Since we can't change the destination of indirectbr, so when
encounter indirectbr as PredPredBB terminator, we should pass it.
Differential Revision: https://reviews.llvm.org/D129193
After D129205, we support SplitBlockPredecessors() for predecessors
with callbr terminators. This means that it is now also safe to
invoke critical edge splitting for an edge coming from a callbr
terminator. Remove checks in various passes that were protecting
against that.
Differential Revision: https://reviews.llvm.org/D129256
For the longest time we used `AAValueSimplify` and
`genericValueTraversal` to determine "potential values". This was
problematic for many reasons:
- We recomputed the result a lot as there was no caching for the 9
locations calling `genericValueTraversal`.
- We added the idea of "intra" vs. "inter" procedural simplification
only as an afterthought. `genericValueTraversal` did offer an option
but `AAValueSimplify` did not. Thus, we might end up with "too much"
simplification in certain situations and then gave up on it.
- Because `genericValueTraversal` was not a real `AA` we ended up with
problems like the infinite recursion bug (#54981) as well as code
duplication.
This patch introduces `AAPotentialValues` and replaces the
`AAValueSimplify` uses with it. `genericValueTraversal` is folded into
`AAPotentialValues` as are the instruction simplifications performed in
`AAValueSimplify` before. We further distinguish "intra" and "inter"
procedural simplification now.
`AAValueSimplify` was not deleted as we haven't ported the
re-materialization of instructions yet. There are other differences over
the former handling, e.g., we may not fold trivially foldable
instructions right now, e.g., `add i32 1, 1` is not folded to `i32 2`
but if an operand would be simplified to `i32 1` we would fold it still.
We are also even more aware of function/SCC boundaries in CGSCC passes,
which is good even if some tests look like they regress.
Fixes: https://github.com/llvm/llvm-project/issues/54981
Note: A previous version was flawed and consequently reverted in
6555558a80.
We recently learned to place the alloca during the heap2stack
transformation in the entry block but we did not account for other
concurrent modifications. We need to record our decision rather than
checking (then outdated) passes during the manifest stage. This will
also allow us to use a custom (=optimistic) "loop info" in the future.
This way it can be reused easily in D128387.
Note this changes the IR slightly. Before The steps for calculating and storing the frame record info were:
1. getPC
2. getSP
3. inttoptr
4. or SP, PC
5. store
Now the steps are:
1. getPC
2. getSP
3. or SP, PC
4. inttoptr
5. store
Differential Revision: https://reviews.llvm.org/D129315
Enhance memchr and strchr handling to simplify calls to the functions
used in equality expressions with the first argument to at most two
integer comparisons:
- memchr(A, C, N) == A to N && *A == C for either a dereferenceable
A or a nonzero N,
- strchr(S, C) == S to *S == C for any S and C, and
- strchr(S, '\0') == 0 to true for any S
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D128939
Fix bug exposed by https://reviews.llvm.org/D125990
rewriteLoopExitValues calls InductionDescriptor::isInductionPHI which requires
the PHI node to have an incoming edge from the loop preheader. This adds checks
before calling InductionDescriptor::isInductionPHI to see that the loop has a
preheader. Also did some refactoring.
Differential Revision: https://reviews.llvm.org/D129297
The 'and (sext (ashr X, ShiftC)), C' --> 'lshr (sext X), ShiftC'
transformation would access out of bounds bits in APInt::getLowBitsSet
if the shift count was larger than X's bit width or if it was negative.
Fixes#56424
This patchs adds a new metadata kind `exclude` which implies that the
global variable should be given the necessary flags during code
generation to not be included in the final executable. This is done
using the ``SHF_EXCLUDE`` flag on ELF for example. This should make it
easier to specify this flag on a variable without needing to explicitly
check the section name in the target backend.
Depends on D129053 D129052
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D129151
Currently we use the `embedBufferInModule` function to store binary
strings containing device offloading data inside the host object to
create a fatbinary. In the case of LTO, we need to extract this object
from the LLVM-IR. This patch adds a metadata node for the embedded
objects containing the embedded pointers and the sections they were
stored at. This should create a cleaner interface for identifying these
values.
In the future it may be worthwhile to also encode an `ID` in the
metadata corresponding to the object's special section type if relevant.
This would allow us to extract the data from an object file and LLVM-IR
using the same ID.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D129033
Now that removeDeadRecipes can remove most dead recipes across a whole
VPlan, there is no need to first collect some dead instructions.
Instead removeDeadRecipes can simply clean them up.
Depends D127580.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D128408
SplitBlockPredecessors currently asserts if one of the predecessor
terminators is a callbr. This limitation was originally necessary,
because just like with indirectbr, it was not possible to replace
successors of a callbr. However, this is no longer the case since
D67252. As the requirement nowadays is that callbr must reference
all blockaddrs directly in the call arguments, and these get
automatically updated when setSuccessor() is called, we no longer
need this limitation.
The only thing we need to do here is use replaceSuccessorWith()
instead of replaceUsesOfWith(), because only the former does the
necessary blockaddr updating magic.
I believe there's other similar limitations that can be removed,
e.g. related to critical edge splitting.
Differential Revision: https://reviews.llvm.org/D129205
This can enable additional region merging, while not losing
opportunities as region merging does not produce dead recipes.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D128831
Previously the scope of debug type of __coro_frame is limited in the
current function. It looked good at the first sight. But it prevent us
to print the type in splitted functions and other functions. Also the
debug type is different for different coroutine functions. So it makes
sense to rename the debug type to make it related to the function name.
After this patch, we could access the coroutine frame type in a function
by `function_name.coro_frame_ty`.
Reviewed By: dblaikie
Differential Revision: https://reviews.llvm.org/D127623
Debugify in OriginalDebugInfo mode, introduced with D82545,
runs only with legacy PassManager.
This patch enables this utility for the NewPM.
Differential Revision: https://reviews.llvm.org/D115351
This patch adds the support for `fmax` and `fmin` operations in `atomicrmw`
instruction. For now (at least in this patch), the instruction will be expanded
to CAS loop. There are already a couple of targets supporting the feature. I'll
create another patch(es) to enable them accordingly.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D127041
This addresses the assertion failure reported in
https://reviews.llvm.org/D124159#3631240.
I believe that this limitation in SplitBlockPredecessors is not
actually necessary (because unlike with indirectbr, callbr is
restricted in a way that does allow updating successors), but for
now fix the assertion failure the same way we do everywhere else,
by also skipping callbr.
Currently, LLVM doesn't have the correct shadow offset
mapping for the n32 ABI.
This patch introduces the correct shadow offset value
for the n32 ABI - 1ULL << 29.
Differential Revision: https://reviews.llvm.org/D127096
As constant expressions can no longer trap, it only makes sense to
call isSafeToSpeculativelyExecute on Instructions, so limit the
API to accept only them, rather than general Operators or Values.
As integer div/rem constant expressions are no longer supported,
constants can no longer trap and are always safe to speculate.
Remove the Constant::canTrap() method and its usages.
By LangRef, hoisting token-returning instructions obsures the origin
so it should be skipped. Found this issue while investigating a
CoroSplit pass crash.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D129025
This in an extension to the code added in D123911 which added vector
combine folding of shuffle-select patterns, attempting to reduce the
total amount of shuffling required in patterns like:
%x = shuffle %i1, %i2
%y = shuffle %i1, %i2
%a = binop %x, %y
%b = binop %x, %y
shuffle %a, %b, selectmask
This patch extends the handing of shuffles that are dependent on one
another, which can arise from the SLP vectorizer, as-in:
%x = shuffle %i1, %i2
%y = shuffle %x
The input shuffles can also be emitted, in which case they are treated
like identity shuffles. This patch also attempts to calculate a better
ordering of input shuffles, which can help getting lower cost input
shuffles, pushing complex shuffles further down the tree.
This is a recommit with some additional checks for supported forms and
out-of-bounds mask elements, with some extra tests.
Differential Revision: https://reviews.llvm.org/D128732
This reverts commit 4e545bdb35.
The newly added test is the third infinite combine loop caused by
this change. In this case, it's a combination of the branch to
common dest and jump threading folds that keeps peeling off loop
iterations.
The core problem here is that we ideally would not thread over
loop backedges, both because it is potentially non-profitable
(it may break canonical loop structure) and because it may result
in these kinds of loops. Unfortunately, due to the lack of a
dominator tree in SimplifyCFG, there is no good way to prevent
this. While we have LoopHeaders, this is an optional structure and
we don't do a good job of keeping it up to date. It would be fine
for a profitability check, but is not suitable for a correctness
check.
So for now I'm just giving up here, as I don't see a good way to
robustly prevent infinite combine loops.
Fixes https://github.com/llvm/llvm-project/issues/56203.
This removes creation of udiv/sdiv/urem/srem constant expressions,
in preparation for their removal. I've added a
ConstantExpr::isDesirableBinOp() predicate to determine whether
an expression should be created for a certain operator.
With this patch, div/rem expressions can still be created through
explicit IR/bitcode, forbidding them entirely will be the next step.
Differential Revision: https://reviews.llvm.org/D128820
If there are multiple predecessors that have the same condition
value (and thus same "real destination"), these were previously
handled by copying the threaded block for each predecessor.
Instead, we can reuse one block for all of them. This makes the
behavior of SimplifyCFG's jump threading match that of the
actual JumpThreading pass.
This also avoids the infinite combine loop reported in:
https://reviews.llvm.org/D124159#3624387
In D95959, the improve analysis for "C >> X" broken the fold
((%x & C) == 0) --> %x u< (-C) iff (-C) is power of two.
It simplifies C, but fails to satisfy the fold condition.
This patch try to restore C before the fold.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D128790
For scalable VFs, the minimum assumed vscale needs to be included in the
cost-computation, otherwise a smaller VF may be used for RT check cost
computation than was used for earlier cost computations.
Fixes a RISCV test failing with UBSan due to both scalar and vector
loops having the same cost.
The test diffs are cosmetic -- but improvements -- because we
let instcombine handle replacement. Instead of dropping the
old value name, it propagates to the new instruction.
This fixes an UBSan failure after 644a965c1e. When using
user-provided VFs/ICs (via the force-vector-width /
force-vector-interleave options) the scalar cost is zero, which would
cause divide-by-zero.
When forcing vectorization using the options, the cost of the runtime
checks should not block vectorization.
This patch replaces the tight hard cut-off for the number of runtime
checks with a more accurate cost-driven approach.
The new approach allows vectorization with a larger number of runtime
checks in general, but only executes the vector loop (and runtime checks) if
considered profitable at runtime. Profitable here means that the cost-model
indicates that the runtime check cost + vector loop cost < scalar loop cost.
To do that, LV computes the minimum trip count for which runtime check cost
+ vector-loop-cost < scalar loop cost.
Note that there is still a hard cut-off to avoid excessive compile-time/code-size
increases, but it is much larger than the original limit.
The performance impact on standard test-suites like SPEC2006/SPEC2006/MultiSource
is mostly neutral, but the new approach can give substantial gains in cases where
we failed to vectorize before due to the over-aggressive cut-offs.
On AArch64 with -O3, I didn't observe any regressions outside the noise level (<0.4%)
and there are the following execution time improvements. Both `IRSmk` and `srad` are relatively short running, but the changes are far above the noise level for them on my benchmark system.
```
CFP2006/447.dealII/447.dealII -1.9%
CINT2017rate/525.x264_r/525.x264_r -2.2%
ASC_Sequoia/IRSmk/IRSmk -9.2%
Rodinia/srad/srad -36.1%
```
`size` regressions on AArch64 with -O3 are
```
MultiSource/Applications/hbd/hbd 90256.00 106768.00 18.3%
MultiSourc...ks/ASCI_Purple/SMG2000/smg2000 240676.00 257268.00 6.9%
MultiSourc...enchmarks/mafft/pairlocalalign 472603.00 489131.00 3.5%
External/S...2017rate/525.x264_r/525.x264_r 613831.00 630343.00 2.7%
External/S...NT2006/464.h264ref/464.h264ref 818920.00 835448.00 2.0%
External/S...te/538.imagick_r/538.imagick_r 1994730.00 2027754.00 1.7%
MultiSourc...nchmarks/tramp3d-v4/tramp3d-v4 1236471.00 1253015.00 1.3%
MultiSource/Applications/oggenc/oggenc 2108147.00 2124675.00 0.8%
External/S.../CFP2006/447.dealII/447.dealII 4742999.00 4759559.00 0.3%
External/S...rate/510.parest_r/510.parest_r 14206377.00 14239433.00 0.2%
```
Reviewed By: lebedev.ri, ebrevnov, dmgreen
Differential Revision: https://reviews.llvm.org/D109368
This patch slightly extends the limit on the RecursionMaxDepth inside
the SLP vectorizer. It does it only when it hits a load (or zext/sext of
a load), which allows it to peek through in the places where it will be
the most valuable, without ballooning out the O(..) by any 2^n factors.
Differential Revision: https://reviews.llvm.org/D122148
Use ConstantFoldBinaryOpOperands() instead, to handle the case
where not all binary ops have a constant expression variant.
This is a bit awkward because we only want to pop the element from
Ops once we're sure that it has folded.
This in an extension to the code added in D123911 which added vector
combine folding of shuffle-select patterns, attempting to reduce the
total amount of shuffling required in patterns like:
%x = shuffle %i1, %i2
%y = shuffle %i1, %i2
%a = binop %x, %y
%b = binop %x, %y
shuffle %a, %b, selectmask
This patch extends the handing of shuffles that are dependent on one
another, which can arise from the SLP vectorizer, as-in:
%x = shuffle %i1, %i2
%y = shuffle %x
The input shuffles can also be emitted, in which case they are treated
like identity shuffles. This patch also attempts to calculate a better
ordering of input shuffles, which can help getting lower cost input
shuffles, pushing complex shuffles further down the tree.
Differential Revision: https://reviews.llvm.org/D128732
These conditions are later checked in the HoistTerminator code
path. Checking them here is somewhat confusing, because this code
only checks the first instruction in the block, which is not
necessarily the terminator.
(-(X & 1)) & Y --> (X & 1) == 0 ? 0 : Y
https://alive2.llvm.org/ce/z/rhpH3i
This is noted as a missing IR canonicalization in issue #55618.
We already managed to fix codegen to the expected form.
The moved helpers are only used for codegen. It will allow moving the
remaining ::execute implementations out of LoopVectorize.cpp.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D128657
If we are certainly not in a loop we can directly emit the heap2stack
allocas in the function entry block. This will help to get rid of them
(SROA) and avoid stacksave/restore intrinsics when the function is
inlined.
This transform is responsible for a long-standing miscompile
as discussed in issue #47012 (was bugzilla #47668).
There was a proposal to correct it in D88432, but that was
abandoned and there hasn't been any recent activity to fix
it AFAICT.
The original patch D45108 started with a constant-shift-only
restriction and only expanded during review, so I don't think
there's much risk of perf regression on the motivating code.
Add an emitter for the memrchr common extension and simplify the strrchr
call handler to use it. This enables transforming calls with the empty
string to the test C ? S : 0.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D128954
LoopSimplify only requires that the loop predecessor has a single
successor and is safe to hoist into -- it doesn't necessarily have
to be an unconditional BranchInst.
Adjust LoopDeletion to assert conditions closer to what it actually
needs for correctness, namely a single successor and a
side-effect-free terminator (as the terminator is getting dropped).
Fixes https://github.com/llvm/llvm-project/issues/56266.
At the moment, the same VPlan can be used code generation of both the
main vector and epilogue vector loop. This can lead to wrong results, if
the plan is optimized based on the VF of the main vector loop and then
re-used for the epilogue loop.
One example where this is problematic is if the scalar loops need to
execute at least one iteration, e.g. due to interleave groups.
To prevent mis-compiles in the short-term, disable optimizing exit
conditions for VPlans when using epilogue vectorization. The proper fix
is to avoid re-using the same plan for both loops, which will require
support for cloning plans first.
Fixes#56319.
When converting strchr(p, '\0') to p + strlen(p) we know that
strlen() must return an offset that is inbounds of the allocated
object (otherwise it would be UB), so we can use an inbounds GEP.
An equivalent argument can be made for the other cases.
The moved helpers are only used for codegen. It will allow moving the
remaining ::execute implementations out of LoopVectorize.cpp.
Depends on D127966.
Depends on D127965.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D127968
This is a minor refinement of resolvedUndefsIn(), mostly for clarity.
If the value of an instruction is undef, then that's already a legal
final result -- we can safely rauw such an instruction with undef.
We only need to mark unknown values as overdefined, as that's the
result we get for an instruction that has not been processed because
it has an undef operand.
Differential Revision: https://reviews.llvm.org/D128251
The unidentified objects recognized in `getUnderlyingObjects` may
still alias to the noalias parameter because `getUnderlyingObjects`
may not check deep enough to get the underlying object because of
`MaxLookup`. The real underlying object for the unidentified object
may still be the noalias parameter.
Originally Patched By: tingwang
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D127202
When merging GEP of GEP with constant indices, if the second GEP's offset is not divisible by the first GEP's element size, convert both type to i8* and merge.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D125934
I looked at canonicalizing in the other direction, but that causes
many potential regressions and infinite loops because we already
(possibly wrongly) canonicalize "trunc X to i1" into an and+icmp.
This has a data layout restriction to avoid creating illegal
mask instructions, but we could remove that if we can show
that the backend can undo this when needed.
The motivating example from issue #56119 is modeled by the
PhaseOrdering test.
Correct a logic bug in the memrchr enhancement added in D123629 that
makes it ineffective in a subset of cases.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D128856
Migrate all binops to use FoldXYZ rather than CreateXYZ APIs,
which are compatible with InstSimplifyFolder and fallible constant
folding.
Rather than continuing to add one method for every single operator,
add a generic FoldBinOp (plus variants for nowrap, exact and fmf
operators), which we would need anyway for CreateBinaryOp.
This change is not NFC because IRBuilder with InstSimplifyFolder
may perform more folding. However, this patch changes SCEVExpander
to not use the folder in InsertBinOp to minimize practical impact
and keep this change as close to NFC as possible.
This means we no longer need to have the same API between IRBuilder
and IRBuilderFolder.
The constant case is substantially simpler, so implementing it
separately isn't an undue burden.
Nowdays we have a generic constant folding API to load a type from
an offset. It should be able to do anything that VNCoercion can do.
This avoids the weird templating between IRBuilder and ConstantFolder
in one function, which is will stop working as the IRBuilderFolder
moves from CreateXYZ to FoldXYZ APIs.
Unfortunately, this doesn't eliminate this pattern from VNCoercion
entirely yet.
At the moment LoopVersioning is only created for inner-loop
vectorization. This patch moves it to LVP::execute, which means it will
also be added for epilogue vectorization. As a consequence, the proper
noalias metadata is now also added to epilogue vector loops.
LVer will be moved to VPTransformState as follow-up.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D127966
The assert was added with 0399473de8 and is correct for that
pattern, but it is off-by-1 with the enhancement in d4f39d8333.
The transforms are still correct with the new pre-condition:
https://alive2.llvm.org/ce/z/6_6ghmhttps://alive2.llvm.org/ce/z/_GTBUt
And as shown in the new test, the transform is expected with
'ult' - in that case, the icmp reduces to test if the shift
amount is 0.
For instructions that don't need any special handling, use
ConstantFoldInstOperands(), rather than re-implementing individual
cases.
This is probably not NFC because it can handle cases the previous
code missed (e.g. vector operations).
Support compares in ConstantFoldInstOperands(), instead of
forcing the use of ConstantFoldCompareInstOperands(). Also handle
insertvalue (extractvalue was already handled).
This removes a footgun, where many uses of ConstantFoldInstOperands()
need a separate check for compares beforehand. It's particularly
insidious if called on a constant expression, because it doesn't
fail in that case, but will just not do DL-dependent folding.
In some cases, there may be widened users of inductions even though the
plan includes the scalar VF. In those cases, make sure we still replace
the VPWidenIntOrFpInductionRecipe with scalar steps, as otherwise we may
try to execute a VPWidenIntOrFpInductionRecipe with a scalar VF.
Alternatively the patch could also split the range if needed.
This fixes a crash exposed by D123720.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D128755
Currently, we only remove dead blocks and non-feasible edges in
IPSCCP, but not in SCCP. I'm not aware of any strong reason for
that difference, so this patch updates SCCP to perform the CFG
cleanup as well.
Compile-time impact seems to be pretty minimal, in the 0.05%
geomean range on CTMark.
For the test case from https://reviews.llvm.org/D126962#3611579
the result after -sccp now looks like this:
define void @test(i1 %c) {
entry:
br i1 %c, label %unreachable, label %next
next:
unreachable
unreachable:
call void @bar()
unreachable
}
-jump-threading does nothing on this, but -simplifycfg will produce
the optimal result.
Differential Revision: https://reviews.llvm.org/D128796
enabled
The C++20 Coroutines couldn't be compiled to WebAssembly due to an
optimization named symmetric transfer requires the support for musttail
calls but WebAssembly doesn't support it yet.
This patch tries to fix the problem by adding a supportsTailCalls
method to TargetTransformImpl to skip the symmetric transfer when
tail-call feature is not supported.
Reviewed By: tlively
Differential Revision: https://reviews.llvm.org/D128794
ConnectProlog adds new incoming values to exit phi nodes which can
change the SCEV for the phi after 20d798bd47.
Fix is analog to cfc741bc0e.
Fixes#56286.
ConnectEpilog adds new incoming values to exit phi nodes which can
change the SCEV for the phi after 20d798bd47.
Fix is analog to cfc741bc0e.
Fixes#56282.
This code requires the result to be an UndefValue/ConstantInt
anyway (checked by getKnownConstant), so we are only interested
in the case where this folds.
AARGetter is an abstraction over a source of the `AAResults` introduced
to support the legacy pass manager as well as the modern one. Since the
Argument Promotion pass doesn't support the legacy pass manager anymore,
the abstraction is not required and `AAResults` may be used directly.
The instance of the `FunctionAnalysisManager` is passed through the
functions to get all the required analyses just wherever they are
required and do not use the awkward getter callbacks.
The `ReplaceCallSite` parameter was required for the legacy pass manager
only and isn't used anymore, so the parameter has been eliminated.
Differential Revision: https://reviews.llvm.org/D128727
The `isDenselyPacked` static member of the `ArgumentPromotionPass` class
is not used in the class itself anymore. The single known user of the
function is in the `AttributorAttributes.cpp` file, so the function has
been moved into the file.
Differential Revision: https://reviews.llvm.org/D128725
Extend the solution accepted in D127766 to strncmp and simplify
strncmp(A, B, N) calls with constant A and B and variable N to
the equivalent of
N <= Pos ? 0 : (A < B ? -1 : B < A ? +1 : 0)
where Pos is the offset of either the first mismatch between A
and B or the terminating null character if both A and B are equal
strings.
Reviewed By: courbet
Differential Revision: https://reviews.llvm.org/D128089
Enhance getConstantDataArrayInfo to let the memchr and memcmp library
call folders look through arbitrarily long sequences of bitcast and
GEP instructions.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D128364
If the root order itself does not require reordering, we can just
remove its reorder mask safely (e.g., if the root node is a vector of
phis). But if this node is used as an operand in the graph, we cannot
delete the reordering, need to keep it. Otherwise the graph nodes are
not synchronized with the operands. It may cause an extra gather
instruction(s) or a compiler crash.
Also, need to be very careful when selecting the gather nodes for
reordering since there might several gather nodes with the same scalars
and we can try to reorder just the same node many times instead of
different nodes.
Differential Revision: https://reviews.llvm.org/D128680
This moves some code for getting PC and SP into their own functions. Since SP
is also retrieved in the prologue and getting the stack tag, we can cache the
SP if we get it once in the prologue. This caching will really only be relevant
in D128387 where StackBaseTag may not be set in the prologue if __hwasan_tls
is not used.
Differential Revision: https://reviews.llvm.org/D128551
It makes sense to handle byval promotion in the same way as non-byval
but also allowing `store` instructions. However, these should
use the same checks as the `load` instructions do, i.e. be part of the
`ArgsToPromote` collection. For these instructions, the check for
interfering modifications can be disabled, though. The promotion
algorithm itself has been modified a lot: all the accesses (i.e. loads
and stores) are rewritten to the emitted `alloca` instructions. To
optimize these new `alloca`s out, the `PromoteMemToReg` function from
`Transforms/Utils/PromoteMemoryToRegister.cpp` file is invoked after
promotion.
In order to let the `PromoteMemToReg` promote as many `alloca`s as it
is possible, there should be no `GEP`s from the `alloca`s. To
eliminate the `GEP`s, its own `alloca` is generated for every argument
part because a single `alloca` for the whole argument (that
significantly simplifies the code of the pass though) unfortunately
cannot be used.
The idea comes from the following discussion:
https://reviews.llvm.org/D124514#3479676
Differential Revision: https://reviews.llvm.org/D125485
This patch moves the code for recipe implementations to a separate file.
The benefits are:
* Keep VPlan.cpp smaller => faster compile-time during parallel builds.
* Keep code for logical units together
As a follow-up I am also planning on moving all ::execute
implemetnations from LoopVectorize.cpp over to the new file, which
should help to reduce the size of the file a bit.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D127965
This removes the extractvalue constant expression, as part of
https://discourse.llvm.org/t/rfc-remove-most-constant-expressions/63179.
extractvalue is already not supported in bitcode, so we do not need
to worry about bitcode auto-upgrade.
Uses of ConstantExpr::getExtractValue() should be replaced with
IRBuilder::CreateExtractValue() (if the fact that the result is
constant is not important) or ConstantFoldExtractValueInstruction()
(if it is). Though for this particular case, it is also possible
and usually preferable to use getAggregateElement() instead.
The C API function LLVMConstExtractValue() is removed, as the
underlying constant expression no longer exists. Instead,
LLVMBuildExtractValue() should be used (which will constant fold
or create an instruction). Depending on the use-case,
LLVMGetAggregateElement() may also be used instead.
Differential Revision: https://reviews.llvm.org/D125795
`commonAlignment` is a shortcut to pick the smallest of two `Align`
objects. As-is it doesn't bring much value compared to `std::min`.
Differential Revision: https://reviews.llvm.org/D128345
This is the followup patch to https://reviews.llvm.org/D125246 for the `SampleContextTracker` part. Before the promotion and merging of the context is based on the SampleContext(the array of frame), this causes a lot of cost to the memory. This patch detaches the tracker from using the array ref instead to use the context trie itself. This can save a lot of memory usage and benefit both the compiler's CS inliner and llvm-profgen's pre-inliner.
One structure needs to be specially treated is the `FuncToCtxtProfiles`, this is used to get all the functionSamples for one function to do the merging and promoting. Before it search each functions' context and traverse the trie to get the node of the context. Now we don't have the context inside the profile, instead we directly use an auxiliary map `ProfileToNodeMap` for profile , it initialize to create the FunctionSamples to TrieNode relations and keep updating it during promoting and merging the node.
Moreover, I was expecting the results before and after remain the same, but I found that the order of FuncToCtxtProfiles matter and affect the results. This can happen on recursive context case, but the difference should be small. Now we don't have the context, so I just used a vector for the order, the result is still deterministic.
Measured on one huge size(12GB) profile from one of our internal service. The profile similarity difference is 99.999%, and the running time is improved by 3X(debug mode) and the memory is reduced from 170GB to 90GB.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D127031
This is another attempt to land this patch.
The patch proposed to use a new cost model for loop interchange,
which is obtained from loop cache analysis.
Given a loopnest, what loop cache analysis returns is a vector of
loops [loop0, loop1, loop2, ...] where loop0 should be replaced as
the outermost loop, loop1 should be placed one more level inside, and
loop2 one more level inside, etc. What loop cache analysis does is not
only more comprehensive than the current cost model, it is also a "one-shot"
query which means that we only need to query it once during the entire
loop interchange pass, which is better than the current cost model where
we query it every time we check whether it is profitable to interchange
two loops. Thus complexity is reduced, especially after D120386 where we
do more interchanges to get the globally optimal loop access pattern.
Updates made to test cases are mostly minor changes and some
corrections. One change that applies to all tests is that we added an option
`-cache-line-size=64` to the RUN lines. This is ensure that loop
cache analysis receives a valid number of cache line size for correct
analysis. Test coverage for loop interchange is not reduced.
Currently we did not completely remove the legacy cost model, but
keep it as fall-back in case the new cost model did not run successfully.
This is because currently we have some limitations in delinearization, which
sometimes makes loop cache analysis bail out. The longer term goal is to
enhance delinearization and eventually remove the legacy cost model
compeletely.
Reviewed By: bmahjour, #loopoptwg
Differential Revision: https://reviews.llvm.org/D124926
Now that we have the sanitizer metadata that is actually on the global
variable, and now that we use debuginfo in order to do symbolization of
globals, we can delete the 'llvm.asan.globals' IR synthesis.
This patch deletes the 'location' part of the __asan_global that's
embedded in the binary as well, because it's unnecessary. This saves
about ~1.7% of the optimised non-debug with-asserts clang binary.
Reviewed By: vitalybuka
Differential Revision: https://reviews.llvm.org/D127911
This change is a bit subtle. If we have a type like <vscale x 1 x i64>, the vectorizer will currently reject vectorization. The reason is that a type like <1 x i64> is likely to get simply rescalarized, and the vectorizer doesn't want to be in the game of simple unrolling.
(I've given the example in terms of 1 x types which use a single register, but the same issue exists for any N x types which use N registers. e.g. RISCV LMULs.)
This change distinguishes scalable types from fixed types under the reasoning that converting to a scalable type isn't unrolling. Because the actual vscale isn't known until runtime, using a vscale type is potentially very profitable.
This makes an important, but unchecked, assumption. Specifically, the scalable type is assumed to only be legal per the cost model if there's actually a scalable register class which is distinct from the scalar domain. This is, to my knowledge, true for all targets which return non-invalid costs for scalable vector ops today, but in theory, we could have a target decide to lower scalable to fixed length vector or even scalar registers. If that ever happens, we'd need to revisit this code.
In practice, this patch unblocks scalable vectorization for ELEN types on RISCV.
Let me sketch one alternate implementation I considered. We could have restricted this to when we know a minimum value for vscale. Specifically, for the default +v extension for RISCV, we actually know that vscale >= 2 for ELEN types. However, doing it this way means we can't generate scalable vectors when using the various embedded vector extensions which have a minimum vscale of 1.
Differential Revision: https://reviews.llvm.org/D128542
Information in the function `Prologue Data` is intentionally opaque.
When a function with `Prologue Data` is duplicated. The self (global
value) references inside `Prologue Data` is still pointing to the
original function. This may cause errors like `fatal error: error in backend: Cannot represent a difference across sections`.
This patch detaches the information from function `Prologue Data`
and attaches it to a function metadata node.
This and D116130 fix https://github.com/llvm/llvm-project/issues/49689.
Reviewed By: pcc
Differential Revision: https://reviews.llvm.org/D115844
Summary:
Currently in OpenMPOpt we strip `noinline` attributes from runtime
functions. This is here because the device bitcode library that we link
has problems with needed definitions getting prematurely optimized out.
This is only necessary for OpenMP offloading to GPUs so we should narrow
the scope for where we spend time doing this. In the future this
shouldn't be necessary as we move to using a linked library rather than
pulling in a bitcode library in Clang.
The global ctor evaluator currently handles by checking whether the
memset memory is already zero, and skips it in that case. However,
it only actually checks the first byte of the memory being set.
This patch extends the code to check all bytes being set. This is
done byte-by-byte to avoid converting undef values to zeros in
larger reads. However, the handling is still not completely correct,
because there might still be padding bytes (though probably this
doesn't matter much in practice, as I'd expect global variable
padding to be zero-initialized in practice).
Mostly fixes https://github.com/llvm/llvm-project/issues/55859.
Differential Revision: https://reviews.llvm.org/D128532
These intrinsics are now fundemental for SVE code generation and have been
present for a year and a half, hence move them out of the experimental
namespace.
Differential Revision: https://reviews.llvm.org/D127976
Support for the legacy pass manager in ArgPromotion causes
complications in D125485. As the legacy pass manager for middle-end
optimizations is unsupported, drop ArgPromotion from the legacy
pipeline, rather than introducing additional complexity to deal
with it.
Differential Revision: https://reviews.llvm.org/D128536
Globals that shouldn't be sanitized are currently communicated to HWASan
through the use of the llvm.asan.globals IR metadata. Now that we have
an on-GV attribute, use it.
Reviewed By: pcc
Differential Revision: https://reviews.llvm.org/D127543
Improved/fixed cost modeling for shuffles by providing masks, improved
cost model for non-identity insertelements.
Differential Revision: https://reviews.llvm.org/D115462
This patch updates LV to generate runtime after the VF & IC are selected. It
allows deciding whether to vectorize with runtime checks or not based on
their cost compared to the vector loop.
It also updates VectorizationFactor to include the scalar cost.
Reviewed By: lebedev.ri, dmgreen
Differential Revision: https://reviews.llvm.org/D75981
Drop the requirement that getInitialValueOfAllocation() must be
passed an allocator function, shifting the responsibility for
checking that into the function (which it does anyway). The
motivation is to avoid some calls to isAllocationFn(), which has
somewhat ill-defined semantics (given the number of
allocator-related attributes we have floating around...)
(For this function, all we eventually need is an allockind of
zeroed or uninitialized.)
Differential Revision: https://reviews.llvm.org/D127274
This is the second attempt to land this patch.
The patch proposed to use a new cost model for loop interchange,
which is obtained from loop cache analysis.
Given a loopnest, what loop cache analysis returns is a vector of
loops [loop0, loop1, loop2, ...] where loop0 should be replaced as the
outermost loop, loop1 should be placed one more level inside, and loop2
one more level inside, etc. What loop cache analysis does is not only more
comprehensive than the current cost model, it is also a "one-shot" query
which means that we only need to query it once during the entire loop
interchange pass, which is better than the current cost model where we
query it every time we check whether it is profitable to interchange two
loops. Thus complexity is reduced, especially after D120386 where we do
more interchanges to get the globally optimal loop access pattern.
Updates made to test cases are mostly minor changes and some corrections.
One change that applies to all tests is that we added an option
`-cache-line-size=64` to the RUN lines. This is ensure that loop cache
analysis receives a valid number of cache line size for correct analysis.
Test coverage for loop interchange is not reduced.
Currently we did not completely remove the legacy cost model, but keep it
as fall-back in case the new cost model did not run successfully. This is
because currently we have some limitations in delinearization, which sometimes
makes loop cache analysis bail out. The longer term goal is to enhance
delinearization and eventually remove the legacy cost model compeletely.
Reviewed By: bmahjour, #loopoptwg
Differential Revision: https://reviews.llvm.org/D124926
If we have an unaligned uniform store, then when costing a scalable VF we can't emit code to scalarize it. (Well, we could, but we haven't implemented that case.) This change replaces an assert with a cost-model bailout such that we reject vectorization with the scalable VF instead of crashing.
If the masked gather nodes must be reordered, we can just reorder
scalars, just like for gather nodes. But if the node contains reused
scalars, it must be handled same way as a regular vectorizable node,
since need to reorder reused mask, not the scalars directly.
Differential Revision: https://reviews.llvm.org/D128360
If there are multiple constraints in the same block, at the moment the
order they are processed may be different depending on the sort
implementation.
Use stable_sort to ensure consistent ordering.
This reverts commit cac60940b7.
Caused -Os -fsanitize=memory -march=haswell miscompile to pytorch/cpuinfo.
See my latest comment (may update) on D115462.
Finding BDV for vector value does not handle freeze instruction.
Adding its handling as it is done for scalar case.
Reviewed By: apilipenko
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D128254
The reachability queries default to "reachable" after exploring too many
basic blocks. LoopInfo helps it skip over the whole loop.
Reviewed By: eugenis
Differential Revision: https://reviews.llvm.org/D127917
StructurizeCFG linearizes the successors of branching basic block
by adding Flow blocks to record the true/false path for branches
and back edges. This patch reduces the number of Phi values needed
to capture the control flow path by improving the basic block
ordering.
Previously, StructurizeCFG adds loop exit blocks outside of the
loop. StructurizeCFG sets a boolean value to indicate the path
taken, and all exit block live values extend to after the loop.
For loops with a large number of exits blocks, this creates a
huge number of values that are maintained, which increases
compilation time and register pressure. This is problem
especially with ASAN, which adds early exits to blocks with
unreachable instructions for each instrumented check in the loop.
In specific cases, this patch reduces the number of values needed
after the loop by moving the exit block into the loop. This is
done for blocks that have a single predecessor and single successor
by moving the block to appear just after the predecessor.
Differential Revision: https://reviews.llvm.org/D123231
UnifyLoopExits creates a single exit, a control flow hub, for
loops with multiple exits. There is an input to the block for
each loop exiting block and an output from the block for each
loop exit block. Multiple checks, or guard blocks, are needed
to branch to the correct exit block.
For large loops with lots of exit blocks, all the extra guard
blocks cause problems for StructurizeCFG and subsequent passes.
This patch reduces the number of guard blocks needed when the
exit blocks branch to a common block (e.g., an unreachable
block). The guard blocks are reduced by changing the inputs
and outputs of the control flow hub. The inputs are the exit
blocks and the outputs are the common block.
Reducing the guard blocks enables StructurizeCFG to reorder the
basic blocks in the CFG to reduce the values that exit a loop
with multiple exits. This reduces the compile-time of
StructurizeCFG and also reduces register pressure.
Differential Revision: https://reviews.llvm.org/D123230
We were overly conservative and required a ret statement to be dominated
completely be a single lifetime.end marker. This is quite restrictive
and leads to two problems:
* limits coverage of use-after-scope, as we degenerate to
use-after-return;
* increases stack usage in programs, as we have to remove all lifetime
markers if we degenerate to use-after-return, which prevents
reuse of stack slots by the stack coloring algorithm.
Reviewed By: eugenis
Differential Revision: https://reviews.llvm.org/D127905
This was necessary for code reuse between the old and new passmanager.
With the old pass-manager gone, this is no longer necessary.
Reviewed By: eugenis, myhsu
Differential Revision: https://reviews.llvm.org/D127913
Binary size of `clang` is trivial; namely, numerical value doesn't
change when measured in MiB, and `.data` section increases from 139Ki to
173 Ki.
Differential Revision: https://reviews.llvm.org/D128070
Scale reg should never be zero, so when the quotient is zero, we
cannot assign it there. Limit this transform to avoid this situation.
Differential Revision: https://reviews.llvm.org/D128339
Reviewed By: eopXD
This patch adds a new transferToOtherSystem helper that tries to
transfer information from signed predicates to the unsigned system and
vice versa.
The initial version adds A >=u B for A >=s B && B >=s 0
https://alive2.llvm.org/ce/z/8b6F9i
As branch on undef is immediate undefined behavior, there is no need
to mark one of the edges as feasible. We can leave all the edges
non-feasible. In IPSCCP, we can replace the branch with an unreachable
terminator.
Differential Revision: https://reviews.llvm.org/D126962
The code has been reformatted in accordance with the code style. Some
function comments were extended to the Doxygen ones and reworded a bit
to eliminate the duplication of the function's/class' name in the
comment.
Differential Revision: https://reviews.llvm.org/D128168
NewGVN will find operator from other context. ValueTracking currently doesn't have a way to run completely without context instruction.
So it will use operator itself as conext instruction.
If the operator in another branch will never be executed but it has an assume, it may caused value tracking use the assume to do wrong simpilfy.
It would be better to make these simplification queries not use context at all, but that would require some API changes.
For now we just use the orignial instruction as context instruction to fix the issue.
Fix#56039
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D127942
createInductionResumeValues creates a phi node placeholder
without filling incoming values. Then it generates the incoming values.
It includes triggering of SCEV expander which may invoke SSAUpdater.
SSAUpdater has an optimization to detect number of predecessors
basing on incoming values if there is phi node.
In case phi node is not filled with incoming values - the number of predecessors
is detected as 0 and this leads to segmentation fault.
In other words SSAUpdater expects that phi is in good shape while
LoopVectorizer breaks this requirement.
The fix is just prepare all incoming values first and then build a phi node.
Reviewed By: fhahn
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D128033
This avoid creating empty bins in AAPointerInfo which can lead to
segfaults. Also ensure we do not try to translate from callee to caller
except if we really take the argument state and move it to the call site
argument state.
Fixes: https://github.com/llvm/llvm-project/issues/55726
When determining liveness via Attributor::isAssumedDead(...) we might
end up without a liveness AA or with one pointing into another function.
Neither is helpful and we will avoid both from now on.
Reapplied after fixing the ASAN error which caused the revert:
db68a25ca9
During the reordering transformation we should try to avoid reordering bundles
like fadd,fsub because this may block them being matched into a single vector
instruction in x86.
We do this by checking if a TreeEntry is such a pattern and adding it to the
list of TreeEntries with orders that need to be considered.
Differential Revision: https://reviews.llvm.org/D125712
In some cases, a recurrence splice instructions needs to be inserted
between to regions, for example if the regions get re-arranged during
sinking.
Fixes#56146.
For non-mem-intrinsic and non-lifetime `CallBase`s, the current
`isRemovable` function only checks if the `CallBase` 1. has no uses 2.
will return 3. does not throw:
80fb782336/llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp (L1017)
But we should also exclude invokes even in case they don't throw,
because they are terminators and thus cannot be removed. While it
doesn't seem to make much sense for `invoke`s to have an `nounwind`
target, this kind of code can be generated and is also valid bitcode.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D128224
Remove the known limitation of the library function call folders to only
work with top-level arrays of characters (as per the TODO comment in
the code) and allows them to also fold calls involving subobjects of
constant aggregates such as member arrays.
ExtractElement does not produce a vector out of a vector, so there's no need to
call a gather once done.
Fix#54469
Credits to npopov@redhat.com for the original approach.
Differential Revision: https://reviews.llvm.org/D126012
If the OffsetBeg + InsertVecSz is greater than VecSz, need to estimate
the cost as shuffle of 2 vector, not as insert of subvector. Otherwise,
the inserted subvector is out of range and compiler may crash.
Differential Revision: https://reviews.llvm.org/D128071
LoopPeel add new incoming values to exit phi nodes which can change the
SCEV for the phi after 20d798bd47.
Forget SCEVs for such phis.
Fixes#56044.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D128164
`llvm::max(Align, MaybeAlign)` and `llvm::max(MaybeAlign, Align)` are
not used often enough to be required. They also make the code more opaque.
Differential Revision: https://reviews.llvm.org/D128121
When threading, we always create a new block for the threaded edge
(even if the edge is not critical), which will later get folded back
into the predecessor if possible. Depending on precise processing
order, this separate block may break the detection of trivial
cycles in the threading code, which normally avoids infinite
threading of loops. Explicitly merge the created edge block into
the predecessor to avoid this.
Fixes https://github.com/llvm/llvm-project/issues/55765.
Differential Revision: https://reviews.llvm.org/D127216
Symmetric transfer is not a part of C++ standards. So the vendors is not
forced to implement it any way. Given the symmetric transfer nowadays is
an optimization. It makes more sense to enable it only if the
optimization is enabled. It is also helpful for the compilation speed in
O0.
We wanted to check if all uses of the function are direct calls, but the
code didn't account for passing the function as a parameter.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D128104
This reverts commit 7aa8a67882.
This version includes fixes to address issues uncovered after
the commit landed and discussed at D11448.
Those include:
* Limit select-traversal to selects inside the loop.
* Freeze pointers resulting from looking through selects to avoid
branch-on-poison.
The memcmp simplifier is limited to folding to constants calls with constant
arrays and constant sizes. This change adds the ability to simplify
memcmp(A, B, N) calls with constant A and B and variable N to the pseudocode
equivalent of
N <= Pos ? 0 : (A < B ? -1 : B < A ? +1 : 0)
where Pos is the offset of the first mismatch between A and B.
Differential Revision: https://reviews.llvm.org/D127766
When the mask is a power-of-2 constant and op0 is a shifted-power-of-2
constant, test if the shift amount equals the offset bit index:
(ShiftC << X) & C --> X == (log2(C) - log2(ShiftC)) ? C : 0
(ShiftC >> X) & C --> X == (log2(ShiftC) - log2(C)) ? C : 0
This is an alternate to D127610 with a more general pattern.
We match only shift+and instead of the trailing xor, so we see a few
more tests diffs. I think we discussed this initially in D126617.
Here are proofs for shifts in both directions:
https://alive2.llvm.org/ce/z/CFrLs4
The test diffs look equal or better for IR, and this makes the
patterns more uniform in IR. The backend can partially invert this
in both cases if that is profitable. It is not trivially reversible,
however, so if we find perf regressions that are not easy to undo,
then we may want to revert this.
Differential Revision: https://reviews.llvm.org/D127801
We really want to push freezes through recurrence phis, so that we
freeze only the start value, rather than the IV value on every
iteration. foldOpIntoPhi() already handles this for the case where
the transfer function doesn't produce poison, e.g.
%iv.next = add %iv, 1. However, this does not work if nowrap flags
are present, e.g. the very common %iv.next = add nuw %iv, 1 case.
This patch adds a fold that pushes freeze instructions to the start
value by checking whether all backedge values will be non-poison
after poison generating flags have been dropped. This allows pushing
freezes out of loops in most cases. I suspect that this also
obsoletes the CanonicalizeFreezeInLoops pass, and we can probably
drop it.
Fixes https://github.com/llvm/llvm-project/issues/56048.
Differential Revision: https://reviews.llvm.org/D127960
llvm.used and llvm.compiler.used are often used with inline assembly
that refers to a specific symbol so that the symbol is kept through to
the linker even though there are no references to it from LLVM IR.
This fixes the MergeFunctions pass to preserve references to these
symbols in llvm.used/llvm.compiler.used so they are not deleted from the
IR. This doesn't prevent these functions from being merged, but
guarantees that an alias or thunk with the expected symbol name is kept
in the IR.
Differential Revision: https://reviews.llvm.org/D127751
Profiling stopped working for us after D98061, which was largely a
Fuschia-specific patch but in one place used `isOSBinFormatELF` to
make a decision. I'm adding a PS4/PS5 exception to that, so we can
get profiling to work again.
Differential Revision: https://reviews.llvm.org/D127506
Profiling stopped working for us after D98061, which was largely a
Fuschia-specific patch but in one place used `isOSBinFormatELF` to
make a decision. I'm adding a PS4/PS5 exception to that, so we can
get profiling to work again.
Differential Revision: https://reviews.llvm.org/D127506
If the root scalar is mapped to to the smallest bit width, the vector is
truncated and the types between original buildvector and extracted value
mismatched. For extract, we emit sext/zext instructions, for shuffles we
can reuse oringal vector instead of the truncated one.
Differential Revision: https://reviews.llvm.org/D127974
Instead of using the underlying instruction and VF to get the type, use
the type of the incoming value. This removes an unnecessary dependence
on the underlying instruction and enables using the recipe without an
underlying instruction.
Currently scatter vectorize nodes can be emitted only for GEPs with
constant indices. But we can also emit such nodes for GEPs with the same
ptr and non-constant vectorizable/gathered indices, if profitable. Patch
adds support for such nodes and tries to improve handling of GEPs with
non-const indeces for such nodes.
Metric: SLP.NumVectorInstructions
Program SLP.NumVectorInstructions
results results0 diff
test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 5243.00 5240.00 -0.1%
test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 5243.00 5240.00 -0.1%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 27550.00 27507.00 -0.2%
test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 5395.00 5380.00 -0.3%
test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 5389.00 5374.00 -0.3%
test-suite :: External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp_r.test 961.00 958.00 -0.3%
test-suite :: External/SPEC/CINT2017speed/620.omnetpp_s/620.omnetpp_s.test 961.00 958.00 -0.3%
test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test 5664.00 5643.00 -0.4%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 13202.00 13127.00 -0.6%
test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test 212.00 207.00 -2.4%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 890.00 850.00 -4.5%
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 1695.00 1581.00 -6.7%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 2338.00 2140.00 -8.5%
test-suite :: SingleSource/UnitTests/matrix-types-spec.test 63.00 55.00 -12.7%
test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 468.00 356.00 -23.9%
Geomean difference -0.3%
All numbers show increased number of generated vector instructions.
Diff:
SingleSource/Benchmarks/Adobe-C++/loop_unroll - better without LTO, but
need an extra analysis with LTO (with LTO compiler generates
masked_gather, while before regular loads were emitted because of extra
data, availbale at LTO time).
SingleSource/UnitTests/matrix-types-spec - more vector code.
MultiSource/Applications/JM/lencod/lencod - same.
External/SPEC/CINT2006/464.h264ref/464.h264ref - same.
MultiSource/Benchmarks/7zip/7zip-benchmark - same.
External/SPEC/CINT2006/445.gobmk/445.gobmk - no changes.
External/SPEC/CFP2017rate/510.parest_r/510.parest_r - more vector code.
External/SPEC/CFP2006/447.dealII/447.dealII - same
External/SPEC/CINT2017speed/620.omnetpp_s/620.omnetpp_s - same
External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp - same
External/SPEC/CFP2017rate/511.povray_r/511.povray - same
External/SPEC/CFP2006/453.povray/453.povray - same
External/SPEC/CFP2017rate/526.blender_r/526.blender_r - same
External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r - same
External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s - same
Differential Revision: https://reviews.llvm.org/D127219
Previously if the inliner split an SCC such that an empty one remained, the MLInlineAdvisor could potentially lose track of the EdgeCount if a subsequent CGSCC pass modified the calls of a function that was initially in the SCC pre-split. Saving the seen nodes in onPassEntry resolves this.
Reviewed By: mtrofin
Differential Revision: https://reviews.llvm.org/D127693
We can skip the analysis of the constant nodes, their order should not
affect the ordering of the trees/subtrees.
Differential Revision: https://reviews.llvm.org/D127775
Adding the `DW_CC_nocall` calling convention to the function debug metadata is needed when either the return values or the arguments of a function are removed as this helps in informing debugger that it may not be safe to call this function or try to interpret the return value.
This translates to setting `DW_AT_calling_convention` with `DW_CC_nocall` for appropriate DWARF DIEs.
The DWARF5 spec (section 3.3.1.1 Calling Convention Information) says:
If the `DW_AT_calling_convention` attribute is not present, or its value is the constant `DW_CC_normal`, then the subroutine may be safely called by obeying the `standard` calling conventions of the target architecture. If the value of the calling convention attribute is the constant `DW_CC_nocall`, the subroutine does not obey standard calling conventions, and it may not be safe for the debugger to call this subroutine.
Reviewed By: dblaikie
Differential Revision: https://reviews.llvm.org/D127134
If an instruction at the beginning of a block is erased, this may
trigger crash due to dereferencing an invalid iterator.
Check if II is at the end before dereferencing it.
Reviewed By: thegameg
Differential Revision: https://reviews.llvm.org/D127736
Adds option to print the contents of the Inline Advisor after each SCC Inliner pass
Reviewed By: mtrofin
Differential Revision: https://reviews.llvm.org/D127689
GetValueInMiddleOfBlock uses result of GetValueAtEndOfBlockInternal if there is no value
defined for current basic block.
If there is already a value it tries (in this order):
to find single register coming from all predecessors
find existing phi node which matches our incoming registers
build new phi.
The compile time improvement is to use current available value if
it is defined out of current BB or it is a PHI register.
This is due to it can be used in the middle basic block.
Reviewed By: sameerds
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D126523
Remove the early exit if both constraints contain no variables. This
restriction is unnecessayr for correctness and removing it simplifies
handling of trivial constant conditions in follow-up changes.
If an integer PHI has an illegal type (according to the data layout) and
it is only used by `trunc` or `trunc(lshr)` operations, we split the PHI
into various instructions in its predecessors:
6d1543a167/llvm/lib/Transforms/InstCombine/InstCombinePHI.cpp (L1536-L1543)
So this can produce code like the following:
Before:
```
pred:
...
bb:
%p = phi i8 [ %somevalue, %pred ], ...
...
%tobool = trunc i8 %p to i1
use %tobool
...
```
In this code, `%p` has an illegal integer type, `i8`, and its only used
in a `trunc` instruction later. In this case this pass puts extraction
code in its predecessors:
After:
```
pred:
...
%t = and i8 %somevalue, 1
%extract = icmp ne i8 %t, 0
bb:
%p.new = phi i1 [ %extract, %pred ], ...
use %p.new instead of %tobool
```
But this doesn't work if `pred` is a `catchswitch` BB because it cannot
have any non-PHI instructions. This CL ensures we bail out in that case.
Fixes https://github.com/llvm/llvm-project/issues/55803.
Reviewed By: dschuff
Differential Revision: https://reviews.llvm.org/D127699
OrigPHIsToFix is only used in the native path. Collecting phis can be
replaced by iterating over the plan. This also removes another
unnecessary use of a late getVPValue.
This also reduces the coupling between ILV and the VPlan utilities.
Removes the workaround from https://reviews.llvm.org/D98509#2732628 for
an AIX build compiler issue.
The AIX build compiler product that caused the issue has since been
fixed. Also, the AIX build compiler has been changed to one based on
LLVM.
This shows narrowing improvements on the logic tests
(transforms recently added with e247b0e5c9).
This is not a complete fix. That would require adding
folds to visitOr/visitXor. But it enables the expected
transforms for the basic patterns in the affected tests.
Handle the fact that not only constant expressions, but also
constant aggregates containing expressions can trap.
This still doesn't fix the original C reproducer, probably due to
more issues remaining in other passes.
When pushing an operation across a phi node, we should avoid doing
so across a loop backedge. This is generally non-profitable, because
it does not reduce the number of times the operation is executed,
and could lead to an infinite combine loop.
The code was already guarding against this, but using an
insufficiently strong condition, which did not cover the case where
the operation was originally outside the loop (in which case the
transform moves the operation from outside the loop into the loop,
which is particularly undesirable).
Differential Revision: https://reviews.llvm.org/D127499
The 1st try ( afa192cfb6 ) was reverted because it could
cause an infinite loop with constant expressions.
A test for that and an extra condition to enable the transform
are added now. I also added code comments to better describe
the transform and the existing, related transform.
Original commit message:
https://alive2.llvm.org/ce/z/hRy3rE
As shown in D123408, we can produce this pattern when moving
casts around, and we already have a related fold for a binop
with a constant operand.
In foldSelectIntoOp we sometimes transform a select of a fadd into a
fadd of a select, where we select between data and an identity value.
For both fadd and fsub the identity is always -0.0, but if the nsz
flag is set on the select instruction we can use +0.0 instead. Doing
so then triggers other optimisations, such as when folding the select
of masked load into a new masked load.
Differential Revision: https://reviews.llvm.org/D126774
This patch improves the fix in D110529 to prevent from crashing on value
with byval attribute that is not added in SCCP solver.
Authored-by: sinan.lin@linux.alibaba.com
Reviewed By: ChuanqiXu
Differential Revision: https://reviews.llvm.org/D126355
This adds a fold for aggressive instcombine that converts
smin(smax(fptosi(x))) into a llvm.fptosi.sat, providing that the
saturation constants are correct and the cost of the llvm.fptosi.sat is
lower.
Unfortunately, a llvm.fptosi.sat cannot always be converted back to a
smin/smax/fptosi. The llvm.fptosi.sat intrinsic is more defined that the
original, which produces poison if the original fptosi was out of range.
The llvm.fptosi.sat will saturate any value, so needs to be expanded to
a fptosi(fpmin(fpmax(x))), which can be worse for codegeneration
depending on the target.
So this change thais conditional on the backend reporting that the
llvm.fptosi.sat is cheaper that the original smin+smax+fptost. This is
a change to the way that AggressiveInstrcombine has worked in the past.
Instead of just being a canonicalization pass, that canonicalization can
be dependant on the target in certain specific cases.
Differential Revision: https://reviews.llvm.org/D125755
Teach the unroller(s) how to handle an invalid cost. This avoids crashes when the backend can't provide a cost due to either a fundemental limitation or an unimplemented cost model case.
Differential Revision: https://reviews.llvm.org/D127305
Per the documentation in Support/InstructionCost.h, the purpose of an invalid cost is so that clients can change behavior on impossible to cost inputs. CodeMetrics was instead asserting that invalid costs never occurred.
On a target with an incomplete cost model - e.g. RISCV - this means that transformations would crash on (falsely) invalid constructs - e.g. scalable vectors. While we certainly should improve the cost model - and I plan to do so in the near future - we also shouldn't be crashing. This violates the explicitly stated purpose of an invalid InstructionCost.
I updated all of the "easy" consumers where bailouts were locally obvious. I plan to follow up with loop unroll in a following change.
Differential Revision: https://reviews.llvm.org/D127131
https://alive2.llvm.org/ce/z/hRy3rE
As shown in D123408, we can produce this pattern when moving
cast around, and we already have a related fold for a binop
with a constant operand.
For the longest time we used `AAValueSimplify` and
`genericValueTraversal` to determine "potential values". This was
problematic for many reasons:
- We recomputed the result a lot as there was no caching for the 9
locations calling `genericValueTraversal`.
- We added the idea of "intra" vs. "inter" procedural simplification
only as an afterthought. `genericValueTraversal` did offer an option
but `AAValueSimplify` did not. Thus, we might end up with "too much"
simplification in certain situations and then gave up on it.
- Because `genericValueTraversal` was not a real `AA` we ended up with
problems like the infinite recursion bug (#54981) as well as code
duplication.
This patch introduces `AAPotentialValues` and replaces the
`AAValueSimplify` uses with it. `genericValueTraversal` is folded into
`AAPotentialValues` as are the instruction simplifications performed in
`AAValueSimplify` before. We further distinguish "intra" and "inter"
procedural simplification now.
`AAValueSimplify` was not deleted as we haven't ported the
re-materialization of instructions yet. There are other differences over
the former handling, e.g., we may not fold trivially foldable
instructions right now, e.g., `add i32 1, 1` is not folded to `i32 2`
but if an operand would be simplified to `i32 1` we would fold it still.
We are also even more aware of function/SCC boundaries in CGSCC passes,
which is good.
Fixes: https://github.com/llvm/llvm-project/issues/54981
When determining liveness via Attributor::isAssumedDead(...) we might
end up without a liveness AA or with one pointing into another function.
Neither is helpful and we will avoid both from now on.
Clang-format InstructionSimplify and convert all "FunctionName"s to
"functionName". This patch does touch a lot of files but gets done with
the cleanup of InstructionSimplify in one commit.
This is the alternative to the less invasive clang-format only patch: D126783
Reviewed By: spatel, rengolin
Differential Revision: https://reviews.llvm.org/D126889
All information is already available in VPlan. Note that there are some
test changes, because we now can correctly look through instructions
like truncates to analyze the actual users.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D123541
We can use constant to allow undef and there is no need to force
integers in the API anyway. The user can decide if a non integer
constant is fine or not.
We need to be careful replacing values as call site arguments
(IRPosition::IRP_CALL_SITE_ARGUMENT) is representing a use and not a
value. This patch replaces the interface to take a IR position instead
making it harder to misuse accidentally. It does not change our tests
right now but a follow up exposed the potential footgun.
We used to be very conservative when integer states were merged.
Instead of adding the known range (which is large due to uncertainty)
into the assumed range (which is hopefully small), we can also only
allow to merge in both at the same time into their respective
counterpart. This will ensure we keep the invariant that assumed is part
of known.
When we recreate instructions as part of simplification we need to take
care of debug metadata and replacing the value multiple times. For now,
we handle both conservatively.
The patch simplifies some of the patterns as below
(A | (B & C0)) | (B & C1) -> A | (B & C0|C1)
((B & C0) | A) | (B & C1) -> (B & C0|C1) | A
In some scenarios like byte reverse on half word, we can see this pattern multiple times and this conversion can optimize these patterns.
Additionally this commit fixes the issue reported with the test case.
int f(int a, int b) {
int c = ((unsigned char)(a >> 23) & 925);
if (a)
c = (a >> 23 & b) | ((unsigned char)(a >> 23) & 925) | (b >> 23 & 157);
return c;
}
The previous revision/commit did not check one-use of an intermediate value that this transform re-uses.
When that value has another use, an existing transform will try to invert the transform here.
By adding one-use checks, we avoid the infinite loops seen with the earlier commit.
Differential Revision: https://reviews.llvm.org/D124119
Existing condition for
fold icmp ugt (ashr X, ShAmtC), C --> icmp ugt X, ((C + 1) << ShAmtC) - 1
missed some boundary. It cause this fold don't work for some cases, and the
reason is due to signed number overflow.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D127188
The IV widening code currently asserts that terminators aren't SCEVable
-- however, this is not the case for invokes with a returned attribute.
As far as I can tell, this assertions is not necessary -- even if we
have a critical edge (the second test case), the trunc gets inserted
in a legal position.
Fixes https://github.com/llvm/llvm-project/issues/55925.
Differential Revision: https://reviews.llvm.org/D127288
This reverts commit 266ea446ab.
The reasons for the revert have been addressed by cleaning up condition
handling in VPlan and properly marking VPBranchOnMaskRecipe as using
scalars.
The test case for the revert from D123720 has been added in 3d663308a5.
Background:
When we construct coroutine frame, we would insert a dbg.declare
intrinsic for it:
```
%hdl = call void @llvm.coro.begin() ; would return coroutine handle
call void @llvm.dbg.declare(metadata ptr %hdl, metadata
![[DEBUG_VARIABLE: __coro_frame]], metadata !DIExpression())
```
And in the splitted coroutine, it looks like:
```
define void @coro_func.resume(ptr *hdl) {
entry.resume:
call void @llvm.dbg.declare(metadata ptr %hdl, metadata
![[DEBUG_VARIABLE: __coro_frame]], metadata !DIExpression())
}
```
And we would salvage the debug info by inserting a new alloca here:
```
define void @coro_func.resume(ptr %hdl) {
entry.resume:
%frame.debug = alloca ptr
call void @llvm.dbg.declare(metadata ptr %frame.debug, metadata
![[DEBUG_VARIABLE: __coro_frame]], metadata !DIExpression())
store ptr %hdl, %frame.debug
}
```
But now, the problem comes since the `dbg.declare` refers to the address
of that alloca instead of actual coroutine handle. I saw there are codes
to solve the problem but it only applies to complex expression only. I
feel if it is OK to relax the condition to make it work for
`__coro_frame`.
Reviewed By: jmorse
Differential Revision: https://reviews.llvm.org/D126277
InstCombine tries to rewrite
%prod = mul nsw i64 %X, Scale
%acc = add nsw i64 %prod, Offset
%0 = alloca i8, i64 %acc, align 4
%1 = bitcast i8* %0 to i32*
Use ( %1 )
into
%prod = mul nsw i64 %X, Scale/4
%acc = add nsw i64 %prod, Offset/4
%0 = alloca i32, i64 %acc, align 4
Use (%0)
But it assumes Scale is unsigned, and performs an unsigned division.
So we should bail out if Scale cannot be interpreted as an unsigned safely.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D126546
If we don't demand low bits and it is valid to pre-shift a constant:
(C2 >> X) << C1 --> (C2 << C1) >> X
https://alive2.llvm.org/ce/z/_UzTMP
This is the reverse-order shift sibling to 82040d414b ( D127122 ).
It seems likely that we would want to add this to the SDAG version of
the code too to keep it on par with IR.
c2eccc6 introduced a call to etHasNoUnsignedWrap which implicitly assumes that Inst is a OverflowingBinaryOperator. This is frequently untrue, but was not caught because cast<Ty>(X) has been broken, see https://discourse.llvm.org/t/cast-x-is-broken-implications-and-proposal-to-address/63033 for context.
I considered reverting this, but since doing so re-introduces a nasty miscompile of its own, I decided to fix forward instead.
I'll note that this is a particularly nasty form of the cast<Ty>(X) issue. Because the cast was succeeding unexpected, we were writing data to instructions which weren't OBOs. This could result in near arbitrary data or memory corruption. I'm a bit shocked that the sanitizers didn't find this TBH.
Enhance memchr libcall folder to handle constant arrays consisting
of one or two sequences of cosecutive equal characters.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D126515
If we don't demand high bits (zeros) and it is valid to pre-shift a constant:
(C2 << X) >> C1 --> (C2 >> C1) << X
https://alive2.llvm.org/ce/z/P3dWDW
There are a variety of related patterns, but I haven't found a single solution
that gets all of the motivating examples - so pulling this piece out of
D126617 along with more tests.
We should also handle the case where we shift-right followed by shift-left,
but I'll make that a follow-on patch assuming this one is ok. It seems likely
that we would want to add this to the SDAG version of the code too to keep it
on par with IR.
Differential Revision: https://reviews.llvm.org/D127122
If we look through a truncate in matchLinearIVUser, it's possible
we find a sext/zext instruction that didn't come from widening.
This will fail the MatchedItCount->getType() == InnerInductionPHI->getType()
assertion.
Fix this by checking that we did not look through a truncate already.
Reviewed By: SjoerdMeijer
Differential Revision: https://reviews.llvm.org/D127149
Based on reviewer comments on https://reviews.llvm.org/D126692 I've
added FastMathFlags to the select instruction used when tail-folding
with reductions. These flags can then be used by InstCombine to
decide upon the most optimal floating point identity value for
fadd/fsub. Doing so unlocks further optimisations, such as folding
selects into masked loads.
Differential Revision: https://reviews.llvm.org/D126778
Now that transforms introducing branch on poison have been removed,
we can stop marking ranges that have been derived from branch
conditions as containing undef. The existing comment explains why
this is legal. I've checked that alive2 is happy with SCCP tests
after this change.
Differential Revision: https://reviews.llvm.org/D126647
Currently, we only check !nosanitize metadata for instruction passed to function `getInterestingMemoryOperands()` or instruction which is a cannot return callable instruction.
This patch add this check to any instruction.
E.g. ASan shouldn't instrument the instruction inserted by UBSan/pointer-overflow.
Reviewed By: vitalybuka
Differential Revision: https://reviews.llvm.org/D126269
In D115737 I found that I needed to teach Instruction::isSafeToRemove()
about strictfp/constrained intrinsics. It was pointed out that this is
probably the wrong function to use isInstructionTriviallyDead(). It doesn't
make sense to have a "second, worse implementation".
I also believe that the Instruction class is the wrong place for this
functionality. The information about whether or not an instruction can be
removed is in the transform passes and should stay there.
Differential Revision: https://reviews.llvm.org/D118387
Try to simplify BranchOnCount to `BranchOnCond true` if TC <= UF * VF.
This is an alternative to D121899 which simplifies the VPlan directly
instead of doing so late in code-gen.
The potential benefit of doing this in VPlan is that this may help
cost-modeling in the future. The reason this is done in prepareToExecute
at the moment is that a single plan may be used for multiple VFs/UFs.
There are further simplifications that can be applied as follow ups:
1. Replace inductions with constants
2. Replace vector region with regular block.
Fixes#55354.
Depends on D126679.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D126680
https://alive2.llvm.org/ce/z/o7rQ5q
This shows an extra instruction in some cases, but that is
caused by an existing canonicalization of trunc -> and+icmp.
Codegen should be better for any target where a multiply is
more costly than the most simple ALU op.
This ends up producing the requested x86 asm from issue #55618,
but it's not the same IR. We are missing a canonicalization
from the negate+mask pattern to the trunc+select created here.
Instead of setting the successor to the exit using CFG.ExitBB, set it to
nullptr initially. The successor to the exit block is later set either
through createEmptyBasicBlock or after VPlan execution (because at the
moment, no block is created by VPlan for the exit block, the existing
one is reused).
This also enables BranchOnCond to be used as terminator for the exiting
block of the topmost vector region.
Depends on D126618.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D126679
Some cl::ZeroOrMore were added to avoid the `may only occur zero or one times!`
error. More were added due to cargo cult. Since the error has been removed,
cl::ZeroOrMore is unneeded.
Also remove cl::init(false) while touching the lines.
Improved/fixed cost modeling for shuffles by providing masks, improved
cost model for non-identity insertelements.
Differential Revision: https://reviews.llvm.org/D115462
Async context frames are allocated with a maximum alignment. If a type
requests an alignment bigger than that dynamically align the address
in the frame.
Differential Revision: https://reviews.llvm.org/D126715
This patch removes CondBit and Predicate from VPBasicBlock. To do so,
the patch introduces a new branch-on-cond VPInstruction opcode to model
a branch on a condition explicitly.
This addresses a long-standing TODO/FIXME that blocks shouldn't be users
of VPValues. Those extra users can cause issues for VPValue-based
analyses that don't expect blocks. Addressing this fixme should allow us
to re-introduce 266ea446ab.
The generic branch opcode can also be used in follow-up patches.
Depends on D123005.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D126618
This patch proposed to use a new cost model for loop interchange, which
is obtained from loop cache analysis.
Given a loopnest, what loop cache analysis returns is a vector of loops
[loop0, loop1, loop2, ...] where loop0 should be replaced as the outermost
loop, loop1 should be placed one more level inside, and loop2 one more level
inside, etc. What loop cache analysis does is not only more comprehensive than
the current cost model, it is also a "one-shot" query which means that we only
need to query it once during the entire loop interchange pass, which is better
than the current cost model where we query it every time we check whether it is
profitable to interchange two loops. Thus complexity is reduced, especially after
D120386 where we do more interchanges to get the globally optimal loop access pattern.
Updates made to test cases are mostly minor changes and some corrections.
Test coverage for loop interchange is not reduced.
Currently we did not completely remove the legacy cost model, but keep it as
fall-back in case the new cost model did not run successfully. This is because
currently we have some limitations in delinearization, which sometimes makes
loop cache analysis bail out. The longer term goal is to enhance delinearization
and eventually remove the legacy cost model compeletely.
Reviewed By: bmahjour, #loopoptwg
Differential Revision: https://reviews.llvm.org/D124926
We could go either way on this and several similar matches.
Just matching as a binop is possibly slightly more efficient;
we don't need to re-confirm the opcode of the instruction.
Improved/fixed cost modeling for shuffles by providing masks, improved
cost model for non-identity insertelements.
Differential Revision: https://reviews.llvm.org/D115462
This patch introduces the abstract base class InlinePriority to serve as
the comparison function for the priority queue. A derived class, such
as SizePriority, may choose to cache the priorities for different
functions for performance reasons.
This design shields the type used for the priority away from classes
outside InlinePriority and classes derived from it. In turn,
PriorityInlineOrder no longer needs to be a template class.
Reviewed By: kazu
Differential Revision: https://reviews.llvm.org/D126300
This patch introduces the abstract base class InlinePriority to serve as
the comparison function for the priority queue. A derived class, such
as SizePriority, may choose to cache the priorities for different
functions for performance reasons.
This design shields the type used for the priority away from classes
outside InlinePriority and classes derived from it. In turn,
PriorityInlineOrder no longer needs to be a template class.
Reviewed By: kazu
Differential Revision: https://reviews.llvm.org/D126300
This patch does not effect any behavior of the current code.
The codebase implicitly implies that `Cost::RateFormula` is only called
when the `Cost` is not in losing status, or else there may be possible
to trigger the assertion of `Cost::isValid`.
The intention here is to prevent mis-use where future development
allow `Cost` that is already loser to call `Cost::RateFormula` - Early
exit when `Cost` is already losing.
Reviewed By: Meinersbur, #loopoptwg
Differential Revision: https://reviews.llvm.org/D125670
Recently the terminology used has been changed from Exit->Exiting in
line with common LLVM loop terminology. Update a remaining use of the
old terminology.
Improved/fixed cost modeling for shuffles by providing masks, improved
cost model for non-identity insertelements.
Differential Revision: https://reviews.llvm.org/D115462
Extractelement instructions may come from different basic blocks, need
to take it into account when looking for a last instruction in the
bundle to prevent compiler crash.
Differential Revision: https://reviews.llvm.org/D126777
This reverts commit ec4adf1f6c. The commit causes
clang to hang on a certain input:
```
$ cat q.cc
int f(int a, int b) {
int c = ((unsigned char)(a >> 23) & 925);
if (a)
c = (a >> 23 & b) | ((unsigned char)(a >> 23) & 925) | (b >> 23 & 157);
return c;
}
$ time ./clang-15-10515 --target=x86_64--linux-gnu -O1 -c q.cc
^C
real 0m45.072s
user 0m0.025s
sys 0m0.099s
```
This patch updates the VPlan native path to use VPRegionBlocks for all
loops in a loop nest. Up to now, only the outermost loop used a region.
This is a step towards unifying both paths and keep things consistent
between them. It also prepares various code-gen parts for modeling the
pre-header in the inner loop vectorizer (D121624).
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D123005
The implementations of VPlanDominatorTree, VPlanLoopInfo and VPlanPredicator
are all incompatible with modeling loops in VPlans as region without
explicit back-edges.
Those pieces are not actively used and only exercised by a few gtest
unit tests. They are at the moment blocking progress towards unifying
the native and inner-loop vectorizer paths in D121624 and D123005.
I think we should not block forward progress on unused pieces of code,
so this patch removes the utilities for now. The plan is to re-introduce
them as needed in a way that is compatible with the unified VPlan scheme
used in both the inner loop vectorizer and the native path.
Reviewed By: sguggill
Differential Revision: https://reviews.llvm.org/D123017
Commit dd5991cc modified the aliasing checks here to allow transforming
a memcpy where the source and destination point into the same object.
However, the change accidentally made the code skip the alias check for
other operations in the loop.
Instead of completely skipping the alias check, just skip the check for
whether the memcpy aliases itself.
Differential Revision: https://reviews.llvm.org/D126486
X <=u (sext i1 Y) --> (X == 0) | Y
https://alive2.llvm.org/ce/z/W_tZzo
This is the conjugate/sibling pattern suggested with D126171
for a sign-extended bool value.
I chose to encode the allockind information in a string constant because
otherwise we would get a bit of an explosion of keywords to deal with
the possible permutations of allocation function types.
I'm not sure that CodeGen.h is the correct place for this enum, but it
seemed to kind of match the UWTableKind enum so I put it in the same
place. Constructive suggestions on a better location most certainly
encouraged.
Differential Revision: https://reviews.llvm.org/D123088
When reassociating GEPs, we can only keep inbounds if both original
GEPs were inbounds, and their offsets have the same sign. For the
sake of simplicity, I only handle the case where both offsets are
non-negative here.
It would probably be fine to just not preserve inbounds at all here,
but as I don't see a compile-time impact for adding the
isKnownNonNegative() calls I went with this more conservative
approach.
Fixes https://github.com/llvm/llvm-project/issues/44206.
Differential Revision: https://reviews.llvm.org/D126687
Even if the total offset is inbounds, we might represent it by first
performing a large negative offset and then a small positive one.
With inbounds semantics as currently specified, each offset must
be inbounds individually, not just the overall offset of the GEP.
Fix this by checking that the sign of all offsets is the same.
Fixes https://github.com/llvm/llvm-project/issues/55722.
(C2 >> X) >> C1 --> (C2 >> C1) >> X
The shift-left form of this transform has existed since:
16f18ed7b5
...but it applies to matching shift right opcodes too:
https://alive2.llvm.org/ce/z/c5eQms
The restriction goes back to:
16f18ed7b5
...but the fold only replaces a shift with a shift, so that's not necessary.
Generalizing to other opcodes is planned as a follow-up.
There are a few places where we use report_fatal_error when the input is broken.
Currently, this function always crashes LLVM with an abort signal, which
then triggers the backtrace printing code.
I think this is excessive, as wrong input shouldn't give a link to
LLVM's github issue URL and tell users to file a bug report.
We shouldn't print a stack trace either.
This patch changes report_fatal_error so it uses exit() rather than
abort() when its argument GenCrashDiag=false.
Reviewed by: nikic, MaskRay, RKSimon
Differential Revision: https://reviews.llvm.org/D126550
If only one of the GEPs is inbounds, then after swapping, there is
no guarantee that one of them will be inbounds as well
(see e.g. https://alive2.llvm.org/ce/z/agaCnp).
This is only a partial fix, because even if both are inbounds, the
result is not necessarily inbounds (if the offsets have different
signs).
As the long explanatory comment attests, performing the modification
in place is pretty tricky. Drop this unnecessary complexity and
always create new instructions.
This should be NFC-ish, but can probably cause difference due to
worklist order.
This option was added in D89854. It prevents GVN from performing
load PRE in a loop, if doing so would require critical edge
splitting on the backedge. From the review:
> I know that GVN Load PRE negatively impacts peeling,
> loop predication, so the passes expecting that latch has
> a conditional branch.
In the PhaseOrdering test in this patch, splitting the backedge
negatively affects vectorization: After critical edge splitting,
the loop gets rotated, effectively peeling off the first loop
iteration. The effect is that the first element is handled
separately, then the bulk of the elements use a vectorized
reduction (but using unaligned, off-by-one memory accesses) and
then a tail of 15 elements is handled separately again.
It's probably worth noting that the loop load PRE from D99926 is
not affected by this change (as it does not need backedge
splitting). This is about normal load PRE that happens to occur
inside a loop.
Differential Revision: https://reviews.llvm.org/D126382
This whole part with recomputation of BPI and BFI looks redundant,
and we tried to get rid of it in D124439. Unfortunately, it causes
some hard-to-reproduce failures due to invalid state of analysis.
Until this is investigated and fixed, let's try to reuse at least
part of available analyzes.
DT is available at this point, and there is no need to recompute it.
Please revert if you see it causing *any* behavior changes.
This reverts the revert commit ad95255b92.
The updated version also creates a load when the store may not execute.
In those cases, we still need to introduce a load in a function where
there may not have been one before, so this doesn't completely resolve
issue #51248.
Original message:
When only a store is sunk, there is no need to create a load in the
pre-header, as the result of the load will never get used.
The dead load can can introduce UB, if the function is marked as
writeonly.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D123473
In LLVM's common loop terminology, an exit block is a block outside a
loop with a predecessor inside the loop. An exiting block is a block
inside the loop which branches to an exit block outside the loop.
This patch updates a few places where VPlan was using ExitBlock for a
block exiting a region. Those instances have been updated to use
ExitingBlock.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D126173
(ashr i32 X, 31) * C --> (X < 0) ? -C : 0
https://alive2.llvm.org/ce/z/G8u9SS
With a constant operand, this is an improvement in IR
and codegen (where it can be converted to a mask op).
Without a constant operand, we would have to negate
the operand, so that is probably better left to the backend.
This is similar but not the same optimization that is requested
in #55618.
This patch adds !nosanitize metadata to FixedMetadataKinds.def, !nosanitize indicates that LLVM should not insert any sanitizer instrumentation.
Reviewed By: vitalybuka
Differential Revision: https://reviews.llvm.org/D126294
All callers pass true.
select-unfold-freeze.ll is now a subset of select.ll so delete it.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D126501
This is effectively NFC (intentionally no test diffs)
because we already have the related fold that converts
the 'and' pattern to select. So this is just an efficiency
improvement.
This extends the fold from D126410 / 3952c905ef
to allow for the only case where it works with signed
division:
https://alive2.llvm.org/ce/z/k7_ypu
(X s/ Y) == SMIN --> (X == SMIN) && (Y == 1)
(X s/ Y) != SMIN --> (X != SMIN) || (Y != 1)
This is another improvement based on #55695.
Use logical instead of bitwise and to combine conditions, to avoid
propagating poison from a later condition if an earlier one is
already false. This avoids introducing branch on poison.
Differential Revision: https://reviews.llvm.org/D125898
Patch improves compile time. For function calls, which cannot be
vectorized, create a unique group for each such a call instead of
subgroup. It prevents them from being grouped by a subgroups and
attempts for their vectorization.
Also, looks through casts operand to try to check their
groups/subgroups.
Reduces number of vectorization attempts. No changes in the statistics
for SPEC2017/2006/llvm-test-suite.
Differential Revision: https://reviews.llvm.org/D126476
Need to handle a corner case correctly, if all elements are Undefs/Poisons,
need to emit actual values, not just poisons.
Differential Revision: https://reviews.llvm.org/D126298
Responding to a feature request from the Rust community:
https://github.com/rust-lang/rust/issues/80630
void foo(X) {
for (...)
switch (X)
case A
X = B
case B
X = C
}
Even though the initial switch value is non-constant, the switch
statement can still be threaded: the initial value will hit the switch
statement but the rest of the state changes will proceed by jumping
unconditionally.
The early predictability check is relaxed to allow unpredictable values
anywhere, but later, after the paths through the switch statement have
been enumerated, no non-constant state values are allowed along the
paths. Any state value not along a path will be an initial switch value,
which can be safely ignored.
Differential Revision: https://reviews.llvm.org/D124394
ScatterVectorize nodes should be handled same way as gathers in
reorderBottomToTop function, since we can simple reorder the loads in
this node. Because of that need to include such nodes to the list of
gathered nodes to fix compiler crash.
Differential Revision: https://reviews.llvm.org/D126378
With large compare constant:
(X u/ Y) == C --> (X == C) && (Y == 1)
(X u/ Y) != C --> (X != C) || (Y != 1)
https://alive2.llvm.org/ce/z/EhKwh6
There are various potential missing icmp (div) transforms shown here:
https://github.com/llvm/llvm-project/issues/55695
This is a generalization for part of the udiv + equality.
I didn't check in detail, but some of those may only make sense as
codegen transforms.
This results in one extra instruction in IR, but it is better for
analysis, and looks much better in codegen on all targets that I tried.
Differential Revision: https://reviews.llvm.org/D126410
When updating the branch instruction outside the loopduring non-trivial
unswitching, always skip trivial selects and update the condition.
Otherwise we might create invalid IR, because the trivial select is
inside the loop, while the condition is outside the loop.
Fixes#55697.
The purpose of the custom linked list was to optimize for the case
of a single-element list. It turns out that TinyPtrVector handles
the same basic scenario even better, reducing the size of
LeaderTableEntry by 33%, and requiring only log2(N) allocations
as the size of the list grows. The only downside is that we have
to store the Value's and BasicBlock's in separate vectors, which
is slightly awkward in a few cases. Fortunately that ends up being
entirely encapsulated inside helper functions.
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D125205
When we hoist instructions over guard we must clear flags due to these flags
might be implied using this guard, so they make sense only after the guard.
As an example of the bug due to current behavior.
L is known to be in range say [0, 100)
c1 = x u< L
guard (c1)
x1 = add x, 1
c2 = x1 u< L
guard(c2)
basing on guard(c1) we can say that x1 = add nuw nsw x, 1
after guard widening we get
c1 = x u< L
x1 = add nuw nsw x, 1
c2 = x1 u< L
c = and c1, c2
guard(c)
now, basing on fact that x + 1 < L and x >= 0 due to x + 1 is nuw
we can prove that x + 1 u< L implies that x u< L, so we can just remove c1
x1 = add nuw nsw x, 1
c2 = x1 u< L
guard(c2)
But that is not correct due to we will pass x == -1 value.
Reviewed By: mkazantsev
Subscribers: llvm-commits, nikic
Differential Revision: https://reviews.llvm.org/D126354
This patch break foldBitCastBitwiseLogic limite the destination
must have an integer element type, and eliminate one bitcast by
doing the logic op in the type of the input that has an integer
element type.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D126184
SLP should build ScatterVectorize nodes only if they actually end up
with masked gather rather than with scalarization. In the second
scenario better to build a gather node.
Differential Revision: https://reviews.llvm.org/D126379
Need to use all ReductionOps when propagating flags for the reduction
ops, otherwise transformation is not correct. Plus, need to drop nuw/nsw
flags.
Differential Revision: https://reviews.llvm.org/D126371
When compiling the attached new test in scalable-reductions-tf.ll we
were hitting this assertion in fixReduction:
Assertion `isa<PHINode>(U) && "Reduction exit must feed Phi's or select"
The loop contains a reduction and an intermediate store of the reduction
value. When vectorising with tail-folding the contains of 'U' in the
assertion above happened to be a scatter_store. It turns out that we
were still creating a widen recipe for the invariant store, despite
knowing that we can actually sink it. The simplest fix is to change
buildVPlanWithVPRecipes so that we look for invariant stores before
attempting to widen it.
Differential Revision: https://reviews.llvm.org/D126295
The crash is caused by incorrect order set by reorderBottomToTop(), which
happens when it is reordering a TreeEntry which has a user that has already been
reordered earlier. Please see the detailed description in the lit test.
Differential Revision: https://reviews.llvm.org/D126099
shuffle (cast X), (cast Y), Mask --> cast (shuffle X, Y, Mask)
This extends the transform added with 0353c2c996.
If the shuffle reduces vector length, the transform
reduces the width of the cast, so that should be a
win for most codegen (if not, it can be inverted).
Bitcasts were stripped in one case, but not the other. Of course,
this no longer really matters with opaque pointers, but as I went
through the trouble of tracking this down, we may as well remove
one typed vs opaque pointer optimization discrepancy.
Use IRBuilder so that the newly created freeze instructions
automatically gets inserted back into the IC worklist.
The changed worklist processing order leads to some cosmetic
differences in tests.
Fixes https://github.com/llvm/llvm-project/issues/55619.
To be used correctly in a sort-like function, isFirstInsertElement
function must follow weak strict ordering rule, i.e.
isFirstInsertElement(IE1, IE1) should return false.
Most of the folds implemented in this function work fine with
logical operations. We only need to be careful for the cases that
work on non-constant masks, where the RHS operand shouldn't be
poison.
This is a conservative implementation that bails out of illegal
transforms, but we could also change these to insert freeze instead.
This is a followup to D125754. We introduce two branches, one
before the unrolled loop and one before the epilogue (and similar
for the prologue case). The previous patch only froze the
condition on the first branch.
Rather than independently freezing the second condition, this patch
instead freezes TripCount and bases BECount on it. These are the
two quantities involved in the conditions, and this ensures that
both work on a consistent, non-poisonous trip count.
Differential Revision: https://reviews.llvm.org/D125896
Fixes a bug preventing moving the loop's metadata to an outer loop's header,
which happens if the loop's exit is also the header of an outer loop.
Adjusts test for above.
Fixes#55416.
Differential Revision: https://reviews.llvm.org/D125574
Builds UserIgnore list only once as a SmallDenseSet without rebuilding
it between the runs, iterate over gathers instead list of reduction ops,
do some checks in the buildTree_rec only if the corresponding containers
are not empty.
SLP vectorizer emits extracts for externally used vectorized scalars and
estimates the cost for each such extract. But in many cases these
scalars are input for insertelement instructions, forming buildvector,
and instead of extractelement/insertelement pair we can emit/cost
estimate shuffle(s) cost and generate series of shuffles, which can be
further optimized.
Tested using test-suite (+SPEC2017), the tests passed, SLP was able to
generate/vectorize more instructions in many cases and it allowed to reduce
number of re-vectorization attempts (where we could try to vectorize
buildector insertelements again and again).
Differential Revision: https://reviews.llvm.org/D107966
X <u (zext i1 Y) --> (X == 0) && Y
https://alive2.llvm.org/ce/z/avQDRY
This is a generalization of 4069cccf3b based on the post-commit suggestion.
This also adds the i1 type check and tests that were missing from the earlier
attempt; that commit caused several bot fails and was reverted.
Differential Revision: https://reviews.llvm.org/D126171
Similarly to a change recently done for fcmps, add a flag that
indicates whether the and/or is logical to foldAndOrOfICmps, and
reuse the function when folding logical and/or.
We were already calling some parts of it, but this gives us a
clearer indication of which parts may need poison-safe variants,
and would also allow to fold combinations of bitwise and logical
and/or.
This change should be close to NFC, because all folds this enables
were either already called previously, or can make use of implied
poison reasoning.
Previously, `getRegUsageForType` was implemented using
`getTypeLegalizationCost`. `getRegUsageForType` is used by the loop
vectorizer to estimate the register pressure caused by using a vector
type. However, `getTypeLegalizationCost` currently only appears to
understand splitting and not scalarization, so significantly
underestimates the register requirements.
Instead, use `getNumRegisters`, which understands when scalarization
can occur (via computeRegisterProperties).
This was discovered while investigating D118979 (Set maximum VF with
shouldMaximizeVectorBandwidth), where under fixed-length 512-bit SVE the
loop vectorizer previously ends up costing an v128i1 as 2 v64i*
registers where it actually occupies 128 i32 registers.
I'm sending this patch early for comment, I'm still doing some sanity checking
with LNT. I note that getRegisterClassForType appears to return VectorRC even
though the type in question (large vNi1 types) end up occupying scalar
registers. That might be worth fixing too.
Differential Revision: https://reviews.llvm.org/D125918
The latch may not be the exiting block. Use the exiting block instead
when looking up the incoming value of the LCSSA phi node. This fixes a
crash with early-exit loops.
This is the specific pattern seen in #53432, but it can be extended
in multiple ways:
1. The 'zext' could be an 'and'
2. The 'sub' could be some other binop with a similar ==0 property (udiv).
There might be some way to generalize using knownbits, but that
would require checking that the 'bool' value is created with
some instruction that can be replaced with new icmp+logic.
https://alive2.llvm.org/ce/z/-KCfpa
Current codegen only supports scalarization of pointer inductions for
scalable VFs if they are uniform. After 3bebec659 we now may enter the
scalarization code path in VPWidenPointerInductionRecipe::execute for
scalable vectors.
Fall back to widening for scalable vectors if necessary.
This should fix a build failure when bootstrapping LLVM with SVE, e.g.
https://lab.llvm.org/buildbot/#/builders/176/builds/1723
This reverts commit fc9c59c355.
The patch triggers an assertion when building SPEC on X86. Reduced
reproducer shared at D107966.
Also reverts follow-up commit 11a09af76d.
This patch introduces a new VPLiveOut subclass of VPUser to model
exit values explicitly. The initial version handles exit values that
are neither part of induction or reduction chains nor first order
recurrence phis.
Fixes#51366, #54867, #55167, #55459
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D123537
JumpThreading may convert selects into branch instructions,
in which case the condition needs to be frozen (as branch on
poison is immediate undefined behavior, unlike select on poison).
The necessary code for this is already in place, this just enables
the option.
Differential Revision: https://reviews.llvm.org/D125869
SLP vectorizer emits extracts for externally used vectorized scalars and
estimates the cost for each such extract. But in many cases these
scalars are input for insertelement instructions, forming buildvector,
and instead of extractelement/insertelement pair we can emit/cost
estimate shuffle(s) cost and generate series of shuffles, which can be
further optimized.
Tested using test-suite (+SPEC2017), the tests passed, SLP was able to
generate/vectorize more instructions in many cases and it allowed to reduce
number of re-vectorization attempts (where we could try to vectorize
buildector insertelements again and again).
Differential Revision: https://reviews.llvm.org/D107966
At the moment LV runs LoopSimplify and reconstructs LCSSA form after
generating the main vector loop and before generating the epilogue
vector loop.
In practice, this adds a new exit block for the scalar loop because the
middle block now also branches to the original exit block of the scalar
loop. It also requires adding a new LCSSA phi in the newly created exit
block.
This complicates things when modeling exit values in VPlan, because we
would need to update the VPlan for the epilogue loop to update the newly
created LCSSA phi node.
But none of that should be necessary, as all analysis requiring
loop-simplify form is already done at this point and LCSSA form of the
original loop is not broken.
Reviewed By: bmahjour
Differential Revision: https://reviews.llvm.org/D125810
Update clearReductionWrapFlags to use the VPlan def-use chain from the
reduction phi recipe to drop reduction wrap flags.
This addresses an existing FIXME and fixes a crash when instructions in
the reduction chain are not used and have been removed before VPlan
codegeneration.
Fixes#55540.
It doesn't matter which value we use for dead args, so let's switch
to poison, so we can eventually kill undef.
Reviewed By: aeubanks, fhahn
Differential Revision: https://reviews.llvm.org/D125983
The runtime check threshold should also restrict interleave count.
Otherwise, too many runtime checks will be generated for some cases.
Reviewed By: fhahn, dmgreen
Differential Revision: https://reviews.llvm.org/D122126
VPWidenMemoryInstruction also models stores which may not produce a value.
This can trip over analyses. Improve the modeling by only adding
VPValues for VPWidenMemoryInstructionRecipes modeling loads.
Most clients only used these methods because they wanted to be able to
extend or truncate to the same bit width (which is a no-op). Now that
the standard zext, sext and trunc allow this, there is no reason to use
the OrSelf versions.
The OrSelf versions additionally have the strange behaviour of allowing
extending to a *smaller* width, or truncating to a *larger* width, which
are also treated as no-ops. A small amount of client code relied on this
(ConstantRange::castOp and MicrosoftCXXNameMangler::mangleNumber) and
needed rewriting.
Differential Revision: https://reviews.llvm.org/D125557
This patch changes the strategy for vectorizing freeze instrucion, from
replicating multiple times to widening according to selected VF.
Fixes#54992
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D125016
In this patch we add a function foldICmpInstWithConstantAllowUndef
to fold integer comparisons with a constant operand: icmp Pred X, C
where X is some kind of instruction and C is AllowUndef.
We move this fold to the new function, so that it can solve undef elts in a vector.
Reviewed By: spatel, RKSimon
Differential Revision: https://reviews.llvm.org/D125220
The pattern matching and vectgorization for reductions was not very
effective. Some of of the possible reduction values were marked as
external arguments, SLP could not find some reduction patterns because
of too early attempt to vectorize pair of binops arguments, the cost of
consts reductions was not correct. Patch addresses these issues and
improves the analysis/cost estimation and vectorization of the
reductions.
The most significant changes in SLP.NumVectorInstructions:
Metric: SLP.NumVectorInstructions [140/14396]
Program results results0 diff
test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 920.00 3548.00 285.7%
test-suite :: SingleSource/Benchmarks/BenchmarkGame/n-body.test 66.00 122.00 84.8%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 100.00 128.00 28.0%
test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 664.00 810.00 22.0%
test-suite :: MultiSource/Benchmarks/mafft/pairlocalalign.test 592.00 687.00 16.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 402.00 426.00 6.0%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 1665.00 1745.00 4.8%
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 135.00 139.00 3.0%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 135.00 139.00 3.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 388.00 397.00 2.3%
test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 895.00 914.00 2.1%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 240.00 244.00 1.7%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 240.00 244.00 1.7%
test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 820.00 832.00 1.5%
test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 820.00 832.00 1.5%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 14804.00 14914.00 0.7%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 8125.00 8183.00 0.7%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 1330.00 1338.00 0.6%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 1330.00 1338.00 0.6%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 9832.00 9880.00 0.5%
test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 5267.00 5291.00 0.5%
test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 4018.00 4024.00 0.1%
test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 4018.00 4024.00 0.1%
test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test 426.00 424.00 -0.5%
test-suite :: External/SPEC/CFP2017rate/544.nab_r/544.nab_r.test 426.00 424.00 -0.5%
test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test 201.00 192.00 -4.5%
test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test 201.00 192.00 -4.5%
644.nab_s and 544.nab_r - reduced number of shuffles but increased number
of useful vectorized instructions.
641.leela_s and 541.leela_r - the function
`@_ZN9FastBoard25get_pattern3_augment_specEiib` is not inlined anymore
but its body gets vectorized successfully. Before, the function was
inlined twice and vectorized just after inlining, currently it is not
required. The vector code looks pretty similar, just like as it was before.
Differential Revision: https://reviews.llvm.org/D111574
When shifting by a byte-multiple:
bswap (shl X, Y) --> lshr (bswap X), Y
bswap (lshr X, Y) --> shl (bswap X), Y
This was limited to constants as a first step in D122010 / 60820e53ec ,
but issue #55327 shows a source example (and there's a test based on that here)
where a variable shift amount is used in this pattern.
Evaluation odering in function call arguments is implementation-dependent.
In fact, gcc evaluates bottom-top and clang does top-bottom.
Fixes#55283 partially.
Part of https://reviews.llvm.org/D125627
We could do better by inserting a bitcast from scalar int
to vector int or using an insertelement (the alternate test
does not crash because there's an independent fold like that).
But this doesn't seem like a likely pattern, so just bail out
for now.
Fixes issue #55516.
This code is valid for any icmp, so we can safely look through a
freeze when trying to find one.
A caveat here is that replaceFoldableUses() may not end up replacing
any uses in this case. It might make sense to use the freeze as the
context instruction (rather than the terminator) if there is a
freeze, to ensure that it always gets folded. This would require
some changes to how replaceFoldedUses() works though, as it
currently assumes that the value is valid at the end of the block.
The modified function was incorrectly (not unnecessarily) ignoring grandchild
loops, and this change fixes the bug. In particular, this fixes the handling of
the loop { inner, body }. The TODO in the same function is talking about the b1
self loop, which may be "unnecessarily" lost, but that is a different issue.
It's sufficient to just fold the icmp to true/false here, and then
let constant terminator folding take care of the rest.
It should be noted that while replaceFoldableUses() may not replace
all uses of the icmp, at least the use in the terminator we're
working on is always replaceable, so terminator constant folding
should be reliably enabled as a subsequent step.
%x umin_seq %y is currently expanded to %x == 0 ? 0 : umin(%x, %y).
This patch changes the expansion to umin(%x, freeze %y) instead
(https://alive2.llvm.org/ce/z/wujUhp).
The motivation for this change are the test cases affected by
D124910, where the freeze expansion ultimately produces better
optimization results. This is largely because
`(%x umin_seq %y) == %x` is a common expansion pattern, which
reliably optimizes in freeze representation, but only sometimes
with the zero comparison (in particular, if %x == 0 can fold to
something else, we generally won't be able to cover reasonable
code from this.)
Differential Revision: https://reviews.llvm.org/D125372
When performing runtime unrolling with multiple exits, one of the
earlier (non-latch) exits may exit the loop on the first iteration,
such that we never branch on the latch exit condition. As such, we
need to freeze the condition of the new branch that is introduced
before the loop, as it now executes unconditionally.
Differential Revision: https://reviews.llvm.org/D125754
This patch makes JumpThreading's ProcessImpliedCondition deal with frozen
conditions.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D84941
shuffle (cast X), (cast Y), Mask --> cast (shuffle X, Y, Mask)
This extends the transform added with 0353c2c996.
If the casts are to a larger element type, the transform
reduces shuffle bit width, so that should be a win for
most codegen (if not, it can be inverted).
The transform was wrong in 3 ways:
1. It created an extra instruction when the source and dest types don't match.
2. It did not account for an extra use of the icmp, so could create 2 extra insts.
3. It favored bit hacks over icmp (icmp generally has better analysis).
This fixes#54692 (modeled by the PhaseOrdering tests).
This is a minimal step to fix the bug, but we should likely invert
this and the sibling transform for the "is negative" pattern too.
The backend should be able to invert this back to a shift if that
leads to better codegen.
This is a reduced try of 3794cc0e99 - that was reverted because
it could cause infinite loops by conflicting with the related
transforms in this block that create shifts.
Need to check if the reduction is still (not)cmp-select pattern min/max
reduction to avoid compiler crash during building list of reduction
operations. cmp-sel pattern provides 2 reduction operations, while
intrinsics - just one.
Those helpers model properties of a user and they should also be
available to non-recipe users. This will be used in D123537 for a new
exit value user.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D124936
JumpThreading intentionally does not force updating of the DT
during optimization, because this may be expensive when many CFG
updates and DT calculations are interleaved.
We shouldn't be fetching the DT just for the purpose of calling
isGuaranteedNotToBeUndefOrPoison(), especially as DT availability
doesn't even show benefit in tests.
This patch fixes a bug that generates unnecessary packing/unpacking structure code because of incorrectly handling lifetime intrinsic.
For example, a partition of an alloca may contain many slices:
```
Partition [0, 4):
Slice0: [0, 4) used by: load i32 addr;
Slice1: [0, 4) used by: store i32 v, addr;
Slice2: [0, 16) used by lifetime.start(16, addr);
```
When SROA determines if the partition can be promoted, lifetime.start is currently treated as a whole alloca load/store, so Slice0 and Slice1 cannot be promoted at this attempt,
but the packing/unpacking code for Slice0 and Slice1 has been generated.
After rewrite lifetime.start/end intrinsic, SROA tries again with Slice0 and Slice1 and finally promotes them, but redundant packing/unpacking code remaining in the IRs.
This patch changes promotability checking to ignore lifetime intrinsic (they will be rewritten to correct sizes later), so we can promote the real users (load/store) at the first attempt with optimal code.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D124967
Checking whether two KnownBits are the same is somewhat common,
mainly in test code.
I don't think there is a lot of room for confusion with "determine
what the KnownBits for an icmp eq would be", as that has a
different result type (this is what the eq() method implements,
which returns Optional<bool>).
Differential Revision: https://reviews.llvm.org/D125692
When using counter relocations, two instructions are emitted to compute
the address of the counter variable.
```
%BiasAdd = add i64 ptrtoint <__profc_>, <__llvm_profile_counter_bias>
%Addr = inttoptr i64 %BiasAdd to i64*
```
When promoting a counter, these instructions might not be available in
the block, so we need to copy these instructions.
This fixes https://github.com/llvm/llvm-project/issues/55125
Reviewed By: phosek
Differential Revision: https://reviews.llvm.org/D125710
The existing transform was wrong in 3 ways:
1. It created an extra instruction when the source and dest types don't match.
2. It did not account for an extra use of the icmp, so could create 2 extra insts.
3. It favored bit hacks over icmp (icmp generally has better analysis).
This fixes#54692 (modeled by the PhaseOrdering tests).
This is a minimal step to fix the bug, but we should likely invert
the sibling transform for the "is negative" pattern too.
The backend should be able to invert this back to a shift if that
leads to better codegen.
Add a map from functions to load instructions that compute the profile bias. Previously we assumed that if the first instruction in the function was a load instruction, then it must be computing the bias. This was likely to work out because functions usually start with the `llvm.instrprof.increment` instruction, but optimizations could change this. For example, inlining into a non-profiled function.
Reviewed By: phosek
Differential Revision: https://reviews.llvm.org/D114319
This patch adds initial support for a pointer diff based runtime check
scheme for vectorization. This scheme requires fewer computations and
checks than the existing full overlap checking, if it is applicable.
The main idea is to only check if source and sink of a dependency are
far enough apart so the accesses won't overlap in the vector loop. To do
so, it is sufficient to compute the difference and compare it to the
`VF * UF * AccessSize`. It is sufficient to check
`(Sink - Src) <u VF * UF * AccessSize` to rule out a backwards
dependence in the vector loop with the given VF and UF. If Src >=u Sink,
there is not dependence preventing vectorization, hence the overflow
should not matter and using the ULT should be sufficient.
Note that the initial version is restricted in multiple ways:
1. Pointers must only either be read or written, by a single
instruction (this allows re-constructing source/sink for
dependences with the available information)
2. Source and sink pointers must be add-recs, with matching steps
3. The step must be a constant.
3. abs(step) == AccessSize.
Most of those restrictions can be relaxed in the future.
See https://github.com/llvm/llvm-project/issues/53590.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D119078
The patch simplifies some of the patterns as below
(A | (B & C0)) | (B & C1) -> A | (B & C0|C1)
((B & C0) | A) | (B & C1) -> (B & C0|C1) | A
In some scenarios like byte reverse on half word, we can see this pattern multiple times and this conversion can optimize these patterns.
Differential Revision: https://reviews.llvm.org/D124119
While select conditions can be poison, branch on poison is
immediate UB. As such, we need to freeze the condition when
converting a select into a branch.
Differential Revision: https://reviews.llvm.org/D125398
When the loop vectoriser encounters a known low trip count it tries
to create a single predicated loop in order to get the benefit of
vectorisation and eliminate the scalar tail. However, until now the
vectoriser prevented the use of scalable vectors in this case due
to concerns in the past about stability. I believe that tail-folded
loops using scalable vectors are now sufficiently well tested that
we can enable this. For the same reason I've also enabled it when
optimising for code size too.
Tests added here:
Transforms/LoopVectorize/AArch64/sve-low-trip-count.ll
Transforms/LoopVectorize/AArch64/sve-tail-folding-optsize.ll
Transforms/LoopVectorize/RISCV/low-trip-count.ll
Differential Revision: https://reviews.llvm.org/D121595
Under some circumstances, SCEVExpander will insert new instructions when
expanding a predicate, but the final result of the expansion can be a
false constant.
In those cases, the expanded instructions may later be used by other
expansions, e.g. the trip count. This may trigger an assertion during
SCEVExpander cleanup. To avoid this, always mark the result as used.
Fixes#55100.
There is a long function foldICmpInstWithConstant,
we can separate a function foldICmpBinOpWithConstant from it.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D125457
If alternate node has only 2 instructions and the tree is already big
enough, better to skip the vectorization of such nodes, they are not
very profitable (the resulting code cotains 3 instructions instead of
original 2 scalars). SLP can try to vectorize the buildvector sequence
in the next attempt, if it is profitable.
Metric: SLP.NumVectorInstructions
Program SLP.NumVectorInstructions
results results0 diff
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test 72.00 73.00 1.4%
test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 1186.00 1198.00 1.0%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test 241.00 242.00 0.4%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 2131.00 2139.00 0.4%
test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test 6377.00 6384.00 0.1%
test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test 6377.00 6384.00 0.1%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 12650.00 12658.00 0.1%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 26169.00 26147.00 -0.1%
test-suite :: MultiSource/Benchmarks/Trimaran/enc-3des/enc-3des.test 99.00 86.00 -13.1%
Gains:
526.blender_r - more vectorized trees.
enc-3des - same.
Others:
510.parest_r - no changes.
miniFE - same
623.xalancbmk_s - some (non-profitable) parts of the trees are not
vectorized.
523.xalancbmk_r - same
lencod - same
timberwolfmc - same
miniAMR - same
Differential Revision: https://reviews.llvm.org/D125571
If the insert indes was used already or is not constant, we should stop
looking for unique buildvector sequence, it mustbe splitted to
2 different buildvectors.
In InnerLoopVectorizer::getOrCreateVectorTripCount there is an
assert that the known minimum value for the VF is a power of 2
when tail-folding is enabled. However, for scalable vectors the
value of vscale may not be a power of 2, which means we have
to worry about the possibility of overflow. I have solved this
problem by adding preheader checks that prevent us from entering
the vector body if the canonical IV would overflow, i.e.
if ((IntMax - TripCount) < (VF * UF)) ... skip vector loop ...
Differential Revision: https://reviews.llvm.org/D125235
We commonly want to create either an inbounds or non-inbounds GEP
based on a boolean value, e.g. when preserving inbounds from
existing GEPs. Directly accept such a boolean in the API, rather
than requiring a ternary between CreateGEP and CreateInBoundsGEP.
This change is not entirely NFC, because we now preserve an
inbounds flag in a constant expression edge-case in InstCombine.
A first patch to use the reasoning in ConstraintElimination to simplify
sub with overflow to a regular sub, if the operation is guaranteed to
not overflow.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D125264
This refactors RS4GC to cache results returned findBaseDefiningValue
and also gets rid of BaseDefiningValueResult by caching the
IsKnownBase flag for BDVs and bases.
Differential Revision: https://reviews.llvm.org/D125000
This patch fix bug left in D124503. We should do
sub(add(X,Z),umin(Y,Z)) --> add(X,usub.sat(Z,Y)) instead of
sub(add(X,Z),umin(Y,Z)) --> add(X,usub.sat(Y,Z)).
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D125352
As discussed in issue #37809, this transform is not safe
if the input is an undefined value.
This is similar to recent changes for urem and sdiv:
d428f09b2c99ef341ce9
There is no difference in codegen on the basic examples,
but this could lead to regressions. We may need to
improve freeze analysis or lowering if that happens.
Presumably, in real cases that are similar to the tests
where a subsequent transform removes the rem, we
will also be able to remove the freeze by seeing that
the parameter has 'noundef'.
The re-apply includes fixes to clang tests that were missed in
the original commit.
Original message:
Prior to this patch we would only set to undef the unused arguments of the
external functions. The rationale was that unused arguments of internal
functions wouldn't need to be turned into undef arguments because they
should have been simply eliminated by the time we reach that code.
This is actually not true because there are plenty of cases where we can't
remove unused arguments. For instance, if the internal function is used in
an indirect call, it may not be possible to change the function signature.
Yet, for statically known call-sites we would still like to mark the unused
arguments as undef.
This patch enables the "set undef arguments" optimization on internal
functions when we encounter cases where internal functions cannot be
optimized. I.e., whenever an internal function is marked "live".
Differential Revision: https://reviews.llvm.org/D124699
It makes sense to make a non-byval promotion attempt first and then
fall back to the byval one. The non-byval ('usual') promotion is
generally better, for example it does promotion even when a structure
has more elements than 'MaxElements' but not all of them are actually
used in the function.
Differential Revision: https://reviews.llvm.org/D124514
As discussed in issue #37809, this transform is not safe
if the input is an undefined value.
This is similar to a recent change for urem:
d428f09b2c
There is no difference in codegen on the basic examples,
but this could lead to regressions. We may need to
improve freeze analysis or lowering if that happens.
Presumably, in real cases that are similar to the tests
where a subsequent transform removes the select, we
will also be able to remove the freeze by seeing that
the parameter has 'noundef'.
As discussed in issue #37809, this transform is not safe
if the input is an undefined value.
There is no difference in codegen on the basic examples,
but this could lead to regressions. We may need to
improve freeze analysis or lowering if that happens.
If there is a freeze %x, we currently replace all other uses of %x
with freeze %x -- as long as they are dominated by the freeze
instruction. This patch extends this behavior to cases where we
did not originally dominate the use by moving the freeze
instruction directly after the definition of the frozen value.
The motivation can be seen in test @combine_and_after_freezing_uses:
Canonicalizing everything to freeze %x allows folds that are based
on value identity (i.e. same operand occurring in two places) to
trigger. This also covers the case from D125248.
Differential Revision: https://reviews.llvm.org/D125321
Further improvement of the cost model for the scalars used in
buildvectors sequences. The main functionality is outlined into
a separate function.
The cost is calculated in the following way:
1. If the Base vector is not undef vector, resizing the very first mask to
have common VF and perform action for 2 input vectors (including non-undef
Base). Other shuffle masks are combined with the resulting after the 1 stage and processed as a shuffle of 2 elements.
2. If the Base is undef vector and have only 1 shuffle mask, perform the
action only for 1 vector with the given mask, if it is not the identity
mask.
3. If > 2 masks are used, perform serie of shuffle actions for 2 vectors,
combing the masks properly between the steps.
The original implementation misses the very first analysis for the Base
vector, so the cost might too optimistic in some cases. But it improves
the cost for the insertelements which are part of the current SLP graph.
Part of D107966.
Differential Revision: https://reviews.llvm.org/D115750
With opaque pointers, both the stored value and the address can be the
same. Only consider the recipe using the first lane only *if* the
address is not stored.
Fixes#55375.
We're having a hard time booting the ARCH=i386 Linux kernel with clang
after removing -ffreestanding because instcombine was dropping inreg
from callers during libcall simplification, but not the callees defined
in different translation units. This led the callers and callees to have
wildly different calling conventions, which (predictably) blew up at
runtime.
Infer the inreg param attrs on function declarations from the module
metadata "NumRegisterParameters." This allows us to boot the ARCH=i386
Linux kernel (w/ -ffreestanding removed).
Fixes: https://github.com/llvm/llvm-project/issues/53645
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D125285
The current reordering scheme only checks the ordering of in-tree operands.
There are some cases, however, where we need to adjust the ordering based on
the ordering of a future SLP-tree who's instructions are not part of the
current tree, but are external users.
This patch is a simple implementation of this. We keep track of scalar stores
that are users of TreeEntries and if they look profitable to vectorize, then
we keep track of their ordering. During the reordering step we take this new
index order into account. This can remove some shuffles in cases like in the
lit test.
Differential Revision: https://reviews.llvm.org/D125111
shuffle (cast X), (cast Y), Mask --> cast (shuffle X, Y, Mask)
This is similar to a recent transform with fneg ( b331a7ebc1 ),
but this is intentionally the most conservative first step to
try to avoid regressions in codegen. There are several
restrictions that could be removed as follow-up enhancements.
Note that a cast with a unary shuffle is currently canonicalized
in the other direction (shuffle after cast - D103038 ). We might
want to invert that to be consistent with this patch.
Previously we took the old name and always appended a numberic suffix.
Since we're doing a 1:1 replacement, it's clearer to keep the original
name exactly.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D125281
This makes the output IR more readable since we're doing a one to
one replacement.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D125280
The EnableReuseStorageInFrame option is designed for testing only.
But it is better to use *_PASS_WITH_PARAMS macro to keep consistent with
other passes.
30a12f3f63 switched the type check
to use the GEP result type rather than the GEP operand type.
However, the GEP result types may match even if the operand types
don't, in case GEPs with scalar/vector base and vector index
are compared.
Fixes https://github.com/llvm/llvm-project/issues/55363.
This is a following cleanup for the previous work D123918. I missed
serveral places which still use legacy pass managers. This patch tries
to remove them.
When a callee function is inlined via an invoke instruction, every function call inside the callee, if not an invoke, will be converted to an invoke after cloned to the caller body. I found that during the conversion the !prof metadata was dropped. This in turned caused a cloned indirect call not properly promoted in subsequent passes.
The particular scenario I was investigating was with AutoFDO and thinLTO. In prelink, no ICP was triggered (neither by the sample loader nor PGO ICP), no indirect call was promoted. This is because 1) the particular indirect call did not have inlined samples; and 2) PGO ICP was intentionally disabled. After inlining, the prof metadata was dropped. Then in postlink, PGO ICP jumped in but didn't do anything. Thus the opportunity was missed.
I'm making a simple fix to preserve !prof metadata when converting call to invoke.
Reviewed By: davidxl
Differential Revision: https://reviews.llvm.org/D125249
If the same scalar is inserted several times into the same buildvector,
the mask index can be used already. In this case need to check, that
this scalar is already part of the vectorized buildvector.
We can try to vectorize number of stores less than MinVecRegSize
/ scalar_value_size, if it is allowed by target. Gives an extra
opportunity for the vectorization.
Fixes PR54985.
Differential Revision: https://reviews.llvm.org/D124284
Need to use actual index instead of the tree entry position, since the
insert index may be different than 0. It mean, that we vectorized part
of the buildvector starting from not initial insertelement instruction
beause of some reason.
Given a commutative reduction leading from a shuffle, the order of the
lanes on the shuffle are not important for the result. This means we can
reorder the shuffle to something simpler, which we try shuffling the
first vector lanes first. This was D123494.
The new shuffle may not be profitable though, and if it is not we can
try the folding of select shuffles from D123911. This, with some
adjustment as the output lane ordering is now unimportant, can allow the
final shuffle to simplify given the inputs to the patterns from D123911.
Where as each transformation on their own are not profitable, the
combination is.
We can only support a single shuffle when called from reductions, but we
are able to sort the ReconstructMask, potentially allowing it to
simplify to an identity or concat mask.
Differential Revision: https://reviews.llvm.org/D125086
When a PHINode has an incoming block from outside the region, it must be handled specially when assigning a global value number to each incoming value. A PHINode has multiple predecessors, and we must handle this case rather than only the single predecessor case.
Reviewer: paquette
Differential Revision: https://reviews.llvm.org/D124777
Given a load without a better order, this patch partially sorts the
elements to form clusters of adjacent elements in memory. These clusters
can potentially be loaded in fewer loads, meaning less overall shuffling
(for example loading v4i8 clusters of a v16i8 as a single f32 loads, as
opposed to multiple independent bytes loads and inserts).
Differential Revision: https://reviews.llvm.org/D122145
As shown in https://github.com/llvm/llvm-project/issues/55150 -
the existing fold may be wrong when converting to a signed value.
This is a quick fix to avoid the miscompile.
I added tests/comments for all of the signed/unsigned combinations
at either side of the boundary width, and tried to confirm with Alive2:
https://alive2.llvm.org/ce/z/3p9DSu
There are already some TODO items in the test file that suggest
possible refinements, so the regression with ui->FP->si is probably ok.
It seems unlikely that we'd see these kind of edge cases with
non-byte-width integer types in real code. The potential miscompile
went undetected for several years.
This and 747c6a0c73fixes#55150.
Differential Revision: https://reviews.llvm.org/D124692
If a constrained intrinsic call was replaced by some value, it was not
removed in some cases. The dangling instruction resulted in useless
instructions executed in runtime. It happened because constrained
intrinsics usually have side effect, it is used to model the interaction
with floating-point environment. In some cases side effect is actually
absent or can be ignored.
This change adds specific treatment of constrained intrinsics so that
their side effect can be removed if it actually absents.
Differential Revision: https://reviews.llvm.org/D118426
Splatting a bit of constant-index across a value:
sext (ashr (trunc iN X to iM), M-1) to iN --> ashr (shl X, N-M), N-1
If the dest type is different, use a cast (adjust use check).
https://alive2.llvm.org/ce/z/acAan3
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D124590
For the unary shuffle pattern, this is opposite to what we try
to do with binops, but it seems better to keep it consistent
with the motivating binary shuffle pattern. On that, it is
clearly better on the usual no-extra uses case.
There is a chance that this will pull an fneg away from some
other binop and cause a regression in codegen, but that should
be invertible in the backend. The transform is birectional:
https://alive2.llvm.org/ce/z/kKaKCUhttps://alive2.llvm.org/ce/z/3DesfwFixes#45631
Try to push an icmp into a select even if the icmp operand isn't
constant - perform a generic SimplifyICmpInst instead.
This doesn't appear to impact compile-time much, and forming
logical and/or is generally profitable, as we have very good
support for them.
D113035 enhanced the matching of bitwise selects from vector types. This
change unfortunately introduced crashes as it tries to cast scalable
vector types to integers.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D124997
After D97756, collectHomogenousInstGraphLoopInvariants may collect
conditions for both logical ANDs and logical ORs in case the root is a
select that matches both logical AND & OR.
This means the function won't return invariant values of either AND/OR
chains, but both. This can result in incorrect transformations.
See llvm/test/Transforms/SimpleLoopUnswitch/trivial-unswitch-logical-and-or.ll.
Without the patch, Alive2 rejects the modified tests with:
Source and target don't have the same return domain.
Note that this also applies to the test case added in D97756
(@test_partial_condition_unswitch_or_select). We can't unswitch on
%cond6, because the graph leading to it contains and AND and an OR.
This only fixes trivial unswitching for now, but a similar problem
likely exists with non-trivial unswitching.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D124526
Factor our InstrumentationIRBuilder and share it between ThreadSanitizer
and SanitizerCoverage. Simplify its usage at the same time (use function
of passed Instruction or BasicBlock).
This class may be used in other instrumentation passes in future.
NFCI.
Reviewed By: nickdesaulniers
Differential Revision: https://reviews.llvm.org/D125038
This patch adds a combine to attempt to reduce the costs of certain
select-shuffle patterns. The form of code it attempts to detect is:
%x = shuffle ...
%y = shuffle ...
%a = binop %x, %y
%b = binop %x, %y
shuffle %a, %b, selectmask
A classic select-mask will pick items from each lane of a or b. These
do not always have a great lowering on many architectures. This patch
attempts to pack a and b into the lower elements, creating a differently
ordered shuffle for reconstructing the orignal which may be better than
the select mask. This can be better for performance, especially if less
elements of a and b need to be computed and the input shuffles are
cheaper.
Because select-masks are just one form of shuffle, we generalize to any
mask. So long as the backend has decent costmodel for the shuffles, this
can generally improve things when they come up. For more basic cost
models the folds do not appear to be profitable, not getting past the
cost checks.
Differential Revision: https://reviews.llvm.org/D123911
Re-materialize for debug instructions would cause a different code
generated if we enabled `-g`. This is bad. So we disable to
re-materialize for debug instructions.
When building with debug info enabled, some load/store instructions do
not have a DebugLocation attached. When using the default IRBuilder, it
attempts to copy the DebugLocation from the insertion-point instruction.
When there's no DebugLocation, no attempt is made to add one.
This is problematic for inserted calls, where the enclosing function has
debug info but the call ends up without a DebugLocation in e.g. LTO
builds that verify that both the enclosing function and calls to
inlinable functions have debug info attached.
This issue was noticed in Linux kernel KCSAN builds with LTO and debug
info enabled:
| ...
| inlinable function call in a function with debug info must have a !dbg location
| call void @__tsan_read8(i8* %432)
| ...
To fix, ensure that all calls to the runtime have a DebugLocation
attached, where the possibility exists that the insertion-point might
not have any DebugLocation attached to it.
Reviewed By: nickdesaulniers
Differential Revision: https://reviews.llvm.org/D124937
Further improvement of the cost model for the scalars used in
buildvectors sequences. The main functionality is outlined into
a separate function.
The cost is calculated in the following way:
1. If the Base vector is not undef vector, resizing the very first mask to
have common VF and perform action for 2 input vectors (including non-undef
Base). Other shuffle masks are combined with the resulting after the 1 stage and processed as a shuffle of 2 elements.
2. If the Base is undef vector and have only 1 shuffle mask, perform the
action only for 1 vector with the given mask, if it is not the identity
mask.
3. If > 2 masks are used, perform serie of shuffle actions for 2 vectors,
combing the masks properly between the steps.
The original implementation misses the very first analysis for the Base
vector, so the cost might too optimistic in some cases. But it improves
the cost for the insertelements which are part of the current SLP graph.
Part of D107966.
Differential Revision: https://reviews.llvm.org/D115750
We cannot skip the freezing the condition if the unswitched branch
executes, if the condition is a chain of ANDs/ORs. For example, if if we
have an AND %c1, %c2 with %c1 == undef and %c2 == 0, there would be no
branch on undef in the original code, but a branch on undef if we
unswitch %c1.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D124603
If a constrained intrinsic call was replaced by some value, it was not
removed in some cases. The dangling instruction resulted in useless
instructions executed in runtime. It happened because constrained
intrinsics usually have side effect, it is used to model the interaction
with floating-point environment. In some cases it is correct behavior
but often the side effect is actually absent or can be ignored.
This change adds specific treatment of constrained intrinsics so that
their side effect can be removed if it actually absents.
Differential Revision: https://reviews.llvm.org/D118426
This patch switches the PGO implementation on AIX from using the runtime
registration-based section tracking to the __start_SECNAME/__stop_SECNAME
based. In order to enable the recognition of __start_SECNAME/__stop_SECNAME
symbols in the AIX linker, the -bdbg:namedsects:ss needs to be used.
Reviewed By: jsji, MaskRay, davidxl
Differential Revision: https://reviews.llvm.org/D124857
This check is in the related fold for binops,
but it was missed when the code was adapted
for intrinsics in 432c199e84. The new test
would crash when trying to create a new
intrinsic with mismatched types.
This extends 432c199e84 and 9c4770eaab with an intrinsic
cited directly in issue #46238
Eventually, we will want to use llvm::isTriviallyVectorizable()
or create some new API for this list, but for now, I am intentionally
making a minimum change to reduce risk and only affect an intrinsic
with regression tests in place.
https://alive2.llvm.org/ce/z/sD-JVv
This extends 432c199e84 with a 3 arg intrinsic to demonstrate
that the code works with the extra operand.
Eventually, we will want to use llvm::isTriviallyVectorizable()
or create some new API for this list, but for now, I am intentionally
making a minimum change to reduce risk and only affect an intrinsic
with regression tests in place.
As a follow-up to D124632, I'm turning on unlimited size caps for inlining with preinlined profile. It should be safe as a preinlined profile has "bounded" inline contexts.
No noticeable size or perf delta was seen with two of our internal large services, but I think this is still a good change to be consistent with the other case.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D124793
The two fields have the same meaning. Their values come from the reader. Therefore I'm removing one.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D124788
This is an intrinsic version of the existing fold for binops.
As a first step, I only allowed min/max, but the code is set
up to make adding more intrinsics easy (with more or less than
2 arguments).
This (and possible follow-ups) are discussed in issue #46238.
Per feedback on D123086 after submit.
Also added a test for vec_malloc et al attribute inference to show it's
doing the right thing.
The new tests exposed a defect, corrected by adding vec_free to the list of
free functions in MemoryBuiltins.cpp, which had been overlooked all the
way back in D94710, over a year ago.
Differential Revision: https://reviews.llvm.org/D124859
Adds ability to vectorize loops containing a store to a loop-invariant
address as part of a reduction that isn't converted to SSA form due to
lack of aliasing info. Runtime checks are generated to ensure the store
does not alias any other accesses in the loop.
Ordered fadd reductions are not yet supported.
Differential Revision: https://reviews.llvm.org/D110235
This adds fptosi_sat and fptoui_sat to the list of trivially
vectorizable functions, mainly so that the loop vectorizer can vectorize
the instruction. Marking them as trivially vectorizable also allows them
to be SLP vectorized, and Scalarized.
The signature of a fptosi_sat requires two type overrides
(@llvm.fptosi.sat.v2i32.v2f32), unlike other intrinsics that often only
take a single. This patch alters hasVectorInstrinsicOverloadedScalarOpd
to isVectorIntrinsicWithOverloadTypeAtArg, so that it can mark the first
operand of the intrinsic as a overloaded (but not scalar) operand.
Differential Revision: https://reviews.llvm.org/D124358
We don't need to insert a load of the dynamic shadow address unless there
are interesting memory accesses to profile.
Split out of D124703.
Differential Revision: https://reviews.llvm.org/D124797
Suppress instrumentation of PGO counter accesses, which is unnecessary
and costly. Also suppress accesses to other compiler inserted variables
starting with "__llvm". This is a slightly expanded variant of what is
done for tsan in shouldInstrumentReadWriteFromAddress.
Differential Revision: https://reviews.llvm.org/D124703
Currently SLP vectorizer walks through the instructions and selects
3 main classes of values: 1) reduction operations - instructions with same
reduction opcode (add, mul, min/max, etc.), which build the reduction,
2) reduced values - instructions with the same opcodes, but different
from the reduction opcode, 3) extra arguments - all other values,
instructions from the different basic block rather than the root node,
instructions with to many/less uses.
This scheme is not very efficient. It excludes some instructions and all
non-instruction values from the reductions (constants, proficient
gathers), to many possibly reduced values are marked as extra arguments.
Patch improves this process by introducing a bit extended analysis
stage. During this stage, we still try to select 3 classes of the
values: 1) reduction operations - same as before, 2) possibly reduced
values - all instructions from the current block/non-instructions, which
may build a vectorization tree, 3) extra arguments - instructions from
the different basic blocks. Additionally, an extra sorting of the
possibly reduced values occurs to build the scalar sequences which
highly likely will bed vectorized, e.g. loads are grouped by the
distance between them, constants are grouped together, cmp instructions
are sorted by their compare types and predicates, extractelement
instructions are sorted by the vector operand, etc. Also, these groups
are reordered by their length so the longest group is the first in the
list of the possibly reduced values.
The vectorization process tries to emit the reductions for all these
groups. These reductions, remaining non-vectorized possible reduced
values and extra arguments are then combined into the final expression
just like it was before.
Differential Revision: https://reviews.llvm.org/D114171
'Widen' recipe are only used when actual vector values are generated.
Fix tryToWidenCall to do not create VPWidenCallRecipes for scalar vector
factors.
This was exposed by D123720, because the widened recipes are considered
vector users.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D124718
Prior to this patch we would only set to undef the unused arguments of the
external functions. The rationale was that unused arguments of internal
functions wouldn't need to be turned into undef arguments because they
should have been simply eliminated by the time we reach that code.
This is actually not true because there are plenty of cases where we can't
remove unused arguments. For instance, if the internal function is used in
an indirect call, it may not be possible to change the function signature.
Yet, for statically known call-sites we would still like to mark the unused
arguments as undef.
This patch enables the "set undef arguments" optimization on internal
functions when we encounter cases where internal functions cannot be
optimized. I.e., whenever an internal function is marked "live".
Differential Revision: https://reviews.llvm.org/D124699
libcalls." (was 0f8c626). This reverts commit 14d9390.
The patch previously failed to recognize cases where user had defined a
function alias with an identical name as that of the library
function. Module::getFunction() would then return nullptr which is what the
sanitizer discovered.
In this updated version a new function isLibFuncEmittable() has as well been
introduced which is now used instead of TLI->has() anytime a library function
is to be emitted . It additionally also makes sure there is e.g. no function
alias with the same name in the module.
Reviewed By: Eli Friedman
Differential Revision: https://reviews.llvm.org/D123198
If there are pre-existing dead instructions, the order we visit replaced
values can cause us sometimes to not delete dead instructions.
The added test non-deterministically failed without the change.
Normally the index type will already be canonicalized here, but
this is not guaranteed depending on visitation order. The code
was already accounting for a potentially needed sext, but a trunc
may also be needed.
Add a ConstantExpr::getSExtOrTrunc() helper method to make this
simpler. This matches the corresponding IRBuilder method in behavior.
Fixes https://github.com/llvm/llvm-project/issues/55228.
Per the guidance in
https://llvm.org/docs/Atomics.html#atomics-and-ir-optimization,
an atomic load from a constant global can be dropped, as there can
be no stores to synchronize with. Any write to the constant global
would be UB.
IPSCCP will already drop such loads, but the main helper in Local
doesn't recognize this currently. This is motivated by D118387.
Differential Revision: https://reviews.llvm.org/D124241
X86 codegen uses function attribute `min-legal-vector-width` to select the proper ABI. The intention of the attribute is to reflect user's requirement when they passing or returning vector arguments. So Clang front-end will iterate the vector arguments and set `min-legal-vector-width` to the width of the maximum for both caller and callee.
It is assumed any middle end optimizations won't care of the attribute expect inlining and argument promotion.
- For inlining, we will propagate the attribute of inlined functions because the inlining functions become the newer caller.
- For argument promotion, we check the `min-legal-vector-width` of the caller and callee and refuse to promote when they don't match.
The problem comes from the optimizations' combination, as shown by https://godbolt.org/z/zo3hba8xW. The caller `foo` has two callees `bar` and `baz`. When doing argument promotion, both `foo` and `bar` has the same `min-legal-vector-width`. So the argument was promoted to vector. Then the inlining inlines `baz` to `foo` and updates `min-legal-vector-width`, which results in ABI mismatch between `foo` and `bar`.
This patch fixes the problem by expanding the concept of `min-legal-vector-width` to indicator of functions arguments. That says, any passes touch functions arguments have to set `min-legal-vector-width` to the value reflects the width of vector arguments. It makes sense to me because any arguments modifications are ABI related and should response for the ABI compatibility.
Differential Revision: https://reviews.llvm.org/D123284
In some cases, it is not enough to freeze the final AND/OR operation
when chaining a number of invariant conditions together.
After creating a chain of ANDs/ORs, we assume all unswitched operands to
be either true or false. But if any of the operands is poison, the rest
of the operands could have any value after branching on the frozen
condition.
To avoid that, freeze individual operands, if needed. In some cases this
may lead to unnecessary freezes, but it seems required at least for some
cases (see trivial-unswitch-freeze-individual-conditions.ll)
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D124554
Trivial unswitching can also introduce new branches on undef/poison.
Freeze the conditions if needed.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D124549
This patch removes an old hack in visitSelectInst that was written to avoid miscompilation bugs in loop unswitch.
(Added via https://reviews.llvm.org/D35811)
The legacy loop unswitch pass will be removed after D124376, and the new simple loop unswitch pass correctly uses freeze to avoid introducing UB after D124252.
Since the hack is not necessary anymore, this patch removes it.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D124426
We have seen that the prioirty inliner delivered on-par performance with the old inliner for probe-only CSSPGO profile, as long as without a size budget. I'm turning on the priority inliner for probe-only profile by default.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D124632
To be more clear and definitive, I'm renaming `ProfileIsCSFlat` back to `ProfileIsCS` which stands for full context-sensitive flat profiles. `ProfileIsCSNested` is now renamed to `ProfileIsPreInlined` and is extended to be applicable for CS flat profiles too. More specifically, `ProfileIsPreInlined` is for any kind of profiles (flat or nested) that contain 'ShouldBeInlined' contexts. The flag is encoded in the profile summary section for extbinary profiles and is computed on-the-fly for text profiles.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D122602
TI->getBitWidth can be > 64 and in those cases the shift will be UB due
to the exponent being too large.
To fix this, cap the shift at 63. I think this should work out fine,
because TableSize is itself a 64 bit type and the maximum table size
must fit in the type. Also, if we would underestimate the size here, at
most we get an extra ZExt.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D124608
Similar to c515b2f39e, If there are no loops in the function as seen
through LI, we should avoid computing the remaining expensive analyses
(such as SCEV, BPI). Reordered the analyses requests and early return
if there are no loops.
The logic of avoiding expensive analyses is applied to LoopVectorizer,
LoopLoadElimination and LoopUnrollPass, i.e. all function passes which operate
on loops.
This is an NFC with compile time improvement.
Differential Revision: https://reviews.llvm.org/D124529
This removes memset with undef char. We already do this for stores
of undef value.
This comes with the caveat that this optimization is not, strictly
speaking, legal for undef values, because we might be overwriting
a poison value. However, our entire load/store model currently still
operates on undef values, so we need to support undef here as well
for internal consistency.
Once https://github.com/llvm/llvm-project/issues/52930 is resolved,
these and related folds can be limited to poison -- I've added
FIXMEs to that effect.
Differential Revision: https://reviews.llvm.org/D124173
The name CountRoundDown is potentially misleading, as the number of
iterations can be rounded up when folding the tail.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D119681
This is an edge-case where we don't convert to bitwise and/or based
on implies poison reasoning, so explicitly try to perform the fold
in logical form. The transform itself is poison-safe, as both icmps
are based on the same value and any nowrap flags are discarded as
part of the fold (https://alive2.llvm.org/ce/z/aCwC8b for the used
example).
This fold handles a special subset of foldAndOrOfICmpsUsingRanges(),
use the more generic implementation instead.
The result can differ if a representation using a range comparison
is possible, in which case that is preferred over masking. There is
a canonicalization opportunity here.
This is the de Morgan conjugated variant of the existing fold for
ors. Implement this by switching the range code to always work
on ors and perform invert operands at the start and end. This makes
reasoning easier and makes the extension more obviosuly correct.
The legacy LoopUnswitch pass is only used in the legacy pass manager
pipeline, which is deprecated.
The NewPM replacement is SimpleLoopUnswitch and I think it is time to
remove the legacy LoopUnswitch code.
Fixes#31000.
Reviewed By: aeubanks, Meinersbur, asbirlea
Differential Revision: https://reviews.llvm.org/D124376
We can express this fold more naturally when working on the constant
range implementation. This change is not entirely NFC, because the
code now also handles cases that don't match the precise pattern
this previously looked for, e.g. we can omit an add on one of the
ranges.
I think this sort comparator was overly complex, and the windows
expensive check bot agreed, failing as it was not giving a strict weak
ordering. Change it to use the comparison of the mask values as unsigned
integers. This should sort the undef elements to the end whilst keeping
X<Y otherwise.
Replace the condition value with the known constant value on the
threaded edge. This happens implicitly with phi threading because
we replace with the incoming value, but not for non-phi threading.
SimplifyCFG implements basic jump threading, if a branch is
performed on a phi node with constant operands. However,
InstCombine canonicalizes such phis to the condition value of a
previous branch, if possible. SimplifyCFG does support this as
well, but only in the very limited case where the same condition
is used in a direct predecessor -- notably, this does not include
the common diamond pattern (i.e. two consecutive if/elses on the
same condition).
This patch extends the code to look back a limited number of
blocks to find a branch on the same value, rather than only
looking at the direct predecessor.
Fixes https://github.com/llvm/llvm-project/issues/54980.
Differential Revision: https://reviews.llvm.org/D124159
Given a shuffle feeding a commutative reduction, the lane ordering of
the shuffle will not alter the result. This is also true if there are a
number of operations between the reduction and the shuffle, providing
they only operate lane-wise. This patch searches for cases like that in
Vector Combine, allowing us to check the cost of the shuffle vs an
in-order identity shuffle and replace the order if possible. This only
handles a single shuffle at the moment to keep things simple, and is
able to ignore splats that produce results where every result is the
same.
This is a more powerful version of a combine that already happens in
instrcombine, capable of optimizing more cases by looking through more
instructions and being able to cost the shuffle.
Differential Revision: https://reviews.llvm.org/D123494
Introduced masks where they are not added and improved target dependent
cost models to avoid returning of the incorrect cost results after
adding masks.
Differential Revision: https://reviews.llvm.org/D100486
The structure ArgPart and alias OffsetAndArgPart have been moved
into the anonymous namespace. NFC.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D124617
The condition should be 'ArgParts.size() > MaxElements', so that if we
have exactly 3 elements in the 'ArgParts' vector, the promotion should
be allowed because the 'MaxElement' threshold is not exceeded yet.
The default value for 'MaxElement' has been decreased to 2 in order
to avoid an actual change in argument promoting behavior. However,
this changes byval argument transformation behavior by allowing
adding not more than 2 arguments to the function instead of 3 allowed
before.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D124178
Remove one of the last remaining uses of ::needsVectorIV, preparing for
its removal. Now that usesScalars is available and based on the
information explicit in VPlan, there is no need to use the pre-computed
needsVectorIV.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D123720
Some loop counters ('i', 'e') and variables ('type') were named not
in accordance with the code style and clang-tidy issues warnings
about the using of such variables. This patch renames the variables
and fixes some typos in the comments within the source file.
Differential Revision: https://reviews.llvm.org/D123662
When using opaque pointers, convert GEPs into offset representation
of the form P + V1 * Scale1 + V2 * Scale2 + ... + ConstantOffset.
This allows us to recognize equivalent address calculations even if
the GEPs don't use the same source element type.
This fixes an opaque pointer codegen regression seen in rustc.
Differential Revision: https://reviews.llvm.org/D124527
They can already be available, and even if not, DT/LI can be available.
We should not recompute them. Old PM is unchanged because it would
require changing dependencies, and we don't care enough about it.
Differential Revision: https://reviews.llvm.org/D124439
Reviewed By: nikic, aeubanks
isNoopAddrSpaceCast is expecting SrcAS is different from DestAS.
If the two AS are the same, consider ptrtoint/inttoptr as noop cast.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D123573
Previously all entries in global_ctors had to have the void()* type and
we'd skip evaluating bitcasted functions. With opaque pointers we may
see the function directly.
Fixes#55147.
Reviewed By: #opaque-pointers, nikic
Differential Revision: https://reviews.llvm.org/D124553
Using the legacy PM for the optimization pipeline was deprecated in 13.0.0.
Following recent changes to remove non-core features of the legacy
PM/optimization pipeline, remove ThreadSanitizerLegacyPass.
Reviewed By: #sanitizers, vitalybuka
Differential Revision: https://reviews.llvm.org/D124209
Introduced masks where they are not added and improved target dependent
cost models to avoid returning of the incorrect cost results after
adding masks.
Differential Revision: https://reviews.llvm.org/D100486
When a block containing llvm.coro.id is cloned during CHR, it inserts an invalid
PHI node with token type to the beginning of the block containing llvm.coro.begin.
To avoid such case, we exclude regions with llvm.coro.id.
Reviewed By: ChuanqiXu
Differential Revision: https://reviews.llvm.org/D124418
IRCE is a function pass that operates on loops. If there are no loops in
the function (as seen through LI), we should avoid computing the
remaining expensive analyses (such as BPI). Reordered the analyses
requests and early return if there are no loops. This is an NFC with
compile time improvement.
The same will be done in a follow-up patch for the loop vectorizer.
Reviewed-By: nikic
Differential Revision: https://reviews.llvm.org/D124478
This relands commit 8f550368b1.
The test is amended with REQUIRES: x86-registered-target, in line with
the other debuginfo-scev-salvage tests.
Differential Revision: https://reviews.llvm.org/D120169
Second of two patches to extend SCEV-based salvaging to dbg.value
intrinsics that have multiple location ops pre-LSR. This second patch
adds the core implementation.
Reviewers: @StephenTozer, @djtodoro
Differential Revision: https://reviews.llvm.org/D120169
First of two patches that extends SCEV-based salvaging to enable
salvaging of dbg.value instrinsics that have multiple locations ops
before the Loop Strength Reduction pass.
The existing single-op SCEV-based salvaging can generate variadic
dbg.value intrinsics in order to salvage a dbg.value that has a single
location op. If a dbg.value has multiple location ops before LSR, and
LSR optimises away one or more of the location operands, then currently
no salvaging will be attempted.
Salvaging can now be added, but first this patch cleans up consistency
in both the code and comments, and applies some refactoring to make
application of the new salvaging implementation more straightforward.
- Use SCEVDbgValueBuilder for both types of recovery expressions:
IV-offset based and iteration count based.
- Combine the functions that write the final DIExpression.
- Move some static functions into member functions.
Reviewers: @Orlando
Differential Revision: https://reviews.llvm.org/D120168
Currently, two GEPs will only be combined if the result element
type of one is the same as the source element type of the other.
However, this means we may miss folding opportunities where the
second GEP could be rewritten using a different element type. This
is especially relevant for opaque pointers, where constant GEPs
often use i8 element type.
Address this by converting GEP indices to offsets, adding them,
and then converting them back to indices. The first (inner) GEP
is allowed to have variable indices as well, in which case only
the constant suffix is converted into an offset.
This should address the regression reported in
https://reviews.llvm.org/D123300#3467615.
Differential Revision: https://reviews.llvm.org/D124459
I found this bug when performing a two-stage build of clang with
Function Specialization enabled and tuned aggressively. The crash
appears only on release builds.
Fixes https://github.com/llvm/llvm-project/issues/55000.
Before accessing the contents of the ArgInfo iterator inside
SCCPInstVisitor::markArgInFuncSpecialization, we should be
checking that the iterator is valid.
Differential Revision: https://reviews.llvm.org/D124114
canonicalizeClampLike canonicalizes the ule/ugt comparisons to ult/uge,
respectively. However, it does not update the variable holding the
comparison predicate type after doing this. Later code fails to handle
the non-canonical predicate type (specifically, the swap of
ThresholdLowIncl and ThresholdHighExcl when Pred0 has been canonicalized
from ugt to uge). This leads to the miscompile reported in PR53252. Fix
this by updating the comparison predicate after canonicalizing.
Fixes#53252
Differential Revision: https://reviews.llvm.org/D119690
The callback is expected to create a branch to the ContinuationBB (sometimes called FiniBB in some lambdas) argument when finishing. This creates problems:
1. The InsertPoint used for CodeGenIP does not need to be the end of a block. If it is not, a naive callback will insert a branch instruction into the middle of the block.
2. The BasicBlock the CodeGenIP is pointing to may or may not have a terminator. There is an conflict where to branch to if the block already has a terminator.
3. Some API functions work only with block having a terminator. Some workarounds have been used to insert a temporary terminator that is removed again.
4. Some callbacks are sensitive to whether the BasicBlock has a terminator or not. This creates a callback ordering problem where different callback may have different behaviour depending on whether a previous callback created a terminator or not. The problem also exists for FinalizeCallbackTy where some callbacks do create branch to another "continue" block, but unlike BodyGenCallbackTy does not receive the target as argument. This is not addressed in this patch.
With this patch, the callback receives an CodeGenIP into a BasicBlock where to insert instructions. If it has to insert control flow, it can split the block at that position as needed but otherwise no separate ContinuationBB is needed. In particular, a callback can be empty without breaking the emitted IR. If the caller needs the control flow to branch to a specific target, it can insert the branch instruction itself and pass an InsertPoint before the terminator to the callback.
Certain frontends such as Clang may expect the current IRBuilder position to be at the end of a basic block. In this case its callbacks must split the block at CodeGenIP before setting the IRBuilder position such that the instructions after CodeGenIP are moved to another basic block and before returning create a new branch instruction to the split block.
Some utility functions such as `splitBB` are supporting correct splitting of BasicBlocks, independent of whether they have a terminator or not, returning/setting the InsertPoint of an IRBuilder to the end of split predecessor block, and optionally omitting creating a branch to the split successor block to be added later.
Reviewed By: kiranchandramohan
Differential Revision: https://reviews.llvm.org/D118409
We can always replace the undef elements in a vector constant
with regular constants to get rid of the freeze:
https://alive2.llvm.org/ce/z/nfRb4F
The select diffs show that we might do better by adjusting the
logic for a frozen select condition. We may also want to refine
the vector constant replacement to consider forming a splat.
Differential Revision: https://reviews.llvm.org/D123962
Before this patch `Args` was used to pass a broadcat's arguments by SLP.
This patch changes this. `Args` is now used for passing the operands of
the shuffle.
Differential Revision: https://reviews.llvm.org/D124202
This continues the push away from hard-coded knowledge about functions
towards attributes. We'll use this to annotate free(), realloc() and
cousins and obviate the hard-coded list of free functions.
Differential Revision: https://reviews.llvm.org/D123083
This reorganizes the code as a preparation for D123865:
* Use more descriptive names for variables
* Simplify a condition by use an already calculated value
for `MaxPeelCount`
* Remove a duplicate log entry
* Report basic values for loop costs
Differential Revision: https://reviews.llvm.org/D124388
Since the size of most of SCC's is 1, the PriorityInlineOrder would not change the inline
order in SCC inliner.
Reviewed By: kazu
Differential Revision: https://reviews.llvm.org/D123608
At the moment, unfeasible default destinations are not handled properly
in removeNonFeasibleEdges. So far, only unfeasible cases are removed,
but later code expects unreachable blocks to have no predecessors.
This is causing the crash reported in PR49573.
If the default destination is unfeasible it won't be executed. Create
a new unreachable block on demand and use that as default
destination.
Note that at the moment this only is relevant for cases where
resolvedUndefsIn marks the first case as executable. Regular switch
handling has a FIXME/TODO to support determining whether the default
case is feasible or not.
Fixes#48917.
Differential Revision: https://reviews.llvm.org/D113497
Don't check whether an input of BDV can be pruned if the input
is the BDV itself. BDV is present in the states map, so in case
the input is the BDV itself, we'd return false. So explicitly check this case.
Differential Revision: https://reviews.llvm.org/D123846
We may be able to make the ValueTracking wrapper smarter
in the future (for example, analyze a simple recurrence),
so this will automatically benefit if that happens.
tryToVectorize() method implements one of searching paths for vectorizable tree roots in SLP vectorizer,
specifically for binary and comparison operations. Order of making probes for various scalar pairs
was defined by its implementation: the instruction operands, then climb over one operand if
the instruction is its sole user and then perform same actions for another operand if previous
attempts failed. Problem with this approach is that among these options we can have more than a
single vectorizable tree candidate and it is not necessarily the one that encountered first.
Trying to build vectorizable tree for each possible combination for just evaluation is expensive.
But we already have lookahead heuristics mechanism which we use for finding best pick among
operands of commutative instructions. It calculates cumulative score for candidates in two
consecutive lanes. This patch introduces use of the heuristics for choosing the best pair among
several combinations. We only try one that looks as most promising for vectorization.
Additional benefit is that we reduce total number of vectorization trees built for probes
because we skip those looking non-profitable early.
Reviewed By: Alexey Bataev (ABataev), Vasileios Porpodas (vporpo)
Differential Revision: https://reviews.llvm.org/D124309
This AliasPtr is being created always from an Int64 even for targets
where 32 bit is the proper type. e.g. “thumbv7-none-linux-android16”.
This causes the assert in the `get` func to fail as we're getting a 32
bit from the APInt.
Fix this by simply always just getting the type from the value instead.
Reviewed By: ChuanqiXu
Differential Revision: https://reviews.llvm.org/D123272
Using the legacy PM for the optimization pipeline was deprecated in 13.0.0.
Following recent changes to remove non-core features of the legacy
PM/optimization pipeline, remove AddressSanitizerLegacyPass...
...,
ModuleAddressSanitizerLegacyPass, and ASanGlobalsMetadataWrapperPass.
MemorySanitizerLegacyPass was removed in D123894.
AddressSanitizerLegacyPass was removed in D124216.
Reviewed By: #sanitizers, vitalybuka
Differential Revision: https://reviews.llvm.org/D124337
This fixes a series of mis-compiles by SimpleLoopUnswitch.
My measurements showed no performance regression with -O3 on AArch64
in SPEC2006, SPEC2017 and a set of internal benchmarks.
Fixes#50387, #50430
Depends on D124251.
Reviewed By: nikic, aqjune
Differential Revision: https://reviews.llvm.org/D124252
Logic in this pass assumes that all users of loop instructions are
either in the same loop or are LCSSA Phis. In fact, there can also
be users in unreachable blocks that currently break assertions.
Such users don't need to go to the next round of simplifications.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D124368
This patch add a function foldSelectWithFCmpToFabs, and do more combine for
fneg-of-fabs.
With 'nsz':
fold (X < +/-0.0) ? X : -X or (X <= +/-0.0) ? X : -X to -fabs(x)
fold (X > +/-0.0) ? X : -X or (X >= +/-0.0) ? X : -X to -fabs(x)
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D123830
Minor refactoring to reduce size of functional change D124309:
look-ahead scoring routines pulled out of VLOperands and formed
new LookAheadHeuristics helper class.
Reviewed By: Alexey Bataev (ABataev), Vasileios Porpodas (vporpo)
Differential Revision: https://reviews.llvm.org/D124313
MisExpect diagnostics should not prevent compilation from succeeding, and the
assertion is insufficient to prevent division by zero in release builds.
This patch addresses that by replacing the assert with an early return.
Additionally, it disables MisExpect diagnostics when using sample profiling,
since this is the only known case where this error has manifested.
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D124302
We only need to insert a Freeze instruction if any of the conditions
may be poison. Similar checks are already done in the other places
SimpleLoopUnswitch creates Freeze instruction.
Reviewed By: aeubanks, efriedma
Differential Revision: https://reviews.llvm.org/D124259
Folds are supposed to always be added in conjugated pairs for and
and or. Merge the two functions to make folds for which this is
currently not the case more obvious.
1d90e53044 switch this code to store
the predicates and operands in variables, but retained a
swapOperands() call here. Thus the commuted cases were no longer
folded. Additionally, as the change was not reported, the next
InstCombine iteration would not pick it up either.
Reapplying without changes, after a fix to a dependent patch.
-----
Rather than creating a PHI node and then using the PHI threading
code, directly handle this case in
FoldCondBranchOnValueKnownInPredecessor().
This change is supposed to be NFC-ish, but may cause changes due
to different transform order.
Reapply with SmallMapVector instead of SmallDenseMap, which should
address the non-determinism issue.
-----
This general threading transform can be performed whenever we know
a constant value for the condition in a predecessor, which would
currently just be the case of a phi node with constant arguments.
Using the legacy PM for the optimization pipeline was deprecated in 13.0.0.
Following recent changes to remove non-core features of the legacy
PM/optimization pipeline, remove AddressSanitizerLegacyPass,
ModuleAddressSanitizerLegacyPass, and ASanGlobalsMetadataWrapperPass.
MemorySanitizerLegacyPass was removed in D123894.
Reviewed By: #sanitizers, vitalybuka
Differential Revision: https://reviews.llvm.org/D124216
Using the legacy PM for the optimization pipeline was deprecated in 13.0.0.
Following recent changes to remove non-core features of the legacy
PM/optimization pipeline, remove AddressSanitizerLegacyPass,
ModuleAddressSanitizerLegacyPass, and ASanGlobalsMetadataWrapperPass.
MemorySanitizerLegacyPass was removed in D123894.
Reviewed By: #sanitizers, vitalybuka
Differential Revision: https://reviews.llvm.org/D124216
This emits an `st_size` that represents the actual useable size of an object before the redzone is added.
Reviewed By: vitalybuka, MaskRay, hctim
Differential Revision: https://reviews.llvm.org/D123010
The first attempt at this missed a check to make sure the offset
constant was in range and caused many bot failures.
That was missed in the Alive2 proof because on overshift creates
poison rather than the assert from APInt. Here's an alternate
attempt at a proof using count-trailing-zeros:
https://alive2.llvm.org/ce/z/pnXQYR
Original commit message:
This is similar to an existing pre-shift-of-constant fold:
8a9c70fc01
...but in this case, we need no-wrap on the shl and a negative
offset:
https://alive2.llvm.org/ce/z/_RVz99
This reverts commit 3df86e799e.
This reverts commit 8988254667.
`[SimplifyCFG] Handle branch on same condition in pred more directly`
caused non-determinism when compiling opt with a bootstrapped clang.
I have to revert the dependent commit as well.
Using the legacy PM for the optimization pipeline was deprecated in 13.0.0.
Following recent changes to remove non-core features of the legacy
PM/optimization pipeline, remove GCOVProfilerLegacyPass.
I have checked many LLVM users and only llvm-hs[1] uses the legacy gcov pass.
[1]: https://github.com/llvm-hs/llvm-hs/issues/392
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D123829
Using the legacy PM for the optimization pipeline was deprecated in 13.0.0.
Following recent changes to remove non-core features of the legacy
PM/optimization pipeline, remove MemorySanitizerLegacyPass.
Differential Revision: https://reviews.llvm.org/D123894
Debugify in OriginalDebugInfo mode, does (DebugInfo) collect-before-pass & check-after-pass
for each instruction, which is pretty expensive. When used to analyze DebugInfo losses
in large projects (like LLVM), this raises the build time unacceptably.
This patch introduces a limit for the number of processed functions per compile unit.
By default, the limit is set to UINT_MAX (practically unlimited), and by using the introduced
option -debugify-func-limit the limit could be set to any positive integer number.
Differential revision: https://reviews.llvm.org/D115714
Rather than creating a PHI node and then using the PHI threading
code, directly handle this case in
FoldCondBranchOnValueKnownInPredecessor().
This change is supposed to be NFC-ish, but may cause changes due
to different transform order.
This general threading transform can be performed whenever we know
a constant value for the condition in a predecessor, which would
currently just be the case of a phi node with constant arguments.
The legacy passes are deprecated now and would be removed in near
future. This patch tries to remove legacy passes in coroutines.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D123918
This patch extends the scope of VPlan to also include the exit (aka
middle) block.
For now, the exit block remains empty, but handling of exit values will
subsequently be moved to VPlan, by adding recipes to model exit values
in the exit block.
As a first step, this will allow fixing #51366.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D123457
We're making a recursive call here and everything in the function
assumes we're looking at scalars. This would be violated if we
looked through a bitcast from vectors.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D124015