This patch extends the condition collection logic to allow adding
conditions from pre-headers to loop headers, by allowing cases where the
target block dominates some of its predecessors.
A == B map to A >= B && A <= B
(https://alive2.llvm.org/ce/z/_dwxKn).
This extends the constraint construction to return a list of
constraints, which can be used to properly de-compose nested AND & OR.
If the incoming block to a phi node is an EH pad, then we will
materialize into an EH pad, which is not supposed to happen. To fix
this, I added a check to see if incoming block of a phi node is an EH
pad before using it as the insertion point.
Differential Revision: https://reviews.llvm.org/D95019
Instead of using ConstraintSystem::negate when adding new constraints,
flip the condition in IR.
The main advantage is that EQ predicates can be represented by 2
constraints, which makes negating based on the constraint tricky. The IR
condition can easily negated.
If we determine that the invariant path through the loop has no effects,
we can directly branch to the exit block, instead to unswitching first.
Besides avoiding some extra work (unswitching first, then deleting the
loop again) this allows to be more aggressive than regular unswitching
with respect to cost-modeling. This approach should always be be
desirable.
This is similar in spirit to D93734, just that it uses the previously
added checks for loop-unswitching.
I tried to add the required no-op checks from scratch, as we only check
a subset of the loop. There is potential to unify the checks with
LoopDeletion, at the cost of adding a predicate whether a block should
be considered.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D95468
This patch fixes updating MemorySSA if the header contains memory
defs that do not clobber a duplicated instruction. We need to find the
first defining access outside the loop body and use that as defining
access of the duplicated instruction.
This fixes a crash caused by bee486851c.
GCC warning:
```
/llvm-project/llvm/lib/Transforms/Scalar/ScalarizeMaskedMemIntrin.cpp: In function ‘void scalarizeMaskedStore(llvm::CallInst*, llvm::DomTreeUpdater*, bool&)’:
/llvm-project/llvm/lib/Transforms/Scalar/ScalarizeMaskedMemIntrin.cpp:295:15: warning: variable ‘IfBlock’ set but not used [-Wunused-but-set-variable]
295 | BasicBlock *IfBlock = CI->getParent();
| ^~~~~~~
/llvm-project/llvm/lib/Transforms/Scalar/ScalarizeMaskedMemIntrin.cpp: In function ‘void scalarizeMaskedScatter(llvm::CallInst*, llvm::DomTreeUpdater*, bool&)’:
/llvm-project/llvm/lib/Transforms/Scalar/ScalarizeMaskedMemIntrin.cpp:555:15: warning: variable ‘IfBlock’ set but not used [-Wunused-but-set-variable]
555 | BasicBlock *IfBlock = CI->getParent();
| ^~~~~~~
```
This de-pessimizes the arguably more usual case of no masked mem intrinsics,
and gets rid of one more Dominator Tree recalculation.
As per llvm/test/CodeGen/X86/opt-pipeline.ll,
there's one more Dominator Tree recalculation left, we could get rid of.
This patch adds additional checks to avoid partial unswitching
in cases where it won't be profitable, e.g. because the path directly
exits the loop anyways.
Loop peeling removes conditions from loop bodies that become invariant
after a small number of iterations. When triggered, this leads to fewer
compares and possibly PHIs in loop bodies, enabling further
optimizations. The current cost-model of loop peeling should be quite
conservative/safe, i.e. only peel if a condition in the loop becomes
known after peeling.
For example, see PR47671, where loop peeling enables vectorization by
removing a PHI the vectorizer does not understand. Granted, the
loop-vectorizer could also be taught about constant PHIs, but loop
peeling is likely to enable other optimizations as well.
This has an impact on quite a few benchmarks from
MultiSource/SPEC2000/SPEC2006 on X86 with -O3 -flto, for example
Same hash: 186 (filtered out)
Remaining: 51
Metric: loop-vectorize.LoopsVectorized
Program base patch diff
test-suite...ve-susan/automotive-susan.test 8.00 9.00 12.5%
test-suite...nal/skidmarks10/skidmarks.test 35.00 31.00 -11.4%
test-suite...lications/sqlite3/sqlite3.test 41.00 43.00 4.9%
test-suite...s/ASC_Sequoia/AMGmk/AMGmk.test 25.00 26.00 4.0%
test-suite...006/450.soplex/450.soplex.test 88.00 89.00 1.1%
test-suite...TimberWolfMC/timberwolfmc.test 120.00 119.00 -0.8%
test-suite.../CINT2006/403.gcc/403.gcc.test 215.00 216.00 0.5%
test-suite...006/447.dealII/447.dealII.test 957.00 958.00 0.1%
test-suite...ternal/HMMER/hmmcalibrate.test 75.00 75.00 0.0%
Same hash: 186 (filtered out)
Remaining: 51
Metric: loop-vectorize.LoopsAnalyzed
Program base patch diff
test-suite...ks/Prolangs-C/agrep/agrep.test 440.00 434.00 -1.4%
test-suite...nal/skidmarks10/skidmarks.test 312.00 308.00 -1.3%
test-suite...marks/7zip/7zip-benchmark.test 6399.00 6323.00 -1.2%
test-suite...lications/minisat/minisat.test 134.00 135.00 0.7%
test-suite...rks/FreeBench/pifft/pifft.test 295.00 297.00 0.7%
test-suite...TimberWolfMC/timberwolfmc.test 1879.00 1869.00 -0.5%
test-suite...pplications/treecc/treecc.test 689.00 691.00 0.3%
test-suite...T2000/300.twolf/300.twolf.test 1593.00 1597.00 0.3%
test-suite.../Benchmarks/Bullet/bullet.test 1394.00 1392.00 -0.1%
test-suite...ications/JM/ldecod/ldecod.test 1431.00 1429.00 -0.1%
test-suite...6/464.h264ref/464.h264ref.test 2229.00 2230.00 0.0%
test-suite...lications/sqlite3/sqlite3.test 2590.00 2589.00 -0.0%
test-suite...ications/JM/lencod/lencod.test 2732.00 2733.00 0.0%
test-suite...006/453.povray/453.povray.test 3395.00 3394.00 -0.0%
Note the -11% regression in number of loops vectorized for skidmarks. I
suspect this corresponds to the fact that those loops are gone now (see
the reduction in number of loops analyzed by LV).
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D88471
Fixes an infinite loop encountered in GVN.
GVN will delay PRE if it encounters critical edges, attempt to split
them later via calls to SplitCriticalEdge(), then restart.
The caller of GVN::splitCriticalEdges() assumed a return value of true
meant that critical edges were split, that the IR had changed, and that
PRE should be re-attempted, upon which we loop infinitely.
This was exposed after D88438, by compiling the Linux kernel for s390,
but the test case is reproducible on x86.
Fixes: https://github.com/ClangBuiltLinux/linux/issues/1261
Reviewed By: void
Differential Revision: https://reviews.llvm.org/D94996
If i change it to AssertingVH instead, a number of existing tests fail,
which means we don't consistently remove from the set when deleting blocks,
which means newly-created blocks may happen to appear in that set
if they happen to occupy the same memory chunk as did some block
that was in the set originally.
There are many places where we delete blocks,
and while we could probably consistently delete from LoopHeaders
when deleting a block in transforms located in SimplifyCFG.cpp itself,
transforms located elsewhere (Local.cpp/BasicBlockUtils.cpp) also may
delete blocks, and it doesn't seem good to teach them to deal with it.
Since we at most only ever delete from LoopHeaders,
let's just delegate to WeakVH to do that automatically.
But to be honest, personally, i'm not sure that the idea
behind LoopHeaders is sound.
This builds on the restricted after initial revert form of D93906, and adds back support for breaking backedges of inner loops. It turns out the original invalidation logic wasn't quite right, specifically around the handling of LCSSA.
When breaking the backedge of an inner loop, we can cause blocks which were in the outer loop only because they were also included in a sub-loop to be removed from both loops. This results in the exit block set for our original parent loop changing, and thus a need for new LCSSA phi nodes.
This case happens when the inner loop has an exit block which is also an exit block of the parent, and there's a block in the child which reaches an exit to said block without also reaching an exit to the parent loop.
(I'm describing this in terms of the immediate parent, but the problem is general for any transitive parent in the nest.)
The approach implemented here involves a potentially expensive LCSSA rebuild. Perf testing during review didn't show anything concerning, but we may end up needing to revert this if anyone encounters a practical compile time issue.
Differential Revision: https://reviews.llvm.org/D94378
Similar to binary operators like fadd/fmul/fsub, propagate shape info
through unary operators (fneg is the only one?).
Differential Revision: https://reviews.llvm.org/D95252
The existing code did not deal with atomic loads correctly. Such loads
are represented as MemoryDefs. Bail out on any MemoryAccess that is not
a MemoryUse.
In LoopInterchange, `findInnerReductionPhi()` looks for reduction
variables, which cannot be constants. Update it to return early in that
case.
This also addresses a blocker for removing use-lists from ConstantData,
whose users could be spread across arbitrary modules in the same
LLVMContext.
Differential Revision: https://reviews.llvm.org/D94712
This patch applies the idea from D93734 to LoopUnswitch.
It adds support for unswitching on conditions that are only
invariant along certain paths through a loop.
In particular, it targets conditions in the loop header that
depend on values loaded from memory. If either path from
the true or false successor through the loop does not modify
memory, perform partial loop unswitching.
That is, duplicate the instructions feeding the condition in the pre-header.
Then unswitch on the duplicated condition. The condition is now known
in the unswitched version for the 'invariant' path through the original loop.
On caveat of this approach is that one of the loops created can be partially
unswitched again. To avoid this behavior, `llvm.loop.unswitch.partial.disable`
metadata is added to the unswitched loops, to avoid subsequent partial
unswitching.
If that's the approach to go, I can move the code handling the metadata kind
into separate functions.
This increases the cases we unswitch quite a bit in SPEC2006/SPEC2000 &
MultiSource. It also allows us to eliminate a dead loop in SPEC2017's omnetpp
```
Tests: 236
Same hash: 170 (filtered out)
Remaining: 66
Metric: loop-unswitch.NumBranches
Program base patch diff
test-suite...000/255.vortex/255.vortex.test 2.00 23.00 1050.0%
test-suite...T2006/401.bzip2/401.bzip2.test 7.00 55.00 685.7%
test-suite :: External/Nurbs/nurbs.test 5.00 26.00 420.0%
test-suite...s-C/unix-smail/unix-smail.test 1.00 3.00 200.0%
test-suite.../Prolangs-C++/ocean/ocean.test 1.00 3.00 200.0%
test-suite...tions/lambda-0.1.3/lambda.test 1.00 3.00 200.0%
test-suite...yApps-C++/PENNANT/PENNANT.test 2.00 5.00 150.0%
test-suite...marks/Ptrdist/yacr2/yacr2.test 1.00 2.00 100.0%
test-suite...lications/viterbi/viterbi.test 1.00 2.00 100.0%
test-suite...plications/d/make_dparser.test 12.00 24.00 100.0%
test-suite...CFP2006/433.milc/433.milc.test 14.00 27.00 92.9%
test-suite.../Applications/lemon/lemon.test 7.00 12.00 71.4%
test-suite...ce/Applications/Burg/burg.test 6.00 10.00 66.7%
test-suite...T2006/473.astar/473.astar.test 16.00 26.00 62.5%
test-suite...marks/7zip/7zip-benchmark.test 78.00 121.00 55.1%
```
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D93764
The pass has dependency on 'TargetTransformInfoWrapperPass', but the
corresponding call to INITIALIZE_PASS_DEPENDENCY was missing.
Differential Revision: https://reviews.llvm.org/D94916
Just like llvm.assume, there are a lot of cases where we can just ignore llvm.experimental.noalias.scope.decl.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D93042
D84108 exposed a bad interaction between inlining and loop-rotation
during regular LTO, which is causing notable regressions in at least
CINT2006/473.astar.
The problem boils down to: we now rotate a loop just before the vectorizer
which requires duplicating a function call in the preheader when compiling
the individual files ('prepare for LTO'). But this then prevents further
inlining of the function during LTO.
This patch tries to resolve this issue by making LoopRotate more
conservative with respect to rotating loops that have inline-able calls
during the 'prepare for LTO' stage.
I think this change intuitively improves the current situation in
general. Loop-rotate tries hard to avoid creating headers that are 'too
big'. At the moment, it assumes all inlining already happened and the
cost of duplicating a call is equal to just doing the call. But with LTO,
inlining also happens during full LTO and it is possible that a previously
duplicated call is actually a huge function which gets inlined
during LTO.
From the perspective of LV, not much should change overall. Most loops
calling user-provided functions won't get vectorized to start with
(unless we can infer that the function does not touch memory, has no
other side effects). If we do not inline the 'inline-able' call during
the LTO stage, we merely delayed loop-rotation & vectorization. If we
inline during LTO, chances should be very high that the inlined code is
itself vectorizable or the user call was not vectorizable to start with.
There could of course be scenarios where we inline a sufficiently large
function with code not profitable to vectorize, which would have be
vectorized earlier (by scalarzing the call). But even in that case,
there probably is no big performance impact, because it should be mostly
down to the cost-model to reject vectorization in that case. And then
the version with scalarized calls should also not be beneficial. In a way,
LV should have strictly more information after inlining and make more
accurate decisions (barring cost-model issues).
There is of course plenty of room for things to go wrong unexpectedly,
so we need to keep a close look at actual performance and address any
follow-up issues.
I took a look at the impact on statistics for
MultiSource/SPEC2000/SPEC2006. There are a few benchmarks with fewer
loops rotated, but no change to the number of loops vectorized.
Reviewed By: sanwou01
Differential Revision: https://reviews.llvm.org/D94232
This is not nice, but it's the best transient solution possible,
and is better than just duplicating the whole function.
The problem is, this function is widely used,
and it is not at all obvious that all the users
could be painlessly switched to operate on DomTreeUpdater,
and somehow i don't feel like porting all those users first.
This function is one of last three that not operate on DomTreeUpdater.
Summary:
Set the default for the option enabling memory ssa use in the loop sink
pass to true for the new pass manager.
Author: Jamie Schmeiser <schmeise@ca.ibm.com>
Reviewed By: asbirlea (Alina Sbirlea)
Differential Revision: https://reviews.llvm.org/D92486
In places where we calculate costs using TTI.getXXXCost() interfaces
I have changed the code to use InstructionCost instead of unsigned.
The change is non functional since InstructionCost behaves in the
same way as an integer for valid costs. Currently the getXXXCost()
functions used in this file do not return invalid costs.
See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html
Differential revision: https://reviews.llvm.org/D94484
Added a utility function in Value class to print block name and use
block labels for unnamed blocks.
Changed LICM to call this function in its debug output.
Patch by Xiaoqing Wu <xiaoqing_wu@apple.com>
Differential Revision: https://reviews.llvm.org/D93577
This boils down to how we deal with early-increment iterator
over function's basic blocks: not only we need to early-increment,
after that we also need to skip all the blocks
that are scheduled for removal, as per DomTreeUpdater.
Thus supporting lazy DomTreeUpdater mode,
where the domtree updates (and thus block removals)
aren't applied immediately, but are delayed
until last possible moment.
This patch fixes a bug that could result in miscompiles (at least
in an OOT target). The problem could be seen by adding checks that
the DominatorTree used in BasicAliasAnalysis and ValueTracking was
valid (e.g. by adding DT->verify() call before every DT dereference
and then running all tests in test/CodeGen).
Problem was that the LegacyPassManager calculated "last user"
incorrectly for passes such as the DominatorTree when not telling
the pass manager that there was a transitive dependency between
the different analyses. And then it could happen that an incorrect
dominator tree was used when doing alias analysis (which was a pretty
serious bug as the alias analysis result could be invalid).
Fixes: https://bugs.llvm.org/show_bug.cgi?id=48709
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D94138
This is a resubmit of dd6bb367 (which was reverted due to stage2 build failures in 7c63aac), with the additional restriction added to the transform to only consider outer most loops.
As shown in the added test case, ensuring LCSSA is up to date when deleting an inner loop is tricky as we may actually need to remove blocks from any outer loops, thus changing the exit block set. For the moment, just avoid transforming this case. I plan to return to this case in a follow up patch and see if we can do better.
Original commit message follows...
The basic idea is that if SCEV can prove the backedge isn't taken, we can go ahead and get rid of the backedge (and thus the loop) while leaving the rest of the control in place. This nicely handles cases with dispatch between multiple exits and internal side effects.
Differential Revision: https://reviews.llvm.org/D93906
Add support for mixed pre/post CFG views.
Update usages of the MemorySSAUpdater to use the new DT API by
requesting the DT updates to be done by the MSSAUpdater.
Differential Revision: https://reviews.llvm.org/D93371
Currently, LoopDeletion does skip loops that have sub-loops, but this
means we currently fail to remove some no-op loops.
One example are inner loops with live-out values. Those cannot be
removed by itself. But the containing loop may itself be a no-op and the
whole loop-nest can be deleted.
The legality checks do not seem to rely on analyzing inner-loops only
for correctness.
With LoopDeletion being a LoopPass, the change means that we now
unfortunately need to do some extra work in parent loops, by checking
some conditions we already checked. But there appears to be no
noticeable compile time impact:
http://llvm-compile-time-tracker.com/compare.php?from=02d11f3cda2ab5b8bf4fc02639fd1f4b8c45963e&to=843201e9cf3b6871e18c52aede5897a22994c36c&stat=instructions
This changes patch leads to ~10 more loops being deleted on
MultiSource, SPEC2000, SPEC2006 with -O3 & LTO
This patch is also required (together with a few others) to eliminate a
no-op loop in omnetpp as discussed on llvm-dev 'LoopDeletion / removal of
empty loops.' (http://lists.llvm.org/pipermail/llvm-dev/2020-December/147462.html)
This change becomes relevant after removing potentially infinite loops
is made possible in 'must-progress' loops (D86844).
Note that I added a function call with side-effects to an outer loop in
`llvm/test/Transforms/LoopDeletion/update-scev.ll` to preserve the
original spirit of the test.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D93716
From C11 and C++11 onwards, a forward-progress requirement has been
introduced for both languages. In the case of C, loops with non-constant
conditionals that do not have any observable side-effects (as defined by
6.8.5p6) can be assumed by the implementation to terminate, and in the
case of C++, this assumption extends to all functions. The clang
frontend will emit the `mustprogress` function attribute for C++
functions (D86233, D85393, D86841) and emit the loop metadata
`llvm.loop.mustprogress` for every loop in C11 or later that has a
non-constant conditional.
This patch modifies LoopDeletion so that only loops with
the `llvm.loop.mustprogress` metadata or loops contained in functions
that are required to make progress (`mustprogress` or `willreturn`) are
checked for observable side-effects. If these loops do not have an
observable side-effect, then we delete them.
Loops without observable side-effects that do not satisfy the above
conditions will not be deleted.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D86844
Loop strength reduction tries to recover debug variable values by looking
for simple offsets from PHI values. In really extreme conditions there may
be an offset used that won't fit in an int64_t, hitting an APInt assertion.
This patch adds a regression test and adjusts the equivalent value
collecting code to filter out any values where the offset can't be
represented by an int64_t. This means that for very large integers with
very large offsets, the variable location will become undef, which is the
same behaviour as before 2a6782bb9f / D87494.
Differential Revision: https://reviews.llvm.org/D94016
Notably, this doesn't switch *every* case, remaining cases
don't actually pass sanity checks in non-permissve mode,
and therefore require further analysis.
Note that SimplifyCFG still defaults to not preserving DomTree by default,
so this is effectively a NFC change.
This reverts commit dd6bb367d1.
Multi-stage builders are showing an assertion failure w/LCSSA not being preserved on entry to IndVars. Reason isn't clear, reverting while investigating.
The basic idea is that if SCEV can prove the backedge isn't taken, we can go ahead and get rid of the backedge (and thus the loop) while leaving the rest of the control in place. This nicely handles cases with dispatch between multiple exits and internal side effects.
Differential Revision: https://reviews.llvm.org/D93906
This patch makes Scalarizer to use poison as insertelement's placeholder.
It contains two changes in Scalarizer.cpp, and the both changes does not change the semantics of the optimized program.
It is because the placeholder value (poison) is already completely hidden by following insertelement instructions.
The first change at visitBitCastInst() creates poison vector of MidTy and consecutively inserts FanIn times,
which is # of elems of MidTy.
The second change at ScalarizerVisitor::finish() creates poison with Op->getType(), and it is filled with
Count insertelements.
The test diffs show that the poison value is never exposed after insertelements.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D93989
Test clang/test/Misc/loop-opt-setup.c fails when executed in Release.
This reverts commit 6f1503d598.
Reviewed By: SureYeaah
Differential Revision: https://reviews.llvm.org/D93956
From C11 and C++11 onwards, a forward-progress requirement has been
introduced for both languages. In the case of C, loops with non-constant
conditionals that do not have any observable side-effects (as defined by
6.8.5p6) can be assumed by the implementation to terminate, and in the
case of C++, this assumption extends to all functions. The clang
frontend will emit the `mustprogress` function attribute for C++
functions (D86233, D85393, D86841) and emit the loop metadata
`llvm.loop.mustprogress` for every loop in C11 or later that has a
non-constant conditional.
This patch modifies LoopDeletion so that only loops with
the `llvm.loop.mustprogress` metadata or loops contained in functions
that are required to make progress (`mustprogress` or `willreturn`) are
checked for observable side-effects. If these loops do not have an
observable side-effect, then we delete them.
Loops without observable side-effects that do not satisfy the above
conditions will not be deleted.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D86844
As mentioned in D93793, there are quite a few places where unary `IRBuilder::CreateShuffleVector(X, Mask)` can be used
instead of `IRBuilder::CreateShuffleVector(X, Undef, Mask)`.
Let's update them.
Actually, it would have been more natural if the patches were made in this order:
(1) let them use unary CreateShuffleVector first
(2) update IRBuilder::CreateShuffleVector to use poison as a placeholder value (D93793)
The order is swapped, but in terms of correctness it is still fine.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D93923
This patch adds support for select form of and/or.
Currently there is an ongoing effort for moving towards using `select a, b, false` instead of `and i1 a, b` and
`select a, true, b` instead of `or i1 a, b` as well.
D93065 has links to relevant changes.
Alive2 proof: (undef input was disabled due to timeout :( )
- and: https://alive2.llvm.org/ce/z/AgvFbQ
- or: https://alive2.llvm.org/ce/z/KjLJyb
Differential Revision: https://reviews.llvm.org/D93935
The function FoldSingleEntryPHINodes() is changed to return if
it has changed IR or not. This return value is used by RS4GC to
set the MadeChange flag respectively.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D93810
This patch updates GVN to correctly return the modified status, if PRE
is performed on indices. It fixes a crash when building the test-suite
with EXPENSIVE_CHECKS and LTO.
EarlyCSE's handleBranchCondition says:
```
// If the condition is AND operation, we can propagate its operands into the
// true branch. If it is OR operation, we can propagate them into the false
// branch.
```
This holds for the corresponding select patterns as well.
This is a part of an ongoing work for disabling buggy select->and/or transformations.
See llvm.org/pr48353 and D93065 for more context
Proof:
and: https://alive2.llvm.org/ce/z/MQWodU
or: https://alive2.llvm.org/ce/z/9GLbB_
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D93842
This patch makes GVN recognize `select c1, c2, false` as well as `select c1, true, c2`
branch condition and propagate equality from these.
See llvm.org/pr48353, D93065
Differential Revision: https://reviews.llvm.org/D93841
While `%x.curr` is always safe to compute, because `LoopBackedgeTakenCount`
will always be smaller than `bitwidth(X)`, i.e. we never get poison,
rewriting `%x.next` is more complicated, however, because `X << LoopTripCount`
will be poison iff `LoopTripCount == bitwidth(X)` (which will happen
iff `BitPos` is `bitwidth(x) - 1` and `X` is `1`).
So unless we know that isn't the case (as alive2 notes, we know it's safe
to do iff shift had no-wrap flags, or bitpos does not indicate signbit,
or we know that %x is never `1`), we'll need to emit an alternative,
safe IR, by either just shifting the `%x.curr`, or conditionally selecting
between the computed `%x.next` and `0`..
Former IR looks better so let's do that.
While there, ensure that we don't drop no-wrap flags from said shift.
The current state of the transform is still not enough to support
my motivational pattern, because it has one more "induction variable".
I have delayed posting this patch, because originally even just rewriting
the loop as countable wasn't enough to nicely transform my motivational pattern,
because i expected that extra IV to be rewritten afterwards,
but it wasn't happening until i fixed that in D91800.
So, this patch allows the 'left-shift until bittest' loop idiom
as long as the inserted ops are cheap,
and lifts any and all extra use checks on the instructions.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D92754
If the bitmask is for sign bit, instcombine would have canonicalized
the pattern into a proper sign bit check. Supporting that is still
simple, but requires a bit of a roundtrip - we first have to use
`decomposeBitTestICmp()`, and the rest again just works.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D91726
The handing of the case where the mask is a constant is trivial,
if said constant is a power of two, the bit in question is log2(mask),
rest just works.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D91725
The motivation here is the following inner loop in fp16/fp24 -> fp32 expander,
that runs as part of the floating-point DNG decompression in RawSpeed library:
cd380bb9a2/src/librawspeed/decompressors/DeflateDecompressor.cpp (L112-L115)
```
while (!(fp32_fraction & (1 << 23))) {
fp32_exponent -= 1;
fp32_fraction <<= 1;
}
```
(https://godbolt.org/z/r13YMh)
As one might notice, that loop is currently uncountable, and that whole code stays scalar.
Yet, it is rather trivial to make that loop countable:
https://godbolt.org/z/do8WMz
and we can prove that via alive2:
https://alive2.llvm.org/ce/z/7vQnji (ha nice, isn't it?)
... and that allow for the whole fp16->fp32 code to vectorize:
https://godbolt.org/z/7hYr13
Now, while i'd love to get there, i feel like i should take it in steps.
For now, this introduces support for the most basic case,
where the bit position is known as a variable,
and the loop *will* go away (has no live-outs other than the recurrence,
no extra instructions in the loop).
I have added sufficient (i believe) test coverage,
and alive2 is happy with those transforms.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D91038
Current approach doesn't work well in cases when multiple paths are predicted to be "cold". By "cold" paths I mean those containing "unreachable" instruction, call marked with 'cold' attribute and 'unwind' handler of 'invoke' instruction. The issue is that heuristics are applied one by one until the first match and essentially ignores relative hotness/coldness
of other paths.
New approach unifies processing of "cold" paths by assigning predefined absolute weight to each block estimated to be "cold". Then we propagate these weights up/down IR similarly to existing approach. And finally set up edge probabilities based on estimated block weights.
One important difference is how we propagate weight up. Existing approach propagates the same weight to all blocks that are post-dominated by a block with some "known" weight. This is useless at least because it always gives 50\50 distribution which is assumed by default anyway. Worse, it causes the algorithm to skip further heuristics and can miss setting more accurate probability. New algorithm propagates the weight up only to the blocks that dominates and post-dominated by a block with some "known" weight. In other words, those blocks that are either always executed or not executed together.
In addition new approach processes loops in an uniform way as well. Essentially loop exit edges are estimated as "cold" paths relative to back edges and should be considered uniformly with other coldness/hotness markers.
Reviewed By: yrouban
Differential Revision: https://reviews.llvm.org/D79485
This is a follow-up patch of D87045.
The patch implements "loop-nest mode" for `LPMUpdater` and `FunctionToLoopPassAdaptor` in which only top-level loops are operated.
`createFunctionToLoopPassAdaptor` decides whether the returned adaptor is in loop-nest mode or not based on the given pass. If the pass is a loop-nest pass or the pass is a `LoopPassManager` which contains only loop-nest passes, the loop-nest version of adaptor is returned; otherwise, the normal (loop) version of adaptor is returned.
Reviewed By: Whitney
Differential Revision: https://reviews.llvm.org/D87531
MSSA DSE starts at a killing store, finds an earlier store and
then checks that the earlier store is not read along any paths
(without being killed first). However, it uses the memory location
of the killing store for that, not the earlier store that we're
attempting to eliminate.
This has a number of problems:
* Mismatches between what BasicAA considers aliasing and what DSE
considers an overwrite (even though both are correct in isolation)
can result in miscompiles. This is PR48279, which D92045 tries to
fix in a different way. The problem is that we're using a location
from a store that is potentially not executed and thus may be UB,
in which case analysis results can be arbitrary.
* Metadata on the killing store may be used to determine aliasing,
but there is no guarantee that the metadata is valid, as the specific
killing store may not be executed. Using the metadata on the earlier
store is valid (it is the store we're removing, so on any execution
where its removal may be observed, it must be executed).
* The location is imprecise. For full overwrites the killing store
will always have a location that is larger or equal than the earlier
access location, so it's beneficial to use the earlier access
location. This is not the case for partial overwrites, in which
case either location might be smaller. There is some room for
improvement here.
Using the earlier access location means that we can no longer cache
which accesses are read for a given killing store, as we may be
querying different locations. However, it turns out that simply
dropping the cache has no notable impact on compile-time.
Differential Revision: https://reviews.llvm.org/D93523
The SROA pass tries to be lazy for removing dead instructions that are collected during iterative run of the pass in the DeadInsts list. However it does not remove instructions from the dead list while running eraseFromParent() on those instructions.
This causes (rare) null pointer dereferences. For example, in the speculatePHINodeLoads() instruction, in the following code snippet:
```
while (!PN.use_empty()) {
LoadInst *LI = cast<LoadInst>(PN.user_back());
LI->replaceAllUsesWith(NewPN);
LI->eraseFromParent();
}
```
If the Load instruction LI belongs to the DeadInsts list, it should be removed when eraseFromParent() is called. However, the bug does not show up in most cases, because immediately in the same function, a new LoadInst is created in the following line:
```
LoadInst *Load = PredBuilder.CreateAlignedLoad(
LoadTy, InVal, Alignment,
(PN.getName() + ".sroa.speculate.load." + Pred->getName()));
```
This new LoadInst object takes the same memory address of the just deleted LI using eraseFromParent(), therefore the bug does not materialize. In very rare cases, the addresses differ and therefore, a dangling pointer is created, causing a crash.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D92431
Part of the <=> changes in C++20 make certain patterns of writing equality
operators ambiguous with themselves (sorry!).
This patch goes through and adjusts all the comparison operators such that
they should work in both C++17 and C++20 modes. It also makes two other small
C++20-specific changes (adding a constructor to a type that cases to be an
aggregate, and adding casts from u8 literals which no longer have type
const char*).
There were four categories of errors that this review fixes.
Here are canonical examples of them, ordered from most to least common:
// 1) Missing const
namespace missing_const {
struct A {
#ifndef FIXED
bool operator==(A const&);
#else
bool operator==(A const&) const;
#endif
};
bool a = A{} == A{}; // error
}
// 2) Type mismatch on CRTP
namespace crtp_mismatch {
template <typename Derived>
struct Base {
#ifndef FIXED
bool operator==(Derived const&) const;
#else
// in one case changed to taking Base const&
friend bool operator==(Derived const&, Derived const&);
#endif
};
struct D : Base<D> { };
bool b = D{} == D{}; // error
}
// 3) iterator/const_iterator with only mixed comparison
namespace iter_const_iter {
template <bool Const>
struct iterator {
using const_iterator = iterator<true>;
iterator();
template <bool B, std::enable_if_t<(Const && !B), int> = 0>
iterator(iterator<B> const&);
#ifndef FIXED
bool operator==(const_iterator const&) const;
#else
friend bool operator==(iterator const&, iterator const&);
#endif
};
bool c = iterator<false>{} == iterator<false>{} // error
|| iterator<false>{} == iterator<true>{}
|| iterator<true>{} == iterator<false>{}
|| iterator<true>{} == iterator<true>{};
}
// 4) Same-type comparison but only have mixed-type operator
namespace ambiguous_choice {
enum Color { Red };
struct C {
C();
C(Color);
operator Color() const;
bool operator==(Color) const;
friend bool operator==(C, C);
};
bool c = C{} == C{}; // error
bool d = C{} == Red;
}
Differential revision: https://reviews.llvm.org/D78938
Details: Jump Threading does not make sense for the targets with divergent CF
since they do not use branch prediction for speculative execution.
Also in the high level IR there is no enough information to conclude that the branch is divergent or uniform.
This may cause errors in further CF lowering.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D93302
A first real transformation that didn't already knew how to do that,
but it's pretty tame - either change successor of all the predecessors
of a block and carefully delay deletion of the block until afterwards
the DomTree updates are appled, or add a successor to the block.
There wasn't a great test coverage for this, so i added extra, to be sure.
... so just ensure that we pass DomTreeUpdater it into it.
Fixes DomTree preservation for a large number of tests,
all of which are marked as such so that they do not regress.
... so just ensure that we pass DomTreeUpdater it into it.
Apparently, there were no dedicated tests just for that functionality,
so i'm adding one here.
Per http://llvm.org/OpenProjects.html#llvm_loopnest, the goal of this
patch (and other following patches) is to create facilities that allow
implementing loop nest passes that run on top-level loop nests for the
New Pass Manager.
This patch extends the functionality of LoopPassManager to handle
loop-nest passes by specializing the definition of LoopPassManager that
accepts both kinds of passes in addPass.
Only loop passes are executed if L is not a top-level one, and both
kinds of passes are executed if L is top-level. Currently, loop nest
passes should have the following run method:
PreservedAnalyses run(LoopNest &, LoopAnalysisManager &,
LoopStandardAnalysisResults &, LPMUpdater &);
Reviewed By: Whitney, ychen
Differential Revision: https://reviews.llvm.org/D87045
Two observations:
1. Unavailability of DomTree makes it impossible to make
`FoldBranchToCommonDest()` transform in certain cases,
where the successor is dominated by predecessor,
because we then don't have PHI's, and can't recreate them,
well, without handrolling 'is dominated by' check,
which doesn't really look like a great solution to me.
2. Avoiding invalidating DomTree in SimplifyCFG will
decrease the number of `Dominator Tree Construction` by 5
(from 28 now, i.e. -18%) in `-O3` old-pm pipeline
(as per `llvm/test/Other/opt-O3-pipeline.ll`)
This might or might not be beneficial for compile time.
So the plan is to make SimplifyCFG preserve DomTree, and then
eventually make DomTree fully required and preserved by the pass.
Now, SimplifyCFG is ~7KLOC. I don't think it will be nice
to do all this uplifting in a single mega-commit,
nor would it be possible to review it in any meaningful way.
But, i believe, it should be possible to do this in smaller steps,
introducing the new behavior, in an optional way, off-by-default,
opt-in option, and gradually fixing transforms one-by-one
and adding the flag to appropriate test coverage.
Then, eventually, the default should be flipped,
and eventually^2 the flag removed.
And that is what is happening here - when the new off-by-default option
is specified, DomTree is required and is claimed to be preserved,
and SimplifyCFG-internal assertions verify that the DomTree is still OK.
ResultPtr is guaranteed to be non-null - and using dyn_cast_or_null causes unnecessary static analyzer warnings.
We can't say the same for FirstResult AFAICT, so keep dyn_cast_or_null for that.
This adds support for loops like
unsigned clz(unsigned x) {
unsigned w = sizeof (x) * CHAR_BIT;
while (x) {
w--;
x >>= 1;
}
return w;
}
and
unsigned clz(unsigned x) {
unsigned w = sizeof (x) * CHAR_BIT - 1;
while (x >>= 1) {
w--;
}
return w;
}
To support these we look for add x, -1 as well as add x, 1 that
we already matched. If the value was -1 we need to subtract from
the initial counter value instead of adding to it.
Fixes PR48404.
Differential Revision: https://reviews.llvm.org/D92745
his is a preparation patch for supporting multiple exits in the loop vectorizer, by itself it should be mostly NFC. This patch moves the loop structure checks from LAA to their respective consumers (where duplicates don't already exist). Moving the checks does end up changing some of the optimization warnings and debug output slightly, but nothing that appears to be a regression.
Why do this? Well, after auditing the code, I can't actually find anything in LAA itself which relies on having all instructions within a loop execute an equal number of times. This patch simply makes this explicit so that if one consumer - say LV in the near future (hopefully) - wants to handle a broader class of loops, it can do so.
Differential Revision: https://reviews.llvm.org/D92066
Use SCEV to salvage additional @llvm.dbg.value that have turned into
referencing undef after transformation (and traditional
salvageDebugInfo). Before rewrite (but after introduction of new
induction variables) use SCEV to compute an equivalent set of values for
each @llvm.dbg.value in the loop body (among the loop header PHI-nodes).
After rewrite (and dead PHI elimination) update those @llvm.dbg.value
now referencing undef by picking a remaining value from its equivalence
set. Allow match with offset by inserting compensation code in the
DIExpression.
Fixes : PR38815
Differential Revision: https://reviews.llvm.org/D87494
CVP currently handles switches by checking an equality predicate
on all edges from predecessor blocks. Of course, this can only
work if the value being switched over is defined in a different block.
Replace this implementation with a call to getPredicateAt(), which
also does the predecessor edge predicate check (if not defined in
the same block), but can also do quite a bit more: It can reason
about phi-nodes by checking edge predicates for incoming values,
it can reason about assumes, and it can reason about block values.
As such, this makes the implementation both simpler and more
powerful. The compile-time impact on CTMark is in the noise.
This is the first in a series of patches that attempts to migrate
existing cost instructions to return a new InstructionCost class
in place of a simple integer. This new class is intended to be
as light-weight and simple as possible, with a full range of
arithmetic and comparison operators that largely mirror the same
sets of operations on basic types, such as integers. The main
advantage to using an InstructionCost is that it can encode a
particular cost state in addition to a value. The initial
implementation only has two states - Normal and Invalid - but these
could be expanded over time if necessary. An invalid state can
be used to represent an unknown cost or an instruction that is
prohibitively expensive.
This patch adds the new class and changes the getInstructionCost
interface to return the new class. Other cost functions, such as
getUserCost, etc., will be migrated in future patches as I believe
this to be less disruptive. One benefit of this new class is that
it provides a way to unify many of the magic costs in the codebase
where the cost is set to a deliberately high number to prevent
optimisations taking place, e.g. vectorization. It also provides
a route to represent the extremely high, and unknown, cost of
scalarization of scalable vectors, which is not currently supported.
Differential Revision: https://reviews.llvm.org/D91174
This patch adds new PM support for the pass and the pass can be now used
during middle-end transforms. The old pass is remamed to
ScalarizeMaskedMemIntrinLegacyPass.
Reviewed-By: skatkov, aeubanks
Differential Revision: https://reviews.llvm.org/D92743
ScalarizeMaskedMemIntrinsic is currently a codeGen level pass. The pass
is actually operating on IR level and does not use any code gen specific
passes. It is useful to move it into transforms directory so that it
can be more widely used as a mid-level transform as well (apart from
usage in codegen pipeline).
In particular, we have a usecase downstream where we would like to use
this pass in our mid-level pipeline which operates on IR level.
The next change will be to add support for new PM.
Reviewers: craig.topper, apilipenko, skatkov
Reviewed-By: skatkov
Differential Revision: https://reviews.llvm.org/D92407
Currently in some places we use signed type to represent size of an access and put explicit casts from unsigned to signed.
For example: int64_t EarlierSize = int64_t(Loc.Size.getValue());
Even though it doesn't loos bits (immidiatly) it may overflow and we end up with negative size. Potentially that cause later code to work incorrectly. A simple expample is a check that size is not negative.
I think it would be safer and clearer if we use unsigned type for the size and handle it appropriately.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D92648
The CountPrev variable was only used to forward a value from
the if statement to the conditional operator under the same
condition.
While there move some variable declarations to their first
assignment.
IsSigned and its accessor, isSigned, were introduced on Oct 25, 2017
in commit 9ac7021a25. The last use was
removed on Nov 20, 2017 in commit
268467869b.
Currently PassBuilder.cpp is by far the file that takes longest to
compile. This is due to tons of templates being instantiated per pass.
Follow PassManager by using wrappers around passes to avoid making
the adaptors templated on the pass type. This allows us to move various
adaptors' run methods into .cpp files.
This reduces the compile time of PassBuilder.cpp on my machine from 66
to 39 seconds. It also reduces the size of opt from 685M to 676M.
Reviewed By: dexonsmith
Differential Revision: https://reviews.llvm.org/D92616
Currently we have to duplicate the same checks in isPotentiallyReassociatable and tryReassociate. With simple pattern like add/mul this may be not a big deal. But the situation gets much worse when I try to add support for min/max. Min/Max may be represented by several instructions and can take different forms. In order reduce complexity for upcoming min/max support we need to restructure the code a bit to avoid mentioned code duplication.
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D88286
Currently we delete optimized instructions as we go. That has several negative consequences. First it complicates traversal logic itself. Second if newly generated instruction has been deleted the traversal is repeated from scratch.
But real motivation for the change is upcoming change with support for min/max reassociation. Here we employ SCEV expander to generate code. As a result newly generated instructions may be inserted not right before original instruction (because SCEV may do hoisting) and there is no way to know 'next' instruction.
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D88285
This patch teaches the jump threading pass to call BPI->eraseBlock
when it folds a conditional branch.
Without this patch, BranchProbabilityInfo could end up with stale edge
probabilities for the basic block containing the conditional branch --
one edge probability with less than 1.0 and the other for a removed
edge.
Differential Revision: https://reviews.llvm.org/D92608
1. Removed #include "...AliasAnalysis.h" in other headers and modules.
2. Cleaned up includes in AliasAnalysis.h.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D92489
In this patch I have added support for a new loop hint called
vectorize.scalable.enable that says whether we should enable scalable
vectorization or not. If a user wants to instruct the compiler to
vectorize a loop with scalable vectors they can now do this as
follows:
br i1 %exitcond, label %for.end, label %for.body, !llvm.loop !2
...
!2 = !{!2, !3, !4}
!3 = !{!"llvm.loop.vectorize.width", i32 8}
!4 = !{!"llvm.loop.vectorize.scalable.enable", i1 true}
Setting the hint to false simply reverts the behaviour back to the
default, using fixed width vectors.
Differential Revision: https://reviews.llvm.org/D88962
This is a straightforward port of MemCpyOpt to MemorySSA following
the approach of D26739. MemDep queries are replaced with MSSA queries
without changing the overall structure of the pass. Some care has
to be taken to account for differences between these APIs
(MemDep also returns reads, MSSA doesn't).
Differential Revision: https://reviews.llvm.org/D89207
We were not correctly splitting a blocks for chains of length 1.
Before that change, additional instructions for blocks in chains of
length 1 were not split off from the block before removing (this was
done correctly for chains of longer size).
If this first block contained an instruction referenced elsewhere,
deleting the block, would result in invalidation of the produced value.
This caused a miscompile which motivated D92297 (before D17993,
nonnull and dereferenceable attributed were not added so MergeICmps were
not triggered.) The new test gep-references-bb.ll demonstrate the issue.
The regression was introduced in
rG0efadbbcdeb82f5c14f38fbc2826107063ca48b2.
This supersedes D92364.
Test case by MaskRay (Fangrui Song).
Differential Revision: https://reviews.llvm.org/D92375
Currently, we have some confusion in the codebase regarding the
meaning of LocationSize::unknown(): Some parts (including most of
BasicAA) assume that LocationSize::unknown() only allows accesses
after the base pointer. Some parts (various callers of AA) assume
that LocationSize::unknown() allows accesses both before and after
the base pointer (but within the underlying object).
This patch splits up LocationSize::unknown() into
LocationSize::afterPointer() and LocationSize::beforeOrAfterPointer()
to make this completely unambiguous. I tried my best to determine
which one is appropriate for all the existing uses.
The test changes in cs-cs.ll in particular illustrate a previously
clearly incorrect AA result: We were effectively assuming that
argmemonly functions were only allowed to access their arguments
after the passed pointer, but not before it. I'm pretty sure that
this was not intentional, and it's certainly not specified by
LangRef that way.
Differential Revision: https://reviews.llvm.org/D91649
When bailing out in rewriteLoopExitValues() you could be left with PHI
nodes in the DeadInsts vector. Those would be not handled by the use of
RecursivelyDeleteTriviallyDeadInstructions() in IndVarSimplify. This
resulted in the IndVarSimplify pass returning an incorrect modified
status. This was caught by the expensive check introduced in D86589.
This patches changes IndVarSimplify so that it deletes those PHI nodes,
using RecursivelyDeleteDeadPHINode().
This fixes PR47486.
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D91153