The goal is to separate collecting items for post-processing
and processing them. Post processing also outlined as
dedicated method.
Differential Revision: https://reviews.llvm.org/D132603
When rebasing the review which became f79214d1, I forgot to adjust for the changed semantics introduced by 531dd3634. Functionally, this had no impact, but semantically it resulted in an incorrect result for isPredicatedInst. I noticed this while doing a follow up change.
This patch adds support for vectorizing conditionally executed div/rem operations via a variant of widening. The existing support for predicated divrem in the vectorizer requires scalarization which we can't do for scalable vectors.
The basic idea is that we can always divide (take remainder) by 1 without executing UB. As such, we can use the active lane mask to conditional select either the actual divisor for active lanes, or a constant one for inactive lanes. We already account for the cost of the active lane mask, so the only additional cost is a splat of one and the vector select. This is one of several possible approaches to this problem; see the review thread for discussion on some of the others. This one was chosen mostly because it was straight forward, and none of the others seemed oviously better.
I enabled the new code only for scalable vectors. We could also legally enable it for fixed vectors as well, but I haven't thought through the cost tradeoffs between widening and scalarization enough to know if that's profitable. This will be explored in future patches.
Differential Revision: https://reviews.llvm.org/D130164
The existing cost model for fixed-order recurrences models the phi as an
extract shuffle of a v1 vector. The shuffle produced should be a splice,
as they take two vectors inputs are extracting from a subset of the
lanes. On certain architectures the existing cost model can drastically
under-estimate the correct cost for the shuffle, so this changes it to a
SK_Splice and passes a correct Mask through to the getShuffleCost call.
I believe this might be the first use of a SK_Splice shuffle cost model
outside of scalable vectors, and some targets may require additions to
the cost-model to correctly account for them. In tree targets appear to
all have been updated where needed.
Differential Revision: https://reviews.llvm.org/D132308
This removes the last use of OperandValueKind from the client side API, and (once this is fully plumbed through TTI implementation) allow use of the same properties in store costing as arithmetic costing.
This completes the client side transition to the OperandValueInfo version of this routine. Backend TTI implementations still use the prior versions for now.
OperandValueKind and OperandValueProperties both provide facts about the operands of an instruction for purposes of cost modeling. We've discussed merging them several times; before I plumb through more flags, let's go ahead and do so.
This change only adds the client side interface for getArithmeticInstrCost and makes a couple of minor changes in client code to prove that it works. Target TTI implementations still use the split flags. I'm deliberately splitting what could be one big change into a series of smaller ones so that I can lean on the compiler to catch errors along the way.
Defaults to TCK_RecipThroughput - as most explicit calls were assuming TCK_RecipThroughput (vectorizers) or was just doing a before-vs-after comparison (vectorcombiner). Calls via getInstructionCost were just dropping the CostKind, so again there should be no change at this time (as getShuffleCost and its expansions don't use CostKind yet) - but it will make it easier for us to better account for size/latency shuffle costs in inline/unroll passes in the future.
Differential Revision: https://reviews.llvm.org/D132287
SLP vectorizer tries to find the reductions starting the operands of the
instructions with no-users/void returns/etc. But such operands can be
postponable instructions, like Cmp, InsertElement or InsertValue. Such
operands still must be postponed, vectorizer should not try to vectorize
them immediately.
Differential Revision: https://reviews.llvm.org/D131965
In many cases constant buildvector results in a vector load from a
constant/data pool. Need to consider this cost too.
Differential Revision: https://reviews.llvm.org/D126885
If the incoming previous value of a fixed-order recurrence is a phi in
the header, go through incoming values from the latch until we find a
non-phi value. Use this as the new Previous, all uses in the header
will be dominated by the original phi, but need to be moved after
the non-phi previous value.
At the moment, fixed-order recurrences are modeled as a chain of
first-order recurrences.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D119661
This change reorganizes the code and comments to make the expected semantics of these routines more clear. However, this is *not* an NFC change. The functional change is having isScalarWithPredication return false if the instruction does not need predicated. Specifically, for the case of a uniform memory operation we were previously considering it *not* to be a predicated instruction, but *were* considering it to be scalable with predication.
As can be seen with the test changes, this causes uniform memory ops which should have been lowered as uniform-per-parts values to instead be lowering via naive scalarization or if scalarization is infeasible (i.e. scalable vectors) aborted entirely. I also don't trust the code to bail out correctly 100% of the time, so it's possible we had a crash or miscompile from trying to scalarize something which isn't scalaralizable. I haven't found a concrete example here, but I am suspicious.
Differential Revision: https://reviews.llvm.org/D131093
Currently, we try to vectorize values, feeding into stores, only if
slp-vectorize-hor-store option is provided. We can safely enable
vectorization of the value operand of a single store in the basic block,
if the operand value is used only in store.
It should enable extra vectorization and should not increase compile
time significantly.
Fixes https://github.com/llvm/llvm-project/issues/51320
Differential Revision: https://reviews.llvm.org/D131894
After D121595 was commited, I noticed regressions assosicated with small trip
count numbersvectorisation by tail folding with scalable vectors. As a solution
for those issues I propose to introduce the minimal trip count threshold value.
Differential Revision: https://reviews.llvm.org/D130755
A const reference is preferred over a non-null const pointer.
`Type *` is kept as is to match the other overload.
Reviewed By: davidxl
Differential Revision: https://reviews.llvm.org/D131197
1) Overloaded (instruction-based) method is a wrapper around the current (opcode-based) method.
2) This patch also changes a few callsites (VectorCombine.cpp,
SLPVectorizer.cpp, CodeGenPrepare.cpp) to call the overloaded method.
3) This is a split of D128302.
Differential Revision: https://reviews.llvm.org/D131114
This is mostly a stylistic change to make the uniform memop widening cost
code fit more naturally with the sourounding code. Its not strictly
speaking NFC as I added in the store with invariant value case, and we
could in theory have a target where a gather/scatter is cheaper than a
single load/store... but it's probably NFC in practice. Note that the
scatter/gather result can still be overriden later if the result is
uniform-by-parts.
This extends the handling of uniform memory operations to handle the case where a store is storing a loop invariant value. Unlike the general case of a store to an invariant address where we must use the last active lane, in this case we can use any lane since all lanes must produce the same result.
For context, the basic structure of the existing code and how the change fits in:
* First, we select a widening strategy. (The result is irrelevant for this patch.)
* Then we determine if a computation is uniform within all lanes of VF. (Note this is the uniform-per-part definition, not LAI's uniform across all unrolled iterations definition.)
* If it is, we overrule the widening strategy, and unconditionally scalarize.
* VPReplicationRecipe - which is what actually does the scalarization - knows how to handle unform-per-part values including for scalable vectors. However, we do need to know that the expression is safe to execute without predication - e.g. the uniform mem op was unconditional in the original loop. (This part was split off and already landed.)
An obvious question is why not simply implement the generic case? The answer is that I'm going to, but doing so without a canonicalization towards uniform causes regressions due to bad interaction with scalarization/uniformity of values feeding the uniform mem-op. This patch is needed to avoid those regressions.
Differential Revision: https://reviews.llvm.org/D130364
If we have interleave groups in the loop we want to vectorise then
we should fall back on normal vectorisation with a scalar epilogue. In
such cases when tail-folding is enabled we'll almost certainly go on to
create vplans with very high costs for all vector VFs and fall back on
VF=1 anyway. This is likely to be worse than if we'd just used an
unpredicated vector loop in the first place.
Once the vectoriser has proper support for analysing all the costs
for each combination of VF and vectorisation style, then we should
be able to remove this.
Added an extra test here:
Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll
Differential Revision: https://reviews.llvm.org/D128342
Now the API getExtendedAddReductionCost is used to determine the cost of extended Add reduction with optional Mul. For Arm, it could cover the cases. But for other target, for example: RISCV, they support other kinds of extended recution, such as FAdd.
This patch does the following changes:
1, Split getExtendedAddReductionCost into 2 new API: getExtendedReductionCost which handles the extended reduction with addtional input of Opcode; getMulAccReductionCost which handle the MLA cases the getExtendedAddReductionCost.
2, Refactor getReductionPatternCost, add some contraint condition to make sure the getMulAccReductionCost should only handle the reuction of Add + Mul.
Differential Revision: https://reviews.llvm.org/D130868
We already had the reasoning about uniform mem op loads; if the address is accessed at least once, we know the instruction doesn't need predicated to ensure fault safety. For stores, we do need to ensure that the values visible in memory are the same with and without predication. The easiest sub-case to check for is that all the values being stored are the same. Since we know that at least one lane is active, this tells us that the value must be visible.
Warning on confusing terminology: "uniform" vs "uniform mem op" mean two different things here, and this patch is specific to the later. It would *not* be legal to make this same change for merely "uniform" operations.
Differential Revision: https://reviews.llvm.org/D130637
Reorganize the code to make it clear what is and isn't handle, and why.
Restructure bailout to remove (false and confusing) dependence on
CM_Scalarize; just return invalid cost and propagate, that's what it
is for.
This code confuses LV's "Uniform" and LVL/LAI's "Uniform". Despite the
common name, these are different.
* LVs notion means that only the first lane *of each unrolled part* is
required. That is, lanes within a single unroll factor are considered
uniform. This allows e.g. widenable memory ops to be considered
uses of uniform computations.
* LVL and LAI's notion refers to all lanes across all unrollings.
IsUniformMem is in turn defined in terms of LAI's notion. Thus a
UniformMemOpmeans is a memory operation with a loop invariant address.
This means the same address is accessed in every iteration.
The tweaked piece of code was trying to match a uniform mem op (i.e.
fully loop invariant address), but instead checked for LV's notion of
uniformity. In theory, this meant with UF > 1, we could speculate
a load which wasn't safe to execute.
This ends up being mostly silent in current code as it is nearly
impossible to create the case where this difference is visible. The
closest I've come in the test case from 54cb87, but even then, the
incorrect result is only visible in the vplan debug output; before this
change we sink the unsafely speculated load back into the user's predicate
blocks before emitting IR. Both before and after IR are correct so the
differences aren't "interesting".
The other test changes are uninteresting. They're cases where LV's uniform
analysis is slightly weaker than SCEV isLoopInvariant.
This patch adds the AArch64 hook for preferPredicateOverEpilogue,
which currently returns true if SVE is enabled and one of the
following conditions (non-exhaustive) is met:
1. The "sve-tail-folding" option is set to "all", or
2. The "sve-tail-folding" option is set to "all+noreductions"
and the loop does not contain reductions,
3. The "sve-tail-folding" option is set to "all+norecurrences"
and the loop has no first-order recurrences.
Currently the default option is "disabled", but this will be
changed in a later patch.
I've added new tests to show the options behave as expected here:
Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll
Differential Revision: https://reviews.llvm.org/D129560
An srem or sdiv has two cases which can cause undefined behavior, not just one. The existing code did not account for this, and as a result, we miscompiled when we encountered e.g. a srem i64 %v, -1 in a conditional block.
Instead of hand rolling the logic, just use the utility function which exists exactly for this purpose.
Differential Revision: https://reviews.llvm.org/D130106
This patch introduces some initial def-use verification. This catches
cases like the one fixed by D129436.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D129717
We currently assert in vectorizeTree(TreeEntry*) when processing a PHI
bundle in a block containing a catchswitch. We attempt to set the
IRBuilder insertion point following the catchswitch, which is invalid.
This is done so that ShuffleBuilder.finalize() knows where to insert
a shuffle if one is needed.
To avoid this occurring, watch out for catchswitch blocks during
buildTree_rec() processing, and avoid adding PHIs in such blocks to
the vectorizable tree. It is unlikely that constraining vectorization
over an exception path will cause a noticeable performance loss, so
this seems preferable to trying to anticipate when a shuffle will and
will not be required.
At the moment, the VPPRedInstPHIRecipe is not used in subsequent uses of
the predicate recipe. This incorrectly models the def-use chains, as all
later uses should use the phi recipe. Fix that by delaying recording of
the recipe.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D129436
At the moment, the cost of runtime checks for scalable vectors is
overestimated due to creating separate vscale * VF expressions for each
check. Instead re-use the first expression.
The backend getShuffleCosts do not currently handle shuffles that change
size very well. Limit the shuffles we collect to the same type to make
sure they do not cause issues as reported in D128732.
For scalable vectors, it is not sufficient to only check
MinProfitableTripCount if it is >= VF.getKnownMinValue() * UF, because
this property may not holder for larger values of vscale. In those
cases, compute umax(VF * UF, MinProfTC) instead.
This should fix
https://lab.llvm.org/buildbot/#/builders/197/builds/2262
When vectorising ordered reductions we call a function
LoopVectorizationPlanner::adjustRecipesForReductions to replace the
existing VPWidenRecipe for the fadd instruction with a new
VPReductionRecipe. We attempt to insert the new recipe in the same
place, but this is wrong because createBlockInMask may have
generated new recipes that VPReductionRecipe now depends upon. I
have changed the insertion code to append the recipe to the
VPBasicBlock instead.
Added a new RUN with tail-folding enabled to the existing test:
Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll
Differential Revision: https://reviews.llvm.org/D129550
When calculating the cost of Instruction::Br in getInstructionCost
we query PredicatedBBsAfterVectorization to see if there is a
scalar predicated block. However, this meant that the decisions
being made for a given fixed-width VF were affecting the cost for a
scalable VF. As a result we were returning InstructionCost::Invalid
pointlessly for a scalable VF that should have a low cost. I
encountered this for some loops when enabling tail-folding for
scalable VFs.
Test added here:
Transforms/LoopVectorize/AArch64/sve-tail-folding-cost.ll
Differential Revision: https://reviews.llvm.org/D128272
Currently, for vectorised loops that use the get.active.lane.mask
intrinsic we only use the mask for predicated vector operations,
such as masked loads and stores, etc. The loop itself is still
controlled by comparing the canonical induction variable with the
trip count. However, for some targets this is inefficient when it's
cheap to use the mask itself to control the loop.
This patch adds support for using the active lane mask for control
flow by:
1. Generating the active lane mask for the next iteration of the
vector loop, rather than the current one. If there are still any
remaining iterations then at least the first bit of the mask will
be set.
2. Extract the first bit of this mask and use this bit for the
conditional branch.
I did this by creating a new VPActiveLaneMaskPHIRecipe that sets
up the initial PHI values in the vector loop pre-header. I've also
made use of the new BranchOnCond VPInstruction for the final
instruction in the loop region.
Differential Revision: https://reviews.llvm.org/D125301
This patch is a simple piece of refactoring that now permits users
to create VPInstructions and specify the name of the value being
generated. This is useful for creating more readable/meaningful
names in IR.
Differential Revision: https://reviews.llvm.org/D128982
Now that removeDeadRecipes can remove most dead recipes across a whole
VPlan, there is no need to first collect some dead instructions.
Instead removeDeadRecipes can simply clean them up.
Depends D127580.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D128408
This can enable additional region merging, while not losing
opportunities as region merging does not produce dead recipes.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D128831
As integer div/rem constant expressions are no longer supported,
constants can no longer trap and are always safe to speculate.
Remove the Constant::canTrap() method and its usages.
This in an extension to the code added in D123911 which added vector
combine folding of shuffle-select patterns, attempting to reduce the
total amount of shuffling required in patterns like:
%x = shuffle %i1, %i2
%y = shuffle %i1, %i2
%a = binop %x, %y
%b = binop %x, %y
shuffle %a, %b, selectmask
This patch extends the handing of shuffles that are dependent on one
another, which can arise from the SLP vectorizer, as-in:
%x = shuffle %i1, %i2
%y = shuffle %x
The input shuffles can also be emitted, in which case they are treated
like identity shuffles. This patch also attempts to calculate a better
ordering of input shuffles, which can help getting lower cost input
shuffles, pushing complex shuffles further down the tree.
This is a recommit with some additional checks for supported forms and
out-of-bounds mask elements, with some extra tests.
Differential Revision: https://reviews.llvm.org/D128732
For scalable VFs, the minimum assumed vscale needs to be included in the
cost-computation, otherwise a smaller VF may be used for RT check cost
computation than was used for earlier cost computations.
Fixes a RISCV test failing with UBSan due to both scalar and vector
loops having the same cost.
This fixes an UBSan failure after 644a965c1e. When using
user-provided VFs/ICs (via the force-vector-width /
force-vector-interleave options) the scalar cost is zero, which would
cause divide-by-zero.
When forcing vectorization using the options, the cost of the runtime
checks should not block vectorization.
This patch replaces the tight hard cut-off for the number of runtime
checks with a more accurate cost-driven approach.
The new approach allows vectorization with a larger number of runtime
checks in general, but only executes the vector loop (and runtime checks) if
considered profitable at runtime. Profitable here means that the cost-model
indicates that the runtime check cost + vector loop cost < scalar loop cost.
To do that, LV computes the minimum trip count for which runtime check cost
+ vector-loop-cost < scalar loop cost.
Note that there is still a hard cut-off to avoid excessive compile-time/code-size
increases, but it is much larger than the original limit.
The performance impact on standard test-suites like SPEC2006/SPEC2006/MultiSource
is mostly neutral, but the new approach can give substantial gains in cases where
we failed to vectorize before due to the over-aggressive cut-offs.
On AArch64 with -O3, I didn't observe any regressions outside the noise level (<0.4%)
and there are the following execution time improvements. Both `IRSmk` and `srad` are relatively short running, but the changes are far above the noise level for them on my benchmark system.
```
CFP2006/447.dealII/447.dealII -1.9%
CINT2017rate/525.x264_r/525.x264_r -2.2%
ASC_Sequoia/IRSmk/IRSmk -9.2%
Rodinia/srad/srad -36.1%
```
`size` regressions on AArch64 with -O3 are
```
MultiSource/Applications/hbd/hbd 90256.00 106768.00 18.3%
MultiSourc...ks/ASCI_Purple/SMG2000/smg2000 240676.00 257268.00 6.9%
MultiSourc...enchmarks/mafft/pairlocalalign 472603.00 489131.00 3.5%
External/S...2017rate/525.x264_r/525.x264_r 613831.00 630343.00 2.7%
External/S...NT2006/464.h264ref/464.h264ref 818920.00 835448.00 2.0%
External/S...te/538.imagick_r/538.imagick_r 1994730.00 2027754.00 1.7%
MultiSourc...nchmarks/tramp3d-v4/tramp3d-v4 1236471.00 1253015.00 1.3%
MultiSource/Applications/oggenc/oggenc 2108147.00 2124675.00 0.8%
External/S.../CFP2006/447.dealII/447.dealII 4742999.00 4759559.00 0.3%
External/S...rate/510.parest_r/510.parest_r 14206377.00 14239433.00 0.2%
```
Reviewed By: lebedev.ri, ebrevnov, dmgreen
Differential Revision: https://reviews.llvm.org/D109368
This patch slightly extends the limit on the RecursionMaxDepth inside
the SLP vectorizer. It does it only when it hits a load (or zext/sext of
a load), which allows it to peek through in the places where it will be
the most valuable, without ballooning out the O(..) by any 2^n factors.
Differential Revision: https://reviews.llvm.org/D122148
This in an extension to the code added in D123911 which added vector
combine folding of shuffle-select patterns, attempting to reduce the
total amount of shuffling required in patterns like:
%x = shuffle %i1, %i2
%y = shuffle %i1, %i2
%a = binop %x, %y
%b = binop %x, %y
shuffle %a, %b, selectmask
This patch extends the handing of shuffles that are dependent on one
another, which can arise from the SLP vectorizer, as-in:
%x = shuffle %i1, %i2
%y = shuffle %x
The input shuffles can also be emitted, in which case they are treated
like identity shuffles. This patch also attempts to calculate a better
ordering of input shuffles, which can help getting lower cost input
shuffles, pushing complex shuffles further down the tree.
Differential Revision: https://reviews.llvm.org/D128732
The moved helpers are only used for codegen. It will allow moving the
remaining ::execute implementations out of LoopVectorize.cpp.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D128657
At the moment, the same VPlan can be used code generation of both the
main vector and epilogue vector loop. This can lead to wrong results, if
the plan is optimized based on the VF of the main vector loop and then
re-used for the epilogue loop.
One example where this is problematic is if the scalar loops need to
execute at least one iteration, e.g. due to interleave groups.
To prevent mis-compiles in the short-term, disable optimizing exit
conditions for VPlans when using epilogue vectorization. The proper fix
is to avoid re-using the same plan for both loops, which will require
support for cloning plans first.
Fixes#56319.
The moved helpers are only used for codegen. It will allow moving the
remaining ::execute implementations out of LoopVectorize.cpp.
Depends on D127966.
Depends on D127965.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D127968
At the moment LoopVersioning is only created for inner-loop
vectorization. This patch moves it to LVP::execute, which means it will
also be added for epilogue vectorization. As a consequence, the proper
noalias metadata is now also added to epilogue vector loops.
LVer will be moved to VPTransformState as follow-up.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D127966
In some cases, there may be widened users of inductions even though the
plan includes the scalar VF. In those cases, make sure we still replace
the VPWidenIntOrFpInductionRecipe with scalar steps, as otherwise we may
try to execute a VPWidenIntOrFpInductionRecipe with a scalar VF.
Alternatively the patch could also split the range if needed.
This fixes a crash exposed by D123720.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D128755
If the root order itself does not require reordering, we can just
remove its reorder mask safely (e.g., if the root node is a vector of
phis). But if this node is used as an operand in the graph, we cannot
delete the reordering, need to keep it. Otherwise the graph nodes are
not synchronized with the operands. It may cause an extra gather
instruction(s) or a compiler crash.
Also, need to be very careful when selecting the gather nodes for
reordering since there might several gather nodes with the same scalars
and we can try to reorder just the same node many times instead of
different nodes.
Differential Revision: https://reviews.llvm.org/D128680
This patch moves the code for recipe implementations to a separate file.
The benefits are:
* Keep VPlan.cpp smaller => faster compile-time during parallel builds.
* Keep code for logical units together
As a follow-up I am also planning on moving all ::execute
implemetnations from LoopVectorize.cpp over to the new file, which
should help to reduce the size of the file a bit.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D127965
`commonAlignment` is a shortcut to pick the smallest of two `Align`
objects. As-is it doesn't bring much value compared to `std::min`.
Differential Revision: https://reviews.llvm.org/D128345
This change is a bit subtle. If we have a type like <vscale x 1 x i64>, the vectorizer will currently reject vectorization. The reason is that a type like <1 x i64> is likely to get simply rescalarized, and the vectorizer doesn't want to be in the game of simple unrolling.
(I've given the example in terms of 1 x types which use a single register, but the same issue exists for any N x types which use N registers. e.g. RISCV LMULs.)
This change distinguishes scalable types from fixed types under the reasoning that converting to a scalable type isn't unrolling. Because the actual vscale isn't known until runtime, using a vscale type is potentially very profitable.
This makes an important, but unchecked, assumption. Specifically, the scalable type is assumed to only be legal per the cost model if there's actually a scalable register class which is distinct from the scalar domain. This is, to my knowledge, true for all targets which return non-invalid costs for scalable vector ops today, but in theory, we could have a target decide to lower scalable to fixed length vector or even scalar registers. If that ever happens, we'd need to revisit this code.
In practice, this patch unblocks scalable vectorization for ELEN types on RISCV.
Let me sketch one alternate implementation I considered. We could have restricted this to when we know a minimum value for vscale. Specifically, for the default +v extension for RISCV, we actually know that vscale >= 2 for ELEN types. However, doing it this way means we can't generate scalable vectors when using the various embedded vector extensions which have a minimum vscale of 1.
Differential Revision: https://reviews.llvm.org/D128542
Improved/fixed cost modeling for shuffles by providing masks, improved
cost model for non-identity insertelements.
Differential Revision: https://reviews.llvm.org/D115462
This patch updates LV to generate runtime after the VF & IC are selected. It
allows deciding whether to vectorize with runtime checks or not based on
their cost compared to the vector loop.
It also updates VectorizationFactor to include the scalar cost.
Reviewed By: lebedev.ri, dmgreen
Differential Revision: https://reviews.llvm.org/D75981
If we have an unaligned uniform store, then when costing a scalable VF we can't emit code to scalarize it. (Well, we could, but we haven't implemented that case.) This change replaces an assert with a cost-model bailout such that we reject vectorization with the scalable VF instead of crashing.
If the masked gather nodes must be reordered, we can just reorder
scalars, just like for gather nodes. But if the node contains reused
scalars, it must be handled same way as a regular vectorizable node,
since need to reorder reused mask, not the scalars directly.
Differential Revision: https://reviews.llvm.org/D128360
This reverts commit cac60940b7.
Caused -Os -fsanitize=memory -march=haswell miscompile to pytorch/cpuinfo.
See my latest comment (may update) on D115462.
createInductionResumeValues creates a phi node placeholder
without filling incoming values. Then it generates the incoming values.
It includes triggering of SCEV expander which may invoke SSAUpdater.
SSAUpdater has an optimization to detect number of predecessors
basing on incoming values if there is phi node.
In case phi node is not filled with incoming values - the number of predecessors
is detected as 0 and this leads to segmentation fault.
In other words SSAUpdater expects that phi is in good shape while
LoopVectorizer breaks this requirement.
The fix is just prepare all incoming values first and then build a phi node.
Reviewed By: fhahn
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D128033
During the reordering transformation we should try to avoid reordering bundles
like fadd,fsub because this may block them being matched into a single vector
instruction in x86.
We do this by checking if a TreeEntry is such a pattern and adding it to the
list of TreeEntries with orders that need to be considered.
Differential Revision: https://reviews.llvm.org/D125712
In some cases, a recurrence splice instructions needs to be inserted
between to regions, for example if the regions get re-arranged during
sinking.
Fixes#56146.
If the OffsetBeg + InsertVecSz is greater than VecSz, need to estimate
the cost as shuffle of 2 vector, not as insert of subvector. Otherwise,
the inserted subvector is out of range and compiler may crash.
Differential Revision: https://reviews.llvm.org/D128071
If the root scalar is mapped to to the smallest bit width, the vector is
truncated and the types between original buildvector and extracted value
mismatched. For extract, we emit sext/zext instructions, for shuffles we
can reuse oringal vector instead of the truncated one.
Differential Revision: https://reviews.llvm.org/D127974
Instead of using the underlying instruction and VF to get the type, use
the type of the incoming value. This removes an unnecessary dependence
on the underlying instruction and enables using the recipe without an
underlying instruction.
Currently scatter vectorize nodes can be emitted only for GEPs with
constant indices. But we can also emit such nodes for GEPs with the same
ptr and non-constant vectorizable/gathered indices, if profitable. Patch
adds support for such nodes and tries to improve handling of GEPs with
non-const indeces for such nodes.
Metric: SLP.NumVectorInstructions
Program SLP.NumVectorInstructions
results results0 diff
test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 5243.00 5240.00 -0.1%
test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 5243.00 5240.00 -0.1%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 27550.00 27507.00 -0.2%
test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 5395.00 5380.00 -0.3%
test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 5389.00 5374.00 -0.3%
test-suite :: External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp_r.test 961.00 958.00 -0.3%
test-suite :: External/SPEC/CINT2017speed/620.omnetpp_s/620.omnetpp_s.test 961.00 958.00 -0.3%
test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test 5664.00 5643.00 -0.4%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 13202.00 13127.00 -0.6%
test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test 212.00 207.00 -2.4%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 890.00 850.00 -4.5%
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 1695.00 1581.00 -6.7%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 2338.00 2140.00 -8.5%
test-suite :: SingleSource/UnitTests/matrix-types-spec.test 63.00 55.00 -12.7%
test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 468.00 356.00 -23.9%
Geomean difference -0.3%
All numbers show increased number of generated vector instructions.
Diff:
SingleSource/Benchmarks/Adobe-C++/loop_unroll - better without LTO, but
need an extra analysis with LTO (with LTO compiler generates
masked_gather, while before regular loads were emitted because of extra
data, availbale at LTO time).
SingleSource/UnitTests/matrix-types-spec - more vector code.
MultiSource/Applications/JM/lencod/lencod - same.
External/SPEC/CINT2006/464.h264ref/464.h264ref - same.
MultiSource/Benchmarks/7zip/7zip-benchmark - same.
External/SPEC/CINT2006/445.gobmk/445.gobmk - no changes.
External/SPEC/CFP2017rate/510.parest_r/510.parest_r - more vector code.
External/SPEC/CFP2006/447.dealII/447.dealII - same
External/SPEC/CINT2017speed/620.omnetpp_s/620.omnetpp_s - same
External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp - same
External/SPEC/CFP2017rate/511.povray_r/511.povray - same
External/SPEC/CFP2006/453.povray/453.povray - same
External/SPEC/CFP2017rate/526.blender_r/526.blender_r - same
External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r - same
External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s - same
Differential Revision: https://reviews.llvm.org/D127219
We can skip the analysis of the constant nodes, their order should not
affect the ordering of the trees/subtrees.
Differential Revision: https://reviews.llvm.org/D127775
OrigPHIsToFix is only used in the native path. Collecting phis can be
replaced by iterating over the plan. This also removes another
unnecessary use of a late getVPValue.
This also reduces the coupling between ILV and the VPlan utilities.
Removes the workaround from https://reviews.llvm.org/D98509#2732628 for
an AIX build compiler issue.
The AIX build compiler product that caused the issue has since been
fixed. Also, the AIX build compiler has been changed to one based on
LLVM.
All information is already available in VPlan. Note that there are some
test changes, because we now can correctly look through instructions
like truncates to analyze the actual users.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D123541
This reverts commit 266ea446ab.
The reasons for the revert have been addressed by cleaning up condition
handling in VPlan and properly marking VPBranchOnMaskRecipe as using
scalars.
The test case for the revert from D123720 has been added in 3d663308a5.
Based on reviewer comments on https://reviews.llvm.org/D126692 I've
added FastMathFlags to the select instruction used when tail-folding
with reductions. These flags can then be used by InstCombine to
decide upon the most optimal floating point identity value for
fadd/fsub. Doing so unlocks further optimisations, such as folding
selects into masked loads.
Differential Revision: https://reviews.llvm.org/D126778
Try to simplify BranchOnCount to `BranchOnCond true` if TC <= UF * VF.
This is an alternative to D121899 which simplifies the VPlan directly
instead of doing so late in code-gen.
The potential benefit of doing this in VPlan is that this may help
cost-modeling in the future. The reason this is done in prepareToExecute
at the moment is that a single plan may be used for multiple VFs/UFs.
There are further simplifications that can be applied as follow ups:
1. Replace inductions with constants
2. Replace vector region with regular block.
Fixes#55354.
Depends on D126679.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D126680
Instead of setting the successor to the exit using CFG.ExitBB, set it to
nullptr initially. The successor to the exit block is later set either
through createEmptyBasicBlock or after VPlan execution (because at the
moment, no block is created by VPlan for the exit block, the existing
one is reused).
This also enables BranchOnCond to be used as terminator for the exiting
block of the topmost vector region.
Depends on D126618.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D126679
Improved/fixed cost modeling for shuffles by providing masks, improved
cost model for non-identity insertelements.
Differential Revision: https://reviews.llvm.org/D115462
This patch removes CondBit and Predicate from VPBasicBlock. To do so,
the patch introduces a new branch-on-cond VPInstruction opcode to model
a branch on a condition explicitly.
This addresses a long-standing TODO/FIXME that blocks shouldn't be users
of VPValues. Those extra users can cause issues for VPValue-based
analyses that don't expect blocks. Addressing this fixme should allow us
to re-introduce 266ea446ab.
The generic branch opcode can also be used in follow-up patches.
Depends on D123005.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D126618
Improved/fixed cost modeling for shuffles by providing masks, improved
cost model for non-identity insertelements.
Differential Revision: https://reviews.llvm.org/D115462
Recently the terminology used has been changed from Exit->Exiting in
line with common LLVM loop terminology. Update a remaining use of the
old terminology.
Improved/fixed cost modeling for shuffles by providing masks, improved
cost model for non-identity insertelements.
Differential Revision: https://reviews.llvm.org/D115462
Extractelement instructions may come from different basic blocks, need
to take it into account when looking for a last instruction in the
bundle to prevent compiler crash.
Differential Revision: https://reviews.llvm.org/D126777
This patch updates the VPlan native path to use VPRegionBlocks for all
loops in a loop nest. Up to now, only the outermost loop used a region.
This is a step towards unifying both paths and keep things consistent
between them. It also prepares various code-gen parts for modeling the
pre-header in the inner loop vectorizer (D121624).
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D123005
The implementations of VPlanDominatorTree, VPlanLoopInfo and VPlanPredicator
are all incompatible with modeling loops in VPlans as region without
explicit back-edges.
Those pieces are not actively used and only exercised by a few gtest
unit tests. They are at the moment blocking progress towards unifying
the native and inner-loop vectorizer paths in D121624 and D123005.
I think we should not block forward progress on unused pieces of code,
so this patch removes the utilities for now. The plan is to re-introduce
them as needed in a way that is compatible with the unified VPlan scheme
used in both the inner loop vectorizer and the native path.
Reviewed By: sguggill
Differential Revision: https://reviews.llvm.org/D123017
In LLVM's common loop terminology, an exit block is a block outside a
loop with a predecessor inside the loop. An exiting block is a block
inside the loop which branches to an exit block outside the loop.
This patch updates a few places where VPlan was using ExitBlock for a
block exiting a region. Those instances have been updated to use
ExitingBlock.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D126173
Patch improves compile time. For function calls, which cannot be
vectorized, create a unique group for each such a call instead of
subgroup. It prevents them from being grouped by a subgroups and
attempts for their vectorization.
Also, looks through casts operand to try to check their
groups/subgroups.
Reduces number of vectorization attempts. No changes in the statistics
for SPEC2017/2006/llvm-test-suite.
Differential Revision: https://reviews.llvm.org/D126476
Need to handle a corner case correctly, if all elements are Undefs/Poisons,
need to emit actual values, not just poisons.
Differential Revision: https://reviews.llvm.org/D126298
ScatterVectorize nodes should be handled same way as gathers in
reorderBottomToTop function, since we can simple reorder the loads in
this node. Because of that need to include such nodes to the list of
gathered nodes to fix compiler crash.
Differential Revision: https://reviews.llvm.org/D126378
SLP should build ScatterVectorize nodes only if they actually end up
with masked gather rather than with scalarization. In the second
scenario better to build a gather node.
Differential Revision: https://reviews.llvm.org/D126379
Need to use all ReductionOps when propagating flags for the reduction
ops, otherwise transformation is not correct. Plus, need to drop nuw/nsw
flags.
Differential Revision: https://reviews.llvm.org/D126371
When compiling the attached new test in scalable-reductions-tf.ll we
were hitting this assertion in fixReduction:
Assertion `isa<PHINode>(U) && "Reduction exit must feed Phi's or select"
The loop contains a reduction and an intermediate store of the reduction
value. When vectorising with tail-folding the contains of 'U' in the
assertion above happened to be a scatter_store. It turns out that we
were still creating a widen recipe for the invariant store, despite
knowing that we can actually sink it. The simplest fix is to change
buildVPlanWithVPRecipes so that we look for invariant stores before
attempting to widen it.
Differential Revision: https://reviews.llvm.org/D126295
The crash is caused by incorrect order set by reorderBottomToTop(), which
happens when it is reordering a TreeEntry which has a user that has already been
reordered earlier. Please see the detailed description in the lit test.
Differential Revision: https://reviews.llvm.org/D126099
To be used correctly in a sort-like function, isFirstInsertElement
function must follow weak strict ordering rule, i.e.
isFirstInsertElement(IE1, IE1) should return false.
Builds UserIgnore list only once as a SmallDenseSet without rebuilding
it between the runs, iterate over gathers instead list of reduction ops,
do some checks in the buildTree_rec only if the corresponding containers
are not empty.
SLP vectorizer emits extracts for externally used vectorized scalars and
estimates the cost for each such extract. But in many cases these
scalars are input for insertelement instructions, forming buildvector,
and instead of extractelement/insertelement pair we can emit/cost
estimate shuffle(s) cost and generate series of shuffles, which can be
further optimized.
Tested using test-suite (+SPEC2017), the tests passed, SLP was able to
generate/vectorize more instructions in many cases and it allowed to reduce
number of re-vectorization attempts (where we could try to vectorize
buildector insertelements again and again).
Differential Revision: https://reviews.llvm.org/D107966
Previously, `getRegUsageForType` was implemented using
`getTypeLegalizationCost`. `getRegUsageForType` is used by the loop
vectorizer to estimate the register pressure caused by using a vector
type. However, `getTypeLegalizationCost` currently only appears to
understand splitting and not scalarization, so significantly
underestimates the register requirements.
Instead, use `getNumRegisters`, which understands when scalarization
can occur (via computeRegisterProperties).
This was discovered while investigating D118979 (Set maximum VF with
shouldMaximizeVectorBandwidth), where under fixed-length 512-bit SVE the
loop vectorizer previously ends up costing an v128i1 as 2 v64i*
registers where it actually occupies 128 i32 registers.
I'm sending this patch early for comment, I'm still doing some sanity checking
with LNT. I note that getRegisterClassForType appears to return VectorRC even
though the type in question (large vNi1 types) end up occupying scalar
registers. That might be worth fixing too.
Differential Revision: https://reviews.llvm.org/D125918
The latch may not be the exiting block. Use the exiting block instead
when looking up the incoming value of the LCSSA phi node. This fixes a
crash with early-exit loops.
Current codegen only supports scalarization of pointer inductions for
scalable VFs if they are uniform. After 3bebec659 we now may enter the
scalarization code path in VPWidenPointerInductionRecipe::execute for
scalable vectors.
Fall back to widening for scalable vectors if necessary.
This should fix a build failure when bootstrapping LLVM with SVE, e.g.
https://lab.llvm.org/buildbot/#/builders/176/builds/1723
This reverts commit fc9c59c355.
The patch triggers an assertion when building SPEC on X86. Reduced
reproducer shared at D107966.
Also reverts follow-up commit 11a09af76d.
This patch introduces a new VPLiveOut subclass of VPUser to model
exit values explicitly. The initial version handles exit values that
are neither part of induction or reduction chains nor first order
recurrence phis.
Fixes#51366, #54867, #55167, #55459
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D123537
SLP vectorizer emits extracts for externally used vectorized scalars and
estimates the cost for each such extract. But in many cases these
scalars are input for insertelement instructions, forming buildvector,
and instead of extractelement/insertelement pair we can emit/cost
estimate shuffle(s) cost and generate series of shuffles, which can be
further optimized.
Tested using test-suite (+SPEC2017), the tests passed, SLP was able to
generate/vectorize more instructions in many cases and it allowed to reduce
number of re-vectorization attempts (where we could try to vectorize
buildector insertelements again and again).
Differential Revision: https://reviews.llvm.org/D107966
At the moment LV runs LoopSimplify and reconstructs LCSSA form after
generating the main vector loop and before generating the epilogue
vector loop.
In practice, this adds a new exit block for the scalar loop because the
middle block now also branches to the original exit block of the scalar
loop. It also requires adding a new LCSSA phi in the newly created exit
block.
This complicates things when modeling exit values in VPlan, because we
would need to update the VPlan for the epilogue loop to update the newly
created LCSSA phi node.
But none of that should be necessary, as all analysis requiring
loop-simplify form is already done at this point and LCSSA form of the
original loop is not broken.
Reviewed By: bmahjour
Differential Revision: https://reviews.llvm.org/D125810
Update clearReductionWrapFlags to use the VPlan def-use chain from the
reduction phi recipe to drop reduction wrap flags.
This addresses an existing FIXME and fixes a crash when instructions in
the reduction chain are not used and have been removed before VPlan
codegeneration.
Fixes#55540.
The runtime check threshold should also restrict interleave count.
Otherwise, too many runtime checks will be generated for some cases.
Reviewed By: fhahn, dmgreen
Differential Revision: https://reviews.llvm.org/D122126
VPWidenMemoryInstruction also models stores which may not produce a value.
This can trip over analyses. Improve the modeling by only adding
VPValues for VPWidenMemoryInstructionRecipes modeling loads.
Most clients only used these methods because they wanted to be able to
extend or truncate to the same bit width (which is a no-op). Now that
the standard zext, sext and trunc allow this, there is no reason to use
the OrSelf versions.
The OrSelf versions additionally have the strange behaviour of allowing
extending to a *smaller* width, or truncating to a *larger* width, which
are also treated as no-ops. A small amount of client code relied on this
(ConstantRange::castOp and MicrosoftCXXNameMangler::mangleNumber) and
needed rewriting.
Differential Revision: https://reviews.llvm.org/D125557
This patch changes the strategy for vectorizing freeze instrucion, from
replicating multiple times to widening according to selected VF.
Fixes#54992
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D125016
The pattern matching and vectgorization for reductions was not very
effective. Some of of the possible reduction values were marked as
external arguments, SLP could not find some reduction patterns because
of too early attempt to vectorize pair of binops arguments, the cost of
consts reductions was not correct. Patch addresses these issues and
improves the analysis/cost estimation and vectorization of the
reductions.
The most significant changes in SLP.NumVectorInstructions:
Metric: SLP.NumVectorInstructions [140/14396]
Program results results0 diff
test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 920.00 3548.00 285.7%
test-suite :: SingleSource/Benchmarks/BenchmarkGame/n-body.test 66.00 122.00 84.8%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 100.00 128.00 28.0%
test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 664.00 810.00 22.0%
test-suite :: MultiSource/Benchmarks/mafft/pairlocalalign.test 592.00 687.00 16.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 402.00 426.00 6.0%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 1665.00 1745.00 4.8%
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 135.00 139.00 3.0%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 135.00 139.00 3.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 388.00 397.00 2.3%
test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 895.00 914.00 2.1%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 240.00 244.00 1.7%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 240.00 244.00 1.7%
test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 820.00 832.00 1.5%
test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 820.00 832.00 1.5%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 14804.00 14914.00 0.7%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 8125.00 8183.00 0.7%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 1330.00 1338.00 0.6%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 1330.00 1338.00 0.6%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 9832.00 9880.00 0.5%
test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 5267.00 5291.00 0.5%
test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 4018.00 4024.00 0.1%
test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 4018.00 4024.00 0.1%
test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test 426.00 424.00 -0.5%
test-suite :: External/SPEC/CFP2017rate/544.nab_r/544.nab_r.test 426.00 424.00 -0.5%
test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test 201.00 192.00 -4.5%
test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test 201.00 192.00 -4.5%
644.nab_s and 544.nab_r - reduced number of shuffles but increased number
of useful vectorized instructions.
641.leela_s and 541.leela_r - the function
`@_ZN9FastBoard25get_pattern3_augment_specEiib` is not inlined anymore
but its body gets vectorized successfully. Before, the function was
inlined twice and vectorized just after inlining, currently it is not
required. The vector code looks pretty similar, just like as it was before.
Differential Revision: https://reviews.llvm.org/D111574
Need to check if the reduction is still (not)cmp-select pattern min/max
reduction to avoid compiler crash during building list of reduction
operations. cmp-sel pattern provides 2 reduction operations, while
intrinsics - just one.
Those helpers model properties of a user and they should also be
available to non-recipe users. This will be used in D123537 for a new
exit value user.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D124936
This patch adds initial support for a pointer diff based runtime check
scheme for vectorization. This scheme requires fewer computations and
checks than the existing full overlap checking, if it is applicable.
The main idea is to only check if source and sink of a dependency are
far enough apart so the accesses won't overlap in the vector loop. To do
so, it is sufficient to compute the difference and compare it to the
`VF * UF * AccessSize`. It is sufficient to check
`(Sink - Src) <u VF * UF * AccessSize` to rule out a backwards
dependence in the vector loop with the given VF and UF. If Src >=u Sink,
there is not dependence preventing vectorization, hence the overflow
should not matter and using the ULT should be sufficient.
Note that the initial version is restricted in multiple ways:
1. Pointers must only either be read or written, by a single
instruction (this allows re-constructing source/sink for
dependences with the available information)
2. Source and sink pointers must be add-recs, with matching steps
3. The step must be a constant.
3. abs(step) == AccessSize.
Most of those restrictions can be relaxed in the future.
See https://github.com/llvm/llvm-project/issues/53590.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D119078
When the loop vectoriser encounters a known low trip count it tries
to create a single predicated loop in order to get the benefit of
vectorisation and eliminate the scalar tail. However, until now the
vectoriser prevented the use of scalable vectors in this case due
to concerns in the past about stability. I believe that tail-folded
loops using scalable vectors are now sufficiently well tested that
we can enable this. For the same reason I've also enabled it when
optimising for code size too.
Tests added here:
Transforms/LoopVectorize/AArch64/sve-low-trip-count.ll
Transforms/LoopVectorize/AArch64/sve-tail-folding-optsize.ll
Transforms/LoopVectorize/RISCV/low-trip-count.ll
Differential Revision: https://reviews.llvm.org/D121595
Under some circumstances, SCEVExpander will insert new instructions when
expanding a predicate, but the final result of the expansion can be a
false constant.
In those cases, the expanded instructions may later be used by other
expansions, e.g. the trip count. This may trigger an assertion during
SCEVExpander cleanup. To avoid this, always mark the result as used.
Fixes#55100.