This patch adds the AArch64 hook for preferPredicateOverEpilogue,
which currently returns true if SVE is enabled and one of the
following conditions (non-exhaustive) is met:
1. The "sve-tail-folding" option is set to "all", or
2. The "sve-tail-folding" option is set to "all+noreductions"
and the loop does not contain reductions,
3. The "sve-tail-folding" option is set to "all+norecurrences"
and the loop has no first-order recurrences.
Currently the default option is "disabled", but this will be
changed in a later patch.
I've added new tests to show the options behave as expected here:
Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll
Differential Revision: https://reviews.llvm.org/D129560
This patch is in preparation for enabling vectorisation with tail-folding
by default for SVE targets. Once we do that many existing tests will
break that depend upon having normal unpredicated vector loops. For
all such tests I have added the flag:
-prefer-predicate-over-epilogue=scalar-epilogue
Differential Revision: https://reviews.llvm.org/D129137
An srem or sdiv has two cases which can cause undefined behavior, not just one. The existing code did not account for this, and as a result, we miscompiled when we encountered e.g. a srem i64 %v, -1 in a conditional block.
Instead of hand rolling the logic, just use the utility function which exists exactly for this purpose.
Differential Revision: https://reviews.llvm.org/D130106
By default if SVE is enabled we want the select instruction used for
reductions to be inside the loop, rather than outside. This makes it
possible for the backend to fold the select into the operation to
produce a single predicated add, fadd, etc.
Differential Revision: https://reviews.llvm.org/D129763
In sve-tail-folding-reductions.ll I've also added an extra RUN line
to test normal reductions, i.e. not in-loop. This patch is a pre-commit
in preparation for a follow-on patch that changes how reduction selects
are generated in the vector loop.
Differential Revision: https://reviews.llvm.org/D129761
I've simplified all of the SVE vectoriser tail-folding tests to
only care about testing the flag:
-prefer-predicate-over-epiloge=predicate-else-scalar-epilogue
In practice we always want to fall back on unpredicated vector
loops if tail-folding is not possible.
Differential Revision: https://reviews.llvm.org/D129843
This builds on the previous forked pointers patch, which only accepted
a single select as the pointer to check. A recursive function to walk
through IR has been added, which searches for either a loop-invariant
or addrec SCEV.
This will only handle a single fork at present, so selects of selects
or a GEP with a select for both the base and offset will be rejected.
There is also a recursion limit with a cli option to change it.
Reviewed By: fhahn, david-arm
Differential Revision: https://reviews.llvm.org/D108699
At the moment, the VPPRedInstPHIRecipe is not used in subsequent uses of
the predicate recipe. This incorrectly models the def-use chains, as all
later uses should use the phi recipe. Fix that by delaying recording of
the recipe.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D129436
At the moment, the cost of runtime checks for scalable vectors is
overestimated due to creating separate vscale * VF expressions for each
check. Instead re-use the first expression.
For scalable vectors, it is not sufficient to only check
MinProfitableTripCount if it is >= VF.getKnownMinValue() * UF, because
this property may not holder for larger values of vscale. In those
cases, compute umax(VF * UF, MinProfTC) instead.
This should fix
https://lab.llvm.org/buildbot/#/builders/197/builds/2262
The test shows a case where the minimum trip count check incorrectly
only checks the minimum profitable trip count computed due to runtime
checks. This is incorrect for scalable VFs, because the VF * UF may
exceed the minimum profitable trip count for vscale > 1.
This is the likely reason for
https://lab.llvm.org/buildbot/#/builders/197/builds/2262 failing.
When vectorising ordered reductions we call a function
LoopVectorizationPlanner::adjustRecipesForReductions to replace the
existing VPWidenRecipe for the fadd instruction with a new
VPReductionRecipe. We attempt to insert the new recipe in the same
place, but this is wrong because createBlockInMask may have
generated new recipes that VPReductionRecipe now depends upon. I
have changed the insertion code to append the recipe to the
VPBasicBlock instead.
Added a new RUN with tail-folding enabled to the existing test:
Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll
Differential Revision: https://reviews.llvm.org/D129550
Currently we only call replaceLoopPHINodesWithPreheaderValues() if
optimizeLoopExits() replaces the exit with an unconditional exit.
However, it is very common that this already happens as part of
eliminateIVComparison(), in which case we're leaving behind the
dead header phi.
Tweak the early bailout for already-constant exits to also call
replaceLoopPHINodesWithPreheaderValues().
Differential Revision: https://reviews.llvm.org/D129214
When calculating the cost of Instruction::Br in getInstructionCost
we query PredicatedBBsAfterVectorization to see if there is a
scalar predicated block. However, this meant that the decisions
being made for a given fixed-width VF were affecting the cost for a
scalable VF. As a result we were returning InstructionCost::Invalid
pointlessly for a scalable VF that should have a low cost. I
encountered this for some loops when enabling tail-folding for
scalable VFs.
Test added here:
Transforms/LoopVectorize/AArch64/sve-tail-folding-cost.ll
Differential Revision: https://reviews.llvm.org/D128272
Currently, for vectorised loops that use the get.active.lane.mask
intrinsic we only use the mask for predicated vector operations,
such as masked loads and stores, etc. The loop itself is still
controlled by comparing the canonical induction variable with the
trip count. However, for some targets this is inefficient when it's
cheap to use the mask itself to control the loop.
This patch adds support for using the active lane mask for control
flow by:
1. Generating the active lane mask for the next iteration of the
vector loop, rather than the current one. If there are still any
remaining iterations then at least the first bit of the mask will
be set.
2. Extract the first bit of this mask and use this bit for the
conditional branch.
I did this by creating a new VPActiveLaneMaskPHIRecipe that sets
up the initial PHI values in the vector loop pre-header. I've also
made use of the new BranchOnCond VPInstruction for the final
instruction in the loop region.
Differential Revision: https://reviews.llvm.org/D125301
The motivation here is to a) bring us closer into alignment with AArch64 under the assumption that codepath is better tested, and b) simplify pattern matching in an upcoming change.
The immediate impact is a significant IR reduction but a fairly minimal change in the generated assembly. Due to a difference in expansion behavior we get a saturating add vs an unsaturating one for the old code, but that's about it. This difference comes down to different handling of overflow, which doesn't seem to be possible here anyways, so the assembly codegen is arguably a minor regression. I don't expect that to matter in practice.
Differential Revision: https://reviews.llvm.org/D129221
Now that removeDeadRecipes can remove most dead recipes across a whole
VPlan, there is no need to first collect some dead instructions.
Instead removeDeadRecipes can simply clean them up.
Depends D127580.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D128408
These three subtarget features are meant to control where MVE
instructions take 1 vs 2 vs 4 architectural beats. The mve1beat feature
is described as "Model MVE instructions as a 1 beat per tick
architecture", meaning MVE instruction will execute over 4 cycles.
mve4beat is the opposite where the entire 4 beats of the MVE instruction
execute in a single cycle. The costs for the two were backwards though,
not matching the cycle counts like they should. This patch switches the
costs on the two to bring them in-line with expectations.
Differential Revision: https://reviews.llvm.org/D129141
This can enable additional region merging, while not losing
opportunities as region merging does not produce dead recipes.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D128831
The removed CHECK configurations are tested as well below, modulo the
dce/instcombine runs. This makes them redundant, and removing them
removes a substantial amount of uneeded checks.
The tests are focused on code-gen for first-order recurrences. There are
plenty of tests specifically for runtime check generation. Using noalias
to avoid runtime checks slightly simplifies the test output and ensures
the checks focus on the relevant bits and ensures the checks focus on
the relevant bits and ensures the checks focus on the relevant bits and
ensures the checks focus on the relevant bits.
D128820 stopped creating div/rem constant expressions by default;
this patch removes support for them entirely.
The getUDiv(), getExactUDiv(), getSDiv(), getExactSDiv(), getURem()
and getSRem() on ConstantExpr are removed, and ConstantExpr::get()
now only accepts binary operators for which
ConstantExpr::isSupportedBinOp() returns true. Uses of these methods
may be replaced either by corresponding IRBuilder methods, or
ConstantFoldBinaryOpOperands (if a constant result is required).
On the C API side, LLVMConstUDiv, LLVMConstExactUDiv, LLVMConstSDiv,
LLVMConstExactSDiv, LLVMConstURem and LLVMConstSRem are removed and
corresponding LLVMBuild methods should be used.
Importantly, this also means that constant expressions can no longer
trap! This patch still keeps the canTrap() method to minimize diff --
I plan to drop it in a separate NFC patch.
Differential Revision: https://reviews.llvm.org/D129148
This patch replaces the tight hard cut-off for the number of runtime
checks with a more accurate cost-driven approach.
The new approach allows vectorization with a larger number of runtime
checks in general, but only executes the vector loop (and runtime checks) if
considered profitable at runtime. Profitable here means that the cost-model
indicates that the runtime check cost + vector loop cost < scalar loop cost.
To do that, LV computes the minimum trip count for which runtime check cost
+ vector-loop-cost < scalar loop cost.
Note that there is still a hard cut-off to avoid excessive compile-time/code-size
increases, but it is much larger than the original limit.
The performance impact on standard test-suites like SPEC2006/SPEC2006/MultiSource
is mostly neutral, but the new approach can give substantial gains in cases where
we failed to vectorize before due to the over-aggressive cut-offs.
On AArch64 with -O3, I didn't observe any regressions outside the noise level (<0.4%)
and there are the following execution time improvements. Both `IRSmk` and `srad` are relatively short running, but the changes are far above the noise level for them on my benchmark system.
```
CFP2006/447.dealII/447.dealII -1.9%
CINT2017rate/525.x264_r/525.x264_r -2.2%
ASC_Sequoia/IRSmk/IRSmk -9.2%
Rodinia/srad/srad -36.1%
```
`size` regressions on AArch64 with -O3 are
```
MultiSource/Applications/hbd/hbd 90256.00 106768.00 18.3%
MultiSourc...ks/ASCI_Purple/SMG2000/smg2000 240676.00 257268.00 6.9%
MultiSourc...enchmarks/mafft/pairlocalalign 472603.00 489131.00 3.5%
External/S...2017rate/525.x264_r/525.x264_r 613831.00 630343.00 2.7%
External/S...NT2006/464.h264ref/464.h264ref 818920.00 835448.00 2.0%
External/S...te/538.imagick_r/538.imagick_r 1994730.00 2027754.00 1.7%
MultiSourc...nchmarks/tramp3d-v4/tramp3d-v4 1236471.00 1253015.00 1.3%
MultiSource/Applications/oggenc/oggenc 2108147.00 2124675.00 0.8%
External/S.../CFP2006/447.dealII/447.dealII 4742999.00 4759559.00 0.3%
External/S...rate/510.parest_r/510.parest_r 14206377.00 14239433.00 0.2%
```
Reviewed By: lebedev.ri, ebrevnov, dmgreen
Differential Revision: https://reviews.llvm.org/D109368
At the moment, the same VPlan can be used code generation of both the
main vector and epilogue vector loop. This can lead to wrong results, if
the plan is optimized based on the VF of the main vector loop and then
re-used for the epilogue loop.
One example where this is problematic is if the scalar loops need to
execute at least one iteration, e.g. due to interleave groups.
To prevent mis-compiles in the short-term, disable optimizing exit
conditions for VPlans when using epilogue vectorization. The proper fix
is to avoid re-using the same plan for both loops, which will require
support for cloning plans first.
Fixes#56319.
I looked at canonicalizing in the other direction, but that causes
many potential regressions and infinite loops because we already
(possibly wrongly) canonicalize "trunc X to i1" into an and+icmp.
This has a data layout restriction to avoid creating illegal
mask instructions, but we could remove that if we can show
that the backend can undo this when needed.
The motivating example from issue #56119 is modeled by the
PhaseOrdering test.
At the moment LoopVersioning is only created for inner-loop
vectorization. This patch moves it to LVP::execute, which means it will
also be added for epilogue vectorization. As a consequence, the proper
noalias metadata is now also added to epilogue vector loops.
LVer will be moved to VPTransformState as follow-up.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D127966
In some cases, there may be widened users of inductions even though the
plan includes the scalar VF. In those cases, make sure we still replace
the VPWidenIntOrFpInductionRecipe with scalar steps, as otherwise we may
try to execute a VPWidenIntOrFpInductionRecipe with a scalar VF.
Alternatively the patch could also split the range if needed.
This fixes a crash exposed by D123720.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D128755
This change is a bit subtle. If we have a type like <vscale x 1 x i64>, the vectorizer will currently reject vectorization. The reason is that a type like <1 x i64> is likely to get simply rescalarized, and the vectorizer doesn't want to be in the game of simple unrolling.
(I've given the example in terms of 1 x types which use a single register, but the same issue exists for any N x types which use N registers. e.g. RISCV LMULs.)
This change distinguishes scalable types from fixed types under the reasoning that converting to a scalable type isn't unrolling. Because the actual vscale isn't known until runtime, using a vscale type is potentially very profitable.
This makes an important, but unchecked, assumption. Specifically, the scalable type is assumed to only be legal per the cost model if there's actually a scalable register class which is distinct from the scalar domain. This is, to my knowledge, true for all targets which return non-invalid costs for scalable vector ops today, but in theory, we could have a target decide to lower scalable to fixed length vector or even scalar registers. If that ever happens, we'd need to revisit this code.
In practice, this patch unblocks scalable vectorization for ELEN types on RISCV.
Let me sketch one alternate implementation I considered. We could have restricted this to when we know a minimum value for vscale. Specifically, for the default +v extension for RISCV, we actually know that vscale >= 2 for ELEN types. However, doing it this way means we can't generate scalable vectors when using the various embedded vector extensions which have a minimum vscale of 1.
Differential Revision: https://reviews.llvm.org/D128542
LoopVectorizer uses getVScaleForTuning for deciding how to discount the cost of a potential vector factor by the amount of work performed. Without the callback implemented, the vectorizer was defaulting to an estimated vscale of 1. This results in fixed vectorization looking falsely profitable (since it used the command line VLEN).
The test change is pretty limited since a) we don't have much coverage of the vectorizer with scalable vectors at all, and b) what little coverage we have mostly uses i64 element types. There's a separate issue with <vscale x 1 x i64> which prevents us from getting to this stage of costing, and thus only the one test explicitly written to avoid that is visible in the diff. However, this is actually a very wide impact change as it changes the practical vectorization result when both fixed and scalable is enabled to scalable.
As an aside, I think the vectorizer is at little too strongly biased towards scalable when both are legal, but we can explore that separately. For now, let's just get the cost model working the way it was intended.
Differential Revision: https://reviews.llvm.org/D128547
We currently have a costing bug around the etype == ELEN case, so add otherwise duplicate tests to show test diffs as I work on other parts of costing.
If we have an unaligned uniform store, then when costing a scalable VF we can't emit code to scalarize it. (Well, we could, but we haven't implemented that case.) This change replaces an assert with a cost-model bailout such that we reject vectorization with the scalable VF instead of crashing.
createInductionResumeValues creates a phi node placeholder
without filling incoming values. Then it generates the incoming values.
It includes triggering of SCEV expander which may invoke SSAUpdater.
SSAUpdater has an optimization to detect number of predecessors
basing on incoming values if there is phi node.
In case phi node is not filled with incoming values - the number of predecessors
is detected as 0 and this leads to segmentation fault.
In other words SSAUpdater expects that phi is in good shape while
LoopVectorizer breaks this requirement.
The fix is just prepare all incoming values first and then build a phi node.
Reviewed By: fhahn
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D128033
This just adds some very basic vectorizer testing with both fixed and scalable vectorization enabled. For context, I just yesterday fixed a crash in costing of the splat_ptr example - see bbf3fd.
In some cases, a recurrence splice instructions needs to be inserted
between to regions, for example if the regions get re-arranged during
sinking.
Fixes#56146.
This reverts commit 7aa8a67882.
This version includes fixes to address issues uncovered after
the commit landed and discussed at D11448.
Those include:
* Limit select-traversal to selects inside the loop.
* Freeze pointers resulting from looking through selects to avoid
branch-on-poison.
TTI::prefersVectorizedAddressing() try to vectorize the addresses that lead to loads.
For aarch64, only gather/scatter (supported by SVE) can deal with vectors of addresses.
This patch specializes the hook for AArch64, to return true only when we enable SVE.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D124612
This brings us into alignment with AArch64, and in the process fixes a compiler crash bug in uniform store handling in the vectorizer.
Before the recent invalid cost bailout work, this would have also avoided crashes on invalid costs in some cases. I honestly think the vectorizer should gracefully bailout on uniform stores it can't use a scatter for, but it doesn't, so lets take the path of least resistance here. It's also possible that there are other vectorizer bugs AArch64 isn't seeing because of this hook; we don't want to be finding them either.
Differential Revision: https://reviews.llvm.org/D127514
This reverts commit 1fbdbb5595.
All known issues surfaced by this patch should have been fixed now.
The fixes included fixing issues with SCEV expansion in LV and DA's
reliance on LCSSA phis.
All information is already available in VPlan. Note that there are some
test changes, because we now can correctly look through instructions
like truncates to analyze the actual users.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D123541
Based on reviewer comments on https://reviews.llvm.org/D126692 I've
added FastMathFlags to the select instruction used when tail-folding
with reductions. These flags can then be used by InstCombine to
decide upon the most optimal floating point identity value for
fadd/fsub. Doing so unlocks further optimisations, such as folding
selects into masked loads.
Differential Revision: https://reviews.llvm.org/D126778
Try to simplify BranchOnCount to `BranchOnCond true` if TC <= UF * VF.
This is an alternative to D121899 which simplifies the VPlan directly
instead of doing so late in code-gen.
The potential benefit of doing this in VPlan is that this may help
cost-modeling in the future. The reason this is done in prepareToExecute
at the moment is that a single plan may be used for multiple VFs/UFs.
There are further simplifications that can be applied as follow ups:
1. Replace inductions with constants
2. Replace vector region with regular block.
Fixes#55354.
Depends on D126679.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D126680
The default RegisterClass is not enough to model RISCV Register.
We define risc-v's own register class to model FP Register.
This helps to better estimate the register pressure in the loop-vectorize.
Reviewed By: kito-cheng
Differential Revision: https://reviews.llvm.org/D126854
This patch removes CondBit and Predicate from VPBasicBlock. To do so,
the patch introduces a new branch-on-cond VPInstruction opcode to model
a branch on a condition explicitly.
This addresses a long-standing TODO/FIXME that blocks shouldn't be users
of VPValues. Those extra users can cause issues for VPValue-based
analyses that don't expect blocks. Addressing this fixme should allow us
to re-introduce 266ea446ab.
The generic branch opcode can also be used in follow-up patches.
Depends on D123005.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D126618
This patch updates the VPlan native path to use VPRegionBlocks for all
loops in a loop nest. Up to now, only the outermost loop used a region.
This is a step towards unifying both paths and keep things consistent
between them. It also prepares various code-gen parts for modeling the
pre-header in the inner loop vectorizer (D121624).
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D123005
Now that SimpleLoopUnswitch and other transforms no longer introduce
branch on poison, enable the -branch-on-poison-as-ub option by
default. The practical impact of this is mostly better flag
preservation in SCEV, and some freeze instructions no longer being
necessary.
Differential Revision: https://reviews.llvm.org/D125299
When reassociating GEPs, we can only keep inbounds if both original
GEPs were inbounds, and their offsets have the same sign. For the
sake of simplicity, I only handle the case where both offsets are
non-negative here.
It would probably be fine to just not preserve inbounds at all here,
but as I don't see a compile-time impact for adding the
isKnownNonNegative() calls I went with this more conservative
approach.
Fixes https://github.com/llvm/llvm-project/issues/44206.
Differential Revision: https://reviews.llvm.org/D126687
If only one of the GEPs is inbounds, then after swapping, there is
no guarantee that one of them will be inbounds as well
(see e.g. https://alive2.llvm.org/ce/z/agaCnp).
This is only a partial fix, because even if both are inbounds, the
result is not necessarily inbounds (if the offsets have different
signs).
```
void vector_reverse_i64(int *A, int *B, int n) {
#pragma clang loop vectorize_width(4, scalable)
for (int i = n-1; i >= 0; i--)
A[i] = B[i] + 1;
}
```
When option: scalable-vectorization is on (or set #pragma clang loop vectorize_width(elements, scalable)), Reverse Iterators can't loop vectorization as <vscale x elements x elementType>
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D125866
When compiling the attached new test in scalable-reductions-tf.ll we
were hitting this assertion in fixReduction:
Assertion `isa<PHINode>(U) && "Reduction exit must feed Phi's or select"
The loop contains a reduction and an intermediate store of the reduction
value. When vectorising with tail-folding the contains of 'U' in the
assertion above happened to be a scatter_store. It turns out that we
were still creating a widen recipe for the invariant store, despite
knowing that we can actually sink it. The simplest fix is to change
buildVPlanWithVPRecipes so that we look for invariant stores before
attempting to widen it.
Differential Revision: https://reviews.llvm.org/D126295
Previously, `getRegUsageForType` was implemented using
`getTypeLegalizationCost`. `getRegUsageForType` is used by the loop
vectorizer to estimate the register pressure caused by using a vector
type. However, `getTypeLegalizationCost` currently only appears to
understand splitting and not scalarization, so significantly
underestimates the register requirements.
Instead, use `getNumRegisters`, which understands when scalarization
can occur (via computeRegisterProperties).
This was discovered while investigating D118979 (Set maximum VF with
shouldMaximizeVectorBandwidth), where under fixed-length 512-bit SVE the
loop vectorizer previously ends up costing an v128i1 as 2 v64i*
registers where it actually occupies 128 i32 registers.
I'm sending this patch early for comment, I'm still doing some sanity checking
with LNT. I note that getRegisterClassForType appears to return VectorRC even
though the type in question (large vNi1 types) end up occupying scalar
registers. That might be worth fixing too.
Differential Revision: https://reviews.llvm.org/D125918
The latch may not be the exiting block. Use the exiting block instead
when looking up the incoming value of the LCSSA phi node. This fixes a
crash with early-exit loops.
Current codegen only supports scalarization of pointer inductions for
scalable VFs if they are uniform. After 3bebec659 we now may enter the
scalarization code path in VPWidenPointerInductionRecipe::execute for
scalable vectors.
Fall back to widening for scalable vectors if necessary.
This should fix a build failure when bootstrapping LLVM with SVE, e.g.
https://lab.llvm.org/buildbot/#/builders/176/builds/1723
This patch introduces a new VPLiveOut subclass of VPUser to model
exit values explicitly. The initial version handles exit values that
are neither part of induction or reduction chains nor first order
recurrence phis.
Fixes#51366, #54867, #55167, #55459
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D123537
At the moment LV runs LoopSimplify and reconstructs LCSSA form after
generating the main vector loop and before generating the epilogue
vector loop.
In practice, this adds a new exit block for the scalar loop because the
middle block now also branches to the original exit block of the scalar
loop. It also requires adding a new LCSSA phi in the newly created exit
block.
This complicates things when modeling exit values in VPlan, because we
would need to update the VPlan for the epilogue loop to update the newly
created LCSSA phi node.
But none of that should be necessary, as all analysis requiring
loop-simplify form is already done at this point and LCSSA form of the
original loop is not broken.
Reviewed By: bmahjour
Differential Revision: https://reviews.llvm.org/D125810
Update clearReductionWrapFlags to use the VPlan def-use chain from the
reduction phi recipe to drop reduction wrap flags.
This addresses an existing FIXME and fixes a crash when instructions in
the reduction chain are not used and have been removed before VPlan
codegeneration.
Fixes#55540.
The runtime check threshold should also restrict interleave count.
Otherwise, too many runtime checks will be generated for some cases.
Reviewed By: fhahn, dmgreen
Differential Revision: https://reviews.llvm.org/D122126
This patch changes the strategy for vectorizing freeze instrucion, from
replicating multiple times to widening according to selected VF.
Fixes#54992
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D125016
isImpliedCondition() currently handles and/or on the LHS, but not
on the RHS, resulting in asymmetric behavior. This patch adds two
new implication rules:
* LHS ==> (RHS1 || RHS2) if LHS ==> RHS1 or LHS ==> RHS2
* LHS ==> !(RHS1 && RHS2) if LHS ==> !RHS1 or LHS ==> !RHS2
Differential Revision: https://reviews.llvm.org/D125551
This patch adds initial support for a pointer diff based runtime check
scheme for vectorization. This scheme requires fewer computations and
checks than the existing full overlap checking, if it is applicable.
The main idea is to only check if source and sink of a dependency are
far enough apart so the accesses won't overlap in the vector loop. To do
so, it is sufficient to compute the difference and compare it to the
`VF * UF * AccessSize`. It is sufficient to check
`(Sink - Src) <u VF * UF * AccessSize` to rule out a backwards
dependence in the vector loop with the given VF and UF. If Src >=u Sink,
there is not dependence preventing vectorization, hence the overflow
should not matter and using the ULT should be sufficient.
Note that the initial version is restricted in multiple ways:
1. Pointers must only either be read or written, by a single
instruction (this allows re-constructing source/sink for
dependences with the available information)
2. Source and sink pointers must be add-recs, with matching steps
3. The step must be a constant.
3. abs(step) == AccessSize.
Most of those restrictions can be relaxed in the future.
See https://github.com/llvm/llvm-project/issues/53590.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D119078
When the loop vectoriser encounters a known low trip count it tries
to create a single predicated loop in order to get the benefit of
vectorisation and eliminate the scalar tail. However, until now the
vectoriser prevented the use of scalable vectors in this case due
to concerns in the past about stability. I believe that tail-folded
loops using scalable vectors are now sufficiently well tested that
we can enable this. For the same reason I've also enabled it when
optimising for code size too.
Tests added here:
Transforms/LoopVectorize/AArch64/sve-low-trip-count.ll
Transforms/LoopVectorize/AArch64/sve-tail-folding-optsize.ll
Transforms/LoopVectorize/RISCV/low-trip-count.ll
Differential Revision: https://reviews.llvm.org/D121595
Under some circumstances, SCEVExpander will insert new instructions when
expanding a predicate, but the final result of the expansion can be a
false constant.
In those cases, the expanded instructions may later be used by other
expansions, e.g. the trip count. This may trigger an assertion during
SCEVExpander cleanup. To avoid this, always mark the result as used.
Fixes#55100.
In InnerLoopVectorizer::getOrCreateVectorTripCount there is an
assert that the known minimum value for the VF is a power of 2
when tail-folding is enabled. However, for scalable vectors the
value of vscale may not be a power of 2, which means we have
to worry about the possibility of overflow. I have solved this
problem by adding preheader checks that prevent us from entering
the vector body if the canonical IV would overflow, i.e.
if ((IntMax - TripCount) < (VF * UF)) ... skip vector loop ...
Differential Revision: https://reviews.llvm.org/D125235
With opaque pointers, both the stored value and the address can be the
same. Only consider the recipe using the first lane only *if* the
address is not stored.
Fixes#55375.
Adds ability to vectorize loops containing a store to a loop-invariant
address as part of a reduction that isn't converted to SSA form due to
lack of aliasing info. Runtime checks are generated to ensure the store
does not alias any other accesses in the loop.
Ordered fadd reductions are not yet supported.
Differential Revision: https://reviews.llvm.org/D110235
This adds fptosi_sat and fptoui_sat to the list of trivially
vectorizable functions, mainly so that the loop vectorizer can vectorize
the instruction. Marking them as trivially vectorizable also allows them
to be SLP vectorized, and Scalarized.
The signature of a fptosi_sat requires two type overrides
(@llvm.fptosi.sat.v2i32.v2f32), unlike other intrinsics that often only
take a single. This patch alters hasVectorInstrinsicOverloadedScalarOpd
to isVectorIntrinsicWithOverloadTypeAtArg, so that it can mark the first
operand of the intrinsic as a overloaded (but not scalar) operand.
Differential Revision: https://reviews.llvm.org/D124358
'Widen' recipe are only used when actual vector values are generated.
Fix tryToWidenCall to do not create VPWidenCallRecipes for scalar vector
factors.
This was exposed by D123720, because the widened recipes are considered
vector users.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D124718
Based off the script from D103695, we were exaggerating the cost of the OR(AND(X,M),AND(Y,~M)) expansion using instruction count instead of effective throughput
Update a bunch of loop-vectorize regression tests to use the new PM
syntax (opt -passes=loop-vectorize) instead of the deprecated legacy
PM syntax (opt -loop-vectorize).
This patch extends the scope of VPlan to also include the exit (aka
middle) block.
For now, the exit block remains empty, but handling of exit values will
subsequently be moved to VPlan, by adding recipes to model exit values
in the exit block.
As a first step, this will allow fixing #51366.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D123457
This patch moves SCEV expansion of steps used by
VPWidenIntOrFpInductionRecipes to the pre-header using
VPExpandSCEVRecipe. This ensures that those steps are expanded while the
CFG is in a valid state. Previously, SCEV expansion may happen during
vector body code-generation, during which the CFG may be invalid,
causing issues with SCEV expansion.
Depends on D122095.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D122096
Most of insertelement constant folding is blocked if the vector type
is scalable. I believe we can make an exception for inserting null
into an all zeros vector.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D123413
With opaque pointers, the stored value and address can be the same.
Previously the code in VPWidenMemoryInstructionRecipe::onlyFirstLaneDemanded
incorrectly considers stores with matching store and pointer operands as
only demanding the first lane, causing a crash.
After D121624 models the pre-header in VPlan, VPExpandSCEVRecipes can be
placed there. This ensures SCEV expansion happens before modifying the
CFG during VPlan execution, when CFG is incomplete.
Depends on D121624.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D122095
This patch extends the scope of VPlan to also model the pre-header.
The pre-header can be used to place recipes that should be code-gen'd
outside the loop, like SCEV expansion.
Depends on D121623.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D121624
D67148 has removed TTI::getNumberOfRegisters(bool Vector) and
started to call TTI::getNumberOfRegisters(unsigned ClassID) from
the LoopVectorize. This has resulted in an unrestricted vectorization
on AMDGPU blowing up register pressure.
Differential Revision: https://reviews.llvm.org/D122850
This fixes the code to actually use the location of the instruction, if
available. Previously, SetInsertPoint would overwrite the insert point
set from the instruction.
When creating induction resume values, SCEV queries may rely on
LoopInfo. Make sure vector.body gets added to the loop of the pre-header
during skeleton construction.
%vector.body will be moved to the vector preheader during VPlan
execution.
Fixes#54745.
During skeleton construction for the epilogue vector loop, generic
helpers use getOrCreateTripCount, which will re-expand the trip count
computation. Instead, re-use the TripCount created during main loop
vectorization.
When MaximizeVectorBandwidth is enabled, we can end up (via calls to
collectUniformsAndScalars/setCostBasedWideningDecision through
calculateRegisterUsage) making widening decisions before we have decided
whether to fold the tail by masking. These decisions will be wrong if we
later decided to fold the tail, for example when the trip count is very
low. It will use incorrect costs for loads that should get masked, using
standard memory operation costs instead.
This still at the moment uses the EmulatedMaskMemRefHack costs (a bit
unfortunately), but the old costs without this change were 1, leading to
too optimistic vectorization.
This slightly changes the way that the MaximizeVectorBandwidth option
works to make it easier to test, always honouring the option if it is
set.
Differential Revision: https://reviews.llvm.org/D120215
In some case, like in the added test case, we can reach
selectInterleaveCount with loops that actually have a cost of 0.
Unfortunately a loop cost of 0 is also used to communicate that the cost
has not been computed yet. To resolve the crash, bail out if the cost
remains zero after computing it.
This seems like the best option, as there are multiple code paths that
return a cost of 0 to force a computation in selectInterleaveCount.
Computing the cost at multiple places up front there would unnecessarily
complicate the logic.
Fixes#54413.
This patch moves pointer induction handling from VPWidenPHIRecipe to its
own recipe. In the process, it adds all information required to generate
code for pointer inductions without relying on Legal to access the list
of induction phis.
Alternatively VPWidenPHIRecipe could also take an optional pointer to InductionDescriptor.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D121615
This patch tries to sink instructions when they are only used in a successor block.
This is a further enhancement patch based on Anna's commit:
D109700, which allows sinking an instruction having multiple uses in a single user.
In this patch, sink instructions with multiple users in a single successor block will be supported.
It could fix a known issue from rust:
https://github.com/rust-lang/rust/issues/51346#issuecomment-394443610
Reviewed By: nikic, reames
Differential Revision: https://reviews.llvm.org/D121585
This uses the existing VPlan helpers to check whether there are scalar
uses of a phi recipe. It remove one of the few remaining dependencies on
the cost model from VPlan code generation.
Depends on D121612.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D121613
This patch ensures scalars (except for uniforms) are no
longer collected (prior to LVP planning phase) for
scalable vectorization.
This is to avoid the chances of generating scalarized
instructions later (during LVP execute phase) as they
are not supported for scalable vectorization.
Relevant test has also been added.
Differential Revision: https://reviews.llvm.org/D121452