After D121595 was commited, I noticed regressions assosicated with small trip
count numbersvectorisation by tail folding with scalable vectors. As a solution
for those issues I propose to introduce the minimal trip count threshold value.
Differential Revision: https://reviews.llvm.org/D130755
This extends the handling of uniform memory operations to handle the case where a store is storing a loop invariant value. Unlike the general case of a store to an invariant address where we must use the last active lane, in this case we can use any lane since all lanes must produce the same result.
For context, the basic structure of the existing code and how the change fits in:
* First, we select a widening strategy. (The result is irrelevant for this patch.)
* Then we determine if a computation is uniform within all lanes of VF. (Note this is the uniform-per-part definition, not LAI's uniform across all unrolled iterations definition.)
* If it is, we overrule the widening strategy, and unconditionally scalarize.
* VPReplicationRecipe - which is what actually does the scalarization - knows how to handle unform-per-part values including for scalable vectors. However, we do need to know that the expression is safe to execute without predication - e.g. the uniform mem op was unconditional in the original loop. (This part was split off and already landed.)
An obvious question is why not simply implement the generic case? The answer is that I'm going to, but doing so without a canonicalization towards uniform causes regressions due to bad interaction with scalarization/uniformity of values feeding the uniform mem-op. This patch is needed to avoid those regressions.
Differential Revision: https://reviews.llvm.org/D130364
If we have interleave groups in the loop we want to vectorise then
we should fall back on normal vectorisation with a scalar epilogue. In
such cases when tail-folding is enabled we'll almost certainly go on to
create vplans with very high costs for all vector VFs and fall back on
VF=1 anyway. This is likely to be worse than if we'd just used an
unpredicated vector loop in the first place.
Once the vectoriser has proper support for analysing all the costs
for each combination of VF and vectorisation style, then we should
be able to remove this.
Added an extra test here:
Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll
Differential Revision: https://reviews.llvm.org/D128342
We already had the reasoning about uniform mem op loads; if the address is accessed at least once, we know the instruction doesn't need predicated to ensure fault safety. For stores, we do need to ensure that the values visible in memory are the same with and without predication. The easiest sub-case to check for is that all the values being stored are the same. Since we know that at least one lane is active, this tells us that the value must be visible.
Warning on confusing terminology: "uniform" vs "uniform mem op" mean two different things here, and this patch is specific to the later. It would *not* be legal to make this same change for merely "uniform" operations.
Differential Revision: https://reviews.llvm.org/D130637
This change enables vectorization (using scalable vectorization only, fixed vectors are not yet enabled) for RISCV when vector instructions are available for the target configuration.
At this point, the resulting configuration should be both stable (e.g. no crashes), and profitable (i.e. few cases where scalar loops beat vector ones), but is not going to be particularly well tuned (i.e. we emit the best possible vector loop). The goal of this change is to align testing across organizations and ensure the default configuration matches what downstreams are using as closely as possible.
This exposes a large amount of code which hasn't otherwise been on by default, and thus may not have been fully exercised. Given that, having issues fall out is not unexpected. If you find issues, please make sure to include as much information as you can when reverting this change.
Differential Revision: https://reviews.llvm.org/D129013
The problem here is target independent, but particularly painful on RISCV. If we chose to vectorize such that vscale x 2 x i32 is our widest type and fits in a register, a naive expansion of i64 comparisons results in comparisons and index types at <scalabe x 2 x i64>. This requires both an LMUL of 2, and a VSETVLI toggle in the loop. Note that we could have used <vscale x 2 x i32> for the compairons legally given the range of the trip count.
All of our other tests are functionality tests constrained to some
specific configuration. This one is intended to float with the
default configuration so that changes in that default are visible
in reviews. Note that our current default does not enable
vectorization at all; thus the current output is unvectorized.
We call tail-call-elim near the beginning of the pipeline,
but that is too early to annotate calls that get added later.
In the motivating case from issue #47852, the missing 'tail'
on memset leads to sub-optimal codegen.
I experimented with removing the early instance of
tail-call-elim instead of just adding another pass, but that
appears to be slightly worse for compile-time:
+0.15% vs. +0.08% time.
"tailcall" shows adding the pass; "tailcall2" shows moving
the pass to later, then adding the original early pass back
(so 1596886802 is functionally equivalent to 180b0439dc ):
https://llvm-compile-time-tracker.com/index.php?config=NewPM-O3&stat=instructions&remote=rotateright
Note that there was an effort to split the tail call functionality
into 2 passes - that could help reduce compile-time if we find
that this change costs more in compile-time than expected based
on the preliminary testing:
D60031
Differential Revision: https://reviews.llvm.org/D130374
This code confuses LV's "Uniform" and LVL/LAI's "Uniform". Despite the
common name, these are different.
* LVs notion means that only the first lane *of each unrolled part* is
required. That is, lanes within a single unroll factor are considered
uniform. This allows e.g. widenable memory ops to be considered
uses of uniform computations.
* LVL and LAI's notion refers to all lanes across all unrollings.
IsUniformMem is in turn defined in terms of LAI's notion. Thus a
UniformMemOpmeans is a memory operation with a loop invariant address.
This means the same address is accessed in every iteration.
The tweaked piece of code was trying to match a uniform mem op (i.e.
fully loop invariant address), but instead checked for LV's notion of
uniformity. In theory, this meant with UF > 1, we could speculate
a load which wasn't safe to execute.
This ends up being mostly silent in current code as it is nearly
impossible to create the case where this difference is visible. The
closest I've come in the test case from 54cb87, but even then, the
incorrect result is only visible in the vplan debug output; before this
change we sink the unsafely speculated load back into the user's predicate
blocks before emitting IR. Both before and after IR are correct so the
differences aren't "interesting".
The other test changes are uninteresting. They're cases where LV's uniform
analysis is slightly weaker than SCEV isLoopInvariant.
This patch adds the AArch64 hook for preferPredicateOverEpilogue,
which currently returns true if SVE is enabled and one of the
following conditions (non-exhaustive) is met:
1. The "sve-tail-folding" option is set to "all", or
2. The "sve-tail-folding" option is set to "all+noreductions"
and the loop does not contain reductions,
3. The "sve-tail-folding" option is set to "all+norecurrences"
and the loop has no first-order recurrences.
Currently the default option is "disabled", but this will be
changed in a later patch.
I've added new tests to show the options behave as expected here:
Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll
Differential Revision: https://reviews.llvm.org/D129560
This patch is in preparation for enabling vectorisation with tail-folding
by default for SVE targets. Once we do that many existing tests will
break that depend upon having normal unpredicated vector loops. For
all such tests I have added the flag:
-prefer-predicate-over-epilogue=scalar-epilogue
Differential Revision: https://reviews.llvm.org/D129137
An srem or sdiv has two cases which can cause undefined behavior, not just one. The existing code did not account for this, and as a result, we miscompiled when we encountered e.g. a srem i64 %v, -1 in a conditional block.
Instead of hand rolling the logic, just use the utility function which exists exactly for this purpose.
Differential Revision: https://reviews.llvm.org/D130106
By default if SVE is enabled we want the select instruction used for
reductions to be inside the loop, rather than outside. This makes it
possible for the backend to fold the select into the operation to
produce a single predicated add, fadd, etc.
Differential Revision: https://reviews.llvm.org/D129763
In sve-tail-folding-reductions.ll I've also added an extra RUN line
to test normal reductions, i.e. not in-loop. This patch is a pre-commit
in preparation for a follow-on patch that changes how reduction selects
are generated in the vector loop.
Differential Revision: https://reviews.llvm.org/D129761
I've simplified all of the SVE vectoriser tail-folding tests to
only care about testing the flag:
-prefer-predicate-over-epiloge=predicate-else-scalar-epilogue
In practice we always want to fall back on unpredicated vector
loops if tail-folding is not possible.
Differential Revision: https://reviews.llvm.org/D129843
This builds on the previous forked pointers patch, which only accepted
a single select as the pointer to check. A recursive function to walk
through IR has been added, which searches for either a loop-invariant
or addrec SCEV.
This will only handle a single fork at present, so selects of selects
or a GEP with a select for both the base and offset will be rejected.
There is also a recursion limit with a cli option to change it.
Reviewed By: fhahn, david-arm
Differential Revision: https://reviews.llvm.org/D108699
At the moment, the VPPRedInstPHIRecipe is not used in subsequent uses of
the predicate recipe. This incorrectly models the def-use chains, as all
later uses should use the phi recipe. Fix that by delaying recording of
the recipe.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D129436
At the moment, the cost of runtime checks for scalable vectors is
overestimated due to creating separate vscale * VF expressions for each
check. Instead re-use the first expression.
For scalable vectors, it is not sufficient to only check
MinProfitableTripCount if it is >= VF.getKnownMinValue() * UF, because
this property may not holder for larger values of vscale. In those
cases, compute umax(VF * UF, MinProfTC) instead.
This should fix
https://lab.llvm.org/buildbot/#/builders/197/builds/2262
The test shows a case where the minimum trip count check incorrectly
only checks the minimum profitable trip count computed due to runtime
checks. This is incorrect for scalable VFs, because the VF * UF may
exceed the minimum profitable trip count for vscale > 1.
This is the likely reason for
https://lab.llvm.org/buildbot/#/builders/197/builds/2262 failing.
When vectorising ordered reductions we call a function
LoopVectorizationPlanner::adjustRecipesForReductions to replace the
existing VPWidenRecipe for the fadd instruction with a new
VPReductionRecipe. We attempt to insert the new recipe in the same
place, but this is wrong because createBlockInMask may have
generated new recipes that VPReductionRecipe now depends upon. I
have changed the insertion code to append the recipe to the
VPBasicBlock instead.
Added a new RUN with tail-folding enabled to the existing test:
Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll
Differential Revision: https://reviews.llvm.org/D129550
Currently we only call replaceLoopPHINodesWithPreheaderValues() if
optimizeLoopExits() replaces the exit with an unconditional exit.
However, it is very common that this already happens as part of
eliminateIVComparison(), in which case we're leaving behind the
dead header phi.
Tweak the early bailout for already-constant exits to also call
replaceLoopPHINodesWithPreheaderValues().
Differential Revision: https://reviews.llvm.org/D129214