I have removed an unnecessary assert in LoopVectorizationCostModel::getInstructionCost
that prevented a cost being calculated for select instructions when using
scalable vectors. In addition, I have changed AArch64TTIImpl::getCmpSelInstrCost
to only do special cost calculations for fixed width vectors and fall
back to the base version for scalable vectors.
I have added a simple cost model test for cmps and selects:
test/Analysis/CostModel/sve-cmpsel.ll
and some simple tests that show we vectorize loops with cmp and select:
test/Transforms/LoopVectorize/AArch64/sve-basic-vec.ll
Differential Revision: https://reviews.llvm.org/D95039
We tend to assume that the AA pipeline is by default the default AA
pipeline and it's confusing when it's empty instead.
PR48779
Initially reverted due to BasicAA running analyses in an unspecified
order (multiple function calls as parameters), fixed by fetching
analyses before the call to construct BasicAA.
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D95117
We tend to assume that the AA pipeline is by default the default AA
pipeline and it's confusing when it's empty instead.
PR48779
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D95117
This adds cost modelling for the inloop vectorization added in
745bf6cf44. Up until now they have been modelled as the original
underlying instruction, usually an add. This happens to works OK for MVE
with instructions that are reducing into the same type as they are
working on. But MVE's instructions can perform the equivalent of an
extended MLA as a single instruction:
%sa = sext <16 x i8> A to <16 x i32>
%sb = sext <16 x i8> B to <16 x i32>
%m = mul <16 x i32> %sa, %sb
%r = vecreduce.add(%m)
->
R = VMLADAV A, B
There are other instructions for performing add reductions of
v4i32/v8i16/v16i8 into i32 (VADDV), for doing the same with v4i32->i64
(VADDLV) and for performing a v4i32/v8i16 MLA into an i64 (VMLALDAV).
The i64 are particularly interesting as there are no native i64 add/mul
instructions, leading to the i64 add and mul naturally getting very
high costs.
Also worth mentioning, under NEON there is the concept of a sdot/udot
instruction which performs a partial reduction from a v16i8 to a v4i32.
They extend and mul/sum the first four elements from the inputs into the
first element of the output, repeating for each of the four output
lanes. They could possibly be represented in the same way as above in
llvm, so long as a vecreduce.add could perform a partial reduction. The
vectorizer would then produce a combination of in and outer loop
reductions to efficiently use the sdot and udot instructions. Although
this patch does not do that yet, it does suggest that separating the
input reduction type from the produced result type is a useful concept
to model. It also shows that a MLA reduction as a single instruction is
fairly common.
This patch attempt to improve the costmodelling of in-loop reductions
by:
- Adding some pattern matching in the loop vectorizer cost model to
match extended reduction patterns that are optionally extended and/or
MLA patterns. This marks the cost of the reduction instruction correctly
and the sext/zext/mul leading up to it as free, which is otherwise
difficult to tell and may get a very high cost. (In the long run this
can hopefully be replaced by vplan producing a single node and costing
it correctly, but that is not yet something that vplan can do).
- getExtendedAddReductionCost is added to query the cost of these
extended reduction patterns.
- Expanded the ARM costs to account for these expanded sizes, which is a
fairly simple change in itself.
- Some minor alterations to allow inloop reduction larger than the highest
vector width and i64 MVE reductions.
- An extra InLoopReductionImmediateChains map was added to the vectorizer
for it to efficiently detect which instructions are reductions in the
cost model.
- The tests have some updates to show what I believe is optimal
vectorization and where we are now.
Put together this can greatly improve performance for reduction loop
under MVE.
Differential Revision: https://reviews.llvm.org/D93476
It turns out the vectorizer calls the getIntrinsicInstrCost functions
with a scalar return type and vector VF. This updates the costmodel to
handle that, still producing the correct vector costs.
A vectorizer test is added to show it vectorizing at the correct factor
again.
Just like llvm.assume, there are a lot of cases where we can just ignore llvm.experimental.noalias.scope.decl.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D93042
Keys matching the tombstone/empty special values cannot be inserted in a
DenseMap. Under some circumstances, LV tries to add members to an
interleave group that match the special values. Skip adding such
members. This is unlikely to have any impact in practice, because
interleave groups with such indices are very likely to not be
vectorized, due to gaps.
This issue has been surfaced by fuzzing, see
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=11638
This relates to the ongoing effort to support vectorization of multiple exit loops (see D93317).
The previous code assumed that LCSSA phis were always single entry before the vectorizer ran. This was correct, but only because the vectorizer allowed only a single exiting edge. There's nothing in the definition of LCSSA which requires single entry phis.
A common case where this comes up is with a loop with multiple exiting blocks which all reach a common exit block. (e.g. see the test updates)
Differential Revision: https://reviews.llvm.org/D93725
This patch unifies the way recipes and VPValues are printed after the
transition to VPDef.
VPSlotTracker has been updated to iterate over all recipes and all
their defined values to number those. There is no need to number
values in Value2VPValue.
It also updates a few places that only used slot numbers for
VPInstruction. All recipes now can produce numbered VPValues.
In the following loop:
void foo(int *a, int *b, int N) {
for (int i=0; i<N; ++i)
a[i + 4] = a[i] + b[i];
}
The loop dependence constrains the VF to a maximum of (4, fixed), which
would mean using <4 x i32> as the vector type in vectorization.
Extending this to scalable vectorization, a VF of (4, scalable) implies
a vector type of <vscale x 4 x i32>. To determine if this is legal
vscale must be taken into account. For this example, unless
max(vscale)=1, it's unsafe to vectorize.
For SVE, the number of bits in an SVE register is architecturally
defined to be a multiple of 128 bits with a maximum of 2048 bits, thus
the maximum vscale is 16. In the loop above it is therefore unfeasible
to vectorize with SVE. However, in this loop:
void foo(int *a, int *b, int N) {
#pragma clang loop vectorize_width(X, scalable)
for (int i=0; i<N; ++i)
a[i + 32] = a[i] + b[i];
}
As long as max(vscale) multiplied by the number of lanes 'X' doesn't
exceed the dependence distance, it is safe to vectorize. For SVE a VF of
(2, scalable) is within this constraint, since a vector of <16 x 2 x 32>
will have no dependencies between lanes. For any number of lanes larger
than this it would be unsafe to vectorize.
This patch extends 'computeFeasibleMaxVF' to legalize scalable VFs
specified as loop hints, implementing the following behaviour:
* If the backend does not support scalable vectors, ignore the hint.
* If scalable vectorization is unfeasible given the loop
dependence, like in the first example above for SVE, then use a
fixed VF.
* Accept scalable VFs if it's safe to do so.
* Otherwise, clamp scalable VFs that exceed the maximum safe VF.
Reviewed By: sdesmalen, fhahn, david-arm
Differential Revision: https://reviews.llvm.org/D91718
The new test case here contains a first order recurrences and an
instruction that is replicated. The first order recurrence forces an
instruction to be sunk _into_, as opposed to after the replication
region. That causes several things to go wrong including registering
vector instructions multiple times and failing to create dominance
relations correctly.
Instead we should be sinking to after the replication region, which is
what this patch makes sure happens.
Differential Revision: https://reviews.llvm.org/D93629
This patch makes SLP and LV emit operations with initial vectors set to poison constant instead of undef.
This is a part of efforts for using poison vector instead of undef to represent "doesn't care" vector.
The goal is to make nice shufflevector optimizations valid that is currently incorrect due to the tricky interaction between undef and poison (see https://bugs.llvm.org/show_bug.cgi?id=44185 ).
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D94061
Creating in-loop reductions relies on IR references to map
IR values to VPValues after interleave group creation.
Make sure we re-add the updated member to the plan, so the look-ups
still work as expected
This fixes a crash reported after D90562.
Let getTruncateExpr() short-circuit to zero when the value being truncated is
known to have at least as many trailing zeros as the target type.
Differential Revision: https://reviews.llvm.org/D93973
The loop vectorizer avoids folding the tail for loop's whose trip-count is
known to SCEV to be divisible by VF. In this case the assumption providing this
information is not taken into account, so the tail is needlessly folded.
If DoExtraAnalysis is true (e.g. because remarks are enabled), we
continue with the analysis rather than exiting. Update code to
conditionally check if the ExitBB has phis or not a single predecessor.
Otherwise a nullptr is dereferenced with DoExtraAnalysis.
As mentioned in D93793, there are quite a few places where unary `IRBuilder::CreateShuffleVector(X, Mask)` can be used
instead of `IRBuilder::CreateShuffleVector(X, Undef, Mask)`.
Let's update them.
Actually, it would have been more natural if the patches were made in this order:
(1) let them use unary CreateShuffleVector first
(2) update IRBuilder::CreateShuffleVector to use poison as a placeholder value (D93793)
The order is swapped, but in terms of correctness it is still fine.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D93923
This patch updates IRBuilder to create insertelement/shufflevector using poison as a placeholder.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D93793
This reverts commit 4ffcd4fe9a thus restoring e4df6a40da.
The only change from the original patch is to add "llvm::" before the call to empty(iterator_range). This is a speculative fix for the ambiguity reported on some builders.
This patch is a major step towards supporting multiple exit loops in the vectorizer. This patch on it's own extends the loop forms allowed in two ways:
single exit loops which are not bottom tested
multiple exit loops w/ a single exit block reached from all exits and no phis in the exit block (because of LCSSA this implies no values defined in the loop used later)
The restrictions on multiple exit loop structures will be removed in follow up patches; disallowing cases for now makes the code changes smaller and more obvious. As before, we can only handle loops with entirely analyzable exits. Removing that restriction is much harder, and is not part of currently planned efforts.
The basic idea here is that we can force the last iteration to run in the scalar epilogue loop (if we have one). From the definition of SCEV's backedge taken count, we know that no earlier iteration can exit the vector body. As such, we can leave the decision on which exit to be taken to the scalar code and generate a bottom tested vector loop which runs all but the last iteration.
The existing code already had the notion of requiring one iteration in the scalar epilogue, this patch is mainly about generalizing that support slightly, making sure we don't try to use this mechanism when tail folding, and updating the code to reflect the difference between a single exit block and a unique exit block (very mechanical).
Differential Revision: https://reviews.llvm.org/D93317
Currently undef is used as a don’t-care vector when constructing a vector using a series of insertelement.
However, this is problematic because undef isn’t undefined enough.
Especially, a sequence of insertelement can be optimized to shufflevector, but using undef as its placeholder makes shufflevector a poison-blocking instruction because undef cannot be optimized to poison.
This makes a few straightforward optimizations incorrect, such as:
```
; https://bugs.llvm.org/show_bug.cgi?id=44185
define <4 x float> @insert_not_undef_shuffle_translate_commute(float %x, <4 x float> %y, <4 x float> %q) {
%xv = insertelement <4 x float> %q, float %x, i32 2
%r = shufflevector <4 x float> %y, <4 x float> %xv, <4 x i32> { 0, 6, 2, undef }
ret <4 x float> %r ; %r[3] is undef
}
=>
define <4 x float> @insert_not_undef_shuffle_translate_commute(float %x, <4 x float> %y, <4 x float> %q) {
%r = insertelement <4 x float> %y, float %x, i32 1
ret <4 x float> %r ; %r[3] = %y[3], incorrect if %y[3] = poison
}
Transformation doesn't verify!
ERROR: Target is more poisonous than source
```
I’d like to suggest
1. Using poison as insertelement’s placeholder value (IRBuilder::CreateVectorSplat should be patched too)
2. Updating shufflevector’s semantics to return poison element if mask is undef
Note that poison is currently lowered into UNDEF in SelDag, so codegen part is okay.
m_Undef() matches PoisonValue as well, so existing optimizations will still fire.
The only concern is hidden miscompilations that will go incorrect when poison constant is given.
A conservative way is copying all tests having `insertelement undef` & replacing it with `insertelement poison` & run Alive2 on it, but it will create many tests and people won’t like it. :(
Instead, I’ll simply locally maintain the tests and run Alive2.
If there is any bug found, I’ll report it.
Relevant links: https://bugs.llvm.org/show_bug.cgi?id=43958 , http://lists.llvm.org/pipermail/llvm-dev/2019-November/137242.html
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D93586
Previously the branch from the middle block to the scalar preheader & exit
was being set-up at the end of skeleton creation in completeLoopSkeleton.
Inserting SCEV or runtime checks may result in LCSSA phis being created,
if they are required. Adjusting branches afterwards may break those
PHIs.
To avoid this, we can instead create the branch from the middle block
to the exit after we created the middle block, so we have the final CFG
before potentially adjusting/creating PHIs.
This fixes a crash for the included test case. For the non-crashing
case, this is almost a NFC with respect to the generated code. The
only change is the order of the predecessors of the involved branch
targets.
Note an assertion was moved from LoopVersioning() to
LoopVersioning::versionLoop. Adjusting the branches means loop-simplify
form may be broken before constructing LoopVersioning. But LV only uses
LoopVersioning to annotate the loop instructions with !noalias metadata,
which does not require loop-simplify form.
This is a fix for an existing issue uncovered by D93317.
When the trip-count is provably divisible by the maximal/chosen VF, folding the
loop's tail during vectorization is redundant. This commit extends the existing
test for constant trip-counts to any trip-count known to be divisible by
maximal/selected VF by SCEV.
Differential Revision: https://reviews.llvm.org/D93615
And that exposes that a number of tests don't *actually* manage to
maintain DomTree validity, which is inline with my observations.
Once again, SimlifyCFG pass currently does not require/preserve DomTree
by default, so this is effectively NFC.
Temporarily revert commit 8b1c4e310c.
After 8b1c4e310c the compile-time for `MultiSource/Benchmarks/MiBench/consumer-lame`
dramatically increases with -O3 & LTO, causing issues for builders with
that configuration.
I filed PR48553 with a smallish reproducer that shows a 10-100x compile
time increase.
... so just ensure that we pass DomTreeUpdater it into it.
Fixes DomTree preservation for a large number of tests,
all of which are marked as such so that they do not regress.
... so just ensure that we pass DomTreeUpdater it into it.
Fixes DomTree preservation for a large number of tests,
all of which are marked as such so that they do not regress.
... so just ensure that we pass DomTreeUpdater it into it.
Fixes DomTree preservation for a large number of tests,
all of which are marked as such so that they do not regress.
First step after e113317958,
in these tests, DomTree is valid afterwards, so mark them as such,
so that they don't regress.
In further steps, SimplifyCFG transforms shall taught to preserve DomTree,
in as small steps as possible.