This is another step (see D95452) towards correcting fast-math-flags
bugs in vector reductions.
There are multiple bugs visible in the test diffs, and this is still
not working as it should. We still use function attributes (rather
than FMF) to drive part of the logic, but we are not checking for
the correct FP function attributes.
Note that FMF may not be propagated optimally on selects (example
in https://llvm.org/PR35607 ). That's why I'm proposing to union the
FMF of a fcmp+select pair and avoid regressions on existing vectorizer
tests.
Differential Revision: https://reviews.llvm.org/D95690
D90687 introduced a crash:
llvm::LoopVectorizationCostModel::computeMaxVF(llvm::ElementCount, unsigned int):
Assertion `WideningDecisions.empty() && Uniforms.empty() && Scalars.empty() &&
"No decisions should have been taken at this point"' failed.
when compiling the following C code:
typedef struct {
char a;
} b;
b *c;
int d, e;
int f() {
int g = 0;
for (; d; d++) {
e = 0;
for (; e < c[d].a; e++)
g++;
}
return g;
}
with:
clang -Os -target hexagon -mhvx -fvectorize -mv67 testcase.c -S -o -
This occurred since prior to D90687 computeFeasibleMaxVF would only be
called in computeMaxVF when a scalar epilogue was allowed, but now it's
always called. This causes the assert above since computeFeasibleMaxVF
collects all viable VFs larger than the default MaxVF, and for each VF
calculates the register usage which results in analysis being done the
assert above guards against. This can occur in computeFeasibleMaxVF if
TTI.shouldMaximizeVectorBandwidth and this target hook is implemented in
the hexagon backend to always return true.
Reported by @iajbar.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D94869
I am trying to untangle the fast-math-flags propagation logic
in the vectorizers (see a6f022127 for SLP).
The loop vectorizer has a mix of checking FP function attributes,
IR-level FMF, and just wrong assumptions.
I am trying to avoid regressions while fixing this, and I think
the IR-level logic is good enough for that, but it's hard to say
for sure. This would be the 1st step in the clean-up.
The existing test that I changed to include 'fast' actually shows
a miscompile: the function only had the equivalent of nnan, but we
created new instructions that had fast (all FMF set). This is
similar to the example in https://llvm.org/PR35538
Differential Revision: https://reviews.llvm.org/D95452
The existing test has less FMF than we might expect if
our FMF was fixed (on all FP values), so this additional
test is intended to check propagation in a more "normal"
example.
I have removed an unnecessary assert in LoopVectorizationCostModel::getInstructionCost
that prevented a cost being calculated for select instructions when using
scalable vectors. In addition, I have changed AArch64TTIImpl::getCmpSelInstrCost
to only do special cost calculations for fixed width vectors and fall
back to the base version for scalable vectors.
I have added a simple cost model test for cmps and selects:
test/Analysis/CostModel/sve-cmpsel.ll
and some simple tests that show we vectorize loops with cmp and select:
test/Transforms/LoopVectorize/AArch64/sve-basic-vec.ll
Differential Revision: https://reviews.llvm.org/D95039
We tend to assume that the AA pipeline is by default the default AA
pipeline and it's confusing when it's empty instead.
PR48779
Initially reverted due to BasicAA running analyses in an unspecified
order (multiple function calls as parameters), fixed by fetching
analyses before the call to construct BasicAA.
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D95117
We tend to assume that the AA pipeline is by default the default AA
pipeline and it's confusing when it's empty instead.
PR48779
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D95117
This adds cost modelling for the inloop vectorization added in
745bf6cf44. Up until now they have been modelled as the original
underlying instruction, usually an add. This happens to works OK for MVE
with instructions that are reducing into the same type as they are
working on. But MVE's instructions can perform the equivalent of an
extended MLA as a single instruction:
%sa = sext <16 x i8> A to <16 x i32>
%sb = sext <16 x i8> B to <16 x i32>
%m = mul <16 x i32> %sa, %sb
%r = vecreduce.add(%m)
->
R = VMLADAV A, B
There are other instructions for performing add reductions of
v4i32/v8i16/v16i8 into i32 (VADDV), for doing the same with v4i32->i64
(VADDLV) and for performing a v4i32/v8i16 MLA into an i64 (VMLALDAV).
The i64 are particularly interesting as there are no native i64 add/mul
instructions, leading to the i64 add and mul naturally getting very
high costs.
Also worth mentioning, under NEON there is the concept of a sdot/udot
instruction which performs a partial reduction from a v16i8 to a v4i32.
They extend and mul/sum the first four elements from the inputs into the
first element of the output, repeating for each of the four output
lanes. They could possibly be represented in the same way as above in
llvm, so long as a vecreduce.add could perform a partial reduction. The
vectorizer would then produce a combination of in and outer loop
reductions to efficiently use the sdot and udot instructions. Although
this patch does not do that yet, it does suggest that separating the
input reduction type from the produced result type is a useful concept
to model. It also shows that a MLA reduction as a single instruction is
fairly common.
This patch attempt to improve the costmodelling of in-loop reductions
by:
- Adding some pattern matching in the loop vectorizer cost model to
match extended reduction patterns that are optionally extended and/or
MLA patterns. This marks the cost of the reduction instruction correctly
and the sext/zext/mul leading up to it as free, which is otherwise
difficult to tell and may get a very high cost. (In the long run this
can hopefully be replaced by vplan producing a single node and costing
it correctly, but that is not yet something that vplan can do).
- getExtendedAddReductionCost is added to query the cost of these
extended reduction patterns.
- Expanded the ARM costs to account for these expanded sizes, which is a
fairly simple change in itself.
- Some minor alterations to allow inloop reduction larger than the highest
vector width and i64 MVE reductions.
- An extra InLoopReductionImmediateChains map was added to the vectorizer
for it to efficiently detect which instructions are reductions in the
cost model.
- The tests have some updates to show what I believe is optimal
vectorization and where we are now.
Put together this can greatly improve performance for reduction loop
under MVE.
Differential Revision: https://reviews.llvm.org/D93476
It turns out the vectorizer calls the getIntrinsicInstrCost functions
with a scalar return type and vector VF. This updates the costmodel to
handle that, still producing the correct vector costs.
A vectorizer test is added to show it vectorizing at the correct factor
again.
Just like llvm.assume, there are a lot of cases where we can just ignore llvm.experimental.noalias.scope.decl.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D93042
Keys matching the tombstone/empty special values cannot be inserted in a
DenseMap. Under some circumstances, LV tries to add members to an
interleave group that match the special values. Skip adding such
members. This is unlikely to have any impact in practice, because
interleave groups with such indices are very likely to not be
vectorized, due to gaps.
This issue has been surfaced by fuzzing, see
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=11638
This relates to the ongoing effort to support vectorization of multiple exit loops (see D93317).
The previous code assumed that LCSSA phis were always single entry before the vectorizer ran. This was correct, but only because the vectorizer allowed only a single exiting edge. There's nothing in the definition of LCSSA which requires single entry phis.
A common case where this comes up is with a loop with multiple exiting blocks which all reach a common exit block. (e.g. see the test updates)
Differential Revision: https://reviews.llvm.org/D93725
This patch unifies the way recipes and VPValues are printed after the
transition to VPDef.
VPSlotTracker has been updated to iterate over all recipes and all
their defined values to number those. There is no need to number
values in Value2VPValue.
It also updates a few places that only used slot numbers for
VPInstruction. All recipes now can produce numbered VPValues.
In the following loop:
void foo(int *a, int *b, int N) {
for (int i=0; i<N; ++i)
a[i + 4] = a[i] + b[i];
}
The loop dependence constrains the VF to a maximum of (4, fixed), which
would mean using <4 x i32> as the vector type in vectorization.
Extending this to scalable vectorization, a VF of (4, scalable) implies
a vector type of <vscale x 4 x i32>. To determine if this is legal
vscale must be taken into account. For this example, unless
max(vscale)=1, it's unsafe to vectorize.
For SVE, the number of bits in an SVE register is architecturally
defined to be a multiple of 128 bits with a maximum of 2048 bits, thus
the maximum vscale is 16. In the loop above it is therefore unfeasible
to vectorize with SVE. However, in this loop:
void foo(int *a, int *b, int N) {
#pragma clang loop vectorize_width(X, scalable)
for (int i=0; i<N; ++i)
a[i + 32] = a[i] + b[i];
}
As long as max(vscale) multiplied by the number of lanes 'X' doesn't
exceed the dependence distance, it is safe to vectorize. For SVE a VF of
(2, scalable) is within this constraint, since a vector of <16 x 2 x 32>
will have no dependencies between lanes. For any number of lanes larger
than this it would be unsafe to vectorize.
This patch extends 'computeFeasibleMaxVF' to legalize scalable VFs
specified as loop hints, implementing the following behaviour:
* If the backend does not support scalable vectors, ignore the hint.
* If scalable vectorization is unfeasible given the loop
dependence, like in the first example above for SVE, then use a
fixed VF.
* Accept scalable VFs if it's safe to do so.
* Otherwise, clamp scalable VFs that exceed the maximum safe VF.
Reviewed By: sdesmalen, fhahn, david-arm
Differential Revision: https://reviews.llvm.org/D91718
The new test case here contains a first order recurrences and an
instruction that is replicated. The first order recurrence forces an
instruction to be sunk _into_, as opposed to after the replication
region. That causes several things to go wrong including registering
vector instructions multiple times and failing to create dominance
relations correctly.
Instead we should be sinking to after the replication region, which is
what this patch makes sure happens.
Differential Revision: https://reviews.llvm.org/D93629
This patch makes SLP and LV emit operations with initial vectors set to poison constant instead of undef.
This is a part of efforts for using poison vector instead of undef to represent "doesn't care" vector.
The goal is to make nice shufflevector optimizations valid that is currently incorrect due to the tricky interaction between undef and poison (see https://bugs.llvm.org/show_bug.cgi?id=44185 ).
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D94061
Creating in-loop reductions relies on IR references to map
IR values to VPValues after interleave group creation.
Make sure we re-add the updated member to the plan, so the look-ups
still work as expected
This fixes a crash reported after D90562.
Let getTruncateExpr() short-circuit to zero when the value being truncated is
known to have at least as many trailing zeros as the target type.
Differential Revision: https://reviews.llvm.org/D93973
The loop vectorizer avoids folding the tail for loop's whose trip-count is
known to SCEV to be divisible by VF. In this case the assumption providing this
information is not taken into account, so the tail is needlessly folded.
If DoExtraAnalysis is true (e.g. because remarks are enabled), we
continue with the analysis rather than exiting. Update code to
conditionally check if the ExitBB has phis or not a single predecessor.
Otherwise a nullptr is dereferenced with DoExtraAnalysis.
As mentioned in D93793, there are quite a few places where unary `IRBuilder::CreateShuffleVector(X, Mask)` can be used
instead of `IRBuilder::CreateShuffleVector(X, Undef, Mask)`.
Let's update them.
Actually, it would have been more natural if the patches were made in this order:
(1) let them use unary CreateShuffleVector first
(2) update IRBuilder::CreateShuffleVector to use poison as a placeholder value (D93793)
The order is swapped, but in terms of correctness it is still fine.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D93923
This patch updates IRBuilder to create insertelement/shufflevector using poison as a placeholder.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D93793
This reverts commit 4ffcd4fe9a thus restoring e4df6a40da.
The only change from the original patch is to add "llvm::" before the call to empty(iterator_range). This is a speculative fix for the ambiguity reported on some builders.
This patch is a major step towards supporting multiple exit loops in the vectorizer. This patch on it's own extends the loop forms allowed in two ways:
single exit loops which are not bottom tested
multiple exit loops w/ a single exit block reached from all exits and no phis in the exit block (because of LCSSA this implies no values defined in the loop used later)
The restrictions on multiple exit loop structures will be removed in follow up patches; disallowing cases for now makes the code changes smaller and more obvious. As before, we can only handle loops with entirely analyzable exits. Removing that restriction is much harder, and is not part of currently planned efforts.
The basic idea here is that we can force the last iteration to run in the scalar epilogue loop (if we have one). From the definition of SCEV's backedge taken count, we know that no earlier iteration can exit the vector body. As such, we can leave the decision on which exit to be taken to the scalar code and generate a bottom tested vector loop which runs all but the last iteration.
The existing code already had the notion of requiring one iteration in the scalar epilogue, this patch is mainly about generalizing that support slightly, making sure we don't try to use this mechanism when tail folding, and updating the code to reflect the difference between a single exit block and a unique exit block (very mechanical).
Differential Revision: https://reviews.llvm.org/D93317
Currently undef is used as a don’t-care vector when constructing a vector using a series of insertelement.
However, this is problematic because undef isn’t undefined enough.
Especially, a sequence of insertelement can be optimized to shufflevector, but using undef as its placeholder makes shufflevector a poison-blocking instruction because undef cannot be optimized to poison.
This makes a few straightforward optimizations incorrect, such as:
```
; https://bugs.llvm.org/show_bug.cgi?id=44185
define <4 x float> @insert_not_undef_shuffle_translate_commute(float %x, <4 x float> %y, <4 x float> %q) {
%xv = insertelement <4 x float> %q, float %x, i32 2
%r = shufflevector <4 x float> %y, <4 x float> %xv, <4 x i32> { 0, 6, 2, undef }
ret <4 x float> %r ; %r[3] is undef
}
=>
define <4 x float> @insert_not_undef_shuffle_translate_commute(float %x, <4 x float> %y, <4 x float> %q) {
%r = insertelement <4 x float> %y, float %x, i32 1
ret <4 x float> %r ; %r[3] = %y[3], incorrect if %y[3] = poison
}
Transformation doesn't verify!
ERROR: Target is more poisonous than source
```
I’d like to suggest
1. Using poison as insertelement’s placeholder value (IRBuilder::CreateVectorSplat should be patched too)
2. Updating shufflevector’s semantics to return poison element if mask is undef
Note that poison is currently lowered into UNDEF in SelDag, so codegen part is okay.
m_Undef() matches PoisonValue as well, so existing optimizations will still fire.
The only concern is hidden miscompilations that will go incorrect when poison constant is given.
A conservative way is copying all tests having `insertelement undef` & replacing it with `insertelement poison` & run Alive2 on it, but it will create many tests and people won’t like it. :(
Instead, I’ll simply locally maintain the tests and run Alive2.
If there is any bug found, I’ll report it.
Relevant links: https://bugs.llvm.org/show_bug.cgi?id=43958 , http://lists.llvm.org/pipermail/llvm-dev/2019-November/137242.html
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D93586
Previously the branch from the middle block to the scalar preheader & exit
was being set-up at the end of skeleton creation in completeLoopSkeleton.
Inserting SCEV or runtime checks may result in LCSSA phis being created,
if they are required. Adjusting branches afterwards may break those
PHIs.
To avoid this, we can instead create the branch from the middle block
to the exit after we created the middle block, so we have the final CFG
before potentially adjusting/creating PHIs.
This fixes a crash for the included test case. For the non-crashing
case, this is almost a NFC with respect to the generated code. The
only change is the order of the predecessors of the involved branch
targets.
Note an assertion was moved from LoopVersioning() to
LoopVersioning::versionLoop. Adjusting the branches means loop-simplify
form may be broken before constructing LoopVersioning. But LV only uses
LoopVersioning to annotate the loop instructions with !noalias metadata,
which does not require loop-simplify form.
This is a fix for an existing issue uncovered by D93317.
When the trip-count is provably divisible by the maximal/chosen VF, folding the
loop's tail during vectorization is redundant. This commit extends the existing
test for constant trip-counts to any trip-count known to be divisible by
maximal/selected VF by SCEV.
Differential Revision: https://reviews.llvm.org/D93615
And that exposes that a number of tests don't *actually* manage to
maintain DomTree validity, which is inline with my observations.
Once again, SimlifyCFG pass currently does not require/preserve DomTree
by default, so this is effectively NFC.
Temporarily revert commit 8b1c4e310c.
After 8b1c4e310c the compile-time for `MultiSource/Benchmarks/MiBench/consumer-lame`
dramatically increases with -O3 & LTO, causing issues for builders with
that configuration.
I filed PR48553 with a smallish reproducer that shows a 10-100x compile
time increase.
... so just ensure that we pass DomTreeUpdater it into it.
Fixes DomTree preservation for a large number of tests,
all of which are marked as such so that they do not regress.
... so just ensure that we pass DomTreeUpdater it into it.
Fixes DomTree preservation for a large number of tests,
all of which are marked as such so that they do not regress.
... so just ensure that we pass DomTreeUpdater it into it.
Fixes DomTree preservation for a large number of tests,
all of which are marked as such so that they do not regress.
First step after e113317958,
in these tests, DomTree is valid afterwards, so mark them as such,
so that they don't regress.
In further steps, SimplifyCFG transforms shall taught to preserve DomTree,
in as small steps as possible.
When it comes to the scalar cost of any predicated block, the loop
vectorizer by default regards this predication as a sign that it is
looking at an if-conversion and divides the scalar cost of the block by
2, assuming it would only be executed half the time. This however makes
no sense if the predication has been introduced to tail predicate the
loop.
Original patch by Anna Welker
Differential Revision: https://reviews.llvm.org/D86452
If we have two unknown sizes and one GEP operand and one non-GEP
operand, then we currently simply return MayAlias. The comment says
we can't do anything useful ... but we can! We can still check that
the underlying objects are different (and do so for the GEP-GEP case).
To reduce the compile-time impact, this a) checks this early, before
doing the relatively expensive GEP decomposition that will not be
used and b) doesn't do the check if the other operand is a phi or
select. In that case, the phi/select will already recurse, so this
would just do two slightly different recursive walks that arrive at
the same roots.
Compile-time is still a bit of a mixed bag: https://llvm-compile-time-tracker.com/compare.php?from=624af932a808b363a888139beca49f57313d9a3b&to=845356e14adbe651a553ed11318ddb5e79a24bcd&stat=instructions
On average this is a small improvement, but sqlite with ThinLTO has
a 0.5% regression (lencod has a 1% improvement).
The BasicAA test case checks this by using two memsets with unknown
size. However, the more interesting case where this is useful is
the LoopVectorize test case, as analysis of accesses in loops tends
to always us unknown sizes.
Differential Revision: https://reviews.llvm.org/D92401
* Steps are scaled by `vscale`, a runtime value.
* Changes to circumvent the cost-model for now (temporary)
so that the cost-model can be implemented separately.
This can vectorize the following loop [1]:
void loop(int N, double *a, double *b) {
#pragma clang loop vectorize_width(4, scalable)
for (int i = 0; i < N; i++) {
a[i] = b[i] + 1.0;
}
}
[1] This source-level example is based on the pragma proposed
separately in D89031. This patch only implements the LLVM part.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D91077
The -enable-new-pm=1 translation caused loop-vectorize to run on all
functions, then instcombine, rather than all passes on one function then
the next. This caused the output of -debug-only and -print-after to be
interleaved in an unexpected way.
This change should be fairly straight forward. If we've reached a call, check to see if we can tell the result is dereferenceable from information about the minimum object size returned by the call.
To control compile time impact, I'm only adding the call for base facts in the routine. getObjectSize can also do recursive reasoning, and we don't want that general capability here.
As a follow up patch (without separate review), I will plumb through the missing TLI parameter. That will have the effect of extending this to known libcalls - malloc, new, and the like - whereas currently this only covers calls with the explicit allocsize attribute.
Differential Revision: https://reviews.llvm.org/D90341
The initial step of the uniform-after-vectorization (lane-0 demanded only) analysis was very awkwardly written. It would revisit use list of each pointer operand of a widened load/store. As a result, it was in the worst case O(N^2) where N was the number of instructions in a loop, and had restricted operand Value types to reduce the size of use lists.
This patch replaces the original algorithm with one which is at most O(2N) in the number of instructions in the loop. (The key observation is that each use of a potentially interesting pointer is visited at most twice, once on first scan, once in the use list of *it's* operand. Only instructions within the loop have their uses scanned.)
In the process, we remove a restriction which required the operand of the uniform mem op to itself be an instruction. This allows detection of uniform mem ops involving global addresses.
Differential Revision: https://reviews.llvm.org/D92056
This is yet another attempt at providing support for epilogue
vectorization following discussions raised in RFC http://llvm.1065342.n5.nabble.com/llvm-dev-Proposal-RFC-Epilog-loop-vectorization-tt106322.html#none
and reviews D30247 and D88819.
Similar to D88819, this patch achieve epilogue vectorization by
executing a single vplan twice: once on the main loop and a second
time on the epilogue loop (using a different VF). However it's able
to handle more loops, and generates more optimal control flow for
cases where the trip count is too small to execute any code in vector
form.
Reviewed By: SjoerdMeijer
Differential Revision: https://reviews.llvm.org/D89566
In this patch I have added support for a new loop hint called
vectorize.scalable.enable that says whether we should enable scalable
vectorization or not. If a user wants to instruct the compiler to
vectorize a loop with scalable vectors they can now do this as
follows:
br i1 %exitcond, label %for.end, label %for.body, !llvm.loop !2
...
!2 = !{!2, !3, !4}
!3 = !{!"llvm.loop.vectorize.width", i32 8}
!4 = !{!"llvm.loop.vectorize.scalable.enable", i1 true}
Setting the hint to false simply reverts the behaviour back to the
default, using fixed width vectors.
Differential Revision: https://reviews.llvm.org/D88962
This is yet another attempt at providing support for epilogue
vectorization following discussions raised in RFC http://llvm.1065342.n5.nabble.com/llvm-dev-Proposal-RFC-Epilog-loop-vectorization-tt106322.html#none
and reviews D30247 and D88819.
Similar to D88819, this patch achieve epilogue vectorization by
executing a single vplan twice: once on the main loop and a second
time on the epilogue loop (using a different VF). However it's able
to handle more loops, and generates more optimal control flow for
cases where the trip count is too small to execute any code in vector
form.
Reviewed By: SjoerdMeijer
Differential Revision: https://reviews.llvm.org/D89566
In the following loop the dependence distance is 2 and can only be
vectorized if the vector length is no larger than this.
void foo(int *a, int *b, int N) {
#pragma clang loop vectorize(enable) vectorize_width(4)
for (int i=0; i<N; ++i) {
a[i + 2] = a[i] + b[i];
}
}
However, when specifying a VF of 4 via a loop hint this loop is
vectorized. According to [1][2], loop hints are ignored if the
optimization is not safe to apply.
This patch introduces a check to bail of vectorization if the user
specified VF is greater than the maximum feasible VF, unless explicitly
forced with '-force-vector-width=X'.
[1] https://llvm.org/docs/LangRef.html#llvm-loop-vectorize-and-llvm-loop-interleave
[2] https://clang.llvm.org/docs/LanguageExtensions.html#extensions-for-loop-hint-optimizations
Reviewed By: sdesmalen, fhahn, Meinersbur
Differential Revision: https://reviews.llvm.org/D90687
Instruction ExtractValue wasn't handled in
LoopVectorizationCostModel::getInstructionCost(). As a result, it was modeled
as a mul which is not really accurate. Since it is free (most of the times),
this now gets a cost of 0 using getInstructionCost.
This is a follow-up of D92208, that required changing this regression test.
In a follow up I will look at InsertValue which also isn't handled yet.
Differential Revision: https://reviews.llvm.org/D92317
VPPredInstPHIRecipe is one of the recipes that was missed during the
initial conversion. This patch adjusts the recipe to also manage its
operand using VPUser.
This was modeled to have a cost of 1, but since we do not have a MUL.2d this is
scalarized into vector inserts/extracts and scalar muls.
Motivating precommitted test is test/Transforms/SLPVectorizer/AArch64/mul.ll,
which we don't want to SLP vectorize.
Test Transforms/LoopVectorize/AArch64/extractvalue-no-scalarization-required.ll
unfortunately needed changing, but the reason is documented in
LoopVectorize.cpp:6855:
// The cost of executing VF copies of the scalar instruction. This opcode
// is unknown. Assume that it is the same as 'mul'.
which I will address next as a follow up of this.
Differential Revision: https://reviews.llvm.org/D92208
Similar to other patches, this makes VPWidenRecipe a VPValue. Because of
the way it interacts with the reduction code it also slightly alters the
way that VPValues are registered, removing the up front NeedDef and
using getOrAddVPValue to create them on-demand if needed instead.
Differential Revision: https://reviews.llvm.org/D88447
This converts the VPReductionRecipe into a VPValue, like other
VPRecipe's in preparation for traversing def-use chains. It also makes
it a VPUser, now storing the used VPValues as operands.
It doesn't yet change how the VPReductionRecipes are created. It will
need to call replaceAllUsesWith from the original recipe they replace,
but that is not done yet as VPWidenRecipe need to be created first.
Differential Revision: https://reviews.llvm.org/D88382
Fix PR47390.
The primary induction should be considered alive when folding tail by masking,
because it will be used by said masking; even when it may otherwise appear
useless: feeding only its own 'bump', which is correctly considered dead, and
as the 'bump' of another induction variable, which may wrongfully want to
consider its bump = the primary induction, dead.
Differential Revision: https://reviews.llvm.org/D92017
A uniform load is one which loads from a uniform address across all lanes. As currently implemented, we cost model such loads as if we did a single scalar load + a broadcast, but the actual lowering replicates the load once per lane.
This change tweaks the lowering to use the REPLICATE strategy by marking such loads (and the computation leading to their memory operand) as uniform after vectorization. This is a useful change in itself, but it's real purpose is to pave the way for a following change which will generalize our uniformity logic.
In review discussion, there was an issue raised with coupling cost modeling with the lowering strategy for uniform inputs. The discussion on that item remains unsettled and is pending larger architectural discussion. We decided to move forward with this patch as is, and revise as warranted once the bigger picture design questions are settled.
Differential Revision: https://reviews.llvm.org/D91398
This is re-applying a combination of f7eac51b9b and 8ec7ea3ddc as one patch
to avoid regressions now that we have better testing in place.
Those were reverted with 32dd5870ee because of crashing in experimental intrinsics.
That bug should be fixed with 7ae346434.
Paraphrased original commit messages:
This is the last step in removing cost-kind as a consideration in the
basic class model for intrinsics.
See D89461 for the start of that.
Subsequent commits dealt with each of the special-case intrinsics that
had customization here in the basic class. This should remove a barrier
to retrying D87188 (canonicalization to the abs intrinsic).
The ARM and x86 cost diffs seen here may be wrong because the
target-specific overrides have their own bugs, but we hope this is
less wrong - if something has a significant throughput cost, then it
should have a significant size / blended cost too by default.
The only behavioral diff in current regression tests is shown in the
x86 scatter-gather test (which is misplaced or broken because it runs
the entire -O3 pipeline) - we unrolled less, and we assume that is
a improvement.
Exception: in general, we want the *size* cost for a scalar call to be
cheap even if the other costs are expensive - we expect it to just be
a branch with some optional stack manipulation.
It is likely that we will want to carve out some
exceptions/overrides to this rule as follow-up patches for
calls that have some general and/or target-specific difference
to the expected lowering.
This was noticed as a regression in unrolling, so we have a test
for that now along with a couple of direct cost model tests.
If the assumed scalarization costs for the oversized vector
calls are not realistic, that would be another follow-up
refinement of the cost models.
Differential Revision: https://reviews.llvm.org/D90554
as it's causing crashes in the optimizer. A reduced testcase has been posted as a follow-up.
This reverts commit f7eac51b9b.
Temporarily Revert "[CostModel] make default size cost for libcalls small (again)" as it depends upon the primary revert.
This reverts commit 8ec7ea3ddc.
Temporarily Revert "[CostModel] add tests for math library calls; NFC" as it depends upon the primary revert.
This reverts commit df09f82599.
Temporarily Revert "[LoopUnroll] add test for full unroll that is sensitive to cost-model; NFC" as it depends upon the primary revert.
This reverts commit 618d555e8d.
I noticed an add example like the one from D91343, so here's a similar patch.
The logic is based on existing code for the single-use demanded bits fold.
But I only matched a constant instead of using compute known bits on the
operands because that was the motivating patterni that I noticed.
I think this will allow removing a special-case (but incomplete) dedicated
fold within visitAnd(), but I need to untangle the existing code to be sure.
https://rise4fun.com/Alive/V6fP
Name: add with low mask
Pre: (C1 & (-1 u>> countLeadingZeros(C2))) == 0
%a = add i8 %x, C1
%r = and i8 %a, C2
=>
%r = and i8 %x, C2
Differential Revision: https://reviews.llvm.org/D91415
This patch turns VPWidenGEPRecipe into a VPValue and uses it
during VPlan construction and codegeneration instead of the plain IR
reference where possible.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D84683
This patch turns VPWidenSelectRecipe into a VPValue and uses it
during VPlan construction and codegeneration instead of the plain IR
reference where possible.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D84682
This is the last step in removing cost-kind as a consideration in the basic class model for intrinsics.
See D89461 for the start of that.
Subsequent commits dealt with each of the special-case intrinsics that had customization here in the
basic class. This should remove a barrier to retrying
D87188 (canonicalization to the abs intrinsic).
The ARM and x86 cost diffs seen here may be wrong because the target-specific overrides have their own
bugs, but we hope this is less wrong - if something has a significant throughput cost, then it should
have a significant size / blended cost too by default.
The only behavioral diff in current regression tests is shown in the x86 scatter-gather test (which is
misplaced or broken because it runs the entire -O3 pipeline) - we unrolled less, and we assume that is
a improvement.
Differential Revision: https://reviews.llvm.org/D90554
This patch turns VPWidenCall into a VPValue and uses it
during VPlan construction and codegeneration instead of the plain IR
reference where possible.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D84681
Claim to not have any vector support to dissuade SLP, LV and friends
from generating SIMD IR for the VE target. We will take this back once
vector isel is stable.
Reviewed By: kaz7, fhahn
Differential Revision: https://reviews.llvm.org/D90462