Floating point conversions inside vectorized loops have performance
implications but are very subtle. The user could specify a floating
point constant, or call a function without realizing that it will
force a change in the vector width. An example of this behaviour is
seen in https://godbolt.org/z/M3nT6c . The vectorizer should indicate
when this happens becuase it is most likely unintended behaviour.
This patch adds a simple check for this behaviour by following floating
point stores in the original loop and checking if a floating point
conversion operation occurs.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D95539
This patch enables scalable vectorization of loops with integer/fast reductions, e.g:
```
unsigned sum = 0;
for (int i = 0; i < n; ++i) {
sum += a[i];
}
```
A new TTI interface, isLegalToVectorizeReduction, has been added to prevent
reductions which are not supported for scalable types from vectorizing.
If the reduction is not supported for a given scalable VF,
computeFeasibleMaxVF will fall back to using fixed-width vectorization.
Reviewed By: david-arm, fhahn, dmgreen
Differential Revision: https://reviews.llvm.org/D95245
This patch updates codegen to use VPValues to manage the generated
scalarized instructions.
Reviewed By: gilr
Differential Revision: https://reviews.llvm.org/D92285
This patch fixes pr48832 by correctly generating the mask when a poison value is involved.
Consider this CFG (which is a part of the input):
```
for.body: ; preds = %for.cond
br i1 true, label %cond.false, label %land.rhs
land.rhs: ; preds = %for.body
br i1 poison, label %cond.end, label %cond.false
cond.false: ; preds = %for.body, %land.rhs
br label %cond.end
cond.end: ; preds = %land.rhs, %cond.false
%cond = phi i32 [ 0, %cond.false ], [ 1, %land.rhs ]
```
The path for.body -> land.rhs -> cond.end should be taken when 'select i1 false, i1 poison, i1 false' holds (which means it's never taken); but VPRecipeBuilder::createEdgeMask was emitting 'and i1 false, poison' instead.
The former one successfully blocks poison propagation whereas the latter one doesn't, making the condition poison and thus causing the miscompilation.
SimplifyCFG has a similar bug (which didn't expose a real-world bug yet), and a patch for this is also ongoing (see https://reviews.llvm.org/D95026).
Reviewed By: bjope
Differential Revision: https://reviews.llvm.org/D95217
Changes `getScalarizationOverhead` to return an invalid cost for scalable VFs
and adds some simple tests for loops containing a function for which
there is a vectorized variant available.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D96356
The individual recipes have been updated to manage their operands using
VPUser a while back. Now that the transition is done, we can instead
make VPRecipeBase a VPUser and get rid of the toVPUser helper.
This patch changes the VecDesc struct to use ElementCount
instead of an unsigned VF value, in preparation for
future work that adds support for vectorized versions of
math functions using scalable vectors. Since all I'm doing
in this patch is switching the type I believe it's a
non-functional change. I changed getWidestVF to now return
both the widest fixed-width and scalable VF values, but
currently the widest scalable value will be zero.
Differential Revision: https://reviews.llvm.org/D96011
This will be needed in the loop-vectorizer where the minimum VF
requested may be a scalable VF. getMinimumVF now takes an additional
operand 'IsScalableVF' that indicates whether a scalable VF is required.
Reviewed By: kparzysz, rampitec
Differential Revision: https://reviews.llvm.org/D96020
This patch is NFC and changes occurrences of `unsigned Width`
and `unsigned i` to work on type ElementCount instead.
This patch is a preparatory patch with the ultimate goal of making
`computeMaxVF()` return both a max fixed VF and a max scalable VF,
so that `selectVectorizationFactor()` can pick the most cost-effective
vectorization factor.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D96019
This patch is NFC and changes occurrences of `unsigned MaxVectorSize`
to work on type ElementCount.
This patch is a preparatory patch with the ultimate goal of making
`computeMaxVF()` return both a max fixed VF and a max scalable VF,
so that `selectVectorizationFactor()` can pick the most cost-effective
vectorization factor.
Reviewed By: kmclaughlin
Differential Revision: https://reviews.llvm.org/D96018
VP blocks keep track of a condition, which is a VPValue. This patch
updates VPBlockBase to manage the value using VPUser, so
replaceAllUsesWith properly updates the condition bit as well.
This is required to enable VP2VP transformations and it helps with
simplifying some of the code required to manage condition bits.
Reviewed By: gilr
Differential Revision: https://reviews.llvm.org/D95382
This reverts commit 502a67dd7f.
This expose a failure in test-suite build on PowerPC,
revert to unblock buildbot first,
Dave will re-commit in https://reviews.llvm.org/D96287.
Thanks Dave.
This patch updates some places where VectorLoopValueMap is accessed
directly to instead go through VPTransformState.
As we move towards managing created values exclusively in VPTransformState,
this ensures the use always can fetch the correct value.
This is in preparation for D92285, which switches to managing scalarized
values through VPValues.
In the future, the various fix* functions should be moved directly into
the VPlan codegen stage.
Reviewed By: gilr
Differential Revision: https://reviews.llvm.org/D95757
getIntrinsicInstrCost takes a IntrinsicCostAttributes holding various
parameters of the intrinsic being costed. It can either be called with a
scalar intrinsic (RetTy==Scalar, VF==1), with a vector instruction
(RetTy==Vector, VF==1) or from the vectorizer with a scalar type and
vector width (RetTy==Scalar, VF>1). A RetTy==Vector, VF>1 is considered
an error. Both of the vector modes are expected to be treated the same,
but because this is confusing many backends end up getting it wrong.
Instead of trying work with those two values separately this removes the
VF parameter, widening the RetTy/ArgTys by VF used called from the
vectorizer. This keeps things simpler, but does require some other
modifications to keep things consistent.
Most backends look like this will be an improvement (or were not using
getIntrinsicInstrCost). AMDGPU needed the most changes to keep the code
from c230965ccf working. ARM removed the fix in
dfac521da1, webassembly happens to get a fixup for an SLP cost
issue and both X86 and AArch64 seem to now be using better costs from
the vectorizer.
Differential Revision: https://reviews.llvm.org/D95291
If we know that the scalar epilogue is required to run, modify the CFG to end the middle block with an unconditional branch to scalar preheader. This is instead of a conditional branch to either the preheader or the exit block.
The motivation to do this is to support multiple exit blocks. Specifically, the current structure forces us to identify immediate dominators and *which* exit block to branch from in the middle terminator. For the multiple exit case - where we know require scalar will hold - these questions are ill formed.
This is the last change needed to support multiple exit loops, but since the diffs are already large enough, I'm going to land this, and then enable separately. You can think of this as being NFCI-ish prep work, but the changes are a bit too involved for me to feel comfortable tagging the change that way.
Differential Revision: https://reviews.llvm.org/D94892
This patch updates the induction value creation to use VPValues of
recipes to map the created values. This should bring is one step closer
to being able to optimize induction recipes directly in VPlan.
Currently widenIntOrFpInduction also generates vector values for a cast
of the induction, if it exists. Make this explicit by adding the cast
instruction to the values defined by the recipe.
Reviewed By: gilr
Differential Revision: https://reviews.llvm.org/D92284
This patch adds constructors to VPIteration as a cleaner way of
initialising the struct and replaces existing constructions of
the form:
{Part, Lane}
with
VPIteration(Part, Lane)
I have also added a default constructor, which is used by VPlan.cpp
when deciding whether to replicate a block or not.
This refactoring will be required in a later patch that adds more
members and functions to VPIteration.
Differential Revision: https://reviews.llvm.org/D95676
This patch updates IRBuilder::CreateMaskedGather/Scatter to work
with ScalableVectorType and adds isLegalMaskedGather/Scatter functions
to AArch64TargetTransformInfo. In addition I've fixed up
isLegalMaskedLoad/Store to return true for supported scalar types,
since this is what the vectorizer asks for.
In LoopVectorize.cpp I've changed
LoopVectorizationCostModel::getInterleaveGroupCost to return an invalid
cost for scalable vectors, since currently this relies upon using shuffle
vector for reversing vectors. In addition, in
LoopVectorizationCostModel::setCostBasedWideningDecision I have assumed
that the cost of scalarising memory ops is infinitely expensive.
I have added some simple masked load/store and gather/scatter tests,
including cases where we use gathers and scatters for conditional invariant
loads and stores.
Differential Revision: https://reviews.llvm.org/D95350
Extend applyLoopGuards() to take into account conditions/assumes proving some
value %v to be divisible by D by rewriting %v to (%v / D) * D. This lets the
loop unroller and the loop vectorizer identify more loops as not requiring
remainder loops.
Differential Revision: https://reviews.llvm.org/D95521
This is another step (see D95452) towards correcting fast-math-flags
bugs in vector reductions.
There are multiple bugs visible in the test diffs, and this is still
not working as it should. We still use function attributes (rather
than FMF) to drive part of the logic, but we are not checking for
the correct FP function attributes.
Note that FMF may not be propagated optimally on selects (example
in https://llvm.org/PR35607 ). That's why I'm proposing to union the
FMF of a fcmp+select pair and avoid regressions on existing vectorizer
tests.
Differential Revision: https://reviews.llvm.org/D95690
D90687 introduced a crash:
llvm::LoopVectorizationCostModel::computeMaxVF(llvm::ElementCount, unsigned int):
Assertion `WideningDecisions.empty() && Uniforms.empty() && Scalars.empty() &&
"No decisions should have been taken at this point"' failed.
when compiling the following C code:
typedef struct {
char a;
} b;
b *c;
int d, e;
int f() {
int g = 0;
for (; d; d++) {
e = 0;
for (; e < c[d].a; e++)
g++;
}
return g;
}
with:
clang -Os -target hexagon -mhvx -fvectorize -mv67 testcase.c -S -o -
This occurred since prior to D90687 computeFeasibleMaxVF would only be
called in computeMaxVF when a scalar epilogue was allowed, but now it's
always called. This causes the assert above since computeFeasibleMaxVF
collects all viable VFs larger than the default MaxVF, and for each VF
calculates the register usage which results in analysis being done the
assert above guards against. This can occur in computeFeasibleMaxVF if
TTI.shouldMaximizeVectorBandwidth and this target hook is implemented in
the hexagon backend to always return true.
Reported by @iajbar.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D94869
I am trying to untangle the fast-math-flags propagation logic
in the vectorizers (see a6f022127 for SLP).
The loop vectorizer has a mix of checking FP function attributes,
IR-level FMF, and just wrong assumptions.
I am trying to avoid regressions while fixing this, and I think
the IR-level logic is good enough for that, but it's hard to say
for sure. This would be the 1st step in the clean-up.
The existing test that I changed to include 'fast' actually shows
a miscompile: the function only had the equivalent of nnan, but we
created new instructions that had fast (all FMF set). This is
similar to the example in https://llvm.org/PR35538
Differential Revision: https://reviews.llvm.org/D95452
This gives the user control over which expander to use, which in turn
allows the user to decide what to do with the expanded instructions.
Used in D75980.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D94295
Now that VPRecipeBase inherits from VPDef, we can always use the new
VPValue for replacement, if the recipe defines one. Given the recipes
that are supported at the moment, all new recipes must have either 0 or
1 defined values.
a6f0221276 enabled intersection of FMF on reduction instructions,
so it is safe to ease the check here.
There is still some room to improve here - it looks like we
have nearly duplicate flags propagation logic inside of the
LoopUtils helper but it is limited targets that do not form
reduction intrinsics (they form the shuffle expansion).
Add an intrinsic type class to represent the
llvm.experimental.noalias.scope.decl intrinsic, to make code
working with it a bit nicer by hiding the metadata extraction
from view.
As shown in the test diffs, we could miscompile by
propagating flags that did not exist in the original
code.
The flags required for fmin/fmax reductions will be
fixed in a follow-up patch.
Walking the use list of a Constant (particularly, ConstantData)
is not scalable, since a given constant may be used by many
instructinos in many functions in many modules.
Differential Revision: https://reviews.llvm.org/D94713
I have removed an unnecessary assert in LoopVectorizationCostModel::getInstructionCost
that prevented a cost being calculated for select instructions when using
scalable vectors. In addition, I have changed AArch64TTIImpl::getCmpSelInstrCost
to only do special cost calculations for fixed width vectors and fall
back to the base version for scalable vectors.
I have added a simple cost model test for cmps and selects:
test/Analysis/CostModel/sve-cmpsel.ll
and some simple tests that show we vectorize loops with cmp and select:
test/Transforms/LoopVectorize/AArch64/sve-basic-vec.ll
Differential Revision: https://reviews.llvm.org/D95039
This adds cost modelling for the inloop vectorization added in
745bf6cf44. Up until now they have been modelled as the original
underlying instruction, usually an add. This happens to works OK for MVE
with instructions that are reducing into the same type as they are
working on. But MVE's instructions can perform the equivalent of an
extended MLA as a single instruction:
%sa = sext <16 x i8> A to <16 x i32>
%sb = sext <16 x i8> B to <16 x i32>
%m = mul <16 x i32> %sa, %sb
%r = vecreduce.add(%m)
->
R = VMLADAV A, B
There are other instructions for performing add reductions of
v4i32/v8i16/v16i8 into i32 (VADDV), for doing the same with v4i32->i64
(VADDLV) and for performing a v4i32/v8i16 MLA into an i64 (VMLALDAV).
The i64 are particularly interesting as there are no native i64 add/mul
instructions, leading to the i64 add and mul naturally getting very
high costs.
Also worth mentioning, under NEON there is the concept of a sdot/udot
instruction which performs a partial reduction from a v16i8 to a v4i32.
They extend and mul/sum the first four elements from the inputs into the
first element of the output, repeating for each of the four output
lanes. They could possibly be represented in the same way as above in
llvm, so long as a vecreduce.add could perform a partial reduction. The
vectorizer would then produce a combination of in and outer loop
reductions to efficiently use the sdot and udot instructions. Although
this patch does not do that yet, it does suggest that separating the
input reduction type from the produced result type is a useful concept
to model. It also shows that a MLA reduction as a single instruction is
fairly common.
This patch attempt to improve the costmodelling of in-loop reductions
by:
- Adding some pattern matching in the loop vectorizer cost model to
match extended reduction patterns that are optionally extended and/or
MLA patterns. This marks the cost of the reduction instruction correctly
and the sext/zext/mul leading up to it as free, which is otherwise
difficult to tell and may get a very high cost. (In the long run this
can hopefully be replaced by vplan producing a single node and costing
it correctly, but that is not yet something that vplan can do).
- getExtendedAddReductionCost is added to query the cost of these
extended reduction patterns.
- Expanded the ARM costs to account for these expanded sizes, which is a
fairly simple change in itself.
- Some minor alterations to allow inloop reduction larger than the highest
vector width and i64 MVE reductions.
- An extra InLoopReductionImmediateChains map was added to the vectorizer
for it to efficiently detect which instructions are reductions in the
cost model.
- The tests have some updates to show what I believe is optimal
vectorization and where we are now.
Put together this can greatly improve performance for reduction loop
under MVE.
Differential Revision: https://reviews.llvm.org/D93476
This is NFC-intended and removes the "OperationData"
class which had become nothing more than a recurrence
(reduction) type.
I adjusted the matching logic to distinguish
instructions from non-instructions - that's all that
the "IsLeafValue" member was keeping track of.
We were able to remove almost all of the state from
OperationData, so these don't make sense as members
of that class - just pass the RecurKind in as a param.
More streamlining is possible, but I'm trying to avoid
logic/typo bugs while fixing this. Eventually, we should
not need the `OperationData` class.
We were able to remove almost all of the state from
OperationData, so these don't make sense as members
of that class - just pass the RecurKind in as a param.
Just like llvm.assume, there are a lot of cases where we can just ignore llvm.experimental.noalias.scope.decl.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D93042
A previous patch has already changed getInstructionCost to return
an InstructionCost type. This patch changes the other various
getXXXCost functions to return an InstructionCost too. This is a
non-functional change - I've added a few asserts that the costs
are valid in places where we're selecting between vector call
and intrinsic costs. However, since we don't yet return invalid
costs from any of the TTI implementations these asserts should
not fire.
See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html
Differential Revision: https://reviews.llvm.org/D94065
After much refactoring over the last 2 weeks to the reduction
matching code, I think this change is finally ready.
We effectively broke fmax/fmin vector reduction optimization
when we started canonicalizing to intrinsics in instcombine,
so this should restore that functionality for SLP.
There are still FMF problems here as noted in the code comments,
but we should be avoiding miscompiles on those for fmax/fmin by
restricting to full 'fast' ops (negative tests are included).
Fixing FMF propagation is a planned follow-up.
Differential Revision: https://reviews.llvm.org/D94913
This will avoid confusion once we start matching
min/max intrinsics. All of these hacks to accomodate
cmp+sel idioms should disappear once we canonicalize
to min/max intrinsics.
The icmp opcode is now hard-coded in the cost model call.
This will make it easier to eventually remove all opcode
queries for min/max patterns as we transition to intrinsics.
This patch changes these functions:
vectorizeLoadInsert
isExtractExtractCheap
foldExtractedCmps
scalarizeBinopOrCmp
getShuffleExtract
foldBitcastShuf
to use the class InstructionCost when calling TTI.get<something>Cost().
This patch is part of a series of patches to use InstructionCost instead of
unsigned/int for the cost model functions.
See this thread for context:
http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html
See this patch for the introduction of the type:
https://reviews.llvm.org/D91174
ps.:This patch adds the test || !NewCost.isValid(), because we want to
return false when:
!NewCost.isValid && !OldCost.isValid()->the cost to transform it expensive
and
!NewCost.isValid() && OldCost.isValid()
Therefore for simplication we only add test for !NewCost.isValid()
Differential Revision: https://reviews.llvm.org/D94069
This is NFC-intended and another step towards supporting
intrinsics as reduction candidates.
The remaining bits of the OperationData class do not make
much sense as-is, so I will try to improve that, but I'm
trying to take minimal steps because it's still not clear
how this was intended to work.
This is another NFC-intended patch to allow matching
intrinsics (example: maxnum) as candidates for reductions.
It's possible that the loop/if logic can be reduced now,
but it's still difficult to understand how this all works.
To get into this block we had: !A || B || C
and we checked C in the first 'if' clause
leaving !A || B. But the 2nd 'if' is checking:
A && !B --> !(!A || B)
This is NFC-intended. I'm still trying to figure out
how the loop where this is used works. It does not
seem like we require this data at all, but it's
hard to confirm given the complicated predicates.
In the spirit of commit fc783e91e0 (llvm-svn: 248943) we
shouldn't vectorize stores of non-packed types (i.e. types that
has padding between consecutive variables in a scalar layout,
but being packed in a vector layout).
The problem was detected as a miscompile in a downstream test case.
Reviewed By: anton-afanasyev
Differential Revision: https://reviews.llvm.org/D94446
This relates to the ongoing effort to support vectorization of multiple exit loops (see D93317).
The previous code assumed that LCSSA phis were always single entry before the vectorizer ran. This was correct, but only because the vectorizer allowed only a single exiting edge. There's nothing in the definition of LCSSA which requires single entry phis.
A common case where this comes up is with a loop with multiple exiting blocks which all reach a common exit block. (e.g. see the test updates)
Differential Revision: https://reviews.llvm.org/D93725
This patch unifies the way recipes and VPValues are printed after the
transition to VPDef.
VPSlotTracker has been updated to iterate over all recipes and all
their defined values to number those. There is no need to number
values in Value2VPValue.
It also updates a few places that only used slot numbers for
VPInstruction. All recipes now can produce numbered VPValues.
This patch is part of a series of patches that migrate integer
instruction costs to use InstructionCost. In the function
selectVectorizationFactor I have simply asserted that the cost
is valid and extracted the value as is. In future we expect
to encounter invalid costs, but we should filter out those
vectorization factors that lead to such invalid costs.
See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html
Differential Revision: https://reviews.llvm.org/D92178
A severe compile-time slowdown from this call is noted in:
https://llvm.org/PR48689
My naive fix was to put it under LLVM_DEBUG ( 267ff79 ),
but that's not limiting in the way we want.
This is a quick fix (or we could just remove the call completely
and rely on some later pass to discover potentially wrong IR?).
A bigger/better fix would be to improve/limit verifyFunction()
as noted in:
https://llvm.org/PR47712
Differential Revision: https://reviews.llvm.org/D94328
Similar to D92129, update VPWidenPHIRecipe to manage the start value as
VPValue. This allows adjusting the start value as a VPlan transform,
which will be used in a follow-up patch to support reductions during
epilogue vectorization.
Reviewed By: gilr
Differential Revision: https://reviews.llvm.org/D93975
This was suggested to prepare for D93975.
By moving the start value creation to widenPHInstruction, we set the
stage to manage the start value directly in VPWidenPHIRecipe, which be
used subsequently to set the 'resume' value for reductions during
epilogue vectorization.
It also moves RdxDesc to the recipe, so we do not have to rely on Legal
to look it up later.
Reviewed By: gilr
Differential Revision: https://reviews.llvm.org/D94175
As noted in PR48689, the verifier may have some kind
of exponential behavior that should be addressed
separately. For now, only run it in debug mode to
prevent problems for release+asserts.
That limit is what we had before D80401, and I'm
not sure if there was a reason to change it in that
patch.
In the following loop:
void foo(int *a, int *b, int N) {
for (int i=0; i<N; ++i)
a[i + 4] = a[i] + b[i];
}
The loop dependence constrains the VF to a maximum of (4, fixed), which
would mean using <4 x i32> as the vector type in vectorization.
Extending this to scalable vectorization, a VF of (4, scalable) implies
a vector type of <vscale x 4 x i32>. To determine if this is legal
vscale must be taken into account. For this example, unless
max(vscale)=1, it's unsafe to vectorize.
For SVE, the number of bits in an SVE register is architecturally
defined to be a multiple of 128 bits with a maximum of 2048 bits, thus
the maximum vscale is 16. In the loop above it is therefore unfeasible
to vectorize with SVE. However, in this loop:
void foo(int *a, int *b, int N) {
#pragma clang loop vectorize_width(X, scalable)
for (int i=0; i<N; ++i)
a[i + 32] = a[i] + b[i];
}
As long as max(vscale) multiplied by the number of lanes 'X' doesn't
exceed the dependence distance, it is safe to vectorize. For SVE a VF of
(2, scalable) is within this constraint, since a vector of <16 x 2 x 32>
will have no dependencies between lanes. For any number of lanes larger
than this it would be unsafe to vectorize.
This patch extends 'computeFeasibleMaxVF' to legalize scalable VFs
specified as loop hints, implementing the following behaviour:
* If the backend does not support scalable vectors, ignore the hint.
* If scalable vectorization is unfeasible given the loop
dependence, like in the first example above for SVE, then use a
fixed VF.
* Accept scalable VFs if it's safe to do so.
* Otherwise, clamp scalable VFs that exceed the maximum safe VF.
Reviewed By: sdesmalen, fhahn, david-arm
Differential Revision: https://reviews.llvm.org/D91718
The new test case here contains a first order recurrences and an
instruction that is replicated. The first order recurrence forces an
instruction to be sunk _into_, as opposed to after the replication
region. That causes several things to go wrong including registering
vector instructions multiple times and failing to create dominance
relations correctly.
Instead we should be sinking to after the replication region, which is
what this patch makes sure happens.
Differential Revision: https://reviews.llvm.org/D93629
After merging the shuffles, we cannot rely on the previous shuffle
anymore and need to shrink the final shuffle, if it is required.
Reported in D92668
Differential Revision: https://reviews.llvm.org/D93967
Similar to 5a1d31a28 -
This should be no-functional-change because the reduction kind
opcodes are 1-for-1 mappings to the instructions we are matching
as reductions. But we want to remove the need for the
`OperationData` opcode field because that does not work when
we start matching intrinsics (eg, maxnum) as reduction candidates.
This patch updates VPWidenIntOrFpInductionRecipe to hold the start value
for the induction variable. This makes the start value explicit and
allows for adjusting the start value for a VPlan.
The flexibility will be used in further patches.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D92129
This patch adds a new getLiveInIRValue accessor to VPValue, which
returns the underlying value, if the VPValue is defined outside of
VPlan. This is required to handle scalars in VPTransformState, which
requires dealing with scalars defined outside of VPlan.
We can simply check VPValue::Def to determine if the value is defined
inside a VPlan.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D92281
This patch makes SLP and LV emit operations with initial vectors set to poison constant instead of undef.
This is a part of efforts for using poison vector instead of undef to represent "doesn't care" vector.
The goal is to make nice shufflevector optimizations valid that is currently incorrect due to the tricky interaction between undef and poison (see https://bugs.llvm.org/show_bug.cgi?id=44185 ).
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D94061
This should be no-functional-change because the reduction kind
opcodes are 1-for-1 mappings to the instructions we are matching
as reductions. But we want to remove the need for the
`OperationData` opcode field because that does not work when
we start matching intrinsics (eg, maxnum) as reduction candidates.
SLP tries to model 2 forms of vector reductions: pairwise and splitting.
From the cost model code comments, those are defined using an example as:
/// Pairwise:
/// (v0, v1, v2, v3)
/// ((v0+v1), (v2+v3), undef, undef)
/// Split:
/// (v0, v1, v2, v3)
/// ((v0+v2), (v1+v3), undef, undef)
I don't know the full history of this functionality, but it was partly
added back in D29402. There are apparently no users at this point (no
regression tests change). X86 might have managed to work-around the need
for this through cost model and codegen improvements.
Removing this code makes it easier to continue the work that was started
in D87416 / D88193. The alternative -- if there is some target that is
silently using this option -- is to move this logic into LoopUtils. We
have related/duplicate functionality there via llvm::createTargetReduction().
Differential Revision: https://reviews.llvm.org/D93860
Creating in-loop reductions relies on IR references to map
IR values to VPValues after interleave group creation.
Make sure we re-add the updated member to the plan, so the look-ups
still work as expected
This fixes a crash reported after D90562.
While here, rename the inaccurate getRecurrenceBinOp()
because that was also used to get CmpInst opcodes.
The recurrence/reduction kind should always refer to the
expected opcode for a reduction. SLP appears to be the
only direct caller of createSimpleTargetReduction(), and
that calling code ideally should not be carrying around
both an opcode and a reduction kind.
This should allow us to generalize reduction matching to
use intrinsics instead of only binops.
This is almost all mechanical search-and-replace and
no-functional-change-intended (NFC). Having a single
enum makes it easier to match/reason about the
reduction cases.
The goal is to remove `Opcode` from reduction matching
code in the vectorizers because that makes it harder to
adapt the code to handle intrinsics.
The code in RecurrenceDescriptor::AddReductionVar() is
the only place that required closer inspection. It uses
a RecurrenceDescriptor and a second InstDesc to sometimes
overwrite part of the struct. It seem like we should be
able to simplify that logic, but it's not clear exactly
which cmp+sel patterns that we are trying to handle/avoid.
If DoExtraAnalysis is true (e.g. because remarks are enabled), we
continue with the analysis rather than exiting. Update code to
conditionally check if the ExitBB has phis or not a single predecessor.
Otherwise a nullptr is dereferenced with DoExtraAnalysis.
I don't know if there's some way this changes what the vectorizers
may produce for reductions, but I have added test coverage with
3567908 and 5ced712 to show that both passes already have bugs in
this area. Hopefully this does not make things worse before we can
really fix it.
I'm not sure if the SLP enum was created before the IVDescriptor
RecurrenceDescriptor / RecurrenceKind existed, but the code in
SLP is now redundant with that class, so it just makes things
more complicated to have both. We eventually call LoopUtils
createSimpleTargetReduction() to create reduction ops, so we
might as well standardize on those enum names.
There's still a question of whether we need to use TTI::ReductionFlags
vs. MinMaxRecurrenceKind, but that can be another clean-up step.
Another option would just be to flatten the enums in RecurrenceDescriptor
into a single enum. There isn't much benefit (smaller switches?) to
having a min/max subset.
This reverts commit 4ffcd4fe9a thus restoring e4df6a40da.
The only change from the original patch is to add "llvm::" before the call to empty(iterator_range). This is a speculative fix for the ambiguity reported on some builders.
This patch is a major step towards supporting multiple exit loops in the vectorizer. This patch on it's own extends the loop forms allowed in two ways:
single exit loops which are not bottom tested
multiple exit loops w/ a single exit block reached from all exits and no phis in the exit block (because of LCSSA this implies no values defined in the loop used later)
The restrictions on multiple exit loop structures will be removed in follow up patches; disallowing cases for now makes the code changes smaller and more obvious. As before, we can only handle loops with entirely analyzable exits. Removing that restriction is much harder, and is not part of currently planned efforts.
The basic idea here is that we can force the last iteration to run in the scalar epilogue loop (if we have one). From the definition of SCEV's backedge taken count, we know that no earlier iteration can exit the vector body. As such, we can leave the decision on which exit to be taken to the scalar code and generate a bottom tested vector loop which runs all but the last iteration.
The existing code already had the notion of requiring one iteration in the scalar epilogue, this patch is mainly about generalizing that support slightly, making sure we don't try to use this mechanism when tail folding, and updating the code to reflect the difference between a single exit block and a unique exit block (very mechanical).
Differential Revision: https://reviews.llvm.org/D93317
Previously the branch from the middle block to the scalar preheader & exit
was being set-up at the end of skeleton creation in completeLoopSkeleton.
Inserting SCEV or runtime checks may result in LCSSA phis being created,
if they are required. Adjusting branches afterwards may break those
PHIs.
To avoid this, we can instead create the branch from the middle block
to the exit after we created the middle block, so we have the final CFG
before potentially adjusting/creating PHIs.
This fixes a crash for the included test case. For the non-crashing
case, this is almost a NFC with respect to the generated code. The
only change is the order of the predecessors of the involved branch
targets.
Note an assertion was moved from LoopVersioning() to
LoopVersioning::versionLoop. Adjusting the branches means loop-simplify
form may be broken before constructing LoopVersioning. But LV only uses
LoopVersioning to annotate the loop instructions with !noalias metadata,
which does not require loop-simplify form.
This is a fix for an existing issue uncovered by D93317.
I am hoping to extend the reduction matching code, and it is
hard to distinguish "ReductionData" from "ReducedValueData".
So extend the tree/root metaphor to include leaves.
Another problem is that the name "OperationData" does not
provide insight into its purpose. I'm not sure if we can alter
that underlying data structure to make the code clearer.
I think this is NFC currently, but the bug would be exposed
when we allow binary intrinsics (maxnum, etc) as candidates
for reductions.
The code in matchAssociativeReduction() is using
OperationData::getNumberOfOperands() when comparing whether
the "EdgeToVisit" iterator is in-bounds, so this code must
use the same (potentially offset) operand value to set
the "EdgeToVisit".
ScalarEvolution should be able to handle both constant and variable trip
counts using getURemExpr, so we do not have to handle them separately.
This is a small simplification of a56280094e.
Reviewed By: gilr
Differential Revision: https://reviews.llvm.org/D93677
This patch turns updates VPInstruction to manage the value it defines
using VPDef. The VPValue is used during VPlan construction and
codegeneration instead of the plain IR reference where possible.
Reviewed By: gilr
Differential Revision: https://reviews.llvm.org/D90565
When the trip-count is provably divisible by the maximal/chosen VF, folding the
loop's tail during vectorization is redundant. This commit extends the existing
test for constant trip-counts to any trip-count known to be divisible by
maximal/selected VF by SCEV.
Differential Revision: https://reviews.llvm.org/D93615
This patch makes VPRecipeBase a direct subclass of VPDef, moving the
SubclassID to VPDef.
Reviewed By: gilr
Differential Revision: https://reviews.llvm.org/D90564
This patch turns updates VPInterleaveRecipe to manage the values it defines
using VPDef. The VPValue is used during VPlan construction and
codegeneration instead of the plain IR reference where possible.
Reviewed By: gilr
Differential Revision: https://reviews.llvm.org/D90562
An earlier patch introduced asserts that the InstructionCost is
valid because at that time the ReuseShuffleCost variable was an
unsigned. However, now that the variable is an InstructionCost
instance the asserts can be removed.
See this thread for context:
http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html
See this patch for the introduction of the type:
https://reviews.llvm.org/D91174
This is an enhancement motivated by https://llvm.org/PR16739
(see D92858 for another).
We can look through a GEP to find a base pointer that may be
safe to use for a vector load. If so, then we shuffle (shift)
the necessary vector element over to index 0.
Alive2 proof based on 1 of the regression tests:
https://alive2.llvm.org/ce/z/yPJLkh
The vector translation is independent of endian (verify by
changing to leading 'E' in the datalayout string).
Differential Revision: https://reviews.llvm.org/D93229
Here's another minimal step suggested by D93229 / D93397 .
(I'm trying to be extra careful in these changes because
load transforms are easy to get wrong.)
We can optimistically choose the greater alignment of a
load and its pointer operand. As the test diffs show, this
can improve what would have been unaligned vector loads
into aligned loads.
When we enhance with gep offsets, we will need to adjust
the alignment calculation to include that offset.
Differential Revision: https://reviews.llvm.org/D93406
As discussed in D93229, we only need a minimal alignment constraint
when querying whether a hypothetical vector load is safe. We still
pass/use the potentially stronger alignment attribute when checking
costs and creating the new load.
There's already a test that changes with the minimum code change,
so splitting this off as a preliminary commit independent of any
gep/offset enhancements.
Differential Revision: https://reviews.llvm.org/D93397
This patch changes the type of cost variables (for instance: Cost, ExtractCost,
SpillCost) to use InstructionCost.
This patch also changes the type of cost variables to InstructionCost in other
functions that use the result of getTreeCost()
This patch is part of a series of patches to use InstructionCost instead of
unsigned/int for the cost model functions.
See this thread for context:
http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html
Depends on D91174
Differential Revision: https://reviews.llvm.org/D93049
Given we haven't yet enabled multiple exiting blocks, this is currently non functional, but it's an obvious extension which cleans up a later patch.
I don't think this is worth review (as it's pretty obvious), if anyone disagrees, feel feel to revert or comment and I will.
This should be purely non-functional. When touching this code for another reason, I found the handling of the PredicateOrDontVectorize piece here very confusing. Let's make it an explicit state (instead of an implicit combination of two variables), and use early return for options/hint processing.
This patch turns updates VPWidenSelectRecipe to manage the value
it defines using VPDef.
Reviewed By: gilr
Differential Revision: https://reviews.llvm.org/D90560
This patch turns updates VPWidenGEPRecipe to manage the value it defines
using VPDef. The VPValue is used during VPlan construction and
codegeneration instead of the plain IR reference where possible.
Reviewed By: gilr
Differential Revision: https://reviews.llvm.org/D90561
This patch turns updates VPWidenREcipe to manage the value it defines
using VPDef.
Reviewed By: gilr
Differential Revision: https://reviews.llvm.org/D90559
As noted in D93229, the transform from scalar load to vector load
potentially leaks poison from the extra vector elements that are
being loaded.
We could use freeze here (and x86 codegen at least appears to be
the same either way), but we already have a shuffle in this logic
to optionally change the vector size, so let's allow that
instruction to serve both purposes.
Differential Revision: https://reviews.llvm.org/D93238
D82227 has added a proper check to limit PHI vectorization to the
maximum vector register size. That unfortunately resulted in at
least a couple of regressions on SystemZ and x86.
This change reverts PHI handling from D82227 and replaces it with
a more general check in SLPVectorizerPass::tryToVectorizeList().
Moved to tryToVectorizeList() it allows to restart vectorization
if initial chunk fails.
However, this function is more general and handles not only PHI
but everything which SLP handles. If vectorization factor would
be limited to maximum vector register size it would limit much
more vectorization than before leading to further regressions.
Therefore a new TTI callback getMaximumVF() is added with the
default 0 to preserve current behavior and limit nothing. Then
targets can decide what is better for them.
The callback gets ElementSize just like a similar getMinimumVF()
function and the main opcode of the chain. The latter is to avoid
regressions at least on the AMDGPU. We can have loads and stores
up to 128 bit wide, and <2 x 16> bit vector math on some
subtargets, where the rest shall not be vectorized. I.e. we need
to differentiate based on the element size and operation itself.
Differential Revision: https://reviews.llvm.org/D92059
This patch updates VPWidenMemoryInstructionRecipe to use VPDef
to manage the value it produces instead of inheriting from VPValue.
Reviewed By: gilr
Differential Revision: https://reviews.llvm.org/D90563
Vector element size could be different for different store chains.
This patch prevents wrong computation of maximum number of elements
for that case.
Differential Revision: https://reviews.llvm.org/D93192
When it comes to the scalar cost of any predicated block, the loop
vectorizer by default regards this predication as a sign that it is
looking at an if-conversion and divides the scalar cost of the block by
2, assuming it would only be executed half the time. This however makes
no sense if the predication has been introduced to tail predicate the
loop.
Original patch by Anna Welker
Differential Revision: https://reviews.llvm.org/D86452
This is the first in a series of patches that attempts to migrate
existing cost instructions to return a new InstructionCost class
in place of a simple integer. This new class is intended to be
as light-weight and simple as possible, with a full range of
arithmetic and comparison operators that largely mirror the same
sets of operations on basic types, such as integers. The main
advantage to using an InstructionCost is that it can encode a
particular cost state in addition to a value. The initial
implementation only has two states - Normal and Invalid - but these
could be expanded over time if necessary. An invalid state can
be used to represent an unknown cost or an instruction that is
prohibitively expensive.
This patch adds the new class and changes the getInstructionCost
interface to return the new class. Other cost functions, such as
getUserCost, etc., will be migrated in future patches as I believe
this to be less disruptive. One benefit of this new class is that
it provides a way to unify many of the magic costs in the codebase
where the cost is set to a deliberately high number to prevent
optimisations taking place, e.g. vectorization. It also provides
a route to represent the extremely high, and unknown, cost of
scalarization of scalable vectors, which is not currently supported.
Differential Revision: https://reviews.llvm.org/D91174
This is an enhancement to load vectorization that is motivated by
a pattern in https://llvm.org/PR16739.
Unfortunately, it's still not enough to make a difference there.
We will have to handle multi-use cases in some better way to avoid
creating multiple overlapping loads.
Differential Revision: https://reviews.llvm.org/D92858
For stores chain vectorization we choose the size of vector
elements to ensure we fit to minimum and maximum vector register
size for the number of elements given. This patch corrects vector
element size choosing the width of value truncated just before
storing instead of the width of value stored.
Fixes PR46983
Differential Revision: https://reviews.llvm.org/D92824
* Steps are scaled by `vscale`, a runtime value.
* Changes to circumvent the cost-model for now (temporary)
so that the cost-model can be implemented separately.
This can vectorize the following loop [1]:
void loop(int N, double *a, double *b) {
#pragma clang loop vectorize_width(4, scalable)
for (int i = 0; i < N; i++) {
a[i] = b[i] + 1.0;
}
}
[1] This source-level example is based on the pragma proposed
separately in D89031. This patch only implements the LLVM part.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D91077
This patch removes a number of asserts that VF is not scalable, even though
the code where this assert lives does nothing that prevents VF being scalable.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D91060
It is possible to merge reuse and reorder shuffles and reduce the total
cost of the ivectorization tree/number of final instructions.
Differential Revision: https://reviews.llvm.org/D92668
The initial step of the uniform-after-vectorization (lane-0 demanded only) analysis was very awkwardly written. It would revisit use list of each pointer operand of a widened load/store. As a result, it was in the worst case O(N^2) where N was the number of instructions in a loop, and had restricted operand Value types to reduce the size of use lists.
This patch replaces the original algorithm with one which is at most O(2N) in the number of instructions in the loop. (The key observation is that each use of a potentially interesting pointer is visited at most twice, once on first scan, once in the use list of *it's* operand. Only instructions within the loop have their uses scanned.)
In the process, we remove a restriction which required the operand of the uniform mem op to itself be an instruction. This allows detection of uniform mem ops involving global addresses.
Differential Revision: https://reviews.llvm.org/D92056
This is yet another attempt at providing support for epilogue
vectorization following discussions raised in RFC http://llvm.1065342.n5.nabble.com/llvm-dev-Proposal-RFC-Epilog-loop-vectorization-tt106322.html#none
and reviews D30247 and D88819.
Similar to D88819, this patch achieve epilogue vectorization by
executing a single vplan twice: once on the main loop and a second
time on the epilogue loop (using a different VF). However it's able
to handle more loops, and generates more optimal control flow for
cases where the trip count is too small to execute any code in vector
form.
Reviewed By: SjoerdMeijer
Differential Revision: https://reviews.llvm.org/D89566
This might be a small improvement in readability, but the
real motivation is to make it easier to adapt the code to
deal with intrinsics like 'maxnum' and/or integer min/max.
There is potentially help in doing that with D92086, but
we might also just add specialized wrappers here to deal
with the expected patterns.