This patch adds a new transform to remove dead recipes. For now, it only
removes dead recipes in the header, to keep the number tests that require
updating manageable. Future patches will extend this to remove dead
recipes across the whole plan.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D118051
Extends getReductionOpChain to look through Phis which may be part of
the reduction chain. adjustRecipesForReductions will now also create a
CondOp for VPReductionRecipe if the block is predicated and not only if
foldTailByMasking is true.
Changes were required in tryToBlend to ensure that we don't attempt
to convert the reduction Phi into a select by returning a VPBlendRecipe.
The VPReductionRecipe will create a select between the Phi and the reduction.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D117580
This reverts commit 77a0da926c as we've
received multiple reports of this significantly impacting performance,
in ways that don't seem to just be target specific cost models going
wrong. I would offer some reproducers, but the test changes here seem to
be full of them!
Reverting for now and hopefully we can remove the "hack" more carefully
as we go.
The vectorizer will choose at times to "vectorize" loops with a scalar
factor (VF=1) with interleaving (IC > 1). This can occasionally produce
better code than the unroller (notable for reductions where it can
produce independent reduction chains that are combined after the loop).
At times this is not very beneficial though, for example when runtime
checks are needed or when the scalar code requires predication.
This addresses the second point, preventing the vectorizer from
interleaving when the scalar loop will require predication. This
prevents it from making a bit of a mess, that is worse than the original
and better left for the unroller to unroll if beneficial. It helps
reverse some of the regressions from D118090.
Differential Revision: https://reviews.llvm.org/D118566
D43208 extracted `useEmulatedMaskMemRefHack()` from legality into cost model.
What it essentially does is prevents scalarized vectorization of masked memory operations:
```
// TODO: Cost model for emulated masked load/store is completely
// broken. This hack guides the cost model to use an artificially
// high enough value to practically disable vectorization with such
// operations, except where previously deployed legality hack allowed
// using very low cost values. This is to avoid regressions coming simply
// from moving "masked load/store" check from legality to cost model.
// Masked Load/Gather emulation was previously never allowed.
// Limited number of Masked Store/Scatter emulation was allowed.
```
While i don't really understand about what specifically `is completely broken`
was talking about, i believe that at least on X86 with AVX2-or-later,
this is no longer true. (or at least, i would like to know what is still broken).
So i would like to follow suit after D111460, and like wise disable that hack for AVX2+.
But since this was added for X86 specifically, let's just instead completely remove this hack.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D114779
When the main loop is e.g. VF=vscale x 1 and the epilogue VF cannot
be any smaller, the vectorizer should try to estimate how many lanes are
executed at runtime and allow a suitable fixed-width VF to be chosen. It
can use VScaleForTuning to figure out what a suitable fixed-width VF could
be. For the case where the main loop VF is VF=vscale x 1, and VScaleForTuning=8,
it could still choose an epilogue VF upto VF=4.
This was a bit tricky to test, so this patch also introduces a wrapper
function to get 'VScaleForTuning' by also considering vscale_range.
If min and max are equal, then that will be the vscale we compile for.
It makes little sense to tune for a different width if the code
will not be portable for other widths.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D118709
For some reason we limited the epilogue VF to be fixed-width, but there
is not necessarily a reason for doing so. If the main VF=vscale x 16, the
epilogue VF could be either fixed-width, or a scalable VF upto vscale x 8.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D118688
This patch tries to use an existing VPWidenCanonicalIVRecipe
instead of creating another step-vector for canonical
induction recipes in widenIntOrFpInduction.
This has the following benefits:
1. First step to avoid setting both vector and scalar values for the
same induction def.
2. Reducing complexity of widenIntOrFpInduction through making things
more explicit in VPlan
3. Only need to splat the vector IV for block in masks.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D116123
This is a bugfix in IVDescriptor.cpp.
The helper function `RecurrenceDescriptor::getExactFPMathInst()`
is supposed to return the 1st FP instruction that does not allow
reordering. However, when constructing the RecurrenceDescriptor,
we trace the use-def chain staring from a PHI node and for each
instruction in the use-def chain, its descriptor overrides the
previous one. Therefore in the final RecurrenceDescriptor we
constructed, we lose previous FP instructions that does not allow
reordering.
Reviewed By: kmclaughlin
Differential Revision: https://reviews.llvm.org/D118073
isCandidateForEpilogueVectorization will currently return false for loops
which contain reductions. This patch removes this restriction and makes
the following changes to support epilogue vectorisation with reductions:
- `fixReduction`: If fixReduction is being called during vectorisation of the
epilogue, the phi node it creates will need to additionally carry incoming
values from the middle block of the main loop.
- `createEpilogueVectorizedLoopSkeleton`: The incoming values of the phi
created by fixReduction are updated after the vec.epilog.iter.check block
is added. The phi is also moved to the preheader of the epilogue.
- `processLoop`: The start value of any VPReductionPHIRecipes are updated before
vectorising the epilogue loop. The getResumeInstr function added to the ILV
will return the resume instruction associated with the recurrence descriptor.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D116928
Adds `-prefer-inloop-reductions` to the RUN line of sve-tail-folding.ll & adds
a new test where in-loop reductions cannot be used (`@cond_xor_reduction`). NFC.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D117578
When SVE is enabled for AArch64 targets it makes more sense to use the
get.active.lane.mask intrinsic, because SVE has an exact 1-1 mapping
from the intrinsic to the 'whilelo' instruction for legal vector types.
This instruction neatly takes overflow into account as well. This patch
fixes an issue in VPInstruction::generateInstruction that assumed we are
only dealing with fixed-width vectors.
Differential Revision: https://reviews.llvm.org/D117109
This patch adds a new BranchOnCount VPInstruction opcode with 2
operands. It first compares its 2 operands (increment of canonical
induction and vector trip count), followed by a branch to either the
exit block or back to the vector header.
It must be the last recipe in the exit block of the topmost vector loop
region.
This extracts parts from D113224 and was discussed in D113223.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D116479
This is required to query the legality more precisely in the LoopVectorizer.
This adds another TTI function named 'forceScalarizeMaskedGather/Scatter'
function to work around the hack introduced for MVE, where
isLegalMaskedGather/Scatter would return an answer by second-guessing
where the function was called from, based on the Type passed in (vector
vs scalar). The new interface makes this explicit. It is also used by
X86 to check for vector widths where gather/scatters aren't profitable
(or don't exist) for certain subtargets.
Differential Revision: https://reviews.llvm.org/D115329
The code in VPWidenCanonicalIVRecipe::execute only worked for fixed-width
vectors due to the way we generate the values per lane. This patch changes
the code to use a combination of vector splats and step vectors to get
the same result. This then works for both fixed-width and scalable vectors.
Tests that exercise this code path for scalable vectors have been added here:
Transforms/LoopVectorize/AArch64/sve-tail-folding.ll
Differential Revision: https://reviews.llvm.org/D113180
This patch fixes up an issue with InnerLoopVectorizer::getOrCreateVectorTripCount
whereby we weren't correctly generating the runtime trip count
for scalable vectors when tail-folding.
It also removes some asserts in the tail-folding path for cases when
the VF is not scalable.
In this patch I have only permitted tail-folding to be enabled
explicitly for scalable vectors when the user has specified one
of the following flags:
-prefer-predicate-over-epilogue=predicate-dont-vectorize
-prefer-predicate-over-epilogue=predicate-else-scalar-epilogue
For now it's best not to enable tail-folding with scalable vectors for
low trip counts or when optimising for code size, since there has been
no analysis on whether this is worth it.
Various tests have been added here:
Transforms/LoopVectorize/AArch64/sve-tail-folding.ll
Transforms/LoopVectorize/AArch64/sve-tail-folding-forced.ll
The tests cannot be target independent because they require masked
load/store support, i.e. TTI.isLegalMaskedLoad and TTI.isLegalMaskedStore
need to return true.
Differential Revision: https://reviews.llvm.org/D113003
Similar to D116732, this adds basic scalar sadd_with_overflow,
uadd_with_overflow, ssub_with_overflow and usub_with_overflow costs for
aarch64, which are usually quite efficiently lowered.
Differential Revision: https://reviews.llvm.org/D116734
At the moment, the primary induction variable for the vector loop is
created as part of the skeleton creation. This is tied to creating the
vector loop latch outside of VPlan. This prevents from modeling the
*whole* vector loop in VPlan, which in turn is required to model
preheader and exit blocks in VPlan as well.
This patch introduces a new recipe VPCanonicalIVPHIRecipe to represent the
primary IV in VPlan and CanonicalIVIncrement{NUW} opcodes for
VPInstruction to model the increment.
This allows us to partly retire createInductionVariable. At the moment,
a bit of patching up is done after executing all blocks in the plan.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D113223
For loops that contain in-loop reductions but no loads or stores, large
VFs are chosen because LoopVectorizationCostModel::getSmallestAndWidestTypes
has no element types to check through and so returns the default widths
(-1U for the smallest and 8 for the widest). This results in the widest
VF being chosen for the following example,
float s = 0;
for (int i = 0; i < N; ++i)
s += (float) i*i;
which, for more computationally intensive loops, leads to large loop
sizes when the operations end up being scalarized.
In this patch, for the case where ElementTypesInLoop is empty, the widest
type is determined by finding the smallest type used by recurrences in
the loop instead of falling back to a default value of 8 bits. This
results in the cost model choosing a more sensible VF for loops like
the one above.
Differential Revision: https://reviews.llvm.org/D113973
The availability of SVE should be sufficient to enable scalable
auto-vectorization.
This patch adds a new TTI interface to query the target what style of
vectorization it wants when scalable vectors are available. For other
targets than AArch64, this currently defaults to 'FixedWidthOnly'.
Differential Revision: https://reviews.llvm.org/D115651
The basic idea to this is that a) having a single canonical type makes CSE easier, and b) many of our transforms are inconsistent about which types we end up with based on visit order.
I'm restricting this to constants as for non-constants, we'd have to decide whether the simplicity was worth extra instructions. For constants, there are no extra instructions.
We chose the canonical type as i64 arbitrarily. We might consider changing this to something else in the future if we have cause.
Differential Revision: https://reviews.llvm.org/D115387
This patch adds on an overhead cost for gathers and scatters, which
is a rough estimate based on performance investigations I have
performed on SVE hardware for various micro-benchmarks.
Differential Revision: https://reviews.llvm.org/D115143
I've added some tests that were previously missing for the gather-scatter costs
being calculated by the vectorizer for AArch64:
Transforms/LoopVectorize/AArch64/sve-gather-scatter-cost.ll
The costs are sometimes different to the ones in
Analysis/CostModel/AArch64/sve-gather.ll
because the vectorizer also adds on the address computation cost.
The default for min is changed to 1. The behaviour of -mvscale-{min,max}
in Clang is also changed such that 16 is the max vscale when targeting
SVE and no max is specified.
Reviewed By: sdesmalen, paulwalker-arm
Differential Revision: https://reviews.llvm.org/D113294
If the condition of a select is a compare, pass its predicate to
TTI::getCmpSelInstrCost to get a more accurate cost value instead
of passing BAD_ICMP_PREDICATE.
I noticed that the commit message from D90070 had a comment about the
vectorized select predicate possibly being composed of other compares with
different predicate values, but I wasn't able to construct an example
where this was an actual issue. If this is an issue, I guess we could
add another check that the block isn't predicated for any reason.
Reviewed By: dmgreen, fhahn
Differential Revision: https://reviews.llvm.org/D114646
collectLoopScalars should only add non-uniform nodes to the list if they
are used by a load/store instruction that is marked as CM_Scalarize.
Before this patch, the LV incorrectly marked pointer induction variables
as 'scalar' when they required to be widened by something else,
such as a compare instruction, and weren't used by a node marked as
'CM_Scalarize'. This case is covered by sve-widen-phi.ll.
This change also allows removing some code where the LV tried to
widen the PHI nodes with a stepvector, even though it was marked as
'scalarAfterVectorization'. Now that this code is more careful about
marking instructions that need widening as 'scalar', this code has
become redundant.
Differential Revision: https://reviews.llvm.org/D114373
In VPRecipeBuilder::handleReplication if we believe the instruction
is predicated we then proceed to create new VP region blocks even
when the load is uniform and only predicated due to tail-folding.
I have updated isPredicatedInst to avoid treating a uniform load as
predicated when tail-folding, which means we can do a single scalar
load and a vector splat of the value.
Tests added here:
Transforms/LoopVectorize/AArch64/tail-fold-uniform-memops.ll
Differential Revision: https://reviews.llvm.org/D112552
This patch updates the cost model for ordered reductions so that a call
to the llvm.fmuladd intrinsic is modelled as a normal fmul instruction
plus the cost of an ordered fadd reduction.
Differential Revision: https://reviews.llvm.org/D111630
In-loop vector reductions which use the llvm.fmuladd intrinsic involve
the creation of two recipes; a VPReductionRecipe for the fadd and a
VPInstruction for the fmul. If the call to llvm.fmuladd has fast-math flags
these should be propagated through to the fmul instruction, so an
interface setFastMathFlags has been added to the VPInstruction class to
enable this.
Differential Revision: https://reviews.llvm.org/D113125
This patch fixes PR52111. The problem is that LV propagates poison-generating flags (`nuw`/`nsw`, `exact`
and `inbounds`) in instructions that contribute to the address computation of widen loads/stores that are
guarded by a condition. It may happen that when the code is vectorized and the control flow within the loop
is linearized, these flags may lead to generating a poison value that is effectively used as the base address
of the widen load/store. The fix drops all the integer poison-generating flags from instructions that
contribute to the address computation of a widen load/store whose original instruction was in a basic block
that needed predication and is not predicated after vectorization.
Reviewed By: fhahn, spatel, nlopes
Differential Revision: https://reviews.llvm.org/D111846
A first step towards modeling preheader and exit blocks in VPlan as well.
Keeping the vector loop in a region allows for changing the VF as we
traverse region boundaries.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D113182
checkOrderedReductions looks for Phi nodes which can be classified as in-order,
meaning they can be vectorised without unsafe math. In order to vectorise the
reduction it should also be classified as in-loop by getReductionOpChain, which
checks that the reduction has two uses.
In this patch, a similar check is added to checkOrderedReductions so that we
now return false if there are more than two uses of the FAdd instruction.
This fixes PR52515.
Reviewed By: fhahn, david-arm
Differential Revision: https://reviews.llvm.org/D114002
When getTypeConversion returns TypeScalarizeScalableVector we were
sometimes returning a non-simple type from getTypeLegalizationCost.
However, many callers depend upon this being a simple type and will
crash if not. This patch changes getTypeLegalizationCost to ensure
that we always a return sensible simple VT. If the vector type
contains unusual integer types, e.g. <vscale x 2 x i3>, then we just
set the type to MVT::i64 as a reasonable default.
A test has been added here that demonstrates the vectoriser can
correctly calculate the cost of vectorising a "zext i3 to i64"
instruction with a VF=vscale x 1:
Transforms/LoopVectorize/AArch64/sve-inductions-unusual-types.ll
Differential Revision: https://reviews.llvm.org/D113777
When asking how many parts are required for a scalable vector type
there are occasions when it cannot be computed. For example, <vscale x 1 x i3>
is one such vector for AArch64+SVE because at the moment no matter how we
promote the i3 type we never end up with a legal vector. This means
that getTypeConversion returns TypeScalarizeScalableVector as the
LegalizeKind, and then getTypeLegalizationCost returns an invalid cost.
This then causes BasicTTImpl::getNumberOfParts to dereference an invalid
cost, which triggers an assert. This patch changes getNumberOfParts to
return 0 for such cases, since the definition of getNumberOfParts in
TargetTransformInfo.h states that we can use a return value of 0 to represent
an unknown answer.
Currently, LoopVectorize.cpp is the only place where we need to check for
0 as a return value, because all other instances will not currently
ask for the number of parts for <vscale x 1 x iX> types.
In addition, I have changed the target-independent interface for
getNumberOfParts to return 1 and assume there is a single register
that can fit the type. The loop vectoriser has lots of tests that are
target-independent and they relied upon the 0 value to mean the
answer is known and that we are not scalarising the vector.
I have added tests here that show we correctly return an invalid cost
for VF=vscale x 1 when the loop contains unusual types such as i7:
Transforms/LoopVectorize/AArch64/sve-inductions-unusual-types.ll
Differential Revision: https://reviews.llvm.org/D113772
`collectElementTypesForWidening` collects the types of load, store and
reduction Phis in a loop. These types are later checked using
`isElementTypeLegalForScalableVector` to prevent vectorisation of
loops with instruction types that are unsupported.
This patch removes i1 from the list of types supported for scalable
vectors. This fixes an assert ("Cannot yet scalarize uniform stores") in
`setCostBasedWideningDecision` when we have a loop containing a uniform
i1 store and a scalable VF, which we cannot create a scatter for.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D113680
When creating a splat of 0 for scalable vectors we tend to create them
with using a combination of shufflevector and insertelement, i.e.
shufflevector (<vscale x 4 x i32> insertelement (<vscale x 4 x i32> poison, i32 0, i32 0),
<vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer)
However, for the case of a zero splat we can actually just replace the
above with zeroinitializer instead. This makes the IR a lot simpler and
easier to read. I have changed ConstantFoldShuffleVectorInstruction to
use zeroinitializer when creating a splat of integer 0 or FP +0.0 values.
Differential Revision: https://reviews.llvm.org/D113394
Changes VPReplicateRecipe to extract the last lane from an unconditional,
uniform store instruction. collectLoopUniforms will also add stores to
the list of uniform instructions where Legal->isUniformMemOp is true.
setCostBasedWideningDecision now sets the widening decision for
all uniform memory ops to Scalarize, where previously GatherScatter
may have been chosen for scalable stores.
This fixes an assert ("Cannot yet scalarize uniform stores") in
setCostBasedWideningDecision when we have a loop containing a
uniform i1 store and a scalable VF, which we cannot create a scatter for.
Reviewed By: sdesmalen, david-arm, fhahn
Differential Revision: https://reviews.llvm.org/D112725
When targeting a specific CPU with scalable vectorization, the knowledge
of that particular CPU's vscale value can be used to tune the cost-model
and make the cost per lane less pessimistic.
If the target implements 'TTI.getVScaleForTuning()', the cost-per-lane
is calculated as:
Cost / (VScaleForTuning * VF.KnownMinLanes)
Otherwise, it assumes a value of 1 meaning that the behavior
is unchanged and calculated as:
Cost / VF.KnownMinLanes
Reviewed By: kmclaughlin, david-arm
Differential Revision: https://reviews.llvm.org/D113209
At the moment in LoopVectorizationCostModel::selectEpilogueVectorizationFactor
we bail out if the main vector loop uses a scalable VF. This patch adds
support for generating epilogue vector loops using a fixed-width VF when the
main vector loop uses a scalable VF.
I've changed LoopVectorizationCostModel::selectEpilogueVectorizationFactor
so that we convert the scalable VF into a fixed-width VF and do profitability
checks on that instead. In addition, since the scalable and fixed-width VFs
live in different VPlans that means I had to change the calls to
LVP.hasPlanWithVFs so that we only pass in the fixed-width VF.
New tests added here:
Transforms/LoopVectorize/AArch64/sve-epilog-vect.ll
Differential Revision: https://reviews.llvm.org/D109432
I've added a test for a loop containing a conditional uniform load for
a target that supports masked loads. The test just ensures that we
correctly use gather instructions and have the correct mask.
Differential Revision: https://reviews.llvm.org/D112619
This patch updates VPReductionRecipe::execute so that the fast-math
flags associated with the underlying instruction of the VPRecipe are
propagated through to the reductions which are created.
Differential Revision: https://reviews.llvm.org/D112548
There's precedent for that in `CreateOr()`/`CreateAnd()`.
The motivation here is to avoid bloating the run-time check's IR
in `SCEVExpander::generateOverflowCheck()`.
Refs. https://reviews.llvm.org/D109368#3089809
The math here is:
Cost of 1 load = cost of n loads / n
Cost of live loads = num live loads * Cost of 1 load
Cost of live loads = num live loads * (cost of n loads / n)
Cost of live loads = cost of n loads * (num live loads / n)
But, all the variables here are integers,
and integer division rounds down,
but this calculation clearly expects float semantics.
Instead multiply upfront, and then perform round-up-division.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D112302
This patch introduces a new function:
AArch64Subtarget::getVScaleForTuning
that returns a value for vscale that can be used for tuning the cost
model when using scalable vectors. The VScaleForTuning option in
AArch64Subtarget is initialised according to the following rules:
1. If the user has specified the CPU to tune for we use that, else
2. If the target CPU was specified we use that, else
3. The tuning is set to "generic".
For CPUs of type "generic" I have assumed that vscale=2.
New tests added here:
Analysis/CostModel/AArch64/sve-gather.ll
Analysis/CostModel/AArch64/sve-scatter.ll
Transforms/LoopVectorize/AArch64/sve-strict-fadd-cost.ll
Differential Revision: https://reviews.llvm.org/D110259
collectLoopScalars collects pointer induction updates in ScalarPtrs, assuming
that the instruction will be scalar after vectorization. This may crash later
in VPReplicateRecipe::execute() if there there is another user of the instruction
other than the Phi node which needs to be widened.
This changes collectLoopScalars so that if there are any other users of
Update other than a Phi node, it is not added to ScalarPtrs.
Reviewed By: david-arm, fhahn
Differential Revision: https://reviews.llvm.org/D111294
This patch adds further support for vectorisation of loops that involve
selecting an integer value based on a previous comparison. Consider the
following C++ loop:
int r = a;
for (int i = 0; i < n; i++) {
if (src[i] > 3) {
r = b;
}
src[i] += 2;
}
We should be able to vectorise this loop because all we are doing is
selecting between two states - 'a' and 'b' - both of which are loop
invariant. This just involves building a vector of values that contain
either 'a' or 'b', where the final reduced value will be 'b' if any lane
contains 'b'.
The IR generated by clang typically looks like this:
%phi = phi i32 [ %a, %entry ], [ %phi.update, %for.body ]
...
%pred = icmp ugt i32 %val, i32 3
%phi.update = select i1 %pred, i32 %b, i32 %phi
We already detect min/max patterns, which also involve a select + cmp.
However, with the min/max patterns we are selecting loaded values (and
hence loop variant) in the loop. In addition we only support certain
cmp predicates. This patch adds a new pattern matching function
(isSelectCmpPattern) and new RecurKind enums - SelectICmp & SelectFCmp.
We only support selecting values that are integer and loop invariant,
however we can support any kind of compare - integer or float.
Tests have been added here:
Transforms/LoopVectorize/AArch64/sve-select-cmp.ll
Transforms/LoopVectorize/select-cmp-predicated.ll
Transforms/LoopVectorize/select-cmp.ll
Differential Revision: https://reviews.llvm.org/D108136
This patch adds further support for vectorisation of loops that involve
selecting an integer value based on a previous comparison. Consider the
following C++ loop:
int r = a;
for (int i = 0; i < n; i++) {
if (src[i] > 3) {
r = b;
}
src[i] += 2;
}
We should be able to vectorise this loop because all we are doing is
selecting between two states - 'a' and 'b' - both of which are loop
invariant. This just involves building a vector of values that contain
either 'a' or 'b', where the final reduced value will be 'b' if any lane
contains 'b'.
The IR generated by clang typically looks like this:
%phi = phi i32 [ %a, %entry ], [ %phi.update, %for.body ]
...
%pred = icmp ugt i32 %val, i32 3
%phi.update = select i1 %pred, i32 %b, i32 %phi
We already detect min/max patterns, which also involve a select + cmp.
However, with the min/max patterns we are selecting loaded values (and
hence loop variant) in the loop. In addition we only support certain
cmp predicates. This patch adds a new pattern matching function
(isSelectCmpPattern) and new RecurKind enums - SelectICmp & SelectFCmp.
We only support selecting values that are integer and loop invariant,
however we can support any kind of compare - integer or float.
Tests have been added here:
Transforms/LoopVectorize/AArch64/sve-select-cmp.ll
Transforms/LoopVectorize/select-cmp-predicated.ll
Transforms/LoopVectorize/select-cmp.ll
Differential Revision: https://reviews.llvm.org/D108136
The expansion for these was updated in https://reviews.llvm.org/D47927 but the cost model was not adjusted.
I believe the cost model was also incorrect for the old expansion.
The expansion prior to D47927 used 3 icmps using LHS, RHS, and Result
to calculate theirs signs. Then 2 icmps to compare the signs. Followed
by an And. The previous cost model was using 3 icmps and 2 selects.
Digging back through git blame, those 2 selects in the cost model used to
be 2 icmps, but were changed in https://reviews.llvm.org/D90681
Differential Revision: https://reviews.llvm.org/D110739
In ValueTracking.cpp we use a function called
computeKnownBitsFromOperator to determine the known bits of a value.
For the vscale intrinsic if the function contains the vscale_range
attribute we can use the maximum and minimum values of vscale to
determine some known zero and one bits. This should help to improve
code quality by allowing certain optimisations to take place.
Tests added here:
Transforms/InstCombine/icmp-vscale.ll
Differential Revision: https://reviews.llvm.org/D109883
For SVE, when scalarising the PHI instruction the whole vector part is
generated as opposed to creating instructions for each lane for fixed-
width vectors. However, in some cases the lane values may be needed
later (e.g for a load instruction) so we still need to calculate
these values to avoid extractelement being called on the vector part.
Differential Revision: https://reviews.llvm.org/D109445
Reverted (manually due to merge conflicts) while regressions reported on PR51540 are investigated
As noticed on D106352, after we've folded "(select C, (gep Ptr, Idx), Ptr) -> (gep Ptr, (select C, Idx, 0))" if the inner Ptr was also a (now one use) gep we could then merge the geps, using the sum of the indices instead.
I've limited this to basic 2-op geps - a more general case further down InstCombinerImpl.visitGetElementPtrInst doesn't have the one-use limitation but only creates the add if it can be created via SimplifyAddInst.
https://alive2.llvm.org/ce/z/f8pLfD (Thanks Roman!)
Differential Revision: https://reviews.llvm.org/D106450
Adjusting the reduction recipes still relies on references to the
original IR, which can become outdated by the first-order recurrence
handling. Until reduction recipe construction does not require IR
references, move it before first-order recurrence handling, to prevent a
crash as exposed by D106653.
This reverts commit f4122398e7 to
investigate a crash exposed by it.
The patch breaks building the code below with `clang -O2 --target=aarch64-linux`
int a;
double b, c;
void d() {
for (; a; a++) {
b += c;
c = a;
}
}
I have added a new TTI interface called enableOrderedReductions() that
controls whether or not ordered reductions should be enabled for a
given target. By default this returns false, whereas for AArch64 it
returns true and we rely upon the cost model to make sensible
vectorisation choices. It is still possible to override the new TTI
interface by setting the command line flag:
-force-ordered-reductions=true|false
I have added a new RUN line to show that we use ordered reductions by
default for SVE and Neon:
Transforms/LoopVectorize/AArch64/strict-fadd.ll
Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll
Differential Revision: https://reviews.llvm.org/D106653
For tight loops like this:
float r = 0;
for (int i = 0; i < n; i++) {
r += a[i];
}
it's better not to vectorise at -O3 using fixed-width ordered reductions
on AArch64 targets. Although the resulting number of instructions in the
generated code ends up being comparable to not vectorising at all, there
may be additional costs on some CPUs, for example perhaps the scheduling
is worse. It makes sense to deter vectorisation in tight loops.
Differential Revision: https://reviews.llvm.org/D108292
Removed AArch64 usage of the getMaxVScale interface, replacing it with
the vscale_range(min, max) IR Attribute.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D106277
Previously we emitted a "does not support scalable vectors"
remark for all targets whenever vectorisation is attempted. This
pollutes the output for architectures that don't support scalable
vectors and is likely confusing to the user.
Instead this patch introduces a debug message that reports when
scalable vectorisation is allowed by the target and only issues
the previous remark when scalable vectorisation is specifically
requested, for example:
#pragma clang loop vectorize_width(2, scalable)
Differential Revision: https://reviews.llvm.org/D108028
I have added RUN lines to both:
Transforms/LoopVectorize/AArch64/strict-fadd.ll
Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll
to show the default behaviour is to not vectorise when the following
flag is unset:
-force-ordered-reductions
This patch updates ConstantVector::getSplat to use poison instead
of undef when using insertelement/shufflevector to splat.
This follows on from D93793.
Differential Revision: https://reviews.llvm.org/D107751
Since all operands to ExtractValue must be loop-invariant when we deem
the loop vectorizable, we can consider ExtractValue to be uniform.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D107286
This change wasn't strictly necessary for D106164 and could be removed.
This patch addresses the post-commit comments from @fhahn on D106164, and
also changes sve-widen-gep.ll to use the same IR test as shown in
pointer-induction.ll.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D106878
I'm renaming the flag because a future patch will add a new
enableOrderedReductions() TTI interface and so the meaning of this
flag will change to be one of forcing the target to enable/disable
them. Also, since other places in LoopVectorize.cpp use the word
'Ordered' instead of 'strict' I changed the flag to match.
Differential Revision: https://reviews.llvm.org/D107264
It was writing to the source directory (which may not be writeable),
rather than using %t.
Fixes: a5dd6c6cf9 ("[LoopVectorize] Don't interleave scalar ordered reductions for inner loops")
Consider the following loop:
void foo(float *dst, float *src, int N) {
for (int i = 0; i < N; i++) {
dst[i] = 0.0;
for (int j = 0; j < N; j++) {
dst[i] += src[(i * N) + j];
}
}
}
When we are not building with -Ofast we may attempt to vectorise the
inner loop using ordered reductions instead. In addition we also try
to select an appropriate interleave count for the inner loop. However,
when choosing a VF=1 the inner loop will be scalar and there is existing
code in selectInterleaveCount that limits the interleave count to 2
for reductions due to concerns about increasing the critical path.
For ordered reductions this problem is even worse due to the additional
data dependency, and so I've added code to simply disable interleaving
for scalar ordered reductions for now.
Test added here:
Transforms/LoopVectorize/AArch64/strict-fadd-vf1.ll
Differential Revision: https://reviews.llvm.org/D106646
The loop vectorizer may decide to use tail folding when the trip-count
is low. When that happens, scalable VFs are no longer a candidate,
since tail folding/predication is not yet supported for scalable vectors.
This can be re-enabled in a future patch.
Reviewed By: kmclaughlin
Differential Revision: https://reviews.llvm.org/D106657
Invalid costs can be used to avoid vectorization with a given VF, which is
used for scalable vectors to avoid things that the code-generator cannot
handle. If we override the cost using the -force-target-instruction-cost
option of the LV, we would override this mechanism, rendering the flag useless.
This change ensures the cost is only overriden when the original cost that
was calculated is valid. That allows the flag to be used in combination
with the -scalable-vectorization option.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D106677
This change moves most of `sve-inductions.ll` to non-AArch64 specific
LV tests using the `-target-supports-scalable-vectors` flag, because they're
not explicitly AArch64-specific. One test builds on AArch64-specific
knowledge regarding masked loads/stores, and remains in sve-inductions.ll.
Scalarization for scalable vectors is not (yet) supported, so the
LV discards a VF when scalarization is chosen as the widening
decision. It should therefore not assert that the VF is not scalable
when it computes the decision to scalarize.
The code can get here when both the interleave-cost, gather/scatter cost
and scalarization-cost are all illegal. This may e.g. happen for SVE
when the VF=1, to avoid generating `<vscale x 1 x eltty>` types that
the code-generator cannot yet handle.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D106656
This fixes an issue that was found in D105199, where a GEP instruction
is used both as the address of a store, as well as the value of a store.
For the former, the value is scalar after vectorization, but the latter
(as value) requires widening.
Other code in that function seems to prevent similar cases from happening,
but it seems this case was missed.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D106164
I have added a new FastMathFlags parameter to getArithmeticReductionCost
to indicate what type of reduction we are performing:
1. Tree-wise. This is the typical fast-math reduction that involves
continually splitting a vector up into halves and adding each
half together until we get a scalar result. This is the default
behaviour for integers, whereas for floating point we only do this
if reassociation is allowed.
2. Ordered. This now allows us to estimate the cost of performing
a strict vector reduction by treating it as a series of scalar
operations in lane order. This is the case when FP reassociation
is not permitted. For scalable vectors this is more difficult
because at compile time we do not know how many lanes there are,
and so we use the worst case maximum vscale value.
I have also fixed getTypeBasedIntrinsicInstrCost to pass in the
FastMathFlags, which meant fixing up some X86 tests where we always
assumed the vector.reduce.fadd/mul intrinsics were 'fast'.
New tests have been added here:
Analysis/CostModel/AArch64/reduce-fadd.ll
Analysis/CostModel/AArch64/sve-intrinsics.ll
Transforms/LoopVectorize/AArch64/strict-fadd-cost.ll
Transforms/LoopVectorize/AArch64/sve-strict-fadd-cost.ll
Differential Revision: https://reviews.llvm.org/D105432
This patch avoids computing discounts for predicated instructions when the
VF is scalable.
There is no support for vectorization of loops with division because the
vectorizer cannot guarantee that zero divisions will not happen.
This loop now does not use VF scalable
```
for (long long i = 0; i < n; i++)
if (cond[i])
a[i] /= b[i];
```
Differential Revision: https://reviews.llvm.org/D101916
As noticed on D106352, after we've folded "(select C, (gep Ptr, Idx), Ptr) -> (gep Ptr, (select C, Idx, 0))" if the inner Ptr was also a (now one use) gep we could then merge the geps, using the sum of the indices instead.
I've limited this to basic 2-op geps - a more general case further down InstCombinerImpl.visitGetElementPtrInst doesn't have the one-use limitation but only creates the add if it can be created via SimplifyAddInst.
https://alive2.llvm.org/ce/z/f8pLfD (Thanks Roman!)
Differential Revision: https://reviews.llvm.org/D106450
This patch returns an Invalid cost from getInstructionCost() for alloca
instructions if the VF is scalable, as otherwise loops which contain
these instructions will crash when attempting to scalarize the alloca.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D105824
The original patch was:
https://reviews.llvm.org/D105806
There were some issues with undeterministic behaviour of the sorting
function, which led to scalable-call.ll passing and/or failing. This
patch fixes the issue by numbering all instructions in the array first,
and using that number as the order, which should provide a consistent
ordering.
This reverts commit a607f64118.
The sort function for emitting an OptRemark was not deterministic,
which caused scalable-call.ll to fail on some buildbots. This patch
fixes that.
This patch also fixes an issue where `Instruction::comesBefore()`
is called when two Instructions are in different basic blocks,
which would otherwise cause an assertion failure.
This patch emits remarks for instructions that have invalid costs for
a given set of vectorization factors. Some example output:
t.c:4:19: remark: Instruction with invalid costs prevented vectorization at VF=(vscale x 1): load
dst[i] = sinf(src[i]);
^
t.c:4:14: remark: Instruction with invalid costs prevented vectorization at VF=(vscale x 1, vscale x 2, vscale x 4): call to llvm.sin.f32
dst[i] = sinf(src[i]);
^
t.c:4:12: remark: Instruction with invalid costs prevented vectorization at VF=(vscale x 1): store
dst[i] = sinf(src[i]);
^
Reviewed By: fhahn, kmclaughlin
Differential Revision: https://reviews.llvm.org/D105806
At the moment, <vscale x 1 x eltty> are not yet fully handled by the
code-generator, so to avoid vectorizing loops with that VF, we mark the
cost for these types as invalid.
The reason for not adding a new "TTI::getMinimumScalableVF" is because
the type is supposed to be a type that can be legalized. It partially is,
although the support for these types need some more work.
Reviewed By: paulwalker-arm, dmgreen
Differential Revision: https://reviews.llvm.org/D103882
This patch adds a TTI function, isElementTypeLegalForScalableVector, to query
whether it is possible to vectorize a given element type. This is called by
isLegalToVectorizeInstTypesForScalable to reject scalable vectorization if
any of the instruction types in the loop are unsupported, e.g:
int foo(__int128_t* ptr, int N)
#pragma clang loop vectorize_width(4, scalable)
for (int i=0; i<N; ++i)
ptr[i] = ptr[i] + 42;
This example currently crashes if we attempt to vectorize since i128 is not a
supported type for scalable vectorization.
Reviewed By: sdesmalen, david-arm
Differential Revision: https://reviews.llvm.org/D102253
Loads of <4 x i8> vectors were modeled as extremely expensive. And while we
don't have a load instruction that supports this, it isn't that expensive to
create a vector of i8 elements. The codegen for this was fixed/optimised in
D105110. This now tweaks the cost model and enables SLP vectorisation of my
motivating case loadi8.ll.
Differential Revision: https://reviews.llvm.org/D103629