For some reason we limited the epilogue VF to be fixed-width, but there
is not necessarily a reason for doing so. If the main VF=vscale x 16, the
epilogue VF could be either fixed-width, or a scalable VF upto vscale x 8.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D118688
This removes another instance of recipe execution still relying on
the cost model.
Depends on D116554.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D116656
This removes the remaining dependence on LoopVectorizationCostModel from
buildScalarSteps and is required so it can be moved out of ILV.
It also improves allows us to remove a few unneeded instructions.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D116554
This patch tries to use an existing VPWidenCanonicalIVRecipe
instead of creating another step-vector for canonical
induction recipes in widenIntOrFpInduction.
This has the following benefits:
1. First step to avoid setting both vector and scalar values for the
same induction def.
2. Reducing complexity of widenIntOrFpInduction through making things
more explicit in VPlan
3. Only need to splat the vector IV for block in masks.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D116123
This explicitly records whether a scalar IV is needed in the
VPWidenIntOrFpInductionRecipe, to remove a dependence on the cost-model
during its ::execute.
It will also be used in D116123 to determine if a vector phi will be
generated.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D118167
isCandidateForEpilogueVectorization will currently return false for loops
which contain reductions. This patch removes this restriction and makes
the following changes to support epilogue vectorisation with reductions:
- `fixReduction`: If fixReduction is being called during vectorisation of the
epilogue, the phi node it creates will need to additionally carry incoming
values from the middle block of the main loop.
- `createEpilogueVectorizedLoopSkeleton`: The incoming values of the phi
created by fixReduction are updated after the vec.epilog.iter.check block
is added. The phi is also moved to the preheader of the epilogue.
- `processLoop`: The start value of any VPReductionPHIRecipes are updated before
vectorising the epilogue loop. The getResumeInstr function added to the ILV
will return the resume instruction associated with the recurrence descriptor.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D116928
This patch updates createBlockInMask to always generate
VPWidenCanonicalIVRecipe and adds a transform to optimize it away later,
if it is not needed.
This is a step towards breaking up VPWidenIntOrFpInductionRecipe and
explicitly distinguishing between vector phis and scalarizing.
Split off from D116123.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D117140
This patch adds VPWidenIntOrFpInductionRecipe::isCanonical to check if
an induction recipe is canonical. The code is also updated to use it
instead of isCanonicalID.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D117551
After d4a8fc3a87 LV stopped adding metadata to disable runtime
unrolling to the vectorized epilogue loop. This was missed because
278aa65cc4 removed the relevant test coverage.
This patch fixes that by adding the relevant metadata after
vector loop generation.
This reverts the revert commit 073c27b5e5.
A reduced test case has been added in 5e4966cbae and the code has
been updated to handle the case where getInductionOpcode returns
BinaryOpsEnd. In this case, the original code was always using
Instruction::Add. Do the same in the patch.
Note this commit may slightly change the value naming, because it now
also assigns the 'induction' name in the floating point case.
Causes a crash with the following (creduce'd) test-case:
clang -O3 '--target=aarch64-grtev4-linux-gnu' -xc - -c -o /dev/null <<EOF
int *e;
int f;
int g() {
int h;
int *j = 0;
while (&f - j > 0) {
int k;
k = j;
if (e == j && *e)
k = 5;
h = k;
j++;
}
return h;
}
EOF
This reverts commit 7ce48be0fd.
This patch adds a new BranchOnCount VPInstruction opcode with 2
operands. It first compares its 2 operands (increment of canonical
induction and vector trip count), followed by a branch to either the
exit block or back to the vector header.
It must be the last recipe in the exit block of the topmost vector loop
region.
This extracts parts from D113224 and was discussed in D113223.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D116479
This is required to query the legality more precisely in the LoopVectorizer.
This adds another TTI function named 'forceScalarizeMaskedGather/Scatter'
function to work around the hack introduced for MVE, where
isLegalMaskedGather/Scatter would return an answer by second-guessing
where the function was called from, based on the Type passed in (vector
vs scalar). The new interface makes this explicit. It is also used by
X86 to check for vector widths where gather/scatters aren't profitable
(or don't exist) for certain subtargets.
Differential Revision: https://reviews.llvm.org/D115329
This patch fixes up an issue with InnerLoopVectorizer::getOrCreateVectorTripCount
whereby we weren't correctly generating the runtime trip count
for scalable vectors when tail-folding.
It also removes some asserts in the tail-folding path for cases when
the VF is not scalable.
In this patch I have only permitted tail-folding to be enabled
explicitly for scalable vectors when the user has specified one
of the following flags:
-prefer-predicate-over-epilogue=predicate-dont-vectorize
-prefer-predicate-over-epilogue=predicate-else-scalar-epilogue
For now it's best not to enable tail-folding with scalable vectors for
low trip counts or when optimising for code size, since there has been
no analysis on whether this is worth it.
Various tests have been added here:
Transforms/LoopVectorize/AArch64/sve-tail-folding.ll
Transforms/LoopVectorize/AArch64/sve-tail-folding-forced.ll
The tests cannot be target independent because they require masked
load/store support, i.e. TTI.isLegalMaskedLoad and TTI.isLegalMaskedStore
need to return true.
Differential Revision: https://reviews.llvm.org/D113003
There is no need to sort inserted instructions by dominance, as the
deletion loop still requires RAUW with undef before deleting. Removing
instructions in reverse insertion order should still insure that the
number of uselist updates is kept to a minimum.
This was originally added in rG22174f5d5af1eb15b376c6d49e7925cbb7cca6be
although that patch doesn't really mention any reasons for ignoring the
pointer type in this calculation if the memory access isn't consecutive.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D115356
At the moment, the primary induction variable for the vector loop is
created as part of the skeleton creation. This is tied to creating the
vector loop latch outside of VPlan. This prevents from modeling the
*whole* vector loop in VPlan, which in turn is required to model
preheader and exit blocks in VPlan as well.
This patch introduces a new recipe VPCanonicalIVPHIRecipe to represent the
primary IV in VPlan and CanonicalIVIncrement{NUW} opcodes for
VPInstruction to model the increment.
This allows us to partly retire createInductionVariable. At the moment,
a bit of patching up is done after executing all blocks in the plan.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D113223
For loops that contain in-loop reductions but no loads or stores, large
VFs are chosen because LoopVectorizationCostModel::getSmallestAndWidestTypes
has no element types to check through and so returns the default widths
(-1U for the smallest and 8 for the widest). This results in the widest
VF being chosen for the following example,
float s = 0;
for (int i = 0; i < N; ++i)
s += (float) i*i;
which, for more computationally intensive loops, leads to large loop
sizes when the operations end up being scalarized.
In this patch, for the case where ElementTypesInLoop is empty, the widest
type is determined by finding the smallest type used by recurrences in
the loop instead of falling back to a default value of 8 bits. This
results in the cost model choosing a more sensible VF for loops like
the one above.
Differential Revision: https://reviews.llvm.org/D113973
Setting the loop metadata for the vector loop after VPlan execution
allows generating the full loop body during VPlan execution. This is in
preparation for D113224.
VPWidenCanonicalIVRecipe does not create PHI instructions, so it does
not need to be placed in the phi section of a VPBasicBlock.
Also tidies the code so the WidenCanonicalIV recipe and the
compare/lane-masks are created in the header.
Discussed D113223.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D116473
This patch adds a new prepareToExecute helper to set up live-ins, so
VPTransformState doesn't need to hold values like TripCount.
This also requires making the trip count operand for ActiveLaneMask
explicit in VPlan.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D116320
Not all header phis widen the phi, e.g. like the new
VPCanonicalIVPHIRecipe in D113223. To let those recipes also inherit
from a phi-like base class, add a more generic VPHeaderPHIRecipe
abstract base class.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D116304
By creating the header and latch blocks up front and adding blocks and
recipes in between those 2 blocks we ensure that the entry and exits of
the plan remain valid throughout construction.
In order to avoid test changes and keep printing of the plans the same,
we use the new header block instead of creating a new block on the first
iteration of the loop traversing the original loop.
We also fold the latch into its predecessor.
This is a follow up to a post-commit suggestion in D114586.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D115793
The availability of SVE should be sufficient to enable scalable
auto-vectorization.
This patch adds a new TTI interface to query the target what style of
vectorization it wants when scalable vectors are available. For other
targets than AArch64, this currently defaults to 'FixedWidthOnly'.
Differential Revision: https://reviews.llvm.org/D115651
Pass in the vector header instead of relying on ILV::LoopVectorBody.
This reduces the dependence on state from ILV. Where VPTransformState is
available, State.CFG.PrevBB can be used.
For the simple copy loop (see test case) vectorizer selects VF equal to 32 while the loop is known to have 17 iterations only. Such behavior makes no sense to me since such vector loop will never be executed. The only case we may want to select VF large than TC is masked vectoriztion. So I haven't touched that case.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D114528
Given a MLA reduction from two different types (say i8 and i16), we were
previously failing to find the reduction pattern, often making us chose
the lower vector factor. This improves that by using the largest of the
two extension types, allowing us to use the larger VF as the type of the
reduction.
As per https://godbolt.org/z/KP549EEYM the backend handles this
valiantly, leading to better performance.
Differential Revision: https://reviews.llvm.org/D115432
This patch simplifies handling of redundant induction casts, by
removing dead cast instructions after initial VPlan construction.
This has the following benefits:
1. fixes a crash
(see @test_optimized_cast_induction_feeding_first_order_recurrence)
2. Simplifies VPWidenIntOrFpInduction to a single-def recipes
3. Retires recordVectorLoopValueForInductionCast.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D115112
This patch uses a similar trick as in D113947 to only run the extra
passes after vectorization on functions where loops have been
vectorized.
The reason for running the 'extra vector passes' is
simplification/unswitching of the runtime checks created by LV, there
should be no need to run them if nothing got vectorized
To do that, a new dummy analysis ShouldRunExtraVectorPasses has been
added. If loops have been vectorized for a function, LV will cache the
analysis. At the moment it uses MadeCFGChanges as proxy for loop
vectorized, which isn't perfect (it could be too aggressive, e.g.
because no runtime checks have been added), but should be good enough
for now.
The extra passes are now managed by a new FunctionPassManager that
runs its passes only if ShouldRunExtraVectorPasses has been cached.
Without this patch, `-extra-vectorizer-passes` has the following
compile-time impact:
NewPM-O3: +4.86%
NewPM-ReleaseThinLTO: +3.56%
NewPM-ReleaseLTO-g: +7.17%
http://llvm-compile-time-tracker.com/compare.php?from=ead3979a92fc33add4710c4510d6906260dcb4ad&to=c292da649e2c6e88a31e702fdc474727d09c72bc&stat=instructions
With this patch, that gets reduced to
NewPM-O3: +1.43%
NewPM-ReleaseThinLTO: +1.00%
NewPM-ReleaseLTO-g: +1.58%
http://llvm-compile-time-tracker.com/compare.php?from=ead3979a92fc33add4710c4510d6906260dcb4ad&to=e67d86b57810011cf285eb9aa1944781be6096f0&stat=instructions
It is probably still too high to enable by default, but much better.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D115052
This allows easier access to the induction descriptor from VPlan,
without needing to go through Legal. VPReductionPHIRecipe already
contains a RecurrenceDescriptor in a similar fashion.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D115111
Both the entry and exit blocks of the top-region of a plan must be
VPBasicBlocks. They also must have no predecessors or successors
respectively.
This invariant was broken when splitting a block for sink-after. To fix
the issue, set the exit block of the region *after* sink-after is done.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D114586