This patch is a follow-up to D115953. It updates optimizeInductions
to also introduce new VPScalarIVStepsRecipes if an IV has both vector
and scalar uses.
It updates all uses that only need scalar values to use the newly
created recipe for the scalar steps.
This completes untangling of VPWidenIntOrFpInductionRecipe
code-generation. Now the recipe *only* creates the widened vector
values, as it says on the tin.
The code to genereate IR has been moved directly to
VPWidenIntOrFpInductionRecipe::execute.
Note that the recipe has been updated to hold a reference to
ScalarEvolution, which is needed to expand the step, until we can place
the corresponding SCEV expansion in the pre-header.
Depends on D120827.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D120828
The analysis passes output function name encapsulated in `'` braces,
but LV uses `"`. Harmonizing this may help in creating an update script
for the LV costmodel test checks.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D121105
The test used to run whole O3 pipeline. Modify it to contain LLVM IR right
before LV and limit passes to "-loop-vectorizer -simplifycfg".
For the RUN line with forced VF force interleave factor as well to simplify
CHECKs as interleaving isn't related to the purpose of the test.
I also tried to add "noalias" to pointer arguments in
@test_gather_not_profitable_pr48429 but LAI seems unable to use them.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D119786
This reverts the revert commit ff93260bf6.
The underlying issue causing the PPC bot failures has been fixed in
cbaac14734 and a corresponding test case has been added in
ad2cad1c52.
Original message:
This patch adds a new VPScalarIVStepsRecipe to handle building scalar
steps.
In the first patch, it only handles the case where there is no vector
induction variable needed.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D115953
This patch adds a new VPScalarIVStepsRecipe to handle building scalar
steps.
In the first patch, it only handles the case where there is no vector
induction variable needed.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D115953
This patch adds a new transform to remove dead recipes. For now, it only
removes dead recipes in the header, to keep the number tests that require
updating manageable. Future patches will extend this to remove dead
recipes across the whole plan.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D118051
Now that integer min/max intrinsics have good support in both
InstCombine and other passes, start canonicalizing SPF min/max
to intrinsic min/max.
Once this sticks, we can stop matching SPF min/max in various
places, and can remove hacks we have for preventing infinite loops
and breaking of SPF canonicalization.
Differential Revision: https://reviews.llvm.org/D98152
Adds new optimization remarks when loop vectorization fails due to
the compiler being unable to find bound of an array access inside
a loop
Differential Revision: https://reviews.llvm.org/D115873
This reverts commit 77a0da926c as we've
received multiple reports of this significantly impacting performance,
in ways that don't seem to just be target specific cost models going
wrong. I would offer some reproducers, but the test changes here seem to
be full of them!
Reverting for now and hopefully we can remove the "hack" more carefully
as we go.
D43208 extracted `useEmulatedMaskMemRefHack()` from legality into cost model.
What it essentially does is prevents scalarized vectorization of masked memory operations:
```
// TODO: Cost model for emulated masked load/store is completely
// broken. This hack guides the cost model to use an artificially
// high enough value to practically disable vectorization with such
// operations, except where previously deployed legality hack allowed
// using very low cost values. This is to avoid regressions coming simply
// from moving "masked load/store" check from legality to cost model.
// Masked Load/Gather emulation was previously never allowed.
// Limited number of Masked Store/Scatter emulation was allowed.
```
While i don't really understand about what specifically `is completely broken`
was talking about, i believe that at least on X86 with AVX2-or-later,
this is no longer true. (or at least, i would like to know what is still broken).
So i would like to follow suit after D111460, and like wise disable that hack for AVX2+.
But since this was added for X86 specifically, let's just instead completely remove this hack.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D114779
This removes the remaining dependence on LoopVectorizationCostModel from
buildScalarSteps and is required so it can be moved out of ILV.
It also improves allows us to remove a few unneeded instructions.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D116554
This patch tries to use an existing VPWidenCanonicalIVRecipe
instead of creating another step-vector for canonical
induction recipes in widenIntOrFpInduction.
This has the following benefits:
1. First step to avoid setting both vector and scalar values for the
same induction def.
2. Reducing complexity of widenIntOrFpInduction through making things
more explicit in VPlan
3. Only need to splat the vector IV for block in masks.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D116123
isCandidateForEpilogueVectorization will currently return false for loops
which contain reductions. This patch removes this restriction and makes
the following changes to support epilogue vectorisation with reductions:
- `fixReduction`: If fixReduction is being called during vectorisation of the
epilogue, the phi node it creates will need to additionally carry incoming
values from the middle block of the main loop.
- `createEpilogueVectorizedLoopSkeleton`: The incoming values of the phi
created by fixReduction are updated after the vec.epilog.iter.check block
is added. The phi is also moved to the preheader of the epilogue.
- `processLoop`: The start value of any VPReductionPHIRecipes are updated before
vectorising the epilogue loop. The getResumeInstr function added to the ILV
will return the resume instruction associated with the recurrence descriptor.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D116928
The modified tests didn't have actual users of all inductions, making it
trivial to eliminate them. Add users to make sure the inductions are
actually used in the vectorized version.
After d4a8fc3a87 LV stopped adding metadata to disable runtime
unrolling to the vectorized epilogue loop. This was missed because
278aa65cc4 removed the relevant test coverage.
This patch fixes that by adding the relevant metadata after
vector loop generation.
This patch adds a new BranchOnCount VPInstruction opcode with 2
operands. It first compares its 2 operands (increment of canonical
induction and vector trip count), followed by a branch to either the
exit block or back to the vector header.
It must be the last recipe in the exit block of the topmost vector loop
region.
This extracts parts from D113224 and was discussed in D113223.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D116479
9345ab3a45 updated generateOverflowCheck to skip creating checks that
always evaluate to false. This in turn means that we only need to check
for overflows if the result of the multiplication is actually used.
Sink the Or for the overflow check into ComputeEndCheck, so it is only
created when there's an actual check.
Currently generateOverflowCheck always creates code for Step being
negative and positive, followed by a select at the end depending on
Step's sign.
This patch updates the code to only create either the checks for step
being positive or negative, if the sign is known.
Follow-up to D116696.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D116747
This patch updates SCEVExpander::expandUnionPredicate to not create
redundant 'or false, x' instructions. While those are trivially
foldable, they can be easily avoided and hinder code that checks the
size/cost of the generated checks before further folds.
I am planning on look into a few other similar improvements to code
generated by SCEVExpander.
I remember a while ago @lebedev.ri working on doing some trivial folds
like that in IRBuilder itself, but there where concerns that such
changes may subtly break existing code.
Reviewed By: reames, lebedev.ri
Differential Revision: https://reviews.llvm.org/D116696
This was originally added in rG22174f5d5af1eb15b376c6d49e7925cbb7cca6be
although that patch doesn't really mention any reasons for ignoring the
pointer type in this calculation if the memory access isn't consecutive.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D115356
At the moment, the primary induction variable for the vector loop is
created as part of the skeleton creation. This is tied to creating the
vector loop latch outside of VPlan. This prevents from modeling the
*whole* vector loop in VPlan, which in turn is required to model
preheader and exit blocks in VPlan as well.
This patch introduces a new recipe VPCanonicalIVPHIRecipe to represent the
primary IV in VPlan and CanonicalIVIncrement{NUW} opcodes for
VPInstruction to model the increment.
This allows us to partly retire createInductionVariable. At the moment,
a bit of patching up is done after executing all blocks in the plan.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D113223
For loops that contain in-loop reductions but no loads or stores, large
VFs are chosen because LoopVectorizationCostModel::getSmallestAndWidestTypes
has no element types to check through and so returns the default widths
(-1U for the smallest and 8 for the widest). This results in the widest
VF being chosen for the following example,
float s = 0;
for (int i = 0; i < N; ++i)
s += (float) i*i;
which, for more computationally intensive loops, leads to large loop
sizes when the operations end up being scalarized.
In this patch, for the case where ElementTypesInLoop is empty, the widest
type is determined by finding the smallest type used by recurrences in
the loop instead of falling back to a default value of 8 bits. This
results in the cost model choosing a more sensible VF for loops like
the one above.
Differential Revision: https://reviews.llvm.org/D113973
VPWidenCanonicalIVRecipe does not create PHI instructions, so it does
not need to be placed in the phi section of a VPBasicBlock.
Also tidies the code so the WidenCanonicalIV recipe and the
compare/lane-masks are created in the header.
Discussed D113223.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D116473
((X << C) + Y) >>u C --> (X + (Y >>u C)) & (-1 >>u C)
https://alive2.llvm.org/ce/z/DY9DPg
This replaces a shift with an 'and', and in the case
where the add has a constant operand, it eliminates
both shifts.
As noted in the TODO comment, we already have this fold when
the shifts are in the opposite order (and that code handles
bitwise logic ops too).
Fixes#52851
((X << C) + Y) >>u C --> (X + (Y >>u C)) & (-1 >>u C)
https://alive2.llvm.org/ce/z/DY9DPg
This replaces a shift with an 'and', and in the case
where the add has a constant operand, it eliminates
both shifts.
As noted in the TODO comment, we already have this fold when
the shifts are in the opposite order (and that code handles
bitwise logic ops too).
Fixes#52851
The basic idea to this is that a) having a single canonical type makes CSE easier, and b) many of our transforms are inconsistent about which types we end up with based on visit order.
I'm restricting this to constants as for non-constants, we'd have to decide whether the simplicity was worth extra instructions. For constants, there are no extra instructions.
We chose the canonical type as i64 arbitrarily. We might consider changing this to something else in the future if we have cause.
Differential Revision: https://reviews.llvm.org/D115387
This reverts commit bbfaf0b170.
Post commit review noted a case where my manual update lost intentional check lines. Given I've abandoned the motivating patch, I'm just reverting the autogen prep.
For the simple copy loop (see test case) vectorizer selects VF equal to 32 while the loop is known to have 17 iterations only. Such behavior makes no sense to me since such vector loop will never be executed. The only case we may want to select VF large than TC is masked vectoriztion. So I haven't touched that case.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D114528
We ask `TTI.getAddressComputationCost()` about the cost of computing vector address,
and then multiply it by the vector width. This doesn't make any sense,
it implies that we'd do a vector GEP and then scalarize the vector of pointers,
but there is no such thing in the vectorized IR, we perform scalar GEP's.
This is *especially* bad on X86, and was effectively prohibiting any scalarized
vectorization of gathers/scatters, because `X86TTIImpl::getAddressComputationCost()`
says that cost of vector address computation is `10` as compared to `1` for scalar.
The computed costs are similar to the ones with D111222+D111220,
but we end up without masked memory intrinsics that we'd then have to
expand later on, without much luck. (D111363)
Differential Revision: https://reviews.llvm.org/D111460
This patch fixes PR52111. The problem is that LV propagates poison-generating flags (`nuw`/`nsw`, `exact`
and `inbounds`) in instructions that contribute to the address computation of widen loads/stores that are
guarded by a condition. It may happen that when the code is vectorized and the control flow within the loop
is linearized, these flags may lead to generating a poison value that is effectively used as the base address
of the widen load/store. The fix drops all the integer poison-generating flags from instructions that
contribute to the address computation of a widen load/store whose original instruction was in a basic block
that needed predication and is not predicated after vectorization.
Reviewed By: fhahn, spatel, nlopes
Differential Revision: https://reviews.llvm.org/D111846
This is one of those wonderful "in theory X doesn't matter, but in practice is does" changes. In this particular case, we shift the IVs inserted by the runtime unroller to clamp iteration count of the loops* from decrementing to incrementing.
Why does this matter? A couple of reasons:
* SCEV doesn't have a native subtract node. Instead, all subtracts (A - B) are represented as A + -1 * B and drops any flags invalidated by such. As a result, SCEV is slightly less good at reasoning about edge cases involving decrementing addrecs than incrementing ones. (You can see this in the inferred flags in some of the test cases.)
* Other parts of the optimizer produce incrementing IVs, and they're common in idiomatic source language. We do have support for reversing IVs, but in general if we produce one of each, the pair will persist surprisingly far through the optimizer before being coalesced. (You can see this looking at nearby phis in the test cases.)
Note that if the hardware prefers decrementing (i.e. zero tested) loops, LSR should convert back immediately before codegen.
* Mostly irrelevant detail: The main loop of the prolog case is handled independently and will simple use the original IV with a changed start value. We could in theory use this scheme for all iteration clamping, but that's a larger and more invasive change.
This reverts commit 7cd273c339.
Several patches with tests fixes have been applied:
0cada82f0a "[Test] Remove incorrect test in GVN"
97cb13615d "[Test] Separate IndVars test into AArch64 and X86 parts"
985cc490f1 "[Test] Remove separated test in IndVars",
and test failures caused by 5ec2386 should be resolved now.
Changes VPReplicateRecipe to extract the last lane from an unconditional,
uniform store instruction. collectLoopUniforms will also add stores to
the list of uniform instructions where Legal->isUniformMemOp is true.
setCostBasedWideningDecision now sets the widening decision for
all uniform memory ops to Scalarize, where previously GatherScatter
may have been chosen for scalable stores.
This fixes an assert ("Cannot yet scalarize uniform stores") in
setCostBasedWideningDecision when we have a loop containing a
uniform i1 store and a scalable VF, which we cannot create a scatter for.
Reviewed By: sdesmalen, david-arm, fhahn
Differential Revision: https://reviews.llvm.org/D112725
This reapplies patch db289340c8.
The test failures on build with expensive checks caused by the patch happened due
to the fact that we sorted loop Phis in replaceCongruentIVs using llvm::sort,
which shuffles the given container if the expensive checks are enabled,
so equivalent Phis in the sorted vector had different mutual order from run
to run. replaceCongruentIVs tries to replace narrow Phis with truncations
of wide ones. In some test cases there were several Phis with the same
width, so if their order differs from run to run, the narrow Phis would
be replaced with a different Phi, depending on the shuffling result.
The patch ae14fae0ff fixed this issue by
replacing llvm::sort with llvm::stable_sort.
In IndVarSimplify after simplifying and extending loop IVs we call 'replaceCongruentIVs'.
This function optionally takes a TTI argument to be able to replace narrow IVs uses
with truncates of the widest one.
For some reason the TTI wasn't passed to the function, so it couldn't perform such
transform.
This patch fixes it.
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D113024
Upon further investigation and discussion,
this is actually the opposite direction from what we should be taking,
and this direction wouldn't solve the motivational problem anyway.
Additionally, some more (polly) tests have escaped being updated.
So, let's just take a step back here.
This reverts commit f3190dedee.
This reverts commit 749581d21f.
This reverts commit f3df87d57e.
This reverts commit ab1dbcecd6.