This was originally added in rG22174f5d5af1eb15b376c6d49e7925cbb7cca6be
although that patch doesn't really mention any reasons for ignoring the
pointer type in this calculation if the memory access isn't consecutive.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D115356
At the moment, the primary induction variable for the vector loop is
created as part of the skeleton creation. This is tied to creating the
vector loop latch outside of VPlan. This prevents from modeling the
*whole* vector loop in VPlan, which in turn is required to model
preheader and exit blocks in VPlan as well.
This patch introduces a new recipe VPCanonicalIVPHIRecipe to represent the
primary IV in VPlan and CanonicalIVIncrement{NUW} opcodes for
VPInstruction to model the increment.
This allows us to partly retire createInductionVariable. At the moment,
a bit of patching up is done after executing all blocks in the plan.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D113223
For loops that contain in-loop reductions but no loads or stores, large
VFs are chosen because LoopVectorizationCostModel::getSmallestAndWidestTypes
has no element types to check through and so returns the default widths
(-1U for the smallest and 8 for the widest). This results in the widest
VF being chosen for the following example,
float s = 0;
for (int i = 0; i < N; ++i)
s += (float) i*i;
which, for more computationally intensive loops, leads to large loop
sizes when the operations end up being scalarized.
In this patch, for the case where ElementTypesInLoop is empty, the widest
type is determined by finding the smallest type used by recurrences in
the loop instead of falling back to a default value of 8 bits. This
results in the cost model choosing a more sensible VF for loops like
the one above.
Differential Revision: https://reviews.llvm.org/D113973
This reverts commit fd4808887e.
This patch causes gcc to issue a lot of warnings like:
warning: base class ‘class llvm::MCParsedAsmOperand’ should be
explicitly initialized in the copy constructor [-Wextra]
Setting the loop metadata for the vector loop after VPlan execution
allows generating the full loop body during VPlan execution. This is in
preparation for D113224.
Rather than checking for nounwind in particular, make sure the
instruction is guaranteed to transfer execution, which will also
handle non-willreturn calls correctly.
Fixes https://github.com/llvm/llvm-project/issues/52950.
VPWidenCanonicalIVRecipe does not create PHI instructions, so it does
not need to be placed in the phi section of a VPBasicBlock.
Also tidies the code so the WidenCanonicalIV recipe and the
compare/lane-masks are created in the header.
Discussed D113223.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D116473
Need to clear CurrentOrder order mask if it is determined that
extractelements form identity order and need to use a vector-like
construct when iterating over ordered entries in the reorderTopToBottom
function.
This patch adds a new prepareToExecute helper to set up live-ins, so
VPTransformState doesn't need to hold values like TripCount.
This also requires making the trip count operand for ActiveLaneMask
explicit in VPlan.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D116320
We should not lose analysis precision if an 'add' has both no-wrap
flags (nsw and nuw) compared to just one or the other.
This patch is modeled on a similar construct that was added with
D59386.
I don't think it is possible to expose a problem with an unsigned
compare because of the way this was coded (nuw is handled first).
InstCombine has an assert that fires with the example from:
https://github.com/llvm/llvm-project/issues/52884
...because it was expecting InstSimplify to handle this kind of
pattern with an smax.
Fixes#52884
Differential Revision: https://reviews.llvm.org/D116322
Not all header phis widen the phi, e.g. like the new
VPCanonicalIVPHIRecipe in D113223. To let those recipes also inherit
from a phi-like base class, add a more generic VPHeaderPHIRecipe
abstract base class.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D116304
By creating the header and latch blocks up front and adding blocks and
recipes in between those 2 blocks we ensure that the entry and exits of
the plan remain valid throughout construction.
In order to avoid test changes and keep printing of the plans the same,
we use the new header block instead of creating a new block on the first
iteration of the loop traversing the original loop.
We also fold the latch into its predecessor.
This is a follow up to a post-commit suggestion in D114586.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D115793
The availability of SVE should be sufficient to enable scalable
auto-vectorization.
This patch adds a new TTI interface to query the target what style of
vectorization it wants when scalable vectors are available. For other
targets than AArch64, this currently defaults to 'FixedWidthOnly'.
Differential Revision: https://reviews.llvm.org/D115651
Need to check for the number of the unique non-constant values since the
unique values may include several constants.
Differential Revision: https://reviews.llvm.org/D115939
Need to check for the number of the unique non-constant values since the
unique values may include several constants.
Differential Revision: https://reviews.llvm.org/D115939
Pass in the vector header instead of relying on ILV::LoopVectorBody.
This reduces the dependence on state from ILV. Where VPTransformState is
available, State.CFG.PrevBB can be used.
Need to early exit out of the reordering process if the perfect/shuffled match is found in the operands. Such pattern will result in not profitable reordering because of (false positive) external use of scalars.
Differential Revision: https://reviews.llvm.org/D115811
These are deprecated and should be replaced with getAlign().
Some of these asserts don't do anything because Load/Store/AllocaInst never have a 0 align value.
No need to represent splats as a node with the reused scalars, it may
increase the cost (currently pass just ignores extra shuffle cost and it
is still not correct).
Differential Revision: https://reviews.llvm.org/D115800
Changes the preliminary multinode analysis:
1. Introduced scores for reversed loads/extractelements.
2. Improved shallow score calculation.
3. Lowered the cost of external uses (no need to consider it several times, just ones).
4. The initial lane for analysis is the one with the minimal possible
reorderings.
These changes in general shall reduce compile time and improve the
reordering in many cases.
Part of D57059.
Differential Revision: https://reviews.llvm.org/D101109
If the gather node is a mix of undefvalues and exractelement
instructions, need to take the ordering for such nodes into account too.
It allows to reorder some (sub)trees and remove some extra shuffles,
improving overall vectorization.
Also, outlined common functionality into a separate function.
Differential Revision: https://reviews.llvm.org/D115358
For the simple copy loop (see test case) vectorizer selects VF equal to 32 while the loop is known to have 17 iterations only. Such behavior makes no sense to me since such vector loop will never be executed. The only case we may want to select VF large than TC is masked vectoriztion. So I haven't touched that case.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D114528
Given a MLA reduction from two different types (say i8 and i16), we were
previously failing to find the reduction pattern, often making us chose
the lower vector factor. This improves that by using the largest of the
two extension types, allowing us to use the larger VF as the type of the
reduction.
As per https://godbolt.org/z/KP549EEYM the backend handles this
valiantly, leading to better performance.
Differential Revision: https://reviews.llvm.org/D115432
This patch simplifies handling of redundant induction casts, by
removing dead cast instructions after initial VPlan construction.
This has the following benefits:
1. fixes a crash
(see @test_optimized_cast_induction_feeding_first_order_recurrence)
2. Simplifies VPWidenIntOrFpInduction to a single-def recipes
3. Retires recordVectorLoopValueForInductionCast.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D115112
This patch uses a similar trick as in D113947 to only run the extra
passes after vectorization on functions where loops have been
vectorized.
The reason for running the 'extra vector passes' is
simplification/unswitching of the runtime checks created by LV, there
should be no need to run them if nothing got vectorized
To do that, a new dummy analysis ShouldRunExtraVectorPasses has been
added. If loops have been vectorized for a function, LV will cache the
analysis. At the moment it uses MadeCFGChanges as proxy for loop
vectorized, which isn't perfect (it could be too aggressive, e.g.
because no runtime checks have been added), but should be good enough
for now.
The extra passes are now managed by a new FunctionPassManager that
runs its passes only if ShouldRunExtraVectorPasses has been cached.
Without this patch, `-extra-vectorizer-passes` has the following
compile-time impact:
NewPM-O3: +4.86%
NewPM-ReleaseThinLTO: +3.56%
NewPM-ReleaseLTO-g: +7.17%
http://llvm-compile-time-tracker.com/compare.php?from=ead3979a92fc33add4710c4510d6906260dcb4ad&to=c292da649e2c6e88a31e702fdc474727d09c72bc&stat=instructions
With this patch, that gets reduced to
NewPM-O3: +1.43%
NewPM-ReleaseThinLTO: +1.00%
NewPM-ReleaseLTO-g: +1.58%
http://llvm-compile-time-tracker.com/compare.php?from=ead3979a92fc33add4710c4510d6906260dcb4ad&to=e67d86b57810011cf285eb9aa1944781be6096f0&stat=instructions
It is probably still too high to enable by default, but much better.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D115052
This allows easier access to the induction descriptor from VPlan,
without needing to go through Legal. VPReductionPHIRecipe already
contains a RecurrenceDescriptor in a similar fashion.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D115111
The comparator for the sort functions should provide strict weak
ordering relation between parameters. Current solution causes compiler
crash with some standard c++ library implementations, because it does
not meet this criteria. Tried to fix it + it improves the iverall
vectorization result.
Differential Revision: https://reviews.llvm.org/D115268
The recurrence lowering code has handling which claims to be about flag intersection, but all the callers pass empty arrays to the arguments. The sole exception is a caller of a method which has the argument, but no implementation.
I don't know what the intent was here, but it certaintly doesn't actually do anything today.
Both the entry and exit blocks of the top-region of a plan must be
VPBasicBlocks. They also must have no predecessors or successors
respectively.
This invariant was broken when splitting a block for sink-after. To fix
the issue, set the exit block of the region *after* sink-after is done.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D114586
Need to add an extra check for potential undef values in
computeExtractCost function to avoid compiler crash on casting to
instructon.
Differential Revision: https://reviews.llvm.org/D115162
ILV::scalarizeInstruction still uses the original IR operands to check
if an input value is uniform after vectorization.
There is no need to go back to the cost model to figure that out, as the
information is already explicit in the VPlan. Just check directly
whether the VPValue is defined outside the plan or is a uniform
VPReplicateRecipe.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D114253
If the condition of a select is a compare, pass its predicate to
TTI::getCmpSelInstrCost to get a more accurate cost value instead
of passing BAD_ICMP_PREDICATE.
I noticed that the commit message from D90070 had a comment about the
vectorized select predicate possibly being composed of other compares with
different predicate values, but I wasn't able to construct an example
where this was an actual issue. If this is an issue, I guess we could
add another check that the block isn't predicated for any reason.
Reviewed By: dmgreen, fhahn
Differential Revision: https://reviews.llvm.org/D114646
VPWidenIntOrFpInductionRecipes can either be constructed with a PHI and
an optional cast or a PHI and a trunc instruction. Reflect this in 2
separate constructors. This also simplifies a follow-up change.
If the extractelement instruction is used multiple times in the
different tree entries (either vectorized, or gathered), need to
compensate the scalar cost of such instructions. They are completely
removed if all users are part of the tree but we need to compensate the
cost only once for each instruction.
Differential Revision: https://reviews.llvm.org/D114958
Need to outline the code for finding common vectors in insertelement
instructions into a separate function for future patches. It also
improves the process by adding some extra checks for early exit and
fixes a bug where it always finds the match because of erroneous compare
of the same values.
Differential Revision: https://reviews.llvm.org/D114909
If several shuffle instructions are emitted, some of them might
same/compatible (less defined) with the previously emitted ones. Such
shuffles can be removed safely, improving the total cost of the
vectorized code.
Differential Revision: https://reviews.llvm.org/D114087
Improved the calculation of the shuffled extracts, where possible. Need
to calculate the cost for the extracted scalars if some users are not
insertelements + improved the total estimation of the shuffled scalars
used in insertelements build vectors.
Differential Revision: https://reviews.llvm.org/D113782
Undefined vector might be not only the UndefValue, but also it can be
a constant vector with undef ot poison elements, need to check for this
kind of undef too.
Differential Revision: https://reviews.llvm.org/D114873
The code in widenMemoryInstruction has already been transitioned
to only rely on information provided by VPWidenMemoryInstructionRecipe
directly.
Moving the code directly to VPWidenMemoryInstructionRecipe::execute
completes the transition for the recipe.
It provides the following advantages:
1. Less indirection, easier to see what's going on.
2. Removes accesses to fields of ILV.
2) in particular ensures that no dependencies on
fields in ILV for vector code generation are re-introduced.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D114324
Extended support for undefined source vector/extract indices/non-fixed
vector types, also no need to check for the parent of the extractelement
instructions with the constant indicies.
Differential Revision: https://reviews.llvm.org/D114121
The code in widenSelectInstruction has already been transitioned
to only rely on information provided by VPWidenSelectRecipe directly.
Moving the code directly to VPWidenSelectRecipe::execute completes
the transition for the recipe.
It provides the following advantages:
1. Less indirection, easier to see what's going on.
2. Removes accesses to fields of ILV.
2) in particular ensures that no dependencies on
fields in ILV for vector code generation are re-introduced.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D114323
We ask `TTI.getAddressComputationCost()` about the cost of computing vector address,
and then multiply it by the vector width. This doesn't make any sense,
it implies that we'd do a vector GEP and then scalarize the vector of pointers,
but there is no such thing in the vectorized IR, we perform scalar GEP's.
This is *especially* bad on X86, and was effectively prohibiting any scalarized
vectorization of gathers/scatters, because `X86TTIImpl::getAddressComputationCost()`
says that cost of vector address computation is `10` as compared to `1` for scalar.
The computed costs are similar to the ones with D111222+D111220,
but we end up without masked memory intrinsics that we'd then have to
expand later on, without much luck. (D111363)
Differential Revision: https://reviews.llvm.org/D111460
The code in widenInstruction has already been transitioned to
only rely on information provided by VPWidenRecipe directly.
Moving the code directly to VPWidenRecipe::execute completes
the transition for the recipe.
It provides the following advantages:
1. Less indirection, easier to see what's going on.
2. Removes accesses to fields of ILV.
2) in particular ensures that no dependencies on
fields in ILV for vector code generation are re-introduced.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D114322
The code in widenGEP has already been transitioned to only rely on
information provided by VPWidenGEPRecipe directly.
Moving the code directly to VPWidenGEPRecipe::execute completes
the transition for the recipe.
It provides the following advantages:
1. Less indirection, easier to see what's going on.
2. Removes accesses to fields of ILV.
2) in particular ensures that no dependencies on
fields in ILV for GEP code generation are re-introduced.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D114321
collectLoopScalars should only add non-uniform nodes to the list if they
are used by a load/store instruction that is marked as CM_Scalarize.
Before this patch, the LV incorrectly marked pointer induction variables
as 'scalar' when they required to be widened by something else,
such as a compare instruction, and weren't used by a node marked as
'CM_Scalarize'. This case is covered by sve-widen-phi.ll.
This change also allows removing some code where the LV tried to
widen the PHI nodes with a stepvector, even though it was marked as
'scalarAfterVectorization'. Now that this code is more careful about
marking instructions that need widening as 'scalar', this code has
become redundant.
Differential Revision: https://reviews.llvm.org/D114373
Compiler has an analysis for perfect diamond matching but it does not
support nodes with main/alternate opcodes. The problem is that the
scalars themselves are different and might not match directly with other
nodes, but operands and main/alternate opcodes might match and compiler
might reuse some previously emitted vector instructions. Need to include
this analysis in the cost model and actual vector instructions emission
process.
Differential Revision: https://reviews.llvm.org/D114101
In VPRecipeBuilder::handleReplication if we believe the instruction
is predicated we then proceed to create new VP region blocks even
when the load is uniform and only predicated due to tail-folding.
I have updated isPredicatedInst to avoid treating a uniform load as
predicated when tail-folding, which means we can do a single scalar
load and a vector splat of the value.
Tests added here:
Transforms/LoopVectorize/AArch64/tail-fold-uniform-memops.ll
Differential Revision: https://reviews.llvm.org/D112552
Compiler has an analysis for perfect diamond matching but it does not
support nodes with main/alternate opcodes. The problem is that the
scalars themselves are different and might not match directly with other
nodes, but operands and main/alternate opcodes might match and compiler
might reuse some previously emitted vector instructions. Need to include
this analysis in the cost model and actual vector instructions emission
process.
Differential Revision: https://reviews.llvm.org/D114101
This patch updates the cost model for ordered reductions so that a call
to the llvm.fmuladd intrinsic is modelled as a normal fmul instruction
plus the cost of an ordered fadd reduction.
Differential Revision: https://reviews.llvm.org/D111630
In-loop vector reductions which use the llvm.fmuladd intrinsic involve
the creation of two recipes; a VPReductionRecipe for the fadd and a
VPInstruction for the fmul. If the call to llvm.fmuladd has fast-math flags
these should be propagated through to the fmul instruction, so an
interface setFastMathFlags has been added to the VPInstruction class to
enable this.
Differential Revision: https://reviews.llvm.org/D113125
This patch fixes PR52111. The problem is that LV propagates poison-generating flags (`nuw`/`nsw`, `exact`
and `inbounds`) in instructions that contribute to the address computation of widen loads/stores that are
guarded by a condition. It may happen that when the code is vectorized and the control flow within the loop
is linearized, these flags may lead to generating a poison value that is effectively used as the base address
of the widen load/store. The fix drops all the integer poison-generating flags from instructions that
contribute to the address computation of a widen load/store whose original instruction was in a basic block
that needed predication and is not predicated after vectorization.
Reviewed By: fhahn, spatel, nlopes
Differential Revision: https://reviews.llvm.org/D111846
A first step towards modeling preheader and exit blocks in VPlan as well.
Keeping the vector loop in a region allows for changing the VF as we
traverse region boundaries.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D113182
When asking how many parts are required for a scalable vector type
there are occasions when it cannot be computed. For example, <vscale x 1 x i3>
is one such vector for AArch64+SVE because at the moment no matter how we
promote the i3 type we never end up with a legal vector. This means
that getTypeConversion returns TypeScalarizeScalableVector as the
LegalizeKind, and then getTypeLegalizationCost returns an invalid cost.
This then causes BasicTTImpl::getNumberOfParts to dereference an invalid
cost, which triggers an assert. This patch changes getNumberOfParts to
return 0 for such cases, since the definition of getNumberOfParts in
TargetTransformInfo.h states that we can use a return value of 0 to represent
an unknown answer.
Currently, LoopVectorize.cpp is the only place where we need to check for
0 as a return value, because all other instances will not currently
ask for the number of parts for <vscale x 1 x iX> types.
In addition, I have changed the target-independent interface for
getNumberOfParts to return 1 and assume there is a single register
that can fit the type. The loop vectoriser has lots of tests that are
target-independent and they relied upon the 0 value to mean the
answer is known and that we are not scalarising the vector.
I have added tests here that show we correctly return an invalid cost
for VF=vscale x 1 when the loop contains unusual types such as i7:
Transforms/LoopVectorize/AArch64/sve-inductions-unusual-types.ll
Differential Revision: https://reviews.llvm.org/D113772
No need to count the final shuffle cost for the constants, gathering of
the constants is just a constant vector + extra inserts, if required.
Differential Revision: https://reviews.llvm.org/D113770
Need to adjust the types of GEPs indices when building the tree
entries/operands. Otherwise some of the nodes might differ and
vectorizer is unable to correctly find them and count their cost.
Differential Revision: https://reviews.llvm.org/D113792
A bunch of scalars can be treated as a splat not only if all elements
are the same but also if some of them are undefvalues.
Differential Revision: https://reviews.llvm.org/D113774
If the vector intrinsic has scalar argument, we currently still create
a tree entry for this argument. This entry is not used, just consumes
resources and increases the cost of the tree.
Differential Revision: https://reviews.llvm.org/D113806
The interface is a convenience function to ask if a block requires
predication when widening, but it's important that there are two
separate concepts to consider:
(A) The block was predicated in the original loop.
(B) The block was unpredicated in the original loop, but requires
predication because of tail folding.
In the case of (B) we know that at least one lane of the vector will
be executed, which means we can implementing a load from a uniform address
with a scalar load + splat (D112552). In the case of predication because
of (A), we cannot do this, because the scalar load itself requires
predication.
The name 'blockNeedsPredication' does not make the distinction between
(A) and (B), hence the reason to rename it.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D113392