All information to fix-up the reduction phi nodes in the vectorized loop
is available in VPlan now. This patch moves the code to do so, to make
this clearer. Fixing up the loop exit value still relies on other
information and remains outside of VPlan for now.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D100113
If the vectorized insertelements instructions form indentity subvector
(the subvector at the beginning of the long vector), it is just enough
to extend the vector itself, no need to generate inserting subvector
shuffle.
Differential Revision: https://reviews.llvm.org/D107494
Since all operands to ExtractValue must be loop-invariant when we deem
the loop vectorizable, we can consider ExtractValue to be uniform.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D107286
We can only trust the range of the index if it is guaranteed
non-poison.
Fixes PR50949.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D107364
This patch adds more instructions to the Uniforms list, for example certain
intrinsics that are uniform by definition or whose operands are loop invariant.
This list includes:
1. The intrinsics 'experimental.noalias.scope.decl' and 'sideeffect', which
are always uniform by definition.
2. If intrinsics 'lifetime.start', 'lifetime.end' and 'assume' have
loop invariant input operands then these are also uniform too.
Also, in VPRecipeBuilder::handleReplication we check if an instruction is
uniform based purely on whether or not the instruction lives in the Uniforms
list. However, there are certain cases where calls to some intrinsics can
be effectively treated as uniform too. Therefore, we now also treat the
following cases as uniform for scalable vectors:
1. If the 'assume' intrinsic's operand is not loop invariant, then we
are free to treat this as uniform anyway since it's only a performance
hint. We will get the benefit for the first lane.
2. When the input pointers for 'lifetime.start' and 'lifetime.end' are loop
variant then for scalable vectors we assume these still ultimately come
from the broadcast of an alloca. We do not support scalable vectorisation
of loops containing alloca instructions, hence the alloca itself would
be invariant. If the pointer does not come from an alloca then the
intrinsic itself has no effect.
I have updated the assume test for fixed width, since we now treat it
as uniform:
Transforms/LoopVectorize/assume.ll
I've also added new scalable vectorisation tests for other intriniscs:
Transforms/LoopVectorize/scalable-assume.ll
Transforms/LoopVectorize/scalable-lifetime.ll
Transforms/LoopVectorize/scalable-noalias-scope-decl.ll
Differential Revision: https://reviews.llvm.org/D107284
This change wasn't strictly necessary for D106164 and could be removed.
This patch addresses the post-commit comments from @fhahn on D106164, and
also changes sve-widen-gep.ll to use the same IR test as shown in
pointer-induction.ll.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D106878
If the vectorized insertelements instructions form indentity subvector
(the subvector at the beginning of the long vector), it is just enough
to extend the vector itself, no need to generate inserting subvector
shuffle.
Differential Revision: https://reviews.llvm.org/D107344
I'm renaming the flag because a future patch will add a new
enableOrderedReductions() TTI interface and so the meaning of this
flag will change to be one of forcing the target to enable/disable
them. Also, since other places in LoopVectorize.cpp use the word
'Ordered' instead of 'strict' I changed the flag to match.
Differential Revision: https://reviews.llvm.org/D107264
This patch updates VPInterleaveRecipe::print to print the actual defined
VPValues for load groups and the store VPValue operands for store
groups.
The IR references may become outdated while transforming the VPlan and
the defined and stored VPValues always are up-to-date.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D107223
Replace insertelement instructions for splats with just single
insertelement + broadcast shuffle. Also, try to merge these instructions
if they come from the same/shuffled gather node.
Differential Revision: https://reviews.llvm.org/D107104
For the nodes with reused scalars the user may be not only of the size
of the final shuffle but also of the size of the scalars themselves,
need to check for this. It is safe to just modify the check here, since
the order of the scalars themselves is preserved, only indeces of the
reused scalars are changed. So, the users with the same size as the
number of scalars in the node, will not be affected, they still will get
the operands in the required order.
Reported by @mstorsjo in D105020.
Differential Revision: https://reviews.llvm.org/D107080
If the instruction was previously deleted, it should not be treated as
an external user. This fixes cost estimation and removes dead
extractelement instructions.
Differential Revision: https://reviews.llvm.org/D107106
Need to check that the minimum acceptable vector factor is at least 2,
not 0, to avoid compiler crash during gathered loads analysis.
Differential Revision: https://reviews.llvm.org/D107058
Reworked reordering algorithm. Originally, the compiler just tried to
detect the most common order in the reordarable nodes (loads, stores,
extractelements,extractvalues) and then fully rebuilding the graph in
the best order. This was not effecient, since it required an extra
memory and time for building/rebuilding tree, double the use of the
scheduling budget, which could lead to missing vectorization due to
exausted scheduling resources.
Patch provide 2-way approach for graph reodering problem. At first, all
reordering is done in-place, it doe not required tree
deleting/rebuilding, it just rotates the scalars/orders/reuses masks in
the graph node.
The first step (top-to bottom) rotates the whole graph, similarly to the previous
implementation. Compiler counts the number of the most used orders of
the graph nodes with the same vectorization factor and then rotates the
subgraph with the given vectorization factor to the most used order, if
it is not empty. Then repeats the same procedure for the subgraphs with
the smaller vectorization factor. We can do this because we still need
to reshuffle smaller subgraph when buildiong operands for the graph
nodes with lasrger vectorization factor, we can rotate just subgraph,
not the whole graph.
The second step (bottom-to-top) scans through the leaves and tries to
detect the users of the leaves which can be reordered. If the leaves can
be reorder in the best fashion, they are reordered and their user too.
It allows to remove double shuffles to the same ordering of the operands in
many cases and just reorder the user operations instead. Plus, it moves
the final shuffles closer to the top of the graph and in many cases
allows to remove extra shuffle because the same procedure is repeated
again and we can again merge some reordering masks and reorder user nodes
instead of the operands.
Also, patch improves cost model for gathering of loads, which improves
x264 benchmark in some cases.
Gives about +2% on AVX512 + LTO (more expected for AVX/AVX2) for {625,525}x264,
+3% for 508.namd, improves most of other benchmarks.
The compile and link time are almost the same, though in some cases it
should be better (we're not doing an extra instruction scheduling
anymore) + we may vectorize more code for the large basic blocks again
because of saving scheduling budget.
Differential Revision: https://reviews.llvm.org/D105020
As suggested in D105008, move the code that fixes up the backedge value
for first order recurrences to VPlan::execute.
Now all that remains in fixFirstOrderRecurrences is the code responsible
for creating the exit values in the middle block.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D106244
This makes a couple of changes to the costing of MLA reduction patterns,
to more accurately cost various patterns that can come up from
vectorization.
- The Arm implementation of getExtendedAddReductionCost is altered to
only provide costs for legal or smaller types. Larger than legal types
need to be split, which currently does not work very well, especially
for predicated reductions where the predicate may be legal but needs to
be split. Currently we limit it to legal or smaller input types.
- The getReductionPatternCost has learnt that reduce(ext(mul(ext, ext))
is a pattern that can come up, and can be treated the same as
reduce(mul(ext, ext)) providing the extension types match.
- And it has been adjusted to not count the ext in reduce(mul(ext, ext))
as part of a reduce(mul) pattern.
Together these changes help to more accurately cost the mla reductions
in cases such as where the extend types don't match or the extend
opcodes are different, picking better vector factors that don't result
in expanded reductions.
Differential Revision: https://reviews.llvm.org/D106166
Consider the following loop:
void foo(float *dst, float *src, int N) {
for (int i = 0; i < N; i++) {
dst[i] = 0.0;
for (int j = 0; j < N; j++) {
dst[i] += src[(i * N) + j];
}
}
}
When we are not building with -Ofast we may attempt to vectorise the
inner loop using ordered reductions instead. In addition we also try
to select an appropriate interleave count for the inner loop. However,
when choosing a VF=1 the inner loop will be scalar and there is existing
code in selectInterleaveCount that limits the interleave count to 2
for reductions due to concerns about increasing the critical path.
For ordered reductions this problem is even worse due to the additional
data dependency, and so I've added code to simply disable interleaving
for scalar ordered reductions for now.
Test added here:
Transforms/LoopVectorize/AArch64/strict-fadd-vf1.ll
Differential Revision: https://reviews.llvm.org/D106646
The loop vectorizer may decide to use tail folding when the trip-count
is low. When that happens, scalable VFs are no longer a candidate,
since tail folding/predication is not yet supported for scalable vectors.
This can be re-enabled in a future patch.
Reviewed By: kmclaughlin
Differential Revision: https://reviews.llvm.org/D106657
Invalid costs can be used to avoid vectorization with a given VF, which is
used for scalable vectors to avoid things that the code-generator cannot
handle. If we override the cost using the -force-target-instruction-cost
option of the LV, we would override this mechanism, rendering the flag useless.
This change ensures the cost is only overriden when the original cost that
was calculated is valid. That allows the flag to be used in combination
with the -scalable-vectorization option.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D106677
Scalarization for scalable vectors is not (yet) supported, so the
LV discards a VF when scalarization is chosen as the widening
decision. It should therefore not assert that the VF is not scalable
when it computes the decision to scalarize.
The code can get here when both the interleave-cost, gather/scatter cost
and scalarization-cost are all illegal. This may e.g. happen for SVE
when the VF=1, to avoid generating `<vscale x 1 x eltty>` types that
the code-generator cannot yet handle.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D106656
This fixes an issue that was found in D105199, where a GEP instruction
is used both as the address of a store, as well as the value of a store.
For the former, the value is scalar after vectorization, but the latter
(as value) requires widening.
Other code in that function seems to prevent similar cases from happening,
but it seems this case was missed.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D106164
This reverts the revert commit b1777b04dc.
The patch originally got reverted due to a crash:
https://bugs.chromium.org/p/chromium/issues/detail?id=1232798#c2
The underlying issue was that we were not using the stored values from
the modified memory recipes, but the out-of-date values directly from
the IR (accessed via the VPlan). This should be fixed in d995d6376. A
reduced version of the reproducer has been added in 93664503be.
Need to fix several cost-related problems. The final type may be defined
incorrectly because of to early definition (we may end up with the wider
type), the CommonCost should not be redefined in ExtractElements
cost related calculations and the shuffle of the final insertelements
vectors should be calculated as a cost of single vector permutations
+ costs of two vector permutations for other n-1 incoming vectors.
Differential Revision: https://reviews.llvm.org/D106578
Fixes more casts to `<FixedVectorType>` for the cases where the
instruction is a Insert/ExtractElementInst.
For fixed-width, this part of truncateToMinimalBitWidths is tested by
AArch64/type-shrinkage-insertelt.ll. I attempted to write a test case for this part
of truncateToMinimalBitWidths which uses scalable vectors, but was unable to add
one. The tests in type-shrinkage-insertelt.ll rely on scalarization to create extract
element instructions for instance, which is not possible for scalable vectors.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D106163
Need to fix several cost-related problems. The final type may be defined
incorrectly because of to early definition (we may end up with the wider
type), the CommonCost should not be redefined in ExtractElements
cost related calculations and the shuffle of the final insertelements
vectors should be calculated as a cost of single vector permutations
+ costs of two vector permutations for other n-1 incoming vectors.
Differential Revision: https://reviews.llvm.org/D106578
Instead of getting the VPValue for the stored IR values through the
current plan, use the stored value of the recipes directly.
This way, the correct VPValues are used if the store recipes have been
modified in the VPlan and the IR value is not correct any longer. This
can happen, e.g. due to D105008.
I have added a new FastMathFlags parameter to getArithmeticReductionCost
to indicate what type of reduction we are performing:
1. Tree-wise. This is the typical fast-math reduction that involves
continually splitting a vector up into halves and adding each
half together until we get a scalar result. This is the default
behaviour for integers, whereas for floating point we only do this
if reassociation is allowed.
2. Ordered. This now allows us to estimate the cost of performing
a strict vector reduction by treating it as a series of scalar
operations in lane order. This is the case when FP reassociation
is not permitted. For scalable vectors this is more difficult
because at compile time we do not know how many lanes there are,
and so we use the worst case maximum vscale value.
I have also fixed getTypeBasedIntrinsicInstrCost to pass in the
FastMathFlags, which meant fixing up some X86 tests where we always
assumed the vector.reduce.fadd/mul intrinsics were 'fast'.
New tests have been added here:
Analysis/CostModel/AArch64/reduce-fadd.ll
Analysis/CostModel/AArch64/sve-intrinsics.ll
Transforms/LoopVectorize/AArch64/strict-fadd-cost.ll
Transforms/LoopVectorize/AArch64/sve-strict-fadd-cost.ll
Differential Revision: https://reviews.llvm.org/D105432
This patch avoids computing discounts for predicated instructions when the
VF is scalable.
There is no support for vectorization of loops with division because the
vectorizer cannot guarantee that zero divisions will not happen.
This loop now does not use VF scalable
```
for (long long i = 0; i < n; i++)
if (cond[i])
a[i] /= b[i];
```
Differential Revision: https://reviews.llvm.org/D101916
Currently the Instruction cost of getReductionPatternCost returns an
Invalid cost to specify "did not find the pattern". This changes that to
return an Optional with None specifying not found, allowing Invalid to
mean an infinite cost as is used elsewhere.
Differential Revision: https://reviews.llvm.org/D106140
This patch removes the assertion when VF is scalable and replaces
getKnownMinValue() by getFixedValue(), so it still guards the code against
scalable vector types.
The assertions were used to guarantee that getknownMinValue were not used for
scalable vectors.
Differential Revision: https://reviews.llvm.org/D106359
This patch adds a VPFirstOrderRecurrencePHIRecipe, to further untangle
VPWidenPHIRecipe into distinct recipes for distinct use cases/lowering.
See D104989 for a new recipe for reduction phis.
This patch also introduces a new `FirstOrderRecurrenceSplice`
VPInstruction opcode, which is used to make the forming of the vector
recurrence value explicit in VPlan. This more accurately models def-uses
in VPlan and also simplifies code-generation. Now, the vector recurrence
values are created at the right place during VPlan-codegeneration,
rather than during post-VPlan fixups.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D105008
The incoming values for PHI nodes may come from unreachable BasicBlocks,
need to handle this case.
Differential Revision: https://reviews.llvm.org/D106264
Part of D105020. Also, fixed FIXMEs that need to use wider vector type
when trying to calculate the cost of reused scalars. This may cause
regressions unless D100486 is landed to improve the cost estimations
for long vectors shuffling.
Differential Revision: https://reviews.llvm.org/D106060
The cost of the InsertSubvector shuffle kind cost is not complete and
may end up with just extracts + inserts costs in many cases. Added
a workaround to represent it as a generic PermuteSingleSrc, which is
still pessimistic but better than InsertSubvector.
Differential Revision: https://reviews.llvm.org/D105827
This patch returns an Invalid cost from getInstructionCost() for alloca
instructions if the VF is scalable, as otherwise loops which contain
these instructions will crash when attempting to scalarize the alloca.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D105824
The original patch was:
https://reviews.llvm.org/D105806
There were some issues with undeterministic behaviour of the sorting
function, which led to scalable-call.ll passing and/or failing. This
patch fixes the issue by numbering all instructions in the array first,
and using that number as the order, which should provide a consistent
ordering.
This reverts commit a607f64118.
This bug was introduced with D105730 / 25ee55c0ba .
If we are not converting all of the operations of a reduction
into a vector op, we need to preserve the existing select form
of the remaining ops. Otherwise, we are potentially leaking
poison where it did not in the original code.
Alive2 agrees that the version that freezes some inputs
and then falls back to scalar is correct:
https://alive2.llvm.org/ce/z/erF4K2
This change enables vectorization of multiple exit loops when the exit count is statically computable. That requirement - shared with the rest of LV - in turn requires each exit to be analyzeable and to dominate the latch.
The majority of work to support this was done in a set of previous patches. In particular,, 72314466 avoids having multiple edges from the middle block to the exits, and 4b33b2387 which added support for non-latch single exit and multiple exits with a single exiting block. As a result, this change is basically just removing a bailout and adjusting some tests now that the prerequisite work is done and has stuck in tree for a bit.
Differential Revision: https://reviews.llvm.org/D105817
The sort function for emitting an OptRemark was not deterministic,
which caused scalable-call.ll to fail on some buildbots. This patch
fixes that.
This patch also fixes an issue where `Instruction::comesBefore()`
is called when two Instructions are in different basic blocks,
which would otherwise cause an assertion failure.
This patch emits remarks for instructions that have invalid costs for
a given set of vectorization factors. Some example output:
t.c:4:19: remark: Instruction with invalid costs prevented vectorization at VF=(vscale x 1): load
dst[i] = sinf(src[i]);
^
t.c:4:14: remark: Instruction with invalid costs prevented vectorization at VF=(vscale x 1, vscale x 2, vscale x 4): call to llvm.sin.f32
dst[i] = sinf(src[i]);
^
t.c:4:12: remark: Instruction with invalid costs prevented vectorization at VF=(vscale x 1): store
dst[i] = sinf(src[i]);
^
Reviewed By: fhahn, kmclaughlin
Differential Revision: https://reviews.llvm.org/D105806
The cost of the InsertSubvector shuffle kind cost is not complete and
may end up with just extracts + inserts costs in many cases. Added
a workaround to represent it as a generic PermuteSingleSrc, which is
still pessimistic but better than InsertSubvector.
Differential Revision: https://reviews.llvm.org/D105827
This has been a work-in-progress for a long time...we finally have all of
the pieces in place to handle vectorization of compare code as shown in:
https://llvm.org/PR41312
To do this (see PhaseOrdering tests), we converted SimplifyCFG and
InstCombine to the poison-safe (select) forms of the logic ops, so now we
need to have SLP recognize those patterns and insert a freeze op to make
a safe reduction:
https://alive2.llvm.org/ce/z/NH54Ah
We get the minimal patterns with this patch, but the PhaseOrdering tests
show that we still need adjustments to get the ideal IR in some or all of
the motivating cases.
Differential Revision: https://reviews.llvm.org/D105730
The const version of VPValue::getVPValue still had a default value for
the value index. Remove the default value and use getVPSingleValue
instead, which is the proper function.
Instead of performing the isMoreProfitable() operation on
InstructionCost::CostTy the operation is performed on InstructionCost
directly, so that it can handle the case where one of the costs is
Invalid.
This patch also changes the CostTy to be int64_t, so that the type is
wide enough to deal with multiplications with e.g. `unsigned MaxTripCount`.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D105113
This makes it clearer when we have encountered the extra arg.
Also, we may need to adjust the way the operand iteration
works when handling logical and/or.
This is NFC-intended currently (so no test diffs). The motivation
is to eventually allow matching for poison-safe logical-and and
logical-or (these are in the form of a select-of-bools).
( https://llvm.org/PR41312 )
Those patterns will not have all of the same constraints as min/max
in the form of cmp+sel. We may also end up removing the cmp+sel
min/max matching entirely (if we canonicalize to intrinsics), so
this will make that step easier.
This patch removes the IsPairwiseForm flag from the Reduction Cost TTI
hooks, along with some accompanying code for pattern matching reductions
from trees starting at extract elements. IsPairWise is now assumed to be
false, which was the predominant way that the value was used from both
the Loop and SLP vectorizers. Since the adjustments such as D93860, the
SLP vectorizer has not relied upon this distinction between paiwise and
non-pairwise reductions.
This also removes some code that was detecting reductions trees starting
from extract elements inside the costmodel. This case was
double-counting costs though, adding the individual costs on the
individual instruction _and_ the total cost of the reduction. Removing
it changes the costs in llvm/test/Analysis/CostModel/X86/reduction.ll to
not double count. The cost of reduction intrinsics is still tested
through the various tests in
llvm/test/Analysis/CostModel/X86/reduce-xyz.ll.
Differential Revision: https://reviews.llvm.org/D105484
Patch tries to improve the vectorization of stores. Originally, we just
check the type and the base pointer of the store.
Patch adds some extra checks to avoid non-profitable vectorization
cases. It includes analysis of the scalar values to be stored and
triggers the vectorization attempt only if the scalar values have
same/alt opcode and are from same basic block, i.e. we don't end up
immediately with the gather node, which is not profitable.
This also improves compile time by filtering out non-profitable cases.
Part of D57059.
Differential Revision: https://reviews.llvm.org/D104122
Revived D101297 in its original form + added some changes in X86
legalization cehcking for masked gathers.
This solution is the most stable and the most correct one. We have to
check the legality before trying to build the masked gather in SLP.
Without this check we have incorrect cost (for SLP) in case if the masked gather
is not legal/slower than the gather. And we're missing some
vectorization opportunities.
This can be fixed in the cost model, but in this case we need to add
special checks for the cost of GEPs for ScatterVectorize node, add
special check for small trees, etc., i.e. there are a lot of corner
cases here and there, which insrease code base and make it harder to
maintain the code.
> Can't we rely on cost model to deal with this? This can be profitable for futher vectorization, when we can start from such gather loads as seed.
The question from D101297. Actually, no, it can't. Actually, simple
gather may give us better result, especially after we started
vectorization of insertelements. Plus, like I said before, the cost for
non-legal masked gathers leads to missed vectorization opportunities.
Differential Revision: https://reviews.llvm.org/D105042
The reduction matching was probably only dealing with binops
when it was written, but we have now generalized it to handle
select and intrinsics too, so assert on that too.
Resubmit after the following changes:
* Fix a latent bug related to unrolling with required epilogue (see e49d65f). I believe this is the cause of the prior PPC buildbot failure.
* Disable non-latch exits for epilogue vectorization to be safe (9ffa90d)
* Split out assert movement (600624a) to reduce churn if this gets reverted again.
Previous commit message (try 3)
Resubmit after fixing test/Transforms/LoopVectorize/ARM/mve-gather-scatter-tailpred.ll
Previous commit message...
This is a resubmit of 3e5ce4 (which was reverted by 7fe41ac). The original commit caused a PPC build bot failure we never really got to the bottom of. I can't reproduce the issue, and the bot owner was non-responsive. In the meantime, we stumbled across an issue which seems possibly related, and worked around a latent bug in 80e8025. My best guess is that the original patch exposed that latent issue at higher frequency, but it really is just a guess.
Original commit message follows...
If we know that the scalar epilogue is required to run, modify the CFG to end the middle block with an unconditional branch to scalar preheader. This is instead of a conditional branch to either the preheader or the exit block.
The motivation to do this is to support multiple exit blocks. Specifically, the current structure forces us to identify immediate dominators and *which* exit block to branch from in the middle terminator. For the multiple exit case - where we know require scalar will hold - these questions are ill formed.
This is the last change needed to support multiple exit loops, but since the diffs are already large enough, I'm going to land this, and then enable separately. You can think of this as being NFCIish prep work, but the changes are a bit too involved for me to feel comfortable tagging the review that way.
Differential Revision: https://reviews.llvm.org/D94892
When skimming through old review discussion, I noticed a post commit comment on an earlier patch which had gone unaddressed. Better late (4 months), than never right?
I'm not aware of an active problem with the combination of non-latch exits and epilogue vectorization, but the interaction was not considered and I'm not modivated to make epilogue vectorization work with early exits. If there were a bug in the interaction, it would be pretty hard to hit right now (as we canonicalize towards bottom tested loops), but an upcoming change to allow multiple exit loops will greatly increase the chance for error. Thus, let's play it safe for now.
Compare type IDs and DFS numbering for basic block instead of addresses
to fix non-determinism.
Differential Revision: https://reviews.llvm.org/D105031
This reverts commit 706bbfb35b.
The committed version moves the definition of VPReductionPHIRecipe out
of an ifdef only intended for ::print helpers. This should resolve the
build failures that caused the revert
This patch adds a TTI function, isElementTypeLegalForScalableVector, to query
whether it is possible to vectorize a given element type. This is called by
isLegalToVectorizeInstTypesForScalable to reject scalable vectorization if
any of the instruction types in the loop are unsupported, e.g:
int foo(__int128_t* ptr, int N)
#pragma clang loop vectorize_width(4, scalable)
for (int i=0; i<N; ++i)
ptr[i] = ptr[i] + 42;
This example currently crashes if we attempt to vectorize since i128 is not a
supported type for scalable vectorization.
Reviewed By: sdesmalen, david-arm
Differential Revision: https://reviews.llvm.org/D102253
This reverts commit 3fed6d443f,
bbcbf21ae6 and
6c3451cd76.
The changes causing build failures with certain configurations, e.g.
https://lab.llvm.org/buildbot/#/builders/67/builds/3365/steps/6/logs/stdio
lib/libLLVMVectorize.a(LoopVectorize.cpp.o): In function `llvm::VPRecipeBuilder::tryToCreateWidenRecipe(llvm::Instruction*, llvm::ArrayRef<llvm::VPValue*>, llvm::VFRange&, std::unique_ptr<llvm::VPlan, std::default_delete<llvm::VPlan> >&) [clone .localalias.8]':
LoopVectorize.cpp:(.text._ZN4llvm15VPRecipeBuilder22tryToCreateWidenRecipeEPNS_11InstructionENS_8ArrayRefIPNS_7VPValueEEERNS_7VFRangeERSt10unique_ptrINS_5VPlanESt14default_deleteISA_EE+0x63b): undefined reference to `vtable for llvm::VPReductionPHIRecipe'
collect2: error: ld returned 1 exit status
This patch is a first step towards splitting up VPWidenPHIRecipe into
separate recipes for the 3 distinct cases they model:
1. reduction phis,
2. first-order recurrence phis,
3. pointer induction phis.
This allows untangling the code generation and allows us to reduce the
reliance on LoopVectorizationCostModel during VPlan code generation.
Discussed/suggested in D100102, D100113, D104197.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D104989
Splits `getSmallestAndWidestTypes` into two functions, one of which now collects
a list of all element types found in the loop (`ElementTypesInLoop`). This ensures we do not
have to iterate over all instructions in the loop again in other places, such as in D102253
which disables scalable vectorization of a loop if any of the instructions use invalid types.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D105437
The function vectorizeChainsInBlock does not support scalable vector,
because function like canReuseExtract and isCommutative in the code
path assert with scalable vectors.
This patch avoids vectorizing blocks that have extract instructions with scalable
vector..
Differential Revision: https://reviews.llvm.org/D104809
This API is not compatible with opaque pointers, the method
accepting an explicit pointer element type should be used instead.
Thankfully there were few in-tree users. The BPF case still ends
up using the pointer element type for now and needs something like
D105407 to avoid doing so.
Same as other CreateLoad-style APIs, these need an explicit type
argument to support opaque pointers.
Differential Revision: https://reviews.llvm.org/D105395
The compiler should not ignore UndefValue when gathering the scalars,
otherwise the resulting code may be less defined than the original one.
Also, grouped scalars to insert them at first to reduce the analysis in
further passes.
Differential Revision: https://reviews.llvm.org/D105275
In lots of places we were calling setDebugLocFromInst and passing
in the same Builder member variable found in InnerLoopVectorizer.
I personally found this confusing so I've changed the interface
to take an Optional<IRBuilder<> *> and we can now pass in None
when we want to use the class member variable.
Differential Revision: https://reviews.llvm.org/D105100
If we unroll a loop in the vectorizer (without vectorizing), and the cost model requires a epilogue be generated for correctness, the code generation must actually do so.
The included test case on an unmodified opt will access memory one past the expected bound. As a result, this patch is fixing a latent miscompile.
Differential Revision: https://reviews.llvm.org/D103700
This patch fixes a crash when the target instruction for sinking is
dead. In that case, no recipe is created and trying to get the recipe
for it results in a crash. To ensure all sink targets are alive, find &
use the first previous alive instruction.
Note that the case where the sink source is dead is already handled.
Found by
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=35320
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D104603
Previously in setCostBasedWideningDecision if we encountered an
invariant store we just assumed that we could scalarize the store
and called getUniformMemOpCost to get the associated cost.
However, for scalable vectors this is not an option because it is
not currently possibly to scalarize the store. At the moment we
crash in VPReplicateRecipe::execute when trying to scalarize the
store.
Therefore, I have changed setCostBasedWideningDecision so that if
we are storing a scalable vector out to a uniform address and the
target supports scatter instructions, then we should use those
instead.
Tests have been added here:
Transforms/LoopVectorize/AArch64/sve-inv-store.ll
Differential Revision: https://reviews.llvm.org/D104624
Currently we will allow loops with a fixed width VF of 1 to vectorize
if the -enable-strict-reductions flag is set. However, the loop vectorizer
will not use ordered reductions if `VF.isScalar()` and the resulting
vectorized loop will be out of order.
This patch removes `VF.isVector()` when checking if ordered reductions
should be used. Also, instead of converting the FAdds to reductions if the
VF = 1, operands of the FAdds are changed such that the order is preserved.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D104533
Sinking scalar operands into predicated-triangle regions may allow
merging regions. This patch adds a VPlan-to-VPlan transform that tries
to merge predicate-triangle regions after sinking.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D100260
This patch updates VPWidenPHI recipes for first-order recurrences to
also track the incoming value from the back-edge. Similar to D99294,
which did the same for reductions.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D104197
Make getPointersDiff() and sortPtrAccesses() compatible with opaque
pointers by explicitly passing in the element type instead of
determining it from the pointer element type.
The SLPVectorizer result is slightly non-optimal in that unnecessary
pointer bitcasts are added.
Differential Revision: https://reviews.llvm.org/D104784
Perform better analysis when trying to vectorize PHIs.
1. Do not try to vectorize vector PHIs.
2. Do deeper analysis for more profitable nodes for the vectorization.
Before we just tried to vectorize the PHIs of the same type. Patch
improves this and tries to vectorize PHIs with incoming values which
come from the same basic block, have the same and/or alternative
opcodes.
It allows to save the compile time and provides better vectorization
results in general.
Part of D57059.
Differential Revision: https://reviews.llvm.org/D103638
This really isn't talking about vectors in general,
but only about either fixed or scalable vectors,
and it's pretty confusing to see it state
that there aren't any vectors :)
At the moment, we create insertelement instructions directly after
LastInst when inserting scalar values in a vector in
VPTransformState::get.
This results in invalid IR when LastInst is a phi, followed by another
phi. In that case, the new instructions should be inserted just after
the last PHI node in the block.
At the moment, I don't think the problematic case can be triggered, but
it can happen once predicate regions are merged and multiple
VPredInstPHI recipes are in the same block (D100260).
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D104188
This can be seen as a follow up to commit 0ee439b705,
that changed the second argument of __powidf2, __powisf2 and
__powitf2 in compiler-rt from si_int to int. That was to align with
how those runtimes are defined in libgcc.
One thing that seem to have been missing in that patch was to make
sure that the rest of LLVM also handle that the argument now depends
on the size of int (not using the si_int machine mode for 32-bit).
When using __builtin_powi for a target with 16-bit int clang crashed.
And when emitting libcalls to those rtlib functions, typically when
lowering @llvm.powi), the backend would always prepare the exponent
argument as an i32 which caused miscompiles when the rtlib was
compiled with 16-bit int.
The solution used here is to use an overloaded type for the second
argument in @llvm.powi. This way clang can use the "correct" type
when lowering __builtin_powi, and then later when emitting the libcall
it is assumed that the type used in @llvm.powi matches the rtlib
function.
One thing that needed some extra attention was that when vectorizing
calls several passes did not support that several arguments could
be overloaded in the intrinsics. This patch allows overload of a
scalar operand by adding hasVectorInstrinsicOverloadedScalarOpd, with
an entry for powi.
Differential Revision: https://reviews.llvm.org/D99439
As Eli mentioned post-commit in D103378, the result of the freeze may
still be out-of-range according to Alive2. So for now, just limit the
transform to indices that are non-poison.
It was found by chance revealing discrepancy between comment (few lines above),
the condition and how re-ordering of instruction is done inside the if statement
it guards. The condition was always evaluated to true.
Differential Revision: https://reviews.llvm.org/D104064
We were passing the RecurrenceDescriptor by value to most of the reduction analysis methods, despite it being rather bulky with TrackingVH members (that can be costly to copy). In all these cases we're only using the RecurrenceDescriptor for rather basic purposes (access to types/kinds etc.).
Differential Revision: https://reviews.llvm.org/D104029
This fixes the concern in single element store scalarization that the
alignment of new store may be larger than it should be. It calculates
the largest alignment if index is constant, and a safe one if not.
Reviewed By: lebedev.ri, spatel
Differential Revision: https://reviews.llvm.org/D103419
First we refactor the code which does no wrapping add sequences
match: we need to allow different operand orders for
the key add instructions involved in the match.
Then we use the refactored code trying 4 variants of matching operands.
Originally the code relied on the fact that the matching operands
of the two last add instructions of memory index calculations
had the same LHS argument. But which operand is the same
in the two instructions is actually not essential, so now we allow
that to be any of LHS or RHS of each of the two instructions.
This increases the chances of vectorization to happen.
Reviewed By: volkan
Differential Revision: https://reviews.llvm.org/D103912
As noted in https://bugs.llvm.org/show_bug.cgi?id=46666, the current behavior of assuming if-conversion safety if a loop is annotated parallel (`!llvm.loop.parallel_accesses`), is not expectable, the documentation for this behavior was since removed from the LangRef again, and can lead to invalid reads.
This was observed in POCL (https://github.com/pocl/pocl/issues/757) and would require similar workarounds in current work at hipSYCL.
The question remains why this was initially added and what the implications of removing this optimization would be.
Do we need an alternative mechanism to propagate the information about legality of if-conversion?
Or is the idea that conditional loads in `#pragma clang loop vectorize(assume_safety)` can be executed unmasked without additional checks flawed in general?
I think this implication is not part of what a user of that pragma (and corresponding metadata) would expect and thus dangerous.
Only two additional tests failed, which are adapted in this patch. Depending on the further direction force-ifcvt.ll should be removed or further adapted.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D103907
There is no need to schedule insertelement instructions. The compiler
did not schedule them before it started support their vectorization and
it should not do it after. We pre-schedule them manually when finding
a build vector sequence.
Disabling scheduling of insertelement instructions improves compile
time and vectorization of the very large basic blocks by saving
scheduling budget for other instructions.
Differential Revision: https://reviews.llvm.org/D104026
```
llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:8024:19: warning: loop variable 'VF' of type 'const llvm::ElementCount' creates a copy from type 'const llvm::ElementCount' [-Wrange-loop-analysis]
for (const auto VF : VFCandidates) {
^
llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:8024:8: note: use reference type 'const llvm::ElementCount &' to prevent copying
for (const auto VF : VFCandidates) {
^~~~~~~~~~~~~~~
&
1 warning generated.
```
Differential Revision: https://reviews.llvm.org/D103970
1. Better sorting of scalars to be gathered. Trying to insert
constants/arguments/instructions-out-of-loop at first and only then
the instructions which are inside the loop. It improves hoisting of
invariant insertelements instructions.
2. Better detection of shuffle candidates in gathering function.
3. The cost of insertelement for constants is 0.
Part of D57059.
Differential Revision: https://reviews.llvm.org/D103458
If the `-enable-strict-reductions` flag is set to true, then currently we will
always choose to vectorize the loop with strict in-order reductions. This is
not necessary where we allow the reordering of FP operations, such as
when loop hints are passed via metadata.
This patch moves useOrderedReductions so that we can also check whether
loop hints allow reordering, in which case we should use the default
behaviour of vectorizing with unordered reductions.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D103814
The non-DOT printing does not include the successors of VPregionBlocks.
This patch use the same style for printing successors as for
VPBasicBlock.
I think the printing of successors could be a bit improved further, as
at the moment it is hard to ensure a check line matches all successors.
But that can be done as follow-up.
Reviewed By: a.elovikov
Differential Revision: https://reviews.llvm.org/D103515
This patch marks the induction increment of the main induction variable
of the vector loop as NUW when not folding the tail.
If the tail is not folded, we know that End - Start >= Step (either
statically or through the minimum iteration checks). We also know that both
Start % Step == 0 and End % Step == 0. We exit the vector loop if %IV +
%Step == %End. Hence we must exit the loop before %IV + %Step unsigned
overflows and we can mark the induction increment as NUW.
This should make SCEV return more precise bounds for the created vector
loops, used by later optimizations, like late unrolling.
At the moment quite a few tests still need to be updated, but before
doing so I'd like to get initial feedback to make sure I am not missing
anything.
Note that this could probably be further improved by using information
from the original IV.
Attempt of modeling of the assumption in Alive2:
https://alive2.llvm.org/ce/z/H_DL_g
Part of a set of fixes required for PR50412.
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D103255
No need to recalculate the cost of extractelements, just no need to
compensate the cost of all extractelements, need to check before if this
is actually going to be removed at the vectorization. Also, no need to
generate new extractelement instruction, we may just regenerate the
original one. It may improve the final vectorization.
Differential Revision: https://reviews.llvm.org/D102933
tryToVectorizeList function allows to reorder only 2 scalars. Patch
allows to reorder >2 scalars. Also, to avoid possible regressions, it
allows extra vectorization of the remaining parts of the scalars
elements if possible.
Part of D57059.
Differential Revision: https://reviews.llvm.org/D103247
As noticed by NAKAMURA Takumi back in 2017, we cannot use
properlyDominates for std::stable_sort as properlyDominates only
partially orders blocks. That is, for blocks A, B, C, D, where A
dominates B and C dominates D, we have A == C, B == C, but A < B. This
is not a valid comparison function for std::stable_sort and causes
different results between libstdc++ and libc++. This change uses DFS
numbering to give deterministic results for all reachable blocks.
Unreachable blocks are ignored already, so do not need special
consideration.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D103441
This patch uses the calculated maximum scalable VFs to build VPlans,
cost them and select a suitable scalable VF.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D98722
llvm::getLoadStoreType was added recently and has the same implementation
as 'getMemInstValueType' in LoopVectorize.cpp. Since there is no
value in having two implementations, this patch removes the custom LV
implementation in favor of the generic one defined in Instructions.h.
As the existing test unreachable.ll shows, we should be doing more
work to avoid entering unreachable blocks: we should not stop
vectorization just because a PHI incoming value from an unreachable
block cannot be vectorized. We know that particular value will never
be used so we can just replace it with poison.
Implemented better scheme for perfect/shuffled matches of the gather
nodes which allows to fix the performance regressions introduced by
earlier patches. Starting detecting matches for broadcast nodes and
extractelement gathering.
Differential Revision: https://reviews.llvm.org/D102920
If the index itself is already poison, the poison propagates through
instructions clamping the index to a valid range. This still causes
introducing a load of poison, as flagged by Alive2 and pointed out
at 575e2aff55.
This patch updates the code to freeze the index, unless it is proven to
not be poison.
Reviewed By: nlopes
Differential Revision: https://reviews.llvm.org/D103378
Update isFirstOrderRecurrence to explore all uses of a recurrence phi
and check if we can sink them. If there are multiple users to sink, they
are all mapped to the previous instruction.
Fixes PR44286 (and another PR or two).
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D84951
For uniform ReplicateRecipes, only the first lane should be used, so
sinking them would mean we have to compute the value of the first lane
multiple times. Also, at the moment, sinking them causes a crash because
the value of the first lane is re-used by all users.
Reported post-commit for D100258.
SLP vectorizer should not consider in sertelements with multiple uses as
a part of high level build vector, it must be considered as
a terminating insertelement in the vector build, otherwise it may
produce incorrect code.
Differential Revision: https://reviews.llvm.org/D103164
When loop hints are passed via metadata, the allowReordering function
in LoopVectorizationLegality will allow the order of floating point
operations to be changed:
bool allowReordering() const {
// When enabling loop hints are provided we allow the vectorizer to change
// the order of operations that is given by the scalar loop. This is not
// enabled by default because can be unsafe or inefficient.
The -enable-strict-reductions flag introduced in D98435 will currently only
vectorize reductions in-loop if hints are used, since canVectorizeFPMath()
will return false if reordering is not allowed.
This patch changes canVectorizeFPMath() to query whether it is safe to
vectorize the loop with ordered reductions if no hints are used. For
testing purposes, an additional flag (-hints-allow-reordering) has been
added to disable the reordering behaviour described above.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D101836
We can only scalarize memory accesses if we know the index is valid.
This patch adjusts canScalarizeAcceess to fall back to
computeConstantRange to check if the index is known to be valid.
Reviewed By: nlopes
Differential Revision: https://reviews.llvm.org/D102476
This patch adds a first VPlan-based implementation of sinking of scalar
operands.
The current version traverse a VPlan once and processes all operands of
a predicated REPLICATE recipe. If one of those operands can be sunk,
it is moved to the block containing the predicated REPLICATE recipe.
Continue with processing the operands of the sunk recipe.
The initial version does not re-process candidates after other recipes
have been sunk. It also cannot partially sink induction increments at
the moment. The VPlan only contains WIDEN-INDUCTION recipes and if the
induction is used for example in a GEP, only the first lane is used and
in the lowered IR the adds for the other lanes can be sunk into the
predicated blocks.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D100258
This reverts commit 94d54155e2.
This fixes a sanitizer failure by moving scalarizeLoadExtract(I)
before foldSingleElementStore(I), which may remove instructions.
This patch adds a new combine that tries to scalarize chains of
`extractelement (load %ptr), %idx` to `load (gep %ptr, %idx)`. This is
profitable when extracting only a few elements out of a large vector.
At the moment, `store (extractelement (load %ptr), %idx), %ptr`
operations on large vectors result in huge code in the backend.
This can easily be triggered by using the matrix extension, e.g.
https://clang.godbolt.org/z/qsccPdPf4
This should complement D98240.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D100273
External insertelement users can be represented as a result of shuffle
of the vectorized element and noconsecutive insertlements too. Added
support for handling non-consecutive insertelements.
Differential Revision: https://reviews.llvm.org/D101555
If we gather extract elements and they actually are just shuffles, it
might be profitable to vectorize them even if the tree is tiny.
Differential Revision: https://reviews.llvm.org/D101460
In InnerLoopVectorizer::setDebugLocFromInst we were previously
asserting that the VF is not scalable. This is because we want to
use the number of elements to create a duplication factor for the
debug profiling data. However, for scalable vectors we only know the
minimum number of elements. I've simply removed the assert for now
and added a FIXME saying that we assume vscale is always 1. When
vscale is not 1 it just means that the profiling data isn't as
accurate, but shouldn't cause any functional problems.
This patch adds a new option to the LoopVectorizer to control how
scalable vectors can be used.
Initially, this suggests three levels to control scalable
vectorization, although other more aggressive options can be added in
the future.
The possible options are:
- Disabled: Disables vectorization with scalable vectors.
- Enabled: Vectorize loops using scalable vectors or fixed-width
vectors, but favors fixed-width vectors when the cost
is a tie.
- Preferred: Like 'Enabled', but favoring scalable vectors when the
cost-model is inconclusive.
Reviewed By: paulwalker-arm, vkmr
Differential Revision: https://reviews.llvm.org/D101945
This patch implements first part of Flow Sensitive SampleFDO (FSAFDO).
It has the following changes:
(1) disable current discriminator encoding scheme,
(2) new hierarchical discriminator for FSAFDO.
For this patch, option "-enable-fs-discriminator=true" turns on the new
functionality. Option "-enable-fs-discriminator=false" (the default)
keeps the current SampleFDO behavior. When the fs-discriminator is
enabled, we insert a flag variable, namely, llvm_fs_discriminator, to
the object. This symbol will checked by create_llvm_prof tool, and used
to generate a profile with FS-AFDO discriminators enabled. If this
happens, for an extbinary format profile, create_llvm_prof tool
will add a flag to profile summary section.
Differential Revision: https://reviews.llvm.org/D102246
Currently all AA analyses marked as preserved are stateless, not taking
into account their dependent analyses. So there's no need to mark them
as preserved, they won't be invalidated unless their analyses are.
SCEVAAResults was the one exception to this, it was treated like a
typical analysis result. Make it like the others and don't invalidate
unless SCEV is invalidated.
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D102032
This allows cast/dyn_cast'ing from VPUser to recipes. This is needed
because there are VPUsers that are not recipes.
Reviewed By: gilr, a.elovikov
Differential Revision: https://reviews.llvm.org/D100257
This patch introduces a new class, MaxVFCandidates, that holds the
maximum vectorization factors that have been computed for both scalable
and fixed-width vectors.
This patch is intended to be NFC for fixed-width vectors, although
considering a scalable max VF (which is disabled by default) pessimises
tail-loop elimination, since it can no longer determine if any chosen VF
(less than fixed/scalable MaxVFs) is guaranteed to handle all vector
iterations if the trip-count is known. This issue will be addressed in
a future patch.
Reviewed By: fhahn, david-arm
Differential Revision: https://reviews.llvm.org/D98721
This reverts commit 6d3e3ae8a9.
Still seeing PPC build bot failures, and one arm self host bot failing. I'm officially stumped, and need help from a bot owner to reduce.
Resubmit after fixing test/Transforms/LoopVectorize/ARM/mve-gather-scatter-tailpred.ll
Previous commit message...
This is a resubmit of 3e5ce4 (which was reverted by 7fe41ac). The original commit caused a PPC build bot failure we never really got to the bottom of. I can't reproduce the issue, and the bot owner was non-responsive. In the meantime, we stumbled across an issue which seems possibly related, and worked around a latent bug in 80e8025. My best guess is that the original patch exposed that latent issue at higher frequency, but it really is just a guess.
Original commit message follows...
If we know that the scalar epilogue is required to run, modify the CFG to end the middle block with an unconditional branch to scalar preheader. This is instead of a conditional branch to either the preheader or the exit block.
The motivation to do this is to support multiple exit blocks. Specifically, the current structure forces us to identify immediate dominators and *which* exit block to branch from in the middle terminator. For the multiple exit case - where we know require scalar will hold - these questions are ill formed.
This is the last change needed to support multiple exit loops, but since the diffs are already large enough, I'm going to land this, and then enable separately. You can think of this as being NFCIish prep work, but the changes are a bit too involved for me to feel comfortable tagging the review that way.
Differential Revision: https://reviews.llvm.org/D94892
This is a resubmit of 3e5ce4 (which was reverted by 7fe41ac). The original commit caused a PPC build bot failure we never really got to the bottom of. I can't reproduce the issue, and the bot owner was non-responsive. In the meantime, we stumbled across an issue which seems possibly related, and worked around a latent bug in 80e8025. My best guess is that the original patch exposed that latent issue at higher frequency, but it really is just a guess.
Original commit message follows...
If we know that the scalar epilogue is required to run, modify the CFG to end the middle block with an unconditional branch to scalar preheader. This is instead of a conditional branch to either the preheader or the exit block.
The motivation to do this is to support multiple exit blocks. Specifically, the current structure forces us to identify immediate dominators and *which* exit block to branch from in the middle terminator. For the multiple exit case - where we know require scalar will hold - these questions are ill formed.
This is the last change needed to support multiple exit loops, but since the diffs are already large enough, I'm going to land this, and then enable separately. You can think of this as being NFCIish prep work, but the changes are a bit too involved for me to feel comfortable tagging the review that way.
Differential Revision: https://reviews.llvm.org/D94892
As discussed in D102437, the VF argument to isScalarWithPredication
seems redundant, so this is intended to be a non-functional change. It
seems wrong to query the widening decision at this point. Removing the
operand and code to get the widening decision causes no unit/regression
tests to fail. I've also found no issues running the LLVM test-suite.
This subsequently removes the VF argument from isPredicatedInst as well,
since it is no longer required.
Add new type of tree node for `InsertElementInst` chain forming vector.
These instructions could be either removed, or replaced by shuffles during
vectorization and we can add this node to cost model, so naturally estimating
their cost, getting rid of `CompensateCost` tricks and reducing further work
for InstCombine. This fixes PR40522 and PR35732 in a natural way. Also this
patch is the first step towards revectorization of partially vectorization
(to fix PR42022 completely). After adding inserts to tree the next step is
to add vector instructions there (for instance, to merge `store <2 x float>`
and `store <2 x float>` to `store <4 x float>`).
Fixes PR40522 and PR35732.
Differential Revision: https://reviews.llvm.org/D98714
This change enables cases for which the index value for the first
load/store instruction in a pair could be a function argument. This
allows using llvm.assume to provide known bits information in such
cases.
Patch by Viacheslav Nikolaev. Thanks!
Differential Revision: https://reviews.llvm.org/D101680
In InnerLoopVectorizer::widenPHIInstruction there are cases where we have
to scalarise a pointer induction variable after vectorisation. For scalable
vectors we already deal with the case where the pointer induction variable
is uniform, but we currently crash if not uniform. For fixed width vectors
we calculate every lane of the scalarised pointer induction variable for a
given VF, however this cannot work for scalable vectors. In this case I
have added support for caching the whole vector value for each unrolled
part so that we can always extract an arbitrary element. Additionally, we
still continue to cache the known minimum number of lanes too in order
to improve code quality by avoiding an extractelement operation.
I have adapted an existing test `pointer_iv_mixed` from the file:
Transforms/LoopVectorize/consecutive-ptr-uniforms.ll
and added it here for scalable vectors instead:
Transforms/LoopVectorize/AArch64/sve-widen-phi.ll
Differential Revision: https://reviews.llvm.org/D101294
Vector single element update optimization is landed in 2db4979. But the
scope needs restriction. This patch restricts the index to inbounds and
vector must be fixed sized. In future, we may use value tracking to
relax constant restrictions.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D102146
If the simplified VPValue is a recipe, we need to register it for Instr,
in case it needs to be recorded. The way this is handled in general may
change soon, following some post-commit comments.
This fixes PR50298.
The test example from https://llvm.org/PR50256 (and reduced here)
shows that we can match a load combine candidate even when there
are no "or" instructions. We can avoid that by confirming that we
do see an "or". This doesn't apply when matching an or-reduction
because that match begins from the operands of the reduction.
Differential Revision: https://reviews.llvm.org/D102074
Need to remove the old code for avoiding double counting of the gather
nodes with perfect diamond matches within the tree after we started
detecting perfect/shuffled matching in the previous patch D100495. We
may skip the cost for such nodes completely.
Differential Revision: https://reviews.llvm.org/D102023
The comment incorrectly states that the PHI is recorded. That's not
accurate, only the recipe for the incoming value is recorded.
Suggested post-commit for 4ba8720f88.
Currently sinking a replicate region into another replicate region is
not supported. Add an assert, to make the problem more obvious, should
it occur.
Discussed post-commit for ccebf7a109.
The function fixReduction used to assert/crash for scalable vector when
a vector reduce could be done with a smaller vector.
This patch removes this assertion as it is safe to use scalable vector for
vector reduce and truncate.
Differential Revision: https://reviews.llvm.org/D101260
The loop vectorizer will currently assume a large trip count when
calculating which of several vectorization factors are more profitable.
That is often not a terrible assumption to make as small trip count
loops will usually have been fully unrolled. There are cases however
where we will try to vectorize them, and especially when folding the
tail by masking can incorrectly choose to vectorize loops that are not
beneficial, due to the folded tail rounding the iteration count up for
the vectorized loop.
The motivating example here has a trip count of 5, so either performs 5
scalar iterations or 2 vector iterations (with VF=4). At a high enough
trip count the vectorization becomes profitable, but the rounding up to
2 vector iterations vs only 5 scalar makes it unprofitable.
This adds an alternative cost calculation when we know the max trip
count and are folding tail by masking, rounding the iteration count up
to the correct number for the vector width. We still do not account for
anything like setup cost or the mixture of vector and scalar loops, but
this is at least an improvement in a few cases that we have had
reported.
Differential Revision: https://reviews.llvm.org/D101726
Adds support for scalable vectorization of loops containing first-order recurrences, e.g:
```
for(int i = 0; i < n; i++)
b[i] = a[i] + a[i - 1]
```
This patch changes fixFirstOrderRecurrence for scalable vectors to take vscale into
account when inserting into and extracting from the last lane of a vector.
CreateVectorSplice has been added to construct a vector for the recurrence, which
returns a splice intrinsic for scalable types. For fixed-width the behaviour
remains unchanged as CreateVectorSplice will return a shufflevector instead.
The tests included here are the same as test/Transform/LoopVectorize/first-order-recurrence.ll
Reviewed By: david-arm, fhahn
Differential Revision: https://reviews.llvm.org/D101076
LoopVectorize has a fairly deeply baked in design problem where it will try to query analysis (primarily SCEV, but also ValueTracking) in the midst of mutating IR. In particular, the intermediate IR state does not represent the semantics of the original (or final) program.
Fixing this for real is hard, but all of the cases seen so far share a common symptom. In cases seen to date, the analysis being queried is the computation of the original loop's trip count. We can fix this particular instance of the issue by simply computing the trip count early, and caching it.
I want to be really clear that this is nothing but a workaround. It does nothing to fix the root issue, and at best, delays the time until we have to fix this for real. Florian and I have discussed an eventual solution in the review comments for https://reviews.llvm.org/D100663, but it's a lot of work.
Test taken from https://reviews.llvm.org/D100663.
Differential Revision: https://reviews.llvm.org/D101487
This patch updates the code that sinks recipes required for first-order
recurrences to properly handle replicate-regions. At the moment, the
code would just move the replicate recipe out of its replicate-region,
producing an invalid VPlan.
When sinking a recipe in a replicate-region, we have to sink the whole
region. To do that, we first need to split the block at the target
recipe and move the region in between.
This patch also adds a splitAt helper to VPBasicBlock to split a
VPBasicBlock at a given iterator.
Fixes PR50009.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D100751
This patch updates the code handling reduction recipes to also keep
track of the incoming value from the latch in the recipe. This is needed
to model the def-use chains completely in VPlan, so that it is possible
to replace the incoming value with an arbitrary VPValue.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D99294
Need to check if target allows/supports masked gathers before trying to
estimate its cost, otherwise we may fail to vectorize some of the
patterns because of too pessimistic cost model.
Part of D57059.
Differential Revision: https://reviews.llvm.org/D101297
Need to check if target allows/supports masked gathers before trying to
estimate its cost, otherwise we may fail to vectorize some of the
patterns because of too pessimistic cost model.
Part of D57059.
Differential Revision: https://reviews.llvm.org/D101297
As we gradually move more elements of LV to VPlan, we are trying to
reduce the number of places that still has to check IR of the original
loop.
This patch adjusts the code to fix cross iteration phis to get the PHIs
to fix directly from the VPlan that is executed. We still need the
original PHI to check for first-order recurrences, but we can get rid of
that once we model that explicitly in VPlan as well.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D99293
This patch introduces a helper to obtain an iterator range for the
PHI-like recipes in a block.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D100101
If the extracts from the non-power-2 vectors are recognized as shuffles,
need some extra checks to not crash cost calculations if trying to gext
the ecost for subvector extracts. In this case need to check carefully
that we do not exit out of bounds of the original vector, otherwise the
TTI's cost model will crash on assert.
Differential Revision: https://reviews.llvm.org/D101477
Added an extra analysis for better choosing of shuffle kind in
getShuffleCost functions for better cost estimation if mask was
provided.
Differential Revision: https://reviews.llvm.org/D100865
Added an extra analysis for better choosing of shuffle kind in
getShuffleCost functions for better cost estimation if mask was
provided.
Differential Revision: https://reviews.llvm.org/D100865
As suggested in D99294, this adds a getVPSingleValue helper to use for
recipes that are guaranteed to define a single value. This replaces uses
of getVPValue() which used to default to I = 0.
This patch causes the loop vectorizer to not interleave loops that have
nounroll loop hints (llvm.loop.unroll.disable and llvm.loop.unroll_count(1)).
Note that if a particular interleave count is being requested
(through llvm.loop.interleave_count), it will still be honoured, regardless
of the presence of nounroll hints.
Reviewed By: Meinersbur
Differential Revision: https://reviews.llvm.org/D101374
This patch fixes a crash encountered when vectorising the following loop:
void foo(float *dst, float *src, long long n) {
for (long long i = 0; i < n; i++)
dst[i] = -src[i];
}
using scalable vectors. I've added a test to
Transforms/LoopVectorize/AArch64/sve-basic-vec.ll
as well as cleaned up the other tests in the same file.
Differential Revision: https://reviews.llvm.org/D98054
If the first tree element is vectorize and the second is gather, it
still might be profitable to vectorize it if the gather node contains
less scalars to vectorize than the original tree node. It might be
profitable to use shuffles.
Differential Revision: https://reviews.llvm.org/D101397
This patch simplifies the calculation of certain costs in
getInstructionCost when isScalarAfterVectorization() returns a true value.
There are a few places where we multiply a cost by a number N, i.e.
unsigned N = isScalarAfterVectorization(I, VF) ? VF.getKnownMinValue() : 1;
return N * TTI.getArithmeticInstrCost(...
After some investigation it seems that there are only these cases that occur
in practice:
1. VF is a scalar, in which case N = 1.
2. VF is a vector. We can only get here if: a) the instruction is a
GEP/bitcast/PHI with scalar uses, or b) this is an update to an induction
variable that remains scalar.
I have changed the code so that N is assumed to always be 1. For GEPs
the cost is always 0, since this is calculated later on as part of the
load/store cost. PHI nodes are costed separately and were never previously
multiplied by VF. For all other cases I have added an assert that none of
the users needs scalarising, which didn't fire in any unit tests.
Only one test required fixing and I believe the original cost for the scalar
add instruction to have been wrong, since only one copy remains after
vectorisation.
I have also added a new test for the case when a pointer PHI feeds directly
into a store that will be scalarised as we were previously never testing it.
Differential Revision: https://reviews.llvm.org/D99718
This patch also refactors the way the feasible max VF is calculated,
although this is NFC for fixed-width vectors.
After this change scalable VF hints are no longer truncated/clamped
to a shorter scalable VF, nor does it drop the 'scalable flag' from
the suggested VF to vectorize with a similar VF that is fixed.
Instead, the hint is ignored which means the vectorizer is free
to find a more suitable VF, using the CostModel to determine the
best possible VF.
Reviewed By: c-rhodes, fhahn
Differential Revision: https://reviews.llvm.org/D98509
When using the -enable-strict-reductions flag where UF>1 we generate multiple
Phi nodes, though only one of these is used as an input to the vector.reduce.fadd
intrinsics. The unused Phi nodes are removed later by instcombine.
This patch changes widenPHIInstruction/fixReduction to only generate
one Phi, and adds an additional test for unrolling to strict-fadd.ll
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D100570
This patch simplifies the calculation of certain costs in
getInstructionCost when isScalarAfterVectorization() returns a true value.
There are a few places where we multiply a cost by a number N, i.e.
unsigned N = isScalarAfterVectorization(I, VF) ? VF.getKnownMinValue() : 1;
return N * TTI.getArithmeticInstrCost(...
After some investigation it seems that there are only these cases that occur
in practice:
1. VF is a scalar, in which case N = 1.
2. VF is a vector. We can only get here if: a) the instruction is a
GEP/bitcast/PHI with scalar uses, or b) this is an update to an induction
variable that remains scalar.
I have changed the code so that N is assumed to always be 1. For GEPs
the cost is always 0, since this is calculated later on as part of the
load/store cost. PHI nodes are costed separately and were never previously
multiplied by VF. For all other cases I have added an assert that none of
the users needs scalarising, which didn't fire in any unit tests.
Only one test required fixing and I believe the original cost for the scalar
add instruction to have been wrong, since only one copy remains after
vectorisation.
I have also added a new test for the case when a pointer PHI feeds directly
into a store that will be scalarised as we were previously never testing it.
Differential Revision: https://reviews.llvm.org/D99718
This patch simplifies VPSlotTracker by using the recursive traversal
iterator to traverse all blocks in a VPlan in reverse post-order when
numbering VPValues in a plan.
This depends on a fix to RPOT (D100169). It also extends the traversal
unit tests to check RPOT.
Reviewed By: a.elovikov
Differential Revision: https://reviews.llvm.org/D100176
When iterating over const blocks, the base type in the lambdas needs
to use const VPBlockBase *, otherwise it cannot be used with input
iterators over const VPBlockBase.
Also adjust the type of the input iterator range to const &, as it
does not take ownership of the input range.
This patch adds a blocksOnly helpers which take an iterator range
over VPBlockBase * or const VPBlockBase * and returns an interator
range that only include BlockTy blocks. The accesses are casted to
BlockTy.
Reviewed By: a.elovikov
Differential Revision: https://reviews.llvm.org/D101093
This patch adds a new iterator to traverse through VPRegionBlocks and a
GraphTraits specialization using the iterator to traverse through
VPRegionBlocks.
Because there is already a GraphTraits specialization for VPBlockBase *
and co, a new VPBlockRecursiveTraversalWrapper helper is introduced.
This allows us to provide a new GraphTraits specialization for that
type. Users can use the new recursive traversal by using this wrapper.
The graph trait visits both the entry block of a region, as well as all
its successors. Exit blocks of a region implicitly have their parent
region's successors. This ensures all blocks in a region are visited
before any blocks in a successor region when doing a reverse post-order
traversal of the graph.
Reviewed By: a.elovikov
Differential Revision: https://reviews.llvm.org/D100175
We can skip check for undefs trying to find perfect/shuffled tree
entries matching, they can be ignored completely improving the final
cost/vectorization results.
Differential Revision: https://reviews.llvm.org/D101061
This commit fixes a bug where the loop vectoriser fails to predicate
loads/stores when interleaving for targets that support masked
loads and stores.
Code such as:
1 void foo(int *restrict data1, int *restrict data2)
2 {
3 int counter = 1024;
4 while (counter--)
5 if (data1[counter] > data2[counter])
6 data1[counter] = data2[counter];
7 }
... could previously be transformed in such a way that the predicated
store implied by:
if (data1[counter] > data2[counter])
data1[counter] = data2[counter];
... was lost, resulting in miscompiles.
This bug was causing some tests in llvm-test-suite to fail when built
for SVE.
Differential Revision: https://reviews.llvm.org/D99569
1. No need to call `areAllUsersVectorized` as later the cost is
calculated only if the instruction has one use and gets vectorized.
2. Need to calculate the cost of the dead extractelement more precisely,
taking the vector type of the vector operand, not the resulting
vector type.
Part of D57059.
Differential Revision: https://reviews.llvm.org/D99980
In quite a few cases in LoopVectorize.cpp we call createStepForVF
with a step value of 0, which leads to unnecessary generation of
llvm.vscale intrinsic calls. I've optimised IRBuilder::CreateVScale
and createStepForVF to return 0 when attempting to multiply
vscale by 0.
Differential Revision: https://reviews.llvm.org/D100763
SLP supports perfect diamond matching for the vectorized tree entries
but do not support it for gathered entries and does not support
non-perfect (shuffled) matching with 1 or 2 tree entries. Patch adds
support for this matching to improve cost of the vectorized tree.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D100495
SLP supports perfect diamond matching for the vectorized tree entries
but do not support it for gathered entries and does not support
non-perfect (shuffled) matching with 1 or 2 tree entries. Patch adds
support for this matching to improve cost of the vectorized tree.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D100495
SLP supports perfect diamond matching for the vectorized tree entries
but do not support it for gathered entries and does not support
non-perfect (shuffled) matching with 1 or 2 tree entries. Patch adds
support for this matching to improve cost of the vectorized tree.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D100495
Rather than maintaining two separate values, a `float` for the per-lane
cost and a Width for the VF, maintain a single VectorizationFactor which
comprises the two and also removes the need for converting an integer value
to float.
This simplifies the query when asking if one VF is more profitable than
another when we want to extend this for scalable vectors (which may
require additional options to determine if e.g. a scalable VF of the
some cost, is more profitable than a fixed VF of the same cost).
The patch isn't entirely NFC because it also fixes an issue in
selectEpilogueVectorizationFactor, where the cost passed to ProfitableVFs
no longer truncates the floating-point cost from `float` to `unsigned` to
then perform the calculation on the truncated cost. It now does
a cost comparison with the correct precision.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D100121
SLP supports perfect diamond matching for the vectorized tree entries
but do not support it for gathered entries and does not support
non-perfect (shuffled) matching with 1 or 2 tree entries. Patch adds
support for this matching to improve cost of the vectorized tree.
Differential Revision: https://reviews.llvm.org/D100495
Add an initial version of a helper to determine whether a recipe may
have side-effects.
Reviewed By: a.elovikov
Differential Revision: https://reviews.llvm.org/D100259
There were a few places in widenPHIInstruction where calculations of
offsets were failing to take the runtime calculation of VF into
account for scalable vectors. I've fixed those cases in this patch
as well as adding an assert that we should not be scalarising for
scalable vectors.
Tests are added here:
Transforms/LoopVectorize/AArch64/sve-widen-phi.ll
Differential Revision: https://reviews.llvm.org/D99254
There are a few places in LoopVectorize.cpp where we have been too
cautious in adding VF.isScalable() asserts and it can be confusing.
It also makes it more difficult to see the genuine places where
work needs doing to improve scalable vectorization support.
This patch changes getMemInstScalarizationCost to return an
invalid cost instead of firing an assert for scalable vectors. Also,
vectorizeInterleaveGroup had multiple asserts all for the same
thing. I have removed all but one assert near the start of the
function, and added a new assert that we aren't dealing with masks
for scalable vectors.
Differential Revision: https://reviews.llvm.org/D99727
Only attempt to propagateIRFlags if we have both SelectInst - afaict we shouldn't have matched a min/max reduction without both SelectInst, but static analyzer doesn't know that.
Main reason is preparation to transform AliasResult to class that contains
offset for PartialAlias case.
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D98027
Instead of passing the start value and the defined value to
widenPHIInstruction, pass the VPWidenPHIRecipe directly, which can be
used to get both (and more in future patches).
D99674 stopped the folding of certain select operations into and/or, due
to incorrect folding in the presence of poison. D97360 added some costs
to attempt to account for the change, but only worked at the getUserCost
level, not the getCmpSelInstrCost that the vectorizer will use directly.
This adds similar logic into the vectorizer to handle these logical
and/or selects, treating them like and/or directly.
This fixes 60% performance regressions from code like the attached test
case.
Differential Revision: https://reviews.llvm.org/D99884
No need to lookup through and/or try to vectorize operands of the
CmpInst instructions during attempts to find/vectorize min/max
reductions. Compiler implements postanalysis of the CmpInsts so we can
skip extra attempts in tryToVectorizeHorReductionOrInstOperands and save
compile time.
Differential Revision: https://reviews.llvm.org/D99950
Add the subclass, update a few places which check for the intrinsic to use idiomatic dyn_cast, and update the public interface of AssumptionCache to use the new class. A follow up change will do the same for the newer assumption query/bundle mechanisms.
Previously we could only vectorize FP reductions if fast math was enabled, as this allows us to
reorder FP operations. However, it may still be beneficial to vectorize the loop by moving
the reduction inside the vectorized loop and making sure that the scalar reduction value
be an input to the horizontal reduction, e.g:
%phi = phi float [ 0.0, %entry ], [ %reduction, %vector_body ]
%load = load <8 x float>
%reduction = call float @llvm.vector.reduce.fadd.v8f32(float %phi, <8 x float> %load)
This patch adds a new flag (IsOrdered) to RecurrenceDescriptor and makes use of the changes added
by D75069 as much as possible, which already teaches the vectorizer about in-loop reductions.
For now in-order reduction support is off by default and controlled with the `-enable-strict-reductions` flag.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D98435
Changes getRecurrenceIdentity to always return a neutral value of -0.0 for FAdd.
Reviewed By: dmgreen, spatel
Differential Revision: https://reviews.llvm.org/D98963
For VPWidenPHIRecipes that model all incoming values as VPValue
operands, print those operands instead of printing the original PHI.
D99294 updates recipes of reduction PHIs to use the VPValue for the
incoming value from the loop backedge, making use of this new printing.
During vectorization better to postpone the vectorization of the CmpInst
instructions till the end of the basic block. Otherwise we may vectorize
it too early and may miss some vectorization patterns, like reductions.
Reworked part of D57059
Differential Revision: https://reviews.llvm.org/D99796
This patch moves mapping of IR operands to VPValues out of
tryToCreateWidenRecipe. This allows using existing VPValue operands when
widening recipes directly, which will be introduced in future patches.
The ultimate reduction node may have multiple uses, but if the ultimate
reduction is min/max reduction and based on SelectInstruction, the
condition of this select instruction must have only single use.
Differential Revision: https://reviews.llvm.org/D99753
The motivation for this patch is to better estimate the cost of
extracelement instructions in cases were they are going to be free,
because the source vector can be used directly.
A simple example is
%v1.lane.0 = extractelement <2 x double> %v.1, i32 0
%v1.lane.1 = extractelement <2 x double> %v.1, i32 1
%a.lane.0 = fmul double %v1.lane.0, %x
%a.lane.1 = fmul double %v1.lane.1, %y
Currently we only consider the extracts free, if there are no other
users.
In this particular case, on AArch64 which can fit <2 x double> in a
vector register, the extracts should be free, independently of other
users, because the source vector of the extracts will be in a vector
register directly, so it should be free to use the vector directly.
The SLP vectorized version of noop_extracts_9_lanes is 30%-50% faster on
certain AArch64 CPUs.
It looks like this does not impact any code in
SPEC2000/SPEC2006/MultiSource both on X86 and AArch64 with -O3 -flto.
This originally regressed after D80773, so if there's a better
alternative to explore, I'd be more than happy to do that.
Reviewed By: ABataev
Differential Revision: https://reviews.llvm.org/D99719
1. Need to cleanup InstrElementSize map for each new tree, otherwise might
use sizes from the previous run of the vectorization attempt.
2. No need to include into analysis the instructions from the different basic
blocks to save compile time.
Differential Revision: https://reviews.llvm.org/D99677
Use SetVector instead of SmallPtrSet to track values with uniform use. Doing this
can help avoid non-determinism caused by iterating over unordered containers.
This bug was found with reverse iteration turning on,
--extra-llvm-cmake-variables="-DLLVM_REVERSE_ITERATION=ON".
Failing LLVM test consecutive-ptr-uniforms.ll .
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D99549
This marks FSIN and other operations to EXPAND for scalable
vectors, so that they are not assumed to be legal by the cost-model.
Depends on D97470
Reviewed By: dmgreen, paulwalker-arm
Differential Revision: https://reviews.llvm.org/D97471
Use SetVector instead of SmallPtrSet for external definitions created for VPlan.
Doing this can help avoid non-determinism caused by iterating over unordered containers.
This bug was found with reverse iteration turning on,
--extra-llvm-cmake-variables="-DLLVM_REVERSE_ITERATION=ON".
Failing LLVM-Unit test VPRecipeTest.dump.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D99544
This patch adds support for the vectorization of induction variables when
using scalable vectors, which required the following changes:
1. Removed assert from InnerLoopVectorizer::getStepVector.
2. Modified InnerLoopVectorizer::createVectorIntOrFpInductionPHI to use
a runtime determined value for VF and removed an assert.
3. Modified InnerLoopVectorizer::buildScalarSteps to work for scalable
vectors. I did this by calculating the full vector value for each Part
of the unroll factor (UF) and caching this in the VP state. This means
that we are always able to extract an arbitrary element from the vector
if necessary. In addition to this, I also permitted the caching of the
individual lane values themselves for the known minimum number of elements
in the same way we do for fixed width vectors. This is a further
optimisation that improves the code quality since it avoids unnecessary
extractelement operations when extracting the first lane.
4. Added an assert to InnerLoopVectorizer::widenPHIInstruction, since while
testing some code paths I noticed this is currently broken for scalable
vectors.
Various tests to support different cases have been added here:
Transforms/LoopVectorize/AArch64/sve-inductions.ll
Differential Revision: https://reviews.llvm.org/D98715
Re-apply 25fbe803d4, with a small update to emit the right remark
class.
Original message:
[LV] Move runtime pointer size check to LVP::plan().
This removes the need for the remaining doesNotMeet check and instead
directly checks if there are too many runtime checks for vectorization
in the planner.
A subsequent patch will adjust the logic used to decide whether to
vectorize with runtime to consider their cost more accurately.
Reviewed By: lebedev.ri
This is a 2nd try of:
3c8473ba53
which was reverted at:
a26312f9d4
because of crashing.
This version includes extra code and tests to avoid the known
crashing examples as discussed in PR49730.
Original commit message:
As noted in D98152, we need to patch SLP to avoid regressions when
we start canonicalizing to integer min/max intrinsics.
Most of the real work to make this possible was in:
7202f47508
Differential Revision: https://reviews.llvm.org/D98981
This removes the need for the remaining doesNotMeet check and instead
directly checks if there are too many runtime checks for vectorization
in the planner.
A subsequent patch will adjust the logic used to decide whether to
vectorize with runtime to consider their cost more accurately.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D98634
This reverts commit 3c8473ba53 and includes test diffs to
maintain testing status.
There's at least 1 place that was not updated with 7202f47508 ,
so we can crash mismatching select and intrinsics as shown in
PR49730.
This patch simplifies the calculation of certain costs in
getInstructionCost when isScalarAfterVectorization() returns a true value.
There are a few places where we multiply a cost by a number N, i.e.
unsigned N = isScalarAfterVectorization(I, VF) ? VF.getKnownMinValue() : 1;
return N * TTI.getArithmeticInstrCost(...
After some investigation it seems that there are only these cases that occur
in practice:
1. VF is a scalar, in which case N = 1.
2. VF is a vector. We can only get here if: a) the instruction is a
GEP/bitcast with scalar uses, or b) this is an update to an induction variable
that remains scalar.
I have changed the code so that N is assumed to always be 1. For GEPs
the cost is always 0, since this is calculated later on as part of the
load/store cost. For all other cases I have added an assert that none of the
users needs scalarising, which didn't fire in any unit tests.
Only one test required fixing and I believe the original cost for the scalar
add instruction to have been wrong, since only one copy remains after
vectorisation.
Differential Revision: https://reviews.llvm.org/D98512
The SCEV commit b46c085d2b [NFCI] SCEVExpander:
emit intrinsics for integral {u,s}{min,max} SCEV expressions
seems to reveal a new crash in SLPVectorizer.
SLP crashes expecting a SelectInst as an externally used value
but umin() call is found.
The patch relaxes the assumption to make the IR flag propagation safe.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D99328
We do not need to scan further if the upper end or lower end of the
basic block is reached already and the instruction is not found. It
means that the instruction is definitely in the lower part of basic
block or in the upper block relatively.
This should improve compile time for the very big basic blocks.
Differential Revision: https://reviews.llvm.org/D99266
This patch changes the interface to take a RegisterKind, to indicate
whether the register bitwidth of a scalar register, fixed-width vector
register, or scalable vector register must be returned.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D98874
We know if the loop contains FP instructions preventing vectorization
after we are done with legality checks. This patch updates the code the
check for un-vectorizable FP operations earlier, to avoid unnecessarily
running the cost model and picking a vectorization factor. It also makes
the code more direct and moves the check to a position where similar
checks are done.
I might be missing something, but I don't see any reason to handle this
check differently to other, similar checks.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D98633
Added getPointersDiff function to LoopAccessAnalysis and used it instead
direct calculatoin of the distance between pointers and/or
isConsecutiveAccess function in SLP vectorizer to improve compile time
and detection of stores consecutive chains.
Part of D57059
Differential Revision: https://reviews.llvm.org/D98967
Added getPointersDiff function to LoopAccessAnalysis and used it instead
direct calculatoin of the distance between pointers and/or
isConsecutiveAccess function in SLP vectorizer to improve compile time
and detection of stores consecutive chains.
Part of D57059
Differential Revision: https://reviews.llvm.org/D98967
As noted in D98152, we need to patch SLP to avoid regressions when
we start canonicalizing to integer min/max intrinsics.
Most of the real work to make this possible was in:
7202f47508
Differential Revision: https://reviews.llvm.org/D98981
In places where we create a ConstantVector whose elements are a
linear sequence of the form <start, start + 1, start + 2, ...>
I've changed the code to make use of CreateStepVector, which creates
a vector with the sequence <0, 1, 2, ...>, and a vector addition
operation. This patch is a non-functional change, since the output
from the vectoriser remains unchanged for fixed length vectors and
there are existing asserts that still fire when attempting to use
scalable vectors for vectorising induction variables.
In a later patch we will enable support for scalable vectors
in InnerLoopVectorizer::getStepVector(), which relies upon the new
stepvector intrinsic in IRBuilder::CreateStepVector.
Differential Revision: https://reviews.llvm.org/D97861
The name is included when printing in DOT mode. Also print it in non-DOT
mode after 93a9d2de8f.
This will become more important to distinguish different plans once
VPlans are gradually refined.
Make sure we use PowerOf2Floor instead of PowerOf2Ceil when
calculating max number of elements that fits inside a vector
register (otherwise we could end up creating vectors larger
than the maximum vector register size).
Also make sure we honor the min/max VF (as given by TTI or
cmd line parameters) when doing vectorizeStores.
Reviewed By: anton-afanasyev
Differential Revision: https://reviews.llvm.org/D97691
I foresee two uses for this:
1) It's easier to use those in debugger.
2) Once we start implementing more VPlan-to-VPlan transformations (especially
inner loop massaging stuff), using the vectorized LLVM IR as CHECK targets in
LIT test would become too obscure. I can imagine that we'd want to CHECK
against VPlan dumps after multiple transformations instead. That would be
easier with plain text dumps than with DOT format.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D96628
This reverts commit 6b053c9867.
The build is broken:
ld.lld: error: undefined symbol: llvm::VPlan::printDOT(llvm::raw_ostream&) const
>>> referenced by LoopVectorize.cpp
>>> LoopVectorize.cpp.o:(llvm::LoopVectorizationPlanner::printPlans(llvm::raw_ostream&)) in archive lib/libLLVMVectorize.a
I foresee two uses for this:
1) It's easier to use those in debugger.
2) Once we start implementing more VPlan-to-VPlan transformations (especially
inner loop massaging stuff), using the vectorized LLVM IR as CHECK targets in
LIT test would become too obscure. I can imagine that we'd want to CHECK
against VPlan dumps after multiple transformations instead. That would be
easier with plain text dumps than with DOT format.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D96628
If SLP vectorizer tries to extend the scheduling region and runs out of
the budget too early, but still extends the region to the new ending
instructions (i.e., it was able to extend the region for the first
instruction in the bundle, but not for the second), the compiler need to
recalculate dependecies in full, just like if the extending was
successfull. Without it, the schedule data chunks may end up with the
wrong number of (unscheduled) dependecies and it may end up with the
incorrect function, where the vectorized instruction does not dominate
on the extractelement instruction.
Differential Revision: https://reviews.llvm.org/D98531
This adds an Mask ArrayRef to getShuffleCost, so that if an exact mask
can be provided a more accurate cost can be provided by the backend.
For example VREV costs could be returned by the ARM backend. This should
be an NFC until then, laying the groundwork for that to be added.
Differential Revision: https://reviews.llvm.org/D98206
The `hasIrregularType` predicate checks whether an array of N values of type Ty is "bitcast-compatible" with a <N x Ty> vector.
The previous check returned invalid results in some cases where there's some padding between the array elements: eg. a 4-element array of u7 values is considered as compatible with <4 x u7>, even though the vector is only loading/storing 28 bits instead of 32.
The problem causes LLVM to generate incorrect code for some targets: for AArch64 the vector loads/stores are lowered in terms of ubfx/bfi, effectively losing the top (N * padding bits).
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D97465
This adds the cost of an i1 extract and a branch to the cost in
getMemInstScalarizationCost when the instruction is predicated. These
predicated loads/store would generate blocks of something like:
%c1 = extractelement <4 x i1> %C, i32 1
br i1 %c1, label %if, label %else
if:
%sa = extractelement <4 x i32> %a, i32 1
%sb = getelementptr inbounds float, float* %pg, i32 %sa
%sv = extractelement <4 x float> %x, i32 1
store float %sa, float* %sb, align 4
else:
So this increases the cost by the extract and branch. This is probably
still too low in many cases due to the cost of all that branching, but
there is already an existing hack increasing the cost using
useEmulatedMaskMemRefHack. It will increase the cost of a memop if it is
a load or there are more than one store. This patch improves the cost
for when there is only a single store, and hopefully at some point in
the future the hack can be removed.
Differential Revision: https://reviews.llvm.org/D98243
Current SLP pass has this piece of code that inserts a trunc instruction
after the vectorized instruction. In the case that the vectorized instruction
is a phi node and not the last phi node in the BB, the trunc instruction
will be inserted between two phi nodes, which will trigger verify problem
in debug version or unpredictable error in another pass.
This patch changes the algorithm to 'if the last vectorized instruction
is a phi, insert it after the last phi node in current BB' to fix this problem.
The motivation is to handle integer min/max reductions independently
of whether they are in the current cmp+sel form or the planned intrinsic
form.
We assumed that min/max included a select instruction, but we can
decouple that implementation detail by checking the instructions
themselves rather than relying on the recurrence (reduction) type.
Instead of maintaining a separate map from predicated instructions to
recipes, we can instead directly look at the VP operands. If the operand
comes from a predicated instruction, the operand will be a
VPPredInstPHIRecipe with a VPReplicateRecipe as its operand.
This patch adds support for reverse loop vectorization.
It is possible to vectorize the following loop:
```
for (int i = n-1; i >= 0; --i)
a[i] = b[i] + 1.0;
```
with fixed or scalable vector.
The loop-vectorizer will use 'reverse' on the loads/stores to make
sure the lanes themselves are also handled in the right order.
This patch adds support for scalable vector on IRBuilder interface to
create a reverse vector. The IR function
CreateVectorReverse lowers to experimental.vector.reverse for scalable vector
and keedp the original behavior for fixed vector using shuffle reverse.
Differential Revision: https://reviews.llvm.org/D95363
This patch fixes a crash when trying to get a scalar value using
VPTransformState::get() for uniform induction values or truncated
induction values. IVs and truncated IVs can be uniform and the updated
code accounts for that, fixing the crash.
This should fix
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=31981
Associative reduction matcher in SLP begins with select instruction but when
it reached call to llvm.umax (or alike) via def-use chain the latter also matched
as UMax kind. The routine's later code assumes matched instruction to be a select
and thus it merely died on the first encountered cast that did not fit.
Differential Revision: https://reviews.llvm.org/D98432
Add support to widen select instructions in VPlan native path by using a correct recipe when such instructions are encountered. This is already used by inner loop vectorizer.
Previously select instructions get handled by the wrong recipe and resulted in unreachable instruction errors like this one: https://bugs.llvm.org/show_bug.cgi?id=48139.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D97136
Add support to widen call instructions in VPlan native path by using a correct recipe when such instructions are encountered. This is already used by inner loop vectorizer.
Previously call instructions got handled by wrong recipes and resulted in unreachable instruction errors like this one: https://bugs.llvm.org/show_bug.cgi?id=48139.
Patch by Mauri Mustonen <mauri.mustonen@tuni.fi>
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D97278
There are certain loops like this below:
for (int i = 0; i < n; i++) {
a[i] = b[i] + 1;
*inv = a[i];
}
that can only be vectorised if we are able to extract the last lane of the
vectorised form of 'a[i]'. For fixed width vectors this already works since
we know at compile time what the final lane is, however for scalable vectors
this is a different story. This patch adds support for extracting the last
lane from a scalable vector using a runtime determined lane value. I have
added support to VPIteration for runtime-determined lanes that still permit
the caching of values. I did this by introducing a new class called VPLane,
which describes the lane we're dealing with and provides interfaces to get
both the compile-time known lane and the runtime determined value. Whilst
doing this work I couldn't find any explicit tests for extracting the last
lane values of fixed width vectors so I added tests for both scalable and
fixed width vectors.
Differential Revision: https://reviews.llvm.org/D95139
This code assumed that FP math was only permissable if it was
fully "fast", so it hard-coded "fast" when creating new instructions.
The underlying code already allows matching recurrences/reductions
that are only "reassoc", so this change should prevent the potential
miscompile seen in the test diffs (we created "fast" ops even though
none existed in the original code).
I don't know if we need to create the temporary IRBuilder objects
used here, so that could be follow-up clean-up.
There's an open question about whether we should require "nsz" in
addition to "reassoc" here. InstCombine uses that combo for its
reassociative folds, but I think codegen is not as strict.
Similar to b3a33553ae, but this shows a TODO and a potential
miscompile is already present.
We are tracking an FP instruction that does *not* have FMF (reassoc)
properties, so calling that "Unsafe" seems opposite of the common
reading.
I also removed one getter method by rolling the null check into
the access. Further simplification may be possible.
The motivation is to clean up the interactions between FMF and
function-level attributes in these classes and their callers.
The new test shows that there is an existing bug somewhere in
the callers. We assumed that the original code was fully 'fast'
and so we produced IR with 'fast' even though it was just 'reassoc'.