I have added a new TTI interface called enableOrderedReductions() that
controls whether or not ordered reductions should be enabled for a
given target. By default this returns false, whereas for AArch64 it
returns true and we rely upon the cost model to make sensible
vectorisation choices. It is still possible to override the new TTI
interface by setting the command line flag:
-force-ordered-reductions=true|false
I have added a new RUN line to show that we use ordered reductions by
default for SVE and Neon:
Transforms/LoopVectorize/AArch64/strict-fadd.ll
Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll
Differential Revision: https://reviews.llvm.org/D106653
Removed AArch64 usage of the getMaxVScale interface, replacing it with
the vscale_range(min, max) IR Attribute.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D106277
LoopLoadElimination, LoopVersioning and LoopVectorize currently
fetch MemorySSA when construction LoopAccessAnalysis. However,
LoopAccessAnalysis does not actually use MemorySSA and we can pass
nullptr instead.
This saves one MemorySSA calculation in the default pipeline, and
thus improves compile-time.
Differential Revision: https://reviews.llvm.org/D108074
Previously we emitted a "does not support scalable vectors"
remark for all targets whenever vectorisation is attempted. This
pollutes the output for architectures that don't support scalable
vectors and is likely confusing to the user.
Instead this patch introduces a debug message that reports when
scalable vectorisation is allowed by the target and only issues
the previous remark when scalable vectorisation is specifically
requested, for example:
#pragma clang loop vectorize_width(2, scalable)
Differential Revision: https://reviews.llvm.org/D108028
Teach LV to use masked-store to support interleave-store-group with
gaps (instead of scatters/scalarization).
The symmetric case of using masked-load to support
interleaved-load-group with gaps was introduced a while ago, by
https://reviews.llvm.org/D53668; This patch completes the store-scenario
leftover from D53668, and solves PR50566.
Reviewed by: Ayal Zaks
Differential Revision: https://reviews.llvm.org/D104750
After refactoring the phi recipes, we can now iterate over all header
phis in a VPlan to detect reductions when it comes to fixing them up
when tail folding.
This reduces the coupling with the cost model & legal by using the
information directly available in VPlan. It also removes a call to
getOrAddVPValue, which references the original IR value which may
become outdated after VPlan transformations.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D100102
This patch adds more instructions to the Uniforms list, for example certain
intrinsics that are uniform by definition or whose operands are loop invariant.
This list includes:
1. The intrinsics 'experimental.noalias.scope.decl' and 'sideeffect', which
are always uniform by definition.
2. If intrinsics 'lifetime.start', 'lifetime.end' and 'assume' have
loop invariant input operands then these are also uniform too.
Also, in VPRecipeBuilder::handleReplication we check if an instruction is
uniform based purely on whether or not the instruction lives in the Uniforms
list. However, there are certain cases where calls to some intrinsics can
be effectively treated as uniform too. Therefore, we now also treat the
following cases as uniform for scalable vectors:
1. If the 'assume' intrinsic's operand is not loop invariant, then we
are free to treat this as uniform anyway since it's only a performance
hint. We will get the benefit for the first lane.
2. When the input pointers for 'lifetime.start' and 'lifetime.end' are loop
variant then for scalable vectors we assume these still ultimately come
from the broadcast of an alloca. We do not support scalable vectorisation
of loops containing alloca instructions, hence the alloca itself would
be invariant. If the pointer does not come from an alloca then the
intrinsic itself has no effect.
I have updated the assume test for fixed width, since we now treat it
as uniform:
Transforms/LoopVectorize/assume.ll
I've also added new scalable vectorisation tests for other intriniscs:
Transforms/LoopVectorize/scalable-assume.ll
Transforms/LoopVectorize/scalable-lifetime.ll
Transforms/LoopVectorize/scalable-noalias-scope-decl.ll
Differential Revision: https://reviews.llvm.org/D107284
All information to fix-up the reduction phi nodes in the vectorized loop
is available in VPlan now. This patch moves the code to do so, to make
this clearer. Fixing up the loop exit value still relies on other
information and remains outside of VPlan for now.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D100113
Since all operands to ExtractValue must be loop-invariant when we deem
the loop vectorizable, we can consider ExtractValue to be uniform.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D107286
This patch adds more instructions to the Uniforms list, for example certain
intrinsics that are uniform by definition or whose operands are loop invariant.
This list includes:
1. The intrinsics 'experimental.noalias.scope.decl' and 'sideeffect', which
are always uniform by definition.
2. If intrinsics 'lifetime.start', 'lifetime.end' and 'assume' have
loop invariant input operands then these are also uniform too.
Also, in VPRecipeBuilder::handleReplication we check if an instruction is
uniform based purely on whether or not the instruction lives in the Uniforms
list. However, there are certain cases where calls to some intrinsics can
be effectively treated as uniform too. Therefore, we now also treat the
following cases as uniform for scalable vectors:
1. If the 'assume' intrinsic's operand is not loop invariant, then we
are free to treat this as uniform anyway since it's only a performance
hint. We will get the benefit for the first lane.
2. When the input pointers for 'lifetime.start' and 'lifetime.end' are loop
variant then for scalable vectors we assume these still ultimately come
from the broadcast of an alloca. We do not support scalable vectorisation
of loops containing alloca instructions, hence the alloca itself would
be invariant. If the pointer does not come from an alloca then the
intrinsic itself has no effect.
I have updated the assume test for fixed width, since we now treat it
as uniform:
Transforms/LoopVectorize/assume.ll
I've also added new scalable vectorisation tests for other intriniscs:
Transforms/LoopVectorize/scalable-assume.ll
Transforms/LoopVectorize/scalable-lifetime.ll
Transforms/LoopVectorize/scalable-noalias-scope-decl.ll
Differential Revision: https://reviews.llvm.org/D107284
This change wasn't strictly necessary for D106164 and could be removed.
This patch addresses the post-commit comments from @fhahn on D106164, and
also changes sve-widen-gep.ll to use the same IR test as shown in
pointer-induction.ll.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D106878
I'm renaming the flag because a future patch will add a new
enableOrderedReductions() TTI interface and so the meaning of this
flag will change to be one of forcing the target to enable/disable
them. Also, since other places in LoopVectorize.cpp use the word
'Ordered' instead of 'strict' I changed the flag to match.
Differential Revision: https://reviews.llvm.org/D107264
This patch updates VPInterleaveRecipe::print to print the actual defined
VPValues for load groups and the store VPValue operands for store
groups.
The IR references may become outdated while transforming the VPlan and
the defined and stored VPValues always are up-to-date.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D107223
As suggested in D105008, move the code that fixes up the backedge value
for first order recurrences to VPlan::execute.
Now all that remains in fixFirstOrderRecurrences is the code responsible
for creating the exit values in the middle block.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D106244
This makes a couple of changes to the costing of MLA reduction patterns,
to more accurately cost various patterns that can come up from
vectorization.
- The Arm implementation of getExtendedAddReductionCost is altered to
only provide costs for legal or smaller types. Larger than legal types
need to be split, which currently does not work very well, especially
for predicated reductions where the predicate may be legal but needs to
be split. Currently we limit it to legal or smaller input types.
- The getReductionPatternCost has learnt that reduce(ext(mul(ext, ext))
is a pattern that can come up, and can be treated the same as
reduce(mul(ext, ext)) providing the extension types match.
- And it has been adjusted to not count the ext in reduce(mul(ext, ext))
as part of a reduce(mul) pattern.
Together these changes help to more accurately cost the mla reductions
in cases such as where the extend types don't match or the extend
opcodes are different, picking better vector factors that don't result
in expanded reductions.
Differential Revision: https://reviews.llvm.org/D106166
Consider the following loop:
void foo(float *dst, float *src, int N) {
for (int i = 0; i < N; i++) {
dst[i] = 0.0;
for (int j = 0; j < N; j++) {
dst[i] += src[(i * N) + j];
}
}
}
When we are not building with -Ofast we may attempt to vectorise the
inner loop using ordered reductions instead. In addition we also try
to select an appropriate interleave count for the inner loop. However,
when choosing a VF=1 the inner loop will be scalar and there is existing
code in selectInterleaveCount that limits the interleave count to 2
for reductions due to concerns about increasing the critical path.
For ordered reductions this problem is even worse due to the additional
data dependency, and so I've added code to simply disable interleaving
for scalar ordered reductions for now.
Test added here:
Transforms/LoopVectorize/AArch64/strict-fadd-vf1.ll
Differential Revision: https://reviews.llvm.org/D106646
The loop vectorizer may decide to use tail folding when the trip-count
is low. When that happens, scalable VFs are no longer a candidate,
since tail folding/predication is not yet supported for scalable vectors.
This can be re-enabled in a future patch.
Reviewed By: kmclaughlin
Differential Revision: https://reviews.llvm.org/D106657
Invalid costs can be used to avoid vectorization with a given VF, which is
used for scalable vectors to avoid things that the code-generator cannot
handle. If we override the cost using the -force-target-instruction-cost
option of the LV, we would override this mechanism, rendering the flag useless.
This change ensures the cost is only overriden when the original cost that
was calculated is valid. That allows the flag to be used in combination
with the -scalable-vectorization option.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D106677
Scalarization for scalable vectors is not (yet) supported, so the
LV discards a VF when scalarization is chosen as the widening
decision. It should therefore not assert that the VF is not scalable
when it computes the decision to scalarize.
The code can get here when both the interleave-cost, gather/scatter cost
and scalarization-cost are all illegal. This may e.g. happen for SVE
when the VF=1, to avoid generating `<vscale x 1 x eltty>` types that
the code-generator cannot yet handle.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D106656
This fixes an issue that was found in D105199, where a GEP instruction
is used both as the address of a store, as well as the value of a store.
For the former, the value is scalar after vectorization, but the latter
(as value) requires widening.
Other code in that function seems to prevent similar cases from happening,
but it seems this case was missed.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D106164
This reverts the revert commit b1777b04dc.
The patch originally got reverted due to a crash:
https://bugs.chromium.org/p/chromium/issues/detail?id=1232798#c2
The underlying issue was that we were not using the stored values from
the modified memory recipes, but the out-of-date values directly from
the IR (accessed via the VPlan). This should be fixed in d995d6376. A
reduced version of the reproducer has been added in 93664503be.
Fixes more casts to `<FixedVectorType>` for the cases where the
instruction is a Insert/ExtractElementInst.
For fixed-width, this part of truncateToMinimalBitWidths is tested by
AArch64/type-shrinkage-insertelt.ll. I attempted to write a test case for this part
of truncateToMinimalBitWidths which uses scalable vectors, but was unable to add
one. The tests in type-shrinkage-insertelt.ll rely on scalarization to create extract
element instructions for instance, which is not possible for scalable vectors.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D106163
Instead of getting the VPValue for the stored IR values through the
current plan, use the stored value of the recipes directly.
This way, the correct VPValues are used if the store recipes have been
modified in the VPlan and the IR value is not correct any longer. This
can happen, e.g. due to D105008.
I have added a new FastMathFlags parameter to getArithmeticReductionCost
to indicate what type of reduction we are performing:
1. Tree-wise. This is the typical fast-math reduction that involves
continually splitting a vector up into halves and adding each
half together until we get a scalar result. This is the default
behaviour for integers, whereas for floating point we only do this
if reassociation is allowed.
2. Ordered. This now allows us to estimate the cost of performing
a strict vector reduction by treating it as a series of scalar
operations in lane order. This is the case when FP reassociation
is not permitted. For scalable vectors this is more difficult
because at compile time we do not know how many lanes there are,
and so we use the worst case maximum vscale value.
I have also fixed getTypeBasedIntrinsicInstrCost to pass in the
FastMathFlags, which meant fixing up some X86 tests where we always
assumed the vector.reduce.fadd/mul intrinsics were 'fast'.
New tests have been added here:
Analysis/CostModel/AArch64/reduce-fadd.ll
Analysis/CostModel/AArch64/sve-intrinsics.ll
Transforms/LoopVectorize/AArch64/strict-fadd-cost.ll
Transforms/LoopVectorize/AArch64/sve-strict-fadd-cost.ll
Differential Revision: https://reviews.llvm.org/D105432
This patch avoids computing discounts for predicated instructions when the
VF is scalable.
There is no support for vectorization of loops with division because the
vectorizer cannot guarantee that zero divisions will not happen.
This loop now does not use VF scalable
```
for (long long i = 0; i < n; i++)
if (cond[i])
a[i] /= b[i];
```
Differential Revision: https://reviews.llvm.org/D101916
Currently the Instruction cost of getReductionPatternCost returns an
Invalid cost to specify "did not find the pattern". This changes that to
return an Optional with None specifying not found, allowing Invalid to
mean an infinite cost as is used elsewhere.
Differential Revision: https://reviews.llvm.org/D106140
This patch removes the assertion when VF is scalable and replaces
getKnownMinValue() by getFixedValue(), so it still guards the code against
scalable vector types.
The assertions were used to guarantee that getknownMinValue were not used for
scalable vectors.
Differential Revision: https://reviews.llvm.org/D106359
This patch adds a VPFirstOrderRecurrencePHIRecipe, to further untangle
VPWidenPHIRecipe into distinct recipes for distinct use cases/lowering.
See D104989 for a new recipe for reduction phis.
This patch also introduces a new `FirstOrderRecurrenceSplice`
VPInstruction opcode, which is used to make the forming of the vector
recurrence value explicit in VPlan. This more accurately models def-uses
in VPlan and also simplifies code-generation. Now, the vector recurrence
values are created at the right place during VPlan-codegeneration,
rather than during post-VPlan fixups.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D105008
This patch returns an Invalid cost from getInstructionCost() for alloca
instructions if the VF is scalable, as otherwise loops which contain
these instructions will crash when attempting to scalarize the alloca.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D105824
The original patch was:
https://reviews.llvm.org/D105806
There were some issues with undeterministic behaviour of the sorting
function, which led to scalable-call.ll passing and/or failing. This
patch fixes the issue by numbering all instructions in the array first,
and using that number as the order, which should provide a consistent
ordering.
This reverts commit a607f64118.
The sort function for emitting an OptRemark was not deterministic,
which caused scalable-call.ll to fail on some buildbots. This patch
fixes that.
This patch also fixes an issue where `Instruction::comesBefore()`
is called when two Instructions are in different basic blocks,
which would otherwise cause an assertion failure.
This patch emits remarks for instructions that have invalid costs for
a given set of vectorization factors. Some example output:
t.c:4:19: remark: Instruction with invalid costs prevented vectorization at VF=(vscale x 1): load
dst[i] = sinf(src[i]);
^
t.c:4:14: remark: Instruction with invalid costs prevented vectorization at VF=(vscale x 1, vscale x 2, vscale x 4): call to llvm.sin.f32
dst[i] = sinf(src[i]);
^
t.c:4:12: remark: Instruction with invalid costs prevented vectorization at VF=(vscale x 1): store
dst[i] = sinf(src[i]);
^
Reviewed By: fhahn, kmclaughlin
Differential Revision: https://reviews.llvm.org/D105806
Instead of performing the isMoreProfitable() operation on
InstructionCost::CostTy the operation is performed on InstructionCost
directly, so that it can handle the case where one of the costs is
Invalid.
This patch also changes the CostTy to be int64_t, so that the type is
wide enough to deal with multiplications with e.g. `unsigned MaxTripCount`.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D105113
This patch removes the IsPairwiseForm flag from the Reduction Cost TTI
hooks, along with some accompanying code for pattern matching reductions
from trees starting at extract elements. IsPairWise is now assumed to be
false, which was the predominant way that the value was used from both
the Loop and SLP vectorizers. Since the adjustments such as D93860, the
SLP vectorizer has not relied upon this distinction between paiwise and
non-pairwise reductions.
This also removes some code that was detecting reductions trees starting
from extract elements inside the costmodel. This case was
double-counting costs though, adding the individual costs on the
individual instruction _and_ the total cost of the reduction. Removing
it changes the costs in llvm/test/Analysis/CostModel/X86/reduction.ll to
not double count. The cost of reduction intrinsics is still tested
through the various tests in
llvm/test/Analysis/CostModel/X86/reduce-xyz.ll.
Differential Revision: https://reviews.llvm.org/D105484
Resubmit after the following changes:
* Fix a latent bug related to unrolling with required epilogue (see e49d65f). I believe this is the cause of the prior PPC buildbot failure.
* Disable non-latch exits for epilogue vectorization to be safe (9ffa90d)
* Split out assert movement (600624a) to reduce churn if this gets reverted again.
Previous commit message (try 3)
Resubmit after fixing test/Transforms/LoopVectorize/ARM/mve-gather-scatter-tailpred.ll
Previous commit message...
This is a resubmit of 3e5ce4 (which was reverted by 7fe41ac). The original commit caused a PPC build bot failure we never really got to the bottom of. I can't reproduce the issue, and the bot owner was non-responsive. In the meantime, we stumbled across an issue which seems possibly related, and worked around a latent bug in 80e8025. My best guess is that the original patch exposed that latent issue at higher frequency, but it really is just a guess.
Original commit message follows...
If we know that the scalar epilogue is required to run, modify the CFG to end the middle block with an unconditional branch to scalar preheader. This is instead of a conditional branch to either the preheader or the exit block.
The motivation to do this is to support multiple exit blocks. Specifically, the current structure forces us to identify immediate dominators and *which* exit block to branch from in the middle terminator. For the multiple exit case - where we know require scalar will hold - these questions are ill formed.
This is the last change needed to support multiple exit loops, but since the diffs are already large enough, I'm going to land this, and then enable separately. You can think of this as being NFCIish prep work, but the changes are a bit too involved for me to feel comfortable tagging the review that way.
Differential Revision: https://reviews.llvm.org/D94892
When skimming through old review discussion, I noticed a post commit comment on an earlier patch which had gone unaddressed. Better late (4 months), than never right?
I'm not aware of an active problem with the combination of non-latch exits and epilogue vectorization, but the interaction was not considered and I'm not modivated to make epilogue vectorization work with early exits. If there were a bug in the interaction, it would be pretty hard to hit right now (as we canonicalize towards bottom tested loops), but an upcoming change to allow multiple exit loops will greatly increase the chance for error. Thus, let's play it safe for now.
This reverts commit 706bbfb35b.
The committed version moves the definition of VPReductionPHIRecipe out
of an ifdef only intended for ::print helpers. This should resolve the
build failures that caused the revert
This patch adds a TTI function, isElementTypeLegalForScalableVector, to query
whether it is possible to vectorize a given element type. This is called by
isLegalToVectorizeInstTypesForScalable to reject scalable vectorization if
any of the instruction types in the loop are unsupported, e.g:
int foo(__int128_t* ptr, int N)
#pragma clang loop vectorize_width(4, scalable)
for (int i=0; i<N; ++i)
ptr[i] = ptr[i] + 42;
This example currently crashes if we attempt to vectorize since i128 is not a
supported type for scalable vectorization.
Reviewed By: sdesmalen, david-arm
Differential Revision: https://reviews.llvm.org/D102253
This reverts commit 3fed6d443f,
bbcbf21ae6 and
6c3451cd76.
The changes causing build failures with certain configurations, e.g.
https://lab.llvm.org/buildbot/#/builders/67/builds/3365/steps/6/logs/stdio
lib/libLLVMVectorize.a(LoopVectorize.cpp.o): In function `llvm::VPRecipeBuilder::tryToCreateWidenRecipe(llvm::Instruction*, llvm::ArrayRef<llvm::VPValue*>, llvm::VFRange&, std::unique_ptr<llvm::VPlan, std::default_delete<llvm::VPlan> >&) [clone .localalias.8]':
LoopVectorize.cpp:(.text._ZN4llvm15VPRecipeBuilder22tryToCreateWidenRecipeEPNS_11InstructionENS_8ArrayRefIPNS_7VPValueEEERNS_7VFRangeERSt10unique_ptrINS_5VPlanESt14default_deleteISA_EE+0x63b): undefined reference to `vtable for llvm::VPReductionPHIRecipe'
collect2: error: ld returned 1 exit status
This patch is a first step towards splitting up VPWidenPHIRecipe into
separate recipes for the 3 distinct cases they model:
1. reduction phis,
2. first-order recurrence phis,
3. pointer induction phis.
This allows untangling the code generation and allows us to reduce the
reliance on LoopVectorizationCostModel during VPlan code generation.
Discussed/suggested in D100102, D100113, D104197.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D104989
Splits `getSmallestAndWidestTypes` into two functions, one of which now collects
a list of all element types found in the loop (`ElementTypesInLoop`). This ensures we do not
have to iterate over all instructions in the loop again in other places, such as in D102253
which disables scalable vectorization of a loop if any of the instructions use invalid types.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D105437
Same as other CreateLoad-style APIs, these need an explicit type
argument to support opaque pointers.
Differential Revision: https://reviews.llvm.org/D105395