This patch extends VP-based sinking to also sink VPScalarStepsRecipe.
This takes us a step closer towards retiring the IR based sinking.
The main change is extending VPScalarIVStepsRecipe::execute to support
executing in a replicate-region.
Depends on D133758.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D133760
This reuses the routine implemented in 0e6f0b7 to implement several existing TODOs. Many of the operations scale linearly with LMUL; this change represents that in the cost model.
Differential Revision: https://reviews.llvm.org/D139039
The first attempt was reverted because a clang test changed
unexpectedly - the file is already marked with a FIXME, so
I just updated it this time to pass.
Original commit message:
This is the main patch for converting a truncated scalar that is
inserted into a vector to bitcast+shuffle. We could go either way
on patterns like this, but this direction will allow collapsing a
pair of these sequences on the motivating example from issue
The patch is split into 3 parts to make it easier to see the
progression of tests diffs. We allow inserting/shuffling into a
different size vector for flexibility, so there are several test
variations. The length-changing is handled by shortening/padding
the shuffle mask with undef elements.
In part 1, handle the basic pattern:
inselt undef, (trunc T), IndexC --> shuffle (bitcast T), IdentityMask
Proof for the endian-dependency behaving as expected:
https://alive2.llvm.org/ce/z/BsA7yC
The TODO items for handling shifts and insert into an arbitrary base
vector value are implemented as follow-ups.
Differential Revision: https://reviews.llvm.org/D138872
This is the main patch for converting a truncated scalar that is
inserted into a vector to bitcast+shuffle. We could go either way
on patterns like this, but this direction will allow collapsing a
pair of these sequences on the motivating example from issue
The patch is split into 3 parts to make it easier to see the
progression of tests diffs. We allow inserting/shuffling into a
different size vector for flexibility, so there are several test
variations. The length-changing is handled by shortening/padding
the shuffle mask with undef elements.
In part 1, handle the basic pattern:
inselt undef, (trunc T), IndexC --> shuffle (bitcast T), IdentityMask
Proof for the endian-dependency behaving as expected:
https://alive2.llvm.org/ce/z/BsA7yC
The TODO items for handling shifts and insert into an arbitrary base
vector value are implemented as follow-ups.
Differential Revision: https://reviews.llvm.org/D138872
This reverts commit bf15f1e489.
The updated version fixes a crash by checking the induction kind instead
of the opcode; for integer inductions, the step is always added, but the
opcode might not be set.
In revision B.q and before of the Armv8-M architecture reference
manual, the vector/scalar forms of the `vmla` and `vmlas` instructions
came in signed and unsigned integer forms, such as `vmla.s8 q0,q1,r2`
or `vmlas.u32 q3,q4,r5`.
Revision B.r has changed this. There are no longer signed and unsigned
versions of these instructions, since they were functionally identical
anyway. Now there is just `vmla.i8` (or `i16` or `i32`, and similarly
for `vmlas`). Bit 28 of the instruction encoding, which was previously
0 for signed or 1 for unsigned, is now expected to be 0 always.
This change updates LLVM to the new version of the architecture. The
obsoleted encodings for unsigned integers are now decoding errors, and
only the still-valid encoding is ever emitted. This shouldn't break
any existing assembly code, because the old signed and unsigned
versions of the mnemonic are still accepted by the assembler (which is
standard practice anyway for all signedness-agnostic MVE integer
instructions).
Reviewed By: dmgreen, lenary
Differential Revision: https://reviews.llvm.org/D138827
This patch implements getArithmeticInstrCost for RISCV, supports cost
model for integer and float vector arithmetic instructions.
Differential Revision: https://reviews.llvm.org/D133552 (Original patch by jacquesguan. Subset by me with todos added.)
This patch splits off the logic to transform the canonical IV to a
a value for an induction with a different start and step. This
transformation only needs to be done once (independent of VF/UF) and
enables sinking of VPScalarIVStepsRecipe as follow-up.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D133758
StoredValues only has entries for members of the interleave group. If
there are gaps, then using the index i here will either access a wrong
entry or be out-of-bounds.
Instead use a dedicated index that only gets incremented for members of
the interleave group.
Fixes#59090.
This is quite reduced from the original example, but hopefully shows
where vectorization is unprofitable because of multiple factors
including the low trip count of the loop.
All of our insert/extract ops work on 128-bit lanes.
For `Insert`, we need to extract affected 128-bit lane,
unless it's being fully overwritten (FIXME: do we need to be
careful about legalization-induced padding that we obviously don't demand?),
perform insertions, and then insert the 128-bit lane back.
But hold on. If we are operating on an 256-bit legal vector,
and thus have two 128-bit subvectors, and are fully overwriting them both,
we don't actually need to insert *both* subvectors,
only the second one, into the implicitly-widened first one.
Also, `Insert` wasn't actually querying the costs,
but just assuming them to be `1`.
`getShuffleCost(TTI::SK_ExtractSubvector)` notes:
```
// Note that in general, the insertion starting at the beginning of a vector
// isn't free, because we need to preserve the rest of the wide vector.
```
... so as far as i can tell, we didn't account for that.
I was hoping this would allow vectorization at a higher VF at one case i looked at,
but the subvector insertion cost is still dis-advising that.
The change for `Extract` is NFC, and is for consistency only,
i wanted to get rid of of that weird explicit discounting of insertion of 0'th element,
since the general code should already deal with that.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D137913
For a min and max reduction idioms, the identity (i.e. neutral) element
should be datatype's highest and lowest possible values respectively.
Current implementation in IVDescriptors incorrectly returns -Inf for FMin
reduction and +Inf for FMax reduction. This patch fixes this bug which
was causing incorrect reduction computation results in loops vectorized
by LV.
Differential Revision: https://reviews.llvm.org/D137220
In D136659 I found a few tests that write through readonly parameters:
* Analysis/BasicAA/pr18573.ll: @foo1 writes through %arr.ptr, but declares it
readonly. I removed the readonly annotation.
* CodeGen/ARM/ParallelDSP/aliasing.ll: @restrict writes through the readonly
%arg3, @store_alias_arg3_illegal_1 writes through the readonly %arg3, and
@store_alias_arg3_illegal_2 writes through the readonly %arg3. I removed
readonly from all three. Also, I added some CHECK-LABEL directives to make it
harder for FileCheck output to be mixed up.
* Transforms/LoopVectorize/AArch64/sve-gather-scatter.ll:
@gather_nxv4i32_ind64_stride2 writes through the readonly %a. I removed the
readonly attribute.
* Transforms/LoopVectorize/interleaved-accesses.ll: @load_gap_reverse writes
through the readonly %P1 and %P2. Also, the corresponding C code in the comment
didn't match the test. I removed the readonly attribute from both parameters
and corrected the C code.
Differential Revision: https://reviews.llvm.org/D136880
Replace custom code to check if only the first lane is used by generic
helper `onlyFirstLaneUsed`. This enables VPlan-based sinking in a few
additional cases and was suggested in D133760.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D136368
Epilogue loop vectorization is a feature in the vectorize intended to avoid running fully scalar code when the vector length of the main loop turns out to be either longer than the trip count of the actual loop, or with a huge remainder.
In practice, this feature appears to not have been well tuned. I honestly don't think it should be on by default at all, but it definitely shouldn't be on for RISCV. Note that other targets have also disabled it, but they've done so via disabling interleaving - which is, well, completely unrelated - and we don't want to do that for RISCV.
In the near term, many examples I'm seeing have terrible codegen for epilogue vectorization. We are greatly increasing code size for little value at reasonable VLEN values for small types. In the long term, the cases that epilogue vectorization are intended to handle are likely better handled via tail folding on RISCV.
As an aside, I also don't really trust the correctness of epilogue vectorization. The code structure is such that otherwise straight forward changes sometimes break only epilogue vectorization. The reuse of an existing vplan without careful validation opens significant room for nasty bugs. Given how rarely the code is exercised, that is not a good combination.
As such, this patch introduces a TTI hook, and completely disables epilogue vectorization on RISCV.
Differential Revision: https://reviews.llvm.org/D136695
Canonicalize GEP of GEP by swapping GEP with some suffix constant indices to the back (and GEP with all constant indices to the back of that), this allows more constant index GEP merging to happen. Exceptions are: If swapping violates use-def relations, or anti-optimizes LICM
For constant indexed GEP of GEP, if they cannot be merged directly, they will be casted to i8* and merged.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D125845
The code in buildScalarSteps already properly handles creating the
scalar induction values with VF = 1. Use it directly instead of using
extra code to handle that case.
Suggested by @Ayal in D133760.
When the SME attributes tell that a function is or may be executed in Streaming
SVE mode, we currently need to be conservative and disable _any_ vectorization
(fixed or scalable) because the code-generator does not yet support generating
streaming-compatible code.
Scalable auto-vec will be gradually enabled in the future when we have
confidence that the loop-vectorizer won't use any SVE or NEON instructions
that are illegal in Streaming SVE mode.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D135950
getPrimitiveSizeInBits returns 0 for pointers, we need to query
the size via DataLayout instead.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D135976
optimizeInductions may leave dead recipes which can prevent sinking.
Sinking on the other hand should not introduce new dead recipes, so
clean up dead recipes before sinking.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D133762
The attached test case can cause LLVM crash in buildVPlanWithVPRecipes because
invalid VPlan is generated.
FIRST-ORDER-RECURRENCE-PHI ir<%792> = phi ir<%501>, ir<%806>
CLONE ir<%804> = fdiv ir<1.000000e+00>, vp<%17> // use of %17
CLONE ir<%806> = load ir<%805>
EMIT vp<%17> = first-order splice ir<%792> ir<%806> // def of %17
...
There is a use before def error on %17.
When vectorizer generates a VPlan, it generates a "first-order splice"
instruction for a loop carried variable after its definition. All related PHI
users are changed to use this "first-order splice" result, and are moved after
it. The move is guided by a MapVector SinkAfter. And the content of SinkAfter is
filled by RecurrenceDescriptor::isFixedOrderRecurrence.
Let's look at the first PHI and related instructions
%v792 = phi double [ %v806, %Loop ], [ %d1, %Entry ]
%v802 = fdiv double %v794, %v792
%v804 = fdiv double 1.000000e+00, %v792
%v806 = load double, ptr %v805, align 8
%v806 is a loop carried variable, %v792 is related PHI instruction. Vectorizer
will generated a new "first-order splice" instruction for %v806, and it will be
used by %v802 and %v804. So %v802 and %v804 will be moved after %v806 and its
"first-order splice" instruction. So SinkAfter contains
%v802 -> %v806
%v804 -> %v802
It means %v802 should be moved after %v806 and %v804 will be moved after %v802.
Please pay attention that the order is important.
When isFixedOrderRecurrence processing PHI instruction %v794, related
instructions are
%v793 = phi double [ %v813, %Loop ], [ %d1, %Entry ]
%v794 = phi double [ %v793, %Loop ], [ %d2, %Entry ]
%v802 = fdiv double %v794, %v792
%v813 = load double, ptr %v812, align 8
This time its related loop carried variable is %v813, its user is %v802. So
%v802 should also be moved after %v813. But %v802 is already in SinkAfter,
because %v813 is later than %v806, so the original %v802 entry in SinkAfter is
deleted, a new %v802 entry is added. Now SinkAfter contains
%v804 -> %v802
%v802 -> %v813
With these data, %v802 can still be moved after all its operands, but %v804
can't be moved after %v806 and its "first-order splice" instruction. And causes
use before def error.
So when remove/re-insert an instruction I in SinkAfter, we should also
recursively remove instructions targeting I and re-insert them into SinkAfter.
But for simplicity I just bail out in this case.
Differential Revision: https://reviews.llvm.org/D134083
Currently, AArch64 doesn't support vectorization for non temporal loads because `isLegalNTLoad` is not implemented for the target.
This patch applies similar functionality as `D73158` but for non temporal loads
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D131964
At the moment, LoopAccessAnalysis is a loop analysis for the new pass
manager. The issue with that is that LAI caches SCEV expressions and
modifications in a loop may impact SCEV expressions in other loops, but
we do not have a convenient way to invalidate LAI for other loops
withing a loop pipeline.
To avoid this issue, turn it into a function analysis which returns a
manager object that keeps track of the individual LAI objects per loop.
Fixes#50940.
Fixes#51669.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D134606
Move LCSSA fixup from ::expandCodeForImpl to ::expand(). This has
the advantage that we directly preserve LCSSA nodes here instead of
relying on doing so in rememberInstruction. It also ensures that we
don't add the non-LCSSA-safe value to InsertedExpressions.
Alternative to D132704.
Fixes#57000.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D134739
Fixes#57572
Generally LICM pass is responsible for sinking out code that calculates
invariant address inside loop as it only needed to be calculated once.
But in rare case it does not happen we will not be vectorizing the
loop.
Differential Revision: https://reviews.llvm.org/D133687
Follow up to D133580; adjust the cost model to prefer uniform store lowering for scalable stores which are unpredicated.
The impact here isn't in the uniform store lowering quality itself. InstCombine happily converts the scatter form into the single store form. The main impact is in letting the rest of the cost model make choices based on the knowledge that the vector will be scalarized on use.
Differential Revision: https://reviews.llvm.org/D134460