Commit Graph

1720 Commits

Author SHA1 Message Date
Florian Hahn 973183612e
[VPlan] Add test for VPExpandSCEVRecipe printing. 2022-03-20 10:11:40 +00:00
Florian Hahn d5fbcf76fd
[VPlan] Improve pattern in vplan-printing.ll check line.
The existing pattern only matched a single value, which breaks if the
numbering slightly changes.
2022-03-19 16:03:25 +00:00
Andrew Wei 0af3e6a22d [InstCombine] Sink instructions with multiple users in a successor block.
This patch tries to sink instructions when they are only used in a successor block.

This is a further enhancement patch based on Anna's commit:
D109700, which allows sinking an instruction having multiple uses in a single user.

In this patch, sink instructions with multiple users in a single successor block will be supported.
It could fix a known issue from rust:
  https://github.com/rust-lang/rust/issues/51346#issuecomment-394443610

Reviewed By: nikic, reames

Differential Revision: https://reviews.llvm.org/D121585
2022-03-18 11:53:45 +08:00
Florian Hahn 151c144350
[LV] Use usesScalars in widenPHIInstruction.
This uses the existing VPlan helpers to check whether there are scalar
uses of a phi recipe. It remove one of the few remaining dependencies on
the cost model from VPlan code generation.

Depends on D121612.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D121613
2022-03-17 13:16:32 +00:00
Malhar Jajoo a36d269658 [VPlan] Avoid collecting scalars for SVE
This patch ensures scalars (except for uniforms) are no
longer collected (prior to LVP planning phase) for
scalable vectorization.

This is to avoid the chances of generating scalarized
instructions later (during LVP execute phase) as they
are not supported for scalable vectorization.

Relevant test has also been added.

Differential Revision: https://reviews.llvm.org/D121452
2022-03-16 16:33:34 +00:00
Florian Hahn 5c4d64eb0d
[LV] Make reduction-order.ll test independent of instruction naming.
Also update test to not use branch on undef.
2022-03-15 11:13:18 +00:00
Florian Hahn 4a0481e981
[LV] Check for users of truncated IVs, add more detailed comment.
Add missing outside user check for truncated IVs. Also hoist the code in
the helper with additional explanations.

Fixes #54370.
2022-03-14 19:39:30 +00:00
Florian Hahn 1c0fc1f074
[VPlan] Ensure each iv user is only visited once in transform.
If a recipe has multiple uses of an IV, we crash. It causes a crash when
building llvm-test-suite.

Exposed by 95f76bff1c.
2022-03-13 21:42:17 +00:00
Florian Hahn 95f76bff1c
[LV] Create & use VPScalarIVSteps for all scalar users.
This patch is a follow-up to D115953. It updates optimizeInductions
to also introduce new VPScalarIVStepsRecipes if an IV has both vector
and scalar uses.

It updates all uses that only need scalar values to use the newly
created recipe for the scalar steps.

This completes untangling of VPWidenIntOrFpInductionRecipe
code-generation. Now the recipe *only* creates the widened vector
values, as it says on the tin.

The code to genereate IR has been moved directly to
VPWidenIntOrFpInductionRecipe::execute.

Note that the recipe has been updated to hold a reference to
ScalarEvolution, which is needed to expand the step, until we can place
the corresponding SCEV expansion in the pre-header.

Depends on D120827.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D120828
2022-03-13 17:15:24 +00:00
Sanjay Patel b48fe158e0 [Analysis] remove bogus smin/smax pattern detection
This is a revert of cfcc42bdc. The analysis is wrong as shown by
the minimal tests for instcombine:
https://alive2.llvm.org/ce/z/y9Dp8A

There may be a way to salvage some of the other tests,
but that can be done as follow-ups. This avoids a miscompile
and fixes #54311.
2022-03-09 17:50:34 -05:00
Florian Hahn a12403cfea
[LV] Do not consider instrs dead if used by phi that's not in plan.
Single value phis won't be modeled in VPlan. If the phi only gets used
outside the loop, the current code misses the fact that the incoming
value is not dead. Update the code to also look through such phis to
check for outside users.

Fixes #54266
2022-03-09 16:04:44 +00:00
Florian Hahn a2979c8399
[IVDescriptors] Bail out instead of asserting that order is expected.
When dealing with multiple phis that depend on each other, the order
might have been changed and may not match the expectation. If that
happens, bail out, rather than asserting.

Fixes https://github.com/llvm/llvm-project/issues/54218
Fixes https://github.com/llvm/llvm-project/issues/54233
Fixes https://github.com/llvm/llvm-project/issues/54254
2022-03-07 19:57:26 +00:00
Florian Hahn f4368487aa
[LV] Add test from PR54227.
Test from https://github.com/llvm/llvm-project/issues/54227.

The underlying issue has already been fixed in de8ac48 with a separate
test.
2022-03-07 17:01:22 +00:00
Roman Lebedev 2f80ea7f4f
[NFC][LV] Use different braces in debug output
The analysis passes output function name encapsulated in `'` braces,
but LV uses `"`. Harmonizing this may help in creating an update script
for the LV costmodel test checks.

Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D121105
2022-03-07 19:32:37 +03:00
Florian Hahn de8ac485e5
[IVDescriptor] Remove SinkCandidate from SinkAfter before re-sinking.
This ensures the right order in the sink-after map is maintained. If we
re-sink an instruction, it must be sunk after all earlier instructions
have been sunk.

Fixes https://github.com/llvm/llvm-project/issues/54223
2022-03-05 19:48:26 +00:00
Florian Hahn 5a60260efe
[IVDescriptor] Use DT to check order of Previous, OtherPrev.
Previous and OhterPrev may not be in the same block. Use DT::dominates
instead of local comesBefore. DT::dominates is already used earlier to
check the order of Previous and SinkCandidate.

Fixes https://github.com/llvm/llvm-project/issues/54195
2022-03-04 11:07:42 +00:00
Florian Hahn 139215af8e
[IVDescriptor] Find original 'Previous' for first-order recurrences.
This patch extends first-order recurrence handling to support cases
where we already sunk an instruction for a different recurrence, but
LastPrev comes before Previous.

To handle those cases correctly, we need to find the earliest entry for
the sink-after chain, because this is references the Previous from the
original recurrence. This is needed to ensure we use the correct
instruction as sink point.

Depends on D118558.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D118642
2022-03-03 16:41:26 +00:00
Florian Hahn 8777cb66a8
[VPlan] Remove reliance on underlying instr for ScalarIVSteps (NFCI).
Instead of relying on underlying instructions, this patch updates
VPScalarIVStepsRecipe to only store the required type information.

This removes access to unrelated information, as well as avoiding issues
with the same underlying instruction being shared by multiple recipes.

This change should only change the debug output and not cause any
codegen changes, hence NFCI.
2022-03-02 16:23:19 +00:00
Florian Hahn 6dc456a375
[LV] Remove redundant check line from recurrence test.
The removed line matches the previous line, modulo the check prefix.
There is no way to disable sinking instructions as required due to
first-order recurrence and removing the line should be safe.
2022-03-02 13:48:46 +00:00
Florian Hahn 83fd2071f0
[LV] Modernize test matching hardcoded induction phi name. 2022-03-02 10:12:38 +00:00
Florian Hahn 470b5c7f0d
[LV] Add test with multiple use of a FOR chained together.
Additional test coverage for D118642.
2022-03-01 14:18:23 +00:00
Nikita Popov 26748bb15a [InstCombine] Slightly relax one-use check in abs canonicalization
Treat the icmp and sub symmetrically, and require that one of them
has one use, not the icmp in particular. This could be further
relaxed in the abs (but not nabs) case to not check one-use at
all.
2022-03-01 15:06:41 +01:00
Nikita Popov 7c080e4649 [LoopVectorize] Regenerate test checks (NFC) 2022-03-01 15:01:14 +01:00
Andrei Elovikov 6e9a8cdcfb [NFC][LoopVectorizer] Simplify LoopVectorize/X86/gather_scatter.ll
The test used to run whole O3 pipeline. Modify it to contain LLVM IR right
before LV and limit passes to "-loop-vectorizer -simplifycfg".

For the RUN line with forced VF force interleave factor as well to simplify
CHECKs as interleaving isn't related to the purpose of the test.

I also tried to add "noalias" to pointer arguments in
@test_gather_not_profitable_pr48429 but LAI seems unable to use them.

Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D119786
2022-02-28 11:12:50 -08:00
Florian Hahn b3e8ace198
Recommit "[VPlan] Introduce recipe to build scalar steps."
This reverts the revert commit ff93260bf6.

The underlying issue causing the PPC bot failures has been fixed in
cbaac14734 and a corresponding test case has been added in
ad2cad1c52.

Original message:

    This patch adds a new VPScalarIVStepsRecipe to handle building scalar
    steps.

    In the first patch, it only handles the case where there is no vector
    induction variable needed.

    Reviewed By: Ayal

    Differential Revision: https://reviews.llvm.org/D115953
2022-02-28 14:12:20 +00:00
Florian Hahn cbaac14734
[LV] Remove induction recipes only used outside vector loop.
Exit values of vector inductions are generated completely independent of
the induction recipes. Consider them for removal, if they are not used
in loop.

This fixes a crash exposed by 49b23f451c.
2022-02-28 11:14:22 +00:00
Florian Hahn 8bbc5e172a
[LV] Add test with dead induction in vector loop used outside.
Add test with a induction phi that is not used in the vector loop, but
by an lcssa phi in the loop exit.
2022-02-28 10:39:08 +00:00
Florian Hahn ad2cad1c52
[LV] Add test with IV that needs scalar steps and user outside of loop.
Also add a run line to check interleaving only. This test covers the PPC
buildbot failures caused by 49b23f451c.
2022-02-28 09:46:18 +00:00
Florian Hahn ff93260bf6
Revert "[VPlan] Introduce recipe to build scalar steps."
This reverts commit 49b23f451c.

This appears to break some PPC build bots. Revert while I investigate.
2022-02-27 17:51:19 +00:00
Florian Hahn 49b23f451c
[VPlan] Introduce recipe to build scalar steps.
This patch adds a new VPScalarIVStepsRecipe to handle building scalar
steps.

In the first patch, it only handles the case where there is no vector
induction variable needed.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D115953
2022-02-27 17:32:41 +00:00
Florian Hahn da740492b0
[VPlan] Remove dead header-phi recipes.
This patch adds a new transform to remove dead recipes. For now, it only
removes dead recipes in the header, to keep the number tests that require
updating manageable. Future patches will extend this to remove dead
recipes across the whole plan.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D118051
2022-02-26 16:26:39 +00:00
Florian Hahn 462cd9270c
[LV] Add test with redundant cast in separate latch block.
Adds another interesting test for D118051.
2022-02-26 14:52:55 +00:00
Nikita Popov a266af7211 [InstCombine] Canonicalize SPF to min/max intrinsics
Now that integer min/max intrinsics have good support in both
InstCombine and other passes, start canonicalizing SPF min/max
to intrinsic min/max.

Once this sticks, we can stop matching SPF min/max in various
places, and can remove hacks we have for preventing infinite loops
and breaking of SPF canonicalization.

Differential Revision: https://reviews.llvm.org/D98152
2022-02-24 09:01:20 +01:00
Malhar Jajoo 9f1c6fbf11 [LAA] Add remarks for unbounded array access
Adds new optimization remarks when loop vectorization fails due to
the compiler being unable to find bound of an array access inside
a loop

Differential Revision: https://reviews.llvm.org/D115873
2022-02-23 15:57:39 +00:00
Kerry McLaughlin 12fb133eba [LoopVectorize] Support conditional in-loop vector reductions
Extends getReductionOpChain to look through Phis which may be part of
the reduction chain. adjustRecipesForReductions will now also create a
CondOp for VPReductionRecipe if the block is predicated and not only if
foldTailByMasking is true.

Changes were required in tryToBlend to ensure that we don't attempt
to convert the reduction Phi into a select by returning a VPBlendRecipe.
The VPReductionRecipe will create a select between the Phi and the reduction.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D117580
2022-02-22 12:04:35 +00:00
Florian Hahn 5c7ae10cec
[LV] Add store to test to make sure the loop is not dead.
Add an extra store to the test, to make sure the operations in the loop
cannot be optimized away after D118051.
2022-02-20 15:05:29 +00:00
zhongyunde b2f5164deb [IVDescriptors] Support FOR where we have multiple sink pointed
Handles the case where Previous doesn't come before LastPrev incorrectly.
Fix https://github.com/llvm/llvm-project/issues/53483

Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D118558
2022-02-14 09:30:35 +08:00
Florian Hahn d462e64754
[LV] Drop noalias from check lines from test (NFC).
The noalias metadata checks re not really relevant for the test and
slight changes to metadata numbering can have large knock-on effects
causing large noise in test diff.
2022-02-13 11:36:54 +00:00
Florian Hahn 446e7c64c7
[LV] Add real uses in some tests, to make them more robust.
Add real uses to some tests, to ensure dead instructions cannot be directly
removed.
2022-02-13 09:52:59 +00:00
Florian Hahn 9474c3009e
[LV] Move unrelated tests from first-order-recurrence-chains.ll 2022-02-11 09:15:42 +00:00
Florian Hahn f97795121f
[LV] Add tests with chained first-order recurrences. 2022-02-10 15:55:19 +00:00
Simon Pilgrim 4517488eb7 [LoopVectorize] Regenerate reduction-predselect.ll test checks 2022-02-10 12:03:10 +00:00
David Green b55d4c2ad8 Revert "[LV] Remove `LoopVectorizationCostModel::useEmulatedMaskMemRefHack()`"
This reverts commit 77a0da926c as we've
received multiple reports of this significantly impacting performance,
in ways that don't seem to just be target specific cost models going
wrong. I would offer some reproducers, but the test changes here seem to
be full of them!

Reverting for now and hopefully we can remove the "hack" more carefully
as we go.
2022-02-09 20:02:54 +00:00
David Green b4c6d1bb37 [LoopVectorizer] Don't perform interleaving of predicated scalar loops
The vectorizer will choose at times to "vectorize" loops with a scalar
factor (VF=1) with interleaving (IC > 1). This can occasionally produce
better code than the unroller (notable for reductions where it can
produce independent reduction chains that are combined after the loop).
At times this is not very beneficial though, for example when runtime
checks are needed or when the scalar code requires predication.

This addresses the second point, preventing the vectorizer from
interleaving when the scalar loop will require predication. This
prevents it from making a bit of a mess, that is worse than the original
and better left for the unroller to unroll if beneficial. It helps
reverse some of the regressions from D118090.

Differential Revision: https://reviews.llvm.org/D118566
2022-02-07 19:34:28 +00:00
Florian Hahn 1049735d07
[LV] Adjust accesses in test to ensure full RT checks are generated.
Add an additional access so the full runtime checks are still generated,
even after D119078.
2022-02-07 18:07:19 +00:00
Roman Lebedev 77a0da926c
[LV] Remove `LoopVectorizationCostModel::useEmulatedMaskMemRefHack()`
D43208 extracted `useEmulatedMaskMemRefHack()` from legality into cost model.
What it essentially does is prevents scalarized vectorization of masked memory operations:
```
  // TODO: Cost model for emulated masked load/store is completely
  // broken. This hack guides the cost model to use an artificially
  // high enough value to practically disable vectorization with such
  // operations, except where previously deployed legality hack allowed
  // using very low cost values. This is to avoid regressions coming simply
  // from moving "masked load/store" check from legality to cost model.
  // Masked Load/Gather emulation was previously never allowed.
  // Limited number of Masked Store/Scatter emulation was allowed.
```

While i don't really understand about what specifically `is completely broken`
was talking about, i believe that at least on X86 with AVX2-or-later,
this is no longer true. (or at least, i would like to know what is still broken).
So i would like to follow suit after D111460, and like wise disable that hack for AVX2+.

But since this was added for X86 specifically, let's just instead completely remove this hack.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D114779
2022-02-07 16:08:31 +03:00
Florian Hahn ef4df27940
[LV] Modernize some runtime check tests a bit.
Update tests to check runtime checks a bit more precisely.
2022-02-07 12:08:56 +00:00
Sander de Smalen eaee477eda [LV] Use VScaleForTuning to allow wider epilogue VFs.
When the main loop is e.g. VF=vscale x 1 and the epilogue VF cannot
be any smaller, the vectorizer should try to estimate how many lanes are
executed at runtime and allow a suitable fixed-width VF to be chosen. It
can use VScaleForTuning to figure out what a suitable fixed-width VF could
be. For the case where the main loop VF is VF=vscale x 1, and VScaleForTuning=8,
it could still choose an epilogue VF upto VF=4.

This was a bit tricky to test, so this patch also introduces a wrapper
function to get 'VScaleForTuning' by also considering vscale_range.
If min and max are equal, then that will be the vscale we compile for.
It makes little sense to tune for a different width if the code
will not be portable for other widths.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D118709
2022-02-03 15:40:17 +00:00
Malhar Jajoo 778b455dd6 [LAA] Add Memory dependence remarks.
Adds new optimization remarks when vectorization fails.

More specifically, new remarks are added for following 4 cases:

- Backward dependency
- Backward dependency that prevents Store-to-load forwarding
- Forward dependency that prevents Store-to-load forwarding
- Unknown dependency

It is important to note that only one of the sources
of failures (to vectorize) is reported by the remarks.
This source of failure may not be first in program order.

A regression test has been added to test the following cases:

a) Loop can be vectorized: No optimization remark is emitted
b) Loop can not be vectorized: In this case an optimization
remark will be emitted for one source of failure.

Reviewed By: sdesmalen, david-arm

Differential Revision: https://reviews.llvm.org/D108371
2022-02-02 12:07:51 +00:00
Sander de Smalen 2a44eaf20f [LV] Allow a scalable VF for the epilogue.
For some reason we limited the epilogue VF to be fixed-width, but there
is not necessarily a reason for doing so. If the main VF=vscale x 16, the
epilogue VF could be either fixed-width, or a scalable VF upto vscale x 8.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D118688
2022-02-01 22:38:55 +00:00
David Green aaa16eb023 [LV][AArch64] Add test for scalar interleaving with predication. NFC 2022-02-01 09:21:49 +00:00
Florian Hahn 02ee3fbff8
[LV] Add additional complex first order recurrence test.
Add a new test case with 2 first-order recurrences, which share a user.
2022-01-31 19:54:14 +00:00
Florian Hahn 8f12175fed
[VPlan] Use VPlan to check if only the first lane is used.
This removes the remaining dependence on LoopVectorizationCostModel from
buildScalarSteps and is required so it can be moved out of ILV.

It also improves allows us to remove a few unneeded instructions.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D116554
2022-01-30 13:07:29 +00:00
Florian Hahn efd4938723
[VPlan] Handle IV vector splat using VPWidenCanonicalIV.
This patch tries to use an existing VPWidenCanonicalIVRecipe
instead of creating another step-vector for canonical
induction recipes in widenIntOrFpInduction.

This has the following benefits:

 1. First step to avoid setting both vector and scalar values for the
    same induction def.
 2. Reducing complexity of widenIntOrFpInduction through making things
    more explicit in VPlan
 3. Only need to splat the vector IV for block in masks.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D116123
2022-01-29 16:25:27 +00:00
Malhar Jajoo b75bdff4a0 Trivial update for debug location in LIT test.
This just updates debug location of a loop in a LIT test to point
to the correct source line.
2022-01-27 19:07:47 +00:00
Congzhe Cao f3e1f44340 [IVDescriptor] Get the exact FP instruction that does not allow reordering
This is a bugfix in IVDescriptor.cpp.

The helper function `RecurrenceDescriptor::getExactFPMathInst()`
is supposed to return the 1st FP instruction that does not allow
reordering. However, when constructing the RecurrenceDescriptor,
we trace the use-def chain staring from a PHI node and for each
instruction in the use-def chain, its descriptor overrides the
previous one. Therefore in the final RecurrenceDescriptor we
constructed, we lose previous FP instructions that does not allow
reordering.

Reviewed By: kmclaughlin

Differential Revision: https://reviews.llvm.org/D118073
2022-01-27 00:33:46 -05:00
Igor Kirillov d3932c690d [LoopVectorize] Add tests with reductions that are stored in invariant address
This patch adds tests for functionality that is to be implemented in D110235.

Differential Revision: https://reviews.llvm.org/D117213
2022-01-24 21:26:38 +00:00
Florian Hahn b2a8eff45c
[LV] Make some tests more robust by adding missing users. 2022-01-24 13:04:09 +00:00
Florian Hahn b7f69b8d46
[LV] Name values and blocks in same induction tests (NFC).
This reduces the churn in the test in future updates due to numbering
changes.
2022-01-24 12:28:43 +00:00
Kerry McLaughlin 8082ab2fc3 [LoopVectorize] Support epilogue vectorisation of loops with reductions
isCandidateForEpilogueVectorization will currently return false for loops
which contain reductions. This patch removes this restriction and makes
the following changes to support epilogue vectorisation with reductions:

- `fixReduction`: If fixReduction is being called during vectorisation of the
    epilogue, the phi node it creates will need to additionally carry incoming
     values from the middle block of the main loop.

- `createEpilogueVectorizedLoopSkeleton`: The incoming values of the phi
    created by fixReduction are updated after the vec.epilog.iter.check block
    is added. The phi is also moved to the preheader of the epilogue.

- `processLoop`: The start value of any VPReductionPHIRecipes are updated before
    vectorising the epilogue loop. The getResumeInstr function added to the ILV
    will return the resume instruction associated with the recurrence descriptor.

Reviewed By: sdesmalen

Differential Revision: https://reviews.llvm.org/D116928
2022-01-24 12:03:31 +00:00
eopXD 3cf15af2da [RISCV] Remove experimental prefix from rvv-related extensions.
Extensions affected: +v, +zve*, +zvl*

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D117860
2022-01-22 20:18:40 -08:00
Kerry McLaughlin c740a07863 [LoopVectorize] Test in-loop reductions with tail folding for scalable vectors
Adds `-prefer-inloop-reductions` to the RUN line of sve-tail-folding.ll & adds
a new test where in-loop reductions cannot be used (`@cond_xor_reduction`). NFC.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D117578
2022-01-19 14:36:23 +00:00
David Sherwood e781620dee [LoopVectorize][AArch64] Use get.active.lane.mask intrinsic when SVE is enabled
When SVE is enabled for AArch64 targets it makes more sense to use the
get.active.lane.mask intrinsic, because SVE has an exact 1-1 mapping
from the intrinsic to the 'whilelo' instruction for legal vector types.
This instruction neatly takes overflow into account as well. This patch
fixes an issue in VPInstruction::generateInstruction that assumed we are
only dealing with fixed-width vectors.

Differential Revision: https://reviews.llvm.org/D117109
2022-01-18 11:59:30 +00:00
Florian Hahn 524150fe07
[LV] Add test coverage for reductions with odd interleave counts.
Add test coverage for loops with reductions and odd (3, 5) interleave
counts.
2022-01-17 14:34:21 +00:00
Florian Hahn 4a6f475446
[LV] Make test more robust by adding users of inductions.
The modified tests didn't have actual users of all inductions, making it
trivial to eliminate them. Add users to make sure the inductions are
actually used in the vectorized version.
2022-01-17 13:28:59 +00:00
Kito Cheng cc35161dc7 [RISCV] Add initial support for getRegUsageForType and getNumberOfRegisters
Those two TTI hooks are used during vectorization for calculating
register pressure, the default implementation isn't consider for LMUL,
and that's also definitly wrong value for register number (all register class
are 8 registers).

So in this patch we tried to:

1. Calculate right register usage for vector type and scalar type.
2. Return right number of register for general purpose register and
   vector register.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D116890
2022-01-17 15:27:54 +08:00
Florian Hahn 070d1034da
[LV] Restore metadata to disable runtime unrolling for epilogue loop.
After d4a8fc3a87 LV stopped adding metadata to disable runtime
unrolling to the vectorized epilogue loop. This was missed because
278aa65cc4 removed the relevant test coverage.

This patch fixes that by adding the relevant metadata after
vector loop generation.
2022-01-16 13:14:16 +00:00
Florian Hahn ba3198cfd1
[IRBuilder] Migrate select-folding to value-based FoldSelect.
Reviewed By: lebedev.ri

Differential Revision: https://reviews.llvm.org/D117228
2022-01-15 11:26:44 +00:00
Florian Hahn 42b34facfd
Recommit "[LV] Inline CreateSplatIV call for scalar VFs."
This reverts the revert commit 073c27b5e5.

A reduced test case has been added in 5e4966cbae and the code has
been updated to handle the case where getInductionOpcode returns
BinaryOpsEnd. In this case, the original code was always using
Instruction::Add. Do the same in the patch.

Note this commit may slightly change the value naming, because it now
also assigns the 'induction' name in the floating point case.
2022-01-14 19:03:49 +00:00
Florian Hahn 5e4966cbae
[LV] Add test with an integer induction based on a ptr one.
Reduced test case from the reproducer mentioned in
073c27b5e5.
2022-01-14 15:56:47 +00:00
James Y Knight 073c27b5e5 Revert "[LV] Inline CreateSplatIV call for scalar VFs (NFC)."
Causes a crash with the following (creduce'd) test-case:

clang -O3 '--target=aarch64-grtev4-linux-gnu' -xc - -c -o /dev/null <<EOF
int *e;
int f;
int g() {
  int h;
  int *j = 0;
  while (&f - j > 0) {
    int k;
    k = j;
    if (e == j && *e)
      k = 5;
    h = k;
    j++;
  }
  return h;
}
EOF

This reverts commit 7ce48be0fd.
2022-01-14 00:00:02 +00:00
Florian Hahn 7b9f5cbfa7
[LV] Extend check lines for pr34681.ll to cover foldable select. 2022-01-13 16:42:47 +00:00
Florian Hahn 3f2fb767e3
[VPlan] Make IV operand explicit for VPWidenCanonicalIVRecipe (NFC).
This makes the def-use relationship between VPCanonicalIVPHIRecipe and
VPWidenCanonicalIVRecipe explicit. Needed for D117140.
2022-01-13 11:13:05 +00:00
Florian Hahn 7ce48be0fd
[LV] Inline CreateSplatIV call for scalar VFs (NFC).
This is a NFC change split off from D116123, as suggested there.
D116123 will remove the last user of CreateSplatIV.
2022-01-13 09:34:31 +00:00
Florian Hahn d4a8fc3a87
[VPlan] Introduce and use BranchOnCount VPInstruction.
This patch adds a new BranchOnCount VPInstruction opcode with 2
operands. It first compares its 2 operands (increment of canonical
induction and vector trip count), followed by a branch to either the
exit block or back to the vector header.

It must be the last recipe in the exit block of the topmost vector loop
region.

This extracts parts from D113224 and was discussed in D113223.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D116479
2022-01-12 13:42:13 +00:00
Rosie Sumpter 552eb372cb [LoopVectorize] Pass a vector type to isLegalMaskedGather/Scatter
This is required to query the legality more precisely in the LoopVectorizer.

This adds another TTI function named 'forceScalarizeMaskedGather/Scatter'
function to work around the hack introduced for MVE, where
isLegalMaskedGather/Scatter would return an answer by second-guessing
where the function was called from, based on the Type passed in (vector
vs scalar). The new interface makes this explicit. It is also used by
X86 to check for vector widths where gather/scatters aren't profitable
(or don't exist) for certain subtargets.

Differential Revision: https://reviews.llvm.org/D115329
2022-01-12 13:34:12 +00:00
Florian Hahn 138fcc5f76
[IRBuilder] Migrate icmp-folding to value-based FoldICmp.
Depends on D116935.

Reviewed By: nikic, lebedev.ri

Differential Revision: https://reviews.llvm.org/D116969
2022-01-12 12:37:46 +00:00
Florian Hahn 7e68061305
[IRBuilder] Migrate add-folding to value-based FoldAdd.
Depends on D116935.

Reviewed By: nikic

Differential Revision: https://reviews.llvm.org/D116968
2022-01-12 09:24:46 +00:00
Florian Hahn f0ef1ea6dd
[IRBuilder] Introduce folder using inst-simplify, use for Or fold.
Alternative to D116817.

This introduces a new value-based folding interface for Or (FoldOr),
which takes 2 values and returns an existing Value or a constant if the
Or can be simplified. Otherwise nullptr is returned. This replaces the
more restrictive CreateOr which takes 2 constants.

This is the used to implement a folder that uses InstructionSimplify.
The logic to simplify `Or` instructions is moved there. Subsequent
patches are going to transition other CreateXXX to the more general
FoldXXX interface.

Reviewed By: nikic, lebedev.ri

Differential Revision: https://reviews.llvm.org/D116935
2022-01-11 17:30:48 +00:00
David Sherwood b0922a9dcd [LoopVectorize] Make VPWidenCanonicalIVRecipe::execute work for scalable vectors
The code in VPWidenCanonicalIVRecipe::execute only worked for fixed-width
vectors due to the way we generate the values per lane. This patch changes
the code to use a combination of vector splats and step vectors to get
the same result. This then works for both fixed-width and scalable vectors.

Tests that exercise this code path for scalable vectors have been added here:

  Transforms/LoopVectorize/AArch64/sve-tail-folding.ll

Differential Revision: https://reviews.llvm.org/D113180
2022-01-10 14:12:32 +00:00
Florian Hahn aecad5828e
[SCEVExpander] Only create trunc when needed.
9345ab3a45 updated generateOverflowCheck to skip creating checks that
always evaluate to false. This in turn means that we only need to
create TruncTripCount if it is actually used.

Sink the TruncTripCount creating into ComputeEndCheck, so it is only
created when there's an actual check.
2022-01-10 11:31:27 +00:00
David Sherwood e3c84fb948 [LoopVectorize] Add support for tail folding using scalable vectors
This patch fixes up an issue with InnerLoopVectorizer::getOrCreateVectorTripCount
whereby we weren't correctly generating the runtime trip count
for scalable vectors when tail-folding.

It also removes some asserts in the tail-folding path for cases when
the VF is not scalable.

In this patch I have only permitted tail-folding to be enabled
explicitly for scalable vectors when the user has specified one
of the following flags:

  -prefer-predicate-over-epilogue=predicate-dont-vectorize
  -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue

For now it's best not to enable tail-folding with scalable vectors for
low trip counts or when optimising for code size, since there has been
no analysis on whether this is worth it.

Various tests have been added here:

  Transforms/LoopVectorize/AArch64/sve-tail-folding.ll
  Transforms/LoopVectorize/AArch64/sve-tail-folding-forced.ll

The tests cannot be target independent because they require masked
load/store support, i.e. TTI.isLegalMaskedLoad and TTI.isLegalMaskedStore
need to return true.

Differential Revision: https://reviews.llvm.org/D113003
2022-01-10 10:55:40 +00:00
Florian Hahn 7f1bf68d7d
[SCEVExpander] Only check overflow if it is needed.
9345ab3a45 updated generateOverflowCheck to skip creating checks that
always evaluate to false. This in turn means that we only need to check
for overflows if the result of the multiplication is actually used.

Sink the Or for the overflow check into ComputeEndCheck, so it is only
created when there's an actual check.
2022-01-09 12:55:41 +00:00
Florian Hahn 3b7b1a75b0
[LV] Improve check lines in existing tests.
Update the check lines in 2 existing tests to use patterns + variables
to match some IR to make them independent of value naming.
2022-01-08 20:46:31 +00:00
Florian Hahn daa5e26312
[LV] Make tests more robust by removing undef.
Replace some uses of undef in the tests. The undef causes runtime checks
to be trivially fold/removeable, which does defeat the purpose of the tests.
2022-01-08 15:21:57 +00:00
Florian Hahn 9345ab3a45
[SCEVExpander] Skip creating <u 0 check, which is always false.
Unsigned compares of the form <u 0 are always false. Do not create such
a redundant check in generateOverflowCheck.

The patch introduces a new lambda to create the check, so we can
exit early conveniently and skip creating some instructions feeding the
check.

I am planning to sink a few additional instructions as follow-ups, but I
would prefer to do this separately, to keep the changes and diff
smaller.

Reviewed By: reames

Differential Revision: https://reviews.llvm.org/D116811
2022-01-08 10:31:04 +00:00
Craig Topper 042394b69e [RISCV] Add a command line option to control the LMUL used by TTI's getRegisterBitWidth.
By default we return the width of an LMUL=1 register. We can enable
testing with larger LMUL values by returning a larger bit width.

This patch adds a RISCV specific option to provide a LMUL which will be
multiplied by the LMUL=1 bit width.

Reviewed By: kito-cheng

Differential Revision: https://reviews.llvm.org/D116339
2022-01-07 20:02:10 -08:00
David Green bc615e436c [AArch64] Update addo and subo costs
Similar to D116732, this adds basic scalar sadd_with_overflow,
uadd_with_overflow, ssub_with_overflow and usub_with_overflow costs for
aarch64, which are usually quite efficiently lowered.

Differential Revision: https://reviews.llvm.org/D116734
2022-01-07 16:20:23 +00:00
Florian Hahn f395a4f8d5
[SCEVExpand] Only create required predicate checks.
Currently generateOverflowCheck always creates code for Step being
negative and positive, followed by a select at the end depending on
Step's sign.

This patch updates the code to only create either the checks for step
being positive or negative, if the sign is known.

Follow-up to D116696.

Reviewed By: reames

Differential Revision: https://reviews.llvm.org/D116747
2022-01-07 14:49:02 +00:00
Florian Hahn 86d113a8b8
[SCEVExpand] Do not create redundant 'or false' for pred expansion.
This patch updates SCEVExpander::expandUnionPredicate to not create
redundant 'or false, x' instructions. While those are trivially
foldable, they can be easily avoided and hinder code that checks the
size/cost of the generated checks before further folds.

I am planning on look into a few other similar improvements to code
generated by SCEVExpander.

I remember a while ago @lebedev.ri working on doing some trivial folds
like that in IRBuilder itself, but there where concerns that such
changes may subtly break existing code.

Reviewed By: reames, lebedev.ri

Differential Revision: https://reviews.llvm.org/D116696
2022-01-06 11:52:19 +00:00
Sander de Smalen 95a93722db [LV] Remove what seems like stale code in collectElementTypesForWidening.
This was originally added in rG22174f5d5af1eb15b376c6d49e7925cbb7cca6be
although that patch doesn't really mention any reasons for ignoring the
pointer type in this calculation if the memory access isn't consecutive.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D115356
2022-01-05 12:20:59 +00:00
Florian Hahn 65c4d6191f
[VPlan] Add VPCanonicalIVPHIRecipe, partly retire createInductionVariable.
At the moment, the primary induction variable for the vector loop is
created as part of the skeleton creation. This is tied to creating the
vector loop latch outside of VPlan. This prevents from modeling the
*whole* vector loop in VPlan, which in turn is required to model
preheader and exit blocks in VPlan as well.

This patch introduces a new recipe VPCanonicalIVPHIRecipe to represent the
primary IV in VPlan and CanonicalIVIncrement{NUW} opcodes for
VPInstruction to model the increment.

This allows us to partly retire createInductionVariable. At the moment,
a bit of patching up is done after executing all blocks in the plan.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D113223
2022-01-05 10:46:06 +00:00
Rosie Sumpter 961f51fdf0 [LoopVectorize][CostModel] Choose smaller VFs for in-loop reductions without loads/stores
For loops that contain in-loop reductions but no loads or stores, large
VFs are chosen because LoopVectorizationCostModel::getSmallestAndWidestTypes
has no element types to check through and so returns the default widths
(-1U for the smallest and 8 for the widest). This results in the widest
VF being chosen for the following example,

float s = 0;
for (int i = 0; i < N; ++i)
  s += (float) i*i;

which, for more computationally intensive loops, leads to large loop
sizes when the operations end up being scalarized.

In this patch, for the case where ElementTypesInLoop is empty, the widest
type is determined by finding the smallest type used by recurrences in
the loop instead of falling back to a default value of 8 bits. This
results in the cost model choosing a more sensible VF for loops like
the one above.

Differential Revision: https://reviews.llvm.org/D113973
2022-01-04 10:12:57 +00:00
Florian Hahn b1a333f0fe
[VPlan] Don't consider VPWidenCanonicalIVRecipe phi-like.
VPWidenCanonicalIVRecipe does not create PHI instructions, so it does
not need to be placed in the phi section of a VPBasicBlock.

Also tidies the code so the WidenCanonicalIV recipe and the
compare/lane-masks are created in the header.

Discussed D113223.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D116473
2022-01-02 12:48:17 +00:00
Sanjay Patel 0c6979b2d6 [InstCombine] fold opposite shifts around an add
((X << C) + Y) >>u C --> (X + (Y >>u C)) & (-1 >>u C)

https://alive2.llvm.org/ce/z/DY9DPg

This replaces a shift with an 'and', and in the case
where the add has a constant operand, it eliminates
both shifts.

As noted in the TODO comment, we already have this fold when
the shifts are in the opposite order (and that code handles
bitwise logic ops too).

Fixes #52851
2021-12-30 12:01:06 -05:00
Sanjay Patel fd9cd3408b Revert "[InstCombine] fold opposite shifts around an add"
This reverts commit 2e3e0a5c28.
Some unintended diffs snuck into this patch.
2021-12-30 11:54:55 -05:00
Sanjay Patel 2e3e0a5c28 [InstCombine] fold opposite shifts around an add
((X << C) + Y) >>u C --> (X + (Y >>u C)) & (-1 >>u C)

https://alive2.llvm.org/ce/z/DY9DPg

This replaces a shift with an 'and', and in the case
where the add has a constant operand, it eliminates
both shifts.

As noted in the TODO comment, we already have this fold when
the shifts are in the opposite order (and that code handles
bitwise logic ops too).

Fixes #52851
2021-12-30 11:52:29 -05:00
Craig Topper a9486a40f7 [RISCV] Disable interleaving scalar loops in the loop vectorizer.
The loop vectorizer can interleave scalar loops even if it doesn't
vectorize them. I don't believe we intended to enable this when
we enabled interleaving for vector instructions.

Disable interleaving for VF=1 like X86 and AMDGPU already do. Test
lifted from AMDGPU.

Differential Revision: https://reviews.llvm.org/D115975
2021-12-23 08:37:24 -06:00
Florian Hahn ede7c2438f
[VPlan] Create header & latch blocks for skeleton up front (NFC).
By creating the header and latch blocks up front and adding blocks and
recipes in between those 2 blocks we ensure that the entry and exits of
the plan remain valid throughout construction.

In order to avoid test changes and keep printing of the plans the same,
we use the new header block instead of creating a new block on the first
iteration of the loop traversing the original loop.

We also fold the latch into its predecessor.

This is a follow up to a post-commit suggestion in D114586.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D115793
2021-12-22 12:44:25 +00:00
Sander de Smalen 290ae657a6 Fix buildbot failure caused by D115651
I somehow missed updating the RUN line of this test.
2021-12-20 17:18:59 +00:00
Sander de Smalen b1ff20fd35 [LV] Enable scalable vectorization by default for SVE cores.
The availability of SVE should be sufficient to enable scalable
auto-vectorization.

This patch adds a new TTI interface to query the target what style of
vectorization it wants when scalable vectors are available. For other
targets than AArch64, this currently defaults to 'FixedWidthOnly'.

Differential Revision: https://reviews.llvm.org/D115651
2021-12-20 16:23:29 +00:00
Florian Hahn 5b362e4c7f
[VPlan] Add Debugloc to VPInstruction.
Upcoming changes require attaching debug locations to VPInstructions,
e.g. adding induction increment recipes in D113223.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D115123
2021-12-20 15:10:41 +00:00
Philip Reames e6ad9ef4e7 [instcombine] Canonicalize constant index type to i64 for extractelement/insertelement
The basic idea to this is that a) having a single canonical type makes CSE easier, and b) many of our transforms are inconsistent about which types we end up with based on visit order.

I'm restricting this to constants as for non-constants, we'd have to decide whether the simplicity was worth extra instructions. For constants, there are no extra instructions.

We chose the canonical type as i64 arbitrarily.  We might consider changing this to something else in the future if we have cause.

Differential Revision: https://reviews.llvm.org/D115387
2021-12-13 16:56:22 -08:00
Philip Reames eb052f6b8f Reapply: Autogen more vectorizer tests in advance of D115387.
Drop changes to consecutive-ptr-uniforms.ll since that test checks boths IR output and debug messages.  I'd missed this in the original commit, and Florian pointed it out in post-commit review.

Original commit message:

These are the ones my first round of scripting couldn't handle that required a bit of manual messaging.  This should be the last batch in llvm-check.

This reverts commit bbba86764a.
2021-12-13 15:49:14 -08:00
Philip Reames bbba86764a Revert "Autogen more vectorizer tests in advance of D115387."
This reverts commit bbfaf0b170.

Post commit review noted a case where my manual update lost intentional check lines.  Given I've abandoned the motivating patch, I'm just reverting the autogen prep.
2021-12-13 12:45:50 -08:00
Philip Reames bbfaf0b170 Autogen more vectorizer tests in advance of D115387.
These are the ones my first round of scripting couldn't handle that required a bit of manual messaging.  This should be the last batch in llvm-check.
2021-12-13 11:04:20 -08:00
Philip Reames 1a18de3d0a Autogen a bunch of instcombine and vectorizer tests
Done in advance of D115387.  These are all the ones which my local script could handle, there's a couple more which need manual updates.
2021-12-13 10:41:38 -08:00
Florian Hahn e2885c7c9b
[VPlan] Add printing test with VPInstruction with debug locs.
Test case for D113223.
2021-12-13 13:08:41 +00:00
Florian Hahn 42263e7d26
[LV] Add test with debug locations on branches that get scalarized. 2021-12-13 12:06:35 +00:00
Evgeniy Brevnov 2025e0985c [LV] Make sure VF doesn't exceed compile time known TC
For the simple copy loop (see test case) vectorizer selects VF equal to 32 while the loop is known to have 17 iterations only. Such behavior makes no sense to me since such vector loop will never be executed. The only case we may want to select VF large than TC is masked vectoriztion. So I haven't touched that case.

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D114528
2021-12-13 13:48:46 +07:00
David Green fed3041863 [LV][ARM] Improve reduction costmodel for mismatching extension types.
Given a MLA reduction from two different types (say i8 and i16), we were
previously failing to find the reduction pattern, often making us chose
the lower vector factor. This improves that by using the largest of the
two extension types, allowing us to use the larger VF as the type of the
reduction.

As per https://godbolt.org/z/KP549EEYM the backend handles this
valiantly, leading to better performance.

Differential Revision: https://reviews.llvm.org/D115432
2021-12-10 15:40:58 +00:00
Florian Hahn 505ad03c7d
[LV] Remove redundant IV casts using VPlan (NFCI).
This patch simplifies handling of redundant induction casts, by
removing dead cast instructions after initial VPlan construction.
This has the following benefits:

  1. fixes a crash
     (see @test_optimized_cast_induction_feeding_first_order_recurrence)
  2. Simplifies VPWidenIntOrFpInduction to a single-def recipes
  3. Retires recordVectorLoopValueForInductionCast.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D115112
2021-12-10 13:57:03 +00:00
Evgeniy Brevnov eef8f3f856 [LV][NFC] New test case for compile time known trip count (TC)
New test to test/track upcoming chnages

Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D114526
2021-12-10 18:25:55 +07:00
David Sherwood 8b0448ce5d [AArch64][Analysis] Add on overhead costs for SVE gathers and scatters
This patch adds on an overhead cost for gathers and scatters, which
is a rough estimate based on performance investigations I have
performed on SVE hardware for various micro-benchmarks.

Differential Revision: https://reviews.llvm.org/D115143
2021-12-09 16:02:59 +00:00
David Sherwood def8b952eb [LoopVectorize][AArch64] Add vectoriser cost model tests for gathers/scatters
I've added some tests that were previously missing for the gather-scatter costs
being calculated by the vectorizer for AArch64:

  Transforms/LoopVectorize/AArch64/sve-gather-scatter-cost.ll

The costs are sometimes different to the ones in

  Analysis/CostModel/AArch64/sve-gather.ll

because the vectorizer also adds on the address computation cost.
2021-12-09 15:44:12 +00:00
Sander de Smalen e1edec1ee6 [LV] NFC: Add check for VF to vector_ptr_load_store.ll.
This just adds some extra CHECK lines to show the effect
of a follow-up patch.
2021-12-08 16:41:59 +00:00
Cullen Rhodes 698584f89b [IR] Remove unbounded as possible value for vscale_range minimum
The default for min is changed to 1. The behaviour of -mvscale-{min,max}
in Clang is also changed such that 16 is the max vscale when targeting
SVE and no max is specified.

Reviewed By: sdesmalen, paulwalker-arm

Differential Revision: https://reviews.llvm.org/D113294
2021-12-07 09:52:21 +00:00
Sander de Smalen 3d549dddf7 [LV] Pass compare predicate to getCmpSelInstrCost.
If the condition of a select is a compare, pass its predicate to
TTI::getCmpSelInstrCost to get a more accurate cost value instead
of passing BAD_ICMP_PREDICATE.

I noticed that the commit message from D90070 had a comment about the
vectorized select predicate possibly being composed of other compares with
different predicate values, but I wasn't able to construct an example
where this was an actual issue. If this is an issue, I guess we could
add another check that the block isn't predicated for any reason.

Reviewed By: dmgreen, fhahn

Differential Revision: https://reviews.llvm.org/D114646
2021-12-06 11:41:27 +00:00
David Green 255ad73424 [ARM] Make MVE v2i1 predicates legal
MVE can treat v16i1, v8i1, v4i1 and v2i1 as different views onto the
same 16bit VPR.P0 register, with v2i1 holding two 8 bit values for the
two halves. This was never treated as a legal type in llvm in the past
as there are not many 64bit instructions and no 64bit compares. There
are a few instructions that could use it though, notably a VSELECT (as
it can handle any size using the underlying v16i8 VPSEL), AND/OR/XOR for
similar reasons, some gathers/scatter and long multiplies and VCTP64
instructions.

This patch goes through and makes v2i1 a legal type, handling all the
cases that fall out of that. It also makes VSELECT legal for v2i64 as a
side benefit. A lot of the codegen changes as a result - usually in way
that is a little better or a little worse, but still expensive. Costs
can change a little too in the process, again in a way that expensive
things remain expensive. A lot of the tests that changed are mainly to
ensure correctness - the code can hopefully be improved in the future
where it comes up in practice.

The intrinsics currently remain using the v4i1 they previously did to
emulate a v2i1. This will be changed in a followup patch but this one
was already large enough.

Differential Revision: https://reviews.llvm.org/D114449
2021-12-03 14:05:41 +00:00
Roman Lebedev 8cd782487f
[X86][LoopVectorize] "Fix" `X86TTIImpl::getAddressComputationCost()`
We ask `TTI.getAddressComputationCost()` about the cost of computing vector address,
and then multiply it by the vector width. This doesn't make any sense,
it implies that we'd do a vector GEP and then scalarize the vector of pointers,
but there is no such thing in the vectorized IR, we perform scalar GEP's.

This is *especially* bad on X86, and was effectively prohibiting any scalarized
vectorization of gathers/scatters, because `X86TTIImpl::getAddressComputationCost()`
says that cost of vector address computation is `10` as compared to `1` for scalar.

The computed costs are similar to the ones with D111222+D111220,
but we end up without masked memory intrinsics that we'd then have to
expand later on, without much luck. (D111363)

Differential Revision: https://reviews.llvm.org/D111460
2021-11-30 10:47:56 +03:00
Sander de Smalen 28a4deab92 [LV] Fix incorrectly marking a pointer indvar as 'scalar'.
collectLoopScalars should only add non-uniform nodes to the list if they
are used by a load/store instruction that is marked as CM_Scalarize.

Before this patch, the LV incorrectly marked pointer induction variables
as 'scalar' when they required to be widened by something else,
such as a compare instruction, and weren't used by a node marked as
'CM_Scalarize'. This case is covered by sve-widen-phi.ll.

This change also allows removing some code where the LV tried to
widen the PHI nodes with a stepvector, even though it was marked as
'scalarAfterVectorization'. Now that this code is more careful about
marking instructions that need widening as 'scalar', this code has
become redundant.

Differential Revision: https://reviews.llvm.org/D114373
2021-11-28 09:49:28 +00:00
Sander de Smalen a9f837bbf0 NFC: Simplify sve-widen-phi.ll by unrolling once.
The unroll factor > 1 has little value for what is being tested.
2021-11-28 09:49:28 +00:00
David Sherwood e20391fc5d [LoopVectorize] When tail-folding, don't always predicate uniform loads
In VPRecipeBuilder::handleReplication if we believe the instruction
is predicated we then proceed to create new VP region blocks even
when the load is uniform and only predicated due to tail-folding.

I have updated isPredicatedInst to avoid treating a uniform load as
predicated when tail-folding, which means we can do a single scalar
load and a vector splat of the value.

Tests added here:

  Transforms/LoopVectorize/AArch64/tail-fold-uniform-memops.ll

Differential Revision: https://reviews.llvm.org/D112552
2021-11-26 11:30:54 +00:00
Graham Hunter dee810e117 [NFC][LAA] Precommit tests for forked pointers
Precommit for https://reviews.llvm.org/D108699
2021-11-24 16:20:35 +00:00
Florian Hahn a7648eb2aa
[LV] Use patterns in some induction tests, to make more robust. (NFC) 2021-11-24 13:32:24 +00:00
Rosie Sumpter df32a39dd0 [LoopVectorize][CostModel] Update cost model for fmuladd intrinsic
This patch updates the cost model for ordered reductions so that a call
to the llvm.fmuladd intrinsic is modelled as a normal fmul instruction
plus the cost of an ordered fadd reduction.

Differential Revision: https://reviews.llvm.org/D111630
2021-11-24 08:50:05 +00:00
Rosie Sumpter 2d33327f9d [LoopVectorize] Print fast-math flags for VPReductionRecipe 2021-11-24 08:50:05 +00:00
Rosie Sumpter 991074012a [LoopVectorize] Propagate fast-math flags for VPInstruction
In-loop vector reductions which use the llvm.fmuladd intrinsic involve
the creation of two recipes; a VPReductionRecipe for the fadd and a
VPInstruction for the fmul. If the call to llvm.fmuladd has fast-math flags
these should be propagated through to the fmul instruction, so an
interface setFastMathFlags has been added to the VPInstruction class to
enable this.

Differential Revision: https://reviews.llvm.org/D113125
2021-11-24 08:50:04 +00:00
Rosie Sumpter c2441b6b89 [LoopVectorize] Add vector reduction support for fmuladd intrinsic
Enables LoopVectorize to handle reduction patterns involving the
llvm.fmuladd intrinsic.

Differential Revision: https://reviews.llvm.org/D111555
2021-11-24 08:50:04 +00:00
Huihui Zhang 9cd7c534e2 [InstCombine] Enable fold select into operand for FAdd, FMul, FSub and FDiv.
For FAdd, FMul, FSub and FDiv, fold select into one of the operands to enable
further optimizations, i.e., floating-point reduction detection.

Turn code:
  %C = fadd %A, %B
  %D = select %cond, %C, %A

into:
  %C = select %cond, %B, -0.000000e+00
  %D = fadd %A, %C

Alive2 verification (with --disable-undef-input), timed out otherwise.
FAdd - https://alive2.llvm.org/ce/z/eUxN4Y
FMul - https://alive2.llvm.org/ce/z/5SWZz4
FSub - https://alive2.llvm.org/ce/z/Dhj8dU
FDiv - https://alive2.llvm.org/ce/z/Yj_NA2

Reviewed By: spatel

Differential Revision: https://reviews.llvm.org/D113442
2021-11-22 15:10:10 -08:00
Diego Caballero 4348cd42c3 [LV] Drop integer poison-generating flags from instructions that need predication
This patch fixes PR52111. The problem is that LV propagates poison-generating flags (`nuw`/`nsw`, `exact`
and `inbounds`) in instructions that contribute to the address computation of widen loads/stores that are
guarded by a condition. It may happen that when the code is vectorized and the control flow within the loop
is linearized, these flags may lead to generating a poison value that is effectively used as the base address
of the widen load/store. The fix drops all the integer poison-generating flags from instructions that
contribute to the address computation of a widen load/store whose original instruction was in a basic block
that needed predication and is not predicated after vectorization.

Reviewed By: fhahn, spatel, nlopes

Differential Revision: https://reviews.llvm.org/D111846
2021-11-22 10:57:29 +00:00
Diego Caballero a7027bb799 [LV] Pre-commit test for D111846
Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D112054
2021-11-22 10:13:56 +00:00
Florian Hahn cf8efbd30e
[VPlan] Wrap vector loop blocks in region.
A first step towards modeling preheader and exit blocks in VPlan as well.
Keeping the vector loop in a region allows for changing the VF as we
traverse region boundaries.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D113182
2021-11-20 17:59:48 +00:00
Kerry McLaughlin ff64b2933a [LoopVectorize] Check the number of uses of an FAdd before classifying as ordered
checkOrderedReductions looks for Phi nodes which can be classified as in-order,
meaning they can be vectorised without unsafe math. In order to vectorise the
reduction it should also be classified as in-loop by getReductionOpChain, which
checks that the reduction has two uses.

In this patch, a similar check is added to checkOrderedReductions so that we
now return false if there are more than two uses of the FAdd instruction.
This fixes PR52515.

Reviewed By: fhahn, david-arm

Differential Revision: https://reviews.llvm.org/D114002
2021-11-18 16:41:19 +00:00
Florian Hahn dead1c11ff
[LV] Add basic check lines to test added in 00200dbda3. 2021-11-18 14:08:57 +00:00
Florian Hahn 00200dbda3
[LV] Add test case for PR52024.
This patch adds a reduced version of the test case from PR52024.

Together with 764d9aa979 the test causes a crash, because LV expands a
SCEV expression during code generation, when the dominator tree is not
up-to-date.
2021-11-18 12:10:44 +00:00
David Sherwood 8d77555b12 [Analysis] Ensure getTypeLegalizationCost returns a simple VT for TypeScalarizeScalableVector
When getTypeConversion returns TypeScalarizeScalableVector we were
sometimes returning a non-simple type from getTypeLegalizationCost.
However, many callers depend upon this being a simple type and will
crash if not. This patch changes getTypeLegalizationCost to ensure
that we always a return sensible simple VT. If the vector type
contains unusual integer types, e.g. <vscale x 2 x i3>, then we just
set the type to MVT::i64 as a reasonable default.

A test has been added here that demonstrates the vectoriser can
correctly calculate the cost of vectorising a "zext i3 to i64"
instruction with a VF=vscale x 1:

  Transforms/LoopVectorize/AArch64/sve-inductions-unusual-types.ll

Differential Revision: https://reviews.llvm.org/D113777
2021-11-17 13:11:58 +00:00
David Sherwood 670dd40244 [Analysis] Fix getNumberOfParts to return 0 when the answer is unknown
When asking how many parts are required for a scalable vector type
there are occasions when it cannot be computed. For example, <vscale x 1 x i3>
is one such vector for AArch64+SVE because at the moment no matter how we
promote the i3 type we never end up with a legal vector. This means
that getTypeConversion returns TypeScalarizeScalableVector as the
LegalizeKind, and then getTypeLegalizationCost returns an invalid cost.
This then causes BasicTTImpl::getNumberOfParts to dereference an invalid
cost, which triggers an assert. This patch changes getNumberOfParts to
return 0 for such cases, since the definition of getNumberOfParts in
TargetTransformInfo.h states that we can use a return value of 0 to represent
an unknown answer.

Currently, LoopVectorize.cpp is the only place where we need to check for
0 as a return value, because all other instances will not currently
ask for the number of parts for <vscale x 1 x iX> types.

In addition, I have changed the target-independent interface for
getNumberOfParts to return 1 and assume there is a single register
that can fit the type. The loop vectoriser has lots of tests that are
target-independent and they relied upon the 0 value to mean the
answer is known and that we are not scalarising the vector.

I have added tests here that show we correctly return an invalid cost
for VF=vscale x 1 when the loop contains unusual types such as i7:

  Transforms/LoopVectorize/AArch64/sve-inductions-unusual-types.ll

Differential Revision: https://reviews.llvm.org/D113772
2021-11-17 12:07:09 +00:00
David Green 309f1e4ac8 [ARM] Add datalayout to costmodel tests. NFC
This adds a sensible datalayout to the ARM cost model tests, to prevent
the costs reported being incorrect for the size of pointers.
2021-11-16 09:49:42 +00:00
Florian Hahn 112c1c346a
[IVDescriptor] Make sure the sign is included for negative extension.
At the moment, computeRecurrenceType does not include any sign bits in
the maximum bit width. If the value can be negative, this means the sign
bit will be missing and the sext won't properly extend the value.

If the value can be negative, increment the bitwidth by one to make sure
there is at least one sign bit in the result value.

Note that the increment is also needed *if* the value is *known* to be
negative, as a sign bit needs to be preserved for the sext to work.

Note that this at the moment prevents vectorization, because the
analysis computes i1 as type for the recurrence when looking through the
AND in lookThroughAnd.

Fixes PR51794, PR52485.

Reviewed By: spatel

Differential Revision: https://reviews.llvm.org/D113056
2021-11-15 13:12:57 +00:00
Simon Pilgrim fbe72e41b9 [LoopVectorize] Add PR41179 test case 2021-11-14 21:54:23 +00:00
Philip Reames 37ead201e6 [runtime-unroll] Use incrementing IVs instead of decrementing ones
This is one of those wonderful "in theory X doesn't matter, but in practice is does" changes. In this particular case, we shift the IVs inserted by the runtime unroller to clamp iteration count of the loops* from decrementing to incrementing.

Why does this matter?  A couple of reasons:
* SCEV doesn't have a native subtract node.  Instead, all subtracts (A - B) are represented as A + -1 * B and drops any flags invalidated by such.  As a result, SCEV is slightly less good at reasoning about edge cases involving decrementing addrecs than incrementing ones.  (You can see this in the inferred flags in some of the test cases.)
* Other parts of the optimizer produce incrementing IVs, and they're common in idiomatic source language.  We do have support for reversing IVs, but in general if we produce one of each, the pair will persist surprisingly far through the optimizer before being coalesced.  (You can see this looking at nearby phis in the test cases.)

Note that if the hardware prefers decrementing (i.e. zero tested) loops, LSR should convert back immediately before codegen.

* Mostly irrelevant detail: The main loop of the prolog case is handled independently and will simple use the original IV with a changed start value.  We could in theory use this scheme for all iteration clamping, but that's a larger and more invasive change.
2021-11-12 15:44:58 -08:00
Florian Hahn 30ebdf8a6d
[LV] Precommit test case from PR52485. 2021-11-12 16:09:19 +00:00
Kerry McLaughlin 7647822156 [AArch64][SVE] Remove i1 type from isElementTypeLegalForScalableVector
`collectElementTypesForWidening` collects the types of load, store and
reduction Phis in a loop. These types are later checked using
`isElementTypeLegalForScalableVector` to prevent vectorisation of
loops with instruction types that are unsupported.

This patch removes i1 from the list of types supported for scalable
vectors. This fixes an assert ("Cannot yet scalarize uniform stores") in
`setCostBasedWideningDecision` when we have a loop containing a uniform
i1 store and a scalable VF, which we cannot create a scatter for.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D113680
2021-11-12 14:24:38 +00:00
Florian Hahn e7f1232cb7
[LV] Move optimized IV recipes to phi section of header after sinking.
Unfortunately sinking recipes for first-order recurrences relies on
the original position of recipes. So if a recipes needs to be sunk after
an optimized induction, it needs to stay in the original position, until
sinking is done. This is causing PR52460.

To fix the crash, keep the recipes in the original position until
sink-after is done.

Post-commit follow-up to c45045bfd0 to address PR52460.
2021-11-10 11:41:08 +00:00
Kerry McLaughlin 6f16ee5e14 Revert "[LoopVectorize] Extract the last lane from a uniform store"
This reverts commit 0d748b4d32.
This is causing some failures when building Spec2017 with scalable
vectors. Reverting to investigate.
2021-11-10 11:21:19 +00:00
Dmitry Makogon 62f86d4f95 Reapply 5ec2386 "Reapply db28934 "[IndVars] Pass TTI to replaceCongruentIVs""
This reverts commit 7cd273c339.

Several patches with tests fixes have been applied:
0cada82f0a "[Test] Remove incorrect test in GVN"
97cb13615d "[Test] Separate IndVars test into AArch64 and X86 parts"
985cc490f1 "[Test] Remove separated test in IndVars",
and test failures caused by 5ec2386 should be resolved now.
2021-11-10 17:36:14 +07:00
David Sherwood 2a48b6993a [IR] In ConstantFoldShuffleVectorInstruction use zeroinitializer for splats of 0
When creating a splat of 0 for scalable vectors we tend to create them
with using a combination of shufflevector and insertelement, i.e.

shufflevector (<vscale x 4 x i32> insertelement (<vscale x 4 x i32> poison, i32 0, i32 0),
               <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer)

However, for the case of a zero splat we can actually just replace the
above with zeroinitializer instead. This makes the IR a lot simpler and
easier to read. I have changed ConstantFoldShuffleVectorInstruction to
use zeroinitializer when creating a splat of integer 0 or FP +0.0 values.

Differential Revision: https://reviews.llvm.org/D113394
2021-11-10 09:42:58 +00:00
Douglas Yung 7cd273c339 Revert "Reapply db28934 "[IndVars] Pass TTI to replaceCongruentIVs""
This reverts commit 5ec2386332.

This change is causing test failures on the PS4 linux build bot: https://lab.llvm.org/buildbot/#/builders/139/builds/12871
2021-11-09 10:28:41 -08:00
Kerry McLaughlin 0d748b4d32 [LoopVectorize] Extract the last lane from a uniform store
Changes VPReplicateRecipe to extract the last lane from an unconditional,
uniform store instruction. collectLoopUniforms will also add stores to
the list of uniform instructions where Legal->isUniformMemOp is true.

setCostBasedWideningDecision now sets the widening decision for
all uniform memory ops to Scalarize, where previously GatherScatter
may have been chosen for scalable stores.

This fixes an assert ("Cannot yet scalarize uniform stores") in
setCostBasedWideningDecision when we have a loop containing a
uniform i1 store and a scalable VF, which we cannot create a scatter for.

Reviewed By: sdesmalen, david-arm, fhahn

Differential Revision: https://reviews.llvm.org/D112725
2021-11-09 14:43:16 +00:00