Commit Graph

1862 Commits

Author SHA1 Message Date
Philip Reames 4c10646367 [LV] Refresh autogen tests to reflect naming changes [nfc]
Purely so that these can be easily autogened without spurious diffs
2022-08-29 14:16:54 -07:00
Florian Hahn 005d1a8ff5
[LV] Add test where either a libfunc or intrinsic is chosen.
In the newly added test either a libfunc (VF=2) or a intrinsic (VF=4)
can be chosen.

Test coverage for D132585.
2022-08-29 10:51:20 +01:00
Philip Reames b45a262679 [RISCV] Enable fixed length vectors and loop vectorization with same
This change enables the use of RISCV's variable length vector registers for fixed length vectors in the IR, and implicitly enables various IR transforms which generate fixed length vectors if legal (e.g. LoopVectorize). Specifically, this enables fixed length vectors which are known to be inbounds of the underlying variable hardware size.

For context, remember that the +V extension provides a minimum VLEN of 128. The embedded variants provide lower minimums. The analogy here is essentially vectorizing for SSE on a machine which may or may not include AVX2/AVX512. We won't get full utilization by default, but we will get some benefit. And of course, with an explicit mcpu we can vectorize to the exact target hardware.

The LV impact is mostly related to vectorizer robustness. In cases we haven't yet fully implemented scalable vectorization support, we can fall back to fixed length vectorization.

SLP has been disabled for now, even when fixed vectors are enabled.  See a310637 and associated review.  There are a few addiitional code quality issues which need worked through before turning SLP on would be reasonable.

Differential Revision: https://reviews.llvm.org/D131508
2022-08-26 14:45:23 -07:00
Florian Hahn 9405af1c85
[LAA] Require AddRecs to be in the innermost loop for diff-checks.
The simpler diff-checks require pointers with add-recs from the same
innermost loop, but this property wasn't check completely. Add the
missing check to ensure both addrecs are in the innermost loop.

Fixes #57315.
2022-08-26 20:39:52 +01:00
Florian Hahn e117137af0
[LV] Add another test for incorrect runtime check generation.
Add a variation of @nested_loop_outer_iv_addrec_invariant_in_inner with
the dependence sink and source swapped to extend test coverage.

Also simplifies the test by removing an unneeded reduction.
2022-08-26 17:28:55 +01:00
Florian Hahn 6e56779e6b
[LV] Add test for incorrect runtime check generation #57315.
Test for PR57315 based on a test provided by @kpdev42.
2022-08-26 16:29:20 +01:00
Florian Hahn 3b135ef446
[LV] Convert runtime diff check test to use opaque pointers.
Modernize the test to make it easier to extend with up-to-date IR.
2022-08-26 16:02:38 +01:00
Philip Reames 86b67a310d [LAA] Prune dependencies with distance large than access implied by trip count
When we have a dependency with a dependence distance which can only be hit on an iteration beyond the actual trip count of the loop, we can ignore that dependency when analyzing said loop. We already had this code, but had restricted it solely to unknown dependence distances. This change applies it to all dependence distances.

Without this code, we relied on the vectorizer reducing VF such that our infeasible dependence was respected. This usually worked out to about the same result, but not always. For fixed length vectorization, this could mean a smaller VF than optimal being chosen or additional runtime checks. For scalable vectorization - where the bounds on access implied by VF are broader - we could often not find a feasible VF at all.

Differential Revision: https://reviews.llvm.org/D131924
2022-08-25 14:24:13 -07:00
Florian Hahn 637da77e66
[LV] Add additional test coverage for SCEVexp and LCSSA interaction.
Also converts the test to use opaque pointers while I am here.
2022-08-25 20:59:47 +01:00
Philip Reames 190cdf51ff [RISCV][LV] Add predicated div/rem test for fixed length vectorization 2022-08-24 11:24:22 -07:00
Philip Reames b20104f644 [LV] Update a test which appears to have been editted without regen [nfc] 2022-08-24 11:05:49 -07:00
Philip Reames f79214d1e1 [LV] Support predicated div/rem operations via safe-divisor select idiom
This patch adds support for vectorizing conditionally executed div/rem operations via a variant of widening. The existing support for predicated divrem in the vectorizer requires scalarization which we can't do for scalable vectors.

The basic idea is that we can always divide (take remainder) by 1 without executing UB. As such, we can use the active lane mask to conditional select either the actual divisor for active lanes, or a constant one for inactive lanes. We already account for the cost of the active lane mask, so the only additional cost is a splat of one and the vector select. This is one of several possible approaches to this problem; see the review thread for discussion on some of the others.  This one was chosen mostly because it was straight forward, and none of the others seemed oviously better.

I enabled the new code only for scalable vectors. We could also legally enable it for fixed vectors as well, but I haven't thought through the cost tradeoffs between widening and scalarization enough to know if that's profitable. This will be explored in future patches.

Differential Revision: https://reviews.llvm.org/D130164
2022-08-24 10:07:59 -07:00
David Green 8d830f8d68 [LV] Replace fixed-order cost model with a SK_Splice shuffle
The existing cost model for fixed-order recurrences models the phi as an
extract shuffle of a v1 vector. The shuffle produced should be a splice,
as they take two vectors inputs are extracting from a subset of the
lanes. On certain architectures the existing cost model can drastically
under-estimate the correct cost for the shuffle, so this changes it to a
SK_Splice and passes a correct Mask through to the getShuffleCost call.

I believe this might be the first use of a SK_Splice shuffle cost model
outside of scalable vectors, and some targets may require additions to
the cost-model to correctly account for them. In tree targets appear to
all have been updated where needed.

Differential Revision: https://reviews.llvm.org/D132308
2022-08-24 13:00:32 +01:00
David Green e29f9f7572 [AArch64][X86] Add some fixed-order-recurrence tests to check the costmodel of fixed order recurrences. NFC 2022-08-24 08:18:01 +01:00
Graham Hunter 14212c968f [NFC][LoopVectorize] Precommit masked vector function call tests 2022-08-23 09:47:10 +01:00
Jay Foad 2754ff883d [InstCombine] Try not to demand low order bits for Add
Don't demand low order bits from the LHS of an Add if:
- they are not demanded in the result, and
- they are known to be zero in the RHS, so they can't possibly
  overflow and affect higher bit positions

This is intended to avoid a regression from a future patch to change
the order of canonicalization of ADD and AND.

Differential Revision: https://reviews.llvm.org/D130075
2022-08-22 20:03:53 +01:00
David Green 04a68fce13 [ARM] Add a couple of MVE fixed-order-reduction tests. NFC 2022-08-22 10:58:14 +01:00
Sanjay Patel 15e3d86911 [InstCombine] reassociate bitwise logic chains based on uses
(X op Y) op Z --> (Y op Z) op X

This isn't a complete solution (see TODO tests for possible refinements),
but it shows some nice wins and doesn't seem to cause any harm. I think
the most potential danger is from conflicting with other folds and causing
an infinite loop - that's the reason for avoiding patterns with constant
operands.

Alternatively, we could try this in the reassociate pass, but we would not
immediately see all of the logic folds that instcombine provides. I also
looked at improving ValueTracking's isImpliedCondition() (and we should
still add some enhancements there), but that would not work in general for
bitwise logic reduction.

The tests that reduce completely to 0/-1 are motivated by issue #56653.

Differential Revision: https://reviews.llvm.org/D131356
2022-08-21 09:42:14 -04:00
David Sherwood 666d2a925f [SVE][LoopVectorize][NFC] Tidy up some tests
Whilst writing a patch to add extra tail-folding RUN lines to
existing tests I noticed a few areas where they can be
cleaned up a little:

1. scalable-reductions.ll: fmin_fast does not mark fcmp as fast.
2. sve-inductions-unusual-types.ll: remove direct references to
   SSA variable names.
3. sve-strict-fadd-cost.ll: don't force vector width so we see
   costs for different VFs in one go. This will be important for
   the follow-on patch.
4. sve-vector-reverse.ll,vector-reverse-mask4.ll: add noalias
   keyword to simplify IR.
4. sve-widen-gep.ll,sve-widen-phi.ll: regenerate using script.

These changes will make the subsequent patch adding RUN lines much
easier to review!

Differential Revision: https://reviews.llvm.org/D132219
2022-08-19 15:12:58 +01:00
Philip Reames 4d87591028 [RISCV] Use VScaleForTuning in costing of operations whose cost depends on VL
On known hardware, reductions, gather, and scatter operations have execution latencies which correlated with the vector length (VL) of the operation. Most other operations (e.g. simply arithmetic) don't correlated in this way, and instead essentially fixed cost as VL varies.

When I'd implemented initial scalable cost model support for reductions, gather, and scatter operations, I had used an upper bound on the statically unknown VL. The argument at the time was that this prevented falsely low costs, and biased the vectorizer away from generating bad (on some hardware) code. Unfortunately, practical experience shows we were a bit too effective at that goal, and the high costs defacto prevents vectorization using these constructs at all.

This patch reverses course, and ties the returned cost not to the maximum possible VL, but the VL which would correspond to VScaleForTuning. This parameter is the same one the vectorizer uses when normalizing loop costs, so the term effectively cancels out. The result is that the vectorizer now sees these constructs as comparable in cost to their fixed length variants.

This does introduce the possibility of the cost for these operations being a significant under estimate on platforms where actual VLEN is far from that implied by VScaleForTuning. On such platforms, we might make poor heuristic choices. Probably not in LV itself (due to the cancellation mentioned above), but possibly during e.g. lowering. I'm not currently aware of any concrete examples of this, but this patch does open a concern which did not previously exist.

Previously, we had the problem of overestimating costs causing the same problem on machines much closer to default values for vscale for tuning. With this patch, we still have that problem potentially if vscale for tuning is set high (manually), and then the code is run on a narrow VLEN machine.

Differential Revision: https://reviews.llvm.org/D131519
2022-08-18 13:10:03 -07:00
Florian Hahn b8709a9d03
[LV] Support fixed order recurrences.
If the incoming previous value of a fixed-order recurrence is a phi in
the header, go through incoming values from the latch until we find a
non-phi value. Use this as the new Previous, all uses in the header
will be dominated by the original phi, but need to be moved after
the non-phi previous value.

At the moment, fixed-order recurrences are modeled as a chain of
first-order recurrences.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D119661
2022-08-18 19:15:52 +01:00
Philip Reames 531dd3634d [LV] Restructure isPredicatedInst and isScalarWithPredication (w/a fix for uniform mem ops)
This change reorganizes the code and comments to make the expected semantics of these routines more clear. However, this is *not* an NFC change. The functional change is having isScalarWithPredication return false if the instruction does not need predicated. Specifically, for the case of a uniform memory operation we were previously considering it *not* to be a predicated instruction, but *were* considering it to be scalable with predication.

As can be seen with the test changes, this causes uniform memory ops which should have been lowered as uniform-per-parts values to instead be lowering via naive scalarization or if scalarization is infeasible (i.e. scalable vectors) aborted entirely. I also don't trust the code to bail out correctly 100% of the time, so it's possible we had a crash or miscompile from trying to scalarize something which isn't scalaralizable. I haven't found a concrete example here, but I am suspicious.

Differential Revision: https://reviews.llvm.org/D131093
2022-08-18 07:14:04 -07:00
Florian Hahn a34428f07d
[LV] Use variables instead of hard-coded metadata IDs in tests. 2022-08-16 12:21:49 +01:00
Zain Jaffal 94d21a94d9
[AArch64] Add tests to check for loop vectorization of non temporal loads
Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D131899
2022-08-16 09:40:51 +01:00
Philip Reames 33e7a0a33b [RISCV][LV] Add test coverage for upcoming dependence distance handling change 2022-08-15 15:20:36 -07:00
Florian Hahn 4f04be5649
[LV] Add tests for vectorizing select of minimum idx idiom.
Test cases for selecting the index with the minimum value.
2022-08-14 17:44:11 +01:00
Martin Sebor 0dcfe7aa35 [InstCombine] Tighten up known library function signature tests (PR #56463)
Replace a switch statement used to validate arguments to known library
functions with a more consistent table-driven approach and tighten it
up.
2022-08-10 14:15:46 -06:00
Dinar Temirbulatov cab6cd6834 [AArch64][LoopVectorize] Introduce trip count minimal value threshold to ignore tail-folding.
After D121595 was commited, I noticed regressions assosicated with small trip
count numbersvectorisation by tail folding with scalable vectors. As a solution
for those issues I propose to introduce the minimal trip count threshold value.

  Differential Revision: https://reviews.llvm.org/D130755
2022-08-09 22:10:17 +01:00
jacquesguan 45bae1be90 [RISCV][test] Add inloop reduction vectorize test. NFC 2022-08-04 15:06:44 +08:00
Philip Reames 0b47615fcf [LV] Recognize store of invariant value to invariant address as uniform
This extends the handling of uniform memory operations to handle the case where a store is storing a loop invariant value. Unlike the general case of a store to an invariant address where we must use the last active lane, in this case we can use any lane since all lanes must produce the same result.

For context, the basic structure of the existing code and how the change fits in:
* First, we select a widening strategy. (The result is irrelevant for this patch.)
* Then we determine if a computation is uniform within all lanes of VF. (Note this is the uniform-per-part definition, not LAI's uniform across all unrolled iterations definition.)
* If it is, we overrule the widening strategy, and unconditionally scalarize.
* VPReplicationRecipe - which is what actually does the scalarization - knows how to handle unform-per-part values including for scalable vectors. However, we do need to know that the expression is safe to execute without predication - e.g. the uniform mem op was unconditional in the original loop. (This part was split off and already landed.)

An obvious question is why not simply implement the generic case? The answer is that I'm going to, but doing so without a canonicalization towards uniform causes regressions due to bad interaction with scalarization/uniformity of values feeding the uniform mem-op. This patch is needed to avoid those regressions.

Differential Revision: https://reviews.llvm.org/D130364
2022-08-02 08:09:49 -07:00
David Sherwood 4ef9cb6c17 [AArch64][LoopVectorize] Disable tail-folding for SVE when loop has interleaved accesses
If we have interleave groups in the loop we want to vectorise then
we should fall back on normal vectorisation with a scalar epilogue. In
such cases when tail-folding is enabled we'll almost certainly go on to
create vplans with very high costs for all vector VFs and fall back on
VF=1 anyway. This is likely to be worse than if we'd just used an
unpredicated vector loop in the first place.

Once the vectoriser has proper support for analysing all the costs
for each combination of VF and vectorisation style, then we should
be able to remove this.

Added an extra test here:

  Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll

Differential Revision: https://reviews.llvm.org/D128342
2022-08-02 09:52:33 +01:00
Florian Hahn ff5ae948a7
[LV] Add variation of test cases with order of phis flipped.
Additional tests with integer and pointer inductions for
D119661.
2022-08-01 11:38:16 +01:00
Florian Hahn 6e1ba62d0d
[LV] Add additional tests with multiple chained recurrences.
Adds more extra tests for D119661.

Also update the test to use opaque pointers.
2022-08-01 10:01:19 +01:00
Philip Reames 82c1b136db [LV] Don't predicate uniform mem op stores unneccessarily
We already had the reasoning about uniform mem op loads; if the address is accessed at least once, we know the instruction doesn't need predicated to ensure fault safety. For stores, we do need to ensure that the values visible in memory are the same with and without predication. The easiest sub-case to check for is that all the values being stored are the same. Since we know that at least one lane is active, this tells us that the value must be visible.

Warning on confusing terminology: "uniform" vs "uniform mem op" mean two different things here, and this patch is specific to the later. It would *not* be legal to make this same change for merely "uniform" operations.

Differential Revision: https://reviews.llvm.org/D130637
2022-07-28 08:55:52 -07:00
Philip Reames 15c645f7ee [RISCV] Enable (scalable) vectorization by default
This change enables vectorization (using scalable vectorization only, fixed vectors are not yet enabled) for RISCV when vector instructions are available for the target configuration.

At this point, the resulting configuration should be both stable (e.g. no crashes), and profitable (i.e. few cases where scalar loops beat vector ones), but is not going to be particularly well tuned (i.e. we emit the best possible vector loop). The goal of this change is to align testing across organizations and ensure the default configuration matches what downstreams are using as closely as possible.

This exposes a large amount of code which hasn't otherwise been on by default, and thus may not have been fully exercised.  Given that, having issues fall out is not unexpected.  If you find issues, please make sure to include as much information as you can when reverting this change.

Differential Revision: https://reviews.llvm.org/D129013
2022-07-27 12:36:04 -07:00
Philip Reames e8ceadd0ce [LV][RISCV] Add a test case for a quality problem mixing vector index and data types
The problem here is target independent, but particularly painful on RISCV.  If we chose to vectorize such that vscale x 2 x i32 is our widest type and fits in a register, a naive expansion of i64 comparisons results in comparisons and index types at <scalabe x 2 x i64>.  This requires both an LMUL of 2, and a VSETVLI toggle in the loop.  Note that we could have used <vscale x 2 x i32> for the compairons legally given the range of the trip count.
2022-07-27 11:42:28 -07:00
Florian Hahn 16e0620d6d
[VPlan] Mark VPPredInstPHIRecipe as not having side-effects.
Now that all uses of VPPredInstPHIRecipes are properly modeled, they can
be treated as not having side-effects, enabling removal.
2022-07-27 19:29:26 +01:00
Philip Reames 43b5e12159 [LV] Refresh an autogened test to pickup naming changes 2022-07-27 10:54:15 -07:00
Philip Reames ebee4fbb34 [RISCV][LV] Add basic tests for default configuration
All of our other tests are functionality tests constrained to some
specific configuration.  This one is intended to float with the
default configuration so that changes in that default are visible
in reviews.  Note that our current default does not enable
vectorization at all; thus the current output is unvectorized.
2022-07-27 09:16:44 -07:00
Florian Hahn a8fdc247e9
[LV] Add missing uses to test to make them more robust.
The changes ensure the VPPredInstPHIRecipes are actually used and cannot
be remove by VP-DCE.
2022-07-27 16:06:52 +01:00
Sanjay Patel bfb9b8e075 [Passes] add a tail-call-elim pass near the end of the opt pipeline
We call tail-call-elim near the beginning of the pipeline,
but that is too early to annotate calls that get added later.

In the motivating case from issue #47852, the missing 'tail'
on memset leads to sub-optimal codegen.

I experimented with removing the early instance of
tail-call-elim instead of just adding another pass, but that
appears to be slightly worse for compile-time:
+0.15% vs. +0.08% time.
"tailcall" shows adding the pass; "tailcall2" shows moving
the pass to later, then adding the original early pass back
(so 1596886802 is functionally equivalent to 180b0439dc ):
https://llvm-compile-time-tracker.com/index.php?config=NewPM-O3&stat=instructions&remote=rotateright

Note that there was an effort to split the tail call functionality
into 2 passes - that could help reduce compile-time if we find
that this change costs more in compile-time than expected based
on the preliminary testing:
D60031

Differential Revision: https://reviews.llvm.org/D130374
2022-07-25 15:25:47 -04:00
Nuno Lopes a30e77b6f6 fix tests for commit 9df0b254d2 2022-07-23 22:32:30 +01:00
Nuno Lopes 9df0b254d2 [NFC] Switch a few uses of undef to poison as placeholders for unreachable code 2022-07-23 21:50:11 +01:00
Philip Reames bd75350180 [LV] Fix a conceptual mistake around meaning of uniform in isPredicatedInst
This code confuses LV's "Uniform" and LVL/LAI's "Uniform".  Despite the
common name, these are different.
* LVs notion means that only the first lane *of each unrolled part* is
  required.  That is, lanes within a single unroll factor are considered
  uniform.  This allows e.g. widenable memory ops to be considered
  uses of uniform computations.
* LVL and LAI's notion refers to all lanes across all unrollings.

IsUniformMem is in turn defined in terms of LAI's notion.  Thus a
UniformMemOpmeans is a memory operation with a loop invariant address.
This means the same address is accessed in every iteration.

The tweaked piece of code was trying to match a uniform mem op (i.e.
fully loop invariant address), but instead checked for LV's notion of
uniformity.  In theory, this meant with UF > 1, we could speculate
a load which wasn't safe to execute.

This ends up being mostly silent in current code as it is nearly
impossible to create the case where this difference is visible.  The
closest I've come in the test case from 54cb87, but even then, the
incorrect result is only visible in the vplan debug output; before this
change we sink the unsafely speculated load back into the user's predicate
blocks before emitting IR.  Both before and after IR are correct so the
differences aren't "interesting".

The other test changes are uninteresting.  They're cases where LV's uniform
analysis is slightly weaker than SCEV isLoopInvariant.
2022-07-21 15:44:34 -07:00
Philip Reames 54cb87964d [LV] Add a load focused version of the r45679 test
This a reproducer for bug in predicated instruction handling.  The final result code is correct, but the reasoning by which we get there isn't.
2022-07-21 15:33:42 -07:00
Philip Reames 83993d666b [LV][SVE] Autogen a test for ease of update 2022-07-21 13:12:53 -07:00
Philip Reames 27945f9282 [RISCV][LV] Split coverage of uniform load with outside use
Turns out this has a large effect of tail folding, so split out a single test to cover that case and remove it from the others.
2022-07-21 12:07:26 -07:00
Philip Reames bb5dc2918f {RISCV][LV] Add tail folding coverage of uniform load store cases 2022-07-21 11:15:36 -07:00
Philip Reames 56a25ed208 {RISCV][LV] Add a test for uniform store of a loop varying value 2022-07-21 11:15:36 -07:00
Philip Reames 0ae46693f0 {RISCV][LV] Split out and expand tests for uniform loads and stores 2022-07-21 10:42:18 -07:00
David Sherwood f15b6b2907 [AArch64] Add target hook for preferPredicateOverEpilogue
This patch adds the AArch64 hook for preferPredicateOverEpilogue,
which currently returns true if SVE is enabled and one of the
following conditions (non-exhaustive) is met:

1. The "sve-tail-folding" option is set to "all", or
2. The "sve-tail-folding" option is set to "all+noreductions"
and the loop does not contain reductions,
3. The "sve-tail-folding" option is set to "all+norecurrences"
and the loop has no first-order recurrences.

Currently the default option is "disabled", but this will be
changed in a later patch.

I've added new tests to show the options behave as expected here:

  Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll

Differential Revision: https://reviews.llvm.org/D129560
2022-07-21 17:20:06 +01:00
David Sherwood ceb6c23b70 [NFC][LoopVectorize] Explicitly disable tail-folding on some SVE tests
This patch is in preparation for enabling vectorisation with tail-folding
by default for SVE targets. Once we do that many existing tests will
break that depend upon having normal unpredicated vector loops. For
all such tests I have added the flag:

  -prefer-predicate-over-epilogue=scalar-epilogue

Differential Revision: https://reviews.llvm.org/D129137
2022-07-21 15:23:00 +01:00
Philip Reames f934b9b073 [LV] Refresh a couple of autogen tests for naming change
These appear to just be changes in temporary identifiers; bit suprising we have so many.
2022-07-20 14:47:52 -07:00
Philip Reames 1a73ef75fa [LV] Autogen a test for ease of update 2022-07-20 08:19:38 -07:00
Philip Reames be25f52fec [LV] Autogen several tests for ease of update in upcoming change 2022-07-20 07:17:51 -07:00
Philip Reames 523a526a02 [LV] Fix miscompile due to srem/sdiv speculation safety condition
An srem or sdiv has two cases which can cause undefined behavior, not just one. The existing code did not account for this, and as a result, we miscompiled when we encountered e.g. a srem i64 %v, -1 in a conditional block.

Instead of hand rolling the logic, just use the utility function which exists exactly for this purpose.

Differential Revision: https://reviews.llvm.org/D130106
2022-07-20 05:35:23 -07:00
David Sherwood 79660d339e [LoopVectorize][AArch64] Add TTI hook preferPredicatedReductionSelect
By default if SVE is enabled we want the select instruction used for
reductions to be inside the loop, rather than outside. This makes it
possible for the backend to fold the select into the operation to
produce a single predicated add, fadd, etc.

Differential Revision: https://reviews.llvm.org/D129763
2022-07-20 09:33:29 +01:00
Philip Reames f1243fa193 [LV] Autogen a partially autogened test for ease of update 2022-07-19 14:18:53 -07:00
Philip Reames 8353403f08 [LV] Add test for generic predicated sdiv 2022-07-19 12:33:36 -07:00
Philip Reames 2247fe856a [LV] Add test coverage for a bug in srem handling 2022-07-19 11:29:17 -07:00
Philip Reames b7d3ba4bdb [LV] Add test coverage for scalable div/rem patterns 2022-07-19 11:02:14 -07:00
David Sherwood 34f81cfa3d [LoopVectorize][NFC] Split reductions out from sve-tail-folding into new file
In sve-tail-folding-reductions.ll I've also added an extra RUN line
to test normal reductions, i.e. not in-loop. This patch is a pre-commit
in preparation for a follow-on patch that changes how reduction selects
are generated in the vector loop.

Differential Revision: https://reviews.llvm.org/D129761
2022-07-18 13:56:39 +01:00
David Sherwood 1e77b0c871 [AArch64][NFC] Simplify loop vectoriser tail-folding tests
I've simplified all of the SVE vectoriser tail-folding tests to
only care about testing the flag:

  -prefer-predicate-over-epiloge=predicate-else-scalar-epilogue

In practice we always want to fall back on unpredicated vector
loops if tail-folding is not possible.

Differential Revision: https://reviews.llvm.org/D129843
2022-07-18 13:37:29 +01:00
Graham Hunter db8fcb2c25 [LAA] Add recursive IR walker for forked pointers
This builds on the previous forked pointers patch, which only accepted
a single select as the pointer to check. A recursive function to walk
through IR has been added, which searches for either a loop-invariant
or addrec SCEV.

This will only handle a single fork at present, so selects of selects
or a GEP with a select for both the base and offset will be rejected.

There is also a recursion limit with a cli option to change it.

Reviewed By: fhahn, david-arm

Differential Revision: https://reviews.llvm.org/D108699
2022-07-18 12:06:17 +01:00
Florian Hahn 105032f549
[LV] Use PHI recipe instead of PredRecipe for subsequent uses.
At the moment, the VPPRedInstPHIRecipe is not used in subsequent uses of
the predicate recipe. This incorrectly models the def-use chains, as all
later uses should use the phi recipe. Fix that by delaying recording of
the recipe.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D129436
2022-07-18 09:35:34 +01:00
Florian Hahn 6813b41d57
[LV] Avoid creating new run-time VF expression for each runtime checks.
At the moment, the cost of runtime checks for scalable vectors is
overestimated due to creating separate vscale * VF expressions for each
check. Instead re-use the first expression.
2022-07-16 17:24:07 +01:00
Florian Hahn aa00fb02c9
[LV] Use umax(VF * UF, MinProfTC) for scalable vectors.
For scalable vectors, it is not sufficient to only check
MinProfitableTripCount if it is >= VF.getKnownMinValue() * UF, because
this property may not holder for larger values of vscale. In those
cases, compute umax(VF * UF, MinProfTC) instead.

This should fix
https://lab.llvm.org/buildbot/#/builders/197/builds/2262
2022-07-15 10:23:14 -07:00
Florian Hahn 4c85a01758
[LV] Add scalable vector test showing incorrect min-trip count check.
The test shows a case where the minimum trip count check incorrectly
only checks the minimum profitable trip count computed due to runtime
checks. This is incorrect for scalable VFs, because the VF * UF may
exceed the minimum profitable trip count for vscale > 1.

This is the likely reason for
https://lab.llvm.org/buildbot/#/builders/197/builds/2262 failing.
2022-07-15 10:02:55 -07:00
Mel Chen bd404fbcc8 [LV][NFC] Fix the condition for printing debug messages
Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D128523
2022-07-15 01:47:33 -07:00
Mel Chen ae15e6a952 [LV] Pre-commit test case for D128523, NFC 2022-07-15 01:22:06 -07:00
David Sherwood 307ace7f20 [LoopVectorize] Ensure the VPReductionRecipe is placed after all it's inputs
When vectorising ordered reductions we call a function
LoopVectorizationPlanner::adjustRecipesForReductions to replace the
existing VPWidenRecipe for the fadd instruction with a new
VPReductionRecipe. We attempt to insert the new recipe in the same
place, but this is wrong because createBlockInMask may have
generated new recipes that VPReductionRecipe now depends upon. I
have changed the insertion code to append the recipe to the
VPBasicBlock instead.

Added a new RUN with tail-folding enabled to the existing test:

  Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll

Differential Revision: https://reviews.llvm.org/D129550
2022-07-13 09:29:25 +01:00
Nikita Popov a5ee62a141 [IndVars] Call replaceLoopPHINodesWithPreheaderValues() for already constant exits
Currently we only call replaceLoopPHINodesWithPreheaderValues() if
optimizeLoopExits() replaces the exit with an unconditional exit.
However, it is very common that this already happens as part of
eliminateIVComparison(), in which case we're leaving behind the
dead header phi.

Tweak the early bailout for already-constant exits to also call
replaceLoopPHINodesWithPreheaderValues().

Differential Revision: https://reviews.llvm.org/D129214
2022-07-13 09:43:21 +02:00
David Sherwood 6b694d600a [LoopVectorize] Change PredicatedBBsAfterVectorization to be per VF
When calculating the cost of Instruction::Br in getInstructionCost
we query PredicatedBBsAfterVectorization to see if there is a
scalar predicated block. However, this meant that the decisions
being made for a given fixed-width VF were affecting the cost for a
scalable VF. As a result we were returning InstructionCost::Invalid
pointlessly for a scalable VF that should have a low cost. I
encountered this for some loops when enabling tail-folding for
scalable VFs.

Test added here:

  Transforms/LoopVectorize/AArch64/sve-tail-folding-cost.ll

Differential Revision: https://reviews.llvm.org/D128272
2022-07-12 14:53:20 +01:00
Nikita Popov 4bb7b6fae3 [IR] Remove support for float binop constant expressions
As part of https://discourse.llvm.org/t/rfc-remove-most-constant-expressions/63179,
this removes support for the floating-point binop constant expressions
fadd, fsub, fmul, fdiv and frem.

As part of this change, the C APIs LLVMConstFAdd, LLVMConstFSub,
LLVMConstFMul, LLVMConstFDiv and LLVMConstFRem are removed.
The LLVMBuild APIs should be used instead.

Differential Revision: https://reviews.llvm.org/D129478
2022-07-12 09:40:49 +02:00
David Sherwood 03fee6712a [LoopVectorize] Add option to use active lane mask for loop control flow
Currently, for vectorised loops that use the get.active.lane.mask
intrinsic we only use the mask for predicated vector operations,
such as masked loads and stores, etc. The loop itself is still
controlled by comparing the canonical induction variable with the
trip count. However, for some targets this is inefficient when it's
cheap to use the mask itself to control the loop.

This patch adds support for using the active lane mask for control
flow by:

1. Generating the active lane mask for the next iteration of the
vector loop, rather than the current one. If there are still any
remaining iterations then at least the first bit of the mask will
be set.
2. Extract the first bit of this mask and use this bit for the
conditional branch.

I did this by creating a new VPActiveLaneMaskPHIRecipe that sets
up the initial PHI values in the vector loop pre-header. I've also
made use of the new BranchOnCond VPInstruction for the final
instruction in the loop region.

Differential Revision: https://reviews.llvm.org/D125301
2022-07-11 13:46:55 +01:00
Philip Reames b12930e133 [RISCV] Switch to using get.active.lane.mask when tail folding
The motivation here is to a) bring us closer into alignment with AArch64 under the assumption that codepath is better tested, and b) simplify pattern matching in an upcoming change.

The immediate impact is a significant IR reduction but a fairly minimal change in the generated assembly. Due to a difference in expansion behavior we get a saturating add vs an unsaturating one for the old code, but that's about it. This difference comes down to different handling of overflow, which doesn't seem to be possible here anyways, so the assembly codegen is arguably a minor regression. I don't expect that to matter in practice.

Differential Revision: https://reviews.llvm.org/D129221
2022-07-08 10:24:59 -07:00
Florian Hahn 12f9c7b270
[LV] Update RISCV test missed by bc19b7c3cc. 2022-07-07 08:51:15 -07:00
Florian Hahn bc19b7c3cc
[LV] Remove collectTriviallyDeadInstructions, already handled by VP DCE.
Now that removeDeadRecipes can remove most dead recipes across a whole
VPlan, there is no need to first collect some dead instructions.
Instead removeDeadRecipes can simply clean them up.

Depends D127580.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D128408
2022-07-07 08:40:27 -07:00
David Green 438ffdb821 [ARM] Switch the costs of mve1beat and mve4beat
These three subtarget features are meant to control where MVE
instructions take 1 vs 2 vs 4 architectural beats. The mve1beat feature
is described as "Model MVE instructions as a 1 beat per tick
architecture", meaning MVE instruction will execute over 4 cycles.
mve4beat is the opposite where the entire 4 beats of the MVE instruction
execute in a single cycle. The costs for the two were backwards though,
not matching the cycle counts like they should. This patch switches the
costs on the two to bring them in-line with expectations.

Differential Revision: https://reviews.llvm.org/D129141
2022-07-07 16:10:00 +01:00
Florian Hahn 17d48c3169
[VPlan] Move remove dead recipes before merging regions.
This can enable additional region merging,  while not losing
opportunities as region merging does not produce dead recipes.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D128831
2022-07-06 20:38:38 -07:00
Florian Hahn 6826047381
[LV] Remove redundant checks from recurrence test.
The removed CHECK configurations are tested as well below, modulo the
dce/instcombine runs. This makes them redundant, and removing them
removes a substantial amount of uneeded checks.
2022-07-06 15:31:57 -07:00
Philip Reames b9513a70e1 [RISCV] Autogen a vectorizer test for ease of update 2022-07-06 09:35:02 -07:00
Florian Hahn e4613d8e1b
[LV] Remove unnecessary memory checks from recurrence test
The tests are focused on code-gen for first-order recurrences. There are
plenty of tests specifically for runtime check generation. Using noalias
to avoid runtime checks slightly simplifies the test output and ensures
the checks focus on the relevant bits and ensures the checks focus on
the relevant bits and ensures the checks focus on the relevant bits and
ensures the checks focus on the relevant bits.
2022-07-06 09:08:29 -07:00
Philip Reames 0f49d9e8d0 [RISCV] Add test coverage for vectorizer tailfolding
As can be seen in the check lines, we have a lot of work to do.
2022-07-06 09:06:20 -07:00
Nikita Popov 11950efe06 [ConstExpr] Remove div/rem constant expressions
D128820 stopped creating div/rem constant expressions by default;
this patch removes support for them entirely.

The getUDiv(), getExactUDiv(), getSDiv(), getExactSDiv(), getURem()
and getSRem() on ConstantExpr are removed, and ConstantExpr::get()
now only accepts binary operators for which
ConstantExpr::isSupportedBinOp() returns true. Uses of these methods
may be replaced either by corresponding IRBuilder methods, or
ConstantFoldBinaryOpOperands (if a constant result is required).

On the C API side, LLVMConstUDiv, LLVMConstExactUDiv, LLVMConstSDiv,
LLVMConstExactSDiv, LLVMConstURem and LLVMConstSRem are removed and
corresponding LLVMBuild methods should be used.

Importantly, this also means that constant expressions can no longer
trap! This patch still keeps the canTrap() method to minimize diff --
I plan to drop it in a separate NFC patch.

Differential Revision: https://reviews.llvm.org/D129148
2022-07-06 10:11:34 +02:00
Florian Hahn 644a965c1e
[LV] Vectorize cases with larger number of RT checks, execute only if profitable.
This patch replaces the tight hard cut-off for the number of runtime
checks with a more accurate cost-driven approach.

The new approach allows vectorization with a larger number of runtime
checks in general, but only executes the vector loop (and runtime checks) if
considered profitable at runtime. Profitable here means that the cost-model
indicates that the runtime check cost + vector loop cost < scalar loop cost.

To do that, LV computes the minimum trip count for which runtime check cost
+ vector-loop-cost < scalar loop cost.

Note that there is still a hard cut-off to avoid excessive compile-time/code-size
increases, but it is much larger than the original limit.

The performance impact on standard test-suites like SPEC2006/SPEC2006/MultiSource
is mostly neutral, but the new approach can give substantial gains in cases where
we failed to vectorize before due to the over-aggressive cut-offs.

On AArch64 with -O3, I didn't observe any regressions outside the noise level (<0.4%)
and there are the following execution time improvements. Both `IRSmk` and `srad` are relatively short running, but the changes are far above the noise level for them on my benchmark system.

```
CFP2006/447.dealII/447.dealII    -1.9%
CINT2017rate/525.x264_r/525.x264_r    -2.2%
ASC_Sequoia/IRSmk/IRSmk       -9.2%
Rodinia/srad/srad     -36.1%
```

`size` regressions on AArch64 with -O3 are

```
MultiSource/Applications/hbd/hbd                 90256.00   106768.00 18.3%
MultiSourc...ks/ASCI_Purple/SMG2000/smg2000     240676.00   257268.00  6.9%
MultiSourc...enchmarks/mafft/pairlocalalign     472603.00   489131.00  3.5%
External/S...2017rate/525.x264_r/525.x264_r     613831.00   630343.00  2.7%
External/S...NT2006/464.h264ref/464.h264ref     818920.00   835448.00  2.0%
External/S...te/538.imagick_r/538.imagick_r    1994730.00  2027754.00  1.7%
MultiSourc...nchmarks/tramp3d-v4/tramp3d-v4    1236471.00  1253015.00  1.3%
MultiSource/Applications/oggenc/oggenc         2108147.00  2124675.00  0.8%
External/S.../CFP2006/447.dealII/447.dealII    4742999.00  4759559.00  0.3%
External/S...rate/510.parest_r/510.parest_r   14206377.00 14239433.00  0.2%
```

Reviewed By: lebedev.ri, ebrevnov, dmgreen

Differential Revision: https://reviews.llvm.org/D109368
2022-07-04 15:11:39 +01:00
Florian Hahn 0dddf04cab
[LV] Don't optimize exit cond during epilogue vectorization.
At the moment, the same VPlan can be used code generation of both the
main vector and epilogue vector loop. This can lead to wrong results, if
the plan is optimized based on the VF of the main vector loop and then
re-used for the epilogue loop.

One example where this is problematic is if the scalar loops need to
execute at least one iteration, e.g. due to interleave groups.

To prevent mis-compiles in the short-term, disable optimizing exit
conditions for VPlans when using epilogue vectorization. The proper fix
is to avoid re-using the same plan for both loops, which will require
support for cloning plans first.

Fixes #56319.
2022-07-01 13:48:38 +01:00
Florian Hahn afd9f422e4
[LV] Update test for #56319 to use interleave group.
The original test was over-reduced. It requires an interleave group, so
the last vector iteration of the epilogue vector loop doesn't execute.
2022-07-01 11:12:00 +01:00
Florian Hahn 8704cfc744
[LV] Add test case for #56319.
Test case for PR56319.
2022-07-01 10:09:24 +01:00
Sanjay Patel cc88445a91 [InstCombine] canonicalize 'icmp (trunc X), C' to 'icmp (X & Mask), C'
I looked at canonicalizing in the other direction, but that causes
many potential regressions and infinite loops because we already
(possibly wrongly) canonicalize "trunc X to i1" into an and+icmp.

This has a data layout restriction to avoid creating illegal
mask instructions, but we could remove that if we can show
that the backend can undo this when needed.

The motivating example from issue #56119 is modeled by the
PhaseOrdering test.
2022-06-30 15:51:39 -04:00
Florian Hahn 68884dde70
[LV] Move LoopVersioning creation to LVP::execute.
At the moment LoopVersioning is only created for inner-loop
vectorization. This patch moves it to LVP::execute, which means it will
also be added for epilogue vectorization. As a consequence, the proper
noalias metadata is now also added to epilogue vector loops.

LVer will be moved to VPTransformState as follow-up.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D127966
2022-06-30 12:14:32 +01:00
Florian Hahn 24b5f8e0d0
[VPlan] Make sure optimizeInductions removes wide ind from scalar plan.
In some cases, there may be widened users of inductions even though the
plan includes the scalar VF. In those cases, make sure we still replace
the VPWidenIntOrFpInductionRecipe with scalar steps, as otherwise we may
try to execute a VPWidenIntOrFpInductionRecipe with a scalar VF.

Alternatively the patch could also split the range if needed.

This fixes a crash exposed by D123720.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D128755
2022-06-30 09:11:48 +01:00
Florian Hahn 476f9c909c
[LV] Add test case showing dead recipe blocking region merging. 2022-06-29 16:34:12 +01:00
Philip Reames f239cddbac [RISCV] Pin two tests to fixed length vectorization to preserve test intent 2022-06-28 13:53:31 -07:00
Philip Reames 20dd3297b1 [LV] Allow scalable vectorization with vscale = 1
This change is a bit subtle. If we have a type like <vscale x 1 x i64>, the vectorizer will currently reject vectorization. The reason is that a type like <1 x i64> is likely to get simply rescalarized, and the vectorizer doesn't want to be in the game of simple unrolling.

(I've given the example in terms of 1 x types which use a single register, but the same issue exists for any N x types which use N registers. e.g. RISCV LMULs.)

This change distinguishes scalable types from fixed types under the reasoning that converting to a scalable type isn't unrolling. Because the actual vscale isn't known until runtime, using a vscale type is potentially very profitable.

This makes an important, but unchecked, assumption. Specifically, the scalable type is assumed to only be legal per the cost model if there's actually a scalable register class which is distinct from the scalar domain. This is, to my knowledge, true for all targets which return non-invalid costs for scalable vector ops today, but in theory, we could have a target decide to lower scalable to fixed length vector or even scalar registers. If that ever happens, we'd need to revisit this code.

In practice, this patch unblocks scalable vectorization for ELEN types on RISCV.

Let me sketch one alternate implementation I considered. We could have restricted this to when we know a minimum value for vscale. Specifically, for the default +v extension for RISCV, we actually know that vscale >= 2 for ELEN types. However, doing it this way means we can't generate scalable vectors when using the various embedded vector extensions which have a minimum vscale of 1.

Differential Revision: https://reviews.llvm.org/D128542
2022-06-27 13:38:57 -07:00
Philip Reames 9803b0d1e7 [RISCV] Implement getVScaleForTuning and thus prefer scalable vectorization when enabled
LoopVectorizer uses getVScaleForTuning for deciding how to discount the cost of a potential vector factor by the amount of work performed. Without the callback implemented, the vectorizer was defaulting to an estimated vscale of 1. This results in fixed vectorization looking falsely profitable (since it used the command line VLEN).

The test change is pretty limited since a) we don't have much coverage of the vectorizer with scalable vectors at all, and b) what little coverage we have mostly uses i64 element types. There's a separate issue with <vscale x 1 x i64> which prevents us from getting to this stage of costing, and thus only the one test explicitly written to avoid that is visible in the diff. However, this is actually a very wide impact change as it changes the practical vectorization result when both fixed and scalable is enabled to scalable.

As an aside, I think the vectorizer is at little too strongly biased towards scalable when both are legal, but we can explore that separately. For now, let's just get the cost model working the way it was intended.

Differential Revision: https://reviews.llvm.org/D128547
2022-06-25 11:25:23 -07:00
Philip Reames ae8fac6f98 [LV][RISCV] Add coverage showing scalable codegen when etype != ELEN
We currently have a costing bug around the etype == ELEN case, so add otherwise duplicate tests to show test diffs as I work on other parts of costing.
2022-06-24 11:38:54 -07:00
Philip Reames 056d63938a [RISCV] Split a vectorizer test runline so that upcoming changes in defaults are visible 2022-06-24 08:48:11 -07:00
Philip Reames adbe718675 [RISCV] Modify a test line so it exercises the intended configuration once we turn on scalable vectorization 2022-06-24 08:48:11 -07:00
Philip Reames 46ea4b5ea1 [LV] Avoid a crash when costing a uniform store which doesn't correspond to a legal scatter
If we have an unaligned uniform store, then when costing a scalable VF we can't emit code to scalarize it.  (Well, we could, but we haven't implemented that case.)  This change replaces an assert with a cost-model bailout such that we reject vectorization with the scalable VF instead of crashing.
2022-06-23 12:41:09 -07:00