Commit Graph

358 Commits

Author SHA1 Message Date
Arthur Eubanks e23aee7175 [test] Update some legacy PM tests 2022-09-30 11:31:02 -07:00
Philip Reames dc7387b587 [LV] Adjust cost model to use uniform store lowering for unpredicated uniform stores
Follow up to D133580; adjust the cost model to prefer uniform store lowering for scalable stores which are unpredicated.

The impact here isn't in the uniform store lowering quality itself. InstCombine happily converts the scatter form into the single store form. The main impact is in letting the rest of the cost model make choices based on the knowledge that the vector will be scalarized on use.

Differential Revision: https://reviews.llvm.org/D134460
2022-09-27 07:28:40 -07:00
Florian Hahn 2c692d891e
[LV] Update handling of scalable pointer inductions after b73d2c8.
The dependent code has been changed quite a lot since 151c144 which
b73d2c8 effectively reverts. Now we run into a case where lowering
didn't expect/support the behavior pre 151c144 any longer.

Update the code dealing with scalable pointer inductions to also check
for uniformity in combination with isScalarAfterVectorization. This
should ensure scalable pointer inductions are handled properly during
epilogue vectorization.

Fixes #57912.
2022-09-23 18:23:02 +01:00
Florian Hahn 17167005d5
[LV] Add test for #57912.
Add test showing miscompilation during epilogue vectorization with SVE.
2022-09-23 11:49:55 +01:00
Florian Hahn 05b3493819
[LV] Convert sve-epilog-vect.ll to use opaque pointers. 2022-09-23 10:24:19 +01:00
Philip Reames 32dc1151e2 [VPlan] Only generate single instr for unpredicated stores of varying value to invariant address
This extends the previously added uniform store case to handle stores of loop varying values to a loop invariant address. Note that the placement of this code only allows unpredicated stores; this is important for correctness. (That is "IsPredicated" is always false at this point in the function.)

This patch does not include scalable types. The diff felt "large enough" as it were; I'll handle that in a separate patch. (It requires some changes to cost modeling.)

Differential Revision: https://reviews.llvm.org/D133580
2022-09-22 08:53:46 -07:00
Sanjay Patel 0f32a5dea0 [InstCombine] don't canonicalize shl+sub to mul+add
This stops Negator from transforming:
`C1 - shl X, C2 --> mul X, (1<<C2) + C1`
...in the general case. There does not seem to be any analysis
benefit to using mul in IR, and there's definitely downside in
codegen (particularly when the multiply has to be expanded).

If `C1` is 0, then there's a stronger argument that the single
mul is a better canonicalization than negate-of-shl, but we may
want to remove that too.

This was noted as a potential conflict for D133667.

Differential Revision: https://reviews.llvm.org/D134310
2022-09-21 08:39:07 -04:00
Simon Pilgrim 09cb9fdef9 [InstCombine] Fold ult(add(x,-1),c) -> ule(x,c) iff x != 0 (PR57635)
Alive2: https://alive2.llvm.org/ce/z/sZ6wwS

As detailed on Issue #57635 and #37628 - for unsigned comparisons, we can compare prior to a decrement iff the value is known never to be zero.

Differential Revision: https://reviews.llvm.org/D134172
2022-09-20 16:44:41 +01:00
Vitaly Buka bbef90ace4 [IRBuilder] Use PoisonValue in CreateMasked*
Followup to 72b776168c

Reviewed By: nlopes

Differential Revision: https://reviews.llvm.org/D133967
2022-09-19 11:01:41 -07:00
Florian Hahn 582f8ef19f
[LV] Keep track of cost-based ScalarAfterVec in VPWidenPointerInd.
Epilogue vectorization uses isScalarAfterVectorization to check if
widened versions for inductions need to be generated and bails out in
those cases.

At the moment, there are scenarios where isScalarAfterVectorization
returns true but VPWidenPointerInduction::onlyScalarsGenerated would
return false, causing widening.

This can lead to widened phis with incorrect start values being created
in the epilogue vector body.

This patch addresses the issue by storing the cost-model decision in
VPWidenPointerInductionRecipe and restoring the behavior before 151c144.
This effectively reverts 151c144, but the long-term fix is to properly
support widened inductions during epilogue vectorization

Fixes #57712.
2022-09-19 18:14:35 +01:00
Florian Hahn f02ff5348f
[LV] Move new epilog-vectorization-widen-inductions.ll to AArch64 dir.
The test requires the AArch64 backend, so move it to the right subdir.
2022-09-19 17:13:06 +01:00
Sanjay Patel d6498abc24 [InstCombine] remove multi-use add demanded constant fold
This was originally part of D133788. There are no visible
regressions. All of the diffs show a large unsigned constant
becoming a small negative constant. This should be better
for analysis (and slightly less compile-time) and codegen.
2022-09-18 14:23:43 -04:00
Graham Hunter 1f639d1bd2 [NFC][LV] Convert masked call tests to use update script 2022-09-09 10:07:39 +01:00
Florian Hahn fc444ddc77
[VPlan] Add field to track if intrinsic should be used for call. (NFC)
This patch moves the cost-based decision whether to use an intrinsic or
library call to the point where the recipe is created. This untangles
code-gen from the cost model and also avoids doing some extra work as
the information is already computed at construction.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D132585
2022-09-01 13:14:40 +01:00
Florian Hahn 1ed555a62b
[LV] Fix test cases where vector loop never executed.
It looks like the vector loops in the modified test cases
unintentionally never get executed. Update the exit condition to ensure
it does to avoid them getting optimized away in upcoming changes.
2022-08-31 13:24:49 +01:00
Philip Reames 4c10646367 [LV] Refresh autogen tests to reflect naming changes [nfc]
Purely so that these can be easily autogened without spurious diffs
2022-08-29 14:16:54 -07:00
Florian Hahn 005d1a8ff5
[LV] Add test where either a libfunc or intrinsic is chosen.
In the newly added test either a libfunc (VF=2) or a intrinsic (VF=4)
can be chosen.

Test coverage for D132585.
2022-08-29 10:51:20 +01:00
Philip Reames f79214d1e1 [LV] Support predicated div/rem operations via safe-divisor select idiom
This patch adds support for vectorizing conditionally executed div/rem operations via a variant of widening. The existing support for predicated divrem in the vectorizer requires scalarization which we can't do for scalable vectors.

The basic idea is that we can always divide (take remainder) by 1 without executing UB. As such, we can use the active lane mask to conditional select either the actual divisor for active lanes, or a constant one for inactive lanes. We already account for the cost of the active lane mask, so the only additional cost is a splat of one and the vector select. This is one of several possible approaches to this problem; see the review thread for discussion on some of the others.  This one was chosen mostly because it was straight forward, and none of the others seemed oviously better.

I enabled the new code only for scalable vectors. We could also legally enable it for fixed vectors as well, but I haven't thought through the cost tradeoffs between widening and scalarization enough to know if that's profitable. This will be explored in future patches.

Differential Revision: https://reviews.llvm.org/D130164
2022-08-24 10:07:59 -07:00
David Green e29f9f7572 [AArch64][X86] Add some fixed-order-recurrence tests to check the costmodel of fixed order recurrences. NFC 2022-08-24 08:18:01 +01:00
Graham Hunter 14212c968f [NFC][LoopVectorize] Precommit masked vector function call tests 2022-08-23 09:47:10 +01:00
David Sherwood 666d2a925f [SVE][LoopVectorize][NFC] Tidy up some tests
Whilst writing a patch to add extra tail-folding RUN lines to
existing tests I noticed a few areas where they can be
cleaned up a little:

1. scalable-reductions.ll: fmin_fast does not mark fcmp as fast.
2. sve-inductions-unusual-types.ll: remove direct references to
   SSA variable names.
3. sve-strict-fadd-cost.ll: don't force vector width so we see
   costs for different VFs in one go. This will be important for
   the follow-on patch.
4. sve-vector-reverse.ll,vector-reverse-mask4.ll: add noalias
   keyword to simplify IR.
4. sve-widen-gep.ll,sve-widen-phi.ll: regenerate using script.

These changes will make the subsequent patch adding RUN lines much
easier to review!

Differential Revision: https://reviews.llvm.org/D132219
2022-08-19 15:12:58 +01:00
Zain Jaffal 94d21a94d9
[AArch64] Add tests to check for loop vectorization of non temporal loads
Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D131899
2022-08-16 09:40:51 +01:00
Martin Sebor 0dcfe7aa35 [InstCombine] Tighten up known library function signature tests (PR #56463)
Replace a switch statement used to validate arguments to known library
functions with a more consistent table-driven approach and tighten it
up.
2022-08-10 14:15:46 -06:00
Dinar Temirbulatov cab6cd6834 [AArch64][LoopVectorize] Introduce trip count minimal value threshold to ignore tail-folding.
After D121595 was commited, I noticed regressions assosicated with small trip
count numbersvectorisation by tail folding with scalable vectors. As a solution
for those issues I propose to introduce the minimal trip count threshold value.

  Differential Revision: https://reviews.llvm.org/D130755
2022-08-09 22:10:17 +01:00
David Sherwood 4ef9cb6c17 [AArch64][LoopVectorize] Disable tail-folding for SVE when loop has interleaved accesses
If we have interleave groups in the loop we want to vectorise then
we should fall back on normal vectorisation with a scalar epilogue. In
such cases when tail-folding is enabled we'll almost certainly go on to
create vplans with very high costs for all vector VFs and fall back on
VF=1 anyway. This is likely to be worse than if we'd just used an
unpredicated vector loop in the first place.

Once the vectoriser has proper support for analysing all the costs
for each combination of VF and vectorisation style, then we should
be able to remove this.

Added an extra test here:

  Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll

Differential Revision: https://reviews.llvm.org/D128342
2022-08-02 09:52:33 +01:00
Philip Reames 83993d666b [LV][SVE] Autogen a test for ease of update 2022-07-21 13:12:53 -07:00
David Sherwood f15b6b2907 [AArch64] Add target hook for preferPredicateOverEpilogue
This patch adds the AArch64 hook for preferPredicateOverEpilogue,
which currently returns true if SVE is enabled and one of the
following conditions (non-exhaustive) is met:

1. The "sve-tail-folding" option is set to "all", or
2. The "sve-tail-folding" option is set to "all+noreductions"
and the loop does not contain reductions,
3. The "sve-tail-folding" option is set to "all+norecurrences"
and the loop has no first-order recurrences.

Currently the default option is "disabled", but this will be
changed in a later patch.

I've added new tests to show the options behave as expected here:

  Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll

Differential Revision: https://reviews.llvm.org/D129560
2022-07-21 17:20:06 +01:00
David Sherwood ceb6c23b70 [NFC][LoopVectorize] Explicitly disable tail-folding on some SVE tests
This patch is in preparation for enabling vectorisation with tail-folding
by default for SVE targets. Once we do that many existing tests will
break that depend upon having normal unpredicated vector loops. For
all such tests I have added the flag:

  -prefer-predicate-over-epilogue=scalar-epilogue

Differential Revision: https://reviews.llvm.org/D129137
2022-07-21 15:23:00 +01:00
David Sherwood 79660d339e [LoopVectorize][AArch64] Add TTI hook preferPredicatedReductionSelect
By default if SVE is enabled we want the select instruction used for
reductions to be inside the loop, rather than outside. This makes it
possible for the backend to fold the select into the operation to
produce a single predicated add, fadd, etc.

Differential Revision: https://reviews.llvm.org/D129763
2022-07-20 09:33:29 +01:00
Philip Reames f1243fa193 [LV] Autogen a partially autogened test for ease of update 2022-07-19 14:18:53 -07:00
David Sherwood 34f81cfa3d [LoopVectorize][NFC] Split reductions out from sve-tail-folding into new file
In sve-tail-folding-reductions.ll I've also added an extra RUN line
to test normal reductions, i.e. not in-loop. This patch is a pre-commit
in preparation for a follow-on patch that changes how reduction selects
are generated in the vector loop.

Differential Revision: https://reviews.llvm.org/D129761
2022-07-18 13:56:39 +01:00
David Sherwood 1e77b0c871 [AArch64][NFC] Simplify loop vectoriser tail-folding tests
I've simplified all of the SVE vectoriser tail-folding tests to
only care about testing the flag:

  -prefer-predicate-over-epiloge=predicate-else-scalar-epilogue

In practice we always want to fall back on unpredicated vector
loops if tail-folding is not possible.

Differential Revision: https://reviews.llvm.org/D129843
2022-07-18 13:37:29 +01:00
Florian Hahn 6813b41d57
[LV] Avoid creating new run-time VF expression for each runtime checks.
At the moment, the cost of runtime checks for scalable vectors is
overestimated due to creating separate vscale * VF expressions for each
check. Instead re-use the first expression.
2022-07-16 17:24:07 +01:00
Florian Hahn aa00fb02c9
[LV] Use umax(VF * UF, MinProfTC) for scalable vectors.
For scalable vectors, it is not sufficient to only check
MinProfitableTripCount if it is >= VF.getKnownMinValue() * UF, because
this property may not holder for larger values of vscale. In those
cases, compute umax(VF * UF, MinProfTC) instead.

This should fix
https://lab.llvm.org/buildbot/#/builders/197/builds/2262
2022-07-15 10:23:14 -07:00
Florian Hahn 4c85a01758
[LV] Add scalable vector test showing incorrect min-trip count check.
The test shows a case where the minimum trip count check incorrectly
only checks the minimum profitable trip count computed due to runtime
checks. This is incorrect for scalable VFs, because the VF * UF may
exceed the minimum profitable trip count for vscale > 1.

This is the likely reason for
https://lab.llvm.org/buildbot/#/builders/197/builds/2262 failing.
2022-07-15 10:02:55 -07:00
David Sherwood 307ace7f20 [LoopVectorize] Ensure the VPReductionRecipe is placed after all it's inputs
When vectorising ordered reductions we call a function
LoopVectorizationPlanner::adjustRecipesForReductions to replace the
existing VPWidenRecipe for the fadd instruction with a new
VPReductionRecipe. We attempt to insert the new recipe in the same
place, but this is wrong because createBlockInMask may have
generated new recipes that VPReductionRecipe now depends upon. I
have changed the insertion code to append the recipe to the
VPBasicBlock instead.

Added a new RUN with tail-folding enabled to the existing test:

  Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll

Differential Revision: https://reviews.llvm.org/D129550
2022-07-13 09:29:25 +01:00
David Sherwood 6b694d600a [LoopVectorize] Change PredicatedBBsAfterVectorization to be per VF
When calculating the cost of Instruction::Br in getInstructionCost
we query PredicatedBBsAfterVectorization to see if there is a
scalar predicated block. However, this meant that the decisions
being made for a given fixed-width VF were affecting the cost for a
scalable VF. As a result we were returning InstructionCost::Invalid
pointlessly for a scalable VF that should have a low cost. I
encountered this for some loops when enabling tail-folding for
scalable VFs.

Test added here:

  Transforms/LoopVectorize/AArch64/sve-tail-folding-cost.ll

Differential Revision: https://reviews.llvm.org/D128272
2022-07-12 14:53:20 +01:00
David Sherwood 03fee6712a [LoopVectorize] Add option to use active lane mask for loop control flow
Currently, for vectorised loops that use the get.active.lane.mask
intrinsic we only use the mask for predicated vector operations,
such as masked loads and stores, etc. The loop itself is still
controlled by comparing the canonical induction variable with the
trip count. However, for some targets this is inefficient when it's
cheap to use the mask itself to control the loop.

This patch adds support for using the active lane mask for control
flow by:

1. Generating the active lane mask for the next iteration of the
vector loop, rather than the current one. If there are still any
remaining iterations then at least the first bit of the mask will
be set.
2. Extract the first bit of this mask and use this bit for the
conditional branch.

I did this by creating a new VPActiveLaneMaskPHIRecipe that sets
up the initial PHI values in the vector loop pre-header. I've also
made use of the new BranchOnCond VPInstruction for the final
instruction in the loop region.

Differential Revision: https://reviews.llvm.org/D125301
2022-07-11 13:46:55 +01:00
Florian Hahn 644a965c1e
[LV] Vectorize cases with larger number of RT checks, execute only if profitable.
This patch replaces the tight hard cut-off for the number of runtime
checks with a more accurate cost-driven approach.

The new approach allows vectorization with a larger number of runtime
checks in general, but only executes the vector loop (and runtime checks) if
considered profitable at runtime. Profitable here means that the cost-model
indicates that the runtime check cost + vector loop cost < scalar loop cost.

To do that, LV computes the minimum trip count for which runtime check cost
+ vector-loop-cost < scalar loop cost.

Note that there is still a hard cut-off to avoid excessive compile-time/code-size
increases, but it is much larger than the original limit.

The performance impact on standard test-suites like SPEC2006/SPEC2006/MultiSource
is mostly neutral, but the new approach can give substantial gains in cases where
we failed to vectorize before due to the over-aggressive cut-offs.

On AArch64 with -O3, I didn't observe any regressions outside the noise level (<0.4%)
and there are the following execution time improvements. Both `IRSmk` and `srad` are relatively short running, but the changes are far above the noise level for them on my benchmark system.

```
CFP2006/447.dealII/447.dealII    -1.9%
CINT2017rate/525.x264_r/525.x264_r    -2.2%
ASC_Sequoia/IRSmk/IRSmk       -9.2%
Rodinia/srad/srad     -36.1%
```

`size` regressions on AArch64 with -O3 are

```
MultiSource/Applications/hbd/hbd                 90256.00   106768.00 18.3%
MultiSourc...ks/ASCI_Purple/SMG2000/smg2000     240676.00   257268.00  6.9%
MultiSourc...enchmarks/mafft/pairlocalalign     472603.00   489131.00  3.5%
External/S...2017rate/525.x264_r/525.x264_r     613831.00   630343.00  2.7%
External/S...NT2006/464.h264ref/464.h264ref     818920.00   835448.00  2.0%
External/S...te/538.imagick_r/538.imagick_r    1994730.00  2027754.00  1.7%
MultiSourc...nchmarks/tramp3d-v4/tramp3d-v4    1236471.00  1253015.00  1.3%
MultiSource/Applications/oggenc/oggenc         2108147.00  2124675.00  0.8%
External/S.../CFP2006/447.dealII/447.dealII    4742999.00  4759559.00  0.3%
External/S...rate/510.parest_r/510.parest_r   14206377.00 14239433.00  0.2%
```

Reviewed By: lebedev.ri, ebrevnov, dmgreen

Differential Revision: https://reviews.llvm.org/D109368
2022-07-04 15:11:39 +01:00
Malhar Jajoo 6bb40552f2 [LoopVectorize] Add support for invariant stores of ordered reductions
Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D126772
2022-06-17 14:56:21 +01:00
Tiehu Zhang b329156f4f [AArch64][LV] AArch64 does not prefer vectorized addressing
TTI::prefersVectorizedAddressing() try to vectorize the addresses that lead to loads.
For aarch64, only gather/scatter (supported by SVE) can deal with vectors of addresses.
This patch specializes the hook for AArch64, to return true only when we enable SVE.

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D124612
2022-06-17 18:32:50 +08:00
Florian Hahn eaf48dd9b0
[VPlan] Replace BranchOnCount with BranchOnCond if TC <= UF * VF.
Try to simplify BranchOnCount to `BranchOnCond true` if TC <= UF * VF.

This is an alternative to D121899 which simplifies the VPlan directly
instead of doing so late in code-gen.

The potential benefit of doing this in VPlan is that this may help
cost-modeling in the future. The reason this is done in prepareToExecute
at the moment is that a single plan may be used for multiple VFs/UFs.

There are further simplifications that can be applied as follow ups:

1. Replace inductions with constants
2. Replace vector region with regular block.

Fixes #55354.

Depends on D126679.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D126680
2022-06-06 09:38:53 +01:00
Nikita Popov 03aceab08b [ValueTracking] Enable -branch-on-poison-as-ub by default
Now that SimpleLoopUnswitch and other transforms no longer introduce
branch on poison, enable the -branch-on-poison-as-ub option by
default. The practical impact of this is mostly better flag
preservation in SCEV, and some freeze instructions no longer being
necessary.

Differential Revision: https://reviews.llvm.org/D125299
2022-06-01 10:46:06 +02:00
David Green 75631438e3 [AArch64] Costmodel tests for llvm.vscale intrinsics. NFC
These shows that the cost of a @llvm.vscale is indeed 1, not 10.
2022-05-26 10:16:21 +01:00
David Sherwood 87936c7b13 [LoopVectorize] Fix assertion failure in fixReduction when tail-folding
When compiling the attached new test in scalable-reductions-tf.ll we
were hitting this assertion in fixReduction:

  Assertion `isa<PHINode>(U) && "Reduction exit must feed Phi's or select"

The loop contains a reduction and an intermediate store of the reduction
value. When vectorising with tail-folding the contains of 'U' in the
assertion above happened to be a scatter_store. It turns out that we
were still creating a widen recipe for the invariant store, despite
knowing that we can actually sink it. The simplest fix is to change
buildVPlanWithVPRecipes so that we look for invariant stores before
attempting to widen it.

Differential Revision: https://reviews.llvm.org/D126295
2022-05-25 11:46:32 +01:00
Jingu Kang bb82f74612 Revert "Revert "[AArch64] Set maximum VF with shouldMaximizeVectorBandwidth""
This reverts commit 42ebfa8269.

The commmit from https://reviews.llvm.org/D125918 has fixed the stage 2 build
failure.

Differential Revision: https://reviews.llvm.org/D118979
2022-05-23 16:15:45 +01:00
Peter Waller ade47bdc31 [LV] Improve register pressure estimate at high VFs
Previously, `getRegUsageForType` was implemented using
`getTypeLegalizationCost`.  `getRegUsageForType` is used by the loop
vectorizer to estimate the register pressure caused by using a vector
type.  However, `getTypeLegalizationCost` currently only appears to
understand splitting and not scalarization, so significantly
underestimates the register requirements.

Instead, use `getNumRegisters`, which understands when scalarization
can occur (via computeRegisterProperties).

This was discovered while investigating D118979 (Set maximum VF with
shouldMaximizeVectorBandwidth), where under fixed-length 512-bit SVE the
loop vectorizer previously ends up costing an v128i1 as 2 v64i*
registers where it actually occupies 128 i32 registers.

I'm sending this patch early for comment, I'm still doing some sanity checking
with LNT.  I note that getRegisterClassForType appears to return VectorRC even
though the type in question (large vNi1 types) end up occupying scalar
registers. That might be worth fixing too.

Differential Revision: https://reviews.llvm.org/D125918
2022-05-23 07:57:45 +00:00
Florian Hahn 97590baead
[LV] Widen ptr-inductions with scalar uses for scalable VFs.
Current codegen only supports scalarization of pointer inductions for
scalable VFs if they are uniform. After 3bebec659 we now may enter the
scalarization code path in VPWidenPointerInductionRecipe::execute for
scalable vectors.

Fall back to widening for scalable vectors if necessary.

This should fix a build failure when bootstrapping LLVM with SVE, e.g.
https://lab.llvm.org/buildbot/#/builders/176/builds/1723
2022-05-22 16:24:13 +01:00
Florian Hahn 3bebec6592
[VPlan] Model first exit values using VPLiveOut.
This patch introduces a new VPLiveOut subclass of VPUser  to model
 exit values explicitly. The initial version handles exit values that
are neither part of induction or reduction chains nor first order
recurrence phis.

Fixes #51366, #54867, #55167, #55459

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D123537
2022-05-21 16:01:38 +01:00
Florian Hahn cd61d4bd2f
[LV] Do not LoopSimplify/LCSSA after generating main vector loop.
At the moment LV runs LoopSimplify and reconstructs LCSSA form after
generating the main vector loop and before generating the epilogue
vector loop.

In practice, this adds a new exit block for the scalar loop because the
middle block now also branches to the original exit block of the scalar
loop. It also requires adding a new LCSSA phi in the newly created exit
block.

This complicates things when modeling exit values in VPlan, because we
would need to update the VPlan for the epilogue loop to update the newly
created LCSSA phi node.

But none of that should be necessary, as all analysis requiring
loop-simplify form is already done at this point and LCSSA form of the
original loop is not broken.

Reviewed By: bmahjour

Differential Revision: https://reviews.llvm.org/D125810
2022-05-20 09:58:40 +01:00