Commit Graph

337 Commits

Author SHA1 Message Date
Florian Hahn da740492b0
[VPlan] Remove dead header-phi recipes.
This patch adds a new transform to remove dead recipes. For now, it only
removes dead recipes in the header, to keep the number tests that require
updating manageable. Future patches will extend this to remove dead
recipes across the whole plan.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D118051
2022-02-26 16:26:39 +00:00
Kerry McLaughlin 12fb133eba [LoopVectorize] Support conditional in-loop vector reductions
Extends getReductionOpChain to look through Phis which may be part of
the reduction chain. adjustRecipesForReductions will now also create a
CondOp for VPReductionRecipe if the block is predicated and not only if
foldTailByMasking is true.

Changes were required in tryToBlend to ensure that we don't attempt
to convert the reduction Phi into a select by returning a VPBlendRecipe.
The VPReductionRecipe will create a select between the Phi and the reduction.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D117580
2022-02-22 12:04:35 +00:00
Florian Hahn 446e7c64c7
[LV] Add real uses in some tests, to make them more robust.
Add real uses to some tests, to ensure dead instructions cannot be directly
removed.
2022-02-13 09:52:59 +00:00
David Green b55d4c2ad8 Revert "[LV] Remove `LoopVectorizationCostModel::useEmulatedMaskMemRefHack()`"
This reverts commit 77a0da926c as we've
received multiple reports of this significantly impacting performance,
in ways that don't seem to just be target specific cost models going
wrong. I would offer some reproducers, but the test changes here seem to
be full of them!

Reverting for now and hopefully we can remove the "hack" more carefully
as we go.
2022-02-09 20:02:54 +00:00
David Green b4c6d1bb37 [LoopVectorizer] Don't perform interleaving of predicated scalar loops
The vectorizer will choose at times to "vectorize" loops with a scalar
factor (VF=1) with interleaving (IC > 1). This can occasionally produce
better code than the unroller (notable for reductions where it can
produce independent reduction chains that are combined after the loop).
At times this is not very beneficial though, for example when runtime
checks are needed or when the scalar code requires predication.

This addresses the second point, preventing the vectorizer from
interleaving when the scalar loop will require predication. This
prevents it from making a bit of a mess, that is worse than the original
and better left for the unroller to unroll if beneficial. It helps
reverse some of the regressions from D118090.

Differential Revision: https://reviews.llvm.org/D118566
2022-02-07 19:34:28 +00:00
Roman Lebedev 77a0da926c
[LV] Remove `LoopVectorizationCostModel::useEmulatedMaskMemRefHack()`
D43208 extracted `useEmulatedMaskMemRefHack()` from legality into cost model.
What it essentially does is prevents scalarized vectorization of masked memory operations:
```
  // TODO: Cost model for emulated masked load/store is completely
  // broken. This hack guides the cost model to use an artificially
  // high enough value to practically disable vectorization with such
  // operations, except where previously deployed legality hack allowed
  // using very low cost values. This is to avoid regressions coming simply
  // from moving "masked load/store" check from legality to cost model.
  // Masked Load/Gather emulation was previously never allowed.
  // Limited number of Masked Store/Scatter emulation was allowed.
```

While i don't really understand about what specifically `is completely broken`
was talking about, i believe that at least on X86 with AVX2-or-later,
this is no longer true. (or at least, i would like to know what is still broken).
So i would like to follow suit after D111460, and like wise disable that hack for AVX2+.

But since this was added for X86 specifically, let's just instead completely remove this hack.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D114779
2022-02-07 16:08:31 +03:00
Sander de Smalen eaee477eda [LV] Use VScaleForTuning to allow wider epilogue VFs.
When the main loop is e.g. VF=vscale x 1 and the epilogue VF cannot
be any smaller, the vectorizer should try to estimate how many lanes are
executed at runtime and allow a suitable fixed-width VF to be chosen. It
can use VScaleForTuning to figure out what a suitable fixed-width VF could
be. For the case where the main loop VF is VF=vscale x 1, and VScaleForTuning=8,
it could still choose an epilogue VF upto VF=4.

This was a bit tricky to test, so this patch also introduces a wrapper
function to get 'VScaleForTuning' by also considering vscale_range.
If min and max are equal, then that will be the vscale we compile for.
It makes little sense to tune for a different width if the code
will not be portable for other widths.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D118709
2022-02-03 15:40:17 +00:00
Sander de Smalen 2a44eaf20f [LV] Allow a scalable VF for the epilogue.
For some reason we limited the epilogue VF to be fixed-width, but there
is not necessarily a reason for doing so. If the main VF=vscale x 16, the
epilogue VF could be either fixed-width, or a scalable VF upto vscale x 8.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D118688
2022-02-01 22:38:55 +00:00
David Green aaa16eb023 [LV][AArch64] Add test for scalar interleaving with predication. NFC 2022-02-01 09:21:49 +00:00
Florian Hahn efd4938723
[VPlan] Handle IV vector splat using VPWidenCanonicalIV.
This patch tries to use an existing VPWidenCanonicalIVRecipe
instead of creating another step-vector for canonical
induction recipes in widenIntOrFpInduction.

This has the following benefits:

 1. First step to avoid setting both vector and scalar values for the
    same induction def.
 2. Reducing complexity of widenIntOrFpInduction through making things
    more explicit in VPlan
 3. Only need to splat the vector IV for block in masks.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D116123
2022-01-29 16:25:27 +00:00
Congzhe Cao f3e1f44340 [IVDescriptor] Get the exact FP instruction that does not allow reordering
This is a bugfix in IVDescriptor.cpp.

The helper function `RecurrenceDescriptor::getExactFPMathInst()`
is supposed to return the 1st FP instruction that does not allow
reordering. However, when constructing the RecurrenceDescriptor,
we trace the use-def chain staring from a PHI node and for each
instruction in the use-def chain, its descriptor overrides the
previous one. Therefore in the final RecurrenceDescriptor we
constructed, we lose previous FP instructions that does not allow
reordering.

Reviewed By: kmclaughlin

Differential Revision: https://reviews.llvm.org/D118073
2022-01-27 00:33:46 -05:00
Igor Kirillov d3932c690d [LoopVectorize] Add tests with reductions that are stored in invariant address
This patch adds tests for functionality that is to be implemented in D110235.

Differential Revision: https://reviews.llvm.org/D117213
2022-01-24 21:26:38 +00:00
Florian Hahn b2a8eff45c
[LV] Make some tests more robust by adding missing users. 2022-01-24 13:04:09 +00:00
Kerry McLaughlin 8082ab2fc3 [LoopVectorize] Support epilogue vectorisation of loops with reductions
isCandidateForEpilogueVectorization will currently return false for loops
which contain reductions. This patch removes this restriction and makes
the following changes to support epilogue vectorisation with reductions:

- `fixReduction`: If fixReduction is being called during vectorisation of the
    epilogue, the phi node it creates will need to additionally carry incoming
     values from the middle block of the main loop.

- `createEpilogueVectorizedLoopSkeleton`: The incoming values of the phi
    created by fixReduction are updated after the vec.epilog.iter.check block
    is added. The phi is also moved to the preheader of the epilogue.

- `processLoop`: The start value of any VPReductionPHIRecipes are updated before
    vectorising the epilogue loop. The getResumeInstr function added to the ILV
    will return the resume instruction associated with the recurrence descriptor.

Reviewed By: sdesmalen

Differential Revision: https://reviews.llvm.org/D116928
2022-01-24 12:03:31 +00:00
Kerry McLaughlin c740a07863 [LoopVectorize] Test in-loop reductions with tail folding for scalable vectors
Adds `-prefer-inloop-reductions` to the RUN line of sve-tail-folding.ll & adds
a new test where in-loop reductions cannot be used (`@cond_xor_reduction`). NFC.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D117578
2022-01-19 14:36:23 +00:00
David Sherwood e781620dee [LoopVectorize][AArch64] Use get.active.lane.mask intrinsic when SVE is enabled
When SVE is enabled for AArch64 targets it makes more sense to use the
get.active.lane.mask intrinsic, because SVE has an exact 1-1 mapping
from the intrinsic to the 'whilelo' instruction for legal vector types.
This instruction neatly takes overflow into account as well. This patch
fixes an issue in VPInstruction::generateInstruction that assumed we are
only dealing with fixed-width vectors.

Differential Revision: https://reviews.llvm.org/D117109
2022-01-18 11:59:30 +00:00
Florian Hahn d4a8fc3a87
[VPlan] Introduce and use BranchOnCount VPInstruction.
This patch adds a new BranchOnCount VPInstruction opcode with 2
operands. It first compares its 2 operands (increment of canonical
induction and vector trip count), followed by a branch to either the
exit block or back to the vector header.

It must be the last recipe in the exit block of the topmost vector loop
region.

This extracts parts from D113224 and was discussed in D113223.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D116479
2022-01-12 13:42:13 +00:00
Rosie Sumpter 552eb372cb [LoopVectorize] Pass a vector type to isLegalMaskedGather/Scatter
This is required to query the legality more precisely in the LoopVectorizer.

This adds another TTI function named 'forceScalarizeMaskedGather/Scatter'
function to work around the hack introduced for MVE, where
isLegalMaskedGather/Scatter would return an answer by second-guessing
where the function was called from, based on the Type passed in (vector
vs scalar). The new interface makes this explicit. It is also used by
X86 to check for vector widths where gather/scatters aren't profitable
(or don't exist) for certain subtargets.

Differential Revision: https://reviews.llvm.org/D115329
2022-01-12 13:34:12 +00:00
David Sherwood b0922a9dcd [LoopVectorize] Make VPWidenCanonicalIVRecipe::execute work for scalable vectors
The code in VPWidenCanonicalIVRecipe::execute only worked for fixed-width
vectors due to the way we generate the values per lane. This patch changes
the code to use a combination of vector splats and step vectors to get
the same result. This then works for both fixed-width and scalable vectors.

Tests that exercise this code path for scalable vectors have been added here:

  Transforms/LoopVectorize/AArch64/sve-tail-folding.ll

Differential Revision: https://reviews.llvm.org/D113180
2022-01-10 14:12:32 +00:00
David Sherwood e3c84fb948 [LoopVectorize] Add support for tail folding using scalable vectors
This patch fixes up an issue with InnerLoopVectorizer::getOrCreateVectorTripCount
whereby we weren't correctly generating the runtime trip count
for scalable vectors when tail-folding.

It also removes some asserts in the tail-folding path for cases when
the VF is not scalable.

In this patch I have only permitted tail-folding to be enabled
explicitly for scalable vectors when the user has specified one
of the following flags:

  -prefer-predicate-over-epilogue=predicate-dont-vectorize
  -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue

For now it's best not to enable tail-folding with scalable vectors for
low trip counts or when optimising for code size, since there has been
no analysis on whether this is worth it.

Various tests have been added here:

  Transforms/LoopVectorize/AArch64/sve-tail-folding.ll
  Transforms/LoopVectorize/AArch64/sve-tail-folding-forced.ll

The tests cannot be target independent because they require masked
load/store support, i.e. TTI.isLegalMaskedLoad and TTI.isLegalMaskedStore
need to return true.

Differential Revision: https://reviews.llvm.org/D113003
2022-01-10 10:55:40 +00:00
David Green bc615e436c [AArch64] Update addo and subo costs
Similar to D116732, this adds basic scalar sadd_with_overflow,
uadd_with_overflow, ssub_with_overflow and usub_with_overflow costs for
aarch64, which are usually quite efficiently lowered.

Differential Revision: https://reviews.llvm.org/D116734
2022-01-07 16:20:23 +00:00
Florian Hahn 65c4d6191f
[VPlan] Add VPCanonicalIVPHIRecipe, partly retire createInductionVariable.
At the moment, the primary induction variable for the vector loop is
created as part of the skeleton creation. This is tied to creating the
vector loop latch outside of VPlan. This prevents from modeling the
*whole* vector loop in VPlan, which in turn is required to model
preheader and exit blocks in VPlan as well.

This patch introduces a new recipe VPCanonicalIVPHIRecipe to represent the
primary IV in VPlan and CanonicalIVIncrement{NUW} opcodes for
VPInstruction to model the increment.

This allows us to partly retire createInductionVariable. At the moment,
a bit of patching up is done after executing all blocks in the plan.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D113223
2022-01-05 10:46:06 +00:00
Rosie Sumpter 961f51fdf0 [LoopVectorize][CostModel] Choose smaller VFs for in-loop reductions without loads/stores
For loops that contain in-loop reductions but no loads or stores, large
VFs are chosen because LoopVectorizationCostModel::getSmallestAndWidestTypes
has no element types to check through and so returns the default widths
(-1U for the smallest and 8 for the widest). This results in the widest
VF being chosen for the following example,

float s = 0;
for (int i = 0; i < N; ++i)
  s += (float) i*i;

which, for more computationally intensive loops, leads to large loop
sizes when the operations end up being scalarized.

In this patch, for the case where ElementTypesInLoop is empty, the widest
type is determined by finding the smallest type used by recurrences in
the loop instead of falling back to a default value of 8 bits. This
results in the cost model choosing a more sensible VF for loops like
the one above.

Differential Revision: https://reviews.llvm.org/D113973
2022-01-04 10:12:57 +00:00
Sander de Smalen 290ae657a6 Fix buildbot failure caused by D115651
I somehow missed updating the RUN line of this test.
2021-12-20 17:18:59 +00:00
Sander de Smalen b1ff20fd35 [LV] Enable scalable vectorization by default for SVE cores.
The availability of SVE should be sufficient to enable scalable
auto-vectorization.

This patch adds a new TTI interface to query the target what style of
vectorization it wants when scalable vectors are available. For other
targets than AArch64, this currently defaults to 'FixedWidthOnly'.

Differential Revision: https://reviews.llvm.org/D115651
2021-12-20 16:23:29 +00:00
Philip Reames e6ad9ef4e7 [instcombine] Canonicalize constant index type to i64 for extractelement/insertelement
The basic idea to this is that a) having a single canonical type makes CSE easier, and b) many of our transforms are inconsistent about which types we end up with based on visit order.

I'm restricting this to constants as for non-constants, we'd have to decide whether the simplicity was worth extra instructions. For constants, there are no extra instructions.

We chose the canonical type as i64 arbitrarily.  We might consider changing this to something else in the future if we have cause.

Differential Revision: https://reviews.llvm.org/D115387
2021-12-13 16:56:22 -08:00
Philip Reames 1a18de3d0a Autogen a bunch of instcombine and vectorizer tests
Done in advance of D115387.  These are all the ones which my local script could handle, there's a couple more which need manual updates.
2021-12-13 10:41:38 -08:00
David Sherwood 8b0448ce5d [AArch64][Analysis] Add on overhead costs for SVE gathers and scatters
This patch adds on an overhead cost for gathers and scatters, which
is a rough estimate based on performance investigations I have
performed on SVE hardware for various micro-benchmarks.

Differential Revision: https://reviews.llvm.org/D115143
2021-12-09 16:02:59 +00:00
David Sherwood def8b952eb [LoopVectorize][AArch64] Add vectoriser cost model tests for gathers/scatters
I've added some tests that were previously missing for the gather-scatter costs
being calculated by the vectorizer for AArch64:

  Transforms/LoopVectorize/AArch64/sve-gather-scatter-cost.ll

The costs are sometimes different to the ones in

  Analysis/CostModel/AArch64/sve-gather.ll

because the vectorizer also adds on the address computation cost.
2021-12-09 15:44:12 +00:00
Cullen Rhodes 698584f89b [IR] Remove unbounded as possible value for vscale_range minimum
The default for min is changed to 1. The behaviour of -mvscale-{min,max}
in Clang is also changed such that 16 is the max vscale when targeting
SVE and no max is specified.

Reviewed By: sdesmalen, paulwalker-arm

Differential Revision: https://reviews.llvm.org/D113294
2021-12-07 09:52:21 +00:00
Sander de Smalen 3d549dddf7 [LV] Pass compare predicate to getCmpSelInstrCost.
If the condition of a select is a compare, pass its predicate to
TTI::getCmpSelInstrCost to get a more accurate cost value instead
of passing BAD_ICMP_PREDICATE.

I noticed that the commit message from D90070 had a comment about the
vectorized select predicate possibly being composed of other compares with
different predicate values, but I wasn't able to construct an example
where this was an actual issue. If this is an issue, I guess we could
add another check that the block isn't predicated for any reason.

Reviewed By: dmgreen, fhahn

Differential Revision: https://reviews.llvm.org/D114646
2021-12-06 11:41:27 +00:00
Sander de Smalen 28a4deab92 [LV] Fix incorrectly marking a pointer indvar as 'scalar'.
collectLoopScalars should only add non-uniform nodes to the list if they
are used by a load/store instruction that is marked as CM_Scalarize.

Before this patch, the LV incorrectly marked pointer induction variables
as 'scalar' when they required to be widened by something else,
such as a compare instruction, and weren't used by a node marked as
'CM_Scalarize'. This case is covered by sve-widen-phi.ll.

This change also allows removing some code where the LV tried to
widen the PHI nodes with a stepvector, even though it was marked as
'scalarAfterVectorization'. Now that this code is more careful about
marking instructions that need widening as 'scalar', this code has
become redundant.

Differential Revision: https://reviews.llvm.org/D114373
2021-11-28 09:49:28 +00:00
Sander de Smalen a9f837bbf0 NFC: Simplify sve-widen-phi.ll by unrolling once.
The unroll factor > 1 has little value for what is being tested.
2021-11-28 09:49:28 +00:00
David Sherwood e20391fc5d [LoopVectorize] When tail-folding, don't always predicate uniform loads
In VPRecipeBuilder::handleReplication if we believe the instruction
is predicated we then proceed to create new VP region blocks even
when the load is uniform and only predicated due to tail-folding.

I have updated isPredicatedInst to avoid treating a uniform load as
predicated when tail-folding, which means we can do a single scalar
load and a vector splat of the value.

Tests added here:

  Transforms/LoopVectorize/AArch64/tail-fold-uniform-memops.ll

Differential Revision: https://reviews.llvm.org/D112552
2021-11-26 11:30:54 +00:00
Rosie Sumpter df32a39dd0 [LoopVectorize][CostModel] Update cost model for fmuladd intrinsic
This patch updates the cost model for ordered reductions so that a call
to the llvm.fmuladd intrinsic is modelled as a normal fmul instruction
plus the cost of an ordered fadd reduction.

Differential Revision: https://reviews.llvm.org/D111630
2021-11-24 08:50:05 +00:00
Rosie Sumpter 991074012a [LoopVectorize] Propagate fast-math flags for VPInstruction
In-loop vector reductions which use the llvm.fmuladd intrinsic involve
the creation of two recipes; a VPReductionRecipe for the fadd and a
VPInstruction for the fmul. If the call to llvm.fmuladd has fast-math flags
these should be propagated through to the fmul instruction, so an
interface setFastMathFlags has been added to the VPInstruction class to
enable this.

Differential Revision: https://reviews.llvm.org/D113125
2021-11-24 08:50:04 +00:00
Rosie Sumpter c2441b6b89 [LoopVectorize] Add vector reduction support for fmuladd intrinsic
Enables LoopVectorize to handle reduction patterns involving the
llvm.fmuladd intrinsic.

Differential Revision: https://reviews.llvm.org/D111555
2021-11-24 08:50:04 +00:00
Diego Caballero 4348cd42c3 [LV] Drop integer poison-generating flags from instructions that need predication
This patch fixes PR52111. The problem is that LV propagates poison-generating flags (`nuw`/`nsw`, `exact`
and `inbounds`) in instructions that contribute to the address computation of widen loads/stores that are
guarded by a condition. It may happen that when the code is vectorized and the control flow within the loop
is linearized, these flags may lead to generating a poison value that is effectively used as the base address
of the widen load/store. The fix drops all the integer poison-generating flags from instructions that
contribute to the address computation of a widen load/store whose original instruction was in a basic block
that needed predication and is not predicated after vectorization.

Reviewed By: fhahn, spatel, nlopes

Differential Revision: https://reviews.llvm.org/D111846
2021-11-22 10:57:29 +00:00
Florian Hahn cf8efbd30e
[VPlan] Wrap vector loop blocks in region.
A first step towards modeling preheader and exit blocks in VPlan as well.
Keeping the vector loop in a region allows for changing the VF as we
traverse region boundaries.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D113182
2021-11-20 17:59:48 +00:00
Kerry McLaughlin ff64b2933a [LoopVectorize] Check the number of uses of an FAdd before classifying as ordered
checkOrderedReductions looks for Phi nodes which can be classified as in-order,
meaning they can be vectorised without unsafe math. In order to vectorise the
reduction it should also be classified as in-loop by getReductionOpChain, which
checks that the reduction has two uses.

In this patch, a similar check is added to checkOrderedReductions so that we
now return false if there are more than two uses of the FAdd instruction.
This fixes PR52515.

Reviewed By: fhahn, david-arm

Differential Revision: https://reviews.llvm.org/D114002
2021-11-18 16:41:19 +00:00
David Sherwood 8d77555b12 [Analysis] Ensure getTypeLegalizationCost returns a simple VT for TypeScalarizeScalableVector
When getTypeConversion returns TypeScalarizeScalableVector we were
sometimes returning a non-simple type from getTypeLegalizationCost.
However, many callers depend upon this being a simple type and will
crash if not. This patch changes getTypeLegalizationCost to ensure
that we always a return sensible simple VT. If the vector type
contains unusual integer types, e.g. <vscale x 2 x i3>, then we just
set the type to MVT::i64 as a reasonable default.

A test has been added here that demonstrates the vectoriser can
correctly calculate the cost of vectorising a "zext i3 to i64"
instruction with a VF=vscale x 1:

  Transforms/LoopVectorize/AArch64/sve-inductions-unusual-types.ll

Differential Revision: https://reviews.llvm.org/D113777
2021-11-17 13:11:58 +00:00
David Sherwood 670dd40244 [Analysis] Fix getNumberOfParts to return 0 when the answer is unknown
When asking how many parts are required for a scalable vector type
there are occasions when it cannot be computed. For example, <vscale x 1 x i3>
is one such vector for AArch64+SVE because at the moment no matter how we
promote the i3 type we never end up with a legal vector. This means
that getTypeConversion returns TypeScalarizeScalableVector as the
LegalizeKind, and then getTypeLegalizationCost returns an invalid cost.
This then causes BasicTTImpl::getNumberOfParts to dereference an invalid
cost, which triggers an assert. This patch changes getNumberOfParts to
return 0 for such cases, since the definition of getNumberOfParts in
TargetTransformInfo.h states that we can use a return value of 0 to represent
an unknown answer.

Currently, LoopVectorize.cpp is the only place where we need to check for
0 as a return value, because all other instances will not currently
ask for the number of parts for <vscale x 1 x iX> types.

In addition, I have changed the target-independent interface for
getNumberOfParts to return 1 and assume there is a single register
that can fit the type. The loop vectoriser has lots of tests that are
target-independent and they relied upon the 0 value to mean the
answer is known and that we are not scalarising the vector.

I have added tests here that show we correctly return an invalid cost
for VF=vscale x 1 when the loop contains unusual types such as i7:

  Transforms/LoopVectorize/AArch64/sve-inductions-unusual-types.ll

Differential Revision: https://reviews.llvm.org/D113772
2021-11-17 12:07:09 +00:00
Kerry McLaughlin 7647822156 [AArch64][SVE] Remove i1 type from isElementTypeLegalForScalableVector
`collectElementTypesForWidening` collects the types of load, store and
reduction Phis in a loop. These types are later checked using
`isElementTypeLegalForScalableVector` to prevent vectorisation of
loops with instruction types that are unsupported.

This patch removes i1 from the list of types supported for scalable
vectors. This fixes an assert ("Cannot yet scalarize uniform stores") in
`setCostBasedWideningDecision` when we have a loop containing a uniform
i1 store and a scalable VF, which we cannot create a scatter for.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D113680
2021-11-12 14:24:38 +00:00
Kerry McLaughlin 6f16ee5e14 Revert "[LoopVectorize] Extract the last lane from a uniform store"
This reverts commit 0d748b4d32.
This is causing some failures when building Spec2017 with scalable
vectors. Reverting to investigate.
2021-11-10 11:21:19 +00:00
David Sherwood 2a48b6993a [IR] In ConstantFoldShuffleVectorInstruction use zeroinitializer for splats of 0
When creating a splat of 0 for scalable vectors we tend to create them
with using a combination of shufflevector and insertelement, i.e.

shufflevector (<vscale x 4 x i32> insertelement (<vscale x 4 x i32> poison, i32 0, i32 0),
               <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer)

However, for the case of a zero splat we can actually just replace the
above with zeroinitializer instead. This makes the IR a lot simpler and
easier to read. I have changed ConstantFoldShuffleVectorInstruction to
use zeroinitializer when creating a splat of integer 0 or FP +0.0 values.

Differential Revision: https://reviews.llvm.org/D113394
2021-11-10 09:42:58 +00:00
Kerry McLaughlin 0d748b4d32 [LoopVectorize] Extract the last lane from a uniform store
Changes VPReplicateRecipe to extract the last lane from an unconditional,
uniform store instruction. collectLoopUniforms will also add stores to
the list of uniform instructions where Legal->isUniformMemOp is true.

setCostBasedWideningDecision now sets the widening decision for
all uniform memory ops to Scalarize, where previously GatherScatter
may have been chosen for scalable stores.

This fixes an assert ("Cannot yet scalarize uniform stores") in
setCostBasedWideningDecision when we have a loop containing a
uniform i1 store and a scalable VF, which we cannot create a scatter for.

Reviewed By: sdesmalen, david-arm, fhahn

Differential Revision: https://reviews.llvm.org/D112725
2021-11-09 14:43:16 +00:00
Sander de Smalen 2829376bb2 [LV] Use VScaleForTuning to fine-tune the cost per lane.
When targeting a specific CPU with scalable vectorization, the knowledge
of that particular CPU's vscale value can be used to tune the cost-model
and make the cost per lane less pessimistic.

If the target implements 'TTI.getVScaleForTuning()', the cost-per-lane
is calculated as:

  Cost / (VScaleForTuning * VF.KnownMinLanes)

Otherwise, it assumes a value of 1 meaning that the behavior
is unchanged and calculated as:

  Cost / VF.KnownMinLanes

Reviewed By: kmclaughlin, david-arm

Differential Revision: https://reviews.llvm.org/D113209
2021-11-08 16:59:46 +00:00
David Sherwood c42bb30b9e [LoopVectorize] Permit fixed-width epilogue loops for scalable vector bodies
At the moment in LoopVectorizationCostModel::selectEpilogueVectorizationFactor
we bail out if the main vector loop uses a scalable VF. This patch adds
support for generating epilogue vector loops using a fixed-width VF when the
main vector loop uses a scalable VF.

I've changed LoopVectorizationCostModel::selectEpilogueVectorizationFactor
so that we convert the scalable VF into a fixed-width VF and do profitability
checks on that instead. In addition, since the scalable and fixed-width VFs
live in different VPlans that means I had to change the calls to
LVP.hasPlanWithVFs so that we only pass in the fixed-width VF.

New tests added here:

  Transforms/LoopVectorize/AArch64/sve-epilog-vect.ll

Differential Revision: https://reviews.llvm.org/D109432
2021-11-08 09:41:13 +00:00
David Sherwood 9da8dde7fd [NFC][LoopVectorize] Add test for tail-folding loop with conditional uniform load
I've added a test for a loop containing a conditional uniform load for
a target that supports masked loads. The test just ensures that we
correctly use gather instructions and have the correct mask.

Differential Revision: https://reviews.llvm.org/D112619
2021-11-03 09:51:11 +00:00
Rosie Sumpter dcb8222d87 [LoopVectorize] Propagate fast-math flags for inloop reductions
This patch updates VPReductionRecipe::execute so that the fast-math
flags associated with the underlying instruction of the VPRecipe are
propagated through to the reductions which are created.

Differential Revision: https://reviews.llvm.org/D112548
2021-11-02 08:59:53 +00:00
Roman Lebedev 101aaf62ef
Revert "[NFC] `IRBuilderBase::CreateAdd()`: place constant onto RHS"
Clang OpenMP codegen tests are failing,
will recommit afterwards.

This reverts commit 4723c9b3c6.
2021-10-27 22:21:37 +03:00
Roman Lebedev 42712698fd
Revert "[IR] `IRBuilderBase::CreateAdd()`: short-circuit `x + 0` --> `x`"
Clang OpenMP codegen tests are failing.

This reverts commit 288f1f8abe.
This reverts commit cb90e5356a.
2021-10-27 22:21:37 +03:00
Roman Lebedev cb90e5356a
[IR] `IRBuilderBase::CreateAdd()`: short-circuit `x + 0` --> `x`
There's precedent for that in `CreateOr()`/`CreateAnd()`.

The motivation here is to avoid bloating the run-time check's IR
in `SCEVExpander::generateOverflowCheck()`.

Refs. https://reviews.llvm.org/D109368#3089809
2021-10-27 21:34:38 +03:00
Roman Lebedev 4723c9b3c6
[NFC] `IRBuilderBase::CreateAdd()`: place constant onto RHS 2021-10-27 21:34:38 +03:00
Roman Lebedev 2eaef53023
[TTI] `BasicTTIImplBase::getInterleavedMemoryOpCost()`: fix load discounting
The math here is:
Cost of 1 load = cost of n loads / n
Cost of live loads = num live loads * Cost of 1 load
Cost of live loads = num live loads * (cost of n loads / n)
Cost of live loads = cost of n loads * (num live loads / n)

But, all the variables here are integers,
and integer division rounds down,
but this calculation clearly expects float semantics.

Instead multiply upfront, and then perform round-up-division.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D112302
2021-10-22 14:08:58 +03:00
David Sherwood 9448cdc900 [SVE][Analysis] Tune the cost model according to the tune-cpu attribute
This patch introduces a new function:

  AArch64Subtarget::getVScaleForTuning

that returns a value for vscale that can be used for tuning the cost
model when using scalable vectors. The VScaleForTuning option in
AArch64Subtarget is initialised according to the following rules:

1. If the user has specified the CPU to tune for we use that, else
2. If the target CPU was specified we use that, else
3. The tuning is set to "generic".

For CPUs of type "generic" I have assumed that vscale=2.

New tests added here:

  Analysis/CostModel/AArch64/sve-gather.ll
  Analysis/CostModel/AArch64/sve-scatter.ll
  Transforms/LoopVectorize/AArch64/sve-strict-fadd-cost.ll

Differential Revision: https://reviews.llvm.org/D110259
2021-10-21 09:33:50 +01:00
Kerry McLaughlin 1439ef1a3f [LoopVectorize] Classify pointer induction updates as scalar only if they have one use
collectLoopScalars collects pointer induction updates in ScalarPtrs, assuming
that the instruction will be scalar after vectorization. This may crash later
in VPReplicateRecipe::execute() if there there is another user of the instruction
other than the Phi node which needs to be widened.

This changes collectLoopScalars so that if there are any other users of
Update other than a Phi node, it is not added to ScalarPtrs.

Reviewed By: david-arm, fhahn

Differential Revision: https://reviews.llvm.org/D111294
2021-10-12 13:24:49 +01:00
David Sherwood 26b7d9d622 [LoopVectorize] Permit vectorisation of more select(cmp(), X, Y) reduction patterns
This patch adds further support for vectorisation of loops that involve
selecting an integer value based on a previous comparison. Consider the
following C++ loop:

  int r = a;
  for (int i = 0; i < n; i++) {
    if (src[i] > 3) {
      r = b;
    }
    src[i] += 2;
  }

We should be able to vectorise this loop because all we are doing is
selecting between two states - 'a' and 'b' - both of which are loop
invariant. This just involves building a vector of values that contain
either 'a' or 'b', where the final reduced value will be 'b' if any lane
contains 'b'.

The IR generated by clang typically looks like this:

  %phi = phi i32 [ %a, %entry ], [ %phi.update, %for.body ]
  ...
  %pred = icmp ugt i32 %val, i32 3
  %phi.update = select i1 %pred, i32 %b, i32 %phi

We already detect min/max patterns, which also involve a select + cmp.
However, with the min/max patterns we are selecting loaded values (and
hence loop variant) in the loop. In addition we only support certain
cmp predicates. This patch adds a new pattern matching function
(isSelectCmpPattern) and new RecurKind enums - SelectICmp & SelectFCmp.
We only support selecting values that are integer and loop invariant,
however we can support any kind of compare - integer or float.

Tests have been added here:

  Transforms/LoopVectorize/AArch64/sve-select-cmp.ll
  Transforms/LoopVectorize/select-cmp-predicated.ll
  Transforms/LoopVectorize/select-cmp.ll

Differential Revision: https://reviews.llvm.org/D108136
2021-10-11 09:41:38 +01:00
Krasimir Georgiev 685f1bfd0a Revert "[LoopVectorize] Permit vectorisation of more select(cmp(), X, Y) reduction patterns"
It appears to cause stage2 clang build failures, e.g.,
https://lab.llvm.org/buildbot/#/builders/74/builds/7145.

This reverts commit 1fb37334bd.
2021-10-01 11:39:43 +02:00
David Sherwood 1fb37334bd [LoopVectorize] Permit vectorisation of more select(cmp(), X, Y) reduction patterns
This patch adds further support for vectorisation of loops that involve
selecting an integer value based on a previous comparison. Consider the
following C++ loop:

  int r = a;
  for (int i = 0; i < n; i++) {
    if (src[i] > 3) {
      r = b;
    }
    src[i] += 2;
  }

We should be able to vectorise this loop because all we are doing is
selecting between two states - 'a' and 'b' - both of which are loop
invariant. This just involves building a vector of values that contain
either 'a' or 'b', where the final reduced value will be 'b' if any lane
contains 'b'.

The IR generated by clang typically looks like this:

  %phi = phi i32 [ %a, %entry ], [ %phi.update, %for.body ]
  ...
  %pred = icmp ugt i32 %val, i32 3
  %phi.update = select i1 %pred, i32 %b, i32 %phi

We already detect min/max patterns, which also involve a select + cmp.
However, with the min/max patterns we are selecting loaded values (and
hence loop variant) in the loop. In addition we only support certain
cmp predicates. This patch adds a new pattern matching function
(isSelectCmpPattern) and new RecurKind enums - SelectICmp & SelectFCmp.
We only support selecting values that are integer and loop invariant,
however we can support any kind of compare - integer or float.

Tests have been added here:

  Transforms/LoopVectorize/AArch64/sve-select-cmp.ll
  Transforms/LoopVectorize/select-cmp-predicated.ll
  Transforms/LoopVectorize/select-cmp.ll

Differential Revision: https://reviews.llvm.org/D108136
2021-10-01 08:41:03 +01:00
Craig Topper 765348298c [CostModel] Update default cost model for sadd/ssub overflow to match TargetLowering
The expansion for these was updated in https://reviews.llvm.org/D47927 but the cost model was not adjusted.

I believe the cost model was also incorrect for the old expansion.
The expansion prior to D47927 used 3 icmps using LHS, RHS, and Result
to calculate theirs signs. Then 2 icmps to compare the signs. Followed
by an And. The previous cost model was using 3 icmps and 2 selects.
Digging back through git blame, those 2 selects in the cost model used to
be 2 icmps, but were changed in https://reviews.llvm.org/D90681

Differential Revision: https://reviews.llvm.org/D110739
2021-09-30 09:41:14 -07:00
Florian Hahn 4b581e87df
[LV] Add tests where rt checks may make vectorization unprofitable.
Add a few additional tests which require a large number of runtime
checks for D109368.
2021-09-27 10:32:28 +01:00
Usman Nadeem f417d9d821 [InstCombine] Eliminate vector reverse if all inputs/outputs to an instruction are reverses
Differential Revision: https://reviews.llvm.org/D109808

Change-Id: I1a10d2bc33acbe0ea353c6cb3d077851391fe73e
2021-09-20 18:32:24 -07:00
David Sherwood f988f68064 [Analysis] Add support for vscale in computeKnownBitsFromOperator
In ValueTracking.cpp we use a function called
computeKnownBitsFromOperator to determine the known bits of a value.
For the vscale intrinsic if the function contains the vscale_range
attribute we can use the maximum and minimum values of vscale to
determine some known zero and one bits. This should help to improve
code quality by allowing certain optimisations to take place.

Tests added here:

  Transforms/InstCombine/icmp-vscale.ll

Differential Revision: https://reviews.llvm.org/D109883
2021-09-20 15:01:59 +01:00
Rosie Sumpter 9d1bea9c88 [SVE][LoopVectorize] Optimise code generated by widenPHIInstruction
For SVE, when scalarising the PHI instruction the whole vector part is
generated as opposed to creating instructions for each lane for fixed-
width vectors. However, in some cases the lane values may be needed
later (e.g for a load instruction) so we still need to calculate
these values to avoid extractelement being called on the vector part.

Differential Revision: https://reviews.llvm.org/D109445
2021-09-10 11:58:04 +01:00
Simon Pilgrim 10c982e0b3 Revert rG1c9bec727ab5c53fa060560dc8d346a911142170 : [InstCombine] Fold (gep (oneuse(gep Ptr, Idx0)), Idx1) -> (gep Ptr, (add Idx0, Idx1)) (PR51069)
Reverted (manually due to merge conflicts) while regressions reported on PR51540 are investigated

As noticed on D106352, after we've folded "(select C, (gep Ptr, Idx), Ptr) -> (gep Ptr, (select C, Idx, 0))" if the inner Ptr was also a (now one use) gep we could then merge the geps, using the sum of the indices instead.

I've limited this to basic 2-op geps - a more general case further down InstCombinerImpl.visitGetElementPtrInst doesn't have the one-use limitation but only creates the add if it can be created via SimplifyAddInst.

https://alive2.llvm.org/ce/z/f8pLfD (Thanks Roman!)

Differential Revision: https://reviews.llvm.org/D106450
2021-08-23 21:09:26 +01:00
Florian Hahn d024a01511
Recommit "[LoopVectorize][AArch64] Enable ordered reductions by default for AArch64"
This reverts the revert ab9296f13b.

The issue causing the revert should be fixed in 9baed023b4.
2021-08-23 11:25:27 +01:00
Florian Hahn 9baed023b4
[LV] Adjust reduction recipes before recurrence handling.
Adjusting the reduction recipes still relies on references to the
original IR, which can become outdated by the first-order recurrence
handling. Until reduction recipe construction does not require IR
references, move it before first-order recurrence handling, to prevent a
crash as exposed by D106653.
2021-08-22 11:02:33 +01:00
Florian Hahn ab9296f13b
Revert "[LoopVectorize][AArch64] Enable ordered reductions by default for AArch64"
This reverts commit f4122398e7 to
investigate a crash exposed by it.

The patch breaks building the code below with `clang -O2 --target=aarch64-linux`

     int a;
     double b, c;
     void d() {
       for (; a; a++) {
         b += c;
         c = a;
       }
     }
2021-08-20 21:24:28 +01:00
David Sherwood f4122398e7 [LoopVectorize][AArch64] Enable ordered reductions by default for AArch64
I have added a new TTI interface called enableOrderedReductions() that
controls whether or not ordered reductions should be enabled for a
given target. By default this returns false, whereas for AArch64 it
returns true and we rely upon the cost model to make sensible
vectorisation choices. It is still possible to override the new TTI
interface by setting the command line flag:

  -force-ordered-reductions=true|false

I have added a new RUN line to show that we use ordered reductions by
default for SVE and Neon:

  Transforms/LoopVectorize/AArch64/strict-fadd.ll
  Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll

Differential Revision: https://reviews.llvm.org/D106653
2021-08-19 09:29:40 +01:00
David Sherwood 219d4518fc [Analysis][AArch64] Make fixed-width ordered reductions slightly more expensive
For tight loops like this:

  float r = 0;
  for (int i = 0; i < n; i++) {
    r += a[i];
  }

it's better not to vectorise at -O3 using fixed-width ordered reductions
on AArch64 targets. Although the resulting number of instructions in the
generated code ends up being comparable to not vectorising at all, there
may be additional costs on some CPUs, for example perhaps the scheduling
is worse. It makes sense to deter vectorisation in tight loops.

Differential Revision: https://reviews.llvm.org/D108292
2021-08-18 17:01:56 +01:00
Dylan Fleming ef198cd99e [SVE] Remove usage of getMaxVScale for AArch64, in favour of IR Attribute
Removed AArch64 usage of the getMaxVScale interface, replacing it with
the vscale_range(min, max) IR Attribute.

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D106277
2021-08-17 14:42:47 +01:00
Paul Walker f7a831daa6 [LoopVectorize] Don't emit remarks about lack of scalable vectors unless they're specifically requested.
Previously we emitted a "does not support scalable vectors"
remark for all targets whenever vectorisation is attempted. This
pollutes the output for architectures that don't support scalable
vectors and is likely confusing to the user.

Instead this patch introduces a debug message that reports when
scalable vectorisation is allowed by the target and only issues
the previous remark when scalable vectorisation is specifically
requested, for example:

  #pragma clang loop vectorize_width(2, scalable)

Differential Revision: https://reviews.llvm.org/D108028
2021-08-15 12:15:52 +01:00
David Sherwood 3ce8c31eb8 [NFC] Add extra RUN line to strict reduction tests
I have added RUN lines to both:

  Transforms/LoopVectorize/AArch64/strict-fadd.ll
  Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll

to show the default behaviour is to not vectorise when the following
flag is unset:

  -force-ordered-reductions
2021-08-10 14:48:38 +01:00
David Sherwood 8439415333 [IR] Let ConstantVector::getSplat use poison instead of undef
This patch updates ConstantVector::getSplat to use poison instead
of undef when using insertelement/shufflevector to splat.

This follows on from D93793.

Differential Revision: https://reviews.llvm.org/D107751
2021-08-10 08:27:43 +01:00
Sander de Smalen 3e47f009ff [LV] Consider ExtractValue as uniform.
Since all operands to ExtractValue must be loop-invariant when we deem
the loop vectorizable, we can consider ExtractValue to be uniform.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D107286
2021-08-05 16:20:50 +01:00
Sander de Smalen 8d08a84745 [LV] Remove a change that was added in D106164.
This change wasn't strictly necessary for D106164 and could be removed.
This patch addresses the post-commit comments from @fhahn on D106164, and
also changes sve-widen-gep.ll to use the same IR test as shown in
pointer-induction.ll.

Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D106878
2021-08-05 14:44:53 +01:00
David Sherwood 0156f91f3b [NFC] Rename enable-strict-reductions to force-ordered-reductions
I'm renaming the flag because a future patch will add a new
enableOrderedReductions() TTI interface and so the meaning of this
flag will change to be one of forcing the target to enable/disable
them. Also, since other places in LoopVectorize.cpp use the word
'Ordered' instead of 'strict' I changed the flag to match.

Differential Revision: https://reviews.llvm.org/D107264
2021-08-03 09:33:01 +01:00
James Y Knight 3d272eea08 Fix test/Transforms/LoopVectorize/AArch64/strict-fadd-vf1.ll.
It was writing to the source directory (which may not be writeable),
rather than using %t.

Fixes: a5dd6c6cf9 ("[LoopVectorize] Don't interleave scalar ordered reductions for inner loops")
2021-07-27 18:32:29 -04:00
David Sherwood a5dd6c6cf9 [LoopVectorize] Don't interleave scalar ordered reductions for inner loops
Consider the following loop:

  void foo(float *dst, float *src, int N) {
    for (int i = 0; i < N; i++) {
      dst[i] = 0.0;
      for (int j = 0; j < N; j++) {
        dst[i] += src[(i * N) + j];
      }
    }
  }

When we are not building with -Ofast we may attempt to vectorise the
inner loop using ordered reductions instead. In addition we also try
to select an appropriate interleave count for the inner loop. However,
when choosing a VF=1 the inner loop will be scalar and there is existing
code in selectInterleaveCount that limits the interleave count to 2
for reductions due to concerns about increasing the critical path.
For ordered reductions this problem is even worse due to the additional
data dependency, and so I've added code to simply disable interleaving
for scalar ordered reductions for now.

Test added here:

  Transforms/LoopVectorize/AArch64/strict-fadd-vf1.ll

Differential Revision: https://reviews.llvm.org/D106646
2021-07-27 17:41:01 +01:00
Sander de Smalen d7dd12aee3 [LV] Disable Scalable VFs when tail folding is enabled b/c of low tripcount.
The loop vectorizer may decide to use tail folding when the trip-count
is low. When that happens, scalable VFs are no longer a candidate,
since tail folding/predication is not yet supported for scalable vectors.

This can be re-enabled in a future patch.

Reviewed By: kmclaughlin

Differential Revision: https://reviews.llvm.org/D106657
2021-07-27 11:37:21 +01:00
Sander de Smalen 13ccb09725 [LV] Don't let ForceTargetInstructionCost override Invalid cost.
Invalid costs can be used to avoid vectorization with a given VF, which is
used for scalable vectors to avoid things that the code-generator cannot
handle. If we override the cost using the -force-target-instruction-cost
option of the LV, we would override this mechanism, rendering the flag useless.

This change ensures the cost is only overriden when the original cost that
was calculated is valid. That allows the flag to be used in combination
with the -scalable-vectorization option.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D106677
2021-07-26 20:27:49 +01:00
Sander de Smalen e745277012 [AArch64] NFC: Make some AArch64-SVE LoopVectorize tests generic.
This change moves most of `sve-inductions.ll` to non-AArch64 specific
LV tests using the `-target-supports-scalable-vectors` flag, because they're
not explicitly AArch64-specific. One test builds on AArch64-specific
knowledge regarding masked loads/stores, and remains in sve-inductions.ll.
2021-07-26 20:27:48 +01:00
Sander de Smalen b9051ba848 [LV] Remove assert that VF cannot be scalable in setCostBasedWideningDecision.
Scalarization for scalable vectors is not (yet) supported, so the
LV discards a VF when scalarization is chosen as the widening
decision. It should therefore not assert that the VF is not scalable
when it computes the decision to scalarize.

The code can get here when both the interleave-cost, gather/scatter cost
and scalarization-cost are all illegal. This may e.g. happen for SVE
when the VF=1, to avoid generating `<vscale x 1 x eltty>` types that
the code-generator cannot yet handle.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D106656
2021-07-26 17:11:45 +01:00
Sander de Smalen 981e9dce54 [LV] Don't assume isScalarAfterVectorization if one of the uses needs widening.
This fixes an issue that was found in D105199, where a GEP instruction
is used both as the address of a store, as well as the value of a store.
For the former, the value is scalar after vectorization, but the latter
(as value) requires widening.

Other code in that function seems to prevent similar cases from happening,
but it seems this case was missed.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D106164
2021-07-26 16:01:55 +01:00
Florian Hahn 93664503be
[LV] Add test to store a first-order rec via interleave group.
This is a reduced version of the reproducer from
https://bugs.chromium.org/p/chromium/issues/detail?id=1232798#c2
2021-07-26 15:20:04 +01:00
David Sherwood b2a5f0029f Fix test failures caused by 0aff1798b5 2021-07-26 11:40:26 +01:00
David Sherwood 0aff1798b5 [Analysis] Add simple cost model for strict (in-order) reductions
I have added a new FastMathFlags parameter to getArithmeticReductionCost
to indicate what type of reduction we are performing:

  1. Tree-wise. This is the typical fast-math reduction that involves
  continually splitting a vector up into halves and adding each
  half together until we get a scalar result. This is the default
  behaviour for integers, whereas for floating point we only do this
  if reassociation is allowed.
  2. Ordered. This now allows us to estimate the cost of performing
  a strict vector reduction by treating it as a series of scalar
  operations in lane order. This is the case when FP reassociation
  is not permitted. For scalable vectors this is more difficult
  because at compile time we do not know how many lanes there are,
  and so we use the worst case maximum vscale value.

I have also fixed getTypeBasedIntrinsicInstrCost to pass in the
FastMathFlags, which meant fixing up some X86 tests where we always
assumed the vector.reduce.fadd/mul intrinsics were 'fast'.

New tests have been added here:

  Analysis/CostModel/AArch64/reduce-fadd.ll
  Analysis/CostModel/AArch64/sve-intrinsics.ll
  Transforms/LoopVectorize/AArch64/strict-fadd-cost.ll
  Transforms/LoopVectorize/AArch64/sve-strict-fadd-cost.ll

Differential Revision: https://reviews.llvm.org/D105432
2021-07-26 10:26:06 +01:00
Caroline Concatto 5a4de84d55 [LoopVectorize] Fix crash for predicated instruction with scalable VF
This patch avoids computing discounts for predicated instructions  when the
VF is scalable.
There is no support for vectorization of loops with division because the
vectorizer cannot guarantee that zero divisions will not happen.

This loop now does not use VF scalable

```
for (long long i = 0; i < n; i++)
    if (cond[i])
      a[i] /= b[i];
```

Differential Revision: https://reviews.llvm.org/D101916
2021-07-22 12:48:27 +01:00
Simon Pilgrim 1c9bec727a [InstCombine] Fold (gep (oneuse(gep Ptr, Idx0)), Idx1) -> (gep Ptr, (add Idx0, Idx1)) (PR51069)
As noticed on D106352, after we've folded "(select C, (gep Ptr, Idx), Ptr) -> (gep Ptr, (select C, Idx, 0))" if the inner Ptr was also a (now one use) gep we could then merge the geps, using the sum of the indices instead.

I've limited this to basic 2-op geps - a more general case further down InstCombinerImpl.visitGetElementPtrInst doesn't have the one-use limitation but only creates the add if it can be created via SimplifyAddInst.

https://alive2.llvm.org/ce/z/f8pLfD (Thanks Roman!)

Differential Revision: https://reviews.llvm.org/D106450
2021-07-22 10:58:51 +01:00
Simon Pilgrim ca9b60f9de [LoopVectorize] Regenerate sve-vector-reverse.ll test checks 2021-07-21 15:14:04 +01:00
Kerry McLaughlin 49d73130ca [LV] Avoid scalable vectorization for loops containing alloca
This patch returns an Invalid cost from getInstructionCost() for alloca
instructions if the VF is scalable, as otherwise loops which contain
these instructions will crash when attempting to scalarize the alloca.

Reviewed By: sdesmalen

Differential Revision: https://reviews.llvm.org/D105824
2021-07-16 11:47:13 +01:00
Sander de Smalen 239d01fa88 Reland "[LV] Print remark when loop cannot be vectorized due to invalid costs."
The original patch was:
  https://reviews.llvm.org/D105806

There were some issues with undeterministic behaviour of the sorting
function, which led to scalable-call.ll passing and/or failing. This
patch fixes the issue by numbering all instructions in the array first,
and using that number as the order, which should provide a consistent
ordering.

This reverts commit a607f64118.
2021-07-16 10:52:01 +01:00
Sander de Smalen a607f64118 Revert "[LV] Print remark when loop cannot be vectorized due to invalid costs."
This reverts commit efaf3099c8.
This reverts commit dc7bdc1e71.

Reverting patches due to buildbot failures.
2021-07-15 15:21:57 +01:00
Sander de Smalen dc7bdc1e71 [LV] Fix determinism for failing scalable-call.ll test.
The sort function for emitting an OptRemark was not deterministic,
which caused scalable-call.ll to fail on some buildbots. This patch
fixes that.

This patch also fixes an issue where `Instruction::comesBefore()`
is called when two Instructions are in different basic blocks,
which would otherwise cause an assertion failure.
2021-07-15 13:16:59 +01:00
Sander de Smalen efaf3099c8 [LV] Print remark when loop cannot be vectorized due to invalid costs.
This patch emits remarks for instructions that have invalid costs for
a given set of vectorization factors. Some example output:

  t.c:4:19: remark: Instruction with invalid costs prevented vectorization at VF=(vscale x 1): load
      dst[i] = sinf(src[i]);
                    ^
  t.c:4:14: remark: Instruction with invalid costs prevented vectorization at VF=(vscale x 1, vscale x 2, vscale x 4): call to llvm.sin.f32
      dst[i] = sinf(src[i]);
               ^
  t.c:4:12: remark: Instruction with invalid costs prevented vectorization at VF=(vscale x 1): store
      dst[i] = sinf(src[i]);
             ^

Reviewed By: fhahn, kmclaughlin

Differential Revision: https://reviews.llvm.org/D105806
2021-07-14 17:11:33 +01:00
Sander de Smalen eac1670739 [CostModel][AArch64] Make loads/stores of <vscale x 1 x eltty> invalid.
At the moment, <vscale x 1 x eltty> are not yet fully handled by the
code-generator, so to avoid vectorizing loops with that VF, we mark the
cost for these types as invalid.
The reason for not adding a new "TTI::getMinimumScalableVF" is because
the type is supposed to be a type that can be legalized. It partially is,
although the support for these types need some more work.

Reviewed By: paulwalker-arm, dmgreen

Differential Revision: https://reviews.llvm.org/D103882
2021-07-14 16:44:22 +01:00
Sander de Smalen d2e4ccc790 [LV] Ignore candidate VFs with invalid costs.
This follows on from discussion on the mailing-list:
  https://lists.llvm.org/pipermail/llvm-dev/2021-June/151047.html

to interpret an Invalid cost as 'infinitely expensive', as this
simplifies some of the legalization issues with scalable vectors.

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D105473
2021-07-12 09:58:22 +01:00
Kerry McLaughlin a7512401e5 [LV] Prevent vectorization with unsupported element types.
This patch adds a TTI function, isElementTypeLegalForScalableVector, to query
whether it is possible to vectorize a given element type. This is called by
isLegalToVectorizeInstTypesForScalable to reject scalable vectorization if
any of the instruction types in the loop are unsupported, e.g:

  int foo(__int128_t* ptr, int N)
    #pragma clang loop vectorize_width(4, scalable)
    for (int i=0; i<N; ++i)
      ptr[i] = ptr[i] + 42;

This example currently crashes if we attempt to vectorize since i128 is not a
supported type for scalable vectorization.

Reviewed By: sdesmalen, david-arm

Differential Revision: https://reviews.llvm.org/D102253
2021-07-06 13:06:21 +01:00
Sjoerd Meijer ee752134ac [AArch64] Cost-model i8 vector loads/stores
Loads of <4 x i8> vectors were modeled as extremely expensive. And while we
don't have a load instruction that supports this, it isn't that expensive to
create a vector of i8 elements. The codegen for this was fixed/optimised in
D105110. This now tweaks the cost model and enables SLP vectorisation of my
motivating case loadi8.ll.

Differential Revision: https://reviews.llvm.org/D103629
2021-07-05 11:25:10 +01:00