Commit Graph

2676 Commits

Author SHA1 Message Date
Alexey Bataev a3fd82c289 [SLP]Fix the crash on cost calculation if non-compatible vectors shuffled.
If the extracts from the non-power-2 vectors are recognized as shuffles,
need some extra checks to not crash cost calculations if trying to gext
the ecost for subvector extracts. In this case need to check carefully
that we do not exit out of bounds of the original vector, otherwise the
TTI's cost model will crash on assert.

Differential Revision: https://reviews.llvm.org/D101477
2021-04-30 09:34:20 -07:00
Alexey Bataev 12c51f2358 [COST] Improve shuffle kind detection if shuffle mask is provided.
Added an extra analysis for better choosing of shuffle kind in
getShuffleCost functions for better cost estimation if mask was
provided.

Differential Revision: https://reviews.llvm.org/D100865
2021-04-29 12:48:00 -07:00
Alexey Bataev 6e859f3cd4 Revert "[COST] Improve shuffle kind detection if shuffle mask is provided."
This reverts commit 9239932221 to fix
a compiler crash on mask checks.
2021-04-29 12:40:33 -07:00
Alexey Bataev 9239932221 [COST] Improve shuffle kind detection if shuffle mask is provided.
Added an extra analysis for better choosing of shuffle kind in
getShuffleCost functions for better cost estimation if mask was
provided.

Differential Revision: https://reviews.llvm.org/D100865
2021-04-29 09:42:56 -07:00
Sander de Smalen 51d648c119 Revert "[LV] Calculate max feasible scalable VF."
Temporarily reverting this patch due to some unexpected issue found
by one of the PPC buildbots.

This reverts commit 584e9b6e4b.
2021-04-29 16:04:37 +01:00
Florian Hahn a0e1313c23
[VPlan] Add getVPSingleValue helper.
As suggested in D99294, this adds a getVPSingleValue helper to use for
recipes that are guaranteed to define a single value. This replaces uses
of getVPValue() which used to default to I = 0.
2021-04-29 13:37:38 +01:00
Bardia Mahjour ddb3b26a12 [LV] Consider Loop Unroll Hints When Making Interleave Decisions
This patch causes the loop vectorizer to not interleave loops that have
nounroll loop hints (llvm.loop.unroll.disable and llvm.loop.unroll_count(1)).
Note that if a particular interleave count is being requested
(through llvm.loop.interleave_count), it will still be honoured, regardless
of the presence of nounroll hints.

Reviewed By: Meinersbur

Differential Revision: https://reviews.llvm.org/D101374
2021-04-28 17:27:52 -04:00
David Sherwood 00e65f3345 [LoopVectorize][SVE] Fix crash when vectorising FP negation
This patch fixes a crash encountered when vectorising the following loop:

 void foo(float *dst, float *src, long long n) {
   for (long long i = 0; i < n; i++)
     dst[i] = -src[i];
 }

using scalable vectors. I've added a test to

 Transforms/LoopVectorize/AArch64/sve-basic-vec.ll

as well as cleaned up the other tests in the same file.

Differential Revision: https://reviews.llvm.org/D98054
2021-04-28 15:22:35 +01:00
Tres Popp f0e848e63d Silence unused variable warning 2021-04-28 15:46:09 +02:00
Alexey Bataev 8af4723c58 [SLP]Try to vectorize tiny trees with shuffled gathers.
If the first tree element is vectorize and the second is gather, it
still might be profitable to vectorize it if the gather node contains
less scalars to vectorize than the original tree node. It might be
profitable to use shuffles.

Differential Revision: https://reviews.llvm.org/D101397
2021-04-28 06:35:31 -07:00
David Sherwood 6998f8ae2d [LoopVectorize] Simplify scalar cost calculation in getInstructionCost
This patch simplifies the calculation of certain costs in
getInstructionCost when isScalarAfterVectorization() returns a true value.
There are a few places where we multiply a cost by a number N, i.e.

  unsigned N = isScalarAfterVectorization(I, VF) ? VF.getKnownMinValue() : 1;
  return N * TTI.getArithmeticInstrCost(...

After some investigation it seems that there are only these cases that occur
in practice:

1. VF is a scalar, in which case N = 1.
2. VF is a vector. We can only get here if: a) the instruction is a
GEP/bitcast/PHI with scalar uses, or b) this is an update to an induction
variable that remains scalar.

I have changed the code so that N is assumed to always be 1. For GEPs
the cost is always 0, since this is calculated later on as part of the
load/store cost. PHI nodes are costed separately and were never previously
multiplied by VF. For all other cases I have added an assert that none of
the users needs scalarising, which didn't fire in any unit tests.

Only one test required fixing and I believe the original cost for the scalar
add instruction to have been wrong, since only one copy remains after
vectorisation.

I have also added a new test for the case when a pointer PHI feeds directly
into a store that will be scalarised as we were previously never testing it.

Differential Revision: https://reviews.llvm.org/D99718
2021-04-28 13:41:07 +01:00
Sander de Smalen 584e9b6e4b [LV] Calculate max feasible scalable VF.
This patch also refactors the way the feasible max VF is calculated,
although this is NFC for fixed-width vectors.

After this change scalable VF hints are no longer truncated/clamped
to a shorter scalable VF, nor does it drop the 'scalable flag' from
the suggested VF to vectorize with a similar VF that is fixed.

Instead, the hint is ignored which means the vectorizer is free
to find a more suitable VF, using the CostModel to determine the
best possible VF.

Reviewed By: c-rhodes, fhahn

Differential Revision: https://reviews.llvm.org/D98509
2021-04-28 12:30:00 +01:00
Kerry McLaughlin 9cc217ab36 [LoopVectorize] Prevent multiple Phis being generated with in-order reductions
When using the -enable-strict-reductions flag where UF>1 we generate multiple
Phi nodes, though only one of these is used as an input to the vector.reduce.fadd
intrinsics. The unused Phi nodes are removed later by instcombine.

This patch changes widenPHIInstruction/fixReduction to only generate
one Phi, and adds an additional test for unrolling to strict-fadd.ll

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D100570
2021-04-28 11:29:01 +01:00
David Sherwood 6968520c3b Revert "[LoopVectorize] Simplify scalar cost calculation in getInstructionCost"
This reverts commit 4afeda9157.
2021-04-27 15:46:03 +01:00
David Sherwood 4afeda9157 [LoopVectorize] Simplify scalar cost calculation in getInstructionCost
This patch simplifies the calculation of certain costs in
getInstructionCost when isScalarAfterVectorization() returns a true value.
There are a few places where we multiply a cost by a number N, i.e.

  unsigned N = isScalarAfterVectorization(I, VF) ? VF.getKnownMinValue() : 1;
  return N * TTI.getArithmeticInstrCost(...

After some investigation it seems that there are only these cases that occur
in practice:

1. VF is a scalar, in which case N = 1.
2. VF is a vector. We can only get here if: a) the instruction is a
GEP/bitcast/PHI with scalar uses, or b) this is an update to an induction
variable that remains scalar.

I have changed the code so that N is assumed to always be 1. For GEPs
the cost is always 0, since this is calculated later on as part of the
load/store cost. PHI nodes are costed separately and were never previously
multiplied by VF. For all other cases I have added an assert that none of
the users needs scalarising, which didn't fire in any unit tests.

Only one test required fixing and I believe the original cost for the scalar
add instruction to have been wrong, since only one copy remains after
vectorisation.

I have also added a new test for the case when a pointer PHI feeds directly
into a store that will be scalarised as we were previously never testing it.

Differential Revision: https://reviews.llvm.org/D99718
2021-04-27 15:26:15 +01:00
Alexey Bataev 24590d8d67 [SLP]Improved isGatherShuffledEntry, NFC.
Reworked isGatherShuffledEntry function, simplified and moved
common code to the lambda (it shall go away when non-power-2 patch will
be landed).
2021-04-27 05:59:46 -07:00
Florian Hahn cb96d802d4
[LV] Hoist code to get vector loop latch (NFC).
Address suggestion from D99294.
2021-04-27 13:30:17 +01:00
Florian Hahn 160e729cf0
[VPlan] Use recursive traversal iterator in VPSlotTracker.
This patch simplifies VPSlotTracker by using the recursive traversal
iterator to traverse all blocks in a VPlan in reverse post-order when
numbering VPValues in a plan.

This depends on a fix to RPOT (D100169). It also extends the traversal
unit tests to check RPOT.

Reviewed By: a.elovikov

Differential Revision: https://reviews.llvm.org/D100176
2021-04-27 12:39:06 +01:00
Florian Hahn 7302fe4328
[VPlan] Make blocksOnly work properly with ranges over const pointers.
When iterating over const blocks, the base type in the lambdas needs
to use const VPBlockBase *, otherwise it cannot be used with input
iterators over const VPBlockBase.

Also adjust the type of the input iterator range to const &, as it
does not take ownership of the input range.
2021-04-26 10:52:35 +01:00
Florian Hahn 4b9be5ac08
[VPlan] Add VPBlockUtils::blocksOnly helper.
This patch adds a blocksOnly helpers which take an iterator range
over VPBlockBase * or const VPBlockBase * and returns an interator
range that only include BlockTy blocks. The accesses are casted to
BlockTy.

Reviewed By: a.elovikov

Differential Revision: https://reviews.llvm.org/D101093
2021-04-25 17:38:09 +01:00
Florian Hahn 89c4dda076
[VPlan] Add GraphTraits impl to traverse through VPRegionBlock.
This patch adds a new iterator to traverse through VPRegionBlocks and a
GraphTraits specialization using the iterator to traverse through
VPRegionBlocks.

Because there is already a GraphTraits specialization for VPBlockBase *
and co, a new VPBlockRecursiveTraversalWrapper helper is introduced.
This allows us to provide a new GraphTraits specialization for that
type. Users can use the new recursive traversal by using this wrapper.

The graph trait visits both the entry block of a region, as well as all
its successors. Exit blocks of a region implicitly have their parent
region's successors. This ensures all blocks in a region are visited
before any blocks in a successor region when doing a reverse post-order
traversal of the graph.

Reviewed By: a.elovikov

Differential Revision: https://reviews.llvm.org/D100175
2021-04-23 17:26:47 +01:00
Alexey Bataev 18c61fc498 [SLP]Skip undefs trying to find perfect/shuffled tree entries matching.
We can skip check for undefs trying to find perfect/shuffled tree
entries matching, they can be ignored completely improving the final
cost/vectorization results.

Differential Revision: https://reviews.llvm.org/D101061
2021-04-22 08:59:07 -07:00
Joe Ellis 2c551aedcf [LoopVectorize] Fix bug where predicated loads/stores were dropped
This commit fixes a bug where the loop vectoriser fails to predicate
loads/stores when interleaving for targets that support masked
loads and stores.

Code such as:

     1  void foo(int *restrict data1, int *restrict data2)
     2  {
     3    int counter = 1024;
     4    while (counter--)
     5      if (data1[counter] > data2[counter])
     6        data1[counter] = data2[counter];
     7  }

... could previously be transformed in such a way that the predicated
store implied by:

    if (data1[counter] > data2[counter])
       data1[counter] = data2[counter];

... was lost, resulting in miscompiles.

This bug was causing some tests in llvm-test-suite to fail when built
for SVE.

Differential Revision: https://reviews.llvm.org/D99569
2021-04-22 15:05:54 +00:00
Alexey Bataev d4f5f23bbb [SLP]Replace more `TTI` with `TTIRef`, NFC.
To pacify MSVC buildbots.
2021-04-22 07:53:20 -07:00
Alexey Bataev da2cdfd421 [SLP]Added explicit ref to TargetTransformInfo to try to pacify MSVC
buildbots, NFC.
2021-04-22 07:49:48 -07:00
Alexey Bataev e99b98cb1b [SLP]Improve cost model for the vectorized extractelements.
1. No need to call `areAllUsersVectorized` as later the cost is
   calculated only if the instruction has one use and gets vectorized.
2. Need to calculate the cost of the dead extractelement more precisely,
   taking the vector type of the vector operand, not the resulting
   vector type.

Part of D57059.

Differential Revision: https://reviews.llvm.org/D99980
2021-04-22 07:40:17 -07:00
David Sherwood 5a229a6702 [LoopVectorize] Don't create unnecessary vscale intrinsic calls
In quite a few cases in LoopVectorize.cpp we call createStepForVF
with a step value of 0, which leads to unnecessary generation of
llvm.vscale intrinsic calls. I've optimised IRBuilder::CreateVScale
and createStepForVF to return 0 when attempting to multiply
vscale by 0.

Differential Revision: https://reviews.llvm.org/D100763
2021-04-22 09:01:52 +01:00
Alexey Bataev af870e11ae [SLP] Add detection of shuffled/perfect matching of tree entries.
SLP supports perfect diamond matching for the vectorized tree entries
but do not support it for gathered entries and does not support
non-perfect (shuffled) matching with 1 or 2 tree entries. Patch adds
support for this matching to improve cost of the vectorized tree.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D100495
2021-04-20 09:08:46 -07:00
Alexey Bataev b82344a019 Revert "[SLP] Add detection of shuffled/perfect matching of tree entries."
This reverts commit daf6e18c55 to fix the
compiler crash.
2021-04-20 08:29:32 -07:00
Alexey Bataev daf6e18c55 [SLP] Add detection of shuffled/perfect matching of tree entries.
SLP supports perfect diamond matching for the vectorized tree entries
but do not support it for gathered entries and does not support
non-perfect (shuffled) matching with 1 or 2 tree entries. Patch adds
support for this matching to improve cost of the vectorized tree.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D100495
2021-04-20 07:46:49 -07:00
Alexey Bataev cf00cb8bed Revert "[SLP] Add detection of shuffled/perfect matching of tree entries."
This reverts commit b232771aca to fix
buildbots.
2021-04-20 07:16:11 -07:00
Alexey Bataev b232771aca [SLP] Add detection of shuffled/perfect matching of tree entries.
SLP supports perfect diamond matching for the vectorized tree entries
but do not support it for gathered entries and does not support
non-perfect (shuffled) matching with 1 or 2 tree entries. Patch adds
support for this matching to improve cost of the vectorized tree.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D100495
2021-04-20 06:55:55 -07:00
Sander de Smalen 86729538bd [LV] Let selectVectorizationFactor reason directly on VectorizationFactor.
Rather than maintaining two separate values, a `float` for the per-lane
cost and a Width for the VF, maintain a single VectorizationFactor which
comprises the two and also removes the need for converting an integer value
to float.

This simplifies the query when asking if one VF is more profitable than
another when we want to extend this for scalable vectors (which may
require additional options to determine if e.g. a scalable VF of the
some cost, is more profitable than a fixed VF of the same cost).

The patch isn't entirely NFC because it also fixes an issue in
selectEpilogueVectorizationFactor, where the cost passed to ProfitableVFs
no longer truncates the floating-point cost from `float` to `unsigned` to
then perform the calculation on the truncated cost. It now does
a cost comparison with the correct precision.

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D100121
2021-04-20 09:54:45 +01:00
Alexey Bataev 8030481065 Revert "[SLP]Add detection of shuffled/perfect matching of tree entries."
This reverts commit d6fde91379 to fix
compiler crashes.
2021-04-19 14:10:04 -07:00
Alexey Bataev d6fde91379 [SLP]Add detection of shuffled/perfect matching of tree entries.
SLP supports perfect diamond matching for the vectorized tree entries
but do not support it for gathered entries and does not support
non-perfect (shuffled) matching with 1 or 2 tree entries. Patch adds
support for this matching to improve cost of the vectorized tree.

Differential Revision: https://reviews.llvm.org/D100495
2021-04-19 13:29:30 -07:00
Cullen Rhodes f0bc2782f2 [TTI] NFC: Remove unused 'OptSize' parameter from shouldMaximizeVectorBandwidth
Reviewed By: sdesmalen

Differential Revision: https://reviews.llvm.org/D100377
2021-04-19 11:01:34 +00:00
Florian Hahn 49999d4364 [VPlan] Replace a few unnecessary includes with forward decls. 2021-04-15 20:08:31 +01:00
Florian Hahn 6adebe3fd2 [VPlan] Add VPRecipeBase::mayHaveSideEffects.
Add an initial version of a helper to determine whether a recipe may
have side-effects.

Reviewed By: a.elovikov

Differential Revision: https://reviews.llvm.org/D100259
2021-04-15 11:49:40 +01:00
David Sherwood ea14df695e [SVE][LoopVectorize] Fix crash in InnerLoopVectorizer::widenPHIInstruction
There were a few places in widenPHIInstruction where calculations of
offsets were failing to take the runtime calculation of VF into
account for scalable vectors. I've fixed those cases in this patch
as well as adding an assert that we should not be scalarising for
scalable vectors.

Tests are added here:

  Transforms/LoopVectorize/AArch64/sve-widen-phi.ll

Differential Revision: https://reviews.llvm.org/D99254
2021-04-15 10:51:49 +01:00
David Sherwood 7120f89f7d [NFC][LoopVectorize] Remove unnecessary VF.isScalable asserts
There are a few places in LoopVectorize.cpp where we have been too
cautious in adding VF.isScalable() asserts and it can be confusing.
It also makes it more difficult to see the genuine places where
work needs doing to improve scalable vectorization support.

This patch changes getMemInstScalarizationCost to return an
invalid cost instead of firing an assert for scalable vectors. Also,
vectorizeInterleaveGroup had multiple asserts all for the same
thing. I have removed all but one assert near the start of the
function, and added a new assert that we aren't dealing with masks
for scalable vectors.

Differential Revision: https://reviews.llvm.org/D99727
2021-04-15 09:41:03 +01:00
Simon Pilgrim b49c41afba [SLP] createOp - fix null dereference warning. NFCI.
Only attempt to propagateIRFlags if we have both SelectInst - afaict we shouldn't have matched a min/max reduction without both SelectInst, but static analyzer doesn't know that.
2021-04-14 15:24:41 +01:00
Sander de Smalen bd86824d98 [TTI] NFC: Change getArithmeticReductionCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

This patch is practically NFC, with the exception of an AArch64 SVE related
cost-model change, where we can now return an Invalid cost instead of some
bogus number.

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D100201
2021-04-13 14:20:59 +01:00
Sander de Smalen 92d8421f49 [TTI] NFC: Change getCastInstrCost and getExtractWithExtendCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D100199
2021-04-13 14:20:58 +01:00
dfukalov d066079728 [NFC][AA] Prepare to convert AliasResult to class with PartialAlias offset.
Main reason is preparation to transform AliasResult to class that contains
offset for PartialAlias case.

Reviewed By: asbirlea

Differential Revision: https://reviews.llvm.org/D98027
2021-04-09 12:54:22 +03:00
Alexey Bataev ab124bbe2a [SLP]Fix PR49898: Infinite loop in SLP vectorizer.
We should not re-try attempt of finding of the consecutive store chain
if it was tried before.

Differential Revision: https://reviews.llvm.org/D100131
2021-04-08 14:18:06 -07:00
Florian Hahn e4de3cdf3d [LV] Pass VPWidenPHIRecipe to widenPHIInstruction (NFC).
Instead of passing the start value and the defined value to
widenPHIInstruction, pass the VPWidenPHIRecipe directly, which can be
used to get both (and more in future patches).
2021-04-08 14:25:10 +01:00
David Green 8675ef100f [LV] Logical and/or select costs
D99674 stopped the folding of certain select operations into and/or, due
to incorrect folding in the presence of poison. D97360 added some costs
to attempt to account for the change, but only worked at the getUserCost
level, not the getCmpSelInstrCost that the vectorizer will use directly.
This adds similar logic into the vectorizer to handle these logical
and/or selects, treating them like and/or directly.

This fixes 60% performance regressions from code like the attached test
case.

Differential Revision: https://reviews.llvm.org/D99884
2021-04-08 10:39:47 +01:00
Alexey Bataev a78e86e6be [SLP]Avoid multiple attempts to vectorize CmpInsts.
No need to lookup through and/or try to vectorize operands of the
CmpInst instructions during attempts to find/vectorize min/max
reductions. Compiler implements postanalysis of the CmpInsts so we can
skip extra attempts in tryToVectorizeHorReductionOrInstOperands and save
compile time.

Differential Revision: https://reviews.llvm.org/D99950
2021-04-07 06:15:42 -07:00
Philip Reames a6d2a8d6f5 Add a subclass of IntrinsicInst for llvm.assume [nfc]
Add the subclass, update a few places which check for the intrinsic to use idiomatic dyn_cast, and update the public interface of AssumptionCache to use the new class.  A follow up change will do the same for the newer assumption query/bundle mechanisms.
2021-04-06 11:16:22 -07:00
Kerry McLaughlin 7344f3d39a [LoopVectorize] Add strict in-order reduction support for fixed-width vectorization
Previously we could only vectorize FP reductions if fast math was enabled, as this allows us to
reorder FP operations. However, it may still be beneficial to vectorize the loop by moving
the reduction inside the vectorized loop and making sure that the scalar reduction value
be an input to the horizontal reduction, e.g:

  %phi = phi float [ 0.0, %entry ], [ %reduction, %vector_body ]
  %load = load <8 x float>
  %reduction = call float @llvm.vector.reduce.fadd.v8f32(float %phi, <8 x float> %load)

This patch adds a new flag (IsOrdered) to RecurrenceDescriptor and makes use of the changes added
by D75069 as much as possible, which already teaches the vectorizer about in-loop reductions.
For now in-order reduction support is off by default and controlled with the `-enable-strict-reductions` flag.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D98435
2021-04-06 14:45:34 +01:00
Kerry McLaughlin 857b8a73da [LoopVectorize] Change the identity element for FAdd
Changes getRecurrenceIdentity to always return a neutral value of -0.0 for FAdd.

Reviewed By: dmgreen, spatel

Differential Revision: https://reviews.llvm.org/D98963
2021-04-06 12:13:43 +01:00
Florian Hahn a6b06b785c [VPlan] Print VPValue operands for VPWidenPHI if possible.
For VPWidenPHIRecipes that model all incoming values as VPValue
operands, print those operands instead of printing the original PHI.

D99294 updates recipes of reduction PHIs to use the VPValue for the
incoming value from the loop backedge, making use of this new printing.
2021-04-06 12:11:21 +01:00
Alexey Bataev 00a84f9a7f [SLP]Improve vectorization of the CmpInst instructions.
During vectorization better to postpone the vectorization of the CmpInst
instructions till the end of the basic block. Otherwise we may vectorize
it too early and may miss some vectorization patterns, like reductions.

Reworked part of D57059

Differential Revision: https://reviews.llvm.org/D99796
2021-04-05 06:22:51 -07:00
Fangrui Song 8e5f3d04f2 [SLPVectorizer] Fix divide-by-zero after D99719
Will add a test case later.
2021-04-02 11:13:51 -07:00
Florian Hahn 8867fc69f0 [LV] Hoist mapping of IR operands to VPValues (NFC).
This patch moves mapping of IR operands to VPValues out of
tryToCreateWidenRecipe. This allows using existing VPValue operands when
widening recipes directly, which will be introduced in future patches.
2021-04-02 17:57:20 +01:00
Alexey Bataev 5fcb07a070 [SLP]Fix a bug in min/max reduction, number of condition uses.
The ultimate reduction node may have multiple uses, but if the ultimate
reduction is min/max reduction and based on SelectInstruction, the
condition of this select instruction must have only single use.

Differential Revision: https://reviews.llvm.org/D99753
2021-04-02 07:09:44 -07:00
Florian Hahn 0f3230390b
[SLP] Better estimate cost of no-op extracts on target vectors.
The motivation for this patch is to better estimate the cost of
extracelement instructions in cases were they are going to be free,
because the source vector can be used directly.

A simple example is

    %v1.lane.0 = extractelement <2 x double> %v.1, i32 0
    %v1.lane.1 = extractelement <2 x double> %v.1, i32 1

    %a.lane.0 = fmul double %v1.lane.0, %x
    %a.lane.1 = fmul double %v1.lane.1, %y

Currently we only consider the extracts free, if there are no other
users.

In this particular case, on AArch64 which can fit <2 x double> in a
vector register, the extracts should be free, independently of other
users, because the source vector of the extracts will be in a vector
register directly, so it should be free to use the vector directly.

The SLP vectorized version of noop_extracts_9_lanes is 30%-50% faster on
certain AArch64 CPUs.

It looks like this does not impact any code in
SPEC2000/SPEC2006/MultiSource both on X86 and AArch64 with -O3 -flto.

This originally regressed after D80773, so if there's a better
alternative to explore, I'd be more than happy to do that.

Reviewed By: ABataev

Differential Revision: https://reviews.llvm.org/D99719
2021-04-02 10:40:12 +01:00
Alexey Bataev c03696da5e [SLP]Improve and fix getVectorElementSize.
1. Need to cleanup InstrElementSize map for each new tree, otherwise might
use sizes from the previous run of the vectorization attempt.
2. No need to include into analysis the instructions from the different basic
   blocks to save compile time.

Differential Revision: https://reviews.llvm.org/D99677
2021-04-01 06:51:26 -07:00
Alexey Bataev ce98a0556a [SLP]Remove `else` after `return`, NFC.` 2021-04-01 05:33:01 -07:00
Huihui Zhang fe5c4a06a4 [LoopVectorize] Use SetVector to track uniform uses to prevent non-determinism.
Use SetVector instead of SmallPtrSet to track values with uniform use. Doing this
can help avoid non-determinism caused by iterating over unordered containers.

This bug was found with reverse iteration turning on,
--extra-llvm-cmake-variables="-DLLVM_REVERSE_ITERATION=ON".
Failing LLVM test consecutive-ptr-uniforms.ll .

Reviewed By: MaskRay

Differential Revision: https://reviews.llvm.org/D99549
2021-03-31 11:21:07 -07:00
Sander de Smalen 7108b2dec1 [SVE] Fix LoopVectorizer test scalalable-call.ll
This marks FSIN and other operations to EXPAND for scalable
vectors, so that they are not assumed to be legal by the cost-model.

Depends on D97470

Reviewed By: dmgreen, paulwalker-arm

Differential Revision: https://reviews.llvm.org/D97471
2021-03-31 14:52:49 +01:00
Huihui Zhang d857a81437 [VPlan] Use SetVector for VPExternalDefs to prevent non-determinism.
Use SetVector instead of SmallPtrSet for external definitions created for VPlan.
Doing this can help avoid non-determinism caused by iterating over unordered containers.

This bug was found with reverse iteration turning on,
 --extra-llvm-cmake-variables="-DLLVM_REVERSE_ITERATION=ON".
Failing LLVM-Unit test VPRecipeTest.dump.

Reviewed By: MaskRay

Differential Revision: https://reviews.llvm.org/D99544
2021-03-30 12:10:56 -07:00
David Sherwood a08c7736a7 [LoopVectorize] Add support for scalable vectorization of induction variables
This patch adds support for the vectorization of induction variables when
using scalable vectors, which required the following changes:

1. Removed assert from InnerLoopVectorizer::getStepVector.
2. Modified InnerLoopVectorizer::createVectorIntOrFpInductionPHI to use
   a runtime determined value for VF and removed an assert.
3. Modified InnerLoopVectorizer::buildScalarSteps to work for scalable
   vectors. I did this by calculating the full vector value for each Part
   of the unroll factor (UF) and caching this in the VP state. This means
   that we are always able to extract an arbitrary element from the vector
   if necessary. In addition to this, I also permitted the caching of the
   individual lane values themselves for the known minimum number of elements
   in the same way we do for fixed width vectors. This is a further
   optimisation that improves the code quality since it avoids unnecessary
   extractelement operations when extracting the first lane.
4. Added an assert to InnerLoopVectorizer::widenPHIInstruction, since while
   testing some code paths I noticed this is currently broken for scalable
   vectors.

Various tests to support different cases have been added here:

  Transforms/LoopVectorize/AArch64/sve-inductions.ll

Differential Revision: https://reviews.llvm.org/D98715
2021-03-30 11:13:31 +01:00
Florian Hahn c773d0f973
Recommit "[LV] Move runtime pointer size check to LVP::plan()."
Re-apply 25fbe803d4, with a small update to emit the right remark
class.

Original message:
    [LV] Move runtime pointer size check to LVP::plan().

    This removes the need for the remaining doesNotMeet check and instead
    directly checks if there are too many runtime checks for vectorization
    in the planner.

    A subsequent patch will adjust the logic used to decide whether to
    vectorize with runtime to consider their cost more accurately.

    Reviewed By: lebedev.ri
2021-03-29 16:14:27 +01:00
Florian Hahn 485c8ce733
Revert "[LV] Move runtime pointer size check to LVP::plan()."
This reverts commit 25fbe803d4.

This breaks a clang test which filters for the wrong remark type.
2021-03-29 14:41:53 +01:00
Sanjay Patel da381cf7ce [SLP] allow matching integer min/max intrinsics as reduction ops
This is a 2nd try of:
3c8473ba53
which was reverted at:
 a26312f9d4
because of crashing.

This version includes extra code and tests to avoid the known
crashing examples as discussed in PR49730.

Original commit message:
As noted in D98152, we need to patch SLP to avoid regressions when
we start canonicalizing to integer min/max intrinsics.
Most of the real work to make this possible was in:
7202f47508

Differential Revision: https://reviews.llvm.org/D98981
2021-03-29 09:38:18 -04:00
Florian Hahn 25fbe803d4
[LV] Move runtime pointer size check to LVP::plan().
This removes the need for the remaining doesNotMeet check and instead
directly checks if there are too many runtime checks for vectorization
in the planner.

A subsequent patch will adjust the logic used to decide whether to
vectorize with runtime to consider their cost more accurately.

Reviewed By: lebedev.ri

Differential Revision: https://reviews.llvm.org/D98634
2021-03-29 14:12:29 +01:00
Florian Hahn 8c6c357897
[LV] Mark a few more cost-model members as const (NFC). 2021-03-28 14:59:48 +01:00
Florian Hahn d2855eba81
[LV] Fix formatting from 2f9d68c3f1. 2021-03-27 21:29:56 +00:00
Florian Hahn 2f9d68c3f1
[LV] Mark some methods as const (NFC).
Mark a few methods as const, as they do not modify any state.
2021-03-27 21:27:53 +00:00
Sanjay Patel b0797e0c12 [SLP] use dyn_cast instead of isa + cast; NFC 2021-03-26 13:52:31 -04:00
Sanjay Patel a26312f9d4 Revert "[SLP] allow matching integer min/max intrinsics as reduction ops"
This reverts commit 3c8473ba53 and includes test diffs to
maintain testing status.

There's at least 1 place that was not updated with 7202f47508 ,
so we can crash mismatching select and intrinsics as shown in
PR49730.
2021-03-26 09:59:14 -04:00
David Sherwood c39460cc4f Revert "[LoopVectorize] Simplify scalar cost calculation in getInstructionCost"
This reverts commit 240aa96cf2.
2021-03-26 11:36:53 +00:00
David Sherwood 240aa96cf2 [LoopVectorize] Simplify scalar cost calculation in getInstructionCost
This patch simplifies the calculation of certain costs in
getInstructionCost when isScalarAfterVectorization() returns a true value.
There are a few places where we multiply a cost by a number N, i.e.

  unsigned N = isScalarAfterVectorization(I, VF) ? VF.getKnownMinValue() : 1;
  return N * TTI.getArithmeticInstrCost(...

After some investigation it seems that there are only these cases that occur
in practice:

1. VF is a scalar, in which case N = 1.
2. VF is a vector. We can only get here if: a) the instruction is a
GEP/bitcast with scalar uses, or b) this is an update to an induction variable
that remains scalar.

I have changed the code so that N is assumed to always be 1. For GEPs
the cost is always 0, since this is calculated later on as part of the
load/store cost. For all other cases I have added an assert that none of the
users needs scalarising, which didn't fire in any unit tests.

Only one test required fixing and I believe the original cost for the scalar
add instruction to have been wrong, since only one copy remains after
vectorisation.

Differential Revision: https://reviews.llvm.org/D98512
2021-03-26 11:27:12 +00:00
Yevgeny Rouban f7ef26ef0b [SLP] Fix crash in reduction for integer min/max
The SCEV commit b46c085d2b [NFCI] SCEVExpander:
    emit intrinsics for integral {u,s}{min,max} SCEV expressions
seems to reveal a new crash in SLPVectorizer.
SLP crashes expecting a SelectInst as an externally used value
but umin() call is found.

The patch relaxes the assumption to make the IR flag propagation safe.

Reviewed By: spatel

Differential Revision: https://reviews.llvm.org/D99328
2021-03-25 21:44:21 +07:00
Alexey Bataev 568c874117 [SLP]Improve and simplify extendSchedulingRegion.
We do not need to scan further if the upper end or lower end of the
basic block is reached already and the instruction is not found. It
means that the instruction is definitely in the lower part of basic
block or in the upper block relatively.
This should improve compile time for the very big basic blocks.

Differential Revision: https://reviews.llvm.org/D99266
2021-03-25 05:31:58 -07:00
Florian Hahn 9d45579279
[LV] Factor out phi type access to variable (NFC).
A slight simplification of the code to reduce future diffs.
2021-03-24 19:25:22 +00:00
Florian Hahn 8d1342f79d
[LV] Remove redundant access to Legal::getReductionVars() (NFC).
The reduction descriptor is retrieved earlier and stored in a variable
RdxDesc already.
2021-03-24 19:15:14 +00:00
Sander de Smalen 55d18b3cc2 [TTI] Return a TypeSize from getRegisterBitWidth.
This patch changes the interface to take a RegisterKind, to indicate
whether the register bitwidth of a scalar register, fixed-width vector
register, or scalable vector register must be returned.

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D98874
2021-03-24 14:45:13 +00:00
Florian Hahn cd0c00c9fe
[LV] Move exact FP math check out of Requirements.
We know if the loop contains FP instructions preventing vectorization
after we are done with legality checks. This patch updates the code the
check for un-vectorizable FP operations earlier, to avoid unnecessarily
running the cost model and picking a vectorization factor. It also makes
the code more direct and moves the check to a position where similar
checks are done.

I might be missing something, but I don't see any reason to handle this
check differently to other, similar checks.

Reviewed By: lebedev.ri

Differential Revision: https://reviews.llvm.org/D98633
2021-03-24 11:01:44 +00:00
Alexey Bataev 99203f2004 [Analysis]Add getPointersDiff function to improve compile time.
Added getPointersDiff function to LoopAccessAnalysis and used it instead
direct calculatoin of the distance between pointers and/or
isConsecutiveAccess function in SLP vectorizer to improve compile time
and detection of stores consecutive chains.

Part of D57059

Differential Revision: https://reviews.llvm.org/D98967
2021-03-23 14:25:36 -07:00
Alexey Bataev f1b47ad278 Revert "[Analysis]Add getPointersDiff function to improve compile time."
This reverts commit 065a14a12d to
investigate and fix crash in SLP vectorizer.
2021-03-23 13:17:54 -07:00
Alexey Bataev 065a14a12d [Analysis]Add getPointersDiff function to improve compile time.
Added getPointersDiff function to LoopAccessAnalysis and used it instead
direct calculatoin of the distance between pointers and/or
isConsecutiveAccess function in SLP vectorizer to improve compile time
and detection of stores consecutive chains.

Part of D57059

Differential Revision: https://reviews.llvm.org/D98967
2021-03-23 12:58:42 -07:00
Sanjay Patel 3c8473ba53 [SLP] allow matching integer min/max intrinsics as reduction ops
As noted in D98152, we need to patch SLP to avoid regressions when
we start canonicalizing to integer min/max intrinsics.
Most of the real work to make this possible was in:
7202f47508

Differential Revision: https://reviews.llvm.org/D98981
2021-03-23 08:56:44 -04:00
David Sherwood d70251163f [LoopVectorize][NFC] Refactor code to use IRBuilder::CreateStepVector
In places where we create a ConstantVector whose elements are a
linear sequence of the form <start, start + 1, start + 2, ...>
I've changed the code to make use of CreateStepVector, which creates
a vector with the sequence <0, 1, 2, ...>, and a vector addition
operation. This patch is a non-functional change, since the output
from the vectoriser remains unchanged for fixed length vectors and
there are existing asserts that still fire when attempting to use
scalable vectors for vectorising induction variables.

In a later patch we will enable support for scalable vectors
in InnerLoopVectorizer::getStepVector(), which relies upon the new
stepvector intrinsic in IRBuilder::CreateStepVector.

Differential Revision: https://reviews.llvm.org/D97861
2021-03-23 11:29:05 +00:00
Florian Hahn f759d512c8
[VPlan] Include name when printing after 93a9d2de8f.
The name is included when printing in DOT mode. Also print it in non-DOT
mode after 93a9d2de8f.

This will become more important to distinguish different plans once
VPlans are gradually refined.
2021-03-23 09:50:14 +00:00
Bjorn Pettersson 688cdddafb [SLP] Honor min/max regsize and min/max VF in vectorizeStores
Make sure we use PowerOf2Floor instead of PowerOf2Ceil when
calculating max number of elements that fits inside a vector
register (otherwise we could end up creating vectors larger
than the maximum vector register size).

Also make sure we honor the min/max VF (as given by TTI or
cmd line parameters) when doing vectorizeStores.

Reviewed By: anton-afanasyev

Differential Revision: https://reviews.llvm.org/D97691
2021-03-22 17:29:35 +01:00
Andrei Elovikov 92205cb27f [NFC][VPlan] Guard print routines with "#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)"
Reviewed By: mehdi_amini

Differential Revision: https://reviews.llvm.org/D98897
2021-03-19 10:50:12 -07:00
Andrei Elovikov 93a9d2de8f [VPlan] Add plain text (not DOT's digraph) dumps
I foresee two uses for this:
1) It's easier to use those in debugger.
2) Once we start implementing more VPlan-to-VPlan transformations (especially
   inner loop massaging stuff), using the vectorized LLVM IR as CHECK targets in
   LIT test would become too obscure. I can imagine that we'd want to CHECK
   against VPlan dumps after multiple transformations instead. That would be
   easier with plain text dumps than with DOT format.

Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D96628
2021-03-19 10:50:12 -07:00
Mehdi Amini 3614df3537 Revert "[VPlan] Add plain text (not DOT's digraph) dumps"
This reverts commit 6b053c9867.
The build is broken:

ld.lld: error: undefined symbol: llvm::VPlan::printDOT(llvm::raw_ostream&) const
>>> referenced by LoopVectorize.cpp
>>>               LoopVectorize.cpp.o:(llvm::LoopVectorizationPlanner::printPlans(llvm::raw_ostream&)) in archive lib/libLLVMVectorize.a
2021-03-18 19:20:39 +00:00
Andrei Elovikov 6b053c9867 [VPlan] Add plain text (not DOT's digraph) dumps
I foresee two uses for this:
1) It's easier to use those in debugger.
2) Once we start implementing more VPlan-to-VPlan transformations (especially
   inner loop massaging stuff), using the vectorized LLVM IR as CHECK targets in
   LIT test would become too obscure. I can imagine that we'd want to CHECK
   against VPlan dumps after multiple transformations instead. That would be
   easier with plain text dumps than with DOT format.

Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D96628
2021-03-18 11:33:39 -07:00
Alexey Bataev b3ced9852c [SLP]Fix crash on extending scheduling region.
If SLP vectorizer tries to extend the scheduling region and runs out of
the budget too early, but still extends the region to the new ending
instructions (i.e., it was able to extend the region for the first
instruction in the bundle, but not for the second), the compiler need to
recalculate dependecies in full, just like if the extending was
successfull. Without it, the schedule data chunks may end up with the
wrong number of (unscheduled) dependecies and it may end up with the
incorrect function, where the vectorized instruction does not dominate
on the extractelement instruction.

Differential Revision: https://reviews.llvm.org/D98531
2021-03-18 06:11:08 -07:00
David Green e2935dcfc4 [TTI] Add a Mask to getShuffleCost
This adds an Mask ArrayRef to getShuffleCost, so that if an exact mask
can be provided a more accurate cost can be provided by the backend.
For example VREV costs could be returned by the ARM backend. This should
be an NFC until then, laying the groundwork for that to be added.

Differential Revision: https://reviews.llvm.org/D98206
2021-03-17 17:46:26 +00:00
LemonBoy 4f024938e4 [LoopVectorize] Refine hasIrregularType predicate
The `hasIrregularType` predicate checks whether an array of N values of type Ty is "bitcast-compatible" with a <N x Ty> vector.
The previous check returned invalid results in some cases where there's some padding between the array elements: eg. a 4-element array of u7 values is considered as compatible with <4 x u7>, even though the vector is only loading/storing 28 bits instead of 32.

The problem causes LLVM to generate incorrect code for some targets: for AArch64 the vector loads/stores are lowered in terms of ubfx/bfi, effectively losing the top (N * padding bits).

Reviewed By: lebedev.ri

Differential Revision: https://reviews.llvm.org/D97465
2021-03-17 17:03:47 +01:00
David Green 3c25c40d51 [LV] Account for the cost of predication of scalarized load/store
This adds the cost of an i1 extract and a branch to the cost in
getMemInstScalarizationCost when the instruction is predicated. These
predicated loads/store would generate blocks of something like:

    %c1 = extractelement <4 x i1> %C, i32 1
    br i1 %c1, label %if, label %else
  if:
    %sa = extractelement <4 x i32> %a, i32 1
    %sb = getelementptr inbounds float, float* %pg, i32 %sa
    %sv = extractelement <4 x float> %x, i32 1
    store float %sa, float* %sb, align 4
  else:

So this increases the cost by the extract and branch. This is probably
still too low in many cases due to the cost of all that branching, but
there is already an existing hack increasing the cost using
useEmulatedMaskMemRefHack. It will increase the cost of a memop if it is
a load or there are more than one store. This patch improves the cost
for when there is only a single store, and hopefully at some point in
the future the hack can be removed.

Differential Revision: https://reviews.llvm.org/D98243
2021-03-17 10:57:50 +00:00
Bu Le 9abe500473 [SLP] Fix the trunc instruction insertion problem
Current SLP pass has this piece of code that inserts a trunc instruction
after the vectorized instruction. In the case that the vectorized instruction
is a phi node and not the last phi node in the BB, the trunc instruction
will be inserted between two phi nodes, which will trigger verify problem
in debug version or unpredictable error in another pass.
This patch changes the algorithm to 'if the last vectorized instruction
is a phi, insert it after the last phi node in current BB' to fix this problem.
2021-03-17 13:51:08 +03:00
Sanjay Patel 7202f47508 [SLP] separate min/max matching from its instruction-level implementation; NFC
The motivation is to handle integer min/max reductions independently
of whether they are in the current cmp+sel form or the planned intrinsic
form.

We assumed that min/max included a select instruction, but we can
decouple that implementation detail by checking the instructions
themselves rather than relying on the recurrence (reduction) type.
2021-03-16 17:16:11 -04:00
Florian Hahn f586de8459
[VPlan] Remove PredInst2Recipe, use VP operands instead. (NFC)
Instead of maintaining a separate map from predicated instructions to
recipes, we can instead directly look at the VP operands. If the operand
comes from a predicated instruction, the operand will be a
VPPredInstPHIRecipe with a VPReplicateRecipe as its operand.
2021-03-16 17:40:35 +00:00
Sanjay Patel 40fdb43d30 [SLP] improve readability in reduction logic; NFC
We had 2 different and ambiguously-named 'I' variables.
2021-03-16 07:35:13 -04:00
Caroline Concatto 3c03635d53 [SVE][LoopVectorize] Add support for scalable vectorization of loops with vector reverse
This patch adds support for reverse loop vectorization.
It is possible to vectorize the following loop:
```
  for (int i = n-1; i >= 0; --i)
    a[i] = b[i] + 1.0;
```
with fixed or scalable vector.
The loop-vectorizer will use 'reverse' on the loads/stores to make
sure the lanes themselves are also handled in the right order.
This patch adds support for scalable vector on IRBuilder interface to
create a reverse vector. The IR function
CreateVectorReverse lowers to experimental.vector.reverse for scalable vector
and keedp the original behavior for fixed vector using shuffle reverse.

Differential Revision: https://reviews.llvm.org/D95363
2021-03-16 07:51:59 +00:00
Florian Hahn fb3ca70761
[LV] Account IV recipes being uniform in VPTransformState::get().
This patch fixes a crash when trying to get a scalar value using
VPTransformState::get() for uniform induction values or truncated
induction values. IVs and truncated IVs can be uniform and the updated
code accounts for that, fixing the crash.

This should fix
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=31981
2021-03-12 13:29:06 +00:00
Valery N Dmitriev 73f94969b2 [SLP] Fix crash when matching associative reduction for integer min/max.
Associative reduction matcher in SLP begins with select instruction but when
it reached call to llvm.umax (or alike) via def-use chain the latter also matched
as UMax kind. The routine's later code assumes matched instruction to be a select
and thus it merely died on the first encountered cast that did not fit.

Differential Revision: https://reviews.llvm.org/D98432
2021-03-11 11:52:57 -08:00
Mauri Mustonen 0de8aeae72
[VPlan] Support to widen select intructions in VPlan native path
Add support to widen select instructions in VPlan native path by using a correct recipe when such instructions are encountered. This is already used by inner loop vectorizer.

Previously select instructions get handled by the wrong recipe and resulted in unreachable instruction errors like this one: https://bugs.llvm.org/show_bug.cgi?id=48139.

Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D97136
2021-03-10 20:59:53 +00:00
Sanjay Patel 23fd647cc6 [SLP] remove dead null check; NFC
We cast<> to Instruction (not dyn_cast<>), so we already
required/assumed that Cmp is not null.
2021-03-09 17:43:07 -05:00
Mauri Mustonen 494b5ba364
[VPlan] Support to widen call intructions in VPlan native path
Add support to widen call instructions in VPlan native path by using a correct recipe when such instructions are encountered. This is already used by inner loop vectorizer.

Previously call instructions got handled by wrong recipes and resulted in unreachable instruction errors like this one: https://bugs.llvm.org/show_bug.cgi?id=48139.

Patch by Mauri Mustonen <mauri.mustonen@tuni.fi>

Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D97278
2021-03-06 21:59:52 +00:00
David Sherwood fec0a0adac [SVE][LoopVectorize] Add support for extracting the last lane of a scalable vector
There are certain loops like this below:

  for (int i = 0; i < n; i++) {
    a[i] = b[i] + 1;
    *inv = a[i];
  }

that can only be vectorised if we are able to extract the last lane of the
vectorised form of 'a[i]'. For fixed width vectors this already works since
we know at compile time what the final lane is, however for scalable vectors
this is a different story. This patch adds support for extracting the last
lane from a scalable vector using a runtime determined lane value. I have
added support to VPIteration for runtime-determined lanes that still permit
the caching of values. I did this by introducing a new class called VPLane,
which describes the lane we're dealing with and provides interfaces to get
both the compile-time known lane and the runtime determined value. Whilst
doing this work I couldn't find any explicit tests for extracting the last
lane values of fixed width vectors so I added tests for both scalable and
fixed width vectors.

Differential Revision: https://reviews.llvm.org/D95139
2021-03-05 09:57:56 +00:00
Sanjay Patel 1bee549737 [LoopVectorize] propagate fast-math-flags from induction instructions
This code assumed that FP math was only permissable if it was
fully "fast", so it hard-coded "fast" when creating new instructions.

The underlying code already allows matching recurrences/reductions
that are only "reassoc", so this change should prevent the potential
miscompile seen in the test diffs (we created "fast" ops even though
none existed in the original code).

I don't know if we need to create the temporary IRBuilder objects
used here, so that could be follow-up clean-up.

There's an open question about whether we should require "nsz" in
addition to "reassoc" here. InstCombine uses that combo for its
reassociative folds, but I think codegen is not as strict.
2021-03-04 17:21:32 -05:00
Sanjay Patel 36a489d194 [Analysis][LoopVectorize] rename "Unsafe" variables/methods; NFC
Similar to b3a33553ae, but this shows a TODO and a potential
miscompile is already present.

We are tracking an FP instruction that does *not* have FMF (reassoc)
properties, so calling that "Unsafe" seems opposite of the common
reading.

I also removed one getter method by rolling the null check into
the access. Further simplification may be possible.

The motivation is to clean up the interactions between FMF and
function-level attributes in these classes and their callers.

The new test shows that there is an existing bug somewhere in
the callers. We assumed that the original code was fully 'fast'
and so we produced IR with 'fast' even though it was just 'reassoc'.
2021-03-04 10:40:26 -05:00
Sanjay Patel b3a33553ae [Analysis][LoopVectorize] rename "Unsafe" variables/methods; NFC
We are tracking an FP instruction that does *not* have FMF (reassoc)
properties, so calling that "Unsafe" seems opposite of the common
reading.

I also removed one getter method by rolling the null check into
the access. Further simplification seems possible.

The motivation is to clean up the interactions between FMF and
function-level attributes in these classes and their callers.
2021-03-04 08:53:04 -05:00
Andrei Elovikov b24afec8ae [NFCI][VPlan] Modify Recipes' print methods to honor Indent parameter
Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D97787
2021-03-02 15:32:10 -08:00
Alexey Bataev a054e94e9e [SLP]Merge reorder and reuse shuffles.
It is possible to merge reuse and reorder shuffles and reduce the total
cost of the vectorization tree/number of final instructions.

Differential Revision: https://reviews.llvm.org/D94992
2021-03-02 06:39:47 -08:00
Florian Hahn a6c81d3366
[VPlan] Remove recipes from back to front.
Update the deletion order when destroying VPBasicBlocks. This ensures
recipes that depend on earlier ones in the block are removed first.
Otherwise this may cause issues when recipes have remaining users later
in the block.
2021-03-01 16:06:30 +00:00
Florian Hahn 53dacb7b67
[LV] Generate RT checks up-front and remove them if required.
This patch updates LV to generate the runtime checks just after cost
modeling, to allow a more precise estimate of the actual cost of the
checks. This information will be used in future patches to generate
larger runtime checks in cases where the checks only make up a small
fraction of the expected scalar loop execution time.

The runtime checks are created up-front in a temporary block to allow better
estimating the cost and un-linked from the existing IR. After deciding to
vectorize, the checks are moved backed. If deciding not to vectorize, the
temporary block is completely removed.

This patch is similar in spirit to D71053, but explores a different
direction: instead of delaying the decision on whether to vectorize in
the presence of runtime checks it instead optimistically creates the
runtime checks early and discards them later if decided to not
vectorize. This has the advantage that the cost-modeling decisions
can be kept together and can be done up-front and thus preserving the
general code structure. I think delaying (part) of the decision to
vectorize would also make the VPlan migration a bit harder.

One potential drawback of this patch is that we speculatively
generate IR which we might have to clean up later. However it seems like
the code required to do so is quite manageable.

Reviewed By: lebedev.ri, ebrevnov

Differential Revision: https://reviews.llvm.org/D75980
2021-03-01 10:48:04 +00:00
Sander de Smalen 5e19208d96 [InstructionCost] NFC: Fix up missing cases in LoopVectorize and CodeGenPrep.
This fixes the types of a few more cost variables to be of type InstructionCost.
2021-02-24 14:30:03 +00:00
Florian Hahn 6240f436dd
Recommit "[LV] Allow tryToCreateWidenRecipe to return a VPValue, use for blends."
This reverts the revert commit 437f0bbcd5.

It adds a new toVPRecipeResult, which forces VPRecipeOrVPValueTy to be
constructed with a VPRecipeBase *. This should address ambiguous
constructor issues for recipe sub-types that also inherit from VPValue.
2021-02-24 10:36:02 +00:00
Andrei Elovikov 3605b873f6 [NFC][VPlan] Use VPUser to store block's predicate
Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D96529
2021-02-23 11:08:27 -08:00
Florian Hahn de40423c85
[LV] Ensure fixNonInductionPHIs uses a valid insertion point.
In some cases, Builder's insertion point may be invalidated before using
it in VPTransformState::get. Make sure the insertion point is
up-to-date.

This should fix various sanitizer errors, like
https://lab.llvm.org/buildbot/#/builders/5/builds/4933/steps/9/logs/stdio
2021-02-23 18:51:05 +00:00
Florian Hahn 437f0bbcd5
Revert "[LV] Allow tryToCreateWidenRecipe to return a VPValue, use for blends."
This reverts commit 4efa097eb4, because
some the compilers used for some bots do not support automatic
conversions to PointerUnion.
2021-02-23 16:57:21 +00:00
Florian Hahn 4efa097eb4
[LV] Allow tryToCreateWidenRecipe to return a VPValue, use for blends.
Generalize the return value of tryToCreateWidenRecipe to return either a
newly create recipe or an existing VPValue. Use this to avoid creating
unnecessary VPBlendRecipes.

Fixes PR44800.
2021-02-23 16:52:03 +00:00
David Green dd2dbf7ee2 [TTI] Change getOperandsScalarizationOverhead to take Type args
As a followup to D95291, getOperandsScalarizationOverhead was still
using a VF as a vector factor if the arguments were scalar, and would
assert on certain matrix intrinsics with differently sized vector
arguments. This patch removes the VF arg, instead passing the Types
through directly. This should allow it to more accurately compute the
cost without having to guess at which operands will be vectorized,
something difficult with more complex intrinsics.

This adjusts one SVE test as it is now calling the wrong intrinsic vs
veccall. Without invalid InstructCosts the cost of the scalarized
intrinsic is too low. This should get fixed when the cost of
scalarization is accounted for with scalable types.

Differential Revision: https://reviews.llvm.org/D96287
2021-02-23 13:04:59 +00:00
David Green bd4b61efbd [CostModel] Remove VF from IntrinsicCostAttributes
getIntrinsicInstrCost takes a IntrinsicCostAttributes holding various
parameters of the intrinsic being costed. It can either be called with a
scalar intrinsic (RetTy==Scalar, VF==1), with a vector instruction
(RetTy==Vector, VF==1) or from the vectorizer with a scalar type and
vector width (RetTy==Scalar, VF>1). A RetTy==Vector, VF>1 is considered
an error. Both of the vector modes are expected to be treated the same,
but because this is confusing many backends end up getting it wrong.

Instead of trying work with those two values separately this removes the
VF parameter, widening the RetTy/ArgTys by VF used called from the
vectorizer. This keeps things simpler, but does require some other
modifications to keep things consistent.

Most backends look like this will be an improvement (or were not using
getIntrinsicInstrCost). AMDGPU needed the most changes to keep the code
from c230965ccf working. ARM removed the fix in
dfac521da1, webassembly happens to get a fixup for an SLP cost
issue and both X86 and AArch64 seem to now be using better costs from
the vectorizer.

Differential Revision: https://reviews.llvm.org/D95291
2021-02-23 13:03:26 +00:00
Alexey Bataev 9a4dd4de9d [SLP]No need to mark scatter load pointer as scalar as it gets vectorized.
Pointer operand of scatter loads does not remain scalar in the tree (it
gest vectorized) and thus must not be marked as the scalar that remains
scalar in vectorized form.

Differential Revision: https://reviews.llvm.org/D96818
2021-02-22 11:58:28 -08:00
Florian Hahn c7ee57f1dc
[LV] Directly use incoming value for single VPBlendRecipes.
VPBlendRecipes with single incoming (value, mask) pair are no-ops. Use
the incoming value directly.
2021-02-22 16:10:08 +00:00
Florian Hahn c11fd0df64
[VPlan] Skip VPWidenPHIRecipe in VPInterleavedACcessInfo.
Update unit tests that did not expect VPWidenPHIRecipes after
15a74b64df.
2021-02-22 10:35:09 +00:00
Florian Hahn 15a74b64df
[VPlan] Manage pairs of incoming (VPValue, VPBB) in VPWidenPHIRecipe.
This patch extends VPWidenPHIRecipe to manage pairs of incoming
(VPValue, VPBasicBlock) in the VPlan native path. This is made possible
because we now directly manage defined VPValues for recipes.

By keeping both the incoming value and block in the recipe directly,
code-generation in the VPlan native path becomes independent of the
predecessor ordering when fixing up non-induction phis, which currently
can cause crashes in the VPlan native path.

This fixes PR45958.

Reviewed By: sguggill

Differential Revision: https://reviews.llvm.org/D96773
2021-02-22 09:44:25 +00:00
Benjamin Kramer 59f442e6bb [LV] Fold single-use variable into assert. NFC. 2021-02-19 18:11:39 +01:00
Florian Hahn edc92a1c42
[LV] Remove VPCallback.
Now that all state for generated instructions is managed directly in
VPTransformState, VPCallBack is no longer needed. This patch updates the
last use of `getOrCreateScalarValue` to instead manage the value
directly in VPTransformState and removes VPCallback.

Reviewed By: gilr

Differential Revision: https://reviews.llvm.org/D95383
2021-02-19 12:50:41 +00:00
Joseph Huber c3a3d20093 [LV] Add analysis remark for mixed precision conversions
Floating point conversions inside vectorized loops have performance
implications but are very subtle. The user could specify a floating
point constant, or call a function without realizing that it will
force a change in the vector width. An example of this behaviour is
seen in https://godbolt.org/z/M3nT6c . The vectorizer should indicate
when this happens becuase it is most likely unintended behaviour.

This patch adds a simple check for this behaviour by following floating
point stores in the original loop and checking if a floating point
conversion operation occurs.

Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D95539
2021-02-17 21:37:08 -05:00
Florian Hahn f64c626069
[VPlan] Remove unused Phi member from VPWidenPHIRecipe (NFC).
The member is not needed any longer after recent changes.
2021-02-16 13:53:06 +00:00
Kerry McLaughlin ba1e150d03 [SVE] Add support for scalable vectorization of loops with int/fast FP reductions
This patch enables scalable vectorization of loops with integer/fast reductions, e.g:

```
unsigned sum = 0;
for (int i = 0; i < n; ++i) {
  sum += a[i];
}
```

A new TTI interface, isLegalToVectorizeReduction, has been added to prevent
reductions which are not supported for scalable types from vectorizing.
If the reduction is not supported for a given scalable VF,
computeFeasibleMaxVF will fall back to using fixed-width vectorization.

Reviewed By: david-arm, fhahn, dmgreen

Differential Revision: https://reviews.llvm.org/D95245
2021-02-16 13:50:06 +00:00
Florian Hahn 54a14c264a
[VPlan] Manage scalarized values using VPValues.
This patch updates codegen to use VPValues to manage the generated
scalarized instructions.

Reviewed By: gilr

Differential Revision: https://reviews.llvm.org/D92285
2021-02-16 09:04:10 +00:00
Juneyoung Lee ed253ef772 [LoopVectorize] Fix VPRecipeBuilder::createEdgeMask to correctly generate the mask
This patch fixes pr48832 by correctly generating the mask when a poison value is involved.

Consider this CFG (which is a part of the input):

```
for.body:                                         ; preds = %for.cond
  br i1 true, label %cond.false, label %land.rhs

land.rhs:                                         ; preds = %for.body
  br i1 poison, label %cond.end, label %cond.false

cond.false:                                       ; preds = %for.body, %land.rhs
  br label %cond.end

cond.end:                                         ; preds = %land.rhs, %cond.false
  %cond = phi i32 [ 0, %cond.false ], [ 1, %land.rhs ]

```

The path for.body -> land.rhs -> cond.end should be taken when 'select i1 false, i1 poison, i1 false' holds (which means it's never taken); but VPRecipeBuilder::createEdgeMask was emitting 'and i1 false, poison' instead.
The former one successfully blocks poison propagation whereas the latter one doesn't, making the condition poison and thus causing the miscompilation.

SimplifyCFG has a similar bug (which didn't expose a real-world bug yet), and a patch for this is also ongoing (see https://reviews.llvm.org/D95026).

Reviewed By: bjope

Differential Revision: https://reviews.llvm.org/D95217
2021-02-14 21:12:34 +09:00
Kerry McLaughlin fea06efe7c [SVE][LoopVectorize] Support for vectorization of loops with function calls
Changes `getScalarizationOverhead` to return an invalid cost for scalable VFs
and adds some simple tests for loops containing a function for which
there is a vectorized variant available.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D96356
2021-02-12 13:47:43 +00:00
Florian Hahn 85fe5c9345
[VPlan] Make VPRecipeBase inherit from VPUser directly (NFC).
The individual recipes have been updated to manage their operands using
VPUser a while back. Now that the transition is done, we can instead
make VPRecipeBase a VPUser and get rid of the toVPUser helper.
2021-02-12 13:06:58 +00:00
David Sherwood 01b87444cb [NFC][Analysis] Change struct VecDesc to use ElementCount
This patch changes the VecDesc struct to use ElementCount
instead of an unsigned VF value, in preparation for
future work that adds support for vectorized versions of
math functions using scalable vectors. Since all I'm doing
in this patch is switching the type I believe it's a
non-functional change. I changed getWidestVF to now return
both the widest fixed-width and scalable VF values, but
currently the widest scalable value will be zero.

Differential Revision: https://reviews.llvm.org/D96011
2021-02-12 11:07:58 +00:00
Sander de Smalen 703130fb01 [TTI] Change TargetTransformInfo::getMinimumVF to return ElementCount
This will be needed in the loop-vectorizer where the minimum VF
requested may be a scalable VF. getMinimumVF now takes an additional
operand 'IsScalableVF' that indicates whether a scalable VF is required.

Reviewed By: kparzysz, rampitec

Differential Revision: https://reviews.llvm.org/D96020
2021-02-11 09:08:48 +00:00
Sander de Smalen be9bbb57f4 [LoopVectorize] NFC: Change selectVectorizationFactor to work on ElementCount.
This patch is NFC and changes occurrences of `unsigned Width`
and `unsigned i` to work on type ElementCount instead.

This patch is a preparatory patch with the ultimate goal of making
`computeMaxVF()` return both a max fixed VF and a max scalable VF,
so that `selectVectorizationFactor()` can pick the most cost-effective
vectorization factor.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D96019
2021-02-11 08:47:59 +00:00
Sander de Smalen 9db6e97a86 [LoopVectorize] NFC: Change computeFeasibleMaxVF to operate on ElementCount.
This patch is NFC and changes occurrences of `unsigned MaxVectorSize`
to work on type ElementCount.

This patch is a preparatory patch with the ultimate goal of making
`computeMaxVF()` return both a max fixed VF and a max scalable VF,
so that `selectVectorizationFactor()` can pick the most cost-effective
vectorization factor.

Reviewed By: kmclaughlin

Differential Revision: https://reviews.llvm.org/D96018
2021-02-10 08:52:10 +00:00
Florian Hahn fd8afa41eb
[VPlan] Use VPUser to manage CondBit
VP blocks keep track of a condition, which is a VPValue. This patch
updates VPBlockBase to manage the value using VPUser, so
replaceAllUsesWith properly updates the condition bit as well.

This is required to enable VP2VP transformations and it helps with
simplifying some of the code required to manage condition bits.

Reviewed By: gilr

Differential Revision: https://reviews.llvm.org/D95382
2021-02-09 21:53:50 +00:00
Kazu Hirata 302313a264 [Transforms] Use range-based for loops (NFC) 2021-02-08 22:33:53 -08:00
Jinsong Ji 9202806241 Revert "[CostModel] Remove VF from IntrinsicCostAttributes"
This reverts commit 502a67dd7f.

This expose a failure in test-suite build on PowerPC,
revert to unblock buildbot first,
Dave will re-commit in https://reviews.llvm.org/D96287.

Thanks Dave.
2021-02-09 02:14:14 +00:00
Florian Hahn 3bb6dc0b26
[LV] Replace some uses of VectorLoopValueMap with VPTransformState (NFC)
This patch updates some places where VectorLoopValueMap is accessed
directly to instead go through VPTransformState.

As we move towards managing created values exclusively in VPTransformState,
this ensures the use always can fetch the correct value.

This is in preparation for D92285, which switches to managing scalarized
values through VPValues.

In the future, the various fix* functions should be moved directly into
the VPlan codegen stage.

Reviewed By: gilr

Differential Revision: https://reviews.llvm.org/D95757
2021-02-07 18:28:21 +00:00
Adrian Kuegel 7fe41ac3df Revert "[LV] Unconditionally branch from middle to scalar preheader if the scalar loop must execute"
This reverts commit 3e5ce49e53.

Tests started failing on PPC, for example:
http://lab.llvm.org:8011/#/builders/105/builds/5569
2021-02-05 12:51:03 +01:00
David Green 502a67dd7f [CostModel] Remove VF from IntrinsicCostAttributes
getIntrinsicInstrCost takes a IntrinsicCostAttributes holding various
parameters of the intrinsic being costed. It can either be called with a
scalar intrinsic (RetTy==Scalar, VF==1), with a vector instruction
(RetTy==Vector, VF==1) or from the vectorizer with a scalar type and
vector width (RetTy==Scalar, VF>1). A RetTy==Vector, VF>1 is considered
an error. Both of the vector modes are expected to be treated the same,
but because this is confusing many backends end up getting it wrong.

Instead of trying work with those two values separately this removes the
VF parameter, widening the RetTy/ArgTys by VF used called from the
vectorizer. This keeps things simpler, but does require some other
modifications to keep things consistent.

Most backends look like this will be an improvement (or were not using
getIntrinsicInstrCost). AMDGPU needed the most changes to keep the code
from c230965ccf working. ARM removed the fix in
dfac521da1, webassembly happens to get a fixup for an SLP cost
issue and both X86 and AArch64 seem to now be using better costs from
the vectorizer.

Differential Revision: https://reviews.llvm.org/D95291
2021-02-05 09:34:24 +00:00
Craig Topper 11ef356d9e [TargetLowering] Use Align in allowsMisalignedMemoryAccesses.
Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D96097
2021-02-04 19:22:06 -08:00
Philip Reames 3e5ce49e53 [LV] Unconditionally branch from middle to scalar preheader if the scalar loop must execute
If we know that the scalar epilogue is required to run, modify the CFG to end the middle block with an unconditional branch to scalar preheader. This is instead of a conditional branch to either the preheader or the exit block.

The motivation to do this is to support multiple exit blocks. Specifically, the current structure forces us to identify immediate dominators and *which* exit block to branch from in the middle terminator. For the multiple exit case - where we know require scalar will hold - these questions are ill formed.

This is the last change needed to support multiple exit loops, but since the diffs are already large enough, I'm going to land this, and then enable separately. You can think of this as being NFCI-ish prep work, but the changes are a bit too involved for me to feel comfortable tagging the change that way.

Differential Revision: https://reviews.llvm.org/D94892
2021-02-04 17:28:30 -08:00
Florian Hahn daaa0e3501
[VPlan] Manage induction value creation using VPValues.
This patch updates the induction value creation to use VPValues of
recipes to map the created values. This should bring is one step closer
to being able to optimize induction recipes directly in VPlan.

Currently widenIntOrFpInduction also generates vector values for a cast
of the induction, if it exists. Make this explicit by adding the cast
instruction to the values defined by the recipe.

Reviewed By: gilr

Differential Revision: https://reviews.llvm.org/D92284
2021-02-03 17:45:03 +00:00
David Sherwood d4626eb0bd [VPlan][NFC] Introduce constructors for VPIteration
This patch adds constructors to VPIteration as a cleaner way of
initialising the struct and replaces existing constructions of
the form:

  {Part, Lane}

with

  VPIteration(Part, Lane)

I have also added a default constructor, which is used by VPlan.cpp
when deciding whether to replicate a block or not.

This refactoring will be required in a later patch that adds more
members and functions to VPIteration.

Differential Revision: https://reviews.llvm.org/D95676
2021-02-03 08:52:27 +00:00
David Sherwood d4d4ceeb8f [SVE][LoopVectorize] Add masked load/store and gather/scatter support for SVE
This patch updates IRBuilder::CreateMaskedGather/Scatter to work
with ScalableVectorType and adds isLegalMaskedGather/Scatter functions
to AArch64TargetTransformInfo. In addition I've fixed up
isLegalMaskedLoad/Store to return true for supported scalar types,
since this is what the vectorizer asks for.

In LoopVectorize.cpp I've changed
LoopVectorizationCostModel::getInterleaveGroupCost to return an invalid
cost for scalable vectors, since currently this relies upon using shuffle
vector for reversing vectors. In addition, in
LoopVectorizationCostModel::setCostBasedWideningDecision I have assumed
that the cost of scalarising memory ops is infinitely expensive.

I have added some simple masked load/store and gather/scatter tests,
including cases where we use gathers and scatters for conditional invariant
loads and stores.

Differential Revision: https://reviews.llvm.org/D95350
2021-02-02 09:52:39 +00:00
Gil Rapaport d475030dc2 [SCEV] Apply loop guards to divisibility tests
Extend applyLoopGuards() to take into account conditions/assumes proving some
value %v to be divisible by D by rewriting %v to (%v / D) * D. This lets the
loop unroller and the loop vectorizer identify more loops as not requiring
remainder loops.

Differential Revision: https://reviews.llvm.org/D95521
2021-02-02 08:09:39 +02:00
Sanjay Patel bbed5f2f8a [LoopVectorize] improve IR fast-math-flags propagation in reductions
This is another step (see D95452) towards correcting fast-math-flags
bugs in vector reductions.

There are multiple bugs visible in the test diffs, and this is still
not working as it should. We still use function attributes (rather
than FMF) to drive part of the logic, but we are not checking for
the correct FP function attributes.

Note that FMF may not be propagated optimally on selects (example
in https://llvm.org/PR35607 ). That's why I'm proposing to union the
FMF of a fcmp+select pair and avoid regressions on existing vectorizer
tests.

Differential Revision: https://reviews.llvm.org/D95690
2021-02-01 16:21:36 -05:00
Cullen Rhodes 8cda227432 [LV] Fix crash when computing max VF too early
D90687 introduced a crash:

  llvm::LoopVectorizationCostModel::computeMaxVF(llvm::ElementCount, unsigned int):
    Assertion `WideningDecisions.empty() && Uniforms.empty() && Scalars.empty() &&
    "No decisions should have been taken at this point"' failed.

when compiling the following C code:

  typedef struct {
  char a;
  } b;

  b *c;
  int d, e;

  int f() {
    int g = 0;
    for (; d; d++) {
      e = 0;
      for (; e < c[d].a; e++)
        g++;
    }
    return g;
  }

with:

  clang -Os -target hexagon -mhvx -fvectorize -mv67 testcase.c -S -o -

This occurred since prior to D90687 computeFeasibleMaxVF would only be
called in computeMaxVF when a scalar epilogue was allowed, but now it's
always called. This causes the assert above since computeFeasibleMaxVF
collects all viable VFs larger than the default MaxVF, and for each VF
calculates the register usage which results in analysis being done the
assert above guards against. This can occur in computeFeasibleMaxVF if
TTI.shouldMaximizeVectorBandwidth and this target hook is implemented in
the hexagon backend to always return true.

Reported by @iajbar.

Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D94869
2021-02-01 12:14:59 +00:00
Sanjay Patel ab93c18c12 [LoopVectorize] use IR fast-math-flags exclusively (not FP function attributes)
I am trying to untangle the fast-math-flags propagation logic
in the vectorizers (see a6f022127 for SLP).

The loop vectorizer has a mix of checking FP function attributes,
IR-level FMF, and just wrong assumptions.

I am trying to avoid regressions while fixing this, and I think
the IR-level logic is good enough for that, but it's hard to say
for sure. This would be the 1st step in the clean-up.

The existing test that I changed to include 'fast' actually shows
a miscompile: the function only had the equivalent of nnan, but we
created new instructions that had fast (all FMF set). This is
similar to the example in https://llvm.org/PR35538

Differential Revision: https://reviews.llvm.org/D95452
2021-01-27 14:17:11 -05:00
Florian Hahn 28410d17f5
[LoopUtils] Pass SCEVExpander instead SE to addRuntimeChecks.
This gives the user control over which expander to use, which in turn
allows the user to decide what to do with the expanded instructions.

Used in D75980.

Reviewed By: lebedev.ri

Differential Revision: https://reviews.llvm.org/D94295
2021-01-27 17:36:19 +00:00
Florian Hahn 76afbf60ed
[VPlan] Replace uses with new value in VPInstructionsToVPRecipe (NFC).
Now that VPRecipeBase inherits from VPDef, we can always use the new
VPValue for replacement, if the recipe defines one. Given the recipes
that are supported at the moment, all new recipes must have either 0 or
1 defined values.
2021-01-25 19:38:08 +00:00
Florian Hahn 3201274dea
[VPlan] Handle scalarized values in VPTransformState.
This patch adds plumbing to handle scalarized values directly in
VPTransformState.

Reviewed By: gilr

Differential Revision: https://reviews.llvm.org/D92282
2021-01-25 14:21:56 +00:00
Sander de Smalen 171d12489f [SLPVectorizer] NFC: Migrate getVectorCallCosts to use InstructionCost.
This change also changes getReductionCost to return InstructionCost,
and it simplifies two expressions by removing a redundant 'isValid' check.
2021-01-25 12:27:01 +00:00
Sanjay Patel 77adbe6a8c [SLP] fix fast-math requirements for fmin/fmax reductions
a6f0221276 enabled intersection of FMF on reduction instructions,
so it is safe to ease the check here.

There is still some room to improve here - it looks like we
have nearly duplicate flags propagation logic inside of the
LoopUtils helper but it is limited targets that do not form
reduction intrinsics (they form the shuffle expansion).
2021-01-24 08:55:56 -05:00
Nikita Popov c83cff45c7 [IR] Add NoAliasScopeDeclInst (NFC)
Add an intrinsic type class to represent the
llvm.experimental.noalias.scope.decl intrinsic, to make code
working with it a bit nicer by hiding the metadata extraction
from view.
2021-01-23 22:40:32 +01:00
Kazu Hirata 1238378f18 [llvm] Use pop_back_val (NFC) 2021-01-23 10:56:33 -08:00
Sanjay Patel a6f0221276 [SLP] fix fast-math-flag propagation on FP reductions
As shown in the test diffs, we could miscompile by
propagating flags that did not exist in the original
code.

The flags required for fmin/fmax reductions will be
fixed in a follow-up patch.
2021-01-23 11:17:20 -05:00
Anton Rapetov a4914dc1f2 [SLP] do not traverse constant uses
Walking the use list of a Constant (particularly, ConstantData)
is not scalable, since a given constant may be used by many
instructinos in many functions in many modules.

Differential Revision: https://reviews.llvm.org/D94713
2021-01-22 08:14:09 -05:00
David Sherwood 2e080eb00a [SVE] Add support for scalable vectorization of loops with selects and cmps
I have removed an unnecessary assert in LoopVectorizationCostModel::getInstructionCost
that prevented a cost being calculated for select instructions when using
scalable vectors. In addition, I have changed AArch64TTIImpl::getCmpSelInstrCost
to only do special cost calculations for fixed width vectors and fall
back to the base version for scalable vectors.

I have added a simple cost model test for cmps and selects:

  test/Analysis/CostModel/sve-cmpsel.ll

and some simple tests that show we vectorize loops with cmp and select:

  test/Transforms/LoopVectorize/AArch64/sve-basic-vec.ll

Differential Revision: https://reviews.llvm.org/D95039
2021-01-22 09:48:13 +00:00
David Green 39db5753f9 [LV][ARM] Inloop reduction cost modelling
This adds cost modelling for the inloop vectorization added in
745bf6cf44. Up until now they have been modelled as the original
underlying instruction, usually an add. This happens to works OK for MVE
with instructions that are reducing into the same type as they are
working on. But MVE's instructions can perform the equivalent of an
extended MLA as a single instruction:

  %sa = sext <16 x i8> A to <16 x i32>
  %sb = sext <16 x i8> B to <16 x i32>
  %m = mul <16 x i32> %sa, %sb
  %r = vecreduce.add(%m)
  ->
  R = VMLADAV A, B

There are other instructions for performing add reductions of
v4i32/v8i16/v16i8 into i32 (VADDV), for doing the same with v4i32->i64
(VADDLV) and for performing a v4i32/v8i16 MLA into an i64 (VMLALDAV).
The i64 are particularly interesting as there are no native i64 add/mul
instructions, leading to the i64 add and mul naturally getting very
high costs.

Also worth mentioning, under NEON there is the concept of a sdot/udot
instruction which performs a partial reduction from a v16i8 to a v4i32.
They extend and mul/sum the first four elements from the inputs into the
first element of the output, repeating for each of the four output
lanes. They could possibly be represented in the same way as above in
llvm, so long as a vecreduce.add could perform a partial reduction. The
vectorizer would then produce a combination of in and outer loop
reductions to efficiently use the sdot and udot instructions. Although
this patch does not do that yet, it does suggest that separating the
input reduction type from the produced result type is a useful concept
to model. It also shows that a MLA reduction as a single instruction is
fairly common.

This patch attempt to improve the costmodelling of in-loop reductions
by:
 - Adding some pattern matching in the loop vectorizer cost model to
   match extended reduction patterns that are optionally extended and/or
   MLA patterns. This marks the cost of the reduction instruction correctly
   and the sext/zext/mul leading up to it as free, which is otherwise
   difficult to tell and may get a very high cost. (In the long run this
   can hopefully be replaced by vplan producing a single node and costing
   it correctly, but that is not yet something that vplan can do).
 - getExtendedAddReductionCost is added to query the cost of these
   extended reduction patterns.
 - Expanded the ARM costs to account for these expanded sizes, which is a
   fairly simple change in itself.
 - Some minor alterations to allow inloop reduction larger than the highest
   vector width and i64 MVE reductions.
 - An extra InLoopReductionImmediateChains map was added to the vectorizer
   for it to efficiently detect which instructions are reductions in the
   cost model.
 - The tests have some updates to show what I believe is optimal
   vectorization and where we are now.

Put together this can greatly improve performance for reduction loop
under MVE.

Differential Revision: https://reviews.llvm.org/D93476
2021-01-21 21:03:41 +00:00
Sanjay Patel 2f03528f5e [SLP] rename reduction variable to avoid shadowing; NFC
The code structure can likely be improved now that
'OperationData' is gone.
2021-01-21 16:02:38 -05:00
Sanjay Patel d77753381f [SLP] simplify reduction matching
This is NFC-intended and removes the "OperationData"
class which had become nothing more than a recurrence
(reduction) type.

I adjusted the matching logic to distinguish
instructions from non-instructions - that's all that
the "IsLeafValue" member was keeping track of.
2021-01-21 14:58:57 -05:00
Kazu Hirata e53472de68 [Transforms] Use llvm::append_range (NFC) 2021-01-20 21:35:54 -08:00
Sanjay Patel c09be0d2a0 [SLP] reduce reduction code for checking vectorizable ops; NFC
This is another step towards removing `OperationData` and
fixing FMF matching/propagation bugs when forming reductions.
2021-01-20 11:14:48 -05:00
Sanjay Patel 1c54112a57 [SLP] refactor more reduction functions; NFC
We were able to remove almost all of the state from
OperationData, so these don't make sense as members
of that class - just pass the RecurKind in as a param.

More streamlining is possible, but I'm trying to avoid
logic/typo bugs while fixing this. Eventually, we should
not need the `OperationData` class.
2021-01-20 11:14:48 -05:00
Sanjay Patel 8590d24543 [SLP] move reduction createOp functions; NFC
We were able to remove almost all of the state from
OperationData, so these don't make sense as members
of that class - just pass the RecurKind in as a param.
2021-01-20 11:14:48 -05:00
Kazu Hirata 8857202489 [llvm] Use llvm::find (NFC) 2021-01-19 20:19:14 -08:00
Alexey Bataev e463bd53c0 Revert "[SLP]Merge reorder and reuse shuffles."
This reverts commit 438682de6a to fix the
bug with the reducing size of the resulting vector for the entry node
with multiple users.
2021-01-19 11:48:04 -08:00
Jeroen Dobbelaere 121cac01e8 [noalias.decl] Look through llvm.experimental.noalias.scope.decl
Just like llvm.assume, there are a lot of cases where we can just ignore llvm.experimental.noalias.scope.decl.

Reviewed By: nikic

Differential Revision: https://reviews.llvm.org/D93042
2021-01-19 20:09:42 +01:00
David Sherwood c3ce262794 [NFC] Make remaining cost functions in LoopVectorize.cpp use InstructionCost
A previous patch has already changed getInstructionCost to return
an InstructionCost type. This patch changes the other various
getXXXCost functions to return an InstructionCost too. This is a
non-functional change - I've added a few asserts that the costs
are valid in places where we're selecting between vector call
and intrinsic costs. However, since we don't yet return invalid
costs from any of the TTI implementations these asserts should
not fire.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Differential Revision: https://reviews.llvm.org/D94065
2021-01-19 09:08:40 +00:00
Sanjay Patel 5b77ac32b1 [SLP] match maxnum/minnum intrinsics as FP reduction ops
After much refactoring over the last 2 weeks to the reduction
matching code, I think this change is finally ready.

We effectively broke fmax/fmin vector reduction optimization
when we started canonicalizing to intrinsics in instcombine,
so this should restore that functionality for SLP.

There are still FMF problems here as noted in the code comments,
but we should be avoiding miscompiles on those for fmax/fmin by
restricting to full 'fast' ops (negative tests are included).

Fixing FMF propagation is a planned follow-up.

Differential Revision: https://reviews.llvm.org/D94913
2021-01-18 17:37:16 -05:00
Sanjay Patel 3dbbadb8ef [SLP] rename reduction query for min/max ops; NFC
This will avoid confusion once we start matching
min/max intrinsics. All of these hacks to accomodate
cmp+sel idioms should disappear once we canonicalize
to min/max intrinsics.
2021-01-18 09:32:57 -05:00
Sanjay Patel d1c4e859ce [SLP] reduce opcode API dependency in reduction cost calc; NFC
The icmp opcode is now hard-coded in the cost model call.
This will make it easier to eventually remove all opcode
queries for min/max patterns as we transition to intrinsics.
2021-01-18 09:32:57 -05:00
Caroline Concatto 36710c38c1 [NFC]Migrate VectorCombine.cpp to use InstructionCost
This patch changes these functions:
vectorizeLoadInsert
isExtractExtractCheap
foldExtractedCmps
scalarizeBinopOrCmp
getShuffleExtract
foldBitcastShuf
to use the class InstructionCost when calling TTI.get<something>Cost().

This patch is part of a series of patches to use InstructionCost instead of
 unsigned/int for the cost model functions.
See this thread for context:
    http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html
See this patch for the introduction of the type:
    https://reviews.llvm.org/D91174

ps.:This patch adds the test || !NewCost.isValid(), because we want to
return false when:
 !NewCost.isValid && !OldCost.isValid()->the cost to transform it expensive
and
 !NewCost.isValid() && OldCost.isValid()
Therefore for simplication we only add  test for !NewCost.isValid()

Differential Revision: https://reviews.llvm.org/D94069
2021-01-18 13:37:21 +00:00
Sanjay Patel 49b96cd9ef [SLP] remove opcode field from reduction data class
This is NFC-intended and another step towards supporting
intrinsics as reduction candidates.

The remaining bits of the OperationData class do not make
much sense as-is, so I will try to improve that, but I'm
trying to take minimal steps because it's still not clear
how this was intended to work.
2021-01-16 13:55:52 -05:00
Sanjay Patel fcfcc3cc6b [SLP] fix typos; NFC 2021-01-16 13:55:52 -05:00
Sanjay Patel 48dbac5b6b [SLP] remove unnecessary use of 'OperationData'
This is another NFC-intended patch to allow matching
intrinsics (example: maxnum) as candidates for reductions.

It's possible that the loop/if logic can be reduced now,
but it's still difficult to understand how this all works.
2021-01-16 13:55:52 -05:00
Kazu Hirata 19aacdb715 [llvm] Construct SmallVector with iterator ranges (NFC) 2021-01-16 09:40:53 -08:00
Sanjay Patel ceb3cdccd0 [SLP] remove dead code in reduction matching; NFC
To get into this block we had: !A || B || C
and we checked C in the first 'if' clause
leaving !A || B. But the 2nd 'if' is checking:
A && !B --> !(!A || B)
2021-01-15 17:03:26 -05:00
Sanjay Patel 1f21de535d [SLP] remove unused reduction functions; NFC
These were made obsolete by simplifying the code in recent patches.
2021-01-15 14:59:33 -05:00
Sanjay Patel b21905dfe3 [SLP] remove unnecessary state in matching reductions
This is NFC-intended. I'm still trying to figure out
how the loop where this is used works. It does not
seem like we require this data at all, but it's
hard to confirm given the complicated predicates.
2021-01-14 18:32:37 -05:00
Bjorn Pettersson d58512b2e3 [SLP] Don't vectorize stores of non-packed types (like i1, i2)
In the spirit of commit fc783e91e0 (llvm-svn: 248943) we
shouldn't vectorize stores of non-packed types (i.e. types that
has padding between consecutive variables in a scalar layout,
but being packed in a vector layout).

The problem was detected as a miscompile in a downstream test case.

Reviewed By: anton-afanasyev

Differential Revision: https://reviews.llvm.org/D94446
2021-01-14 11:30:33 +01:00
Sanjay Patel 123674a816 [SLP] simplify type check for reductions
This is NFC-intended. The 'valid' call allows int/FP/pointers
for other parts of SLP. The difference here is that we can't
reduce pointers.
2021-01-13 13:30:46 -05:00
Sanjay Patel 9e7895a868 [SLP] reduce code duplication while processing reductions; NFC 2021-01-12 16:03:57 -05:00
Sanjay Patel 92fb5c49e8 [SLP] rename variable to improve readability; NFC
The OperationData in the 2nd block (visiting the operands)
is completely independent of the 1st block.
2021-01-12 16:03:57 -05:00
Sanjay Patel 554be30a42 [SLP] reduce code duplication in processing reductions; NFC 2021-01-12 16:03:57 -05:00
Sanjay Patel 46507a96fc [SLP] reduce code duplication while matching reductions; NFC 2021-01-12 16:03:57 -05:00
Philip Reames 9f61fbd75a [LV] Relax assumption that LCSSA implies single entry
This relates to the ongoing effort to support vectorization of multiple exit loops (see D93317).

The previous code assumed that LCSSA phis were always single entry before the vectorizer ran. This was correct, but only because the vectorizer allowed only a single exiting edge. There's nothing in the definition of LCSSA which requires single entry phis.

A common case where this comes up is with a loop with multiple exiting blocks which all reach a common exit block. (e.g. see the test updates)

Differential Revision: https://reviews.llvm.org/D93725
2021-01-12 12:34:52 -08:00
Florian Hahn eb0371e403 [VPlan] Unify value/recipe printing after VPDef transition.
This patch unifies the way recipes and VPValues are printed after the
transition to VPDef.

VPSlotTracker has been updated to iterate over all recipes and all
their defined values to number those. There is no need to number
values in Value2VPValue.

It also updates a few places that only used slot numbers for
VPInstruction. All recipes now can produce numbered VPValues.
2021-01-11 14:42:46 +00:00
Florian Hahn a94497a342 [VPlan] Move initial quote emission from ::print to ::dumpBasicBlock.
This means there will be no stray " when printing individual recipes
using print()/dump() in a debugger, for example.
2021-01-11 12:22:15 +00:00
David Sherwood 40abeb11f4 [NFC][InstructionCost] Change LoopVectorizationCostModel::getInstructionCost to return InstructionCost
This patch is part of a series of patches that migrate integer
instruction costs to use InstructionCost. In the function
selectVectorizationFactor I have simply asserted that the cost
is valid and extracted the value as is. In future we expect
to encounter invalid costs, but we should filter out those
vectorization factors that lead to such invalid costs.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Differential Revision: https://reviews.llvm.org/D92178
2021-01-11 09:22:37 +00:00
David Sherwood b7ccaca537 [NFC] Remove min/max functions from InstructionCost
Removed the InstructionCost::min/max functions because it's
fine to use std::min/max instead.

Differential Revision: https://reviews.llvm.org/D94301
2021-01-11 09:00:12 +00:00
Sanjay Patel 3f09c77d33 [SLP] fix typo in assert
This snuck into 0aa75fb12f , but I didn't catch it locally.
2021-01-10 13:15:04 -05:00
Sanjay Patel 0aa75fb12f [SLP] put verifyFunction call behind EXPENSIVE_CHECKS
A severe compile-time slowdown from this call is noted in:
https://llvm.org/PR48689
My naive fix was to put it under LLVM_DEBUG ( 267ff79 ),
but that's not limiting in the way we want.
This is a quick fix (or we could just remove the call completely
and rely on some later pass to discover potentially wrong IR?).
A bigger/better fix would be to improve/limit verifyFunction()
as noted in:
https://llvm.org/PR47712

Differential Revision: https://reviews.llvm.org/D94328
2021-01-10 12:32:21 -05:00
Kazu Hirata 6a6e382161 [llvm] Drop unnecessary make_range (NFC) 2021-01-09 09:25:00 -08:00
Florian Hahn 65f578fc0e [VPlan] Keep start value of VPWidenPHIRecipe as VPValue.
Similar to D92129, update VPWidenPHIRecipe  to manage the start value as
VPValue. This allows adjusting the start value as a VPlan transform,
which will be used in a follow-up patch to support reductions during
epilogue vectorization.

Reviewed By: gilr

Differential Revision: https://reviews.llvm.org/D93975
2021-01-09 16:34:15 +00:00
Florian Hahn c493e9216b [VPlan] Move reduction start value creation to widenPHIRecipe.
This was suggested to prepare for D93975.

By moving the start value creation to widenPHInstruction, we set the
stage to manage the start value directly in VPWidenPHIRecipe, which be
used subsequently to set the 'resume' value for reductions during
epilogue vectorization.

It also moves RdxDesc to the recipe, so we do not have to rely on Legal
to look it up later.

Reviewed By: gilr

Differential Revision: https://reviews.llvm.org/D94175
2021-01-08 17:49:43 +00:00
Alexander Belyaev bcbdeafa9c Revert "[SLP]Need shrink the load vector after reordering."
This reverts commit 4284afdf94.

This changes computed values in fused_batchnorm_test_cpu.

Not equal to tolerance rtol=1e-06, atol=0.001
Mismatched value: a is different from b.
not close where = (array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1]), array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1]), array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 1]), array([0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3,
       4, 5]))
not close lhs = [-0.6636615  -0.9804948  -1.148275   -0.68193716 -0.8572368  -0.65046215
 -0.6993756  -1.2244141  -1.0938729  -0.50369143 -0.51830524 -0.738452
 -0.7214286  -0.48115745 -0.9380924  -0.9341769  -0.5916775  -1.2896856
 -0.7264182  -0.9746917  -0.783249   -0.7659018  -0.86214024 -0.47784212]
not close rhs = [ 0.44102234  0.12418899 -0.04359123  0.42274666  0.24744703  0.45422167
  0.40530816 -0.11973029  0.01081094  0.6009924   0.5863786   0.3662318
  0.38325527  0.62352633  0.1665914   0.1705069   0.5130063  -0.18500176
  0.37826565  0.12999213  0.3214348   0.338782    0.24254355  0.62684166]
not close dif = [1.1046839 1.1046838 1.1046838 1.1046839 1.1046839 1.1046839 1.1046838
 1.1046839 1.1046839 1.1046839 1.1046839 1.1046839 1.1046839 1.1046838
 1.1046839 1.1046839 1.1046839 1.1046839 1.1046839 1.1046839 1.1046839
 1.1046839 1.1046838 1.1046838]
not close tol = [0.00100044 0.00100012 0.00100004 0.00100042 0.00100025 0.00100045
 0.00100041 0.00100012 0.00100001 0.0010006  0.00100059 0.00100037
 0.00100038 0.00100062 0.00100017 0.00100017 0.00100051 0.00100019
 0.00100038 0.00100013 0.00100032 0.00100034 0.00100024 0.00100063]
2021-01-08 14:42:26 +01:00
Sanjay Patel 267ff7901c [SLP] limit verifyFunction to debug build (PR48689)
As noted in PR48689, the verifier may have some kind
of exponential behavior that should be addressed
separately. For now, only run it in debug mode to
prevent problems for release+asserts.
That limit is what we had before D80401, and I'm
not sure if there was a reason to change it in that
patch.
2021-01-08 08:10:17 -05:00
Cullen Rhodes 1e7efd397a [LV] Legalize scalable VF hints
In the following loop:

  void foo(int *a, int *b, int N) {
    for (int i=0; i<N; ++i)
      a[i + 4] = a[i] + b[i];
  }

The loop dependence constrains the VF to a maximum of (4, fixed), which
would mean using <4 x i32> as the vector type in vectorization.
Extending this to scalable vectorization, a VF of (4, scalable) implies
a vector type of <vscale x 4 x i32>. To determine if this is legal
vscale must be taken into account. For this example, unless
max(vscale)=1, it's unsafe to vectorize.

For SVE, the number of bits in an SVE register is architecturally
defined to be a multiple of 128 bits with a maximum of 2048 bits, thus
the maximum vscale is 16. In the loop above it is therefore unfeasible
to vectorize with SVE. However, in this loop:

  void foo(int *a, int *b, int N) {
    #pragma clang loop vectorize_width(X, scalable)
    for (int i=0; i<N; ++i)
      a[i + 32] = a[i] + b[i];
  }

As long as max(vscale) multiplied by the number of lanes 'X' doesn't
exceed the dependence distance, it is safe to vectorize. For SVE a VF of
(2, scalable) is within this constraint, since a vector of <16 x 2 x 32>
will have no dependencies between lanes. For any number of lanes larger
than this it would be unsafe to vectorize.

This patch extends 'computeFeasibleMaxVF' to legalize scalable VFs
specified as loop hints, implementing the following behaviour:
  * If the backend does not support scalable vectors, ignore the hint.
  * If scalable vectorization is unfeasible given the loop
    dependence, like in the first example above for SVE, then use a
    fixed VF.
  * Accept scalable VFs if it's safe to do so.
  * Otherwise, clamp scalable VFs that exceed the maximum safe VF.

Reviewed By: sdesmalen, fhahn, david-arm

Differential Revision: https://reviews.llvm.org/D91718
2021-01-08 10:49:44 +00:00
David Green 72fb5ba079 [LV] Don't sink into replication regions
The new test case here contains a first order recurrences and an
instruction that is replicated. The first order recurrence forces an
instruction to be sunk _into_, as opposed to after the replication
region. That causes several things to go wrong including registering
vector instructions multiple times and failing to create dominance
relations correctly.

Instead we should be sinking to after the replication region, which is
what this patch makes sure happens.

Differential Revision: https://reviews.llvm.org/D93629
2021-01-08 09:50:10 +00:00
Kazu Hirata 33bf1cad75 [llvm] Use *Set::contains (NFC) 2021-01-07 20:29:34 -08:00
Sanjay Patel 4c7148d75c [SLP] remove opcode identifier for reduction; NFC
Another step towards allowing intrinsics in reduction matching.
2021-01-07 14:07:27 -05:00
Alexey Bataev 4284afdf94 [SLP]Need shrink the load vector after reordering.
After merging the shuffles, we cannot rely on the previous shuffle
anymore and need to shrink the final shuffle, if it is required.

Reported in D92668

Differential Revision: https://reviews.llvm.org/D93967
2021-01-07 04:50:48 -08:00
Oliver Stannard 76f6b125ce Revert "[llvm] Use BasicBlock::phis() (NFC)"
Reverting because this causes crashes on the 2-stage buildbots, for
example http://lab.llvm.org:8011/#/builders/7/builds/1140.

This reverts commit 9b228f107d.
2021-01-07 09:43:33 +00:00
Kazu Hirata cfeecdf7b6 [llvm] Use llvm::all_of (NFC) 2021-01-06 18:27:36 -08:00
Kazu Hirata 9b228f107d [llvm] Use BasicBlock::phis() (NFC) 2021-01-06 18:27:35 -08:00
Sanjay Patel 4c022b5a41 [SLP] use reduction kind's opcode to create new instructions; NFC
Similar to 5a1d31a28 -
This should be no-functional-change because the reduction kind
opcodes are 1-for-1 mappings to the instructions we are matching
as reductions. But we want to remove the need for the
`OperationData` opcode field because that does not work when
we start matching intrinsics (eg, maxnum) as reduction candidates.
2021-01-06 14:37:44 -05:00
Sanjay Patel 5d24089a70 [SLP] reduce code for propagating flags on reductions; NFC
If we add/change to match intrinsics, this might get more
wordy, but there's no need to list each kind currently.
2021-01-06 14:37:44 -05:00
Florian Hahn 816dba48af
[VPlan] Keep start value in VPWidenIntOrFpInductionRecipe (NFC).
This patch updates VPWidenIntOrFpInductionRecipe to hold the start value
for the induction variable. This makes the start value explicit and
allows for adjusting the start value for a VPlan.

The flexibility will be used in further patches.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D92129
2021-01-06 11:47:33 +00:00
Florian Hahn 0ce5f402e0
[VPlan] Add getLiveInIRValue accessor to VPValue.
This patch adds a new getLiveInIRValue accessor to VPValue, which
returns the underlying value, if the VPValue is defined outside of
VPlan. This is required to handle scalars in VPTransformState, which
requires dealing with scalars defined outside of VPlan.

We can simply check VPValue::Def to determine if the value is defined
inside a VPlan.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D92281
2021-01-06 11:20:42 +00:00
Florian Hahn f73c09caa2
[VPlan] Use public VPValue constructor in VPPRedInstPHIRecipe (NFC).
VPPredInstPHIRecipe does not need access to VPValue via friendship. It
can just use the public constructor,

Discussed as part of D92281.
2021-01-06 10:47:09 +00:00
Juneyoung Lee 4a8e6ed2f7 [SLP,LV] Use poison constant vector for shufflevector/initial insertelement
This patch makes SLP and LV emit operations with initial vectors set to poison constant instead of undef.
This is a part of efforts for using poison vector instead of undef to represent "doesn't care" vector.
The goal is to make nice shufflevector optimizations valid that is currently incorrect due to the tricky interaction between undef and poison (see https://bugs.llvm.org/show_bug.cgi?id=44185 ).

Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D94061
2021-01-06 11:22:50 +09:00
Sanjay Patel 6a03f8ab62 [SLP] reduce code for finding reduction costs; NFC
We can get both (vector/scalar) costs in a single switch
instead of sequentially.
2021-01-05 17:35:54 -05:00
Sanjay Patel 5a1d31a284 [SLP] use reduction kind's opcode for cost model queries; NFC
This should be no-functional-change because the reduction kind
opcodes are 1-for-1 mappings to the instructions we are matching
as reductions. But we want to remove the need for the
`OperationData` opcode field because that does not work when
we start matching intrinsics (eg, maxnum) as reduction candidates.
2021-01-05 15:12:40 -05:00
Sanjay Patel d4a999b453 [SLP] reduce code duplication; NFC 2021-01-05 15:12:40 -05:00
Sanjay Patel 3b8b2c7da2 [SLP] delete unused pairwise reduction option
SLP tries to model 2 forms of vector reductions: pairwise and splitting.
From the cost model code comments, those are defined using an example as:

  /// Pairwise:
  ///  (v0, v1, v2, v3)
  ///  ((v0+v1), (v2+v3), undef, undef)
  /// Split:
  ///  (v0, v1, v2, v3)
  ///  ((v0+v2), (v1+v3), undef, undef)

I don't know the full history of this functionality, but it was partly
added back in D29402. There are apparently no users at this point (no
regression tests change). X86 might have managed to work-around the need
for this through cost model and codegen improvements.

Removing this code makes it easier to continue the work that was started
in D87416 / D88193. The alternative -- if there is some target that is
silently using this option -- is to move this logic into LoopUtils. We
have related/duplicate functionality there via llvm::createTargetReduction().

Differential Revision: https://reviews.llvm.org/D93860
2021-01-05 13:23:07 -05:00
Florian Hahn 8a47e6252a
[VPlan] Re-add interleave group members to plan.
Creating in-loop reductions relies on IR references to map
IR values to VPValues after interleave group creation.

Make sure we re-add the updated member to the plan, so the look-ups
still work as expected

This fixes a crash reported after D90562.
2021-01-05 15:06:47 +00:00
Florian Hahn 38c6933dcc
[LV] Simplify lambda in all_of to directly return hasVF() result. (NFC)
The if in the lambda is not necessary. We can directly return the result
of hasVF.
2021-01-05 10:34:06 +00:00
Sanjay Patel 36263a7ccc [LoopUtils] remove redundant opcode parameter; NFC
While here, rename the inaccurate getRecurrenceBinOp()
because that was also used to get CmpInst opcodes.

The recurrence/reduction kind should always refer to the
expected opcode for a reduction. SLP appears to be the
only direct caller of createSimpleTargetReduction(), and
that calling code ideally should not be carrying around
both an opcode and a reduction kind.

This should allow us to generalize reduction matching to
use intrinsics instead of only binops.
2021-01-04 17:05:28 -05:00
Florian Hahn c50f9b2351
[LV] Clean up trailing whitespace (NFC).
Clean up some stray whitespace that sneaked in recently.
2021-01-02 16:43:13 +00:00
Sanjay Patel c74e8539ff [Analysis] flatten enums for recurrence types
This is almost all mechanical search-and-replace and
no-functional-change-intended (NFC). Having a single
enum makes it easier to match/reason about the
reduction cases.

The goal is to remove `Opcode` from reduction matching
code in the vectorizers because that makes it harder to
adapt the code to handle intrinsics.

The code in RecurrenceDescriptor::AddReductionVar() is
the only place that required closer inspection. It uses
a RecurrenceDescriptor and a second InstDesc to sometimes
overwrite part of the struct. It seem like we should be
able to simplify that logic, but it's not clear exactly
which cmp+sel patterns that we are trying to handle/avoid.
2021-01-01 12:20:16 -05:00
Florian Hahn d9f306aa52
[LV] Fix crash when generating remarks with multi-exit loops.
If DoExtraAnalysis is true (e.g. because remarks are enabled), we
continue with the analysis rather than exiting. Update code to
conditionally check if the ExitBB has phis or not a single predecessor.
Otherwise a nullptr is dereferenced with DoExtraAnalysis.
2021-01-01 13:54:41 +00:00
Sanjay Patel 8ca60db40b [LoopUtils] reduce FMF and min/max complexity when forming reductions
I don't know if there's some way this changes what the vectorizers
may produce for reductions, but I have added test coverage with
3567908 and 5ced712 to show that both passes already have bugs in
this area. Hopefully this does not make things worse before we can
really fix it.
2020-12-30 15:22:26 -05:00
Sanjay Patel 21a3a0225d [SLP] replace local reduction enum with RecurrenceKind; NFCI
I'm not sure if the SLP enum was created before the IVDescriptor
RecurrenceDescriptor / RecurrenceKind existed, but the code in
SLP is now redundant with that class, so it just makes things
more complicated to have both. We eventually call LoopUtils
createSimpleTargetReduction() to create reduction ops, so we
might as well standardize on those enum names.

There's still a question of whether we need to use TTI::ReductionFlags
vs. MinMaxRecurrenceKind, but that can be another clean-up step.

Another option would just be to flatten the enums in RecurrenceDescriptor
into a single enum. There isn't much benefit (smaller switches?) to
having a min/max subset.
2020-12-29 14:52:11 -05:00
Philip Reames 4b33b23877 Reapply "[LV] Vectorize (some) early and multiple exit loops"" w/fix for builder
This reverts commit 4ffcd4fe9a thus restoring e4df6a40da.

The only change from the original patch is to add "llvm::" before the call to empty(iterator_range).  This is a speculative fix for the ambiguity reported on some builders.
2020-12-28 10:13:28 -08:00
Arthur Eubanks 4ffcd4fe9a Revert "[LV] Vectorize (some) early and multiple exit loops"
This reverts commit e4df6a40da.

Breaks Windows bots, e.g. http://45.33.8.238/win/30472/step_4.txt
and http://lab.llvm.org:8011/#/builders/83/builds/2078/steps/5/logs/stdio
2020-12-28 10:05:41 -08:00
Philip Reames e4df6a40da [LV] Vectorize (some) early and multiple exit loops
This patch is a major step towards supporting multiple exit loops in the vectorizer. This patch on it's own extends the loop forms allowed in two ways:

    single exit loops which are not bottom tested
    multiple exit loops w/ a single exit block reached from all exits and no phis in the exit block (because of LCSSA this implies no values defined in the loop used later)

The restrictions on multiple exit loop structures will be removed in follow up patches; disallowing cases for now makes the code changes smaller and more obvious. As before, we can only handle loops with entirely analyzable exits. Removing that restriction is much harder, and is not part of currently planned efforts.

The basic idea here is that we can force the last iteration to run in the scalar epilogue loop (if we have one). From the definition of SCEV's backedge taken count, we know that no earlier iteration can exit the vector body. As such, we can leave the decision on which exit to be taken to the scalar code and generate a bottom tested vector loop which runs all but the last iteration.

The existing code already had the notion of requiring one iteration in the scalar epilogue, this patch is mainly about generalizing that support slightly, making sure we don't try to use this mechanism when tail folding, and updating the code to reflect the difference between a single exit block and a unique exit block (very mechanical).

Differential Revision: https://reviews.llvm.org/D93317
2020-12-28 09:40:42 -08:00
Florian Hahn 0ea3749b3c
[LV] Set up branch from middle block earlier.
Previously the branch from the middle block to the scalar preheader & exit
was being set-up at the end of skeleton creation in completeLoopSkeleton.
Inserting SCEV or runtime checks may result in LCSSA phis being created,
if they are required. Adjusting branches afterwards may break those
PHIs.

To avoid this, we can instead create the branch from the middle block
to the exit after we created the middle block, so we have the final CFG
before potentially adjusting/creating PHIs.

This fixes a crash for the included test case. For the non-crashing
case, this is almost a NFC with respect to the generated code. The
only change is the order of the predecessors of the involved branch
targets.

Note an assertion was moved from LoopVersioning() to
LoopVersioning::versionLoop. Adjusting the branches means loop-simplify
form may be broken before constructing LoopVersioning. But LV only uses
LoopVersioning to annotate the loop instructions with !noalias metadata,
which does not require loop-simplify form.

This is a fix for an existing issue uncovered by D93317.
2020-12-27 18:21:12 +00:00
Kazu Hirata 789d250613 [CodeGen, Transforms] Use *Map::lookup (NFC) 2020-12-27 09:57:27 -08:00
Sanjay Patel badf0f20f3 [SLP] rename reduction variables for readability; NFC
I am hoping to extend the reduction matching code, and it is
hard to distinguish "ReductionData" from "ReducedValueData".
So extend the tree/root metaphor to include leaves.

Another problem is that the name "OperationData" does not
provide insight into its purpose. I'm not sure if we can alter
that underlying data structure to make the code clearer.
2020-12-26 11:20:25 -05:00
Sanjay Patel c4ca108966 [SLP] use switch to improve readability; NFC
This will get more complicated when we handle intrinsics like maxnum.
2020-12-26 10:59:45 -05:00
Kazu Hirata df812115e3 [CodeGen, Transforms] Use llvm::any_of (NFC) 2020-12-24 09:08:36 -08:00
Sanjay Patel 0d15d4b6f4 [SLP] use operand index abstraction for number of operands
I think this is NFC currently, but the bug would be exposed
when we allow binary intrinsics (maxnum, etc) as candidates
for reductions.

The code in matchAssociativeReduction() is using
OperationData::getNumberOfOperands() when comparing whether
the "EdgeToVisit" iterator is in-bounds, so this code must
use the same (potentially offset) operand value to set
the "EdgeToVisit".
2020-12-22 16:05:39 -05:00
Florian Hahn ef4dbb2b7a [LV] Use ScalarEvolution::getURemExpr to reduce duplication.
ScalarEvolution should be able to handle both constant and variable trip
counts using getURemExpr, so we do not have to handle them separately.

This is a small simplification of a56280094e.

Reviewed By: gilr

Differential Revision: https://reviews.llvm.org/D93677
2020-12-22 14:48:42 +00:00
Florian Hahn c0c0ae16c3
[VPlan] Make VPInstruction a VPDef
This patch turns updates VPInstruction to manage the value it defines
using VPDef. The VPValue is used  during VPlan construction and
codegeneration instead of the plain IR reference where possible.

Reviewed By: gilr

Differential Revision: https://reviews.llvm.org/D90565
2020-12-22 09:53:47 +00:00
Gil Rapaport a56280094e [LV] Avoid needless fold tail
When the trip-count is provably divisible by the maximal/chosen VF, folding the
loop's tail during vectorization is redundant. This commit extends the existing
test for constant trip-counts to any trip-count known to be divisible by
maximal/selected VF by SCEV.

Differential Revision: https://reviews.llvm.org/D93615
2020-12-22 10:25:20 +02:00
Florian Hahn f250892373
[VPlan] Make VPRecipeBase inherit from VPDef.
This patch makes VPRecipeBase a direct subclass of VPDef, moving the
SubclassID to VPDef.

Reviewed By: gilr

Differential Revision: https://reviews.llvm.org/D90564
2020-12-21 13:34:00 +00:00
Florian Hahn cd608dc8d3
[VPlan] Use VPDef for VPInterleaveRecipe.
This patch turns updates VPInterleaveRecipe to manage the values it defines
using VPDef. The VPValue is used  during VPlan construction and
codegeneration instead of the plain IR reference where possible.

Reviewed By: gilr

Differential Revision: https://reviews.llvm.org/D90562
2020-12-21 10:56:53 +00:00
David Sherwood 3bf7d47a97 [NFC][InstructionCost] Remove isValid() asserts in SLPVectorizer.cpp
An earlier patch introduced asserts that the InstructionCost is
valid because at that time the ReuseShuffleCost variable was an
unsigned. However, now that the variable is an InstructionCost
instance the asserts can be removed.

See this thread for context:
http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

See this patch for the introduction of the type:
https://reviews.llvm.org/D91174
2020-12-21 09:12:28 +00:00
Sanjay Patel 37d0dda739 [SLP] fix typo; NFC 2020-12-18 16:55:52 -05:00
Sanjay Patel 47aaa99c0e [VectorCombine] allow peeking through GEPs when creating a vector load
This is an enhancement motivated by https://llvm.org/PR16739
(see D92858 for another).

We can look through a GEP to find a base pointer that may be
safe to use for a vector load. If so, then we shuffle (shift)
the necessary vector element over to index 0.

Alive2 proof based on 1 of the regression tests:
https://alive2.llvm.org/ce/z/yPJLkh

The vector translation is independent of endian (verify by
changing to leading 'E' in the datalayout string).

Differential Revision: https://reviews.llvm.org/D93229
2020-12-18 09:25:03 -05:00
Cullen Rhodes 1fd3a04775 [LV] Disable epilogue vectorization for scalable VFs
Epilogue vectorization doesn't support scalable vectorization factors
yet, disable it for now.

Reviewed By: sdesmalen, bmahjour

Differential Revision: https://reviews.llvm.org/D93063
2020-12-17 12:14:03 +00:00
Sanjay Patel 38ebc1a13d [VectorCombine] optimize alignment for load transform
Here's another minimal step suggested by D93229 / D93397 .
(I'm trying to be extra careful in these changes because
load transforms are easy to get wrong.)

We can optimistically choose the greater alignment of a
load and its pointer operand. As the test diffs show, this
can improve what would have been unaligned vector loads
into aligned loads.

When we enhance with gep offsets, we will need to adjust
the alignment calculation to include that offset.

Differential Revision: https://reviews.llvm.org/D93406
2020-12-16 15:25:45 -05:00
Sanjay Patel aaaf0ec72b [VectorCombine] loosen alignment constraint for load transform
As discussed in D93229, we only need a minimal alignment constraint
when querying whether a hypothetical vector load is safe. We still
pass/use the potentially stronger alignment attribute when checking
costs and creating the new load.

There's already a test that changes with the minimum code change,
so splitting this off as a preliminary commit independent of any
gep/offset enhancements.

Differential Revision: https://reviews.llvm.org/D93397
2020-12-16 12:25:18 -05:00
Caroline Concatto be9184bc55 [SLPVectorizer]Migrate getEntryCost to return InstructionCost
This patch also changes:
  the return type of getGatherCost and
  the signature of the debug function dumpTreeCosts
to use InstructionCost.

This patch is part of a series of patches to use InstructionCost instead of
unsigned/int for the cost model functions.

See this thread for context:
http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

See this patch for the introduction of the type:
https://reviews.llvm.org/D91174

Depends on D93049

Differential Revision: https://reviews.llvm.org/D93127
2020-12-16 14:18:40 +00:00