Commit Graph

3283 Commits

Author SHA1 Message Date
Florian Hahn b7315ffc3c
[LAA,LV] Add initial support for pointer-diff memory checks.
This patch adds initial support for a pointer diff based runtime check
scheme for vectorization. This scheme requires fewer computations and
checks than the existing full overlap checking, if it is applicable.

The main idea is to only check if source and sink of a dependency are
far enough apart so the accesses won't overlap in the vector loop. To do
so, it is sufficient to compute the difference and compare it to the
`VF * UF * AccessSize`. It is sufficient to check
`(Sink - Src) <u VF * UF * AccessSize` to rule out a backwards
dependence in the vector loop with the given VF and UF. If Src >=u Sink,
there is not dependence preventing vectorization, hence the overflow
should not matter and using the ULT should be sufficient.

Note that the initial version is restricted in multiple ways:

1. Pointers must only either be read or written, by a single
   instruction (this allows re-constructing source/sink for
   dependences with the available information)
 2. Source and sink pointers must be add-recs, with matching steps
 3. The step must be a constant.
 3. abs(step) == AccessSize.

Most of those restrictions can be relaxed in the future.

See https://github.com/llvm/llvm-project/issues/53590.

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D119078
2022-05-16 15:27:22 +01:00
David Sherwood befc952045 [LoopVectorize] Permit tail-folding for low trip counts using scalable vectors
When the loop vectoriser encounters a known low trip count it tries
to create a single predicated loop in order to get the benefit of
vectorisation and eliminate the scalar tail. However, until now the
vectoriser prevented the use of scalable vectors in this case due
to concerns in the past about stability. I believe that tail-folded
loops using scalable vectors are now sufficiently well tested that
we can enable this. For the same reason I've also enabled it when
optimising for code size too.

Tests added here:

  Transforms/LoopVectorize/AArch64/sve-low-trip-count.ll
  Transforms/LoopVectorize/AArch64/sve-tail-folding-optsize.ll
  Transforms/LoopVectorize/RISCV/low-trip-count.ll

Differential Revision: https://reviews.llvm.org/D121595
2022-05-16 09:14:24 +01:00
Florian Hahn 8b7c3d2179
[LV] Set SCEVCheckCond to nullptr whenever it was used.
Under some circumstances, SCEVExpander will insert new instructions when
expanding a predicate, but the final result of the expansion can be a
false constant.

In those cases, the expanded instructions may later be used by other
expansions, e.g. the trip count. This may trigger an assertion during
SCEVExpander cleanup. To avoid this, always mark the result as used.

Fixes #55100.
2022-05-15 21:52:07 +01:00
Craig Topper b3097eb6cd [SLP] Fix misspelling of 'analyzed'. NFC 2022-05-15 10:30:24 -07:00
Florian Hahn 39552964e1
[VPlan] Improve printing of VPReplicateRecipe with calls.
Suggested as part of D124718.
2022-05-15 15:51:26 +01:00
Alexey Bataev 8b8281f354 [SLP]Do not vectorize non-profitable alternate nodes.
If alternate node has only 2 instructions and the tree is already big
enough, better to skip the vectorization of such nodes, they are not
very profitable (the resulting code cotains 3 instructions instead of
original 2 scalars). SLP can try to vectorize the buildvector sequence
in the next attempt, if it is profitable.

Metric: SLP.NumVectorInstructions

Program                                                                                       SLP.NumVectorInstructions
                                                                               results                   results0 diff
     test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test    72.00                     73.00   1.4%
test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test  1186.00                   1198.00   1.0%
     test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test   241.00                    242.00   0.4%
                  test-suite :: MultiSource/Applications/JM/lencod/lencod.test  2131.00                   2139.00   0.4%
 test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test  6377.00                   6384.00   0.1%
test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test  6377.00                   6384.00   0.1%
        test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 12650.00                  12658.00   0.1%
      test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 26169.00                  26147.00  -0.1%
          test-suite :: MultiSource/Benchmarks/Trimaran/enc-3des/enc-3des.test    99.00                     86.00 -13.1%

Gains:
526.blender_r - more vectorized trees.
enc-3des - same.

Others:
510.parest_r - no changes.
miniFE - same
623.xalancbmk_s - some (non-profitable) parts of the trees are not
    vectorized.
523.xalancbmk_r - same
lencod - same
timberwolfmc - same
miniAMR - same

Differential Revision: https://reviews.llvm.org/D125571
2022-05-13 14:28:54 -07:00
Alexey Bataev 85f6b15ee5 [SLP]Do not look for buildvector sequence, if the index is reused.
If the insert indes was used already or is not constant, we should stop
looking for unique buildvector sequence, it mustbe splitted to
2 different buildvectors.
2022-05-13 13:56:02 -07:00
David Sherwood 92c645b5c1 [LoopVectorize] Add overflow checks when tail-folding with scalable vectors
In InnerLoopVectorizer::getOrCreateVectorTripCount there is an
assert that the known minimum value for the VF is a power of 2
when tail-folding is enabled. However, for scalable vectors the
value of vscale may not be a power of 2, which means we have
to worry about the possibility of overflow. I have solved this
problem by adding preheader checks that prevent us from entering
the vector body if the canonical IV would overflow, i.e.

  if ((IntMax - TripCount) < (VF * UF)) ... skip vector loop ...

Differential Revision: https://reviews.llvm.org/D125235
2022-05-13 14:09:43 +01:00
Nikita Popov ed1cb01baf [IRBuilder] Add IsInBounds parameter to CreateGEP()
We commonly want to create either an inbounds or non-inbounds GEP
based on a boolean value, e.g. when preserving inbounds from
existing GEPs. Directly accept such a boolean in the API, rather
than requiring a ternary between CreateGEP and CreateInBoundsGEP.

This change is not entirely NFC, because we now preserve an
inbounds flag in a constant expression edge-case in InstCombine.
2022-05-13 14:30:55 +02:00
Vasileios Porpodas 0950d4060c Recommit "[SLP] Make reordering aware of external vectorizable scalar stores."
This reverts commit c2a7904aba.

Original code review: https://reviews.llvm.org/D125111
2022-05-11 16:47:29 -07:00
Arthur Eubanks c2a7904aba Revert "[SLP] Make reordering aware of external vectorizable scalar stores."
This reverts commit 71bcead98b.

Causes crashes, see comments in D125111.
2022-05-11 15:28:00 -07:00
Alexey Bataev f5d45d70a5 [SLP]Further improvement of the cost model for scalars used in buildvectors.
Further improvement of the cost model for the scalars used in
buildvectors sequences. The main functionality is outlined into
a separate function.
The cost is calculated in the following way:
1. If the Base vector is not undef vector, resizing the very first mask to
have common VF and perform action for 2 input vectors (including non-undef
Base). Other shuffle masks are combined with the resulting after the 1 stage and processed as a shuffle of 2 elements.
2. If the Base is undef vector and have only 1 shuffle mask, perform the
action only for 1 vector with the given mask, if it is not the identity
mask.
3. If > 2 masks are used, perform serie of shuffle actions for 2 vectors,
combing the masks properly between the steps.

The original implementation misses the very first analysis for the Base
vector, so the cost might too optimistic in some cases. But it improves
the cost for the insertelements which are part of the current SLP graph.

Part of D107966.

Differential Revision: https://reviews.llvm.org/D115750
2022-05-11 06:08:55 -07:00
Florian Hahn 635b752211
[VPlan] VPInterleaveRecipe only uses first lane if op not stored.
With opaque pointers, both the stored value and the address can be the
same. Only consider the recipe using the first lane only *if* the
address is not stored.

Fixes #55375.
2022-05-11 11:24:56 +01:00
Vasileios Porpodas 71bcead98b [SLP] Make reordering aware of external vectorizable scalar stores.
The current reordering scheme only checks the ordering of in-tree operands.
There are some cases, however, where we need to adjust the ordering based on
the ordering of a future SLP-tree who's instructions are not part of the
current tree, but are external users.

This patch is a simple implementation of this. We keep track of scalar stores
that are users of TreeEntries and if they look profitable to vectorize, then
we keep track of their ordering. During the reordering step we take this new
index order into account. This can remove some shuffles in cases like in the
lit test.

Differential Revision: https://reviews.llvm.org/D125111
2022-05-10 15:25:35 -07:00
Alexey Bataev 4212ef8a0e Revert "[SLP]Further improvement of the cost model for scalars used in buildvectors."
This reverts commit 99f31acfce and several
others to fix detected crashes, reported in https://reviews.llvm.org/D115750
2022-05-09 13:46:06 -07:00
Alexey Bataev cce80bd8b7 [SLP]Adjust assertion check for scalars in several insertelements.
If the same scalar is inserted several times into the same buildvector,
the mask index can be used already. In this case need to check, that
this scalar is already part of the vectorized buildvector.
2022-05-09 13:07:59 -07:00
Florian Hahn 266ea446ab
Revert "Recommit "[VPlan] Remove uneeded needsVectorIV check.""
This reverts commit 8b48223447.

This triggers an assertion on a test case mentioned in D123720.
Revert while I investigate.
2022-05-09 20:33:14 +01:00
Alexey Bataev 9dc4ced204 [SLP]Try partial store vectorization if supported by target.
We can try to vectorize number of stores less than MinVecRegSize
/ scalar_value_size, if it is allowed by target. Gives an extra
opportunity for the vectorization.

Fixes PR54985.

Differential Revision: https://reviews.llvm.org/D124284
2022-05-09 09:48:15 -07:00
Alexey Bataev 9c3a75eabf [SLP]Fix a crash when preparing a mask for external scalars.
Need to use actual index instead of the tree entry position, since the
insert index may be different than 0. It mean, that we vectorized part
of the buildvector starting from not initial insertelement instruction
beause of some reason.
2022-05-09 07:59:34 -07:00
David Green 6f9e1ea0ef [VectorCombine] Attempt to fold select shuffles from reductions
Given a commutative reduction leading from a shuffle, the order of the
lanes on the shuffle are not important for the result. This means we can
reorder the shuffle to something simpler, which we try shuffling the
first vector lanes first. This was D123494.

The new shuffle may not be profitable though, and if it is not we can
try the folding of select shuffles from D123911. This, with some
adjustment as the output lane ordering is now unimportant, can allow the
final shuffle to simplify given the inputs to the patterns from D123911.
Where as each transformation on their own are not profitable, the
combination is.

We can only support a single shuffle when called from reductions, but we
are able to sort the ReconstructMask, potentially allowing it to
simplify to an identity or concat mask.

Differential Revision: https://reviews.llvm.org/D125086
2022-05-08 10:32:41 +01:00
David Green 802e15c576 [SLP] Cluster ordering for loads
Given a load without a better order, this patch partially sorts the
elements to form clusters of adjacent elements in memory. These clusters
can potentially be loaded in fewer loads, meaning less overall shuffling
(for example loading v4i8 clusters of a v16i8 as a single f32 loads, as
opposed to multiple independent bytes loads and inserts).

Differential Revision: https://reviews.llvm.org/D122145
2022-05-07 14:38:11 +01:00
David Green 100cb9a2ba [VectorCombine] Fold shuffle select pattern
This patch adds a combine to attempt to reduce the costs of certain
select-shuffle patterns. The form of code it attempts to detect is:
  %x = shuffle ...
  %y = shuffle ...
  %a = binop %x, %y
  %b = binop %x, %y
  shuffle %a, %b, selectmask

A classic select-mask will pick items from each lane of a or b. These
do not always have a great lowering on many architectures. This patch
attempts to pack a and b into the lower elements, creating a differently
ordered shuffle for reconstructing the orignal which may be better than
the select mask. This can be better for performance, especially if less
elements of a and b need to be computed and the input shuffles are
cheaper.

Because select-masks are just one form of shuffle, we generalize to any
mask. So long as the backend has decent costmodel for the shuffles, this
can generally improve things when they come up. For more basic cost
models the folds do not appear to be profitable, not getting past the
cost checks.

Differential Revision: https://reviews.llvm.org/D123911
2022-05-06 08:13:18 +01:00
Florian Hahn f9f7aa30f8
[VPlan] Remove dead code to create VPWidenPHIRecipes (NFCI).
After introducing VPWidenPointerInductionRecipe, VPWidenPHIRecipes
should not be created at this point. Turn check into an assert.
2022-05-05 19:29:02 +01:00
Alexey Bataev 99f31acfce [SLP]Further improvement of the cost model for scalars used in buildvectors.
Further improvement of the cost model for the scalars used in
buildvectors sequences. The main functionality is outlined into
a separate function.
The cost is calculated in the following way:
1. If the Base vector is not undef vector, resizing the very first mask to
have common VF and perform action for 2 input vectors (including non-undef
Base). Other shuffle masks are combined with the resulting after the 1 stage and processed as a shuffle of 2 elements.
2. If the Base is undef vector and have only 1 shuffle mask, perform the
action only for 1 vector with the given mask, if it is not the identity
mask.
3. If > 2 masks are used, perform serie of shuffle actions for 2 vectors,
combing the masks properly between the steps.

The original implementation misses the very first analysis for the Base
vector, so the cost might too optimistic in some cases. But it improves
the cost for the insertelements which are part of the current SLP graph.

Part of D107966.

Differential Revision: https://reviews.llvm.org/D115750
2022-05-05 06:04:25 -07:00
Florian Hahn 8b48223447
Recommit "[VPlan] Remove uneeded needsVectorIV check."
This reverts commit f4e1eaa375.

The patch was originally reverted because it uncovered an issue that has
now been fixed in 0ef8ca6d88.
2022-05-04 10:53:42 +01:00
serge-sans-paille 7030654296 [iwyu] Handle regressions in libLLVM header include
Running iwyu-diff on LLVM codebase since fa5a4e1b95 detected a few
regressions, fixing them.

Differential Revision: https://reviews.llvm.org/D124847
2022-05-04 08:32:38 +02:00
Igor Kirillov 4e5e042d9a [LoopVectorize] Support reductions that store intermediary result
Adds ability to vectorize loops containing a store to a loop-invariant
address as part of a reduction that isn't converted to SSA form due to
lack of aliasing info. Runtime checks are generated to ensure the store
does not alias any other accesses in the loop.

Ordered fadd reductions are not yet supported.

Differential Revision: https://reviews.llvm.org/D110235
2022-05-03 10:12:30 +01:00
David Green 6f81903e89 [LV][SLP] Mark fptosi_sat as vectorizable
This adds fptosi_sat and fptoui_sat to the list of trivially
vectorizable functions, mainly so that the loop vectorizer can vectorize
the instruction. Marking them as trivially vectorizable also allows them
to be SLP vectorized, and Scalarized.

The signature of a fptosi_sat requires two type overrides
(@llvm.fptosi.sat.v2i32.v2f32), unlike other intrinsics that often only
take a single. This patch alters hasVectorInstrinsicOverloadedScalarOpd
to isVectorIntrinsicWithOverloadTypeAtArg, so that it can mark the first
operand of the intrinsic as a overloaded (but not scalar) operand.

Differential Revision: https://reviews.llvm.org/D124358
2022-05-03 09:32:34 +01:00
Alexey Bataev e74a73782f [SLP][NFC]Minor code changes for better readability, NFC. 2022-05-02 12:58:25 -07:00
Alexey Bataev 7ea03f0b4e [SLP]Improve reductions analysis and emission, part 1.
Currently SLP vectorizer walks through the instructions and selects
3 main classes of values: 1) reduction operations - instructions with same
reduction opcode (add, mul, min/max, etc.), which build the reduction,
2) reduced values - instructions with the same opcodes, but different
from the reduction opcode, 3) extra arguments - all other values,
instructions from the different basic block rather than the root node,
instructions with to many/less uses.

This scheme is not very efficient. It excludes some instructions and all
non-instruction values from the reductions (constants, proficient
gathers), to many possibly reduced values are marked as extra arguments.
Patch improves this process by introducing a bit extended analysis
stage. During this stage, we still try to select 3 classes of the
values: 1) reduction operations - same as before, 2) possibly reduced
values - all instructions from the current block/non-instructions, which
may build a vectorization tree, 3) extra arguments - instructions from
the different basic blocks. Additionally, an extra sorting of the
possibly reduced values occurs to build the scalar sequences which
highly likely will bed vectorized, e.g. loads are grouped by the
distance between them, constants are grouped together, cmp instructions
are sorted by their compare types and predicates, extractelement
instructions are sorted by the vector operand, etc. Also, these groups
are reordered by their length so the longest group is the first in the
list of the possibly reduced values.

The vectorization process tries to emit the reductions for all these
groups. These reductions, remaining non-vectorized possible reduced
values and extra arguments are then combined into the final expression
just like it was before.

Differential Revision: https://reviews.llvm.org/D114171
2022-05-02 12:03:58 -07:00
Florian Hahn 0ef8ca6d88
[VPlan] Do not create VPWidenCall recipes for scalar vector factors.
'Widen' recipe are only used when actual vector values are generated.
Fix tryToWidenCall to do not create VPWidenCallRecipes for scalar vector
factors.

This was exposed by D123720, because the widened recipes are considered
vector users.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D124718
2022-05-02 19:40:33 +01:00
Simon Pilgrim 34f97a3709 [VectorCombine] Merge isa<>/cast<> into dyn_cast<>. NFC.
We want to handle the the assert in VectorCombine so avoid the repeated isa/cast code.
2022-05-01 20:09:10 +01:00
Simon Pilgrim 09761ce295 [SLPVectorizer] Remove weird unicode character from comment. NFCI.
Whatever it was, Visual Assist really didn't like it....
2022-05-01 16:37:21 +01:00
Alexey Bataev 484fcb9888 [SLP][NFC]Fix a comment. 2022-04-29 09:27:13 -07:00
Anna Thomas 205246cb64 [CompileTime] [Passes] Avoid computing unnecessary analyses. NFC
Similar to c515b2f39e, If there are no loops in the function as seen
through LI, we should avoid computing the remaining expensive analyses
(such as SCEV, BPI).  Reordered the analyses requests and early return
if there are no loops.

The logic of avoiding expensive analyses is applied to LoopVectorizer,
LoopLoadElimination and LoopUnrollPass, i.e. all function passes which operate
on loops.

This is an NFC with compile time improvement.

Differential Revision: https://reviews.llvm.org/D124529
2022-04-29 10:00:06 -04:00
Ricky Zhou 24a133e16f
[LV] Rename CountRoundDown to VectorTripCount (NFC)
The name CountRoundDown is potentially misleading, as the number of
iterations can be rounded up when folding the tail.

Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D119681
2022-04-29 13:50:00 +01:00
Florian Hahn e66127e69b
[VPlan] Simplify & adjust code as suggested in D123005.
Improve code as suggested in D123005. Applied separately, because the
comments where made a diff that has not been rebased to current main.
2022-04-29 13:34:54 +01:00
David Green 7047c47918 [VecCombine] Fix sort comparator logic in foldShuffleFromReductions
I think this sort comparator was overly complex, and the windows
expensive check bot agreed, failing as it was not giving a strict weak
ordering. Change it to use the comparison of the mask values as unsigned
integers. This should sort the undef elements to the end whilst keeping
X<Y otherwise.
2022-04-29 09:30:02 +01:00
Florian Hahn f4e1eaa375
Revert "[VPlan] Remove uneeded needsVectorIV check."
This reverts commit 43842b887e while I
investigate a buildbot failure.

It also reverts the follow-up commit
2883de0514.
2022-04-28 20:16:21 +01:00
David Green ded8187e35 [VectorCombine] Try to reduce shuffle cost for commutative reduction operands
Given a shuffle feeding a commutative reduction, the lane ordering of
the shuffle will not alter the result. This is also true if there are a
number of operations between the reduction and the shuffle, providing
they only operate lane-wise. This patch searches for cases like that in
Vector Combine, allowing us to check the cost of the shuffle vs an
in-order identity shuffle and replace the order if possible. This only
handles a single shuffle at the moment to keep things simple, and is
able to ignore splats that produce results where every result is the
same.

This is a more powerful version of a combine that already happens in
instrcombine, capable of optimizing more cases by looking through more
instructions and being able to cost the shuffle.

Differential Revision: https://reviews.llvm.org/D123494
2022-04-28 19:46:12 +01:00
Alexey Bataev 75e1cf4a6a [COST]Improve cost model for shuffles in SLP.
Introduced masks where they are not added and improved target dependent
cost models to avoid returning of the incorrect cost results after
adding masks.

Differential Revision: https://reviews.llvm.org/D100486
2022-04-28 10:04:41 -07:00
Florian Hahn 2883de0514
[VPlan] Fix comment formatting from 43842b887e. 2022-04-28 16:31:48 +01:00
Florian Hahn 43842b887e
[VPlan] Remove uneeded needsVectorIV check.
Remove one of the last remaining uses of ::needsVectorIV, preparing for
its removal. Now that usesScalars is available and based on the
information explicit in VPlan, there is no need to use the pre-computed
needsVectorIV.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D123720
2022-04-28 16:27:34 +01:00
Alexey Bataev 9861ca0c23 Revert "[COST]Improve cost model for shuffles in SLP."
This reverts commit 29a470e380 to fix
a crash reported in https://reviews.llvm.org/D100486#3479989.
2022-04-28 08:11:56 -07:00
Alexey Bataev 29a470e380 [COST]Improve cost model for shuffles in SLP.
Introduced masks where they are not added and improved target dependent
cost models to avoid returning of the incorrect cost results after
adding masks.

Differential Revision: https://reviews.llvm.org/D100486
2022-04-27 10:56:26 -07:00
Shilei Tian a6b355dd31 [SLP] Fix a typo that causes redundant assertion and potential segment fault
Reviewed By: ABataev

Differential Revision: https://reviews.llvm.org/D124497
2022-04-27 10:07:59 -04:00
Vasileios Porpodas fa8a9fea47 Recommit "[SLP][TTI] Refactoring of `getShuffleCost` `Args` to work like `getArithmeticInstrCost`"
This reverts commit 6a9bbd9f20.

Code review: https://reviews.llvm.org/D124202
2022-04-26 14:02:40 -07:00
Vasileios Porpodas 6a9bbd9f20 Revert "[SLP][TTI] Refactoring of `getShuffleCost` `Args` to work like `getArithmeticInstrCost`"
This reverts commit 55ce296d6f.
2022-04-26 11:25:26 -07:00
Vasileios Porpodas 55ce296d6f [SLP][TTI] Refactoring of `getShuffleCost` `Args` to work like `getArithmeticInstrCost`
Before this patch `Args` was used to pass a broadcat's arguments by SLP.
This patch changes this. `Args` is now used for passing the operands of
the shuffle.

Differential Revision: https://reviews.llvm.org/D124202
2022-04-26 11:11:29 -07:00
Valery N Dmitriev 88b9e46fb5 [SLP] Steer for the best chance in tryToVectorize() when rooting with binary ops.
tryToVectorize() method implements one of searching paths for vectorizable tree roots in SLP vectorizer,
specifically for binary and comparison operations. Order of making probes for various scalar pairs
was defined by its implementation: the instruction operands, then climb over one operand if
the instruction is its sole user and then perform same actions for another operand if previous
attempts failed. Problem with this approach is that among these options we can have more than a
single vectorizable tree candidate and it is not necessarily the one that encountered first.
Trying to build vectorizable tree for each possible combination for just evaluation is expensive.
But we already have lookahead heuristics mechanism which we use for finding best pick among
operands of commutative instructions. It calculates cumulative score for candidates in two
consecutive lanes. This patch introduces use of the heuristics for choosing the best pair among
several combinations. We only try one that looks as most promising for vectorization.
Additional benefit is that we reduce total number of vectorization trees built for probes
because we skip those looking non-profitable early.

Reviewed By: Alexey Bataev (ABataev), Vasileios Porpodas (vporpo)
Differential Revision: https://reviews.llvm.org/D124309
2022-04-25 12:25:33 -07:00
David Green 9727c77d58 [NFC] Rename Instrinsic to Intrinsic 2022-04-25 18:13:23 +01:00
Valery N Dmitriev edf7bed87b [SLP][NFC] Outline lookahead heuristics into a separate helper class.
Minor refactoring to reduce size of functional change D124309:
  look-ahead scoring routines pulled out of VLOperands and formed
  new LookAheadHeuristics helper class.

Reviewed By: Alexey Bataev (ABataev), Vasileios Porpodas (vporpo)
Differential Revision: https://reviews.llvm.org/D124313
2022-04-22 18:59:08 -07:00
Vasileios Porpodas 889588ee97 [SLP] Refactoring isLegalBroadcastLoad() to use `ElementCount`.
Replacing `unsigned` with `ElementCount` in the argument of `isLegalBroadcastLoad()`.
This helps reduce the diff of a future SLP patch for AArch64.
2022-04-21 10:19:00 -07:00
Florian Hahn bea69b232f
[VPlan] Initial modeling of middle block in VPlan.
This patch extends the scope of VPlan to also include the exit (aka
middle) block.

For now, the exit block remains empty, but handling of exit values will
subsequently be moved to VPlan, by adding recipes to model exit values
in the exit block.

As a first step, this will allow fixing #51366.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D123457
2022-04-20 19:34:41 +01:00
Vasileios Porpodas 8d4b5e0833 [NFC][SLP] Improved description of getShallowScore() and getScoreAtLevelRec()
Differential Revision: https://reviews.llvm.org/D124027
2022-04-19 12:15:36 -07:00
Florian Hahn 4026b718b8
[VPlan] Remove unused SCEV forward declaration (NFC). 2022-04-19 17:16:17 +02:00
Alexey Bataev 883571928c Revert "[SLP]Improve reductions analysis and emission, part 1."
This reverts commit 0e1f4d4d3c to fix
a crash reported in PR54976
2022-04-19 06:17:03 -07:00
Florian Hahn a65f2730d2
[VPlan] Expand induction step in VPlan pre-header.
This patch moves SCEV expansion of steps used by
VPWidenIntOrFpInductionRecipes to the pre-header using
VPExpandSCEVRecipe. This ensures that those steps are expanded while the
CFG is in a valid state. Previously, SCEV expansion may happen during
vector body code-generation, during which the CFG may be invalid,
causing issues with SCEV expansion.

Depends on D122095.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D122096
2022-04-19 13:06:39 +02:00
Vasileios Porpodas b1333f03d9 Recommit "[SLP] Support internal users of splat loads"
Code review: https://reviews.llvm.org/D121940

This reverts commit 359dbb0d3d.
2022-04-18 15:58:01 -07:00
Vasileios Porpodas 359dbb0d3d Revert "[SLP] Support internal users of splat loads"
This reverts commit f8e1337115.
2022-04-18 12:12:34 -07:00
Vasileios Porpodas f8e1337115 [SLP] Support internal users of splat loads
Until now we would only accept a broadcast load pattern if it is only used
by a single vector of instructions.

This patch relaxes this, and allows for the broadcast to have more than one
user vector, as long as all of its uses are internal to the SLP graph and
vectorized.

Differential Revision: https://reviews.llvm.org/D121940
2022-04-18 11:59:44 -07:00
Florian Hahn 73f5d7d0d6
[VPlan] Handle equal address and store ops in onlyFirstLaneDemanded.
With opaque pointers, the stored value and address can be the same.

Previously the code in VPWidenMemoryInstructionRecipe::onlyFirstLaneDemanded
incorrectly considers stores with matching store and pointer operands as
only demanding the first lane, causing a crash.
2022-04-15 22:53:33 +02:00
Johannes Doerfert 1fb415fee9 [AMDGPU][FIX] Proper load-store-vectorizer result with opaque pointers
The original code relied on the fact that we needed a bitcast
instruction (for non constant base objects). With opaque pointers there
might not be a bitcast. Always check if reordering is required instead.

Fixes: https://github.com/llvm/llvm-project/issues/54896

Differential Revision: https://reviews.llvm.org/D123694
2022-04-15 13:42:46 -05:00
Florian Hahn 2c14cdf831
[VPlan] Turn external defs in Value -> VPValue mapping.
This addresses an existing TODO by keeping a mapping of external IR
Value * definitions wrapped in VPValues for use in a VPlan.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D123700
2022-04-14 12:03:09 +02:00
serge-sans-paille fa5a4e1b95 [iwyu] Handle regressions in libLLVM header include
Running iwyu-diff on LLVM codebase since a96638e50e detected a few
regressions, fixing them.
2022-04-13 20:53:19 +02:00
Alexey Bataev 0e1f4d4d3c [SLP]Improve reductions analysis and emission, part 1.
Currently SLP vectorizer walks through the instructions and selects
3 main classes of values: 1) reduction operations - instructions with same
reduction opcode (add, mul, min/max, etc.), which build the reduction,
2) reduced values - instructions with the same opcodes, but different
from the reduction opcode, 3) extra arguments - all other values,
instructions from the different basic block rather than the root node,
instructions with to many/less uses.

This scheme is not very efficient. It excludes some instructions and all
non-instruction values from the reductions (constants, proficient
gathers), to many possibly reduced values are marked as extra arguments.
Patch improves this process by introducing a bit extended analysis
stage. During this stage, we still try to select 3 classes of the
values: 1) reduction operations - same as before, 2) possibly reduced
values - all instructions from the current block/non-instructions, which
may build a vectorization tree, 3) extra arguments - instructions from
the different basic blocks. Additionally, an extra sorting of the
possibly reduced values occurs to build the scalar sequences which
highly likely will bed vectorized, e.g. loads are grouped by the
distance between them, constants are grouped together, cmp instructions
are sorted by their compare types and predicates, extractelement
instructions are sorted by the vector operand, etc. Also, these groups
are reordered by their length so the longest group is the first in the
list of the possibly reduced values.

The vectorization process tries to emit the reductions for all these
groups. These reductions, remaining non-vectorized possible reduced
values and extra arguments are then combined into the final expression
just like it was before.

Differential Revision: https://reviews.llvm.org/D114171
2022-04-12 17:46:11 -07:00
Muhammad Omair Javaid 42ebfa8269 Revert "[AArch64] Set maximum VF with shouldMaximizeVectorBandwidth"
This reverts commit 64b6192e81.

This broke LLVM AArch64 buildbot clang-aarch64-sve-vls-2stage:

https://lab.llvm.org/buildbot/#/builders/176/builds/1515

llvm-tblgen crashes after applying this patch.
2022-04-13 04:53:07 +05:00
Florian Hahn 5f1eb74850
[VPlan] Place VPExpandSCEVRecipe in pre-header.
After D121624 models the pre-header in VPlan, VPExpandSCEVRecipes can be
placed there. This ensures SCEV expansion happens before modifying the
CFG during VPlan execution, when CFG is incomplete.

Depends on D121624.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D122095
2022-04-10 10:26:20 +02:00
Florian Hahn 256c6b0ba1
[VPlan] Model pre-header explicitly.
This patch extends the scope of VPlan to also model the pre-header.
The pre-header can be used to place recipes that should be code-gen'd
outside the loop, like SCEV expansion.

Depends on D121623.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D121624
2022-04-09 14:19:47 +02:00
Florian Hahn 467dbcd9f1
[LV] Set debug loc after setting insert point.
This fixes the code to actually use the location of the instruction, if
available. Previously, SetInsertPoint would overwrite the insert point
set from the instruction.
2022-04-08 20:34:40 +02:00
Florian Hahn 29fe998eaa
[VPlan] Preserve debug location when creating branch.
Update createEmptyBasicBlock to preserve the debug location of the
previous terminator.
2022-04-08 17:22:53 +02:00
Florian Hahn 4388c979da
[VPlan] Use vector.body as header name in VPlan native path.
This brings the VPlan block naming in line with the naming of the
generated basic blocks.
2022-04-07 10:31:12 +02:00
Jingu Kang 64b6192e81 [AArch64] Set maximum VF with shouldMaximizeVectorBandwidth
Set the maximum VF of AArch64 with 128 / the size of smallest type in loop.

Differential Revision: https://reviews.llvm.org/D118979
2022-04-05 13:16:52 +01:00
Florian Hahn 1ff022e21b
[LV] Add vector.body block to parent loop during skeleton creation.
When creating induction resume values, SCEV queries may rely on
LoopInfo. Make sure vector.body gets added to the loop of the pre-header
during skeleton construction.

%vector.body will be moved to the vector preheader during VPlan
execution.

Fixes #54745.
2022-04-05 11:54:17 +01:00
Florian Hahn 1817c526e1
[VPlan] Update VPInterleavedAccessInfo to use getVectorLoopRegion.
Update VPInterleavedAccessInfo  to use the generic getVectorLoopRegion
helper instead of relying on the entry block being the top-most vector
loop region.
2022-04-04 10:26:39 +01:00
Florian Hahn 8cd1892725
[VPlan] Remember previous loop and reset vector loop.
At the moment this is NFC, but will be needed once nested loops are also
modeled as regions. Preparation for D123005.
2022-04-04 09:27:15 +01:00
Philip Reames 88de27e3fd [LV] Handle non-integral types when considering interleave widening legality
In general, anywhere we might need to insert a blind bitcast, we need to make sure the types are losslessly convertible.

This fixes pr54634.
2022-04-03 20:16:20 -07:00
Florian Hahn 95b2aa511e
[VPlan] Set VPlan header block name to vector.body.
This brings the VPlan block naming in line with the naming of the
generated basic blocks.
2022-04-02 19:34:32 +01:00
Florian Hahn f8101e4d68
Recommit "[LV] Remove unneeded createHeaderBranch.(NFCI)"
This reverts commit 14e3650f01.

The issue causing the revert were fixed independently in
a08c90a402 and 14e5f9785c.
2022-04-01 16:53:39 +01:00
Florian Hahn 14e5f9785c
[LV] Add SCEV workaround from 80e8025 to epilogue vector code path.
This was exposed by 14e3650f. The recommit of 14e3650f will hit the
problematic code path requiring the workaround.
test case that crashes without the workaround.
2022-04-01 15:14:47 +01:00
Florian Hahn a08c90a402
[LV] Re-use TripCount from EPI.TripCount.
During skeleton construction for the epilogue vector loop, generic
helpers use getOrCreateTripCount, which will re-expand the trip count
computation. Instead, re-use the TripCount created during main loop
vectorization.
2022-04-01 13:47:34 +01:00
Florian Hahn 14e3650f01
Revert "Recommit "[LV] Remove unneeded createHeaderBranch.(NFCI)""
This reverts commit 8378a71b6c.

It looks like this patch uncovered another issue, e.g. see
https://lab.llvm.org/buildbot/#/builders/168/builds/5518
2022-03-31 19:00:48 +01:00
Florian Hahn 8378a71b6c
Recommit "[LV] Remove unneeded createHeaderBranch.(NFCI)"
This reverts the revert commit 2760cdc9c6.

This version pulls in the code to create the vector loop object in VPlan
from D121624.

This is needed because otherwise existing LoopInfo verification will
fail, as a loop block doesn't have in-loop successors now that we
do not replace the branch.

Now that we do not add new loops during skeleton construction, there's
also no need to verify LI there.
2022-03-31 14:48:32 +01:00
Florian Hahn 2760cdc9c6
Revert "[LV] Remove unneeded createHeaderBranch.(NFCI)"
This reverts commit 32bc83d11e.

This is causing bots with expensive-checks to fail. Revert while I
investigate.
2022-03-31 12:32:50 +01:00
Florian Hahn 32bc83d11e
[LV] Remove unneeded createHeaderBranch.(NFCI)
The only remaining use was to get the exit block of the loop. Instead of
relying on the loop, use the successor of VectorHeaderBB
(LoopMiddleBlock) directly to set VPTransformState::CFG::ExitB

Depends on D121621.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D121623
2022-03-31 11:48:52 +01:00
Florian Hahn 2c494f0941
[VPlan] Remove unneeded Loop variable (NFC).
Suggested in D121623. The remaining uses of L can be replaced, reducing
the need for the variable.
2022-03-31 10:34:28 +01:00
David Green b65267ca7b [LV] Invalidate widening decisions after maximizing vector bandwidth
When MaximizeVectorBandwidth is enabled, we can end up (via calls to
collectUniformsAndScalars/setCostBasedWideningDecision through
calculateRegisterUsage) making widening decisions before we have decided
whether to fold the tail by masking. These decisions will be wrong if we
later decided to fold the tail, for example when the trip count is very
low. It will use incorrect costs for loads that should get masked, using
standard memory operation costs instead.

This still at the moment uses the EmulatedMaskMemRefHack costs (a bit
unfortunately), but the old costs without this change were 1, leading to
too optimistic vectorization.

This slightly changes the way that the MaximizeVectorBandwidth option
works to make it easier to test, always honouring the option if it is
set.

Differential Revision: https://reviews.llvm.org/D120215
2022-03-31 09:19:31 +01:00
Florian Hahn e4543af4e6
[VPlan] Track current vector loop in VPTransformState (NFC).
Instead of looking up the vector loop using the header, keep track of
the current vector loop in VPTransformState. This removes the
requirement for the vector header block being part of the loop up front.

A follow-up patch will move the code to generate the Loop object for the
vector loop to VPRegionBlock.

Depends on D121619.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D121621
2022-03-30 22:16:40 +01:00
Florian Hahn e8673f2f20
[LV] Do not create separate latch block in VPlan::execute.
Now that all dependencies on creating the latch block up-front have been
removed, there is no need to create it early.

Depends on D121618.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D121619
2022-03-30 17:31:38 +01:00
Florian Hahn 8a4077fac0
[LV] Pass LoopHeaderBB directly to updateDominatorTree. (NFC)
At the call site, we already know what the vector header block is. Pass
it directly.
2022-03-30 13:11:20 +01:00
Florian Hahn ecb4171dcb
[LV] Handle zero cost loops in selectInterleaveCount.
In some case, like in the added test case, we can reach
selectInterleaveCount with loops that actually have a cost of 0.

Unfortunately a loop cost of 0 is also used to communicate that the cost
has not been computed yet. To resolve the crash, bail out if the cost
remains zero after computing it.

This seems like the best option, as there are multiple code paths that
return a cost of 0 to force a computation in selectInterleaveCount.
Computing the cost at multiple places up front there would unnecessarily
complicate the logic.

Fixes #54413.
2022-03-29 22:52:43 +01:00
Florian Hahn d1d3563278
[LV] Move code to place pointer induction increment to VPlan post-processing.
This patch moves the code to set the correct incoming block for the
backedge value to VPlan::execute.

When generating the phi node, the backedge value is temporarily added
using the pre-header as incoming block. The invalid phi node will be
fixed up during VPlan::execute after main VPlan code generation.
At the same time, the backedge value is also moved to the latch.

This change removes the requirement to create the latch block up-front
for VPWidenInductionPHIRecipe::execute, which in turn will enable
modeling the pre-header in VPlan.

Depends on D121617.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D121618
2022-03-29 20:27:59 +01:00
Philip Reames 7d6e8f2a96 [slp] Delete dead scalar instructions feeding vectorized instructions
If we vectorize a e.g. store, we leave around a bunch of getelementptrs for the individual scalar stores which we removed. We can go ahead and delete them as well.

This is purely for test output quality and readability. It should have no effect in any sane pipeline.

Differential Revision: https://reviews.llvm.org/D122493
2022-03-28 20:10:13 -07:00
Florian Hahn e7bf2ea934
[LV] Move code to place induction increment to VPlan post-processing.
This patch moves the code to set the correct incoming block for the
backedge value to VPlan::execute.

When generating the phi node, the backedge value is temporarily added
using the pre-header as incoming block. The invalid phi node will be
fixed up during VPlan::execute after main VPlan code generation.
At the same time, the backedge value is also moved to the latch.

This change removes the requirement to create the latch block up-front
for VPWidenIntOrFpInductionRecipe::execute, which in turn will enable
modeling the pre-header in VPlan.

As an alternative, the increment could be modeled as separate recipe,
but that would require more work and a bit of redundant code, as we need
to create the step-vector during VPWidenIntOrFpInductionRecipe::execute
anyways, to create the values for different parts.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D121617
2022-03-28 16:20:02 +01:00
Philip Reames f80aaa675f [SLP] Simplify eraseInstruction [NFC]
This simplifies the implementation of eraseInstruction by moving the odd-replace-users-with-undef handling back to the only caller which uses it.  This handling was not obviously correct, so add the asserts which make it clear why this is safe to do at all.  The result is simpler code and stronger assertions.
2022-03-25 12:01:52 -07:00
Philip Reames 48cc9287f5 Reapply "[SLP] Schedule only sub-graph of vectorizable instructions"" (try 3)
The original commit exposed several missing dependencies (e.g. latent bugs in SLP scheduling).  Most of these were fixed over the weekend and have had several days to bake.  The last was fixed this morning after being noticed in manual review of test changes yesterday.  See the review thread for links to each change.

Original commit message follows:

SLP currently schedules all instructions within a scheduling window which stretches from the first instruction potentially vectorized to the last. This window can include a very large number of unrelated instructions which are not being considered for vectorization. This change switches the code to only schedule the sub-graph consisting of the instructions being vectorized and their transitive users.

This has the effect of greatly reducing the amount of work performed in large basic blocks, and thus greatly improves compile time on degenerate examples. To understand the effects, I added some statistics (not planned for upstream contribution). Here's an illustration from my motivating example:

   Before this patch:

   704357 SLP                          - Number of calcDeps actions
   699021 SLP                          - Number of schedule calls
   5598 SLP                          - Number of ReSchedule actions
   59 SLP                          - Number of ReScheduleOnFail actions
   10084 SLP                          - Number of schedule resets
   8523 SLP                          - Number of vector instructions generated

   After this patch:

   102895 SLP                          - Number of calcDeps actions
   161916 SLP                          - Number of schedule calls
   5637 SLP                          - Number of ReSchedule actions
   55 SLP                          - Number of ReScheduleOnFail actions
   10083 SLP                          - Number of schedule resets
   8403 SLP                          - Number of vector instructions generated

I do want to highlight that there is a small difference in number of generated vector instructions. This example is hitting the bailout due to maximum window size, and the change in scheduling is slightly perturbing when and how we hit it. This can be seen in the RescheduleOnFail counter change. Given that, I think we can safely ignore.

The downside of this change can be seen in the large test diff. We group all vectorizable instructions together at the bottom of the scheduling region. This means that vector instructions can move quite far from their original point in code. While maybe undesirable, I don't see this as being a major problem as this pass is not intended to be a general scheduling pass.

For context, it's worth noting that the pre-scheduling that SLP does while building the vector tree is exactly the sub-graph scheduling implemented by this patch.

Differential Revision: https://reviews.llvm.org/D118538
2022-03-25 10:39:23 -07:00
Philip Reames ec858f0201 [SLP] Optimize stacksave dependence handling [NFC]
After writing the commit message for 4b1bace28, realized that the mentioned optimization was rather straight forward.  We already have the code for scanning a block during region initialization, we can simply keep track if we've seen a stacksave or stackrestore.  If we haven't, none of these dependencies are relevant and we can avoid the relatively expensive scans entirely.
2022-03-25 10:04:10 -07:00
Philip Reames a16308c282 [SLP] Explicit track required stacksave/alloca dependency (try 3)
This is an extension of commit b7806c to handle one last case noticed in test changes for D118538.  Again, this is thought to be a latent bug in the existing code, though this time I have not managed to reduce tests for the original algoritthm.

The prior attempt had failed to account for this case:
  %a = alloca i8
  stacksave
  stackrestore
  store i8 0, i8* %a

If we allow '%a' to reorder into the stacksave/restore region, then the alloca will be deallocated before the use.  We will have taken a well defined program, and introduced a use-after-free bug.

There's also an inverse case where the alloca originally follows the stackrestore, and we need to prevent the reordering it above the restore.

Compile time wise, we potentially do an extra scan of the block for each alloca seen in a bundle.  This is significantly more expensive than the stacksave rooted version and is why I'd tried to avoid this in the initial patch.  There is room to optimize this (by essentially caching a "has stacksave" bit per block), but I'm leaving that to future work if it actually shows up in practice.  Since allocas in bundles should be rare in practice, I suspect we can defer the complexity for a long while.
2022-03-25 10:04:10 -07:00
Florian Hahn e47d220230
[LV] Use getVectorLoopRegion to retrieve header. (NFC)
Update all places that currently assume the entry block to the plan is
also the vector loop header to use getVectorLoopRegion instead.

getVectorLoopRegion will keep doing the right thing when the pre-header
is modeled explicitly (and becomes the new entry block in the plan).
2022-03-25 16:57:12 +00:00
Philip Reames d9756fa723 [slp] Factor out a lambda to avoid uplicating code a third time in upcoming patch [nfc] 2022-03-25 09:02:39 -07:00
Fraser Cormack 2e44b7872b [VectorCombine] Insert addrspacecast when crossing address space boundaries
We can not bitcast pointers across different address spaces. This was
previously fixed in D89577 but then in D93229 an enhancement was added
which peeks further through the ponter operand, opening up the
possibility that address-space violations could be introduced.

Instead of bailing as the previous fix did, simply insert an
addrspacecast cast instruction.

Reviewed By: lebedev.ri

Differential Revision: https://reviews.llvm.org/D121787
2022-03-24 19:08:08 +00:00
Simon Pilgrim 597aefa89c Fix unused variable warning by embedding inside assertion 2022-03-24 17:41:24 +00:00
Florian Hahn 46432a0088
[VPlan] Add VPWidenPointerInductionRecipe.
This patch moves pointer induction handling from VPWidenPHIRecipe to its
own recipe. In the process, it adds all information required to generate
code for pointer inductions without relying on Legal to access the list
of induction phis.

Alternatively VPWidenPHIRecipe could also take an optional pointer to InductionDescriptor.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D121615
2022-03-24 14:58:45 +00:00
Alexey Bataev 20973c0841 [SLP][NFC]Fix param name in comments, NFC. 2022-03-24 05:58:42 -07:00
Vasileios Porpodas 39aa202aff Recommit "[SLP] Fix lookahead operand reordering for splat loads." attempt 3, fixed assertion crash.
Original review: https://reviews.llvm.org/D121354

This reverts commit e6ead19b77.
2022-03-23 18:32:17 -07:00
Arthur Eubanks e6ead19b77 Revert "Recommit "[SLP] Fix lookahead operand reordering for splat loads." attempt 2, fixed assertion crash."
This reverts commit 27bd8f9492.

Causes crashes, see comments in D121973
2022-03-23 10:57:45 -07:00
serge-sans-paille 1b89c83254 Cleanup includes: Transforms/Instrumentation & Transforms/Vectorize
Discourse thread: https://discourse.llvm.org/t/include-what-you-use-include-cleanup
Differential Revision: https://reviews.llvm.org/D122181
2022-03-23 11:06:13 +01:00
Vasileios Porpodas 27bd8f9492 Recommit "[SLP] Fix lookahead operand reordering for splat loads." attempt 2, fixed assertion crash.
Original review: https://reviews.llvm.org/D121354

This reverts commit f7d7d2a08d.
2022-03-22 16:41:55 -07:00
Arthur Eubanks f7d7d2a08d Revert "Recommit "[SLP] Fix lookahead operand reordering for splat loads.""
This reverts commit 79613185d3.

Causes crashes, see comments in https://reviews.llvm.org/D121973.
2022-03-22 13:33:49 -07:00
Florian Hahn 50c8588e44
[LV] Remove Loop argument from createInductionResumeValues (NFCI).
createInductionResumeValues only uses its loop argument only to get the
pre-header, but the pre-header is already known (we created/cached it
earlier). Remove the unneeded loop argument.
2022-03-22 14:23:12 +00:00
Vasileios Porpodas 79613185d3 Recommit "[SLP] Fix lookahead operand reordering for splat loads."
Original review: https://reviews.llvm.org/D121354

The original commit 9136145eb0 broke the build on several targets.

Differential Revision: https://reviews.llvm.org/D121973
2022-03-21 15:57:32 -07:00
Philip Reames ee7324b898 Rename mayBeMemoryDependent to mayHaveNonDefUseDependency [nfc] 2022-03-21 10:01:40 -07:00
Alexey Bataev 79a182371e [SLP]Make stricter check for instructions that do not require
scheduling.

Need to check that the instructions with external operands can be
reordered safely before actualy exclude them from the scheduling.
2022-03-21 06:09:12 -07:00
Sophia 72bde608d2 [LV] Fix typo in comment
Reviewed by: fhahn (Florian Hahn)

    Differential Revision: https://reviews.llvm.org/D121781
2022-03-21 20:30:05 +08:00
Florian Hahn 0ebac76e6e
[LV] Remove unneeded Loop argument from completeLoopSkeleton. (NFCI)
completeLoopSkeleton only uses its loop argument only to get the
pre-header, but the pre-header is already known (we created/cached it
earlier). Remove the unneeded loop argument.
2022-03-21 10:07:25 +00:00
Florian Hahn 487629cc61
[LV] Remove dead Loop argument from emitMemRuntimeChecks. (NFC) 2022-03-20 21:01:15 +00:00
Philip Reames b7806c8b37 [SLP] Explicit track required stacksave/alloca dependency
The semantics of an inalloca alloca instruction requires that it not be reordered with a preceeding stacksave intrinsic call.  Unfortunately, there's no def/use edge or memory dependence edge.  (THe memory point is slightly subtle, but in general a new allocation can't alias with a call which executes strictly before it comes into existance.)

I'd tried to tackle this same case previously in 689babdf6, but the fix chosen there turned out to be incomplete.  As such, this change contains a fully revert of the first fix attempt.

This was noticed when investigating problems which surfaced with D118538, but this is definitely an existing bug.  This time around, I managed to reduce a couple of additional cases, including one which was being actively miscompiled even without the new scheduling change.  (See test diffs)

Compile time wise, we only spend extra time when seeing a stacksave (rare), and even then we walk the block at most once per schedule window extension.  Likely a non-issue.
2022-03-20 13:58:45 -07:00
Kazu Hirata bce1bf0ee2 [Transform] Apply clang-tidy fixes for readability-redundant-smartptr-get (NFC) 2022-03-20 10:41:22 -07:00
Philip Reames 6253b77da9 [SLP] Respect control dependence within a block during scheduling
This fixes an active miscompile visible in the test changes.  The basic problem is that the scheduling dependency graph didn't have any edges for control dependence within a single basic block.  The result is that we could (and in some rare cases *did*) perform reorderings within a block which could introduce new undefined behavior along paths which didn't previously contain any.

Impact wise, we have two major cases where control is not guaranteed to reach a later instruction in the block: may throw calls, and calls containing infinite loops.
* The former case was mostly covered by the memory dependencies, and to trigger require a function which can throw, but not write to memory.  In theory, such a case is possible, but not likely in practice.
* The later case is likely more of an issue in practice.  After this code was first written, we changed the IR semantics to allow well defined infinite loops without satisifying mustprogress.  Even for C/C++ - which do imply mustprogress - recent changes to how we treat atomics (e.g. an atomic read does not always imply a write) could expose this issue.  I'm a bit shocked we don't seem to have a bug report which hit this in real code actually.

Compile time wise, this results in a single extra scan of the scheduling window in the common case.  Since we stop scanning at the next instruction which isn't guaranteed to execute, no matter what order we traverse instructions in, we scan the block once.  The exception to this is that when we extend the scheduling window downwards, we invalidate all dependencies, and thus rescan.  So the potentially expensive case is when we a call in a big schedule window which is frequently extended.  We could optimize this case (by caching the last instruction not guaranteeed to transfer execution and scanning only the extended window) and starting there), but I decided to leave the complexity until it mattered.  That same case is already degenerate with memory dependences which is more expensive than the control dependence scan.

We could also consider combining the memory dependence and control dependence sets to reduce memory usage, but since it complicates the code slightly and makes debugging a bit harder, I went with the simplest scheme for now.

This was noticed while trying to understand the failures reported against D118538, but is not otherwise related to that change.
2022-03-19 13:36:24 -07:00
Florian Hahn 1a820ff039
[LV] Remove unnecessary uses of Loop* (NFC).
Update functions that previously took a loop pointer but only to get the
pre-header. Instead, pass the block directly. This removes the
requirement for the loop object to be created up-front.
2022-03-19 20:18:47 +00:00
Philip Reames 1093949cff [SLP] Add comment clarifying assumption that tripped me up [NFC]
I keep thinking this assumption is probably exploitable for a bug in the existing implementation, but all of my attempts at writing a test case have failed.  So for the moment, just document this very subtle assumption.
2022-03-18 11:40:19 -07:00
Kazu Hirata 3e0f7c7881 [Vectorize] Fix an 'unused function' warning
This patch fixes:

  llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:3917:13: error:
  unused function 'needToScheduleSingleInstruction'
  [-Werror,-Wunused-function]
2022-03-18 11:24:57 -07:00
Kazu Hirata b3d8c0d069 [Vectorize] Fix an 'unused variable' warning
This patch fixes:

  llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:8148:18: error:
  unused variable 'SDTE' [-Werror,-Wunused-variable]
2022-03-18 11:24:54 -07:00
Philip Reames 8f108c32bc Revert "[SLP] Optionally preserve MemorySSA"
This reverts commit 1cfa986d68.  See https://github.com/llvm/llvm-project/issues/54256 for why I'm discontinuing the project.

Seperately, it turns out that while this patch does correctly preserve MSSA, it's correct only at the end of the pass; not between vectorization attempts.  Even if we decide to resurrect this, we'll need to fix that before reapplying.
2022-03-18 10:45:59 -07:00
Vasileios Porpodas 9136145eb0 Revert "[SLP] Fix lookahead operand reordering for splat loads." due to build failures
This reverts commit 5efa78985b.
2022-03-17 18:22:04 -07:00
Vasileios Porpodas 5efa78985b [SLP] Fix lookahead operand reordering for splat loads.
Splat loads are inexpensive in X86. For a 2-lane vector we need just one
instruction: `movddup (%reg), xmm0`. Using the standard Splat score leads
to worse code. This patch adds a new score dedicated for splat loads.

Please note that a splat is usually three IR instructions:
- It is usually a load and 2 inserts:
 %ld = load double, double* %gep
 %ins1 = insertelement <2 x double> poison, double %ld, i32 0
 %ins2 = insertelement <2 x double> %ins1, double %ld, i32 1

- But it can also be a load, an insert and a shuffle:
 %ld = load double, double* %gep
 %ins = insertelement <2 x double> poison, double %ld, i32 0
 %shf = shufflevector <2 x double> %ins, <2 x double> poison, <2 x i32> zeroinitializer

Because of this some of the lit tests contain more IR instructions.

Differential Revision: https://reviews.llvm.org/D121354
2022-03-17 18:05:54 -07:00
Alexey Bataev d65cc85977 [SLP]Do not schedule instructions with constants/argument/phi operands and external users.
No need to schedule entry nodes where all instructions are not memory
read/write instructions and their operands are either constants, or
arguments, or phis, or instructions from others blocks, or their users
are phis or from the other blocks.
The resulting vector instructions can be placed at
the beginning of the basic block without scheduling (if operands does
not need to be scheduled) or at the end of the block (if users are
outside of the block).
It may save some compile time and scheduling resources.

Differential Revision: https://reviews.llvm.org/D121121
2022-03-17 11:03:45 -07:00
Florian Hahn 151c144350
[LV] Use usesScalars in widenPHIInstruction.
This uses the existing VPlan helpers to check whether there are scalar
uses of a phi recipe. It remove one of the few remaining dependencies on
the cost model from VPlan code generation.

Depends on D121612.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D121613
2022-03-17 13:16:32 +00:00
Florian Hahn a6e70e4056
[VPlan] VPInterleaveRecipe only requires the first lane of the address.
VPInterleaveRecipe only uses the first lane of the address. Add
onlyFirstLaneUsed implementation. This is needed for a follow-up patch.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D121612
2022-03-17 11:56:43 +00:00
Nikita Popov 1dbeb64493 [SLP] Avoid unnecessary getIncomingValueForBlock() call (NFC)
This code just wants to check all incoming values, we don't care
care what the incoming block is here.
2022-03-17 12:23:46 +01:00
Alexey Bataev 150ea76543 Revert "[SLP]Do not schedule instructions with constants/argument/phi operands and external users."
This reverts commit 1eeb2bfe72 to fix
a bug reported in https://reviews.llvm.org/D121121
2022-03-16 13:54:59 -07:00
Malhar Jajoo a36d269658 [VPlan] Avoid collecting scalars for SVE
This patch ensures scalars (except for uniforms) are no
longer collected (prior to LVP planning phase) for
scalable vectorization.

This is to avoid the chances of generating scalarized
instructions later (during LVP execute phase) as they
are not supported for scalable vectorization.

Relevant test has also been added.

Differential Revision: https://reviews.llvm.org/D121452
2022-03-16 16:33:34 +00:00
Alexey Bataev 1eeb2bfe72 [SLP]Do not schedule instructions with constants/argument/phi operands and external users.
No need to schedule entry nodes where all instructions are not memory
read/write instructions and their operands are either constants, or
arguments, or phis, or instructions from others blocks, or their users
are phis or from the other blocks.
The resulting vector instructions can be placed at
the beginning of the basic block without scheduling (if operands does
not need to be scheduled) or at the end of the block (if users are
outside of the block).
It may save some compile time and scheduling resources.

Differential Revision: https://reviews.llvm.org/D121121
2022-03-16 06:05:43 -07:00
Philip Reames 1cfa986d68 [SLP] Optionally preserve MemorySSA
This initial patch adds code to preserve MemorySSA through a run of SLP vectorizer. The eventual plan is to use MemorySSA to accelerate SLP's memory dependence checking, but we're a ways from that.  In particular, this patch is correct, but really slow. It's being landed so that we can work incrementally in tree, not because it's expected to be useful to anyone just yet.

The broader effort is being tracked in https://github.com/llvm/llvm-project/issues/54256.  Its worth noting expicitly that this may not work out, and if not, we will be reverting all of the MSSA support in SLP at some point in the next few weeks.

Differential Revision: https://reviews.llvm.org/D117926
2022-03-15 16:36:15 -07:00
Florian Hahn ca1b2fc9fb
[LV] Remove LoopVectorBody from InnerLoopVectorizer. (NFCI)
Update places still referencing LoopVectorBody to use the vector loop to
get the vector loop header. This is needed to move vector loop
code-generation to VPlan completely, which in turn is needed to model
pre-header & exit blocks in VPlan as well.
2022-03-15 08:22:31 +00:00
Florian Hahn 4a0481e981
[LV] Check for users of truncated IVs, add more detailed comment.
Add missing outside user check for truncated IVs. Also hoist the code in
the helper with additional explanations.

Fixes #54370.
2022-03-14 19:39:30 +00:00
Nikita Popov 8361c5da30 [SLPVectorizer] Handle external load/store pointer uses with opaque pointers
In this case we may not generate a bitcast, so the new load/store
becomes the external user.
2022-03-14 16:55:09 +01:00
Florian Hahn d621ae30e2
[LV] Remove dead Loop argument from emitMinimumVector... (NFC)
The argument is not used, remove it.
2022-03-14 15:47:40 +00:00
Florian Hahn 3ee2d908a9
[LV] Remove dead Loop argument from emitSCEVChecks. (NFC)
The argument is not used, remove it.
2022-03-14 13:00:03 +00:00
Florian Hahn 8896c36624
[LV] Do not set insert point in completeLoopSkeleton. (NFCI)
The insertion point for the builder used during VPlan code generation is
set during code generation. Setting the insert point here is dead code
and can be removed.
2022-03-14 12:21:26 +00:00
Florian Hahn 1c0fc1f074
[VPlan] Ensure each iv user is only visited once in transform.
If a recipe has multiple uses of an IV, we crash. It causes a crash when
building llvm-test-suite.

Exposed by 95f76bff1c.
2022-03-13 21:42:17 +00:00
Florian Hahn 95f76bff1c
[LV] Create & use VPScalarIVSteps for all scalar users.
This patch is a follow-up to D115953. It updates optimizeInductions
to also introduce new VPScalarIVStepsRecipes if an IV has both vector
and scalar uses.

It updates all uses that only need scalar values to use the newly
created recipe for the scalar steps.

This completes untangling of VPWidenIntOrFpInductionRecipe
code-generation. Now the recipe *only* creates the widened vector
values, as it says on the tin.

The code to genereate IR has been moved directly to
VPWidenIntOrFpInductionRecipe::execute.

Note that the recipe has been updated to hold a reference to
ScalarEvolution, which is needed to expand the step, until we can place
the corresponding SCEV expansion in the pre-header.

Depends on D120827.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D120828
2022-03-13 17:15:24 +00:00
serge-sans-paille ed98c1b376 Cleanup includes: DebugInfo & CodeGen
Discourse thread: https://discourse.llvm.org/t/include-what-you-use-include-cleanup
Differential Revision: https://reviews.llvm.org/D121332
2022-03-12 17:26:40 +01:00
Florian Hahn d3e1094473
[VPlan] Implement VPCanonicalIVPHIRecipe::onlyFirstLaneUsed.
The recipe only uses the first lane of its operands.

Suggested & split off D120827.
2022-03-11 18:07:26 +00:00
Florian Hahn ecea477df3
[VPlan] Helper to check if a recipe uses scalar values of op.
This patch adds a helper to check if a recipe only uses scalars of a
given operand. This is similar to onlyFirstLaneUsed, which was
introduced earlier.

By default, usesScalars falls back on onlyFirstLaneUsed.

Will be used by D120828.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D120827
2022-03-11 13:41:08 +00:00
Florian Hahn a12403cfea
[LV] Do not consider instrs dead if used by phi that's not in plan.
Single value phis won't be modeled in VPlan. If the phi only gets used
outside the loop, the current code misses the fact that the incoming
value is not dead. Update the code to also look through such phis to
check for outside users.

Fixes #54266
2022-03-09 16:04:44 +00:00
Philip Reames a2e9c68fcd [SLP] Extract a helper for buildvector [nfc] 2022-03-07 19:11:40 -08:00
Philip Reames 8ab3befa3f [SLP] Fix spelling in a lambda name [NFC] 2022-03-07 18:52:57 -08:00
Roman Lebedev 2f80ea7f4f
[NFC][LV] Use different braces in debug output
The analysis passes output function name encapsulated in `'` braces,
but LV uses `"`. Harmonizing this may help in creating an update script
for the LV costmodel test checks.

Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D121105
2022-03-07 19:32:37 +03:00
Philip Reames deae979a2c Revert "Reapply "[SLP] Schedule only sub-graph of vectorizable instructions"""
This reverts commit 738042711b. A second, apparently separate, issue has been reported on the original review.
2022-03-03 11:35:34 -08:00
Philip Reames 738042711b Reapply "[SLP] Schedule only sub-graph of vectorizable instructions""
Root issue which triggered the revert was fixed in 689bab.  No changes in the reapplied patch.

Original commit message follows:

SLP currently schedules all instructions within a scheduling window which stretches from the first instr
uction potentially vectorized to the last. This window can include a very large number of unrelated instruct
ions which are not being considered for vectorization. This change switches the code to only schedule the su
b-graph consisting of the instructions being vectorized and their transitive users.

This has the effect of greatly reducing the amount of work performed in large basic blocks, and thus greatly improves compile time on degenerate examples. To understand the effects, I added some statistics (not planned for upstream contribution). Here's an illustration from my motivating example:

    Before this patch:

    704357 SLP                          - Number of calcDeps actions
     699021 SLP                          - Number of schedule calls
       5598 SLP                          - Number of ReSchedule actions
         59 SLP                          - Number of ReScheduleOnFail actions
      10084 SLP                          - Number of schedule resets
       8523 SLP                          - Number of vector instructions generated

    After this patch:

    102895 SLP                          - Number of calcDeps actions
     161916 SLP                          - Number of schedule calls
       5637 SLP                          - Number of ReSchedule actions
         55 SLP                          - Number of ReScheduleOnFail actions
      10083 SLP                          - Number of schedule resets
       8403 SLP                          - Number of vector instructions generated

I do want to highlight that there is a small difference in number of generated vector instructions. This example is hitting the bailout due to maximum window size, and the change in scheduling is slightly perturbing when and how we hit it. This can be seen in the RescheduleOnFail counter change. Given that, I think we can safely ignore.

The downside of this change can be seen in the large test diff. We group all vectorizable instructions together at the bottom of the scheduling region. This means that vector instructions can move quite far from their original point in code. While maybe undesirable, I don't see this as being a major problem as this pass is not intended to be a general scheduling pass.

For context, it's worth noting that the pre-scheduling that SLP does while building the vector tree is exactly the sub-graph scheduling implemented by this patch.

Differential Revision: https://reviews.llvm.org/D118538
2022-03-02 10:47:20 -08:00
Philip Reames 689babdf68 [SLP] Don't try to vectorize allocas
While a collection of allocas are technically vectorizeable - by forming a wider alloca - this was not a transform SLP actually knows how to do.  Instead, we were forming a bundle with missing dependencies, and then relying on the scheduling code to preserve program order if multiple instructions were scheduleable at once.  I haven't been able to write a test case, but I'm 99% sure this was wrong in some edge case.

The unknown op case was flowing down the shufflevector path.  This did result in some splat handling being lost with this change, but the same lack of splat handling is visible in a whole bunch of simple examples for the gather path.  I didn't consider this interesting to fix given how narrow the splat of allocas case is.
2022-03-02 10:08:43 -08:00
Florian Hahn 8777cb66a8
[VPlan] Remove reliance on underlying instr for ScalarIVSteps (NFCI).
Instead of relying on underlying instructions, this patch updates
VPScalarIVStepsRecipe to only store the required type information.

This removes access to unrelated information, as well as avoiding issues
with the same underlying instruction being shared by multiple recipes.

This change should only change the debug output and not cause any
codegen changes, hence NFCI.
2022-03-02 16:23:19 +00:00
Florian Hahn 9e46866c0c
[LV] Remove dead EntryVal argument from buildScalarSteps (NFC).
The EntryVal argument is not needed after recent refactoring. Remove it.
2022-03-02 14:59:22 +00:00
Arthur Eubanks 9c6250ee41 Revert "[SLP] Schedule only sub-graph of vectorizable instructions"
This reverts commit 0539a26d91.

Causes a miscompile, see comments on D118538.

Required updating bottom-to-top-reorder.ll.
2022-03-01 17:31:16 -08:00
Arthur Eubanks 6987ac7903 Revert "[SLP] Remove SchedulingPriority from ScheduleData [NFC]"
This reverts commit a3e9b32c00.

Required for reverting D118538.
2022-03-01 17:28:52 -08:00
serge-sans-paille a494ae43be Cleanup includes: TransformsUtils
Estimation on the impact on preprocessor output:
before: 1065307662
after:  1064800684

Discourse thread: https://discourse.llvm.org/t/include-what-you-use-include-cleanup
Differential Revision: https://reviews.llvm.org/D120741
2022-03-01 21:00:07 +01:00
serge-sans-paille 71c3a5519d Cleanup includes: LLVMAnalysis
Number of lines output by preprocessor:
before: 1065940348
after:  1065307662

Discourse thread: https://discourse.llvm.org/t/include-what-you-use-include-cleanup
Differential Revision: https://reviews.llvm.org/D120659
2022-03-01 18:01:54 +01:00
Philip Reames 8cb0ac5825 [SLP] Check invariant that all instructions in bundle are in same block [NFC] 2022-02-28 13:17:44 -08:00
Alexey Bataev e4b9640867 [SLP]Improve bottom-to-top reordering.
Currently bottom-to-top reordering analysis counts orders of the
operands and then adds natural order counts for the operand users. It is
very conservative, this the user nodes themselves may require
reordering. Patch improves bottom-to-top analysis by checking for the
user nodes if they require/allows the reordring. If the user node must
be reordered, has reused scalars, is an alternate op vectorization node,
is a non-ordered gather node or may allow reordering because of the
reordered operands, such node is considered as the node that allows
reodring and is not counted as a node with the natural order.

Differential Revision: https://reviews.llvm.org/D120492
2022-02-28 06:48:46 -08:00
Florian Hahn b3e8ace198
Recommit "[VPlan] Introduce recipe to build scalar steps."
This reverts the revert commit ff93260bf6.

The underlying issue causing the PPC bot failures has been fixed in
cbaac14734 and a corresponding test case has been added in
ad2cad1c52.

Original message:

    This patch adds a new VPScalarIVStepsRecipe to handle building scalar
    steps.

    In the first patch, it only handles the case where there is no vector
    induction variable needed.

    Reviewed By: Ayal

    Differential Revision: https://reviews.llvm.org/D115953
2022-02-28 14:12:20 +00:00
Florian Hahn cbaac14734
[LV] Remove induction recipes only used outside vector loop.
Exit values of vector inductions are generated completely independent of
the induction recipes. Consider them for removal, if they are not used
in loop.

This fixes a crash exposed by 49b23f451c.
2022-02-28 11:14:22 +00:00
Philip Reames 319265328c [SLP] Remove field unused after 33ce97f to silence buildbots [NFC] 2022-02-27 10:18:10 -08:00
Florian Hahn ff93260bf6
Revert "[VPlan] Introduce recipe to build scalar steps."
This reverts commit 49b23f451c.

This appears to break some PPC build bots. Revert while I investigate.
2022-02-27 17:51:19 +00:00
Philip Reames 33ce97f413 [SLP] Use BatchAA to reduce capture analysis cost [NFC]
SLP makes very heavy use of aliasing queries to construct pointer dependencies for scheduling purposes.  AA internally usings pointerMayBeCaptured to prove some noalias results.  In a local profile, we were spending about 4% of total O2 time in capture tracking.  By using BatchAA interface - which caches capture results - this drops to 2%.

Note that there is no invalidation of BatchAA here.  This assumes that no transformation done by SLP invalidates alias or capture results.  This is the same assumption made by the existing AliasCache, so this is not a new assumption in the code.
2022-02-27 09:47:24 -08:00
Florian Hahn 49b23f451c
[VPlan] Introduce recipe to build scalar steps.
This patch adds a new VPScalarIVStepsRecipe to handle building scalar
steps.

In the first patch, it only handles the case where there is no vector
induction variable needed.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D115953
2022-02-27 17:32:41 +00:00
Florian Hahn 9bc866cc6f
[VPlan] Add recipe to handle SCEV expansion (NFC).
This can be used to explicitly model VPValues that depend on SCEV
expansion, like the step for inductions.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D116288
2022-02-27 12:47:02 +00:00
Florian Hahn da740492b0
[VPlan] Remove dead header-phi recipes.
This patch adds a new transform to remove dead recipes. For now, it only
removes dead recipes in the header, to keep the number tests that require
updating manageable. Future patches will extend this to remove dead
recipes across the whole plan.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D118051
2022-02-26 16:26:39 +00:00
Evgeniy Brevnov 10e99eb7e4 [SLP] "Normal" instructions should not go between PHI and Lading pad
Currently, SLP can insert "shuffle" instruction beween PHI and Landing pad instruction. The problem is demonstrated by LIT test. The solution is to adjust insertion point once we are done with PHI generation.

Differential Revision: https://reviews.llvm.org/D120552
2022-02-26 11:44:26 +07:00
Vasileios Porpodas 4bbc3290a2 [SLP] Fix for the min/max intrinsic cost.
The min/max intrinsic cost is currently too low because in the cost calculation
we subtract the cost of the vector compare as we will not emit it.
For the cost of the vector compare we are currently passing BAD_ICMP_PREDICATE
which returns 3, the worst case cost.
I think we should be passing VecPred instead, since we know the predicates of
the compare instr.

I think this is related to commit b3b993a7ad which introduced the predicate
argument to getCmpSelInstrCost().
https://reviews.llvm.org/rGb3b993a7ad817c3c5801341fa78f34332900eb83

Differential Revision: https://reviews.llvm.org/D120439
2022-02-24 18:08:40 -08:00
Philip Reames ed54296ea3 [SLP] Fastpath instructions not in block being scheduled [nfc] 2022-02-23 13:51:36 -08:00
Philip Reames a4541fdfe4 [SLP] Replace a impossible branch condition with an assert [NFC]
An entire bundle must be inside the scheduling window.  Assert that this property holds as opposed to checking it at runtime.
2022-02-23 13:43:45 -08:00
Philip Reames 9a40f9f681 {SLP] Make it clear ScheduleDataMap is keyed by instructions [NFC] 2022-02-23 13:31:36 -08:00
Philip Reames 9392c0d4ef Revert "[SLP] Remove cap on schedule window size"
This reverts commit 6adf4b039e.  Reverting while investigating https://github.com/llvm/llvm-project/issues/54029
2022-02-23 13:12:07 -08:00
Philip Reames a83441e8cd Revert "[SLP] Simplify extendSchedulingRegion"
This reverts commit 8c85f3a052.
2022-02-23 13:12:07 -08:00
Philip Reames 222e8610f1 [SLP] Rearrange fields in ScheduleData for density [NFC] 2022-02-23 12:33:43 -08:00
Philip Reames a3e9b32c00 [SLP] Remove SchedulingPriority from ScheduleData [NFC]
First step in trying to shrink the memory footprint of ScheduleData to improve cache locality.
2022-02-23 11:43:46 -08:00
Philip Reames 8c85f3a052 [SLP] Simplify extendSchedulingRegion
This change uses instruction's comesBefore method to simplify the code significantly. There's little compile time concern here because getSpillCost already calls comesBefore on every basic block which contains a vectorization candidate. The only additional times we'll build basic block ordering is when we can't schedule a vector candidate anywhere in the containing block.

Differential Revision: https://reviews.llvm.org/D120364
2022-02-23 11:23:38 -08:00
Philip Reames 6adf4b039e [SLP] Remove cap on schedule window size
This cap was first added in 848c1aa45 (back in 2015).  Per the original commit message, the purpose was to avoid a compile time explosion in long basic blocks.  The algorithmic problem in scheduling has now been fixed in 0539a26d.

In the meantime, the code has rotten fairly badly.  Some intermediate refactoring caused the size to only be incremented if *both* iterators advance in the window search.  This causes the size to be badly undercounted when near one end of a basic block.  We no longer have any test which exercises the logic in an intentional way; there's one test which differs with this change, but the changes appear fairly orthoganol to the purpose of the test file.

Unfortunately, we no longer have the original motivating example, so it's possible that it also hits some other issue.  I tested locally with a large example, but even at it's worst, that one doesn't demonstrate anything too extreme even without the algorithmic fix.  It's clearly faster with, but only by ~20% which doesn't seem in line with the original commit message.   If regressions with this patch are seen, please file a bug and I'll try to fix any other algorithmic problems which fall out.
2022-02-23 08:27:45 -08:00
Brendon Cahoon 3cc15e2cb6 [SLP] Fix assert from non-constant index in insertelement
A call to getInsertIndex() in getTreeCost() is returning None,
which causes an assert because a non-constant index value for
insertelement was not expected. This case occurs when the
insertelement index value is defined with a PHI.

Differential Revision: https://reviews.llvm.org/D120223
2022-02-22 15:57:14 -06:00
Philip Reames 8612b11c86 [SLP] Use isInSchedulingRegion consistently [NFC] 2022-02-22 10:27:16 -08:00
Philip Reames 0539a26d91 [SLP] Schedule only sub-graph of vectorizable instructions
SLP currently schedules all instructions within a scheduling window which stretches from the first instruction potentially vectorized to the last. This window can include a very large number of unrelated instructions which are not being considered for vectorization. This change switches the code to only schedule the sub-graph consisting of the instructions being vectorized and their transitive users.

This has the effect of greatly reducing the amount of work performed in large basic blocks, and thus greatly improves compile time on degenerate examples. To understand the effects, I added some statistics (not planned for upstream contribution). Here's an illustration from my motivating example:

Before this patch:

704357 SLP                          - Number of calcDeps actions
 699021 SLP                          - Number of schedule calls
   5598 SLP                          - Number of ReSchedule actions
     59 SLP                          - Number of ReScheduleOnFail actions
  10084 SLP                          - Number of schedule resets
   8523 SLP                          - Number of vector instructions generated

After this patch:

102895 SLP                          - Number of calcDeps actions
 161916 SLP                          - Number of schedule calls
   5637 SLP                          - Number of ReSchedule actions
     55 SLP                          - Number of ReScheduleOnFail actions
  10083 SLP                          - Number of schedule resets
   8403 SLP                          - Number of vector instructions generated

I do want to highlight that there is a small difference in number of generated vector instructions. This example is hitting the bailout due to maximum window size, and the change in scheduling is slightly perturbing when and how we hit it. This can be seen in the RescheduleOnFail counter change. Given that, I think we can safely ignore.

The downside of this change can be seen in the large test diff. We group all vectorizable instructions together at the bottom of the scheduling region. This means that vector instructions can move quite far from their original point in code. While maybe undesirable, I don't see this as being a major problem as this pass is not intended to be a general scheduling pass.

For context, it's worth noting that the pre-scheduling that SLP does while building the vector tree is exactly the sub-graph scheduling implemented by this patch.

Differential Revision: https://reviews.llvm.org/D118538
2022-02-22 10:15:55 -08:00
Kerry McLaughlin 12fb133eba [LoopVectorize] Support conditional in-loop vector reductions
Extends getReductionOpChain to look through Phis which may be part of
the reduction chain. adjustRecipesForReductions will now also create a
CondOp for VPReductionRecipe if the block is predicated and not only if
foldTailByMasking is true.

Changes were required in tryToBlend to ensure that we don't attempt
to convert the reduction Phi into a select by returning a VPBlendRecipe.
The VPReductionRecipe will create a select between the Phi and the reduction.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D117580
2022-02-22 12:04:35 +00:00
Florian Hahn c141d158e5
[VectorCombine] Remove redundant checks (NFC).
The removed conditions are already checked by the if above.

Fixes #53761.
2022-02-19 21:05:32 +00:00
Philip Reames 3ad0bdae8f [SLP] Address post commit comment from 2e50760 2022-02-18 10:57:15 -08:00
Alexey Bataev b0a0df9809 [SLP]Fix vectorization of the alternate cmp instruction with swapped predicates.
If the alternate cmp instruction is a swapped predicate of the main cmp
instruction, need to generate alternate instruction, not the one with
the swapped predicate. Also, the lane with the alternate opcode should
be selected only, if the corresponding operands are not compatible.

Correctness confirmed:
https://alive2.llvm.org/ce/z/94BG66

Differential Revision: https://reviews.llvm.org/D119855
2022-02-18 04:27:45 -08:00
Alexey Bataev d1cd64ffdd [SLP][NFC]Fix misprint in function name, NFC. 2022-02-17 05:57:51 -08:00
Arthur Eubanks 826fae51d2 [SLPVectorizer][OpaquePtrs] Check GEP source element type
Fixes a miscompile with opaque pointers.

Reviewed By: #opaque-pointers, nikic

Differential Revision: https://reviews.llvm.org/D119980
2022-02-16 14:47:20 -08:00
Philip Reames 2e50760775 [SLP] Add assert that entities are scheduled as expected
Requested in D118538
2022-02-15 12:21:49 -08:00
Anton Afanasyev b7574b092a [SLP] Don't try to vectorize pair with insertelement
Particularly this breaks vectorization of insertelements where some of
intermediate (i.e. not last) insertelements are used externally.

Fixes PR52275
Fixes #51617

Differential Revision: https://reviews.llvm.org/D119679
2022-02-15 16:12:59 +03:00
Anton Afanasyev 954ea0f044 [SLP] Simplify indices processing for insertelements
Get rid of non-constant and undef indices of insertelements
at `buildTree()` stage. Fix bugs.

Differential Revision: https://reviews.llvm.org/D119623
2022-02-14 14:50:44 +03:00
Florian Hahn 2cd22ce0d0
[LV] Pass start value directly to emitTransformedIndex (NFC). 2022-02-12 19:03:32 +00:00
Anton Afanasyev cd685f5736 [NFC][SLP] Set default parameter for Offset equal to zero 2022-02-11 17:22:33 +03:00
Philip Reames 5ba115031d [PSE] Remove assumption that top level predicate is union from public interface [NFC*]
Note that this doesn't actually cause the top level predicate to become a non-union just yet.

The * above comes from a case in the LoopVectorizer where a predicate which is later proven no longer blocks vectorization due to a change from checking if predicates exists to whether the predicate is possibly false.
2022-02-10 16:14:52 -08:00
Simon Pilgrim 6af7c1371a [LoopVectorize] getStepVector - reduce scope of local variable. NFC. 2022-02-10 20:44:25 +00:00
David Green b55d4c2ad8 Revert "[LV] Remove `LoopVectorizationCostModel::useEmulatedMaskMemRefHack()`"
This reverts commit 77a0da926c as we've
received multiple reports of this significantly impacting performance,
in ways that don't seem to just be target specific cost models going
wrong. I would offer some reproducers, but the test changes here seem to
be full of them!

Reverting for now and hopefully we can remove the "hack" more carefully
as we go.
2022-02-09 20:02:54 +00:00
Alexey Bataev 370ea1a199 [SLP][NFC]Fix comment, NFC. 2022-02-09 07:14:14 -08:00
Florian Hahn 8aa122081f
[LV] Pass step to emitTransformedIndex (NFC).
Move out the induction step creation from emitTransformedIndex to the
callers. In some places (e.g. widenIntOrFpInduction) the step is already
created. Passing the step in ensures the steps are kept in sync.
2022-02-09 11:12:45 +00:00
Florian Hahn c9e6678b56
[LV] Move buildScalarSteps out of ILV (NFC).
This makes the function independent of shared state in ILV (ensures no
new dependencies on things like the cost model are introduced) and allows
for use directly in recipe's ::execute functions.
2022-02-08 21:18:40 +00:00
David Green b4c6d1bb37 [LoopVectorizer] Don't perform interleaving of predicated scalar loops
The vectorizer will choose at times to "vectorize" loops with a scalar
factor (VF=1) with interleaving (IC > 1). This can occasionally produce
better code than the unroller (notable for reductions where it can
produce independent reduction chains that are combined after the loop).
At times this is not very beneficial though, for example when runtime
checks are needed or when the scalar code requires predication.

This addresses the second point, preventing the vectorizer from
interleaving when the scalar loop will require predication. This
prevents it from making a bit of a mess, that is worse than the original
and better left for the unroller to unroll if beneficial. It helps
reverse some of the regressions from D118090.

Differential Revision: https://reviews.llvm.org/D118566
2022-02-07 19:34:28 +00:00