Commit Graph

3480 Commits

Author SHA1 Message Date
Philip Reames b5c7213647 [LV] Use early return to simplify code structure 2022-07-22 12:15:14 -07:00
Benjamin Kramer 5a445395e4 [LV] Remove unused variable. NFC. 2022-07-22 17:43:58 +02:00
Philip Reames d7bf81fd51 [LV] Rework widening cost of uniform memory ops for clarity [nfc]
Reorganize the code to make it clear what is and isn't handle, and why.
Restructure bailout to remove (false and confusing) dependence on
CM_Scalarize; just return invalid cost and propagate, that's what it
is for.
2022-07-22 08:35:45 -07:00
Philip Reames bd75350180 [LV] Fix a conceptual mistake around meaning of uniform in isPredicatedInst
This code confuses LV's "Uniform" and LVL/LAI's "Uniform".  Despite the
common name, these are different.
* LVs notion means that only the first lane *of each unrolled part* is
  required.  That is, lanes within a single unroll factor are considered
  uniform.  This allows e.g. widenable memory ops to be considered
  uses of uniform computations.
* LVL and LAI's notion refers to all lanes across all unrollings.

IsUniformMem is in turn defined in terms of LAI's notion.  Thus a
UniformMemOpmeans is a memory operation with a loop invariant address.
This means the same address is accessed in every iteration.

The tweaked piece of code was trying to match a uniform mem op (i.e.
fully loop invariant address), but instead checked for LV's notion of
uniformity.  In theory, this meant with UF > 1, we could speculate
a load which wasn't safe to execute.

This ends up being mostly silent in current code as it is nearly
impossible to create the case where this difference is visible.  The
closest I've come in the test case from 54cb87, but even then, the
incorrect result is only visible in the vplan debug output; before this
change we sink the unsafely speculated load back into the user's predicate
blocks before emitting IR.  Both before and after IR are correct so the
differences aren't "interesting".

The other test changes are uninteresting.  They're cases where LV's uniform
analysis is slightly weaker than SCEV isLoopInvariant.
2022-07-21 15:44:34 -07:00
David Sherwood f15b6b2907 [AArch64] Add target hook for preferPredicateOverEpilogue
This patch adds the AArch64 hook for preferPredicateOverEpilogue,
which currently returns true if SVE is enabled and one of the
following conditions (non-exhaustive) is met:

1. The "sve-tail-folding" option is set to "all", or
2. The "sve-tail-folding" option is set to "all+noreductions"
and the loop does not contain reductions,
3. The "sve-tail-folding" option is set to "all+norecurrences"
and the loop has no first-order recurrences.

Currently the default option is "disabled", but this will be
changed in a later patch.

I've added new tests to show the options behave as expected here:

  Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll

Differential Revision: https://reviews.llvm.org/D129560
2022-07-21 17:20:06 +01:00
Philip Reames 523a526a02 [LV] Fix miscompile due to srem/sdiv speculation safety condition
An srem or sdiv has two cases which can cause undefined behavior, not just one. The existing code did not account for this, and as a result, we miscompiled when we encountered e.g. a srem i64 %v, -1 in a conditional block.

Instead of hand rolling the logic, just use the utility function which exists exactly for this purpose.

Differential Revision: https://reviews.llvm.org/D130106
2022-07-20 05:35:23 -07:00
Florian Hahn 5124b21648
[VPlan] Initial def-use verification.
This patch introduces some initial def-use verification. This catches
cases like the one fixed by D129436.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D129717
2022-07-20 11:06:32 +01:00
William Schmidt bccc9aa81c Don't vectorize PHIs in catchswitch blocks
We currently assert in vectorizeTree(TreeEntry*) when processing a PHI
bundle in a block containing a catchswitch.  We attempt to set the
IRBuilder insertion point following the catchswitch, which is invalid.
This is done so that ShuffleBuilder.finalize() knows where to insert
a shuffle if one is needed.

To avoid this occurring, watch out for catchswitch blocks during
buildTree_rec() processing, and avoid adding PHIs in such blocks to
the vectorizable tree.  It is unlikely that constraining vectorization
over an exception path will cause a noticeable performance loss, so
this seems preferable to trying to anticipate when a shuffle will and
will not be required.
2022-07-19 06:10:17 -07:00
Florian Hahn a75760a269
[LV] Remove unnecessary cast in widenCallInstruction. (NFC) 2022-07-19 11:23:24 +01:00
Florian Hahn 30e53b8c03
[LV] Sink module variable and use State to set it in widenCall. (NFC)
Limits the lifetime of the variable and makes it independent of
CallInst.
2022-07-18 19:41:48 +01:00
Florian Hahn 105032f549
[LV] Use PHI recipe instead of PredRecipe for subsequent uses.
At the moment, the VPPRedInstPHIRecipe is not used in subsequent uses of
the predicate recipe. This incorrectly models the def-use chains, as all
later uses should use the phi recipe. Fix that by delaying recording of
the recipe.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D129436
2022-07-18 09:35:34 +01:00
Kazu Hirata 7094ab4ee7 [llvm] Modernize bool literals (NFC)
Identified with modernize-use-bool-literals.
2022-07-17 18:08:51 -07:00
Florian Hahn cc0ee17951
[LV] Move VPPredInstPHIRecipe::execute to VPlanRecipes.cpp (NFC) 2022-07-17 11:34:23 +01:00
Florian Hahn 6813b41d57
[LV] Avoid creating new run-time VF expression for each runtime checks.
At the moment, the cost of runtime checks for scalable vectors is
overestimated due to creating separate vscale * VF expressions for each
check. Instead re-use the first expression.
2022-07-16 17:24:07 +01:00
David Green 4b7913c357 [VectorCombine] Only consider shuffle uses with the same type.
The backend getShuffleCosts do not currently handle shuffles that change
size very well. Limit the shuffles we collect to the same type to make
sure they do not cause issues as reported in D128732.
2022-07-16 13:23:39 +01:00
Florian Hahn aa00fb02c9
[LV] Use umax(VF * UF, MinProfTC) for scalable vectors.
For scalable vectors, it is not sufficient to only check
MinProfitableTripCount if it is >= VF.getKnownMinValue() * UF, because
this property may not holder for larger values of vscale. In those
cases, compute umax(VF * UF, MinProfTC) instead.

This should fix
https://lab.llvm.org/buildbot/#/builders/197/builds/2262
2022-07-15 10:23:14 -07:00
Mel Chen bd404fbcc8 [LV][NFC] Fix the condition for printing debug messages
Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D128523
2022-07-15 01:47:33 -07:00
Kazu Hirata 611ffcf4e4 [llvm] Use value instead of getValue (NFC) 2022-07-13 23:11:56 -07:00
Florian Hahn ee37ae91b6
[VPlan] Move VPBB verification to separate function (NFC). 2022-07-13 18:53:40 -07:00
Florian Hahn 6f7347b888
[LV] Use PredRecipe directly instead of getOrAddVPValue (NFC).
There is no need to look up the VPValue for Instr, PredRecipe can be
used directly.
2022-07-13 17:01:42 -07:00
Florian Hahn 225e3ec622
[LV] Move VPBranchOnMaskRecipe::execute to VPlanRecipes.cpp (NFC). 2022-07-13 14:39:59 -07:00
David Sherwood 307ace7f20 [LoopVectorize] Ensure the VPReductionRecipe is placed after all it's inputs
When vectorising ordered reductions we call a function
LoopVectorizationPlanner::adjustRecipesForReductions to replace the
existing VPWidenRecipe for the fadd instruction with a new
VPReductionRecipe. We attempt to insert the new recipe in the same
place, but this is wrong because createBlockInMask may have
generated new recipes that VPReductionRecipe now depends upon. I
have changed the insertion code to append the recipe to the
VPBasicBlock instead.

Added a new RUN with tail-folding enabled to the existing test:

  Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll

Differential Revision: https://reviews.llvm.org/D129550
2022-07-13 09:29:25 +01:00
David Sherwood 6b694d600a [LoopVectorize] Change PredicatedBBsAfterVectorization to be per VF
When calculating the cost of Instruction::Br in getInstructionCost
we query PredicatedBBsAfterVectorization to see if there is a
scalar predicated block. However, this meant that the decisions
being made for a given fixed-width VF were affecting the cost for a
scalable VF. As a result we were returning InstructionCost::Invalid
pointlessly for a scalable VF that should have a low cost. I
encountered this for some loops when enabling tail-folding for
scalable VFs.

Test added here:

  Transforms/LoopVectorize/AArch64/sve-tail-folding-cost.ll

Differential Revision: https://reviews.llvm.org/D128272
2022-07-12 14:53:20 +01:00
Florian Hahn 5d135041c5
[LV] Move VPBlendRecipe::execute to VPlanRecipes.cpp (NFC). 2022-07-11 16:01:07 -07:00
David Sherwood 03fee6712a [LoopVectorize] Add option to use active lane mask for loop control flow
Currently, for vectorised loops that use the get.active.lane.mask
intrinsic we only use the mask for predicated vector operations,
such as masked loads and stores, etc. The loop itself is still
controlled by comparing the canonical induction variable with the
trip count. However, for some targets this is inefficient when it's
cheap to use the mask itself to control the loop.

This patch adds support for using the active lane mask for control
flow by:

1. Generating the active lane mask for the next iteration of the
vector loop, rather than the current one. If there are still any
remaining iterations then at least the first bit of the mask will
be set.
2. Extract the first bit of this mask and use this bit for the
conditional branch.

I did this by creating a new VPActiveLaneMaskPHIRecipe that sets
up the initial PHI values in the vector loop pre-header. I've also
made use of the new BranchOnCond VPInstruction for the final
instruction in the loop region.

Differential Revision: https://reviews.llvm.org/D125301
2022-07-11 13:46:55 +01:00
David Sherwood 02d6950d84 [LoopVectorize][NFC] Add optional Name parameter to VPInstruction
This patch is a simple piece of refactoring that now permits users
to create VPInstructions and specify the name of the value being
generated. This is useful for creating more readable/meaningful
names in IR.

Differential Revision: https://reviews.llvm.org/D128982
2022-07-11 09:23:24 +01:00
Florian Hahn 6a4bc452f8
[LV] Move VPWidenGEPRecipe::execute to VPlanRecipes.cpp (NFC). 2022-07-10 17:10:17 -07:00
Florian Hahn 13ae213469
[LV] Move VPWidenRecipe::execute to VPlanRecipes.cpp (NFC). 2022-07-09 18:46:57 -07:00
Florian Hahn 0c27b38849
[VPlan] Move VPWidenSelectRecipe::execute to VPlanRecipes.cpp (NFC).
Depends on D127968.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D127970
2022-07-08 09:35:23 -07:00
Craig Topper 0266773464 [SLP] Add missing space to optimization remark.
Reviewed By: vporpo

Differential Revision: https://reviews.llvm.org/D129330
2022-07-07 23:29:11 -07:00
Florian Hahn bc19b7c3cc
[LV] Remove collectTriviallyDeadInstructions, already handled by VP DCE.
Now that removeDeadRecipes can remove most dead recipes across a whole
VPlan, there is no need to first collect some dead instructions.
Instead removeDeadRecipes can simply clean them up.

Depends D127580.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D128408
2022-07-07 08:40:27 -07:00
Sander de Smalen 519d7876cb [VectorCombine] Avoid creating shuffle for extract-extract pattern on scalable vector.
This addresses https://github.com/llvm/llvm-project/issues/56377

Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D129136
2022-07-07 08:37:04 +00:00
Florian Hahn 17d48c3169
[VPlan] Move remove dead recipes before merging regions.
This can enable additional region merging,  while not losing
opportunities as region merging does not produce dead recipes.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D128831
2022-07-06 20:38:38 -07:00
Nikita Popov 1ed8b29302 [LoopVectorizationLegality] Drop unused variable (NFC) 2022-07-06 10:43:39 +02:00
Nikita Popov 8ee913d83b [IR] Remove Constant::canTrap() (NFC)
As integer div/rem constant expressions are no longer supported,
constants can no longer trap and are always safe to speculate.
Remove the Constant::canTrap() method and its usages.
2022-07-06 10:36:47 +02:00
David Green 5493f8fc59 [VectorCombine] Improve shuffle select shuffle-of-shuffles
This in an extension to the code added in D123911 which added vector
combine folding of shuffle-select patterns, attempting to reduce the
total amount of shuffling required in patterns like:
  %x = shuffle %i1, %i2
  %y = shuffle %i1, %i2
  %a = binop %x, %y
  %b = binop %x, %y
  shuffle %a, %b, selectmask

This patch extends the handing of shuffles that are dependent on one
another, which can arise from the SLP vectorizer, as-in:
  %x = shuffle %i1, %i2
  %y = shuffle %x

The input shuffles can also be emitted, in which case they are treated
like identity shuffles. This patch also attempts to calculate a better
ordering of input shuffles, which can help getting lower cost input
shuffles, pushing complex shuffles further down the tree.

This is a recommit with some additional checks for supported forms and
out-of-bounds mask elements, with some extra tests.

Differential Revision: https://reviews.llvm.org/D128732
2022-07-05 17:16:18 +01:00
Florian Hahn ebb78a95ce
[LV] Remove stray dbgs() call after 774fc63490. 2022-07-05 12:58:18 +01:00
Florian Hahn 774fc63490
[LV] Consider minimum vscale assmuption for RT check cost.
For scalable VFs, the minimum assumed vscale needs to be included in the
cost-computation, otherwise a smaller VF may be used for RT check cost
computation than was used for earlier cost computations.

Fixes a RISCV test failing with UBSan due to both scalar and vector
loops having the same cost.
2022-07-05 09:41:58 +01:00
Nikita Popov b69c75d53f Revert "[VectorCombine] Improve shuffle select shuffle-of-shuffles"
This reverts commit 19a1e20b8a.

Clang crashes while linking bullet from llvm-test-suite in
ReleaseLTO-g cmake configuration.
2022-07-05 09:31:20 +02:00
Florian Hahn 2a82c15f63
[LV] Consider runtime checks profitable if scalar cost is zero.
This fixes an UBSan failure after 644a965c1e. When using
user-provided VFs/ICs (via the force-vector-width /
force-vector-interleave options) the scalar cost is zero, which would
cause divide-by-zero.

When forcing vectorization using the options, the cost of the runtime
checks should not block vectorization.
2022-07-04 21:37:16 +01:00
Florian Hahn 9eb6572786
[LV] Add back CantReorderMemOps remark.
Add back remark unintentionally dropped by 644a965c1e.

I will add a LV test separately, so we do not have to rely on a Clang
test to catch this.
2022-07-04 17:23:47 +01:00
Peter Waller c146af3f46 [LoopVectorize][NFC] Reinstate TTICapture workaround for gcc-6
Fixes #56374.
2022-07-04 14:14:15 +00:00
Florian Hahn 644a965c1e
[LV] Vectorize cases with larger number of RT checks, execute only if profitable.
This patch replaces the tight hard cut-off for the number of runtime
checks with a more accurate cost-driven approach.

The new approach allows vectorization with a larger number of runtime
checks in general, but only executes the vector loop (and runtime checks) if
considered profitable at runtime. Profitable here means that the cost-model
indicates that the runtime check cost + vector loop cost < scalar loop cost.

To do that, LV computes the minimum trip count for which runtime check cost
+ vector-loop-cost < scalar loop cost.

Note that there is still a hard cut-off to avoid excessive compile-time/code-size
increases, but it is much larger than the original limit.

The performance impact on standard test-suites like SPEC2006/SPEC2006/MultiSource
is mostly neutral, but the new approach can give substantial gains in cases where
we failed to vectorize before due to the over-aggressive cut-offs.

On AArch64 with -O3, I didn't observe any regressions outside the noise level (<0.4%)
and there are the following execution time improvements. Both `IRSmk` and `srad` are relatively short running, but the changes are far above the noise level for them on my benchmark system.

```
CFP2006/447.dealII/447.dealII    -1.9%
CINT2017rate/525.x264_r/525.x264_r    -2.2%
ASC_Sequoia/IRSmk/IRSmk       -9.2%
Rodinia/srad/srad     -36.1%
```

`size` regressions on AArch64 with -O3 are

```
MultiSource/Applications/hbd/hbd                 90256.00   106768.00 18.3%
MultiSourc...ks/ASCI_Purple/SMG2000/smg2000     240676.00   257268.00  6.9%
MultiSourc...enchmarks/mafft/pairlocalalign     472603.00   489131.00  3.5%
External/S...2017rate/525.x264_r/525.x264_r     613831.00   630343.00  2.7%
External/S...NT2006/464.h264ref/464.h264ref     818920.00   835448.00  2.0%
External/S...te/538.imagick_r/538.imagick_r    1994730.00  2027754.00  1.7%
MultiSourc...nchmarks/tramp3d-v4/tramp3d-v4    1236471.00  1253015.00  1.3%
MultiSource/Applications/oggenc/oggenc         2108147.00  2124675.00  0.8%
External/S.../CFP2006/447.dealII/447.dealII    4742999.00  4759559.00  0.3%
External/S...rate/510.parest_r/510.parest_r   14206377.00 14239433.00  0.2%
```

Reviewed By: lebedev.ri, ebrevnov, dmgreen

Differential Revision: https://reviews.llvm.org/D109368
2022-07-04 15:11:39 +01:00
David Green 2de05afc19 [SLP] Peek into loads when hitting the RecursionMaxDepth
This patch slightly extends the limit on the RecursionMaxDepth inside
the SLP vectorizer. It does it only when it hits a load (or zext/sext of
a load), which allows it to peek through in the places where it will be
the most valuable, without ballooning out the O(..) by any 2^n factors.

Differential Revision: https://reviews.llvm.org/D122148
2022-07-04 14:22:50 +01:00
David Green 19a1e20b8a [VectorCombine] Improve shuffle select shuffle-of-shuffles
This in an extension to the code added in D123911 which added vector
combine folding of shuffle-select patterns, attempting to reduce the
total amount of shuffling required in patterns like:
  %x = shuffle %i1, %i2
  %y = shuffle %i1, %i2
  %a = binop %x, %y
  %b = binop %x, %y
  shuffle %a, %b, selectmask

This patch extends the handing of shuffles that are dependent on one
another, which can arise from the SLP vectorizer, as-in:
  %x = shuffle %i1, %i2
  %y = shuffle %x

The input shuffles can also be emitted, in which case they are treated
like identity shuffles. This patch also attempts to calculate a better
ordering of input shuffles, which can help getting lower cost input
shuffles, pushing complex shuffles further down the tree.

Differential Revision: https://reviews.llvm.org/D128732
2022-07-04 13:38:43 +01:00
Florian Hahn b4694229aa
[LV] Simplify setDebugLocFromInst by using early exit (NFC).
Suggested as separate improvement in D128657.
2022-07-04 09:25:26 +01:00
Florian Hahn b0da3c6fa4
[VPlan] Move setDebugLocFromInst to VPTransformState (NFC).
The moved helpers are only used for codegen. It will allow moving the
remaining ::execute implementations out of LoopVectorize.cpp.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D128657
2022-07-02 15:18:17 +01:00
Florian Hahn 0dddf04cab
[LV] Don't optimize exit cond during epilogue vectorization.
At the moment, the same VPlan can be used code generation of both the
main vector and epilogue vector loop. This can lead to wrong results, if
the plan is optimized based on the VF of the main vector loop and then
re-used for the epilogue loop.

One example where this is problematic is if the scalar loops need to
execute at least one iteration, e.g. due to interleave groups.

To prevent mis-compiles in the short-term, disable optimizing exit
conditions for VPlans when using epilogue vectorization. The proper fix
is to avoid re-using the same plan for both loops, which will require
support for cloning plans first.

Fixes #56319.
2022-07-01 13:48:38 +01:00
Florian Hahn 583abd0e36
[VPlan] Move addMetadata to VPTransformState (NFC).
The moved helpers are only used for codegen. It will allow moving the
remaining ::execute implementations out of LoopVectorize.cpp.

Depends on D127966.
Depends on D127965.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D127968
2022-07-01 12:03:25 +01:00
Alexey Bataev 4be3fc35aa [SLP][NFC]Cleanup up operands of the removed insertelements, NFC.
Replace all operands of the insertelement instruction, replaced by
shuffles, by poisons to avoid false-positive reports about incorrect function.
2022-06-30 17:51:43 -07:00
Florian Hahn 68884dde70
[LV] Move LoopVersioning creation to LVP::execute.
At the moment LoopVersioning is only created for inner-loop
vectorization. This patch moves it to LVP::execute, which means it will
also be added for epilogue vectorization. As a consequence, the proper
noalias metadata is now also added to epilogue vector loops.

LVer will be moved to VPTransformState as follow-up.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D127966
2022-06-30 12:14:32 +01:00
Florian Hahn 24b5f8e0d0
[VPlan] Make sure optimizeInductions removes wide ind from scalar plan.
In some cases, there may be widened users of inductions even though the
plan includes the scalar VF. In those cases, make sure we still replace
the VPWidenIntOrFpInductionRecipe with scalar steps, as otherwise we may
try to execute a VPWidenIntOrFpInductionRecipe with a scalar VF.

Alternatively the patch could also split the range if needed.

This fixes a crash exposed by D123720.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D128755
2022-06-30 09:11:48 +01:00
Nikita Popov bdba8278d9 [VectorCombine] Avoid ConstantExpr::get() (NFC)
Use IRBuilder APIs instead, which will still constant fold.
2022-06-29 17:17:52 +02:00
Alexey Bataev bf4dcbd2df [SLP]Fix PR56251: Do not remove the reordering from the root node, being used as an operand.
If the root order itself does not require reordering, we can just
remove its reorder mask safely (e.g., if the root node is a vector of
phis). But if this node is used as an operand in the graph, we cannot
delete the reordering, need to keep it. Otherwise the graph nodes are
not synchronized with the operands. It may cause an extra gather
instruction(s) or a compiler crash.
Also, need to be very careful when selecting the gather nodes for
reordering since there might several gather nodes with the same scalars
and we can try to reorder just the same node many times instead of
different nodes.

Differential Revision: https://reviews.llvm.org/D128680
2022-06-28 13:42:05 -07:00
Florian Hahn 03975b7f0e
[VPlan] Move recipe implementations to separate file (NFC).
This patch moves the code for recipe implementations to a separate file.

The benefits are:
 * Keep VPlan.cpp smaller => faster compile-time during parallel builds.
 * Keep code for logical units together

As a follow-up I am also planning on moving all ::execute
implemetnations from LoopVectorize.cpp over to the new file, which
should help to reduce the size of the file a bit.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D127965
2022-06-28 10:34:30 +01:00
Guillaume Chatelet 3c126d5fe4 [Alignment] Replace commonAlignment with std::min
`commonAlignment` is a shortcut to pick the smallest of two `Align`
objects. As-is it doesn't bring much value compared to `std::min`.

Differential Revision: https://reviews.llvm.org/D128345
2022-06-28 07:15:02 +00:00
Philip Reames 20dd3297b1 [LV] Allow scalable vectorization with vscale = 1
This change is a bit subtle. If we have a type like <vscale x 1 x i64>, the vectorizer will currently reject vectorization. The reason is that a type like <1 x i64> is likely to get simply rescalarized, and the vectorizer doesn't want to be in the game of simple unrolling.

(I've given the example in terms of 1 x types which use a single register, but the same issue exists for any N x types which use N registers. e.g. RISCV LMULs.)

This change distinguishes scalable types from fixed types under the reasoning that converting to a scalable type isn't unrolling. Because the actual vscale isn't known until runtime, using a vscale type is potentially very profitable.

This makes an important, but unchecked, assumption. Specifically, the scalable type is assumed to only be legal per the cost model if there's actually a scalable register class which is distinct from the scalar domain. This is, to my knowledge, true for all targets which return non-invalid costs for scalable vector ops today, but in theory, we could have a target decide to lower scalable to fixed length vector or even scalar registers. If that ever happens, we'd need to revisit this code.

In practice, this patch unblocks scalable vectorization for ELEN types on RISCV.

Let me sketch one alternate implementation I considered. We could have restricted this to when we know a minimum value for vscale. Specifically, for the default +v extension for RISCV, we actually know that vscale >= 2 for ELEN types. However, doing it this way means we can't generate scalable vectors when using the various embedded vector extensions which have a minimum vscale of 1.

Differential Revision: https://reviews.llvm.org/D128542
2022-06-27 13:38:57 -07:00
Kazu Hirata a7938c74f1 [llvm] Don't use Optional::hasValue (NFC)
This patch replaces Optional::hasValue with the implicit cast to bool
in conditionals only.
2022-06-25 21:42:52 -07:00
Kazu Hirata 3b7c3a654c Revert "Don't use Optional::hasValue (NFC)"
This reverts commit aa8feeefd3.
2022-06-25 11:56:50 -07:00
Kazu Hirata aa8feeefd3 Don't use Optional::hasValue (NFC) 2022-06-25 11:55:57 -07:00
Alexey Bataev 2faacf61a5 [SLP]Improve shuffles cost estimation where possible.
Improved/fixed cost modeling for shuffles by providing masks, improved
cost model for non-identity insertelements.

Differential Revision: https://reviews.llvm.org/D115462
2022-06-24 09:28:01 -07:00
Florian Hahn cb69ba4faa
[LV] Create RT checks once VF/IC are selected, track scalar cost.
This patch updates LV to generate runtime after the VF & IC are selected. It
allows deciding whether to vectorize with runtime checks or not based on
their cost compared to the vector loop.

It also updates VectorizationFactor to include the scalar cost.

Reviewed By: lebedev.ri, dmgreen

Differential Revision: https://reviews.llvm.org/D75981
2022-06-24 17:42:11 +02:00
Florian Hahn b18141a8f2
[VPlan] Set VFs included in plan before last set of VPTransforms (NFC).
This allows VPlanTransforms to query the VFs included in the plan in the
future.
2022-06-24 10:16:56 +02:00
Philip Reames 46ea4b5ea1 [LV] Avoid a crash when costing a uniform store which doesn't correspond to a legal scatter
If we have an unaligned uniform store, then when costing a scalable VF we can't emit code to scalarize it.  (Well, we could, but we haven't implemented that case.)  This change replaces an assert with a cost-model bailout such that we reject vectorization with the scalable VF instead of crashing.
2022-06-23 12:41:09 -07:00
Alexey Bataev 3b6edef15d [SLP]Fix a crash when reorder masked gather nodes with reused scalars.
If the masked gather nodes must be reordered, we can just reorder
scalars, just like for gather nodes. But if the node contains reused
scalars, it must be handled same way as a regular vectorizable node,
since need to reorder reused mask, not the scalars directly.

Differential Revision: https://reviews.llvm.org/D128360
2022-06-23 11:32:30 -07:00
Florian Hahn 569d84fe99
[VPlan] Remove dead recipes across whole plan.
This extends removeDeadRecipe to remove recipes across the whole plan.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D127580
2022-06-23 13:36:02 +02:00
Fangrui Song 1ffd2d99c2 Revert D115462 "[SLP]Improve shuffles cost estimation where possible."
This reverts commit cac60940b7.

Caused -Os -fsanitize=memory -march=haswell miscompile to pytorch/cpuinfo.
See my latest comment (may update) on D115462.
2022-06-22 23:16:31 -07:00
Fangrui Song a411bc11d6 Revert "[SLP]Fix a crash when insert subvector is out of range."
This reverts commit f1ee2738b3.

Revert due to the revert of a dependent commit `[SLP]Improve shuffles cost estimation where possible.`
2022-06-22 23:16:25 -07:00
Guillaume Chatelet 57ffff6db0 Revert "[NFC] Remove dead code"
This reverts commit 8ba2cbff70.
2022-06-22 14:55:47 +00:00
Guillaume Chatelet 8ba2cbff70 [NFC] Remove dead code 2022-06-22 13:33:58 +00:00
Serguei Katkov 8f891b7c39 [LoopVectorize] Uninitialized phi node leads to a crash in SSAUpdater.
createInductionResumeValues creates a phi node placeholder
without filling incoming values. Then it generates the incoming values.

It includes triggering of SCEV expander which may invoke SSAUpdater.
SSAUpdater has an optimization to detect number of predecessors
basing on incoming values if there is phi node.
In case phi node is not filled with incoming values - the number of predecessors
is detected as 0 and this leads to segmentation fault.

In other words SSAUpdater expects that phi is in good shape while
LoopVectorizer breaks this requirement.

The fix is just prepare all incoming values first and then build a phi node.

Reviewed By: fhahn
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D128033
2022-06-22 10:49:27 +07:00
Vasileios Porpodas 7a9ad25769 Recommit "[SLP][X86] Improve reordering to consider alternate instruction bundles"
This reverts commit 6d6268dcbf.

Review: https://reviews.llvm.org/D125712
2022-06-21 18:35:29 -07:00
Vasileios Porpodas 6d6268dcbf Revert "[SLP][X86] Improve reordering to consider alternate instruction bundles"
This reverts commit 6f88acf410.
2022-06-21 17:07:21 -07:00
Vasileios Porpodas 6f88acf410 [SLP][X86] Improve reordering to consider alternate instruction bundles
During the reordering transformation we should try to avoid reordering bundles
like fadd,fsub because this may block them being matched into a single vector
instruction in x86.
We do this by checking if a TreeEntry is such a pattern and adding it to the
list of TreeEntries with orders that need to be considered.

Differential Revision: https://reviews.llvm.org/D125712
2022-06-21 16:44:48 -07:00
Florian Hahn 88ce403c6a
[LV] Add new block to place recurrence splice, if needed.
In some cases, a recurrence splice instructions needs to be inserted
between to regions, for example if the regions get re-arranged during
sinking.

Fixes #56146.
2022-06-21 21:54:37 +02:00
Alexey Bataev d4ee43153d [SLP][NFC]Fix a warning in a comparison, NFC.
Fixed signedness warning.
2022-06-21 10:19:47 -07:00
Alexey Bataev f1ee2738b3 [SLP]Fix a crash when insert subvector is out of range.
If the OffsetBeg + InsertVecSz is greater than VecSz, need to estimate
the cost as shuffle of 2 vector, not as insert of subvector. Otherwise,
the inserted subvector is out of range and compiler may crash.

Differential Revision: https://reviews.llvm.org/D128071
2022-06-21 07:16:35 -07:00
Kazu Hirata 7a47ee51a1 [llvm] Don't use Optional::getValue (NFC) 2022-06-20 22:45:45 -07:00
Kazu Hirata e0e687a615 [llvm] Don't use Optional::hasValue (NFC) 2022-06-20 10:38:12 -07:00
Kazu Hirata 129b531c9c [llvm] Use value_or instead of getValueOr (NFC) 2022-06-18 23:07:11 -07:00
Kazu Hirata c399b3a608 [Vectorize] Use llvm::is_contained (NFC) 2022-06-18 15:49:15 -07:00
Malhar Jajoo 6bb40552f2 [LoopVectorize] Add support for invariant stores of ordered reductions
Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D126772
2022-06-17 14:56:21 +01:00
Alexey Bataev 76782a65ee [SLP]Use original vector if need to shuffle truncated root.
If the root scalar is mapped to to the smallest bit width, the vector is
truncated and the types between original buildvector and extracted value
mismatched. For extract, we emit sext/zext instructions, for shuffles we
can reuse oringal vector instead of the truncated one.

Differential Revision: https://reviews.llvm.org/D127974
2022-06-16 10:41:18 -07:00
Florian Hahn 949c13649c
[LV] Remove widenPHIInstruction dependence on underlying instr (NFC).
Instead of using the underlying instruction and VF to get the type, use
the type of the incoming value. This removes an unnecessary dependence
on the underlying instruction and enables using the recipe without an
underlying instruction.
2022-06-16 16:03:01 +02:00
Alexey Bataev 7236d49fd5 [SLP]Extend vectorization for scatter vectorize nodes.
Currently scatter vectorize nodes can be emitted only for GEPs with
constant indices. But we can also emit such nodes for GEPs with the same
ptr and non-constant vectorizable/gathered indices, if profitable. Patch
adds support for such nodes and tries to improve handling of GEPs with
non-const indeces for such nodes.

Metric: SLP.NumVectorInstructions

Program                                                                                       SLP.NumVectorInstructions
                                                                                              results                   results0 diff
                    test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  5243.00                   5240.00  -0.1%
                     test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  5243.00                   5240.00  -0.1%
                     test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 27550.00                  27507.00  -0.2%
                               test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test  5395.00                   5380.00  -0.3%
                       test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  5389.00                   5374.00  -0.3%
                    test-suite :: External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp_r.test   961.00                    958.00  -0.3%
                   test-suite :: External/SPEC/CINT2017speed/620.omnetpp_s/620.omnetpp_s.test   961.00                    958.00  -0.3%
                               test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test  5664.00                   5643.00  -0.4%
                       test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 13202.00                  13127.00  -0.6%
                                test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test   212.00                    207.00  -2.4%
                                test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test   890.00                    850.00  -4.5%
                            test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test  1695.00                   1581.00  -6.7%
                                 test-suite :: MultiSource/Applications/JM/lencod/lencod.test  2338.00                   2140.00  -8.5%
                                  test-suite :: SingleSource/UnitTests/matrix-types-spec.test    63.00                     55.00 -12.7%
                             test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test   468.00                    356.00 -23.9%
                                                                           Geomean difference                                     -0.3%

All numbers show increased number of generated vector instructions.

Diff:
SingleSource/Benchmarks/Adobe-C++/loop_unroll - better without LTO, but
need an extra analysis with LTO (with LTO compiler generates
masked_gather, while before regular loads were emitted because of extra
data, availbale at LTO time).
SingleSource/UnitTests/matrix-types-spec - more vector code.
MultiSource/Applications/JM/lencod/lencod - same.
External/SPEC/CINT2006/464.h264ref/464.h264ref - same.
MultiSource/Benchmarks/7zip/7zip-benchmark - same.
External/SPEC/CINT2006/445.gobmk/445.gobmk - no changes.
External/SPEC/CFP2017rate/510.parest_r/510.parest_r - more vector code.
External/SPEC/CFP2006/447.dealII/447.dealII - same
External/SPEC/CINT2017speed/620.omnetpp_s/620.omnetpp_s - same
External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp - same
External/SPEC/CFP2017rate/511.povray_r/511.povray - same
External/SPEC/CFP2006/453.povray/453.povray - same
External/SPEC/CFP2017rate/526.blender_r/526.blender_r - same
External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r - same
External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s - same

Differential Revision: https://reviews.llvm.org/D127219
2022-06-16 06:05:48 -07:00
Florian Hahn 5ff5b460d9
[LV] Remove unneeded CustomBuilder arg from setDebugLocFromInst (NFC).
The only user that passed in a custom builder was passing in
VPTransformState::Builder, which is the same as ILV::Builder.
2022-06-15 18:48:02 +01:00
Alexey Bataev c60c13f7eb [SLP] Improve reordering in presence of constant only nodes.
We can skip the analysis of the constant nodes, their order should not
affect the ordering of the trees/subtrees.

Differential Revision: https://reviews.llvm.org/D127775
2022-06-15 06:17:34 -07:00
Florian Hahn 9129e7bb54
[LV] Replace OrigPHIsToFix in native with VPlan traversal. (NFC)
OrigPHIsToFix is only used in the native path. Collecting phis can be
replaced by iterating over the plan. This also removes another
unnecessary use of a late getVPValue.

This also reduces the coupling between ILV and the VPlan utilities.
2022-06-13 22:20:58 +01:00
Hubert Tong 5efb380c26 [NFC] Undo AIX build compiler workaround
Removes the workaround from https://reviews.llvm.org/D98509#2732628 for
an AIX build compiler issue.

The AIX build compiler product that caused the issue has since been
fixed. Also, the AIX build compiler has been changed to one based on
LLVM.
2022-06-13 17:00:33 -04:00
Florian Hahn 763f2bdba5
[VPlan] Remove dead OrigLoop argument from removeDeadRecipes (NFC).
The use of the argument has been remove a while ago. Remove the dead
argument.
2022-06-11 23:36:47 +01:00
Florian Hahn 85983ca42e
[VPlan] Replace remaining use of needsScalarIV.
All information is already available in VPlan. Note that there are some
test changes, because we now can correctly look through instructions
like truncates to analyze the actual users.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D123541
2022-06-09 12:05:37 +01:00
Florian Hahn cedfd7a2e5
Recommit "[VPlan] Remove uneeded needsVectorIV check."
This reverts commit 266ea446ab.

The reasons for the revert have been addressed by cleaning up condition
handling in VPlan and properly marking VPBranchOnMaskRecipe as using
scalars.

The test case for the revert from D123720 has been added in 3d663308a5.
2022-06-08 14:06:45 +01:00
Florian Hahn b0c9a71be0
[VPlan] Handle VPInst without underlying instr in VPInterleavedAccess.
This violation is hidden while `cast` is missing an isa assertion after
D123901.
2022-06-07 21:00:49 +01:00
David Sherwood 997ecb0036 [LoopVectorize] Add FastMathFlags to the select used for reductions with tail-folding
Based on reviewer comments on https://reviews.llvm.org/D126692 I've
added FastMathFlags to the select instruction used when tail-folding
with reductions. These flags can then be used by InstCombine to
decide upon the most optimal floating point identity value for
fadd/fsub. Doing so unlocks further optimisations, such as folding
selects into masked loads.

Differential Revision: https://reviews.llvm.org/D126778
2022-06-07 10:21:31 +01:00
Florian Hahn eaf48dd9b0
[VPlan] Replace BranchOnCount with BranchOnCond if TC <= UF * VF.
Try to simplify BranchOnCount to `BranchOnCond true` if TC <= UF * VF.

This is an alternative to D121899 which simplifies the VPlan directly
instead of doing so late in code-gen.

The potential benefit of doing this in VPlan is that this may help
cost-modeling in the future. The reason this is done in prepareToExecute
at the moment is that a single plan may be used for multiple VFs/UFs.

There are further simplifications that can be applied as follow ups:

1. Replace inductions with constants
2. Replace vector region with regular block.

Fixes #55354.

Depends on D126679.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D126680
2022-06-06 09:38:53 +01:00
Florian Hahn 416a5080d8
[VPlan] Update vector latch terminator edge to exit block after execution.
Instead of setting the successor to the exit using CFG.ExitBB, set it to
nullptr initially. The successor to the exit block is later set either
through createEmptyBasicBlock or after VPlan execution (because at the
moment, no block is created by VPlan for the exit block, the existing
one is reused).

This also enables BranchOnCond to be used as terminator for the exiting
block of the topmost vector region.

Depends on D126618.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D126679
2022-06-04 21:22:32 +01:00
Alexey Bataev cac60940b7 [SLP]Improve shuffles cost estimation where possible.
Improved/fixed cost modeling for shuffles by providing masks, improved
cost model for non-identity insertelements.

Differential Revision: https://reviews.llvm.org/D115462
2022-06-03 08:06:22 -07:00
Benjamin Kramer a8d2a381a2 [VPlan] Silence another unused variable warning in release builds 2022-06-03 14:07:56 +02:00
Benjamin Kramer 6b7c186390 [VPlan] Inline variable into assertion. NFC.
Avoids a warning in release builds
llvm/lib/Transforms/Vectorize/VPlanHCFGBuilder.cpp:311:14: warning: unused variable 'BrCond' [-Wunused-variable]
      Value *BrCond = Br->getCondition();
2022-06-03 13:59:48 +02:00
Florian Hahn a5bb4a3b4d
[VPlan] Replace CondBit with BranchOnCond VPInstruction.
This patch removes CondBit and Predicate from VPBasicBlock. To do so,
the patch introduces a new branch-on-cond VPInstruction opcode to model
a branch on a condition explicitly.

This addresses a long-standing TODO/FIXME that blocks shouldn't be users
of VPValues. Those extra users can cause issues for VPValue-based
analyses that don't expect blocks. Addressing this fixme should allow us
to re-introduce 266ea446ab.

The generic branch opcode can also be used in follow-up patches.

Depends on D123005.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D126618
2022-06-03 11:48:31 +01:00
Fangrui Song df0f30dc36 Revert "[SLP]Improve shuffles cost estimation where possible."
This reverts commit 9980c99718.

Caused assertion failures: https://reviews.llvm.org/D115462#3555350
2022-06-03 00:30:34 -07:00
Alexey Bataev 9980c99718 [SLP]Improve shuffles cost estimation where possible.
Improved/fixed cost modeling for shuffles by providing masks, improved
cost model for non-identity insertelements.

Differential Revision: https://reviews.llvm.org/D115462
2022-06-02 11:18:14 -07:00
Florian Hahn 4f1c86e3d5
[VPlan] Remove dead VPlan-native special case from BranchOnCount (NFC).
After 05776122b6 this special case doesn't exist any longer.
2022-06-02 12:07:54 +01:00
Alexey Bataev 73020b4540 Revert "[SLP]Improve shuffles cost estimation where possible."
This reverts commit fd5a6ce9dc to fix
a crash detected by a buildbot
https://lab.llvm.org/buildbot/#/builders/179/builds/3805/steps/11/logs/stdio.
2022-06-01 15:44:51 -07:00
Florian Hahn 08482830eb
[LV] Update var name to Exiting, in line with terminology (NFC)
Recently the terminology used has been changed from Exit->Exiting in
line with common LLVM loop terminology. Update a remaining use of the
old terminology.
2022-06-01 22:13:29 +01:00
Alexey Bataev fd5a6ce9dc [SLP]Improve shuffles cost estimation where possible.
Improved/fixed cost modeling for shuffles by providing masks, improved
cost model for non-identity insertelements.

Differential Revision: https://reviews.llvm.org/D115462
2022-06-01 11:01:37 -07:00
Alexey Bataev fe4949942d [SLP]Fix PR55796: insert point for extractelements from different basic blocks.
Extractelement instructions may come from different basic blocks, need
to take it into account when looking for a last instruction in the
bundle to prevent compiler crash.

Differential Revision: https://reviews.llvm.org/D126777
2022-06-01 09:44:53 -07:00
Florian Hahn 05776122b6
[VPlan] Use region for each loop in native path.
This patch updates the VPlan native path to use VPRegionBlocks for all
loops in a loop nest. Up to now, only the outermost loop used a region.

This is a step towards unifying both paths and keep things consistent
between them. It also prepares various code-gen parts for modeling the
pre-header in the inner loop vectorizer (D121624).

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D123005
2022-06-01 10:41:05 +01:00
Florian Hahn d157019482
[VPlan] Remove unused native utilities incompatible with nested regions.
The implementations of VPlanDominatorTree, VPlanLoopInfo and VPlanPredicator
are all incompatible with modeling loops in VPlans as region without
explicit back-edges.

Those pieces are not actively used and only exercised by a few gtest
unit tests. They are at the moment blocking progress towards unifying
the native and inner-loop vectorizer paths in D121624 and D123005.

I think we should not block forward progress on unused pieces of code,
so this patch removes the utilities for now. The plan is to re-introduce
them as needed in a way that is compatible with the unified VPlan scheme
used in both the inner loop vectorizer and the native path.

Reviewed By: sguggill

Differential Revision: https://reviews.llvm.org/D123017
2022-06-01 09:32:59 +01:00
Mel Chen b0fc765350 [NFC] Change LoopVectorizationCostModel::useOrderedReductions() to be a const function.
Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D126200
2022-05-31 05:39:13 -07:00
Florian Hahn 6abce17fc2
[VPlan] Use Exiting-block instead of Exit-block terminology (NFC).
In LLVM's common loop terminology, an exit block is a block outside a
loop with a predecessor inside the loop. An exiting block is a block
inside the loop which branches to an exit block outside the loop.

This patch updates a few places where VPlan was using ExitBlock for a
block exiting a region. Those instances have been updated to use
ExitingBlock.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D126173
2022-05-28 21:16:05 +01:00
Alexey Bataev 7b809c30b9 [SLP]Improve compile time, NFC.
Patch improves compile time. For function calls, which cannot be
vectorized, create a unique group for each such a call instead of
subgroup. It prevents them from being grouped by a subgroups and
attempts for their vectorization.

Also, looks through casts operand to try to check their
groups/subgroups.

Reduces number of vectorization attempts. No changes in the statistics
for SPEC2017/2006/llvm-test-suite.

Differential Revision: https://reviews.llvm.org/D126476
2022-05-26 08:40:59 -07:00
Alexey Bataev 120d52b0ef [SLP]Fix PR55653: emit undefs where required, not poison.
Need to handle a corner case correctly, if all elements are Undefs/Poisons,
need to emit actual values, not just poisons.

Differential Revision: https://reviews.llvm.org/D126298
2022-05-26 08:38:50 -07:00
Simon Pilgrim 14258d6fb5 [SLP] Move canVectorizeLoads implementation to simplify the diff in D105986. NFC. 2022-05-26 15:23:58 +01:00
Alexey Bataev 9139d484d4 [SLP]Fix crash on reordering of ScatterVectorize nodes.
ScatterVectorize nodes should be handled same way as gathers in
reorderBottomToTop function, since we can simple reorder the loads in
this node. Because of that need to include such nodes to the list of
gathered nodes to fix compiler crash.

Differential Revision: https://reviews.llvm.org/D126378
2022-05-26 06:25:58 -07:00
Florian Hahn 390c0ac28d
[LV] Fix indentation in tryToCreateWidenRecipe (NFC). 2022-05-26 08:53:34 +01:00
Alexey Bataev 3bf5c2c8ec [SLP]Do not try to generate ScatterVectorize if it will be scalarized.
SLP should build ScatterVectorize nodes only if they actually end up
with masked gather rather than with scalarization. In the second
scenario better to build a gather node.

Differential Revision: https://reviews.llvm.org/D126379
2022-05-25 14:25:07 -07:00
Alexey Bataev 10f41a2147 [SLP]Fix PR55688: Miscompile due to incorrect nuw/nsw handling.
Need to use all ReductionOps when propagating flags for the reduction
ops, otherwise transformation is not correct. Plus, need to drop nuw/nsw
flags.

Differential Revision: https://reviews.llvm.org/D126371
2022-05-25 13:59:06 -07:00
David Sherwood 87936c7b13 [LoopVectorize] Fix assertion failure in fixReduction when tail-folding
When compiling the attached new test in scalable-reductions-tf.ll we
were hitting this assertion in fixReduction:

  Assertion `isa<PHINode>(U) && "Reduction exit must feed Phi's or select"

The loop contains a reduction and an intermediate store of the reduction
value. When vectorising with tail-folding the contains of 'U' in the
assertion above happened to be a scatter_store. It turns out that we
were still creating a widen recipe for the invariant store, despite
knowing that we can actually sink it. The simplest fix is to change
buildVPlanWithVPRecipes so that we look for invariant stores before
attempting to widen it.

Differential Revision: https://reviews.llvm.org/D126295
2022-05-25 11:46:32 +01:00
Florian Hahn c6e45ea074
[VPlan] Exit earlier when trying to widen with scalar VFs.
This simplifies the code a bit, suggested in D124718.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D125029
2022-05-25 11:05:23 +01:00
Florian Hahn 1ba42dd04b
[VPlan] Use MapVector for LiveOuts for deterministic iteration.
During code-gen, we iterate over the LiveOuts and the differences in
iteration order can cause slightly different outputs.
2022-05-25 09:30:02 +01:00
Vasileios Porpodas 9df0568b07 [SLP] Fix crash caused by reorderBottomToTop().
The crash is caused by incorrect order set by reorderBottomToTop(), which
happens when it is reordering a TreeEntry which has a user that has already been
reordered earlier. Please see the detailed description in the lit test.

Differential Revision: https://reviews.llvm.org/D126099
2022-05-24 12:24:19 -07:00
Alexey Bataev f9c806ae5c [SLP][NFC]Make isFirstInsertElement a weak strict ordering comparator.
To be used correctly in a sort-like function, isFirstInsertElement
function must follow weak strict ordering rule, i.e.
isFirstInsertElement(IE1, IE1) should return false.
2022-05-24 06:02:42 -07:00
Alexey Bataev 319a722f6f [SLP][NFC]Improve compile time, NFC.
Builds UserIgnore list only once as a SmallDenseSet without rebuilding
it between the runs, iterate over gathers instead list of reduction ops,
do some checks in the buildTree_rec only if the corresponding containers
  are not empty.
2022-05-23 12:15:27 -07:00
Benjamin Kramer 2f2ca30d0a Fix an unused variable warning in no-asserts build mode 2022-05-23 19:53:40 +02:00
Jingu Kang bb82f74612 Revert "Revert "[AArch64] Set maximum VF with shouldMaximizeVectorBandwidth""
This reverts commit 42ebfa8269.

The commmit from https://reviews.llvm.org/D125918 has fixed the stage 2 build
failure.

Differential Revision: https://reviews.llvm.org/D118979
2022-05-23 16:15:45 +01:00
Alexey Bataev 2ac5ebedea [SLP]Do not emit extract elements for insertelements users, replace with shuffles directly.
SLP vectorizer emits extracts for externally used vectorized scalars and
estimates the cost for each such extract. But in many cases these
scalars are input for insertelement instructions, forming buildvector,
and instead of extractelement/insertelement pair we can emit/cost
estimate shuffle(s) cost and generate series of shuffles, which can be
further optimized.

Tested using test-suite (+SPEC2017), the tests passed, SLP was able to
generate/vectorize more instructions in many cases and it allowed to reduce
number of re-vectorization attempts (where we could try to vectorize
buildector insertelements again and again).

Differential Revision: https://reviews.llvm.org/D107966
2022-05-23 07:06:45 -07:00
Peter Waller ade47bdc31 [LV] Improve register pressure estimate at high VFs
Previously, `getRegUsageForType` was implemented using
`getTypeLegalizationCost`.  `getRegUsageForType` is used by the loop
vectorizer to estimate the register pressure caused by using a vector
type.  However, `getTypeLegalizationCost` currently only appears to
understand splitting and not scalarization, so significantly
underestimates the register requirements.

Instead, use `getNumRegisters`, which understands when scalarization
can occur (via computeRegisterProperties).

This was discovered while investigating D118979 (Set maximum VF with
shouldMaximizeVectorBandwidth), where under fixed-length 512-bit SVE the
loop vectorizer previously ends up costing an v128i1 as 2 v64i*
registers where it actually occupies 128 i32 registers.

I'm sending this patch early for comment, I'm still doing some sanity checking
with LNT.  I note that getRegisterClassForType appears to return VectorRC even
though the type in question (large vNi1 types) end up occupying scalar
registers. That might be worth fixing too.

Differential Revision: https://reviews.llvm.org/D125918
2022-05-23 07:57:45 +00:00
Florian Hahn 145fe57106
[LV] Use exiting block instead of latch in addUsersInExitBlock.
The latch may not be the exiting block. Use the exiting block instead
when looking up the incoming value of the LCSSA phi node. This fixes a
crash with early-exit loops.
2022-05-22 18:27:41 +01:00
Florian Hahn 97590baead
[LV] Widen ptr-inductions with scalar uses for scalable VFs.
Current codegen only supports scalarization of pointer inductions for
scalable VFs if they are uniform. After 3bebec659 we now may enter the
scalarization code path in VPWidenPointerInductionRecipe::execute for
scalable vectors.

Fall back to widening for scalable vectors if necessary.

This should fix a build failure when bootstrapping LLVM with SVE, e.g.
https://lab.llvm.org/buildbot/#/builders/176/builds/1723
2022-05-22 16:24:13 +01:00
Florian Hahn aeb19817d6
Revert "[SLP]Do not emit extract elements for insertelements users, replace with shuffles directly."
This reverts commit fc9c59c355.

The patch triggers an assertion when building SPEC on X86. Reduced
reproducer shared at D107966.

Also reverts follow-up commit 11a09af76d.
2022-05-21 21:00:01 +01:00
Florian Hahn 3bebec6592
[VPlan] Model first exit values using VPLiveOut.
This patch introduces a new VPLiveOut subclass of VPUser  to model
 exit values explicitly. The initial version handles exit values that
are neither part of induction or reduction chains nor first order
recurrence phis.

Fixes #51366, #54867, #55167, #55459

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D123537
2022-05-21 16:01:38 +01:00
Dmitri Gribenko 11a09af76d Fix an unused variable warning in no-asserts build mode 2022-05-20 17:11:58 +02:00
Alexey Bataev fc9c59c355 [SLP]Do not emit extract elements for insertelements users, replace with shuffles directly.
SLP vectorizer emits extracts for externally used vectorized scalars and
estimates the cost for each such extract. But in many cases these
scalars are input for insertelement instructions, forming buildvector,
and instead of extractelement/insertelement pair we can emit/cost
estimate shuffle(s) cost and generate series of shuffles, which can be
further optimized.

Tested using test-suite (+SPEC2017), the tests passed, SLP was able to
generate/vectorize more instructions in many cases and it allowed to reduce
number of re-vectorization attempts (where we could try to vectorize
buildector insertelements again and again).

Differential Revision: https://reviews.llvm.org/D107966
2022-05-20 05:58:09 -07:00
Alexey Bataev 4e271fc495 [SLP][NFC]Use SmallPtrSet to avoid n*m complexity, NFC. 2022-05-20 05:56:43 -07:00
Florian Hahn cd61d4bd2f
[LV] Do not LoopSimplify/LCSSA after generating main vector loop.
At the moment LV runs LoopSimplify and reconstructs LCSSA form after
generating the main vector loop and before generating the epilogue
vector loop.

In practice, this adds a new exit block for the scalar loop because the
middle block now also branches to the original exit block of the scalar
loop. It also requires adding a new LCSSA phi in the newly created exit
block.

This complicates things when modeling exit values in VPlan, because we
would need to update the VPlan for the epilogue loop to update the newly
created LCSSA phi node.

But none of that should be necessary, as all analysis requiring
loop-simplify form is already done at this point and LCSSA form of the
original loop is not broken.

Reviewed By: bmahjour

Differential Revision: https://reviews.llvm.org/D125810
2022-05-20 09:58:40 +01:00
Florian Hahn c90235f0ef
[LV] Drop wrap flags for reductions using VP def-use chain.
Update clearReductionWrapFlags to use the VPlan def-use chain from the
reduction phi recipe to drop reduction wrap flags.

This addresses an existing FIXME and fixes a crash when instructions in
the reduction chain are not used and have been removed before VPlan
codegeneration.

Fixes #55540.
2022-05-19 20:36:46 +01:00
Tiehu Zhang 3ed9f603fd [LoopVectorize] Don't interleave when the number of runtime checks exceeds the threshold
The runtime check threshold should also restrict interleave count.
Otherwise, too many runtime checks will be generated for some cases.

Reviewed By: fhahn, dmgreen

Differential Revision: https://reviews.llvm.org/D122126
2022-05-19 23:29:00 +08:00
Florian Hahn df56fb44f5
[VPlan] Update VPWidenMemoryInstruction to not inherit from VPValue.
VPWidenMemoryInstruction also models stores which may not produce a value.
This can trip over analyses. Improve the modeling by only adding
VPValues for VPWidenMemoryInstructionRecipes modeling loads.
2022-05-19 16:24:58 +01:00
Jay Foad 6bec3e9303 [APInt] Remove all uses of zextOrSelf, sextOrSelf and truncOrSelf
Most clients only used these methods because they wanted to be able to
extend or truncate to the same bit width (which is a no-op). Now that
the standard zext, sext and trunc allow this, there is no reason to use
the OrSelf versions.

The OrSelf versions additionally have the strange behaviour of allowing
extending to a *smaller* width, or truncating to a *larger* width, which
are also treated as no-ops. A small amount of client code relied on this
(ConstantRange::castOp and MicrosoftCXXNameMangler::mangleNumber) and
needed rewriting.

Differential Revision: https://reviews.llvm.org/D125557
2022-05-19 11:23:13 +01:00
lizhijin 90ea81fcb2 [LV] Widen freeze instead of scalarizing it
This patch changes the strategy for vectorizing freeze instrucion, from
replicating multiple times to widening according to selected VF.

Fixes #54992

Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D125016
2022-05-19 12:28:01 +08:00
Alexey Bataev 7d8060bc19 [SLP]Improve reductions vectorization.
The pattern matching and vectgorization for reductions was not very
effective. Some of of the possible reduction values were marked as
external arguments, SLP could not find some reduction patterns because
of too early attempt to vectorize pair of binops arguments, the cost of
consts reductions was not correct. Patch addresses these issues and
improves the analysis/cost estimation and vectorization of the
reductions.

The most significant changes in SLP.NumVectorInstructions:

Metric: SLP.NumVectorInstructions                                                                                                                                                                                                 [140/14396]

Program                                                                                        results  results0 diff
               test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test   920.00  3548.00 285.7%
                test-suite :: SingleSource/Benchmarks/BenchmarkGame/n-body.test    66.00   122.00  84.8%
      test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test   100.00   128.00  28.0%
 test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test   664.00   810.00  22.0%
                 test-suite :: MultiSource/Benchmarks/mafft/pairlocalalign.test   592.00   687.00  16.0%
  test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test   402.00   426.00   6.0%
                   test-suite :: MultiSource/Applications/JM/lencod/lencod.test  1665.00  1745.00   4.8%
  test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test   135.00   139.00   3.0%
 test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test   135.00   139.00   3.0%
                  test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test   388.00   397.00   2.3%
                   test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test   895.00   914.00   2.1%
    test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test   240.00   244.00   1.7%
           test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test   240.00   244.00   1.7%
             test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test   820.00   832.00   1.5%
              test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test   820.00   832.00   1.5%
       test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 14804.00 14914.00   0.7%
                        test-suite :: MultiSource/Benchmarks/Bullet/bullet.test  8125.00  8183.00   0.7%
           test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test  1330.00  1338.00   0.6%
            test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test  1330.00  1338.00   0.6%
         test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  9832.00  9880.00   0.5%
         test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  5267.00  5291.00   0.5%
       test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  4018.00  4024.00   0.1%
      test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  4018.00  4024.00   0.1%
              test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test   426.00   424.00  -0.5%
               test-suite :: External/SPEC/CFP2017rate/544.nab_r/544.nab_r.test   426.00   424.00  -0.5%
          test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test   201.00   192.00  -4.5%
         test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test   201.00   192.00  -4.5%

644.nab_s and 544.nab_r - reduced number of shuffles but increased number
of useful vectorized instructions.

641.leela_s and 541.leela_r - the function
`@_ZN9FastBoard25get_pattern3_augment_specEiib` is not inlined anymore
but its body gets vectorized successfully. Before, the function was
inlined twice and vectorized just after inlining, currently it is not
required. The vector code looks pretty similar, just like as it was before.

Differential Revision: https://reviews.llvm.org/D111574
2022-05-18 13:22:18 -07:00
Florian Hahn fcfb86483b
[LV] set Header earlier, use variable instead of repeated access (NFC). 2022-05-18 09:29:59 +01:00
Florian Hahn 5b00d13c00
[LV] Fetch vector loop region once and remember it (NFC).
This avoids an unnecessary lookup and makes the code slightly more
compact.
2022-05-17 15:57:23 +01:00
Alexey Bataev b0f0313feb [SLP]Add an extra check for select minmax reduction to avoid crash.
Need to check if the reduction is still (not)cmp-select pattern min/max
reduction to avoid compiler crash during building list of reduction
operations. cmp-sel pattern provides 2 reduction operations, while
intrinsics - just one.
2022-05-17 06:05:52 -07:00
Florian Hahn c1a9d14982
[VPlan] Move usesScalars/onlyFirstLaneUsed to VPUser.
Those helpers model properties of a user and they should also be
available to non-recipe users. This will be used in D123537 for a new
exit value user.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D124936
2022-05-17 11:20:06 +01:00
Alexey Bataev 152072801e [SLP]Check if the root of the buildvector has one use only.
The root of the buildvector can have only one use, otherwise it can be
treated only as a final element of the previous buildvector sequence.
2022-05-16 07:30:36 -07:00
Florian Hahn b7315ffc3c
[LAA,LV] Add initial support for pointer-diff memory checks.
This patch adds initial support for a pointer diff based runtime check
scheme for vectorization. This scheme requires fewer computations and
checks than the existing full overlap checking, if it is applicable.

The main idea is to only check if source and sink of a dependency are
far enough apart so the accesses won't overlap in the vector loop. To do
so, it is sufficient to compute the difference and compare it to the
`VF * UF * AccessSize`. It is sufficient to check
`(Sink - Src) <u VF * UF * AccessSize` to rule out a backwards
dependence in the vector loop with the given VF and UF. If Src >=u Sink,
there is not dependence preventing vectorization, hence the overflow
should not matter and using the ULT should be sufficient.

Note that the initial version is restricted in multiple ways:

1. Pointers must only either be read or written, by a single
   instruction (this allows re-constructing source/sink for
   dependences with the available information)
 2. Source and sink pointers must be add-recs, with matching steps
 3. The step must be a constant.
 3. abs(step) == AccessSize.

Most of those restrictions can be relaxed in the future.

See https://github.com/llvm/llvm-project/issues/53590.

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D119078
2022-05-16 15:27:22 +01:00
David Sherwood befc952045 [LoopVectorize] Permit tail-folding for low trip counts using scalable vectors
When the loop vectoriser encounters a known low trip count it tries
to create a single predicated loop in order to get the benefit of
vectorisation and eliminate the scalar tail. However, until now the
vectoriser prevented the use of scalable vectors in this case due
to concerns in the past about stability. I believe that tail-folded
loops using scalable vectors are now sufficiently well tested that
we can enable this. For the same reason I've also enabled it when
optimising for code size too.

Tests added here:

  Transforms/LoopVectorize/AArch64/sve-low-trip-count.ll
  Transforms/LoopVectorize/AArch64/sve-tail-folding-optsize.ll
  Transforms/LoopVectorize/RISCV/low-trip-count.ll

Differential Revision: https://reviews.llvm.org/D121595
2022-05-16 09:14:24 +01:00
Florian Hahn 8b7c3d2179
[LV] Set SCEVCheckCond to nullptr whenever it was used.
Under some circumstances, SCEVExpander will insert new instructions when
expanding a predicate, but the final result of the expansion can be a
false constant.

In those cases, the expanded instructions may later be used by other
expansions, e.g. the trip count. This may trigger an assertion during
SCEVExpander cleanup. To avoid this, always mark the result as used.

Fixes #55100.
2022-05-15 21:52:07 +01:00
Craig Topper b3097eb6cd [SLP] Fix misspelling of 'analyzed'. NFC 2022-05-15 10:30:24 -07:00
Florian Hahn 39552964e1
[VPlan] Improve printing of VPReplicateRecipe with calls.
Suggested as part of D124718.
2022-05-15 15:51:26 +01:00
Alexey Bataev 8b8281f354 [SLP]Do not vectorize non-profitable alternate nodes.
If alternate node has only 2 instructions and the tree is already big
enough, better to skip the vectorization of such nodes, they are not
very profitable (the resulting code cotains 3 instructions instead of
original 2 scalars). SLP can try to vectorize the buildvector sequence
in the next attempt, if it is profitable.

Metric: SLP.NumVectorInstructions

Program                                                                                       SLP.NumVectorInstructions
                                                                               results                   results0 diff
     test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test    72.00                     73.00   1.4%
test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test  1186.00                   1198.00   1.0%
     test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test   241.00                    242.00   0.4%
                  test-suite :: MultiSource/Applications/JM/lencod/lencod.test  2131.00                   2139.00   0.4%
 test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test  6377.00                   6384.00   0.1%
test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test  6377.00                   6384.00   0.1%
        test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 12650.00                  12658.00   0.1%
      test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 26169.00                  26147.00  -0.1%
          test-suite :: MultiSource/Benchmarks/Trimaran/enc-3des/enc-3des.test    99.00                     86.00 -13.1%

Gains:
526.blender_r - more vectorized trees.
enc-3des - same.

Others:
510.parest_r - no changes.
miniFE - same
623.xalancbmk_s - some (non-profitable) parts of the trees are not
    vectorized.
523.xalancbmk_r - same
lencod - same
timberwolfmc - same
miniAMR - same

Differential Revision: https://reviews.llvm.org/D125571
2022-05-13 14:28:54 -07:00
Alexey Bataev 85f6b15ee5 [SLP]Do not look for buildvector sequence, if the index is reused.
If the insert indes was used already or is not constant, we should stop
looking for unique buildvector sequence, it mustbe splitted to
2 different buildvectors.
2022-05-13 13:56:02 -07:00
David Sherwood 92c645b5c1 [LoopVectorize] Add overflow checks when tail-folding with scalable vectors
In InnerLoopVectorizer::getOrCreateVectorTripCount there is an
assert that the known minimum value for the VF is a power of 2
when tail-folding is enabled. However, for scalable vectors the
value of vscale may not be a power of 2, which means we have
to worry about the possibility of overflow. I have solved this
problem by adding preheader checks that prevent us from entering
the vector body if the canonical IV would overflow, i.e.

  if ((IntMax - TripCount) < (VF * UF)) ... skip vector loop ...

Differential Revision: https://reviews.llvm.org/D125235
2022-05-13 14:09:43 +01:00
Nikita Popov ed1cb01baf [IRBuilder] Add IsInBounds parameter to CreateGEP()
We commonly want to create either an inbounds or non-inbounds GEP
based on a boolean value, e.g. when preserving inbounds from
existing GEPs. Directly accept such a boolean in the API, rather
than requiring a ternary between CreateGEP and CreateInBoundsGEP.

This change is not entirely NFC, because we now preserve an
inbounds flag in a constant expression edge-case in InstCombine.
2022-05-13 14:30:55 +02:00
Vasileios Porpodas 0950d4060c Recommit "[SLP] Make reordering aware of external vectorizable scalar stores."
This reverts commit c2a7904aba.

Original code review: https://reviews.llvm.org/D125111
2022-05-11 16:47:29 -07:00
Arthur Eubanks c2a7904aba Revert "[SLP] Make reordering aware of external vectorizable scalar stores."
This reverts commit 71bcead98b.

Causes crashes, see comments in D125111.
2022-05-11 15:28:00 -07:00
Alexey Bataev f5d45d70a5 [SLP]Further improvement of the cost model for scalars used in buildvectors.
Further improvement of the cost model for the scalars used in
buildvectors sequences. The main functionality is outlined into
a separate function.
The cost is calculated in the following way:
1. If the Base vector is not undef vector, resizing the very first mask to
have common VF and perform action for 2 input vectors (including non-undef
Base). Other shuffle masks are combined with the resulting after the 1 stage and processed as a shuffle of 2 elements.
2. If the Base is undef vector and have only 1 shuffle mask, perform the
action only for 1 vector with the given mask, if it is not the identity
mask.
3. If > 2 masks are used, perform serie of shuffle actions for 2 vectors,
combing the masks properly between the steps.

The original implementation misses the very first analysis for the Base
vector, so the cost might too optimistic in some cases. But it improves
the cost for the insertelements which are part of the current SLP graph.

Part of D107966.

Differential Revision: https://reviews.llvm.org/D115750
2022-05-11 06:08:55 -07:00
Florian Hahn 635b752211
[VPlan] VPInterleaveRecipe only uses first lane if op not stored.
With opaque pointers, both the stored value and the address can be the
same. Only consider the recipe using the first lane only *if* the
address is not stored.

Fixes #55375.
2022-05-11 11:24:56 +01:00
Vasileios Porpodas 71bcead98b [SLP] Make reordering aware of external vectorizable scalar stores.
The current reordering scheme only checks the ordering of in-tree operands.
There are some cases, however, where we need to adjust the ordering based on
the ordering of a future SLP-tree who's instructions are not part of the
current tree, but are external users.

This patch is a simple implementation of this. We keep track of scalar stores
that are users of TreeEntries and if they look profitable to vectorize, then
we keep track of their ordering. During the reordering step we take this new
index order into account. This can remove some shuffles in cases like in the
lit test.

Differential Revision: https://reviews.llvm.org/D125111
2022-05-10 15:25:35 -07:00
Alexey Bataev 4212ef8a0e Revert "[SLP]Further improvement of the cost model for scalars used in buildvectors."
This reverts commit 99f31acfce and several
others to fix detected crashes, reported in https://reviews.llvm.org/D115750
2022-05-09 13:46:06 -07:00
Alexey Bataev cce80bd8b7 [SLP]Adjust assertion check for scalars in several insertelements.
If the same scalar is inserted several times into the same buildvector,
the mask index can be used already. In this case need to check, that
this scalar is already part of the vectorized buildvector.
2022-05-09 13:07:59 -07:00
Florian Hahn 266ea446ab
Revert "Recommit "[VPlan] Remove uneeded needsVectorIV check.""
This reverts commit 8b48223447.

This triggers an assertion on a test case mentioned in D123720.
Revert while I investigate.
2022-05-09 20:33:14 +01:00
Alexey Bataev 9dc4ced204 [SLP]Try partial store vectorization if supported by target.
We can try to vectorize number of stores less than MinVecRegSize
/ scalar_value_size, if it is allowed by target. Gives an extra
opportunity for the vectorization.

Fixes PR54985.

Differential Revision: https://reviews.llvm.org/D124284
2022-05-09 09:48:15 -07:00
Alexey Bataev 9c3a75eabf [SLP]Fix a crash when preparing a mask for external scalars.
Need to use actual index instead of the tree entry position, since the
insert index may be different than 0. It mean, that we vectorized part
of the buildvector starting from not initial insertelement instruction
beause of some reason.
2022-05-09 07:59:34 -07:00
David Green 6f9e1ea0ef [VectorCombine] Attempt to fold select shuffles from reductions
Given a commutative reduction leading from a shuffle, the order of the
lanes on the shuffle are not important for the result. This means we can
reorder the shuffle to something simpler, which we try shuffling the
first vector lanes first. This was D123494.

The new shuffle may not be profitable though, and if it is not we can
try the folding of select shuffles from D123911. This, with some
adjustment as the output lane ordering is now unimportant, can allow the
final shuffle to simplify given the inputs to the patterns from D123911.
Where as each transformation on their own are not profitable, the
combination is.

We can only support a single shuffle when called from reductions, but we
are able to sort the ReconstructMask, potentially allowing it to
simplify to an identity or concat mask.

Differential Revision: https://reviews.llvm.org/D125086
2022-05-08 10:32:41 +01:00
David Green 802e15c576 [SLP] Cluster ordering for loads
Given a load without a better order, this patch partially sorts the
elements to form clusters of adjacent elements in memory. These clusters
can potentially be loaded in fewer loads, meaning less overall shuffling
(for example loading v4i8 clusters of a v16i8 as a single f32 loads, as
opposed to multiple independent bytes loads and inserts).

Differential Revision: https://reviews.llvm.org/D122145
2022-05-07 14:38:11 +01:00
David Green 100cb9a2ba [VectorCombine] Fold shuffle select pattern
This patch adds a combine to attempt to reduce the costs of certain
select-shuffle patterns. The form of code it attempts to detect is:
  %x = shuffle ...
  %y = shuffle ...
  %a = binop %x, %y
  %b = binop %x, %y
  shuffle %a, %b, selectmask

A classic select-mask will pick items from each lane of a or b. These
do not always have a great lowering on many architectures. This patch
attempts to pack a and b into the lower elements, creating a differently
ordered shuffle for reconstructing the orignal which may be better than
the select mask. This can be better for performance, especially if less
elements of a and b need to be computed and the input shuffles are
cheaper.

Because select-masks are just one form of shuffle, we generalize to any
mask. So long as the backend has decent costmodel for the shuffles, this
can generally improve things when they come up. For more basic cost
models the folds do not appear to be profitable, not getting past the
cost checks.

Differential Revision: https://reviews.llvm.org/D123911
2022-05-06 08:13:18 +01:00
Florian Hahn f9f7aa30f8
[VPlan] Remove dead code to create VPWidenPHIRecipes (NFCI).
After introducing VPWidenPointerInductionRecipe, VPWidenPHIRecipes
should not be created at this point. Turn check into an assert.
2022-05-05 19:29:02 +01:00
Alexey Bataev 99f31acfce [SLP]Further improvement of the cost model for scalars used in buildvectors.
Further improvement of the cost model for the scalars used in
buildvectors sequences. The main functionality is outlined into
a separate function.
The cost is calculated in the following way:
1. If the Base vector is not undef vector, resizing the very first mask to
have common VF and perform action for 2 input vectors (including non-undef
Base). Other shuffle masks are combined with the resulting after the 1 stage and processed as a shuffle of 2 elements.
2. If the Base is undef vector and have only 1 shuffle mask, perform the
action only for 1 vector with the given mask, if it is not the identity
mask.
3. If > 2 masks are used, perform serie of shuffle actions for 2 vectors,
combing the masks properly between the steps.

The original implementation misses the very first analysis for the Base
vector, so the cost might too optimistic in some cases. But it improves
the cost for the insertelements which are part of the current SLP graph.

Part of D107966.

Differential Revision: https://reviews.llvm.org/D115750
2022-05-05 06:04:25 -07:00
Florian Hahn 8b48223447
Recommit "[VPlan] Remove uneeded needsVectorIV check."
This reverts commit f4e1eaa375.

The patch was originally reverted because it uncovered an issue that has
now been fixed in 0ef8ca6d88.
2022-05-04 10:53:42 +01:00
serge-sans-paille 7030654296 [iwyu] Handle regressions in libLLVM header include
Running iwyu-diff on LLVM codebase since fa5a4e1b95 detected a few
regressions, fixing them.

Differential Revision: https://reviews.llvm.org/D124847
2022-05-04 08:32:38 +02:00
Igor Kirillov 4e5e042d9a [LoopVectorize] Support reductions that store intermediary result
Adds ability to vectorize loops containing a store to a loop-invariant
address as part of a reduction that isn't converted to SSA form due to
lack of aliasing info. Runtime checks are generated to ensure the store
does not alias any other accesses in the loop.

Ordered fadd reductions are not yet supported.

Differential Revision: https://reviews.llvm.org/D110235
2022-05-03 10:12:30 +01:00
David Green 6f81903e89 [LV][SLP] Mark fptosi_sat as vectorizable
This adds fptosi_sat and fptoui_sat to the list of trivially
vectorizable functions, mainly so that the loop vectorizer can vectorize
the instruction. Marking them as trivially vectorizable also allows them
to be SLP vectorized, and Scalarized.

The signature of a fptosi_sat requires two type overrides
(@llvm.fptosi.sat.v2i32.v2f32), unlike other intrinsics that often only
take a single. This patch alters hasVectorInstrinsicOverloadedScalarOpd
to isVectorIntrinsicWithOverloadTypeAtArg, so that it can mark the first
operand of the intrinsic as a overloaded (but not scalar) operand.

Differential Revision: https://reviews.llvm.org/D124358
2022-05-03 09:32:34 +01:00
Alexey Bataev e74a73782f [SLP][NFC]Minor code changes for better readability, NFC. 2022-05-02 12:58:25 -07:00
Alexey Bataev 7ea03f0b4e [SLP]Improve reductions analysis and emission, part 1.
Currently SLP vectorizer walks through the instructions and selects
3 main classes of values: 1) reduction operations - instructions with same
reduction opcode (add, mul, min/max, etc.), which build the reduction,
2) reduced values - instructions with the same opcodes, but different
from the reduction opcode, 3) extra arguments - all other values,
instructions from the different basic block rather than the root node,
instructions with to many/less uses.

This scheme is not very efficient. It excludes some instructions and all
non-instruction values from the reductions (constants, proficient
gathers), to many possibly reduced values are marked as extra arguments.
Patch improves this process by introducing a bit extended analysis
stage. During this stage, we still try to select 3 classes of the
values: 1) reduction operations - same as before, 2) possibly reduced
values - all instructions from the current block/non-instructions, which
may build a vectorization tree, 3) extra arguments - instructions from
the different basic blocks. Additionally, an extra sorting of the
possibly reduced values occurs to build the scalar sequences which
highly likely will bed vectorized, e.g. loads are grouped by the
distance between them, constants are grouped together, cmp instructions
are sorted by their compare types and predicates, extractelement
instructions are sorted by the vector operand, etc. Also, these groups
are reordered by their length so the longest group is the first in the
list of the possibly reduced values.

The vectorization process tries to emit the reductions for all these
groups. These reductions, remaining non-vectorized possible reduced
values and extra arguments are then combined into the final expression
just like it was before.

Differential Revision: https://reviews.llvm.org/D114171
2022-05-02 12:03:58 -07:00
Florian Hahn 0ef8ca6d88
[VPlan] Do not create VPWidenCall recipes for scalar vector factors.
'Widen' recipe are only used when actual vector values are generated.
Fix tryToWidenCall to do not create VPWidenCallRecipes for scalar vector
factors.

This was exposed by D123720, because the widened recipes are considered
vector users.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D124718
2022-05-02 19:40:33 +01:00
Simon Pilgrim 34f97a3709 [VectorCombine] Merge isa<>/cast<> into dyn_cast<>. NFC.
We want to handle the the assert in VectorCombine so avoid the repeated isa/cast code.
2022-05-01 20:09:10 +01:00
Simon Pilgrim 09761ce295 [SLPVectorizer] Remove weird unicode character from comment. NFCI.
Whatever it was, Visual Assist really didn't like it....
2022-05-01 16:37:21 +01:00
Alexey Bataev 484fcb9888 [SLP][NFC]Fix a comment. 2022-04-29 09:27:13 -07:00
Anna Thomas 205246cb64 [CompileTime] [Passes] Avoid computing unnecessary analyses. NFC
Similar to c515b2f39e, If there are no loops in the function as seen
through LI, we should avoid computing the remaining expensive analyses
(such as SCEV, BPI).  Reordered the analyses requests and early return
if there are no loops.

The logic of avoiding expensive analyses is applied to LoopVectorizer,
LoopLoadElimination and LoopUnrollPass, i.e. all function passes which operate
on loops.

This is an NFC with compile time improvement.

Differential Revision: https://reviews.llvm.org/D124529
2022-04-29 10:00:06 -04:00
Ricky Zhou 24a133e16f
[LV] Rename CountRoundDown to VectorTripCount (NFC)
The name CountRoundDown is potentially misleading, as the number of
iterations can be rounded up when folding the tail.

Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D119681
2022-04-29 13:50:00 +01:00
Florian Hahn e66127e69b
[VPlan] Simplify & adjust code as suggested in D123005.
Improve code as suggested in D123005. Applied separately, because the
comments where made a diff that has not been rebased to current main.
2022-04-29 13:34:54 +01:00
David Green 7047c47918 [VecCombine] Fix sort comparator logic in foldShuffleFromReductions
I think this sort comparator was overly complex, and the windows
expensive check bot agreed, failing as it was not giving a strict weak
ordering. Change it to use the comparison of the mask values as unsigned
integers. This should sort the undef elements to the end whilst keeping
X<Y otherwise.
2022-04-29 09:30:02 +01:00
Florian Hahn f4e1eaa375
Revert "[VPlan] Remove uneeded needsVectorIV check."
This reverts commit 43842b887e while I
investigate a buildbot failure.

It also reverts the follow-up commit
2883de0514.
2022-04-28 20:16:21 +01:00
David Green ded8187e35 [VectorCombine] Try to reduce shuffle cost for commutative reduction operands
Given a shuffle feeding a commutative reduction, the lane ordering of
the shuffle will not alter the result. This is also true if there are a
number of operations between the reduction and the shuffle, providing
they only operate lane-wise. This patch searches for cases like that in
Vector Combine, allowing us to check the cost of the shuffle vs an
in-order identity shuffle and replace the order if possible. This only
handles a single shuffle at the moment to keep things simple, and is
able to ignore splats that produce results where every result is the
same.

This is a more powerful version of a combine that already happens in
instrcombine, capable of optimizing more cases by looking through more
instructions and being able to cost the shuffle.

Differential Revision: https://reviews.llvm.org/D123494
2022-04-28 19:46:12 +01:00
Alexey Bataev 75e1cf4a6a [COST]Improve cost model for shuffles in SLP.
Introduced masks where they are not added and improved target dependent
cost models to avoid returning of the incorrect cost results after
adding masks.

Differential Revision: https://reviews.llvm.org/D100486
2022-04-28 10:04:41 -07:00
Florian Hahn 2883de0514
[VPlan] Fix comment formatting from 43842b887e. 2022-04-28 16:31:48 +01:00
Florian Hahn 43842b887e
[VPlan] Remove uneeded needsVectorIV check.
Remove one of the last remaining uses of ::needsVectorIV, preparing for
its removal. Now that usesScalars is available and based on the
information explicit in VPlan, there is no need to use the pre-computed
needsVectorIV.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D123720
2022-04-28 16:27:34 +01:00
Alexey Bataev 9861ca0c23 Revert "[COST]Improve cost model for shuffles in SLP."
This reverts commit 29a470e380 to fix
a crash reported in https://reviews.llvm.org/D100486#3479989.
2022-04-28 08:11:56 -07:00
Alexey Bataev 29a470e380 [COST]Improve cost model for shuffles in SLP.
Introduced masks where they are not added and improved target dependent
cost models to avoid returning of the incorrect cost results after
adding masks.

Differential Revision: https://reviews.llvm.org/D100486
2022-04-27 10:56:26 -07:00
Shilei Tian a6b355dd31 [SLP] Fix a typo that causes redundant assertion and potential segment fault
Reviewed By: ABataev

Differential Revision: https://reviews.llvm.org/D124497
2022-04-27 10:07:59 -04:00
Vasileios Porpodas fa8a9fea47 Recommit "[SLP][TTI] Refactoring of `getShuffleCost` `Args` to work like `getArithmeticInstrCost`"
This reverts commit 6a9bbd9f20.

Code review: https://reviews.llvm.org/D124202
2022-04-26 14:02:40 -07:00
Vasileios Porpodas 6a9bbd9f20 Revert "[SLP][TTI] Refactoring of `getShuffleCost` `Args` to work like `getArithmeticInstrCost`"
This reverts commit 55ce296d6f.
2022-04-26 11:25:26 -07:00
Vasileios Porpodas 55ce296d6f [SLP][TTI] Refactoring of `getShuffleCost` `Args` to work like `getArithmeticInstrCost`
Before this patch `Args` was used to pass a broadcat's arguments by SLP.
This patch changes this. `Args` is now used for passing the operands of
the shuffle.

Differential Revision: https://reviews.llvm.org/D124202
2022-04-26 11:11:29 -07:00
Valery N Dmitriev 88b9e46fb5 [SLP] Steer for the best chance in tryToVectorize() when rooting with binary ops.
tryToVectorize() method implements one of searching paths for vectorizable tree roots in SLP vectorizer,
specifically for binary and comparison operations. Order of making probes for various scalar pairs
was defined by its implementation: the instruction operands, then climb over one operand if
the instruction is its sole user and then perform same actions for another operand if previous
attempts failed. Problem with this approach is that among these options we can have more than a
single vectorizable tree candidate and it is not necessarily the one that encountered first.
Trying to build vectorizable tree for each possible combination for just evaluation is expensive.
But we already have lookahead heuristics mechanism which we use for finding best pick among
operands of commutative instructions. It calculates cumulative score for candidates in two
consecutive lanes. This patch introduces use of the heuristics for choosing the best pair among
several combinations. We only try one that looks as most promising for vectorization.
Additional benefit is that we reduce total number of vectorization trees built for probes
because we skip those looking non-profitable early.

Reviewed By: Alexey Bataev (ABataev), Vasileios Porpodas (vporpo)
Differential Revision: https://reviews.llvm.org/D124309
2022-04-25 12:25:33 -07:00
David Green 9727c77d58 [NFC] Rename Instrinsic to Intrinsic 2022-04-25 18:13:23 +01:00
Valery N Dmitriev edf7bed87b [SLP][NFC] Outline lookahead heuristics into a separate helper class.
Minor refactoring to reduce size of functional change D124309:
  look-ahead scoring routines pulled out of VLOperands and formed
  new LookAheadHeuristics helper class.

Reviewed By: Alexey Bataev (ABataev), Vasileios Porpodas (vporpo)
Differential Revision: https://reviews.llvm.org/D124313
2022-04-22 18:59:08 -07:00
Vasileios Porpodas 889588ee97 [SLP] Refactoring isLegalBroadcastLoad() to use `ElementCount`.
Replacing `unsigned` with `ElementCount` in the argument of `isLegalBroadcastLoad()`.
This helps reduce the diff of a future SLP patch for AArch64.
2022-04-21 10:19:00 -07:00
Florian Hahn bea69b232f
[VPlan] Initial modeling of middle block in VPlan.
This patch extends the scope of VPlan to also include the exit (aka
middle) block.

For now, the exit block remains empty, but handling of exit values will
subsequently be moved to VPlan, by adding recipes to model exit values
in the exit block.

As a first step, this will allow fixing #51366.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D123457
2022-04-20 19:34:41 +01:00
Vasileios Porpodas 8d4b5e0833 [NFC][SLP] Improved description of getShallowScore() and getScoreAtLevelRec()
Differential Revision: https://reviews.llvm.org/D124027
2022-04-19 12:15:36 -07:00
Florian Hahn 4026b718b8
[VPlan] Remove unused SCEV forward declaration (NFC). 2022-04-19 17:16:17 +02:00
Alexey Bataev 883571928c Revert "[SLP]Improve reductions analysis and emission, part 1."
This reverts commit 0e1f4d4d3c to fix
a crash reported in PR54976
2022-04-19 06:17:03 -07:00
Florian Hahn a65f2730d2
[VPlan] Expand induction step in VPlan pre-header.
This patch moves SCEV expansion of steps used by
VPWidenIntOrFpInductionRecipes to the pre-header using
VPExpandSCEVRecipe. This ensures that those steps are expanded while the
CFG is in a valid state. Previously, SCEV expansion may happen during
vector body code-generation, during which the CFG may be invalid,
causing issues with SCEV expansion.

Depends on D122095.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D122096
2022-04-19 13:06:39 +02:00
Vasileios Porpodas b1333f03d9 Recommit "[SLP] Support internal users of splat loads"
Code review: https://reviews.llvm.org/D121940

This reverts commit 359dbb0d3d.
2022-04-18 15:58:01 -07:00
Vasileios Porpodas 359dbb0d3d Revert "[SLP] Support internal users of splat loads"
This reverts commit f8e1337115.
2022-04-18 12:12:34 -07:00
Vasileios Porpodas f8e1337115 [SLP] Support internal users of splat loads
Until now we would only accept a broadcast load pattern if it is only used
by a single vector of instructions.

This patch relaxes this, and allows for the broadcast to have more than one
user vector, as long as all of its uses are internal to the SLP graph and
vectorized.

Differential Revision: https://reviews.llvm.org/D121940
2022-04-18 11:59:44 -07:00
Florian Hahn 73f5d7d0d6
[VPlan] Handle equal address and store ops in onlyFirstLaneDemanded.
With opaque pointers, the stored value and address can be the same.

Previously the code in VPWidenMemoryInstructionRecipe::onlyFirstLaneDemanded
incorrectly considers stores with matching store and pointer operands as
only demanding the first lane, causing a crash.
2022-04-15 22:53:33 +02:00
Johannes Doerfert 1fb415fee9 [AMDGPU][FIX] Proper load-store-vectorizer result with opaque pointers
The original code relied on the fact that we needed a bitcast
instruction (for non constant base objects). With opaque pointers there
might not be a bitcast. Always check if reordering is required instead.

Fixes: https://github.com/llvm/llvm-project/issues/54896

Differential Revision: https://reviews.llvm.org/D123694
2022-04-15 13:42:46 -05:00
Florian Hahn 2c14cdf831
[VPlan] Turn external defs in Value -> VPValue mapping.
This addresses an existing TODO by keeping a mapping of external IR
Value * definitions wrapped in VPValues for use in a VPlan.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D123700
2022-04-14 12:03:09 +02:00
serge-sans-paille fa5a4e1b95 [iwyu] Handle regressions in libLLVM header include
Running iwyu-diff on LLVM codebase since a96638e50e detected a few
regressions, fixing them.
2022-04-13 20:53:19 +02:00
Alexey Bataev 0e1f4d4d3c [SLP]Improve reductions analysis and emission, part 1.
Currently SLP vectorizer walks through the instructions and selects
3 main classes of values: 1) reduction operations - instructions with same
reduction opcode (add, mul, min/max, etc.), which build the reduction,
2) reduced values - instructions with the same opcodes, but different
from the reduction opcode, 3) extra arguments - all other values,
instructions from the different basic block rather than the root node,
instructions with to many/less uses.

This scheme is not very efficient. It excludes some instructions and all
non-instruction values from the reductions (constants, proficient
gathers), to many possibly reduced values are marked as extra arguments.
Patch improves this process by introducing a bit extended analysis
stage. During this stage, we still try to select 3 classes of the
values: 1) reduction operations - same as before, 2) possibly reduced
values - all instructions from the current block/non-instructions, which
may build a vectorization tree, 3) extra arguments - instructions from
the different basic blocks. Additionally, an extra sorting of the
possibly reduced values occurs to build the scalar sequences which
highly likely will bed vectorized, e.g. loads are grouped by the
distance between them, constants are grouped together, cmp instructions
are sorted by their compare types and predicates, extractelement
instructions are sorted by the vector operand, etc. Also, these groups
are reordered by their length so the longest group is the first in the
list of the possibly reduced values.

The vectorization process tries to emit the reductions for all these
groups. These reductions, remaining non-vectorized possible reduced
values and extra arguments are then combined into the final expression
just like it was before.

Differential Revision: https://reviews.llvm.org/D114171
2022-04-12 17:46:11 -07:00
Muhammad Omair Javaid 42ebfa8269 Revert "[AArch64] Set maximum VF with shouldMaximizeVectorBandwidth"
This reverts commit 64b6192e81.

This broke LLVM AArch64 buildbot clang-aarch64-sve-vls-2stage:

https://lab.llvm.org/buildbot/#/builders/176/builds/1515

llvm-tblgen crashes after applying this patch.
2022-04-13 04:53:07 +05:00
Florian Hahn 5f1eb74850
[VPlan] Place VPExpandSCEVRecipe in pre-header.
After D121624 models the pre-header in VPlan, VPExpandSCEVRecipes can be
placed there. This ensures SCEV expansion happens before modifying the
CFG during VPlan execution, when CFG is incomplete.

Depends on D121624.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D122095
2022-04-10 10:26:20 +02:00
Florian Hahn 256c6b0ba1
[VPlan] Model pre-header explicitly.
This patch extends the scope of VPlan to also model the pre-header.
The pre-header can be used to place recipes that should be code-gen'd
outside the loop, like SCEV expansion.

Depends on D121623.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D121624
2022-04-09 14:19:47 +02:00
Florian Hahn 467dbcd9f1
[LV] Set debug loc after setting insert point.
This fixes the code to actually use the location of the instruction, if
available. Previously, SetInsertPoint would overwrite the insert point
set from the instruction.
2022-04-08 20:34:40 +02:00
Florian Hahn 29fe998eaa
[VPlan] Preserve debug location when creating branch.
Update createEmptyBasicBlock to preserve the debug location of the
previous terminator.
2022-04-08 17:22:53 +02:00
Florian Hahn 4388c979da
[VPlan] Use vector.body as header name in VPlan native path.
This brings the VPlan block naming in line with the naming of the
generated basic blocks.
2022-04-07 10:31:12 +02:00
Jingu Kang 64b6192e81 [AArch64] Set maximum VF with shouldMaximizeVectorBandwidth
Set the maximum VF of AArch64 with 128 / the size of smallest type in loop.

Differential Revision: https://reviews.llvm.org/D118979
2022-04-05 13:16:52 +01:00
Florian Hahn 1ff022e21b
[LV] Add vector.body block to parent loop during skeleton creation.
When creating induction resume values, SCEV queries may rely on
LoopInfo. Make sure vector.body gets added to the loop of the pre-header
during skeleton construction.

%vector.body will be moved to the vector preheader during VPlan
execution.

Fixes #54745.
2022-04-05 11:54:17 +01:00
Florian Hahn 1817c526e1
[VPlan] Update VPInterleavedAccessInfo to use getVectorLoopRegion.
Update VPInterleavedAccessInfo  to use the generic getVectorLoopRegion
helper instead of relying on the entry block being the top-most vector
loop region.
2022-04-04 10:26:39 +01:00
Florian Hahn 8cd1892725
[VPlan] Remember previous loop and reset vector loop.
At the moment this is NFC, but will be needed once nested loops are also
modeled as regions. Preparation for D123005.
2022-04-04 09:27:15 +01:00
Philip Reames 88de27e3fd [LV] Handle non-integral types when considering interleave widening legality
In general, anywhere we might need to insert a blind bitcast, we need to make sure the types are losslessly convertible.

This fixes pr54634.
2022-04-03 20:16:20 -07:00
Florian Hahn 95b2aa511e
[VPlan] Set VPlan header block name to vector.body.
This brings the VPlan block naming in line with the naming of the
generated basic blocks.
2022-04-02 19:34:32 +01:00
Florian Hahn f8101e4d68
Recommit "[LV] Remove unneeded createHeaderBranch.(NFCI)"
This reverts commit 14e3650f01.

The issue causing the revert were fixed independently in
a08c90a402 and 14e5f9785c.
2022-04-01 16:53:39 +01:00
Florian Hahn 14e5f9785c
[LV] Add SCEV workaround from 80e8025 to epilogue vector code path.
This was exposed by 14e3650f. The recommit of 14e3650f will hit the
problematic code path requiring the workaround.
test case that crashes without the workaround.
2022-04-01 15:14:47 +01:00
Florian Hahn a08c90a402
[LV] Re-use TripCount from EPI.TripCount.
During skeleton construction for the epilogue vector loop, generic
helpers use getOrCreateTripCount, which will re-expand the trip count
computation. Instead, re-use the TripCount created during main loop
vectorization.
2022-04-01 13:47:34 +01:00
Florian Hahn 14e3650f01
Revert "Recommit "[LV] Remove unneeded createHeaderBranch.(NFCI)""
This reverts commit 8378a71b6c.

It looks like this patch uncovered another issue, e.g. see
https://lab.llvm.org/buildbot/#/builders/168/builds/5518
2022-03-31 19:00:48 +01:00
Florian Hahn 8378a71b6c
Recommit "[LV] Remove unneeded createHeaderBranch.(NFCI)"
This reverts the revert commit 2760cdc9c6.

This version pulls in the code to create the vector loop object in VPlan
from D121624.

This is needed because otherwise existing LoopInfo verification will
fail, as a loop block doesn't have in-loop successors now that we
do not replace the branch.

Now that we do not add new loops during skeleton construction, there's
also no need to verify LI there.
2022-03-31 14:48:32 +01:00
Florian Hahn 2760cdc9c6
Revert "[LV] Remove unneeded createHeaderBranch.(NFCI)"
This reverts commit 32bc83d11e.

This is causing bots with expensive-checks to fail. Revert while I
investigate.
2022-03-31 12:32:50 +01:00
Florian Hahn 32bc83d11e
[LV] Remove unneeded createHeaderBranch.(NFCI)
The only remaining use was to get the exit block of the loop. Instead of
relying on the loop, use the successor of VectorHeaderBB
(LoopMiddleBlock) directly to set VPTransformState::CFG::ExitB

Depends on D121621.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D121623
2022-03-31 11:48:52 +01:00
Florian Hahn 2c494f0941
[VPlan] Remove unneeded Loop variable (NFC).
Suggested in D121623. The remaining uses of L can be replaced, reducing
the need for the variable.
2022-03-31 10:34:28 +01:00
David Green b65267ca7b [LV] Invalidate widening decisions after maximizing vector bandwidth
When MaximizeVectorBandwidth is enabled, we can end up (via calls to
collectUniformsAndScalars/setCostBasedWideningDecision through
calculateRegisterUsage) making widening decisions before we have decided
whether to fold the tail by masking. These decisions will be wrong if we
later decided to fold the tail, for example when the trip count is very
low. It will use incorrect costs for loads that should get masked, using
standard memory operation costs instead.

This still at the moment uses the EmulatedMaskMemRefHack costs (a bit
unfortunately), but the old costs without this change were 1, leading to
too optimistic vectorization.

This slightly changes the way that the MaximizeVectorBandwidth option
works to make it easier to test, always honouring the option if it is
set.

Differential Revision: https://reviews.llvm.org/D120215
2022-03-31 09:19:31 +01:00
Florian Hahn e4543af4e6
[VPlan] Track current vector loop in VPTransformState (NFC).
Instead of looking up the vector loop using the header, keep track of
the current vector loop in VPTransformState. This removes the
requirement for the vector header block being part of the loop up front.

A follow-up patch will move the code to generate the Loop object for the
vector loop to VPRegionBlock.

Depends on D121619.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D121621
2022-03-30 22:16:40 +01:00
Florian Hahn e8673f2f20
[LV] Do not create separate latch block in VPlan::execute.
Now that all dependencies on creating the latch block up-front have been
removed, there is no need to create it early.

Depends on D121618.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D121619
2022-03-30 17:31:38 +01:00
Florian Hahn 8a4077fac0
[LV] Pass LoopHeaderBB directly to updateDominatorTree. (NFC)
At the call site, we already know what the vector header block is. Pass
it directly.
2022-03-30 13:11:20 +01:00
Florian Hahn ecb4171dcb
[LV] Handle zero cost loops in selectInterleaveCount.
In some case, like in the added test case, we can reach
selectInterleaveCount with loops that actually have a cost of 0.

Unfortunately a loop cost of 0 is also used to communicate that the cost
has not been computed yet. To resolve the crash, bail out if the cost
remains zero after computing it.

This seems like the best option, as there are multiple code paths that
return a cost of 0 to force a computation in selectInterleaveCount.
Computing the cost at multiple places up front there would unnecessarily
complicate the logic.

Fixes #54413.
2022-03-29 22:52:43 +01:00
Florian Hahn d1d3563278
[LV] Move code to place pointer induction increment to VPlan post-processing.
This patch moves the code to set the correct incoming block for the
backedge value to VPlan::execute.

When generating the phi node, the backedge value is temporarily added
using the pre-header as incoming block. The invalid phi node will be
fixed up during VPlan::execute after main VPlan code generation.
At the same time, the backedge value is also moved to the latch.

This change removes the requirement to create the latch block up-front
for VPWidenInductionPHIRecipe::execute, which in turn will enable
modeling the pre-header in VPlan.

Depends on D121617.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D121618
2022-03-29 20:27:59 +01:00
Philip Reames 7d6e8f2a96 [slp] Delete dead scalar instructions feeding vectorized instructions
If we vectorize a e.g. store, we leave around a bunch of getelementptrs for the individual scalar stores which we removed. We can go ahead and delete them as well.

This is purely for test output quality and readability. It should have no effect in any sane pipeline.

Differential Revision: https://reviews.llvm.org/D122493
2022-03-28 20:10:13 -07:00
Florian Hahn e7bf2ea934
[LV] Move code to place induction increment to VPlan post-processing.
This patch moves the code to set the correct incoming block for the
backedge value to VPlan::execute.

When generating the phi node, the backedge value is temporarily added
using the pre-header as incoming block. The invalid phi node will be
fixed up during VPlan::execute after main VPlan code generation.
At the same time, the backedge value is also moved to the latch.

This change removes the requirement to create the latch block up-front
for VPWidenIntOrFpInductionRecipe::execute, which in turn will enable
modeling the pre-header in VPlan.

As an alternative, the increment could be modeled as separate recipe,
but that would require more work and a bit of redundant code, as we need
to create the step-vector during VPWidenIntOrFpInductionRecipe::execute
anyways, to create the values for different parts.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D121617
2022-03-28 16:20:02 +01:00
Philip Reames f80aaa675f [SLP] Simplify eraseInstruction [NFC]
This simplifies the implementation of eraseInstruction by moving the odd-replace-users-with-undef handling back to the only caller which uses it.  This handling was not obviously correct, so add the asserts which make it clear why this is safe to do at all.  The result is simpler code and stronger assertions.
2022-03-25 12:01:52 -07:00
Philip Reames 48cc9287f5 Reapply "[SLP] Schedule only sub-graph of vectorizable instructions"" (try 3)
The original commit exposed several missing dependencies (e.g. latent bugs in SLP scheduling).  Most of these were fixed over the weekend and have had several days to bake.  The last was fixed this morning after being noticed in manual review of test changes yesterday.  See the review thread for links to each change.

Original commit message follows:

SLP currently schedules all instructions within a scheduling window which stretches from the first instruction potentially vectorized to the last. This window can include a very large number of unrelated instructions which are not being considered for vectorization. This change switches the code to only schedule the sub-graph consisting of the instructions being vectorized and their transitive users.

This has the effect of greatly reducing the amount of work performed in large basic blocks, and thus greatly improves compile time on degenerate examples. To understand the effects, I added some statistics (not planned for upstream contribution). Here's an illustration from my motivating example:

   Before this patch:

   704357 SLP                          - Number of calcDeps actions
   699021 SLP                          - Number of schedule calls
   5598 SLP                          - Number of ReSchedule actions
   59 SLP                          - Number of ReScheduleOnFail actions
   10084 SLP                          - Number of schedule resets
   8523 SLP                          - Number of vector instructions generated

   After this patch:

   102895 SLP                          - Number of calcDeps actions
   161916 SLP                          - Number of schedule calls
   5637 SLP                          - Number of ReSchedule actions
   55 SLP                          - Number of ReScheduleOnFail actions
   10083 SLP                          - Number of schedule resets
   8403 SLP                          - Number of vector instructions generated

I do want to highlight that there is a small difference in number of generated vector instructions. This example is hitting the bailout due to maximum window size, and the change in scheduling is slightly perturbing when and how we hit it. This can be seen in the RescheduleOnFail counter change. Given that, I think we can safely ignore.

The downside of this change can be seen in the large test diff. We group all vectorizable instructions together at the bottom of the scheduling region. This means that vector instructions can move quite far from their original point in code. While maybe undesirable, I don't see this as being a major problem as this pass is not intended to be a general scheduling pass.

For context, it's worth noting that the pre-scheduling that SLP does while building the vector tree is exactly the sub-graph scheduling implemented by this patch.

Differential Revision: https://reviews.llvm.org/D118538
2022-03-25 10:39:23 -07:00
Philip Reames ec858f0201 [SLP] Optimize stacksave dependence handling [NFC]
After writing the commit message for 4b1bace28, realized that the mentioned optimization was rather straight forward.  We already have the code for scanning a block during region initialization, we can simply keep track if we've seen a stacksave or stackrestore.  If we haven't, none of these dependencies are relevant and we can avoid the relatively expensive scans entirely.
2022-03-25 10:04:10 -07:00
Philip Reames a16308c282 [SLP] Explicit track required stacksave/alloca dependency (try 3)
This is an extension of commit b7806c to handle one last case noticed in test changes for D118538.  Again, this is thought to be a latent bug in the existing code, though this time I have not managed to reduce tests for the original algoritthm.

The prior attempt had failed to account for this case:
  %a = alloca i8
  stacksave
  stackrestore
  store i8 0, i8* %a

If we allow '%a' to reorder into the stacksave/restore region, then the alloca will be deallocated before the use.  We will have taken a well defined program, and introduced a use-after-free bug.

There's also an inverse case where the alloca originally follows the stackrestore, and we need to prevent the reordering it above the restore.

Compile time wise, we potentially do an extra scan of the block for each alloca seen in a bundle.  This is significantly more expensive than the stacksave rooted version and is why I'd tried to avoid this in the initial patch.  There is room to optimize this (by essentially caching a "has stacksave" bit per block), but I'm leaving that to future work if it actually shows up in practice.  Since allocas in bundles should be rare in practice, I suspect we can defer the complexity for a long while.
2022-03-25 10:04:10 -07:00
Florian Hahn e47d220230
[LV] Use getVectorLoopRegion to retrieve header. (NFC)
Update all places that currently assume the entry block to the plan is
also the vector loop header to use getVectorLoopRegion instead.

getVectorLoopRegion will keep doing the right thing when the pre-header
is modeled explicitly (and becomes the new entry block in the plan).
2022-03-25 16:57:12 +00:00
Philip Reames d9756fa723 [slp] Factor out a lambda to avoid uplicating code a third time in upcoming patch [nfc] 2022-03-25 09:02:39 -07:00
Fraser Cormack 2e44b7872b [VectorCombine] Insert addrspacecast when crossing address space boundaries
We can not bitcast pointers across different address spaces. This was
previously fixed in D89577 but then in D93229 an enhancement was added
which peeks further through the ponter operand, opening up the
possibility that address-space violations could be introduced.

Instead of bailing as the previous fix did, simply insert an
addrspacecast cast instruction.

Reviewed By: lebedev.ri

Differential Revision: https://reviews.llvm.org/D121787
2022-03-24 19:08:08 +00:00
Simon Pilgrim 597aefa89c Fix unused variable warning by embedding inside assertion 2022-03-24 17:41:24 +00:00
Florian Hahn 46432a0088
[VPlan] Add VPWidenPointerInductionRecipe.
This patch moves pointer induction handling from VPWidenPHIRecipe to its
own recipe. In the process, it adds all information required to generate
code for pointer inductions without relying on Legal to access the list
of induction phis.

Alternatively VPWidenPHIRecipe could also take an optional pointer to InductionDescriptor.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D121615
2022-03-24 14:58:45 +00:00