Commit Graph

3362 Commits

Author SHA1 Message Date
Kazu Hirata 258531b7ac Remove redundant initialization of Optional (NFC) 2022-08-20 21:18:28 -07:00
Philip Reames b0a2c48e9f [tti] Consolidate getOperandInfo without OperandValueProperties copies [nfc] 2022-08-19 16:22:22 -07:00
Alexey Bataev c167028684 [SLP]Delay vectorization of postponable values for instructions with no users.
SLP vectorizer tries to find the reductions starting the operands of the
instructions with no-users/void returns/etc. But such operands can be
postponable instructions, like Cmp, InsertElement or InsertValue. Such
operands still must be postponed, vectorizer should not try to vectorize
them immediately.

Differential Revision: https://reviews.llvm.org/D131965
2022-08-19 08:39:16 -07:00
Alexey Bataev 0e7ed32c71 [SLP]Cost for a constant buildvector.
In many cases constant buildvector results in a vector load from a
constant/data pool. Need to consider this cost too.

Differential Revision: https://reviews.llvm.org/D126885
2022-08-19 08:02:42 -07:00
Alexey Bataev d53e245951 [COST][NFC]Introduce OperandValueKind in getMemoryOpCost, NFC.
Added OperandValueKind OpdInfo parameter to getMemoryOpCost functions to
better estimate cost with immediate values.

Part of D126885.
2022-08-19 07:33:00 -07:00
Florian Hahn b8709a9d03
[LV] Support fixed order recurrences.
If the incoming previous value of a fixed-order recurrence is a phi in
the header, go through incoming values from the latch until we find a
non-phi value. Use this as the new Previous, all uses in the header
will be dominated by the original phi, but need to be moved after
the non-phi previous value.

At the moment, fixed-order recurrences are modeled as a chain of
first-order recurrences.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D119661
2022-08-18 19:15:52 +01:00
Philip Reames 1436adae2c [LV-L] Add const and move method body out of line [nfc] 2022-08-18 11:10:19 -07:00
Philip Reames c064d3f139 [LV] Use early continue to simplify code [nfc] 2022-08-18 10:31:55 -07:00
Philip Reames 531dd3634d [LV] Restructure isPredicatedInst and isScalarWithPredication (w/a fix for uniform mem ops)
This change reorganizes the code and comments to make the expected semantics of these routines more clear. However, this is *not* an NFC change. The functional change is having isScalarWithPredication return false if the instruction does not need predicated. Specifically, for the case of a uniform memory operation we were previously considering it *not* to be a predicated instruction, but *were* considering it to be scalable with predication.

As can be seen with the test changes, this causes uniform memory ops which should have been lowered as uniform-per-parts values to instead be lowering via naive scalarization or if scalarization is infeasible (i.e. scalable vectors) aborted entirely. I also don't trust the code to bail out correctly 100% of the time, so it's possible we had a crash or miscompile from trying to scalarize something which isn't scalaralizable. I haven't found a concrete example here, but I am suspicious.

Differential Revision: https://reviews.llvm.org/D131093
2022-08-18 07:14:04 -07:00
Simon Pilgrim 594c5b1a42 [SLP] Update TODO comment about shuffle mask decoding
This is handled in ShuffleVectorInst/getShuffleCost - getInstructionThroughput is (slowly) being removed.
2022-08-17 11:41:46 +01:00
Alexey Bataev 65c7cecb13 [SLP]Fix PR51320: Try to vectorize single store operands.
Currently, we try to vectorize values, feeding into stores, only if
slp-vectorize-hor-store option is provided. We can safely enable
vectorization of the value operand of a single store in the basic block,
if the operand value is used only in store.
It should enable extra vectorization and should not increase compile
time significantly.
Fixes https://github.com/llvm/llvm-project/issues/51320

Differential Revision: https://reviews.llvm.org/D131894
2022-08-16 07:25:21 -07:00
Philip Reames e792a353b5 [slp] adjust debug output to include final computed cost 2022-08-15 13:51:39 -07:00
Alexey Bataev 2819126d0c [SLP][NFC]Replace multiple isa calls with single one where possible,
NFC.
2022-08-15 11:56:58 -07:00
Kazu Hirata 50724716cd [Transforms] Qualify auto in range-based for loops (NFC)
Identified with readability-qualified-auto.
2022-08-14 12:51:58 -07:00
Kazu Hirata 109df7f9a4 [llvm] Qualify auto in range-based for loops (NFC)
Identified with readability-qualified-auto.
2022-08-13 12:55:42 -07:00
Dinar Temirbulatov cab6cd6834 [AArch64][LoopVectorize] Introduce trip count minimal value threshold to ignore tail-folding.
After D121595 was commited, I noticed regressions assosicated with small trip
count numbersvectorisation by tail folding with scalable vectors. As a solution
for those issues I propose to introduce the minimal trip count threshold value.

  Differential Revision: https://reviews.llvm.org/D130755
2022-08-09 22:10:17 +01:00
Fangrui Song de9d80c1c5 [llvm] LLVM_FALLTHROUGH => [[fallthrough]]. NFC
With C++17 there is no Clang pedantic warning or MSVC C5051.
2022-08-08 11:24:15 -07:00
Kazu Hirata 0e37ef0186 [Transforms] Fix comment typos (NFC) 2022-08-07 23:55:24 -07:00
Kazu Hirata a2d4501718 [llvm] Fix comment typos (NFC) 2022-08-07 00:16:14 -07:00
Dawid Jurczak 1bd31a6898 [NFC] Add SmallVector constructor to allow creation of SmallVector<T> from ArrayRef of items convertible to type T
Extracted from https://reviews.llvm.org/D129781 and address comment:
https://reviews.llvm.org/D129781#3655571

Differential Revision: https://reviews.llvm.org/D130268
2022-08-05 13:35:41 +02:00
Fangrui Song 7d6017fd31 [TTI] Change new getVectorInstrCost overload to use const reference after D131114
A const reference is preferred over a non-null const pointer.
`Type *` is kept as is to match the other overload.

Reviewed By: davidxl

Differential Revision: https://reviews.llvm.org/D131197
2022-08-04 15:16:51 -07:00
Mingming Liu bc8f2f3649 [AArch64][TTI][NFC] Overload method 'getVectorInstrCost' to provide vector instruction itself, as a context information for cost estimation.
1) Overloaded (instruction-based) method is a wrapper around the current (opcode-based) method.
2) This patch also changes a few callsites (VectorCombine.cpp,
   SLPVectorizer.cpp, CodeGenPrepare.cpp) to call the overloaded method.
3) This is a split of D128302.

Differential Revision: https://reviews.llvm.org/D131114
2022-08-04 12:58:25 -07:00
Philip Reames 569a7f6aa3 [LV] Move definition of isPredicatedInst out of line and make it const [nfc] 2022-08-03 08:53:11 -07:00
Philip Reames a1cab0daae [LV] Use cost base decision for uniform mem op strategy [nfc-ish]
This is mostly a stylistic change to make the uniform memop widening cost
code fit more naturally with the sourounding code.  Its not strictly
speaking NFC as I added in the store with invariant value case, and we
could in theory have a target where a gather/scatter is cheaper than a
single load/store... but it's probably NFC in practice.  Note that the
scatter/gather result can still be overriden later if the result is
uniform-by-parts.
2022-08-03 07:47:24 -07:00
Philip Reames 0b47615fcf [LV] Recognize store of invariant value to invariant address as uniform
This extends the handling of uniform memory operations to handle the case where a store is storing a loop invariant value. Unlike the general case of a store to an invariant address where we must use the last active lane, in this case we can use any lane since all lanes must produce the same result.

For context, the basic structure of the existing code and how the change fits in:
* First, we select a widening strategy. (The result is irrelevant for this patch.)
* Then we determine if a computation is uniform within all lanes of VF. (Note this is the uniform-per-part definition, not LAI's uniform across all unrolled iterations definition.)
* If it is, we overrule the widening strategy, and unconditionally scalarize.
* VPReplicationRecipe - which is what actually does the scalarization - knows how to handle unform-per-part values including for scalable vectors. However, we do need to know that the expression is safe to execute without predication - e.g. the uniform mem op was unconditional in the original loop. (This part was split off and already landed.)

An obvious question is why not simply implement the generic case? The answer is that I'm going to, but doing so without a canonicalization towards uniform causes regressions due to bad interaction with scalarization/uniformity of values feeding the uniform mem-op. This patch is needed to avoid those regressions.

Differential Revision: https://reviews.llvm.org/D130364
2022-08-02 08:09:49 -07:00
David Sherwood 4ef9cb6c17 [AArch64][LoopVectorize] Disable tail-folding for SVE when loop has interleaved accesses
If we have interleave groups in the loop we want to vectorise then
we should fall back on normal vectorisation with a scalar epilogue. In
such cases when tail-folding is enabled we'll almost certainly go on to
create vplans with very high costs for all vector VFs and fall back on
VF=1 anyway. This is likely to be worse than if we'd just used an
unpredicated vector loop in the first place.

Once the vectoriser has proper support for analysing all the costs
for each combination of VF and vectorisation style, then we should
be able to remove this.

Added an extra test here:

  Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll

Differential Revision: https://reviews.llvm.org/D128342
2022-08-02 09:52:33 +01:00
jacquesguan e38af7ba95 [LV] Refactor getExtendedAddReductionCost to support other extended reduction more than Add.
Now the API getExtendedAddReductionCost is used to determine the cost of extended Add reduction with optional Mul. For Arm, it could cover the cases. But for other target, for example: RISCV, they support other kinds of extended recution, such as FAdd.

This patch does the following changes:
1, Split getExtendedAddReductionCost into 2 new API: getExtendedReductionCost which handles the extended reduction with addtional input of Opcode; getMulAccReductionCost which handle the MLA cases the getExtendedAddReductionCost.
2, Refactor getReductionPatternCost, add some contraint condition to make sure the getMulAccReductionCost should only handle the reuction of Add + Mul.

Differential Revision: https://reviews.llvm.org/D130868
2022-08-02 16:02:38 +08:00
Philip Reames 82c1b136db [LV] Don't predicate uniform mem op stores unneccessarily
We already had the reasoning about uniform mem op loads; if the address is accessed at least once, we know the instruction doesn't need predicated to ensure fault safety. For stores, we do need to ensure that the values visible in memory are the same with and without predication. The easiest sub-case to check for is that all the values being stored are the same. Since we know that at least one lane is active, this tells us that the value must be visible.

Warning on confusing terminology: "uniform" vs "uniform mem op" mean two different things here, and this patch is specific to the later. It would *not* be legal to make this same change for merely "uniform" operations.

Differential Revision: https://reviews.llvm.org/D130637
2022-07-28 08:55:52 -07:00
Florian Hahn 16e0620d6d
[VPlan] Mark VPPredInstPHIRecipe as not having side-effects.
Now that all uses of VPPredInstPHIRecipes are properly modeled, they can
be treated as not having side-effects, enabling removal.
2022-07-27 19:29:26 +01:00
Kazu Hirata 95a932fb15 Remove redundaunt override specifiers (NFC)
Identified with modernize-use-override.
2022-07-24 22:28:11 -07:00
Kazu Hirata acf648b5e9 Use llvm::less_first and llvm::less_second (NFC) 2022-07-24 16:21:29 -07:00
Kazu Hirata 2d2e2e7ea9 [Vectorize] Remove isConsecutiveLoadOrStore (NFC)
The last use was removed on Jan 4, 2022 in commit
95a93722db.
2022-07-23 13:01:14 -07:00
Philip Reames b5c7213647 [LV] Use early return to simplify code structure 2022-07-22 12:15:14 -07:00
Benjamin Kramer 5a445395e4 [LV] Remove unused variable. NFC. 2022-07-22 17:43:58 +02:00
Philip Reames d7bf81fd51 [LV] Rework widening cost of uniform memory ops for clarity [nfc]
Reorganize the code to make it clear what is and isn't handle, and why.
Restructure bailout to remove (false and confusing) dependence on
CM_Scalarize; just return invalid cost and propagate, that's what it
is for.
2022-07-22 08:35:45 -07:00
Philip Reames bd75350180 [LV] Fix a conceptual mistake around meaning of uniform in isPredicatedInst
This code confuses LV's "Uniform" and LVL/LAI's "Uniform".  Despite the
common name, these are different.
* LVs notion means that only the first lane *of each unrolled part* is
  required.  That is, lanes within a single unroll factor are considered
  uniform.  This allows e.g. widenable memory ops to be considered
  uses of uniform computations.
* LVL and LAI's notion refers to all lanes across all unrollings.

IsUniformMem is in turn defined in terms of LAI's notion.  Thus a
UniformMemOpmeans is a memory operation with a loop invariant address.
This means the same address is accessed in every iteration.

The tweaked piece of code was trying to match a uniform mem op (i.e.
fully loop invariant address), but instead checked for LV's notion of
uniformity.  In theory, this meant with UF > 1, we could speculate
a load which wasn't safe to execute.

This ends up being mostly silent in current code as it is nearly
impossible to create the case where this difference is visible.  The
closest I've come in the test case from 54cb87, but even then, the
incorrect result is only visible in the vplan debug output; before this
change we sink the unsafely speculated load back into the user's predicate
blocks before emitting IR.  Both before and after IR are correct so the
differences aren't "interesting".

The other test changes are uninteresting.  They're cases where LV's uniform
analysis is slightly weaker than SCEV isLoopInvariant.
2022-07-21 15:44:34 -07:00
David Sherwood f15b6b2907 [AArch64] Add target hook for preferPredicateOverEpilogue
This patch adds the AArch64 hook for preferPredicateOverEpilogue,
which currently returns true if SVE is enabled and one of the
following conditions (non-exhaustive) is met:

1. The "sve-tail-folding" option is set to "all", or
2. The "sve-tail-folding" option is set to "all+noreductions"
and the loop does not contain reductions,
3. The "sve-tail-folding" option is set to "all+norecurrences"
and the loop has no first-order recurrences.

Currently the default option is "disabled", but this will be
changed in a later patch.

I've added new tests to show the options behave as expected here:

  Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll

Differential Revision: https://reviews.llvm.org/D129560
2022-07-21 17:20:06 +01:00
Philip Reames 523a526a02 [LV] Fix miscompile due to srem/sdiv speculation safety condition
An srem or sdiv has two cases which can cause undefined behavior, not just one. The existing code did not account for this, and as a result, we miscompiled when we encountered e.g. a srem i64 %v, -1 in a conditional block.

Instead of hand rolling the logic, just use the utility function which exists exactly for this purpose.

Differential Revision: https://reviews.llvm.org/D130106
2022-07-20 05:35:23 -07:00
Florian Hahn 5124b21648
[VPlan] Initial def-use verification.
This patch introduces some initial def-use verification. This catches
cases like the one fixed by D129436.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D129717
2022-07-20 11:06:32 +01:00
William Schmidt bccc9aa81c Don't vectorize PHIs in catchswitch blocks
We currently assert in vectorizeTree(TreeEntry*) when processing a PHI
bundle in a block containing a catchswitch.  We attempt to set the
IRBuilder insertion point following the catchswitch, which is invalid.
This is done so that ShuffleBuilder.finalize() knows where to insert
a shuffle if one is needed.

To avoid this occurring, watch out for catchswitch blocks during
buildTree_rec() processing, and avoid adding PHIs in such blocks to
the vectorizable tree.  It is unlikely that constraining vectorization
over an exception path will cause a noticeable performance loss, so
this seems preferable to trying to anticipate when a shuffle will and
will not be required.
2022-07-19 06:10:17 -07:00
Florian Hahn a75760a269
[LV] Remove unnecessary cast in widenCallInstruction. (NFC) 2022-07-19 11:23:24 +01:00
Florian Hahn 30e53b8c03
[LV] Sink module variable and use State to set it in widenCall. (NFC)
Limits the lifetime of the variable and makes it independent of
CallInst.
2022-07-18 19:41:48 +01:00
Florian Hahn 105032f549
[LV] Use PHI recipe instead of PredRecipe for subsequent uses.
At the moment, the VPPRedInstPHIRecipe is not used in subsequent uses of
the predicate recipe. This incorrectly models the def-use chains, as all
later uses should use the phi recipe. Fix that by delaying recording of
the recipe.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D129436
2022-07-18 09:35:34 +01:00
Kazu Hirata 7094ab4ee7 [llvm] Modernize bool literals (NFC)
Identified with modernize-use-bool-literals.
2022-07-17 18:08:51 -07:00
Florian Hahn cc0ee17951
[LV] Move VPPredInstPHIRecipe::execute to VPlanRecipes.cpp (NFC) 2022-07-17 11:34:23 +01:00
Florian Hahn 6813b41d57
[LV] Avoid creating new run-time VF expression for each runtime checks.
At the moment, the cost of runtime checks for scalable vectors is
overestimated due to creating separate vscale * VF expressions for each
check. Instead re-use the first expression.
2022-07-16 17:24:07 +01:00
David Green 4b7913c357 [VectorCombine] Only consider shuffle uses with the same type.
The backend getShuffleCosts do not currently handle shuffles that change
size very well. Limit the shuffles we collect to the same type to make
sure they do not cause issues as reported in D128732.
2022-07-16 13:23:39 +01:00
Florian Hahn aa00fb02c9
[LV] Use umax(VF * UF, MinProfTC) for scalable vectors.
For scalable vectors, it is not sufficient to only check
MinProfitableTripCount if it is >= VF.getKnownMinValue() * UF, because
this property may not holder for larger values of vscale. In those
cases, compute umax(VF * UF, MinProfTC) instead.

This should fix
https://lab.llvm.org/buildbot/#/builders/197/builds/2262
2022-07-15 10:23:14 -07:00
Mel Chen bd404fbcc8 [LV][NFC] Fix the condition for printing debug messages
Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D128523
2022-07-15 01:47:33 -07:00
Kazu Hirata 611ffcf4e4 [llvm] Use value instead of getValue (NFC) 2022-07-13 23:11:56 -07:00
Florian Hahn ee37ae91b6
[VPlan] Move VPBB verification to separate function (NFC). 2022-07-13 18:53:40 -07:00
Florian Hahn 6f7347b888
[LV] Use PredRecipe directly instead of getOrAddVPValue (NFC).
There is no need to look up the VPValue for Instr, PredRecipe can be
used directly.
2022-07-13 17:01:42 -07:00
Florian Hahn 225e3ec622
[LV] Move VPBranchOnMaskRecipe::execute to VPlanRecipes.cpp (NFC). 2022-07-13 14:39:59 -07:00
David Sherwood 307ace7f20 [LoopVectorize] Ensure the VPReductionRecipe is placed after all it's inputs
When vectorising ordered reductions we call a function
LoopVectorizationPlanner::adjustRecipesForReductions to replace the
existing VPWidenRecipe for the fadd instruction with a new
VPReductionRecipe. We attempt to insert the new recipe in the same
place, but this is wrong because createBlockInMask may have
generated new recipes that VPReductionRecipe now depends upon. I
have changed the insertion code to append the recipe to the
VPBasicBlock instead.

Added a new RUN with tail-folding enabled to the existing test:

  Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll

Differential Revision: https://reviews.llvm.org/D129550
2022-07-13 09:29:25 +01:00
David Sherwood 6b694d600a [LoopVectorize] Change PredicatedBBsAfterVectorization to be per VF
When calculating the cost of Instruction::Br in getInstructionCost
we query PredicatedBBsAfterVectorization to see if there is a
scalar predicated block. However, this meant that the decisions
being made for a given fixed-width VF were affecting the cost for a
scalable VF. As a result we were returning InstructionCost::Invalid
pointlessly for a scalable VF that should have a low cost. I
encountered this for some loops when enabling tail-folding for
scalable VFs.

Test added here:

  Transforms/LoopVectorize/AArch64/sve-tail-folding-cost.ll

Differential Revision: https://reviews.llvm.org/D128272
2022-07-12 14:53:20 +01:00
Florian Hahn 5d135041c5
[LV] Move VPBlendRecipe::execute to VPlanRecipes.cpp (NFC). 2022-07-11 16:01:07 -07:00
David Sherwood 03fee6712a [LoopVectorize] Add option to use active lane mask for loop control flow
Currently, for vectorised loops that use the get.active.lane.mask
intrinsic we only use the mask for predicated vector operations,
such as masked loads and stores, etc. The loop itself is still
controlled by comparing the canonical induction variable with the
trip count. However, for some targets this is inefficient when it's
cheap to use the mask itself to control the loop.

This patch adds support for using the active lane mask for control
flow by:

1. Generating the active lane mask for the next iteration of the
vector loop, rather than the current one. If there are still any
remaining iterations then at least the first bit of the mask will
be set.
2. Extract the first bit of this mask and use this bit for the
conditional branch.

I did this by creating a new VPActiveLaneMaskPHIRecipe that sets
up the initial PHI values in the vector loop pre-header. I've also
made use of the new BranchOnCond VPInstruction for the final
instruction in the loop region.

Differential Revision: https://reviews.llvm.org/D125301
2022-07-11 13:46:55 +01:00
David Sherwood 02d6950d84 [LoopVectorize][NFC] Add optional Name parameter to VPInstruction
This patch is a simple piece of refactoring that now permits users
to create VPInstructions and specify the name of the value being
generated. This is useful for creating more readable/meaningful
names in IR.

Differential Revision: https://reviews.llvm.org/D128982
2022-07-11 09:23:24 +01:00
Florian Hahn 6a4bc452f8
[LV] Move VPWidenGEPRecipe::execute to VPlanRecipes.cpp (NFC). 2022-07-10 17:10:17 -07:00
Florian Hahn 13ae213469
[LV] Move VPWidenRecipe::execute to VPlanRecipes.cpp (NFC). 2022-07-09 18:46:57 -07:00
Florian Hahn 0c27b38849
[VPlan] Move VPWidenSelectRecipe::execute to VPlanRecipes.cpp (NFC).
Depends on D127968.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D127970
2022-07-08 09:35:23 -07:00
Craig Topper 0266773464 [SLP] Add missing space to optimization remark.
Reviewed By: vporpo

Differential Revision: https://reviews.llvm.org/D129330
2022-07-07 23:29:11 -07:00
Florian Hahn bc19b7c3cc
[LV] Remove collectTriviallyDeadInstructions, already handled by VP DCE.
Now that removeDeadRecipes can remove most dead recipes across a whole
VPlan, there is no need to first collect some dead instructions.
Instead removeDeadRecipes can simply clean them up.

Depends D127580.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D128408
2022-07-07 08:40:27 -07:00
Sander de Smalen 519d7876cb [VectorCombine] Avoid creating shuffle for extract-extract pattern on scalable vector.
This addresses https://github.com/llvm/llvm-project/issues/56377

Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D129136
2022-07-07 08:37:04 +00:00
Florian Hahn 17d48c3169
[VPlan] Move remove dead recipes before merging regions.
This can enable additional region merging,  while not losing
opportunities as region merging does not produce dead recipes.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D128831
2022-07-06 20:38:38 -07:00
Nikita Popov 1ed8b29302 [LoopVectorizationLegality] Drop unused variable (NFC) 2022-07-06 10:43:39 +02:00
Nikita Popov 8ee913d83b [IR] Remove Constant::canTrap() (NFC)
As integer div/rem constant expressions are no longer supported,
constants can no longer trap and are always safe to speculate.
Remove the Constant::canTrap() method and its usages.
2022-07-06 10:36:47 +02:00
David Green 5493f8fc59 [VectorCombine] Improve shuffle select shuffle-of-shuffles
This in an extension to the code added in D123911 which added vector
combine folding of shuffle-select patterns, attempting to reduce the
total amount of shuffling required in patterns like:
  %x = shuffle %i1, %i2
  %y = shuffle %i1, %i2
  %a = binop %x, %y
  %b = binop %x, %y
  shuffle %a, %b, selectmask

This patch extends the handing of shuffles that are dependent on one
another, which can arise from the SLP vectorizer, as-in:
  %x = shuffle %i1, %i2
  %y = shuffle %x

The input shuffles can also be emitted, in which case they are treated
like identity shuffles. This patch also attempts to calculate a better
ordering of input shuffles, which can help getting lower cost input
shuffles, pushing complex shuffles further down the tree.

This is a recommit with some additional checks for supported forms and
out-of-bounds mask elements, with some extra tests.

Differential Revision: https://reviews.llvm.org/D128732
2022-07-05 17:16:18 +01:00
Florian Hahn ebb78a95ce
[LV] Remove stray dbgs() call after 774fc63490. 2022-07-05 12:58:18 +01:00
Florian Hahn 774fc63490
[LV] Consider minimum vscale assmuption for RT check cost.
For scalable VFs, the minimum assumed vscale needs to be included in the
cost-computation, otherwise a smaller VF may be used for RT check cost
computation than was used for earlier cost computations.

Fixes a RISCV test failing with UBSan due to both scalar and vector
loops having the same cost.
2022-07-05 09:41:58 +01:00
Nikita Popov b69c75d53f Revert "[VectorCombine] Improve shuffle select shuffle-of-shuffles"
This reverts commit 19a1e20b8a.

Clang crashes while linking bullet from llvm-test-suite in
ReleaseLTO-g cmake configuration.
2022-07-05 09:31:20 +02:00
Florian Hahn 2a82c15f63
[LV] Consider runtime checks profitable if scalar cost is zero.
This fixes an UBSan failure after 644a965c1e. When using
user-provided VFs/ICs (via the force-vector-width /
force-vector-interleave options) the scalar cost is zero, which would
cause divide-by-zero.

When forcing vectorization using the options, the cost of the runtime
checks should not block vectorization.
2022-07-04 21:37:16 +01:00
Florian Hahn 9eb6572786
[LV] Add back CantReorderMemOps remark.
Add back remark unintentionally dropped by 644a965c1e.

I will add a LV test separately, so we do not have to rely on a Clang
test to catch this.
2022-07-04 17:23:47 +01:00
Peter Waller c146af3f46 [LoopVectorize][NFC] Reinstate TTICapture workaround for gcc-6
Fixes #56374.
2022-07-04 14:14:15 +00:00
Florian Hahn 644a965c1e
[LV] Vectorize cases with larger number of RT checks, execute only if profitable.
This patch replaces the tight hard cut-off for the number of runtime
checks with a more accurate cost-driven approach.

The new approach allows vectorization with a larger number of runtime
checks in general, but only executes the vector loop (and runtime checks) if
considered profitable at runtime. Profitable here means that the cost-model
indicates that the runtime check cost + vector loop cost < scalar loop cost.

To do that, LV computes the minimum trip count for which runtime check cost
+ vector-loop-cost < scalar loop cost.

Note that there is still a hard cut-off to avoid excessive compile-time/code-size
increases, but it is much larger than the original limit.

The performance impact on standard test-suites like SPEC2006/SPEC2006/MultiSource
is mostly neutral, but the new approach can give substantial gains in cases where
we failed to vectorize before due to the over-aggressive cut-offs.

On AArch64 with -O3, I didn't observe any regressions outside the noise level (<0.4%)
and there are the following execution time improvements. Both `IRSmk` and `srad` are relatively short running, but the changes are far above the noise level for them on my benchmark system.

```
CFP2006/447.dealII/447.dealII    -1.9%
CINT2017rate/525.x264_r/525.x264_r    -2.2%
ASC_Sequoia/IRSmk/IRSmk       -9.2%
Rodinia/srad/srad     -36.1%
```

`size` regressions on AArch64 with -O3 are

```
MultiSource/Applications/hbd/hbd                 90256.00   106768.00 18.3%
MultiSourc...ks/ASCI_Purple/SMG2000/smg2000     240676.00   257268.00  6.9%
MultiSourc...enchmarks/mafft/pairlocalalign     472603.00   489131.00  3.5%
External/S...2017rate/525.x264_r/525.x264_r     613831.00   630343.00  2.7%
External/S...NT2006/464.h264ref/464.h264ref     818920.00   835448.00  2.0%
External/S...te/538.imagick_r/538.imagick_r    1994730.00  2027754.00  1.7%
MultiSourc...nchmarks/tramp3d-v4/tramp3d-v4    1236471.00  1253015.00  1.3%
MultiSource/Applications/oggenc/oggenc         2108147.00  2124675.00  0.8%
External/S.../CFP2006/447.dealII/447.dealII    4742999.00  4759559.00  0.3%
External/S...rate/510.parest_r/510.parest_r   14206377.00 14239433.00  0.2%
```

Reviewed By: lebedev.ri, ebrevnov, dmgreen

Differential Revision: https://reviews.llvm.org/D109368
2022-07-04 15:11:39 +01:00
David Green 2de05afc19 [SLP] Peek into loads when hitting the RecursionMaxDepth
This patch slightly extends the limit on the RecursionMaxDepth inside
the SLP vectorizer. It does it only when it hits a load (or zext/sext of
a load), which allows it to peek through in the places where it will be
the most valuable, without ballooning out the O(..) by any 2^n factors.

Differential Revision: https://reviews.llvm.org/D122148
2022-07-04 14:22:50 +01:00
David Green 19a1e20b8a [VectorCombine] Improve shuffle select shuffle-of-shuffles
This in an extension to the code added in D123911 which added vector
combine folding of shuffle-select patterns, attempting to reduce the
total amount of shuffling required in patterns like:
  %x = shuffle %i1, %i2
  %y = shuffle %i1, %i2
  %a = binop %x, %y
  %b = binop %x, %y
  shuffle %a, %b, selectmask

This patch extends the handing of shuffles that are dependent on one
another, which can arise from the SLP vectorizer, as-in:
  %x = shuffle %i1, %i2
  %y = shuffle %x

The input shuffles can also be emitted, in which case they are treated
like identity shuffles. This patch also attempts to calculate a better
ordering of input shuffles, which can help getting lower cost input
shuffles, pushing complex shuffles further down the tree.

Differential Revision: https://reviews.llvm.org/D128732
2022-07-04 13:38:43 +01:00
Florian Hahn b4694229aa
[LV] Simplify setDebugLocFromInst by using early exit (NFC).
Suggested as separate improvement in D128657.
2022-07-04 09:25:26 +01:00
Florian Hahn b0da3c6fa4
[VPlan] Move setDebugLocFromInst to VPTransformState (NFC).
The moved helpers are only used for codegen. It will allow moving the
remaining ::execute implementations out of LoopVectorize.cpp.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D128657
2022-07-02 15:18:17 +01:00
Florian Hahn 0dddf04cab
[LV] Don't optimize exit cond during epilogue vectorization.
At the moment, the same VPlan can be used code generation of both the
main vector and epilogue vector loop. This can lead to wrong results, if
the plan is optimized based on the VF of the main vector loop and then
re-used for the epilogue loop.

One example where this is problematic is if the scalar loops need to
execute at least one iteration, e.g. due to interleave groups.

To prevent mis-compiles in the short-term, disable optimizing exit
conditions for VPlans when using epilogue vectorization. The proper fix
is to avoid re-using the same plan for both loops, which will require
support for cloning plans first.

Fixes #56319.
2022-07-01 13:48:38 +01:00
Florian Hahn 583abd0e36
[VPlan] Move addMetadata to VPTransformState (NFC).
The moved helpers are only used for codegen. It will allow moving the
remaining ::execute implementations out of LoopVectorize.cpp.

Depends on D127966.
Depends on D127965.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D127968
2022-07-01 12:03:25 +01:00
Alexey Bataev 4be3fc35aa [SLP][NFC]Cleanup up operands of the removed insertelements, NFC.
Replace all operands of the insertelement instruction, replaced by
shuffles, by poisons to avoid false-positive reports about incorrect function.
2022-06-30 17:51:43 -07:00
Florian Hahn 68884dde70
[LV] Move LoopVersioning creation to LVP::execute.
At the moment LoopVersioning is only created for inner-loop
vectorization. This patch moves it to LVP::execute, which means it will
also be added for epilogue vectorization. As a consequence, the proper
noalias metadata is now also added to epilogue vector loops.

LVer will be moved to VPTransformState as follow-up.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D127966
2022-06-30 12:14:32 +01:00
Florian Hahn 24b5f8e0d0
[VPlan] Make sure optimizeInductions removes wide ind from scalar plan.
In some cases, there may be widened users of inductions even though the
plan includes the scalar VF. In those cases, make sure we still replace
the VPWidenIntOrFpInductionRecipe with scalar steps, as otherwise we may
try to execute a VPWidenIntOrFpInductionRecipe with a scalar VF.

Alternatively the patch could also split the range if needed.

This fixes a crash exposed by D123720.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D128755
2022-06-30 09:11:48 +01:00
Nikita Popov bdba8278d9 [VectorCombine] Avoid ConstantExpr::get() (NFC)
Use IRBuilder APIs instead, which will still constant fold.
2022-06-29 17:17:52 +02:00
Alexey Bataev bf4dcbd2df [SLP]Fix PR56251: Do not remove the reordering from the root node, being used as an operand.
If the root order itself does not require reordering, we can just
remove its reorder mask safely (e.g., if the root node is a vector of
phis). But if this node is used as an operand in the graph, we cannot
delete the reordering, need to keep it. Otherwise the graph nodes are
not synchronized with the operands. It may cause an extra gather
instruction(s) or a compiler crash.
Also, need to be very careful when selecting the gather nodes for
reordering since there might several gather nodes with the same scalars
and we can try to reorder just the same node many times instead of
different nodes.

Differential Revision: https://reviews.llvm.org/D128680
2022-06-28 13:42:05 -07:00
Florian Hahn 03975b7f0e
[VPlan] Move recipe implementations to separate file (NFC).
This patch moves the code for recipe implementations to a separate file.

The benefits are:
 * Keep VPlan.cpp smaller => faster compile-time during parallel builds.
 * Keep code for logical units together

As a follow-up I am also planning on moving all ::execute
implemetnations from LoopVectorize.cpp over to the new file, which
should help to reduce the size of the file a bit.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D127965
2022-06-28 10:34:30 +01:00
Guillaume Chatelet 3c126d5fe4 [Alignment] Replace commonAlignment with std::min
`commonAlignment` is a shortcut to pick the smallest of two `Align`
objects. As-is it doesn't bring much value compared to `std::min`.

Differential Revision: https://reviews.llvm.org/D128345
2022-06-28 07:15:02 +00:00
Philip Reames 20dd3297b1 [LV] Allow scalable vectorization with vscale = 1
This change is a bit subtle. If we have a type like <vscale x 1 x i64>, the vectorizer will currently reject vectorization. The reason is that a type like <1 x i64> is likely to get simply rescalarized, and the vectorizer doesn't want to be in the game of simple unrolling.

(I've given the example in terms of 1 x types which use a single register, but the same issue exists for any N x types which use N registers. e.g. RISCV LMULs.)

This change distinguishes scalable types from fixed types under the reasoning that converting to a scalable type isn't unrolling. Because the actual vscale isn't known until runtime, using a vscale type is potentially very profitable.

This makes an important, but unchecked, assumption. Specifically, the scalable type is assumed to only be legal per the cost model if there's actually a scalable register class which is distinct from the scalar domain. This is, to my knowledge, true for all targets which return non-invalid costs for scalable vector ops today, but in theory, we could have a target decide to lower scalable to fixed length vector or even scalar registers. If that ever happens, we'd need to revisit this code.

In practice, this patch unblocks scalable vectorization for ELEN types on RISCV.

Let me sketch one alternate implementation I considered. We could have restricted this to when we know a minimum value for vscale. Specifically, for the default +v extension for RISCV, we actually know that vscale >= 2 for ELEN types. However, doing it this way means we can't generate scalable vectors when using the various embedded vector extensions which have a minimum vscale of 1.

Differential Revision: https://reviews.llvm.org/D128542
2022-06-27 13:38:57 -07:00
Kazu Hirata a7938c74f1 [llvm] Don't use Optional::hasValue (NFC)
This patch replaces Optional::hasValue with the implicit cast to bool
in conditionals only.
2022-06-25 21:42:52 -07:00
Kazu Hirata 3b7c3a654c Revert "Don't use Optional::hasValue (NFC)"
This reverts commit aa8feeefd3.
2022-06-25 11:56:50 -07:00
Kazu Hirata aa8feeefd3 Don't use Optional::hasValue (NFC) 2022-06-25 11:55:57 -07:00
Alexey Bataev 2faacf61a5 [SLP]Improve shuffles cost estimation where possible.
Improved/fixed cost modeling for shuffles by providing masks, improved
cost model for non-identity insertelements.

Differential Revision: https://reviews.llvm.org/D115462
2022-06-24 09:28:01 -07:00
Florian Hahn cb69ba4faa
[LV] Create RT checks once VF/IC are selected, track scalar cost.
This patch updates LV to generate runtime after the VF & IC are selected. It
allows deciding whether to vectorize with runtime checks or not based on
their cost compared to the vector loop.

It also updates VectorizationFactor to include the scalar cost.

Reviewed By: lebedev.ri, dmgreen

Differential Revision: https://reviews.llvm.org/D75981
2022-06-24 17:42:11 +02:00
Florian Hahn b18141a8f2
[VPlan] Set VFs included in plan before last set of VPTransforms (NFC).
This allows VPlanTransforms to query the VFs included in the plan in the
future.
2022-06-24 10:16:56 +02:00
Philip Reames 46ea4b5ea1 [LV] Avoid a crash when costing a uniform store which doesn't correspond to a legal scatter
If we have an unaligned uniform store, then when costing a scalable VF we can't emit code to scalarize it.  (Well, we could, but we haven't implemented that case.)  This change replaces an assert with a cost-model bailout such that we reject vectorization with the scalable VF instead of crashing.
2022-06-23 12:41:09 -07:00
Alexey Bataev 3b6edef15d [SLP]Fix a crash when reorder masked gather nodes with reused scalars.
If the masked gather nodes must be reordered, we can just reorder
scalars, just like for gather nodes. But if the node contains reused
scalars, it must be handled same way as a regular vectorizable node,
since need to reorder reused mask, not the scalars directly.

Differential Revision: https://reviews.llvm.org/D128360
2022-06-23 11:32:30 -07:00
Florian Hahn 569d84fe99
[VPlan] Remove dead recipes across whole plan.
This extends removeDeadRecipe to remove recipes across the whole plan.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D127580
2022-06-23 13:36:02 +02:00
Fangrui Song 1ffd2d99c2 Revert D115462 "[SLP]Improve shuffles cost estimation where possible."
This reverts commit cac60940b7.

Caused -Os -fsanitize=memory -march=haswell miscompile to pytorch/cpuinfo.
See my latest comment (may update) on D115462.
2022-06-22 23:16:31 -07:00
Fangrui Song a411bc11d6 Revert "[SLP]Fix a crash when insert subvector is out of range."
This reverts commit f1ee2738b3.

Revert due to the revert of a dependent commit `[SLP]Improve shuffles cost estimation where possible.`
2022-06-22 23:16:25 -07:00