Commit Graph

2808 Commits

Author SHA1 Message Date
Sander de Smalen 3d549dddf7 [LV] Pass compare predicate to getCmpSelInstrCost.
If the condition of a select is a compare, pass its predicate to
TTI::getCmpSelInstrCost to get a more accurate cost value instead
of passing BAD_ICMP_PREDICATE.

I noticed that the commit message from D90070 had a comment about the
vectorized select predicate possibly being composed of other compares with
different predicate values, but I wasn't able to construct an example
where this was an actual issue. If this is an issue, I guess we could
add another check that the block isn't predicated for any reason.

Reviewed By: dmgreen, fhahn

Differential Revision: https://reviews.llvm.org/D114646
2021-12-06 11:41:27 +00:00
Florian Hahn 7c3c352d82
[VPlan] Separate ctors for VPWidenIntOrFpInduction. (NFC)
VPWidenIntOrFpInductionRecipes can either be constructed with a PHI and
an optional cast or a PHI and a trunc instruction. Reflect this in 2
separate constructors. This also simplifies a follow-up change.
2021-12-05 12:15:18 +00:00
Alexey Bataev ba74bb3a22 [SLP]Fix reused extracts cost.
If the extractelement instruction is used multiple times in the
different tree entries (either vectorized, or gathered), need to
compensate the scalar cost of such instructions. They are completely
removed if all users are part of the tree but we need to compensate the
cost only once for each instruction.

Differential Revision: https://reviews.llvm.org/D114958
2021-12-02 10:52:00 -08:00
Alexey Bataev 8ceccbd321 [SLP]Outline and fix code for finding common insertelement vectors.
Need to outline the code for finding common vectors in insertelement
instructions into a separate function for future patches. It also
improves the process by adding some extra checks for early exit and
fixes a bug where it always finds the match because of erroneous compare
of the same values.

Differential Revision: https://reviews.llvm.org/D114909
2021-12-02 09:18:25 -08:00
Alexey Bataev 92fbd76af5 [SLP]Improve registering and merging of compatible shuffles.
If several shuffle instructions are emitted, some of them might
same/compatible (less defined) with the previously emitted ones. Such
shuffles can be removed safely, improving the total cost of the
vectorized code.

Differential Revision: https://reviews.llvm.org/D114087
2021-12-02 08:48:29 -08:00
Alexey Bataev afc9e7517a [SLP]Improve cost model for the shuffled extracts.
Improved the calculation of the shuffled extracts, where possible. Need
to calculate the cost for the extracted scalars if some users are not
insertelements + improved the total estimation of the shuffled scalars
used in insertelements build vectors.

Differential Revision: https://reviews.llvm.org/D113782
2021-12-01 08:10:57 -08:00
Alexey Bataev cc30fbf242 [SLP]Introduce isUndefVector function to check for undef vectors.
Undefined vector might be not only the UndefValue, but also it can be
a constant vector with undef ot poison elements, need to check for this
kind of undef too.

Differential Revision: https://reviews.llvm.org/D114873
2021-12-01 07:46:10 -08:00
Alexey Bataev ddce6e0561 [SLP]Improve vectorization of cmp instructions sequences.
Final attempt to vectorize bundles of comptatible cmp instructions after
all other instructions processing.

Metric: SLP.NumVectorInstructions

Program                                                                             results results0 diff
        test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test    1.00    5.00  400.0%
                              test-suite :: MultiSource/Benchmarks/PAQ8p/paq8p.test    8.00   11.00   37.5%
                    test-suite :: MultiSource/Benchmarks/Olden/voronoi/voronoi.test   20.00   26.00   30.0%
                test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 1344.00 1648.00   22.6%
               test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 1344.00 1648.00   22.6%
                              test-suite :: MultiSource/Benchmarks/Olden/bh/bh.test  102.00  124.00   21.6%
                test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/CoMD/CoMD.test  118.00  133.00   12.7%
          test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 3233.00 3554.00    9.9%
           test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 3233.00 3554.00    9.9%
                        test-suite :: MultiSource/Benchmarks/Olden/power/power.test   64.00   70.00    9.4%
           test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 7879.00 8604.00    9.2%
           test-suite :: MultiSource/Benchmarks/Prolangs-C/simulator/simulator.test   50.00   54.00    8.0%
                        test-suite :: MultiSource/Applications/sqlite3/sqlite3.test   27.00   29.00    7.4%
             test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 8345.00 8955.00    7.3%
     test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test  694.00  738.00    6.3%
                        test-suite :: MultiSource/Benchmarks/MallocBench/gs/gs.test  361.00  382.00    5.8%
                      test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  409.00  430.00    5.1%
     test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test  140.00  147.00    5.0%
      test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test  140.00  147.00    5.0%
             test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 4013.00 4206.00    4.8%
                       test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test  966.00 1011.00    4.7%
                           test-suite :: SingleSource/Benchmarks/Misc/oourafft.test   65.00   68.00    4.6%
                            test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 4219.00 4381.00    3.8%
                    test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1911.00 1973.00    3.2%
      test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test   62.00   64.00    3.2%
     test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test   62.00   64.00    3.2%
                 test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test  852.00  877.00    2.9%
                  test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test  852.00  877.00    2.9%
                       test-suite :: MultiSource/Applications/JM/lencod/lencod.test 1624.00 1668.00    2.7%
                         test-suite :: MultiSource/Benchmarks/McCat/18-imp/imp.test   39.00   40.00    2.6%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset.test  613.00  624.00    1.8%
      test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test  378.00  383.00    1.3%
      test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test  293.00  295.00    0.7%
            test-suite :: MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test  297.00  299.00    0.7%
      test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test 5522.00 5534.00    0.2%
     test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test 5522.00 5534.00    0.2%

Differential Revision: https://reviews.llvm.org/D114799
2021-12-01 07:26:29 -08:00
Florian Hahn e44298a8f8
[LV] Move code from vectorizeMemoryInstruction to recipe's execute().
The code in widenMemoryInstruction has already been transitioned
to only rely on information provided by VPWidenMemoryInstructionRecipe
directly.

Moving the code directly to VPWidenMemoryInstructionRecipe::execute
completes the transition for the recipe.

It provides the following advantages:

1. Less indirection, easier to see what's going on.
2. Removes accesses to fields of ILV.

2) in particular ensures that no dependencies on
fields in ILV for vector code generation are re-introduced.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D114324
2021-12-01 14:56:51 +00:00
Alexey Bataev dce6c434ea [SLP]Improve isFixedVectorShuffle and its use.
Extended support for undefined source vector/extract indices/non-fixed
vector types, also no need to check for the parent of the extractelement
instructions with the constant indicies.

Differential Revision: https://reviews.llvm.org/D114121
2021-11-30 10:10:20 -08:00
Alexey Bataev fc57cfad3c [SLP][NFC]Move static function to make it visible in member function,
NFC.
2021-11-30 09:38:46 -08:00
Philip Reames c41b318423 [LV] Remove unneeded cast to Operator [NFC] 2021-11-30 08:45:13 -08:00
Florian Hahn dab776dd0f
[LV] Move code from widenSelectInstruction to VPWidenSelectRecipe. (NFC)
The code in widenSelectInstruction has already been transitioned
to only rely on information provided by VPWidenSelectRecipe directly.

Moving the code directly to VPWidenSelectRecipe::execute completes
the transition for the recipe.

It provides the following advantages:

1. Less indirection, easier to see what's going on.
2. Removes accesses to fields of ILV.

2) in particular ensures that no dependencies on
fields in ILV for vector code generation are re-introduced.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D114323
2021-11-30 10:32:44 +00:00
Roman Lebedev 8cd782487f
[X86][LoopVectorize] "Fix" `X86TTIImpl::getAddressComputationCost()`
We ask `TTI.getAddressComputationCost()` about the cost of computing vector address,
and then multiply it by the vector width. This doesn't make any sense,
it implies that we'd do a vector GEP and then scalarize the vector of pointers,
but there is no such thing in the vectorized IR, we perform scalar GEP's.

This is *especially* bad on X86, and was effectively prohibiting any scalarized
vectorization of gathers/scatters, because `X86TTIImpl::getAddressComputationCost()`
says that cost of vector address computation is `10` as compared to `1` for scalar.

The computed costs are similar to the ones with D111222+D111220,
but we end up without masked memory intrinsics that we'd then have to
expand later on, without much luck. (D111363)

Differential Revision: https://reviews.llvm.org/D111460
2021-11-30 10:47:56 +03:00
Florian Hahn fd71159f64
[LV] Move code from widenInstruction to VPWidenRecipe. (NFC)
The code in widenInstruction has already been transitioned to
only rely on information provided by VPWidenRecipe directly.

Moving the code directly to VPWidenRecipe::execute completes
the transition for the recipe.

It provides the following advantages:

1. Less indirection, easier to see what's going on.
2. Removes accesses to fields of ILV.

2) in particular ensures that no dependencies on
fields in ILV for vector code generation are re-introduced.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D114322
2021-11-29 09:09:00 +00:00
Florian Hahn 3495090b9b
[LV] Move code from widenGEP to VPWidenGEPRecipe (NFC).
The code in widenGEP has already been transitioned to only rely on
information provided by VPWidenGEPRecipe directly.

Moving the code directly to VPWidenGEPRecipe::execute completes
the transition for the recipe.

It provides the following advantages:

1. Less indirection, easier to see what's going on.
2. Removes accesses to fields of ILV.

2) in particular ensures that no dependencies on
fields in ILV for GEP code generation are re-introduced.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D114321
2021-11-28 18:29:18 +00:00
Sander de Smalen 28a4deab92 [LV] Fix incorrectly marking a pointer indvar as 'scalar'.
collectLoopScalars should only add non-uniform nodes to the list if they
are used by a load/store instruction that is marked as CM_Scalarize.

Before this patch, the LV incorrectly marked pointer induction variables
as 'scalar' when they required to be widened by something else,
such as a compare instruction, and weren't used by a node marked as
'CM_Scalarize'. This case is covered by sve-widen-phi.ll.

This change also allows removing some code where the LV tried to
widen the PHI nodes with a stepvector, even though it was marked as
'scalarAfterVectorization'. Now that this code is more careful about
marking instructions that need widening as 'scalar', this code has
become redundant.

Differential Revision: https://reviews.llvm.org/D114373
2021-11-28 09:49:28 +00:00
Alexey Bataev fc0aacf324 [SLP]Improve analysis/emission of vector operands for alternate nodes.
Compiler has an analysis for perfect diamond matching but it does not
support nodes with main/alternate opcodes. The problem is that the
scalars themselves are different and might not match directly with other
nodes, but operands and main/alternate opcodes might match and compiler
might reuse some previously emitted vector instructions. Need to include
this analysis in the cost model and actual vector instructions emission
process.

Differential Revision: https://reviews.llvm.org/D114101
2021-11-26 06:38:02 -08:00
David Sherwood e20391fc5d [LoopVectorize] When tail-folding, don't always predicate uniform loads
In VPRecipeBuilder::handleReplication if we believe the instruction
is predicated we then proceed to create new VP region blocks even
when the load is uniform and only predicated due to tail-folding.

I have updated isPredicatedInst to avoid treating a uniform load as
predicated when tail-folding, which means we can do a single scalar
load and a vector splat of the value.

Tests added here:

  Transforms/LoopVectorize/AArch64/tail-fold-uniform-memops.ll

Differential Revision: https://reviews.llvm.org/D112552
2021-11-26 11:30:54 +00:00
Alexey Bataev 4675a1654c Revert "[SLP]Improve analysis/emission of vector operands for alternate nodes."
This reverts commit 496254cf80 to fix
compiler crashes reported in D114101#3152982.
2021-11-25 05:19:49 -08:00
Alexey Bataev 496254cf80 [SLP]Improve analysis/emission of vector operands for alternate nodes.
Compiler has an analysis for perfect diamond matching but it does not
support nodes with main/alternate opcodes. The problem is that the
scalars themselves are different and might not match directly with other
nodes, but operands and main/alternate opcodes might match and compiler
might reuse some previously emitted vector instructions. Need to include
this analysis in the cost model and actual vector instructions emission
process.

Differential Revision: https://reviews.llvm.org/D114101
2021-11-24 12:55:24 -08:00
Florian Hahn 2897b67665
[LV] Use OrigLoop instead of induction to get function. (NFC)
Upcoming changes will result in Induction not being set/used in some
cases. Use OrigLoop to get the function instead.
2021-11-24 20:17:44 +00:00
Florian Hahn 8b86752c60
[VPlan] Remove unused VPInstruction constructor. (NFC)
VPInstruction inherits from VPValue, so the constructor taking
ArrayRef<VPValue*> covers all cases that would be covered by the removed
constructor.
2021-11-24 14:06:50 +00:00
Rosie Sumpter df32a39dd0 [LoopVectorize][CostModel] Update cost model for fmuladd intrinsic
This patch updates the cost model for ordered reductions so that a call
to the llvm.fmuladd intrinsic is modelled as a normal fmul instruction
plus the cost of an ordered fadd reduction.

Differential Revision: https://reviews.llvm.org/D111630
2021-11-24 08:50:05 +00:00
Rosie Sumpter 2d33327f9d [LoopVectorize] Print fast-math flags for VPReductionRecipe 2021-11-24 08:50:05 +00:00
Rosie Sumpter 991074012a [LoopVectorize] Propagate fast-math flags for VPInstruction
In-loop vector reductions which use the llvm.fmuladd intrinsic involve
the creation of two recipes; a VPReductionRecipe for the fadd and a
VPInstruction for the fmul. If the call to llvm.fmuladd has fast-math flags
these should be propagated through to the fmul instruction, so an
interface setFastMathFlags has been added to the VPInstruction class to
enable this.

Differential Revision: https://reviews.llvm.org/D113125
2021-11-24 08:50:04 +00:00
Rosie Sumpter c2441b6b89 [LoopVectorize] Add vector reduction support for fmuladd intrinsic
Enables LoopVectorize to handle reduction patterns involving the
llvm.fmuladd intrinsic.

Differential Revision: https://reviews.llvm.org/D111555
2021-11-24 08:50:04 +00:00
Diego Caballero 4348cd42c3 [LV] Drop integer poison-generating flags from instructions that need predication
This patch fixes PR52111. The problem is that LV propagates poison-generating flags (`nuw`/`nsw`, `exact`
and `inbounds`) in instructions that contribute to the address computation of widen loads/stores that are
guarded by a condition. It may happen that when the code is vectorized and the control flow within the loop
is linearized, these flags may lead to generating a poison value that is effectively used as the base address
of the widen load/store. The fix drops all the integer poison-generating flags from instructions that
contribute to the address computation of a widen load/store whose original instruction was in a basic block
that needed predication and is not predicated after vectorization.

Reviewed By: fhahn, spatel, nlopes

Differential Revision: https://reviews.llvm.org/D111846
2021-11-22 10:57:29 +00:00
Florian Hahn cf8efbd30e
[VPlan] Wrap vector loop blocks in region.
A first step towards modeling preheader and exit blocks in VPlan as well.
Keeping the vector loop in a region allows for changing the VF as we
traverse region boundaries.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D113182
2021-11-20 17:59:48 +00:00
Florian Hahn 76effb001d
[LV] Remove obsolete comment about creating a dummy block (NFC)
No dummy pre-entry block is created since a6c4969f5f. The comment is
stale now and can be removed.

Mentioned by @Ayal in D113182.
2021-11-19 17:17:04 +00:00
Alexey Bataev d1fdf867b1 [SLP][NFC]Introduce TreeEntry::getVectorFactor member function, NFC.
Added TreeEntry::getVectorFactor to get the final vectotization factor
to simplify the code.

Differential Revision: https://reviews.llvm.org/D114190
2021-11-19 06:32:19 -08:00
David Sherwood 670dd40244 [Analysis] Fix getNumberOfParts to return 0 when the answer is unknown
When asking how many parts are required for a scalable vector type
there are occasions when it cannot be computed. For example, <vscale x 1 x i3>
is one such vector for AArch64+SVE because at the moment no matter how we
promote the i3 type we never end up with a legal vector. This means
that getTypeConversion returns TypeScalarizeScalableVector as the
LegalizeKind, and then getTypeLegalizationCost returns an invalid cost.
This then causes BasicTTImpl::getNumberOfParts to dereference an invalid
cost, which triggers an assert. This patch changes getNumberOfParts to
return 0 for such cases, since the definition of getNumberOfParts in
TargetTransformInfo.h states that we can use a return value of 0 to represent
an unknown answer.

Currently, LoopVectorize.cpp is the only place where we need to check for
0 as a return value, because all other instances will not currently
ask for the number of parts for <vscale x 1 x iX> types.

In addition, I have changed the target-independent interface for
getNumberOfParts to return 1 and assume there is a single register
that can fit the type. The loop vectoriser has lots of tests that are
target-independent and they relied upon the 0 value to mean the
answer is known and that we are not scalarising the vector.

I have added tests here that show we correctly return an invalid cost
for VF=vscale x 1 when the loop contains unusual types such as i7:

  Transforms/LoopVectorize/AArch64/sve-inductions-unusual-types.ll

Differential Revision: https://reviews.llvm.org/D113772
2021-11-17 12:07:09 +00:00
Alexey Bataev 900cc1a226 [SLP]Improve cost of the gather nodes.
No need to count the final shuffle cost for the constants, gathering of
the constants is just a constant vector + extra inserts, if required.

Differential Revision: https://reviews.llvm.org/D113770
2021-11-16 06:25:07 -08:00
Alexey Bataev cdf8a53c1d [SLP]Fix windows build, NFC.
Need to put `IndexIdx` var to the list of captures.
2021-11-16 06:09:51 -08:00
Alexey Bataev aa9bbb64be [SLP]Adjust GEP indices types when trying to build entries.
Need to adjust the types of GEPs indices when building the tree
entries/operands. Otherwise some of the nodes might differ and
vectorizer is unable to correctly find them and count their cost.

Differential Revision: https://reviews.llvm.org/D113792
2021-11-16 05:44:33 -08:00
Alexey Bataev 224e46d355 [SLP][DOT][NFCI]Output all scalars for the splats, not only the first one. 2021-11-15 10:54:26 -08:00
Alexey Bataev 036207d5f2 [SLP]Improve splat detection.
A bunch of scalars can be treated as a splat not only if all elements
are the same but also if some of them are undefvalues.

Differential Revision: https://reviews.llvm.org/D113774
2021-11-15 07:50:34 -08:00
Alexey Bataev b85152f8b1 [SLP][NFC]Use `isa_and_nonnull` and fix comment, NFC. 2021-11-15 06:49:33 -08:00
Alexey Bataev 6fb5bed7d1 [SLP]Do not create unused gather nodes for scalar arguments of vector intrinsics.
If the vector intrinsic has scalar argument, we currently still create
a tree entry for this argument. This entry is not used, just consumes
resources and increases the cost of the tree.

Differential Revision: https://reviews.llvm.org/D113806
2021-11-15 06:11:19 -08:00
Sander de Smalen f835fe8ef7 [LV] Rename blockNeedsPredication to blockNeedsPredicationForAnyReason.
The interface is a convenience function to ask if a block requires
predication when widening, but it's important that there are two
separate concepts to consider:
(A) The block was predicated in the original loop.
(B) The block was unpredicated in the original loop, but requires
    predication because of tail folding.

In the case of (B) we know that at least one lane of the vector will
be executed, which means we can implementing a load from a uniform address
with a scalar load + splat (D112552). In the case of predication because
of (A), we cannot do this, because the scalar load itself requires
predication.

The name 'blockNeedsPredication' does not make the distinction between
(A) and (B), hence the reason to rename it.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D113392
2021-11-15 08:04:20 +00:00
Alexey Bataev 352c46e707 [SLP]Improve vectorization of split loads.
Need to fix ther cost estimation for split loads, since we look at the
subregs already, no need to permute them, need just to estimate
subregister insert, if it is smaller than the real register. Also, using
split loads, it might be profitable already to vectorize smaller trees
with gathering of the loads.

Differential Revision: https://reviews.llvm.org/D107188
2021-11-12 06:13:22 -08:00
Florian Hahn 93931d78cf
[LV] Do not rely on InductionDescriptor::getCastInsts. (NFC)
Now that CastDef is passed as VPValue, there is no need to access
ID.getCastInsts, as CastDef can instead be checked.
2021-11-10 13:03:44 +00:00
Florian Hahn e7f1232cb7
[LV] Move optimized IV recipes to phi section of header after sinking.
Unfortunately sinking recipes for first-order recurrences relies on
the original position of recipes. So if a recipes needs to be sunk after
an optimized induction, it needs to stay in the original position, until
sinking is done. This is causing PR52460.

To fix the crash, keep the recipes in the original position until
sink-after is done.

Post-commit follow-up to c45045bfd0 to address PR52460.
2021-11-10 11:41:08 +00:00
Kerry McLaughlin 6f16ee5e14 Revert "[LoopVectorize] Extract the last lane from a uniform store"
This reverts commit 0d748b4d32.
This is causing some failures when building Spec2017 with scalable
vectors. Reverting to investigate.
2021-11-10 11:21:19 +00:00
Kerry McLaughlin 0d748b4d32 [LoopVectorize] Extract the last lane from a uniform store
Changes VPReplicateRecipe to extract the last lane from an unconditional,
uniform store instruction. collectLoopUniforms will also add stores to
the list of uniform instructions where Legal->isUniformMemOp is true.

setCostBasedWideningDecision now sets the widening decision for
all uniform memory ops to Scalarize, where previously GatherScatter
may have been chosen for scalable stores.

This fixes an assert ("Cannot yet scalarize uniform stores") in
setCostBasedWideningDecision when we have a loop containing a
uniform i1 store and a scalable VF, which we cannot create a scatter for.

Reviewed By: sdesmalen, david-arm, fhahn

Differential Revision: https://reviews.llvm.org/D112725
2021-11-09 14:43:16 +00:00
Florian Hahn acbefbf19f [VPlan] Guard code to dump instructions after d9361bfbe2.
This should fix build failures when built without assertions enabled,
e.g.
    https://lab.llvm.org/buildbot/#/builders/205/builds/172
2021-11-09 10:29:05 +00:00
Florian Hahn d9361bfbe2 [VPlan] Add initial inner-loop VPlan verification.
This patch adds a function to verify general properties of VPlans. The
first check makes sure that all phi-like recipes are at the beginning of
a block, with no other recipes in between.

Note that this currently may not hold for VPBlendRecipes at the moment,
as other recipes may be inserted before the VPBlendRecipe during mask
creation.

Note that this patch depends on D111300 and D111301, which fix code that
breaks the checked invariant.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D111302
2021-11-09 10:18:28 +00:00
Florian Hahn e3bfb6a146
[VPlan] Make sure recurrence splice is not inserted between phis.
All phi-like recipes should be at the beginning of a VPBasicBlock with
no other recipes in between. Ensure that the recurrence-splicing recipe
is not added between phi-like recipes, but after them.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D111301
2021-11-08 17:42:32 +00:00
Sander de Smalen 2829376bb2 [LV] Use VScaleForTuning to fine-tune the cost per lane.
When targeting a specific CPU with scalable vectorization, the knowledge
of that particular CPU's vscale value can be used to tune the cost-model
and make the cost per lane less pessimistic.

If the target implements 'TTI.getVScaleForTuning()', the cost-per-lane
is calculated as:

  Cost / (VScaleForTuning * VF.KnownMinLanes)

Otherwise, it assumes a value of 1 meaning that the behavior
is unchanged and calculated as:

  Cost / VF.KnownMinLanes

Reviewed By: kmclaughlin, david-arm

Differential Revision: https://reviews.llvm.org/D113209
2021-11-08 16:59:46 +00:00
David Sherwood c63b0f471b [NFC][LoopVectorize] Make the createStepForVF interface more caller-friendly
The common use case for calling createStepForVF is currently something
like:

  Value *Step = createStepForVF(Builder, ConstantInt::get(Ty, UF), VF);

and it makes more sense to reduce overall lines of code and change the
function to let it create the constant instead. With my patch this
becomes:

  Value *Step = createStepForVF(Builder, Ty, VF, UF);

and the ConstantInt is created instead createStepForVF. A side-effect of
this is that the code in createStepForVF is also becomes simpler.

As part of this patch I've also replaced some calls to getRuntimeVF
with calls to createStepForVF, i.e.

  getRuntimeVF(Builder, Count->getType(), VFactor * UFactor) ->
  createStepForVF(Builder, Count->getType(), VFactor, UFactor)

because this feels semantically better.

Differential Revision: https://reviews.llvm.org/D113122
2021-11-08 15:14:14 +00:00