Particularly this breaks vectorization of insertelements where some of
intermediate (i.e. not last) insertelements are used externally.
Fixes PR52275
Fixes#51617
Differential Revision: https://reviews.llvm.org/D119679
Note that this doesn't actually cause the top level predicate to become a non-union just yet.
The * above comes from a case in the LoopVectorizer where a predicate which is later proven no longer blocks vectorization due to a change from checking if predicates exists to whether the predicate is possibly false.
This reverts commit 77a0da926c as we've
received multiple reports of this significantly impacting performance,
in ways that don't seem to just be target specific cost models going
wrong. I would offer some reproducers, but the test changes here seem to
be full of them!
Reverting for now and hopefully we can remove the "hack" more carefully
as we go.
Move out the induction step creation from emitTransformedIndex to the
callers. In some places (e.g. widenIntOrFpInduction) the step is already
created. Passing the step in ensures the steps are kept in sync.
This makes the function independent of shared state in ILV (ensures no
new dependencies on things like the cost model are introduced) and allows
for use directly in recipe's ::execute functions.
The vectorizer will choose at times to "vectorize" loops with a scalar
factor (VF=1) with interleaving (IC > 1). This can occasionally produce
better code than the unroller (notable for reductions where it can
produce independent reduction chains that are combined after the loop).
At times this is not very beneficial though, for example when runtime
checks are needed or when the scalar code requires predication.
This addresses the second point, preventing the vectorizer from
interleaving when the scalar loop will require predication. This
prevents it from making a bit of a mess, that is worse than the original
and better left for the unroller to unroll if beneficial. It helps
reverse some of the regressions from D118090.
Differential Revision: https://reviews.llvm.org/D118566
D43208 extracted `useEmulatedMaskMemRefHack()` from legality into cost model.
What it essentially does is prevents scalarized vectorization of masked memory operations:
```
// TODO: Cost model for emulated masked load/store is completely
// broken. This hack guides the cost model to use an artificially
// high enough value to practically disable vectorization with such
// operations, except where previously deployed legality hack allowed
// using very low cost values. This is to avoid regressions coming simply
// from moving "masked load/store" check from legality to cost model.
// Masked Load/Gather emulation was previously never allowed.
// Limited number of Masked Store/Scatter emulation was allowed.
```
While i don't really understand about what specifically `is completely broken`
was talking about, i believe that at least on X86 with AVX2-or-later,
this is no longer true. (or at least, i would like to know what is still broken).
So i would like to follow suit after D111460, and like wise disable that hack for AVX2+.
But since this was added for X86 specifically, let's just instead completely remove this hack.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D114779
In scalarizeInstruction(), isUniformAfterVectorization is used to detect
cases where it is sufficient to always access the first lane. This
should map directly checking whether the operand is a uniform replicate
recipe.
Differential Revision: https://reviews.llvm.org/D116654
We can simply compute the value of this field on demand. Doing so clarifies the behavior when one of the instructions within a bundle doesn't have valid dependencies. I vaguely thing this could change behavior slightly, but none of the test cases are affected, and my attempts to write one by hand have failed.
This also minorly reduces memory usage, but that's a secondary value at best.
This adds the assertion that all items in the ready list are in-fact scheduleable entities ready to be scheduled. This involves changing the ReadyInsts structure to be a set, and fixing a couple places where we left nodes on the list when they were no longer ready.
The idea here is to have a verify routine we can call during scheduling to ensure broken invariants are reported. The intent is to help in debugging scheduling bugs.
At the moment, only the most basic properties are checked as adding several I thought held reported failures.
When the main loop is e.g. VF=vscale x 1 and the epilogue VF cannot
be any smaller, the vectorizer should try to estimate how many lanes are
executed at runtime and allow a suitable fixed-width VF to be chosen. It
can use VScaleForTuning to figure out what a suitable fixed-width VF could
be. For the case where the main loop VF is VF=vscale x 1, and VScaleForTuning=8,
it could still choose an epilogue VF upto VF=4.
This was a bit tricky to test, so this patch also introduces a wrapper
function to get 'VScaleForTuning' by also considering vscale_range.
If min and max are equal, then that will be the vscale we compile for.
It makes little sense to tune for a different width if the code
will not be portable for other widths.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D118709
Compiler adds the estimation for the external uses during operands
reordering analysis, which makes it tend to prefer duplicates in the
lanes rather than diamond/shuffled match in the graph. It changes the sizes of
the vector operands and may prevent some vectorization. We don't need
this kind of estimation for the analysis phase, because we just need to
choose the most compatible instruction and it does not matter if it has
external user or used in the non-matching lane. Instead, we count the number
of unique instruction in the lane and see if the reassociation changes
the number of unique scalars to be power of 2 or not. If we have power
of 2 unique scalars in the lane, it is considered more profitable rather
than having non-power-of-2 number of unique scalars.
Metric: SLP.NumVectorInstructions
test-suite :: MultiSource/Benchmarks/FreeBench/distray/distray.test 70.00 86.00 22.9%
test-suite :: External/SPEC/CFP2017rate/544.nab_r/544.nab_r.test 346.00 353.00 2.0%
test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test 346.00 353.00 2.0%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 235.00 239.00 1.7%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 235.00 239.00 1.7%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 8723.00 8834.00 1.3%
test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 1051.00 1064.00 1.2%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 1628.00 1646.00 1.1%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 1628.00 1646.00 1.1%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 9100.00 9184.00 0.9%
test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 3565.00 3577.00 0.3%
test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 3565.00 3577.00 0.3%
test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 4235.00 4245.00 0.2%
test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1996.00 1998.00 0.1%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 1671.00 1672.00 0.1%
test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 783.00 782.00 -0.1%
test-suite :: SingleSource/Benchmarks/Misc/oourafft.test 69.00 68.00 -1.4%
test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test 207.00 192.00 -7.2%
test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test 207.00 192.00 -7.2%
test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test 89.00 80.00 -10.1%
test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test 89.00 80.00 -10.1%
test-suite :: MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test 260.00 215.00 -17.3%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test 256.00 211.00 -17.6%
MultiSource/Benchmarks/Prolangs-C/TimberWolfMC - pretty the same.
SingleSource/Benchmarks/Misc/oourafft.test - 2 <2 x > loads replaced by
one <4 x> load.
External/SPEC/CINT2017speed/641.leela_s - function gets vectorized and
not inlined anymore.
External/SPEC/CINT2017rate/541.leela_r - same
xternal/SPEC/CINT2017rate/531.deepsjeng_r - changed the order in
multi-block tree, the result is pretty the same.
External/SPEC/CINT2017speed/631.deepsjeng_s - same.
MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a - the result is the same
as before.
MultiSource/Benchmarks/MiBench/consumer-jpeg - same.
Differential Revision: https://reviews.llvm.org/D116688
Added support for alternate ops vectorization of the cmp instructions.
It allows to vectorize either cmp instructions with same/swapped
predicate but different (swapped) operands kinds or cmp instructions
with different predicates and compatible operands kinds.
Differential Revision: https://reviews.llvm.org/D115955
Added support for alternate ops vectorization of the cmp instructions.
It allows to vectorize either cmp instructions with same/swapped
predicate but different (swapped) operands kinds or cmp instructions
with different predicates and compatible operands kinds.
Differential Revision: https://reviews.llvm.org/D115955
Based on the output of include-what-you-use.
This is a big chunk of changes. It is very likely to break downstream code
unless they took a lot of care in avoiding hidden ehader dependencies, something
the LLVM codebase doesn't do that well :-/
I've tried to summarize the biggest change below:
- llvm/include/llvm-c/Core.h: no longer includes llvm-c/ErrorHandling.h
- llvm/IR/DIBuilder.h no longer includes llvm/IR/DebugInfo.h
- llvm/IR/IRBuilder.h no longer includes llvm/IR/IntrinsicInst.h
- llvm/IR/LLVMRemarkStreamer.h no longer includes llvm/Support/ToolOutputFile.h
- llvm/IR/LegacyPassManager.h no longer include llvm/Pass.h
- llvm/IR/Type.h no longer includes llvm/ADT/SmallPtrSet.h
- llvm/IR/PassManager.h no longer includes llvm/Pass.h nor llvm/Support/Debug.h
And the usual count of preprocessed lines:
$ clang++ -E -Iinclude -I../llvm/include ../llvm/lib/IR/*.cpp -std=c++14 -fno-rtti -fno-exceptions | wc -l
before: 6400831
after: 6189948
200k lines less to process is no that bad ;-)
Discourse thread on the topic: https://llvm.discourse.group/t/include-what-you-use-include-cleanup
Differential Revision: https://reviews.llvm.org/D118652
For some reason we limited the epilogue VF to be fixed-width, but there
is not necessarily a reason for doing so. If the main VF=vscale x 16, the
epilogue VF could be either fixed-width, or a scalable VF upto vscale x 8.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D118688
Added support for alternate ops vectorization of the cmp instructions.
It allows to vectorize either cmp instructions with same/swapped
predicate but different (swapped) operands kinds or cmp instructions
with different predicates and compatible operands kinds.
Differential Revision: https://reviews.llvm.org/D115955
This removes another instance of recipe execution still relying on
the cost model.
Depends on D116554.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D116656
Added support for alternate ops vectorization of the cmp instructions.
It allows to vectorize either cmp instructions with same/swapped
predicate but different (swapped) operands kinds or cmp instructions
with different predicates and compatible operands kinds.
Differential Revision: https://reviews.llvm.org/D115955
This removes the remaining dependence on LoopVectorizationCostModel from
buildScalarSteps and is required so it can be moved out of ILV.
It also improves allows us to remove a few unneeded instructions.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D116554
This patch tries to use an existing VPWidenCanonicalIVRecipe
instead of creating another step-vector for canonical
induction recipes in widenIntOrFpInduction.
This has the following benefits:
1. First step to avoid setting both vector and scalar values for the
same induction def.
2. Reducing complexity of widenIntOrFpInduction through making things
more explicit in VPlan
3. Only need to splat the vector IV for block in masks.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D116123
This reverts commit db49a78900. The reasoning in the patch applied to a downstream branch, and I got myself confused when trying to split apart pieces. Thankfully, the assert was simply weaker than the actual invariant currently upstream which is that ReadyInsts is not empty.
The fact we could have a block with a valid scheduling window, but nothing to schedule was surprising to me. After digging through the code, this can only happen if we don't find anything to directly vectorize. However, the reduction handling code relies on this mode, so we can't simply consider such trees unvectorizeable. The assert conveys both that this situation can happen, but also that it can *only* happen for an immediate gather.
Context: We built the bundle before deciding that vectorization of a bundle is possible. A side effect of bundle construction is manipulating the scheduling window, so a bundle which isn't vectorizable can cause the creation or expansion of a scheduling window.
No need to reorder the top nodes, if they are not stores or
insertelement instructions and each node should be analized only
once, when the bottom-to-top analysis is performed.
We still endup with extractelements for the top node scalars and
the final shuffle just adds an extra cost and currently
crashes the compiler for PHI nodes.
Differential Revision: https://reviews.llvm.org/D116760
This explicitly records whether a scalar IV is needed in the
VPWidenIntOrFpInductionRecipe, to remove a dependence on the cost-model
during its ::execute.
It will also be used in D116123 to determine if a vector phi will be
generated.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D118167
Use shufflevector to do the subvector extracts. This allows a lot more
load merging on AMDGPU and also on NVPTX when <2 x half> is involved.
Differential Revision: https://reviews.llvm.org/D117219
isCandidateForEpilogueVectorization will currently return false for loops
which contain reductions. This patch removes this restriction and makes
the following changes to support epilogue vectorisation with reductions:
- `fixReduction`: If fixReduction is being called during vectorisation of the
epilogue, the phi node it creates will need to additionally carry incoming
values from the middle block of the main loop.
- `createEpilogueVectorizedLoopSkeleton`: The incoming values of the phi
created by fixReduction are updated after the vec.epilog.iter.check block
is added. The phi is also moved to the preheader of the epilogue.
- `processLoop`: The start value of any VPReductionPHIRecipes are updated before
vectorising the epilogue loop. The getResumeInstr function added to the ILV
will return the resume instruction associated with the recurrence descriptor.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D116928
This patch updates createBlockInMask to always generate
VPWidenCanonicalIVRecipe and adds a transform to optimize it away later,
if it is not needed.
This is a step towards breaking up VPWidenIntOrFpInductionRecipe and
explicitly distinguishing between vector phis and scalarizing.
Split off from D116123.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D117140