Summary:
Get back `const` partially lost in one of recent changes.
Additionally specify explicit qualifiers in few places.
Reviewers: samparker
Reviewed By: samparker
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D82383
D68667 introduced a tighter limit to the number of GEPs to simplify
together. The limit was based on the vector element size of the pointer,
but the pointers themselves are not actually put in vectors.
IIUC we try to vectorize the index computations here, so we should base
the limit on the vector element size of the computation of the index.
This restores the test regression on AArch64 and also restores the
vectorization for a important pattern in SPEC2006/464.h264ref on
AArch64 (@test_i16_extend). We get a large benefit from doing a single
load up front and then processing the index computations in vectors.
Note that we could probably even further improve the AArch64 codegen, if
we would do zexts to i32 instead of i64 for the sub operands and then do
a single vector sext on the result of the subtractions. AArch64 provides
dedicated vector instructions to do so. Sketch of proof in Alive:
https://alive2.llvm.org/ce/z/A4xYAB
Reviewers: craig.topper, RKSimon, xbolva00, ABataev, spatel
Reviewed By: ABataev, spatel
Differential Revision: https://reviews.llvm.org/D82418
This saves creating/destroying a builder every time we
perform some transform.
The tests show instruction ordering diffs resulting from
always inserting at the root instruction now, but those
should be benign.
This doesn't change anything currently, but it would make sense
to create a class-level IRBuilder instead of recreating that
everywhere. As we expand to more optimizations, we will probably
also want to hold things like the DataLayout or other constant
refs in here too.
Move ScalarEvolution::forgetLoopDispositions implementation to ScalarEvolution.cpp to remove the dependency.
Add implicit header dependency to source files where necessary.
"error: 'get' is deprecated: The base class version of get with the scalable
argument defaulted to false is deprecated."
Changed VectorType::get() -> FixedVectorType::get().
This emits new IR intrinsic @llvm.get.active.mask for tail-folded vectorised
loops if the intrinsic is supported by the backend, which is checked by
querying TargetTransform hook emitGetActiveLaneMask.
This intrinsic creates a mask representing active and inactive vector lanes,
which is used by the masked load/store instructions that are created for
tail-folded loops. The semantics of @llvm.get.active.mask are described here in
LangRef:
https://llvm.org/docs/LangRef.html#llvm-get-active-lane-mask-intrinsics
This intrinsic is also used to provide a hint to the backend. That is, the
second argument of the intrinsic represents the back-edge taken count of the
loop. For MVE, for example, we use that to set up tail-predication, which is a
new form of predication in MVE for vector loops that implicitely predicates the
last vector loop iteration by implicitely setting active/inactive lanes, i.e.
the tail loop is predicated. In order to set up a tail-predicated vector loop,
we need to know the number of data elements processed by the vector loop, which
corresponds the the tripcount of the scalar loop, which we can now reconstruct
using @llvm.get.active.mask.
Differential Revision: https://reviews.llvm.org/D79100
Generalize scalarization (recently enhanced with D80885)
to allow compares as well as binops.
Similar to binops, we are avoiding scalarization of a loaded
value because that could avoid a register transfer in codegen.
This requires 1 extra predicate that I am aware of: we do not
want to scalarize the condition value of a vector select. That
might also invert a transform that we do in instcombine that
prefers a vector condition operand for a vector select.
I think this is the final step in solving PR37463:
https://bugs.llvm.org/show_bug.cgi?id=37463
Differential Revision: https://reviews.llvm.org/D81661
Have BasicTTI call the base implementation so that both agree on the
default behaviour, which the default being a cost of '1'. This has
required an X86 specific implementation as it seems to be very
reliant on those instructions being free. Changes are also made to
AMDGPU so that their implementations distinguish between cost kinds,
so that the unrolling isn't affected. PowerPC also has its own
implementation to prevent changes to the reg-usage vectorizer test.
The cost model test changes now reflect that ret instructions are not
generally free.
Differential Revision: https://reviews.llvm.org/D79164
getOrCreateTripCount is used to generate code for the outer loop, but it
requires a computable backedge taken counts. Check that in the VPlan
native path.
Reviewers: Ayal, gilr, rengolin, sguggill
Reviewed By: sguggill
Differential Revision: https://reviews.llvm.org/D81088
scalarizeBinop currently folds
vec_bo((inselt VecC0, V0, Index), (inselt VecC1, V1, Index))
->
inselt(vec_bo(VecC0, VecC1), scl_bo(V0,V1), Index)
This patch extends this to account for cases where one of the vec_bo operands is already all-constant and performs similar cost checks to determine if the scalar binop with a constant still makes sense:
vec_bo((inselt VecC0, V0, Index), VecC1)
->
inselt(vec_bo(VecC0, VecC1), scl_bo(V0,extractelt(V1,Index)), Index)
Fixes PR42174
Differential Revision: https://reviews.llvm.org/D80885
Currently extracting a lane for a VPValue def is not supported, if it is
managed directly by VPTransformState (e.g. because it is created by a
VPInstruction or an external VPValue def).
For now, simply extract the requested lane. In the future, we should
also cache the extracted scalar values, similar to LV.
Reviewers: Ayal, rengolin, gilr, SjoerdMeijer
Reviewed By: SjoerdMeijer
Differential Revision: https://reviews.llvm.org/D80787
LV currently only supports power of 2 vectorization factors, which has
been made explicit with the assertion added in
840450549c.
However, if the widest type is not a power-of-2 the computed MaxVF won't
be a power-of-2 either. This patch updates computeFeasibleMaxVF to
ensure the returned value is a power-of-2 by rounding down to the
nearest power-of-2.
Fixes PR46139.
Reviewers: Ayal, gilr, rengolin
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D80870
relevant aggregate build instructions only (UserCost).
Users are detected with findBuildAggregate routine and the trick is
that following SLP vectorization may end up vectorizing entire list
with smaller chunks. Cost adjustment then is applied for individual
chunks and these adjustments obviously have to be smaller than the
entire aggregate build cost.
Differential Revision: https://reviews.llvm.org/D80773
If it turns out that we can do runtime checks, but there are no
runtime-checks to generate, set RtCheck.Need to false.
This can happen if we can prove statically that the pointers passed in
to canCheckPtrAtRT do not alias. This should not change any results, but
allows us to skip some work and assert that runtime checks are
generated, if LAA indicates that runtime checks are required.
Reviewers: anemet, Ayal
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D79969
Note: This is a recommit of 259abfc7cb,
with some suggested renaming.
If it turns out that we can do runtime checks, but there are no
runtime-checks to generate, set RtCheck.Need to false.
This can happen if we can prove statically that the pointers passed in
to canCheckPtrAtRT do not alias. This should not change any results, but
allows us to skip some work and assert that runtime checks are
generated, if LAA indicates that runtime checks are required.
Reviewers: anemet, Ayal
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D79969
If a loop has a constant trip count known to be a multiple of MaxVF (times user
UF), LV infers that no tail will be generated for any chosen VF. This relies on
the chosen VF's being powers of 2 bound by MaxVF, and assumes MaxVF is a power
of 2. Make sure the latter holds, in particular when MaxVF is set by a memory
dependence distance which may not be a power of 2.
Differential Revision: https://reviews.llvm.org/D80491
Currently we unconditionally get the first lane of the condition
operand, even if we later use the full vector condition. This can result
in some unnecessary instructions being generated.
Suggested as follow-up in D80219.
VPWidenSelectRecipe already contains a VPUser, but it is not used. This
patch updates the code related to VPWidenSelectRecipe to use VPUser for
its operands.
Reviewers: Ayal, gilr, rengolin
Reviewed By: gilr
Differential Revision: https://reviews.llvm.org/D80219
As noted in D80236, moving the pass in the pipeline exposed this
shortcoming. Extra work to recalculate the alias results showed
up as a compile-time slowdown.
Summary:
When handling loops whose VF is 1, fold-tail vectorization sets the
backedge taken count of the original loop with a vector of a single
element. This causes type-mismatch during instruction generartion.
The purpose of this patch is toto address the case of VF==1.
Reviewer: Ayal (Ayal Zaks), bmahjour (Bardia Mahjour), fhahn (Florian Hahn), gilr (Gil Rapaport), rengolin (Renato Golin)
Reviewed By: Ayal (Ayal Zaks), bmahjour (Bardia Mahjour), fhahn (Florian Hahn)
Subscribers: Ayal (Ayal Zaks), rkruppe (Hanna Kruppe), bmahjour (Bardia Mahjour), rogfer01 (Roger Ferrer Ibanez), vkmr (Vineet Kumar), bollu (Siddharth Bhat), hiraditya (Aditya Kumar), llvm-commits (Mailing List llvm-commits)
Tag: LLVM
Differential Revision: https://reviews.llvm.org/D79976
This is a fix for PR45965 - https://bugs.llvm.org/show_bug.cgi?id=45965 -
which was left out of D80106 because of a test failure.
SLP does its own mini-CSE after potentially creating redundant instructions,
so we need to wait for that to complete before running the verifier.
Otherwise, we will see a test failure for
test/Transforms/SLPVectorizer/X86/crash_vectorizeTree.ll (not changed here)
because a phi temporarily has identical but different incoming values for
the same incoming block.
A related, but independent, test that would have been altered here was
fixed with:
rG880df55
The test was escaping verification in SLP without this change because we
were not running verifyFunction() unless SLP actually changed the IR.
Differential Revision: https://reviews.llvm.org/D80401
The algorithm inside getVectorElementSize() is almost O(x^2) complexity and
when, for example, we compile MultiSource/Applications/ClamAV/shared_sha256.c
with 1k instructions inside sha256_transform() function that resulted in almost
~800k iterations. The following change improves the algorithm with the map to
a liner complexity.
Differential Revision: https://reviews.llvm.org/D80241
Combine the two API calls into one by introducing a structure to hold
the relevant data. This has the added benefit of moving the boiler
plate code for arguments and flags, into the constructors. This is
intended to be a non-functional change, but the complicated web of
logic involved here makes it very hard to guarantee.
Differential Revision: https://reviews.llvm.org/D79941
SCEVExpander modifies the underlying function so it is more suitable in
Transforms/Utils, rather than Analysis. This allows using other
transform utils in SCEVExpander.
This patch was originally committed as b8a3c34eee, but broke the
modules build, as LoopAccessAnalysis was using the Expander.
The code-gen part of LAA was moved to lib/Transforms recently, so this
patch can be landed again.
Reviewers: sanjoy.google, efriedma, reames
Reviewed By: sanjoy.google
Differential Revision: https://reviews.llvm.org/D71537
This patch adds VPValue version of the instruction operands to
VPReplicateRecipe and uses them during code-generation.
Reviewers: Ayal, gilr, rengolin
Reviewed By: gilr
Differential Revision: https://reviews.llvm.org/D80114
We can remove a dynamic memory allocation, by checking the number of
operands: no operands = all true, 1 operand = mask.
Reviewers: Ayal, gilr, rengolin
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D80110
LV considers an internally computed MaxVF to decide if a constant trip-count is
a multiple of any subsequently chosen VF, and conclude that no scalar remainder
iterations (tail) will be left for Fold Tail to handle. If an external VF is
provided via -force-vector-width, it must be considered instead of the internal
MaxVF.
If an external UF is provided via -force-vector-interleave, it too must be
considered in addition to MaxVF or user VF.
Fixes PR45679.
Differential Revision: https://reviews.llvm.org/D80085
verifyFunction/verifyModule don't assert or error internally. They
also don't print anything if you don't pass a raw_ostream to them.
So the caller needs to check the result and ideally pass a stream
to get the messages. Otherwise they're just really expensive no-ops.
I've filed PR45965 for another instance in SLPVectorizer
that causes a lit test failure.
Differential Revision: https://reviews.llvm.org/D80106
If both OpA and OpB is an add with NSW/NUW and with the same LHS operand,
we can guarantee that the transformation is safe if we can prove that OpA
won't overflow when IdxDiff added to the RHS of OpA.
Review: https://reviews.llvm.org/D79817
Now that load/store alignment is required, we no longer need most
of them. Also switch the getLoadStoreAlignment() helper to return
Align instead of MaybeAlign.
This is split off from D79799 - where I was proposing to fully iterate
over a function until there are no more transforms. I suspect we are
still going to want to do something like that eventually.
But we can achieve the same gains much more efficiently on the current
set of regression tests just by reversing the order that we visit the
instructions.
This may also reduce the motivation for D79078, but we are still not
getting the optimal pattern for a reduction.
The patch standardizes printing of VPRecipes a bit, by hoisting out the
common emission of \\l\"+\n. It simplifies the code and is also a first
step towards untangling printing from DOT format output, with the goal
of making the DOT output optional and to provide a more concise debug
output if DOT output is disabled.
Reviewers: gilr, Ayal, rengolin
Reviewed By: gilr
Differential Revision: https://reviews.llvm.org/D78883
Summary:
Analyses that are statefull should not be retrieved through a proxy from
an outer IR unit, as these analyses are only invalidated at the end of
the inner IR unit manager.
This patch disallows getting the outer manager and provides an API to
get a cached analysis through the proxy. If the analysis is not
stateless, the call to getCachedResult will assert.
Reviewers: chandlerc
Subscribers: mehdi_amini, eraman, hiraditya, zzheng, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D72893
This was reverted because of a miscompilation. At closer inspection, the
problem was actually visible in a changed llvm regression test too. This
one-line follow up fix/recommit will splat the IV, which is what we are trying
to avoid if unnecessary in general, if tail-folding is requested even if all
users are scalar instructions after vectorisation. Because with tail-folding,
the splat IV will be used by the predicate of the masked loads/stores
instructions. The previous version omitted this, which caused the
miscompilation. The original commit message was:
If tail-folding of the scalar remainder loop is applied, the primary induction
variable is splat to a vector and used by the masked load/store vector
instructions, thus the IV does not remain scalar. Because we now mark
that the IV does not remain scalar for these cases, we don't emit the vector IV
if it is not used. Thus, the vectoriser produces less dead code.
Thanks to Ayal Zaks for the direction how to fix this.
Currently LAA's uses of ScalarEvolutionExpander blocks moving the
expander from Analysis to Transforms. Conceptually the expander does not
fit into Analysis (it is only used for code generation) and
runtime-check generation also seems to be better suited as a
transformation utility.
Reviewers: Ayal, anemet
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D78460
In order to reduce the API surface area (preparation for D78460), remove
a addRuntimeChecks() function and do the additional check in the single
caller.
Reviewers: Ayal, anemet
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D79679
As with the extractelement patterns that are currently in vector-combine,
there are going to be several possible variations on this theme. This
should be the clearest, simplest example.
Scalarization is the right direction for target-independent canonicalization,
and InstCombine has some of those folds already, but it doesn't do this.
I proposed a similar transform in D50992. Here in vector-combine, we can
check the cost model to be sure it's profitable, so there should be less risk.
Differential Revision: https://reviews.llvm.org/D79452
The original patch (rG86dfbc676ebe) exposed an existing bug:
we could wrongly cast a constant expression to BinaryOperator
because the pattern matching allows that. This adds a check
for that case, and there's a reduced test case to verify no
crashing.
Original commit message:
This builds on the or-reduction bailout that was added with D67841.
We still do not have IR-level load combining, although that could
be a target-specific enhancement for -vector-combiner.
The heuristic is narrowly defined to catch the motivating case from
PR39538:
https://bugs.llvm.org/show_bug.cgi?id=39538
...while preserving existing functionality.
That is, there's an unmodified test of pure load/zext/store that is
not seen in this patch at llvm/test/Transforms/SLPVectorizer/X86/cast.ll.
That's the reason for the logic difference to require the 'or'
instructions. The chances that vectorization would actually help a
memory-bound sequence like that seem small, but it looks nicer with:
vpmovzxwd (%rsi), %xmm0
vmovdqu %xmm0, (%rdi)
rather than:
movzwl (%rsi), %eax
movl %eax, (%rdi)
...
In the motivating test, we avoid creating a vector mess that is
unrecoverable in the backend, and SDAG forms the expected bswap
instructions after load combining:
movzbl (%rdi), %eax
vmovd %eax, %xmm0
movzbl 1(%rdi), %eax
vmovd %eax, %xmm1
movzbl 2(%rdi), %eax
vpinsrb $4, 4(%rdi), %xmm0, %xmm0
vpinsrb $8, 8(%rdi), %xmm0, %xmm0
vpinsrb $12, 12(%rdi), %xmm0, %xmm0
vmovd %eax, %xmm2
movzbl 3(%rdi), %eax
vpinsrb $1, 5(%rdi), %xmm1, %xmm1
vpinsrb $2, 9(%rdi), %xmm1, %xmm1
vpinsrb $3, 13(%rdi), %xmm1, %xmm1
vpslld $24, %xmm0, %xmm0
vpmovzxbd %xmm1, %xmm1 # xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
vpslld $16, %xmm1, %xmm1
vpor %xmm0, %xmm1, %xmm0
vpinsrb $1, 6(%rdi), %xmm2, %xmm1
vmovd %eax, %xmm2
vpinsrb $2, 10(%rdi), %xmm1, %xmm1
vpinsrb $3, 14(%rdi), %xmm1, %xmm1
vpinsrb $1, 7(%rdi), %xmm2, %xmm2
vpinsrb $2, 11(%rdi), %xmm2, %xmm2
vpmovzxbd %xmm1, %xmm1 # xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
vpinsrb $3, 15(%rdi), %xmm2, %xmm2
vpslld $8, %xmm1, %xmm1
vpmovzxbd %xmm2, %xmm2 # xmm2 = xmm2[0],zero,zero,zero,xmm2[1],zero,zero,zero,xmm2[2],zero,zero,zero,xmm2[3],zero,zero,zero
vpor %xmm2, %xmm1, %xmm1
vpor %xmm1, %xmm0, %xmm0
vmovdqu %xmm0, (%rsi)
movl (%rdi), %eax
movl 4(%rdi), %ecx
movl 8(%rdi), %edx
movbel %eax, (%rsi)
movbel %ecx, 4(%rsi)
movl 12(%rdi), %ecx
movbel %edx, 8(%rsi)
movbel %ecx, 12(%rsi)
Differential Revision: https://reviews.llvm.org/D78997
If tail-folding of the scalar remainder loop is applied, the primary induction
variable is splat to a vector and used by the masked load/store vector
instructions, thus the IV does not remain scalar. Because we now mark
that the IV does not remain scalar for these cases, we don't emit the vector IV
if it is not used. Thus, the vectoriser produces less dead code.
Thanks to Ayal Zaks for the direction how to fix this.
Differential Revision: https://reviews.llvm.org/D78911
This builds on the or-reduction bailout that was added with D67841.
We still do not have IR-level load combining, although that could
be a target-specific enhancement for -vector-combiner.
The heuristic is narrowly defined to catch the motivating case from
PR39538:
https://bugs.llvm.org/show_bug.cgi?id=39538
...while preserving existing functionality.
That is, there's an unmodified test of pure load/zext/store that is
not seen in this patch at llvm/test/Transforms/SLPVectorizer/X86/cast.ll.
That's the reason for the logic difference to require the 'or'
instructions. The chances that vectorization would actually help a
memory-bound sequence like that seem small, but it looks nicer with:
vpmovzxwd (%rsi), %xmm0
vmovdqu %xmm0, (%rdi)
rather than:
movzwl (%rsi), %eax
movl %eax, (%rdi)
...
In the motivating test, we avoid creating a vector mess that is
unrecoverable in the backend, and SDAG forms the expected bswap
instructions after load combining:
movzbl (%rdi), %eax
vmovd %eax, %xmm0
movzbl 1(%rdi), %eax
vmovd %eax, %xmm1
movzbl 2(%rdi), %eax
vpinsrb $4, 4(%rdi), %xmm0, %xmm0
vpinsrb $8, 8(%rdi), %xmm0, %xmm0
vpinsrb $12, 12(%rdi), %xmm0, %xmm0
vmovd %eax, %xmm2
movzbl 3(%rdi), %eax
vpinsrb $1, 5(%rdi), %xmm1, %xmm1
vpinsrb $2, 9(%rdi), %xmm1, %xmm1
vpinsrb $3, 13(%rdi), %xmm1, %xmm1
vpslld $24, %xmm0, %xmm0
vpmovzxbd %xmm1, %xmm1 # xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
vpslld $16, %xmm1, %xmm1
vpor %xmm0, %xmm1, %xmm0
vpinsrb $1, 6(%rdi), %xmm2, %xmm1
vmovd %eax, %xmm2
vpinsrb $2, 10(%rdi), %xmm1, %xmm1
vpinsrb $3, 14(%rdi), %xmm1, %xmm1
vpinsrb $1, 7(%rdi), %xmm2, %xmm2
vpinsrb $2, 11(%rdi), %xmm2, %xmm2
vpmovzxbd %xmm1, %xmm1 # xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
vpinsrb $3, 15(%rdi), %xmm2, %xmm2
vpslld $8, %xmm1, %xmm1
vpmovzxbd %xmm2, %xmm2 # xmm2 = xmm2[0],zero,zero,zero,xmm2[1],zero,zero,zero,xmm2[2],zero,zero,zero,xmm2[3],zero,zero,zero
vpor %xmm2, %xmm1, %xmm1
vpor %xmm1, %xmm0, %xmm0
vmovdqu %xmm0, (%rsi)
movl (%rdi), %eax
movl 4(%rdi), %ecx
movl 8(%rdi), %edx
movbel %eax, (%rsi)
movbel %ecx, 4(%rsi)
movl 12(%rdi), %ecx
movbel %edx, 8(%rsi)
movbel %ecx, 12(%rsi)
Differential Revision: https://reviews.llvm.org/D78997
getScalarizationOverhead is only ever called with vectors (and we already had a load of cast<VectorType> calls immediately inside the functions).
Followup to D78357
Reviewed By: @samparker
Differential Revision: https://reviews.llvm.org/D79341
Make the kind of cost explicit throughout the cost model which,
apart from making the cost clear, will allow the generic parts to
calculate better costs. It will also allow some backends to
approximate and correlate the different costs if they wish. Another
benefit is that it will also help simplify the cost model around
immediate and intrinsic costs, where we currently have multiple APIs.
RFC thread:
http://lists.llvm.org/pipermail/llvm-dev/2020-April/141263.html
Differential Revision: https://reviews.llvm.org/D79002
The improvements to the x86 vector insert/extract element costs in D74976 resulted in the estimated costs for vector initialization and scalarization increasing higher than should be expected. This is particularly noticeable on pre-SSE4 targets where the available of legal INSERT_VECTOR_ELT ops is more limited.
This patch does 2 things:
1 - it implements X86TTIImpl::getScalarizationOverhead to more accurately represent the typical costs of a ISD::BUILD_VECTOR pattern.
2 - it adds a DemandedElts mask to getScalarizationOverhead to permit the SLP's BoUpSLP::getGatherCost to be rewritten to use it directly instead of accumulating raw vector insertion costs.
This fixes PR45418 where a v4i8 (zext'd to v4i32) was no longer vectorizing.
A future patch should extend X86TTIImpl::getScalarizationOverhead to tweak the EXTRACT_VECTOR_ELT scalarization costs as well.
Reviewed By: @craig.topper
Differential Revision: https://reviews.llvm.org/D78216
The crash that caused the original revert has been fixed in
a3c964a278. I also added a reduced version of the crash reproducer.
This reverts the revert commit 2107af9ccf.
We may want to identify sequences that are not
reductions, but still qualify as load-combines
in the back-end, so make most of the body a
helper function.
When folding tail, branch taken count is computed during initial VPlan execution
and recorded to be used by the compare computing the loop's mask. This recording
should directly set the State, instead of reusing Value2VPValue mapping which
serves original Values present prior to vectorization.
The branch taken count may be a constant Value, which may be used elsewhere in
the loop; trying to employ Value2VPValue for both leads to the issue reported in
https://reviews.llvm.org/D76992#inline-721028
Differential Revision: https://reviews.llvm.org/D78847
One of transforms the loop vectorizer makes is LCSSA formation. In some cases it
is the only transform it makes. We should not drop CFG analyzes if only LCSSA was
formed and no actual CFG changes was made.
We should think of expanding this logic to other passes as well, and maybe make
it a part of PM framework.
Reviewed By: Florian Hahn
Differential Revision: https://reviews.llvm.org/D78360
This reverts commit 9245c7ac13.
This is triggering a segfault in XLA downstream, we'll follow-up with
a reproducer, it is likely influenced by TTI/TLI settings or other
options as a simple `opt -loop-vectorize` invocation on the IR
before the crash does not reproduce immediately.
This patch adds VPValue version of the instruction operands to
VPWidenRecipe and uses them during code-generation.
Similar to D76373 this reduces ingredient def-use usage by ILV as
a step towards full VPlan-based def-use relations.
Reviewers: rengolin, Ayal, gilr
Reviewed By: gilr
Differential Revision: https://reviews.llvm.org/D76992
The individual tryTo* helpers do not need to be public. Also, the
builder contained two consecutive public: sections, which is not
necessary. Moved the remaining public methods after the constructor.
Also make some of the tryTo* helpers const.
Reviewers: gilr, rengolin, Ayal, hsaito
Reviewed by: gilr
Differential Revision: https://reviews.llvm.org/D78288
This patch includes some clean-ups to tryToCreateRecipe, suggested in
D77973.
It includes:
* Renaming tryToCreateRecipe to tryToCreateWidenRecipe.
* Move VPBB insertion logic to caller of tryToCreateWidenRecipe.
* Hoists instruction checks to tryToCreateWidenRecipe, making it
clearer which instructions are handled by which recipe, simplifying
the checks by using early exits.
* Split up handling of induction PHIs and truncates using inductions.
Reviewers: gilr, rengolin, Ayal, hsaito
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D78287
The API for shuffles and reductions uses generic Type parameters,
instead of VectorType, and so assertions and casts are used a lot.
This patch makes those types explicit, which means that the clients
can't be lazy, but results in less ambiguity, and that can only be a
good thing.
Bugzilla: https://bugs.llvm.org/show_bug.cgi?id=45562
Differential Revision: https://reviews.llvm.org/D78357
bitcast (shuf V, MaskC) --> shuf (bitcast V), MaskC'
This is the widen shuffle elements enhancement to D76727.
It builds on the analysis and simplifications in
D77881 and rG6a7e958a423e.
The phase ordering tests show that we can simplify inverse
shuffles across a binop in both directions (widen/narrow or
narrow/widen) now.
There's another potential transform visible in some of the
remaining TODOs - move a bitcasted operand of a shuffle
after the shuffle.
Differential Revision: https://reviews.llvm.org/D78371
First-order recurrences require special treatment when they are live-out;
such treatment is provided by fixFirstOrderRecurrence(), so they should be
included in AllowedExit set.
(Should probably have been included originally in D16197.)
Fixes PR45526: AllowedExit set is used by prepareToFoldTailByMasking() to
check whether the treatment for live-outs also holds when folding the tail,
which is not (yet) the case for first-order recurrences.
Differential Revision: https://reviews.llvm.org/D78210
Cost-modeling decisions are tied to the compute interleave groups
(widening decisions, scalar and uniform values). When invalidating the
interleave groups, those decisions also need to be invalidated.
Otherwise there is a mis-match during VPlan construction.
VPWidenMemoryRecipes created initially are left around w/o converting them
into VPInterleave recipes. Such a conversion indeed should not take place,
and these gather/scatter recipes may in fact be right. The crux is leaving around
obsolete CM_Interleave (and dependent) markings of instructions along with
their costs, instead of recalculating decisions, costs, and recipes.
Alternatively to forcing a complete recompute later on, we could try
to selectively invalidate the decisions connected to the interleave
groups. But we would likely need to run the uniform/scalar value
detection parts again anyways and the extra complexity is probably not
worth it.
Fixes PR45572.
Reviewers: gilr, rengolin, Ayal, hsaito
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D78298
After introducing VPWidenSelectRecipe, the duplicated logic can be
shared.
Reviewers: gilr, rengolin, Ayal, hsaito
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D77973
Handling LoadInst and StoreInst in tryToWiden seems a bit
counter-intuitive, as there is only an assertion for them and in no
case VPWidenRefipes are created for them.
I think it makes sense to move the assertion to handleReplication, where
the non-widened loads and store are handled.
Reviewers: gilr, rengolin, Ayal, hsaito
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D77972
Fix an assert introduced in 41ed5d856c1: a phi with a single predecessor and a
mask is a valid case which is already supported by the code.
Differential Revision: https://reviews.llvm.org/D78115
Summary:
Currently, the internal options -vectorize-loops, -vectorize-slp, and
-interleave-loops do not have much practical effect. This is because
they are used to initialize the corresponding flags in the pass
managers, and those flags are then unconditionally overwritten when
compiling via clang or via LTO from the linkers. The only exception was
-vectorize-loops via opt because of some special hackery there.
While vectorization could still be disabled when compiling via clang,
using -fno-[slp-]vectorize, this meant that there was no way to disable
it when compiling in LTO mode via the linkers. This only affected
ThinLTO, since for regular LTO vectorization is done during the compile
step for scalability reasons. For ThinLTO it is invoked in the LTO
backends. See also the discussion on PR45434.
This patch makes it so the internal options can actually be used to
disable these optimizations. Ultimately, the best long term solution is
to mark the loops with metadata (similar to the approach used to fix
-fno-unroll-loops in D77058), but this enables a shorter term
workaround, and actually makes these internal options useful.
I constant propagated the initial values of these internal flags into
the pass manager flags (for some reasons vectorize-loops and
interleave-loops were initialized to true, while vectorize-slp was
initialized to false). As mentioned above, they are overwritten
unconditionally so this doesn't have any real impact, and these initial
values aren't particularly meaningful.
I then changed the passes to check the internl values and return without
performing the associated optimization when false (I changed the default
of -vectorize-slp to true so the options behave similarly). I was able
to remove the hackery in opt used to get -vectorize-loops=false to work,
as well as a special option there used to disable SLP vectorization.
Finally, I changed thinlto-slp-vectorize-pm.c to:
a) Only test SLP (moved the loop vectorization checking to a new test).
b) Use code that is slp vectorized when it is enabled, and check that
instead of whether the pass is enabled.
c) Test the new behavior of -vectorize-slp.
d) Test both pass managers.
The loop vectorization (and associated interleaving) testing I moved to
a new thinlto-loop-vectorize-pm.c test, with several changes:
a) Changed the flags on the interleaving testing so that it will
actually interleave, and check that.
b) Test the new behavior of -vectorize-loops and -interleave-loops.
c) Test both pass managers.
Reviewers: fhahn, wmi
Subscribers: hiraditya, steven_wu, dexonsmith, cfe-commits, davezarzycki, llvm-commits
Tags: #clang
Differential Revision: https://reviews.llvm.org/D77989
Summary:
Remove usages of asserting vector getters in Type in preparation for the
VectorType refactor. The existence of these functions complicates the
refactor while adding little value.
Reviewers: rriddle, sdesmalen, efriedma
Reviewed By: efriedma
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77259
Pass from the calling recipe the interleave group itself instead of passing the
group's insertion position and having the function query CM for its interleave
group and making sure that given instruction is the insertion point of.
Differential Revision: https://reviews.llvm.org/D78002
Widening a selects depends on whether the condition is loop invariant or
not. Rather than checking during codegen-time, the information can be
recorded at the VPlan construction time.
This was suggested as part of D76992, to reduce the reliance on
accessing the original underlying IR values.
Reviewers: gilr, rengolin, Ayal, hsaito
Reviewed By: gilr
Differential Revision: https://reviews.llvm.org/D77869
As proposed in D77881, we'll have the related widening operation,
so this name becomes too vague.
While here, change the function signature to take an 'int' rather
than 'size_t' for the scaling factor, add an assert for overflow of
32-bits, and improve the documentation comments.
Default visibility for classes is private, so the private: at the top of
various class definitions is redundant.
Reviewers: gilr, rengolin, Ayal, hsaito
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D77810
InnerLoopVectorizer's code called during VPlan execution still relies on
original IR's def-use relations to decide which vector code to generate,
limiting VPlan transformations ability to modify def-use relations and still
have ILV generate the vector code.
This commit introduces VPValues for VPBlendRecipe to use as the values to
blend. The recipe is generated with VPValues wrapping the phi's incoming values
of the scalar phi. This reduces ingredient def-use usage by ILV as a step
towards full VPlan-based def-use relations.
Differential Revision: https://reviews.llvm.org/D77539
Introduce a new VPWidenCanonicalIVRecipe to generate a canonical vector
induction for use in fold-tail-with-masking, if a primary induction is absent.
The canonical scalar IV having start = 0 and step = VF*UF, created during code
-gen to control the vector loop, is widened into a canonical vector IV having
start = {<Part*VF, Part*VF+1, ..., Part*VF+VF-1> for 0 <= Part < UF} and
step = <VF*UF, VF*UF, ..., VF*UF>.
Differential Revision: https://reviews.llvm.org/D77635
When building a VPlan, BasicBlock::instructionsWithoutDebug() is used to
iterate over the instructions in a block. This means that no recipes
should be created for debug info intrinsics already and we can turn the
early exit into an assertion.
Reviewers: Ayal, gilr, rengolin, aprantl
Reviewed By: aprantl
Differential Revision: https://reviews.llvm.org/D77636
This patch adds VPValue versions for the arguments of the call to
VPWidenCallRecipe and uses them during code-generation.
Similar to D76373 this reduces ingredient def-use usage by ILV as
a step towards full VPlan-based def-use relations.
Reviewers: Ayal, gilr, rengolin
Reviewed By: gilr
Differential Revision: https://reviews.llvm.org/D77655
Now that we have scalable vectors, there's a distinction that isn't
getting captured in the original SequentialType: some vectors don't have
a known element count, so counting the number of elements doesn't make
sense.
In some cases, there's a better way to express the commonality using
other methods. If we're dealing with GEPs, there's GEP methods; if we're
dealing with a ConstantDataSequential, we can query its element type
directly.
In the relatively few remaining cases, I just decided to write out
the type checks. We're talking about relatively few places, and I think
the abstraction doesn't really carry its weight. (See thread "[RFC]
Refactor class hierarchy of VectorType in the IR" on llvmdev.)
Differential Revision: https://reviews.llvm.org/D75661
This patch moves calls to their own recipe, to simplify the transition
to VPUser for operands of VPWidenRecipe, as discussed in D76992.
Subsequently additional information can be added to the recipe rather
than computing it during the execute step.
Reviewers: rengolin, Ayal, gilr, hsaito
Reviewed By: gilr
Differential Revision: https://reviews.llvm.org/D77467
After 49d00824bb, VPWidenRecipe only stores a single instruction.
tryToWiden can simply return the widen recipe, like other helpers in
VPRecipeBuilder.
Extracting to the same index that we are going to insert back into
allows forming select ("blend") shuffles and enables further transforms.
Admittedly, this is a quick-fix for a more general problem that I'm
hoping to solve by adding transforms for patterns that start with an
insertelement.
But this might resolve some regressions known to be caused by the
extract-extract transform (although I have not gotten more details on
those yet).
In the motivating case from PR34724:
https://bugs.llvm.org/show_bug.cgi?id=34724
The combination of subsequent instcombine and codegen transforms gets us this improvement:
vmovshdup %xmm0, %xmm2 ## xmm2 = xmm0[1,1,3,3]
vhaddps %xmm1, %xmm1, %xmm4
vmovshdup %xmm1, %xmm3 ## xmm3 = xmm1[1,1,3,3]
vaddps %xmm0, %xmm2, %xmm0
vaddps %xmm1, %xmm3, %xmm1
vshufps $200, %xmm4, %xmm0, %xmm0 ## xmm0 = xmm0[0,2],xmm4[0,3]
vinsertps $177, %xmm1, %xmm0, %xmm0 ## xmm0 = zero,xmm0[1,2],xmm1[2]
-->
vmovshdup %xmm0, %xmm2 ## xmm2 = xmm0[1,1,3,3]
vhaddps %xmm1, %xmm1, %xmm1
vaddps %xmm0, %xmm2, %xmm0
vshufps $200, %xmm1, %xmm0, %xmm0 ## xmm0 = xmm0[0,2],xmm1[0,3]
Differential Revision: https://reviews.llvm.org/D76623
bitcast (shuf V, MaskC) --> shuf (bitcast V), MaskC'
We do not attempt this in InstCombine because we do not want to change
types and create new shuffle ops that are potentially not lowered as
well as the original code. Here, we can check the cost model to see if
it is worthwhile.
I've aggressively enabled this transform even if the types are the same
size and/or equal cost because moving the bitcast allows InstCombine to
make further simplifications.
In the motivating cases from PR35454:
https://bugs.llvm.org/show_bug.cgi?id=35454
...this is enough to let instcombine and the backend eliminate the
redundant shuffles, but we probably want to extend VectorCombine to
handle the inverse pattern (shuffle-of-bitcast) to get that
simplification directly in IR.
Differential Revision: https://reviews.llvm.org/D76727
Instead, represent the mask as out-of-line data in the instruction. This
should be more efficient in the places that currently use
getShuffleVector(), and paves the way for further changes to add new
shuffles for scalable vectors.
This doesn't change the syntax in textual IR. And I don't currently plan
to change the bitcode encoding in this patch, although we'll probably
need to do something once we extend shufflevector for scalable types.
I expect that once this is finished, we can then replace the raw "mask"
with something more appropriate for scalable vectors. Not sure exactly
what this looks like at the moment, but there are a few different ways
we could handle it. Maybe we could try to describe specific shuffles.
Or maybe we could define it in terms of a function to convert a fixed-length
array into an appropriate scalable vector, using a "step", or something
like that.
Differential Revision: https://reviews.llvm.org/D72467
In InnerLoopVectorizer::getOrCreateTripCount, when the backedge taken
count is a SCEV add expression, its type is defined by the type of the
last operand of the add expression.
In the test case from PR45259, this last operand happens to be a
pointer, which (according to llvm::Type) does not have a primitive size
in bits. In this case, LoopVectorize fails to truncate the SCEV and
crashes as a result.
Uing ScalarEvolution::getTypeSizeInBits makes the truncation work as expected.
https://bugs.llvm.org/show_bug.cgi?id=45259
Differential Revision: https://reviews.llvm.org/D76669
This patch changes VPWidenRecipe to only store a single original IR
instruction. This is the first required step towards modeling it's
operands as VPValues and also towards breaking it up into a
VPInstruction.
Discussed as part of D74695.
Reviewers: Ayal, gilr, rengolin
Reviewed By: gilr
Differential Revision: https://reviews.llvm.org/D76988
This untangles the logic in widenIntOrFpInduction in order to make more
explicit and visible how exactly the induction variable is lowered.
Differential Revision: https://reviews.llvm.org/D76686
InnerLoopVectorizer's code called during VPlan execution still relies on
original IR's def-use relations to decide which vector code to generate,
limiting VPlan transformations ability to modify def-use relations and still
have ILV generate the vector code.
This commit introduces a VPValue for VPWidenMemoryInstructionRecipe to use as
the stored value. The recipe is generated with a VPValue wrapping the stored
value of the scalar store. This reduces ingredient def-use usage by ILV as a
step towards full VPlan-based def-use relations.
Differential Revision: https://reviews.llvm.org/D76373
We need to insert into the Visited set at the same time we insert
into the worklist. Otherwise we may end up pushing the same
instruction to the worklist multiple times, and only adding it to
the visited set later.
The latest improvements to VPValue printing make this mapping clear when
printing the operand. Printing the mapping separately is not required
any longer.
Reviewers: rengolin, hsaito, Ayal, gilr
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D76375
Now that printing VPValues uses the underlying IR value name, if
available, recording the underlying value here improves printing.
Reviewers: rengolin, hsaito, Ayal, gilr
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D76374
The existence of the class is more confusing than helpful, I think; the
commonality is mostly just "GEP is legal", which can be queried using
APIs on GetElementPtrInst.
Differential Revision: https://reviews.llvm.org/D75660
When the an underlying value is available, we can use its name for
printing, as discussed in D73078.
Reviewers: rengolin, hsaito, Ayal, gilr
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D76200
Summary:
SLPVectorizer try to vectorize list of scalar instructions of the same type,
instructions already vectorized are rejected through isValidElementType().
Without this patch, tryToVectorizeList() will first try to determine vectorization
factor of a list of Instructions before checking whether each instruction has unsupported
type or not. For instructions already vectorized for SVE, it will crash at getVectorElementSize(),
where it try to return a fixed size.
This patch make sure invalid element types are rejected before trying to get vectorization
factor. This make sure we are not trying to vectorize instructions already vectorized.
Reviewers: sdesmalen, efriedma, spatel, RKSimon, ABataev, apazos, rengolin
Reviewed By: efriedma
Subscribers: tschuett, hiraditya, rkruppe, psnobl, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76017
Summary:
Support ConstantInt::get() and Constant::getAllOnesValue() for scalable
vector type, this requires ConstantVector::getSplat() to take in 'ElementCount',
instead of 'unsigned' number of element count.
This change is needed for D73753.
Reviewers: sdesmalen, efriedma, apazos, spatel, huntergr, willlovett
Reviewed By: efriedma
Subscribers: tschuett, hiraditya, rkruppe, psnobl, cfe-commits, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74386
Refines the gather/scatter cost model, but also changes the TTI
function getIntrinsicInstrCost to accept an additional parameter
which is needed for the gather/scatter cost evaluation.
This did require trivial changes in some non-ARM backends to
adopt the new parameter.
Extending gathers and truncating scatters are now priced cheaper.
Differential Revision: https://reviews.llvm.org/D75525
It seems like the SLPVectorizer is currently not aware of vector
versions of functions provided by libraries like Accelerate [1].
This patch updates SLPVectorizer to use the same infrastructure
the LoopVectorizer uses to detect vectorizable library functions.
For calls, it computes the cost of an intrinsic call (existing behavior)
and the cost of a vector function library call, if available. Like
LoopVectorizer, it assumes the cost of the vector function is simply the
cost of a call to a vector function.
[1] https://developer.apple.com/documentation/accelerate
Reviewers: ABataev, RKSimon, spatel
Reviewed By: ABataev
Differential Revision: https://reviews.llvm.org/D75878
opcode (extelt V0, Ext0), (ext V1, Ext1) --> extelt (opcode (splat V0, Ext0), V1), Ext1
The first part of this patch generalizes the cost calculation to accept
different extraction indexes. The second part creates a shuffle+extract
before feeding into the existing code to create a vector op+extract.
The patch conservatively uses "TargetTransformInfo::SK_PermuteSingleSrc"
rather than "TargetTransformInfo::SK_Broadcast" (splat specifically
from element 0) because we do not have a more general "SK_Splat"
currently. That does not affect any of the current regression tests,
but we might be able to find some cost model target specialization where
that comes into play.
I suspect that we can expose some missing x86 horizontal op codegen with
this transform, so I'm speculatively adding a debug flag to disable the
binop variant of this transform to allow easier testing.
The test changes show that we're sensitive to cost model diffs (as we
should be), so that means that patches like D74976
should have better coverage.
Differential Revision: https://reviews.llvm.org/D75689
Currently when printing VPValues we use the object address, which makes
it hard to distinguish VPValues as they usually are large numbers with
varying distance between them.
This patch adds a simple slot tracker, similar to the ModuleSlotTracker
used for IR values. In order to dump a VPValue or anything containing a
VPValue, a slot tracker for the enclosing VPlan needs to be created. The
existing VPlanPrinter can take care of that for the existing code. We
assign consecutive numbers to each VPValue we encounter in a reverse
post order traversal of the VPlan.
Reviewers: rengolin, hsaito, fhahn, Ayal, dorit, gilr
Reviewed By: gilr
Differential Revision: https://reviews.llvm.org/D73078
This patch adds a getPlan accessor to VPBlockBase, which finds the entry
block of the plan containing the block and returns the plan set for this
block.
VPBlockBase contains a VPlan pointer, but it should only be set for
the entry block of a plan. This allows moving blocks without updating
the pointer for each moved block and in the future we might introduce a
parent relationship between plans and blocks, similar to the one in LLVM IR.
Reviewers: rengolin, hsaito, fhahn, Ayal, dorit, gilr
Reviewed By: gilr
Differential Revision: https://reviews.llvm.org/D74445
getReductionVars, getInductionVars and getFirstOrderRecurrences were all
being returned from LoopVectorizationLegality as pointers to lists. This
just changes them to be references, cleaning up the interface slightly.
Differential Revision: https://reviews.llvm.org/D75448
This change adds an assertion to prevent tricky bug related to recursive
approach of building vectorization tree. For loop below takes number of
operands directly from tree entry rather than from scalars.
If the entry at this moment turns out incomplete (i.e. not all operands set)
then not all the dependencies will be seen by the scheduler.
This can lead to failed scheduling (and thus failed vectorization)
for perfectly vectorizable tree.
Here is code example which is likely to fire the assertion:
for (i : VL0->getNumOperands()) {
...
TE->setOperand(i, Operands);
buildTree_rec(Operands, Depth + 1,...);
}
Correct way is two steps process: first set all operands to a tree entry
and then recursively process each operand.
Differential Revision: https://reviews.llvm.org/D75296
This patch deletes some dead code out of SLP vectorizer.
Couple of changes taken out of D57059 to slightly lighten it
plus one more similar case fixed.
Differential Revision: https://reviews.llvm.org/D75276
A recent commit
(https://reviews.llvm.org/rG66c120f02560ef528a60924104ead66f330190f1) changed
the cost for calls to functions that have a vector version for some
vectorization factor. However, no check is performed for whether the
vectorization factor matches the current one being cost modeled. This leads to
attempts to widen call instructions to a vectorization factor for which such a
function does not exist, which in turn leads to an assertion failure.
This patch adds the check for vectorization factor (i.e. not just that the
called function has a vector version for some VF, but that it has a vector
version for this VF).
Differential revision: https://reviews.llvm.org/D74944