As discussed in D86798, it's not clear if the caller code
works with a more liberal definition of "commutative" that
includes intrinsics like min/max. This makes the binop
restriction (current functionality is unchanged) explicit
until the code is audited/tested.
The entries in VectorizableTree are not necessarily ordered by their
position in basic blocks. Collect them and order them by dominance so
later instructions are guaranteed to be visited first. For instructions
in different basic blocks, we only scan to the beginning of the block,
so their order does not matter, as long as all instructions in a basic
block are grouped together. Using dominance ensures a deterministic order.
The modified test case contains an example where we compute a wrong
spill cost (2) without this patch, even though there is no call between
any instruction in the bundle.
This seems to have limited practical impact, .e.g on X86 with a recent
Intel Xeon CPU with -O3 -march=native -flto on MultiSource,SPEC2000,SPEC2006
there are no binary changes.
Reviewed By: ABataev
Differential Revision: https://reviews.llvm.org/D82444
Summary:
This patch takes the indices operands of `insertelement`/`insertvalue`
into account while generation of seed elements for `findBuildAggregate()`.
This function has kept the original order of `insert`s before.
Also this patch optimizes `findBuildAggregate()` preventing it from
redundant temporary vector allocations and its multiple reversing.
Fixes llvm.org/pr44067
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D83779
In vectorizeChainsInBlock we try to collect chains of PHI nodes
that have the same element type, but the code is relying upon
the implicit conversion from TypeSize -> uint64_t. For now, I have
modified the code to ignore PHI nodes with scalable types.
Differential Revision: https://reviews.llvm.org/D83542
Currently, getCastInstrCost has limited information about the cast it's
rating, often just the opcode and types. Sometimes there is a context
instruction as well, but it isn't trustworthy: for instance, when the
vectorizer is rating a plan, it calls getCastInstrCost with the old
instructions when, in fact, it's trying to evaluate the cost of the
instruction post-vectorization. Thus, the current system can get the
cost of certain casts incorrect as the correct cost can vary greatly
based on the context in which it's used.
For example, if the vectorizer queries getCastInstrCost to evaluate the
cost of a sext(load) with tail predication enabled, getCastInstrCost
will think it's free most of the time, but it's not always free. On ARM
MVE, a VLD2 group cannot be extended like a normal VLDR can. Similar
situations can come up with how masked loads can be extended when being
split.
To fix that, this path adds a new parameter to getCastInstrCost to give
it a hint about the context of the cast. It adds a CastContextHint enum
which contains the type of the load/store being created by the
vectorizer - one for each of the types it can produce.
Original patch by Pierre van Houtryve
Differential Revision: https://reviews.llvm.org/D79162
I've got the report clang11 issues signed/unsigned mismatch
warning here. For some reason only clang11 seems to issue
this warning.
Differential Revision: https://reviews.llvm.org/D83916
Teaches the SLPVectorizer to use vectorized library functions for
non-intrinsic calls.
This already worked for intrinsics that have vectorized library
functions, thanks to D75878, but schedules with library functions with a
vector variant were being rejected early.
- assume that there are no load/store dependencies between lib
functions with a vector variant; this would otherwise prevent the
bundle from becoming "ready"
- check during legalization that the vector variant can be used
- fix-up where we previously assumed that a call would be an intrinsic
Differential Revision: https://reviews.llvm.org/D82550
At the moment this place does not check maximum size set
by TTI and just creates a maximum possible vectors.
Differential Revision: https://reviews.llvm.org/D82227
The entries in VectorizableTree are not necessarily ordered by their
position in basic blocks. Collect them and order them by dominance so
later instructions are guaranteed to be visited first. For instructions
in different basic blocks, we only scan to the beginning of the block,
so their order does not matter, as long as all instructions in a basic
block are grouped together. Using dominance ensures a deterministic order.
The modified test case contains an example where we compute a wrong
spill cost (2) without this patch, even though there is no call between
any instruction in the bundle.
This seems to have limited practical impact, .e.g on X86 with a recent
Intel Xeon CPU with -O3 -march=native -flto on MultiSource,SPEC2000,SPEC2006
there are no binary changes.
Reviewers: craig.topper, RKSimon, xbolva00, ABataev, spatel
Reviewed By: ABataev
Differential Revision: https://reviews.llvm.org/D82444
D68667 introduced a tighter limit to the number of GEPs to simplify
together. The limit was based on the vector element size of the pointer,
but the pointers themselves are not actually put in vectors.
IIUC we try to vectorize the index computations here, so we should base
the limit on the vector element size of the computation of the index.
This restores the test regression on AArch64 and also restores the
vectorization for a important pattern in SPEC2006/464.h264ref on
AArch64 (@test_i16_extend). We get a large benefit from doing a single
load up front and then processing the index computations in vectors.
Note that we could probably even further improve the AArch64 codegen, if
we would do zexts to i32 instead of i64 for the sub operands and then do
a single vector sext on the result of the subtractions. AArch64 provides
dedicated vector instructions to do so. Sketch of proof in Alive:
https://alive2.llvm.org/ce/z/A4xYAB
Reviewers: craig.topper, RKSimon, xbolva00, ABataev, spatel
Reviewed By: ABataev, spatel
Differential Revision: https://reviews.llvm.org/D82418
relevant aggregate build instructions only (UserCost).
Users are detected with findBuildAggregate routine and the trick is
that following SLP vectorization may end up vectorizing entire list
with smaller chunks. Cost adjustment then is applied for individual
chunks and these adjustments obviously have to be smaller than the
entire aggregate build cost.
Differential Revision: https://reviews.llvm.org/D80773
This is a fix for PR45965 - https://bugs.llvm.org/show_bug.cgi?id=45965 -
which was left out of D80106 because of a test failure.
SLP does its own mini-CSE after potentially creating redundant instructions,
so we need to wait for that to complete before running the verifier.
Otherwise, we will see a test failure for
test/Transforms/SLPVectorizer/X86/crash_vectorizeTree.ll (not changed here)
because a phi temporarily has identical but different incoming values for
the same incoming block.
A related, but independent, test that would have been altered here was
fixed with:
rG880df55
The test was escaping verification in SLP without this change because we
were not running verifyFunction() unless SLP actually changed the IR.
Differential Revision: https://reviews.llvm.org/D80401
The algorithm inside getVectorElementSize() is almost O(x^2) complexity and
when, for example, we compile MultiSource/Applications/ClamAV/shared_sha256.c
with 1k instructions inside sha256_transform() function that resulted in almost
~800k iterations. The following change improves the algorithm with the map to
a liner complexity.
Differential Revision: https://reviews.llvm.org/D80241
Combine the two API calls into one by introducing a structure to hold
the relevant data. This has the added benefit of moving the boiler
plate code for arguments and flags, into the constructors. This is
intended to be a non-functional change, but the complicated web of
logic involved here makes it very hard to guarantee.
Differential Revision: https://reviews.llvm.org/D79941
Now that load/store alignment is required, we no longer need most
of them. Also switch the getLoadStoreAlignment() helper to return
Align instead of MaybeAlign.
The original patch (rG86dfbc676ebe) exposed an existing bug:
we could wrongly cast a constant expression to BinaryOperator
because the pattern matching allows that. This adds a check
for that case, and there's a reduced test case to verify no
crashing.
Original commit message:
This builds on the or-reduction bailout that was added with D67841.
We still do not have IR-level load combining, although that could
be a target-specific enhancement for -vector-combiner.
The heuristic is narrowly defined to catch the motivating case from
PR39538:
https://bugs.llvm.org/show_bug.cgi?id=39538
...while preserving existing functionality.
That is, there's an unmodified test of pure load/zext/store that is
not seen in this patch at llvm/test/Transforms/SLPVectorizer/X86/cast.ll.
That's the reason for the logic difference to require the 'or'
instructions. The chances that vectorization would actually help a
memory-bound sequence like that seem small, but it looks nicer with:
vpmovzxwd (%rsi), %xmm0
vmovdqu %xmm0, (%rdi)
rather than:
movzwl (%rsi), %eax
movl %eax, (%rdi)
...
In the motivating test, we avoid creating a vector mess that is
unrecoverable in the backend, and SDAG forms the expected bswap
instructions after load combining:
movzbl (%rdi), %eax
vmovd %eax, %xmm0
movzbl 1(%rdi), %eax
vmovd %eax, %xmm1
movzbl 2(%rdi), %eax
vpinsrb $4, 4(%rdi), %xmm0, %xmm0
vpinsrb $8, 8(%rdi), %xmm0, %xmm0
vpinsrb $12, 12(%rdi), %xmm0, %xmm0
vmovd %eax, %xmm2
movzbl 3(%rdi), %eax
vpinsrb $1, 5(%rdi), %xmm1, %xmm1
vpinsrb $2, 9(%rdi), %xmm1, %xmm1
vpinsrb $3, 13(%rdi), %xmm1, %xmm1
vpslld $24, %xmm0, %xmm0
vpmovzxbd %xmm1, %xmm1 # xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
vpslld $16, %xmm1, %xmm1
vpor %xmm0, %xmm1, %xmm0
vpinsrb $1, 6(%rdi), %xmm2, %xmm1
vmovd %eax, %xmm2
vpinsrb $2, 10(%rdi), %xmm1, %xmm1
vpinsrb $3, 14(%rdi), %xmm1, %xmm1
vpinsrb $1, 7(%rdi), %xmm2, %xmm2
vpinsrb $2, 11(%rdi), %xmm2, %xmm2
vpmovzxbd %xmm1, %xmm1 # xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
vpinsrb $3, 15(%rdi), %xmm2, %xmm2
vpslld $8, %xmm1, %xmm1
vpmovzxbd %xmm2, %xmm2 # xmm2 = xmm2[0],zero,zero,zero,xmm2[1],zero,zero,zero,xmm2[2],zero,zero,zero,xmm2[3],zero,zero,zero
vpor %xmm2, %xmm1, %xmm1
vpor %xmm1, %xmm0, %xmm0
vmovdqu %xmm0, (%rsi)
movl (%rdi), %eax
movl 4(%rdi), %ecx
movl 8(%rdi), %edx
movbel %eax, (%rsi)
movbel %ecx, 4(%rsi)
movl 12(%rdi), %ecx
movbel %edx, 8(%rsi)
movbel %ecx, 12(%rsi)
Differential Revision: https://reviews.llvm.org/D78997
This builds on the or-reduction bailout that was added with D67841.
We still do not have IR-level load combining, although that could
be a target-specific enhancement for -vector-combiner.
The heuristic is narrowly defined to catch the motivating case from
PR39538:
https://bugs.llvm.org/show_bug.cgi?id=39538
...while preserving existing functionality.
That is, there's an unmodified test of pure load/zext/store that is
not seen in this patch at llvm/test/Transforms/SLPVectorizer/X86/cast.ll.
That's the reason for the logic difference to require the 'or'
instructions. The chances that vectorization would actually help a
memory-bound sequence like that seem small, but it looks nicer with:
vpmovzxwd (%rsi), %xmm0
vmovdqu %xmm0, (%rdi)
rather than:
movzwl (%rsi), %eax
movl %eax, (%rdi)
...
In the motivating test, we avoid creating a vector mess that is
unrecoverable in the backend, and SDAG forms the expected bswap
instructions after load combining:
movzbl (%rdi), %eax
vmovd %eax, %xmm0
movzbl 1(%rdi), %eax
vmovd %eax, %xmm1
movzbl 2(%rdi), %eax
vpinsrb $4, 4(%rdi), %xmm0, %xmm0
vpinsrb $8, 8(%rdi), %xmm0, %xmm0
vpinsrb $12, 12(%rdi), %xmm0, %xmm0
vmovd %eax, %xmm2
movzbl 3(%rdi), %eax
vpinsrb $1, 5(%rdi), %xmm1, %xmm1
vpinsrb $2, 9(%rdi), %xmm1, %xmm1
vpinsrb $3, 13(%rdi), %xmm1, %xmm1
vpslld $24, %xmm0, %xmm0
vpmovzxbd %xmm1, %xmm1 # xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
vpslld $16, %xmm1, %xmm1
vpor %xmm0, %xmm1, %xmm0
vpinsrb $1, 6(%rdi), %xmm2, %xmm1
vmovd %eax, %xmm2
vpinsrb $2, 10(%rdi), %xmm1, %xmm1
vpinsrb $3, 14(%rdi), %xmm1, %xmm1
vpinsrb $1, 7(%rdi), %xmm2, %xmm2
vpinsrb $2, 11(%rdi), %xmm2, %xmm2
vpmovzxbd %xmm1, %xmm1 # xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
vpinsrb $3, 15(%rdi), %xmm2, %xmm2
vpslld $8, %xmm1, %xmm1
vpmovzxbd %xmm2, %xmm2 # xmm2 = xmm2[0],zero,zero,zero,xmm2[1],zero,zero,zero,xmm2[2],zero,zero,zero,xmm2[3],zero,zero,zero
vpor %xmm2, %xmm1, %xmm1
vpor %xmm1, %xmm0, %xmm0
vmovdqu %xmm0, (%rsi)
movl (%rdi), %eax
movl 4(%rdi), %ecx
movl 8(%rdi), %edx
movbel %eax, (%rsi)
movbel %ecx, 4(%rsi)
movl 12(%rdi), %ecx
movbel %edx, 8(%rsi)
movbel %ecx, 12(%rsi)
Differential Revision: https://reviews.llvm.org/D78997
Make the kind of cost explicit throughout the cost model which,
apart from making the cost clear, will allow the generic parts to
calculate better costs. It will also allow some backends to
approximate and correlate the different costs if they wish. Another
benefit is that it will also help simplify the cost model around
immediate and intrinsic costs, where we currently have multiple APIs.
RFC thread:
http://lists.llvm.org/pipermail/llvm-dev/2020-April/141263.html
Differential Revision: https://reviews.llvm.org/D79002
The improvements to the x86 vector insert/extract element costs in D74976 resulted in the estimated costs for vector initialization and scalarization increasing higher than should be expected. This is particularly noticeable on pre-SSE4 targets where the available of legal INSERT_VECTOR_ELT ops is more limited.
This patch does 2 things:
1 - it implements X86TTIImpl::getScalarizationOverhead to more accurately represent the typical costs of a ISD::BUILD_VECTOR pattern.
2 - it adds a DemandedElts mask to getScalarizationOverhead to permit the SLP's BoUpSLP::getGatherCost to be rewritten to use it directly instead of accumulating raw vector insertion costs.
This fixes PR45418 where a v4i8 (zext'd to v4i32) was no longer vectorizing.
A future patch should extend X86TTIImpl::getScalarizationOverhead to tweak the EXTRACT_VECTOR_ELT scalarization costs as well.
Reviewed By: @craig.topper
Differential Revision: https://reviews.llvm.org/D78216
We may want to identify sequences that are not
reductions, but still qualify as load-combines
in the back-end, so make most of the body a
helper function.
The API for shuffles and reductions uses generic Type parameters,
instead of VectorType, and so assertions and casts are used a lot.
This patch makes those types explicit, which means that the clients
can't be lazy, but results in less ambiguity, and that can only be a
good thing.
Bugzilla: https://bugs.llvm.org/show_bug.cgi?id=45562
Differential Revision: https://reviews.llvm.org/D78357
Summary:
Currently, the internal options -vectorize-loops, -vectorize-slp, and
-interleave-loops do not have much practical effect. This is because
they are used to initialize the corresponding flags in the pass
managers, and those flags are then unconditionally overwritten when
compiling via clang or via LTO from the linkers. The only exception was
-vectorize-loops via opt because of some special hackery there.
While vectorization could still be disabled when compiling via clang,
using -fno-[slp-]vectorize, this meant that there was no way to disable
it when compiling in LTO mode via the linkers. This only affected
ThinLTO, since for regular LTO vectorization is done during the compile
step for scalability reasons. For ThinLTO it is invoked in the LTO
backends. See also the discussion on PR45434.
This patch makes it so the internal options can actually be used to
disable these optimizations. Ultimately, the best long term solution is
to mark the loops with metadata (similar to the approach used to fix
-fno-unroll-loops in D77058), but this enables a shorter term
workaround, and actually makes these internal options useful.
I constant propagated the initial values of these internal flags into
the pass manager flags (for some reasons vectorize-loops and
interleave-loops were initialized to true, while vectorize-slp was
initialized to false). As mentioned above, they are overwritten
unconditionally so this doesn't have any real impact, and these initial
values aren't particularly meaningful.
I then changed the passes to check the internl values and return without
performing the associated optimization when false (I changed the default
of -vectorize-slp to true so the options behave similarly). I was able
to remove the hackery in opt used to get -vectorize-loops=false to work,
as well as a special option there used to disable SLP vectorization.
Finally, I changed thinlto-slp-vectorize-pm.c to:
a) Only test SLP (moved the loop vectorization checking to a new test).
b) Use code that is slp vectorized when it is enabled, and check that
instead of whether the pass is enabled.
c) Test the new behavior of -vectorize-slp.
d) Test both pass managers.
The loop vectorization (and associated interleaving) testing I moved to
a new thinlto-loop-vectorize-pm.c test, with several changes:
a) Changed the flags on the interleaving testing so that it will
actually interleave, and check that.
b) Test the new behavior of -vectorize-loops and -interleave-loops.
c) Test both pass managers.
Reviewers: fhahn, wmi
Subscribers: hiraditya, steven_wu, dexonsmith, cfe-commits, davezarzycki, llvm-commits
Tags: #clang
Differential Revision: https://reviews.llvm.org/D77989
Summary:
Remove usages of asserting vector getters in Type in preparation for the
VectorType refactor. The existence of these functions complicates the
refactor while adding little value.
Reviewers: rriddle, sdesmalen, efriedma
Reviewed By: efriedma
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77259
Now that we have scalable vectors, there's a distinction that isn't
getting captured in the original SequentialType: some vectors don't have
a known element count, so counting the number of elements doesn't make
sense.
In some cases, there's a better way to express the commonality using
other methods. If we're dealing with GEPs, there's GEP methods; if we're
dealing with a ConstantDataSequential, we can query its element type
directly.
In the relatively few remaining cases, I just decided to write out
the type checks. We're talking about relatively few places, and I think
the abstraction doesn't really carry its weight. (See thread "[RFC]
Refactor class hierarchy of VectorType in the IR" on llvmdev.)
Differential Revision: https://reviews.llvm.org/D75661
We need to insert into the Visited set at the same time we insert
into the worklist. Otherwise we may end up pushing the same
instruction to the worklist multiple times, and only adding it to
the visited set later.
The existence of the class is more confusing than helpful, I think; the
commonality is mostly just "GEP is legal", which can be queried using
APIs on GetElementPtrInst.
Differential Revision: https://reviews.llvm.org/D75660
Summary:
SLPVectorizer try to vectorize list of scalar instructions of the same type,
instructions already vectorized are rejected through isValidElementType().
Without this patch, tryToVectorizeList() will first try to determine vectorization
factor of a list of Instructions before checking whether each instruction has unsupported
type or not. For instructions already vectorized for SVE, it will crash at getVectorElementSize(),
where it try to return a fixed size.
This patch make sure invalid element types are rejected before trying to get vectorization
factor. This make sure we are not trying to vectorize instructions already vectorized.
Reviewers: sdesmalen, efriedma, spatel, RKSimon, ABataev, apazos, rengolin
Reviewed By: efriedma
Subscribers: tschuett, hiraditya, rkruppe, psnobl, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76017
It seems like the SLPVectorizer is currently not aware of vector
versions of functions provided by libraries like Accelerate [1].
This patch updates SLPVectorizer to use the same infrastructure
the LoopVectorizer uses to detect vectorizable library functions.
For calls, it computes the cost of an intrinsic call (existing behavior)
and the cost of a vector function library call, if available. Like
LoopVectorizer, it assumes the cost of the vector function is simply the
cost of a call to a vector function.
[1] https://developer.apple.com/documentation/accelerate
Reviewers: ABataev, RKSimon, spatel
Reviewed By: ABataev
Differential Revision: https://reviews.llvm.org/D75878