Further improvement of the cost model for the scalars used in
buildvectors sequences. The main functionality is outlined into
a separate function.
The cost is calculated in the following way:
1. If the Base vector is not undef vector, resizing the very first mask to
have common VF and perform action for 2 input vectors (including non-undef
Base). Other shuffle masks are combined with the resulting after the 1 stage and processed as a shuffle of 2 elements.
2. If the Base is undef vector and have only 1 shuffle mask, perform the
action only for 1 vector with the given mask, if it is not the identity
mask.
3. If > 2 masks are used, perform serie of shuffle actions for 2 vectors,
combing the masks properly between the steps.
The original implementation misses the very first analysis for the Base
vector, so the cost might too optimistic in some cases. But it improves
the cost for the insertelements which are part of the current SLP graph.
Part of D107966.
Differential Revision: https://reviews.llvm.org/D115750
The current reordering scheme only checks the ordering of in-tree operands.
There are some cases, however, where we need to adjust the ordering based on
the ordering of a future SLP-tree who's instructions are not part of the
current tree, but are external users.
This patch is a simple implementation of this. We keep track of scalar stores
that are users of TreeEntries and if they look profitable to vectorize, then
we keep track of their ordering. During the reordering step we take this new
index order into account. This can remove some shuffles in cases like in the
lit test.
Differential Revision: https://reviews.llvm.org/D125111
If the same scalar is inserted several times into the same buildvector,
the mask index can be used already. In this case need to check, that
this scalar is already part of the vectorized buildvector.
We can try to vectorize number of stores less than MinVecRegSize
/ scalar_value_size, if it is allowed by target. Gives an extra
opportunity for the vectorization.
Fixes PR54985.
Differential Revision: https://reviews.llvm.org/D124284
Need to use actual index instead of the tree entry position, since the
insert index may be different than 0. It mean, that we vectorized part
of the buildvector starting from not initial insertelement instruction
beause of some reason.
Given a load without a better order, this patch partially sorts the
elements to form clusters of adjacent elements in memory. These clusters
can potentially be loaded in fewer loads, meaning less overall shuffling
(for example loading v4i8 clusters of a v16i8 as a single f32 loads, as
opposed to multiple independent bytes loads and inserts).
Differential Revision: https://reviews.llvm.org/D122145
Further improvement of the cost model for the scalars used in
buildvectors sequences. The main functionality is outlined into
a separate function.
The cost is calculated in the following way:
1. If the Base vector is not undef vector, resizing the very first mask to
have common VF and perform action for 2 input vectors (including non-undef
Base). Other shuffle masks are combined with the resulting after the 1 stage and processed as a shuffle of 2 elements.
2. If the Base is undef vector and have only 1 shuffle mask, perform the
action only for 1 vector with the given mask, if it is not the identity
mask.
3. If > 2 masks are used, perform serie of shuffle actions for 2 vectors,
combing the masks properly between the steps.
The original implementation misses the very first analysis for the Base
vector, so the cost might too optimistic in some cases. But it improves
the cost for the insertelements which are part of the current SLP graph.
Part of D107966.
Differential Revision: https://reviews.llvm.org/D115750
This adds fptosi_sat and fptoui_sat to the list of trivially
vectorizable functions, mainly so that the loop vectorizer can vectorize
the instruction. Marking them as trivially vectorizable also allows them
to be SLP vectorized, and Scalarized.
The signature of a fptosi_sat requires two type overrides
(@llvm.fptosi.sat.v2i32.v2f32), unlike other intrinsics that often only
take a single. This patch alters hasVectorInstrinsicOverloadedScalarOpd
to isVectorIntrinsicWithOverloadTypeAtArg, so that it can mark the first
operand of the intrinsic as a overloaded (but not scalar) operand.
Differential Revision: https://reviews.llvm.org/D124358
Currently SLP vectorizer walks through the instructions and selects
3 main classes of values: 1) reduction operations - instructions with same
reduction opcode (add, mul, min/max, etc.), which build the reduction,
2) reduced values - instructions with the same opcodes, but different
from the reduction opcode, 3) extra arguments - all other values,
instructions from the different basic block rather than the root node,
instructions with to many/less uses.
This scheme is not very efficient. It excludes some instructions and all
non-instruction values from the reductions (constants, proficient
gathers), to many possibly reduced values are marked as extra arguments.
Patch improves this process by introducing a bit extended analysis
stage. During this stage, we still try to select 3 classes of the
values: 1) reduction operations - same as before, 2) possibly reduced
values - all instructions from the current block/non-instructions, which
may build a vectorization tree, 3) extra arguments - instructions from
the different basic blocks. Additionally, an extra sorting of the
possibly reduced values occurs to build the scalar sequences which
highly likely will bed vectorized, e.g. loads are grouped by the
distance between them, constants are grouped together, cmp instructions
are sorted by their compare types and predicates, extractelement
instructions are sorted by the vector operand, etc. Also, these groups
are reordered by their length so the longest group is the first in the
list of the possibly reduced values.
The vectorization process tries to emit the reductions for all these
groups. These reductions, remaining non-vectorized possible reduced
values and extra arguments are then combined into the final expression
just like it was before.
Differential Revision: https://reviews.llvm.org/D114171
Introduced masks where they are not added and improved target dependent
cost models to avoid returning of the incorrect cost results after
adding masks.
Differential Revision: https://reviews.llvm.org/D100486
Introduced masks where they are not added and improved target dependent
cost models to avoid returning of the incorrect cost results after
adding masks.
Differential Revision: https://reviews.llvm.org/D100486
Before this patch `Args` was used to pass a broadcat's arguments by SLP.
This patch changes this. `Args` is now used for passing the operands of
the shuffle.
Differential Revision: https://reviews.llvm.org/D124202
tryToVectorize() method implements one of searching paths for vectorizable tree roots in SLP vectorizer,
specifically for binary and comparison operations. Order of making probes for various scalar pairs
was defined by its implementation: the instruction operands, then climb over one operand if
the instruction is its sole user and then perform same actions for another operand if previous
attempts failed. Problem with this approach is that among these options we can have more than a
single vectorizable tree candidate and it is not necessarily the one that encountered first.
Trying to build vectorizable tree for each possible combination for just evaluation is expensive.
But we already have lookahead heuristics mechanism which we use for finding best pick among
operands of commutative instructions. It calculates cumulative score for candidates in two
consecutive lanes. This patch introduces use of the heuristics for choosing the best pair among
several combinations. We only try one that looks as most promising for vectorization.
Additional benefit is that we reduce total number of vectorization trees built for probes
because we skip those looking non-profitable early.
Reviewed By: Alexey Bataev (ABataev), Vasileios Porpodas (vporpo)
Differential Revision: https://reviews.llvm.org/D124309
Minor refactoring to reduce size of functional change D124309:
look-ahead scoring routines pulled out of VLOperands and formed
new LookAheadHeuristics helper class.
Reviewed By: Alexey Bataev (ABataev), Vasileios Porpodas (vporpo)
Differential Revision: https://reviews.llvm.org/D124313
Until now we would only accept a broadcast load pattern if it is only used
by a single vector of instructions.
This patch relaxes this, and allows for the broadcast to have more than one
user vector, as long as all of its uses are internal to the SLP graph and
vectorized.
Differential Revision: https://reviews.llvm.org/D121940
Currently SLP vectorizer walks through the instructions and selects
3 main classes of values: 1) reduction operations - instructions with same
reduction opcode (add, mul, min/max, etc.), which build the reduction,
2) reduced values - instructions with the same opcodes, but different
from the reduction opcode, 3) extra arguments - all other values,
instructions from the different basic block rather than the root node,
instructions with to many/less uses.
This scheme is not very efficient. It excludes some instructions and all
non-instruction values from the reductions (constants, proficient
gathers), to many possibly reduced values are marked as extra arguments.
Patch improves this process by introducing a bit extended analysis
stage. During this stage, we still try to select 3 classes of the
values: 1) reduction operations - same as before, 2) possibly reduced
values - all instructions from the current block/non-instructions, which
may build a vectorization tree, 3) extra arguments - instructions from
the different basic blocks. Additionally, an extra sorting of the
possibly reduced values occurs to build the scalar sequences which
highly likely will bed vectorized, e.g. loads are grouped by the
distance between them, constants are grouped together, cmp instructions
are sorted by their compare types and predicates, extractelement
instructions are sorted by the vector operand, etc. Also, these groups
are reordered by their length so the longest group is the first in the
list of the possibly reduced values.
The vectorization process tries to emit the reductions for all these
groups. These reductions, remaining non-vectorized possible reduced
values and extra arguments are then combined into the final expression
just like it was before.
Differential Revision: https://reviews.llvm.org/D114171
If we vectorize a e.g. store, we leave around a bunch of getelementptrs for the individual scalar stores which we removed. We can go ahead and delete them as well.
This is purely for test output quality and readability. It should have no effect in any sane pipeline.
Differential Revision: https://reviews.llvm.org/D122493
This simplifies the implementation of eraseInstruction by moving the odd-replace-users-with-undef handling back to the only caller which uses it. This handling was not obviously correct, so add the asserts which make it clear why this is safe to do at all. The result is simpler code and stronger assertions.
The original commit exposed several missing dependencies (e.g. latent bugs in SLP scheduling). Most of these were fixed over the weekend and have had several days to bake. The last was fixed this morning after being noticed in manual review of test changes yesterday. See the review thread for links to each change.
Original commit message follows:
SLP currently schedules all instructions within a scheduling window which stretches from the first instruction potentially vectorized to the last. This window can include a very large number of unrelated instructions which are not being considered for vectorization. This change switches the code to only schedule the sub-graph consisting of the instructions being vectorized and their transitive users.
This has the effect of greatly reducing the amount of work performed in large basic blocks, and thus greatly improves compile time on degenerate examples. To understand the effects, I added some statistics (not planned for upstream contribution). Here's an illustration from my motivating example:
Before this patch:
704357 SLP - Number of calcDeps actions
699021 SLP - Number of schedule calls
5598 SLP - Number of ReSchedule actions
59 SLP - Number of ReScheduleOnFail actions
10084 SLP - Number of schedule resets
8523 SLP - Number of vector instructions generated
After this patch:
102895 SLP - Number of calcDeps actions
161916 SLP - Number of schedule calls
5637 SLP - Number of ReSchedule actions
55 SLP - Number of ReScheduleOnFail actions
10083 SLP - Number of schedule resets
8403 SLP - Number of vector instructions generated
I do want to highlight that there is a small difference in number of generated vector instructions. This example is hitting the bailout due to maximum window size, and the change in scheduling is slightly perturbing when and how we hit it. This can be seen in the RescheduleOnFail counter change. Given that, I think we can safely ignore.
The downside of this change can be seen in the large test diff. We group all vectorizable instructions together at the bottom of the scheduling region. This means that vector instructions can move quite far from their original point in code. While maybe undesirable, I don't see this as being a major problem as this pass is not intended to be a general scheduling pass.
For context, it's worth noting that the pre-scheduling that SLP does while building the vector tree is exactly the sub-graph scheduling implemented by this patch.
Differential Revision: https://reviews.llvm.org/D118538
After writing the commit message for 4b1bace28, realized that the mentioned optimization was rather straight forward. We already have the code for scanning a block during region initialization, we can simply keep track if we've seen a stacksave or stackrestore. If we haven't, none of these dependencies are relevant and we can avoid the relatively expensive scans entirely.
This is an extension of commit b7806c to handle one last case noticed in test changes for D118538. Again, this is thought to be a latent bug in the existing code, though this time I have not managed to reduce tests for the original algoritthm.
The prior attempt had failed to account for this case:
%a = alloca i8
stacksave
stackrestore
store i8 0, i8* %a
If we allow '%a' to reorder into the stacksave/restore region, then the alloca will be deallocated before the use. We will have taken a well defined program, and introduced a use-after-free bug.
There's also an inverse case where the alloca originally follows the stackrestore, and we need to prevent the reordering it above the restore.
Compile time wise, we potentially do an extra scan of the block for each alloca seen in a bundle. This is significantly more expensive than the stacksave rooted version and is why I'd tried to avoid this in the initial patch. There is room to optimize this (by essentially caching a "has stacksave" bit per block), but I'm leaving that to future work if it actually shows up in practice. Since allocas in bundles should be rare in practice, I suspect we can defer the complexity for a long while.
The semantics of an inalloca alloca instruction requires that it not be reordered with a preceeding stacksave intrinsic call. Unfortunately, there's no def/use edge or memory dependence edge. (THe memory point is slightly subtle, but in general a new allocation can't alias with a call which executes strictly before it comes into existance.)
I'd tried to tackle this same case previously in 689babdf6, but the fix chosen there turned out to be incomplete. As such, this change contains a fully revert of the first fix attempt.
This was noticed when investigating problems which surfaced with D118538, but this is definitely an existing bug. This time around, I managed to reduce a couple of additional cases, including one which was being actively miscompiled even without the new scheduling change. (See test diffs)
Compile time wise, we only spend extra time when seeing a stacksave (rare), and even then we walk the block at most once per schedule window extension. Likely a non-issue.