result
Summary:
If the same value is used several times as an extra value, SLP
vectorizer takes it into account only once instead of actual number of
using.
For example:
```
int val = 1;
for (int y = 0; y < 8; y++) {
for (int x = 0; x < 8; x++) {
val = val + input[y * 8 + x] + 3;
}
}
```
We have 2 extra rguments: `1` - initial value of horizontal reduction
and `3`, which is added 8*8 times to the reduction. Before the patch we
added `1` to the reduction value and added once `3`, though it must be
added 64 times.
Reviewers: mkuper, mzolotukhin
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D30262
llvm-svn: 295972
result
Summary:
If the same value is used several times as an extra value, SLP
vectorizer takes it into account only once instead of actual number of
using.
For example:
```
int val = 1;
for (int y = 0; y < 8; y++) {
for (int x = 0; x < 8; x++) {
val = val + input[y * 8 + x] + 3;
}
}
```
We have 2 extra rguments: `1` - initial value of horizontal reduction
and `3`, which is added 8*8 times to the reduction. Before the patch we
added `1` to the reduction value and added once `3`, though it must be
added 64 times.
Reviewers: mkuper, mzolotukhin
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D30262
llvm-svn: 295956
result
Summary:
If the same value is used several times as an extra value, SLP
vectorizer takes it into account only once instead of actual number of
using.
For example:
```
int val = 1;
for (int y = 0; y < 8; y++) {
for (int x = 0; x < 8; x++) {
val = val + input[y * 8 + x] + 3;
}
}
```
We have 2 extra rguments: `1` - initial value of horizontal reduction
and `3`, which is added 8*8 times to the reduction. Before the patch we
added `1` to the reduction value and added once `3`, though it must be
added 64 times.
Reviewers: mkuper, mzolotukhin
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D30262
llvm-svn: 295949
Implement isLegalToVectorizeLoadChain for AMDGPU to avoid
producing private address spaces accesses that will need to be
split up later. This was doing the wrong thing in the case
where the queried chain was an even number of elements.
A possible <4 x i32> store was being split into
store <2 x i32>
store i32
store i32
rather than
store <2 x i32>
store <2 x i32>
when legal.
llvm-svn: 295933
Summary:
If the same value is used several times as an extra value, SLP
vectorizer takes it into account only once instead of actual number of
using.
For example:
```
int val = 1;
for (int y = 0; y < 8; y++) {
for (int x = 0; x < 8; x++) {
val = val + input[y * 8 + x] + 3;
}
}
```
We have 2 extra rguments: `1` - initial value of horizontal reduction
and `3`, which is added 8*8 times to the reduction. Before the patch we
added `1` to the reduction value and added once `3`, though it must be
added 64 times.
Reviewers: mkuper, mzolotukhin
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D30262
llvm-svn: 295868
Prevent memory objects of different address spaces to be part of
the same load/store groups when analysing interleaved accesses.
This is fixing pr31900.
Reviewers: HaoLiu, mssimpso, mkuper
Reviewed By: mssimpso, mkuper
Subscribers: llvm-commits, efriedma, mzolotukhin
Differential Revision: https://reviews.llvm.org/D29717
This reverts r295042 (re-applies r295038) with an additional fix for the
buildbot problem.
llvm-svn: 295858
We previously only created a vector phi node for an induction variable if its
step had a constant integer type. However, the step actually only needs to be
loop-invariant. We only handle inductions having loop-invariant steps, so this
patch should enable vector phi node creation for all integer induction
variables that will be vectorized.
Differential Revision: https://reviews.llvm.org/D29956
llvm-svn: 295456
This reapplies commit r294967 with a fix for the execution time regressions
caught by the clang-cmake-aarch64-quick bot. We now extend the truncate
optimization to non-primary induction variables only if the truncate isn't
already free.
Differential Revision: https://reviews.llvm.org/D29847
llvm-svn: 295063
back into a vector
Previously the cost of the existing ExtractElement/ExtractValue
instructions was considered as a dead cost only if it was detected that
they have only one use. But these instructions may be considered
dead also if users of the instructions are also going to be vectorized,
like:
```
%x0 = extractelement <2 x float> %x, i32 0
%x1 = extractelement <2 x float> %x, i32 1
%x0x0 = fmul float %x0, %x0
%x1x1 = fmul float %x1, %x1
%add = fadd float %x0x0, %x1x1
```
This can be transformed to
```
%1 = fmul <2 x float> %x, %x
%2 = extractelement <2 x float> %1, i32 0
%3 = extractelement <2 x float> %1, i32 1
%add = fadd float %2, %3
```
because though `%x0` and `%x1` have 2 users each other, these users are
part of the vectorized tree and we can consider these `extractelement`
instructions as dead.
Differential Revision: https://reviews.llvm.org/D29900
llvm-svn: 295056
Prevent memory objects of different address spaces to be part of
the same load/store groups when analysing interleaved accesses.
This is fixing pr31900.
Reviewers: HaoLiu, mssimpso, mkuper
Reviewed By: mssimpso, mkuper
Subscribers: llvm-commits, efriedma, mzolotukhin
Differential Revision: https://reviews.llvm.org/D29717
llvm-svn: 295038
This reverts commit r294967. This patch caused execution time slowdowns in a
few LLVM test-suite tests, as reported by the clang-cmake-aarch64-quick bot.
I'm reverting to investigate.
llvm-svn: 294973
This patch extends the optimization of truncations whose operand is an
induction variable with a constant integer step. Previously we were only
applying this optimization to the primary induction variable. However, the cost
model assumes the optimization is applied to the truncation of all integer
induction variables (even regardless of step type). The transformation is now
applied to the other induction variables, and I've updated the cost model to
ensure it is better in sync with the transformation we actually perform.
Differential Revision: https://reviews.llvm.org/D29847
llvm-svn: 294967
reductions.
Currently, LLVM supports vectorization of horizontal reduction
instructions with initial value set to 0. Patch supports vectorization
of reduction with non-zero initial values. Also, it supports a
vectorization of instructions with some extra arguments, like:
```
float f(float x[], int a, int b) {
float p = a % b;
p += x[0] + 3;
for (int i = 1; i < 32; i++)
p += x[i];
return p;
}
```
Patch allows vectorization of this kind of horizontal reductions.
Differential Revision: https://reviews.llvm.org/D29727
llvm-svn: 294934
Summary:
This patch starts the implementation as discuss in the following RFC: http://lists.llvm.org/pipermail/llvm-dev/2016-October/106532.html
When optimization duplicates code that will scale down the execution count of a basic block, we will record the duplication factor as part of discriminator so that the offline process tool can find the duplication factor and collect the accurate execution frequency of the corresponding source code. Two important optimization that fall into this category is loop vectorization and loop unroll. This patch records the duplication factor for these 2 optimizations.
The recording will be guarded by a flag encode-duplication-in-discriminators, which is off by default.
Reviewers: probinson, aprantl, davidxl, hfinkel, echristo
Reviewed By: hfinkel
Subscribers: mehdi_amini, anemet, mzolotukhin, llvm-commits
Differential Revision: https://reviews.llvm.org/D26420
llvm-svn: 294782
We previously only created a vector phi node for an induction variable if its
type matched the type of the canonical induction variable.
Differential Revision: https://reviews.llvm.org/D29776
llvm-svn: 294755
Making the cost model selecting between Interleave, GatherScatter or Scalar vectorization form of memory instruction.
The right decision should be done for non-consecutive memory access instrcuctions that may have more than one vectorization solution.
This patch includes the following changes:
- Cost Model calculates the cost of Load/Store vector form and choose the better option between Widening, Interleave, GatherScactter and Scalarization. Cost Model keeps the widening decision.
- Arrays of Uniform and Scalar values are moved from Legality to Cost Model.
- Cost Model collects Uniforms and Scalars per VF. The collection is based on CM decision map of Loadis/Stores vectorization form.
- Vectorization of memory instruction is performed according to the CM decision.
Differential Revision: https://reviews.llvm.org/D27919
llvm-svn: 294503
This breaks when one of the extra values is also a scalar that
participates in the same vectorization tree which we'll end up
reducing.
llvm-svn: 294245
Currently LLVM supports vectorization of horizontal reduction
instructions with initial value set to 0. Patch supports vectorization
of reduction with non-zero initial values. Also it supports a
vectorization of instructions with some extra arguments, like:
float f(float x[], int a, int b) {
float p = a % b;
p += x[0] + 3;
for (int i = 1; i < 32; i++)
p += x[i];
return p;
}
Patch allows vectorization of this kind of horizontal reductions.
Differential Revision: https://reviews.llvm.org/D28961
llvm-svn: 293994
This patch moves some helper functions related to interleaved access
vectorization out of LoopVectorize.cpp and into VectorUtils.cpp. We would like
to use these functions in a follow-on patch that improves interleaved load and
store lowering in (ARM/AArch64)ISelLowering.cpp. One of the functions was
already duplicated there and has been removed.
Differential Revision: https://reviews.llvm.org/D29398
llvm-svn: 293788
By calling getScalarizationOverhead with the CallInst instead of the types of
its arguments, we make sure that only unique call arguments are added to the
scalarization cost.
getScalarizationOverhead() is extended to handle calls by only passing on the
actual call arguments (which is not all the operands).
This also eliminates a wrapper function with the same name.
review: Hal Finkel
llvm-svn: 293459
The jumbled scalar loads will be sorted while building the tree and these accesses will be marked to generate shufflevector after the vectorized load with proper mask.
Reviewers: hfinkel, mssimpso, mkuper
Differential Revision: https://reviews.llvm.org/D26905
Change-Id: I9c0c8e6f91a00076a7ee1465440a3f6ae092f7ad
llvm-svn: 293386
Some checks in SLP horizontal reduction analysis function are performed
several times, though it is enough to perform these checks only once
during an initial attempt at adding candidate for the reduction
instruction/reduced value.
Differential Revision: https://reviews.llvm.org/D29175
llvm-svn: 293274
change the set of uniform instructions in the loop causing an assert
failure.
The problem is that the legalization checking also builds data
structures mapping various facts about the loop body. The immediate
cause was the set of uniform instructions. If these then change when
LCSSA is formed, the data structures would already have been built and
become stale. The included test case triggered an assert in loop
vectorize that was reduced out of the new PM's pipeline.
The solution is to form LCSSA early enough that no information is cached
across the changes made. The only really obvious position is outside of
the main logic to vectorize the loop. This also has the advantage of
removing one case where forming LCSSA could mutate the loop but we
wouldn't track that as a "Changed" state.
If it is significantly advantageous to do some legalization checking
prior to this, we can do a more careful positioning but it seemed best
to just back off to a safe position first.
llvm-svn: 293168
Refactoring to remove duplications of this method.
New method getOperandsScalarizationOverhead() that looks at the present unique
operands and add extract costs for them. Old behaviour was to just add extract
costs for one operand of the type always, which still happens in
getArithmeticInstrCost() if no operands are provided by the caller.
This is a good start of improving on this, but there are more places
that can be improved by using getOperandsScalarizationOverhead().
Review: Hal Finkel
https://reviews.llvm.org/D29017
llvm-svn: 293155
instructions.
If number of instructions in horizontal reduction list is not power of 2
then only PowerOf2Floor(NumberOfInstructions) last elements are actually
vectorized, other instructions remain scalar. Patch tries to vectorize
the remaining elements either.
Differential Revision: https://reviews.llvm.org/D28959
llvm-svn: 293042
Removed data members ReduxWidth and MinVecRegSize + some C++11 stylish
improvements.
Differential Revision: https://reviews.llvm.org/D29010
llvm-svn: 292899
This changes the vectorizer to explicitly use the loopsimplify and lcssa utils,
instead of "requiring" the transformations as if they were analyses.
This is not NFC, since it changes the LCSSA behavior - we no longer run LCSSA
for all loops, but rather only for the loops we expect to modify.
Differential Revision: https://reviews.llvm.org/D28868
llvm-svn: 292456
We currently check whether a reduction has a single outside user. We don't
really need to require that - we just need to make sure a single value is
used externally. The number of external users of that value shouldn't actually
matter.
Differential Revision: https://reviews.llvm.org/D28830
llvm-svn: 292424
If a memory instruction will be vectorized, but it's pointer operand is
non-consecutive-like, the instruction is a gather or scatter operation. Its
pointer operand will be non-uniform. This should fix PR31671.
Reference: https://llvm.org/bugs/show_bug.cgi?id=31671
Differential Revision: https://reviews.llvm.org/D28819
llvm-svn: 292254
a function's CFG when that CFG is unchanged.
This allows transformation passes to simply claim they preserve the CFG
and analysis passes to check for the CFG being preserved to remove the
fanout of all analyses being listed in all passes.
I've gone through and removed or cleaned up as many of the comments
reminding us to do this as I could.
Differential Revision: https://reviews.llvm.org/D28627
llvm-svn: 292054
The removed assert seems bogus - it's perfectly legal for the roots of the
vectorized subtrees to be equal even if the original scalar values aren't,
if the original scalars happen to be equivalent.
This fixes PR31599.
Differential Revision: https://reviews.llvm.org/D28539
llvm-svn: 291692
updated instructions:
pmulld, pmullw, pmulhw, mulsd, mulps, mulpd, divss, divps, divsd, divpd, addpd and subpd.
special optimization case which replaces pmulld with pmullw\pmulhw\pshuf seq.
In case if the real operands bitwidth <= 16.
Differential Revision: https://reviews.llvm.org/D28104
llvm-svn: 291657
arguments much like the CGSCC pass manager.
This is a major redesign following the pattern establish for the CGSCC layer to
support updates to the set of loops during the traversal of the loop nest and
to support invalidation of analyses.
An additional significant burden in the loop PM is that so many passes require
access to a large number of function analyses. Manually ensuring these are
cached, available, and preserved has been a long-standing burden in LLVM even
with the help of the automatic scheduling in the old pass manager. And it made
the new pass manager extremely unweildy. With this design, we can package the
common analyses up while in a function pass and make them immediately available
to all the loop passes. While in some cases this is unnecessary, I think the
simplicity afforded is worth it.
This does not (yet) address loop simplified form or LCSSA form, but those are
the next things on my radar and I have a clear plan for them.
While the patch is very large, most of it is either mechanically updating loop
passes to the new API or the new testing for the loop PM. The code for it is
reasonably compact.
I have not yet updated all of the loop passes to correctly leverage the update
mechanisms demonstrated in the unittests. I'll do that in follow-up patches
along with improved FileCheck tests for those passes that ensure things work in
more realistic scenarios. In many cases, there isn't much we can do with these
until the loop simplified form and LCSSA form are in place.
Differential Revision: https://reviews.llvm.org/D28292
llvm-svn: 291651
This patch delays the fix-up step for external induction variable users until
after the dominator tree has been properly updated. This should fix PR30742.
The SCEVExpander in InductionDescriptor::transform can generate code in the
wrong location if the dominator tree is not up-to-date. We should work towards
keeping the dominator tree up-to-date throughout the transformation.
Reference: https://llvm.org/bugs/show_bug.cgi?id=30742
Differential Revision: https://reviews.llvm.org/D28168
llvm-svn: 291462
This code seems to be target dependent which may not be the same for all targets.
Passed the decision whether the given stride is complex or not to the target by sending stride information via SCEV to getAddressComputationCost instead of 'IsComplex'.
Specifically at X86 targets we dont see any significant address computation cost in case of the strided access in general.
Differential Revision: https://reviews.llvm.org/D27518
llvm-svn: 291106
This patch reapplies r289863. The original patch was reverted because it
exposed a bug causing the loop vectorizer to crash in the Python runtime on
PPC. The underlying issue was fixed with r289958.
llvm-svn: 289975
After r288909, instructions feeding predicated instructions may be scalarized
if profitable. Since these instructions will remain scalar, we shouldn't
attempt to type-shrink them. We should only truncate vector types to their
minimal bit widths. This bug was exposed by enabling the vectorization of loops
containing conditional stores by default.
llvm-svn: 289958
stores by default
This uncovers a crasher in the loop vectorizer on PPC when building the
Python runtime. I'll send the testcase to the review thread for the
original commit.
llvm-svn: 289934
This patch sets the default value of the "-enable-cond-stores-vec" command line
option to "true".
Differential Revision: https://reviews.llvm.org/D27814
llvm-svn: 289863
After r289755, the AssumptionCache is no longer needed. Variables affected by
assumptions are now found by using the new operand-bundle-based scheme. This
new scheme is more computationally efficient, and also we need much less
code...
llvm-svn: 289756
We currently check if the exact trip count is known and is smaller than the
"tiny loop" bound. We should be checking the maximum bound on the trip count
instead.
Differential Revision: https://reviews.llvm.org/D27690
llvm-svn: 289583
This patch ensures the correct minimum bit width during type-shrinking.
Previously when type-shrinking, we always sign-extended values back to their
original width. However, if we are going to sign-extend, and the sign bit is
unknown, we have to increase the minimum bit width by one bit so the
sign-extend will fill the upper bits correctly. If the sign bit is known to be
zero, we can perform a zero-extend instead. This should fix PR31243.
Reference: https://llvm.org/bugs/show_bug.cgi?id=31243
Differential Revision: https://reviews.llvm.org/D27466
llvm-svn: 289470
When trying to vectorize trees that start at insertelement instructions
function tryToVectorizeList() uses vectorization factor calculated as
MinVecRegSize/ScalarTypeSize. But sometimes it does not work as tree
cost for this fixed vectorization factor is too high.
Patch tries to improve the situation. It tries different vectorization
factors from max(PowerOf2Floor(NumberOfVectorizedValues),
MinVecRegSize/ScalarTypeSize) to MinVecRegSize/ScalarTypeSize and tries
to choose the best one.
Differential Revision: https://reviews.llvm.org/D27215
llvm-svn: 289043
This patch attempts to scalarize the operand expressions of predicated
instructions if they were conditionally executed in the original loop. After
scalarization, the expressions will be sunk inside the blocks created for the
predicated instructions. The transformation essentially performs
un-if-conversion on the operands.
The cost model has been updated to determine if scalarization is profitable. It
compares the cost of a vectorized instruction, assuming it will be
if-converted, to the cost of the scalarized instruction, assuming that the
instructions corresponding to each vector lane will be sunk inside a predicated
block, possibly avoiding execution. If it's more profitable to scalarize the
entire expression tree feeding the predicated instruction, the expression will
be scalarized; otherwise, it will be vectorized. We only consider the cost of
the entire expression to accurately estimate the cost of the required
insertelement and extractelement instructions.
Differential Revision: https://reviews.llvm.org/D26083
llvm-svn: 288909
This reverts commit r288497, as it broke the AArch64 build of Compiler-RT's
builtins (twice: once in r288412 and once in r288497). We should investigate
this offline.
llvm-svn: 288508
When trying to vectorize trees that start at insertelement instructions
function tryToVectorizeList() uses vectorization factor calculated as
MinVecRegSize/ScalarTypeSize. But sometimes it does not work as tree
cost for this fixed vectorization factor is too high.
Patch tries to improve the situation. It tries different vectorization
factors from max(PowerOf2Floor(NumberOfVectorizedValues),
MinVecRegSize/ScalarTypeSize) to MinVecRegSize/ScalarTypeSize and tries
to choose the best one.
Differential Revision: https://reviews.llvm.org/D27215
llvm-svn: 288497
When trying to vectorize trees that start at insertelement instructions
function tryToVectorizeList() uses vectorization factor calculated as
MinVecRegSize/ScalarTypeSize. But sometimes it does not work as tree
cost for this fixed vectorization factor is too high.
Patch tries to improve the situation. It tries different vectorization
factors from max(PowerOf2Floor(NumberOfVectorizedValues),
MinVecRegSize/ScalarTypeSize) to MinVecRegSize/ScalarTypeSize and tries
to choose the best one.
Differential Revision: https://reviews.llvm.org/D27215
llvm-svn: 288412
Currently when cost of scalar operations is evaluated the vector type is
used for scalar operations. Patch fixes this issue and fixes evaluation
of the vector operations cost.
Several test showed that vector cost model is too optimistic. It
allowed vectorization of 8 or less add/fadd operations, though scalar
code is faster. Actually, only for 16 or more operations vector code
provides better performance.
Differential Revision: https://reviews.llvm.org/D26277
llvm-svn: 288398
Currently SLP vectorizer tries to vectorize a binary operation and dies
immediately after unsuccessful the first unsuccessfull attempt. Patch
tries to improve the situation, trying to vectorize all binary
operations of all children nodes in the binop tree.
Differential Revision: https://reviews.llvm.org/D25517
llvm-svn: 288115
Summary:
The "getVectorizablePrefix" method would give up if it found an aliasing load for a store chain.
In practice, the aliasing load can be treated as a memory barrier and all stores that precede it
are a valid vectorizable prefix.
Issue found by volkan in D26962. Testcase is a pruned version of the one in the original patch.
Reviewers: jlebar, arsenm, tstellarAMD
Subscribers: mzolotukhin, wdng, nhaehnle, anna, volkan, llvm-commits
Differential Revision: https://reviews.llvm.org/D27008
llvm-svn: 287781
This patch updates a bunch of places where add_dependencies was being explicitly called to add dependencies on intrinsics_gen to instead use the DEPENDS named parameter. This cleanup is needed for a patch I'm working on to add a dependency debugging mode to the build system.
llvm-svn: 287206
The register usage algorithm incorrectly treats instructions whose value is
not used within the loop (e.g. those that do not produce a value).
The algorithm first calculates the usages within the loop. It iterates over
the instructions in order, and records at which instruction index each use
ends (in fact, they're actually recorded against the next index, as this is
when we want to delete them from the open intervals).
The algorithm then iterates over the instructions again, adding each
instruction in turn to a list of open intervals. Instructions are then
removed from the list of open intervals when they occur in the list of uses
ended at the current index.
The problem is, instructions which are not used in the loop are skipped.
However, although they aren't used, the last use of a value may have been
recorded against that instruction index. In this case, the use is not deleted
from the open intervals, which may then bump up the estimated register usage.
This patch fixes the issue by simply moving the "is used" check after the loop
which erases the uses at the current index.
Differential Revision: https://reviews.llvm.org/D26554
llvm-svn: 286969
This is PR28376.
Unfortunately given the current structure of optimization diagnostics we
lack the capability to tell whether the user has
passed -Rpass-analysis=loop-vectorize since this is local to the
front-end (BackendConsumer::OptimizationRemarkHandler).
So rather than printing this even if the user has already
passed -Rpass-analysis, this patch just punts and stops recommending
this option. I don't think that getting this right is worth the
complexity.
Differential Revision: https://reviews.llvm.org/D26563
llvm-svn: 286662
possible pointer-wrap-around concerns, in some cases.
Before this patch, collectConstStridedAccesses (part of interleaved-accesses
analysis) called getPtrStride with [Assume=false, ShouldCheckWrap=true] when
examining all candidate pointers. This is too conservative. Instead, this
patch makes collectConstStridedAccesses use an optimistic approach, calling
getPtrStride with [Assume=true, ShouldCheckWrap=false], and then, once the
candidate interleave groups have been formed, revisits the pointer-wrapping
analysis but only where it matters: namely, in groups that have gaps, and where
the gaps are not at the very end of the group (in which case the loop is
peeled). This second time getPtrStride is called with [Assume=false,
ShouldCheckWrap=true], but this could further be improved to using Assume=true,
once we also add the logic to track that we are not going to meet the scev
runtime checks threshold.
Differential Revision: https://reviews.llvm.org/D25276
llvm-svn: 285517
After successfull horizontal reduction vectorization attempt for PHI node
vectorizer tries to update root binary op by combining vectorized tree
and the ReductionPHI node. But during vectorization this ReductionPHI
can be vectorized itself and replaced by the `undef` value, while the
instruction itself is marked for deletion. This 'marked for deletion'
PHI node then can be used in new binary operation, causing "Use still
stuck around after Def is destroyed" crash upon PHI node deletion.
Also the test is fixed to make it perform actual testing.
Differential Revision: https://reviews.llvm.org/D25671
llvm-svn: 285286
When we predicate an instruction (div, rem, store) we place the instruction in
its own basic block within the vectorized loop. If a predicated instruction has
scalar operands, it's possible to recursively sink these scalar expressions
into the predicated block so that they might avoid execution. This patch sinks
as much scalar computation as possible into predicated blocks. We previously
were able to sink such operands only if they were extractelement instructions.
Differential Revision: https://reviews.llvm.org/D25632
llvm-svn: 285097
Some instructions from the original loop, when vectorized, can become trivially
dead. This happens because of the way we structure the new loop. For example,
we create new induction variables and induction variable "steps" in the new
loop. Thus, when we go to vectorize the original induction variable update, it
may no longer be needed due to the instructions we've already created. This
patch prevents us from creating these redundant instructions. This reduces code
size before simplification and allows greater flexibility in code generation
since we have fewer unnecessary instruction uses.
Differential Revision: https://reviews.llvm.org/D25631
llvm-svn: 284631
This patch modifies the cost calculation of predicated instructions (div and
rem) to avoid the accumulation of rounding errors due to multiple truncating
integer divisions. The calculation for predicated stores will be addressed in a
follow-on patch since we currently don't scale the cost of predicated stores by
block probability.
Differential Revision: https://reviews.llvm.org/D25333
llvm-svn: 284123
Previously, we marked the branch conditions of latch blocks uniform after
vectorization if they were instructions contained in the loop. However, if a
condition instruction has users other than the branch, it may not remain
uniform. This patch ensures the conditions we mark uniform are only used by the
branch. This should fix PR30627.
Reference: https://llvm.org/bugs/show_bug.cgi?id=30627
llvm-svn: 283563
unrolling.
The next code is not vectorized by the SLPVectorizer:
```
int test(unsigned int *p) {
int sum = 0;
for (int i = 0; i < 8; i++)
sum += p[i];
return sum;
}
```
During optimization this loop is fully unrolled and SLPVectorizer is
unable to vectorize it. Patch tries to fix this problem.
Differential Revision: https://reviews.llvm.org/D24796
llvm-svn: 283535
The vectorizer already holds a pointer to one cost model artifact in a member
variable (i.e., MinBWs). As we add more, it will be easier to communicate these
artifacts to the vectorizer if we simply pass a pointer to the cost model
instead.
llvm-svn: 283373
The vectorizer already holds a pointer to the legality analysis in a member
variable, so it makes sense that we would pass it in the constructor.
llvm-svn: 283368
This patch refactors the cost estimation of scalarized loads and stores to
reuse getScalarizationOverhead for the cost of the extractelement and
insertelement instructions we might create. The existing code accounted for
this cost, but it was functionally equivalent to the helper function.
llvm-svn: 283364
The cost model has to estimate the probability of executing predicated blocks.
However, we currently always assume predicated blocks have a 50% chance of
executing (this value is hardcoded in several places throughout the code).
Since we always use the same value, this patch adds a helper function for
getting this uniform probability. The function simplifies some comments and
makes our assumptions more clear. In the future, we may want to extend this
with actual block probability information if it's available.
llvm-svn: 283354
This patch adds a single helper function for checking if an instruction will be
scalarized with predication. Such instructions include conditional stores and
instructions that may divide by zero. Existing checks have been updated to use
the new function.
llvm-svn: 283350
Summary: Added 6 new target hooks for the vectorizer in order to filter types, handle size constraints and decide how to split chains.
Reviewers: tstellarAMD, arsenm
Subscribers: arsenm, mzolotukhin, wdng, llvm-commits, nhaehnle
Differential Revision: https://reviews.llvm.org/D24727
llvm-svn: 283099
When building the steps for scalar induction variables, we previously attempted
to determine if all the scalar users of the induction variable were uniform. If
they were, we would only emit the step corresponding to vector lane zero. This
optimization was too aggressive. We generally don't know the entire set of
induction variable users that will be scalar. We have
isScalarAfterVectorization, but this is only a conservative estimate of the
instructions that will be scalarized. Thus, an induction variable may have
scalar users that aren't already known to be scalar. To avoid emitting unused
steps, we can only check that the induction variable is uniform. This should
fix PR30542.
Reference: https://llvm.org/bugs/show_bug.cgi?id=30542
llvm-svn: 282863
(Recommit after making sure IsVerbose gets properly initialized in
DiagnosticInfoOptimizationBase. See previous commit that takes care of
this.)
OptimizationRemarkAnalysis directly takes the role of the report that is
generated by LAA.
Then we need the magic to be able to turn an LAA remark into an LV
remark. This is done via a new OptimizationRemark ctor.
llvm-svn: 282813
OptimizationRemarkAnalysis directly takes the role of the report that is
generated by LAA.
Then we need the magic to be able to turn an LAA remark into an LV
remark. This is done via a new OptimizationRemark ctor.
llvm-svn: 282758
The last one remaining after which emitAnalysis can be removed is when
we convert the LAA's report to a vectorization report. This requires
converting LAA to the new interface first.
llvm-svn: 282726
This patch ensures that we actually scalarize instructions marked scalar after
vectorization. Previously, such instructions may have been vectorized instead.
Differential Revision: https://reviews.llvm.org/D23889
llvm-svn: 282418
If we identify an instruction as uniform after vectorization, we know that we
should only use the value corresponding to the first vector lane of each unroll
iteration. However, when scalarizing such instructions, we still produce values
for the other vector lanes. This patch prevents us from generating the unused
scalars.
Differential Revision: https://reviews.llvm.org/D24275
llvm-svn: 282087
Simplified GEP cloning in vectorizeMemoryInstruction().
Added an assertion that checks consecutive GEP, which should have only one loop-variant operand.
Differential Revision: https://reviews.llvm.org/D24557
llvm-svn: 281851
This patch moves the processing of pointer induction variables in
collectLoopUniforms from the consecutive pointer phase of the analysis to the
phi node phase. Previously, if a pointer induction variable was used by both a
scalarized non-memory instruction as well as a vectorized memory instruction,
we would incorrectly identify the pointer as uniform. Pointer induction
variables should be treated the same as other phi nodes. That is, they are
uniform if all users of the induction variable and induction variable update
are uniform.
Differential Revision: https://reviews.llvm.org/D24511
llvm-svn: 281485
The test case included in r280979 wasn't checking what it was supposed to be
checking for the predicated store case. Fixing the test revealed that the
multi-use case (when a pointer is used by both vectorized and scalarized memory
accesses) wasn't being handled properly. We can't skip over
non-consecutive-like pointers since they may have looked consecutive-like with
a different memory access.
llvm-svn: 280992
Previously, all consecutive pointers were marked uniform after vectorization.
However, if a consecutive pointer is used by a memory access that is eventually
scalarized, the pointer won't remain uniform after all. An example is
predicated stores. Even though a predicated store may be consecutive, it will
still be scalarized, making it's pointer operand non-uniform.
This patch updates the logic in collectLoopUniforms to consider the cases where
a memory access may be scalarized. If a memory access may be scalarized, its
pointer operand is not marked uniform. The determination of whether a given
memory instruction will be scalarized or not has been moved into a common
function that is used by the vectorizer, cost model, and legality analysis.
Differential Revision: https://reviews.llvm.org/D24271
llvm-svn: 280979
Summary:
LSV replaces multiple adjacent loads with one vectorized load and a
bunch of extractelement instructions. This patch makes the
extractelement instructions' names match those of the original loads,
for (hopefully) improved readability.
Reviewers: asbirlea, tstellarAMD
Subscribers: arsenm, mzolotukhin
Differential Revision: https://reviews.llvm.org/D23748
llvm-svn: 280818
Inheriting from std::iterator uses more boiler-plate than manual
typedefs. Avoid that in both ilist_iterator and
MachineInstrBundleIterator.
This has the side effect of removing ilist_iterator from certain ADL
lookups in namespace std; calls to std::next need to be qualified by
"std::" that didn't have to before. The one case of this in-tree was
operating on a temporary, so I used the more compact operator++.
llvm-svn: 280570
For uniform instructions, we're only required to generate a scalar value for
the first vector lane of each unroll iteration. Thus, if we have a reverse
interleaved group, computing the member index off the scalar GEP corresponding
to the last vector lane of its pointer operand technically makes the GEP
non-uniform. We should compute the member index off the first scalar GEP
instead.
I've added the updated member index computation to the existing reverse
interleaved group test.
llvm-svn: 280497
We can now maintain scalar values in VectorLoopValueMap. Thus, we no longer
have to create temporary vectors with insertelement instructions when handling
pointer induction variables. This case was mistakenly missed from r279649 when
refactoring the other scalarization code.
llvm-svn: 280405
This patch moves the allocation of VectorParts for PHI nodes into the actual
PHI widening code. Previously, we allocated these VectorParts in
vectorizeBlockInLoop, and passed them by reference to widenPHIInstruction. Upon
returning, we would then map the VectorParts in VectorLoopValueMap. This
behavior is problematic for the cases where we only want to generate a scalar
version of a PHI node. For example, if in the future we only generate a scalar
version of an induction variable, we would end up inserting an empty vector
entry into the map once we return to vectorizeBlockInLoop. We now no longer
need to pass VectorParts to the various PHI widening functions, and we can keep
VectorParts allocation as close as possible to the point at which they are
actually mapped in VectorLoopValueMap.
llvm-svn: 280390
Passing the types/opcode check still doesn't guarantee we'll actually vectorize.
Therefore, just make it clear we're attempting to vectorize.
llvm-svn: 280263
Summary:
LSV was using two vector sets (heads and tails) to track pairs of adjiacent position to vectorize.
A recent optimization is trying to obtain the longest chain to vectorize and assumes the positions
in heads(H) and tails(T) match, which is not the case is there are multiple tails for the same head.
e.g.:
i1: store a[0]
i2: store a[1]
i3: store a[1]
Leads to:
H: i1
T: i2 i3
Instead of:
H: i1 i1
T: i2 i3
So the positions for instructions that follow i3 will have different indexes in H/T.
This patch resolves PR29148.
This issue also surfaced the fact that if the chain is too long, and TLI
returns a "not-fast" answer, the whole chain will be abandoned for
vectorization, even though a smaller one would be beneficial.
Added a testcase and FIXME for this.
Reviewers: tstellarAMD, arsenm, jlebar
Subscribers: mzolotukhin, wdng, llvm-commits
Differential Revision: https://reviews.llvm.org/D24057
llvm-svn: 280179
We don't need to limit predication to blocks that have a single incoming
edge, we just need to use the right mask.
This fixes PR30172.
Differential Revision: https://reviews.llvm.org/D24009
llvm-svn: 280148
Reverse iterators to doubly-linked lists can be simpler (and cheaper)
than std::reverse_iterator. Make it so.
In particular, change ilist<T>::reverse_iterator so that it is *never*
invalidated unless the node it references is deleted. This matches the
guarantees of ilist<T>::iterator.
(Note: MachineBasicBlock::iterator is *not* an ilist iterator, but a
MachineInstrBundleIterator<MachineInstr>. This commit does not change
MachineBasicBlock::reverse_iterator, but it does update
MachineBasicBlock::reverse_instr_iterator. See note at end of commit
message for details on bundle iterators.)
Given the list (with the Sentinel showing twice for simplicity):
[Sentinel] <-> A <-> B <-> [Sentinel]
the following is now true:
1. begin() represents A.
2. begin() holds the pointer for A.
3. end() represents [Sentinel].
4. end() holds the poitner for [Sentinel].
5. rbegin() represents B.
6. rbegin() holds the pointer for B.
7. rend() represents [Sentinel].
8. rend() holds the pointer for [Sentinel].
The changes are #6 and #8. Here are some properties from the old
scheme (which used std::reverse_iterator):
- rbegin() held the pointer for [Sentinel] and rend() held the pointer
for A;
- operator*() cost two dereferences instead of one;
- converting from a valid iterator to its valid reverse_iterator
involved a confusing increment; and
- "RI++->erase()" left RI invalid. The unintuitive replacement was
"RI->erase(), RE = end()".
With vector-like data structures these properties are hard to avoid
(since past-the-beginning is not a valid pointer), and don't impose a
real cost (since there's still only one dereference, and all iterators
are invalidated on erase). But with lists, this was a poor design.
Specifically, the following code (which obviously works with normal
iterators) now works with ilist::reverse_iterator as well:
for (auto RI = L.rbegin(), RE = L.rend(); RI != RE;)
fooThatMightRemoveArgFromList(*RI++);
Converting between iterator and reverse_iterator for the same node uses
the getReverse() function.
reverse_iterator iterator::getReverse();
iterator reverse_iterator::getReverse();
Why doesn't iterator <=> reverse_iterator conversion use constructors?
In order to catch and update old code, reverse_iterator does not even
have an explicit conversion from iterator. It wouldn't be safe because
there would be no reasonable way to catch all the bugs from the changed
semantic (see the changes at call sites that are part of this patch).
Old code used this API:
std::reverse_iterator::reverse_iterator(iterator);
iterator std::reverse_iterator::base();
Here's how to update from old code to new (that incorporates the
semantic change), assuming I is an ilist<>::iterator and RI is an
ilist<>::reverse_iterator:
[Old] ==> [New]
reverse_iterator(I) (--I).getReverse()
reverse_iterator(I) ++I.getReverse()
--reverse_iterator(I) I.getReverse()
reverse_iterator(++I) I.getReverse()
RI.base() (--RI).getReverse()
RI.base() ++RI.getReverse()
--RI.base() RI.getReverse()
(++RI).base() RI.getReverse()
delete &*RI, RE = end() delete &*RI++
RI->erase(), RE = end() RI++->erase()
=======================================
Note: bundle iterators are out of scope
=======================================
MachineBasicBlock::iterator, also known as
MachineInstrBundleIterator<MachineInstr>, is a wrapper to represent
MachineInstr bundles. The idea is that each operator++ takes you to the
beginning of the next bundle. Implementing a sane reverse iterator for
this is harder than ilist. Here are the options:
- Use std::reverse_iterator<MBB::i>. Store a handle to the beginning of
the next bundle. A call to operator*() runs a loop (usually
operator--() will be called 1 time, for unbundled instructions).
Increment/decrement just works. This is the status quo.
- Store a handle to the final node in the bundle. A call to operator*()
still runs a loop, but it iterates one time fewer (usually
operator--() will be called 0 times, for unbundled instructions).
Increment/decrement just works.
- Make the ilist_sentinel<MachineInstr> *always* store that it's the
sentinel (instead of just in asserts mode). Then the bundle iterator
can sniff the sentinel bit in operator++().
I initially tried implementing the end() option as part of this commit,
but updating iterator/reverse_iterator conversion call sites was
error-prone. I have a WIP series of patches that implements the final
option.
llvm-svn: 280032
After r279649 when getting a vector value from VectorLoopValueMap, we create an
insertelement sequence on-demand if the value has been scalarized instead of
vectorized. We previously inserted this insertelement sequence before the
value's first vector user. However, this insert location is problematic if that
user is the phi node of a first-order recurrence. With this patch, we move the
insertelement sequence after the last scalar instruction we created when
scalarizing the value. Thus, the value's vector definition in the new loop will
immediately follow its scalar definitions. This should fix PR30183.
Reference: https://llvm.org/bugs/show_bug.cgi?id=30183
llvm-svn: 280001
This patch unifies the data structures we use for mapping instructions from the
original loop to their corresponding instructions in the new loop. Previously,
we maintained two distinct maps for this purpose: WidenMap and ScalarIVMap.
WidenMap maintained the vector values each instruction from the old loop was
represented with, and ScalarIVMap maintained the scalar values each scalarized
induction variable was represented with. With this patch, all values created
for the new loop are maintained in VectorLoopValueMap.
The change allows for several simplifications. Previously, when an instruction
was scalarized, we had to insert the scalar values into vectors in order to
maintain the mapping in WidenMap. Then, if a user of the scalarized value was
also scalar, we had to extract the scalar values from the temporary vector we
created. We now aovid these unnecessary scalar-to-vector-to-scalar conversions.
If a scalarized value is used by a scalar instruction, the scalar value is used
directly. However, if the scalarized value is needed by a vector instruction,
we generate the needed insertelement instructions on-demand.
A common idiom in several locations in the code (including the scalarization
code), is to first get the vector values an instruction from the original loop
maps to, and then extract a particular scalar value. This patch adds
getScalarValue for this purpose along side getVectorValue as an interface into
VectorLoopValueMap. These functions work together to return the requested
values if they're available or to produce them if they're not.
The mapping has also be made less permissive. Entries can be added to
VectorLoopValue map with the new initVector and initScalar functions.
getVectorValue has been modified to return a constant reference to the mapped
entries.
There's no real functional change with this patch; however, in some cases we
will generate slightly different code. For example, instead of an insertelement
sequence following the definition of an instruction, it will now precede the
first use of that instruction. This can be seen in the test case changes.
Differential Revision: https://reviews.llvm.org/D23169
llvm-svn: 279649
div/rem instructions in basic blocks that require predication currently prevent
vectorization. This patch extends the existing mechanism for predicating stores
to handle other instructions and leverages it to predicate divs and rems.
Differential Revision: https://reviews.llvm.org/D22918
llvm-svn: 279620
The test case included with r279125 exposed an existing signed integer
overflow. Since getTreeCost can return INT_MAX, we can't sum this cost together
with other costs, such as getReductionCost.
This patch removes the possibility of assigning a cost of INT_MAX. Since we
were previously using INT_MAX as an indicator for "should not vectorize", we
now explicitly check this condition with "isTreeTinyAndNotFullyVectorizable"
before computing a cost.
This patch adds a run-line to the test case used for r279125 that ensures we
don't vectorize. Previously, this line would vectorize the test case by chance
due to undefined behavior in the cost calculation.
Differential Revision: https://reviews.llvm.org/D23723
llvm-svn: 279562
The test case included in r279125 exposed existing undefined behavior in the
SLP vectorizer that it did not introduce. This patch reapplies the original
patch, but modifies the test case to avoid hitting the undefined behavior. This
allows us to close PR28330 while keeping the UBSan bot happy. The undefined
behavior the original test uncovered will be addressed in a follow-on patch.
Reference: https://llvm.org/bugs/show_bug.cgi?id=28330
llvm-svn: 279370
We abort building vectorizable trees in some cases (e.g., if the maximum
recursion depth is reached, if the region size is too large, etc.). If this
happens for a reduction, we can be left with a root entry that needs to be
gathered. For these cases, we need make sure we actually set VectorizedValue to
the resulting vector.
This patch ensures we properly set VectorizedValue, and it also ensures the
insertelement sequence generated for the gathers is inserted at the correct
location.
Reference: https://llvm.org/bugs/show_bug.cgi?id=28330
Differential Revison: https://reviews.llvm.org/D23410
llvm-svn: 279125
Summary: I later (after r278573) found that LoopIterator.h has some overlapping with LoopBodyTraits. It's good to use LoopBodyTraits because a *Traits struct is algorithm independent.
Reviewers: anemet, nadav, mkuper
Subscribers: mzolotukhin, llvm-commits
Differential Revision: https://reviews.llvm.org/D23529
llvm-svn: 278996
Summary:
In getVectorizablePrefix, this is less efficient (because we have to
iterate over the BB twice), but boy is it simpler. Given how much
trouble we've had here, I think the simplicity gain is worthwhile.
In reorder(), this is actually more efficient, as
DominatorTree::dominates iterates over the BB from the beginning when
the two instructions are in the same BB.
Reviewers: asbirlea
Subscribers: arsenm, llvm-commits, mzolotukhin
Differential Revision: https://reviews.llvm.org/D23472
llvm-svn: 278580
InnerLoopVectorizer shouldn't handle a loop with cycles inside the loop
body, even if that cycle isn't a natural loop.
Fixes PR28541.
Differential Revision: https://reviews.llvm.org/D22952
llvm-svn: 278573