Summary:
Currently most of the time vectors of extractelement instructions are
treated as scalars that must be gathered into vectors. But in some
cases, like when we have extractelement instructions from single vector
with different constant indeces or from 2 vectors of the same size, we
can treat this operations as shuffle of a single vector or blending of 2
vectors.
```
define <2 x i8> @g(<2 x i8> %x, <2 x i8> %y) {
%x0 = extractelement <2 x i8> %x, i32 0
%y1 = extractelement <2 x i8> %y, i32 1
%x0x0 = mul i8 %x0, %x0
%y1y1 = mul i8 %y1, %y1
%ins1 = insertelement <2 x i8> undef, i8 %x0x0, i32 0
%ins2 = insertelement <2 x i8> %ins1, i8 %y1y1, i32 1
ret <2 x i8> %ins2
}
```
can be converted to something like
```
define <2 x i8> @g(<2 x i8> %x, <2 x i8> %y) {
%1 = shufflevector <2 x i8> %x, <2 x i8> %y, <2 x i32> <i32 0, i32 3>
%2 = mul <2 x i8> %1, %1
ret <2 x i8> %2
}
```
Currently this type of conversion is considered as high cost
transformation.
Reviewers: mzolotukhin, delena, mkuper, hfinkel, RKSimon
Subscribers: ashahid, RKSimon, spatel, llvm-commits
Differential Revision: https://reviews.llvm.org/D30200
llvm-svn: 309812
ARM Neon has native support for half-sized vector registers (64 bits). This
is beneficial for example for 2D and 3D graphics. This patch adds the option
to lower MinVecRegSize from 128 via a TTI in the SLP Vectorizer.
*** Performance Analysis
This change was motivated by some internal benchmarks but it is also
beneficial on SPEC and the LLVM testsuite.
The results are with -O3 and PGO. A negative percentage is an improvement.
The testsuite was run with a sample size of 4.
** SPEC
* CFP2006/482.sphinx3 -3.34%
A pretty hot loop is SLP vectorized resulting in nice instruction reduction.
This used to be a +22% regression before rL299482.
* CFP2000/177.mesa -3.34%
* CINT2000/256.bzip2 +6.97%
My current plan is to extend the fix in rL299482 to i16 which brings the
regression down to +2.5%. There are also other problems with the codegen in
this loop so there is further room for improvement.
** LLVM testsuite
* SingleSource/Benchmarks/Misc/ReedSolomon -10.75%
There are multiple small SLP vectorizations outside the hot code. It's a bit
surprising that it adds up to 10%. Some of this may be code-layout noise.
* MultiSource/Benchmarks/VersaBench/beamformer/beamformer -8.40%
The opt-viewer screenshot can be seen at F3218284. We start at a colder store
but the tree leads us into the hottest loop.
* MultiSource/Applications/lambda-0.1.3/lambda -2.68%
* MultiSource/Benchmarks/Bullet/bullet -2.18%
This is using 3D vectors.
* SingleSource/Benchmarks/Shootout-C++/Shootout-C++-lists +6.67%
Noise, binary is unchanged.
* MultiSource/Benchmarks/Ptrdist/anagram/anagram +4.90%
There is an additional SLP in the cold code. The test runs for ~1sec and
prints out over 2000 lines. This is most likely noise.
* MultiSource/Applications/aha/aha +1.63%
* MultiSource/Applications/JM/lencod/lencod +1.41%
* SingleSource/Benchmarks/Misc/richards_benchmark +1.15%
Differential Revision: https://reviews.llvm.org/D31965
llvm-svn: 303116
This caused PR33053.
Original commit message:
> The new experimental reduction intrinsics can now be used, so I'm enabling this
> for AArch64. We will need this for SVE anyway, so it makes sense to do this for
> NEON reductions as well.
>
> The existing code to match shufflevector patterns are replaced with a direct
> lowering of the reductions to AArch64-specific nodes. Tests updated with the
> new, simpler, representation.
>
> Differential Revision: https://reviews.llvm.org/D32247
llvm-svn: 303115
The approach I followed was to emit the remark after getTreeCost concludes
that SLP is profitable. I initially tried emitting them after the
vectorizeRootInstruction calls in vectorizeChainsInBlock but I vaguely
remember missing a few cases for example in HorizontalReduction::tryToReduce.
ORE is placed in BoUpSLP so that it's available from everywhere (notably
HorizontalReduction::tryToReduce).
We use the first instruction in the root bundle as the locator for the remark.
In order to get a sense how far the tree is spanning I've include the size of
the tree in the remark. This is not perfect of course but it gives you at
least a rough idea about the tree. Then you can follow up with -view-slp-tree
to really see the actual tree.
llvm-svn: 302811
The new experimental reduction intrinsics can now be used, so I'm enabling this
for AArch64. We will need this for SVE anyway, so it makes sense to do this for
NEON reductions as well.
The existing code to match shufflevector patterns are replaced with a direct
lowering of the reductions to AArch64-specific nodes. Tests updated with the
new, simpler, representation.
Differential Revision: https://reviews.llvm.org/D32247
llvm-svn: 302678
In tryToVectorizeList, under a very limited circumstance (when entered
from tryToVectorizePair), the values may be reordered (swapped) and the
SLP tree is built with the new order. This extends that to the case when
starting from phis in vectorizeChainsInBlock when there are exactly two
phis. The textual order of phi nodes shouldn't really matter. Without
this change, the loop body in the accompnaying test case is fully vectorized
when we swap the orde of the phis but not with this order. While this
doesn't solve the phi-ordering problem in a general way (for more than 2
phis), this is simple fix that piggybacks on an existing mechanism and
is useful in cases like multiplying two complex numbers.
Differential revision: https://reviews.llvm.org/D32065
llvm-svn: 300574
Use '2>&1 |' and not '|&' to pipe debug output to FileCheck
Hopefully handles a "shell parser error" on
llvm-clang-x86_64-expensive-checks-win
test/Transforms/SLPVectorizer/SystemZ/SLP-cmp-cost-query.ll
llvm-svn: 300064
In getEntryCost(), make the scalar type for a compare instruction that of the
operands, not i1. This is needed in order to call getCmpSelInstrCost() for a
compare in a sensible way, the same way as the LoopVectorizer does.
New test: test/Transforms/SLPVectorizer/SystemZ/SLP-cmp-cost-query.ll
Review: Matthew Simpson
https://reviews.llvm.org/D31601
llvm-svn: 300061
Currently the default C calling convention functions are treated
the same as compute kernels. Make this explicit so the default
calling convention can be changed to a non-kernel.
Converted with perl -pi -e 's/define void/define amdgpu_kernel void/'
on the relevant test directories (and undoing in one place that actually
wanted a non-kernel).
llvm-svn: 298444
This reverts r293386, r294027, r294029 and r296411.
Turns out the SLP tree isn't actually a "tree" and we don't handle
accessing the same packet of loads in several different orders well,
causing miscompiles.
Revert until we can fix this properly.
llvm-svn: 297493
for VectorizeTree() API.This API uses it for proper mask computation to be used in shufflevector IR.
The fix is to compute the mask for out of order memory accesses while building the vectorizable tree
instead of actual vectorization of vectorizable tree.It also needs to recompute the proper Lane for
external use of vectorizable scalars based on shuffle mask.
Reviewers: mkuper
Differential Revision: https://reviews.llvm.org/D30159
Change-Id: Ide8773ce0ad3562f3cf4d1a0ad0f487e2f60ce5d
llvm-svn: 296863
Summary:
The SLP vectorizer should propagate IR-level optimization hints/flags
(nsw, nuw, exact, fast-math) when converting scalar horizontal
reductions instructions into vectors, just like for other vectorized
instructions.
It doe not include IR propagation for extra arguments, we need to handle
original scalar operations for extra args to propagate correct flags.
Reviewers: mkuper, mzolotukhin, hfinkel
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D30418
llvm-svn: 296614
Summary:
We should preserve IR flags for extra args. These IR flags should be
taken from original scalar operations, not from the reduction
operations.
Reviewers: mkuper, mzolotukhin, hfinkel
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D30447
llvm-svn: 296613
Summary:
If horizontal reduction tree starts from the binary operation that is
used in PHI node, but this PHI is not used in horizontal reduction, we
may end up with extra addition of this PHI node after vectorization.
Here is an example:
```
%phi = phi i32 [ %tmp, %end], ...
...
%tmp = add i32 %tmp1, %tmp2
end:
```
after vectorization we always have something like:
```
%phi = phi i32 [ %tmp, %end], ...
...
%red = extractelement <8 x 32> %vec.red, 0
%tmp = add i32 %red, %phi
end:
```
even if `%phi` is not used in reduction tree. Patch considers these PHI
nodes as extra arguments and considers them in the final result iff they
really used in reduction.
Reviewers: mkuper, hfinkel, mzolotukhin
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D30409
llvm-svn: 296606
for VectorizeTree() API.This API uses it for proper mask computation to be used in shufflevector IR.
The fix is to compute the mask for out of order memory accesses while building the vectorizable tree
instead of actual vectorization of vectorizable tree.
Reviewers: mkuper
Differential Revision: https://reviews.llvm.org/D30159
Change-Id: Id1e287f073fa4959713ba545fa4254db5da8b40d
llvm-svn: 296575
result
Summary:
If the same value is used several times as an extra value, SLP
vectorizer takes it into account only once instead of actual number of
using.
For example:
```
int val = 1;
for (int y = 0; y < 8; y++) {
for (int x = 0; x < 8; x++) {
val = val + input[y * 8 + x] + 3;
}
}
```
We have 2 extra rguments: `1` - initial value of horizontal reduction
and `3`, which is added 8*8 times to the reduction. Before the patch we
added `1` to the reduction value and added once `3`, though it must be
added 64 times.
Reviewers: mkuper, mzolotukhin
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D30262
llvm-svn: 295972
result
Summary:
If the same value is used several times as an extra value, SLP
vectorizer takes it into account only once instead of actual number of
using.
For example:
```
int val = 1;
for (int y = 0; y < 8; y++) {
for (int x = 0; x < 8; x++) {
val = val + input[y * 8 + x] + 3;
}
}
```
We have 2 extra rguments: `1` - initial value of horizontal reduction
and `3`, which is added 8*8 times to the reduction. Before the patch we
added `1` to the reduction value and added once `3`, though it must be
added 64 times.
Reviewers: mkuper, mzolotukhin
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D30262
llvm-svn: 295956
result
Summary:
If the same value is used several times as an extra value, SLP
vectorizer takes it into account only once instead of actual number of
using.
For example:
```
int val = 1;
for (int y = 0; y < 8; y++) {
for (int x = 0; x < 8; x++) {
val = val + input[y * 8 + x] + 3;
}
}
```
We have 2 extra rguments: `1` - initial value of horizontal reduction
and `3`, which is added 8*8 times to the reduction. Before the patch we
added `1` to the reduction value and added once `3`, though it must be
added 64 times.
Reviewers: mkuper, mzolotukhin
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D30262
llvm-svn: 295949
Summary:
If the same value is used several times as an extra value, SLP
vectorizer takes it into account only once instead of actual number of
using.
For example:
```
int val = 1;
for (int y = 0; y < 8; y++) {
for (int x = 0; x < 8; x++) {
val = val + input[y * 8 + x] + 3;
}
}
```
We have 2 extra rguments: `1` - initial value of horizontal reduction
and `3`, which is added 8*8 times to the reduction. Before the patch we
added `1` to the reduction value and added once `3`, though it must be
added 64 times.
Reviewers: mkuper, mzolotukhin
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D30262
llvm-svn: 295868
back into a vector
Previously the cost of the existing ExtractElement/ExtractValue
instructions was considered as a dead cost only if it was detected that
they have only one use. But these instructions may be considered
dead also if users of the instructions are also going to be vectorized,
like:
```
%x0 = extractelement <2 x float> %x, i32 0
%x1 = extractelement <2 x float> %x, i32 1
%x0x0 = fmul float %x0, %x0
%x1x1 = fmul float %x1, %x1
%add = fadd float %x0x0, %x1x1
```
This can be transformed to
```
%1 = fmul <2 x float> %x, %x
%2 = extractelement <2 x float> %1, i32 0
%3 = extractelement <2 x float> %1, i32 1
%add = fadd float %2, %3
```
because though `%x0` and `%x1` have 2 users each other, these users are
part of the vectorized tree and we can consider these `extractelement`
instructions as dead.
Differential Revision: https://reviews.llvm.org/D29900
llvm-svn: 295056
reductions.
Currently, LLVM supports vectorization of horizontal reduction
instructions with initial value set to 0. Patch supports vectorization
of reduction with non-zero initial values. Also, it supports a
vectorization of instructions with some extra arguments, like:
```
float f(float x[], int a, int b) {
float p = a % b;
p += x[0] + 3;
for (int i = 1; i < 32; i++)
p += x[i];
return p;
}
```
Patch allows vectorization of this kind of horizontal reductions.
Differential Revision: https://reviews.llvm.org/D29727
llvm-svn: 294934
This breaks when one of the extra values is also a scalar that
participates in the same vectorization tree which we'll end up
reducing.
llvm-svn: 294245
This generalizes memory access sorting to use differences between SCEVs,
instead of relying on constant offsets. That allows us to properly do
SLP vectorization of non-sequentially ordered loads within loops bodies.
Differential Revision: https://reviews.llvm.org/D29425
llvm-svn: 294027
Currently LLVM supports vectorization of horizontal reduction
instructions with initial value set to 0. Patch supports vectorization
of reduction with non-zero initial values. Also it supports a
vectorization of instructions with some extra arguments, like:
float f(float x[], int a, int b) {
float p = a % b;
p += x[0] + 3;
for (int i = 1; i < 32; i++)
p += x[i];
return p;
}
Patch allows vectorization of this kind of horizontal reductions.
Differential Revision: https://reviews.llvm.org/D28961
llvm-svn: 293994
The jumbled scalar loads will be sorted while building the tree and these accesses will be marked to generate shufflevector after the vectorized load with proper mask.
Reviewers: hfinkel, mssimpso, mkuper
Differential Revision: https://reviews.llvm.org/D26905
Change-Id: I9c0c8e6f91a00076a7ee1465440a3f6ae092f7ad
llvm-svn: 293386
instructions.
If number of instructions in horizontal reduction list is not power of 2
then only PowerOf2Floor(NumberOfInstructions) last elements are actually
vectorized, other instructions remain scalar. Patch tries to vectorize
the remaining elements either.
Differential Revision: https://reviews.llvm.org/D28959
llvm-svn: 293042
The removed assert seems bogus - it's perfectly legal for the roots of the
vectorized subtrees to be equal even if the original scalar values aren't,
if the original scalars happen to be equivalent.
This fixes PR31599.
Differential Revision: https://reviews.llvm.org/D28539
llvm-svn: 291692
This adds a combine that canonicalizes a chain of inserts which broadcasts
a value into a single insert + a splat shufflevector.
This fixes PR31286.
Differential Revision: https://reviews.llvm.org/D27992
llvm-svn: 290641
This patch ensures the correct minimum bit width during type-shrinking.
Previously when type-shrinking, we always sign-extended values back to their
original width. However, if we are going to sign-extend, and the sign bit is
unknown, we have to increase the minimum bit width by one bit so the
sign-extend will fill the upper bits correctly. If the sign bit is known to be
zero, we can perform a zero-extend instead. This should fix PR31243.
Reference: https://llvm.org/bugs/show_bug.cgi?id=31243
Differential Revision: https://reviews.llvm.org/D27466
llvm-svn: 289470
Summary:
This change adds some verification in the IR verifier around struct path
TBAA metadata.
Other than some basic sanity checks (e.g. we get constant integers where
we expect constant integers), this checks:
- That by the time an struct access tuple `(base-type, offset)` is
"reduced" to a scalar base type, the offset is `0`. For instance, in
C++ you can't start from, say `("struct-a", 16)`, and end up with
`("int", 4)` -- by the time the base type is `"int"`, the offset
better be zero. In particular, a variant of this invariant is needed
for `llvm::getMostGenericTBAA` to be correct.
- That there are no cycles in a struct path.
- That struct type nodes have their offsets listed in an ascending
order.
- That when generating the struct access path, you eventually reach the
access type listed in the tbaa tag node.
Reviewers: dexonsmith, chandlerc, reames, mehdi_amini, manmanren
Subscribers: mcrosier, llvm-commits
Differential Revision: https://reviews.llvm.org/D26438
llvm-svn: 289402
When trying to vectorize trees that start at insertelement instructions
function tryToVectorizeList() uses vectorization factor calculated as
MinVecRegSize/ScalarTypeSize. But sometimes it does not work as tree
cost for this fixed vectorization factor is too high.
Patch tries to improve the situation. It tries different vectorization
factors from max(PowerOf2Floor(NumberOfVectorizedValues),
MinVecRegSize/ScalarTypeSize) to MinVecRegSize/ScalarTypeSize and tries
to choose the best one.
Differential Revision: https://reviews.llvm.org/D27215
llvm-svn: 289043
This reverts commit r288497, as it broke the AArch64 build of Compiler-RT's
builtins (twice: once in r288412 and once in r288497). We should investigate
this offline.
llvm-svn: 288508
When trying to vectorize trees that start at insertelement instructions
function tryToVectorizeList() uses vectorization factor calculated as
MinVecRegSize/ScalarTypeSize. But sometimes it does not work as tree
cost for this fixed vectorization factor is too high.
Patch tries to improve the situation. It tries different vectorization
factors from max(PowerOf2Floor(NumberOfVectorizedValues),
MinVecRegSize/ScalarTypeSize) to MinVecRegSize/ScalarTypeSize and tries
to choose the best one.
Differential Revision: https://reviews.llvm.org/D27215
llvm-svn: 288497
When trying to vectorize trees that start at insertelement instructions
function tryToVectorizeList() uses vectorization factor calculated as
MinVecRegSize/ScalarTypeSize. But sometimes it does not work as tree
cost for this fixed vectorization factor is too high.
Patch tries to improve the situation. It tries different vectorization
factors from max(PowerOf2Floor(NumberOfVectorizedValues),
MinVecRegSize/ScalarTypeSize) to MinVecRegSize/ScalarTypeSize and tries
to choose the best one.
Differential Revision: https://reviews.llvm.org/D27215
llvm-svn: 288412
Currently when cost of scalar operations is evaluated the vector type is
used for scalar operations. Patch fixes this issue and fixes evaluation
of the vector operations cost.
Several test showed that vector cost model is too optimistic. It
allowed vectorization of 8 or less add/fadd operations, though scalar
code is faster. Actually, only for 16 or more operations vector code
provides better performance.
Differential Revision: https://reviews.llvm.org/D26277
llvm-svn: 288398
Currently SLP vectorizer tries to vectorize a binary operation and dies
immediately after unsuccessful the first unsuccessfull attempt. Patch
tries to improve the situation, trying to vectorize all binary
operations of all children nodes in the binop tree.
Differential Revision: https://reviews.llvm.org/D25517
llvm-svn: 288115
Summary:
This extends FCOPYSIGN support to 512-bit vectors.
I've also added tests to show what the 128-bit and 256-bit cases look like with broadcast loads.
Reviewers: delena, zvi, RKSimon, spatel
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D26791
llvm-svn: 287298
This patch avoids scalarization of CTLZ by instead expanding to use CTPOP (ref: "Hacker's Delight") when the necessary operations are available.
This also adds the necessary cost models for X86 SSE2 targets (the main beneficiary) to ensure vectorization only happens when its useful.
Differential Revision: https://reviews.llvm.org/D25910
llvm-svn: 286233
After successfull horizontal reduction vectorization attempt for PHI node
vectorizer tries to update root binary op by combining vectorized tree
and the ReductionPHI node. But during vectorization this ReductionPHI
can be vectorized itself and replaced by the `undef` value, while the
instruction itself is marked for deletion. This 'marked for deletion'
PHI node then can be used in new binary operation, causing "Use still
stuck around after Def is destroyed" crash upon PHI node deletion.
Also the test is fixed to make it perform actual testing.
Differential Revision: https://reviews.llvm.org/D25671
llvm-svn: 285286
As discussed on PR28461 we currently miss the chance to lower "fptosi <2 x double> %arg to <2 x i32>" to cvttpd2dq due to its use of illegal types.
This patch adds support for fptosi to 2i32 from both 2f64 and 2f32.
It also recognises that cvttpd2dq zeroes the upper 64-bits of the xmm result (similar to D23797) - we still don't do this for the cvttpd2dq/cvttps2dq intrinsics - this can be done in a future patch.
Differential Revision: https://reviews.llvm.org/D23808
llvm-svn: 284459
Fixed copy+paste vector alignment to correct for per-element scalar loads
Increased to 512-bit data sizes in preparation of avx512 tests
llvm-svn: 283748
unrolling.
The next code is not vectorized by the SLPVectorizer:
```
int test(unsigned int *p) {
int sum = 0;
for (int i = 0; i < 8; i++)
sum += p[i];
return sum;
}
```
During optimization this loop is fully unrolled and SLPVectorizer is
unable to vectorize it. Patch tries to fix this problem.
Differential Revision: https://reviews.llvm.org/D24796
llvm-svn: 283535