Summary:
Reworked the previously committed patch to insert shuffles for reused
extract element instructions in the correct position. Previous logic was
incorrect, and might lead to the crash with PHIs and EH instructions.
Reviewers: efriedma, javed.absar
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D50143
llvm-svn: 339166
Summary:
If the ExtractElement instructions can be optimized out during the
vectorization and we need to reshuffle the parent vector, this
ShuffleInstruction may be inserted in the wrong place causing compiler
to produce incorrect code.
Reviewers: spatel, RKSimon, mkuper, hfinkel, javed.absar
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D49928
llvm-svn: 338380
as well as sext(C + x + ...) -> (D + sext(C-D + x + ...))<nuw><nsw>
similar to the equivalent transformation for zext's
if the top level addition in (D + (C-D + x * n)) could be proven to
not wrap, where the choice of D also maximizes the number of trailing
zeroes of (C-D + x * n), ensuring homogeneous behaviour of the
transformation and better canonicalization of such AddRec's
(indeed, there are 2^(2w) different expressions in `B1 + ext(B2 + Y)` form for
the same Y, but only 2^(2w - k) different expressions in the resulting `B3 +
ext((B4 * 2^k) + Y)` form, where w is the bit width of the integral type)
This patch generalizes sext(C1 + C2*X) --> sext(C1) + sext(C2*X) and
sext{C1,+,C2} --> sext(C1) + sext{0,+,C2} transformations added in
r209568 relaxing the requirements the following way:
1. C2 doesn't have to be a power of 2, it's enough if it's divisible by 2
a sufficient number of times;
2. C1 doesn't have to be less than C2, instead of extracting the entire
C1 we can split it into 2 terms: (00...0XXX + YY...Y000), keep the
second one that may cause wrapping within the extension operator, and
move the first one that doesn't affect wrapping out of the extension
operator, enabling further simplifications;
3. C1 and C2 don't have to be positive, splitting C1 like shown above
produces a sum that is guaranteed to not wrap, signed or unsigned;
4. in AddExpr case there could be more than 2 terms, and in case of
AddExpr the 2nd and following terms and in case of AddRecExpr the
Step component don't have to be in the C2*X form or constant
(respectively), they just need to have enough trailing zeros,
which in turn could be guaranteed by means other than arithmetics,
e.g. by a pointer alignment;
5. the extension operator doesn't have to be a sext, the same
transformation works and profitable for zext's as well.
Apparently, optimizations like SLPVectorizer currently fail to
vectorize even rather trivial cases like the following:
double bar(double *a, unsigned n) {
double x = 0.0;
double y = 0.0;
for (unsigned i = 0; i < n; i += 2) {
x += a[i];
y += a[i + 1];
}
return x * y;
}
If compiled with `clang -std=c11 -Wpedantic -Wall -O3 main.c -S -o - -emit-llvm`
(!{!"clang version 7.0.0 (trunk 337339) (llvm/trunk 337344)"})
it produces scalar code with the loop not unrolled with the unsigned `n` and
`i` (like shown above), but vectorized and unrolled loop with signed `n` and
`i`. With the changes made in this commit the unsigned version will be
vectorized (though not unrolled for unclear reasons).
How it all works:
Let say we have an AddExpr that looks like (C + x + y + ...), where C
is a constant and x, y, ... are arbitrary SCEVs. Let's compute the
minimum number of trailing zeroes guaranteed of that sum w/o the
constant term: (x + y + ...). If, for example, those terms look like
follows:
i
XXXX...X000
YYYY...YY00
...
ZZZZ...0000
then the rightmost non-guaranteed-zero bit (a potential one at i-th
position above) can change the bits of the sum to the left (and at
i-th position itself), but it can not possibly change the bits to the
right. So we can compute the number of trailing zeroes by taking a
minimum between the numbers of trailing zeroes of the terms.
Now let's say that our original sum with the constant is effectively
just C + X, where X = x + y + .... Let's also say that we've got 2
guaranteed trailing zeros for X:
j
CCCC...CCCC
XXXX...XX00 // this is X = (x + y + ...)
Any bit of C to the left of j may in the end cause the C + X sum to
wrap, but the rightmost 2 bits of C (at positions j and j - 1) do not
affect wrapping in any way. If the upper bits cause a wrap, it will be
a wrap regardless of the values of the 2 least significant bits of C.
If the upper bits do not cause a wrap, it won't be a wrap regardless
of the values of the 2 bits on the right (again).
So let's split C to 2 constants like follows:
0000...00CC = D
CCCC...CC00 = (C - D)
and represent the whole sum as D + (C - D + X). The second term of
this new sum looks like this:
CCCC...CC00
XXXX...XX00
----------- // let's add them up
YYYY...YY00
The sum above (let's call it Y)) may or may not wrap, we don't know,
so we need to keep it under a sext/zext. Adding D to that sum though
will never wrap, signed or unsigned, if performed on the original bit
width or the extended one, because all that that final add does is
setting the 2 least significant bits of Y to the bits of D:
YYYY...YY00 = Y
0000...00CC = D
----------- <nuw><nsw>
YYYY...YYCC
Which means we can safely move that D out of the sext or zext and
claim that the top-level sum neither sign wraps nor unsigned wraps.
Let's run an example, let's say we're working in i8's and the original
expression (zext's or sext's operand) is 21 + 12x + 8y. So it goes
like this:
0001 0101 // 21
XXXX XX00 // 12x
YYYY Y000 // 8y
0001 0101 // 21
ZZZZ ZZ00 // 12x + 8y
0000 0001 // D
0001 0100 // 21 - D = 20
ZZZZ ZZ00 // 12x + 8y
0000 0001 // D
WWWW WW00 // 21 - D + 12x + 8y = 20 + 12x + 8y
therefore zext(21 + 12x + 8y) = (1 + zext(20 + 12x + 8y))<nuw><nsw>
This approach could be improved if we move away from using trailing
zeroes and use KnownBits instead. For instance, with KnownBits we could
have the following picture:
i
10 1110...0011 // this is C
XX X1XX...XX00 // this is X = (x + y + ...)
Notice that some of the bits of X are known ones, also notice that
known bits of X are interspersed with unknown bits and not grouped on
the rigth or left.
We can see at the position i that C(i) and X(i) are both known ones,
therefore the (i + 1)th carry bit is guaranteed to be 1 regardless of
the bits of C to the right of i. For instance, the C(i - 1) bit only
affects the bits of the sum at positions i - 1 and i, and does not
influence if the sum is going to wrap or not. Therefore we could split
the constant C the following way:
i
00 0010...0011 = D
10 1100...0000 = (C - D)
Let's compute the KnownBits of (C - D) + X:
XX1 1 = carry bit, blanks stand for known zeroes
10 1100...0000 = (C - D)
XX X1XX...XX00 = X
--- -----------
XX X0XX...XX00
Will this add wrap or not essentially depends on bits of X. Adding D
to this sum, however, is guaranteed to not to wrap:
0 X
00 0010...0011 = D
sX X0XX...XX00 = (C - D) + X
--- -----------
sX XXXX XX11
As could be seen above, adding D preserves the sign bit of (C - D) +
X, if any, and has a guaranteed 0 carry out, as expected.
The more bits of (C - D) we constrain, the better the transformations
introduced here canonicalize expressions as it leaves less freedom to
what values the constant part of ((C - D) + x + y + ...) can take.
Reviewed By: mzolotukhin, efriedma
Differential Revision: https://reviews.llvm.org/D48853
llvm-svn: 337943
TTI::getMinMaxReductionCost typically can't handle pointer types - until this is changed its better to limit horizontal reduction to integer/float vector types only.
llvm-svn: 337280
We currently only support binary instructions in the alternate opcode shuffles.
This patch is an initial attempt at adding cast instructions as well, this raises several issues that we probably want to address as we continue to generalize the alternate mechanism:
1 - Duplication of cost determination - we should probably add scalar/vector costs helper functions and get BoUpSLP::getEntryCost to use them instead of determining costs directly.
2 - Support alternate instructions with the same opcode (e.g. casts with different src types) - alternate vectorization of calls with different IntrinsicIDs will require this.
3 - Allow alternates to be a different instruction type - mixing binary/cast/call etc.
4 - Allow passthrough of unsupported alternate instructions - related to PR30787/D28907 'copyable' elements.
Reapplied with fix to only accept 2 different casts if they come from the same source type (PR38154).
Differential Revision: https://reviews.llvm.org/D49135
llvm-svn: 336989
We currently only support binary instructions in the alternate opcode shuffles.
This patch is an initial attempt at adding cast instructions as well, this raises several issues that we probably want to address as we continue to generalize the alternate mechanism:
1 - Duplication of cost determination - we should probably add scalar/vector costs helper functions and get BoUpSLP::getEntryCost to use them instead of determining costs directly.
2 - Support alternate instructions with the same opcode (e.g. casts with different src types) - alternate vectorization of calls with different IntrinsicIDs will require this.
3 - Allow alternates to be a different instruction type - mixing binary/cast/call etc.
4 - Allow passthrough of unsupported alternate instructions - related to PR30787/D28907 'copyable' elements.
Reapplied with fix to only accept 2 different casts if they come from the same source type.
Differential Revision: https://reviews.llvm.org/D49135
llvm-svn: 336812
We currently only support binary instructions in the alternate opcode shuffles.
This patch is an initial attempt at adding cast instructions as well, this raises several issues that we probably want to address as we continue to generalize the alternate mechanism:
1 - Duplication of cost determination - we should probably add scalar/vector costs helper functions and get BoUpSLP::getEntryCost to use them instead of determining costs directly.
2 - Support alternate instructions with the same opcode (e.g. casts with different src types) - alternate vectorization of calls with different IntrinsicIDs will require this.
3 - Allow alternates to be a different instruction type - mixing binary/cast/call etc.
4 - Allow passthrough of unsupported alternate instructions - related to PR30787/D28907 'copyable' elements.
Differential Revision: https://reviews.llvm.org/D49135
llvm-svn: 336804
Summary: It is common to have the following min/max pattern during the intermediate stages of SLP since we only optimize at the end. This patch tries to catch such patterns and allow more vectorization.
%1 = extractelement <2 x i32> %a, i32 0
%2 = extractelement <2 x i32> %a, i32 1
%cond = icmp sgt i32 %1, %2
%3 = extractelement <2 x i32> %a, i32 0
%4 = extractelement <2 x i32> %a, i32 1
%select = select i1 %cond, i32 %3, i32 %4
Author: FarhanaAleen
Reviewed By: ABataev, RKSimon, spatel
Differential Revision: https://reviews.llvm.org/D47608
llvm-svn: 336130
We were always using the opcodes of the first 2 scalars for the costs of the alternate opcode + shuffle. This made sense when we used SK_Alternate and opcodes were guaranteed to be alternating, but this fails for the more general SK_Select case.
This fix exposes an issue demonstrated by the fmul_fdiv_v4f32_const test - the SLM model has v4f32 fdiv costs which are more than twice those of the f32 scalar cost, meaning that the cost model determines that the vectorization is not performant. Unfortunately it completely ignores the fact that the fdiv by a constant will be changed into a fmul by InstCombine for a much lower cost vectorization. But at least we're seeing this now...
llvm-svn: 336095
Alternate opcode handling only supports binary operators, these tests demonstrate missed opportunities to vectorize some sitofp/uitofp and fptosi/fptoui style casts as well as some (successful) float bits manipulations
llvm-svn: 336060
Since D46637 we are better at handling uniform/non-uniform constant Pow2 detection; this patch tweaks the SLP argument handling to support them.
As SLP works with arrays of values I don't think we can easily use the pattern match helpers here.
Differential Revision: https://reviews.llvm.org/D48214
llvm-svn: 335621
Enable tryToVectorizeList to support InstructionsState alternate opcode patterns at a root (build vector etc.) as well as further down the vectorization tree.
NOTE: This patch reduces some of the debug reporting if there are opcode mismatches - I can try to add it back if it proves a problem. But it could get rather messy trying to provide equivalent verbose debug strings via getSameOpcode etc.
Differential Revision: https://reviews.llvm.org/D48488
llvm-svn: 335364
SLP currently only accepts (F)Add/(F)Sub alternate counterpart ops to be merged into an alternate shuffle.
This patch relaxes this to accept any pair of BinaryOperator opcodes instead, assuming the target's cost model accepts the vectorization+shuffle.
Differential Revision: https://reviews.llvm.org/D48477
llvm-svn: 335349
AArch64 was only setting costs for SK_Transpose, which meant that many of the simpler shuffles (e.g. SK_Select and SK_PermuteSingleSrc for larger vector elements) was being severely overestimated by the default shuffle expansion.
This patch adds costs to help improve SLP performance and avoid a regression in reductions introduced by D48174.
I'm not very knowledgeable about AArch64 shuffle lowering so I've kept the extra costs to a minimum - someone who knows this code can add extra costs which should improve vectorization a lot more.
Differential Revision: https://reviews.llvm.org/D48172
llvm-svn: 335329
These were being over cautious for costs for one/two op general shuffles - VSHUFPD doesn't have to replicate the same shuffle in both lanes like VSHUFPS does.
llvm-svn: 335216
D47985 saw the old SK_Alternate 'alternating' shuffle mask replaced with the SK_Select mask which accepts either input operand for each lane, equivalent to a vector select with a constant condition operand.
This patch updates SLPVectorizer to make full use of this SK_Select shuffle pattern by removing the 'isOdd()' limitation.
The AArch64 regression will be fixed by D48172.
Differential Revision: https://reviews.llvm.org/D48174
llvm-svn: 335130
This usually results in better code. Fixes using
inline asm with short2, and also fixes having a different
ABI for function parameters between VI and gfx9.
Partially cleans up the mess used for lowering of the d16
operations. Making v4f16 legal will help clean this up more,
but this requires additional work.
llvm-svn: 332953
In order to set breakpoints on labels and list source code around
labels, we need collect debug information for labels, i.e., label
name, the function label belong, line number in the file, and the
address label located. In order to keep these information in LLVM
IR and to allow backend to generate debug information correctly.
We create a new kind of metadata for labels, DILabel. The format
of DILabel is
!DILabel(scope: !1, name: "foo", file: !2, line: 3)
We hope to keep debug information as much as possible even the
code is optimized. So, we create a new kind of intrinsic for label
metadata to avoid the metadata is eliminated with basic block.
The intrinsic will keep existing if we keep it from optimized out.
The format of the intrinsic is
llvm.dbg.label(metadata !1)
It has only one argument, that is the DILabel metadata. The
intrinsic will follow the label immediately. Backend could get the
label metadata through the intrinsic's parameter.
We also create DIBuilder API for labels to be used by Frontend.
Frontend could use createLabel() to allocate DILabel objects, and use
insertLabel() to insert llvm.dbg.label intrinsic in LLVM IR.
Differential Revision: https://reviews.llvm.org/D45024
Patch by Hsiangkai Wang.
llvm-svn: 331841
Since PTX has grown a <2 x half> datatype vectorization has become more
important. The late LoadStoreVectorizer intentionally only does loads
and stores, but now arithmetic has to be vectorized for optimal
throughput too.
This is still very limited, SLP vectorization happily creates <2 x half>
if it's a legal type but there's still a lot of register moving
happening to get that fed into a vectorized store. Overall it's a small
performance win by reducing the amount of arithmetic instructions.
I haven't really checked what the loop vectorizer does to PTX code, the
cost model there might need some more tweaks. I didn't see it causing
harm though.
Differential Revision: https://reviews.llvm.org/D46130
llvm-svn: 331035
We use getExtractWithExtendCost to calculate the cost of extractelement and
s|zext together when computing the extract cost after vectorization, but we
calculate the cost of extractelement and s|zext separately when computing the
scalar cost which is larger than it should be.
Differential Revision: https://reviews.llvm.org/D45469
llvm-svn: 330143
Summary:
If the load/extractelement/extractvalue instructions are not originally
consecutive, the SLP vectorizer is unable to vectorize them. Patch
allows reordering of such instructions.
Patch does not support reordering of the repeated instruction, this must
be handled in the separate patch.
Reviewers: RKSimon, spatel, hfinkel, mkuper, Ayal, ashahid
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D43776
llvm-svn: 329085
We use two approaches for determining the minimum bitwidth.
* Demanded bits
* Value tracking
If demanded bits doesn't result in a narrower type, we then try value tracking.
We need this if we want to root SLP trees with the indices of getelementptr
instructions since all the bits of the indices are demanded.
But there is a missing piece though. We need to be able to distinguish "demanded
and shrinkable" from "demanded and not shrinkable". For example, the bits of %i
in
%i = sext i32 %e1 to i64
%gep = getelementptr inbounds i64, i64* %p, i64 %i
are demanded, but we can shrink %i's type to i32 because it won't change the
result of the getelementptr. On the other hand, in
%tmp15 = sext i32 %tmp14 to i64
%tmp16 = insertvalue { i64, i64 } undef, i64 %tmp15, 0
it doesn't make sense to shrink %tmp15 and we can skip the value tracking.
Ideas are from Matthew Simpson!
Differential Revision: https://reviews.llvm.org/D44868
llvm-svn: 329035
Summary:
If the load/extractelement/extractvalue instructions are not originally
consecutive, the SLP vectorizer is unable to vectorize them. Patch
allows reordering of such instructions.
Reviewers: RKSimon, spatel, hfinkel, mkuper, Ayal, ashahid
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D43776
llvm-svn: 328980
When building the SLP tree, we look for reuse among the vectorized tree
entries. However, each gather sequence is represented by a unique tree entry,
even though the sequence may be identical to another one. This means, for
example, that a gather sequence with two uses will be counted twice when
computing the cost of the tree. We should only count the cost of the definition
of a gather sequence rather than its uses. During code generation, the
redundant gather sequences are emitted, but we optimize them away with CSE. So
it looks like this problem just affects the cost model.
Differential Revision: https://reviews.llvm.org/D44742
llvm-svn: 328316
This patch provides an implementation of getArithmeticReductionCost for
AArch64. We can specialize the cost of add reductions since they are computed
using the 'addv' instruction.
Differential Revision: https://reviews.llvm.org/D44490
llvm-svn: 327702