llvm-project/llvm/test/Transforms/SLPVectorizer/X86
Roman Tereshin ed047b0184 [SCEV] Add [zs]ext{C,+,x} -> (D + [zs]ext{C-D,+,x})<nuw><nsw> transform
as well as sext(C + x + ...) -> (D + sext(C-D + x + ...))<nuw><nsw>
similar to the equivalent transformation for zext's

if the top level addition in (D + (C-D + x * n)) could be proven to
not wrap, where the choice of D also maximizes the number of trailing
zeroes of (C-D + x * n), ensuring homogeneous behaviour of the
transformation and better canonicalization of such AddRec's

(indeed, there are 2^(2w) different expressions in `B1 + ext(B2 + Y)` form for
the same Y, but only 2^(2w - k) different expressions in the resulting `B3 +
ext((B4 * 2^k) + Y)` form, where w is the bit width of the integral type)

This patch generalizes sext(C1 + C2*X) --> sext(C1) + sext(C2*X) and
sext{C1,+,C2} --> sext(C1) + sext{0,+,C2} transformations added in
r209568 relaxing the requirements the following way:

1. C2 doesn't have to be a power of 2, it's enough if it's divisible by 2
 a sufficient number of times;
2. C1 doesn't have to be less than C2, instead of extracting the entire
  C1 we can split it into 2 terms: (00...0XXX + YY...Y000), keep the
  second one that may cause wrapping within the extension operator, and
  move the first one that doesn't affect wrapping out of the extension
  operator, enabling further simplifications;
3. C1 and C2 don't have to be positive, splitting C1 like shown above
 produces a sum that is guaranteed to not wrap, signed or unsigned;
4. in AddExpr case there could be more than 2 terms, and in case of
  AddExpr the 2nd and following terms and in case of AddRecExpr the
  Step component don't have to be in the C2*X form or constant
  (respectively), they just need to have enough trailing zeros,
  which in turn could be guaranteed by means other than arithmetics,
  e.g. by a pointer alignment;
5. the extension operator doesn't have to be a sext, the same
  transformation works and profitable for zext's as well.

Apparently, optimizations like SLPVectorizer currently fail to
vectorize even rather trivial cases like the following:

 double bar(double *a, unsigned n) {
   double x = 0.0;
   double y = 0.0;
   for (unsigned i = 0; i < n; i += 2) {
     x += a[i];
     y += a[i + 1];
   }
   return x * y;
 }

If compiled with `clang -std=c11 -Wpedantic -Wall -O3 main.c -S -o - -emit-llvm`
(!{!"clang version 7.0.0 (trunk 337339) (llvm/trunk 337344)"})

it produces scalar code with the loop not unrolled with the unsigned `n` and
`i` (like shown above), but vectorized and unrolled loop with signed `n` and
`i`. With the changes made in this commit the unsigned version will be
vectorized (though not unrolled for unclear reasons).

How it all works:

Let say we have an AddExpr that looks like (C + x + y + ...), where C
is a constant and x, y, ... are arbitrary SCEVs. Let's compute the
minimum number of trailing zeroes guaranteed of that sum w/o the
constant term: (x + y + ...). If, for example, those terms look like
follows:

        i
XXXX...X000
YYYY...YY00
   ...
ZZZZ...0000

then the rightmost non-guaranteed-zero bit (a potential one at i-th
position above) can change the bits of the sum to the left (and at
i-th position itself), but it can not possibly change the bits to the
right. So we can compute the number of trailing zeroes by taking a
minimum between the numbers of trailing zeroes of the terms.

Now let's say that our original sum with the constant is effectively
just C + X, where X = x + y + .... Let's also say that we've got 2
guaranteed trailing zeros for X:

         j
CCCC...CCCC
XXXX...XX00  // this is X = (x + y + ...)

Any bit of C to the left of j may in the end cause the C + X sum to
wrap, but the rightmost 2 bits of C (at positions j and j - 1) do not
affect wrapping in any way. If the upper bits cause a wrap, it will be
a wrap regardless of the values of the 2 least significant bits of C.
If the upper bits do not cause a wrap, it won't be a wrap regardless
of the values of the 2 bits on the right (again).

So let's split C to 2 constants like follows:

0000...00CC  = D
CCCC...CC00  = (C - D)

and represent the whole sum as D + (C - D + X). The second term of
this new sum looks like this:

CCCC...CC00
XXXX...XX00
-----------  // let's add them up
YYYY...YY00

The sum above (let's call it Y)) may or may not wrap, we don't know,
so we need to keep it under a sext/zext. Adding D to that sum though
will never wrap, signed or unsigned, if performed on the original bit
width or the extended one, because all that that final add does is
setting the 2 least significant bits of Y to the bits of D:

YYYY...YY00 = Y
0000...00CC = D
-----------  <nuw><nsw>
YYYY...YYCC

Which means we can safely move that D out of the sext or zext and
claim that the top-level sum neither sign wraps nor unsigned wraps.

Let's run an example, let's say we're working in i8's and the original
expression (zext's or sext's operand) is 21 + 12x + 8y. So it goes
like this:

0001 0101  // 21
XXXX XX00  // 12x
YYYY Y000  // 8y

0001 0101  // 21
ZZZZ ZZ00  // 12x + 8y

0000 0001  // D
0001 0100  // 21 - D = 20
ZZZZ ZZ00  // 12x + 8y

0000 0001  // D
WWWW WW00  // 21 - D + 12x + 8y = 20 + 12x + 8y

therefore zext(21 + 12x + 8y) = (1 + zext(20 + 12x + 8y))<nuw><nsw>

This approach could be improved if we move away from using trailing
zeroes and use KnownBits instead. For instance, with KnownBits we could
have the following picture:

    i
10 1110...0011  // this is C
XX X1XX...XX00  // this is X = (x + y + ...)

Notice that some of the bits of X are known ones, also notice that
known bits of X are interspersed with unknown bits and not grouped on
the rigth or left.

We can see at the position i that C(i) and X(i) are both known ones,
therefore the (i + 1)th carry bit is guaranteed to be 1 regardless of
the bits of C to the right of i. For instance, the C(i - 1) bit only
affects the bits of the sum at positions i - 1 and i, and does not
influence if the sum is going to wrap or not. Therefore we could split
the constant C the following way:

    i
00 0010...0011  = D
10 1100...0000  = (C - D)

Let's compute the KnownBits of (C - D) + X:

XX1 1            = carry bit, blanks stand for known zeroes
 10 1100...0000  = (C - D)
 XX X1XX...XX00  = X
--- -----------
 XX X0XX...XX00

Will this add wrap or not essentially depends on bits of X. Adding D
to this sum, however, is guaranteed to not to wrap:

0    X
 00 0010...0011  = D
 sX X0XX...XX00  = (C - D) + X
--- -----------
 sX XXXX   XX11

As could be seen above, adding D preserves the sign bit of (C - D) +
X, if any, and has a guaranteed 0 carry out, as expected.

The more bits of (C - D) we constrain, the better the transformations
introduced here canonicalize expressions as it leaves less freedom to
what values the constant part of ((C - D) + x + y + ...) can take.

Reviewed By: mzolotukhin, efriedma

Differential Revision: https://reviews.llvm.org/D48853

llvm-svn: 337943
2018-07-25 18:01:41 +00:00
..
PR32086.ll [SLP] Allow vectorization of reversed loads. 2018-02-14 15:29:15 +00:00
PR34635.ll [SLP] Reduce test, NFC. 2017-09-19 13:38:56 +00:00
PR35628_1.ll [SLP] Fix PR35628: Count external uses on extra reduction arguments. 2018-01-08 14:33:11 +00:00
PR35628_2.ll [SLP] Fix PR35628: Count external uses on extra reduction arguments. 2018-01-08 14:33:11 +00:00
PR35777.ll [SLP] Fix vectorization for tree with trunc to minimum required bit width. 2018-01-19 14:40:13 +00:00
PR35865.ll [COST]Fix PR35865: Fix cost model evaluation for shuffle on X86. 2018-01-09 19:08:22 +00:00
PR36280.ll [X86][SSE] Reduce FADD/FSUB/FMUL costs on later targets (PR36280) 2018-02-26 22:10:17 +00:00
addsub.ll [SLPVectorizer] Relax "alternate" opcode vectorisation to work with any SK_Select shuffle pattern 2018-06-20 14:26:28 +00:00
aggregate.ll [SLP] Update tests checks, NFC. 2018-01-05 14:40:04 +00:00
align.ll
alternate-calls.ll [SLPVectorizer][X86] Begin adding alternate tests for call operators 2018-07-02 17:23:45 +00:00
alternate-cast.ll [SLPVectorizer] Add initial alternate opcode support for cast instructions. (REAPPLIED-2) 2018-07-13 11:09:52 +00:00
alternate-fp.ll [SLPVectorizer] Fix alternate opcode + shuffle cost function to correct handle SK_Select patterns. 2018-07-02 11:28:01 +00:00
alternate-int.ll [SLPVectorizer] Ensure alternate/passthrough doesn't vectorize sdiv with undef elts 2018-07-11 14:34:43 +00:00
arith-add.ll [X86] Remove unnecessary -mattr to enable avx512bw when the -mcpu already enabled it. NFC 2018-04-16 18:14:19 +00:00
arith-fp.ll [X86][SLM] Add SLM arithmetic vectorization tests 2017-06-10 19:16:09 +00:00
arith-mul.ll [X86][SLM] Add SLM arithmetic vectorization tests 2017-06-10 19:16:09 +00:00
arith-sub.ll [X86][SLM] Add SLM arithmetic vectorization tests 2017-06-10 19:16:09 +00:00
atomics.ll
bad_types.ll
barriercall.ll
bitreverse.ll [X86] Add missing BITREVERSE costs for SSE2 vectors and i8/i16/i32/i64 scalars 2017-03-15 19:34:55 +00:00
blending-shuffle.ll [SLP] Stop counting cost of gather sequences with multiple uses 2018-03-23 14:18:27 +00:00
bswap.ll
call.ll [ValueTracking] readonly (const) is a requirement for converting sqrt to llvm.sqrt; nnan is not 2017-11-06 22:40:09 +00:00
cast.ll [SLP] Fix PR35047: Fix default cost model for cast op in X86. 2017-11-07 14:23:44 +00:00
cmp_sel.ll [SLPVectorizer][X86] Regenerate some tests. NFCI 2018-04-04 13:53:51 +00:00
commutativity.ll
compare-reduce.ll [SLP] Fix tests checks, NFC. 2018-02-20 18:11:50 +00:00
consecutive-access.ll [SCEV] Add [zs]ext{C,+,x} -> (D + [zs]ext{C-D,+,x})<nuw><nsw> transform 2018-07-25 18:01:41 +00:00
continue_vectorizing.ll
crash_7zip.ll
crash_binaryop.ll
crash_bullet.ll
crash_bullet3.ll
crash_cmpop.ll
crash_dequeue.ll
crash_flop7.ll
crash_gep.ll
crash_lencod.ll
crash_mandeltext.ll
crash_netbsd_decompress.ll
crash_scheduling.ll [Verifier] Add verification for TBAA metadata 2016-12-11 20:07:15 +00:00
crash_sim4b1.ll
crash_smallpt.ll
crash_vectorizeTree.ll
cross_block_slp.ll
cse.ll [X86][SSE] Reduce FADD/FSUB/FMUL costs on later targets (PR36280) 2018-02-26 22:10:17 +00:00
ctlz.ll [VectorLegalizer] Expansion of CTLZ using CTPOP when possible 2016-11-08 14:10:28 +00:00
ctpop.ll
cttz.ll
cycle_dup.ll
debug_info.ll [DebugInfo] Add DILabel metadata and intrinsic llvm.dbg.label. 2018-05-09 02:40:45 +00:00
diamond.ll
external_user.ll
external_user_jumbled_load.ll [SLP] Fix PR36481: vectorize reassociated instructions. 2018-04-03 17:14:47 +00:00
extract-shuffle.ll [SLP] Add extra test for extractelement shuffle, NFC. 2018-01-30 21:06:06 +00:00
extract.ll [SLP] Fix PR36481: vectorize reassociated instructions. 2018-04-03 17:14:47 +00:00
extract_in_tree_user.ll [SLP] Update test checks, NFC. 2018-02-06 20:00:05 +00:00
extractcost.ll
extractelement.ll [SLP] Fix for PR31879: vectorize repeated scalar ops that don't get put 2017-02-14 15:20:48 +00:00
fabs.ll
fcopysign.ll [AVX-512] Support FCOPYSIGN for v16f32 and v8f64 2016-11-18 02:25:34 +00:00
flag.ll
fma.ll
fptosi.ll [SLPVectorizer][X86] Tests to show missed buildvector sitofp/fptosi vectorizations 2016-12-06 13:29:55 +00:00
fptoui.ll
fround.ll
funclet.ll [SLP] Fix tests checks, NFC. 2018-02-20 18:11:50 +00:00
gep.ll
gep_mismatch.ll
hadd.ll [X86][AVX] Reduce v4f64/v4i64 shuffle costs (PR37882) 2018-06-21 11:37:13 +00:00
hoist.ll [SLP] Fix for PR32086: Count InsertElementInstr of the same elements as shuffle. 2018-01-29 16:08:52 +00:00
horizontal-list.ll [SLP] Support for horizontal min/max reduction. 2017-09-25 13:34:59 +00:00
horizontal-minmax.ll [SLPVectorizer] Don't attempt horizontal reduction on pointer types (PR38191) 2018-07-17 13:43:33 +00:00
horizontal.ll [X86][SSE] Reduce FADD/FSUB/FMUL costs on later targets (PR36280) 2018-02-26 22:10:17 +00:00
hsub.ll [X86][AVX] Reduce v4f64/v4i64 shuffle costs (PR37882) 2018-06-21 11:37:13 +00:00
implicitfloat.ll
in-tree-user.ll [SLP] Fix tests checks, NFC. 2018-02-20 18:11:50 +00:00
insert-after-bundle.ll [SLPVectorizer][X86] Cleanup test case. NFCI 2017-08-06 20:50:19 +00:00
insert-element-build-vector.ll [SLP] Take user instructions cost into consideration in insertelement vectorization. 2018-02-12 14:54:48 +00:00
insertvalue.ll [SLP] Fix PR35777: Incorrect handling of aggregate values. 2018-01-08 14:43:06 +00:00
intrinsic.ll
jumbled-load-multiuse.ll [SLPVectorizer][X86] Regenerate some tests. NFCI 2018-04-04 13:53:51 +00:00
jumbled-load-shuffle-placement.ll [SLPVectorizer][X86] Regenerate some tests. NFCI 2018-04-04 13:53:51 +00:00
jumbled-load-used-in-phi.ll [SLP] Fix PR36481: vectorize reassociated instructions. 2018-04-03 17:14:47 +00:00
jumbled-load.ll [SLP] Fix PR36481: vectorize reassociated instructions. 2018-04-03 17:14:47 +00:00
limit.ll [SLP] A test for limiting vectorization of instructions, NFC. 2017-06-30 14:37:32 +00:00
lit.local.cfg
load-merge.ll [SLP] Fix PR35047: Fix default cost model for cast op in X86. 2017-11-07 14:23:44 +00:00
long_chains.ll
loopinvariant.ll
metadata.ll
minimum-sizes.ll [SLP] Add/update tests for SLP vectorizer, NFC. 2018-01-10 21:29:18 +00:00
multi_block.ll
multi_user.ll
non-vectorizable-intrinsic.ll
odd_store.ll [SLPVectorizer][X86] Regenerate some tests. NFCI 2018-04-04 13:53:51 +00:00
operandorder.ll [InstCombine] Canonicalize insert splat sequences into an insert + shuffle 2016-12-28 00:18:08 +00:00
opt.ll
ordering.ll
partail.ll [SLP] Distinguish "demanded and shrinkable" from "demanded and not shrinkable" values when determining the minimum bitwidth 2018-04-03 00:05:10 +00:00
phi.ll
phi3.ll
phi_landingpad.ll
phi_overalignedtype.ll
powof2div.ll [SLPVectorizer] Recognise non uniform power of 2 constants 2018-06-26 16:20:16 +00:00
pr16571.ll
pr16628.ll
pr16899.ll
pr18060.ll
pr19657.ll [SLP] Fix test checks, NFC. 2018-02-21 15:32:58 +00:00
pr23510.ll
pr27163.ll
pr31599.ll [SLP] Remove bogus assert. 2017-01-11 19:23:57 +00:00
pr35497.ll [SLPVectorizer] Add tests related to PR30787, NFCI. 2018-03-29 18:57:03 +00:00
propagate_ir_flags.ll Fixed the lost FastMathFlags for CALL operations in SLPVectorizer. 2016-11-16 00:55:50 +00:00
reassociated-loads.ll [SLP] Fix PR36481: vectorize reassociated instructions. 2018-04-03 17:14:47 +00:00
reduction.ll
reduction2.ll
reduction_loads.ll [SLP] Revert everything that has to do with memory access sorting. 2017-03-10 18:59:07 +00:00
reduction_unrolled.ll [SLP] Update test checks, NFC. 2018-02-06 20:00:05 +00:00
remark_horcost.ll [SLP] Added more missed optimization remarks 2017-11-15 17:04:53 +00:00
remark_listcost.ll [SLP] Added more missed optimization remarks 2017-11-15 17:04:53 +00:00
remark_not_all_parts.ll [SLPVectorizer] Support alternate opcodes in tryToVectorizeList 2018-06-22 16:37:34 +00:00
remark_unsupported.ll [SLP] Added more missed optimization remarks 2017-11-15 17:04:53 +00:00
reorder_phi.ll [X86][SSE] Reduce FADD/FSUB/FMUL costs on later targets (PR36280) 2018-02-26 22:10:17 +00:00
reorder_repeated_ops.ll [SLP] Fix PR36481: vectorize reassociated instructions. 2018-04-03 17:14:47 +00:00
resched.ll [SLPVectorizer] Support alternate opcodes in tryToVectorizeList 2018-06-22 16:37:34 +00:00
return.ll [SLP] Fix tests checks, NFC. 2018-02-20 18:11:50 +00:00
reverse_extract_elements.ll [SLP] Fix for PR32164: Improve vectorization of reverse order of extract operations. 2018-01-16 18:17:01 +00:00
rgb_phi.ll
saxpy.ll
schedule-bundle.ll [SLPVectorizer] Schedule bundle with different opcodes. 2017-08-14 15:40:16 +00:00
schedule_budget.ll
scheduling.ll [SLP] Preserve IR flags when vectorizing horizontal reductions. 2017-03-01 12:43:39 +00:00
sext.ll [SLPVectorizer][X86] Add load extend tests (PR36091) 2018-02-22 12:19:34 +00:00
shift-ashr.ll [SLPVectorizer][X86] Add vectorization tests for vXi64/vXi32/vXi16/VXi8 shifts 2017-05-15 14:27:11 +00:00
shift-lshr.ll [SLPVectorizer][X86] Add vectorization tests for vXi64/vXi32/vXi16/VXi8 shifts 2017-05-15 14:27:11 +00:00
shift-shl.ll [SLPVectorizer][X86] Add vectorization tests for vXi64/vXi32/vXi16/VXi8 shifts 2017-05-15 14:27:11 +00:00
sign-extend.ll [SLP] Take user instructions cost into consideration in insertelement vectorization. 2018-02-12 14:54:48 +00:00
simple-loop.ll
simplebb.ll [X86][SSE] Reduce FADD/FSUB/FMUL costs on later targets (PR36280) 2018-02-26 22:10:17 +00:00
sitofp.ll [SLPVectorizer][X86] Tests to show missed buildvector sitofp/fptosi vectorizations 2016-12-06 13:29:55 +00:00
sqrt.ll
store-jumbled.ll [SLP] Fix PR36481: vectorize reassociated instructions. 2018-04-03 17:14:47 +00:00
stores_vectorize.ll [SLP] Additional tests for stores vectorization, NFC. 2018-03-05 20:20:12 +00:00
tiny-tree.ll [SLPVectorizer][X86] Regenerate some tests. NFCI 2018-04-04 13:53:51 +00:00
uitofp.ll
undef_vect.ll [SLP] Fix crash on propagate IR flags for undef operands of min/max 2017-09-27 17:42:49 +00:00
unreachable.ll
value-bug.ll [SLP] Take user instructions cost into consideration in insertelement vectorization. 2018-02-12 14:54:48 +00:00
vect_copyable_in_binops.ll Revert r319531 "[SLPVectorizer] Failure to beneficially vectorize 'copyable' elements in integer binary ops." 2017-12-01 16:17:24 +00:00
vector.ll [SLP] A test for vectorization of users of extractelement instructions, 2017-03-06 16:26:00 +00:00
vector_gep.ll
vectorize-reorder-reuse.ll [SLP] Additional tests for reorder reuse vectorization, NFC. 2018-04-09 19:02:34 +00:00
zext.ll [SLPVectorizer][X86] Add load extend tests (PR36091) 2018-02-22 12:19:34 +00:00