Commit Graph

350 Commits

Author SHA1 Message Date
Alexey Bataev 6dd29fccb8 [SLP] Support for horizontal min/max reduction.
SLP vectorizer supports horizontal reductions for Add/FAdd binary
operations. Patch adds support for horizontal min/max reductions.
Function getReductionCost() is split to getArithmeticReductionCost() for
binary operation reductions and getMinMaxReductionCost() for min/max
reductions.
Patch fixes PR26956.

Differential revision: https://reviews.llvm.org/D27846

llvm-svn: 312791
2017-09-08 13:49:36 +00:00
Sam Elliott b0c9753691 Keep Optimization Remark Yaml in NewPM
Summary:
The New Pass Manager infrastructure was forgetting to keep around the optimization remark yaml file that the compiler might have been producing. This meant setting the option to '-' for stdout worked, but setting it to a filename didn't give file output (presumably it was deleted because compilation didn't explicitly keep it). This change just ensures that the file is kept if compilation succeeds.

So far I have updated one of the optimization remark output tests to add a version with the new pass manager. It is my intention for this patch to also include changes to all tests that use `-opt-remark-output=` but I wanted to get the code patch ready for review while I was making all those changes.

Fixes https://bugs.llvm.org/show_bug.cgi?id=33951

Reviewers: anemet, chandlerc

Reviewed By: anemet, chandlerc

Subscribers: javed.absar, chandlerc, fhahn, llvm-commits

Differential Revision: https://reviews.llvm.org/D36906

llvm-svn: 311271
2017-08-20 01:30:45 +00:00
Dinar Temirbulatov 7aff8cfa55 [SLPVectorizer] Tighten up VLeft, VRight declaration, remove unnecessary testcase test/Transforms/SLPVectorizer/X86/reorder.ll, NFCI.
llvm-svn: 311223
2017-08-19 03:15:07 +00:00
Dinar Temirbulatov e3ce1b455e [SLPVectorizer] Add opcode parameter to reorderAltShuffleOperands, reorderInputsAccordingToOpcode functions.
Reviewers: mkuper, RKSimon, ABataev, mzolotukhin, spatel, filcab

Subscribers: llvm-commits, rengolin

Differential Revision: https://reviews.llvm.org/D36766

llvm-svn: 311221
2017-08-19 02:54:20 +00:00
Dinar Temirbulatov 7b78f5e52d [SLPVectorizer] Schedule bundle with different opcodes.
This change let us schedule a bundle with different opcodes in it, for example : [ load, add, add, add ]

Reviewers: mkuper, RKSimon, ABataev, mzolotukhin, spatel, filcab

Subscribers: llvm-commits, rengolin

Differential Revision: https://reviews.llvm.org/D36518

llvm-svn: 310847
2017-08-14 15:40:16 +00:00
Alexey Bataev 9581b42589 [SLP] General improvements of SLP vectorization process.
Patch tries to improve two-pass vectorization analysis, existing in SLP vectorizer. What it does:

1. Defines key nodes, that are the vectorization roots. Previously vectorization started if StoreInst or ReturnInst is found. For now, the vectorization started for all Instructions with no users and void types (Terminators, StoreInst) + CallInsts.
2. CmpInsts, InsertElementInsts and InsertValueInsts are stored in the
array. This array is processed only after the vectorization of the
first-after-these instructions key node is finished. Vectorization goes
in reverse order to try to vectorize as much code as possible.

Reviewers: mzolotukhin, Ayal, mkuper, gilr, hfinkel, RKSimon

Subscribers: ashahid, anemet, RKSimon, mssimpso, llvm-commits

Differential Revision: https://reviews.llvm.org/D29826

llvm-svn: 310260
2017-08-07 15:25:49 +00:00
Alexey Bataev 53d523c9eb Revert "[SLP] General improvements of SLP vectorization process."
This reverts commit r310255.

llvm-svn: 310257
2017-08-07 14:51:52 +00:00
Alexey Bataev faace8f1f1 [SLP] General improvements of SLP vectorization process.
Summary:
Patch tries to improve two-pass vectorization analysis, existing in SLP vectorizer. What it does:
1. Defines key nodes, that are the vectorization roots. Previously vectorization started if StoreInst or ReturnInst is found. For now, the vectorization started for all Instructions with no users and void types (Terminators, StoreInst) + CallInsts.
2. CmpInsts, InsertElementInsts and InsertValueInsts are stored in the array. This array is processed only after the vectorization of the first-after-these instructions key node is finished. Vectorization goes in reverse order to try to vectorize as much code as possible.

Reviewers: mzolotukhin, Ayal, mkuper, gilr, hfinkel, RKSimon

Subscribers: ashahid, anemet, RKSimon, mssimpso, llvm-commits

Differential Revision: https://reviews.llvm.org/D29826

llvm-svn: 310255
2017-08-07 14:03:17 +00:00
Simon Pilgrim 7c53c0f586 [SLPVectorizer][X86] Cleanup test case. NFCI
Remove excess attributes/metadata

llvm-svn: 310227
2017-08-06 20:50:19 +00:00
Dinar Temirbulatov cc2294a4eb [SLPVectorizer] Add extra parameter to setInsertPointAfterBundle to handle different opcodes, NFCI.
Differential Revision: https://reviews.llvm.org/D35769

llvm-svn: 310183
2017-08-05 18:43:52 +00:00
Craig Topper ae9b87d10c [InstCombine] Support sext in foldLogicCastConstant
This adds support for sext in foldLogicCastConstant. This is a prerequisite for D36214.

Differential Revision: https://reviews.llvm.org/D36234

llvm-svn: 309880
2017-08-02 20:25:56 +00:00
Alexey Bataev 65e9dd66f7 [SLPVectorizer] Test update, NFC.
llvm-svn: 309814
2017-08-02 14:22:53 +00:00
Alexey Bataev fba97e6e21 [SLP] Fix for PR31880: shuffle and vectorize repeated scalar ops on extracted elements
Summary:
Currently most of the time vectors of extractelement instructions are
treated as scalars that must be gathered into vectors. But in some
cases, like when we have extractelement instructions from single vector
with different constant indeces or from 2 vectors of the same size, we
can treat this operations as shuffle of a single vector or blending of 2
vectors.
```
define <2 x i8> @g(<2 x i8> %x, <2 x i8> %y) {
  %x0 = extractelement <2 x i8> %x, i32 0
  %y1 = extractelement <2 x i8> %y, i32 1
  %x0x0 = mul i8 %x0, %x0
  %y1y1 = mul i8 %y1, %y1
  %ins1 = insertelement <2 x i8> undef, i8 %x0x0, i32 0
  %ins2 = insertelement <2 x i8> %ins1, i8 %y1y1, i32 1
  ret <2 x i8> %ins2
}
```
can be converted to something like
```
define <2 x i8> @g(<2 x i8> %x, <2 x i8> %y) {
  %1 = shufflevector <2 x i8> %x, <2 x i8> %y, <2 x i32> <i32 0, i32 3>
  %2 = mul <2 x i8> %1, %1
  ret <2 x i8> %2
}
```
Currently this type of conversion is considered as high cost
transformation.

Reviewers: mzolotukhin, delena, mkuper, hfinkel, RKSimon

Subscribers: ashahid, RKSimon, spatel, llvm-commits

Differential Revision: https://reviews.llvm.org/D30200

llvm-svn: 309812
2017-08-02 13:25:26 +00:00
Mohammad Shahid 126e91ab09 [SLP]: Add test to resurrect the jumbled load patch. This test has multiple uses
of memory loads by different user

Change-Id: I40b5ba8b810265440f3e55efca77c4b41ca98fa4
llvm-svn: 309544
2017-07-31 07:40:54 +00:00
Alexey Bataev 8843aab83a [SLP] A test for limiting vectorization of instructions, NFC.
llvm-svn: 306828
2017-06-30 14:37:32 +00:00
Matt Arsenault 67cd347e93 AMDGPU: Allow vectorization of packed types
llvm-svn: 305844
2017-06-20 20:38:06 +00:00
Simon Pilgrim 448c2290f1 [X86][SLM] Add SLM arithmetic vectorization tests
As discussed on D33983, as SLM has so many custom costs its worth testing as well.

llvm-svn: 305151
2017-06-10 19:16:09 +00:00
Simon Pilgrim 631109c09c Regenerate test
llvm-svn: 304973
2017-06-08 10:24:49 +00:00
Alexey Bataev 0c2fca94a3 [SLP] Change extension of the test, NFC.
llvm-svn: 304829
2017-06-06 20:27:45 +00:00
Alexey Bataev 7266fec0d4 [SLP] Add a test for fix of PR32164, NFC.
llvm-svn: 304826
2017-06-06 20:11:35 +00:00
Amara Emerson c9916d7e97 Re-commit r302678, fixing PR33053.
The issue was that the AArch64 TTI hook allowed unpacked integer cmp reductions
which didn't have a lowering.

llvm-svn: 303211
2017-05-16 21:29:22 +00:00
Adam Nemet e29686e5c1 [SLP] Enable 64-bit wide vectorization on AArch64
ARM Neon has native support for half-sized vector registers (64 bits).  This
is beneficial for example for 2D and 3D graphics.  This patch adds the option
to lower MinVecRegSize from 128 via a TTI in the SLP Vectorizer.

*** Performance Analysis

This change was motivated by some internal benchmarks but it is also
beneficial on SPEC and the LLVM testsuite.

The results are with -O3 and PGO.  A negative percentage is an improvement.
The testsuite was run with a sample size of 4.

** SPEC

* CFP2006/482.sphinx3  -3.34%

A pretty hot loop is SLP vectorized resulting in nice instruction reduction.
This used to be a +22% regression before rL299482.

* CFP2000/177.mesa     -3.34%
* CINT2000/256.bzip2   +6.97%

My current plan is to extend the fix in rL299482 to i16 which brings the
regression down to +2.5%.  There are also other problems with the codegen in
this loop so there is further room for improvement.

** LLVM testsuite

* SingleSource/Benchmarks/Misc/ReedSolomon               -10.75%

There are multiple small SLP vectorizations outside the hot code.  It's a bit
surprising that it adds up to 10%.  Some of this may be code-layout noise.

* MultiSource/Benchmarks/VersaBench/beamformer/beamformer -8.40%

The opt-viewer screenshot can be seen at F3218284.  We start at a colder store
but the tree leads us into the hottest loop.

* MultiSource/Applications/lambda-0.1.3/lambda            -2.68%
* MultiSource/Benchmarks/Bullet/bullet                    -2.18%

This is using 3D vectors.

* SingleSource/Benchmarks/Shootout-C++/Shootout-C++-lists +6.67%

Noise, binary is unchanged.

* MultiSource/Benchmarks/Ptrdist/anagram/anagram          +4.90%

There is an additional SLP in the cold code.  The test runs for ~1sec and
prints out over 2000 lines. This is most likely noise.

* MultiSource/Applications/aha/aha                        +1.63%
* MultiSource/Applications/JM/lencod/lencod               +1.41%
* SingleSource/Benchmarks/Misc/richards_benchmark         +1.15%

Differential Revision: https://reviews.llvm.org/D31965

llvm-svn: 303116
2017-05-15 21:15:01 +00:00
Hans Wennborg bd6e9e77a7 Revert r302678 "[AArch64] Enable use of reduction intrinsics."
This caused PR33053.

Original commit message:

> The new experimental reduction intrinsics can now be used, so I'm enabling this
> for AArch64. We will need this for SVE anyway, so it makes sense to do this for
> NEON reductions as well.
>
> The existing code to match shufflevector patterns are replaced with a direct
> lowering of the reductions to AArch64-specific nodes. Tests updated with the
> new, simpler, representation.
>
> Differential Revision: https://reviews.llvm.org/D32247

llvm-svn: 303115
2017-05-15 20:59:32 +00:00
Simon Pilgrim 7d2f06ae22 [SLPVectorizer][X86] Add vectorization tests for vXi64/vXi32/vXi16/VXi8 add/sub/mul
llvm-svn: 303074
2017-05-15 15:48:15 +00:00
Simon Pilgrim a12d65972a [SLPVectorizer][X86] Add vectorization tests for vXi64/vXi32/vXi16/VXi8 shifts
llvm-svn: 303069
2017-05-15 14:27:11 +00:00
Adam Nemet 0aca09fc6c [SLP] Emit optimization remarks
The approach I followed was to emit the remark after getTreeCost concludes
that SLP is profitable.  I initially tried emitting them after the
vectorizeRootInstruction calls in vectorizeChainsInBlock but I vaguely
remember missing a few cases for example in HorizontalReduction::tryToReduce.

ORE is placed in BoUpSLP so that it's available from everywhere (notably
HorizontalReduction::tryToReduce).

We use the first instruction in the root bundle as the locator for the remark.
In order to get a sense how far the tree is spanning I've include the size of
the tree in the remark.  This is not perfect of course but it gives you at
least a rough idea about the tree.  Then you can follow up with -view-slp-tree
to really see the actual tree.

llvm-svn: 302811
2017-05-11 17:06:17 +00:00
Amara Emerson 816542ceb3 [AArch64] Enable use of reduction intrinsics.
The new experimental reduction intrinsics can now be used, so I'm enabling this
for AArch64. We will need this for SVE anyway, so it makes sense to do this for
NEON reductions as well.

The existing code to match shufflevector patterns are replaced with a direct
lowering of the reductions to AArch64-specific nodes. Tests updated with the
new, simpler, representation.

Differential Revision: https://reviews.llvm.org/D32247

llvm-svn: 302678
2017-05-10 15:15:38 +00:00
Matt Arsenault 6a288c1e32 Replace hardcoded intrinsic list with speculatable attribute.
No change in which intrinsics should be speculated.

llvm-svn: 301995
2017-05-03 02:26:10 +00:00
Easwaran Raman 76aba5f6d7 [SLP vectorizer] Allow phi node reordering in tryToVectorizeList.
In tryToVectorizeList, under a very limited circumstance (when entered
from tryToVectorizePair), the values may be reordered (swapped) and the
SLP tree is built with the new order. This extends that to the case when
starting from phis in vectorizeChainsInBlock when there are exactly two
phis. The textual order of phi nodes shouldn't really matter. Without
this change, the loop body in the accompnaying test case is fully vectorized
when we swap the orde of the phis but not with this order. While this
doesn't solve the phi-ordering problem in a general way (for more than 2
phis), this is simple fix that piggybacks on an existing mechanism and
is useful in cases like multiplying two complex numbers.

Differential revision: https://reviews.llvm.org/D32065

llvm-svn: 300574
2017-04-18 18:16:57 +00:00
Jonas Paulsson 4707015d46 Fix a RUN line in new test.
Use '2>&1 |' and not '|&' to pipe debug output to FileCheck

Hopefully handles a "shell parser error" on
llvm-clang-x86_64-expensive-checks-win

test/Transforms/SLPVectorizer/SystemZ/SLP-cmp-cost-query.ll

llvm-svn: 300064
2017-04-12 14:25:08 +00:00
Jonas Paulsson 22776892c9 [SLPVectorizer] Pass the right type argument to getCmpSelInstrCost()
In getEntryCost(), make the scalar type for a compare instruction that of the
operands, not i1. This is needed in order to call getCmpSelInstrCost() for a
compare in a sensible way, the same way as the LoopVectorizer does.

New test: test/Transforms/SLPVectorizer/SystemZ/SLP-cmp-cost-query.ll

Review: Matthew Simpson
https://reviews.llvm.org/D31601

llvm-svn: 300061
2017-04-12 13:29:25 +00:00
Matt Arsenault 3dbeefa978 AMDGPU: Mark all unspecified CC functions in tests as amdgpu_kernel
Currently the default C calling convention functions are treated
the same as compute kernels. Make this explicit so the default
calling convention can be changed to a non-kernel.

Converted with perl -pi -e 's/define void/define amdgpu_kernel void/'
on the relevant test directories (and undoing in one place that actually
wanted a non-kernel).

llvm-svn: 298444
2017-03-21 21:39:51 +00:00
Simon Pilgrim 06c70adcf0 [X86] Add missing BITREVERSE costs for SSE2 vectors and i8/i16/i32/i64 scalars
Prep work for PR31810

llvm-svn: 297876
2017-03-15 19:34:55 +00:00
Michael Kuperstein 5fb39a7966 [SLP] Revert everything that has to do with memory access sorting.
This reverts r293386, r294027, r294029 and r296411.

Turns out the SLP tree isn't actually a "tree" and we don't handle
accessing the same packet of loads in several different orders well,
causing miscompiles.

Revert until we can fix this properly.

llvm-svn: 297493
2017-03-10 18:59:07 +00:00
Michael Kuperstein 768d013a03 [SLP] Revert r296863 due to miscompiles.
Details and reproducer are on the email thread for r296863.

llvm-svn: 297103
2017-03-06 23:54:51 +00:00
Alexey Bataev d7db344c17 [SLP] A test for vectorization of users of extractelement instructions,
NFC.

llvm-svn: 297024
2017-03-06 16:26:00 +00:00
Mohammad Shahid bdac9f30c0 [SLP] Fixes the bug due to absence of in order uses of scalars which needs to be available
for VectorizeTree() API.This API uses it for proper mask computation to be used in shufflevector IR.
The fix is to compute the mask for out of order memory accesses while building the vectorizable tree
instead of actual vectorization of vectorizable tree.It also needs to recompute the proper Lane for
external use of vectorizable scalars based on shuffle mask.

Reviewers: mkuper

Differential Revision: https://reviews.llvm.org/D30159

Change-Id: Ide8773ce0ad3562f3cf4d1a0ad0f487e2f60ce5d
llvm-svn: 296863
2017-03-03 10:02:47 +00:00
Hans Wennborg cc4ff78c9d Revert r296575 "[SLP] Fixes the bug due to absence of in order uses of scalars which needs to be available"
It caused miscompiles, e.g. in Chromium (PR32109).

llvm-svn: 296654
2017-03-01 18:57:16 +00:00
Alexey Bataev 4a45efa431 [SLP] Preserve IR flags when vectorizing horizontal reductions.
Summary:
The SLP vectorizer should propagate IR-level optimization hints/flags
(nsw, nuw, exact, fast-math) when converting scalar horizontal
reductions instructions into vectors, just like for other vectorized
instructions.
It doe not include IR propagation for extra arguments, we need to handle
original scalar operations for extra args to propagate correct flags.

Reviewers: mkuper, mzolotukhin, hfinkel

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D30418

llvm-svn: 296614
2017-03-01 12:43:39 +00:00
Alexey Bataev 74e5a36856 [SLP] Preserve IR flags for extra args.
Summary:
We should preserve IR flags for extra args. These IR flags should be
taken from original scalar operations, not from the reduction
operations.

Reviewers: mkuper, mzolotukhin, hfinkel

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D30447

llvm-svn: 296613
2017-03-01 12:22:33 +00:00
Alexey Bataev dfec81107f [SLP] Fix for PR32038: extra add of PHI node when it is not required.
Summary:
If horizontal reduction tree starts from the binary operation that is
used in PHI node, but this PHI is not used in horizontal reduction, we
may end up with extra addition of this PHI node after vectorization.
Here is an example:
```
%phi = phi i32 [ %tmp, %end], ...
...
%tmp = add i32 %tmp1, %tmp2
end:
```
after vectorization we always have something like:

```
%phi = phi i32 [ %tmp, %end], ...
...
%red = extractelement <8 x 32> %vec.red, 0
%tmp = add i32 %red, %phi
end:
```
even if `%phi` is not used in reduction tree. Patch considers these PHI
nodes as extra arguments and considers them in the final result iff they
really used in reduction.

Reviewers: mkuper, hfinkel, mzolotukhin

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D30409

llvm-svn: 296606
2017-03-01 10:50:44 +00:00
Mohammad Shahid 175ffa8c35 [SLP] Fixes the bug due to absence of in order uses of scalars which needs to be available
for VectorizeTree() API.This API uses it for proper mask computation to be used in shufflevector IR.
The fix is to compute the mask for out of order memory accesses while building the vectorizable tree
instead of actual vectorization of vectorizable tree.

Reviewers: mkuper

Differential Revision: https://reviews.llvm.org/D30159

Change-Id: Id1e287f073fa4959713ba545fa4254db5da8b40d
llvm-svn: 296575
2017-03-01 03:51:54 +00:00
Michael Kuperstein c07cca85fb [SLP] Load sorting should not try to sort things that aren't loads.
We may get a VL where the first element is a load, but the others
aren't. Trying to sort such VLs can only lead to sorrow.

llvm-svn: 296411
2017-02-27 23:18:11 +00:00
Alexey Bataev a79c41cf51 [SLP] Use different flags in tests for reduction ops and extra args.
llvm-svn: 296376
2017-02-27 20:22:44 +00:00
Alexey Bataev 0aadc6ef13 [SLP] Modify test to check IR flags propagation for extra args.
llvm-svn: 296369
2017-02-27 19:16:09 +00:00
Alexey Bataev cb78d09d14 [SLP] A test for a fix of PR32038.
llvm-svn: 296349
2017-02-27 16:07:10 +00:00
Alexey Bataev f77d1656af [SLP] Fix for PR32036: Vectorized horizontal reduction returning wrong
result

Summary:
If the same value is used several times as an extra value, SLP
vectorizer takes it into account only once instead of actual number of
using.
For example:
```
int val = 1;
for (int y = 0; y < 8; y++) {
  for (int x = 0; x < 8; x++) {
    val = val + input[y * 8 + x] + 3;
  }
}
```
We have 2 extra rguments: `1` - initial value of horizontal reduction
and `3`, which is added 8*8 times to the reduction. Before the patch we
added `1` to the reduction value and added once `3`, though it must be
added 64 times.

Reviewers: mkuper, mzolotukhin

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D30262

llvm-svn: 295972
2017-02-23 13:37:09 +00:00
Alexey Bataev 14b370c1bf Revert "[SLP] Fix for PR32036: Vectorized horizontal reduction returning wrong"
This reverts commit 7c5141e577d9efd1c8e3087566a38ce6b3a41a84.

llvm-svn: 295957
2017-02-23 11:09:35 +00:00
Alexey Bataev 7ae653285d [SLP] Fix for PR32036: Vectorized horizontal reduction returning wrong
result

Summary:
If the same value is used several times as an extra value, SLP
vectorizer takes it into account only once instead of actual number of
using.
For example:
```
int val = 1;
for (int y = 0; y < 8; y++) {
  for (int x = 0; x < 8; x++) {
    val = val + input[y * 8 + x] + 3;
  }
}
```
We have 2 extra rguments: `1` - initial value of horizontal reduction
and `3`, which is added 8*8 times to the reduction. Before the patch we
added `1` to the reduction value and added once `3`, though it must be
added 64 times.

Reviewers: mkuper, mzolotukhin

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D30262

llvm-svn: 295956
2017-02-23 10:57:15 +00:00
Alexey Bataev 7337212e83 Revert "[SLP] Fix for PR32036: Vectorized horizontal reduction returning wrong"
This reverts commit d83c81ee6a8dea662808ac22b396d1bb0595c89d.

llvm-svn: 295951
2017-02-23 09:59:29 +00:00