Commit Graph

1194 Commits

Author SHA1 Message Date
Max Kazantsev 57d17795b9 [Test] Add one more test for patch [SLP]Improve reductions analysis and emission, part 1.
The original patch leads to malformed phis on this test. Make sure
we're safeguarded from its return until it is fixed.
2022-04-20 13:57:26 +07:00
Alexey Bataev 7adfa31bc6 [SLP][NFC]Add a test for reducing same values, NFC. 2022-04-19 06:48:21 -07:00
Alexey Bataev 883571928c Revert "[SLP]Improve reductions analysis and emission, part 1."
This reverts commit 0e1f4d4d3c to fix
a crash reported in PR54976
2022-04-19 06:17:03 -07:00
Alban Bridonneau 8daffd1dfb Fix SLP score for out of order contiguous loads
SLP uses the distance between pointers to optimize
the getShallowScore. However the current code misses
the case where we are trying to vectorize for VF=4, and the distance
between pointers is 2. In that case the returned score
reflects the case of contiguous loads, when it's not actually
contiguous.

The attached unit tests have 5 loads, where the program order
is not the same as the offset order in the GEPs. So, the choice
of which 4 loads to bundle together matters. If we pick the
first 4, then we can vectorize with VF=4. If we pick the
last 4, then we can only vectorize with VF=2.

This patch makes a more conservative choice, to consider
all distances>1 to not be a case of contiguous load, and
give those cases a lower score.

Reviewed By: ABataev

Differential Revision: https://reviews.llvm.org/D123516
2022-04-19 11:58:01 +01:00
Vasileios Porpodas b1333f03d9 Recommit "[SLP] Support internal users of splat loads"
Code review: https://reviews.llvm.org/D121940

This reverts commit 359dbb0d3d.
2022-04-18 15:58:01 -07:00
Vasileios Porpodas 359dbb0d3d Revert "[SLP] Support internal users of splat loads"
This reverts commit f8e1337115.
2022-04-18 12:12:34 -07:00
Vasileios Porpodas f8e1337115 [SLP] Support internal users of splat loads
Until now we would only accept a broadcast load pattern if it is only used
by a single vector of instructions.

This patch relaxes this, and allows for the broadcast to have more than one
user vector, as long as all of its uses are internal to the SLP graph and
vectorized.

Differential Revision: https://reviews.llvm.org/D121940
2022-04-18 11:59:44 -07:00
Alexey Bataev 0e1f4d4d3c [SLP]Improve reductions analysis and emission, part 1.
Currently SLP vectorizer walks through the instructions and selects
3 main classes of values: 1) reduction operations - instructions with same
reduction opcode (add, mul, min/max, etc.), which build the reduction,
2) reduced values - instructions with the same opcodes, but different
from the reduction opcode, 3) extra arguments - all other values,
instructions from the different basic block rather than the root node,
instructions with to many/less uses.

This scheme is not very efficient. It excludes some instructions and all
non-instruction values from the reductions (constants, proficient
gathers), to many possibly reduced values are marked as extra arguments.
Patch improves this process by introducing a bit extended analysis
stage. During this stage, we still try to select 3 classes of the
values: 1) reduction operations - same as before, 2) possibly reduced
values - all instructions from the current block/non-instructions, which
may build a vectorization tree, 3) extra arguments - instructions from
the different basic blocks. Additionally, an extra sorting of the
possibly reduced values occurs to build the scalar sequences which
highly likely will bed vectorized, e.g. loads are grouped by the
distance between them, constants are grouped together, cmp instructions
are sorted by their compare types and predicates, extractelement
instructions are sorted by the vector operand, etc. Also, these groups
are reordered by their length so the longest group is the first in the
list of the possibly reduced values.

The vectorization process tries to emit the reductions for all these
groups. These reductions, remaining non-vectorized possible reduced
values and extra arguments are then combined into the final expression
just like it was before.

Differential Revision: https://reviews.llvm.org/D114171
2022-04-12 17:46:11 -07:00
Simon Pilgrim f061c1050b [SLP][X86] Add ray_sphere intersection methods from c-ray benchmark
We're failing to vectorize several comparison reduction patterns.

Issue #43090 was based off this, but while that simplified test case is now folding, the original still fails due to poor cost model values for vXi1 extractions
2022-04-12 19:51:27 +01:00
Philip Reames 7d6e8f2a96 [slp] Delete dead scalar instructions feeding vectorized instructions
If we vectorize a e.g. store, we leave around a bunch of getelementptrs for the individual scalar stores which we removed. We can go ahead and delete them as well.

This is purely for test output quality and readability. It should have no effect in any sane pipeline.

Differential Revision: https://reviews.llvm.org/D122493
2022-03-28 20:10:13 -07:00
Philip Reames 48cc9287f5 Reapply "[SLP] Schedule only sub-graph of vectorizable instructions"" (try 3)
The original commit exposed several missing dependencies (e.g. latent bugs in SLP scheduling).  Most of these were fixed over the weekend and have had several days to bake.  The last was fixed this morning after being noticed in manual review of test changes yesterday.  See the review thread for links to each change.

Original commit message follows:

SLP currently schedules all instructions within a scheduling window which stretches from the first instruction potentially vectorized to the last. This window can include a very large number of unrelated instructions which are not being considered for vectorization. This change switches the code to only schedule the sub-graph consisting of the instructions being vectorized and their transitive users.

This has the effect of greatly reducing the amount of work performed in large basic blocks, and thus greatly improves compile time on degenerate examples. To understand the effects, I added some statistics (not planned for upstream contribution). Here's an illustration from my motivating example:

   Before this patch:

   704357 SLP                          - Number of calcDeps actions
   699021 SLP                          - Number of schedule calls
   5598 SLP                          - Number of ReSchedule actions
   59 SLP                          - Number of ReScheduleOnFail actions
   10084 SLP                          - Number of schedule resets
   8523 SLP                          - Number of vector instructions generated

   After this patch:

   102895 SLP                          - Number of calcDeps actions
   161916 SLP                          - Number of schedule calls
   5637 SLP                          - Number of ReSchedule actions
   55 SLP                          - Number of ReScheduleOnFail actions
   10083 SLP                          - Number of schedule resets
   8403 SLP                          - Number of vector instructions generated

I do want to highlight that there is a small difference in number of generated vector instructions. This example is hitting the bailout due to maximum window size, and the change in scheduling is slightly perturbing when and how we hit it. This can be seen in the RescheduleOnFail counter change. Given that, I think we can safely ignore.

The downside of this change can be seen in the large test diff. We group all vectorizable instructions together at the bottom of the scheduling region. This means that vector instructions can move quite far from their original point in code. While maybe undesirable, I don't see this as being a major problem as this pass is not intended to be a general scheduling pass.

For context, it's worth noting that the pre-scheduling that SLP does while building the vector tree is exactly the sub-graph scheduling implemented by this patch.

Differential Revision: https://reviews.llvm.org/D118538
2022-03-25 10:39:23 -07:00
Philip Reames a121458edc [test,slp] Add another stacksave related dependence test 2022-03-25 08:48:17 -07:00
Vasileios Porpodas 39aa202aff Recommit "[SLP] Fix lookahead operand reordering for splat loads." attempt 3, fixed assertion crash.
Original review: https://reviews.llvm.org/D121354

This reverts commit e6ead19b77.
2022-03-23 18:32:17 -07:00
Arthur Eubanks e6ead19b77 Revert "Recommit "[SLP] Fix lookahead operand reordering for splat loads." attempt 2, fixed assertion crash."
This reverts commit 27bd8f9492.

Causes crashes, see comments in D121973
2022-03-23 10:57:45 -07:00
Vasileios Porpodas 27bd8f9492 Recommit "[SLP] Fix lookahead operand reordering for splat loads." attempt 2, fixed assertion crash.
Original review: https://reviews.llvm.org/D121354

This reverts commit f7d7d2a08d.
2022-03-22 16:41:55 -07:00
Arthur Eubanks f7d7d2a08d Revert "Recommit "[SLP] Fix lookahead operand reordering for splat loads.""
This reverts commit 79613185d3.

Causes crashes, see comments in https://reviews.llvm.org/D121973.
2022-03-22 13:33:49 -07:00
Vasileios Porpodas 79613185d3 Recommit "[SLP] Fix lookahead operand reordering for splat loads."
Original review: https://reviews.llvm.org/D121354

The original commit 9136145eb0 broke the build on several targets.

Differential Revision: https://reviews.llvm.org/D121973
2022-03-21 15:57:32 -07:00
Philip Reames 56727f9472 [test] Add regression test from pr54465
The reported crash regression was (presumably) fixed in 79a1823, but without adding coverage.
2022-03-21 10:49:27 -07:00
Alexey Bataev 79a182371e [SLP]Make stricter check for instructions that do not require
scheduling.

Need to check that the instructions with external operands can be
reordered safely before actualy exclude them from the scheduling.
2022-03-21 06:09:12 -07:00
Philip Reames b7806c8b37 [SLP] Explicit track required stacksave/alloca dependency
The semantics of an inalloca alloca instruction requires that it not be reordered with a preceeding stacksave intrinsic call.  Unfortunately, there's no def/use edge or memory dependence edge.  (THe memory point is slightly subtle, but in general a new allocation can't alias with a call which executes strictly before it comes into existance.)

I'd tried to tackle this same case previously in 689babdf6, but the fix chosen there turned out to be incomplete.  As such, this change contains a fully revert of the first fix attempt.

This was noticed when investigating problems which surfaced with D118538, but this is definitely an existing bug.  This time around, I managed to reduce a couple of additional cases, including one which was being actively miscompiled even without the new scheduling change.  (See test diffs)

Compile time wise, we only spend extra time when seeing a stacksave (rare), and even then we walk the block at most once per schedule window extension.  Likely a non-issue.
2022-03-20 13:58:45 -07:00
Philip Reames 983ed87c61 [slp,tests] Consolidate two test files into one
There are some slight changes to the test lines due to different cost threshold choices in the two command lines, but I don't believe these to be interesting the purpose of the tests.
2022-03-19 18:24:35 -07:00
Philip Reames 89ab020d02 [tests, SLP] Add coverage for missing dependencies for stacksave intrinsics
The existing scheduling doesn't account for the scheduling restrictions implied by inalloca allocas combined with stacksave/stackrestore.  This adds coverage including one currently miscompiling case.
2022-03-19 18:05:09 -07:00
Philip Reames 6253b77da9 [SLP] Respect control dependence within a block during scheduling
This fixes an active miscompile visible in the test changes.  The basic problem is that the scheduling dependency graph didn't have any edges for control dependence within a single basic block.  The result is that we could (and in some rare cases *did*) perform reorderings within a block which could introduce new undefined behavior along paths which didn't previously contain any.

Impact wise, we have two major cases where control is not guaranteed to reach a later instruction in the block: may throw calls, and calls containing infinite loops.
* The former case was mostly covered by the memory dependencies, and to trigger require a function which can throw, but not write to memory.  In theory, such a case is possible, but not likely in practice.
* The later case is likely more of an issue in practice.  After this code was first written, we changed the IR semantics to allow well defined infinite loops without satisifying mustprogress.  Even for C/C++ - which do imply mustprogress - recent changes to how we treat atomics (e.g. an atomic read does not always imply a write) could expose this issue.  I'm a bit shocked we don't seem to have a bug report which hit this in real code actually.

Compile time wise, this results in a single extra scan of the scheduling window in the common case.  Since we stop scanning at the next instruction which isn't guaranteed to execute, no matter what order we traverse instructions in, we scan the block once.  The exception to this is that when we extend the scheduling window downwards, we invalidate all dependencies, and thus rescan.  So the potentially expensive case is when we a call in a big schedule window which is frequently extended.  We could optimize this case (by caching the last instruction not guaranteeed to transfer execution and scanning only the extended window) and starting there), but I decided to leave the complexity until it mattered.  That same case is already degenerate with memory dependences which is more expensive than the control dependence scan.

We could also consider combining the memory dependence and control dependence sets to reduce memory usage, but since it complicates the code slightly and makes debugging a bit harder, I went with the simplest scheme for now.

This was noticed while trying to understand the failures reported against D118538, but is not otherwise related to that change.
2022-03-19 13:36:24 -07:00
Philip Reames bdbcca617a [SLP,tests] Add coverage showing need for control dependencies during scheduling 2022-03-19 09:45:33 -07:00
Philip Reames 3abf8ebd9a [slp][tests] Add missing function attributes
SLP is currently assuming that control dependence in these cases is irrelevant.  This is only valid if none of the lib-funcs involved can throw or infinite loop in the scalar forms.  This appears to be true (or at least we infer the respective attributes) for the libfuncs I spot checked.  This change is mostly for shrunking the diff on an upcoming patch.
2022-03-18 15:51:42 -07:00
Simon Pilgrim 5dde9c1286 [CostModel][X86] Reduce cost of extracting bool vector elements
For constant indices, these are now just a MOVMSK+TEST/BT
2022-03-18 19:02:47 +00:00
Simon Pilgrim b58413da9b [SLP][X86] Add baseline SSE2 test run to lookahead.ll 2022-03-18 14:27:04 +00:00
Vasileios Porpodas 9136145eb0 Revert "[SLP] Fix lookahead operand reordering for splat loads." due to build failures
This reverts commit 5efa78985b.
2022-03-17 18:22:04 -07:00
Vasileios Porpodas 511fa0800f [SLP][NFC] Added a test for a followup patch that enables handling splat loads with uses. 2022-03-17 18:05:54 -07:00
Vasileios Porpodas 5efa78985b [SLP] Fix lookahead operand reordering for splat loads.
Splat loads are inexpensive in X86. For a 2-lane vector we need just one
instruction: `movddup (%reg), xmm0`. Using the standard Splat score leads
to worse code. This patch adds a new score dedicated for splat loads.

Please note that a splat is usually three IR instructions:
- It is usually a load and 2 inserts:
 %ld = load double, double* %gep
 %ins1 = insertelement <2 x double> poison, double %ld, i32 0
 %ins2 = insertelement <2 x double> %ins1, double %ld, i32 1

- But it can also be a load, an insert and a shuffle:
 %ld = load double, double* %gep
 %ins = insertelement <2 x double> poison, double %ld, i32 0
 %shf = shufflevector <2 x double> %ins, <2 x double> poison, <2 x i32> zeroinitializer

Because of this some of the lit tests contain more IR instructions.

Differential Revision: https://reviews.llvm.org/D121354
2022-03-17 18:05:54 -07:00
Vasileios Porpodas b051c836c0 [SLP][NFC] This adds a test for a follow-up patch that fixes a look-ahead operand reordering issue
Differential Revision: https://reviews.llvm.org/D121353
2022-03-17 18:05:53 -07:00
Alexey Bataev d65cc85977 [SLP]Do not schedule instructions with constants/argument/phi operands and external users.
No need to schedule entry nodes where all instructions are not memory
read/write instructions and their operands are either constants, or
arguments, or phis, or instructions from others blocks, or their users
are phis or from the other blocks.
The resulting vector instructions can be placed at
the beginning of the basic block without scheduling (if operands does
not need to be scheduled) or at the end of the block (if users are
outside of the block).
It may save some compile time and scheduling resources.

Differential Revision: https://reviews.llvm.org/D121121
2022-03-17 11:03:45 -07:00
Alexey Bataev 150ea76543 Revert "[SLP]Do not schedule instructions with constants/argument/phi operands and external users."
This reverts commit 1eeb2bfe72 to fix
a bug reported in https://reviews.llvm.org/D121121
2022-03-16 13:54:59 -07:00
Alexey Bataev 1eeb2bfe72 [SLP]Do not schedule instructions with constants/argument/phi operands and external users.
No need to schedule entry nodes where all instructions are not memory
read/write instructions and their operands are either constants, or
arguments, or phis, or instructions from others blocks, or their users
are phis or from the other blocks.
The resulting vector instructions can be placed at
the beginning of the basic block without scheduling (if operands does
not need to be scheduled) or at the end of the block (if users are
outside of the block).
It may save some compile time and scheduling resources.

Differential Revision: https://reviews.llvm.org/D121121
2022-03-16 06:05:43 -07:00
Nikita Popov 8361c5da30 [SLPVectorizer] Handle external load/store pointer uses with opaque pointers
In this case we may not generate a bitcast, so the new load/store
becomes the external user.
2022-03-14 16:55:09 +01:00
David Green 8bef17ed59 [AArch64][SLP] Add a test with mutual reductions. NFC 2022-03-09 21:46:57 +00:00
Philip Reames deae979a2c Revert "Reapply "[SLP] Schedule only sub-graph of vectorizable instructions"""
This reverts commit 738042711b. A second, apparently separate, issue has been reported on the original review.
2022-03-03 11:35:34 -08:00
David Green 65c0e45a37 [AArch64] Vector shifts cost 1
The costs of vector shifts was 2 as opposed to 1, as the nodes are
marked custom. Fix this like the others and mark the nodes as cheap.

Differential Revision: https://reviews.llvm.org/D120773
2022-03-03 10:42:57 +00:00
Philip Reames 738042711b Reapply "[SLP] Schedule only sub-graph of vectorizable instructions""
Root issue which triggered the revert was fixed in 689bab.  No changes in the reapplied patch.

Original commit message follows:

SLP currently schedules all instructions within a scheduling window which stretches from the first instr
uction potentially vectorized to the last. This window can include a very large number of unrelated instruct
ions which are not being considered for vectorization. This change switches the code to only schedule the su
b-graph consisting of the instructions being vectorized and their transitive users.

This has the effect of greatly reducing the amount of work performed in large basic blocks, and thus greatly improves compile time on degenerate examples. To understand the effects, I added some statistics (not planned for upstream contribution). Here's an illustration from my motivating example:

    Before this patch:

    704357 SLP                          - Number of calcDeps actions
     699021 SLP                          - Number of schedule calls
       5598 SLP                          - Number of ReSchedule actions
         59 SLP                          - Number of ReScheduleOnFail actions
      10084 SLP                          - Number of schedule resets
       8523 SLP                          - Number of vector instructions generated

    After this patch:

    102895 SLP                          - Number of calcDeps actions
     161916 SLP                          - Number of schedule calls
       5637 SLP                          - Number of ReSchedule actions
         55 SLP                          - Number of ReScheduleOnFail actions
      10083 SLP                          - Number of schedule resets
       8403 SLP                          - Number of vector instructions generated

I do want to highlight that there is a small difference in number of generated vector instructions. This example is hitting the bailout due to maximum window size, and the change in scheduling is slightly perturbing when and how we hit it. This can be seen in the RescheduleOnFail counter change. Given that, I think we can safely ignore.

The downside of this change can be seen in the large test diff. We group all vectorizable instructions together at the bottom of the scheduling region. This means that vector instructions can move quite far from their original point in code. While maybe undesirable, I don't see this as being a major problem as this pass is not intended to be a general scheduling pass.

For context, it's worth noting that the pre-scheduling that SLP does while building the vector tree is exactly the sub-graph scheduling implemented by this patch.

Differential Revision: https://reviews.llvm.org/D118538
2022-03-02 10:47:20 -08:00
Philip Reames 689babdf68 [SLP] Don't try to vectorize allocas
While a collection of allocas are technically vectorizeable - by forming a wider alloca - this was not a transform SLP actually knows how to do.  Instead, we were forming a bundle with missing dependencies, and then relying on the scheduling code to preserve program order if multiple instructions were scheduleable at once.  I haven't been able to write a test case, but I'm 99% sure this was wrong in some edge case.

The unknown op case was flowing down the shufflevector path.  This did result in some splat handling being lost with this change, but the same lack of splat handling is visible in a whole bunch of simple examples for the gather path.  I didn't consider this interesting to fix given how narrow the splat of allocas case is.
2022-03-02 10:08:43 -08:00
Philip Reames 29028e47bd [slp] Add tests for cause of D118538 revert 2022-03-02 09:45:17 -08:00
Arthur Eubanks 9c6250ee41 Revert "[SLP] Schedule only sub-graph of vectorizable instructions"
This reverts commit 0539a26d91.

Causes a miscompile, see comments on D118538.

Required updating bottom-to-top-reorder.ll.
2022-03-01 17:31:16 -08:00
Alexey Bataev e4b9640867 [SLP]Improve bottom-to-top reordering.
Currently bottom-to-top reordering analysis counts orders of the
operands and then adds natural order counts for the operand users. It is
very conservative, this the user nodes themselves may require
reordering. Patch improves bottom-to-top analysis by checking for the
user nodes if they require/allows the reordring. If the user node must
be reordered, has reused scalars, is an alternate op vectorization node,
is a non-ordered gather node or may allow reordering because of the
reordered operands, such node is considered as the node that allows
reodring and is not counted as a node with the natural order.

Differential Revision: https://reviews.llvm.org/D120492
2022-02-28 06:48:46 -08:00
Evgeniy Brevnov 10e99eb7e4 [SLP] "Normal" instructions should not go between PHI and Lading pad
Currently, SLP can insert "shuffle" instruction beween PHI and Landing pad instruction. The problem is demonstrated by LIT test. The solution is to adjust insertion point once we are done with PHI generation.

Differential Revision: https://reviews.llvm.org/D120552
2022-02-26 11:44:26 +07:00
Vasileios Porpodas 4bbc3290a2 [SLP] Fix for the min/max intrinsic cost.
The min/max intrinsic cost is currently too low because in the cost calculation
we subtract the cost of the vector compare as we will not emit it.
For the cost of the vector compare we are currently passing BAD_ICMP_PREDICATE
which returns 3, the worst case cost.
I think we should be passing VecPred instead, since we know the predicates of
the compare instr.

I think this is related to commit b3b993a7ad which introduced the predicate
argument to getCmpSelInstrCost().
https://reviews.llvm.org/rGb3b993a7ad817c3c5801341fa78f34332900eb83

Differential Revision: https://reviews.llvm.org/D120439
2022-02-24 18:08:40 -08:00
Vasileios Porpodas 6136f97c69 [SLP][NFC] Test for a follow-up fix of the the vector min/max instrinsic cost calculation.
The code in this test should not have been vectorized.
It looks worse than the scalar code.

Differential Revision: https://reviews.llvm.org/D120438
2022-02-24 18:08:39 -08:00
Nikita Popov a266af7211 [InstCombine] Canonicalize SPF to min/max intrinsics
Now that integer min/max intrinsics have good support in both
InstCombine and other passes, start canonicalizing SPF min/max
to intrinsic min/max.

Once this sticks, we can stop matching SPF min/max in various
places, and can remove hacks we have for preventing infinite loops
and breaking of SPF canonicalization.

Differential Revision: https://reviews.llvm.org/D98152
2022-02-24 09:01:20 +01:00
Philip Reames 9392c0d4ef Revert "[SLP] Remove cap on schedule window size"
This reverts commit 6adf4b039e.  Reverting while investigating https://github.com/llvm/llvm-project/issues/54029
2022-02-23 13:12:07 -08:00
Philip Reames 6adf4b039e [SLP] Remove cap on schedule window size
This cap was first added in 848c1aa45 (back in 2015).  Per the original commit message, the purpose was to avoid a compile time explosion in long basic blocks.  The algorithmic problem in scheduling has now been fixed in 0539a26d.

In the meantime, the code has rotten fairly badly.  Some intermediate refactoring caused the size to only be incremented if *both* iterators advance in the window search.  This causes the size to be badly undercounted when near one end of a basic block.  We no longer have any test which exercises the logic in an intentional way; there's one test which differs with this change, but the changes appear fairly orthoganol to the purpose of the test file.

Unfortunately, we no longer have the original motivating example, so it's possible that it also hits some other issue.  I tested locally with a large example, but even at it's worst, that one doesn't demonstrate anything too extreme even without the algorithmic fix.  It's clearly faster with, but only by ~20% which doesn't seem in line with the original commit message.   If regressions with this patch are seen, please file a bug and I'll try to fix any other algorithmic problems which fall out.
2022-02-23 08:27:45 -08:00
Brendon Cahoon 3cc15e2cb6 [SLP] Fix assert from non-constant index in insertelement
A call to getInsertIndex() in getTreeCost() is returning None,
which causes an assert because a non-constant index value for
insertelement was not expected. This case occurs when the
insertelement index value is defined with a PHI.

Differential Revision: https://reviews.llvm.org/D120223
2022-02-22 15:57:14 -06:00
Alexey Bataev b3f4535a03 [SLP][NFC]Add a test for bottom to top reordering. 2022-02-22 13:46:34 -08:00
Philip Reames 0539a26d91 [SLP] Schedule only sub-graph of vectorizable instructions
SLP currently schedules all instructions within a scheduling window which stretches from the first instruction potentially vectorized to the last. This window can include a very large number of unrelated instructions which are not being considered for vectorization. This change switches the code to only schedule the sub-graph consisting of the instructions being vectorized and their transitive users.

This has the effect of greatly reducing the amount of work performed in large basic blocks, and thus greatly improves compile time on degenerate examples. To understand the effects, I added some statistics (not planned for upstream contribution). Here's an illustration from my motivating example:

Before this patch:

704357 SLP                          - Number of calcDeps actions
 699021 SLP                          - Number of schedule calls
   5598 SLP                          - Number of ReSchedule actions
     59 SLP                          - Number of ReScheduleOnFail actions
  10084 SLP                          - Number of schedule resets
   8523 SLP                          - Number of vector instructions generated

After this patch:

102895 SLP                          - Number of calcDeps actions
 161916 SLP                          - Number of schedule calls
   5637 SLP                          - Number of ReSchedule actions
     55 SLP                          - Number of ReScheduleOnFail actions
  10083 SLP                          - Number of schedule resets
   8403 SLP                          - Number of vector instructions generated

I do want to highlight that there is a small difference in number of generated vector instructions. This example is hitting the bailout due to maximum window size, and the change in scheduling is slightly perturbing when and how we hit it. This can be seen in the RescheduleOnFail counter change. Given that, I think we can safely ignore.

The downside of this change can be seen in the large test diff. We group all vectorizable instructions together at the bottom of the scheduling region. This means that vector instructions can move quite far from their original point in code. While maybe undesirable, I don't see this as being a major problem as this pass is not intended to be a general scheduling pass.

For context, it's worth noting that the pre-scheduling that SLP does while building the vector tree is exactly the sub-graph scheduling implemented by this patch.

Differential Revision: https://reviews.llvm.org/D118538
2022-02-22 10:15:55 -08:00
Alexey Bataev b0a0df9809 [SLP]Fix vectorization of the alternate cmp instruction with swapped predicates.
If the alternate cmp instruction is a swapped predicate of the main cmp
instruction, need to generate alternate instruction, not the one with
the swapped predicate. Also, the lane with the alternate opcode should
be selected only, if the corresponding operands are not compatible.

Correctness confirmed:
https://alive2.llvm.org/ce/z/94BG66

Differential Revision: https://reviews.llvm.org/D119855
2022-02-18 04:27:45 -08:00
Alexey Bataev 27f72eb25e [SLP][NFC]Add another test for swapped main/alternate cmp, NFC. 2022-02-17 09:55:16 -08:00
Shao-Ce SUN 21ac474392 [NFC] Correct typo `interger` to `integer` 2022-02-17 21:17:47 +08:00
Arthur Eubanks 826fae51d2 [SLPVectorizer][OpaquePtrs] Check GEP source element type
Fixes a miscompile with opaque pointers.

Reviewed By: #opaque-pointers, nikic

Differential Revision: https://reviews.llvm.org/D119980
2022-02-16 14:47:20 -08:00
Arthur Eubanks 4e24397805 [test][SLPVectorizer][OpaquePtr] Precommit test 2022-02-16 14:47:20 -08:00
Alexey Bataev d6371a7c60 [SLP][NFC]Add a test for miscompilation of alternate cmp instructions,
NFC.
2022-02-15 08:29:17 -08:00
Anton Afanasyev b7574b092a [SLP] Don't try to vectorize pair with insertelement
Particularly this breaks vectorization of insertelements where some of
intermediate (i.e. not last) insertelements are used externally.

Fixes PR52275
Fixes #51617

Differential Revision: https://reviews.llvm.org/D119679
2022-02-15 16:12:59 +03:00
Anton Afanasyev f16a9dffce [Test][SLP] Add tests for PR52275 2022-02-15 16:12:59 +03:00
Anton Afanasyev 954ea0f044 [SLP] Simplify indices processing for insertelements
Get rid of non-constant and undef indices of insertelements
at `buildTree()` stage. Fix bugs.

Differential Revision: https://reviews.llvm.org/D119623
2022-02-14 14:50:44 +03:00
Simon Pilgrim ea071884b0 [SLP][X86] Add common check prefix for horizontal reduction tests 2022-02-12 21:48:31 +00:00
Philip Reames 9fa3243ffc [tests] Add coverage for SLP reschedule event
This is slightly reduced from the crash reported against D117951.
2022-02-03 12:23:11 -08:00
Alexey Bataev 802ceb8343 [SLP]Excluded external uses from the reordering estimation.
Compiler adds the estimation for the external uses during operands
reordering analysis, which makes it tend to prefer duplicates in the
lanes rather than diamond/shuffled match in the graph. It changes the sizes of
the vector operands and may prevent some vectorization. We don't need
this kind of estimation for the analysis phase, because we just need to
choose the most compatible instruction and it does not matter if it has
external user or used in the non-matching lane. Instead, we count the number
of unique instruction in the lane and see if the reassociation changes
the number of unique scalars to be power of 2 or not. If we have power
of 2 unique scalars in the lane, it is considered more profitable rather
than having non-power-of-2 number of unique scalars.

Metric: SLP.NumVectorInstructions

                          test-suite :: MultiSource/Benchmarks/FreeBench/distray/distray.test   70.00   86.00   22.9%
                             test-suite :: External/SPEC/CFP2017rate/544.nab_r/544.nab_r.test  346.00  353.00    2.0%
                            test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test  346.00  353.00    2.0%
                         test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test  235.00  239.00    1.7%
                  test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test  235.00  239.00    1.7%
                     test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 8723.00 8834.00    1.3%
                                 test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 1051.00 1064.00    1.2%
                         test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 1628.00 1646.00    1.1%
                          test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 1628.00 1646.00    1.1%
                       test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 9100.00 9184.00    0.9%
                     test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 3565.00 3577.00    0.3%
                    test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 3565.00 3577.00    0.3%
                       test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 4235.00 4245.00    0.2%
                              test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1996.00 1998.00    0.1%
                                 test-suite :: MultiSource/Applications/JM/lencod/lencod.test 1671.00 1672.00    0.1%

test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test  783.00  782.00   -0.1%
                      test-suite :: SingleSource/Benchmarks/Misc/oourafft.test   69.00   68.00   -1.4%
        test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test  207.00  192.00   -7.2%
         test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test  207.00  192.00   -7.2%
 test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test   89.00   80.00  -10.1%
test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test   89.00   80.00  -10.1%
       test-suite :: MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test  260.00  215.00  -17.3%
 test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test  256.00  211.00  -17.6%

MultiSource/Benchmarks/Prolangs-C/TimberWolfMC - pretty the same.
SingleSource/Benchmarks/Misc/oourafft.test - 2 <2 x > loads replaced by
one <4 x> load.
External/SPEC/CINT2017speed/641.leela_s - function gets vectorized and
not inlined anymore.
External/SPEC/CINT2017rate/541.leela_r - same
xternal/SPEC/CINT2017rate/531.deepsjeng_r - changed the order in
multi-block tree, the result is pretty the same.
External/SPEC/CINT2017speed/631.deepsjeng_s - same.
MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a - the result is the same
as before.
MultiSource/Benchmarks/MiBench/consumer-jpeg - same.

Differential Revision: https://reviews.llvm.org/D116688
2022-02-03 06:50:06 -08:00
Alexey Bataev ad2a0ccf8f [SLP]Alternate vectorization for cmp instructions.
Added support for alternate ops vectorization of the cmp instructions.
It allows to vectorize either cmp instructions with same/swapped
predicate but different (swapped) operands kinds or cmp instructions
with different predicates and compatible operands kinds.

Differential Revision: https://reviews.llvm.org/D115955
2022-02-03 06:24:10 -08:00
Alexey Bataev 8a1dfbc4d8 Revert "[SLP]Alternate vectorization for cmp instructions."
This reverts commit 842a2360a8 to fix the
bugs reported by users in https://reviews.llvm.org/D115955#3291538.
2022-02-02 12:06:36 -08:00
Alexey Bataev 842a2360a8 [SLP]Alternate vectorization for cmp instructions.
Added support for alternate ops vectorization of the cmp instructions.
It allows to vectorize either cmp instructions with same/swapped
predicate but different (swapped) operands kinds or cmp instructions
with different predicates and compatible operands kinds.

Differential Revision: https://reviews.llvm.org/D115955
2022-02-02 10:32:52 -08:00
Benjamin Kramer 0c3d22a592 Revert "[SLP]Alternate vectorization for cmp instructions."
This reverts commit 83620bd2ad.

It's causing miscompilations, see review comments at
https://reviews.llvm.org/D115955
2022-02-02 13:08:51 +01:00
Alexey Bataev 83620bd2ad [SLP]Alternate vectorization for cmp instructions.
Added support for alternate ops vectorization of the cmp instructions.
It allows to vectorize either cmp instructions with same/swapped
predicate but different (swapped) operands kinds or cmp instructions
with different predicates and compatible operands kinds.

Differential Revision: https://reviews.llvm.org/D115955
2022-02-01 09:54:20 -08:00
Alexey Bataev 0dc33c0a9c [SLP][NFC]Add a test for alternate vectorization in cmp instructions
with same/swapped predicate.
2022-02-01 09:28:06 -08:00
Benjamin Kramer 5281f0dab2 Revert "[SLP]Alternate vectorization for cmp instructions."
This reverts commit afaaecc88c.

Crashes when compiling SciPy, test case https://reviews.llvm.org/P8276
2022-02-01 11:40:43 +01:00
Alexey Bataev afaaecc88c [SLP]Alternate vectorization for cmp instructions.
Added support for alternate ops vectorization of the cmp instructions.
It allows to vectorize either cmp instructions with same/swapped
predicate but different (swapped) operands kinds or cmp instructions
with different predicates and compatible operands kinds.

Differential Revision: https://reviews.llvm.org/D115955
2022-01-31 11:11:25 -08:00
Alexey Bataev cec8b614f3 [SLP]Do not reorder top nodes if they do not require reordering.
No need to reorder the top nodes, if they are not stores or
insertelement instructions and each node should be analized only
once, when the bottom-to-top analysis is performed.
We still endup with extractelements for the top node scalars and
the final shuffle just adds an extra cost and currently
crashes the compiler for PHI nodes.

Differential Revision: https://reviews.llvm.org/D116760
2022-01-28 09:16:18 -08:00
Philip Reames 15f7857412 [tests] Refresh autogen tests for SLP 2022-01-24 17:05:58 -08:00
eopXD 3cf15af2da [RISCV] Remove experimental prefix from rvv-related extensions.
Extensions affected: +v, +zve*, +zvl*

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D117860
2022-01-22 20:18:40 -08:00
Rosie Sumpter 552eb372cb [LoopVectorize] Pass a vector type to isLegalMaskedGather/Scatter
This is required to query the legality more precisely in the LoopVectorizer.

This adds another TTI function named 'forceScalarizeMaskedGather/Scatter'
function to work around the hack introduced for MVE, where
isLegalMaskedGather/Scatter would return an answer by second-guessing
where the function was called from, based on the Type passed in (vector
vs scalar). The new interface makes this explicit. It is also used by
X86 to check for vector widths where gather/scatters aren't profitable
(or don't exist) for certain subtargets.

Differential Revision: https://reviews.llvm.org/D115329
2022-01-12 13:34:12 +00:00
Kito Cheng f142c45f1e [RISCV] Set getMinVectorRegisterBitWidth to 16 if enable fixed length vector code gen for RVV
getMinVectorRegisterBitWidth means what vector types is supported in
this target, and actually RISC-V support all fixed length vector types with
vector length less than `getMinRVVVectorSizeInBits`, so set it to 16,
means 2 x i8, that is minimal fixed length vector size in theory.

That also fixed one issue, some testcase migth become non-vectorizable
when `-riscv-v-vector-bits-min` set to larger value, because the vector size is
smaller than `-riscv-v-vector-bits-min`.

For example, following code can vectorize by SLP with
`-riscv-v-vector-bits-min=128` or `-riscv-v-vector-bits-min=256`, but
can't vectorize `-riscv-v-vector-bits-min=512` or larger:

```
void foo(double *da) {
  da[0] = 0;
  da[1] = 1;
  da[2] = 2;
  da[3] = 3;
}
```

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D116534
2022-01-08 11:16:21 +08:00
Alexey Bataev d130df544d [SLP]Improve reordering for the nodes beeing used in alternate vectorization.
No need to include the order of the scalars beeing used as part of the
alternate vectorization into account when trying to reorder the whole
graph. Such elements better to reorder in the following phase because
the subtree still ends up in shuffle.

Part of D116688, fixes the regression in D116690.

Differential Revision: https://reviews.llvm.org/D116740
2022-01-06 11:18:57 -08:00
Alexey Bataev 7cb19fe493 [SLP]Initialize the lane with the given value instead of default 0.
There is a bug in the reordering analysis stage. If the element with the
given hash is not added to the map but has the same number of APOs and
instructions with same parent, but different instruction opcode, it will
be initalized with default values and then the counter is increased by
1. But the lane is not updated and default to 0 instead of the actual
   `Lane` value. It leads to the fact that the analysis is useless in
   many cases and default to lane 0 instead of actual lane with the
   minimum amount of APO operands.

Differential Revision: https://reviews.llvm.org/D116690
2022-01-06 10:57:11 -08:00
Alexey Bataev bf5a688252 [SLP][NFC]Add a test for the extra shuffle after alternate node, NFC. 2022-01-06 06:34:58 -08:00
Alexey Bataev 7171af7445 [SLP][NFC]Add a test for shuffled entries with different vector sizes,
NFC.
2021-12-27 07:35:35 -08:00
Alexey Bataev ab9078f3d3 [SLP]Fix PR52756: SLPVectorizer crashes with assertion VecTy == FinalVecTy.
Need to check for the number of the unique non-constant values since the
unique values may include several constants.

Differential Revision: https://reviews.llvm.org/D115939
2021-12-20 07:21:20 -08:00
Alexey Bataev 4459a11f4d Revert "[SLP]Fix PR52756: SLPVectorizer crashes with assertion VecTy == FinalVecTy."
This reverts commit fcaf290d02 to fix test
mismatch reported in https://lab.llvm.org/buildbot#builders/117/builds/3531
2021-12-20 07:21:18 -08:00
Alexey Bataev fcaf290d02 [SLP]Fix PR52756: SLPVectorizer crashes with assertion VecTy == FinalVecTy.
Need to check for the number of the unique non-constant values since the
unique values may include several constants.

Differential Revision: https://reviews.llvm.org/D115939
2021-12-20 05:15:01 -08:00
Alexey Bataev 65fc992579 [SLP]Early exit out of the reordering if shuffled/perfect diamond match found.
Need to early exit out of the reordering process if the perfect/shuffled match is found in the operands. Such pattern will result in not profitable reordering because of (false positive) external use of scalars.

Differential Revision: https://reviews.llvm.org/D115811
2021-12-16 11:09:49 -08:00
Alexey Bataev 292bbed6ab [SLP][NFC] Add a test for inefficient reordering, NFC. 2021-12-15 11:05:28 -08:00
Alexey Bataev 6f2e087631 [SLP]Do not represent splats as node with the reused scalars.
No need to represent splats as a node with the reused scalars, it may
increase the cost (currently pass just ignores extra shuffle cost and it
is still not correct).

Differential Revision: https://reviews.llvm.org/D115800
2021-12-15 06:33:11 -08:00
Alexey Bataev 46bbd254c1 [SLP][NFC]Add a test for broadcast cost with undefs, NFC. 2021-12-15 05:58:47 -08:00
Alexey Bataev f00da7c3bc [SLP][NFC]Update test checks, NFC. 2021-12-14 07:35:02 -08:00
Alexey Bataev bd05376986 [SLP]Improve multinode analysis.
Changes the preliminary multinode analysis:
1. Introduced scores for reversed loads/extractelements.
2. Improved shallow score calculation.
3. Lowered the cost of external uses (no need to consider it several times, just ones).
4. The initial lane for analysis is the one with the minimal possible
   reorderings.

These changes in general shall reduce compile time and improve the
reordering in many cases.

Part of D57059.

Differential Revision: https://reviews.llvm.org/D101109
2021-12-14 06:01:52 -08:00
Philip Reames e6ad9ef4e7 [instcombine] Canonicalize constant index type to i64 for extractelement/insertelement
The basic idea to this is that a) having a single canonical type makes CSE easier, and b) many of our transforms are inconsistent about which types we end up with based on visit order.

I'm restricting this to constants as for non-constants, we'd have to decide whether the simplicity was worth extra instructions. For constants, there are no extra instructions.

We chose the canonical type as i64 arbitrarily.  We might consider changing this to something else in the future if we have cause.

Differential Revision: https://reviews.llvm.org/D115387
2021-12-13 16:56:22 -08:00
Alexey Bataev e5b191a433 [SLP]Improve/fix reodering for gather nodes with extractelements/undefs.
If the gather node is a mix of undefvalues and exractelement
instructions, need to take the ordering for such nodes into account too.
It allows to reorder some (sub)trees and remove some extra shuffles,
improving overall vectorization.
Also, outlined common functionality into a separate function.

Differential Revision: https://reviews.llvm.org/D115358
2021-12-13 10:59:38 -08:00
Alexey Bataev 19c5cf4167 [SLP]Fix comparator for cmp instruction vectorization.
The comparator for the sort functions should provide strict weak
ordering relation between parameters. Current solution causes compiler
crash with some standard c++ library implementations, because it does
not meet this criteria. Tried to fix it + it improves the iverall
vectorization result.

Differential Revision: https://reviews.llvm.org/D115268
2021-12-09 10:57:57 -08:00
Alexey Bataev a101a9b64b [SLP]Fix compiler crash when calculating extract cost for undefs.
Need to add an extra check for potential undef values in
computeExtractCost function to avoid compiler crash on casting to
instructon.

Differential Revision: https://reviews.llvm.org/D115162
2021-12-06 10:46:13 -08:00
Alexey Bataev ba74bb3a22 [SLP]Fix reused extracts cost.
If the extractelement instruction is used multiple times in the
different tree entries (either vectorized, or gathered), need to
compensate the scalar cost of such instructions. They are completely
removed if all users are part of the tree but we need to compensate the
cost only once for each instruction.

Differential Revision: https://reviews.llvm.org/D114958
2021-12-02 10:52:00 -08:00
Alexey Bataev 8ceccbd321 [SLP]Outline and fix code for finding common insertelement vectors.
Need to outline the code for finding common vectors in insertelement
instructions into a separate function for future patches. It also
improves the process by adding some extra checks for early exit and
fixes a bug where it always finds the match because of erroneous compare
of the same values.

Differential Revision: https://reviews.llvm.org/D114909
2021-12-02 09:18:25 -08:00
Alexey Bataev 92fbd76af5 [SLP]Improve registering and merging of compatible shuffles.
If several shuffle instructions are emitted, some of them might
same/compatible (less defined) with the previously emitted ones. Such
shuffles can be removed safely, improving the total cost of the
vectorized code.

Differential Revision: https://reviews.llvm.org/D114087
2021-12-02 08:48:29 -08:00
Alexey Bataev 75106413d0 [SLP][NFC]Add a test for extractelements with many uses vectorization, NFC. 2021-12-02 06:30:37 -08:00
Alexey Bataev afc9e7517a [SLP]Improve cost model for the shuffled extracts.
Improved the calculation of the shuffled extracts, where possible. Need
to calculate the cost for the extracted scalars if some users are not
insertelements + improved the total estimation of the shuffled scalars
used in insertelements build vectors.

Differential Revision: https://reviews.llvm.org/D113782
2021-12-01 08:10:57 -08:00
Alexey Bataev cc30fbf242 [SLP]Introduce isUndefVector function to check for undef vectors.
Undefined vector might be not only the UndefValue, but also it can be
a constant vector with undef ot poison elements, need to check for this
kind of undef too.

Differential Revision: https://reviews.llvm.org/D114873
2021-12-01 07:46:10 -08:00
Alexey Bataev ddce6e0561 [SLP]Improve vectorization of cmp instructions sequences.
Final attempt to vectorize bundles of comptatible cmp instructions after
all other instructions processing.

Metric: SLP.NumVectorInstructions

Program                                                                             results results0 diff
        test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test    1.00    5.00  400.0%
                              test-suite :: MultiSource/Benchmarks/PAQ8p/paq8p.test    8.00   11.00   37.5%
                    test-suite :: MultiSource/Benchmarks/Olden/voronoi/voronoi.test   20.00   26.00   30.0%
                test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 1344.00 1648.00   22.6%
               test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 1344.00 1648.00   22.6%
                              test-suite :: MultiSource/Benchmarks/Olden/bh/bh.test  102.00  124.00   21.6%
                test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/CoMD/CoMD.test  118.00  133.00   12.7%
          test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 3233.00 3554.00    9.9%
           test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 3233.00 3554.00    9.9%
                        test-suite :: MultiSource/Benchmarks/Olden/power/power.test   64.00   70.00    9.4%
           test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 7879.00 8604.00    9.2%
           test-suite :: MultiSource/Benchmarks/Prolangs-C/simulator/simulator.test   50.00   54.00    8.0%
                        test-suite :: MultiSource/Applications/sqlite3/sqlite3.test   27.00   29.00    7.4%
             test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 8345.00 8955.00    7.3%
     test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test  694.00  738.00    6.3%
                        test-suite :: MultiSource/Benchmarks/MallocBench/gs/gs.test  361.00  382.00    5.8%
                      test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  409.00  430.00    5.1%
     test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test  140.00  147.00    5.0%
      test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test  140.00  147.00    5.0%
             test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 4013.00 4206.00    4.8%
                       test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test  966.00 1011.00    4.7%
                           test-suite :: SingleSource/Benchmarks/Misc/oourafft.test   65.00   68.00    4.6%
                            test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 4219.00 4381.00    3.8%
                    test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1911.00 1973.00    3.2%
      test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test   62.00   64.00    3.2%
     test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test   62.00   64.00    3.2%
                 test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test  852.00  877.00    2.9%
                  test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test  852.00  877.00    2.9%
                       test-suite :: MultiSource/Applications/JM/lencod/lencod.test 1624.00 1668.00    2.7%
                         test-suite :: MultiSource/Benchmarks/McCat/18-imp/imp.test   39.00   40.00    2.6%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset.test  613.00  624.00    1.8%
      test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test  378.00  383.00    1.3%
      test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test  293.00  295.00    0.7%
            test-suite :: MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test  297.00  299.00    0.7%
      test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test 5522.00 5534.00    0.2%
     test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test 5522.00 5534.00    0.2%

Differential Revision: https://reviews.llvm.org/D114799
2021-12-01 07:26:29 -08:00
Alexey Bataev e28174cf56 [SLP][NFC]Add a test for inserting into constant undef vector, NFC. 2021-12-01 06:21:05 -08:00
Alexey Bataev dce6c434ea [SLP]Improve isFixedVectorShuffle and its use.
Extended support for undefined source vector/extract indices/non-fixed
vector types, also no need to check for the parent of the extractelement
instructions with the constant indicies.

Differential Revision: https://reviews.llvm.org/D114121
2021-11-30 10:10:20 -08:00
Alexey Bataev fc0aacf324 [SLP]Improve analysis/emission of vector operands for alternate nodes.
Compiler has an analysis for perfect diamond matching but it does not
support nodes with main/alternate opcodes. The problem is that the
scalars themselves are different and might not match directly with other
nodes, but operands and main/alternate opcodes might match and compiler
might reuse some previously emitted vector instructions. Need to include
this analysis in the cost model and actual vector instructions emission
process.

Differential Revision: https://reviews.llvm.org/D114101
2021-11-26 06:38:02 -08:00
Alexey Bataev 6263982172 [SLP][NFC]Add a test for gathered instructions in loop, NFC. 2021-11-26 05:52:48 -08:00
Alexey Bataev 4675a1654c Revert "[SLP]Improve analysis/emission of vector operands for alternate nodes."
This reverts commit 496254cf80 to fix
compiler crashes reported in D114101#3152982.
2021-11-25 05:19:49 -08:00
Alexey Bataev 496254cf80 [SLP]Improve analysis/emission of vector operands for alternate nodes.
Compiler has an analysis for perfect diamond matching but it does not
support nodes with main/alternate opcodes. The problem is that the
scalars themselves are different and might not match directly with other
nodes, but operands and main/alternate opcodes might match and compiler
might reuse some previously emitted vector instructions. Need to include
this analysis in the cost model and actual vector instructions emission
process.

Differential Revision: https://reviews.llvm.org/D114101
2021-11-24 12:55:24 -08:00
Alexey Bataev 02298c15d5 [SLP][NFC]Add a test that reveals the problem in the emission of
vector int division with undefs.
2021-11-22 07:41:07 -08:00
Alexey Bataev 2e7f12d5e9 [SLP][NFC]Add a test for multiple alternate nodes with cost estimation,
NFC.
2021-11-17 09:02:57 -08:00
Alexey Bataev 900cc1a226 [SLP]Improve cost of the gather nodes.
No need to count the final shuffle cost for the constants, gathering of
the constants is just a constant vector + extra inserts, if required.

Differential Revision: https://reviews.llvm.org/D113770
2021-11-16 06:25:07 -08:00
Alexey Bataev 51c0b6843a [SLP][NFC]Add more tests for shuffles that can be optimized after SLP,
NFC.
2021-11-16 05:42:18 -08:00
Alexey Bataev 2d0cab9d3d [SLP][NFC]Add a test for extra shuffle emission, NFC. 2021-11-15 12:14:43 -08:00
Alexey Bataev 036207d5f2 [SLP]Improve splat detection.
A bunch of scalars can be treated as a splat not only if all elements
are the same but also if some of them are undefvalues.

Differential Revision: https://reviews.llvm.org/D113774
2021-11-15 07:50:34 -08:00
Alexey Bataev 6fb5bed7d1 [SLP]Do not create unused gather nodes for scalar arguments of vector intrinsics.
If the vector intrinsic has scalar argument, we currently still create
a tree entry for this argument. This entry is not used, just consumes
resources and increases the cost of the tree.

Differential Revision: https://reviews.llvm.org/D113806
2021-11-15 06:11:19 -08:00
Alexey Bataev e2a86ab847 [SLP][NFCAdd a test for vector intrinsic with scalar parameter, NFC. 2021-11-12 13:49:56 -08:00
Alexey Bataev 352c46e707 [SLP]Improve vectorization of split loads.
Need to fix ther cost estimation for split loads, since we look at the
subregs already, no need to permute them, need just to estimate
subregister insert, if it is smaller than the real register. Also, using
split loads, it might be profitable already to vectorize smaller trees
with gathering of the loads.

Differential Revision: https://reviews.llvm.org/D107188
2021-11-12 06:13:22 -08:00
Anton Afanasyev 1c2ad70fd5 [Test][SLPVectorizer] Precommit test for PR52275 2021-11-06 17:11:02 +03:00
Alexey Bataev 07ef9f513f [SLP]Improve/fix reordering of the gathered graph nodes.
Gathered loads/extractelements/extractvalue instructions should be
checked if they can represent a vector reordering node too and their
order should ve taken into account for better graph reordering analysis/
Also, if the gather node has reused scalars, they must be reordered
instead of the scalars themselves.

Differential Revision: https://reviews.llvm.org/D112454
2021-10-28 05:45:09 -07:00
Alexey Bataev f06e332982 Revert "[SLP]Improve/fix reordering of the gathered graph nodes."
This reverts commit 64d1617d18 to fix test
non-stability.
2021-10-27 11:16:58 -07:00
Alexey Bataev 64d1617d18 [SLP]Improve/fix reordering of the gathered graph nodes.
Gathered loads/extractelements/extractvalue instructions should be
checked if they can represent a vector reordering node too and their
order should ve taken into account for better graph reordering analysis/
Also, if the gather node has reused scalars, they must be reordered
instead of the scalars themselves.

Differential Revision: https://reviews.llvm.org/D112454
2021-10-27 08:49:13 -07:00
Alexey Bataev 9b12975cbf Revert "[SLP]Improve/fix reordering of the gathered graph nodes."
This reverts commit f719b794bc to fix
instability in tests.
2021-10-27 07:31:36 -07:00
Alexey Bataev f719b794bc [SLP]Improve/fix reordering of the gathered graph nodes.
Gathered loads/extractelements/extractvalue instructions should be
checked if they can represent a vector reordering node too and their
order should ve taken into account for better graph reordering analysis/
Also, if the gather node has reused scalars, they must be reordered
instead of the scalars themselves.

Differential Revision: https://reviews.llvm.org/D112454
2021-10-27 06:08:40 -07:00
Alexey Bataev cb4feae7bd [SLP]Fix logical and/or reductions.
Need to emit select(cmp) instructions for poison-safe forms of select
ops. Currently alive reports that `Target is more poisonous than source`
for operations we generating for such instructions.

https://alive2.llvm.org/ce/z/FiNiAA

Differential Revision: https://reviews.llvm.org/D112562
2021-10-27 04:25:20 -07:00
Alexey Bataev 5db7568a6a [SLP][NFC]Add a test for poison-free or reduction. 2021-10-26 14:04:05 -07:00
Alexey Bataev 8ba8cf24f7 [SLP][NFC]Add a test for logical reduction with extra op. 2021-10-26 10:14:20 -07:00
Alexey Bataev ce14d1b690 [SLP]Do not reorder reduction nodes.
The final reduction nodes should not be reordered, the order does not
matter for reductions. Also, it might be profitable to vectorize smaller
reduction trees, reduction cost may compensate small tree cost.

Part of D111574

Differential Revision: https://reviews.llvm.org/D112467
2021-10-26 07:41:24 -07:00
Alexey Bataev eb9b75dd4d [SLP]Change the order of the reduction/binops args pair vectorization attempts.
Need to change the order of the reduction/binops args pair vectorization
attempts. Need to try to find the reduction at first and postpone
vectorization of binops args. This may help to find more reduction
patterns and vectorize them.
Part of D111574.

Differential Revision: https://reviews.llvm.org/D112224
2021-10-25 06:27:14 -07:00
Quinn Pham 950f22a5e1 [llvm]Inclusive language: replace master with main
[NFC] This patch fixes a url in a testcase due to the renaming of the branch.
2021-10-22 11:56:44 -05:00
Florian Hahn a4b8979a81
[SLP] Add additional tests which caused crashes with versioning. 2021-10-21 18:17:31 +01:00
Alexey Bataev 3ea7877c8b [SLP]Unify vectorization of PHI and store nodes with improved tiny tree vectorization.
Vectorization of PHIs and stores very similar, it might be beneficial to
try to revectorize stores (like PHIs) if the total number of stores with
the same/alternate opcode is less than the vector size but number of
stores with the same type is larger than the vector size.

Differential Revision: https://reviews.llvm.org/D109831
2021-10-21 06:25:32 -07:00
Bjorn Pettersson a413663d8f [NewPM][test] Avoid using -enable-new-pm=1 since -passes implies new PM 2021-10-20 15:16:17 +02:00
Simon Pilgrim a3c05982ac [SLP][X86] Improve SLP tests for division/multiplication by +/- pow2
Add PR51436 test as well as some basic multiply tests, and include SSE2 division coverage
2021-10-20 13:30:27 +01:00
Alexey Bataev b9cfa016da [SLP]Fix emission of the shrink shuffles.
Need to follow the order of the reused scalars from the
ReuseShuffleIndices mask rather than rely on the natural order.

Differential Revision: https://reviews.llvm.org/D111898
2021-10-18 13:13:12 -07:00
Alexey Bataev 1312aff768 [SLP]Add a test for shrink shuffle after reorder, NFC. 2021-10-15 09:42:43 -07:00
Alexey Bataev 414abff1fe [SLP]Fix PR52090: clang crashes: Assertion `Index < Length && "Invalid index!"' failed.
Need to check that either Idx is UndefMaskElem and value is UndefValue
or Idx is valid and value is the same as the scalar value in the node.

Differential Revision: https://reviews.llvm.org/D111802
2021-10-14 14:26:29 -07:00
Philip Reames 0658bab870 [SCEV] Infer flags from add/gep in any block
This patch removes a compile time restriction from isSCEVExprNeverPoison. We've strengthened our ability to reason about flags on scopes other than addrecs, and this bailout prevents us from using it. The comment is also suspect as well in that we're in the middle of constructing a SCEV for I. As such, we're going to visit all operands *anyways*.

Differential Revision: https://reviews.llvm.org/D111186
2021-10-06 11:11:54 -07:00
Simon Pilgrim 0776924a17 [CostModel][X86] getCmpSelInstrCost - treat BAD_PREDICATEs the same as the worst case cost predicates for ICMP/FCMP instructions
As suggested on D111024, we should treat getCmpSelInstrCost calls without a specific predicate as matching the worst case predicate cost.

These regressions will be addressed with a mixture of D111024 and fixing other specific getCmpSelInstrCost calls to have realistic predicates.
2021-10-06 10:14:56 +01:00
Alexey Bataev bebe702dbe [SLP]Detect reused scalars in all possible gathers for better vectorization cost.
Some initially gathered nodes missed the check for the reused scalars,
which leads to high gather cost. Such nodes still can be represented as
m gathers + shuffle instead of n gathers, where m < n.

Differential Revision: https://reviews.llvm.org/D111153
2021-10-05 09:43:03 -07:00
Kerry McLaughlin c1d46d3461 [SLPVectorizer] Fix crash in isShuffle with scalable vectors
D104809 changed `buildTree_rec` to check for extract element instructions
with scalable types. However, if the extract is extended or truncated,
these changes do not apply and we assert later on in isShuffle(), which
attempts to cast the type of the extract to FixedVectorType.

Reviewed By: ABataev

Differential Revision: https://reviews.llvm.org/D110640
2021-10-01 10:56:44 +01:00
Alexey Bataev f701505c45 [SLP]Improve vectorization of phi nodes by trying wider vectors.
Try to improve vectorization of the PHI nodes by trying to vectorize
similar instructions at the size of the widest possible vectors, then
aggregating with compatible type PHIs and trying to vectoriza again and
only if this failed, try smaller sizes of the vector factors for
compatible PHI nodes. This restores performance of several benchmarks
after tuning of the fp/int conversion instructions costs.

Differential Revision: https://reviews.llvm.org/D108740
2021-09-28 07:20:36 -07:00
Alexey Bataev 8bacfb9bed [SLP]No need to schedule/check parent for extract{element/value} instruction.
The instruction extractelement/extractvalue are not required to
be scheduled since they only depend on the source vector/aggregate (with
constant indices), smae applies to the parent basic block checks.
Improves compile time and saves scheduling budget.

Differential Revision: https://reviews.llvm.org/D108703
2021-09-28 06:13:55 -07:00
Jameson Nash e27a6db529 Bad SLPVectorization shufflevector replacement, resulting in write to wrong memory location
We see that it might otherwise do:

  %10 = getelementptr {}**, <2 x {}***> %9, <2 x i32> <i32 10, i32 4>
  %11 = bitcast <2 x {}***> %10 to <2 x i64*>
...
  %27 = extractelement <2 x i64*> %11, i32 0
  %28 = bitcast i64* %27 to <2 x i64>*
  store <2 x i64> %22, <2 x i64>* %28, align 4, !tbaa !2

Which is an out-of-bounds store (the extractelement got offset 10
instead of offset 4 as intended). With the fix, we correctly generate
extractelement for i32 1 and generate correct code.

Differential Revision: https://reviews.llvm.org/D106613
2021-09-27 14:06:13 -04:00
Simon Pilgrim c931d35216 [CostModel][X86] Increase i64 mul cost from 1 to 2
Only the most recent cpus support really 1cy 64-bit multiplies, and the X64 cost table represents a realistic worst case. The 1cy value was also discouraging vectorization when most vXi64 PMULDQ expansions aren't actually slower than scalarization.

Noticed while investigating PR51436.
2021-09-23 14:48:21 +01:00
Alexey Bataev 173dd896db [SLP][NFC]Add a test to show an issue with incorrectly extracted
pointers.
2021-09-22 09:02:13 -07:00
hyeongyu kim ec8311444a [InstCombine] Update InstCombine to use poison instead of undef for shufflevector's placeholder (2/3)
This patch is for fixing potential shufflevector-related bugs like D93818.
As D93818, this patch change shufflevector's default placeholder to poison.
To reduce risk, it was divided into several patches, and this patch is for InstCombineCompares and InstructionCombining.

Reviewed By: spatel

Differential Revision: https://reviews.llvm.org/D110227
2021-09-23 00:14:50 +09:00
Alexey Bataev b6d10beb50 [SLP][NFC]Rename function in the test for better matching of the
transformation.
2021-09-22 05:51:18 -07:00
Anna Thomas 69921f6f45 [InstCombine] Improve TryToSinkInstruction with multiple uses
This patch allows sinking an instruction which can have multiple uses in a
single user. We were previously over-restrictive by looking for exactly one use,
rather than one user.

Also added an API for retrieving a unique undroppable user.

Reviewed By: nikic

Differential Revision: https://reviews.llvm.org/D109700
2021-09-21 10:04:04 -04:00
Alexey Bataev bc69dd62c0 [SLP]Improve graph reordering.
Reworked reordering algorithm. Originally, the compiler just tried to
detect the most common order in the reordarable nodes (loads, stores,
extractelements,extractvalues) and then fully rebuilding the graph in
the best order. This was not effecient, since it required an extra
memory and time for building/rebuilding tree, double the use of the
scheduling budget, which could lead to missing vectorization due to
exausted scheduling resources.

Patch provide 2-way approach for graph reodering problem. At first, all
reordering is done in-place, it doe not required tree
deleting/rebuilding, it just rotates the scalars/orders/reuses masks in
the graph node.

The first step (top-to bottom) rotates the whole graph, similarly to the previous
implementation. Compiler counts the number of the most used orders of
the graph nodes with the same vectorization factor and then rotates the
subgraph with the given vectorization factor to the most used order, if
it is not empty. Then repeats the same procedure for the subgraphs with
the smaller vectorization factor. We can do this because we still need
to reshuffle smaller subgraph when buildiong operands for the graph
nodes with lasrger vectorization factor, we can rotate just subgraph,
not the whole graph.

The second step (bottom-to-top) scans through the leaves and tries to
detect the users of the leaves which can be reordered. If the leaves can
be reorder in the best fashion, they are reordered and their user too.
It allows to remove double shuffles to the same ordering of the operands in
many cases and just reorder the user operations instead. Plus, it moves
the final shuffles closer to the top of the graph and in many cases
allows to remove extra shuffle because the same procedure is repeated
again and we can again merge some reordering masks and reorder user nodes
instead of the operands.

Also, patch improves cost model for gathering of loads, which improves
x264 benchmark in some cases.

Gives about +2% on AVX512 + LTO (more expected for AVX/AVX2) for {625,525}x264,
+3% for 508.namd, improves most of other benchmarks.
The compile and link time are almost the same, though in some cases it
should be better (we're not doing an extra instruction scheduling
anymore) + we may vectorize more code for the large basic blocks again
because of saving scheduling budget.

Differential Revision: https://reviews.llvm.org/D105020
2021-09-20 08:42:19 -07:00
Alexey Bataev 2b0b1d5319 [SLP][NFC]Add a test for reorder of alt shuffle operands. 2021-09-17 10:42:45 -07:00
Florian Hahn 2f97ff8e7b
[SLP] Add additional memory versioning tests. 2021-09-16 13:31:14 +01:00