Commit Graph

1085 Commits

Author SHA1 Message Date
Philip Reames 7d6e8f2a96 [slp] Delete dead scalar instructions feeding vectorized instructions
If we vectorize a e.g. store, we leave around a bunch of getelementptrs for the individual scalar stores which we removed. We can go ahead and delete them as well.

This is purely for test output quality and readability. It should have no effect in any sane pipeline.

Differential Revision: https://reviews.llvm.org/D122493
2022-03-28 20:10:13 -07:00
Philip Reames 48cc9287f5 Reapply "[SLP] Schedule only sub-graph of vectorizable instructions"" (try 3)
The original commit exposed several missing dependencies (e.g. latent bugs in SLP scheduling).  Most of these were fixed over the weekend and have had several days to bake.  The last was fixed this morning after being noticed in manual review of test changes yesterday.  See the review thread for links to each change.

Original commit message follows:

SLP currently schedules all instructions within a scheduling window which stretches from the first instruction potentially vectorized to the last. This window can include a very large number of unrelated instructions which are not being considered for vectorization. This change switches the code to only schedule the sub-graph consisting of the instructions being vectorized and their transitive users.

This has the effect of greatly reducing the amount of work performed in large basic blocks, and thus greatly improves compile time on degenerate examples. To understand the effects, I added some statistics (not planned for upstream contribution). Here's an illustration from my motivating example:

   Before this patch:

   704357 SLP                          - Number of calcDeps actions
   699021 SLP                          - Number of schedule calls
   5598 SLP                          - Number of ReSchedule actions
   59 SLP                          - Number of ReScheduleOnFail actions
   10084 SLP                          - Number of schedule resets
   8523 SLP                          - Number of vector instructions generated

   After this patch:

   102895 SLP                          - Number of calcDeps actions
   161916 SLP                          - Number of schedule calls
   5637 SLP                          - Number of ReSchedule actions
   55 SLP                          - Number of ReScheduleOnFail actions
   10083 SLP                          - Number of schedule resets
   8403 SLP                          - Number of vector instructions generated

I do want to highlight that there is a small difference in number of generated vector instructions. This example is hitting the bailout due to maximum window size, and the change in scheduling is slightly perturbing when and how we hit it. This can be seen in the RescheduleOnFail counter change. Given that, I think we can safely ignore.

The downside of this change can be seen in the large test diff. We group all vectorizable instructions together at the bottom of the scheduling region. This means that vector instructions can move quite far from their original point in code. While maybe undesirable, I don't see this as being a major problem as this pass is not intended to be a general scheduling pass.

For context, it's worth noting that the pre-scheduling that SLP does while building the vector tree is exactly the sub-graph scheduling implemented by this patch.

Differential Revision: https://reviews.llvm.org/D118538
2022-03-25 10:39:23 -07:00
Philip Reames a121458edc [test,slp] Add another stacksave related dependence test 2022-03-25 08:48:17 -07:00
Vasileios Porpodas 39aa202aff Recommit "[SLP] Fix lookahead operand reordering for splat loads." attempt 3, fixed assertion crash.
Original review: https://reviews.llvm.org/D121354

This reverts commit e6ead19b77.
2022-03-23 18:32:17 -07:00
Arthur Eubanks e6ead19b77 Revert "Recommit "[SLP] Fix lookahead operand reordering for splat loads." attempt 2, fixed assertion crash."
This reverts commit 27bd8f9492.

Causes crashes, see comments in D121973
2022-03-23 10:57:45 -07:00
Vasileios Porpodas 27bd8f9492 Recommit "[SLP] Fix lookahead operand reordering for splat loads." attempt 2, fixed assertion crash.
Original review: https://reviews.llvm.org/D121354

This reverts commit f7d7d2a08d.
2022-03-22 16:41:55 -07:00
Arthur Eubanks f7d7d2a08d Revert "Recommit "[SLP] Fix lookahead operand reordering for splat loads.""
This reverts commit 79613185d3.

Causes crashes, see comments in https://reviews.llvm.org/D121973.
2022-03-22 13:33:49 -07:00
Vasileios Porpodas 79613185d3 Recommit "[SLP] Fix lookahead operand reordering for splat loads."
Original review: https://reviews.llvm.org/D121354

The original commit 9136145eb0 broke the build on several targets.

Differential Revision: https://reviews.llvm.org/D121973
2022-03-21 15:57:32 -07:00
Philip Reames 56727f9472 [test] Add regression test from pr54465
The reported crash regression was (presumably) fixed in 79a1823, but without adding coverage.
2022-03-21 10:49:27 -07:00
Alexey Bataev 79a182371e [SLP]Make stricter check for instructions that do not require
scheduling.

Need to check that the instructions with external operands can be
reordered safely before actualy exclude them from the scheduling.
2022-03-21 06:09:12 -07:00
Philip Reames b7806c8b37 [SLP] Explicit track required stacksave/alloca dependency
The semantics of an inalloca alloca instruction requires that it not be reordered with a preceeding stacksave intrinsic call.  Unfortunately, there's no def/use edge or memory dependence edge.  (THe memory point is slightly subtle, but in general a new allocation can't alias with a call which executes strictly before it comes into existance.)

I'd tried to tackle this same case previously in 689babdf6, but the fix chosen there turned out to be incomplete.  As such, this change contains a fully revert of the first fix attempt.

This was noticed when investigating problems which surfaced with D118538, but this is definitely an existing bug.  This time around, I managed to reduce a couple of additional cases, including one which was being actively miscompiled even without the new scheduling change.  (See test diffs)

Compile time wise, we only spend extra time when seeing a stacksave (rare), and even then we walk the block at most once per schedule window extension.  Likely a non-issue.
2022-03-20 13:58:45 -07:00
Philip Reames 983ed87c61 [slp,tests] Consolidate two test files into one
There are some slight changes to the test lines due to different cost threshold choices in the two command lines, but I don't believe these to be interesting the purpose of the tests.
2022-03-19 18:24:35 -07:00
Philip Reames 89ab020d02 [tests, SLP] Add coverage for missing dependencies for stacksave intrinsics
The existing scheduling doesn't account for the scheduling restrictions implied by inalloca allocas combined with stacksave/stackrestore.  This adds coverage including one currently miscompiling case.
2022-03-19 18:05:09 -07:00
Philip Reames 6253b77da9 [SLP] Respect control dependence within a block during scheduling
This fixes an active miscompile visible in the test changes.  The basic problem is that the scheduling dependency graph didn't have any edges for control dependence within a single basic block.  The result is that we could (and in some rare cases *did*) perform reorderings within a block which could introduce new undefined behavior along paths which didn't previously contain any.

Impact wise, we have two major cases where control is not guaranteed to reach a later instruction in the block: may throw calls, and calls containing infinite loops.
* The former case was mostly covered by the memory dependencies, and to trigger require a function which can throw, but not write to memory.  In theory, such a case is possible, but not likely in practice.
* The later case is likely more of an issue in practice.  After this code was first written, we changed the IR semantics to allow well defined infinite loops without satisifying mustprogress.  Even for C/C++ - which do imply mustprogress - recent changes to how we treat atomics (e.g. an atomic read does not always imply a write) could expose this issue.  I'm a bit shocked we don't seem to have a bug report which hit this in real code actually.

Compile time wise, this results in a single extra scan of the scheduling window in the common case.  Since we stop scanning at the next instruction which isn't guaranteed to execute, no matter what order we traverse instructions in, we scan the block once.  The exception to this is that when we extend the scheduling window downwards, we invalidate all dependencies, and thus rescan.  So the potentially expensive case is when we a call in a big schedule window which is frequently extended.  We could optimize this case (by caching the last instruction not guaranteeed to transfer execution and scanning only the extended window) and starting there), but I decided to leave the complexity until it mattered.  That same case is already degenerate with memory dependences which is more expensive than the control dependence scan.

We could also consider combining the memory dependence and control dependence sets to reduce memory usage, but since it complicates the code slightly and makes debugging a bit harder, I went with the simplest scheme for now.

This was noticed while trying to understand the failures reported against D118538, but is not otherwise related to that change.
2022-03-19 13:36:24 -07:00
Philip Reames bdbcca617a [SLP,tests] Add coverage showing need for control dependencies during scheduling 2022-03-19 09:45:33 -07:00
Philip Reames 3abf8ebd9a [slp][tests] Add missing function attributes
SLP is currently assuming that control dependence in these cases is irrelevant.  This is only valid if none of the lib-funcs involved can throw or infinite loop in the scalar forms.  This appears to be true (or at least we infer the respective attributes) for the libfuncs I spot checked.  This change is mostly for shrunking the diff on an upcoming patch.
2022-03-18 15:51:42 -07:00
Simon Pilgrim 5dde9c1286 [CostModel][X86] Reduce cost of extracting bool vector elements
For constant indices, these are now just a MOVMSK+TEST/BT
2022-03-18 19:02:47 +00:00
Simon Pilgrim b58413da9b [SLP][X86] Add baseline SSE2 test run to lookahead.ll 2022-03-18 14:27:04 +00:00
Vasileios Porpodas 9136145eb0 Revert "[SLP] Fix lookahead operand reordering for splat loads." due to build failures
This reverts commit 5efa78985b.
2022-03-17 18:22:04 -07:00
Vasileios Porpodas 511fa0800f [SLP][NFC] Added a test for a followup patch that enables handling splat loads with uses. 2022-03-17 18:05:54 -07:00
Vasileios Porpodas 5efa78985b [SLP] Fix lookahead operand reordering for splat loads.
Splat loads are inexpensive in X86. For a 2-lane vector we need just one
instruction: `movddup (%reg), xmm0`. Using the standard Splat score leads
to worse code. This patch adds a new score dedicated for splat loads.

Please note that a splat is usually three IR instructions:
- It is usually a load and 2 inserts:
 %ld = load double, double* %gep
 %ins1 = insertelement <2 x double> poison, double %ld, i32 0
 %ins2 = insertelement <2 x double> %ins1, double %ld, i32 1

- But it can also be a load, an insert and a shuffle:
 %ld = load double, double* %gep
 %ins = insertelement <2 x double> poison, double %ld, i32 0
 %shf = shufflevector <2 x double> %ins, <2 x double> poison, <2 x i32> zeroinitializer

Because of this some of the lit tests contain more IR instructions.

Differential Revision: https://reviews.llvm.org/D121354
2022-03-17 18:05:54 -07:00
Vasileios Porpodas b051c836c0 [SLP][NFC] This adds a test for a follow-up patch that fixes a look-ahead operand reordering issue
Differential Revision: https://reviews.llvm.org/D121353
2022-03-17 18:05:53 -07:00
Alexey Bataev d65cc85977 [SLP]Do not schedule instructions with constants/argument/phi operands and external users.
No need to schedule entry nodes where all instructions are not memory
read/write instructions and their operands are either constants, or
arguments, or phis, or instructions from others blocks, or their users
are phis or from the other blocks.
The resulting vector instructions can be placed at
the beginning of the basic block without scheduling (if operands does
not need to be scheduled) or at the end of the block (if users are
outside of the block).
It may save some compile time and scheduling resources.

Differential Revision: https://reviews.llvm.org/D121121
2022-03-17 11:03:45 -07:00
Alexey Bataev 150ea76543 Revert "[SLP]Do not schedule instructions with constants/argument/phi operands and external users."
This reverts commit 1eeb2bfe72 to fix
a bug reported in https://reviews.llvm.org/D121121
2022-03-16 13:54:59 -07:00
Alexey Bataev 1eeb2bfe72 [SLP]Do not schedule instructions with constants/argument/phi operands and external users.
No need to schedule entry nodes where all instructions are not memory
read/write instructions and their operands are either constants, or
arguments, or phis, or instructions from others blocks, or their users
are phis or from the other blocks.
The resulting vector instructions can be placed at
the beginning of the basic block without scheduling (if operands does
not need to be scheduled) or at the end of the block (if users are
outside of the block).
It may save some compile time and scheduling resources.

Differential Revision: https://reviews.llvm.org/D121121
2022-03-16 06:05:43 -07:00
Nikita Popov 8361c5da30 [SLPVectorizer] Handle external load/store pointer uses with opaque pointers
In this case we may not generate a bitcast, so the new load/store
becomes the external user.
2022-03-14 16:55:09 +01:00
David Green 8bef17ed59 [AArch64][SLP] Add a test with mutual reductions. NFC 2022-03-09 21:46:57 +00:00
Philip Reames deae979a2c Revert "Reapply "[SLP] Schedule only sub-graph of vectorizable instructions"""
This reverts commit 738042711b. A second, apparently separate, issue has been reported on the original review.
2022-03-03 11:35:34 -08:00
David Green 65c0e45a37 [AArch64] Vector shifts cost 1
The costs of vector shifts was 2 as opposed to 1, as the nodes are
marked custom. Fix this like the others and mark the nodes as cheap.

Differential Revision: https://reviews.llvm.org/D120773
2022-03-03 10:42:57 +00:00
Philip Reames 738042711b Reapply "[SLP] Schedule only sub-graph of vectorizable instructions""
Root issue which triggered the revert was fixed in 689bab.  No changes in the reapplied patch.

Original commit message follows:

SLP currently schedules all instructions within a scheduling window which stretches from the first instr
uction potentially vectorized to the last. This window can include a very large number of unrelated instruct
ions which are not being considered for vectorization. This change switches the code to only schedule the su
b-graph consisting of the instructions being vectorized and their transitive users.

This has the effect of greatly reducing the amount of work performed in large basic blocks, and thus greatly improves compile time on degenerate examples. To understand the effects, I added some statistics (not planned for upstream contribution). Here's an illustration from my motivating example:

    Before this patch:

    704357 SLP                          - Number of calcDeps actions
     699021 SLP                          - Number of schedule calls
       5598 SLP                          - Number of ReSchedule actions
         59 SLP                          - Number of ReScheduleOnFail actions
      10084 SLP                          - Number of schedule resets
       8523 SLP                          - Number of vector instructions generated

    After this patch:

    102895 SLP                          - Number of calcDeps actions
     161916 SLP                          - Number of schedule calls
       5637 SLP                          - Number of ReSchedule actions
         55 SLP                          - Number of ReScheduleOnFail actions
      10083 SLP                          - Number of schedule resets
       8403 SLP                          - Number of vector instructions generated

I do want to highlight that there is a small difference in number of generated vector instructions. This example is hitting the bailout due to maximum window size, and the change in scheduling is slightly perturbing when and how we hit it. This can be seen in the RescheduleOnFail counter change. Given that, I think we can safely ignore.

The downside of this change can be seen in the large test diff. We group all vectorizable instructions together at the bottom of the scheduling region. This means that vector instructions can move quite far from their original point in code. While maybe undesirable, I don't see this as being a major problem as this pass is not intended to be a general scheduling pass.

For context, it's worth noting that the pre-scheduling that SLP does while building the vector tree is exactly the sub-graph scheduling implemented by this patch.

Differential Revision: https://reviews.llvm.org/D118538
2022-03-02 10:47:20 -08:00
Philip Reames 689babdf68 [SLP] Don't try to vectorize allocas
While a collection of allocas are technically vectorizeable - by forming a wider alloca - this was not a transform SLP actually knows how to do.  Instead, we were forming a bundle with missing dependencies, and then relying on the scheduling code to preserve program order if multiple instructions were scheduleable at once.  I haven't been able to write a test case, but I'm 99% sure this was wrong in some edge case.

The unknown op case was flowing down the shufflevector path.  This did result in some splat handling being lost with this change, but the same lack of splat handling is visible in a whole bunch of simple examples for the gather path.  I didn't consider this interesting to fix given how narrow the splat of allocas case is.
2022-03-02 10:08:43 -08:00
Philip Reames 29028e47bd [slp] Add tests for cause of D118538 revert 2022-03-02 09:45:17 -08:00
Arthur Eubanks 9c6250ee41 Revert "[SLP] Schedule only sub-graph of vectorizable instructions"
This reverts commit 0539a26d91.

Causes a miscompile, see comments on D118538.

Required updating bottom-to-top-reorder.ll.
2022-03-01 17:31:16 -08:00
Alexey Bataev e4b9640867 [SLP]Improve bottom-to-top reordering.
Currently bottom-to-top reordering analysis counts orders of the
operands and then adds natural order counts for the operand users. It is
very conservative, this the user nodes themselves may require
reordering. Patch improves bottom-to-top analysis by checking for the
user nodes if they require/allows the reordring. If the user node must
be reordered, has reused scalars, is an alternate op vectorization node,
is a non-ordered gather node or may allow reordering because of the
reordered operands, such node is considered as the node that allows
reodring and is not counted as a node with the natural order.

Differential Revision: https://reviews.llvm.org/D120492
2022-02-28 06:48:46 -08:00
Evgeniy Brevnov 10e99eb7e4 [SLP] "Normal" instructions should not go between PHI and Lading pad
Currently, SLP can insert "shuffle" instruction beween PHI and Landing pad instruction. The problem is demonstrated by LIT test. The solution is to adjust insertion point once we are done with PHI generation.

Differential Revision: https://reviews.llvm.org/D120552
2022-02-26 11:44:26 +07:00
Vasileios Porpodas 4bbc3290a2 [SLP] Fix for the min/max intrinsic cost.
The min/max intrinsic cost is currently too low because in the cost calculation
we subtract the cost of the vector compare as we will not emit it.
For the cost of the vector compare we are currently passing BAD_ICMP_PREDICATE
which returns 3, the worst case cost.
I think we should be passing VecPred instead, since we know the predicates of
the compare instr.

I think this is related to commit b3b993a7ad which introduced the predicate
argument to getCmpSelInstrCost().
https://reviews.llvm.org/rGb3b993a7ad817c3c5801341fa78f34332900eb83

Differential Revision: https://reviews.llvm.org/D120439
2022-02-24 18:08:40 -08:00
Vasileios Porpodas 6136f97c69 [SLP][NFC] Test for a follow-up fix of the the vector min/max instrinsic cost calculation.
The code in this test should not have been vectorized.
It looks worse than the scalar code.

Differential Revision: https://reviews.llvm.org/D120438
2022-02-24 18:08:39 -08:00
Nikita Popov a266af7211 [InstCombine] Canonicalize SPF to min/max intrinsics
Now that integer min/max intrinsics have good support in both
InstCombine and other passes, start canonicalizing SPF min/max
to intrinsic min/max.

Once this sticks, we can stop matching SPF min/max in various
places, and can remove hacks we have for preventing infinite loops
and breaking of SPF canonicalization.

Differential Revision: https://reviews.llvm.org/D98152
2022-02-24 09:01:20 +01:00
Philip Reames 9392c0d4ef Revert "[SLP] Remove cap on schedule window size"
This reverts commit 6adf4b039e.  Reverting while investigating https://github.com/llvm/llvm-project/issues/54029
2022-02-23 13:12:07 -08:00
Philip Reames 6adf4b039e [SLP] Remove cap on schedule window size
This cap was first added in 848c1aa45 (back in 2015).  Per the original commit message, the purpose was to avoid a compile time explosion in long basic blocks.  The algorithmic problem in scheduling has now been fixed in 0539a26d.

In the meantime, the code has rotten fairly badly.  Some intermediate refactoring caused the size to only be incremented if *both* iterators advance in the window search.  This causes the size to be badly undercounted when near one end of a basic block.  We no longer have any test which exercises the logic in an intentional way; there's one test which differs with this change, but the changes appear fairly orthoganol to the purpose of the test file.

Unfortunately, we no longer have the original motivating example, so it's possible that it also hits some other issue.  I tested locally with a large example, but even at it's worst, that one doesn't demonstrate anything too extreme even without the algorithmic fix.  It's clearly faster with, but only by ~20% which doesn't seem in line with the original commit message.   If regressions with this patch are seen, please file a bug and I'll try to fix any other algorithmic problems which fall out.
2022-02-23 08:27:45 -08:00
Brendon Cahoon 3cc15e2cb6 [SLP] Fix assert from non-constant index in insertelement
A call to getInsertIndex() in getTreeCost() is returning None,
which causes an assert because a non-constant index value for
insertelement was not expected. This case occurs when the
insertelement index value is defined with a PHI.

Differential Revision: https://reviews.llvm.org/D120223
2022-02-22 15:57:14 -06:00
Alexey Bataev b3f4535a03 [SLP][NFC]Add a test for bottom to top reordering. 2022-02-22 13:46:34 -08:00
Philip Reames 0539a26d91 [SLP] Schedule only sub-graph of vectorizable instructions
SLP currently schedules all instructions within a scheduling window which stretches from the first instruction potentially vectorized to the last. This window can include a very large number of unrelated instructions which are not being considered for vectorization. This change switches the code to only schedule the sub-graph consisting of the instructions being vectorized and their transitive users.

This has the effect of greatly reducing the amount of work performed in large basic blocks, and thus greatly improves compile time on degenerate examples. To understand the effects, I added some statistics (not planned for upstream contribution). Here's an illustration from my motivating example:

Before this patch:

704357 SLP                          - Number of calcDeps actions
 699021 SLP                          - Number of schedule calls
   5598 SLP                          - Number of ReSchedule actions
     59 SLP                          - Number of ReScheduleOnFail actions
  10084 SLP                          - Number of schedule resets
   8523 SLP                          - Number of vector instructions generated

After this patch:

102895 SLP                          - Number of calcDeps actions
 161916 SLP                          - Number of schedule calls
   5637 SLP                          - Number of ReSchedule actions
     55 SLP                          - Number of ReScheduleOnFail actions
  10083 SLP                          - Number of schedule resets
   8403 SLP                          - Number of vector instructions generated

I do want to highlight that there is a small difference in number of generated vector instructions. This example is hitting the bailout due to maximum window size, and the change in scheduling is slightly perturbing when and how we hit it. This can be seen in the RescheduleOnFail counter change. Given that, I think we can safely ignore.

The downside of this change can be seen in the large test diff. We group all vectorizable instructions together at the bottom of the scheduling region. This means that vector instructions can move quite far from their original point in code. While maybe undesirable, I don't see this as being a major problem as this pass is not intended to be a general scheduling pass.

For context, it's worth noting that the pre-scheduling that SLP does while building the vector tree is exactly the sub-graph scheduling implemented by this patch.

Differential Revision: https://reviews.llvm.org/D118538
2022-02-22 10:15:55 -08:00
Alexey Bataev b0a0df9809 [SLP]Fix vectorization of the alternate cmp instruction with swapped predicates.
If the alternate cmp instruction is a swapped predicate of the main cmp
instruction, need to generate alternate instruction, not the one with
the swapped predicate. Also, the lane with the alternate opcode should
be selected only, if the corresponding operands are not compatible.

Correctness confirmed:
https://alive2.llvm.org/ce/z/94BG66

Differential Revision: https://reviews.llvm.org/D119855
2022-02-18 04:27:45 -08:00
Alexey Bataev 27f72eb25e [SLP][NFC]Add another test for swapped main/alternate cmp, NFC. 2022-02-17 09:55:16 -08:00
Shao-Ce SUN 21ac474392 [NFC] Correct typo `interger` to `integer` 2022-02-17 21:17:47 +08:00
Arthur Eubanks 826fae51d2 [SLPVectorizer][OpaquePtrs] Check GEP source element type
Fixes a miscompile with opaque pointers.

Reviewed By: #opaque-pointers, nikic

Differential Revision: https://reviews.llvm.org/D119980
2022-02-16 14:47:20 -08:00
Arthur Eubanks 4e24397805 [test][SLPVectorizer][OpaquePtr] Precommit test 2022-02-16 14:47:20 -08:00
Alexey Bataev d6371a7c60 [SLP][NFC]Add a test for miscompilation of alternate cmp instructions,
NFC.
2022-02-15 08:29:17 -08:00
Anton Afanasyev b7574b092a [SLP] Don't try to vectorize pair with insertelement
Particularly this breaks vectorization of insertelements where some of
intermediate (i.e. not last) insertelements are used externally.

Fixes PR52275
Fixes #51617

Differential Revision: https://reviews.llvm.org/D119679
2022-02-15 16:12:59 +03:00