Commit Graph

9 Commits

Author SHA1 Message Date
David Green fad70c3068 [ARM] Improve WLS lowering
Recently we improved the lowering of low overhead loops and tail
predicated loops, but concentrated first on the DLS do style loops. This
extends those improvements over to the WLS while loops, improving the
chance of lowering them successfully. To do this the lowering has to
change a little as the instructions are terminators that produce a value
- something that needs to be treated carefully.

Lowering starts at the Hardware Loop pass, inserting a new
llvm.test.start.loop.iterations that produces both an i1 to control the
loop entry and an i32 similar to the llvm.start.loop.iterations
intrinsic added for do loops. This feeds into the loop phi, properly
gluing the values together:

  %wls = call { i32, i1 } @llvm.test.start.loop.iterations.i32(i32 %div)
  %wls0 = extractvalue { i32, i1 } %wls, 0
  %wls1 = extractvalue { i32, i1 } %wls, 1
  br i1 %wls1, label %loop.ph, label %loop.exit
...
loop:
  %lsr.iv = phi i32 [ %wls0, %loop.ph ], [ %iv.next, %loop ]
  ..
  %iv.next = call i32 @llvm.loop.decrement.reg.i32(i32 %lsr.iv, i32 1)
  %cmp = icmp ne i32 %iv.next, 0
  br i1 %cmp, label %loop, label %loop.exit

The llvm.test.start.loop.iterations need to be lowered through ISel
lowering as a pair of WLS and WLSSETUP nodes, which each get converted
to t2WhileLoopSetup and t2WhileLoopStart Pseudos. This helps prevent
t2WhileLoopStart from being a terminator that produces a value,
something difficult to control at that stage in the pipeline. Instead
the t2WhileLoopSetup produces the value of LR (essentially acting as a
lr = subs rn, 0), t2WhileLoopStart consumes that lr value (the Bcc).

These are then converted into a single t2WhileLoopStartLR at the same
point as t2DoLoopStartTP and t2LoopEndDec. Otherwise we revert the loop
to prevent them from progressing further in the pipeline. The
t2WhileLoopStartLR is a single instruction that takes a GPR and produces
LR, similar to the WLS instruction.

  %1:gprlr = t2WhileLoopStartLR %0:rgpr, %bb.3
  t2B %bb.1
...
bb.2.loop:
  %2:gprlr = PHI %1:gprlr, %bb.1, %3:gprlr, %bb.2
  ...
  %3:gprlr = t2LoopEndDec %2:gprlr, %bb.2
  t2B %bb.3

The t2WhileLoopStartLR can then be treated similar to the other low
overhead loop pseudos, eventually being lowered to a WLS providing the
branches are within range.

Differential Revision: https://reviews.llvm.org/D97729
2021-03-11 17:56:19 +00:00
David Green 54e2876132 [ARM] Update and add extra WLS testing. NFC 2021-03-01 21:46:09 +00:00
Mircea Trofin f9a27df16b [FileCheck] Enforce --allow-unused-prefixes=false for llvm/test/Transforms
Explicitly opt-out llvm/test/Transforms/Attributor.

Verified by flipping the default value of allow-unused-prefixes and
observing that none of the failures were under llvm/test/Transforms.

Differential Revision: https://reviews.llvm.org/D92404
2020-12-09 08:51:38 -08:00
David Green b2ac9681a7 [ARM] Alter t2DoLoopStart to define lr
This changes the definition of t2DoLoopStart from
t2DoLoopStart rGPR
to
GPRlr = t2DoLoopStart rGPR

This will hopefully mean that low overhead loops are more tied together,
and we can more reliably generate loops without reverting or being at
the whims of the register allocator.

This is a fairly simple change in itself, but leads to a number of other
required alterations.

 - The hardware loop pass, if UsePhi is set, now generates loops of the
   form:
       %start = llvm.start.loop.iterations(%N)
     loop:
       %p = phi [%start], [%dec]
       %dec = llvm.loop.decrement.reg(%p, 1)
       %c = icmp ne %dec, 0
       br %c, loop, exit
 - For this a new llvm.start.loop.iterations intrinsic was added, identical
   to llvm.set.loop.iterations but produces a value as seen above, gluing
   the loop together more through def-use chains.
 - This new instrinsic conceptually produces the same output as input,
   which is taught to SCEV so that the checks in MVETailPredication are not
   affected.
 - Some minor changes are needed to the ARMLowOverheadLoop pass, but it has
   been left mostly as before. We should now more reliably be able to tell
   that the t2DoLoopStart is correct without having to prove it, but
   t2WhileLoopStart and tail-predicated loops will remain the same.
 - And all the tests have been updated. There are a lot of them!

This patch on it's own might cause more trouble that it helps, with more
tail-predicated loops being reverted, but some additional patches can
hopefully improve upon that to get to something that is better overall.

Differential Revision: https://reviews.llvm.org/D89881
2020-11-10 15:57:58 +00:00
David Green 44c1a56869 [ARM] Add extra MVE tests for various patches. NFC 2020-11-01 16:24:23 +00:00
Florian Hahn ee1c12708a [SCEV] If Start>=RHS, simplify (Start smin RHS) = RHS for trip counts.
In some cases, it seems like we can get rid of unnecessary s/umins by
using information from the loop guards (unless I am missing something).

One place where this seems to be helpful in practice is when computing
loop trip counts. This patch just changes howManyGreaterThans for now.
Note that this requires a loop for which we can check 'is guarded'.

On SPEC2000/SPEC2006/MultiSource, there are some notable changes for
some programs in the number of loops unrolled and trip counts computed.

```
Same hash: 179 (filtered out)
Remaining: 58
Metric: scalar-evolution.NumTripCountsComputed

Program                                        base    patch   diff
 test-suite...langs-C/compiler/compiler.test    25.00   31.00  24.0%
 test-suite.../Applications/SPASS/SPASS.test   2020.00 2323.00 15.0%
 test-suite...langs-C/allroots/allroots.test    29.00   32.00  10.3%
 test-suite.../Prolangs-C/loader/loader.test    17.00   18.00   5.9%
 test-suite...fice-ispell/office-ispell.test   253.00  265.00   4.7%
 test-suite...006/450.soplex/450.soplex.test   3552.00 3692.00  3.9%
 test-suite...chmarks/MallocBench/gs/gs.test   453.00  470.00   3.8%
 test-suite...ngs-C/assembler/assembler.test    29.00   30.00   3.4%
 test-suite.../Benchmarks/Ptrdist/bc/bc.test   263.00  270.00   2.7%
 test-suite...rks/FreeBench/pifft/pifft.test   722.00  741.00   2.6%
 test-suite...count/automotive-bitcount.test    41.00   42.00   2.4%
 test-suite...0/253.perlbmk/253.perlbmk.test   1417.00 1451.00  2.4%
 test-suite...000/197.parser/197.parser.test   387.00  396.00   2.3%
 test-suite...lications/sqlite3/sqlite3.test   1168.00 1189.00  1.8%
 test-suite...000/255.vortex/255.vortex.test   173.00  176.00   1.7%

Metric: loop-unroll.NumUnrolled

Program                                        base   patch  diff
 test-suite...langs-C/compiler/compiler.test     1.00   3.00 200.0%
 test-suite.../Applications/SPASS/SPASS.test   134.00 234.00 74.6%
 test-suite...count/automotive-bitcount.test     3.00   4.00 33.3%
 test-suite.../Prolangs-C/loader/loader.test     3.00   4.00 33.3%
 test-suite...langs-C/allroots/allroots.test     3.00   4.00 33.3%
 test-suite...Source/Benchmarks/sim/sim.test    10.00  12.00 20.0%
 test-suite...fice-ispell/office-ispell.test    21.00  25.00 19.0%
 test-suite.../Benchmarks/Ptrdist/bc/bc.test    32.00  38.00 18.8%
 test-suite...006/450.soplex/450.soplex.test   300.00 352.00 17.3%
 test-suite...rks/FreeBench/pifft/pifft.test    60.00  69.00 15.0%
 test-suite...chmarks/MallocBench/gs/gs.test    57.00  63.00 10.5%
 test-suite...ngs-C/assembler/assembler.test    10.00  11.00 10.0%
 test-suite...0/253.perlbmk/253.perlbmk.test   145.00 157.00  8.3%
 test-suite...000/197.parser/197.parser.test    43.00  46.00  7.0%
 test-suite...TimberWolfMC/timberwolfmc.test   205.00 214.00  4.4%
 Geomean difference                                           7.6%
```

Fixes https://bugs.llvm.org/show_bug.cgi?id=46939
Fixes https://bugs.llvm.org/show_bug.cgi?id=46924 on X86.

Reviewed By: mkazantsev

Differential Revision: https://reviews.llvm.org/D85046
2020-08-03 17:22:42 +01:00
Sjoerd Meijer b0614509a0 [HardwareLoops] llvm.loop.decrement.reg definition
This is split off from D80316, slightly tightening the definition of overloaded
hardwareloop intrinsic llvm.loop.decrement.reg specifying that both operands
its result have the same type.
2020-05-21 10:48:16 +01:00
Sam Parker 9a92be1b35 [HardwareLoops] Loop counter guard intrinsic
Introduce llvm.test.set.loop.iterations which sets the loop counter
and also produces an i1 after testing that the count is not zero.

Differential Revision: https://reviews.llvm.org/D63809

llvm-svn: 364628
2019-06-28 07:38:16 +00:00
Sam Parker c5ef502ee8 [CodeGen] Generic Hardware Loop Support
Patch which introduces a target-independent framework for generating
hardware loops at the IR level. Most of the code has been taken from
PowerPC CTRLoops and PowerPC has been ported over to use this generic
pass. The target dependent parts have been moved into
TargetTransformInfo, via isHardwareLoopProfitable, with
HardwareLoopInfo introduced to transfer information from the backend.
    
Three generic intrinsics have been introduced:
- void @llvm.set_loop_iterations
  Takes as a single operand, the number of iterations to be executed.
- i1 @llvm.loop_decrement(anyint)
  Takes the maximum number of elements processed in an iteration of
  the loop body and subtracts this from the total count. Returns
  false when the loop should exit.
- anyint @llvm.loop_decrement_reg(anyint, anyint)
  Takes the number of elements remaining to be processed as well as
  the maximum numbe of elements processed in an iteration of the loop
  body. Returns the updated number of elements remaining.

llvm-svn: 362774
2019-06-07 07:35:30 +00:00