Commit Graph

13 Commits

Author SHA1 Message Date
David Green fad70c3068 [ARM] Improve WLS lowering
Recently we improved the lowering of low overhead loops and tail
predicated loops, but concentrated first on the DLS do style loops. This
extends those improvements over to the WLS while loops, improving the
chance of lowering them successfully. To do this the lowering has to
change a little as the instructions are terminators that produce a value
- something that needs to be treated carefully.

Lowering starts at the Hardware Loop pass, inserting a new
llvm.test.start.loop.iterations that produces both an i1 to control the
loop entry and an i32 similar to the llvm.start.loop.iterations
intrinsic added for do loops. This feeds into the loop phi, properly
gluing the values together:

  %wls = call { i32, i1 } @llvm.test.start.loop.iterations.i32(i32 %div)
  %wls0 = extractvalue { i32, i1 } %wls, 0
  %wls1 = extractvalue { i32, i1 } %wls, 1
  br i1 %wls1, label %loop.ph, label %loop.exit
...
loop:
  %lsr.iv = phi i32 [ %wls0, %loop.ph ], [ %iv.next, %loop ]
  ..
  %iv.next = call i32 @llvm.loop.decrement.reg.i32(i32 %lsr.iv, i32 1)
  %cmp = icmp ne i32 %iv.next, 0
  br i1 %cmp, label %loop, label %loop.exit

The llvm.test.start.loop.iterations need to be lowered through ISel
lowering as a pair of WLS and WLSSETUP nodes, which each get converted
to t2WhileLoopSetup and t2WhileLoopStart Pseudos. This helps prevent
t2WhileLoopStart from being a terminator that produces a value,
something difficult to control at that stage in the pipeline. Instead
the t2WhileLoopSetup produces the value of LR (essentially acting as a
lr = subs rn, 0), t2WhileLoopStart consumes that lr value (the Bcc).

These are then converted into a single t2WhileLoopStartLR at the same
point as t2DoLoopStartTP and t2LoopEndDec. Otherwise we revert the loop
to prevent them from progressing further in the pipeline. The
t2WhileLoopStartLR is a single instruction that takes a GPR and produces
LR, similar to the WLS instruction.

  %1:gprlr = t2WhileLoopStartLR %0:rgpr, %bb.3
  t2B %bb.1
...
bb.2.loop:
  %2:gprlr = PHI %1:gprlr, %bb.1, %3:gprlr, %bb.2
  ...
  %3:gprlr = t2LoopEndDec %2:gprlr, %bb.2
  t2B %bb.3

The t2WhileLoopStartLR can then be treated similar to the other low
overhead loop pseudos, eventually being lowered to a WLS providing the
branches are within range.

Differential Revision: https://reviews.llvm.org/D97729
2021-03-11 17:56:19 +00:00
Roman Lebedev b46c085d2b
[NFCI] SCEVExpander: emit intrinsics for integral {u,s}{min,max} SCEV expressions
These intrinsics, not the icmp+select are the canonical form nowadays,
so we might as well directly emit them.

This should not cause any regressions, but if it does,
then then they would needed to be fixed regardless.

Note that this doesn't deal with `SCEVExpander::isHighCostExpansion()`,
but that is a pessimization, not a correctness issue.

Additionally, the non-intrinsic form has issues with undef,
see https://reviews.llvm.org/D88287#2587863
2021-03-06 21:52:46 +03:00
David Green 48230355e9 [ARM] Remove DLS lr, lr
A DLS lr, lr instruction only moves lr to itself. It need not be emitted
on it's own to save a instruction in the loop preheader.

Differential Revision: https://reviews.llvm.org/D78916
2021-02-02 11:09:31 +00:00
David Green b2ac9681a7 [ARM] Alter t2DoLoopStart to define lr
This changes the definition of t2DoLoopStart from
t2DoLoopStart rGPR
to
GPRlr = t2DoLoopStart rGPR

This will hopefully mean that low overhead loops are more tied together,
and we can more reliably generate loops without reverting or being at
the whims of the register allocator.

This is a fairly simple change in itself, but leads to a number of other
required alterations.

 - The hardware loop pass, if UsePhi is set, now generates loops of the
   form:
       %start = llvm.start.loop.iterations(%N)
     loop:
       %p = phi [%start], [%dec]
       %dec = llvm.loop.decrement.reg(%p, 1)
       %c = icmp ne %dec, 0
       br %c, loop, exit
 - For this a new llvm.start.loop.iterations intrinsic was added, identical
   to llvm.set.loop.iterations but produces a value as seen above, gluing
   the loop together more through def-use chains.
 - This new instrinsic conceptually produces the same output as input,
   which is taught to SCEV so that the checks in MVETailPredication are not
   affected.
 - Some minor changes are needed to the ARMLowOverheadLoop pass, but it has
   been left mostly as before. We should now more reliably be able to tell
   that the t2DoLoopStart is correct without having to prove it, but
   t2WhileLoopStart and tail-predicated loops will remain the same.
 - And all the tests have been updated. There are a lot of them!

This patch on it's own might cause more trouble that it helps, with more
tail-predicated loops being reverted, but some additional patches can
hopefully improve upon that to get to something that is better overall.

Differential Revision: https://reviews.llvm.org/D89881
2020-11-10 15:57:58 +00:00
Sjoerd Meijer b0614509a0 [HardwareLoops] llvm.loop.decrement.reg definition
This is split off from D80316, slightly tightening the definition of overloaded
hardwareloop intrinsic llvm.loop.decrement.reg specifying that both operands
its result have the same type.
2020-05-21 10:48:16 +01:00
David Green eecba95067 [ARM] Replace arm vendor with none. NFC 2020-04-22 18:19:35 +01:00
Sam Parker e3a4a13fcc [ARM][LowOverheadLoops] Enable by default
The code is now in a good enough state to pass the bunch of tests that
I have run (after fixing the bugs), so let's enable it by default.

Differential Revision: https://reviews.llvm.org/D65277

llvm-svn: 367297
2019-07-30 08:14:28 +00:00
Sam Parker 98722691b0 [ARM] WLS/LE Code Generation
Backend changes to enable WLS/LE low-overhead loops for armv8.1-m:
1) Use TTI to communicate to the HardwareLoop pass that we should try
   to generate intrinsics that guard the loop entry, as well as setting
   the loop trip count.
2) Lower the BRCOND that uses said intrinsic to an Arm specific node:
   ARMWLS.
3) ISelDAGToDAG the node to a new pseudo instruction:
   t2WhileLoopStart.
4) Add support in ArmLowOverheadLoops to handle the new pseudo
   instruction.

Differential Revision: https://reviews.llvm.org/D63816

llvm-svn: 364733
2019-07-01 08:21:28 +00:00
Sam Parker a6fd919cb3 [ARM] DLS/LE low-overhead loop code generation
Introduce three pseudo instructions to be used during DAG ISel to
represent v8.1-m low-overhead loops. One maps to set_loop_iterations
while loop_decrement_reg is lowered to two, so that we can separate
the decrement and branching operations. The pseudo instructions are
expanded pre-emission, where we can still decide whether we actually
want to generate a low-overhead loop, in a new pass:
ARMLowOverheadLoops. The pass currently bails, reverting to an sub,
icmp and br, in the cases where a call or stack spill/restore happens
between the decrement and branching instructions, or if the loop is
too large.

Differential Revision: https://reviews.llvm.org/D63476

llvm-svn: 364288
2019-06-25 10:45:51 +00:00
Sam Parker 0cf9639a9c [SCEV] Pass NoWrapFlags when expanding an AddExpr
InsertBinop now accepts NoWrapFlags, so pass them through when
expanding a simple add expression.

This is the first re-commit of the functional changes from rL362687,
which was previously reverted.

Differential Revision: https://reviews.llvm.org/D61934

llvm-svn: 363364
2019-06-14 09:19:41 +00:00
Sam Parker 3d42959dd8 Revert rL363156.
The patch was to fix buildbots, but rL363157 should now be fixing it
in a cleaner way.

llvm-svn: 363174
2019-06-12 15:28:00 +00:00
Sam Parker ece316b56a Attempt to fix non-Arm buildbots
Adding REQUIRES: arm to failing tests

llvm-svn: 363156
2019-06-12 12:47:35 +00:00
Sam Parker 757ac02dc8 [ARM] Implement TTI::isHardwareLoopProfitable
Implement the backend target hook to drive the HardwareLoops pass.
The low-overhead branch extension for Arm M-class cores is flexible
enough that we don't have to ensure correctness at this point, except
checking that the loop counter variable can be stored in LR - a
32-bit register. For it to be profitable, we want to avoid loops that
contain function calls, or any other instruction that alters the PC.
    
This implementation uses TargetLoweringInfo, to query type and
operation actions, looks at intrinsic calls and also performs some
manual checks for remainder/division and FP operations.
    
I think this should be a good base to start and extra details can be
filled out later.

Differential Revision: https://reviews.llvm.org/D62907

llvm-svn: 363149
2019-06-12 12:00:42 +00:00