Currently when creating tail predicated loops, we need to validate that
all the live-outs of a loop will be equivalent with and without tail
predication, and if they are not we cannot legally create a
tail-predicated loop, leaving expensive vctp and vpst instructions in
the loop. These notably can include register-allocation instructions
like stack loads and stores, and copys lowered from COPYs to MVE_VORRs.
Instead of trying to prove this is valid late in the pipeline, this
patch introduces a MQPRCopy pseudo instruction that COPY is lowered to.
This can then either be converted to a MVE_VORR where possible, or to a
couple of VMOVD instructions if not. This way they do not behave
differently within and outside of tail-predications regions, and we can
know by construction that they are always valid. The idea is that we can
do the same with stack load and stores, converting them to VLDR/VSTR or
VLDM/VSTM where required to prove tail predication is always valid.
This does unfortunately mean inserting multiple VMOVD instructions,
instead of a single MVE_VORR, but my experiments show it to be an
improvement in general.
Differential Revision: https://reviews.llvm.org/D111048
This allows VMOVL in tail predicated loops so long as the the vector
size the VMOVL is extending into is less than or equal to the size of
the VCTP in the tail predicated loop. These cases represent a
sign-extend-inreg (or zero-extend-inreg), which needn't block tail
predication as in https://godbolt.org/z/hdTsEbx8Y.
For this a vecsize has been added to the TSFlag bits of MVE
instructions, which stores the size of the elements that the MVE
instruction operates on. In the case of multiple size (such as a
MVE_VMOVLs8bh that extends from i8 to i16, the largest size was be
chosen). The sizes are encoded as 00 = i8, 01 = i16, 10 = i32 and 11 =
i64, which often (but not always) comes from the instruction encoding
directly. A unit test was added, and although only a subset of the
vecsizes are currently used, the rest should be useful for other cases.
Differential Revision: https://reviews.llvm.org/D109706
The semantics of tail predication loops means that the value of LR as an
instruction is executed determines the predicate. In other words:
mov r3, #3
DLSTP lr, r3 // Start tail predication, lr==3
VADD.s32 q0, q1, q2 // Lanes 0,1 and 2 are updated in q0.
mov lr, #1
VADD.s32 q0, q1, q2 // Only first lane is updated.
This means that the value of lr cannot be spilled and re-used in tail
predication regions without potentially altering the behaviour of the
program. More lanes than required could be stored, for example, and in
the case of a gather those lanes might not have been setup, leading to
alignment exceptions.
This patch adds a new lr predicate operand to MVE instructions in order
to keep a reference to the lr that they use as a tail predicate. It will
usually hold the zeroreg meaning not predicated, being set to the LR phi
value in the MVETPAndVPTOptimisationsPass. This will prevent it from
being spilled anywhere that it needs to be used.
A lot of tests needed updating.
Differential Revision: https://reviews.llvm.org/D107638
We are running into more and more cases where the liveouts of low
overhead loops do not validate. Add some extra debug messages to make it
clearer why.
I just hit a nasty bug when writing a unit test after calling MF->getFrameInfo()
without declaring the variable as a reference.
Deleting the copy-constructor also showed a place in the ARM backend which was
doing the same thing, albeit it didn't impact correctness there from the looks of it.
This patch makes vector spills valid for tail predication when all loads
from the same stack slot are within the loop
Differential Revision: https://reviews.llvm.org/D105443
This adds t2WhileLoopStartTP, similar to the t2DoLoopStartTP added in
D90591. It keeps a reference to both the tripcount register and the
element count register, so that the ARMLowOverheadLoops pass in the
backend can pick the correct one without having to search for it from
the operand of a VCTP.
Differential Revision: https://reviews.llvm.org/D103236
The findLoopPreheader function will currently not find a preheader if it
branches to multiple different loop headers. This patch adds an option
to relax that, allowing ARMLowOverheadLoops to process more loops
successfully. This helps with WhileLoopStart setup instructions that can
branch/fallthrough to the low overhead loop and to branch to a separate
loop from the same preheader (but I don't believe it is possible for
both loops to be low overhead loops).
Differential Revision: https://reviews.llvm.org/D102747
The Loop start instruction handled by the ARMLowOverheadLoops are:
$lr = t2DoLoopStart $r0
$lr = t2DoLoopStartTP $r1, $r0
$lr = t2WhileLoopStartLR $r0, %bb, implicit-def dead $cpsr
All three of these will have LR as the 0 argument, the trip count as the
1 argument.
This patch updated a few places in ARMLowOverheadLoops where the 0th arg
was being used for t2WhileLoopStartLR instructions as the trip count.
One place was entirely removed as it does not seem valid any more, the
case the code is trying to protect against should not be able to occur
with our correct-by-construction low overhead loops.
Differential Revision: https://reviews.llvm.org/D102620
In function ConvertVPTBlocks(), it is assumed that every instruction
within a vector-predicated block is predicated. This is false for debug
instructions, used by LLVM.
Because of this, an assertion failure is reached when an input contains
debug instructions inside VPT blocks. In non-assert builds, an out of
bounds memory access took place.
The present patch properly covers the case of debug instructions.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D99075
Recently we improved the lowering of low overhead loops and tail
predicated loops, but concentrated first on the DLS do style loops. This
extends those improvements over to the WLS while loops, improving the
chance of lowering them successfully. To do this the lowering has to
change a little as the instructions are terminators that produce a value
- something that needs to be treated carefully.
Lowering starts at the Hardware Loop pass, inserting a new
llvm.test.start.loop.iterations that produces both an i1 to control the
loop entry and an i32 similar to the llvm.start.loop.iterations
intrinsic added for do loops. This feeds into the loop phi, properly
gluing the values together:
%wls = call { i32, i1 } @llvm.test.start.loop.iterations.i32(i32 %div)
%wls0 = extractvalue { i32, i1 } %wls, 0
%wls1 = extractvalue { i32, i1 } %wls, 1
br i1 %wls1, label %loop.ph, label %loop.exit
...
loop:
%lsr.iv = phi i32 [ %wls0, %loop.ph ], [ %iv.next, %loop ]
..
%iv.next = call i32 @llvm.loop.decrement.reg.i32(i32 %lsr.iv, i32 1)
%cmp = icmp ne i32 %iv.next, 0
br i1 %cmp, label %loop, label %loop.exit
The llvm.test.start.loop.iterations need to be lowered through ISel
lowering as a pair of WLS and WLSSETUP nodes, which each get converted
to t2WhileLoopSetup and t2WhileLoopStart Pseudos. This helps prevent
t2WhileLoopStart from being a terminator that produces a value,
something difficult to control at that stage in the pipeline. Instead
the t2WhileLoopSetup produces the value of LR (essentially acting as a
lr = subs rn, 0), t2WhileLoopStart consumes that lr value (the Bcc).
These are then converted into a single t2WhileLoopStartLR at the same
point as t2DoLoopStartTP and t2LoopEndDec. Otherwise we revert the loop
to prevent them from progressing further in the pipeline. The
t2WhileLoopStartLR is a single instruction that takes a GPR and produces
LR, similar to the WLS instruction.
%1:gprlr = t2WhileLoopStartLR %0:rgpr, %bb.3
t2B %bb.1
...
bb.2.loop:
%2:gprlr = PHI %1:gprlr, %bb.1, %3:gprlr, %bb.2
...
%3:gprlr = t2LoopEndDec %2:gprlr, %bb.2
t2B %bb.3
The t2WhileLoopStartLR can then be treated similar to the other low
overhead loop pseudos, eventually being lowered to a WLS providing the
branches are within range.
Differential Revision: https://reviews.llvm.org/D97729
With t2DoLoopDec we can be left with some extra MOV's in the preheaders
of tail predicated loops. This removes them, in the same way we remove
other dead variables.
Differential Revision: https://reviews.llvm.org/D91857
A DLS lr, lr instruction only moves lr to itself. It need not be emitted
on it's own to save a instruction in the loop preheader.
Differential Revision: https://reviews.llvm.org/D78916
A vpt block that just contains either VPST;VCTP or VPT;VCTP, once the
VCTP is removed will become invalid. This fixed the first by removing
the now empty block and bails out for the second, as we have no simple
way of converting a VPT to a VCMP.
Differential Revision: https://reviews.llvm.org/D92369
We currently have problems with the way that low overhead loops are
specified, with LR being spilled between the t2LoopDec and the t2LoopEnd
forcing the entire loop to be reverted late in the backend. As they will
eventually become a single instruction, this patch introduces a
t2LoopEndDec which is the combination of the two, combined before
registry allocation to make sure this does not fail.
Unfortunately this instruction is a terminator that produces a value
(and also branches - it only produces the value around the branching
edge). So this needs some adjustment to phi elimination and the register
allocator to make sure that we do not spill this LR def around the loop
(needing to put a spill after the terminator). We treat the loop very
carefully, making sure that there is nothing else like calls that would
break it's ability to use LR. For that, this adds a
isUnspillableTerminator to opt in the new behaviour.
There is a chance that this could cause problems, and so I have added an
escape option incase. But I have not seen any problems in the testing
that I've tried, and not reverting Low overhead loops is important for
our performance. If this does work then we can hopefully do the same for
t2WhileLoopStart and t2DoLoopStart instructions.
This patch also contains the code needed to convert or revert the
t2LoopEndDec in the backend (which just needs a subs; bne) and the code
pre-ra to create them.
Differential Revision: https://reviews.llvm.org/D91358
This adds code to revert low overhead loops with calls in them before
register allocation. Ideally we would not create low overhead loops with
calls in them to begin with, but that can be difficult to always get
correct. If we want to try and glue together t2LoopDec and t2LoopEnd
into a single instruction, we need to ensure that no instructions use LR
in the loop. (Technically the final code can be better too, as it
doesn't need to use the same registers but that has not been optimized
for here, as reverting loops with calls is expected to be very rare).
It also adds a MVETailPredUtils.h header to share the revert code
between different passes, and provides a place to expand upon, with
RevertLoopWithCall becoming a place to perform other low overhead loop
alterations like removing copies or combining LoopDec and End into a
single instruction.
Differential Revision: https://reviews.llvm.org/D91273
This converts the intermediate VPR use assertion to a condition in the if-statement to protect against assertion failures in case behaviuour is changed.
This is a follow-up to https://reviews.llvm.org/D90935 and implements the post-approval comments.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D91790
This introduces a new pseudo instruction, almost identical to a
t2DoLoopStart but taking 2 parameters - the original loop iteration
count needed for a low overhead loop, plus the VCTP element count needed
for a DLSTP instruction setting up a tail predicated loop. The idea is
that the instruction holds both values and the backend
ARMLowOverheadLoops pass can pick between the two, depending on whether
it creates a tail predicated loop or falls back to a low overhead loop.
To do that there needs to be something that converts a t2DoLoopStart to
a t2DoLoopStartTP, for which this patch repurposes the
MVEVPTOptimisationsPass as a "tail predication and vpt optimisation"
pass. The extra operand for the t2DoLoopStartTP is chosen based on the
operands of VCTP's in the loop, and the instruction is moved as late in
the block as possible to attempt to increase the likelihood of making
tail predicated loops.
Differential Revision: https://reviews.llvm.org/D90591
This changes the definition of t2DoLoopStart from
t2DoLoopStart rGPR
to
GPRlr = t2DoLoopStart rGPR
This will hopefully mean that low overhead loops are more tied together,
and we can more reliably generate loops without reverting or being at
the whims of the register allocator.
This is a fairly simple change in itself, but leads to a number of other
required alterations.
- The hardware loop pass, if UsePhi is set, now generates loops of the
form:
%start = llvm.start.loop.iterations(%N)
loop:
%p = phi [%start], [%dec]
%dec = llvm.loop.decrement.reg(%p, 1)
%c = icmp ne %dec, 0
br %c, loop, exit
- For this a new llvm.start.loop.iterations intrinsic was added, identical
to llvm.set.loop.iterations but produces a value as seen above, gluing
the loop together more through def-use chains.
- This new instrinsic conceptually produces the same output as input,
which is taught to SCEV so that the checks in MVETailPredication are not
affected.
- Some minor changes are needed to the ARMLowOverheadLoop pass, but it has
been left mostly as before. We should now more reliably be able to tell
that the t2DoLoopStart is correct without having to prove it, but
t2WhileLoopStart and tail-predicated loops will remain the same.
- And all the tests have been updated. There are a lot of them!
This patch on it's own might cause more trouble that it helps, with more
tail-predicated loops being reverted, but some additional patches can
hopefully improve upon that to get to something that is better overall.
Differential Revision: https://reviews.llvm.org/D89881
There were cases where a VCMP and a VPST were merged even if the VCMP
didn't have the same defs of its operands as the VPST. This is fixed by
adding RDA checks for the defs. This however gave rise to cases where
the new VPST created would precede the un-merged VCMP and so would fail
a predicate mask assertion since the VCMP wasn't predicated. This was
solved by converting the VCMP to a VPT instead of inserting the new
VPST.
Differential Revision: https://reviews.llvm.org/D90461
This reverts commit 38f625d0d1.
This commit contains some holes in its logic and has been causing
issues since it was commited. The idea sounds OK but some cases were not
handled correctly. Instead of trying to fix that up later it is probably
simpler to revert it and work to reimplement it in a more reliable way.
There are a number of places in RDA where we assume the block will not
be empty. This isn't necessarily true for tail predicated loops where we
have removed instructions. This attempt to make the pass more resilient
to empty blocks, not casting pointers to machine instructions where they
would be invalid.
The test contains a case that was previously failing, but recently been
hidden on trunk. It contains an empty block to begin with to show a
similar error.
Differential Revision: https://reviews.llvm.org/D88926
Before deciding to insert a [W|D]LSTP, check that defining LR with
the element count won't affect any other instructions that should be
taking the iteration count.
Differential Revision: https://reviews.llvm.org/D88549