count register
After my patch at D86087, code that now uses the mov operand rather than
the vctp operand will no longer remove modifications to the vctp operand
as they should. This patch fixes that by explicitly removing
modifications to the vctp operand rather than the register used as the
element count.
Remove the code that tried to look for reduction patterns, since the
vectorizer and isel can now produce predicated arithmetic instructios
within the loop body. This has required some reorganisation and fixes
around live-out and predication checks, as well as looking for cases
where an input/output is initialised to zero.
Differential Revision: https://reviews.llvm.org/D86613
Fix the ARM backend's analyzeBranch so it doesn't ignore predicated
return instructions, and make the MachineVerifier rule more strict.
Differential Revision: https://reviews.llvm.org/D40061
This adapts legalization of intrinsic get.active.lane.mask to the new semantics
as described in D86147. Because the second argument is now the loop tripcount,
we legalize this intrinsic to an 'icmp ULT' instead of an ULE when it was the
backedge-taken count.
Differential Revision: https://reviews.llvm.org/D86302
This adapts tail-predication to the new semantics of get.active.lane.mask as
defined in D86147. This means that:
- we can remove the BTC + 1 overflow checks because now the loop tripcount is
passed in to the intrinsic,
- we can immediately use that value to setup a counter for the number of
elements processed by the loop and don't need to materialize BTC + 1.
Differential Revision: https://reviews.llvm.org/D86303
VLD2/4 instructions cannot be predicated, so we cannot tail predicate
them from autovec. From intrinsics though, they should be valid as they
will just end up loading extra values into off vector lanes, not
effecting the on lanes. The same is true for loads in general where so
long as we are not using the other vector lanes, an unpredicated load
can be converted to a predicated one.
This marks VLD2 and VLD4 instructions as validForTailPredication and
allows any unpredicated load in tail predication loop, which seems to be
valid given the other checks we have.
Differential Revision: https://reviews.llvm.org/D86022
There are some cases where the instruction that sets up the iteration
count for a tail predicated loop cannot be moved before the dlstp,
stopping tail predication entirely. This patch checks if the mov operand
can be used and if so, uses that instead.
Differential Revision: https://reviews.llvm.org/D86087
Similarly as for pointers, even for integers a == b is usually false.
GCC also uses this heuristic.
Reviewed By: ebrevnov
Differential Revision: https://reviews.llvm.org/D85781
Similarly as for pointers, even for integers a == b is usually false.
GCC also uses this heuristic.
Reviewed By: ebrevnov
Differential Revision: https://reviews.llvm.org/D85781
Similarly as for pointers, even for integers a == b is usually false.
GCC also uses this heuristic.
Reviewed By: ebrevnov
Differential Revision: https://reviews.llvm.org/D85781
This pick ups the work on the overflow checks for get.active.lane.mask,
which ensure that it is safe to insert the VCTP intrinisc that enables
tail-predication. For a 2d auto-correlation kernel and its inner loop j:
M = Size - i;
for (j = 0; j < M; j++)
Sum += Input[j] * Input[j+i];
For this inner loop, the SCEV backedge taken count (BTC) expression is:
(-1 + (sext i16 %Size to i32)),+,-1}<nw><%for.body>
and LoopUtil cannotBeMaxInLoop couldn't calculate a bound on this, thus "BTC
cannot be max" could not be determined. So overflow behaviour had to be assumed
in the loop tripcount expression that uses the BTC. As a result
tail-predication had to be forced (with an option) for this case.
This change solves that by using ScalarEvolution's helper
getConstantMaxBackedgeTakenCount which is able to determine the range of BTC,
thus can determine it is safe, so that we no longer need to force tail-predication
as reflected in the changed test cases.
Differential Revision: https://reviews.llvm.org/D85737
VPSEL has slightly different semantics under tail predication (it can
end up selecting from Qn, Qm and Qd). We do not model that at the moment
so they block tail predicated loops from being formed.
This just converts them into a predicated VMOV instead (via a VORR),
allowing tail predication to happen whilst still modelling the original
behaviour of the input.
Differential Revision: https://reviews.llvm.org/D85110
This adds sign/zero extending scalar loads/stores to the MVE
instructions added in D77813, allowing us to create up more post-inc
instructions. These are comparatively simple, compared to LDR/STR (which
may be better turned into an LDRD/LDM), but still require some additions
over MVE instructions. Because there are i12 and i8 variants of the
offset loads/stores dealing with different signs, we may need to convert
an i12 address to a i8 negative instruction. t2LDRBi12 can also be
shrunk to a tLDRi under the right conditions, so we need to be careful
with codesize too.
Differential Revision: https://reviews.llvm.org/D78625
Most MVE instructions can be predicated to fold a select into the
instruction, using the predicate and the selects else as a passthough.
This adds tablegen patterns for most two operand instructions using the
newly added TwoOpPattern from 1030e82598.
Differential Revision: https://reviews.llvm.org/D83222
There were cases where a do-while loop would be converted to a while
loop before finding out that it would be unsafe to expand the SCEV in
this situation and then bailing out of hardware loop conversion.
This patch checks if it would be unsafe to expand the SCEV and if so stops converting the do-while into a while, allowing conversion to a hardware loop.
Differential Revision: https://reviews.llvm.org/D83953
This refactors option -disable-mve-tail-predication to take different arguments
so that we have 1 option to control tail-predication rather than several
different ones.
This is also a prep step for D82953, in which we want to reject reductions
unless that is requested with this option.
Differential Revision: https://reviews.llvm.org/D83133
Whether an instruction is deemed to have side effects in determined by
whether it has a tblgen pattern that emits a single instruction.
Because of the way a lot of the the vcvt instructions are specified
either in dagtodag code or with patterns that emit multiple
instructions, they don't get marked as not having side effects.
This just marks them as not having side effects manually. It can help
especially with instruction scheduling, to not create artificial
barriers, but one of these tests also managed to produce fewer
instructions.
Differential Revision: https://reviews.llvm.org/D81639
Given a loop with two subloops, it should be possible for both to be
converted to hardware loops. That's what this patch does, simply enough.
It slightly alters the loop iterating order to try and convert all
subloops. If one (or more) succeeds, it stops as before.
Differential Revision: https://reviews.llvm.org/D78502
While validating live-out values, record instructions that look like
a reduction. This will comprise of a vector op (for now only vadd),
a vorr (vmov) which store the previous value of vadd and then a vpsel
in the exit block which is predicated upon a vctp. This vctp will
combine the last two iterations using the vmov and vadd into a vector
which can then be consumed by a vaddv.
Once we have determined that it's safe to perform tail-predication,
we need to change this sequence of instructions so that the
predication doesn't produce incorrect code. This involves changing
the register allocation of the vadd so it updates itself and the
predication on the final iteration will not update the falsely
predicated lanes. This mimics what the vmov, vctp and vpsel do and
so we then don't need any of those instructions.
Differential Revision: https://reviews.llvm.org/D75533
After the rewrite of this pass (D79175) I missed one thing: the inserted VCTP
intrinsic can be cloned to exit blocks if there are instructions present in it
that perform the same operation, but this wasn't triggering anymore. However,
it turns out that for handling reductions, see D75533, it's actually easier not
not to have the VCTP in exit blocks, so this removes that code.
This was possible because it turned out that some other code that depended on
this, rematerialization of the trip count enabling more dead code removal
later, wasn't doing much anymore due to more aggressive dead code removal that
was added to the low-overhead loops pass.
Differential Revision: https://reviews.llvm.org/D82773
This patch stops the trunc, rint, round, floor and ceil intrinsics from blocking tail predication.
Differential Revision: https://reviews.llvm.org/D82553
We current extract and convert from a top lane of a f16 vector using a
VMOVX;VCVTB pair. We can simplify that to use a single VCVTT. The
pattern is mostly copied from a vector extract pattern, but produces a
VCVTTHS f32 directly.
This had to move some code around so that ARMInstrVFP had access to the
required pattern frags that were previously part of ARMInstrNEON.
Differential Revision: https://reviews.llvm.org/D81556
To set up a tail-predicated loop, we need to to calculate the number of
elements processed by the loop. We can now use intrinsic
@llvm.get.active.lane.mask() to do this, which is emitted by the vectoriser in
D79100. This intrinsic generates a predicate for the masked loads/stores, and
consumes the Backedge Taken Count (BTC) as its second argument. We can now use
that to reconstruct the loop tripcount, instead of the IR pattern match
approach we were using before.
Many thanks to Eli Friedman and Sam Parker for all their help with this work.
This also adds overflow checks for the different, new expressions that we
create: the loop tripcount, and the sub expression that calculates the
remaining elements to be processed. For the latter, SCEV is not able to
calculate precise enough bounds, so we work around that at the moment, but is
not entirely correct yet, it's conservative. The overflow checks can be
overruled with a force flag, which is thus potentially unsafe (but not really
because the vectoriser is the only place where this intrinsic is emitted at the
moment). It's also good to mention that the tail-predication pass is not yet
enabled by default. We will follow up to see if we can implement these
overflow checks better, either by a change in SCEV or we may want revise the
definition of llvm.get.active.lane.mask.
Differential Revision: https://reviews.llvm.org/D79175
This is split off from D80316, slightly tightening the definition of overloaded
hardwareloop intrinsic llvm.loop.decrement.reg specifying that both operands
its result have the same type.
Previously, the LowOverheadLoops pass couldn't handle VPT blocks
with conditions, or with multiple VCTPs. This patch improves the
LowOverheadLoops pass so it can handle those cases.
It also adds support for VCMPs before the VCTP.
Differential Revision: https://reviews.llvm.org/D78206
This patch adds a new TTI hook to allow targets to tell LSR that
a chain including some instruction is already profitable and
should not be optimized. This patch also adds an implementation
of this TTI hook for ARM so LSR doesn't optimize chains that include
the VCTP intrinsic.
Differential Revision: https://reviews.llvm.org/D79418
Unlike Neon, MVE does not have a way of duplicating from a vector lane,
so a VDUPLANE currently selects to a VDUP(move_from_lane(..)). This
forces that to be done earlier as a dag combine to allow other folds to
happen.
It converts to a VDUP(EXTRACT). On FP16 this is then folded to a
VGETLANEu to prevent it from creating a vmovx;vmovhr pair, using a
single move_from_reg instead.
Differential Revision: https://reviews.llvm.org/D79606
This patch makes the folding of or(A, B) into not(and(not(A), not(B)))
more agressive for I1 vector. This only affects Thumb2 MVE and improves
codegen, because it removes a lot of msr/mrs instructions on VPR.P0.
This patch also adds a xor(vcmp) -> !vcmp fold for MVE.
Differential Revision: https://reviews.llvm.org/D77202
Add a bit more logic into the 'FalseLaneZeros' tracking to enable
horizontal reductions and also make the VADDV variants
validForTailPredication.
Differential Revision: https://reviews.llvm.org/D76708