This adds a few more unpredicated intrinsics to sink splats to, in order
to create more qr instruction variants. Notably this includes
saddsat/uaddsat but also some of the unpredicated mve intrinsics.
Differential Revision: https://reviews.llvm.org/D110333
If the old instructions mentioned a subreg that the new instructions do
not, remove the subrange for that subreg.
For example, in TwoAddressInstructionPass::eliminateRegSequence, if a
use operand in the REG_SEQUENCE has the undef flag then we don't
generate a copy for it so after the elimination there should be no live
interval at all for the corresponding subreg of the def.
This is a small step towards switching TwoAddressInstructionPass over
from LiveVariables to LiveIntervals. Currently this path is only tested
if you explicitly enable -early-live-intervals.
Differential Revision: https://reviews.llvm.org/D110542
As we have to split blocks, we may be left in an invalid loop state
after a WLS is reverted to a DLS. Instead remember the WLS that could
not be fixed and revert them after finishing processing all other loops.
Differential Revision: https://reviews.llvm.org/D110567
The ARMLowOverheadLoops pass recalculates VPT block masks when it
converts VCMP's inside VPT blocks into VPT's. The function to do so
doesn't seem to handle debug info though, leading to invalid block
creation or asserts at compile time. Make sure the function skips any
debug info between the MVE instructions it inspects.
Differential Revision: https://reviews.llvm.org/D110564
In TwoAddressInstructionPass::processTiedPairs, update subranges of the
live interval for RegB as well as the main range.
This is a small step towards switching TwoAddressInstructionPass over
from LiveVariables to LiveIntervals. Currently this path is only tested
if you explicitly enable -early-live-intervals.
Differential Revision: https://reviews.llvm.org/D110526
We align non-fallthrough branches under Cortex-M at O3 to lead to fewer
instruction fetches. This improves that for the block after a LE or
LETP. These blocks will still have terminating branches until the
LowOverheadLoops pass is run (as they are not handled by analyzeBranch,
the branch is not removed until later), so canFallThrough will return
false. These extra branches will eventually be removed, leaving a
fallthrough, so treat them as such and don't add unnecessary alignments.
Differential Revision: https://reviews.llvm.org/D107810
Given:
- A jump table
- Which jumps to the next block
- The next block ends in a WLS
- Where the WLS conditionally jumps to block earlier in the program.
The Arm block placement pass would attempt to move the block containing
the WLS earlier, as the WLS instruction can only branch forward. In
doing so it would add a branch from the jumptable block to the WLS
block, thinking it previously fell-through.
This in itself would be fine, if a little inefficient, but the constant
island pass expects all instructions after a jump-table branch to have
been removed by analyzeBranch. So it gets confused and can assign the
same labels to multiple jump table blocks.
I've changed the condition to the same as used in analyzeBranch.
In repairIntervalsInRange, if the new instructions refer to subregs but
the old instructions did not, make sure any existing live interval for
the superreg is updated to have subranges. Also skip repairing any range
that we have recalculated from scratch, partly for efficiency but also
to avoids some cases that repairOldRegInRange can't handle.
The existing test/CodeGen/AMDGPU/twoaddr-regsequence.mir provides some
test coverage for this change: when TwoAddressInstructionPass converts
REG_SEQUENCE into subreg copies, the live intervals will now get
subranges and MachineVerifier will verify that the subranges are
correct. Unfortunately MachineVerifier does not complain if the
subranges are not present, so the test also passed before this patch.
This patch also fixes ~800 of the ~1500 failures in the whole CodeGen
lit test suite when -early-live-intervals is forced on.
Differential Revision: https://reviews.llvm.org/D110328
This allows VMOVL in tail predicated loops so long as the the vector
size the VMOVL is extending into is less than or equal to the size of
the VCTP in the tail predicated loop. These cases represent a
sign-extend-inreg (or zero-extend-inreg), which needn't block tail
predication as in https://godbolt.org/z/hdTsEbx8Y.
For this a vecsize has been added to the TSFlag bits of MVE
instructions, which stores the size of the elements that the MVE
instruction operates on. In the case of multiple size (such as a
MVE_VMOVLs8bh that extends from i8 to i16, the largest size was be
chosen). The sizes are encoded as 00 = i8, 01 = i16, 10 = i32 and 11 =
i64, which often (but not always) comes from the instruction encoding
directly. A unit test was added, and although only a subset of the
vecsizes are currently used, the rest should be useful for other cases.
Differential Revision: https://reviews.llvm.org/D109706
The vectorizer can sometimes make reverse shuffles from indices that
count down. In MVE, we don't have a 128bit rev instruction, but we can
select this to a VREV64 with some lane movs to swap the two halfs.
Ideally this would use VMOVD's, but only gets as far as VMOVS's at the
moment.
Differential Revision: https://reviews.llvm.org/D69510
This simple heuristic uses the estimated live range length combined
with the number of registers in the class to switch which heuristic to
use. This was taking the raw number of registers in the class, even
though not all of them may be available. AMDGPU heavily relies on
dynamically reserved numbers of registers based on user attributes to
satisfy occupancy constraints, so the raw number is highly misleading.
There are still a few problems here. In the original testcase that
made me notice this, the live range size is incorrect after the
scheduler rearranges instructions, since the instructions don't have
the original InstrDist offsets. Additionally, I think it would be more
appropriate to use the number of disjointly allocatable registers in
the class. For the AMDGPU register tuples, there are a large number of
registers in each tuple class, but only a small fraction can actually
be allocated at the same time since they all overlap with each
other. It seems we do not have a query that corresponds to the number
of independently allocatable registers. Relatedly, I'm still debugging
some allocation failures where overlapping tuples seem to not be
handled correctly.
The test changes are mostly noise. There are a handful of x86 tests
that look like regressions with an additional spill, and a handful
that now avoid a spill. The worst looking regression is likely
test/Thumb2/mve-vld4.ll which introduces a few additional
spills. test/CodeGen/AMDGPU/soft-clause-exceeds-register-budget.ll
shows a massive improvement by completely eliminating a large number
of spills inside a loop.
Currently, opaque pointers are supported in two forms: The
-force-opaque-pointers mode, where all pointers are opaque and
typed pointers do not exist. And as a simple ptr type that can
coexist with typed pointers.
This patch removes support for the mixed mode. You either get
typed pointers, or you get opaque pointers, but not both. In the
(current) default mode, using ptr is forbidden. In -opaque-pointers
mode, all pointers are opaque.
The motivation here is that the mixed mode introduces additional
issues that don't exist in fully opaque mode. D105155 is an example
of a design problem. Looking at D109259, it would probably need
additional work to support mixed mode (e.g. to generate GEPs for
typed base but opaque result). Mixed mode will also end up
inserting many casts between i8* and ptr, which would require
significant additional work to consistently avoid.
I don't think the mixed mode is particularly valuable, as it
doesn't align with our end goal. The only thing I've found it to
be moderately useful for is adding some opaque pointer tests in
between typed pointer tests, but I think we can live without that.
Differential Revision: https://reviews.llvm.org/D109290
I can't seem to wrap my head around the proper fix here,
we should be fine without this requirement, iff we can form this form,
but the naive attempt (https://reviews.llvm.org/D106317) has failed.
So just to unblock the release, put up a restriction.
Fixes https://bugs.llvm.org/show_bug.cgi?id=51125
As an extension to D107866, this adds store(fptosisat(..)) patterns,
similar to the existing fptosi patterns, to prevent unnecessarily moving
into gpr regs where we can use fp stores directly.
Differential Revision: https://reviews.llvm.org/D108378
This extends D107865 to the VFP insructions, lowering llvm.fptosi.sat
and llvm.fptoui.sat to VCVT instructions that inherently perform the
saturate.
Differential Revision: https://reviews.llvm.org/D107866
The semantics of tail predication loops means that the value of LR as an
instruction is executed determines the predicate. In other words:
mov r3, #3
DLSTP lr, r3 // Start tail predication, lr==3
VADD.s32 q0, q1, q2 // Lanes 0,1 and 2 are updated in q0.
mov lr, #1
VADD.s32 q0, q1, q2 // Only first lane is updated.
This means that the value of lr cannot be spilled and re-used in tail
predication regions without potentially altering the behaviour of the
program. More lanes than required could be stored, for example, and in
the case of a gather those lanes might not have been setup, leading to
alignment exceptions.
This patch adds a new lr predicate operand to MVE instructions in order
to keep a reference to the lr that they use as a tail predicate. It will
usually hold the zeroreg meaning not predicated, being set to the LR phi
value in the MVETPAndVPTOptimisationsPass. This will prevent it from
being spilled anywhere that it needs to be used.
A lot of tests needed updating.
Differential Revision: https://reviews.llvm.org/D107638
This adds lowering of the llvm.fptosi.sat and llvm.fptoui.sat intinsics,
selecting a VCVT instruction which under MVE will inherently perform the
saturate.
Differential Revision: https://reviews.llvm.org/D107865
This adds extra MVE vector fptosi.sat and fptoui.sat tests, along with
adding or adjusting the existing scalar tests to cover more
architectures and instruction combinations.
Currently isReallyTriviallyReMaterializableGeneric() implementation
prevents rematerialization on any virtual register use on the grounds
that is not a trivial rematerialization and that we do not want to
extend liveranges.
It appears that LRE logic does not attempt to extend a liverange of
a source register for rematerialization so that is not an issue.
That is checked in the LiveRangeEdit::allUsesAvailableAt().
The only non-trivial aspect of it is accounting for tied-defs which
normally represent a read-modify-write operation and not rematerializable.
The test for a tied-def situation already exists in the
/CodeGen/AMDGPU/remat-vop.mir,
test_no_remat_v_cvt_f32_i32_sdwa_dst_unused_preserve.
The change has affected ARM/Thumb, Mips, RISCV, and x86. For the targets
where I more or less understand the asm it seems to reduce spilling
(as expected) or be neutral. However, it needs a review by all targets'
specialists.
Differential Revision: https://reviews.llvm.org/D106408
Add a variant of mve-vqdmulh tests that uses min/max intrinsics
directly, including a scalar test that shows it misbehaving for min
intrinsics and a fix for the combine to prevent it from misbehaving.
This changes the lowering of saddsat and ssubsat so that instead of
using:
r,o = saddo x, y
c = setcc r < 0
s = c ? INTMAX : INTMIN
ret o ? s : r
into using asr and xor to materialize the INTMAX/INTMIN constants:
r,o = saddo x, y
s = ashr r, BW-1
x = xor s, INTMIN
ret o ? x : r
https://alive2.llvm.org/ce/z/TYufgD
This seems to reduce the instruction count in most testcases across most
architectures. X86 has some custom lowering added to compensate for
cases where it can increase instruction count.
Differential Revision: https://reviews.llvm.org/D105853
This enables subreg liveness in the arm backend when MVE is present,
which allows the register allocator to detect when subregister are
alive/dead, compared to only acting on full registers. This can helps
produce better code on MVE with the way MQPR registers are made up of
SPR registers, but is especially helpful for MQQPR and MQQQQPR
registers, where there are very few "registers" available and being able
to split them up into subregs can help produce much better code.
Differential Revision: https://reviews.llvm.org/D107642
Similar to the MQPR register class as the MVE equivalent to QPR, this
adds MQQPR and MQQQQPR register classes for the MVE equivalents of QQPR
and QQQQPR registers. The MVE MQPR seemed have worked out quite well,
and adding MQQPR and MQQQQPR allows us to a little more accurately
specify the number of registers, calculating register pressure limits a
little better.
Differential Revision: https://reviews.llvm.org/D107463
Currently isReallyTriviallyReMaterializableGeneric() implementation
prevents rematerialization on any virtual register use on the grounds
that is not a trivial rematerialization and that we do not want to
extend liveranges.
It appears that LRE logic does not attempt to extend a liverange of
a source register for rematerialization so that is not an issue.
That is checked in the LiveRangeEdit::allUsesAvailableAt().
The only non-trivial aspect of it is accounting for tied-defs which
normally represent a read-modify-write operation and not rematerializable.
The test for a tied-def situation already exists in the
/CodeGen/AMDGPU/remat-vop.mir,
test_no_remat_v_cvt_f32_i32_sdwa_dst_unused_preserve.
The change has affected ARM/Thumb, Mips, RISCV, and x86. For the targets
where I more or less understand the asm it seems to reduce spilling
(as expected) or be neutral. However, it needs a review by all targets'
specialists.
Differential Revision: https://reviews.llvm.org/D106408
Given a constant operand, the MVE and DAGCombine combines could fight,
each redistributing in the opposite order. Add a guard to the MVE
vecreduce distribution to prevent that.
That seems this test does not check what was stated in the
comment anymore. Just switch to generated checks.
Differential Revision: https://reviews.llvm.org/D107590
This attempts to make more of RDA aware of potentially overlapping
subregisters. Some of this was already in place, with it iterating
through MCRegUnitIterators. This also replaces calls to
LiveRegs.contains(..) with !LiveRegs.available(..), and updates the
isValidRegUseOf and isValidRegDefOf to search subregs.
Differential Revision: https://reviews.llvm.org/D107351
If the block target for a WLSTP instruction is known to be out of range,
and cannot be fixed by the ARMBlockPlacementPass, we can relax it to a
DLSTP (and cmp/branch) to still allow the creation of tail predicated
loops. That is what this patch does, adding extra revert code to the
fallback path of ARMBlockPlacementPass.
Due to the code produced when reverting, this creates a DLSTP between a
Bcc and a Br. As a DLS isn't necessarily a terminator we need to split
the block to move the DLS/Br into.
Differential Revision: https://reviews.llvm.org/D104709
Currently, the default alignment is much larger than the actual size of
the vector in memory. Fix this to use a sane default.
For SVE, temporarily remove lowering of load/store operations for
predicates with less than 16 elements. The layout the backend was
assuming for SVE predicates with less than 16 elements doesn't agree
with the frontend. More work probably needs to be done here.
This change is, strictly speaking, not backwards-compatible at the
bitcode level. But probably nobody is actually depending on that; i1
vectors in memory are rare, and the code that does use them probably
ends up forcing the alignment to something sane anyway. If we think
this is a concern, I can restrict this to scalable vectors for now
(where it's actually causing issues for me at the moment).
Differential Revision: https://reviews.llvm.org/D88994
This distributes reductions based on the relative offset of loads, if
one is found from their operands. Given chains of reductions this will
then sort them in ascending load order, which in turn can help simple
prefetches latch on to increasing strides more easily.
Differential Revision: https://reviews.llvm.org/D106569
This adds a combine for adds of reductions, distributing them so that
they occur sequentially to enable better use of accumulating VADDVA
instructions. It combines:
add(X, add(vecreduce(Y), vecreduce(Z))) ->
add(add(X, vecreduce(Y)), vecreduce(Z))
and
add(add(A, reduce(B)), add(C, reduce(D))) ->
add(add(add(A, C), reduce(B)), reduce(D))
These together distribute the add's so that more reductions can be
selected to VADDVA.
Differential Revision: https://reviews.llvm.org/D106532
Under MVE we can use VADDV/VADDVA's to perform integer add reductions,
so it can be beneficial to use more reductions than summing subvectors
and reducing once. Especially for VMLAV/VMLAVA the mul can be
incorporated into the reduction, producing less instructions.
Some of the test cases currently get larger due to extra integer adds,
but will be improved in a followup patch.
Differential Revision: https://reviews.llvm.org/D106531
These are generated from SLP vectorization, for example
https://godbolt.org/z/ebxdPh1Kz. These backend tests
show cases that we can produce better code for.
This implements the isLoadFromStackSlot and isStoreToStackSlot for MVE
MVE_VSTRWU32 and MVE_VLDRWU32 functions. They behave the same as many
other loads/stores, expecting a FI in Op1 and zero offset in Op2. At the
same time this alters VLDR_P0_off and VSTR_P0_off to use the same code
too, as they too should be returning VPR in Op0, take a FI in Op1 and
zero offset in Op2.
Differential Revision: https://reviews.llvm.org/D106797
The register class required for some MVE loads/stores is more
constrained than the register we use when creating postinc. Make sure we
constrain the register class to keep the code correct.
This relaxes the VMLAV and VADDV reduction recognition code to handle
smaller than legal types, extending them as needed. That was already
handled for some reductions, this extends it to more types in a more
generic way. If a smaller than legal value is found it is extended to
the legal type as needed.
Differential Revision: https://reviews.llvm.org/D106051