Let the sext_inreg be selected to sext.w. Remove unneeded sext.w
during PostProcessISelDAG.
This gives opportunities for some other isel patterns to match
like the ADDIPair or matching mul with immediate to shXadd.
This becomes possible after D107658 started selecting W instructions
based on users. The sext.w will be considered a W user so isel
will often select a W instruction for the sext.w input and we can
just remove the sext.w. Otherwise we can combine the sext.w with
a ADD/SUB/MUL/SLLI to create a new W instruction in parallel
to the the original instruction.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D107708
DAGCombiner::visitStore can clear the upper bits of constants
used by stores. This leads prevents them from being recognized as
sign extended negative values making them more expensive to
materialize.
This patch uses the hasAllNBitUsers method from D107658 to make
a negative constant if none of the users care about the upper bits.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D108052
We normally select these when the root node is a sext_inreg, but
SimplifyDemandedBits can sometimes bypass the sext_inreg for some
users. This can create situation where sext_inreg+add/sub/mul/shl
is selected to a W instruction, and then the add/sub/mul/shl is
separately selected to a non-W instruction with the same inputs.
This patch tries to detect when it would still be ok to use a W
instruction without the sext_inreg by checking the direct users.
This can allow the W instruction to CSE with one created for a
sext_inreg+add/sub/mul/shl. To minimize complexity and cost of
checking, we make no attempt to determine if the CSE will happen
and just always use a W instruction when we can.
Differential Revision: https://reviews.llvm.org/D107658
Replace some existing isel patterns that are covered by the new
code. SLLIUWPat has been removed in favor of folding its root case
into the new code. The other uses in isel patterns for shXadd.uw
have been switched to using hardcoded AND masks.
This is based on the original version of D49585 from ARM. The final
version of that was made a DAG combine, but I've chosen to keep it
as custom isel. I'm not convinced DAG combine is as good with
shift pairs as it is with and+shift. I saw some issues optimizing
the shifts created by vscale lowering if an and isn't created for
from a shift pair.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D106230
If the upper 32 bits are zero and bit 31 is set, we might be able to
use zext.w to fill in the zeros after using an lui and/or addi.
Most of this patch is plumbing the subtarget features into the constant
materialization.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D105509
This helps us select W instructions in more cases. Most of the
affected tests have had the sign_extend_inreg or AND folded into
sextload/zextload.
Differential Revision: https://reviews.llvm.org/D104079
This uses 3 bits of data instead of 7. I'm wondering if we can use
bitfields for the lookup table key where this would matter.
I also name the shift_amount template to log2 since it is used
with more than just an srl now.
All that really matters is that the VLMAX of the preceding
instructions is the same as the VLMAX required by the mask
operation.
Also update the vmsge(u) handling to use the SEW/LMUL we use for
other mask register operations. We were matching it to the compare
before. Some cases will be improve if we fix masked compares to
use tail agnostic policy. I think they ignore the tail policy
anyway.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D103299
SEW=64 shifts only uses the log2(64) bits of shift amount. If we're
splatting a 64 bit value in 2 parts, we can avoid splatting the
upper bits and just let the low bits be sign extended. They won't
be read anyway.
For the purposes of SelectionDAG semantics of the generic ISD opcodes,
if hi was non-zero or bit 31 of the low is 1, the shift was already
undefined so it should be ok to replace high with sign extend of low.
In order do be able to find the split i64 value before it becomes
a stack operation, I added a new ISD opcode that will be expanded
to the stack spill in PreprocessISelDAG. This new node is conceptually
similar to BuildPairF64, but I expanded earlier so that we could
go through regular isel to get the right VLSE opcode for the LMUL.
BuildPairF64 is expanded in a CustomInserter.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D102521
The ComplexPattern is looking for an immediate in a certain range
that has a single use. This can be handled with a PatLeaf since
we aren't matching multiple patterns or checking any complicated
relationships between nodes.
This shrinks the isel table a little bit since tablegen no longer
has to generate patterns with commuted operands. With the PatLeaf,
tablegen can see we're matching an immediate which should always
be on the right hand side of add.
Reviewed By: benshi001
Differential Revision: https://reviews.llvm.org/D102510
The VSEW encoding isn't a useful value to pass around. It's better
to use SEW or log2(SEW) directly. The only real ugliness is that
the vsetvli IR intrinsics use the VSEW encoding, but it's easy
enough to decode that when the intrinsic is processed.
This shrinks the immediate that isel table needs to emit for these
instructions. Hoping this allows me to change OPC_EmitInteger to
use a better variable length encoding for representing negative
numbers. Similar to what was done a few months ago for OPC_CheckInteger.
The alternative encoding uses less bytes for negative numbers, but
increases the number of bytes need to encode 64 which was a very
common number in the RISCV table due to SEW=64. By using Log2 this
becomes 6 and is no longer a problem.
This adds a special operand type that is allowed to be either
an immediate or register. By giving it a unique operand type the
machine verifier will ignore it.
This perturbs a lot of tests but mostly it is just slightly different
instruction orders. Something bad did happen to some min/max reduction
tests. We're spilling vector registers when we weren't before.
Reviewed By: khchen
Differential Revision: https://reviews.llvm.org/D101246
This modifies my previous patch to push the strided load formation
to isel. This gives us opportunity to fold the splat into a .vx
operation first. Using a scalar register and a .vx operation reduces
vector register pressure which can be important for larger LMULs.
If we can't fold the splat into a .vx operation, then it can make
sense to use a strided load to free up the vector arithmetic
ALU to do actual arithmetic rather than tying it up with vmv.v.x.
Reviewed By: khchen
Differential Revision: https://reviews.llvm.org/D101138
We can have RISCVISelDAGToDAG.cpp call the VT only version by
finding the RISCVTargetLowering object via the Subtarget.
Make the static versions just global static functions in
RISCVISelLowering that can be called by static functions in that
file.
These instructions don't really exist, but we have ways we can
emulate them.
.vv will swap operands and use vmsle().vv. .vi will adjust the
immediate and use .vmsgt(u).vi when possible. For .vx we need to
use some of the multiple instruction sequences from the V extension
spec.
For unmasked vmsge(u).vx we use:
vmslt{u}.vx vd, va, x; vmnand.mm vd, vd, vd
For cases where mask and maskedoff are the same value then we have
vmsge{u}.vx v0, va, x, v0.t which is the vd==v0 case that
requires a temporary so we use:
vmslt{u}.vx vt, va, x; vmandnot.mm vd, vd, vt
For other masked cases we use this sequence:
vmslt{u}.vx vd, va, x, v0.t; vmxor.mm vd, vd, v0
We trust that register allocation will prevent vd in vmslt{u}.vx
from being v0 since v0 is still needed by the vmxor.
Differential Revision: https://reviews.llvm.org/D100925
This patch adds more optimized codegen for the above SETCC forms,
by matching the '.vi' vector forms when the immediate is a 5-bit signed
immediate plus 1. The immediate can be decremented and the corresponding
SET[U]LE or SET[U]GT forms can be matched.
This work was left as a TODO from D94168.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D100096
Many of the operands are handled the same or in the same order
for all these intrinsics. Factor out the code for selecting and
pushing them into the Operands vector.
Differential Revision: https://reviews.llvm.org/D99923
I missed a few intrinsics in 3dd4aa7d09
when I did this for masked loads and masked segment loads/stores.
Found while trying to share more code between these custom isel
functions.
This adds a new integer materialization strategy mainly targeted
at 64-bit constants like 0xffffffff where there are 32 or more trailing
ones with leading zeros. We can materialize these by using an addi -1
and srli to restore the leading zeros. This matches what gcc does.
I haven't limited to just these cases though. The implementation
here takes the constant, shifts out all the leading zeros and
shifts ones into the LSBs, creates the new sequence, adds an srli,
and checks if this is shorter than our original strategy.
I've separated the recursive portion into a standalone function
so I could append the new strategy outside of the recursion. Since
external users are no longer using the recursive function, I've
cleaned up the external interface to return the sequence instead of
taking a vector by reference.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D98821
This matches what we do in our isel patterns. In our internal
testing we've found this is needed to make the fast register
allocator happy at -O0. Otherwise it may assign V0 to an earlier
operand and find itself with no registers left when it reaches
the mask operand. By using V0 explicitly, the fast register allocator
will see it when it checks for phys register usages before it
starts allocating vregs. I'll try to update this with a test case.
Unfortunately, this does appear to prevent some instruction reordering
by the pre-RA scheduler which leads to the increased spills seen in
some tests. I suspect that problem could already occur for other
instructions that already used V0 directly.
There's a lot of repeated code here that could do with some
wrapper functions. Not sure if that should be at the level of the
new code that deals with V0. That would require multiple output
parameters to pass the glue, chain and register back. Maybe it
should be at a higher level over the entire set of push_backs.
Reviewed By: frasercrmck, HsiangKai
Differential Revision: https://reviews.llvm.org/D99367
We look for this pattern frequently in isel patterns so its a
good idea to try to preserve it.
This also let's us remove our special isel handling for srliw
and use a direct pattern match of (srl (and X, 0xffffffff), C)
since no bits will be removed from the and mask.
Differential Revision: https://reviews.llvm.org/D99042
Previously we used selectImm for RV64 and isel patterns for
RV32. This should be NFC, but will allow RV32 and RV64 to share
improvements in the future. For example, it might be useful to
use BSETI from Zbs to make single bit constants.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D98877
This patch addresses a compiler crash resulting from passing a
fixed-length type to one that expects scalable vector types. An
assertion was added to prevent this regressing in the future.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D97868
This patch unifies the two disparate paths for lowering INSERT_SUBVECTOR
operations under one roof. Consequently, with this patch it is possible to
support any fixed-length subvector insertion, not just "cast-like" ones.
As before, support for the insertion of mask vectors will come in a
separate patch.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D97543
This will allow FrameIndex as the base address instead of
emitting a separate ADDI from isel. eliminateFrameIndex will likely turn
it back into an ADDI, but this makes things consistent with the
SDPatterns and VLPatterns.
I only tested one case for simplicity. I can test more if reviewers
want.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D97221
This patch unifies the two disparate paths for lowering
EXTRACT_SUBVECTOR operations under one roof. Consequently, with this
patch it is possible to support any fixed-length subvector extraction,
not just "cast-like" ones.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D97192
We just started using a ComplexPattern for sexti32. This updates
zexti32 to match.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D97231
This patch extends the support for RVV INSERT_SUBVECTOR to cover those
which don't align to a vector register boundary. Like the support for
EXTRACT_SUBVECTOR in D96959, it accomplishes this by extracting the
nearest register-sized subvector (a subregister operation), then sliding
the vector down with VSLIDEDOWN, inserting the subvector to the first
position, and sliding the vector back up again afterwards.
Unlike subvector extraction, for vectors that occupy less than a full
vector register we must preserve the untouched elements. We do this by
lowering to an LMUL=1 INSERT_SUBVECTOR using the above method and
lowering that to a VSLIDEUP with a zero offset. This uses a
tail-undisturbed policy and so has the effect of "sliding in" the
subvector elements while preserving the surrounding ones.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D96972
An i64 AssertZExt from a type smaller than i32 has at least 33
leading zeros which mean it has at least 33 sign bits.
Since we have a couple patterns that use two sexti32, I've
switched to a ComplexPattern so tablegen didn't have to generate
9 different permutations.
As noted in the FIXME, maybe we should just call computeNumSignBits,
but we don't have tests that benefit from that yet.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D97130
This should fix the issue reported in D96972.
I don't have a good test case for this without those changes.
Differential Revision: https://reviews.llvm.org/D97082
A previous patch moved the index versions. This moves the rest.
I also removed the custom lowering for VLEFF since we can now
do everything directly in the isel handling.
I had to update getLMUL to handle mask registers to index the
pseudo table correctly for VLE1/VSE1.
This is good for another 15K reduction in llc size.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D97097
This patch extends the support for RVV EXTRACT_SUBVECTOR to cover those
which don't align to a vector register boundary. It accomplishes this by
extracting the nearest register-sized subvector (a subregister
operation), then sliding the vector down with VSLIDEDOWN and extracting
the subvector from the first position (a COPY operation).
Since this procedure involves the use of VSCALE and multiplication, the
handling of such operations is done during lowering to simplify the
implementation and make use of DAG combining. This necessitated moving
some helper functions from RISCVISelDAGToDAG to RISCVTargetLowering.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D96959
We don't currently create memory operands for these intrinsics,
but there was a suggestion of using the indexed load/store
intrinsics to implement isel for scalable vector gather/scatter.
That may propagate the memory operand from the gather/scatter
ISD nodes.
We had more combinations of data and index lmuls than we needed.
Also add some asserts to verify that the IndexVT and data VT have
the same element count when we isel these pseudo instructions.
There are many legal combinations of index and data VTs supported
for these intrinsics. This results in a lot of isel patterns in
RISCVGenDAGISel.inc.
By adding a separate table similar to what we use for segment
load/stores, we can more efficiently manually select these
intrinsics. We should also be able to reuse this table scalable
vector gather/scatter.
This reduces the llc binary size by ~56K.
Reviewed By: khchen
Differential Revision: https://reviews.llvm.org/D97033
Just like we do for isel patterns, we need to call selectVLOp
to prevent 0 from being selected to X0 by the default isel.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D97021
Intrinsic ID is a 32-bit value which made each row of the table 4
byte aligned. The remaining fields used 5 bytes. This meant 3 bytes
of padding per row.
This patch breaks the table into 4 separate tables and indexes them
by properties we know about the intrinsic. NF, masked,
strided, ordered, etc. The indexed load/store tables have no
padding in their rows now.
All together this reduces the size of llc binary by ~28K.
I'm considering adding similar tables for isel of non-segment
load/store as well to cut down the size of the isel table and
probably improve our isel performance. Those tables would need to
indexed from intrinsics, IR loads/stores, gathers/scatters, and
RISCVISD opcodes. So having a table that can be indexed without using
intrinsic ID is more flexible.
Reviewed By: HsiangKai
Differential Revision: https://reviews.llvm.org/D96894