This shrinks the immediate that isel table needs to emit for these
instructions. Hoping this allows me to change OPC_EmitInteger to
use a better variable length encoding for representing negative
numbers. Similar to what was done a few months ago for OPC_CheckInteger.
The alternative encoding uses less bytes for negative numbers, but
increases the number of bytes need to encode 64 which was a very
common number in the RISCV table due to SEW=64. By using Log2 this
becomes 6 and is no longer a problem.
DAGCombiner was recently taught how to combine STEP_VECTOR nodes,
meaning the step value is no longer guaranteed to be one by the time it
reaches the backend for lowering.
This patch supports such cases on RISC-V by lowering to other step
values to a multiply following the vid.v instruction. It includes a
small optimization for common cases where the multiply can be expressed
as a shift left.
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D100856
Similar for or/xor with 0 in place of -1.
This is the canonical form produced by InstCombine for something like `c ? x & y : x;` Since we have to use control flow to expand select we'll usually end up with a mv in basic block. By folding this we may be able to pull the and/or/xor into the block instead and avoid a mv instruction.
The code here is based on code from ARM that uses this to create predicated instructions. I'm doing it on SELECT_CC so it happens late, but we could do it on select earlier which is what ARM does. I'm not sure if we lose any combine opportunities if we do it earlier.
I left out add and sub because this can separate sext.w from the add/sub. It also made a conditional i64 addition/subtraction on RV32 worse. I guess both of those would be fixed by doing this earlier on select.
The select-binop-identity.ll test has not been commited yet, but I made the diff show the changes to it.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D101485
This replaces D98479.
This allows type legalization to form SPLAT_VECTOR_PARTS so we don't
lose the splattedness when the scalar type is split.
I'm handling SPLAT_VECTOR_PARTS for fixed vectors separately so
we can continue using non-VL nodes for scalable vectors.
I limited to RV32+vXi64 because DAGCombiner::visitBUILD_VECTOR likes
to form SPLAT_VECTOR before seeing if it can replace the BUILD_VECTOR
with other operations. Especially interesting is a splat BUILD_VECTOR of
the extract_vector_elt which can become a splat shuffle, but won't if
we form SPLAT_VECTOR first. We either need to reorder visitBUILD_VECTOR
or add visitSPLAT_VECTOR.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D100803
This seems like a reasonable upper bound on VL. WG discussions for
the V spec would probably allow us to use 2^16 as an upper bound
on VLEN, but this is good enough for now.
This allows us to remove sext and zext if user happens to assign
the size_t result into an int and then uses it as a VL intrinsic
argument which is size_t.
Reviewed By: frasercrmck, rogfer01, arcbbb
Differential Revision: https://reviews.llvm.org/D101472
This is an complementary/alternative fix for D99068. It takes a slightly
different approach by explicitly summing up all of the required split
part type sizes and ensuring we allocate enough space for them. It also
takes the maximum alignment of each part.
Compared with D99068 there are fewer changes to the stack objects in
existing tests. However, @luismarques has shown in that patch that there
are opportunities to reduce our stack usage in the future.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D99087
This adds a special operand type that is allowed to be either
an immediate or register. By giving it a unique operand type the
machine verifier will ignore it.
This perturbs a lot of tests but mostly it is just slightly different
instruction orders. Something bad did happen to some min/max reduction
tests. We're spilling vector registers when we weren't before.
Reviewed By: khchen
Differential Revision: https://reviews.llvm.org/D101246
This modifies my previous patch to push the strided load formation
to isel. This gives us opportunity to fold the splat into a .vx
operation first. Using a scalar register and a .vx operation reduces
vector register pressure which can be important for larger LMULs.
If we can't fold the splat into a .vx operation, then it can make
sense to use a strided load to free up the vector arithmetic
ALU to do actual arithmetic rather than tying it up with vmv.v.x.
Reviewed By: khchen
Differential Revision: https://reviews.llvm.org/D101138
We have several extensions that need i32 to be Custom for
INTRINSIC_WO_CHAIN with RV64 so enable it for all RV64.
For V extension, make i32 Custom for RV64 and i64 Custom for RV32.
When the i32 or i64 is legal, the operation action doesn't matter.
LegalizeDAG checks MVT::Other rather than the real type.
This teaches DAG combine that shift amount operands for grev, gorc
shfl, unshfl only read a few bits.
This also teaches DAG combine that grevw, gorcw, shflw, unshflw,
bcompressw, bdecompressw only consume the lower 32 bits of their
inputs.
In the future we can teach SimplifyDemandedBits to also propagate
demanded bits of the output to the inputs in some cases.
Use getContainerForFixedLengthVector and getRegClassIDForVecVT to
get the register class to use when making a fixed vector type legal.
Inline it into the other two call sites.
I'm looking into using fractional lmul for fixed length vectors
and getLMULForFixedLengthVector returned an integer making it
unable to express this. I considered returning the LMUL
enum, but that seemed like it would introduce more complexity to
convert it for use.
Make it a static function RISCVISelLowering, the only place it
is used.
I think I'm going to make this return a fractional LMULs in some
cases so I'm sorting out where it should live before I start
making changes.
We can have RISCVISelDAGToDAG.cpp call the VT only version by
finding the RISCVTargetLowering object via the Subtarget.
Make the static versions just global static functions in
RISCVISelLowering that can be called by static functions in that
file.
This patch adds support for both scalable- and fixed-length vector code
lowering of the llvm.minnum and llvm.maxnum intrinsics to the equivalent
RVV instructions.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D101035
Implementations are allowed to optimize an x0 stride to perform
less memory accesses. This is the case in SiFive cores.
No idea if this is the case in other implementations. We might
need a tuning flag for this.
Reviewed By: frasercrmck, arcbbb
Differential Revision: https://reviews.llvm.org/D100815
Rather than doing splatting each separately and doing bit manipulation
to merge them in the vector domain, copy the data to the stack
and splat it using a strided load with x0 stride. At least on
some implementations this vector load is optimized to not do
a load for each element.
This is equivalent to how we move i64 to f64 on RV32.
I've only implemented this for the intrinsic fallbacks in this
patch. I think we do similar splatting/shifting/oring in other
places. If this is approved, I'll refactor the others to share
the code.
Differential Revision: https://reviews.llvm.org/D101002
The value is always an immediate and can never be in a register.
This the kind of thing TargetConstant is for.
Saves a step GenDAGISel to convert a Constant to a TargetConstant.
This recognizes the case when Hi is (sra Lo, 31). We can use
SPLAT_VECTOR_I64 rather than splatting the high bits and
combining them in the vector register.
As noted in the FIXME there's a sort of agreement that the any
extra bits stored will be 0.
The generated code is pretty terrible. I was really hoping we
could use a tail undisturbed trick, but tail undisturbed no
longer applies to masked destinations in the current draft
spec.
Fingers crossed that it isn't common to do this. I doubt IR
from clang or the vectorizer would ever create this kind of store.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D100618
This patch extends the lowering of RVV fixed-length vector shuffles to
avoid the default stack expansion and instead lower to vrgather
instructions.
For "permute"-style shuffles where one vector is swizzled, we can lower
to one vrgather. For shuffles involving two vector operands, we lower to
one unmasked vrgather (or splat, where appropriate) followed by a masked
vrgather which blends in the second half.
On occasion, when it's not possible to create a legal BUILD_VECTOR for
the indices, we use vrgatherei16 instructions with 16-bit index types.
For 8-bit element vectors where we may have indices over 255, we have a
fairly blunt fallback to the stack expansion to avoid custom-splitting
of the vector types.
To enable the selection of masked vrgather instructions, this patch
extends the various RISCVISD::VRGATHER nodes to take a passthru operand.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D100549
Prep work for adding intrinsics in the future.
Left an assert that the input is constant in ReplaceNodeResults,
as the intrinsic shouldn't go through that path.
This patch adds RVV codegen support for OR/XOR/AND reductions for both
scalable- and fixed-length vector types. There are a few possible
codegen strategies for each -- vmfirst.m, vmsbf.m, and vmsif.m could be
used to some extent -- but the vpopc.m instruction was chosen since it
produces the scalar result in one instruction, after which scalar
instructions can finish off the computation.
The reductions are lowered identically for both scalable- and
fixed-length vectors, although some alternate strategies may be more
optimal on fixed-length vectors since it's cheaper to get the length of
those types.
Other reduction types were not deemed to be relevant for mask vectors.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D100030
New custom DAG nodes were added to represent operations on CSR. These
nodes are lowered to corresponding pseudo instruction. Using the pseudo
instructions allows to specify different scheduling information for
operations on different system registers. It also make possible to
specify dependencies of instructions on specific system registers.
Differential Revision: https://reviews.llvm.org/D98936
If the constants have a difference of 1 we can convert one to
the other by adding or subtracting the condition.
We have a DAG combine for this, but it only runs before type
legalization. If the select is introduced later during type
legalization or op legalization we will miss it.
We don't need a specific condition, but some conditions are
harder to materialize than others on RISCV. I know that SETLT
will be a single instruction and it is what is used by the
motivating pattern from signed saturating add/sub.
Differential Revision: https://reviews.llvm.org/D99021
This can't use our normal strategy of splatting the scalar and using
a .vv operation instead of .vx.
Instead this patch bitcasts the vector to the equivalent SEW=32
vector and inserts the scalar parts using two vslide1up/down. We
do that unmasked and apply the mask separately at the end with
a vmerge.
For vslide1up there maybe some other options here like getting
i64 into element 0 and using vslideup.vi with this vector as
vd and the original source as vs1. Masking would still need to
be done afterwards.
That idea doesn't work for vslide1down. We need to slidedown and
then insert a single scalar at vl-1 which we could do with a
vslideup, but that assumes vl > 0 which I don't think we can assume.
The i32 double slide1down implemented here is the best I could come
up with and I just made vslide1up consistent.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D99910
We encountered a hang in our internal code base. I'm having trouble
creating a test case because the test that hit it was testing some
code that is not upstream.
It's a bit silly, but it allows us to write stricter type
constraints for isel. There's still some extra type checks in
the generated table due to some type interference limitations
around HWMode.
This patch supports bitcasts from scalar types to fixed-length vectors
and vice versa. It custom-lowers and custom-legalizes them to
EXTRACT_VECTOR_ELT/INSERT_VECTOR_ELT operations, using a single-element
vectors to hold the scalar where appropriate.
Previously, some of these would fail to select, others would be expanded
through stack loads and stores. Effort was made to ensure the codegen
avoids the stack for both legal and illegal scalar types.
Some of the codegen could be improved, but on first glance it looks like
a general optimization of EXTRACT_VECTOR_ELT when extracting an i64
element on RV32.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D99667
Caught in internal testing, these operations are assumed legal by
default, even for scalable vector types. Expand them back into separate
truncations and stores, or loads and extensions.
Also add explicit fixed-length vector tests for these operations, even
though they should have been correct already.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D99654
The W version of orc.b does not exist in Zbp so we need to use
gorci encoding. If we have Zbp, we can use gorciw which can avoid a
sext.w in some cases.
As long as it's a constant we can directly pattern match it
without any problems. It's only when it isn't a constant that
we need to add an AND.
In theory this should allow more target independent optimizations
to remain active.
Forgot to amend the Author.
Original commit message:
Header files are included in a separate patch in case the name needs to be changed.
RV32 / 64:
orc.b
Differential Revision: https://reviews.llvm.org/D99320
The default legalization strategy is PromoteFloat which keeps
half in single precision format through multiple floating point
operations. Conversion to/from float is done at loads, stores,
bitcasts, and other places that care about the exact size being 16
bits.
This patches switches to the alternative method softPromoteHalf.
This aims to keep the type in 16-bit format between every operation.
So we promote to float and immediately round for any arithmetic
operation. This should be closer to the IR semantics since we
are rounding after each operation and not accumulating extra
precision across multiple operations. X86 is the only other
target that enables this today. See https://reviews.llvm.org/D73749
I had to update getRegisterTypeForCallingConv to force f16 to
use f32 when the F extension is enabled. This way we can still
pass it in the lower bits of an FPR for ilp32f and lp64f ABIs.
The softPromoteHalf would otherwise always give i16 as the
argument type.
Reviewed By: asb, frasercrmck
Differential Revision: https://reviews.llvm.org/D99148
We need to splat the scalar separately and use .vv, but there is
no vmsgt(u).vv. So add isel patterns to select vmslt(u).vv with
swapped operands.
We also need to get VT to use for the splat from an operand rather
than the result since the result VT is nxvXi1.
Reviewed By: HsiangKai
Differential Revision: https://reviews.llvm.org/D99704
There's no target independent ISD opcode for MULHSU, so custom
legalize 2*XLen multiplies ourselves. We have to be a little
careful to prefer MULHU or MULHSU.
I thought about doing this in isel by pattern matching the
(add (mul X, (srai Y, XLen-1)), (mulhu X, Y)) pattern. I decided
against this because the add might become part of a chain of adds.
I don't trust DAG combine not to reassociate with other adds making
it difficult to find both pieces again.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D99479
Our CLZW isel pattern is quite easily broken by surrounding code
preventing it from matching sometimes. This usually results in
failing to remove the and X, 0xffffffff inserted by type
legalization. The add with -32 that type legalization also inserts
will often gets combined into other add/sub nodes. That doesn't
usually result in extra code when we don't use clzw.
CTTZ seems to be less fragile, but I wanted to keep it consistent
with CTLZ.
Reviewed By: asb, HsiangKai
Differential Revision: https://reviews.llvm.org/D99317
This adds almost everything required for supporting the new stepvector
intrinsic on RVV. It is lowered to the existing VID_VL SDNode.
The only exception is a limitation that RV32 cannot yet lower the
intrinsic on i64 vectors. This is because the step operand is
(currently) required to be at least as large as the vector element type.
I will look into patching that out and loosening the requirement to only
an integer pointer type.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D99594