Similar to what we do for other loads/stores, use the intrinsic
version that we already have custom isel for.
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D121166
vslide1up/down have this flag set, but the value isn't a splat.
Rename for clarity.
Reviewed By: khchen
Differential Revision: https://reviews.llvm.org/D121037
With Zbb, abs is expanded to (max X, neg) by default. If X has 33 or
more sign bits, we can expand it a little early using negw instead of
neg to save a sext_inreg. If X started as a 32 bit value, type
legalization would have inserted a sext before the abs so X having
33 sign bits should always be true.
Note: I've used ISD::FREEZE here since we increase the number of uses.
Our default expansion for ABS doesn't do that, but I think that's a bug.
We can't do this with custom type legalization because ISD::FREEZE
doesn't propagate sign bits so later DAG combine won't expand be
able to see optmize it.
Alives2 https://alive2.llvm.org/ce/z/Gx3RNe
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D120597
Until Zfinx is supported in CodeGen we need to convert all Zfinx
register classes to GPR.
Remove the zfinx-types.ll test which didn't test anything meaningful
since -mattr=zfinx isn't implemented completely in llc.
Follow up to D93298.
This miscompile was introduced in D119527.
This was a special pattern for rotate+bswap on RV32. It doesn't
work for RV64 since the rotate needs to be half the bitwidth. The
equivalent pattern for RV64 is ROTR ((GREV x, 56), 32) so match
that instead.
This could be generalized further as noted in the new FIXME.
Reviewed By: Chenbing.Zheng
Differential Revision: https://reviews.llvm.org/D120686
This patch added the MC layer support of Zfinx extension.
Authored-by: StephenFan
Co-Authored-by: Shao-Ce Sun
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D93298
This lowers VECTOR_SPLICE of scalable vectors to a slidedown follow by a slideup.
Fixed vectors are encouraged to use shufflevector instruction. The equivalent patch
for fixed vectors is D119039.
I've used a tail agnostic slidedown and limited the VL to only the
elements that will not be overwritten by the slideup. The slideup
uses VLMax for its VL. It unfortunately uses tail undisturbed policy
but it isn't required as there is no tail. We just need the merge
operand to carry the bits for the lower portion of the result.
Care was taken to ensure that either the slideup or slidedown will
be able to use a .vi instruction when the immediate is small. Which
one uses the immediate depends on the sign of the immediate.
Reviewed By: frasercrmck, ABataev
Differential Revision: https://reviews.llvm.org/D119303
Default type legalization will create sext_inreg+abs, but we may
not be able to remove the sext_inreg.
Instead this patch expands abs during type legalization to
Y = sraiw X, 31; subw(xor X, Y), Y) which doesn't require the input
to be sign extended.
This gives a big improvement for some neg-abs tests where the
abs is used more than the the neg. Previously the abs was expanded
a different way before and after type legalization. Now they are
expanded in a similar way enabling more CSE.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D120636
vcpop and vfirst are still useful when VL=0.
vcpop equivalents to li 0 and vfirst equivalents to li -1,
since no mask elements are active.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D120302
Add a new ISD opcode to represent the sign extending behavior of
vmv.x.h. Keep the previous anyext opcode to allow the existing
(fmv_x_anyexth (fmv_h_x X)) combine to keep working without needing
to generate a sign extend.
For fmv.x.w we are able to match the sext_inreg in an isel pattern,
but a 16-bit sext_inreg is lowered to a shift pair before isel. This
seemed like a larger match than we should do in isel.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D118974
Internally to DAGCombiner the SDValues were passed by non-const
reference despite not being modified. They were then passed by
const reference to TLI.
This patch passes them by value which is consistent with the vast
majority of code.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D120420
See https://github.com/llvm/llvm-project/issues/53831 for a full discussion.
The basic issue is that DAGCombiner::visitMUL and
RISCVISelLowering;:transformAddImmMullImm get stuck in a loop, as the
current checks in transformAddImmMulImm aren't sufficient to avoid all
cases where DAGCombiner::isMulAddWithConstProfitable might trigger a
transformation. This patch makes transformAddImmMulImm bail out if C0
(the constant used for multiplication) has more than one use.
Differential Revision: https://reviews.llvm.org/D120332
This generalizes isElementRotate to work when there's only a single
slide needed. I've removed matchShuffleAsSlideDown which is now
redundant.
Reviewed By: frasercrmck, khchen
Differential Revision: https://reviews.llvm.org/D119759
Add the passthru operand for
VMV_V_X_VL, VFMV_V_F_VL and SPLAT_VECTOR_SPLIT_I64_VL also.
The goal is support tail and mask policy in RVV builtins.
We focus on IR part first.
If the passthru operand is undef, we use tail agnostic, otherwise
use tail undisturbed.
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D119688
Part of the shift lowering creates a (sub XLEN-1, ShAmt). When this
value is used we know that ShAmt is [0..XLEN-1]. Since XLEN is a power
of 2 we can replace the sub with an xor. This allows us to use XORI
instead of LI+SUB.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D119411
The goal is support tail and mask policy in RVV builtins.
We focus on IR part first.
If the passthru operand is undef, we use tail agnostic, otherwise
use tail undisturbed.
Add passthru operand for VSLIDE1UP_VL and VSLIDE1DOWN_VL to support
i64 scalar in rv32.
The masked VSLIDE1 would only emit mask undisturbed policy regardless
of giving mask agnostic policy until InsertVSETVLI supports mask agnostic.
Reviewed by: craig.topper, rogfer01
Differential Revision: https://reviews.llvm.org/D117989
This is a more generic version of D119110 that uses MaskedValueIsZero
to do the matching and SimplifyDemandedBits to remove any unneeded
AND instructions.
Tests were taken from D119110.
Reviewed By: Chenbing.Zheng
Differential Revision: https://reviews.llvm.org/D119622
While matching widening multiply, if we matched an extend from i8->i32,
i16->i64 or i8->i64, we need to reintroduce a narrower extend. If we're
matching a vwmulsu we need to use a sext for op0 and a zext for op1.
This bug exists in LLVM 14 and will need to be backported.
Differential Revision: https://reviews.llvm.org/D119618
Move some combine patterns to DAG combine,and
it dealt with fixme left in RISCVInstrInfoZb.td.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D119527
We can lower a vector splice to a vslidedown and a vslideup.
The majority of the matching code here came from X86's code for matching
PALIGNR and VPALIGND/Q.
The slidedown and slideup lowering don't really require it to be concatenation,
but it happened to be an interesting pattern with existing analysis code I
could use.
This helps with cases where the scalar loop optimizer forwarded a load
result from a previous loop iteration. For example, this happens if the
loop uses x[i] and x[i+1] on the same iteration. The scalar optimizer
will forward x[i+1] load from the previous loop to satisfy x[i] on this
loop. When this get vectorized it results in one element of a vector
being forwarded from the previous loop to be concatenated with elements
loaded on this iteration.
Whether that's more efficient than doing a shifted loaded or reloading
the single scalar and using vslide1up is an interesting question.
But that's not something the backend can help with.
Reviewed By: khchen
Differential Revision: https://reviews.llvm.org/D119039
The VLMaxSentinel is represented as TargetConstant, but that's included
in isa<ConstantSDNode>. To keep constant VLs and VLMax separate as long
as possible, use the X0 register during lowering and only convert to
VLMaxSentinel during isel.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D118845
We already had FMA_VL node, but we didn't have masked patterns.
I have not added the fneg variations. I'll do those after I add
llvm.vp.fneg.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D119196
This patch adds an optimization to splat-like operations where the
splatted value is extracted from a identically-sized vector. On RVV we
can splat that via vrgather.vx/vrgather.vi without dropping to scalar
beforehand.
We do have a similar VECTOR_SHUFFLE-specific optimization but that only
works on fixed-length vector types and for those with a constant splat
lane. This patch extends this optimization to make it work on
scalable-vector types and on unknown extract indices.
It is performed during fixed-vector BUILD_VECTOR lowering and during a
new DAGCombine on SPLAT_VECTOR for scalable vectors.
Reviewed By: craig.topper, khchen
Differential Revision: https://reviews.llvm.org/D118456
Add a new ISD opcode to represent the sign extending behavior of
vmv.x.h. Keep the previous anyext opcode to allow the existing
(fmv_x_anyexth (fmv_h_x X)) combine to keep working without needing
to generate a sign extend.
For fmv.x.w we are able to match the sext_inreg in an isel pattern,
but a 16-bit sext_inreg is lowered to a shift pair before isel. This
seemed like a larger match than we should do in isel.
Differential Revision: https://reviews.llvm.org/D118974
Add the vslidedown and interleave patterns that I recently implemented.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D118952
This avoids a crash for scalable vectors and or scalarization for
fixed vectors.
The algorithm is different enough that I don't think it makes sense
to merge with ceil/floor/trunc. Algorithm is adapted from gcc's X86
SSE2 output.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D117247
SPLAT_VECTOR_I64 has the same semantics as RISCVISD::VMV_V_X_VL, it
just assumed VLMax instead of carrying a VL operand.
Include order of RISCVInstrInfoVSDPatterns.td and RISCVInstrInfoVVLPatterns.td
has been swapped to avoid moving riscv_vmv_v_x_vl into
RISCVInstrInfoVSDPatterns.td and to allow moving other "_vl" SDNodes back to
RISCVInstrInfoVVLPatterns.td
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D118841
VLMaxSentintel happens to be represented as -1 TargetConstant. A user
provided -1 would be an ISD::Constant. We shouldn't assume that they
are the same thing. I'm still not entirely convinced that we should be
treating -1 from the user as VLMAX.
Also fix one place that failed to use XLenVT for the VLMaxSentinel,
using MVT::i64 in code that only executes on RV32.
This adds or reuses ISD opcodes for vadd.wv, vaddu.wv, vadd.vv, vaddu.vv
and a similar set for sub.
I've included support for narrowing scalar splats that have known
sign/zero bits similar to what was done for MUL_VL.
The conversion to vwadd.vv proceeds in two phases. First we'll form
a vwadd.wv by narrowing one of the operands. Then we'll visit the
vwadd.wv to try to narrow the other operand. This turned out to be
simpler than catching all the cases in one step. The forming of of
vwadd.wv can happen for either operand for add, but only the right
hand side for sub since sub isn't commutable.
An interesting quirk is that ADD_VL and VZEXT_VL/VSEXT_VL are formed
during vector op legalization, but VMV_V_X_VL isn't usually formed
until op legalization when BUILD_VECTORS are handled. This leads to
VWADD_W_VL forming in one DAG combine round, and then a later DAG combine
round sees the VMV_V_X_VL and needs to commute the operands to get the
splat in position. This alone necessitated a VWADD_W_VL combine function
which made forming vwadd.vv in two stages an easy choice.
I've left out trying hard to form vwadd.wx instructions for now. It would
only save an extend in the scalar domain which isn't as interesting.
Might need to review the test coverage a bit. Most of the vwadd.wv
instructions are coming from vXi64 tests on rv64. The tests were
copy pasted from the existing multiply tests.
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D117954
We convert VLEN to vscale by dividing by RVVBitsPerBlock which is
currently 64. This is only correct if VLEN is evenly divisible by
64. With only Zvl32b we can't assume that.
This patch adds a fatal_error to prevent generating code that may
be broken.
We probably need to look at how we size stack frame objects too.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D118583
We had previously hardcoded this to assume that vector registers
are 128 bits. This was true when only V existed, but after Zve
extensions were added this became incorrect.
This patch adjusts it to support 128, 64, or 32 bit vectors depending
on Zvl. The 128-bit limit is artificial, but we don't have any test
coverage showing that we larger values so I was being conservative.
None of our lit tests depend on this code today due to the custom
lowering of ISD::VSCALE that inserts the appropriate left or right
shift to convert from VLENB to VSCALE. That code was added after
this code in computeKnownBitsForTargetNode.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D118582
masked.atomicrmw.*.i32 intrinsics access an i32 (and then possibly
mask it), so hardcode MVT::i32 as the access type here, rather than
determining it from the pointer element type.
Differential Revision: https://reviews.llvm.org/D118336
This is a slight change because I'm using the ANY_EXTEND result
instead of the original operand, but getNode should constant fold.
While there, add a comment about why the code specifically checks
for a ConstantSDNode.
We can use the RISCVISD::GREV encoding that swaps the bits in
each byte. This allows it to use the existing computeKnownBits
support for RISCVISD::GREV.
We already have an ISD opcode for the more general GREV/GREVI
instructon. We can just use it with the encoding that corresponds
to the behavior of brev8. This is similar to what we do for orc.b
where we use the GORC ISD opcode.
Now the backend promotes mask vector to an i8 vector and extract element from that. We could bitcast to a widen element vector, and extract from it to GPR, then use I instruction to extract the certain bit.
Differential Revision: https://reviews.llvm.org/D117389
We were creating a truncate with the default for the type, but for
VP intrinsics we have a VL that we should use.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D118406
`__gnu_h2f_ieee` and `__gnu_f2h_ieee` are introduce by ARM and set that as
default name for fp16 and fp32 conversion in LLVM.
However RISC-V GCC using default naming scheme for that, which is
`__extendhfsf2` and `__truncsfhf2` for that, that cause runtime ABI
incompatible issue.
Although we didn't have formal runtime ABI spec to specify those naming
convention yet, but I think it would be great to fix the incompatible
issue first.
And I've plan to create a runtime ABI spec undere psABI spec this year.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D118207
According to riscv-v-spec-1.0, widening signed(vs2)-unsigned integer multiply
vwmulsu.vv vd, vs2, vs1, vm # vector-vector
vwmulsu.vx vd, vs2, rs1, vm # vector-scalar
It is worth noting that signed op is only for vs2.
For vwmulsu.vv, we can swap two ops, and don't care which is sign extension,
but for vwmulsu.vx signExt can not be a vector extended from scalar (rs1).
I specifically added two functions ending with _swap in the test case.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D118215
This patch adds support for expanding VP_MERGE through a sequence of
vector operations producing a full-length mask setting up the elements
past EVL/pivot to be false, combining this with the original mask, and
culminating in a full-length vector select.
This expansion should work for any data type, though the only use for
RVV is for boolean vectors, which themselves rely on an expansion for
the VSELECT.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D118058
This reverts commit ef82063207.
- It conflicts with the existing llvm::size in STLExtras, which will now
never be called.
- Calling it without llvm:: breaks C++17 compat
The goal is support tail and mask policy in RVV builtins.
We focus on IR part first.
If the passthru operand is undef, we use tail agnostic, otherwise
use tail undisturbed.
Co-Authored-by: Hsiangkai Wang <Hsiangkai@gmail.com>
Reviewers: craig.topper, frasercrmck
Differential Revision: https://reviews.llvm.org/D117647
Instead use either Type::getPointerElementType() or
Type::getNonOpaquePointerElementType().
This is part of D117885, in preparation for deprecating the API.
This patch introduces new intrinsics that enable the use of vsetvli in
contexts where only the returned vector length is of interest. The
pre-existing intrinsics are marked with side-effects, which prevents
even trivial optimizations on/across them.
These intrinsics are intended to be used in situations where the vector
length is fed in turn to RVV intrinsics or to vector-predication
intrinsics during loop vectorization, for example. Those codegen paths
ensure that instructions are generated with their own implicit vsetvli,
so the vector length and vtype can be relied upon to be correct.
No corresponding C builtins are planned at this stage, though that is a
possibility for the future if the need arises.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D117910
This patch adds lowering of the llvm.vp.merge.* intrinsic
(ISD::VP_MERGE) to RVV vmerge/vfmerge instructions. It introduces a
special pseudo form of vmerge which allows a tied merge operand,
allowing us to specify the tail elements as being equal to the "on
false" operand, using a tied-def constraint and a "tail undisturbed"
policy.
While this strategy allows us to often lower the intrinsic to just one
instruction, it may be less efficient in fixed-vector types as the
number of tail elements may extend far beyond the length of the fixed
vector. Another strategy could be to use a vmerge/vfmerge instruction
with an AVL equal to the length of the vector type, and manipulate the
condition operand such that mask elements greater than the operation's
EVL are false.
I've also observed inefficient codegen in which our 'VF' patterns don't
match raw floating-point SPLAT_VECTORs, which occur in scalable-vector
code.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D117561
This patch follows up on D117697 to help the simple binary operations
behave similarly in the presence of masks.
It also enables CGP sinking support for vp.fdiv and vp.fsub intrinsics,
now that VFRDIV and VFRSUB are consistently matched with a LHS splat for
masked and unmasked variants.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D117783
Instead of passing the both the SDNode* and 2 of the operands
in two different orders, just pass the SDNode * and a bool to
indicate which operand order to test.
While there rename to combineMUL_VLToVWMUL_VL.
This patch brings better splat-matching to our VP support, by sinking
splat operands of VP intrinsics back into the same block as the VP
operation. The list of VP intrinsics we are interested in matches that
of the regular instructions.
Some optimization is still lacking. For instance, our VL nodes aren't
recognized as commutative, so splats must be on the RHS. Because of
this, we limit our sinking of splats to just the RHS operand for now.
Improvement in this regard can come in another patch.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D117703
RISCV only has a unary shuffle that requires places indices in a
register. For interleaving two vectors this means we need at least
two vrgathers and a vmerge to do a shuffle of two vectors.
This patch teaches shuffle lowering to use a widening addu followed
by a widening vmaccu to implement the interleave. First we extract
the low half of both V1 and V2. Then we implement
(zext(V1) + zext(V2)) + (zext(V2) * zext(2^eltbits - 1)) which
simplifies to (zext(V1) + zext(V2) * 2^eltbits). This further
simplifies to (zext(V1) + zext(V2) << eltbits). Then we bitcast the
result back to the original type splitting the wide elements in half.
We can only do this if we have a type with wider elements available.
Because we're using extends we also have to be careful with fractional
lmuls. Floating point types are supported by bitcasting to/from integer.
The tests test a varied combination of LMULs split across VLEN>=128 and
VLEN>=512 tests. There a few tests with shuffle indices commuted as well
as tests for undef indices. There's one test for a vXi64/vXf64 vector which
we can't optimize, but verifies we don't crash.
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D117743
Similar for ceil, trunc, round, and roundeven. This allows us to use
static rounding modes to avoid a libcall.
This is similar to D116771, but for the saturating conversions.
This optimization is done for AArch64 as isel patterns.
RISCV doesn't have instructions for ceil/floor/trunc/round/roundeven
so the operations don't stick around until isel to enable a pattern
match. Thus I've implemented a DAG combine.
I'm only handling saturating to i64 or i32. This could be extended
to other sizes in the future.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D116864
This idea has come up in several reviews -- D115978 and D105902 -- so I
can't take any credit for the idea. Instead of using a constant pool to
lower -0.0, we can emit a sequence of two instructions:
fmv.[hwd].x freg, zero
fsgnjn.[hsd] freg, freg, freg
This is only done when the floating-point type is legal.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D117687
We may not be allowed to use vXiXLen vectors. Consult ELEN to
determine what is allowed. This will become even more important
when Zve32 is added.
Reviewed By: frasercrmck, arcbbb
Differential Revision: https://reviews.llvm.org/D117518
Remove fshl/fshr with constant shift amount isel patterns. Replace
with fsr/fsl with constant isel patterns.
This hack was trying to preserve as much optimization opportunity
for fshl/fshr by constant as possible, but the conversion to
RISCVISD::FSR/FSL happens so late it probably isn't worth much.
The new isel patterns are needed by D117468 anyway.
This reverts the revert commit e328385739.
Accidental demanded bits change has been removed. The demanded bits
code itself was remove in a pre-commit since it isn't tested.
Original commit message:
Previous we used the fshl/fshr operand ordering for simplicity. This
made things confusing when D117468 proposed adding intrinsics for
the instructions. We can't just use the generic funnel shifting
intrinsics because fsl/fsr have different functionality that should
be exposed to software.
Now we use rs1, rs3, rs2/shamt order which matches the instruction
printing order and the order used in this intrinsic header
https://github.com/riscv/riscv-bitmanip/blob/main-history/cproofs/rvintrin.h
Testing may be easier after D117468. Right now we get demanded bits
optimizations done on ISD::FSHL/FSHR before they become FSR/FSL. This
makes it hard to test.
Previous we used the fshl/fshr operand ordering for simplicity. This
made things confusing when D117468 proposed adding intrinsics for
the instructions. We can't just use the generic funnel shifting
intrinsics because fsl/fsr have different functionality that should
be exposed to software.
Now we use rs1, rs3, rs2/shamt order which matches the instruction
printing order and the order used in this intrinsic header
https://github.com/riscv/riscv-bitmanip/blob/main-history/cproofs/rvintrin.h
Currently, users expected VL is the last operand. However, since some
intrinsics has tail policy in the last operand, this rule cannot be used
anymore.
Reviewed By: craig.topper, frasercrmck
Differential Revision: https://reviews.llvm.org/D117452
Current SplatOperand starts from 1 because operand 0 (or 1) is intrinsic
id in SelectionDAG.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D117453
For fixed vectors, the undef will get expanded to an all zeros
build_vector. We don't want that so suppress creating the
insert_subvector.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D117379
We were considering this legal, but later the undef would become an all
zeros vector. This would cause us to need to re-legalize the insert later
into a vslideup with zero vector.
This patch catches the case and directly legalizes it to a scalable
insert.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D117377
When we know the value we're extending is a negative constant then it
makes sense to use SIGN_EXTEND because this may improve code quality in
some cases, particularly when doing a constant splat of an unpacked vector
type. For example, for SVE when splatting the value -1 into all elements
of a vector of type <vscale x 2 x i32> the element type will get promoted
from i32 -> i64. In this case we want the splat value to sign-extend from
(i32 -1) -> (i64 -1), whereas currently it zero-extends from
(i32 -1) -> (i64 0xFFFFFFFF). Sign-extending the constant means we can use
a single mov immediate instruction.
New tests added here:
CodeGen/AArch64/sve-vector-splat.ll
I believe we see some code quality improvements in these existing
tests too:
CodeGen/AArch64/reduce-and.ll
CodeGen/AArch64/unfold-masked-merge-vector-variablemask.ll
The apparent regressions in CodeGen/AArch64/fast-isel-cmp-vec.ll only
occur because the test disables codegen prepare and branch folding.
Differential Revision: https://reviews.llvm.org/D114357
Original patch by @hussainjk.
This patch was split off from D109377 to keep vector legalization
(widening/splitting) separate from vector element legalization
(promoting).
While the original patch added a third overload of
SelectionDAG::getVPStore, this patch takes the liberty of collapsing
those all down to 1, as three overloads seems excessive for a
little-used node.
The original patch also used ModifyToType in places, but that method
still crashes on scalable vector types. Seeing as the other VP
legalization methods only work when all operands need identical
widening, this patch follows in that vein.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D117235
These cases follow the same pattern, so they can be combined
to a unqiue function.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D117378
Specifically the unary shuffle case where the elements being
shifted in are undef. This handles the shuffles produce by expanding
llvm.reduce.mul.
I did not reduce the VL which would increase the number of vsetvlis,
but may improve the execution speed. We'd also want to narrow the
multiplies so we could share vsetvlis between the vslidedown.vi and
the next multiply.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D117239
It appears the code here was written for the inline asm clobbering
a specific register, but it also gets used for named input and
output registers.
For the input and output case, we should honor the VT so we
don't insert conversion instructions around the inline assembly.
For the clobber, case we need to pick the largest register class.
Reviewed By: asb, jrtc27
Differential Revision: https://reviews.llvm.org/D117279
We could use vmv.v.i/vmv.v.x whose eew is 32 to lower the i64 splat vector if the i64 constant scalar could be splitted into two same i32 scalar.
Differential Revision: https://reviews.llvm.org/D117079
When we know the value we're extending is a negative constant then it
makes sense to use SIGN_EXTEND because this may improve code quality in
some cases, particularly when doing a constant splat of an unpacked vector
type. For example, for SVE when splatting the value -1 into all elements
of a vector of type <vscale x 2 x i32> the element type will get promoted
from i32 -> i64. In this case we want the splat value to sign-extend from
(i32 -1) -> (i64 -1), whereas currently it zero-extends from
(i32 -1) -> (i64 0xFFFFFFFF). Sign-extending the constant means we can use
a single mov immediate instruction.
New tests added here:
CodeGen/AArch64/sve-vector-splat.ll
I believe we see some code quality improvements in these existing
tests too:
CodeGen/AArch64/dag-numsignbits.ll
CodeGen/AArch64/reduce-and.ll
CodeGen/AArch64/unfold-masked-merge-vector-variablemask.ll
The apparent regressions in CodeGen/AArch64/fast-isel-cmp-vec.ll only
occur because the test disables codegen prepare and branch folding.
Differential Revision: https://reviews.llvm.org/D114357
This adds support for STRICT_FSETCC(quiet) and STRICT_FSETCCS(signaling).
FEQ matches well to STRICT_FSETCC oeq.
FLT/FLE matches well to STRICT_FSETCCS olt/ole.
Others require commuting operands or multiple instructions.
STRICT_FSETCC olt/ole/ogt/oge/ult/ule/ugt/uge uses FLT/FLE,
but we need to save/restore FFLAGS around them to avoid spurious
exceptions. I've implemented pseudo instructions with a
CustomInserter to insert the save/restore CSR instructions.
Unfortunately, this doesn't honor exceptions for signaling NANs
but I'm not sure if signaling nans are really supported by the
constrained intrinsics.
STRICT_FSETCC one and ueq expand to a pair of FLT instructions
with a save/restore of fflags around each. This could be improved
in the future.
There may be some opportunities to generate better code for strict
comparisons mixed with nonans fast math flags. I've left FIXMEs in
the .td files for that.
Co-Authored-by: ShihPo Hung <shihpo.hung@sifive.com>
Reviewed By: arcbbb
Differential Revision: https://reviews.llvm.org/D116694
Similar for ceil, trunc, round, and roundeven. This allows us to use
static rounding modes to avoid a libcall.
This optimization is done for AArch64 as isel patterns.
RISCV doesn't have instructions for ceil/floor/trunc/round/roundeven
so the operations don't stick around until isel to enable a pattern
match. Thus I've implemented a DAG combine.
We only handle XLen types except i32 on RV64. i32 will be type
legalized to a RISCVISD node. All other types will be type legalized
to XLen and maintain the FP_TO_SINT/UINT ISD opcode.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D116771
The code can only address the whole RV32 address space or the lower 2 GiB
of the RV64 address space in small code model, so 32 bits entry is enough.
Cache hit ratio and code size have some improvements.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D116435
When `Zbt` is enabled, we can generate SELECT for division by power
of 2, so that there is no data dependency.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D114856
When we want to create an splat vector that only the first element is initialized, we could use vmv.s.x or vfmv.s.f to build it.
Differential Revision: https://reviews.llvm.org/D116277
The zextload hook is only used to determine whether to insert a
zero_extend or any_extend for narrow types leaving a basic block.
Returning true from this hook tends to cause any load whose output
leaves the basic block to become an LWU instead of an LW.
Since we tend to prefer sexts for i32 compares on RV64, this can
cause extra sext.w instructions to be created in other basic blocks.
If we use LW instead of LWU this gives the MIR pass from D116397
a better chance of removing them.
Another option might be to teach getPreferredExtendForValue in
FunctionLoweringInfo.cpp about our preference for sign_extend of
i32 compares. That would cause SIGN_EXTEND to be chosen for any
value used by a compare instead of using the isZExtFree heuristic.
That will require code to convert from the llvm::Type* to EVT/MVT
as well as querying the type legalization actions to get the
promoted type in order to call TargetLowering::isSExtCheaperThanZExt.
That seemed like many extra steps when no other target wants it.
Though it would avoid us needing to lean on the MIR pass in some cases.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D116567
This patch adds isel support for STRICT_LRINT/LLRINT/LROUND/LLROUND.
It also adds test cases for f32 and f64 constrained intrinsics that
correspond to the intrinsics in float-intrinsics.ll and
double-intrinsics.ll. Support for promoting the integer argument of
STRICT_FPOWI was added.
I've skipped adding tests for f16 intrinsics, since we don't have libcalls
for them and we have inconsistent support for promoting them in LegalizeDAG.
This will need to be examined more closely.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D116323
After consuming all vector registers, the scalable vector values will be
passed indirectly. The pointer values will be saved in general
registers. If all general registers are used up, we will report an error to
notify users the compiler does not support passing scalable vector
values through the stack. In this patch, we remove the restriction. After
all general registers are used up, we use the stack to save the
pointers which point to the indirect passed scalable vector values.
Differential Revision: https://reviews.llvm.org/D116310
The 'r' constraint uses the GPR class. There is generic support
for bitcasting and extending/truncating non-integer VTs to the
required integer VT. This doesn't work for scalable vectors and
instead crashes.
To prevent this, explicitly reject vectors. Fixed vectors might
work without crashing, but it doesn't seem worthwhile to allow.
While there remove an unnecessary level of indentation in the
"vr" and "vm" constraint handling.
Differential Revision: https://reviews.llvm.org/D115810
For fixed and scalable vectors, each intrinsic x is lowered to vmx.mm,
dropping the mask, which is safe to do as masked-off elements are
undef anyway.
Differential Revision: https://reviews.llvm.org/D115339
-0.0 requires a constant pool. +0.0 can be made with vmv.v.x x0.
Not doing this in getNeutralElement for fear of changing other targets.
Differential Revision: https://reviews.llvm.org/D115978
This adds support for strict conversions between fp types and between
integer and fp.
NOTE: RISCV has static rounding mode instructions, but the constrainted
intrinsic metadata is not used to select static rounding modes. Dynamic
rounding mode is always used.
Differential Revision: https://reviews.llvm.org/D115997
Enable transform (X & Y) == Y ---> (~X & Y) == 0 and (X & Y) != Y ---> (~X & Y) != 0 when have Zbb extension to use more andn instruction.
Differential Revision: https://reviews.llvm.org/D115922
Our Zfhmin support is only MC layer, but these are CodeGen layer
interfaces. If f16 isn't a Legal type for CodeGen with Zfhmin, then
these interfaces should keep their non-Zfh behavior.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D115822
Test that STRICT_FMINNUM/FMAXNUM are lowered to libcalls for f32/f64.
The RISC-V instructions don't match the behavior of fmin/fmax libcalls
with respect to SNaN.
Promoting FMINNUM/FMAXNUM for f16 needs more work outside of the
RISC-V backend.
Reviewed By: asb, arcbbb
Differential Revision: https://reviews.llvm.org/D115680
In order to support constrained FP intrinsics we need to model FRM
dependency. Whether or not a instruction uses FRM is based on a 3
bit field in the instruction. Because of this we can't add
'Uses = [FRM]' to the tablegen descriptions.
This patch examines the immediate after isel and adds an implicit
use of FRM. This idea came from Roger Ferrer Ibanez.
Other ideas:
We could be overly conservative and just pretend all instructions with
frm field read the FRM register. Or we could have pseudoinstructions
for CodeGen with rounding mode.
Reviewed By: asb, frasercrmck, arcbbb
Differential Revision: https://reviews.llvm.org/D115555
The reduction instructions only reads the first element. The
execution time for a splat may take longer with a larger VL.
We should use the smallest VL we can.
Reviewed By: frasercrmck, HsiangKai
Differential Revision: https://reviews.llvm.org/D115536
- `vm` constraint is used for masking operand, which always v0.
- Update testcase, only masking operand should use `vm`, vector mask operations
should just use `vr` for any vector register.
- Revise the description of `vm` constraint.
- This patch also fix issue on RISCVRegisterInfo.td and RISCVISelLowering.cpp.
RISCVRegisterInfo.td:
- The first VT in the list must be the largest total size since the
SelectionDAGBuilder uses the first register in the list as the canonical
type for the register.
RISCVISelLowering.cpp:
- Fix RISCVTargetLowering::splitValueIntoRegisterParts and
RISCVTargetLowering::joinRegisterPartsIntoValue for handling vectors
with different total size, that will happened on fractional LMUL since
fractional LMUL is always occupy one vector register.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D112599
The immediate size check on StepNumerator did not take into account
that vmul.vi does not exist. It also did not account for power of 2
constants that can be done with vshl.vi.
This patch fixes this by moving the conversion from mul to shift
further up. Then we can consider the immediates separately for MUL
vs SHL. For MUL I've allowed simm12 which requires a single addi
before a vmul.vx. For SHL I've allowed any uimm5 which works with
vshl.vi. We could relax these further in the future. This is a
starting point that allows us to emit the same number of instructions
we were already using for smaller numerators.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D115081
This prevents scalarization of fixed vector operations or crashes
on scalable vectors.
We don't have direct support for these operations. To emulate
ftrunc we can convert to the same sized integer and back to fp using
round to zero. We don't need to do a convert if the value is large
enough to have no fractional bits or is a nan.
The ceil and floor lowering would be better if we changed FRM, but
we don't model FRM correctly yet. So I've used the trunc lowering
with a conditional add or subtract with 1.0 if the truncate rounded
in the wrong direction.
There are also missed opportunities to use masked instructions.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D113543
This adds a fold in DAGCombine to create fptosi_sat from sequences for
smin(smax(fptosi(x))) nodes, where the min/max saturate the output of
the fp convert to a specific bitwidth (say INT_MIN and INT_MAX). Because
it is dealing with smin(/smax) in DAG they may currently be ISD::SMIN,
ISD::SETCC/ISD::SELECT, ISD::VSELECT or ISD::SELECT_CC nodes which need
to be handled similarly.
A shouldConvertFpToSat method was added to control when converting may
be profitable. The original fptosi will have a less strict semantics
than the fptosisat, with less values that need to produce defined
behaviour.
This especially helps on ARM/AArch64 where the vcvt instructions
naturally saturate the result.
Differential Revision: https://reviews.llvm.org/D111976
It causes builds to fail with this assert:
llvm/include/llvm/ADT/APInt.h:990:
bool llvm::APInt::operator==(const llvm::APInt &) const:
Assertion `BitWidth == RHS.BitWidth && "Comparison requires equal bit widths"' failed.
See comment on the code review.
> This adds a fold in DAGCombine to create fptosi_sat from sequences for
> smin(smax(fptosi(x))) nodes, where the min/max saturate the output of
> the fp convert to a specific bitwidth (say INT_MIN and INT_MAX). Because
> it is dealing with smin(/smax) in DAG they may currently be ISD::SMIN,
> ISD::SETCC/ISD::SELECT, ISD::VSELECT or ISD::SELECT_CC nodes which need
> to be handled similarly.
>
> A shouldConvertFpToSat method was added to control when converting may
> be profitable. The original fptosi will have a less strict semantics
> than the fptosisat, with less values that need to produce defined
> behaviour.
>
> This especially helps on ARM/AArch64 where the vcvt instructions
> naturally saturate the result.
>
> Differential Revision: https://reviews.llvm.org/D111976
This reverts commit 52ff3b0093.
This adds a fold in DAGCombine to create fptosi_sat from sequences for
smin(smax(fptosi(x))) nodes, where the min/max saturate the output of
the fp convert to a specific bitwidth (say INT_MIN and INT_MAX). Because
it is dealing with smin(/smax) in DAG they may currently be ISD::SMIN,
ISD::SETCC/ISD::SELECT, ISD::VSELECT or ISD::SELECT_CC nodes which need
to be handled similarly.
A shouldConvertFpToSat method was added to control when converting may
be profitable. The original fptosi will have a less strict semantics
than the fptosisat, with less values that need to produce defined
behaviour.
This especially helps on ARM/AArch64 where the vcvt instructions
naturally saturate the result.
Differential Revision: https://reviews.llvm.org/D111976
On RISC-V, icmp is not sunk (as the following snippet shows) which
generates the following suboptimal branch pattern:
```
core_list_find:
lh a2, 2(a1)
seqz a3, a0 <<
bltz a2, .LBB0_5
bnez a3, .LBB0_9 << should sink the seqz
[...]
j .LBB0_9
.LBB0_5:
bnez a3, .LBB0_9 << should sink the seqz
lh a1, 0(a1)
[...]
```
due to an icmp not being sunk.
The blocks after `codegenprepare` look as follows:
```
define dso_local %struct.list_head_s* @core_list_find(%struct.list_head_s* readonly %list, %struct.list_data_s* nocapture readonly %info) local_unnamed_addr #0 {
entry:
%idx = getelementptr inbounds %struct.list_data_s, %struct.list_data_s* %info, i64 0, i32 1
%0 = load i16, i16* %idx, align 2, !tbaa !4
%cmp = icmp sgt i16 %0, -1
%tobool.not37 = icmp eq %struct.list_head_s* %list, null
br i1 %cmp, label %while.cond.preheader, label %while.cond9.preheader
while.cond9.preheader: ; preds = %entry
br i1 %tobool.not37, label %return, label %land.rhs11.lr.ph
```
where the `%tobool.not37` is the result of the icmp that is not sunk.
Note that it is computed in the basic-block up until what becomes the
`bltz` instruction and the `bnez` is a basic-block of its own.
Compare this to what happens on AArch64 (where the icmp is correctly sunk):
```
define dso_local %struct.list_head_s* @core_list_find(%struct.list_head_s* readonly %list, %struct.list_data_s* nocapture readonly %info) local_unnamed_addr #0 {
entry:
%idx = getelementptr inbounds %struct.list_data_s, %struct.list_data_s* %info, i64 0, i32 1
%0 = load i16, i16* %idx, align 2, !tbaa !6
%cmp = icmp sgt i16 %0, -1
br i1 %cmp, label %while.cond.preheader, label %while.cond9.preheader
while.cond9.preheader: ; preds = %entry
%1 = icmp eq %struct.list_head_s* %list, null
br i1 %1, label %return, label %land.rhs11.lr.ph
```
This is caused by sinkCmpExpression() being skipped, if multiple
condition registers are supported.
Given that the check for multiple condition registers affect only
sinkCmpExpression() and shouldNormalizeToSelectSequence(), this change
adjusts the RISC-V target as follows:
* we no longer signal multiple condition registers (thus changing
the behaviour of sinkCmpExpression() back to sinking the icmp)
* we override shouldNormalizeToSelectSequence() to let always select
the preferred normalisation strategy for our backend
With both changes, the test results remain unchanged. Note that without
the target-specific override to shouldNormalizeToSelectSequence(), there
is worse code (more branches) generated for select-and.ll and select-or.ll.
The original test case changes as expected:
```
core_list_find:
lh a2, 2(a1)
bltz a2, .LBB0_5
beqz a0, .LBB0_9 <<
[...]
j .LBB0_9
.LBB0_5:
beqz a0, .LBB0_9 <<
lh a1, 0(a1)
[...]
```
Differential Revision: https://reviews.llvm.org/D98932
If we have a large enough floating point type that can exactly
represent the integer value, we can convert the value to FP and
use the exponent to calculate the leading/trailing zeros.
The exponent will contain log2 of the value plus the exponent bias.
We can then remove the bias and convert from log2 to leading/trailing
zeros.
This doesn't work for zero since the exponent of zero is zero so we
can only do this for CTLZ_ZERO_UNDEF/CTTZ_ZERO_UNDEF. If we need
a value for zero we can use a vmseq and a vmerge to handle it.
We need to be careful to make sure the floating point type is legal.
If it isn't we'll continue using the integer expansion. We could split the vector
and concatenate the results but that needs some additional work and evaluation.
Differential Revision: https://reviews.llvm.org/D111904
Previously these would crash. I don't think these can be generated
directly from C. Not sure if any optimizations can introduce them.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D113527
Not all scalar element types are allowed in vectors so we may not
be able to bitcast to a 1 element vector to use insert/extract.
This will become a bigger issue when the Zve extensions are commited.
For now, I'm using the ELEN limit to limit the element types.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D113219
This is consistent with what we do for other operands that are required
to be constants.
I don't think this results in any real changes. The pattern match
code for isel treats ConstantSDNode and TargetConstantSDNode the same.
These and MULHS/MULHU both default to Legal. Targets need to set
the ones they don't support to Expand.
I think MULHS/MULHU likely has priority in most places so this
change probably isn't directly testable. I found it while looking
at disabling MULHS/MULHU for nxvXi64 as required for Zve64x.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D113325
Similar to D110206, this patch optimizes unmasked vp.load intrinsics to
avoid the need of a vmset instruction to set the mask. It does so by
selecting a riscv_vle intrinsic rather than a riscv_vle_mask intrinsic.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D113022
Add new hasVInstructions() which is currently equivalent.
Replace vector uses of hasStdExtZfh/F/D with new vector specific
versions. The vector spec no longer requires that the vectors implement the
same types as scalar. It only requires that the scalar type is
the maximum size the vectors can support. This is currently
implemented using the scalar rule we were using before.
Add new hasVInstructionsI64() begin using to qualify code that
requires i64 vector elements.
This is all NFC for now, but we can start using this to better
implement D112408 which introduces the Zve extensions.
Reviewed By: frasercrmck, eopXD
Differential Revision: https://reviews.llvm.org/D112496
This can avoid a loss of decoupling with the scalar unit on cores
with decoupled scalar and vector units.
We should support FP too, but those use extract_element and not a
custom ISD node so it is a little different. I also left a FIXME
in the test for i64 extract and store on RV32.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D109482
If one input of a fixed vector multiply is a sign/zero extend and
the other operand is a splat of a scalar, we can use a widening
multiply if the scalar value has sufficient sign/zero bits.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D110028
This particular case was creating a `VMSET_VL` using the old
fixed-length type in order to pass a mask to other custom nodes
operating on the scalable container type. This kind of thing wasn't
caught for us; I only noticed when experimenting with odd-length
vectors, where it was trying to generate an invalid `v3i1` MVT.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D110420
Add the tail policy argument to LLVM IR intrinsics. There are two policies for tail elements. Tail agnostic means users do not care about the values in the tail elements and tail undisturbed means the values in the tail elements need to be kept after the operation. In order to let users control the tail policy, we add an additional argument at the end of the argument list.
For unmasked operations, we have no maskedoff and the tail policy is always tail agnostic. If users want to keep tail elements under unmasked operations, they could use all one mask in the masked operations to do it. So, we only add the additional argument for masked operations for most cases. There are exceptions listed below.
In this patch, we do not handle the following cases to reduce the complexity of the patch. There could be two separate patches for them.
* Use dest argument to control tail policy
vmerge.vvm/vmerge.vxm/vmerge.vim (add _t builtins with additional dest argument)
vfmerge.vfm (add _t builtins with additional dest argument)
vmv.v.v (add _t builtins with additional dest argument)
vmv.v.x (add _t builtins with additional dest argument)
vmv.v.i (add _t builtins with additional dest argument)
vfmv.v.f (add _t builtins with additional dest argument)
vadc.vvm/vadc.vxm/vadc.vim (add _t builtins with additional dest argument)
vsbc.vvm/vsbc.vxm (add _t builtins with additional dest argument)
* Always has tail argument for masked/unmasked intrinsics
Vector Single-Width Integer Multiply-Add Instructions (add _t and _mt builtins)
Vector Widening Integer Multiply-Add Instructions (add _t and _mt builtins)
Vector Single-Width Floating-Point Fused Multiply-Add Instructions (add _t and _mt builtins)
Vector Widening Floating-Point Fused Multiply-Add Instructions (add _t and _mt builtins)
Vector Reduction Operations (add _t and _mt builtins)
Vector Slideup Instructions (add _t and _mt builtins)
Vector Slidedown Instructions (add _t and _mt builtins)
Discussion: https://github.com/riscv/rvv-intrinsic-doc/pull/101
Differential Revision: https://reviews.llvm.org/D105092
This patch adds codegen support for lowering the vector-predicated
reduction intrinsics to RVV instructions. The process is similar to that
of the other reduction intrinsics, save for the fact that every VP
reduction has a start value. We reuse the existing custom "VL" nodes,
adding extra patterns where required to handle non-true masks.
To support these nodes, the `RISCVISD::VECREDUCE_*_VL` nodes have been
given an explicit "merge" operand. This is to faciliate the VP
reductions, where we must be careful to ensure that even if no operation
is performed (when VL=0) we still produce the start value. The RVV
reductions don't update the destination register under these conditions,
so we tie the splatted start value to the output register.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D107657
We can use riscv_vse intrinsic instead of riscv_vse_mask. The code here
is based on similar code for handling masked.scatter and vp.scatter.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D110206
This requires a minor change to CodeGenPrepare to ensure that
shouldSinkOperands will be called for And.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D110106
Optimize (add (mul x, c0), c1) -> (ADDI (MUL (ADDI, c1/c0), c0), c1%c0),
if c1/c0 and c1%c0 are simm12, while c1 is not.
Optimize (add (mul x, c0), c1) -> (MUL (ADDI, c1/c0), c0),
if c1%c0 is zero, and c1/c0 is simm12 while c1 is not.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D108607
For strided accesses the loop vectorizer seems to prefer creating a
vector induction variable with a start value of the form
<i32 0, i32 1, i32 2, ...>. This value will be incremented each
loop iteration by a splat constant equal to the length of the vector.
Within the loop, arithmetic using splat values will be done on this
vector induction variable to produce indices for a vector GEP.
This pass attempts to dig through the arithmetic back to the phi
to create a new scalar induction variable and a stride. We push
all of the arithmetic out of the loop by folding it into the start,
step, and stride values. Then we create a scalar GEP to use as the
base pointer for a strided load or store using the computed stride.
Loop strength reduce will run after this pass and can do some
cleanups to the scalar GEP and induction variable.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D107790
LICM may have pulled out a splat, but with .vx instructions we
can fold it into an operation.
This patch enables CGP to reverse the LICM transform and move the
splat back into the loop.
I've started with the commutable integer operations and shifts, but we can
extend this with more operations in future patches.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D109394
Since i128 isn't a legal C type on RV32, I don't believe
libgcc implements these functions for RV32. compiler-rt
does implement them because i128 support is enabled
in order to handle long double.
This is consistent with 32-bit X86 and ARM.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D109383
This patch adds support for the vector-predicated `VP_STORE` and
`VP_LOAD` nodes. We do this in the same way we lower `MSTORE` and
`MLOAD`: to regular load/store instructions via intrinsics.
One necessary change was made to `SelectionDAGLegalize` so that
`VP_STORE` nodes' operation actions are taken from the stored "value"
operands, in the same vein as `STORE` or `MSTORE`.
Reviewed By: craig.topper, rogfer01
Differential Revision: https://reviews.llvm.org/D108999
This patch adds support for the `VP_SCATTER` and `VP_GATHER` nodes by
lowering them to RVV's `vsox`/`vlux` instructions, respectively. This
process is almost identical to the existing `MSCATTER`/`MGATHER` support.
One extra change was made to `SelectionDAGLegalize` so that
`VP_SCATTER`'s operation action is derived from its stored "value"
operand rather than its return type (which is always the chain).
Reviewed By: craig.topper, rogfer01
Differential Revision: https://reviews.llvm.org/D108987
This patch changes the register class to avoid accidentally setting
the AVL operand to X0 through MachineIR optimizations.
There are cases where we really want to use X0, but we can't get that
past the MachineVerifier with the register class as GPRNoX0. So I've
use a 64-bit -1 as a sentinel for X0. All other immediate values should
be uimm5. I convert it to X0 at the earliest possible point in the VSETVLI
insertion pass to avoid touching the rest of the algorithm. In
SelectionDAG lowering I'm using a -1 TargetConstant to hide it from
instruction selection and treat it differently than if the user
used -1. A user -1 should be selected to a register since it doesn't
fit in uimm5.
This is the rest of the changes started in D109110. As mentioned there,
I don't have a failing test from MachineIR optimizations anymore.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D109116
If the true and false values are the same, we don't need a SELECT_CC.
This would normally be folded before a select is legalized to
select_cc. The test case exploits the late legalization of vscale
to trigger a case where they become identical after legalization.
This works around an issue found on a test case in D107957. In that
case the true/false values were both eventually 0 and the select was
used by a vector AVL operand. The select_cc got expanded to control
flow and a phi, but the phi inputs were both copies from X0. MachineIR
optimizations simplified this to a single copy from X0 going into the
vector instruction. This became the input of a vsetvli after vsetvli
insertion. Then register coalescing folded the copy into the vsetvli.
X0 as the source of a vsetvli is a special encoding and should not be
created by coalesing. We need to fix our vsetvli handling to make sure
this can never happen any other way, but removing the unneeded select
is still a worthwhile optimization.
Similar to D108842, D108844, D108926, D108928, and D108936.
__has_builtin(builtin_mul_overflow) returns true for 32b RISCV targets,
but Clang is deferring to compiler RT when encountering long long types.
If the semantics of __has_builtin mean "the compiler resolves these,
always" then we shouldn't conditionally emit a libcall.
Link: https://bugs.llvm.org/show_bug.cgi?id=28629
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D108939
This adds an ELEN limit for fixed length vectors. This will scalarize
any elements larger than this. It will also disable some fractional
LMULs. For example, if ELEN=32 then mf8 becomes illegal, i32/f32
vectors can't use any fractional LMULs, i16/f16 can only use mf2,
and i8 can use mf2 and mf4.
We may also need something for the scalable vectors, but that has
interactions with the intrinsics and we can't scalarize a scalable
vector.
Longer term this should come from one of the Zve* features
Similar to what we do for add/sub/mul.
This can help remove some sext.w. There are some regressions on
some bswap tests, but I have an idea how to fix that for a follow up.
A new PACKW pattern is added to handle the new sext_inreg placement.
Differential Revision: https://reviews.llvm.org/D108663
This encapsulates the APInt creation and worklist management into
a helper function.
To keep one common interface I've use Log2_32 in places that
previously created a mask by subtracting 1 from a power of 2.
Differential Revision: https://reviews.llvm.org/D108324
We already do this for non-constants RHS. This just removes the
special case. I believe the special case may have been needed
because the ANY_EXTEND of a constant used to create zero extended
constants, but we recently changed that to produce sign extended
constants.
D107658 is needed to prevent some regressions.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D107697
Similar for sub except sub isn't commutative.
Modify the existing and/or/xor folds to also work on ISD::SELECT
and not just RISCVISD::SELECT_CC. This is needed to make sure
we do this transform before type legalization turns i32 add/sub
into add/sub+sign_extend_inreg on RV64. If we don't do this before
that, the sign_extend_inreg will still be after the select.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D107603
Shuffles which are broken into separate halves reveal splats in which
a half is accessed via one index; such operations can be optimized to
use "vrgather.vi".
This optimization could be achieved by adding extra patterns to match
`vrgather_vv_vl` which uses a splat as an index operand, but this patch
instead identifies splat earlier. This way, future optimizations can
build on top of the data gathered here, e.g., to splat-gather dominant
indices and insert any leftovers.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D107449
Previously we converted ISD condition codes to integers and stored
them directly in our MIR instructions. The ISD enum kind of belongs
to SelectionDAG so that seems like incorrect layering.
This patch instead uses a CondCode node on RISCV::SELECT_CC until
isel and then converts it from ISD encoding to a RISCV specific value.
This value can be converted to/from the RISCV branch opcodes in the
RISCV namespace.
My larger motivation is to possibly support a microarchitectural
feature of some CPUs where a short forward branch over a single
instruction can be predicated internally. This will require a new
pseudo instruction for select that needs to carry a branch condition
and live probably until RISCVExpandPseudos. At that point it can be
expanded to control flow without other instructions ending up in the
predicated basic block. Using an ISD encoding in RISCVExpandPseudos
doesn't seem like correct layering.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D107400
The fcvt fp to integer instructions saturate if their input is
infinity or out of range, but the instructions produce a maximum
integer for nan instead of 0 required for the ISD opcodes.
This means we can use the instructions to do the saturating
conversion, but we'll need to fix up the nan case at the end.
We can probably improve the i8 and i16 default codegen as well,
but I'll leave that for a follow up.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D107230
This patch extends the optimization of VID-sequence BUILD_VECTORs
introduced in D104921 to include simple fractional steps composed of a
separated integer numerator and denominator.
A notable limitation in this sequence detection is that only sequences
with steps N/1 or 1/D are found, meaning that the step between elements
and the frequency with which it changes is consistent across the whole
sequence. Fractional steps such as 2/3 won't be matched as those would
involve more complex tracking of state or some level of backtracking.
As is stands, however, this patch is sufficient to match common
interleave-type shuffle indices, for example matching `<0,0,1,1>` (or
commonly `<0,u,1,u>` or `<u,0,u,1>`) to an index sequence divided by 2.
While the optimization is relatively `undef`-tolerant, due to greedy
pattern-matching there even are some simple patterns which confuse the
sequence detection into identifying either a suboptimal sequence or no
sequence at all.
Currently only fractional-step sequences identified as having a
power-of-two denominator are actually lowered to RVV instructions. This
is to avoid introducing divisions into the generated code.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D106533
This patch aims to improve the performance of BUILD_VECTORs which are
identified as containing a dominant element. Given that most
floating-point constants themselves require a load from the constant
pool, it was possible for the optimization to actually increase the
number of individual loads on small vectors. The exception is the zero
constant -- +0.0 -- which can be materialized efficiently.
While this optimization could do with a proper cost model to weigh the
benfits of a single vector load vs. the manipulation of individual
elements -- even for integer vectors which often require several
instructions to materialize -- without a concrete RVV implementation to
work with any heuristic is likely to be both more obtuse and inaccurate.
Until then, this patch fixes at least one known obvious deficiency.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D106963
The sign_extend we insert here can get turned into a zero_extend if
the sign bit is known zero. This can enable a setcc combine that
shrinks compares with zero_extend. This reduces the use count of
the zero_extend allowing other combines to turn it back into an
any_extend.
This restricts the combine to only cases where the result is used
by a CopyToReg. This works for my original motivating case. I
hope the CopyToReg use will prevent any converted extends from
turning back into an any_extend.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D106754
This patch adds support for lowering the saturating vector add/sub
intrinsics to RVV instructions, for both fixed-length and
scalable-vector forms alike.
Note that some of the DAG combines are still not triggering for the
scalable-vector tests. These require a bit more work in the DAGCombiner
itself.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D106651
I stumbled onto a case where our (sext_inreg (assertzexti32 (fptoui X)), i32)
isel pattern can cause an fcvt.wu and fcvt.lu to be emitted if
the assertzexti32 has an additional user. If we add a one use check
it would just cause a fcvt.lu followed by a sext.w when only need
a fcvt.wu to satisfy both users.
To mitigate this I've added custom isel and new ISD opcodes for
fcvt.wu. This allows us to keep know it started life as a conversion
to i32 without needing to match multiple nodes. ComputeNumSignBits
has been taught that this new nodes produces 33 sign bits. To
prevent regressions when we need to zero extend the result of an
(i32 (fptoui X)), I've added a DAG combine to convert it to an
(i64 (fptoui X)) before type legalization. In most cases this would
happen in InstCombine, but a zero_extend can be created for function
returns or arguments.
To keep everything consistent I've added new nodes for fptosi as well.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D106346
Lowering certain float vectors without legal vector types could cause a
crash due to a bad interaction between passing floats via GPRs and
argument splitting. Split vector floats appear just like scalar floats.
Under certain situations we choose to pass these float arguments via
GPRs and use an XLenVT location and set the 'BCvt' info to track how
they must be converted back to floating-point values. However, later
logic for handling split arguments may take over, in which case we lose
the previous information and set the 'Indirect' info, thus incorrectly
lowering to integer types.
I don't believe that we would have come across the notion of split
floating-point arguments before. This patch addresses the issue by
updating the lowering so that split arguments are only passed indirectly
when they are scalar integer types.
This has some change to how we lower some larger illegal float vectors,
as can be seen in 'fastcc-float.ll' where the vector is now passed
partly in registers and partly on the stack.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D102852
This relands a6ca88e908 which was originally
reverted due to overflow bugs in e3fa2b1eab.
This patch teaches the compiler to identify a wider variety of
`BUILD_VECTOR`s which form integer arithmetic sequences, and to lower
them to `vid.v` with modifications for non-unit steps and non-zero
addends.
The sequences handled by this optimization must either be monotonically
increasing or decreasing. Consecutive elements holding the same value
indicate a fractional step which, while simple mathematically,
becomes more complex to handle both in the realm of lossy integer
division and in the presence of `undef`s.
For example, a common "interleaving" shuffle index will be lowered by
LLVM to both `<0,u,1,u,2,...>` and `<u,0,u,1,u,...>` `BUILD_VECTOR`
nodes. Either of these would ideally be lowered to `vid.v` shifted right
by 1. Detection of this sequence in presence of general `undef` values
is more complicated, however: `<0,u,u,1,>` could match either
`<0,0,0,1,>` or `<0,0,1,1,>` depending on later values in the sequence.
Both are possible, so backtracking or multiple passes is inevitable.
Sticking to monotonic sequences keeps the logic simpler as it can be
done in one pass. Fractional steps will likely be a separate
optimization in a future patch.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D104921
The existing rule about the operand type is strange. Instead, just say
the operand is a TargetConstant with the right width. (Legalization
ignores TargetConstants, so it doesn't matter if that width is legal.)
Highlights:
1. I had to substantially rewrite the AArch64 isel patterns to expect a
TargetConstant. Nothing too exotic, but maybe a little hairy. Maybe
worth considering a target-specific node with some dagcombines instead
of this complicated nest of isel patterns.
2. Our behavior on RV32 for vectors of i64 has changed slightly. In
particular, we correctly preserve the width of the arithmetic through
legalization. This changes the DAG a bit. Maybe room for
improvement here.
3. I explicitly defined the behavior around overflow. This is necessary
to make the DAGCombine transforms legal, and I don't think it causes any
practical issues.
Differential Revision: https://reviews.llvm.org/D105673
If we need to shift left anyway we might be able to take advantage
of LUI implicitly shifting its immediate left by 12 to cover part
of the shift. This allows us to use more bits of the LUI immediate
to avoid an ADDI.
isDesirableToCommuteWithShift now considers compressed instruction
opportunities when deciding if commuting should be allowed.
I believe this is the same or similar to one of the optimizations
from D79492.
Reviewed By: luismarques, arcbbb
Differential Revision: https://reviews.llvm.org/D105417
I don't think the semantics of the llvm masked gather intrinsic care
about the order the elements are loaded. For example, type legalization
by splitting will chain them in parallel. This is different than
scatter which we do chain in order.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D106025
RISCV would prefer a sign extended constant since that works better
with our constant materialization. We have an existing TLI hook we
use to control sign extension of setcc operands in type legalization.
That hook happens to do the right check we need here, but might be
straying from its original purpose. With only RISCV defining this
hook in tree, I wasn't sure if it was worth adding another hook
with identical behavior.
This is an alternative to D105785 where I tried to handle this in
the RISCV backend by not creating ANY_EXTENDs in some places.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D105918
We assume VLENB is a multiple of 8 and previously relied on shift
pairs being optimized to an AND+SHL/SHR and computeKnownBits
removing the AND. This doesn't happen if (vlenb >> 3) gets CSEd
to have multiple uses. This patch manually emits the best shift
to workaround this.
If the upper 32 bits are zero and bit 31 is set, we might be able to
use zext.w to fill in the zeros after using an lui and/or addi.
Most of this patch is plumbing the subtarget features into the constant
materialization.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D105509
This patch teaches the compiler to identify a wider variety of
`BUILD_VECTOR`s which form integer arithmetic sequences, and to lower
them to `vid.v` with modifications for non-unit steps and non-zero
addends.
The sequences handled by this optimization must either be monotonically
increasing or decreasing. Consecutive elements holding the same value
indicate a fractional step which, while simple mathematically,
becomes more complex to handle both in the realm of lossy integer
division and in the presence of `undef`s.
For example, a common "interleaving" shuffle index will be lowered by
LLVM to both `<0,u,1,u,2,...>` and `<u,0,u,1,u,...>` `BUILD_VECTOR`
nodes. Either of these would ideally be lowered to `vid.v` shifted right
by 1. Detection of this sequence in presence of general `undef` values
is more complicated, however: `<0,u,u,1,>` could match either
`<0,0,0,1,>` or `<0,0,1,1,>` depending on later values in the sequence.
Both are possible, so backtracking or multiple passes is inevitable.
Sticking to monotonic sequences keeps the logic simpler as it can be
done in one pass. Fractional steps will likely be a separate
optimization in a future patch.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D104921
Using positive zero as the neutral element in 'fadd' reductions, while
it generates better code, is incorrect. The correct neutral element is
negative zero: 0.0 + -0.0 = 0.0, whereas -0.0 + -0.0 = -0.0.
There are perhaps more optimal lowerings of negative zero avoiding
constant-pool loads which could be left as future work.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D105902
We don't really have optimizations for division with a constant
LHS. If we don't use a W instruction we end up needing to sign
or zero extend the RHS to use the 64-bit instruction.
I had to sign_extend i32 constants on the LHS instead of using
any_extend which becomes zero_extend. If we don't do this, constants
that were originally negative become harder to materialize. I think
this problem exists for more of our W instruction cases. For example
(i32 (shl -1, X)), but we don't have lit tests. I'll work on that
as a follow up.
I also left a FIXME for enabling W instruction for RHS constants
under -Oz.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D105769
Similar to D46745, "S" represents an absolute symbolic operand, which
can be used to specify the access models, e.g.
extern int var;
void *addr_via_asm() {
void *ret;
asm("lui %0, %%hi(%1)\naddi %0,%0,%%lo(%1)" : "=r"(ret) : "S"(&var));
return ret;
}
'S' is documented in trunk GCC: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101275
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D105254
Often when lowering vector shuffles, we split the shuffle into two
LHS/RHS shuffles which are then blended together. To do so we split the
original indices into two, indexed into each respective vector. These
two index vectors are then separately lowered as BUILD_VECTORs.
This patch forwards on any undef indices to the BUILD_VECTOR, rather
than having the VECTOR_SHUFFLE lowering decide on an optimal concrete
index. The motiviation for ths change is so that we don't duplicate
optimization logic between the two lowering methods and let BUILD_VECTOR
do what it does best.
Propagating undef in this way allows us, for example, to generate
`vid.v` to produce the LHS indices of commonly-used interleave-type
shuffles. I have designs on further optimizing interleave-type and other
common shuffle patterns in the near future.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D104789
These are fp->int conversions using either RMM or dynamic rounding modes.
The lround and lrint opcodes have a return type of either i32 or
i64 depending on sizeof(long) in the frontend which should follow
xlen. llround/llrint should always return i64 so we'll need a libcall
for those on rv32.
The frontend will only emit the intrinsics if -fno-math-errno is in
effect otherwise a libcall will be emitted which will not use
these ISD opcodes.
gcc also does this optimization.
Reviewed By: arcbbb
Differential Revision: https://reviews.llvm.org/D105206
This adds a DAG combine to detect sext/zext inputs and emit a
new ISD opcode. The extends will either be removed or replaced
with narrower extends.
Isel patterns are used to match add and widening mul to vwmacc
similar to the recently added vmacc patterns.
There's still some work to be to match vmulsu.
We should also rewrite splats that were extended as scalars and
then splatted.
Reviewed By: arcbbb
Differential Revision: https://reviews.llvm.org/D104802
It seems it is possible for DAG combine to create a shl with an
i64 result type and an i32 shift amount. This is ok before type
legalization since the type don't need to match in SelectionDAG.
This results in type legalization calling LowerOperation to
legalize just the amount. We weren't expecting this so we
asserted for not finding a fixed vector shift.
To fix this, I've added a check for the fixed vector case and
returned SDValue() to get the default type legalizer. I've
factored all shifts together and added a fixed vector specific
handler to avoid repeating similar code for each in
LowerOperation.
The particular case I found was exposed by D104581, but the bad
shift is created after that patch triggers.
If type legalization is going to insert a sign_extend for other users
of X and we can fold the sign_extend into ADDW/MULW/SUBW, it is
better to replace the ANY_EXTEND so we don't end up with a separate
ADD/MUL/SUB instruction for the users of the ANY_EXTEND.
I'm only handling setcc uses right now, but there are other
instructions that force sign_extends like ashr.
There are probably other *W instructions we could use in addition
to ADDW/SUBW/MULW.
My motivating case was a loop terminating compare and a phi use
as seen in the new test file.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D104581
This patch optimizes the code generation of vector-type SELECTs (LLVM
select instructions with scalar conditions) by custom-lowering to
VSELECTs (LLVM select instructions with vector conditions) by splatting
the condition to a vector. This avoids the default expansion path which
would either introduce control flow or fully scalarize.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D104772
With the exception of `frem`, this patch supports the current set of VP
floating-point binary intrinsics by lowering them to to RVV instructions. It
does so by using the existing `RISCVISD *_VL` custom nodes as an intermediate
layer. Both scalable and fixed-length vectors are supported by using this
method.
The `frem` node is unsupported due to a lack of available instructions. For
fixed-length vectors we could scalarize but that option is not (currently)
available for scalable-vector types. The support is intentionally left out so
it equivalent for both vector types.
The matching of vector/scalar forms is currently lacking, as scalable vector
types do not lower to the custom `VFMV_V_F_VL` node. We could either make
floating-point scalable vector splats lower to this node, or support the
matching of multiple kinds of splat via a `ComplexPattern`, much like we do for
integer types.
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D104237
This patch adds support for loading and storing unaligned vectors via an
equivalently-sized i8 vector type, which has support in the RVV
specification for byte-aligned access.
This offers a more optimal path for handling of unaligned fixed-length
vector accesses, which are currently scalarized. It also prevents
crashing when `LegalizeDAG` sees an unaligned scalable-vector load/store
operation.
Future work could be to investigate loading/storing via the largest
vector element type for the given alignment, in case that would be more
optimal on hardware. For instance, a 4-byte-aligned nxv2i64 vector load
could loaded as nxv4i32 instead of as nxv16i8.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D104032
This patch changes RVV's policy for its supported list of fixed-length
vector types by capping by vector size rather than element count. Now
all 1024-byte vectors (of supported element types) are supported, rather
than all 256-element vectors.
This is a more natural fit for the architecture, and allows us to, for
example, improve the support for vector bitcasts.
This change necessitated the adding of some new simple types to avoid
"regressing" on the number of currently-supported vectors. We round out
the 1024-byte types by adding `v512i8`, `v1024i8`, `v512i16` and
`v512f16`.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D103884
This patch is a simple fix which registers CONCAT_VECTORS as
custom-lowered for scalable mask vectors. This follows the pattern of
all other scalable-vector types, as the default expansion of
CONCAT_VECTORS cannot handle scalable types, and even if it did it'd go
through the stack and generate worse code.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D103896
Include known bits support so we know we don't need to zext the
output if the input was already zero extended.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D103757
We should be exiting when the shift amount is greater than
the bit width regardless of whether it is a power of 2.
Reported by Simon Pilgrim here https://reviews.llvm.org/D96661
This requires getting a shift amount that is out of bounds that
wasn't already optimized by SelectionDAG. This would be pretty
trick to construct a test for.
Or it would require a non-power of 2 shift amount and a mask
that has runs of ones and zeros of the next lowest power of 2 from
that shift amount. I tried a little to produce a test for this,
but didn't get it to work.
Don't require a specific kind of IRBuilder for TargetLowering hooks.
This allows us to drop the IRBuilder.h include from TargetLowering.h.
Differential Revision: https://reviews.llvm.org/D103759
RVV vectors must be aligned to their element types, so anything less is
unaligned.
For regular loads and stores, our custom-lowering of fixed-length
vectors meant that we opted out of LegalizeDAG's built-in unaligned
expansion. This patch adds that logic in to our custom lower function.
For masked intrinsics, we declare that anything unaligned is not legal,
leaving the ScalarizeMaskedMemIntrin pass to do the expansion for us.
Note that neither of these methods can handle the expansion of
scalable-vector memory ops, so those cases are left alone by this patch.
Scalable loads and stores already go through expansion by default but
hit an assertion, and scalable masked intrinsics will silently generate
incorrect code. It may be prudent to return an error in both of these
cases.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D102493
This patch extends the RISC-V lowering of the 'fastcc' calling
convention to vector types, both fixed-length and scalable. Without this
patch, any function passing or returning vector types by value would
throw a compiler error.
Vectors are handled in 'fastcc' much as they are in the default calling
convention, the noticeable difference being the extended set of scalar
GPR registers that can be used to pass vectors indirectly.
Reviewed By: HsiangKai
Differential Revision: https://reviews.llvm.org/D102505
This patch fixes a bug in lowering scalable-vector types in RISC-V's
main calling convention. When scalable-vector types are split and passed
indirectly, the target is responsible for scaling the offset --
initially set to the known-minimum store size -- by the scalable factor.
Before this we were issuing overlapping loads or stores to the different
parts, leading to incorrect codegen.
Credit to @HsiangKai for spotting this.
Reviewed By: HsiangKai
Differential Revision: https://reviews.llvm.org/D103262
This patch custom lowers FP_TO_[US]INT and [US]INT_TO_FP conversions
between floating-point and boolean vectors. As the default action is
scalarization, this patch both supports scalable-vector conversions and
improves the code generation for fixed-length vectors.
The lowering for these conversions can piggy-back on the existing
lowering, which lowers the operations to a supported narrowing/widening
conversion and then either an extension or truncation.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D103312
This patch adds a way for the target to configure the type it uses for
the explicit vector length operands of VP SDNodes. The type must be a
legal integer type (there is still no target-independent legalization of
this operand) and must currently be at least as big as i32, the type
used by the IR intrinsics. An implicit zero-extension takes place on
targets which choose a larger type. All VP nodes should be created with
this type used for the EVL operand.
This allows 64-bit RISC-V to avoid custom legalization of all VP nodes,
keeping them in their target-independent form for that bit longer.
Reviewed By: simoll
Differential Revision: https://reviews.llvm.org/D103027
DAGCombine's `mergeStoresOfConstantsOrVecElts` optimization is told
whether it's to use vector types and also whether it's to issue a
truncating store. However, the truncating store code path assumes a
scalar integer `ConstantSDNode`, and when using vector types it creates
either a `BUILD_VECTOR` or `CONCAT_VECTORS` to store: neither of which
is a constant.
The `riscv64` target is able to expose a crash here because it switches
on both code paths at the same time. The `f32` is stored as `i32` which
must be promoted to `i64`, necessitating a truncating store.
It also decides later that it prefers a vector store of `v2f32`.
While vector truncating stores are legal, this combine is not able to
emit them. We also don't have a test case. This patch adds an assert to
catch this case more gracefully, and updates one of the caller functions
to the function to turn off the use of truncating stores when preferring
vectors.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D103173
The vector calling convention dictates that when the vector argument
registers are exhaused, GPRs are used to pass the address via the stack.
When the GPRs themselves are exhausted, at best we would previously
crash with an assertion, and at worst we'd generate incorrect code.
This patch addresses this issue by passing fixed-length vectors via the
stack with their full fixed-length size and aligned to their element
type size. Since the calling convention lowering can't yet handle
scalable vector types, this patch adds a fatal error to make it clear
that we are lacking in this regard.
Reviewed By: HsiangKai
Differential Revision: https://reviews.llvm.org/D102422
This patch extends the cases in which the legalizer is able to express
VSELECT in terms of XOR/AND/OR. When dealing with a VSELECT between
boolean vector types, the mask itself is an all-ones or all-ones value
of the operand type, so a 0/1 boolean type behaves identically to a 0/-1
type.
This greatly helps RISC-V which relies on expansion for these nodes. It
also allows scalable-vector bool VSELECTs to use the default expansion,
where before it would crash in SelectionDAG::UnrollVectorOp.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D103147
SEW=64 shifts only uses the log2(64) bits of shift amount. If we're
splatting a 64 bit value in 2 parts, we can avoid splatting the
upper bits and just let the low bits be sign extended. They won't
be read anyway.
For the purposes of SelectionDAG semantics of the generic ISD opcodes,
if hi was non-zero or bit 31 of the low is 1, the shift was already
undefined so it should be ok to replace high with sign extend of low.
In order do be able to find the split i64 value before it becomes
a stack operation, I added a new ISD opcode that will be expanded
to the stack spill in PreprocessISelDAG. This new node is conceptually
similar to BuildPairF64, but I expanded earlier so that we could
go through regular isel to get the right VLSE opcode for the LMUL.
BuildPairF64 is expanded in a CustomInserter.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D102521
This is a replacement for D101938 for inserting vsetvli
instructions where needed. This new version changes how
we track the information in such a way that we can extend
it to be aware of VL/VTYPE changes in other blocks. Given
how much it changes the previous patch, I've decided to
abandon the previous patch and post this from scratch.
For now the pass consists of a single phase that assumes
the incoming state from other basic blocks is unknown. A
follow up patch will extend this with a phase to collect
information about how VL/VTYPE change in each block and
a second phase to propagate this information to the entire
function. This will be used by a third phase to do the
vsetvli insertion.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D102737
RVV code generation does not successfully custom-lower BUILD_VECTOR in all
cases. When it resorts to default expansion it may, on occasion, be expanded to
scalar stores through the stack. Unfortunately these stores may then be picked
up by the post-legalization DAGCombiner which merges them again. The merged
store uses a BUILD_VECTOR which is then expanded, and so on.
This patch addresses the issue by overriding the `mergeStoresAfterLegalization`
hook. A lack of granularity in this method (being passed the scalar type) means
we opt out in almost all cases when RVV fixed-length vector support is enabled.
The only exception to this rule are mask vectors, which are always either
custom-lowered or are expanded to a load from a constant pool.
Reviewed By: HsiangKai
Differential Revision: https://reviews.llvm.org/D102913
The default expansion for BUILD_VECTORs -- save for going through
shuffles -- is to go through the stack. This method only works when the
type is at least byte-sized, so for v2i1 and v4i1 we would crash.
This patch ensures that small mask-type BUILD_VECTORs are always handled
without crashing. We lower to a SETCC of the equivalent i8 type.
This also exposes some pre-existing issues where the lowering when
optimizing for size results in larger code than without. Those will be
tackled in future patches.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D102767
The use of `SelectionDAG::getSplatValue` isn't guaranteed to return a
type-legal splat value as it may implicitly extract a vector element
from another shuffle. It is not permitted to introduce an illegal type
when lowering shuffles.
This patch addresses the crash by adding a boolean flag to
`getSplatValue`, defaulting to false, which when set will ensure a
type-legal return value. If it is unable to do that it will fail to
return a splat value.
I've been through the existing uses of `getSplatValue` in other targets
and was unable to find a need or test cases showing a need to update
their uses. In some cases, the call is made during `LegalizeVectorOps`
which may still produce illegal scalar types. In other situations, the
illegally-typed splat value may be quickly patched up to a legal type
(such as any-extending the returned `extract_vector_elt` up to a legal
type) before `LegalizeDAG` notices.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D102687
Like the element extraction of these vectors, we choose to promote up to
an i8 vector type and perform the insertion there.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D102697
The VSEW encoding isn't a useful value to pass around. It's better
to use SEW or log2(SEW) directly. The only real ugliness is that
the vsetvli IR intrinsics use the VSEW encoding, but it's easy
enough to decode that when the intrinsic is processed.
My thought process is that if v2i64 is an LMUL=1 type then v2i32
should be an LMUL=1/2 type. We limit the fractional LMUL so that
SEW=64 clips to LMUL=1, SEW=32 clips to LMUL=1/2, etc. This
ensures there's always a fractional LMUL available to truncate a type.
This does reduce the number of vsetvlis in some cases.
Some tests increase vsetvlis because the best container type for a
mask type is dependent on the LMUL+SEW that the mask was produced
from, but you can't tell that from the type. I think this is
something we need to solve this in the machine IR when optimizing
vsetvlis.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D101215
This patch extends VectorLegalizer::ExpandSELECT to permit expansion
also for scalable vector types. The only real change is conditionally
checking for BUILD_VECTOR or SPLAT_VECTOR legality depending on the
vector type.
We can use this to fix "cannot select" errors for scalable vector
selects on the RISCV target. Note that in future patches RISCV will
possibly custom-lower vector SELECTs to VSELECTs for branchless codegen.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D102063
This patch supports all of the current set of VP integer binary
intrinsics by lowering them to to RVV instructions. It does so by using
the existing RISCVISD *_VL custom nodes as an intermediate layer. Both
scalable and fixed-length vectors are supported by using this method.
One notable change to the existing vector codegen strategy is that
scalable all-ones and all-zeros mask SPLAT_VECTORs are now lowered to
RISCVISD VMSET_VL and VMCLR_VL nodes to match their fixed-length
BUILD_VECTOR counterparts. This allows them to reuse the existing
"all-ones" VL patterns.
To reduce the size of the phabricator diff, some tests are intentionally
left out and will be added later if the patch is accepted.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D101826
Previously, RISC-V would make legal all fixed-length vectors types whose
size are less than or equal to some function of the minimum value of
VLEN and the maximum-permissible LMUL grouping.
Due to vector legalization issues, this patch instead caps the legal
fixed-length vector types to those with 256 elements. This value was
chosen because it is the longest vector length which has corresponding
MVTs across all supported element types.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D101839
This patch adds support for splatting i1 types to fixed-length or
scalable vector types. It does so by lowering the operation to a SETCC
of the equivalent i8 type.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D101465
This shrinks the immediate that isel table needs to emit for these
instructions. Hoping this allows me to change OPC_EmitInteger to
use a better variable length encoding for representing negative
numbers. Similar to what was done a few months ago for OPC_CheckInteger.
The alternative encoding uses less bytes for negative numbers, but
increases the number of bytes need to encode 64 which was a very
common number in the RISCV table due to SEW=64. By using Log2 this
becomes 6 and is no longer a problem.
DAGCombiner was recently taught how to combine STEP_VECTOR nodes,
meaning the step value is no longer guaranteed to be one by the time it
reaches the backend for lowering.
This patch supports such cases on RISC-V by lowering to other step
values to a multiply following the vid.v instruction. It includes a
small optimization for common cases where the multiply can be expressed
as a shift left.
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D100856
Similar for or/xor with 0 in place of -1.
This is the canonical form produced by InstCombine for something like `c ? x & y : x;` Since we have to use control flow to expand select we'll usually end up with a mv in basic block. By folding this we may be able to pull the and/or/xor into the block instead and avoid a mv instruction.
The code here is based on code from ARM that uses this to create predicated instructions. I'm doing it on SELECT_CC so it happens late, but we could do it on select earlier which is what ARM does. I'm not sure if we lose any combine opportunities if we do it earlier.
I left out add and sub because this can separate sext.w from the add/sub. It also made a conditional i64 addition/subtraction on RV32 worse. I guess both of those would be fixed by doing this earlier on select.
The select-binop-identity.ll test has not been commited yet, but I made the diff show the changes to it.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D101485
This replaces D98479.
This allows type legalization to form SPLAT_VECTOR_PARTS so we don't
lose the splattedness when the scalar type is split.
I'm handling SPLAT_VECTOR_PARTS for fixed vectors separately so
we can continue using non-VL nodes for scalable vectors.
I limited to RV32+vXi64 because DAGCombiner::visitBUILD_VECTOR likes
to form SPLAT_VECTOR before seeing if it can replace the BUILD_VECTOR
with other operations. Especially interesting is a splat BUILD_VECTOR of
the extract_vector_elt which can become a splat shuffle, but won't if
we form SPLAT_VECTOR first. We either need to reorder visitBUILD_VECTOR
or add visitSPLAT_VECTOR.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D100803
This seems like a reasonable upper bound on VL. WG discussions for
the V spec would probably allow us to use 2^16 as an upper bound
on VLEN, but this is good enough for now.
This allows us to remove sext and zext if user happens to assign
the size_t result into an int and then uses it as a VL intrinsic
argument which is size_t.
Reviewed By: frasercrmck, rogfer01, arcbbb
Differential Revision: https://reviews.llvm.org/D101472
This is an complementary/alternative fix for D99068. It takes a slightly
different approach by explicitly summing up all of the required split
part type sizes and ensuring we allocate enough space for them. It also
takes the maximum alignment of each part.
Compared with D99068 there are fewer changes to the stack objects in
existing tests. However, @luismarques has shown in that patch that there
are opportunities to reduce our stack usage in the future.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D99087
This adds a special operand type that is allowed to be either
an immediate or register. By giving it a unique operand type the
machine verifier will ignore it.
This perturbs a lot of tests but mostly it is just slightly different
instruction orders. Something bad did happen to some min/max reduction
tests. We're spilling vector registers when we weren't before.
Reviewed By: khchen
Differential Revision: https://reviews.llvm.org/D101246
This modifies my previous patch to push the strided load formation
to isel. This gives us opportunity to fold the splat into a .vx
operation first. Using a scalar register and a .vx operation reduces
vector register pressure which can be important for larger LMULs.
If we can't fold the splat into a .vx operation, then it can make
sense to use a strided load to free up the vector arithmetic
ALU to do actual arithmetic rather than tying it up with vmv.v.x.
Reviewed By: khchen
Differential Revision: https://reviews.llvm.org/D101138
We have several extensions that need i32 to be Custom for
INTRINSIC_WO_CHAIN with RV64 so enable it for all RV64.
For V extension, make i32 Custom for RV64 and i64 Custom for RV32.
When the i32 or i64 is legal, the operation action doesn't matter.
LegalizeDAG checks MVT::Other rather than the real type.
This teaches DAG combine that shift amount operands for grev, gorc
shfl, unshfl only read a few bits.
This also teaches DAG combine that grevw, gorcw, shflw, unshflw,
bcompressw, bdecompressw only consume the lower 32 bits of their
inputs.
In the future we can teach SimplifyDemandedBits to also propagate
demanded bits of the output to the inputs in some cases.
Use getContainerForFixedLengthVector and getRegClassIDForVecVT to
get the register class to use when making a fixed vector type legal.
Inline it into the other two call sites.
I'm looking into using fractional lmul for fixed length vectors
and getLMULForFixedLengthVector returned an integer making it
unable to express this. I considered returning the LMUL
enum, but that seemed like it would introduce more complexity to
convert it for use.
Make it a static function RISCVISelLowering, the only place it
is used.
I think I'm going to make this return a fractional LMULs in some
cases so I'm sorting out where it should live before I start
making changes.
We can have RISCVISelDAGToDAG.cpp call the VT only version by
finding the RISCVTargetLowering object via the Subtarget.
Make the static versions just global static functions in
RISCVISelLowering that can be called by static functions in that
file.
This patch adds support for both scalable- and fixed-length vector code
lowering of the llvm.minnum and llvm.maxnum intrinsics to the equivalent
RVV instructions.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D101035
Implementations are allowed to optimize an x0 stride to perform
less memory accesses. This is the case in SiFive cores.
No idea if this is the case in other implementations. We might
need a tuning flag for this.
Reviewed By: frasercrmck, arcbbb
Differential Revision: https://reviews.llvm.org/D100815
Rather than doing splatting each separately and doing bit manipulation
to merge them in the vector domain, copy the data to the stack
and splat it using a strided load with x0 stride. At least on
some implementations this vector load is optimized to not do
a load for each element.
This is equivalent to how we move i64 to f64 on RV32.
I've only implemented this for the intrinsic fallbacks in this
patch. I think we do similar splatting/shifting/oring in other
places. If this is approved, I'll refactor the others to share
the code.
Differential Revision: https://reviews.llvm.org/D101002
The value is always an immediate and can never be in a register.
This the kind of thing TargetConstant is for.
Saves a step GenDAGISel to convert a Constant to a TargetConstant.
This recognizes the case when Hi is (sra Lo, 31). We can use
SPLAT_VECTOR_I64 rather than splatting the high bits and
combining them in the vector register.
As noted in the FIXME there's a sort of agreement that the any
extra bits stored will be 0.
The generated code is pretty terrible. I was really hoping we
could use a tail undisturbed trick, but tail undisturbed no
longer applies to masked destinations in the current draft
spec.
Fingers crossed that it isn't common to do this. I doubt IR
from clang or the vectorizer would ever create this kind of store.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D100618
This patch extends the lowering of RVV fixed-length vector shuffles to
avoid the default stack expansion and instead lower to vrgather
instructions.
For "permute"-style shuffles where one vector is swizzled, we can lower
to one vrgather. For shuffles involving two vector operands, we lower to
one unmasked vrgather (or splat, where appropriate) followed by a masked
vrgather which blends in the second half.
On occasion, when it's not possible to create a legal BUILD_VECTOR for
the indices, we use vrgatherei16 instructions with 16-bit index types.
For 8-bit element vectors where we may have indices over 255, we have a
fairly blunt fallback to the stack expansion to avoid custom-splitting
of the vector types.
To enable the selection of masked vrgather instructions, this patch
extends the various RISCVISD::VRGATHER nodes to take a passthru operand.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D100549
Prep work for adding intrinsics in the future.
Left an assert that the input is constant in ReplaceNodeResults,
as the intrinsic shouldn't go through that path.
This patch adds RVV codegen support for OR/XOR/AND reductions for both
scalable- and fixed-length vector types. There are a few possible
codegen strategies for each -- vmfirst.m, vmsbf.m, and vmsif.m could be
used to some extent -- but the vpopc.m instruction was chosen since it
produces the scalar result in one instruction, after which scalar
instructions can finish off the computation.
The reductions are lowered identically for both scalable- and
fixed-length vectors, although some alternate strategies may be more
optimal on fixed-length vectors since it's cheaper to get the length of
those types.
Other reduction types were not deemed to be relevant for mask vectors.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D100030
New custom DAG nodes were added to represent operations on CSR. These
nodes are lowered to corresponding pseudo instruction. Using the pseudo
instructions allows to specify different scheduling information for
operations on different system registers. It also make possible to
specify dependencies of instructions on specific system registers.
Differential Revision: https://reviews.llvm.org/D98936
If the constants have a difference of 1 we can convert one to
the other by adding or subtracting the condition.
We have a DAG combine for this, but it only runs before type
legalization. If the select is introduced later during type
legalization or op legalization we will miss it.
We don't need a specific condition, but some conditions are
harder to materialize than others on RISCV. I know that SETLT
will be a single instruction and it is what is used by the
motivating pattern from signed saturating add/sub.
Differential Revision: https://reviews.llvm.org/D99021
This can't use our normal strategy of splatting the scalar and using
a .vv operation instead of .vx.
Instead this patch bitcasts the vector to the equivalent SEW=32
vector and inserts the scalar parts using two vslide1up/down. We
do that unmasked and apply the mask separately at the end with
a vmerge.
For vslide1up there maybe some other options here like getting
i64 into element 0 and using vslideup.vi with this vector as
vd and the original source as vs1. Masking would still need to
be done afterwards.
That idea doesn't work for vslide1down. We need to slidedown and
then insert a single scalar at vl-1 which we could do with a
vslideup, but that assumes vl > 0 which I don't think we can assume.
The i32 double slide1down implemented here is the best I could come
up with and I just made vslide1up consistent.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D99910
We encountered a hang in our internal code base. I'm having trouble
creating a test case because the test that hit it was testing some
code that is not upstream.
It's a bit silly, but it allows us to write stricter type
constraints for isel. There's still some extra type checks in
the generated table due to some type interference limitations
around HWMode.
This patch supports bitcasts from scalar types to fixed-length vectors
and vice versa. It custom-lowers and custom-legalizes them to
EXTRACT_VECTOR_ELT/INSERT_VECTOR_ELT operations, using a single-element
vectors to hold the scalar where appropriate.
Previously, some of these would fail to select, others would be expanded
through stack loads and stores. Effort was made to ensure the codegen
avoids the stack for both legal and illegal scalar types.
Some of the codegen could be improved, but on first glance it looks like
a general optimization of EXTRACT_VECTOR_ELT when extracting an i64
element on RV32.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D99667
Caught in internal testing, these operations are assumed legal by
default, even for scalable vector types. Expand them back into separate
truncations and stores, or loads and extensions.
Also add explicit fixed-length vector tests for these operations, even
though they should have been correct already.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D99654
The W version of orc.b does not exist in Zbp so we need to use
gorci encoding. If we have Zbp, we can use gorciw which can avoid a
sext.w in some cases.
As long as it's a constant we can directly pattern match it
without any problems. It's only when it isn't a constant that
we need to add an AND.
In theory this should allow more target independent optimizations
to remain active.
Forgot to amend the Author.
Original commit message:
Header files are included in a separate patch in case the name needs to be changed.
RV32 / 64:
orc.b
Differential Revision: https://reviews.llvm.org/D99320
The default legalization strategy is PromoteFloat which keeps
half in single precision format through multiple floating point
operations. Conversion to/from float is done at loads, stores,
bitcasts, and other places that care about the exact size being 16
bits.
This patches switches to the alternative method softPromoteHalf.
This aims to keep the type in 16-bit format between every operation.
So we promote to float and immediately round for any arithmetic
operation. This should be closer to the IR semantics since we
are rounding after each operation and not accumulating extra
precision across multiple operations. X86 is the only other
target that enables this today. See https://reviews.llvm.org/D73749
I had to update getRegisterTypeForCallingConv to force f16 to
use f32 when the F extension is enabled. This way we can still
pass it in the lower bits of an FPR for ilp32f and lp64f ABIs.
The softPromoteHalf would otherwise always give i16 as the
argument type.
Reviewed By: asb, frasercrmck
Differential Revision: https://reviews.llvm.org/D99148
We need to splat the scalar separately and use .vv, but there is
no vmsgt(u).vv. So add isel patterns to select vmslt(u).vv with
swapped operands.
We also need to get VT to use for the splat from an operand rather
than the result since the result VT is nxvXi1.
Reviewed By: HsiangKai
Differential Revision: https://reviews.llvm.org/D99704
There's no target independent ISD opcode for MULHSU, so custom
legalize 2*XLen multiplies ourselves. We have to be a little
careful to prefer MULHU or MULHSU.
I thought about doing this in isel by pattern matching the
(add (mul X, (srai Y, XLen-1)), (mulhu X, Y)) pattern. I decided
against this because the add might become part of a chain of adds.
I don't trust DAG combine not to reassociate with other adds making
it difficult to find both pieces again.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D99479
Our CLZW isel pattern is quite easily broken by surrounding code
preventing it from matching sometimes. This usually results in
failing to remove the and X, 0xffffffff inserted by type
legalization. The add with -32 that type legalization also inserts
will often gets combined into other add/sub nodes. That doesn't
usually result in extra code when we don't use clzw.
CTTZ seems to be less fragile, but I wanted to keep it consistent
with CTLZ.
Reviewed By: asb, HsiangKai
Differential Revision: https://reviews.llvm.org/D99317
This adds almost everything required for supporting the new stepvector
intrinsic on RVV. It is lowered to the existing VID_VL SDNode.
The only exception is a limitation that RV32 cannot yet lower the
intrinsic on i64 vectors. This is because the step operand is
(currently) required to be at least as large as the vector element type.
I will look into patching that out and loosening the requirement to only
an integer pointer type.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D99594
Without Zfh the half type isn't legal, but it could still be
used as an argument/return in IR. Clang will not generate this today.
Previously we promoted the half value to float for arguments and
returns if the F extension is enabled but Zfh isn't. Then depending on
which ABI is enabled we would pass it in either an FPR or a GPR in
float format.
If the F extension isn't enabled, it would get passed in the lower
16 bits of a GPR in half format.
With this patch the value will always in half format and will be
in the lower bits of a GPR or FPR. This should be consistent
with where the bits are located when Zfh is enabled.
I've based this implementation off of how this is done on ARM.
I've manually nan-boxed the value to 32 bits using integer ops.
It looks like flw, fsw, fmv.s, fmv.w.x, fmf.x.w won't
canonicalize nans so should leave the value alone. I think those
are the instructions that could get used on this value.
Reviewed By: kito-cheng
Differential Revision: https://reviews.llvm.org/D98670
We look for this pattern frequently in isel patterns so its a
good idea to try to preserve it.
This also let's us remove our special isel handling for srliw
and use a direct pattern match of (srl (and X, 0xffffffff), C)
since no bits will be removed from the and mask.
Differential Revision: https://reviews.llvm.org/D99042
This patch adds a small optimization for vector shuffle lowering,
detecting shuffles which can be re-expressed as vector selects.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D99270
This patch adds further optimization techniques to RVV BUILD_VECTOR
lowering. It teaches the compiler to find splats of larger vector
element types "hidden" in smaller ones. For example, a v4i8 build_vector
(0x1, 0x2, 0x1, 0x2) could be splat as v2i16 0x0201. This is generally
more optimal than the dominant-element BUILD_VECTORs and so takes
priority.
This optimization is currently limited to all-constant-or-undef
BUILD_VECTORs as those were found to be the most common. There's no
reason this couldn't be extended to other BUILD_VECTORs, but the
additional bit-manipulation instructions may require more sophisticated
heuristics.
There are some cases where the materialization of the larger constant
takes more scalar instructions than it does to build the vector with
vector instructions. We could add heuristics to try and catch this.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D99195
This patch builds upon the initial BUILD_VECTOR work introduced in
D98700. It further optimizes the lowering of BUILD_VECTOR by using
VSELECT operations to effectively insert repeated elements into the
vector with relatively few instructions. This allows us to optimize more
BUILD_VECTORs without significantly increasing the size of the generated
code.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D98969
This patch adds an optimization for mask-vector BUILD_VECTOR nodes whose
elements are all constants or undef. It lowers such operations by
building up the vector via a series of integer operations, in which
multiple mask elements are inserted into a vector at a time via
i8/i16/i32/i64 element types. The final result is then bitcast from that
integer vector.
We restrict this optimization in certain circumstances when optimizing
for size. If we are required to use more than one integer insert
operation, then it will likely increase code size compared with using a
load from a constant pool.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D98860
I've split the gather/scatter custom handler to avoid complicating
it with even more differences between gather/scatter.
Tests are the scalable vector tests with the vscale removed and
dropped the tests that used vector.insert. We're probably not
as thorough on the splitting cases since we use 128 for VLEN here
but scalable vector use a known min size of 64.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D98991
Found by adding asserts to LegalizeDAG to catch incorrect result
types being returned.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D98964
I'm not sure how I failed to notice this before, but when optimizing
dominant-element BUILD_VECTORs we would lower via the scalable container type,
which lost us the information about the fixed length of the vector types. By
lowering via the fixed-length type we can preserve that information and
eliminate redundant vsetvli instructions.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D98938
Returning the scalable-vector container type would present problems when
the fixed-length INSERT_VECTOR_ELT was used by later operations.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D98776
We returned the input chain instead of the output chain from the
new load. This bypasses the load in the chain. I haven't found a
good way to test this yet. IR order prevents my initial attempts
at causing reordering.
This patch adds support for masked scatter intrinsics on scalable vector
types. It is mostly an extension of the earlier masked gather support
introduced in D96263, since the addressing mode legalization is the
same.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D96486
This patch supports the masked gather intrinsics in RVV.
The RVV indexed load/store instructions only support the "unsigned unscaled"
addressing mode; indices are implicitly zero-extended or truncated to XLEN and
are treated as byte offsets. This ISA supports the intrinsics directly, but not
the majority of various forms of the MGATHER SDNode that LLVM combines to. Any
signed or scaled indexing is extended to the XLEN value type and scaled
accordingly. This is done during DAG combining as widening the index types to
XLEN may produce illegal vectors that require splitting, e.g.
nxv16i8->nxv16i64.
Support for scalable-vector CONCAT_VECTORS was added to avoid spilling via the
stack when lowering split legalized index operands.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D96263
Without this patch, bitcasts of fixed-length mask vectors would go
through the stack.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D98779
This patch adds an optimization path for BUILD_VECTOR nodes where the
majority of the elements are identical. These can be splatted, with the
remaining elements patched up with INSERT_VECTOR_ELTs. The threshold can
be tweaked as required - it is currently conservative. Undef elements
are disregarded when judging the dominance of a particular element. This
allows them to be covered by the splat value.
In addition, vectors of 2 elements are always optimized to a splat (for
the upper element) and an insert at element zero.
This optimization is disabled when optimizing for size.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D98700
The InstrEmitter can sometimes insert a copy after an IMPLICIT_DEF
before connecting it to the vector instruction. This occurs when
constrainRegClass reduces to a class with less than 4 registers.
I believe LMUL8 on masked instructions triggers this since the
result can only use the v8, v16, or v24 register group as the mask
is using v0.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D98567
The default promotion uses zero extends that become shifts. We
cam use sign extend instead which is better for RISCV.
I've used two different implementations based on whether we
have minu/maxu instructions.
Differential Revision: https://reviews.llvm.org/D98683
This allows me to introduce similar combines for branches as
we have recently added for SELECT_CC. Some of them are less
useful for standalone setccs and only help branch instructions.
By having a BR_CC node its easier to only affect branches.
I'm using CondCodeSDNode to make isel patterns easier to
write so we can refer to the codes by name. SELECT_CC uses a
constant instead.
I've translated the condition code just like SELECT_CC so
we need less patterns for the swapped conditions. This
includes special cases for X < 1 and X > -1 that get translated
to blez and bgez by using a 0 constant.
computeKnownBitsForTargetNode support for SELECT_CC is added
to allow MaskedValueIsZero to work for cases where the true
and false values of the SELECT_CC are setccs and the
result of the SELECT_CC is used by a BR_CC. This was needed
to avoid regressions in some of the overflow tests.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D98159
The default legalization uses zero extends that require pair of shifts
on RISCV. Instead we can take advantage of the fact that unsigned
compares work equally well on sign extended inputs. This allows
us to use addw/subw and sext.w.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D98233
This patch adds fixed-length vector support to the calling convention
when RVV is used to lower fixed-length vectors. The scheme follows the
regular vector calling convention for the argument/return registers, but
uses scalable vector container types as the LocVTs, and converts to/from
the fixed-length vector value types as required.
Fixed-length vector types may be split when the combination of minimum
VLEN and the maximum allowable LMUL is not large enough to fully contain
the vector. In this case the behaviour differs between fixed-length
vectors passed as parameters and as return values:
1. For return values, vectors must be passed entirely via registers or
via the stack.
2. For parameters, unlike scalar values, split vectors continue to be
passed by value, and are split across multiple registers until there are
no remaining registers. Thus vector parameters may be found partly in
registers and partly on the stack.
As with scalable vectors, the first fixed-length mask vector is passed
via v0. Split mask fixed-length vectors are passed first via v0 and then
via the next available vector register: v8,v9,etc.
The handling of vector return values uses all available argument
registers v8-v23 which does not adhere to the calling convention we're
supposedly implementing, but since this issue affects both fixed-length
and scalable-vector values, it was left as-is.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D97954
Types of fractional LMUL and LMUL=1 are all using VR register class. When
using inline asm, it will use the first type in the register class as the
type for the register. It is not necessary the same as the value type. We
need to use INSERT_SUBVECTOR/EXTRACT_SUBVECToR/BITCAST to make it legal
to put the value in the corresponding register class.
Differential Revision: https://reviews.llvm.org/D97480
This patch optimizes the codegen for INSERT_VECTOR_ELT in various ways.
Primarily, it removes the use of vslidedown during lowering, and the
vector element is inserted entirely using vslideup with a custom VL and
slide index.
Additionally, lowering of i64-element vectors on RV32 has been optimized
in several ways. When the 64-bit value to insert is the same as the
sign-extension of the lower 32-bits, the codegen can follow the regular
path. When this is not possible, a new sequence of two i32 vslide1up
instructions is used to get the vector element into a vector. This
sequence was suggested by @craig.topper. From there, the value is slid
into the final position for more consistent lowering across RV32 and
RV64.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D98250
We don't support any other shuffles currently.
This changes the bswap/bitreverse tests that check for this in
their expansion code. Previously we expanded a byte swapping
shuffle through memory. Now we're scalarizing and doing bit
operations on scalars to swap bytes.
In the future we can probably use vrgather.vx to do a byte swap
shuffle.
This uses a really simple approach of converting to an i8 vector
and extracting. This is probably not the best approach especially
if you know the index is constant.
Other ideas:
-Store to stack temporary using vse1, load as scalar and shift.
-Sort of bitcast the vector to a vector of i8, slide down the
appropriate 8 bit element, copy to scalar, shift down the
correct bit within the 8 bits we extracted. Not exactly sure
how to describe such a bitcast from i1 vector to i8 vector
within the type system for elements less than 8.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D98310
On riscv32, i64 isn't a legal scalar type but we would like to
support scalable vectors of i64.
This patch introduces a new node that can represent a splat made
of multiple scalar values. I've used this new node to solve the current
crashes we experience when getConstant is used after type legalization.
For RISCV, we are now default expanding SPLAT_VECTOR to SPLAT_VECTOR_PARTS
when needed and then handling the SPLAT_VECTOR_PARTS later during
LegalizeOps. I've remove the special case I previously put in for
ABS for D97991 as the default expansion is now able to succesfully
use getConstant.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D98004
Currently we crash in type legalization any time an intrinsic
uses a scalar i64 on RV32.
This patch adds support for type legalizing this to prevent
crashing. I don't promise that it uses the best possible codegen
just that it is functional.
This first version handles 3 cases. vmv.v.x intrinsic, vmv.s.x
intrinsic and intrinsics that take a scalar input, splat it and
then do some operation.
For vmv.v.x we'll either rely on hardware sign extension for
constants or we'll convert it to multiple splats and bit
manipulation.
For vmv.s.x we use a really unoptimal sequence inspired by what
we do for an INSERT_VECTOR_ELT.
For the third case we'll either try to use the .vi form for
constants or convert to a complicated splat and bitmanip and use
the .vv form of the operation.
I've renamed the ExtendOperand field to SplatOperand now use it
specifically for the third case. The first two cases are handled
by custom lowering specifically for those intrinsics.
I haven't updated all tests yet, but I tried to cover a subset
that includes single-width, widening, and narrowing.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D97895
The type legalizer will visit the result before the operands. To
avoid creating an illegal target specific node or falling back to
scalarization, we need to manually split vector operands.
This still doesn't handle the case of non-power of 2 operands
which need to be widened. I'm not sure the type legalizer is
ready for it. I think we would need to insert an
INSERT_SUBVECTOR with the power of 2 type we want, with an undef
first operand, and the non-power of 2 orignal operand as the vector
to insert. Then fill in the neutral elements into the elements the
padded elements. Alternatively we INSERT_SUBVECTOR into a neutral vector.
From there we carry on splitting if needed to get to a legal type
then do the target specific code.
The problem with this is the type legalizer doesn't know how to
widen an insert_subvector yet. We would need to add that including
the handling for a non-undef first vector.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D98292
I've left mask registers to a future patch as we'll need
to convert them to full vectors, shuffle, and then truncate.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D97609
I've included tests that require type legalization to split the
vector. The i64 version of these scalarizes on RV32 due to type
legalization visiting the result before the vector type. So we
have to abort our custom expansion to avoid creating target
specific nodes with an illegal type. Then type legalization ends
up scalarizing. We might be able to fix this by doing custom
splitting for large vectors in our handler to get down to a legal
type.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D98102
Previously we set the value to -1, but the SEW information could
be useful for scheduling.
Reviewed By: frasercrmck, rogfer01
Differential Revision: https://reviews.llvm.org/D98062
The default fixed vector expansion uses sra+xor+add since it can't
see that smax is legal due to our custom handling. So we select
smax(X, sub(0, X)) manually.
Scalable vectors are able to use the smax expansion automatically
for most cases. It crashes in one case because getConstant can't build a
SPLAT_VECTOR for nxvXi64 when i64 scalars aren't legal. So
we manually emit a SPLAT_VECTOR_I64 for that case.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D97991
While working on adding fixed-length vectors to the calling convention,
it was necessary to be able to query for a fixed-length vector container
type without access to an instance of SelectionDAG.
This patch modifies the "main" getContainerForFixedLengthVector function
to use an instance of TargetLowering rather than SelectionDAG, and
preserves the SelectionDAG overload as a wrapper.
An additional non-static version of the function was also added to
simplify the common case in RISCVTargetLowering.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D97925
A setcc can be created during LegalizeDAG after select_cc has been
created. This combine will enable us to fold these late setccs.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D98132
This pattern occurs when lowering for overflow operations
introduce an xor after select_cc has already been formed.
I had to rework another combine that looked for select_cc of an xor
with 1. That xor will now get combined away so we just need to
look for the RHS of the select_cc being 1.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D98130
This patch addresses a compiler crash resulting from passing a
fixed-length type to one that expects scalable vector types. An
assertion was added to prevent this regressing in the future.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D97868
This patch fixes up one case where the fixed-length-vector VL was
dropped (falling back to VLMAX) when inserting vector elements, as the
code would lower via ISD::INSERT_VECTOR_ELT (at index 0) which loses the
fixed-length vector information.
To this end, a custom node, VMV_S_XF_VL, was introduced to carry the VL
operand through to the final instruction. This node wraps the RVV
vmv.s.x and vmv.s.f instructions, which were being selected by
insert_vector_elt anyway.
There should be no observable difference in scalable-vector codegen.
There is still one outstanding drop from fixed-length VL to VLMAX, when
an i64 element is inserted into a vector on RV32; the splat (which is
custom legalized) has no notion of the original fixed-length vector
type.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D97842
This patch enables support for lowering INSERT_VECTOR_ELT on
fixed-length vector types. The strategy follows that for scalable vector
types.
This patch also includes a quick fix to prevent the compiler infinitely
looping between lowering BUILD_VECTOR as VECTOR_SHUFFLE and back again.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D97698
The default expansion of CONCAT_VECTORS goes through the stack. This
patch avoids that penalty by custom-lowering CONCAT_VECTORS to a series
of INSERT_SUBVECTOR nodes. Futher optimizations are possible, but this
is a good start.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D97692
Like with EXTRACT_SUBVECTOR, INSERT_SUBVECTOR poses a problem
for vector masks as RVV isn't able to slide mask types around. We choose
instead to bitcast to equivalently-sized i8 types where we can, else we
zero-extend, perform the operation, and truncate back down.
One test was left disabled due to a crash in the legalizer.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D97559
This patch fixes a bug where the lowering for INSERT_SUBVECTOR and
EXTRACT_SUBVECTOR would insist on first extracting a register-aligned
LMUL1 vector type before perfoming the slide up/down. This was even if
the vector was a fractional LMUL type, in which case the aligned
EXTRACT_SUBVECTOR was invalid.
This issue only occurred for scalable vector types, but a variety of
tests for both scalable and fixed-length vectors have been added to
ensure this does not regress in the future.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D97556
This patch unifies the two disparate paths for lowering INSERT_SUBVECTOR
operations under one roof. Consequently, with this patch it is possible to
support any fixed-length subvector insertion, not just "cast-like" ones.
As before, support for the insertion of mask vectors will come in a
separate patch.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D97543
This patch adds support for extracting subvectors from vector masks.
This can be either extracting a scalable vector from another, or a fixed-length
vector from a fixed-length or scalable vector.
Since RVV lacks a way to slide vector masks down on an element-wise
basis and we don't know the true length of the vector registers, in many
cases we must resort to using equivalently-sized i8 vectors to perform
the operation. When this is not possible we fall back and extend to a
suitable i8 vector.
Support was also added for fixed-length truncation to mask types.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D97475
This patch extends the support for scalable-vector int->fp and fp->int
conversions by additionally handling fixed-length vectors.
The existing scalable-vector lowering re-expresses widening/narrowing by
x4+ conversions as standard nodes. The fixed-length vector support slots
in at "the end" of this process by lowering the now equally-sized and
widening/narrowing by x2 nodes to our custom VL versions.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D97374
This patch extends the support for vector FP_ROUND and FP_EXTEND by
including support for fixed-length vector types. Since fixed-length
vectors use "VL" nodes and scalable vectors can use the standard nodes,
there is slightly more to do in the fixed-length case. A helper function
was introduced to try and reduce the divergent paths. It is expected
that this function will similarly come in useful for lowering the
int-to-fp and fp-to-int operations for fixed-length vectors.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D97301
This patch extends support for our custom-lowering of scalable-vector
truncates to include those of fixed-length vectors. It does this by
co-opting the custom RISCVISD::TRUNCATE_VECTOR node and adding mask and
VL operands. This avoids unnecessary duplication of patterns and
inflation of the ISel table.
Some truncates go through CONCAT_VECTORS which currently isn't
efficiently handled, as it goes through the stack. This can be improved
upon in the future.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D97202
This patch adds support for the custom lowering sign- and zero-extension
of fixed-length vector types. It does so through custom nodes. Since the
source and destination types are (necessarily) of different sizes, it is
possible that the source type is legal whilst the larger destination
type isn't. In this case the legalization makes heavy use of
EXTRACT_SUBVECTOR.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D97194
This patch unifies the two disparate paths for lowering
EXTRACT_SUBVECTOR operations under one roof. Consequently, with this
patch it is possible to support any fixed-length subvector extraction,
not just "cast-like" ones.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D97192