The vector calling convention dictates that when the vector argument
registers are exhaused, GPRs are used to pass the address via the stack.
When the GPRs themselves are exhausted, at best we would previously
crash with an assertion, and at worst we'd generate incorrect code.
This patch addresses this issue by passing fixed-length vectors via the
stack with their full fixed-length size and aligned to their element
type size. Since the calling convention lowering can't yet handle
scalable vector types, this patch adds a fatal error to make it clear
that we are lacking in this regard.
Reviewed By: HsiangKai
Differential Revision: https://reviews.llvm.org/D102422
This patch extends the cases in which the legalizer is able to express
VSELECT in terms of XOR/AND/OR. When dealing with a VSELECT between
boolean vector types, the mask itself is an all-ones or all-ones value
of the operand type, so a 0/1 boolean type behaves identically to a 0/-1
type.
This greatly helps RISC-V which relies on expansion for these nodes. It
also allows scalable-vector bool VSELECTs to use the default expansion,
where before it would crash in SelectionDAG::UnrollVectorOp.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D103147
We aren't going to connect the result to anything so we might
as well avoid allocating a register.
Reviewed By: frasercrmck, HsiangKai
Differential Revision: https://reviews.llvm.org/D102031
SEW=64 shifts only uses the log2(64) bits of shift amount. If we're
splatting a 64 bit value in 2 parts, we can avoid splatting the
upper bits and just let the low bits be sign extended. They won't
be read anyway.
For the purposes of SelectionDAG semantics of the generic ISD opcodes,
if hi was non-zero or bit 31 of the low is 1, the shift was already
undefined so it should be ok to replace high with sign extend of low.
In order do be able to find the split i64 value before it becomes
a stack operation, I added a new ISD opcode that will be expanded
to the stack spill in PreprocessISelDAG. This new node is conceptually
similar to BuildPairF64, but I expanded earlier so that we could
go through regular isel to get the right VLSE opcode for the LMUL.
BuildPairF64 is expanded in a CustomInserter.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D102521
It's conceivable someone could put a vsetvli in inline assembly
so its safer to consider them as barriers. The alternative would
be to trust that the user marks VL and VTYPE registers as clobbers
of the inline assembly if they do that, but hat seems error prone.
I'm assuming inline assembly in vector code is going to be rare.
Reviewed By: frasercrmck, HsiangKai
Differential Revision: https://reviews.llvm.org/D103126
This patch extends D102737 to allow VL/VTYPE changes to be taken
into account before adding an explicit vsetvli.
We do this by using a data flow analysis to propagate VL/VTYPE
information from predecessors until we've determined a value for
every value in the function.
We use this information to determine if a vsetvli needs to be
inserted before the first vector instruction the block.
Differential Revision: https://reviews.llvm.org/D102739
This is a replacement for D101938 for inserting vsetvli
instructions where needed. This new version changes how
we track the information in such a way that we can extend
it to be aware of VL/VTYPE changes in other blocks. Given
how much it changes the previous patch, I've decided to
abandon the previous patch and post this from scratch.
For now the pass consists of a single phase that assumes
the incoming state from other basic blocks is unknown. A
follow up patch will extend this with a phase to collect
information about how VL/VTYPE change in each block and
a second phase to propagate this information to the entire
function. This will be used by a third phase to do the
vsetvli insertion.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D102737
If the local variable `NumOfVReg` isPowerOf2_32(NumOfVReg - 1) or isPowerOf2_32(NumOfVReg + 1), the ADDI and MUL instructions can be replaced with SLLI and ADD(or SUB) instructions.
Based on original patch by StephenFan.
Reviewed By: frasercrmck, StephenFan
Differential Revision: https://reviews.llvm.org/D100577
RVV code generation does not successfully custom-lower BUILD_VECTOR in all
cases. When it resorts to default expansion it may, on occasion, be expanded to
scalar stores through the stack. Unfortunately these stores may then be picked
up by the post-legalization DAGCombiner which merges them again. The merged
store uses a BUILD_VECTOR which is then expanded, and so on.
This patch addresses the issue by overriding the `mergeStoresAfterLegalization`
hook. A lack of granularity in this method (being passed the scalar type) means
we opt out in almost all cases when RVV fixed-length vector support is enabled.
The only exception to this rule are mask vectors, which are always either
custom-lowered or are expanded to a load from a constant pool.
Reviewed By: HsiangKai
Differential Revision: https://reviews.llvm.org/D102913
Unlike normal loads these don't have an extension field, but we know
from TargetLowering whether these are sign-extending or zero-extending,
and so can optimise away unnecessary extensions.
This was noticed on RISC-V, where sign extensions in the calling
convention would result in unnecessary explicit extension instructions,
but this also fixes some Mips inefficiencies. PowerPC sees churn in the
tests as all the zero extensions are only for promoting 32-bit to
64-bit, but these zero extensions are still not optimised away as they
should be, likely due to i32 being a legal type.
This also simplifies the WebAssembly code somewhat, which currently
works around the lack of target-independent combines with some ugly
patterns that break once they're optimised away.
Re-landed with correct handling in ComputeNumSignBits for Tmp == VTBits,
where zero-extending atomics were incorrectly returning 0 rather than
the (slightly confusing) required return value of 1.
Re-landed again after D102819 fixed PowerPC to correctly zero-extend all
of its atomics as it claimed to do, since the combination of that bug
and this optimisation caused buildbot regressions.
Reviewed By: RKSimon, atanasyan
Differential Revision: https://reviews.llvm.org/D101342
The default expansion for BUILD_VECTORs -- save for going through
shuffles -- is to go through the stack. This method only works when the
type is at least byte-sized, so for v2i1 and v4i1 we would crash.
This patch ensures that small mask-type BUILD_VECTORs are always handled
without crashing. We lower to a SETCC of the equivalent i8 type.
This also exposes some pre-existing issues where the lowering when
optimizing for size results in larger code than without. Those will be
tackled in future patches.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D102767
The use of `SelectionDAG::getSplatValue` isn't guaranteed to return a
type-legal splat value as it may implicitly extract a vector element
from another shuffle. It is not permitted to introduce an illegal type
when lowering shuffles.
This patch addresses the crash by adding a boolean flag to
`getSplatValue`, defaulting to false, which when set will ensure a
type-legal return value. If it is unable to do that it will fail to
return a splat value.
I've been through the existing uses of `getSplatValue` in other targets
and was unable to find a need or test cases showing a need to update
their uses. In some cases, the call is made during `LegalizeVectorOps`
which may still produce illegal scalar types. In other situations, the
illegally-typed splat value may be quickly patched up to a legal type
(such as any-extending the returned `extract_vector_elt` up to a legal
type) before `LegalizeDAG` notices.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D102687
Like the element extraction of these vectors, we choose to promote up to
an i8 vector type and perform the insertion there.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D102697
Where the RVV specification writes `vs2, vs1`, our TableGen patterns use
`rs1, rs2`. These differences can easily cause confusion. The VMANDNOT
instruction performs `LHS && !RHS`, and similarly for VMORNOT.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D102606
The select-of-constants transform was asserting that its constant vector
inputs did not implicitly truncate their input without that as an
explicit precondition to the function. This patch relaxes that assertion
into an early return to skip the optimization.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D102393
The MachineBasicBlock::iterator is continuously changing during
generating the frame handling instructions. We should use the DebugLoc
from the caller, instead of getting it from the changing iterator.
If the prologue instructions located in a basic block without any other
instructions after these prologue instructions, the iterator will be
updated to the boundary of the basic block and it is invalid to use the
iterator to access DebugLoc. This patch also fixes the crash when
accessing DebugLoc using the iterator.
Differential Revision: https://reviews.llvm.org/D102386
This patch extends the vector type-conversion and legalization capabilities of
scalable vector types.
Firstly, `vscale x 1` types now behave more like the corresponding `vscale x
2+` types. This enables the integer promotion legalization of extended scalable
types, such as the promotion of `<vscale x 1 x i5>` to `<vscale x 1 x i8>`.
These `vscale x 1` types are also now better handled by
`getVectorTypeBreakdown`, where what looks like older handling for 1-element
fixed-length vector types was spuriously updated to include scalable types.
Widening of scalable types is now better supported, by using `INSERT_SUBVECTOR`
to insert the smaller scalable vector "value" type into the wider scalable
vector "part" type. This allows AArch64 to pass and return `vscale x 1` types
by value by widening.
There are still cases where we are unable to legalize `vscale x 1` types, such
as where expansion would require splitting the vector in two.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D102073
Similar to X86 D73230 and AArch64 D101872
With this change, we can set dso_local in clang's -fpic -fno-semantic-interposition mode,
for default visibility external linkage non-ifunc-non-COMDAT definitions.
For such dso_local definitions, variable access/taking the address of a
function/calling a function will go through a local alias to avoid GOT/PLT.
Reviewed By: jrtc27, luismarques
Differential Revision: https://reviews.llvm.org/D101875
My thought process is that if v2i64 is an LMUL=1 type then v2i32
should be an LMUL=1/2 type. We limit the fractional LMUL so that
SEW=64 clips to LMUL=1, SEW=32 clips to LMUL=1/2, etc. This
ensures there's always a fractional LMUL available to truncate a type.
This does reduce the number of vsetvlis in some cases.
Some tests increase vsetvlis because the best container type for a
mask type is dependent on the LMUL+SEW that the mask was produced
from, but you can't tell that from the type. I think this is
something we need to solve this in the machine IR when optimizing
vsetvlis.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D101215
Limited to splats because we would need to truncate the shift
amount vector otherwise.
I tried to do this with new ISD nodes and a DAG combine to
avoid such a large pattern, but we don't form the splat until
LegalizeDAG and need DAG combine to remove a scalable->fixed->scalable
cast before it becomes visible to the shift node. By the time that
happens we've already visited the truncate node and won't revisit it.
I think I have an idea how to improve i64 on RV32 I'll save for a
follow up.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D102019
For Zvlsseg spilling, we need to convert the pseudo instructions
into multiple vector load/store instructions with appropriate offsets.
For example, for PseudoVSPILL3_M2, we need to convert it to
VS2R %v2, %base
ADDI %base, %base, (vlenb x 2)
VS2R %v4, %base
ADDI %base, %base, (vlenb x 2)
VS2R %v6, %base
We need to keep the size of the offset in the pseudo spilling instructions.
In this case, it is (vlenb x 2).
In the original implementation, we use the size of frame objects divide the
number of vectors in zvlsseg types. The size of frame objects is not
necessary exactly the same as the spilling data. It may be larger than
it. So, we change it to (VLENB x LMUL) in this patch. The calculation is
more direct and easy to understand.
Differential Revision: https://reviews.llvm.org/D101869
This patch extends VectorLegalizer::ExpandSELECT to permit expansion
also for scalable vector types. The only real change is conditionally
checking for BUILD_VECTOR or SPLAT_VECTOR legality depending on the
vector type.
We can use this to fix "cannot select" errors for scalable vector
selects on the RISCV target. Note that in future patches RISCV will
possibly custom-lower vector SELECTs to VSELECTs for branchless codegen.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D102063
Unlike normal loads these don't have an extension field, but we know
from TargetLowering whether these are sign-extending or zero-extending,
and so can optimise away unnecessary extensions.
This was noticed on RISC-V, where sign extensions in the calling
convention would result in unnecessary explicit extension instructions,
but this also fixes some Mips inefficiencies. PowerPC sees churn in the
tests as all the zero extensions are only for promoting 32-bit to
64-bit, but these zero extensions are still not optimised away as they
should be, likely due to i32 being a legal type.
This also simplifies the WebAssembly code somewhat, which currently
works around the lack of target-independent combines with some ugly
patterns that break once they're optimised away.
Re-landed with correct handling in ComputeNumSignBits for Tmp == VTBits,
where zero-extending atomics were incorrectly returning 0 rather than
the (slightly confusing) required return value of 1.
Reviewed By: RKSimon, atanasyan
Differential Revision: https://reviews.llvm.org/D101342
This seems to have broken sanitizers, giving lots of
Assertion `NumBits <= MAX_INT_BITS && "bitwidth too large"' failed.
failures across multiple targets (currently X86 and PowerPC). Reverting
until I have a chance to reproduce and debug.
This reverts commit 6e876f9ded.
Unlike normal loads these don't have an extension field, but we know
from TargetLowering whether these are sign-extending or zero-extending,
and so can optimise away unnecessary extensions.
This was noticed on RISC-V, where sign extensions in the calling
convention would result in unnecessary explicit extension instructions,
but this also fixes some Mips inefficiencies. PowerPC sees churn in the
tests as all the zero extensions are only for promoting 32-bit to
64-bit, but these zero extensions are still not optimised away as they
should be, likely due to i32 being a legal type.
This also simplifies the WebAssembly code somewhat, which currently
works around the lack of target-independent combines with some ugly
patterns that break once they're optimised away.
Reviewed By: RKSimon, atanasyan
Differential Revision: https://reviews.llvm.org/D101342
This patch supports all of the current set of VP integer binary
intrinsics by lowering them to to RVV instructions. It does so by using
the existing RISCVISD *_VL custom nodes as an intermediate layer. Both
scalable and fixed-length vectors are supported by using this method.
One notable change to the existing vector codegen strategy is that
scalable all-ones and all-zeros mask SPLAT_VECTORs are now lowered to
RISCVISD VMSET_VL and VMCLR_VL nodes to match their fixed-length
BUILD_VECTOR counterparts. This allows them to reuse the existing
"all-ones" VL patterns.
To reduce the size of the phabricator diff, some tests are intentionally
left out and will be added later if the patch is accepted.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D101826
Previously, RISC-V would make legal all fixed-length vectors types whose
size are less than or equal to some function of the minimum value of
VLEN and the maximum-permissible LMUL grouping.
Due to vector legalization issues, this patch instead caps the legal
fixed-length vector types to those with 256 elements. This value was
chosen because it is the longest vector length which has corresponding
MVTs across all supported element types.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D101839
This patch adds the two MVTs to fix a legalizer crash when using vector
shuffles of <256 x i16> and <128 x i16> on RISC-V. The legalizer can't
promote the operand of `v256i32 = any_extend_vector_inreg v128i16`.
Reviewed By: craig.topper, RKSimon
Differential Revision: https://reviews.llvm.org/D101769
These tests show inefficient sign extension for AMOs on RISC-V. The
normal CodeGen tests use anyext return values, but if marked signext
then we end up generating unnecessary sign extension instructions. This
can be seen when compiling C that returns an i32 (signed or unsigned),
where the calling convention results in a signext return value.
This patch adds support for splatting i1 types to fixed-length or
scalable vector types. It does so by lowering the operation to a SETCC
of the equivalent i8 type.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D101465
This shrinks the immediate that isel table needs to emit for these
instructions. Hoping this allows me to change OPC_EmitInteger to
use a better variable length encoding for representing negative
numbers. Similar to what was done a few months ago for OPC_CheckInteger.
The alternative encoding uses less bytes for negative numbers, but
increases the number of bytes need to encode 64 which was a very
common number in the RISCV table due to SEW=64. By using Log2 this
becomes 6 and is no longer a problem.
DAGCombiner was recently taught how to combine STEP_VECTOR nodes,
meaning the step value is no longer guaranteed to be one by the time it
reaches the backend for lowering.
This patch supports such cases on RISC-V by lowering to other step
values to a multiply following the vid.v instruction. It includes a
small optimization for common cases where the multiply can be expressed
as a shift left.
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D100856
When rvv vector objects existed, using sp to access the fixed stack object will pass the rvv vector objects field. So the StackOffset needs add a scalable offset of the size of rvv vector objects field
Differential Revision: https://reviews.llvm.org/D100286
Similar for or/xor with 0 in place of -1.
This is the canonical form produced by InstCombine for something like `c ? x & y : x;` Since we have to use control flow to expand select we'll usually end up with a mv in basic block. By folding this we may be able to pull the and/or/xor into the block instead and avoid a mv instruction.
The code here is based on code from ARM that uses this to create predicated instructions. I'm doing it on SELECT_CC so it happens late, but we could do it on select earlier which is what ARM does. I'm not sure if we lose any combine opportunities if we do it earlier.
I left out add and sub because this can separate sext.w from the add/sub. It also made a conditional i64 addition/subtraction on RV32 worse. I guess both of those would be fixed by doing this earlier on select.
The select-binop-identity.ll test has not been commited yet, but I made the diff show the changes to it.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D101485
This replaces D98479.
This allows type legalization to form SPLAT_VECTOR_PARTS so we don't
lose the splattedness when the scalar type is split.
I'm handling SPLAT_VECTOR_PARTS for fixed vectors separately so
we can continue using non-VL nodes for scalable vectors.
I limited to RV32+vXi64 because DAGCombiner::visitBUILD_VECTOR likes
to form SPLAT_VECTOR before seeing if it can replace the BUILD_VECTOR
with other operations. Especially interesting is a splat BUILD_VECTOR of
the extract_vector_elt which can become a splat shuffle, but won't if
we form SPLAT_VECTOR first. We either need to reorder visitBUILD_VECTOR
or add visitSPLAT_VECTOR.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D100803
This seems like a reasonable upper bound on VL. WG discussions for
the V spec would probably allow us to use 2^16 as an upper bound
on VLEN, but this is good enough for now.
This allows us to remove sext and zext if user happens to assign
the size_t result into an int and then uses it as a VL intrinsic
argument which is size_t.
Reviewed By: frasercrmck, rogfer01, arcbbb
Differential Revision: https://reviews.llvm.org/D101472
This is an complementary/alternative fix for D99068. It takes a slightly
different approach by explicitly summing up all of the required split
part type sizes and ensuring we allocate enough space for them. It also
takes the maximum alignment of each part.
Compared with D99068 there are fewer changes to the stack objects in
existing tests. However, @luismarques has shown in that patch that there
are opportunities to reduce our stack usage in the future.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D99087
This adds a special operand type that is allowed to be either
an immediate or register. By giving it a unique operand type the
machine verifier will ignore it.
This perturbs a lot of tests but mostly it is just slightly different
instruction orders. Something bad did happen to some min/max reduction
tests. We're spilling vector registers when we weren't before.
Reviewed By: khchen
Differential Revision: https://reviews.llvm.org/D101246
This modifies my previous patch to push the strided load formation
to isel. This gives us opportunity to fold the splat into a .vx
operation first. Using a scalar register and a .vx operation reduces
vector register pressure which can be important for larger LMULs.
If we can't fold the splat into a .vx operation, then it can make
sense to use a strided load to free up the vector arithmetic
ALU to do actual arithmetic rather than tying it up with vmv.v.x.
Reviewed By: khchen
Differential Revision: https://reviews.llvm.org/D101138
This teaches DAG combine that shift amount operands for grev, gorc
shfl, unshfl only read a few bits.
This also teaches DAG combine that grevw, gorcw, shflw, unshflw,
bcompressw, bdecompressw only consume the lower 32 bits of their
inputs.
In the future we can teach SimplifyDemandedBits to also propagate
demanded bits of the output to the inputs in some cases.
Theses instructions are allowed to write v0 when they are masked.
We'll still never use v0 because of the earlyclobber constraint so
this doesn't really help anything. It just makes the definitions
correct.
While I was there remove an unused multiclass I noticed.
Reviewed By: HsiangKai
Differential Revision: https://reviews.llvm.org/D101118
This patch adds support for both scalable- and fixed-length vector code
lowering of the llvm.minnum and llvm.maxnum intrinsics to the equivalent
RVV instructions.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D101035
Add PromoteIntOp_FP_TO_XINT_SAT to type legalize the bit width
operand from i32 to i64 for RV64.
Add test cases for the saturating intrinsics for half/float/double
and i32/i64. CodeGen is definitely not optimal. We can probably
make use of the native behavior of fcvt instructions in many cases.
Fixes PR50083
These instructions don't really exist, but we have ways we can
emulate them.
.vv will swap operands and use vmsle().vv. .vi will adjust the
immediate and use .vmsgt(u).vi when possible. For .vx we need to
use some of the multiple instruction sequences from the V extension
spec.
For unmasked vmsge(u).vx we use:
vmslt{u}.vx vd, va, x; vmnand.mm vd, vd, vd
For cases where mask and maskedoff are the same value then we have
vmsge{u}.vx v0, va, x, v0.t which is the vd==v0 case that
requires a temporary so we use:
vmslt{u}.vx vt, va, x; vmandnot.mm vd, vd, vt
For other masked cases we use this sequence:
vmslt{u}.vx vd, va, x, v0.t; vmxor.mm vd, vd, v0
We trust that register allocation will prevent vd in vmslt{u}.vx
from being v0 since v0 is still needed by the vmxor.
Differential Revision: https://reviews.llvm.org/D100925
Refactor to use new multiclass instead of individual patterns.
We already supported this due to SEW=64 on RV32, but we didn't have
test cases for all the types we supported.
Part of D100925
We don't have instructions for these, but can swap the operands
to use vmle/vmflt. This makes the IR interface more consistent and
simplifies the frontend implementation.
Part of D100925
Implementations are allowed to optimize an x0 stride to perform
less memory accesses. This is the case in SiFive cores.
No idea if this is the case in other implementations. We might
need a tuning flag for this.
Reviewed By: frasercrmck, arcbbb
Differential Revision: https://reviews.llvm.org/D100815
Rather than doing splatting each separately and doing bit manipulation
to merge them in the vector domain, copy the data to the stack
and splat it using a strided load with x0 stride. At least on
some implementations this vector load is optimized to not do
a load for each element.
This is equivalent to how we move i64 to f64 on RV32.
I've only implemented this for the intrinsic fallbacks in this
patch. I think we do similar splatting/shifting/oring in other
places. If this is approved, I'll refactor the others to share
the code.
Differential Revision: https://reviews.llvm.org/D101002
It is proper to relax non-negative limitation of step_vector.
Also this patch adds more combines for step_vector:
(sub X, step_vector(C)) -> (add X, step_vector(-C))
Differential Revision: https://reviews.llvm.org/D100812
This recognizes the case when Hi is (sra Lo, 31). We can use
SPLAT_VECTOR_I64 rather than splatting the high bits and
combining them in the vector register.
This patch adds incrementally-better support for SPLAT_VECTOR in a
handful of vector combines by changing a few more
isBuildVectorAllOnes/isBuildVectorAllZeros to the equivalent
isConstantSplatVectorAllOnes/Zeros calls.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D100851
This patch fixes a case missed out by D100574, in which RVV scalable
stack offset computations may require three live registers in the case
where the offset's fixed component is 12 bits or larger and has a
scalable component.
Instead of adding an additional emergency spill slot, this patch further
optimizes the scalable stack offset computation sequences to reduce
register usage.
By emitting the sequence to compute the scalable component before the
fixed component, we can free up one scratch register to be reallocated
by the sequence for the fixed component. Doing this saves one register
and thus one additional emergency spill slot.
Compare:
$x5 = LUI 1
$x1 = ADDIW killed $x5, -1896
$x1 = ADD $x2, killed $x1
$x5 = PseudoReadVLENB
$x6 = ADDI $x0, 50
$x5 = MUL killed $x5, killed $x6
$x1 = ADD killed $x1, killed $x5
versus:
$x5 = PseudoReadVLENB
$x1 = ADDI $x0, 50
$x5 = MUL killed $x5, killed $x1
$x1 = LUI 1
$x1 = ADDIW killed $x1, -1896
$x1 = ADD $x2, killed $x1
$x1 = ADD killed $x1, killed $x5
Reviewed By: HsiangKai
Differential Revision: https://reviews.llvm.org/D100847
This patch adds an additional emergency spill slot to RVV code. This is
required as RVV stack offsets may require an additional register to compute.
This patch includes an optimization by @HsiangKai <kai.wang@sifive.com>
to reduce the number of registers required for the computation of stack
offsets from 3 to 2. Otherwise we'd need two additional emergency spill
slots.
Reviewed By: HsiangKai
Differential Revision: https://reviews.llvm.org/D100574
This patch changes ISD::isBuildVectorAllZeros to
ISD::isConstantSplatVectorAllZeros which handles zero sclar vector.
TestPlan: check-llvm
Differential Revision: https://reviews.llvm.org/D100813
This patch relaxes the requirement that the STEP_VECTOR step constant
must be of a type at least as large as the vector element type. This
does not permit its use on targets which have legal vector element types
larger than the largest legal scalar type, such as i64 vectors on RV32.
As such, the requirement has been loosened so that the step operand must
be any scalar type so long as the constant immediate is non-negative and
the value fits inside the vector element type.
This limits combining optimizations in certain circumstances but in
practice it's unlikely to be a hindrance.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D100660
As noted in the FIXME there's a sort of agreement that the any
extra bits stored will be 0.
The generated code is pretty terrible. I was really hoping we
could use a tail undisturbed trick, but tail undisturbed no
longer applies to masked destinations in the current draft
spec.
Fingers crossed that it isn't common to do this. I doubt IR
from clang or the vectorizer would ever create this kind of store.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D100618
This patch extends the lowering of RVV fixed-length vector shuffles to
avoid the default stack expansion and instead lower to vrgather
instructions.
For "permute"-style shuffles where one vector is swizzled, we can lower
to one vrgather. For shuffles involving two vector operands, we lower to
one unmasked vrgather (or splat, where appropriate) followed by a masked
vrgather which blends in the second half.
On occasion, when it's not possible to create a legal BUILD_VECTOR for
the indices, we use vrgatherei16 instructions with 16-bit index types.
For 8-bit element vectors where we may have indices over 255, we have a
fairly blunt fallback to the stack expansion to avoid custom-splitting
of the vector types.
To enable the selection of masked vrgather instructions, this patch
extends the various RISCVISD::VRGATHER nodes to take a passthru operand.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D100549
When trying to clamp a constant index into a scalable vector we can
test if the index is less than the minimum number of elements in the
vector. If so, we can simply return the index because we know it is
guaranteed to fit inside the vector.
Differential Revision: https://reviews.llvm.org/D100639
It has to save all caller-saved registers before a call in the handler.
So don't emit a call that save/restore registers.
Reviewed By: simoncook, luismarques, asb
Differential Revision: https://reviews.llvm.org/D100532
This patch adds more optimized codegen for the above SETCC forms,
by matching the '.vi' vector forms when the immediate is a 5-bit signed
immediate plus 1. The immediate can be decremented and the corresponding
SET[U]LE or SET[U]GT forms can be matched.
This work was left as a TODO from D94168.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D100096
The first source has the same EEW as the destination and the other
source is a scalar so the overlap constraints don't apply to
the unmasked version.
For the masked version we have a constraint that the destination
can't be V0 so that covers the only overlap issue there.
Reviewed By: khchen
Differential Revision: https://reviews.llvm.org/D100217
These require the input to be zero or sign extended. If we have
sext.b, sext.h or zext.h instructions we can use them. Otherwise
we need to use a pair of shifts to accomplish the zero/sign extend
and the final shift.
We don't currently use zext.h when it is available.
This patch adds RVV codegen support for OR/XOR/AND reductions for both
scalable- and fixed-length vector types. There are a few possible
codegen strategies for each -- vmfirst.m, vmsbf.m, and vmsif.m could be
used to some extent -- but the vpopc.m instruction was chosen since it
produces the scalar result in one instruction, after which scalar
instructions can finish off the computation.
The reductions are lowered identically for both scalable- and
fixed-length vectors, although some alternate strategies may be more
optimal on fixed-length vectors since it's cheaper to get the length of
those types.
Other reduction types were not deemed to be relevant for mask vectors.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D100030
If the stack size is larger than 12 bits, we have to use a scratch
register to store the stack size. Before we introduce the scalable stack
offset, we could simplify
%0 = ADDI %stack.0, 0
=>
%scratch = ... # sequence of instructions to move the offset into
%%scratch
%0 = ADD %fp, %scratch
However, if the offset contains scalable part, we need to consider it.
%0 = ADDI %stack.0, 0
=>
%scratch = ... # sequence of instructions to move the offset into
%%scratch
%scratch = ADD %fp, %scratch
%scalable_offset = ... # sequence of instructions for vscaled-offset.
%0 = ADD/SUB %scratch, %scalable_offset
Differential Revision: https://reviews.llvm.org/D100035
This test case shows that we access wrong stack slots when the frame
object has scalable offset under large stack size.
Differential Revision: https://reviews.llvm.org/D100084
If the constants have a difference of 1 we can convert one to
the other by adding or subtracting the condition.
We have a DAG combine for this, but it only runs before type
legalization. If the select is introduced later during type
legalization or op legalization we will miss it.
We don't need a specific condition, but some conditions are
harder to materialize than others on RISCV. I know that SETLT
will be a single instruction and it is what is used by the
motivating pattern from signed saturating add/sub.
Differential Revision: https://reviews.llvm.org/D99021
This can't use our normal strategy of splatting the scalar and using
a .vv operation instead of .vx.
Instead this patch bitcasts the vector to the equivalent SEW=32
vector and inserts the scalar parts using two vslide1up/down. We
do that unmasked and apply the mask separately at the end with
a vmerge.
For vslide1up there maybe some other options here like getting
i64 into element 0 and using vslideup.vi with this vector as
vd and the original source as vs1. Masking would still need to
be done afterwards.
That idea doesn't work for vslide1down. We need to slidedown and
then insert a single scalar at vl-1 which we could do with a
vslideup, but that assumes vl > 0 which I don't think we can assume.
The i32 double slide1down implemented here is the best I could come
up with and I just made vslide1up consistent.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D99910
This allows FoldConstantArithmetic to handle SPLAT_VECTOR in
addition to BUILD_VECTOR. This allows it to support scalable
vectors. I'm also allowing fixed length SPLAT_VECTOR which is
used by some targets, but I'm not familiar enough to write tests
for those targets.
I had to block this function from running on CONCAT_VECTORS to
avoid calling getNode for a CONCAT_VECTORS of 2 scalars.
This can happen because the 2 operand getNode calls this
function for any opcode. Previously we were protected because
CONCAT_VECTORs of BUILD_VECTOR is folded to a larger BUILD_VECTOR
before that call. But it's not always possible to fold a CONCAT_VECTORS
of SPLAT_VECTORs, and we don't even try.
This fixes PR49781 where DAG combine thought constant folding
should be possible, but FoldConstantArithmetic couldn't do it.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D99682
I missed a few intrinsics in 3dd4aa7d09
when I did this for masked loads and masked segment loads/stores.
Found while trying to share more code between these custom isel
functions.
This patch supports bitcasts from scalar types to fixed-length vectors
and vice versa. It custom-lowers and custom-legalizes them to
EXTRACT_VECTOR_ELT/INSERT_VECTOR_ELT operations, using a single-element
vectors to hold the scalar where appropriate.
Previously, some of these would fail to select, others would be expanded
through stack loads and stores. Effort was made to ensure the codegen
avoids the stack for both legal and illegal scalar types.
Some of the codegen could be improved, but on first glance it looks like
a general optimization of EXTRACT_VECTOR_ELT when extracting an i64
element on RV32.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D99667
Caught in internal testing, these operations are assumed legal by
default, even for scalable vector types. Expand them back into separate
truncations and stores, or loads and extensions.
Also add explicit fixed-length vector tests for these operations, even
though they should have been correct already.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D99654
This patch adds a test which shows how the compiler incorrectly sets the
size and alignment of a stack object used to indirectly pass vector
types to functions.
In the particular example, the test passes a <4 x i8> vector type to a
function and creates a stack object of size and alignment equal to 4
bytes. However, the code generated to set up that parameter has been
scalarized and stores each element as individual XLEN-sized values. Thus
on RV32 this stores 16 bytes and on RV64 32 bytes, both of which clobber
the stack. Similarly, the alignment is set up as the alignment
of the vector type, which is not necessarily the natural alignment of XLEN.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D95025
The W version of orc.b does not exist in Zbp so we need to use
gorci encoding. If we have Zbp, we can use gorciw which can avoid a
sext.w in some cases.
Head files are included in a separate patch in case the name needs to be changed.
RV32 / 64:
clmul
clmulh
clmulr
Differential Revision: https://reviews.llvm.org/D99711
Forgot to amend the Author.
Original commit message:
Header files are included in a separate patch in case the name needs to be changed.
RV32 / 64:
orc.b
Differential Revision: https://reviews.llvm.org/D99320
Implementation for RISC-V Zbr extension intrinsic.
Header files are included in separate patch in case the name needs to be changed
RV32 / 64:
crc32b
crc32h
crc32w
crc32cb
crc32ch
crc32cw
RV64 Only:
crc32d
crc32cd
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D99009
For positive constants we try shifting left to remove leading zeros
and fill the bottom bits with 1s. We then materialize that constant
shift it right.
This patch adds a new strategy to try filling the bottom bits with
zeros instead. This catches some additional cases.
RV32 is able to use the llvm.experimental.vector.insert intrinsics too.
This patch ensures they're tested.
Reviewed By: khchen, asb
Differential Revision: https://reviews.llvm.org/D99655
D99717 introduced some test cases which showed that the output of one
vsetvli into another would not be picked up by the RISCVCleanupVSETVLI
pass. This patch teaches the optimization about such a pattern. The
pattern is quite common when using the RVV vsetvli intrinsic to pass the
VL onto other intrinsics.
The second test case introduced by D99717 is left unoptimized by this
patch. It is a rarer case and will require us to rewire any uses of the
redundant vset[i]vli's output to the previous one's.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D99730
This occurs when we type legalize an i64 scalar input on RV32. We
need to manually splat, which requires a vector input. Rather
than special case this in lowering just pattern match it.
The default legalization strategy is PromoteFloat which keeps
half in single precision format through multiple floating point
operations. Conversion to/from float is done at loads, stores,
bitcasts, and other places that care about the exact size being 16
bits.
This patches switches to the alternative method softPromoteHalf.
This aims to keep the type in 16-bit format between every operation.
So we promote to float and immediately round for any arithmetic
operation. This should be closer to the IR semantics since we
are rounding after each operation and not accumulating extra
precision across multiple operations. X86 is the only other
target that enables this today. See https://reviews.llvm.org/D73749
I had to update getRegisterTypeForCallingConv to force f16 to
use f32 when the F extension is enabled. This way we can still
pass it in the lower bits of an FPR for ilp32f and lp64f ABIs.
The softPromoteHalf would otherwise always give i16 as the
argument type.
Reviewed By: asb, frasercrmck
Differential Revision: https://reviews.llvm.org/D99148
We need to splat the scalar separately and use .vv, but there is
no vmsgt(u).vv. So add isel patterns to select vmslt(u).vv with
swapped operands.
We also need to get VT to use for the splat from an operand rather
than the result since the result VT is nxvXi1.
Reviewed By: HsiangKai
Differential Revision: https://reviews.llvm.org/D99704
There's no target independent ISD opcode for MULHSU, so custom
legalize 2*XLen multiplies ourselves. We have to be a little
careful to prefer MULHU or MULHSU.
I thought about doing this in isel by pattern matching the
(add (mul X, (srai Y, XLen-1)), (mulhu X, Y)) pattern. I decided
against this because the add might become part of a chain of adds.
I don't trust DAG combine not to reassociate with other adds making
it difficult to find both pieces again.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D99479
This adds a new integer materialization strategy mainly targeted
at 64-bit constants like 0xffffffff where there are 32 or more trailing
ones with leading zeros. We can materialize these by using an addi -1
and srli to restore the leading zeros. This matches what gcc does.
I haven't limited to just these cases though. The implementation
here takes the constant, shifts out all the leading zeros and
shifts ones into the LSBs, creates the new sequence, adds an srli,
and checks if this is shorter than our original strategy.
I've separated the recursive portion into a standalone function
so I could append the new strategy outside of the recursion. Since
external users are no longer using the recursive function, I've
cleaned up the external interface to return the sequence instead of
taking a vector by reference.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D98821
This allows these optimisations to apply to e.g. `urem i16` directly
before `urem` is promoted to i32 on architectures where i16 operations
are not intrinsically legal (such as on Aarch64). The legalization then
later can happen more directly and generated code gets a chance to avoid
wasting time on computing results in types wider than necessary, in the end.
Seems like mostly an improvement in terms of results at least as far as x86_64 and aarch64 are concerned, with a few regressions here and there. It also helps in preventing regressions in changes like {D87976}.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D88785
Our CLZW isel pattern is quite easily broken by surrounding code
preventing it from matching sometimes. This usually results in
failing to remove the and X, 0xffffffff inserted by type
legalization. The add with -32 that type legalization also inserts
will often gets combined into other add/sub nodes. That doesn't
usually result in extra code when we don't use clzw.
CTTZ seems to be less fragile, but I wanted to keep it consistent
with CTLZ.
Reviewed By: asb, HsiangKai
Differential Revision: https://reviews.llvm.org/D99317
Also modify the simm5_plus1 check because Imm-1 is UB if Imm happens
to be INT64_MIN. I don't think the compiler would optimize based on that in this
usage, but it could fail UBSan or -ftrapv.
Reviewed By: HsiangKai, frasercrmck
Differential Revision: https://reviews.llvm.org/D99637
This adds almost everything required for supporting the new stepvector
intrinsic on RVV. It is lowered to the existing VID_VL SDNode.
The only exception is a limitation that RV32 cannot yet lower the
intrinsic on i64 vectors. This is because the step operand is
(currently) required to be at least as large as the vector element type.
I will look into patching that out and loosening the requirement to only
an integer pointer type.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D99594
Without Zfh the half type isn't legal, but it could still be
used as an argument/return in IR. Clang will not generate this today.
Previously we promoted the half value to float for arguments and
returns if the F extension is enabled but Zfh isn't. Then depending on
which ABI is enabled we would pass it in either an FPR or a GPR in
float format.
If the F extension isn't enabled, it would get passed in the lower
16 bits of a GPR in half format.
With this patch the value will always in half format and will be
in the lower bits of a GPR or FPR. This should be consistent
with where the bits are located when Zfh is enabled.
I've based this implementation off of how this is done on ARM.
I've manually nan-boxed the value to 32 bits using integer ops.
It looks like flw, fsw, fmv.s, fmv.w.x, fmf.x.w won't
canonicalize nans so should leave the value alone. I think those
are the instructions that could get used on this value.
Reviewed By: kito-cheng
Differential Revision: https://reviews.llvm.org/D98670
This matches what we do in our isel patterns. In our internal
testing we've found this is needed to make the fast register
allocator happy at -O0. Otherwise it may assign V0 to an earlier
operand and find itself with no registers left when it reaches
the mask operand. By using V0 explicitly, the fast register allocator
will see it when it checks for phys register usages before it
starts allocating vregs. I'll try to update this with a test case.
Unfortunately, this does appear to prevent some instruction reordering
by the pre-RA scheduler which leads to the increased spills seen in
some tests. I suspect that problem could already occur for other
instructions that already used V0 directly.
There's a lot of repeated code here that could do with some
wrapper functions. Not sure if that should be at the level of the
new code that deals with V0. That would require multiple output
parameters to pass the glue, chain and register back. Maybe it
should be at a higher level over the entire set of push_backs.
Reviewed By: frasercrmck, HsiangKai
Differential Revision: https://reviews.llvm.org/D99367
In D97111 we changed the RVV frame layout when using sp or bp to address
the stack slots so we could address the emergency stack slot. The idea
is to put the RVV objects as far as possible (in offset terms) from the
frame reference register (sp / fp / bp).
When using fp this happens naturally because the RVV objects are already
the top of the stack and due to the constraints of RVV (VLENB being a
power of two >= 128) the stack remains aligned. The rest of this summary
does not apply to this case.
When using sp / bp we need to skip the non-RVV stack slots. The size of
the the non-RVV objects is computed subtracting the callee saved
register size (whose computation is added in D97111 itself) to the total
size of the stack (which does not account for RVV stack slots). However,
when doing so we round to 16 bytes when computing that size and we end
emitting a smaller offset that may belong to a scalar stack slot (see
D98801). So this change removes that rounding.
Also, because we want the RVV objects be between the non-RVV stack slots
and the callee-saved register slots, we need to make sure the RVV
objects are properly aligned to 8 bytes. Adding a padding of 8 would
render the stack unaligned. So when allocating space for RVV (only when
we don't use fp) we need to have extra padding that preserves the stack
alignment. This way we can round to 8 bytes the offset that skips the
non-RVV objects and we do not misalign the whole stack in the way. In
some circumstances this means that the RVV objects may have padding
before (=lower offsets from sp/bp) and after (before the CSR stack
slots).
Differential Revision: https://reviews.llvm.org/D98802
This testcase shows that we attempt to assign the same offset sp + 16 to
two different stack objects.
The fix will come in a later change.
Differential Revision: https://reviews.llvm.org/D98801
While addressing RVV frame layout issues I found this file had
whitespace differences that made diffs noisier than they should be.
Differential Revision: https://reviews.llvm.org/D98800
This is currently performed in SelectionDAGLegalize, here we make it also
happen in LegalizeVectorOps, allowing a target to lower the SETCC condition
codes first in LegalizeVectorOps and then lower to a custom node afterwards,
without having to duplicate all of the SETCC condition legalization in the
target specific lowering.
As a result of this, fixed length floating point SETCC nodes can now be
properly lowered for SVE.
Differential Revision: https://reviews.llvm.org/D98939
We have a special pattern for
(mul (and X, 0xffffffff), (and Y, 0xffffffff)), to optimize the
ANDs to shift. But if a sext_inreg coms first, we'll form a MULW
and limit the effectiveness of the special match. So this patch
adds a larger pattern to suppress the MULW formation by emitting
a sext.w and then the same output we use for the
(mul (and X, 0xffffffff), (and Y, 0xffffffff)). This should all
get CSEd.
This is the issue I was trying to fix with D99029, but that affected
many more tests.
Add the constraint when destination EEW not equals the source EEW for
correctness.
The RVV spec has three register overlap rules and I implement the first
stricter constraint because the others are difficult to enforce.
Reviewed By: frasercrmck, craig.topper
Differential Revision: https://reviews.llvm.org/D98920
We look for this pattern frequently in isel patterns so its a
good idea to try to preserve it.
This also let's us remove our special isel handling for srliw
and use a direct pattern match of (srl (and X, 0xffffffff), C)
since no bits will be removed from the and mask.
Differential Revision: https://reviews.llvm.org/D99042
This patch adds a small optimization for vector shuffle lowering,
detecting shuffles which can be re-expressed as vector selects.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D99270
This patch adds further optimization techniques to RVV BUILD_VECTOR
lowering. It teaches the compiler to find splats of larger vector
element types "hidden" in smaller ones. For example, a v4i8 build_vector
(0x1, 0x2, 0x1, 0x2) could be splat as v2i16 0x0201. This is generally
more optimal than the dominant-element BUILD_VECTORs and so takes
priority.
This optimization is currently limited to all-constant-or-undef
BUILD_VECTORs as those were found to be the most common. There's no
reason this couldn't be extended to other BUILD_VECTORs, but the
additional bit-manipulation instructions may require more sophisticated
heuristics.
There are some cases where the materialization of the larger constant
takes more scalar instructions than it does to build the vector with
vector instructions. We could add heuristics to try and catch this.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D99195
This implements various idioms using ctlz/cttz like Log2, Log2_Ceil,
findFirstSetBit, etc.
Some of these demonstrate that we fail to use clzw because the
idiom breaks the isel patterns we use. The isel pattern we use
is (add (cttz (and X, 0xffffffff)), -32). Some of the idioms
cause the constant on the add to be different.
This patch builds upon the initial BUILD_VECTOR work introduced in
D98700. It further optimizes the lowering of BUILD_VECTOR by using
VSELECT operations to effectively insert repeated elements into the
vector with relatively few instructions. This allows us to optimize more
BUILD_VECTORs without significantly increasing the size of the generated
code.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D98969
This patch adds an optimization for mask-vector BUILD_VECTOR nodes whose
elements are all constants or undef. It lowers such operations by
building up the vector via a series of integer operations, in which
multiple mask elements are inserted into a vector at a time via
i8/i16/i32/i64 element types. The final result is then bitcast from that
integer vector.
We restrict this optimization in certain circumstances when optimizing
for size. If we are required to use more than one integer insert
operation, then it will likely increase code size compared with using a
load from a constant pool.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D98860
I've split the gather/scatter custom handler to avoid complicating
it with even more differences between gather/scatter.
Tests are the scalable vector tests with the vscale removed and
dropped the tests that used vector.insert. We're probably not
as thorough on the splitting cases since we use 128 for VLEN here
but scalable vector use a known min size of 64.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D98991
The reason for generating mv a0, a0 instruction is when the stack object offset is large then int<12>. To deal this situation, in the elimintateFrameIndex function, it will
create a virtual register, which needs the register scavenger to scavenge it. If the machine instruction that contains the stack object and the opcode is ADDI(the addi
was generated by frameindexNode), and then this instruction's destination register was the same as the register that was generated by the register scavenger, then the
mv a0, a0 was generated. So to eliminnate this instruction, in the eliminateFrameIndex function, if the instrution opcode is ADDI, then the virtual register can't be created.
Differential Revision: https://reviews.llvm.org/D92479
If the mul add two users, one of which was a sext.w, the mul
would also be selected to a MULW before our pattern runs. This
causes the ANDs to now be used by the already selected MULW and
the mul we still need to select. They are unneeded on the MULW
since MULW only reads the lower bits. So they get selected to
SLLI+SRLI for the MULW use. The use for the
(mul (and X, 0xffffffff), (and Y, 0xffffffff)) manages to reuse
the SLLI.
The end result is increased register pressure and no improvement
to how soon we can start the MULW.
This optimization is trying to save SRLI instructions needed to
implement the ANDs. If we have zext.w we won't save anything.
Because we don't check that the multiply is the only user of the
AND we might even increase instruction count.
This patterns computes the full 64 bit product of a 32x32 unsigned
multiply. This requires a two pairs of SLLI+SRLI to zero the
upper 32 bits of the inputs.
We can do better than this by using two SLLI to move the lower
bits to the upper bits then use MULHU to compute the product. This
is the high half of a full 64x64 product. Since we put 32 0s in the lower
bits of the inputs we know the 128-bit product will have zeros in the
lower 64 bits. So the upper 64 bits, which MULHU computes, will contain
the original 64 bit product we were after.
The same trick would work for (mul (sext_inreg X, i32), (sext_inreg Y, i32))
using MULHS, but sext_inreg is sext.w which is already one instruction so we
wouldn't save anything.
Differential Revision: https://reviews.llvm.org/D99026
I'm not sure how I failed to notice this before, but when optimizing
dominant-element BUILD_VECTORs we would lower via the scalable container type,
which lost us the information about the fixed length of the vector types. By
lowering via the fixed-length type we can preserve that information and
eliminate redundant vsetvli instructions.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D98938
Since the "LMUL-MAX=2" output for some test functions differed between
RV32 and RV64, the update_llc_test_checks script failed to emit a
unified LMULMAX2 check for them. I'm not sure why it didn't warn about
this.
This patch also takes the opportunity to add unified RV32/RV64 checks to
help shorten the test file when the output for LMULMAX1 and LMULMAX2 is
identical but differs between the two ISAs.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D98944
Returning the scalable-vector container type would present problems when
the fixed-length INSERT_VECTOR_ELT was used by later operations.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D98776
For Zvlsseg, we create several tuple register classes. When spilling for
these tuple register classes, we need to iterate NF times to load/store
these tuple registers.
Differential Revision: https://reviews.llvm.org/D98629
This patch adds support for masked scatter intrinsics on scalable vector
types. It is mostly an extension of the earlier masked gather support
introduced in D96263, since the addressing mode legalization is the
same.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D96486
This patch supports the masked gather intrinsics in RVV.
The RVV indexed load/store instructions only support the "unsigned unscaled"
addressing mode; indices are implicitly zero-extended or truncated to XLEN and
are treated as byte offsets. This ISA supports the intrinsics directly, but not
the majority of various forms of the MGATHER SDNode that LLVM combines to. Any
signed or scaled indexing is extended to the XLEN value type and scaled
accordingly. This is done during DAG combining as widening the index types to
XLEN may produce illegal vectors that require splitting, e.g.
nxv16i8->nxv16i64.
Support for scalable-vector CONCAT_VECTORS was added to avoid spilling via the
stack when lowering split legalized index operands.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D96263
Without this patch, bitcasts of fixed-length mask vectors would go
through the stack.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D98779
This patch changes the operand order of masked vmslt[u]
from (mask, rs1, scalar, maskedoff, vl)
to (maskedoff, rs1, scalar, mask, vl).
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D98839
This patch adds an optimization path for BUILD_VECTOR nodes where the
majority of the elements are identical. These can be splatted, with the
remaining elements patched up with INSERT_VECTOR_ELTs. The threshold can
be tweaked as required - it is currently conservative. Undef elements
are disregarded when judging the dominance of a particular element. This
allows them to be covered by the splat value.
In addition, vectors of 2 elements are always optimized to a splat (for
the upper element) and an insert at element zero.
This optimization is disabled when optimizing for size.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D98700
The InstrEmitter can sometimes insert a copy after an IMPLICIT_DEF
before connecting it to the vector instruction. This occurs when
constrainRegClass reduces to a class with less than 4 registers.
I believe LMUL8 on masked instructions triggers this since the
result can only use the v8, v16, or v24 register group as the mask
is using v0.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D98567
The default promotion uses zero extends that become shifts. We
cam use sign extend instead which is better for RISCV.
I've used two different implementations based on whether we
have minu/maxu instructions.
Differential Revision: https://reviews.llvm.org/D98683
This allows me to introduce similar combines for branches as
we have recently added for SELECT_CC. Some of them are less
useful for standalone setccs and only help branch instructions.
By having a BR_CC node its easier to only affect branches.
I'm using CondCodeSDNode to make isel patterns easier to
write so we can refer to the codes by name. SELECT_CC uses a
constant instead.
I've translated the condition code just like SELECT_CC so
we need less patterns for the swapped conditions. This
includes special cases for X < 1 and X > -1 that get translated
to blez and bgez by using a 0 constant.
computeKnownBitsForTargetNode support for SELECT_CC is added
to allow MaskedValueIsZero to work for cases where the true
and false values of the SELECT_CC are setccs and the
result of the SELECT_CC is used by a BR_CC. This was needed
to avoid regressions in some of the overflow tests.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D98159
The following code-sequence showed up in a testcase (isolated from
SPEC2017) for if-conversion and vectorization when searching for the
maximum in an array:
addi a2, zero, 1
blt a1, a2, .LBB0_5
which can be expressed as `bge zero,a1,.LBB0_5`/`blez a1,/LBB0_5`.
More generally, we want to express (a < 1) as (a <= 0).
This adds the required isel-pattern and updates the testcases.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D98449
This patch addresses a few issues when dealing with scalable-vector
INSERT_SUBVECTOR and EXTRACT_SUBVECTOR nodes.
When legalizing in DAGTypeLegalizer::SplitVecRes_INSERT_SUBVECTOR, we
store the low and high halves to the stack separately. The offset for
the high half was calculated incorrectly.
Additionally, we can optimize this process when we can detect that the
subvector is contained entirely within the low/high split vector type.
While this optimization is valid on scalable vectors, when performing
the 'high' optimization, the subvector must also be a scalable vector.
Note that the 'low' optimization is still conservative: it may be
possible to insert v2i32 into the low half of a split nxv1i32/nxv1i32,
but we can't guarantee it. It is always possible to insert v2i32 into
nxv2i32 or v2i32 into nxv4i32+2 as we know vscale is at least 1.
Lastly, in SelectionDAG::isSplatValue, we early-exit on the extracted subvector value
type being a scalable vector, forgetting that we can also extract a
fixed-length vector from a scalable one.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D98495
The default legalization uses zero extends that require pair of shifts
on RISCV. Instead we can take advantage of the fact that unsigned
compares work equally well on sign extended inputs. This allows
us to use addw/subw and sext.w.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D98233
This patch adds fixed-length vector support to the calling convention
when RVV is used to lower fixed-length vectors. The scheme follows the
regular vector calling convention for the argument/return registers, but
uses scalable vector container types as the LocVTs, and converts to/from
the fixed-length vector value types as required.
Fixed-length vector types may be split when the combination of minimum
VLEN and the maximum allowable LMUL is not large enough to fully contain
the vector. In this case the behaviour differs between fixed-length
vectors passed as parameters and as return values:
1. For return values, vectors must be passed entirely via registers or
via the stack.
2. For parameters, unlike scalar values, split vectors continue to be
passed by value, and are split across multiple registers until there are
no remaining registers. Thus vector parameters may be found partly in
registers and partly on the stack.
As with scalable vectors, the first fixed-length mask vector is passed
via v0. Split mask fixed-length vectors are passed first via v0 and then
via the next available vector register: v8,v9,etc.
The handling of vector return values uses all available argument
registers v8-v23 which does not adhere to the calling convention we're
supposedly implementing, but since this issue affects both fixed-length
and scalable-vector values, it was left as-is.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D97954
Types of fractional LMUL and LMUL=1 are all using VR register class. When
using inline asm, it will use the first type in the register class as the
type for the register. It is not necessary the same as the value type. We
need to use INSERT_SUBVECTOR/EXTRACT_SUBVECToR/BITCAST to make it legal
to put the value in the corresponding register class.
Differential Revision: https://reviews.llvm.org/D97480
This patch change the rvv frame layout that proposed in D94465. In patch D94465, In the eliminateFrameIndex function,
to eliminate the rvv frame index, create temp virtual register is needed. This virtual register should be scavenged by class
RegsiterScavenger. If the machine function has other unused registers, there is no problem. But if there isn't unused registers,
we need a emergency spill slot. Because of the emergency spill slot belongs to the scalar local variables field, to access emergency
spill slot, we need a temp virtual register again. This makes the compiler report the "Incomplete scavenging after 2nd pass" error.
So I change the rvv frame layout as follows:
```
|--------------------------------------|
| arguments passed on the stack |
|--------------------------------------|<--- fp
| callee saved registers |
|--------------------------------------|
| rvv vector objects(local variables |
| and outgoing arguments |
|--------------------------------------|
| realignment field |
|--------------------------------------|
| scalar local variable(also contains|
| emergency spill slot) |
|--------------------------------------|<--- bp
| variable-sized local variables |
|--------------------------------------|<--- sp
```
Differential Revision: https://reviews.llvm.org/D97111
This also briefly tests a larger set of architectures than the more
exhaustive functionality tests for AArch64 and x86.
As requested in D88785
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D98339
This patch optimizes the codegen for INSERT_VECTOR_ELT in various ways.
Primarily, it removes the use of vslidedown during lowering, and the
vector element is inserted entirely using vslideup with a custom VL and
slide index.
Additionally, lowering of i64-element vectors on RV32 has been optimized
in several ways. When the 64-bit value to insert is the same as the
sign-extension of the lower 32-bits, the codegen can follow the regular
path. When this is not possible, a new sequence of two i32 vslide1up
instructions is used to get the vector element into a vector. This
sequence was suggested by @craig.topper. From there, the value is slid
into the final position for more consistent lowering across RV32 and
RV64.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D98250
We don't support any other shuffles currently.
This changes the bswap/bitreverse tests that check for this in
their expansion code. Previously we expanded a byte swapping
shuffle through memory. Now we're scalarizing and doing bit
operations on scalars to swap bytes.
In the future we can probably use vrgather.vx to do a byte swap
shuffle.
This uses a really simple approach of converting to an i8 vector
and extracting. This is probably not the best approach especially
if you know the index is constant.
Other ideas:
-Store to stack temporary using vse1, load as scalar and shift.
-Sort of bitcast the vector to a vector of i8, slide down the
appropriate 8 bit element, copy to scalar, shift down the
correct bit within the 8 bits we extracted. Not exactly sure
how to describe such a bitcast from i1 vector to i8 vector
within the type system for elements less than 8.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D98310
RISCV makes all fixed vector MVTs with size less than or equal
to a command line option legal.
This didn't include v1f16 because it was missing but did include v1f32 and v1f64.
One test is affected where we did test this type, but it is a horizontal
reduction so it is non-sensical. Perhaps we should canonicalize that
away somewhere.
I'm not sure if we should be making v1 types legal, but this will at
least make RISCV consistent across all types.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D98365
Currently we crash in type legalization any time an intrinsic
uses a scalar i64 on RV32.
This patch adds support for type legalizing this to prevent
crashing. I don't promise that it uses the best possible codegen
just that it is functional.
This first version handles 3 cases. vmv.v.x intrinsic, vmv.s.x
intrinsic and intrinsics that take a scalar input, splat it and
then do some operation.
For vmv.v.x we'll either rely on hardware sign extension for
constants or we'll convert it to multiple splats and bit
manipulation.
For vmv.s.x we use a really unoptimal sequence inspired by what
we do for an INSERT_VECTOR_ELT.
For the third case we'll either try to use the .vi form for
constants or convert to a complicated splat and bitmanip and use
the .vv form of the operation.
I've renamed the ExtendOperand field to SplatOperand now use it
specifically for the third case. The first two cases are handled
by custom lowering specifically for those intrinsics.
I haven't updated all tests yet, but I tried to cover a subset
that includes single-width, widening, and narrowing.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D97895
The type legalizer will visit the result before the operands. To
avoid creating an illegal target specific node or falling back to
scalarization, we need to manually split vector operands.
This still doesn't handle the case of non-power of 2 operands
which need to be widened. I'm not sure the type legalizer is
ready for it. I think we would need to insert an
INSERT_SUBVECTOR with the power of 2 type we want, with an undef
first operand, and the non-power of 2 orignal operand as the vector
to insert. Then fill in the neutral elements into the elements the
padded elements. Alternatively we INSERT_SUBVECTOR into a neutral vector.
From there we carry on splitting if needed to get to a legal type
then do the target specific code.
The problem with this is the type legalizer doesn't know how to
widen an insert_subvector yet. We would need to add that including
the handling for a non-undef first vector.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D98292