The constants end up getting zero extended to i64, but sign extend
would be better for constant materialization. We're using W
instructions so either behavior is correct since the upper bits
aren't read.
These are fp->int conversions using either RMM or dynamic rounding modes.
The lround and lrint opcodes have a return type of either i32 or
i64 depending on sizeof(long) in the frontend which should follow
xlen. llround/llrint should always return i64 so we'll need a libcall
for those on rv32.
The frontend will only emit the intrinsics if -fno-math-errno is in
effect otherwise a libcall will be emitted which will not use
these ISD opcodes.
gcc also does this optimization.
Reviewed By: arcbbb
Differential Revision: https://reviews.llvm.org/D105206
This adds a DAG combine to detect sext/zext inputs and emit a
new ISD opcode. The extends will either be removed or replaced
with narrower extends.
Isel patterns are used to match add and widening mul to vwmacc
similar to the recently added vmacc patterns.
There's still some work to be to match vmulsu.
We should also rewrite splats that were extended as scalars and
then splatted.
Reviewed By: arcbbb
Differential Revision: https://reviews.llvm.org/D104802
This will currently accept the old number of bytes syntax, and convert
it to a scalar. This should be removed in the near future (I think I
converted all of the tests already, but likely missed a few).
Not sure what the exact syntax and policy should be. We can continue
printing the number of bytes for non-generic instructions to avoid
test churn and only allow non-scalar types for generic instructions.
This will currently print the LLT in parentheses, but accept parsing
the existing integers and implicitly converting to scalar. The
parentheses are a bit ugly, but the parser logic seems unable to deal
without either parentheses or some keyword to indicate the start of a
type.
It seems it is possible for DAG combine to create a shl with an
i64 result type and an i32 shift amount. This is ok before type
legalization since the type don't need to match in SelectionDAG.
This results in type legalization calling LowerOperation to
legalize just the amount. We weren't expecting this so we
asserted for not finding a fixed vector shift.
To fix this, I've added a check for the fixed vector case and
returned SDValue() to get the default type legalizer. I've
factored all shifts together and added a fixed vector specific
handler to avoid repeating similar code for each in
LowerOperation.
The particular case I found was exposed by D104581, but the bad
shift is created after that patch triggers.
We use (and (ctpop X), 1) to represent parity.
The generated code for i32 parity on RV64 has more instructions than
necessary which I hope to improve in a followup patch.
Also add missing test for i64 ctpop.
I thought this might help with another optimization I was
thinking about, but I don't think it will. So it just wastes
compile time calling computeKnownBits for no benefit.
This reverts commit 81b2f95971.
If type legalization is going to insert a sign_extend for other users
of X and we can fold the sign_extend into ADDW/MULW/SUBW, it is
better to replace the ANY_EXTEND so we don't end up with a separate
ADD/MUL/SUB instruction for the users of the ANY_EXTEND.
I'm only handling setcc uses right now, but there are other
instructions that force sign_extends like ashr.
There are probably other *W instructions we could use in addition
to ADDW/SUBW/MULW.
My motivating case was a loop terminating compare and a phi use
as seen in the new test file.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D104581
This patch teaches the compiler to generate code to handle larger RVV
stack sizes and stack offsets which resolve an amount larger than 2047
vector registers in size.
The previous behaviour was asserting on such large values as it was only
able to materialize the constant by feeding it to the 12-bit immediate
of an `ADDI` instruction. The compiler can now materialize this amount
into a temporary register before continuing with the computation.
A test case for this scenario is included which also checks that the
temporary register used to materialize the amount doesn't require an
additional spill slot over what we're already reserving for RVV code.
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D104727
This patch optimizes the code generation of vector-type SELECTs (LLVM
select instructions with scalar conditions) by custom-lowering to
VSELECTs (LLVM select instructions with vector conditions) by splatting
the condition to a vector. This avoids the default expansion path which
would either introduce control flow or fully scalarize.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D104772
This optimization pre-promotes the input and constants for a
switch instruction to a legal type so that all the generated compares
share the same extend. Since RISCV prefers sext for i32 to i64
extends, we should honor that to use sext.w instead of a pair
of shifts.
Reviewed By: jrtc27
Differential Revision: https://reviews.llvm.org/D104612
With regards to overrunning, the langref (llvm/docs/LangRef.rst)
specifies:
(llvm.experimental.vector.insert)
Elements ``idx`` through (``idx`` + num_elements(``subvec``) - 1)
must be valid ``vec`` indices. If this condition cannot be determined
statically but is false at runtime, then the result vector is
undefined.
(llvm.experimental.vector.extract)
Elements ``idx`` through (``idx`` + num_elements(result_type) - 1)
must be valid vector indices. If this condition cannot be determined
statically but is false at runtime, then the result vector is
undefined.
For the non-mixed cases (e.g. inserting/extracting a scalable into/from
another scalable, or inserting/extracting a fixed into/from another
fixed), it is possible to statically check whether or not the above
conditions are met. This was previously missing from the verifier, and
if the conditions were found to be false, the result of the
insertion/extraction would be replaced with an undef.
With regards to invalid indices, the langref (llvm/docs/LangRef.rst)
specifies:
(llvm.experimental.vector.insert)
``idx`` represents the starting element number at which ``subvec``
will be inserted. ``idx`` must be a constant multiple of
``subvec``'s known minimum vector length.
(llvm.experimental.vector.extract)
The ``idx`` specifies the starting element number within ``vec``
from which a subvector is extracted. ``idx`` must be a constant
multiple of the known-minimum vector length of the result type.
Similarly, these conditions were not previously enforced in the
verifier. In some circumstances, invalid indices were permitted
silently, and in other circumstances, an undef was spawned where a
verifier error would have been preferred.
This commit adds verifier checks to enforce the constraints above.
Differential Revision: https://reviews.llvm.org/D104468
If the outer add has an simm12 immediate operand we should prefer
it instead of materializing it in a register. This would guarantee
and extra instruction and temporary register. Since we don't check
one use on the shl or zext we might generate more instructions if
there is an additional user.
Previously we went directly to unknown state on VTYPE mismatch.
If we instead remember the partial match, we can use this to
still use X0, X0 vsetvli in successors if AVL and needed SEW/LMUL
ratio match.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D104069
This re-architects the RISCV relocation handling to bring the
implementation closer in line with the implementation in binutils. We
would previously aggressively resolve the relocation. With this
restructuring, we always will emit a paired relocation for any symbolic
difference of the type of S±T[±C] where S and T are labels and C is a
constant.
GAS has a special target hook controlled by `RELOC_EXPANSION_POSSIBLE`
which indicates that a fixup may be expanded into multiple relocations.
This is used by the RISCV backend to always emit a paired relocation -
either ADD[WIDTH] + SUB[WIDTH] for text relocations or SET[WIDTH] +
SUB[WIDTH] for a debug info relocation. Irrespective of whether linker
relaxation support is enabled, symbolic difference is always emitted as
a paired relocation.
This change also sinks the target specific behaviour down into the
target specific area rather than exposing it to the shared relocation
handling. In the process, we also sink the "special" handling for debug
information down into the RISCV target. Although this improves the path
for the other targets, this is not necessarily entirely ideal either.
The changes in the debug info emission could be done through another
type of hook as this functionality would be required by any other target
which wishes to do linker relaxation. However, as there are no other
targets in LLVM which currently do this, this is a reasonable thing to
do until such time as the code needs to be shared.
Improve the handling of the relocation (and add a reduced test case from
the Linux kernel) to ensure that we handle complex expressions for
symbolic difference. This ensures that we correct relocate symbols with
the adddends normalized and associated with the addition portion of the
paired relocation.
This change also addresses some review comments from Alex Bradbury about
the relocations meant for use in the DWARF CFA being named incorrectly
(using ADD6 instead of SET6) in the original change which introduced the
relocation type.
This resolves the issues with the symbolic difference emission
sufficiently to enable building the Linux kernel with clang+IAS+lld
(without linker relaxation).
Resolves PR50153, PR50156!
Fixes: ClangBuiltLinux/linux#1023, ClangBuiltLinux/linux#1143
Reviewed By: nickdesaulniers, maskray
Differential Revision: https://reviews.llvm.org/D103539
With the exception of `frem`, this patch supports the current set of VP
floating-point binary intrinsics by lowering them to to RVV instructions. It
does so by using the existing `RISCVISD *_VL` custom nodes as an intermediate
layer. Both scalable and fixed-length vectors are supported by using this
method.
The `frem` node is unsupported due to a lack of available instructions. For
fixed-length vectors we could scalarize but that option is not (currently)
available for scalable-vector types. The support is intentionally left out so
it equivalent for both vector types.
The matching of vector/scalar forms is currently lacking, as scalable vector
types do not lower to the custom `VFMV_V_F_VL` node. We could either make
floating-point scalable vector splats lower to this node, or support the
matching of multiple kinds of splat via a `ComplexPattern`, much like we do for
integer types.
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D104237
This can be seen as a follow up to commit 0ee439b705,
that changed the second argument of __powidf2, __powisf2 and
__powitf2 in compiler-rt from si_int to int. That was to align with
how those runtimes are defined in libgcc.
One thing that seem to have been missing in that patch was to make
sure that the rest of LLVM also handle that the argument now depends
on the size of int (not using the si_int machine mode for 32-bit).
When using __builtin_powi for a target with 16-bit int clang crashed.
And when emitting libcalls to those rtlib functions, typically when
lowering @llvm.powi), the backend would always prepare the exponent
argument as an i32 which caused miscompiles when the rtlib was
compiled with 16-bit int.
The solution used here is to use an overloaded type for the second
argument in @llvm.powi. This way clang can use the "correct" type
when lowering __builtin_powi, and then later when emitting the libcall
it is assumed that the type used in @llvm.powi matches the rtlib
function.
One thing that needed some extra attention was that when vectorizing
calls several passes did not support that several arguments could
be overloaded in the intrinsics. This patch allows overload of a
scalar operand by adding hasVectorInstrinsicOverloadedScalarOpd, with
an entry for powi.
Differential Revision: https://reviews.llvm.org/D99439
This patch adds support for loading and storing unaligned vectors via an
equivalently-sized i8 vector type, which has support in the RVV
specification for byte-aligned access.
This offers a more optimal path for handling of unaligned fixed-length
vector accesses, which are currently scalarized. It also prevents
crashing when `LegalizeDAG` sees an unaligned scalable-vector load/store
operation.
Future work could be to investigate loading/storing via the largest
vector element type for the given alignment, in case that would be more
optimal on hardware. For instance, a 4-byte-aligned nxv2i64 vector load
could loaded as nxv4i32 instead of as nxv16i8.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D104032
When using FP to access stack objects, the scalable stack objects will
be put at the lower end of the frame. It looks like
```
|-------------------| <-- FP
| callee-saved regs |
|-------------------|
| scalar local vars |
|-------------------|
| RVV local vars |
|-------------------| <-- SP
```
If there are scalar arguments that need to pass through memory and there
are vector objects on the stack using FP to access. The outgoing scalar
arguments will overwrite the vector objects. It looks like
```
|-------------------| <-- FP
| callee-saved regs |
|-------------------|
| scalar local vars |
|-------------------| |-------------------|
| RVV local vars | | outgoing args | <- outgoing arguments
|-------------------| <-- SP |-------------------| overwrite from here.
```
In this patch, we reserve the stack for the outgoing arguments before
function calls if using FP to access and there are scalable vector frame
objects. It looks like
```
|-------------------| <-- FP
| callee-saved regs |
|-------------------|
| scalar local vars |
|-------------------|
| RVV local vars |
|-------------------|
| outgoing args |
|-------------------| <-- SP
```
Differential Revision: https://reviews.llvm.org/D103622
This helps us select W instructions in more cases. Most of the
affected tests have had the sign_extend_inreg or AND folded into
sextload/zextload.
Differential Revision: https://reviews.llvm.org/D104079
The loads end up becoming sextload/zextload which prevent our
isel patterns from finding the sign_extend_inreg or AND instruction
we need.
The easiest way to fix this is to use computeKnownBits or
ComputeNumSignBits in our isel matching to catch this.
This patch changes RVV's policy for its supported list of fixed-length
vector types by capping by vector size rather than element count. Now
all 1024-byte vectors (of supported element types) are supported, rather
than all 256-element vectors.
This is a more natural fit for the architecture, and allows us to, for
example, improve the support for vector bitcasts.
This change necessitated the adding of some new simple types to avoid
"regressing" on the number of currently-supported vectors. We round out
the 1024-byte types by adding `v512i8`, `v1024i8`, `v512i16` and
`v512f16`.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D103884
This patch is a simple fix which registers CONCAT_VECTORS as
custom-lowered for scalable mask vectors. This follows the pattern of
all other scalable-vector types, as the default expansion of
CONCAT_VECTORS cannot handle scalable types, and even if it did it'd go
through the stack and generate worse code.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D103896
In most of cases, it has a single space after comma in assembly operands.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D103790
The first source has the same EEW as the destination, but we're
using earlyclobber which prevents them from ever being the same
register. This patch attempts to work around this.
-For unmasked .wv, add a special TIED pseudo that pretends like
the first operand and the destination must be the same register. This
disables the earlyclobber for that source. Mark the instruction
as convertible to 3 address form which will switch it to the
original untied pseudo when the TwoAddressInstructionPass decides
that keeping them tied would require an extra copy. This uses
code in RISCVInstrInfo.cpp to do the conversion to the untied
opcode.
The untie test case show that we can generate the untied version.
Not sure it was profitable to do it in this case, but they have
really simple IR.
Reviewed By: arcbbb
Differential Revision: https://reviews.llvm.org/D103552
In 0.9 these were defined to leave elements other than 0 in the
destination unmodified. They were changed to use the tail policy
in 0.10. I missed that update.
I assume no one has noticed because in order cores treat tail
agnostic the same as tail undisturbed. I believe Spike and QEMU do
the same.
Reviewed By: arcbbb, frasercrmck
Differential Revision: https://reviews.llvm.org/D103736
This sets the AllowTruncation flag on isConstOrConstSplat in
isNullOrNullSplat, allowing it to see truncated constant zeroes on
architectures such as AArch64, where only a i32.i64 are legal. As a
truncation of 0 is always 0, this should always be valid, allowing some
extra folding to happen including some of the cases from D103755.
Differential Revision: https://reviews.llvm.org/D103756
Writes of a mask result are always tail agnostic.
Unfortunately, this seems to have made codegen worse. I can only
think this must be because the vsetvli was acting as some sort
of barrier that prevented some code movement in the scheduler.
Reviewed By: arcbbb
Differential Revision: https://reviews.llvm.org/D103331
This patch optimizes (and r i) to
(BCLRI (BCLRI r, i0), i1) in which i = ~((1<<i0) | (1<<i1)).
or
(BCLRI (ANDI r, i0), i1) in which i = i0 & ~(1<<i1).
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D103743
Include known bits support so we know we don't need to zext the
output if the input was already zero extended.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D103757
All that really matters is that the VLMAX of the preceding
instructions is the same as the VLMAX required by the mask
operation.
Also update the vmsge(u) handling to use the SEW/LMUL we use for
other mask register operations. We were matching it to the compare
before. Some cases will be improve if we fix masked compares to
use tail agnostic policy. I think they ignore the tail policy
anyway.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D103299
This patch extends the SelectionDAG's ability to constant-fold vector
arithmetic to include support for SPLAT_VECTOR. This is not only for
scalable-vector types but also for fixed-length vector types, which
helps Hexagon in a couple of cases.
The original RISC-V test case was in fact an infinite DAGCombine loop.
The pattern `and (truncate v1), (truncate v2)` can be combined to
`truncate (and v1, v2)` but the truncate can similarly be combined back
to `truncate (and v1, v2)` (but, crucially, only when one of `v1` or
`v2` is a constant vector).
It wasn't exposed in on fixed-length types because a TRUNCATE of a
constant BUILD_VECTOR was folded into the BUILD_VECTOR itself, whereas
this did not happen for the equivalent (scalable-vector) SPLAT_VECTOR.
Reviewed By: RKSimon, craig.topper
Differential Revision: https://reviews.llvm.org/D103246
If we're not emitting separate fences for the success/failure cases, we
need to pass the merged ordering to the target so it can emit the
correct instructions.
For the PowerPC testcase, we end up with extra fences, but that seems
like an improvement over missing fences. If someone wants to improve
that, the PowerPC backed could be taught to emit the fences after isel,
instead of depending on fences emitted by AtomicExpand.
Fixes https://bugs.llvm.org/show_bug.cgi?id=33332 .
Differential Revision: https://reviews.llvm.org/D103342
This patch addresses an issue in which fixed-length (VLS) vector RVV
code could fail to reserve an emergency spill slot for their frame index
elimination. This is because we were previously only reserving a spill
slot when there were `scalable-vector` frame indices being used.
However, fixed-length codegen uses regular-type frame indices if it
needs to spill.
This patch does the fairly brute-force method of checking ahead of time
whether the function contains any RVV spill instructions, in which case
it reserves one slot. Note that the second RVV slot is still only
reserved for `scalable-vector` frame indices.
This unfortunately causes quite a bit of churn in existing tests, where
we chop and change stack offsets for spill slots.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D103269