Based on D24038.
LLVM has an @llvm.eh.dwarf.cfa intrinsic, used to lower the GCC-compatible __builtin_dwarf_cfa() builtin.
Reviewed By: StephenFan
Differential Revision: https://reviews.llvm.org/D126181
i64 indices aren't supported on Zve32*. Scalarize gathers to prevent
generating illegal instructions.
Since InstCombine will aggressively canonicalize GEP indices to
pointer size, we're pretty much always going to have an i64 index.
Trying to predict when SelectionDAG will find a smaller index from
the TTI hook used by the ScalarizeMaskedMemIntrinPass seems fragile.
To optimize this we probably need an IR pass to rewrite it earlier.
Test RUN lines have also been added to make sure the strided load/store
optimization still works.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D127179
MIR support is totally unusable for AMDGPU without this, since the set
of reserved registers is set from fields here.
Add a clone method to MachineFunctionInfo. This is a subtle variant of
the copy constructor that is required if there are any MIR constructs
that use pointers. Specifically, at minimum fields that reference
MachineBasicBlocks or the MachineFunction need to be adjusted to the
values in the new function.
The default RegisterClass is not enough to model RISCV Register.
We define risc-v's own register class to model FP Register.
This helps to better estimate the register pressure in the loop-vectorize.
Reviewed By: kito-cheng
Differential Revision: https://reviews.llvm.org/D126854
This fixes an inconsistency between RV32 and RV64. Still considering
trying to do this peephole during isel, but wanted to fix the
inconsistency first.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D126986
Test changes are because isBaseWithConstantOffset uses computeKnownBits
and that is able to see that an earlier AND instruction guaranteed
alignment so that we can treat an OR as an ADD.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D126970
Previously we had 3 different isel patterns for every scalar load
store instruction.
This reduces them to a single ComplexPattern that returns the Base
and Offset. Or an offset of 0 if there was no offset identified
I've done a similar thing for the 2 isel patterns that match add/or
with FrameIndex and immediate. Using the offset of 0, I was also
able to remove the custom handler for FrameIndex. Happy to split that
to another patch.
We might be able to enhance in the future to remove the post-isel
peephole or the special handling for ADD with constant added by D126576.
A nice side effect is that this removes nearly 3000 bytes from the isel
table.
Differential Revision: https://reviews.llvm.org/D126932
The immediate range check for CSImm12MulBy8 included some values
covered by CSImm12MulBy4. I assume CSImm12MulBy4 had priority due
to pattern order in the td file, but this makes the priority
explicit in the predicate.
If the imm is out of range for an ADDI, we will materialize it in
a register using multiple instructions. If the ADD is used by a
load/store, doPeepholeLoadStoreADDI can try to pull an ADDI from
the constant materialization into the load/store offset. This only
works if the ADD has a single use, otherwise the peephole would have
to rebuild multiple nodes.
This patch instead tries to solve the problem when the add is selected.
We check that the add is only used by loads/stores and if it is
we will select it to (ADDI (ADD X, Imm-Lo12), Lo12). This will enable
the simple case in doPeepholeLoadStoreADDI that can bypass an ADDI
used as a pointer. As a result we can remove the more complicated
peephole from doPeepholeLoadStoreADDI.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D126576
Once we've computed the incoming predecessor state, we should use the same compatibility check with knowledge of MI as we did in phase 2 in order to be consistent across all phases.
Differential Revision: https://reviews.llvm.org/D126574
We enable a custom handler to optimize conversions between scalars
and fixed vectors. Unfortunately, the custom handler picks up scalar
to scalar conversions as well. If the scalar types are both legal,
we wouldn't match any of the fixed vector cases and would return SDValue()
causing the LegalizeDAG to expand the bitcast through memory.
This patch fixes this by checking if it's a scalar to scalar conversion
and returns `Op` if both types are legal.
Differential Revision: https://reviews.llvm.org/D126739
If the adjustment doesn't fit in 12 bits, try to break it into
two 12 bit values before falling back to movImm+add/sub.
This is based on a similar idea from isel.
Reviewed By: luismarques, reames
Differential Revision: https://reviews.llvm.org/D126392
The immediate for LUI is stored as 20-bit unsigned value. We need
to sign extend if after shifting by 12 to match the instruction
behavior.
If we find an LUI+ADDI on RV64, it means the constant isn't a
simm32. If it was, we would have emitted LUI+ADDIW from constant
materialization. Make sure the constant is a simm32 before folding.
This appears to match gcc.
A future patch will add support for LUI+ADDIW on RV64.
Originally, `OptLevel` isn't passed into the `MachineFunctionPass`.
This lets the default parameter of `SelectionDAGISel`, which is
`CodeGenOpt::Default`, be passed in. OptLevelChanger captures the
optimization level with the parameter, and rather not the value
within `TargetMachine`. This lets the optimization be
unintentionally overwriten if other value than `CodeGenOpt::Default`
passed.
This patch fixes this by passing the optimization level rather
than using the default value.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D126641
This pattern is what we get after DAG combine for C code like this.
short *ptr1, *ptr2, *ptr3;
unsigned diff = ptr1 - ptr2;
return ptr3[diff];
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D126588
This is a follow up to address a review comment from D124869. When deciding whether to PRE a vsetvli, we can allow non-LMUL1 vsetvlis.
Differential Revision: https://reviews.llvm.org/D126563
When lowering GlobalAddressNodes, we were removing a non-zero offset and
creating a separate ADD.
It already comes out of SelectionDAGBuilder with a separate ADD. The
ADD was being removed by DAGCombiner.
This patch disables the DAG combine so we don't have to reverse it.
Test changes all look to be instruction order changes. Probably due
to different DAG node ordering.
Differential Revision: https://reviews.llvm.org/D126558
Split off from D125021.
We were duplicating logic across different phases. Since we want to
ensure a consistency of logic across phases for correctness, this patch
combines our multiple compatibility checks into one function to better
convey this.
Several methods were made const too.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D126472
A RISCV implementation can choose to implement unaligned load/store support. We currently don't have a way for such a processor to indicate a preference for unaligned load/stores, so add a subtarget feature.
There doesn't appear to be a formal extension for unaligned support. The RISCV Profiles (https://github.com/riscv/riscv-profiles/blob/main/profiles.adoc#rva20u64-profile) docs use the name Zicclsm, but a) that doesn't appear to actually been standardized, and b) isn't quite what we want here anyway due to the perf comment.
Instead, we can follow precedent from other backends and have a feature flag for the existence of misaligned load/stores with sufficient performance that user code should actually use them.
Differential Revision: https://reviews.llvm.org/D126085
This change reorganizes the majority of frame index resolution into a two strep process.
Step 1 - Select which base register we're going to use.
Step 2 - Compute the offset from that base register.
The key point is that this allows us to share the step 2 logic for the SP case. This reduces the code duplication, and (I think) makes the code much easier to follow.
I also went ahead and added assertions into phase 2 to catch errors where we select an illegal base pointer. In general, we can't index from a base register to a stack location if that requires crossing a variable and unknown region. In practice, we have two such cases: dynamic stack realign and var sized objects. Note that crossing the scalable region is fine since while variable, it's a known variability which can be expressed in the offset.
Differential Revision: https://reviews.llvm.org/D126403
During insertion of VSETVLI, we have two related bits of code which decide whether we can reuse a previous vsetvli result. As was pointed out in the original review, these cases can allow any prior state for which we know that VL is the same for any value of AVL.
This was originally separated out of a desire for separate tests and review. As it turns out, finding a test case for this has been quite challenging. Most of the cases I tried, we manage to already get through other chains of logic. We do have one correct test change, but that only exercises one of the two changes.
Differential Revision: https://reviews.llvm.org/D126400
We didn't implement RISCVELFStreamer::reset and cause some very strange
section output for attribute section...just reference D15950 to see how
ARM implement that.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D125905
This patch tries to solve the incoordination between the direct and intermediate cast caused by D123975.
This patch replaces ISD::FP_EXTEND and ISD::FP_ROUND with RVV VL op in the lowering of FP scalable vector direct cast to unify with the intermediate cast.
And it also changes the FP widenning pattern with the VL op.
Differential Revision: https://reviews.llvm.org/D125364
be2cb8 fixes the case which triggered the revert. Reapply, and let's see if anything else falls out.
Original commit message:
These asserts are believed to hold after several recent miscompiles have been fixed. If you see an assertion failure on this change, please toggle the default back and make sure you file a bug with a reproducer. We may have as yet uncaught miscompiles lurking in this code.
Differential Revision: https://reviews.llvm.org/D125271
This moves mutation entirely out of the main algorithm.
The immediate trigger is that we hit another case of the same issue I thought we'd fixed in 72925d9. It turns out we hadn't considered the cross block case.
As a brief summary, the issue being fixed is that if we mutate a previous vsetvli in phase 3, there's a possibility that some later use of that vsetvli changes "compatibility". In the cross_block_mutate test, this later vsetvli occurs in another block (and is thus visit order dependent too!). This causes us to fail strict asserts. (To be explicit, the current on by default workaround should compensate. It's only when we turn that off that we have problems.)
Now, I want to explicitly call out an alternate workaround. We could leave the mutation in phase 3, and simplify restrict it to the case where the previous vsetvli's GPR result is unused. That covers the case we've actually seen. (I'll note that codegen regressions with a simple form of this were significant. We might have to check specifically for the use outside block case to keep them reasonable, which complicates the workaround slightly.)
Personally, I'm at the point where I want the mutation pulled out just for robustness sake. I'm worried there's yet one more form of this bug we haven't thought about.
The other motivation for this change is that it does give us a couple of minor codegen wins. None appear to be hugely significant, but improvements never hurt right?
Differential Revision: https://reviews.llvm.org/D125270
Update test to check MIR after finalize-isel instead of debug output.
This is of course not the only place we should preserve FMF, but
it's the most obvious one.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D126306
This is a straight forward extension of the PRE transform introduced in D124869 to handle the VLMAX case.
The test changes here look quite positive. This surprised me until I realized that all the tests are using @llvm.vscale to figure out the VLMAX, not the llvm.riscv.vsetvlmax intrinsic. If they'd used the later, these would have been full redundancy cases and fully handled by the data flow. I'm not really sure if use of vscale here is representative or not. If it is, we should probably look at using VSETVLI to lower vscale rather than a raw read of vlenb and some math.
Differential Revision: https://reviews.llvm.org/D126338
When optimizing for size, this pass searches for instructions that are
prevented from being compressed by one of the following:
1. The use of a single uncompressed register.
2. A base register + offset where the offset is too large to be
compressed and the base register may or may not already be compressed.
In the first case, if there is a compressed register available, then the
uncompressed register is copied to the compressed register and its uses
replaced. This is only done if there are enough uses that code size
would be improved.
In the second case, if a compressed register is available, then the
original base register is copied and adjusted such that:
new_base_register = base_register + adjustment
base_register + large_offset = new_base_register + small_offset
and the uses of the base register are replaced with the new base
register. Again this is only done if there are enough uses for code size
to be improved.
This pass was authored by Lewis Revill, with large offset optimization
added by Craig Blackmore.
Differential Revision: https://reviews.llvm.org/D92105
We found untested code where negative frame indices were ostensibly
handled despite it being in a block guarded by !MFI.isFixedObjectIndex.
While the implementation of MachineFrameInfo::isFixedObjectIndex
suggests this is possible (i.e., if a frame index was more negative - less than the
number of fixed objects), I couldn't find any test in tree -- for any
target -- where a negative frame index wasn't also a fixed object
offset. I couldn't find a way of creating such a object with the
public MachineFrameInfo creation APIs. Even
MachineFrameInfo::getObjectIndexBegin starts counting at the negative
number of fixed objects, so such frame indices wouldn't be covered by
loops using the provided begin/end methods.
Given all this, an assert that any object encountered in the block is
non-negative seems reasonable.
Reviewed By: StephenFan, kito-cheng
Differential Revision: https://reviews.llvm.org/D126278
When the AVL value does not fit in 5 bits, the register in which this value is stored may be dead when we want to forward it. This patch ensure the kill flags on the register are cleared before forwarding.
Patch by: loralb
Differential Revision: https://reviews.llvm.org/D125971
Instead of matching opcodes to know the format to emit, use an
enum value that we can get from the RISCVMatInt::Inst class.
Change the consumers to use fully covered switches so that we get
a compiler warning if a new kind is added. With the opcode checks
it was easier to forget to update one of the 3 consumers.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D126317
This patch teaches the VSETVLI insertion pass to perform a very limited form of partial redundancy elimination. The motivating example comes from the fixed length vectorization of a simple loop such as:
for (unsigned i = 0; i < a_len; i++)
a[i] += b;
Without this change, the core vector loop and preheader is as follows:
.LBB0_3: # %vector.ph
andi a1, a6, -8
addi a4, a0, 16
mv a5, a1
.LBB0_4: # %vector.body
# =>This Inner Loop Header: Depth=1
addi a3, a4, -16
vsetivli zero, 4, e32, m1, ta, mu
vle32.v v8, (a3)
vle32.v v9, (a4)
vadd.vx v8, v8, a2
vadd.vx v9, v9, a2
vse32.v v8, (a3)
vse32.v v9, (a4)
addi a5, a5, -8
addi a4, a4, 32
bnez a5, .LBB0_4
The key thing to note here is that, the execution of the vsetivli only needs to happen once. Since there's no tail folding happening here, the value of the vector configuration registers are invariant through the loop.
After this patch, we hoist the configuration into the preheader and perform it once.
.LBB0_3: # %vector.ph
andi a1, a6, -8
vsetivli zero, 4, e32, m1, ta, mu
addi a4, a0, 16
mv a5, a1
.LBB0_4: # %vector.body
# =>This Inner Loop Header: Depth=1
addi a3, a4, -16
vle32.v v8, (a3)
vle32.v v9, (a4)
vadd.vx v8, v8, a2
vadd.vx v9, v9, a2
vse32.v v8, (a3)
vse32.v v9, (a4)
addi a5, a5, -8
addi a4, a4, 32
bnez a5, .LBB0_4
Differential Revision: https://reviews.llvm.org/D124869
This reverts commit dfe513ae1b.
Tests have been changed to avoid the type legalization bug being
fixed in D126036.
Original commit message:
This will remove masks on the shift amount. We usually get this with
SimplifyDemandedBits in DAGCombine, but that's restricted to cases
where the AND has a single use. selectShiftMaskXLen does not have
that restriction.
This patch fixes another bug in the RVV frame lowering. While some frame
objects with non-default stack IDs (such scalable-vector alloca
instructions) are considered in the target-independent max alignment
calculations, others (for example, during calling-convention lowering)
are not. This means we'd occasionally align the base of the stack to
only 16 bytes, with no way to ensure that the RVV section contained
within that is aligned to anything higher.
Reviewed By: StephenFan
Differential Revision: https://reviews.llvm.org/D125973
This patch addresses several alignment issues in the stack frame when
RVV objects are taken into account.
One bug is that the RVV stack was never guaranteed to keep the alignment
of the stack *as a whole*. We must maintain a 16-byte aligned stack at
all times, especially when calling other functions. With the standard V
extension, this is conveniently happening since VLEN is at least 128 and
always 16-byte aligned. However, we support Zvl64b which does not
guarantee this. To fix this, the RVV stack size is rounded up to be
aligned to 16 bytes. This in practice generally makes us allocate a
stack sized at least 2*VLEN in size, and a multiple of 2.
|------------------------------| -- <-- FP
| 8-byte callee-save | | |
|------------------------------| | |
| one VLENB-sized RVV object | | |
|------------------------------| | |
| 8-byte local variable | | |
|------------------------------| -- <-- SP (must be aligned to 16)
In the example above, with Zvl64b we are decrementing SP by 12 bytes
which does not leave SP correctly aligned. We therefore introduce an
extra VLENB-sized amount used for alignment. This would therefore ensure
the total stack size was 16 bytes (48 for Zvl128b, 80 for Zvl256b, etc):
|------------------------------| -- <-- FP
| 8-byte callee-save | | |
|------------------------------| | |
| one VLENB-sized padding obj | | |
| one VLENB-sized RVV object | | |
|------------------------------| | |
| 8-byte local variable | | |
|------------------------------| -- <-- SP
A new RVV invariant has been introduced in this patch, which is that the
base of the RVV stack itself is now always aligned to 16 bytes, not 8 as
before. This keeps us more in line with the scalar stack and should be
easier to reason about. The calculation of the RVV padding has thus
changed to be the amount required to align the scalar local variable
section to the RVV section's alignment. This amount is further rounded
up when setting up the initial stack to keep everything aligned:
|------------------------------| -- <-- FP
| 8-byte callee-save |
|------------------------------|
| |
| RVV objects |
| (aligned to at least 16) |
| |
|------------------------------|
| RVV padding of 8 bytes |
|------------------------------|
| 8-byte local variable |
|------------------------------| -- <-- SP
In the example above, it's clear that we need 8 bytes of padding to keep
the RVV section aligned to 16 when using SP. But to keep SP *itself*
aligned to 16 we can't decrement the initial stack pointer by 24 - we
have to round up to 32.
With the RVV section correctly aligned, the second bug fixed by
this patch is that RVV objects themselves are now correctly aligned. We
were previously only guaranteeing an alignment of 8 bytes, even if they
required a higher alignment. This is relatively simple and in practice
we see more rounding up of VLEN amounts to account for alignment in
between objects:
|------------------------------|
| RVV object (aligned to 16) |
|------------------------------|
| no padding necessary |
|------------------------------|
| 2*VLENB RVV object (align 16)|
|------------------------------|
| VLENB alignment padding |
|------------------------------|
| RVV object (align 32) |
|------------------------------|
| 3*VLENB alignment padding |
|------------------------------|
| VLENB RVV object (align 32) |
|------------------------------| -- <-- base of RVV section
Note that a lot of the regressions in codegen owing to the new alignment
rules are correct but actually only strictly necessary for Zvl64b (and
Zvl32b but that's not really supported). I plan a follow-up patch to
take the known VLEN into account when padding for alignment.
Reviewed By: StephenFan
Differential Revision: https://reviews.llvm.org/D125787
Previously, `getRegUsageForType` was implemented using
`getTypeLegalizationCost`. `getRegUsageForType` is used by the loop
vectorizer to estimate the register pressure caused by using a vector
type. However, `getTypeLegalizationCost` currently only appears to
understand splitting and not scalarization, so significantly
underestimates the register requirements.
Instead, use `getNumRegisters`, which understands when scalarization
can occur (via computeRegisterProperties).
This was discovered while investigating D118979 (Set maximum VF with
shouldMaximizeVectorBandwidth), where under fixed-length 512-bit SVE the
loop vectorizer previously ends up costing an v128i1 as 2 v64i*
registers where it actually occupies 128 i32 registers.
I'm sending this patch early for comment, I'm still doing some sanity checking
with LNT. I note that getRegisterClassForType appears to return VectorRC even
though the type in question (large vNi1 types) end up occupying scalar
registers. That might be worth fixing too.
Differential Revision: https://reviews.llvm.org/D125918
We must add padding when using SP or BP to access stack objects.
Checking whether we're missing FP is not sufficient as stack realignment
uses SP too. The test in D125962 explains the specific issue in more
detail.
Split from D125787.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D125964
This reverts commit 86f7d7074a.
The test cases added for this exposed an pre-existing bug that is failing
the expensive checks bot. Reverting so I can revert that patch.
Most clients only used these methods because they wanted to be able to
extend or truncate to the same bit width (which is a no-op). Now that
the standard zext, sext and trunc allow this, there is no reason to use
the OrSelf versions.
The OrSelf versions additionally have the strange behaviour of allowing
extending to a *smaller* width, or truncating to a *larger* width, which
are also treated as no-ops. A small amount of client code relied on this
(ConstantRange::castOp and MicrosoftCXXNameMangler::mangleNumber) and
needed rewriting.
Differential Revision: https://reviews.llvm.org/D125557
In RISCVTargetTransformInfo, enumerating the processor family is not a good way to predict.
Because it needs to enumerate many subtarget family and is hard to update if add new subtarget.
Instead, create a feature to distinguish whether targets want to use default unroll preference or not.
Keep TuneSiFive7 because it's flag to indicate subtarget family, which may used in other place.
Differential Revision: https://reviews.llvm.org/D125741
This will remove masks on the shift amount. We usually get this with
SimplifyDemandedBits in DAGCombine, but that's restricted to cases
where the AND has a single use. selectShiftMaskXLen does not have
that restriction.
This reverts commit 79a66ec97b.
The stronger asserts served their purpose; I stumbled across another bug. Will reapply once this one is also fixed.
The bug appears to be a variant of a previous one:
* We mutate an instruction in one block.
* That mutation changes the phase3 results of another block.
This is very similiar to a previous issue, except cross block instead of within a single block.
This patch adds a transform to the local prepass in InsertVSETVLI which canonicalizes an AVL of a register from another vsetvli into immediate or VLMAX when VTYPE is the same. In this patch, I chose to be conservative and avoid arbitrary vreg forwarding due to profitability concerns about possibility overlapping live ranges.
This has the effect of eliminating vsetvli instructions in loops which are walking either VLMAX or a constant number of lanes per iteration.
Differential Revision: https://reviews.llvm.org/D125812
These asserts are believed to hold after several recent miscompiles have been fixed. If you see an assertion failure on this change, please toggle the default back and make sure you file a bug with a reproducer. We may have as yet uncaught miscompiles lurking in this code.
Differential Revision: https://reviews.llvm.org/D125271
With recent fixes to the dataflow in place, we now never pass
Strict=true to isCompatible, so remove the parameter completely.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D125748
Our current implementation of the InsertVSETVLI dataflow allows phase 3 to arrive at a different block end state than the data flow in phase 1/2 computed. This arises because a block which contains instructions (e.g. load or stores) which don't consume all the incoming bits of the VL/VTYPE can be compatible with multiple incoming states. The algorithm effectively changes the SEW on such instructions, and propagates the prior state forward. As phase 3 uses the block input state for this propagation, but phase 1/2 doesn't, this can result in different block end states.
If we don't correct for it, this discrepancy can result in miscompiles. This was the source of multiple recent bugs. However, by now we have fixes for all known correctness issues.
The basic strategy we use is to insert a compensation vsetvli to bring the block state leaving the block back into consistency with the one computed. This is correct, but results in extra vsetvlis being placed at the end of blocks.
This change adjusts the phase 1/2 algorithm to propagate the incoming block state through the block, allowing the compatibility rules to modify the end state. The algorithm may need to run slightly more iterations, but the end result is consistent with what phase 3 does.
The benefit of doing this is two fold.
First, we reverse some of the code quality introductions introduced in the functional fixes.
Second, we simplify the invariants, and allow the strict assertions to be enabled. Several humans, myself included, have found it quite surprising that invariant didn't hold already, and arguably that confusion is the cause of several of our recent miscompiles in this code.
The downside to this patch is that the dataflow may require additional iterations to stabilize. In the worse case, we go from O(Edges) to O(E + UniquePaths) as the incoming state (and thus the outgoing one) can now change once for each path from the entry block.
Differential Revision: https://reviews.llvm.org/D125232
We've got a lurking problem with our data flow implementation where different phases disagree, resulting in possible miscompiles. D119518 introduced a workaround, but failed to consider blocks which only contain load/stores compatible with their incoming state.
When I went to rebase and simplify D125232, it turned out that not all of the correctness issues had been fixed yet after all. This is the correctness fix accidentally embedded in the original more complicated version.
Note that the test changes here are mostly regressions. It's worth noting that the simplified version of D125232 exactly reverses all the non-functional diffs in the test caused here. D125232 should be the immediate following commit.
Differential Revision: https://reviews.llvm.org/D125703
The existing redundant copy elimination required a virtual register source, but the same logic works for any physreg where we don't have to worry about clobbers. On RISCV, this helps eliminate redundant CSR reads from VLENB.
Differential Revision: https://reviews.llvm.org/D125564
During early gather/scatter enablement two different approaches
were taken to represent scaled indices:
* A Scale operand whereby byte_offsets = Index * Scale
* An IndexType whereby byte_offsets = Index * sizeof(MemVT.ElementType)
Having multiple representations is bad as shown by this patch which
fixes instances where the two are out of sync. The dedicated scale
operand is more flexible and pervasive so this patch removes the
UNSCALED values from IndexType. This means all indices are scaled
but the scale can be one, hence unscaled. SDNodes now use the scale
operand to answer the "isScaledIndex" question.
I toyed with the idea of keeping the UNSCALED enums and helper
functions but because they will have no uses and force SDNodes to
validate the set of supported values I figured it's best to remove
them. We can re-add them if there's a real need. For similar
reasons I've kept the IndexType enum when a bool could be used as I
think being explicitly looks better.
Depends On D123347
Differential Revision: https://reviews.llvm.org/D123381
When legality check for vectoring reduction, hasVInstructions() check be unneeded. RISCV can only loop vectorization with hasVInstructions()
Reviewed By: kito-cheng, craig.topper
Differential Revision: https://reviews.llvm.org/D125460
This patch replaces some for-each set with the new arrayref argument API, since it already used an array in defination, I think this change won't cause any ambiguity.
Differential Revision: https://reviews.llvm.org/D125455
We need to use tail undisturbed for vslideup to implement
vector insert operation correctly.
Ideally, we cound use the tail agnostic when insert subvector
or element at the end of the vector. This will be in follow-up
patch.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D125545
The name `MCFixedLenDisassembler.h` is out of date after D120958.
Rename it as `MCDecoderOps.h` to reflect the change.
Reviewed By: myhsu
Differential Revision: https://reviews.llvm.org/D124987
When building the final merged node, we were using the original chain
rather than the output chain of the new operation. After some collapsing
of the chain this could cause the loads be incorrectly scheduled respect
to later stores.
This was uncovered by SingleSource/Regression/C/gcc-c-torture/execute/pr36038.c
of the llvm testsuite.
https://reviews.llvm.org/D125560
This reverts most of ed242b54c9
I'm seeing failures in our intrinsic testing on qemu that seem
related to this. Reverting while I investigate.
I've left the command line option in place for directed testing.
It defaults to off.
This patch adds minimal support for lowering an read.register intrinsic with vlenb as the argument. Note that vlenb is an implementation constant, so it is never allocatable.
This was split off a patch to eventually replace PseudoReadVLENB with a COPY MI because doing so revealed a couple of optimization opportunities which really seemed to warrant individual patches and tests. To write those patches, I need a way to write the tests involving vlenb, and read.register seemed like the right testing hook.
Differential Revision: https://reviews.llvm.org/D125552
The goal is support tail and mask policy in RVV builtins.
We focus on IR part first.
If the passthru operand is undef, we use tail agnostic, otherwise
use tail undisturbed.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D125323
We've got a lurking problem with our data flow implementation where different phases disagree, resulting in possible miscompiles. D119518 introduced a workaround, but failed to consider blocks without terminators (e.g. fallthroughs).
I have a deeper rework of the algorithm in flight over in D125232, but this patch is specifically a minimal fix for an active miscompile. That change can be reworked over this once landed.
Differential Revision: https://reviews.llvm.org/D125408
riscv_fma_vl doesn't have a tail, so use the tail_agnostic policy.
We were already doing this for some patterns. I think the patterns
with fneg and mask were added later and I copied the tail policy
from the unmasked patterns.
Reviewed By: khchen
Differential Revision: https://reviews.llvm.org/D125424
RVV makes heavy use of subregisters due to LMUL>1 and segment
load/store tuples. Enabling subregister liveness tracking improves the quality
of the register allocation.
I've added a command line that can be used to turn it off if it causes compile
time or functional issues. I used the command line to keep the old behavior
for one interesting test case that was testing register allocation.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D125108
This is a followup to D124231.
We can fold the ADDIW in this pattern if we can prove that LUI+ADDI
would have produced the same result as LUI+ADDIW.
This pattern occurs because constant materialization prefers LUI+ADDIW
for all simm32 immediates. Only immediates in the range
0x7ffff800-0x7fffffff require an ADDIW. Other simm32 immediates
work with LUI+ADDI.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D124693
If we have multiple gather/scatter instructions using the same the
same strided address we would scalarize it multiple times. I guess
a later pass cleans this up, but I don't know if that's guaranteed.
This patch adds a cache to remember the scalarization we already
created for a previous gather/scatter.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D125326
We should consolidate the operand counting and ordering into
RISCVBaseInfo.h and stop spreading it around.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D125344
This hook determines if SimplifySetcc transforms (X & (C l>>/<< Y))
==/!= 0 into ((X <</l>> Y) & C) ==/!= 0. Where C is a constant and
X might be a constant.
The default implementation favors doing the transform if X is not
a constant. Otherwise the code is left alone. There is a provision
that if the target supports a bit test instruction then the transform
will favor ((1 << Y) & X) ==/!= 0. RISCV does not say it has a variable
bit test operation.
RISCV with Zbs does have a BEXT instruction that performs (X >> Y) & 1.
Without Zbs, (X >> Y) & 1 still looks preferable to ((1 << Y) & X) since
we can fold use ANDI instead of putting a 1 in a register for SLL.
This patch overrides this hook to favor bit extract patterns and
otherwise falls back to the "do the transform if X is not a constant"
heuristic.
I've added tests where both C and X are constants with both the shl form
and lshr form. I've also added a test for a switch statement that lowers
to a bit test. That was my original motivation for looking at this.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D124639
Type legalization will want to turn (srl X, Y) into RISCVISD::SRLW,
which will prevent us from using a BEXT instruction.
I don't think there is any precedent for type promotion checking
users to decide how to promote. Instead, I've added this DAG combine to
do it before type legalization.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D124109
This patch is an alternative to a piece of D125270. If we have one vsetvli which is using as AVL the output of another, and the prior AVL can be proven to produce the same VL value as that defining one, we can use the AVL from the prior instruction. This has the effect of removing a state transition on AVL, and will let us use the cheaper 'vsetvli x0, x0, vtype1' form or possible even skip emitting it entirely.
This builds on the same infrastructure as D125337, and does the analogous extension to working on abstract states instead of only prior explicit vsetvli instructions. This is where the (relatively minor) code improvements come from.
More importantly, this fixes the last case where the state computed in phase 1 and 2 of the algorithm differs from the state computed during phase 3. Note that such differences can cause miscompiles by creating disagreements about contents of the VL and VTYPE registers at block boundaries.
Doing this transform inside the dataflow can cause the compatibility of a later store to change with regards to the current state. test15 in the diff illustrates this case well. What we have is a vsetvli which is mutated by one following vector op, but whose GPR result is used by another. The compatibility logic walks back to the def in this case, and checks to see if it matches the immediate prior state. In phase 1 and 2, it doesn't, and in phase 3 (after mutation) it does because we remove a transition which caused it to differ.
Differential Revision: https://reviews.llvm.org/D125392
This patch is an alternative to a piece of D125270. Its direct motivation is to fix a wrong code bug (described below), but somewhat unexpectedly, it also results in a significant code quality improvement for idiomatic fixed length vector patterns.
The existing transform is simply wrong in its current location. We are correct about the fact that the scalar move itself can use the previous vsetvli, but we loose track of the fact that later instructions might depend on the state change represented. That is, the actual value of VL in the register is different than the abstract state thinks it is. Not simply due to precision of modeling, but e.g. the VL register could contain 3 when the abstract state says it is 1. This is annoying hard to demonstrate in practice due to differences in policy flags on the intrinsics, but this is at least a latent wrong code bug.
The code quality benefit comes from the fact we don't need to tie this to explicit vsetvli instructions at all. We can propagate the abstract state, and reduce a) the number of transitions, or b) the cost of those transitions. It turns out we have a bunch of cases - in tests at least - where fixed length AVLs are known non-zero, and we can leave VL unchanged while changing VTYPE.
Differential Revision: https://reviews.llvm.org/D125337
Rather than VP_SEXT/VP_ZEXT/VP_TRUNC, having
VP_SIGN_EXTEND/VP_ZERO_EXTEND/VP_TRUNCATE better matches their non-VP
counterparts.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D125298
The patch make PseudoReadVL have the vtypes of the corresponding VLEFF/VLSEGFF.
It's useful to get the vtypes of locations of PseudoReadVL without finding the
corresponding VLEFF/VLSEGFF.
It could simplify optimizations in RISCVInsertVSETVLI like D123581.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D125199
This patch adds rvv codegen support for vp.fpext. The lowering of fp_round, vp.fptrunc, fp_extend and vp.fpext share most code so use a common lowering function to handle these four.
And this patch changes the intermediate cast from ISD::FP_EXTEND/ISD::FP_ROUND to the RVV VL version op RISCVISD::FP_EXTEND_VL and RISCVISD::FP_ROUND_VL for scalable vectors.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D123975
These can be selected to unmasked from masked instructions by the
post-process DAG step.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D125239
This fixes the first of several cases where the state computed in phase 1 and 2 of the algorithm differs from the state computed during phase 3. Note that such differences can cause miscompiles by creating disagreements about contents of the VL and VTYPE registers at block boundaries.
In this particular case, we recognize that for the first vsetvli in a block, that if the AVL is a phi of GPR results from previous vsetvlis and the VTYPE field matches, we can avoid emitting a vsetvli as the register contents don't change. Unfortunately, the abstract state does change and that update was lost.
As noted in the test change, this can actually improve results by preserving information until later state transitions in the block. However, this minor codegen improvement is not the motivation for the patch. The motivation is to avoid cases a case where we break a key internal correctness invariant.
Differential Revision: https://reviews.llvm.org/D125133
If the income state hasn't changed, and the step function is fixed by assumption, then the output state can't have changed.
In the current algorithm, this is a very minor win and mostly allows adding tracing output without being horrible verbose.
This assertion should hold for any reasonable data flow algorithm, but is known not to in several cases today. I'd like to go ahead and land this off-by-default, so that we can collaborate on fixes and have a common definition of success.
Differential: https://reviews.llvm.org/D125035
Enable default outlining when the function has the minsize attribute.
`addr-label.ll` crashed after enabling this, so a barrier is added before
instruction selection as a workaround.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D122213
If the GPR destination register of a VSETVLI instruction is unused, we can replace it with X0. This discards the result, and thus reduces register pressure.
Since after the core insertion/lowering algorithm has run, many user written VSETVLIs will have their GPR result unused (as VTYPE/VLEN is now explicitly read instead), this kicks in for most tests which involve a vsetvli intrinsic for fixed length vectorization. (vscale vectorization generally uses the GPR result to know how far to e.g. advance pointers in a loop and these uses are not removed.) When inserting VSETVLIs to lower psuedos, we prefer the X0 form anyways.
Differential Revision: https://reviews.llvm.org/D124961
No reason to special case simm12, movImm handles all immediates.
This also fixe a bug that we weren't passing the frame-setup/destroy
flag to movImm when we were calling it.
riscv-v-vector-bits-min is primarily used to opt-in to the
autovectorizer. The vector width can be determined from Zvl*b.
This patch adds support treating -1 as meaning use Zvl*b so we can
still opt-in to autovectorization without needing to repeat a
vector width already given by Zvl*b or -mcpu.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D124960
We can use SH1ADD, SH2ADD, SH3ADD to multipy by 3, 5, and 9 respectively.
We could extend this to 3, 5, or 9 multiplied by a power 2 by also
emitting a SLLI.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D124824
The result was totally wrong.
We could use mask undisturbed result to emulate the mask agnostic result.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D124684
Because of shrink wrapping, the block to insert epilog may don't have
instructions (Only debug instructions). And the position to insert may
point to MBB.end() that don't have a DebugLoc. This patch fix this
problem.
The test program was copied from the issue:https://github.com/llvm/llvm-project/issues/53662
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D123679
This fixes a crash from D124231.
We can't fold
(load (add base, (addi src, off1)), off2)
-> (load (add base, src), off1+off2)
if the src is a FrameIndex. FrameIndex cannot be the operand of an
add.
There was an immediate==0 check that I think was trying to catch
the common case of FrameIndex addis where the immediate is 0, but
they can also appear in non-zero form. Instead explicitly check
for a FrameIndex operand.
Putting a node in this list allows the node to be used as the root
of an isel pattern that would then call the ComplexPattern. The
usual case is to use the ComplexPattern as the operand of another
operator.
AddrFI is never used as a root operation. frameindex is handled
directly with custom code in RISCVISelDAGToDAG::Select. So adding
frameindex to the list here serves no purpose.
It's possible that we have a constant that isn't simm32 so we can't
use LUI+ADDIW, but we can use LUI+ADDI. Because ADDI uses a sign
extended constant, it's possible that after subtracting it out, we
end up with a simm32 that maps to LUI.
This patch detects this case after removing Lo12 and before shifting
the value for SLLI.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D124222
Add cost model for broadcast shuffle in RISCVTTIImpl::getShuffleCost
with scalable vector. The cost model might not the best.
For scalable vector, BasicTTIImpl::getShuffleCost return invalid cost,
so this patch relies on the existing cost model in BasicTTIImpl.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D124101
This improves opportunities to use bset/bclr/binv. Unfortunately,
there are no W versions of these instrcutions so this isn't always
a clear win. If we use SLLW we get free sign extend and shift masking,
but need to put a 1 in a register and can't remove an or/xor. If
we use bset/bclr/binv we remove the immediate materializationg and
logic op, but might need a mask on the shift amount and sext.w.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D124096
By clearing the HasDummyMask flag from mask register binary operations
and mask load/store.
HasDummyMask was causing an extra operand to get appended when
converting from MachineInstr to MCInst. This extra operand doesn't
appear in the assembly string so was mostly ignored, but it prevented
the alias instruction printing from working correctly.
Reviewed By: arcbbb
Differential Revision: https://reviews.llvm.org/D124424
The default implementation of findCommutedOpIndices picks the
first two source operands. That's exactly what we want for the
scalar FMA instructions.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D124463
Following D118810 that reduced the size of ISel table,
this patch optimizes allone-masked RVV pseudos with TU policy and
swap them out to their unmasked TU pseudos.
Since the UNDEF merge operand is not preserved, we turn it into TA
pseudo regardless of the policy operand.
Reviewed By: craig.topper, frasercrmck
Differential Revision: https://reviews.llvm.org/D121881
vslideup works by leaving elements 0<i<OFFSET undisturbed.
so it need the destination operand as input for correctness
regardless of policy. Add a operand to indicate policy.
We also add policy operand for unmaksed vslidedown to keep the interface consistent with vslideup
because vslidedown have only undisturbed at 0<i<vstart but user have no way to control of vstart.
Reviewed By: rogfer01, craig.topper
Differential Revision: https://reviews.llvm.org/D124186
To fix llvm-mca's error of 'found an unsupported instruction
in the input assembly sequence.' caused by the lack of
scheduling info.
Pseudo function call instructions will be expanded to `auipc`
and `jalr`, so their scheduling info are the combination of
two.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D123578
Backwards search
The sext.w removal pass (before the new patch) checks if the input to sext.w is already in sign-extended form, so it can eliminate it. It does that by checking every definition/source that reaches the sext.w is an instruction that produces a sign-extended value, either by definition (e.g. ADDW), or it propagates sign-extension (e.g. OR) so we check its sources recursively.
Forward search
Sometimes, one of the sources is an instruction that doesn't always produce a sign-extended value, but it has a W-version that does (e.g. ADD / ADDW). If we transform the ADD to ADDW, the sext.w can be removed (assuming other def paths are satisfied), but this transformation is sound only if every use of this ADD/W only reqruires the lower 32-bits either directly (like sll %x, 32) or they propagate dependency (lower word of output only depends on lower word of input) so we check its uses recursively.
When searching backwards, if an instruction that can be replaced with W-variant is encountered, this pass runs the forward search to verify it can be replaced, then adds it to a list of fixable instructions. After verifying all paths, it replaces the instruction and removes the sext.w.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D119928
This patch adds custom MIR operand comments to VTYPE immediate operands
in VSETVLI instructions and SEW/LMUL operands in vector codegen pseudo
instructions. The result is intended to be more human-readable and
hopefully maintainable when working with MIR, particularly when
writing or reading test cases.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D124187
We saw a failure caused by unwinding with incomplete CFIs, so we
can't outline CFI instructions when they are needed in EH.
This is a recommit of 0d40688, which was reverted in ce83883 as
related precommit test 360d44e caused some errors.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D122634
If there are fewer than 12 trailing zeros, we'll try to use an ADDI
at the end of the sequence. If we strip trailing zeros and end the
sequence with a SLLI we might find a shorter sequence.
Differential Revision: https://reviews.llvm.org/D124148
We saw a failure caused by unwinding with incomplete CFIs, so we
can't outline CFI instructions when they are needed in EH.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D122634
We can't shift-right negative numbers to divide them, so avoid emitting
such sequences. Use negative numerators as a proxy for this situation, since
the indices are always non-negative.
An alternative strategy could be to add a compiler flag to emit division
instructions, which would at least allow us to test the VID sequence
matching itself.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D123796
We haven't been updating this as Zb* instructions have been used
for immediate materialization. They will hit the default case and
trigger an llvm_unreachable. Instead of trying to list them all,
assume instructions that aren't explicitly listed aren't compressible.
Spotted while looking at integer materialization for other reasons.
I haven't seen a crash from this yet.
There's an existing generic combine that does this for legal types.
This patch adds a RISCV specific combine for W instructions.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D123983
This patch fixes a bug when lowering BUILD_VECTOR via VID sequences.
After adding support for fractional steps in D106533, elements with zero
steps may be skipped if no step has yet been computed. This allowed
certain sequences to slip through the cracks, being identified as VID
sequences when in fact they are not.
The fix for this is to perform a second loop over the BUILD_VECTOR to
validate the entire sequence once the step has been computed. This isn't
the most efficient, but on balance the code is more readable and
maintainable than doing back-validation during the first loop.
Fixes the tests introduced in D123785.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D123786
This patch adds rvv codegen support for vp.fptrunc. The lowering of fp_round and vp.fptrunc share most code so use a common lowering function to handle those two, similar to vp.trunc.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D123841
Materializing constants on RISCV is simpler if the constant is sign
extended from i32. By default i32 constant operands of phis are
zero extended.
This patch adds a hook to allow RISCV to override this for i32. We
have an existing isSExtCheaperThanZExt, but it operates on EVT which
we don't have at these places in the code.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D122951
This was added before Zve extensions were defined. I think users
should use Zve32x or Zve32f now. Though we will lose support for limiting
ELEN to 16 or 8, but I hope no one was using that.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D123418
Having an enum with names that contain the string representation
of their value doesn't add any value. We can just use the numbers.
Reviewed By: kito-cheng, frasercrmck
Differential Revision: https://reviews.llvm.org/D123417
Scalable vectors llvm.experimental.stepvector intrinsic
will crash due to an invalid cost when run the code through the loopunroll.
Reviewed By: kito-cheng
Differential Revision: https://reviews.llvm.org/D122782
There's an assert in LUI+SH*ADD+ADDI materialization that makes sure the
lower 12 bits aren't zero since that case should have been handled as
LUI+ADDI+SH*ADD. But nothing prevented the LUI+SH*ADD+ADDI checks from
running after the earlier code handled it.
The sequence would be the same length or longer so it wouldn't replace
the earlier sequence, but the assert happened before that was checked.
The vector holding the sequence also wasn't reset before the second
check so that guaranteed the sequence would never be found to be
shorter.
This patch fixes this by only trying the second expansion when the
earlier fails.
Fixes PR54812.
Reviewed By: benshi001
Differential Revision: https://reviews.llvm.org/D123406
Similar to D123217 but for the floating-point patterns. No change in
generated output, while reducing the generated table size.
Reviewed By: arcbbb
Differential Revision: https://reviews.llvm.org/D123291
SLLI is always compressible to C.SLLI as long as the source and dest
register is the same.
ANDI and SRLI are only compressible if the register is x8-x15. By
using SLLI we have a better chance of generating shorter code.
I had to exclude one exclusion for the BEXTI case so that it's
pattern match could still fire.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D123336
RISCVMachineFunctionInfo has some fields like VarArgsFrameIndex and
VarArgsSaveSize are calculated at ISel lowering stage, those info are
not contained in MIR files, that cause test cases rely on those field
can't not reproduce correctly by MIR dump files.
This patch adding the MIR read/write for those fields.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D123178
This patch synchronizes the structure of the templates with those
in RISCVInstrInfoVVLPatterns.td so that we get patterns with .vx
on the left hand side.
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D123255
This matches VPatIntegerSetCCVL_VI_Swappable. But as noted in the
FIXME this may only be needed due to lack of canonicalization on
VP_SETCC.
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D123239
The existing code wasn't getting the subtarget info from the fragment,
so the current status of RVC would be ignored. This would cause a crash
for the new test case when the target then reported it couldn't write
the requested number of code alignment bytes.
Differential Revision: https://reviews.llvm.org/D122236
This patch has no effect on the generated code, whilst mitigating the
increase in ISel table size caused by the recent addition of masked
patterns.
I aim to do the same for floating-point patterns once D123051 lands,
giving us a reason to use masked floating-point patterns.
Reviewed By: arcbbb
Differential Revision: https://reviews.llvm.org/D123217
This patch adds the necessary infrastructure to lower vp.fcmp via
ISD::VP_SETCC to RVV instructions.
Most notably this patch adds cond-code legalization for VP_SETCC,
reusing the existing TargetLowering::LegalizeSetCCCondCode by passing in
additional SDValue parameters for the Mask and EVL. This method then
uses VP operations to legalize the condcode.
There is still a general lack of canonicalization on VP_SETCC as opposed
to SETCC which results in worse code than is theoretically possible.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D123051
This patch adds the minimum required to successfully lower vp.icmp via
the new ISD::VP_SETCC node to RVV instructions.
Regular ISD::SETCC goes through a lot of canonicalization which targets
may rely on which has not hereto been ported to VP_SETCC. It also
supports expansion of individual condition codes and a non-boolean
return type. Support for all of that will follow in later patches.
In the case of RVV this largely isn't a problem as the vector integer
comparison instructions are plentiful enough that it can lower all
VP_SETCC nodes on legal integer vectors except for boolean vectors,
which regular SETCC folds away immediately into logical operations.
Floating-point VP_SETCC operations aren't as well supported in RVV and
the backend relies on condition code expansion, so support for those
operations will come in later patches.
Portions of this code were taken from the VP reference patches.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D122743
We can do this conversion by converting the same sized integer type, then compare the result with 0. The conversion is undefined if the converted FP value doesn't fit in an i1.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D122678
If we expand (uaddo X, 1) we previously expanded the overflow calculation
as (X + 1) <u X. This potentially increases the live range of X and
can prevent X+1 from reusing the register that previously held X.
Since we're adding 1, overflow only occurs if X was UINT_MAX in which
case (X+1) would be 0. So this patch adds a special case to expand
the overflow calculation to (X+1) == 0.
This seems to help with uaddo intrinsics that get introduced by
CodeGenPrepare after LSR. Alternatively, we could block the uaddo
transform in CodeGenPrepare for this case.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D122933
Previously, these isel optimizations were disabled if the AND could
be selected as a ANDI instruction. This patch disables the optimizations
only if the immediate is valid for C.ANDI. If we can't use C.ANDI,
we might be able to compress the shift instructions instead.
I'm not checking the C extension since we have relatively poor test
coverage of the C extension. Without C extension the code size
should be equal. My only concern would be if the shift+andi had
better latency/throughput on a particular CPU.
I did have to add a peephole to match SRLIW if the input is zexti32
to prevent a regression in rv64zbp.ll.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D122701
The splat_vector will be legalized to build_vector eventually
anyway. This patch makes it take fewer steps.
Unfortunately, this results in some codegen changes. It looks
like it comes down to how the nodes were ordered in the topological
sort for isel. Because the build_vector is created earlier we end up
with a different ordering of nodes.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D122185
In D122512, several masked patterns were added to support lowering of
vector-predicated float-to-int and int-to-float conversions. With the
introduction of these patterns, all of the old "unmasked" patterns are
matchable via the DAG post-process introduced in D118810, once the relevant
opcode entries are set up in the helper table.
Locally this reduces the generated isel table by 4%.
Reviewed By: arcbbb
Differential Revision: https://reviews.llvm.org/D122637
masked compare and vmsbf/vmsif/vmsof are always tail agnostic, we could
check maskedoff value to decide mask policy rather than have a addtional
policy operand.
Reviewed By: craig.topper, arcbbb
Differential Revision: https://reviews.llvm.org/D122456
This reverts commit 10fd2822b7.
I have a better implementation for those operations without the
additional policy operand.
masked compare and vmsbf/vmsif/vmsof are always tail agnostic so we could
assume undef maskedoff is mask agnostic.
Differential Revision: https://reviews.llvm.org/D122455
This function now takes a uint64_t instead of an APInt. The caller
is responsible for masking the shift amount, extracting and inserting
into the KnownBits APInts, and inverting to compute zeros.
This is less code and cleaner division of responsibilities.
Modified DAGCombiner to pass the shift the bittest input and the shift amount
to hasBitTest. This matches the other call to hasBitTest in TargetLowering.h
This is an alternative to D122454.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D122458
All LLVM backends use MCDisassembler as a base class for their
instruction decoders. Use "const MCDisassembler *" for the decoder
instead of "const void *". Remove unnecessary static casts.
Reviewed By: skan
Differential Revision: https://reviews.llvm.org/D122245
Don't call EltVT.getSizeInBits() or SrcEltVT.getSizeInBits() a second
time. They are already in EltSize or SrcEltSize variables.
Refactor some comparisons to use multiply instead of division.
In the past, when construct RISCVAsmBackend, MCTargetOptions.ABIName would be passed and stored in RISCVAsmBackend.
But MCTargetOptions.ABIName can only be specified by -target-abi xxx in command line, if the .ll file has target-abi attribute, the codegen module will ignore it. And the generated object file would have incorrect EFlags value.
https://github.com/llvm/llvm-project/issues/50591 also caused by this problem.
This patch override the AsmPrinter::emitFunctionEntryLabel function and use it to set the target abi value that get from .ll file's target-abi attribute. And storing the target-abi in RISCVTargetStreamer instead of RISCVAsmBackend.
Differential Revision: https://reviews.llvm.org/D121183
On RV32, we need to type legalize i64 scalar arguments to intrinsics.
We usually do this by splatting the value into a vector separately.
If the scalar happens to be sign extended, we can continue using a .vx
intrinsic.
We already special cased sign extended constants, this extends it
to any sign extended value.
I've only added tests for one case of vadd. Most intrinsics go
through the same check.
Reviewed By: khchen
Differential Revision: https://reviews.llvm.org/D122186
The mask being NoRegister prevented the existing aliases from matching
since NoRegister isn't in the VMV0 register class.
To workaround this I've added new aliases that look for zero_reg.
I had to motify tablegen to generate matching code for zero_reg.
And as a consequence, I had to change the EmitPriority for an ARM
alias that used zero_reg that started printing.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D121496
intrinsics.
Those operations are updated under a tail agnostic policy, but they
could have mask agnostic or undisturbed.
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D120228
Add the UsesMaskPolicy flag to indicate the operations result
would be effected by the mask policy. (ex. mask operations).
It means RISCVInsertVSETVLI should decide the mask policy according
by mask policy operand or passthru operand.
If UsesMaskPolicy is false (ex. unmasked, store, and reduction operations),
the mask policy could be either mask undisturbed or agnostic.
Currently, RISCVInsertVSETVLI sets UsesMaskPolicy operations default to
MA, otherwise to MU to keep the current mask policy would not be changed
for unmasked operations.
Add masked-tama, masked-tamu, masked-tuma and masked-tumu test cases.
I didn't add all operations because most of implementations are using
the same pseudo multiclass. Some tests maybe be duplicated in different
tests. (ex. masked vmacc with tumu shows in vmacc-rv32.ll and masked-tumu)
I think having different tests only for policy would make the testing
clear.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D120226
To perform the cost model of vector casting, the patch consider most vector
casts as their scalar form and consider those vector form of free scalr castings
as 1.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D121771
On RV32, we need to type legalize i64 scalar arguments to intrinsics.
We usually do this by splatting the value into a vector separately.
If the scalar happens to be sign extended, we can continue using a .vx
intrinsic.
We already special cased sign extended constants, this extends it
to any sign extended value.
I've only added tests for one case of vadd. Most intrinsics go
through the same check. I can add more tests if we're concerned.
Differential Revision: https://reviews.llvm.org/D122186
This patch adds single-bit and bit-counting ops to list of sign-extending ops.
A single-bit write propagates sign-extendedness if it's not in the sign-bits.
Bit extraction and bit counting always outputs a small number, so sign-extended.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D121152
RISCVISelDAGToDAG's selectImm uses RISCVTargetLowering::getAddr
(specifically the ConstantPoolSDNode) as of 41454ab256 ("[RISCV] Use
constant pool for large integers"), but nothing explicitly instantiates
any of the templates, the only reason they exist is because of the
various lowering methods in RISCVISelLowering.cpp that themselves use
the methods. However, with inlining, those can end up not existing as
real functions and thus not be exported, leading to link errors. Up
until now this hasn't happened, but for whatever reason D121654 has
triggered this on the sanitizer-ppc64be-linux buildbot, giving:
../../../../lib/libLLVMRISCVCodeGen.a(RISCVISelDAGToDAG.cpp.o): In function `selectImm(llvm::SelectionDAG*, llvm::SDLoc const&, llvm::MVT, long, llvm::RISCVSubtarget const&)':
RISCVISelDAGToDAG.cpp:(.text._ZL9selectImmPN4llvm12SelectionDAGERKNS_5SDLocENS_3MVTElRKNS_14RISCVSubtargetE+0x3d8): undefined reference to `llvm::SDValue llvm::RISCVTargetLowering::getAddr<llvm::ConstantPoolSDNode>(llvm::ConstantPoolSDNode*, llvm::SelectionDAG&, bool) const'
collect2: error: ld returned 1 exit status
Fix this by explicitly instantiating getAddr in its four different forms
so separate translation units can reliably use it.
Fixes: 41454ab256 ("[RISCV] Use constant pool for large integers")
Currently we allow half types in vectors if the scalar Zfh extension
is enabled. This behavior is not inline with the vector spec. For f32
and f64 types, the Zve32f, Zve64f, Zve64d, and V explicitly control
the availablity of floating point types in vectors.
In order to make our compiler compliant, we either need to remove all support
for half in vectors or we need an extension to control it.
Draft spec here https://github.com/riscv/riscv-v-spec/pull/780
Reviewed By: kito-cheng
Differential Revision: https://reviews.llvm.org/D121345
Since we have SPLAT_VECTOR_PARTS these days, I don't think we need
to go through extra lengths to avoid introducing an illegal scalar type.
We can just call getConstant using the scalable vector type and let
it create either a SPLAT_VECTOR or a SPLAT_VECTOR_PARTS.
Reviewed By: frasercrmck, rogfer01
Differential Revision: https://reviews.llvm.org/D121645
We have a special case to skip this transform if c1 is 0xffffffff
and x is sext_inreg in order to use sraiw+zext.w. But we were only
checking that we have a sext_inreg opcode, not how many bits are
being sign extended.
This commit adds a check that it is a sext_inreg from i32 so we know for
sure that an sraiw can be created.
Since we mark the pseudos as mayLoad but do not provide any MMOs,
isSafeToMove conservatively returns false, stopping MachineLICM from
hoisting the instructions. PseudoLA_TLS_GD does not actually expand to a
load, so stop marking that as mayLoad to allow it to be hoisted, and for
the others make sure to add MMOs during lowering to indicate they're GOT
loads and thus can be freely moved.
Fixes https://github.com/llvm/llvm-project/issues/54372
Reviewed By: MaskRay, arichardson
Differential Revision: https://reviews.llvm.org/D121654
Select SRLI+SLLI for and i64 %x, imm if the imm is a leading ones mask.
It's useful in RV64 when the mask exceeds simm32 (cannot be generated by LUI).
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D121598
This code handles fixed vector SPLAT_VECTOR, but is never called in
any tests.
We only form fixed vector splat vectors for vXi64 on RV32 as part
of DAGCombine. This will be type legalized to SPLAT_VECTOR_PARTS.
So the Custom handling for SPLAT_VECTOR is never needed.
This patch makes SPLAT_VECTOR for vXi64 'Legal' on RV32 so that
DAGCombine will create it, but there's no need for Custom handler.
It will still be type legalized to SPLAT_VECTOR_PARTS.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D121673
If the type is less than XLenVT, type legalization will turn this
into (srl (bitreverse (bswap (srl (bswap X), C))), C). We can't
completely recover from these shifts. They introduce zeros into
the upper bits of the result and we can't easily tell if they are
needed. By doing a DAG combine early, we avoid introducing these
shifts.
Type legalize narrow RISCVISD::GREV/GORC with constant to a larger
type without switching to W. Detect sext_inreg+gorci/grevi with a
uimm5 immediate during isel to emit GREVIW/GORCIW.
This allows us to better propagate known bits information through
extended bits after type legalization. It will also simplify a
change I'm considering for BREV8 with Zbkb.
A future patch will add computeKnownBits support for GORC.
A further improvement here would be to use hasAllWUsers and
doPeepholeSExtW like we do for SLLIW, but I don't think we have
the test coverage for that yet.
We already do this for RISCVISD::VFMV_S_F_VL and the vfmv.v.f
intrinsic.
Reviewed By: kito-cheng
Differential Revision: https://reviews.llvm.org/D121429
We know the shift amount is a constant with bit 31 clear. anyext
of constant will be either zext or sext which will produce the
same result here. But we really shouldn't rely on that. It would
be valid to put a random number in the upper bits. Our isel patterns
expect the upper bits to be 0 so we should ask for it explicitly.
This doesn't appear to be needed any more. I did some inspecting
of the gcc torture suite and SPEC2006 with this removed and didn't
find any meaningful changes.
I think we're more aggressive about forming ADDIW now using
sign_extend_inreg during type legalization and hasAllWUsers in isel.
This probably helps catch the cases this helped with before.
This helps us form vfnmsub, vfnmadd, and vfmusb from masked VP
intrinsics.
I've used "srcvalue" for the mask parameter in the fneg nodes. We
can't match "V0" because that doesn't ensure the mask the is the same.
Instead it matches two different nodes and generates two copies to
V0 of those separate values.
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D120287
Similar to what we do for other loads/stores, use the intrinsic
version that we already have custom isel for.
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D121166
Most other targets support 'generic', but RISCV issues an error.
This can require a special case in tools that use LLVM that aren't
clang.
This patch treats "generic" the same as an empty string and remaps
it to generic-rv/rv64 based on the triple. Unfortunately, it has to
be added to RISCV.td because MCSubtargetInfo is constructed and
parses the CPU before RISCVSubtarget's constructor gets a chance
to remap it. The CPU will then reparsed and the state in the
MCSubtargetInfo subclass will be updated again.
Fixes PR54146.
Reviewed By: khchen
Differential Revision: https://reviews.llvm.org/D121149
vslide1up/down have this flag set, but the value isn't a splat.
Rename for clarity.
Reviewed By: khchen
Differential Revision: https://reviews.llvm.org/D121037
vmsgeu.vi with 0 is always true, but in the masked with mask undisturbed
policy, we still need to keep inactive elelemt which come from maskedoff.
We could return mask directly if it's mask agnostic policy in the future.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D121080
We should not emit a tail agnostic vlse for a tail undisturbed vmv.s.x
In D119688:
- if (IsScalarMove && !Node->getOperand(0).isUndef())
+ bool HasPassthruOperand = Node->getOpcode() != ISD::SPLAT_VECTOR;
+ if (HasPassthruOperand && !IsScalarMove &&
!Node->getOperand(0).isUndef())
break;
The IsScalarMove check in the if statement had been changed.
Differential Revision: https://reviews.llvm.org/D120963
setgt X, -1 is the canonical form of setge X, 0. We can swap the
select operands and use setlt X, X0 when selecting CMOV. This
avoid materializing the -1 in a register.
With Zbb, abs is expanded to (max X, neg) by default. If X has 33 or
more sign bits, we can expand it a little early using negw instead of
neg to save a sext_inreg. If X started as a 32 bit value, type
legalization would have inserted a sext before the abs so X having
33 sign bits should always be true.
Note: I've used ISD::FREEZE here since we increase the number of uses.
Our default expansion for ABS doesn't do that, but I think that's a bug.
We can't do this with custom type legalization because ISD::FREEZE
doesn't propagate sign bits so later DAG combine won't expand be
able to see optmize it.
Alives2 https://alive2.llvm.org/ce/z/Gx3RNe
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D120597
The patch adds very basic cost model for masked memory op on scalable vector.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D117884
Until Zfinx is supported in CodeGen we need to convert all Zfinx
register classes to GPR.
Remove the zfinx-types.ll test which didn't test anything meaningful
since -mattr=zfinx isn't implemented completely in llc.
Follow up to D93298.
This miscompile was introduced in D119527.
This was a special pattern for rotate+bswap on RV32. It doesn't
work for RV64 since the rotate needs to be half the bitwidth. The
equivalent pattern for RV64 is ROTR ((GREV x, 56), 32) so match
that instead.
This could be generalized further as noted in the new FIXME.
Reviewed By: Chenbing.Zheng
Differential Revision: https://reviews.llvm.org/D120686
The patch adds very basic cost model for masked memory op on scalable vector.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D117884
This patch added the MC layer support of Zfinx extension.
Authored-by: StephenFan
Co-Authored-by: Shao-Ce Sun
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D93298
This wraps up from D119053. The 2 headers are moved as described,
fixed file headers and include guards, updated all files where the old
paths were detected (simple grep through the repo), and `clang-format`-ed it all.
Differential Revision: https://reviews.llvm.org/D119876
This lowers VECTOR_SPLICE of scalable vectors to a slidedown follow by a slideup.
Fixed vectors are encouraged to use shufflevector instruction. The equivalent patch
for fixed vectors is D119039.
I've used a tail agnostic slidedown and limited the VL to only the
elements that will not be overwritten by the slideup. The slideup
uses VLMax for its VL. It unfortunately uses tail undisturbed policy
but it isn't required as there is no tail. We just need the merge
operand to carry the bits for the lower portion of the result.
Care was taken to ensure that either the slideup or slidedown will
be able to use a .vi instruction when the immediate is small. Which
one uses the immediate depends on the sign of the immediate.
Reviewed By: frasercrmck, ABataev
Differential Revision: https://reviews.llvm.org/D119303
Default type legalization will create sext_inreg+abs, but we may
not be able to remove the sext_inreg.
Instead this patch expands abs during type legalization to
Y = sraiw X, 31; subw(xor X, Y), Y) which doesn't require the input
to be sign extended.
This gives a big improvement for some neg-abs tests where the
abs is used more than the the neg. Previously the abs was expanded
a different way before and after type legalization. Now they are
expanded in a similar way enabling more CSE.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D120636
Not only some AMO instructions but also other instructions need to
process (${gpr}) or 0(${gpr}), where the 0 is be silently ignored.
This patch does some changes for general usage.
Signed-off-by: Eric Tang <eric.tang@starfivetech.com>
Differential Revision: https://reviews.llvm.org/D120017
In this patch, we add a more narrower exclusion for
zeroext (srl x) -> srli (slli x), so that it provides an opportunity
for the selection of sraiw.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D120467
By failing to lex the token we end up both parsing it as a binary
operator ourselves and parsing it as a unary operator when calling
parseExpression on the RHS. For plus this is harmless but for minus this
parses "foo - 4" as "foo - -4", effectively treating a top-level minus
as a plus.
Fixes https://github.com/llvm/llvm-project/issues/54105
Reviewed By: asb, MaskRay
Differential Revision: https://reviews.llvm.org/D120635
With the condition N->use_empty(), the root node of DAG always
misses peephole optimization. So a dummy node is needed.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D119934
vcpop and vfirst are still useful when VL=0.
vcpop equivalents to li 0 and vfirst equivalents to li -1,
since no mask elements are active.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D120302
Clang computes the default ABI if -mabi is empty
and encode it in LLVM IR module flag since D105555.
For correctness, llc need to give the same target-abi
(Options.MCOptions.ABIName) with ABI encoded in IR.
The getSubtargetImpl already has a check for them only if
Options.MCOptions.ABIName is not empty.
In order to get more robustness we could have a check for
explicit ABI, but now we have two different logic to
compute the default ABI.
The front-end ABI is defautl to the ilp32/ilp32e/lp64, and
ilp32d/lp64d when hardware support for extension D.
The backend ABI is default to the ilp32/ilp32e/lp64.
Reviewed by: asb, jrtc27
Differential Revision: https://reviews.llvm.org/D118333
This patch changes the version of V extension from 0.1 to 1.0 in RISCVInstrInfoVPseudos.td, RISCVInstrInfoVSDPatterns.td, RISCVInstrInfoVVLPatterns.td, RISCVInstrInfoV.td
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D120525
Add a new ISD opcode to represent the sign extending behavior of
vmv.x.h. Keep the previous anyext opcode to allow the existing
(fmv_x_anyexth (fmv_h_x X)) combine to keep working without needing
to generate a sign extend.
For fmv.x.w we are able to match the sext_inreg in an isel pattern,
but a 16-bit sext_inreg is lowered to a shift pair before isel. This
seemed like a larger match than we should do in isel.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D118974
Internally to DAGCombiner the SDValues were passed by non-const
reference despite not being modified. They were then passed by
const reference to TLI.
This patch passes them by value which is consistent with the vast
majority of code.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D120420
See https://github.com/llvm/llvm-project/issues/53831 for a full discussion.
The basic issue is that DAGCombiner::visitMUL and
RISCVISelLowering;:transformAddImmMullImm get stuck in a loop, as the
current checks in transformAddImmMulImm aren't sufficient to avoid all
cases where DAGCombiner::isMulAddWithConstProfitable might trigger a
transformation. This patch makes transformAddImmMulImm bail out if C0
(the constant used for multiplication) has more than one use.
Differential Revision: https://reviews.llvm.org/D120332