A forward abstract state can be in the special SEWLMULRatioOnly state which means we're not allowed to inspect its fields. The scalar to vector move case was mising a guard, and we'd crash on an assert. Test cases included.
According to the vector spec, mf8 is not supported for i8 if ELEN
is 32. Similarily mf4 is not suported for i16/f16 or mf2 for i32/f32.
Since RVVBitsPerBlock is 64 and LMUL is calculated as
((MinNumElements * ElementSize) / RVVBitsPerBlock) this means we
need to disable any type with MinNumElements==1.
For generic IR, these types will now be widened in type legalization.
For RVV intrinsics, we'll probably hit a fatal error somewhere. I plan
to work on disabling the intrinsics in the riscv_vector.h header.
Reviewed By: arcbbb
Differential Revision: https://reviews.llvm.org/D128286
This adds the macrofusion plumbing and support fusing LUI+ADDI(W).
This is similar to D73643, but handles a different case. Other cases
can be added in the future.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D128393
The failure that caused the previous revert has been fixed
by https://reviews.llvm.org/D126048
Original commit message:
RVV makes heavy use of subregisters due to LMUL>1 and segment
load/store tuples. Enabling subregister liveness tracking improves the quality
of the register allocation.
I've added a command line that can be used to turn it off if it causes compile
time or functional issues. I used the command line to keep the old behavior
for one interesting test case that was testing register allocation.
Reviewed By: kito-cheng
Differential Revision: https://reviews.llvm.org/D128016
The VT we want to shrink to may not be legal especially after type
legalization.
Fixes PR56110.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D128135
Type legalization will convert the bitcast into a vector store and
scalar load.
Instead this patch widens the vector to v8i1 with undef, and bitcasts
it to i8. v8i1->i8 has custom handling for type legalization already to
bitcast to a v1i8 vector and use an extract_element.
The code here was lifted from X86's avx512 support.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D128099
D125335 makes regsOverlap skip following control flow, which is not entended
in the original code.
Differential Revision: https://reviews.llvm.org/D128039
Doing so let's the post-mutation pass leverage the demanded info to rewrite vsetvlis before a store/mask-op to eliminate later vsetvlis.
Sorry for the lack of store test change; all of my attempts to write something reasonable have been handled through existing logic.
A splat of the values 0 and -1 as sign extended 12 bit immediates are always the same bit pattern regardless of the etype used to perform the operation. As a result, we can sometimes avoid introducing a vsetvli just for the purposes of a splat.
Looking at the diffs, we don't get a huge amount of immediate value out of this. We mostly push the vsetvli one instruction down, usually in front of a vmerge. We also don't get the corresponding fixed length vector cases because VL typically is changed despite the actual bits written being the same. Both of these are areas I plan to explore in future patches.
Interestingly, this makes a great example of why we need the forward and backward implementation to be consistent. Before we merged the demanded field handling, if we implement only the forward direction, we lost the ability to mutate a prior vsetvli and eliminate a later one entirely. This resulted in practical regressions instead of improvements. It's always nice when practice matches theory. :)
Differential Revision: https://reviews.llvm.org/D128006
This allows computeKnownBits to see the constant being loaded.
This recovers the rv64zbp test case changes from D127520.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D127679
MinRCSize is 4 and prevents constrainRegClass from changing the
register class if the new class has size less than 4.
IMPLICIT_DEF gets a unique vreg for each use and will be removed
by the ProcessImplicitDef pass before register allocation. I don't
think there is any reason to prevent constraining the virtual register
to whatever register class the use needs.
The attached test case was previously creating a copy of IMPLICIT_DEF
because vrm8nov0 has 3 registers in it.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D128005
If we're writing to an undef vector (i.e. implicit_def), we can change the value of bits outside the requested write without consequence. This allows us to avoid a VSETVLI just for narrowing the value written.
Differential Revision: https://reviews.llvm.org/D127880
The motivating case, and the only one actually enabled by this patch, is a load or store followed by another op with the same SEW/LMUL ratio.
As an example, consider:
define void @test1(ptr %in, ptr %out) {
entry:
%0 = load <8 x i16>, ptr %in, align 2
%1 = sext <8 x i16> %0 to <8 x i32>
store <8 x i32> %1, ptr %out, align 4
ret void
}
Without this patch, we get:
vsetivli zero, 8, e16, mf4, ta, mu
vle16.v v8, (a0)
vsetvli zero, zero, e32, mf2, ta, mu
vsext.vf2 v9, v8
vse32.v v9, (a1)
ret
Whereas with the patch we get:
vsetivli zero, 8, e32, mf2, ta, mu
vle16.v v8, (a0)
vsext.vf2 v9, v8
vse32.v v9, (a1)
ret
We have rewritten the first vsetvli and thus removed the second one.
As is strongly hinted by the code structure and todos, I am planning on communing this with all (or most all?) of the cases from isCompatible used in the forward data flow. This will be done in a series of following changes - some NFC reworks, and some reviewed optimization extensions.
Differential Revision: https://reviews.llvm.org/D127780
This reverts commit 7207373e1e.
We found another RISC-V bug when landing D126048, and it has been fixed
by D127642 now.
Differential Revision: https://reviews.llvm.org/D126048
RISC-V expand register tuple spilling into series of register spilling after
register allocation phase by the pseudo instruction expansion, however part of
register tuple might be still undefined during spilling, machine verifier will
complain the spill instruction is using an undefined physical register.
Optimal solution should be doing liveness analysis and do not emit spill
and reload for those undefined parts, but accurate liveness info at that point
is not so easy to get.
So the suboptimal solution is still spill and reload those undefined parts, but
adding implicit-use of super register to spill function, then machine
verifier will only report report using undefined physical register if
the when whole super register is undefined, and this behavior are also
documented in MachineVerifier::checkLiveness[1].
Example for demo what happend:
```
v10m2 = xxx
# v12m2 not define yet
PseudoVSPILL2_M2 v10m2_v12m2
...
```
After expansion:
```
v10m2 = xxx
# v12m2 not define yet
# Expand PseudoVSPILL2_M2 v10m2_v12m2 to 2 vs2r
VS2R_V v10m2
VS2R_V v12m2 # Use undef reg!
```
What this patch did:
```
v10m2 = xxx
# v12m2 not define yet
# Expand PseudoVSPILL2_M2 v10m2_v12m2 to 2 vs2r
VS2R_V v10m2 implicit v10m2_v12m2
# Use undef reg (v12m2), but v10m2_v12m2 ins't totally undef, so
# that's OK.
VS2R_V v12m2 implicit v10m2_v12m2
```
[1] https://github.com/llvm-mirror/llvm/blob/master/lib/CodeGen/MachineVerifier.cpp#L2016-L2019
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D127642
VSETVLIInfos right after VLEFF/VLSEGFF are currently unknown since they modify
VL. Unknown VSETVLIInfos make next vector operations needed to be inserted
VSET(I)VLI. Actually the next vector operation of VLEFF/VLSEGFF may not need to
be inserted VSET(I)VLI if it uses same VTYPE and the resulted vl of
VLEFF/VLSEGFF.
Take the below C code as an example,
vint8m4_t vec_src1 = vle8ff_v_i8m4(str1, &new_vl, vl);
vbool2_t mask1 = vmseq_vx_i8m4_b2(vec_src1, 0, new_vl);
vsetvli insertion adds a redundant vsetvli for that,
Assembly result:
vsetvli a2,a2,e8,m4,ta,mu
vle8ff.v v28,(a0)
csrr a3,vl ; redundant
vsetvli zero,a3,e8,m4,ta,mu ; redundant
vmseq.vi v25,v28,0
After D126794, VLEFF/VLSEGFF has a define having value of VL. The patch consider
there is a ghost vsetvli right after VLEFF/VLSEGFF. The ghost VSET(I)LIs use the
vl output of the VLEFF/VLSEGFF as its AVL and same VTYPE of the VLEFF/VLSEGFF.
The ghost vsetvli must be redundant, and we could use it to get the VSETVLIInfo
right after VLEFF/VLSEGFF.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D127576
We were incorrectly creating a VRGATHER node with i1 vector type. We
could support this by promoting the mask to i8 and truncating it, but
for now I want to prevent the crash.
Fixes PR56007.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D127681
This simplifies the isel code by removing the manual load creation.
It also improves our ability to use 0 strided loads for vector splats.
There is an assumption here that Mask and ShiftedMask constants are
cheap enough that they don't become constant pool loads so that our
isel optimizations involving And still work. I believe those constants
are 3 instructions in the worst case.
The rv64zbp-intrinsic.ll changes is a regression caused by intrinsics
being expanded to RISCVISD also occuring during lowering. So the optimizations
were only happening during the last DAGCombine, which can't see through the
load. I believe we can fix this test by implementing
TargetLowering::getTargetConstantFromLoad for RISC-V or by adding the intrinsic
to computeKnownBitsForTargetNode to enable earlier DAG combine. Since Zbp is not
a ratified extension, I don't view these as blocking this patch.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D127520
Another issue unearthed by D127115
We take a long time to canonicalize an insert_vector_elt chain before being able to convert it into a build_vector - even if they are already in ascending insertion order, we fold the nodes one at a time into the build_vector 'seed', leaving plenty of time for other folds to alter it (in particular recognising when they come from extract_vector_elt resulting in a shuffle_vector that is much harder to fold with).
D127115 makes this particularly difficult as we're almost guaranteed to have the lost the sequence before all possible insertions have been folded.
This patch proposes to begin at the last insertion and attempt to collect all the (oneuse) insertions right away and create the build_vector before its too late.
Differential Revision: https://reviews.llvm.org/D127595
We need a preheader and a single latch, but we don't need a dedicated
exit.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D127513
Our optimization pass checks for loop simplify form, before doing
the transform. The loops here aren't in loop simplify form because
the exit block has two predecessors.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D127451
D125887 changed the ctlz/cttz despeculation transform to insert
a freeze for the introduced branch on zero. While this does fix
the "branch on poison" issue, we may still get in trouble if we
pick a different value for the branch and for the ctz argument
(i.e. non-zero for the branch, but zero for the ctz). To avoid
this, we should use the same frozen value in both positions.
This does cause a regression in RISCV codegen by introducing an
additional sext. The DAG looks like this:
t0: ch = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %3
t4: i64 = AssertSext t2, ValueType:ch:i32
t23: i64 = freeze t4
t9: ch = CopyToReg t0, Register:i64 %0, t23
t16: ch = CopyToReg t0, Register:i64 %4, Constant:i64<32>
t18: ch = TokenFactor t9, t16
t25: i64 = sign_extend_inreg t23, ValueType:ch:i32
t24: i64 = setcc t25, Constant:i64<0>, seteq:ch
t28: i64 = and t24, Constant:i64<1>
t19: ch = brcond t18, t28, BasicBlock:ch<cond.end 0x8311f68>
t21: ch = br t19, BasicBlock:ch<cond.false 0x8311e80>
I don't see a really obvious way to improve this, as we can't push
the freeze past the AssertSext (which may produce poison).
Differential Revision: https://reviews.llvm.org/D126638
The patch is a replacement of D125199. PseudoReadVL with vtype has worry for
computing same vtypes of VLEFF/VLSEGFF in two different places, DAGToDAG and
InsertVSETVLI. VLEFF/VLSEGFF MI with VL output still could provide the vtype of
VLEFF/VLSEGFF to the users of its VL.
The patch names the new pseudo as original VLEFF/VLSEGFF name suffixed "_VL" and
expand them in RISCVInsertVSETVLI pass.
This patch also reverts commit 4537aae0d5,
"[RISCV] Make PseudoReadVL have the vtypes of the corresponding VLEFF/VLSEGFF.".
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D126794
For an addition with simm14 and simm15 immediates with 2 or 3 trailing bits,
we can use a shXadd instruction and an addi to do the addition.
This patch teaches RISCVMergeBaseOffset to see through this pattern.
I don't think the sh1add case occurs because we use two addis for that,
but I implemented it for completeness.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D127376
In order to make sure the stack point is right through the EH region,
we also need to restore stack pointer from the frame pointer if we
don't preserve stack space within prologue/epilogue for outgoing variables,
normally it's just checking the variable sized object is present or not
is enough, but we also don't preserve that at prologue/epilogue when
have vector objects in stack.
Example to show what happened:
```
try {
sp adjust for outgoing args. // 1. Sp changed.
func_call // 2. Exception raised
sp restore // Oh, not restored
} catch {
// 3. And now we are here.
}
// 4. Prepare to return!, restore return address from stack, but...sp is wrong.
// 5. Screw up!
```
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D126861
This matches what we do in IR. For the RISC-V test case, this allows
us to use -8 for the AND mask instead of materializing a constant in a register.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D127335
Add with immediates in the range [-4096, -2049] or [2048, 4095] get
convert to two ADDIs. Teach RISCVMergeBaseOffset to recognize this
pattern as well.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D126843
LUI+ADDIW always produces a simm32. This allows us to always
fold it into a global offset.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D126729
Based on D24038.
LLVM has an @llvm.eh.dwarf.cfa intrinsic, used to lower the GCC-compatible __builtin_dwarf_cfa() builtin.
Reviewed By: StephenFan
Differential Revision: https://reviews.llvm.org/D126181
Spliter will try to extend a live range into `r` slot for a use operand,
that's works on most situaion, however that not work correctly when the operand
has tied to def, and the def operand is early clobber.
Give an example to demo what's wrong:
0 %0 = ...
16 early-clobber %0 = Op %0 (tied-def 0), ...
32 ... = Op %0
Before extend:
%0 = [0r, 0d) [16e, 32d)
The point we want to extend is 0d to 16e not 16r in this case, but if
we use 16r here we will extend nothing because that already contained
in [16e, 32d).
This patch add check for detect such case and adjust the extend point.
Detailed explanation for testcase: https://reviews.llvm.org/D126047
Reviewed By: MatzeB
Differential Revision: https://reviews.llvm.org/D126048
i64 indices aren't supported on Zve32*. Scalarize gathers to prevent
generating illegal instructions.
Since InstCombine will aggressively canonicalize GEP indices to
pointer size, we're pretty much always going to have an i64 index.
Trying to predict when SelectionDAG will find a smaller index from
the TTI hook used by the ScalarizeMaskedMemIntrinPass seems fragile.
To optimize this we probably need an IR pass to rewrite it earlier.
Test RUN lines have also been added to make sure the strided load/store
optimization still works.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D127179
Use the query that doesn't assert if TracksLiveness isn't set, which
needs to always be available. We also need to start printing liveins
regardless of TracksLiveness.
Move the code that was added for D126896 after the normal recursive calls
to computeKnownBits. This allows us to calculate trailing zeros.
Previously we would break out of the switch before the recursive calls.
This fixes an inconsistency between RV32 and RV64. Still considering
trying to do this peephole during isel, but wanted to fix the
inconsistency first.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D126986
Test changes are because isBaseWithConstantOffset uses computeKnownBits
and that is able to see that an earlier AND instruction guaranteed
alignment so that we can treat an OR as an ADD.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D126970
If the imm is out of range for an ADDI, we will materialize it in
a register using multiple instructions. If the ADD is used by a
load/store, doPeepholeLoadStoreADDI can try to pull an ADDI from
the constant materialization into the load/store offset. This only
works if the ADD has a single use, otherwise the peephole would have
to rebuild multiple nodes.
This patch instead tries to solve the problem when the add is selected.
We check that the add is only used by loads/stores and if it is
we will select it to (ADDI (ADD X, Imm-Lo12), Lo12). This will enable
the simple case in doPeepholeLoadStoreADDI that can bypass an ADDI
used as a pointer. As a result we can remove the more complicated
peephole from doPeepholeLoadStoreADDI.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D126576
If C is non-negative, the result of the smax must also be
non-negative, so all sign bits of the result are 0.
This allows DAGCombiner to remove a zext_inreg in the modified test.
This zext_inreg started as a sext that became zext before type
legalization then was promoted to a zext_inreg.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D126896
One of the operands of the smax is a positive value so computeKnownBits
determines the result of the smax must always be positive. This allows
DAG combiner to convert the sign extend to zero extend before type
legalization.
After type legalization the smax is promoted to i64 by sign extending
its inputs and the zero extend becomes an AND instruction. We are unable
to remove the AND at this point and it becomes a pair of shifts or a
zext.w.
The result of smax has as many sign bits as the minimum of its inputs.
Had we kept the sign extend instead of turning it into a zero extend
it would be removed by DAG combiner after type legalization.
Once we've computed the incoming predecessor state, we should use the same compatibility check with knowledge of MI as we did in phase 2 in order to be consistent across all phases.
Differential Revision: https://reviews.llvm.org/D126574
Adds MVT::v128i2, MVT::v64i4, and implied MVT::i2, MVT::i4.
Keeps MVT::i2, MVT::i4 lowering actions as expand, which should be
removed once targets set this explicitly.
Adjusts 11 lit tests to reflect slightly different behavior during
DAG combine.
Differential Revision: https://reviews.llvm.org/D125247
Adds MVT::v128i2, MVT::v64i4, and implied MVT::i2, MVT::i4.
Keeps MVT::i2, MVT::i4 lowering actions as `expand`, which should be
removed once targets set this explicitly.
Adjusts 11 lit tests to reflect slightly different behavior during
DAG combine.
Differential Revision: https://reviews.llvm.org/D125247
We enable a custom handler to optimize conversions between scalars
and fixed vectors. Unfortunately, the custom handler picks up scalar
to scalar conversions as well. If the scalar types are both legal,
we wouldn't match any of the fixed vector cases and would return SDValue()
causing the LegalizeDAG to expand the bitcast through memory.
This patch fixes this by checking if it's a scalar to scalar conversion
and returns `Op` if both types are legal.
Differential Revision: https://reviews.llvm.org/D126739
As mentioned in D125947, we can reduce codegen results by
adding an explicit hard single-float ABI.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D126640
If the adjustment doesn't fit in 12 bits, try to break it into
two 12 bit values before falling back to movImm+add/sub.
This is based on a similar idea from isel.
Reviewed By: luismarques, reames
Differential Revision: https://reviews.llvm.org/D126392
The immediate for LUI is stored as 20-bit unsigned value. We need
to sign extend if after shifting by 12 to match the instruction
behavior.
If we find an LUI+ADDI on RV64, it means the constant isn't a
simm32. If it was, we would have emitted LUI+ADDIW from constant
materialization. Make sure the constant is a simm32 before folding.
This appears to match gcc.
A future patch will add support for LUI+ADDIW on RV64.
Originally, `OptLevel` isn't passed into the `MachineFunctionPass`.
This lets the default parameter of `SelectionDAGISel`, which is
`CodeGenOpt::Default`, be passed in. OptLevelChanger captures the
optimization level with the parameter, and rather not the value
within `TargetMachine`. This lets the optimization be
unintentionally overwriten if other value than `CodeGenOpt::Default`
passed.
This patch fixes this by passing the optimization level rather
than using the default value.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D126641
The optimization level should not be restored into O2.
This is a pre-commit test case to show fix in D126641.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D126677
This pattern is what we get after DAG combine for C code like this.
short *ptr1, *ptr2, *ptr3;
unsigned diff = ptr1 - ptr2;
return ptr3[diff];
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D126588
The tests here show the codegen for something like this C code.
unsigned diff = ptr1 - ptr2;
return ptr3[diff];
The pointer difference is truncated to 32-bits before being used
again as an index. In SelectionDAG this appears as an AND between
a SRL and a SHL. DAGCombiner will remove the shifts leaving only
an AND. The Mask now has 1,2, or 3 trailing zeros and 31, 30, or 29
leading zeros. We end up falling back to constant materialization
to create this mask.
We could instead use srli followed by slli.uw. Or since
we have an add, we can use srli followed by shXadd.uw.
Differential Revision: https://reviews.llvm.org/D126589
This is a follow up to address a review comment from D124869. When deciding whether to PRE a vsetvli, we can allow non-LMUL1 vsetvlis.
Differential Revision: https://reviews.llvm.org/D126563
These should be aligned to the natural alignment of the element.
Probably copy/paste mistake from the i32 tests.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D126567
When lowering GlobalAddressNodes, we were removing a non-zero offset and
creating a separate ADD.
It already comes out of SelectionDAGBuilder with a separate ADD. The
ADD was being removed by DAGCombiner.
This patch disables the DAG combine so we don't have to reverse it.
Test changes all look to be instruction order changes. Probably due
to different DAG node ordering.
Differential Revision: https://reviews.llvm.org/D126558
Today, text section prefixes (none, .unlikely, .hot, and .unkown) are determined based on PGO profile. However, Propeller may deem a function hot when PGO doesn't. Besides, when `-Wl,-keep-text-section-prefix=true` Propeller cannot enforce a global section ordering as the linker can only reorder sections within each output section (.text, .text.hot, .text.unlikely).
This patch promotes all functions with Propeller profiles (functions listed in the basic-block-sections profile) to .text.hot. The feature is hidden behind the flag `--bbsections-guided-section-prefix` which defaults to `true`.
The new implementation refactors the parsing of basic block sections profile into a new `BasicBlockSectionsProfileReader` analysis pass. This allows us to use the information earlier in `CodeGenPrepare` in order to set the functions text prefix. `BasicBlockSectionsProfileReader` will be used both by `BasicBlockSections` pass and `CodeGenPrepare`.
Differential Revision: https://reviews.llvm.org/D122930
A RISCV implementation can choose to implement unaligned load/store support. We currently don't have a way for such a processor to indicate a preference for unaligned load/stores, so add a subtarget feature.
There doesn't appear to be a formal extension for unaligned support. The RISCV Profiles (https://github.com/riscv/riscv-profiles/blob/main/profiles.adoc#rva20u64-profile) docs use the name Zicclsm, but a) that doesn't appear to actually been standardized, and b) isn't quite what we want here anyway due to the perf comment.
Instead, we can follow precedent from other backends and have a feature flag for the existence of misaligned load/stores with sufficient performance that user code should actually use them.
Differential Revision: https://reviews.llvm.org/D126085
With a fix for an expensive checks build failure exposed by new RISC-V tests.
Something about expanding two rotates in type legalization caused a change
in the remapping tables that the expensive checks verifying wasn't expecting.
See comment in the code for how it was fixed.
Tests came from this commit that exposed the bug
[RISCV] Add test cases showing failure to remove mask on rotate amounts.
If the masking AND has multiple users we fail to remove it.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D126036
During insertion of VSETVLI, we have two related bits of code which decide whether we can reuse a previous vsetvli result. As was pointed out in the original review, these cases can allow any prior state for which we know that VL is the same for any value of AVL.
This was originally separated out of a desire for separate tests and review. As it turns out, finding a test case for this has been quite challenging. Most of the cases I tried, we manage to already get through other chains of logic. We do have one correct test change, but that only exercises one of the two changes.
Differential Revision: https://reviews.llvm.org/D126400
reapply 62a9b36fcf and fix module build
failue:
1: remove MachineCycleInfoWrapperPass in MachinePassRegistry.def
MachineCycleInfoWrapperPass is a anylysis pass, should not be there.
2: move the definition for MachineCycleInfoPrinterPass to cpp file.
Otherwise, there are module conflicit for MachineCycleInfoWrapperPass
in MachinePassRegistry.def and MachineCycleAnalysis.h after
62a9b36fcf.
MachineCycle can handle irreducible loop. Natural loop
analysis (MachineLoop) can not return correct loop depth if
the loop is irreducible loop. And MachineSink is sensitive
to the loop depth, see MachineSinking::isProfitableToSinkTo().
This patch tries to use MachineCycle so that we can handle
irreducible loop better.
Reviewed By: sameerds, MatzeB
Differential Revision: https://reviews.llvm.org/D123995
This moves mutation entirely out of the main algorithm.
The immediate trigger is that we hit another case of the same issue I thought we'd fixed in 72925d9. It turns out we hadn't considered the cross block case.
As a brief summary, the issue being fixed is that if we mutate a previous vsetvli in phase 3, there's a possibility that some later use of that vsetvli changes "compatibility". In the cross_block_mutate test, this later vsetvli occurs in another block (and is thus visit order dependent too!). This causes us to fail strict asserts. (To be explicit, the current on by default workaround should compensate. It's only when we turn that off that we have problems.)
Now, I want to explicitly call out an alternate workaround. We could leave the mutation in phase 3, and simplify restrict it to the case where the previous vsetvli's GPR result is unused. That covers the case we've actually seen. (I'll note that codegen regressions with a simple form of this were significant. We might have to check specifically for the use outside block case to keep them reasonable, which complicates the workaround slightly.)
Personally, I'm at the point where I want the mutation pulled out just for robustness sake. I'm worried there's yet one more form of this bug we haven't thought about.
The other motivation for this change is that it does give us a couple of minor codegen wins. None appear to be hugely significant, but improvements never hurt right?
Differential Revision: https://reviews.llvm.org/D125270
Update test to check MIR after finalize-isel instead of debug output.
This is of course not the only place we should preserve FMF, but
it's the most obvious one.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D126306
This is a straight forward extension of the PRE transform introduced in D124869 to handle the VLMAX case.
The test changes here look quite positive. This surprised me until I realized that all the tests are using @llvm.vscale to figure out the VLMAX, not the llvm.riscv.vsetvlmax intrinsic. If they'd used the later, these would have been full redundancy cases and fully handled by the data flow. I'm not really sure if use of vscale here is representative or not. If it is, we should probably look at using VSETVLI to lower vscale rather than a raw read of vlenb and some math.
Differential Revision: https://reviews.llvm.org/D126338
When optimizing for size, this pass searches for instructions that are
prevented from being compressed by one of the following:
1. The use of a single uncompressed register.
2. A base register + offset where the offset is too large to be
compressed and the base register may or may not already be compressed.
In the first case, if there is a compressed register available, then the
uncompressed register is copied to the compressed register and its uses
replaced. This is only done if there are enough uses that code size
would be improved.
In the second case, if a compressed register is available, then the
original base register is copied and adjusted such that:
new_base_register = base_register + adjustment
base_register + large_offset = new_base_register + small_offset
and the uses of the base register are replaced with the new base
register. Again this is only done if there are enough uses for code size
to be improved.
This pass was authored by Lewis Revill, with large offset optimization
added by Craig Blackmore.
Differential Revision: https://reviews.llvm.org/D92105
When the AVL value does not fit in 5 bits, the register in which this value is stored may be dead when we want to forward it. This patch ensure the kill flags on the register are cleared before forwarding.
Patch by: loralb
Differential Revision: https://reviews.llvm.org/D125971
This patch teaches the VSETVLI insertion pass to perform a very limited form of partial redundancy elimination. The motivating example comes from the fixed length vectorization of a simple loop such as:
for (unsigned i = 0; i < a_len; i++)
a[i] += b;
Without this change, the core vector loop and preheader is as follows:
.LBB0_3: # %vector.ph
andi a1, a6, -8
addi a4, a0, 16
mv a5, a1
.LBB0_4: # %vector.body
# =>This Inner Loop Header: Depth=1
addi a3, a4, -16
vsetivli zero, 4, e32, m1, ta, mu
vle32.v v8, (a3)
vle32.v v9, (a4)
vadd.vx v8, v8, a2
vadd.vx v9, v9, a2
vse32.v v8, (a3)
vse32.v v9, (a4)
addi a5, a5, -8
addi a4, a4, 32
bnez a5, .LBB0_4
The key thing to note here is that, the execution of the vsetivli only needs to happen once. Since there's no tail folding happening here, the value of the vector configuration registers are invariant through the loop.
After this patch, we hoist the configuration into the preheader and perform it once.
.LBB0_3: # %vector.ph
andi a1, a6, -8
vsetivli zero, 4, e32, m1, ta, mu
addi a4, a0, 16
mv a5, a1
.LBB0_4: # %vector.body
# =>This Inner Loop Header: Depth=1
addi a3, a4, -16
vle32.v v8, (a3)
vle32.v v9, (a4)
vadd.vx v8, v8, a2
vadd.vx v9, v9, a2
vse32.v v8, (a3)
vse32.v v9, (a4)
addi a5, a5, -8
addi a4, a4, 32
bnez a5, .LBB0_4
Differential Revision: https://reviews.llvm.org/D124869
This reverts commit dfe513ae1b.
Tests have been changed to avoid the type legalization bug being
fixed in D126036.
Original commit message:
This will remove masks on the shift amount. We usually get this with
SimplifyDemandedBits in DAGCombine, but that's restricted to cases
where the AND has a single use. selectShiftMaskXLen does not have
that restriction.
This patch fixes another bug in the RVV frame lowering. While some frame
objects with non-default stack IDs (such scalable-vector alloca
instructions) are considered in the target-independent max alignment
calculations, others (for example, during calling-convention lowering)
are not. This means we'd occasionally align the base of the stack to
only 16 bytes, with no way to ensure that the RVV section contained
within that is aligned to anything higher.
Reviewed By: StephenFan
Differential Revision: https://reviews.llvm.org/D125973
This patch addresses several alignment issues in the stack frame when
RVV objects are taken into account.
One bug is that the RVV stack was never guaranteed to keep the alignment
of the stack *as a whole*. We must maintain a 16-byte aligned stack at
all times, especially when calling other functions. With the standard V
extension, this is conveniently happening since VLEN is at least 128 and
always 16-byte aligned. However, we support Zvl64b which does not
guarantee this. To fix this, the RVV stack size is rounded up to be
aligned to 16 bytes. This in practice generally makes us allocate a
stack sized at least 2*VLEN in size, and a multiple of 2.
|------------------------------| -- <-- FP
| 8-byte callee-save | | |
|------------------------------| | |
| one VLENB-sized RVV object | | |
|------------------------------| | |
| 8-byte local variable | | |
|------------------------------| -- <-- SP (must be aligned to 16)
In the example above, with Zvl64b we are decrementing SP by 12 bytes
which does not leave SP correctly aligned. We therefore introduce an
extra VLENB-sized amount used for alignment. This would therefore ensure
the total stack size was 16 bytes (48 for Zvl128b, 80 for Zvl256b, etc):
|------------------------------| -- <-- FP
| 8-byte callee-save | | |
|------------------------------| | |
| one VLENB-sized padding obj | | |
| one VLENB-sized RVV object | | |
|------------------------------| | |
| 8-byte local variable | | |
|------------------------------| -- <-- SP
A new RVV invariant has been introduced in this patch, which is that the
base of the RVV stack itself is now always aligned to 16 bytes, not 8 as
before. This keeps us more in line with the scalar stack and should be
easier to reason about. The calculation of the RVV padding has thus
changed to be the amount required to align the scalar local variable
section to the RVV section's alignment. This amount is further rounded
up when setting up the initial stack to keep everything aligned:
|------------------------------| -- <-- FP
| 8-byte callee-save |
|------------------------------|
| |
| RVV objects |
| (aligned to at least 16) |
| |
|------------------------------|
| RVV padding of 8 bytes |
|------------------------------|
| 8-byte local variable |
|------------------------------| -- <-- SP
In the example above, it's clear that we need 8 bytes of padding to keep
the RVV section aligned to 16 when using SP. But to keep SP *itself*
aligned to 16 we can't decrement the initial stack pointer by 24 - we
have to round up to 32.
With the RVV section correctly aligned, the second bug fixed by
this patch is that RVV objects themselves are now correctly aligned. We
were previously only guaranteeing an alignment of 8 bytes, even if they
required a higher alignment. This is relatively simple and in practice
we see more rounding up of VLEN amounts to account for alignment in
between objects:
|------------------------------|
| RVV object (aligned to 16) |
|------------------------------|
| no padding necessary |
|------------------------------|
| 2*VLENB RVV object (align 16)|
|------------------------------|
| VLENB alignment padding |
|------------------------------|
| RVV object (align 32) |
|------------------------------|
| 3*VLENB alignment padding |
|------------------------------|
| VLENB RVV object (align 32) |
|------------------------------| -- <-- base of RVV section
Note that a lot of the regressions in codegen owing to the new alignment
rules are correct but actually only strictly necessary for Zvl64b (and
Zvl32b but that's not really supported). I plan a follow-up patch to
take the known VLEN into account when padding for alignment.
Reviewed By: StephenFan
Differential Revision: https://reviews.llvm.org/D125787
MachineCycle can handle irreducible loop. Natural loop
analysis (MachineLoop) can not return correct loop depth if
the loop is irreducible loop. And MachineSink is sensitive
to the loop depth, see MachineSinking::isProfitableToSinkTo().
This patch tries to use MachineCycle so that we can handle
irreducible loop better.
Reviewed By: sameerds, MatzeB
Differential Revision: https://reviews.llvm.org/D123995
Establish the most basic possible test coverage for LSR transformation on RISCV.
Original patch by eopXD (D123458), modified by me to cleanup/simplify tests.
We must add padding when using SP or BP to access stack objects.
Checking whether we're missing FP is not sufficient as stack realignment
uses SP too. The test in D125962 explains the specific issue in more
detail.
Split from D125787.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D125964
This test (and its forthcoming fix) was split off from D125787. It shows
that the logic we use to determine when we need to add extra RVV padding
is insufficient.
In this example, we may have a situation involving dynamic stack
alignment -- but no variable-sized objects -- where we have no FP but
must still use SP to index objects. In this case we also need the
extra RVV padding, otherwise objects may overlap. Specifically, the test
shows that the RVV vector object may clobber the lowest callee-save.
|------------------------------| -- <-- Incoming SP
| 4-byte callee-save (ra) |
|------------------------------| -- <-- SP + VLENB*2 + 60
| 4-byte callee-save (s0) |
|------------------------------| -- <-- SP + VLENB*2 + 56 --
| 4-byte callee-save (s9) | |
|------------------------------| -- <-- SP + VLENB*2 + 52 | RVV object(!!)
| VLENB*2 RVV object | |
|------------------------------| -- <-- SP + 56 --
| 4-byte local object |
|------------------------------| -- <-- SP + 32
| Dead area |
|------------------------------| -- <-- InSP - 2*VLENB - 64
| Possibly-zero realignment |
|------------------------------| -- <-- SP (realigned to 32)
This diagram should help show that when SP==InSP -- e.g., when the incoming SP
is 32-byte aligned, subtracting 2*VLENB+64 may keep it that way -- the RVV
object clobbers the spill of s9.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D125962
This reverts commit 86f7d7074a.
The test cases added for this exposed an pre-existing bug that is failing
the expensive checks bot. Reverting so I can revert that patch.
This is the test which triggered my disabling of the assert in d4545e6. The
issue it reveals is basically the same as from cc0283a6, but in the cross
block case.
We visit block1, mutate the setvli (correctly), and then visit block two and
ask whether the vadd is compatible with the block state. Before mutation, it
wasn't. After mutation, it is. And thus, we have our phase 1 vs 3 difference.
This will remove masks on the shift amount. We usually get this with
SimplifyDemandedBits in DAGCombine, but that's restricted to cases
where the AND has a single use. selectShiftMaskXLen does not have
that restriction.
This function tries to match (a >> 8) | (a << 8) as (bswap a) >> 16.
If the SRL isn't masked and the high bits aren't demanded, we still
need to ensure that bits 23:16 are zero. After the right shift they
will be in bits 15:8 which is where the important bits from the SHL
end up. It's only a bswap if the OR on bits 15:8 only takes the bits
from the SHL.
Fixes PR55484.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D125641
The patch does not pass math flags to float VPCmpIntrinsics because LLParser
could not identify float VPCmpIntrinsics as FPMathOperators.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D125600
This patch adds a transform to the local prepass in InsertVSETVLI which canonicalizes an AVL of a register from another vsetvli into immediate or VLMAX when VTYPE is the same. In this patch, I chose to be conservative and avoid arbitrary vreg forwarding due to profitability concerns about possibility overlapping live ranges.
This has the effect of eliminating vsetvli instructions in loops which are walking either VLMAX or a constant number of lanes per iteration.
Differential Revision: https://reviews.llvm.org/D125812
The RISC-V stack is assumed to be aligned to 16 bytes and can handle stack
realignment for larger objects, but the "RVV stack" is only ensured to be
aligned to 8 bytes. This means that objects specified at a larger alignment may
be misaligned, not only for 16-byte-aligned RVV objects that don't trigger
realignment, but also for 32-byte-and-larger-aligned objects which do.
The new test checks a variety of alignment configurations, showing the
misaligned cases.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D110933
If we're using shift pairs to mask, then relax the one use limit if the shift amounts are equal - we'll only be generating a single AND node.
AArch64 has a couple of regressions due to this, so I've enforced the existing one use limit inside a AArch64TargetLowering::shouldFoldConstantShiftPairToMask callback.
Part of the work to fix the regressions in D77804
Differential Revision: https://reviews.llvm.org/D125607
This patch uses VP_REDUCE_AND and VP_REDUCE_OR to replace VP_REDUCE_SMAX,VP_REDUCE_SMIN,VP_REDUCE_UMAX and VP_REDUCE_UMIN for mask vector type.
Differential Revision: https://reviews.llvm.org/D125002
This patch adds a simple test which demonstrates a miscompilation of
16-byte-aligned scalar (non-RVV) objects when combined with RVV stack
objects.
The RISCV stack is assumed to be aligned to 16 bytes, and this is
guaranteed/assumed to be true when setting up the stack. However, when
the stack contains RVV objects, we decrement the stack pointer by some
multiple of vlenb, which is only guaranteed to be aligned to 8 bytes.
This means that non-RVV objects specifically requiring 16-byte alignment
fall through the cracks and are misaligned. Objects requiring larger
alignment trigger stack realignment and thus should be okay.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D125382
The documentation for this specifically mentions that this should not
happen. We could think about adding target hooks to permit it (and how
to merge IDs) in the future if that is desirable.
This specific test case was merging a scalable-vector slot into a
non-scalable one and dropping the notion of scalability, meaning we
failed to allocate enough stack space for the object.
Reviewed By: arsenm, MaskRay, sdesmalen
Differential Revision: https://reviews.llvm.org/D125699
Our current implementation of the InsertVSETVLI dataflow allows phase 3 to arrive at a different block end state than the data flow in phase 1/2 computed. This arises because a block which contains instructions (e.g. load or stores) which don't consume all the incoming bits of the VL/VTYPE can be compatible with multiple incoming states. The algorithm effectively changes the SEW on such instructions, and propagates the prior state forward. As phase 3 uses the block input state for this propagation, but phase 1/2 doesn't, this can result in different block end states.
If we don't correct for it, this discrepancy can result in miscompiles. This was the source of multiple recent bugs. However, by now we have fixes for all known correctness issues.
The basic strategy we use is to insert a compensation vsetvli to bring the block state leaving the block back into consistency with the one computed. This is correct, but results in extra vsetvlis being placed at the end of blocks.
This change adjusts the phase 1/2 algorithm to propagate the incoming block state through the block, allowing the compatibility rules to modify the end state. The algorithm may need to run slightly more iterations, but the end result is consistent with what phase 3 does.
The benefit of doing this is two fold.
First, we reverse some of the code quality introductions introduced in the functional fixes.
Second, we simplify the invariants, and allow the strict assertions to be enabled. Several humans, myself included, have found it quite surprising that invariant didn't hold already, and arguably that confusion is the cause of several of our recent miscompiles in this code.
The downside to this patch is that the dataflow may require additional iterations to stabilize. In the worse case, we go from O(Edges) to O(E + UniquePaths) as the incoming state (and thus the outgoing one) can now change once for each path from the entry block.
Differential Revision: https://reviews.llvm.org/D125232
We've got a lurking problem with our data flow implementation where different phases disagree, resulting in possible miscompiles. D119518 introduced a workaround, but failed to consider blocks which only contain load/stores compatible with their incoming state.
When I went to rebase and simplify D125232, it turned out that not all of the correctness issues had been fixed yet after all. This is the correctness fix accidentally embedded in the original more complicated version.
Note that the test changes here are mostly regressions. It's worth noting that the simplified version of D125232 exactly reverses all the non-functional diffs in the test caused here. D125232 should be the immediate following commit.
Differential Revision: https://reviews.llvm.org/D125703
The existing redundant copy elimination required a virtual register source, but the same logic works for any physreg where we don't have to worry about clobbers. On RISCV, this helps eliminate redundant CSR reads from VLENB.
Differential Revision: https://reviews.llvm.org/D125564
This bug is in generic DAG combine and easily reproducible on many
targets.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D125640
If we use multiply it would be with 0x0101 which is 1 more than a power
of 2. On some targets we would expand this to shl+add. By avoiding the
multiply earlier, we can generate better code.
Note, PowerPC doesn't do the shl+add expansion of multiply so one of
the tests increased in instruction count.
Limiting to scalars because it almost always increased the number of
instructions in vector tests.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D125638
We need to use tail undisturbed for vslideup to implement
vector insert operation correctly.
Ideally, we cound use the tail agnostic when insert subvector
or element at the end of the vector. This will be in follow-up
patch.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D125545
Pulled out of D77804 as its going to be easier to address the regressions individually.
This patch allows SimplifyDemandedBits to call SimplifyMultipleUseDemandedBits in cases where the source operand has other uses, enabling us to peek through the shifted value if we don't demand all the bits/elts.
The lost RISCV gorc2 fold shouldn't be a problem - instcombine would have already destroyed that pattern - see https://github.com/llvm/llvm-project/issues/50553
Differential Revision: https://reviews.llvm.org/D124839
When building the final merged node, we were using the original chain
rather than the output chain of the new operation. After some collapsing
of the chain this could cause the loads be incorrectly scheduled respect
to later stores.
This was uncovered by SingleSource/Regression/C/gcc-c-torture/execute/pr36038.c
of the llvm testsuite.
https://reviews.llvm.org/D125560
After a memmove is expanded, we load the higher addresses after we have
stored onto the lower ones, as if we could do a forward copy.
Differential Revision: https://reviews.llvm.org/D125553
This reverts most of ed242b54c9
I'm seeing failures in our intrinsic testing on qemu that seem
related to this. Reverting while I investigate.
I've left the command line option in place for directed testing.
It defaults to off.
This patch adds minimal support for lowering an read.register intrinsic with vlenb as the argument. Note that vlenb is an implementation constant, so it is never allocatable.
This was split off a patch to eventually replace PseudoReadVLENB with a COPY MI because doing so revealed a couple of optimization opportunities which really seemed to warrant individual patches and tests. To write those patches, I need a way to write the tests involving vlenb, and read.register seemed like the right testing hook.
Differential Revision: https://reviews.llvm.org/D125552
The goal is support tail and mask policy in RVV builtins.
We focus on IR part first.
If the passthru operand is undef, we use tail agnostic, otherwise
use tail undisturbed.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D125323
We've got a lurking problem with our data flow implementation where different phases disagree, resulting in possible miscompiles. D119518 introduced a workaround, but failed to consider blocks without terminators (e.g. fallthroughs).
I have a deeper rework of the algorithm in flight over in D125232, but this patch is specifically a minimal fix for an active miscompile. That change can be reworked over this once landed.
Differential Revision: https://reviews.llvm.org/D125408
riscv_fma_vl doesn't have a tail, so use the tail_agnostic policy.
We were already doing this for some patterns. I think the patterns
with fneg and mask were added later and I copied the tail policy
from the unmasked patterns.
Reviewed By: khchen
Differential Revision: https://reviews.llvm.org/D125424
RVV makes heavy use of subregisters due to LMUL>1 and segment
load/store tuples. Enabling subregister liveness tracking improves the quality
of the register allocation.
I've added a command line that can be used to turn it off if it causes compile
time or functional issues. I used the command line to keep the old behavior
for one interesting test case that was testing register allocation.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D125108
This is a followup to D124231.
We can fold the ADDIW in this pattern if we can prove that LUI+ADDI
would have produced the same result as LUI+ADDIW.
This pattern occurs because constant materialization prefers LUI+ADDIW
for all simm32 immediates. Only immediates in the range
0x7ffff800-0x7fffffff require an ADDIW. Other simm32 immediates
work with LUI+ADDI.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D124693
If we have multiple gather/scatter instructions using the same the
same strided address we would scalarize it multiple times. I guess
a later pass cleans this up, but I don't know if that's guaranteed.
This patch adds a cache to remember the scalarization we already
created for a previous gather/scatter.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D125326
This hook determines if SimplifySetcc transforms (X & (C l>>/<< Y))
==/!= 0 into ((X <</l>> Y) & C) ==/!= 0. Where C is a constant and
X might be a constant.
The default implementation favors doing the transform if X is not
a constant. Otherwise the code is left alone. There is a provision
that if the target supports a bit test instruction then the transform
will favor ((1 << Y) & X) ==/!= 0. RISCV does not say it has a variable
bit test operation.
RISCV with Zbs does have a BEXT instruction that performs (X >> Y) & 1.
Without Zbs, (X >> Y) & 1 still looks preferable to ((1 << Y) & X) since
we can fold use ANDI instead of putting a 1 in a register for SLL.
This patch overrides this hook to favor bit extract patterns and
otherwise falls back to the "do the transform if X is not a constant"
heuristic.
I've added tests where both C and X are constants with both the shl form
and lshr form. I've also added a test for a switch statement that lowers
to a bit test. That was my original motivation for looking at this.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D124639
Type legalization will want to turn (srl X, Y) into RISCVISD::SRLW,
which will prevent us from using a BEXT instruction.
I don't think there is any precedent for type promotion checking
users to decide how to promote. Instead, I've added this DAG combine to
do it before type legalization.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D124109
This patch is an alternative to a piece of D125270. If we have one vsetvli which is using as AVL the output of another, and the prior AVL can be proven to produce the same VL value as that defining one, we can use the AVL from the prior instruction. This has the effect of removing a state transition on AVL, and will let us use the cheaper 'vsetvli x0, x0, vtype1' form or possible even skip emitting it entirely.
This builds on the same infrastructure as D125337, and does the analogous extension to working on abstract states instead of only prior explicit vsetvli instructions. This is where the (relatively minor) code improvements come from.
More importantly, this fixes the last case where the state computed in phase 1 and 2 of the algorithm differs from the state computed during phase 3. Note that such differences can cause miscompiles by creating disagreements about contents of the VL and VTYPE registers at block boundaries.
Doing this transform inside the dataflow can cause the compatibility of a later store to change with regards to the current state. test15 in the diff illustrates this case well. What we have is a vsetvli which is mutated by one following vector op, but whose GPR result is used by another. The compatibility logic walks back to the def in this case, and checks to see if it matches the immediate prior state. In phase 1 and 2, it doesn't, and in phase 3 (after mutation) it does because we remove a transition which caused it to differ.
Differential Revision: https://reviews.llvm.org/D125392
These variations are chosen to exercise both FRE and PRE cases involving loops which don't change state in the iteration and can thus perform vsetvli in the preheader of the loop only. At the moment, these are essentially all TODOs.
This patch is an alternative to a piece of D125270. Its direct motivation is to fix a wrong code bug (described below), but somewhat unexpectedly, it also results in a significant code quality improvement for idiomatic fixed length vector patterns.
The existing transform is simply wrong in its current location. We are correct about the fact that the scalar move itself can use the previous vsetvli, but we loose track of the fact that later instructions might depend on the state change represented. That is, the actual value of VL in the register is different than the abstract state thinks it is. Not simply due to precision of modeling, but e.g. the VL register could contain 3 when the abstract state says it is 1. This is annoying hard to demonstrate in practice due to differences in policy flags on the intrinsics, but this is at least a latent wrong code bug.
The code quality benefit comes from the fact we don't need to tie this to explicit vsetvli instructions at all. We can propagate the abstract state, and reduce a) the number of transitions, or b) the cost of those transitions. It turns out we have a bunch of cases - in tests at least - where fixed length AVLs are known non-zero, and we can leave VL unchanged while changing VTYPE.
Differential Revision: https://reviews.llvm.org/D125337
The patch make PseudoReadVL have the vtypes of the corresponding VLEFF/VLSEGFF.
It's useful to get the vtypes of locations of PseudoReadVL without finding the
corresponding VLEFF/VLSEGFF.
It could simplify optimizations in RISCVInsertVSETVLI like D123581.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D125199
This patch adds rvv codegen support for vp.fpext. The lowering of fp_round, vp.fptrunc, fp_extend and vp.fpext share most code so use a common lowering function to handle these four.
And this patch changes the intermediate cast from ISD::FP_EXTEND/ISD::FP_ROUND to the RVV VL version op RISCVISD::FP_EXTEND_VL and RISCVISD::FP_ROUND_VL for scalable vectors.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D123975
These are aimed at a possible miscompile spotted in the vmv.s.x/f mutation case, but it appears this is a latent bug. Or at least, I haven't been able to construct a case with compatible policy flags via intrinsics.
This fixes the first of several cases where the state computed in phase 1 and 2 of the algorithm differs from the state computed during phase 3. Note that such differences can cause miscompiles by creating disagreements about contents of the VL and VTYPE registers at block boundaries.
In this particular case, we recognize that for the first vsetvli in a block, that if the AVL is a phi of GPR results from previous vsetvlis and the VTYPE field matches, we can avoid emitting a vsetvli as the register contents don't change. Unfortunately, the abstract state does change and that update was lost.
As noted in the test change, this can actually improve results by preserving information until later state transitions in the block. However, this minor codegen improvement is not the motivation for the patch. The motivation is to avoid cases a case where we break a key internal correctness invariant.
Differential Revision: https://reviews.llvm.org/D125133
Enable default outlining when the function has the minsize attribute.
`addr-label.ll` crashed after enabling this, so a barrier is added before
instruction selection as a workaround.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D122213
Add basic tests and some tests for same operands and all undef
operands inspired by PR55271.
i32 is umin/umax is using signext to match RISC-V ABI. i8/i16 are
using signext/zeroext to match the operation.
Differential Revision: https://reviews.llvm.org/D124948
If the GPR destination register of a VSETVLI instruction is unused, we can replace it with X0. This discards the result, and thus reduces register pressure.
Since after the core insertion/lowering algorithm has run, many user written VSETVLIs will have their GPR result unused (as VTYPE/VLEN is now explicitly read instead), this kicks in for most tests which involve a vsetvli intrinsic for fixed length vectorization. (vscale vectorization generally uses the GPR result to know how far to e.g. advance pointers in a loop and these uses are not removed.) When inserting VSETVLIs to lower psuedos, we prefer the X0 form anyways.
Differential Revision: https://reviews.llvm.org/D124961
No reason to special case simm12, movImm handles all immediates.
This also fixe a bug that we weren't passing the frame-setup/destroy
flag to movImm when we were calling it.
riscv-v-vector-bits-min is primarily used to opt-in to the
autovectorizer. The vector width can be determined from Zvl*b.
This patch adds support treating -1 as meaning use Zvl*b so we can
still opt-in to autovectorization without needing to repeat a
vector width already given by Zvl*b or -mcpu.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D124960
This test shows incorrect cross-bb insertion. We'd expect to see
a SEW=8 vsetvli, something like:
vsetvli zero, zero, e8, mf8, ta, mu
vluxei64.v v1, (a2), v8, v0.t
But instead the vsetvli is omitted and instead an inherited SEW=64
vsetvli is used:
vmv1r.v v9, v1
vsetvli a3, zero, e64, m1, ta, mu
vmseq.vi v9, v1, 0
vmv1r.v v8, v0
vmandn.mm v0, v9, v2
beqz a0, .LBB0_2
# %bb.1:
vluxei64.v v1, (a2), v8, v0.t
vmv1r.v v3, v1
The "mask reg op" vmandn.mm in bb.1 appears to be confusing the insertion
process, as it is able to elide its own vsetvli as its VLMAX (SEW=8,
LMUL=MF8) is identical to the previous one (SEW=64, LMUL=1).
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D124089
We can use SH1ADD, SH2ADD, SH3ADD to multipy by 3, 5, and 9 respectively.
We could extend this to 3, 5, or 9 multiplied by a power 2 by also
emitting a SLLI.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D124824
The result was totally wrong.
We could use mask undisturbed result to emulate the mask agnostic result.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D124684
When looking for memory uses,
reassociationCanBreakAddressingModePattern should check uses of
the outer ADD rather than the inner ADD. We want to know if the
two ops we're reassociating are used by a load/store.
In practice, the existing check usually works because CodeGenPrepare
will make one of the load/stores have an offset of 0 relative to
split GEP. That will make the inner add have a memory use.
To test this, I've manually split the GEPs so there is no 0 offset
store.
This issue was recently discussed in the original review D60294.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D124644
We try to match as a disguised rotate by constant of these forms
(shl (X | Y), C1) | (srl X, C2) --> (rotl X, C1) | (shl Y, C1)
(shl X, C1) | (srl (X | Y), C2) --> (rotl X, C1) | (srl Y, C2)
We may have also looked through an AND to find the shift. If we
did, we need to apply a mask to the result.
I'll add an AArch64 test and pre-commit it and the RISC-V test
tomorrow.
Fixes PR55201.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D124711
Because of shrink wrapping, the block to insert epilog may don't have
instructions (Only debug instructions). And the position to insert may
point to MBB.end() that don't have a DebugLoc. This patch fix this
problem.
The test program was copied from the issue:https://github.com/llvm/llvm-project/issues/53662
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D123679
This fixes a crash from D124231.
We can't fold
(load (add base, (addi src, off1)), off2)
-> (load (add base, src), off1+off2)
if the src is a FrameIndex. FrameIndex cannot be the operand of an
add.
There was an immediate==0 check that I think was trying to catch
the common case of FrameIndex addis where the immediate is 0, but
they can also appear in non-zero form. Instead explicitly check
for a FrameIndex operand.
It's possible that we have a constant that isn't simm32 so we can't
use LUI+ADDIW, but we can use LUI+ADDI. Because ADDI uses a sign
extended constant, it's possible that after subtracting it out, we
end up with a simm32 that maps to LUI.
This patch detects this case after removing Lo12 and before shifting
the value for SLLI.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D124222
This improves opportunities to use bset/bclr/binv. Unfortunately,
there are no W versions of these instrcutions so this isn't always
a clear win. If we use SLLW we get free sign extend and shift masking,
but need to put a 1 in a register and can't remove an or/xor. If
we use bset/bclr/binv we remove the immediate materializationg and
logic op, but might need a mask on the shift amount and sext.w.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D124096
The description of SETCC says
/// SetCC operator - This evaluates to a true value iff the condition is
/// true. If the result value type is not i1 then the high bits conform
/// to getBooleanContents.
Without this patch, we sign extended the i1 to the used larger type
regardless of getBooleanContents. This resulted in miscompiles, as
shown in the attached testcase that ended up returning -1 instead of
1 when using -mattr=+v.
Fixes https://github.com/llvm/llvm-project/issues/55168
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D124618
By clearing the HasDummyMask flag from mask register binary operations
and mask load/store.
HasDummyMask was causing an extra operand to get appended when
converting from MachineInstr to MCInst. This extra operand doesn't
appear in the assembly string so was mostly ignored, but it prevented
the alias instruction printing from working correctly.
Reviewed By: arcbbb
Differential Revision: https://reviews.llvm.org/D124424