Only one caller didn't already have an MVT and that was easy to
fix. Since the return type is MVT and it uses MVT::getVectorVT,
taking an MVT as input makes the most sense.
This custom isel was used to split the lo12 bits of the imm so that
they could be folded into load/store addresses via a post-isel
peephole.
This patch instead splits the immediate during isel and folds the
lo12 removing the need for the post-isel peephole to do anything.
After this we'll be able to remove the post-isel peephole.
Reviewed By: asb, luismarques
Differential Revision: https://reviews.llvm.org/D129450
This restores the old behavior before D129402 when
enableUnalignedScalarMem is false. This fixes a regression spotted
by @asb.
To fix this correctly, we need to consider alignment of the load
we'd be replacing, but that's not possible in the current interface.
Currently, for vectorised loops that use the get.active.lane.mask
intrinsic we only use the mask for predicated vector operations,
such as masked loads and stores, etc. The loop itself is still
controlled by comparing the canonical induction variable with the
trip count. However, for some targets this is inefficient when it's
cheap to use the mask itself to control the loop.
This patch adds support for using the active lane mask for control
flow by:
1. Generating the active lane mask for the next iteration of the
vector loop, rather than the current one. If there are still any
remaining iterations then at least the first bit of the mask will
be set.
2. Extract the first bit of this mask and use this bit for the
conditional branch.
I did this by creating a new VPActiveLaneMaskPHIRecipe that sets
up the initial PHI values in the vector loop pre-header. I've also
made use of the new BranchOnCond VPInstruction for the final
instruction in the loop region.
Differential Revision: https://reviews.llvm.org/D125301
Including the following opcode:
Select_FPR16_Using_CC_GPR
Select_FPR32_Using_CC_GPR
Select_FPR64_Using_CC_GPR
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D127871
Somehow some tests failed in our downstream because it matched
VFMV+FSD pattern first. Both FSD and VSE patterns have the same
complexity, while FSD is matched before VSE in the generated
matcher table.
This problem only occurs in our downstream (so sorry that I can't
provide a test here) and increasing the value of `AddedComplexity`
can fix it.
Reviewed By: StephenFan, craig.topper
Differential Revision: https://reviews.llvm.org/D129360
I think it only makes sense to return true here if we aren't going
to turn around and create a constant pool for the immmediate.
I left out the check for useConstantPoolForLargeInts() thinking
that even if you don't want the commpiler to create a constant pool
you might still want to avoid materializing an integer that is
already available in a global variable.
Test file was copied from AArch64/ARM and has not been commited yet.
Will post separate review for that.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D129402
We have custom isel that tries to select the Lo12 bits using a
separate ADDI that can later folded into the load/store address
by the post-isel peephole.
This patch disables this if the load/store already had a non-zero
offset. A non-zero offset implies that CodeGenPrepare split several
large offsets used by different loads and stores into a common large
offset and multiple small offsets that could be folded. Folding more
of the lo12 bits changes this common offset by increasing the small
offsets. While this can save an instruction to materialize the common
offset, it can also prevent the small offsets from fitting in a
compressed load/store instruction.
Removing this also simplifies the last piece needed to fold the custom
isel for add into SelectAddrRegImm and remove the post-isel peephole.
The motivation here is to a) bring us closer into alignment with AArch64 under the assumption that codepath is better tested, and b) simplify pattern matching in an upcoming change.
The immediate impact is a significant IR reduction but a fairly minimal change in the generated assembly. Due to a difference in expansion behavior we get a saturating add vs an unsaturating one for the old code, but that's about it. This difference comes down to different handling of overflow, which doesn't seem to be possible here anyways, so the assembly codegen is arguably a minor regression. I don't expect that to matter in practice.
Differential Revision: https://reviews.llvm.org/D129221
This allows fixed length vectors involving splats on the LHS to commute into the _vx form of the instruction. Oddly, the generic canonicalization rules appear to catch the scalable vector cases. I haven't fully dug in to understand why, but I suspect it's because of a difference in how we represent splats (splat_vector vs build_vector).
Differential Revision: https://reviews.llvm.org/D129302
Current implementation will rename both register in store instructions if
we store base address into memory with same base register, it's OK if
the offset is 0, however that is wrong transform if offset isn't 0, give
a smalle example here:
sd a0, 808(a0)
We should not transform into:
addi a2, a0, 768
sd a2, 40(a2)
That should just rename base address like this:
addi a2, a0, 768
sd a0, 40(a2)
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D128876
RVV doesn't have immediate field for memory addressing. Currently
we build MachineInstructions in PEI to computing stack offset for
RVV load store instructions. These instructions were added too late to
can be optimized by CSE, LICM... passes.
This patch makes FrameIndex SDNodes can't be matched in RVV Load Store
instruction selection patterns. So that the FrameIndex SDNodes would be
selected as `ADDI GPR, targetframeindex`.
There are 2 advantages for such change:
1. Stack objects address computing can be optimized by machine function
passes.
2. Since the ADDI instruction's destination register can be used as a
temp register, we can save an emergency spill slot.
Differential Revision: https://reviews.llvm.org/D128187
This handles the code we get for this.
int foo(unsigned x, int *y) {
return y[x >> 3];
}
The srl and shl implied by the array index will be combined to
form (srl (and X, C2), C1). We need to reverse this get to back
the shl to fold into SHXADD.
Some more complex cases require checking the relationship of
operands on different nodes of the match. They also require
additional instructions to be created. Using a ComplexPattern
gives us that flexibility.
I'll be adding another pattern in a future patch.
Computing scalable offset needs up to two scrach registers. We add
scavenge spill slots according to the result of `RISCV::isRVVSpill`
and `RVVStackSize`. Since ADDI is not included in `RISCV::isRVVSpill`,
PEI doesn't add scavenge spill slots for scrach registers when using
ADDI to get scalable stack offsets.
The ADDI instruction has a destination register which can be used as
a scrach register. So one scavenge spil slot is sufficient for
computing scalable stack offsets.
Differential Revision: https://reviews.llvm.org/D128188
This handles the code we get for
int foo(int* x, unsigned y) {
return x[y >> 1];
}
The shift right and the shl will get DAG combined into
(shl (and X, 0xfffffffe), 1). We have custom isel to match the
shl+and, but with Zba the (add (shl X, 1), Y) part will get
matched and leave the and to be iseled by itself. This commit
adds a larger pattern that includes the and.
This allows us to fold global and constant pool addresses into
load/store during isel instead of in the post-isel peephole. I
did not copy the alignment check for ConsantPoolSDNode because it
wasn't tested.
This is a step towards being able to remove the post-isel
peephole.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D128738
where C2 has 32 leading zeros and C3 trailing zeros.
When the shl is used by an add C is 1,2 or 3, we end up matching
(add (shl X, C), Y) first. This leaves an and with a constant that
is harder to materialize.
Similar for SH2ADD and SH3ADD.
This is what we get from
int foo(int* x, unsigned y) {
return x[y >> 1];
}
This allows us to avoid materializing 0xFFFFFFFE into a register.
This reverts commit 7af3d4ab3d.
RISC-V reverted the shrink wrap patch for bug 53662. Since the bug is fixed
by D123679, the commit re-enable it.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D128965
getPointerAlignment and ConstantPoolSDNode::getAlign only consider
the alignment of the object. If we already have a non-zero offset
into the offset that may have reduced the alignment.
Since the base pointer will become an LUI with the old offset, we
need to be sure the new offset fits in the alignment of the address
that will be used to create the LUI immediate.
I'm not sure it is possible to have a non-zero offset in the
GlobalAddressSDNode or ConstantPoolSDNode at this point today so this
may only be a theoretical bug.
Differential Revision: https://reviews.llvm.org/D129006
This fixes test/DebugInfo/Generic/accel-table-hash-collisions.ll and
cross-cu-inlining.ll when the default triple is riscv. llvm-dwarfdump
--apple-names does not resolve R_RISCV_{ADD,SUB}32 in .apple_names .apple_types
and having ADD/SUB will cause decoding failure `Atom[0]: Error extracting the
value`.
We know which instruction we're emitting so its ok to directly
encode X0 into the instruction. We only need to create a copy when
a constant 0 is selected without context of what instructions uses it.
Only handle immediates that would produce an ADDI or ADDIW of Lo12
as the final instruction in their materialization.
As the test change show this removes immediates that materialize
with lui+addiw that is not the same as lui+addi.
These should contain the same thing, but we aren't consistent about
which we use.
Since we call ReplaceNode, it seems more correct to use the initial VT.
Similar for a subtract with a constant left hand side.
(sra (add (shl X, 32), C1<<32), 32) is the canonical IR from InstCombine
for (sext (add (trunc X to i32), 32) to i32).
For RISCV, we should lower this as addiw which means turning it into
(sext_inreg (add X, C1)).
There is an existing DAG combine to convert back to (sext (add (trunc X
to i32), 32) to i32), but it requires isTruncateFree to return true
and for i32 to be a legal type as it used sign_extend and truncate
nodes. So that doesn't work for RISCV.
If the outer sra happens be used by a shl by constant, it will be
folded and the shift amount of the sra will be changed before we
can do our own DAG combine. This requires us to match the more
general pattern and restore the shl.
I had wanted to do this as a separate (add (shl X, 32), C1<<32) ->
(shl (add X, C1), 32) combine, but that hit an infinite loop for some
values of C1.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D128869
The sext_inreg can often be folded into an earlier instruction by
using a W instruction. The sext_inreg also works better with our ABI.
This is one of the steps to improving the generated code for this https://godbolt.org/z/hssn6sPco
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D128843
This reverts commit 755c84c62c. A bug was reported on the original review thread (https://reviews.llvm.org/D128006), and on inspection this patch is simply wrong. It needs to be checking for VLInBytes, not MaxVL. These happen to be the same when using AVL=VLMAX (which is quite common), but this does not fold when AVL != VLMAX.
Previously we iseled this to a pair of ADDIs and relied on a post
isel peephole to fold one of the ADDIs into the load/store. Now
we split the immediate in two parts the same way isel does and fold
one of the pieces. If the add has a non-memory use it will emit
two isels and larger one will CSE with the ADDI we created for the
the memory use.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D128741
This implements known bits for READ_VALUE using any information known about minimum and maximum VLEN. There's an additional assumption that VLEN is a power of two.
The motivation here is mostly to remove the last use of getMinVLen, but while I was here, I decided to also fix the bug for VLEN < 128 and handle max from command line generically too.
Differential Revision: https://reviews.llvm.org/D128758
The pass was previously limited to LUI+ADDI being used by a single
instruction.
This patch allows the pass to optimize multiple memory operations
that use the same offset. Each of them will receive a separate %lo
relocation. My main motivation is to handle a read-modify-write
where we have a load and store to the same address, but I didn't
restrict it to that case.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D128599
Including the following opcode:
Select_FPR16_Using_CC_GPR
Select_FPR32_Using_CC_GPR
Select_FPR64_Using_CC_GPR
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D127871
This extends the existing cost model for reductions for scalable vectors.
The existing cost model assumes that reductions are roughly logarithmic in cost for unordered variants and linear for ordered ones. This change keeps that same basic model, and extends it out to the maximum number of elements a scalable vector could possibly have.
This results in costs which aren't terribly high for unordered reductions, but are for ordered ones. This seems about right; we want to strongly bias away from using scalable ordered reductions if the cost might be linear in VL.
Differential Revision: https://reviews.llvm.org/D127447
SelectBaseAddr was a minor convenience to use since it already'
existed for vector load/store. D128187 is going to remove the other
uses of SelectBaseAddr so it has less reason to exist.
This patch removes the dependency on SelectBaseAddr and adds a new
SelectAddrFrameIndex to share some code with SelectFrameAddrRegImm.
LoopVectorizer uses getVScaleForTuning for deciding how to discount the cost of a potential vector factor by the amount of work performed. Without the callback implemented, the vectorizer was defaulting to an estimated vscale of 1. This results in fixed vectorization looking falsely profitable (since it used the command line VLEN).
The test change is pretty limited since a) we don't have much coverage of the vectorizer with scalable vectors at all, and b) what little coverage we have mostly uses i64 element types. There's a separate issue with <vscale x 1 x i64> which prevents us from getting to this stage of costing, and thus only the one test explicitly written to avoid that is visible in the diff. However, this is actually a very wide impact change as it changes the practical vectorization result when both fixed and scalable is enabled to scalable.
As an aside, I think the vectorizer is at little too strongly biased towards scalable when both are legal, but we can explore that separately. For now, let's just get the cost model working the way it was intended.
Differential Revision: https://reviews.llvm.org/D128547
These are now only used in the implementation of getRealMinVLen and getRealMaxVLEn, and useRVVForFixedLengthVectors; make them protected to discourage new users.
The comments in the existing code appear to pre-exist the standardization of the +v extension. In particular, the specification *does* provide a bound on the maximum value VLEN can take. From what I can tell, the LMUL comment was simply a misunderstanding of what this API returns.
This API returns the maximum value that vscale can take at runtime. This is used in the vectorizer to bound the largest scalable VF (e.g. LMUL in RISCV terms) which can be used without violating memory dependence.
Differential Revision: https://reviews.llvm.org/D128538
getRealMaxVLen returns an upper bound on the value of VLEN. We can use this upper bound (which unless explicitly set at command line is going to result in a e8 MaxVLMax of much greater than 256) instead of explicitly handling the unknown case separately from the bounded by number greater than 256 case.
Note as well that this code already implicitly depends on a capped value for VLEN. If infinite VLEN were possible, than 16 bit indices wouldn't be enough.
This doesn't change behavior, it just makes it slightly more obvious what's
going on. Note that getRealMinVLen is always >= getMinRVVVectorSizeInBits.
The first case is a bit tricky, as you have to know that
getMinRVVVectorSizeInBits returns 0 when not set, and thus is equivalent
to the else value clause. The new code structure makes it more obvious we
return 0 unless using RVV for fixed length vectors.
We currently split the immediate almost equally between two addis.
If the immediate is odd, it won't be split exactly equal.
This patch instead gives one addi an immediate of 2047 or -2048 and the
other getsthe remainder. If the original immediate is near -2049 or 2048,
this might allow the use of c.addi for the addi that receives the
smaller immediate.
Reviewed By: asb, luismarques
Differential Revision: https://reviews.llvm.org/D128500
This patch adds 3 new _VL RISCVISD opcodes to represent VFMA_VL with
different portions negated. It also adds a DAG combine to peek
through FNEG_VL to create these new opcodes.
This is modeled after similar code from X86.
This makes the isel patterns more regular and reduces the size of
the isel table by ~37K.
The test changes look like regressions, but they point to a bug that
was already there. We aren't able to commute a masked FMA instruction
to improve register allocation because we always use a mask undisturbed
policy. Prior to this patch we matched two multiply operands in a
different order and hid this issue for these test cases, but a different
test still could have encountered it.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D128310
A forward abstract state can be in the special SEWLMULRatioOnly state which means we're not allowed to inspect its fields. The scalar to vector move case was mising a guard, and we'd crash on an assert. Test cases included.
According to the vector spec, mf8 is not supported for i8 if ELEN
is 32. Similarily mf4 is not suported for i16/f16 or mf2 for i32/f32.
Since RVVBitsPerBlock is 64 and LMUL is calculated as
((MinNumElements * ElementSize) / RVVBitsPerBlock) this means we
need to disable any type with MinNumElements==1.
For generic IR, these types will now be widened in type legalization.
For RVV intrinsics, we'll probably hit a fatal error somewhere. I plan
to work on disabling the intrinsics in the riscv_vector.h header.
Reviewed By: arcbbb
Differential Revision: https://reviews.llvm.org/D128286
This adds the macrofusion plumbing and support fusing LUI+ADDI(W).
This is similar to D73643, but handles a different case. Other cases
can be added in the future.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D128393
This adds RISCVISD opccodes for LA, LA_TLS_IE, and LA_TLS_GD to
remove creation of MachineSDNodes form get*Addr. This makes the
code consistent with the previous patches that added RISCVISD::HI,
ADD_LO, LLA, and TPREL_ADD.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D128325
Put it before the VL instead of as the first operand. I want to add
passthru to more operands, but the commutable ones like VADD_VL
require the commutable operands to be operand 0 and 1. So we can't
have the passthru as operand 0 for those.
The failure that caused the previous revert has been fixed
by https://reviews.llvm.org/D126048
Original commit message:
RVV makes heavy use of subregisters due to LMUL>1 and segment
load/store tuples. Enabling subregister liveness tracking improves the quality
of the register allocation.
I've added a command line that can be used to turn it off if it causes compile
time or functional issues. I used the command line to keep the old behavior
for one interesting test case that was testing register allocation.
Reviewed By: kito-cheng
Differential Revision: https://reviews.llvm.org/D128016
Use it in place of VSELECT_VL+VRGATHER*_VL.
This simplifies the isel patterns.
Overall, I think trying to match select+op to create masked instructions
in isel doesn't scale. We either need to do it in DAG combine, pre-isel
peepole, or post-isel peephole. I don't yet know which is the right
answer, but for this case it seemed best to be able to request the
masked form directly from lowering.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D128023
This was a bug introduced in d764aa. A pointer type is not a primitive type, and thus we were ending up dividing by zero when computing VLMax.
Differential Revision: https://reviews.llvm.org/D128219
The code being removed is technically correct; if we end up with two VL=0 instructions next to each other, we can avoid a state transition if the second is a scalar move. However, since both ops are also nops, we should simply delete them instead. As such, this compatibility rule simply complicates the code for no purpose.
When working through correctness issues in this pass, I moved a number of transforms which were phrased as mutating prior vsetvli instructions out of the main data flow because mutating prior instructions can invalidate the running dataflow results in subtle ways. We ended up creating both a prepass and a post-pass.
After consideration, I believe the prepass to be redundant, and this change removes it by folding it back into the data flow via a key conceptual change. Instead of phrasing the mutations on instructions, we can phrase them on abstract states. This avoids the dataflow inconsistency problem mentioned above by simply propagating the potential change forward, and thus reflecting its results in the dataflow. Critically, we do so without modifying existing VSETVLI instructions; some of the data flow steps include non-local IR analysis.
Compile time wise, this removes a linear pass, but has the potential to increase the number of iterations for the data flow to converge. That's not a algorithmic complexity change, the needVSETVLI mechanism has the same effect. In practice, I don't see this triggering more iterations, so I think it's likely to be a net win overall. (I didn't do any careful analysis here; just an impression from glancing at a couple tests.)
This has the potential to produce better results, so this isn't strictly speaking NFC.
Differential Revision: https://reviews.llvm.org/D127870
In D127983, I had flipped from using the computed EEW to using the SEW value pulled from the VSETVLI when checking compatibility. This wasn't intentional, though thankfully it appears to be a non-functional difference. The new code does make a unchecked assumption that the initial SEW operand on the load/store is the EEW. This patch clarifies the assumption, and adds an assert to make sure this remains true.
Differential Revision: https://reviews.llvm.org/D128085
Type legalization will convert the bitcast into a vector store and
scalar load.
Instead this patch widens the vector to v8i1 with undef, and bitcasts
it to i8. v8i1->i8 has custom handling for type legalization already to
bitcast to a v1i8 vector and use an extract_element.
The code here was lifted from X86's avx512 support.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D128099
Doing so let's the post-mutation pass leverage the demanded info to rewrite vsetvlis before a store/mask-op to eliminate later vsetvlis.
Sorry for the lack of store test change; all of my attempts to write something reasonable have been handled through existing logic.
This code should be dead. A simple whole register copy of an IMPLICIT_DEF, is simply an IMPLICIT_DEF of it's own. (This would not be true for freeze, but is for copy.) If we find a case which gets here with vector operand copy of an IMPLICIT_DEF, we most likely have an earlier missed optimization anyways. (The most recent case of this was e6c7a3a, found by Craig during review of this patch.) There might be others, and if so, we'll revisit them individually as regressions are reported.
Differential Revision: https://reviews.llvm.org/D127996
A splat of the values 0 and -1 as sign extended 12 bit immediates are always the same bit pattern regardless of the etype used to perform the operation. As a result, we can sometimes avoid introducing a vsetvli just for the purposes of a splat.
Looking at the diffs, we don't get a huge amount of immediate value out of this. We mostly push the vsetvli one instruction down, usually in front of a vmerge. We also don't get the corresponding fixed length vector cases because VL typically is changed despite the actual bits written being the same. Both of these are areas I plan to explore in future patches.
Interestingly, this makes a great example of why we need the forward and backward implementation to be consistent. Before we merged the demanded field handling, if we implement only the forward direction, we lost the ability to mutate a prior vsetvli and eliminate a later one entirely. This resulted in practical regressions instead of improvements. It's always nice when practice matches theory. :)
Differential Revision: https://reviews.llvm.org/D128006
This allows computeKnownBits to see the constant being loaded.
This recovers the rv64zbp test case changes from D127520.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D127679
Rather than emitting a MachineSDNode from lowering. Let isel match it.
This is consistent with the RISCVISD::HI and ADD_LO nodes that were
also added. Having them both the same will make D127679 consistent.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D127714
Instead add RISCVISD opcodes that will be selected to LUI/ADDI
during isel.
I'm looking into maybe moving doPeepholeLoadStoreADDI into isel.
Having the ADDI as a RISCVISD node will make it visible to isel.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D127713
This change merges the logic for reasoning about demanded portions of the VTYPE register between the main dataflow algorithm and the backwards mutation post pass. In the process, we get to delete a bunch of now redundant code.
This should be entirely NFC. I included a slight hack (see TODO) to avoid changing behavior in the post pass while being able to use the generalized logic in the prepass. I will fix the TODO in a separate change once this lands.
Differential Revision: https://reviews.llvm.org/D127983
The costing we use for fixed length vector gather and scatter is to simply count up the memory ops, and multiply by a fixed memory op cost. For scalable vectors, we don't actually know how many lanes are active. Instead, we have to end up making a worst case assumption on how many lanes could be active. In the generic +V case, this results in very high costs, but we can do better when we know an upper bound on the VLEN.
There's some obvious ways to improve this - e.g. using information about VL and mask bits from the instruction to reduce the upper bound - but this seems like a reasonable starting point.
The resulting costs do bias us pretty strongly away from generating scatter/gather for generic +V. Without this, we'd be returning an invalid cost and thus definitely not vectorizing, so no major change in practical behavior expected.
Differential Revision: https://reviews.llvm.org/D127541
If we're writing to an undef vector (i.e. implicit_def), we can change the value of bits outside the requested write without consequence. This allows us to avoid a VSETVLI just for narrowing the value written.
Differential Revision: https://reviews.llvm.org/D127880
This change just moves some code around, and extracts out a helper function expected to be useful when reusing the demanded field logic in the forward dataflow.
If the merge operand isn't undef we need to be using tail undisturbed.
Turns out all of our uses of riscv_slidedown_vl use undef so this
doesn't affect any tests.
The motivating case, and the only one actually enabled by this patch, is a load or store followed by another op with the same SEW/LMUL ratio.
As an example, consider:
define void @test1(ptr %in, ptr %out) {
entry:
%0 = load <8 x i16>, ptr %in, align 2
%1 = sext <8 x i16> %0 to <8 x i32>
store <8 x i32> %1, ptr %out, align 4
ret void
}
Without this patch, we get:
vsetivli zero, 8, e16, mf4, ta, mu
vle16.v v8, (a0)
vsetvli zero, zero, e32, mf2, ta, mu
vsext.vf2 v9, v8
vse32.v v9, (a1)
ret
Whereas with the patch we get:
vsetivli zero, 8, e32, mf2, ta, mu
vle16.v v8, (a0)
vsext.vf2 v9, v8
vse32.v v9, (a1)
ret
We have rewritten the first vsetvli and thus removed the second one.
As is strongly hinted by the code structure and todos, I am planning on communing this with all (or most all?) of the cases from isCompatible used in the forward data flow. This will be done in a series of following changes - some NFC reworks, and some reviewed optimization extensions.
Differential Revision: https://reviews.llvm.org/D127780
RISC-V expand register tuple spilling into series of register spilling after
register allocation phase by the pseudo instruction expansion, however part of
register tuple might be still undefined during spilling, machine verifier will
complain the spill instruction is using an undefined physical register.
Optimal solution should be doing liveness analysis and do not emit spill
and reload for those undefined parts, but accurate liveness info at that point
is not so easy to get.
So the suboptimal solution is still spill and reload those undefined parts, but
adding implicit-use of super register to spill function, then machine
verifier will only report report using undefined physical register if
the when whole super register is undefined, and this behavior are also
documented in MachineVerifier::checkLiveness[1].
Example for demo what happend:
```
v10m2 = xxx
# v12m2 not define yet
PseudoVSPILL2_M2 v10m2_v12m2
...
```
After expansion:
```
v10m2 = xxx
# v12m2 not define yet
# Expand PseudoVSPILL2_M2 v10m2_v12m2 to 2 vs2r
VS2R_V v10m2
VS2R_V v12m2 # Use undef reg!
```
What this patch did:
```
v10m2 = xxx
# v12m2 not define yet
# Expand PseudoVSPILL2_M2 v10m2_v12m2 to 2 vs2r
VS2R_V v10m2 implicit v10m2_v12m2
# Use undef reg (v12m2), but v10m2_v12m2 ins't totally undef, so
# that's OK.
VS2R_V v12m2 implicit v10m2_v12m2
```
[1] https://github.com/llvm-mirror/llvm/blob/master/lib/CodeGen/MachineVerifier.cpp#L2016-L2019
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D127642
VSETVLIInfos right after VLEFF/VLSEGFF are currently unknown since they modify
VL. Unknown VSETVLIInfos make next vector operations needed to be inserted
VSET(I)VLI. Actually the next vector operation of VLEFF/VLSEGFF may not need to
be inserted VSET(I)VLI if it uses same VTYPE and the resulted vl of
VLEFF/VLSEGFF.
Take the below C code as an example,
vint8m4_t vec_src1 = vle8ff_v_i8m4(str1, &new_vl, vl);
vbool2_t mask1 = vmseq_vx_i8m4_b2(vec_src1, 0, new_vl);
vsetvli insertion adds a redundant vsetvli for that,
Assembly result:
vsetvli a2,a2,e8,m4,ta,mu
vle8ff.v v28,(a0)
csrr a3,vl ; redundant
vsetvli zero,a3,e8,m4,ta,mu ; redundant
vmseq.vi v25,v28,0
After D126794, VLEFF/VLSEGFF has a define having value of VL. The patch consider
there is a ghost vsetvli right after VLEFF/VLSEGFF. The ghost VSET(I)LIs use the
vl output of the VLEFF/VLSEGFF as its AVL and same VTYPE of the VLEFF/VLSEGFF.
The ghost vsetvli must be redundant, and we could use it to get the VSETVLIInfo
right after VLEFF/VLSEGFF.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D127576
Since almost all pseudos have the same form of BaseInstr, we
can just set it as default value to reduce some lines.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D127632
We would previously fail to handle 64-bit PC-relative relocations on
RISCV. This was exposed by trying to build with
`-fprofile-instr-generate`.
The original changes restricted the relocation handling to the text
segment as the paired relocations are undesirable in at least the debug
and .eh_frame sections. We now make this explicit to handle the general
case for the data relocations as well.
It would be preferable to use `R_RISCV_n_PCREL` when available to avoid
an extra relocation.
Differential Revision: https://reviews.llvm.org/D127549
Reviewed By: luismarques, MaskRay
Fixes: #55971
In an effort to make this code easier to read and extend, this splits out helper functions for the transfer function of the data flow. Due to the other results computed during the phases, we can't completely abstract away everything, but we can abstract the actual state transitions.
The motivation here is the following upcoming changes:
* The fault first load patch - already approved, this will be rebased over - adds another case into the transferAfter path.
* An upcoming patch to fold the local prepass back into the main algorithm greatly complicates the transferBefore logic.
Differential Revision: https://reviews.llvm.org/D127761
This is possibly somewhat subjective, but having an explicitly named flag to track the property required and code structure that more closely matches phase 1/2 of the dataflow seems much easier to read.
Differential Revision: https://reviews.llvm.org/D126893
We were incorrectly creating a VRGATHER node with i1 vector type. We
could support this by promoting the mask to i8 and truncating it, but
for now I want to prevent the crash.
Fixes PR56007.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D127681
If we defer the mutation of the instruction, we can add the assert discussed in D126921. Once we do that, the API becomes subject to revision - but let's do that in a separate change.
This simplifies the isel code by removing the manual load creation.
It also improves our ability to use 0 strided loads for vector splats.
There is an assumption here that Mask and ShiftedMask constants are
cheap enough that they don't become constant pool loads so that our
isel optimizations involving And still work. I believe those constants
are 3 instructions in the worst case.
The rv64zbp-intrinsic.ll changes is a regression caused by intrinsics
being expanded to RISCVISD also occuring during lowering. So the optimizations
were only happening during the last DAGCombine, which can't see through the
load. I believe we can fix this test by implementing
TargetLowering::getTargetConstantFromLoad for RISC-V or by adding the intrinsic
to computeKnownBitsForTargetNode to enable earlier DAG combine. Since Zbp is not
a ratified extension, I don't view these as blocking this patch.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D127520
These methods don't access any state from RISCVInstrInfo. Make them
free functions in the RISCV namespace.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D127583
Our actual lowering for i1 reductions uses ctpop combined with possibly a vector negate and possibly a logic op afterwards. I believe ctpop to be low cost on all reasonable hardware.
The default costing implementation here was returning quite inconsistent costs. and/or were returning very high costs (because we seem to think moving into scalar registers is very expensive?) and others were returning lower but still too high (because of the assumed tree reduce strategy). While we should probably improve the generic costing strategy for i1 vectors, let's start by fixing the immediate problem.
Differential Revision: https://reviews.llvm.org/D127511
This brings us into alignment with AArch64, and in the process fixes a compiler crash bug in uniform store handling in the vectorizer.
Before the recent invalid cost bailout work, this would have also avoided crashes on invalid costs in some cases. I honestly think the vectorizer should gracefully bailout on uniform stores it can't use a scatter for, but it doesn't, so lets take the path of least resistance here. It's also possible that there are other vectorizer bugs AArch64 isn't seeing because of this hook; we don't want to be finding them either.
Differential Revision: https://reviews.llvm.org/D127514
We need a preheader and a single latch, but we don't need a dedicated
exit.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D127513
This prevents them from being assumed legal by the cost model.
This matches what is done for AArch64 SVE.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D123799
The patch is a replacement of D125199. PseudoReadVL with vtype has worry for
computing same vtypes of VLEFF/VLSEGFF in two different places, DAGToDAG and
InsertVSETVLI. VLEFF/VLSEGFF MI with VL output still could provide the vtype of
VLEFF/VLSEGFF to the users of its VL.
The patch names the new pseudo as original VLEFF/VLSEGFF name suffixed "_VL" and
expand them in RISCVInsertVSETVLI pass.
This patch also reverts commit 4537aae0d5,
"[RISCV] Make PseudoReadVL have the vtypes of the corresponding VLEFF/VLSEGFF.".
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D126794
For an addition with simm14 and simm15 immediates with 2 or 3 trailing bits,
we can use a shXadd instruction and an addi to do the addition.
This patch teaches RISCVMergeBaseOffset to see through this pattern.
I don't think the sh1add case occurs because we use two addis for that,
but I implemented it for completeness.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D127376
In order to make sure the stack point is right through the EH region,
we also need to restore stack pointer from the frame pointer if we
don't preserve stack space within prologue/epilogue for outgoing variables,
normally it's just checking the variable sized object is present or not
is enough, but we also don't preserve that at prologue/epilogue when
have vector objects in stack.
Example to show what happened:
```
try {
sp adjust for outgoing args. // 1. Sp changed.
func_call // 2. Exception raised
sp restore // Oh, not restored
} catch {
// 3. And now we are here.
}
// 4. Prepare to return!, restore return address from stack, but...sp is wrong.
// 5. Screw up!
```
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D126861
Add with immediates in the range [-4096, -2049] or [2048, 4095] get
convert to two ADDIs. Teach RISCVMergeBaseOffset to recognize this
pattern as well.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D126843
LUI+ADDIW always produces a simm32. This allows us to always
fold it into a global offset.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D126729
The abstract state used in the data flow should not know anything about the instructions which produced the abstract states. Instead, when comparing two states, we can simply use information about the machine instr at that time.
In the old design, basically any use of the instruction flags on the current (as opposed to a "Require" - aka upcoming state) would be a bug. We don't seem to actually have any such bugs, but we can make this much more obvious with code structure.
Differential Revision: https://reviews.llvm.org/D126921