Add scheduling resource for vector segment load/store instructions in D128886.
I miss to add scheduling resource for pseudo segment instructions.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D130222
This time using N1 instead of N0 since N1 points to the original
setcc. This now affects scheduling as I expected.
Original commit message:
We change seteq<->setne but it doesn't change the semantics
of the setcc. We should keep original debug location. This is
consistent with visitXor in the generic DAGCombiner.
We change seteq<->setne but it doesn't change the semantics
of the setcc. We should keep original debug location. This is
consistent with visitXor in the generic DAGCombiner.
While (sub 0, X) can use x0 for the 0, I believe (add X, -1) is
still preferrable. (addi X, -1) can be compressed, sub with x0 on
the LHS is never compressible.
This introduce an xori in some cases. I don't believe it was the
intention of the original patch. This was an accident because
nonan FP equality compares also use SETEQ/SETNE.
Also pass the correct type to getSetCCInverse.
-Rename variable NnzC -> N0C.
-Use SelectionDAG::getSetCC to reduce code.
-Use SDValue::getOperand instead of operator-> and SDNode::getOperand.
Initial steps to add another similar combine to this code.
We have a good selection of W instructions, so promoting a truncated
value back to i64 is often free.
This appears to be a net code size reduction on SPECINT2006.
This has been split from D130397 as one of the patches needed to
complete that.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D131819
(setcc x, y, eq/neq) are seqz, snez that set rd = 0/1.
addi is used to process immediate, which can save instructions for load immediate.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D131471
Including patterns to select addiw if only the lower 32 bits are used.
I'm not excited about adding this many patterns. I'm looking at whether
we can create the xori during lowering and move the ineg patterns to
DAGCombiner.
The patch uses peephole method to fold merge.vvm and unmasked intrinsics to
masked intrinsics. Using peephole intead of tablegen patterns is to avoid large
auto gnerated code.
Note: The patch ignores segment loads since I don't know how to test them.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D130442
Commit 8922adf646 recently made JITTargetMachineBuilder honor the
hasJIT property of the target. LLVM supports just-in-time compilation
on RISC-V, so set the flag.
Differential Revision: https://reviews.llvm.org/D131617
As extending from or truncating to mask vector do not use the same instructions as the normal cast, this path changed it to 2 which is the number of instructions we used.
Differential Revision: https://reviews.llvm.org/D131552
Prior to this patch, libcalls inserted by the SelectionDAG legalizer
could never be tailcalled. The eligibility of libcalls for tail calling
is is partly determined by checking TargetLowering::isInTailCallPosition
and comparing the return type of the libcall and the calleer.
isInTailCallPosition in turn calls TargetLowering::isUsedByReturnOnly
(which always returns false if not implemented by the target).
This patch provides a minimal implementation of
TargetLowering::isUsedByReturnOnly - enough to support tail calling
libcalls on hard float ABIs. Soft-float ABIs are left for a follow on
patch. libcall-tail-calls.ll also shows missed opportunities to tail
call integer libcalls, but this is due to issues outside of
the isUsedByReturnOnly hook.
Differential Revision: https://reviews.llvm.org/D131087
The cost of convert from or to mask vector is different from other cases. We could not use PowDiff to calculate it. This patch set it to 3 as we use 3 instruction to make it.
Differential Revision: https://reviews.llvm.org/D131149
This adds a +forced-atomics target feature with the same semantics
as +atomics-32 on ARM (D130480). For RISCV targets without the +a
extension, this forces LLVM to assume that lock-free atomics
(up to 32/64 bits for riscv32/64 respectively) are available.
This means that atomic load/store are lowered to a simple load/store
(and fence as necessary), as these are guaranteed to be atomic
(as long as they're aligned). Atomic RMW/CAS are lowered to __sync
(rather than __atomic) libcalls. Responsibility for providing the
__sync libcalls lies with the user (for privileged single-core code
they can be implemented by disabling interrupts). Code using
+forced-atomics and -forced-atomics are not ABI compatible if atomic
variables cross the ABI boundary.
For context, the difference between __sync and __atomic is that the
former are required to be lock-free, while the latter requires a
shared global lock provided by a shared object library. See
https://llvm.org/docs/Atomics.html#libcalls-atomic for a detailed
discussion on the topic.
This target feature will be used by Rust's riscv32i target family
to support the use of atomic load/store without atomic RMW/CAS.
Differential Revision: https://reviews.llvm.org/D130621
The floating point stores use a different register class, it
probably makes sense to have a different SchedRead.
Reviewed By: monkchiang
Differential Revision: https://reviews.llvm.org/D131379
If we're adding a constant that can't use addi we try a few tricks,
one of which is using li+sh3add. We should not do this if lui+add
would work. For example adding 8192. Using sh3add prevents folding
a sext.w to form addw, thus increasing instruction count.
This fixes a bug reported privately by @craig.topper. Here's an example which illustrates the problem:
vsetivli a1, a0, e32, m1, ta, mu # both DefInfo and PrevInfo
vsetivli a2, a1, e32, m4, ta, mu
With the unsound result being:
vsetivli a1, a0, e32, m1, ta, mu
vsetivli a2, a0, e32, m4, ta, mu
Consider the case where this is running on a machine with VLEN=512,. For this case, the VLMAXs are 16 and 64 respectively.
Consider for a0 = 33. The correct result is: a1 = 16, and a2 = 16
After the unsound optimization: a1 = 16 and a2 = 33
This particular example used VLMAXs which differed by more than a power of two. With a difference of only one power of two, there's another form of this bug which involves the AVL < 2 x VLMAX special case, but that ones more complicated to construct as many examples turn out accidentally sound.
This patch takes the approach of simply removing the unsound optimization, but there are multiple sound sub-cases of it. I plan to return to at least a couple of them, but figured it was cleaner to remove the unsound optimization (for ease of backporting), and then review the new optimizations on their own.
Differential Revision: https://reviews.llvm.org/D131264
When folding (sra (add (shl X, 32), C1), 32 - C) -> (shl (sext_inreg (add X, C1), i32), C)
it's possible that the add is used by multiple sras. We should
allow the combine if all the SRAs will eventually be updated.
After transforming all of the sras, the shls will share a single
(sext_inreg (add X, C1), i32).
This pattern occurs if an sra with 32 is used as index in multiple
GEPs with different scales. The shl from the GEPs will be combined
with the sra before we get a chance to match the sra pattern.
When folding (sra (add (shl X, 32), C1), 32 - C) -> (shl (sext_inreg (add X, C1), C)
ignore the use count on the (shl X, 32).
The sext_inreg after the transform is free. So we're only making
2 new instructions, the add and the shl. So we only need to be
concerned with replacing the original sra+add. The original shl
can have other uses. This helps if there are multiple different
constants being added to the same shl.
In RVV, we use vwredsum.vs and vwredsumu.vs for vecreduce.add(ext(Ty A)) if the result type's width is twice of the input vector's SEW-width. In this situation, the cost of extended add reduction should be same as single-width add reduction. So as the vector float widenning reduction.
Differential Revision: https://reviews.llvm.org/D129994
D129980 converts (seteq (i64 (and X, 0xffffffff)), C1) into
(seteq (i64 (sext_inreg X, i32)), C1). If bit 31 of X is 0, it
will be turned back into an 'and' by SimplifyDemandedBits which
can cause an infinite loop.
To prevent this, check if bit 31 is 0 with computeKnownBits before
doing the transformation.
Fixes PR56905.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D131113
This patch ensures consistency in the construction of FP_ROUND nodes
such that they always use ISD::TargetConstant instead of ISD::Constant.
This additionally fixes a bug in the AArch64 SVE backend where patterns
were matching against TargetConstant nodes and sometimes failing when
passed a Constant node.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D130370
An unnecessary sext.w is generated when masking the result of the
riscv_masked_cmpxchg_i64 intrinsic. Implementing handling of the
intrinsic in ComputeNumSignBitsForTargetNode allows it to be removed.
Although this isn't a particularly important optimisation, removing the
sext.w simplifies implementation of an additional cmpxchg-related
optimisation in D130192.
Although I can't produce a test with different codegen for the other
atomics intrinsics, these are added as well for completeness.
Differential Revision: https://reviews.llvm.org/D130191
This used to print from the ADDI where the operand number was
correct. It recently changed to print from the LUI or AUIPC which
needs to use operand 1 instead of 2.
This shows up as a crash with -debug.
I think these pseudos will exist when the post-RA scheduler runs
so they should have sched classes.
Reviewed By: monkchiang
Differential Revision: https://reviews.llvm.org/D130945
* TargetFrameLowering has a TransientStackAlignment field that "returns
the number of bytes to which the stack pointer must be aligned at all
times, even between calls.
* As explained in the [RISC-V calling
convention](https://github.com/riscv-non-isa/riscv-elf-psabi-doc/blob/master/riscv-cc.adoc),
the stack pointer must remain fully aligned throughout execution for
compliant code. This is important for embedded targets that might avoid
realigning the stack pointer for interrupt service routines. Systems
running full OSes may always realign the stack anyway.
* TransientStackAlignment is used in estimateStackSize in
MachineFrameInfo and in PEI::calculateFrameObjectOffsets.
* estimateStackSize is only used in the RISC-V backend for scavenging
slots. It may be possible to craft a function where the difference
is observable, but it wouldn't be a meaningful test.
* calculateFrameObjectOffsets makes use of TransientStackAlignment,
but then sets the stack alignment to the max of that alignment and
MaxAlign, which is unconditionally set to 16 in
RISCVFrameLowering::processFunctionBeforeFrameFinalized
* I've changed this logic to only set MaxAlign if there are RVV frame
objects. There should be no functional change here for either RVV
targets (MaxAlign is set as before) or non-RVV targets
(TransientStackAlign is now 16 anyway).
Differential Revision: https://reviews.llvm.org/D130068
It's possible we have:
lui a0, %hi(sym)
addi a0, %lo(sym)
addi a0, <offset1>
lw a0, <offset2>(a0)
We want to arrive at
lui a0, %hi(sym+offset1+offset2)
lw a0, %lo(sym+offset1+offset2)
We currently fail to do this because we only consider loads/stores
if we didn't find any arithmetic.
This patch splits arithmetic folding and load/store folding into
two separate phases. The load/store folding can no longer assume
the offset in hi/lo is 0 so we must combine the offsets. I've applied
the same simm32 limit that we applied in the arithmetic folding.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D130931
At least based on the lit tests, the coalescer sometimes fails to
propagate the copy from X0 into the branch instruction. This patch
does it manually during isel. The majority of the changes are from
the select patterns.
Some of the changes are just register allocation changes. Only
the Select change affects the whether a b*z instruction is generated
in the tests. I changed the branch pattern for consistency.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D130809
The only iterator we're holding points to HiLUI and we never
delete that so I think it is safe to delete everything else
immediately.
I want to split detectAndFoldOffset into two phases. First, combine
LUI+ADDI with any ADD/ADDI/SHXADD that comes after it. This may
open opportunities to fold the ADDI from the LUI+ADDI into a
load/store address. So the load/store folding should run as a
second phase even if the ADD/ADDI/SHXADD made changes.
In order to do this we need to eagerly delete instructions in the
first phase so that we don't have dead users of the LUI+ADDI
when we start the second phase.
Patches to split the phases will come later.
Reviewed By: asb, luismarques
Differential Revision: https://reviews.llvm.org/D130119
Builds upon D123264, adding support for merging the low part of the LLA
address into the load/store instruction offsets.
Differential Revision: https://reviews.llvm.org/D123265
Expand load address pseudo-instructions earlier (pre-ra) to allow follow-up
patches to fold the addi of PseudoLLA instructions into the immediate
operand of load/store instructions.
Differential Revision: https://reviews.llvm.org/D123264
This adds a merge operand to all of the binary _VL nodes. Including
integer and widening. They all share multiclasses in tablegen
so doing them all at once was easiest.
I plan to use FADD_VL in an upcoming patch. The rest are just for
consistency to keep tablegen working.
This does reduce the isel table size by about 25k so that's nice.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D130816