The transformation is benefit because vmerge.vvm always needs mask operand but
vadd.vi may not.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D133255
Unary shuffles such as <0,2,4,6,8,10,12,14> or <1,3,5,7,9,11,13,15>
where half the elements are returned, can be lowered using vnsrl.
SelectionDAGBuilder lowers such shuffles as a build_vector of
extract_elements since the mask has less elements than the source.
To fix this, I've enable the extractSubvectorIsCheapHook to allow
DAGCombine to rebuild the shuffle using 2 extract_subvectors preceding
the shufffle.
I've gone very conservative on extractSubvectorIsCheapHook to minimize
test impact and match what we have test coverage for. This can be
improved in the future.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D133736
As the Zbs extension includes bext[i] for bit extract, we can
unconditionally return true from this hook. This hook causes the DAG
combiner to perform the following canonicalisation:
and (not (srl X, C)), 1 --> (and X, 1<<C) == 0
and (srl (not X), C)), 1 --> (and X, 1<<C) == 0
As simply changing the hook causes a codegen regression, this patch also
modifies a BEXTI pattern to match this canonicalised form.
As BSETINVMask is now used for BEXT as well as BSET and BINV, it has
been renamed to the more generic SingleBitSetMask.
There is one codegen change in bittest.ll for bittest_31_i64 (NOT+BEXTI
rather than NOT+SRLIW). This is neutral in terms of code quality.
Differential Revision: https://reviews.llvm.org/D131482
We use the saturating behavior of fcvt.wu.h/s/d but forgot to
take into account that fcvt.wu will sign extend the saturated
result. According to computeKnownBits a promoted FP_TO_UINT_SAT
is expected to zero extend the saturated value.
In many case the upper bits aren't be demanded so this wouldn't
be an issue. But if we computeKnownBits caused an AND to be removed
it would be a bug.
This patch inserts an AND during to zero the upper bits.
Unfortunately, this pessimizes code if we aren't able to tell if
the upper bits are demanded. To fix that we could custom type
promote the FP_TO_UINT_SAT with SEXT_INREG after it, but I'll
leave that for future work.
I haven't found a failure from this, I was revisiting the code to
add vector support and spotted it.
Differential Revision: https://reviews.llvm.org/D133746
For remainder:
If (1 << (Bitwidth / 2)) % Divisor == 1, we can add the high and low halves
together and use a (Bitwidth / 2) urem. If (BitWidth /2) is a legal integer
type, this urem will be expand by DAGCombiner using multiply by magic
constant. We do have to take into account that adding high and low
together can produce a carry, making it a (BitWidth / 2)+1 bit number.
So we need to also add back in the carry from the first addition.
For division:
We can use the above trick to compute the remainder, subtract that
remainder from the dividend, then multiply by the multiplicative
inverse of the Divisor modulo (1 << BitWidth).
This is based on the section "Remainder by Summing Digits" in
Hacker's delight.
The remainder trick is similar to a trick you may have learned for
determining if a decimal number is divisible by 3. You can add all the
digits together and see if the sum is divisible by 3. If you're not sure
if the sum is divisible by 3, you can add its digits together. This
can be repeated until you have a single decimal digit. If that digit
is 3, 6, or 9, then the original number is divisible by 3. This works
because 10 % 3 == 1.
gcc already does this same trick. There are additional tricks gcc
does urem as well as srem, udiv, and sdiv that I plan to add in
future patches.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D130862
The default is to use extload which can become a zextload or
sextload if it is followed by an 'and' or sext_inreg.
Sometimes type legalization will introduce an 'and' from promoting
something like 'srl X, C' and a sext_inreg from from a setcc. The
'and' could be freely folded with the promoted 'srl' by using srliw,
but the sext_inreg can't be folded into a compare. DAG combiner
will see both of these choices and may decide to fold the 'and'
instead of the 'sext_inreg'. This forces the sext_inreg to become
a sext.w.
By picking sextload in the type legalizer we take this choice away.
Looking at spec2006 compiled with Zba and Zbb this appeared to be
net reduction in lines of code in the objdump disassembly output.
This is similar to what we do with i32 add/sub/mul/shl in
type legalization where we always emit a sext_inreg.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D130397
Split out from D129178, this just adds the GlobalMerge tests (other than global-merge-minsize.ll which is testing a specific configuration of the pass when it's enabled) and exposes `-riscv-enable-global-merge` and //doesn't enable it by default//.
Note that the comment "// FIXME: Unify control over GlobalMerge." is copied from the Arm and AArch64 backends, which expose the same flag. Presumably the author is imagining some later refactoring that provides a target-independent flag.
Reviewed By: craig.topper, reames, hiraditya
Differential Revision: https://reviews.llvm.org/D130481
This patch introduces the priority analysis and the priority advisor,
the default implementation, and the scaffolding for introducing the
other implementations of the advisor.
Reviewed By: mtrofin
Differential Revision: https://reviews.llvm.org/D132835
This is a minimalist implementation which simply adds the extension (in the experimental namespace since its not ratified), and wires up the setting of the required ELF header flag. Future changes will include codegen changes to exploit the stronger memory model.
This is intended to implement v0.1 of the proposed specification which can be found in Chapter 25 of https://github.com/riscv/riscv-isa-manual/releases/download/draft-20220723-10eea63/riscv-spec.pdf.
Differential Revision: https://reviews.llvm.org/D133239
This adds new VFCVT pseudoinstructions that take a rounding mode operand. A custom inserter is used to insert additional instructions to change FRM around the
VFCVT.
Some of this is borrowed from D122860, but takes a somewhat different direction. We may migrate to that patch, but for now I was trying to keep this as independent from
RVV intrinsics as I could.
A followup patch will use this approach for FROUND too.
Still need to fix the cost model.
Reviewed By: arcbbb
Differential Revision: https://reviews.llvm.org/D133238
We can use srliw to shift out the trailing bits and slli to shift
back in zeros. The sign extend of srliw will 0 the upper 32 bits
since we will be shifting a 0 into bit 31.
Don't require the AND has one use and don't depend on
targetShrinkDemandedConstant turning C2 into 0xffffffff. Instead,
check that the constant is 0xffffffff after replacing any bits
that will be shifted out with 1s.
Another way to fix this might be to prevent SimplifyDemandedBits
from destroying the ANDI after type legalization using
targetShrinkDemandedBits. That would prevent the CSE that created
this mess. targetShrinkDemandedBits is currently only enable after
legalize ops. Quick experiment shows we can't just change when it
runs, we would need to try a different heuristic for post type
legalization.
SimplifyDemandedBits can 0 the upper bits and targetShrinkDemandedConstant
isn't alway able to recover it.
At least part of that may be because targetShrinkDemandedConstant
only runs in the last DAGCombine. Might be worth seeing what happens
if we move it post type legalization.
This builds on D132771 to invert (setlt 0, X) to (setlt X, 1) and
vice versa.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D132798
We can rewrite to (bnez (or/and (setne), Z) is Z is 0/1.
Alternatively, we could canonicalize to (xor (or/and (setne), Z), 1)
even if there is no branch. The xor would not always get removed,
but it might enable other DeMorgan combines. I decided to be
conservative for this first patch and require the xor to be removed.
I have a couple other invertible setccs I will add in a follow up
patch.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D132771
Mostly just modeled after vp.fneg except there is a
"functional instruction" for fneg while fabs is always an
intrinsic.
Reviewed By: fakepaper56
Differential Revision: https://reviews.llvm.org/D132793
The motivation of this patch is to lower the IR pattern
(vp.merge mask, (add x, y), false, vl) to
(PseudoVADD_VV_<LMUL>_MASK false, x, y, mask, vl).
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D131841
This change enables the use of RISCV's variable length vector registers for fixed length vectors in the IR, and implicitly enables various IR transforms which generate fixed length vectors if legal (e.g. LoopVectorize). Specifically, this enables fixed length vectors which are known to be inbounds of the underlying variable hardware size.
For context, remember that the +V extension provides a minimum VLEN of 128. The embedded variants provide lower minimums. The analogy here is essentially vectorizing for SSE on a machine which may or may not include AVX2/AVX512. We won't get full utilization by default, but we will get some benefit. And of course, with an explicit mcpu we can vectorize to the exact target hardware.
The LV impact is mostly related to vectorizer robustness. In cases we haven't yet fully implemented scalable vectorization support, we can fall back to fixed length vectorization.
SLP has been disabled for now, even when fixed vectors are enabled. See a310637 and associated review. There are a few addiitional code quality issues which need worked through before turning SLP on would be reasonable.
Differential Revision: https://reviews.llvm.org/D131508
In patch D121183, target abi is get from .ll file's target-abi
attribute and set in RISCVAsmPrinter::emitFunctionEntryLabel
function. In https://github.com/llvm/llvm-project/issues/57242,
an api mismatch error may be caused by failing to call function
RISCVAsmPrinter::emitFunctionEntryLabel to set target-abi to
correct one when the .ll is empty or a module has no function.
This patch move setting target-abi part to function
RISCVAsmPrinter::emitStartOfAsmFile, make sure all .ll file and
module in LTO read target-abi from module flag and set, with or
without function.
Signed-off-by: xiaojing.zhang <xiaojing.zhang@xcalibyte.com>
Signed-off-by: jianxin.lai <jianxin.lai@xcalibyte.com>
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D132204
SimplifyDemandedBits tries to agressively turn xor immediates into -1
to match a 'not' instruction. In this case, because X is a boolean, the
upper bits of (xor X, 1) are known to be 0. Because this is an AND
instruction, that means those bits aren't demanded from the other
operand, and thus SimplifyDemandedBits can turn (xor Y, 1) to (not Y).
We need to detect that this has happened to enable the DeMorgan
optimization. To do this we allow one of the xors to use -1 when
the outer operation is And.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D132671
This optimizes xors that appear due to legalizing setge/setle which
require an xor with 1. This reduces the number of xors and may
allow the xor to fold with a beqz or bnez.
Differential Revision: https://reviews.llvm.org/D132614
Update `llvm/test/CodeGen/RISCV/branch-relaxation.ll` with
`update_llc_test_checks.py`, according to
https://reviews.llvm.org/D130560#3746417:
>>! In D130560#3746417, @luismarques wrote:
>>>! In D130560#3746379, @luismarques wrote:
>> The tests don't seem to have been properly updated with
>> `update_llc_test_checks.py`.
>> `llvm/test/CodeGen/RISCV/branch-relaxation.ll` contains RV64 RUN
>> lines but the corresponding CHECK lines are missing in
>> some functions.
>
> Looking more closely at this, I guess you tried to only include the
> `CHECK-RV64` and `CHECK-RV32` checks when relevant. That's a good
> instinct but I guess it goes a bit against how we normally use
> `update_llc_test_checks.py`. My understanding of the trade-off of
> using that tool is that the test updates are much easier, even if
> sometimes the CHECKs aren't as tight as something more tailormade.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D132625
Update `llvm/test/CodeGen/RISCV/branch-relaxation.ll` with
`update_llc_test_checks.py`, according to
https://reviews.llvm.org/D130560#3746417:
>>! In D130560#3746417, @luismarques wrote:
>>>! In D130560#3746379, @luismarques wrote:
>> The tests don't seem to have been properly updated with
>> `update_llc_test_checks.py`.
>> `llvm/test/CodeGen/RISCV/branch-relaxation.ll` contains RV64 RUN
>> lines but the corresponding CHECK lines are missing in
>> some functions.
>
> Looking more closely at this, I guess you tried to only include the
> `CHECK-RV64` and `CHECK-RV32` checks when relevant. That's a good
> instinct but I guess it goes a bit against how we normally use
> `update_llc_test_checks.py`. My understanding of the trade-off of
> using that tool is that the test updates are much easier, even if
> sometimes the CHECKs aren't as tight as something more tailormade.
This issue is found by build llvm-testsuite with `-Oz`, linker will complain
`dangerous relocation: %pcrel_lo missing matching %pcrel_hi` and that
turn out cause by we outlined pcrel-lo, but leave pcrel-hi there, that's
not problem in general, but the problem is they put into different section, they
pcrel-hi and pcrel-lo pair (e.g. AUIPC+ADDI) *MUST* put be present in same
section due to the implementation.
Outlined function will put into .text name, but the source functions
will put in .text.<function-name> if function-section is enabled or the
function has `comdat` attribute.
There are few solutions for this issue:
1. Always disallow instructions with pcrel-lo flags.
2. Only disallow instructions with pcrel-lo flags that when function-section is
enabled or this function has `comdat` attribute.
3. Check the corresponding instruction with pcrel-high also included in the
outlining candidate sequence or not, and allow that only when pcrel-high is
included in the outlining candidate.
First one is most conservative, that might lose some optimization
opportunities, and second one could save those opportunities, and last
one is hard to implement, and don't have any benefits since pcrel-high
are using different label even accessing same symbol.
Use custom section name might also cause this problem, but that already
filtered by RISCVInstrInfo::isFunctionSafeToOutlineFrom.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D132528
In branch relaxation pass, `j`'s with offset over 1MiB will be relaxed
to `jump` pseudo-instructions.
This patch allocates a stack slot for functions with a size greater than
1MiB. If the register scavenger cannot find a scratch register for
`jump`, spill a register to the slot before the jump and restore it
after the jump.
.mbb:
foo
j .dest_bb
bar
bar
bar
.dest_bb:
baz
The above code will be relaxed to the following code.
.mbb:
foo
sd s11, 0(sp)
jump .restore_bb, s11
bar
bar
bar
j .dest_bb
.restore_bb:
ld s11, 0(sp)
.dest_bb:
baz
Depends on D129999.
Reviewed By: StephenFan
Differential Revision: https://reviews.llvm.org/D130560
Similar to D132211, we can optimize x <s -1 ? x : -1 -> x <s 0 ? x : -1
Also improve the unsigned case from D132211 to use x != 0 which
will give a bnez instruction which might be compressible.
Differential Revision: https://reviews.llvm.org/D132252
if x == 1,
x > 1 ? x : 1 return x, which is also 1.
x > 0 ? x : 1 return 1.
Reduce the number of load 1 instructions.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D132211
Following the comment's thread of D117235, I added checks for the widening + splitting case, which also causes a split with one of the resulting vectors to be empty. Due to the same issues described in that same thread, the `fixed-vectors-strided-store.ll` test is missing the widening + splitting case, while the same case in the `strided-vpload.ll` test requires to manually split the loaded vector.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D121784
This patch enables expansion or custom lowering for some integer
condition codes so that any xori that is needed is created before
the last DAG combine to enable optimization.
I've seen cases where we end up with
(or (xori (setcc), 1), (xori (setcc), 1)) which we would ideally
convert to (xori (and (setcc), (setcc)), 1). This patch doesn't
accomplish that yet, but it should allow us to add DAG
combines as follow ups. Example https://godbolt.org/z/Y4qnvsq1b
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D131729
Use it to the fix a bug in the fceil/ffloor lowerings. We were
setting the passthru to IMPLICIT_DEF before and using a mask
agnostic policy. This means where the incoming bits in
the mask were 0 they could be anything in the outgoing mask. We
want those bits in the outgoing mask to be 0. This means we need to
pass the input mask as the passthru.
This generates worse code because we are unable to allocate the
v0 register to the output due to an earlyclobber constraint. We
probably need a special TIED pseudoinstruction and probably custom
isel since you can't use V0 twice in the input pattern.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D132058
I have a couple data points that some microarchitectures prefer
the immediate 0 over x0. Does anyone know of microarchitectures
where the opposite is true?
Unfortunately, this is different than the vncvt.x.x.w alias
from the spec. Perhaps the alias was poorly chosen if x0 isn't
as optimal as immediate 0 on all microarchitectures.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D132041
The current llvm.prefetch intrinsic docs state "The rw, locality and
cache type arguments must be constant integers."
This change:
- Makes arg 3 (cache type) an ImmArg
- Improves the verifier error messages to reference the incorrect
argument.
- Fixes two tests which contradict the docs.
This is needed as the lowering to GlobalISel is different for ImmArgs
compared to other constants. The non-ImmArgs create a G_CONSTANT MIR
instruction, the for ImmArgs the constant is put directly on the
intrinsic's MIR instruction as an immediate.
Differential Revision: https://reviews.llvm.org/D132042
Extracted from D131729 where we handled C==0. It's now generalized
to more constants.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D132000
If the success value of a cmpxchg is used in a branch, the expanded
cmpxchg sequence ends up with a redundant branch-to-branch (as the
backend atomics expansion happens as late as possible, passes to
optimise such cases have already run). This patch identifies this case
and avoid it when expanding the cmpxchg.
Note that a similar optimisation is possible for a BEQ on the cmpxchg
success value. As it's hard to imagine a case where real-world code may
do that, this patch doens't handle that case.
Differential Revision: https://reviews.llvm.org/D130192
(sub C, (xori X, 1)) can be folded to (add X, C-1) if X is 0 or 1.
This would avoid the xori and in some cases remove an instruction
neede to materialize the constant.
This time using N1 instead of N0 since N1 points to the original
setcc. This now affects scheduling as I expected.
Original commit message:
We change seteq<->setne but it doesn't change the semantics
of the setcc. We should keep original debug location. This is
consistent with visitXor in the generic DAGCombiner.
While (sub 0, X) can use x0 for the 0, I believe (add X, -1) is
still preferrable. (addi X, -1) can be compressed, sub with x0 on
the LHS is never compressible.
In these test cases we do the transform, but the immediate is too
large to form an ADDI so it didn't save any instructions.
If the constant is opaque or has additional users we shouldn't do
the transform if it doesn't form an ADDI.
This introduce an xori in some cases. I don't believe it was the
intention of the original patch. This was an accident because
nonan FP equality compares also use SETEQ/SETNE.
Also pass the correct type to getSetCCInverse.
In these tests we had (sub C, (seteq X, Y)) which we converted to
the (add (setne X, Y), C-1). We don't have a FNE compare instruction
so this created an XORI to invert an FEQ instruction.
This might be a good idea since it can save a constant materialization,
but does not appear to be the intention of the original patch.
We have a good selection of W instructions, so promoting a truncated
value back to i64 is often free.
This appears to be a net code size reduction on SPECINT2006.
This has been split from D130397 as one of the patches needed to
complete that.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D131819
These are guaranteed not to create undef/poison (although they may pass through) - the associated ISD::VALUETYPE node is also guaranteed never to generate poison
(setcc x, y, eq/neq) are seqz, snez that set rd = 0/1.
addi is used to process immediate, which can save instructions for load immediate.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D131471
Including patterns to select addiw if only the lower 32 bits are used.
I'm not excited about adding this many patterns. I'm looking at whether
we can create the xori during lowering and move the ineg patterns to
DAGCombiner.
The patch uses peephole method to fold merge.vvm and unmasked intrinsics to
masked intrinsics. Using peephole intead of tablegen patterns is to avoid large
auto gnerated code.
Note: The patch ignores segment loads since I don't know how to test them.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D130442
Prior to this patch, libcalls inserted by the SelectionDAG legalizer
could never be tailcalled. The eligibility of libcalls for tail calling
is is partly determined by checking TargetLowering::isInTailCallPosition
and comparing the return type of the libcall and the calleer.
isInTailCallPosition in turn calls TargetLowering::isUsedByReturnOnly
(which always returns false if not implemented by the target).
This patch provides a minimal implementation of
TargetLowering::isUsedByReturnOnly - enough to support tail calling
libcalls on hard float ABIs. Soft-float ABIs are left for a follow on
patch. libcall-tail-calls.ll also shows missed opportunities to tail
call integer libcalls, but this is due to issues outside of
the isUsedByReturnOnly hook.
Differential Revision: https://reviews.llvm.org/D131087
This is currently exercising scalarization code path; with vectors enabled, we hit a different code path. Explicitly exercise both so that both configurations have testing.
This adds a +forced-atomics target feature with the same semantics
as +atomics-32 on ARM (D130480). For RISCV targets without the +a
extension, this forces LLVM to assume that lock-free atomics
(up to 32/64 bits for riscv32/64 respectively) are available.
This means that atomic load/store are lowered to a simple load/store
(and fence as necessary), as these are guaranteed to be atomic
(as long as they're aligned). Atomic RMW/CAS are lowered to __sync
(rather than __atomic) libcalls. Responsibility for providing the
__sync libcalls lies with the user (for privileged single-core code
they can be implemented by disabling interrupts). Code using
+forced-atomics and -forced-atomics are not ABI compatible if atomic
variables cross the ABI boundary.
For context, the difference between __sync and __atomic is that the
former are required to be lock-free, while the latter requires a
shared global lock provided by a shared object library. See
https://llvm.org/docs/Atomics.html#libcalls-atomic for a detailed
discussion on the topic.
This target feature will be used by Rust's riscv32i target family
to support the use of atomic load/store without atomic RMW/CAS.
Differential Revision: https://reviews.llvm.org/D130621
This patch emits table lookup in expandCTTZ.
Context -
https://reviews.llvm.org/D113291 transforms set of IR instructions to
cttz intrinsic but there are some targets which does not support CTTZ or
CTLZ. Hence, I generate a table lookup in TargetLowering::expandCTTZ().
Differential Revision: https://reviews.llvm.org/D128911
FoldConstantArithmetic can fold constant vectors hidden behind bitcasts (e.g. vXi64 -> v2Xi32 on 32-bit platforms), but currently bails if either vector contains undef elements. These undefs can often occur due to SimplifyDemandedBits/VectorElts calls recognising that the upper bits are often unnecessary (e.g. funnel-shift/rotate implicit-modulo and AND masks).
This patch adds a basic 'FoldValueWithUndef' handler that will attempt to constant fold if one or both of the ops are undef - so far this just handles the AND and MUL cases where we always fold to zero.
The RISCV codegen increase is interesting - it looks like the BUILD_VECTOR lowering was loading a constant pool entry but now (with all elements defined constant) it can materialize the constant instead?
Differential Revision: https://reviews.llvm.org/D130839
If we're adding a constant that can't use addi we try a few tricks,
one of which is using li+sh3add. We should not do this if lui+add
would work. For example adding 8192. Using sh3add prevents folding
a sext.w to form addw, thus increasing instruction count.
This fixes a bug reported privately by @craig.topper. Here's an example which illustrates the problem:
vsetivli a1, a0, e32, m1, ta, mu # both DefInfo and PrevInfo
vsetivli a2, a1, e32, m4, ta, mu
With the unsound result being:
vsetivli a1, a0, e32, m1, ta, mu
vsetivli a2, a0, e32, m4, ta, mu
Consider the case where this is running on a machine with VLEN=512,. For this case, the VLMAXs are 16 and 64 respectively.
Consider for a0 = 33. The correct result is: a1 = 16, and a2 = 16
After the unsound optimization: a1 = 16 and a2 = 33
This particular example used VLMAXs which differed by more than a power of two. With a difference of only one power of two, there's another form of this bug which involves the AVL < 2 x VLMAX special case, but that ones more complicated to construct as many examples turn out accidentally sound.
This patch takes the approach of simply removing the unsound optimization, but there are multiple sound sub-cases of it. I plan to return to at least a couple of them, but figured it was cleaner to remove the unsound optimization (for ease of backporting), and then review the new optimizations on their own.
Differential Revision: https://reviews.llvm.org/D131264
When folding (sra (add (shl X, 32), C1), 32 - C) -> (shl (sext_inreg (add X, C1), i32), C)
it's possible that the add is used by multiple sras. We should
allow the combine if all the SRAs will eventually be updated.
After transforming all of the sras, the shls will share a single
(sext_inreg (add X, C1), i32).
This pattern occurs if an sra with 32 is used as index in multiple
GEPs with different scales. The shl from the GEPs will be combined
with the sra before we get a chance to match the sra pattern.
When folding (sra (add (shl X, 32), C1), 32 - C) -> (shl (sext_inreg (add X, C1), C)
ignore the use count on the (shl X, 32).
The sext_inreg after the transform is free. So we're only making
2 new instructions, the add and the shl. So we only need to be
concerned with replacing the original sra+add. The original shl
can have other uses. This helps if there are multiple different
constants being added to the same shl.
D129980 converts (seteq (i64 (and X, 0xffffffff)), C1) into
(seteq (i64 (sext_inreg X, i32)), C1). If bit 31 of X is 0, it
will be turned back into an 'and' by SimplifyDemandedBits which
can cause an infinite loop.
To prevent this, check if bit 31 is 0 with computeKnownBits before
doing the transformation.
Fixes PR56905.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D131113
Although there's good coverage of the libcalls within llvm/test/CodeGen,
it's useful to have tests for all ABI and hard/soft-float combinations
in order to properly test the logic that enables libcall tail calls
(which will be implemented in a follow-up patch).
An unnecessary sext.w is generated when masking the result of the
riscv_masked_cmpxchg_i64 intrinsic. Implementing handling of the
intrinsic in ComputeNumSignBitsForTargetNode allows it to be removed.
Although this isn't a particularly important optimisation, removing the
sext.w simplifies implementation of an additional cmpxchg-related
optimisation in D130192.
Although I can't produce a test with different codegen for the other
atomics intrinsics, these are added as well for completeness.
Differential Revision: https://reviews.llvm.org/D130191
It's possible we have:
lui a0, %hi(sym)
addi a0, %lo(sym)
addi a0, <offset1>
lw a0, <offset2>(a0)
We want to arrive at
lui a0, %hi(sym+offset1+offset2)
lw a0, %lo(sym+offset1+offset2)
We currently fail to do this because we only consider loads/stores
if we didn't find any arithmetic.
This patch splits arithmetic folding and load/store folding into
two separate phases. The load/store folding can no longer assume
the offset in hi/lo is 0 so we must combine the offsets. I've applied
the same simm32 limit that we applied in the arithmetic folding.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D130931
At least based on the lit tests, the coalescer sometimes fails to
propagate the copy from X0 into the branch instruction. This patch
does it manually during isel. The majority of the changes are from
the select patterns.
Some of the changes are just register allocation changes. Only
the Select change affects the whether a b*z instruction is generated
in the tests. I changed the branch pattern for consistency.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D130809
Builds upon D123264, adding support for merging the low part of the LLA
address into the load/store instruction offsets.
Differential Revision: https://reviews.llvm.org/D123265
Expand load address pseudo-instructions earlier (pre-ra) to allow follow-up
patches to fold the addi of PseudoLLA instructions into the immediate
operand of load/store instructions.
Differential Revision: https://reviews.llvm.org/D123264
For constants in the range [-2047, 2048] we use addi. If the constant
is -2048 we can use xori. If we don't match this explicitly, we'll
emit an LI for the -2048 followed by an XOR.