Change the costmodel to lower a = b * C where C = -(2^n - 2^m) to
lsl w8, w0, m
sub w0, w8, w0, lsl n
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D134934
When returning from a function with both SCS and PAC-RET enabled, we need to
authenticate the return address from the stack and then load from the SCS,
but this was happening in the reverse order when RETA[AB] were being used.
Fix it by disabling the use of RETA[AB] when SCS is enabled.
Fixes pr58072.
Differential Revision: https://reviews.llvm.org/D134931
This is a port of an existing optimization in AArch64 ISelLowering, handling
a case when the same input vector can be used for both ext inputs.
Differential Revision: https://reviews.llvm.org/D134891
Decompose the const 14 can be separated from D132322
Change the costmodel to lower a = b * C where C = 2^n - 2^m to
lsl w8, w0, n
sub w0, w8, w0, lsl m
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D134706
A few issues:
1. There was no legalizer test for G_PTRTOINT
2. Same clamping issue as in many other opcodes
3. AArch64 pointers can only be 64b, so in reality we always have to trunc or
extend with any size other than p0 anyway.
This seems to actually produce more correct selection for narrow types as well.
Differential Revision: https://reviews.llvm.org/D107588
This is intended to be equivalent to the s32 + s64 cases in
AArch64TargetLowering::LowerFCOPYSIGN.
Widen everything and then use G_BIT + a mask to handle the actual copysign
operation. Then, narrow back down to s32/s64.
I wasn't sure about what the best/most canonical INSERT_SUBREG-selectable
pattern is. I chose G_INSERT_VECTOR_ELT + an undef vector because it produces
reasonably okay codegen. (It doesn't produce INSERT_SUBREG right now though.)
If there's a better way to do this then I'm happy to change it.
We also have a couple codegen deficiencies with how we emit vector constants
right now. (We need a GISel equivalent to the tryAdvSIMDModImm64 stuff)
Differential Revision: https://reviews.llvm.org/D108725
This is necessary for custom-legalizing G_FCOPYSIGN.
This is equivalent to the BIT instruction (bitwise insert if true).
Add selection testcases for imported patterns.
Differential Revision: https://reviews.llvm.org/D108714
For gathers which load in 8 and 16 bit data then use that data
as an index, the index can be extended to 32 bits instead of
64 bits
Differential Revision: https://reviews.llvm.org/D130692
This reverts commit 1c62af3e23.
The commit causes the test below to fail. Revert for now to get the bots
back to green.
Failing test:
lvm/test/Transforms/LoopVectorize/AArch64/masked-op-cost.ll
Currently over 256 non-temporal loads are broken inefficently. For example, `v17i32` gets broken into 2 128-bit loads. It is better if we can use
256-bit loads instead.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D133421
These instructions are flag setting so the ptest is redundant, the
TableGen class wasn't setting the element size for the predicate causing
the checks in AArch64InstrInfo::optimizePTestInstr to fail.
The commit D120104 enabled FeatureFuseAdrpAdd for -mcpu=generic,
allowing the linker to relax adrp;add pairs where possible. D132075
extended that to neoverse-n1, this patch extends it to all other cortex
and neoverse cpus for the same reasons.
Differential Revision: https://reviews.llvm.org/D134521
These names can then be matched by name against 'bits' fields in a
record, to populate an instruction's encoding.
This does _not_ yet change DecoderEmitter to allow by-name matching of
sub-operands. Unlike the encoder, the decoder already defaulted to not
supporting positional matching, and backends had workarounds in place
for the missing decoding support.
Additionally, use this new capability to allow the ARM and AArch64
backends not to require any positional operand matching.
Differential Revision: https://reviews.llvm.org/D131003
This patch removes the aarch64 instrinsic svget/svset/svcreate from llvm.
It also implements the InstCombine for vector.extract that used to be in svget.
Depends on: D131547
Differential Revision: https://reviews.llvm.org/D131548
They're roughly ARMv8.6. This works in the .td file, but in
AArch64TargetParser.def, marking them v8.6 brings in support for the SM4
cryptographic hash and we don't actually have that. So TargetParser side
they're marked as v8.5, with the extra features (BF16 and I8MM added manually).
Finally, A16 supports the HCX extension in addition to v8.6. This has no
TargetParser implications.
shuffle (tbl2, tbl2) can be folded into a single tbl4 if the mask for
the selected elements is constant.
Reviewed By: t.p.northover
Differential Revision: https://reviews.llvm.org/D133491
The llvm.aarch64.neon.scalar.sqxtn.i32.i64 intrinsics take and return
integer types, but operate on fp registers. This can create some
inefficiencies in their lowering, where the registers are converted to
fp a little too late. This patch adds lowering for the intrinsics,
creating bitcasts to/from fp types to allow nicer folding later when the
instructions are selected, especially around insert/extracts.
Differential Revision: https://reviews.llvm.org/D134024
This adds some quick tablegen patterns for vector_insert(bitcast(..))
and bitcast(vector_extract(..)), allowing us to avoid a round-trip
through GPRs.
Differential Revision: https://reviews.llvm.org/D134022
Inlining must be disabled when the call-site needs to toggle PSTATE.SM or
when the callee's function body is executed in a different streaming mode than
its caller. This is needed because function calls are the boundaries for
streaming mode changes.
More details about the SME attributes and design can be found
in D131562.
Differential Revision: https://reviews.llvm.org/D131581
This patch enables the LSLFast feature for Cortex-A76, Cortex-A77,
Cortex-A78, Cortex-A78C, Cortex-A710, Cortex-X1, Cortex-X2, Neoverse N1,
Neoverse N2, Neoverse V1 and the Neoverse 512TB pseudo-cpu, in-line with
the software optimization guides for those CPUs.
Differntial revision: https://reviews.llvm.org/D134273
This patch removes the intrinsic aarch64.sve.ldN from tablegen in favour of
using arch64.sve.ldN.sret.
Depends on: D133023
Differential Revision: https://reviews.llvm.org/D133025
Previously only using the UnsafeFPMath option, this now looks for the
fast moth flags on the instructions, using the same flag flags as other
backends.
When a streaming mode change is (or may be) required for a call, it will
need to restore the original mode after the call, which prevents the use of
tail-call optimization. The same holds true for a call that requires the lazy-save
mechanism to be set up before the call, and possibly restored after.
More details about the SME attributes and design can be found
in D131562.
Reviewed By: aemerson
Differential Revision: https://reviews.llvm.org/D131579
This is a partial port of the code used by the SelectionDAGBuilder to
translate selects.
In particular, see matchSelectPattern in ValueTracking.cpp. This is a
GISel-equivalent of the portion which handles fminnum/fmaxnum/fminimum/fmaximum.
I tried to set it up so it'd be easy to add the non-FP cases. Those are simpler.
On the AArch64-end, it seems like the FP cases are more important for perf
right now, so I bit the bullet and went at the more complicated problem. :)
I elected to do this as a post-legalize combine rather than in the
IRTranslator because
Deciding which fmax/fmin to use can depend on legalization rules
Philosophically-speaking (TM), putting it in a combine just feels cleaner
Being able to enable/disable the combine is handy
Another option would be to use the ValueTracking code in the IRTranslator and
match what SelectionDAGBuilder::visitSelect does. I think that may be somewhat
annoying since we'd need to write lowerings back into the selects in the
legalizer. I'm not strongly opposed to the approach.
We'd also want to be careful with vector selects once that's implemented,
which explicitly check if a vector select is legal on the target. That'd
probably need a hook.
From what I can tell, doing this as a combine is probably a cleaner option
long-term.
Differential Revision: https://reviews.llvm.org/D116702
When a function is streaming-compatible and calls a function with a normal or streaming
interface, it may need to enable/disable stremaing mode before the call, and
needs to restore PSTATE.SM after the call.
This patch implements this with a Pseudo node that gets expanded to a
conditional branch and smstart/smstop node.
More details about the SME attributes and design can be found
in D131562.
Reviewed By: aemerson
Differential Revision: https://reviews.llvm.org/D131578
This patch implements the ABI for calls from:
Normal -> Streaming
Normal -> Streaming-compatible
Streaming -> Normal
Streaming -> Streaming-compatible
Streaming -> Streaming
The compiler inserts SMSTART/SMSTOP instructions before and after the call,
depending on the required transition.
More details about the SME attributes and design can be found
in D131562.
Reviewed By: aemerson
Differential Revision: https://reviews.llvm.org/D131576
On AArch64, doing the vector truncate separately after the fptoui
conversion can be lowered more efficiently using tbl.4, building on
D133495.
https://alive2.llvm.org/ce/z/T538CC
Depends on D133495
Reviewed By: t.p.northover
Differential Revision: https://reviews.llvm.org/D133496
Similar to using tbl to lower vector ZExts, tbl4 can be used to lower
vector truncates.
The initial version support i32->i8 conversions.
Depends on D120571
Reviewed By: t.p.northover
Differential Revision: https://reviews.llvm.org/D133495
This patch extends CodeGenPrepare to lower zext v16i8 -> v16i32 in loops
using a wide shuffle creating a v64i8 vector, selecting groups of 3
zero elements and an element from the input.
This is profitable on AArch64 where such shuffles can be lowered to tbl
instructions, but only in loops, because it requires materializing 4
masks, which can be done in the loop preheader.
This is the only reason the transform is part of CGP. If there's a
better alternative I missed, please let me know. The same goes for the
shouldReplaceZExtWithShuffle hook which guards this. I am not sure if
this transform will be beneficial on other targets, but it seems like
there is no way other convenient way.
This improves the generated code for loops like the one below in
combination with D96522.
int foo(uint8_t *p, int N) {
unsigned long long sum = 0;
for (int i = 0; i < N ; i++, p++) {
unsigned int v = *p;
sum += (v < 127) ? v : 256 - v;
}
return sum;
}
https://clang.godbolt.org/z/Wco866MjY
Reviewed By: t.p.northover
Differential Revision: https://reviews.llvm.org/D120571
All in-tree targets pass pointer-sized ConstantSDNodes to the
method. This overload reduced amount of boilerplate code a bit. This
also makes getCALLSEQ_END consistent with getCALLSEQ_START, which
already takes uint64_ts.
A thread may not have access to SME or TPIDR2_EL0, so in order to
safely query PSTATE.SM in a streaming-compatible function, the
code should call `__arm_sme_state()`, as described in the ABI:
c2bb09c4d4
This means that the value of pstate.sm is:
* 0 if the function is non-streaming.
* 1 if the function has `arm_streaming` or `arm_locally_streaming`.
* evaluated at runtime by a call to __arm_sme_state() otherwise.
This patch also adds a calling convention for calls to SME support routines.
At some point we can remove the need for the llvm.aarch64.get.pstatesm() intrinsic
and use function calls (with the corresponding cc) directly instead.
Reviewed By: aemerson
Differential Revision: https://reviews.llvm.org/D131571
Add fix for propagation of !pcsections metadata for expanded atomics,
together with more tests for interesting atomic instructions (based on
llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll).
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D133710
SLH will fall back to a different technique if X16 is being used,
so there is no need to warn for inline asm use. Only prevent other codegen
from using it.
Reviewed By: kristof.beyls
Differential Revision: https://reviews.llvm.org/D133766
The current code for generating nontemporal load outputs the wrong assembly for big endian architecture.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D133789