Change the costmodel to lower a = b * C where C = -(2^n - 2^m) to
lsl w8, w0, m
sub w0, w8, w0, lsl n
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D134934
Decompose the const 14 can be separated from D132322
Change the costmodel to lower a = b * C where C = 2^n - 2^m to
lsl w8, w0, n
sub w0, w8, w0, lsl m
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D134706
For gathers which load in 8 and 16 bit data then use that data
as an index, the index can be extended to 32 bits instead of
64 bits
Differential Revision: https://reviews.llvm.org/D130692
Currently over 256 non-temporal loads are broken inefficently. For example, `v17i32` gets broken into 2 128-bit loads. It is better if we can use
256-bit loads instead.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D133421
This patch removes the aarch64 instrinsic svget/svset/svcreate from llvm.
It also implements the InstCombine for vector.extract that used to be in svget.
Depends on: D131547
Differential Revision: https://reviews.llvm.org/D131548
shuffle (tbl2, tbl2) can be folded into a single tbl4 if the mask for
the selected elements is constant.
Reviewed By: t.p.northover
Differential Revision: https://reviews.llvm.org/D133491
The llvm.aarch64.neon.scalar.sqxtn.i32.i64 intrinsics take and return
integer types, but operate on fp registers. This can create some
inefficiencies in their lowering, where the registers are converted to
fp a little too late. This patch adds lowering for the intrinsics,
creating bitcasts to/from fp types to allow nicer folding later when the
instructions are selected, especially around insert/extracts.
Differential Revision: https://reviews.llvm.org/D134024
This patch removes the intrinsic aarch64.sve.ldN from tablegen in favour of
using arch64.sve.ldN.sret.
Depends on: D133023
Differential Revision: https://reviews.llvm.org/D133025
When a streaming mode change is (or may be) required for a call, it will
need to restore the original mode after the call, which prevents the use of
tail-call optimization. The same holds true for a call that requires the lazy-save
mechanism to be set up before the call, and possibly restored after.
More details about the SME attributes and design can be found
in D131562.
Reviewed By: aemerson
Differential Revision: https://reviews.llvm.org/D131579
When a function is streaming-compatible and calls a function with a normal or streaming
interface, it may need to enable/disable stremaing mode before the call, and
needs to restore PSTATE.SM after the call.
This patch implements this with a Pseudo node that gets expanded to a
conditional branch and smstart/smstop node.
More details about the SME attributes and design can be found
in D131562.
Reviewed By: aemerson
Differential Revision: https://reviews.llvm.org/D131578
This patch implements the ABI for calls from:
Normal -> Streaming
Normal -> Streaming-compatible
Streaming -> Normal
Streaming -> Streaming-compatible
Streaming -> Streaming
The compiler inserts SMSTART/SMSTOP instructions before and after the call,
depending on the required transition.
More details about the SME attributes and design can be found
in D131562.
Reviewed By: aemerson
Differential Revision: https://reviews.llvm.org/D131576
On AArch64, doing the vector truncate separately after the fptoui
conversion can be lowered more efficiently using tbl.4, building on
D133495.
https://alive2.llvm.org/ce/z/T538CC
Depends on D133495
Reviewed By: t.p.northover
Differential Revision: https://reviews.llvm.org/D133496
Similar to using tbl to lower vector ZExts, tbl4 can be used to lower
vector truncates.
The initial version support i32->i8 conversions.
Depends on D120571
Reviewed By: t.p.northover
Differential Revision: https://reviews.llvm.org/D133495
This patch extends CodeGenPrepare to lower zext v16i8 -> v16i32 in loops
using a wide shuffle creating a v64i8 vector, selecting groups of 3
zero elements and an element from the input.
This is profitable on AArch64 where such shuffles can be lowered to tbl
instructions, but only in loops, because it requires materializing 4
masks, which can be done in the loop preheader.
This is the only reason the transform is part of CGP. If there's a
better alternative I missed, please let me know. The same goes for the
shouldReplaceZExtWithShuffle hook which guards this. I am not sure if
this transform will be beneficial on other targets, but it seems like
there is no way other convenient way.
This improves the generated code for loops like the one below in
combination with D96522.
int foo(uint8_t *p, int N) {
unsigned long long sum = 0;
for (int i = 0; i < N ; i++, p++) {
unsigned int v = *p;
sum += (v < 127) ? v : 256 - v;
}
return sum;
}
https://clang.godbolt.org/z/Wco866MjY
Reviewed By: t.p.northover
Differential Revision: https://reviews.llvm.org/D120571
All in-tree targets pass pointer-sized ConstantSDNodes to the
method. This overload reduced amount of boilerplate code a bit. This
also makes getCALLSEQ_END consistent with getCALLSEQ_START, which
already takes uint64_ts.
A thread may not have access to SME or TPIDR2_EL0, so in order to
safely query PSTATE.SM in a streaming-compatible function, the
code should call `__arm_sme_state()`, as described in the ABI:
c2bb09c4d4
This means that the value of pstate.sm is:
* 0 if the function is non-streaming.
* 1 if the function has `arm_streaming` or `arm_locally_streaming`.
* evaluated at runtime by a call to __arm_sme_state() otherwise.
This patch also adds a calling convention for calls to SME support routines.
At some point we can remove the need for the llvm.aarch64.get.pstatesm() intrinsic
and use function calls (with the corresponding cc) directly instead.
Reviewed By: aemerson
Differential Revision: https://reviews.llvm.org/D131571
The current code for generating nontemporal load outputs the wrong assembly for big endian architecture.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D133789
If the Shuffle is a splat and the operand is a zext/sext, sinking the
operand and the s/zext can help create indexed s/umull. This is
especially useful to prevent i64 mul being scalarized.
Differential Revision: https://reviews.llvm.org/D133355
Also remove new-pass-manager version of ExpandLargeDivRem because there is no way
yet to access TargetLowering in the new pass manager.
Differential Revision: https://reviews.llvm.org/D133691
LLVM contains a helpful function for getting the size of a C-style
array: `llvm::array_lengthof`. This is useful prior to C++17, but not as
helpful for C++17 or later: `std::size` already has support for C-style
arrays.
Change call sites to use `std::size` instead.
Differential Revision: https://reviews.llvm.org/D133429
Part of initial Arm64EC patchset.
Arm64EC code needs to use functions with a different name, to avoid
using the x64 versions.
Differential Revision: https://reviews.llvm.org/D125417
Part of patchset to add initial support for ARM64EC.
The ARM64EC calling convention is the same as ARM64 for non-varargs
functions, but for varargs, the convention is significantly different.
Basically, only x0-x3 registers are used for passing arguments, and x4
and x5 describe the address/size of the arguments passed in memory. (See
https://docs.microsoft.com/en-us/windows/uwp/porting/arm64ec-abi for
more details; see
https://docs.microsoft.com/en-us/cpp/build/x64-calling-convention for
the x64 calling convention rules, which this convention needs to match.)
Note that this currently doesn't handle i128 arguments correctly; as
noted in review, that's sort of complicated to handle, so I'm leaving it
for a followup.
Differential Revision: https://reviews.llvm.org/D125415
There's no advatange to emitting floating point scalable accesses,
whereas by lowering them to integer variants we can benefit from
several combines that seek to replace explicit extends/truncates
with extending/truncating accesses.
Differential Revision: https://reviews.llvm.org/D132393
The KCFI sanitizer, enabled with `-fsanitize=kcfi`, implements a
forward-edge control flow integrity scheme for indirect calls. It
uses a !kcfi_type metadata node to attach a type identifier for each
function and injects verification code before indirect calls.
Unlike the current CFI schemes implemented in LLVM, KCFI does not
require LTO, does not alter function references to point to a jump
table, and never breaks function address equality. KCFI is intended
to be used in low-level code, such as operating system kernels,
where the existing schemes can cause undue complications because
of the aforementioned properties. However, unlike the existing
schemes, KCFI is limited to validating only function pointers and is
not compatible with executable-only memory.
KCFI does not provide runtime support, but always traps when a
type mismatch is encountered. Users of the scheme are expected
to handle the trap. With `-fsanitize=kcfi`, Clang emits a `kcfi`
operand bundle to indirect calls, and LLVM lowers this to a
known architecture-specific sequence of instructions for each
callsite to make runtime patching easier for users who require this
functionality.
A KCFI type identifier is a 32-bit constant produced by taking the
lower half of xxHash64 from a C++ mangled typename. If a program
contains indirect calls to assembly functions, they must be
manually annotated with the expected type identifiers to prevent
errors. To make this easier, Clang generates a weak SHN_ABS
`__kcfi_typeid_<function>` symbol for each address-taken function
declaration, which can be used to annotate functions in assembly
as long as at least one C translation unit linked into the program
takes the function address. For example on AArch64, we might have
the following code:
```
.c:
int f(void);
int (*p)(void) = f;
p();
.s:
.4byte __kcfi_typeid_f
.global f
f:
...
```
Note that X86 uses a different preamble format for compatibility
with Linux kernel tooling. See the comments in
`X86AsmPrinter::emitKCFITypeId` for details.
As users of KCFI may need to locate trap locations for binary
validation and error handling, LLVM can additionally emit the
locations of traps to a `.kcfi_traps` section.
Similarly to other sanitizers, KCFI checking can be disabled for a
function with a `no_sanitize("kcfi")` function attribute.
Relands 67504c9549 with a fix for
32-bit builds.
Reviewed By: nickdesaulniers, kees, joaomoreira, MaskRay
Differential Revision: https://reviews.llvm.org/D119296
The KCFI sanitizer, enabled with `-fsanitize=kcfi`, implements a
forward-edge control flow integrity scheme for indirect calls. It
uses a !kcfi_type metadata node to attach a type identifier for each
function and injects verification code before indirect calls.
Unlike the current CFI schemes implemented in LLVM, KCFI does not
require LTO, does not alter function references to point to a jump
table, and never breaks function address equality. KCFI is intended
to be used in low-level code, such as operating system kernels,
where the existing schemes can cause undue complications because
of the aforementioned properties. However, unlike the existing
schemes, KCFI is limited to validating only function pointers and is
not compatible with executable-only memory.
KCFI does not provide runtime support, but always traps when a
type mismatch is encountered. Users of the scheme are expected
to handle the trap. With `-fsanitize=kcfi`, Clang emits a `kcfi`
operand bundle to indirect calls, and LLVM lowers this to a
known architecture-specific sequence of instructions for each
callsite to make runtime patching easier for users who require this
functionality.
A KCFI type identifier is a 32-bit constant produced by taking the
lower half of xxHash64 from a C++ mangled typename. If a program
contains indirect calls to assembly functions, they must be
manually annotated with the expected type identifiers to prevent
errors. To make this easier, Clang generates a weak SHN_ABS
`__kcfi_typeid_<function>` symbol for each address-taken function
declaration, which can be used to annotate functions in assembly
as long as at least one C translation unit linked into the program
takes the function address. For example on AArch64, we might have
the following code:
```
.c:
int f(void);
int (*p)(void) = f;
p();
.s:
.4byte __kcfi_typeid_f
.global f
f:
...
```
Note that X86 uses a different preamble format for compatibility
with Linux kernel tooling. See the comments in
`X86AsmPrinter::emitKCFITypeId` for details.
As users of KCFI may need to locate trap locations for binary
validation and error handling, LLVM can additionally emit the
locations of traps to a `.kcfi_traps` section.
Similarly to other sanitizers, KCFI checking can be disabled for a
function with a `no_sanitize("kcfi")` function attribute.
Reviewed By: nickdesaulniers, kees, joaomoreira, MaskRay
Differential Revision: https://reviews.llvm.org/D119296
SDNode.
How:
1) Add AArch64ISD::PMULL SDNode, and extend aarch64_neon_pmull intrinsic
tablegen pattern for this SDNode.
2) For aarch64_neon_pmull64, canonicalize i64 operands to v1i64 vectors
during legalization.
3) For {aarch64_neon_pmull, aarch64_neon_pmull64}, combine intrinsic to
SDNode.
Why
1) Adding the SDNode makes it easier to canonicalize i64 inputs (required by
aarch64_neon_pmull64) to vector inputs. Vector inputs carries lane
information, which helps dag-combiner to combine nodes (e.g. rewrite to a
better node to prepare for instruction selection) and instruction-selection
to emit instructions that use higher-half inputs in place
(i.e., no need to move lane 1 content to lane 0).
2) Using the SDNode for aarch64_neon_pmull64 is NFC, yet without this we
have to move the definition of {PMULLv1i64, PMULLv2i64} out of its
current group of records without gains.
Test cases are commented with what is being tested in
`aarch64-pmull2.ll` and `pmull-ldr-merge.ll` under directory
`llvm/test/CodeGen/AArch64`.
Differential Revision: https://reviews.llvm.org/D131047
This change adds support for lowering llvm.prefetch directly using
GlobalISel. Currently, llvm.prefetch falls back to SelectionDAG.
This Change:
- Adds an AArch64-specific G_PREFETCH generic instruction, to be used
where AArch64ISD::PREFETCH is used in SelectionDAG.
- Adds the GINodeEquiv so patterns are translated over to GlobalISel
automatically.
- Corrects the AArch64Prefetch patterns to use a target immediate, which
is needed to get the patterns to translate across correctly.
- Translates the SelectionDAG legalisation of the prefetch intrinsic
into the corresponding GlobalISel legalisation.
Differential Revision: https://reviews.llvm.org/D132043
Extends findMoreOptimalIndexType to allow ISD::BUILD_VECTOR based
indices to be truncated when such truncation is lossless. This can
enable the use of 32bit gather/scatter indices thus making it less
likely to have to split a gather/scatter in two.
Depends on D125194
Differential Revision: https://reviews.llvm.org/D130533
TragetLowering had two last InstructionCost related `getTypeLegalizationCost()`
and `getScalingFactorCost()` members, but all other costs are processed in TTI.
E.g. it is not comfortable to use other TTI members in these two functions
overrided in a target.
Minor refactoring: `getTypeLegalizationCost()` now doesn't need DataLayout
parameter - it was always passed from TTI.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D117723
Currenlty all temporal loads are mapped to `LDP` or `LDR`. This patch will map all the non temporal 256-bit loads into `LDNP`. Future patches should address other non-temporal loads.
Reviewed By: fhahn, dmgreen
Differential Revision: https://reviews.llvm.org/D131773
A bfloat select operation will currently crash, but is allowed from C.
This adds handling for the operation, turning it into a FCSELHrrr if
fullfp16 is present, or converting it to a FCSELSrrr if not. The
FCSELSrrr is created via using INSERT_SUBREG/EXTRACT_SUBREG to convert
the bf16 to a f32 and using the f32 pattern for FCSELSrrr. (I originally
attempted to do this via a tablegen pattern, but it appears that the
nzcv glue is places onto the wrong node, causing it to be forgotten and
incorrect scheduling to be emitted).
The FCSELSrrr can also be used for fp16 selects when +fullfp16 is not
present, which helps avoid an unnecessary promotion to f32.
Differential Revision: https://reviews.llvm.org/D131253
Currently fcopysign for VLS vectors lowers through NEON even when the
vector width is wider than a NEON vector, causing bad codegen as the
vectors are split. This patch causes SVE to be used for these vectors
instead, giving much better codegen on wide VLS vectors.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D128642