If the Shuffle is a splat and the operand is a zext/sext, sinking the
operand and the s/zext can help create indexed s/umull. This is
especially useful to prevent i64 mul being scalarized.
Differential Revision: https://reviews.llvm.org/D133355
This extends 4658366d95 to add a note
explaining why the register is reserved.
note: x13 is clobbered by asynchronous signals when using Arm64EC.
I've added testing for w/x registers and v/q/s/d and h floating point
registers.
llvm will accept, but silently do nothing with, b registers. So they
are not tested here (clang rejects them so at least for C you're safe anyway).
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D133701
__declspec(safebuffers) is equivalent to
__attribute__((no_stack_protector)). This information is recorded in
CodeView.
While we are here, add support for strict_gs_check.
Also remove new-pass-manager version of ExpandLargeDivRem because there is no way
yet to access TargetLowering in the new pass manager.
Differential Revision: https://reviews.llvm.org/D133691
This patch adds a utility class that will be used in subsequent patches
for parsing the function/callsite attributes and determining whether
changes to PSTATE.SM are needed, or whether a lazy-save mechanism is
required.
It also implements some of the restrictions on the SME attributes
in the IR Verifier pass.
More details about the SME attributes and design can be found
in D131562.
Reviewed By: david-arm, aemerson
Differential Revision: https://reviews.llvm.org/D131570
Fixes#50098
LLVM uses X19 as the frame base pointer, if it needs to. Meaning you
can get warnings if you clobber that with inline asm.
However, it doesn't explain why. The frame base register is not part
of the ABI so it's pretty confusing why you get that warning out of the blue.
This adds a method to explain a reserved register with X19 as the first one.
The logic is the same as getReservedRegs.
I could have added a return parameter to isASMClobberable and friends
but found that there's a lot of things that call isReservedReg in various
ways.
So while one more method on the pile isn't great design, it is simpler
right now to do it this way and only pay the cost if you are actually using
a reserved register.
Reviewed By: lenary
Differential Revision: https://reviews.llvm.org/D133213
Restrict the 32-bit form of an instruction of integer as too many test cases
will be clobber as the register number updated.
From %reg = INSERT_SUBREG %reg, %subreg, subidx
To %reg:subidx = SUBREG_TO_REG 0, %subreg, subidx
Try to prefix the redundant mov instruction at D132325 as the SUBREG_TO_REG should not generate code.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D132939
LLVM contains a helpful function for getting the size of a C-style
array: `llvm::array_lengthof`. This is useful prior to C++17, but not as
helpful for C++17 or later: `std::size` already has support for C-style
arrays.
Change call sites to use `std::size` instead.
Differential Revision: https://reviews.llvm.org/D133429
Propagate (most) PC sections metadata to MachineInstr when GlobalISel is
doing instruction selection.
This change results in support for architectures using GlobalISel (such
as -O0 with AArch64). Not all instructions may be supported yet, and
requires further target-specific handling (such as done for AArch64
pseudo-atomics). Expanding supported instructions is planned on a
case-by-case basis and new use cases for PC sections metadata.
Reviewed By: vitalybuka
Differential Revision: https://reviews.llvm.org/D130886
Propagate PC sections metadata to MachineInstr when FastISel is doing
instruction selection.
Reviewed By: vitalybuka
Differential Revision: https://reviews.llvm.org/D130884
This diff adjusts AddedComplexity of BIC to bump its position
in the list of patterns to make LLVM pick it instead of MVN + AND.
MVN + AND requires 2 cycles, so does e.g. MOV + BIC, but the latter
outperforms the former if the instructions producing the operands of
BIC can be issued in parallel.
One may consider the following example:
ldur x15, [x0, #2] # 4 cycles
mvn x10, x15 # 1 cycle (depends on ldur)
and x9, x10, #0x8080808080808080
vs.
ldur x15, [x0, #2] # 4 cycles
mov x9, #0x8080808080808080 # 1 cycle (can be executed in parallel with ldur)
bic x9, x9, x15. # 1 cycle
Test plan: ninja check-all
Differential revision: https://reviews.llvm.org/D133345
This patch is essentially an alternative to https://reviews.llvm.org/D75836 and was mentioned by @lhames in a comment.
The gist of the issue is that Mach-O has restrictions on which kind of sections are allowed after debug info has been emitted, which is also properly asserted within LLVM. Problem is that stack maps are currently emitted as one of the last sections in each target-specific AsmPrinter so far, which would cause the assertion to trigger. The current approach of special casing for the `__LLVM_STACKMAPS` section is not viable either, as downstream users can overwrite the stackmap format using plugins, which may want to use different sections.
This patch fixes the issue by emitting the stack map earlier, right before debug info is emitted. The way this is implemented is by taking the choice when to emit the StackMap away from the target AsmPrinter and doing so in the base class. The only disadvantage of this approach is that the `StackMaps` member is now part of the base class, even for targets that do not support them. This is functionaly not a problem however, as emitting an empty `StackMaps` is a no-op.
Differential Revision: https://reviews.llvm.org/D132708
This patch adds an option --reserve-regs-for-regalloc, so we can reserve a list
of physical registers. These registers will not be used by register allocator,
but can still be used as ABI requests such as passing arguments to function
call.
Its main purpose is simulating high register pressure by reserving many physical
registers. So it will be much easier to test and debug register allocation
changes.
Differential Revision: https://reviews.llvm.org/D132717
This adds the ExpandLargeDivRem to the default pass pipeline.
The limit at which it expands div/rem instructions is configured
via a new TargetTransformInfo hook (default: no expansion)
X86, Arm and AArch64 backends implement this hook to expand div/rem
instructions with more than 128 bits.
Differential Revision: https://reviews.llvm.org/D130076
Part of initial Arm64EC patchset.
Arm64EC code needs to use functions with a different name, to avoid
using the x64 versions.
Differential Revision: https://reviews.llvm.org/D125417
Part of patchset to add initial support for ARM64EC.
The ARM64EC calling convention is the same as ARM64 for non-varargs
functions, but for varargs, the convention is significantly different.
Basically, only x0-x3 registers are used for passing arguments, and x4
and x5 describe the address/size of the arguments passed in memory. (See
https://docs.microsoft.com/en-us/windows/uwp/porting/arm64ec-abi for
more details; see
https://docs.microsoft.com/en-us/cpp/build/x64-calling-convention for
the x64 calling convention rules, which this convention needs to match.)
Note that this currently doesn't handle i128 arguments correctly; as
noted in review, that's sort of complicated to handle, so I'm leaving it
for a followup.
Differential Revision: https://reviews.llvm.org/D125415
Part of patchset to add initial support for ARM64EC.
I'm not completely sure I understand the reason for this restriction,
but Microsoft documentation says that asynchronous signals clobber these
registers, so we can't ever use them.
As far as I know, none of these registers have any hardcoded meaning, so
reserving them shouldn't have any significant side-effects.
Differental Revision: https://reviews.llvm.org/D125413
Part of patchset to add initial support for ARM64EC.
Per discussion on review, using the triple arm64ec-pc-windows-msvc. The
parsing works the same way as Apple's alternate Arm ABI "arm64e".
Differential Revision: https://reviews.llvm.org/D125412
This is a simple addition to emitConditionalComparison, to match CCMP
with immediates using getIConstantVRegValWithLookThrough, letting it
select the CCMPri variants of the instructions.
Differential Revision: https://reviews.llvm.org/D131073
If we don't have NEON, we use the generic fallback, which takes 12
instructions. Make sure the costs reflect that.
(On a related note, we could optimize the generic fallback a bit. It
currently uses sequences like lsr+and+add; if we use and+lsr+add
instead, we can fold the lsr into the add.)
Differential Revision: https://reviews.llvm.org/D133154
1) Tablegen patterns exist to use 'xtn' and 'uzp1' for trunc [1]. Cost table entries are updated based on the actual number of {xtn, uzp1} instructions generated.
2) Without this, an IR instruction like trunc <8 x i16> %v to <8 x i8> is considered free and might be sinked to other basic blocks. As a result, the sinked 'trunc' is in a different basic block with its (usually not-free) vector operand and misses the chance to be combined during instruction selection. (examples in [2])
3) It's a lot of effort to teach CodeGenPrepare.cpp to sink the operand of trunc without introducing regressions, since the instruction to compute the operand of trunc could be faster (e.g., throughput) than the instruction corresponding to "trunc (bin-vector-op". For instance in [3], sinking %1 (as trunc operand) into bb.1 and bb.2 means to replace 2 xtn with 2 shrn (shrn has a throughput of 1 and only utilize v1 pipeline), which is not necessarily good, especially since ushr result needs to be preserved for store operation in bb.0. Meanwhile, it's too optimistic (for CodeGenPrepare pass) to assume machine-cse will always be able to de-dup shrn from various basic blocks into one shrn.
[1] For {v8i16->v8i8, v4i32->v4i16, v2i64->v2i32}, 813ae2871d/llvm/lib/Target/AArch64/AArch64InstrInfo.td (L4472).
For concat (trunc, trunc) -> uzip1, 813ae2871d/llvm/lib/Target/AArch64/AArch64InstrInfo.td (L5428-L5437)
[2] examples
- trunc(umin(X, 255)) -> UQXTRN v8i8 (and other {u,s}x{min,max} pattern for v8i16 operands) from 813ae2871d/llvm/lib/Target/AArch64/AArch64InstrInfo.td (L4515-L4528)
- trunc (AArch64vlshr v8i16, imm) -> SHRNv8i8 (same missed for SHRNv2i32) from 813ae2871d/llvm/lib/Target/AArch64/AArch64InstrInfo.td (L6743-L6748)
[3]
---
; instruction latency / throughput / pipeline on `neoverse-n1`
bb.0:
%1 = lshr <8 x i16> %10, <i16 4, i16 4, i16 4, i16 4, i16 4, i16 4, i16 4, i16 4> ; ushr, latency 2, throughput 1, pipeline V1
%2 = trunc <8 x i16> %1 to <8 x i8> ; xtn, latency 2, throughput 2, pipeline V
%3 = store <8 x i8> %1, ptr %addr
br cond i1 cond, label bb.1, label bb.2
bb.1:
%4 = trunc <8 x i16> %1 to <8 x i8> ; xtn
bb.2:
%5 = trunc <8 x i16> %1 to <8 x i8> ; xtn
---
Differential Revision: https://reviews.llvm.org/D132784
When selecting G_EXTRACT to COPY for extracting a 64-bit GPR from
a 128-bit register pair (XSeqPair) we know enough to constrain the
destination register class to gpr64. Without this it may have only
a register bank and some copy elimination code would assert while
assuming that a register class existed.
The register class has to be set explicitly because we might hit the
COPY -> COPY case where register class can't be inferred.
This would cause the following to crash in selection, where the store
is commented (otherwise the store constrains the register class):
define dso_local i128 @load_atomic_i128_unordered(i128* %p) {
%pair = cmpxchg i128* %p, i128 0, i128 0 acquire acquire
%val = extractvalue { i128, i1 } %pair, 0
; store i128 %val, i128* %p
ret i128 %val
}
Differential Revision: https://reviews.llvm.org/D132665
AArch64TTIImpl::getSpliceCost() is now used more aggressively and
LNT (MultiSource/Benchmarks/mafft) exposed a failure case for
<vscale x 1 x i1>. I've tested other element types and whilst they
can be costed they cannot be code generated, so this patch returns
InstructionCost::getInvalid() for all cases.
There's no advatange to emitting floating point scalable accesses,
whereas by lowering them to integer variants we can benefit from
several combines that seek to replace explicit extends/truncates
with extending/truncating accesses.
Differential Revision: https://reviews.llvm.org/D132393
The linker is supposed to detect when an object with /kernel is linked
with another object which is not compiled with /kernel. The linker
detects this by checking bit 30 in @feat.00.
Update three changes:
1.Split the Load/Store resources into two, Ld0St and Ld1,
since only one of them is capable of stores.
2.Integer ADD and SUB instructions have different latencies
and processor resource usage (pipeline) when they have a shift of
zero vs. non-zero, refer to D8043
3.The throughout of scalar DIV instruction.
Reviewed By: dmgreen, bryanpkc
Differential Revision: https://reviews.llvm.org/D132529
The KCFI sanitizer, enabled with `-fsanitize=kcfi`, implements a
forward-edge control flow integrity scheme for indirect calls. It
uses a !kcfi_type metadata node to attach a type identifier for each
function and injects verification code before indirect calls.
Unlike the current CFI schemes implemented in LLVM, KCFI does not
require LTO, does not alter function references to point to a jump
table, and never breaks function address equality. KCFI is intended
to be used in low-level code, such as operating system kernels,
where the existing schemes can cause undue complications because
of the aforementioned properties. However, unlike the existing
schemes, KCFI is limited to validating only function pointers and is
not compatible with executable-only memory.
KCFI does not provide runtime support, but always traps when a
type mismatch is encountered. Users of the scheme are expected
to handle the trap. With `-fsanitize=kcfi`, Clang emits a `kcfi`
operand bundle to indirect calls, and LLVM lowers this to a
known architecture-specific sequence of instructions for each
callsite to make runtime patching easier for users who require this
functionality.
A KCFI type identifier is a 32-bit constant produced by taking the
lower half of xxHash64 from a C++ mangled typename. If a program
contains indirect calls to assembly functions, they must be
manually annotated with the expected type identifiers to prevent
errors. To make this easier, Clang generates a weak SHN_ABS
`__kcfi_typeid_<function>` symbol for each address-taken function
declaration, which can be used to annotate functions in assembly
as long as at least one C translation unit linked into the program
takes the function address. For example on AArch64, we might have
the following code:
```
.c:
int f(void);
int (*p)(void) = f;
p();
.s:
.4byte __kcfi_typeid_f
.global f
f:
...
```
Note that X86 uses a different preamble format for compatibility
with Linux kernel tooling. See the comments in
`X86AsmPrinter::emitKCFITypeId` for details.
As users of KCFI may need to locate trap locations for binary
validation and error handling, LLVM can additionally emit the
locations of traps to a `.kcfi_traps` section.
Similarly to other sanitizers, KCFI checking can be disabled for a
function with a `no_sanitize("kcfi")` function attribute.
Relands 67504c9549 with a fix for
32-bit builds.
Reviewed By: nickdesaulniers, kees, joaomoreira, MaskRay
Differential Revision: https://reviews.llvm.org/D119296
The KCFI sanitizer, enabled with `-fsanitize=kcfi`, implements a
forward-edge control flow integrity scheme for indirect calls. It
uses a !kcfi_type metadata node to attach a type identifier for each
function and injects verification code before indirect calls.
Unlike the current CFI schemes implemented in LLVM, KCFI does not
require LTO, does not alter function references to point to a jump
table, and never breaks function address equality. KCFI is intended
to be used in low-level code, such as operating system kernels,
where the existing schemes can cause undue complications because
of the aforementioned properties. However, unlike the existing
schemes, KCFI is limited to validating only function pointers and is
not compatible with executable-only memory.
KCFI does not provide runtime support, but always traps when a
type mismatch is encountered. Users of the scheme are expected
to handle the trap. With `-fsanitize=kcfi`, Clang emits a `kcfi`
operand bundle to indirect calls, and LLVM lowers this to a
known architecture-specific sequence of instructions for each
callsite to make runtime patching easier for users who require this
functionality.
A KCFI type identifier is a 32-bit constant produced by taking the
lower half of xxHash64 from a C++ mangled typename. If a program
contains indirect calls to assembly functions, they must be
manually annotated with the expected type identifiers to prevent
errors. To make this easier, Clang generates a weak SHN_ABS
`__kcfi_typeid_<function>` symbol for each address-taken function
declaration, which can be used to annotate functions in assembly
as long as at least one C translation unit linked into the program
takes the function address. For example on AArch64, we might have
the following code:
```
.c:
int f(void);
int (*p)(void) = f;
p();
.s:
.4byte __kcfi_typeid_f
.global f
f:
...
```
Note that X86 uses a different preamble format for compatibility
with Linux kernel tooling. See the comments in
`X86AsmPrinter::emitKCFITypeId` for details.
As users of KCFI may need to locate trap locations for binary
validation and error handling, LLVM can additionally emit the
locations of traps to a `.kcfi_traps` section.
Similarly to other sanitizers, KCFI checking can be disabled for a
function with a `no_sanitize("kcfi")` function attribute.
Reviewed By: nickdesaulniers, kees, joaomoreira, MaskRay
Differential Revision: https://reviews.llvm.org/D119296
This patch adds a Type operand to the TLI isCheapToSpeculateCttz/isCheapToSpeculateCtlz callbacks, allowing targets to decide whether branches should occur on a type-by-type/legality basis.
For X86, this patch proposes to allow CTTZ speculation for i8/i16 types that will lower to promoted i32 BSF instructions by masking the operand above the msb (we already do something similar for i8/i16 TZCNT). This required a minor tweak to CTTZ lowering - if the src operand is known never zero (i.e. due to the promotion masking) we can remove the CMOV zero src handling.
Although BSF isn't very fast, most CPUs from the last 20 years don't do that bad a job with it, although there are some annoying passthrough EFLAGS dependencies. Additionally, now that we emit 'REP BSF' in most cases, we are tending towards assuming this will most likely be executed as a TZCNT instruction on any semi-modern CPU.
Differential Revision: https://reviews.llvm.org/D132520
This commit moves the information on whether a register is constant into
the Tablegen files to allow generating the implementaiton of
isConstantPhysReg(). I've marked isConstantPhysReg() as final in this
generated file to ensure that changes are made to tablegen instead of
overriding this function, but if that turns out to be too restrictive,
we can remove the qualifier.
This should be pretty much NFC, but I did notice that e.g. the AMDGPU
generated file also includes the LO16/HI16 registers now.
The new isConstant flag will also be used by D131958 to ensure that
constant registers are marked as call-preserved.
Differential Revision: https://reviews.llvm.org/D131962
Add SVE patterns to make use of predicated smin, umin, smax, and umax instructions,
add sve-min-max-pred.ll test file for the new patterns
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D132122
This has the effect of exposing the power-of-two property for use in memory op costing, but no target actually uses it yet. The main point of this change is simple consistency with the recently changes getArithmeticInstrCost, and to remove the last (interface) use of OperandValueKind.
This patch fixes the list of subtarget features enabled for the
Cortex-X1C processor, including the following:
* Fix incorrect version used for FeatureRCPC:
* Use FEAT_LRCPC2 instead of FEAT_LRCPC.
* Add missing v8.4-A features included in the TRM:
* Flag Manipulation Instructions - FeatureFlagM (FEAT_FlagM)
* Large System Extension 2 - FeatureLSE2 (FEAT_LSE2)
Reviewed By: vhscampos
Differential Revision: https://reviews.llvm.org/D132120
This change completes the process of replacing OperandValueKind and OperandValueProperties which were previously passed independently in this API with a single container class which contains both.
This is the change which motivated the whole sequence which preceeded it. In an original spike version of this change, I'd noticed a nasty bug: I'd changed the signature without changing names, and as result, we silently passed additional information through a callsite which previously dropped the power-of-two fact. This might be harmless in most cases, but at least a couple clearly dependend for correctness on not passing that property through.
I did my best to split off prior changes which reduced the scope of this one, and which made it possible to use compiler assistance. For instance, every parameter which changes type in this change also changes name. This was intentional to make sure that every call site possible effected must show up in the diff. This let me audit each one closely.
This is part of an ongoing transition to use OperandValueInfo which combines OperandValueKind and OperandValueProperties. This change adds some accessor methods and uses them to simplify backend code. The primary motivation of doing so is removing uses of the parameters so that an upcoming api change is less error prone.
A fixed length SK_Splice shuffle vector is lowered to a Ext under
AArch64, which should have a cost of 1.
Differential Revision: https://reviews.llvm.org/D132299
Defaults to TCK_RecipThroughput - as most explicit calls were assuming TCK_RecipThroughput (vectorizers) or was just doing a before-vs-after comparison (vectorcombiner). Calls via getInstructionCost were just dropping the CostKind, so again there should be no change at this time (as getShuffleCost and its expansions don't use CostKind yet) - but it will make it easier for us to better account for size/latency shuffle costs in inline/unroll passes in the future.
Differential Revision: https://reviews.llvm.org/D132287
SDNode.
How:
1) Add AArch64ISD::PMULL SDNode, and extend aarch64_neon_pmull intrinsic
tablegen pattern for this SDNode.
2) For aarch64_neon_pmull64, canonicalize i64 operands to v1i64 vectors
during legalization.
3) For {aarch64_neon_pmull, aarch64_neon_pmull64}, combine intrinsic to
SDNode.
Why
1) Adding the SDNode makes it easier to canonicalize i64 inputs (required by
aarch64_neon_pmull64) to vector inputs. Vector inputs carries lane
information, which helps dag-combiner to combine nodes (e.g. rewrite to a
better node to prepare for instruction selection) and instruction-selection
to emit instructions that use higher-half inputs in place
(i.e., no need to move lane 1 content to lane 0).
2) Using the SDNode for aarch64_neon_pmull64 is NFC, yet without this we
have to move the definition of {PMULLv1i64, PMULLv2i64} out of its
current group of records without gains.
Test cases are commented with what is being tested in
`aarch64-pmull2.ll` and `pmull-ldr-merge.ll` under directory
`llvm/test/CodeGen/AArch64`.
Differential Revision: https://reviews.llvm.org/D131047
This change adds support for lowering llvm.prefetch directly using
GlobalISel. Currently, llvm.prefetch falls back to SelectionDAG.
This Change:
- Adds an AArch64-specific G_PREFETCH generic instruction, to be used
where AArch64ISD::PREFETCH is used in SelectionDAG.
- Adds the GINodeEquiv so patterns are translated over to GlobalISel
automatically.
- Corrects the AArch64Prefetch patterns to use a target immediate, which
is needed to get the patterns to translate across correctly.
- Translates the SelectionDAG legalisation of the prefetch intrinsic
into the corresponding GlobalISel legalisation.
Differential Revision: https://reviews.llvm.org/D132043
AdrpAdd fusion is already enabled for "generic" and it helps
the linker apply relocation relaxations for more adrp+add pairs.
This patch enables it for -mtune=neoverse-n1.
Differential revision: https://reviews.llvm.org/D132075
If the incoming previous value of a fixed-order recurrence is a phi in
the header, go through incoming values from the latch until we find a
non-phi value. Use this as the new Previous, all uses in the header
will be dominated by the original phi, but need to be moved after
the non-phi previous value.
At the moment, fixed-order recurrences are modeled as a chain of
first-order recurrences.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D119661
Extends findMoreOptimalIndexType to allow ISD::BUILD_VECTOR based
indices to be truncated when such truncation is lossless. This can
enable the use of 32bit gather/scatter indices thus making it less
likely to have to split a gather/scatter in two.
Depends on D125194
Differential Revision: https://reviews.llvm.org/D130533
TragetLowering had two last InstructionCost related `getTypeLegalizationCost()`
and `getScalingFactorCost()` members, but all other costs are processed in TTI.
E.g. it is not comfortable to use other TTI members in these two functions
overrided in a target.
Minor refactoring: `getTypeLegalizationCost()` now doesn't need DataLayout
parameter - it was always passed from TTI.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D117723
There was no pattern to fold into these instructions. This patch adds
the pattern obtained from the following ACLE intrinsics so that they
generate sqdmlal/sqdmlsl instructions instead of separate sqdmull and
sqadd/sqsub instructions:
- vqdmlalh_s16, vqdmlslh_s16
- vqdmlalh_lane_s16, vqdmlalh_laneq_s16, vqdmlslh_lane_s16,
vqdmlslh_laneq_s16 (when the lane index is 0)
It also modifies the result of the existing pattern for the latter, when
the lane index is not 0, to use the v1i32_indexed instructions instead
of the v4i16_indexed ones.
Fixes#49997.
Differential Revision: https://reviews.llvm.org/D131700
Currenlty all temporal loads are mapped to `LDP` or `LDR`. This patch will map all the non temporal 256-bit loads into `LDNP`. Future patches should address other non-temporal loads.
Reviewed By: fhahn, dmgreen
Differential Revision: https://reviews.llvm.org/D131773
This reverts commit 38c2366b3f.
This patch seems to break boostraping LLVM with `-fglobal-isel -O3`
on AArch64 hardware. Without the revert, there are 500+ test
failures for the `check-llvm-codegen-x86` target.
A bfloat select operation will currently crash, but is allowed from C.
This adds handling for the operation, turning it into a FCSELHrrr if
fullfp16 is present, or converting it to a FCSELSrrr if not. The
FCSELSrrr is created via using INSERT_SUBREG/EXTRACT_SUBREG to convert
the bf16 to a f32 and using the f32 pattern for FCSELSrrr. (I originally
attempted to do this via a tablegen pattern, but it appears that the
nzcv glue is places onto the wrong node, causing it to be forgotten and
incorrect scheduling to be emitted).
The FCSELSrrr can also be used for fp16 selects when +fullfp16 is not
present, which helps avoid an unnecessary promotion to f32.
Differential Revision: https://reviews.llvm.org/D131253
Currently fcopysign for VLS vectors lowers through NEON even when the
vector width is wider than a NEON vector, causing bad codegen as the
vectors are split. This patch causes SVE to be used for these vectors
instead, giving much better codegen on wide VLS vectors.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D128642
After D121595 was commited, I noticed regressions assosicated with small trip
count numbersvectorisation by tail folding with scalable vectors. As a solution
for those issues I propose to introduce the minimal trip count threshold value.
Differential Revision: https://reviews.llvm.org/D130755
This patch adds the names of the Arm Architecture Reference Manual (ARM)
features to the corresponding Subtarget Features in the AArch64 backend
and target parser.
The aim of this is to make it clearer what architectural features a
subtarget feature might enable (so, which features a CPU must provide to
support that subtarget feature), and so make it easier to add new CPUs
in the future.
Differential Revision: https://reviews.llvm.org/D131257
The register operand of DBG_VALUE is not selected to a proper register
bank in both AArch64 and X86. This would cause getRegClass crash after
global ISel. After discussion, we think the MIR should assume all
vritual register should be set proper register class after global ISel,
so this patch is to fix the gap of DBG_VALUE for AArch64 and X86.
Differential Revision: https://reviews.llvm.org/D129037
1. Missing instruction information (FTSSEL, FMSB, PFIRST and RDFFR)
is added and CompleteModel is set to one.
2. Information for pseudo SVE instructions is added. Those
instructions are present at the time of scheduling.
3. Resource and latency information for SVE instructions is modified
to be more accurate.
For example, the description for CMPEQ, which consumes one cycle
each of unit FLA and PPR, is as follows.
```
Previous:
def A64FXGI01 : ProcResGroup<[A64FXIPFLA, A64FXIPPR]>;
def A64FXWrite_4Cyc_GI01 : SchedWriteRes<[A64FXGI01]> {...
Modified:
def A64FXGI0 : ProcResGroup<[A64FXIPFLA]>;
def A64FXGI1 : ProcResGroup<[A64FXIPPR]>;
def A64FXWrite_CMP : SchedWriteRes<[A64FXGI0, A64FXGI1]> {...
```
Reference: A64FX Microarchitecture Manual (Table 16-3)
https://github.com/fujitsu/A64FX/blob/master/doc/A64FX_Microarchitecture_Manual_en_1.7.pdf
Reviewed By: dmgreen, kawashima-fj
Differential Revision: https://reviews.llvm.org/D131165
Add patterns to select predicated instructions when lowering:
fadd(a, select(mask, b, splat(0)))
fsub(a, select(mask, b, splat(0)))
'fadd' is unsafe unless no-signed zeros fast-math flag is set, since
-0.0 + 0.0 = 0.0
changes the sign. Alive2: https://alive2.llvm.org/ce/z/wbhJh_
Also adds FMA patterns for:
fadd(a, select(mask, mul(b, c), splat(0))) -> fmla(a, mask, b, c)
fsub(a, select(mask, mul(b, c), splat(0))) -> fmla(a, mask, b, c)
These patterns require the 'contract' fast-math flag to be set, and the
fadd 'nsz' as above.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D130564
NOTE: i8 vector splats are ignored because the immediate range of
DUP already has full coverage.
Differential Revision: https://reviews.llvm.org/D131078
This is a simple addition to emitConditionalComparison, to match CCMP
with immediates using getIConstantVRegValWithLookThrough, letting it
select the CCMPri variants of the instructions.
Differential Revision: https://reviews.llvm.org/D131073
1) Overloaded (instruction-based) method is a wrapper around the current (opcode-based) method.
2) This patch also changes a few callsites (VectorCombine.cpp,
SLPVectorizer.cpp, CodeGenPrepare.cpp) to call the overloaded method.
3) This is a split of D128302.
Differential Revision: https://reviews.llvm.org/D131114
This patch ensures consistency in the construction of FP_ROUND nodes
such that they always use ISD::TargetConstant instead of ISD::Constant.
This additionally fixes a bug in the AArch64 SVE backend where patterns
were matching against TargetConstant nodes and sometimes failing when
passed a Constant node.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D130370
rGcf97e0ec42b8 makes $x18 to be treated as callee-saved in functions with
Windows calling convention on non-Windows OSes.
Here we mark $x18 as callee-saved for functions with Windows calling
convention on Darwin, as well as on other non-Windows platforms, in
order to prevent some miscompilations (like miscompilation of
win64cc-darwin-backup-x18.ll).
Since getCalleeSavedRegs doesn't return x18 in list of callee-saved
registers, assignCalleeSavedSpillSlots and determineCalleeSaves
consider different sets of registers as callee-saved. It causes an
error:
```
Assertion failed: ((!HasCalleeSavedStackSize || getCalleeSavedStackSize() == Size) && "Invalid size calculated for callee saves"), function getCalleeSavedStackSize, file
AArch64MachineFunctionInfo.h, line 292.
```
Differential Revision: https://reviews.llvm.org/D130676
This folds a v4i32 Mul(And(Srl(X, 15), 0x10001), 0xffff) into a v8i16
CMLTz instruction. The Srl and And extract the top bit (whether the
input is negative) and the Mul sets all values in the i16 half to all
1/0 depending on if that top bit was set. This is equivalent to a v8i16
CMLTz instruction. The same applies to other sizes with equivalent
constants.
Differential Revision: https://reviews.llvm.org/D130874
If we have interleave groups in the loop we want to vectorise then
we should fall back on normal vectorisation with a scalar epilogue. In
such cases when tail-folding is enabled we'll almost certainly go on to
create vplans with very high costs for all vector VFs and fall back on
VF=1 anyway. This is likely to be worse than if we'd just used an
unpredicated vector loop in the first place.
Once the vectoriser has proper support for analysing all the costs
for each combination of VF and vectorisation style, then we should
be able to remove this.
Added an extra test here:
Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll
Differential Revision: https://reviews.llvm.org/D128342
2xi64 is the legalized type for wide reductions (like 16xi64) and setting the
cost to 2 makes `load-reduce` and `load-zext-reduce` patterns profitable.
The few performance measurments that I did on an aarch64 machine confirm that
these patterns are actually faster when vectorized.
Differential Revision: https://reviews.llvm.org/D130740
A build vector of two extracted elements is equivalent to an extract
subvector where the inner vector is any-extended to the
extract_vector_elt VT, because extract_vector_elt has the effect of an
any-extend.
(build_vector (extract_elt_i16_to_i32 vec Idx+0) (extract_elt_i16_to_i32 vec Idx+1))
=> (extract_subvector (anyext_i16_to_i32 vec) Idx)
Depends on D130697
Differential Revision: https://reviews.llvm.org/D130698