Currenlty all temporal loads are mapped to `LDP` or `LDR`. This patch will map all the non temporal 256-bit loads into `LDNP`. Future patches should address other non-temporal loads.
Reviewed By: fhahn, dmgreen
Differential Revision: https://reviews.llvm.org/D131773
A bfloat select operation will currently crash, but is allowed from C.
This adds handling for the operation, turning it into a FCSELHrrr if
fullfp16 is present, or converting it to a FCSELSrrr if not. The
FCSELSrrr is created via using INSERT_SUBREG/EXTRACT_SUBREG to convert
the bf16 to a f32 and using the f32 pattern for FCSELSrrr. (I originally
attempted to do this via a tablegen pattern, but it appears that the
nzcv glue is places onto the wrong node, causing it to be forgotten and
incorrect scheduling to be emitted).
The FCSELSrrr can also be used for fp16 selects when +fullfp16 is not
present, which helps avoid an unnecessary promotion to f32.
Differential Revision: https://reviews.llvm.org/D131253
Currently fcopysign for VLS vectors lowers through NEON even when the
vector width is wider than a NEON vector, causing bad codegen as the
vectors are split. This patch causes SVE to be used for these vectors
instead, giving much better codegen on wide VLS vectors.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D128642
This patch ensures consistency in the construction of FP_ROUND nodes
such that they always use ISD::TargetConstant instead of ISD::Constant.
This additionally fixes a bug in the AArch64 SVE backend where patterns
were matching against TargetConstant nodes and sometimes failing when
passed a Constant node.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D130370
This folds a v4i32 Mul(And(Srl(X, 15), 0x10001), 0xffff) into a v8i16
CMLTz instruction. The Srl and And extract the top bit (whether the
input is negative) and the Mul sets all values in the i16 half to all
1/0 depending on if that top bit was set. This is equivalent to a v8i16
CMLTz instruction. The same applies to other sizes with equivalent
constants.
Differential Revision: https://reviews.llvm.org/D130874
A build vector of two extracted elements is equivalent to an extract
subvector where the inner vector is any-extended to the
extract_vector_elt VT, because extract_vector_elt has the effect of an
any-extend.
(build_vector (extract_elt_i16_to_i32 vec Idx+0) (extract_elt_i16_to_i32 vec Idx+1))
=> (extract_subvector (anyext_i16_to_i32 vec) Idx)
Depends on D130697
Differential Revision: https://reviews.llvm.org/D130698
Without this, the intrinsic will be expanded to an integer; thereby an
explicit copy (from GPR to SIMD register) will be codegen'd. This matches the
general convention of using "v1" types to represent scalar integer operations in
vector registers.
The similar approach is observed in D56616, and the pattern likely applies on
other intrinsic that accepts integer scalars (e.g.,
int_aarch64_neon_sqdmulls_scalar)
Differential Revision: https://reviews.llvm.org/D130548
This adds similar heuristics to G_GLOBAL_VALUE, querying the cost of
materializing a specific constant in code size. Doing so prevents us from
sinking constants which require multiple instructions to generate into
use blocks.
Code size savings on CTMark -Os:
Program size.__text
before after diff
ClamAV/clamscan 381940.00 382052.00 0.0%
lencod/lencod 428408.00 428428.00 0.0%
SPASS/SPASS 411868.00 411876.00 0.0%
kimwitu++/kc 449944.00 449944.00 0.0%
Bullet/bullet 463588.00 463556.00 -0.0%
sqlite3/sqlite3 284696.00 284668.00 -0.0%
consumer-typeset/consumer-typeset 414492.00 414424.00 -0.0%
7zip/7zip-benchmark 595244.00 594972.00 -0.0%
mafft/pairlocalalign 247512.00 247368.00 -0.1%
tramp3d-v4/tramp3d-v4 372884.00 372044.00 -0.2%
Geomean difference -0.0%
Differential Revision: https://reviews.llvm.org/D130554
This patch starts small, only detecting sequences of the form
<a, a+n, a+2n, a+3n, ...> where a and n are ConstantSDNodes.
Differential Revision: https://reviews.llvm.org/D125194
This helps fold away the ptest instructions, which needs the knowledge on whether
the general predicate is known to zero the inactive lanes.
This fixes some PTEST regressions introduced by D129282.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D129852
Due to the way fixed length SVE lowering works, we sometimes introduce
ext/trunc nodes very late, these nodes then immediately get converted
into target specific nodes (UUNPKLO/UZP1) before they get a chance to be
folded into a load/store.
This patch introduces target specific dag combines for these nodes so that
we can still create extending loads/truncating stores out of them.
Differential Revision: https://reviews.llvm.org/D128065
This patch recognizes f16 immediates as legal and adds the necessary
patterns. This allows the fadda folding introduced in 05d424d165
to be applied to the f16 cases.
Differential Revision: https://reviews.llvm.org/D129989
As noticed on D129765 and reported on Issue #56531 - aarch64 targets can use the neon ctpop + add-reduce instructions to speed up scalar ctpop instructions, but we fail to do this for parity calculations.
I'm not sure where the cutoff should be for specific CPUs, but i64 (+ i128 special case) shows a definite reduction in instruction count. i32 is about the same (but scalar <-> neon transfers are probably more costly?), and sub-i32 promotion looks to be a definite regression compared to parity expansion optimized for those widths.
Differential Revision: https://reviews.llvm.org/D130246
This patch lowers
duplane128(insert_subvector(undef, bitcast(op(128bitsubvec)), 0), 0)
to
bitcast(duplane128(insert_subvector(undef, op(128bitsubvec), 0), 0)).
This enables floating-point loads to match patterns added in
https://reviews.llvm.org/D130010
Differential Revision: https://reviews.llvm.org/D130013
The "xor (X >> ShiftC), XorC --> (not X) >> ShiftC" fold is currently limited to the XOR mask being a shifted all-bits mask, but we can relax this to only need to match under the demanded bits.
This helps expose more bit extraction/clearing patterns and fixes the PowerPC testCompares*.ll regressions from D127115
Alive2: https://alive2.llvm.org/ce/z/fl7T7K
Differential Revision: https://reviews.llvm.org/D129933
As mentioned on D127115, this patch that attempts to recognise shuffle masks that could be simplified to a AND mask - we already have a similar transform that will fold AND -> 'clear mask' shuffle, but this patch handles cases where the referenced elements are not from the same lane indices but are known to be zero.
Differential Revision: https://reviews.llvm.org/D129150
Currently any legal predicate types will be pattern-matched when
creating a PTEST instruction. This could be a problem in future since
PTEST always uses the .B specifier for the operand, but it is not
always guaranteed that the extra lanes of unpacked types (e.g. nxv4i1)
are zero. This patch ensures the operands of PTEST are type nxv16i1,
where the undef lanes are set to zero.
Differential Revision: https://reviews.llvm.org/D129282/
Try to re-use an already extended operand for SetCC with vector operands
feeding an extended select. Doing so avoids requiring another full
extension of the SET_CC result when lowering the select.
This improves lowering for certain extend/cmp/select patterns operating.
For example with v16i8, this replaces 6 instructions for the extra extension
with 4 separate selects.
This improves the generated code for loops like the one below in
combination with D96522.
int foo(uint8_t *p, int N) {
unsigned long long sum = 0;
for (int i = 0; i < N ; i++, p++) {
unsigned int v = *p;
sum += (v < 127) ? v : 256 - v;
}
return sum;
}
https://clang.godbolt.org/z/Wco866MjY
On the AArch64 cores I have access to, the patch improves performance of
the vector loop by ~10%.
This could be generalized per follow-ups, but the initial version
targets one of the more important cases in combination with D96522.
Alive2 modeling:
* sext EQ https://alive2.llvm.org/ce/z/5upBvb
* sext NE https://alive2.llvm.org/ce/z/zbEcJp
* zext EQ https://alive2.llvm.org/ce/z/_xMwof
* zext NE https://alive2.llvm.org/ce/z/5FwKfc
* zext unsigned predicate: https://alive2.llvm.org/ce/z/iEwLU3
* sext signed predicate: https://alive2.llvm.org/ce/z/aMBega
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D120481
This is done during type legalization since the target representation of
these nodes may not be valid until after type legalization, and after
type legalization the fact that these are dealing with i1 types may be
lost.
Differential Revision: https://reviews.llvm.org/D128996
When the predicate vector is unpacked, we cannot assume anything about the
values in the other lanes. We have to make sure we use the correct
predicate where we know that the other lanes have been zeroed.
Reviewed By: RosieSumpter
Differential Revision: https://reviews.llvm.org/D129081
This patch adds new the following SME intrinsics:
@llvm.aarch64.sme.addva
@llvm.aarch64.sme.addha
Differential Revision: https://reviews.llvm.org/D127861
This patch puts the code to safely bitcast a predicate, and possibly zero
any undefined lanes when doing a widening cast, into one place and merges
the functionality with lowerConvertToSVBool.
This is some cleanup inspired by D128665.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D128926
One motivation to add support for these types are the LD1Q/ST1Q
instructions in SME, for which we have defined a number of load/store
intrinsics which at the moment still take a `<vscale x 16 x i1>` predicate
regardless of their element type.
This patch adds basic support for the nxv1i1 type such that it can be passed/returned
from functions, as well as some basic support to support some existing tests that
result in a nxv1i1 type. It also adds support for splats.
Other operations (e.g. insert/extract subvector, logical ops, etc) will be
supported in follow-up patches.
Reviewed By: paulwalker-arm, efriedma
Differential Revision: https://reviews.llvm.org/D128665
As described in aapcs64 (https://github.com/ARM-software/abi-aa/blob/2022Q1/aapcs64/aapcs64.rst#scalable-vector-registers)
AAVPCS is used only when registers z0-z7 take an SVE argument. This fixes the case where floats occupy the lower bits
of registers z0-z7 but SVE arguments in registers greater than z7 cause a function to use AAVPCS where it should use AAPCS.
Moving SVE function deduction from AArch64RegisterInfo::hasSVEArgsOrReturn to AArch64TargetLowering::LowerFormalArguments
where physical register lowering is more accurate fixes this.
Differential Revision: https://reviews.llvm.org/D127209
This helps ISel decompose the generic offset for the tile into a base + offset.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D128508
When the SME feature is enabled we also gain access to a few extra
SVE2 instructions. This patch adds LLVM IR intrinsics to make use
of these new instructions:
@llvm.aarch64.sve.psel
@llvm.aarch64.sve.revd
@llvm.aarch64.sve.sclamp
@llvm.aarch64.sve.uclamp
Differential Revision: https://reviews.llvm.org/D128332
This patch adds the following intrinsics to support the SME ACLE:
* @llvm.aarch64.sme.mopa: Non-widening outer product + accumulate
* @llvm.aarch64.sme.mops: Non-widening outer product + subtract
* @llvm.aarch64.sme.mopa.wide: Widening outer product + accumulate
* @llvm.aarch64.sme.mops.wide: Widening outer product + subtract
* @llvm.aarch64.sme.smopa.wide: Widening signed sum of outer product + accumulate
* @llvm.aarch64.sme.smops.wide: Widening signed sum of outer product + subtract
* @llvm.aarch64.sme.umopa.wide: Widening unsigned sum of outer product + accumulate
* @llvm.aarch64.sme.umops.wide: Widening unsigned sum of outer product + subtract
* @llvm.aarch64.sme.sumopa.wide: Widening signed by unsigned sum of outer product + accumulate
* @llvm.aarch64.sme.sumops.wide: Widening signed by unsigned sum of outer product + subtract
* @llvm.aarch64.sme.usmopa.wide: Widening unsigned by signed sum of outer product + accumulate
* @llvm.aarch64.sme.usmops.wide: Widening unsigned by signed sum of outer product + subtract
Differential Revision: https://reviews.llvm.org/D127956
Given a vector add or sub from extends that needs more that one 'step'
(i.e i8 to i32 or i16 to i64), we can transform the sequence to
sext(add(ext, ext)), to allow the add(ext, ext) to become a single uaddl
and a larger extend, producing less instructions in total.
https://alive2.llvm.org/ce/z/S2T4k-
Differential Revision: https://reviews.llvm.org/D128426
This patch adds support for:
* Querying the PSTATE.SM state with @llvm.aarch64.sme.get.pstatesm
* Reading/writing the TPIDR2 register with new
@llvm.aarch64.sme.get.tpidr2 and @llvm.aarch64.sme.set.tpidr2
intrinsics.
Tests added here:
CodeGen/AArch64/sme-get-pstatesm.ll
CodeGen/AArch64/sme-read-write-tpidr2.ll
Differential Revision: https://reviews.llvm.org/D127957
This enables an existing transformation that when combined with an
add will emit saba/uaba instructions.
Differential Revision: https://reviews.llvm.org/D128198
As a followup to D128144, this adds extract(DUP(C)) as a canonical
constant to prevent it being transformed back into a BUILD_VECTOR,
leading to an infinite loop.
An AArch64ISD::DUP is just a splat, where the known bits for each lane
are the same as the input. This teaches that to computeKnownBitsForTargetNode.
Problems arise for constants though, as a constant BUILD_VECTOR can be
lowered to an AArch64ISD::DUP, which SimplifyDemandedBits would then
turn back into a constant BUILD_VECTOR leading to an infinite cycle.
This has been prevented by adding a isTargetCanonicalConstantNode node
to prevent the conversion back into a BUILD_VECTOR.
Differential Revision: https://reviews.llvm.org/D128144
The SME zero instruction takes a mask as an input declaring which
64-bit element tiles should be zeroed. There is a 1:1 mapping
between the zero intrinsic and the instruction, however we also
want to make the register allocator aware that some tile
registers are being written to.
We can actually just use the custom inserter for a pseudo instruction
to correctly mark all the appropriate registers in the mask as
implicitly defined by the operation.
Differential Revision: https://reviews.llvm.org/D127843