Currently any legal predicate types will be pattern-matched when
creating a PTEST instruction. This could be a problem in future since
PTEST always uses the .B specifier for the operand, but it is not
always guaranteed that the extra lanes of unpacked types (e.g. nxv4i1)
are zero. This patch ensures the operands of PTEST are type nxv16i1,
where the undef lanes are set to zero.
Differential Revision: https://reviews.llvm.org/D129282/
Try to re-use an already extended operand for SetCC with vector operands
feeding an extended select. Doing so avoids requiring another full
extension of the SET_CC result when lowering the select.
This improves lowering for certain extend/cmp/select patterns operating.
For example with v16i8, this replaces 6 instructions for the extra extension
with 4 separate selects.
This improves the generated code for loops like the one below in
combination with D96522.
int foo(uint8_t *p, int N) {
unsigned long long sum = 0;
for (int i = 0; i < N ; i++, p++) {
unsigned int v = *p;
sum += (v < 127) ? v : 256 - v;
}
return sum;
}
https://clang.godbolt.org/z/Wco866MjY
On the AArch64 cores I have access to, the patch improves performance of
the vector loop by ~10%.
This could be generalized per follow-ups, but the initial version
targets one of the more important cases in combination with D96522.
Alive2 modeling:
* sext EQ https://alive2.llvm.org/ce/z/5upBvb
* sext NE https://alive2.llvm.org/ce/z/zbEcJp
* zext EQ https://alive2.llvm.org/ce/z/_xMwof
* zext NE https://alive2.llvm.org/ce/z/5FwKfc
* zext unsigned predicate: https://alive2.llvm.org/ce/z/iEwLU3
* sext signed predicate: https://alive2.llvm.org/ce/z/aMBega
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D120481
This is done during type legalization since the target representation of
these nodes may not be valid until after type legalization, and after
type legalization the fact that these are dealing with i1 types may be
lost.
Differential Revision: https://reviews.llvm.org/D128996
When the predicate vector is unpacked, we cannot assume anything about the
values in the other lanes. We have to make sure we use the correct
predicate where we know that the other lanes have been zeroed.
Reviewed By: RosieSumpter
Differential Revision: https://reviews.llvm.org/D129081
This patch adds new the following SME intrinsics:
@llvm.aarch64.sme.addva
@llvm.aarch64.sme.addha
Differential Revision: https://reviews.llvm.org/D127861
This patch puts the code to safely bitcast a predicate, and possibly zero
any undefined lanes when doing a widening cast, into one place and merges
the functionality with lowerConvertToSVBool.
This is some cleanup inspired by D128665.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D128926
One motivation to add support for these types are the LD1Q/ST1Q
instructions in SME, for which we have defined a number of load/store
intrinsics which at the moment still take a `<vscale x 16 x i1>` predicate
regardless of their element type.
This patch adds basic support for the nxv1i1 type such that it can be passed/returned
from functions, as well as some basic support to support some existing tests that
result in a nxv1i1 type. It also adds support for splats.
Other operations (e.g. insert/extract subvector, logical ops, etc) will be
supported in follow-up patches.
Reviewed By: paulwalker-arm, efriedma
Differential Revision: https://reviews.llvm.org/D128665
As described in aapcs64 (https://github.com/ARM-software/abi-aa/blob/2022Q1/aapcs64/aapcs64.rst#scalable-vector-registers)
AAVPCS is used only when registers z0-z7 take an SVE argument. This fixes the case where floats occupy the lower bits
of registers z0-z7 but SVE arguments in registers greater than z7 cause a function to use AAVPCS where it should use AAPCS.
Moving SVE function deduction from AArch64RegisterInfo::hasSVEArgsOrReturn to AArch64TargetLowering::LowerFormalArguments
where physical register lowering is more accurate fixes this.
Differential Revision: https://reviews.llvm.org/D127209
This helps ISel decompose the generic offset for the tile into a base + offset.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D128508
When the SME feature is enabled we also gain access to a few extra
SVE2 instructions. This patch adds LLVM IR intrinsics to make use
of these new instructions:
@llvm.aarch64.sve.psel
@llvm.aarch64.sve.revd
@llvm.aarch64.sve.sclamp
@llvm.aarch64.sve.uclamp
Differential Revision: https://reviews.llvm.org/D128332
This patch adds the following intrinsics to support the SME ACLE:
* @llvm.aarch64.sme.mopa: Non-widening outer product + accumulate
* @llvm.aarch64.sme.mops: Non-widening outer product + subtract
* @llvm.aarch64.sme.mopa.wide: Widening outer product + accumulate
* @llvm.aarch64.sme.mops.wide: Widening outer product + subtract
* @llvm.aarch64.sme.smopa.wide: Widening signed sum of outer product + accumulate
* @llvm.aarch64.sme.smops.wide: Widening signed sum of outer product + subtract
* @llvm.aarch64.sme.umopa.wide: Widening unsigned sum of outer product + accumulate
* @llvm.aarch64.sme.umops.wide: Widening unsigned sum of outer product + subtract
* @llvm.aarch64.sme.sumopa.wide: Widening signed by unsigned sum of outer product + accumulate
* @llvm.aarch64.sme.sumops.wide: Widening signed by unsigned sum of outer product + subtract
* @llvm.aarch64.sme.usmopa.wide: Widening unsigned by signed sum of outer product + accumulate
* @llvm.aarch64.sme.usmops.wide: Widening unsigned by signed sum of outer product + subtract
Differential Revision: https://reviews.llvm.org/D127956
Given a vector add or sub from extends that needs more that one 'step'
(i.e i8 to i32 or i16 to i64), we can transform the sequence to
sext(add(ext, ext)), to allow the add(ext, ext) to become a single uaddl
and a larger extend, producing less instructions in total.
https://alive2.llvm.org/ce/z/S2T4k-
Differential Revision: https://reviews.llvm.org/D128426
This patch adds support for:
* Querying the PSTATE.SM state with @llvm.aarch64.sme.get.pstatesm
* Reading/writing the TPIDR2 register with new
@llvm.aarch64.sme.get.tpidr2 and @llvm.aarch64.sme.set.tpidr2
intrinsics.
Tests added here:
CodeGen/AArch64/sme-get-pstatesm.ll
CodeGen/AArch64/sme-read-write-tpidr2.ll
Differential Revision: https://reviews.llvm.org/D127957
This enables an existing transformation that when combined with an
add will emit saba/uaba instructions.
Differential Revision: https://reviews.llvm.org/D128198
As a followup to D128144, this adds extract(DUP(C)) as a canonical
constant to prevent it being transformed back into a BUILD_VECTOR,
leading to an infinite loop.
An AArch64ISD::DUP is just a splat, where the known bits for each lane
are the same as the input. This teaches that to computeKnownBitsForTargetNode.
Problems arise for constants though, as a constant BUILD_VECTOR can be
lowered to an AArch64ISD::DUP, which SimplifyDemandedBits would then
turn back into a constant BUILD_VECTOR leading to an infinite cycle.
This has been prevented by adding a isTargetCanonicalConstantNode node
to prevent the conversion back into a BUILD_VECTOR.
Differential Revision: https://reviews.llvm.org/D128144
The SME zero instruction takes a mask as an input declaring which
64-bit element tiles should be zeroed. There is a 1:1 mapping
between the zero intrinsic and the instruction, however we also
want to make the register allocator aware that some tile
registers are being written to.
We can actually just use the custom inserter for a pseudo instruction
to correctly mark all the appropriate registers in the mask as
implicitly defined by the operation.
Differential Revision: https://reviews.llvm.org/D127843
Similar to the existing (shl (srl x, c1), c2) fold
Part of the work to fix the regressions in D77804
Differential Revision: https://reviews.llvm.org/D125836
These intrinsics return the number of elements in a streaming
vector, for example aarch64.sme.cntsw returns the number of
32-bit elements. When in streaming mode these are equivalent
to aarch64.sve.cntb/h/w/d with an input value of 1.
I have implemented these intrinsics using the rdsvl instruction
and added tests here:
CodeGen/AArch64/SME/sme-intrinsics-rdsvl.ll
Differential Revision: https://reviews.llvm.org/D127853
This patch adds implementations for the read/write SME ACLE intrinsics:
@llvm.aarch64.sme.read.horiz
@llvm.aarch64.sme.read.vert
@llvm.aarch64.sme.write.horiz
@llvm.aarch64.sme.write.vert
These all map to the SME mova instruction.
Differential Revision: https://reviews.llvm.org/D127414
This patch adds implementations for the fill/spill SME ACLE intrinsics:
@llvm.aarch64.sme.ldr
@llvm.aarch64.sme.str
Differential Revision: https://reviews.llvm.org/D127317
This patch adds implementations for the load/store SME ACLE intrinsics:
- @llvm.aarch64.sme.ld1*
- @llvm.aarch64.sme.st1*
Differential Revision: https://reviews.llvm.org/D127210
As a follow up to D126686, this does the same fold for floating point
add and shuffle. In this case it is limited to reassoc either x[0]+x[1]
or x[1]+x[0] for both result[0] and results[1].
Differential Revision: https://reviews.llvm.org/D127087
This was assuming the vector types were MVTs, but they don't have to be.
Note that the concrete output of the test isn't very useful, since it's
dominated by nonsensical calling convention lowering for the weird types.
Differential Revision: https://reviews.llvm.org/D126505
Bitcasting between unpacked scalable vector types of different
element counts is not a NOP because the live elements are laid out
differently.
01234567
e.g. nxv2i32 = XX??XX??
nxv4f16 = X?X?X?X?
Differential Revision: https://reviews.llvm.org/D126957
I can't remove the function just yet as it is used in the generated .inc files.
I would also like to provide a way to compare alignment with TypeSize since it came up a few times.
Differential Revision: https://reviews.llvm.org/D126910
This adds a fold of add(x, shuffle(x, <1,0,3,2,5,4,...>), into
shuffle(addp(x), <0,0,1,1,2,2,..>. The ADDP instruction takes two
vectors and returns one, adding adjacent pairs. So we match x in a
custom combine as it is lowered from a v8i32. The original code
would be 2 rev64 and 2 add, with the new code being a single addp
with a zip1;zip2 shuffle, producing smaller code.
Differential Revision: https://reviews.llvm.org/D126686
Patch enables custom lowering for MVT::nxv4bf16 because otherwise
the refactored test file triggers a selection failure.
The reason for the refactoring it to highlight cases where the
generated code is wrong.
LowerINSERT_SUBVECTOR emits AArch64ISD::UUNPK## when lowering
scalable vector floating point INSERT_SUBVECTOR. However, these
nodes only make sense for integer types and thus isel patterns do
not exist for floating point, which leads to isel failures.
This patch ensures floating point operands are cast to integer
before the core lowering takes place.
Fixes: #55037
Differential Revision: https://reviews.llvm.org/D126487
Following discussion on D120261 and D121208 it seems better to remove the
concept of Streaming SVE from the subtarget/assembler predicates and
instead reason about 'SVE' and 'SME' as its higher level features, rather
than trying to model this runtime mode through explicit feature flags.
This patch is largely NFC.
Reviewed By: paulwalker-arm, david-arm
Differential Revision: https://reviews.llvm.org/D125977
If both a v2i32 DUP(x) and a v4i32 DUP(x) node exists, we can re-use the
larger node using a vector extract to obtain the smaller. This comes up
in the smull/smlal code, but needs a small fixup to allow the smull2
code in tryExtendDUPToExtractHigh/performAddSubLongCombine to still
match smull2 extracts.
Differential Revision: https://reviews.llvm.org/D126449
If the fma operates on a legal vector type, the indexed variants can be
used, if the second operand is a splat of a valid index.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D126234
Currently reassociating add expressions can lead to failing to select
(u|s)mlal. Implement isReassocProfitable to skip reassociating
expressions that can be lowered to (u|s)mlal.
The same issue exists for the *mlsl variants as well, but the DAG
combiner doesn't use the isReassocProfitable hook before reassociating.
To be fixed in a follow-up commit as this requires DAGCombiner changes
as well.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D125895
Lower fixed length MGATHER/MSCATTER operations to scalable vector
equivalents, which are then lowered to SVE specific nodes. This
two stage process is in preparation for making scalable vector
MGATHER/MSCATTER operations legal.
Differential Revision: https://reviews.llvm.org/D125192
This patch implements a for a target specific optimization that replaces
the cmp and csel from cttz with an and mask.
Recommitted with a fix for truncated value sizes.
Differential Revision: https://reviews.llvm.org/D123782
A TBL instruction will fill out-of-range values with 0's, something used
in D121139 to turn tbl2 with a zero input into tbl1s. This works OK for
v16i8, but for v8i8 the input is still treated as a v16i8, so
out-of-range values (like a lane index of 8) would end up loading values
from the top half of the input register. Clean this up by detecting the
out of range values and making sure they really use out of range values.
There is a fix for swapped indices of 64bit input vectors too, which
could be incorrectly adjusted if the zerovector was the first operand.
Fixes#55545
Differential Revision: https://reviews.llvm.org/D125865
Most clients only used these methods because they wanted to be able to
extend or truncate to the same bit width (which is a no-op). Now that
the standard zext, sext and trunc allow this, there is no reason to use
the OrSelf versions.
The OrSelf versions additionally have the strange behaviour of allowing
extending to a *smaller* width, or truncating to a *larger* width, which
are also treated as no-ops. A small amount of client code relied on this
(ConstantRange::castOp and MicrosoftCXXNameMangler::mangleNumber) and
needed rewriting.
Differential Revision: https://reviews.llvm.org/D125557
Similar to D123386, this adds D-Movs to the AArch64 perfect shuffle
tables, slightly lowering the costs a little more. This is a rough
improvement in general, especially if you ignore mov v0.16b, v2.16b type
moves that are often artefacts of the calling convention.
The D register movs are encoded as (0x4 | LaneIdx), and to generate a D
register move we are required to bitcast into a higher type, but it is
otherwise very similar to the S-lane mov's already supported.
Differential Revision: https://reviews.llvm.org/D125477
If we're using shift pairs to mask, then relax the one use limit if the shift amounts are equal - we'll only be generating a single AND node.
AArch64 has a couple of regressions due to this, so I've enforced the existing one use limit inside a AArch64TargetLowering::shouldFoldConstantShiftPairToMask callback.
Part of the work to fix the regressions in D77804
Differential Revision: https://reviews.llvm.org/D125607
NEON does not have native support for xor reductions. However, when
reducing predicate vectors the operation is synonymous with an add
reduction that is supported.
Differential Revision: https://reviews.llvm.org/D125605
During early gather/scatter enablement two different approaches
were taken to represent scaled indices:
* A Scale operand whereby byte_offsets = Index * Scale
* An IndexType whereby byte_offsets = Index * sizeof(MemVT.ElementType)
Having multiple representations is bad as shown by this patch which
fixes instances where the two are out of sync. The dedicated scale
operand is more flexible and pervasive so this patch removes the
UNSCALED values from IndexType. This means all indices are scaled
but the scale can be one, hence unscaled. SDNodes now use the scale
operand to answer the "isScaledIndex" question.
I toyed with the idea of keeping the UNSCALED enums and helper
functions but because they will have no uses and force SDNodes to
validate the set of supported values I figured it's best to remove
them. We can re-add them if there's a real need. For similar
reasons I've kept the IndexType enum when a bool could be used as I
think being explicitly looks better.
Depends On D123347
Differential Revision: https://reviews.llvm.org/D123381