First patch in a series adding MC layer support for the Arm Scalable
Matrix Extension.
This patch adds the following features:
sme, sme-i64, sme-f64
The sme-i64 and sme-f64 flags are for the optional I16I64 and F64F64
features.
If a target supports I16I64 then the following instructions are
implemented:
* 64-bit integer ADDHA and ADDVA variants (D105570).
* SMOPA, SMOPS, SUMOPA, SUMOPS, UMOPA, UMOPS, USMOPA, and USMOPS
instructions that accumulate 16-bit integer outer products into 64-bit
integer tiles.
If a target supports F64F64 then the FMOPA and FMOPS instructions that
accumulate double-precision floating-point outer products into
double-precision tiles are implemented.
Outer products are implemented in D105571.
The reference can be found here:
https://developer.arm.com/documentation/ddi0602/2021-06
Reviewed By: CarolineConcatto
Differential Revision: https://reviews.llvm.org/D105569
This sets the latency of stores to 1 in the Cortex-A55 scheduling model,
to better match the values given in the software optimization guide.
The latency of a store in normal llvm scheduling does not appear to have
a lot of uses. If the store has no outputs then the latency is somewhat
meaningless (and pre/post increment update operands use the WriteAdr
write for those operands instead). The one place it does alter things is
the latency between a store and the end of the scheduling region, which
can in turn have an effect on the critical path length. As a result a
latency of 1 is more correct and offers ever-so-slightly better
scheduling of instructions near the end of the block.
They are marked as RetireOOO to keep the llvm-mca from introducing
stalls where non would exist.
Differential Revision: https://reviews.llvm.org/D105541
This adds custom lowering for truncating stores when operating on
fixed length vectors in SVE. It also includes a DAG combine to
fold extends followed by truncating stores into non-truncating
stores in order to prevent this pattern appearing once truncating
stores are supported.
Currently truncating stores are not used in certain cases where
the size of the vector is larger than the target vector width.
Differential Revision: https://reviews.llvm.org/D104471
The original motivation for this was to implement moreElementsVector of shuffles
on AArch64, which resulted in complex sequences of artifacts like unmerge(unmerge(concat...))
which the combiner couldn't handle. It seemed here that the better option,
instead of writing ever-more-complex combines, was to have a way to find
the original "non-artifact" source registers for a given definition, walking
through arbitrary expressions of unmerge/concat/insert. As long as the bits
aren't extended or truncated, this is a pretty simple algorithm that avoids
the need for lots of combines and instead jumps straight to the final result
we want.
I've only used this new technique in 2 places within tryCombineUnmerge, using it
in more general situations resulted in infinite loops in AMDGPU. So for now
it's used when we would otherwise fail to combine and that seems to work.
In order to support looking through G_INSERTs, I also had to add it as an
artifact in isArtifact(), which caused a whole lot of issues in tests. AMDGPU
started infinite looping since full legalization of G_INSERT doensn't seem to
be there. To work around this, I've temporarily added a CLI option to use the
old behaviour so that the MIR tests will still run and terminate.
Other minor changes include no longer making >128b G_MERGE/UNMERGE legal.
We never had isel support for that anyway and it was a remnant of the legacy
legalizer rules. However being legal prevented the combiner from checking if it
was dead and deleting them.
Differential Revision: https://reviews.llvm.org/D104355
This patch removes the IsPairwiseForm flag from the Reduction Cost TTI
hooks, along with some accompanying code for pattern matching reductions
from trees starting at extract elements. IsPairWise is now assumed to be
false, which was the predominant way that the value was used from both
the Loop and SLP vectorizers. Since the adjustments such as D93860, the
SLP vectorizer has not relied upon this distinction between paiwise and
non-pairwise reductions.
This also removes some code that was detecting reductions trees starting
from extract elements inside the costmodel. This case was
double-counting costs though, adding the individual costs on the
individual instruction _and_ the total cost of the reduction. Removing
it changes the costs in llvm/test/Analysis/CostModel/X86/reduction.ll to
not double count. The cost of reduction intrinsics is still tested
through the various tests in
llvm/test/Analysis/CostModel/X86/reduce-xyz.ll.
Differential Revision: https://reviews.llvm.org/D105484
SelectionDAG's equivalents in ISD::InputArg/OutputArg track the
original argument index. Mips relies on this, and its currently
reinventing its own parallel CallLowering infrastructure which tracks
these indexes on the side. Add this to help move towards deleting the
custom mips handling.
Additionally, lower the floating point compare SVE intrinsics to
SETCC_MERGE_ZERO ISD nodes to avoid duplicating ISel patterns.
Differential Revision: https://reviews.llvm.org/D105486
This patch prevents GlobalISel from optimizing out redundant branch
instructions when compiling without optimizations.
The motivating example is code like the following common pattern in
Swift, where users expect to be able to set a breakpoint on the early
exit:
public func f(b: Bool) {
guard b else {
return // I would like to set a breakpoint here.
}
...
}
The patch modifies two places in GlobalISEL: The first one is in
IRTranslator.cpp where the removal of redundant branches is made
conditional on the optimization level. The second one is in
AArch64InstructionSelector.cpp where an -O0 *only* optimization is
being removed.
Disabling these optimizations increases code size at -O0 by
~8%. However, doing so improves debuggability, and debug builds are
the primary reason why developers compile without optimizations. We
thus concluded that this is the right trade-off.
rdar://79515454
Differential Revision: https://reviews.llvm.org/D105238
This patch adds a TTI function, isElementTypeLegalForScalableVector, to query
whether it is possible to vectorize a given element type. This is called by
isLegalToVectorizeInstTypesForScalable to reject scalable vectorization if
any of the instruction types in the loop are unsupported, e.g:
int foo(__int128_t* ptr, int N)
#pragma clang loop vectorize_width(4, scalable)
for (int i=0; i<N; ++i)
ptr[i] = ptr[i] + 42;
This example currently crashes if we attempt to vectorize since i128 is not a
supported type for scalable vectorization.
Reviewed By: sdesmalen, david-arm
Differential Revision: https://reviews.llvm.org/D102253
This avoids the use of the vector unit for copying from scalar to
vector. There is an extra ptrue instruction, but a predicate register
with the ptrue pattern populated is likely to be free in the context of
real code.
Tests were generated from a template to cover the axes mentioned at the
top of the test file.
Co-authored-by: Francesco Petrogalli <francesco.petrogalli@arm.com>
Differential Revision: https://reviews.llvm.org/D103170
For the following case:
t8: i32 = or t7, t4
t10: i32 = ORRWrs t8, t8, TargetConstant:i32<73>
Current code wrongly returns (t8 >> shiftConstant) as the
UsefulBits of t8, which in fact is (t8 | (t8 >> shiftConstant)).
Reviewed by: sdesmalen, mdchen
Differential Revision: https://reviews.llvm.org/D102759
This patch adds a new ShuffleKind SK_Splice and then handle the cost in
getShuffleCost, as in experimental.vector.reverse.
Differential Revision: https://reviews.llvm.org/D104630
Improve codegen when lowering the common vector shuffle case from the
vectorizer (op1[last]:op2[0:last-1]). This patch only handles this
common case as it is difficult to handle this more generally when using
fixed length vectors, due to being unable to use the SVE ext instruction.
Differential Revision: https://reviews.llvm.org/D105289
Loads of <4 x i8> vectors were modeled as extremely expensive. And while we
don't have a load instruction that supports this, it isn't that expensive to
create a vector of i8 elements. The codegen for this was fixed/optimised in
D105110. This now tweaks the cost model and enables SLP vectorisation of my
motivating case loadi8.ll.
Differential Revision: https://reviews.llvm.org/D103629
This adds simple patterns for signed and unsigned saturating extract
narrow instructions. They combine a min/max/truncate into a single
instruction, providing that the immediates on the min/max are correct
for the saturation type. This is just handled in tablegen with some
extra patterns.
v2i64->v2i32 is not handled here as the min/max nodes are not legal,
making the lowering quite different.
Differential Revision: https://reviews.llvm.org/D103263
Target-independent code only knows how to spill to the stack; instead,
use AArch64ISD::REINTERPRET_CAST.
Differential Revision: https://reviews.llvm.org/D104573
Inserting into a smaller-than-legal scalable vector would result in an
internal compiler error. For example, inserting a <vscale x 4 x i8> into
a <vscale x 8 x i8> (both illegal vector types for SVE) would cause a
crash.
This crash was happening because there was no code to promote (legalise)
the result of an INSERT_SUBVECTOR node.
This patch implements PromoteIntRes_INSERT_SUBVECTOR, which legalises
the ISD node. This is currently done by going through memory. This is
necessary because of the requirement that the SubVec parameter of the
INSERT_SUBVECTOR node must be smaller than the Vec parameter, which
means that INSERT_SUBVECTOR cannot always have a legal result/operand
types.
Co-Authored-by: Joe Ellis <joe.ellis@arm.com>
Differential Revision: https://reviews.llvm.org/D102766
Since gather lowering can now lower to nodes that may need expansion via
the vector legalizer, do MGATHER lowering via vector legalizer.
Additionally, as part of adding passthru support for fixed typed
gathers, fix passthru support for scalable types.
Depends on D104910
Differential Revision: https://reviews.llvm.org/D104217
This enables proper lowering of non-byte sized loads. We still aren't
faithfully preserving memory types everywhere, so the legality checks
still only consider the size.
Looking at PostDominatorTree::dominates, we can see that has the same
logic (with the addition of handling Phi nodes - which are not used as inputs in
this pass) as the helper function.
Reviewed By: eugenis
Differential Revision: https://reviews.llvm.org/D105141
GlobalISel is relying on regular MachineMemOperands to track all of
the memory properties of accesses. Just the raw byte size is
insufficent to disambiguate all situations. For example, if we need to
split an unaligned extending load, we need to know the number of bits
in the original source value and can't infer it from the result
type. This is also a problem for extending vector loads.
This does decrease the maximum representable size from the full
uint64_t bytes to a maximum of 16-bits. No in tree testcases hit this,
other than places using UINT64_MAX for unknown sizes. This may be an
issue for G_MEMCPY and co., although they can just use unknown size
for large static sizes. This also has potential for backend abuse by
relying on the type when it really shouldn't be relevant after
selection.
This does not include the necessary MIR printer/parser changes to
represent this.
Adds legalizer, register bank select, and instruction
select support for G_SBFX and G_UBFX. These opcodes generate
scalar or vector ALU bitfield extract instructions for
AMDGPU. The instructions allow both constant or register
values for the offset and width operands.
The 32-bit scalar version is expanded to a sequence that
combines the offset and width into a single register.
There are no 64-bit vgpr bitfield extract instructions, so the
operations are expanded to a sequence of instructions that
implement the operation. If the width is a constant,
then the 32-bit bitfield extract instructions are used.
Moved the AArch64 specific code for creating G_SBFX to
CombinerHelper.cpp so that it can be used by other targets.
Only bitfield extracts with constant offset and width values
are handled currently.
Differential Revision: https://reviews.llvm.org/D100149
This adds support for Armv9-A's Realm Management Extension, including
three new system registers - MFAR_EL3, GPCCR_EL3 and GPTBR_EL3 - and
four new TLBI instructions.
The reference for the Realm Management Extension can be found at: https://developer.arm.com/documentation/ddi0615/aa.
Based on patches by Victor Campos.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D104773
This ports the AArch64 SABD and USBD over to DAG Combine, where they can be
used by more backends (notably MVE in a follow-up patch). The matching code
has changed very little, just to handle legal operations and types
differently. It selects from (ABS (SUB (EXTEND a), (EXTEND b))), producing
a ubds/abdu which is zexted to the original type.
Differential Revision: https://reviews.llvm.org/D91937
To reflect that the size may be scalable, a TypeSize is returned
instead of an unsigned. In places where the result is used,
it currently relies on an implicit cast of TypeSize -> uint64_t,
which asserts that the type is not scalable.
This patch is NFC for fixed-width vectors.
Reviewed By: aemerson
Differential Revision: https://reviews.llvm.org/D104454