Internally to DAGCombiner the SDValues were passed by non-const
reference despite not being modified. They were then passed by
const reference to TLI.
This patch passes them by value which is consistent with the vast
majority of code.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D120420
We have a combine for converting mul(dup(ext(..)), ...) into
mul(ext(dup(..)), ..), for allowing more uses of smull and umull
instructions. Currently it looks for vector insert and shuffle vectors
to detect the element that we can convert to a vector extend. Not all
cases will have a shufflevector/insert element though.
This started by extending the recognition to buildvectors (with elements
that may be individually extended). The new method seems to cover all
the cases that the old method captured though, as the shuffle will
eventually be lowered to buildvectors, so the old method has been
removed to keep the code a little simpler. The new code detects legal
build_vector(ext(a), ext(b), ..), converting them to ext(build_vector(a,
b, ..)) providing all the extends/types match up.
Differential Revision: https://reviews.llvm.org/D120018
We have a combine for converting mul(dup(ext(..)), ...) into
mul(ext(dup(..)), ..), for allowing more uses of smull and umull
instructions. Currently it looks for vector insert and shuffle vectors
to detect the element that we can convert to a vector extend. Not all
cases will have a shufflevector/insert element though.
This started by extending the recognition to buildvectors (with elements
that may be individually extended). The new method seems to cover all
the cases that the old method captured though, as the shuffle will
eventually be lowered to buildvectors, so the old method has been
removed to keep the code a little simpler. The new code detects legal
build_vector(ext(a), ext(b), ..), converting them to ext(build_vector(a,
b, ..)) providing all the extends/types match up.
Differential Revision: https://reviews.llvm.org/D120018
LR is modified at the moment of the call and before any use is read.
Reviewers: reames
Reviewed By: reames
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D120114
We have some duplicate patterns between the AArch64ISD::UMULL (/SMULL)
and the int_aarch64_neon_umull (/smull) intrinsics. They did not
replicate all the patterns though, leaving some gaps on instructions
like umlal2 from codegen. This commons all the patterns by converting
all int_aarch64_neon_umull intrinsics to UMULL nodes and removing the
duplicate for umull/smull intrinsics, so that all instructions go
through the same tablegen pattern.
This improves some of the longer-than-legal mla patterns, helping them
replace ext with umlal2.
Differential Revision: https://reviews.llvm.org/D119887
(vselect (setcc ( condcode) (_) (_)) (a) (op (a) (b)))
=> (vselect (setcc (!condcode) (_) (_)) (op (a) (b)) (a))
As a follow up to D117689, invert the operand order and condition
in order to fold vselects into predicated instructions.
Differential Revision: https://reviews.llvm.org/D119424
This also requires adjustment to code in AArch64ISelLowering so that
vector_extract is distributed over strict_fadd.
Differential Revision: https://reviews.llvm.org/D118489
This consists of marking the various strict opcodes as legal, and
adjusting instruction selection patterns so that 'op' is 'any_op'.
FP16 and vector instructions additionally require some extra work in
lowering and legalization, so we can't set IsStrictFPEnabled just yet.
Also more work needs to be done for full strict fp support (marking
instructions that can raise exceptions as such, and modelling FPCR use
for controlling rounding).
Differential Revision: https://reviews.llvm.org/D114946
This is some NFC (hopefully!) cleanup for performCommonVectorExtendCombine
and related methods, removing conditions that cannot occur and otherwise
cleaning up the code a little.
This ports the aarch64 combines for HADD and RHADD over to DAG combine,
so that they can be used in more architectures (notably MVE in a
followup patch). They are renamed to AVGFLOOR and AVGCEIL in the
process, to avoid confusion with instructions such as X86 hadd. The code
was also rewritten slightly to remove the AArch64 idiosyncrasies.
The general pattern for a AVGFLOORS is
%xe = sext i8 %x to i32
%ye = sext i8 %y to i32
%a = add i32 %xe, %ye
%r = lshr i32 %a, 1
%t = trunc i32 %r to i8
An AVGFLOORU is equivalent with zext. Because of the truncate
lshr==ashr, as the top bits are not demanded. An AVGCEIL also includes
an extra rounding, so includes an extra add of 1.
Differential Revision: https://reviews.llvm.org/D106237
These nodes provide an indirection that is not necessary because
SVE has unpredicated add/sub instructions and there's no downside
to using them for partial register operations. In fact, the test
changes show that unifying how fixed-length and scalable vector
add/sub are lowered enables better use of existing isel patterns.
Differential Revision: https://reviews.llvm.org/D119355
When lowering the get.active.lane.mask intrinsic with a fixed-width
predicate vector result, we can actually make use of the SVE whilelo
instruction when SVE is enabled. We do this by carefully choosing
a sensible VT for the whilelo instruction, then promoting it to an
integer vector, i.e. nxv16i1 -> nx16i8. We can then extract a v16i8
subvector and truncate back to the original return type, i.e. v16i1.
This leads to a significant improvement in code quality.
Differential Revision: https://reviews.llvm.org/D116664
The are several places where hasBF16 is used to protect code that
has no requirement for the +bf16 feature. The lowering code uses
stock SVE instructions for things like loads and stores and so is
safe even when +bf16 is not available.
NOTE: Currently the nxvbf16 type is not legal unless the +bf16
feature is available, but that isn't an issue because the affected
code is post type legalisation.
NOTE: This patch mirrors previous work that removed the same
redundant protection from isel patterns where the resulting
selection emitted stock SVE instructions.
Differential Revision: https://reviews.llvm.org/D119328
The decision is perhaps arbitrary but I figure zeroing has no
dependency on the value being loaded.
Differential Revision: https://reviews.llvm.org/D119327
Previously the code in AArch64TargetLowering::ReconstructShuffle assumed
the input vectors were always fixed-width, however this is not always
the case since you can extract elements from scalable vectors and insert
into fixed-width ones. We were hitting crashes here for two different
cases:
1. When lowering a fixed-length vector extract from a scalable vector
with i1 element types. This happens due to the fact the i1 elements
get promoted to larger integer types for fixed-width vectors and leads
to sequences of INSERT_VECTOR_ELT and EXTRACT_VECTOR_ELT nodes. In this
case AArch64TargetLowering::ReconstructShuffle will still fail to make
a transformation, but at least it no longer crashes.
2. When lowering a sequence of extractelement/insertelement operations
on mixed fixed-width/scalable vectors.
For now, I've just changed AArch64TargetLowering::ReconstructShuffle to
bail out if it finds a scalable vector.
Tests for both instances described above have been added here:
(1) CodeGen/AArch64/sve-extract-fixed-vector.ll
(2) CodeGen/AArch64/sve-fixed-length-reshuffle.ll
Differential Revision: https://reviews.llvm.org/D116602
The lowering code for shuffle_vector has a code path that looks through
extract_subvector, this code path did not properly account for the
potential presense of larger than Neon vector types and could produce
unselectable DAG nodes.
Differential Revision: https://reviews.llvm.org/D119252
This patch adds custom lowering support for ISD::MUL with v1i64 and v2i64
types when SVE is enabled, regardless of the minimum SVE vector length. We
do this because NEON simply does not have 64-bit vector multiplies, so we
want to take advantage of these instructions in SVE.
I've updated the 128-bit min SVE vector bits tests here:
CodeGen/AArch64/sve-fixed-length-int-arith.ll
CodeGen/AArch64/sve-fixed-length-int-mulh.ll
CodeGen/AArch64/sve-fixed-length-int-rem.ll
Differential Revision: https://reviews.llvm.org/D118802
We currently use emitConjunction to create CCMP conjunctions from the
conditions of selects, helping turning and/ors into more optimal ccmp
sequences that don't need to go through csels. This extends that to also
be used whilst lowering brcond, giving more opportunity for better
condition generation.
Differential Revision: https://reviews.llvm.org/D118650
This patch modifies the FCOPYSIGN lowering to go through the BSP
pseudo-instruction. This allows the same lowering code for NEON,
SVE and SVE2.
As part of this, lowering for BSP for SVE and SVE2 is also added.
For SVE and NEON this patch is NFC.
Differential Revision: https://reviews.llvm.org/D118394
Previously useSVEForFixedLengthVectorVT only allowed SVE usage when
the target SVE register length was known to be at least 256bit.
This was true even for NEON sized vectors, which was an artificial
restriction imposed during early SVE bring up. This now changes so
that callers can opt to use SVE for NEON sized vectors regardless
of the SVE register length.
The patch is NFC because for all places where OverrideNEON is used
we now explicitly also check that SVE code generation for larger
than NEON vectors is enabled. The intent is that over time these
extra checks will either be removed or the lowering disabled if the
SVE usage proves not beneficial.
Differential Revision: https://reviews.llvm.org/D118957
In AArch64ISelLowering.cpp this patch implements this fold:
1) GEP (%ptr, SHL ((stepvector(A) + splat(%offset))) << splat(B)))
into GEP (%ptr + (%offset << B), step_vector (A << B))
The above transform simplifies the index operand so that it can be expressed
as i32 elements.
This allows using only one gather/scatter assembly instruction instead of two.
Patch by Paul Walker (@paulwalker-arm).
Depends on D117900
Differential Revision: https://reviews.llvm.org/D118345
In AArch64ISelLowering.cpp this patch implements this fold:
GEP (%ptr, (splat(%offset) + stepvector(A)))
into GEP ((%ptr + %offset), stepvector(A))
The above transform simplifies the index operand so that it can be expressed
as i32 elements.
This allows using only one gather/scatter assembly instruction instead of two.
Patch by Paul Walker (@paulwalker-arm).
Depends on D118459
Differential Revision: https://reviews.llvm.org/D117900
This teaches AArch64TargetLowering::shouldSinkOperands to sink the
operands of aarch64_neon_pmull intrinsic.
Differential Revision: https://reviews.llvm.org/D117944
Given an (integer) vecreduce, we know the order of the inputs does not matter.
We can convert UADDV(add(zext(extract_lo(x)), zext(extract_hi(x)))) into
UADDV(UADDLP(x)). This can also happen through an extra add, where we transform
UADDV(add(y, add(zext(extract_lo(x)), zext(extract_hi(x))))).
This makes sure the same thing happens signed cases too, which requires adding
a new SADDLP node.
Differential Revision: https://reviews.llvm.org/D118107
LLVM has a couple of ways of producing ccmp - either from chains in isel
or from a later ifcvt style pass. This adds a simple DAG combine to
capture more cases, converting and(csel(0, 1, cc0), csel(0, 1, cc1))
into a csel(ccmp(.., cc0)), depending on cc1 (a SUBS in this case).
Differential Revision: https://reviews.llvm.org/D118327
This patch adds custom lowering support for ISD::SDIV and ISD::UDIV
when SVE is enabled, regardless of the minimum SVE vector length. We do
this because NEON simply does not have vector integer divide support, so
we want to take advantage of these instructions in SVE.
As part of this patch I've also simplified LowerToPredicatedOp to avoid
re-asking the same question about whether we should be using SVE for
fixed length vectors. Once we've made the decision to call
LowerToPredicatedOp, then we should simply assert we should be using SVE.
I've updated the 128-bit min SVE vector bits tests here:
CodeGen/AArch64/sve-fixed-length-int-div.ll
CodeGen/AArch64/sve-fixed-length-int-rem.ll
Differential Revision: https://reviews.llvm.org/D117871
Whilst adding legal types <-> register classes for Streaming SVE in
D118561 I noticed the hasSVE predication block set operation actions for
opcodes that may not be legal in Streaming SVE. Move these operations to
the later hasSVE block which has loops over the same types.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D118560
Pointer element types do not imply that the pointer is ABI aligned.
We should be using either an explicit align attribute here, or fall
back to an alignment of 1. This fixes a new element type access
introduced in D117764.
I don't think this makes any practical difference though, as the
lowering does not depend on alignment.
Differential Revision: https://reviews.llvm.org/D118681
New target SDNodes are added: AArch64ISD::MOPS_MEMSET, etc.
Each intrinsic is translated to one of these in SelectionDAGBuilder
via EmitTargetCodeForMOPS.
A custom lowering routine for INTRINSIC_W_CHAIN is added to handle
llvm.aarch64.mops.memset.tag. This takes a separate path from the common
intrinsics but ultimately ends up in the same EmitMOPS().
This is part 4/4 of a series of patches split from
https://reviews.llvm.org/D117405 to facilitate reviewing.
Patch by Tomas Matheson, Lucas Prates and Son Tuan Vu.
Differential Revision: https://reviews.llvm.org/D117764
The optimization added in D118139 causes a crash on the added test case
while trying to zero extend an vector of floats.
Fix the crash by bailing out for floating point operands.
Reviewed By: DavidTruby
Differential Revision: https://reviews.llvm.org/D118615
AArch64ISD::PFALSE does not provide any value, in fact it can
prevent common combines from firing. We only needed to lower
to PFALSE until ISD::SPLAT_VECTOR became generally available.
Differential Revision: https://reviews.llvm.org/D118469
Currently, the clang.arc.attachedcall bundle takes an optional function
argument. Depending on whether the argument is present, calls with this
bundle have the following semantics:
- on x86, with the argument present, the call is lowered to:
call _target
mov rax, rdi
call _objc_retainAutoreleasedReturnValue
- on AArch64, without the argument, the call is lowered to:
bl _target
mov x29, x29
and the objc runtime call is expected to be emitted separately.
That's because, on x86, the objc runtime checks for both the mov and
the call on x86, and treats the combination as the ARC autorelease elision
marker.
But on AArch64, it only checks for the dedicated NOP marker, as that's
historically been sufficiently unique. Thanks to that, the runtime call
wasn't required to be adjacent to the NOP marker, so it wasn't emitted
as part of the bundle sequence.
This patch unifies both architectures: on AArch64, we now emit all
3 instructions for the bundle. This guarantees that the runtime call
is adjacent to the marker in the sequence, and that's information the
runtime can use to further optimize this.
This helps simplify some of the handling, in particular
BundledRetainClaimRVs, which no longer needs to know whether the bundle
is sufficient or not: it now always should be.
Note that this does not include an AutoUpgrade for the nullary bundles,
as they are only produced in ObjCContract as part of the obj/asm emission
pipeline, and are not expected to be in bitcode.
Differential Revision: https://reviews.llvm.org/D118214
When a comparison is extended and it would be free to extend the
arguments to that comparison, we can propagate the extend into those arguments.
This prevents extra instructions being generated to extend the result of the
comparison, which is not free to extend.
This is a resubmission of D116812 with fixes that need another review.
Differential Revision: https://reviews.llvm.org/D118139
This adds the following changes:
* Fold: vselect(<all active predicate>, x, y) => x
* Extend isAllActivePredicate to take vscale_range into account, e.g.
isAllActivePredicate(vl16) for nxv16i1 and vscale == 1 => true.
isAllActivePredicate(vl32) for nxv16i1 and vscale == 2 => true.
Differential Revision: https://reviews.llvm.org/D118147
The ISel patterns for PFALSE helps recognise the instructions as being
free of side-effects, which helps MachineCSE remove redundant
PFALSE instructions.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D118054
This reverts commit ef82063207.
- It conflicts with the existing llvm::size in STLExtras, which will now
never be called.
- Calling it without llvm:: breaks C++17 compat
When wider vectors are used, for example fixed width SVE,
there is no patterns to select AArch64ISD::LD1LANEpost
nodes, so we should do an early exit.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D117674
NOTE: This patch also includes tests that highlight those cases
where the existing DAG combine doesn't yet work well for SVE.
Differential Revision: https://reviews.llvm.org/D117873
Instead use either Type::getPointerElementType() or
Type::getNonOpaquePointerElementType().
This is part of D117885, in preparation for deprecating the API.
AArch64 supports unsigned shift right and accumulate. In case we see a
unsigned shift right followed by an OR. We could turn them into a USRA
instruction, given the operands of the OR has no common bits.
Differential Revision: https://reviews.llvm.org/D114405
When a comparison is extended and it would be free to extend the
arguments to that comparison, we can propagate the extend into those arguments.
This prevents extra instructions being generated to extend the result of the
comparison, which is not free to extend.
Differential Revision: https://reviews.llvm.org/D116812
For certain negative indices passed to the VECTOR_SPLICE operation
we can actually directly use the SVE splice instruction by creating
the appropriate predicate. The predicate needs to be constructed in
such a way that all but the last -idx elements are false. We can do
this efficiently using a combination of 'ptrue' (with the appropriate
fixed pattern, e.g. vl1, vl2, etc.) and 'rev'. The advantage of using
these instructions to generate the predicate is they do not set any
flags, unlike the whilelo instruction. This is critical when the splice
operation is in a loop, since we want MachineLICM to hoist the
predicate generation out of the loop.
Differential Revision: https://reviews.llvm.org/D115863
When constructDup is passed an extract_subvector it tries to use
extract_subvector's operand directly when creating the DUPLANE.
This is invalid when extracting from a scalable vector because the
necessary DUPLANE ISel patterns do not exist.
NOTE: This patch is an update to https://reviews.llvm.org/D110524
that originally fixed this but introduced a bug when the result
VT is 64bits. I've restructured the code so the critial final
else block is entered when necessary.
Differential Revision: https://reviews.llvm.org/D116442
Attempt to lower a shuffle as a permute instruction(zip/uzp/trn) for fixed length SVE.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D113376
This patch partially resolves an issue for VLS code generation
where a mask is generated from a smaller width integer comparison
than the instruction using the mask requires.
Instead of sign extending a p register by converting it to a z
register, extending that, and converting back, we instead just
do an unpack of the p register.
A separate issue causes the code generation to still be poor when
the mask generation would fit in a neon register, as we then use
a neon comparison operation and have to convert that to a p register.
This will be resolved in a separate patch.
Reviewed By: peterwaller-arm
Differential Revision: https://reviews.llvm.org/D111221
This extends the custom lowering for truncating stores on
fixed length vectors in SVE to support masked truncating stores.
It also adds a DAG combine for truncates followed by masked
stores.
Reviewed By: peterwaller-arm, paulwalker-arm
Differential Revision: https://reviews.llvm.org/D108115
Attempt to lower a shuffle as a permute instruction(rev/revb/revh/revw) for fixed length SVE.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D114960
Fix a couple of things that were causing stack protection to not work
correctly in functions that have scalable vectors on the stack:
* Use TypeSize when determining if accesses to a variable are
considered out-of-bounds so that the behaviour is correct for
scalable vectors.
* When stack protection is enabled move the stack protector location
to the top of the SVE locals, so that any overflow in them (or the
other locals which are below that) will be detected.
Fixes: https://github.com/llvm/llvm-project/issues/51137
Differential Revision: https://reviews.llvm.org/D111631
-(Za + Zm * Zn) != (-Za + Zm * (-Zn))
when the FMA produces a zero output (e.g. all zero inputs can produce -0
output)
Add a PatFrag to check presence of nsz on the fneg, add tests which
ensure the combine does not fire in the absense of nsz.
See https://reviews.llvm.org/D90901 for a similar discussion on X86.
Differential Revision: https://reviews.llvm.org/D109525
Restrict duplicate FP_EXTEND/FP_TRUNC -> LOAD/STORE DAG combines to only
larger than NEON types, as these are the ones for which there is custom
lowering.
Update tests so that they go through memory to improve validation.
Differential Revision: https://reviews.llvm.org/D115166
f526c600c0 had a concern raised because of an invalid typesize request
on a scalable vector, which this patch addresses.
Prevent shouldReduceLoadWidth from attempting to query the bit size, and
add a regression test in sve-extract-fixed-vector.ll.
Differential Revision: https://reviews.llvm.org/D115156
By duplicating these dag combines we can bypass the legality checks that
they do, this allows us to perform these combines on larger than legal
fixed types, which in turn allows us to bring the same benefits D114580
brought but to larger than legal fixed types.
Depends on D114580
Differential Revision: https://reviews.llvm.org/D114628
This adds a fold in DAGCombine to create fptosi_sat from sequences for
smin(smax(fptosi(x))) nodes, where the min/max saturate the output of
the fp convert to a specific bitwidth (say INT_MIN and INT_MAX). Because
it is dealing with smin(/smax) in DAG they may currently be ISD::SMIN,
ISD::SETCC/ISD::SELECT, ISD::VSELECT or ISD::SELECT_CC nodes which need
to be handled similarly.
A shouldConvertFpToSat method was added to control when converting may
be profitable. The original fptosi will have a less strict semantics
than the fptosisat, with less values that need to produce defined
behaviour.
This especially helps on ARM/AArch64 where the vcvt instructions
naturally saturate the result.
Differential Revision: https://reviews.llvm.org/D111976
It causes builds to fail with this assert:
llvm/include/llvm/ADT/APInt.h:990:
bool llvm::APInt::operator==(const llvm::APInt &) const:
Assertion `BitWidth == RHS.BitWidth && "Comparison requires equal bit widths"' failed.
See comment on the code review.
> This adds a fold in DAGCombine to create fptosi_sat from sequences for
> smin(smax(fptosi(x))) nodes, where the min/max saturate the output of
> the fp convert to a specific bitwidth (say INT_MIN and INT_MAX). Because
> it is dealing with smin(/smax) in DAG they may currently be ISD::SMIN,
> ISD::SETCC/ISD::SELECT, ISD::VSELECT or ISD::SELECT_CC nodes which need
> to be handled similarly.
>
> A shouldConvertFpToSat method was added to control when converting may
> be profitable. The original fptosi will have a less strict semantics
> than the fptosisat, with less values that need to produce defined
> behaviour.
>
> This especially helps on ARM/AArch64 where the vcvt instructions
> naturally saturate the result.
>
> Differential Revision: https://reviews.llvm.org/D111976
This reverts commit 52ff3b0093.
This adds a fold in DAGCombine to create fptosi_sat from sequences for
smin(smax(fptosi(x))) nodes, where the min/max saturate the output of
the fp convert to a specific bitwidth (say INT_MIN and INT_MAX). Because
it is dealing with smin(/smax) in DAG they may currently be ISD::SMIN,
ISD::SETCC/ISD::SELECT, ISD::VSELECT or ISD::SELECT_CC nodes which need
to be handled similarly.
A shouldConvertFpToSat method was added to control when converting may
be profitable. The original fptosi will have a less strict semantics
than the fptosisat, with less values that need to produce defined
behaviour.
This especially helps on ARM/AArch64 where the vcvt instructions
naturally saturate the result.
Differential Revision: https://reviews.llvm.org/D111976
I tried to exercise the existing combine patterns in performConcatVectorsCombine
for scalable vectors and at the moment it doesn't seem possible. Parts of
the code currently assume we're dealing with fixed-width vectors with calls
to getVectorNumElements(), therefore I've decided to simply bail out early
for scalable vectors.
Added a test here to show that we don't crash when attempting to combine
truncate + concat:
CodeGen/AArch64/concat_vector-truncate-combine.ll
Differential Revision: https://reviews.llvm.org/D114600
This allows the generic DAG combine to fold fp_extend/fp_trunc into
loads/stores which we can then lower into a integer extending
load/truncating store plus an FP_EXTEND/FP_ROUND.
The nuance here is that fixed-type FP_EXTEND/FP_ROUND require unpacked
types hence lowering them introduces an unpack/zip. By allowing these
nodes to be combined with loads/store we make it much easier to have
this unpack/zip combined into the load/store by our custom lowering.
Differential Revision: https://reviews.llvm.org/D114580
In most common cases the @llvm.get.active.lane.mask intrinsic maps directly
to the SVE whilelo instruction, which already takes overflow into account.
However, currently in SelectionDAGBuilder::visitIntrinsicCall we always lower
this immediately to a generic sequence of instructions that explicitly
take overflow into account. This makes it very difficult to then later
transform back into a single whilelo instruction. Therefore, this patch
introduces a new TLI function called shouldExpandGetActiveLaneMask that asks if
we should lower/expand this to a sequence of generic ISD nodes, or instead
just leave it as an intrinsic for the target to lower.
You can see the significant improvement in code quality for some of the
tests in this file:
CodeGen/AArch64/active_lane_mask.ll
Differential Revision: https://reviews.llvm.org/D114542
If we only demand bits from one half of a rotation pattern, see if we can simplify to a logical shift.
For the ARM/AArch64 rev16/32 patterns, I had to drop a fold to prevent srl(bswap()) -> rotr(bswap) -> srl(bswap) infinite loops. I've replaced this with an isel PatFrag which should do the same task.
Reapplied with fix for AArch64 rev patterns to matching the ARM fix.
https://alive2.llvm.org/ce/z/iroxki (rol -> shl by amt iff demanded bits has at least as many trailing zeros as the shift amount)
https://alive2.llvm.org/ce/z/4ez_U- (ror -> shl by revamt iff demanded bits has at least as many trailing zeros as the reverse shift amount)
https://alive2.llvm.org/ce/z/cD7dR- (ror -> lshr by amt iff demanded bits has at least as many leading zeros as the shift amount)
https://alive2.llvm.org/ce/z/_XGHtQ (rol -> lshr by revamt iff demanded bits has at least as many leading zeros as the reverse shift amount)
Differential Revision: https://reviews.llvm.org/D114354
This teaches AArch64TargetLowering::shouldSinkOperands to sink splat
shuffles to certain neon intrinsics, so that they can make use of the
lane variants of the instructions that are available.
Differential Revision: https://reviews.llvm.org/D112994
STATEPOINT instruction behavior is similar to call instruction.
In aarch64 BL instruction implicitly define lr register, so
STATEPOINT instruction should do the same.
However STATEPOINT is a general pseudo instruction and I could not find
a way to override list of implicit defs for specific target.
So this patch post processes inserting STATEPOINT instruction by
adding implisit dead def for lr.
Reviewers: reames, loicottet, ostannard
Reviewed By: reames
Subscribers: danilaml, hiraditya, kristof.beyls, llvm-commits, yrouban
Differential Revision: https://reviews.llvm.org/D111114
With fullfp16, it is cheaper to cast the {U,S}INT_TO_FP operand to i16
first, rather than promoting it to i32. The custom lowering for
{U,S}INT_TO_FP already supports that, it just needs to be used.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D113601
This extends performFpToIntCombine to work on FP16 vectors as well as
the f32 and f64 vectors it already supported.
Differential Revision: https://reviews.llvm.org/D113297
Similar to D113199 but dealing with the vector size, this extends the
fptosi+fmul to fixed point fold to handle fptosi.sat nodes that are
equally viable, so long as the saturation width matches the output
width.
Differential Revision: https://reviews.llvm.org/D113200
This adds FP type support to the SVE Container type list as a supplement to D112303.
Reviewed By: peterwaller-arm, paulwalker-arm
Differential Revision: https://reviews.llvm.org/D113333
When inserting an unpacked FP subvector into a packed vector we
can simply cast the unpacked value into a packed value, since
both types are legal for SVE. We can then use this as the input
for the UZP instruction. This avoids us expanding the operation
by going through the stack.
Differential Revision: https://reviews.llvm.org/D113270
Performing the rearrangement for add/sub and mul instructions to match the madd/msub pattern
Reviewed By: dmgreen, sdesmalen, david-arm
Differential Revision: https://reviews.llvm.org/D111862
When created a UUNPKLO/HI node with an undef input then the
output should also be undef. I've added a target DAG combine
function to ensure we avoid creating an unnecessary uunpklo/hi
instruction.
Differential Revision: https://reviews.llvm.org/D113266
This interface should not have existed in the first place, let alone
be a public member.
It allows calling `ElementCount::get(..)->getValue()`, which is ambiguous.
The interfaces to be used are either getFixedValue() or getKnownMinValue().
For NEON, FMA matching is done in the MachineCombiner, and not the
DAGCombiner. That causes problems with VLS lowering, since the
vectors are fixed width at the DAGCombiner, but are scalable in
the MachineCombiner. This patch corrects it by matching FMAs for
VLS vectors in the DAGCombiner.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D112557
This adds support for SVE structured loads/stores to the relevant target
hooks, such that we can support these instructions in the InterleavedAccess
pass.
Depends on D112078
Differential Revision: https://reviews.llvm.org/D112303
This patch enables the use of reciprocal estimates for SVE
when both the -Ofast and -mrecip flags are used.
Reviewed By: david-arm, paulwalker-arm
Differential Revision: https://reviews.llvm.org/D111657
Inspired by D111968, provide a isNegatedPowerOf2() wrapper instead of obfuscating code with (-Value).isPowerOf2() patterns, which I'm sure are likely avenues for typos.....
Differential Revision: https://reviews.llvm.org/D111998
Try to widen element type to get a new mask value for a better permutation
sequence, so that we can use NEON shuffle instructions, such as zip1/2,
UZP1/2, TRN1/2, REV, INS, etc.
For example:
shufflevector <4 x i32> %a, <4 x i32> %b, <4 x i32> <i32 6, i32 7, i32 2, i32 3>
is equivalent to:
shufflevector <2 x i64> %a, <2 x i64> %b, <2 x i32> <i32 3, i32 1>
Finally, we can get:
mov v0.d[0], v1.d[1]
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D111619
Similar to D111236, this improves the lowering of vector fptosi.sat and
fptoui.sat, using legal converts and further saturating from there with
min/max. f64 are excluded for the moment due to producing worse code in
places compared to the unrolling.
Differential Revision: https://reviews.llvm.org/D111787
Improve the lowering of scalar fptosi.sat and fptoui.sat for saturating
widths smaller than legal types by using the fact that the legal type
will saturate under aarch64, and saturating the result further using
min/max.
Differential Revision: https://reviews.llvm.org/D111236
The lowering for EXTRACT_SUBVECTOR should not be called during type
legalization, only as part of lowering, hence return SDValue() when
called on illegal types.
This also adds missing tests for extracting fixed types from illegal
scalable types.
Differential Revision: https://reviews.llvm.org/D111412
AAPCS requires i1 argument to be zero-extended to 8-bits by the
caller. Emit a new AArch64ISD::ASSERT_ZEXT_BOOL hint (or AssertZExt
for GlobalISel) to enable some optimization opportunities. In
particular, when the argument is forwarded to the callee, we can avoid
zero-extension and use it as-is.
Differential Revision: https://reviews.llvm.org/D107160
Stop using APInt constructors and methods that were soft-deprecated in
D109483. This fixes all the uses I found in llvm, except for the APInt
unit tests which should still test the deprecated methods.
Differential Revision: https://reviews.llvm.org/D110807
Like normal atomicrmw operations, at -O0 the simple register-allocator can
insert spills into the LL/SC loop if it's expanded and visible when regalloc
runs. This can cause the operation to never succeed by repeatedly clearing the
monitor. Instead expand to a cmpxchg, which has a pseudo-instruction for -O0.
v8.4 says that normal loads/stores of 128-bytes are single-copy atomic if
they're properly aligned (which all LLVM atomics are) so we no longer need to
do a full RMW operation to guarantee we got a clean read.
This extends the custom lowering for extending loads on
fixed length vectors in SVE to support masked extending loads.
The existing tests for correct behaviour of masked extending loads
exhibit bad code generation due to the legalistaion of i1 vectors.
They have been left as-is and new tests have been added that do not
exhibit this behaviour.
Differential Revision: https://reviews.llvm.org/D108200
When combining 'and' of an unsigned unpack and shuffle instruction,
bail early if shuffle is not constructed from a constant integer.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D109556
Soft deprecrate isNullValue/isAllOnesValue and update in tree
callers. This matches the changes to the APInt interface from
D109483.
Reviewed By: lattner
Differential Revision: https://reviews.llvm.org/D109535
This renames the primary methods for creating a zero value to `getZero`
instead of `getNullValue` and renames predicates like `isAllOnesValue`
to simply `isAllOnes`. This achieves two things:
1) This starts standardizing predicates across the LLVM codebase,
following (in this case) ConstantInt. The word "Value" doesn't
convey anything of merit, and is missing in some of the other things.
2) Calling an integer "null" doesn't make any sense. The original sin
here is mine and I've regretted it for years. This moves us to calling
it "zero" instead, which is correct!
APInt is widely used and I don't think anyone is keen to take massive source
breakage on anything so core, at least not all in one go. As such, this
doesn't actually delete any entrypoints, it "soft deprecates" them with a
comment.
Included in this patch are changes to a bunch of the codebase, but there are
more. We should normalize SelectionDAG and other APIs as well, which would
make the API change more mechanical.
Differential Revision: https://reviews.llvm.org/D109483
The isEssentiallyExtractHighSubvector function currently calls
getVectorNumElements on a type that in specific cases might be scalable.
Since this function only has correct behaviour at the moment on scalable
types anyway, the function can just return false when given a fixed type.
Differential Revision: https://reviews.llvm.org/D109163
When lowering a fixed length gather/scatter the index type is assumed to
be the same as the memory type, this is incorrect in cases where the
extension of the index has been folded into the addressing mode.
For now add a temporary workaround to fix the codegen faults caused by
this by preventing the removal of this extension. At a later date the
lowering for SVE gather/scatters will be redesigned to improve the way
addressing modes are handled.
As a short term side effect of this change, the addressing modes
generated for fixed length gather/scatters will not be optimal.
Differential Revision: https://reviews.llvm.org/D109145
For vectors that are exactly equal to getMaxSVEVectorSizeInBits, just use
AArch64SVEPredPattern::all, which can enable the use of unpredicated ptrue when available.
TestPlan: check-llvm
Differential Revision: https://reviews.llvm.org/D108706
This patch solely moves convert operation between SVE predicate pattern
and element number into two small functions. It's pre-commit patch for optimize
pture with known sve register width.
Differential Revision: https://reviews.llvm.org/D108705
This allows the instruction selector to realize that it can directly
broadcast the low byte of the memset value, rather than replicating
it to a 64-bit GPR before broadcasting.
This fixes PR50985.
Differential Revision: https://reviews.llvm.org/D108354
If Orig produces more than one value (rare) with different types (rarer) then
we need to make sure we check against the one that Orig actually represents,
not just the first type.
Unfortunately because of the combination of things that need to happen I wasn't
able to produce a test.
Follow-up to D107068, attempt to fold nested concat_vectors/undefs, as long as both the vector and inner subvector types are legal.
This exposed the same issue in ARM's MVE LowerCONCAT_VECTORS_i1 (raised as PR51365) and AArch64's performConcatVectorsCombine which both assumed concat_vectors only took 2 subvector operands.
Differential Revision: https://reviews.llvm.org/D107597
This patch enables extending loads for fixed length SVE code generation.
There is a slight regression here in the mulh tests; since these tests
load the parameter and then extend it these are treated as extending
loads which are merged, preventing the mulh instruction from being
generated. As this affects scalable SVE codegen as well this should be
addressed in a separate patch.
Reviewed By: bsmith
Differential Revision: https://reviews.llvm.org/D107057
This was checking extends as shuffles, where as we should be checking
the operands. This helps sink the shuffles, creating more addl/subl
instructions.
Differential Revision: https://reviews.llvm.org/D107623
I was originally going to try to implement this in target-independent
code, but it's actually sort of tricky to generate the correct sequence
for vectors like nxv2f32. So just stick this in target-specific code,
at least for now.
Differential Revision: https://reviews.llvm.org/D107608
The patterns for fixed length gather/scatter with 32-bit offsets and
64-bit memory type are slightly different that the rest of the patterns,
as such the lowering needs to be slightly different to ensure the
correct types are used.
Differential Revision: https://reviews.llvm.org/D107576
IR typically creates INSERT_SUBVECTOR patterns as a widening of the subvector with undefs to pad to the destination size, followed by a shuffle for the actual insertion - SelectionDAGBuilder has to do something similar for shuffles when source/destination vectors are different sizes.
This combine attempts to recognize these patterns by looking for a shuffle of a subvector (from a CONCAT_VECTORS) that starts at a modulo of its size into an otherwise identity shuffle of the base vector.
This uncovered a couple of target-specific issues as we haven't often created INSERT_SUBVECTOR nodes in generic code - aarch64 could only handle insertions into the bottom of undefs (i.e. a vector widening), and x86-avx512 vXi1 insertion wasn't keeping track of undef elements in the base vector.
Fixes PR50053
Differential Revision: https://reviews.llvm.org/D107068
Don't know how to custom expand this
UNREACHABLE executed at llvm-project/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:16788
The fix is to provide missing expansions for:
case ISD::STRICT_FP_TO_UINT:
case ISD::STRICT_FP_TO_SINT:
A test case is provided.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D107452
This patch legalizes the Machine Value Type introduced in D94096 for loads
and stores. A new target hook named getAsmOperandValueType() is added which
maps i512 to MVT::i64x8. GlobalISel falls back to DAG for legalization.
Differential Revision: https://reviews.llvm.org/D94097
An incorrect mask type when lowering an SVE gather/scatter was causing
a codegen fault which manifested as the incorrect predicate size being
used for an SVE gather/scatter, (e.g.. p0.b rather than p0.d).
Fixes PR51182.
Differential Revision: https://reviews.llvm.org/D106943
This patch implements vector_splice in tablegen for all cases when the
Immediate is positive and lower than the known minimum value of
a scalable vector.
Vector_splice can be implemented using SVE instruction EXT.
For instance :
@llvm.experimental.vector.splice(Vector_1, Vector_2, Imm)
@llvm.experimental.vector.splice(<A,B,C,D>, <E,F,G,H>, 1) ==> <B, C, D, E>
EXT Vector_1, Vector_2, Imm // Vector_1 = B, C, D + Vector_2 = E
Depends on D105633
Differential Revision: https://reviews.llvm.org/D106273
This patch implements vector_splice in tablegen for:
a) when the immediate is equal to -1 (Imm==1) and uses:
INSR + LASTB
For instance :
@llvm.experimental.vector.splice(Vector_1, Vector_2, -1)
@llvm.experimental.vector.splice(<A,B,C,D>, <E,F,G,H>, 1) ==> <D, E, F, G>
LAST RegLast, Vector_1 // RegLast = D
INSR Res, (Vector_1 >> 1), RegLast // Res = D + E, F, G
Differential Revision: https://reviews.llvm.org/D105633
This adds custom lowering for truncating stores when operating on
fixed length vectors in SVE. It also includes a DAG combine to
fold extends followed by truncating stores into non-truncating
stores in order to prevent this pattern appearing once truncating
stores are supported.
Currently truncating stores are not used in certain cases where
the size of the vector is larger than the target vector width.
Differential Revision: https://reviews.llvm.org/D104471
Basically two parts to this fix:
1. Stop using AtomicExpand to expand cmpxchg i128
2. Fix AArch64ExpandPseudoInsts to use a correct expansion.
From ARM architecture reference:
To atomically load two 64-bit quantities, perform a Load-Exclusive
pair/Store-Exclusive pair sequence of reading and writing the same value
for which the Store-Exclusive pair succeeds, and use the read values
from the Load-Exclusive pair.
Fixes https://bugs.llvm.org/show_bug.cgi?id=51102
Differential Revision: https://reviews.llvm.org/D106039
This removes the promotion of NEON AND, OR and XOR nodes to v2i32/v4i32,
treating them the same as the AArch64 and MVE backends where we just add
the relevant patterns for each legal type. This prevents a lot of
bitcasts from being added to the DAG, which have the potential to make
optimizations more difficult. It does mean adding extra patterns, and
some codegen can change due to the types now being legal, not promoted.
Differential Revision: https://reviews.llvm.org/D105588
The case for nxv2f32/nxv2i32 was already covered by D104573.
This patch builds on top of that by making the mechanism work for
nxv2[b]f16/nxv2i16, nxv4[b]f16/nxv4i16 as well.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D106138
This is mostly a minor convenience, but the pattern seems frequent
enough to be worthwhile (and we'll probably add more uses in the
future).
Differential Revision: https://reviews.llvm.org/D105850
This adds custom lowering for truncating stores when operating on
fixed length vectors in SVE. It also includes a DAG combine to
fold extends followed by truncating stores into non-truncating
stores in order to prevent this pattern appearing once truncating
stores are supported.
Currently truncating stores are not used in certain cases where
the size of the vector is larger than the target vector width.
Differential Revision: https://reviews.llvm.org/D104471
Additionally, lower the floating point compare SVE intrinsics to
SETCC_MERGE_ZERO ISD nodes to avoid duplicating ISel patterns.
Differential Revision: https://reviews.llvm.org/D105486
This patch adds a new ShuffleKind SK_Splice and then handle the cost in
getShuffleCost, as in experimental.vector.reverse.
Differential Revision: https://reviews.llvm.org/D104630
Improve codegen when lowering the common vector shuffle case from the
vectorizer (op1[last]:op2[0:last-1]). This patch only handles this
common case as it is difficult to handle this more generally when using
fixed length vectors, due to being unable to use the SVE ext instruction.
Differential Revision: https://reviews.llvm.org/D105289
Target-independent code only knows how to spill to the stack; instead,
use AArch64ISD::REINTERPRET_CAST.
Differential Revision: https://reviews.llvm.org/D104573
Inserting into a smaller-than-legal scalable vector would result in an
internal compiler error. For example, inserting a <vscale x 4 x i8> into
a <vscale x 8 x i8> (both illegal vector types for SVE) would cause a
crash.
This crash was happening because there was no code to promote (legalise)
the result of an INSERT_SUBVECTOR node.
This patch implements PromoteIntRes_INSERT_SUBVECTOR, which legalises
the ISD node. This is currently done by going through memory. This is
necessary because of the requirement that the SubVec parameter of the
INSERT_SUBVECTOR node must be smaller than the Vec parameter, which
means that INSERT_SUBVECTOR cannot always have a legal result/operand
types.
Co-Authored-by: Joe Ellis <joe.ellis@arm.com>
Differential Revision: https://reviews.llvm.org/D102766
Since gather lowering can now lower to nodes that may need expansion via
the vector legalizer, do MGATHER lowering via vector legalizer.
Additionally, as part of adding passthru support for fixed typed
gathers, fix passthru support for scalable types.
Depends on D104910
Differential Revision: https://reviews.llvm.org/D104217
This ports the AArch64 SABD and USBD over to DAG Combine, where they can be
used by more backends (notably MVE in a follow-up patch). The matching code
has changed very little, just to handle legal operations and types
differently. It selects from (ABS (SUB (EXTEND a), (EXTEND b))), producing
a ubds/abdu which is zexted to the original type.
Differential Revision: https://reviews.llvm.org/D91937
This custom lowers <4 x i8> vector loads using a 32-bit load, followed by 2
SSHLL instructions to extend it to e.g. a <4 x i32> vector. Before, it was
really inefficient and expensive to construct a <4 x i32> for this as 4 byte
loads and 4 moves were used. With this improvement SLP vectorisation might for
example become profitable, see D103629.
Differential Revision: https://reviews.llvm.org/D104782
This is a mechanical change. This actually also renames the
similarly named methods in the SmallString class, however these
methods don't seem to be used outside of the llvm subproject, so
this doesn't break building of the rest of the monorepo.
This also adds new interfaces for the fixed- and scalable case:
* LLT::fixed_vector
* LLT::scalable_vector
The strategy for migrating to the new interfaces was as follows:
* If the new LLT is a (modified) clone of another LLT, taking the
same number of elements, then use LLT::vector(OtherTy.getElementCount())
or if the number of elements is halfed/doubled, it uses .divideCoefficientBy(2)
or operator*. That is because there is no reason to specifically restrict
the types to 'fixed_vector'.
* If the algorithm works on the number of elements (as unsigned), then
just use fixed_vector. This will need to be fixed up in the future when
modifying the algorithm to also work for scalable vectors, and will need
then need additional tests to confirm the behaviour works the same for
scalable vectors.
* If the test used the '/*Scalable=*/true` flag of LLT::vector, then
this is replaced by LLT::scalable_vector.
Reviewed By: aemerson
Differential Revision: https://reviews.llvm.org/D104451
This only applies to FastIsel. GlobalIsel seems to sidestep
the issue.
This fixes https://bugs.llvm.org/show_bug.cgi?id=46996
One of the things we do in llvm is decide if a type needs
consecutive registers. Previously, we just checked if it
was an array or not.
(plus an SVE specific check that is not changing here)
This causes some confusion when you arbitrary IR like:
```
%T1 = type { double, i1 };
define [ 1 x %T1 ] @foo() {
entry:
ret [ 1 x %T1 ] zeroinitializer
}
```
We see it is an array so we call CC_AArch64_Custom_Block
which bails out when it sees the i1, a type we don't want
to put into a block.
This leaves the location of the double in some kind of
intermediate state and leads to odd codegen. Which then crashes
the backend because it doesn't know how to implement
what it's been asked for.
You get this:
```
renamable $d0 = FMOVD0
$w0 = COPY killed renamable $d0
```
Rather than this:
```
$d0 = FMOVD0
$w0 = COPY $wzr
```
The backend knows how to copy 64 bit to 64 bit registers,
but not 64 to 32. It can certainly be taught how but the real
issue seems to be us even trying to assign a register block
in the first place.
This change makes the logic of
AArch64TargetLowering::functionArgumentNeedsConsecutiveRegisters
a bit more in depth. If we find an array, also check that all the
nested aggregates in that array have a single member type.
Then CC_AArch64_Custom_Block's assumption of a type that looks
like [ N x type ] will be valid and we get the expected codegen.
New tests have been added to exercise these situations. Note that
some of the output is not ABI compliant. The aim of this change is
to simply handle these situations and not to make our processing
of arbitrary IR ABI compliant.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D104123
Currently, Loop strengh reduce is not handling loops with scalable stride very well.
Take loop vectorized with scalable vector type <vscale x 8 x i16> for instance,
(refer to test/CodeGen/AArch64/sve-lsr-scaled-index-addressing-mode.ll added).
Memory accesses are incremented by "16*vscale", while induction variable is incremented
by "8*vscale". The scaling factor "2" needs to be extracted to build candidate formula
i.e., "reg(%in) + 2*reg({0,+,(8 * %vscale)}". So that addrec register reg({0,+,(8*vscale)})
can be reused among Address and ICmpZero LSRUses to enable optimal solution selection.
This patch allow LSR getExactSDiv to recognize special cases like "C1*X*Y /s C2*X*Y",
and pull out "C1 /s C2" as scaling factor whenever possible. Without this change, LSR
is missing candidate formula with proper scaled factor to leverage target scaled-index
addressing mode.
Note: This patch doesn't fully fix AArch64 isLegalAddressingMode for scalable
vector. But allow simple valid scale to pass through.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D103939
Given a vecreduce_add node, detect the below pattern and convert it to the node
sequence with UABDL, [S|U]ADB and UADDLP.
i32 vecreduce_add(
v16i32 abs(
v16i32 sub(
v16i32 [sign|zero]_extend(v16i8 a), v16i32 [sign|zero]_extend(v16i8 b))))
=================>
i32 vecreduce_add(
v4i32 UADDLP(
v8i16 add(
v8i16 zext(
v8i8 [S|U]ABD low8:v16i8 a, low8:v16i8 b
v8i16 zext(
v8i8 [S|U]ABD high8:v16i8 a, high8:v16i8 b
Differential Revision: https://reviews.llvm.org/D104042
Don't require a specific kind of IRBuilder for TargetLowering hooks.
This allows us to drop the IRBuilder.h include from TargetLowering.h.
Differential Revision: https://reviews.llvm.org/D103759
This NEG node is just a vector negation, easily represented as a SUB
zero. Removing it from the one place it is generated is essentially an
NFC, but can allow some extra folding. The updated tests are now loading
different constant literals, which have already been negated.
Differential Revision: https://reviews.llvm.org/D103703
setcc (csel 0, 1, cond, X), 1, ne ==> csel 0, 1, !cond, X
Where X is a condition code setting instruction.
Co-authored-by: Paul Walker <paul.walker@arm.com>
Differential Revision: https://reviews.llvm.org/D103256
If a cmpxchg specifies acquire or seq_cst on failure, make sure we
generate code consistent with that ordering even if the success ordering
is not acquire/seq_cst.
At one point, it was ambiguous whether this sort of construct was valid,
but the C++ standad and LLVM now accept arbitrary combinations of
success/failure orderings.
This doesn't address the corresponding issue in AtomicExpand. (This was
reported as https://bugs.llvm.org/show_bug.cgi?id=33332 .)
Fixes https://bugs.llvm.org/show_bug.cgi?id=50512.
Differential Revision: https://reviews.llvm.org/D103284
SwiftTailCC has a different set of requirements than the C calling convention
for a tail call. The exact argument sequence doesn't have to match, but fewer
ABI-affecting attributes are allowed.
Also make sure the musttail diagnostic triggers if a musttail call isn't
actually a tail call.
This adds custom lowering for the MLOAD and MSTORE ISD nodes when
passed fixed length vectors in SVE. This is done by converting the
vectors to VLA vectors and using the VLA code generation.
Fixed length extending loads and truncating stores currently produce
correct code, but do not use the built in extend/truncate in the
load and store instructions. This will be fixed in a future patch.
Differential Revision: https://reviews.llvm.org/D101834
bswap.v2i16 + sitofp in LLVM IR generate a sequence of:
- REV32 + USHR for bswap.v2i16
- SHL + SSHR + SCVTF for sext to v2i32 and scvt
The shift instructions are excessive as noted in PR24820, and they can
be optimized to just SSHR.
Differential Revision: https://reviews.llvm.org/D102333
The implementation just extends the vector to a larger element type, and
extracts from that. Not fancy, but generates reasonable code.
There was discussion in the review of doing the promotion in
target-independent code, but I'm sticking with this to avoid making
LegalizeDAG infrastructure more complicated.
Differential Revision: https://reviews.llvm.org/D87651
Adding lowering support for bitreverse.
Previously, lowering bitreverse would expand it into a series of other instructions. This patch makes it so this produces a single rbit instruction instead.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D102397
Swift's new concurrency features are going to require guaranteed tail calls so
that they don't consume excessive amounts of stack space. This would normally
mean "tailcc", but there are also Swift-specific ABI desires that don't
naturally go along with "tailcc" so this adds another calling convention that's
the combination of "swiftcc" and "tailcc".
Support is added for AArch64 and X86 for now.
AArch64's fctv* instructions implement the saturating behaviour that the
fpto*i.sat intrinsics require, in cases where the destination width
matches the saturation width. Lowering them removes a lot of unnecessary
generated code.
Only scalar lowerings are supported for now.
Differential Revision: https://reviews.llvm.org/D102353