Similar to the existing (shl (srl x, c1), c2) fold
Part of the work to fix the regressions in D77804
Differential Revision: https://reviews.llvm.org/D125836
These intrinsics return the number of elements in a streaming
vector, for example aarch64.sme.cntsw returns the number of
32-bit elements. When in streaming mode these are equivalent
to aarch64.sve.cntb/h/w/d with an input value of 1.
I have implemented these intrinsics using the rdsvl instruction
and added tests here:
CodeGen/AArch64/SME/sme-intrinsics-rdsvl.ll
Differential Revision: https://reviews.llvm.org/D127853
This patch adds implementations for the read/write SME ACLE intrinsics:
@llvm.aarch64.sme.read.horiz
@llvm.aarch64.sme.read.vert
@llvm.aarch64.sme.write.horiz
@llvm.aarch64.sme.write.vert
These all map to the SME mova instruction.
Differential Revision: https://reviews.llvm.org/D127414
This patch adds implementations for the fill/spill SME ACLE intrinsics:
@llvm.aarch64.sme.ldr
@llvm.aarch64.sme.str
Differential Revision: https://reviews.llvm.org/D127317
This patch adds implementations for the load/store SME ACLE intrinsics:
- @llvm.aarch64.sme.ld1*
- @llvm.aarch64.sme.st1*
Differential Revision: https://reviews.llvm.org/D127210
As a follow up to D126686, this does the same fold for floating point
add and shuffle. In this case it is limited to reassoc either x[0]+x[1]
or x[1]+x[0] for both result[0] and results[1].
Differential Revision: https://reviews.llvm.org/D127087
This was assuming the vector types were MVTs, but they don't have to be.
Note that the concrete output of the test isn't very useful, since it's
dominated by nonsensical calling convention lowering for the weird types.
Differential Revision: https://reviews.llvm.org/D126505
Bitcasting between unpacked scalable vector types of different
element counts is not a NOP because the live elements are laid out
differently.
01234567
e.g. nxv2i32 = XX??XX??
nxv4f16 = X?X?X?X?
Differential Revision: https://reviews.llvm.org/D126957
I can't remove the function just yet as it is used in the generated .inc files.
I would also like to provide a way to compare alignment with TypeSize since it came up a few times.
Differential Revision: https://reviews.llvm.org/D126910
This adds a fold of add(x, shuffle(x, <1,0,3,2,5,4,...>), into
shuffle(addp(x), <0,0,1,1,2,2,..>. The ADDP instruction takes two
vectors and returns one, adding adjacent pairs. So we match x in a
custom combine as it is lowered from a v8i32. The original code
would be 2 rev64 and 2 add, with the new code being a single addp
with a zip1;zip2 shuffle, producing smaller code.
Differential Revision: https://reviews.llvm.org/D126686
Patch enables custom lowering for MVT::nxv4bf16 because otherwise
the refactored test file triggers a selection failure.
The reason for the refactoring it to highlight cases where the
generated code is wrong.
LowerINSERT_SUBVECTOR emits AArch64ISD::UUNPK## when lowering
scalable vector floating point INSERT_SUBVECTOR. However, these
nodes only make sense for integer types and thus isel patterns do
not exist for floating point, which leads to isel failures.
This patch ensures floating point operands are cast to integer
before the core lowering takes place.
Fixes: #55037
Differential Revision: https://reviews.llvm.org/D126487
Following discussion on D120261 and D121208 it seems better to remove the
concept of Streaming SVE from the subtarget/assembler predicates and
instead reason about 'SVE' and 'SME' as its higher level features, rather
than trying to model this runtime mode through explicit feature flags.
This patch is largely NFC.
Reviewed By: paulwalker-arm, david-arm
Differential Revision: https://reviews.llvm.org/D125977
If both a v2i32 DUP(x) and a v4i32 DUP(x) node exists, we can re-use the
larger node using a vector extract to obtain the smaller. This comes up
in the smull/smlal code, but needs a small fixup to allow the smull2
code in tryExtendDUPToExtractHigh/performAddSubLongCombine to still
match smull2 extracts.
Differential Revision: https://reviews.llvm.org/D126449
If the fma operates on a legal vector type, the indexed variants can be
used, if the second operand is a splat of a valid index.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D126234
Currently reassociating add expressions can lead to failing to select
(u|s)mlal. Implement isReassocProfitable to skip reassociating
expressions that can be lowered to (u|s)mlal.
The same issue exists for the *mlsl variants as well, but the DAG
combiner doesn't use the isReassocProfitable hook before reassociating.
To be fixed in a follow-up commit as this requires DAGCombiner changes
as well.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D125895
Lower fixed length MGATHER/MSCATTER operations to scalable vector
equivalents, which are then lowered to SVE specific nodes. This
two stage process is in preparation for making scalable vector
MGATHER/MSCATTER operations legal.
Differential Revision: https://reviews.llvm.org/D125192
This patch implements a for a target specific optimization that replaces
the cmp and csel from cttz with an and mask.
Recommitted with a fix for truncated value sizes.
Differential Revision: https://reviews.llvm.org/D123782
A TBL instruction will fill out-of-range values with 0's, something used
in D121139 to turn tbl2 with a zero input into tbl1s. This works OK for
v16i8, but for v8i8 the input is still treated as a v16i8, so
out-of-range values (like a lane index of 8) would end up loading values
from the top half of the input register. Clean this up by detecting the
out of range values and making sure they really use out of range values.
There is a fix for swapped indices of 64bit input vectors too, which
could be incorrectly adjusted if the zerovector was the first operand.
Fixes#55545
Differential Revision: https://reviews.llvm.org/D125865
Most clients only used these methods because they wanted to be able to
extend or truncate to the same bit width (which is a no-op). Now that
the standard zext, sext and trunc allow this, there is no reason to use
the OrSelf versions.
The OrSelf versions additionally have the strange behaviour of allowing
extending to a *smaller* width, or truncating to a *larger* width, which
are also treated as no-ops. A small amount of client code relied on this
(ConstantRange::castOp and MicrosoftCXXNameMangler::mangleNumber) and
needed rewriting.
Differential Revision: https://reviews.llvm.org/D125557
Similar to D123386, this adds D-Movs to the AArch64 perfect shuffle
tables, slightly lowering the costs a little more. This is a rough
improvement in general, especially if you ignore mov v0.16b, v2.16b type
moves that are often artefacts of the calling convention.
The D register movs are encoded as (0x4 | LaneIdx), and to generate a D
register move we are required to bitcast into a higher type, but it is
otherwise very similar to the S-lane mov's already supported.
Differential Revision: https://reviews.llvm.org/D125477
If we're using shift pairs to mask, then relax the one use limit if the shift amounts are equal - we'll only be generating a single AND node.
AArch64 has a couple of regressions due to this, so I've enforced the existing one use limit inside a AArch64TargetLowering::shouldFoldConstantShiftPairToMask callback.
Part of the work to fix the regressions in D77804
Differential Revision: https://reviews.llvm.org/D125607
NEON does not have native support for xor reductions. However, when
reducing predicate vectors the operation is synonymous with an add
reduction that is supported.
Differential Revision: https://reviews.llvm.org/D125605
During early gather/scatter enablement two different approaches
were taken to represent scaled indices:
* A Scale operand whereby byte_offsets = Index * Scale
* An IndexType whereby byte_offsets = Index * sizeof(MemVT.ElementType)
Having multiple representations is bad as shown by this patch which
fixes instances where the two are out of sync. The dedicated scale
operand is more flexible and pervasive so this patch removes the
UNSCALED values from IndexType. This means all indices are scaled
but the scale can be one, hence unscaled. SDNodes now use the scale
operand to answer the "isScaledIndex" question.
I toyed with the idea of keeping the UNSCALED enums and helper
functions but because they will have no uses and force SDNodes to
validate the set of supported values I figured it's best to remove
them. We can re-add them if there's a real need. For similar
reasons I've kept the IndexType enum when a bool could be used as I
think being explicitly looks better.
Depends On D123347
Differential Revision: https://reviews.llvm.org/D123381
Under some situations we can visit 64bit vector extract elements in
tryCombineFixedPointConvert, where an assert fires as they are expected
to have been converted to 128bit. Turn the assert into an if statement,
bailing out and letting the extract be handled first.
Also invert some ifs, using early exits to reduce indentation.
Fixes#55417
`performFlagSettingCombine` is a generalised version of `performANDSCombine` which also works on `ADCS` and `SBCS`.
Differential revision: https://reviews.llvm.org/D124464
When a fixed length load is lowered to an SVE masked load, the
result chain is currently set to the input chain of the old load,
rather than the result chain of the new load. This may cause stores
to be incorrectly reordered.
Fixes https://github.com/llvm/llvm-project/issues/55281.
Differential Revision: https://reviews.llvm.org/D125464
When extracting the first lane of a predicate created using the
llvm.get.active.lane.mask intrinsic, it should give the same codegen as
when the predicate is created using the llvm.aarch64.sve.whilelo
intrinsic, since get.active.lane.mask is lowered to whilelo. This patch
ensures the codegen is the same by recognizing
llvm.get.active.lane.mask as a flag-setting operation in this case.
Differential Revision: https://reviews.llvm.org/D125215
Converts to SVBool are already considered as a nop, if they
are converting an operand from a ptrue or a cmp, because
they zero the extra predicate lanes by construction.
This patch adds 2 similar cases:
- The wide cmp, which were not directly recognized by the test
for other forms of cmp
- Splats of 1, which will be generated as ptrue, and as such
will also zero the extra predicate lines.
Reviewed By: paulwalker-arm, peterwaller-arm
Differential Revision: https://reviews.llvm.org/D124908
This patch implements a for a target specific optimization that replaces
the cmp and csel from cttz with an and mask.
Differential Revision: https://reviews.llvm.org/D123782
Add helper functions to query the signed and scaled properties
of ISD::IndexType along with functions to change them.
Remove setIndexType from MaskedGatherSDNode because it only has
one usage and typically should only be changed alongside its
index operand.
Minimise the direct use of the enum values to lay the groundwork
for more refactoring.
Differential Revision: https://reviews.llvm.org/D123347
13403a70e4 introduced a bug where we
generate the outgoing carry inverted, which in turn breaks the
lowering of @llvm.usub.sat.i128, returning the normal difference on
saturation and zero otherwise.
Note that AArch64 has peculiar semantics where the subtraction
instructions generate borrow inverted. The problem is that we mix the
two forms of semantics -- the normal carry and inverted carry -- in
the area of extended precision subtractions. Specifically, we have
three problems:
- lowerADDSUBCARRY takes the non-inverted incoming carry from a
subtraction and feeds it to SBCS without inverting it first.
- lowerADDSUBCARRY makes available the outgoing carry from SBCS
without inverting it.
- foldOverflowCheck folds:
(SBC{S} l r (CMP (CSET LO carry) 1)) => (SBC{S} l r carry)
When the incoming carry flag is set, CSET LO results in zero. CMP
in turn generates a borrow, *clearing* the carry flag. Instead, we
should fold:
(SBC{S} l r (CMP 0 (CSET LO carry))) => (SBC{S} l r carry)
When the incoming carry flag is set, CSET LO results in zero. CMP
does not generate a borrow, *setting* the carry flag.
IIUC, we should use the normal (that is, non-inverted) semantics for
carry everywhere.
This patch fixes the three problems above.
This patch does not add any new testcases because we have a plenty of
them covering the instruction in question. In particular,
@u128_saturating_sub is identical to the testcase in the motivating
issue.
Fixes: #55253
Differential Revision: https://reviews.llvm.org/D124976
This is essentially a refactoring patch but allows more cases to
be caught, hence the output changes to some tests.
Differential Revision: https://reviews.llvm.org/D122994
getGatherScatterIndexIsExtended currently looks through all
SIGN_EXTEND_INREG operations regardless of their input type. This
patch restricts the code to only look through i32->i64 extensions,
which are the ones supported implicitly by SVE addressing modes.
Differential Revision: https://reviews.llvm.org/D123318
refineUniformBase and selectGatherScatterAddrMode both attempt the
transformation:
base(0) + index(A+splat(B)) => base(B) + index(A)
However, this is only safe when index is not implicitly scaled.
Differential Revision: https://reviews.llvm.org/D123222
Given a shuffle with 4 elements size 16 or 32, we can use the costs
directly from the PerfectShuffle tables to get a slightly more accurate
cost for the resulting shuffle.
Differential Revision: https://reviews.llvm.org/D123409
This teaches the perfect shuffle tables about lane inserts, that can
help reduce the cost of many entries. Many of the shuffle masks are
one-away from being correct, and a simple lane move can be a lot simpler
than trying to use ext/zip/etc. Because they are not exactly like the
other masks handled in the perfect shuffle tables, they require special
casing to generate them, with a special InsOp Operator.
The lane to insert into is encoded as the RHSID, and the move from is
grabbed from the original mask. This helps reduce the maximum perfect
shuffle entry cost to 3, with many more shuffles being generatable in a
single instruction.
Differential Revision: https://reviews.llvm.org/D123386
The perfect shuffle tables encode a cost of either 0 (a nop-copy) or 1
(a single instruction) with a cost encoding of 0 in the upper 2 bits.
All perfect shuffles with any cost are then marked as legal shuffles
though (the maximum encoded cost is 3), which can confuse the DAG
combiner into thinking the shuffles are cheaper than the should be.
Limiting legal shuffles to single instructions seems to do better in
most case, producing less instructions for complex shuffles. There are
some cases that now become tbl, which may be better or worse depending
on whether the instruction is in a loop and the tbl load can be hoisted
out.
Differential Revision: https://reviews.llvm.org/D123377
1. X%C to the equivalent of X-X/C*C is not always fastest path if there is no SDIV pair exist. So check target have faster for srem only first.
2. Add AArch64 faster path for SREM only pow2 case.
Fix https://github.com/llvm/llvm-project/issues/54649
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D122968
Similar to D122281, we should firstly exclude all scalable vector extending
stores and then selectively enable those which we directly support.
Also merge integer and float scalable vector into scalable_vector_valuetypes.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D123449
Handle unsupported passthru values before lowering the gather to
target specific nodes. This is a simplification that's on the road
to moving more of MGATHER lowering into td based isel.
Differential Revision: https://reviews.llvm.org/D123683
For strict FP16 to work correctly needs some changes in lowering and
legalization:
* SelectionDAGLegalize::PromoteNode was missing handling for some
strict fp opcodes.
* Some of the custom lowering of strict fp operations needed to be
adjusted to work with FP16.
* Custom lowering needed to be added for round-to-int operations.
With this, and the previous patches for the rest of the strict fp
isel, we can set IsStrictFPEnabled = true.
Differential Revision: https://reviews.llvm.org/D115620
The existing code was not updating the uses of loads that it recreated,
leading to incorrect chains which could break the ordering between
nodes. This moves the code to a combine instead, and makes sure we
update the chain references. This does mean it happens earlier -
potentially before the concats are simplified. This can lead to
inefficiencies in the codegen, which will be fixed in followups.
The lowering code did not use the scale operand of MGATHER/MSCATTER
nodes, but instead assumed scaled indices were always scaled based
on the element type of the memory type. This patch adds the missing
support by rewritting the nodes as unscaled variants.
Differential Revision: https://reviews.llvm.org/D123670
We were previously lowering to the incorrect instructions for the
setcc DAG node when using the SETUEQ and SETONE floating point
condition codes. I have fixed this by marking the SETONE code
as Expand and letting the SETUNE code be legal. I have also
fixed up the patterns for FCMNE_PPzZZ and FCMNE_PPzZ0 to use
the correct opcode.
Differential Revision: https://reviews.llvm.org/D121905
Perfect shuffle costs are always encoded less than 4, and shouldn't
really have a cost more than 3, so it makes no sense to check it when
generating shuffles. The perfect shuffle is likely always better than a
tbl too (although that may depend on whether it is in a loop).
Use the same enum as the other atomic instructions for consistency, in
preparation for addition of another strategy.
Introduce a new "Expand" option, since the store expansion does not
use cmpxchg. Alternatively, the existing CmpXChg strategy could be
renamed to Expand.
In COFF, the immediates in IMAGE_REL_ARM64_PAGEBASE_REL21 relocations
are limited to 21 bit signed, i.e. the offset has to be less than
(1 << 20). The previous limit did intend to cover for this case, but
had missed that the 21 bit field was signed.
This fixes issue https://github.com/llvm/llvm-project/issues/54753.
Differential Revision: https://reviews.llvm.org/D123160
WHILELO/LS insn is used very important for SVE loop, and itself
is a flag-setting operation, so add it.
Reviewed By: paulwalker-arm, david-arm
Differential Revision: https://reviews.llvm.org/D122796
Accord the discussion in D122281, we missing an ISD::AND combine for MLOAD
because it relies on BuildVectorSDNode is fails for scalable vectors.
This patch is intend to handle that, so we can circle back the type MVT::nxv2i32
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D122703
Last active extracting will output LASTB + WHILELS, and the WHILELS itself
is a flag-setting operation, so perform it preferly.
Reviewed By: paulwalker-arm, sdesmalen
Differential Revision: https://reviews.llvm.org/D122551
D120018 altered this combine to work on buildvectors as opposed to
shuffle dup's. This works well for dups and other things that are
expanded into buildvectors. Some shuffles are legal though, and stay as
vector_shuffle through lowering. This expands the transform to also
handle shuffles, so that we can turn mul(shuffle(sext into
mul(sext(shuffle and more readily make smull/umull instructions. This
can come up from the SLP vectorizer adding shuffles that are costed from
extends.
Differential Revision: https://reviews.llvm.org/D123012
D113200 introduced an error where it was converting FP_TO_SI_SAT with
multiply to a fixed point floating point convert. The saturation
bitwidth needs to be equal to the floating point width, or else the
routine would truncate the result as opposed to saturating it.
Fixes#54601
Accord the discussion in D120953, we should firstly exclude all scalable vector
extending loads and then selectively enable those which we directly support.
This patch is intend to refactor for above (truncating stores is not touched),and
more scalable vector types will try to reduce the number of masked loads in favour
of more unpklo/hi instructions.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D122281
The default expansion for buildvectors is to extract each element and
insert them into a new vector. That involves a lot of copying to/from
the GPR registers. TLB3 and TLB4 can be relatively slow instructions
with the mask needing to be loaded from a constant pool, but they should
always be better than all the moves to/from GPRs.
Differential Revision: https://reviews.llvm.org/D121137
The default expansion for buildvectors is to extract each element and
insert them into a new vector. That involves a lot of copying to/from
the GPR registers. TLB3 and TLB4 can be relatively slow instructions
with the mask needing to be loaded from a constant pool, but they should
always be better than all the moves to/from GPRs.
Differential Revision: https://reviews.llvm.org/D121137
This reverts commit edb7ba714a.
This changes BLR_BTI to take variable_ops meaning that we can accept
a register or a label. The pattern still expects one argument so we'll
never get more than one. Then later we can check the type of the operand
to choose BL or BLR to emit.
(this is what BLR_RVMARKER does but I missed this detail of it first time around)
Also require NoSLSBLRMitigation which I missed in the first version.
Some implementations of setjmp will end with a br instead of a ret.
This means that the next instruction after a call to setjmp must be
a "bti j" (j for jump) to make this work when branch target identification
is enabled.
The BTI extension was added in armv8.5-a but the bti instruction is in the
hint space. This means we can emit it for any architecture version as long
as branch target enforcement flags are passed.
The starting point for the hint number is 32 then call adds 2, jump adds 4.
Hence "hint #36" for a "bti j" (and "hint #34" for the "bti c" you see
at the start of functions).
The existing Arm command line option -mno-bti-at-return-twice has been
applied to AArch64 as well.
Support is added to SelectionDAG Isel and GlobalIsel. FastIsel will
defer to SelectionDAG.
Based on the change done for M profile Arm in https://reviews.llvm.org/D112427Fixes#48888
Reviewed By: danielkiss
Differential Revision: https://reviews.llvm.org/D121707
Trying to reduce the number of masked loads in favour of more unpklo/hi
instructions. Both ISD::ZEXTLOAD and ISD::SEXTLOAD are supported to extensions
from legal types.
Both of normal and masked loads test cases added to guard compile crash.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D120953
When N > 12, (2^N -1) is not a legal add immediate (isLegalAddImmediate will return false).
ANd if SetCC input use this number, DAG combiner will generate one more SRL instruction.
So combine [setcc (srl x, imm), 0, ne] to [setcc (and x, (-1 << imm)), 0, ne] to get better optimization in emitComparison
Fix https://github.com/llvm/llvm-project/issues/54283
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D121449
When inverting the compare predicate trySwapVSelectOperands is
incorrectly using the type of the select's cond operand rather
than the type of cond's operands. This means we're treating all
inversions as if they're integer.
Differential Revision: https://reviews.llvm.org/D121968
We already have custom lowering for v4i8 load, which loads as a f32,
converts to a vector and bitcasts and extends the result to a v4i16.
This adds some custom lowering of concat(v4i8 load, ...) to keep the
result as an f32 and create a buildvector of the resulting f32 loads.
This helps not create all the extends and bitcasts, which are often
difficult to fully clean up.
Differential Revision: https://reviews.llvm.org/D121400
If we already have a AArch64ISD::ANDS node with identical operands, we
can merge any ISD::AND into it, reducing the instruction count by
calculating the value and the flags in a single operation. This code is
taken from the X86 backend, and could also handle AArch64ISD::ADDS and
AArch64ISD::SUBS, but I couldn't find any test cases where it came up.
Differential Revision: https://reviews.llvm.org/D118584
Includes verifier changes checking the elementtype, clang codegen
changes to emit the elementtype, and ISel changes using the elementtype.
Reviewed By: #opaque-pointers, nikic
Differential Revision: https://reviews.llvm.org/D120527
When building the LLVM test suite with SVE I discovered a crash
when compiling some Halide tests, which occurs because we try to
use SVE to lower 64-bit vector multiplies and there is no
vscale_range attribute on the function. In this case the min SVE
vector bits was 0, which caused an assert in LowerToPredicatedOp
to fire. I have amended the asserts in this function to check that the
fixed-width type is legal. If the fixed-width type is larger than NEON
and is legal then it must be because we've set the min SVE vector
bits to something > 128. Or if the min SVE bits is 0, then the only
legal types allowed are 128 bit types - for any other types the assert
will fire.
Tests added here:
CodeGen/AArch64/sve-fixed-length-no-vscale-range.ll
Differential Revision: https://reviews.llvm.org/D121297
While checking is tail call optimization is possible, the calling
convention applied to fixed arguments is not the correct one.
This implies for DarwinPCS that all arguments of a vararg function will
go to the stack although fixed ones can go in registers.
This prevents non-virtual thunks to be tail optimized although they are
marked as musttail.
Differential Revision: https://reviews.llvm.org/D120622
A TBL instruction will use zero for any out of range values. We can use
this in GenerateTBL to help turn a TBL2 into a TBL1, avoiding the need
to materialise the zero.
Differential Revision: https://reviews.llvm.org/D121139
Materialize : i1 = extract_vector_elt t37, Constant:i64<0>
... into: "ptrue p, all" + PTEST
Test bit of lane 0 can use P register directly, and the instruction “pture all”
is loop invariant, which will beneficial to SVE after hoisting out the loop.
Reviewed By: david-arm, paulwalker-arm
Differential Revision: https://reviews.llvm.org/D120891
When lowering large v16f32->v16i8 fp_to_si_sat, the fp_to_si_sat node is
split several times, creating an illegal v4i8 concat that gets expanded
into a BUILD_VECTOR. After some combining and other legalisation, it
ends up the a buildvector that extracts from 4 vectors, looking like
BUILDVECTOR(a0,a1,a2,a3,b0,b1,b2,b3,c0,c1,c2,c3,d0,d1,d2,d3). That is
really an v16i32->v16i8 truncate in disguise.
This adds a ReconstructTruncateFromBuildVector method to detect the
pattern, converting it back into the legal "concat(trunc(concat(trunc(a),
trunc(b))), trunc(concat(trunc(c), trunc(d))))" tree. The extracted
nodes could also be v4i16, in which case the truncates are not needed.
All those truncates and concats then become uzip1's, which is much
better than expanding by moving vector lanes around.
Differential Revision: https://reviews.llvm.org/D119469
When triggered during operation legalisation the affected combine
generates a splat_vector that when custom lowered for SVE fixed
length code generation, results in the original precombine sequence
and thus we enter a legalisation/combine hang.
NOTE: The patch contains no tests because I observed this issue
only when combined with other work that might never become public.
The current way AArch64 lowers ISD::SPLAT_VECTOR meant a specific
test was not possible so I'm hoping the DAGCombiner fix can be seen
as obvious. The AArch64ISelLowering change is requirted to maintain
existing code quality.
Differential Revision: https://reviews.llvm.org/D120735
This turns upz1(x, undef) to concat(truncate(x), undef), as the truncate
is simpler and can often be optimized away, and it helps some of the
insert-subvector tests optimize more cleanly.
Differential Revision: https://reviews.llvm.org/D120879
The compiler currently crashes for scalable types when compiling with
+sme, e.g.
define <vscale x 4 x i32> @foo(<vscale x 4 x i32> %a) {
ret <vscale x 4 x i32> %a
}
since it doesn't know how to legalize the types. SME implies a subset of
SVE (+streaming-sve), the hasSVE predication in the backend needs
extending to consider types/operations that are legal in Streaming SVE.
This is the first patch adding legal types <-> register classes. Before
making the change +sve(2) was temporarily replaced with +sme in all the
intrinsics tests to see what failed, and again after making the change.
For all the tests that passed after adding the legal types another RUN
line has been added for +streaming-sve. More patches to follow.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D118561
This patch addresses @paulwalker-arm's comment on D117900 to
only update/write the by-ref operands iff the function returns
true. It also handles a few more cases where a series of added
offsets can be folded into the base pointer, rather than just looking
at a single offset.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D119728