D113200 introduced an error where it was converting FP_TO_SI_SAT with
multiply to a fixed point floating point convert. The saturation
bitwidth needs to be equal to the floating point width, or else the
routine would truncate the result as opposed to saturating it.
Fixes#54601
Accord the discussion in D120953, we should firstly exclude all scalable vector
extending loads and then selectively enable those which we directly support.
This patch is intend to refactor for above (truncating stores is not touched),and
more scalable vector types will try to reduce the number of masked loads in favour
of more unpklo/hi instructions.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D122281
The default expansion for buildvectors is to extract each element and
insert them into a new vector. That involves a lot of copying to/from
the GPR registers. TLB3 and TLB4 can be relatively slow instructions
with the mask needing to be loaded from a constant pool, but they should
always be better than all the moves to/from GPRs.
Differential Revision: https://reviews.llvm.org/D121137
The default expansion for buildvectors is to extract each element and
insert them into a new vector. That involves a lot of copying to/from
the GPR registers. TLB3 and TLB4 can be relatively slow instructions
with the mask needing to be loaded from a constant pool, but they should
always be better than all the moves to/from GPRs.
Differential Revision: https://reviews.llvm.org/D121137
This reverts commit edb7ba714a.
This changes BLR_BTI to take variable_ops meaning that we can accept
a register or a label. The pattern still expects one argument so we'll
never get more than one. Then later we can check the type of the operand
to choose BL or BLR to emit.
(this is what BLR_RVMARKER does but I missed this detail of it first time around)
Also require NoSLSBLRMitigation which I missed in the first version.
Some implementations of setjmp will end with a br instead of a ret.
This means that the next instruction after a call to setjmp must be
a "bti j" (j for jump) to make this work when branch target identification
is enabled.
The BTI extension was added in armv8.5-a but the bti instruction is in the
hint space. This means we can emit it for any architecture version as long
as branch target enforcement flags are passed.
The starting point for the hint number is 32 then call adds 2, jump adds 4.
Hence "hint #36" for a "bti j" (and "hint #34" for the "bti c" you see
at the start of functions).
The existing Arm command line option -mno-bti-at-return-twice has been
applied to AArch64 as well.
Support is added to SelectionDAG Isel and GlobalIsel. FastIsel will
defer to SelectionDAG.
Based on the change done for M profile Arm in https://reviews.llvm.org/D112427Fixes#48888
Reviewed By: danielkiss
Differential Revision: https://reviews.llvm.org/D121707
Trying to reduce the number of masked loads in favour of more unpklo/hi
instructions. Both ISD::ZEXTLOAD and ISD::SEXTLOAD are supported to extensions
from legal types.
Both of normal and masked loads test cases added to guard compile crash.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D120953
When N > 12, (2^N -1) is not a legal add immediate (isLegalAddImmediate will return false).
ANd if SetCC input use this number, DAG combiner will generate one more SRL instruction.
So combine [setcc (srl x, imm), 0, ne] to [setcc (and x, (-1 << imm)), 0, ne] to get better optimization in emitComparison
Fix https://github.com/llvm/llvm-project/issues/54283
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D121449
When inverting the compare predicate trySwapVSelectOperands is
incorrectly using the type of the select's cond operand rather
than the type of cond's operands. This means we're treating all
inversions as if they're integer.
Differential Revision: https://reviews.llvm.org/D121968
We already have custom lowering for v4i8 load, which loads as a f32,
converts to a vector and bitcasts and extends the result to a v4i16.
This adds some custom lowering of concat(v4i8 load, ...) to keep the
result as an f32 and create a buildvector of the resulting f32 loads.
This helps not create all the extends and bitcasts, which are often
difficult to fully clean up.
Differential Revision: https://reviews.llvm.org/D121400
If we already have a AArch64ISD::ANDS node with identical operands, we
can merge any ISD::AND into it, reducing the instruction count by
calculating the value and the flags in a single operation. This code is
taken from the X86 backend, and could also handle AArch64ISD::ADDS and
AArch64ISD::SUBS, but I couldn't find any test cases where it came up.
Differential Revision: https://reviews.llvm.org/D118584
Includes verifier changes checking the elementtype, clang codegen
changes to emit the elementtype, and ISel changes using the elementtype.
Reviewed By: #opaque-pointers, nikic
Differential Revision: https://reviews.llvm.org/D120527
When building the LLVM test suite with SVE I discovered a crash
when compiling some Halide tests, which occurs because we try to
use SVE to lower 64-bit vector multiplies and there is no
vscale_range attribute on the function. In this case the min SVE
vector bits was 0, which caused an assert in LowerToPredicatedOp
to fire. I have amended the asserts in this function to check that the
fixed-width type is legal. If the fixed-width type is larger than NEON
and is legal then it must be because we've set the min SVE vector
bits to something > 128. Or if the min SVE bits is 0, then the only
legal types allowed are 128 bit types - for any other types the assert
will fire.
Tests added here:
CodeGen/AArch64/sve-fixed-length-no-vscale-range.ll
Differential Revision: https://reviews.llvm.org/D121297
While checking is tail call optimization is possible, the calling
convention applied to fixed arguments is not the correct one.
This implies for DarwinPCS that all arguments of a vararg function will
go to the stack although fixed ones can go in registers.
This prevents non-virtual thunks to be tail optimized although they are
marked as musttail.
Differential Revision: https://reviews.llvm.org/D120622
A TBL instruction will use zero for any out of range values. We can use
this in GenerateTBL to help turn a TBL2 into a TBL1, avoiding the need
to materialise the zero.
Differential Revision: https://reviews.llvm.org/D121139
Materialize : i1 = extract_vector_elt t37, Constant:i64<0>
... into: "ptrue p, all" + PTEST
Test bit of lane 0 can use P register directly, and the instruction “pture all”
is loop invariant, which will beneficial to SVE after hoisting out the loop.
Reviewed By: david-arm, paulwalker-arm
Differential Revision: https://reviews.llvm.org/D120891
When lowering large v16f32->v16i8 fp_to_si_sat, the fp_to_si_sat node is
split several times, creating an illegal v4i8 concat that gets expanded
into a BUILD_VECTOR. After some combining and other legalisation, it
ends up the a buildvector that extracts from 4 vectors, looking like
BUILDVECTOR(a0,a1,a2,a3,b0,b1,b2,b3,c0,c1,c2,c3,d0,d1,d2,d3). That is
really an v16i32->v16i8 truncate in disguise.
This adds a ReconstructTruncateFromBuildVector method to detect the
pattern, converting it back into the legal "concat(trunc(concat(trunc(a),
trunc(b))), trunc(concat(trunc(c), trunc(d))))" tree. The extracted
nodes could also be v4i16, in which case the truncates are not needed.
All those truncates and concats then become uzip1's, which is much
better than expanding by moving vector lanes around.
Differential Revision: https://reviews.llvm.org/D119469
When triggered during operation legalisation the affected combine
generates a splat_vector that when custom lowered for SVE fixed
length code generation, results in the original precombine sequence
and thus we enter a legalisation/combine hang.
NOTE: The patch contains no tests because I observed this issue
only when combined with other work that might never become public.
The current way AArch64 lowers ISD::SPLAT_VECTOR meant a specific
test was not possible so I'm hoping the DAGCombiner fix can be seen
as obvious. The AArch64ISelLowering change is requirted to maintain
existing code quality.
Differential Revision: https://reviews.llvm.org/D120735
This turns upz1(x, undef) to concat(truncate(x), undef), as the truncate
is simpler and can often be optimized away, and it helps some of the
insert-subvector tests optimize more cleanly.
Differential Revision: https://reviews.llvm.org/D120879
The compiler currently crashes for scalable types when compiling with
+sme, e.g.
define <vscale x 4 x i32> @foo(<vscale x 4 x i32> %a) {
ret <vscale x 4 x i32> %a
}
since it doesn't know how to legalize the types. SME implies a subset of
SVE (+streaming-sve), the hasSVE predication in the backend needs
extending to consider types/operations that are legal in Streaming SVE.
This is the first patch adding legal types <-> register classes. Before
making the change +sve(2) was temporarily replaced with +sme in all the
intrinsics tests to see what failed, and again after making the change.
For all the tests that passed after adding the legal types another RUN
line has been added for +streaming-sve. More patches to follow.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D118561
This patch addresses @paulwalker-arm's comment on D117900 to
only update/write the by-ref operands iff the function returns
true. It also handles a few more cases where a series of added
offsets can be folded into the base pointer, rather than just looking
at a single offset.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D119728
Internally to DAGCombiner the SDValues were passed by non-const
reference despite not being modified. They were then passed by
const reference to TLI.
This patch passes them by value which is consistent with the vast
majority of code.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D120420
We have a combine for converting mul(dup(ext(..)), ...) into
mul(ext(dup(..)), ..), for allowing more uses of smull and umull
instructions. Currently it looks for vector insert and shuffle vectors
to detect the element that we can convert to a vector extend. Not all
cases will have a shufflevector/insert element though.
This started by extending the recognition to buildvectors (with elements
that may be individually extended). The new method seems to cover all
the cases that the old method captured though, as the shuffle will
eventually be lowered to buildvectors, so the old method has been
removed to keep the code a little simpler. The new code detects legal
build_vector(ext(a), ext(b), ..), converting them to ext(build_vector(a,
b, ..)) providing all the extends/types match up.
Differential Revision: https://reviews.llvm.org/D120018
We have a combine for converting mul(dup(ext(..)), ...) into
mul(ext(dup(..)), ..), for allowing more uses of smull and umull
instructions. Currently it looks for vector insert and shuffle vectors
to detect the element that we can convert to a vector extend. Not all
cases will have a shufflevector/insert element though.
This started by extending the recognition to buildvectors (with elements
that may be individually extended). The new method seems to cover all
the cases that the old method captured though, as the shuffle will
eventually be lowered to buildvectors, so the old method has been
removed to keep the code a little simpler. The new code detects legal
build_vector(ext(a), ext(b), ..), converting them to ext(build_vector(a,
b, ..)) providing all the extends/types match up.
Differential Revision: https://reviews.llvm.org/D120018
LR is modified at the moment of the call and before any use is read.
Reviewers: reames
Reviewed By: reames
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D120114
We have some duplicate patterns between the AArch64ISD::UMULL (/SMULL)
and the int_aarch64_neon_umull (/smull) intrinsics. They did not
replicate all the patterns though, leaving some gaps on instructions
like umlal2 from codegen. This commons all the patterns by converting
all int_aarch64_neon_umull intrinsics to UMULL nodes and removing the
duplicate for umull/smull intrinsics, so that all instructions go
through the same tablegen pattern.
This improves some of the longer-than-legal mla patterns, helping them
replace ext with umlal2.
Differential Revision: https://reviews.llvm.org/D119887
(vselect (setcc ( condcode) (_) (_)) (a) (op (a) (b)))
=> (vselect (setcc (!condcode) (_) (_)) (op (a) (b)) (a))
As a follow up to D117689, invert the operand order and condition
in order to fold vselects into predicated instructions.
Differential Revision: https://reviews.llvm.org/D119424
This also requires adjustment to code in AArch64ISelLowering so that
vector_extract is distributed over strict_fadd.
Differential Revision: https://reviews.llvm.org/D118489
This consists of marking the various strict opcodes as legal, and
adjusting instruction selection patterns so that 'op' is 'any_op'.
FP16 and vector instructions additionally require some extra work in
lowering and legalization, so we can't set IsStrictFPEnabled just yet.
Also more work needs to be done for full strict fp support (marking
instructions that can raise exceptions as such, and modelling FPCR use
for controlling rounding).
Differential Revision: https://reviews.llvm.org/D114946
This is some NFC (hopefully!) cleanup for performCommonVectorExtendCombine
and related methods, removing conditions that cannot occur and otherwise
cleaning up the code a little.
This ports the aarch64 combines for HADD and RHADD over to DAG combine,
so that they can be used in more architectures (notably MVE in a
followup patch). They are renamed to AVGFLOOR and AVGCEIL in the
process, to avoid confusion with instructions such as X86 hadd. The code
was also rewritten slightly to remove the AArch64 idiosyncrasies.
The general pattern for a AVGFLOORS is
%xe = sext i8 %x to i32
%ye = sext i8 %y to i32
%a = add i32 %xe, %ye
%r = lshr i32 %a, 1
%t = trunc i32 %r to i8
An AVGFLOORU is equivalent with zext. Because of the truncate
lshr==ashr, as the top bits are not demanded. An AVGCEIL also includes
an extra rounding, so includes an extra add of 1.
Differential Revision: https://reviews.llvm.org/D106237