Commit Graph

1445 Commits

Author SHA1 Message Date
Vitaly Buka 16fecdfa70 Revert "[AArch64] Add `foldCSELOfCSEl` DAG combine"
Breaks ubsan on buildbot, details in D125504

This reverts commit 6f9423ef06.
2022-08-16 20:29:37 -07:00
Karl Meakin 6f9423ef06 [AArch64] Add `foldCSELOfCSEl` DAG combine
Differential Revision: https://reviews.llvm.org/D125504
2022-08-16 12:49:11 +01:00
Zain Jaffal 7155ed4289
[AArch64] Add support for 256-bit non temporal loads
Currenlty all temporal loads are mapped to `LDP` or `LDR`. This patch will map all the non temporal 256-bit loads into `LDNP`. Future patches should address other non-temporal loads.

Reviewed By: fhahn, dmgreen

Differential Revision: https://reviews.llvm.org/D131773
2022-08-16 12:19:36 +01:00
Vitaly Buka e0e960923f [AArch64] Fix signed integer overflow in CSINC case
Followup to D131815, which overlflows on different
values.
2022-08-15 15:04:20 -07:00
Vitaly Buka f1596952f9 [AArch64] Fix signed integer overflow in CSINC case
https://lab.llvm.org/staging/#/builders/224/builds/2/steps/16/logs/stdio

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D131815
2022-08-13 13:12:09 -07:00
David Green a9e9dd9a3a [AArch64] Add bf16 select handling
A bfloat select operation will currently crash, but is allowed from C.
This adds handling for the operation, turning it into a FCSELHrrr if
fullfp16 is present, or converting it to a FCSELSrrr if not. The
FCSELSrrr is created via using INSERT_SUBREG/EXTRACT_SUBREG to convert
the bf16 to a f32 and using the f32 pattern for FCSELSrrr. (I originally
attempted to do this via a tablegen pattern, but it appears that the
nzcv glue is places onto the wrong node, causing it to be forgotten and
incorrect scheduling to be emitted).

The FCSELSrrr can also be used for fp16 selects when +fullfp16 is not
present, which helps avoid an unnecessary promotion to f32.

Differential Revision: https://reviews.llvm.org/D131253
2022-08-11 14:20:36 +01:00
David Truby b1b9c39629 [AArch64][SVE] Use SVE for VLS fcopysign for wide vectors
Currently fcopysign for VLS vectors lowers through NEON even when the
vector width is wider than a NEON vector, causing bad codegen as the
vectors are split. This patch causes SVE to be used for these vectors
instead, giving much better codegen on wide VLS vectors.

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D128642
2022-08-10 10:17:19 +00:00
Fangrui Song de9d80c1c5 [llvm] LLVM_FALLTHROUGH => [[fallthrough]]. NFC
With C++17 there is no Clang pedantic warning or MSVC C5051.
2022-08-08 11:24:15 -07:00
David Truby 9a976f3661 [llvm] Always use TargetConstant for FP_ROUND ISD Nodes
This patch ensures consistency in the construction of FP_ROUND nodes
such that they always use ISD::TargetConstant instead of ISD::Constant.

This additionally fixes a bug in the AArch64 SVE backend where patterns
were matching against TargetConstant nodes and sometimes failing when
passed a Constant node.

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D130370
2022-08-03 14:02:11 +01:00
David Green 1206f72e31 [AArch64] Fold Mul(And(Srl(X, 15), 0x10001), 0xffff) to CMLTz
This folds a v4i32 Mul(And(Srl(X, 15), 0x10001), 0xffff) into a v8i16
CMLTz instruction. The Srl and And extract the top bit (whether the
input is negative) and the Mul sets all values in the i16 half to all
1/0 depending on if that top bit was set. This is equivalent to a v8i16
CMLTz instruction. The same applies to other sizes with equivalent
constants.

Differential Revision: https://reviews.llvm.org/D130874
2022-08-02 13:01:59 +01:00
Kazu Hirata 12b29900a1 Use any_of (NFC) 2022-07-30 10:35:56 -07:00
Matt Devereau a8b726ac65 [AArch64][SVE] Change DupLane128Combine Index comparison to 0
IdxInsert == IdxDupLane is incorrect. IdxInsert is the starting element number,
whereas IdxIndex is the index of a quadword
2022-07-29 14:31:00 +00:00
David Sherwood 487fa6f8c3 [AArch64][DAGCombine] Add performBuildVectorCombine 'extract_elt ~> anyext'
A build vector of two extracted elements is equivalent to an extract
subvector where the inner vector is any-extended to the
extract_vector_elt VT, because extract_vector_elt has the effect of an
any-extend.

  (build_vector (extract_elt_i16_to_i32 vec Idx+0) (extract_elt_i16_to_i32 vec Idx+1))
  => (extract_subvector (anyext_i16_to_i32 vec) Idx)

Depends on D130697

Differential Revision: https://reviews.llvm.org/D130698
2022-07-29 09:51:09 +01:00
Amara Emerson 65246d3eb4 Use hasNItemsOrLess() in MRI::hasAtMostUserInstrs(). 2022-07-27 11:42:14 -07:00
Mingming Liu 34348814e1 [AArch64] Explicitly use v1i64 type for llvm.aarch64.neon.pmull64
Without this, the intrinsic will be expanded to an integer; thereby an
explicit copy (from GPR to SIMD register) will be codegen'd. This matches the
general convention of using "v1" types to represent scalar integer operations in
vector registers.

The similar approach is observed in D56616, and the pattern likely applies on
other intrinsic that accepts integer scalars (e.g.,
int_aarch64_neon_sqdmulls_scalar)

Differential Revision: https://reviews.llvm.org/D130548
2022-07-27 11:11:16 -07:00
Amara Emerson 19cdd1908b [AArch64][GlobalISel] Add heuristics for localizing G_CONSTANT.
This adds similar heuristics to G_GLOBAL_VALUE, querying the cost of
materializing a specific constant in code size. Doing so prevents us from
sinking constants which require multiple instructions to generate into
use blocks.

Code size savings on CTMark -Os:
Program                                       size.__text
                                              before         after           diff
ClamAV/clamscan                               381940.00      382052.00       0.0%
lencod/lencod                                 428408.00      428428.00       0.0%
SPASS/SPASS                                   411868.00      411876.00       0.0%
kimwitu++/kc                                  449944.00      449944.00       0.0%
Bullet/bullet                                 463588.00      463556.00      -0.0%
sqlite3/sqlite3                               284696.00      284668.00      -0.0%
consumer-typeset/consumer-typeset             414492.00      414424.00      -0.0%
7zip/7zip-benchmark                           595244.00      594972.00      -0.0%
mafft/pairlocalalign                          247512.00      247368.00      -0.1%
tramp3d-v4/tramp3d-v4                         372884.00      372044.00      -0.2%
                           Geomean difference                               -0.0%

Differential Revision: https://reviews.llvm.org/D130554
2022-07-27 10:51:16 -07:00
Paul Walker e5c892dd85 [SVE][SelectionDAG] Use INDEX to generate matching instances of BUILD_VECTOR.
This patch starts small, only detecting sequences of the form
<a, a+n, a+2n, a+3n, ...> where a and n are ConstantSDNodes.

Differential Revision: https://reviews.llvm.org/D125194
2022-07-26 15:28:37 +00:00
Sander de Smalen a41ddf178e [AArch64][SVE] Sink ptrue into loop if it is used by PTEST.
This helps fold away the ptest instructions, which needs the knowledge on whether
the general predicate is known to zero the inactive lanes.

This fixes some PTEST regressions introduced by D129282.

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D129852
2022-07-26 15:07:41 +01:00
Sander de Smalen 370ff43a15 [AArch64][SVE] Consider more intrinsics in 'isZeroingInactiveLanes'.
This fixes some PTEST regressions introduced by D129282.

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D129851
2022-07-26 15:07:41 +01:00
Bradley Smith 953a98ef8d [AArch64][SVE] Fold target specific ext/trunc nodes into loads/stores
Due to the way fixed length SVE lowering works, we sometimes introduce
ext/trunc nodes very late, these nodes then immediately get converted
into target specific nodes (UUNPKLO/UZP1) before they get a chance to be
folded into a load/store.

This patch introduces target specific dag combines for these nodes so that
we can still create extending loads/truncating stores out of them.

Differential Revision: https://reviews.llvm.org/D128065
2022-07-25 15:24:05 +00:00
Cullen Rhodes c04ff587dc [AArch64] Combine setcc (iN (bitcast (vNi1 X))) with vecreduce_or
Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D130163
2022-07-25 12:14:33 +00:00
Rosie Sumpter 034a27e688 [AArch64] Add f16 fpimm patterns
This patch recognizes f16 immediates as legal and adds the necessary
patterns. This allows the fadda folding introduced in 05d424d165
to be applied to the f16 cases.

Differential Revision: https://reviews.llvm.org/D129989
2022-07-25 09:08:10 +01:00
Simon Pilgrim 939cf9b1be [AArch64] Use neon instructions for i64/i128 ISD::PARITY calculation
As noticed on D129765 and reported on Issue #56531 - aarch64 targets can use the neon ctpop + add-reduce instructions to speed up scalar ctpop instructions, but we fail to do this for parity calculations.

I'm not sure where the cutoff should be for specific CPUs, but i64 (+ i128 special case) shows a definite reduction in instruction count. i32 is about the same (but scalar <-> neon transfers are probably more costly?), and sub-i32 promotion looks to be a definite regression compared to parity expansion optimized for those widths.

Differential Revision: https://reviews.llvm.org/D130246
2022-07-22 17:24:17 +01:00
Cullen Rhodes bf268a05cd [AArch64] Emit vector FP cmp when LE is used with fast-math
Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D130093
2022-07-22 07:53:55 +00:00
Matt Devereau cd3d7bf15d [AArch64][SVE] Add DAG-Combine to push bitcasts from floating point loads after DUPLANE128
This patch lowers
  duplane128(insert_subvector(undef, bitcast(op(128bitsubvec)), 0), 0)
to
  bitcast(duplane128(insert_subvector(undef, op(128bitsubvec), 0), 0)).

This enables floating-point loads to match patterns added in
https://reviews.llvm.org/D130010

Differential Revision: https://reviews.llvm.org/D130013
2022-07-21 11:00:10 +00:00
Simon Pilgrim 0f6b0461b0 [DAG] SimplifyDemandedBits - relax "xor (X >> ShiftC), XorC --> (not X) >> ShiftC" to match only demanded bits
The "xor (X >> ShiftC), XorC --> (not X) >> ShiftC" fold is currently limited to the XOR mask being a shifted all-bits mask, but we can relax this to only need to match under the demanded bits.

This helps expose more bit extraction/clearing patterns and fixes the PowerPC testCompares*.ll regressions from D127115

Alive2: https://alive2.llvm.org/ce/z/fl7T7K

Differential Revision: https://reviews.llvm.org/D129933
2022-07-19 10:59:07 +01:00
Simon Pilgrim 4f10516325 [AArch64] isDesirableToCommuteWithShift - add explicit ShiftLHS variable instead of altering a incoming argument variable
As discussed on D129995, altering the 'N' variable to point to shift's source value was confusing.
2022-07-18 13:28:07 +01:00
Simon Pilgrim 259c36e7c1 [DAG] Add asserts to isDesirableToCommuteWithShift overrides to ensure its being called from a shift. NFC. 2022-07-18 13:11:24 +01:00
Simon Pilgrim a5d0122f75 [DAG] Canonicalize non-inlane shuffle -> AND if all non-inlane referenced elements are known zero
As mentioned on D127115, this patch that attempts to recognise shuffle masks that could be simplified to a AND mask - we already have a similar transform that will fold AND -> 'clear mask' shuffle, but this patch handles cases where the referenced elements are not from the same lane indices but are known to be zero.

Differential Revision: https://reviews.llvm.org/D129150
2022-07-16 11:38:24 +01:00
Rosie Sumpter e5edc1b5ee [AArch64][SVE] Ensure PTEST operands have type nxv16i1
Currently any legal predicate types will be pattern-matched when
creating a PTEST instruction. This could be a problem in future since
PTEST always uses the .B specifier for the operand, but it is not
always guaranteed that the extra lanes of unpacked types (e.g. nxv4i1)
are zero. This patch ensures the operands of PTEST are type nxv16i1,
where the undef lanes are set to zero.

Differential Revision: https://reviews.llvm.org/D129282/
2022-07-12 09:27:59 +01:00
David Green 4334cbd49b [AArch64] Remove incorrect use of DemandElts
This call to computeKnownBits was passing in a 0xff mask, looking like
it was expecting it to be used as a DemandBits, not a DemandElts mask.
2022-07-08 11:38:00 +01:00
Florian Hahn afdedd405e
[AArch64] Try to re-use extended operand for SETCC with vector ops.
Try to re-use an already extended operand for SetCC with vector operands
feeding an extended select. Doing so avoids requiring another full
extension of the SET_CC result when lowering the select.

This improves lowering for certain extend/cmp/select patterns operating.
For example with  v16i8, this replaces 6 instructions for the extra extension
with 4 separate selects.

This improves the generated code for loops like the one below in
combination with D96522.

    int foo(uint8_t *p, int N) {
      unsigned long long sum = 0;
      for (int i = 0; i < N ; i++, p++) {
        unsigned int v = *p;
        sum += (v < 127) ? v : 256 - v;
      }
      return sum;
    }

https://clang.godbolt.org/z/Wco866MjY

On the AArch64 cores I have access to, the patch improves performance of
the vector loop by ~10%.

This could be generalized per follow-ups, but the initial version
targets one of the more important cases in combination with D96522.

Alive2 modeling:
* sext EQ https://alive2.llvm.org/ce/z/5upBvb
* sext NE https://alive2.llvm.org/ce/z/zbEcJp
* zext EQ https://alive2.llvm.org/ce/z/_xMwof
* zext NE https://alive2.llvm.org/ce/z/5FwKfc
* zext unsigned predicate: https://alive2.llvm.org/ce/z/iEwLU3
* sext signed predicate: https://alive2.llvm.org/ce/z/aMBega

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D120481
2022-07-07 16:50:00 -07:00
Bradley Smith 60d6be5dd3 [LegalizeTypes] Replace vecreduce_xor/or/and with vecreduce_add/umax/umin if not legal
This is done during type legalization since the target representation of
these nodes may not be valid until after type legalization, and after
type legalization the fact that these are dealing with i1 types may be
lost.

Differential Revision: https://reviews.llvm.org/D128996
2022-07-07 09:33:54 +00:00
Sander de Smalen 5d4f6ce229 [AArch64][SVE] Zero other lanes when doing OR reduction on unpacked predicate using ptest.
When the predicate vector is unpacked, we cannot assume anything about the
values in the other lanes. We have to make sure we use the correct
predicate where we know that the other lanes have been zeroed.

Reviewed By: RosieSumpter

Differential Revision: https://reviews.llvm.org/D129081
2022-07-06 16:12:44 +00:00
Sander de Smalen 95e08824fa [AArch64] Add support for various operations on nxv1i1 types.
The supported operations are:
* Logical operations (and, or, xor, bic)
* Logical reductions (and, or, xor, [us]min, [us]max)
* Conversions to/from svbool_t
* Predicate count (CNTP)

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D128835
2022-07-06 15:57:11 +00:00
David Sherwood 77b13a57a9 [AArch64][SME] Add SME addha/va intrinsics
This patch adds new the following SME intrinsics:

  @llvm.aarch64.sme.addva
  @llvm.aarch64.sme.addha

Differential Revision: https://reviews.llvm.org/D127861
2022-07-05 09:47:17 +01:00
Sander de Smalen bf89d24f53 [AArch64] NFC: Move safe predicate casting to a separate function.
This patch puts the code to safely bitcast a predicate, and possibly zero
any undefined lanes when doing a widening cast, into one place and merges
the functionality with lowerConvertToSVBool.

This is some cleanup inspired by D128665.

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D128926
2022-07-04 10:32:54 +00:00
Sander de Smalen 690db16422 [AArch64] Make nxv1i1 types a legal type for SVE.
One motivation to add support for these types are the LD1Q/ST1Q
instructions in SME, for which we have defined a number of load/store
intrinsics which at the moment still take a `<vscale x 16 x i1>` predicate
regardless of their element type.

This patch adds basic support for the nxv1i1 type such that it can be passed/returned
from functions, as well as some basic support to support some existing tests that
result in a nxv1i1 type. It also adds support for splats.

Other operations (e.g. insert/extract subvector, logical ops, etc) will be
supported in follow-up patches.

Reviewed By: paulwalker-arm, efriedma

Differential Revision: https://reviews.llvm.org/D128665
2022-07-01 15:11:13 +00:00
Matt Devereau 5166345f50 [SVE][AArch64] Refine hasSVEArgsOrReturn
As described in aapcs64 (https://github.com/ARM-software/abi-aa/blob/2022Q1/aapcs64/aapcs64.rst#scalable-vector-registers)
AAVPCS is used only when registers z0-z7 take an SVE argument. This fixes the case where floats occupy the lower bits
of registers z0-z7 but SVE arguments in registers greater than z7 cause a function to use AAVPCS where it should use AAPCS.

Moving SVE function deduction from AArch64RegisterInfo::hasSVEArgsOrReturn to AArch64TargetLowering::LowerFormalArguments
where physical register lowering is more accurate fixes this.

Differential Revision: https://reviews.llvm.org/D127209
2022-07-01 13:24:55 +00:00
Matt Devereau 018a0dd5c8 [AArch64][SVE] Create AArch64ISD node for DUPQLANE128
Create an AArch64ISD node instead of emitting machine node DUP_ZZI_Q.
This allows a simpler DAG combine for work previously attempted
in https://reviews.llvm.org/D128503

Differential Revision: https://reviews.llvm.org/D128902
2022-07-01 11:46:24 +00:00
Sander de Smalen fbefc62a96 [AArch64][SME] Sink tile offset operands into the loop for load/store instructions.
This helps ISel decompose the generic offset for the tile into a base + offset.

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D128508
2022-06-28 10:28:36 +01:00
David Sherwood 054faac9f9 [AArch64][SME] Add SVE2 psel, uclamp, sclamp and revd IR intrinsics
When the SME feature is enabled we also gain access to a few extra
SVE2 instructions. This patch adds LLVM IR intrinsics to make use
of these new instructions:

  @llvm.aarch64.sve.psel
  @llvm.aarch64.sve.revd
  @llvm.aarch64.sve.sclamp
  @llvm.aarch64.sve.uclamp

Differential Revision: https://reviews.llvm.org/D128332
2022-06-28 10:25:06 +01:00
David Sherwood f916ee0fb1 [AArch64][SME] Add SME outer product intrinsics
This patch adds the following intrinsics to support the SME ACLE:

  * @llvm.aarch64.sme.mopa: Non-widening outer product + accumulate
  * @llvm.aarch64.sme.mops: Non-widening outer product + subtract
  * @llvm.aarch64.sme.mopa.wide: Widening outer product + accumulate
  * @llvm.aarch64.sme.mops.wide: Widening outer product + subtract
  * @llvm.aarch64.sme.smopa.wide: Widening signed sum of outer product + accumulate
  * @llvm.aarch64.sme.smops.wide: Widening signed sum of outer product + subtract
  * @llvm.aarch64.sme.umopa.wide: Widening unsigned sum of outer product + accumulate
  * @llvm.aarch64.sme.umops.wide: Widening unsigned sum of outer product + subtract
  * @llvm.aarch64.sme.sumopa.wide: Widening signed by unsigned sum of outer product + accumulate
  * @llvm.aarch64.sme.sumops.wide: Widening signed by unsigned sum of outer product + subtract
  * @llvm.aarch64.sme.usmopa.wide: Widening unsigned by signed sum of outer product + accumulate
  * @llvm.aarch64.sme.usmops.wide: Widening unsigned by signed sum of outer product + subtract

Differential Revision: https://reviews.llvm.org/D127956
2022-06-28 09:41:44 +01:00
David Green 03c65c0d32 [AArch64] Convert vector add(ext, ext) into ext(add(ext, ext))
Given a vector add or sub from extends that needs more that one 'step'
(i.e i8 to i32 or i16 to i64), we can transform the sequence to
sext(add(ext, ext)), to allow the add(ext, ext) to become a single uaddl
and a larger extend, producing less instructions in total.
https://alive2.llvm.org/ce/z/S2T4k-

Differential Revision: https://reviews.llvm.org/D128426
2022-06-24 10:04:28 +01:00
Paul Walker e8716179eb [SVE] Make ISD::SPLAT_VECTOR a legal operation.
The implication of this patch being AArch64ISD::DUP no longer
supports scalable vectors.

Differential Revision: https://reviews.llvm.org/D128265
2022-06-23 00:42:47 +01:00
David Sherwood aa0a413df8 [AArch64][SME] Add some SME PSTATE setting/query intrinsics
This patch adds support for:

* Querying the PSTATE.SM state with @llvm.aarch64.sme.get.pstatesm
* Reading/writing the TPIDR2 register with new
@llvm.aarch64.sme.get.tpidr2 and @llvm.aarch64.sme.set.tpidr2
intrinsics.

Tests added here:

  CodeGen/AArch64/sme-get-pstatesm.ll
  CodeGen/AArch64/sme-read-write-tpidr2.ll

Differential Revision: https://reviews.llvm.org/D127957
2022-06-22 10:26:45 +01:00
Paul Walker 7b285ae0e8 [SVE] Lower "unpredicated" sabd/uabd intrinsics to ISD::ABDS/U.
This enables an existing transformation that when combined with an
add will emit saba/uaba instructions.

Differential Revision: https://reviews.llvm.org/D128198
2022-06-22 00:02:51 +01:00
David Green 3f81841474 [AArch64] Add Extract(DUP(C)) as a canonical constant.
As a followup to D128144, this adds extract(DUP(C)) as a canonical
constant to prevent it being transformed back into a BUILD_VECTOR,
leading to an infinite loop.
2022-06-21 09:51:22 +01:00
David Green c0ecbfa4fd [AArch64] Known bits for AArch64ISD::DUP
An AArch64ISD::DUP is just a splat, where the known bits for each lane
are the same as the input. This teaches that to computeKnownBitsForTargetNode.

Problems arise for constants though, as a constant BUILD_VECTOR can be
lowered to an AArch64ISD::DUP, which SimplifyDemandedBits would then
turn back into a constant BUILD_VECTOR leading to an infinite cycle.
This has been prevented by adding a isTargetCanonicalConstantNode node
to prevent the conversion back into a BUILD_VECTOR.

Differential Revision: https://reviews.llvm.org/D128144
2022-06-20 19:11:57 +01:00
David Sherwood 013358632e [AArch64][SME] Add the zero intrinsic
The SME zero instruction takes a mask as an input declaring which
64-bit element tiles should be zeroed. There is a 1:1 mapping
between the zero intrinsic and the instruction, however we also
want to make the register allocator aware that some tile
registers are being written to.

We can actually just use the custom inserter for a pseudo instruction
to correctly mark all the appropriate registers in the mask as
implicitly defined by the operation.

 Differential Revision: https://reviews.llvm.org/D127843
2022-06-20 14:27:59 +01:00