Commit Graph

1445 Commits

Author SHA1 Message Date
Simon Pilgrim e4a124dda5 [DAG] Fold (srl (shl x, c1), c2) -> and(shl/srl(x, c3), m)
Similar to the existing (shl (srl x, c1), c2) fold

Part of the work to fix the regressions in D77804

Differential Revision: https://reviews.llvm.org/D125836
2022-06-20 08:37:38 +01:00
David Sherwood 6f6fa5aa10 [AArch64][SME] Add SME cntsb/h/w/d intrinsics
These intrinsics return the number of elements in a streaming
vector, for example aarch64.sme.cntsw returns the number of
32-bit elements. When in streaming mode these are equivalent
to aarch64.sve.cntb/h/w/d with an input value of 1.

I have implemented these intrinsics using the rdsvl instruction
and added tests here:

  CodeGen/AArch64/SME/sme-intrinsics-rdsvl.ll

Differential Revision: https://reviews.llvm.org/D127853
2022-06-16 10:50:25 +01:00
David Sherwood 5fa2416ea0 [AArch64][SME] Add SME read/write intrinsics that map to the mova instruction
This patch adds implementations for the read/write SME ACLE intrinsics:

  @llvm.aarch64.sme.read.horiz
  @llvm.aarch64.sme.read.vert
  @llvm.aarch64.sme.write.horiz
  @llvm.aarch64.sme.write.vert

These all map to the SME mova instruction.

Differential Revision: https://reviews.llvm.org/D127414
2022-06-15 10:31:07 +01:00
David Sherwood bd61664167 [AArch64][SME] Add ldr/str (fill/spill) intrinsics
This patch adds implementations for the fill/spill SME ACLE intrinsics:

    @llvm.aarch64.sme.ldr
    @llvm.aarch64.sme.str

Differential Revision: https://reviews.llvm.org/D127317
2022-06-14 13:58:22 +01:00
Rosie Sumpter 2c4e44752d [AArch64][SME] Add load/store intrinsics
This patch adds implementations for the load/store SME ACLE intrinsics:
  - @llvm.aarch64.sme.ld1*
  - @llvm.aarch64.sme.st1*

Differential Revision: https://reviews.llvm.org/D127210
2022-06-14 11:11:22 +01:00
Guillaume Chatelet d1a27d0b9c [NFC][Alignment] Use proper version of getAlign 2022-06-13 12:59:38 +00:00
David Green 338fd211e7 [AArch64] Generate FADDP from shuffled fadd
As a follow up to D126686, this does the same fold for floating point
add and shuffle. In this case it is limited to reassoc either x[0]+x[1]
or x[1]+x[0] for both result[0] and results[1].

Differential Revision: https://reviews.llvm.org/D127087
2022-06-11 14:16:37 +01:00
Ahmed Bougacha c68b469e07 [AArch64][SVE] Don't crash on pre-legalizer types in extload combine.
This was assuming the vector types were MVTs, but they don't have to be.

Note that the concrete output of the test isn't very useful, since it's
dominated by nonsensical calling convention lowering for the weird types.

Differential Revision: https://reviews.llvm.org/D126505
2022-06-09 10:33:21 -07:00
Paul Walker a1121c31d8 [SVE] Fix incorrect code generation for bitcasts of unpacked vector types.
Bitcasting between unpacked scalable vector types of different
element counts is not a NOP because the live elements are laid out
differently.
               01234567
e.g. nxv2i32 = XX??XX??
     nxv4f16 = X?X?X?X?

Differential Revision: https://reviews.llvm.org/D126957
2022-06-08 10:30:07 +01:00
Guillaume Chatelet 0788186182 [Alignment][NFC] Remove usage of MemSDNode::getAlignment
I can't remove the function just yet as it is used in the generated .inc files.
I would also like to provide a way to compare alignment with TypeSize since it came up a few times.

Differential Revision: https://reviews.llvm.org/D126910
2022-06-07 13:52:20 +00:00
David Green 4ea1b43527 [AArch64] Generate ADDP from shuffled add
This adds a fold of add(x, shuffle(x, <1,0,3,2,5,4,...>), into
shuffle(addp(x), <0,0,1,1,2,2,..>. The ADDP instruction takes two
vectors and returns one, adding adjacent pairs. So we match x in a
custom combine as it is lowered from a v8i32. The original code
would be 2 rev64 and 2 add, with the new code being a single addp
with a zip1;zip2 shuffle, producing smaller code.

Differential Revision: https://reviews.llvm.org/D126686
2022-06-06 11:39:51 +01:00
Paul Walker 2dde272db7 [SVE] Refactor sve-bitcast.ll to include all combinations for legal types.
Patch enables custom lowering for MVT::nxv4bf16 because otherwise
the refactored test file triggers a selection failure.

The reason for the refactoring it to highlight cases where the
generated code is wrong.
2022-06-03 12:09:19 +01:00
Paul Walker 48ea26a387 [SVE] Fixed custom lowering of ISD::INSERT_SUBVECTOR.
LowerINSERT_SUBVECTOR emits AArch64ISD::UUNPK## when lowering
scalable vector floating point INSERT_SUBVECTOR. However, these
nodes only make sense for integer types and thus isel patterns do
not exist for floating point, which leads to isel failures.

This patch ensures floating point operands are cast to integer
before the core lowering takes place.

Fixes: #55037

Differential Revision: https://reviews.llvm.org/D126487
2022-06-02 14:51:04 +01:00
Paul Walker 1fe4953d89 [SVE] Remove custom lowering of scalable vector MGATHER & MSCATTER operations.
Differential Revision: https://reviews.llvm.org/D126255
2022-06-02 11:19:52 +01:00
Sander de Smalen 9c38fc111b [AArch64] Remove references to Streaming SVE from target features.
Following discussion on D120261 and D121208 it seems better to remove the
concept of Streaming SVE from the subtarget/assembler predicates and
instead reason about 'SVE' and 'SME' as its higher level features, rather
than trying to model this runtime mode through explicit feature flags.

This patch is largely NFC.

Reviewed By: paulwalker-arm, david-arm

Differential Revision: https://reviews.llvm.org/D125977
2022-05-31 16:25:01 +02:00
David Green 9a3144d078 [AArch64] Reuse larger DUP if available
If both a v2i32 DUP(x) and a v4i32 DUP(x) node exists, we can re-use the
larger node using a vector extract to obtain the smaller. This comes up
in the smull/smlal code, but needs a small fixup to allow the smull2
code in tryExtendDUPToExtractHigh/performAddSubLongCombine to still
match smull2 extracts.

Differential Revision: https://reviews.llvm.org/D126449
2022-05-29 19:42:13 +01:00
Florian Hahn 786c687810
[AArch64] Add support for FMA intrinsics to shouldSinkOperands.
If the fma operates on a legal vector type, the indexed variants can be
used, if the second operand is a splat of a valid index.

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D126234
2022-05-27 10:37:03 +01:00
Zongwei Lan ad73ce318e [Target] use getSubtarget<> instead of static_cast<>(getSubtarget())
Differential Revision: https://reviews.llvm.org/D125391
2022-05-26 11:22:41 -07:00
Florian Hahn 0cc981e021
[AArch64] implement isReassocProfitable, disable for (u|s)mlal.
Currently reassociating add expressions can lead to failing to select
(u|s)mlal. Implement isReassocProfitable to skip reassociating
expressions that can be lowered to (u|s)mlal.

The same issue exists for the *mlsl variants as well, but the DAG
combiner doesn't use the isReassocProfitable hook before reassociating.
To be fixed in a follow-up commit as this requires DAGCombiner changes
as well.

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D125895
2022-05-23 09:39:00 +01:00
David Green 6ef5e242f2 [AArch64] Fix assumptions on input type of tryCombineFixedPointConvert
It is possible for the input type to not be v2i64 or v4i32, so weaken
the assertion to a return, fixing the crash in the new test.

Fixes #55606
2022-05-23 08:55:54 +01:00
Paul Walker 258dac43d6 [SVE] Enable use of 32bit gather/scatter indices for fixed length vectors
Differential Revision: https://reviews.llvm.org/D125193
2022-05-22 12:32:30 +01:00
Paul Walker 216f546c84 [SVE] Refactor lowering for fixed length MGATHER/MSCATTER.
Lower fixed length MGATHER/MSCATTER operations to scalable vector
equivalents, which are then lowered to SVE specific nodes. This
two stage process is in preparation for making scalable vector
MGATHER/MSCATTER operations legal.

Differential Revision: https://reviews.llvm.org/D125192
2022-05-21 10:14:45 +01:00
Rahul Anand R 534ea8bca5 [AArch64] Generate AND in place of CSEL for predicated CTTZ
This patch implements a for a target specific optimization that replaces
the cmp and csel from cttz with an and mask.

Recommitted with a fix for truncated value sizes.

Differential Revision: https://reviews.llvm.org/D123782
2022-05-20 13:41:32 +01:00
David Green 602f81ec33 [AArch64] Fix zero element TBL indices
A TBL instruction will fill out-of-range values with 0's, something used
in D121139 to turn tbl2 with a zero input into tbl1s. This works OK for
v16i8, but for v8i8 the input is still treated as a v16i8, so
out-of-range values (like a lane index of 8) would end up loading values
from the top half of the input register. Clean this up by detecting the
out of range values and making sure they really use out of range values.
There is a fix for swapped indices of 64bit input vectors too, which
could be incorrectly adjusted if the zerovector was the first operand.

Fixes #55545

Differential Revision: https://reviews.llvm.org/D125865
2022-05-19 13:54:35 +01:00
Jay Foad 6bec3e9303 [APInt] Remove all uses of zextOrSelf, sextOrSelf and truncOrSelf
Most clients only used these methods because they wanted to be able to
extend or truncate to the same bit width (which is a no-op). Now that
the standard zext, sext and trunc allow this, there is no reason to use
the OrSelf versions.

The OrSelf versions additionally have the strange behaviour of allowing
extending to a *smaller* width, or truncating to a *larger* width, which
are also treated as no-ops. A small amount of client code relied on this
(ConstantRange::castOp and MicrosoftCXXNameMangler::mangleNumber) and
needed rewriting.

Differential Revision: https://reviews.llvm.org/D125557
2022-05-19 11:23:13 +01:00
David Green 4c6a070a2c [AArch64] Teach perfect shuffles tables about D-lane movs
Similar to D123386, this adds D-Movs to the AArch64 perfect shuffle
tables, slightly lowering the costs a little more. This is a rough
improvement in general, especially if you ignore mov v0.16b, v2.16b type
moves that are often artefacts of the calling convention.

The D register movs are encoded as (0x4 | LaneIdx), and to generate a D
register move we are required to bitcast into a higher type, but it is
otherwise very similar to the S-lane mov's already supported.

Differential Revision: https://reviews.llvm.org/D125477
2022-05-17 18:16:45 +01:00
Simon Pilgrim d40b7f0d5a [DAG] Fold (shl (srl x, c), c) -> and(x, m) even if srl has other uses
If we're using shift pairs to mask, then relax the one use limit if the shift amounts are equal - we'll only be generating a single AND node.

AArch64 has a couple of regressions due to this, so I've enforced the existing one use limit inside a AArch64TargetLowering::shouldFoldConstantShiftPairToMask callback.

Part of the work to fix the regressions in D77804

Differential Revision: https://reviews.llvm.org/D125607
2022-05-17 13:40:11 +01:00
Paul Walker ee8aa351e4 [AArch64] Use ADDV for boolean xor reductions.
NEON does not have native support for xor reductions. However, when
reducing predicate vectors the operation is synonymous with an add
reduction that is supported.

Differential Revision: https://reviews.llvm.org/D125605
2022-05-16 22:34:12 +01:00
Paul Walker 7dd05ba9ed [SelectionDAG] Remove duplicate "is scaled" information from gather/scatter SDNodes.
During early gather/scatter enablement two different approaches
were taken to represent scaled indices:

* A Scale operand whereby byte_offsets = Index * Scale
* An IndexType whereby byte_offsets = Index * sizeof(MemVT.ElementType)

Having multiple representations is bad as shown by this patch which
fixes instances where the two are out of sync. The dedicated scale
operand is more flexible and pervasive so this patch removes the
UNSCALED values from IndexType. This means all indices are scaled
but the scale can be one, hence unscaled. SDNodes now use the scale
operand to answer the "isScaledIndex" question.

I toyed with the idea of keeping the UNSCALED enums and helper
functions but because they will have no uses and force SDNodes to
validate the set of supported values I figured it's best to remove
them. We can re-add them if there's a real need. For similar
reasons I've kept the IndexType enum when a bool could be used as I
think being explicitly looks better.

Depends On D123347

Differential Revision: https://reviews.llvm.org/D123381
2022-05-16 20:47:52 +01:00
David Green 4c3e51ecfa [AArch64] Handle 64bit vectors in tryCombineFixedPointConvert
Under some situations we can visit 64bit vector extract elements in
tryCombineFixedPointConvert, where an assert fires as they are expected
to have been converted to 128bit. Turn the assert into an if statement,
bailing out and letting the extract be handled first.

Also invert some ifs, using early exits to reduce indentation.

Fixes #55417
2022-05-16 11:08:47 +01:00
Karl Meakin 0298cce257 [AArch64] Add `foldADCToCINC` DAG combine.
Differential revision: https://reviews.llvm.org/D123781
2022-05-12 22:21:20 +01:00
Karl Meakin d29fc6e7d2 [AArch64] Replace `performANDSCombine` with `performFlagSettingCombine`.
`performFlagSettingCombine` is a generalised version of `performANDSCombine` which also works on  `ADCS` and `SBCS`.

Differential revision: https://reviews.llvm.org/D124464
2022-05-12 22:17:23 +01:00
Nikita Popov 44d85259d0 [AArch64] Preserve chain when lowering fixed length load to SVE (PR55281)
When a fixed length load is lowered to an SVE masked load, the
result chain is currently set to the input chain of the old load,
rather than the result chain of the new load. This may cause stores
to be incorrectly reordered.

Fixes https://github.com/llvm/llvm-project/issues/55281.

Differential Revision: https://reviews.llvm.org/D125464
2022-05-12 16:03:32 +02:00
Alban Bridonneau e6635377e5 [NFC] Change comment number in aarch64 isel 2022-05-11 15:41:33 +00:00
David Green 442c351b2b Revert "[AArch64] Generate AND in place of CSEL for predicated CTTZ"
This reverts commit 7dcd0ea683 due to
issues reported postcommit with the correctness of truncated cttzs.
2022-05-10 17:17:03 +01:00
Amaury Séchet 59fea9380d [AArch64] Remove ADDC, ADDE, SUBC, SUBE support, use the CARRY ops instead
This cleans up tech debt. Similar to D33390 .

Reviewed By: Kmeakin

Differential Revision: https://reviews.llvm.org/D125150
2022-05-10 00:23:15 +00:00
Rosie Sumpter 1a2665902f [AArch64][SVE] Improve codegen when extracting first lane of active lane mask
When extracting the first lane of a predicate created using the
llvm.get.active.lane.mask intrinsic, it should give the same codegen as
when the predicate is created using the llvm.aarch64.sve.whilelo
intrinsic, since get.active.lane.mask is lowered to whilelo. This patch
ensures the codegen is the same by recognizing
llvm.get.active.lane.mask as a flag-setting operation in this case.

Differential Revision: https://reviews.llvm.org/D125215
2022-05-09 13:56:04 +01:00
Alban Bridonneau fef81131d9 [SVE] Optimize new cases for lowerConvertToSVBool
Converts to SVBool are already considered as a nop, if they
are converting an operand from a ptrue or a cmp, because
they zero the extra predicate lanes by construction.

This patch adds 2 similar cases:
- The wide cmp, which were not directly recognized by the test
for other forms of cmp
- Splats of 1, which will be generated as ptrue, and as such
will also zero the extra predicate lines.

Reviewed By: paulwalker-arm, peterwaller-arm

Differential Revision: https://reviews.llvm.org/D124908
2022-05-09 10:17:57 +00:00
Rahul Anand R 7dcd0ea683 [AArch64] Generate AND in place of CSEL for predicated CTTZ
This patch implements a for a target specific optimization that replaces
the cmp and csel from cttz with an and mask.

Differential Revision: https://reviews.llvm.org/D123782
2022-05-09 10:28:20 +01:00
Paul Walker 702c4ade22 [ISD::IndexType] Helper functions for common queries.
Add helper functions to query the signed and scaled properties
of ISD::IndexType along with functions to change them.

Remove setIndexType from MaskedGatherSDNode because it only has
one usage and typically should only be changed alongside its
index operand.

Minimise the direct use of the enum values to lay the groundwork
for more refactoring.

Differential Revision: https://reviews.llvm.org/D123347
2022-05-07 11:23:42 +01:00
Kazu Hirata fffb6e6afd [AArch64] Fix sub with carry
13403a70e4 introduced a bug where we
generate the outgoing carry inverted, which in turn breaks the
lowering of @llvm.usub.sat.i128, returning the normal difference on
saturation and zero otherwise.

Note that AArch64 has peculiar semantics where the subtraction
instructions generate borrow inverted.  The problem is that we mix the
two forms of semantics -- the normal carry and inverted carry -- in
the area of extended precision subtractions.  Specifically, we have
three problems:

- lowerADDSUBCARRY takes the non-inverted incoming carry from a
  subtraction and feeds it to SBCS without inverting it first.

- lowerADDSUBCARRY makes available the outgoing carry from SBCS
  without inverting it.

- foldOverflowCheck folds:

  (SBC{S} l r (CMP (CSET LO carry) 1)) => (SBC{S} l r carry)

  When the incoming carry flag is set, CSET LO results in zero.  CMP
  in turn generates a borrow, *clearing* the carry flag.  Instead, we
  should fold:

  (SBC{S} l r (CMP 0 (CSET LO carry))) => (SBC{S} l r carry)

  When the incoming carry flag is set, CSET LO results in zero.  CMP
  does not generate a borrow, *setting* the carry flag.

IIUC, we should use the normal (that is, non-inverted) semantics for
carry everywhere.

This patch fixes the three problems above.

This patch does not add any new testcases because we have a plenty of
them covering the instruction in question.  In particular,
@u128_saturating_sub is identical to the testcase in the motivating
issue.

Fixes: #55253

Differential Revision: https://reviews.llvm.org/D124976
2022-05-06 11:04:17 -07:00
Paul Walker b481512485 [SVE] Move reg+reg gather/scatter addressing optimisations from lowering into DAG combine.
This is essentially a refactoring patch but allows more cases to
be caught, hence the output changes to some tests.

Differential Revision: https://reviews.llvm.org/D122994
2022-04-29 17:42:33 +01:00
Paul Walker 59588f0a3d [SVE][ISel] Ensure explicit gather/scatter offset extension isn't lost.
getGatherScatterIndexIsExtended currently looks through all
SIGN_EXTEND_INREG operations regardless of their input type.  This
patch restricts the code to only look through i32->i64 extensions,
which are the ones supported implicitly by SVE addressing modes.

Differential Revision: https://reviews.llvm.org/D123318
2022-04-29 14:20:13 +01:00
Paul Walker 7a0b897e86 [DAGCombiner][SVE] Ensure MGATHER/MSCATTER addressing mode combines preserve index scaling
refineUniformBase and selectGatherScatterAddrMode both attempt the
transformation:

  base(0) + index(A+splat(B)) => base(B) + index(A)

However, this is only safe when index is not implicitly scaled.

Differential Revision: https://reviews.llvm.org/D123222
2022-04-29 12:35:16 +01:00
David Green d6327050e0 [AArch64] Use PerfectShuffle costs in AArch64TTIImpl::getShuffleCost
Given a shuffle with 4 elements size 16 or 32, we can use the costs
directly from the PerfectShuffle tables to get a slightly more accurate
cost for the resulting shuffle.

Differential Revision: https://reviews.llvm.org/D123409
2022-04-27 12:09:01 +01:00
David Green 9727c77d58 [NFC] Rename Instrinsic to Intrinsic 2022-04-25 18:13:23 +01:00
Karl Meakin 81904454f7 [AArch64] Add `foldOverflowCheck` DAG combine
Differential Revision: https://reviews.llvm.org//D123779
2022-04-21 14:56:38 +01:00
Karl Meakin 13403a70e4 [AArch64] Add lowerings for {ADD,SUB}CARRY and S{ADD,SUB}O_CARRY
Differential Revision: https://reviews.llvm.org/D123322
2022-04-21 14:56:37 +01:00
David Green 73dc996428 [AArch64] Add lane moves to PerfectShuffle tables
This teaches the perfect shuffle tables about lane inserts, that can
help reduce the cost of many entries. Many of the shuffle masks are
one-away from being correct, and a simple lane move can be a lot simpler
than trying to use ext/zip/etc. Because they are not exactly like the
other masks handled in the perfect shuffle tables, they require special
casing to generate them, with a special InsOp Operator.

The lane to insert into is encoded as the RHSID, and the move from is
grabbed from the original mask. This helps reduce the maximum perfect
shuffle entry cost to 3, with many more shuffles being generatable in a
single instruction.

Differential Revision: https://reviews.llvm.org/D123386
2022-04-19 14:49:50 +01:00
David Green cc9495f679 [AArch64] Only mark cost 1 perfect shuffles as legal
The perfect shuffle tables encode a cost of either 0 (a nop-copy) or 1
(a single instruction) with a cost encoding of 0 in the upper 2 bits.
All perfect shuffles with any cost are then marked as legal shuffles
though (the maximum encoded cost is 3), which can confuse the DAG
combiner into thinking the shuffles are cheaper than the should be.

Limiting legal shuffles to single instructions seems to do better in
most case, producing less instructions for complex shuffles. There are
some cases that now become tbl, which may be better or worse depending
on whether the instruction is in a loop and the tbl load can be hoisted
out.

Differential Revision: https://reviews.llvm.org/D123377
2022-04-19 12:58:55 +01:00
chenglin.bi 222adf338a [Arch64][SelectionDAG] Add target-specific implementation of srem
1. X%C to the equivalent of X-X/C*C is not always fastest path if there is no SDIV pair exist. So check target have faster for srem only first.
2. Add AArch64 faster path for SREM only pow2 case.

Fix https://github.com/llvm/llvm-project/issues/54649

Reviewed By: efriedma

Differential Revision: https://reviews.llvm.org/D122968
2022-04-19 02:49:42 +08:00
chenglin.bi acfc025a72 Revert "[Arch64][SelectionDAG] Add target-specific implementation of srem"
This reverts commit 9d9eddd3dd.
2022-04-18 10:35:09 +08:00
chenglin.bi 9d9eddd3dd [Arch64][SelectionDAG] Add target-specific implementation of srem
X%C to the equivalent of X-X/C*C is not always fastest path if there is no SDIV pair exist. So check target have faster for srem only first. Add AArch64 faster path for SREM only pow2 case.

Fix https://github.com/llvm/llvm-project/issues/54649

Reviewed By: efriedma

Differential Revision: https://reviews.llvm.org/D122968
2022-04-16 12:29:11 +08:00
zhongyunde 49cb4fef02 [AArch64][SelectionDAG] Refactor to support more scalable vector extending stores
Similar to D122281, we should firstly exclude all scalable vector extending
stores and then selectively enable those which we directly support.

Also merge integer and float scalable vector into scalable_vector_valuetypes.

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D123449
2022-04-15 19:11:40 +08:00
Paul Walker a5a258e208 [SVE] Refactor MGATHER lowering for unsupported passthru values.
Handle unsupported passthru values before lowering the gather to
target specific nodes. This is a simplification that's on the road
to moving more of MGATHER lowering into td based isel.

Differential Revision: https://reviews.llvm.org/D123683
2022-04-14 17:26:43 +01:00
John Brawn 12c1022679 [AArch64] Lowering and legalization of strict FP16
For strict FP16 to work correctly needs some changes in lowering and
legalization:
 * SelectionDAGLegalize::PromoteNode was missing handling for some
   strict fp opcodes.
 * Some of the custom lowering of strict fp operations needed to be
   adjusted to work with FP16.
 * Custom lowering needed to be added for round-to-int operations.

With this, and the previous patches for the rest of the strict fp
isel, we can set IsStrictFPEnabled = true.

Differential Revision: https://reviews.llvm.org/D115620
2022-04-14 16:51:22 +01:00
David Green 1ba8f4f67d [AArch64] Move v4i8 concat load lowering to a combine.
The existing code was not updating the uses of loads that it recreated,
leading to incorrect chains which could break the ordering between
nodes. This moves the code to a combine instead, and makes sure we
update the chain references. This does mean it happens earlier -
potentially before the concats are simplified. This can lead to
inefficiencies in the codegen, which will be fixed in followups.
2022-04-14 15:19:33 +01:00
Paul Walker 0c44115e51 [SVE] Add support for non-element-type sized scaling when lowering MGATHER/MSCATTER.
The lowering code did not use the scale operand of MGATHER/MSCATTER
nodes, but instead assumed scaled indices were always scaled based
on the element type of the memory type. This patch adds the missing
support by rewritting the nodes as unscaled variants.

Differential Revision: https://reviews.llvm.org/D123670
2022-04-14 11:54:46 +01:00
David Sherwood 44271e7c55 [AArch64][SVE] Fix lowering of "fcmp ueq/one" when using SVE
We were previously lowering to the incorrect instructions for the
setcc DAG node when using the SETUEQ and SETONE floating point
condition codes. I have fixed this by marking the SETONE code
as Expand and letting the SETUNE code be legal. I have also
fixed up the patterns for FCMNE_PPzZZ and FCMNE_PPzZ0 to use
the correct opcode.

Differential Revision: https://reviews.llvm.org/D121905
2022-04-13 10:24:03 +01:00
David Green a93607c479 [AArch64] Remove always true Perfect cost check. NFC
Perfect shuffle costs are always encoded less than 4, and shouldn't
really have a cost more than 3, so it makes no sense to check it when
generating shuffles. The perfect shuffle is likely always better than a
tbl too (although that may depend on whether it is in a loop).
2022-04-08 12:16:34 +01:00
Matt Arsenault c4ea925f50 AtomicExpand: Change return type for shouldExpandAtomicStoreInIR
Use the same enum as the other atomic instructions for consistency, in
preparation for addition of another strategy.

Introduce a new "Expand" option, since the store expansion does not
use cmpxchg. Alternatively, the existing CmpXChg strategy could be
renamed to Expand.
2022-04-06 22:34:04 -04:00
Martin Storsjö 8d7a17b7c8 [AArch64] Fix the upper limit for folded address offsets for COFF
In COFF, the immediates in IMAGE_REL_ARM64_PAGEBASE_REL21 relocations
are limited to 21 bit signed, i.e. the offset has to be less than
(1 << 20). The previous limit did intend to cover for this case, but
had missed that the 21 bit field was signed.

This fixes issue https://github.com/llvm/llvm-project/issues/54753.

Differential Revision: https://reviews.llvm.org/D123160
2022-04-06 22:54:13 +03:00
zhongyunde 9a2d5cc1da [SVE][AArch64] Enable first active true vector combine for INTRINSIC_WO_CHAIN
WHILELO/LS insn is used very important for SVE loop, and itself
is a flag-setting operation, so add it.

Reviewed By: paulwalker-arm, david-arm

Differential Revision: https://reviews.llvm.org/D122796
2022-04-06 21:01:37 +08:00
zhongyunde 19e5235147 [AArch64][InstCombine] Fold MLOAD and zero extensions into MLOAD
Accord the discussion in D122281, we missing an ISD::AND combine for MLOAD
because it relies on BuildVectorSDNode is fails for scalable vectors.
This patch is intend to handle that, so we can circle back the type MVT::nxv2i32

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D122703
2022-04-06 20:50:42 +08:00
zhongyunde 251637690a [AArch64] Enhance last active true vector combine
Last active extracting will output LASTB + WHILELS, and the WHILELS itself
is a flag-setting operation, so perform it preferly.

Reviewed By: paulwalker-arm, sdesmalen

Differential Revision: https://reviews.llvm.org/D122551
2022-04-06 09:54:28 +08:00
David Green 3b9833597e [AArch64] Alter mull buildvectors(ext(..)) combine to work on shuffles
D120018 altered this combine to work on buildvectors as opposed to
shuffle dup's. This works well for dups and other things that are
expanded into buildvectors. Some shuffles are legal though, and stay as
vector_shuffle through lowering. This expands the transform to also
handle shuffles, so that we can turn mul(shuffle(sext into
mul(sext(shuffle and more readily make smull/umull instructions. This
can come up from the SLP vectorizer adding shuffles that are costed from
extends.

Differential Revision: https://reviews.llvm.org/D123012
2022-04-04 23:07:47 +01:00
David Green 60f57b3658 [AArch64] Ensure fixed point fptoi_sat has correct saturation width
D113200 introduced an error where it was converting FP_TO_SI_SAT with
multiply to a fixed point floating point convert. The saturation
bitwidth needs to be equal to the floating point width, or else the
routine would truncate the result as opposed to saturating it.

Fixes #54601
2022-03-29 10:12:44 +01:00
Shao-Ce SUN 662b9fa02c [NFC][CodeGen] Add a setTargetDAGCombine use ArrayRef
Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D122557
2022-03-29 09:53:24 +08:00
zhongyunde c3fe025bd4 [AArch64][SelectionDAG] Refactor to support more scalable vector extending loads
Accord the discussion in D120953, we should firstly exclude all scalable vector
extending loads and then selectively enable those which we directly support.

This patch is intend to refactor for above (truncating stores is not touched),and
more scalable vector types will try to reduce the number of masked loads in favour
of more unpklo/hi instructions.

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D122281
2022-03-27 21:18:01 +08:00
David Green 693d3b7e76 [AArch64] Lower 3 and 4 sources buildvectors to TBL
The default expansion for buildvectors is to extract each element and
insert them into a new vector. That involves a lot of copying to/from
the GPR registers. TLB3 and TLB4 can be relatively slow instructions
with the mask needing to be loaded from a constant pool, but they should
always be better than all the moves to/from GPRs.

Differential Revision: https://reviews.llvm.org/D121137
2022-03-26 21:10:43 +00:00
Simon Pilgrim e699b5da44 [AArch64] isProfitableToHoist - remove nullptr test
User is dereferenced on the main codepath so the null test is likely superfluous
2022-03-25 10:27:16 +00:00
David Green 3d8d60e147 Revert "[AArch64] Lower 3 and 4 sources buildvectors to TBL"
This reverts commit ec93b28909 as problems
with it have been reported.
2022-03-25 10:03:10 +00:00
David Green ec93b28909 [AArch64] Lower 3 and 4 sources buildvectors to TBL
The default expansion for buildvectors is to extract each element and
insert them into a new vector. That involves a lot of copying to/from
the GPR registers. TLB3 and TLB4 can be relatively slow instructions
with the mask needing to be loaded from a constant pool, but they should
always be better than all the moves to/from GPRs.

Differential Revision: https://reviews.llvm.org/D121137
2022-03-24 10:02:33 +00:00
David Green 54bc9ad2e8 [AArch64] Make some methods static. NFC 2022-03-24 08:55:27 +00:00
David Spickett c3b98194df Reland "[llvm][AArch64] Insert "bti j" after call to setjmp"
This reverts commit edb7ba714a.

This changes BLR_BTI to take variable_ops meaning that we can accept
a register or a label. The pattern still expects one argument so we'll
never get more than one. Then later we can check the type of the operand
to choose BL or BLR to emit.

(this is what BLR_RVMARKER does but I missed this detail of it first time around)

Also require NoSLSBLRMitigation which I missed in the first version.
2022-03-23 11:43:43 +00:00
David Spickett edb7ba714a Revert "[llvm][AArch64] Insert "bti j" after call to setjmp"
This reverts commit eb5ecbbcbb
due to failures on buildbots with expensive checks enabled.
2022-03-23 10:43:20 +00:00
David Spickett eb5ecbbcbb [llvm][AArch64] Insert "bti j" after call to setjmp
Some implementations of setjmp will end with a br instead of a ret.
This means that the next instruction after a call to setjmp must be
a "bti j" (j for jump) to make this work when branch target identification
is enabled.

The BTI extension was added in armv8.5-a but the bti instruction is in the
hint space. This means we can emit it for any architecture version as long
as branch target enforcement flags are passed.

The starting point for the hint number is 32 then call adds 2, jump adds 4.
Hence "hint #36" for a "bti j" (and "hint #34" for the "bti c" you see
at the start of functions).

The existing Arm command line option -mno-bti-at-return-twice has been
applied to AArch64 as well.

Support is added to SelectionDAG Isel and GlobalIsel. FastIsel will
defer to SelectionDAG.

Based on the change done for M profile Arm in https://reviews.llvm.org/D112427

Fixes #48888

Reviewed By: danielkiss

Differential Revision: https://reviews.llvm.org/D121707
2022-03-23 09:51:02 +00:00
zhongyunde 828b89bc0b [AArch64][SelectionDAG] Supports unpklo/hi instructions to reduce the number of loads
Trying to reduce the number of masked loads in favour of more unpklo/hi
instructions. Both ISD::ZEXTLOAD and ISD::SEXTLOAD are supported to extensions
from legal types.

Both of normal and masked loads test cases added to guard compile crash.

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D120953
2022-03-21 23:47:33 +08:00
chenglin.bi dd3b90e4d7 [AArch64] Combine ISD::SETCC into AArch64ISD::ANDS
When N > 12, (2^N -1) is not a legal add immediate (isLegalAddImmediate will return false).
ANd if SetCC input use this number, DAG combiner will generate one more SRL instruction.
So combine [setcc (srl x, imm), 0, ne] to [setcc (and x, (-1 << imm)), 0, ne] to get better optimization in emitComparison
Fix https://github.com/llvm/llvm-project/issues/54283

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D121449
2022-03-19 13:04:16 +00:00
Paul Walker f46fe36d59 [AArch64] Fix incorrect getSetCCInverse usage within trySwapVSelectOperands.
When inverting the compare predicate trySwapVSelectOperands is
incorrectly using the type of the select's cond operand rather
than the type of cond's operands. This means we're treating all
inversions as if they're integer.

Differential Revision: https://reviews.llvm.org/D121968
2022-03-19 12:36:14 +00:00
David Green fe6057a293 [AArch64] Custom lower concat(v4i8 load, ...)
We already have custom lowering for v4i8 load, which loads as a f32,
converts to a vector and bitcasts and extends the result to a v4i16.
This adds some custom lowering of concat(v4i8 load, ...) to keep the
result as an f32 and create a buildvector of the resulting f32 loads.
This helps not create all the extends and bitcasts, which are often
difficult to fully clean up.

Differential Revision: https://reviews.llvm.org/D121400
2022-03-18 11:58:02 +00:00
David Green 0b6df40c52 [AArch64] Combine ISD::AND into AArch64ISD::ANDS
If we already have a AArch64ISD::ANDS node with identical operands, we
can merge any ISD::AND into it, reducing the instruction count by
calculating the value and the flags in a single operation. This code is
taken from the X86 backend, and could also handle AArch64ISD::ADDS and
AArch64ISD::SUBS, but I couldn't find any test cases where it came up.

Differential Revision: https://reviews.llvm.org/D118584
2022-03-17 09:44:11 +00:00
zhongyunde 3568333815 [AArch64] Perform last active true vector combine
Test bit of lane EC-1 can use P register directly, eg:
Materialize : Idx = (add (mul vscale, NumEls), -1)
               i1 = extract_vector_elt t37, Constant:i64<Idx>
    ... into: "ptrue p, all" + PTEST

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D121180
2022-03-15 01:25:03 +08:00
Arthur Eubanks 250620f76e [OpaquePtr][AArch64] Use elementtype on ldxr/stxr
Includes verifier changes checking the elementtype, clang codegen
changes to emit the elementtype, and ISel changes using the elementtype.

Reviewed By: #opaque-pointers, nikic

Differential Revision: https://reviews.llvm.org/D120527
2022-03-14 10:09:59 -07:00
David Sherwood aeeb1199b4 [AArch64][SVE] Change the asserts in LowerToPredicatedOp to check for legal types
When building the LLVM test suite with SVE I discovered a crash
when compiling some Halide tests, which occurs because we try to
use SVE to lower 64-bit vector multiplies and there is no
vscale_range attribute on the function. In this case the min SVE
vector bits was 0, which caused an assert in LowerToPredicatedOp
to fire. I have amended the asserts in this function to check that the
fixed-width type is legal. If the fixed-width type is larger than NEON
and is legal then it must be because we've set the min SVE vector
bits to something > 128. Or if the min SVE bits is 0, then the only
legal types allowed are 128 bit types - for any other types the assert
will fire.

Tests added here:

  CodeGen/AArch64/sve-fixed-length-no-vscale-range.ll

Differential Revision: https://reviews.llvm.org/D121297
2022-03-11 09:57:58 +00:00
Philippe Valembois 26cd258420 [AArch64] Use correct calling convention for each vararg
While checking is tail call optimization is possible, the calling
convention applied to fixed arguments is not the correct one.
This implies for DarwinPCS that all arguments of a vararg function will
go to the stack although fixed ones can go in registers.

This prevents non-virtual thunks to be tail optimized although they are
marked as musttail.

Differential Revision: https://reviews.llvm.org/D120622
2022-03-10 15:07:25 -08:00
David Green 4899e2cab4 [AArch64] Fix type in comment. NFC 2022-03-10 15:03:27 +00:00
David Green 21a97a2ac1 [AArch64] TBL uses zero for out of range elements.
A TBL instruction will use zero for any out of range values. We can use
this in GenerateTBL to help turn a TBL2 into a TBL1, avoiding the need
to materialise the zero.

Differential Revision: https://reviews.llvm.org/D121139
2022-03-10 14:45:13 +00:00
zhongyunde c22c8b151b [AArch64] Perform first active true vector combine
Materialize : i1 = extract_vector_elt t37, Constant:i64<0>
   ... into: "ptrue p, all" + PTEST
Test bit of lane 0 can use P register directly, and the instruction “pture all”
is loop invariant, which will beneficial to SVE after hoisting out the loop.

Reviewed By: david-arm, paulwalker-arm

Differential Revision: https://reviews.llvm.org/D120891
2022-03-08 01:10:21 +08:00
David Green d9633d1490 [AArch64] Turn truncating buildvectors into truncates
When lowering large v16f32->v16i8 fp_to_si_sat, the fp_to_si_sat node is
split several times, creating an illegal v4i8 concat that gets expanded
into a BUILD_VECTOR. After some combining and other legalisation, it
ends up the a buildvector that extracts from 4 vectors, looking like
BUILDVECTOR(a0,a1,a2,a3,b0,b1,b2,b3,c0,c1,c2,c3,d0,d1,d2,d3). That is
really an v16i32->v16i8 truncate in disguise.

This adds a ReconstructTruncateFromBuildVector method to detect the
pattern, converting it back into the legal "concat(trunc(concat(trunc(a),
trunc(b))), trunc(concat(trunc(c), trunc(d))))" tree. The extracted
nodes could also be v4i16, in which case the truncates are not needed.
All those truncates and concats then become uzip1's, which is much
better than expanding by moving vector lanes around.

Differential Revision: https://reviews.llvm.org/D119469
2022-03-07 09:42:54 +00:00
Benjamin Kramer fbce4a7803 Drop some more global std::maps. NFCI. 2022-03-06 13:28:29 +01:00
Karl Meakin 43a0016f3d Extend `performANDCSELCombine` to `performANDORCSELCombine`
Differential Revision: https://reviews.llvm.org/D120422
2022-03-04 15:09:59 +00:00
Paul Walker 42b4a6227e [DAGCombine] Prevent illegal ISD::SPLAT_VECTOR operations post legalisation.
When triggered during operation legalisation the affected combine
generates a splat_vector that when custom lowered for SVE fixed
length code generation, results in the original precombine sequence
and thus we enter a legalisation/combine hang.

NOTE: The patch contains no tests because I observed this issue
only when combined with other work that might never become public.
The current way AArch64 lowers ISD::SPLAT_VECTOR meant a specific
test was not possible so I'm hoping the DAGCombiner fix can be seen
as obvious. The AArch64ISelLowering change is requirted to maintain
existing code quality.

Differential Revision: https://reviews.llvm.org/D120735
2022-03-04 11:54:03 +00:00
David Green e348b09bb5 [AArch64] Turn UZP1 with undef operand into truncate
This turns upz1(x, undef) to concat(truncate(x), undef), as the truncate
is simpler and can often be optimized away, and it helps some of the
insert-subvector tests optimize more cleanly.

Differential Revision: https://reviews.llvm.org/D120879
2022-03-04 11:12:26 +00:00
Cullen Rhodes 616586794b [AArch64] Add legal types for Streaming SVE
The compiler currently crashes for scalable types when compiling with
+sme, e.g.

  define <vscale x 4 x i32> @foo(<vscale x 4 x i32> %a) {
    ret <vscale x 4 x i32> %a
  }

since it doesn't know how to legalize the types. SME implies a subset of
SVE (+streaming-sve), the hasSVE predication in the backend needs
extending to consider types/operations that are legal in Streaming SVE.

This is the first patch adding legal types <-> register classes. Before
making the change +sve(2) was temporarily replaced with +sme in all the
intrinsics tests to see what failed, and again after making the change.
For all the tests that passed after adding the legal types another RUN
line has been added for +streaming-sve. More patches to follow.

Reviewed By: sdesmalen

Differential Revision: https://reviews.llvm.org/D118561
2022-03-03 09:51:14 +00:00
Sander de Smalen eac2638ec1 [AArch64][SVE] Fold away SETCC if original input was predicate vector.
This adds the following two folds:

Fold 1:
   setcc_merge_zero(
       all_active, extend(nxvNi1 ...), != splat(0))
  -> nxvNi1 ...

Fold 2:
   setcc_merge_zero(
       pred, extend(nxvNi1 ...), != splat(0))
  -> nxvNi1 and(pred, ...)

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D119334
2022-02-28 14:12:43 +00:00
Sander de Smalen 201e3686ab [AArch64][SVE] Handle more cases in findMoreOptimalIndexType.
This patch addresses @paulwalker-arm's comment on D117900 to
only update/write the by-ref operands iff the function returns
true. It also handles a few more cases where a series of added
offsets can be folded into the base pointer, rather than just looking
at a single offset.

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D119728
2022-02-28 12:13:52 +00:00
Paul Walker 16ee102964 [SVE] Add missing splat patterns for bfloat vectors.
Differential Revision: https://reviews.llvm.org/D120496
2022-02-25 16:53:39 +00:00
Paul Walker 8ca5be93cc [SVE] Don't custom lower constant predicate ISD:SPLAT_VECTOR operations.
Differential Revision: https://reviews.llvm.org/D120340
2022-02-25 11:32:37 +00:00
Arthur Eubanks 6aa285eb85 [OpaquePtr][AArch64] Use load/store value type instead of pointer type for ldnt1/stnt1 alignment 2022-02-24 16:56:13 -08:00