Commit Graph

1488 Commits

Author SHA1 Message Date
zhongyunde 4a549be9c3 [AArch64] Lower multiplication by a negative constant to shl+sub+shl
Change the costmodel to lower a = b * C where C = -(2^n - 2^m) to
            lsl     w8, w0, m
            sub     w0, w8, w0, lsl n
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D134934
2022-10-01 21:27:42 +08:00
Florian Hahn fe49ba84d3
[AArch64] Reflow comment in AArch64IselLowering.cpp (NFC). 2022-09-30 17:17:04 +01:00
Zain Jaffal fca8730793
[AArch64] Refactor opcode selection for LowerMUL (NFC)
Move the logic for selecting `NewOpc` out of `LowerMUL`

Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D134875
2022-09-30 16:48:02 +01:00
Zain Jaffal 661403b85c
[AArch64] Add support for 128-bit non temporal loads.
Adding to the work done in `D131773` here we add support to 128-bit loads.

Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D132559
2022-09-30 11:04:04 +01:00
zhongyunde 4d15e7b21b [AArch64] Lower multiplication by a constant (NFC)
Refactor according https://reviews.llvm.org/D134706#inline-1298952
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D134848
2022-09-30 01:37:28 +08:00
zhongyunde 62a51c357c [AArch64] Lower multiplication by a constant int to shl+sub+shl
Decompose the const 14 can be separated from D132322
Change the costmodel to lower a = b * C where C = 2^n - 2^m to
        lsl     w8, w0, n
        sub     w0, w8, w0, lsl m
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D134706
2022-09-30 01:31:06 +08:00
Matt Devereau 0a4771a7e8 [AArch64][SVE] Expand gather index to 32 bits instead of 64 bits
For gathers which load in 8 and 16 bit data then use that data
as an index, the index can be extended to 32 bits instead of
64 bits

Differential Revision: https://reviews.llvm.org/D130692
2022-09-28 14:42:12 +00:00
Florian Hahn 2d3c260362
[AArch64] break non-temporal loads over 256 into 256-loads and a smaller load
Currently over 256 non-temporal loads are broken inefficently. For example, `v17i32` gets broken into 2 128-bit loads. It is better if we can use
256-bit loads instead.

Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D133421
2022-09-28 15:20:26 +01:00
Caroline Concatto 5431bf27bd [AArch64]Remove svget/svset/svcreate from llvm
This patch removes the aarch64 instrinsic svget/svset/svcreate from llvm.
It also implements the InstCombine for vector.extract that used to be in svget.

Depends on: D131547

Differential Revision: https://reviews.llvm.org/D131548
2022-09-23 10:48:43 +01:00
Florian Hahn ac434afed8
[AArch64] Try to fold shuffle (tbl2, tbl2) to tbl4.
shuffle (tbl2, tbl2) can be folded into a single tbl4 if the mask for
the selected elements is constant.

Reviewed By: t.p.northover

Differential Revision: https://reviews.llvm.org/D133491
2022-09-21 19:15:56 +01:00
David Green 4f78e022ee [AArch64] Lower scalar sqxtn intrinsics to use fp registers
The llvm.aarch64.neon.scalar.sqxtn.i32.i64 intrinsics take and return
integer types, but operate on fp registers. This can create some
inefficiencies in their lowering, where the registers are converted to
fp a little too late. This patch adds lowering for the intrinsics,
creating bitcasts to/from fp types to allow nicer folding later when the
instructions are selected, especially around insert/extracts.

Differential Revision: https://reviews.llvm.org/D134024
2022-09-21 10:46:43 +01:00
Caroline Concatto d32b8fdbdb [LLVM][AArch64] Replace aarch64.sve.ld by aarch64.sve.ldN.sret
This patch removes the intrinsic aarch64.sve.ldN from tablegen in favour of
using arch64.sve.ldN.sret.

Depends on: D133023

Differential Revision: https://reviews.llvm.org/D133025
2022-09-20 13:15:07 +01:00
Sander de Smalen bed214cf0f [AArch64][SME] Add intrinsics for enabling/disabling ZA.
This adds the intrinsics:
* void @llvm.aarch64.sme.za.enable() -> smstart za
* void @llvm.aarch64.sme.za.disable()  -> smstop za

Reviewed By: aemerson

Differential Revision: https://reviews.llvm.org/D133894
2022-09-17 16:41:42 +00:00
Sander de Smalen 5fae000f36 [AArch64][SME] Disable tail-call optimization when streaming mode change or lazy-save may be required.
When a streaming mode change is (or may be) required for a call, it will
need to restore the original mode after the call, which prevents the use of
tail-call optimization. The same holds true for a call that requires the lazy-save
mechanism to be set up before the call, and possibly restored after.

More details about the SME attributes and design can be found
in D131562.

Reviewed By: aemerson

Differential Revision: https://reviews.llvm.org/D131579
2022-09-17 16:15:07 +00:00
Sander de Smalen bd4935c175 [AArch64][SME] Implement ABI for calls from streaming-compatible functions.
When a function is streaming-compatible and calls a function with a normal or streaming
interface, it may need to enable/disable stremaing mode before the call, and
needs to restore PSTATE.SM after the call.

This patch implements this with a Pseudo node that gets expanded to a
conditional branch and smstart/smstop node.

More details about the SME attributes and design can be found
in D131562.

Reviewed By: aemerson

Differential Revision: https://reviews.llvm.org/D131578
2022-09-16 14:48:37 +00:00
Sander de Smalen b00c36c295 [AArch64][SME] Implement ABI for calls to/from streaming functions.
This patch implements the ABI for calls from:

  Normal -> Streaming
  Normal -> Streaming-compatible
  Streaming -> Normal
  Streaming -> Streaming-compatible
  Streaming -> Streaming

The compiler inserts SMSTART/SMSTOP instructions before and after the call,
depending on the required transition.

More details about the SME attributes and design can be found
in D131562.

Reviewed By: aemerson

Differential Revision: https://reviews.llvm.org/D131576
2022-09-16 14:07:47 +00:00
Florian Hahn 6b86b481e3
[AArch64] Use tbl for truncating vector FPtoUI conversions.
On AArch64, doing the vector truncate separately after the fptoui
conversion can be lowered more efficiently using tbl.4, building on
D133495.

https://alive2.llvm.org/ce/z/T538CC

Depends on D133495

Reviewed By: t.p.northover

Differential Revision: https://reviews.llvm.org/D133496
2022-09-16 14:57:43 +01:00
Florian Hahn 8491d01cc3
[AArch64] Lower vector trunc using tbl.
Similar to using tbl to lower vector ZExts, tbl4 can be used to lower
vector truncates.

The initial version support i32->i8 conversions.

Depends on D120571

Reviewed By: t.p.northover

Differential Revision: https://reviews.llvm.org/D133495
2022-09-16 12:42:49 +01:00
Florian Hahn 5871f18827
[AArch64] Lower extending uitofp using tbl.
On AArch64, doing the zero-extend separately first can be lowered more
efficiently using tbl, building on D120571.

https://alive2.llvm.org/ce/z/8Je595

Depends on D120571

Reviewed By: t.p.northover

Differential Revision: https://reviews.llvm.org/D133494
2022-09-16 10:20:25 +01:00
Florian Hahn 81a11da762
[CGP,AArch64] Replace zexts with shuffle that can be lowered using tbl.
This patch extends CodeGenPrepare to lower zext v16i8 -> v16i32 in loops
using a wide shuffle  creating a v64i8 vector, selecting groups of 3
zero elements and an element from the input.

This is profitable on AArch64 where such shuffles can be lowered to tbl
instructions, but only in loops, because it requires materializing 4
masks, which can be done in the loop preheader.

This is the only reason the transform is part of CGP. If there's a
better alternative I missed, please let me know. The same goes for the
shouldReplaceZExtWithShuffle hook which guards this. I am not sure if
this transform will be beneficial on other targets, but it seems like
there is no way other convenient way.

This improves the generated code for loops like the one below in
combination with D96522.

    int foo(uint8_t *p, int N) {
      unsigned long long sum = 0;
      for (int i = 0; i < N ; i++, p++) {
	unsigned int v = *p;
	sum += (v < 127) ? v : 256 - v;
      }
      return sum;
    }

https://clang.godbolt.org/z/Wco866MjY

Reviewed By: t.p.northover

Differential Revision: https://reviews.llvm.org/D120571
2022-09-15 19:18:13 +01:00
Sergei Barannikov c6acb4eb0f [SDAG] Add `getCALLSEQ_END` overload taking `uint64_t`s
All in-tree targets pass pointer-sized ConstantSDNodes to the
method. This overload reduced amount of boilerplate code a bit.  This
also makes getCALLSEQ_END consistent with getCALLSEQ_START, which
already takes uint64_ts.
2022-09-15 14:02:12 -04:00
Sander de Smalen 45d28779c5 [AArch64][SME] Fix lowering of llvm.aarch64.get.pstatesm()
A thread may not have access to SME or TPIDR2_EL0, so in order to
safely query PSTATE.SM in a streaming-compatible function, the
code should call `__arm_sme_state()`, as described in the ABI:

  c2bb09c4d4

This means that the value of pstate.sm is:
* 0 if the function is non-streaming.
* 1 if the function has `arm_streaming` or `arm_locally_streaming`.
* evaluated at runtime by a call to __arm_sme_state() otherwise.

This patch also adds a calling convention for calls to SME support routines.

At some point we can remove the need for the llvm.aarch64.get.pstatesm() intrinsic
and use function calls (with the corresponding cc) directly instead.

Reviewed By: aemerson

Differential Revision: https://reviews.llvm.org/D131571
2022-09-15 15:14:13 +00:00
Zain Jaffal d1dec04d76
[AArch64] Disable nontemproal load for Big Endian
The current code for generating nontemporal load outputs the wrong assembly for big endian architecture.

Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D133789
2022-09-14 14:49:55 +01:00
David Green 993b203b6a [AArch64] Sink splat(s/zext(..)) to uses
If the Shuffle is a splat and the operand is a zext/sext, sinking the
operand and the s/zext can help create indexed s/umull. This is
especially useful to prevent i64 mul being scalarized.

Differential Revision: https://reviews.llvm.org/D133355
2022-09-13 15:47:41 +01:00
Matthias Gehre c1502425ba Move TargetTransformInfo::maxLegalDivRemBitWidth -> TargetLowering::maxSupportedDivRemBitWidth
Also remove new-pass-manager version of ExpandLargeDivRem because there is no way
yet to access TargetLowering in the new pass manager.

Differential Revision: https://reviews.llvm.org/D133691
2022-09-12 17:06:16 +01:00
Joe Loser 5e96cea1db [llvm] Use std::size instead of llvm::array_lengthof
LLVM contains a helpful function for getting the size of a C-style
array: `llvm::array_lengthof`. This is useful prior to C++17, but not as
helpful for C++17 or later: `std::size` already has support for C-style
arrays.

Change call sites to use `std::size` instead.

Differential Revision: https://reviews.llvm.org/D133429
2022-09-08 09:01:53 -06:00
Eli Friedman 2b9cec6244 [ARM64EC 5/?] Fix names of __chkstk and __security_check_cookie.
Part of initial Arm64EC patchset.

Arm64EC code needs to use functions with a different name, to avoid
using the x64 versions.

Differential Revision: https://reviews.llvm.org/D125417
2022-09-05 13:19:54 -07:00
Eli Friedman 5637ec0983 [ARM64EC 4/?] Add LLVM support for varargs calling convention.
Part of patchset to add initial support for ARM64EC.

The ARM64EC calling convention is the same as ARM64 for non-varargs
functions, but for varargs, the convention is significantly different.
Basically, only x0-x3 registers are used for passing arguments, and x4
and x5 describe the address/size of the arguments passed in memory. (See
https://docs.microsoft.com/en-us/windows/uwp/porting/arm64ec-abi for
more details; see
https://docs.microsoft.com/en-us/cpp/build/x64-calling-convention for
the x64 calling convention rules, which this convention needs to match.)

Note that this currently doesn't handle i128 arguments correctly; as
noted in review, that's sort of complicated to handle, so I'm leaving it
for a followup.

Differential Revision: https://reviews.llvm.org/D125415
2022-09-05 13:05:48 -07:00
Kazu Hirata 7d8c2d17eb [llvm] Use range-based for loops (NFC)
Identified with modernize-loop-convert.
2022-09-03 23:27:25 -07:00
Hassnaa Hamdi a6d9c944df [AArch64 - SVE]: Use SVE to lower reduce.fadd.
Differential Revision: https://reviews.llvm.org/D132573

skip custom-lowering for v1f64 to be expanded instead, because it has only one lane

Differential Revision: https://reviews.llvm.org/D132959
2022-08-31 12:31:06 +00:00
Stephen Long 40999cbd93 [SVE] Fix SVEDup0 matching -0.0f
Because of D128669, CPY is being used to zero active lanes even in the case of -0.0f. This patch checks for floating point positive zero. That way SVEDup0 won't match -0.0f.

Fixes https://github.com/llvm/llvm-project/issues/57428

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D132880
2022-08-30 11:07:17 -07:00
Paul Walker 11b4dce7d3 [SVE] Lower fixed-length floating point loads and stores to integer variants.
There's no advatange to emitting floating point scalable accesses,
whereas by lowering them to integer variants we can benefit from
several combines that seek to replace explicit extends/truncates
with extending/truncating accesses.

Differential Revision: https://reviews.llvm.org/D132393
2022-08-26 11:10:23 +01:00
Usman Nadeem 46768052e0 [AArch64][DAGCombine] Fix a bug in performBuildVectorCombine where it could produce an invalid EXTRACT_SUBVECTOR
EXTRACT_SUBVECTOR requires that Idx be a constant multiple of ResultType's
known minimum vector length.

Something like this will produce an invalid extract_subvector:

t1: v4i16 = .....
t2: i32 = extract_vector_elt t1, Constant:i64<1>
t3: i32 = extract_vector_elt t1, Constant:i64<2>
t4: v2i32 = BUILD_VECTOR t2, t3
// produces
t5: v2i32 = extract_subvector t...., Constant:i64<1>

Differential Revision: https://reviews.llvm.org/D132517

Change-Id: I7a5acf054edee3e89c0f85a28d8869256403ce08
2022-08-24 16:24:19 -07:00
Sami Tolvanen cff5bef948 KCFI sanitizer
The KCFI sanitizer, enabled with `-fsanitize=kcfi`, implements a
forward-edge control flow integrity scheme for indirect calls. It
uses a !kcfi_type metadata node to attach a type identifier for each
function and injects verification code before indirect calls.

Unlike the current CFI schemes implemented in LLVM, KCFI does not
require LTO, does not alter function references to point to a jump
table, and never breaks function address equality. KCFI is intended
to be used in low-level code, such as operating system kernels,
where the existing schemes can cause undue complications because
of the aforementioned properties. However, unlike the existing
schemes, KCFI is limited to validating only function pointers and is
not compatible with executable-only memory.

KCFI does not provide runtime support, but always traps when a
type mismatch is encountered. Users of the scheme are expected
to handle the trap. With `-fsanitize=kcfi`, Clang emits a `kcfi`
operand bundle to indirect calls, and LLVM lowers this to a
known architecture-specific sequence of instructions for each
callsite to make runtime patching easier for users who require this
functionality.

A KCFI type identifier is a 32-bit constant produced by taking the
lower half of xxHash64 from a C++ mangled typename. If a program
contains indirect calls to assembly functions, they must be
manually annotated with the expected type identifiers to prevent
errors. To make this easier, Clang generates a weak SHN_ABS
`__kcfi_typeid_<function>` symbol for each address-taken function
declaration, which can be used to annotate functions in assembly
as long as at least one C translation unit linked into the program
takes the function address. For example on AArch64, we might have
the following code:

```
.c:
  int f(void);
  int (*p)(void) = f;
  p();

.s:
  .4byte __kcfi_typeid_f
  .global f
  f:
    ...
```

Note that X86 uses a different preamble format for compatibility
with Linux kernel tooling. See the comments in
`X86AsmPrinter::emitKCFITypeId` for details.

As users of KCFI may need to locate trap locations for binary
validation and error handling, LLVM can additionally emit the
locations of traps to a `.kcfi_traps` section.

Similarly to other sanitizers, KCFI checking can be disabled for a
function with a `no_sanitize("kcfi")` function attribute.

Relands 67504c9549 with a fix for
32-bit builds.

Reviewed By: nickdesaulniers, kees, joaomoreira, MaskRay

Differential Revision: https://reviews.llvm.org/D119296
2022-08-24 22:41:38 +00:00
Sami Tolvanen a79060e275 Revert "KCFI sanitizer"
This reverts commit 67504c9549 as using
PointerEmbeddedInt to store 32 bits breaks 32-bit arm builds.
2022-08-24 19:30:13 +00:00
Sami Tolvanen 67504c9549 KCFI sanitizer
The KCFI sanitizer, enabled with `-fsanitize=kcfi`, implements a
forward-edge control flow integrity scheme for indirect calls. It
uses a !kcfi_type metadata node to attach a type identifier for each
function and injects verification code before indirect calls.

Unlike the current CFI schemes implemented in LLVM, KCFI does not
require LTO, does not alter function references to point to a jump
table, and never breaks function address equality. KCFI is intended
to be used in low-level code, such as operating system kernels,
where the existing schemes can cause undue complications because
of the aforementioned properties. However, unlike the existing
schemes, KCFI is limited to validating only function pointers and is
not compatible with executable-only memory.

KCFI does not provide runtime support, but always traps when a
type mismatch is encountered. Users of the scheme are expected
to handle the trap. With `-fsanitize=kcfi`, Clang emits a `kcfi`
operand bundle to indirect calls, and LLVM lowers this to a
known architecture-specific sequence of instructions for each
callsite to make runtime patching easier for users who require this
functionality.

A KCFI type identifier is a 32-bit constant produced by taking the
lower half of xxHash64 from a C++ mangled typename. If a program
contains indirect calls to assembly functions, they must be
manually annotated with the expected type identifiers to prevent
errors. To make this easier, Clang generates a weak SHN_ABS
`__kcfi_typeid_<function>` symbol for each address-taken function
declaration, which can be used to annotate functions in assembly
as long as at least one C translation unit linked into the program
takes the function address. For example on AArch64, we might have
the following code:

```
.c:
  int f(void);
  int (*p)(void) = f;
  p();

.s:
  .4byte __kcfi_typeid_f
  .global f
  f:
    ...
```

Note that X86 uses a different preamble format for compatibility
with Linux kernel tooling. See the comments in
`X86AsmPrinter::emitKCFITypeId` for details.

As users of KCFI may need to locate trap locations for binary
validation and error handling, LLVM can additionally emit the
locations of traps to a `.kcfi_traps` section.

Similarly to other sanitizers, KCFI checking can be disabled for a
function with a `no_sanitize("kcfi")` function attribute.

Reviewed By: nickdesaulniers, kees, joaomoreira, MaskRay

Differential Revision: https://reviews.llvm.org/D119296
2022-08-24 18:52:42 +00:00
Jakub Kuderski 6fa87ec10f [ADT] Deprecate is_splat and replace all uses with all_equal
See the discussion thread for more details:
https://discourse.llvm.org/t/adt-is-splat-and-empty-ranges/64692

Reviewed By: dblaikie

Differential Revision: https://reviews.llvm.org/D132335
2022-08-23 11:36:27 -04:00
wanglian 53bc7d5f08 [AArch64][NFC] Replace setOperationAction and AddPromotedToType
with setOperationPromotedToType.

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D132213
2022-08-22 09:59:47 +08:00
Mingming Liu 945a306501 [AArch64] Change aarch64_neon_pmull{,64} intrinsic ISel through a new
SDNode.

How:
1) Add AArch64ISD::PMULL SDNode, and extend aarch64_neon_pmull intrinsic
   tablegen pattern for this SDNode.
2) For aarch64_neon_pmull64, canonicalize i64 operands to v1i64 vectors
   during legalization.
3) For {aarch64_neon_pmull, aarch64_neon_pmull64}, combine intrinsic to
   SDNode.

Why
1) Adding the SDNode makes it easier to canonicalize i64 inputs (required by
   aarch64_neon_pmull64) to vector inputs. Vector inputs carries lane
   information, which helps dag-combiner to combine nodes (e.g. rewrite to a
   better node to prepare for instruction selection) and instruction-selection
   to emit instructions that use higher-half inputs in place
   (i.e., no need to move lane 1 content to lane 0).
2) Using the SDNode for aarch64_neon_pmull64 is NFC, yet without this we
   have to move the definition of {PMULLv1i64, PMULLv2i64} out of its
   current group of records without gains.

Test cases are commented with what is being tested in
`aarch64-pmull2.ll` and `pmull-ldr-merge.ll` under directory
`llvm/test/CodeGen/AArch64`.

Differential Revision: https://reviews.llvm.org/D131047
2022-08-19 13:17:13 -07:00
Archibald Elliott 270c179afd [AArch64][GISel] Lower llvm.prefetch
This change adds support for lowering llvm.prefetch directly using
GlobalISel. Currently, llvm.prefetch falls back to SelectionDAG.

This Change:
- Adds an AArch64-specific G_PREFETCH generic instruction, to be used
  where AArch64ISD::PREFETCH is used in SelectionDAG.
- Adds the GINodeEquiv so patterns are translated over to GlobalISel
  automatically.
- Corrects the AArch64Prefetch patterns to use a target immediate, which
  is needed to get the patterns to translate across correctly.
- Translates the SelectionDAG legalisation of the prefetch intrinsic
  into the corresponding GlobalISel legalisation.

Differential Revision: https://reviews.llvm.org/D132043
2022-08-19 09:11:18 +01:00
Karl Meakin 71f0ec242f [AArch64] Add `foldCSELOfCSEL` combine.
This time more conservative.

Differential review: https://reviews.llvm.org/D125504
2022-08-19 01:04:29 +01:00
Paul Walker 96c8d615d6 [SVE] Extend findMoreOptimalIndexType so BUILD_VECTORs do not force 64bit indices.
Extends findMoreOptimalIndexType to allow ISD::BUILD_VECTOR based
indices to be truncated when such truncation is lossless. This can
enable the use of 32bit gather/scatter indices thus making it less
likely to have to split a gather/scatter in two.

Depends on D125194

Differential Revision: https://reviews.llvm.org/D130533
2022-08-18 18:00:53 +01:00
Daniil Fukalov 7ed3d81333 [NFCI] Move cost estimation from TargetLowering to TargetTransformInfo.
TragetLowering had two last InstructionCost related `getTypeLegalizationCost()`
and `getScalingFactorCost()` members, but all other costs are processed in TTI.

E.g. it is not comfortable to use other TTI members in these two functions
overrided in a target.

Minor refactoring: `getTypeLegalizationCost()` now doesn't need DataLayout
parameter - it was always passed from TTI.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D117723
2022-08-18 00:38:55 +03:00
Vitaly Buka 16fecdfa70 Revert "[AArch64] Add `foldCSELOfCSEl` DAG combine"
Breaks ubsan on buildbot, details in D125504

This reverts commit 6f9423ef06.
2022-08-16 20:29:37 -07:00
Karl Meakin 6f9423ef06 [AArch64] Add `foldCSELOfCSEl` DAG combine
Differential Revision: https://reviews.llvm.org/D125504
2022-08-16 12:49:11 +01:00
Zain Jaffal 7155ed4289
[AArch64] Add support for 256-bit non temporal loads
Currenlty all temporal loads are mapped to `LDP` or `LDR`. This patch will map all the non temporal 256-bit loads into `LDNP`. Future patches should address other non-temporal loads.

Reviewed By: fhahn, dmgreen

Differential Revision: https://reviews.llvm.org/D131773
2022-08-16 12:19:36 +01:00
Vitaly Buka e0e960923f [AArch64] Fix signed integer overflow in CSINC case
Followup to D131815, which overlflows on different
values.
2022-08-15 15:04:20 -07:00
Vitaly Buka f1596952f9 [AArch64] Fix signed integer overflow in CSINC case
https://lab.llvm.org/staging/#/builders/224/builds/2/steps/16/logs/stdio

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D131815
2022-08-13 13:12:09 -07:00
David Green a9e9dd9a3a [AArch64] Add bf16 select handling
A bfloat select operation will currently crash, but is allowed from C.
This adds handling for the operation, turning it into a FCSELHrrr if
fullfp16 is present, or converting it to a FCSELSrrr if not. The
FCSELSrrr is created via using INSERT_SUBREG/EXTRACT_SUBREG to convert
the bf16 to a f32 and using the f32 pattern for FCSELSrrr. (I originally
attempted to do this via a tablegen pattern, but it appears that the
nzcv glue is places onto the wrong node, causing it to be forgotten and
incorrect scheduling to be emitted).

The FCSELSrrr can also be used for fp16 selects when +fullfp16 is not
present, which helps avoid an unnecessary promotion to f32.

Differential Revision: https://reviews.llvm.org/D131253
2022-08-11 14:20:36 +01:00
David Truby b1b9c39629 [AArch64][SVE] Use SVE for VLS fcopysign for wide vectors
Currently fcopysign for VLS vectors lowers through NEON even when the
vector width is wider than a NEON vector, causing bad codegen as the
vectors are split. This patch causes SVE to be used for these vectors
instead, giving much better codegen on wide VLS vectors.

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D128642
2022-08-10 10:17:19 +00:00