Commit Graph

8184 Commits

Author SHA1 Message Date
Phoebe Wang 02fe96b240 [X86][FP16] Do not split FP64->FP16 to FP64->FP32->FP16
Truncation from double to half is not always identical to truncating to float first and then to half. https://godbolt.org/z/56s9517hd

On the other hand, expanding to float and then to double is always identical to expanding to double directly. https://godbolt.org/z/Ye8vbYPnY

Reviewed By: RKSimon, skan

Differential Revision: https://reviews.llvm.org/D130151
2022-07-22 08:36:05 +08:00
Bing1 Yu e01bf5a3e2 [X86] Promote v32f16's fadd into v32f32's fadd when it is avx512 without avx512fp16
Reviewed By: pengfei

Differential Revision: https://reviews.llvm.org/D130059
2022-07-19 14:37:50 +08:00
Benjamin Kramer 9234a7c0df [X86][FP16] Don't crash when lowering SELECT on fp16 vectors
This is a regression from f187948162
2022-07-18 13:41:00 +02:00
Phoebe Wang f187948162 [X86][FP16] Enable vector support for FP16 emulation
This is follow up of D107082, which enable vector support according to psABI.

Reviewed By: skan

Differential Revision: https://reviews.llvm.org/D127982
2022-07-16 09:38:58 +08:00
Simon Pilgrim 66bfd1ba8c [X86] Move isInRange(ArrayRef<int>) inside assert to fix NDEBUG builds. NFC.
Fix unused static function warning introduced by D129207
2022-07-12 21:51:07 +01:00
Xiang1 Zhang a45dd3d814 [X86] Support -mstack-protector-guard-symbol
Reviewed By: nickdesaulniers

Differential Revision: https://reviews.llvm.org/D129346
2022-07-12 10:17:00 +08:00
Xiang1 Zhang 643786213b Revert "[X86] Support -mstack-protector-guard-symbol"
This reverts commit efbaad1c4a.
due to miss adding review info.
2022-07-12 10:14:32 +08:00
Xiang1 Zhang efbaad1c4a [X86] Support -mstack-protector-guard-symbol 2022-07-12 10:13:48 +08:00
Simon Pilgrim 97868fb972 [X86] isTargetShuffleEquivalent - attempt to match SM_SentinelZero shuffle mask elements using known bits
If the combined shuffle mask requires zero elements, we don't currently have much chance of matching them against the expected source vector. This patch uses the SelectionDAG::MaskedVectorIsZero wrapper to attempt to determine if the expected lement we want to use is already known to be zero.

I've also tightened up the ExpectedMask assertion to always be in range - we're never giving it a target shuffle mask that has sentinels at all - allowing to remove some of the confusing bounds checks.

This attempts to address some of the regressions uncovered by D129150 where we more aggressively fold shuffles as AND / 'clear' masks which results in more combined shuffles using SM_SentinelZero.

Differential Revision: https://reviews.llvm.org/D129207
2022-07-11 15:29:44 +01:00
Phoebe Wang 8fb083d33e [X86][FP16] Add constrained FP support for scalar emulation
This is a follow up patch to support constrained FP in FP16 emulation.

Reviewed By: skan

Differential Revision: https://reviews.llvm.org/D128114
2022-07-08 20:33:42 +08:00
Phoebe Wang 6c535f9f1b [X86][FP16] Fix crash when lowering copysign for f16
This is to address the assertion fail reported in https://reviews.llvm.org/D107082#3635612
Not sure if it is a problem of promoting FCOPYSIGN + libcall FP_ROUND.
The promoting will set the rounding mode to 1 a442c62888/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp (L4810-L4814)
While libcall cannot handle the rounding mode equals to 1 a442c62888/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp (L4324-L4328)
So changing the action to Expand to workaround the problem.

Reviewed By: clementval, MaskRay

Differential Revision: https://reviews.llvm.org/D129294
2022-07-07 19:17:26 -07:00
Simon Pilgrim fbb51ac0ba [X86] LowerShift - lower some shuffles directly to X86ISD::PSHUFLW nodes.
These are expected to lower to X86ISD::PSHUFLW but we were seeing some regressions in D129150 because it'd managed to exploit the masking of the shift amounts to create unintended clear masks instead.
2022-07-06 18:01:03 +01:00
Shilei Tian 1023ddaf77 [LLVM] Add the support for fmax and fmin in atomicrmw instruction
This patch adds the support for `fmax` and `fmin` operations in `atomicrmw`
instruction. For now (at least in this patch), the instruction will be expanded
to CAS loop. There are already a couple of targets supporting the feature. I'll
create another patch(es) to enable them accordingly.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D127041
2022-07-06 10:57:53 -04:00
Paul Robinson 08e4fe6c61 [X86] Add RDPRU instruction
Add support for the RDPRU instruction on Zen2 processors.

User-facing features:

- Clang option -m[no-]rdpru to enable/disable the feature
- Support is implicit for znver2/znver3 processors
- Preprocessor symbol __RDPRU__ to indicate support
- Header rdpruintrin.h to define intrinsics
- "rdpru" mnemonic supported for assembler code

Internal features:

- Clang builtin __builtin_ia32_rdpru
- IR intrinsic @llvm.x86.rdpru

Differential Revision: https://reviews.llvm.org/D128934
2022-07-06 07:17:47 -07:00
Craig Topper 2bfca35614 [X86] Disable combineVectorSizedSetCCEquality for soft float.
The vector types aren't legal with soft float.
Also disable under NoImplicitFloat for good measure.

Fixes PR56351.

Differential Revision: https://reviews.llvm.org/D129060
2022-07-04 08:33:30 -07:00
Simon Pilgrim 26708fa166 Revert rG057db2002bb3: [X86] combineAndnp - constant fold ANDNP(C,X) -> AND(~C,X)
If the LHS op has a single use then using the more general AND op is likely to allow commutation, load folding, generic folds etc.

Reverted due to reports from @alexfh about it causing an infinite loop (repro still pending).
2022-07-01 10:36:09 +01:00
Simon Pilgrim e961e05d59 [SLP][X86] Add 32-bit vector stores to help vectorization opportunities
Building on the work on D124284, this patch tags v4i8 and v2i16 vector loads as custom, enabling SLP to try to vectorize these types ending in a partial store (using the SSE MOVD instruction) - we already do something similar for 64-bit vector types.

Differential Revision: https://reviews.llvm.org/D127604
2022-06-30 20:25:50 +01:00
Craig Topper 3706bdad4a [X86] Remove unnecessary COPY from EmitLoweredCascadedSelect.
I believe we already checked that the destination of the first
CMOV is only used by the second CMOV so I don't think there is any
reason we need the PHI to write the register that was used by the
first CMOV. We can directly use the second CMOV destination and
avoid the copy.

This may be a left over from when the cascaded select handling
was part of the main algorithm before it was refactored in D35685.

Reviewed By: pengfei

Differential Revision: https://reviews.llvm.org/D128124
2022-06-28 09:33:33 -07:00
Simon Pilgrim 0b998053db [X86] combineConcatVectorOps - IsConcatFree must check extraction index
Identified in the regression reported by @alexfh on rGb5d7beeb9792 - IsConcatFree wasn't ensuring the subvector extraction index matched the position it would be concatenated back into.
2022-06-27 11:46:49 +01:00
Simon Pilgrim ac4cb1775b [X86] fold (and (mul x, c1), c2) -> (mul x, (and c1, c2)) iff c2 is all/no bits mask
Noticed on D128216 - if we're zeroing out vector elements of a mul/mulh result then see if we can merge the and-mask into the mul by just multiplying by zero.

Ideally we'd make this generic (similar to the existing foldSelectWithIdentityConstant?), but these cases are appearing very late, after the constants have been lowered to constant-pool loads.
2022-06-21 15:10:43 +01:00
Simon Pilgrim 057db2002b [X86] combineAndnp - constant fold ANDNP(C,X) -> AND(~C,X)
If the LHS op has a single use then using the more general AND op is likely to allow commutation, load folding, generic folds etc.
2022-06-21 12:31:01 +01:00
Simon Pilgrim 843d43e62a [X86] computeKnownBitsForTargetNode - add X86ISD::VBROADCAST_LOAD handling
This requires us to override the isTargetCanonicalConstantNode callback introduced in D128144, so we can recognise the various cases where a VBROADCAST_LOAD constant is being reused at different vector widths to prevent infinite loops.
2022-06-21 11:48:01 +01:00
Simon Pilgrim 8254966062 [X86] LowerINSERT_VECTOR_ELT - always lower v32i8/v16i16 allones insertions on AVX1 as OR ops
v32i8/v16i16 blend shuffles on AVX1 will expand to OR(AND,ANDN) patterns which can be easily broken by other combines
2022-06-20 18:43:03 +01:00
Simon Pilgrim e4a124dda5 [DAG] Fold (srl (shl x, c1), c2) -> and(shl/srl(x, c3), m)
Similar to the existing (shl (srl x, c1), c2) fold

Part of the work to fix the regressions in D77804

Differential Revision: https://reviews.llvm.org/D125836
2022-06-20 08:37:38 +01:00
Simon Pilgrim ba3f2667b6 [DAG] Add MaskedVectorIsZero helper
Equivalent to MaskedValueIsZero, except its checking if all of the demanded vectors elements are known to be zero
2022-06-19 17:56:30 +01:00
Simon Pilgrim 41455dd1dc [X86] Remove isTargetShuffleSplat and just use SelectionDAG::isSplatValue
shuffle(splat(x)) -> splat(x), it doesn't have to be a target specific broadcast
2022-06-19 11:22:57 +01:00
Simon Pilgrim ac3f967382 [X86] canonicalizeShuffleWithBinOps - merge shuffles across binops if either source op is a known splat
The shuffle of a splat (with no undefs) should always be removed
2022-06-18 17:14:00 +01:00
Simon Pilgrim f42f2b7005 [X86] canonicalizeShuffleWithBinOps - merge unary shuffles across binops if either source op is a foldable load
This mostly handles folding of constants that have already become loads, but we expose some generic load cases as well.

This also exposes the chance to merge unary shuffles across X86ISD::ANDNP nodes with different scalar widths
2022-06-18 15:58:54 +01:00
Simon Pilgrim 3c9123af9f [X86] isShuffleFoldableLoad - ensure the load has one use.
We'll only fold the load if has one use. Makes no difference to existing tests but will be necessary for an upcoming patch to improve load folding as part of canonicalizeShuffleWithBinOps.
2022-06-18 14:51:55 +01:00
Phoebe Wang 655ba9c8a1 Reland "Reland "Reland "Reland "[X86][RFC] Enable `_Float16` type support on X86 following the psABI""""
This resolves problems reported in commit 1a20252978.
1. Promote to float lowering for nodes XINT_TO_FP
2. Bail out f16 from shuffle combine due to vector type is not legal in the version
2022-06-17 21:34:05 +08:00
Benjamin Kramer 1a20252978 Revert "Reland "Reland "Reland "[X86][RFC] Enable `_Float16` type support on X86 following the psABI""""
This reverts commit 04a3d5f3a1.

I see two more issues:

- uitofp/sitofp from i32/i64 to half now generates
  __floatsihf/__floatdihf, which exists in neither compiler-rt nor
  libgcc

- This crashes when legalizing the bitcast:
```
; RUN: llc < %s -mcpu=skx
define void @main.45(ptr nocapture readnone %retval, ptr noalias nocapture readnone %run_options, ptr noalias nocapture readnone %params, ptr noalias nocapture readonly %buffer_table, ptr noalias nocapture readnone %status, ptr noalias nocapture readnone %prof_counters) local_unnamed_addr {
entry:
  %fusion = load ptr, ptr %buffer_table, align 8
  %0 = getelementptr inbounds ptr, ptr %buffer_table, i64 1
  %Arg_1.2 = load ptr, ptr %0, align 8
  %1 = getelementptr inbounds ptr, ptr %buffer_table, i64 2
  %Arg_0.1 = load ptr, ptr %1, align 8
  %2 = load half, ptr %Arg_0.1, align 8
  %3 = bitcast half %2 to i16
  %4 = and i16 %3, 32767
  %5 = icmp eq i16 %4, 0
  %6 = and i16 %3, -32768
  %broadcast.splatinsert = insertelement <4 x half> poison, half %2, i64 0
  %broadcast.splat = shufflevector <4 x half> %broadcast.splatinsert, <4 x half> poison, <4 x i32> zeroinitializer
  %broadcast.splatinsert9 = insertelement <4 x i16> poison, i16 %4, i64 0
  %broadcast.splat10 = shufflevector <4 x i16> %broadcast.splatinsert9, <4 x i16> poison, <4 x i32> zeroinitializer
  %broadcast.splatinsert11 = insertelement <4 x i16> poison, i16 %6, i64 0
  %broadcast.splat12 = shufflevector <4 x i16> %broadcast.splatinsert11, <4 x i16> poison, <4 x i32> zeroinitializer
  %broadcast.splatinsert13 = insertelement <4 x i16> poison, i16 %3, i64 0
  %broadcast.splat14 = shufflevector <4 x i16> %broadcast.splatinsert13, <4 x i16> poison, <4 x i32> zeroinitializer
  %wide.load = load <4 x half>, ptr %Arg_1.2, align 8
  %7 = fcmp uno <4 x half> %broadcast.splat, %wide.load
  %8 = fcmp oeq <4 x half> %broadcast.splat, %wide.load
  %9 = bitcast <4 x half> %wide.load to <4 x i16>
  %10 = and <4 x i16> %9, <i16 32767, i16 32767, i16 32767, i16 32767>
  %11 = icmp eq <4 x i16> %10, zeroinitializer
  %12 = and <4 x i16> %9, <i16 -32768, i16 -32768, i16 -32768, i16 -32768>
  %13 = or <4 x i16> %12, <i16 1, i16 1, i16 1, i16 1>
  %14 = select <4 x i1> %11, <4 x i16> %9, <4 x i16> %13
  %15 = icmp ugt <4 x i16> %broadcast.splat10, %10
  %16 = icmp ne <4 x i16> %broadcast.splat12, %12
  %17 = or <4 x i1> %15, %16
  %18 = select <4 x i1> %17, <4 x i16> <i16 -1, i16 -1, i16 -1, i16 -1>, <4 x i16> <i16 1, i16 1, i16 1, i16 1>
  %19 = add <4 x i16> %18, %broadcast.splat14
  %20 = select i1 %5, <4 x i16> %14, <4 x i16> %19
  %21 = select <4 x i1> %8, <4 x i16> %9, <4 x i16> %20
  %22 = bitcast <4 x i16> %21 to <4 x half>
  %23 = select <4 x i1> %7, <4 x half> <half 0xH7E00, half 0xH7E00, half 0xH7E00, half 0xH7E00>, <4 x half> %22
  store <4 x half> %23, ptr %fusion, align 16
  ret void
}
```

llc: llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp:977: void (anonymous namespace)::SelectionDAGLegalize::LegalizeOp(llvm::SDNode *): Assertion `(TLI.getTypeAction(*DAG.getContext(), Op.getValueType()) == TargetLowering::TypeLegal || Op.getOpcode() == ISD::TargetConstant || Op.getOpcode() == ISD::Register) && "Unexpected illegal type!"' failed.
2022-06-17 09:43:07 +02:00
Phoebe Wang 04a3d5f3a1 Reland "Reland "Reland "[X86][RFC] Enable `_Float16` type support on X86 following the psABI"""
Fix the crash on lowering X86ISD::FCMP.
2022-06-17 12:12:17 +08:00
Paul Robinson ff0122dcce [PS5] Emit ud2 for ubsan trap 2022-06-16 11:20:10 -07:00
Frederik Gossen 3cd5696a33 Revert "Reland "Reland "[X86][RFC] Enable `_Float16` type support on X86 following the psABI"""
This reverts commit e1c5afa47d.

This introduces crashes in the JAX backend on CPU. A reproducer in LLVM is
below. Let me know if you have trouble reproducing this.

; ModuleID = '__compute_module'
source_filename = "__compute_module"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-grtev4-linux-gnu"

@0 = private unnamed_addr constant [4 x i8] c"\00\00\00?"
@1 = private unnamed_addr constant [4 x i8] c"\1C}\908"
@2 = private unnamed_addr constant [4 x i8] c"?\00\\4"
@3 = private unnamed_addr constant [4 x i8] c"%ci1"
@4 = private unnamed_addr constant [4 x i8] zeroinitializer
@5 = private unnamed_addr constant [4 x i8] c"\00\00\00\C0"
@6 = private unnamed_addr constant [4 x i8] c"\00\00\00B"
@7 = private unnamed_addr constant [4 x i8] c"\94\B4\C22"
@8 = private unnamed_addr constant [4 x i8] c"^\09B6"
@9 = private unnamed_addr constant [4 x i8] c"\15\F3M?"
@10 = private unnamed_addr constant [4 x i8] c"e\CC\\;"
@11 = private unnamed_addr constant [4 x i8] c"d\BD/>"
@12 = private unnamed_addr constant [4 x i8] c"V\F4I="
@13 = private unnamed_addr constant [4 x i8] c"\10\CB,<"
@14 = private unnamed_addr constant [4 x i8] c"\AC\E3\D6:"
@15 = private unnamed_addr constant [4 x i8] c"\DC\A8E9"
@16 = private unnamed_addr constant [4 x i8] c"\C6\FA\897"
@17 = private unnamed_addr constant [4 x i8] c"%\F9\955"
@18 = private unnamed_addr constant [4 x i8] c"\B5\DB\813"
@19 = private unnamed_addr constant [4 x i8] c"\B4W_\B2"
@20 = private unnamed_addr constant [4 x i8] c"\1Cc\8F\B4"
@21 = private unnamed_addr constant [4 x i8] c"~3\94\B6"
@22 = private unnamed_addr constant [4 x i8] c"3Yq\B8"
@23 = private unnamed_addr constant [4 x i8] c"\E9\17\17\BA"
@24 = private unnamed_addr constant [4 x i8] c"\F1\B2\8D\BB"
@25 = private unnamed_addr constant [4 x i8] c"\F8t\C2\BC"
@26 = private unnamed_addr constant [4 x i8] c"\82[\C2\BD"
@27 = private unnamed_addr constant [4 x i8] c"uB-?"
@28 = private unnamed_addr constant [4 x i8] c"^\FF\9B\BE"
@29 = private unnamed_addr constant [4 x i8] c"\00\00\00A"

; Function Attrs: uwtable
define void @main.158(ptr %retval, ptr noalias %run_options, ptr noalias %params, ptr noalias %buffer_table, ptr noalias %status, ptr noalias %prof_counters) #0 {
entry:
  %fusion.invar_address.dim.1 = alloca i64, align 8
  %fusion.invar_address.dim.0 = alloca i64, align 8
  %0 = getelementptr inbounds ptr, ptr %buffer_table, i64 1
  %Arg_0.1 = load ptr, ptr %0, align 8, !invariant.load !0, !dereferenceable !1, !align !2
  %1 = getelementptr inbounds ptr, ptr %buffer_table, i64 0
  %fusion = load ptr, ptr %1, align 8, !invariant.load !0, !dereferenceable !1, !align !2
  store i64 0, ptr %fusion.invar_address.dim.0, align 8
  br label %fusion.loop_header.dim.0

return:                                           ; preds = %fusion.loop_exit.dim.0
  ret void

fusion.loop_header.dim.0:                         ; preds = %fusion.loop_exit.dim.1, %entry
  %fusion.indvar.dim.0 = load i64, ptr %fusion.invar_address.dim.0, align 8
  %2 = icmp uge i64 %fusion.indvar.dim.0, 3
  br i1 %2, label %fusion.loop_exit.dim.0, label %fusion.loop_body.dim.0

fusion.loop_body.dim.0:                           ; preds = %fusion.loop_header.dim.0
  store i64 0, ptr %fusion.invar_address.dim.1, align 8
  br label %fusion.loop_header.dim.1

fusion.loop_header.dim.1:                         ; preds = %fusion.loop_body.dim.1, %fusion.loop_body.dim.0
  %fusion.indvar.dim.1 = load i64, ptr %fusion.invar_address.dim.1, align 8
  %3 = icmp uge i64 %fusion.indvar.dim.1, 1
  br i1 %3, label %fusion.loop_exit.dim.1, label %fusion.loop_body.dim.1

fusion.loop_body.dim.1:                           ; preds = %fusion.loop_header.dim.1
  %4 = getelementptr inbounds [3 x [1 x half]], ptr %Arg_0.1, i64 0, i64 %fusion.indvar.dim.0, i64 0
  %5 = load half, ptr %4, align 2, !invariant.load !0, !noalias !3
  %6 = fpext half %5 to float
  %7 = call float @llvm.fabs.f32(float %6)
  %constant.121 = load float, ptr @29, align 4
  %compare.2 = fcmp ole float %7, %constant.121
  %8 = zext i1 %compare.2 to i8
  %constant.120 = load float, ptr @0, align 4
  %multiply.95 = fmul float %7, %constant.120
  %constant.119 = load float, ptr @5, align 4
  %add.82 = fadd float %multiply.95, %constant.119
  %constant.118 = load float, ptr @4, align 4
  %multiply.94 = fmul float %add.82, %constant.118
  %constant.117 = load float, ptr @19, align 4
  %add.81 = fadd float %multiply.94, %constant.117
  %multiply.92 = fmul float %add.82, %add.81
  %constant.116 = load float, ptr @18, align 4
  %add.79 = fadd float %multiply.92, %constant.116
  %multiply.91 = fmul float %add.82, %add.79
  %subtract.87 = fsub float %multiply.91, %add.81
  %constant.115 = load float, ptr @20, align 4
  %add.78 = fadd float %subtract.87, %constant.115
  %multiply.89 = fmul float %add.82, %add.78
  %subtract.86 = fsub float %multiply.89, %add.79
  %constant.114 = load float, ptr @17, align 4
  %add.76 = fadd float %subtract.86, %constant.114
  %multiply.88 = fmul float %add.82, %add.76
  %subtract.84 = fsub float %multiply.88, %add.78
  %constant.113 = load float, ptr @21, align 4
  %add.75 = fadd float %subtract.84, %constant.113
  %multiply.86 = fmul float %add.82, %add.75
  %subtract.83 = fsub float %multiply.86, %add.76
  %constant.112 = load float, ptr @16, align 4
  %add.73 = fadd float %subtract.83, %constant.112
  %multiply.85 = fmul float %add.82, %add.73
  %subtract.81 = fsub float %multiply.85, %add.75
  %constant.111 = load float, ptr @22, align 4
  %add.72 = fadd float %subtract.81, %constant.111
  %multiply.83 = fmul float %add.82, %add.72
  %subtract.80 = fsub float %multiply.83, %add.73
  %constant.110 = load float, ptr @15, align 4
  %add.70 = fadd float %subtract.80, %constant.110
  %multiply.82 = fmul float %add.82, %add.70
  %subtract.78 = fsub float %multiply.82, %add.72
  %constant.109 = load float, ptr @23, align 4
  %add.69 = fadd float %subtract.78, %constant.109
  %multiply.80 = fmul float %add.82, %add.69
  %subtract.77 = fsub float %multiply.80, %add.70
  %constant.108 = load float, ptr @14, align 4
  %add.68 = fadd float %subtract.77, %constant.108
  %multiply.79 = fmul float %add.82, %add.68
  %subtract.75 = fsub float %multiply.79, %add.69
  %constant.107 = load float, ptr @24, align 4
  %add.67 = fadd float %subtract.75, %constant.107
  %multiply.77 = fmul float %add.82, %add.67
  %subtract.74 = fsub float %multiply.77, %add.68
  %constant.106 = load float, ptr @13, align 4
  %add.66 = fadd float %subtract.74, %constant.106
  %multiply.76 = fmul float %add.82, %add.66
  %subtract.72 = fsub float %multiply.76, %add.67
  %constant.105 = load float, ptr @25, align 4
  %add.65 = fadd float %subtract.72, %constant.105
  %multiply.74 = fmul float %add.82, %add.65
  %subtract.71 = fsub float %multiply.74, %add.66
  %constant.104 = load float, ptr @12, align 4
  %add.64 = fadd float %subtract.71, %constant.104
  %multiply.73 = fmul float %add.82, %add.64
  %subtract.69 = fsub float %multiply.73, %add.65
  %constant.103 = load float, ptr @26, align 4
  %add.63 = fadd float %subtract.69, %constant.103
  %multiply.71 = fmul float %add.82, %add.63
  %subtract.67 = fsub float %multiply.71, %add.64
  %constant.102 = load float, ptr @11, align 4
  %add.62 = fadd float %subtract.67, %constant.102
  %multiply.70 = fmul float %add.82, %add.62
  %subtract.66 = fsub float %multiply.70, %add.63
  %constant.101 = load float, ptr @28, align 4
  %add.61 = fadd float %subtract.66, %constant.101
  %multiply.68 = fmul float %add.82, %add.61
  %subtract.65 = fsub float %multiply.68, %add.62
  %constant.100 = load float, ptr @27, align 4
  %add.60 = fadd float %subtract.65, %constant.100
  %subtract.64 = fsub float %add.60, %add.62
  %multiply.66 = fmul float %subtract.64, %constant.120
  %constant.99 = load float, ptr @6, align 4
  %divide.4 = fdiv float %constant.99, %7
  %add.59 = fadd float %divide.4, %constant.119
  %multiply.65 = fmul float %add.59, %constant.118
  %constant.98 = load float, ptr @3, align 4
  %add.58 = fadd float %multiply.65, %constant.98
  %multiply.64 = fmul float %add.59, %add.58
  %constant.97 = load float, ptr @7, align 4
  %add.57 = fadd float %multiply.64, %constant.97
  %multiply.63 = fmul float %add.59, %add.57
  %subtract.63 = fsub float %multiply.63, %add.58
  %constant.96 = load float, ptr @2, align 4
  %add.56 = fadd float %subtract.63, %constant.96
  %multiply.62 = fmul float %add.59, %add.56
  %subtract.62 = fsub float %multiply.62, %add.57
  %constant.95 = load float, ptr @8, align 4
  %add.55 = fadd float %subtract.62, %constant.95
  %multiply.61 = fmul float %add.59, %add.55
  %subtract.61 = fsub float %multiply.61, %add.56
  %constant.94 = load float, ptr @1, align 4
  %add.54 = fadd float %subtract.61, %constant.94
  %multiply.60 = fmul float %add.59, %add.54
  %subtract.60 = fsub float %multiply.60, %add.55
  %constant.93 = load float, ptr @10, align 4
  %add.53 = fadd float %subtract.60, %constant.93
  %multiply.59 = fmul float %add.59, %add.53
  %subtract.59 = fsub float %multiply.59, %add.54
  %constant.92 = load float, ptr @9, align 4
  %add.52 = fadd float %subtract.59, %constant.92
  %subtract.58 = fsub float %add.52, %add.54
  %multiply.58 = fmul float %subtract.58, %constant.120
  %9 = call float @llvm.sqrt.f32(float %7)
  %10 = fdiv float 1.000000e+00, %9
  %multiply.57 = fmul float %multiply.58, %10
  %11 = trunc i8 %8 to i1
  %12 = select i1 %11, float %multiply.66, float %multiply.57
  %13 = fptrunc float %12 to half
  %14 = getelementptr inbounds [3 x [1 x half]], ptr %fusion, i64 0, i64 %fusion.indvar.dim.0, i64 0
  store half %13, ptr %14, align 2, !alias.scope !3
  %invar.inc1 = add nuw nsw i64 %fusion.indvar.dim.1, 1
  store i64 %invar.inc1, ptr %fusion.invar_address.dim.1, align 8
  br label %fusion.loop_header.dim.1

fusion.loop_exit.dim.1:                           ; preds = %fusion.loop_header.dim.1
  %invar.inc = add nuw nsw i64 %fusion.indvar.dim.0, 1
  store i64 %invar.inc, ptr %fusion.invar_address.dim.0, align 8
  br label %fusion.loop_header.dim.0

fusion.loop_exit.dim.0:                           ; preds = %fusion.loop_header.dim.0
  br label %return
}

; Function Attrs: nocallback nofree nosync nounwind readnone speculatable willreturn
declare float @llvm.fabs.f32(float %0) #1

; Function Attrs: nocallback nofree nosync nounwind readnone speculatable willreturn
declare float @llvm.sqrt.f32(float %0) #1

attributes #0 = { uwtable "denormal-fp-math"="preserve-sign" "no-frame-pointer-elim"="false" }
attributes #1 = { nocallback nofree nosync nounwind readnone speculatable willreturn }

!0 = !{}
!1 = !{i64 6}
!2 = !{i64 8}
!3 = !{!4}
!4 = !{!"buffer: {index:0, offset:0, size:6}", !5}
!5 = !{!"XLA global AA domain"}
2022-06-15 18:04:42 -04:00
Phoebe Wang e1c5afa47d Reland "Reland "[X86][RFC] Enable `_Float16` type support on X86 following the psABI""
Fixed the missing SQRT promotion. Adding several missing operations too.
2022-06-15 23:00:18 +08:00
Thomas Joerg 37455b1f71 Revert "Reland "[X86][RFC] Enable `_Float16` type support on X86 following the psABI""
This reverts commit 6e02e27536.

This introduces a crash in the backend. Reproducer in MLIR's LLVM
dialect follows. Let me know if you have trouble reproducing this.

module {
  llvm.func @malloc(i64) -> !llvm.ptr<i8>
  llvm.func @_mlir_ciface_tf_report_error(!llvm.ptr<i8>, i32, !llvm.ptr<i8>)
  llvm.mlir.global internal constant @error_message_2208944672953921889("failed to allocate memory at loc(\22-\22:3:8)\00")
  llvm.func @_mlir_ciface_tf_alloc(!llvm.ptr<i8>, i64, i64, i32, i32, !llvm.ptr<i32>) -> !llvm.ptr<i8>
  llvm.func @Rsqrt_CPU_DT_HALF_DT_HALF(%arg0: !llvm.ptr<i8>, %arg1: i64, %arg2: !llvm.ptr<i8>) -> !llvm.struct<(i64, ptr<i8>)> attributes {llvm.emit_c_interface, tf_entry} {
    %0 = llvm.mlir.constant(8 : i32) : i32
    %1 = llvm.mlir.constant(8 : index) : i64
    %2 = llvm.mlir.constant(2 : index) : i64
    %3 = llvm.mlir.constant(dense<0.000000e+00> : vector<4xf16>) : vector<4xf16>
    %4 = llvm.mlir.constant(dense<[0, 1, 2, 3]> : vector<4xi32>) : vector<4xi32>
    %5 = llvm.mlir.constant(dense<1.000000e+00> : vector<4xf16>) : vector<4xf16>
    %6 = llvm.mlir.constant(false) : i1
    %7 = llvm.mlir.constant(1 : i32) : i32
    %8 = llvm.mlir.constant(0 : i32) : i32
    %9 = llvm.mlir.constant(4 : index) : i64
    %10 = llvm.mlir.constant(0 : index) : i64
    %11 = llvm.mlir.constant(1 : index) : i64
    %12 = llvm.mlir.constant(-1 : index) : i64
    %13 = llvm.mlir.null : !llvm.ptr<f16>
    %14 = llvm.getelementptr %13[%9] : (!llvm.ptr<f16>, i64) -> !llvm.ptr<f16>
    %15 = llvm.ptrtoint %14 : !llvm.ptr<f16> to i64
    %16 = llvm.alloca %15 x f16 {alignment = 32 : i64} : (i64) -> !llvm.ptr<f16>
    %17 = llvm.alloca %15 x f16 {alignment = 32 : i64} : (i64) -> !llvm.ptr<f16>
    %18 = llvm.mlir.null : !llvm.ptr<i64>
    %19 = llvm.getelementptr %18[%arg1] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64>
    %20 = llvm.ptrtoint %19 : !llvm.ptr<i64> to i64
    %21 = llvm.alloca %20 x i64 : (i64) -> !llvm.ptr<i64>
    llvm.br ^bb1(%10 : i64)
  ^bb1(%22: i64):  // 2 preds: ^bb0, ^bb2
    %23 = llvm.icmp "slt" %22, %arg1 : i64
    llvm.cond_br %23, ^bb2, ^bb3
  ^bb2:  // pred: ^bb1
    %24 = llvm.bitcast %arg2 : !llvm.ptr<i8> to !llvm.ptr<struct<(ptr<f16>, ptr<f16>, i64)>>
    %25 = llvm.getelementptr %24[%10, 2] : (!llvm.ptr<struct<(ptr<f16>, ptr<f16>, i64)>>, i64) -> !llvm.ptr<i64>
    %26 = llvm.add %22, %11  : i64
    %27 = llvm.getelementptr %25[%26] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64>
    %28 = llvm.load %27 : !llvm.ptr<i64>
    %29 = llvm.getelementptr %21[%22] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64>
    llvm.store %28, %29 : !llvm.ptr<i64>
    llvm.br ^bb1(%26 : i64)
  ^bb3:  // pred: ^bb1
    llvm.br ^bb4(%10, %11 : i64, i64)
  ^bb4(%30: i64, %31: i64):  // 2 preds: ^bb3, ^bb5
    %32 = llvm.icmp "slt" %30, %arg1 : i64
    llvm.cond_br %32, ^bb5, ^bb6
  ^bb5:  // pred: ^bb4
    %33 = llvm.bitcast %arg2 : !llvm.ptr<i8> to !llvm.ptr<struct<(ptr<f16>, ptr<f16>, i64)>>
    %34 = llvm.getelementptr %33[%10, 2] : (!llvm.ptr<struct<(ptr<f16>, ptr<f16>, i64)>>, i64) -> !llvm.ptr<i64>
    %35 = llvm.add %30, %11  : i64
    %36 = llvm.getelementptr %34[%35] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64>
    %37 = llvm.load %36 : !llvm.ptr<i64>
    %38 = llvm.mul %37, %31  : i64
    llvm.br ^bb4(%35, %38 : i64, i64)
  ^bb6:  // pred: ^bb4
    %39 = llvm.bitcast %arg2 : !llvm.ptr<i8> to !llvm.ptr<ptr<f16>>
    %40 = llvm.getelementptr %39[%11] : (!llvm.ptr<ptr<f16>>, i64) -> !llvm.ptr<ptr<f16>>
    %41 = llvm.load %40 : !llvm.ptr<ptr<f16>>
    %42 = llvm.getelementptr %13[%11] : (!llvm.ptr<f16>, i64) -> !llvm.ptr<f16>
    %43 = llvm.ptrtoint %42 : !llvm.ptr<f16> to i64
    %44 = llvm.alloca %7 x i32 : (i32) -> !llvm.ptr<i32>
    llvm.store %8, %44 : !llvm.ptr<i32>
    %45 = llvm.call @_mlir_ciface_tf_alloc(%arg0, %31, %43, %8, %7, %44) : (!llvm.ptr<i8>, i64, i64, i32, i32, !llvm.ptr<i32>) -> !llvm.ptr<i8>
    %46 = llvm.bitcast %45 : !llvm.ptr<i8> to !llvm.ptr<f16>
    %47 = llvm.icmp "eq" %31, %10 : i64
    %48 = llvm.or %6, %47  : i1
    %49 = llvm.mlir.null : !llvm.ptr<i8>
    %50 = llvm.icmp "ne" %45, %49 : !llvm.ptr<i8>
    %51 = llvm.or %50, %48  : i1
    llvm.cond_br %51, ^bb7, ^bb13
  ^bb7:  // pred: ^bb6
    %52 = llvm.urem %31, %9  : i64
    %53 = llvm.sub %31, %52  : i64
    llvm.br ^bb8(%10 : i64)
  ^bb8(%54: i64):  // 2 preds: ^bb7, ^bb9
    %55 = llvm.icmp "slt" %54, %53 : i64
    llvm.cond_br %55, ^bb9, ^bb10
  ^bb9:  // pred: ^bb8
    %56 = llvm.mul %54, %11  : i64
    %57 = llvm.add %56, %10  : i64
    %58 = llvm.add %57, %10  : i64
    %59 = llvm.getelementptr %41[%58] : (!llvm.ptr<f16>, i64) -> !llvm.ptr<f16>
    %60 = llvm.bitcast %59 : !llvm.ptr<f16> to !llvm.ptr<vector<4xf16>>
    %61 = llvm.load %60 {alignment = 2 : i64} : !llvm.ptr<vector<4xf16>>
    %62 = "llvm.intr.sqrt"(%61) : (vector<4xf16>) -> vector<4xf16>
    %63 = llvm.fdiv %5, %62  : vector<4xf16>
    %64 = llvm.getelementptr %46[%58] : (!llvm.ptr<f16>, i64) -> !llvm.ptr<f16>
    %65 = llvm.bitcast %64 : !llvm.ptr<f16> to !llvm.ptr<vector<4xf16>>
    llvm.store %63, %65 {alignment = 2 : i64} : !llvm.ptr<vector<4xf16>>
    %66 = llvm.add %54, %9  : i64
    llvm.br ^bb8(%66 : i64)
  ^bb10:  // pred: ^bb8
    %67 = llvm.icmp "ult" %53, %31 : i64
    llvm.cond_br %67, ^bb11, ^bb12
  ^bb11:  // pred: ^bb10
    %68 = llvm.mul %53, %12  : i64
    %69 = llvm.add %31, %68  : i64
    %70 = llvm.mul %53, %11  : i64
    %71 = llvm.add %70, %10  : i64
    %72 = llvm.trunc %69 : i64 to i32
    %73 = llvm.mlir.undef : vector<4xi32>
    %74 = llvm.insertelement %72, %73[%8 : i32] : vector<4xi32>
    %75 = llvm.shufflevector %74, %73 [0 : i32, 0 : i32, 0 : i32, 0 : i32] : vector<4xi32>, vector<4xi32>
    %76 = llvm.icmp "slt" %4, %75 : vector<4xi32>
    %77 = llvm.add %71, %10  : i64
    %78 = llvm.getelementptr %41[%77] : (!llvm.ptr<f16>, i64) -> !llvm.ptr<f16>
    %79 = llvm.bitcast %78 : !llvm.ptr<f16> to !llvm.ptr<vector<4xf16>>
    %80 = llvm.intr.masked.load %79, %76, %3 {alignment = 2 : i32} : (!llvm.ptr<vector<4xf16>>, vector<4xi1>, vector<4xf16>) -> vector<4xf16>
    %81 = llvm.bitcast %16 : !llvm.ptr<f16> to !llvm.ptr<vector<4xf16>>
    llvm.store %80, %81 : !llvm.ptr<vector<4xf16>>
    %82 = llvm.load %81 {alignment = 2 : i64} : !llvm.ptr<vector<4xf16>>
    %83 = "llvm.intr.sqrt"(%82) : (vector<4xf16>) -> vector<4xf16>
    %84 = llvm.fdiv %5, %83  : vector<4xf16>
    %85 = llvm.bitcast %17 : !llvm.ptr<f16> to !llvm.ptr<vector<4xf16>>
    llvm.store %84, %85 {alignment = 2 : i64} : !llvm.ptr<vector<4xf16>>
    %86 = llvm.load %85 : !llvm.ptr<vector<4xf16>>
    %87 = llvm.getelementptr %46[%77] : (!llvm.ptr<f16>, i64) -> !llvm.ptr<f16>
    %88 = llvm.bitcast %87 : !llvm.ptr<f16> to !llvm.ptr<vector<4xf16>>
    llvm.intr.masked.store %86, %88, %76 {alignment = 2 : i32} : vector<4xf16>, vector<4xi1> into !llvm.ptr<vector<4xf16>>
    llvm.br ^bb12
  ^bb12:  // 2 preds: ^bb10, ^bb11
    %89 = llvm.mul %2, %1  : i64
    %90 = llvm.mul %arg1, %2  : i64
    %91 = llvm.add %90, %11  : i64
    %92 = llvm.mul %91, %1  : i64
    %93 = llvm.add %89, %92  : i64
    %94 = llvm.alloca %93 x i8 : (i64) -> !llvm.ptr<i8>
    %95 = llvm.bitcast %94 : !llvm.ptr<i8> to !llvm.ptr<ptr<f16>>
    llvm.store %46, %95 : !llvm.ptr<ptr<f16>>
    %96 = llvm.getelementptr %95[%11] : (!llvm.ptr<ptr<f16>>, i64) -> !llvm.ptr<ptr<f16>>
    llvm.store %46, %96 : !llvm.ptr<ptr<f16>>
    %97 = llvm.getelementptr %95[%2] : (!llvm.ptr<ptr<f16>>, i64) -> !llvm.ptr<ptr<f16>>
    %98 = llvm.bitcast %97 : !llvm.ptr<ptr<f16>> to !llvm.ptr<i64>
    llvm.store %10, %98 : !llvm.ptr<i64>
    %99 = llvm.bitcast %94 : !llvm.ptr<i8> to !llvm.ptr<struct<(ptr<f16>, ptr<f16>, i64, i64)>>
    %100 = llvm.getelementptr %99[%10, 3] : (!llvm.ptr<struct<(ptr<f16>, ptr<f16>, i64, i64)>>, i64) -> !llvm.ptr<i64>
    %101 = llvm.getelementptr %100[%arg1] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64>
    %102 = llvm.sub %arg1, %11  : i64
    llvm.br ^bb14(%102, %11 : i64, i64)
  ^bb13:  // pred: ^bb6
    %103 = llvm.mlir.addressof @error_message_2208944672953921889 : !llvm.ptr<array<42 x i8>>
    %104 = llvm.getelementptr %103[%10, %10] : (!llvm.ptr<array<42 x i8>>, i64, i64) -> !llvm.ptr<i8>
    llvm.call @_mlir_ciface_tf_report_error(%arg0, %0, %104) : (!llvm.ptr<i8>, i32, !llvm.ptr<i8>) -> ()
    %105 = llvm.mul %2, %1  : i64
    %106 = llvm.mul %2, %10  : i64
    %107 = llvm.add %106, %11  : i64
    %108 = llvm.mul %107, %1  : i64
    %109 = llvm.add %105, %108  : i64
    %110 = llvm.alloca %109 x i8 : (i64) -> !llvm.ptr<i8>
    %111 = llvm.bitcast %110 : !llvm.ptr<i8> to !llvm.ptr<ptr<f16>>
    llvm.store %13, %111 : !llvm.ptr<ptr<f16>>
    %112 = llvm.getelementptr %111[%11] : (!llvm.ptr<ptr<f16>>, i64) -> !llvm.ptr<ptr<f16>>
    llvm.store %13, %112 : !llvm.ptr<ptr<f16>>
    %113 = llvm.getelementptr %111[%2] : (!llvm.ptr<ptr<f16>>, i64) -> !llvm.ptr<ptr<f16>>
    %114 = llvm.bitcast %113 : !llvm.ptr<ptr<f16>> to !llvm.ptr<i64>
    llvm.store %10, %114 : !llvm.ptr<i64>
    %115 = llvm.call @malloc(%109) : (i64) -> !llvm.ptr<i8>
    "llvm.intr.memcpy"(%115, %110, %109, %6) : (!llvm.ptr<i8>, !llvm.ptr<i8>, i64, i1) -> ()
    %116 = llvm.mlir.undef : !llvm.struct<(i64, ptr<i8>)>
    %117 = llvm.insertvalue %10, %116[0] : !llvm.struct<(i64, ptr<i8>)>
    %118 = llvm.insertvalue %115, %117[1] : !llvm.struct<(i64, ptr<i8>)>
    llvm.return %118 : !llvm.struct<(i64, ptr<i8>)>
  ^bb14(%119: i64, %120: i64):  // 2 preds: ^bb12, ^bb15
    %121 = llvm.icmp "sge" %119, %10 : i64
    llvm.cond_br %121, ^bb15, ^bb16
  ^bb15:  // pred: ^bb14
    %122 = llvm.getelementptr %21[%119] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64>
    %123 = llvm.load %122 : !llvm.ptr<i64>
    %124 = llvm.getelementptr %100[%119] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64>
    llvm.store %123, %124 : !llvm.ptr<i64>
    %125 = llvm.getelementptr %101[%119] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64>
    llvm.store %120, %125 : !llvm.ptr<i64>
    %126 = llvm.mul %120, %123  : i64
    %127 = llvm.sub %119, %11  : i64
    llvm.br ^bb14(%127, %126 : i64, i64)
  ^bb16:  // pred: ^bb14
    %128 = llvm.call @malloc(%93) : (i64) -> !llvm.ptr<i8>
    "llvm.intr.memcpy"(%128, %94, %93, %6) : (!llvm.ptr<i8>, !llvm.ptr<i8>, i64, i1) -> ()
    %129 = llvm.mlir.undef : !llvm.struct<(i64, ptr<i8>)>
    %130 = llvm.insertvalue %arg1, %129[0] : !llvm.struct<(i64, ptr<i8>)>
    %131 = llvm.insertvalue %128, %130[1] : !llvm.struct<(i64, ptr<i8>)>
    llvm.return %131 : !llvm.struct<(i64, ptr<i8>)>
  }
  llvm.func @_mlir_ciface_Rsqrt_CPU_DT_HALF_DT_HALF(%arg0: !llvm.ptr<struct<(i64, ptr<i8>)>>, %arg1: !llvm.ptr<i8>, %arg2: !llvm.ptr<struct<(i64, ptr<i8>)>>) attributes {llvm.emit_c_interface, tf_entry} {
    %0 = llvm.load %arg2 : !llvm.ptr<struct<(i64, ptr<i8>)>>
    %1 = llvm.extractvalue %0[0] : !llvm.struct<(i64, ptr<i8>)>
    %2 = llvm.extractvalue %0[1] : !llvm.struct<(i64, ptr<i8>)>
    %3 = llvm.call @Rsqrt_CPU_DT_HALF_DT_HALF(%arg1, %1, %2) : (!llvm.ptr<i8>, i64, !llvm.ptr<i8>) -> !llvm.struct<(i64, ptr<i8>)>
    llvm.store %3, %arg0 : !llvm.ptr<struct<(i64, ptr<i8>)>>
    llvm.return
  }
}
2022-06-15 13:24:24 +02:00
Benjamin Kramer fb34d531af Promote bf16 to f32 when the target doesn't support it
This is modeled after the half-precision fp support. Two new nodes are
introduced for casting from and to bf16. Since casting from bf16 is a
simple operation I opted to always directly lower it to integer
arithmetic. The other way round is more complicated if you want to
preserve IEEE semantics, so it's handled by a new __truncsfbf2
compiler-rt builtin.

This is of course very bare bones, but sufficient to get a semi-softened
fadd on x86.

Possible future improvements:
 - Targets with bf16 conversion instructions can now make fp_to_bf16 legal
 - The software conversion to bf16 can be replaced by a trivial
   implementation under fast math.

Differential Revision: https://reviews.llvm.org/D126953
2022-06-15 12:56:31 +02:00
Simon Pilgrim 4fd561415e [X86] needCarryOrOverflowFlag/onlyZeroFlagUsed - merge identical switch cases. NFCI.
Makes it easier to grok and fixes various bugprone-branch-clone warnings.
2022-06-15 10:40:22 +01:00
Phoebe Wang 6e02e27536 Reland "[X86][RFC] Enable `_Float16` type support on X86 following the psABI"
Disabled 2 mlir tests due to the runtime doesn't support `_Float16`, see
the issue here https://github.com/llvm/llvm-project/issues/55992
2022-06-15 09:15:31 +08:00
Simon Pilgrim 64eea34420 [X86] combineEXTEND_VECTOR_INREG - don't attempt to shuffle combine ANY_EXTEND_VECTOR_INREG without SSE41
Without SSE41, ANY_EXTEND_VECTOR_INREG nodes are likely to be prematurely combined to a target shuffle preventing generic sign extension folds.

Fixes a number of sign-extend regressions in D127115.
2022-06-13 17:42:04 +01:00
Mehdi Amini 5d8298a768 Revert "[X86][RFC] Enable `_Float16` type support on X86 following the psABI"
This reverts commit 2d2da259c8.

This breaks MLIR integration test (JIT crashing), reverting in the
meantime.
2022-06-12 15:14:37 +00:00
Simon Pilgrim b5d7beeb97 [X86] combineConcatVectorOps - add support for concatenation of VSELECT/BLENDV nodes (REAPPLIED)
If the LHS/RHS selection operands can be cheaply concatenated back together then replace 2 x 128-bit selection nodes with 1 x 256-bit node

Addresses the regression introduced in the bug fix from rGd5af6a38082b39ae520a328e44dc29ebcb036bb2

REAPPLIED with for bug identified in rGea8fb3b60196
2022-06-12 15:40:36 +01:00
Phoebe Wang 2d2da259c8 [X86][RFC] Enable `_Float16` type support on X86 following the psABI
GCC and Clang/LLVM will support `_Float16` on X86 in C/C++, following
the latest X86 psABI. (https://gitlab.com/x86-psABIs)

_Float16 arithmetic will be performed using native half-precision. If
native arithmetic instructions are not available, it will be performed
at a higher precision (currently always float) and then truncated down
to _Float16 immediately after each single arithmetic operation.

Reviewed By: LuoYuanke

Differential Revision: https://reviews.llvm.org/D107082
2022-06-12 11:40:00 +08:00
Simon Pilgrim 7841d09449 [X86][AVX512] Retain pmuldq broadcast loads on 32-bit targets
Don't demand just the lower 32-bits on 32-bit AVX512 targets to preserve 64-bit broadcast loads patterns
2022-06-11 19:30:00 +01:00
Simon Pilgrim 6eaea225c7 [X86] combineTargetShuffle - break if-else chain. NFC.
(style) Both cases always continue.
2022-06-11 09:16:39 +01:00
Simon Pilgrim 89d2b1e4f7 [X86] emitOrXorXorTree - break if-else chain. NFC.
(style) Both cases always return.
2022-06-11 09:16:38 +01:00
Simon Pilgrim 5acbb2dda2 [X86] combineMulToPMADDWD - don't bitcast the source ops before splitting to ensure we split the build vectors early
Fixes a regression on D127115 - splitting was creating extract_subvector(bitcast(build_vector())) patterns which prevented the build vectors being split before being bitcast to vXi16 types, resulting in various issues with further folding of the (now legal) build vectors
2022-06-10 13:44:49 +01:00
Simon Pilgrim 7ac33b8aac [X86] Remove !VT.is128BitVector() check. NFCI.
The code is inside a if(VT.is256BitVector() || VT.is512BitVector()) condition
2022-06-09 21:39:45 +01:00
Simon Pilgrim 72a049d778 [X86][AVX2] LowerINSERT_VECTOR_ELT - support v4i64 insertion as BLENDI(X, SCALAR_TO_VECTOR(Y)) 2022-06-09 21:18:10 +01:00
Simon Pilgrim 1a02db9882 [X86] canonicalizeShuffleWithBinOps - add TODO for X86ISD::ANDNP bitwise handling
Its just as safe to move shuffles across X86ISD::ANDNP as any other logical bitop, they just tend to appear too late to matter.

Noticed while triaging D127115 regressions.
2022-06-09 12:18:26 +01:00
Simon Pilgrim 9a76337fee [X86] combineMOVMSK - constant fold with getTargetConstantBitsFromNode not just BUILD_VECTOR
Help avoid a regression in D127115
2022-06-08 17:48:55 +01:00
Guillaume Chatelet 0788186182 [Alignment][NFC] Remove usage of MemSDNode::getAlignment
I can't remove the function just yet as it is used in the generated .inc files.
I would also like to provide a way to compare alignment with TypeSize since it came up a few times.

Differential Revision: https://reviews.llvm.org/D126910
2022-06-07 13:52:20 +00:00
Simon Pilgrim f5507978a3 [X86] getFauxShuffleMask - add VSELECT/BLENDV handling
First step towards enabling shuffle combining starting from VSELECT/BLENDV nodes - this should eventually help improve the codegen reported at Issue #54819
2022-06-07 14:46:25 +01:00
Simon Pilgrim 1b6d3bdc82 [X86] foldMaskedMergeImpl - pass SDLoc by const reference not value. 2022-06-07 12:36:30 +01:00
Simon Pilgrim 63e3035dbe [X86] LowerGC_TRANSITION - remove redundant SDLoc(). 2022-06-07 10:57:58 +01:00
Shilei Tian 0c3e6e5717 [NFC] Remove trailing whitespace 2022-06-06 18:59:13 -04:00
Eric Christopher 93cb6b9c83 Revert "[X86] combineConcatVectorOps - add support for concatenation VSELECT/BLENDV nodes"
See the original commit for a testcase.

This reverts commit ea8fb3b601.
2022-06-03 12:31:11 -07:00
Simon Pilgrim de2b543505 [X86] LowerVSETCC - merge getConstant() calls with flipped/unflipped sign masks. NFCI. 2022-06-01 15:09:48 +01:00
Sanjay Patel 3a503a4a9c [x86] fix miscompile from wrongly identified fneg
We may need to peek through a bitcast when identifying an fneg idiom
via its pool constant, but we can't allow a different-sized constant
in that match.

This is noted in issue #55758 with an example that needs fast-math,
but as the test here shows, this has potential to miscompile more
generally (no fast-math required).

Differential Revision: https://reviews.llvm.org/D126775
2022-06-01 09:56:33 -04:00
Simon Pilgrim f6dbb0b6fb [X86] Fix typo in extraction type introduced in rGed0303aa2251e4484a2b4ff7f236c9f7cdfb2092
It doesn't look like we have test coverage for this at the moment :(
2022-06-01 12:31:27 +01:00
Simon Pilgrim ea8fb3b601 [X86] combineConcatVectorOps - add support for concatenation VSELECT/BLENDV nodes
If the LHS/RHS selection operands can be cheaply concatenated back together then replace 2 x 128-bit selection nodes with 1 x 256-bit node

Addresses the regression introduced in the bug fix from rGd5af6a38082b39ae520a328e44dc29ebcb036bb2
2022-06-01 10:46:06 +01:00
Simon Pilgrim d5af6a3808 [X86] LowerMINMAX - split v4i64 types on AVX1 targets (Issue #55648)
Originally we tried to use default expansion for v4i64 types to make it easier to concatenate the results back together, but this can cause infinite loop issues with existing VSELECT splitting code in narrowExtractedVectorSelect if we have other uses of the VSELECT results (e.g. reduction patterns).

To fix the infinite loop, this patch always splits MIN/MAX v4i64 nodes during lowering and I've added a TODO for combineConcatVectorOps to investigate when we can cheaply concatenate VSELECT/BLENDV nodes together.

Fixes #55648 - regression test case will be added in a follow up.
2022-05-31 17:28:56 +01:00
Simon Pilgrim af0113cf77 [X86] combineEXTRACT_SUBVECTOR - pull out repeated getVectorNumElements() calls. NFC. 2022-05-31 16:13:54 +01:00
Simon Pilgrim b9443cb6fa [X86] narrowExtractedVectorSelect - don't peek through bitcasts to find source vector
We don't seem to need this for any test coverage and it was making tracking of the uses() of the source vector more difficult

Noticed while investigating Issue #55648
2022-05-31 14:57:18 +01:00
Simon Pilgrim ed0303aa22 [X86] LowerTRUNCATE - avoid creating extract_subvector(bitcast(vec)) patterns
We have a generic DAG combine to attempt to fold extract_subvector(bitcast(vec)) -> bitcast(extract_subvector(vec)) but if we create these patterns late in lowering then we often miss them.

Noticed while investigating Issue #55648 which gets caught in an infinite loop trying to split extract_subvector(bitcast(vselect()) patterns - this doesn't fix the issue yet but reduces the regressions from the WIP fix.
2022-05-31 14:30:56 +01:00
Zongwei Lan ad73ce318e [Target] use getSubtarget<> instead of static_cast<>(getSubtarget())
Differential Revision: https://reviews.llvm.org/D125391
2022-05-26 11:22:41 -07:00
Craig Topper 06fee478d2 [X86] Add isSimple check to the load combine in combineExtractVectorElt.
I think we need to be sure the load isn't volatile before we
duplicate and shrink it.

Reviewed By: spatel

Differential Revision: https://reviews.llvm.org/D126353
2022-05-25 09:11:11 -07:00
Jay Foad 6bec3e9303 [APInt] Remove all uses of zextOrSelf, sextOrSelf and truncOrSelf
Most clients only used these methods because they wanted to be able to
extend or truncate to the same bit width (which is a no-op). Now that
the standard zext, sext and trunc allow this, there is no reason to use
the OrSelf versions.

The OrSelf versions additionally have the strange behaviour of allowing
extending to a *smaller* width, or truncating to a *larger* width, which
are also treated as no-ops. A small amount of client code relied on this
(ConstantRange::castOp and MicrosoftCXXNameMangler::mangleNumber) and
needed rewriting.

Differential Revision: https://reviews.llvm.org/D125557
2022-05-19 11:23:13 +01:00
Simon Pilgrim 320545b577 [X86] Rename combineCONCAT_VECTORS\INSERT_SUBVECTOR\EXTRACT_SUBVECTOR to match Opcode name. NFCI.
Its a lot easier to quickly search for the combine when it actually contains the name of the opcode it combines.
2022-05-17 18:37:53 +01:00
Simon Pilgrim c64f5d44ad [X86] Attempt to fold EFLAGS into X86ISD::ADD/SUB ops
We already use combineAddOrSubToADCOrSBB to fold extended EFLAGS results into ISD::ADD/SUB ops as X86ISD::ADC/SBB carry ops.

This patch extends this to also try to fold EFLAGS results with X86ISD::ADD/SUB ops

Differential Revision: https://reviews.llvm.org/D125642
2022-05-17 10:59:24 +01:00
Simon Pilgrim b3077f563d [X86] Move combineAddOrSubToADCOrSBB earlier. NFC.
Make it easier to reuse in X86 ADD/SUB combines in an upcoming patch.
2022-05-15 22:06:33 +01:00
Simon Pilgrim fd1f0c51ef [X86] lowerShuffleAsLanePermuteAndSHUFP always succeeds, so just return the result. NFC. 2022-05-15 15:53:36 +01:00
Simon Pilgrim c0f59be358 [X86] Pull out repeated isShuffleMaskInputInPlace calls. NFC. 2022-05-15 15:35:09 +01:00
Simon Pilgrim 32162cf291 [X86] lowerV4I64Shuffle - try harder to lower to PERMQ(BLENDD(V1,V2)) pattern 2022-05-15 14:57:58 +01:00
Simon Pilgrim bc90bbb759 [X86] LowerAVG - fix cut+paste typo. NFC. 2022-05-14 17:42:09 +01:00
Simon Pilgrim 98f82d69bd [X86] LowerStore - use is64BitVector() wrapper. NFCI. 2022-05-13 15:30:18 +01:00
Matthias Braun cd19af74c0 Avoid 8 and 16bit switch conditions on x86
This adds a `TargetLoweringBase::getSwitchConditionType` callback to
give targets a chance to control the type used in
`CodeGenPrepare::optimizeSwitchInst`.

Implement callback for X86 to avoid i8 and i16 types where possible as
they often incur extra zero-extensions.

This is NFC for non-X86 targets.

Differential Revision: https://reviews.llvm.org/D124894
2022-05-10 10:00:10 -07:00
Simon Pilgrim 980f41d7c4 [X86] (style) Use auto for dyn_cast<> results 2022-05-01 17:15:18 +01:00
Simon Pilgrim d4f06ec874 [X86] (style) Don't use auto for non obvious types 2022-05-01 17:10:21 +01:00
Simon Pilgrim 92235e3bf4 [X86] lowerShuffleAsRepeatedMaskAndLanePermute - permit 32-bit sublane permute for unary v32i8 cases
Increase the likelihood that we can lower to a permd(pshufb()) pattern, but only after we've attempted with 64-bit sublane permutes first

Fixes #55066
2022-04-30 11:00:28 +01:00
Simon Pilgrim b424055b52 [X86] lowerShuffleAsRepeatedMaskAndLanePermute - move the sublane split code into a lambda helper. NFC.
This is a NFC cleanup as part of the work on #55066 - the idea being that we will be able to check for multiple sub lane scales.
2022-04-29 16:03:50 +01:00
Simon Pilgrim 3562f855b7 [X86] SimplifyDemandedVectorEltsForTargetNode - fold (uniform) shift(0,x) -> 0 2022-04-29 12:08:47 +01:00
Simon Pilgrim 336a1233b2 [X86] SimplifyDemandedVectorEltsForTargetNode - fold shift(0,x) -> 0 2022-04-29 11:32:54 +01:00
Simon Pilgrim 6c44e398ec [X86] combineShuffle - reuse SDLoc. NFCI. 2022-04-29 10:30:11 +01:00
Simon Pilgrim 2d7f0b1c22 [X86] Fold ANDNP(undef,x)/ANDNP(x,undef) -> 0
Matches the fold in DAGCombiner::visitANDLike.
2022-04-29 10:20:48 +01:00
Simon Pilgrim ab17ed0723 [X86] Don't fold AND(SRL(X,Y),1) -> SETCC(BT(X,Y)) on BMI2 targets
With BMI2 we have SHRX which is a lot quicker than regular x86 shifts.

Fixes #55138
2022-04-28 21:28:16 +01:00
Simon Pilgrim 9e3b7e8e65 [X86] getTargetVShiftByConstNode - use SelectionDAG::FoldConstantArithmetic to perform constant folding. NFCI.
Remove some unnecessary code duplication.
2022-04-28 17:10:20 +01:00
Simon Pilgrim de7cee24b6 [X86] getBT - attempt to peek through aext(and(trunc(x),c)) mask/modulo
Ideally we'd fold this with generic DAGCombiner, but that only works for !isTruncateFree cases - we might be able to adapt IsDesirableToPromoteOp to find truncated src ops in the future, but for now just use this peephole.

Noticed in Issue #55138
2022-04-28 16:10:26 +01:00
Simon Pilgrim ed8dffef4c [X86] getFauxShuffle - don't assume an UNDEF src element for AND/ANDNP results in an UNDEF shuffle mask index
The other src element might be zero, guaranteeing zero.

Fixes #55157
2022-04-28 12:32:58 +01:00
Simon Pilgrim e378577524 [X86] Use is128BitLaneRepeatedShuffleMask wrapper. NFC.
We don't need to know the actual repeated mask.
2022-04-27 21:09:57 +01:00
Simon Pilgrim 03482bccad [X86] collectConcatOps - add ability to collect from vector 'widening' patterns
Recognise insert_subvector(undef, x, lo/hi) patterns where we double the width of a vector - creating an UNDEF subvector on the fly.
2022-04-27 15:38:58 +01:00
Xiang1 Zhang c430f0f532 [X86] Add use condition for combineSetCCMOVMSK
Reviewed by RKSimon, LuoYuanke

Differential Revision: https://reviews.llvm.org/D123652
2022-04-26 16:42:50 +08:00
Simon Pilgrim e8305c0b8f [X86] combineX86ShuffleChain - don't fold to truncate(concat(V1,V2)) if it was already a PACK op
Fixes #55050
2022-04-25 17:13:44 +01:00
Craig Topper c6fdb1de47 [X86] Move some hasOneUse checks after checking what the opcode is.
Calling hasOneUse can be expensive on nodes with multiple results.
Especially when some results are Chains. By checking the opcode first,
we can avoid walking the uses if it isn't an interesting node,
and thus avoid calling hasOneUse on a node that might have many uses.

Found by profiling the IR given in D123857.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D123881
2022-04-16 14:18:58 -07:00
Craig Topper 9d86bf825c [X86] Move hasOneUse check after opcode check. NFC
Checking opcode is cheap. hasOneUse might not be if the node has
multiple results. By checking the opcode we can rule out nodes
with multiple results we aren't interested in.
2022-04-15 17:20:57 -07:00
Liu, Chen3 bf60a5af0a [X86] Covert unsigned int 0 to float-point with FILD instruction.
unsinged int 0 will be convert to float/double -0.0 when the rounding
mode is set to 'FE_DOWNWARD'. Using FILD instruction instead of SSE
instructions on 32-bit target if the strictfp is enabled.

Differential Revision: https://reviews.llvm.org/D123660
2022-04-13 20:06:15 +08:00
Simon Pilgrim 0488c6638b [X86] getFauxShuffleMask - remove use DemandedElts TODO
Most of the getTargetShuffleInputs recursive calls have now gone and the remaining uses aren't likely to benefit from a DemandedElts mask
2022-04-12 15:36:30 +01:00
Simon Pilgrim 1e803d305a Revert rG88ff6f70c45f2767576c64dde28cbfe7a90916ca "[X86] Extend vselect(cond, pshufb(x), pshufb(y)) -> or(pshufb(x), pshufb(y)) to include inner or(pshufb(x), pshufb(y)) chains"
Reverting while I investigate reports of internal test regressions/failures
2022-04-11 10:42:43 +01:00
Simon Pilgrim 88ff6f70c4 [X86] Extend vselect(cond, pshufb(x), pshufb(y)) -> or(pshufb(x), pshufb(y)) to include inner or(pshufb(x), pshufb(y)) chains 2022-04-10 13:04:53 +01:00
Simon Pilgrim c74d729bd6 [X86] combineExtractSubvector - fold extract_subvector(insert_subvector(V,X,C1),C1)
extract_subvector(insert_subvector(V,X,C1),C1) -> insert_subvector(extract_subvector(V,C1),X,0)

More aggressively attempt to reduce the width of an extract_subvector source - we currently only do this if we're inserting into a zero vector (i.e. canonicalizing to the AVX implicit zero upper elts pattern).

But if we're extracting from the same point as the inner insert_subvector then the fold is still relatively trivial - we can probably do even better if we can ensure the subvector isn't badly split.
2022-04-10 11:03:08 +01:00