Phoebe Wang
02fe96b240
[X86][FP16] Do not split FP64->FP16 to FP64->FP32->FP16
...
Truncation from double to half is not always identical to truncating to float first and then to half. https://godbolt.org/z/56s9517hd
On the other hand, expanding to float and then to double is always identical to expanding to double directly. https://godbolt.org/z/Ye8vbYPnY
Reviewed By: RKSimon, skan
Differential Revision: https://reviews.llvm.org/D130151
2022-07-22 08:36:05 +08:00
Bing1 Yu
e01bf5a3e2
[X86] Promote v32f16's fadd into v32f32's fadd when it is avx512 without avx512fp16
...
Reviewed By: pengfei
Differential Revision: https://reviews.llvm.org/D130059
2022-07-19 14:37:50 +08:00
Benjamin Kramer
9234a7c0df
[X86][FP16] Don't crash when lowering SELECT on fp16 vectors
...
This is a regression from f187948162
2022-07-18 13:41:00 +02:00
Phoebe Wang
f187948162
[X86][FP16] Enable vector support for FP16 emulation
...
This is follow up of D107082, which enable vector support according to psABI.
Reviewed By: skan
Differential Revision: https://reviews.llvm.org/D127982
2022-07-16 09:38:58 +08:00
Simon Pilgrim
66bfd1ba8c
[X86] Move isInRange(ArrayRef<int>) inside assert to fix NDEBUG builds. NFC.
...
Fix unused static function warning introduced by D129207
2022-07-12 21:51:07 +01:00
Xiang1 Zhang
a45dd3d814
[X86] Support -mstack-protector-guard-symbol
...
Reviewed By: nickdesaulniers
Differential Revision: https://reviews.llvm.org/D129346
2022-07-12 10:17:00 +08:00
Xiang1 Zhang
643786213b
Revert "[X86] Support -mstack-protector-guard-symbol"
...
This reverts commit efbaad1c4a
.
due to miss adding review info.
2022-07-12 10:14:32 +08:00
Xiang1 Zhang
efbaad1c4a
[X86] Support -mstack-protector-guard-symbol
2022-07-12 10:13:48 +08:00
Simon Pilgrim
97868fb972
[X86] isTargetShuffleEquivalent - attempt to match SM_SentinelZero shuffle mask elements using known bits
...
If the combined shuffle mask requires zero elements, we don't currently have much chance of matching them against the expected source vector. This patch uses the SelectionDAG::MaskedVectorIsZero wrapper to attempt to determine if the expected lement we want to use is already known to be zero.
I've also tightened up the ExpectedMask assertion to always be in range - we're never giving it a target shuffle mask that has sentinels at all - allowing to remove some of the confusing bounds checks.
This attempts to address some of the regressions uncovered by D129150 where we more aggressively fold shuffles as AND / 'clear' masks which results in more combined shuffles using SM_SentinelZero.
Differential Revision: https://reviews.llvm.org/D129207
2022-07-11 15:29:44 +01:00
Phoebe Wang
8fb083d33e
[X86][FP16] Add constrained FP support for scalar emulation
...
This is a follow up patch to support constrained FP in FP16 emulation.
Reviewed By: skan
Differential Revision: https://reviews.llvm.org/D128114
2022-07-08 20:33:42 +08:00
Phoebe Wang
6c535f9f1b
[X86][FP16] Fix crash when lowering copysign for f16
...
This is to address the assertion fail reported in https://reviews.llvm.org/D107082#3635612
Not sure if it is a problem of promoting FCOPYSIGN + libcall FP_ROUND.
The promoting will set the rounding mode to 1 a442c62888/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp (L4810-L4814)
While libcall cannot handle the rounding mode equals to 1 a442c62888/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp (L4324-L4328)
So changing the action to Expand to workaround the problem.
Reviewed By: clementval, MaskRay
Differential Revision: https://reviews.llvm.org/D129294
2022-07-07 19:17:26 -07:00
Simon Pilgrim
fbb51ac0ba
[X86] LowerShift - lower some shuffles directly to X86ISD::PSHUFLW nodes.
...
These are expected to lower to X86ISD::PSHUFLW but we were seeing some regressions in D129150 because it'd managed to exploit the masking of the shift amounts to create unintended clear masks instead.
2022-07-06 18:01:03 +01:00
Shilei Tian
1023ddaf77
[LLVM] Add the support for fmax and fmin in atomicrmw instruction
...
This patch adds the support for `fmax` and `fmin` operations in `atomicrmw`
instruction. For now (at least in this patch), the instruction will be expanded
to CAS loop. There are already a couple of targets supporting the feature. I'll
create another patch(es) to enable them accordingly.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D127041
2022-07-06 10:57:53 -04:00
Paul Robinson
08e4fe6c61
[X86] Add RDPRU instruction
...
Add support for the RDPRU instruction on Zen2 processors.
User-facing features:
- Clang option -m[no-]rdpru to enable/disable the feature
- Support is implicit for znver2/znver3 processors
- Preprocessor symbol __RDPRU__ to indicate support
- Header rdpruintrin.h to define intrinsics
- "rdpru" mnemonic supported for assembler code
Internal features:
- Clang builtin __builtin_ia32_rdpru
- IR intrinsic @llvm.x86.rdpru
Differential Revision: https://reviews.llvm.org/D128934
2022-07-06 07:17:47 -07:00
Craig Topper
2bfca35614
[X86] Disable combineVectorSizedSetCCEquality for soft float.
...
The vector types aren't legal with soft float.
Also disable under NoImplicitFloat for good measure.
Fixes PR56351.
Differential Revision: https://reviews.llvm.org/D129060
2022-07-04 08:33:30 -07:00
Simon Pilgrim
26708fa166
Revert rG057db2002bb3: [X86] combineAndnp - constant fold ANDNP(C,X) -> AND(~C,X)
...
If the LHS op has a single use then using the more general AND op is likely to allow commutation, load folding, generic folds etc.
Reverted due to reports from @alexfh about it causing an infinite loop (repro still pending).
2022-07-01 10:36:09 +01:00
Simon Pilgrim
e961e05d59
[SLP][X86] Add 32-bit vector stores to help vectorization opportunities
...
Building on the work on D124284, this patch tags v4i8 and v2i16 vector loads as custom, enabling SLP to try to vectorize these types ending in a partial store (using the SSE MOVD instruction) - we already do something similar for 64-bit vector types.
Differential Revision: https://reviews.llvm.org/D127604
2022-06-30 20:25:50 +01:00
Craig Topper
3706bdad4a
[X86] Remove unnecessary COPY from EmitLoweredCascadedSelect.
...
I believe we already checked that the destination of the first
CMOV is only used by the second CMOV so I don't think there is any
reason we need the PHI to write the register that was used by the
first CMOV. We can directly use the second CMOV destination and
avoid the copy.
This may be a left over from when the cascaded select handling
was part of the main algorithm before it was refactored in D35685.
Reviewed By: pengfei
Differential Revision: https://reviews.llvm.org/D128124
2022-06-28 09:33:33 -07:00
Simon Pilgrim
0b998053db
[X86] combineConcatVectorOps - IsConcatFree must check extraction index
...
Identified in the regression reported by @alexfh on rGb5d7beeb9792 - IsConcatFree wasn't ensuring the subvector extraction index matched the position it would be concatenated back into.
2022-06-27 11:46:49 +01:00
Simon Pilgrim
ac4cb1775b
[X86] fold (and (mul x, c1), c2) -> (mul x, (and c1, c2)) iff c2 is all/no bits mask
...
Noticed on D128216 - if we're zeroing out vector elements of a mul/mulh result then see if we can merge the and-mask into the mul by just multiplying by zero.
Ideally we'd make this generic (similar to the existing foldSelectWithIdentityConstant?), but these cases are appearing very late, after the constants have been lowered to constant-pool loads.
2022-06-21 15:10:43 +01:00
Simon Pilgrim
057db2002b
[X86] combineAndnp - constant fold ANDNP(C,X) -> AND(~C,X)
...
If the LHS op has a single use then using the more general AND op is likely to allow commutation, load folding, generic folds etc.
2022-06-21 12:31:01 +01:00
Simon Pilgrim
843d43e62a
[X86] computeKnownBitsForTargetNode - add X86ISD::VBROADCAST_LOAD handling
...
This requires us to override the isTargetCanonicalConstantNode callback introduced in D128144, so we can recognise the various cases where a VBROADCAST_LOAD constant is being reused at different vector widths to prevent infinite loops.
2022-06-21 11:48:01 +01:00
Simon Pilgrim
8254966062
[X86] LowerINSERT_VECTOR_ELT - always lower v32i8/v16i16 allones insertions on AVX1 as OR ops
...
v32i8/v16i16 blend shuffles on AVX1 will expand to OR(AND,ANDN) patterns which can be easily broken by other combines
2022-06-20 18:43:03 +01:00
Simon Pilgrim
e4a124dda5
[DAG] Fold (srl (shl x, c1), c2) -> and(shl/srl(x, c3), m)
...
Similar to the existing (shl (srl x, c1), c2) fold
Part of the work to fix the regressions in D77804
Differential Revision: https://reviews.llvm.org/D125836
2022-06-20 08:37:38 +01:00
Simon Pilgrim
ba3f2667b6
[DAG] Add MaskedVectorIsZero helper
...
Equivalent to MaskedValueIsZero, except its checking if all of the demanded vectors elements are known to be zero
2022-06-19 17:56:30 +01:00
Simon Pilgrim
41455dd1dc
[X86] Remove isTargetShuffleSplat and just use SelectionDAG::isSplatValue
...
shuffle(splat(x)) -> splat(x), it doesn't have to be a target specific broadcast
2022-06-19 11:22:57 +01:00
Simon Pilgrim
ac3f967382
[X86] canonicalizeShuffleWithBinOps - merge shuffles across binops if either source op is a known splat
...
The shuffle of a splat (with no undefs) should always be removed
2022-06-18 17:14:00 +01:00
Simon Pilgrim
f42f2b7005
[X86] canonicalizeShuffleWithBinOps - merge unary shuffles across binops if either source op is a foldable load
...
This mostly handles folding of constants that have already become loads, but we expose some generic load cases as well.
This also exposes the chance to merge unary shuffles across X86ISD::ANDNP nodes with different scalar widths
2022-06-18 15:58:54 +01:00
Simon Pilgrim
3c9123af9f
[X86] isShuffleFoldableLoad - ensure the load has one use.
...
We'll only fold the load if has one use. Makes no difference to existing tests but will be necessary for an upcoming patch to improve load folding as part of canonicalizeShuffleWithBinOps.
2022-06-18 14:51:55 +01:00
Phoebe Wang
655ba9c8a1
Reland "Reland "Reland "Reland "[X86][RFC] Enable `_Float16` type support on X86 following the psABI""""
...
This resolves problems reported in commit 1a20252978
.
1. Promote to float lowering for nodes XINT_TO_FP
2. Bail out f16 from shuffle combine due to vector type is not legal in the version
2022-06-17 21:34:05 +08:00
Benjamin Kramer
1a20252978
Revert "Reland "Reland "Reland "[X86][RFC] Enable `_Float16` type support on X86 following the psABI""""
...
This reverts commit 04a3d5f3a1
.
I see two more issues:
- uitofp/sitofp from i32/i64 to half now generates
__floatsihf/__floatdihf, which exists in neither compiler-rt nor
libgcc
- This crashes when legalizing the bitcast:
```
; RUN: llc < %s -mcpu=skx
define void @main.45(ptr nocapture readnone %retval, ptr noalias nocapture readnone %run_options, ptr noalias nocapture readnone %params, ptr noalias nocapture readonly %buffer_table, ptr noalias nocapture readnone %status, ptr noalias nocapture readnone %prof_counters) local_unnamed_addr {
entry:
%fusion = load ptr, ptr %buffer_table, align 8
%0 = getelementptr inbounds ptr, ptr %buffer_table, i64 1
%Arg_1.2 = load ptr, ptr %0, align 8
%1 = getelementptr inbounds ptr, ptr %buffer_table, i64 2
%Arg_0.1 = load ptr, ptr %1, align 8
%2 = load half, ptr %Arg_0.1, align 8
%3 = bitcast half %2 to i16
%4 = and i16 %3, 32767
%5 = icmp eq i16 %4, 0
%6 = and i16 %3, -32768
%broadcast.splatinsert = insertelement <4 x half> poison, half %2, i64 0
%broadcast.splat = shufflevector <4 x half> %broadcast.splatinsert, <4 x half> poison, <4 x i32> zeroinitializer
%broadcast.splatinsert9 = insertelement <4 x i16> poison, i16 %4, i64 0
%broadcast.splat10 = shufflevector <4 x i16> %broadcast.splatinsert9, <4 x i16> poison, <4 x i32> zeroinitializer
%broadcast.splatinsert11 = insertelement <4 x i16> poison, i16 %6, i64 0
%broadcast.splat12 = shufflevector <4 x i16> %broadcast.splatinsert11, <4 x i16> poison, <4 x i32> zeroinitializer
%broadcast.splatinsert13 = insertelement <4 x i16> poison, i16 %3, i64 0
%broadcast.splat14 = shufflevector <4 x i16> %broadcast.splatinsert13, <4 x i16> poison, <4 x i32> zeroinitializer
%wide.load = load <4 x half>, ptr %Arg_1.2, align 8
%7 = fcmp uno <4 x half> %broadcast.splat, %wide.load
%8 = fcmp oeq <4 x half> %broadcast.splat, %wide.load
%9 = bitcast <4 x half> %wide.load to <4 x i16>
%10 = and <4 x i16> %9, <i16 32767, i16 32767, i16 32767, i16 32767>
%11 = icmp eq <4 x i16> %10, zeroinitializer
%12 = and <4 x i16> %9, <i16 -32768, i16 -32768, i16 -32768, i16 -32768>
%13 = or <4 x i16> %12, <i16 1, i16 1, i16 1, i16 1>
%14 = select <4 x i1> %11, <4 x i16> %9, <4 x i16> %13
%15 = icmp ugt <4 x i16> %broadcast.splat10, %10
%16 = icmp ne <4 x i16> %broadcast.splat12, %12
%17 = or <4 x i1> %15, %16
%18 = select <4 x i1> %17, <4 x i16> <i16 -1, i16 -1, i16 -1, i16 -1>, <4 x i16> <i16 1, i16 1, i16 1, i16 1>
%19 = add <4 x i16> %18, %broadcast.splat14
%20 = select i1 %5, <4 x i16> %14, <4 x i16> %19
%21 = select <4 x i1> %8, <4 x i16> %9, <4 x i16> %20
%22 = bitcast <4 x i16> %21 to <4 x half>
%23 = select <4 x i1> %7, <4 x half> <half 0xH7E00, half 0xH7E00, half 0xH7E00, half 0xH7E00>, <4 x half> %22
store <4 x half> %23, ptr %fusion, align 16
ret void
}
```
llc: llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp:977: void (anonymous namespace)::SelectionDAGLegalize::LegalizeOp(llvm::SDNode *): Assertion `(TLI.getTypeAction(*DAG.getContext(), Op.getValueType()) == TargetLowering::TypeLegal || Op.getOpcode() == ISD::TargetConstant || Op.getOpcode() == ISD::Register) && "Unexpected illegal type!"' failed.
2022-06-17 09:43:07 +02:00
Phoebe Wang
04a3d5f3a1
Reland "Reland "Reland "[X86][RFC] Enable `_Float16` type support on X86 following the psABI"""
...
Fix the crash on lowering X86ISD::FCMP.
2022-06-17 12:12:17 +08:00
Paul Robinson
ff0122dcce
[PS5] Emit ud2 for ubsan trap
2022-06-16 11:20:10 -07:00
Frederik Gossen
3cd5696a33
Revert "Reland "Reland "[X86][RFC] Enable `_Float16` type support on X86 following the psABI"""
...
This reverts commit e1c5afa47d
.
This introduces crashes in the JAX backend on CPU. A reproducer in LLVM is
below. Let me know if you have trouble reproducing this.
; ModuleID = '__compute_module'
source_filename = "__compute_module"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-grtev4-linux-gnu"
@0 = private unnamed_addr constant [4 x i8] c"\00\00\00?"
@1 = private unnamed_addr constant [4 x i8] c"\1C}\908"
@2 = private unnamed_addr constant [4 x i8] c"?\00\\4"
@3 = private unnamed_addr constant [4 x i8] c"%ci1"
@4 = private unnamed_addr constant [4 x i8] zeroinitializer
@5 = private unnamed_addr constant [4 x i8] c"\00\00\00\C0"
@6 = private unnamed_addr constant [4 x i8] c"\00\00\00B"
@7 = private unnamed_addr constant [4 x i8] c"\94\B4\C22"
@8 = private unnamed_addr constant [4 x i8] c"^\09B6"
@9 = private unnamed_addr constant [4 x i8] c"\15\F3M?"
@10 = private unnamed_addr constant [4 x i8] c"e\CC\\;"
@11 = private unnamed_addr constant [4 x i8] c"d\BD/>"
@12 = private unnamed_addr constant [4 x i8] c"V\F4I="
@13 = private unnamed_addr constant [4 x i8] c"\10\CB,<"
@14 = private unnamed_addr constant [4 x i8] c"\AC\E3\D6:"
@15 = private unnamed_addr constant [4 x i8] c"\DC\A8E9"
@16 = private unnamed_addr constant [4 x i8] c"\C6\FA\897"
@17 = private unnamed_addr constant [4 x i8] c"%\F9\955"
@18 = private unnamed_addr constant [4 x i8] c"\B5\DB\813"
@19 = private unnamed_addr constant [4 x i8] c"\B4W_\B2"
@20 = private unnamed_addr constant [4 x i8] c"\1Cc\8F\B4"
@21 = private unnamed_addr constant [4 x i8] c"~3\94\B6"
@22 = private unnamed_addr constant [4 x i8] c"3Yq\B8"
@23 = private unnamed_addr constant [4 x i8] c"\E9\17\17\BA"
@24 = private unnamed_addr constant [4 x i8] c"\F1\B2\8D\BB"
@25 = private unnamed_addr constant [4 x i8] c"\F8t\C2\BC"
@26 = private unnamed_addr constant [4 x i8] c"\82[\C2\BD"
@27 = private unnamed_addr constant [4 x i8] c"uB-?"
@28 = private unnamed_addr constant [4 x i8] c"^\FF\9B\BE"
@29 = private unnamed_addr constant [4 x i8] c"\00\00\00A"
; Function Attrs: uwtable
define void @main.158(ptr %retval, ptr noalias %run_options, ptr noalias %params, ptr noalias %buffer_table, ptr noalias %status, ptr noalias %prof_counters) #0 {
entry:
%fusion.invar_address.dim.1 = alloca i64, align 8
%fusion.invar_address.dim.0 = alloca i64, align 8
%0 = getelementptr inbounds ptr, ptr %buffer_table, i64 1
%Arg_0.1 = load ptr, ptr %0, align 8, !invariant.load !0 , !dereferenceable !1 , !align !2
%1 = getelementptr inbounds ptr, ptr %buffer_table, i64 0
%fusion = load ptr, ptr %1, align 8, !invariant.load !0 , !dereferenceable !1 , !align !2
store i64 0, ptr %fusion.invar_address.dim.0, align 8
br label %fusion.loop_header.dim.0
return: ; preds = %fusion.loop_exit.dim.0
ret void
fusion.loop_header.dim.0: ; preds = %fusion.loop_exit.dim.1, %entry
%fusion.indvar.dim.0 = load i64, ptr %fusion.invar_address.dim.0, align 8
%2 = icmp uge i64 %fusion.indvar.dim.0, 3
br i1 %2, label %fusion.loop_exit.dim.0, label %fusion.loop_body.dim.0
fusion.loop_body.dim.0: ; preds = %fusion.loop_header.dim.0
store i64 0, ptr %fusion.invar_address.dim.1, align 8
br label %fusion.loop_header.dim.1
fusion.loop_header.dim.1: ; preds = %fusion.loop_body.dim.1, %fusion.loop_body.dim.0
%fusion.indvar.dim.1 = load i64, ptr %fusion.invar_address.dim.1, align 8
%3 = icmp uge i64 %fusion.indvar.dim.1, 1
br i1 %3, label %fusion.loop_exit.dim.1, label %fusion.loop_body.dim.1
fusion.loop_body.dim.1: ; preds = %fusion.loop_header.dim.1
%4 = getelementptr inbounds [3 x [1 x half]], ptr %Arg_0.1, i64 0, i64 %fusion.indvar.dim.0, i64 0
%5 = load half, ptr %4, align 2, !invariant.load !0 , !noalias !3
%6 = fpext half %5 to float
%7 = call float @llvm.fabs.f32(float %6)
%constant.121 = load float, ptr @29, align 4
%compare.2 = fcmp ole float %7, %constant.121
%8 = zext i1 %compare.2 to i8
%constant.120 = load float, ptr @0, align 4
%multiply.95 = fmul float %7, %constant.120
%constant.119 = load float, ptr @5, align 4
%add.82 = fadd float %multiply.95, %constant.119
%constant.118 = load float, ptr @4, align 4
%multiply.94 = fmul float %add.82, %constant.118
%constant.117 = load float, ptr @19, align 4
%add.81 = fadd float %multiply.94, %constant.117
%multiply.92 = fmul float %add.82, %add.81
%constant.116 = load float, ptr @18, align 4
%add.79 = fadd float %multiply.92, %constant.116
%multiply.91 = fmul float %add.82, %add.79
%subtract.87 = fsub float %multiply.91, %add.81
%constant.115 = load float, ptr @20, align 4
%add.78 = fadd float %subtract.87, %constant.115
%multiply.89 = fmul float %add.82, %add.78
%subtract.86 = fsub float %multiply.89, %add.79
%constant.114 = load float, ptr @17, align 4
%add.76 = fadd float %subtract.86, %constant.114
%multiply.88 = fmul float %add.82, %add.76
%subtract.84 = fsub float %multiply.88, %add.78
%constant.113 = load float, ptr @21, align 4
%add.75 = fadd float %subtract.84, %constant.113
%multiply.86 = fmul float %add.82, %add.75
%subtract.83 = fsub float %multiply.86, %add.76
%constant.112 = load float, ptr @16, align 4
%add.73 = fadd float %subtract.83, %constant.112
%multiply.85 = fmul float %add.82, %add.73
%subtract.81 = fsub float %multiply.85, %add.75
%constant.111 = load float, ptr @22, align 4
%add.72 = fadd float %subtract.81, %constant.111
%multiply.83 = fmul float %add.82, %add.72
%subtract.80 = fsub float %multiply.83, %add.73
%constant.110 = load float, ptr @15, align 4
%add.70 = fadd float %subtract.80, %constant.110
%multiply.82 = fmul float %add.82, %add.70
%subtract.78 = fsub float %multiply.82, %add.72
%constant.109 = load float, ptr @23, align 4
%add.69 = fadd float %subtract.78, %constant.109
%multiply.80 = fmul float %add.82, %add.69
%subtract.77 = fsub float %multiply.80, %add.70
%constant.108 = load float, ptr @14, align 4
%add.68 = fadd float %subtract.77, %constant.108
%multiply.79 = fmul float %add.82, %add.68
%subtract.75 = fsub float %multiply.79, %add.69
%constant.107 = load float, ptr @24, align 4
%add.67 = fadd float %subtract.75, %constant.107
%multiply.77 = fmul float %add.82, %add.67
%subtract.74 = fsub float %multiply.77, %add.68
%constant.106 = load float, ptr @13, align 4
%add.66 = fadd float %subtract.74, %constant.106
%multiply.76 = fmul float %add.82, %add.66
%subtract.72 = fsub float %multiply.76, %add.67
%constant.105 = load float, ptr @25, align 4
%add.65 = fadd float %subtract.72, %constant.105
%multiply.74 = fmul float %add.82, %add.65
%subtract.71 = fsub float %multiply.74, %add.66
%constant.104 = load float, ptr @12, align 4
%add.64 = fadd float %subtract.71, %constant.104
%multiply.73 = fmul float %add.82, %add.64
%subtract.69 = fsub float %multiply.73, %add.65
%constant.103 = load float, ptr @26, align 4
%add.63 = fadd float %subtract.69, %constant.103
%multiply.71 = fmul float %add.82, %add.63
%subtract.67 = fsub float %multiply.71, %add.64
%constant.102 = load float, ptr @11, align 4
%add.62 = fadd float %subtract.67, %constant.102
%multiply.70 = fmul float %add.82, %add.62
%subtract.66 = fsub float %multiply.70, %add.63
%constant.101 = load float, ptr @28, align 4
%add.61 = fadd float %subtract.66, %constant.101
%multiply.68 = fmul float %add.82, %add.61
%subtract.65 = fsub float %multiply.68, %add.62
%constant.100 = load float, ptr @27, align 4
%add.60 = fadd float %subtract.65, %constant.100
%subtract.64 = fsub float %add.60, %add.62
%multiply.66 = fmul float %subtract.64, %constant.120
%constant.99 = load float, ptr @6, align 4
%divide.4 = fdiv float %constant.99, %7
%add.59 = fadd float %divide.4, %constant.119
%multiply.65 = fmul float %add.59, %constant.118
%constant.98 = load float, ptr @3, align 4
%add.58 = fadd float %multiply.65, %constant.98
%multiply.64 = fmul float %add.59, %add.58
%constant.97 = load float, ptr @7, align 4
%add.57 = fadd float %multiply.64, %constant.97
%multiply.63 = fmul float %add.59, %add.57
%subtract.63 = fsub float %multiply.63, %add.58
%constant.96 = load float, ptr @2, align 4
%add.56 = fadd float %subtract.63, %constant.96
%multiply.62 = fmul float %add.59, %add.56
%subtract.62 = fsub float %multiply.62, %add.57
%constant.95 = load float, ptr @8, align 4
%add.55 = fadd float %subtract.62, %constant.95
%multiply.61 = fmul float %add.59, %add.55
%subtract.61 = fsub float %multiply.61, %add.56
%constant.94 = load float, ptr @1, align 4
%add.54 = fadd float %subtract.61, %constant.94
%multiply.60 = fmul float %add.59, %add.54
%subtract.60 = fsub float %multiply.60, %add.55
%constant.93 = load float, ptr @10, align 4
%add.53 = fadd float %subtract.60, %constant.93
%multiply.59 = fmul float %add.59, %add.53
%subtract.59 = fsub float %multiply.59, %add.54
%constant.92 = load float, ptr @9, align 4
%add.52 = fadd float %subtract.59, %constant.92
%subtract.58 = fsub float %add.52, %add.54
%multiply.58 = fmul float %subtract.58, %constant.120
%9 = call float @llvm.sqrt.f32(float %7)
%10 = fdiv float 1.000000e+00, %9
%multiply.57 = fmul float %multiply.58, %10
%11 = trunc i8 %8 to i1
%12 = select i1 %11, float %multiply.66, float %multiply.57
%13 = fptrunc float %12 to half
%14 = getelementptr inbounds [3 x [1 x half]], ptr %fusion, i64 0, i64 %fusion.indvar.dim.0, i64 0
store half %13, ptr %14, align 2, !alias.scope !3
%invar.inc1 = add nuw nsw i64 %fusion.indvar.dim.1, 1
store i64 %invar.inc1, ptr %fusion.invar_address.dim.1, align 8
br label %fusion.loop_header.dim.1
fusion.loop_exit.dim.1: ; preds = %fusion.loop_header.dim.1
%invar.inc = add nuw nsw i64 %fusion.indvar.dim.0, 1
store i64 %invar.inc, ptr %fusion.invar_address.dim.0, align 8
br label %fusion.loop_header.dim.0
fusion.loop_exit.dim.0: ; preds = %fusion.loop_header.dim.0
br label %return
}
; Function Attrs: nocallback nofree nosync nounwind readnone speculatable willreturn
declare float @llvm.fabs.f32(float %0) #1
; Function Attrs: nocallback nofree nosync nounwind readnone speculatable willreturn
declare float @llvm.sqrt.f32(float %0) #1
attributes #0 = { uwtable "denormal-fp-math"="preserve-sign" "no-frame-pointer-elim"="false" }
attributes #1 = { nocallback nofree nosync nounwind readnone speculatable willreturn }
!0 = !{}
!1 = !{i64 6}
!2 = !{i64 8}
!3 = !{!4}
!4 = !{!"buffer: {index:0, offset:0, size:6}", !5}
!5 = !{!"XLA global AA domain"}
2022-06-15 18:04:42 -04:00
Phoebe Wang
e1c5afa47d
Reland "Reland "[X86][RFC] Enable `_Float16` type support on X86 following the psABI""
...
Fixed the missing SQRT promotion. Adding several missing operations too.
2022-06-15 23:00:18 +08:00
Thomas Joerg
37455b1f71
Revert "Reland "[X86][RFC] Enable `_Float16` type support on X86 following the psABI""
...
This reverts commit 6e02e27536
.
This introduces a crash in the backend. Reproducer in MLIR's LLVM
dialect follows. Let me know if you have trouble reproducing this.
module {
llvm.func @malloc(i64) -> !llvm.ptr<i8>
llvm.func @_mlir_ciface_tf_report_error(!llvm.ptr<i8>, i32, !llvm.ptr<i8>)
llvm.mlir.global internal constant @error_message_2208944672953921889("failed to allocate memory at loc(\22-\22:3:8)\00")
llvm.func @_mlir_ciface_tf_alloc(!llvm.ptr<i8>, i64, i64, i32, i32, !llvm.ptr<i32>) -> !llvm.ptr<i8>
llvm.func @Rsqrt_CPU_DT_HALF_DT_HALF(%arg0: !llvm.ptr<i8>, %arg1: i64, %arg2: !llvm.ptr<i8>) -> !llvm.struct<(i64, ptr<i8>)> attributes {llvm.emit_c_interface, tf_entry} {
%0 = llvm.mlir.constant(8 : i32) : i32
%1 = llvm.mlir.constant(8 : index) : i64
%2 = llvm.mlir.constant(2 : index) : i64
%3 = llvm.mlir.constant(dense<0.000000e+00> : vector<4xf16>) : vector<4xf16>
%4 = llvm.mlir.constant(dense<[0, 1, 2, 3]> : vector<4xi32>) : vector<4xi32>
%5 = llvm.mlir.constant(dense<1.000000e+00> : vector<4xf16>) : vector<4xf16>
%6 = llvm.mlir.constant(false) : i1
%7 = llvm.mlir.constant(1 : i32) : i32
%8 = llvm.mlir.constant(0 : i32) : i32
%9 = llvm.mlir.constant(4 : index) : i64
%10 = llvm.mlir.constant(0 : index) : i64
%11 = llvm.mlir.constant(1 : index) : i64
%12 = llvm.mlir.constant(-1 : index) : i64
%13 = llvm.mlir.null : !llvm.ptr<f16>
%14 = llvm.getelementptr %13[%9] : (!llvm.ptr<f16>, i64) -> !llvm.ptr<f16>
%15 = llvm.ptrtoint %14 : !llvm.ptr<f16> to i64
%16 = llvm.alloca %15 x f16 {alignment = 32 : i64} : (i64) -> !llvm.ptr<f16>
%17 = llvm.alloca %15 x f16 {alignment = 32 : i64} : (i64) -> !llvm.ptr<f16>
%18 = llvm.mlir.null : !llvm.ptr<i64>
%19 = llvm.getelementptr %18[%arg1] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64>
%20 = llvm.ptrtoint %19 : !llvm.ptr<i64> to i64
%21 = llvm.alloca %20 x i64 : (i64) -> !llvm.ptr<i64>
llvm.br ^bb1(%10 : i64)
^bb1(%22: i64): // 2 preds: ^bb0, ^bb2
%23 = llvm.icmp "slt" %22, %arg1 : i64
llvm.cond_br %23, ^bb2, ^bb3
^bb2: // pred: ^bb1
%24 = llvm.bitcast %arg2 : !llvm.ptr<i8> to !llvm.ptr<struct<(ptr<f16>, ptr<f16>, i64)>>
%25 = llvm.getelementptr %24[%10, 2] : (!llvm.ptr<struct<(ptr<f16>, ptr<f16>, i64)>>, i64) -> !llvm.ptr<i64>
%26 = llvm.add %22, %11 : i64
%27 = llvm.getelementptr %25[%26] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64>
%28 = llvm.load %27 : !llvm.ptr<i64>
%29 = llvm.getelementptr %21[%22] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64>
llvm.store %28, %29 : !llvm.ptr<i64>
llvm.br ^bb1(%26 : i64)
^bb3: // pred: ^bb1
llvm.br ^bb4(%10, %11 : i64, i64)
^bb4(%30: i64, %31: i64): // 2 preds: ^bb3, ^bb5
%32 = llvm.icmp "slt" %30, %arg1 : i64
llvm.cond_br %32, ^bb5, ^bb6
^bb5: // pred: ^bb4
%33 = llvm.bitcast %arg2 : !llvm.ptr<i8> to !llvm.ptr<struct<(ptr<f16>, ptr<f16>, i64)>>
%34 = llvm.getelementptr %33[%10, 2] : (!llvm.ptr<struct<(ptr<f16>, ptr<f16>, i64)>>, i64) -> !llvm.ptr<i64>
%35 = llvm.add %30, %11 : i64
%36 = llvm.getelementptr %34[%35] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64>
%37 = llvm.load %36 : !llvm.ptr<i64>
%38 = llvm.mul %37, %31 : i64
llvm.br ^bb4(%35, %38 : i64, i64)
^bb6: // pred: ^bb4
%39 = llvm.bitcast %arg2 : !llvm.ptr<i8> to !llvm.ptr<ptr<f16>>
%40 = llvm.getelementptr %39[%11] : (!llvm.ptr<ptr<f16>>, i64) -> !llvm.ptr<ptr<f16>>
%41 = llvm.load %40 : !llvm.ptr<ptr<f16>>
%42 = llvm.getelementptr %13[%11] : (!llvm.ptr<f16>, i64) -> !llvm.ptr<f16>
%43 = llvm.ptrtoint %42 : !llvm.ptr<f16> to i64
%44 = llvm.alloca %7 x i32 : (i32) -> !llvm.ptr<i32>
llvm.store %8, %44 : !llvm.ptr<i32>
%45 = llvm.call @_mlir_ciface_tf_alloc(%arg0, %31, %43, %8, %7, %44) : (!llvm.ptr<i8>, i64, i64, i32, i32, !llvm.ptr<i32>) -> !llvm.ptr<i8>
%46 = llvm.bitcast %45 : !llvm.ptr<i8> to !llvm.ptr<f16>
%47 = llvm.icmp "eq" %31, %10 : i64
%48 = llvm.or %6, %47 : i1
%49 = llvm.mlir.null : !llvm.ptr<i8>
%50 = llvm.icmp "ne" %45, %49 : !llvm.ptr<i8>
%51 = llvm.or %50, %48 : i1
llvm.cond_br %51, ^bb7, ^bb13
^bb7: // pred: ^bb6
%52 = llvm.urem %31, %9 : i64
%53 = llvm.sub %31, %52 : i64
llvm.br ^bb8(%10 : i64)
^bb8(%54: i64): // 2 preds: ^bb7, ^bb9
%55 = llvm.icmp "slt" %54, %53 : i64
llvm.cond_br %55, ^bb9, ^bb10
^bb9: // pred: ^bb8
%56 = llvm.mul %54, %11 : i64
%57 = llvm.add %56, %10 : i64
%58 = llvm.add %57, %10 : i64
%59 = llvm.getelementptr %41[%58] : (!llvm.ptr<f16>, i64) -> !llvm.ptr<f16>
%60 = llvm.bitcast %59 : !llvm.ptr<f16> to !llvm.ptr<vector<4xf16>>
%61 = llvm.load %60 {alignment = 2 : i64} : !llvm.ptr<vector<4xf16>>
%62 = "llvm.intr.sqrt"(%61) : (vector<4xf16>) -> vector<4xf16>
%63 = llvm.fdiv %5, %62 : vector<4xf16>
%64 = llvm.getelementptr %46[%58] : (!llvm.ptr<f16>, i64) -> !llvm.ptr<f16>
%65 = llvm.bitcast %64 : !llvm.ptr<f16> to !llvm.ptr<vector<4xf16>>
llvm.store %63, %65 {alignment = 2 : i64} : !llvm.ptr<vector<4xf16>>
%66 = llvm.add %54, %9 : i64
llvm.br ^bb8(%66 : i64)
^bb10: // pred: ^bb8
%67 = llvm.icmp "ult" %53, %31 : i64
llvm.cond_br %67, ^bb11, ^bb12
^bb11: // pred: ^bb10
%68 = llvm.mul %53, %12 : i64
%69 = llvm.add %31, %68 : i64
%70 = llvm.mul %53, %11 : i64
%71 = llvm.add %70, %10 : i64
%72 = llvm.trunc %69 : i64 to i32
%73 = llvm.mlir.undef : vector<4xi32>
%74 = llvm.insertelement %72, %73[%8 : i32] : vector<4xi32>
%75 = llvm.shufflevector %74, %73 [0 : i32, 0 : i32, 0 : i32, 0 : i32] : vector<4xi32>, vector<4xi32>
%76 = llvm.icmp "slt" %4, %75 : vector<4xi32>
%77 = llvm.add %71, %10 : i64
%78 = llvm.getelementptr %41[%77] : (!llvm.ptr<f16>, i64) -> !llvm.ptr<f16>
%79 = llvm.bitcast %78 : !llvm.ptr<f16> to !llvm.ptr<vector<4xf16>>
%80 = llvm.intr.masked.load %79, %76, %3 {alignment = 2 : i32} : (!llvm.ptr<vector<4xf16>>, vector<4xi1>, vector<4xf16>) -> vector<4xf16>
%81 = llvm.bitcast %16 : !llvm.ptr<f16> to !llvm.ptr<vector<4xf16>>
llvm.store %80, %81 : !llvm.ptr<vector<4xf16>>
%82 = llvm.load %81 {alignment = 2 : i64} : !llvm.ptr<vector<4xf16>>
%83 = "llvm.intr.sqrt"(%82) : (vector<4xf16>) -> vector<4xf16>
%84 = llvm.fdiv %5, %83 : vector<4xf16>
%85 = llvm.bitcast %17 : !llvm.ptr<f16> to !llvm.ptr<vector<4xf16>>
llvm.store %84, %85 {alignment = 2 : i64} : !llvm.ptr<vector<4xf16>>
%86 = llvm.load %85 : !llvm.ptr<vector<4xf16>>
%87 = llvm.getelementptr %46[%77] : (!llvm.ptr<f16>, i64) -> !llvm.ptr<f16>
%88 = llvm.bitcast %87 : !llvm.ptr<f16> to !llvm.ptr<vector<4xf16>>
llvm.intr.masked.store %86, %88, %76 {alignment = 2 : i32} : vector<4xf16>, vector<4xi1> into !llvm.ptr<vector<4xf16>>
llvm.br ^bb12
^bb12: // 2 preds: ^bb10, ^bb11
%89 = llvm.mul %2, %1 : i64
%90 = llvm.mul %arg1, %2 : i64
%91 = llvm.add %90, %11 : i64
%92 = llvm.mul %91, %1 : i64
%93 = llvm.add %89, %92 : i64
%94 = llvm.alloca %93 x i8 : (i64) -> !llvm.ptr<i8>
%95 = llvm.bitcast %94 : !llvm.ptr<i8> to !llvm.ptr<ptr<f16>>
llvm.store %46, %95 : !llvm.ptr<ptr<f16>>
%96 = llvm.getelementptr %95[%11] : (!llvm.ptr<ptr<f16>>, i64) -> !llvm.ptr<ptr<f16>>
llvm.store %46, %96 : !llvm.ptr<ptr<f16>>
%97 = llvm.getelementptr %95[%2] : (!llvm.ptr<ptr<f16>>, i64) -> !llvm.ptr<ptr<f16>>
%98 = llvm.bitcast %97 : !llvm.ptr<ptr<f16>> to !llvm.ptr<i64>
llvm.store %10, %98 : !llvm.ptr<i64>
%99 = llvm.bitcast %94 : !llvm.ptr<i8> to !llvm.ptr<struct<(ptr<f16>, ptr<f16>, i64, i64)>>
%100 = llvm.getelementptr %99[%10, 3] : (!llvm.ptr<struct<(ptr<f16>, ptr<f16>, i64, i64)>>, i64) -> !llvm.ptr<i64>
%101 = llvm.getelementptr %100[%arg1] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64>
%102 = llvm.sub %arg1, %11 : i64
llvm.br ^bb14(%102, %11 : i64, i64)
^bb13: // pred: ^bb6
%103 = llvm.mlir.addressof @error_message_2208944672953921889 : !llvm.ptr<array<42 x i8>>
%104 = llvm.getelementptr %103[%10, %10] : (!llvm.ptr<array<42 x i8>>, i64, i64) -> !llvm.ptr<i8>
llvm.call @_mlir_ciface_tf_report_error(%arg0, %0, %104) : (!llvm.ptr<i8>, i32, !llvm.ptr<i8>) -> ()
%105 = llvm.mul %2, %1 : i64
%106 = llvm.mul %2, %10 : i64
%107 = llvm.add %106, %11 : i64
%108 = llvm.mul %107, %1 : i64
%109 = llvm.add %105, %108 : i64
%110 = llvm.alloca %109 x i8 : (i64) -> !llvm.ptr<i8>
%111 = llvm.bitcast %110 : !llvm.ptr<i8> to !llvm.ptr<ptr<f16>>
llvm.store %13, %111 : !llvm.ptr<ptr<f16>>
%112 = llvm.getelementptr %111[%11] : (!llvm.ptr<ptr<f16>>, i64) -> !llvm.ptr<ptr<f16>>
llvm.store %13, %112 : !llvm.ptr<ptr<f16>>
%113 = llvm.getelementptr %111[%2] : (!llvm.ptr<ptr<f16>>, i64) -> !llvm.ptr<ptr<f16>>
%114 = llvm.bitcast %113 : !llvm.ptr<ptr<f16>> to !llvm.ptr<i64>
llvm.store %10, %114 : !llvm.ptr<i64>
%115 = llvm.call @malloc(%109) : (i64) -> !llvm.ptr<i8>
"llvm.intr.memcpy"(%115, %110, %109, %6) : (!llvm.ptr<i8>, !llvm.ptr<i8>, i64, i1) -> ()
%116 = llvm.mlir.undef : !llvm.struct<(i64, ptr<i8>)>
%117 = llvm.insertvalue %10, %116[0] : !llvm.struct<(i64, ptr<i8>)>
%118 = llvm.insertvalue %115, %117[1] : !llvm.struct<(i64, ptr<i8>)>
llvm.return %118 : !llvm.struct<(i64, ptr<i8>)>
^bb14(%119: i64, %120: i64): // 2 preds: ^bb12, ^bb15
%121 = llvm.icmp "sge" %119, %10 : i64
llvm.cond_br %121, ^bb15, ^bb16
^bb15: // pred: ^bb14
%122 = llvm.getelementptr %21[%119] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64>
%123 = llvm.load %122 : !llvm.ptr<i64>
%124 = llvm.getelementptr %100[%119] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64>
llvm.store %123, %124 : !llvm.ptr<i64>
%125 = llvm.getelementptr %101[%119] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64>
llvm.store %120, %125 : !llvm.ptr<i64>
%126 = llvm.mul %120, %123 : i64
%127 = llvm.sub %119, %11 : i64
llvm.br ^bb14(%127, %126 : i64, i64)
^bb16: // pred: ^bb14
%128 = llvm.call @malloc(%93) : (i64) -> !llvm.ptr<i8>
"llvm.intr.memcpy"(%128, %94, %93, %6) : (!llvm.ptr<i8>, !llvm.ptr<i8>, i64, i1) -> ()
%129 = llvm.mlir.undef : !llvm.struct<(i64, ptr<i8>)>
%130 = llvm.insertvalue %arg1, %129[0] : !llvm.struct<(i64, ptr<i8>)>
%131 = llvm.insertvalue %128, %130[1] : !llvm.struct<(i64, ptr<i8>)>
llvm.return %131 : !llvm.struct<(i64, ptr<i8>)>
}
llvm.func @_mlir_ciface_Rsqrt_CPU_DT_HALF_DT_HALF(%arg0: !llvm.ptr<struct<(i64, ptr<i8>)>>, %arg1: !llvm.ptr<i8>, %arg2: !llvm.ptr<struct<(i64, ptr<i8>)>>) attributes {llvm.emit_c_interface, tf_entry} {
%0 = llvm.load %arg2 : !llvm.ptr<struct<(i64, ptr<i8>)>>
%1 = llvm.extractvalue %0[0] : !llvm.struct<(i64, ptr<i8>)>
%2 = llvm.extractvalue %0[1] : !llvm.struct<(i64, ptr<i8>)>
%3 = llvm.call @Rsqrt_CPU_DT_HALF_DT_HALF(%arg1, %1, %2) : (!llvm.ptr<i8>, i64, !llvm.ptr<i8>) -> !llvm.struct<(i64, ptr<i8>)>
llvm.store %3, %arg0 : !llvm.ptr<struct<(i64, ptr<i8>)>>
llvm.return
}
}
2022-06-15 13:24:24 +02:00
Benjamin Kramer
fb34d531af
Promote bf16 to f32 when the target doesn't support it
...
This is modeled after the half-precision fp support. Two new nodes are
introduced for casting from and to bf16. Since casting from bf16 is a
simple operation I opted to always directly lower it to integer
arithmetic. The other way round is more complicated if you want to
preserve IEEE semantics, so it's handled by a new __truncsfbf2
compiler-rt builtin.
This is of course very bare bones, but sufficient to get a semi-softened
fadd on x86.
Possible future improvements:
- Targets with bf16 conversion instructions can now make fp_to_bf16 legal
- The software conversion to bf16 can be replaced by a trivial
implementation under fast math.
Differential Revision: https://reviews.llvm.org/D126953
2022-06-15 12:56:31 +02:00
Simon Pilgrim
4fd561415e
[X86] needCarryOrOverflowFlag/onlyZeroFlagUsed - merge identical switch cases. NFCI.
...
Makes it easier to grok and fixes various bugprone-branch-clone warnings.
2022-06-15 10:40:22 +01:00
Phoebe Wang
6e02e27536
Reland "[X86][RFC] Enable `_Float16` type support on X86 following the psABI"
...
Disabled 2 mlir tests due to the runtime doesn't support `_Float16`, see
the issue here https://github.com/llvm/llvm-project/issues/55992
2022-06-15 09:15:31 +08:00
Simon Pilgrim
64eea34420
[X86] combineEXTEND_VECTOR_INREG - don't attempt to shuffle combine ANY_EXTEND_VECTOR_INREG without SSE41
...
Without SSE41, ANY_EXTEND_VECTOR_INREG nodes are likely to be prematurely combined to a target shuffle preventing generic sign extension folds.
Fixes a number of sign-extend regressions in D127115.
2022-06-13 17:42:04 +01:00
Mehdi Amini
5d8298a768
Revert "[X86][RFC] Enable `_Float16` type support on X86 following the psABI"
...
This reverts commit 2d2da259c8
.
This breaks MLIR integration test (JIT crashing), reverting in the
meantime.
2022-06-12 15:14:37 +00:00
Simon Pilgrim
b5d7beeb97
[X86] combineConcatVectorOps - add support for concatenation of VSELECT/BLENDV nodes (REAPPLIED)
...
If the LHS/RHS selection operands can be cheaply concatenated back together then replace 2 x 128-bit selection nodes with 1 x 256-bit node
Addresses the regression introduced in the bug fix from rGd5af6a38082b39ae520a328e44dc29ebcb036bb2
REAPPLIED with for bug identified in rGea8fb3b60196
2022-06-12 15:40:36 +01:00
Phoebe Wang
2d2da259c8
[X86][RFC] Enable `_Float16` type support on X86 following the psABI
...
GCC and Clang/LLVM will support `_Float16` on X86 in C/C++, following
the latest X86 psABI. (https://gitlab.com/x86-psABIs )
_Float16 arithmetic will be performed using native half-precision. If
native arithmetic instructions are not available, it will be performed
at a higher precision (currently always float) and then truncated down
to _Float16 immediately after each single arithmetic operation.
Reviewed By: LuoYuanke
Differential Revision: https://reviews.llvm.org/D107082
2022-06-12 11:40:00 +08:00
Simon Pilgrim
7841d09449
[X86][AVX512] Retain pmuldq broadcast loads on 32-bit targets
...
Don't demand just the lower 32-bits on 32-bit AVX512 targets to preserve 64-bit broadcast loads patterns
2022-06-11 19:30:00 +01:00
Simon Pilgrim
6eaea225c7
[X86] combineTargetShuffle - break if-else chain. NFC.
...
(style) Both cases always continue.
2022-06-11 09:16:39 +01:00
Simon Pilgrim
89d2b1e4f7
[X86] emitOrXorXorTree - break if-else chain. NFC.
...
(style) Both cases always return.
2022-06-11 09:16:38 +01:00
Simon Pilgrim
5acbb2dda2
[X86] combineMulToPMADDWD - don't bitcast the source ops before splitting to ensure we split the build vectors early
...
Fixes a regression on D127115 - splitting was creating extract_subvector(bitcast(build_vector())) patterns which prevented the build vectors being split before being bitcast to vXi16 types, resulting in various issues with further folding of the (now legal) build vectors
2022-06-10 13:44:49 +01:00
Simon Pilgrim
7ac33b8aac
[X86] Remove !VT.is128BitVector() check. NFCI.
...
The code is inside a if(VT.is256BitVector() || VT.is512BitVector()) condition
2022-06-09 21:39:45 +01:00
Simon Pilgrim
72a049d778
[X86][AVX2] LowerINSERT_VECTOR_ELT - support v4i64 insertion as BLENDI(X, SCALAR_TO_VECTOR(Y))
2022-06-09 21:18:10 +01:00
Simon Pilgrim
1a02db9882
[X86] canonicalizeShuffleWithBinOps - add TODO for X86ISD::ANDNP bitwise handling
...
Its just as safe to move shuffles across X86ISD::ANDNP as any other logical bitop, they just tend to appear too late to matter.
Noticed while triaging D127115 regressions.
2022-06-09 12:18:26 +01:00
Simon Pilgrim
9a76337fee
[X86] combineMOVMSK - constant fold with getTargetConstantBitsFromNode not just BUILD_VECTOR
...
Help avoid a regression in D127115
2022-06-08 17:48:55 +01:00
Guillaume Chatelet
0788186182
[Alignment][NFC] Remove usage of MemSDNode::getAlignment
...
I can't remove the function just yet as it is used in the generated .inc files.
I would also like to provide a way to compare alignment with TypeSize since it came up a few times.
Differential Revision: https://reviews.llvm.org/D126910
2022-06-07 13:52:20 +00:00
Simon Pilgrim
f5507978a3
[X86] getFauxShuffleMask - add VSELECT/BLENDV handling
...
First step towards enabling shuffle combining starting from VSELECT/BLENDV nodes - this should eventually help improve the codegen reported at Issue #54819
2022-06-07 14:46:25 +01:00
Simon Pilgrim
1b6d3bdc82
[X86] foldMaskedMergeImpl - pass SDLoc by const reference not value.
2022-06-07 12:36:30 +01:00
Simon Pilgrim
63e3035dbe
[X86] LowerGC_TRANSITION - remove redundant SDLoc().
2022-06-07 10:57:58 +01:00
Shilei Tian
0c3e6e5717
[NFC] Remove trailing whitespace
2022-06-06 18:59:13 -04:00
Eric Christopher
93cb6b9c83
Revert "[X86] combineConcatVectorOps - add support for concatenation VSELECT/BLENDV nodes"
...
See the original commit for a testcase.
This reverts commit ea8fb3b601
.
2022-06-03 12:31:11 -07:00
Simon Pilgrim
de2b543505
[X86] LowerVSETCC - merge getConstant() calls with flipped/unflipped sign masks. NFCI.
2022-06-01 15:09:48 +01:00
Sanjay Patel
3a503a4a9c
[x86] fix miscompile from wrongly identified fneg
...
We may need to peek through a bitcast when identifying an fneg idiom
via its pool constant, but we can't allow a different-sized constant
in that match.
This is noted in issue #55758 with an example that needs fast-math,
but as the test here shows, this has potential to miscompile more
generally (no fast-math required).
Differential Revision: https://reviews.llvm.org/D126775
2022-06-01 09:56:33 -04:00
Simon Pilgrim
f6dbb0b6fb
[X86] Fix typo in extraction type introduced in rGed0303aa2251e4484a2b4ff7f236c9f7cdfb2092
...
It doesn't look like we have test coverage for this at the moment :(
2022-06-01 12:31:27 +01:00
Simon Pilgrim
ea8fb3b601
[X86] combineConcatVectorOps - add support for concatenation VSELECT/BLENDV nodes
...
If the LHS/RHS selection operands can be cheaply concatenated back together then replace 2 x 128-bit selection nodes with 1 x 256-bit node
Addresses the regression introduced in the bug fix from rGd5af6a38082b39ae520a328e44dc29ebcb036bb2
2022-06-01 10:46:06 +01:00
Simon Pilgrim
d5af6a3808
[X86] LowerMINMAX - split v4i64 types on AVX1 targets (Issue #55648 )
...
Originally we tried to use default expansion for v4i64 types to make it easier to concatenate the results back together, but this can cause infinite loop issues with existing VSELECT splitting code in narrowExtractedVectorSelect if we have other uses of the VSELECT results (e.g. reduction patterns).
To fix the infinite loop, this patch always splits MIN/MAX v4i64 nodes during lowering and I've added a TODO for combineConcatVectorOps to investigate when we can cheaply concatenate VSELECT/BLENDV nodes together.
Fixes #55648 - regression test case will be added in a follow up.
2022-05-31 17:28:56 +01:00
Simon Pilgrim
af0113cf77
[X86] combineEXTRACT_SUBVECTOR - pull out repeated getVectorNumElements() calls. NFC.
2022-05-31 16:13:54 +01:00
Simon Pilgrim
b9443cb6fa
[X86] narrowExtractedVectorSelect - don't peek through bitcasts to find source vector
...
We don't seem to need this for any test coverage and it was making tracking of the uses() of the source vector more difficult
Noticed while investigating Issue #55648
2022-05-31 14:57:18 +01:00
Simon Pilgrim
ed0303aa22
[X86] LowerTRUNCATE - avoid creating extract_subvector(bitcast(vec)) patterns
...
We have a generic DAG combine to attempt to fold extract_subvector(bitcast(vec)) -> bitcast(extract_subvector(vec)) but if we create these patterns late in lowering then we often miss them.
Noticed while investigating Issue #55648 which gets caught in an infinite loop trying to split extract_subvector(bitcast(vselect()) patterns - this doesn't fix the issue yet but reduces the regressions from the WIP fix.
2022-05-31 14:30:56 +01:00
Zongwei Lan
ad73ce318e
[Target] use getSubtarget<> instead of static_cast<>(getSubtarget())
...
Differential Revision: https://reviews.llvm.org/D125391
2022-05-26 11:22:41 -07:00
Craig Topper
06fee478d2
[X86] Add isSimple check to the load combine in combineExtractVectorElt.
...
I think we need to be sure the load isn't volatile before we
duplicate and shrink it.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D126353
2022-05-25 09:11:11 -07:00
Jay Foad
6bec3e9303
[APInt] Remove all uses of zextOrSelf, sextOrSelf and truncOrSelf
...
Most clients only used these methods because they wanted to be able to
extend or truncate to the same bit width (which is a no-op). Now that
the standard zext, sext and trunc allow this, there is no reason to use
the OrSelf versions.
The OrSelf versions additionally have the strange behaviour of allowing
extending to a *smaller* width, or truncating to a *larger* width, which
are also treated as no-ops. A small amount of client code relied on this
(ConstantRange::castOp and MicrosoftCXXNameMangler::mangleNumber) and
needed rewriting.
Differential Revision: https://reviews.llvm.org/D125557
2022-05-19 11:23:13 +01:00
Simon Pilgrim
320545b577
[X86] Rename combineCONCAT_VECTORS\INSERT_SUBVECTOR\EXTRACT_SUBVECTOR to match Opcode name. NFCI.
...
Its a lot easier to quickly search for the combine when it actually contains the name of the opcode it combines.
2022-05-17 18:37:53 +01:00
Simon Pilgrim
c64f5d44ad
[X86] Attempt to fold EFLAGS into X86ISD::ADD/SUB ops
...
We already use combineAddOrSubToADCOrSBB to fold extended EFLAGS results into ISD::ADD/SUB ops as X86ISD::ADC/SBB carry ops.
This patch extends this to also try to fold EFLAGS results with X86ISD::ADD/SUB ops
Differential Revision: https://reviews.llvm.org/D125642
2022-05-17 10:59:24 +01:00
Simon Pilgrim
b3077f563d
[X86] Move combineAddOrSubToADCOrSBB earlier. NFC.
...
Make it easier to reuse in X86 ADD/SUB combines in an upcoming patch.
2022-05-15 22:06:33 +01:00
Simon Pilgrim
fd1f0c51ef
[X86] lowerShuffleAsLanePermuteAndSHUFP always succeeds, so just return the result. NFC.
2022-05-15 15:53:36 +01:00
Simon Pilgrim
c0f59be358
[X86] Pull out repeated isShuffleMaskInputInPlace calls. NFC.
2022-05-15 15:35:09 +01:00
Simon Pilgrim
32162cf291
[X86] lowerV4I64Shuffle - try harder to lower to PERMQ(BLENDD(V1,V2)) pattern
2022-05-15 14:57:58 +01:00
Simon Pilgrim
bc90bbb759
[X86] LowerAVG - fix cut+paste typo. NFC.
2022-05-14 17:42:09 +01:00
Simon Pilgrim
98f82d69bd
[X86] LowerStore - use is64BitVector() wrapper. NFCI.
2022-05-13 15:30:18 +01:00
Matthias Braun
cd19af74c0
Avoid 8 and 16bit switch conditions on x86
...
This adds a `TargetLoweringBase::getSwitchConditionType` callback to
give targets a chance to control the type used in
`CodeGenPrepare::optimizeSwitchInst`.
Implement callback for X86 to avoid i8 and i16 types where possible as
they often incur extra zero-extensions.
This is NFC for non-X86 targets.
Differential Revision: https://reviews.llvm.org/D124894
2022-05-10 10:00:10 -07:00
Simon Pilgrim
980f41d7c4
[X86] (style) Use auto for dyn_cast<> results
2022-05-01 17:15:18 +01:00
Simon Pilgrim
d4f06ec874
[X86] (style) Don't use auto for non obvious types
2022-05-01 17:10:21 +01:00
Simon Pilgrim
92235e3bf4
[X86] lowerShuffleAsRepeatedMaskAndLanePermute - permit 32-bit sublane permute for unary v32i8 cases
...
Increase the likelihood that we can lower to a permd(pshufb()) pattern, but only after we've attempted with 64-bit sublane permutes first
Fixes #55066
2022-04-30 11:00:28 +01:00
Simon Pilgrim
b424055b52
[X86] lowerShuffleAsRepeatedMaskAndLanePermute - move the sublane split code into a lambda helper. NFC.
...
This is a NFC cleanup as part of the work on #55066 - the idea being that we will be able to check for multiple sub lane scales.
2022-04-29 16:03:50 +01:00
Simon Pilgrim
3562f855b7
[X86] SimplifyDemandedVectorEltsForTargetNode - fold (uniform) shift(0,x) -> 0
2022-04-29 12:08:47 +01:00
Simon Pilgrim
336a1233b2
[X86] SimplifyDemandedVectorEltsForTargetNode - fold shift(0,x) -> 0
2022-04-29 11:32:54 +01:00
Simon Pilgrim
6c44e398ec
[X86] combineShuffle - reuse SDLoc. NFCI.
2022-04-29 10:30:11 +01:00
Simon Pilgrim
2d7f0b1c22
[X86] Fold ANDNP(undef,x)/ANDNP(x,undef) -> 0
...
Matches the fold in DAGCombiner::visitANDLike.
2022-04-29 10:20:48 +01:00
Simon Pilgrim
ab17ed0723
[X86] Don't fold AND(SRL(X,Y),1) -> SETCC(BT(X,Y)) on BMI2 targets
...
With BMI2 we have SHRX which is a lot quicker than regular x86 shifts.
Fixes #55138
2022-04-28 21:28:16 +01:00
Simon Pilgrim
9e3b7e8e65
[X86] getTargetVShiftByConstNode - use SelectionDAG::FoldConstantArithmetic to perform constant folding. NFCI.
...
Remove some unnecessary code duplication.
2022-04-28 17:10:20 +01:00
Simon Pilgrim
de7cee24b6
[X86] getBT - attempt to peek through aext(and(trunc(x),c)) mask/modulo
...
Ideally we'd fold this with generic DAGCombiner, but that only works for !isTruncateFree cases - we might be able to adapt IsDesirableToPromoteOp to find truncated src ops in the future, but for now just use this peephole.
Noticed in Issue #55138
2022-04-28 16:10:26 +01:00
Simon Pilgrim
ed8dffef4c
[X86] getFauxShuffle - don't assume an UNDEF src element for AND/ANDNP results in an UNDEF shuffle mask index
...
The other src element might be zero, guaranteeing zero.
Fixes #55157
2022-04-28 12:32:58 +01:00
Simon Pilgrim
e378577524
[X86] Use is128BitLaneRepeatedShuffleMask wrapper. NFC.
...
We don't need to know the actual repeated mask.
2022-04-27 21:09:57 +01:00
Simon Pilgrim
03482bccad
[X86] collectConcatOps - add ability to collect from vector 'widening' patterns
...
Recognise insert_subvector(undef, x, lo/hi) patterns where we double the width of a vector - creating an UNDEF subvector on the fly.
2022-04-27 15:38:58 +01:00
Xiang1 Zhang
c430f0f532
[X86] Add use condition for combineSetCCMOVMSK
...
Reviewed by RKSimon, LuoYuanke
Differential Revision: https://reviews.llvm.org/D123652
2022-04-26 16:42:50 +08:00
Simon Pilgrim
e8305c0b8f
[X86] combineX86ShuffleChain - don't fold to truncate(concat(V1,V2)) if it was already a PACK op
...
Fixes #55050
2022-04-25 17:13:44 +01:00
Craig Topper
c6fdb1de47
[X86] Move some hasOneUse checks after checking what the opcode is.
...
Calling hasOneUse can be expensive on nodes with multiple results.
Especially when some results are Chains. By checking the opcode first,
we can avoid walking the uses if it isn't an interesting node,
and thus avoid calling hasOneUse on a node that might have many uses.
Found by profiling the IR given in D123857.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D123881
2022-04-16 14:18:58 -07:00
Craig Topper
9d86bf825c
[X86] Move hasOneUse check after opcode check. NFC
...
Checking opcode is cheap. hasOneUse might not be if the node has
multiple results. By checking the opcode we can rule out nodes
with multiple results we aren't interested in.
2022-04-15 17:20:57 -07:00
Liu, Chen3
bf60a5af0a
[X86] Covert unsigned int 0 to float-point with FILD instruction.
...
unsinged int 0 will be convert to float/double -0.0 when the rounding
mode is set to 'FE_DOWNWARD'. Using FILD instruction instead of SSE
instructions on 32-bit target if the strictfp is enabled.
Differential Revision: https://reviews.llvm.org/D123660
2022-04-13 20:06:15 +08:00
Simon Pilgrim
0488c6638b
[X86] getFauxShuffleMask - remove use DemandedElts TODO
...
Most of the getTargetShuffleInputs recursive calls have now gone and the remaining uses aren't likely to benefit from a DemandedElts mask
2022-04-12 15:36:30 +01:00
Simon Pilgrim
1e803d305a
Revert rG88ff6f70c45f2767576c64dde28cbfe7a90916ca "[X86] Extend vselect(cond, pshufb(x), pshufb(y)) -> or(pshufb(x), pshufb(y)) to include inner or(pshufb(x), pshufb(y)) chains"
...
Reverting while I investigate reports of internal test regressions/failures
2022-04-11 10:42:43 +01:00
Simon Pilgrim
88ff6f70c4
[X86] Extend vselect(cond, pshufb(x), pshufb(y)) -> or(pshufb(x), pshufb(y)) to include inner or(pshufb(x), pshufb(y)) chains
2022-04-10 13:04:53 +01:00
Simon Pilgrim
c74d729bd6
[X86] combineExtractSubvector - fold extract_subvector(insert_subvector(V,X,C1),C1)
...
extract_subvector(insert_subvector(V,X,C1),C1) -> insert_subvector(extract_subvector(V,C1),X,0)
More aggressively attempt to reduce the width of an extract_subvector source - we currently only do this if we're inserting into a zero vector (i.e. canonicalizing to the AVX implicit zero upper elts pattern).
But if we're extracting from the same point as the inner insert_subvector then the fold is still relatively trivial - we can probably do even better if we can ensure the subvector isn't badly split.
2022-04-10 11:03:08 +01:00