Luo, Yuanke
5fb4134210
[X86][DAGISel] Don't widen shuffle element with AVX512
...
Currently the X86 shuffle lowering would widen the element type for
shuffle if the mask element value is adjacent. For below example
%t2 = add nsw <16 x i32> %t0, %t1
%t3 = sub nsw <16 x i32> %t0, %t1
%t4 = shufflevector <16 x i32> %t2, <16 x i32> %t3,
<16 x i32> <i32 16, i32 17, i32 2, i32 3, i32 4,
i32 5, i32 6, i32 7, i32 8, i32 9, i32 10,
i32 11, i32 12, i32 13, i32 14, i32 15>
ret <16 x i32> %t4
Compiler would transform the shuffle to
%t4 = shufflevector <8 x i64> %t2, <8 x i64> %t3,
<8 x i64> <i32 8, i32 1, i32 2, i32 3, i32 4,
i32 5, i32 6, i32 7>
This may lose the oppotunity to let ISel select mask instruction when
avx512 is enabled.
This patch is to prevent the tranform when avx512 feature is enabled.
Thank Simon for the idea.
Differential Revision: https://reviews.llvm.org/D129537
2022-07-26 11:56:03 +08:00
Craig Topper
00060a7b97
[X86] Custom type legalize v2i32 smulo/umulo to use a single pmuldq/pmuludq.
...
With SSE4.1 and above we were using 3 multiply instructions. This
was due to type legalization widening to v4i32 and the low half
being done with pmulld while the high half used two pmuldq/pmuludq.
Instead of that, we can use a single pmuludq/pmuldq to calculate
the full product at once, extract the high and low bits and compare
to check for overflow.
I've restricted SMULO to sse4.1 to get pmuldq. We can probably
do a fixup to pmuludq on earlier targets, but that's for another day.
I was going through my git stash and found an early version of this patch
from a year or two ago so I went ahead and finished it.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D130432
2022-07-25 09:12:35 -07:00
Simon Pilgrim
0708771cce
[DAG] MaskedVectorIsZero - don't bother with (-1).isSubsetOf mask check. NFC.
...
Just use KnownBits::isZero() to ensure all the bits are known zero.
2022-07-24 13:12:21 +01:00
Simon Pilgrim
69d1e805ce
[X86] combineAndnp - remove unused variable. NFC.
2022-07-24 11:32:44 +01:00
Simon Pilgrim
ce81a0df67
[X86][SSE] Enable X86ISD::ANDNP constant folding
2022-07-24 11:07:34 +01:00
Simon Pilgrim
293899c64b
[X86] Don't assume an AND/ANDNP element is undef/undemanded just because one element is undef
...
For mask ops like these, the other operand's corresponding element might be zero (result = zero) - so we must demand all the bits and that element.
This appears to be what D128570 was trying to fix - both sides of the funnel shift mask of the vXi64 (legalized to v2Xi32) were incorrectly simplifying the upper 32-bit halves to undef, resulting in bad folds later on.
I intend to address the test case regressions, but this close to the release branch I'd prefer to get a fix in first.
2022-07-24 10:53:38 +01:00
Simon Pilgrim
676a03d8a5
[X86] matchBinaryShuffle - limit SHUFFLE(X,Y) -> OR(X,Y) cases to where X + Y are the same width as the result
...
Minor bit of prep work toward not unnecessarily widening shuffle operands in combineX86ShufflesRecursively, instead only widening in combineX86ShuffleChain if we actual find a match - see Issue #45319
2022-07-23 16:56:45 +01:00
Arnold Schwaighofer
58e6ee0e1f
llvm.swift.async.context.addr cannot be modeled as NoMem because we don't want it to be cse'd accross async suspends
...
An async suspend models the split between two partial async functions.
`llvm.swift.async.context.addr ` will have a different value in the two
partial functions so it is not correct to generally CSE the instruction.
rdar://97336162
Differential Revision: https://reviews.llvm.org/D130201
2022-07-22 11:50:58 -07:00
Phoebe Wang
02fe96b240
[X86][FP16] Do not split FP64->FP16 to FP64->FP32->FP16
...
Truncation from double to half is not always identical to truncating to float first and then to half. https://godbolt.org/z/56s9517hd
On the other hand, expanding to float and then to double is always identical to expanding to double directly. https://godbolt.org/z/Ye8vbYPnY
Reviewed By: RKSimon, skan
Differential Revision: https://reviews.llvm.org/D130151
2022-07-22 08:36:05 +08:00
Bing1 Yu
e01bf5a3e2
[X86] Promote v32f16's fadd into v32f32's fadd when it is avx512 without avx512fp16
...
Reviewed By: pengfei
Differential Revision: https://reviews.llvm.org/D130059
2022-07-19 14:37:50 +08:00
Benjamin Kramer
9234a7c0df
[X86][FP16] Don't crash when lowering SELECT on fp16 vectors
...
This is a regression from f187948162
2022-07-18 13:41:00 +02:00
Phoebe Wang
f187948162
[X86][FP16] Enable vector support for FP16 emulation
...
This is follow up of D107082, which enable vector support according to psABI.
Reviewed By: skan
Differential Revision: https://reviews.llvm.org/D127982
2022-07-16 09:38:58 +08:00
Simon Pilgrim
66bfd1ba8c
[X86] Move isInRange(ArrayRef<int>) inside assert to fix NDEBUG builds. NFC.
...
Fix unused static function warning introduced by D129207
2022-07-12 21:51:07 +01:00
Xiang1 Zhang
a45dd3d814
[X86] Support -mstack-protector-guard-symbol
...
Reviewed By: nickdesaulniers
Differential Revision: https://reviews.llvm.org/D129346
2022-07-12 10:17:00 +08:00
Xiang1 Zhang
643786213b
Revert "[X86] Support -mstack-protector-guard-symbol"
...
This reverts commit efbaad1c4a
.
due to miss adding review info.
2022-07-12 10:14:32 +08:00
Xiang1 Zhang
efbaad1c4a
[X86] Support -mstack-protector-guard-symbol
2022-07-12 10:13:48 +08:00
Simon Pilgrim
97868fb972
[X86] isTargetShuffleEquivalent - attempt to match SM_SentinelZero shuffle mask elements using known bits
...
If the combined shuffle mask requires zero elements, we don't currently have much chance of matching them against the expected source vector. This patch uses the SelectionDAG::MaskedVectorIsZero wrapper to attempt to determine if the expected lement we want to use is already known to be zero.
I've also tightened up the ExpectedMask assertion to always be in range - we're never giving it a target shuffle mask that has sentinels at all - allowing to remove some of the confusing bounds checks.
This attempts to address some of the regressions uncovered by D129150 where we more aggressively fold shuffles as AND / 'clear' masks which results in more combined shuffles using SM_SentinelZero.
Differential Revision: https://reviews.llvm.org/D129207
2022-07-11 15:29:44 +01:00
Phoebe Wang
8fb083d33e
[X86][FP16] Add constrained FP support for scalar emulation
...
This is a follow up patch to support constrained FP in FP16 emulation.
Reviewed By: skan
Differential Revision: https://reviews.llvm.org/D128114
2022-07-08 20:33:42 +08:00
Phoebe Wang
6c535f9f1b
[X86][FP16] Fix crash when lowering copysign for f16
...
This is to address the assertion fail reported in https://reviews.llvm.org/D107082#3635612
Not sure if it is a problem of promoting FCOPYSIGN + libcall FP_ROUND.
The promoting will set the rounding mode to 1 a442c62888/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp (L4810-L4814)
While libcall cannot handle the rounding mode equals to 1 a442c62888/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp (L4324-L4328)
So changing the action to Expand to workaround the problem.
Reviewed By: clementval, MaskRay
Differential Revision: https://reviews.llvm.org/D129294
2022-07-07 19:17:26 -07:00
Simon Pilgrim
fbb51ac0ba
[X86] LowerShift - lower some shuffles directly to X86ISD::PSHUFLW nodes.
...
These are expected to lower to X86ISD::PSHUFLW but we were seeing some regressions in D129150 because it'd managed to exploit the masking of the shift amounts to create unintended clear masks instead.
2022-07-06 18:01:03 +01:00
Shilei Tian
1023ddaf77
[LLVM] Add the support for fmax and fmin in atomicrmw instruction
...
This patch adds the support for `fmax` and `fmin` operations in `atomicrmw`
instruction. For now (at least in this patch), the instruction will be expanded
to CAS loop. There are already a couple of targets supporting the feature. I'll
create another patch(es) to enable them accordingly.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D127041
2022-07-06 10:57:53 -04:00
Paul Robinson
08e4fe6c61
[X86] Add RDPRU instruction
...
Add support for the RDPRU instruction on Zen2 processors.
User-facing features:
- Clang option -m[no-]rdpru to enable/disable the feature
- Support is implicit for znver2/znver3 processors
- Preprocessor symbol __RDPRU__ to indicate support
- Header rdpruintrin.h to define intrinsics
- "rdpru" mnemonic supported for assembler code
Internal features:
- Clang builtin __builtin_ia32_rdpru
- IR intrinsic @llvm.x86.rdpru
Differential Revision: https://reviews.llvm.org/D128934
2022-07-06 07:17:47 -07:00
Craig Topper
2bfca35614
[X86] Disable combineVectorSizedSetCCEquality for soft float.
...
The vector types aren't legal with soft float.
Also disable under NoImplicitFloat for good measure.
Fixes PR56351.
Differential Revision: https://reviews.llvm.org/D129060
2022-07-04 08:33:30 -07:00
Simon Pilgrim
26708fa166
Revert rG057db2002bb3: [X86] combineAndnp - constant fold ANDNP(C,X) -> AND(~C,X)
...
If the LHS op has a single use then using the more general AND op is likely to allow commutation, load folding, generic folds etc.
Reverted due to reports from @alexfh about it causing an infinite loop (repro still pending).
2022-07-01 10:36:09 +01:00
Simon Pilgrim
e961e05d59
[SLP][X86] Add 32-bit vector stores to help vectorization opportunities
...
Building on the work on D124284, this patch tags v4i8 and v2i16 vector loads as custom, enabling SLP to try to vectorize these types ending in a partial store (using the SSE MOVD instruction) - we already do something similar for 64-bit vector types.
Differential Revision: https://reviews.llvm.org/D127604
2022-06-30 20:25:50 +01:00
Craig Topper
3706bdad4a
[X86] Remove unnecessary COPY from EmitLoweredCascadedSelect.
...
I believe we already checked that the destination of the first
CMOV is only used by the second CMOV so I don't think there is any
reason we need the PHI to write the register that was used by the
first CMOV. We can directly use the second CMOV destination and
avoid the copy.
This may be a left over from when the cascaded select handling
was part of the main algorithm before it was refactored in D35685.
Reviewed By: pengfei
Differential Revision: https://reviews.llvm.org/D128124
2022-06-28 09:33:33 -07:00
Simon Pilgrim
0b998053db
[X86] combineConcatVectorOps - IsConcatFree must check extraction index
...
Identified in the regression reported by @alexfh on rGb5d7beeb9792 - IsConcatFree wasn't ensuring the subvector extraction index matched the position it would be concatenated back into.
2022-06-27 11:46:49 +01:00
Simon Pilgrim
ac4cb1775b
[X86] fold (and (mul x, c1), c2) -> (mul x, (and c1, c2)) iff c2 is all/no bits mask
...
Noticed on D128216 - if we're zeroing out vector elements of a mul/mulh result then see if we can merge the and-mask into the mul by just multiplying by zero.
Ideally we'd make this generic (similar to the existing foldSelectWithIdentityConstant?), but these cases are appearing very late, after the constants have been lowered to constant-pool loads.
2022-06-21 15:10:43 +01:00
Simon Pilgrim
057db2002b
[X86] combineAndnp - constant fold ANDNP(C,X) -> AND(~C,X)
...
If the LHS op has a single use then using the more general AND op is likely to allow commutation, load folding, generic folds etc.
2022-06-21 12:31:01 +01:00
Simon Pilgrim
843d43e62a
[X86] computeKnownBitsForTargetNode - add X86ISD::VBROADCAST_LOAD handling
...
This requires us to override the isTargetCanonicalConstantNode callback introduced in D128144, so we can recognise the various cases where a VBROADCAST_LOAD constant is being reused at different vector widths to prevent infinite loops.
2022-06-21 11:48:01 +01:00
Simon Pilgrim
8254966062
[X86] LowerINSERT_VECTOR_ELT - always lower v32i8/v16i16 allones insertions on AVX1 as OR ops
...
v32i8/v16i16 blend shuffles on AVX1 will expand to OR(AND,ANDN) patterns which can be easily broken by other combines
2022-06-20 18:43:03 +01:00
Simon Pilgrim
e4a124dda5
[DAG] Fold (srl (shl x, c1), c2) -> and(shl/srl(x, c3), m)
...
Similar to the existing (shl (srl x, c1), c2) fold
Part of the work to fix the regressions in D77804
Differential Revision: https://reviews.llvm.org/D125836
2022-06-20 08:37:38 +01:00
Simon Pilgrim
ba3f2667b6
[DAG] Add MaskedVectorIsZero helper
...
Equivalent to MaskedValueIsZero, except its checking if all of the demanded vectors elements are known to be zero
2022-06-19 17:56:30 +01:00
Simon Pilgrim
41455dd1dc
[X86] Remove isTargetShuffleSplat and just use SelectionDAG::isSplatValue
...
shuffle(splat(x)) -> splat(x), it doesn't have to be a target specific broadcast
2022-06-19 11:22:57 +01:00
Simon Pilgrim
ac3f967382
[X86] canonicalizeShuffleWithBinOps - merge shuffles across binops if either source op is a known splat
...
The shuffle of a splat (with no undefs) should always be removed
2022-06-18 17:14:00 +01:00
Simon Pilgrim
f42f2b7005
[X86] canonicalizeShuffleWithBinOps - merge unary shuffles across binops if either source op is a foldable load
...
This mostly handles folding of constants that have already become loads, but we expose some generic load cases as well.
This also exposes the chance to merge unary shuffles across X86ISD::ANDNP nodes with different scalar widths
2022-06-18 15:58:54 +01:00
Simon Pilgrim
3c9123af9f
[X86] isShuffleFoldableLoad - ensure the load has one use.
...
We'll only fold the load if has one use. Makes no difference to existing tests but will be necessary for an upcoming patch to improve load folding as part of canonicalizeShuffleWithBinOps.
2022-06-18 14:51:55 +01:00
Phoebe Wang
655ba9c8a1
Reland "Reland "Reland "Reland "[X86][RFC] Enable `_Float16` type support on X86 following the psABI""""
...
This resolves problems reported in commit 1a20252978
.
1. Promote to float lowering for nodes XINT_TO_FP
2. Bail out f16 from shuffle combine due to vector type is not legal in the version
2022-06-17 21:34:05 +08:00
Benjamin Kramer
1a20252978
Revert "Reland "Reland "Reland "[X86][RFC] Enable `_Float16` type support on X86 following the psABI""""
...
This reverts commit 04a3d5f3a1
.
I see two more issues:
- uitofp/sitofp from i32/i64 to half now generates
__floatsihf/__floatdihf, which exists in neither compiler-rt nor
libgcc
- This crashes when legalizing the bitcast:
```
; RUN: llc < %s -mcpu=skx
define void @main.45(ptr nocapture readnone %retval, ptr noalias nocapture readnone %run_options, ptr noalias nocapture readnone %params, ptr noalias nocapture readonly %buffer_table, ptr noalias nocapture readnone %status, ptr noalias nocapture readnone %prof_counters) local_unnamed_addr {
entry:
%fusion = load ptr, ptr %buffer_table, align 8
%0 = getelementptr inbounds ptr, ptr %buffer_table, i64 1
%Arg_1.2 = load ptr, ptr %0, align 8
%1 = getelementptr inbounds ptr, ptr %buffer_table, i64 2
%Arg_0.1 = load ptr, ptr %1, align 8
%2 = load half, ptr %Arg_0.1, align 8
%3 = bitcast half %2 to i16
%4 = and i16 %3, 32767
%5 = icmp eq i16 %4, 0
%6 = and i16 %3, -32768
%broadcast.splatinsert = insertelement <4 x half> poison, half %2, i64 0
%broadcast.splat = shufflevector <4 x half> %broadcast.splatinsert, <4 x half> poison, <4 x i32> zeroinitializer
%broadcast.splatinsert9 = insertelement <4 x i16> poison, i16 %4, i64 0
%broadcast.splat10 = shufflevector <4 x i16> %broadcast.splatinsert9, <4 x i16> poison, <4 x i32> zeroinitializer
%broadcast.splatinsert11 = insertelement <4 x i16> poison, i16 %6, i64 0
%broadcast.splat12 = shufflevector <4 x i16> %broadcast.splatinsert11, <4 x i16> poison, <4 x i32> zeroinitializer
%broadcast.splatinsert13 = insertelement <4 x i16> poison, i16 %3, i64 0
%broadcast.splat14 = shufflevector <4 x i16> %broadcast.splatinsert13, <4 x i16> poison, <4 x i32> zeroinitializer
%wide.load = load <4 x half>, ptr %Arg_1.2, align 8
%7 = fcmp uno <4 x half> %broadcast.splat, %wide.load
%8 = fcmp oeq <4 x half> %broadcast.splat, %wide.load
%9 = bitcast <4 x half> %wide.load to <4 x i16>
%10 = and <4 x i16> %9, <i16 32767, i16 32767, i16 32767, i16 32767>
%11 = icmp eq <4 x i16> %10, zeroinitializer
%12 = and <4 x i16> %9, <i16 -32768, i16 -32768, i16 -32768, i16 -32768>
%13 = or <4 x i16> %12, <i16 1, i16 1, i16 1, i16 1>
%14 = select <4 x i1> %11, <4 x i16> %9, <4 x i16> %13
%15 = icmp ugt <4 x i16> %broadcast.splat10, %10
%16 = icmp ne <4 x i16> %broadcast.splat12, %12
%17 = or <4 x i1> %15, %16
%18 = select <4 x i1> %17, <4 x i16> <i16 -1, i16 -1, i16 -1, i16 -1>, <4 x i16> <i16 1, i16 1, i16 1, i16 1>
%19 = add <4 x i16> %18, %broadcast.splat14
%20 = select i1 %5, <4 x i16> %14, <4 x i16> %19
%21 = select <4 x i1> %8, <4 x i16> %9, <4 x i16> %20
%22 = bitcast <4 x i16> %21 to <4 x half>
%23 = select <4 x i1> %7, <4 x half> <half 0xH7E00, half 0xH7E00, half 0xH7E00, half 0xH7E00>, <4 x half> %22
store <4 x half> %23, ptr %fusion, align 16
ret void
}
```
llc: llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp:977: void (anonymous namespace)::SelectionDAGLegalize::LegalizeOp(llvm::SDNode *): Assertion `(TLI.getTypeAction(*DAG.getContext(), Op.getValueType()) == TargetLowering::TypeLegal || Op.getOpcode() == ISD::TargetConstant || Op.getOpcode() == ISD::Register) && "Unexpected illegal type!"' failed.
2022-06-17 09:43:07 +02:00
Phoebe Wang
04a3d5f3a1
Reland "Reland "Reland "[X86][RFC] Enable `_Float16` type support on X86 following the psABI"""
...
Fix the crash on lowering X86ISD::FCMP.
2022-06-17 12:12:17 +08:00
Paul Robinson
ff0122dcce
[PS5] Emit ud2 for ubsan trap
2022-06-16 11:20:10 -07:00
Frederik Gossen
3cd5696a33
Revert "Reland "Reland "[X86][RFC] Enable `_Float16` type support on X86 following the psABI"""
...
This reverts commit e1c5afa47d
.
This introduces crashes in the JAX backend on CPU. A reproducer in LLVM is
below. Let me know if you have trouble reproducing this.
; ModuleID = '__compute_module'
source_filename = "__compute_module"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-grtev4-linux-gnu"
@0 = private unnamed_addr constant [4 x i8] c"\00\00\00?"
@1 = private unnamed_addr constant [4 x i8] c"\1C}\908"
@2 = private unnamed_addr constant [4 x i8] c"?\00\\4"
@3 = private unnamed_addr constant [4 x i8] c"%ci1"
@4 = private unnamed_addr constant [4 x i8] zeroinitializer
@5 = private unnamed_addr constant [4 x i8] c"\00\00\00\C0"
@6 = private unnamed_addr constant [4 x i8] c"\00\00\00B"
@7 = private unnamed_addr constant [4 x i8] c"\94\B4\C22"
@8 = private unnamed_addr constant [4 x i8] c"^\09B6"
@9 = private unnamed_addr constant [4 x i8] c"\15\F3M?"
@10 = private unnamed_addr constant [4 x i8] c"e\CC\\;"
@11 = private unnamed_addr constant [4 x i8] c"d\BD/>"
@12 = private unnamed_addr constant [4 x i8] c"V\F4I="
@13 = private unnamed_addr constant [4 x i8] c"\10\CB,<"
@14 = private unnamed_addr constant [4 x i8] c"\AC\E3\D6:"
@15 = private unnamed_addr constant [4 x i8] c"\DC\A8E9"
@16 = private unnamed_addr constant [4 x i8] c"\C6\FA\897"
@17 = private unnamed_addr constant [4 x i8] c"%\F9\955"
@18 = private unnamed_addr constant [4 x i8] c"\B5\DB\813"
@19 = private unnamed_addr constant [4 x i8] c"\B4W_\B2"
@20 = private unnamed_addr constant [4 x i8] c"\1Cc\8F\B4"
@21 = private unnamed_addr constant [4 x i8] c"~3\94\B6"
@22 = private unnamed_addr constant [4 x i8] c"3Yq\B8"
@23 = private unnamed_addr constant [4 x i8] c"\E9\17\17\BA"
@24 = private unnamed_addr constant [4 x i8] c"\F1\B2\8D\BB"
@25 = private unnamed_addr constant [4 x i8] c"\F8t\C2\BC"
@26 = private unnamed_addr constant [4 x i8] c"\82[\C2\BD"
@27 = private unnamed_addr constant [4 x i8] c"uB-?"
@28 = private unnamed_addr constant [4 x i8] c"^\FF\9B\BE"
@29 = private unnamed_addr constant [4 x i8] c"\00\00\00A"
; Function Attrs: uwtable
define void @main.158(ptr %retval, ptr noalias %run_options, ptr noalias %params, ptr noalias %buffer_table, ptr noalias %status, ptr noalias %prof_counters) #0 {
entry:
%fusion.invar_address.dim.1 = alloca i64, align 8
%fusion.invar_address.dim.0 = alloca i64, align 8
%0 = getelementptr inbounds ptr, ptr %buffer_table, i64 1
%Arg_0.1 = load ptr, ptr %0, align 8, !invariant.load !0 , !dereferenceable !1 , !align !2
%1 = getelementptr inbounds ptr, ptr %buffer_table, i64 0
%fusion = load ptr, ptr %1, align 8, !invariant.load !0 , !dereferenceable !1 , !align !2
store i64 0, ptr %fusion.invar_address.dim.0, align 8
br label %fusion.loop_header.dim.0
return: ; preds = %fusion.loop_exit.dim.0
ret void
fusion.loop_header.dim.0: ; preds = %fusion.loop_exit.dim.1, %entry
%fusion.indvar.dim.0 = load i64, ptr %fusion.invar_address.dim.0, align 8
%2 = icmp uge i64 %fusion.indvar.dim.0, 3
br i1 %2, label %fusion.loop_exit.dim.0, label %fusion.loop_body.dim.0
fusion.loop_body.dim.0: ; preds = %fusion.loop_header.dim.0
store i64 0, ptr %fusion.invar_address.dim.1, align 8
br label %fusion.loop_header.dim.1
fusion.loop_header.dim.1: ; preds = %fusion.loop_body.dim.1, %fusion.loop_body.dim.0
%fusion.indvar.dim.1 = load i64, ptr %fusion.invar_address.dim.1, align 8
%3 = icmp uge i64 %fusion.indvar.dim.1, 1
br i1 %3, label %fusion.loop_exit.dim.1, label %fusion.loop_body.dim.1
fusion.loop_body.dim.1: ; preds = %fusion.loop_header.dim.1
%4 = getelementptr inbounds [3 x [1 x half]], ptr %Arg_0.1, i64 0, i64 %fusion.indvar.dim.0, i64 0
%5 = load half, ptr %4, align 2, !invariant.load !0 , !noalias !3
%6 = fpext half %5 to float
%7 = call float @llvm.fabs.f32(float %6)
%constant.121 = load float, ptr @29, align 4
%compare.2 = fcmp ole float %7, %constant.121
%8 = zext i1 %compare.2 to i8
%constant.120 = load float, ptr @0, align 4
%multiply.95 = fmul float %7, %constant.120
%constant.119 = load float, ptr @5, align 4
%add.82 = fadd float %multiply.95, %constant.119
%constant.118 = load float, ptr @4, align 4
%multiply.94 = fmul float %add.82, %constant.118
%constant.117 = load float, ptr @19, align 4
%add.81 = fadd float %multiply.94, %constant.117
%multiply.92 = fmul float %add.82, %add.81
%constant.116 = load float, ptr @18, align 4
%add.79 = fadd float %multiply.92, %constant.116
%multiply.91 = fmul float %add.82, %add.79
%subtract.87 = fsub float %multiply.91, %add.81
%constant.115 = load float, ptr @20, align 4
%add.78 = fadd float %subtract.87, %constant.115
%multiply.89 = fmul float %add.82, %add.78
%subtract.86 = fsub float %multiply.89, %add.79
%constant.114 = load float, ptr @17, align 4
%add.76 = fadd float %subtract.86, %constant.114
%multiply.88 = fmul float %add.82, %add.76
%subtract.84 = fsub float %multiply.88, %add.78
%constant.113 = load float, ptr @21, align 4
%add.75 = fadd float %subtract.84, %constant.113
%multiply.86 = fmul float %add.82, %add.75
%subtract.83 = fsub float %multiply.86, %add.76
%constant.112 = load float, ptr @16, align 4
%add.73 = fadd float %subtract.83, %constant.112
%multiply.85 = fmul float %add.82, %add.73
%subtract.81 = fsub float %multiply.85, %add.75
%constant.111 = load float, ptr @22, align 4
%add.72 = fadd float %subtract.81, %constant.111
%multiply.83 = fmul float %add.82, %add.72
%subtract.80 = fsub float %multiply.83, %add.73
%constant.110 = load float, ptr @15, align 4
%add.70 = fadd float %subtract.80, %constant.110
%multiply.82 = fmul float %add.82, %add.70
%subtract.78 = fsub float %multiply.82, %add.72
%constant.109 = load float, ptr @23, align 4
%add.69 = fadd float %subtract.78, %constant.109
%multiply.80 = fmul float %add.82, %add.69
%subtract.77 = fsub float %multiply.80, %add.70
%constant.108 = load float, ptr @14, align 4
%add.68 = fadd float %subtract.77, %constant.108
%multiply.79 = fmul float %add.82, %add.68
%subtract.75 = fsub float %multiply.79, %add.69
%constant.107 = load float, ptr @24, align 4
%add.67 = fadd float %subtract.75, %constant.107
%multiply.77 = fmul float %add.82, %add.67
%subtract.74 = fsub float %multiply.77, %add.68
%constant.106 = load float, ptr @13, align 4
%add.66 = fadd float %subtract.74, %constant.106
%multiply.76 = fmul float %add.82, %add.66
%subtract.72 = fsub float %multiply.76, %add.67
%constant.105 = load float, ptr @25, align 4
%add.65 = fadd float %subtract.72, %constant.105
%multiply.74 = fmul float %add.82, %add.65
%subtract.71 = fsub float %multiply.74, %add.66
%constant.104 = load float, ptr @12, align 4
%add.64 = fadd float %subtract.71, %constant.104
%multiply.73 = fmul float %add.82, %add.64
%subtract.69 = fsub float %multiply.73, %add.65
%constant.103 = load float, ptr @26, align 4
%add.63 = fadd float %subtract.69, %constant.103
%multiply.71 = fmul float %add.82, %add.63
%subtract.67 = fsub float %multiply.71, %add.64
%constant.102 = load float, ptr @11, align 4
%add.62 = fadd float %subtract.67, %constant.102
%multiply.70 = fmul float %add.82, %add.62
%subtract.66 = fsub float %multiply.70, %add.63
%constant.101 = load float, ptr @28, align 4
%add.61 = fadd float %subtract.66, %constant.101
%multiply.68 = fmul float %add.82, %add.61
%subtract.65 = fsub float %multiply.68, %add.62
%constant.100 = load float, ptr @27, align 4
%add.60 = fadd float %subtract.65, %constant.100
%subtract.64 = fsub float %add.60, %add.62
%multiply.66 = fmul float %subtract.64, %constant.120
%constant.99 = load float, ptr @6, align 4
%divide.4 = fdiv float %constant.99, %7
%add.59 = fadd float %divide.4, %constant.119
%multiply.65 = fmul float %add.59, %constant.118
%constant.98 = load float, ptr @3, align 4
%add.58 = fadd float %multiply.65, %constant.98
%multiply.64 = fmul float %add.59, %add.58
%constant.97 = load float, ptr @7, align 4
%add.57 = fadd float %multiply.64, %constant.97
%multiply.63 = fmul float %add.59, %add.57
%subtract.63 = fsub float %multiply.63, %add.58
%constant.96 = load float, ptr @2, align 4
%add.56 = fadd float %subtract.63, %constant.96
%multiply.62 = fmul float %add.59, %add.56
%subtract.62 = fsub float %multiply.62, %add.57
%constant.95 = load float, ptr @8, align 4
%add.55 = fadd float %subtract.62, %constant.95
%multiply.61 = fmul float %add.59, %add.55
%subtract.61 = fsub float %multiply.61, %add.56
%constant.94 = load float, ptr @1, align 4
%add.54 = fadd float %subtract.61, %constant.94
%multiply.60 = fmul float %add.59, %add.54
%subtract.60 = fsub float %multiply.60, %add.55
%constant.93 = load float, ptr @10, align 4
%add.53 = fadd float %subtract.60, %constant.93
%multiply.59 = fmul float %add.59, %add.53
%subtract.59 = fsub float %multiply.59, %add.54
%constant.92 = load float, ptr @9, align 4
%add.52 = fadd float %subtract.59, %constant.92
%subtract.58 = fsub float %add.52, %add.54
%multiply.58 = fmul float %subtract.58, %constant.120
%9 = call float @llvm.sqrt.f32(float %7)
%10 = fdiv float 1.000000e+00, %9
%multiply.57 = fmul float %multiply.58, %10
%11 = trunc i8 %8 to i1
%12 = select i1 %11, float %multiply.66, float %multiply.57
%13 = fptrunc float %12 to half
%14 = getelementptr inbounds [3 x [1 x half]], ptr %fusion, i64 0, i64 %fusion.indvar.dim.0, i64 0
store half %13, ptr %14, align 2, !alias.scope !3
%invar.inc1 = add nuw nsw i64 %fusion.indvar.dim.1, 1
store i64 %invar.inc1, ptr %fusion.invar_address.dim.1, align 8
br label %fusion.loop_header.dim.1
fusion.loop_exit.dim.1: ; preds = %fusion.loop_header.dim.1
%invar.inc = add nuw nsw i64 %fusion.indvar.dim.0, 1
store i64 %invar.inc, ptr %fusion.invar_address.dim.0, align 8
br label %fusion.loop_header.dim.0
fusion.loop_exit.dim.0: ; preds = %fusion.loop_header.dim.0
br label %return
}
; Function Attrs: nocallback nofree nosync nounwind readnone speculatable willreturn
declare float @llvm.fabs.f32(float %0) #1
; Function Attrs: nocallback nofree nosync nounwind readnone speculatable willreturn
declare float @llvm.sqrt.f32(float %0) #1
attributes #0 = { uwtable "denormal-fp-math"="preserve-sign" "no-frame-pointer-elim"="false" }
attributes #1 = { nocallback nofree nosync nounwind readnone speculatable willreturn }
!0 = !{}
!1 = !{i64 6}
!2 = !{i64 8}
!3 = !{!4}
!4 = !{!"buffer: {index:0, offset:0, size:6}", !5}
!5 = !{!"XLA global AA domain"}
2022-06-15 18:04:42 -04:00
Phoebe Wang
e1c5afa47d
Reland "Reland "[X86][RFC] Enable `_Float16` type support on X86 following the psABI""
...
Fixed the missing SQRT promotion. Adding several missing operations too.
2022-06-15 23:00:18 +08:00
Thomas Joerg
37455b1f71
Revert "Reland "[X86][RFC] Enable `_Float16` type support on X86 following the psABI""
...
This reverts commit 6e02e27536
.
This introduces a crash in the backend. Reproducer in MLIR's LLVM
dialect follows. Let me know if you have trouble reproducing this.
module {
llvm.func @malloc(i64) -> !llvm.ptr<i8>
llvm.func @_mlir_ciface_tf_report_error(!llvm.ptr<i8>, i32, !llvm.ptr<i8>)
llvm.mlir.global internal constant @error_message_2208944672953921889("failed to allocate memory at loc(\22-\22:3:8)\00")
llvm.func @_mlir_ciface_tf_alloc(!llvm.ptr<i8>, i64, i64, i32, i32, !llvm.ptr<i32>) -> !llvm.ptr<i8>
llvm.func @Rsqrt_CPU_DT_HALF_DT_HALF(%arg0: !llvm.ptr<i8>, %arg1: i64, %arg2: !llvm.ptr<i8>) -> !llvm.struct<(i64, ptr<i8>)> attributes {llvm.emit_c_interface, tf_entry} {
%0 = llvm.mlir.constant(8 : i32) : i32
%1 = llvm.mlir.constant(8 : index) : i64
%2 = llvm.mlir.constant(2 : index) : i64
%3 = llvm.mlir.constant(dense<0.000000e+00> : vector<4xf16>) : vector<4xf16>
%4 = llvm.mlir.constant(dense<[0, 1, 2, 3]> : vector<4xi32>) : vector<4xi32>
%5 = llvm.mlir.constant(dense<1.000000e+00> : vector<4xf16>) : vector<4xf16>
%6 = llvm.mlir.constant(false) : i1
%7 = llvm.mlir.constant(1 : i32) : i32
%8 = llvm.mlir.constant(0 : i32) : i32
%9 = llvm.mlir.constant(4 : index) : i64
%10 = llvm.mlir.constant(0 : index) : i64
%11 = llvm.mlir.constant(1 : index) : i64
%12 = llvm.mlir.constant(-1 : index) : i64
%13 = llvm.mlir.null : !llvm.ptr<f16>
%14 = llvm.getelementptr %13[%9] : (!llvm.ptr<f16>, i64) -> !llvm.ptr<f16>
%15 = llvm.ptrtoint %14 : !llvm.ptr<f16> to i64
%16 = llvm.alloca %15 x f16 {alignment = 32 : i64} : (i64) -> !llvm.ptr<f16>
%17 = llvm.alloca %15 x f16 {alignment = 32 : i64} : (i64) -> !llvm.ptr<f16>
%18 = llvm.mlir.null : !llvm.ptr<i64>
%19 = llvm.getelementptr %18[%arg1] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64>
%20 = llvm.ptrtoint %19 : !llvm.ptr<i64> to i64
%21 = llvm.alloca %20 x i64 : (i64) -> !llvm.ptr<i64>
llvm.br ^bb1(%10 : i64)
^bb1(%22: i64): // 2 preds: ^bb0, ^bb2
%23 = llvm.icmp "slt" %22, %arg1 : i64
llvm.cond_br %23, ^bb2, ^bb3
^bb2: // pred: ^bb1
%24 = llvm.bitcast %arg2 : !llvm.ptr<i8> to !llvm.ptr<struct<(ptr<f16>, ptr<f16>, i64)>>
%25 = llvm.getelementptr %24[%10, 2] : (!llvm.ptr<struct<(ptr<f16>, ptr<f16>, i64)>>, i64) -> !llvm.ptr<i64>
%26 = llvm.add %22, %11 : i64
%27 = llvm.getelementptr %25[%26] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64>
%28 = llvm.load %27 : !llvm.ptr<i64>
%29 = llvm.getelementptr %21[%22] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64>
llvm.store %28, %29 : !llvm.ptr<i64>
llvm.br ^bb1(%26 : i64)
^bb3: // pred: ^bb1
llvm.br ^bb4(%10, %11 : i64, i64)
^bb4(%30: i64, %31: i64): // 2 preds: ^bb3, ^bb5
%32 = llvm.icmp "slt" %30, %arg1 : i64
llvm.cond_br %32, ^bb5, ^bb6
^bb5: // pred: ^bb4
%33 = llvm.bitcast %arg2 : !llvm.ptr<i8> to !llvm.ptr<struct<(ptr<f16>, ptr<f16>, i64)>>
%34 = llvm.getelementptr %33[%10, 2] : (!llvm.ptr<struct<(ptr<f16>, ptr<f16>, i64)>>, i64) -> !llvm.ptr<i64>
%35 = llvm.add %30, %11 : i64
%36 = llvm.getelementptr %34[%35] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64>
%37 = llvm.load %36 : !llvm.ptr<i64>
%38 = llvm.mul %37, %31 : i64
llvm.br ^bb4(%35, %38 : i64, i64)
^bb6: // pred: ^bb4
%39 = llvm.bitcast %arg2 : !llvm.ptr<i8> to !llvm.ptr<ptr<f16>>
%40 = llvm.getelementptr %39[%11] : (!llvm.ptr<ptr<f16>>, i64) -> !llvm.ptr<ptr<f16>>
%41 = llvm.load %40 : !llvm.ptr<ptr<f16>>
%42 = llvm.getelementptr %13[%11] : (!llvm.ptr<f16>, i64) -> !llvm.ptr<f16>
%43 = llvm.ptrtoint %42 : !llvm.ptr<f16> to i64
%44 = llvm.alloca %7 x i32 : (i32) -> !llvm.ptr<i32>
llvm.store %8, %44 : !llvm.ptr<i32>
%45 = llvm.call @_mlir_ciface_tf_alloc(%arg0, %31, %43, %8, %7, %44) : (!llvm.ptr<i8>, i64, i64, i32, i32, !llvm.ptr<i32>) -> !llvm.ptr<i8>
%46 = llvm.bitcast %45 : !llvm.ptr<i8> to !llvm.ptr<f16>
%47 = llvm.icmp "eq" %31, %10 : i64
%48 = llvm.or %6, %47 : i1
%49 = llvm.mlir.null : !llvm.ptr<i8>
%50 = llvm.icmp "ne" %45, %49 : !llvm.ptr<i8>
%51 = llvm.or %50, %48 : i1
llvm.cond_br %51, ^bb7, ^bb13
^bb7: // pred: ^bb6
%52 = llvm.urem %31, %9 : i64
%53 = llvm.sub %31, %52 : i64
llvm.br ^bb8(%10 : i64)
^bb8(%54: i64): // 2 preds: ^bb7, ^bb9
%55 = llvm.icmp "slt" %54, %53 : i64
llvm.cond_br %55, ^bb9, ^bb10
^bb9: // pred: ^bb8
%56 = llvm.mul %54, %11 : i64
%57 = llvm.add %56, %10 : i64
%58 = llvm.add %57, %10 : i64
%59 = llvm.getelementptr %41[%58] : (!llvm.ptr<f16>, i64) -> !llvm.ptr<f16>
%60 = llvm.bitcast %59 : !llvm.ptr<f16> to !llvm.ptr<vector<4xf16>>
%61 = llvm.load %60 {alignment = 2 : i64} : !llvm.ptr<vector<4xf16>>
%62 = "llvm.intr.sqrt"(%61) : (vector<4xf16>) -> vector<4xf16>
%63 = llvm.fdiv %5, %62 : vector<4xf16>
%64 = llvm.getelementptr %46[%58] : (!llvm.ptr<f16>, i64) -> !llvm.ptr<f16>
%65 = llvm.bitcast %64 : !llvm.ptr<f16> to !llvm.ptr<vector<4xf16>>
llvm.store %63, %65 {alignment = 2 : i64} : !llvm.ptr<vector<4xf16>>
%66 = llvm.add %54, %9 : i64
llvm.br ^bb8(%66 : i64)
^bb10: // pred: ^bb8
%67 = llvm.icmp "ult" %53, %31 : i64
llvm.cond_br %67, ^bb11, ^bb12
^bb11: // pred: ^bb10
%68 = llvm.mul %53, %12 : i64
%69 = llvm.add %31, %68 : i64
%70 = llvm.mul %53, %11 : i64
%71 = llvm.add %70, %10 : i64
%72 = llvm.trunc %69 : i64 to i32
%73 = llvm.mlir.undef : vector<4xi32>
%74 = llvm.insertelement %72, %73[%8 : i32] : vector<4xi32>
%75 = llvm.shufflevector %74, %73 [0 : i32, 0 : i32, 0 : i32, 0 : i32] : vector<4xi32>, vector<4xi32>
%76 = llvm.icmp "slt" %4, %75 : vector<4xi32>
%77 = llvm.add %71, %10 : i64
%78 = llvm.getelementptr %41[%77] : (!llvm.ptr<f16>, i64) -> !llvm.ptr<f16>
%79 = llvm.bitcast %78 : !llvm.ptr<f16> to !llvm.ptr<vector<4xf16>>
%80 = llvm.intr.masked.load %79, %76, %3 {alignment = 2 : i32} : (!llvm.ptr<vector<4xf16>>, vector<4xi1>, vector<4xf16>) -> vector<4xf16>
%81 = llvm.bitcast %16 : !llvm.ptr<f16> to !llvm.ptr<vector<4xf16>>
llvm.store %80, %81 : !llvm.ptr<vector<4xf16>>
%82 = llvm.load %81 {alignment = 2 : i64} : !llvm.ptr<vector<4xf16>>
%83 = "llvm.intr.sqrt"(%82) : (vector<4xf16>) -> vector<4xf16>
%84 = llvm.fdiv %5, %83 : vector<4xf16>
%85 = llvm.bitcast %17 : !llvm.ptr<f16> to !llvm.ptr<vector<4xf16>>
llvm.store %84, %85 {alignment = 2 : i64} : !llvm.ptr<vector<4xf16>>
%86 = llvm.load %85 : !llvm.ptr<vector<4xf16>>
%87 = llvm.getelementptr %46[%77] : (!llvm.ptr<f16>, i64) -> !llvm.ptr<f16>
%88 = llvm.bitcast %87 : !llvm.ptr<f16> to !llvm.ptr<vector<4xf16>>
llvm.intr.masked.store %86, %88, %76 {alignment = 2 : i32} : vector<4xf16>, vector<4xi1> into !llvm.ptr<vector<4xf16>>
llvm.br ^bb12
^bb12: // 2 preds: ^bb10, ^bb11
%89 = llvm.mul %2, %1 : i64
%90 = llvm.mul %arg1, %2 : i64
%91 = llvm.add %90, %11 : i64
%92 = llvm.mul %91, %1 : i64
%93 = llvm.add %89, %92 : i64
%94 = llvm.alloca %93 x i8 : (i64) -> !llvm.ptr<i8>
%95 = llvm.bitcast %94 : !llvm.ptr<i8> to !llvm.ptr<ptr<f16>>
llvm.store %46, %95 : !llvm.ptr<ptr<f16>>
%96 = llvm.getelementptr %95[%11] : (!llvm.ptr<ptr<f16>>, i64) -> !llvm.ptr<ptr<f16>>
llvm.store %46, %96 : !llvm.ptr<ptr<f16>>
%97 = llvm.getelementptr %95[%2] : (!llvm.ptr<ptr<f16>>, i64) -> !llvm.ptr<ptr<f16>>
%98 = llvm.bitcast %97 : !llvm.ptr<ptr<f16>> to !llvm.ptr<i64>
llvm.store %10, %98 : !llvm.ptr<i64>
%99 = llvm.bitcast %94 : !llvm.ptr<i8> to !llvm.ptr<struct<(ptr<f16>, ptr<f16>, i64, i64)>>
%100 = llvm.getelementptr %99[%10, 3] : (!llvm.ptr<struct<(ptr<f16>, ptr<f16>, i64, i64)>>, i64) -> !llvm.ptr<i64>
%101 = llvm.getelementptr %100[%arg1] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64>
%102 = llvm.sub %arg1, %11 : i64
llvm.br ^bb14(%102, %11 : i64, i64)
^bb13: // pred: ^bb6
%103 = llvm.mlir.addressof @error_message_2208944672953921889 : !llvm.ptr<array<42 x i8>>
%104 = llvm.getelementptr %103[%10, %10] : (!llvm.ptr<array<42 x i8>>, i64, i64) -> !llvm.ptr<i8>
llvm.call @_mlir_ciface_tf_report_error(%arg0, %0, %104) : (!llvm.ptr<i8>, i32, !llvm.ptr<i8>) -> ()
%105 = llvm.mul %2, %1 : i64
%106 = llvm.mul %2, %10 : i64
%107 = llvm.add %106, %11 : i64
%108 = llvm.mul %107, %1 : i64
%109 = llvm.add %105, %108 : i64
%110 = llvm.alloca %109 x i8 : (i64) -> !llvm.ptr<i8>
%111 = llvm.bitcast %110 : !llvm.ptr<i8> to !llvm.ptr<ptr<f16>>
llvm.store %13, %111 : !llvm.ptr<ptr<f16>>
%112 = llvm.getelementptr %111[%11] : (!llvm.ptr<ptr<f16>>, i64) -> !llvm.ptr<ptr<f16>>
llvm.store %13, %112 : !llvm.ptr<ptr<f16>>
%113 = llvm.getelementptr %111[%2] : (!llvm.ptr<ptr<f16>>, i64) -> !llvm.ptr<ptr<f16>>
%114 = llvm.bitcast %113 : !llvm.ptr<ptr<f16>> to !llvm.ptr<i64>
llvm.store %10, %114 : !llvm.ptr<i64>
%115 = llvm.call @malloc(%109) : (i64) -> !llvm.ptr<i8>
"llvm.intr.memcpy"(%115, %110, %109, %6) : (!llvm.ptr<i8>, !llvm.ptr<i8>, i64, i1) -> ()
%116 = llvm.mlir.undef : !llvm.struct<(i64, ptr<i8>)>
%117 = llvm.insertvalue %10, %116[0] : !llvm.struct<(i64, ptr<i8>)>
%118 = llvm.insertvalue %115, %117[1] : !llvm.struct<(i64, ptr<i8>)>
llvm.return %118 : !llvm.struct<(i64, ptr<i8>)>
^bb14(%119: i64, %120: i64): // 2 preds: ^bb12, ^bb15
%121 = llvm.icmp "sge" %119, %10 : i64
llvm.cond_br %121, ^bb15, ^bb16
^bb15: // pred: ^bb14
%122 = llvm.getelementptr %21[%119] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64>
%123 = llvm.load %122 : !llvm.ptr<i64>
%124 = llvm.getelementptr %100[%119] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64>
llvm.store %123, %124 : !llvm.ptr<i64>
%125 = llvm.getelementptr %101[%119] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64>
llvm.store %120, %125 : !llvm.ptr<i64>
%126 = llvm.mul %120, %123 : i64
%127 = llvm.sub %119, %11 : i64
llvm.br ^bb14(%127, %126 : i64, i64)
^bb16: // pred: ^bb14
%128 = llvm.call @malloc(%93) : (i64) -> !llvm.ptr<i8>
"llvm.intr.memcpy"(%128, %94, %93, %6) : (!llvm.ptr<i8>, !llvm.ptr<i8>, i64, i1) -> ()
%129 = llvm.mlir.undef : !llvm.struct<(i64, ptr<i8>)>
%130 = llvm.insertvalue %arg1, %129[0] : !llvm.struct<(i64, ptr<i8>)>
%131 = llvm.insertvalue %128, %130[1] : !llvm.struct<(i64, ptr<i8>)>
llvm.return %131 : !llvm.struct<(i64, ptr<i8>)>
}
llvm.func @_mlir_ciface_Rsqrt_CPU_DT_HALF_DT_HALF(%arg0: !llvm.ptr<struct<(i64, ptr<i8>)>>, %arg1: !llvm.ptr<i8>, %arg2: !llvm.ptr<struct<(i64, ptr<i8>)>>) attributes {llvm.emit_c_interface, tf_entry} {
%0 = llvm.load %arg2 : !llvm.ptr<struct<(i64, ptr<i8>)>>
%1 = llvm.extractvalue %0[0] : !llvm.struct<(i64, ptr<i8>)>
%2 = llvm.extractvalue %0[1] : !llvm.struct<(i64, ptr<i8>)>
%3 = llvm.call @Rsqrt_CPU_DT_HALF_DT_HALF(%arg1, %1, %2) : (!llvm.ptr<i8>, i64, !llvm.ptr<i8>) -> !llvm.struct<(i64, ptr<i8>)>
llvm.store %3, %arg0 : !llvm.ptr<struct<(i64, ptr<i8>)>>
llvm.return
}
}
2022-06-15 13:24:24 +02:00
Benjamin Kramer
fb34d531af
Promote bf16 to f32 when the target doesn't support it
...
This is modeled after the half-precision fp support. Two new nodes are
introduced for casting from and to bf16. Since casting from bf16 is a
simple operation I opted to always directly lower it to integer
arithmetic. The other way round is more complicated if you want to
preserve IEEE semantics, so it's handled by a new __truncsfbf2
compiler-rt builtin.
This is of course very bare bones, but sufficient to get a semi-softened
fadd on x86.
Possible future improvements:
- Targets with bf16 conversion instructions can now make fp_to_bf16 legal
- The software conversion to bf16 can be replaced by a trivial
implementation under fast math.
Differential Revision: https://reviews.llvm.org/D126953
2022-06-15 12:56:31 +02:00
Simon Pilgrim
4fd561415e
[X86] needCarryOrOverflowFlag/onlyZeroFlagUsed - merge identical switch cases. NFCI.
...
Makes it easier to grok and fixes various bugprone-branch-clone warnings.
2022-06-15 10:40:22 +01:00
Phoebe Wang
6e02e27536
Reland "[X86][RFC] Enable `_Float16` type support on X86 following the psABI"
...
Disabled 2 mlir tests due to the runtime doesn't support `_Float16`, see
the issue here https://github.com/llvm/llvm-project/issues/55992
2022-06-15 09:15:31 +08:00
Simon Pilgrim
64eea34420
[X86] combineEXTEND_VECTOR_INREG - don't attempt to shuffle combine ANY_EXTEND_VECTOR_INREG without SSE41
...
Without SSE41, ANY_EXTEND_VECTOR_INREG nodes are likely to be prematurely combined to a target shuffle preventing generic sign extension folds.
Fixes a number of sign-extend regressions in D127115.
2022-06-13 17:42:04 +01:00
Mehdi Amini
5d8298a768
Revert "[X86][RFC] Enable `_Float16` type support on X86 following the psABI"
...
This reverts commit 2d2da259c8
.
This breaks MLIR integration test (JIT crashing), reverting in the
meantime.
2022-06-12 15:14:37 +00:00
Simon Pilgrim
b5d7beeb97
[X86] combineConcatVectorOps - add support for concatenation of VSELECT/BLENDV nodes (REAPPLIED)
...
If the LHS/RHS selection operands can be cheaply concatenated back together then replace 2 x 128-bit selection nodes with 1 x 256-bit node
Addresses the regression introduced in the bug fix from rGd5af6a38082b39ae520a328e44dc29ebcb036bb2
REAPPLIED with for bug identified in rGea8fb3b60196
2022-06-12 15:40:36 +01:00