Commit Graph

780 Commits

Author SHA1 Message Date
Simon Pilgrim 5583d4e87b [CostModel][X86] Add cost kinds test coverage for select operators 2022-08-13 18:08:08 +01:00
Simon Pilgrim 3482523dcd [CostModel] Rename vselect-cost.ll to select.ll
This covers more than just vector select costs
2022-08-13 17:30:42 +01:00
Simon Pilgrim a22bbbfb73 [CostModel][X86] Add cost kinds test coverage for fp comparisons 2022-08-13 17:20:49 +01:00
Simon Pilgrim 6ba1427225 [CostModel][X86] Add cost kinds test coverage for fp arithmetic operators 2022-08-13 16:42:43 +01:00
Simon Pilgrim d2ce2f1b5c [CostModel][X86] Sync masked-intrinsic-cost.ll and masked-intrinsic-cost-inseltpoison.ll
We'd lost some type test coverage in masked-intrinsic-cost-inseltpoison.ll
2022-08-11 12:05:50 +01:00
Phoebe Wang f187948162 [X86][FP16] Enable vector support for FP16 emulation
This is follow up of D107082, which enable vector support according to psABI.

Reviewed By: skan

Differential Revision: https://reviews.llvm.org/D127982
2022-07-16 09:38:58 +08:00
Nabeel Omer 0d41794335 [SLP] Add cost model for `llvm.powi.*` intrinsics (REAPPLIED)
Patch was reverted in 4c5f10a due to buildbot failures, now being
reapplied with updated AArch64 and RISCV tests.

This patch adds handling for the llvm.powi.* intrinsics in
BasicTTIImplBase::getIntrinsicInstrCost() and improves vectorization.
Closes #53887.

Differential Revision: https://reviews.llvm.org/D128172
2022-06-24 10:23:19 +00:00
Nabeel Omer 4c5f10aeeb Revert rGe6ccb57bb3f6b761f2310e97fd6ca99eff42f73e "[SLP] Add cost model for `llvm.powi.*` intrinsics"
This reverts commit e6ccb57bb3.
2022-06-21 15:05:55 +00:00
Nabeel Omer e6ccb57bb3 [SLP] Add cost model for `llvm.powi.*` intrinsics
This patch adds handling for the llvm.powi.* intrinsics in
BasicTTIImplBase::getIntrinsicInstrCost() and improves vectorization.
Closes #53887.

Differential Revision: https://reviews.llvm.org/D128172
2022-06-21 14:40:34 +00:00
Phoebe Wang 655ba9c8a1 Reland "Reland "Reland "Reland "[X86][RFC] Enable `_Float16` type support on X86 following the psABI""""
This resolves problems reported in commit 1a20252978.
1. Promote to float lowering for nodes XINT_TO_FP
2. Bail out f16 from shuffle combine due to vector type is not legal in the version
2022-06-17 21:34:05 +08:00
Benjamin Kramer 1a20252978 Revert "Reland "Reland "Reland "[X86][RFC] Enable `_Float16` type support on X86 following the psABI""""
This reverts commit 04a3d5f3a1.

I see two more issues:

- uitofp/sitofp from i32/i64 to half now generates
  __floatsihf/__floatdihf, which exists in neither compiler-rt nor
  libgcc

- This crashes when legalizing the bitcast:
```
; RUN: llc < %s -mcpu=skx
define void @main.45(ptr nocapture readnone %retval, ptr noalias nocapture readnone %run_options, ptr noalias nocapture readnone %params, ptr noalias nocapture readonly %buffer_table, ptr noalias nocapture readnone %status, ptr noalias nocapture readnone %prof_counters) local_unnamed_addr {
entry:
  %fusion = load ptr, ptr %buffer_table, align 8
  %0 = getelementptr inbounds ptr, ptr %buffer_table, i64 1
  %Arg_1.2 = load ptr, ptr %0, align 8
  %1 = getelementptr inbounds ptr, ptr %buffer_table, i64 2
  %Arg_0.1 = load ptr, ptr %1, align 8
  %2 = load half, ptr %Arg_0.1, align 8
  %3 = bitcast half %2 to i16
  %4 = and i16 %3, 32767
  %5 = icmp eq i16 %4, 0
  %6 = and i16 %3, -32768
  %broadcast.splatinsert = insertelement <4 x half> poison, half %2, i64 0
  %broadcast.splat = shufflevector <4 x half> %broadcast.splatinsert, <4 x half> poison, <4 x i32> zeroinitializer
  %broadcast.splatinsert9 = insertelement <4 x i16> poison, i16 %4, i64 0
  %broadcast.splat10 = shufflevector <4 x i16> %broadcast.splatinsert9, <4 x i16> poison, <4 x i32> zeroinitializer
  %broadcast.splatinsert11 = insertelement <4 x i16> poison, i16 %6, i64 0
  %broadcast.splat12 = shufflevector <4 x i16> %broadcast.splatinsert11, <4 x i16> poison, <4 x i32> zeroinitializer
  %broadcast.splatinsert13 = insertelement <4 x i16> poison, i16 %3, i64 0
  %broadcast.splat14 = shufflevector <4 x i16> %broadcast.splatinsert13, <4 x i16> poison, <4 x i32> zeroinitializer
  %wide.load = load <4 x half>, ptr %Arg_1.2, align 8
  %7 = fcmp uno <4 x half> %broadcast.splat, %wide.load
  %8 = fcmp oeq <4 x half> %broadcast.splat, %wide.load
  %9 = bitcast <4 x half> %wide.load to <4 x i16>
  %10 = and <4 x i16> %9, <i16 32767, i16 32767, i16 32767, i16 32767>
  %11 = icmp eq <4 x i16> %10, zeroinitializer
  %12 = and <4 x i16> %9, <i16 -32768, i16 -32768, i16 -32768, i16 -32768>
  %13 = or <4 x i16> %12, <i16 1, i16 1, i16 1, i16 1>
  %14 = select <4 x i1> %11, <4 x i16> %9, <4 x i16> %13
  %15 = icmp ugt <4 x i16> %broadcast.splat10, %10
  %16 = icmp ne <4 x i16> %broadcast.splat12, %12
  %17 = or <4 x i1> %15, %16
  %18 = select <4 x i1> %17, <4 x i16> <i16 -1, i16 -1, i16 -1, i16 -1>, <4 x i16> <i16 1, i16 1, i16 1, i16 1>
  %19 = add <4 x i16> %18, %broadcast.splat14
  %20 = select i1 %5, <4 x i16> %14, <4 x i16> %19
  %21 = select <4 x i1> %8, <4 x i16> %9, <4 x i16> %20
  %22 = bitcast <4 x i16> %21 to <4 x half>
  %23 = select <4 x i1> %7, <4 x half> <half 0xH7E00, half 0xH7E00, half 0xH7E00, half 0xH7E00>, <4 x half> %22
  store <4 x half> %23, ptr %fusion, align 16
  ret void
}
```

llc: llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp:977: void (anonymous namespace)::SelectionDAGLegalize::LegalizeOp(llvm::SDNode *): Assertion `(TLI.getTypeAction(*DAG.getContext(), Op.getValueType()) == TargetLowering::TypeLegal || Op.getOpcode() == ISD::TargetConstant || Op.getOpcode() == ISD::Register) && "Unexpected illegal type!"' failed.
2022-06-17 09:43:07 +02:00
Phoebe Wang 04a3d5f3a1 Reland "Reland "Reland "[X86][RFC] Enable `_Float16` type support on X86 following the psABI"""
Fix the crash on lowering X86ISD::FCMP.
2022-06-17 12:12:17 +08:00
Frederik Gossen 3cd5696a33 Revert "Reland "Reland "[X86][RFC] Enable `_Float16` type support on X86 following the psABI"""
This reverts commit e1c5afa47d.

This introduces crashes in the JAX backend on CPU. A reproducer in LLVM is
below. Let me know if you have trouble reproducing this.

; ModuleID = '__compute_module'
source_filename = "__compute_module"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-grtev4-linux-gnu"

@0 = private unnamed_addr constant [4 x i8] c"\00\00\00?"
@1 = private unnamed_addr constant [4 x i8] c"\1C}\908"
@2 = private unnamed_addr constant [4 x i8] c"?\00\\4"
@3 = private unnamed_addr constant [4 x i8] c"%ci1"
@4 = private unnamed_addr constant [4 x i8] zeroinitializer
@5 = private unnamed_addr constant [4 x i8] c"\00\00\00\C0"
@6 = private unnamed_addr constant [4 x i8] c"\00\00\00B"
@7 = private unnamed_addr constant [4 x i8] c"\94\B4\C22"
@8 = private unnamed_addr constant [4 x i8] c"^\09B6"
@9 = private unnamed_addr constant [4 x i8] c"\15\F3M?"
@10 = private unnamed_addr constant [4 x i8] c"e\CC\\;"
@11 = private unnamed_addr constant [4 x i8] c"d\BD/>"
@12 = private unnamed_addr constant [4 x i8] c"V\F4I="
@13 = private unnamed_addr constant [4 x i8] c"\10\CB,<"
@14 = private unnamed_addr constant [4 x i8] c"\AC\E3\D6:"
@15 = private unnamed_addr constant [4 x i8] c"\DC\A8E9"
@16 = private unnamed_addr constant [4 x i8] c"\C6\FA\897"
@17 = private unnamed_addr constant [4 x i8] c"%\F9\955"
@18 = private unnamed_addr constant [4 x i8] c"\B5\DB\813"
@19 = private unnamed_addr constant [4 x i8] c"\B4W_\B2"
@20 = private unnamed_addr constant [4 x i8] c"\1Cc\8F\B4"
@21 = private unnamed_addr constant [4 x i8] c"~3\94\B6"
@22 = private unnamed_addr constant [4 x i8] c"3Yq\B8"
@23 = private unnamed_addr constant [4 x i8] c"\E9\17\17\BA"
@24 = private unnamed_addr constant [4 x i8] c"\F1\B2\8D\BB"
@25 = private unnamed_addr constant [4 x i8] c"\F8t\C2\BC"
@26 = private unnamed_addr constant [4 x i8] c"\82[\C2\BD"
@27 = private unnamed_addr constant [4 x i8] c"uB-?"
@28 = private unnamed_addr constant [4 x i8] c"^\FF\9B\BE"
@29 = private unnamed_addr constant [4 x i8] c"\00\00\00A"

; Function Attrs: uwtable
define void @main.158(ptr %retval, ptr noalias %run_options, ptr noalias %params, ptr noalias %buffer_table, ptr noalias %status, ptr noalias %prof_counters) #0 {
entry:
  %fusion.invar_address.dim.1 = alloca i64, align 8
  %fusion.invar_address.dim.0 = alloca i64, align 8
  %0 = getelementptr inbounds ptr, ptr %buffer_table, i64 1
  %Arg_0.1 = load ptr, ptr %0, align 8, !invariant.load !0, !dereferenceable !1, !align !2
  %1 = getelementptr inbounds ptr, ptr %buffer_table, i64 0
  %fusion = load ptr, ptr %1, align 8, !invariant.load !0, !dereferenceable !1, !align !2
  store i64 0, ptr %fusion.invar_address.dim.0, align 8
  br label %fusion.loop_header.dim.0

return:                                           ; preds = %fusion.loop_exit.dim.0
  ret void

fusion.loop_header.dim.0:                         ; preds = %fusion.loop_exit.dim.1, %entry
  %fusion.indvar.dim.0 = load i64, ptr %fusion.invar_address.dim.0, align 8
  %2 = icmp uge i64 %fusion.indvar.dim.0, 3
  br i1 %2, label %fusion.loop_exit.dim.0, label %fusion.loop_body.dim.0

fusion.loop_body.dim.0:                           ; preds = %fusion.loop_header.dim.0
  store i64 0, ptr %fusion.invar_address.dim.1, align 8
  br label %fusion.loop_header.dim.1

fusion.loop_header.dim.1:                         ; preds = %fusion.loop_body.dim.1, %fusion.loop_body.dim.0
  %fusion.indvar.dim.1 = load i64, ptr %fusion.invar_address.dim.1, align 8
  %3 = icmp uge i64 %fusion.indvar.dim.1, 1
  br i1 %3, label %fusion.loop_exit.dim.1, label %fusion.loop_body.dim.1

fusion.loop_body.dim.1:                           ; preds = %fusion.loop_header.dim.1
  %4 = getelementptr inbounds [3 x [1 x half]], ptr %Arg_0.1, i64 0, i64 %fusion.indvar.dim.0, i64 0
  %5 = load half, ptr %4, align 2, !invariant.load !0, !noalias !3
  %6 = fpext half %5 to float
  %7 = call float @llvm.fabs.f32(float %6)
  %constant.121 = load float, ptr @29, align 4
  %compare.2 = fcmp ole float %7, %constant.121
  %8 = zext i1 %compare.2 to i8
  %constant.120 = load float, ptr @0, align 4
  %multiply.95 = fmul float %7, %constant.120
  %constant.119 = load float, ptr @5, align 4
  %add.82 = fadd float %multiply.95, %constant.119
  %constant.118 = load float, ptr @4, align 4
  %multiply.94 = fmul float %add.82, %constant.118
  %constant.117 = load float, ptr @19, align 4
  %add.81 = fadd float %multiply.94, %constant.117
  %multiply.92 = fmul float %add.82, %add.81
  %constant.116 = load float, ptr @18, align 4
  %add.79 = fadd float %multiply.92, %constant.116
  %multiply.91 = fmul float %add.82, %add.79
  %subtract.87 = fsub float %multiply.91, %add.81
  %constant.115 = load float, ptr @20, align 4
  %add.78 = fadd float %subtract.87, %constant.115
  %multiply.89 = fmul float %add.82, %add.78
  %subtract.86 = fsub float %multiply.89, %add.79
  %constant.114 = load float, ptr @17, align 4
  %add.76 = fadd float %subtract.86, %constant.114
  %multiply.88 = fmul float %add.82, %add.76
  %subtract.84 = fsub float %multiply.88, %add.78
  %constant.113 = load float, ptr @21, align 4
  %add.75 = fadd float %subtract.84, %constant.113
  %multiply.86 = fmul float %add.82, %add.75
  %subtract.83 = fsub float %multiply.86, %add.76
  %constant.112 = load float, ptr @16, align 4
  %add.73 = fadd float %subtract.83, %constant.112
  %multiply.85 = fmul float %add.82, %add.73
  %subtract.81 = fsub float %multiply.85, %add.75
  %constant.111 = load float, ptr @22, align 4
  %add.72 = fadd float %subtract.81, %constant.111
  %multiply.83 = fmul float %add.82, %add.72
  %subtract.80 = fsub float %multiply.83, %add.73
  %constant.110 = load float, ptr @15, align 4
  %add.70 = fadd float %subtract.80, %constant.110
  %multiply.82 = fmul float %add.82, %add.70
  %subtract.78 = fsub float %multiply.82, %add.72
  %constant.109 = load float, ptr @23, align 4
  %add.69 = fadd float %subtract.78, %constant.109
  %multiply.80 = fmul float %add.82, %add.69
  %subtract.77 = fsub float %multiply.80, %add.70
  %constant.108 = load float, ptr @14, align 4
  %add.68 = fadd float %subtract.77, %constant.108
  %multiply.79 = fmul float %add.82, %add.68
  %subtract.75 = fsub float %multiply.79, %add.69
  %constant.107 = load float, ptr @24, align 4
  %add.67 = fadd float %subtract.75, %constant.107
  %multiply.77 = fmul float %add.82, %add.67
  %subtract.74 = fsub float %multiply.77, %add.68
  %constant.106 = load float, ptr @13, align 4
  %add.66 = fadd float %subtract.74, %constant.106
  %multiply.76 = fmul float %add.82, %add.66
  %subtract.72 = fsub float %multiply.76, %add.67
  %constant.105 = load float, ptr @25, align 4
  %add.65 = fadd float %subtract.72, %constant.105
  %multiply.74 = fmul float %add.82, %add.65
  %subtract.71 = fsub float %multiply.74, %add.66
  %constant.104 = load float, ptr @12, align 4
  %add.64 = fadd float %subtract.71, %constant.104
  %multiply.73 = fmul float %add.82, %add.64
  %subtract.69 = fsub float %multiply.73, %add.65
  %constant.103 = load float, ptr @26, align 4
  %add.63 = fadd float %subtract.69, %constant.103
  %multiply.71 = fmul float %add.82, %add.63
  %subtract.67 = fsub float %multiply.71, %add.64
  %constant.102 = load float, ptr @11, align 4
  %add.62 = fadd float %subtract.67, %constant.102
  %multiply.70 = fmul float %add.82, %add.62
  %subtract.66 = fsub float %multiply.70, %add.63
  %constant.101 = load float, ptr @28, align 4
  %add.61 = fadd float %subtract.66, %constant.101
  %multiply.68 = fmul float %add.82, %add.61
  %subtract.65 = fsub float %multiply.68, %add.62
  %constant.100 = load float, ptr @27, align 4
  %add.60 = fadd float %subtract.65, %constant.100
  %subtract.64 = fsub float %add.60, %add.62
  %multiply.66 = fmul float %subtract.64, %constant.120
  %constant.99 = load float, ptr @6, align 4
  %divide.4 = fdiv float %constant.99, %7
  %add.59 = fadd float %divide.4, %constant.119
  %multiply.65 = fmul float %add.59, %constant.118
  %constant.98 = load float, ptr @3, align 4
  %add.58 = fadd float %multiply.65, %constant.98
  %multiply.64 = fmul float %add.59, %add.58
  %constant.97 = load float, ptr @7, align 4
  %add.57 = fadd float %multiply.64, %constant.97
  %multiply.63 = fmul float %add.59, %add.57
  %subtract.63 = fsub float %multiply.63, %add.58
  %constant.96 = load float, ptr @2, align 4
  %add.56 = fadd float %subtract.63, %constant.96
  %multiply.62 = fmul float %add.59, %add.56
  %subtract.62 = fsub float %multiply.62, %add.57
  %constant.95 = load float, ptr @8, align 4
  %add.55 = fadd float %subtract.62, %constant.95
  %multiply.61 = fmul float %add.59, %add.55
  %subtract.61 = fsub float %multiply.61, %add.56
  %constant.94 = load float, ptr @1, align 4
  %add.54 = fadd float %subtract.61, %constant.94
  %multiply.60 = fmul float %add.59, %add.54
  %subtract.60 = fsub float %multiply.60, %add.55
  %constant.93 = load float, ptr @10, align 4
  %add.53 = fadd float %subtract.60, %constant.93
  %multiply.59 = fmul float %add.59, %add.53
  %subtract.59 = fsub float %multiply.59, %add.54
  %constant.92 = load float, ptr @9, align 4
  %add.52 = fadd float %subtract.59, %constant.92
  %subtract.58 = fsub float %add.52, %add.54
  %multiply.58 = fmul float %subtract.58, %constant.120
  %9 = call float @llvm.sqrt.f32(float %7)
  %10 = fdiv float 1.000000e+00, %9
  %multiply.57 = fmul float %multiply.58, %10
  %11 = trunc i8 %8 to i1
  %12 = select i1 %11, float %multiply.66, float %multiply.57
  %13 = fptrunc float %12 to half
  %14 = getelementptr inbounds [3 x [1 x half]], ptr %fusion, i64 0, i64 %fusion.indvar.dim.0, i64 0
  store half %13, ptr %14, align 2, !alias.scope !3
  %invar.inc1 = add nuw nsw i64 %fusion.indvar.dim.1, 1
  store i64 %invar.inc1, ptr %fusion.invar_address.dim.1, align 8
  br label %fusion.loop_header.dim.1

fusion.loop_exit.dim.1:                           ; preds = %fusion.loop_header.dim.1
  %invar.inc = add nuw nsw i64 %fusion.indvar.dim.0, 1
  store i64 %invar.inc, ptr %fusion.invar_address.dim.0, align 8
  br label %fusion.loop_header.dim.0

fusion.loop_exit.dim.0:                           ; preds = %fusion.loop_header.dim.0
  br label %return
}

; Function Attrs: nocallback nofree nosync nounwind readnone speculatable willreturn
declare float @llvm.fabs.f32(float %0) #1

; Function Attrs: nocallback nofree nosync nounwind readnone speculatable willreturn
declare float @llvm.sqrt.f32(float %0) #1

attributes #0 = { uwtable "denormal-fp-math"="preserve-sign" "no-frame-pointer-elim"="false" }
attributes #1 = { nocallback nofree nosync nounwind readnone speculatable willreturn }

!0 = !{}
!1 = !{i64 6}
!2 = !{i64 8}
!3 = !{!4}
!4 = !{!"buffer: {index:0, offset:0, size:6}", !5}
!5 = !{!"XLA global AA domain"}
2022-06-15 18:04:42 -04:00
Phoebe Wang e1c5afa47d Reland "Reland "[X86][RFC] Enable `_Float16` type support on X86 following the psABI""
Fixed the missing SQRT promotion. Adding several missing operations too.
2022-06-15 23:00:18 +08:00
Thomas Joerg 37455b1f71 Revert "Reland "[X86][RFC] Enable `_Float16` type support on X86 following the psABI""
This reverts commit 6e02e27536.

This introduces a crash in the backend. Reproducer in MLIR's LLVM
dialect follows. Let me know if you have trouble reproducing this.

module {
  llvm.func @malloc(i64) -> !llvm.ptr<i8>
  llvm.func @_mlir_ciface_tf_report_error(!llvm.ptr<i8>, i32, !llvm.ptr<i8>)
  llvm.mlir.global internal constant @error_message_2208944672953921889("failed to allocate memory at loc(\22-\22:3:8)\00")
  llvm.func @_mlir_ciface_tf_alloc(!llvm.ptr<i8>, i64, i64, i32, i32, !llvm.ptr<i32>) -> !llvm.ptr<i8>
  llvm.func @Rsqrt_CPU_DT_HALF_DT_HALF(%arg0: !llvm.ptr<i8>, %arg1: i64, %arg2: !llvm.ptr<i8>) -> !llvm.struct<(i64, ptr<i8>)> attributes {llvm.emit_c_interface, tf_entry} {
    %0 = llvm.mlir.constant(8 : i32) : i32
    %1 = llvm.mlir.constant(8 : index) : i64
    %2 = llvm.mlir.constant(2 : index) : i64
    %3 = llvm.mlir.constant(dense<0.000000e+00> : vector<4xf16>) : vector<4xf16>
    %4 = llvm.mlir.constant(dense<[0, 1, 2, 3]> : vector<4xi32>) : vector<4xi32>
    %5 = llvm.mlir.constant(dense<1.000000e+00> : vector<4xf16>) : vector<4xf16>
    %6 = llvm.mlir.constant(false) : i1
    %7 = llvm.mlir.constant(1 : i32) : i32
    %8 = llvm.mlir.constant(0 : i32) : i32
    %9 = llvm.mlir.constant(4 : index) : i64
    %10 = llvm.mlir.constant(0 : index) : i64
    %11 = llvm.mlir.constant(1 : index) : i64
    %12 = llvm.mlir.constant(-1 : index) : i64
    %13 = llvm.mlir.null : !llvm.ptr<f16>
    %14 = llvm.getelementptr %13[%9] : (!llvm.ptr<f16>, i64) -> !llvm.ptr<f16>
    %15 = llvm.ptrtoint %14 : !llvm.ptr<f16> to i64
    %16 = llvm.alloca %15 x f16 {alignment = 32 : i64} : (i64) -> !llvm.ptr<f16>
    %17 = llvm.alloca %15 x f16 {alignment = 32 : i64} : (i64) -> !llvm.ptr<f16>
    %18 = llvm.mlir.null : !llvm.ptr<i64>
    %19 = llvm.getelementptr %18[%arg1] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64>
    %20 = llvm.ptrtoint %19 : !llvm.ptr<i64> to i64
    %21 = llvm.alloca %20 x i64 : (i64) -> !llvm.ptr<i64>
    llvm.br ^bb1(%10 : i64)
  ^bb1(%22: i64):  // 2 preds: ^bb0, ^bb2
    %23 = llvm.icmp "slt" %22, %arg1 : i64
    llvm.cond_br %23, ^bb2, ^bb3
  ^bb2:  // pred: ^bb1
    %24 = llvm.bitcast %arg2 : !llvm.ptr<i8> to !llvm.ptr<struct<(ptr<f16>, ptr<f16>, i64)>>
    %25 = llvm.getelementptr %24[%10, 2] : (!llvm.ptr<struct<(ptr<f16>, ptr<f16>, i64)>>, i64) -> !llvm.ptr<i64>
    %26 = llvm.add %22, %11  : i64
    %27 = llvm.getelementptr %25[%26] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64>
    %28 = llvm.load %27 : !llvm.ptr<i64>
    %29 = llvm.getelementptr %21[%22] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64>
    llvm.store %28, %29 : !llvm.ptr<i64>
    llvm.br ^bb1(%26 : i64)
  ^bb3:  // pred: ^bb1
    llvm.br ^bb4(%10, %11 : i64, i64)
  ^bb4(%30: i64, %31: i64):  // 2 preds: ^bb3, ^bb5
    %32 = llvm.icmp "slt" %30, %arg1 : i64
    llvm.cond_br %32, ^bb5, ^bb6
  ^bb5:  // pred: ^bb4
    %33 = llvm.bitcast %arg2 : !llvm.ptr<i8> to !llvm.ptr<struct<(ptr<f16>, ptr<f16>, i64)>>
    %34 = llvm.getelementptr %33[%10, 2] : (!llvm.ptr<struct<(ptr<f16>, ptr<f16>, i64)>>, i64) -> !llvm.ptr<i64>
    %35 = llvm.add %30, %11  : i64
    %36 = llvm.getelementptr %34[%35] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64>
    %37 = llvm.load %36 : !llvm.ptr<i64>
    %38 = llvm.mul %37, %31  : i64
    llvm.br ^bb4(%35, %38 : i64, i64)
  ^bb6:  // pred: ^bb4
    %39 = llvm.bitcast %arg2 : !llvm.ptr<i8> to !llvm.ptr<ptr<f16>>
    %40 = llvm.getelementptr %39[%11] : (!llvm.ptr<ptr<f16>>, i64) -> !llvm.ptr<ptr<f16>>
    %41 = llvm.load %40 : !llvm.ptr<ptr<f16>>
    %42 = llvm.getelementptr %13[%11] : (!llvm.ptr<f16>, i64) -> !llvm.ptr<f16>
    %43 = llvm.ptrtoint %42 : !llvm.ptr<f16> to i64
    %44 = llvm.alloca %7 x i32 : (i32) -> !llvm.ptr<i32>
    llvm.store %8, %44 : !llvm.ptr<i32>
    %45 = llvm.call @_mlir_ciface_tf_alloc(%arg0, %31, %43, %8, %7, %44) : (!llvm.ptr<i8>, i64, i64, i32, i32, !llvm.ptr<i32>) -> !llvm.ptr<i8>
    %46 = llvm.bitcast %45 : !llvm.ptr<i8> to !llvm.ptr<f16>
    %47 = llvm.icmp "eq" %31, %10 : i64
    %48 = llvm.or %6, %47  : i1
    %49 = llvm.mlir.null : !llvm.ptr<i8>
    %50 = llvm.icmp "ne" %45, %49 : !llvm.ptr<i8>
    %51 = llvm.or %50, %48  : i1
    llvm.cond_br %51, ^bb7, ^bb13
  ^bb7:  // pred: ^bb6
    %52 = llvm.urem %31, %9  : i64
    %53 = llvm.sub %31, %52  : i64
    llvm.br ^bb8(%10 : i64)
  ^bb8(%54: i64):  // 2 preds: ^bb7, ^bb9
    %55 = llvm.icmp "slt" %54, %53 : i64
    llvm.cond_br %55, ^bb9, ^bb10
  ^bb9:  // pred: ^bb8
    %56 = llvm.mul %54, %11  : i64
    %57 = llvm.add %56, %10  : i64
    %58 = llvm.add %57, %10  : i64
    %59 = llvm.getelementptr %41[%58] : (!llvm.ptr<f16>, i64) -> !llvm.ptr<f16>
    %60 = llvm.bitcast %59 : !llvm.ptr<f16> to !llvm.ptr<vector<4xf16>>
    %61 = llvm.load %60 {alignment = 2 : i64} : !llvm.ptr<vector<4xf16>>
    %62 = "llvm.intr.sqrt"(%61) : (vector<4xf16>) -> vector<4xf16>
    %63 = llvm.fdiv %5, %62  : vector<4xf16>
    %64 = llvm.getelementptr %46[%58] : (!llvm.ptr<f16>, i64) -> !llvm.ptr<f16>
    %65 = llvm.bitcast %64 : !llvm.ptr<f16> to !llvm.ptr<vector<4xf16>>
    llvm.store %63, %65 {alignment = 2 : i64} : !llvm.ptr<vector<4xf16>>
    %66 = llvm.add %54, %9  : i64
    llvm.br ^bb8(%66 : i64)
  ^bb10:  // pred: ^bb8
    %67 = llvm.icmp "ult" %53, %31 : i64
    llvm.cond_br %67, ^bb11, ^bb12
  ^bb11:  // pred: ^bb10
    %68 = llvm.mul %53, %12  : i64
    %69 = llvm.add %31, %68  : i64
    %70 = llvm.mul %53, %11  : i64
    %71 = llvm.add %70, %10  : i64
    %72 = llvm.trunc %69 : i64 to i32
    %73 = llvm.mlir.undef : vector<4xi32>
    %74 = llvm.insertelement %72, %73[%8 : i32] : vector<4xi32>
    %75 = llvm.shufflevector %74, %73 [0 : i32, 0 : i32, 0 : i32, 0 : i32] : vector<4xi32>, vector<4xi32>
    %76 = llvm.icmp "slt" %4, %75 : vector<4xi32>
    %77 = llvm.add %71, %10  : i64
    %78 = llvm.getelementptr %41[%77] : (!llvm.ptr<f16>, i64) -> !llvm.ptr<f16>
    %79 = llvm.bitcast %78 : !llvm.ptr<f16> to !llvm.ptr<vector<4xf16>>
    %80 = llvm.intr.masked.load %79, %76, %3 {alignment = 2 : i32} : (!llvm.ptr<vector<4xf16>>, vector<4xi1>, vector<4xf16>) -> vector<4xf16>
    %81 = llvm.bitcast %16 : !llvm.ptr<f16> to !llvm.ptr<vector<4xf16>>
    llvm.store %80, %81 : !llvm.ptr<vector<4xf16>>
    %82 = llvm.load %81 {alignment = 2 : i64} : !llvm.ptr<vector<4xf16>>
    %83 = "llvm.intr.sqrt"(%82) : (vector<4xf16>) -> vector<4xf16>
    %84 = llvm.fdiv %5, %83  : vector<4xf16>
    %85 = llvm.bitcast %17 : !llvm.ptr<f16> to !llvm.ptr<vector<4xf16>>
    llvm.store %84, %85 {alignment = 2 : i64} : !llvm.ptr<vector<4xf16>>
    %86 = llvm.load %85 : !llvm.ptr<vector<4xf16>>
    %87 = llvm.getelementptr %46[%77] : (!llvm.ptr<f16>, i64) -> !llvm.ptr<f16>
    %88 = llvm.bitcast %87 : !llvm.ptr<f16> to !llvm.ptr<vector<4xf16>>
    llvm.intr.masked.store %86, %88, %76 {alignment = 2 : i32} : vector<4xf16>, vector<4xi1> into !llvm.ptr<vector<4xf16>>
    llvm.br ^bb12
  ^bb12:  // 2 preds: ^bb10, ^bb11
    %89 = llvm.mul %2, %1  : i64
    %90 = llvm.mul %arg1, %2  : i64
    %91 = llvm.add %90, %11  : i64
    %92 = llvm.mul %91, %1  : i64
    %93 = llvm.add %89, %92  : i64
    %94 = llvm.alloca %93 x i8 : (i64) -> !llvm.ptr<i8>
    %95 = llvm.bitcast %94 : !llvm.ptr<i8> to !llvm.ptr<ptr<f16>>
    llvm.store %46, %95 : !llvm.ptr<ptr<f16>>
    %96 = llvm.getelementptr %95[%11] : (!llvm.ptr<ptr<f16>>, i64) -> !llvm.ptr<ptr<f16>>
    llvm.store %46, %96 : !llvm.ptr<ptr<f16>>
    %97 = llvm.getelementptr %95[%2] : (!llvm.ptr<ptr<f16>>, i64) -> !llvm.ptr<ptr<f16>>
    %98 = llvm.bitcast %97 : !llvm.ptr<ptr<f16>> to !llvm.ptr<i64>
    llvm.store %10, %98 : !llvm.ptr<i64>
    %99 = llvm.bitcast %94 : !llvm.ptr<i8> to !llvm.ptr<struct<(ptr<f16>, ptr<f16>, i64, i64)>>
    %100 = llvm.getelementptr %99[%10, 3] : (!llvm.ptr<struct<(ptr<f16>, ptr<f16>, i64, i64)>>, i64) -> !llvm.ptr<i64>
    %101 = llvm.getelementptr %100[%arg1] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64>
    %102 = llvm.sub %arg1, %11  : i64
    llvm.br ^bb14(%102, %11 : i64, i64)
  ^bb13:  // pred: ^bb6
    %103 = llvm.mlir.addressof @error_message_2208944672953921889 : !llvm.ptr<array<42 x i8>>
    %104 = llvm.getelementptr %103[%10, %10] : (!llvm.ptr<array<42 x i8>>, i64, i64) -> !llvm.ptr<i8>
    llvm.call @_mlir_ciface_tf_report_error(%arg0, %0, %104) : (!llvm.ptr<i8>, i32, !llvm.ptr<i8>) -> ()
    %105 = llvm.mul %2, %1  : i64
    %106 = llvm.mul %2, %10  : i64
    %107 = llvm.add %106, %11  : i64
    %108 = llvm.mul %107, %1  : i64
    %109 = llvm.add %105, %108  : i64
    %110 = llvm.alloca %109 x i8 : (i64) -> !llvm.ptr<i8>
    %111 = llvm.bitcast %110 : !llvm.ptr<i8> to !llvm.ptr<ptr<f16>>
    llvm.store %13, %111 : !llvm.ptr<ptr<f16>>
    %112 = llvm.getelementptr %111[%11] : (!llvm.ptr<ptr<f16>>, i64) -> !llvm.ptr<ptr<f16>>
    llvm.store %13, %112 : !llvm.ptr<ptr<f16>>
    %113 = llvm.getelementptr %111[%2] : (!llvm.ptr<ptr<f16>>, i64) -> !llvm.ptr<ptr<f16>>
    %114 = llvm.bitcast %113 : !llvm.ptr<ptr<f16>> to !llvm.ptr<i64>
    llvm.store %10, %114 : !llvm.ptr<i64>
    %115 = llvm.call @malloc(%109) : (i64) -> !llvm.ptr<i8>
    "llvm.intr.memcpy"(%115, %110, %109, %6) : (!llvm.ptr<i8>, !llvm.ptr<i8>, i64, i1) -> ()
    %116 = llvm.mlir.undef : !llvm.struct<(i64, ptr<i8>)>
    %117 = llvm.insertvalue %10, %116[0] : !llvm.struct<(i64, ptr<i8>)>
    %118 = llvm.insertvalue %115, %117[1] : !llvm.struct<(i64, ptr<i8>)>
    llvm.return %118 : !llvm.struct<(i64, ptr<i8>)>
  ^bb14(%119: i64, %120: i64):  // 2 preds: ^bb12, ^bb15
    %121 = llvm.icmp "sge" %119, %10 : i64
    llvm.cond_br %121, ^bb15, ^bb16
  ^bb15:  // pred: ^bb14
    %122 = llvm.getelementptr %21[%119] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64>
    %123 = llvm.load %122 : !llvm.ptr<i64>
    %124 = llvm.getelementptr %100[%119] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64>
    llvm.store %123, %124 : !llvm.ptr<i64>
    %125 = llvm.getelementptr %101[%119] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64>
    llvm.store %120, %125 : !llvm.ptr<i64>
    %126 = llvm.mul %120, %123  : i64
    %127 = llvm.sub %119, %11  : i64
    llvm.br ^bb14(%127, %126 : i64, i64)
  ^bb16:  // pred: ^bb14
    %128 = llvm.call @malloc(%93) : (i64) -> !llvm.ptr<i8>
    "llvm.intr.memcpy"(%128, %94, %93, %6) : (!llvm.ptr<i8>, !llvm.ptr<i8>, i64, i1) -> ()
    %129 = llvm.mlir.undef : !llvm.struct<(i64, ptr<i8>)>
    %130 = llvm.insertvalue %arg1, %129[0] : !llvm.struct<(i64, ptr<i8>)>
    %131 = llvm.insertvalue %128, %130[1] : !llvm.struct<(i64, ptr<i8>)>
    llvm.return %131 : !llvm.struct<(i64, ptr<i8>)>
  }
  llvm.func @_mlir_ciface_Rsqrt_CPU_DT_HALF_DT_HALF(%arg0: !llvm.ptr<struct<(i64, ptr<i8>)>>, %arg1: !llvm.ptr<i8>, %arg2: !llvm.ptr<struct<(i64, ptr<i8>)>>) attributes {llvm.emit_c_interface, tf_entry} {
    %0 = llvm.load %arg2 : !llvm.ptr<struct<(i64, ptr<i8>)>>
    %1 = llvm.extractvalue %0[0] : !llvm.struct<(i64, ptr<i8>)>
    %2 = llvm.extractvalue %0[1] : !llvm.struct<(i64, ptr<i8>)>
    %3 = llvm.call @Rsqrt_CPU_DT_HALF_DT_HALF(%arg1, %1, %2) : (!llvm.ptr<i8>, i64, !llvm.ptr<i8>) -> !llvm.struct<(i64, ptr<i8>)>
    llvm.store %3, %arg0 : !llvm.ptr<struct<(i64, ptr<i8>)>>
    llvm.return
  }
}
2022-06-15 13:24:24 +02:00
Nabeel Omer 245604a96f [X86][SLP] Basic test coverage for llvm.powi
This patch introduces basic test coverage for llvm.powi.* intrinsics.

Differential Revision: https://reviews.llvm.org/D127492
2022-06-15 11:13:54 +01:00
Phoebe Wang 6e02e27536 Reland "[X86][RFC] Enable `_Float16` type support on X86 following the psABI"
Disabled 2 mlir tests due to the runtime doesn't support `_Float16`, see
the issue here https://github.com/llvm/llvm-project/issues/55992
2022-06-15 09:15:31 +08:00
Mehdi Amini 5d8298a768 Revert "[X86][RFC] Enable `_Float16` type support on X86 following the psABI"
This reverts commit 2d2da259c8.

This breaks MLIR integration test (JIT crashing), reverting in the
meantime.
2022-06-12 15:14:37 +00:00
Phoebe Wang 2d2da259c8 [X86][RFC] Enable `_Float16` type support on X86 following the psABI
GCC and Clang/LLVM will support `_Float16` on X86 in C/C++, following
the latest X86 psABI. (https://gitlab.com/x86-psABIs)

_Float16 arithmetic will be performed using native half-precision. If
native arithmetic instructions are not available, it will be performed
at a higher precision (currently always float) and then truncated down
to _Float16 immediately after each single arithmetic operation.

Reviewed By: LuoYuanke

Differential Revision: https://reviews.llvm.org/D107082
2022-06-12 11:40:00 +08:00
Simon Pilgrim 6c80267d0f [CostModel][X86] getScalarizationOverhead - improve extraction costs for > 128-bit vectors
We were using the default getScalarizationOverhead expansion for extraction costs, which adds up all the individual element extraction costs.

This is fine for 128-bit vectors, but for 256/512-bit vectors each element extraction also has to account for extracting the upper 128-bit subvector extraction before it can handle the element. For scalarization costs we only need to extract each demanded subvector once.

Differential Revision: https://reviews.llvm.org/D125527
2022-05-24 15:18:08 +01:00
Simon Pilgrim 47be07074a [CostModel][X86] Auto generate partial interleaved load LV costs using UTC_ARGS --filter control 2022-05-12 17:46:41 +01:00
Simon Pilgrim 14e83ada16 [CostModel][X86] Auto generate masked load/store LV costs using UTC_ARGS --filter control
Also fix a sse42 -> sse4.2 typo so that we actually test costs for sse4.2
2022-05-12 17:40:40 +01:00
Simon Pilgrim a5c45c4dc1 [CostModel][X86] Auto generate gather/scatter LV costs using UTC_ARGS --filter control
Also fix a sse42 -> sse4.2 typo so that we actually test costs for sse4.2
2022-05-12 17:39:06 +01:00
Simon Pilgrim 3d107ce2b2 [CostModel][X86] Relax fcmp costs on SSE41 targets or later
Only pre-SSE41 targets double-pump the fp comparison ops
2022-05-06 13:29:40 +01:00
Simon Pilgrim cbfa857346 [CostModel][X86] Adjust 128-bit select costs to account for slow BLENDV op
Based off the script from D103695 - Jaguar, Bulldozer, Silvermont (et al) and Haswell all have slow BLENDV ops, so adjust the worse case cost values
2022-05-06 13:07:34 +01:00
Simon Pilgrim d21bf51494 [CostModel][X86] Adjust pre-SSE41 fp scalar select costs to account for vector ops
Based off the script from D103695, we now mainly use BLENDV or OR(AND,ANDN) to select scalar float/double ops
2022-05-06 11:41:55 +01:00
Simon Pilgrim f0e8c1d6d9 [CostModel][X86] Adjust 256-bit select costs to account for slow BLENDV op
Based off the script from D103695, on AVX1, Jaguar/Bulldozer both have low throughput for ymm select patterns (BLENDV + OR(AND,ANDN))), and even on AVX2 Haswell still struggles with BLENDV ops
2022-05-06 11:27:37 +01:00
Simon Pilgrim 4236a10717 [CostModel][X86] Add more complete float/double select cost test coverage
We were only testing basic vector types
2022-05-06 10:45:36 +01:00
Simon Pilgrim 86bb7df6e6 [CostModel][X86] getScalarizationOverhead - handle vXi1 extracts with MOVMSK (pre-AVX512)
We can quickly extract multiple elements of a bool vector using MOVMSK ops - since we don't know what generated the vXi1, I've been optimistic and assumed we can use PMOVMSKB to extract the maximum number of bools with a single op.

The MOVMSK pattern isn't great for extract+insert round trips as vXi1 type legalization can interfere with this a lot - so this relies on us remaining good at using getScalarizationOverhead properly (and tagging both Insert and Extract modes) for those round trip cases.

The AVX512 KMOV codegen for bool extraction is a bit of a mess so for now I've not included that - the per-element cost is a lot more accurate for current codegen.
2022-05-02 09:58:39 +01:00
Simon Pilgrim d5198cf92f [CostModel][X86] Check for 'null op' truncations
If the legalized src/dst types are the same, assume the "truncation" is free.

This fixes some edge cases such as mul lo/hi ops and bool vectors which will get legalized back to legal vector widths
2022-05-01 12:03:40 +01:00
Simon Pilgrim c2964746e3 [CostModel][X86] Reduce cost of vector selects on SSE2/AVX1 targets
Based off the script from D103695, we were exaggerating the cost of the OR(AND(X,M),AND(Y,~M)) expansion using instruction count instead of effective throughput
2022-05-01 09:32:14 +01:00
Alexey Bataev 371412e065 [COST]Fix crash for non-power-2 vector shuffle mask.
Need to normalizize the mask to avoid possible crashes during attempts
to estimate cost of the very long shuffles with non-power-2 number of
elements in masks.
2022-04-29 07:28:07 -07:00
Alexey Bataev 75e1cf4a6a [COST]Improve cost model for shuffles in SLP.
Introduced masks where they are not added and improved target dependent
cost models to avoid returning of the incorrect cost results after
adding masks.

Differential Revision: https://reviews.llvm.org/D100486
2022-04-28 10:04:41 -07:00
Alexey Bataev ac23cf738a [COST][NFC]Add a test for non-power-2 shuffles, NFC. 2022-04-28 09:08:28 -07:00
Alexey Bataev 9861ca0c23 Revert "[COST]Improve cost model for shuffles in SLP."
This reverts commit 29a470e380 to fix
a crash reported in https://reviews.llvm.org/D100486#3479989.
2022-04-28 08:11:56 -07:00
Alexey Bataev 29a470e380 [COST]Improve cost model for shuffles in SLP.
Introduced masks where they are not added and improved target dependent
cost models to avoid returning of the incorrect cost results after
adding masks.

Differential Revision: https://reviews.llvm.org/D100486
2022-04-27 10:56:26 -07:00
Vasileios Porpodas c7bb5ac5ca [NFC] Renamed /test/Analysis/CostModel/X86/splat-load.ll test and added more checks.
Renamed test/Analysis/CostModel/X86/splat-load.ll to shuffle-load.ll
to align it with AArch64's similar test.

Also added a complete list of checks for all vector combinations up to 512-bits.

Differential Revision: https://reviews.llvm.org/D124528
2022-04-27 09:47:43 -07:00
David Green 4a8c13a6f4 [CostModel] Add basic fptoi_sat costs
This adds some basic fptosi_sat and fptoui_sat target independent cost
modelling. The fptosi_sat is modelled as a fmin/fmax to saturate the
value, followed by a fp convert. The signed values then have an
additional fcmp+select for handling Nan correctly.

The AArch64/Arm costs may be more incorrect, as the instruction exist
natively. This can be fixed with target specific cost updates.

Differential Revision: https://reviews.llvm.org/D124269
2022-04-27 09:30:00 +01:00
Vasileios Porpodas fa8a9fea47 Recommit "[SLP][TTI] Refactoring of `getShuffleCost` `Args` to work like `getArithmeticInstrCost`"
This reverts commit 6a9bbd9f20.

Code review: https://reviews.llvm.org/D124202
2022-04-26 14:02:40 -07:00
Vasileios Porpodas 6a9bbd9f20 Revert "[SLP][TTI] Refactoring of `getShuffleCost` `Args` to work like `getArithmeticInstrCost`"
This reverts commit 55ce296d6f.
2022-04-26 11:25:26 -07:00
Vasileios Porpodas 55ce296d6f [SLP][TTI] Refactoring of `getShuffleCost` `Args` to work like `getArithmeticInstrCost`
Before this patch `Args` was used to pass a broadcat's arguments by SLP.
This patch changes this. `Args` is now used for passing the operands of
the shuffle.

Differential Revision: https://reviews.llvm.org/D124202
2022-04-26 11:11:29 -07:00
David Green 1159984802 [CostModel] Add fptoi_sat costmodel tests. NFC 2022-04-25 18:44:35 +01:00
Vasileios Porpodas e83ad23daf [TTI] Pre-commit cost model tests splat-loads. 2022-04-21 14:45:51 -07:00
Roman Lebedev be5c15c7ae
[NFC][Costmodel][LV][X86] Refresh one or two interleaved load/store tests 2022-04-15 17:43:18 +03:00
Simon Pilgrim d663166acb [CostModel][X86] Reduce cost of v2i64 icmp base cost on SSE2 targets
Based off the script from D103695, we were exaggerating the cost of the v2i64 comparison expansion using instruction count instead of effective throughput
2022-03-30 09:11:55 +01:00
Simon Pilgrim 5dde9c1286 [CostModel][X86] Reduce cost of extracting bool vector elements
For constant indices, these are now just a MOVMSK+TEST/BT
2022-03-18 19:02:47 +00:00
Simon Pilgrim 4455c5cdea [CostModel][X86] Update RUN -passes=* to double quotes to appease update scripts on windows 2022-03-18 11:44:18 +00:00
Roman Lebedev 2f80ea7f4f
[NFC][LV] Use different braces in debug output
The analysis passes output function name encapsulated in `'` braces,
but LV uses `"`. Harmonizing this may help in creating an update script
for the LV costmodel test checks.

Reviewed By: fhahn

Differential Revision: https://reviews.llvm.org/D121105
2022-03-07 19:32:37 +03:00
Arthur Eubanks 15ba588d6d [test] Migrate '-analyze -cost-model' to '-passes=print<cost-model>' 2022-02-09 15:42:16 -08:00
David Green b55d4c2ad8 Revert "[LV] Remove `LoopVectorizationCostModel::useEmulatedMaskMemRefHack()`"
This reverts commit 77a0da926c as we've
received multiple reports of this significantly impacting performance,
in ways that don't seem to just be target specific cost models going
wrong. I would offer some reproducers, but the test changes here seem to
be full of them!

Reverting for now and hopefully we can remove the "hack" more carefully
as we go.
2022-02-09 20:02:54 +00:00