llvm-project

Commit Graph

Author	SHA1	Message	Date
Simon Pilgrim	97868fb972	[X86] isTargetShuffleEquivalent - attempt to match SM_SentinelZero shuffle mask elements using known bits If the combined shuffle mask requires zero elements, we don't currently have much chance of matching them against the expected source vector. This patch uses the SelectionDAG::MaskedVectorIsZero wrapper to attempt to determine if the expected lement we want to use is already known to be zero. I've also tightened up the ExpectedMask assertion to always be in range - we're never giving it a target shuffle mask that has sentinels at all - allowing to remove some of the confusing bounds checks. This attempts to address some of the regressions uncovered by D129150 where we more aggressively fold shuffles as AND / 'clear' masks which results in more combined shuffles using SM_SentinelZero. Differential Revision: https://reviews.llvm.org/D129207	2022-07-11 15:29:44 +01:00
Nicolai Hähnle	ede600377c	ManagedStatic: remove many straightforward uses in llvm (Reapply after revert in `e9ce1a5880` due to Fuchsia test failures. Removed changes in lib/ExecutionEngine/ other than error categories, to be checked in more detail and reapplied separately.) Bulk remove many of the more trivial uses of ManagedStatic in the llvm directory, either by defining a new getter function or, in many cases, moving the static variable directly into the only function that uses it. Differential Revision: https://reviews.llvm.org/D129120	2022-07-10 10:29:15 +02:00
Nicolai Hähnle	e9ce1a5880	Revert "ManagedStatic: remove many straightforward uses in llvm" This reverts commit `e6f1f06245`. Reverting due to a failure on the fuchsia-x86_64-linux buildbot.	2022-07-10 09:54:30 +02:00
Nicolai Hähnle	e6f1f06245	ManagedStatic: remove many straightforward uses in llvm Bulk remove many of the more trivial uses of ManagedStatic in the llvm directory, either by defining a new getter function or, in many cases, moving the static variable directly into the only function that uses it. Differential Revision: https://reviews.llvm.org/D129120	2022-07-10 09:15:08 +02:00
Phoebe Wang	8fb083d33e	[X86][FP16] Add constrained FP support for scalar emulation This is a follow up patch to support constrained FP in FP16 emulation. Reviewed By: skan Differential Revision: https://reviews.llvm.org/D128114	2022-07-08 20:33:42 +08:00
Sanjay Patel	8b75671314	[SDAG] try to replace subtract-from-constant with xor This is almost the same as the abandoned D48529, but it allows splat vector constants too. This replaces the x86-specific code that was added with the alternate patch D48557 with the original generic combine. This transform is a less restricted form of an existing InstCombine and the proposed SDAG equivalent for that in D128080: https://alive2.llvm.org/ce/z/OUm6N_ Differential Revision: https://reviews.llvm.org/D128123	2022-07-08 08:14:24 -04:00
Haohai Wen	18a1085e02	[X86] Fix collectLeaves for adds used by phi that forms loop When add has additional users, we should indentify whether add's user is phi that forms loop rather than root's. Reviewed By: LuoYuanke Differential Revision: https://reviews.llvm.org/D129169	2022-07-08 10:39:02 +08:00
Phoebe Wang	6c535f9f1b	[X86][FP16] Fix crash when lowering copysign for f16 This is to address the assertion fail reported in https://reviews.llvm.org/D107082#3635612 Not sure if it is a problem of promoting FCOPYSIGN + libcall FP_ROUND. The promoting will set the rounding mode to 1 `a442c62888/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp (L4810-L4814)` While libcall cannot handle the rounding mode equals to 1 `a442c62888/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp (L4324-L4328)` So changing the action to Expand to workaround the problem. Reviewed By: clementval, MaskRay Differential Revision: https://reviews.llvm.org/D129294	2022-07-07 19:17:26 -07:00
Nicolai Hähnle	64a78c8501	Remove unnecessary includes of ManagedStatic.h Differential Revision: https://reviews.llvm.org/D129115	2022-07-07 14:29:20 +02:00
Tim Northover	8d9dc83f35	X86: add newline to end of FMA instruction comments. The newline is used by Disassembler.cpp (`emitComments`) to work out how to format them properly, and if there's no newline it goes into an infinite loop. Unfortunately I couldn't get llvm-objdump to be affected, only the MacOS otool utility which dlopens libLTO.	2022-07-07 12:35:28 +01:00
Simon Pilgrim	fbb51ac0ba	[X86] LowerShift - lower some shuffles directly to X86ISD::PSHUFLW nodes. These are expected to lower to X86ISD::PSHUFLW but we were seeing some regressions in D129150 because it'd managed to exploit the masking of the shift amounts to create unintended clear masks instead.	2022-07-06 18:01:03 +01:00
Shilei Tian	1023ddaf77	[LLVM] Add the support for fmax and fmin in atomicrmw instruction This patch adds the support for `fmax` and `fmin` operations in `atomicrmw` instruction. For now (at least in this patch), the instruction will be expanded to CAS loop. There are already a couple of targets supporting the feature. I'll create another patch(es) to enable them accordingly. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D127041	2022-07-06 10:57:53 -04:00
Paul Robinson	08e4fe6c61	[X86] Add RDPRU instruction Add support for the RDPRU instruction on Zen2 processors. User-facing features: - Clang option -m[no-]rdpru to enable/disable the feature - Support is implicit for znver2/znver3 processors - Preprocessor symbol __RDPRU__ to indicate support - Header rdpruintrin.h to define intrinsics - "rdpru" mnemonic supported for assembler code Internal features: - Clang builtin __builtin_ia32_rdpru - IR intrinsic @llvm.x86.rdpru Differential Revision: https://reviews.llvm.org/D128934	2022-07-06 07:17:47 -07:00
Craig Topper	2bfca35614	[X86] Disable combineVectorSizedSetCCEquality for soft float. The vector types aren't legal with soft float. Also disable under NoImplicitFloat for good measure. Fixes PR56351. Differential Revision: https://reviews.llvm.org/D129060	2022-07-04 08:33:30 -07:00
Simon Pilgrim	26708fa166	Revert rG057db2002bb3: [X86] combineAndnp - constant fold ANDNP(C,X) -> AND(~C,X) If the LHS op has a single use then using the more general AND op is likely to allow commutation, load folding, generic folds etc. Reverted due to reports from @alexfh about it causing an infinite loop (repro still pending).	2022-07-01 10:36:09 +01:00
Simon Pilgrim	e961e05d59	[SLP][X86] Add 32-bit vector stores to help vectorization opportunities Building on the work on D124284, this patch tags v4i8 and v2i16 vector loads as custom, enabling SLP to try to vectorize these types ending in a partial store (using the SSE MOVD instruction) - we already do something similar for 64-bit vector types. Differential Revision: https://reviews.llvm.org/D127604	2022-06-30 20:25:50 +01:00
Amir Ayupov	cb75faf40c	[X86][BOLT] Use getOperandType to determine memory access size Generate INSTRINFO_OPERAND_TYPE table in X86GenInstrInfo.inc. This diff adds support for instructions that were previously reported as having memory access size 0. It replaces the heuristic of looking at instruction register width to determine memory access width by instead checking the memory operand type using tablegen-provided tables. Reviewed By: skan Differential Revision: https://reviews.llvm.org/D126116	2022-06-30 00:25:32 -07:00
Luo, Yuanke	5cb0979870	[X86][AMX] Split greedy RA for tile register When we fill the shape to tile configure memory, the shape is gotten from AMX pseudo instruction. However the register for the shape may be split or spilled by greedy RA. That cause we fill the shape to config memory after ldtilecfg is executed, so that the shape configuration would be wrong. This patch is to split the tile register allocation from greedy register allocation, so that after tile registers are allocated the shape registers are still virtual register. The shape register only may be redefined or multi-defined by phi elimination pass, two address pass. That doesn't affect tile register configuration. Differential Revision: https://reviews.llvm.org/D128584	2022-06-29 10:35:43 +08:00
Craig Topper	3706bdad4a	[X86] Remove unnecessary COPY from EmitLoweredCascadedSelect. I believe we already checked that the destination of the first CMOV is only used by the second CMOV so I don't think there is any reason we need the PHI to write the register that was used by the first CMOV. We can directly use the second CMOV destination and avoid the copy. This may be a left over from when the cascaded select handling was part of the main algorithm before it was refactored in D35685. Reviewed By: pengfei Differential Revision: https://reviews.llvm.org/D128124	2022-06-28 09:33:33 -07:00
Simon Pilgrim	0b998053db	[X86] combineConcatVectorOps - IsConcatFree must check extraction index Identified in the regression reported by @alexfh on rGb5d7beeb9792 - IsConcatFree wasn't ensuring the subvector extraction index matched the position it would be concatenated back into.	2022-06-27 11:46:49 +01:00
Nikita Popov	1061511008	[X86PreAMXConfig] Use IRBuilder to insert instructions (NFC) Use an IRBuilder to insert instructions in preWriteTileCfg(). While here, also remove some unnecessary bool return values. There are some test changes because the IRBuilder folds "trunc i16 8 to i8" to "i8 8", and that has knock-on effects on instruction naming. I ran into this when converting tests to opaque pointers and noticed that this pass introduces unnecessary "bitcast ptr to ptr" instructions.	2022-06-22 17:28:48 +02:00
Nikita Popov	fbb72530fe	[X86PreAMXConfig] Use MapVector to fix non-determinism We generate code by iterating over this map, so make sure that the order is deterministic.	2022-06-22 16:57:33 +02:00
Vasileios Porpodas	7a9ad25769	Recommit "[SLP][X86] Improve reordering to consider alternate instruction bundles" This reverts commit `6d6268dcbf`. Review: https://reviews.llvm.org/D125712	2022-06-21 18:35:29 -07:00
Vasileios Porpodas	6d6268dcbf	Revert "[SLP][X86] Improve reordering to consider alternate instruction bundles" This reverts commit `6f88acf410`.	2022-06-21 17:07:21 -07:00
Vasileios Porpodas	6f88acf410	[SLP][X86] Improve reordering to consider alternate instruction bundles During the reordering transformation we should try to avoid reordering bundles like fadd,fsub because this may block them being matched into a single vector instruction in x86. We do this by checking if a TreeEntry is such a pattern and adding it to the list of TreeEntries with orders that need to be considered. Differential Revision: https://reviews.llvm.org/D125712	2022-06-21 16:44:48 -07:00
Simon Pilgrim	ac4cb1775b	[X86] fold (and (mul x, c1), c2) -> (mul x, (and c1, c2)) iff c2 is all/no bits mask Noticed on D128216 - if we're zeroing out vector elements of a mul/mulh result then see if we can merge the and-mask into the mul by just multiplying by zero. Ideally we'd make this generic (similar to the existing foldSelectWithIdentityConstant?), but these cases are appearing very late, after the constants have been lowered to constant-pool loads.	2022-06-21 15:10:43 +01:00
Simon Pilgrim	057db2002b	[X86] combineAndnp - constant fold ANDNP(C,X) -> AND(~C,X) If the LHS op has a single use then using the more general AND op is likely to allow commutation, load folding, generic folds etc.	2022-06-21 12:31:01 +01:00
Simon Pilgrim	843d43e62a	[X86] computeKnownBitsForTargetNode - add X86ISD::VBROADCAST_LOAD handling This requires us to override the isTargetCanonicalConstantNode callback introduced in D128144, so we can recognise the various cases where a VBROADCAST_LOAD constant is being reused at different vector widths to prevent infinite loops.	2022-06-21 11:48:01 +01:00
Kazu Hirata	7a47ee51a1	[llvm] Don't use Optional::getValue (NFC)	2022-06-20 22:45:45 -07:00
Phoebe Wang	edcc68e86f	[X86] Make sure SF is updated when optimizing for `jg/jge/jl/jle` This fixes issue #56103. Reviewed By: mingmingl Differential Revision: https://reviews.llvm.org/D128122	2022-06-21 09:09:27 +08:00
Simon Pilgrim	8254966062	[X86] LowerINSERT_VECTOR_ELT - always lower v32i8/v16i16 allones insertions on AVX1 as OR ops v32i8/v16i16 blend shuffles on AVX1 will expand to OR(AND,ANDN) patterns which can be easily broken by other combines	2022-06-20 18:43:03 +01:00
Kazu Hirata	e0e687a615	[llvm] Don't use Optional::hasValue (NFC)	2022-06-20 10:38:12 -07:00
Simon Pilgrim	e4a124dda5	[DAG] Fold (srl (shl x, c1), c2) -> and(shl/srl(x, c3), m) Similar to the existing (shl (srl x, c1), c2) fold Part of the work to fix the regressions in D77804 Differential Revision: https://reviews.llvm.org/D125836	2022-06-20 08:37:38 +01:00
Amir Ayupov	c0128549b0	[TableGen][X86] Add Size field to X86MemOperand class Set Size appropriately in operand definitions and query it for dumping memory operand size table `getMemOperandSize` (follow-up use D126116) and `X86Disassembler::getMemOperandSize`. Excerpt from a produced `getMemOperandSize` table for X86: ``` static int getMemOperandSize(int OpType) { switch (OpType) { default: return 0; case OpTypes::i8mem: case OpTypes::i8mem_NOREX: return 8; case OpTypes::f16mem: case OpTypes::i16mem: return 16; case OpTypes::f32mem: case OpTypes::i32mem: return 32; ... ``` Reviewed By: skan, pengfei Differential Revision: https://reviews.llvm.org/D127787	2022-06-19 11:46:56 -07:00
Simon Pilgrim	ba3f2667b6	[DAG] Add MaskedVectorIsZero helper Equivalent to MaskedValueIsZero, except its checking if all of the demanded vectors elements are known to be zero	2022-06-19 17:56:30 +01:00
Simon Pilgrim	41455dd1dc	[X86] Remove isTargetShuffleSplat and just use SelectionDAG::isSplatValue shuffle(splat(x)) -> splat(x), it doesn't have to be a target specific broadcast	2022-06-19 11:22:57 +01:00
Kazu Hirata	129b531c9c	[llvm] Use value_or instead of getValueOr (NFC)	2022-06-18 23:07:11 -07:00
Kazu Hirata	47b39c5157	[X86] Use default member initialization (NFC) Identified with modernize-use-default-member-init.	2022-06-18 12:11:58 -07:00
Kazu Hirata	1590d39f2e	[X86] Use default member initialization (NFC) Identified with modernize-use-default-member-init.	2022-06-18 12:08:07 -07:00
Kazu Hirata	7c987bb4d9	[X86] Use default member initialization (NFC) Identified with modernize-use-default-member-init.	2022-06-18 12:05:34 -07:00
Simon Pilgrim	ac3f967382	[X86] canonicalizeShuffleWithBinOps - merge shuffles across binops if either source op is a known splat The shuffle of a splat (with no undefs) should always be removed	2022-06-18 17:14:00 +01:00
Simon Pilgrim	f42f2b7005	[X86] canonicalizeShuffleWithBinOps - merge unary shuffles across binops if either source op is a foldable load This mostly handles folding of constants that have already become loads, but we expose some generic load cases as well. This also exposes the chance to merge unary shuffles across X86ISD::ANDNP nodes with different scalar widths	2022-06-18 15:58:54 +01:00
Kazu Hirata	621f58e716	[Target, CodeGen] Use isImm(), isReg(), etc (NFC)	2022-06-18 07:41:04 -07:00
Simon Pilgrim	3c9123af9f	[X86] isShuffleFoldableLoad - ensure the load has one use. We'll only fold the load if has one use. Makes no difference to existing tests but will be necessary for an upcoming patch to improve load folding as part of canonicalizeShuffleWithBinOps.	2022-06-18 14:51:55 +01:00
Phoebe Wang	655ba9c8a1	Reland "Reland "Reland "Reland "[X86][RFC] Enable `_Float16` type support on X86 following the psABI"""" This resolves problems reported in commit `1a20252978`. 1. Promote to float lowering for nodes XINT_TO_FP 2. Bail out f16 from shuffle combine due to vector type is not legal in the version	2022-06-17 21:34:05 +08:00
Benjamin Kramer	1a20252978	Revert "Reland "Reland "Reland "[X86][RFC] Enable `_Float16` type support on X86 following the psABI"""" This reverts commit `04a3d5f3a1`. I see two more issues: - uitofp/sitofp from i32/i64 to half now generates __floatsihf/__floatdihf, which exists in neither compiler-rt nor libgcc - This crashes when legalizing the bitcast: ``` ; RUN: llc < %s -mcpu=skx define void @main.45(ptr nocapture readnone %retval, ptr noalias nocapture readnone %run_options, ptr noalias nocapture readnone %params, ptr noalias nocapture readonly %buffer_table, ptr noalias nocapture readnone %status, ptr noalias nocapture readnone %prof_counters) local_unnamed_addr { entry: %fusion = load ptr, ptr %buffer_table, align 8 %0 = getelementptr inbounds ptr, ptr %buffer_table, i64 1 %Arg_1.2 = load ptr, ptr %0, align 8 %1 = getelementptr inbounds ptr, ptr %buffer_table, i64 2 %Arg_0.1 = load ptr, ptr %1, align 8 %2 = load half, ptr %Arg_0.1, align 8 %3 = bitcast half %2 to i16 %4 = and i16 %3, 32767 %5 = icmp eq i16 %4, 0 %6 = and i16 %3, -32768 %broadcast.splatinsert = insertelement <4 x half> poison, half %2, i64 0 %broadcast.splat = shufflevector <4 x half> %broadcast.splatinsert, <4 x half> poison, <4 x i32> zeroinitializer %broadcast.splatinsert9 = insertelement <4 x i16> poison, i16 %4, i64 0 %broadcast.splat10 = shufflevector <4 x i16> %broadcast.splatinsert9, <4 x i16> poison, <4 x i32> zeroinitializer %broadcast.splatinsert11 = insertelement <4 x i16> poison, i16 %6, i64 0 %broadcast.splat12 = shufflevector <4 x i16> %broadcast.splatinsert11, <4 x i16> poison, <4 x i32> zeroinitializer %broadcast.splatinsert13 = insertelement <4 x i16> poison, i16 %3, i64 0 %broadcast.splat14 = shufflevector <4 x i16> %broadcast.splatinsert13, <4 x i16> poison, <4 x i32> zeroinitializer %wide.load = load <4 x half>, ptr %Arg_1.2, align 8 %7 = fcmp uno <4 x half> %broadcast.splat, %wide.load %8 = fcmp oeq <4 x half> %broadcast.splat, %wide.load %9 = bitcast <4 x half> %wide.load to <4 x i16> %10 = and <4 x i16> %9, <i16 32767, i16 32767, i16 32767, i16 32767> %11 = icmp eq <4 x i16> %10, zeroinitializer %12 = and <4 x i16> %9, <i16 -32768, i16 -32768, i16 -32768, i16 -32768> %13 = or <4 x i16> %12, <i16 1, i16 1, i16 1, i16 1> %14 = select <4 x i1> %11, <4 x i16> %9, <4 x i16> %13 %15 = icmp ugt <4 x i16> %broadcast.splat10, %10 %16 = icmp ne <4 x i16> %broadcast.splat12, %12 %17 = or <4 x i1> %15, %16 %18 = select <4 x i1> %17, <4 x i16> <i16 -1, i16 -1, i16 -1, i16 -1>, <4 x i16> <i16 1, i16 1, i16 1, i16 1> %19 = add <4 x i16> %18, %broadcast.splat14 %20 = select i1 %5, <4 x i16> %14, <4 x i16> %19 %21 = select <4 x i1> %8, <4 x i16> %9, <4 x i16> %20 %22 = bitcast <4 x i16> %21 to <4 x half> %23 = select <4 x i1> %7, <4 x half> <half 0xH7E00, half 0xH7E00, half 0xH7E00, half 0xH7E00>, <4 x half> %22 store <4 x half> %23, ptr %fusion, align 16 ret void } ``` llc: llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp:977: void (anonymous namespace)::SelectionDAGLegalize::LegalizeOp(llvm::SDNode ): Assertion `(TLI.getTypeAction(DAG.getContext(), Op.getValueType()) == TargetLowering::TypeLegal \|\| Op.getOpcode() == ISD::TargetConstant \|\| Op.getOpcode() == ISD::Register) && "Unexpected illegal type!"' failed.	2022-06-17 09:43:07 +02:00
Phoebe Wang	04a3d5f3a1	Reland "Reland "Reland "[X86][RFC] Enable `_Float16` type support on X86 following the psABI""" Fix the crash on lowering X86ISD::FCMP.	2022-06-17 12:12:17 +08:00
Paul Robinson	ff0122dcce	[PS5] Emit ud2 for ubsan trap	2022-06-16 11:20:10 -07:00
Paul Robinson	77b00098f2	[PS5] Use same debug trap instruction as PS4	2022-06-16 11:03:03 -07:00
Frederik Gossen	3cd5696a33	Revert "Reland "Reland "[X86][RFC] Enable `_Float16` type support on X86 following the psABI""" This reverts commit `e1c5afa47d`. This introduces crashes in the JAX backend on CPU. A reproducer in LLVM is below. Let me know if you have trouble reproducing this. ; ModuleID = '__compute_module' source_filename = "__compute_module" target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128" target triple = "x86_64-grtev4-linux-gnu" @0 = private unnamed_addr constant [4 x i8] c"\00\00\00?" @1 = private unnamed_addr constant [4 x i8] c"\1C}\908" @2 = private unnamed_addr constant [4 x i8] c"?\00\\4" @3 = private unnamed_addr constant [4 x i8] c"%ci1" @4 = private unnamed_addr constant [4 x i8] zeroinitializer @5 = private unnamed_addr constant [4 x i8] c"\00\00\00\C0" @6 = private unnamed_addr constant [4 x i8] c"\00\00\00B" @7 = private unnamed_addr constant [4 x i8] c"\94\B4\C22" @8 = private unnamed_addr constant [4 x i8] c"^\09B6" @9 = private unnamed_addr constant [4 x i8] c"\15\F3M?" @10 = private unnamed_addr constant [4 x i8] c"e\CC\\;" @11 = private unnamed_addr constant [4 x i8] c"d\BD/>" @12 = private unnamed_addr constant [4 x i8] c"V\F4I=" @13 = private unnamed_addr constant [4 x i8] c"\10\CB,<" @14 = private unnamed_addr constant [4 x i8] c"\AC\E3\D6:" @15 = private unnamed_addr constant [4 x i8] c"\DC\A8E9" @16 = private unnamed_addr constant [4 x i8] c"\C6\FA\897" @17 = private unnamed_addr constant [4 x i8] c"%\F9\955" @18 = private unnamed_addr constant [4 x i8] c"\B5\DB\813" @19 = private unnamed_addr constant [4 x i8] c"\B4W_\B2" @20 = private unnamed_addr constant [4 x i8] c"\1Cc\8F\B4" @21 = private unnamed_addr constant [4 x i8] c"~3\94\B6" @22 = private unnamed_addr constant [4 x i8] c"3Yq\B8" @23 = private unnamed_addr constant [4 x i8] c"\E9\17\17\BA" @24 = private unnamed_addr constant [4 x i8] c"\F1\B2\8D\BB" @25 = private unnamed_addr constant [4 x i8] c"\F8t\C2\BC" @26 = private unnamed_addr constant [4 x i8] c"\82[\C2\BD" @27 = private unnamed_addr constant [4 x i8] c"uB-?" @28 = private unnamed_addr constant [4 x i8] c"^\FF\9B\BE" @29 = private unnamed_addr constant [4 x i8] c"\00\00\00A" ; Function Attrs: uwtable define void @main.158(ptr %retval, ptr noalias %run_options, ptr noalias %params, ptr noalias %buffer_table, ptr noalias %status, ptr noalias %prof_counters) #0 { entry: %fusion.invar_address.dim.1 = alloca i64, align 8 %fusion.invar_address.dim.0 = alloca i64, align 8 %0 = getelementptr inbounds ptr, ptr %buffer_table, i64 1 %Arg_0.1 = load ptr, ptr %0, align 8, !invariant.load !0, !dereferenceable !1, !align !2 %1 = getelementptr inbounds ptr, ptr %buffer_table, i64 0 %fusion = load ptr, ptr %1, align 8, !invariant.load !0, !dereferenceable !1, !align !2 store i64 0, ptr %fusion.invar_address.dim.0, align 8 br label %fusion.loop_header.dim.0 return: ; preds = %fusion.loop_exit.dim.0 ret void fusion.loop_header.dim.0: ; preds = %fusion.loop_exit.dim.1, %entry %fusion.indvar.dim.0 = load i64, ptr %fusion.invar_address.dim.0, align 8 %2 = icmp uge i64 %fusion.indvar.dim.0, 3 br i1 %2, label %fusion.loop_exit.dim.0, label %fusion.loop_body.dim.0 fusion.loop_body.dim.0: ; preds = %fusion.loop_header.dim.0 store i64 0, ptr %fusion.invar_address.dim.1, align 8 br label %fusion.loop_header.dim.1 fusion.loop_header.dim.1: ; preds = %fusion.loop_body.dim.1, %fusion.loop_body.dim.0 %fusion.indvar.dim.1 = load i64, ptr %fusion.invar_address.dim.1, align 8 %3 = icmp uge i64 %fusion.indvar.dim.1, 1 br i1 %3, label %fusion.loop_exit.dim.1, label %fusion.loop_body.dim.1 fusion.loop_body.dim.1: ; preds = %fusion.loop_header.dim.1 %4 = getelementptr inbounds [3 x [1 x half]], ptr %Arg_0.1, i64 0, i64 %fusion.indvar.dim.0, i64 0 %5 = load half, ptr %4, align 2, !invariant.load !0, !noalias !3 %6 = fpext half %5 to float %7 = call float @llvm.fabs.f32(float %6) %constant.121 = load float, ptr @29, align 4 %compare.2 = fcmp ole float %7, %constant.121 %8 = zext i1 %compare.2 to i8 %constant.120 = load float, ptr @0, align 4 %multiply.95 = fmul float %7, %constant.120 %constant.119 = load float, ptr @5, align 4 %add.82 = fadd float %multiply.95, %constant.119 %constant.118 = load float, ptr @4, align 4 %multiply.94 = fmul float %add.82, %constant.118 %constant.117 = load float, ptr @19, align 4 %add.81 = fadd float %multiply.94, %constant.117 %multiply.92 = fmul float %add.82, %add.81 %constant.116 = load float, ptr @18, align 4 %add.79 = fadd float %multiply.92, %constant.116 %multiply.91 = fmul float %add.82, %add.79 %subtract.87 = fsub float %multiply.91, %add.81 %constant.115 = load float, ptr @20, align 4 %add.78 = fadd float %subtract.87, %constant.115 %multiply.89 = fmul float %add.82, %add.78 %subtract.86 = fsub float %multiply.89, %add.79 %constant.114 = load float, ptr @17, align 4 %add.76 = fadd float %subtract.86, %constant.114 %multiply.88 = fmul float %add.82, %add.76 %subtract.84 = fsub float %multiply.88, %add.78 %constant.113 = load float, ptr @21, align 4 %add.75 = fadd float %subtract.84, %constant.113 %multiply.86 = fmul float %add.82, %add.75 %subtract.83 = fsub float %multiply.86, %add.76 %constant.112 = load float, ptr @16, align 4 %add.73 = fadd float %subtract.83, %constant.112 %multiply.85 = fmul float %add.82, %add.73 %subtract.81 = fsub float %multiply.85, %add.75 %constant.111 = load float, ptr @22, align 4 %add.72 = fadd float %subtract.81, %constant.111 %multiply.83 = fmul float %add.82, %add.72 %subtract.80 = fsub float %multiply.83, %add.73 %constant.110 = load float, ptr @15, align 4 %add.70 = fadd float %subtract.80, %constant.110 %multiply.82 = fmul float %add.82, %add.70 %subtract.78 = fsub float %multiply.82, %add.72 %constant.109 = load float, ptr @23, align 4 %add.69 = fadd float %subtract.78, %constant.109 %multiply.80 = fmul float %add.82, %add.69 %subtract.77 = fsub float %multiply.80, %add.70 %constant.108 = load float, ptr @14, align 4 %add.68 = fadd float %subtract.77, %constant.108 %multiply.79 = fmul float %add.82, %add.68 %subtract.75 = fsub float %multiply.79, %add.69 %constant.107 = load float, ptr @24, align 4 %add.67 = fadd float %subtract.75, %constant.107 %multiply.77 = fmul float %add.82, %add.67 %subtract.74 = fsub float %multiply.77, %add.68 %constant.106 = load float, ptr @13, align 4 %add.66 = fadd float %subtract.74, %constant.106 %multiply.76 = fmul float %add.82, %add.66 %subtract.72 = fsub float %multiply.76, %add.67 %constant.105 = load float, ptr @25, align 4 %add.65 = fadd float %subtract.72, %constant.105 %multiply.74 = fmul float %add.82, %add.65 %subtract.71 = fsub float %multiply.74, %add.66 %constant.104 = load float, ptr @12, align 4 %add.64 = fadd float %subtract.71, %constant.104 %multiply.73 = fmul float %add.82, %add.64 %subtract.69 = fsub float %multiply.73, %add.65 %constant.103 = load float, ptr @26, align 4 %add.63 = fadd float %subtract.69, %constant.103 %multiply.71 = fmul float %add.82, %add.63 %subtract.67 = fsub float %multiply.71, %add.64 %constant.102 = load float, ptr @11, align 4 %add.62 = fadd float %subtract.67, %constant.102 %multiply.70 = fmul float %add.82, %add.62 %subtract.66 = fsub float %multiply.70, %add.63 %constant.101 = load float, ptr @28, align 4 %add.61 = fadd float %subtract.66, %constant.101 %multiply.68 = fmul float %add.82, %add.61 %subtract.65 = fsub float %multiply.68, %add.62 %constant.100 = load float, ptr @27, align 4 %add.60 = fadd float %subtract.65, %constant.100 %subtract.64 = fsub float %add.60, %add.62 %multiply.66 = fmul float %subtract.64, %constant.120 %constant.99 = load float, ptr @6, align 4 %divide.4 = fdiv float %constant.99, %7 %add.59 = fadd float %divide.4, %constant.119 %multiply.65 = fmul float %add.59, %constant.118 %constant.98 = load float, ptr @3, align 4 %add.58 = fadd float %multiply.65, %constant.98 %multiply.64 = fmul float %add.59, %add.58 %constant.97 = load float, ptr @7, align 4 %add.57 = fadd float %multiply.64, %constant.97 %multiply.63 = fmul float %add.59, %add.57 %subtract.63 = fsub float %multiply.63, %add.58 %constant.96 = load float, ptr @2, align 4 %add.56 = fadd float %subtract.63, %constant.96 %multiply.62 = fmul float %add.59, %add.56 %subtract.62 = fsub float %multiply.62, %add.57 %constant.95 = load float, ptr @8, align 4 %add.55 = fadd float %subtract.62, %constant.95 %multiply.61 = fmul float %add.59, %add.55 %subtract.61 = fsub float %multiply.61, %add.56 %constant.94 = load float, ptr @1, align 4 %add.54 = fadd float %subtract.61, %constant.94 %multiply.60 = fmul float %add.59, %add.54 %subtract.60 = fsub float %multiply.60, %add.55 %constant.93 = load float, ptr @10, align 4 %add.53 = fadd float %subtract.60, %constant.93 %multiply.59 = fmul float %add.59, %add.53 %subtract.59 = fsub float %multiply.59, %add.54 %constant.92 = load float, ptr @9, align 4 %add.52 = fadd float %subtract.59, %constant.92 %subtract.58 = fsub float %add.52, %add.54 %multiply.58 = fmul float %subtract.58, %constant.120 %9 = call float @llvm.sqrt.f32(float %7) %10 = fdiv float 1.000000e+00, %9 %multiply.57 = fmul float %multiply.58, %10 %11 = trunc i8 %8 to i1 %12 = select i1 %11, float %multiply.66, float %multiply.57 %13 = fptrunc float %12 to half %14 = getelementptr inbounds [3 x [1 x half]], ptr %fusion, i64 0, i64 %fusion.indvar.dim.0, i64 0 store half %13, ptr %14, align 2, !alias.scope !3 %invar.inc1 = add nuw nsw i64 %fusion.indvar.dim.1, 1 store i64 %invar.inc1, ptr %fusion.invar_address.dim.1, align 8 br label %fusion.loop_header.dim.1 fusion.loop_exit.dim.1: ; preds = %fusion.loop_header.dim.1 %invar.inc = add nuw nsw i64 %fusion.indvar.dim.0, 1 store i64 %invar.inc, ptr %fusion.invar_address.dim.0, align 8 br label %fusion.loop_header.dim.0 fusion.loop_exit.dim.0: ; preds = %fusion.loop_header.dim.0 br label %return } ; Function Attrs: nocallback nofree nosync nounwind readnone speculatable willreturn declare float @llvm.fabs.f32(float %0) #1 ; Function Attrs: nocallback nofree nosync nounwind readnone speculatable willreturn declare float @llvm.sqrt.f32(float %0) #1 attributes #0 = { uwtable "denormal-fp-math"="preserve-sign" "no-frame-pointer-elim"="false" } attributes #1 = { nocallback nofree nosync nounwind readnone speculatable willreturn } !0 = !{} !1 = !{i64 6} !2 = !{i64 8} !3 = !{!4} !4 = !{!"buffer: {index:0, offset:0, size:6}", !5} !5 = !{!"XLA global AA domain"}	2022-06-15 18:04:42 -04:00
Simon Pilgrim	4204361fed	[X86] X86InstrInfo.cpp - fix signed/unsigned promotion warnings in addImm calls addImm takes a int64_t arg but we were using uint64_t types	2022-06-15 18:21:43 +01:00
Paul Robinson	654a835c3f	[PS5] Trap after noreturn calls, with special case for stack-check-fail	2022-06-15 09:02:17 -07:00
Phoebe Wang	e1c5afa47d	Reland "Reland "[X86][RFC] Enable `_Float16` type support on X86 following the psABI"" Fixed the missing SQRT promotion. Adding several missing operations too.	2022-06-15 23:00:18 +08:00
Thomas Joerg	37455b1f71	Revert "Reland "[X86][RFC] Enable `_Float16` type support on X86 following the psABI"" This reverts commit `6e02e27536`. This introduces a crash in the backend. Reproducer in MLIR's LLVM dialect follows. Let me know if you have trouble reproducing this. module { llvm.func @malloc(i64) -> !llvm.ptr<i8> llvm.func @_mlir_ciface_tf_report_error(!llvm.ptr<i8>, i32, !llvm.ptr<i8>) llvm.mlir.global internal constant @error_message_2208944672953921889("failed to allocate memory at loc(\22-\22:3:8)\00") llvm.func @_mlir_ciface_tf_alloc(!llvm.ptr<i8>, i64, i64, i32, i32, !llvm.ptr<i32>) -> !llvm.ptr<i8> llvm.func @Rsqrt_CPU_DT_HALF_DT_HALF(%arg0: !llvm.ptr<i8>, %arg1: i64, %arg2: !llvm.ptr<i8>) -> !llvm.struct<(i64, ptr<i8>)> attributes {llvm.emit_c_interface, tf_entry} { %0 = llvm.mlir.constant(8 : i32) : i32 %1 = llvm.mlir.constant(8 : index) : i64 %2 = llvm.mlir.constant(2 : index) : i64 %3 = llvm.mlir.constant(dense<0.000000e+00> : vector<4xf16>) : vector<4xf16> %4 = llvm.mlir.constant(dense<[0, 1, 2, 3]> : vector<4xi32>) : vector<4xi32> %5 = llvm.mlir.constant(dense<1.000000e+00> : vector<4xf16>) : vector<4xf16> %6 = llvm.mlir.constant(false) : i1 %7 = llvm.mlir.constant(1 : i32) : i32 %8 = llvm.mlir.constant(0 : i32) : i32 %9 = llvm.mlir.constant(4 : index) : i64 %10 = llvm.mlir.constant(0 : index) : i64 %11 = llvm.mlir.constant(1 : index) : i64 %12 = llvm.mlir.constant(-1 : index) : i64 %13 = llvm.mlir.null : !llvm.ptr<f16> %14 = llvm.getelementptr %13[%9] : (!llvm.ptr<f16>, i64) -> !llvm.ptr<f16> %15 = llvm.ptrtoint %14 : !llvm.ptr<f16> to i64 %16 = llvm.alloca %15 x f16 {alignment = 32 : i64} : (i64) -> !llvm.ptr<f16> %17 = llvm.alloca %15 x f16 {alignment = 32 : i64} : (i64) -> !llvm.ptr<f16> %18 = llvm.mlir.null : !llvm.ptr<i64> %19 = llvm.getelementptr %18[%arg1] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64> %20 = llvm.ptrtoint %19 : !llvm.ptr<i64> to i64 %21 = llvm.alloca %20 x i64 : (i64) -> !llvm.ptr<i64> llvm.br ^bb1(%10 : i64) ^bb1(%22: i64): // 2 preds: ^bb0, ^bb2 %23 = llvm.icmp "slt" %22, %arg1 : i64 llvm.cond_br %23, ^bb2, ^bb3 ^bb2: // pred: ^bb1 %24 = llvm.bitcast %arg2 : !llvm.ptr<i8> to !llvm.ptr<struct<(ptr<f16>, ptr<f16>, i64)>> %25 = llvm.getelementptr %24[%10, 2] : (!llvm.ptr<struct<(ptr<f16>, ptr<f16>, i64)>>, i64) -> !llvm.ptr<i64> %26 = llvm.add %22, %11 : i64 %27 = llvm.getelementptr %25[%26] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64> %28 = llvm.load %27 : !llvm.ptr<i64> %29 = llvm.getelementptr %21[%22] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64> llvm.store %28, %29 : !llvm.ptr<i64> llvm.br ^bb1(%26 : i64) ^bb3: // pred: ^bb1 llvm.br ^bb4(%10, %11 : i64, i64) ^bb4(%30: i64, %31: i64): // 2 preds: ^bb3, ^bb5 %32 = llvm.icmp "slt" %30, %arg1 : i64 llvm.cond_br %32, ^bb5, ^bb6 ^bb5: // pred: ^bb4 %33 = llvm.bitcast %arg2 : !llvm.ptr<i8> to !llvm.ptr<struct<(ptr<f16>, ptr<f16>, i64)>> %34 = llvm.getelementptr %33[%10, 2] : (!llvm.ptr<struct<(ptr<f16>, ptr<f16>, i64)>>, i64) -> !llvm.ptr<i64> %35 = llvm.add %30, %11 : i64 %36 = llvm.getelementptr %34[%35] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64> %37 = llvm.load %36 : !llvm.ptr<i64> %38 = llvm.mul %37, %31 : i64 llvm.br ^bb4(%35, %38 : i64, i64) ^bb6: // pred: ^bb4 %39 = llvm.bitcast %arg2 : !llvm.ptr<i8> to !llvm.ptr<ptr<f16>> %40 = llvm.getelementptr %39[%11] : (!llvm.ptr<ptr<f16>>, i64) -> !llvm.ptr<ptr<f16>> %41 = llvm.load %40 : !llvm.ptr<ptr<f16>> %42 = llvm.getelementptr %13[%11] : (!llvm.ptr<f16>, i64) -> !llvm.ptr<f16> %43 = llvm.ptrtoint %42 : !llvm.ptr<f16> to i64 %44 = llvm.alloca %7 x i32 : (i32) -> !llvm.ptr<i32> llvm.store %8, %44 : !llvm.ptr<i32> %45 = llvm.call @_mlir_ciface_tf_alloc(%arg0, %31, %43, %8, %7, %44) : (!llvm.ptr<i8>, i64, i64, i32, i32, !llvm.ptr<i32>) -> !llvm.ptr<i8> %46 = llvm.bitcast %45 : !llvm.ptr<i8> to !llvm.ptr<f16> %47 = llvm.icmp "eq" %31, %10 : i64 %48 = llvm.or %6, %47 : i1 %49 = llvm.mlir.null : !llvm.ptr<i8> %50 = llvm.icmp "ne" %45, %49 : !llvm.ptr<i8> %51 = llvm.or %50, %48 : i1 llvm.cond_br %51, ^bb7, ^bb13 ^bb7: // pred: ^bb6 %52 = llvm.urem %31, %9 : i64 %53 = llvm.sub %31, %52 : i64 llvm.br ^bb8(%10 : i64) ^bb8(%54: i64): // 2 preds: ^bb7, ^bb9 %55 = llvm.icmp "slt" %54, %53 : i64 llvm.cond_br %55, ^bb9, ^bb10 ^bb9: // pred: ^bb8 %56 = llvm.mul %54, %11 : i64 %57 = llvm.add %56, %10 : i64 %58 = llvm.add %57, %10 : i64 %59 = llvm.getelementptr %41[%58] : (!llvm.ptr<f16>, i64) -> !llvm.ptr<f16> %60 = llvm.bitcast %59 : !llvm.ptr<f16> to !llvm.ptr<vector<4xf16>> %61 = llvm.load %60 {alignment = 2 : i64} : !llvm.ptr<vector<4xf16>> %62 = "llvm.intr.sqrt"(%61) : (vector<4xf16>) -> vector<4xf16> %63 = llvm.fdiv %5, %62 : vector<4xf16> %64 = llvm.getelementptr %46[%58] : (!llvm.ptr<f16>, i64) -> !llvm.ptr<f16> %65 = llvm.bitcast %64 : !llvm.ptr<f16> to !llvm.ptr<vector<4xf16>> llvm.store %63, %65 {alignment = 2 : i64} : !llvm.ptr<vector<4xf16>> %66 = llvm.add %54, %9 : i64 llvm.br ^bb8(%66 : i64) ^bb10: // pred: ^bb8 %67 = llvm.icmp "ult" %53, %31 : i64 llvm.cond_br %67, ^bb11, ^bb12 ^bb11: // pred: ^bb10 %68 = llvm.mul %53, %12 : i64 %69 = llvm.add %31, %68 : i64 %70 = llvm.mul %53, %11 : i64 %71 = llvm.add %70, %10 : i64 %72 = llvm.trunc %69 : i64 to i32 %73 = llvm.mlir.undef : vector<4xi32> %74 = llvm.insertelement %72, %73[%8 : i32] : vector<4xi32> %75 = llvm.shufflevector %74, %73 [0 : i32, 0 : i32, 0 : i32, 0 : i32] : vector<4xi32>, vector<4xi32> %76 = llvm.icmp "slt" %4, %75 : vector<4xi32> %77 = llvm.add %71, %10 : i64 %78 = llvm.getelementptr %41[%77] : (!llvm.ptr<f16>, i64) -> !llvm.ptr<f16> %79 = llvm.bitcast %78 : !llvm.ptr<f16> to !llvm.ptr<vector<4xf16>> %80 = llvm.intr.masked.load %79, %76, %3 {alignment = 2 : i32} : (!llvm.ptr<vector<4xf16>>, vector<4xi1>, vector<4xf16>) -> vector<4xf16> %81 = llvm.bitcast %16 : !llvm.ptr<f16> to !llvm.ptr<vector<4xf16>> llvm.store %80, %81 : !llvm.ptr<vector<4xf16>> %82 = llvm.load %81 {alignment = 2 : i64} : !llvm.ptr<vector<4xf16>> %83 = "llvm.intr.sqrt"(%82) : (vector<4xf16>) -> vector<4xf16> %84 = llvm.fdiv %5, %83 : vector<4xf16> %85 = llvm.bitcast %17 : !llvm.ptr<f16> to !llvm.ptr<vector<4xf16>> llvm.store %84, %85 {alignment = 2 : i64} : !llvm.ptr<vector<4xf16>> %86 = llvm.load %85 : !llvm.ptr<vector<4xf16>> %87 = llvm.getelementptr %46[%77] : (!llvm.ptr<f16>, i64) -> !llvm.ptr<f16> %88 = llvm.bitcast %87 : !llvm.ptr<f16> to !llvm.ptr<vector<4xf16>> llvm.intr.masked.store %86, %88, %76 {alignment = 2 : i32} : vector<4xf16>, vector<4xi1> into !llvm.ptr<vector<4xf16>> llvm.br ^bb12 ^bb12: // 2 preds: ^bb10, ^bb11 %89 = llvm.mul %2, %1 : i64 %90 = llvm.mul %arg1, %2 : i64 %91 = llvm.add %90, %11 : i64 %92 = llvm.mul %91, %1 : i64 %93 = llvm.add %89, %92 : i64 %94 = llvm.alloca %93 x i8 : (i64) -> !llvm.ptr<i8> %95 = llvm.bitcast %94 : !llvm.ptr<i8> to !llvm.ptr<ptr<f16>> llvm.store %46, %95 : !llvm.ptr<ptr<f16>> %96 = llvm.getelementptr %95[%11] : (!llvm.ptr<ptr<f16>>, i64) -> !llvm.ptr<ptr<f16>> llvm.store %46, %96 : !llvm.ptr<ptr<f16>> %97 = llvm.getelementptr %95[%2] : (!llvm.ptr<ptr<f16>>, i64) -> !llvm.ptr<ptr<f16>> %98 = llvm.bitcast %97 : !llvm.ptr<ptr<f16>> to !llvm.ptr<i64> llvm.store %10, %98 : !llvm.ptr<i64> %99 = llvm.bitcast %94 : !llvm.ptr<i8> to !llvm.ptr<struct<(ptr<f16>, ptr<f16>, i64, i64)>> %100 = llvm.getelementptr %99[%10, 3] : (!llvm.ptr<struct<(ptr<f16>, ptr<f16>, i64, i64)>>, i64) -> !llvm.ptr<i64> %101 = llvm.getelementptr %100[%arg1] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64> %102 = llvm.sub %arg1, %11 : i64 llvm.br ^bb14(%102, %11 : i64, i64) ^bb13: // pred: ^bb6 %103 = llvm.mlir.addressof @error_message_2208944672953921889 : !llvm.ptr<array<42 x i8>> %104 = llvm.getelementptr %103[%10, %10] : (!llvm.ptr<array<42 x i8>>, i64, i64) -> !llvm.ptr<i8> llvm.call @_mlir_ciface_tf_report_error(%arg0, %0, %104) : (!llvm.ptr<i8>, i32, !llvm.ptr<i8>) -> () %105 = llvm.mul %2, %1 : i64 %106 = llvm.mul %2, %10 : i64 %107 = llvm.add %106, %11 : i64 %108 = llvm.mul %107, %1 : i64 %109 = llvm.add %105, %108 : i64 %110 = llvm.alloca %109 x i8 : (i64) -> !llvm.ptr<i8> %111 = llvm.bitcast %110 : !llvm.ptr<i8> to !llvm.ptr<ptr<f16>> llvm.store %13, %111 : !llvm.ptr<ptr<f16>> %112 = llvm.getelementptr %111[%11] : (!llvm.ptr<ptr<f16>>, i64) -> !llvm.ptr<ptr<f16>> llvm.store %13, %112 : !llvm.ptr<ptr<f16>> %113 = llvm.getelementptr %111[%2] : (!llvm.ptr<ptr<f16>>, i64) -> !llvm.ptr<ptr<f16>> %114 = llvm.bitcast %113 : !llvm.ptr<ptr<f16>> to !llvm.ptr<i64> llvm.store %10, %114 : !llvm.ptr<i64> %115 = llvm.call @malloc(%109) : (i64) -> !llvm.ptr<i8> "llvm.intr.memcpy"(%115, %110, %109, %6) : (!llvm.ptr<i8>, !llvm.ptr<i8>, i64, i1) -> () %116 = llvm.mlir.undef : !llvm.struct<(i64, ptr<i8>)> %117 = llvm.insertvalue %10, %116[0] : !llvm.struct<(i64, ptr<i8>)> %118 = llvm.insertvalue %115, %117[1] : !llvm.struct<(i64, ptr<i8>)> llvm.return %118 : !llvm.struct<(i64, ptr<i8>)> ^bb14(%119: i64, %120: i64): // 2 preds: ^bb12, ^bb15 %121 = llvm.icmp "sge" %119, %10 : i64 llvm.cond_br %121, ^bb15, ^bb16 ^bb15: // pred: ^bb14 %122 = llvm.getelementptr %21[%119] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64> %123 = llvm.load %122 : !llvm.ptr<i64> %124 = llvm.getelementptr %100[%119] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64> llvm.store %123, %124 : !llvm.ptr<i64> %125 = llvm.getelementptr %101[%119] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64> llvm.store %120, %125 : !llvm.ptr<i64> %126 = llvm.mul %120, %123 : i64 %127 = llvm.sub %119, %11 : i64 llvm.br ^bb14(%127, %126 : i64, i64) ^bb16: // pred: ^bb14 %128 = llvm.call @malloc(%93) : (i64) -> !llvm.ptr<i8> "llvm.intr.memcpy"(%128, %94, %93, %6) : (!llvm.ptr<i8>, !llvm.ptr<i8>, i64, i1) -> () %129 = llvm.mlir.undef : !llvm.struct<(i64, ptr<i8>)> %130 = llvm.insertvalue %arg1, %129[0] : !llvm.struct<(i64, ptr<i8>)> %131 = llvm.insertvalue %128, %130[1] : !llvm.struct<(i64, ptr<i8>)> llvm.return %131 : !llvm.struct<(i64, ptr<i8>)> } llvm.func @_mlir_ciface_Rsqrt_CPU_DT_HALF_DT_HALF(%arg0: !llvm.ptr<struct<(i64, ptr<i8>)>>, %arg1: !llvm.ptr<i8>, %arg2: !llvm.ptr<struct<(i64, ptr<i8>)>>) attributes {llvm.emit_c_interface, tf_entry} { %0 = llvm.load %arg2 : !llvm.ptr<struct<(i64, ptr<i8>)>> %1 = llvm.extractvalue %0[0] : !llvm.struct<(i64, ptr<i8>)> %2 = llvm.extractvalue %0[1] : !llvm.struct<(i64, ptr<i8>)> %3 = llvm.call @Rsqrt_CPU_DT_HALF_DT_HALF(%arg1, %1, %2) : (!llvm.ptr<i8>, i64, !llvm.ptr<i8>) -> !llvm.struct<(i64, ptr<i8>)> llvm.store %3, %arg0 : !llvm.ptr<struct<(i64, ptr<i8>)>> llvm.return } }	2022-06-15 13:24:24 +02:00
Simon Pilgrim	cf2072bcad	[X86] X86TargetTransformInfo.cpp - use InstructionCost type to accumulate instructions costs	2022-06-15 12:21:01 +01:00
Benjamin Kramer	fb34d531af	Promote bf16 to f32 when the target doesn't support it This is modeled after the half-precision fp support. Two new nodes are introduced for casting from and to bf16. Since casting from bf16 is a simple operation I opted to always directly lower it to integer arithmetic. The other way round is more complicated if you want to preserve IEEE semantics, so it's handled by a new __truncsfbf2 compiler-rt builtin. This is of course very bare bones, but sufficient to get a semi-softened fadd on x86. Possible future improvements: - Targets with bf16 conversion instructions can now make fp_to_bf16 legal - The software conversion to bf16 can be replaced by a trivial implementation under fast math. Differential Revision: https://reviews.llvm.org/D126953	2022-06-15 12:56:31 +02:00
Simon Pilgrim	4fd561415e	[X86] needCarryOrOverflowFlag/onlyZeroFlagUsed - merge identical switch cases. NFCI. Makes it easier to grok and fixes various bugprone-branch-clone warnings.	2022-06-15 10:40:22 +01:00
Amir Ayupov	5965878d4d	[X86][NFC] Use mnemonic tables in validateInstruction 4/4 Group switch cases by opcode: - VGATHERDPD - VGATHERDPS - VGATHERQPD - VGATHERQPS - VPGATHERDD - VPGATHERDQ - VPGATHERQD - VPGATHERQQ Distinguish masked vs non-masked forms by EVEX encoding. Reviewed By: skan, craig.topper Differential Revision: https://reviews.llvm.org/D127719	2022-06-14 19:53:44 -07:00
Luo, Yuanke	54ec8e25fc	[X86][AMX] Fix klockwork issue.	2022-06-15 09:26:59 +08:00
Phoebe Wang	6e02e27536	Reland "[X86][RFC] Enable `_Float16` type support on X86 following the psABI" Disabled 2 mlir tests due to the runtime doesn't support `_Float16`, see the issue here https://github.com/llvm/llvm-project/issues/55992	2022-06-15 09:15:31 +08:00
Amir Ayupov	6226e46c5f	[X86][NFC] Use mnemonic tables in validateInstruction 3/4 Group switch cases by opcode: - V4FMADDPS - V4FMADDSS - V4FNMADDPS - V4FNMADDSS - VP4DPWSSDS - VP4DPWSSD Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D127718	2022-06-14 12:11:47 -07:00
Amir Ayupov	df16c077dc	[X86][NFC] Use mnemonic tables in validateInstruction 2/4 Group switch cases by opcode: - VFCMULCPH - VFCMULCSH - VFMULCPH - VFMULCSH Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D127717	2022-06-14 12:09:37 -07:00
Amir Ayupov	4bf928bce4	[X86][NFC] Use mnemonic tables in validateInstruction 1/4 Group switch cases by opcode: - VFCMADDCPH - VFCMADDCSH - VFMADDCPH - VFMADDCSH Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D127716	2022-06-14 12:06:23 -07:00
Simon Pilgrim	64eea34420	[X86] combineEXTEND_VECTOR_INREG - don't attempt to shuffle combine ANY_EXTEND_VECTOR_INREG without SSE41 Without SSE41, ANY_EXTEND_VECTOR_INREG nodes are likely to be prematurely combined to a target shuffle preventing generic sign extension folds. Fixes a number of sign-extend regressions in D127115.	2022-06-13 17:42:04 +01:00
Maksim Panchenko	8f6512fea0	[X86][Disassembler] Fix displacement operand size for symbolizer On 64-bit X86, 0x66 operand-size override prefix will change the size of the instruction operand, e.g. from 32 bits to 16 bits, but it will not modify the size of the displacement operand used for memory addressing, which will always be 32 bits. Reviewed By: skan, rafauler Differential Revision: https://reviews.llvm.org/D126726	2022-06-13 00:14:43 -07:00
Kazu Hirata	92ab024f81	[X86] Use default member initialization (NFC) Identified with modernize-use-default-member-init.	2022-06-12 18:30:46 -07:00
Jez Ng	d4bcb45db7	[MC][re-land] Omit DWARF unwind info if compact unwind is present where eligible This reverts commit `d941d59783`. Differential Revision: https://reviews.llvm.org/D122258	2022-06-12 17:24:19 -04:00
Mehdi Amini	5d8298a768	Revert "[X86][RFC] Enable `_Float16` type support on X86 following the psABI" This reverts commit `2d2da259c8`. This breaks MLIR integration test (JIT crashing), reverting in the meantime.	2022-06-12 15:14:37 +00:00
Jez Ng	d941d59783	Revert "[MC] Omit DWARF unwind info if compact unwind is present where eligible" This reverts commit `ef501bf85d`.	2022-06-12 10:47:08 -04:00
Simon Pilgrim	b5d7beeb97	[X86] combineConcatVectorOps - add support for concatenation of VSELECT/BLENDV nodes (REAPPLIED) If the LHS/RHS selection operands can be cheaply concatenated back together then replace 2 x 128-bit selection nodes with 1 x 256-bit node Addresses the regression introduced in the bug fix from rGd5af6a38082b39ae520a328e44dc29ebcb036bb2 REAPPLIED with for bug identified in rGea8fb3b60196	2022-06-12 15:40:36 +01:00
Jez Ng	ef501bf85d	[MC] Omit DWARF unwind info if compact unwind is present where eligible Previously, omitting unnecessary DWARF unwinds was only done in two cases: * For Darwin + aarch64, if no DWARF unwind info is needed for all the functions in a TU, then the `__eh_frame` section would be omitted entirely. If any one function needed DWARF unwind, then MC would emit DWARF unwind entries for all the functions in the TU. * For watchOS, MC would omit DWARF unwind on a per-function basis, as long as compact unwind was available for that function. This diff makes it so that we omit DWARF unwind on a per-function basis for Darwin + aarch64 as well. In addition, we introduce the flag `--emit-dwarf-unwind=` which can toggle between `always`, `no-compact-unwind` (only emit DWARF when CU cannot be emitted for a given function), and the target platform `default`. `no-compact-unwind` is particularly useful for newer x86_64 platforms: we don't want to omit DWARF unwind for x86_64 in general due to possible backwards compat issues, but we should make it possible for people to opt into this behavior if they are only targeting newer platforms. Motivation: I'm working on adding support for `__eh_frame` to LLD, but I'm concerned that we would suffer a perf hit. Processing compact unwind is already expensive, and that's a simpler format than EH frames. Given that MC currently produces one EH frame entry for every compact unwind entry, I don't think processing them will be cheap. I tried to do something clever on LLD's end to drop the unnecessary EH frames at parse time, but this made the code significantly more complex. So I'm looking at fixing this at the MC level instead. Addendum: It turns out that there was a latent bug in the X86 backend when `OmitDwarfIfHaveCompactUnwind` is naively enabled, which is not too surprising given that this combination has not been heretofore used. For functions that have unwind info that cannot be encoded with CU, MC would end up dropping both the compact unwind entry (OK; existing behavior) as well as the DWARF entries (not OK). This diff fixes things so that we emit the DWARF entry, as well as a CU entry with encoding `UNWIND_X86_MODE_DWARF` -- this basically tells the unwinder to look for the DWARF entry. I'm not 100% sure the `UNWIND_X86_MODE_DWARF` CU entry is necessary, this was the simplest fix. ld64 seems to be able to handle both the absence and presence of this CU entry. Ultimately ld64 (and LLD) will synthesize `UNWIND_X86_MODE_DWARF` if it is absent, so there is no impact to the final binary size. Reviewed By: davide, lhames Differential Revision: https://reviews.llvm.org/D122258	2022-06-12 10:03:56 -04:00
Phoebe Wang	2d2da259c8	[X86][RFC] Enable `_Float16` type support on X86 following the psABI GCC and Clang/LLVM will support `_Float16` on X86 in C/C++, following the latest X86 psABI. (https://gitlab.com/x86-psABIs) _Float16 arithmetic will be performed using native half-precision. If native arithmetic instructions are not available, it will be performed at a higher precision (currently always float) and then truncated down to _Float16 immediately after each single arithmetic operation. Reviewed By: LuoYuanke Differential Revision: https://reviews.llvm.org/D107082	2022-06-12 11:40:00 +08:00
Simon Pilgrim	7841d09449	[X86][AVX512] Retain pmuldq broadcast loads on 32-bit targets Don't demand just the lower 32-bits on 32-bit AVX512 targets to preserve 64-bit broadcast loads patterns	2022-06-11 19:30:00 +01:00
Simon Pilgrim	6eaea225c7	[X86] combineTargetShuffle - break if-else chain. NFC. (style) Both cases always continue.	2022-06-11 09:16:39 +01:00
Simon Pilgrim	89d2b1e4f7	[X86] emitOrXorXorTree - break if-else chain. NFC. (style) Both cases always return.	2022-06-11 09:16:38 +01:00
Fangrui Song	adf4142f76	[MC] De-capitalize SwitchSection. NFC Add SwitchSection to return switchSection. The API will be removed soon.	2022-06-10 22:50:55 -07:00
Eli Friedman	0ff51d5dde	Fix interaction of CFI instructions with MachineOutliner. 1. When checking if a candidate contains a CFI instruction, actually iterate over all of the instructions, instead of stopping halfway through. 2. Make sure copied CFI directives refer to the correct instruction. Fixes https://github.com/llvm/llvm-project/issues/55842 Differential Revision: https://reviews.llvm.org/D126930	2022-06-10 13:37:49 -07:00
Guillaume Chatelet	38637ee477	[clang] Add support for __builtin_memset_inline In the same spirit as D73543 and in reply to https://reviews.llvm.org/D126768#3549920 this patch is adding support for `__builtin_memset_inline`. The idea is to get support from the compiler to easily write efficient memory function implementations. This patch could be split in two: - one for the LLVM part adding the `llvm.memset.inline.*` intrinsics. - and another one for the Clang part providing the instrinsic as a builtin. Differential Revision: https://reviews.llvm.org/D126903	2022-06-10 13:13:59 +00:00
Simon Pilgrim	5acbb2dda2	[X86] combineMulToPMADDWD - don't bitcast the source ops before splitting to ensure we split the build vectors early Fixes a regression on D127115 - splitting was creating extract_subvector(bitcast(build_vector())) patterns which prevented the build vectors being split before being bitcast to vXi16 types, resulting in various issues with further folding of the (now legal) build vectors	2022-06-10 13:44:49 +01:00
Simon Pilgrim	7ac33b8aac	[X86] Remove !VT.is128BitVector() check. NFCI. The code is inside a if(VT.is256BitVector() \|\| VT.is512BitVector()) condition	2022-06-09 21:39:45 +01:00
Simon Pilgrim	72a049d778	[X86][AVX2] LowerINSERT_VECTOR_ELT - support v4i64 insertion as BLENDI(X, SCALAR_TO_VECTOR(Y))	2022-06-09 21:18:10 +01:00
Simon Pilgrim	1a02db9882	[X86] canonicalizeShuffleWithBinOps - add TODO for X86ISD::ANDNP bitwise handling Its just as safe to move shuffles across X86ISD::ANDNP as any other logical bitop, they just tend to appear too late to matter. Noticed while triaging D127115 regressions.	2022-06-09 12:18:26 +01:00
Guillaume Chatelet	dc3367970e	[SelectionDAG] Handle bzero/memset libcalls globally instead of per target Differential Revision: https://reviews.llvm.org/D127279	2022-06-09 08:34:55 +00:00
Simon Pilgrim	9a76337fee	[X86] combineMOVMSK - constant fold with getTargetConstantBitsFromNode not just BUILD_VECTOR Help avoid a regression in D127115	2022-06-08 17:48:55 +01:00
Matt Arsenault	cc5a1b3dd9	llvm-reduce: Add cloning of target MachineFunctionInfo MIR support is totally unusable for AMDGPU without this, since the set of reserved registers is set from fields here. Add a clone method to MachineFunctionInfo. This is a subtle variant of the copy constructor that is required if there are any MIR constructs that use pointers. Specifically, at minimum fields that reference MachineBasicBlocks or the MachineFunction need to be adjusted to the values in the new function.	2022-06-07 10:14:48 -04:00
Guillaume Chatelet	0788186182	[Alignment][NFC] Remove usage of MemSDNode::getAlignment I can't remove the function just yet as it is used in the generated .inc files. I would also like to provide a way to compare alignment with TypeSize since it came up a few times. Differential Revision: https://reviews.llvm.org/D126910	2022-06-07 13:52:20 +00:00
Simon Pilgrim	f5507978a3	[X86] getFauxShuffleMask - add VSELECT/BLENDV handling First step towards enabling shuffle combining starting from VSELECT/BLENDV nodes - this should eventually help improve the codegen reported at Issue #54819	2022-06-07 14:46:25 +01:00
Simon Pilgrim	5cea1553b8	[X86] X86SpeculativeLoadHardening.cpp - pass DebugLoc by const reference not value.	2022-06-07 12:38:05 +01:00
Simon Pilgrim	1b6d3bdc82	[X86] foldMaskedMergeImpl - pass SDLoc by const reference not value.	2022-06-07 12:36:30 +01:00
Simon Pilgrim	63e3035dbe	[X86] LowerGC_TRANSITION - remove redundant SDLoc().	2022-06-07 10:57:58 +01:00
Fangrui Song	15d82c62dc	[MC] De-capitalize MCStreamer functions Follow-up to `c031378ce0` . The class is mostly consistent now.	2022-06-07 00:31:02 -07:00
Shilei Tian	0c3e6e5717	[NFC] Remove trailing whitespace	2022-06-06 18:59:13 -04:00
Fangrui Song	77e300ffdf	[MC] Change EndOfStatement "unexpected tokens in .xxx directive " to "expected newline"	2022-06-05 15:11:01 -07:00
Kazu Hirata	9a8e65de8c	[Target] Use MachineBasicBlock::erase (NFC)	2022-06-04 22:41:24 -07:00
Eric Christopher	93cb6b9c83	Revert "[X86] combineConcatVectorOps - add support for concatenation VSELECT/BLENDV nodes" See the original commit for a testcase. This reverts commit `ea8fb3b601`.	2022-06-03 12:31:11 -07:00
Simon Pilgrim	de2b543505	[X86] LowerVSETCC - merge getConstant() calls with flipped/unflipped sign masks. NFCI.	2022-06-01 15:09:48 +01:00
Sanjay Patel	3a503a4a9c	[x86] fix miscompile from wrongly identified fneg We may need to peek through a bitcast when identifying an fneg idiom via its pool constant, but we can't allow a different-sized constant in that match. This is noted in issue #55758 with an example that needs fast-math, but as the test here shows, this has potential to miscompile more generally (no fast-math required). Differential Revision: https://reviews.llvm.org/D126775	2022-06-01 09:56:33 -04:00
Simon Pilgrim	f6dbb0b6fb	[X86] Fix typo in extraction type introduced in rGed0303aa2251e4484a2b4ff7f236c9f7cdfb2092 It doesn't look like we have test coverage for this at the moment :(	2022-06-01 12:31:27 +01:00
Simon Pilgrim	ea8fb3b601	[X86] combineConcatVectorOps - add support for concatenation VSELECT/BLENDV nodes If the LHS/RHS selection operands can be cheaply concatenated back together then replace 2 x 128-bit selection nodes with 1 x 256-bit node Addresses the regression introduced in the bug fix from rGd5af6a38082b39ae520a328e44dc29ebcb036bb2	2022-06-01 10:46:06 +01:00
Phoebe Wang	a2ea5b496b	[X86] Add support for `-mharden-sls=[none\|all\|return\|indirect-jmp]` The patch addresses the feature request from https://github.com/ClangBuiltLinux/linux/issues/1633. The implementation borrows a lot from aarch64. Reviewed By: nickdesaulniers, MaskRay Differential Revision: https://reviews.llvm.org/D126137	2022-06-01 09:45:04 +08:00
Simon Pilgrim	d5af6a3808	[X86] LowerMINMAX - split v4i64 types on AVX1 targets (Issue #55648 ) Originally we tried to use default expansion for v4i64 types to make it easier to concatenate the results back together, but this can cause infinite loop issues with existing VSELECT splitting code in narrowExtractedVectorSelect if we have other uses of the VSELECT results (e.g. reduction patterns). To fix the infinite loop, this patch always splits MIN/MAX v4i64 nodes during lowering and I've added a TODO for combineConcatVectorOps to investigate when we can cheaply concatenate VSELECT/BLENDV nodes together. Fixes #55648 - regression test case will be added in a follow up.	2022-05-31 17:28:56 +01:00
Simon Pilgrim	af0113cf77	[X86] combineEXTRACT_SUBVECTOR - pull out repeated getVectorNumElements() calls. NFC.	2022-05-31 16:13:54 +01:00
Simon Pilgrim	b9443cb6fa	[X86] narrowExtractedVectorSelect - don't peek through bitcasts to find source vector We don't seem to need this for any test coverage and it was making tracking of the uses() of the source vector more difficult Noticed while investigating Issue #55648	2022-05-31 14:57:18 +01:00
Simon Pilgrim	ed0303aa22	[X86] LowerTRUNCATE - avoid creating extract_subvector(bitcast(vec)) patterns We have a generic DAG combine to attempt to fold extract_subvector(bitcast(vec)) -> bitcast(extract_subvector(vec)) but if we create these patterns late in lowering then we often miss them. Noticed while investigating Issue #55648 which gets caught in an infinite loop trying to split extract_subvector(bitcast(vselect()) patterns - this doesn't fix the issue yet but reduces the regressions from the WIP fix.	2022-05-31 14:30:56 +01:00
Simon Pilgrim	d384a4c530	[X86] Adjust vector test costs to match SoG (Issue #54889 ) znver1/2 models were incorrectly modelling the latency/throughput/uops and znver1 ymm variants also require double pumping. Now matches what I can decipher from the AMD SoG, Agner and instlatx64 numbers vs the llvm-exegesis report provided by @fabian-r	2022-05-31 09:14:06 +01:00
Xiang1 Zhang	5d5aba78db	[X86][NFC] Refine X86 Domain Reassignment for compiling time Differential Revision: https://reviews.llvm.org/D126622	2022-05-31 10:10:40 +08:00
Simon Pilgrim	14cc4674bf	[X86] Adjust vector fp test costs to match int test costs znver1/2 models were missing the vtestps/pd overrides to match the vptest integer equivalents. Noticed while investigating Issue #54889	2022-05-30 09:50:15 +01:00
Simon Pilgrim	1956f28037	[X86] Adjust vector extend to ymm to match SoG (Issue #54889 ) znver1 ymm variants of VPMOVSX/VPMOVZX instructions require double pumping. Now matches AMD SoG, Agner and instlatx64 numbers. Thanks to @fabian-r for the report	2022-05-30 08:58:56 +01:00
Simon Pilgrim	c99690462e	[X86] Adjust vector shift costs to match SoG (Issue #54889 ) znver1/2 models were incorrectly modelling the fpupipe (should be pipe2 for shift-by-scalar-amount and pipe1 for shift-by-element-amount) and znver1 ymm variants also require double pumping. Now matches AMD SoG, Agner and instlatx64 numbers. Thanks to @fabian-r for the report	2022-05-29 17:55:39 +01:00
eopXD	6a84579243	[LSR][TTI][PowerPC][SystemZ][X86] Add const-ness to TTI::isLSRCostLess. NFC Reviewed By: Meinersbur Differential Revision: https://reviews.llvm.org/D126350	2022-05-27 15:22:23 -07:00
Luo, Yuanke	aaaf9cede7	[X86][AMX] Replace LDTILECFG with PLDTILECFGV on auto-config. There is intrinsic `@llvm.x86.ldtilecfg` which is lowered to LDTILECFG. This intrinsic is open for user to configure tile registers by themselves. There is a chance that `@llvm.x86.ldtilecfg` would be mixed with the new AMX intrinsics which depend on compiler to configure tile registers. Separate pusedo instruction PLDTILECFGV would avoid unexpected behavious when `@llvm.x86.ldtilecfg` is mixed with new AMX intrinsics. Though user should not mix the two programming model, compiler should avoid crash or UB when they are mixed. Differential Revision: https://reviews.llvm.org/D126519	2022-05-27 16:38:35 +08:00
Zongwei Lan	ad73ce318e	[Target] use getSubtarget<> instead of static_cast<>(getSubtarget()) Differential Revision: https://reviews.llvm.org/D125391	2022-05-26 11:22:41 -07:00
Fangrui Song	9ee15bba47	[MC] Lower case the first letter of EmitCOFF* EmitWin* EmitCV*. NFC	2022-05-26 00:14:08 -07:00
Maksim Panchenko	bed9efed71	[MCDisassembler] Disambiguate Size parameter in tryAddingSymbolicOperand() MCSymbolizer::tryAddingSymbolicOperand() overloaded the Size parameter to specify either the instruction size or the operand size depending on the architecture. However, for proper symbolic disassembly on X86, we need to know both sizes, as an instruction can have two operands, and the instruction size cannot be reliably calculated based on the operand offset and its size. Hence, split Size into OpSize and InstSize. For X86, the new interface allows to fix a couple of issues: * Correctly adjust the value of PC-relative operands. * Set operand size to zero when the operand is specified implicitly. Differential Revision: https://reviews.llvm.org/D126101	2022-05-25 13:44:32 -07:00
Craig Topper	06fee478d2	[X86] Add isSimple check to the load combine in combineExtractVectorElt. I think we need to be sure the load isn't volatile before we duplicate and shrink it. Reviewed By: spatel Differential Revision: https://reviews.llvm.org/D126353	2022-05-25 09:11:11 -07:00
Simon Pilgrim	6c80267d0f	[CostModel][X86] getScalarizationOverhead - improve extraction costs for > 128-bit vectors We were using the default getScalarizationOverhead expansion for extraction costs, which adds up all the individual element extraction costs. This is fine for 128-bit vectors, but for 256/512-bit vectors each element extraction also has to account for extracting the upper 128-bit subvector extraction before it can handle the element. For scalarization costs we only need to extract each demanded subvector once. Differential Revision: https://reviews.llvm.org/D125527	2022-05-24 15:18:08 +01:00
Luo, Yuanke	3b1de7ab60	[X86][AMX] Reduce the compiling time for non-amx code. Differential Revision: https://reviews.llvm.org/D126280	2022-05-24 18:02:51 +08:00
Luo, Yuanke	496156ac57	[X86][AMX] Multiple configure for AMX register. The previous solution depends on variable name to record the shape information. However it is not reliable, because in release build compiler would not set the variable name. It can be accomplished with an additional option `fno-discard-value-names`, but it is not acceptable for users. This patch is to preconfigure the tile register with machine instruction. It follow the same way what sigle configure does. In the future we can fall back to multiple configure when single configure fails due to the shape dependency issue. The algorithm to configure the tile register is simple in the patch. We may improve it in the future. It configure tile register based on basic block. Compiler would spill the tile register if it live out the basic block. After the configure there should be no spill across tile confgiure in the register alloction. Just like fast register allocation the algorithm walk the instruction in reverse order. When the shape dependency doesn't meet, it insert ldtilecfg after the last instruction that define the shape. In post configuration compiler also walk the basic block to collect the physical tile register number and generate instruction to fill the stack slot for the correponding shape information. TODO: There is some following work in D125602. The risk is modifying the fast RA may cause regression as fast RA is usded for different targets. We may create an independent RA for tile register. Differential Revision: https://reviews.llvm.org/D125075	2022-05-24 13:18:42 +08:00
Luo, Yuanke	d5999bd3f7	[X86][AMX][NFC] Refactor X86LowerAMXCast.cpp Change static function to X86LowerAMXCast member function. Differential Revision: https://reviews.llvm.org/D126058	2022-05-20 19:32:09 +08:00
Bill Wendling	6e00a34cdb	[AArch64] Add support for -fzero-call-used-regs Support the "-fzero-call-used-regs" option on AArch64. This involves much less specialized code than the X86 version. Most of the checks can be done with TableGen. Reviewed By: nickdesaulniers, MaskRay Differential Revision: https://reviews.llvm.org/D124836	2022-05-19 16:58:28 -07:00
Sotiris Apostolakis	a094ad03f3	[NFC] Fix typos in X86CmovConversion	2022-05-19 15:13:11 +00:00
Jay Foad	6bec3e9303	[APInt] Remove all uses of zextOrSelf, sextOrSelf and truncOrSelf Most clients only used these methods because they wanted to be able to extend or truncate to the same bit width (which is a no-op). Now that the standard zext, sext and trunc allow this, there is no reason to use the OrSelf versions. The OrSelf versions additionally have the strange behaviour of allowing extending to a smaller width, or truncating to a larger width, which are also treated as no-ops. A small amount of client code relied on this (ConstantRange::castOp and MicrosoftCXXNameMangler::mangleNumber) and needed rewriting. Differential Revision: https://reviews.llvm.org/D125557	2022-05-19 11:23:13 +01:00
Simon Pilgrim	320545b577	[X86] Rename combineCONCAT_VECTORS\INSERT_SUBVECTOR\EXTRACT_SUBVECTOR to match Opcode name. NFCI. Its a lot easier to quickly search for the combine when it actually contains the name of the opcode it combines.	2022-05-17 18:37:53 +01:00
Simon Pilgrim	c64f5d44ad	[X86] Attempt to fold EFLAGS into X86ISD::ADD/SUB ops We already use combineAddOrSubToADCOrSBB to fold extended EFLAGS results into ISD::ADD/SUB ops as X86ISD::ADC/SBB carry ops. This patch extends this to also try to fold EFLAGS results with X86ISD::ADD/SUB ops Differential Revision: https://reviews.llvm.org/D125642	2022-05-17 10:59:24 +01:00
Sanjay Patel	be7f09f7b2	[IR] create and use helper functions that test the signbit; NFCI	2022-05-16 11:26:23 -04:00
Simon Pilgrim	b3077f563d	[X86] Move combineAddOrSubToADCOrSBB earlier. NFC. Make it easier to reuse in X86 ADD/SUB combines in an upcoming patch.	2022-05-15 22:06:33 +01:00
Simon Pilgrim	896557e129	[X86] Adjust fadd costs to match SoG znver1/2 models were incorrectly modelling these on fpupipe 0 instead of 2/3 and znver1 ymm variants also require double pumping. Now matches AMD SoG, Agner and instlatx64 numbers. Thanks to @fabian-r for the report	2022-05-15 21:28:29 +01:00
Simon Pilgrim	fd1f0c51ef	[X86] lowerShuffleAsLanePermuteAndSHUFP always succeeds, so just return the result. NFC.	2022-05-15 15:53:36 +01:00
Simon Pilgrim	c0f59be358	[X86] Pull out repeated isShuffleMaskInputInPlace calls. NFC.	2022-05-15 15:35:09 +01:00
Simon Pilgrim	32162cf291	[X86] lowerV4I64Shuffle - try harder to lower to PERMQ(BLENDD(V1,V2)) pattern	2022-05-15 14:57:58 +01:00
Simon Pilgrim	bc90bbb759	[X86] LowerAVG - fix cut+paste typo. NFC.	2022-05-14 17:42:09 +01:00
Simon Pilgrim	98f82d69bd	[X86] LowerStore - use is64BitVector() wrapper. NFCI.	2022-05-13 15:30:18 +01:00
Mingming Liu	cb22cb2691	[X86] Fix 80 column violation in X86InstrInfo.cpp. NFC Differential Revision: https://reviews.llvm.org/D125345	2022-05-10 19:56:14 -07:00
Mingming Liu	852f3d9987	Revert "[NFC] Run clang-format on llvm/lib/Target/X86/X86InstroInfo.cpp" This reverts commit `8bef5476de`. Need to revert, update commit message and reapply.	2022-05-10 19:53:31 -07:00
Mingming Liu	8bef5476de	[NFC] Run clang-format on llvm/lib/Target/X86/X86InstroInfo.cpp Differential Revision: https://reviews.llvm.org/D125345	2022-05-10 17:56:51 -07:00
Mingming Liu	fc58d7a326	[Peephole-opt][X86] Enhance peephole opt to see through SUBREG_TO_REG (following AND) and eliminates redundant TEST instruction. Differential Revision: https://reviews.llvm.org/D124118	2022-05-10 15:56:20 -07:00
Mingming Liu	1555c41abb	Revert "Enhance peephole optimization." This reverts commit `d84ca05ef7`. Will revert, update commit message and re-commit.	2022-05-10 13:59:05 -07:00
Mingming Liu	d84ca05ef7	Enhance peephole optimization. Differential Revision: https://reviews.llvm.org/D124118	2022-05-10 12:35:35 -07:00
Matthias Braun	cd19af74c0	Avoid 8 and 16bit switch conditions on x86 This adds a `TargetLoweringBase::getSwitchConditionType` callback to give targets a chance to control the type used in `CodeGenPrepare::optimizeSwitchInst`. Implement callback for X86 to avoid i8 and i16 types where possible as they often incur extra zero-extensions. This is NFC for non-X86 targets. Differential Revision: https://reviews.llvm.org/D124894	2022-05-10 10:00:10 -07:00
Simon Pilgrim	6824cf1ab7	[X86] Set some more plausible latencies for horizontal add/subs on znver1 These are all microcoded/multi-pipe nightmares on Ryzen, but we shouldn't just be using the WriteMicrocoded class which is for REALLY bad microcoded nightmares - instead use the same approximate latencies as znver2 (Agner and uops.info both suggest similar values) - and make sure we use the FPU defs for both Fixes #53242	2022-05-08 15:48:42 +01:00
Simon Pilgrim	eeb44579f1	[X86] Add description comments to SandyBridge for COPY/WriteZero/WriteVecMaskedGatherWriteback cases. NFC. Match other models. Use X86WriteRes for WriteVecMaskedGatherWriteback like other models as well.	2022-05-07 10:42:19 +01:00
Simon Pilgrim	3d107ce2b2	[CostModel][X86] Relax fcmp costs on SSE41 targets or later Only pre-SSE41 targets double-pump the fp comparison ops	2022-05-06 13:29:40 +01:00
Simon Pilgrim	cbfa857346	[CostModel][X86] Adjust 128-bit select costs to account for slow BLENDV op Based off the script from D103695 - Jaguar, Bulldozer, Silvermont (et al) and Haswell all have slow BLENDV ops, so adjust the worse case cost values	2022-05-06 13:07:34 +01:00
Simon Pilgrim	d21bf51494	[CostModel][X86] Adjust pre-SSE41 fp scalar select costs to account for vector ops Based off the script from D103695, we now mainly use BLENDV or OR(AND,ANDN) to select scalar float/double ops	2022-05-06 11:41:55 +01:00
Simon Pilgrim	f0e8c1d6d9	[CostModel][X86] Adjust 256-bit select costs to account for slow BLENDV op Based off the script from D103695, on AVX1, Jaguar/Bulldozer both have low throughput for ymm select patterns (BLENDV + OR(AND,ANDN))), and even on AVX2 Haswell still struggles with BLENDV ops	2022-05-06 11:27:37 +01:00
Nick Desaulniers	18fd09ab64	[X86SchedSandyBridge] update cost of COPY to 1 cycle from 0 To match the cost of other scheduling models. This is expected to schedule mov instructions around INLINEASM less frequently for the default machineschedule (pre-RA scheduling). Suggested by Craig Topper. Link: https://github.com/llvm/llvm-project/issues/41914 Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D122350	2022-05-05 11:14:22 -07:00
Luo, Yuanke	373ce14760	[X86][AMX] Replace PXOR instruction with SET0 in AMX pre config. To generate zero value, the PXOR instruction need 3 operands that is tied to the same vreg. If is not good in SSA form and with undef value two address instruction pass may convert `%0:vr128 = PXORrr undef %0, undef %0` to `%1:vr128 = PXORrr undef %1:vr128(tied-def 0), undef %0:vr128`. It is not expected. It can be simplified to SET0 instruction which only take 1 destination operand. It should be more friendly to two address instruction pass and register allocation pass. `%0:vr128 = V_SET0` Also add AVX1 code path so that it is consistant to other code. Differential Revision: https://reviews.llvm.org/D124903	2022-05-05 10:44:57 +08:00
Craig Topper	589517925b	[X86] Call initializeX86PreTileConfigPass from LLVMInitializeX86Target. Without this, the pass doesn't show up in print-before/after-all. Differential Revision: https://reviews.llvm.org/D124973	2022-05-04 19:09:06 -07:00
Luo, Yuanke	fe7d0067bd	[X86][AMX] Add mayLoad/mayStore property for AMX instructions.	2022-05-03 14:48:22 +08:00
Simon Pilgrim	59dc8ce95a	[X86] Reduce some superfluous diffs between znver1/znver2 models. NFC znver2 is a mainly a search+replace of the znver1 model, but for no reason the HADD and DPPS have been moved around - try to keep these in sync (no actual changes in the models).	2022-05-02 16:45:43 +01:00
Simon Pilgrim	ce9c0faca1	[X86][AMX] combineLdSt - don't dereference dyn_cast. NFC This leads to null pointer dereference warnings - use cast<> which will assert that the cast correct.	2022-05-02 16:45:43 +01:00
Simon Pilgrim	c7662dc3e5	[X86] MOVDDUP has the same sched behaviour as MOVSHDUP/MOVSLDUP on Skylake Fixes an old TODO - confirmed on Agner + uops.info	2022-05-02 12:50:37 +01:00
Simon Pilgrim	86bb7df6e6	[CostModel][X86] getScalarizationOverhead - handle vXi1 extracts with MOVMSK (pre-AVX512) We can quickly extract multiple elements of a bool vector using MOVMSK ops - since we don't know what generated the vXi1, I've been optimistic and assumed we can use PMOVMSKB to extract the maximum number of bools with a single op. The MOVMSK pattern isn't great for extract+insert round trips as vXi1 type legalization can interfere with this a lot - so this relies on us remaining good at using getScalarizationOverhead properly (and tagging both Insert and Extract modes) for those round trip cases. The AVX512 KMOV codegen for bool extraction is a bit of a mess so for now I've not included that - the per-element cost is a lot more accurate for current codegen.	2022-05-02 09:58:39 +01:00
Simon Pilgrim	980f41d7c4	[X86] (style) Use auto for dyn_cast<> results	2022-05-01 17:15:18 +01:00
Simon Pilgrim	d4f06ec874	[X86] (style) Don't use auto for non obvious types	2022-05-01 17:10:21 +01:00
Simon Pilgrim	d5198cf92f	[CostModel][X86] Check for 'null op' truncations If the legalized src/dst types are the same, assume the "truncation" is free. This fixes some edge cases such as mul lo/hi ops and bool vectors which will get legalized back to legal vector widths	2022-05-01 12:03:40 +01:00
Simon Pilgrim	c2964746e3	[CostModel][X86] Reduce cost of vector selects on SSE2/AVX1 targets Based off the script from D103695, we were exaggerating the cost of the OR(AND(X,M),AND(Y,~M)) expansion using instruction count instead of effective throughput	2022-05-01 09:32:14 +01:00
Simon Pilgrim	92235e3bf4	[X86] lowerShuffleAsRepeatedMaskAndLanePermute - permit 32-bit sublane permute for unary v32i8 cases Increase the likelihood that we can lower to a permd(pshufb()) pattern, but only after we've attempted with 64-bit sublane permutes first Fixes #55066	2022-04-30 11:00:28 +01:00
Simon Pilgrim	b424055b52	[X86] lowerShuffleAsRepeatedMaskAndLanePermute - move the sublane split code into a lambda helper. NFC. This is a NFC cleanup as part of the work on #55066 - the idea being that we will be able to check for multiple sub lane scales.	2022-04-29 16:03:50 +01:00
Alexey Bataev	371412e065	[COST]Fix crash for non-power-2 vector shuffle mask. Need to normalizize the mask to avoid possible crashes during attempts to estimate cost of the very long shuffles with non-power-2 number of elements in masks.	2022-04-29 07:28:07 -07:00
Simon Pilgrim	3562f855b7	[X86] SimplifyDemandedVectorEltsForTargetNode - fold (uniform) shift(0,x) -> 0	2022-04-29 12:08:47 +01:00
Simon Pilgrim	336a1233b2	[X86] SimplifyDemandedVectorEltsForTargetNode - fold shift(0,x) -> 0	2022-04-29 11:32:54 +01:00
Simon Pilgrim	6c44e398ec	[X86] combineShuffle - reuse SDLoc. NFCI.	2022-04-29 10:30:11 +01:00
Simon Pilgrim	2d7f0b1c22	[X86] Fold ANDNP(undef,x)/ANDNP(x,undef) -> 0 Matches the fold in DAGCombiner::visitANDLike.	2022-04-29 10:20:48 +01:00
Simon Pilgrim	ab17ed0723	[X86] Don't fold AND(SRL(X,Y),1) -> SETCC(BT(X,Y)) on BMI2 targets With BMI2 we have SHRX which is a lot quicker than regular x86 shifts. Fixes #55138	2022-04-28 21:28:16 +01:00
Alan Zhao	3333c28fc0	[llvm-ml] Improve indirect call parsing In MASM, if a QWORD symbol is passed to a jmp or call instruction in 64-bit mode or a DWORD or WORD symbol is passed in 32-bit mode, then MSVC's assembler recognizes that as an indirect call. Additionally, if the operand is qualified as a ptr, then that should also be an indirect call. Furthermore, in 64-bit mode, such operands are implicitly rip-relative (in fact, MSVC's assembler ml64.exe does not allow explicitly specifying rip as a base register.) To keep this patch managable, this patch does not include: * error messages for wrong operand types (e.g. passing a QWORD in 32-bit mode) * resolving indirect calls if the symbol is declared after it's first use (llvm-ml currently only runs a single pass). * imlementing the extern keyword (required to resolve https://crbug.com/762167.) This patch is likely missing a bunch of edge cases, so please do point them out in the review. Reviewed By: epastor, hans, MaskRay Committed By: epastor (on behalf of ayzhao) Differential Revision: https://reviews.llvm.org/D124413	2022-04-28 13:17:19 -04:00
Simon Pilgrim	a9215ed9cc	[InstCombine][X86] simplifyDemandedVectorEltsIntrinsic - handle avx2 per-element vector shifts	2022-04-28 18:14:54 +01:00
Alexey Bataev	75e1cf4a6a	[COST]Improve cost model for shuffles in SLP. Introduced masks where they are not added and improved target dependent cost models to avoid returning of the incorrect cost results after adding masks. Differential Revision: https://reviews.llvm.org/D100486	2022-04-28 10:04:41 -07:00
Simon Pilgrim	9e3b7e8e65	[X86] getTargetVShiftByConstNode - use SelectionDAG::FoldConstantArithmetic to perform constant folding. NFCI. Remove some unnecessary code duplication.	2022-04-28 17:10:20 +01:00
Alexey Bataev	9861ca0c23	Revert "[COST]Improve cost model for shuffles in SLP." This reverts commit `29a470e380` to fix a crash reported in https://reviews.llvm.org/D100486#3479989.	2022-04-28 08:11:56 -07:00
Simon Pilgrim	de7cee24b6	[X86] getBT - attempt to peek through aext(and(trunc(x),c)) mask/modulo Ideally we'd fold this with generic DAGCombiner, but that only works for !isTruncateFree cases - we might be able to adapt IsDesirableToPromoteOp to find truncated src ops in the future, but for now just use this peephole. Noticed in Issue #55138	2022-04-28 16:10:26 +01:00
Simon Pilgrim	ed8dffef4c	[X86] getFauxShuffle - don't assume an UNDEF src element for AND/ANDNP results in an UNDEF shuffle mask index The other src element might be zero, guaranteeing zero. Fixes #55157	2022-04-28 12:32:58 +01:00
Luo, Yuanke	942ec5c36d	[X86][AMX] combine tile cast and load/store instruction. The `llvm.x86.cast.tile.to.vector` intrinsic is lowered to `llvm.x86.tilestored64.internal` and `load <256 x i32>`. The `llvm.x86.cast.vector.to.tile` is lowered to `store <256 x i32>` and `llvm.x86.tileloadd64.internal`. When `llvm.x86.cast.tile.to.vector` is used by `store <256 x i32>` or `load <256 x i32>` is used by `llvm.x86.cast.vector.to.tile`, they can be combined by `llvm.x86.tilestored64.internal` and `llvm.x86.tileloadd64.internal`. Differential Revision: https://reviews.llvm.org/D124378	2022-04-28 14:55:21 +08:00
Shengchen Kan	6a6b0e4a63	[X86] Check the address in machine verifier 1. The scale factor must be 1, 2, 4, 8 2. The displacement must fit in 32-bit signed integer Noticed by: https://github.com/llvm/llvm-project/issues/55091 Reviewed By: pengfei Differential Revision: https://reviews.llvm.org/D124455	2022-04-28 10:05:39 +08:00
Bill Wendling	8f2ec974d1	[X86] Move target-generic code into CodeGen [NFC] This code is the same for all platforms. Differential Revision: https://reviews.llvm.org/D124566	2022-04-27 15:37:28 -07:00
Simon Pilgrim	e378577524	[X86] Use is128BitLaneRepeatedShuffleMask wrapper. NFC. We don't need to know the actual repeated mask.	2022-04-27 21:09:57 +01:00
Alexey Bataev	29a470e380	[COST]Improve cost model for shuffles in SLP. Introduced masks where they are not added and improved target dependent cost models to avoid returning of the incorrect cost results after adding masks. Differential Revision: https://reviews.llvm.org/D100486	2022-04-27 10:56:26 -07:00
Simon Pilgrim	03482bccad	[X86] collectConcatOps - add ability to collect from vector 'widening' patterns Recognise insert_subvector(undef, x, lo/hi) patterns where we double the width of a vector - creating an UNDEF subvector on the fly.	2022-04-27 15:38:58 +01:00
Vasileios Porpodas	fa8a9fea47	Recommit "[SLP][TTI] Refactoring of `getShuffleCost` `Args` to work like `getArithmeticInstrCost`" This reverts commit `6a9bbd9f20`. Code review: https://reviews.llvm.org/D124202	2022-04-26 14:02:40 -07:00
Vasileios Porpodas	6a9bbd9f20	Revert "[SLP][TTI] Refactoring of `getShuffleCost` `Args` to work like `getArithmeticInstrCost`" This reverts commit `55ce296d6f`.	2022-04-26 11:25:26 -07:00
Vasileios Porpodas	55ce296d6f	[SLP][TTI] Refactoring of `getShuffleCost` `Args` to work like `getArithmeticInstrCost` Before this patch `Args` was used to pass a broadcat's arguments by SLP. This patch changes this. `Args` is now used for passing the operands of the shuffle. Differential Revision: https://reviews.llvm.org/D124202	2022-04-26 11:11:29 -07:00
Xiang1 Zhang	c430f0f532	[X86] Add use condition for combineSetCCMOVMSK Reviewed by RKSimon, LuoYuanke Differential Revision: https://reviews.llvm.org/D123652	2022-04-26 16:42:50 +08:00
Luo, Yuanke	f3ad7ea03a	[X86][AMX] Report error when shapes are not pre-defined. Instead of report fatal error, this patch emit error message and exit when shapes are not pre-defined. This would cause the compiling fail but not crash. Differential Revision: https://reviews.llvm.org/D124342	2022-04-26 14:57:25 +08:00
David Green	9727c77d58	[NFC] Rename Instrinsic to Intrinsic	2022-04-25 18:13:23 +01:00
Simon Pilgrim	e8305c0b8f	[X86] combineX86ShuffleChain - don't fold to truncate(concat(V1,V2)) if it was already a PACK op Fixes #55050	2022-04-25 17:13:44 +01:00
Vasileios Porpodas	889588ee97	[SLP] Refactoring isLegalBroadcastLoad() to use `ElementCount`. Replacing `unsigned` with `ElementCount` in the argument of `isLegalBroadcastLoad()`. This helps reduce the diff of a future SLP patch for AArch64.	2022-04-21 10:19:00 -07:00
gpei-dev	3e6b904f0a	Force insert zero-idiom and break false dependency of dest register for several instructions. The related instructions are: VPERMD/Q/PS/PD VRANGEPD/PS/SD/SS VGETMANTSS/SD/SH VGETMANDPS/PD - mem version only VPMULLQ VFMULCSH/PH VFCMULCSH/PH Differential Revision: https://reviews.llvm.org/D116072	2022-04-21 16:47:13 +08:00
Matt Arsenault	3659780d58	MachineModuleInfo: Remove UsesMorestackAddr This is x86 specific, and adds statefulness to MachineModuleInfo. Instead of explicitly tracking this, infer if we need to declare the symbol based on the reference previously inserted. This produces a small change in the output due to the move from AsmPrinter::doFinalization to X86's emitEndOfAsmFile. This will now be moved relative to other end of file fields, which I'm assuming doesn't matter (e.g. the __morestack_addr declaration is now after the .note.GNU-split-stack part) This also produces another small change in code if the module happened to define/declare __morestack_addr, but I assume that's invalid and doesn't really matter.	2022-04-20 11:10:20 -04:00
Matt Arsenault	d7938b1a81	MachineModuleInfo: Move HasSplitStack handling to AsmPrinter This is used to emit one field in doFinalization for the module. We can accumulate this when emitting all individual functions directly in the AsmPrinter, rather than accumulating additional state in MachineModuleInfo. Move the special case behavior predicate into MachineFrameInfo to share it. This now promotes it to generic behavior. I'm assuming this is fine because no other target implements adjustForSegmentedStacks, or has tests using the split-stack attribute.	2022-04-20 10:54:29 -04:00
Matt Arsenault	209e7ef874	X86: Do not use ValueMap for PreallocatedIds ValueMap should only be necessary if the IR values can be replaced. This is only used during codegen, when it's illegal to change the underlying IR. This allows using the default copy constructor for X86MachineFunctionInfo. I'm not happy about targets keeping state here that's only used in one specific pass, but we don't have a better place to put it right now.	2022-04-19 21:07:47 -04:00
Craig Topper	c6fdb1de47	[X86] Move some hasOneUse checks after checking what the opcode is. Calling hasOneUse can be expensive on nodes with multiple results. Especially when some results are Chains. By checking the opcode first, we can avoid walking the uses if it isn't an interesting node, and thus avoid calling hasOneUse on a node that might have many uses. Found by profiling the IR given in D123857. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D123881	2022-04-16 14:18:58 -07:00
Craig Topper	9d86bf825c	[X86] Move hasOneUse check after opcode check. NFC Checking opcode is cheap. hasOneUse might not be if the node has multiple results. By checking the opcode we can rule out nodes with multiple results we aren't interested in.	2022-04-15 17:20:57 -07:00
Simon Pilgrim	a305d8f44e	[X86] Adjust fsetcc/fmin/fmax costs to match SoG (Issue #54889 ) znver1/2 models were incorrectly modelling these as 3 cycle latency instructions on the wrong pipe and znver1 ymm variants also require double pumping. Now matches AMD SoG, Agner and instlatx64 numbers. Thanks to @fabian-r for the report	2022-04-14 13:27:33 +01:00
Liu, Chen3	bf60a5af0a	[X86] Covert unsigned int 0 to float-point with FILD instruction. unsinged int 0 will be convert to float/double -0.0 when the rounding mode is set to 'FE_DOWNWARD'. Using FILD instruction instead of SSE instructions on 32-bit target if the strictfp is enabled. Differential Revision: https://reviews.llvm.org/D123660	2022-04-13 20:06:15 +08:00
Jonas Paulsson	46f83caebc	[InlineAsm] Add support for address operands ("p"). This patch adds support for inline assembly address operands using the "p" constraint on X86 and SystemZ. This was in fact broken on X86 (see example at https://reviews.llvm.org/D110267, Nov 23). These operands should probably be treated the same as memory operands by CodeGenPrepare, which have been commented with "TODO" there. Review: Xiang Zhang and Ulrich Weigand Differential Revision: https://reviews.llvm.org/D122220	2022-04-13 12:50:21 +02:00
Harald van Dijk	3337f50625	[X86] Fix handling of maskmovdqu in x32 differently This reverts the functional changes of D103427 but keeps its tests, and and reimplements the functionality by reusing the existing 32-bit MASKMOVDQU and VMASKMOVDQU instructions as suggested by skan in review. These instructions were previously predicated on Not64BitMode. This reimplementation restores the disassembly of a class of instructions, which will see a test added in followup patch D122449. These instructions are in 64-bit mode special cased in X86MCInstLower::Lower, because we use flags with one meaning for subtly different things: we have an AdSize32 class which indicates both that the instruction needs a 0x67 prefix and that the text form of the instruction implies a 0x67 prefix. These instructions are special in needing a 0x67 prefix but having a text form that does not imply a 0x67 prefix, so we encode this in MCInst as an instruction that has an explicit address size override. Note that originally VMASKMOVDQU64 was special cased to be excluded from disassembly, as we cannot distinguish between VMASKMOVDQU and VMASKMOVDQU64 and rely on the fact that these are indistinguishable, or close enough to it, at the MCInst level that it does not matter which we use. Because VMASKMOVDQU now receives special casing, even though it does not make a difference in the current implementation, as a precaution VMASKMOVDQU is excluded from disassembly rather than VMASKMOVDQU64. Reviewed By: RKSimon, skan Differential Revision: https://reviews.llvm.org/D122540	2022-04-12 18:32:14 +01:00
Simon Pilgrim	0488c6638b	[X86] getFauxShuffleMask - remove use DemandedElts TODO Most of the getTargetShuffleInputs recursive calls have now gone and the remaining uses aren't likely to benefit from a DemandedElts mask	2022-04-12 15:36:30 +01:00
Simon Pilgrim	058a33d3c9	[X86] Account for high uop/resource usage in BSF/BSR instructions znver1/2 models were incorrectly modelling these as single uop instructions, instead of the microcoded nightmares they really are. Now matches AMD SoG, Agner and instlatx64 numbers. Fixes #54811	2022-04-11 11:20:09 +01:00
Simon Pilgrim	1e803d305a	Revert rG88ff6f70c45f2767576c64dde28cbfe7a90916ca "[X86] Extend vselect(cond, pshufb(x), pshufb(y)) -> or(pshufb(x), pshufb(y)) to include inner or(pshufb(x), pshufb(y)) chains" Reverting while I investigate reports of internal test regressions/failures	2022-04-11 10:42:43 +01:00
Simon Pilgrim	88ff6f70c4	[X86] Extend vselect(cond, pshufb(x), pshufb(y)) -> or(pshufb(x), pshufb(y)) to include inner or(pshufb(x), pshufb(y)) chains	2022-04-10 13:04:53 +01:00
Simon Pilgrim	c74d729bd6	[X86] combineExtractSubvector - fold extract_subvector(insert_subvector(V,X,C1),C1) extract_subvector(insert_subvector(V,X,C1),C1) -> insert_subvector(extract_subvector(V,C1),X,0) More aggressively attempt to reduce the width of an extract_subvector source - we currently only do this if we're inserting into a zero vector (i.e. canonicalizing to the AVX implicit zero upper elts pattern). But if we're extracting from the same point as the inner insert_subvector then the fold is still relatively trivial - we can probably do even better if we can ensure the subvector isn't badly split.	2022-04-10 11:03:08 +01:00
Luo, Yuanke	690bed0cec	[X86][AMX] Fix infinite loop of getShape. When walk the user chain to get the shape of a phi node. If it is phi node in the chain, we should walk to the user of this phi node instead of the original phi node.	2022-04-10 14:44:51 +08:00
Simon Pilgrim	30a01bccda	[X86] Fold concat(pshufb(x,y),pshufb(z,w)) -> pshufb(concat(x,z),concat(y,w))	2022-04-09 16:05:50 +01:00
Simon Pilgrim	97ee923248	[X86] lowerV64I8Shuffle - attempt to fold to SHUFFLE(ALIGNR(X,Y)) and OR(PSHUFB(X),PSHUFB(Y))	2022-04-09 14:09:39 +01:00
Simon Pilgrim	3d4bb78fbe	[X86][SSE] combineSelect - more aggressively create zero elements in the or(pshufb(x), pshufb(y)) fold When we fold vselect(cond, pshufb(x), pshufb(y)) -> or(pshufb(x), pshufb(y)), ensure we convert all undef elements to zero elements - this should help us expose more known zero elements for deeper chains of these cases. Noticed while triaging Issue #54819	2022-04-09 12:53:00 +01:00
Simon Pilgrim	f5b4507486	[X86] Reduce some superfluous diffs between znver1/znver2 models. NFC znver2 is a mainly a search+replace of the znver1 model, but for no reason some lines have been moved around - try to keep these in sync (no actual changes in the models).	2022-04-09 10:59:18 +01:00
Nikita Popov	3075e5d2ef	[X86][FastISel] Fix with.overflow + select eflags clobber (PR54369) Don't try to directly use the with.overflow flag result in a cmov if we need to materialize constants between the instruction producing the overflow flag and the cmov. The current code is careful to check that there are no other instructions in between, but misses the constant materialization case (which may clobber eflags via xor or constant expression evaluation). Fixes https://github.com/llvm/llvm-project/issues/54369. Differential Revision: https://reviews.llvm.org/D122825	2022-04-08 16:12:28 +02:00
Simon Pilgrim	5626bd4289	[X86] Fix SLM scheduler model for PMULLD (PR37059) Adjust the PMULLD entry to match the Intel AoM numbers - PMULLD is a uop nightmare on SLM and we should model it as such. We had reports of internal regressions the last time this was attempted (rG13a0f83a05ff), but no public repros, and tests I did last year when I had access to a SLM box failed to see anything. My hunch is that the more aggressive PMULLD -> PMADDWD folds we now perform might have helped. We can revisit this again if we ever receive an actual repro. Fixes #36407	2022-04-08 10:07:06 +01:00
chenglin.bi	f72b3a506b	[x86] Replace getNodeIfExists to doesNodeExist when only check node exist Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D123224	2022-04-08 00:33:05 +08:00
Simon Pilgrim	cf3a09369a	[X86] Enable fast variable per-lane shuffle tuning on all Ryzen targets (PR44795) rGa3b8695bf592 enabled this for znver3, but AMD SoG, Agner and uops.info all agree that even znver1 has a fast per-lane shuffle op (VPSHUFB), but cross-lane shuffles seem to be slow (PERMPS etc.) Fixes #44140 Differential Revision: https://reviews.llvm.org/D123306	2022-04-07 16:00:52 +01:00
Simon Pilgrim	a1df2ef5cb	[X86] Ensure ZN3Tuning inherits from ZN2Tuning instead of ZNTuning At the moment ZN2Tuning is just a copy of ZNTuning, but we should try to keep a clean inheritance.	2022-04-07 14:01:15 +01:00
Wei Xiao	842d0bf931	[x86] Improve select lowering for smin(x, 0) & smax(x, 0) smin(x, 0): (select (x < 0), x, 0) -> ((x >> (size_in_bits(x)-1))) & x smax(x, 0): (select (x > 0), x, 0) -> (~(x >> (size_in_bits(x)-1))) & x The comparison is testing for a positive value, we have to invert the sign bit mask, so only do that transform if the target has a bitwise 'and not' instruction (the invert is free). The transform is performed only when CMP has a single user to avoid increasing total instruction number. https://alive2.llvm.org/ce/z/euUnNm https://alive2.llvm.org/ce/z/37339J Differential Revision: https://reviews.llvm.org/D123109	2022-04-07 15:53:24 +08:00
Matt Arsenault	c4ea925f50	AtomicExpand: Change return type for shouldExpandAtomicStoreInIR Use the same enum as the other atomic instructions for consistency, in preparation for addition of another strategy. Introduce a new "Expand" option, since the store expansion does not use cmpxchg. Alternatively, the existing CmpXChg strategy could be renamed to Expand.	2022-04-06 22:34:04 -04:00
Roman Lebedev	9be6e7b0f2	[X86] `lowerBuildVectorAsBroadcast()`: with AVX512VL, allow i64->XMM broadcasts from constant pool Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D123221	2022-04-06 18:33:40 +03:00
Shengchen Kan	f4661b5a55	[X86] Fold MMX_MOVD64from64rr + store to MMX_MOVQ64mr instead of MMX_MOVD64from64mr in auto-generated table This is a follow-up patch for D122241.	2022-04-06 21:33:57 +08:00
Shengchen Kan	4d21497006	[X86] Remove TB_NO_REVERSE for 2 memory folding entries ``` X86::MMX_MOVD64from64rr -> X86::MMX_MOVQ64mr X86::MMX_MOVD64grr -> X86::MMX_MOVD64mr ``` These two entries were added in llvm-svn: 372770. I think these two should be reversable. Reviewed By: RKSimon, pengfei Differential Revision: https://reviews.llvm.org/D122217	2022-04-06 17:21:12 +08:00
Martin Storsjö	46776f7556	Fix warnings about variables that are set but only used in debug mode Add void casts to mark the variables used, next to the places where they are used in assert or `LLVM_DEBUG()` expressions. Differential Revision: https://reviews.llvm.org/D123117	2022-04-06 10:01:46 +03:00
Shengchen Kan	81b10f8200	[X86][tablgen] Consider the mnemonic when auto-generating memory folding table Intuitively, the memory folding pair should have the same mnemonic. This patch removes ``` {X86::SENDUIPI,X86::VMXON} ``` in the auto-generated table. And `NotMemoryFoldable` for `TPAUSE` and `CLWB` can be saved. ``` {X86::MOVLHPSrr,X86::MOVHPSrm} {X86::VMOVLHPSZrr,X86::VMOVHPSZ128rm} {X86::VMOVLHPSrr,X86::VMOVHPSrm} ``` It seems the three pairs above are mistakenly killed. But we can add them back manually later. Reviewed By: Amir Differential Revision: https://reviews.llvm.org/D122477	2022-04-06 12:53:05 +08:00
Pierre Gousseau	a3d5f1cf5d	[x86] Fix infinite loop inside DAG combiner with lzcnt feature. The issue affects targets supporting fast-lzcnt such as btver2. This removes extraneous zext/trunc node insertions to fix the infinite loop. This fixes Issue https://github.com/llvm/llvm-project/issues/54694 Differential Revision: https://reviews.llvm.org/D122900 Reviewed By: RKSimon, spatel, lebedev.ri	2022-04-05 17:32:10 +01:00
Wei Xiao	ca33d74ca5	[X86] Improve x86-partial-reduction to support abs intrinsic Current implementation only recognizes absolute operation implemented by select instruction. This patch adds support for abs intrinsic. Differential Revision: https://reviews.llvm.org/D122777	2022-04-05 11:32:09 +08:00
Simon Pilgrim	ffe0cc82db	[X86] Add XOR(X, MIN_SIGNED_VALUE) -> ADD(X, MIN_SIGNED_VALUE) isel patterns (PR52267) Improve chances of folding to LEA patterns Differential Revision: https://reviews.llvm.org/D123043	2022-04-04 19:47:06 +01:00
Simon Pilgrim	623d4b5787	[X86] Support optional NOT stages in the AND(SRL(X,Y),1) -> SETCC(BT(X,Y)) fold Extension to D122891, peek through NOT() ops, adjusting the condcode as we go.	2022-04-04 10:51:26 +01:00
Simon Pilgrim	fbfd78f7aa	[X86] lowerShuffleAsRepeatedMaskAndLanePermute - allow v16i32 sub-lane permutes for v64i8 shuffles Without VBMI, we are better off permuting v16i32 sub-lanes, even though its a variable shuffle, if it allows us to then shuffle v64i8 inlane repeated masks (PSHUFB etc.) Fixes #54658	2022-04-03 10:05:10 +01:00
Simon Pilgrim	76cd11f303	[DAG] Add llvm::isMinSignedConstant helper. NFC Pulled out of D122754	2022-04-01 17:47:34 +01:00
Simon Pilgrim	c64f37f818	[X86] matchAddressRecursively - add XOR(X, MIN_SIGNED_VALUE) handling Allows us to fold XOR(X, MIN_SIGNED_VALUE) == ADD(X, MIN_SIGNED_VALUE) into LEA patterns As mentioned on PR52267. Differential Revision: https://reviews.llvm.org/D122815	2022-04-01 17:26:29 +01:00
Simon Pilgrim	b8652fbcbb	[X86] Fold AND(SRL(X,Y),1) -> SETCC(BT(X,Y)) (RECOMMITTED) As noticed on PR39174, if we're extracting a single non-constant bit index, then try to use BT+SETCC instead to avoid messing around moving the shift amount to the ECX register, using slow x86 shift ops etc. Recommitted with a fix to ensure we zext/trunc the SETCC result to the original type. Differential Revision: https://reviews.llvm.org/D122891	2022-04-01 16:59:06 +01:00
Simon Pilgrim	5a457bd2fa	Revert rGa5f637bcbb7d1e08ce637f113fc117c3f4b2b110 "[X86] Fold AND(SRL(X,Y),1) -> SETCC(BT(X,Y))" Investigating a sanitizer-windows buildbot breakage	2022-04-01 16:48:24 +01:00
Simon Pilgrim	9afa6811ad	[X86] lowerShuffleAsRepeatedMaskAndLanePermute - allow 64-bit sublane shuffling on AVX512BW v64i8 shuffles We were only performing this on 256-bit vectors on AVX2 targets Noticed while triaging Issue #54658	2022-04-01 16:40:10 +01:00
Simon Pilgrim	a5f637bcbb	[X86] Fold AND(SRL(X,Y),1) -> SETCC(BT(X,Y)) As noticed on PR39174, if we're extracting a single non-constant bit index, then try to use BT+SETCC instead to avoid messing around moving the shift amount to the ECX register, using slow x86 shift ops etc. Differential Revision: https://reviews.llvm.org/D122891	2022-04-01 16:07:56 +01:00
Simon Pilgrim	3245cfb8d3	[X86] Add getBT helper node for attempting to create a X86ISD::BT node Avoids repeating all the extension/legalization wrappers in every use	2022-04-01 11:48:25 +01:00
Simon Pilgrim	919b657080	Revert rGff2d1bb2b749bd8a5697c25d2380b7c97a59ae06 "[X86] Add getBT helper node for attempting to create a X86ISD::BT node" Typo means that this doesn't return a value in all cases.	2022-04-01 11:21:00 +01:00
Simon Pilgrim	ff2d1bb2b7	[X86] Add getBT helper node for attempting to create a X86ISD::BT node Avoids repeating all the extension/legalization wrapper in every use	2022-04-01 11:12:23 +01:00
Simon Pilgrim	cb5c4a5917	[X86] lowerV8I16Shuffle - use explicit SmallVector<SDValue, 4> width to avoid MSVC AVX alignment bug As discussed on Issue #54645 - building llc with /AVX can result in incorrectly aligned structs	2022-04-01 10:54:24 +01:00
Fangrui Song	ac6878b330	[X86] Set frame-setup/frame-destroy on prologue/epilogue CFI instructions This approach is used by AArch64/RISCV to make frame-setup/frame-destroy instructions contiguous instead of being interleaved by CFI instructions. Code checking `MBBI->getFlag(MachineInstr::FrameSetup) \|\| MBBI->isCFIInstruction()` can be simplified to just check FrameSetup. This helps locate all CFI instructions in the prologue, which can be handy to use .cfi_remember_state/.cfi_restore_state to decrease unwind table size (D114545). Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D122541	2022-03-31 23:04:50 -07:00
Matt Arsenault	f635be3014	X86/GlobalISel: Use LLT form of getMachineMemOperand	2022-03-31 18:49:23 -04:00
Simon Pilgrim	535211c3eb	[X86] Remove redundant FIXME lowerV64I8Shuffle has been extended a lot since this was added.	2022-03-31 18:05:52 +01:00
Simon Pilgrim	fac1729924	[X86] lowerV64I8Shuffle - don't use lowerShuffleWithPERMV until we've tried simpler options Shuffle combining will still lower to this with better fast cross lane checks. Noticed while triaging Issue #54658	2022-03-31 18:05:51 +01:00
Sanjay Patel	4a54e3eed3	[x86] try to replace 0.0 in fcmp with negated operand This inverts a fold recently added to IR with: `3491f2f4b0` We can put -bidirectional on the Alive2 examples to show that the reverse transforms work: https://alive2.llvm.org/ce/z/8iVQwB The motivation for the IR change was to improve matching to 'fabs' in IR (see https://github.com/llvm/llvm-project/issues/38828 ), but it regressed x86 codegen for 'not-quite-fabs' patterns like (X > -X) ? X : -X. Ie, when there is no fast-math (nsz), the cmp+select is not a proper fabs operation, but it does map nicely to the unusual NAN semantics of MINSS/MAXSS. I drafted this as a target-independent fold, but it doesn't appear to help any other targets and seems to cause regressions for SystemZ at least. Differential Revision: https://reviews.llvm.org/D122726	2022-03-31 09:17:49 -04:00
Luo, Yuanke	6753eb0c90	[X86][AMX] Materialize undef or zero value to tilezero The AMX combiner would store undef or zero to stack and invoke tileload to load the data to tile register. To avoid the store/load, we can materialzie undef or zero value to tilezero. Differential Revision: https://reviews.llvm.org/D122714	2022-03-31 19:10:28 +08:00
Simon Pilgrim	481b185620	[X86] combineCarryThroughADD - recognise X86ISD::ADD(AND(X,1),-1) pattern can be folded to X86ISD::BT As mentioned on D122482, if we've generated a masked overflow test see if we can fold it to X86ISD::BT to feed a X86ISD::ADC/SBB Differential Revision: https://reviews.llvm.org/D122572	2022-03-31 09:52:55 +01:00
Luo, Yuanke	1141c8b6fc	[X86][AMX] Fix bug for amx cast tranform After combining amx cast operation, some amx cast intrinsic may be dead code. This patch is to delete such dead code and avoid crash.	2022-03-30 17:22:30 +08:00
Simon Pilgrim	6697e3354f	[X86] combineADC - fold ADC(C1,C2,Carry) -> ADC(0,C1+C2,Carry) If we're not relying on the flag result, we can fold the constants together into the RHS immediate operand and set the LHS operand to zero, simplifying for further folds. We could do something similar if the flag result is in use and the constant fold doesn't affect it, but I don't have any real test cases for this yet. As suggested by @davezarzycki on Issue #35256 Differential Revision: https://reviews.llvm.org/D122482	2022-03-30 09:11:55 +01:00
Simon Pilgrim	d663166acb	[CostModel][X86] Reduce cost of v2i64 icmp base cost on SSE2 targets Based off the script from D103695, we were exaggerating the cost of the v2i64 comparison expansion using instruction count instead of effective throughput	2022-03-30 09:11:55 +01:00
Simon Pilgrim	1ec109ec58	[X86] combineCarryThroughADD - remove unused peek through of SEXT/AEXT nodes.	2022-03-29 17:22:50 +01:00
Shao-Ce SUN	662b9fa02c	[NFC][CodeGen] Add a setTargetDAGCombine use ArrayRef Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D122557	2022-03-29 09:53:24 +08:00
Simon Pilgrim	8a1956dfa5	[X86] lowerV64I8Shuffle - attempt to match with lowerShuffleAsLanePermuteAndPermute Fixes #54562	2022-03-28 17:21:27 +01:00
Kazu Hirata	6212871968	[Target] Apply clang-tidy fixes for readability-redundant-member-init (NFC)	2022-03-27 22:22:37 -07:00
Phoebe Wang	674d52e8ce	[X86] Refactor X86ScalarSSEf16/32/64 with hasFP16/SSE1/SSE2. NFCI This is used for f16 emulation. We emulate f16 for SSE2 targets and above. Refactoring makes the future code to be more clean. Reviewed By: LuoYuanke Differential Revision: https://reviews.llvm.org/D122475	2022-03-27 12:24:02 +08:00
Shengchen Kan	dc68ca3eff	[X86][tablgen] Rename field hasREX_WPrefix to hasREX_W for X86Inst. NFC To make it more like hasVEX_L and hasEVEX_K, etc.	2022-03-26 23:14:08 +08:00
Shengchen Kan	271e8d2495	[X86][tablgen] Refine the class RecognizableInstr. NFCI 1. Add comments to explain why we set `isAsmParserOnly` for XACQUIRE and XRELEASE 2. Check `X86Inst` in the constructor of `RecognizableInstrBase` so that we can avoid the case where one of it's field is not initialized but accessed by user. (e.g. in X86EVEX2VEXTablesEmitter.cpp) 3. Move `Rec` from `RecognizableInstrBase` to `RecognizableInstr` to reduce size of `RecognizableInstrBase` 4. Remove out-of-date comments for shouldBeEmitted() (filter() was removed) 5. Add a basic field `IsAsmParserOnly` and remove the field `ShouldBeEmitted` b/c we can deduce it w/ little overhead	2022-03-26 22:41:49 +08:00

... 3 4 5 6 7 ...

22902 Commits