llvm-project

Commit Graph

Author	SHA1	Message	Date
Craig Topper	da3ef8b756	[X86] Handle inverted inputs when matching VPTERNLOG from 2 binary ops. This is a more general version of D109273. Though it doesn't peek through bitcasts or rearange broadcasts. Reviewed By: LuoYuanke Differential Revision: https://reviews.llvm.org/D109295	2021-09-06 17:44:52 -07:00
Fangrui Song	76529b4468	[X86] Simplify condition guarding emitCalleeSavedFrameMoves. NFC	2021-09-06 15:54:02 -07:00
Fangrui Song	4f1e410a1b	[X86] Simplify two hasFP(F). NFC	2021-09-06 15:47:40 -07:00
Tianqing Wang	12fa608af4	[X86] Add CRC32 feature. `d8faf03807` implemented general-regs-only for X86 by disabling all features with vector instructions. But the CRC32 instruction in SSE4.2 ISA, which uses only GPRs, also becomes unavailable. This patch adds a CRC32 feature for this instruction and allows it to be used with general-regs-only. Reviewed By: pengfei Differential Revision: https://reviews.llvm.org/D105462	2021-09-06 17:24:30 +08:00
Simon Pilgrim	f114ef3731	[CostModel][X86] Add generic costs for vXi32 MUL -> v2Xi16 PMADDDW folds Based off the improved fold in D108522 This should eventually allow us to replace the SLM only cost patterns with generic versions.	2021-09-05 16:08:11 +01:00
Simon Pilgrim	2005ae15a6	[X86][SLM] WriteVecIMul instructions only take 1uop (REAPPLIED) The xmm variant have half the throughput (and +1cy latency) of the mmx variants, but are still 1uop. I still need to do more thorough testing of SLM on test-suite before fixing the obvious bad numbers for WritePMULLD. But this helps the D103695 helper script get to more accurate numbers for vXi32 multiplies of extended operands (i.e. we can use PMADDWD, PMULLW/PMULHW etc). Matches what Intel AoM / Agner / llvm-exegesis reports.	2021-09-04 15:03:56 +01:00
Simon Pilgrim	ac51d69208	Revert rG994da657076900f5ad7fe593c3b5e5f89ab3d53d "[X86][SLM] WriteVecIMul instructions only take 1uop" This changed some codegen tests that I forgot about in my rebase, I'll recommit shortly with a fix.	2021-09-04 13:39:10 +01:00
Simon Pilgrim	994da65707	[X86][SLM] WriteVecIMul instructions only take 1uop The xmm variant have half the throughput (and +1cy latency) of the mmx variants, but are still 1uop. I still need to do more thorough testing of SLM on test-suite before fixing the obvious bad numbers for WritePMULLD. But this helps the D103695 helper script get to more accurate numbers for vXi32 multiplies of extended operands (i.e. we can use PMADDWD, PMULLW/PMULHW etc). Matches what Intel AoM / Agner / llvm-exegesis reports.	2021-09-04 13:21:34 +01:00
Simon Pilgrim	c6371020a8	[X86][SLM] RMW instructions don't require an extra uop For RMW instructions, the load and store hold the MEC for an extra cycle, but within the same single uop. This is alluded to in the Intel AOM: "The MEC also owns the MEC RSV, which is responsible for scheduling of all loads and stores. Load and store instructions go through addresses generation phase in program order to avoid on-the-fly memory ordering later in the pipeline. Therefore, an unknown address will stall younger memory instructions." Noticed while trying to get a cheap SLM test box up and running with llvm-exegesis - RMW arithmetic is always 1uop - and matches what Agner / InstLatX64 report as well.	2021-09-04 13:21:34 +01:00
Simon Pilgrim	da965a77d5	[X86][SLM] Fix MUL uops, latency and throughput These were all set to the same best case mul i32 values (which seems to be the only version of MUL that SLM actually performs well with). Noticed while trying to improve multiplication costs for vectorization via the D103695 helper script. Confirmed with Intel AoM / Agner / InstLatX64.	2021-09-04 13:21:34 +01:00
Simon Pilgrim	7d062d2c47	[X86][Atom] MUL/DIV instructions require both ports, not either. Noticed while trying to improve multiplication costs for vectorization via the D103695 helper script. Confirmed with Intel AoM.	2021-09-04 11:58:09 +01:00
Simon Pilgrim	0d0f39b0f3	[X86][Atom] Add missing UOps override to AtomWriteResPair multiclass Make it easier to describe microcoded instructions.	2021-09-04 11:58:09 +01:00
Simon Pilgrim	6ba0b9f68a	[X86][SLM] Fix PBLENDVB uops and throughput SLM PBLENDVB is just as bad as BLENDVPD/PS - so model it as such, fixing the rr vs rm uops diff as well. The Intel AoM appears to have a copy+paste typo with PBLENDW, it doesn't match Agner or InstLatX64. Noticed while investigating some of the weird discrepancies reported by the D103695 helper script (SLM had much better vector shift throughputs than it should).	2021-09-03 11:31:29 +01:00
Kirill Stoimenov	cf53c6c971	[asan] Fixed link error by setting jump symbol to R_X86_64_PLT32. Fixing this link error: ld: error: relocation R_X86_64_PC32 cannot be used against symbol __asan_report_load...; recompile with -fPIC Reviewed By: vitalybuka Differential Revision: https://reviews.llvm.org/D109183	2021-09-02 21:50:56 +00:00
Craig Topper	3e89cc5cda	[X86] Remove isel predicates for xgetbv/xsetbv instructions so they can work on Windows. https://reviews.llvm.org/D56686 was supposed to allow these to work on Windows without needing to enable the xsave feature to match MSVC. It seems this didn't work because the backend isel patterns would still block it. This patch removes the predicates from the isel patterns. Fixes PR51706. Reviewed By: pengfei Differential Revision: https://reviews.llvm.org/D109097	2021-09-02 10:25:02 -07:00
Simon Pilgrim	d66d520fe1	[X86][SSE] combineMulToPMADDWD - improve recognition of sign/zero extended upper bits PMADDWD(v8i16 x, v8i16 y) == (v4i32) { (int)x[0]y[0] + (int)x[1]y[1], ..., (int)x[6]y[6] + (int)x[7]y[7] } Currently combineMulToPMADDWD only folds cases where the upper 17 bits of both vXi32 inputs are known zero (i.e. the first half is positive and the second half of the pair is zero in each 2xi16 pair), this can be relaxed to only require one zero-extended input if the other input has at least 17 sign bits. That way the sign of the result is still preserved, and the second half is still zero. Noticed while investigating PR47437. Differential Revision: https://reviews.llvm.org/D108522	2021-09-02 17:36:22 +01:00
Roman Lebedev	3f1f08f0ed	Revert @llvm.isnan intrinsic patchset. Please refer to https://lists.llvm.org/pipermail/llvm-dev/2021-September/152440.html (and that whole thread.) TLDR: the original patch had no prior RFC, yet it had some changes that really need a proper RFC discussion. It won't be productive to discuss such an RFC, once it's actually posted, while said patch is already committed, because that introduces bias towards already-committed stuff, and the tree is potentially in broken state meanwhile. While the end result of discussion may lead back to the current design, it may also not lead to the current design. Therefore i take it upon myself to revert the tree back to last known good state. This reverts commit `4c4093e6e3`. This reverts commit `0a2b1ba33a`. This reverts commit `d9873711cb`. This reverts commit `791006fb8c`. This reverts commit `c22b64ef66`. This reverts commit `72ebcd3198`. This reverts commit `5fa6039a5f`. This reverts commit `9efda541bf`. This reverts commit `94d3ff09cf`.	2021-09-02 13:53:56 +03:00
Simon Pilgrim	b0acd6c369	[X86] Fold PMADD(x,0) or PMADD(0,x) -> 0 Pulled out of D108522 - handle zero-operand cases for PMADDWD/VPMADDUBSW ops	2021-09-02 10:48:50 +01:00
Wang, Pengfei	74043caef2	[X86] Enable half type support in inline assembly constraints Reviewed By: LuoYuanke Differential Revision: https://reviews.llvm.org/D105799	2021-09-01 09:29:31 +08:00
Simon Pilgrim	9e2d14c285	[X86] Copy X86SchedSkylakeServer.td to X86SchedIceLake.td Icelake, Rocketlake and Tigerlake targets currently use the SkylakeServer scheduler model, despite being a later microarchitecture, leading to both reported bugs (PR48110) and discrepancies when comparing llvm-mca reports to other profiling tools (OSACA, uops, uica, etc.). And tbh I'm getting sick of llvm-mca getting blamed for what are backend scheduler model issues :-( This patch doesn't attempt to fix any of these discrepancies - there should be no changes in codegen - its a setup patch that copies the skx model, renames all the resources, adds the additional ports (but doesn't reference them yet) and updates the llvm-exegesis pfm counter mappings (based off https://sourceforge.net/p/perfmon2/libpfm4/ci/master/tree/lib/events/intel_icl_events.h). This should make it trivial for anyone with hardware access to use llvm-exegesis reports to iteratively improve the model (my attempts to get hold of a cheap tiger lake box haven't been fruitful yet....). I will copy the SkylakeServer llvm-mca resource tests as follow up commits - the diff should entirely be the resource renames. Differential Revision: https://reviews.llvm.org/D108914	2021-08-31 11:57:20 +01:00
Nikita Popov	0529e2e018	[InstrInfo] Use 64-bit immediates for analyzeCompare() (NFCI) The backend generally uses 64-bit immediates (e.g. what MachineOperand::getImm() returns), so use that for analyzeCompare() and optimizeCompareInst() as well. This avoids truncation for targets that support immediates larger 32-bit. In particular, we can avoid the bugprone value normalization hack in the AArch64 target. This is a followup to D108076. Differential Revision: https://reviews.llvm.org/D108875	2021-08-30 19:46:04 +02:00
Simon Pilgrim	af2920ec6f	[TTI][X86] getArithmeticInstrCost - move opcode canonicalization before all target-specific costs. NFCI. The GLM/SLM special cases still get tested first but after the the MUL/DIV/REM pattern detection - this will be necessary for when we make the SLM vXi32 MUL canonicalization generic to improve PMULLW/PMULHW/PMADDDW cost support etc.	2021-08-30 12:24:59 +01:00
Wang, Pengfei	ab40dbfe03	[X86] AVX512FP16 instructions enabling 6/6 Enable FP16 complex FMA instructions. Ref.: https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html Reviewed By: LuoYuanke Differential Revision: https://reviews.llvm.org/D105269	2021-08-30 13:08:45 +08:00
Vince Bridgers	55ba1de7c5	[X86] Remove X86LowerAMXType::getRowFromCol from X86LowerAMXType.cpp Remove method X86LowerAMXType::getRowFromCol since it's not used, and it's causing a warning. Reviewed By: LuoYuanke Differential Revision: https://reviews.llvm.org/D108862	2021-08-29 12:27:34 -05:00
Roman Lebedev	6734018041	[Codegen][X86] EltsFromConsecutiveLoads(): if only have AVX1, ensure that the "load" is actually foldable (PR51615) This fixes another reproducer from https://bugs.llvm.org/show_bug.cgi?id=51615 And again, the fix lies not in the code added in D105390 In this case, we completely don't check that the "broadcast-from-mem" we create can actually fold the load. In this case, it's operand was not a load at all: ``` Combining: t16: v8i32 = vector_shuffle<0,u,u,u,0,u,u,u> t14, undef:v8i32 Creating new node: t29: i32 = undef RepeatLoad: t8: i32 = truncate t7 t7: i64 = extract_vector_elt t5, Constant:i64<0> t5: v2i64,ch = load<(load (s128) from %ir.arg)> t0, t2, undef:i64 t2: i64,ch = CopyFromReg t0, Register:i64 %0 t1: i64 = Register %0 t4: i64 = undef t3: i64 = Constant<0> Combining: t15: v8i32 = undef ``` Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D108821	2021-08-27 20:26:53 +03:00
Serge Pavlov	cdbe569fb6	[X86] Implement llvm.isnan(x86_fp80) as unordered comparison x86_fp80 format allows values that do not fit any of IEEE-754 category. Previously they were recognized by intrinsic __builtin_isnan as NaNs. Now this intrinsic is implemented using instruction FXAM, which distinguish between NaNs and unsupported values. It can make some programs behave differently. As a solution, this fix changes lowering of the intrinsic. If floating point exceptions are ignored, llvm.isnan is lowered into unordered comparison, as __buildtin_isnan was implemented earlier. In strictfp functions the intrinsic is lowered using FXAM, which does not raise exceptions even for signaling NaN, as required by IEEE-754 and C standards. Differential Revision: https://reviews.llvm.org/D108037	2021-08-27 18:06:07 +07:00
Nathan Sidwell	199ac3a839	[NFC][X86] Sret return register cleanup There are no paths into LowerFormalParms that have already specified the sret register. We always materialize a virtual and then assign it to the physical reg at the point of the return. Differential Revision: https://reviews.llvm.org/D108762	2021-08-27 04:03:49 -07:00
Roman Lebedev	d4d459e747	[X86] AMD Zen 3: MULX w/ mem operand has the same throughput as with reg op Exegesis is faulty and sometimes when measuring throughput^-1 produces snippets that have loop-carried dependencies, which must be what caused me to incorrectly measure it originally. After looking much more carefully, the inverse throughput should match that of the MULX w/ reg op. As per llvm-exegesis measurements.	2021-08-27 13:27:05 +03:00
Roman Lebedev	0f04936a2d	[X86] AMD Zen 3: MULX produces low part of the result in 3cy, +1cy for high part As per llvm-exegesis measurements.	2021-08-27 13:27:05 +03:00
Kirill Stoimenov	2e83a0efb9	[asan] Fixed a runtime crash. Looks like the NoRegister has some effect on the final code that is generated. My guess is that some optimization kicks in at the end? When I use -S to dump the assembly I get the correct version with 'shrq $3, %r8': movq %r9, %r8 shrq $3, %r8 movsbl 2147450880(%r8), %r8d But, when I disassemble the final binary I get RAX in stead of R8: mov %r9,%r8 shr $0x3,%rax movsbl 0x7fff8000(%r8),%r8d Reviewed By: vitalybuka Differential Revision: https://reviews.llvm.org/D108745	2021-08-26 20:30:25 +00:00
Roman Lebedev	a8125bf4a8	[X86][Codegen] PR51615: don't replace wide volatile load with narrow broadcast-from-memory Even though https://bugs.llvm.org/show_bug.cgi?id=51615 appears to be introduced by D105390, the fix lies here. We can not replace a wide volatile load with a broadcast-from-memory, because that would narrow the load, which isn't legal for volatiles. Reviewed By: spatel Differential Revision: https://reviews.llvm.org/D108757	2021-08-26 18:46:49 +03:00
Simon Pilgrim	c17f5afa88	[X86] getShape - don't dereference dyn_cast<> dyn_cast can return nullptr, use cast<> to assert we have the correct type.	2021-08-26 15:08:13 +01:00
Simon Pilgrim	47f2affa08	Fix MSVC "result of 32-bit shift implicitly converted to 64 bits" warning. NFCI.	2021-08-26 15:08:12 +01:00
Andrea Di Biagio	4a5b191703	[X86][MCA] Address the latest issues with MULX reported in PR51495. It turns out that SchedWrite WriteIMulH was always assigned to the low half of the result of a MULX (rather than to the high half). To avoid confusion, this patch swaps the two MULX writes in the tablegen definition of MULX32/64. That way, write names better describe what they actually refer to; this also avoids further complications if in future we decide to reuse the same MulH writes to also model other scalar integer multiply instructions. I also had to swap the latency values for the two MULX writes to make sure that the change is effectively an NFC. In fact, none of the existing x86 tests were affected by this small refactoring. This patch also fixes a bug in MCA: a wrong latency value was propagated for instructions that perform multiple writes to a same register. This last issue was found by Roman while testing MULX on targets that define a different latency for the Low/High part of the result. Differential Revision: https://reviews.llvm.org/D108727	2021-08-26 12:08:20 +01:00
Nathan Sidwell	ab55cc6cef	[X86] pr51000 in-register struct return tailcalling In-register structure returns are not special, and handled by lowering to multiple-value tuples. We can tail-call from non-sret fns to structure-returning functions, except on i686 where the sret pointer is callee-pop. Differential Revision: https://reviews.llvm.org/D105807	2021-08-25 10:15:50 -07:00
Kirill Stoimenov	832aae738b	[asan] Implemented intrinsic for the custom calling convention similar used by HWASan for X86. The implementation uses the int_asan_check_memaccess intrinsic to instrument the code. The intrinsic is replaced by a call to a function which performs the access check. The generated function names encode the input register name as a number using Reg - X86::NoRegister formula. Reviewed By: vitalybuka Differential Revision: https://reviews.llvm.org/D107850	2021-08-25 15:31:46 +00:00
Andrea Di Biagio	5f848b311f	[X86][SchedModel] Fix latency the Hi register write of MULX (PR51495). Before this patch, WriteIMulH reported a latency value which is correct for the RR variant of MULX, but not for the RM variant. This patch fixes the issue by introducing a new WriteIMulHLd, which is meant to be used only by the RM variant of MULX. Differential Revision: https://reviews.llvm.org/D108701	2021-08-25 16:12:09 +01:00
Min-Yih Hsu	d2bb6d512c	[X86] Add explicit library dependency on LLVMInstrumentation Patch `9588b685c6` introduced dependency on ASAN. But it didn't explicitly put LLVMInstrumentation as one of the library dependencies such that the build will fail if we're building LLVM as shared libraries (i.e. -DBUILD_SHARED_LIBS=ON). This patch explicitly links X86CodeGen against the Instrumentation component. Differential Revision: https://reviews.llvm.org/D108662	2021-08-24 14:08:17 -07:00
Kirill Stoimenov	b97ca3aca1	Revert "[asan] Implemented intrinsic for the custom calling convention similar used by HWASan for X86." This reverts commit `9588b685c6`. Breaks a bunch of builds. Reviewed By: GMNGeoffrey Differential Revision: https://reviews.llvm.org/D108658	2021-08-24 13:21:20 -07:00
Kirill Stoimenov	9588b685c6	[asan] Implemented intrinsic for the custom calling convention similar used by HWASan for X86. The implementation uses the int_asan_check_memaccess intrinsic to instrument the code. The intrinsic is replaced by a call to a function which performs the access check. The generated function names encode the input register name as a number using Reg - X86::NoRegister formula. Reviewed By: vitalybuka Differential Revision: https://reviews.llvm.org/D107850	2021-08-24 19:34:34 +00:00
Simon Pilgrim	307890f85b	[X86] Freeze vXi8 shl(x,1) -> add(x,x) vector fold (PR50468) We don't have any vXi8 shift instructions (other than on XOP which is handled separately), so replace the shl(x,1) -> add(x,x) fold with shl(x,1) -> add(freeze(x),freeze(x)) to avoid the undef issues identified in PR50468. Split off from D106675 as I'm still looking at whether we can fix the vXi16/i32/i64 issues with the D106679 alternative. Differential Revision: https://reviews.llvm.org/D108139	2021-08-24 16:08:24 +01:00
Jeremy Morse	992e21eeee	[DebugInfo][InstrRef] Fix over-droppage of locations in X86FloatingPoint Over in D105657, we started dropping instruction numbers (that become variable locations) from call instructions, as we can't correctly represent the x87 FP stack. Unfortunately, it turns out that the "special FP instructions" that this pass transforms includes "every call instruction" [0]. Thus, we've ended up dropping all return values from all calls. Ouch. This patch adds a filter: only drop instruction numbers from calls if they return something on the FP stack. Seeing how LLVM only allows a single return value, this should drop instruction numbers on anything that returns a float, and nothing else. Rather than writing a new test, I've modified the original one to have a positive and negative case: drop instruction number on a call with an FP-stack modification, keep it on a plain call. Differential Revision: https://reviews.llvm.org/D108580	2021-08-24 10:24:07 +01:00
Liu, Chen3	b7795eb646	[X86] Building constant vector which element type is half will cause assertion fail. Fix assertion fail when building con constant vector which element type is half. Differential Revision: https://reviews.llvm.org/D108612	2021-08-24 14:34:30 +08:00
Wang, Pengfei	c728bd5bba	[X86] AVX512FP16 instructions enabling 5/6 Enable FP16 FMA instructions. Ref.: https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html Reviewed By: LuoYuanke Differential Revision: https://reviews.llvm.org/D105268	2021-08-24 09:07:19 +08:00
Fangrui Song	ba6e15d8cc	[TargetMachine] Move COFF special case for ExternalSymbolSDNode from shouldAssumeDSOLocal to X86Subtarget Intended to be NFC. ARM/AArch64 don't appear to need adjustment. TargetMachine::shouldAssumeDSOLocal is expected to be very simple, ideally matching isDSOLocal(). The IR producers are expected to set dso_local correctly. (While some may think this function can make producers' work easier, the function is really not in a good position to set dso_local. See the various special cases we duplicate from clang CodeGenModule.cpp.) Reviewed By: mstorsjo Differential Revision: https://reviews.llvm.org/D108514	2021-08-23 13:54:40 -07:00
Simon Pilgrim	805fb1f6c1	[X86] combineMul - move MUL_IMM comment inside function. NFC. combineMul is now used for other things as well as the mul-with-constant expansion - move the comment to where its actually relevant.	2021-08-22 18:27:03 +01:00
Simon Pilgrim	352df10a23	[X86][AVX] matchShuffleAsBlend - use isElementEquivalent to help match broadcast/repeated elements Extend matchShuffleAsBlend to not only match against known in-place elements for BLEND shuffles, but use isElementEquivalent to determine if the shuffle mask's referenced element is the same as the in-place element. This allows us to replace a number of insertps instructions with more general blendps instructions (better opportunities for commutation, concatenation etc.).	2021-08-22 15:26:17 +01:00
Simon Pilgrim	96fb3eef66	Fix signed/unsigned comparison warning. NFCI.	2021-08-22 15:02:19 +01:00
Simon Pilgrim	a1c892b439	[X86][SSE] lowerVECTOR_SHUFFLE - canonicalize with horizontal ops. Before lowering shuffles, see if we can merge horizontal ops or canonicalize the shuffle mask to point to the same LHS/RHS of the HOps when an HOp's args are repeated.	2021-08-22 14:54:48 +01:00
Simon Pilgrim	8533e782ef	[X86] Try to sync HSW + BDW model class defs to simplify comparisons. NFC. Broadwell is mainly a die shrink of Haswell, but the model had many of the scheduling classes in different orders, making side-by-side comparisons very difficult. The InstRW overrides are still quite different, but at least that part of the side-by-side diff is now in the same position. This was noticed while I was trying to investigate diffs between llvm-mca and other perf analyzers in https://uica.uops.info/ - we used to be able to do diffs between most of the models very easily, but we seem to have lost that simplicity as classes have been altered, models have been refined and other models have rotted.	2021-08-22 13:02:51 +01:00

1 2 3 4 5 ...

21822 Commits