Commit Graph

21822 Commits

Author SHA1 Message Date
Craig Topper da3ef8b756 [X86] Handle inverted inputs when matching VPTERNLOG from 2 binary ops.
This is a more general version of D109273. Though it doesn't
peek through bitcasts or rearange broadcasts.

Reviewed By: LuoYuanke

Differential Revision: https://reviews.llvm.org/D109295
2021-09-06 17:44:52 -07:00
Fangrui Song 76529b4468 [X86] Simplify condition guarding emitCalleeSavedFrameMoves. NFC 2021-09-06 15:54:02 -07:00
Fangrui Song 4f1e410a1b [X86] Simplify two hasFP(F). NFC 2021-09-06 15:47:40 -07:00
Tianqing Wang 12fa608af4 [X86] Add CRC32 feature.
d8faf03807 implemented general-regs-only for X86 by disabling all features
with vector instructions. But the CRC32 instruction in SSE4.2 ISA, which uses
only GPRs, also becomes unavailable. This patch adds a CRC32 feature for this
instruction and allows it to be used with general-regs-only.

Reviewed By: pengfei

Differential Revision: https://reviews.llvm.org/D105462
2021-09-06 17:24:30 +08:00
Simon Pilgrim f114ef3731 [CostModel][X86] Add generic costs for vXi32 MUL -> v2Xi16 PMADDDW folds
Based off the improved fold in D108522

This should eventually allow us to replace the SLM only cost patterns with generic versions.
2021-09-05 16:08:11 +01:00
Simon Pilgrim 2005ae15a6 [X86][SLM] WriteVecIMul instructions only take 1uop (REAPPLIED)
The xmm variant have half the throughput (and +1cy latency) of the mmx variants, but are still 1uop.

I still need to do more thorough testing of SLM on test-suite before fixing the obvious bad numbers for WritePMULLD.

But this helps the D103695 helper script get to more accurate numbers for vXi32 multiplies of extended operands (i.e. we can use PMADDWD, PMULLW/PMULHW etc). Matches what Intel AoM / Agner / llvm-exegesis reports.
2021-09-04 15:03:56 +01:00
Simon Pilgrim ac51d69208 Revert rG994da657076900f5ad7fe593c3b5e5f89ab3d53d "[X86][SLM] WriteVecIMul instructions only take 1uop"
This changed some codegen tests that I forgot about in my rebase, I'll recommit shortly with a fix.
2021-09-04 13:39:10 +01:00
Simon Pilgrim 994da65707 [X86][SLM] WriteVecIMul instructions only take 1uop
The xmm variant have half the throughput (and +1cy latency) of the mmx variants, but are still 1uop.

I still need to do more thorough testing of SLM on test-suite before fixing the obvious bad numbers for WritePMULLD.

But this helps the D103695 helper script get to more accurate numbers for vXi32 multiplies of extended operands (i.e. we can use PMADDWD, PMULLW/PMULHW etc). Matches what Intel AoM / Agner / llvm-exegesis reports.
2021-09-04 13:21:34 +01:00
Simon Pilgrim c6371020a8 [X86][SLM] RMW instructions don't require an extra uop
For RMW instructions, the load and store hold the MEC for an extra cycle, but within the same single uop. This is alluded to in the Intel AOM:

"The MEC also owns the MEC RSV, which is responsible for scheduling of all loads and stores. Load and
store instructions go through addresses generation phase in program order to avoid on-the-fly memory
ordering later in the pipeline. Therefore, an unknown address will stall younger memory instructions."

Noticed while trying to get a cheap SLM test box up and running with llvm-exegesis - RMW arithmetic is always 1uop - and matches what Agner / InstLatX64 report as well.
2021-09-04 13:21:34 +01:00
Simon Pilgrim da965a77d5 [X86][SLM] Fix MUL uops, latency and throughput
These were all set to the same best case mul i32 values (which seems to be the only version of MUL that SLM actually performs well with).

Noticed while trying to improve multiplication costs for vectorization via the D103695 helper script. Confirmed with Intel AoM / Agner / InstLatX64.
2021-09-04 13:21:34 +01:00
Simon Pilgrim 7d062d2c47 [X86][Atom] MUL/DIV instructions require both ports, not either.
Noticed while trying to improve multiplication costs for vectorization via the D103695 helper script. Confirmed with Intel AoM.
2021-09-04 11:58:09 +01:00
Simon Pilgrim 0d0f39b0f3 [X86][Atom] Add missing UOps override to AtomWriteResPair multiclass
Make it easier to describe microcoded instructions.
2021-09-04 11:58:09 +01:00
Simon Pilgrim 6ba0b9f68a [X86][SLM] Fix PBLENDVB uops and throughput
SLM PBLENDVB is just as bad as BLENDVPD/PS - so model it as such, fixing the rr vs rm uops diff as well. The Intel AoM appears to have a copy+paste typo with PBLENDW, it doesn't match Agner or InstLatX64.

Noticed while investigating some of the weird discrepancies reported by the D103695 helper script (SLM had much better vector shift throughputs than it should).
2021-09-03 11:31:29 +01:00
Kirill Stoimenov cf53c6c971 [asan] Fixed link error by setting jump symbol to R_X86_64_PLT32.
Fixing this link error:
ld: error: relocation R_X86_64_PC32 cannot be used against symbol __asan_report_load...; recompile with -fPIC

Reviewed By: vitalybuka

Differential Revision: https://reviews.llvm.org/D109183
2021-09-02 21:50:56 +00:00
Craig Topper 3e89cc5cda [X86] Remove isel predicates for xgetbv/xsetbv instructions so they can work on Windows.
https://reviews.llvm.org/D56686  was supposed to allow these to
work on Windows without needing to enable the xsave feature to
match MSVC. It seems this didn't work because the backend isel
patterns would still block it.

This patch removes the predicates from the isel patterns.

Fixes PR51706.

Reviewed By: pengfei

Differential Revision: https://reviews.llvm.org/D109097
2021-09-02 10:25:02 -07:00
Simon Pilgrim d66d520fe1 [X86][SSE] combineMulToPMADDWD - improve recognition of sign/zero extended upper bits
PMADDWD(v8i16 x, v8i16 y) == (v4i32) { (int)x[0]*y[0] + (int)x[1]*y[1], ..., (int)x[6]*y[6] + (int)x[7]*y[7] }

Currently combineMulToPMADDWD only folds cases where the upper 17 bits of both vXi32 inputs are known zero (i.e. the first half is positive and the second half of the pair is zero in each 2xi16 pair), this can be relaxed to only require one zero-extended input if the other input has at least 17 sign bits.

That way the sign of the result is still preserved, and the second half is still zero.

Noticed while investigating PR47437.

Differential Revision: https://reviews.llvm.org/D108522
2021-09-02 17:36:22 +01:00
Roman Lebedev 3f1f08f0ed
Revert @llvm.isnan intrinsic patchset.
Please refer to
https://lists.llvm.org/pipermail/llvm-dev/2021-September/152440.html
(and that whole thread.)

TLDR: the original patch had no prior RFC, yet it had some changes that
really need a proper RFC discussion. It won't be productive to discuss
such an RFC, once it's actually posted, while said patch is already
committed, because that introduces bias towards already-committed stuff,
and the tree is potentially in broken state meanwhile.

While the end result of discussion may lead back to the current design,
it may also not lead to the current design.

Therefore i take it upon myself
to revert the tree back to last known good state.

This reverts commit 4c4093e6e3.
This reverts commit 0a2b1ba33a.
This reverts commit d9873711cb.
This reverts commit 791006fb8c.
This reverts commit c22b64ef66.
This reverts commit 72ebcd3198.
This reverts commit 5fa6039a5f.
This reverts commit 9efda541bf.
This reverts commit 94d3ff09cf.
2021-09-02 13:53:56 +03:00
Simon Pilgrim b0acd6c369 [X86] Fold PMADD(x,0) or PMADD(0,x) -> 0
Pulled out of D108522 - handle zero-operand cases for PMADDWD/VPMADDUBSW ops
2021-09-02 10:48:50 +01:00
Wang, Pengfei 74043caef2 [X86] Enable half type support in inline assembly constraints
Reviewed By: LuoYuanke

Differential Revision: https://reviews.llvm.org/D105799
2021-09-01 09:29:31 +08:00
Simon Pilgrim 9e2d14c285 [X86] Copy X86SchedSkylakeServer.td to X86SchedIceLake.td
Icelake, Rocketlake and Tigerlake targets currently use the SkylakeServer scheduler model, despite being a later microarchitecture, leading to both reported bugs (PR48110) and discrepancies when comparing llvm-mca reports to other profiling tools (OSACA, uops, uica, etc.). And tbh I'm getting sick of llvm-mca getting blamed for what are backend scheduler model issues :-(

This patch doesn't attempt to fix any of these discrepancies - there should be no changes in codegen - its a setup patch that copies the skx model, renames all the resources, adds the additional ports (but doesn't reference them yet) and updates the llvm-exegesis pfm counter mappings (based off https://sourceforge.net/p/perfmon2/libpfm4/ci/master/tree/lib/events/intel_icl_events.h).

This should make it trivial for anyone with hardware access to use llvm-exegesis reports to iteratively improve the model (my attempts to get hold of a cheap tiger lake box haven't been fruitful yet....).

I will copy the SkylakeServer llvm-mca resource tests as follow up commits - the diff should entirely be the resource renames.

Differential Revision: https://reviews.llvm.org/D108914
2021-08-31 11:57:20 +01:00
Nikita Popov 0529e2e018 [InstrInfo] Use 64-bit immediates for analyzeCompare() (NFCI)
The backend generally uses 64-bit immediates (e.g. what
MachineOperand::getImm() returns), so use that for analyzeCompare()
and optimizeCompareInst() as well. This avoids truncation for
targets that support immediates larger 32-bit. In particular, we
can avoid the bugprone value normalization hack in the AArch64
target.

This is a followup to D108076.

Differential Revision: https://reviews.llvm.org/D108875
2021-08-30 19:46:04 +02:00
Simon Pilgrim af2920ec6f [TTI][X86] getArithmeticInstrCost - move opcode canonicalization before all target-specific costs. NFCI.
The GLM/SLM special cases still get tested first but after the the MUL/DIV/REM pattern detection - this will be necessary for when we make the SLM vXi32 MUL canonicalization generic to improve PMULLW/PMULHW/PMADDDW cost support etc.
2021-08-30 12:24:59 +01:00
Wang, Pengfei ab40dbfe03 [X86] AVX512FP16 instructions enabling 6/6
Enable FP16 complex FMA instructions.

Ref.: https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html

Reviewed By: LuoYuanke

Differential Revision: https://reviews.llvm.org/D105269
2021-08-30 13:08:45 +08:00
Vince Bridgers 55ba1de7c5 [X86] Remove X86LowerAMXType::getRowFromCol from X86LowerAMXType.cpp
Remove method X86LowerAMXType::getRowFromCol since it's not used, and
it's causing a warning.

Reviewed By: LuoYuanke

Differential Revision: https://reviews.llvm.org/D108862
2021-08-29 12:27:34 -05:00
Roman Lebedev 6734018041
[Codegen][X86] EltsFromConsecutiveLoads(): if only have AVX1, ensure that the "load" is actually foldable (PR51615)
This fixes another reproducer from https://bugs.llvm.org/show_bug.cgi?id=51615
And again, the fix lies not in the code added in D105390

In this case, we completely don't check that the "broadcast-from-mem" we create
can actually fold the load. In this case, it's operand was not a load at all:
```
Combining: t16: v8i32 = vector_shuffle<0,u,u,u,0,u,u,u> t14, undef:v8i32
Creating new node: t29: i32 = undef
RepeatLoad:
t8: i32 = truncate t7
  t7: i64 = extract_vector_elt t5, Constant:i64<0>
    t5: v2i64,ch = load<(load (s128) from %ir.arg)> t0, t2, undef:i64
      t2: i64,ch = CopyFromReg t0, Register:i64 %0
        t1: i64 = Register %0
      t4: i64 = undef
    t3: i64 = Constant<0>
Combining: t15: v8i32 = undef

```

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D108821
2021-08-27 20:26:53 +03:00
Serge Pavlov cdbe569fb6 [X86] Implement llvm.isnan(x86_fp80) as unordered comparison
x86_fp80 format allows values that do not fit any of IEEE-754 category.
Previously they were recognized by intrinsic __builtin_isnan as NaNs.
Now this intrinsic is implemented using instruction FXAM, which
distinguish between NaNs and unsupported values. It can make some
programs behave differently.

As a solution, this fix changes lowering of the intrinsic. If floating
point exceptions are ignored, llvm.isnan is lowered into unordered
comparison, as __buildtin_isnan was implemented earlier. In strictfp
functions the intrinsic is lowered using FXAM, which does not raise
exceptions even for signaling NaN, as required by IEEE-754 and C
standards.

Differential Revision: https://reviews.llvm.org/D108037
2021-08-27 18:06:07 +07:00
Nathan Sidwell 199ac3a839 [NFC][X86] Sret return register cleanup
There are no paths into LowerFormalParms that have already specified
the sret register. We always materialize a virtual and then assign it
to the physical reg at the point of the return.

Differential Revision: https://reviews.llvm.org/D108762
2021-08-27 04:03:49 -07:00
Roman Lebedev d4d459e747
[X86] AMD Zen 3: MULX w/ mem operand has the same throughput as with reg op
Exegesis is faulty and sometimes when measuring throughput^-1
produces snippets that have loop-carried dependencies,
which must be what caused me to incorrectly measure it originally.

After looking much more carefully, the inverse throughput should match
that of the MULX w/ reg op.

As per llvm-exegesis measurements.
2021-08-27 13:27:05 +03:00
Roman Lebedev 0f04936a2d
[X86] AMD Zen 3: MULX produces low part of the result in 3cy, +1cy for high part
As per llvm-exegesis measurements.
2021-08-27 13:27:05 +03:00
Kirill Stoimenov 2e83a0efb9 [asan] Fixed a runtime crash.
Looks like the NoRegister has some effect on the final code that is generated. My guess is that some optimization kicks in at the end?

When I use -S to dump the assembly I get the correct version with 'shrq    $3, %r8':
        movq    %r9, %r8
        shrq    $3, %r8
        movsbl  2147450880(%r8), %r8d

But, when I disassemble the final binary I get RAX in stead of R8:
        mov    %r9,%r8
        shr    $0x3,%rax
        movsbl 0x7fff8000(%r8),%r8d

Reviewed By: vitalybuka

Differential Revision: https://reviews.llvm.org/D108745
2021-08-26 20:30:25 +00:00
Roman Lebedev a8125bf4a8
[X86][Codegen] PR51615: don't replace wide volatile load with narrow broadcast-from-memory
Even though https://bugs.llvm.org/show_bug.cgi?id=51615
appears to be introduced by D105390, the fix lies here.

We can not replace a wide volatile load with a broadcast-from-memory,
because that would narrow the load, which isn't legal for volatiles.

Reviewed By: spatel

Differential Revision: https://reviews.llvm.org/D108757
2021-08-26 18:46:49 +03:00
Simon Pilgrim c17f5afa88 [X86] getShape - don't dereference dyn_cast<>
dyn_cast can return nullptr, use cast<> to assert we have the correct type.
2021-08-26 15:08:13 +01:00
Simon Pilgrim 47f2affa08 Fix MSVC "result of 32-bit shift implicitly converted to 64 bits" warning. NFCI. 2021-08-26 15:08:12 +01:00
Andrea Di Biagio 4a5b191703 [X86][MCA] Address the latest issues with MULX reported in PR51495.
It turns out that SchedWrite WriteIMulH was always assigned to the low half of
the result of a MULX (rather than to the high half).

To avoid confusion, this patch swaps the two MULX writes in the tablegen
definition of MULX32/64.  That way, write names better describe what they
actually refer to; this also avoids further complications if in future we decide
to reuse the same MulH writes to also model other scalar integer multiply
instructions.  I also had to swap the latency values for the two MULX writes to
make sure that the change is effectively an NFC. In fact, none of the existing
x86 tests were affected by this small refactoring.

This patch also fixes a bug in MCA: a wrong latency value was propagated for
instructions that perform multiple writes to a same register.  This last issue
was found by Roman while testing MULX on targets that define a different latency
for the Low/High part of the result.

Differential Revision: https://reviews.llvm.org/D108727
2021-08-26 12:08:20 +01:00
Nathan Sidwell ab55cc6cef [X86] pr51000 in-register struct return tailcalling
In-register structure returns are not special, and handled by lowering
to multiple-value tuples.  We can tail-call from non-sret fns to
structure-returning functions, except on i686 where the sret pointer
is callee-pop.

Differential Revision: https://reviews.llvm.org/D105807
2021-08-25 10:15:50 -07:00
Kirill Stoimenov 832aae738b [asan] Implemented intrinsic for the custom calling convention similar used by HWASan for X86.
The implementation uses the int_asan_check_memaccess intrinsic to instrument the code. The intrinsic is replaced by a call to a function which performs the access check. The generated function names encode the input register name as a number using Reg - X86::NoRegister formula.

Reviewed By: vitalybuka

Differential Revision: https://reviews.llvm.org/D107850
2021-08-25 15:31:46 +00:00
Andrea Di Biagio 5f848b311f [X86][SchedModel] Fix latency the Hi register write of MULX (PR51495).
Before this patch, WriteIMulH reported a latency value which is correct for the
RR variant of MULX, but not for the RM variant.

This patch fixes the issue by introducing a new WriteIMulHLd, which is meant to
be used only by the RM variant of MULX.

Differential Revision: https://reviews.llvm.org/D108701
2021-08-25 16:12:09 +01:00
Min-Yih Hsu d2bb6d512c [X86] Add explicit library dependency on LLVMInstrumentation
Patch 9588b685c6 introduced dependency on ASAN. But it didn't
explicitly put LLVMInstrumentation as one of the library dependencies
such that the build will fail if we're building LLVM as shared libraries
(i.e. -DBUILD_SHARED_LIBS=ON).
This patch explicitly links X86CodeGen against the Instrumentation
component.

Differential Revision: https://reviews.llvm.org/D108662
2021-08-24 14:08:17 -07:00
Kirill Stoimenov b97ca3aca1 Revert "[asan] Implemented intrinsic for the custom calling convention similar used by HWASan for X86."
This reverts commit 9588b685c6. Breaks a bunch of builds.

Reviewed By: GMNGeoffrey

Differential Revision: https://reviews.llvm.org/D108658
2021-08-24 13:21:20 -07:00
Kirill Stoimenov 9588b685c6 [asan] Implemented intrinsic for the custom calling convention similar used by HWASan for X86.
The implementation uses the int_asan_check_memaccess intrinsic to instrument the code. The intrinsic is replaced by a call to a function which performs the access check. The generated function names encode the input register name as a number using Reg - X86::NoRegister formula.

Reviewed By: vitalybuka

Differential Revision: https://reviews.llvm.org/D107850
2021-08-24 19:34:34 +00:00
Simon Pilgrim 307890f85b [X86] Freeze vXi8 shl(x,1) -> add(x,x) vector fold (PR50468)
We don't have any vXi8 shift instructions (other than on XOP which is handled separately), so replace the shl(x,1) -> add(x,x) fold with shl(x,1) -> add(freeze(x),freeze(x)) to avoid the undef issues identified in PR50468.

Split off from D106675 as I'm still looking at whether we can fix the vXi16/i32/i64 issues with the D106679 alternative.

Differential Revision: https://reviews.llvm.org/D108139
2021-08-24 16:08:24 +01:00
Jeremy Morse 992e21eeee [DebugInfo][InstrRef] Fix over-droppage of locations in X86FloatingPoint
Over in D105657, we started dropping instruction numbers (that become
variable locations) from call instructions, as we can't correctly represent
the x87 FP stack. Unfortunately, it turns out that the "special FP
instructions" that this pass transforms includes "every call instruction"
[0]. Thus, we've ended up dropping all return values from all calls. Ouch.

This patch adds a filter: only drop instruction numbers from calls if they
return something on the FP stack. Seeing how LLVM only allows a single
return value, this should drop instruction numbers on anything that returns
a float, and nothing else.

Rather than writing a new test, I've modified the original one to have a
positive and negative case: drop instruction number on a call with an
FP-stack modification, keep it on a plain call.

Differential Revision: https://reviews.llvm.org/D108580
2021-08-24 10:24:07 +01:00
Liu, Chen3 b7795eb646 [X86] Building constant vector which element type is half will cause assertion fail.
Fix assertion fail when building con constant vector which element type is half.

Differential Revision: https://reviews.llvm.org/D108612
2021-08-24 14:34:30 +08:00
Wang, Pengfei c728bd5bba [X86] AVX512FP16 instructions enabling 5/6
Enable FP16 FMA instructions.

Ref.: https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html

Reviewed By: LuoYuanke

Differential Revision: https://reviews.llvm.org/D105268
2021-08-24 09:07:19 +08:00
Fangrui Song ba6e15d8cc [TargetMachine] Move COFF special case for ExternalSymbolSDNode from shouldAssumeDSOLocal to X86Subtarget
Intended to be NFC. ARM/AArch64 don't appear to need adjustment.

TargetMachine::shouldAssumeDSOLocal is expected to be very simple, ideally
matching isDSOLocal(). The IR producers are expected to set dso_local correctly.
(While some may think this function can make producers' work easier, the
function is really not in a good position to set dso_local. See the various
special cases we duplicate from clang CodeGenModule.cpp.)

Reviewed By: mstorsjo

Differential Revision: https://reviews.llvm.org/D108514
2021-08-23 13:54:40 -07:00
Simon Pilgrim 805fb1f6c1 [X86] combineMul - move MUL_IMM comment inside function. NFC.
combineMul is now used for other things as well as the mul-with-constant expansion - move the comment to where its actually relevant.
2021-08-22 18:27:03 +01:00
Simon Pilgrim 352df10a23 [X86][AVX] matchShuffleAsBlend - use isElementEquivalent to help match broadcast/repeated elements
Extend matchShuffleAsBlend to not only match against known in-place elements for BLEND shuffles, but use isElementEquivalent to determine if the shuffle mask's referenced element is the same as the in-place element.

This allows us to replace a number of insertps instructions with more general blendps instructions (better opportunities for commutation, concatenation etc.).
2021-08-22 15:26:17 +01:00
Simon Pilgrim 96fb3eef66 Fix signed/unsigned comparison warning. NFCI. 2021-08-22 15:02:19 +01:00
Simon Pilgrim a1c892b439 [X86][SSE] lowerVECTOR_SHUFFLE - canonicalize with horizontal ops.
Before lowering shuffles, see if we can merge horizontal ops or canonicalize the shuffle mask to point to the same LHS/RHS of the HOps when an HOp's args are repeated.
2021-08-22 14:54:48 +01:00
Simon Pilgrim 8533e782ef [X86] Try to sync HSW + BDW model class defs to simplify comparisons. NFC.
Broadwell is mainly a die shrink of Haswell, but the model had many of the scheduling classes in different orders, making side-by-side comparisons very difficult.

The InstRW overrides are still quite different, but at least that part of the side-by-side diff is now in the same position.

This was noticed while I was trying to investigate diffs between llvm-mca and other perf analyzers in https://uica.uops.info/ - we used to be able to do diffs between most of the models very easily, but we seem to have lost that simplicity as classes have been altered, models have been refined and other models have rotted.
2021-08-22 13:02:51 +01:00