Commit Graph

65 Commits

Author SHA1 Message Date
Simon Pilgrim 29475e0286 [X86] Add scheduler classes for zmm vector reg-reg move instructions
Basic zmm reg-reg moves (with predication) are more port limited than xmm/ymm moves, so we need to add a separate class for them.

We still appear to be missing move-elimination patterns for most of the intel models, which looks to be one of the main diffs for basic codegen analysis between llvm-mca and uops.info

Load/stores are a bit messier and might be better handled as overrides.
2021-12-27 12:13:42 +00:00
Simon Pilgrim 41052fd699 [X86][MMX] Remove superfluous 'i' from MMX cvt opnames. NFCI.
This is a very old copy+paste typo - none of these cvt ops have an immediate operand.

Noticed while trying to merge MMX instructions into some existing SSE instruction scheduler instregex patterns.
2021-12-12 17:59:16 +00:00
Simon Pilgrim 0a08813cad [X86][MMX] Remove superfluous 'i' from MMX binop opnames. NFCI.
This is a very old copy+paste typo - none of these binops have an immediate operand.

Noticed while trying to merge MMX instructions into some existing SSE instruction scheduler instregex patterns.
2021-12-12 17:59:16 +00:00
Roman Lebedev d4d459e747
[X86] AMD Zen 3: MULX w/ mem operand has the same throughput as with reg op
Exegesis is faulty and sometimes when measuring throughput^-1
produces snippets that have loop-carried dependencies,
which must be what caused me to incorrectly measure it originally.

After looking much more carefully, the inverse throughput should match
that of the MULX w/ reg op.

As per llvm-exegesis measurements.
2021-08-27 13:27:05 +03:00
Roman Lebedev 0f04936a2d
[X86] AMD Zen 3: MULX produces low part of the result in 3cy, +1cy for high part
As per llvm-exegesis measurements.
2021-08-27 13:27:05 +03:00
Andrea Di Biagio 5f848b311f [X86][SchedModel] Fix latency the Hi register write of MULX (PR51495).
Before this patch, WriteIMulH reported a latency value which is correct for the
RR variant of MULX, but not for the RM variant.

This patch fixes the issue by introducing a new WriteIMulHLd, which is meant to
be used only by the RM variant of MULX.

Differential Revision: https://reviews.llvm.org/D108701
2021-08-25 16:12:09 +01:00
Andrea Di Biagio 35d4292a73 [X86][SchedModels] Fix missing ReadAdvance for MULX and ADCX/ADOX (PR51494)
Before this patch, instructions MULX32rm and MULX64rm were missing a ReadAdvance
for the implicit read of register EDX/RDX.  This patch fixes the issue, and it
also introduces a new SchedWrite for the two variants of MULX. The general idea
behind this last change is to eventually decrease the number of InstRW in the
scheduling models.

This patch also adds a ReadAdvance for the implicit read of EFLAGS in ADCX/ADOX.

Differential Revision: https://reviews.llvm.org/D108372
2021-08-20 17:39:51 +01:00
Roman Lebedev 834aafa55b
[NFC] AMD Zen 3: fix typo in a comment 2021-06-19 22:05:17 +03:00
Roman Lebedev 69caacc626
[X86] AMD Zen 3: don't confuse shift and shuffle, NFC
These proc res groups occupy the exact same pipes,
so this doesn't affect the modelling,
but it's confusing nontheless.
2021-06-17 21:07:35 +03:00
Roman Lebedev 88da6c1ead
[X86] Schedule-model second (mask) output of GATHER instruction
Much like `mulx`'s `WriteIMulH`, there are two outputs of
AVX2 GATHER instructions. This was changed back in rL160110,
but the sched model change wasn't present.

So right now, for sched models that are marked as complete
(`znver3` only now), codegen'ning `GATHER` results in a crash:
```
DefIdx 1 exceeds machine model writes for early-clobber renamable $ymm3, dead early-clobber renamable $ymm2 = VPGATHERDDYrm killed renamable $ymm3(tied-def 0), undef renamable $rax, 4, renamable $ymm0, 0, $noreg, killed renamable $ymm2(tied-def 1) :: (load 32, align 1)
```
https://godbolt.org/z/Ks7zW7WGh

I'm guessing we need to deal with this like we deal with `WriteIMulH`.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D104205
2021-06-15 12:04:33 +03:00
Roman Lebedev 852497711d
[X86] AMD Zen 3: double the LoopMicroOpBufferSize
While the IndVars issue (PR50384) has been resolved,
and the compile performance improved, a new blocker emerged,
the codegen machine instruction scheduling is also quadratic.
So we still can't really specify the right value here.

Filed PR50584.
2021-06-05 01:23:58 +03:00
Roman Lebedev 75ea0abaae
[X86] AMD Zen 3: fix MULX modelling - don't forget about WriteIMulH (PR50387)
Otherwise lack thereof will be caught by a defensive check during
scheduling, and we'll crash.

I've literally never seen this syntax before..
2021-05-18 19:58:04 +03:00
Roman Lebedev 3cc3960766
[X86] AMD Zen 3: cap LoopMicroOpBufferSize to workaround PR50384 (quadratic IndVars runtime)
While i would like to keep the right value here,
i would also like to be able to actually compile
e.g. vanilla test-suite.

256 is a pretty random guess, it should be pretty good enough
for serious loops, but small enough to result in tolerant
compile times for certain edge cases.

https://bugs.llvm.org/show_bug.cgi?id=50384
2021-05-18 15:56:57 +03:00
Roman Lebedev 1fc1c88704
[X86] AMD Zen 3: same-reg AVX YMM VPCMPGT{B,W,D,Q} is a zero-cycle(!) dep-breaking zero-idiom
As measured by exegesis, and confirmed by ref docs.
2021-05-14 20:23:05 +03:00
Roman Lebedev 2f8572d8e2
[X86] AMD Zen 3: same-reg AVX XMM VPCMPGT{B,W,D,Q} is a zero-cycle(!) dep-breaking zero-idiom
As measured by exegesis, and confirmed by ref docs.
2021-05-14 20:23:04 +03:00
Roman Lebedev f8f7c765a0
[X86] AMD Zen 3: same-reg SSE XMM PCMPGT{B,W,D,Q} is a 1-cycle(!) dep-breaking zero-idiom
As measured by exegesis, and confirmed by ref docs.
2021-05-14 20:23:04 +03:00
Roman Lebedev 26eeb6e650
[X86] AMD Zen 3: same-reg AVX YMM VPSUBUS{B,W} is a 1-cycle(!) dep-breaking zero-idiom
Not really mentioned in ref docs, but measures as such.
Yes, this one is also not zero-cycle.
2021-05-14 20:23:03 +03:00
Roman Lebedev 41a5dcdf87
[X86] AMD Zen 3: same-reg AVX XMM VPSUBUS{B,W} is a 1-cycle(!) dep-breaking zero-idiom
Not really mentioned in ref docs, but measures as such.
Yes, this one is also not zero-cycle.
2021-05-14 20:23:03 +03:00
Roman Lebedev 6733fe5c0d
[X86] AMD Zen 3: same-reg SSE XMM PSUBUS{B,W} is a 1-cycle(!) dep-breaking zero-idiom
Not really mentioned in ref docs, but measures as such.
2021-05-14 20:23:03 +03:00
Roman Lebedev 555e1d2987
[X86] AMD Zen 3: same-reg AVX YMM VPSUBS{B,W} is a 1-cycle(!) dep-breaking zero-idiom
Not really mentioned in ref docs, but measures as such.
Yes, this one is also not zero-cycle.
2021-05-14 20:23:02 +03:00
Roman Lebedev 012417c980
[X86] AMD Zen 3: same-reg AVX XMM VPSUBS{B,W} is a 1-cycle(!) dep-breaking zero-idiom
Not really mentioned in ref docs, but measures as such.
Yes, this one is also not zero-cycle.
2021-05-14 20:23:02 +03:00
Roman Lebedev 29c4f892fe
[X86] AMD Zen 3: same-reg SSE XMM PSUBS{B,W} is a 1-cycle(!) dep-breaking zero-idiom
Not really mentioned in ref docs, but measures as such.
2021-05-14 20:23:02 +03:00
Roman Lebedev 93f2642871
[X86] AMD Zen 3: same-reg AVX YMM VPSUB{B,W,D,Q} is a zero-cycle(!) dep-breaking zero-idiom
As confirmed by the exegesis measurements, and ref docs.
2021-05-14 20:23:01 +03:00
Roman Lebedev 7a45b96e04
[X86] AMD Zen 3: same-reg AVX XMM VPSUB{B,W,D,Q} is a zero-cycle(!) dep-breaking zero-idiom
As confirmed by the exegesis measurements, and ref docs.
2021-05-14 20:23:01 +03:00
Roman Lebedev 1ea8be214f
[X86] AMD Zen 3: same-reg SSE XMM PSUB{B,W,D,Q} is a 1-cycle(!) dep-breaking zero-idiom
As confirmed by the exegesis measurements, and ref docs.
2021-05-14 20:23:00 +03:00
Roman Lebedev ce22f53916
[X86] AMD Zen 3: same-reg AVX YMM VPANDN is a zero-cycle(!) dep-breaking zero-idiom
As confirmed by exegesis measurements, and ref docs.
2021-05-14 20:23:00 +03:00
Roman Lebedev 44c2b4fe91
[X86] AMD Zen 3: same-reg AVX XMM VPANDN is a zero-cycle(!) dep-breaking zero-idiom
As confirmed by exegesis measurements, and ref docs.
2021-05-14 20:23:00 +03:00
Roman Lebedev a72cacb53f
[X86] AMD Zen 3: same-reg SSE XMM PANDN is a 1-cycle(!) dep-breaking zero-idiom
As confirmed by the exegesis measurements, and ref docs.
2021-05-14 20:22:59 +03:00
Roman Lebedev 1d73c2b8cf
[X86] AMD Zen 3: same-reg AVX YMM VPXOR is a zero-cycle(!) dep-breaking zero-idiom
As confirmed by exegesis measurements, and ref docs.
2021-05-14 20:22:59 +03:00
Roman Lebedev 31669b5073
[X86] AMD Zen 3: same-reg AVX XMM VPXOR is a zero-cycle(!) dep-breaking zero-idiom
As confirmed by exegesis measurements, and ref docs.
2021-05-14 20:22:58 +03:00
Roman Lebedev 498bf365f4
[X86] AMD Zen 3: same-reg SSE XMM PXOR is a 1-cycle(!) dep-breaking zero-idiom
As confirmed by the exegesis measurements, and ref docs.
2021-05-14 20:22:58 +03:00
Roman Lebedev 4af4afe014
[X86] AMD Zen 3: same-reg AVX YMM VANDNPD is a zero-cycle(!) dep-breaking zero-idiom
As confirmed by exegesis measurements, and ref docs.
2021-05-14 14:06:24 +03:00
Roman Lebedev 17f99a8a41
[X86] AMD Zen 3: same-reg AVX XMM VANDNPD is a zero-cycle(!) dep-breaking zero-idiom
As confirmed by exegesis measurements, and ref docs.
2021-05-14 14:06:24 +03:00
Roman Lebedev 38ceb46fb0
[X86] AMD Zen 3: same-reg SSE XMM ANDNPD is a 1-cycle(!) dep-breaking zero-idiom
As confirmed by exegesis measurements, and ref docs.
2021-05-14 14:06:24 +03:00
Roman Lebedev d8a595b81c
[X86] AMD Zen 3: same-reg AVX YMM VANDNPS is a zero-cycle(!) dep-breaking zero-idiom
As confirmed by exegesis measurements, and ref docs.
2021-05-14 14:06:24 +03:00
Roman Lebedev fd4cbc822b
[X86] AMD Zen 3: same-reg AVX XMM VANDNPS is a zero-cycle(!) dep-breaking zero-idiom
As confirmed by exegesis measurements, and ref docs.
2021-05-14 14:06:23 +03:00
Roman Lebedev f38dcbecb6
[X86] AMD Zen 3: same-reg SSE XMM ANDNPS is a 1-cycle(!) dep-breaking zero-idiom
Same as SSE XMM XORPS/XORPD, it is not zero-cycle, even though it breaks the deps.
As confirmed by the exegesis measurements, and ref docs.
2021-05-14 14:06:23 +03:00
Roman Lebedev 43a7f130a7
[X86] AMD Zen 3: same-reg AVX YMM VXORPD is a zero-cycle(!) dep-breaking zero-idiom
As confirmed by exegesis measurements, and ref docs.
2021-05-14 11:56:07 +03:00
Roman Lebedev 336b9dbe88
[X86] AMD Zen 3: same-reg AVX XMM VXORPD is a zero-cycle(!) dep-breaking zero-idiom
As confirmed by exegesis measurements, and ref docs.
2021-05-14 11:56:07 +03:00
Roman Lebedev 9c596bc541
[X86] AMD Zen 3: same-reg SSE XMM XORPD is a 1-cycle(!) dep-breaking zero-idiom
Same as with it's float friend, unlike their AVX versions.
As confirmed by exegesis, and ref docs.
2021-05-14 11:56:07 +03:00
Roman Lebedev 59554c01ab
[X86] AMD Zen 3: same-reg AVX YMM VXORPS is a zero-cycle(!) dep-breaking zero-idiom
As confirmed by exegesis, and ref docs.
2021-05-14 11:56:06 +03:00
Roman Lebedev 26c1bffe67
[X86] AMD Zen 3: same-reg AVX XMM VXORPS is a zero-cycle(!) dep-breaking zero-idiom
Unlike it's legacy SSE XMM XORPS version, which measures as being 1-cycle,
this one is certainly a zero-cycle instruction, in addition to both of them
being dependency breaking.

As confirmed by exegesis measurements, and ref docs.
2021-05-14 11:56:06 +03:00
Roman Lebedev aa0dcb3ba4
[X86] AMD Zen 3: same-reg SSE XMM XORPS is a 1-cycle(!) dep-breaking one-idiom
While both the SOG and Agner insist that it is zero-cycle,
i can not confirm that claim. While it clearly breaks the dependency,
i can not come up with a snippet, or measurement approach,
to end up with IPC bigger than 4, which, to me, means that it actually
consumes execution resource of an FP unit for a cycle.
2021-05-14 00:03:36 +03:00
Roman Lebedev 6a64c462eb
[X86] AMD Zen 3: same-reg AVX YMM VPCMP is dep breaking one-idiom
As measured by exegesis, and confirmed by ref docs.
Still not zero-cycle :)
2021-05-10 23:49:27 +03:00
Roman Lebedev 2953245337
[X86] AMD Zen 3: same-reg AVX XMM VPCMP is dep breaking one-idiom
As measured by exegesis, and confirmed by ref docs.
Again, it's not zero-cycle.
2021-05-10 23:49:26 +03:00
Roman Lebedev 0f3bcb97ef
[X86] AMD Zen 3: same-reg SSE XMM PCMP is dep breaking one-idiom
As measured by exegesis, and confirmed by ref docs.
Much like with MMX PCMP, it does actually have to execute, though.
2021-05-10 23:49:26 +03:00
Roman Lebedev b24edfff4f
[X86] AMD Zen 3: same-reg PCMPEQ is an MMX all-ones dep breaking idiom
They are, however, not zero-cycle, and do actually execute.

As measured by exegesis, and confirmed by ref docs.
2021-05-10 23:49:26 +03:00
Roman Lebedev 08cf2776ac
[X86] AMD Zen 3: sub-32-bit CMP also break dependencies
They measure as having the same effect as 32-bit CMP.
2021-05-10 20:57:38 +03:00
Roman Lebedev be23d5e814
[X86] AMD Zen 3: same-reg CMP is a zero-cycle dependency-breaking instruction
As measured by exegesis, and confirmed by ref docs.
2021-05-10 00:03:20 +03:00
Roman Lebedev 11b0568dce
[X86] AMD Zen 3: same-reg SBB is a dependency-breaking instruction
As confirmed by exegesis measurements, and ref docs.
It does actually execute.

While there, bump latency for MULX32rr, that seems to match measurements.
2021-05-10 00:03:20 +03:00