Commit Graph

661 Commits

Author SHA1 Message Date
Simon Pilgrim 0c9c92ffc0 [X86][XOP] Tidyup VPHADD/VPHSUB unary horizontal ops default schedule class
Based off Agner and AMD SoG tables, the XOP VPHADD/VPHSUB unary horizontal ops are as fast as basic arithmetic ops, not the slower SSSE3 binary horizontal add/sub ops. This also matches what the bdver2 model already lists.

Noticed while investigating reduction add optimizations.
2022-03-03 12:07:48 +00:00
David Green 61b616755a Partially revert "[SchedModels][CortexA55] Add ASIMD integer instructions"
The Cortex-A55 scheduling model is used for -mcpu=generic, meaning it
can have a wider effect than just the A55. The changes to the A55
scheduling model seems to have caused performance regressions on
Cortex-A510 device which have latencies closer to the original and
different forwarding paths.

This partially reverts the changes from D117003, at least until we can
do something to improve Cortex-A510. According to my results, this
improves the A510 results without altering the A55 very much.
2022-02-28 10:58:52 +00:00
Pavel Kosov 37fa99eda0 [SchedModels][CortexA55] Add ASIMD integer instructions
Depends on D114642

Original review https://reviews.llvm.org/D112201

OS Laboratory. Huawei Russian Research Institute. Saint-Petersburg

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D117003
2022-02-17 13:41:57 +03:00
Craig Topper 56d6ccd4cb [X86] Update register RCL/RCR by 1 and immediate scheduling for Intel CPUs
Most Intel CPU scheduler files lumped the immediate and 1 instructions
together, but uops.info shows they are quite different.

For the most part the by 1 instructions were pretty accurate to the uops.info
data except the latency was 3 instead of 2 as uops.info indicates.

The by immediate instructions need 7 or 8 uops and have higher latency.

It looks like the 8-bit by immediate instructions may need even more
uops, but I just lumped them with the 16/32/64.

Noticed while checking out PR53648. So mostly I cared about the by 1
instructions.

Reviewed By: RKSimon, pengfei

Differential Revision: https://reviews.llvm.org/D119217
2022-02-08 09:20:20 -08:00
Simon Pilgrim 6eb8fc9244 [X86] Add some missing dependency-breaking zero idiom patterns to scheduler models
Many of the x86 scheduler models are not accounting for their microarch's ability to handle dependency-breaking zero idioms (pxor xmm0,xmm0 etc.), which is causing some notable differences when comparing llvm-mca reports to iaca, uops.info etc.

These are based on the Intel AoMs and Agner's docs which list the instructions handled on each cpu model - there may be more, although tbh the xor/pxor/xorps/xorpd are by far the most commonly encountered.

Once this is in place we also need to review missing support for 'allones' idioms and reg-reg move elimination, but this needs fixing first.

@lebedev.ri The Barcelona test changes are due to the cpu still being tagged as using the SandyBridge model, if/when you get back to D63628 these will need to be addressed.

Based on an original patch by @andreadb (Andrea Di Biagio)

Differential Revision: https://reviews.llvm.org/D117497
2022-01-19 11:29:33 +00:00
Simon Pilgrim 8ea579203d [MCA][X86] Add missing zero-idioms test file coverage
atom/slm have no/limited zero-idioms handling but we should test all the common instructions anyhow

znver1/znver2 were just missing - I've copied the Haswell tests for consistent test coverage
2022-01-17 16:04:39 +00:00
Patrick Holland 85e6e748d4 [MCA] Switching from conservatively guessing which instructions are
memory-barrier instructions to providing targets and developers a convenient
way to explicitly declare which instructions are memory-barriers.

Differential Revision: https://reviews.llvm.org/D116779
2022-01-11 13:50:14 -08:00
Pavel Kosov 34a91d7748 [SchedModels][CortexA55] Fix scheduling of FP loads
Patch fixes scheduling of FP load instructions with pre/post increment adding WriteAdr for address operand.

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D116361

OS Laboratory. Huawei Russian Research Institute. Saint-Petersburg
2022-01-10 10:14:45 +03:00
Simon Pilgrim a0a0eb192e [X86] Use WriteVecMove scheduler classes for VPMOVM2* instructions
These match the port behaviour of reg-reg predicated xmm/ymm/zmm moves

Fixes #34958
2021-12-27 13:21:29 +00:00
Simon Pilgrim 29475e0286 [X86] Add scheduler classes for zmm vector reg-reg move instructions
Basic zmm reg-reg moves (with predication) are more port limited than xmm/ymm moves, so we need to add a separate class for them.

We still appear to be missing move-elimination patterns for most of the intel models, which looks to be one of the main diffs for basic codegen analysis between llvm-mca and uops.info

Load/stores are a bit messier and might be better handled as overrides.
2021-12-27 12:13:42 +00:00
Simon Pilgrim 948ae472a6 [MCA][X86] Add AVX512 vector move instruction test coverage 2021-12-27 11:43:39 +00:00
Simon Pilgrim 67cce1ceee [X86] Adjust some IceLake fp shuffle schedule classes (PR48110)
The IceLake scheduler model is still mainly a copy of the SkylakeServer model.

This patch adjusts the fp shuffle classes to account for most instructions now working on Port 1 as well as Port 5.

This is based off Agner + uops.info as well as the PR48110 report.

Differential Revision: https://reviews.llvm.org/D115752
2021-12-19 13:00:11 +00:00
Simon Pilgrim 74d1fc742a [X86] Adjust some IceLake integer shuffle schedule classes (PR48110)
The IceLake scheduler model is still mainly a copy of the SkylakeServer model.

This patch adjusts the integer shuffle classes to account for most instructions now working on Port 1 as well as Port 5.

This is based off Agner + uops.info as well as the PR48110 report.

Differential Revision: https://reviews.llvm.org/D115547
2021-12-14 18:56:13 +00:00
Simon Pilgrim d9655eec05 [MCA][X86] Add AVX512 subvector broadcast instruction test coverage 2021-12-13 18:48:25 +00:00
Simon Pilgrim fe1b5b56c6 [MCA][X86] Add AVX512 movddup/movshdup/movsldup instruction test coverage
As noted on D115547
2021-12-13 18:04:56 +00:00
Simon Pilgrim b04c646711 [MCA][X86] Add AVX512 broadcast instruction test coverage
As noted on D115547
2021-12-13 17:45:16 +00:00
Simon Pilgrim 9ad5969b5e [X86][Atom] Fix CVT uops + port usage
Fix overrides to use both ports. Update the uops counts + port usage based off the most recent llvm-exegesis captures (PR36895) and what Intel AoM / Agner reports as well.
2021-12-12 22:57:53 +00:00
Simon Pilgrim 4c1d248397 [MCA][X86] Fix duplicated cvtsi2ss/cvtsi2sd i32 + i64 folded tests
Specify the integer width to ensure we're testing the correct instruction
2021-12-12 22:48:45 +00:00
Simon Pilgrim c02f9791c6 [X86][AVX512] Remove xmm->xmm vpmovsx/vpmovzx rm overrides
The XMM evex cases have the same behaviour as the SSE41 versions, which already uses WriteShuffleX.Folded
2021-12-12 16:08:10 +00:00
Simon Pilgrim fc02ceb12a [X86][AVX512] Use WriteShuffleX for xmm->xmm extensions
The XMM evex cases have the same behaviour as the SSE41 versions, which already uses WriteShuffleX
2021-12-12 15:22:32 +00:00
Simon Pilgrim 7f09aee0f6 [MCA][X86] Add missing VPMOVSX/VPMOVZX from AVX512 tests 2021-12-10 18:12:57 +00:00
Simon Pilgrim 6fae235885 [MCA][X86] Add missing ALIGND/ALIGNQ from AVX512F/AVX512VL tests 2021-12-10 15:59:52 +00:00
Simon Pilgrim b025b062d6 [MCA][X86] Add missing PALIGNR from AVX512BW/AVX512BWVL tests 2021-12-10 15:59:52 +00:00
Simon Pilgrim ebcc92ccda [MCA][X86] Add missing PSLLDQ/PSRLDQ from AVX512BW/AVX512BWVL tests 2021-12-10 15:59:51 +00:00
Simon Pilgrim 550bf36732 [MCA][X86] Add missing PACKSS/PACKUS from AVX512BW/AVX512BWVL tests 2021-12-10 15:59:51 +00:00
Simon Pilgrim 80ce01c6fd [MCA][X86] Add missing PSHUFLW from AVX512BWVL tests 2021-12-10 14:02:37 +00:00
Andrew Savonichev 420300c0d8 [MCA] Remove the warning about experimental support for in-order CPU
There are not a lot of bug reports for this feature, so let's mark it
stable.

Differential Revision: https://reviews.llvm.org/D114701
2021-12-07 15:27:51 +03:00
Roman Lebedev 2f364f6f0d
[NFC][X86][MCA] Add forgotten test coverage for AVX512's VPMOVM2[BWDQ] / VPMOV[BWDQ]2M 2021-11-20 13:09:18 +03:00
Simon Pilgrim 0bb32b1b21 [X86][SLM] Fix BitTest+Set uops + port usage
Both ports are required for BitTest ops. Update the uops counts + port usage based off the most recent llvm-exegesis captures and what Intel AoM / Agner reports as well.
2021-10-17 18:13:15 +01:00
Simon Pilgrim 5ed5df4802 [X86][SLM] Fix uops for PCMPISTR/PCMPISTR instructions
Based off a recent llvm-exegesis capture and what Intel AoM / Agner reports as well.
2021-10-17 18:13:14 +01:00
Simon Pilgrim 680afaaa5d [X86][SLM] Fix uops for PCLMULQDQ
Based off a recent llvm-exegesis capture and what Intel AoM / Agner reports as well.
2021-10-17 18:13:14 +01:00
Simon Pilgrim 498c7236bc [X86][SLM] +1uop for PSHUFBrm xmm
Extra 1uop for folded pshufb ops, based off a recent llvm-exegesis capture and what Intel AoM / Agner reports as well.
2021-10-17 18:13:14 +01:00
Simon Pilgrim 7cae0daee6 [X86][Atom] Fix BSR/BSF uops + port usage
Both ports are required for BitScan ops. Update the uops counts + port usage based off the most recent llvm-exegesis captures (PR36895) and what Intel AoM / Agner reports as well.
2021-10-02 19:09:44 +01:00
Simon Pilgrim 8e7f6039fa [X86] Atom SSE shift-by-variable take 2uops/3uops not 1uop
Based off the most recent llvm-exegesis captures (PR36895) and what Intel AoM / Agner / InstLatX64 reports as well.
2021-10-02 12:28:41 +01:00
David Green e9adcbde31 [AArch64] Model Cortex-A55 Q register NEON instructions
Cortex-A55 has 2 64bit NEON vector units, meaning a 128bit instruction
requires taking both units (and can only be issued as the first
instruction in a dual issue pair). This patch models that by splitting
the WriteV SchedWrite into two - the WriteVd that reads/writes only
64bit operands, and the WriteVq that read/writes 128bit registers. The
A55 schedule then uses this distinction to model the WriteVq as taking
both resource units, and starting a Schedule Group and WriteVd as taking
one as before.

I believe this is more correct, even if it does not lead to much better
performance.

Differential Revision: https://reviews.llvm.org/D108766
2021-09-29 16:55:31 +01:00
Simon Pilgrim dade83c02a [X86][SLM] Fix ADDQ/SUBQ/CMPEQQ throughput to account for running on either port.
Testing on a SLM box suggests these can run on either port, but the throughput is 4cy on either (inc MMX versions). Confirmed with Intel AoM / Agner / InstLatX64.
2021-09-24 10:06:14 +01:00
Simon Pilgrim f855ef2601 [X86][Atom] Fix FP uops + port usage
Both ports are required in most cases. Update the uops counts + port usage based off the most recent llvm-exegesis captures (PR36895) and what Intel AoM / Agner / InstLatX64 reports as well.

Noticed while trying to improve fp costs for vectorization via the D103695 helper script.
2021-09-19 20:39:20 +01:00
Simon Pilgrim cf8fac7d07 [X86][Atom] Specific uops for all IMUL/IDIV instructions
Based off a mixture of llvm-exegesis captures (PR36895) and Intel AoM / Agner / InstLatX64 reports.
2021-09-19 16:58:52 +01:00
Simon Pilgrim e381d8b243 [X86][Atom] Fix (U)COMISS/SD uops, latency and throughput
Both ports are required, for reg and mem variants - we can also use the WriteFComX class directly and remove the unnecessary InstRW overrides. Matches what Intel AoM / Agner / InstLatX64 report as well.
2021-09-19 12:44:44 +01:00
Simon Pilgrim 5ebe95e256 [X86][Atom] Fix integer shuffles uops, latency and throughput
The MMX pack/unpck shuffles don't need an override - they have the same behaviour as other shuffles (Port0 only).
The SSE pslldq/psrldq shuffles don't need an override - they have the same behaviour as other shuffles (Port0 only).
The SSE pshufb shuffles use 4uops (+1 load).

Noticed the pslldq/psrldq issue while trying to improve reduction costs via the D103695 helper script, and fixed the others while reviewing. Confirmed with Intel AoM / Agner / InstLatX64.
2021-09-17 12:11:54 +01:00
Simon Pilgrim 65ad09da0e [X86][SLM] Fix DIVPD/DIVPS/RCPPS/RSQRTPS/SQRTPD/SQRTPS/DPPD/DPPS uops, latency and throughput
The packed variants of the instructions had been modelled as the same as the scalar variants.

Reported during a run of llvm-exegesis on a cheap SLM box and matches what Agner / InstLatX64 report as well.
2021-09-13 08:36:43 +01:00
Simon Pilgrim df975e4590 [X86][SLM] Fix PSAD/MPSAD uops, latency and throughput
Noticed while trying to improve generic reduction costs via the D103695 helper script. Confirmed with Intel AoM / Agner / InstLatX64.
2021-09-11 11:44:09 +01:00
Simon Pilgrim 484944ac3b [X86][SLM] Fix HADD/HSUB uops, latency and throughput
Noticed while trying to improve generic reduction costs via the D103695 helper script. Confirmed with Intel AoM / Agner / InstLatX64.
2021-09-11 11:44:09 +01:00
Simon Pilgrim 2005ae15a6 [X86][SLM] WriteVecIMul instructions only take 1uop (REAPPLIED)
The xmm variant have half the throughput (and +1cy latency) of the mmx variants, but are still 1uop.

I still need to do more thorough testing of SLM on test-suite before fixing the obvious bad numbers for WritePMULLD.

But this helps the D103695 helper script get to more accurate numbers for vXi32 multiplies of extended operands (i.e. we can use PMADDWD, PMULLW/PMULHW etc). Matches what Intel AoM / Agner / llvm-exegesis reports.
2021-09-04 15:03:56 +01:00
Simon Pilgrim ac51d69208 Revert rG994da657076900f5ad7fe593c3b5e5f89ab3d53d "[X86][SLM] WriteVecIMul instructions only take 1uop"
This changed some codegen tests that I forgot about in my rebase, I'll recommit shortly with a fix.
2021-09-04 13:39:10 +01:00
Simon Pilgrim 994da65707 [X86][SLM] WriteVecIMul instructions only take 1uop
The xmm variant have half the throughput (and +1cy latency) of the mmx variants, but are still 1uop.

I still need to do more thorough testing of SLM on test-suite before fixing the obvious bad numbers for WritePMULLD.

But this helps the D103695 helper script get to more accurate numbers for vXi32 multiplies of extended operands (i.e. we can use PMADDWD, PMULLW/PMULHW etc). Matches what Intel AoM / Agner / llvm-exegesis reports.
2021-09-04 13:21:34 +01:00
Simon Pilgrim c6371020a8 [X86][SLM] RMW instructions don't require an extra uop
For RMW instructions, the load and store hold the MEC for an extra cycle, but within the same single uop. This is alluded to in the Intel AOM:

"The MEC also owns the MEC RSV, which is responsible for scheduling of all loads and stores. Load and
store instructions go through addresses generation phase in program order to avoid on-the-fly memory
ordering later in the pipeline. Therefore, an unknown address will stall younger memory instructions."

Noticed while trying to get a cheap SLM test box up and running with llvm-exegesis - RMW arithmetic is always 1uop - and matches what Agner / InstLatX64 report as well.
2021-09-04 13:21:34 +01:00
Simon Pilgrim da965a77d5 [X86][SLM] Fix MUL uops, latency and throughput
These were all set to the same best case mul i32 values (which seems to be the only version of MUL that SLM actually performs well with).

Noticed while trying to improve multiplication costs for vectorization via the D103695 helper script. Confirmed with Intel AoM / Agner / InstLatX64.
2021-09-04 13:21:34 +01:00
Simon Pilgrim 7d062d2c47 [X86][Atom] MUL/DIV instructions require both ports, not either.
Noticed while trying to improve multiplication costs for vectorization via the D103695 helper script. Confirmed with Intel AoM.
2021-09-04 11:58:09 +01:00
Simon Pilgrim 6ba0b9f68a [X86][SLM] Fix PBLENDVB uops and throughput
SLM PBLENDVB is just as bad as BLENDVPD/PS - so model it as such, fixing the rr vs rm uops diff as well. The Intel AoM appears to have a copy+paste typo with PBLENDW, it doesn't match Agner or InstLatX64.

Noticed while investigating some of the weird discrepancies reported by the D103695 helper script (SLM had much better vector shift throughputs than it should).
2021-09-03 11:31:29 +01:00