Commit Graph

675 Commits

Author SHA1 Message Date
Cullen Rhodes 767b26a4e2 [MCA] Support multiple comma-separated -mattr features
Reviewed By: myhsu

Differential Revision: https://reviews.llvm.org/D129479
2022-07-12 08:20:11 +00:00
Cullen Rhodes d1c51d45f0 [AArch64] Use Neoverse N2 sched model as default for:
- Cortex-A710
  - Cortex-X2
  - Neoverse-V1
  - Neoverse-512tvb

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D129203
2022-07-08 13:34:13 +00:00
Cullen Rhodes 03af9ba680 [AArch64] Initial sched model for Neoverse N2
The optimization guide can be found here:
https://developer.arm.com/documentation/PJDOC-466751330-18256/latest/

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D128631
2022-07-08 09:39:13 +00:00
Jay Foad d393538c7f [AMDGPU] Add a GFX11 MCA test
This mostly just tests that DPFP is 1/32 rate on GFX11, instead of 1/16
rate as on GFX10.
2022-06-14 13:47:29 +01:00
Simon Pilgrim d384a4c530 [X86] Adjust vector test costs to match SoG (Issue #54889)
znver1/2 models were incorrectly modelling the latency/throughput/uops and znver1 ymm variants also require double pumping.

Now matches what I can decipher from the AMD SoG, Agner and instlatx64 numbers vs the llvm-exegesis report provided by @fabian-r
2022-05-31 09:14:06 +01:00
Simon Pilgrim 14cc4674bf [X86] Adjust vector fp test costs to match int test costs
znver1/2 models were missing the vtestps/pd overrides to match the vptest integer equivalents.

Noticed while investigating Issue #54889
2022-05-30 09:50:15 +01:00
Simon Pilgrim 1956f28037 [X86] Adjust vector extend to ymm to match SoG (Issue #54889)
znver1 ymm variants of VPMOVSX**/VPMOVZX** instructions require double pumping.

Now matches AMD SoG, Agner and instlatx64 numbers.

Thanks to @fabian-r for the report
2022-05-30 08:58:56 +01:00
Simon Pilgrim c99690462e [X86] Adjust vector shift costs to match SoG (Issue #54889)
znver1/2 models were incorrectly modelling the fpupipe (should be pipe2 for shift-by-scalar-amount and pipe1 for shift-by-element-amount) and znver1 ymm variants also require double pumping.

Now matches AMD SoG, Agner and instlatx64 numbers.

Thanks to @fabian-r for the report
2022-05-29 17:55:39 +01:00
Simon Pilgrim 896557e129 [X86] Adjust fadd costs to match SoG
znver1/2 models were incorrectly modelling these on fpupipe 0 instead of 2/3 and znver1 ymm variants also require double pumping.

Now matches AMD SoG, Agner and instlatx64 numbers.

Thanks to @fabian-r for the report
2022-05-15 21:28:29 +01:00
Simon Pilgrim 6824cf1ab7 [X86] Set some more plausible latencies for horizontal add/subs on znver1
These are all microcoded/multi-pipe nightmares on Ryzen, but we shouldn't just be using the WriteMicrocoded class which is for REALLY bad microcoded nightmares - instead use the same approximate latencies as znver2 (Agner and uops.info both suggest similar values) - and make sure we use the FPU defs for both

Fixes #53242
2022-05-08 15:48:42 +01:00
Simon Pilgrim c7662dc3e5 [X86] MOVDDUP has the same sched behaviour as MOVSHDUP/MOVSLDUP on Skylake
Fixes an old TODO - confirmed on Agner + uops.info
2022-05-02 12:50:37 +01:00
Simon Pilgrim a305d8f44e [X86] Adjust fsetcc/fmin/fmax costs to match SoG (Issue #54889)
znver1/2 models were incorrectly modelling these as 3 cycle latency instructions on the wrong pipe and znver1 ymm variants also require double pumping.

Now matches AMD SoG, Agner and instlatx64 numbers.

Thanks to @fabian-r for the report
2022-04-14 13:27:33 +01:00
Simon Pilgrim 058a33d3c9 [X86] Account for high uop/resource usage in BSF/BSR instructions
znver1/2 models were incorrectly modelling these as single uop instructions, instead of the microcoded nightmares they really are.

Now matches AMD SoG, Agner and instlatx64 numbers.

Fixes #54811
2022-04-11 11:20:09 +01:00
Simon Pilgrim 5626bd4289 [X86] Fix SLM scheduler model for PMULLD (PR37059)
Adjust the PMULLD entry to match the Intel AoM numbers - PMULLD is a uop nightmare on SLM and we should model it as such.

We had reports of internal regressions the last time this was attempted (rG13a0f83a05ff), but no public repros, and tests I did last year when I had access to a SLM box failed to see anything. My hunch is that the more aggressive PMULLD -> PMADDWD folds we now perform might have helped. We can revisit this again if we ever receive an actual repro.

Fixes #36407
2022-04-08 10:07:06 +01:00
Simon Pilgrim 0c9c92ffc0 [X86][XOP] Tidyup VPHADD/VPHSUB unary horizontal ops default schedule class
Based off Agner and AMD SoG tables, the XOP VPHADD/VPHSUB unary horizontal ops are as fast as basic arithmetic ops, not the slower SSSE3 binary horizontal add/sub ops. This also matches what the bdver2 model already lists.

Noticed while investigating reduction add optimizations.
2022-03-03 12:07:48 +00:00
David Green 61b616755a Partially revert "[SchedModels][CortexA55] Add ASIMD integer instructions"
The Cortex-A55 scheduling model is used for -mcpu=generic, meaning it
can have a wider effect than just the A55. The changes to the A55
scheduling model seems to have caused performance regressions on
Cortex-A510 device which have latencies closer to the original and
different forwarding paths.

This partially reverts the changes from D117003, at least until we can
do something to improve Cortex-A510. According to my results, this
improves the A510 results without altering the A55 very much.
2022-02-28 10:58:52 +00:00
Pavel Kosov 37fa99eda0 [SchedModels][CortexA55] Add ASIMD integer instructions
Depends on D114642

Original review https://reviews.llvm.org/D112201

OS Laboratory. Huawei Russian Research Institute. Saint-Petersburg

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D117003
2022-02-17 13:41:57 +03:00
Craig Topper 56d6ccd4cb [X86] Update register RCL/RCR by 1 and immediate scheduling for Intel CPUs
Most Intel CPU scheduler files lumped the immediate and 1 instructions
together, but uops.info shows they are quite different.

For the most part the by 1 instructions were pretty accurate to the uops.info
data except the latency was 3 instead of 2 as uops.info indicates.

The by immediate instructions need 7 or 8 uops and have higher latency.

It looks like the 8-bit by immediate instructions may need even more
uops, but I just lumped them with the 16/32/64.

Noticed while checking out PR53648. So mostly I cared about the by 1
instructions.

Reviewed By: RKSimon, pengfei

Differential Revision: https://reviews.llvm.org/D119217
2022-02-08 09:20:20 -08:00
Simon Pilgrim 6eb8fc9244 [X86] Add some missing dependency-breaking zero idiom patterns to scheduler models
Many of the x86 scheduler models are not accounting for their microarch's ability to handle dependency-breaking zero idioms (pxor xmm0,xmm0 etc.), which is causing some notable differences when comparing llvm-mca reports to iaca, uops.info etc.

These are based on the Intel AoMs and Agner's docs which list the instructions handled on each cpu model - there may be more, although tbh the xor/pxor/xorps/xorpd are by far the most commonly encountered.

Once this is in place we also need to review missing support for 'allones' idioms and reg-reg move elimination, but this needs fixing first.

@lebedev.ri The Barcelona test changes are due to the cpu still being tagged as using the SandyBridge model, if/when you get back to D63628 these will need to be addressed.

Based on an original patch by @andreadb (Andrea Di Biagio)

Differential Revision: https://reviews.llvm.org/D117497
2022-01-19 11:29:33 +00:00
Simon Pilgrim 8ea579203d [MCA][X86] Add missing zero-idioms test file coverage
atom/slm have no/limited zero-idioms handling but we should test all the common instructions anyhow

znver1/znver2 were just missing - I've copied the Haswell tests for consistent test coverage
2022-01-17 16:04:39 +00:00
Patrick Holland 85e6e748d4 [MCA] Switching from conservatively guessing which instructions are
memory-barrier instructions to providing targets and developers a convenient
way to explicitly declare which instructions are memory-barriers.

Differential Revision: https://reviews.llvm.org/D116779
2022-01-11 13:50:14 -08:00
Pavel Kosov 34a91d7748 [SchedModels][CortexA55] Fix scheduling of FP loads
Patch fixes scheduling of FP load instructions with pre/post increment adding WriteAdr for address operand.

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D116361

OS Laboratory. Huawei Russian Research Institute. Saint-Petersburg
2022-01-10 10:14:45 +03:00
Simon Pilgrim a0a0eb192e [X86] Use WriteVecMove scheduler classes for VPMOVM2* instructions
These match the port behaviour of reg-reg predicated xmm/ymm/zmm moves

Fixes #34958
2021-12-27 13:21:29 +00:00
Simon Pilgrim 29475e0286 [X86] Add scheduler classes for zmm vector reg-reg move instructions
Basic zmm reg-reg moves (with predication) are more port limited than xmm/ymm moves, so we need to add a separate class for them.

We still appear to be missing move-elimination patterns for most of the intel models, which looks to be one of the main diffs for basic codegen analysis between llvm-mca and uops.info

Load/stores are a bit messier and might be better handled as overrides.
2021-12-27 12:13:42 +00:00
Simon Pilgrim 948ae472a6 [MCA][X86] Add AVX512 vector move instruction test coverage 2021-12-27 11:43:39 +00:00
Simon Pilgrim 67cce1ceee [X86] Adjust some IceLake fp shuffle schedule classes (PR48110)
The IceLake scheduler model is still mainly a copy of the SkylakeServer model.

This patch adjusts the fp shuffle classes to account for most instructions now working on Port 1 as well as Port 5.

This is based off Agner + uops.info as well as the PR48110 report.

Differential Revision: https://reviews.llvm.org/D115752
2021-12-19 13:00:11 +00:00
Simon Pilgrim 74d1fc742a [X86] Adjust some IceLake integer shuffle schedule classes (PR48110)
The IceLake scheduler model is still mainly a copy of the SkylakeServer model.

This patch adjusts the integer shuffle classes to account for most instructions now working on Port 1 as well as Port 5.

This is based off Agner + uops.info as well as the PR48110 report.

Differential Revision: https://reviews.llvm.org/D115547
2021-12-14 18:56:13 +00:00
Simon Pilgrim d9655eec05 [MCA][X86] Add AVX512 subvector broadcast instruction test coverage 2021-12-13 18:48:25 +00:00
Simon Pilgrim fe1b5b56c6 [MCA][X86] Add AVX512 movddup/movshdup/movsldup instruction test coverage
As noted on D115547
2021-12-13 18:04:56 +00:00
Simon Pilgrim b04c646711 [MCA][X86] Add AVX512 broadcast instruction test coverage
As noted on D115547
2021-12-13 17:45:16 +00:00
Simon Pilgrim 9ad5969b5e [X86][Atom] Fix CVT uops + port usage
Fix overrides to use both ports. Update the uops counts + port usage based off the most recent llvm-exegesis captures (PR36895) and what Intel AoM / Agner reports as well.
2021-12-12 22:57:53 +00:00
Simon Pilgrim 4c1d248397 [MCA][X86] Fix duplicated cvtsi2ss/cvtsi2sd i32 + i64 folded tests
Specify the integer width to ensure we're testing the correct instruction
2021-12-12 22:48:45 +00:00
Simon Pilgrim c02f9791c6 [X86][AVX512] Remove xmm->xmm vpmovsx/vpmovzx rm overrides
The XMM evex cases have the same behaviour as the SSE41 versions, which already uses WriteShuffleX.Folded
2021-12-12 16:08:10 +00:00
Simon Pilgrim fc02ceb12a [X86][AVX512] Use WriteShuffleX for xmm->xmm extensions
The XMM evex cases have the same behaviour as the SSE41 versions, which already uses WriteShuffleX
2021-12-12 15:22:32 +00:00
Simon Pilgrim 7f09aee0f6 [MCA][X86] Add missing VPMOVSX/VPMOVZX from AVX512 tests 2021-12-10 18:12:57 +00:00
Simon Pilgrim 6fae235885 [MCA][X86] Add missing ALIGND/ALIGNQ from AVX512F/AVX512VL tests 2021-12-10 15:59:52 +00:00
Simon Pilgrim b025b062d6 [MCA][X86] Add missing PALIGNR from AVX512BW/AVX512BWVL tests 2021-12-10 15:59:52 +00:00
Simon Pilgrim ebcc92ccda [MCA][X86] Add missing PSLLDQ/PSRLDQ from AVX512BW/AVX512BWVL tests 2021-12-10 15:59:51 +00:00
Simon Pilgrim 550bf36732 [MCA][X86] Add missing PACKSS/PACKUS from AVX512BW/AVX512BWVL tests 2021-12-10 15:59:51 +00:00
Simon Pilgrim 80ce01c6fd [MCA][X86] Add missing PSHUFLW from AVX512BWVL tests 2021-12-10 14:02:37 +00:00
Andrew Savonichev 420300c0d8 [MCA] Remove the warning about experimental support for in-order CPU
There are not a lot of bug reports for this feature, so let's mark it
stable.

Differential Revision: https://reviews.llvm.org/D114701
2021-12-07 15:27:51 +03:00
Roman Lebedev 2f364f6f0d
[NFC][X86][MCA] Add forgotten test coverage for AVX512's VPMOVM2[BWDQ] / VPMOV[BWDQ]2M 2021-11-20 13:09:18 +03:00
Simon Pilgrim 0bb32b1b21 [X86][SLM] Fix BitTest+Set uops + port usage
Both ports are required for BitTest ops. Update the uops counts + port usage based off the most recent llvm-exegesis captures and what Intel AoM / Agner reports as well.
2021-10-17 18:13:15 +01:00
Simon Pilgrim 5ed5df4802 [X86][SLM] Fix uops for PCMPISTR/PCMPISTR instructions
Based off a recent llvm-exegesis capture and what Intel AoM / Agner reports as well.
2021-10-17 18:13:14 +01:00
Simon Pilgrim 680afaaa5d [X86][SLM] Fix uops for PCLMULQDQ
Based off a recent llvm-exegesis capture and what Intel AoM / Agner reports as well.
2021-10-17 18:13:14 +01:00
Simon Pilgrim 498c7236bc [X86][SLM] +1uop for PSHUFBrm xmm
Extra 1uop for folded pshufb ops, based off a recent llvm-exegesis capture and what Intel AoM / Agner reports as well.
2021-10-17 18:13:14 +01:00
Simon Pilgrim 7cae0daee6 [X86][Atom] Fix BSR/BSF uops + port usage
Both ports are required for BitScan ops. Update the uops counts + port usage based off the most recent llvm-exegesis captures (PR36895) and what Intel AoM / Agner reports as well.
2021-10-02 19:09:44 +01:00
Simon Pilgrim 8e7f6039fa [X86] Atom SSE shift-by-variable take 2uops/3uops not 1uop
Based off the most recent llvm-exegesis captures (PR36895) and what Intel AoM / Agner / InstLatX64 reports as well.
2021-10-02 12:28:41 +01:00
David Green e9adcbde31 [AArch64] Model Cortex-A55 Q register NEON instructions
Cortex-A55 has 2 64bit NEON vector units, meaning a 128bit instruction
requires taking both units (and can only be issued as the first
instruction in a dual issue pair). This patch models that by splitting
the WriteV SchedWrite into two - the WriteVd that reads/writes only
64bit operands, and the WriteVq that read/writes 128bit registers. The
A55 schedule then uses this distinction to model the WriteVq as taking
both resource units, and starting a Schedule Group and WriteVd as taking
one as before.

I believe this is more correct, even if it does not lead to much better
performance.

Differential Revision: https://reviews.llvm.org/D108766
2021-09-29 16:55:31 +01:00
Simon Pilgrim dade83c02a [X86][SLM] Fix ADDQ/SUBQ/CMPEQQ throughput to account for running on either port.
Testing on a SLM box suggests these can run on either port, but the throughput is 4cy on either (inc MMX versions). Confirmed with Intel AoM / Agner / InstLatX64.
2021-09-24 10:06:14 +01:00