Commit Graph

677 Commits

Author SHA1 Message Date
Yuta Mukai 3f561996bf [AArch64] Fix and add A64FX scheduling resource/latency info
1. Missing instruction information (FTSSEL, FMSB, PFIRST and RDFFR)
   is added and CompleteModel is set to one.

2. Information for pseudo SVE instructions is added. Those
   instructions are present at the time of scheduling.

3. Resource and latency information for SVE instructions is modified
   to be more accurate.
   For example, the description for CMPEQ, which consumes one cycle
   each of unit FLA and PPR, is as follows.
```
Previous:
  def A64FXGI01 : ProcResGroup<[A64FXIPFLA, A64FXIPPR]>;
  def A64FXWrite_4Cyc_GI01 : SchedWriteRes<[A64FXGI01]> {...
Modified:
  def A64FXGI0 : ProcResGroup<[A64FXIPFLA]>;
  def A64FXGI1 : ProcResGroup<[A64FXIPPR]>;
  def A64FXWrite_CMP : SchedWriteRes<[A64FXGI0, A64FXGI1]> {...
```

Reference: A64FX Microarchitecture Manual (Table 16-3)
https://github.com/fujitsu/A64FX/blob/master/doc/A64FX_Microarchitecture_Manual_en_1.7.pdf

Reviewed By: dmgreen, kawashima-fj

Differential Revision: https://reviews.llvm.org/D131165
2022-08-09 10:53:40 +09:00
David Green 408378a0b3 [AArch64] Tone down the number of repeated fmov N2 scheduling tests. NFC 2022-08-05 08:11:57 +01:00
Cullen Rhodes 767b26a4e2 [MCA] Support multiple comma-separated -mattr features
Reviewed By: myhsu

Differential Revision: https://reviews.llvm.org/D129479
2022-07-12 08:20:11 +00:00
Cullen Rhodes d1c51d45f0 [AArch64] Use Neoverse N2 sched model as default for:
- Cortex-A710
  - Cortex-X2
  - Neoverse-V1
  - Neoverse-512tvb

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D129203
2022-07-08 13:34:13 +00:00
Cullen Rhodes 03af9ba680 [AArch64] Initial sched model for Neoverse N2
The optimization guide can be found here:
https://developer.arm.com/documentation/PJDOC-466751330-18256/latest/

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D128631
2022-07-08 09:39:13 +00:00
Jay Foad d393538c7f [AMDGPU] Add a GFX11 MCA test
This mostly just tests that DPFP is 1/32 rate on GFX11, instead of 1/16
rate as on GFX10.
2022-06-14 13:47:29 +01:00
Simon Pilgrim d384a4c530 [X86] Adjust vector test costs to match SoG (Issue #54889)
znver1/2 models were incorrectly modelling the latency/throughput/uops and znver1 ymm variants also require double pumping.

Now matches what I can decipher from the AMD SoG, Agner and instlatx64 numbers vs the llvm-exegesis report provided by @fabian-r
2022-05-31 09:14:06 +01:00
Simon Pilgrim 14cc4674bf [X86] Adjust vector fp test costs to match int test costs
znver1/2 models were missing the vtestps/pd overrides to match the vptest integer equivalents.

Noticed while investigating Issue #54889
2022-05-30 09:50:15 +01:00
Simon Pilgrim 1956f28037 [X86] Adjust vector extend to ymm to match SoG (Issue #54889)
znver1 ymm variants of VPMOVSX**/VPMOVZX** instructions require double pumping.

Now matches AMD SoG, Agner and instlatx64 numbers.

Thanks to @fabian-r for the report
2022-05-30 08:58:56 +01:00
Simon Pilgrim c99690462e [X86] Adjust vector shift costs to match SoG (Issue #54889)
znver1/2 models were incorrectly modelling the fpupipe (should be pipe2 for shift-by-scalar-amount and pipe1 for shift-by-element-amount) and znver1 ymm variants also require double pumping.

Now matches AMD SoG, Agner and instlatx64 numbers.

Thanks to @fabian-r for the report
2022-05-29 17:55:39 +01:00
Simon Pilgrim 896557e129 [X86] Adjust fadd costs to match SoG
znver1/2 models were incorrectly modelling these on fpupipe 0 instead of 2/3 and znver1 ymm variants also require double pumping.

Now matches AMD SoG, Agner and instlatx64 numbers.

Thanks to @fabian-r for the report
2022-05-15 21:28:29 +01:00
Simon Pilgrim 6824cf1ab7 [X86] Set some more plausible latencies for horizontal add/subs on znver1
These are all microcoded/multi-pipe nightmares on Ryzen, but we shouldn't just be using the WriteMicrocoded class which is for REALLY bad microcoded nightmares - instead use the same approximate latencies as znver2 (Agner and uops.info both suggest similar values) - and make sure we use the FPU defs for both

Fixes #53242
2022-05-08 15:48:42 +01:00
Simon Pilgrim c7662dc3e5 [X86] MOVDDUP has the same sched behaviour as MOVSHDUP/MOVSLDUP on Skylake
Fixes an old TODO - confirmed on Agner + uops.info
2022-05-02 12:50:37 +01:00
Simon Pilgrim a305d8f44e [X86] Adjust fsetcc/fmin/fmax costs to match SoG (Issue #54889)
znver1/2 models were incorrectly modelling these as 3 cycle latency instructions on the wrong pipe and znver1 ymm variants also require double pumping.

Now matches AMD SoG, Agner and instlatx64 numbers.

Thanks to @fabian-r for the report
2022-04-14 13:27:33 +01:00
Simon Pilgrim 058a33d3c9 [X86] Account for high uop/resource usage in BSF/BSR instructions
znver1/2 models were incorrectly modelling these as single uop instructions, instead of the microcoded nightmares they really are.

Now matches AMD SoG, Agner and instlatx64 numbers.

Fixes #54811
2022-04-11 11:20:09 +01:00
Simon Pilgrim 5626bd4289 [X86] Fix SLM scheduler model for PMULLD (PR37059)
Adjust the PMULLD entry to match the Intel AoM numbers - PMULLD is a uop nightmare on SLM and we should model it as such.

We had reports of internal regressions the last time this was attempted (rG13a0f83a05ff), but no public repros, and tests I did last year when I had access to a SLM box failed to see anything. My hunch is that the more aggressive PMULLD -> PMADDWD folds we now perform might have helped. We can revisit this again if we ever receive an actual repro.

Fixes #36407
2022-04-08 10:07:06 +01:00
Simon Pilgrim 0c9c92ffc0 [X86][XOP] Tidyup VPHADD/VPHSUB unary horizontal ops default schedule class
Based off Agner and AMD SoG tables, the XOP VPHADD/VPHSUB unary horizontal ops are as fast as basic arithmetic ops, not the slower SSSE3 binary horizontal add/sub ops. This also matches what the bdver2 model already lists.

Noticed while investigating reduction add optimizations.
2022-03-03 12:07:48 +00:00
David Green 61b616755a Partially revert "[SchedModels][CortexA55] Add ASIMD integer instructions"
The Cortex-A55 scheduling model is used for -mcpu=generic, meaning it
can have a wider effect than just the A55. The changes to the A55
scheduling model seems to have caused performance regressions on
Cortex-A510 device which have latencies closer to the original and
different forwarding paths.

This partially reverts the changes from D117003, at least until we can
do something to improve Cortex-A510. According to my results, this
improves the A510 results without altering the A55 very much.
2022-02-28 10:58:52 +00:00
Pavel Kosov 37fa99eda0 [SchedModels][CortexA55] Add ASIMD integer instructions
Depends on D114642

Original review https://reviews.llvm.org/D112201

OS Laboratory. Huawei Russian Research Institute. Saint-Petersburg

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D117003
2022-02-17 13:41:57 +03:00
Craig Topper 56d6ccd4cb [X86] Update register RCL/RCR by 1 and immediate scheduling for Intel CPUs
Most Intel CPU scheduler files lumped the immediate and 1 instructions
together, but uops.info shows they are quite different.

For the most part the by 1 instructions were pretty accurate to the uops.info
data except the latency was 3 instead of 2 as uops.info indicates.

The by immediate instructions need 7 or 8 uops and have higher latency.

It looks like the 8-bit by immediate instructions may need even more
uops, but I just lumped them with the 16/32/64.

Noticed while checking out PR53648. So mostly I cared about the by 1
instructions.

Reviewed By: RKSimon, pengfei

Differential Revision: https://reviews.llvm.org/D119217
2022-02-08 09:20:20 -08:00
Simon Pilgrim 6eb8fc9244 [X86] Add some missing dependency-breaking zero idiom patterns to scheduler models
Many of the x86 scheduler models are not accounting for their microarch's ability to handle dependency-breaking zero idioms (pxor xmm0,xmm0 etc.), which is causing some notable differences when comparing llvm-mca reports to iaca, uops.info etc.

These are based on the Intel AoMs and Agner's docs which list the instructions handled on each cpu model - there may be more, although tbh the xor/pxor/xorps/xorpd are by far the most commonly encountered.

Once this is in place we also need to review missing support for 'allones' idioms and reg-reg move elimination, but this needs fixing first.

@lebedev.ri The Barcelona test changes are due to the cpu still being tagged as using the SandyBridge model, if/when you get back to D63628 these will need to be addressed.

Based on an original patch by @andreadb (Andrea Di Biagio)

Differential Revision: https://reviews.llvm.org/D117497
2022-01-19 11:29:33 +00:00
Simon Pilgrim 8ea579203d [MCA][X86] Add missing zero-idioms test file coverage
atom/slm have no/limited zero-idioms handling but we should test all the common instructions anyhow

znver1/znver2 were just missing - I've copied the Haswell tests for consistent test coverage
2022-01-17 16:04:39 +00:00
Patrick Holland 85e6e748d4 [MCA] Switching from conservatively guessing which instructions are
memory-barrier instructions to providing targets and developers a convenient
way to explicitly declare which instructions are memory-barriers.

Differential Revision: https://reviews.llvm.org/D116779
2022-01-11 13:50:14 -08:00
Pavel Kosov 34a91d7748 [SchedModels][CortexA55] Fix scheduling of FP loads
Patch fixes scheduling of FP load instructions with pre/post increment adding WriteAdr for address operand.

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D116361

OS Laboratory. Huawei Russian Research Institute. Saint-Petersburg
2022-01-10 10:14:45 +03:00
Simon Pilgrim a0a0eb192e [X86] Use WriteVecMove scheduler classes for VPMOVM2* instructions
These match the port behaviour of reg-reg predicated xmm/ymm/zmm moves

Fixes #34958
2021-12-27 13:21:29 +00:00
Simon Pilgrim 29475e0286 [X86] Add scheduler classes for zmm vector reg-reg move instructions
Basic zmm reg-reg moves (with predication) are more port limited than xmm/ymm moves, so we need to add a separate class for them.

We still appear to be missing move-elimination patterns for most of the intel models, which looks to be one of the main diffs for basic codegen analysis between llvm-mca and uops.info

Load/stores are a bit messier and might be better handled as overrides.
2021-12-27 12:13:42 +00:00
Simon Pilgrim 948ae472a6 [MCA][X86] Add AVX512 vector move instruction test coverage 2021-12-27 11:43:39 +00:00
Simon Pilgrim 67cce1ceee [X86] Adjust some IceLake fp shuffle schedule classes (PR48110)
The IceLake scheduler model is still mainly a copy of the SkylakeServer model.

This patch adjusts the fp shuffle classes to account for most instructions now working on Port 1 as well as Port 5.

This is based off Agner + uops.info as well as the PR48110 report.

Differential Revision: https://reviews.llvm.org/D115752
2021-12-19 13:00:11 +00:00
Simon Pilgrim 74d1fc742a [X86] Adjust some IceLake integer shuffle schedule classes (PR48110)
The IceLake scheduler model is still mainly a copy of the SkylakeServer model.

This patch adjusts the integer shuffle classes to account for most instructions now working on Port 1 as well as Port 5.

This is based off Agner + uops.info as well as the PR48110 report.

Differential Revision: https://reviews.llvm.org/D115547
2021-12-14 18:56:13 +00:00
Simon Pilgrim d9655eec05 [MCA][X86] Add AVX512 subvector broadcast instruction test coverage 2021-12-13 18:48:25 +00:00
Simon Pilgrim fe1b5b56c6 [MCA][X86] Add AVX512 movddup/movshdup/movsldup instruction test coverage
As noted on D115547
2021-12-13 18:04:56 +00:00
Simon Pilgrim b04c646711 [MCA][X86] Add AVX512 broadcast instruction test coverage
As noted on D115547
2021-12-13 17:45:16 +00:00
Simon Pilgrim 9ad5969b5e [X86][Atom] Fix CVT uops + port usage
Fix overrides to use both ports. Update the uops counts + port usage based off the most recent llvm-exegesis captures (PR36895) and what Intel AoM / Agner reports as well.
2021-12-12 22:57:53 +00:00
Simon Pilgrim 4c1d248397 [MCA][X86] Fix duplicated cvtsi2ss/cvtsi2sd i32 + i64 folded tests
Specify the integer width to ensure we're testing the correct instruction
2021-12-12 22:48:45 +00:00
Simon Pilgrim c02f9791c6 [X86][AVX512] Remove xmm->xmm vpmovsx/vpmovzx rm overrides
The XMM evex cases have the same behaviour as the SSE41 versions, which already uses WriteShuffleX.Folded
2021-12-12 16:08:10 +00:00
Simon Pilgrim fc02ceb12a [X86][AVX512] Use WriteShuffleX for xmm->xmm extensions
The XMM evex cases have the same behaviour as the SSE41 versions, which already uses WriteShuffleX
2021-12-12 15:22:32 +00:00
Simon Pilgrim 7f09aee0f6 [MCA][X86] Add missing VPMOVSX/VPMOVZX from AVX512 tests 2021-12-10 18:12:57 +00:00
Simon Pilgrim 6fae235885 [MCA][X86] Add missing ALIGND/ALIGNQ from AVX512F/AVX512VL tests 2021-12-10 15:59:52 +00:00
Simon Pilgrim b025b062d6 [MCA][X86] Add missing PALIGNR from AVX512BW/AVX512BWVL tests 2021-12-10 15:59:52 +00:00
Simon Pilgrim ebcc92ccda [MCA][X86] Add missing PSLLDQ/PSRLDQ from AVX512BW/AVX512BWVL tests 2021-12-10 15:59:51 +00:00
Simon Pilgrim 550bf36732 [MCA][X86] Add missing PACKSS/PACKUS from AVX512BW/AVX512BWVL tests 2021-12-10 15:59:51 +00:00
Simon Pilgrim 80ce01c6fd [MCA][X86] Add missing PSHUFLW from AVX512BWVL tests 2021-12-10 14:02:37 +00:00
Andrew Savonichev 420300c0d8 [MCA] Remove the warning about experimental support for in-order CPU
There are not a lot of bug reports for this feature, so let's mark it
stable.

Differential Revision: https://reviews.llvm.org/D114701
2021-12-07 15:27:51 +03:00
Roman Lebedev 2f364f6f0d
[NFC][X86][MCA] Add forgotten test coverage for AVX512's VPMOVM2[BWDQ] / VPMOV[BWDQ]2M 2021-11-20 13:09:18 +03:00
Simon Pilgrim 0bb32b1b21 [X86][SLM] Fix BitTest+Set uops + port usage
Both ports are required for BitTest ops. Update the uops counts + port usage based off the most recent llvm-exegesis captures and what Intel AoM / Agner reports as well.
2021-10-17 18:13:15 +01:00
Simon Pilgrim 5ed5df4802 [X86][SLM] Fix uops for PCMPISTR/PCMPISTR instructions
Based off a recent llvm-exegesis capture and what Intel AoM / Agner reports as well.
2021-10-17 18:13:14 +01:00
Simon Pilgrim 680afaaa5d [X86][SLM] Fix uops for PCLMULQDQ
Based off a recent llvm-exegesis capture and what Intel AoM / Agner reports as well.
2021-10-17 18:13:14 +01:00
Simon Pilgrim 498c7236bc [X86][SLM] +1uop for PSHUFBrm xmm
Extra 1uop for folded pshufb ops, based off a recent llvm-exegesis capture and what Intel AoM / Agner reports as well.
2021-10-17 18:13:14 +01:00
Simon Pilgrim 7cae0daee6 [X86][Atom] Fix BSR/BSF uops + port usage
Both ports are required for BitScan ops. Update the uops counts + port usage based off the most recent llvm-exegesis captures (PR36895) and what Intel AoM / Agner reports as well.
2021-10-02 19:09:44 +01:00
Simon Pilgrim 8e7f6039fa [X86] Atom SSE shift-by-variable take 2uops/3uops not 1uop
Based off the most recent llvm-exegesis captures (PR36895) and what Intel AoM / Agner / InstLatX64 reports as well.
2021-10-02 12:28:41 +01:00