znver1/2 models were incorrectly modelling the latency/throughput/uops and znver1 ymm variants also require double pumping.
Now matches what I can decipher from the AMD SoG, Agner and instlatx64 numbers vs the llvm-exegesis report provided by @fabian-r
znver1 ymm variants of VPMOVSX**/VPMOVZX** instructions require double pumping.
Now matches AMD SoG, Agner and instlatx64 numbers.
Thanks to @fabian-r for the report
znver1/2 models were incorrectly modelling the fpupipe (should be pipe2 for shift-by-scalar-amount and pipe1 for shift-by-element-amount) and znver1 ymm variants also require double pumping.
Now matches AMD SoG, Agner and instlatx64 numbers.
Thanks to @fabian-r for the report
znver1/2 models were incorrectly modelling these on fpupipe 0 instead of 2/3 and znver1 ymm variants also require double pumping.
Now matches AMD SoG, Agner and instlatx64 numbers.
Thanks to @fabian-r for the report
These are all microcoded/multi-pipe nightmares on Ryzen, but we shouldn't just be using the WriteMicrocoded class which is for REALLY bad microcoded nightmares - instead use the same approximate latencies as znver2 (Agner and uops.info both suggest similar values) - and make sure we use the FPU defs for both
Fixes#53242
znver1/2 models were incorrectly modelling these as 3 cycle latency instructions on the wrong pipe and znver1 ymm variants also require double pumping.
Now matches AMD SoG, Agner and instlatx64 numbers.
Thanks to @fabian-r for the report
znver1/2 models were incorrectly modelling these as single uop instructions, instead of the microcoded nightmares they really are.
Now matches AMD SoG, Agner and instlatx64 numbers.
Fixes#54811
Adjust the PMULLD entry to match the Intel AoM numbers - PMULLD is a uop nightmare on SLM and we should model it as such.
We had reports of internal regressions the last time this was attempted (rG13a0f83a05ff), but no public repros, and tests I did last year when I had access to a SLM box failed to see anything. My hunch is that the more aggressive PMULLD -> PMADDWD folds we now perform might have helped. We can revisit this again if we ever receive an actual repro.
Fixes#36407
Based off Agner and AMD SoG tables, the XOP VPHADD/VPHSUB unary horizontal ops are as fast as basic arithmetic ops, not the slower SSSE3 binary horizontal add/sub ops. This also matches what the bdver2 model already lists.
Noticed while investigating reduction add optimizations.
The Cortex-A55 scheduling model is used for -mcpu=generic, meaning it
can have a wider effect than just the A55. The changes to the A55
scheduling model seems to have caused performance regressions on
Cortex-A510 device which have latencies closer to the original and
different forwarding paths.
This partially reverts the changes from D117003, at least until we can
do something to improve Cortex-A510. According to my results, this
improves the A510 results without altering the A55 very much.
Most Intel CPU scheduler files lumped the immediate and 1 instructions
together, but uops.info shows they are quite different.
For the most part the by 1 instructions were pretty accurate to the uops.info
data except the latency was 3 instead of 2 as uops.info indicates.
The by immediate instructions need 7 or 8 uops and have higher latency.
It looks like the 8-bit by immediate instructions may need even more
uops, but I just lumped them with the 16/32/64.
Noticed while checking out PR53648. So mostly I cared about the by 1
instructions.
Reviewed By: RKSimon, pengfei
Differential Revision: https://reviews.llvm.org/D119217
Many of the x86 scheduler models are not accounting for their microarch's ability to handle dependency-breaking zero idioms (pxor xmm0,xmm0 etc.), which is causing some notable differences when comparing llvm-mca reports to iaca, uops.info etc.
These are based on the Intel AoMs and Agner's docs which list the instructions handled on each cpu model - there may be more, although tbh the xor/pxor/xorps/xorpd are by far the most commonly encountered.
Once this is in place we also need to review missing support for 'allones' idioms and reg-reg move elimination, but this needs fixing first.
@lebedev.ri The Barcelona test changes are due to the cpu still being tagged as using the SandyBridge model, if/when you get back to D63628 these will need to be addressed.
Based on an original patch by @andreadb (Andrea Di Biagio)
Differential Revision: https://reviews.llvm.org/D117497
atom/slm have no/limited zero-idioms handling but we should test all the common instructions anyhow
znver1/znver2 were just missing - I've copied the Haswell tests for consistent test coverage
memory-barrier instructions to providing targets and developers a convenient
way to explicitly declare which instructions are memory-barriers.
Differential Revision: https://reviews.llvm.org/D116779
Patch fixes scheduling of FP load instructions with pre/post increment adding WriteAdr for address operand.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D116361
OS Laboratory. Huawei Russian Research Institute. Saint-Petersburg
Basic zmm reg-reg moves (with predication) are more port limited than xmm/ymm moves, so we need to add a separate class for them.
We still appear to be missing move-elimination patterns for most of the intel models, which looks to be one of the main diffs for basic codegen analysis between llvm-mca and uops.info
Load/stores are a bit messier and might be better handled as overrides.
The IceLake scheduler model is still mainly a copy of the SkylakeServer model.
This patch adjusts the fp shuffle classes to account for most instructions now working on Port 1 as well as Port 5.
This is based off Agner + uops.info as well as the PR48110 report.
Differential Revision: https://reviews.llvm.org/D115752
The IceLake scheduler model is still mainly a copy of the SkylakeServer model.
This patch adjusts the integer shuffle classes to account for most instructions now working on Port 1 as well as Port 5.
This is based off Agner + uops.info as well as the PR48110 report.
Differential Revision: https://reviews.llvm.org/D115547
Fix overrides to use both ports. Update the uops counts + port usage based off the most recent llvm-exegesis captures (PR36895) and what Intel AoM / Agner reports as well.
Both ports are required for BitTest ops. Update the uops counts + port usage based off the most recent llvm-exegesis captures and what Intel AoM / Agner reports as well.
Both ports are required for BitScan ops. Update the uops counts + port usage based off the most recent llvm-exegesis captures (PR36895) and what Intel AoM / Agner reports as well.
Cortex-A55 has 2 64bit NEON vector units, meaning a 128bit instruction
requires taking both units (and can only be issued as the first
instruction in a dual issue pair). This patch models that by splitting
the WriteV SchedWrite into two - the WriteVd that reads/writes only
64bit operands, and the WriteVq that read/writes 128bit registers. The
A55 schedule then uses this distinction to model the WriteVq as taking
both resource units, and starting a Schedule Group and WriteVd as taking
one as before.
I believe this is more correct, even if it does not lead to much better
performance.
Differential Revision: https://reviews.llvm.org/D108766
Testing on a SLM box suggests these can run on either port, but the throughput is 4cy on either (inc MMX versions). Confirmed with Intel AoM / Agner / InstLatX64.