The znver2 override already matched the WriteBlendY class exactly, and the znver1 override wasn't accounting for ymm double-pumping.
Found with the help of D138359
znver1/znver2 has barely any difference in behaviour between the AVX1/2 variants of these instructions - it looks like it was a copy+paste mistake to miss the AVX2 integer domain instructions in the overrides.
Having said that the override numbers don't appear to match the numbers in the AMD 17h SoGs very well - for instance vperm2f128/vperm2i128 might be microcoded from the AMD sense of >3 uops, but it doesn't have a 100cy latency..... These will need to be further addressed.
This appears to be a slow down vs Skylake (which the model was copied off) - confirmed with uops.info / instlatx64
Noticed as D138359 was reporting that many of the PACKS overrides were redundant, but were in fact incorrect
Reported by D138359 - they were being overridden as WriteMicrocoded despite already being declared WriteMicrocoded
It also fixes a rather funny instregex mismatch that was matching the movsldup shuffle by mistake
D138359 was reporting that this override was superfluous, but it had never been setup - I took the numbers from uops.info (I couldn't find an estimate in Intel docs).
These were missed for some reason - only noticed this while investigating a FIXME in the SandyBridge model
Also sync the znver2/znver3 tests which had been missed when LOCK test coverage was added
D138359 was reporting that the EXTRACTPSrr override was unnecessary, however the AMD SoG and Agner both confirm that both the rr and rm versions take 2uops (matching znver1)
NOTE: For IceLakeServer we actually test TigerLake as that's the only target that supports it (we do something similar for F16C on IvyBridge in the SandyBridge tests).
AMD 15h SoG + Agner both indicate there's no difference between MPSADBWrri + VMPSADBWrri - I can't find any data on the folded variant so I've kept the existing numbers
Removes the last X86 override for WriteMPSAD/WritePSADBW classes - removing a further 3 entries from every sched class table
The znver1/znver2 schedules for CVTPD2PS were incorrectly double pumping the xmm-load variant instead of the ymm variants (znver1 only)
Also, the xmm-load variant was incorrectly using FP03 instead of just FP3
Confirmed by the AMD SoG 17h tables, Agner + uops.info
Another step towards removing a lot of unnecessary overrides from all the x86 scheduler models - these should hopefully be convertible into regular WriteCvtPD2I classes soon.
Fixes a lot of throughput mismatches - the more complicated conversion instructions use ICXPort5+ICXPort01, not ICXPort5+ICXPort015 (ICXPort015 is mainly used for basic Logic + blend ops)
Fixing this should allow us to remove a lot of unnecessary scheduler overrides from IceLakeModel
Confirmed by both Agner + uops.info
Fixes a lot of throughput mismatches - the more complicated conversion instructions use SKXPort5+SKXPort01, not SKXPort5+SKXPort015 (SKXPort015 is mainly used for basic Logic + blend ops)
Fixing this should allow us to remove a lot of unnecessary scheduler overrides from SkylakeServerModel
Confirmed by both Agner + uops.info
Fixes a lot of throughput mismatches - the more complicated conversion instructions use SKLPort5+SKLPort01, not SKLPort5+SKLPort015 (SKLPort015 is mainly used for basic Logic + blend ops)
Fixing this should allow us to remove a lot of unnecessary scheduler overrides from SkylakeClientModel
Confirmed by both Agner + uops.info
Broadwell/Haswell were completely overriding the WriteCvtPD2I class defs - we can remove those overrides entirely by just choosing better class defs.
Also fixes the scheduler for a missing YMM folded case - confirmed with Agner + uops.info that the port usage is correct
None of Haswell/Broadwell/Skylake/Icelake treat CVTTSS2SI64rm differently from CVTSS2SI64rm (or the AVX variants)
Confirmed with Agner, uops.info and Intel AoM
There can be a difference for MOVDDUPrr but not the load folded broadcast that is purely on Port23
Fixes an old TODO (inherited from SkylakeServer which was fixed at c7662dc3e5)
Confirmed on Agner + uops.info
Although some very old x86 hardware would perform the extension as a later stage, every target we have a scheduler for always performs this as part of the load-op (avoid ALU pipes etc.). If anyone wants to model very old hardware they can always override this.
This patch just tags these as WriteLoad directly and removes unnecessary overrides - this cleans up some latency/throughput tests as they aren't being badly modelled as folded ALU ops
Znver1/Znver2 were using vector load latency values (which is what WriteFLoad*/WriteVecLoad* are for) instead of the scalar load latency value
TBH I'm not sure clflush/clzero/prefetch ops should be tagged as WriteLoad but at least this makes us more consistent
Atom was missing a load latency value (so was defaulting to 1cy)
Znver1/Znver2 were using vector load latency values (which is what WriteFLoad*/WriteVecLoad* are for) instead of the scalar load latency value
TBH I'm not sure clflush/clzero/prefetch ops should be tagged as WriteLoad but at least this makes us more consistent
Broadwell/Haswell were completely overriding the class defs - we can remove those overrides entirely by just choosing better class defs (plus a fix for missing mmx folded load).
The WriteCvtSD2SS/WriteCvtPD2PS* classes were mostly unused as the models were needlessly overriding all instructions - in some cases the folded pattern overrides were entirely missing (but I've confirmed they just have an additional Port23 use)
There were a couple of typos (confirmed with Agner/uops.info) - Skylake/Icelake uses Port5+Port01 for XMM/YMM, Skylake uses Port5+Port05 for ZMM but Icelake uses Port5+Port0