llvm-project

Commit Graph

Author	SHA1	Message	Date
Simon Pilgrim	0c9c92ffc0	[X86][XOP] Tidyup VPHADD/VPHSUB unary horizontal ops default schedule class Based off Agner and AMD SoG tables, the XOP VPHADD/VPHSUB unary horizontal ops are as fast as basic arithmetic ops, not the slower SSSE3 binary horizontal add/sub ops. This also matches what the bdver2 model already lists. Noticed while investigating reduction add optimizations.	2022-03-03 12:07:48 +00:00
David Green	61b616755a	Partially revert "[SchedModels][CortexA55] Add ASIMD integer instructions" The Cortex-A55 scheduling model is used for -mcpu=generic, meaning it can have a wider effect than just the A55. The changes to the A55 scheduling model seems to have caused performance regressions on Cortex-A510 device which have latencies closer to the original and different forwarding paths. This partially reverts the changes from D117003, at least until we can do something to improve Cortex-A510. According to my results, this improves the A510 results without altering the A55 very much.	2022-02-28 10:58:52 +00:00
Pavel Kosov	37fa99eda0	[SchedModels][CortexA55] Add ASIMD integer instructions Depends on D114642 Original review https://reviews.llvm.org/D112201 OS Laboratory. Huawei Russian Research Institute. Saint-Petersburg Reviewed By: dmgreen Differential Revision: https://reviews.llvm.org/D117003	2022-02-17 13:41:57 +03:00
Craig Topper	56d6ccd4cb	[X86] Update register RCL/RCR by 1 and immediate scheduling for Intel CPUs Most Intel CPU scheduler files lumped the immediate and 1 instructions together, but uops.info shows they are quite different. For the most part the by 1 instructions were pretty accurate to the uops.info data except the latency was 3 instead of 2 as uops.info indicates. The by immediate instructions need 7 or 8 uops and have higher latency. It looks like the 8-bit by immediate instructions may need even more uops, but I just lumped them with the 16/32/64. Noticed while checking out PR53648. So mostly I cared about the by 1 instructions. Reviewed By: RKSimon, pengfei Differential Revision: https://reviews.llvm.org/D119217	2022-02-08 09:20:20 -08:00
Simon Pilgrim	6eb8fc9244	[X86] Add some missing dependency-breaking zero idiom patterns to scheduler models Many of the x86 scheduler models are not accounting for their microarch's ability to handle dependency-breaking zero idioms (pxor xmm0,xmm0 etc.), which is causing some notable differences when comparing llvm-mca reports to iaca, uops.info etc. These are based on the Intel AoMs and Agner's docs which list the instructions handled on each cpu model - there may be more, although tbh the xor/pxor/xorps/xorpd are by far the most commonly encountered. Once this is in place we also need to review missing support for 'allones' idioms and reg-reg move elimination, but this needs fixing first. @lebedev.ri The Barcelona test changes are due to the cpu still being tagged as using the SandyBridge model, if/when you get back to D63628 these will need to be addressed. Based on an original patch by @andreadb (Andrea Di Biagio) Differential Revision: https://reviews.llvm.org/D117497	2022-01-19 11:29:33 +00:00
Simon Pilgrim	8ea579203d	[MCA][X86] Add missing zero-idioms test file coverage atom/slm have no/limited zero-idioms handling but we should test all the common instructions anyhow znver1/znver2 were just missing - I've copied the Haswell tests for consistent test coverage	2022-01-17 16:04:39 +00:00
Patrick Holland	85e6e748d4	[MCA] Switching from conservatively guessing which instructions are memory-barrier instructions to providing targets and developers a convenient way to explicitly declare which instructions are memory-barriers. Differential Revision: https://reviews.llvm.org/D116779	2022-01-11 13:50:14 -08:00
Pavel Kosov	34a91d7748	[SchedModels][CortexA55] Fix scheduling of FP loads Patch fixes scheduling of FP load instructions with pre/post increment adding WriteAdr for address operand. Reviewed By: dmgreen Differential Revision: https://reviews.llvm.org/D116361 OS Laboratory. Huawei Russian Research Institute. Saint-Petersburg	2022-01-10 10:14:45 +03:00
Simon Pilgrim	a0a0eb192e	[X86] Use WriteVecMove scheduler classes for VPMOVM2* instructions These match the port behaviour of reg-reg predicated xmm/ymm/zmm moves Fixes #34958	2021-12-27 13:21:29 +00:00
Simon Pilgrim	29475e0286	[X86] Add scheduler classes for zmm vector reg-reg move instructions Basic zmm reg-reg moves (with predication) are more port limited than xmm/ymm moves, so we need to add a separate class for them. We still appear to be missing move-elimination patterns for most of the intel models, which looks to be one of the main diffs for basic codegen analysis between llvm-mca and uops.info Load/stores are a bit messier and might be better handled as overrides.	2021-12-27 12:13:42 +00:00
Simon Pilgrim	948ae472a6	[MCA][X86] Add AVX512 vector move instruction test coverage	2021-12-27 11:43:39 +00:00
Simon Pilgrim	67cce1ceee	[X86] Adjust some IceLake fp shuffle schedule classes (PR48110) The IceLake scheduler model is still mainly a copy of the SkylakeServer model. This patch adjusts the fp shuffle classes to account for most instructions now working on Port 1 as well as Port 5. This is based off Agner + uops.info as well as the PR48110 report. Differential Revision: https://reviews.llvm.org/D115752	2021-12-19 13:00:11 +00:00
Simon Pilgrim	74d1fc742a	[X86] Adjust some IceLake integer shuffle schedule classes (PR48110) The IceLake scheduler model is still mainly a copy of the SkylakeServer model. This patch adjusts the integer shuffle classes to account for most instructions now working on Port 1 as well as Port 5. This is based off Agner + uops.info as well as the PR48110 report. Differential Revision: https://reviews.llvm.org/D115547	2021-12-14 18:56:13 +00:00
Simon Pilgrim	d9655eec05	[MCA][X86] Add AVX512 subvector broadcast instruction test coverage	2021-12-13 18:48:25 +00:00
Simon Pilgrim	fe1b5b56c6	[MCA][X86] Add AVX512 movddup/movshdup/movsldup instruction test coverage As noted on D115547	2021-12-13 18:04:56 +00:00
Simon Pilgrim	b04c646711	[MCA][X86] Add AVX512 broadcast instruction test coverage As noted on D115547	2021-12-13 17:45:16 +00:00
Simon Pilgrim	9ad5969b5e	[X86][Atom] Fix CVT uops + port usage Fix overrides to use both ports. Update the uops counts + port usage based off the most recent llvm-exegesis captures (PR36895) and what Intel AoM / Agner reports as well.	2021-12-12 22:57:53 +00:00
Simon Pilgrim	4c1d248397	[MCA][X86] Fix duplicated cvtsi2ss/cvtsi2sd i32 + i64 folded tests Specify the integer width to ensure we're testing the correct instruction	2021-12-12 22:48:45 +00:00
Simon Pilgrim	c02f9791c6	[X86][AVX512] Remove xmm->xmm vpmovsx/vpmovzx rm overrides The XMM evex cases have the same behaviour as the SSE41 versions, which already uses WriteShuffleX.Folded	2021-12-12 16:08:10 +00:00
Simon Pilgrim	fc02ceb12a	[X86][AVX512] Use WriteShuffleX for xmm->xmm extensions The XMM evex cases have the same behaviour as the SSE41 versions, which already uses WriteShuffleX	2021-12-12 15:22:32 +00:00
Simon Pilgrim	7f09aee0f6	[MCA][X86] Add missing VPMOVSX/VPMOVZX from AVX512 tests	2021-12-10 18:12:57 +00:00
Simon Pilgrim	6fae235885	[MCA][X86] Add missing ALIGND/ALIGNQ from AVX512F/AVX512VL tests	2021-12-10 15:59:52 +00:00
Simon Pilgrim	b025b062d6	[MCA][X86] Add missing PALIGNR from AVX512BW/AVX512BWVL tests	2021-12-10 15:59:52 +00:00
Simon Pilgrim	ebcc92ccda	[MCA][X86] Add missing PSLLDQ/PSRLDQ from AVX512BW/AVX512BWVL tests	2021-12-10 15:59:51 +00:00
Simon Pilgrim	550bf36732	[MCA][X86] Add missing PACKSS/PACKUS from AVX512BW/AVX512BWVL tests	2021-12-10 15:59:51 +00:00
Simon Pilgrim	80ce01c6fd	[MCA][X86] Add missing PSHUFLW from AVX512BWVL tests	2021-12-10 14:02:37 +00:00
Andrew Savonichev	420300c0d8	[MCA] Remove the warning about experimental support for in-order CPU There are not a lot of bug reports for this feature, so let's mark it stable. Differential Revision: https://reviews.llvm.org/D114701	2021-12-07 15:27:51 +03:00
Roman Lebedev	2f364f6f0d	[NFC][X86][MCA] Add forgotten test coverage for AVX512's VPMOVM2[BWDQ] / VPMOV[BWDQ]2M	2021-11-20 13:09:18 +03:00
Simon Pilgrim	0bb32b1b21	[X86][SLM] Fix BitTest+Set uops + port usage Both ports are required for BitTest ops. Update the uops counts + port usage based off the most recent llvm-exegesis captures and what Intel AoM / Agner reports as well.	2021-10-17 18:13:15 +01:00
Simon Pilgrim	5ed5df4802	[X86][SLM] Fix uops for PCMPISTR/PCMPISTR instructions Based off a recent llvm-exegesis capture and what Intel AoM / Agner reports as well.	2021-10-17 18:13:14 +01:00
Simon Pilgrim	680afaaa5d	[X86][SLM] Fix uops for PCLMULQDQ Based off a recent llvm-exegesis capture and what Intel AoM / Agner reports as well.	2021-10-17 18:13:14 +01:00
Simon Pilgrim	498c7236bc	[X86][SLM] +1uop for PSHUFBrm xmm Extra 1uop for folded pshufb ops, based off a recent llvm-exegesis capture and what Intel AoM / Agner reports as well.	2021-10-17 18:13:14 +01:00
Simon Pilgrim	7cae0daee6	[X86][Atom] Fix BSR/BSF uops + port usage Both ports are required for BitScan ops. Update the uops counts + port usage based off the most recent llvm-exegesis captures (PR36895) and what Intel AoM / Agner reports as well.	2021-10-02 19:09:44 +01:00
Simon Pilgrim	8e7f6039fa	[X86] Atom SSE shift-by-variable take 2uops/3uops not 1uop Based off the most recent llvm-exegesis captures (PR36895) and what Intel AoM / Agner / InstLatX64 reports as well.	2021-10-02 12:28:41 +01:00
David Green	e9adcbde31	[AArch64] Model Cortex-A55 Q register NEON instructions Cortex-A55 has 2 64bit NEON vector units, meaning a 128bit instruction requires taking both units (and can only be issued as the first instruction in a dual issue pair). This patch models that by splitting the WriteV SchedWrite into two - the WriteVd that reads/writes only 64bit operands, and the WriteVq that read/writes 128bit registers. The A55 schedule then uses this distinction to model the WriteVq as taking both resource units, and starting a Schedule Group and WriteVd as taking one as before. I believe this is more correct, even if it does not lead to much better performance. Differential Revision: https://reviews.llvm.org/D108766	2021-09-29 16:55:31 +01:00
Simon Pilgrim	dade83c02a	[X86][SLM] Fix ADDQ/SUBQ/CMPEQQ throughput to account for running on either port. Testing on a SLM box suggests these can run on either port, but the throughput is 4cy on either (inc MMX versions). Confirmed with Intel AoM / Agner / InstLatX64.	2021-09-24 10:06:14 +01:00
Simon Pilgrim	f855ef2601	[X86][Atom] Fix FP uops + port usage Both ports are required in most cases. Update the uops counts + port usage based off the most recent llvm-exegesis captures (PR36895) and what Intel AoM / Agner / InstLatX64 reports as well. Noticed while trying to improve fp costs for vectorization via the D103695 helper script.	2021-09-19 20:39:20 +01:00
Simon Pilgrim	cf8fac7d07	[X86][Atom] Specific uops for all IMUL/IDIV instructions Based off a mixture of llvm-exegesis captures (PR36895) and Intel AoM / Agner / InstLatX64 reports.	2021-09-19 16:58:52 +01:00
Simon Pilgrim	e381d8b243	[X86][Atom] Fix (U)COMISS/SD uops, latency and throughput Both ports are required, for reg and mem variants - we can also use the WriteFComX class directly and remove the unnecessary InstRW overrides. Matches what Intel AoM / Agner / InstLatX64 report as well.	2021-09-19 12:44:44 +01:00
Simon Pilgrim	5ebe95e256	[X86][Atom] Fix integer shuffles uops, latency and throughput The MMX pack/unpck shuffles don't need an override - they have the same behaviour as other shuffles (Port0 only). The SSE pslldq/psrldq shuffles don't need an override - they have the same behaviour as other shuffles (Port0 only). The SSE pshufb shuffles use 4uops (+1 load). Noticed the pslldq/psrldq issue while trying to improve reduction costs via the D103695 helper script, and fixed the others while reviewing. Confirmed with Intel AoM / Agner / InstLatX64.	2021-09-17 12:11:54 +01:00
Simon Pilgrim	65ad09da0e	[X86][SLM] Fix DIVPD/DIVPS/RCPPS/RSQRTPS/SQRTPD/SQRTPS/DPPD/DPPS uops, latency and throughput The packed variants of the instructions had been modelled as the same as the scalar variants. Reported during a run of llvm-exegesis on a cheap SLM box and matches what Agner / InstLatX64 report as well.	2021-09-13 08:36:43 +01:00
Simon Pilgrim	df975e4590	[X86][SLM] Fix PSAD/MPSAD uops, latency and throughput Noticed while trying to improve generic reduction costs via the D103695 helper script. Confirmed with Intel AoM / Agner / InstLatX64.	2021-09-11 11:44:09 +01:00
Simon Pilgrim	484944ac3b	[X86][SLM] Fix HADD/HSUB uops, latency and throughput Noticed while trying to improve generic reduction costs via the D103695 helper script. Confirmed with Intel AoM / Agner / InstLatX64.	2021-09-11 11:44:09 +01:00
Simon Pilgrim	2005ae15a6	[X86][SLM] WriteVecIMul instructions only take 1uop (REAPPLIED) The xmm variant have half the throughput (and +1cy latency) of the mmx variants, but are still 1uop. I still need to do more thorough testing of SLM on test-suite before fixing the obvious bad numbers for WritePMULLD. But this helps the D103695 helper script get to more accurate numbers for vXi32 multiplies of extended operands (i.e. we can use PMADDWD, PMULLW/PMULHW etc). Matches what Intel AoM / Agner / llvm-exegesis reports.	2021-09-04 15:03:56 +01:00
Simon Pilgrim	ac51d69208	Revert rG994da657076900f5ad7fe593c3b5e5f89ab3d53d "[X86][SLM] WriteVecIMul instructions only take 1uop" This changed some codegen tests that I forgot about in my rebase, I'll recommit shortly with a fix.	2021-09-04 13:39:10 +01:00
Simon Pilgrim	994da65707	[X86][SLM] WriteVecIMul instructions only take 1uop The xmm variant have half the throughput (and +1cy latency) of the mmx variants, but are still 1uop. I still need to do more thorough testing of SLM on test-suite before fixing the obvious bad numbers for WritePMULLD. But this helps the D103695 helper script get to more accurate numbers for vXi32 multiplies of extended operands (i.e. we can use PMADDWD, PMULLW/PMULHW etc). Matches what Intel AoM / Agner / llvm-exegesis reports.	2021-09-04 13:21:34 +01:00
Simon Pilgrim	c6371020a8	[X86][SLM] RMW instructions don't require an extra uop For RMW instructions, the load and store hold the MEC for an extra cycle, but within the same single uop. This is alluded to in the Intel AOM: "The MEC also owns the MEC RSV, which is responsible for scheduling of all loads and stores. Load and store instructions go through addresses generation phase in program order to avoid on-the-fly memory ordering later in the pipeline. Therefore, an unknown address will stall younger memory instructions." Noticed while trying to get a cheap SLM test box up and running with llvm-exegesis - RMW arithmetic is always 1uop - and matches what Agner / InstLatX64 report as well.	2021-09-04 13:21:34 +01:00
Simon Pilgrim	da965a77d5	[X86][SLM] Fix MUL uops, latency and throughput These were all set to the same best case mul i32 values (which seems to be the only version of MUL that SLM actually performs well with). Noticed while trying to improve multiplication costs for vectorization via the D103695 helper script. Confirmed with Intel AoM / Agner / InstLatX64.	2021-09-04 13:21:34 +01:00
Simon Pilgrim	7d062d2c47	[X86][Atom] MUL/DIV instructions require both ports, not either. Noticed while trying to improve multiplication costs for vectorization via the D103695 helper script. Confirmed with Intel AoM.	2021-09-04 11:58:09 +01:00
Simon Pilgrim	6ba0b9f68a	[X86][SLM] Fix PBLENDVB uops and throughput SLM PBLENDVB is just as bad as BLENDVPD/PS - so model it as such, fixing the rr vs rm uops diff as well. The Intel AoM appears to have a copy+paste typo with PBLENDW, it doesn't match Agner or InstLatX64. Noticed while investigating some of the weird discrepancies reported by the D103695 helper script (SLM had much better vector shift throughputs than it should).	2021-09-03 11:31:29 +01:00

1 2 3 4 5 ...

661 Commits