llvm-project

Commit Graph

Author	SHA1	Message	Date
Simon Pilgrim	faff990e9b	[X86] Fix Icelake VPMULLQ zmm pipes and adjust AVX512DQ v8i64 mul costs to match worse case Icelake PMULLQ throughput regressed cf SkylakeServer as its Pipe0 only Confirmed with Intel SOM, Agner and instlatx64	2022-09-25 14:18:08 +01:00
Simon Pilgrim	4994f87ca1	[X86] Fix bdver2 128-bit shuffles throughputs Noticed while trying to get vector ctpop/ctlz/cttz costs fixed using the script from D103695 - all of these are full-rate but the throughput costs were weirdly high for bdver2 Matches AMD 15h SoG, Agner and instlatx64	2022-09-10 17:34:40 +01:00
Simon Pilgrim	7785bd34e7	[X86] Fix bdver2 128-bit ALU/logic/shift throughputs Noticed while trying to get vector shifts costs fixed using the script from D103695 - all of these are full-rate but the throughput costs were weirdly high for bdver2 Matches AMD 15h SoG, Agner and instlatx64	2022-09-10 16:23:29 +01:00
Simon Pilgrim	05f56f10ed	[X86] Fix VPPERM load folding latency Noticed while investigating BITREVERSE cost numbers with the D103695 script - VPPERM folded loads was using the WriteVarShuffleX defaults and was missing an override like the VPPERM reg-reg variants	2022-09-09 13:57:39 +01:00
Andrea Di Biagio	3262794804	[MCA] Correctly check pipeline availability for partially overlapping resource groups. This patch mostly reverts commit `70b37f4c03` which fixed PR50725. In case of explicit consumption of multiple partially overlapping group resources, the ResourceManager was not correctly checking pipeline esources availability. The fix for PR50725 only partially addressed a few instances of that issue. This is a more general (although, technically slower) fix for that same issue. It also fixes Issue #57548 Thanks to Haohai Wen for the small reproducible.	2022-09-07 12:17:59 +01:00
Simon Pilgrim	bd0801cddf	[X86] Cleanup SLM SSE shift and CMPGTQ scheduler model numbers These were causing weird mismatches for the D103695 script report as I'm trying to enable cost kinds support for vector shift and integer comparisons. The SSE shifts by (non-constant) scalar are half-rate but still only 1uop and PCMPGT is half-rate and only on Pipe0 (although not as slow as PCMPEQQ which we already handle).	2022-09-05 13:44:05 +01:00
Simon Pilgrim	8534f51474	[CostModel][X86] Add CostKinds handling for sqrt intrinsicc This was achieved using the 'cost-tables vs llvm-mca' script from D103695 Some of the znver1/znver2 latency/throughput numbers were really weird (some copy+paste afaict) - I've used the numbers from the AMD SoG, which roughly match the 'worst case' range value from Agner	2022-09-04 18:39:21 +01:00
Simon Pilgrim	c444af1c20	[CostModel][X86] Add CostKinds handling for mul ops This was achieved using the 'cost-tables vs llvm-mca' script D103695 Also fix a missing pmullw v16i16 half-rate throughput as znver1 double-pumps - matches numbers from AMD SoG + Agner	2022-09-04 11:59:05 +01:00
Simon Pilgrim	444685de06	[CostModel][X86] Adjust mul v4i32/v8i32 throughput cost Based off the numbers from AMD SoG + Agner - vXi32 are both half-rate, and znver1 double-pumps the v8i32 op We should have caught this earlier as many Intel models have half-rate pmulld already :-(	2022-09-03 18:45:08 +01:00
Simon Pilgrim	bddbd408b7	[X86] Fix fdiv throughput/latency/uops counts Matches znver1/2 numbers from AMD SoG + Agner - no additional uops for folded instructions and znver1 double pumps 256-bit vectors Matches skylake/icelake throughput numbers from Intel AoM + Agner/instlatx64 Noticed while adding fdiv CostKinds support	2022-09-03 15:23:46 +01:00
Simon Pilgrim	bd956b7db3	[X86] Fix fmul throughput/latency/uops counts Matches numbers from AMD SoG + Agner - should always be on FPU Pipes 0+1, no additional uops for folded instructions and znver1 double pumps 256-bit vectors and is always latency = 4cy for f64 multiplies Noticed while adding fmul CostKinds support to the x86 cost models in rG0735200e3f50 and znver1 wasn't being flagged as requiring 2uop for 256-bit vectors	2022-09-03 11:10:51 +01:00
Simon Pilgrim	f8d4da7630	[X86] Fix reciprocal instruction throughput/uops counts Matches numbers from AMD SoG + Agner - should always be on FPU Pipes 0+1, no additional uops for folded instructions and znver1 double pumps 256-bit vectors Noticed while adding CostKinds support to the x86 cost models	2022-09-01 20:25:52 +01:00
Matt Devereau	30b045aba6	[AArch64][SVE] Extend LD1RQ ISel patterns to cover missing addressing modes Add some missing patterns for ld1rq's scalar + scalar addressing mode. Also, adds the scalar + imm and scalar + scalar addressing modes for the patterns added in https://reviews.llvm.org/D130010 Differential Revision: https://reviews.llvm.org/D130993	2022-08-25 13:07:37 +00:00
zhongyunde	3c8f327ce9	[AArch64] Fix sched model for tsv110 Update three changes: 1.Split the Load/Store resources into two, Ld0St and Ld1, since only one of them is capable of stores. 2.Integer ADD and SUB instructions have different latencies and processor resource usage (pipeline) when they have a shift of zero vs. non-zero, refer to D8043 3.The throughout of scalar DIV instruction. Reviewed By: dmgreen, bryanpkc Differential Revision: https://reviews.llvm.org/D132529	2022-08-25 19:20:07 +08:00
Simon Pilgrim	a7441289e2	[X86] Fix znver1 256-bit ALU/Logic/Blend uop counts ymm instructions are double pumped on znver1 - noticed while trying to review size-latency costkinds numbers for D132216 Matches AMD 17h SOG / Agner / uops.info	2022-08-19 19:09:39 +01:00
zhongyunde	319fd6a69c	[NFC][AArch64] precommit sched model for tsv110 Part of the schedule model is not accurate, so need a initial test record the changes. This assemble list is refer to the basic part of D128631 Reviewed By: dmgreen Differential Revision: https://reviews.llvm.org/D132103	2022-08-18 17:33:45 +08:00
Haohai Wen	f4410d471f	[X86] Add schedule module for Alderlake-P The X86SchedAlderlakeP.td file is automatically generated by schedtool (D130897). Most of instruction's scheduling information is based on measured ADL-P data in uops.info. Some data is from GLC tpt/lat data provided by intel doc. The rest instruction's scheduling information is from skylake client schedule model in order to get a relative complete model. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D130959	2022-08-18 16:40:14 +08:00
Yuta Mukai	3f561996bf	[AArch64] Fix and add A64FX scheduling resource/latency info 1. Missing instruction information (FTSSEL, FMSB, PFIRST and RDFFR) is added and CompleteModel is set to one. 2. Information for pseudo SVE instructions is added. Those instructions are present at the time of scheduling. 3. Resource and latency information for SVE instructions is modified to be more accurate. For example, the description for CMPEQ, which consumes one cycle each of unit FLA and PPR, is as follows. ``` Previous: def A64FXGI01 : ProcResGroup<[A64FXIPFLA, A64FXIPPR]>; def A64FXWrite_4Cyc_GI01 : SchedWriteRes<[A64FXGI01]> {... Modified: def A64FXGI0 : ProcResGroup<[A64FXIPFLA]>; def A64FXGI1 : ProcResGroup<[A64FXIPPR]>; def A64FXWrite_CMP : SchedWriteRes<[A64FXGI0, A64FXGI1]> {... ``` Reference: A64FX Microarchitecture Manual (Table 16-3) https://github.com/fujitsu/A64FX/blob/master/doc/A64FX_Microarchitecture_Manual_en_1.7.pdf Reviewed By: dmgreen, kawashima-fj Differential Revision: https://reviews.llvm.org/D131165	2022-08-09 10:53:40 +09:00
David Green	408378a0b3	[AArch64] Tone down the number of repeated fmov N2 scheduling tests. NFC	2022-08-05 08:11:57 +01:00
Cullen Rhodes	767b26a4e2	[MCA] Support multiple comma-separated -mattr features Reviewed By: myhsu Differential Revision: https://reviews.llvm.org/D129479	2022-07-12 08:20:11 +00:00
Cullen Rhodes	d1c51d45f0	[AArch64] Use Neoverse N2 sched model as default for: - Cortex-A710 - Cortex-X2 - Neoverse-V1 - Neoverse-512tvb Reviewed By: dmgreen Differential Revision: https://reviews.llvm.org/D129203	2022-07-08 13:34:13 +00:00
Cullen Rhodes	03af9ba680	[AArch64] Initial sched model for Neoverse N2 The optimization guide can be found here: https://developer.arm.com/documentation/PJDOC-466751330-18256/latest/ Reviewed By: dmgreen Differential Revision: https://reviews.llvm.org/D128631	2022-07-08 09:39:13 +00:00
Jay Foad	d393538c7f	[AMDGPU] Add a GFX11 MCA test This mostly just tests that DPFP is 1/32 rate on GFX11, instead of 1/16 rate as on GFX10.	2022-06-14 13:47:29 +01:00
Simon Pilgrim	d384a4c530	[X86] Adjust vector test costs to match SoG (Issue #54889 ) znver1/2 models were incorrectly modelling the latency/throughput/uops and znver1 ymm variants also require double pumping. Now matches what I can decipher from the AMD SoG, Agner and instlatx64 numbers vs the llvm-exegesis report provided by @fabian-r	2022-05-31 09:14:06 +01:00
Simon Pilgrim	14cc4674bf	[X86] Adjust vector fp test costs to match int test costs znver1/2 models were missing the vtestps/pd overrides to match the vptest integer equivalents. Noticed while investigating Issue #54889	2022-05-30 09:50:15 +01:00
Simon Pilgrim	1956f28037	[X86] Adjust vector extend to ymm to match SoG (Issue #54889 ) znver1 ymm variants of VPMOVSX/VPMOVZX instructions require double pumping. Now matches AMD SoG, Agner and instlatx64 numbers. Thanks to @fabian-r for the report	2022-05-30 08:58:56 +01:00
Simon Pilgrim	c99690462e	[X86] Adjust vector shift costs to match SoG (Issue #54889 ) znver1/2 models were incorrectly modelling the fpupipe (should be pipe2 for shift-by-scalar-amount and pipe1 for shift-by-element-amount) and znver1 ymm variants also require double pumping. Now matches AMD SoG, Agner and instlatx64 numbers. Thanks to @fabian-r for the report	2022-05-29 17:55:39 +01:00
Simon Pilgrim	896557e129	[X86] Adjust fadd costs to match SoG znver1/2 models were incorrectly modelling these on fpupipe 0 instead of 2/3 and znver1 ymm variants also require double pumping. Now matches AMD SoG, Agner and instlatx64 numbers. Thanks to @fabian-r for the report	2022-05-15 21:28:29 +01:00
Simon Pilgrim	6824cf1ab7	[X86] Set some more plausible latencies for horizontal add/subs on znver1 These are all microcoded/multi-pipe nightmares on Ryzen, but we shouldn't just be using the WriteMicrocoded class which is for REALLY bad microcoded nightmares - instead use the same approximate latencies as znver2 (Agner and uops.info both suggest similar values) - and make sure we use the FPU defs for both Fixes #53242	2022-05-08 15:48:42 +01:00
Simon Pilgrim	c7662dc3e5	[X86] MOVDDUP has the same sched behaviour as MOVSHDUP/MOVSLDUP on Skylake Fixes an old TODO - confirmed on Agner + uops.info	2022-05-02 12:50:37 +01:00
Simon Pilgrim	a305d8f44e	[X86] Adjust fsetcc/fmin/fmax costs to match SoG (Issue #54889 ) znver1/2 models were incorrectly modelling these as 3 cycle latency instructions on the wrong pipe and znver1 ymm variants also require double pumping. Now matches AMD SoG, Agner and instlatx64 numbers. Thanks to @fabian-r for the report	2022-04-14 13:27:33 +01:00
Simon Pilgrim	058a33d3c9	[X86] Account for high uop/resource usage in BSF/BSR instructions znver1/2 models were incorrectly modelling these as single uop instructions, instead of the microcoded nightmares they really are. Now matches AMD SoG, Agner and instlatx64 numbers. Fixes #54811	2022-04-11 11:20:09 +01:00
Simon Pilgrim	5626bd4289	[X86] Fix SLM scheduler model for PMULLD (PR37059) Adjust the PMULLD entry to match the Intel AoM numbers - PMULLD is a uop nightmare on SLM and we should model it as such. We had reports of internal regressions the last time this was attempted (rG13a0f83a05ff), but no public repros, and tests I did last year when I had access to a SLM box failed to see anything. My hunch is that the more aggressive PMULLD -> PMADDWD folds we now perform might have helped. We can revisit this again if we ever receive an actual repro. Fixes #36407	2022-04-08 10:07:06 +01:00
Simon Pilgrim	0c9c92ffc0	[X86][XOP] Tidyup VPHADD/VPHSUB unary horizontal ops default schedule class Based off Agner and AMD SoG tables, the XOP VPHADD/VPHSUB unary horizontal ops are as fast as basic arithmetic ops, not the slower SSSE3 binary horizontal add/sub ops. This also matches what the bdver2 model already lists. Noticed while investigating reduction add optimizations.	2022-03-03 12:07:48 +00:00
David Green	61b616755a	Partially revert "[SchedModels][CortexA55] Add ASIMD integer instructions" The Cortex-A55 scheduling model is used for -mcpu=generic, meaning it can have a wider effect than just the A55. The changes to the A55 scheduling model seems to have caused performance regressions on Cortex-A510 device which have latencies closer to the original and different forwarding paths. This partially reverts the changes from D117003, at least until we can do something to improve Cortex-A510. According to my results, this improves the A510 results without altering the A55 very much.	2022-02-28 10:58:52 +00:00
Pavel Kosov	37fa99eda0	[SchedModels][CortexA55] Add ASIMD integer instructions Depends on D114642 Original review https://reviews.llvm.org/D112201 OS Laboratory. Huawei Russian Research Institute. Saint-Petersburg Reviewed By: dmgreen Differential Revision: https://reviews.llvm.org/D117003	2022-02-17 13:41:57 +03:00
Craig Topper	56d6ccd4cb	[X86] Update register RCL/RCR by 1 and immediate scheduling for Intel CPUs Most Intel CPU scheduler files lumped the immediate and 1 instructions together, but uops.info shows they are quite different. For the most part the by 1 instructions were pretty accurate to the uops.info data except the latency was 3 instead of 2 as uops.info indicates. The by immediate instructions need 7 or 8 uops and have higher latency. It looks like the 8-bit by immediate instructions may need even more uops, but I just lumped them with the 16/32/64. Noticed while checking out PR53648. So mostly I cared about the by 1 instructions. Reviewed By: RKSimon, pengfei Differential Revision: https://reviews.llvm.org/D119217	2022-02-08 09:20:20 -08:00
Simon Pilgrim	6eb8fc9244	[X86] Add some missing dependency-breaking zero idiom patterns to scheduler models Many of the x86 scheduler models are not accounting for their microarch's ability to handle dependency-breaking zero idioms (pxor xmm0,xmm0 etc.), which is causing some notable differences when comparing llvm-mca reports to iaca, uops.info etc. These are based on the Intel AoMs and Agner's docs which list the instructions handled on each cpu model - there may be more, although tbh the xor/pxor/xorps/xorpd are by far the most commonly encountered. Once this is in place we also need to review missing support for 'allones' idioms and reg-reg move elimination, but this needs fixing first. @lebedev.ri The Barcelona test changes are due to the cpu still being tagged as using the SandyBridge model, if/when you get back to D63628 these will need to be addressed. Based on an original patch by @andreadb (Andrea Di Biagio) Differential Revision: https://reviews.llvm.org/D117497	2022-01-19 11:29:33 +00:00
Simon Pilgrim	8ea579203d	[MCA][X86] Add missing zero-idioms test file coverage atom/slm have no/limited zero-idioms handling but we should test all the common instructions anyhow znver1/znver2 were just missing - I've copied the Haswell tests for consistent test coverage	2022-01-17 16:04:39 +00:00
Patrick Holland	85e6e748d4	[MCA] Switching from conservatively guessing which instructions are memory-barrier instructions to providing targets and developers a convenient way to explicitly declare which instructions are memory-barriers. Differential Revision: https://reviews.llvm.org/D116779	2022-01-11 13:50:14 -08:00
Pavel Kosov	34a91d7748	[SchedModels][CortexA55] Fix scheduling of FP loads Patch fixes scheduling of FP load instructions with pre/post increment adding WriteAdr for address operand. Reviewed By: dmgreen Differential Revision: https://reviews.llvm.org/D116361 OS Laboratory. Huawei Russian Research Institute. Saint-Petersburg	2022-01-10 10:14:45 +03:00
Simon Pilgrim	a0a0eb192e	[X86] Use WriteVecMove scheduler classes for VPMOVM2* instructions These match the port behaviour of reg-reg predicated xmm/ymm/zmm moves Fixes #34958	2021-12-27 13:21:29 +00:00
Simon Pilgrim	29475e0286	[X86] Add scheduler classes for zmm vector reg-reg move instructions Basic zmm reg-reg moves (with predication) are more port limited than xmm/ymm moves, so we need to add a separate class for them. We still appear to be missing move-elimination patterns for most of the intel models, which looks to be one of the main diffs for basic codegen analysis between llvm-mca and uops.info Load/stores are a bit messier and might be better handled as overrides.	2021-12-27 12:13:42 +00:00
Simon Pilgrim	948ae472a6	[MCA][X86] Add AVX512 vector move instruction test coverage	2021-12-27 11:43:39 +00:00
Simon Pilgrim	67cce1ceee	[X86] Adjust some IceLake fp shuffle schedule classes (PR48110) The IceLake scheduler model is still mainly a copy of the SkylakeServer model. This patch adjusts the fp shuffle classes to account for most instructions now working on Port 1 as well as Port 5. This is based off Agner + uops.info as well as the PR48110 report. Differential Revision: https://reviews.llvm.org/D115752	2021-12-19 13:00:11 +00:00
Simon Pilgrim	74d1fc742a	[X86] Adjust some IceLake integer shuffle schedule classes (PR48110) The IceLake scheduler model is still mainly a copy of the SkylakeServer model. This patch adjusts the integer shuffle classes to account for most instructions now working on Port 1 as well as Port 5. This is based off Agner + uops.info as well as the PR48110 report. Differential Revision: https://reviews.llvm.org/D115547	2021-12-14 18:56:13 +00:00
Simon Pilgrim	d9655eec05	[MCA][X86] Add AVX512 subvector broadcast instruction test coverage	2021-12-13 18:48:25 +00:00
Simon Pilgrim	fe1b5b56c6	[MCA][X86] Add AVX512 movddup/movshdup/movsldup instruction test coverage As noted on D115547	2021-12-13 18:04:56 +00:00
Simon Pilgrim	b04c646711	[MCA][X86] Add AVX512 broadcast instruction test coverage As noted on D115547	2021-12-13 17:45:16 +00:00
Simon Pilgrim	9ad5969b5e	[X86][Atom] Fix CVT uops + port usage Fix overrides to use both ports. Update the uops counts + port usage based off the most recent llvm-exegesis captures (PR36895) and what Intel AoM / Agner reports as well.	2021-12-12 22:57:53 +00:00

1 2 3 4 5 ...

694 Commits