llvm-project

Commit Graph

Author	SHA1	Message	Date
Simon Pilgrim	dade83c02a	[X86][SLM] Fix ADDQ/SUBQ/CMPEQQ throughput to account for running on either port. Testing on a SLM box suggests these can run on either port, but the throughput is 4cy on either (inc MMX versions). Confirmed with Intel AoM / Agner / InstLatX64.	2021-09-24 10:06:14 +01:00
Simon Pilgrim	f855ef2601	[X86][Atom] Fix FP uops + port usage Both ports are required in most cases. Update the uops counts + port usage based off the most recent llvm-exegesis captures (PR36895) and what Intel AoM / Agner / InstLatX64 reports as well. Noticed while trying to improve fp costs for vectorization via the D103695 helper script.	2021-09-19 20:39:20 +01:00
Simon Pilgrim	cf8fac7d07	[X86][Atom] Specific uops for all IMUL/IDIV instructions Based off a mixture of llvm-exegesis captures (PR36895) and Intel AoM / Agner / InstLatX64 reports.	2021-09-19 16:58:52 +01:00
Simon Pilgrim	e381d8b243	[X86][Atom] Fix (U)COMISS/SD uops, latency and throughput Both ports are required, for reg and mem variants - we can also use the WriteFComX class directly and remove the unnecessary InstRW overrides. Matches what Intel AoM / Agner / InstLatX64 report as well.	2021-09-19 12:44:44 +01:00
Simon Pilgrim	5ebe95e256	[X86][Atom] Fix integer shuffles uops, latency and throughput The MMX pack/unpck shuffles don't need an override - they have the same behaviour as other shuffles (Port0 only). The SSE pslldq/psrldq shuffles don't need an override - they have the same behaviour as other shuffles (Port0 only). The SSE pshufb shuffles use 4uops (+1 load). Noticed the pslldq/psrldq issue while trying to improve reduction costs via the D103695 helper script, and fixed the others while reviewing. Confirmed with Intel AoM / Agner / InstLatX64.	2021-09-17 12:11:54 +01:00
Simon Pilgrim	65ad09da0e	[X86][SLM] Fix DIVPD/DIVPS/RCPPS/RSQRTPS/SQRTPD/SQRTPS/DPPD/DPPS uops, latency and throughput The packed variants of the instructions had been modelled as the same as the scalar variants. Reported during a run of llvm-exegesis on a cheap SLM box and matches what Agner / InstLatX64 report as well.	2021-09-13 08:36:43 +01:00
Simon Pilgrim	df975e4590	[X86][SLM] Fix PSAD/MPSAD uops, latency and throughput Noticed while trying to improve generic reduction costs via the D103695 helper script. Confirmed with Intel AoM / Agner / InstLatX64.	2021-09-11 11:44:09 +01:00
Simon Pilgrim	484944ac3b	[X86][SLM] Fix HADD/HSUB uops, latency and throughput Noticed while trying to improve generic reduction costs via the D103695 helper script. Confirmed with Intel AoM / Agner / InstLatX64.	2021-09-11 11:44:09 +01:00
Simon Pilgrim	2005ae15a6	[X86][SLM] WriteVecIMul instructions only take 1uop (REAPPLIED) The xmm variant have half the throughput (and +1cy latency) of the mmx variants, but are still 1uop. I still need to do more thorough testing of SLM on test-suite before fixing the obvious bad numbers for WritePMULLD. But this helps the D103695 helper script get to more accurate numbers for vXi32 multiplies of extended operands (i.e. we can use PMADDWD, PMULLW/PMULHW etc). Matches what Intel AoM / Agner / llvm-exegesis reports.	2021-09-04 15:03:56 +01:00
Simon Pilgrim	ac51d69208	Revert rG994da657076900f5ad7fe593c3b5e5f89ab3d53d "[X86][SLM] WriteVecIMul instructions only take 1uop" This changed some codegen tests that I forgot about in my rebase, I'll recommit shortly with a fix.	2021-09-04 13:39:10 +01:00
Simon Pilgrim	994da65707	[X86][SLM] WriteVecIMul instructions only take 1uop The xmm variant have half the throughput (and +1cy latency) of the mmx variants, but are still 1uop. I still need to do more thorough testing of SLM on test-suite before fixing the obvious bad numbers for WritePMULLD. But this helps the D103695 helper script get to more accurate numbers for vXi32 multiplies of extended operands (i.e. we can use PMADDWD, PMULLW/PMULHW etc). Matches what Intel AoM / Agner / llvm-exegesis reports.	2021-09-04 13:21:34 +01:00
Simon Pilgrim	c6371020a8	[X86][SLM] RMW instructions don't require an extra uop For RMW instructions, the load and store hold the MEC for an extra cycle, but within the same single uop. This is alluded to in the Intel AOM: "The MEC also owns the MEC RSV, which is responsible for scheduling of all loads and stores. Load and store instructions go through addresses generation phase in program order to avoid on-the-fly memory ordering later in the pipeline. Therefore, an unknown address will stall younger memory instructions." Noticed while trying to get a cheap SLM test box up and running with llvm-exegesis - RMW arithmetic is always 1uop - and matches what Agner / InstLatX64 report as well.	2021-09-04 13:21:34 +01:00
Simon Pilgrim	da965a77d5	[X86][SLM] Fix MUL uops, latency and throughput These were all set to the same best case mul i32 values (which seems to be the only version of MUL that SLM actually performs well with). Noticed while trying to improve multiplication costs for vectorization via the D103695 helper script. Confirmed with Intel AoM / Agner / InstLatX64.	2021-09-04 13:21:34 +01:00
Simon Pilgrim	7d062d2c47	[X86][Atom] MUL/DIV instructions require both ports, not either. Noticed while trying to improve multiplication costs for vectorization via the D103695 helper script. Confirmed with Intel AoM.	2021-09-04 11:58:09 +01:00
Simon Pilgrim	6ba0b9f68a	[X86][SLM] Fix PBLENDVB uops and throughput SLM PBLENDVB is just as bad as BLENDVPD/PS - so model it as such, fixing the rr vs rm uops diff as well. The Intel AoM appears to have a copy+paste typo with PBLENDW, it doesn't match Agner or InstLatX64. Noticed while investigating some of the weird discrepancies reported by the D103695 helper script (SLM had much better vector shift throughputs than it should).	2021-09-03 11:31:29 +01:00
Simon Pilgrim	7ec7272b80	[MCA][X86] Add basic coverage for icelake arch Copy the skylake-avx512 tests for icelake-server coverage. Add icelake/rocketlake/tigerlake test coverage to the relevent generic tests as well.	2021-08-31 12:20:09 +01:00
Roman Lebedev	d4d459e747	[X86] AMD Zen 3: MULX w/ mem operand has the same throughput as with reg op Exegesis is faulty and sometimes when measuring throughput^-1 produces snippets that have loop-carried dependencies, which must be what caused me to incorrectly measure it originally. After looking much more carefully, the inverse throughput should match that of the MULX w/ reg op. As per llvm-exegesis measurements.	2021-08-27 13:27:05 +03:00
Roman Lebedev	0f04936a2d	[X86] AMD Zen 3: MULX produces low part of the result in 3cy, +1cy for high part As per llvm-exegesis measurements.	2021-08-27 13:27:05 +03:00
Roman Lebedev	db2c6cd99c	[NFC][X86][MCA] AMD Zen 3: improve MULX test coverage Latency for MULX isn't right	2021-08-27 13:27:05 +03:00
Andrea Di Biagio	4a5b191703	[X86][MCA] Address the latest issues with MULX reported in PR51495. It turns out that SchedWrite WriteIMulH was always assigned to the low half of the result of a MULX (rather than to the high half). To avoid confusion, this patch swaps the two MULX writes in the tablegen definition of MULX32/64. That way, write names better describe what they actually refer to; this also avoids further complications if in future we decide to reuse the same MulH writes to also model other scalar integer multiply instructions. I also had to swap the latency values for the two MULX writes to make sure that the change is effectively an NFC. In fact, none of the existing x86 tests were affected by this small refactoring. This patch also fixes a bug in MCA: a wrong latency value was propagated for instructions that perform multiple writes to a same register. This last issue was found by Roman while testing MULX on targets that define a different latency for the Low/High part of the result. Differential Revision: https://reviews.llvm.org/D108727	2021-08-26 12:08:20 +01:00
Andrea Di Biagio	6181427bb9	[X86][MCA] Add more tests for MULX (PR51495). llvm-mca still reports a wrong latency for the case where the two destination registers of MULX are the same.	2021-08-25 21:28:21 +01:00
Andrea Di Biagio	5f848b311f	[X86][SchedModel] Fix latency the Hi register write of MULX (PR51495). Before this patch, WriteIMulH reported a latency value which is correct for the RR variant of MULX, but not for the RM variant. This patch fixes the issue by introducing a new WriteIMulHLd, which is meant to be used only by the RM variant of MULX. Differential Revision: https://reviews.llvm.org/D108701	2021-08-25 16:12:09 +01:00
Andrea Di Biagio	fe13b81ed9	[X86][NFC] Pre-commit llvm-mca tests for PR51495. WriteIMulH reports an incorrect latency for RM variants of MULX.	2021-08-25 14:17:17 +01:00
Andrea Di Biagio	35d4292a73	[X86][SchedModels] Fix missing ReadAdvance for MULX and ADCX/ADOX (PR51494) Before this patch, instructions MULX32rm and MULX64rm were missing a ReadAdvance for the implicit read of register EDX/RDX. This patch fixes the issue, and it also introduces a new SchedWrite for the two variants of MULX. The general idea behind this last change is to eventually decrease the number of InstRW in the scheduling models. This patch also adds a ReadAdvance for the implicit read of EFLAGS in ADCX/ADOX. Differential Revision: https://reviews.llvm.org/D108372	2021-08-20 17:39:51 +01:00
Andrea Di Biagio	2d53e54f0e	[X86][NFC] Pre-commit tests for PR51494	2021-08-18 19:55:21 +01:00
Tozer	6d5e31baaa	Fix 2: [MCParser] Correctly handle CRLF line ends when consuming line comments Fixes an issue with revision `5c6f748c` and `ad40cb88`. Adds an mcpu argument to the test command, preventing an invalid default CPU from being used on some platforms.	2021-08-17 17:13:21 +01:00
Tozer	ad40cb8821	Fix: [MCParser] Correctly handle CRLF line ends when consuming line comments Fixes an issue with revision `5c6f748c`. Move the test added in the above commit into the X86 folder, ensuring that it is only run on targets where its triple is valid.	2021-08-17 16:16:19 +01:00
Andrea Di Biagio	7a1a35a1d1	[X86][SchedModel] Add missing ReadAdvance for some arithmetic ops (PR51318 and PR51322). This fixes a bug where implicit uses of EFLAGS were not marked as ReadAdvance in the RM/MR variants of ADC/SBB (PR51318) This also fixes the absence of ReadAdvance for the register operand of RMW arithmetic instructions (PR51322). Differential Revision: https://reviews.llvm.org/D107367	2021-08-04 17:50:22 +01:00
Andrea Di Biagio	f0658c7a42	[MCA][NFC] Add tests for PR51318 and PR51322. Also, regenerate existing X86 tests using update_mca_test.py.	2021-08-03 17:06:34 +01:00
Simon Pilgrim	d073b19dbf	[X86] Fix SLM FP<->INT throughputs. Noticed while trying to clean up the shift costs model for SSE4 targets using the script in D10369 - SLM double-pumps all the 128-bit vector conversion ops and only use FP0 pipe - numbers taken from Intel AOM + Agner.	2021-07-22 19:39:04 +01:00
Simon Pilgrim	ded8866f4a	[X86][Atom] Fix vector fp<->int resource/throughputs Match whats documented in the Intel AOM - almost all the conversion instructions requires BOTH ports (apart from the MMX cvtpi2ps/cvtpi2ps instructions which we already override) - this was being incorrectly modelled as EITHER port. Now that we can use in-order models in llvm-mca, the atom model is a good "worst case scenario" analysis for x86.	2021-07-07 16:52:34 +01:00
Serge Pavlov	b36d214bed	[X86] Add description of FXAM instruction Previously this instruction could be used only in assembler. This change makes it available for compiler also. Scheduling information was copied from FTST instruction, hopefully this can be a satisfactory approximation. Differential Revision: https://reviews.llvm.org/D104853	2021-06-25 12:26:51 +07:00
Andrea Di Biagio	70b37f4c03	[MCA][InstrBuilder] Always check for implicit uses of resource units (PR50725). When instructions are issued to the underlying pipeline resources, the mca::ResourceManager should also check for the presence of extra uses induced by the explicit consumption of multiple partially overlapping group resources. Fixes PR50725	2021-06-16 14:51:12 +01:00
Simon Pilgrim	630820bafc	[X86][SLM] Adjust XMM non-PMULLD throughput costs to half rate. Match what's reported in the costs table, Agner's tables and the Intel AOM	2021-06-09 13:51:40 +01:00
Simon Pilgrim	21aec4fdc5	[X86][SLM] Fix vector PSHUFB + variable shift resource/throughputs Match whats documented in the Intel AOM (+Agner) - PSHUFB xmm is really slow, and mmx/xmm vector shifts are half rate. Noticed while working to get the cost tables to more closely match llvm-mca analysis, in this case for shifts and truncations.	2021-05-26 11:14:21 +01:00
Simon Pilgrim	66978466ba	[X86][Atom] Fix vector variable shift resource/throughputs Match whats documented in the Intel AOM - the non-immediate variants of the PSLL/PSRA/PSRL* shift instructions requires BOTH ports - this was being incorrectly modelled as EITHER port. Now that we can use in-order models in llvm-mca, the atom model is a good "worst case scenario" analysis for x86.	2021-05-26 10:30:59 +01:00
Simon Pilgrim	57250f2f3c	[X86][Atom] Fix vector PSHUFB resource/throughputs Match whats documented in the Intel AOM - the XMM variant of PSHUFB requires BOTH ports - this was being incorrectly modelled as EITHER port. Now that we can use in-order models in llvm-mca, the atom model is a good "worst case scenario" analysis for x86.	2021-05-25 17:31:45 +01:00
Simon Pilgrim	a26288e803	[X86][Atom] Fix vector fadd/fcmp/fmul resource/throughputs Match whats documented in the Intel AOM - these are all fadd/fcmp use Port1 and fmul uses Port1, but in many cases BOTH ports are required - this was being incorrectly modelled as EITHER port. Discovered while investigating the correct fptoui costs to fix the regressions in D101555. Now that we can use in-order models in llvm-mca, the atom model is a good "worst case scenario" analysis for x86.	2021-05-20 18:56:58 +01:00
Andrea Di Biagio	9acabe8b6f	[MCA] Unbreak the buildbots by passing flag -mcpu=generic to the new test added by commit `e5d59db469`. This should unbreak buildbot clang-ppc64le-linux-lnt.	2021-05-19 19:12:33 +01:00
Patrick Holland	e5d59db469	[MCA] llvm-mca MCTargetStreamer segfault fix In order to create the code regions for llvm-mca to analyze, llvm-mca creates an AsmCodeRegionGenerator and calls AsmCodeRegionGenerator::parseCodeRegions(). Within this function, both an MCAsmParser and MCTargetAsmParser are created so that MCAsmParser::Run() can be used to create the code regions for us. These parser classes were created for llvm-mc so they are designed to emit code with an MCStreamer and MCTargetStreamer that are expected to be setup and passed into the MCAsmParser constructor. Because llvm-mca doesn’t want to emit any code, an MCStreamerWrapper class gets created instead and passed into the MCAsmParser constructor. This wrapper inherits from MCStreamer and overrides many of the emit methods to just do nothing. The exception is the emitInstruction() method which calls Regions.addInstruction(Inst). This works well and allows llvm-mca to utilize llvm-mc’s MCAsmParser to build our code regions, however there are a few directives which rely on the MCTargetStreamer. llvm-mc assumes that the MCStreamer that gets passed into the MCAsmParser’s constructor has a valid pointer to an MCTargetStreamer. Because llvm-mca doesn’t setup an MCTargetStreamer, when the parser encounters one of those directives, a segfault will occur. In x86, each one of these 7 directives will cause this segfault if they exist in the input assembly to llvm-mca: .cv_fpo_proc .cv_fpo_setframe .cv_fpo_pushreg .cv_fpo_stackalloc .cv_fpo_stackalign .cv_fpo_endprologue .cv_fpo_endproc I haven’t looked at other targets, but I wouldn’t be surprised if some of the other ones also have certain directives which could result in this same segfault. My proposed solution is to simply initialize an MCTargetStreamer after we initialize the MCStreamerWrapper. The MCTargetStreamer requires an ostream object, but we don’t actually want any of these directives to be emitted anywhere, so I use an ostream created with the nulls() function. Since this needs to happen after the MCStreamerWrapper has been initialized, it needs to happen within the AsmCodeRegionGenerator::parseCodeRegions() function. The MCTargetStreamer also needs an MCInstPrinter which is easiest to initialize within the main() function of llvm-mca. So this MCInstPrinter gets constructed within main() then passed into the parseCodeRegions() function as a parameter. (If you feel like it would be appropriate and possible to create the MCInstPrinter within the parseCodeRegions() function, then feel free to modify my solution. That would stop us from having to pass it into the function and would limit its scope / lifetime.) My solution stops the segfault from happening and still passes all of the current (expected) llvm-mca tests. I also added a new test for x86 that checks for this segfault on an input that includes one of the .cv_fpo directives (this test fails without my solution, but passes with it). As far as I can tell, all of the functions that I modified are only called from within llvm-mca so there shouldn’t be any worries about breaking other tools. Differential Revision: https://reviews.llvm.org/D102709	2021-05-19 18:36:10 +01:00
Simon Pilgrim	b14f9a1ebd	[X86][Atom] Fix vector integer shift by immediate resource/throughputs Match whats documented in the Intel AOM (and Agner/instlatx64 agree) - these are all Port0 only. Now that we can use in-order models in llvm-mca, the atom model is a good "worst case scenario" analysis for x86.	2021-05-19 14:39:40 +01:00
Simon Pilgrim	f9b1208681	[X86][Atom] Fix vector integer multiplication resource/throughputs Match whats documented in the Intel AOM (and Agner/instlatx64 agree) - vector integer multiplies are pipelined - all Port0, throughput = 2 @ 128bits, 1 @ 64bits. Noticed while checking reduction costs - now that we can use in-order models in llvm-mca, the atom model is the "worst case scenario" we have in x86.	2021-05-15 14:25:48 +01:00
Roman Lebedev	990e806b36	[NFC][X86][MCA] Add sudo-zero-idiom vperm2f128/vperm2i128 tests - don't break deps While btver2 model states that this pattern is a zero-cycle zero-idiom on Jaguar, it does not appear to be the case on Znver3, here it measures as not being recognized as dep-breaking zero-idiom, let alone a zero-cycle one.	2021-05-14 20:23:05 +03:00
Roman Lebedev	1fc1c88704	[X86] AMD Zen 3: same-reg AVX YMM VPCMPGT{B,W,D,Q} is a zero-cycle(!) dep-breaking zero-idiom As measured by exegesis, and confirmed by ref docs.	2021-05-14 20:23:05 +03:00
Roman Lebedev	2f8572d8e2	[X86] AMD Zen 3: same-reg AVX XMM VPCMPGT{B,W,D,Q} is a zero-cycle(!) dep-breaking zero-idiom As measured by exegesis, and confirmed by ref docs.	2021-05-14 20:23:04 +03:00
Roman Lebedev	f8f7c765a0	[X86] AMD Zen 3: same-reg SSE XMM PCMPGT{B,W,D,Q} is a 1-cycle(!) dep-breaking zero-idiom As measured by exegesis, and confirmed by ref docs.	2021-05-14 20:23:04 +03:00
Roman Lebedev	d2fb4bfba8	[NFC][X86][MCA] AMD Zen 3: add same-reg AVX YMM VPCMPGT{B,W,D,Q} tests	2021-05-14 20:23:04 +03:00
Roman Lebedev	094b493a3a	[NFC][X86][MCA] AMD Zen 3: add same-reg AVX XMM VPCMPGT{B,W,D,Q} tests	2021-05-14 20:23:04 +03:00
Roman Lebedev	1c0ac0b0f2	[NFC][X86][MCA] AMD Zen 3: add same-reg SSE XMM PCMPGT{B,W,D,Q} tests	2021-05-14 20:23:03 +03:00
Roman Lebedev	26eeb6e650	[X86] AMD Zen 3: same-reg AVX YMM VPSUBUS{B,W} is a 1-cycle(!) dep-breaking zero-idiom Not really mentioned in ref docs, but measures as such. Yes, this one is also not zero-cycle.	2021-05-14 20:23:03 +03:00

1 2 3 4 5 ...

519 Commits