Commit Graph

627 Commits

Author SHA1 Message Date
David Green e9adcbde31 [AArch64] Model Cortex-A55 Q register NEON instructions
Cortex-A55 has 2 64bit NEON vector units, meaning a 128bit instruction
requires taking both units (and can only be issued as the first
instruction in a dual issue pair). This patch models that by splitting
the WriteV SchedWrite into two - the WriteVd that reads/writes only
64bit operands, and the WriteVq that read/writes 128bit registers. The
A55 schedule then uses this distinction to model the WriteVq as taking
both resource units, and starting a Schedule Group and WriteVd as taking
one as before.

I believe this is more correct, even if it does not lead to much better
performance.

Differential Revision: https://reviews.llvm.org/D108766
2021-09-29 16:55:31 +01:00
Simon Pilgrim dade83c02a [X86][SLM] Fix ADDQ/SUBQ/CMPEQQ throughput to account for running on either port.
Testing on a SLM box suggests these can run on either port, but the throughput is 4cy on either (inc MMX versions). Confirmed with Intel AoM / Agner / InstLatX64.
2021-09-24 10:06:14 +01:00
Simon Pilgrim f855ef2601 [X86][Atom] Fix FP uops + port usage
Both ports are required in most cases. Update the uops counts + port usage based off the most recent llvm-exegesis captures (PR36895) and what Intel AoM / Agner / InstLatX64 reports as well.

Noticed while trying to improve fp costs for vectorization via the D103695 helper script.
2021-09-19 20:39:20 +01:00
Simon Pilgrim cf8fac7d07 [X86][Atom] Specific uops for all IMUL/IDIV instructions
Based off a mixture of llvm-exegesis captures (PR36895) and Intel AoM / Agner / InstLatX64 reports.
2021-09-19 16:58:52 +01:00
Simon Pilgrim e381d8b243 [X86][Atom] Fix (U)COMISS/SD uops, latency and throughput
Both ports are required, for reg and mem variants - we can also use the WriteFComX class directly and remove the unnecessary InstRW overrides. Matches what Intel AoM / Agner / InstLatX64 report as well.
2021-09-19 12:44:44 +01:00
Simon Pilgrim 5ebe95e256 [X86][Atom] Fix integer shuffles uops, latency and throughput
The MMX pack/unpck shuffles don't need an override - they have the same behaviour as other shuffles (Port0 only).
The SSE pslldq/psrldq shuffles don't need an override - they have the same behaviour as other shuffles (Port0 only).
The SSE pshufb shuffles use 4uops (+1 load).

Noticed the pslldq/psrldq issue while trying to improve reduction costs via the D103695 helper script, and fixed the others while reviewing. Confirmed with Intel AoM / Agner / InstLatX64.
2021-09-17 12:11:54 +01:00
Simon Pilgrim 65ad09da0e [X86][SLM] Fix DIVPD/DIVPS/RCPPS/RSQRTPS/SQRTPD/SQRTPS/DPPD/DPPS uops, latency and throughput
The packed variants of the instructions had been modelled as the same as the scalar variants.

Reported during a run of llvm-exegesis on a cheap SLM box and matches what Agner / InstLatX64 report as well.
2021-09-13 08:36:43 +01:00
Simon Pilgrim df975e4590 [X86][SLM] Fix PSAD/MPSAD uops, latency and throughput
Noticed while trying to improve generic reduction costs via the D103695 helper script. Confirmed with Intel AoM / Agner / InstLatX64.
2021-09-11 11:44:09 +01:00
Simon Pilgrim 484944ac3b [X86][SLM] Fix HADD/HSUB uops, latency and throughput
Noticed while trying to improve generic reduction costs via the D103695 helper script. Confirmed with Intel AoM / Agner / InstLatX64.
2021-09-11 11:44:09 +01:00
Simon Pilgrim 2005ae15a6 [X86][SLM] WriteVecIMul instructions only take 1uop (REAPPLIED)
The xmm variant have half the throughput (and +1cy latency) of the mmx variants, but are still 1uop.

I still need to do more thorough testing of SLM on test-suite before fixing the obvious bad numbers for WritePMULLD.

But this helps the D103695 helper script get to more accurate numbers for vXi32 multiplies of extended operands (i.e. we can use PMADDWD, PMULLW/PMULHW etc). Matches what Intel AoM / Agner / llvm-exegesis reports.
2021-09-04 15:03:56 +01:00
Simon Pilgrim ac51d69208 Revert rG994da657076900f5ad7fe593c3b5e5f89ab3d53d "[X86][SLM] WriteVecIMul instructions only take 1uop"
This changed some codegen tests that I forgot about in my rebase, I'll recommit shortly with a fix.
2021-09-04 13:39:10 +01:00
Simon Pilgrim 994da65707 [X86][SLM] WriteVecIMul instructions only take 1uop
The xmm variant have half the throughput (and +1cy latency) of the mmx variants, but are still 1uop.

I still need to do more thorough testing of SLM on test-suite before fixing the obvious bad numbers for WritePMULLD.

But this helps the D103695 helper script get to more accurate numbers for vXi32 multiplies of extended operands (i.e. we can use PMADDWD, PMULLW/PMULHW etc). Matches what Intel AoM / Agner / llvm-exegesis reports.
2021-09-04 13:21:34 +01:00
Simon Pilgrim c6371020a8 [X86][SLM] RMW instructions don't require an extra uop
For RMW instructions, the load and store hold the MEC for an extra cycle, but within the same single uop. This is alluded to in the Intel AOM:

"The MEC also owns the MEC RSV, which is responsible for scheduling of all loads and stores. Load and
store instructions go through addresses generation phase in program order to avoid on-the-fly memory
ordering later in the pipeline. Therefore, an unknown address will stall younger memory instructions."

Noticed while trying to get a cheap SLM test box up and running with llvm-exegesis - RMW arithmetic is always 1uop - and matches what Agner / InstLatX64 report as well.
2021-09-04 13:21:34 +01:00
Simon Pilgrim da965a77d5 [X86][SLM] Fix MUL uops, latency and throughput
These were all set to the same best case mul i32 values (which seems to be the only version of MUL that SLM actually performs well with).

Noticed while trying to improve multiplication costs for vectorization via the D103695 helper script. Confirmed with Intel AoM / Agner / InstLatX64.
2021-09-04 13:21:34 +01:00
Simon Pilgrim 7d062d2c47 [X86][Atom] MUL/DIV instructions require both ports, not either.
Noticed while trying to improve multiplication costs for vectorization via the D103695 helper script. Confirmed with Intel AoM.
2021-09-04 11:58:09 +01:00
Simon Pilgrim 6ba0b9f68a [X86][SLM] Fix PBLENDVB uops and throughput
SLM PBLENDVB is just as bad as BLENDVPD/PS - so model it as such, fixing the rr vs rm uops diff as well. The Intel AoM appears to have a copy+paste typo with PBLENDW, it doesn't match Agner or InstLatX64.

Noticed while investigating some of the weird discrepancies reported by the D103695 helper script (SLM had much better vector shift throughputs than it should).
2021-09-03 11:31:29 +01:00
Simon Pilgrim 7ec7272b80 [MCA][X86] Add basic coverage for icelake arch
Copy the skylake-avx512 tests for icelake-server coverage.

Add icelake/rocketlake/tigerlake test coverage to the relevent generic tests as well.
2021-08-31 12:20:09 +01:00
Roman Lebedev d4d459e747
[X86] AMD Zen 3: MULX w/ mem operand has the same throughput as with reg op
Exegesis is faulty and sometimes when measuring throughput^-1
produces snippets that have loop-carried dependencies,
which must be what caused me to incorrectly measure it originally.

After looking much more carefully, the inverse throughput should match
that of the MULX w/ reg op.

As per llvm-exegesis measurements.
2021-08-27 13:27:05 +03:00
Roman Lebedev 0f04936a2d
[X86] AMD Zen 3: MULX produces low part of the result in 3cy, +1cy for high part
As per llvm-exegesis measurements.
2021-08-27 13:27:05 +03:00
Roman Lebedev db2c6cd99c
[NFC][X86][MCA] AMD Zen 3: improve MULX test coverage
Latency for MULX isn't right
2021-08-27 13:27:05 +03:00
Andrea Di Biagio 4a5b191703 [X86][MCA] Address the latest issues with MULX reported in PR51495.
It turns out that SchedWrite WriteIMulH was always assigned to the low half of
the result of a MULX (rather than to the high half).

To avoid confusion, this patch swaps the two MULX writes in the tablegen
definition of MULX32/64.  That way, write names better describe what they
actually refer to; this also avoids further complications if in future we decide
to reuse the same MulH writes to also model other scalar integer multiply
instructions.  I also had to swap the latency values for the two MULX writes to
make sure that the change is effectively an NFC. In fact, none of the existing
x86 tests were affected by this small refactoring.

This patch also fixes a bug in MCA: a wrong latency value was propagated for
instructions that perform multiple writes to a same register.  This last issue
was found by Roman while testing MULX on targets that define a different latency
for the Low/High part of the result.

Differential Revision: https://reviews.llvm.org/D108727
2021-08-26 12:08:20 +01:00
David Green 6ffc6951a3 [AArch64] Remove unpredictable from narrowing instructions.
Like other similar instructions the xtn2 family do not have side
effects, and explicitly marking them as such can help improve scheduling
freedom.
2021-08-26 09:43:44 +01:00
David Green 9474b03d41 [AArch64] Add a Cortex-A55 NEON scheduler test case. 2021-08-26 09:43:44 +01:00
Andrea Di Biagio 6181427bb9 [X86][MCA] Add more tests for MULX (PR51495).
llvm-mca still reports a wrong latency for the case where
the two destination registers of MULX are the same.
2021-08-25 21:28:21 +01:00
Andrea Di Biagio 5f848b311f [X86][SchedModel] Fix latency the Hi register write of MULX (PR51495).
Before this patch, WriteIMulH reported a latency value which is correct for the
RR variant of MULX, but not for the RM variant.

This patch fixes the issue by introducing a new WriteIMulHLd, which is meant to
be used only by the RM variant of MULX.

Differential Revision: https://reviews.llvm.org/D108701
2021-08-25 16:12:09 +01:00
Andrea Di Biagio fe13b81ed9 [X86][NFC] Pre-commit llvm-mca tests for PR51495.
WriteIMulH reports an incorrect latency for RM variants of MULX.
2021-08-25 14:17:17 +01:00
Patrick Holland e4ebfb5786 [MCA] Adding an AMDGPUCustomBehaviour implementation.
This implementation allows mca to model the desired behaviour of the s_waitcnt
instruction. This patch also adds the RetireOOO flag to the AMDGPU instructions
within the scheduling model. This flag is only used by mca and allows
instructions to finish out-of-order which helps mca's simulations more closely
model the actual device.

Differential Revision: https://reviews.llvm.org/D104730
2021-08-24 13:33:58 -07:00
David Green 50f4ae58eb [AArch64] Correct store ReadAdrBase operand
It appears that the Read operand for stores was being placed on the
first operand (the stored value) not the address base. This adds a
ReadST for the stored value operand, allowing the ReadAdrBase to
correctly act upon the address.

Differential Revision: https://reviews.llvm.org/D108287
2021-08-23 21:07:55 +01:00
David Green 955c9437fd [AArch64] Add Scheduling tests for Load/Store ReadAdv operands. 2021-08-23 21:07:55 +01:00
Andrea Di Biagio 35d4292a73 [X86][SchedModels] Fix missing ReadAdvance for MULX and ADCX/ADOX (PR51494)
Before this patch, instructions MULX32rm and MULX64rm were missing a ReadAdvance
for the implicit read of register EDX/RDX.  This patch fixes the issue, and it
also introduces a new SchedWrite for the two variants of MULX. The general idea
behind this last change is to eventually decrease the number of InstRW in the
scheduling models.

This patch also adds a ReadAdvance for the implicit read of EFLAGS in ADCX/ADOX.

Differential Revision: https://reviews.llvm.org/D108372
2021-08-20 17:39:51 +01:00
Andrea Di Biagio 2d53e54f0e [X86][NFC] Pre-commit tests for PR51494 2021-08-18 19:55:21 +01:00
Tozer 6d5e31baaa Fix 2: [MCParser] Correctly handle CRLF line ends when consuming line comments
Fixes an issue with revision 5c6f748c and ad40cb88.

Adds an mcpu argument to the test command, preventing an invalid default
CPU from being used on some platforms.
2021-08-17 17:13:21 +01:00
Tozer ad40cb8821 Fix: [MCParser] Correctly handle CRLF line ends when consuming line comments
Fixes an issue with revision 5c6f748c.

Move the test added in the above commit into the X86 folder, ensuring
that it is only run on targets where its triple is valid.
2021-08-17 16:16:19 +01:00
Tozer 5c6f748cbc [MCParser] Correctly handle CRLF line ends when consuming line comments
Fixes issue: https://bugs.llvm.org/show_bug.cgi?id=47983

The AsmLexer currently has an issue with lexing line comments in files
with CRLF line endings, in which it reads the carriage return as being
part of the line comment. This causes an error for certain valid comment
layouts; this patch fixes this by excluding the carriage return from the
line comment.

Differential Revision: https://reviews.llvm.org/D90234
2021-08-17 15:52:51 +01:00
Andrea Di Biagio 7a1a35a1d1 [X86][SchedModel] Add missing ReadAdvance for some arithmetic ops (PR51318 and PR51322).
This fixes a bug where implicit uses of EFLAGS were not marked as ReadAdvance in
the RM/MR variants of ADC/SBB (PR51318)

This also fixes the absence of ReadAdvance for the register operand of
RMW arithmetic instructions (PR51322).

Differential Revision: https://reviews.llvm.org/D107367
2021-08-04 17:50:22 +01:00
Andrea Di Biagio f0658c7a42 [MCA][NFC] Add tests for PR51318 and PR51322.
Also, regenerate existing X86 tests using update_mca_test.py.
2021-08-03 17:06:34 +01:00
Andrew Savonichev bcc83a2e83 [MCA] Use LSU for the in-order pipeline
Load/Store unit is used to enforce order of loads and stores if they
alias (controlled by --noalias=false option).

Fixes PR50483 - [MCA] In-order pipeline doesn't track memory
load/store dependencies.

Differential Revision: https://reviews.llvm.org/D103955
2021-07-29 14:40:23 +03:00
Simon Pilgrim d073b19dbf [X86] Fix SLM FP<->INT throughputs.
Noticed while trying to clean up the shift costs model for SSE4 targets using the script in D10369 - SLM double-pumps all the 128-bit vector conversion ops and only use FP0 pipe - numbers taken from Intel AOM + Agner.
2021-07-22 19:39:04 +01:00
Nicholas Guy 9769535efd [AArch64] Update Cortex-A55 SchedModel to improve LDP scheduling
Specifying the latencies of specific LDP variants appears to improve
performance almost universally.

Differential Revision: https://reviews.llvm.org/D105882
2021-07-16 12:00:57 +01:00
Marcos Horro 77f2f0f9b7 [llvm-mca][JSON] Store extra information about driver flags used for the simulation
Added information stored in PipelineOptions and the MCSubtargetInfo.

Bug: https://bugs.llvm.org/show_bug.cgi?id=51041

Reviewed By: andreadb

Differential Revision: https://reviews.llvm.org/D106077
2021-07-16 09:18:40 +02:00
David Green f73334c46d [AArch64] Set the latency of Cortex-A55 stores to 1
This sets the latency of stores to 1 in the Cortex-A55 scheduling model,
to better match the values given in the software optimization guide.

The latency of a store in normal llvm scheduling does not appear to have
a lot of uses. If the store has no outputs then the latency is somewhat
meaningless (and pre/post increment update operands use the WriteAdr
write for those operands instead). The one place it does alter things is
the latency between a store and the end of the scheduling region, which
can in turn have an effect on the critical path length. As a result a
latency of 1 is more correct and offers ever-so-slightly better
scheduling of instructions near the end of the block.

They are marked as RetireOOO to keep the llvm-mca from introducing
stalls where non would exist.

Differential Revision: https://reviews.llvm.org/D105541
2021-07-12 13:39:35 +01:00
Andrea Di Biagio 4fe0fcd1c0 [llvm-mca][JSON] Teach the PipelinePrinter how to deal with anonymous code regions (PR51008)
This patch addresses the last remaining problems reported in PR51008.

Previous fixes for PR51008 worked under the wrong assumption that code regions
are always named (except maybe for the default region, which was automatically
named "main").

In reality, it is quite common for users to declare multiple anonymous regions.
So we cannot really use the region name as the key string of a JSON object.  In
practice, code region names are completely optional.

Using "main" for the default region was also problematic because there can be
another region with that same name.

This patch fixes these issues by introducing a json::array of regions.  Each
region has a "Name" field, which would default to the empty string for anonymous
regions.

Added a few more tests to verify that the JSON file format is still valid, and
that multiple anonymous regions all appear in the final output.
2021-07-10 13:57:52 +01:00
Andrea Di Biagio d919bca875 [llvm-mca][JSON] Further refactoring of the JSON printing logic.
This patch renames object "Resources" to "TargetInfo".

Moved the getJSONTargetInfo method from class InstructionView to the
PipelinePrinter.

Removed uses of std::stringstream.
Removed unused method View::printViewJSON().
2021-07-10 12:38:19 +01:00
Andrea Di Biagio 10cb036223 [llvm-mca] Refactor the logic that prints JSON files.
Moved most of the printing logic into the PipelinePrinter.

This patch also fixes the JSON output when flag -instruction-tables is
specified.
2021-07-09 22:56:39 +01:00
Marcos Horro b11d31eb73 [llvm-mca] Fix JSON format for multiple regions
Instead of printing each region individually when using JSON format,
this patch creates a JSON object which is updated with the values of
each region, printing them at the end. New test is added for JSON output
with multiple regions.

Bug: https://bugs.llvm.org/show_bug.cgi?id=51008

Reviewed By: andreadb

Differential Revision: https://reviews.llvm.org/D105618
2021-07-09 18:04:16 +02:00
Patrick Holland d38b9f1f31 Revert "[MCA] [AMDGPU] Adding an implementation to AMDGPUCustomBehaviour for handling s_waitcnt instructions."
Build failures when building with shared libraries. Reverting until I can fix.

Differential Revision: https://reviews.llvm.org/D104730
2021-07-07 20:48:42 -07:00
Patrick Holland af3baf1761 [MCA] [AMDGPU] Adding an implementation to AMDGPUCustomBehaviour for handling s_waitcnt instructions.
This commit also makes some slight changes to the scheduling model for AMDGPU to set the RetireOOO flag for all scheduling classes.

This flag is only used by llvm-mca and allows instructions to retire out of order.

See the differential link below for a deeper explanation of everything.

Differential Revision: https://reviews.llvm.org/D104730
2021-07-07 14:17:54 -07:00
Simon Pilgrim ded8866f4a [X86][Atom] Fix vector fp<->int resource/throughputs
Match whats documented in the Intel AOM - almost all the conversion instructions requires BOTH ports (apart from the MMX cvtpi2ps/cvtpi2ps instructions which we already override) - this was being incorrectly modelled as EITHER port.

Now that we can use in-order models in llvm-mca, the atom model is a good "worst case scenario" analysis for x86.
2021-07-07 16:52:34 +01:00
Marcos Horro aa13e4fe7e [llvm-mca] Fix JSON output (PR50922)
Based on the discussion in PR50922, minor changes have been done to properly
output a valid JSON.  Removed "not implemented" keys.

Differential Revision: https://reviews.llvm.org/D105064
2021-07-01 12:53:20 +01:00
Serge Pavlov b36d214bed [X86] Add description of FXAM instruction
Previously this instruction could be used only in assembler. This change
makes it available for compiler also. Scheduling information was copied
from FTST instruction, hopefully this can be a satisfactory approximation.

Differential Revision: https://reviews.llvm.org/D104853
2021-06-25 12:26:51 +07:00