llvm-project/llvm/test/tools/llvm-mca/X86
Andrea Di Biagio 528f68144b [X86][BtVer2] Fix latency and throughput of conditional SIMD store instructions.
On BtVer2 conditional SIMD stores are heavily microcoded.
The latency is directly proportional to the number of packed elements extracted
from the input vector. Also, according to micro-benchmarks, most of the
computation seems to be done in the integer unit.

Only a minority of the uOPs is executed by the FPU. The observed behaviour on
the FPU looks similar to this:
 - The input MASK value is moved to the Integer Unit
   -- [ a VMOVMSK-like uOP-executed on JFPU0].
 - In parallel, each element of the input XMM/YMM is extracted and then sent to
   the IntegerUnit through JFPU1.

As expected, a (conditional) store is executed for every extracted element.
Interestingly, a (speculative) load is executed for every extracted element too.
It is as-if a "LOAD - BIT_EXTRACT- CMOV" sequence of uOPs is repeated by the
integer unit for every contionally stored element.
VMASKMOVDQU is a special case: the number of speculative loads is always 2
(presumably, one load per quadword). That means, extra shifts and masking is
performed on (one of) the loaded quadwords before each conditional store (that
also explains the big number of non-FP uOPs retired).

This patch replaces the existing writes for conditional SIMD stores (i.e.
WriteFMaskedStore, and WriteFMaskedStoreY) with the following new writes:

  WriteFMaskedStore32  [ XMM Packed Single ]
  WriteFMaskedStore32Y [ YMM Packed Single ]
  WriteFMaskedStore64  [ XMM Packed Double ]
  WriteFMaskedStore64Y [ YMM Packed Double ]

Added a wrapper class named X86SchedWriteMaskMove in X86Schedule.td to describe
both RM and MR variants for conditional SIMD moves in a single tablegen
definition.
Instances of that class are then passed in input to multiclass avx_movmask_rm
when constructing MASKMOVPS/PD definitions.

Since this patch introduces new writes, I had to update all the X86 scheduling
models.

Differential Revision: https://reviews.llvm.org/D66801

llvm-svn: 370649
2019-09-02 12:32:28 +00:00
..
Atom [MCA][X86] Add tests for LOCK variants of standard X86 arithmetic ops 2019-08-20 11:13:20 +00:00
Barcelona [MCA][X86] Add tests for LOCK variants of standard X86 arithmetic ops 2019-08-20 11:13:20 +00:00
BdVer2 [MCA][X86] Add tests for LOCK variants of standard X86 arithmetic ops 2019-08-20 11:13:20 +00:00
Broadwell [MCA][X86] Add tests for LOCK variants of standard X86 arithmetic ops 2019-08-20 11:13:20 +00:00
BtVer2 [X86][BtVer2] Fix latency and throughput of conditional SIMD store instructions. 2019-09-02 12:32:28 +00:00
Generic [MCA][X86] Add tests for LOCK variants of standard X86 arithmetic ops 2019-08-20 11:13:20 +00:00
Haswell [MCA][X86] Add tests for LOCK variants of standard X86 arithmetic ops 2019-08-20 11:13:20 +00:00
SLM [MCA][X86] Add tests for LOCK variants of standard X86 arithmetic ops 2019-08-20 11:13:20 +00:00
SandyBridge [MCA][X86] Add tests for LOCK variants of standard X86 arithmetic ops 2019-08-20 11:13:20 +00:00
SkylakeClient [MCA][X86] Add tests for LOCK variants of standard X86 arithmetic ops 2019-08-20 11:13:20 +00:00
SkylakeServer [MCA][X86] Add tests for LOCK variants of standard X86 arithmetic ops 2019-08-20 11:13:20 +00:00
Znver1 [MCA][X86] Add tests for LOCK variants of standard X86 arithmetic ops 2019-08-20 11:13:20 +00:00
bextr-read-after-ld.s [X86] AMD Piledriver (BdVer2): major cleanup (mainly inverse throughput) 2019-05-09 13:54:51 +00:00
bzhi-read-after-ld.s
cpus.s [NFC][MCA][X86] Add baseline test coverage for AMD Barcelona (aka K10, fam10h) 2019-06-15 16:12:05 +00:00
default-iterations.s
dispatch_width.s
fma3-read-after-ld-1.s
fma3-read-after-ld-2.s
in-order-cpu.s
intel-syntax.s [X86][BtVer2] Fix latency/throughput of scalar integer MUL instructions. 2019-08-22 15:20:16 +00:00
invalid-assembly-sequence.s
invalid-cpu.s
invalid-empty-file.s
lit.local.cfg [lit] Delete empty lines at the end of lit.local.cfg NFC 2019-06-17 09:51:07 +00:00
llvm-mca-markers-1.s
llvm-mca-markers-2.s [MCA] Don't add a name to the default code region. 2019-05-08 11:00:43 +00:00
llvm-mca-markers-3.s
llvm-mca-markers-4.s
llvm-mca-markers-5.s
llvm-mca-markers-6.s [MCA] Add support for nested and overlapping region markers 2019-05-09 15:18:09 +00:00
llvm-mca-markers-7.s [MCA] Add support for nested and overlapping region markers 2019-05-09 15:18:09 +00:00
llvm-mca-markers-8.s [MCA] Add support for nested and overlapping region markers 2019-05-09 15:18:09 +00:00
llvm-mca-markers-9.s [X86][BtVer2] Fix latency/throughput of scalar integer MUL instructions. 2019-08-22 15:20:16 +00:00
llvm-mca-markers-10.s [X86][BtVer2] Fix latency/throughput of scalar integer MUL instructions. 2019-08-22 15:20:16 +00:00
llvm-mca-markers-11.s [MCA] Add support for nested and overlapping region markers 2019-05-09 15:18:09 +00:00
llvm-mca-markers-12.s [MCA] Add support for nested and overlapping region markers 2019-05-09 15:18:09 +00:00
no-sched-model.s
option-all-stats-1.s [llvm-mca][scheduler-stats] Print issued micro opcodes per cycle. NFCI 2019-04-08 16:05:54 +00:00
option-all-stats-2.s [llvm-mca][scheduler-stats] Print issued micro opcodes per cycle. NFCI 2019-04-08 16:05:54 +00:00
option-all-views-1.s [MCA][Bottleneck Analysis] Teach how to compute a critical sequence of instructions based on the simulation. 2019-06-21 13:32:54 +00:00
option-all-views-2.s [MCA][Bottleneck Analysis] Teach how to compute a critical sequence of instructions based on the simulation. 2019-06-21 13:32:54 +00:00
option-no-stats-1.s [MCA][Bottleneck Analysis] Teach how to compute a critical sequence of instructions based on the simulation. 2019-06-21 13:32:54 +00:00
print-imm-hex-1.s [MCA] Add support for printing immedate values as hex. Also enable lexing of masm binary and hex literals. 2019-08-02 10:38:25 +00:00
print-imm-hex-2.s [MCA] Add support for printing immedate values as hex. Also enable lexing of masm binary and hex literals. 2019-08-02 10:38:25 +00:00
read-after-ld-1.s [NFC][MCA][X86] Add baseline test coverage for AMD Barcelona (aka K10, fam10h) 2019-06-15 16:12:05 +00:00
read-after-ld-2.s
read-after-ld-3.s
register-file-statistics.s [NFC][MCA][X86] Add baseline test coverage for AMD Barcelona (aka K10, fam10h) 2019-06-15 16:12:05 +00:00
scheduler-queue-usage.s [NFC][MCA][X86] Add baseline test coverage for AMD Barcelona (aka K10, fam10h) 2019-06-15 16:12:05 +00:00
show-encoding.s [MCA] Add flag -show-encoding to llvm-mca. 2019-08-09 11:26:27 +00:00
sqrt-rsqrt-rcp-memop.s [NFC][MCA][X86] Add baseline test coverage for AMD Barcelona (aka K10, fam10h) 2019-06-15 16:12:05 +00:00
uop-queue.s [MCA] Add an experimental MicroOpQueue stage. 2019-03-29 12:15:37 +00:00
variable-blend-read-after-ld-1.s [X86] AMD Piledriver (BdVer2): major cleanup (mainly inverse throughput) 2019-05-09 13:54:51 +00:00
variable-blend-read-after-ld-2.s [X86] AMD Piledriver (BdVer2): major cleanup (mainly inverse throughput) 2019-05-09 13:54:51 +00:00