llvm-project

History

Andrea Di Biagio 528f68144b [X86][BtVer2] Fix latency and throughput of conditional SIMD store instructions. On BtVer2 conditional SIMD stores are heavily microcoded. The latency is directly proportional to the number of packed elements extracted from the input vector. Also, according to micro-benchmarks, most of the computation seems to be done in the integer unit. Only a minority of the uOPs is executed by the FPU. The observed behaviour on the FPU looks similar to this: - The input MASK value is moved to the Integer Unit -- [ a VMOVMSK-like uOP-executed on JFPU0]. - In parallel, each element of the input XMM/YMM is extracted and then sent to the IntegerUnit through JFPU1. As expected, a (conditional) store is executed for every extracted element. Interestingly, a (speculative) load is executed for every extracted element too. It is as-if a "LOAD - BIT_EXTRACT- CMOV" sequence of uOPs is repeated by the integer unit for every contionally stored element. VMASKMOVDQU is a special case: the number of speculative loads is always 2 (presumably, one load per quadword). That means, extra shifts and masking is performed on (one of) the loaded quadwords before each conditional store (that also explains the big number of non-FP uOPs retired). This patch replaces the existing writes for conditional SIMD stores (i.e. WriteFMaskedStore, and WriteFMaskedStoreY) with the following new writes: WriteFMaskedStore32 [ XMM Packed Single ] WriteFMaskedStore32Y [ YMM Packed Single ] WriteFMaskedStore64 [ XMM Packed Double ] WriteFMaskedStore64Y [ YMM Packed Double ] Added a wrapper class named X86SchedWriteMaskMove in X86Schedule.td to describe both RM and MR variants for conditional SIMD moves in a single tablegen definition. Instances of that class are then passed in input to multiclass avx_movmask_rm when constructing MASKMOVPS/PD definitions. Since this patch introduces new writes, I had to update all the X86 scheduling models. Differential Revision: https://reviews.llvm.org/D66801 llvm-svn: 370649		2019-09-02 12:32:28 +00:00
..
add-sequence.s	…
bottleneck-hints-1.s	[MCA][Bottleneck Analysis] Teach how to compute a critical sequence of instructions based on the simulation.	2019-06-21 13:32:54 +00:00
bottleneck-hints-2.s	[MCA][Bottleneck Analysis] Teach how to compute a critical sequence of instructions based on the simulation.	2019-06-21 13:32:54 +00:00
bottleneck-hints-3.s	[MCA][Bottleneck Analysis] Teach how to compute a critical sequence of instructions based on the simulation.	2019-06-21 13:32:54 +00:00
bottleneck-hints-4.s	[MCA][Bottleneck Analysis] Teach how to compute a critical sequence of instructions based on the simulation.	2019-06-21 13:32:54 +00:00
bottleneck-hints-none.s	[llvm-mca] Enable bottleneck analysis when flag -all-views is specified.	2019-06-10 16:56:25 +00:00
clear-super-register-1.s	[X86][BtVer2] Fix latency/throughput of scalar integer MUL instructions.	2019-08-22 15:20:16 +00:00
clear-super-register-2.s	…
cmpxchg-read-advance.s	[X86][BtVer2] Add a read-advance to every implicit register use of CMPXCHG8B/16B.	2019-08-23 12:19:45 +00:00
dependency-breaking-cmp.s	…
dependency-breaking-pcmpeq.s	…
dependency-breaking-pcmpgt.s	…
dependency-breaking-sbb-1.s	…
dependency-breaking-sbb-2.s	[X86][BtVer2] Fix latency/throughput of scalar integer MUL instructions.	2019-08-22 15:20:16 +00:00
dependent-pmuld-paddd.s	…
dot-product.s	…
hadd-read-after-ld-1.s	…
hadd-read-after-ld-2.s	…
instruction-info-view.s	…
int-to-fpu-forwarding-1.s	[MC][X86] Correctly model additional operand latency caused by transfer delays from the integer to the floating point unit.	2019-01-23 16:35:07 +00:00
int-to-fpu-forwarding-2.s	[X86] Remove the suffix on vcvt[u]si2ss/sd register variants in assembly printing.	2019-05-06 21:39:51 +00:00
int-to-fpu-forwarding-3.s	[MC][X86] Correctly model additional operand latency caused by transfer delays from the integer to the floating point unit.	2019-01-23 16:35:07 +00:00
load-store-alias.s	…
memcpy-like-test.s	…
one-idioms.s	…
partial-reg-update-2.s	[X86][BtVer2] Fix latency/throughput of scalar integer MUL instructions.	2019-08-22 15:20:16 +00:00
partial-reg-update-3.s	…
partial-reg-update-4.s	[X86][BtVer2] Fix latency/throughput of scalar integer MUL instructions.	2019-08-22 15:20:16 +00:00
partial-reg-update-5.s	…
partial-reg-update-6.s	[X86][BtVer2] Fix latency/throughput of scalar integer MUL instructions.	2019-08-22 15:20:16 +00:00
partial-reg-update-7.s	[X86][BtVer2] Fix latency/throughput of scalar integer MUL instructions.	2019-08-22 15:20:16 +00:00
partial-reg-update.s	[X86][BtVer2] Fix latency/throughput of scalar integer MUL instructions.	2019-08-22 15:20:16 +00:00
pipes-fpu.s	…
pr37790.s	[X86] Add missing properties on llvm.x86.sse.{st,ld}mxcsr	2019-06-19 08:44:31 +00:00
rank.s	…
rcu-statistics.s	…
read-advance-1.s	…
read-advance-2.s	[X86][BtVer2] Fix latency/throughput of scalar integer MUL instructions.	2019-08-22 15:20:16 +00:00
read-advance-3.s	…
reg-move-elimination-1.s	…
reg-move-elimination-2.s	…
reg-move-elimination-3.s	…
reg-move-elimination-4.s	…
reg-move-elimination-5.s	…
reg-move-elimination-6.s	[MCA] Correctly update register definitions in the PRF after move elimination.	2019-02-18 14:15:25 +00:00
register-files-1.s	…
register-files-2.s	…
register-files-3.s	…
register-files-4.s	[MCA] Always check if scheduler resources are unavailable when reporting dispatch stalls.	2019-02-26 14:19:00 +00:00
register-files-5.s	…
resources-aes.s	…
resources-avx1.s	[X86][BtVer2] Fix latency and throughput of conditional SIMD store instructions.	2019-09-02 12:32:28 +00:00
resources-bmi1.s	[llvm-mca][X86] Add missing tzcntw tests	2019-01-22 14:53:52 +00:00
resources-cmov.s	…
resources-cmpxchg.s	[X86][Btver2] Fix latency and throughput of CMPXCHG instructions.	2019-08-20 10:23:55 +00:00
resources-f16c.s	…
resources-lea.s	…
resources-lzcnt.s	…
resources-mmx.s	…
resources-movbe.s	…
resources-pclmul.s	…
resources-popcnt.s	…
resources-prefetchw.s	…
resources-sse1.s	[X86] Add missing properties on llvm.x86.sse.{st,ld}mxcsr	2019-06-19 08:44:31 +00:00
resources-sse2.s	[X86][BtVer2] Fix latency and throughput of conditional SIMD store instructions.	2019-09-02 12:32:28 +00:00
resources-sse3.s	[llvm-mca][X86] Add missing monitor/mwait tests	2019-01-22 15:48:16 +00:00
resources-sse4a.s	…
resources-sse41.s	…
resources-sse42.s	…
resources-ssse3.s	[X86][BtVer2] Update latency of mmx horizontal operations	2019-01-21 18:04:25 +00:00
resources-x86_32.s	…
resources-x86_64.s	[X86][BtVer2] Fix latency of ALU RMW instructions.	2019-08-23 11:34:10 +00:00
resources-x87.s	[X86] Print all register forms of x87 fadd/fsub/fdiv/fmul as having two arguments where on is %st.	2019-02-04 17:28:18 +00:00
scheduler-queue-usage.s	[llvm-mca][scheduler-stats] Print issued micro opcodes per cycle. NFCI	2019-04-08 16:05:54 +00:00
simple-test.s	…
unsupported-instruction.s	…
vbroadcast-operand-latency.s	…
vec-logic-read-after-ld-1.s	…
vec-logic-read-after-ld-2.s	…
xadd.s	[X86][BtVer2] Fix latency/throughput of scalar integer MUL instructions.	2019-08-22 15:20:16 +00:00
xchg.s	[X86][BtVer2] Fix latency/throughput of scalar integer MUL instructions.	2019-08-22 15:20:16 +00:00
zero-idioms-avx-256.s	…
zero-idioms.s	…