forked from OSchip/llvm-project
![]() On BtVer2 conditional SIMD stores are heavily microcoded. The latency is directly proportional to the number of packed elements extracted from the input vector. Also, according to micro-benchmarks, most of the computation seems to be done in the integer unit. Only a minority of the uOPs is executed by the FPU. The observed behaviour on the FPU looks similar to this: - The input MASK value is moved to the Integer Unit -- [ a VMOVMSK-like uOP-executed on JFPU0]. - In parallel, each element of the input XMM/YMM is extracted and then sent to the IntegerUnit through JFPU1. As expected, a (conditional) store is executed for every extracted element. Interestingly, a (speculative) load is executed for every extracted element too. It is as-if a "LOAD - BIT_EXTRACT- CMOV" sequence of uOPs is repeated by the integer unit for every contionally stored element. VMASKMOVDQU is a special case: the number of speculative loads is always 2 (presumably, one load per quadword). That means, extra shifts and masking is performed on (one of) the loaded quadwords before each conditional store (that also explains the big number of non-FP uOPs retired). This patch replaces the existing writes for conditional SIMD stores (i.e. WriteFMaskedStore, and WriteFMaskedStoreY) with the following new writes: WriteFMaskedStore32 [ XMM Packed Single ] WriteFMaskedStore32Y [ YMM Packed Single ] WriteFMaskedStore64 [ XMM Packed Double ] WriteFMaskedStore64Y [ YMM Packed Double ] Added a wrapper class named X86SchedWriteMaskMove in X86Schedule.td to describe both RM and MR variants for conditional SIMD moves in a single tablegen definition. Instances of that class are then passed in input to multiclass avx_movmask_rm when constructing MASKMOVPS/PD definitions. Since this patch introduces new writes, I had to update all the X86 scheduling models. Differential Revision: https://reviews.llvm.org/D66801 llvm-svn: 370649 |
||
---|---|---|
.. | ||
Atom | ||
Barcelona | ||
BdVer2 | ||
Broadwell | ||
BtVer2 | ||
Generic | ||
Haswell | ||
SLM | ||
SandyBridge | ||
SkylakeClient | ||
SkylakeServer | ||
Znver1 | ||
bextr-read-after-ld.s | ||
bzhi-read-after-ld.s | ||
cpus.s | ||
default-iterations.s | ||
dispatch_width.s | ||
fma3-read-after-ld-1.s | ||
fma3-read-after-ld-2.s | ||
in-order-cpu.s | ||
intel-syntax.s | ||
invalid-assembly-sequence.s | ||
invalid-cpu.s | ||
invalid-empty-file.s | ||
lit.local.cfg | ||
llvm-mca-markers-1.s | ||
llvm-mca-markers-2.s | ||
llvm-mca-markers-3.s | ||
llvm-mca-markers-4.s | ||
llvm-mca-markers-5.s | ||
llvm-mca-markers-6.s | ||
llvm-mca-markers-7.s | ||
llvm-mca-markers-8.s | ||
llvm-mca-markers-9.s | ||
llvm-mca-markers-10.s | ||
llvm-mca-markers-11.s | ||
llvm-mca-markers-12.s | ||
no-sched-model.s | ||
option-all-stats-1.s | ||
option-all-stats-2.s | ||
option-all-views-1.s | ||
option-all-views-2.s | ||
option-no-stats-1.s | ||
print-imm-hex-1.s | ||
print-imm-hex-2.s | ||
read-after-ld-1.s | ||
read-after-ld-2.s | ||
read-after-ld-3.s | ||
register-file-statistics.s | ||
scheduler-queue-usage.s | ||
show-encoding.s | ||
sqrt-rsqrt-rcp-memop.s | ||
uop-queue.s | ||
variable-blend-read-after-ld-1.s | ||
variable-blend-read-after-ld-2.s |