llvm-project

Commit Graph

Author	SHA1	Message	Date
Andrea Di Biagio	815cdbff29	[X86][Btver2] Improved latency/throughput model for scalar int-to-float conversions. Account for bypass delays when computing the latency of scalar int-to-float conversions. On Jaguar we need to account for an extra 6cy latency (see AMD fam16h SOG). This patch also fixes the number of micropcodes for the register-memory variants of scalar int-to-float conversions. Differential Revision: https://reviews.llvm.org/D57148 llvm-svn: 352518	2019-01-29 16:47:27 +00:00
Roman Lebedev	661577466e	[NFC][MCA][X86][BdVer2] Cherry-pick int-to-ivec forwarding tests from BtVer2 llvm-svn: 352317	2019-01-27 14:35:54 +00:00
Simon Pilgrim	c9d33907ef	[llvm-mca][X86] Add some missing DQI tests Match more of the coverage of test\CodeGen\X86\avx512-schedule.ll as discussed on D57244 llvm-svn: 352273	2019-01-26 13:00:46 +00:00
Simon Pilgrim	d36f7730cd	[llvm-mca][X86] Add missing shuffle tests Match the coverage of test\CodeGen\X86\avx512-shuffle-schedule.ll so we can get rid of -print-schedule (and fix PR37160) without losing schedule tests llvm-svn: 352179	2019-01-25 09:17:30 +00:00
Andrea Di Biagio	d768d35515	[MC][X86] Correctly model additional operand latency caused by transfer delays from the integer to the floating point unit. This patch adds a new ReadAdvance definition named ReadInt2Fpu. ReadInt2Fpu allows x86 scheduling models to accurately describe delays caused by data transfers from the integer unit to the floating point unit. ReadInt2Fpu currently defaults to a delay of zero cycles (i.e. no delay) for all x86 models excluding BtVer2. That means, this patch is only a functional change for the Jaguar cpu model only. Tablegen definitions for instructions (V)PINSR* have been updated to account for the new ReadInt2Fpu. That read is mapped to the the GPR input operand. On Jaguar, int-to-fpu transfers are modeled as a +6cy delay. Before this patch, that extra delay was added to the opcode latency. In practice, the insert opcode only executes for 1cy. Most of the actual latency is actually contributed by the so-called operand-latency. According to the AMD SOG for family 16h, (V)PINSR* latency is defined by expression f+1, where f is defined as a forwarding delay from the integer unit to the fpu. When printing instruction latency from MCA (see InstructionInfoView.cpp) and LLC (only when flag -print-schedule is speified), we now need to account for any extra forwarding delays. We do this by checking if scheduling classes declare any negative ReadAdvance entries. Quoting a code comment in TargetSchedule.td: "A negative advance effectively increases latency, which may be used for cross-domain stalls". When computing the instruction latency for the purpose of our scheduling tests, we now add any extra delay to the formula. This avoids regressing existing codegen and mca schedule tests. It comes with the cost of an extra (but very simple) hook in MCSchedModel. Differential Revision: https://reviews.llvm.org/D57056 llvm-svn: 351965	2019-01-23 16:35:07 +00:00
Simon Pilgrim	922b540643	[llvm-mca][X86] Tidyup avx512 placeholder tests Ensure we keep avx512f/bw/dq + vl versions separate, add example broadcast tests - this should allow us to better the test coverage of test\CodeGen\X86\avx512-schedule.ll llvm-svn: 351848	2019-01-22 17:52:15 +00:00
Simon Pilgrim	8e11254132	[llvm-mca][X86] Add VPOPCNTDQ tests Matches test coverage of test\CodeGen\X86\avx512vpopcntdq-schedule.ll llvm-svn: 351842	2019-01-22 17:19:44 +00:00
Simon Pilgrim	90fa50d928	[llvm-mca][X86] Add missing CLWB/CLZERO/FSGSBASE/LWP/MWAITX/RDPID/SHA tests We're getting pretty close to matching/exceeding test coverage of the test\CodeGen\X86\*-schedule.ll files, which should allow us to get rid of -print-schedule and fix PR37160 llvm-svn: 351836	2019-01-22 16:39:28 +00:00
Simon Pilgrim	fc4b1e841e	[llvm-mca][X86] Add missing enter/leave, invlpg/invlpga, rdmsr/wrmsr, rdpmc and rdtsc/rdtscp tests llvm-svn: 351835	2019-01-22 16:29:26 +00:00
Simon Pilgrim	4e03b2496d	[llvm-mca][X86] Add missing mfence/pinsrw tests llvm-svn: 351831	2019-01-22 16:01:08 +00:00
Simon Pilgrim	05198a9b8a	[llvm-mca][X86] Add missing monitor/mwait tests These technically should be under a MONITOR cpuid bit, but we tag them as SSE3 so I've done that here as well. llvm-svn: 351829	2019-01-22 15:48:16 +00:00
Simon Pilgrim	9b3a2f96a1	[llvm-mca][X86] Add missing vperm2i128 tests llvm-svn: 351828	2019-01-22 14:54:24 +00:00
Simon Pilgrim	1d8d6c3bfb	[llvm-mca][X86] Add missing tzcntw tests llvm-svn: 351827	2019-01-22 14:53:52 +00:00
Andrea Di Biagio	a4d1ffc269	[MCA] Add tests for int-to-fpu transfer delays. NFC llvm-svn: 351822	2019-01-22 13:59:08 +00:00
Simon Pilgrim	aa6a4339ac	[X86][BtVer2] SSE2 vector shifts has local forwarding disabled Similar to horizontal ops on D56777, the sse2 (but not mmx) bit shift ops has local forwarding disabled, adding +1cy to the use latency for the result. Differential Revision: https://reviews.llvm.org/D57026 llvm-svn: 351817	2019-01-22 13:27:18 +00:00
Simon Pilgrim	2c69f90171	[X86][BtVer2] X86ISD::VPERMILPV has local forwarding disabled Similar to horizontal ops on D56777, the vpermilpd/vpermilps variable mask ops has local forwarding disabled, adding +1cy to the use latency for the result. Differential Revision: https://reviews.llvm.org/D57022 llvm-svn: 351815	2019-01-22 13:13:57 +00:00
Simon Pilgrim	9b73ae96c5	[X86][BtVer2] Update latency of mmx horizontal operations D56777 added +1cy local forwarding penalty for horizontal operations, but this penalty only affects sse2/xmm variants, the mmx variants don't suffer the penalty. Confirmed with @andreadb llvm-svn: 351755	2019-01-21 18:04:25 +00:00
Andrea Di Biagio	b68dd05c14	[X86][BtVer2] Update the WriteLoad latency. r327630 introduced new write definitions for float/vector loads. Before that revision, WriteLoad was used by both integer/float (scalar/vector) load. So, WriteLoad had to conservatively declare a latency to 5cy. That is because the load-to-use latency for float/vector load is 5cy. Now that we have dedicated writes for float/vector loads, there is no reason why we should keep the latency of WriteLoad to 5cy. At the moment, WriteLoad is only used by scalar integer loads only; we can assume an optimstic 3cy latency for them. This patch changes that latency from 5cy to 3cy, and regenerates the affected scheduling/mca tests. Differential Revision: https://reviews.llvm.org/D56922 llvm-svn: 351742	2019-01-21 12:04:10 +00:00
Andrea Di Biagio	c5f0f5309e	[X86][BtVer2] Update latency of horizontal operations. On Jaguar, horizontal adds/subs have local forwarding disable. That means, we pay a compulsory extra cycle of write-back stage, and the value is not available until the end of that stage. This patch changes the latency of horizontal operations by adding an extra cycle. With this patch, latency numbers now match what is reported by perf. I plan to send another patch to also 'fix' the latency of shuffle operations (on Jaguar, local forwarding is disabled for vector shuffles too). Differential Revision: https://reviews.llvm.org/D56777 llvm-svn: 351366	2019-01-16 18:18:01 +00:00
Simon Pilgrim	99c139f4dc	[llvm-mca][x86] Add RDSEED instruction resource tests for GLM llvm-svn: 348624	2018-12-07 18:37:40 +00:00
Simon Pilgrim	c703ce35b8	[llvm-mca][x86] Add missing AES instruction resource tests Add missing non-VEX instructions llvm-svn: 348623	2018-12-07 18:35:54 +00:00
Simon Pilgrim	c4e2776f3b	[llvm-mca][x86] Add RDRAND/RDSEED instruction resource tests llvm-svn: 348622	2018-12-07 18:29:47 +00:00
Andrea Di Biagio	373a4ccf6c	[llvm-mca][MC] Add the ability to declare which processor resources model load/store queues (PR36666). This patch adds the ability to specify via tablegen which processor resources are load/store queue resources. A new tablegen class named MemoryQueue can be optionally used to mark resources that model load/store queues. Information about the load/store queue is collected at 'CodeGenSchedule' stage, and analyzed by the 'SubtargetEmitter' to initialize two new fields in struct MCExtraProcessorInfo named `LoadQueueID` and `StoreQueueID`. Those two fields are identifiers for buffered resources used to describe the load queue and the store queue. Field `BufferSize` is interpreted as the number of entries in the queue, while the number of units is a throughput indicator (i.e. number of available pickers for loads/stores). At construction time, LSUnit in llvm-mca checks for the presence of extra processor information (i.e. MCExtraProcessorInfo) in the scheduling model. If that information is available, and fields LoadQueueID and StoreQueueID are set to a value different than zero (i.e. the invalid processor resource index), then LSUnit initializes its LoadQueue/StoreQueue based on the BufferSize value declared by the two processor resources. With this patch, we more accurately track dynamic dispatch stalls caused by the lack of LS tokens (i.e. load/store queue full). This is also shown by the differences in two BdVer2 tests. Stalls that were previously classified as generic SCHEDULER FULL stalls, are not correctly classified either as "load queue full" or "store queue full". About the differences in the -scheduler-stats view: those differences are expected, because entries in the load/store queue are not released at instruction issue stage. Instead, those are released at instruction executed stage. This is the main reason why for the modified tests, the load/store queues gets full before PdEx is full. Differential Revision: https://reviews.llvm.org/D54957 llvm-svn: 347857	2018-11-29 12:15:56 +00:00
Andrea Di Biagio	7a7588990b	[llvm-mca] pass -dispatch-stats flag to a couple of tests. NFC This change is in preparation for a patch that fixes PR36666. llvm-mca currently doesn't know if a buffered processor resource describes a load or store queue. So, any dynamic dispatch stall caused by the lack of load/store queue entries is normally reported as a generic SCHEDULER stall. See for example the -dispatch-stats output from the two tests modified by this patch. In future, processor models will be able to tag processor resources that are used to describe load/store queues. That information would then be used by llvm-mca to correctly classify dynamic dispatch stalls caused by the lack of tokens in the LS. llvm-svn: 347662	2018-11-27 15:56:00 +00:00
Andrea Di Biagio	07a8255a78	[llvm-mca][View] Improved Retire Control Unit Statistics. RetireControlUnitStatistics now reports extra information about the ROB and the avg/maximum number of entries consumed over the entire simulation. Example: Retire Control Unit - number of cycles where we saw N instructions retired: [# retired], [# cycles] 0, 109 (17.9%) 1, 102 (16.7%) 2, 399 (65.4%) Total ROB Entries: 64 Max Used ROB Entries: 35 ( 54.7% ) Average Used ROB Entries per cy: 32 ( 50.0% ) Documentation in llvm/docs/CommandGuide/llvmn-mca.rst has been updated to reflect this change. llvm-svn: 347493	2018-11-23 12:12:57 +00:00
Andrea Di Biagio	1cb8a3c690	[llvm-mca] Fix an invalid memory read introduced by r346487. This patch fixes an invalid memory read introduced by r346487. Before this patch, partial register write had to query the latency of the dependent full register write by calling a method on the full write descriptor. However, if the full write is from an already retired instruction, chances are that the EntryStage already reclaimed its memory. In some parial register write tests, valgrind was reporting an invalid memory read. This change fixes the invalid memory access problem. Writes are now responsible for tracking dependent partial register writes, and notify them in the event of instruction issued. That means, partial register writes no longer need to query their associated full write to check when they are ready to execute. Added test X86/BtVer2/partial-reg-update-7.s llvm-svn: 347459	2018-11-22 12:48:57 +00:00
Andrea Di Biagio	dda9032314	[llvm-mca] Correctly update the resource strategy for processor resources with multiple units. When looking at the tests committed by Roman at r346587, I noticed that numbers reported by the resource pressure for PdAGU01 were wrong. In particular, according to the aut-generated CHECK lines in tests memcpy-like-test.s and store-throughput.s, resource pressure for PdAGU01 was not uniformly distributed among the two AGEN pipes. It turns out that the reason why pressure was not correctly distributed, was because the "resource selection strategy" object associated with PdAGU01 was not correctly updated on the event of AGEN pipe used. As a result, llvm-mca was not simulating a round-robin pipeline allocation for PdAGU01. Instead, PdAGU1 was always prioritized over PdAGU0. This patch fixes the issue; now processor resource strategy objects for resources declaring multiple units, are correctly notified in the event of "resource used". llvm-svn: 346650	2018-11-12 13:09:39 +00:00
Roman Lebedev	b428b8b214	[X86][BdVer2] Fix loads/stores throughput for Piledriver (PR39465) There are two AGU units, and per 1cy, there can be either two loads, or a load and a store; but not two stores, or two loads and a store. Additionally, loads shouldn't affect the store scheduler and vice versa. (but should affect the PdEX scheduler.) Required rL346545. Fixes https://bugs.llvm.org/show_bug.cgi?id=39465 llvm-svn: 346587	2018-11-10 14:31:43 +00:00
Roman Lebedev	e105b655a2	[NFC][MCA][BdVer2] Add bdver2 runline into register-file-statistics.s test Missed this one by accident when adding the initial version in rL345463 / rL345462 llvm-svn: 346585	2018-11-10 10:56:58 +00:00
Clement Courbet	e6b727e552	[X86] Fix VZEROUPPER scheduling info on SNB,HSW,BDW,SXL,SKX. Summary: Starting from SNB, VZEROUPPER is handled by the renamer and uses no proc resources. After HSW, it also has zero latency. This fixes PR35606. To reproduce: Uops: llvm-exegesis -mode=uops -opcode-name=VZEROUPPER Latency: echo -e '#LLVM-EXEGESIS-DEFREG XMM0 1\n#LLVM-EXEGESIS-DEFREG XMM1 1\nvzeroupper' \| /tmp/llvm-exegesis -mode=latency -snippets-file=- echo -e '#LLVM-EXEGESIS-DEFREG XMM0 1\n#LLVM-EXEGESIS-DEFREG XMM1 1\nvzeroupper\naddps %xmm0, %xmm1' \| /tmp/llvm-exegesis -mode=latency -snippets-file=- Reviewers: RKSimon, craig.topper, andreadb Subscribers: gbedwell, llvm-commits Differential Revision: https://reviews.llvm.org/D54107 llvm-svn: 346482	2018-11-09 09:49:06 +00:00
Roman Lebedev	3817292069	[NFC][BdVer2] Load and store throughput tests: also check sched stats (PR39465) As noted by Andrea Di Biagio in https://bugs.llvm.org/show_bug.cgi?id=39465 both the loads and stores occupy both the store and load queues. This is clearly wrong. llvm-svn: 346425	2018-11-08 18:15:58 +00:00
Roman Lebedev	2ad16b9371	[NFC][BdVer2] Tests for load and store throughput (PR39465) During review it was noted that while it appears that the Piledriver can do two [consecutive] loads per cycle, it can only do one store per cycle. It was suggested that the sched model incorrectly models that, but it was opted to fix this afterwards. These tests show that the two consecutive loads are modelled correctly, and one consecutive stores is not modelled incorrectly. Unless i'm missing the point. https://bugs.llvm.org/show_bug.cgi?id=39465 llvm-svn: 346404	2018-11-08 14:48:56 +00:00
Andrea Di Biagio	fe3bc1b9bf	[llvm-mca] Add extra counters for move elimination in view RegisterFileStatistics. This patch teaches view RegisterFileStatistics how to report events for optimizable register moves. For each processor register file, view RegisterFileStatistics reports the following extra information: - Number of optimizable register moves - Number of register moves eliminated - Number of zero moves (i.e. register moves that propagate a zero) - Max Number of moves eliminated per cycle. Differential Revision: https://reviews.llvm.org/D53976 llvm-svn: 345865	2018-11-01 18:04:39 +00:00
Roman Lebedev	a5baf86744	AMD BdVer2 (Piledriver) Initial Scheduler model Summary: # Overview This is somewhat partial. * Latencies are good {F7371125} * All of these remaining inconsistencies //appear// to be noise/noisy/flaky. * NumMicroOps are somewhat good {F7371158} * Most of the remaining inconsistencies are from `Ld` / `Ld_ReadAfterLd` classes * Actual unit occupation (pipes, `ResourceCycles`) are undiscovered lands, i did not really look there. They are basically verbatum copy from `btver2` * Many `InstRW`. And there are still inconsistencies left... To be noted: I think this is the first new schedule profile produced with the new next-gen tools like llvm-exegesis! # Benchmark I realize that isn't what was suggested, but i'll start with some "internal" public real-world benchmark i understand - [[ https://github.com/darktable-org/rawspeed \| RawSpeed raw image decoding library ]]. Diff (the exact clang from trunk without/with this patch): ``` Comparing /home/lebedevri/rawspeed/build-old/src/utilities/rsbench/rsbench to /home/lebedevri/rawspeed/build-new/src/utilities/rsbench/rsbench Benchmark Time CPU Time Old Time New CPU Old CPU New ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/real_time_pvalue 0.0000 0.0000 U Test, Repetitions: 25 vs 25 Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/real_time_mean -0.0607 -0.0604 234 219 233 219 Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/real_time_median -0.0630 -0.0626 233 219 233 219 Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/real_time_stddev +0.2581 +0.2587 1 2 1 2 Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/real_time_pvalue 0.0000 0.0000 U Test, Repetitions: 25 vs 25 Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/real_time_mean -0.0770 -0.0767 144 133 144 133 Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/real_time_median -0.0767 -0.0763 144 133 144 133 Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/real_time_stddev -0.4170 -0.4156 1 0 1 0 Canon/EOS 5DS/2K4A9927.CR2/threads:8/real_time_pvalue 0.0000 0.0000 U Test, Repetitions: 25 vs 25 Canon/EOS 5DS/2K4A9927.CR2/threads:8/real_time_mean -0.0271 -0.0270 463 450 463 450 Canon/EOS 5DS/2K4A9927.CR2/threads:8/real_time_median -0.0093 -0.0093 453 449 453 449 Canon/EOS 5DS/2K4A9927.CR2/threads:8/real_time_stddev -0.7280 -0.7280 13 4 13 4 Canon/EOS 5DS/2K4A9928.CR2/threads:8/real_time_pvalue 0.0004 0.0004 U Test, Repetitions: 25 vs 25 Canon/EOS 5DS/2K4A9928.CR2/threads:8/real_time_mean -0.0065 -0.0065 569 565 569 565 Canon/EOS 5DS/2K4A9928.CR2/threads:8/real_time_median -0.0077 -0.0077 569 564 569 564 Canon/EOS 5DS/2K4A9928.CR2/threads:8/real_time_stddev +1.0077 +1.0068 2 5 2 5 Canon/EOS 5DS/2K4A9929.CR2/threads:8/real_time_pvalue 0.0220 0.0199 U Test, Repetitions: 25 vs 25 Canon/EOS 5DS/2K4A9929.CR2/threads:8/real_time_mean +0.0006 +0.0007 312 312 312 312 Canon/EOS 5DS/2K4A9929.CR2/threads:8/real_time_median +0.0031 +0.0032 311 312 311 312 Canon/EOS 5DS/2K4A9929.CR2/threads:8/real_time_stddev -0.7069 -0.7072 4 1 4 1 Canon/EOS 10D/CRW_7673.CRW/threads:8/real_time_pvalue 0.0004 0.0004 U Test, Repetitions: 25 vs 25 Canon/EOS 10D/CRW_7673.CRW/threads:8/real_time_mean -0.0015 -0.0015 141 141 141 141 Canon/EOS 10D/CRW_7673.CRW/threads:8/real_time_median -0.0010 -0.0011 141 141 141 141 Canon/EOS 10D/CRW_7673.CRW/threads:8/real_time_stddev -0.1486 -0.1456 0 0 0 0 Canon/EOS 40D/_MG_0154.CR2/threads:8/real_time_pvalue 0.6139 0.8766 U Test, Repetitions: 25 vs 25 Canon/EOS 40D/_MG_0154.CR2/threads:8/real_time_mean -0.0008 -0.0005 60 60 60 60 Canon/EOS 40D/_MG_0154.CR2/threads:8/real_time_median -0.0006 -0.0002 60 60 60 60 Canon/EOS 40D/_MG_0154.CR2/threads:8/real_time_stddev -0.1467 -0.1390 0 0 0 0 Canon/EOS 77D/IMG_4049.CR2/threads:8/real_time_pvalue 0.0137 0.0137 U Test, Repetitions: 25 vs 25 Canon/EOS 77D/IMG_4049.CR2/threads:8/real_time_mean +0.0002 +0.0002 275 275 275 275 Canon/EOS 77D/IMG_4049.CR2/threads:8/real_time_median -0.0015 -0.0014 275 275 275 275 Canon/EOS 77D/IMG_4049.CR2/threads:8/real_time_stddev +3.3687 +3.3587 0 2 0 2 Canon/PowerShot G1/crw_1693.crw/threads:8/real_time_pvalue 0.4041 0.3933 U Test, Repetitions: 25 vs 25 Canon/PowerShot G1/crw_1693.crw/threads:8/real_time_mean +0.0004 +0.0004 67 67 67 67 Canon/PowerShot G1/crw_1693.crw/threads:8/real_time_median -0.0000 -0.0000 67 67 67 67 Canon/PowerShot G1/crw_1693.crw/threads:8/real_time_stddev +0.1947 +0.1995 0 0 0 0 Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/real_time_pvalue 0.0074 0.0001 U Test, Repetitions: 25 vs 25 Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/real_time_mean -0.0092 +0.0074 547 542 25 25 Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/real_time_median -0.0054 +0.0115 544 541 25 25 Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/real_time_stddev -0.4086 -0.3486 8 5 0 0 Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/real_time_pvalue 0.3320 0.0000 U Test, Repetitions: 25 vs 25 Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/real_time_mean +0.0015 +0.0204 218 218 12 12 Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/real_time_median +0.0001 +0.0203 218 218 12 12 Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/real_time_stddev +0.2259 +0.2023 1 1 0 0 GoPro/HERO6 Black/GOPR9172.GPR/threads:8/real_time_pvalue 0.0000 0.0001 U Test, Repetitions: 25 vs 25 GoPro/HERO6 Black/GOPR9172.GPR/threads:8/real_time_mean -0.0209 -0.0179 96 94 90 88 GoPro/HERO6 Black/GOPR9172.GPR/threads:8/real_time_median -0.0182 -0.0155 95 93 90 88 GoPro/HERO6 Black/GOPR9172.GPR/threads:8/real_time_stddev -0.6164 -0.2703 2 1 2 1 Kodak/DCS Pro 14nx/D7465857.DCR/threads:8/real_time_pvalue 0.0000 0.0000 U Test, Repetitions: 25 vs 25 Kodak/DCS Pro 14nx/D7465857.DCR/threads:8/real_time_mean -0.0098 -0.0098 176 175 176 175 Kodak/DCS Pro 14nx/D7465857.DCR/threads:8/real_time_median -0.0126 -0.0126 176 174 176 174 Kodak/DCS Pro 14nx/D7465857.DCR/threads:8/real_time_stddev +6.9789 +6.9157 0 2 0 2 Nikon/D850/Nikon-D850-14bit-lossless-compressed.NEF/threads:8/real_time_pvalue 0.0000 0.0000 U Test, Repetitions: 25 vs 25 Nikon/D850/Nikon-D850-14bit-lossless-compressed.NEF/threads:8/real_time_mean -0.0237 -0.0238 474 463 474 463 Nikon/D850/Nikon-D850-14bit-lossless-compressed.NEF/threads:8/real_time_median -0.0267 -0.0267 473 461 473 461 Nikon/D850/Nikon-D850-14bit-lossless-compressed.NEF/threads:8/real_time_stddev +0.7179 +0.7178 3 5 3 5 Olympus/E-M1MarkII/Olympus_EM1mk2__HIRES_50MP.ORF/threads:8/real_time_pvalue 0.6837 0.6554 U Test, Repetitions: 25 vs 25 Olympus/E-M1MarkII/Olympus_EM1mk2__HIRES_50MP.ORF/threads:8/real_time_mean -0.0014 -0.0013 1375 1373 1375 1373 Olympus/E-M1MarkII/Olympus_EM1mk2__HIRES_50MP.ORF/threads:8/real_time_median +0.0018 +0.0019 1371 1374 1371 1374 Olympus/E-M1MarkII/Olympus_EM1mk2__HIRES_50MP.ORF/threads:8/real_time_stddev -0.7457 -0.7382 11 3 10 3 Panasonic/DC-G9/P1000476.RW2/threads:8/real_time_pvalue 0.0000 0.0000 U Test, Repetitions: 25 vs 25 Panasonic/DC-G9/P1000476.RW2/threads:8/real_time_mean -0.0080 -0.0289 22 22 10 10 Panasonic/DC-G9/P1000476.RW2/threads:8/real_time_median -0.0070 -0.0287 22 22 10 10 Panasonic/DC-G9/P1000476.RW2/threads:8/real_time_stddev +1.0977 +0.6614 0 0 0 0 Panasonic/DC-GH5/_T012014.RW2/threads:8/real_time_pvalue 0.0000 0.0000 U Test, Repetitions: 25 vs 25 Panasonic/DC-GH5/_T012014.RW2/threads:8/real_time_mean +0.0132 +0.0967 35 36 10 11 Panasonic/DC-GH5/_T012014.RW2/threads:8/real_time_median +0.0132 +0.0956 35 36 10 11 Panasonic/DC-GH5/_T012014.RW2/threads:8/real_time_stddev -0.0407 -0.1695 0 0 0 0 Panasonic/DC-GH5S/P1022085.RW2/threads:8/real_time_pvalue 0.0000 0.0000 U Test, Repetitions: 25 vs 25 Panasonic/DC-GH5S/P1022085.RW2/threads:8/real_time_mean +0.0331 +0.1307 13 13 6 6 Panasonic/DC-GH5S/P1022085.RW2/threads:8/real_time_median +0.0430 +0.1373 12 13 6 6 Panasonic/DC-GH5S/P1022085.RW2/threads:8/real_time_stddev -0.9006 -0.8847 1 0 0 0 Pentax/645Z/IMGP2837.PEF/threads:8/real_time_pvalue 0.0016 0.0010 U Test, Repetitions: 25 vs 25 Pentax/645Z/IMGP2837.PEF/threads:8/real_time_mean -0.0023 -0.0024 395 394 395 394 Pentax/645Z/IMGP2837.PEF/threads:8/real_time_median -0.0029 -0.0030 395 394 395 393 Pentax/645Z/IMGP2837.PEF/threads:8/real_time_stddev -0.0275 -0.0375 1 1 1 1 Phase One/P65/CF027310.IIQ/threads:8/real_time_pvalue 0.0232 0.0000 U Test, Repetitions: 25 vs 25 Phase One/P65/CF027310.IIQ/threads:8/real_time_mean -0.0047 +0.0039 114 113 28 28 Phase One/P65/CF027310.IIQ/threads:8/real_time_median -0.0050 +0.0037 114 113 28 28 Phase One/P65/CF027310.IIQ/threads:8/real_time_stddev -0.0599 -0.2683 1 1 0 0 Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/real_time_pvalue 0.0000 0.0000 U Test, Repetitions: 25 vs 25 Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/real_time_mean +0.0206 +0.0207 405 414 405 414 Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/real_time_median +0.0204 +0.0205 405 414 405 414 Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/real_time_stddev +0.2155 +0.2212 1 1 1 1 Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/real_time_pvalue 0.0000 0.0000 U Test, Repetitions: 25 vs 25 Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/real_time_mean -0.0109 -0.0108 147 145 147 145 Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/real_time_median -0.0104 -0.0103 147 145 147 145 Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/real_time_stddev -0.4919 -0.4800 0 0 0 0 Samsung/NX3000/_3184416.SRW/threads:8/real_time_pvalue 0.0000 0.0000 U Test, Repetitions: 25 vs 25 Samsung/NX3000/_3184416.SRW/threads:8/real_time_mean -0.0149 -0.0147 220 217 220 217 Samsung/NX3000/_3184416.SRW/threads:8/real_time_median -0.0173 -0.0169 221 217 220 217 Samsung/NX3000/_3184416.SRW/threads:8/real_time_stddev +1.0337 +1.0341 1 3 1 3 Sony/DSLR-A350/DSC05472.ARW/threads:8/real_time_pvalue 0.0001 0.0001 U Test, Repetitions: 25 vs 25 Sony/DSLR-A350/DSC05472.ARW/threads:8/real_time_mean -0.0019 -0.0019 194 193 194 193 Sony/DSLR-A350/DSC05472.ARW/threads:8/real_time_median -0.0021 -0.0021 194 193 194 193 Sony/DSLR-A350/DSC05472.ARW/threads:8/real_time_stddev -0.4441 -0.4282 0 0 0 0 Sony/ILCE-7RM2/14-bit-compressed.ARW/threads:8/real_time_pvalue 0.0000 0.4263 U Test, Repetitions: 25 vs 25 Sony/ILCE-7RM2/14-bit-compressed.ARW/threads:8/real_time_mean +0.0258 -0.0006 81 83 19 19 Sony/ILCE-7RM2/14-bit-compressed.ARW/threads:8/real_time_median +0.0235 -0.0011 81 82 19 19 Sony/ILCE-7RM2/14-bit-compressed.ARW/threads:8/real_time_stddev +0.1634 +0.1070 1 1 0 0 ``` {F7443905} If we look at the `_mean`s, the time column, the biggest win is `-7.7%` (`Canon/EOS 5D Mark II/10.canon.sraw2.cr2`), and the biggest loose is `+3.3%` (`Panasonic/DC-GH5S/P1022085.RW2`); Overall: mean `-0.7436%`, median `-0.23%`, `cbrt(sum(time^3))` = `-8.73%` Looks good so far i'd say. llvm-exegesis details: {F7371117} {F7371125} {F7371128} {F7371144} {F7371158} Reviewers: craig.topper, RKSimon, andreadb, courbet, avt77, spatel, GGanesh Reviewed By: andreadb Subscribers: javed.absar, gbedwell, jfb, llvm-commits Differential Revision: https://reviews.llvm.org/D52779 llvm-svn: 345463	2018-10-27 20:46:30 +00:00
Roman Lebedev	a51921877a	[NFC][X86] Baseline tests for AMD BdVer2 (Piledriver) Scheduler model Adding the baseline tests in a preparatory NFC commit, so that the actual commit shows the diff. Yes, i'm aware that a few of these codegen-based sched tests are testing wrong instructions, i will fix that afterwards. For https://reviews.llvm.org/D52779 llvm-svn: 345462	2018-10-27 20:36:11 +00:00
Reid Kleckner	953bdce68d	[MC] Separate masm integer literal lexer support from inline asm Summary: This renames the IsParsingMSInlineAsm member variable of AsmLexer to LexMasmIntegers and moves it up to MCAsmLexer. This is the only behavior controlled by that variable. I added a public setter, so that it can be set from outside or from the llvm-mc command line. We may need to arrange things so that users can get this behavior from clang, but that's future work. I also put additional hex literal lexing functionality under this flag to fix PR32973. It appears that this hex literal parsing wasn't intended to be enabled in non-masm-style blocks. Now, masm integers (0b1101 and 0ABCh) work in __asm blocks from clang, but 0b label references work when using .intel_syntax in standalone .s files. However, 0b label references will not work from __asm blocks in clang. They will work from GCC inline asm blocks, which it sounds like is important for Crypto++ as mentioned in PR36144. Essentially, we only lex masm literals for inline asm blobs that use intel syntax. If the .intel_syntax directive is used inside a gnu-style inline asm statement, masm literals will not be lexed, which is compatible with gas and llvm-mc standalone .s assembly. This fixes PR36144 and PR32973. Reviewers: Gerolf, avt77 Subscribers: eraman, hiraditya, llvm-commits Differential Revision: https://reviews.llvm.org/D53535 llvm-svn: 345189	2018-10-24 20:23:57 +00:00
Simon Pilgrim	7d27cfdcb2	[X86] Fix Skylake ReadAfterLd for PADDrm etc. Missed in rL343868 as due to their custom InstrRW. llvm-svn: 344600	2018-10-16 09:50:16 +00:00
Andrea Di Biagio	6eebbe0a97	[tblgen][llvm-mca] Add the ability to describe move elimination candidates via tablegen. This patch adds the ability to identify instructions that are "move elimination candidates". It also allows scheduling models to describe processor register files that allow move elimination. A move elimination candidate is an instruction that can be eliminated at register renaming stage. Each subtarget can specify which instructions are move elimination candidates with the help of tablegen class "IsOptimizableRegisterMove" (see llvm/Target/TargetInstrPredicate.td). For example, on X86, BtVer2 allows both GPR and MMX/SSE moves to be eliminated. The definition of 'IsOptimizableRegisterMove' for BtVer2 looks like this: ``` def : IsOptimizableRegisterMove<[ InstructionEquivalenceClass<[ // GPR variants. MOV32rr, MOV64rr, // MMX variants. MMX_MOVQ64rr, // SSE variants. MOVAPSrr, MOVUPSrr, MOVAPDrr, MOVUPDrr, MOVDQArr, MOVDQUrr, // AVX variants. VMOVAPSrr, VMOVUPSrr, VMOVAPDrr, VMOVUPDrr, VMOVDQArr, VMOVDQUrr ], CheckNot<CheckSameRegOperand<0, 1>> > ]>; ``` Definitions of IsOptimizableRegisterMove from processor models of a same Target are processed by the SubtargetEmitter to auto-generate a target-specific override for each of the following predicate methods: ``` bool TargetSubtargetInfo::isOptimizableRegisterMove(const MachineInstr *MI) const; bool MCInstrAnalysis::isOptimizableRegisterMove(const MCInst &MI, unsigned CPUID) const; ``` By default, those methods return false (i.e. conservatively assume that there are no move elimination candidates). Tablegen class RegisterFile has been extended with the following information: - The set of register classes that allow move elimination. - Maxium number of moves that can be eliminated every cycle. - Whether move elimination is restricted to moves from registers that are known to be zero. This patch is structured in three part: A first part (which is mostly boilerplate) adds the new 'isOptimizableRegisterMove' target hooks, and extends existing register file descriptors in MC by introducing new fields to describe properties related to move elimination. A second part, uses the new tablegen constructs to describe move elimination in the BtVer2 scheduling model. A third part, teaches llm-mca how to query the new 'isOptimizableRegisterMove' hook to mark instructions that are candidates for move elimination. It also teaches class RegisterFile how to describe constraints on move elimination at PRF granularity. llvm-mca tests for btver2 show differences before/after this patch. Differential Revision: https://reviews.llvm.org/D53134 llvm-svn: 344334	2018-10-12 11:23:04 +00:00
Andrea Di Biagio	6a0b319549	[llvm-mca][BtVer2] Add tests for optimizable GPR register moves. NFC llvm-svn: 344253	2018-10-11 14:54:54 +00:00
Andrea Di Biagio	1b29ec6531	[llvm-mca][BtVer2] Add two more move-elimination tests. NFC These should test all the optimizable moves on Jaguar. A follow-up patch will teach how to recognize these optimizable register moves. llvm-svn: 344144	2018-10-10 14:46:54 +00:00
Simon Pilgrim	f09fc3bc12	[X86] Move ReadAfterLd functionality into X86FoldableSchedWrite (PR36957) Currently we hardcode instructions with ReadAfterLd if the register operands don't need to be available until the folded load has completed. This doesn't take into account the different load latencies of different memory operands (PR36957). This patch adds a ReadAfterFold def into X86FoldableSchedWrite to replace ReadAfterLd, allowing us to specify the load latency at a scheduler class level. I've added ReadAfterVec*Ld classes that match the XMM/Scl, XMM and YMM/ZMM WriteVecLoad classes that we currently use, we can tweak these values in future patches once this infrastructure is in place. Differential Revision: https://reviews.llvm.org/D52886 llvm-svn: 343868	2018-10-05 17:57:29 +00:00
Simon Pilgrim	6ad03ad34b	[llvm-mca][x86] Add PR36951 ReadAfterLd test case llvm-svn: 343795	2018-10-04 16:26:56 +00:00
Greg Bedwell	dee7bfdb9f	[utils] Ensure that update_mca_test_checks.py writes prefixes in alphabetical order llvm-svn: 343783	2018-10-04 14:42:19 +00:00
Simon Pilgrim	82a3b1c687	[llvm-mca][x86] Add tests demonstrating ReadAfterLd delay llvm-svn: 343773	2018-10-04 13:05:42 +00:00
Simon Pilgrim	0b451a2983	[X86][Btver2] Fix MMX PSHUFB schedule Match AMD Fam16h SOG + llvm-exegesis tests llvm-svn: 343701	2018-10-03 18:18:50 +00:00
Andrea Di Biagio	207e0217f9	[llvm-mca] Add support for move elimination in class RegisterFile. This patch teaches class RegisterFile how to analyze register writes from instructions that are move elimination candidates. In particular, it teaches it how to check if a move can be effectively eliminated by the underlying PRF, and (if necessary) how to perform move elimination. The long term goal is to allow processor models to describe instructions that are valid move elimination candidates. The idea is to let register file definitions in tablegen declare if/when moves can be eliminated. This patch is a non functional change. The logic that performs move elimination is currently disabled. A future patch will add support for move elimination in the processor models, and enable this new code path. llvm-svn: 343691	2018-10-03 15:02:44 +00:00
Simon Pilgrim	c68cc4efbe	[X86][Btver2] Most RMW instructions don't require an additional uop Remove uop on WriteRMW and move it into the few instructions that need it. Match AMD Fam16h SOG + llvm-exegesis tests llvm-svn: 343671	2018-10-03 10:28:43 +00:00
Simon Pilgrim	d11015861c	[X86] ALU/ADC RMW instructions should use the WriteRMW sequence class I was expecting this to be a nfc but Silvermont seems to be setup a little differently: // A folded store needs a cycle on MEC_RSV for the store data, but it does not need an extra port cycle to recompute the address. def : WriteRes<WriteRMW, [SLM_MEC_RSV]>; So moving from WriteStore to WriteRMW reduces predicted port pressure, confirmed by @craig.topper that this is correct. Differential Revision: https://reviews.llvm.org/D52740 llvm-svn: 343670	2018-10-03 10:01:13 +00:00
Simon Pilgrim	860cb5c071	[X86][Btver2] Fix BLENDV and AESDEC schedules Match AMD Fam16h SOG + llvm-exegesis tests llvm-svn: 343597	2018-10-02 15:13:18 +00:00
Simon Pilgrim	e0d2019052	[X86][Btver2] Fix BT(C\|R\|S)mr & BT(C\|R\|S)mi schedule latency + uop counts Match AMD Fam16h SOG + llvm-exegesis tests llvm-svn: 343494	2018-10-01 16:31:30 +00:00

1 2 3 4 5 ...

297 Commits