Commit Graph

368 Commits

Author SHA1 Message Date
Andrea Di Biagio 43003f0fec [MCA] Fix typo in AVX2 gather tests. NFC
llvm-svn: 359397
2019-04-28 10:54:45 +00:00
Andrea Di Biagio 2050dff996 [MCA] Remove wrong comments from a test. NFC
llvm-svn: 358160
2019-04-11 10:15:04 +00:00
Craig Topper 4a32ce39b7 [X86] Make _Int instructions the preferred instructon for the assembly parser and disassembly parser to remove inconsistencies between VEX and EVEX.
Many of our instructions have both a _Int form used by intrinsics and a form
used by other IR constructs. In the EVEX space the _Int versions usually cover
all the capabilities include broadcasting and rounding. While the other version
only covers simple register/register or register/load forms. For this reason
in EVEX, the non intrinsic form is usually marked isCodeGenOnly=1.

In the VEX encoding space we were less consistent, but usually the _Int version
was the isCodeGenOnly version.

This commit makes the VEX instructions match the EVEX instructions. This was
done by manually studying the AsmMatcher table so its possible I missed some
cases, but we should be closer now.

I'm thinking about using the isCodeGenOnly bit to simplify the EVEX2VEX
tablegen code that disambiguates the _Int and non _Int versions. Currently it
checks register class sizes and Record the memory operands come from. I have
some other changes I was looking into for D59266 that may break the memory check.

I had to make a few scheduler hacks to keep the _Int versions from being treated
differently than the non _Int version.

Differential Revision: https://reviews.llvm.org/D60441

llvm-svn: 358138
2019-04-10 21:29:41 +00:00
Andrea Di Biagio f6a60f1f80 [llvm-mca][scheduler-stats] Print issued micro opcodes per cycle. NFCI
It makes more sense to print out the number of micro opcodes that are issued
every cycle rather than the number of instructions issued per cycle.
This behavior is also consistent with the dispatch-stats: numbers from the two
views can now be easily compared.

llvm-svn: 357919
2019-04-08 16:05:54 +00:00
Andrea Di Biagio e074ac60b4 [MCA] Add an experimental MicroOpQueue stage.
This patch adds an experimental stage named MicroOpQueueStage.
MicroOpQueueStage can be used to simulate a hardware micro-op queue (basically,
a decoupling queue between 'decode' and 'dispatch').  Users can specify a queue
size, as well as a optional MaxIPC (which - in the absence of a "Decoders" stage
- can be used to simulate a different throughput from the decoders).

This stage is added to the default pipeline between the EntryStage and the
DispatchStage only if PipelineOption::MicroOpQueue is different than zero. By
default, llvm-mca sets PipelineOption::MicroOpQueue to the value of hidden flag
-micro-op-queue-size.

Throughput from the decoder can be simulated via another hidden flag named
-decoder-throughput.  That flag allows us to quickly experiment with different
frontend throughputs.  For targets that declare a loop buffer, flag
-decoder-throughput allows users to do multiple runs, each time simulating a
different throughput from the decoders.

This stage can/will be extended in future. For example, we could add a "buffer
full" event to notify bottlenecks caused by backpressure. flag
-decoder-throughput would probably go away if in future we delegate to another
stage (DecoderStage?) the simulation of a (potentially variable) throughput from
the decoders. For now, flag -decoder-throughput is "good enough" to run some
simple experiments.

Differential Revision: https://reviews.llvm.org/D59928

llvm-svn: 357248
2019-03-29 12:15:37 +00:00
Roman Lebedev c325be6cef [X86] AMD Piledriver (BdVer2): fine-tune some latencies
Based on llvm-exegesis measurements.

Now that llvm-exegesis is ~2 magnitudes faster, and is a bit smarter,
it is now possible to continue cleanup of the scheduler model.

With this, there are no more latency inconsistencies for the
opcodes that produce stable measurements, and only a few inconsistencies
for unstable measurements (MMX_* opcodes, opcodes that llvm-exegesis
measures by chaining - CMP, TEST, BT, SETcc, CVT, MOV, etc.)

llvm-svn: 357169
2019-03-28 13:40:34 +00:00
Craig Topper c2b35ebc1d [X86] Remove the _alt forms of (V)CMP instructions. Use a combination of custom printing and custom parsing to achieve the same result and more
Similar to previous change done for VPCOM and VPCMP

Differential Revision: https://reviews.llvm.org/D59468

llvm-svn: 356384
2019-03-18 17:59:59 +00:00
Craig Topper 12509d87f3 [X86] Remove the _alt forms of XOP VPCOM instructions. Use a combination of custom printing and custom parsing to achieve the same result and more
Previously we had a regular form of the instruction used when the immediate was 0-7. And _alt form that allowed the full 8 bit immediate. Codegen would always use the 0-7 form since the immediate was always checked to be in range. Assembly parsing would use the 0-7 form when a mnemonic like vpcomtrueb was used. If the immediate was specified directly the _alt form was used. The disassembler would prefer to use the 0-7 form instruction when the immediate was in range and the _alt form otherwise. This way disassembly would print the most readable form when possible.

The assembly parsing for things like vpcomtrueb relied on splitting the mnemonic into 3 pieces. A "vpcom" prefix, an immediate representing the "true", and a suffix of "b". The tablegenerated printing code would similarly print a "vpcom" prefix, decode the immediate into a string, and then print "b".

The _alt form on the other hand parsed and printed like any other instruction with no specialness.

With this patch we drop to one form and solve the disassembly printing issue by doing custom printing when the immediate is 0-7. The parsing code has been tweaked to turn "vpcomtrueb" into "vpcomb" and then the immediate for the "true" is inserted either before or after the other operands depending on at&t or intel syntax.

I'd rather not do the custom printing, but I tried using an InstAlias for each possible mnemonic for all 8 immediates for all 16 combinations of element size, signedness, and memory/register. The code emitted into printAliasInstr ended up checking the number of operands, the register class of each operand, and the immediate for all 256 aliases. This was repeated for both the at&t and intel printer. Despite a lot of common checks between all of the aliases, when compiled with clang at least this commonality was not well optimized. Nor do all the checks seem necessary. Since I want to do a similar thing for vcmpps/pd/ss/sd which have 32 immediate values and 3 encoding flavors, 3 register sizes, etc. This didn't seem to scale well for clang binary size. So custom printing seemed a better trade off.

I also considered just using the InstAlias for the matching and not the printing. But that seemed like it would add a lot of extra rows to the matcher table. Especially given that the 32 immediates for vpcmpps have 46 strings associated with them.

Differential Revision: https://reviews.llvm.org/D59398

llvm-svn: 356343
2019-03-17 21:21:37 +00:00
Craig Topper d0c2dba644 [X86] Correct scheduler information for rotate by constant for Haswell, Broadwell, and Skylake.
Rotate with explicit immediate is a single uop from Haswell on. An immediate of 1 has a dependency on the previous writer of flags, but the other immediate values do not.

The implicit rotate by 1 instruction is 2 uops. But the flags are merged after the rotate uop so the data result does not see the flag dependency. But I don't think we have any way of modeling that.

RORX is 1 uop without the load. 2 uops with the load. We currently model these with WriteShift/WriteShiftLd.

Differential Revision: https://reviews.llvm.org/D59077

llvm-svn: 355636
2019-03-07 21:22:56 +00:00
Craig Topper b3af5d3e57 [X86] Model ADC/SBB with immediate 0 more accurately in the Haswell scheduler model
Haswell and possibly Sandybridge have an optimization for ADC/SBB with immediate 0 to use a single uop flow. This only applies GR16/GR32/GR64 with an 8-bit immediate. It does not apply to GR8. It also does not apply to the implicit AX/EAX/RAX forms.

Differential Revision: https://reviews.llvm.org/D59058

llvm-svn: 355635
2019-03-07 21:22:51 +00:00
Matt Davis 6c5a49ccb9 [llvm-mca] Emit a message when no bottlenecks are identified.
Summary:
Since bottleneck hints are enabled via user request, it can be
confusing if no bottleneck information is presented.  Such is the
case when no bottlenecks are identified.  This patch emits a message
in that case.

Reviewers: andreadb

Reviewed By: andreadb

Subscribers: tschuett, gbedwell, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59098

llvm-svn: 355628
2019-03-07 19:34:44 +00:00
Simon Pilgrim 3f37538b86 [llvm-mca][X86] Add ADC/SBB with zero test cases
Some targets have fast-path handling for these patterns that we should model.

llvm-svn: 355498
2019-03-06 12:51:16 +00:00
Andrea Di Biagio be3281a281 [MCA] Highlight kernel bottlenecks in the summary view.
This patch adds a new flag named -bottleneck-analysis to print out information
about throughput bottlenecks.

MCA knows how to identify and classify dynamic dispatch stalls. However, it
doesn't know how to analyze and highlight kernel bottlenecks.  The goal of this
patch is to teach MCA how to correlate increases in backend pressure to backend
stalls (and therefore, the loss of throughput).

From a Scheduler point of view, backend pressure is a function of the scheduler
buffer usage (i.e. how the number of uOps in the scheduler buffers changes over
time). Backend pressure increases (or decreases) when there is a mismatch
between the number of opcodes dispatched, and the number of opcodes issued in
the same cycle.  Since buffer resources are limited, continuous increases in
backend pressure would eventually leads to dispatch stalls. So, there is a
strong correlation between dispatch stalls, and how backpressure changed over
time.

This patch teaches how to identify situations where backend pressure increases
due to:
 - unavailable pipeline resources.
 - data dependencies.

Data dependencies may delay execution of instructions and therefore increase the
time that uOps have to spend in the scheduler buffers. That often translates to
an increase in backend pressure which may eventually lead to a bottleneck.
Contention on pipeline resources may also delay execution of instructions, and
lead to a temporary increase in backend pressure.

Internally, the Scheduler classifies instructions based on whether register /
memory operands are available or not.

An instruction is marked as "ready to execute" only if data dependencies are
fully resolved.
Every cycle, the Scheduler attempts to execute all instructions that are ready
to execute. If an instruction cannot execute because of unavailable pipeline
resources, then the Scheduler internally updates a BusyResourceUnits mask with
the ID of each unavailable resource.

ExecuteStage is responsible for tracking changes in backend pressure. If backend
pressure increases during a cycle because of contention on pipeline resources,
then ExecuteStage sends a "backend pressure" event to the listeners.
That event would contain information about instructions delayed by resource
pressure, as well as the BusyResourceUnits mask.

Note that ExecuteStage also knows how to identify situations where backpressure
increased because of delays introduced by data dependencies.

The SummaryView observes "backend pressure" events and prints out a "bottleneck
report".

Example of bottleneck report:

```
Cycles with backend pressure increase [ 99.89% ]
Throughput Bottlenecks:
  Resource Pressure       [ 0.00% ]
  Data Dependencies:      [ 99.89% ]
   - Register Dependencies [ 0.00% ]
   - Memory Dependencies   [ 99.89% ]
```

A bottleneck report is printed out only if increases in backend pressure
eventually caused backend stalls.

About the time complexity:

Time complexity is linear in the number of instructions in the
Scheduler::PendingSet.

The average slowdown tends to be in the range of ~5-6%.
For memory intensive kernels, the slowdown can be significant if flag
-noalias=false is specified. In the worst case scenario I have observed a
slowdown of ~30% when flag -noalias=false was specified.

We can definitely recover part of that slowdown if we optimize class LSUnit (by
doing extra bookkeeping to speedup queries). For now, this new analysis is
disabled by default, and it can be enabled via flag -bottleneck-analysis. Users
of MCA as a library can enable the generation of pressure events through the
constructor of ExecuteStage.

This patch partially addresses https://bugs.llvm.org/show_bug.cgi?id=37494

Differential Revision: https://reviews.llvm.org/D58728

llvm-svn: 355308
2019-03-04 11:52:34 +00:00
Andrea Di Biagio c032e2ab7c [MCA] Always check if scheduler resources are unavailable when reporting dispatch stalls.
Dispatch stall cycles may be associated to multiple dispatch stall events.
Before this patch, each stall cycle was associated with a single stall event.
This patch also improves a couple of code comments, and adds a helper method to
query the Scheduler for dispatch stalls.

llvm-svn: 354877
2019-02-26 14:19:00 +00:00
Craig Topper ce2bd19c49 [X86] Correct some ADC/SBB with immediate scheduler data for Broadwell and Skylake.
Summary:
The AX/EAX/RAX with immediate forms are 2 uops just like the AL with immediate.

The modrm form with r8 and immediate is a single uop just like r16/r32/r64 with immediate.

Reviewers: RKSimon, andreadb

Reviewed By: RKSimon

Subscribers: gbedwell, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D58581

llvm-svn: 354754
2019-02-24 19:23:39 +00:00
Andrea Di Biagio c102e2a227 [MCA] Correctly update register definitions in the PRF after move elimination.
This patch fixes a bug where register writes performed by optimizable register
moves were sometimes wrongly treated like partial register updates.  Before this
patch, llvm-mca wrongly predicted a 1.50 IPC for test reg-move-elimination-6.s
(added by this patch).  With this patch, llvm-mca correctly updates the register
defintions in the PRF, and the IPC for that test is now correctly reported as 2.

llvm-svn: 354271
2019-02-18 14:15:25 +00:00
Craig Topper bf7593ec4a [X86] Print all register forms of x87 fadd/fsub/fdiv/fmul as having two arguments where on is %st.
All of these instructions consume one encoded register and the other register is %st. They either write the result to %st or the encoded register. Previously we printed both arguments when the encoded register was written. And we printed one argument when the result was written to %st. For the stack popping forms the encoded register is always the destination and we didn't print both operands. This was inconsistent with gcc and objdump and just makes the output assembly code harder to read.

This patch changes things to always print both operands making us consistent with gcc and objdump. The parser should still be able to handle the single register forms just as it did before. This also matches the GNU assembler behavior.

llvm-svn: 353061
2019-02-04 17:28:18 +00:00
Craig Topper 7a2944efe1 [X86] Print %st(0) as %st when its implicit to the instruction. Continue printing it as %st(0) when its encoded in the instruction.
This is a step back from the change I made in r352985. This appears to be more consistent with gcc and objdump behavior.

llvm-svn: 353015
2019-02-04 04:15:10 +00:00
Craig Topper f77b858dc3 Revert r352985 "[X86] Print %st(0) as %st to match what gcc inline asm uses as the clobber name to make MS inline asm work correctly"
Looking into gcc and objdump behavior more this was overly aggressive. If the register is encoded in the instruction we should print %st(0), if its implicit we should print %st.

I'll be making a more directed change in a future patch.

llvm-svn: 353013
2019-02-04 04:15:02 +00:00
Craig Topper 5a570dd437 [X86] Print %st(0) as %st to match what gcc inline asm uses as the clobber name to make MS inline asm work correctly
Summary:
When calculating clobbers for MS style inline assembly we fail if the asm clobbers stack top because we print st(0) and try to pass it through the gcc register name check. This was found with when I attempted to make a emms/femms clobber all ST registers. If you use emms/femms in MS inline asm we would try to use st(0) as the clobber name but clang would think that wasn't a valid clobber name.

This also matches what objdump disassembly prints. It's also what is printed by gcc -S.

Reviewers: RKSimon, rnk, efriedma, spatel, andreadb, lebedev.ri

Reviewed By: rnk

Subscribers: eraman, gbedwell, lebedev.ri, llvm-commits

Differential Revision: https://reviews.llvm.org/D57621

llvm-svn: 352985
2019-02-03 07:53:39 +00:00
Roman Lebedev 7857215f8e [X86][BdVer2] Transfer delays from the integer to the floating point unit.
Summary:
I'm unable to find this number in the "AMD SOG for family 15h".
llvm-exegesis measures the latencies of these instructions as `2`,
which matches the latencies specified in "AMD SOG for family 15h".

However if we look at Agner, Microarchitecture, "AMD Bulldozer, Piledriver,
Steamroller and Excavator pipeline", "Data delay between different execution
domains", the int->ivec transfer is listed as `8`..`10`cy of additional latency.

Also, Agner's "Instruction tables", for Piledriver, lists their latencies as `12`,
which is consistent with `2cy` from exegesis / AMD SOG + `10cy` transfer delay.

Additional data point comes from the fact that Agner's "Instruction tables",
for Jaguar, lists their latencies as `8`; and "AMD SOG for family 16h" does
state the `+6cy` int->ivec delay, which is consistent with instr latency of `1` or `2`.

Reviewers: andreadb, RKSimon, craig.topper

Reviewed By: andreadb

Subscribers: gbedwell, courbet, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D57300

llvm-svn: 352861
2019-02-01 11:15:13 +00:00
Andrea Di Biagio 815cdbff29 [X86][Btver2] Improved latency/throughput model for scalar int-to-float conversions.
Account for bypass delays when computing the latency of scalar int-to-float
conversions.
On Jaguar we need to account for an extra 6cy latency (see AMD fam16h SOG).
This patch also fixes the number of micropcodes for the register-memory variants
of scalar int-to-float conversions.

Differential Revision: https://reviews.llvm.org/D57148

llvm-svn: 352518
2019-01-29 16:47:27 +00:00
Roman Lebedev 661577466e [NFC][MCA][X86][BdVer2] Cherry-pick int-to-ivec forwarding tests from BtVer2
llvm-svn: 352317
2019-01-27 14:35:54 +00:00
Simon Pilgrim c9d33907ef [llvm-mca][X86] Add some missing DQI tests
Match more of the coverage of test\CodeGen\X86\avx512-schedule.ll as discussed on D57244 

llvm-svn: 352273
2019-01-26 13:00:46 +00:00
Simon Pilgrim d36f7730cd [llvm-mca][X86] Add missing shuffle tests
Match the coverage of test\CodeGen\X86\avx512-shuffle-schedule.ll so we can get rid of -print-schedule (and fix PR37160) without losing schedule tests

llvm-svn: 352179
2019-01-25 09:17:30 +00:00
Andrea Di Biagio d768d35515 [MC][X86] Correctly model additional operand latency caused by transfer delays from the integer to the floating point unit.
This patch adds a new ReadAdvance definition named ReadInt2Fpu.
ReadInt2Fpu allows x86 scheduling models to accurately describe delays caused by
data transfers from the integer unit to the floating point unit.
ReadInt2Fpu currently defaults to a delay of zero cycles (i.e. no delay) for all
x86 models excluding BtVer2. That means, this patch is only a functional change
for the Jaguar cpu model only.

Tablegen definitions for instructions (V)PINSR* have been updated to account for
the new ReadInt2Fpu. That read is mapped to the the GPR input operand.
On Jaguar, int-to-fpu transfers are modeled as a +6cy delay. Before this patch,
that extra delay was added to the opcode latency. In practice, the insert opcode
only executes for 1cy. Most of the actual latency is actually contributed by the
so-called operand-latency. According to the AMD SOG for family 16h, (V)PINSR*
latency is defined by expression f+1, where f is defined as a forwarding delay
from the integer unit to the fpu.

When printing instruction latency from MCA (see InstructionInfoView.cpp) and LLC
(only when flag -print-schedule is speified), we now need to account for any
extra forwarding delays. We do this by checking if scheduling classes declare
any negative ReadAdvance entries. Quoting a code comment in TargetSchedule.td:
"A negative advance effectively increases latency, which may be used for
cross-domain stalls". When computing the instruction latency for the purpose of
our scheduling tests, we now add any extra delay to the formula. This avoids
regressing existing codegen and mca schedule tests. It comes with the cost of an
extra (but very simple) hook in MCSchedModel.

Differential Revision: https://reviews.llvm.org/D57056

llvm-svn: 351965
2019-01-23 16:35:07 +00:00
Simon Pilgrim 922b540643 [llvm-mca][X86] Tidyup avx512 placeholder tests
Ensure we keep avx512f/bw/dq + vl versions separate, add example broadcast tests - this should allow us to better the test coverage of test\CodeGen\X86\avx512-schedule.ll

llvm-svn: 351848
2019-01-22 17:52:15 +00:00
Simon Pilgrim 8e11254132 [llvm-mca][X86] Add VPOPCNTDQ tests
Matches test coverage of test\CodeGen\X86\avx512vpopcntdq-schedule.ll

llvm-svn: 351842
2019-01-22 17:19:44 +00:00
Simon Pilgrim 90fa50d928 [llvm-mca][X86] Add missing CLWB/CLZERO/FSGSBASE/LWP/MWAITX/RDPID/SHA tests
We're getting pretty close to matching/exceeding test coverage of the test\CodeGen\X86\*-schedule.ll files, which should allow us to get rid of -print-schedule and fix PR37160

llvm-svn: 351836
2019-01-22 16:39:28 +00:00
Simon Pilgrim fc4b1e841e [llvm-mca][X86] Add missing enter/leave, invlpg/invlpga, rdmsr/wrmsr, rdpmc and rdtsc/rdtscp tests
llvm-svn: 351835
2019-01-22 16:29:26 +00:00
Simon Pilgrim 4e03b2496d [llvm-mca][X86] Add missing mfence/pinsrw tests
llvm-svn: 351831
2019-01-22 16:01:08 +00:00
Simon Pilgrim 05198a9b8a [llvm-mca][X86] Add missing monitor/mwait tests
These technically should be under a MONITOR cpuid bit, but we tag them as SSE3 so I've done that here as well.

llvm-svn: 351829
2019-01-22 15:48:16 +00:00
Simon Pilgrim 9b3a2f96a1 [llvm-mca][X86] Add missing vperm2i128 tests
llvm-svn: 351828
2019-01-22 14:54:24 +00:00
Simon Pilgrim 1d8d6c3bfb [llvm-mca][X86] Add missing tzcntw tests
llvm-svn: 351827
2019-01-22 14:53:52 +00:00
Andrea Di Biagio a4d1ffc269 [MCA] Add tests for int-to-fpu transfer delays. NFC
llvm-svn: 351822
2019-01-22 13:59:08 +00:00
Simon Pilgrim aa6a4339ac [X86][BtVer2] SSE2 vector shifts has local forwarding disabled
Similar to horizontal ops on D56777, the sse2 (but not mmx) bit shift ops has local forwarding disabled, adding +1cy to the use latency for the result.

Differential Revision: https://reviews.llvm.org/D57026

llvm-svn: 351817
2019-01-22 13:27:18 +00:00
Simon Pilgrim 2c69f90171 [X86][BtVer2] X86ISD::VPERMILPV has local forwarding disabled
Similar to horizontal ops on D56777, the vpermilpd/vpermilps variable mask ops has local forwarding disabled, adding +1cy to the use latency for the result.

Differential Revision: https://reviews.llvm.org/D57022

llvm-svn: 351815
2019-01-22 13:13:57 +00:00
Simon Pilgrim 9b73ae96c5 [X86][BtVer2] Update latency of mmx horizontal operations
D56777 added +1cy local forwarding penalty for horizontal operations, but this penalty only affects sse2/xmm variants, the mmx variants don't suffer the penalty.

Confirmed with @andreadb

llvm-svn: 351755
2019-01-21 18:04:25 +00:00
Andrea Di Biagio b68dd05c14 [X86][BtVer2] Update the WriteLoad latency.
r327630 introduced new write definitions for float/vector loads.
Before that revision, WriteLoad was used by both integer/float (scalar/vector)
load. So, WriteLoad had to conservatively declare a latency to 5cy. That is
because the load-to-use latency for float/vector load is 5cy.

Now that we have dedicated writes for float/vector loads, there is no reason why
we should keep the latency of WriteLoad to 5cy. At the moment, WriteLoad is only
used by scalar integer loads only; we can assume an optimstic 3cy latency for
them.
This patch changes that latency from 5cy to 3cy, and regenerates the affected
scheduling/mca tests.

Differential Revision: https://reviews.llvm.org/D56922

llvm-svn: 351742
2019-01-21 12:04:10 +00:00
Andrea Di Biagio c5f0f5309e [X86][BtVer2] Update latency of horizontal operations.
On Jaguar, horizontal adds/subs have local forwarding disable.
That means, we pay a compulsory extra cycle of write-back stage, and the value
is not available until the end of that stage.

This patch changes the latency of horizontal operations by adding an extra
cycle. With this patch, latency numbers now match what is reported by perf.

I plan to send another patch to also 'fix' the latency of shuffle operations (on
Jaguar, local forwarding is disabled for vector shuffles too).

Differential Revision: https://reviews.llvm.org/D56777

llvm-svn: 351366
2019-01-16 18:18:01 +00:00
Simon Pilgrim 99c139f4dc [llvm-mca][x86] Add RDSEED instruction resource tests for GLM
llvm-svn: 348624
2018-12-07 18:37:40 +00:00
Simon Pilgrim c703ce35b8 [llvm-mca][x86] Add missing AES instruction resource tests
Add missing non-VEX instructions

llvm-svn: 348623
2018-12-07 18:35:54 +00:00
Simon Pilgrim c4e2776f3b [llvm-mca][x86] Add RDRAND/RDSEED instruction resource tests
llvm-svn: 348622
2018-12-07 18:29:47 +00:00
Andrea Di Biagio 373a4ccf6c [llvm-mca][MC] Add the ability to declare which processor resources model load/store queues (PR36666).
This patch adds the ability to specify via tablegen which processor resources
are load/store queue resources.

A new tablegen class named MemoryQueue can be optionally used to mark resources
that model load/store queues.  Information about the load/store queue is
collected at 'CodeGenSchedule' stage, and analyzed by the 'SubtargetEmitter' to
initialize two new fields in struct MCExtraProcessorInfo named `LoadQueueID` and
`StoreQueueID`.  Those two fields are identifiers for buffered resources used to
describe the load queue and the store queue.
Field `BufferSize` is interpreted as the number of entries in the queue, while
the number of units is a throughput indicator (i.e. number of available pickers
for loads/stores).

At construction time, LSUnit in llvm-mca checks for the presence of extra
processor information (i.e. MCExtraProcessorInfo) in the scheduling model.  If
that information is available, and fields LoadQueueID and StoreQueueID are set
to a value different than zero (i.e. the invalid processor resource index), then
LSUnit initializes its LoadQueue/StoreQueue based on the BufferSize value
declared by the two processor resources.

With this patch, we more accurately track dynamic dispatch stalls caused by the
lack of LS tokens (i.e. load/store queue full). This is also shown by the
differences in two BdVer2 tests. Stalls that were previously classified as
generic SCHEDULER FULL stalls, are not correctly classified either as "load
queue full" or "store queue full".

About the differences in the -scheduler-stats view: those differences are
expected, because entries in the load/store queue are not released at
instruction issue stage. Instead, those are released at instruction executed
stage.  This is the main reason why for the modified tests, the load/store
queues gets full before PdEx is full.

Differential Revision: https://reviews.llvm.org/D54957

llvm-svn: 347857
2018-11-29 12:15:56 +00:00
Andrea Di Biagio 7a7588990b [llvm-mca] pass -dispatch-stats flag to a couple of tests. NFC
This change is in preparation for a patch that fixes PR36666.

llvm-mca currently doesn't know if a buffered processor resource describes a
load or store queue. So, any dynamic dispatch stall caused by the lack of
load/store queue entries is normally reported as a generic SCHEDULER stall. See for
example the -dispatch-stats output from the two tests modified by this patch.

In future, processor models will be able to tag processor resources that are
used to describe load/store queues. That information would then be used by
llvm-mca to correctly classify dynamic dispatch stalls caused by the lack of
tokens in the LS.

llvm-svn: 347662
2018-11-27 15:56:00 +00:00
Andrea Di Biagio 07a8255a78 [llvm-mca][View] Improved Retire Control Unit Statistics.
RetireControlUnitStatistics now reports extra information about the ROB and the
avg/maximum number of entries consumed over the entire simulation.

Example:
  Retire Control Unit - number of cycles where we saw N instructions retired:
  [# retired], [# cycles]
   0,           109  (17.9%)
   1,           102  (16.7%)
   2,           399  (65.4%)

  Total ROB Entries:                64
  Max Used ROB Entries:             35  ( 54.7% )
  Average Used ROB Entries per cy:  32  ( 50.0% )

Documentation in llvm/docs/CommandGuide/llvmn-mca.rst has been updated to
reflect this change.

llvm-svn: 347493
2018-11-23 12:12:57 +00:00
Andrea Di Biagio 1cb8a3c690 [llvm-mca] Fix an invalid memory read introduced by r346487.
This patch fixes an invalid memory read introduced by r346487.
Before this patch, partial register write had to query the latency of the
dependent full register write by calling a method on the full write descriptor.
However, if the full write is from an already retired instruction, chances are
that the EntryStage already reclaimed its memory.
In some parial register write tests, valgrind was reporting an invalid
memory read.

This change fixes the invalid memory access problem. Writes are now responsible
for tracking dependent partial register writes, and notify them in the event of
instruction issued.
That means, partial register writes no longer need to query their associated
full write to check when they are ready to execute.

Added test X86/BtVer2/partial-reg-update-7.s

llvm-svn: 347459
2018-11-22 12:48:57 +00:00
Andrea Di Biagio dda9032314 [llvm-mca] Correctly update the resource strategy for processor resources with multiple units.
When looking at the tests committed by Roman at r346587, I noticed that numbers
reported by the resource pressure for PdAGU01 were wrong.

In particular, according to the aut-generated CHECK lines in tests
memcpy-like-test.s and store-throughput.s, resource pressure for PdAGU01
was not uniformly distributed among the two AGEN pipes.

It turns out that the reason why pressure was not correctly distributed, was
because the "resource selection strategy" object associated with PdAGU01 was not
correctly updated on the event of AGEN pipe used.
As a result, llvm-mca was not simulating a round-robin pipeline allocation for
PdAGU01. Instead, PdAGU1 was always prioritized over PdAGU0.

This patch fixes the issue; now processor resource strategy objects for
resources declaring multiple units, are correctly notified in the event of
"resource used".

llvm-svn: 346650
2018-11-12 13:09:39 +00:00
Roman Lebedev b428b8b214 [X86][BdVer2] Fix loads/stores throughput for Piledriver (PR39465)
There are two AGU units, and per 1cy, there can be either two loads,
or a load and a store; but not two stores, or two loads and a store.

Additionally, loads shouldn't affect the store scheduler and vice versa.
(but *should* affect the PdEX scheduler.)

Required rL346545.
Fixes https://bugs.llvm.org/show_bug.cgi?id=39465

llvm-svn: 346587
2018-11-10 14:31:43 +00:00
Roman Lebedev e105b655a2 [NFC][MCA][BdVer2] Add bdver2 runline into register-file-statistics.s test
Missed this one by accident when adding
the initial version in rL345463 / rL345462

llvm-svn: 346585
2018-11-10 10:56:58 +00:00
Clement Courbet e6b727e552 [X86] Fix VZEROUPPER scheduling info on SNB,HSW,BDW,SXL,SKX.
Summary:
Starting from SNB, VZEROUPPER is handled by the renamer and uses no proc resources.
After HSW, it also has zero latency.

This fixes PR35606.

To reproduce:
Uops:
  llvm-exegesis -mode=uops -opcode-name=VZEROUPPER
Latency:
  echo -e '#LLVM-EXEGESIS-DEFREG XMM0 1\n#LLVM-EXEGESIS-DEFREG XMM1 1\nvzeroupper' | /tmp/llvm-exegesis -mode=latency -snippets-file=-
  echo -e '#LLVM-EXEGESIS-DEFREG XMM0 1\n#LLVM-EXEGESIS-DEFREG XMM1 1\nvzeroupper\naddps %xmm0, %xmm1' | /tmp/llvm-exegesis -mode=latency -snippets-file=-

Reviewers: RKSimon, craig.topper, andreadb

Subscribers: gbedwell, llvm-commits

Differential Revision: https://reviews.llvm.org/D54107

llvm-svn: 346482
2018-11-09 09:49:06 +00:00
Roman Lebedev 3817292069 [NFC][BdVer2] Load and store throughput tests: also check sched stats (PR39465)
As noted by Andrea Di Biagio in https://bugs.llvm.org/show_bug.cgi?id=39465
both the loads and stores occupy both the store and load queues.
This is clearly wrong.

llvm-svn: 346425
2018-11-08 18:15:58 +00:00
Roman Lebedev 2ad16b9371 [NFC][BdVer2] Tests for load and store throughput (PR39465)
During review it was noted that while it appears that
the Piledriver can do two [consecutive] loads per cycle,
it can only do one store per cycle. It was suggested
that the sched model incorrectly models that,
but it was opted to fix this afterwards.

These tests show that the two consecutive loads are
modelled correctly, and one consecutive stores is not
modelled incorrectly. Unless i'm missing the point.

https://bugs.llvm.org/show_bug.cgi?id=39465

llvm-svn: 346404
2018-11-08 14:48:56 +00:00
Andrea Di Biagio fe3bc1b9bf [llvm-mca] Add extra counters for move elimination in view RegisterFileStatistics.
This patch teaches view RegisterFileStatistics how to report events for
optimizable register moves.

For each processor register file, view RegisterFileStatistics reports the
following extra information:
 - Number of optimizable register moves
 - Number of register moves eliminated
 - Number of zero moves (i.e. register moves that propagate a zero)
 - Max Number of moves eliminated per cycle.

Differential Revision: https://reviews.llvm.org/D53976

llvm-svn: 345865
2018-11-01 18:04:39 +00:00
Roman Lebedev a5baf86744 AMD BdVer2 (Piledriver) Initial Scheduler model
Summary:
# Overview
This is somewhat partial.
* Latencies are good {F7371125}
  * All of these remaining inconsistencies //appear// to be noise/noisy/flaky.
* NumMicroOps are somewhat good {F7371158}
  * Most of the remaining inconsistencies are from `Ld` / `Ld_ReadAfterLd` classes
* Actual unit occupation (pipes, `ResourceCycles`) are undiscovered lands, i did not really look there.
  They are basically verbatum copy from `btver2`
* Many `InstRW`. And there are still inconsistencies left...

To be noted:
I think this is the first new schedule profile produced with the new next-gen tools like llvm-exegesis!

# Benchmark
I realize that isn't what was suggested, but i'll start with some "internal" public real-world benchmark i understand - [[ https://github.com/darktable-org/rawspeed | RawSpeed raw image decoding library ]].
Diff (the exact clang from trunk without/with this patch):
```
Comparing /home/lebedevri/rawspeed/build-old/src/utilities/rsbench/rsbench to /home/lebedevri/rawspeed/build-new/src/utilities/rsbench/rsbench
Benchmark                                                                                        Time             CPU      Time Old      Time New       CPU Old       CPU New
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/real_time_pvalue                             0.0000          0.0000      U Test, Repetitions: 25 vs 25
Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/real_time_mean                              -0.0607         -0.0604           234           219           233           219
Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/real_time_median                            -0.0630         -0.0626           233           219           233           219
Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/real_time_stddev                            +0.2581         +0.2587             1             2             1             2
Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/real_time_pvalue                             0.0000          0.0000      U Test, Repetitions: 25 vs 25
Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/real_time_mean                              -0.0770         -0.0767           144           133           144           133
Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/real_time_median                            -0.0767         -0.0763           144           133           144           133
Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/real_time_stddev                            -0.4170         -0.4156             1             0             1             0
Canon/EOS 5DS/2K4A9927.CR2/threads:8/real_time_pvalue                                          0.0000          0.0000      U Test, Repetitions: 25 vs 25
Canon/EOS 5DS/2K4A9927.CR2/threads:8/real_time_mean                                           -0.0271         -0.0270           463           450           463           450
Canon/EOS 5DS/2K4A9927.CR2/threads:8/real_time_median                                         -0.0093         -0.0093           453           449           453           449
Canon/EOS 5DS/2K4A9927.CR2/threads:8/real_time_stddev                                         -0.7280         -0.7280            13             4            13             4
Canon/EOS 5DS/2K4A9928.CR2/threads:8/real_time_pvalue                                          0.0004          0.0004      U Test, Repetitions: 25 vs 25
Canon/EOS 5DS/2K4A9928.CR2/threads:8/real_time_mean                                           -0.0065         -0.0065           569           565           569           565
Canon/EOS 5DS/2K4A9928.CR2/threads:8/real_time_median                                         -0.0077         -0.0077           569           564           569           564
Canon/EOS 5DS/2K4A9928.CR2/threads:8/real_time_stddev                                         +1.0077         +1.0068             2             5             2             5
Canon/EOS 5DS/2K4A9929.CR2/threads:8/real_time_pvalue                                          0.0220          0.0199      U Test, Repetitions: 25 vs 25
Canon/EOS 5DS/2K4A9929.CR2/threads:8/real_time_mean                                           +0.0006         +0.0007           312           312           312           312
Canon/EOS 5DS/2K4A9929.CR2/threads:8/real_time_median                                         +0.0031         +0.0032           311           312           311           312
Canon/EOS 5DS/2K4A9929.CR2/threads:8/real_time_stddev                                         -0.7069         -0.7072             4             1             4             1
Canon/EOS 10D/CRW_7673.CRW/threads:8/real_time_pvalue                                          0.0004          0.0004      U Test, Repetitions: 25 vs 25
Canon/EOS 10D/CRW_7673.CRW/threads:8/real_time_mean                                           -0.0015         -0.0015           141           141           141           141
Canon/EOS 10D/CRW_7673.CRW/threads:8/real_time_median                                         -0.0010         -0.0011           141           141           141           141
Canon/EOS 10D/CRW_7673.CRW/threads:8/real_time_stddev                                         -0.1486         -0.1456             0             0             0             0
Canon/EOS 40D/_MG_0154.CR2/threads:8/real_time_pvalue                                          0.6139          0.8766      U Test, Repetitions: 25 vs 25
Canon/EOS 40D/_MG_0154.CR2/threads:8/real_time_mean                                           -0.0008         -0.0005            60            60            60            60
Canon/EOS 40D/_MG_0154.CR2/threads:8/real_time_median                                         -0.0006         -0.0002            60            60            60            60
Canon/EOS 40D/_MG_0154.CR2/threads:8/real_time_stddev                                         -0.1467         -0.1390             0             0             0             0
Canon/EOS 77D/IMG_4049.CR2/threads:8/real_time_pvalue                                          0.0137          0.0137      U Test, Repetitions: 25 vs 25
Canon/EOS 77D/IMG_4049.CR2/threads:8/real_time_mean                                           +0.0002         +0.0002           275           275           275           275
Canon/EOS 77D/IMG_4049.CR2/threads:8/real_time_median                                         -0.0015         -0.0014           275           275           275           275
Canon/EOS 77D/IMG_4049.CR2/threads:8/real_time_stddev                                         +3.3687         +3.3587             0             2             0             2
Canon/PowerShot G1/crw_1693.crw/threads:8/real_time_pvalue                                     0.4041          0.3933      U Test, Repetitions: 25 vs 25
Canon/PowerShot G1/crw_1693.crw/threads:8/real_time_mean                                      +0.0004         +0.0004            67            67            67            67
Canon/PowerShot G1/crw_1693.crw/threads:8/real_time_median                                    -0.0000         -0.0000            67            67            67            67
Canon/PowerShot G1/crw_1693.crw/threads:8/real_time_stddev                                    +0.1947         +0.1995             0             0             0             0
Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/real_time_pvalue                              0.0074          0.0001      U Test, Repetitions: 25 vs 25
Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/real_time_mean                               -0.0092         +0.0074           547           542            25            25
Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/real_time_median                             -0.0054         +0.0115           544           541            25            25
Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/real_time_stddev                             -0.4086         -0.3486             8             5             0             0
Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/real_time_pvalue                                        0.3320          0.0000      U Test, Repetitions: 25 vs 25
Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/real_time_mean                                         +0.0015         +0.0204           218           218            12            12
Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/real_time_median                                       +0.0001         +0.0203           218           218            12            12
Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/real_time_stddev                                       +0.2259         +0.2023             1             1             0             0
GoPro/HERO6 Black/GOPR9172.GPR/threads:8/real_time_pvalue                                      0.0000          0.0001      U Test, Repetitions: 25 vs 25
GoPro/HERO6 Black/GOPR9172.GPR/threads:8/real_time_mean                                       -0.0209         -0.0179            96            94            90            88
GoPro/HERO6 Black/GOPR9172.GPR/threads:8/real_time_median                                     -0.0182         -0.0155            95            93            90            88
GoPro/HERO6 Black/GOPR9172.GPR/threads:8/real_time_stddev                                     -0.6164         -0.2703             2             1             2             1
Kodak/DCS Pro 14nx/D7465857.DCR/threads:8/real_time_pvalue                                     0.0000          0.0000      U Test, Repetitions: 25 vs 25
Kodak/DCS Pro 14nx/D7465857.DCR/threads:8/real_time_mean                                      -0.0098         -0.0098           176           175           176           175
Kodak/DCS Pro 14nx/D7465857.DCR/threads:8/real_time_median                                    -0.0126         -0.0126           176           174           176           174
Kodak/DCS Pro 14nx/D7465857.DCR/threads:8/real_time_stddev                                    +6.9789         +6.9157             0             2             0             2
Nikon/D850/Nikon-D850-14bit-lossless-compressed.NEF/threads:8/real_time_pvalue                 0.0000          0.0000      U Test, Repetitions: 25 vs 25
Nikon/D850/Nikon-D850-14bit-lossless-compressed.NEF/threads:8/real_time_mean                  -0.0237         -0.0238           474           463           474           463
Nikon/D850/Nikon-D850-14bit-lossless-compressed.NEF/threads:8/real_time_median                -0.0267         -0.0267           473           461           473           461
Nikon/D850/Nikon-D850-14bit-lossless-compressed.NEF/threads:8/real_time_stddev                +0.7179         +0.7178             3             5             3             5
Olympus/E-M1MarkII/Olympus_EM1mk2__HIRES_50MP.ORF/threads:8/real_time_pvalue                   0.6837          0.6554      U Test, Repetitions: 25 vs 25
Olympus/E-M1MarkII/Olympus_EM1mk2__HIRES_50MP.ORF/threads:8/real_time_mean                    -0.0014         -0.0013          1375          1373          1375          1373
Olympus/E-M1MarkII/Olympus_EM1mk2__HIRES_50MP.ORF/threads:8/real_time_median                  +0.0018         +0.0019          1371          1374          1371          1374
Olympus/E-M1MarkII/Olympus_EM1mk2__HIRES_50MP.ORF/threads:8/real_time_stddev                  -0.7457         -0.7382            11             3            10             3
Panasonic/DC-G9/P1000476.RW2/threads:8/real_time_pvalue                                        0.0000          0.0000      U Test, Repetitions: 25 vs 25
Panasonic/DC-G9/P1000476.RW2/threads:8/real_time_mean                                         -0.0080         -0.0289            22            22            10            10
Panasonic/DC-G9/P1000476.RW2/threads:8/real_time_median                                       -0.0070         -0.0287            22            22            10            10
Panasonic/DC-G9/P1000476.RW2/threads:8/real_time_stddev                                       +1.0977         +0.6614             0             0             0             0
Panasonic/DC-GH5/_T012014.RW2/threads:8/real_time_pvalue                                       0.0000          0.0000      U Test, Repetitions: 25 vs 25
Panasonic/DC-GH5/_T012014.RW2/threads:8/real_time_mean                                        +0.0132         +0.0967            35            36            10            11
Panasonic/DC-GH5/_T012014.RW2/threads:8/real_time_median                                      +0.0132         +0.0956            35            36            10            11
Panasonic/DC-GH5/_T012014.RW2/threads:8/real_time_stddev                                      -0.0407         -0.1695             0             0             0             0
Panasonic/DC-GH5S/P1022085.RW2/threads:8/real_time_pvalue                                      0.0000          0.0000      U Test, Repetitions: 25 vs 25
Panasonic/DC-GH5S/P1022085.RW2/threads:8/real_time_mean                                       +0.0331         +0.1307            13            13             6             6
Panasonic/DC-GH5S/P1022085.RW2/threads:8/real_time_median                                     +0.0430         +0.1373            12            13             6             6
Panasonic/DC-GH5S/P1022085.RW2/threads:8/real_time_stddev                                     -0.9006         -0.8847             1             0             0             0
Pentax/645Z/IMGP2837.PEF/threads:8/real_time_pvalue                                            0.0016          0.0010      U Test, Repetitions: 25 vs 25
Pentax/645Z/IMGP2837.PEF/threads:8/real_time_mean                                             -0.0023         -0.0024           395           394           395           394
Pentax/645Z/IMGP2837.PEF/threads:8/real_time_median                                           -0.0029         -0.0030           395           394           395           393
Pentax/645Z/IMGP2837.PEF/threads:8/real_time_stddev                                           -0.0275         -0.0375             1             1             1             1
Phase One/P65/CF027310.IIQ/threads:8/real_time_pvalue                                          0.0232          0.0000      U Test, Repetitions: 25 vs 25
Phase One/P65/CF027310.IIQ/threads:8/real_time_mean                                           -0.0047         +0.0039           114           113            28            28
Phase One/P65/CF027310.IIQ/threads:8/real_time_median                                         -0.0050         +0.0037           114           113            28            28
Phase One/P65/CF027310.IIQ/threads:8/real_time_stddev                                         -0.0599         -0.2683             1             1             0             0
Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/real_time_pvalue                          0.0000          0.0000      U Test, Repetitions: 25 vs 25
Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/real_time_mean                           +0.0206         +0.0207           405           414           405           414
Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/real_time_median                         +0.0204         +0.0205           405           414           405           414
Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/real_time_stddev                         +0.2155         +0.2212             1             1             1             1
Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/real_time_pvalue                         0.0000          0.0000      U Test, Repetitions: 25 vs 25
Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/real_time_mean                          -0.0109         -0.0108           147           145           147           145
Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/real_time_median                        -0.0104         -0.0103           147           145           147           145
Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/real_time_stddev                        -0.4919         -0.4800             0             0             0             0
Samsung/NX3000/_3184416.SRW/threads:8/real_time_pvalue                                         0.0000          0.0000      U Test, Repetitions: 25 vs 25
Samsung/NX3000/_3184416.SRW/threads:8/real_time_mean                                          -0.0149         -0.0147           220           217           220           217
Samsung/NX3000/_3184416.SRW/threads:8/real_time_median                                        -0.0173         -0.0169           221           217           220           217
Samsung/NX3000/_3184416.SRW/threads:8/real_time_stddev                                        +1.0337         +1.0341             1             3             1             3
Sony/DSLR-A350/DSC05472.ARW/threads:8/real_time_pvalue                                         0.0001          0.0001      U Test, Repetitions: 25 vs 25
Sony/DSLR-A350/DSC05472.ARW/threads:8/real_time_mean                                          -0.0019         -0.0019           194           193           194           193
Sony/DSLR-A350/DSC05472.ARW/threads:8/real_time_median                                        -0.0021         -0.0021           194           193           194           193
Sony/DSLR-A350/DSC05472.ARW/threads:8/real_time_stddev                                        -0.4441         -0.4282             0             0             0             0
Sony/ILCE-7RM2/14-bit-compressed.ARW/threads:8/real_time_pvalue                                0.0000          0.4263      U Test, Repetitions: 25 vs 25
Sony/ILCE-7RM2/14-bit-compressed.ARW/threads:8/real_time_mean                                 +0.0258         -0.0006            81            83            19            19
Sony/ILCE-7RM2/14-bit-compressed.ARW/threads:8/real_time_median                               +0.0235         -0.0011            81            82            19            19
Sony/ILCE-7RM2/14-bit-compressed.ARW/threads:8/real_time_stddev                               +0.1634         +0.1070             1             1             0             0
```
{F7443905}
If we look at the `_mean`s, the time column, the biggest win is `-7.7%` (`Canon/EOS 5D Mark II/10.canon.sraw2.cr2`),
and the biggest loose is `+3.3%` (`Panasonic/DC-GH5S/P1022085.RW2`);
Overall: mean `-0.7436%`, median `-0.23%`, `cbrt(sum(time^3))` = `-8.73%`
Looks good so far i'd say.

llvm-exegesis details:
{F7371117} {F7371125}
{F7371128} {F7371144} {F7371158}

Reviewers: craig.topper, RKSimon, andreadb, courbet, avt77, spatel, GGanesh

Reviewed By: andreadb

Subscribers: javed.absar, gbedwell, jfb, llvm-commits

Differential Revision: https://reviews.llvm.org/D52779

llvm-svn: 345463
2018-10-27 20:46:30 +00:00
Roman Lebedev a51921877a [NFC][X86] Baseline tests for AMD BdVer2 (Piledriver) Scheduler model
Adding the baseline tests in a preparatory NFC commit,
so that the actual commit shows the *diff*.

Yes, i'm aware that a few of these codegen-based sched tests
are testing wrong instructions, i will fix that afterwards.

For https://reviews.llvm.org/D52779

llvm-svn: 345462
2018-10-27 20:36:11 +00:00
Reid Kleckner 953bdce68d [MC] Separate masm integer literal lexer support from inline asm
Summary:
This renames the IsParsingMSInlineAsm member variable of AsmLexer to
LexMasmIntegers and moves it up to MCAsmLexer. This is the only behavior
controlled by that variable. I added a public setter, so that it can be
set from outside or from the llvm-mc command line. We may need to
arrange things so that users can get this behavior from clang, but
that's future work.

I also put additional hex literal lexing functionality under this flag
to fix PR32973. It appears that this hex literal parsing wasn't intended
to be enabled in non-masm-style blocks.

Now, masm integers (0b1101 and 0ABCh) work in __asm blocks from clang,
but 0b label references work when using .intel_syntax in standalone .s
files.

However, 0b label references will *not* work from __asm blocks in clang.
They will work from GCC inline asm blocks, which it sounds like is
important for Crypto++ as mentioned in PR36144.

Essentially, we only lex masm literals for inline asm blobs that use
intel syntax. If the .intel_syntax directive is used inside a gnu-style
inline asm statement, masm literals will not be lexed, which is
compatible with gas and llvm-mc standalone .s assembly.

This fixes PR36144 and PR32973.

Reviewers: Gerolf, avt77

Subscribers: eraman, hiraditya, llvm-commits

Differential Revision: https://reviews.llvm.org/D53535

llvm-svn: 345189
2018-10-24 20:23:57 +00:00
Simon Pilgrim 7d27cfdcb2 [X86] Fix Skylake ReadAfterLd for PADDrm etc.
Missed in rL343868 as due to their custom InstrRW.

llvm-svn: 344600
2018-10-16 09:50:16 +00:00
Andrea Di Biagio 6eebbe0a97 [tblgen][llvm-mca] Add the ability to describe move elimination candidates via tablegen.
This patch adds the ability to identify instructions that are "move elimination
candidates". It also allows scheduling models to describe processor register
files that allow move elimination.

A move elimination candidate is an instruction that can be eliminated at
register renaming stage.
Each subtarget can specify which instructions are move elimination candidates
with the help of tablegen class "IsOptimizableRegisterMove" (see
llvm/Target/TargetInstrPredicate.td).

For example, on X86, BtVer2 allows both GPR and MMX/SSE moves to be eliminated.
The definition of 'IsOptimizableRegisterMove' for BtVer2 looks like this:

```
def : IsOptimizableRegisterMove<[
  InstructionEquivalenceClass<[
    // GPR variants.
    MOV32rr, MOV64rr,

    // MMX variants.
    MMX_MOVQ64rr,

    // SSE variants.
    MOVAPSrr, MOVUPSrr,
    MOVAPDrr, MOVUPDrr,
    MOVDQArr, MOVDQUrr,

    // AVX variants.
    VMOVAPSrr, VMOVUPSrr,
    VMOVAPDrr, VMOVUPDrr,
    VMOVDQArr, VMOVDQUrr
  ], CheckNot<CheckSameRegOperand<0, 1>> >
]>;
```

Definitions of IsOptimizableRegisterMove from processor models of a same
Target are processed by the SubtargetEmitter to auto-generate a target-specific
override for each of the following predicate methods:

```
bool TargetSubtargetInfo::isOptimizableRegisterMove(const MachineInstr *MI)
const;
bool MCInstrAnalysis::isOptimizableRegisterMove(const MCInst &MI, unsigned
CPUID) const;
```

By default, those methods return false (i.e. conservatively assume that there
are no move elimination candidates).

Tablegen class RegisterFile has been extended with the following information:
 - The set of register classes that allow move elimination.
 - Maxium number of moves that can be eliminated every cycle.
 - Whether move elimination is restricted to moves from registers that are
   known to be zero.

This patch is structured in three part:

A first part (which is mostly boilerplate) adds the new
'isOptimizableRegisterMove' target hooks, and extends existing register file
descriptors in MC by introducing new fields to describe properties related to
move elimination.

A second part, uses the new tablegen constructs to describe move elimination in
the BtVer2 scheduling model.

A third part, teaches llm-mca how to query the new 'isOptimizableRegisterMove'
hook to mark instructions that are candidates for move elimination. It also
teaches class RegisterFile how to describe constraints on move elimination at
PRF granularity.

llvm-mca tests for btver2 show differences before/after this patch.

Differential Revision: https://reviews.llvm.org/D53134

llvm-svn: 344334
2018-10-12 11:23:04 +00:00
Andrea Di Biagio 6a0b319549 [llvm-mca][BtVer2] Add tests for optimizable GPR register moves. NFC
llvm-svn: 344253
2018-10-11 14:54:54 +00:00
Andrea Di Biagio 1b29ec6531 [llvm-mca][BtVer2] Add two more move-elimination tests. NFC
These should test all the optimizable moves on Jaguar.
A follow-up patch will teach how to recognize these optimizable register moves.

llvm-svn: 344144
2018-10-10 14:46:54 +00:00
Simon Pilgrim f09fc3bc12 [X86] Move ReadAfterLd functionality into X86FoldableSchedWrite (PR36957)
Currently we hardcode instructions with ReadAfterLd if the register operands don't need to be available until the folded load has completed. This doesn't take into account the different load latencies of different memory operands (PR36957).

This patch adds a ReadAfterFold def into X86FoldableSchedWrite to replace ReadAfterLd, allowing us to specify the load latency at a scheduler class level.

I've added ReadAfterVec*Ld classes that match the XMM/Scl, XMM and YMM/ZMM WriteVecLoad classes that we currently use, we can tweak these values in future patches once this infrastructure is in place.

Differential Revision: https://reviews.llvm.org/D52886

llvm-svn: 343868
2018-10-05 17:57:29 +00:00
Simon Pilgrim 6ad03ad34b [llvm-mca][x86] Add PR36951 ReadAfterLd test case
llvm-svn: 343795
2018-10-04 16:26:56 +00:00
Greg Bedwell dee7bfdb9f [utils] Ensure that update_mca_test_checks.py writes prefixes in alphabetical order
llvm-svn: 343783
2018-10-04 14:42:19 +00:00
Simon Pilgrim 82a3b1c687 [llvm-mca][x86] Add tests demonstrating ReadAfterLd delay
llvm-svn: 343773
2018-10-04 13:05:42 +00:00
Simon Pilgrim 0b451a2983 [X86][Btver2] Fix MMX PSHUFB schedule
Match AMD Fam16h SOG + llvm-exegesis tests

llvm-svn: 343701
2018-10-03 18:18:50 +00:00
Andrea Di Biagio 207e0217f9 [llvm-mca] Add support for move elimination in class RegisterFile.
This patch teaches class RegisterFile how to analyze register writes from
instructions that are move elimination candidates.
In particular, it teaches it how to check if a move can be effectively eliminated
by the underlying PRF, and (if necessary) how to perform move elimination.

The long term goal is to allow processor models to describe instructions that
are valid move elimination candidates.
The idea is to let register file definitions in tablegen declare if/when moves
can be eliminated.

This patch is a non functional change.
The logic that performs move elimination is currently disabled.  A future patch
will add support for move elimination in the processor models, and enable this
new code path.

llvm-svn: 343691
2018-10-03 15:02:44 +00:00
Simon Pilgrim c68cc4efbe [X86][Btver2] Most RMW instructions don't require an additional uop
Remove uop on WriteRMW and move it into the few instructions that need it.

Match AMD Fam16h SOG + llvm-exegesis tests

llvm-svn: 343671
2018-10-03 10:28:43 +00:00
Simon Pilgrim d11015861c [X86] ALU/ADC RMW instructions should use the WriteRMW sequence class
I was expecting this to be a nfc but Silvermont seems to be setup a little differently:

// A folded store needs a cycle on MEC_RSV for the store data, but it does not need an extra port cycle to recompute the address.
def : WriteRes<WriteRMW, [SLM_MEC_RSV]>;

So moving from WriteStore to WriteRMW reduces predicted port pressure, confirmed by @craig.topper that this is correct.

Differential Revision: https://reviews.llvm.org/D52740

llvm-svn: 343670
2018-10-03 10:01:13 +00:00
Simon Pilgrim 860cb5c071 [X86][Btver2] Fix BLENDV and AESDEC schedules
Match AMD Fam16h SOG + llvm-exegesis tests

llvm-svn: 343597
2018-10-02 15:13:18 +00:00
Simon Pilgrim e0d2019052 [X86][Btver2] Fix BT(C|R|S)mr & BT(C|R|S)mi schedule latency + uop counts
Match AMD Fam16h SOG + llvm-exegesis tests

llvm-svn: 343494
2018-10-01 16:31:30 +00:00
Simon Pilgrim 6ddc4e821c [X86][Btver2] Fix BTmr schedule uop counts
Match AMD Fam16h SOG + llvm-exegesis tests

llvm-svn: 343484
2018-10-01 14:42:16 +00:00
Simon Pilgrim a982236e59 [X86][Btver2] Fix masked load schedule
JFPU01 resource usage should match JFPX

Match AMD Fam16h SOG + llvm-exegesis tests

llvm-svn: 343468
2018-10-01 13:12:05 +00:00
Andrea Di Biagio 24ea163007 [X86][BtVer2] Teach how to identify zero-idiom VPERM2F128rr instructions.
This patch adds another variant class to identify zero-idiom VPERM2F128rr
instructions.

On Jaguar, a VPERM wih bit 3 and 7 of the mask set, is a zero-idiom.

Differential Revision: https://reviews.llvm.org/D52663

llvm-svn: 343452
2018-10-01 10:35:13 +00:00
Clement Courbet a933fb237e [X86][Sched] Update scheduling information for VZEROALL on HWS, BDW, SKX, SNB.
Summary:
    While looking at PR35606, I found out that the scheduling info is incorrect.

    One can check that it's really a P5+P6 and not a 2*P56 with:
    echo -e 'vzeroall\nvandps %xmm1, %xmm2, %xmm3' | ./bin/llvm-exegesis -mode=uops -snippets-file=-
    (vandps executes on P5 only)

    Reviewers: craig.topper, RKSimon

    Subscribers: llvm-commits

    Differential Revision: https://reviews.llvm.org/D52541

llvm-svn: 343447
2018-10-01 08:37:48 +00:00
Simon Pilgrim f21083870d [X86] Fix scheduler class for BTmi instructions
This wasn't treated as a folded load instruction

llvm-svn: 343424
2018-09-30 20:19:16 +00:00
Simon Pilgrim b1108399bd [LLVM-MCA][X86] Add missing VCMPESTR/VCMPESTR tests
llvm-svn: 343421
2018-09-30 18:19:00 +00:00
Simon Pilgrim 20623f2343 [LLVM-MCA][X86] Add some AVX512 tests
These are going to be necessary to check I don't mess up when I start cleaning up all the remaining vector integer overrides

llvm-svn: 343414
2018-09-30 17:01:59 +00:00
Simon Pilgrim 4f5693ac8d [X86][Btver2] Fix PCmpIStrI/PCmpIStrM schedules
Missing JFPU0 pipe and double JFPU1 pipe (to match JVALU1) resources

Match AMD Fam16h SOG + llvm-exegesis tests

llvm-svn: 343413
2018-09-30 16:38:38 +00:00
Andrea Di Biagio 6e218d0a57 [llvm-mca] Add a test for zero-idiom VPERM2F128rr. NFC
We don't correctly model the latency and resource usage information for
zero-idiom VPERM2F128rr on Jaguar.

This is demonstrated by the incorrect numbers in the resource pressure view, and
the timeline view.
A follow up patch will fix this problem.

llvm-svn: 343346
2018-09-28 17:47:09 +00:00
Greg Bedwell becbbe0383 [utils] Stricter checking from update_mca_test_checks.py
If any prefixes have been specified on the RUN lines that do not end up
ever actually getting printed, raise an Error. This is either an
indication that the run lines just need cleaning up, or that something
is more fundamentally wrong with the test.

Also raise an Error if there are any blocks which cannot be checked
because they are not uniquely covered by a prefix.

Fixed up a couple of tests where the extra checking flagged up issues.

Differential Revision: https://reviews.llvm.org/D48276

llvm-svn: 343332
2018-09-28 15:39:09 +00:00
Greg Bedwell 2f528f8c1e [utils] Allow better identification of matching blocks in update_mca_test_checks.py
Insert empty blocks to cause the positions of matching blocks to match
across lists where possible so that later stages of the algorithm can
actually identify them as being identical.

Regenerated all tests with this change.

Differential Revision: https://reviews.llvm.org/D52560

llvm-svn: 343331
2018-09-28 15:38:56 +00:00
Simon Pilgrim 428c1196d8 [X86][Btver2] PSUBS/PSUBUS instructions are zero-idioms
Noticed during llvm-exegesis tests, the PSUBS/PSUBUS instructions have the same zero-idiom behaviour to PSUB

llvm-svn: 343321
2018-09-28 14:20:42 +00:00
Simon Pilgrim 3216fd3602 [X86][Btver2] Add zero-idiom tests for PSUBS/PSUBUS instructions
Noticed during llvm-exegesis tests, the PSUBS/PSUBUS instructions have the same zero-idiom behaviour to PSUB

llvm-svn: 343319
2018-09-28 13:53:11 +00:00
Simon Pilgrim 66da1ed29d [X86][Btver2] CVTSS2I/CVTSD2I - add missing JFPU0 pipe
We issue JFPU1->JSTC then JFPU0->JFPA then -> JALU0 (integer pipe)

Match AMD Fam16h SOG + llvm-exegesis tests

llvm-svn: 343314
2018-09-28 13:19:22 +00:00
Simon Pilgrim 17e5981ebf [X86][Btver2] Fix BSF/BSR schedule
Double throughput to account for 2 pipes + fix BSF's latency/uop counts

Match AMD Fam16h SOG + llvm-exegesis tests

llvm-svn: 343311
2018-09-28 10:26:48 +00:00
Simon Pilgrim 280af1c7f0 [X86][BtVer2] Fix PHMINPOS schedule resources typo
PHMINPOS can run on either JFPU pipe

llvm-svn: 343299
2018-09-28 08:21:39 +00:00
Simon Pilgrim 86c7b07ecd [X86][Btver2] (V)MPSADBW instructions take 3uops not 1
llvm-svn: 343238
2018-09-27 17:13:57 +00:00
Simon Pilgrim dd744f158a [X86][Btver2] BTC/BTR/BTS instructions take 2uops not 1
llvm-svn: 343234
2018-09-27 16:39:52 +00:00
Simon Pilgrim c2a88ea64e [X86][Btver2] BLSI/BLSMSK/BLSR instructions take 2uops not 1 (same as TZCNT)
llvm-svn: 343227
2018-09-27 14:57:57 +00:00
Simon Pilgrim 98f503a326 [X86][Btver2] TZCNT instructions take 2uops not 1
llvm-svn: 343200
2018-09-27 12:28:47 +00:00
Simon Pilgrim b56be79e0c Revert rL342916: [X86] Remove shift/rotate by CL memory (RMW) overrides
As suggested by Craig Topper - I'm going to look at cleaning up the RMW sequences instead.

The uops are slightly different to the register variant, so requires a +1uop tweak

llvm-svn: 342969
2018-09-25 13:01:26 +00:00
Simon Pilgrim 0b4ad7596f [X86] Remove shift/rotate by CL memory (RMW) overrides
The uops are slightly different to the register variant, so requires a +1uop tweak

llvm-svn: 342916
2018-09-24 20:11:50 +00:00
Simon Pilgrim 00865a48d1 [X86] Split WriteIMul into 8/16/32/64 implementations (PR36931)
Split WriteIMul by size and also by IMUL multiply-by-imm and multiply-by-reg cases.

This removes all the scheduler overrides for gpr multiplies and stops WriteMULH being ignored for BMI2 MULX instructions.

llvm-svn: 342892
2018-09-24 15:21:57 +00:00
Simon Pilgrim 9202c9fb47 [X86] ROR*mCL instruction models should match ROL*mCL etc.
Confirmed with Craig Topper - fix a typo that was missing a Port4 uop for ROR*mCL instructions on some Intel models.

Yet another step on the scheduler model cleanup marathon......

llvm-svn: 342846
2018-09-23 19:16:01 +00:00
Simon Pilgrim 19952add7c [X86] Added missing RCL/RCR schedule overrides to the generic SNB model
The SandyBridge model was missing schedule values for the RCL/RCR values - instead using the (incredibly optimistic) WriteShift (now WriteRotate) defaults.

I've added overrides with more realistic (slow) values, based on a mixture of Agner/instlatx64 numbers and what later Intel models do as well.

This is necessary to allow WriteRotate to be updated to remove other rotate overrides.

It'd probably be a good idea to investigate a WriteRotateCarry class at some point but its not high priority given the unusualness of these instructions.

llvm-svn: 342842
2018-09-23 17:40:24 +00:00
Clement Courbet 8171bd8e0f [X86][Sched] Add zero idiom sched data to the SNB model.
Summary:
On SNB, renamer-based zeroing does not work for:
 - 16 and 8-bit GPRs[1].
 - MMX [2].
 - ANDN variants [3]

[1] echo 'sub %ax, %ax' | /tmp/llvm-exegesis -mode=uops -snippets-file=-
[2] echo 'pxor %mm0, %mm0' | /tmp/llvm-exegesis -mode=uops -snippets-file=-
[3] echo 'andnps %xmm0, %xmm0' | /tmp/llvm-exegesis -mode=uops -snippets-file=-

Reviewers: RKSimon, andreadb

Subscribers: gbedwell, craig.topper, llvm-commits

Differential Revision: https://reviews.llvm.org/D52358

llvm-svn: 342736
2018-09-21 14:07:20 +00:00
Andrea Di Biagio 4cd5cf9fc8 [X86][BtVer2] Fix latency and resource cycles of AVX 256-bit zero-idioms.
This patch introduces a SchedWriteVariant to describe zero-idiom VXORP(S|D)Yrr
and VANDNP(S|D)Yrr.

This is a follow-up of r342555.

On Jaguar, a VXORPSYrr is 2 macro opcodes. Only one opcode is eliminated at
register-renaming stage. The other opcode has to be executed to set the upper
half of the destination YMM.
Same for VANDNP(S|D)Yrr.

Differential Revision: https://reviews.llvm.org/D52347

llvm-svn: 342728
2018-09-21 12:43:07 +00:00
Andrea Di Biagio 0aea310391 [llvm-mca][BtVer2] Modify ANDN tests in zero-idioms-avx-256.s. NFC
Two test cases should have tested 256-bit variants of VANDN zero-idioms instead
of the 128-bit variants.

llvm-svn: 342655
2018-09-20 15:48:23 +00:00
Andrea Di Biagio 8b6c314be1 [TableGen][SubtargetEmitter] Add the ability for processor models to describe dependency breaking instructions.
This patch adds the ability for processor models to describe dependency breaking
instructions.

Different processors may specify a different set of dependency-breaking
instructions.
That means, we cannot assume that all processors of the same target would use
the same rules to classify dependency breaking instructions.

The main goal of this patch is to provide the means to describe dependency
breaking instructions directly via tablegen, and have the following
TargetSubtargetInfo hooks redefined in overrides by tabegen'd
XXXGenSubtargetInfo classes (here, XXX is a Target name).

```
virtual bool isZeroIdiom(const MachineInstr *MI, APInt &Mask) const {
  return false;
}

virtual bool isDependencyBreaking(const MachineInstr *MI, APInt &Mask) const {
  return isZeroIdiom(MI);
}
```

An instruction MI is a dependency-breaking instruction if a call to method
isDependencyBreaking(MI) on the STI (TargetSubtargetInfo object) evaluates to
true. Similarly, an instruction MI is a special case of zero-idiom dependency
breaking instruction if a call to STI.isZeroIdiom(MI) returns true.
The extra APInt is used for those targets that may want to select which machine
operands have their dependency broken (see comments in code).
Note that by default, subtargets don't know about the existence of
dependency-breaking. In the absence of external information, those method calls
would always return false.

A new tablegen class named STIPredicate has been added by this patch to let
processor models classify instructions that have properties in common. The idea
is that, a MCInstrPredicate definition can be used to "generate" an instruction
equivalence class, with the idea that instructions of a same class all have a
property in common.

STIPredicate definitions are essentially a collection of instruction equivalence
classes.
Also, different processor models can specify a different variant of the same
STIPredicate with different rules (i.e. predicates) to classify instructions.
Tablegen backends (in this particular case, the SubtargetEmitter) will be able
to process STIPredicate definitions, and automatically generate functions in
XXXGenSubtargetInfo.

This patch introduces two special kind of STIPredicate classes named
IsZeroIdiomFunction and IsDepBreakingFunction in tablegen. It also adds a
definition for those in the BtVer2 scheduling model only.

This patch supersedes the one committed at r338372 (phabricator review: D49310).

The main advantages are:
 - We can describe subtarget predicates via tablegen using STIPredicates.
 - We can describe zero-idioms / dep-breaking instructions directly via
   tablegen in the scheduling models.

In future, the STIPredicates framework can be used for solving other problems.
Examples of future developments are:
 - Teach how to identify optimizable register-register moves
 - Teach how to identify slow LEA instructions (each subtarget defining its own
   concept of "slow" LEA).
 - Teach how to identify instructions that have undocumented false dependencies
   on the output registers on some processors only.

It is also (in my opinion) an elegant way to expose knowledge to both external
tools like llvm-mca, and codegen passes.
For example, machine schedulers in LLVM could reuse that information when
internally constructing the data dependency graph for a code region.

This new design feature is also an "opt-in" feature. Processor models don't have
to use the new STIPredicates. It has all been designed to be as unintrusive as
possible.

Differential Revision: https://reviews.llvm.org/D52174

llvm-svn: 342555
2018-09-19 15:57:45 +00:00