Commit Graph

677 Commits

Author SHA1 Message Date
Andrea Di Biagio c9649eb9da [X86][BtVer2] Fix latency/throughput of scalar integer MUL instructions.
Single operand MUL instructions that implicitly set EAX have the following
latency/throughput profile (see below):

imul %cl              # latency: 3cy - uOPs: 1 - 1 JMul
imul %cx              # latency: 3cy - uOPs: 3 - 3 JMul
imul %ecx             # latency: 3cy - uOPs: 2 - 2 JMul
imul %rcx             # latency: 6cy - uOPs: 2 - 4 JMul

mul %cl               # latency: 3cy - uOPs: 1 - 1 JMul
mul %cx               # latency: 3cy - uOPs: 3 - 3 JMul
mul %ecx              # latency: 3cy - uOPs: 2 - 2 JMul
mul %rcx              # latency: 6cy - uOPs: 2 - 4 JMul

Excluding the 64bit variant, which has a latency of 6cy, every other instruction
has a latency of 3cy. However, the number of decoded macro-opcodes (as well as
the resource cyles) depend on the MUL size.

The two operand MULs have a more predictable profile (see below):

imul %dx, %dx         # latency: 3cy - uOPs: 1 - 1 JMul
imul %edx, %edx       # latency: 3cy - uOPs: 1 - 1 JMul
imul %rdx, %rdx       # latency: 6cy - uOPs: 1 - 4 JMul

imul $3, %dx, %dx     # latency: 4cy - uOPs: 2 - 2 JMul
imul $3, %ecx, %ecx   # latency: 3cy - uOPs: 1 - 1 JMul
imul $3, %rdx, %rdx   # latency: 6cy - uOPs: 1 - 4 JMul

This patch updates the values in the Jaguar scheduling model and regenerates
llvm-mca tests.

Differential Revision: https://reviews.llvm.org/D66547

llvm-svn: 369661
2019-08-22 15:20:16 +00:00
Andrea Di Biagio c6744055ad [X86][BtVer2] Fix latency and throughput of XCHG and XADD.
On Jaguar, XCHG has a latency of 1cy and decodes to 2 macro-opcodes. Maximum
throughput for XCHG is 1 IPC. The byte exchange has worse latency and decodes to
1 extra uOP; maximum observed throughput is 0.5 IPC.

```
xchgb %cl, %dl           # Latency: 2cy  -  uOPs: 3  -  2 ALU
xchgw %cx, %dx           # Latency: 1cy  -  uOPs: 2  -  2 ALU
xchgl %ecx, %edx         # Latency: 1cy  -  uOPs: 2  -  2 ALU
xchgq %rcx, %rdx         # Latency: 1cy  -  uOPs: 2  -  2 ALU
```

The reg-mem forms of XCHG are atomic operations with an observed latency of
16cy.  The resource usage is similar to the XCHGrr variants. The biggest
difference is obviously the bus-locking, which prevents the LS to issue other
memory uOPs in parallel until the unlocking store uOP is executed.

```
xchgb %cl, (%rsp)        # Latency: 16cy  -  uOPs: 3 - ECX latency: 11cy
xchgw %cx, (%rsp)        # Latency: 16cy  -  uOPs: 3 - ECX latency: 11cy
xchgl %ecx, (%rsp)       # Latency: 16cy  -  uOPs: 3 - ECX latency: 11cy
xchgq %rcx, (%rsp)       # Latency: 16cy  -  uOPs: 3 - ECX latency: 11cy
```

The exchanged in/out register operand becomes available after 11cy from the
start of execution. Added test xchg.s to verify that we correctly see that
register write committed in 11cy (and not 16cy).

Reg-reg XADD instructions have the same latency/throughput than the byte
exchange (register-register variant).

```
xaddb %cl, %dl           # latency: 2cy  -  uOPs: 3  -  3 ALU
xaddw %cx, %dx           # latency: 2cy  -  uOPs: 3  -  3 ALU
xaddl %ecx, %edx         # latency: 2cy  -  uOPs: 3  -  3 ALU
xaddq %rcx, %rdx         # latency: 2cy  -  uOPs: 3  -  3 ALU
```

The non-atomic RM variants have a latency of 11cy, and decode to 4
macro-opcodes. They still consume 2 ALU pipes, and the exchange in/out register
operand becomes available in 3cy (it matches the 'load-to-use latency').

```
xaddb %cl, (%rsp)        # latency: 11cy  -  uOPs: 4  -  3 ALU
xaddw %cx, (%rsp)        # latency: 11cy  -  uOPs: 4  -  3 ALU
xaddl %ecx, (%rsp)       # latency: 11cy  -  uOPs: 4  -  3 ALU
xaddq %rcx, (%rsp)       # latency: 11cy  -  uOPs: 4  -  3 ALU
```

The atomic XADD variants execute in 16cy. The in/out register operand is
available after 11cy from the start of execution.

```
lock xaddb %cl, (%rsp)   # latency: 16cy - uOPs: 4 - 3 ALU -- ECX latency: 11cy
lock xaddw %cx, (%rsp)   # latency: 16cy - uOPs: 4 - 3 ALU -- ECX latency: 11cy
lock xaddl %ecx, (%rsp)  # latency: 16cy - uOPs: 4 - 3 ALU -- ECX latency: 11cy
lock xaddq %rcx, (%rsp)  # latency: 16cy - uOPs: 4 - 3 ALU -- ECX latency: 11cy
```

Added test xadd.s to verify those latencies as well as read-advance values.

Differential Revision: https://reviews.llvm.org/D66535

llvm-svn: 369642
2019-08-22 11:32:47 +00:00
Andrea Di Biagio 2e897a94f5 [X86][BtVer2] Use ReadAfterLd entries for the register operands of CMPXCHG.
This is a follow-up of r369365.

llvm-svn: 369412
2019-08-20 17:05:56 +00:00
Andrea Di Biagio 16111d3795 [X86][BtVer2] Fix latency and throughput of atomic INC/DEC/NEG/NOT.
Latency and throughput of LOCK INC/DEC/NEG/NOT is always 19cy.
Number of uOPs is still 1.

Differential Revision: https://reviews.llvm.org/D66469

llvm-svn: 369388
2019-08-20 14:31:27 +00:00
Simon Pilgrim 6a3dc3e15c [MCA][X86] Add tests for LOCK variants of standard X86 arithmetic ops
D66424 adds the base support for LOCK so we should be able to add special case support for all these cases in future patches

llvm-svn: 369367
2019-08-20 11:13:20 +00:00
Andrea Di Biagio b1bdd97a26 [X86][Btver2] Fix latency and throughput of CMPXCHG instructions.
On Jaguar, CMPXCHG has a latency of 11cy, and a maximum throughput of 0.33 IPC.
Throughput is superiorly limited to 0.33 because of the implicit in/out
dependency on register EAX. In the case of repeated non-atomic CMPXCHG with the
same memory location, store-to-load forwarding occurs and values for sequent
loads are quickly forwarded from the store buffer.

Interestingly, the functionality in LLVM that computes the reciprocal throughput
doesn't seem to know about RMW instructions. That functionality only looks at
the "consumed resource cycles" for the throughput computation. It should be
fixed/improved by a future patch. In particular, for RMW instructions, that
logic should also take into account for the write latency of in/out register
operands.

An atomic CMPXCHG has a latency of ~17cy. Throughput is also limited to
~17cy/inst due to cache locking, which prevents other memory uOPs to start
executing before the "lock releasing" store uOP.

CMPXCHG8rr and CMPXCHG8rm are treated specially because they decode to one less
macro opcode. Their latency tend to be the same as the other RR/RM variants. RR
variants are relatively fast 3cy (but still microcoded - 5 macro opcodes).

CMPXCHG8B is 11cy and unfortunately doesn't seem to benefit from store-to-load
forwarding. That means, throughput is clearly limited by the in/out dependency
on GPR registers. The uOP composition is sadly unknown (due to the lack of PMCs
for the Integer pipes). I have reused the same mix of consumed resource from the
other CMPXCHG instructions for CMPXCHG8B too.
LOCK CMPXCHG8B is instead 18cycles.

CMPXCHG16B is 32cycles. Up to 38cycles when the LOCK prefix is specified. Due to
the in/out dependencies, throughput is limited to 1 instruction every 32 (or 38)
cycles dependeing on whether the LOCK prefix is specified or not.
I wouldn't be surprised if the microcode for CMPXCHG16B is similar to 2x
microcode from CMPXCHG8B. So, I have speculatively set the JALU01 consumption to
2x the resource cycles used for CMPXCHG8B.

The two new hasLockPrefix() functions are used by the btver2 scheduling model
check if a MCInst/MachineInst has a LOCK prefix. Calls to hasLockPrefix() have
been encoded in predicates of variant scheduling classes that describe lat/thr
of CMPXCHG.

Differential Revision: https://reviews.llvm.org/D66424

llvm-svn: 369365
2019-08-20 10:23:55 +00:00
Andrea Di Biagio bf989187c3 [X86] Move scheduling tests for CMPXCHG to the corresponding resources-x86_64.s files. NFC
In D66424 it has been requested to move all the new tests added by r369278 into
resources-x86_64.s. That is because only the 8b/16 ops should be tested by
resources-cmpxchg.s. This partially reverts r369278.

llvm-svn: 369288
2019-08-19 18:20:30 +00:00
Andrea Di Biagio ecbaba672e [X86] Added extensive scheduling model tests for all the CMPXCHG variants. NFC
Addresses a review comment in D66424

llvm-svn: 369279
2019-08-19 17:07:26 +00:00
Andrea Di Biagio cbec9af6bf [MCA] Add flag -show-encoding to llvm-mca.
Flag -show-encoding enables the printing of instruction encodings as part of the
the instruction info view.

Example (with flags -mtriple=x86_64--  -mcpu=btver2):

Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[7]: Encoding Size

[1]    [2]    [3]    [4]    [5]    [6]    [7]    Encodings:     Instructions:
 1      2     1.00                         4     c5 f0 59 d0    vmulps   %xmm0, %xmm1, %xmm2
 1      4     1.00                         4     c5 eb 7c da    vhaddps  %xmm2, %xmm2, %xmm3
 1      4     1.00                         4     c5 e3 7c e3    vhaddps  %xmm3, %xmm3, %xmm4

In this example, column Encoding Size is the size in bytes of the instruction
encoding. Column Encodings reports the actual instruction encodings as byte
sequences in hex (objdump style).

The computation of encodings is done by a utility class named mca::CodeEmitter.

In future, I plan to expose the CodeEmitter to the instruction builder, so that
information about instruction encoding sizes can be used by the simulator. That
would be a first step towards simulating the throughput from the decoders in the
hardware frontend.

Differential Revision: https://reviews.llvm.org/D65948

llvm-svn: 368432
2019-08-09 11:26:27 +00:00
Craig Topper 29688f4da0 [X86] Limit vpermil2pd/vpermil2ps immediates to 4 bits in the assembly parser.
The upper 4 bits of the immediate byte are used to encode a
register. We need to limit the explicit immediate to fit in the
remaining 4 bits.

Fixes PR42899.

llvm-svn: 368123
2019-08-07 05:34:27 +00:00
Andrea Di Biagio 207e3af501 [MCA] Add support for printing immedate values as hex. Also enable lexing of masm binary and hex literals.
This patch adds a new llvm-mca flag named -print-imm-hex.

By default, the instruction printer prints immediate operands as decimals. Flag
-print-imm-hex enables the instruction printer to print those operands in hex.

This patch also adds support for MASM binary and hex literal numbers (example
0FFh, 101b).
Added tests to verify the behavior of the new flag. Tests also verify that masm
numeric literal operands are now recognized.

Differential Revision: https://reviews.llvm.org/D65588

llvm-svn: 367671
2019-08-02 10:38:25 +00:00
Andrea Di Biagio dd0dc19b1c Set an explicit x86 triple for test bottleneck-analysis.s added by my r364045. NFC
This should unbreak the ppc64 buildbots.

llvm-svn: 364048
2019-06-21 14:05:58 +00:00
Andrea Di Biagio aa9b6468bd [MCA][Bottleneck Analysis] Teach how to compute a critical sequence of instructions based on the simulation.
This patch teaches the bottleneck analysis how to identify and print the most
expensive sequence of instructions according to the simulation. Fixes PR37494.

The goal is to help users identify the sequence of instruction which is most
critical for performance.

A dependency graph is internally used by the bottleneck analysis to describe
data dependencies and processor resource interferences between instructions.

There is one node in the graph for every instruction in the input assembly
sequence. The number of nodes in the graph is independent from the number of
iterations simulated by the tool. It means that a single node of the graph
represents all the possible instances of a same instruction contributed by the
simulated iterations.

Edges are dynamically "discovered" by the bottleneck analysis by observing
instruction state transitions and "backend pressure increase" events generated
by the Execute stage. Information from the events is used to identify critical
dependencies, and materialize edges in the graph. A dependency edge is uniquely
identified by a pair of node identifiers plus an instance of struct
DependencyEdge::Dependency (which provides more details about the actual
dependency kind).

The bottleneck analysis internally ranks dependency edges based on their impact
on the runtime (see field DependencyEdge::Dependency::Cost). To this end, each
edge of the graph has an associated cost. By default, the cost of an edge is a
function of its latency (in cycles). In practice, the cost of an edge is also a
function of the number of cycles where the dependency has been seen as
'contributing to backend pressure increases'. The idea is that the higher the
cost of an edge, the higher is the impact of the dependency on performance. To
put it in another way, the cost of an edge is a measure of criticality for
performance.

Note how a same edge may be found in multiple iteration of the simulated loop.
The logic that adds new edges to the graph checks if an equivalent dependency
already exists (duplicate edges are not allowed). If an equivalent dependency
edge is found, field DependencyEdge::Frequency of that edge is incremented by
one, and the new cost is cumulatively added to the existing edge cost.

At the end of simulation, costs are propagated to nodes through the edges of the
graph. The goal is to identify a critical sequence from a node of the root-set
(composed by node of the graph with no predecessors) to a 'sink node' with no
successors.  Note that the graph is intentionally kept acyclic to minimize the
complexity of the critical sequence computation algorithm (complexity is
currently linear in the number of nodes in the graph).

The critical path is finally computed as a sequence of dependency edges. For
edges describing processor resource interferences, the view also prints a
so-called "interference probability" value (by dividing field
DependencyEdge::Frequency by the total number of iterations).

Examples of critical sequence computations can be found in tests added/modified
by this patch.

On output streams that support colored output, instructions from the critical
sequence are rendered with a different color.

Strictly speaking the analysis conducted by the bottleneck analysis view is not
a critical path analysis. The cost of an edge doesn't only depend on the
dependency latency. More importantly, the cost of a same edge may be computed
differently by different iterations.

The number of dependencies is discovered dynamically based on the events
generated by the simulator. However, their number is not fixed. This is
especially true for edges that model processor resource interferences; an
interference may not occur in every iteration. For that reason, it makes sense
to also print out a "probability of interference".

By construction, the accuracy of this analysis (as always) is strongly dependent
on the simulation (and therefore the quality of the information available in the
scheduling model).

That being said, the critical sequence effectively identifies a performance
criticality. Instructions from that sequence are expected to have a very big
impact on performance. So, users can take advantage of this information to focus
their attention on specific interactions between instructions.
In my experience, it works quite well in practice, and produces useful
output (in a reasonable amount time).

Differential Revision: https://reviews.llvm.org/D63543

llvm-svn: 364045
2019-06-21 13:32:54 +00:00
Clement Courbet f7a6fb9f2c Fix r363773: Update Barcelona MCA tests.
llvm-svn: 363781
2019-06-19 10:00:36 +00:00
Roman Lebedev 9f9691c032 [NFC][X86][MCA] Barcelona: add load/store/load-store-throughput tests
llvm-svn: 363775
2019-06-19 08:53:34 +00:00
Roman Lebedev 4358016b03 [NFC][X86][MCA] BdVer2: add load-store-throughput test
llvm-svn: 363774
2019-06-19 08:53:28 +00:00
Clement Courbet 4ef7c2868a [X86] Add missing properties on llvm.x86.sse.{st,ld}mxcsr
Summary:
llvm.x86.sse.stmxcsr only writes to memory.
llvm.x86.sse.ldmxcsr only reads from memory, and might generate an FPE.

Reviewers: craig.topper, RKSimon

Subscribers: llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D62896

llvm-svn: 363773
2019-06-19 08:44:31 +00:00
Fangrui Song ac14f7b10c [lit] Delete empty lines at the end of lit.local.cfg NFC
llvm-svn: 363538
2019-06-17 09:51:07 +00:00
Roman Lebedev 5dd61974f9 [NFC][MCA][X86] Add one more 'clear super register' pattern - movss/movsd load clears high XMM bits
llvm-svn: 363498
2019-06-15 16:12:13 +00:00
Roman Lebedev 680c43b73a [NFC][MCA][X86] Add baseline test coverage for AMD Barcelona (aka K10, fam10h)
Looking into sched model for that CPU ...

llvm-svn: 363497
2019-06-15 16:12:05 +00:00
Andrea Di Biagio c650a9084f [llvm-mca] Enable bottleneck analysis when flag -all-views is specified.
Bottleneck Analysis is one of the many views available in llvm-mca. Therefore,
it should be enabled when flag -all-views is passed in input to the tool.

llvm-svn: 362964
2019-06-10 16:56:25 +00:00
Andrea Di Biagio c2493ce4a4 [MCA][Scheduler] Improved critical memory dependency computation.
This fixes a problem where back-pressure increases caused by register
dependencies were not correctly notified if execution was also delayed by memory
dependencies.

llvm-svn: 361740
2019-05-26 19:50:31 +00:00
Craig Topper 4b08fcdeb1 [X86] Add zero idioms to the haswell, broadwell, and skylake schedule models. Add 256-bit fp xor to sandybridge zero idioms
This copies the Sandy Bridge zero idiom support to later CPUs. Adding the AVX2 and AVX512F/VL instructions as appropriate.

Differential Revision: https://reviews.llvm.org/D62360

llvm-svn: 361690
2019-05-25 04:47:49 +00:00
Craig Topper af6c9df163 [X86][llvm-mca] Add zero idiom tests for Intel CPUs. NFC
This pre-commits tests for D62360

llvm-svn: 361689
2019-05-25 04:47:42 +00:00
Andrea Di Biagio 4e62554bfa [MCA] Add support for nested and overlapping region markers
This patch fixes PR41523
https://bugs.llvm.org/show_bug.cgi?id=41523

Regions can now nest/overlap provided that they have different names.
Anonymous regions cannot overlap.

Region end markers must specify the region name. The only exception is for when
there is only one user-defined region; in that particular case, the region end
marker doesn't need to specify a name.

Incorrect region end markers are no longer ignored. Instead, the tool reports an
error and we exit with an error code.

Added test cases to verify the new diagnostic error messages.

Updated the llvm-mca docs to reflect this feature change.

Differential Revision: https://reviews.llvm.org/D61676

llvm-svn: 360351
2019-05-09 15:18:09 +00:00
Roman Lebedev 9db0e72570 [X86] AMD Piledriver (BdVer2): major cleanup (mainly inverse throughput)
I've started this cleanup more several times now, but got sidetracked
elsewhere, e.g. by llvm-exegesis problems. Not this time, finally!

This is mainly cleaning up the inverse throughput values,
and a few latencies/uops, based on the llvm-exegesis measured values.

Though this is not complete by any means,
there's certainly more cleanup to be done.

The performance numbers (i've only checked by RawSpeed benchmark) aren't
really surprising - overall this *slightly* (< -1%) improves perf.

llvm-svn: 360341
2019-05-09 13:54:51 +00:00
Andrea Di Biagio d52a542e4c [MCA] Don't add a name to the default code region.
This is done in preparation for a patch that fixes PR41523.

llvm-svn: 360243
2019-05-08 11:00:43 +00:00
Craig Topper d10a200ceb [X86] Remove the suffix on vcvt[u]si2ss/sd register variants in assembly printing.
We require d/q suffixes on the memory form of these instructions to disambiguate the memory size.
We don't require it on the register forms, but need to support parsing both with and without it.

Previously we always printed the d/q suffix on the register forms, but it's redundant and
inconsistent with gcc and objdump.

After this patch we should support the d/q for parsing, but not print it when its unneeded.

llvm-svn: 360085
2019-05-06 21:39:51 +00:00
Simon Pilgrim 9d99372f73 [llvm-mca][x86] Fix MMX PMOVMSKB test
This is defined as part of SSE1, XMM PMOVMSKB doesn't appear until SSE2

llvm-svn: 359477
2019-04-29 18:24:30 +00:00
Andrea Di Biagio 43003f0fec [MCA] Fix typo in AVX2 gather tests. NFC
llvm-svn: 359397
2019-04-28 10:54:45 +00:00
Andrea Di Biagio 2050dff996 [MCA] Remove wrong comments from a test. NFC
llvm-svn: 358160
2019-04-11 10:15:04 +00:00
Craig Topper 4a32ce39b7 [X86] Make _Int instructions the preferred instructon for the assembly parser and disassembly parser to remove inconsistencies between VEX and EVEX.
Many of our instructions have both a _Int form used by intrinsics and a form
used by other IR constructs. In the EVEX space the _Int versions usually cover
all the capabilities include broadcasting and rounding. While the other version
only covers simple register/register or register/load forms. For this reason
in EVEX, the non intrinsic form is usually marked isCodeGenOnly=1.

In the VEX encoding space we were less consistent, but usually the _Int version
was the isCodeGenOnly version.

This commit makes the VEX instructions match the EVEX instructions. This was
done by manually studying the AsmMatcher table so its possible I missed some
cases, but we should be closer now.

I'm thinking about using the isCodeGenOnly bit to simplify the EVEX2VEX
tablegen code that disambiguates the _Int and non _Int versions. Currently it
checks register class sizes and Record the memory operands come from. I have
some other changes I was looking into for D59266 that may break the memory check.

I had to make a few scheduler hacks to keep the _Int versions from being treated
differently than the non _Int version.

Differential Revision: https://reviews.llvm.org/D60441

llvm-svn: 358138
2019-04-10 21:29:41 +00:00
Andrea Di Biagio f6a60f1f80 [llvm-mca][scheduler-stats] Print issued micro opcodes per cycle. NFCI
It makes more sense to print out the number of micro opcodes that are issued
every cycle rather than the number of instructions issued per cycle.
This behavior is also consistent with the dispatch-stats: numbers from the two
views can now be easily compared.

llvm-svn: 357919
2019-04-08 16:05:54 +00:00
Andrea Di Biagio e074ac60b4 [MCA] Add an experimental MicroOpQueue stage.
This patch adds an experimental stage named MicroOpQueueStage.
MicroOpQueueStage can be used to simulate a hardware micro-op queue (basically,
a decoupling queue between 'decode' and 'dispatch').  Users can specify a queue
size, as well as a optional MaxIPC (which - in the absence of a "Decoders" stage
- can be used to simulate a different throughput from the decoders).

This stage is added to the default pipeline between the EntryStage and the
DispatchStage only if PipelineOption::MicroOpQueue is different than zero. By
default, llvm-mca sets PipelineOption::MicroOpQueue to the value of hidden flag
-micro-op-queue-size.

Throughput from the decoder can be simulated via another hidden flag named
-decoder-throughput.  That flag allows us to quickly experiment with different
frontend throughputs.  For targets that declare a loop buffer, flag
-decoder-throughput allows users to do multiple runs, each time simulating a
different throughput from the decoders.

This stage can/will be extended in future. For example, we could add a "buffer
full" event to notify bottlenecks caused by backpressure. flag
-decoder-throughput would probably go away if in future we delegate to another
stage (DecoderStage?) the simulation of a (potentially variable) throughput from
the decoders. For now, flag -decoder-throughput is "good enough" to run some
simple experiments.

Differential Revision: https://reviews.llvm.org/D59928

llvm-svn: 357248
2019-03-29 12:15:37 +00:00
Roman Lebedev c325be6cef [X86] AMD Piledriver (BdVer2): fine-tune some latencies
Based on llvm-exegesis measurements.

Now that llvm-exegesis is ~2 magnitudes faster, and is a bit smarter,
it is now possible to continue cleanup of the scheduler model.

With this, there are no more latency inconsistencies for the
opcodes that produce stable measurements, and only a few inconsistencies
for unstable measurements (MMX_* opcodes, opcodes that llvm-exegesis
measures by chaining - CMP, TEST, BT, SETcc, CVT, MOV, etc.)

llvm-svn: 357169
2019-03-28 13:40:34 +00:00
Craig Topper c2b35ebc1d [X86] Remove the _alt forms of (V)CMP instructions. Use a combination of custom printing and custom parsing to achieve the same result and more
Similar to previous change done for VPCOM and VPCMP

Differential Revision: https://reviews.llvm.org/D59468

llvm-svn: 356384
2019-03-18 17:59:59 +00:00
Craig Topper 12509d87f3 [X86] Remove the _alt forms of XOP VPCOM instructions. Use a combination of custom printing and custom parsing to achieve the same result and more
Previously we had a regular form of the instruction used when the immediate was 0-7. And _alt form that allowed the full 8 bit immediate. Codegen would always use the 0-7 form since the immediate was always checked to be in range. Assembly parsing would use the 0-7 form when a mnemonic like vpcomtrueb was used. If the immediate was specified directly the _alt form was used. The disassembler would prefer to use the 0-7 form instruction when the immediate was in range and the _alt form otherwise. This way disassembly would print the most readable form when possible.

The assembly parsing for things like vpcomtrueb relied on splitting the mnemonic into 3 pieces. A "vpcom" prefix, an immediate representing the "true", and a suffix of "b". The tablegenerated printing code would similarly print a "vpcom" prefix, decode the immediate into a string, and then print "b".

The _alt form on the other hand parsed and printed like any other instruction with no specialness.

With this patch we drop to one form and solve the disassembly printing issue by doing custom printing when the immediate is 0-7. The parsing code has been tweaked to turn "vpcomtrueb" into "vpcomb" and then the immediate for the "true" is inserted either before or after the other operands depending on at&t or intel syntax.

I'd rather not do the custom printing, but I tried using an InstAlias for each possible mnemonic for all 8 immediates for all 16 combinations of element size, signedness, and memory/register. The code emitted into printAliasInstr ended up checking the number of operands, the register class of each operand, and the immediate for all 256 aliases. This was repeated for both the at&t and intel printer. Despite a lot of common checks between all of the aliases, when compiled with clang at least this commonality was not well optimized. Nor do all the checks seem necessary. Since I want to do a similar thing for vcmpps/pd/ss/sd which have 32 immediate values and 3 encoding flavors, 3 register sizes, etc. This didn't seem to scale well for clang binary size. So custom printing seemed a better trade off.

I also considered just using the InstAlias for the matching and not the printing. But that seemed like it would add a lot of extra rows to the matcher table. Especially given that the 32 immediates for vpcmpps have 46 strings associated with them.

Differential Revision: https://reviews.llvm.org/D59398

llvm-svn: 356343
2019-03-17 21:21:37 +00:00
Craig Topper d0c2dba644 [X86] Correct scheduler information for rotate by constant for Haswell, Broadwell, and Skylake.
Rotate with explicit immediate is a single uop from Haswell on. An immediate of 1 has a dependency on the previous writer of flags, but the other immediate values do not.

The implicit rotate by 1 instruction is 2 uops. But the flags are merged after the rotate uop so the data result does not see the flag dependency. But I don't think we have any way of modeling that.

RORX is 1 uop without the load. 2 uops with the load. We currently model these with WriteShift/WriteShiftLd.

Differential Revision: https://reviews.llvm.org/D59077

llvm-svn: 355636
2019-03-07 21:22:56 +00:00
Craig Topper b3af5d3e57 [X86] Model ADC/SBB with immediate 0 more accurately in the Haswell scheduler model
Haswell and possibly Sandybridge have an optimization for ADC/SBB with immediate 0 to use a single uop flow. This only applies GR16/GR32/GR64 with an 8-bit immediate. It does not apply to GR8. It also does not apply to the implicit AX/EAX/RAX forms.

Differential Revision: https://reviews.llvm.org/D59058

llvm-svn: 355635
2019-03-07 21:22:51 +00:00
Matt Davis 6c5a49ccb9 [llvm-mca] Emit a message when no bottlenecks are identified.
Summary:
Since bottleneck hints are enabled via user request, it can be
confusing if no bottleneck information is presented.  Such is the
case when no bottlenecks are identified.  This patch emits a message
in that case.

Reviewers: andreadb

Reviewed By: andreadb

Subscribers: tschuett, gbedwell, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59098

llvm-svn: 355628
2019-03-07 19:34:44 +00:00
Simon Pilgrim 3f37538b86 [llvm-mca][X86] Add ADC/SBB with zero test cases
Some targets have fast-path handling for these patterns that we should model.

llvm-svn: 355498
2019-03-06 12:51:16 +00:00
Andrea Di Biagio be3281a281 [MCA] Highlight kernel bottlenecks in the summary view.
This patch adds a new flag named -bottleneck-analysis to print out information
about throughput bottlenecks.

MCA knows how to identify and classify dynamic dispatch stalls. However, it
doesn't know how to analyze and highlight kernel bottlenecks.  The goal of this
patch is to teach MCA how to correlate increases in backend pressure to backend
stalls (and therefore, the loss of throughput).

From a Scheduler point of view, backend pressure is a function of the scheduler
buffer usage (i.e. how the number of uOps in the scheduler buffers changes over
time). Backend pressure increases (or decreases) when there is a mismatch
between the number of opcodes dispatched, and the number of opcodes issued in
the same cycle.  Since buffer resources are limited, continuous increases in
backend pressure would eventually leads to dispatch stalls. So, there is a
strong correlation between dispatch stalls, and how backpressure changed over
time.

This patch teaches how to identify situations where backend pressure increases
due to:
 - unavailable pipeline resources.
 - data dependencies.

Data dependencies may delay execution of instructions and therefore increase the
time that uOps have to spend in the scheduler buffers. That often translates to
an increase in backend pressure which may eventually lead to a bottleneck.
Contention on pipeline resources may also delay execution of instructions, and
lead to a temporary increase in backend pressure.

Internally, the Scheduler classifies instructions based on whether register /
memory operands are available or not.

An instruction is marked as "ready to execute" only if data dependencies are
fully resolved.
Every cycle, the Scheduler attempts to execute all instructions that are ready
to execute. If an instruction cannot execute because of unavailable pipeline
resources, then the Scheduler internally updates a BusyResourceUnits mask with
the ID of each unavailable resource.

ExecuteStage is responsible for tracking changes in backend pressure. If backend
pressure increases during a cycle because of contention on pipeline resources,
then ExecuteStage sends a "backend pressure" event to the listeners.
That event would contain information about instructions delayed by resource
pressure, as well as the BusyResourceUnits mask.

Note that ExecuteStage also knows how to identify situations where backpressure
increased because of delays introduced by data dependencies.

The SummaryView observes "backend pressure" events and prints out a "bottleneck
report".

Example of bottleneck report:

```
Cycles with backend pressure increase [ 99.89% ]
Throughput Bottlenecks:
  Resource Pressure       [ 0.00% ]
  Data Dependencies:      [ 99.89% ]
   - Register Dependencies [ 0.00% ]
   - Memory Dependencies   [ 99.89% ]
```

A bottleneck report is printed out only if increases in backend pressure
eventually caused backend stalls.

About the time complexity:

Time complexity is linear in the number of instructions in the
Scheduler::PendingSet.

The average slowdown tends to be in the range of ~5-6%.
For memory intensive kernels, the slowdown can be significant if flag
-noalias=false is specified. In the worst case scenario I have observed a
slowdown of ~30% when flag -noalias=false was specified.

We can definitely recover part of that slowdown if we optimize class LSUnit (by
doing extra bookkeeping to speedup queries). For now, this new analysis is
disabled by default, and it can be enabled via flag -bottleneck-analysis. Users
of MCA as a library can enable the generation of pressure events through the
constructor of ExecuteStage.

This patch partially addresses https://bugs.llvm.org/show_bug.cgi?id=37494

Differential Revision: https://reviews.llvm.org/D58728

llvm-svn: 355308
2019-03-04 11:52:34 +00:00
Andrea Di Biagio c032e2ab7c [MCA] Always check if scheduler resources are unavailable when reporting dispatch stalls.
Dispatch stall cycles may be associated to multiple dispatch stall events.
Before this patch, each stall cycle was associated with a single stall event.
This patch also improves a couple of code comments, and adds a helper method to
query the Scheduler for dispatch stalls.

llvm-svn: 354877
2019-02-26 14:19:00 +00:00
Craig Topper ce2bd19c49 [X86] Correct some ADC/SBB with immediate scheduler data for Broadwell and Skylake.
Summary:
The AX/EAX/RAX with immediate forms are 2 uops just like the AL with immediate.

The modrm form with r8 and immediate is a single uop just like r16/r32/r64 with immediate.

Reviewers: RKSimon, andreadb

Reviewed By: RKSimon

Subscribers: gbedwell, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D58581

llvm-svn: 354754
2019-02-24 19:23:39 +00:00
Andrea Di Biagio c102e2a227 [MCA] Correctly update register definitions in the PRF after move elimination.
This patch fixes a bug where register writes performed by optimizable register
moves were sometimes wrongly treated like partial register updates.  Before this
patch, llvm-mca wrongly predicted a 1.50 IPC for test reg-move-elimination-6.s
(added by this patch).  With this patch, llvm-mca correctly updates the register
defintions in the PRF, and the IPC for that test is now correctly reported as 2.

llvm-svn: 354271
2019-02-18 14:15:25 +00:00
Craig Topper bf7593ec4a [X86] Print all register forms of x87 fadd/fsub/fdiv/fmul as having two arguments where on is %st.
All of these instructions consume one encoded register and the other register is %st. They either write the result to %st or the encoded register. Previously we printed both arguments when the encoded register was written. And we printed one argument when the result was written to %st. For the stack popping forms the encoded register is always the destination and we didn't print both operands. This was inconsistent with gcc and objdump and just makes the output assembly code harder to read.

This patch changes things to always print both operands making us consistent with gcc and objdump. The parser should still be able to handle the single register forms just as it did before. This also matches the GNU assembler behavior.

llvm-svn: 353061
2019-02-04 17:28:18 +00:00
Craig Topper 7a2944efe1 [X86] Print %st(0) as %st when its implicit to the instruction. Continue printing it as %st(0) when its encoded in the instruction.
This is a step back from the change I made in r352985. This appears to be more consistent with gcc and objdump behavior.

llvm-svn: 353015
2019-02-04 04:15:10 +00:00
Craig Topper f77b858dc3 Revert r352985 "[X86] Print %st(0) as %st to match what gcc inline asm uses as the clobber name to make MS inline asm work correctly"
Looking into gcc and objdump behavior more this was overly aggressive. If the register is encoded in the instruction we should print %st(0), if its implicit we should print %st.

I'll be making a more directed change in a future patch.

llvm-svn: 353013
2019-02-04 04:15:02 +00:00
Craig Topper 5a570dd437 [X86] Print %st(0) as %st to match what gcc inline asm uses as the clobber name to make MS inline asm work correctly
Summary:
When calculating clobbers for MS style inline assembly we fail if the asm clobbers stack top because we print st(0) and try to pass it through the gcc register name check. This was found with when I attempted to make a emms/femms clobber all ST registers. If you use emms/femms in MS inline asm we would try to use st(0) as the clobber name but clang would think that wasn't a valid clobber name.

This also matches what objdump disassembly prints. It's also what is printed by gcc -S.

Reviewers: RKSimon, rnk, efriedma, spatel, andreadb, lebedev.ri

Reviewed By: rnk

Subscribers: eraman, gbedwell, lebedev.ri, llvm-commits

Differential Revision: https://reviews.llvm.org/D57621

llvm-svn: 352985
2019-02-03 07:53:39 +00:00
Roman Lebedev 7857215f8e [X86][BdVer2] Transfer delays from the integer to the floating point unit.
Summary:
I'm unable to find this number in the "AMD SOG for family 15h".
llvm-exegesis measures the latencies of these instructions as `2`,
which matches the latencies specified in "AMD SOG for family 15h".

However if we look at Agner, Microarchitecture, "AMD Bulldozer, Piledriver,
Steamroller and Excavator pipeline", "Data delay between different execution
domains", the int->ivec transfer is listed as `8`..`10`cy of additional latency.

Also, Agner's "Instruction tables", for Piledriver, lists their latencies as `12`,
which is consistent with `2cy` from exegesis / AMD SOG + `10cy` transfer delay.

Additional data point comes from the fact that Agner's "Instruction tables",
for Jaguar, lists their latencies as `8`; and "AMD SOG for family 16h" does
state the `+6cy` int->ivec delay, which is consistent with instr latency of `1` or `2`.

Reviewers: andreadb, RKSimon, craig.topper

Reviewed By: andreadb

Subscribers: gbedwell, courbet, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D57300

llvm-svn: 352861
2019-02-01 11:15:13 +00:00
Andrea Di Biagio 815cdbff29 [X86][Btver2] Improved latency/throughput model for scalar int-to-float conversions.
Account for bypass delays when computing the latency of scalar int-to-float
conversions.
On Jaguar we need to account for an extra 6cy latency (see AMD fam16h SOG).
This patch also fixes the number of micropcodes for the register-memory variants
of scalar int-to-float conversions.

Differential Revision: https://reviews.llvm.org/D57148

llvm-svn: 352518
2019-01-29 16:47:27 +00:00
Roman Lebedev 661577466e [NFC][MCA][X86][BdVer2] Cherry-pick int-to-ivec forwarding tests from BtVer2
llvm-svn: 352317
2019-01-27 14:35:54 +00:00
Simon Pilgrim c9d33907ef [llvm-mca][X86] Add some missing DQI tests
Match more of the coverage of test\CodeGen\X86\avx512-schedule.ll as discussed on D57244 

llvm-svn: 352273
2019-01-26 13:00:46 +00:00
Simon Pilgrim d36f7730cd [llvm-mca][X86] Add missing shuffle tests
Match the coverage of test\CodeGen\X86\avx512-shuffle-schedule.ll so we can get rid of -print-schedule (and fix PR37160) without losing schedule tests

llvm-svn: 352179
2019-01-25 09:17:30 +00:00
Andrea Di Biagio d768d35515 [MC][X86] Correctly model additional operand latency caused by transfer delays from the integer to the floating point unit.
This patch adds a new ReadAdvance definition named ReadInt2Fpu.
ReadInt2Fpu allows x86 scheduling models to accurately describe delays caused by
data transfers from the integer unit to the floating point unit.
ReadInt2Fpu currently defaults to a delay of zero cycles (i.e. no delay) for all
x86 models excluding BtVer2. That means, this patch is only a functional change
for the Jaguar cpu model only.

Tablegen definitions for instructions (V)PINSR* have been updated to account for
the new ReadInt2Fpu. That read is mapped to the the GPR input operand.
On Jaguar, int-to-fpu transfers are modeled as a +6cy delay. Before this patch,
that extra delay was added to the opcode latency. In practice, the insert opcode
only executes for 1cy. Most of the actual latency is actually contributed by the
so-called operand-latency. According to the AMD SOG for family 16h, (V)PINSR*
latency is defined by expression f+1, where f is defined as a forwarding delay
from the integer unit to the fpu.

When printing instruction latency from MCA (see InstructionInfoView.cpp) and LLC
(only when flag -print-schedule is speified), we now need to account for any
extra forwarding delays. We do this by checking if scheduling classes declare
any negative ReadAdvance entries. Quoting a code comment in TargetSchedule.td:
"A negative advance effectively increases latency, which may be used for
cross-domain stalls". When computing the instruction latency for the purpose of
our scheduling tests, we now add any extra delay to the formula. This avoids
regressing existing codegen and mca schedule tests. It comes with the cost of an
extra (but very simple) hook in MCSchedModel.

Differential Revision: https://reviews.llvm.org/D57056

llvm-svn: 351965
2019-01-23 16:35:07 +00:00
Simon Pilgrim 922b540643 [llvm-mca][X86] Tidyup avx512 placeholder tests
Ensure we keep avx512f/bw/dq + vl versions separate, add example broadcast tests - this should allow us to better the test coverage of test\CodeGen\X86\avx512-schedule.ll

llvm-svn: 351848
2019-01-22 17:52:15 +00:00
Simon Pilgrim 8e11254132 [llvm-mca][X86] Add VPOPCNTDQ tests
Matches test coverage of test\CodeGen\X86\avx512vpopcntdq-schedule.ll

llvm-svn: 351842
2019-01-22 17:19:44 +00:00
Simon Pilgrim 90fa50d928 [llvm-mca][X86] Add missing CLWB/CLZERO/FSGSBASE/LWP/MWAITX/RDPID/SHA tests
We're getting pretty close to matching/exceeding test coverage of the test\CodeGen\X86\*-schedule.ll files, which should allow us to get rid of -print-schedule and fix PR37160

llvm-svn: 351836
2019-01-22 16:39:28 +00:00
Simon Pilgrim fc4b1e841e [llvm-mca][X86] Add missing enter/leave, invlpg/invlpga, rdmsr/wrmsr, rdpmc and rdtsc/rdtscp tests
llvm-svn: 351835
2019-01-22 16:29:26 +00:00
Simon Pilgrim 4e03b2496d [llvm-mca][X86] Add missing mfence/pinsrw tests
llvm-svn: 351831
2019-01-22 16:01:08 +00:00
Simon Pilgrim 05198a9b8a [llvm-mca][X86] Add missing monitor/mwait tests
These technically should be under a MONITOR cpuid bit, but we tag them as SSE3 so I've done that here as well.

llvm-svn: 351829
2019-01-22 15:48:16 +00:00
Simon Pilgrim 9b3a2f96a1 [llvm-mca][X86] Add missing vperm2i128 tests
llvm-svn: 351828
2019-01-22 14:54:24 +00:00
Simon Pilgrim 1d8d6c3bfb [llvm-mca][X86] Add missing tzcntw tests
llvm-svn: 351827
2019-01-22 14:53:52 +00:00
Andrea Di Biagio a4d1ffc269 [MCA] Add tests for int-to-fpu transfer delays. NFC
llvm-svn: 351822
2019-01-22 13:59:08 +00:00
Simon Pilgrim aa6a4339ac [X86][BtVer2] SSE2 vector shifts has local forwarding disabled
Similar to horizontal ops on D56777, the sse2 (but not mmx) bit shift ops has local forwarding disabled, adding +1cy to the use latency for the result.

Differential Revision: https://reviews.llvm.org/D57026

llvm-svn: 351817
2019-01-22 13:27:18 +00:00
Simon Pilgrim 2c69f90171 [X86][BtVer2] X86ISD::VPERMILPV has local forwarding disabled
Similar to horizontal ops on D56777, the vpermilpd/vpermilps variable mask ops has local forwarding disabled, adding +1cy to the use latency for the result.

Differential Revision: https://reviews.llvm.org/D57022

llvm-svn: 351815
2019-01-22 13:13:57 +00:00
Simon Pilgrim 9b73ae96c5 [X86][BtVer2] Update latency of mmx horizontal operations
D56777 added +1cy local forwarding penalty for horizontal operations, but this penalty only affects sse2/xmm variants, the mmx variants don't suffer the penalty.

Confirmed with @andreadb

llvm-svn: 351755
2019-01-21 18:04:25 +00:00
Andrea Di Biagio b68dd05c14 [X86][BtVer2] Update the WriteLoad latency.
r327630 introduced new write definitions for float/vector loads.
Before that revision, WriteLoad was used by both integer/float (scalar/vector)
load. So, WriteLoad had to conservatively declare a latency to 5cy. That is
because the load-to-use latency for float/vector load is 5cy.

Now that we have dedicated writes for float/vector loads, there is no reason why
we should keep the latency of WriteLoad to 5cy. At the moment, WriteLoad is only
used by scalar integer loads only; we can assume an optimstic 3cy latency for
them.
This patch changes that latency from 5cy to 3cy, and regenerates the affected
scheduling/mca tests.

Differential Revision: https://reviews.llvm.org/D56922

llvm-svn: 351742
2019-01-21 12:04:10 +00:00
Andrea Di Biagio c5f0f5309e [X86][BtVer2] Update latency of horizontal operations.
On Jaguar, horizontal adds/subs have local forwarding disable.
That means, we pay a compulsory extra cycle of write-back stage, and the value
is not available until the end of that stage.

This patch changes the latency of horizontal operations by adding an extra
cycle. With this patch, latency numbers now match what is reported by perf.

I plan to send another patch to also 'fix' the latency of shuffle operations (on
Jaguar, local forwarding is disabled for vector shuffles too).

Differential Revision: https://reviews.llvm.org/D56777

llvm-svn: 351366
2019-01-16 18:18:01 +00:00
Evandro Menezes 946fe976fd [llvm-mca] Update tests for Exynos (NFC)
Update test cases for Exynos M4.

llvm-svn: 350961
2019-01-11 19:36:27 +00:00
Evandro Menezes 9b7b5b1dcc [llvm-mca] Update the Exynos test cases (NFC)
Add more entropy to the test cases.

llvm-svn: 350662
2019-01-08 22:29:56 +00:00
Evandro Menezes 7927a45cdb [llvm-mca] Rename directory for the Cortex tests (NFC)
llvm-svn: 349688
2018-12-19 22:24:42 +00:00
Evandro Menezes 7f37ec7cd3 [llvm-mca] Update Exynos test cases (NFC)
llvm-svn: 349687
2018-12-19 22:24:39 +00:00
Evandro Menezes 5d409b2278 [AArch64] Improve the Exynos M3 pipeline model
llvm-svn: 349652
2018-12-19 17:37:51 +00:00
Evandro Menezes 1cfab9747d [llvm-mca] Split test (NFC)
Split the Exynos test of the register offset addressing mode into separate
loads and stores tests.

llvm-svn: 349651
2018-12-19 17:37:14 +00:00
Evandro Menezes 031abc2bd7 [llvm-mca] Improve test (NFC)
Add more instruction variations for Exynos.

llvm-svn: 349567
2018-12-18 23:19:52 +00:00
Evandro Menezes 4bfd4ce1bc [llvm-mca] Update the Exynos test cases (NFC)
Add more entropy to the test cases.

llvm-svn: 349537
2018-12-18 20:46:03 +00:00
Andrea Di Biagio 4c73711069 [MCA] Add support for BeginGroup/EndGroup.
llvm-svn: 349354
2018-12-17 14:27:33 +00:00
Andrea Di Biagio 4506067593 [MCA] Don't assume that createMCInstrAnalysis() always returns a valid pointer.
Class InstrBuilder wrongly assumed that llvm targets were always able to return
a non-null pointer when createMCInstrAnalysis() was called on them.
This was causing crashes when simulating executions for targets that don't
provide an MCInstrAnalysis object.
This patch fixes the issue by making MCInstrAnalysis optional.

llvm-svn: 349352
2018-12-17 14:00:37 +00:00
Evandro Menezes 53f0d41dc4 [AArch64] Refactor the Exynos scheduling predicates
Refactor the scheduling predicates based on `MCInstPredicate`.  In this
case, for the Exynos processors.

Differential revision: https://reviews.llvm.org/D55345

llvm-svn: 348774
2018-12-10 17:17:26 +00:00
Evandro Menezes 7ea7de55ea [llvm-mca] Add new tests for Exynos (NFC)
llvm-svn: 348766
2018-12-10 16:22:29 +00:00
Simon Pilgrim 99c139f4dc [llvm-mca][x86] Add RDSEED instruction resource tests for GLM
llvm-svn: 348624
2018-12-07 18:37:40 +00:00
Simon Pilgrim c703ce35b8 [llvm-mca][x86] Add missing AES instruction resource tests
Add missing non-VEX instructions

llvm-svn: 348623
2018-12-07 18:35:54 +00:00
Simon Pilgrim c4e2776f3b [llvm-mca][x86] Add RDRAND/RDSEED instruction resource tests
llvm-svn: 348622
2018-12-07 18:29:47 +00:00
Hans Wennborg c56cc3a889 Fix test/tools/llvm-mca/AArch64/Exynos/direct-branch.s on Mac
It was failing as below. Adding a triple seems to help.

--
: 'RUN: at line 2';   /work/llvm.combined/build.release/bin/llvm-mca -march=aarch64 -mcpu=exynos-m1 -resource-pressure=false < /work/llvm.combined/llvm/test/tools/llvm-mca/AArch64/Exynos/direct-branch.s | /work/llvm.combined/build.release/bin/FileCheck /work/llvm.combined/llvm/test/tools/llvm-mca/AArch64/Exynos/direct-branch.s -check-prefixes=ALL,M1
: 'RUN: at line 3';   /work/llvm.combined/build.release/bin/llvm-mca -march=aarch64 -mcpu=exynos-m3 -resource-pressure=false < /work/llvm.combined/llvm/test/tools/llvm-mca/AArch64/Exynos/direct-branch.s | /work/llvm.combined/build.release/bin/FileCheck /work/llvm.combined/llvm/test/tools/llvm-mca/AArch64/Exynos/direct-branch.s -check-prefixes=ALL,M3
--
Exit Code: 1

Command Output (stderr):
--
/work/llvm.combined/llvm/test/tools/llvm-mca/AArch64/Exynos/direct-branch.s:36:12: error: M1-NEXT: expected string not found in input
           ^
<stdin>:21:2: note: scanning from here
 1 0 0.25 b Ltmp0
 ^

--

llvm-svn: 348577
2018-12-07 09:58:33 +00:00
Evandro Menezes 51df880e70 [llvm-mca] Improve test (NFC)
Add more instructions to the test for Cortex.

llvm-svn: 348565
2018-12-07 03:23:36 +00:00
Evandro Menezes 83beb91450 [llvm-mca] Improve test (NFC)
Add a label to make explicit that the branch is short for Exynos.

llvm-svn: 348564
2018-12-07 03:23:14 +00:00
Evandro Menezes 5d42bc7ce8 [llvm-mca] Simplify test (NFC)
llvm-svn: 348395
2018-12-05 18:34:51 +00:00
Evandro Menezes 86953e4350 [llvm-mca] Sort test run lines (NFC)
llvm-svn: 348393
2018-12-05 18:30:06 +00:00
Andrea Di Biagio 373a4ccf6c [llvm-mca][MC] Add the ability to declare which processor resources model load/store queues (PR36666).
This patch adds the ability to specify via tablegen which processor resources
are load/store queue resources.

A new tablegen class named MemoryQueue can be optionally used to mark resources
that model load/store queues.  Information about the load/store queue is
collected at 'CodeGenSchedule' stage, and analyzed by the 'SubtargetEmitter' to
initialize two new fields in struct MCExtraProcessorInfo named `LoadQueueID` and
`StoreQueueID`.  Those two fields are identifiers for buffered resources used to
describe the load queue and the store queue.
Field `BufferSize` is interpreted as the number of entries in the queue, while
the number of units is a throughput indicator (i.e. number of available pickers
for loads/stores).

At construction time, LSUnit in llvm-mca checks for the presence of extra
processor information (i.e. MCExtraProcessorInfo) in the scheduling model.  If
that information is available, and fields LoadQueueID and StoreQueueID are set
to a value different than zero (i.e. the invalid processor resource index), then
LSUnit initializes its LoadQueue/StoreQueue based on the BufferSize value
declared by the two processor resources.

With this patch, we more accurately track dynamic dispatch stalls caused by the
lack of LS tokens (i.e. load/store queue full). This is also shown by the
differences in two BdVer2 tests. Stalls that were previously classified as
generic SCHEDULER FULL stalls, are not correctly classified either as "load
queue full" or "store queue full".

About the differences in the -scheduler-stats view: those differences are
expected, because entries in the load/store queue are not released at
instruction issue stage. Instead, those are released at instruction executed
stage.  This is the main reason why for the modified tests, the load/store
queues gets full before PdEx is full.

Differential Revision: https://reviews.llvm.org/D54957

llvm-svn: 347857
2018-11-29 12:15:56 +00:00
Andrea Di Biagio 7a7588990b [llvm-mca] pass -dispatch-stats flag to a couple of tests. NFC
This change is in preparation for a patch that fixes PR36666.

llvm-mca currently doesn't know if a buffered processor resource describes a
load or store queue. So, any dynamic dispatch stall caused by the lack of
load/store queue entries is normally reported as a generic SCHEDULER stall. See for
example the -dispatch-stats output from the two tests modified by this patch.

In future, processor models will be able to tag processor resources that are
used to describe load/store queues. That information would then be used by
llvm-mca to correctly classify dynamic dispatch stalls caused by the lack of
tokens in the LS.

llvm-svn: 347662
2018-11-27 15:56:00 +00:00
Evandro Menezes 56368c6fa5 [AArch64] Refactor the scheduling predicates (2/3) (NFC)
Refactor the scheduling predicates based on `MCInstPredicate`.  In this
case, `AArch64InstrInfo::hasShiftedReg()`.

Differential revision: https://reviews.llvm.org/D54820

llvm-svn: 347598
2018-11-26 21:47:41 +00:00
Evandro Menezes b02ac8bd21 [AArch64] Refactor the scheduling predicates (1/3) (NFC)
Refactor the scheduling predicates based on `MCInstPredicate`.  In this
case, `AArch64InstrInfo::isScaledAddr()`

Differential revision: https://reviews.llvm.org/D54777

llvm-svn: 347597
2018-11-26 21:47:28 +00:00
Andrea Di Biagio 36296c0484 [llvm-mca] Add support for instructions with a variadic number of operands.
By default, llvm-mca conservatively assumes that a register operand from the
variadic sequence is both a register read and a register write.  That is because
MCInstrDesc doesn't describe extra variadic operands; we don't have enough
dataflow information to tell which register operands from the variadic sequence
is a definition, and which is a use instead.

However, if a variadic instruction is flagged 'mayStore' (but not 'mayLoad'),
and it has no 'unmodeledSideEffects', then llvm-mca (very) optimistically
assumes that any register operand in the variadic sequence is a register read
only. Conversely, if a variadic instruction is marked as 'mayLoad' (but not
'mayStore'), and it has no 'unmodeledSideEffects', then llvm-mca optimistically
assumes that any extra register operand is a register definition only.
These assumptions work quite well for variadic load/store multiple instructions
defined by the ARM backend.

llvm-svn: 347522
2018-11-25 12:46:24 +00:00
Evandro Menezes 079bf4b7b4 [TableGen] Emit more variant transitions
`llvm-mca` relies on the predicates to be based on `MCSchedPredicate` in order
to resolve the scheduling for variant instructions.  Otherwise, it aborts
the building of the instruction model early.

However, the scheduling model emitter in `TableGen` gives up too soon, unless
all processors use only such predicates.

In order to allow more processors to be used with `llvm-mca`, this patch
emits scheduling transitions if any processor uses these predicates.  The
transition emitted for the processors using legacy predicates is the one
specified with `NoSchedPred`, which is based on `MCSchedPredicate`.

Preferably, `llvm-mca` should instead assume a reasonable default when a
variant transition is not based on `MCSchedPredicate` for a given processor.
This issue should be revisited in the future.

Differential revision: https://reviews.llvm.org/D54648

llvm-svn: 347504
2018-11-23 21:17:33 +00:00
Andrea Di Biagio 7e32cc8353 [llvm-mca] Refactor some of the logic in InstrBuilder, and add a verifyOperands method.
With this change, InstrBuilder emits an error if the MCInst sequence contains an
instruction with a variadic opcode, and a non-zero number of variadic operands.

Currently we don't know how to correctly analyze variadic opcodes. The problem
with variadic operands is that there is no information for them in the opcode
descriptor (i.e. MCInstrDesc). That means, we don't know which variadic operands
are defs, and which are uses.

In future, we could try to conservatively assume that any extra register
operands is both a register use and a register definition.

This patch fixes a subtle bug in the evaluation of read/write operands for ARM
VLD1 with implicit index update. Added test vld1-index-update.s

llvm-svn: 347503
2018-11-23 20:26:57 +00:00
Andrea Di Biagio 07a8255a78 [llvm-mca][View] Improved Retire Control Unit Statistics.
RetireControlUnitStatistics now reports extra information about the ROB and the
avg/maximum number of entries consumed over the entire simulation.

Example:
  Retire Control Unit - number of cycles where we saw N instructions retired:
  [# retired], [# cycles]
   0,           109  (17.9%)
   1,           102  (16.7%)
   2,           399  (65.4%)

  Total ROB Entries:                64
  Max Used ROB Entries:             35  ( 54.7% )
  Average Used ROB Entries per cy:  32  ( 50.0% )

Documentation in llvm/docs/CommandGuide/llvmn-mca.rst has been updated to
reflect this change.

llvm-svn: 347493
2018-11-23 12:12:57 +00:00
Andrea Di Biagio 1cb8a3c690 [llvm-mca] Fix an invalid memory read introduced by r346487.
This patch fixes an invalid memory read introduced by r346487.
Before this patch, partial register write had to query the latency of the
dependent full register write by calling a method on the full write descriptor.
However, if the full write is from an already retired instruction, chances are
that the EntryStage already reclaimed its memory.
In some parial register write tests, valgrind was reporting an invalid
memory read.

This change fixes the invalid memory access problem. Writes are now responsible
for tracking dependent partial register writes, and notify them in the event of
instruction issued.
That means, partial register writes no longer need to query their associated
full write to check when they are ready to execute.

Added test X86/BtVer2/partial-reg-update-7.s

llvm-svn: 347459
2018-11-22 12:48:57 +00:00
Evandro Menezes d0792170a3 [llvm-mca] Add test case (NFC)
Add test case that will serve as the base for D54820.

llvm-svn: 347440
2018-11-22 00:38:36 +00:00
Evandro Menezes b9f9042648 [llvm-mca] Add test case (NFC)
Fix previous commit r347434.

llvm-svn: 347437
2018-11-21 23:36:40 +00:00
Evandro Menezes 34b32a3019 [llvm-mca] Add test case (NFC)
Add test case that will serve as the base for D54777.

llvm-svn: 347434
2018-11-21 22:57:46 +00:00
Andrea Di Biagio dda9032314 [llvm-mca] Correctly update the resource strategy for processor resources with multiple units.
When looking at the tests committed by Roman at r346587, I noticed that numbers
reported by the resource pressure for PdAGU01 were wrong.

In particular, according to the aut-generated CHECK lines in tests
memcpy-like-test.s and store-throughput.s, resource pressure for PdAGU01
was not uniformly distributed among the two AGEN pipes.

It turns out that the reason why pressure was not correctly distributed, was
because the "resource selection strategy" object associated with PdAGU01 was not
correctly updated on the event of AGEN pipe used.
As a result, llvm-mca was not simulating a round-robin pipeline allocation for
PdAGU01. Instead, PdAGU1 was always prioritized over PdAGU0.

This patch fixes the issue; now processor resource strategy objects for
resources declaring multiple units, are correctly notified in the event of
"resource used".

llvm-svn: 346650
2018-11-12 13:09:39 +00:00
Roman Lebedev b428b8b214 [X86][BdVer2] Fix loads/stores throughput for Piledriver (PR39465)
There are two AGU units, and per 1cy, there can be either two loads,
or a load and a store; but not two stores, or two loads and a store.

Additionally, loads shouldn't affect the store scheduler and vice versa.
(but *should* affect the PdEX scheduler.)

Required rL346545.
Fixes https://bugs.llvm.org/show_bug.cgi?id=39465

llvm-svn: 346587
2018-11-10 14:31:43 +00:00
Roman Lebedev e105b655a2 [NFC][MCA][BdVer2] Add bdver2 runline into register-file-statistics.s test
Missed this one by accident when adding
the initial version in rL345463 / rL345462

llvm-svn: 346585
2018-11-10 10:56:58 +00:00
Clement Courbet e6b727e552 [X86] Fix VZEROUPPER scheduling info on SNB,HSW,BDW,SXL,SKX.
Summary:
Starting from SNB, VZEROUPPER is handled by the renamer and uses no proc resources.
After HSW, it also has zero latency.

This fixes PR35606.

To reproduce:
Uops:
  llvm-exegesis -mode=uops -opcode-name=VZEROUPPER
Latency:
  echo -e '#LLVM-EXEGESIS-DEFREG XMM0 1\n#LLVM-EXEGESIS-DEFREG XMM1 1\nvzeroupper' | /tmp/llvm-exegesis -mode=latency -snippets-file=-
  echo -e '#LLVM-EXEGESIS-DEFREG XMM0 1\n#LLVM-EXEGESIS-DEFREG XMM1 1\nvzeroupper\naddps %xmm0, %xmm1' | /tmp/llvm-exegesis -mode=latency -snippets-file=-

Reviewers: RKSimon, craig.topper, andreadb

Subscribers: gbedwell, llvm-commits

Differential Revision: https://reviews.llvm.org/D54107

llvm-svn: 346482
2018-11-09 09:49:06 +00:00
Roman Lebedev 3817292069 [NFC][BdVer2] Load and store throughput tests: also check sched stats (PR39465)
As noted by Andrea Di Biagio in https://bugs.llvm.org/show_bug.cgi?id=39465
both the loads and stores occupy both the store and load queues.
This is clearly wrong.

llvm-svn: 346425
2018-11-08 18:15:58 +00:00
Roman Lebedev 2ad16b9371 [NFC][BdVer2] Tests for load and store throughput (PR39465)
During review it was noted that while it appears that
the Piledriver can do two [consecutive] loads per cycle,
it can only do one store per cycle. It was suggested
that the sched model incorrectly models that,
but it was opted to fix this afterwards.

These tests show that the two consecutive loads are
modelled correctly, and one consecutive stores is not
modelled incorrectly. Unless i'm missing the point.

https://bugs.llvm.org/show_bug.cgi?id=39465

llvm-svn: 346404
2018-11-08 14:48:56 +00:00
Andrea Di Biagio fe3bc1b9bf [llvm-mca] Add extra counters for move elimination in view RegisterFileStatistics.
This patch teaches view RegisterFileStatistics how to report events for
optimizable register moves.

For each processor register file, view RegisterFileStatistics reports the
following extra information:
 - Number of optimizable register moves
 - Number of register moves eliminated
 - Number of zero moves (i.e. register moves that propagate a zero)
 - Max Number of moves eliminated per cycle.

Differential Revision: https://reviews.llvm.org/D53976

llvm-svn: 345865
2018-11-01 18:04:39 +00:00
Roman Lebedev a5baf86744 AMD BdVer2 (Piledriver) Initial Scheduler model
Summary:
# Overview
This is somewhat partial.
* Latencies are good {F7371125}
  * All of these remaining inconsistencies //appear// to be noise/noisy/flaky.
* NumMicroOps are somewhat good {F7371158}
  * Most of the remaining inconsistencies are from `Ld` / `Ld_ReadAfterLd` classes
* Actual unit occupation (pipes, `ResourceCycles`) are undiscovered lands, i did not really look there.
  They are basically verbatum copy from `btver2`
* Many `InstRW`. And there are still inconsistencies left...

To be noted:
I think this is the first new schedule profile produced with the new next-gen tools like llvm-exegesis!

# Benchmark
I realize that isn't what was suggested, but i'll start with some "internal" public real-world benchmark i understand - [[ https://github.com/darktable-org/rawspeed | RawSpeed raw image decoding library ]].
Diff (the exact clang from trunk without/with this patch):
```
Comparing /home/lebedevri/rawspeed/build-old/src/utilities/rsbench/rsbench to /home/lebedevri/rawspeed/build-new/src/utilities/rsbench/rsbench
Benchmark                                                                                        Time             CPU      Time Old      Time New       CPU Old       CPU New
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/real_time_pvalue                             0.0000          0.0000      U Test, Repetitions: 25 vs 25
Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/real_time_mean                              -0.0607         -0.0604           234           219           233           219
Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/real_time_median                            -0.0630         -0.0626           233           219           233           219
Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/real_time_stddev                            +0.2581         +0.2587             1             2             1             2
Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/real_time_pvalue                             0.0000          0.0000      U Test, Repetitions: 25 vs 25
Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/real_time_mean                              -0.0770         -0.0767           144           133           144           133
Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/real_time_median                            -0.0767         -0.0763           144           133           144           133
Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/real_time_stddev                            -0.4170         -0.4156             1             0             1             0
Canon/EOS 5DS/2K4A9927.CR2/threads:8/real_time_pvalue                                          0.0000          0.0000      U Test, Repetitions: 25 vs 25
Canon/EOS 5DS/2K4A9927.CR2/threads:8/real_time_mean                                           -0.0271         -0.0270           463           450           463           450
Canon/EOS 5DS/2K4A9927.CR2/threads:8/real_time_median                                         -0.0093         -0.0093           453           449           453           449
Canon/EOS 5DS/2K4A9927.CR2/threads:8/real_time_stddev                                         -0.7280         -0.7280            13             4            13             4
Canon/EOS 5DS/2K4A9928.CR2/threads:8/real_time_pvalue                                          0.0004          0.0004      U Test, Repetitions: 25 vs 25
Canon/EOS 5DS/2K4A9928.CR2/threads:8/real_time_mean                                           -0.0065         -0.0065           569           565           569           565
Canon/EOS 5DS/2K4A9928.CR2/threads:8/real_time_median                                         -0.0077         -0.0077           569           564           569           564
Canon/EOS 5DS/2K4A9928.CR2/threads:8/real_time_stddev                                         +1.0077         +1.0068             2             5             2             5
Canon/EOS 5DS/2K4A9929.CR2/threads:8/real_time_pvalue                                          0.0220          0.0199      U Test, Repetitions: 25 vs 25
Canon/EOS 5DS/2K4A9929.CR2/threads:8/real_time_mean                                           +0.0006         +0.0007           312           312           312           312
Canon/EOS 5DS/2K4A9929.CR2/threads:8/real_time_median                                         +0.0031         +0.0032           311           312           311           312
Canon/EOS 5DS/2K4A9929.CR2/threads:8/real_time_stddev                                         -0.7069         -0.7072             4             1             4             1
Canon/EOS 10D/CRW_7673.CRW/threads:8/real_time_pvalue                                          0.0004          0.0004      U Test, Repetitions: 25 vs 25
Canon/EOS 10D/CRW_7673.CRW/threads:8/real_time_mean                                           -0.0015         -0.0015           141           141           141           141
Canon/EOS 10D/CRW_7673.CRW/threads:8/real_time_median                                         -0.0010         -0.0011           141           141           141           141
Canon/EOS 10D/CRW_7673.CRW/threads:8/real_time_stddev                                         -0.1486         -0.1456             0             0             0             0
Canon/EOS 40D/_MG_0154.CR2/threads:8/real_time_pvalue                                          0.6139          0.8766      U Test, Repetitions: 25 vs 25
Canon/EOS 40D/_MG_0154.CR2/threads:8/real_time_mean                                           -0.0008         -0.0005            60            60            60            60
Canon/EOS 40D/_MG_0154.CR2/threads:8/real_time_median                                         -0.0006         -0.0002            60            60            60            60
Canon/EOS 40D/_MG_0154.CR2/threads:8/real_time_stddev                                         -0.1467         -0.1390             0             0             0             0
Canon/EOS 77D/IMG_4049.CR2/threads:8/real_time_pvalue                                          0.0137          0.0137      U Test, Repetitions: 25 vs 25
Canon/EOS 77D/IMG_4049.CR2/threads:8/real_time_mean                                           +0.0002         +0.0002           275           275           275           275
Canon/EOS 77D/IMG_4049.CR2/threads:8/real_time_median                                         -0.0015         -0.0014           275           275           275           275
Canon/EOS 77D/IMG_4049.CR2/threads:8/real_time_stddev                                         +3.3687         +3.3587             0             2             0             2
Canon/PowerShot G1/crw_1693.crw/threads:8/real_time_pvalue                                     0.4041          0.3933      U Test, Repetitions: 25 vs 25
Canon/PowerShot G1/crw_1693.crw/threads:8/real_time_mean                                      +0.0004         +0.0004            67            67            67            67
Canon/PowerShot G1/crw_1693.crw/threads:8/real_time_median                                    -0.0000         -0.0000            67            67            67            67
Canon/PowerShot G1/crw_1693.crw/threads:8/real_time_stddev                                    +0.1947         +0.1995             0             0             0             0
Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/real_time_pvalue                              0.0074          0.0001      U Test, Repetitions: 25 vs 25
Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/real_time_mean                               -0.0092         +0.0074           547           542            25            25
Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/real_time_median                             -0.0054         +0.0115           544           541            25            25
Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/real_time_stddev                             -0.4086         -0.3486             8             5             0             0
Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/real_time_pvalue                                        0.3320          0.0000      U Test, Repetitions: 25 vs 25
Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/real_time_mean                                         +0.0015         +0.0204           218           218            12            12
Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/real_time_median                                       +0.0001         +0.0203           218           218            12            12
Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/real_time_stddev                                       +0.2259         +0.2023             1             1             0             0
GoPro/HERO6 Black/GOPR9172.GPR/threads:8/real_time_pvalue                                      0.0000          0.0001      U Test, Repetitions: 25 vs 25
GoPro/HERO6 Black/GOPR9172.GPR/threads:8/real_time_mean                                       -0.0209         -0.0179            96            94            90            88
GoPro/HERO6 Black/GOPR9172.GPR/threads:8/real_time_median                                     -0.0182         -0.0155            95            93            90            88
GoPro/HERO6 Black/GOPR9172.GPR/threads:8/real_time_stddev                                     -0.6164         -0.2703             2             1             2             1
Kodak/DCS Pro 14nx/D7465857.DCR/threads:8/real_time_pvalue                                     0.0000          0.0000      U Test, Repetitions: 25 vs 25
Kodak/DCS Pro 14nx/D7465857.DCR/threads:8/real_time_mean                                      -0.0098         -0.0098           176           175           176           175
Kodak/DCS Pro 14nx/D7465857.DCR/threads:8/real_time_median                                    -0.0126         -0.0126           176           174           176           174
Kodak/DCS Pro 14nx/D7465857.DCR/threads:8/real_time_stddev                                    +6.9789         +6.9157             0             2             0             2
Nikon/D850/Nikon-D850-14bit-lossless-compressed.NEF/threads:8/real_time_pvalue                 0.0000          0.0000      U Test, Repetitions: 25 vs 25
Nikon/D850/Nikon-D850-14bit-lossless-compressed.NEF/threads:8/real_time_mean                  -0.0237         -0.0238           474           463           474           463
Nikon/D850/Nikon-D850-14bit-lossless-compressed.NEF/threads:8/real_time_median                -0.0267         -0.0267           473           461           473           461
Nikon/D850/Nikon-D850-14bit-lossless-compressed.NEF/threads:8/real_time_stddev                +0.7179         +0.7178             3             5             3             5
Olympus/E-M1MarkII/Olympus_EM1mk2__HIRES_50MP.ORF/threads:8/real_time_pvalue                   0.6837          0.6554      U Test, Repetitions: 25 vs 25
Olympus/E-M1MarkII/Olympus_EM1mk2__HIRES_50MP.ORF/threads:8/real_time_mean                    -0.0014         -0.0013          1375          1373          1375          1373
Olympus/E-M1MarkII/Olympus_EM1mk2__HIRES_50MP.ORF/threads:8/real_time_median                  +0.0018         +0.0019          1371          1374          1371          1374
Olympus/E-M1MarkII/Olympus_EM1mk2__HIRES_50MP.ORF/threads:8/real_time_stddev                  -0.7457         -0.7382            11             3            10             3
Panasonic/DC-G9/P1000476.RW2/threads:8/real_time_pvalue                                        0.0000          0.0000      U Test, Repetitions: 25 vs 25
Panasonic/DC-G9/P1000476.RW2/threads:8/real_time_mean                                         -0.0080         -0.0289            22            22            10            10
Panasonic/DC-G9/P1000476.RW2/threads:8/real_time_median                                       -0.0070         -0.0287            22            22            10            10
Panasonic/DC-G9/P1000476.RW2/threads:8/real_time_stddev                                       +1.0977         +0.6614             0             0             0             0
Panasonic/DC-GH5/_T012014.RW2/threads:8/real_time_pvalue                                       0.0000          0.0000      U Test, Repetitions: 25 vs 25
Panasonic/DC-GH5/_T012014.RW2/threads:8/real_time_mean                                        +0.0132         +0.0967            35            36            10            11
Panasonic/DC-GH5/_T012014.RW2/threads:8/real_time_median                                      +0.0132         +0.0956            35            36            10            11
Panasonic/DC-GH5/_T012014.RW2/threads:8/real_time_stddev                                      -0.0407         -0.1695             0             0             0             0
Panasonic/DC-GH5S/P1022085.RW2/threads:8/real_time_pvalue                                      0.0000          0.0000      U Test, Repetitions: 25 vs 25
Panasonic/DC-GH5S/P1022085.RW2/threads:8/real_time_mean                                       +0.0331         +0.1307            13            13             6             6
Panasonic/DC-GH5S/P1022085.RW2/threads:8/real_time_median                                     +0.0430         +0.1373            12            13             6             6
Panasonic/DC-GH5S/P1022085.RW2/threads:8/real_time_stddev                                     -0.9006         -0.8847             1             0             0             0
Pentax/645Z/IMGP2837.PEF/threads:8/real_time_pvalue                                            0.0016          0.0010      U Test, Repetitions: 25 vs 25
Pentax/645Z/IMGP2837.PEF/threads:8/real_time_mean                                             -0.0023         -0.0024           395           394           395           394
Pentax/645Z/IMGP2837.PEF/threads:8/real_time_median                                           -0.0029         -0.0030           395           394           395           393
Pentax/645Z/IMGP2837.PEF/threads:8/real_time_stddev                                           -0.0275         -0.0375             1             1             1             1
Phase One/P65/CF027310.IIQ/threads:8/real_time_pvalue                                          0.0232          0.0000      U Test, Repetitions: 25 vs 25
Phase One/P65/CF027310.IIQ/threads:8/real_time_mean                                           -0.0047         +0.0039           114           113            28            28
Phase One/P65/CF027310.IIQ/threads:8/real_time_median                                         -0.0050         +0.0037           114           113            28            28
Phase One/P65/CF027310.IIQ/threads:8/real_time_stddev                                         -0.0599         -0.2683             1             1             0             0
Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/real_time_pvalue                          0.0000          0.0000      U Test, Repetitions: 25 vs 25
Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/real_time_mean                           +0.0206         +0.0207           405           414           405           414
Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/real_time_median                         +0.0204         +0.0205           405           414           405           414
Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/real_time_stddev                         +0.2155         +0.2212             1             1             1             1
Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/real_time_pvalue                         0.0000          0.0000      U Test, Repetitions: 25 vs 25
Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/real_time_mean                          -0.0109         -0.0108           147           145           147           145
Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/real_time_median                        -0.0104         -0.0103           147           145           147           145
Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/real_time_stddev                        -0.4919         -0.4800             0             0             0             0
Samsung/NX3000/_3184416.SRW/threads:8/real_time_pvalue                                         0.0000          0.0000      U Test, Repetitions: 25 vs 25
Samsung/NX3000/_3184416.SRW/threads:8/real_time_mean                                          -0.0149         -0.0147           220           217           220           217
Samsung/NX3000/_3184416.SRW/threads:8/real_time_median                                        -0.0173         -0.0169           221           217           220           217
Samsung/NX3000/_3184416.SRW/threads:8/real_time_stddev                                        +1.0337         +1.0341             1             3             1             3
Sony/DSLR-A350/DSC05472.ARW/threads:8/real_time_pvalue                                         0.0001          0.0001      U Test, Repetitions: 25 vs 25
Sony/DSLR-A350/DSC05472.ARW/threads:8/real_time_mean                                          -0.0019         -0.0019           194           193           194           193
Sony/DSLR-A350/DSC05472.ARW/threads:8/real_time_median                                        -0.0021         -0.0021           194           193           194           193
Sony/DSLR-A350/DSC05472.ARW/threads:8/real_time_stddev                                        -0.4441         -0.4282             0             0             0             0
Sony/ILCE-7RM2/14-bit-compressed.ARW/threads:8/real_time_pvalue                                0.0000          0.4263      U Test, Repetitions: 25 vs 25
Sony/ILCE-7RM2/14-bit-compressed.ARW/threads:8/real_time_mean                                 +0.0258         -0.0006            81            83            19            19
Sony/ILCE-7RM2/14-bit-compressed.ARW/threads:8/real_time_median                               +0.0235         -0.0011            81            82            19            19
Sony/ILCE-7RM2/14-bit-compressed.ARW/threads:8/real_time_stddev                               +0.1634         +0.1070             1             1             0             0
```
{F7443905}
If we look at the `_mean`s, the time column, the biggest win is `-7.7%` (`Canon/EOS 5D Mark II/10.canon.sraw2.cr2`),
and the biggest loose is `+3.3%` (`Panasonic/DC-GH5S/P1022085.RW2`);
Overall: mean `-0.7436%`, median `-0.23%`, `cbrt(sum(time^3))` = `-8.73%`
Looks good so far i'd say.

llvm-exegesis details:
{F7371117} {F7371125}
{F7371128} {F7371144} {F7371158}

Reviewers: craig.topper, RKSimon, andreadb, courbet, avt77, spatel, GGanesh

Reviewed By: andreadb

Subscribers: javed.absar, gbedwell, jfb, llvm-commits

Differential Revision: https://reviews.llvm.org/D52779

llvm-svn: 345463
2018-10-27 20:46:30 +00:00
Roman Lebedev a51921877a [NFC][X86] Baseline tests for AMD BdVer2 (Piledriver) Scheduler model
Adding the baseline tests in a preparatory NFC commit,
so that the actual commit shows the *diff*.

Yes, i'm aware that a few of these codegen-based sched tests
are testing wrong instructions, i will fix that afterwards.

For https://reviews.llvm.org/D52779

llvm-svn: 345462
2018-10-27 20:36:11 +00:00
Reid Kleckner 953bdce68d [MC] Separate masm integer literal lexer support from inline asm
Summary:
This renames the IsParsingMSInlineAsm member variable of AsmLexer to
LexMasmIntegers and moves it up to MCAsmLexer. This is the only behavior
controlled by that variable. I added a public setter, so that it can be
set from outside or from the llvm-mc command line. We may need to
arrange things so that users can get this behavior from clang, but
that's future work.

I also put additional hex literal lexing functionality under this flag
to fix PR32973. It appears that this hex literal parsing wasn't intended
to be enabled in non-masm-style blocks.

Now, masm integers (0b1101 and 0ABCh) work in __asm blocks from clang,
but 0b label references work when using .intel_syntax in standalone .s
files.

However, 0b label references will *not* work from __asm blocks in clang.
They will work from GCC inline asm blocks, which it sounds like is
important for Crypto++ as mentioned in PR36144.

Essentially, we only lex masm literals for inline asm blobs that use
intel syntax. If the .intel_syntax directive is used inside a gnu-style
inline asm statement, masm literals will not be lexed, which is
compatible with gas and llvm-mc standalone .s assembly.

This fixes PR36144 and PR32973.

Reviewers: Gerolf, avt77

Subscribers: eraman, hiraditya, llvm-commits

Differential Revision: https://reviews.llvm.org/D53535

llvm-svn: 345189
2018-10-24 20:23:57 +00:00
Andrea Di Biagio 083addf751 [llvm-mca] [llvm-mca] Improved error handling and error reporting from class InstrBuilder.
A new class named InstructionError has been added to Support.h in order to
improve the error reporting from class InstrBuilder.
The llvm-mca driver is responsible for handling InstructionError objects, and
printing them out to stderr.

The goal of this patch is to remove all the remaining error handling logic from
the library code.
In particular, this allows us to:
 - Simplify the logic in InstrBuilder by removing a needless dependency from
MCInstrPrinter.
 - Centralize all the error halding logic in a new function named 'runPipeline'
(see llvm-mca.cpp).

This is also a first step towards generalizing class InstrBuilder, so that in
future, we will be able to reuse its logic to also "lower" MachineInstr to
mca::Instruction objects.

Differential Revision: https://reviews.llvm.org/D53585

llvm-svn: 345129
2018-10-24 10:56:47 +00:00
Simon Pilgrim 7d27cfdcb2 [X86] Fix Skylake ReadAfterLd for PADDrm etc.
Missed in rL343868 as due to their custom InstrRW.

llvm-svn: 344600
2018-10-16 09:50:16 +00:00
Andrea Di Biagio 6eebbe0a97 [tblgen][llvm-mca] Add the ability to describe move elimination candidates via tablegen.
This patch adds the ability to identify instructions that are "move elimination
candidates". It also allows scheduling models to describe processor register
files that allow move elimination.

A move elimination candidate is an instruction that can be eliminated at
register renaming stage.
Each subtarget can specify which instructions are move elimination candidates
with the help of tablegen class "IsOptimizableRegisterMove" (see
llvm/Target/TargetInstrPredicate.td).

For example, on X86, BtVer2 allows both GPR and MMX/SSE moves to be eliminated.
The definition of 'IsOptimizableRegisterMove' for BtVer2 looks like this:

```
def : IsOptimizableRegisterMove<[
  InstructionEquivalenceClass<[
    // GPR variants.
    MOV32rr, MOV64rr,

    // MMX variants.
    MMX_MOVQ64rr,

    // SSE variants.
    MOVAPSrr, MOVUPSrr,
    MOVAPDrr, MOVUPDrr,
    MOVDQArr, MOVDQUrr,

    // AVX variants.
    VMOVAPSrr, VMOVUPSrr,
    VMOVAPDrr, VMOVUPDrr,
    VMOVDQArr, VMOVDQUrr
  ], CheckNot<CheckSameRegOperand<0, 1>> >
]>;
```

Definitions of IsOptimizableRegisterMove from processor models of a same
Target are processed by the SubtargetEmitter to auto-generate a target-specific
override for each of the following predicate methods:

```
bool TargetSubtargetInfo::isOptimizableRegisterMove(const MachineInstr *MI)
const;
bool MCInstrAnalysis::isOptimizableRegisterMove(const MCInst &MI, unsigned
CPUID) const;
```

By default, those methods return false (i.e. conservatively assume that there
are no move elimination candidates).

Tablegen class RegisterFile has been extended with the following information:
 - The set of register classes that allow move elimination.
 - Maxium number of moves that can be eliminated every cycle.
 - Whether move elimination is restricted to moves from registers that are
   known to be zero.

This patch is structured in three part:

A first part (which is mostly boilerplate) adds the new
'isOptimizableRegisterMove' target hooks, and extends existing register file
descriptors in MC by introducing new fields to describe properties related to
move elimination.

A second part, uses the new tablegen constructs to describe move elimination in
the BtVer2 scheduling model.

A third part, teaches llm-mca how to query the new 'isOptimizableRegisterMove'
hook to mark instructions that are candidates for move elimination. It also
teaches class RegisterFile how to describe constraints on move elimination at
PRF granularity.

llvm-mca tests for btver2 show differences before/after this patch.

Differential Revision: https://reviews.llvm.org/D53134

llvm-svn: 344334
2018-10-12 11:23:04 +00:00
Andrea Di Biagio 6a0b319549 [llvm-mca][BtVer2] Add tests for optimizable GPR register moves. NFC
llvm-svn: 344253
2018-10-11 14:54:54 +00:00
Andrea Di Biagio 1b29ec6531 [llvm-mca][BtVer2] Add two more move-elimination tests. NFC
These should test all the optimizable moves on Jaguar.
A follow-up patch will teach how to recognize these optimizable register moves.

llvm-svn: 344144
2018-10-10 14:46:54 +00:00
Simon Pilgrim f09fc3bc12 [X86] Move ReadAfterLd functionality into X86FoldableSchedWrite (PR36957)
Currently we hardcode instructions with ReadAfterLd if the register operands don't need to be available until the folded load has completed. This doesn't take into account the different load latencies of different memory operands (PR36957).

This patch adds a ReadAfterFold def into X86FoldableSchedWrite to replace ReadAfterLd, allowing us to specify the load latency at a scheduler class level.

I've added ReadAfterVec*Ld classes that match the XMM/Scl, XMM and YMM/ZMM WriteVecLoad classes that we currently use, we can tweak these values in future patches once this infrastructure is in place.

Differential Revision: https://reviews.llvm.org/D52886

llvm-svn: 343868
2018-10-05 17:57:29 +00:00
Simon Pilgrim 6ad03ad34b [llvm-mca][x86] Add PR36951 ReadAfterLd test case
llvm-svn: 343795
2018-10-04 16:26:56 +00:00
Greg Bedwell dee7bfdb9f [utils] Ensure that update_mca_test_checks.py writes prefixes in alphabetical order
llvm-svn: 343783
2018-10-04 14:42:19 +00:00
Simon Pilgrim 82a3b1c687 [llvm-mca][x86] Add tests demonstrating ReadAfterLd delay
llvm-svn: 343773
2018-10-04 13:05:42 +00:00
Simon Pilgrim 0b451a2983 [X86][Btver2] Fix MMX PSHUFB schedule
Match AMD Fam16h SOG + llvm-exegesis tests

llvm-svn: 343701
2018-10-03 18:18:50 +00:00
Andrea Di Biagio 207e0217f9 [llvm-mca] Add support for move elimination in class RegisterFile.
This patch teaches class RegisterFile how to analyze register writes from
instructions that are move elimination candidates.
In particular, it teaches it how to check if a move can be effectively eliminated
by the underlying PRF, and (if necessary) how to perform move elimination.

The long term goal is to allow processor models to describe instructions that
are valid move elimination candidates.
The idea is to let register file definitions in tablegen declare if/when moves
can be eliminated.

This patch is a non functional change.
The logic that performs move elimination is currently disabled.  A future patch
will add support for move elimination in the processor models, and enable this
new code path.

llvm-svn: 343691
2018-10-03 15:02:44 +00:00
Simon Pilgrim c68cc4efbe [X86][Btver2] Most RMW instructions don't require an additional uop
Remove uop on WriteRMW and move it into the few instructions that need it.

Match AMD Fam16h SOG + llvm-exegesis tests

llvm-svn: 343671
2018-10-03 10:28:43 +00:00
Simon Pilgrim d11015861c [X86] ALU/ADC RMW instructions should use the WriteRMW sequence class
I was expecting this to be a nfc but Silvermont seems to be setup a little differently:

// A folded store needs a cycle on MEC_RSV for the store data, but it does not need an extra port cycle to recompute the address.
def : WriteRes<WriteRMW, [SLM_MEC_RSV]>;

So moving from WriteStore to WriteRMW reduces predicted port pressure, confirmed by @craig.topper that this is correct.

Differential Revision: https://reviews.llvm.org/D52740

llvm-svn: 343670
2018-10-03 10:01:13 +00:00
Simon Pilgrim 860cb5c071 [X86][Btver2] Fix BLENDV and AESDEC schedules
Match AMD Fam16h SOG + llvm-exegesis tests

llvm-svn: 343597
2018-10-02 15:13:18 +00:00
Simon Pilgrim e0d2019052 [X86][Btver2] Fix BT(C|R|S)mr & BT(C|R|S)mi schedule latency + uop counts
Match AMD Fam16h SOG + llvm-exegesis tests

llvm-svn: 343494
2018-10-01 16:31:30 +00:00
Simon Pilgrim 6ddc4e821c [X86][Btver2] Fix BTmr schedule uop counts
Match AMD Fam16h SOG + llvm-exegesis tests

llvm-svn: 343484
2018-10-01 14:42:16 +00:00
Simon Pilgrim a982236e59 [X86][Btver2] Fix masked load schedule
JFPU01 resource usage should match JFPX

Match AMD Fam16h SOG + llvm-exegesis tests

llvm-svn: 343468
2018-10-01 13:12:05 +00:00
Andrea Di Biagio 24ea163007 [X86][BtVer2] Teach how to identify zero-idiom VPERM2F128rr instructions.
This patch adds another variant class to identify zero-idiom VPERM2F128rr
instructions.

On Jaguar, a VPERM wih bit 3 and 7 of the mask set, is a zero-idiom.

Differential Revision: https://reviews.llvm.org/D52663

llvm-svn: 343452
2018-10-01 10:35:13 +00:00
Clement Courbet a933fb237e [X86][Sched] Update scheduling information for VZEROALL on HWS, BDW, SKX, SNB.
Summary:
    While looking at PR35606, I found out that the scheduling info is incorrect.

    One can check that it's really a P5+P6 and not a 2*P56 with:
    echo -e 'vzeroall\nvandps %xmm1, %xmm2, %xmm3' | ./bin/llvm-exegesis -mode=uops -snippets-file=-
    (vandps executes on P5 only)

    Reviewers: craig.topper, RKSimon

    Subscribers: llvm-commits

    Differential Revision: https://reviews.llvm.org/D52541

llvm-svn: 343447
2018-10-01 08:37:48 +00:00
Simon Pilgrim f21083870d [X86] Fix scheduler class for BTmi instructions
This wasn't treated as a folded load instruction

llvm-svn: 343424
2018-09-30 20:19:16 +00:00
Simon Pilgrim b1108399bd [LLVM-MCA][X86] Add missing VCMPESTR/VCMPESTR tests
llvm-svn: 343421
2018-09-30 18:19:00 +00:00
Simon Pilgrim 20623f2343 [LLVM-MCA][X86] Add some AVX512 tests
These are going to be necessary to check I don't mess up when I start cleaning up all the remaining vector integer overrides

llvm-svn: 343414
2018-09-30 17:01:59 +00:00
Simon Pilgrim 4f5693ac8d [X86][Btver2] Fix PCmpIStrI/PCmpIStrM schedules
Missing JFPU0 pipe and double JFPU1 pipe (to match JVALU1) resources

Match AMD Fam16h SOG + llvm-exegesis tests

llvm-svn: 343413
2018-09-30 16:38:38 +00:00
Andrea Di Biagio 6e218d0a57 [llvm-mca] Add a test for zero-idiom VPERM2F128rr. NFC
We don't correctly model the latency and resource usage information for
zero-idiom VPERM2F128rr on Jaguar.

This is demonstrated by the incorrect numbers in the resource pressure view, and
the timeline view.
A follow up patch will fix this problem.

llvm-svn: 343346
2018-09-28 17:47:09 +00:00
Greg Bedwell becbbe0383 [utils] Stricter checking from update_mca_test_checks.py
If any prefixes have been specified on the RUN lines that do not end up
ever actually getting printed, raise an Error. This is either an
indication that the run lines just need cleaning up, or that something
is more fundamentally wrong with the test.

Also raise an Error if there are any blocks which cannot be checked
because they are not uniquely covered by a prefix.

Fixed up a couple of tests where the extra checking flagged up issues.

Differential Revision: https://reviews.llvm.org/D48276

llvm-svn: 343332
2018-09-28 15:39:09 +00:00
Greg Bedwell 2f528f8c1e [utils] Allow better identification of matching blocks in update_mca_test_checks.py
Insert empty blocks to cause the positions of matching blocks to match
across lists where possible so that later stages of the algorithm can
actually identify them as being identical.

Regenerated all tests with this change.

Differential Revision: https://reviews.llvm.org/D52560

llvm-svn: 343331
2018-09-28 15:38:56 +00:00
Simon Pilgrim 428c1196d8 [X86][Btver2] PSUBS/PSUBUS instructions are zero-idioms
Noticed during llvm-exegesis tests, the PSUBS/PSUBUS instructions have the same zero-idiom behaviour to PSUB

llvm-svn: 343321
2018-09-28 14:20:42 +00:00
Simon Pilgrim 3216fd3602 [X86][Btver2] Add zero-idiom tests for PSUBS/PSUBUS instructions
Noticed during llvm-exegesis tests, the PSUBS/PSUBUS instructions have the same zero-idiom behaviour to PSUB

llvm-svn: 343319
2018-09-28 13:53:11 +00:00
Simon Pilgrim 66da1ed29d [X86][Btver2] CVTSS2I/CVTSD2I - add missing JFPU0 pipe
We issue JFPU1->JSTC then JFPU0->JFPA then -> JALU0 (integer pipe)

Match AMD Fam16h SOG + llvm-exegesis tests

llvm-svn: 343314
2018-09-28 13:19:22 +00:00
Simon Pilgrim 17e5981ebf [X86][Btver2] Fix BSF/BSR schedule
Double throughput to account for 2 pipes + fix BSF's latency/uop counts

Match AMD Fam16h SOG + llvm-exegesis tests

llvm-svn: 343311
2018-09-28 10:26:48 +00:00
Simon Pilgrim 280af1c7f0 [X86][BtVer2] Fix PHMINPOS schedule resources typo
PHMINPOS can run on either JFPU pipe

llvm-svn: 343299
2018-09-28 08:21:39 +00:00
Simon Pilgrim 86c7b07ecd [X86][Btver2] (V)MPSADBW instructions take 3uops not 1
llvm-svn: 343238
2018-09-27 17:13:57 +00:00
Simon Pilgrim dd744f158a [X86][Btver2] BTC/BTR/BTS instructions take 2uops not 1
llvm-svn: 343234
2018-09-27 16:39:52 +00:00
Simon Pilgrim c2a88ea64e [X86][Btver2] BLSI/BLSMSK/BLSR instructions take 2uops not 1 (same as TZCNT)
llvm-svn: 343227
2018-09-27 14:57:57 +00:00
Simon Pilgrim 98f503a326 [X86][Btver2] TZCNT instructions take 2uops not 1
llvm-svn: 343200
2018-09-27 12:28:47 +00:00
Simon Pilgrim b56be79e0c Revert rL342916: [X86] Remove shift/rotate by CL memory (RMW) overrides
As suggested by Craig Topper - I'm going to look at cleaning up the RMW sequences instead.

The uops are slightly different to the register variant, so requires a +1uop tweak

llvm-svn: 342969
2018-09-25 13:01:26 +00:00
Simon Pilgrim 0b4ad7596f [X86] Remove shift/rotate by CL memory (RMW) overrides
The uops are slightly different to the register variant, so requires a +1uop tweak

llvm-svn: 342916
2018-09-24 20:11:50 +00:00
Simon Pilgrim 00865a48d1 [X86] Split WriteIMul into 8/16/32/64 implementations (PR36931)
Split WriteIMul by size and also by IMUL multiply-by-imm and multiply-by-reg cases.

This removes all the scheduler overrides for gpr multiplies and stops WriteMULH being ignored for BMI2 MULX instructions.

llvm-svn: 342892
2018-09-24 15:21:57 +00:00
Simon Pilgrim 9202c9fb47 [X86] ROR*mCL instruction models should match ROL*mCL etc.
Confirmed with Craig Topper - fix a typo that was missing a Port4 uop for ROR*mCL instructions on some Intel models.

Yet another step on the scheduler model cleanup marathon......

llvm-svn: 342846
2018-09-23 19:16:01 +00:00
Simon Pilgrim 19952add7c [X86] Added missing RCL/RCR schedule overrides to the generic SNB model
The SandyBridge model was missing schedule values for the RCL/RCR values - instead using the (incredibly optimistic) WriteShift (now WriteRotate) defaults.

I've added overrides with more realistic (slow) values, based on a mixture of Agner/instlatx64 numbers and what later Intel models do as well.

This is necessary to allow WriteRotate to be updated to remove other rotate overrides.

It'd probably be a good idea to investigate a WriteRotateCarry class at some point but its not high priority given the unusualness of these instructions.

llvm-svn: 342842
2018-09-23 17:40:24 +00:00
Clement Courbet 8171bd8e0f [X86][Sched] Add zero idiom sched data to the SNB model.
Summary:
On SNB, renamer-based zeroing does not work for:
 - 16 and 8-bit GPRs[1].
 - MMX [2].
 - ANDN variants [3]

[1] echo 'sub %ax, %ax' | /tmp/llvm-exegesis -mode=uops -snippets-file=-
[2] echo 'pxor %mm0, %mm0' | /tmp/llvm-exegesis -mode=uops -snippets-file=-
[3] echo 'andnps %xmm0, %xmm0' | /tmp/llvm-exegesis -mode=uops -snippets-file=-

Reviewers: RKSimon, andreadb

Subscribers: gbedwell, craig.topper, llvm-commits

Differential Revision: https://reviews.llvm.org/D52358

llvm-svn: 342736
2018-09-21 14:07:20 +00:00
Andrea Di Biagio 4cd5cf9fc8 [X86][BtVer2] Fix latency and resource cycles of AVX 256-bit zero-idioms.
This patch introduces a SchedWriteVariant to describe zero-idiom VXORP(S|D)Yrr
and VANDNP(S|D)Yrr.

This is a follow-up of r342555.

On Jaguar, a VXORPSYrr is 2 macro opcodes. Only one opcode is eliminated at
register-renaming stage. The other opcode has to be executed to set the upper
half of the destination YMM.
Same for VANDNP(S|D)Yrr.

Differential Revision: https://reviews.llvm.org/D52347

llvm-svn: 342728
2018-09-21 12:43:07 +00:00
Andrea Di Biagio 0aea310391 [llvm-mca][BtVer2] Modify ANDN tests in zero-idioms-avx-256.s. NFC
Two test cases should have tested 256-bit variants of VANDN zero-idioms instead
of the 128-bit variants.

llvm-svn: 342655
2018-09-20 15:48:23 +00:00
Andrea Di Biagio 8b6c314be1 [TableGen][SubtargetEmitter] Add the ability for processor models to describe dependency breaking instructions.
This patch adds the ability for processor models to describe dependency breaking
instructions.

Different processors may specify a different set of dependency-breaking
instructions.
That means, we cannot assume that all processors of the same target would use
the same rules to classify dependency breaking instructions.

The main goal of this patch is to provide the means to describe dependency
breaking instructions directly via tablegen, and have the following
TargetSubtargetInfo hooks redefined in overrides by tabegen'd
XXXGenSubtargetInfo classes (here, XXX is a Target name).

```
virtual bool isZeroIdiom(const MachineInstr *MI, APInt &Mask) const {
  return false;
}

virtual bool isDependencyBreaking(const MachineInstr *MI, APInt &Mask) const {
  return isZeroIdiom(MI);
}
```

An instruction MI is a dependency-breaking instruction if a call to method
isDependencyBreaking(MI) on the STI (TargetSubtargetInfo object) evaluates to
true. Similarly, an instruction MI is a special case of zero-idiom dependency
breaking instruction if a call to STI.isZeroIdiom(MI) returns true.
The extra APInt is used for those targets that may want to select which machine
operands have their dependency broken (see comments in code).
Note that by default, subtargets don't know about the existence of
dependency-breaking. In the absence of external information, those method calls
would always return false.

A new tablegen class named STIPredicate has been added by this patch to let
processor models classify instructions that have properties in common. The idea
is that, a MCInstrPredicate definition can be used to "generate" an instruction
equivalence class, with the idea that instructions of a same class all have a
property in common.

STIPredicate definitions are essentially a collection of instruction equivalence
classes.
Also, different processor models can specify a different variant of the same
STIPredicate with different rules (i.e. predicates) to classify instructions.
Tablegen backends (in this particular case, the SubtargetEmitter) will be able
to process STIPredicate definitions, and automatically generate functions in
XXXGenSubtargetInfo.

This patch introduces two special kind of STIPredicate classes named
IsZeroIdiomFunction and IsDepBreakingFunction in tablegen. It also adds a
definition for those in the BtVer2 scheduling model only.

This patch supersedes the one committed at r338372 (phabricator review: D49310).

The main advantages are:
 - We can describe subtarget predicates via tablegen using STIPredicates.
 - We can describe zero-idioms / dep-breaking instructions directly via
   tablegen in the scheduling models.

In future, the STIPredicates framework can be used for solving other problems.
Examples of future developments are:
 - Teach how to identify optimizable register-register moves
 - Teach how to identify slow LEA instructions (each subtarget defining its own
   concept of "slow" LEA).
 - Teach how to identify instructions that have undocumented false dependencies
   on the output registers on some processors only.

It is also (in my opinion) an elegant way to expose knowledge to both external
tools like llvm-mca, and codegen passes.
For example, machine schedulers in LLVM could reuse that information when
internally constructing the data dependency graph for a code region.

This new design feature is also an "opt-in" feature. Processor models don't have
to use the new STIPredicates. It has all been designed to be as unintrusive as
possible.

Differential Revision: https://reviews.llvm.org/D52174

llvm-svn: 342555
2018-09-19 15:57:45 +00:00
Simon Pilgrim 1c1335a10d [X86][BMI1] Fix BLSI/BLSMSK/BLSR BMI1 scheduling on btver2
These have the same behaviour as tzcnt on btver2 - confirmed with AMD 16h SOG, Agner and instlatx64.

llvm-svn: 342235
2018-09-14 13:31:14 +00:00
Andrea Di Biagio fb3d9e1449 [X86] Remove wrong ReadAdvance from multiclass sse_fp_unop_s.
A ReadAdvance was incorrectly added to the SchedReadWrite list associated with
the following SSE instructions:

sqrtss
sqrtsd
rsqrtss
rcpss

As a consequence, a wrong operand latency was computed for the register operand
used as the base address of the folded load operand.

This patch removes the wrong ReadAdvance, and updates the llvm-mca test cases.
There is still a problem with correctly modeling partial register writes on XMM
registers This other problem is currently tracked here:
https://bugs.llvm.org/show_bug.cgi?id=38813

Differential Revision: https://reviews.llvm.org/D51542

llvm-svn: 341326
2018-09-03 16:47:34 +00:00
Andrea Di Biagio a59ec4efa0 [X86][BtVer2] Remove wrong ReadAdvance from AVX vbroadcast(ss|sd|f128) instructions.
The presence of a ReadAdvance for input operand #0 is problematic
because it changes the input latency of the register used as the base address
for the folded load.

A broadcast cannot start executing if the load address hasn't been computed yet.

In the llvm-mca example, the VBROADCASTSS is dependent on the address generated
by the LEAQ.  That means, it cannot start until LEAQ reaches the write-back
stage. If we apply ReadAdvance, then we wrongly assume that the load can start 3
cycles in advance.

Differential Revision: https://reviews.llvm.org/D51534

llvm-svn: 341222
2018-08-31 16:05:48 +00:00
Andrea Di Biagio 69da3f3df6 [X86] Add llvm-mca tests that show how operand latency is wrongly computed for SSE sqrtss/sd and rcpss.
According to the timeline view, sqrtss/sd/rcpss start executing before the load
address for the memory operand is available.
This problem is caused by the presence of a ReadAfterLd (a ReadAdvance). Those
unary operations should not specify a ReadAdvance at all.

llvm-svn: 341213
2018-08-31 14:12:13 +00:00
Andrea Di Biagio 0e21ca1278 [X86][BtVer2] Add an llvm-mca test that shows how the read latency of AVX broadcastss on ymm registers is incorrectly set.
llvm-svn: 341197
2018-08-31 10:39:33 +00:00
Andrea Di Biagio b998eae2f2 [X86][BtVer2] Fix WriteFShuffle256 schedule write info.
This patch fixes the number of micro opcodes, and processor resource cycles for
the following AVX instructions:

vinsertf128rr/rm
vperm2f128rr/rm
vbroadcastf128

Tests have been regenerated using the usual scripts in the llvm/utils directory.

Differential Revision: https://reviews.llvm.org/D51492

llvm-svn: 341185
2018-08-31 08:30:47 +00:00
Andrea Di Biagio 8b647dcf4b [llvm-mca] Report the number of dispatched micro opcodes in the DispatchStatistics view.
This patch introduces the following changes to the DispatchStatistics view:
 * DispatchStatistics now reports the number of dispatched opcodes instead of
   the number of dispatched instructions.
 * The "Dynamic Dispatch Stall Cycles" table now also reports the percentage of
   stall cycles against the total simulated cycles.

This change allows users to easily compare dispatch group sizes with the
processor DispatchWidth.
Before this change, it was difficult to correlate the two numbers, since
DispatchStatistics view reported numbers of instructions (instead of opcodes).
DispatchWidth defines the maximum size of a dispatch group in terms of number of
micro opcodes.

The other change introduced by this patch is related to how DispatchStage
generates "instruction dispatch" events.
In particular:
 * There can be multiple dispatch events associated with a same instruction
 * Each dispatch event now encapsulates the number of dispatched micro opcodes.

The number of micro opcodes declared by an instruction may exceed the processor
DispatchWidth. Therefore, we cannot assume that instructions are always fully
dispatched in a single cycle.
DispatchStage knows already how to handle instructions declaring a number of
opcodes bigger that DispatchWidth. However, DispatchStage always emitted a
single instruction dispatch event (during the first simulated dispatch cycle)
for instructions dispatched.

With this patch, DispatchStage now correctly notifies multiple dispatch events
for instructions that cannot be dispatched in a single cycle.

A few views had to be modified. Views can no longer assume that there can only
be one dispatch event per instruction.

Tests (and docs) have been updated.

Differential Revision: https://reviews.llvm.org/D51430

llvm-svn: 341055
2018-08-30 10:50:20 +00:00
Andrew V. Tischenko 62f7a3207b [X86] Improved sched model for X86 CMPXCHG* instructions.
Differential Revision: https://reviews.llvm.org/D50070 

llvm-svn: 341024
2018-08-30 06:26:00 +00:00
Andrea Di Biagio a2eee47450 [llvm-mca] Add fields "Total uOps" and "uOps Per Cycle" to the report generated by the SummaryView.
This patch adds two new fields to the perf report generated by the SummaryView.
Fields are now logically organized into two small groups; only the second group
contains throughput indicators.

Example:
```
Iterations:        100
Instructions:      300
Total Cycles:      414
Total uOps:        700

Dispatch Width:    4
uOps Per Cycle:    1.69
IPC:               0.72
Block RThroughput: 4.0
```

This patch also updates the docs for llvm-mca.
Due to the nature of this change, several tests in the tools/llvm-mca directory
were affected, and had to be updated using script `update_mca_test_checks.py`.

llvm-svn: 340946
2018-08-29 17:56:39 +00:00
Andrea Di Biagio 5221e17fd6 [llvm-mca] Don't disable the SummaryView if flag `-all-stats` is false.
llvm-svn: 340945
2018-08-29 17:40:04 +00:00
Andrea Di Biagio d17d371c40 [llvm-mca][TimelineView] Force the same number of executions for every entry in the 'wait-times' table.
This patch also uses colors to highlight problematic wait-time entries.
A problematic entry is an entry with an high wait time that tends to match (or
exceed) the size of the scheduler's buffer.

Color RED is used if an instruction had to wait an average number of cycles
which is bigger than (or equal to) the size of the underlying scheduler's
buffer.
Color YELLOW is used if the time (in cycles) spend waiting for the
operands or pipeline resources is bigger than half the size of the underlying
scheduler's buffer.
Color MAGENTA is used if an instruction does not consume buffer resources
according to the scheduling model.

llvm-svn: 340825
2018-08-28 14:27:01 +00:00
Andrea Di Biagio b89b96c1b2 [llvm-mca] Improved report generated by the SchedulerStatistics view.
Before this patch, the SchedulerStatistics only printed the maximum number of
buffer entries consumed in each scheduler's queue at a given point of the
simulation.

This patch restructures the reported table, and adds an extra field named
"Average number of used buffer entries" to it.
This patch also uses different colors to help identifying bottlenecks caused by
high scheduler's buffer pressure.

llvm-svn: 340746
2018-08-27 14:52:52 +00:00
Andrea Di Biagio a03f2a77f8 [llvm-mca] Fix PR38575: Avoid an invalid implicit truncation of a processor resource mask (an uint64_t value) to unsigned.
This patch fixes a regression introduced at revision 338702.

A processor resource mask was incorrectly implicitly truncated to an unsigned
quantity. Later on, the truncated mask was used to initialize an element of a
vector of processor resource descriptors.
On targets with more than 32 processor resources, some elements of the vector
are left uninitialized. As a consequence, this bug might have eventually caused
a crash due to null dereference in the Scheduler.

This patch fixes PR38575, and adds a test for it.

llvm-svn: 339768
2018-08-15 12:53:38 +00:00
Andrew V. Tischenko 1fe3375620 [X86] MCA tests for XCHG*, XADD* and CMPXCHG* instructions
Differential Revision: https://reviews.llvm.org/D49912

llvm-svn: 339145
2018-08-07 14:36:43 +00:00
Simon Pilgrim b911d6721d [llvm-mca][x86] Add CMPXCHG instruction resource tests
I've put CMPXCHG8B/CMPXCHG16B in the same file, even though technically they are under separate CPUID bits all targets seem to support both (or neither).

llvm-svn: 338595
2018-08-01 17:25:11 +00:00
Simon Pilgrim 5c4fb14e07 [llvm-mca][x86] Add PREFETCHW instruction resource tests
These aren't just available via 3DNow! so test for them separately as well.

llvm-svn: 338584
2018-08-01 16:34:39 +00:00
Simon Pilgrim dcfa732b2f [llvm-mca][x86] Add PCLMUL instruction resource tests
Renamed the btver2 file that already contained them - the other targets were only testing the AVX versions

llvm-svn: 338583
2018-08-01 16:25:50 +00:00
Andrea Di Biagio 7f3bf5c1f9 [llvm-mca] Correctly update the rank in `Scheduler::select()`.
Found by inspection.

llvm-svn: 338579
2018-08-01 16:06:33 +00:00
Simon Pilgrim 34ac6533f4 [llvm-mca][x86] Add SET/TEST instruction resource tests
llvm-svn: 338576
2018-08-01 15:29:47 +00:00
Simon Pilgrim e364e57ac9 [llvm-mca][x86] Add LEA instruction resource tests
We already added these to btver2, now add them to other targets, even though none of their models treat them specially (yet).

llvm-svn: 338565
2018-08-01 14:25:33 +00:00
Simon Pilgrim 6754913e95 [llvm-mca][x86] Add more x86-64 system instruction resource tests
CPUID, IN/OUT, INS/OUTS, INT, PAUSE, SCAS, UD2, XLAT

llvm-svn: 338563
2018-08-01 14:18:09 +00:00
Simon Pilgrim 5f41ab79c0 [llvm-mca][x86] Add CLFLUSHOPT instruction resource tests
llvm-svn: 338550
2018-08-01 13:34:17 +00:00
Simon Pilgrim bd014f4d91 [llvm-mca][x86] Add CMPS/LODS/MOVS/STOS string instruction resource tests
llvm-svn: 338532
2018-08-01 13:14:45 +00:00
Simon Pilgrim 18d025a732 [llvm-mca][x86] Add STC + STD instruction resource tests
llvm-svn: 338514
2018-08-01 11:00:11 +00:00
Simon Pilgrim 1f4b9cb6fe [llvm-mca][x86] Add 32-bit instruction resource tests
These aren't exhaustive, but cover some instructions that are only available in 32-bit mode (where would we be without good BCD math performance?).

llvm-svn: 338404
2018-07-31 17:33:08 +00:00
Andrea Di Biagio a1852b6194 [llvm-mca][BtVer2] Teach how to identify dependency-breaking idioms.
This patch teaches llvm-mca how to identify dependency breaking instructions on
btver2.

An example of dependency breaking instructions is the zero-idiom XOR (example:
`XOR %eax, %eax`), which always generates zero regardless of the actual value of
the input register operands.
Dependency breaking instructions don't have to wait on their input register
operands before executing. This is because the computation is not dependent on
the inputs.

Not all dependency breaking idioms are also zero-latency instructions. For
example, `CMPEQ %xmm1, %xmm1` is independent on
the value of XMM1, and it generates a vector of all-ones.
That instruction is not eliminated at register renaming stage, and its opcode is
issued to a pipeline for execution. So, the latency is not zero. 

This patch adds a new method named isDependencyBreaking() to the MCInstrAnalysis
interface. That method takes as input an instruction (i.e. MCInst) and a
MCSubtargetInfo.
The default implementation of isDependencyBreaking() conservatively returns
false for all instructions. Targets may override the default behavior for
specific CPUs, and return a value which better matches the subtarget behavior.

In future, we should teach to Tablegen how to automatically generate the body of
isDependencyBreaking from scheduling predicate definitions. This would allow us
to expose the knowledge about dependency breaking instructions to the machine
schedulers (and, potentially, other codegen passes).

Differential Revision: https://reviews.llvm.org/D49310

llvm-svn: 338372
2018-07-31 13:21:43 +00:00
Roman Lebedev 52b85377eb [NFC][MCA] ZnVer1: Update RegisterFile to identify false dependencies on partially written registers.
Summary:
Pretty mechanical follow-up for D49196.

As microarchitecture.pdf notes, "20 AMD Ryzen pipeline",
"20.8 Register renaming and out-of-order schedulers":
  The integer register file has 168 physical registers of 64 bits each.
  The floating point register file has 160 registers of 128 bits each.
"20.14 Partial register access":
  The processor always keeps the different parts of an integer register together.
  ...
  An instruction that writes to part of a register will therefore have a false dependence
  on any previous write to the same register or any part of it.

Reviewers: andreadb, courbet, RKSimon, craig.topper, GGanesh

Reviewed By: GGanesh

Subscribers: gbedwell, llvm-commits

Differential Revision: https://reviews.llvm.org/D49393

llvm-svn: 337676
2018-07-23 10:10:13 +00:00
Roman Lebedev d57bd45acc [NFC][MCA] ZnVer1: add partial-reg-update tests
Reviewers: andreadb, courbet, RKSimon, craig.topper, GGanesh

Reviewed By: GGanesh

Subscribers: gbedwell, llvm-commits

Differential Revision: https://reviews.llvm.org/D49392

llvm-svn: 337675
2018-07-23 10:10:04 +00:00
Simon Pilgrim 5e729dcc03 [llvm-mca][x86] Add movsx/movzx instructions to general x86_64 resource tests
llvm-svn: 337586
2018-07-20 17:43:42 +00:00
Andrea Di Biagio b6022aa8d9 [X86][BtVer2] correctly model the latency/throughput of LEA instructions.
This patch fixes the latency/throughput of LEA instructions in the BtVer2
scheduling model.

On Jaguar, A 3-operands LEA has a latency of 2cy, and a reciprocal throughput of
1. That is because it uses one cycle of SAGU followed by 1cy of ALU1.  An LEA
with a "Scale" operand is also slow, and it has the same latency profile as the
3-operands LEA. An LEA16r has a latency of 3cy, and a throughput of 0.5 (i.e.
RThrouhgput of 2.0).

This patch adds a new TIIPredicate named IsThreeOperandsLEAFn to X86Schedule.td.
The tablegen backend (for instruction-info) expands that definition into this
(file X86GenInstrInfo.inc):
```
static bool isThreeOperandsLEA(const MachineInstr &MI) {
  return (
    (
      MI.getOpcode() == X86::LEA32r
      || MI.getOpcode() == X86::LEA64r
      || MI.getOpcode() == X86::LEA64_32r
      || MI.getOpcode() == X86::LEA16r
    )
    && MI.getOperand(1).isReg()
    && MI.getOperand(1).getReg() != 0
    && MI.getOperand(3).isReg()
    && MI.getOperand(3).getReg() != 0
    && (
      (
        MI.getOperand(4).isImm()
        && MI.getOperand(4).getImm() != 0
      )
      || (MI.getOperand(4).isGlobal())
    )
  );
}
```

A similar method is generated in the X86_MC namespace, and included into
X86MCTargetDesc.cpp (the declaration lives in X86MCTargetDesc.h).

Back to the BtVer2 scheduling model:
A new scheduling predicate named JSlowLEAPredicate now checks if either the
instruction is a three-operands LEA, or it is an LEA with a Scale value
different than 1.
A variant scheduling class uses that new predicate to correctly select the
appropriate latency profile.

Differential Revision: https://reviews.llvm.org/D49436

llvm-svn: 337469
2018-07-19 16:42:15 +00:00
Simon Pilgrim 03164dfa5e [llvm-mca][x86] Add extend, carry-flag and CMP instructions to general x86_64 resource tests
llvm-svn: 337306
2018-07-17 17:47:35 +00:00
Simon Pilgrim 92da01fed9 [llvm-mca][x86] Add MOVBE resource tests to all supporting targets
SNB doesn't support MOVBE but the numbers in Generic (which use the SNB model) look sane.

llvm-svn: 337305
2018-07-17 17:41:45 +00:00
Simon Pilgrim 94049e8b15 [llvm-mca][x86] Add BSWAP resource tests
llvm-svn: 337302
2018-07-17 17:10:47 +00:00
Simon Pilgrim 99a4f3195b [llvm-mca][x86] Add displacement-only and additional scale=1 LEA tests
llvm-svn: 337298
2018-07-17 16:17:33 +00:00
Simon Pilgrim 17d89ca70e [llvm-mca][x86] Add LEA resource tests (PR32326)
Add llvm-mca tests demonstrating how LEA instructions are currently modelled. Once this is working on btver2 I'll copy the test file to the other target directories.

llvm-svn: 337297
2018-07-17 16:13:29 +00:00
Andrea Di Biagio f84b0a6914 [llvm-mca] Regenerate X86 specific tests. NFC
Not all tests were correctly updated by the update script after r336797.

llvm-svn: 337124
2018-07-15 11:43:11 +00:00
Andrea Di Biagio ff630c2cdc [llvm-mca][BtVer2] teach how to identify false dependencies on partially written
registers.

The goal of this patch is to improve the throughput analysis in llvm-mca for the
case where instructions perform partial register writes.

On x86, partial register writes are quite difficult to model, mainly because
different processors tend to implement different register merging schemes in
hardware.

When the code contains partial register writes, the IPC (instructions per
cycles) estimated by llvm-mca tends to diverge quite significantly from the
observed IPC (using perf).

Modern AMD processors (at least, from Bulldozer onwards) don't rename partial
registers. Quoting Agner Fog's microarchitecture.pdf:
" The processor always keeps the different parts of an integer register together.
For example, AL and AH are not treated as independent by the out-of-order
execution mechanism. An instruction that writes to part of a register will
therefore have a false dependence on any previous write to the same register or
any part of it."

This patch is a first important step towards improving the analysis of partial
register updates. It changes the semantic of RegisterFile descriptors in
tablegen, and teaches llvm-mca how to identify false dependences in the presence
of partial register writes (for more details: see the new code comments in
include/Target/TargetSchedule.h - class RegisterFile).

This patch doesn't address the case where a write to a part of a register is
followed by a read from the whole register.  On Intel chips, high8 registers
(AH/BH/CH/DH)) can be stored in separate physical registers. However, a later
(dirty) read of the full register (example: AX/EAX) triggers a merge uOp, which
adds extra latency (and potentially affects the pipe usage).
This is a very interesting article on the subject with a very informative answer
from Peter Cordes:
https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to

In future, the definition of RegisterFile can be extended with extra information
that may be used to identify delays caused by merge opcodes triggered by a dirty
read of a partial write.

Differential Revision: https://reviews.llvm.org/D49196

llvm-svn: 337123
2018-07-15 11:01:38 +00:00
Andrea Di Biagio e86e6efea1 [llvm-mca][BtVer2] Add tests for dependency breaking instructions.
llvm-svn: 337024
2018-07-13 16:46:51 +00:00
Andrea Di Biagio 483db141e3 [X86] Fix MayLoad/HasSideEffect flag for (V)MOVLPSrm instructions.
Before revision 336728, the "mayLoad" flag for instruction (V)MOVLPSrm was
inferred directly from the "default" pattern associated with the instruction
definition.

r336728 removed special node X86Movlps, and all the patterns associated to it.
Now instruction (V)MOVLPSrm doesn't have a pattern associated to it, and the
'mayLoad/hasSideEffects' flags are left unset.

When the instruction info is emitted by tablegen, method
CodeGenDAGPatterns::InferInstructionFlags() sees that (V)MOVLPSrm doesn't have a
pattern, and flags are undefined. So, it conservatively sets the
"hasSideEffects" flag for it.

As a consequence, we were losing the 'mayLoad' flag, and we were gaining a
'hasSideEffect' flag in its place.
This patch fixes the issue (originally reported by Michael Holmen).

The mca tests show the differences in the instruction info flags.  Instructions
that were affected by this problem were: MOVLPSrm/VMOVLPSrm/VMOVLPSZ128rm.

Differential Revision: https://reviews.llvm.org/D49182

llvm-svn: 336818
2018-07-11 15:27:50 +00:00
Andrea Di Biagio d2e2c053cf [llvm-mca] Use a different character to flag instructions with side-effects in the Instruction Info View. NFC
This makes easier to identify changes in the instruction info flags.  It also
helps spotting potential regressions similar to the one recently introduced at
r336728.

Using the same character to mark MayLoad/MayStore/HasSideEffects is problematic
for llvm-lit. When pattern matching substrings, llvm-lit consumes tabs and
spaces. A change in position of the flag marker may not trigger a test failure.

This patch only changes the character used for flag `hasSideEffects`. The reason
why I didn't touch other flags is because I want to avoid spamming the mailing
because of the massive diff due to the numerous tests affected by this change.

In future, each instruction flag should be associated with a different character
in the Instruction Info View.

llvm-svn: 336797
2018-07-11 12:44:44 +00:00
Andrea Di Biagio 2b3a4f9c9b [llvm-mca] Add tests for partial register writes.
llvm-mca doesn't know that on modern AMD processors, portions of a general
purpose register are not treated independently. So, a partial register write has
a false dependency on the super-register.

The issue with partial register writes will be addressed by a follow-up patch.

llvm-svn: 336778
2018-07-11 09:50:00 +00:00
Andrea Di Biagio 8834779644 [llvm-mca] report an error if the assembly sequence contains an unsupported instruction.
This is a short-term fix for PR38093.
For now, we llvm::report_fatal_error if the instruction builder finds an
unsupported instruction in the instruction stream.

We need to revisit this fix once we start addressing PR38101.
Essentially, we need a better framework for error handling.

llvm-svn: 336543
2018-07-09 12:30:55 +00:00
Roman Lebedev 0e58dee284 [MCA][X86][NFC] Add BSF/BSR resource tests
Reviewers: RKSimon, andreadb, courbet

Reviewed By: RKSimon

Subscribers: gbedwell, llvm-commits

Differential Revision: https://reviews.llvm.org/D48997

llvm-svn: 336510
2018-07-08 09:50:14 +00:00
Andrea Di Biagio 61c52af9d9 [llvm-mca] improve the instruction issue logic implemented by the Scheduler.
This patch modifies the Scheduler heuristic used to select the next instruction
to issue to the pipelines.

The motivating example is test X86/BtVer2/add-sequence.s, for which llvm-mca
wrongly reported an estimated IPC of 1.50. According to perf, the actual IPC for
that test should have been ~2.00.
It turns out that an IPC of 2.00 for test add-sequence.s cannot possibly be
predicted by a Scheduler that only prioritizes instructions based on their
"age". A similar issue also affected test X86/BtVer2/dependent-pmuld-paddd.s,
for which llvm-mca wrongly estimated an IPC of 0.84 instead of an IPC of 1.00.

Instructions in the ReadyQueue are now ranked based on two factors:
 - The "age" of an instruction.
 - The number of unique users of writes associated with an instruction.

The new logic still prioritizes older instructions over younger instructions to
minimize the pressure on the reorder buffer. However, the number of users of an
instruction now also affects the overall rank. This potentially increases the
ability of the Scheduler to extract instruction level parallelism.  This patch
fixes the problem with the wrong IPC reported for test add-sequence.s and test
dependent-pmuld-paddd.s.

llvm-svn: 336420
2018-07-06 08:08:30 +00:00
Roman Lebedev 0dd27042c6 [X86][BtVer2][MCA][NFC] Add CMPEQ dependency-breaking one-idioms tests
Summary: As per `Agner's Microarchitecture doc
(21.8 AMD Bobcat and Jaguar pipeline - Dependency-breaking instructions)`,
these, like zero-idioms, are dependency-breaking,
although they produce ones and still consume resources.

FIXME: as discussed in D48877, llvm-mca handling is broken for these.

Reviewers: andreadb

Reviewed By: andreadb

Subscribers: gbedwell, RKSimon, llvm-commits

Differential Revision: https://reviews.llvm.org/D48876

llvm-svn: 336292
2018-07-04 17:32:44 +00:00
Fangrui Song f50ad6c311 Replace unused output filenames with /dev/null in tests
Similar to rLLD336129

llvm-svn: 336131
2018-07-02 18:16:44 +00:00
Simon Pilgrim 83125594ed [llvm-mca][x86] Add FMA4 resource tests
We should be ensuring we have (near) complete test coverage of instructions, at least for the generic model.

llvm-svn: 335870
2018-06-28 16:24:13 +00:00
Simon Pilgrim 12f9503d40 [llvm-mca][x86] Add 3dnow! resource tests
We should be ensuring we have (near) complete test coverage of instructions, at least for the generic model.

llvm-svn: 335869
2018-06-28 16:21:22 +00:00
Andrea Di Biagio 2145b13fc9 [llvm-mca][X86] Teach how to identify register writes that implicitly clear the upper portion of a super-register.
This patch teaches llvm-mca how to identify register writes that implicitly zero
the upper portion of a super-register.

On X86-64, a general purpose register is implemented in hardware as a 64-bit
register. Quoting the Intel 64 Software Developer's Manual: "an update to the
lower 32 bits of a 64 bit integer register is architecturally defined to zero
extend the upper 32 bits".  Also, a write to an XMM register performed by an AVX
instruction implicitly zeroes the upper 128 bits of the aliasing YMM register.

This patch adds a new method named clearsSuperRegisters to the MCInstrAnalysis
interface to help identify instructions that implicitly clear the upper portion
of a super-register.  The rest of the patch teaches llvm-mca how to use that new
method to obtain the information, and update the register dependencies
accordingly.

I compared the kernels from tests clear-super-register-1.s and
clear-super-register-2.s against the output from perf on btver2.  Previously
there was a large discrepancy between the estimated IPC and the measured IPC.
Now the differences are mostly in the noise.

Differential Revision: https://reviews.llvm.org/D48225

llvm-svn: 335113
2018-06-20 10:08:11 +00:00
Roman Lebedev d23b6831de [X86][Znver1] Specify Register Files, RCU; FP scheduler capacity.
Summary:
First off: i do not have any access to that processor,
so this is purely theoretical, no benchmarks.

I have been looking into b**d**ver2 scheduling profile, and while cross-referencing
the existing b**t**ver2, znver1 profiles, and the reference docs
(`Software Optimization Guide for AMD Family {15,16,17}h Processors`),
i have noticed that only b**t**ver2 scheduling profile specifies these.

Also, there is no mca test coverage.

Reviewers: RKSimon, craig.topper, courbet, GGanesh, andreadb

Reviewed By: GGanesh

Subscribers: gbedwell, vprasad, ddibyend, shivaram, Ashutosh, javed.absar, llvm-commits

Differential Revision: https://reviews.llvm.org/D47676

llvm-svn: 335099
2018-06-20 07:01:14 +00:00
Clement Courbet e0aa30008f [X86] Fix r335097
Missed `Generic` test in llvm-mca.

llvm-svn: 335098
2018-06-20 06:44:13 +00:00
Clement Courbet 7b9913fb9f [X86] Add sched class WriteLAHFSAHF and fix values.
Summary:
I ran llvm-exegesis on SKX, SKL, BDW, HSW, SNB.
Atom is from Agner and SLM is a guess.
I've left AMD processors alone.

Reviewers: RKSimon, craig.topper

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D48079

llvm-svn: 335097
2018-06-20 06:13:39 +00:00
Roman Lebedev ae0527aac9 [MCA][NFC] Add generic XOP resource tests
Summary:
Based on
* [[ https://support.amd.com/TechDocs/43479.pdf | AMD64 Architecture Programmer’s Manual Volume 6: 128-Bit and 256-Bit XOP and FMA4 Instructions ]],
* [[ https://support.amd.com/TechDocs/24594.pdf | AMD64 Architecture Programmer’s Manual Volume 3: General-Purpose and System Instructions]],
* https://en.wikipedia.org/wiki/XOP_instruction_set

Appears to be only supported in AMD's 15h generation, so only in b**d**ver[1-4],
for which currently llvm has no scheduling profiles.

Reviewers: RKSimon, craig.topper, andreadb, spatel

Reviewed By: RKSimon

Subscribers: gbedwell, llvm-commits

Differential Revision: https://reviews.llvm.org/D48264

llvm-svn: 335034
2018-06-19 09:21:27 +00:00
Roman Lebedev 0d12c1685b [MCA][NFC] Add generic TBM resource tests
Summary:
Based on https://support.amd.com/TechDocs/24594.pdf,
https://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets#TBM_(Trailing_Bit_Manipulation)

Appears to be only supported in AMD's 15h generation, so only in b**d**ver[1-4],
for which currently llvm has no scheduling profiles.

Reviewers: RKSimon, craig.topper, simark, andreadb

Reviewed By: RKSimon

Subscribers: gbedwell, llvm-commits

Differential Revision: https://reviews.llvm.org/D48252

llvm-svn: 335033
2018-06-19 09:21:22 +00:00
Andrea Di Biagio a88281d8ae [llvm-mca] Use an ordered map to collect hardware statistics. NFC.
Histogram entries are now ordered by key.  This should improves their
readability when statistics are printed.

llvm-svn: 334961
2018-06-18 17:04:56 +00:00
Andrea Di Biagio 487da729a2 [llvm-mca] Add tests for XOP and AVX512 instructions that implicitly clear the upper portion of a super-register.
When the destination register of a XOP instruction is an XMM register, bits
[255:128] of the corresponding YMM register are cleared.

When the destination register of a EVEX encoded instruction is an XMM/YMM
register, the upper bits of the corresponding ZMM are cleared.
On processors that feature AVX512, a write to an XMM registers always clears the
upper portion of the corresponding ZMM register if the instruction is VEX or
EVEX encoded.

These new tests show some interesting cases which aren't correctly analyzed by
llvm-mca. The lack of knowledge related to the implicit update on the
super-registers is addressed by D48225.

llvm-svn: 334945
2018-06-18 14:00:30 +00:00
Clement Courbet 0d9da88d18 [X86] Fix NOOP sched overrides on BDW/HSW/SKL.
Summary: Noop certainly does not use resources.

Reviewers: RKSimon, craig.topper, andreadb

Subscribers: gbedwell, llvm-commits, gchatelet

Differential Revision: https://reviews.llvm.org/D48028

llvm-svn: 334927
2018-06-18 06:48:22 +00:00
Simon Pilgrim e930f569f7 [llvm-mca][X86] Add some avx512f/avx512vl resource test placeholders
There are a lot of instructions to add under these ISAs (and the other AVX512 variants) but this should demonstrate how to test for the EVEX instructions with different maskings

llvm-svn: 334907
2018-06-17 16:25:48 +00:00
Simon Pilgrim f5ecd8d50d [llvm-mca][x86] Add Generic cpu resource tests
Added a Generic x86 cpu set of resource tests to allow us to check all ISAs.

We currently use SandyBridge as our generic CPU model, but it's better if we actually duplicate these tests for if/when we change the model, it also means we don't end up polluting the SandyBridge folder with tests for ISAs it doesn't support.

llvm-svn: 334853
2018-06-15 18:35:25 +00:00
Roman Lebedev 9ddf128f79 [MCA] Add -summary-view option
Summary:
While that is indeed a quite interesting summary stat,
there are cases where it does not really add anything
other than consuming extra lines.

Declutters the output of D48190.

Reviewers: RKSimon, andreadb, courbet, craig.topper

Reviewed By: andreadb

Subscribers: javed.absar, gbedwell, llvm-commits

Differential Revision: https://reviews.llvm.org/D48209

llvm-svn: 334833
2018-06-15 14:01:43 +00:00
Roman Lebedev 7c423001e4 [MCA][x86][NFC] Add tests for -register-file-stats, -scheduler-stats
Summary:
There does not seem to be any other tests for this.
Split off from D47676.

Reviewers: RKSimon, craig.topper, courbet, andreadb

Reviewed By: andreadb

Subscribers: javed.absar, gbedwell, llvm-commits

Differential Revision: https://reviews.llvm.org/D48190

llvm-svn: 334832
2018-06-15 14:01:35 +00:00
Andrea Di Biagio 4cafb297d5 [llvm-mca] Add tests for instructions that implicitly clear the upper portion of a super-register.
On x86-64, a write to register EAX implicitly clears the upper half or RAX.
128-bit AVX instructions clear the upper 128-bit of the YMM register that
aliases the XMM definition register.

llvm-mca doesn't know about register writes that implicitly clear the upper
portion of an aliasing super-register. This issue will be fixed in a future patch.

llvm-svn: 334742
2018-06-14 17:48:42 +00:00
Andrea Di Biagio 4729d1ff27 [llvm-mca] Add another test for partial register stalls.
This test checks that a physical register is correctly allocated for the partial
write to register BX.
The ADD instruction has to wait for the write to RBX (and BX) before being
executed.

llvm-svn: 334730
2018-06-14 15:54:34 +00:00
Andrea Di Biagio 0ffb2271a1 [llvm-mca] Fixed a bug in the logic that checks if a memory operation is ready to execute.
Fixes PR37790.

In some (very rare) cases, the LSUnit (Load/Store unit) was wrongly marking a
load (or store) as "ready to execute" effectively bypassing older memory barrier
instructions.

To reproduce this bug, the memory barrier must be the first instruction in the
input assembly sequence, and it doesn't have to perform any register writes.

llvm-svn: 334633
2018-06-13 18:30:14 +00:00
Clement Courbet 7db69cc08a [X86] Fix skylake server scheduling info.
Summary:
This fixes most of the scheduling info for SKX vector operations.
I had to split a lot of the YMM/ZMM classes into separate classes for YMM and ZMM.

The before/after llvm-exegesis analysis are in the phabricator diff.

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D47721

llvm-svn: 334407
2018-06-11 14:37:53 +00:00
Simon Pilgrim 89deac6694 [X86][BtVer2] Add support for all SUB/XOR 32/64 scalar instructions that should match the dependency-breaking 'zero-idiom'
As detailed on Agner's Microarchitecture doc (21.8 AMD Bobcat and Jaguar pipeline - Dependency-breaking instructions), these instructions are dependency breaking and fast-path zero the destination register (and appropriate EFLAGS bits).

llvm-svn: 334303
2018-06-08 17:00:45 +00:00
Simon Pilgrim efb4806bb9 [X86][BtVer2] Remove SBB tests that were accidentally added in rL334296
These aren't true zero-idiom instructions (just dependency breaking).

llvm-svn: 334297
2018-06-08 15:43:00 +00:00
Simon Pilgrim 53766a986d [X86][BtVer2] Add tests for scalar SUB/XOR instructions that should match the dependency-breaking 'zero-idiom'
As detailed on Agner's Microarchitecture doc (21.8 AMD Bobcat and Jaguar pipeline - Dependency-breaking instructions).

llvm-svn: 334296
2018-06-08 15:28:43 +00:00
Simon Pilgrim aafcf9e4a1 [X86][BtVer2] Limit zero idiom tests to a single iteration.
Reduces output size and we're only wanting to check that the instructions are fast-path'd (just Dispatch+Retire) anyhow

llvm-svn: 334292
2018-06-08 15:01:40 +00:00
Simon Pilgrim aef5bdbea1 [X86][BtVer2] Add support for all vector instructions that should match the dependency-breaking 'zero-idiom'
As detailed on Agner's Microarchitecture doc (21.8 AMD Bobcat and Jaguar pipeline - Dependency-breaking instructions), all these instructions are dependency breaking and zero the destination register.

llvm-svn: 334119
2018-06-06 19:06:09 +00:00
Simon Pilgrim 7a48bb6e44 [llvm-mca][x86] Fix all resources-x86_64.s tests to use different registers in reg-reg cases
I noticed while working on zero-idiom + dependency-breaking support (PR36671) that most of our binary instruction tests were reusing the same src registers, which would cause the tests to fail once we enable scalar zero-idiom support on btver2. Fixed in all targets to keep them in sync.

llvm-svn: 334110
2018-06-06 18:20:25 +00:00
Simon Pilgrim 64541ff297 [X86][BtVer2] Add tests for all vector instructions that should match the dependency-breaking 'zero-idiom'
As detailed on Agner's Microarchitecture doc (21.8 AMD Bobcat and Jaguar pipeline - Dependency-breaking instructions), all these instructions are dependency breaking and zero the destination register.

TODO: Scalar instructions still need to be tested (need to check EFLAGS handling).
llvm-svn: 334104
2018-06-06 16:14:37 +00:00
Sanjay Patel 59313be8d3 [CodeGen] assume max/default throughput for unspecified instructions
This is a fix for the problem arising in D47374 (PR37678):
https://bugs.llvm.org/show_bug.cgi?id=37678

We may not have throughput info because it's not specified in the model 
or it's not available with variant scheduling, so assume that those
instructions can execute/complete at max-issue-width.

Differential Revision: https://reviews.llvm.org/D47723

llvm-svn: 334055
2018-06-05 23:34:45 +00:00
Andrea Di Biagio 757600bccb [llvm-mca] Correctly update the CyclesLeft of a register read in the presence of partial register updates.
This patch fixe the logic in ReadState::cycleEvent(). That method was not
correctly updating field `TotalCycles`.

Added extra code comments in class ReadState to better describe each field.

llvm-svn: 334028
2018-06-05 17:12:02 +00:00
Andrea Di Biagio 39e5a5695f [RFC][patch 3/3] Add support for variant scheduling classes in llvm-mca.
This patch is the last of a sequence of three patches related to LLVM-dev RFC
"MC support for variant scheduling classes".
http://lists.llvm.org/pipermail/llvm-dev/2018-May/123181.html

This fixes PR36672.

The main goal of this patch is to teach llvm-mca how to solve variant scheduling
classes.  This patch does that, plus it adds new variant scheduling classes to
the BtVer2 scheduling model to identify so-called zero-idioms (i.e. so-called
dependency breaking instructions that are known to generate zero, and that are
optimized out in hardware at register renaming stage).

Without the BtVer2 change, this patch would not have had any meaningful tests.
This patch is effectively the union of two changes:
 1) a change that teaches llvm-mca how to resolve variant scheduling classes.
 2) a change to the BtVer2 scheduling model that allows us to special-case
    packed XOR zero-idioms (this partially fixes PR36671).

Differential Revision: https://reviews.llvm.org/D47374 

llvm-svn: 333909
2018-06-04 15:43:09 +00:00
Greg Bedwell bbe64af0a0 [llvm-mca] Regenerate a test to remove a double newline
Command used: py update_mca_test_checks.py ..\test\tools\llvm-mca\*\*.s ..\test\tools\llvm-mca\*\*\*.s

llvm-svn: 333893
2018-06-04 12:30:03 +00:00
Roman Lebedev 7b53d1454f [llvm-mca] Make sure not to end the test files with an empty line.
Summary:
It's super irritating.

[properly configured] git client then complains about that double-newline,
and you have to use `--force` to ignore the warning, since even if you
fix it manually, it will be reintroduced the very next runtime :/

Reviewers: RKSimon, andreadb, courbet, craig.topper, javed.absar, gbedwell

Reviewed By: gbedwell

Subscribers: javed.absar, tschuett, gbedwell, llvm-commits

Differential Revision: https://reviews.llvm.org/D47697

llvm-svn: 333887
2018-06-04 11:48:46 +00:00
Clement Courbet 2e41c5a79c [X86] Introduce WriteFLDC for x87 constant loads.
Summary:
{FLDL2E, FLDL2T, FLDLG2, FLDLN2, FLDPI} were using WriteMicrocoded.

 - I've measured the values for Broadwell, Haswell, SandyBridge, Skylake.
 - For ZnVer1 and Atom, values were transferred form InstRWs.
 - For SLM and BtVer2, I've guessed some values :(

Reviewers: RKSimon, craig.topper, andreadb

Subscribers: gbedwell, llvm-commits

Differential Revision: https://reviews.llvm.org/D47585

llvm-svn: 333656
2018-05-31 14:22:01 +00:00
Clement Courbet b78ab5097d [X86] Extract latency of fldz/fld1 in separate classes.
Summary:
 - I've measured the values for Broadwell, Haswell, SandyBridge, Skylake.
 - For ZnVer1 and Atom, values were transferred form `InstRW`s.
 - For SLM and BtVer2, values are from Agner.

This is split off from https://reviews.llvm.org/D47377

Reviewers: RKSimon, andreadb

Subscribers: gbedwell, llvm-commits

Differential Revision: https://reviews.llvm.org/D47523

llvm-svn: 333642
2018-05-31 11:41:27 +00:00
Clement Courbet 07c9ec6f2e [X86][Sched] Add InstRW for CLC on Intel after SNB.
Summary:
After SNB, Intel CPUs can rename CF independently of other EFLAGS,
so the renamer can zero it for free. Note that STC still consumes resources.

To reproduce: `$ llvm-exegesis -mode=uops -opcode-name=CLC`

On SNB:
```
---
key:
  opcode_name:     CLC
  mode:            uops
  config:          ''
cpu_name:        sandybridge
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
  - { key: '3', value: 0.0014, debug_string: SBPort0 }
  - { key: '4', value: 0.0013, debug_string: SBPort1 }
  - { key: '5', value: 0.0003, debug_string: SBPort4 }
  - { key: '6', value: 0.0029, debug_string: SBPort5 }
  - { key: '10', value: 0.0003, debug_string: SBPort23 }
error:           ''
info:            'instruction is serial, repeating a random one.
Snippet:
CLC
'
...
```

On HSW:
```
---
key:
  opcode_name:     CLC
  mode:            uops
  config:          ''
cpu_name:        haswell
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
  - { key: '3', value: 0.001, debug_string: HWPort0 }
  - { key: '4', value: 0.0009, debug_string: HWPort1 }
  - { key: '5', value: 0.0004, debug_string: HWPort2 }
  - { key: '6', value: 0.0006, debug_string: HWPort3 }
  - { key: '7', value: 0.0002, debug_string: HWPort4 }
  - { key: '8', value: 0.0012, debug_string: HWPort5 }
  - { key: '9', value: 0.0022, debug_string: HWPort6 }
  - { key: '10', value: 0.0001, debug_string: HWPort7 }
error:           ''
info:            'instruction is serial, repeating a random one.
Snippet:
CLC
'
...

```

Reviewers: craig.topper, RKSimon

Subscribers: gchatelet, llvm-commits

Differential Revision: https://reviews.llvm.org/D47362

llvm-svn: 333392
2018-05-29 06:19:39 +00:00
Simon Pilgrim 0155bf0da9 [X86][SNB] Fix differences between vex/non-vex XMM vector moves (PR37286)
As confirmed by llvm-exegesis, there is no scheduler difference between MOVDQA/MOVDQU and VMOVDQA/VMOVDQU xmm reg-reg moves

Another chapter in the never ending crusade to remove useless InstRW overrides from the x86 scheduler models......

llvm-svn: 333271
2018-05-25 12:18:11 +00:00
Greg Bedwell e790f6fb06 [UpdateTestChecks] Improved update_mca_test_checks block analysis
Previously update_mca_test_checks worked entirely at "block" level where
a block is some sequence of lines delimited by at least one empty line.
This generally worked well, but could sometimes lead to excessive
repetition of check lines for various prefixes if some block was almost
identical between prefixes, but not quite (for example, due to a
different dispatch width in the otherwise identical summary views).

This new analyis attempts to split blocks further in the case where the
following conditions are met:
  a) There is some prefix common to every RUN line (typically 'ALL').
  b) The first line of the block is common to the output with every prefix.
  c) The block has the same number of lines for the output with every prefix.

Also, regenerated all llvm-mca test files with the following command:
update_mca_test_checks.py "../test/tools/llvm-mca/*/*.s" "../test/tools/llvm-mca/*/*/*.s"

The new analysis showed a "multiple lines not disambiguated by prefixes" warning
for test "AArch64/Exynos/scheduler-queue-usage.s" so I've also added some
explicit prefixes to each of the RUN lines in that test.

Differential Revision: https://reviews.llvm.org/D47321

llvm-svn: 333204
2018-05-24 16:36:44 +00:00
Andrea Di Biagio 3fc20c9c7f [llvm-mca] Print the "Block RThroughput" in the SummaryView.
This patch implements the "block reciprocal throughput" computation in the
SummaryView.

The block reciprocal throughput is computed as the MAX of:
  - NumMicroOps / DispatchWidth
  - Resource Cycles / #Units   (for every resource consumed).

The block throughput is bounded from above by the hardware dispatch throughput.
That is because the DispatchWidth is an upper bound on how many opcodes can be part
of a single dispatch group.

The block throughput is also limited by the amount of hardware parallelism. The
number of available resource units affects how the resource pressure is
distributed, and also how many blocks can be delivered every cycle.

llvm-svn: 333095
2018-05-23 15:59:27 +00:00
Andrea Di Biagio cb1ed400a4 [llvm-mca] Removed an empty line generated by the timeline view. NFC.
Also, regenerate all tests.

llvm-svn: 332853
2018-05-21 17:11:56 +00:00
Andrea Di Biagio b5757abefb [X86][BtVer2] Add a 'J' prefix to the PRF/RCU defs. NFC
This is to keep the Jaguar model's naming convention. Processor resources all
have a 'J' prefix in the BtVer2 scheduling model.

llvm-svn: 332851
2018-05-21 16:30:26 +00:00
Simon Pilgrim 1273f4ad93 [X86] Add GPR<->XMM Schedule Tags
BtVer2 - fix NumMicroOp and account for the Lat+6cy GPR->XMM and Lat+1cy XMm->GPR delays (see rL332737)

The high number of MOVD/MOVQ equivalent instructions meant that there were a number of missed patterns in SNB/Znver1:
SNB - add missing GPR<->MMX costs (taken from Agner / Intel AOM)
Znver1 - add missing GPR<->XMM MOVQ costs (taken from Agner)

llvm-svn: 332745
2018-05-18 17:58:36 +00:00
Simon Pilgrim 007b50fd35 [X86][BtVer2] Improve simulation of (V)PINSR values
Include the 6cy delay transferring from the GPR to FPU.

llvm-svn: 332737
2018-05-18 17:09:41 +00:00
Simon Pilgrim 3ecb0b80f6 [X86][BtVer2] Partial vector stores (inc MMX) have a 2cy latency
llvm-svn: 332722
2018-05-18 14:22:22 +00:00
Simon Pilgrim c4b8d367a8 [X86][SSE] Ensure vector partial load/stores use the WriteVecLoad/WriteVecStore scheduler classes
Retag some instructions that were missed when we split off vector load/store/moves - MOVQ/MOVD etc.

Fixes BtVer2/SLM which have different behaviours for GPR stores.

llvm-svn: 332718
2018-05-18 14:08:01 +00:00
Simon Pilgrim d749b321b2 [X86][SSE] Ensure float load/stores use the WriteFLoad/WriteFStore scheduler classes
Retag some instructions that were missed when we split off vector load/store/moves - MOVSS/MOVSD/MOVHPD/MOVHPD/MOVLPD/MOVLPS etc.

Fixes BtVer2/SLM which have different behaviours for GPR stores.

llvm-svn: 332714
2018-05-18 13:13:59 +00:00
Simon Pilgrim e389ea0e3e [llvm-mca][X86] Add CMOV test files
llvm-svn: 332622
2018-05-17 16:29:12 +00:00
Simon Pilgrim b5741f5c3d [X86][BtVer2] ADC/SBB take 2cy on an ALU pipe, not 1cy like ADD/SUB
llvm-svn: 332616
2018-05-17 15:43:23 +00:00
Andrea Di Biagio 650b5fc6cb [llvm-mca] add flag -all-views and flag -all-stats.
Flag -all-views enables all the views.
Flag -all-stats enables all the views that print hardware statistics.

llvm-svn: 332602
2018-05-17 12:27:03 +00:00
Simon Pilgrim b4fd145fc3 [llvm-mca][X86] Add ADX test files
llvm-svn: 332595
2018-05-17 11:32:38 +00:00
Simon Pilgrim d5d77dcb46 [X86] Fix typo in instregex for CVTSI642SDrr
llvm-svn: 332510
2018-05-16 18:31:17 +00:00
Andrea Di Biagio 45ccdd1785 [llvm-mca] Regenerate tests after r332381 and r332361. NFC
llvm-svn: 332447
2018-05-16 10:12:06 +00:00
Simon Pilgrim be9a206883 [X86] Split WriteCvtF2F into F32->F64 and F64->F32 scheduler classes
BtVer2 - Fixes schedules for (V)CVTPS2PD instructions

A lot of the Intel models still have too many InstRW overrides for these new classes - this needs cleaning up but I wanted to get the classes in first

llvm-svn: 332376
2018-05-15 17:36:49 +00:00
Simon Pilgrim 891ebcdbaa [X86] Split off F16C WriteCvtPH2PS/WriteCvtPS2PH scheduler classes
Btver2 - VCVTPH2PSYrm needs to double pump the AGU
Broadwell - missing VCVTPS2PH*mr stores extra latency

Allows us to remove the WriteCvtF2FSt conversion store class

llvm-svn: 332357
2018-05-15 14:12:32 +00:00
Simon Pilgrim 2aa395abcf [llvm-mca][x86] Add F16C instruction tests
llvm-svn: 332347
2018-05-15 12:50:06 +00:00
Simon Pilgrim 5bd5e2fd3e [llvm-mca][X86] Add missing SSE4A test file
llvm-svn: 332270
2018-05-14 18:20:40 +00:00
Simon Pilgrim 228d24a2d6 [X86][BtVer2] Fix MMX/YMM integer vector nt store schedules
MMX was missing and YMM was tagged as a fp nt store

llvm-svn: 332269
2018-05-14 18:07:28 +00:00
Simon Pilgrim 4135de2e93 [llvm-mca][x86] Add scalar nt-store instruction tests
llvm-svn: 332262
2018-05-14 17:10:33 +00:00
Simon Pilgrim 7340d88740 [llvm-mca][x86] Add and/not/or/xor instruction tests
llvm-svn: 332257
2018-05-14 16:26:24 +00:00
Simon Pilgrim 661ae7778d [X86][BtVer2] Model ymm move as double pumped instructions
We still need to handle mmx/xmm moves as 'decode-only' no-pipe instructions

llvm-svn: 332109
2018-05-11 17:38:36 +00:00
Simon Pilgrim 706403bab8 [X86][MMX] Tag MMX Move/Load/Store as WriteVec schedule classes
Fixes an issue on SLM/Btver2 where we had instructions were being treated as scalar loads/stores

llvm-svn: 332104
2018-05-11 16:38:59 +00:00
Simon Pilgrim 032a01f74a [X86][SLM] Vector stores only use the MEC port.
Confirmed by both Agner and Intel's AOM - the IEC/FPC are not required for pure load/stores (even if its a partial update).

Can't fix WriteStore until all RMW instructions are cleaned up though....

llvm-svn: 332096
2018-05-11 15:16:15 +00:00
Simon Pilgrim 22dd72b995 [X86] Split WriteF/WriteVec Move/Load/Store scheduler classes by vector width
Fixes a SNB issue that was missing vlddqu/vmovntdqa ymm instructions

llvm-svn: 332094
2018-05-11 14:30:54 +00:00
Simon Pilgrim 37fbb7f173 [X86][SNB] Fix typo in PEXTRDmr instregex, was missing VPEXTRDmr.
llvm-svn: 332002
2018-05-10 17:30:49 +00:00
Simon Pilgrim ab34aa8294 [X86] Cleanup WriteFStore/WriteVecStore schedules
MOVNTPD/MOVNTPS should be WriteFStore

Standardized BDW/HSW/SKL/SKX WriteFStore/WriteVecStore - fixes some missed instregex patterns. (V)MASKMOVDQU was already using the default, its costs gets increased but is still nowhere near the real cost of that nasty instruction....

llvm-svn: 331864
2018-05-09 11:01:16 +00:00
Simon Pilgrim 2864b46469 [X86] Split off WriteIMul64 from WriteIMul schedule class (PR36931)
This fixes a couple of BtVer2 missing instructions that weren't been handled in the override.

NOTE: There are still a lot of overrides that still need cleaning up!
llvm-svn: 331770
2018-05-08 14:55:16 +00:00
Simon Pilgrim 739d1a68aa [llvm][x86] SandyBridge/IvyBridge don't support BMI1/BMI2
llvm-svn: 331769
2018-05-08 14:20:25 +00:00
Simon Pilgrim 2580554333 [X86] Split WriteIDiv into div/idiv 8/16/32/64 implementations (PR36930)
I've created the necessary classes but there are still a lot of overrides that need cleaning up.

NOTE: The Znver1 model was missing some div/idiv variants in the instregex patterns and wasn't setting the resource cycles at all in the overrides.
llvm-svn: 331767
2018-05-08 13:51:45 +00:00
Simon Pilgrim 4283924e08 [llvm-mca][x86] Add div/idiv, mul/imul and inc/dec/neg/nop instruction tests
llvm-svn: 331765
2018-05-08 13:30:58 +00:00
Simon Pilgrim 061096d2c2 [llvm-mca][x86] Remove addsubpd from SSE2 tests
llvm-svn: 331678
2018-05-07 21:10:48 +00:00
Simon Pilgrim 1233e1234a [X86] Split WriteFAdd/WriteFCmp/WriteFMul schedule classes
Split to support single/double for scalar, XMM and YMM/ZMM instructions - removing InstrRW overrides for these instructions.

Fixes Atom ADDSUBPD instruction and reclassifies VFPCLASS as WriteFCmp which is closer in behaviour.

llvm-svn: 331672
2018-05-07 20:52:53 +00:00
Simon Pilgrim e480ed0b9f [X86][AVX2] Tag VPMOVSX/VPMOVZX ymm instructions as WriteShuffle256
These are more like cross-lane shuffles than regular shuffles - we already do this for AVX512 equivalents.

Differential Revision: https://reviews.llvm.org/D46229

llvm-svn: 331659
2018-05-07 18:25:19 +00:00
Simon Pilgrim 763bf12085 [X86][Znver1] Remove WriteFMul/WriteFRcp InstRW overrides/aliases.
Fixes x87 schedules to more closely match Agner - AMD doesn't tend to "special case" x87 instructions as much as Intel.

llvm-svn: 331645
2018-05-07 16:34:26 +00:00
Simon Pilgrim ac5d0a31ef [X86] Split WriteFDiv schedule classes to support single/double scalar, XMM and YMM/ZMM instructions.
This removes all InstrRW overrides for these instructions - some x87 overrides remain but most use default (and realistic) values.

llvm-svn: 331643
2018-05-07 16:15:46 +00:00
Simon Pilgrim f3ae50fca2 [X86] Split WriteFRcp/WriteFRsqrt/WriteFSqrt schedule classes
WriteFRcp/WriteFRsqrt are split to support scalar, XMM and YMM/ZMM instructions.

WriteFSqrt is split into single/double/long-double sizes and scalar, XMM, YMM and ZMM instructions.

This removes all InstrRW overrides for these instructions.

NOTE: There were a couple of typos in the Znver1 model - notably a 1cy throughput for SQRT that is highly unlikely and doesn't tally with Agner.

NOTE: I had to add Agner's numbers for several targets for WriteFSqrt80.
llvm-svn: 331629
2018-05-07 11:50:44 +00:00
Simon Pilgrim 0e51a125ea [X86] Add WriteEMMS scheduler class
Filled in the missing values from Btver2 SoG or Agner

llvm-svn: 331546
2018-05-04 18:16:13 +00:00
Simon Pilgrim be51b20127 [X86] Add SchedWriteFRnd fp rounding scheduler classes
Split off from SchedWriteFAdd for fp rounding/bit-manipulation instructions.

Fixes an issue on btver2 which only had the ymm version using the JSTC pipe instead of JFPA.

llvm-svn: 331515
2018-05-04 12:59:24 +00:00
Simon Pilgrim 0aed731516 [X86][Znver1] Use SchedAlias to tag microcoded scheduler classes
Avoids extra entries in the class tables.

Found a typo that missed the MMX_PHSUBSW instruction.

llvm-svn: 331488
2018-05-03 22:12:23 +00:00
Simon Pilgrim 350c22c587 [X86][SNB] Fix scheduling of MMX integer multiply instructions.
The entries were being bound to the wrong class.

llvm-svn: 331388
2018-05-02 19:26:14 +00:00
Clement Courbet a1a3095d88 [X86] Fix scheduling info for (V?)SQRTPDm on silvermont.
https://reviews.llvm.org/D46356

llvm-svn: 331356
2018-05-02 13:46:14 +00:00
Andrea Di Biagio e047d3529b [llvm-mca] Correctly handle zero-latency stores that consume pipeline resources.
This fixes PR37293.

We can have scheduling classes with no write latency entries, that still consume
processor resources. We don't want to treat those instructions as zero-latency
instructions; they still have to be issued to the underlying pipelines, so they
still consume resource cycles.

This is likely to be a regression which I have accidentally introduced at
revision 330807. Now, if an instruction has a non-empty set of write processor
resources, we conservatively treat it as a normal (i.e. non zero-latency)
instruction.

llvm-svn: 331193
2018-04-30 15:55:04 +00:00
Andrea Di Biagio 77bd1c748a [llvm-mca] Regenerate test Atom/resources-sse3.s. NFC
Before this change, it wrongly specified -mcpu=slm instead of -mcpu=atom.

llvm-svn: 331170
2018-04-30 12:13:04 +00:00
Andrea Di Biagio e9384eb13b [llvm-mca] Support for in-order CPU for -instruction-tables testing.
Added Intel Atom tests to verify that the tool correctly generates instruction
tables even if the CPU is in-order.

Fixes PR37282.

llvm-svn: 331169
2018-04-30 12:05:34 +00:00
Simon Pilgrim 8962c344f9 [llvm-mca][X86] Add BT resource tests to all models
llvm-svn: 331144
2018-04-29 15:45:31 +00:00
Simon Pilgrim 2d569361fc [llvm-mca][X86] Add add/adc + sub/sbb resource tests to all models
llvm-svn: 331140
2018-04-29 11:03:25 +00:00
Simon Pilgrim 318e9d39ab [llvm-mca][X86] Add double shift resource tests to all relevant models
llvm-svn: 331109
2018-04-28 15:18:49 +00:00
Simon Pilgrim 4d0187c893 [llvm-mca][X86] Add shift/rotate resource tests to all relevant models
I intend to add further instruction tests to the resources-x86_64.s test file as required, but this initial commit is to help remove a load of unnecessary InstRW overrides in a future patch

llvm-svn: 331108
2018-04-28 14:56:18 +00:00
Simon Pilgrim 7574ffd7bc [llvm-mca][X86] Updated fma3 tests after rL330820
llvm-svn: 330822
2018-04-25 13:19:04 +00:00
Andrea Di Biagio 93c49d5e58 [llvm-mca] Default to the native host cpu if flag -mcpu is not specified.
llvm-svn: 330809
2018-04-25 10:18:25 +00:00
Simon Pilgrim 27bc83e228 [X86] Split off PHMINPOSUW to their own schedule class
This also fixes Jaguar's schedule which was treating it as the WriteVecIMul default. 

llvm-svn: 330756
2018-04-24 18:49:25 +00:00
Simon Pilgrim f0945aa0e0 [X86][F16C] Add WriteCvtF2FSt scheduling class
Fixes the classification of VCVTPS2PHmr/VCVTPS2PHYmr which were tagged as WriteCvtF2FLd_WriteRMW (PR36887)

llvm-svn: 330737
2018-04-24 16:43:07 +00:00
Simon Pilgrim 828ef9e013 [X86][BtVer2] Fix VCVTPS2PHmr/VCVTPS2PHYmr latencies
These are stores, not loads, so don't need to account for load latency.

llvm-svn: 330735
2018-04-24 16:26:51 +00:00
Simon Pilgrim f35b8ac196 [X86][IVB] Add F16C resource tests.
Note this is IvyBridge (which shares the model) NOT SandyBridge.

llvm-svn: 330734
2018-04-24 16:22:59 +00:00
Andrea Di Biagio 0626864fa4 [llvm-mca] Default the output asm dialect used by the instruction printer to the input asm dialect.
The instruction printer used by llvm-mca to generate the performance report now
defaults the output assembly format to the format used for the input assembly
file.

On x86, the asm format can be either AT&T or Intel, depending on the
presence/absence of directive `.intel_syntax`.

Users can still specify a different assembly dialect with the command line flag
-output-asm-variant=<uint>.

llvm-svn: 330733
2018-04-24 16:19:08 +00:00
Simon Pilgrim 16299273d0 [X86] Remove unnecessary FMA reg-mem InstRW scheduler overrides.
llvm-svn: 330720
2018-04-24 14:47:11 +00:00
Simon Pilgrim f7d2a93d5f [X86] Add vector element insertion/extraction scheduler classes
Split off pinsr/pextr and extractps instructions.

(Mostly) fixes PR36887.

Note: It might be worth adding a WriteFInsertLd class as well in the future.

Differential Revision: https://reviews.llvm.org/D45929

llvm-svn: 330714
2018-04-24 13:21:41 +00:00
Simon Pilgrim 87ba905fe9 [llvm-mca][X86] Add BMI/LZCNT/POPCNT resource tests to all relevant models
The SandyBridge BMI tests are actually run on IvyBridge as that's the first lowest CPU that actually support the ISAs (but still use the SandyBridge model).

llvm-svn: 330556
2018-04-22 20:42:24 +00:00
Simon Pilgrim 96855ec39e [X86] Remove unnecessary WriteFVarBlend/WriteVarBlend InstRW overrides.
This also fixes some of the ReadAfterLd issues due to InstRW.

llvm-svn: 330544
2018-04-22 14:43:12 +00:00
Simon Pilgrim 5e9f1da0cd [llvm-mca][X86] Add POPCNT resource test
llvm-svn: 330540
2018-04-22 09:58:00 +00:00
Simon Pilgrim e25aa02bc4 [llvm-mca][X86] Add AVX2 resource tests
llvm-svn: 330512
2018-04-21 16:12:42 +00:00
Simon Pilgrim d73bd154d9 [llvm-mca][X86] Add SSE resource tests to all models
llvm-svn: 330506
2018-04-21 14:16:57 +00:00
Simon Pilgrim 26178d4336 [llvm-mca][X86] Add MMX resource tests
llvm-svn: 330502
2018-04-21 11:28:59 +00:00
Simon Pilgrim 1264066cd7 [llvm-mca][X86] Add X87 resource tests
llvm-svn: 330499
2018-04-21 10:36:19 +00:00
Simon Pilgrim 1803bfb75f [llvm-mca][X86] Add MMX/SSE/AES/CLMUL resource SandyBridge tests
llvm-svn: 330486
2018-04-20 22:04:11 +00:00
Simon Pilgrim 0a6bfb1843 [llvm-mca][X86] Add prefetch instruction resource tests
llvm-svn: 330371
2018-04-19 22:11:58 +00:00
Simon Pilgrim 7209117868 [llvm-mca][FMA] Add FMA resource tests
llvm-svn: 330366
2018-04-19 21:32:22 +00:00
Simon Pilgrim 4a486c13fa [llvm-mca][X86] Add resource test for every out-of-order scheduler model
I've copied and regenerated a resource file from btver2 to every x86 scheduler model supported by llvm-mca so we have at least some basic coverage.

For most this has been the avx1 tests, but for silvermont I've used sse42 as thats the latest it supports.

More will be added later.

llvm-svn: 330352
2018-04-19 18:08:10 +00:00
Simon Pilgrim f209321d61 [llvm-mca][X86] Add mmx instruction to btver2 resource tests
Useful to see scheduler class deltas against xmm equivalents

llvm-svn: 330335
2018-04-19 15:09:46 +00:00
Simon Pilgrim c310bfa193 [llvm-mca][X86] Add mmx versions of SSSE3 instructions
Move PABS instructions incorrectly tested under SSE2 

llvm-svn: 330295
2018-04-18 20:47:48 +00:00
Greg Bedwell 90d141a295 [UpdateTestChecks] Add update_mca_test_checks.py script
This script can be used to regenerate tests in the
test/tools/llvm-mca directory (PR36904).

Regenerated a number of tests using the pattern: test/tools/llvm-mca/*/*/*.s

Differential Revision: https://reviews.llvm.org/D45369

llvm-svn: 330246
2018-04-18 10:27:45 +00:00
Craig Topper e56a2fc5e7 [X86] Add separate scheduling class for PSADBW instruction.
llvm-svn: 330204
2018-04-17 19:35:19 +00:00
Andrea Di Biagio c752616f30 [llvm-mca] Ensure that instructions with a schedule read-advance are always issued in the right order.
Normally, the Scheduler prioritizes older instructions over younger instructions
during the instruction issue stage. In one particular case where a dependent
instruction had a schedule read-advance associated to one of the input operands,
this rule was not correctly applied.

This patch fixes the issue and adds a test to verify that we don't regress that
particular case.

llvm-svn: 330032
2018-04-13 15:19:07 +00:00
Andrea Di Biagio f41ad5c59e [llvm-mca] Renamed BackendStatistics to RetireControlUnitStatistics.
Also, removed flag -verbose in favor of flag -retire-stats.

llvm-svn: 329794
2018-04-11 12:12:53 +00:00
Andrea Di Biagio 1cc29c045e [llvm-mca] Move the logic that prints scheduler statistics from BackendStatistics to its own view.
Added flag -scheduler-stats to print scheduler related statistics.

llvm-svn: 329792
2018-04-11 11:37:46 +00:00
Andrea Di Biagio 821f650bba [llvm-mca] Move the logic that prints dispatch unit statistics from BackendStatistics to its own view.
This patch moves the logic that collects and analyzes dispatch events to the
DispatchStatistics view.

Added flag -dispatch-stats to print statistics related to the dispatch logic.

llvm-svn: 329708
2018-04-10 14:55:14 +00:00
Andrea Di Biagio 074cef3dfb [llvm-mca] Increase the default number of iterations to 100.
llvm-svn: 329694
2018-04-10 12:50:03 +00:00
Andrea Di Biagio c9f409eb6f Reapply "[llvm-mca] Do not separate iterations with a newline in the timeline view."
This reapplies r329403 with a fix for the floating point rounding issue.

llvm-svn: 329680
2018-04-10 09:55:33 +00:00
Andrea Di Biagio c65901282b [llvm-mca] Add the ability to mark regions of code for analysis (PR36875)
This patch teaches llvm-mca how to parse code comments in search for special
"markers" used to select regions of code.

Example:

# LLVM-MCA-BEGIN My Code Region
  ....
# LLVM-MCA-END

The MCAsmLexer now delegates to an object of class MCACommentParser (i.e. an
AsmCommentConsumer) the parsing of code comments to search for begin/end code
region markers.

A comment starting with substring "LLVM-MCA-BEGIN" marks the beginning of a new
region of code.  A comment starting with substring "LLVM-MCA-END" marks the end
of the last region.

This implementation doesn't allow regions to overlap. Each region can have a
optional description; internally, each region is identified by a range of source
code locations (SMLoc).

MCInst objects are added to a region R only if the source location for the
MCInst is in the range of locations specified by R.

By default, the tool allocates an implicit "Default" code region which contains
every source location.  See new tests llvm-mca-marker-*.s for a few examples.

A new Backend object is created for every region. So, the analysis is conducted
on every parsed code region.  The final report is the union of the reports
generated for every code region.  Note that empty regions are skipped.

Special "[#] Code Region - ..." strings are used in the report to mark the
portion which is specific to a code region only. For example, see
llvm-mca-markers-5.s.

Differential Revision: https://reviews.llvm.org/D45433

llvm-svn: 329590
2018-04-09 16:39:52 +00:00
Hans Wennborg 6400c03e6a Revert r329403 "[llvm-mca] Do not separate iterations with a newline in the timeline view."
This made AArch64/CortexA57/direct-branch.s fail on Windows, e.g.
http://lab.llvm.org:8011/builders/clang-x86-windows-msvc2015/builds/11251

> Also, update a few tests to minimize the diff in D45369.
> No functional change intended.

llvm-svn: 329569
2018-04-09 13:53:41 +00:00
Simon Pilgrim 86588fc809 [X86][Btver2] Add vector extract costs
llvm-svn: 329524
2018-04-08 11:26:26 +00:00
Andrea Di Biagio 85b8138bc6 [llvm-mca] Do not separate iterations with a newline in the timeline view.
Also, update a few tests to minimize the diff in D45369.
No functional change intended.

llvm-svn: 329403
2018-04-06 15:30:02 +00:00
Andrea Di Biagio c74ad502ce [MC][Tablegen] Allow models to describe the retire control unit for llvm-mca.
This patch adds the ability to describe properties of the hardware retire
control unit.

Tablegen class RetireControlUnit has been added for this purpose (see
TargetSchedule.td).

A RetireControlUnit specifies the size of the reorder buffer, as well as the
maximum number of opcodes that can be retired every cycle.

A zero (or negative) value for the reorder buffer size means: "the size is
unknown". If the size is unknown, then llvm-mca defaults it to the value of
field SchedMachineModel::MicroOpBufferSize.  A zero or negative number of
opcodes retired per cycle means: "there is no restriction on the number of
instructions that can be retired every cycle".

Models can optionally specify an instance of RetireControlUnit. There can only
be up-to one RetireControlUnit definition per scheduling model.

Information related to the RCU (RetireControlUnit) is stored in (two new fields
of) MCExtraProcessorInfo.  llvm-mca loads that information when it initializes
the DispatchUnit / RetireControlUnit (see Dispatch.h/Dispatch.cpp).

This patch fixes PR36661.

Differential Revision: https://reviews.llvm.org/D45259

llvm-svn: 329304
2018-04-05 15:41:41 +00:00
Simon Pilgrim 8139a88cb6 [X86][Btver2] Strip unnecessary check prefixes from resources tests
llvm-svn: 329192
2018-04-04 13:25:45 +00:00
Andrea Di Biagio 8dabf4f145 [llvm-mca] Move the logic that prints register file statistics to its own view. NFCI
Before this patch, the "BackendStatistics" view was responsible for printing the
register file usage (as well as many other statistics).

Now users can enable register file usage statistics using the command line flag
`-register-file-stats`. By default, the tool doesn't print register file
statistics.

llvm-svn: 329083
2018-04-03 16:46:23 +00:00
Andrea Di Biagio 9da4d6db33 [MC][Tablegen] Allow the definition of processor register files in the scheduling model for llvm-mca
This patch allows the description of register files in processor scheduling
models. This addresses PR36662.

A new tablegen class named 'RegisterFile' has been added to TargetSchedule.td.
Targets can optionally describe register files for their processors using that
class. In particular, class RegisterFile allows to specify:
 - The total number of physical registers.
 - Which target registers are accessible through the register file.
 - The cost of allocating a register at register renaming stage.

Example (from this patch - see file X86/X86ScheduleBtVer2.td)

  def FpuPRF : RegisterFile<72, [VR64, VR128, VR256], [1, 1, 2]>

Here, FpuPRF describes a register file for MMX/XMM/YMM registers. On Jaguar
(btver2), a YMM register definition consumes 2 physical registers, while MMX/XMM
register definitions only cost 1 physical register.

The syntax allows to specify an empty set of register classes.  An empty set of
register classes means: this register file models all the registers specified by
the Target.  For each register class, users can specify an optional register
cost. By default, register costs default to 1.  A value of 0 for the number of
physical registers means: "this register file has an unbounded number of
physical registers".

This patch is structured in two parts.

* Part 1 - MC/Tablegen *

A first part adds the tablegen definition of RegisterFile, and teaches the
SubtargetEmitter how to emit information related to register files.

Information about register files is accessible through an instance of
MCExtraProcessorInfo.
The idea behind this design is to logically partition the processor description
which is only used by external tools (like llvm-mca) from the processor
information used by the llvm machine schedulers.
I think that this design would make easier for targets to get rid of the extra
processor information if they don't want it.

* Part 2 - llvm-mca related *

The second part of this patch is related to changes to llvm-mca.

The main differences are:
 1) class RegisterFile now needs to take into account the "cost of a register"
when allocating physical registers at register renaming stage.
 2) Point 1. triggered a minor refactoring which lef to the removal of the
"maximum 32 register files" restriction.
 3) The BackendStatistics view has been updated so that we can print out extra
details related to each register file implemented by the processor.

The effect of point 3. is also visible in tests register-files-[1..5].s.

Differential Revision: https://reviews.llvm.org/D44980

llvm-svn: 329067
2018-04-03 13:36:24 +00:00
Andrea Di Biagio 6fd62feff8 [llvm-mca] Do not assume that implicit reads cannot be associated with ReadAdvance entries.
Before, the instruction builder incorrectly assumed that only explicit reads
could have been associated with ReadAdvance entries.
This patch fixes the issue and adds a test to verify it.

llvm-svn: 328972
2018-04-02 13:46:49 +00:00
Craig Topper 13a0f83a05 [X86] Add SchedRW for PMULLD
Summary:
It seems many CPUs don't implement this instruction as well as the other vector multiplies. Often using a multi uop flow. Silvermont in particular has a 7 uop flow with 11 cycle throughput. Sandy Bridge implements it as a single uop with 5 cycle latency and 1 cycle throughput. But Haswell and later use 2 uops with 10 cycle latency and 2 cycle throughput.

This patch adds a new X86SchedWritePair we can use to tag this instruction separately. I've provided correct information for Silvermont, Btver2, and Sandy Bridge. I've removed the InstRWs for SandyBridge. I've left Haswell/Broadwell/Skylake InstRWs in place because I wasn't sure how to account for the different load latency between 128 and 256 bits. I also left Znver1 InstRWs in place because the existing values don't match Agner's spreadsheet.

I also left a FIXME in the SandyBridge model because it being used for the "generic" model is too optimistic for the 256/512-bit versions since those are multiple uops on all known CPUs.

Reviewers: RKSimon, GGanesh, courbet

Reviewed By: RKSimon

Subscribers: gchatelet, gbedwell, andreadb, llvm-commits

Differential Revision: https://reviews.llvm.org/D44972

llvm-svn: 328914
2018-03-31 04:54:32 +00:00
Andrea Di Biagio dc97172b2f [X86][BtVer2] Fixed the number of micro opcodes for AVX vector converts and
VSQRT instructions.

There were still a few AVX instructions with an incorrect number of opcodes.
These should be fixed now.

llvm-svn: 328892
2018-03-30 18:53:47 +00:00
Andrea Di Biagio 3eaa26bb64 [X86][BtVer2] Fix the number of uOps for horizontal operations.
llvm-svn: 328886
2018-03-30 18:15:30 +00:00
Andrea Di Biagio 073a9d74ca [X86][BtVer2] Add missing ReadAfterLd to RM variants of AVX horizontal adds and
most vector logic instructions.

Fixed a few InstRW that forgot to specify a ReadAfterLd for the register input
operand.

llvm-svn: 328867
2018-03-30 14:48:08 +00:00
Andrea Di Biagio 42d8ea22c0 [X86][BtVer2] Add tests that show how ReadAfterLd is missing for some
instructions.

In the Btver2 model, there are a few InstRW overrides that don't specify a
ReadAfterLd for the register input operand.

As a result, a few AVX variants of horizontal operations and most vector logic
operations with a folded memory operand don't have a ReadAdvance info associated
to their input register operands.

llvm-svn: 328865
2018-03-30 14:29:33 +00:00
Andrea Di Biagio 01043625cf [X86] Add llvm-mca tests for r328834.
Verify that the ReadAfterLd is correctly applied to FMA and 4-ops variable blend
instructions.

As Craig pointed out in D44726, some Intel models still have to be fixed.

llvm-svn: 328861
2018-03-30 13:38:37 +00:00
Andrea Di Biagio 0823090843 [X86] Add tests to verify the presence of "ReadAfterLd" after r328823.
This change adds a couple of tests to verify the change introduced by revision
328823 ([X86] Correct the placement of ReadAfterLd in BEXTR and BZHI).

llvm-svn: 328859
2018-03-30 11:44:48 +00:00
Andrea Di Biagio 0a837ef6b1 [llvm-mca] Correctly set the ReadAdvance information for register use operands.
The tool was passing the wrong operand index to method
MCSubtargetInfo::getReadAdvanceCycles(). That method requires a "UseIdx", and
not the operand index. This was found when testing X86 code where instructions
had a memory folded operand.

This patch fixes the issue and adds test read-advance-1.s to ensure that
the ReadAfterLd (a ReadAdvance of 3cy) information is correctly used.

llvm-svn: 328790
2018-03-29 14:26:56 +00:00
Andrea Di Biagio 5076b98fb9 [X86][BtVer2] Fix the number of micro opcodes for AES[ENC|DEC] and other YMM instructions.
Similar to r328694. The number of micro opcodes should be 2 for those
instructions.

This was found when testing AVX code for BtVer2 using llvm-mca.

llvm-svn: 328698
2018-03-28 12:12:04 +00:00
Andrea Di Biagio 010924e35c [X86][BtVer2] Fix the number of micro opcodes for a bunch of YMM instructions.
The Jaguar backend natively supports 128-bit data types. Operations on YMM
registers are split into two COPs (complex operations). Each COP consumes a slot
in the dispatch group, and in the reorder buffer.

The scheduling model for Jaguar should mark those instructions as `let
NumMicroOps = 2`.

This was found when testing AVX code for BtVer2 using llvm-mca.

llvm-svn: 328694
2018-03-28 10:49:33 +00:00
Andrea Di Biagio 9ecb4011ca [llvm-mca] pass the correct set of used registers in checkRAT.
We were incorrectly initializing the array of used registers in method checkRAT.
As a consequence, the number of register file stalls was misreported.

Added a test to cover this case.

llvm-svn: 328629
2018-03-27 15:23:41 +00:00
Simon Pilgrim fcf49df21c [X86][Btver2] Add (U)COMISD/(U)COMISD scheduler costs
Account for the "+i" integer pipe transfer cost (1cy use of JALU0 for GPR PRF write)

llvm-svn: 328573
2018-03-26 19:01:06 +00:00
Simon Pilgrim 86ea53123d [X86][Btver2] Add CVTSI2SD/CVTSI2SS scheduler costs
We still need to account for how Jaguar passes data from GPR -> XMM, which isn't as clean as XMM -> GPR.....

llvm-svn: 328551
2018-03-26 17:02:02 +00:00
Simon Pilgrim 8815105cd5 [X86][Btver2] Add CVTSD2SS/CVTSS2SD scheduler costs
llvm-svn: 328541
2018-03-26 16:24:13 +00:00
Simon Pilgrim aa40148cae [X86][Btver2] Account for the "+i" integer pipe transfer costs (1cy use of JALU0 for GPR PRF write)
llvm-svn: 328536
2018-03-26 16:10:08 +00:00
Simon Pilgrim 0b73b29388 [X86][Btver2] Add CVTSD2SI/CVTSS2SI scheduler costs
Account for the "+i" integer pipe transfer cost (1cy use of JALU0 for GPR PRF write)

This also adds missing vcvttss2si tests 

llvm-svn: 328505
2018-03-26 15:30:47 +00:00
Simon Pilgrim 3aa9344605 [X86][Btver2] Fix YMM BLENDPD/BLENDPS + UNPCKPD/UNPCKP instructions costs
These should match the YMM MOVDUP/ PERMILPD/PERMILPS + SHUFPD/SHUFPS shuffles instead of using the WriteFShuffle defaults.

llvm-svn: 328501
2018-03-26 14:44:24 +00:00
Andrea Di Biagio 5ffd2c3cfc [llvm-mca] Fix how views are added to the InstructionTables.
This should fix the stack-use-after-scope reported by the asan buildbots after
revision 328493.

llvm-svn: 328499
2018-03-26 14:25:52 +00:00
Simon Pilgrim 67df1cf597 [X86][Btver2] Add (V)SQRTPD/(V)SQRTSD costs
The xmm sd/pd versions were using the WriteFSQRT default which is modelled on sqrtss/sqrtps

llvm-svn: 328497
2018-03-26 14:03:40 +00:00
Andrea Di Biagio ff9c1092b7 [llvm-mca] Add a flag -instruction-info to enable/disable the instruction info view.
llvm-svn: 328493
2018-03-26 13:44:54 +00:00
Simon Pilgrim caa203aed5 [X86][Btver2] Double the AGU and schedule pipe resources for YMM
Both the AGUs and schedule pipes are double pumped for 256-bit instructions as well as the functional units which we already model.

llvm-svn: 328491
2018-03-26 13:15:20 +00:00
Andrea Di Biagio d1569290ef [llvm-mca] Add flag -instruction-tables to print the theoretical resource pressure distribution for instructions (PR36874)
The goal of this patch is to address most of PR36874.  To fully fix PR36874 we
need to split the "InstructionInfo" view from the "SummaryView". That would make
easy to check the latency and rthroughput as well.

The patch reuses all the logic from ResourcePressureView to print out the
"instruction tables".

We have an entry for every instruction in the input sequence. Each entry reports
the theoretical resource pressure distribution. Resource pressure is uniformly
distributed across all the processor resource units of a group.

At the moment, the backend pipeline is not configurable, so the only way to fix
this is by creating a different driver that simply sends instruction events to
the resource pressure view.  That means, we don't use the Backend interface.
Instead, it is simpler to just have a different code-path for when flag
-instruction-tables is specified.

Once Clement addresses bug 36663, then we can port the "instruction tables"
logic into a stage of our configurable pipeline.

Updated the BtVer2 test cases (thanks Simon for the help). Now we pass flag
-instruction-tables to each modified test.

Differential Revision: https://reviews.llvm.org/D44839

llvm-svn: 328487
2018-03-26 12:04:53 +00:00
Simon Pilgrim 6c63e6c222 [X86][Btver2] Cleanup TEST instructions to use JFPA (+JFPX on ymms) function unit
llvm-svn: 328343
2018-03-23 17:59:22 +00:00
Simon Pilgrim e5c0a041ff [X86][Btver2] Cleanup MOVMSK instructions to use JFPA function unit
Add missing non-VEX and (V)PMOVMSKB instructions to the pattern

llvm-svn: 328338
2018-03-23 17:38:59 +00:00
Simon Pilgrim 256f149bf0 [X86][Btver2] Vector permutes use a JFPU01 scheduler pipe and JFPX/JVALU function unit
llvm-svn: 328331
2018-03-23 16:17:56 +00:00