This patch adds a new ReadAdvance definition named ReadInt2Fpu.
ReadInt2Fpu allows x86 scheduling models to accurately describe delays caused by
data transfers from the integer unit to the floating point unit.
ReadInt2Fpu currently defaults to a delay of zero cycles (i.e. no delay) for all
x86 models excluding BtVer2. That means, this patch is only a functional change
for the Jaguar cpu model only.
Tablegen definitions for instructions (V)PINSR* have been updated to account for
the new ReadInt2Fpu. That read is mapped to the the GPR input operand.
On Jaguar, int-to-fpu transfers are modeled as a +6cy delay. Before this patch,
that extra delay was added to the opcode latency. In practice, the insert opcode
only executes for 1cy. Most of the actual latency is actually contributed by the
so-called operand-latency. According to the AMD SOG for family 16h, (V)PINSR*
latency is defined by expression f+1, where f is defined as a forwarding delay
from the integer unit to the fpu.
When printing instruction latency from MCA (see InstructionInfoView.cpp) and LLC
(only when flag -print-schedule is speified), we now need to account for any
extra forwarding delays. We do this by checking if scheduling classes declare
any negative ReadAdvance entries. Quoting a code comment in TargetSchedule.td:
"A negative advance effectively increases latency, which may be used for
cross-domain stalls". When computing the instruction latency for the purpose of
our scheduling tests, we now add any extra delay to the formula. This avoids
regressing existing codegen and mca schedule tests. It comes with the cost of an
extra (but very simple) hook in MCSchedModel.
Differential Revision: https://reviews.llvm.org/D57056
llvm-svn: 351965
to reflect the new license.
We understand that people may be surprised that we're moving the header
entirely to discuss the new license. We checked this carefully with the
Foundation's lawyer and we believe this is the correct approach.
Essentially, all code in the project is now made available by the LLVM
project under our new license, so you will see that the license headers
include that license only. Some of our contributors have contributed
code under our old license, and accordingly, we have retained a copy of
our old license notice in the top-level files in each project and
repository.
llvm-svn: 351636
Field ResourceUnitMask was incorrectly defined as a 'const unsigned' mask. It
should have been a 64 bit quantity instead. That means, ResourceUnitMask was
always implicitly truncated to a 32 bit quantity.
This issue has been found by inspection. Surprisingly, that bug was latent, and
it never negatively affected any existing upstream targets.
This patch fixes the wrong definition of ResourceUnitMask, and adds a bunch of
extra debug prints to help debugging potential issues related to invalid
processor resource masks.
llvm-svn: 350820
This patch adds the ability to specify via tablegen which processor resources
are load/store queue resources.
A new tablegen class named MemoryQueue can be optionally used to mark resources
that model load/store queues. Information about the load/store queue is
collected at 'CodeGenSchedule' stage, and analyzed by the 'SubtargetEmitter' to
initialize two new fields in struct MCExtraProcessorInfo named `LoadQueueID` and
`StoreQueueID`. Those two fields are identifiers for buffered resources used to
describe the load queue and the store queue.
Field `BufferSize` is interpreted as the number of entries in the queue, while
the number of units is a throughput indicator (i.e. number of available pickers
for loads/stores).
At construction time, LSUnit in llvm-mca checks for the presence of extra
processor information (i.e. MCExtraProcessorInfo) in the scheduling model. If
that information is available, and fields LoadQueueID and StoreQueueID are set
to a value different than zero (i.e. the invalid processor resource index), then
LSUnit initializes its LoadQueue/StoreQueue based on the BufferSize value
declared by the two processor resources.
With this patch, we more accurately track dynamic dispatch stalls caused by the
lack of LS tokens (i.e. load/store queue full). This is also shown by the
differences in two BdVer2 tests. Stalls that were previously classified as
generic SCHEDULER FULL stalls, are not correctly classified either as "load
queue full" or "store queue full".
About the differences in the -scheduler-stats view: those differences are
expected, because entries in the load/store queue are not released at
instruction issue stage. Instead, those are released at instruction executed
stage. This is the main reason why for the modified tests, the load/store
queues gets full before PdEx is full.
Differential Revision: https://reviews.llvm.org/D54957
llvm-svn: 347857
RetireControlUnitStatistics now reports extra information about the ROB and the
avg/maximum number of entries consumed over the entire simulation.
Example:
Retire Control Unit - number of cycles where we saw N instructions retired:
[# retired], [# cycles]
0, 109 (17.9%)
1, 102 (16.7%)
2, 399 (65.4%)
Total ROB Entries: 64
Max Used ROB Entries: 35 ( 54.7% )
Average Used ROB Entries per cy: 32 ( 50.0% )
Documentation in llvm/docs/CommandGuide/llvmn-mca.rst has been updated to
reflect this change.
llvm-svn: 347493
This patch teaches view RegisterFileStatistics how to report events for
optimizable register moves.
For each processor register file, view RegisterFileStatistics reports the
following extra information:
- Number of optimizable register moves
- Number of register moves eliminated
- Number of zero moves (i.e. register moves that propagate a zero)
- Max Number of moves eliminated per cycle.
Differential Revision: https://reviews.llvm.org/D53976
llvm-svn: 345865
Summary: This allows to remove `using namespace llvm;` in those *.cpp files
When we want to revisit the decision (everything resides in llvm::mca::*) in the future, we can move things to a nested namespace of llvm::mca::, to conceptually make them separate from the rest of llvm::mca::*
Reviewers: andreadb, mattd
Reviewed By: andreadb
Subscribers: javed.absar, tschuett, gbedwell, llvm-commits
Differential Revision: https://reviews.llvm.org/D53407
llvm-svn: 345612
Also, removed the initialization of vectors used for processor resource masks.
Support function 'computeProcResourceMasks()' already calls method resize on
those vectors.
No functional change intended.
llvm-svn: 345161
Added begin()/end() methods to allow the usage of SourceMgr in foreach loops.
With this change, method getMCInstFromIndex() (as well as a couple of other
methods) are now redundant, and can be removed from the public interface.
llvm-svn: 345147
Summary:
This patch removes the storing of accumulated floating point data
within the llvm-mca library.
This patch splits-up the two quantities: cycles and number of resource units.
By splitting-up these two quantities, we delay the calculation of "cycles per resource unit"
until that value is read, reducing the chance of accumulating floating point error.
I considered using the APFloat, but after measuring performance, for a large (many iteration)
sample, I decided to go with this faster solution.
Reviewers: andreadb, courbet, RKSimon
Reviewed By: andreadb
Subscribers: llvm-commits, javed.absar, tschuett, gbedwell
Differential Revision: https://reviews.llvm.org/D51903
llvm-svn: 341980
This patch introduces the following changes to the DispatchStatistics view:
* DispatchStatistics now reports the number of dispatched opcodes instead of
the number of dispatched instructions.
* The "Dynamic Dispatch Stall Cycles" table now also reports the percentage of
stall cycles against the total simulated cycles.
This change allows users to easily compare dispatch group sizes with the
processor DispatchWidth.
Before this change, it was difficult to correlate the two numbers, since
DispatchStatistics view reported numbers of instructions (instead of opcodes).
DispatchWidth defines the maximum size of a dispatch group in terms of number of
micro opcodes.
The other change introduced by this patch is related to how DispatchStage
generates "instruction dispatch" events.
In particular:
* There can be multiple dispatch events associated with a same instruction
* Each dispatch event now encapsulates the number of dispatched micro opcodes.
The number of micro opcodes declared by an instruction may exceed the processor
DispatchWidth. Therefore, we cannot assume that instructions are always fully
dispatched in a single cycle.
DispatchStage knows already how to handle instructions declaring a number of
opcodes bigger that DispatchWidth. However, DispatchStage always emitted a
single instruction dispatch event (during the first simulated dispatch cycle)
for instructions dispatched.
With this patch, DispatchStage now correctly notifies multiple dispatch events
for instructions that cannot be dispatched in a single cycle.
A few views had to be modified. Views can no longer assume that there can only
be one dispatch event per instruction.
Tests (and docs) have been updated.
Differential Revision: https://reviews.llvm.org/D51430
llvm-svn: 341055
This patch adds two new fields to the perf report generated by the SummaryView.
Fields are now logically organized into two small groups; only the second group
contains throughput indicators.
Example:
```
Iterations: 100
Instructions: 300
Total Cycles: 414
Total uOps: 700
Dispatch Width: 4
uOps Per Cycle: 1.69
IPC: 0.72
Block RThroughput: 4.0
```
This patch also updates the docs for llvm-mca.
Due to the nature of this change, several tests in the tools/llvm-mca directory
were affected, and had to be updated using script `update_mca_test_checks.py`.
llvm-svn: 340946
This patch also uses colors to highlight problematic wait-time entries.
A problematic entry is an entry with an high wait time that tends to match (or
exceed) the size of the scheduler's buffer.
Color RED is used if an instruction had to wait an average number of cycles
which is bigger than (or equal to) the size of the underlying scheduler's
buffer.
Color YELLOW is used if the time (in cycles) spend waiting for the
operands or pipeline resources is bigger than half the size of the underlying
scheduler's buffer.
Color MAGENTA is used if an instruction does not consume buffer resources
according to the scheduling model.
llvm-svn: 340825
Before this patch, the SchedulerStatistics only printed the maximum number of
buffer entries consumed in each scheduler's queue at a given point of the
simulation.
This patch restructures the reported table, and adds an extra field named
"Average number of used buffer entries" to it.
This patch also uses different colors to help identifying bottlenecks caused by
high scheduler's buffer pressure.
llvm-svn: 340746