llvm-project

Commit Graph

Author	SHA1	Message	Date
Andrea Di Biagio	0dc5dc6531	[MCA][NFC] Removed unused method, and fixed a coverity issue. The coverity issue was reported agaist class MCAOperand due to the lack of proper initialization for field Index. No functional change intended.	2021-08-27 12:49:49 +01:00
Patrick Holland	fe01014faa	[MCA] Moved View.h and View.cpp from /tools/llvm-mca/ to /lib/MCA/. Moved View.h and View.cpp from /tools/llvm-mca/Views/ to /lib/MCA/ and /include/llvm/MCA/. This is so that targets can define their own Views within the /lib/Target/ directory (so that the View can use backend functionality). To enable these Views within mca, targets will need to add them to the vector of Views returned by their target's CustomBehaviour::getViews() methods. Differential Revision: https://reviews.llvm.org/D108520	2021-08-25 12:12:47 -07:00
Andrea Di Biagio	45685a1fc4	[MCA] Simplify the rounding logic used in TimelineView::printWaitTimeEntry. This is related to PR51392. Before this patch, the timeline view was rounding doubles to the first decimal, using a logic similar to this: ``` double AverageTime = (double)Input / CumulativeExecutions; double Result = floor((AverageTime * 10) + 0.5) / 10 ``` Here, Input and CumulativeExecutions are both unsigned integers. The last operation is what effectively performs the rounding of AverageTime. PR51392 has been raised because - under specific -m32 configurations of GCC - one of the timeline tests reports slighlty different values (due to a different rounding choice). This patch tries to minimise the propagation of floating-point error by hoisting the multiply by 10, so that it is performed on the unsigned. ``` double AverageTime = (double)(Input * 10) / CumulativeExecutions; floor(AverageTime + 0.5) / 10 ``` So we are trading a floating point multiply for a integer multiply (which can be expanded using a simple MUL or using an `ADD + LEA` sequence). This decrease in floating point operations executed should also help with decreasing the error in the computation.. Strictly speaking, that computation will always be potentially subject to error (depending on what values are passed in input). However, this patch should improve the situation and make bug like PR51392 less frequent.	2021-08-07 11:59:41 +01:00
Marcos Horro	14f77576c9	[llvm-mca] [NFC] Formatting code Applied clang-format to all files. Discarded BottleneckAnalysis.h 80-column width violation since it contains an example of report. Caught some typos and minor style details. Reviewed By: andreadb Differential Revision: https://reviews.llvm.org/D105900	2021-07-13 19:13:59 +02:00
Andrea Di Biagio	d919bca875	[llvm-mca][JSON] Further refactoring of the JSON printing logic. This patch renames object "Resources" to "TargetInfo". Moved the getJSONTargetInfo method from class InstructionView to the PipelinePrinter. Removed uses of std::stringstream. Removed unused method View::printViewJSON().	2021-07-10 12:38:19 +01:00
Andrea Di Biagio	10cb036223	[llvm-mca] Refactor the logic that prints JSON files. Moved most of the printing logic into the PipelinePrinter. This patch also fixes the JSON output when flag -instruction-tables is specified.	2021-07-09 22:56:39 +01:00
Marcos Horro	b11d31eb73	[llvm-mca] Fix JSON format for multiple regions Instead of printing each region individually when using JSON format, this patch creates a JSON object which is updated with the values of each region, printing them at the end. New test is added for JSON output with multiple regions. Bug: https://bugs.llvm.org/show_bug.cgi?id=51008 Reviewed By: andreadb Differential Revision: https://reviews.llvm.org/D105618	2021-07-09 18:04:16 +02:00
Marcos Horro	aa13e4fe7e	[llvm-mca] Fix JSON output (PR50922) Based on the discussion in PR50922, minor changes have been done to properly output a valid JSON. Removed "not implemented" keys. Differential Revision: https://reviews.llvm.org/D105064	2021-07-01 12:53:20 +01:00
Jay Foad	beebe5a056	[MCA] Allow unlimited cycles in the timeline view Change --max-timeline-cycles=0 to mean no limit on the number of cycles. Use this in AMDGPU tests to show all instructions in the timeline view instead of having it arbitrarily truncated. Differential Revision: https://reviews.llvm.org/D104846	2021-06-24 12:54:57 +01:00
Patrick Holland	70040de32d	[MCA][TimelineView] Fixed a bug that was causing instructions outside of the timeline-max-cycles to still be printed. Differential Revision: https://reviews.llvm.org/D104815	2021-06-23 15:05:49 -07:00
Patrick Holland	d03736455c	[MCA] [In-order pipeline] Fix for 0 latency instruction causing assertion to fail. 0 latency instructions now get processed and retired properly within the in-order pipeline. Had to fix a bug within TimelineView.cpp as well that would show up when a 0 latency instruction was the first instruction in the source. Differential Revision: https://reviews.llvm.org/D104675	2021-06-22 10:18:39 -07:00
Patrick Holland	ef16c8eaa5	Reapply "[MCA] Adding the CustomBehaviour class to llvm-mca". The original change was pushed in main as commit `f7a23ecece`. It was then reverted by commit `a04f01bab2` because it caused linker failures on buildbots that don't build the AMDGPU target. -- Some instructions are not defined well enough within the target’s scheduling model for llvm-mca to be able to properly simulate its behaviour. The ideal solution to this situation is to modify the scheduling model, but that’s not always a viable strategy. Maybe other parts of the backend depend on that instruction being modelled the way that it is. Or maybe the instruction is quite complex and it’s difficult to fully capture its behaviour with tablegen. The CustomBehaviour class (which I will refer to as CB frequently) is designed to provide intuitive scaffolding for developers to implement the correct modelling for these instructions. More details are available in the original commit log message (`f7a23ecece`). Differential Revision: https://reviews.llvm.org/D104149	2021-06-16 16:54:48 +01:00
Andrea Di Biagio	a04f01bab2	Revert "[MCA] Adding the CustomBehaviour class to llvm-mca" This reverts commit `f7a23ecece`. It appears to breaks buildbots that don't build the AMDGPU backend.	2021-06-15 21:41:36 +01:00
Patrick Holland	f7a23ecece	[MCA] Adding the CustomBehaviour class to llvm-mca Some instructions are not defined well enough within the target’s scheduling model for llvm-mca to be able to properly simulate its behaviour. The ideal solution to this situation is to modify the scheduling model, but that’s not always a viable strategy. Maybe other parts of the backend depend on that instruction being modelled the way that it is. Or maybe the instruction is quite complex and it’s difficult to fully capture its behaviour with tablegen. The CustomBehaviour class (which I will refer to as CB frequently) is designed to provide intuitive scaffolding for developers to implement the correct modelling for these instructions. Implementation details: llvm-mca does its best to extract relevant register, resource, and memory information from every MCInst when lowering them to an mca::Instruction. It then uses this information to detect dependencies and simulate stalls within the pipeline. For some instructions, the information that gets captured within the mca::Instruction is not enough for mca to simulate them properly. In these cases, there are two main possibilities: 1. The instruction has a dependency that isn’t detected by mca. 2. mca is incorrectly enforcing a dependency that shouldn’t exist. For the rest of this discussion, I will be focusing on (1), but I have put some thought into (2) and I may revisit it in the future. So we have an instruction that has dependencies that aren’t picked up by mca. The basic idea for both pipelines in mca is that when an instruction wants to be dispatched, we first check for register hazards and then we check for resource hazards. This is where CB is injected. If no register or resource hazards have been detected, we make a call to CustomBehaviour::checkCustomHazard() to give the target specific CB the chance to detect and enforce any custom dependencies. The return value for checkCustomHazaard() is an unsigned int representing the (minimum) number of cycles that the instruction needs to stall for. It’s fine to underestimate this value because when StallCycles gets down to 0, we’ll end up checking for all the hazards again before the instruction is actually dispatched. However, it’s important not to overestimate the value and the more accurate your estimate is, the more efficient mca’s execution can be. In general, for checkCustomHazard() to be able to detect these custom dependencies, it needs information about the current instruction and also all of the instructions that are still executing within the pipeline. The mca pipeline uses mca::Instruction rather than MCInst and the current information encoded within each mca::Instruction isn’t sufficient for my use cases. I had to add a few extra attributes to the mca::Instruction class and have them get set by the MCInst during instruction building. For example, the current mca::Instruction doesn’t know its opcode, and it also doesn’t know anything about its immediate operands (both of which I had to add to the class). With information about the current instruction, a list of all currently executing instructions, and some target specific objects (MCSubtargetInfo and MCInstrInfo which the base CB class has references to), developers should be able to detect and enforce most custom dependencies within checkCustomHazard. If you need more information than is present in the mca::Instruction, feel free to add attributes to that class and have them set during the lowering sequence from MCInst. Fortunately, in the in-order pipeline, it’s very convenient for us to pass these arguments to checkCustomHazard. The hazard checking is taken care of within InOrderIssueStage::canExecute(). This function takes a const InstRef as a parameter (representing the instruction that currently wants to be dispatched) and the InOrderIssueStage class maintains a SmallVector<InstRef, 4> which holds all of the currently executing instructions. For the out-of-order pipeline, it’s a bit trickier to get the list of executing instructions and this is why I have held off on implementing it myself. This is the main topic I will bring up when I eventually make a post to discuss and ask for feedback. CB is a base class where targets implement their own derived classes. If a target specific CB does not exist (or we pass in the -disable-cb flag), the base class is used. This base class trivially returns 0 from its checkCustomHazard() implementation (meaning that the current instruction needs to stall for 0 cycles aka no hazard is detected). For this reason, targets or users who choose not to use CB shouldn’t see any negative impacts to accuracy or performance (in comparison to pre-patch llvm-mca). Differential Revision: https://reviews.llvm.org/D104149	2021-06-15 21:30:48 +01:00
Andrea Di Biagio	50770d8de5	[MCA] Refactor the InOrderIssueStage stage. NFCI Moved the logic that checks for RAW hazards from the InOrderIssueStage to the RegisterFile. Changed how the InOrderIssueStage keeps track of backend stalls. Stall events are now generated from method notifyStallEvent(). No functional change intended.	2021-05-27 22:28:04 +01:00
Andrea Di Biagio	de1843e51a	[llvm-mca][View] Update the Register File statistics. Correctly track the number of move eliminated in the Register File statistics.	2021-05-08 19:43:16 +01:00
Andrew Savonichev	292da93d59	[MCA] Disable RCU for InOrderIssueStage This is a follow-up for: D98604 [MCA] Ensure that writes occur in-order When instructions are aligned by the order of writes, they retire in-order naturally. There is no need for an RCU, so it is disabled. Differential Revision: https://reviews.llvm.org/D98628	2021-03-24 13:54:04 +03:00
Peng Guo	91e7a17133	[NFC][llvm-mca] Fix compiler warning Fix clang compiler warning from `-Wrange-loop-analysis`. Reviewed By: andreadb Differential Revision: https://reviews.llvm.org/D95997	2021-02-04 09:44:36 -08:00
Wolfgang Pieb	c6e8f81410	[llvm-mca] Addressing build failures due to missing override specifiers	2021-01-21 17:32:18 -08:00
Wolfgang Pieb	04af1ca2e9	[llvm-mca] Forgot a couple of override specifiers. Differential Revision: https://reviews.llvm.org/D86644	2021-01-21 15:44:14 -08:00
Wolfgang Pieb	d38be2ba0e	[llvm-mca] Initial implementation of serialization using JSON. The views implemented at this time are Summary, Timeline, ResourcePressure and InstructionInfo. Use --json on the command line to obtain JSON output.	2021-01-21 15:15:54 -08:00
Evgeny Leviant	8a7ca143f8	[ARM][SchedModels] Convert IsPredicatedPred to MCSchedPredicate Differential revision: https://reviews.llvm.org/D89553	2020-10-19 11:37:54 +03:00
Wolfgang Pieb	e02920fe55	[llvm-mca][NFC] Refactor handling of views that examine individual instructions, including printing them. Reviewers: andreadb, lebedev.ri Differential Review: https://reviews.llvm.org/D86390 Introduces a new base class "InstructionView" that such views derive from. Other views still use the "View" base class.	2020-08-25 12:12:37 -07:00
Wolfgang Pieb	cf6adecd6a	[llvm-mca][NFC] Refactor views to separate data collection from printing. Reviewed By: andreadb, lebedev.ri Differential Revision: https://reviews.llvm.org/D86177	2020-08-21 11:27:36 -07:00
Fangrui Song	aa708763d3	[MC] Add parameter `Address` to MCInstPrinter::printInst printInst prints a branch/call instruction as `b offset` (there are many variants on various targets) instead of `b address`. It is a convention to use address instead of offset in most external symbolizers/disassemblers. This difference makes `llvm-objdump -d` output unsatisfactory. Add `uint64_t Address` to printInst(), so that it can pass the argument to printInstruction(). `raw_ostream &OS` is moved to the last to be consistent with other print* methods. The next step is to pass `Address` to printInstruction() (generated by tablegen from the instruction set description). We can gradually migrate targets to print addresses instead of offsets. In any case, downstream projects which don't know `Address` can pass 0 as the argument. Reviewed By: jhenderson Differential Revision: https://reviews.llvm.org/D72172	2020-01-06 20:42:22 -08:00
Mark de Wever	536c9a604e	[Tools] Fixes -Wrange-loop-analysis warnings This avoids new warnings due to D68912 adds -Wrange-loop-analysis to -Wall. Differential Revision: https://reviews.llvm.org/D71808	2019-12-22 19:11:17 +01:00
Roman Lebedev	a5e65c1cf7	[MCA] Show aggregate over Average Wait times for the whole snippet (PR43219) Summary: As disscused in https://bugs.llvm.org/show_bug.cgi?id=43219, i believe it may be somewhat useful to show //some// aggregates over all the sea of statistics provided. Example: ``` Average Wait times (based on the timeline view): [0]: Executions [1]: Average time spent waiting in a scheduler's queue [2]: Average time spent waiting in a scheduler's queue while ready [3]: Average time elapsed from WB until retire stage [0] [1] [2] [3] 0. 3 1.0 1.0 4.7 vmulps %xmm0, %xmm1, %xmm2 1. 3 2.7 0.0 2.3 vhaddps %xmm2, %xmm2, %xmm3 2. 3 6.0 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4 3 3.2 0.3 2.3 <total> ``` I.e. we average the averages. Reviewers: andreadb, mattd, RKSimon Reviewed By: andreadb Subscribers: gbedwell, arphaman, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D68714 llvm-svn: 374361	2019-10-10 14:46:21 +00:00
Andrea Di Biagio	e0900f285b	[MCA] Improved cost computation for loop carried dependencies in the bottleneck analysis. This patch introduces a cut-off threshold for dependency edge frequences with the goal of simplifying the critical sequence computation. This patch also removes the cost normalization for loop carried dependencies. We didn't really need to artificially amplify the cost of loop-carried dependencies since it is already computed as the integral over time of the delay (in cycle). In the absence of backend stalls there is no need for computing a critical sequence. With this patch we early exit from the critical sequence computation if no bottleneck was reported during the simulation. llvm-svn: 372337	2019-09-19 16:05:11 +00:00
Andrea Di Biagio	cbec9af6bf	[MCA] Add flag -show-encoding to llvm-mca. Flag -show-encoding enables the printing of instruction encodings as part of the the instruction info view. Example (with flags -mtriple=x86_64-- -mcpu=btver2): Instruction Info: [1]: #uOps [2]: Latency [3]: RThroughput [4]: MayLoad [5]: MayStore [6]: HasSideEffects (U) [7]: Encoding Size [1] [2] [3] [4] [5] [6] [7] Encodings: Instructions: 1 2 1.00 4 c5 f0 59 d0 vmulps %xmm0, %xmm1, %xmm2 1 4 1.00 4 c5 eb 7c da vhaddps %xmm2, %xmm2, %xmm3 1 4 1.00 4 c5 e3 7c e3 vhaddps %xmm3, %xmm3, %xmm4 In this example, column Encoding Size is the size in bytes of the instruction encoding. Column Encodings reports the actual instruction encodings as byte sequences in hex (objdump style). The computation of encodings is done by a utility class named mca::CodeEmitter. In future, I plan to expose the CodeEmitter to the instruction builder, so that information about instruction encoding sizes can be used by the simulator. That would be a first step towards simulating the throughput from the decoders in the hardware frontend. Differential Revision: https://reviews.llvm.org/D65948 llvm-svn: 368432	2019-08-09 11:26:27 +00:00
Rui Ueyama	4d41c332ef	Revert r367649: Improve raw_ostream so that you can "write" colors using operator<< This reverts commit r367649 in an attempt to unbreak Windows bots. llvm-svn: 367658	2019-08-02 07:22:34 +00:00
Rui Ueyama	a52f982f1c	Improve raw_ostream so that you can "write" colors using operator<< 1. raw_ostream supports ANSI colors so that you can write messages to the termina with colors. Previously, in order to change and reset color, you had to call `changeColor` and `resetColor` functions, respectively. So, if you print out "error: " in red, for example, you had to do something like this: OS.changeColor(raw_ostream::RED); OS << "error: "; OS.resetColor(); With this patch, you can write the same code as follows: OS << raw_ostream::RED << "error: " << raw_ostream::RESET; 2. Add a boolean flag to raw_ostream so that you can disable colored output. If you disable colors, changeColor, operator<<(Color), resetColor and other color-related functions have no effect. Most LLVM tools automatically prints out messages using colors, and you can disable it by passing a flag such as `--disable-colors`. This new flag makes it easy to write code that works that way. Differential Revision: https://reviews.llvm.org/D65564 llvm-svn: 367649	2019-08-02 04:48:30 +00:00
Andrea Di Biagio	aa9b6468bd	[MCA][Bottleneck Analysis] Teach how to compute a critical sequence of instructions based on the simulation. This patch teaches the bottleneck analysis how to identify and print the most expensive sequence of instructions according to the simulation. Fixes PR37494. The goal is to help users identify the sequence of instruction which is most critical for performance. A dependency graph is internally used by the bottleneck analysis to describe data dependencies and processor resource interferences between instructions. There is one node in the graph for every instruction in the input assembly sequence. The number of nodes in the graph is independent from the number of iterations simulated by the tool. It means that a single node of the graph represents all the possible instances of a same instruction contributed by the simulated iterations. Edges are dynamically "discovered" by the bottleneck analysis by observing instruction state transitions and "backend pressure increase" events generated by the Execute stage. Information from the events is used to identify critical dependencies, and materialize edges in the graph. A dependency edge is uniquely identified by a pair of node identifiers plus an instance of struct DependencyEdge::Dependency (which provides more details about the actual dependency kind). The bottleneck analysis internally ranks dependency edges based on their impact on the runtime (see field DependencyEdge::Dependency::Cost). To this end, each edge of the graph has an associated cost. By default, the cost of an edge is a function of its latency (in cycles). In practice, the cost of an edge is also a function of the number of cycles where the dependency has been seen as 'contributing to backend pressure increases'. The idea is that the higher the cost of an edge, the higher is the impact of the dependency on performance. To put it in another way, the cost of an edge is a measure of criticality for performance. Note how a same edge may be found in multiple iteration of the simulated loop. The logic that adds new edges to the graph checks if an equivalent dependency already exists (duplicate edges are not allowed). If an equivalent dependency edge is found, field DependencyEdge::Frequency of that edge is incremented by one, and the new cost is cumulatively added to the existing edge cost. At the end of simulation, costs are propagated to nodes through the edges of the graph. The goal is to identify a critical sequence from a node of the root-set (composed by node of the graph with no predecessors) to a 'sink node' with no successors. Note that the graph is intentionally kept acyclic to minimize the complexity of the critical sequence computation algorithm (complexity is currently linear in the number of nodes in the graph). The critical path is finally computed as a sequence of dependency edges. For edges describing processor resource interferences, the view also prints a so-called "interference probability" value (by dividing field DependencyEdge::Frequency by the total number of iterations). Examples of critical sequence computations can be found in tests added/modified by this patch. On output streams that support colored output, instructions from the critical sequence are rendered with a different color. Strictly speaking the analysis conducted by the bottleneck analysis view is not a critical path analysis. The cost of an edge doesn't only depend on the dependency latency. More importantly, the cost of a same edge may be computed differently by different iterations. The number of dependencies is discovered dynamically based on the events generated by the simulator. However, their number is not fixed. This is especially true for edges that model processor resource interferences; an interference may not occur in every iteration. For that reason, it makes sense to also print out a "probability of interference". By construction, the accuracy of this analysis (as always) is strongly dependent on the simulation (and therefore the quality of the information available in the scheduling model). That being said, the critical sequence effectively identifies a performance criticality. Instructions from that sequence are expected to have a very big impact on performance. So, users can take advantage of this information to focus their attention on specific interactions between instructions. In my experience, it works quite well in practice, and produces useful output (in a reasonable amount time). Differential Revision: https://reviews.llvm.org/D63543 llvm-svn: 364045	2019-06-21 13:32:54 +00:00
Andrea Di Biagio	3b2f5df12c	[MCA] Slightly refactor the bottleneck analysis view. NFCI This patch slightly refactors data structures internally used by the bottleneck analysis to track data and resource dependencies. This patch also updates methods used to print out information about dependency edges when in debug mode. This is the last of a sequence of commits done in preparation for an upcoming patch that fixes PR37494. No functional change intended. llvm-svn: 363677	2019-06-18 12:59:46 +00:00
Andrea Di Biagio	49d8699ecc	[MCA] Fix -Wunused-private-field warning after r362933. NFC This should unbreak the buildbots. llvm-svn: 362935	2019-06-10 13:33:54 +00:00
Andrea Di Biagio	47db08dbb1	[MCA] Further refactor the bottleneck analysis view. NFCI. llvm-svn: 362933	2019-06-10 12:50:08 +00:00
Andrea Di Biagio	065bd45da9	[MCA] Remove unused fields from BottleneckAnalysis. NFC This should appease the buildbots. llvm-svn: 362251	2019-05-31 18:01:42 +00:00
Andrea Di Biagio	312f3a2bbf	[MCA] Refactor class BottleneckAnalysis. NFCI The resource pressure distribution computation is now delegated by class BottleneckAnalysis to an instance of class PressureTracker. Class PressureTracker is also responsible for: - tracking users of processor resource units. - tracking the number of delay cycles caused by increases in backpressure. BottleneckAnalysis internally initializes a dependency graph. Each nodes represents an instruction in the input code sequence. Edges of the dependency graph are critical register/memory/resource dependencies. Dependencies are only added to the graph if they are seen as critical by backend pressure events. The DependencyGraph is currently unused. It is possible to print the dependency graph (see method DependencyGraph::dump()) for debugging purposes. The long term goal is to use the information stored by the dependency graph in order to do critical path computation. llvm-svn: 362246	2019-05-31 17:18:34 +00:00
Andrea Di Biagio	57cef58672	[MCA] Moved the bottleneck analysis to its own file. NFCI llvm-svn: 358554	2019-04-17 06:02:05 +00:00
Andrea Di Biagio	f6a60f1f80	[llvm-mca][scheduler-stats] Print issued micro opcodes per cycle. NFCI It makes more sense to print out the number of micro opcodes that are issued every cycle rather than the number of instructions issued per cycle. This behavior is also consistent with the dispatch-stats: numbers from the two views can now be easily compared. llvm-svn: 357919	2019-04-08 16:05:54 +00:00
Andrea Di Biagio	ddce32e2f3	[MCA] Correctly update the UsedResourceGroups mask in the InstrBuilder. Found by inspection when looking at the debug output of MCA. This problem was latent, and none of the upstream models were affected by it. No functional change intended. llvm-svn: 357000	2019-03-26 15:38:37 +00:00
Matt Davis	6c5a49ccb9	[llvm-mca] Emit a message when no bottlenecks are identified. Summary: Since bottleneck hints are enabled via user request, it can be confusing if no bottleneck information is presented. Such is the case when no bottlenecks are identified. This patch emits a message in that case. Reviewers: andreadb Reviewed By: andreadb Subscribers: tschuett, gbedwell, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D59098 llvm-svn: 355628	2019-03-07 19:34:44 +00:00
Andrea Di Biagio	9735d9011a	[MCA] Correctly initialize struct SummaryView::BackPressureInfo. This should appease the buildbots. llvm-svn: 355309	2019-03-04 12:23:05 +00:00
Andrea Di Biagio	be3281a281	[MCA] Highlight kernel bottlenecks in the summary view. This patch adds a new flag named -bottleneck-analysis to print out information about throughput bottlenecks. MCA knows how to identify and classify dynamic dispatch stalls. However, it doesn't know how to analyze and highlight kernel bottlenecks. The goal of this patch is to teach MCA how to correlate increases in backend pressure to backend stalls (and therefore, the loss of throughput). From a Scheduler point of view, backend pressure is a function of the scheduler buffer usage (i.e. how the number of uOps in the scheduler buffers changes over time). Backend pressure increases (or decreases) when there is a mismatch between the number of opcodes dispatched, and the number of opcodes issued in the same cycle. Since buffer resources are limited, continuous increases in backend pressure would eventually leads to dispatch stalls. So, there is a strong correlation between dispatch stalls, and how backpressure changed over time. This patch teaches how to identify situations where backend pressure increases due to: - unavailable pipeline resources. - data dependencies. Data dependencies may delay execution of instructions and therefore increase the time that uOps have to spend in the scheduler buffers. That often translates to an increase in backend pressure which may eventually lead to a bottleneck. Contention on pipeline resources may also delay execution of instructions, and lead to a temporary increase in backend pressure. Internally, the Scheduler classifies instructions based on whether register / memory operands are available or not. An instruction is marked as "ready to execute" only if data dependencies are fully resolved. Every cycle, the Scheduler attempts to execute all instructions that are ready to execute. If an instruction cannot execute because of unavailable pipeline resources, then the Scheduler internally updates a BusyResourceUnits mask with the ID of each unavailable resource. ExecuteStage is responsible for tracking changes in backend pressure. If backend pressure increases during a cycle because of contention on pipeline resources, then ExecuteStage sends a "backend pressure" event to the listeners. That event would contain information about instructions delayed by resource pressure, as well as the BusyResourceUnits mask. Note that ExecuteStage also knows how to identify situations where backpressure increased because of delays introduced by data dependencies. The SummaryView observes "backend pressure" events and prints out a "bottleneck report". Example of bottleneck report: ``` Cycles with backend pressure increase [ 99.89% ] Throughput Bottlenecks: Resource Pressure [ 0.00% ] Data Dependencies: [ 99.89% ] - Register Dependencies [ 0.00% ] - Memory Dependencies [ 99.89% ] ``` A bottleneck report is printed out only if increases in backend pressure eventually caused backend stalls. About the time complexity: Time complexity is linear in the number of instructions in the Scheduler::PendingSet. The average slowdown tends to be in the range of ~5-6%. For memory intensive kernels, the slowdown can be significant if flag -noalias=false is specified. In the worst case scenario I have observed a slowdown of ~30% when flag -noalias=false was specified. We can definitely recover part of that slowdown if we optimize class LSUnit (by doing extra bookkeeping to speedup queries). For now, this new analysis is disabled by default, and it can be enabled via flag -bottleneck-analysis. Users of MCA as a library can enable the generation of pressure events through the constructor of ExecuteStage. This patch partially addresses https://bugs.llvm.org/show_bug.cgi?id=37494 Differential Revision: https://reviews.llvm.org/D58728 llvm-svn: 355308	2019-03-04 11:52:34 +00:00
Andrea Di Biagio	d882ad5e6e	[MCA][ResourceManager] Add a table that maps processor resource indices to processor resource identifiers. This patch adds a lookup table to speed up resource queries in the ResourceManager. This patch also moves helper function 'getResourceStateIndex()' from ResourceManager.cpp to Support.h, so that we can reuse that logic in the SummaryView (and potentially other views in llvm-mca). No functional change intended. llvm-svn: 354470	2019-02-20 14:53:18 +00:00
Andrea Di Biagio	d768d35515	[MC][X86] Correctly model additional operand latency caused by transfer delays from the integer to the floating point unit. This patch adds a new ReadAdvance definition named ReadInt2Fpu. ReadInt2Fpu allows x86 scheduling models to accurately describe delays caused by data transfers from the integer unit to the floating point unit. ReadInt2Fpu currently defaults to a delay of zero cycles (i.e. no delay) for all x86 models excluding BtVer2. That means, this patch is only a functional change for the Jaguar cpu model only. Tablegen definitions for instructions (V)PINSR* have been updated to account for the new ReadInt2Fpu. That read is mapped to the the GPR input operand. On Jaguar, int-to-fpu transfers are modeled as a +6cy delay. Before this patch, that extra delay was added to the opcode latency. In practice, the insert opcode only executes for 1cy. Most of the actual latency is actually contributed by the so-called operand-latency. According to the AMD SOG for family 16h, (V)PINSR* latency is defined by expression f+1, where f is defined as a forwarding delay from the integer unit to the fpu. When printing instruction latency from MCA (see InstructionInfoView.cpp) and LLC (only when flag -print-schedule is speified), we now need to account for any extra forwarding delays. We do this by checking if scheduling classes declare any negative ReadAdvance entries. Quoting a code comment in TargetSchedule.td: "A negative advance effectively increases latency, which may be used for cross-domain stalls". When computing the instruction latency for the purpose of our scheduling tests, we now add any extra delay to the formula. This avoids regressing existing codegen and mca schedule tests. It comes with the cost of an extra (but very simple) hook in MCSchedModel. Differential Revision: https://reviews.llvm.org/D57056 llvm-svn: 351965	2019-01-23 16:35:07 +00:00
Chandler Carruth	2946cd7010	Update the file headers across all of the LLVM projects in the monorepo to reflect the new license. We understand that people may be surprised that we're moving the header entirely to discuss the new license. We checked this carefully with the Foundation's lawyer and we believe this is the correct approach. Essentially, all code in the project is now made available by the LLVM project under our new license, so you will see that the license headers include that license only. Some of our contributors have contributed code under our old license, and accordingly, we have retained a copy of our old license notice in the top-level files in each project and repository. llvm-svn: 351636	2019-01-19 08:50:56 +00:00
Andrea Di Biagio	97ed076dd1	[MCA] Fix wrong definition of ResourceUnitMask in DefaultResourceStrategy. Field ResourceUnitMask was incorrectly defined as a 'const unsigned' mask. It should have been a 64 bit quantity instead. That means, ResourceUnitMask was always implicitly truncated to a 32 bit quantity. This issue has been found by inspection. Surprisingly, that bug was latent, and it never negatively affected any existing upstream targets. This patch fixes the wrong definition of ResourceUnitMask, and adds a bunch of extra debug prints to help debugging potential issues related to invalid processor resource masks. llvm-svn: 350820	2019-01-10 13:59:13 +00:00
Clement Courbet	cc5e6a72de	[llvm-mca] Move llvm-mca library to llvm/lib/MCA. Summary: See PR38731. Reviewers: andreadb Subscribers: mgorny, javed.absar, tschuett, gbedwell, andreadb, RKSimon, llvm-commits Differential Revision: https://reviews.llvm.org/D55557 llvm-svn: 349332	2018-12-17 08:08:31 +00:00
Andrea Di Biagio	373a4ccf6c	[llvm-mca][MC] Add the ability to declare which processor resources model load/store queues (PR36666). This patch adds the ability to specify via tablegen which processor resources are load/store queue resources. A new tablegen class named MemoryQueue can be optionally used to mark resources that model load/store queues. Information about the load/store queue is collected at 'CodeGenSchedule' stage, and analyzed by the 'SubtargetEmitter' to initialize two new fields in struct MCExtraProcessorInfo named `LoadQueueID` and `StoreQueueID`. Those two fields are identifiers for buffered resources used to describe the load queue and the store queue. Field `BufferSize` is interpreted as the number of entries in the queue, while the number of units is a throughput indicator (i.e. number of available pickers for loads/stores). At construction time, LSUnit in llvm-mca checks for the presence of extra processor information (i.e. MCExtraProcessorInfo) in the scheduling model. If that information is available, and fields LoadQueueID and StoreQueueID are set to a value different than zero (i.e. the invalid processor resource index), then LSUnit initializes its LoadQueue/StoreQueue based on the BufferSize value declared by the two processor resources. With this patch, we more accurately track dynamic dispatch stalls caused by the lack of LS tokens (i.e. load/store queue full). This is also shown by the differences in two BdVer2 tests. Stalls that were previously classified as generic SCHEDULER FULL stalls, are not correctly classified either as "load queue full" or "store queue full". About the differences in the -scheduler-stats view: those differences are expected, because entries in the load/store queue are not released at instruction issue stage. Instead, those are released at instruction executed stage. This is the main reason why for the modified tests, the load/store queues gets full before PdEx is full. Differential Revision: https://reviews.llvm.org/D54957 llvm-svn: 347857	2018-11-29 12:15:56 +00:00
Andrea Di Biagio	07a8255a78	[llvm-mca][View] Improved Retire Control Unit Statistics. RetireControlUnitStatistics now reports extra information about the ROB and the avg/maximum number of entries consumed over the entire simulation. Example: Retire Control Unit - number of cycles where we saw N instructions retired: [# retired], [# cycles] 0, 109 (17.9%) 1, 102 (16.7%) 2, 399 (65.4%) Total ROB Entries: 64 Max Used ROB Entries: 35 ( 54.7% ) Average Used ROB Entries per cy: 32 ( 50.0% ) Documentation in llvm/docs/CommandGuide/llvmn-mca.rst has been updated to reflect this change. llvm-svn: 347493	2018-11-23 12:12:57 +00:00

1 2

69 Commits