llvm-project

Commit Graph

Author	SHA1	Message	Date
wlei	9af46710fe	[llvm-profgen] Move profiled binary loading out of PerfReader Change to use unique pointer of profiled binary to unblock asan. At same time, I realized we can decouple to move the profiled binary loading out of PerfReader, so I made some other related refactors. Reviewed By: hoy Differential Revision: https://reviews.llvm.org/D108254	2021-08-17 17:28:01 -07:00
wlei	f812c19253	[llvm-profgen] Clean up code dealing with multiple binaries As we decided to support only one binary each time, this patch cleans up the related code dealing with multiple binaries. We can use `llvm-profdata` to merge profile from multiple binaries. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D108002	2021-08-17 12:16:07 -07:00
jamesluox	ee7d20e846	[CSSPGO] Migrate and refactor the decoder of Pseudo Probe Migrate pseudo probe decoding logic in llvm-profgen to MC, so other LLVM-base program could reuse existing codes. Redesign object layout of encoded and decoded pseudo probes. Reviewed By: hoy Differential Revision: https://reviews.llvm.org/D106861	2021-08-04 09:21:34 -07:00
wlei	f1affe8dc8	[llvm-profgen][CSSPGO] Support count based aggregated type of hybrid perf script This change tried to integrate a new count based aggregated type of perf script. The only difference of the format is that an aggregated count is added at the head of the original sample which means the same samples are repeated to the given count times. This is used to reduce the perf script size. e.g. ``` 2 4005dc 400634 400684 7f68c5788793 0x4005c8/0x4005dc/P/-/-/0 .... ``` Implemented by a dedicated PerfReader `AggregatedHybridPerfReader`. Differential Revision: https://reviews.llvm.org/D107192	2021-08-03 17:56:35 -07:00
wlei	6da9241aab	[llvm-profgen] Refactor PerfReader to allow different types of perf scripts In order to support different types of perf scripts, this change tried to refactor `PerfReader` by adding the base class `PerfReaderBase` and current HybridPerfReader is derived from it for CS profile generation. Common functions like, passMM2PEvents, extract_lbrs, extract_callstack, etc. can be reused. Next step is to add LBR only reader(for non-CS profile) and aggregated perf scripts reader(do a pre-aggregation of scripts). Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D107014	2021-08-02 17:18:47 -07:00
Hongtao Yu	0712038458	[CSSPGO][llvm-profgen] Allow multiple executable load segments. The linker or post-link optimizer can create an ELF image with multiple executable segments each of which will be loaded separately at run time. This breaks the assumption of llvm-profgen that currently only supports one base load address. What it ends up with is that the subsequent mmap events will be treated as an overwrite of the first mmap event which will in turn screw up address mapping. While it is non-trivial to support multiple separate load addresses and given that on x64 those segments will always be loaded at consecutive addresses (though via separate mmap sys calls), I'm adding an error checking logic to bail out if that's violated and keep using a single load address which is the address of the first executable segment. Also changing the disassembly output from printing section offset to printing the virtual address instead, which matches the behavior of objdump. Differential Revision: https://reviews.llvm.org/D103178	2021-07-13 18:22:24 -07:00
Hongtao Yu	00bfde723b	[NFC][CSSPGO]llvm-profge] Fix Build warning dueo to an attrbute usage.	2021-05-24 12:59:02 -07:00
Hongtao Yu	3b51b51877	[CSSPGO][llvm-profgen] Report samples for untrackable frames. Fixing an issue where samples collected for an untrackable frame is not reported. An untrackable frame refers to a frame whose caller is untrackable due to missing debug info or pseudo probe. Though the frame is connected to its parent frame through the frame pointer chain at runtime, the compiler cannot build the connection without debug info or pseudo probe. In such case we just need to report the untrackable frame as the base frame and all of its child frames. With more samples reported I'm seeing this improves the performance of an internal benchmark by 2.5%. Reviewed By: wenlei, wlei Differential Revision: https://reviews.llvm.org/D102961	2021-05-24 12:39:12 -07:00
Wenlei He	1410db70b9	[CSSPGO] Add attribute metadata for context profile This changes adds attribute field for metadata of context profile. Currently we have an inline attribute that indicates whether the leaf frame corresponding to a context profile was inlined in previous build. This will be used to help estimating inlining and be taken into account when trimming context. Changes for that in llvm-profgen will follow. It will also help tuning. Differential Revision: https://reviews.llvm.org/D98823	2021-03-18 22:00:56 -07:00
wlei	426e326a19	[CSSPGO][llvm-profgen] Renovate perfscript check and command line input validation This include some changes related with PerfReader's the input check and command line change: 1) It appears there might be thousands of leading MMAP-Event line in the perfscript for large workload. For this case, the 4k threshold is not eligible to determine it's a hybrid sample. This change renovated the `isHybridPerfScript` by going through the script without threshold limitation checking whether there is a non-empty call stack immediately followed by a LBR sample. It will stop once it find a valid one. 2) Added several input validations for the command line switches in PerfReader. 3) Changed the command line `show-disassembly` to `show-disassembly-only`, it will print to stdout and exit early which leave an empty output profile. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D96387	2021-02-12 15:18:50 -08:00
wlei	3869309a0c	[CSSPGO][llvm-profgen] Aggregate samples on call frame trie to speed up profile generation For CS profile generation, the process of call stack unwinding is time-consuming since for each LBR entry we need linear time to generate the context( hash, compression, string concatenation). This change speeds up this by grouping all the call frame within one LBR sample into a trie and aggregating the result(sample counter) on it, deferring the context compression and string generation to the end of unwinding. Specifically, it uses `StackLeaf` as the top frame on the stack and manipulates(pop or push a trie node) it dynamically during virtual unwinding so that the raw sample can just be recoded on the leaf node, the path(root to leaf) will represent its calling context. In the end, it traverses the trie and generates the context on the fly. Results: Our internal branch shows about 5X speed-up on some large workloads in SPEC06 benchmark. Differential Revision: https://reviews.llvm.org/D94110	2021-02-04 08:43:21 -08:00
wlei	08e8bb60cf	Revert "[CSSPGO][llvm-profgen] Aggregate samples on call frame trie to speed up profile generation" This reverts commit `1714ad2336`.	2021-02-03 22:16:05 -08:00
wlei	1714ad2336	[CSSPGO][llvm-profgen] Aggregate samples on call frame trie to speed up profile generation For CS profile generation, the process of call stack unwinding is time-consuming since for each LBR entry we need linear time to generate the context( hash, compression, string concatenation). This change speeds up this by grouping all the call frame within one LBR sample into a trie and aggregating the result(sample counter) on it, deferring the context compression and string generation to the end of unwinding. Specifically, it uses `StackLeaf` as the top frame on the stack and manipulates(pop or push a trie node) it dynamically during virtual unwinding so that the raw sample can just be recoded on the leaf node, the path(root to leaf) will represent its calling context. In the end, it traverses the trie and generates the context on the fly. Results: Our internal branch shows about 5X speed-up on some large workloads in SPEC06 benchmark. Differential Revision: https://reviews.llvm.org/D94110	2021-02-03 18:50:14 -08:00
wlei	c681400b25	[CSSPGO][llvm-profgen] Virtual unwinding with pseudo probe This change extends virtual unwinder to support pseudo probe in llvm-profgen. Please refer https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s and https://reviews.llvm.org/D89707 for more context about CSSPGO and llvm-profgen. Implementation - Added `ProbeBasedCtxKey` derived from `ContextKey` for sample counter aggregation. As we need string splitting to infer the profile for callee function, string based context introduces more string handling overhead, here we just use probe pointer based context. - For linear unwinding, as inline context is encoded in each pseudo probe, we don't need to go through each instruction to extract range sharing same inliner. So just record the range for the context. - For probe based context, we should ignore the top frame probe since it will be extracted from the address range. we defer the extraction in `ProfileGeneration`. - Added `PseudoProbeProfileGenerator` for pseudo probe based profile generation. - Some helper function to get pseduo probe info(call probe, inline context) from profiled binary. - Added regression test for unwinder's output The pseudo probe based profile generation will be in the upcoming patch. Test Plan: ninja & ninja check-llvm Differential Revision: https://reviews.llvm.org/D92896	2021-01-13 11:02:58 -08:00
wlei	414930b91b	[CSSPGO][llvm-profgen] Refactor to unify hashable interface for trace sample and context-sensitive counter As we plan to support both CSSPGO and AutoFDO for llvm-profgen, we will have different kinds of perf sample and different kinds of sample counter(cs/non-cs, with/without pseudo probe) which both need to do aggregation in hash map. This change implements the hashable interface(`Hashable`) and the unified base class for them to have better extensibility and reusability. Currently perf trace sample and sample counter with context implemented this `Hashable` and the class hierarchy is like: ``` \| Hashable \| PerfSample \| HybridSample \| LBRSample \| ContextKey \| StringBasedCtxKey \| ProbeBasedCtxKey \| CallsiteBasedCtxKey \| ... ``` - Class specifying `Hashable` should implement `getHashCode` and `isEqual`. Here we make `getHashCode` a non-virtual function to avoid vtable overhead, so derived class should calculate and assign the base class's HashCode manually. This also provides the flexibility for calculating the hash code incrementally(like rolling hash) during frame stack unwinding - `isEqual` is a virtual function, which will have perf overhead. In the future, if we redesign a better hash function, then we can just skip this or switch to non-virtual function. - Added `PerfSample` and `ContextKey` as base class for perf sample and counter context key, leveraging llvm-style RTTI for this. - Added `StringBasedCtxKey` class extending `ContextKey` to use string as context id. - Refactor `AggregationCounter` to take all kinds of `PerfSample` as key - Refactor `ContextSampleCounter` to take all kinds of `ContextKey` as key - Other refactoring work: - Create a wrapper class `SampleCounter` to wrap `RangeCounter` and `BranchCounter` - Hoist `ContextId` and `FunctionProfile` out of `populateFunctionBodySamples` and `populateFunctionBoundarySamples` to reuse them in ProfileGenerator Differential Revision: https://reviews.llvm.org/D92584	2021-01-13 11:02:57 -08:00
wlei	1f05b1a9f5	[CSSPGO][llvm-profgen] Context-sensitive profile data generation This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC. This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path. we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding. Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context. With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO. A breakdown of noteworthy changes: - Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack * Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted * Speed up by aggregating `HybridSample` into `AggregatedSamples` * Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.  Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`. * Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.  * Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop - `getCanonicalFnName` for callee name and name from ELF section - Added regression test for both unwinding and profile generation Test Plan: ninja & ninja check-llvm Reviewed By: hoy, wenlei, wmi Differential Revision: https://reviews.llvm.org/D89723	2020-12-07 13:48:58 -08:00
wlei	21c91454a8	[llvm-profgen][NFC]Fix build failure on different platform see titile Test Plan: ninja & ninja check-llvm Reviewed By: hoy Differential Revision: https://reviews.llvm.org/D91897	2020-11-20 16:36:04 -08:00
wlei	a94fa86229	[CSSPGO][llvm-profgen] Parse mmap events from perf script This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC. As a starter, this change sets up an entry point by introducing PerfReader to load profiled binaries and perf traces(including perf events and perf samples). For the event, here it parses the mmap2 events from perf script to build the loader snaps, which is used to retrieve the image load address in the subsequent perf tracing parsing. As described in llvm-profgen.rst, the tool being built aims to support multiple input perf data (preprocessed by perf script) as well as multiple input binary images. It should also support dynamic reload/unload shared objects by leveraging the loader snaps being built by this change Reviewed By: wenlei, wmi Differential Revision: https://reviews.llvm.org/D89707	2020-11-20 14:26:26 -08:00

18 Commits