Turn on `use-context-cost-for-preinliner` to use context-sensitive byte size cost for preinliner decisions by default.
This is a more accurate proxy of inline cost than profile size. We tested on our large workload that it delivers measureable CPU improvement.
Differential Revision: https://reviews.llvm.org/D109893
Invalid frame addresses exist in call stack samples due to bad unwinding. This could happen to frame-pointer-based unwinding and the callee functions that do not have the frame pointer chain set up. It isn't common when the program is built with the frame pointer omission disabled, but can still happen with third-party static libs built with frame pointer omitted.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D109638
Perf script can sometimes give disordered LBR samples like below.
```
b022500
32de0044
3386e1d1
7f118e05720c
7f118df2d81f
0x2a0b9622/0x2a0b9610/P/-/-/1 0x2a0b79ff/0x2a0b9618/P/-/-/2 0x2a0b7a4a/0x2a0b79e8/P/-/-/1 0x2a0b7a33/0x2a0b7a46/P/-/-/1 0x2a0b7a42/0x2a0b7a23/P/-/-/1 0x2a0b7a21/0x2a0b7a37/P/-/-/2 0x2a0b79e6/0x2a0b7a07/P/-/-/1 0x2a0b79d4/0x2a0b79dc/P/-/-/2 0x2a0b7a03/0x2a0b79aa/P/-/-/1 0x2a0b79a8/0x2a0b7a00/P/-/-/234 0x2a0b9613/0x2a0b7930/P/-/-/1 0x2a0b9622/0x2a0b9610/P/-/-/1 0x2a0b79ff/0x2a0b9618/P/-/-/2 0x2a0b7a4a/0x2aWarning:
Processed 10263226 events and lost 1 chunks!
```
Note that the last LBR record `0x2a0b7a4a/0x2aWarning:` . Currently llvm-profgen does not detect that and as a result an uninitialized branch target value will be used. The uninitialized value can cause creepy instruction ranges created which which in turn will result in a completely wrong profile. An example is like
```
.... @ _ZN5folly13loadUnalignedIsEET_PKv]:18446744073709551615:18446744073709551615
1: 18446744073709551615
!CFGChecksum: 4294967295
!Attributes: 0
```
Reviewed By: wenlei, wlei
Differential Revision: https://reviews.llvm.org/D109637
We merge cold context by default to save profile size. However trimming cold context after merging doesn't save size much, so default to off to reflect how it's commonly used.
Differential Revision: https://reviews.llvm.org/D109166
This change improves the warning for truncated context by: 1) deduplicate them as one call without probe can appear in many different context leading to duplicated warnings , 2) rephrase the message to make it easier to understand. The term "untracked frame" can be confusing.
Differential Revision: https://reviews.llvm.org/D109115
Adding the compiler support of MD5 CS profile based on pervious context split work D107299. A MD5 CS profile is about 40% smaller than the string-based extbinary profile. As a result, the compilation is 15% faster.
There are a few conversion from real names to md5 names that have been made on the sample loader and context tracker side to get it work.
Reviewed By: wenlei, wmi
Differential Revision: https://reviews.llvm.org/D108342
This change aims at supporting LBR only sample perf script which is used for regular(Non-CS) profile generation. A LBR perf script includes a batch of LBR sample which starts with a frame pointer and a group of 32 LBR entries is followed. The FROM/TO LBR pair and the range between two consecutive entries (the former entry's TO and the latter entry's FROM) will be used to infer function profile info.
An example of LBR perf script(created by `perf script -F ip,brstack -i perf.data`)
```
40062f 0x40062f/0x4005b0/P/-/-/9 0x400645/0x4005ff/P/-/-/1 0x400637/0x400645/P/-/-/1 ...
4005d7 0x4005d7/0x4005e5/P/-/-/8 0x40062f/0x4005b0/P/-/-/6 0x400645/0x4005ff/P/-/-/1 ...
...
```
For implementation:
- Extended a new child class `LBRPerfReader` for the sample parsing, reused all the functionalities in `extractLBRStack` except for an extension to parsing leading instruction pointer.
- `HybridSample` is reused(just leave the call stack empty) and the parsed samples is still aggregated in `AggregatedSamples`. After that, range samples, branch sample, address samples are computed and recorded.
- Reused `ContextSampleCounterMap` to store the raw profile, since it's no need to aggregation by context, here it just registered one sample counter with a fake context key.
- Unified to use `show-raw-profile` instead of `show-unwinder-output` to dump the intermediate raw profile, see the comments of the format of the raw profile. For CS profile, it remains to output the unwinder output.
Profile generation part will come soon.
Differential Revision: https://reviews.llvm.org/D108153
Currently context strings contain a lot of duplicated function names and that significantly increase the profile size. This change split the context into a series of {name, offset, discriminator} tuples so function names used in the context can be replaced by the index into the name table and that significantly reduce the size consumed by context.
A follow-up improvement made in the compiler and profiling tools is to avoid reconstructing full context strings which is time- and memory- consuming. Instead a context vector of `StringRef` is adopted to represent the full context in all scenarios. As a result, the previous prevalent profile map which was implemented as a `StringRef` is now engineered as an unordered map keyed by `SampleContext`. `SampleContext` is reshaped to using an `ArrayRef` to represent a full context for CS profile. For non-CS profile, it falls back to use `StringRef` to represent a contextless function name. Both the `ArrayRef` and `StringRef` objects are underpinned by real array and string objects that are stored in producer buffers. For compiler, they are maintained by the sample reader. For llvm-profgen, they are maintained in `ProfiledBinary` and `ProfileGenerator`. Full context strings can be generated only in those cases of debugging and printing.
When it comes to profile format, nothing has changed to the text format, though internally CS context is implemented as a vector. Extbinary format is only changed for CS profile, with an additional `SecCSNameTable` section which stores all full contexts logically in the form of `vector<int>`, which each element as an offset points to `SecNameTable`. All occurrences of contexts elsewhere are redirected to using the offset of `SecCSNameTable`.
Testing
This is no-diff change in terms of code quality and profile content (for text profile).
For our internal large service (aka ads), the profile generation is cut to half, with a 20x smaller string-based extbinary format generated.
The compile time of ads is dropped by 25%.
Differential Revision: https://reviews.llvm.org/D107299
The change adds a switch to allow sample loader to use global pre-inliner's decision instead. The pre-inliner in llvm-profgen makes inline decision globally based on whole program profile and function byte size as cost proxy.
Since pre-inliner also adjusts/merges context profile based on its inline decision, honoring its inline decision in sample loader would lead to better post-inline profile quality especially for thinlto where cross module profile merging isn't possible without pre-inliner.
Minor fix in profile reader is also included. When pre-inliner is use, we now also turn off the default merging and trimming logic unless it's explicitly asked.
Differential Revision: https://reviews.llvm.org/D108677
This is a follow up diff for BinarySizeContextTracker to track zero size for fully optimized inlinee. When an inlinee is fully optimized away, we won't be able to get its size through symbolizing instructions, hence we will treat the corresponding context size as unknown. However by traversing the inlined probe forest, we know what're original inlinees regardless of optimization. If a context show up in inlined probes, but not during symbolization, we know that it's fully optimized away hence its size is zero instead of unknown. It should provide more accurate size cost estimation for pre-inliner to make better inline decisions in llvm-profgen.
Differential Revision: https://reviews.llvm.org/D108350
This change enables llvm-profgen to use accurate context-sensitive post-optimization function byte size as a cost proxy to drive global preinline decisions.
To do this, BinarySizeContextTracker is introduced to track function byte size under different inline context during disassembling. In preinliner, we can not query context byte size under switch `context-cost-for-preinliner`. The tracker uses a reverse trie to keep size of functions under different context (callee as parent, caller as child), and it can give best/longest possible matching context size for given input context.
The new size cost is off by default. There're a few TODOs that needs to addressed: 1) avoid dangling string from `Offset2LocStackMap`, which will be addressed in split context work; 2) using inlinee's entry probe to make sure we have correct zero size for inlinee that's completely optimized away after inlining. Some tuning is also needed.
Differential Revision: https://reviews.llvm.org/D108180
Change to use unique pointer of profiled binary to unblock asan.
At same time, I realized we can decouple to move the profiled binary loading out of PerfReader, so I made some other related refactors.
Reviewed By: hoy
Differential Revision: https://reviews.llvm.org/D108254
As we decided to support only one binary each time, this patch cleans up the related code dealing with multiple binaries. We can use `llvm-profdata` to merge profile from multiple binaries.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D108002
Currently we use a centralized string map(StringMap<FunctionSamples> ProfileMap) to store the profile while populating the sample, which might cause the memory usage bottleneck. I saw in an extreme case, there are thousands of samples whose context stack depth is >= 100. The memory consumption can be greater than 100GB.
As here the context is used for inlining, we can assume we won't have so many of inlinees keeping inlined at the same root function, so this change tried to cap the context stack and merge the samples for peak memory reduction and this is done after recursion compression.
The default value is -1 meaning no depth limit, in the future we can tune to a smaller one.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D107800
One performance issue happened in profile generation and it turned out the line 525 loop is the bottleneck.
Moving the code outside of loop scope can fix this issue. The run time is improved from 30+mins to ~30s.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D107529
Migrate pseudo probe decoding logic in llvm-profgen to MC, so other LLVM-base program could reuse existing codes. Redesign object layout of encoded and decoded pseudo probes.
Reviewed By: hoy
Differential Revision: https://reviews.llvm.org/D106861
This change tried to integrate a new count based aggregated type of perf script. The only difference of the format is that an aggregated count is added at the head of the original sample which means the same samples are repeated to the given count times. This is used to reduce the perf script size.
e.g.
```
2
4005dc
400634
400684
7f68c5788793
0x4005c8/0x4005dc/P/-/-/0 ....
```
Implemented by a dedicated PerfReader `AggregatedHybridPerfReader`.
Differential Revision: https://reviews.llvm.org/D107192
This change supports to run without parsing MMap binary loading events instead it always assumes binary is loaded at the preferred address. This is used when we have assured no binary load address changes or we have pre-processed the addresses resolution. Warn if there's interior mmap event but without leading mmap events.
Reviewed By: hoy
Differential Revision: https://reviews.llvm.org/D107097
In order to support different types of perf scripts, this change tried to refactor `PerfReader` by adding the base class `PerfReaderBase` and current HybridPerfReader is derived from it for CS profile generation. Common functions like, passMM2PEvents, extract_lbrs, extract_callstack, etc. can be reused.
Next step is to add LBR only reader(for non-CS profile) and aggregated perf scripts reader(do a pre-aggregation of scripts).
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D107014
[[noreturn]] can be used since Oct 2016 when the minimum compiler requirement was bumped to GCC 4.8/MSVC 2015.
Note: the definition of LLVM_ATTRIBUTE_NORETURN is kept for now.
The linker or post-link optimizer can create an ELF image with multiple executable segments each of which will be loaded separately at run time. This breaks the assumption of llvm-profgen that currently only supports one base load address. What it ends up with is that the subsequent mmap events will be treated as an overwrite of the first mmap event which will in turn screw up address mapping. While it is non-trivial to support multiple separate load addresses and given that on x64 those segments will always be loaded at consecutive addresses (though via separate mmap
sys calls), I'm adding an error checking logic to bail out if that's violated and keep using a single load address which is the address of the first executable segment.
Also changing the disassembly output from printing section offset to printing the virtual address instead, which matches the behavior of objdump.
Differential Revision: https://reviews.llvm.org/D103178
In a callback case, a return from internal code, say A, to external runtime can happen. The external runtime can then call back to another internal routine, say B. Making an artificial branch that looks like a return from A to B can confuse the unwinder to treat the instruction before B as the call instruction.
Reviewed By: wenlei, wmi
Differential Revision: https://reviews.llvm.org/D104546
Add a parameter of IsFSDiscriminator to function
getBaseDiscriminatorFromDiscriminator().
This function currently checks the internal flag of
--enable-fs-discriminator. This is not good because we might
change the default value of the internal flag.
Note that we have a default parameter. This is just
because create_afdo_tool has a call-site to it.
I will remove the default parameter in a later patch.
Differential Revision: https://reviews.llvm.org/D104584
As a follow-up to https://reviews.llvm.org/D104129, I'm cleaning up the danling probe related code in both the compiler and llvm-profgen.
I'm seeing a 5% size win for the pseudo_probe section for SPEC2017 and 10% for Ciner. Certain benchmark such as 602.gcc has a 20% size win. No obvious difference seen on build time for SPEC2017 and Cinder.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D104477
We were using 0 as an indicator of invalid offset when computing disjoint ranges. In reality, 0 can be an valid code offset which stands for the first function in .text section. I'm using UINT64_MAX as an invalid code offset instead.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D104497
If we have seen an inwards transition from external code to internal code, but not a following outwards transition, the inwards transition is likely due to interrupt which is usually unpaired. Ignore current and subsequent entries since they are likely from an unrelated pre-interrupt context.
LBR records from different interrupt context are unrelated and they should not be mixed together. Currenlty the OS does this for task-scheduling interrupt but not for all interrupts.
Reviewed By: wenlei, wlei
Differential Revision: https://reviews.llvm.org/D104276
We were using a `StringMap` object to store all profiles to be emitted. The object is basically an unordered hash table, therefore updating it in the process of trasvering it may cause issue since the underlying bucket array could change.
I'm also moving the `csspgo-preinliner` switch around so that no context tri will be constructed (by the constructor of `CSPreInliner`) when the switch is off.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D104267
Previously dangling samples were represented by INT64_MAX in sample profile while probes never executed were not reported. This was based on an observation that dangling probes were only at a smaller portion than zero-count probes. However, with compiler optimizations, dangling probes end up becoming at large portion of all probes in general and reporting them does not make sense from profile size point of view. This change flips sample reporting by reporting zero-count probes instead. This enabled dangling probe to be represented by none (missing entry in profile). This has a couple benefits:
1. Reducing sample profile size in optimize mode, even when the number of non-executed probes outperform the number of dangling probes, since INT64_MAX takes more space over 0 to encode.
2. Binary size savings. No need to encode dangling probe anymore, since missing probes are treated as dangling in the profile reader.
3. Reducing compiler work to track dangling probes. However, for probes that are real dead and removed, we still need the compiler to identify them so that they can be reported as zero-count, instead of mistreated as dangling probes.
4. Improving counts quality by respecting the counts already collected on the non-dangling copy of a probe. A probe, when duplicated, gets two copies at runtime. If one of them is dangling while the other is not, merging the two probes at profile generation time will cause the real samples collected on the non-dangling one to be discarded. Not reporting the dangling counterpart will keep the real samples.
5. Better readability.
6. Be consistent with non-CS dwarf line number based profile. Zero counts are trusted by the compiler counts inferencer while missing counts will be inferred by the compiler.
Note that the current patch does include any work for #3. There will be follow-up changes.
For #1, I've seen for a large Facebook service, the text profile is reduced by 7%. For extbinary profile, the size of LBRProfileSection is reduced by 35%.
For #4, I have seen general counts quality for SPEC2017 is improved by 10%.
Reviewed By: wenlei, wlei, wmi
Differential Revision: https://reviews.llvm.org/D104129
This change provides the option to merge and aggregate cold context by the last k frames instead of context-less name. By default K = 1 means the context-less one.
This is for better perf tuning. The more selective merging and trimming will rely on llvm-profgen's preinliner.
Reviewed By: wenlei, hoy
Differential Revision: https://reviews.llvm.org/D104131
Make extended binary the default output format for CSSPGO. This avoids having to pass flag every time when generating profile. It also matches llvm-profdata where binary profile is the default (should we switch to extbinary as default for llvm-profdata?).
We plan to compress name table for context profile, which depends on the built-in compression of extbinary.
Differential Revision: https://reviews.llvm.org/D103650
llvm-profgen uses profile summary based cold threshold to merge and trim cold context profile. This is to strike a good balance between profile size and performance.
We've been using 99.9% as the cutoff to save profile size without affecting performance. This change switch to use 99.9% instead of 99.9999% as default cold threshold cutoff for llvm-profgen.
Redundant switch csprof-cold-thres is also removed and tests cleaned up.
Differential Revision: https://reviews.llvm.org/D103071
Fixing an issue where samples collected for an untrackable frame is not reported. An untrackable frame refers to a frame whose caller is untrackable due to missing debug info or pseudo probe. Though the frame is connected to its parent frame through the frame pointer chain at runtime, the compiler cannot build the connection without debug info or pseudo probe. In such case we just need to report the untrackable frame as the base frame and all of its child frames.
With more samples reported I'm seeing this improves the performance of an internal benchmark by 2.5%.
Reviewed By: wenlei, wlei
Differential Revision: https://reviews.llvm.org/D102961
This makes it possible for targets to define their own MCObjectFileInfo.
This MCObjectFileInfo is then used to determine things like section alignment.
This is a follow up to D101462 and prepares for the RISCV backend defining the
text section alignment depending on the enabled extensions.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D101921
This untangles the MCContext and the MCObjectFileInfo. There is a circular
dependency between MCContext and MCObjectFileInfo. Currently this dependency
also exists during construction: You can't contruct a MOFI without a MCContext
without constructing the MCContext with a dummy version of that MOFI first.
This removes this dependency during construction. In a perfect world,
MCObjectFileInfo wouldn't depend on MCContext at all, but only be stored in the
MCContext, like other MC information. This is future work.
This also shifts/adds more information to the MCContext making it more
available to the different targets. Namely:
- TargetTriple
- ObjectFileType
- SubtargetInfo
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D101462
The change adds support for triming and merging cold context when mergine CSSPGO profiles using llvm-profdata. This is similar to the context profile trimming in llvm-profgen, however the flexibility to trim cold context after profile is generated can be useful.
Differential Revision: https://reviews.llvm.org/D100528
Report dangling probes for frames that have real samples collected. Dangling probes are the probes associated to an empty block. When reported, sample count on a dangling probe will not be trusted by the compiler and we will rely on the counts inference algorithm to get the probe a reasonable count. This actually fixes a bug where previously only those dangling probes with samples collected were reported.
This patch also fixes two existing issues. Pseudo probes are stored in `Address2ProbesMap` and their pointers are used in `PseudoProbeInlineTree`. Previously `std::vector` was used to store probes and the pointers to probes may get obsolete as the vector grows. I'm changing `std::vector` to `std::list` instead.
The other issue is that all outlined functions shared the same inline frame previously due to the unchanged `Index` value as the dummy inlineSite identifier.
Good results seen for SPEC2017 in general regarding profile quality.
Reviewed By: wenlei, wlei
Differential Revision: https://reviews.llvm.org/D100235
CommandLine.h is indirectly included in ~50% of TUs when building
clang, and VirtualFileSystem.h is large.
(Already remarked by jhenderson on D70769.)
No behavior change.
Differential Revision: https://reviews.llvm.org/D100957
This patch fixed the following issues along side with some refactoring:
1. Fix bugs where StringRef for context string out live the underlying std::string. We now keep string table in profile generator to hold std::strings. We also do the same for bracketed context strings in profile writer.
2. Make sure profile output strictly follow (total sample, name) order. Previously, there's inconsistency between ProfileMap's key and FunctionSamples's name, leading to inconsistent ordering. This is now fixed by introducing context profile canonicalization. Assertions are also added to make sure ProfileMap's key and FunctionSamples's name are always consistent.
3. Enhanced error handling for profile writing to make sure we bubble up errors properly for both llvm-profgen and llvm-profdata when string table is not populated correctly for extended binary profile.
4. Keep all internal context representation bracket free. This avoids creating new strings for context trimming, merging and preinline. getNameWithContext API is now simplied accordingly.
5. Factor out the code for context trimming and merging into SampleContextTrimmer in SampleProf.cpp. This enables llvm-profdata to use the trimmer when merging profiles. Changes in llvm-profgen will be in separate patch.
Differential Revision: https://reviews.llvm.org/D100090
Use profiled call edges to augment the top-down order. There are cases that the top-down order computed based on the static call graph doesn't reflect real execution order. For example:
1. Incomplete static call graph due to unknown indirect call targets. Adjusting the order by considering indirect call edges from the profile can enable the inlining of indirect call targets by allowing the caller processed before them.
2. Mutual call edges in an SCC. The static processing order computed for an SCC may not reflect the call contexts in the context-sensitive profile, thus may cause potential inlining to be overlooked. The function order in one SCC is being adjusted to a top-down order based on the profile to favor more inlining.
3. Transitive indirect call edges due to inlining. When a callee function is inlined into into a caller function in LTO prelink, every call edge originated from the callee will be transferred to the caller. If any of the transferred edges is indirect, the original profiled indirect edge, even if considered, would not enforce a top-down order from the caller to the potential indirect call target in LTO postlink since the inlined callee is gone from the static call graph.
4. #3 can happen even for direct call targets, due to functions defined in header files. Header functions, when included into source files, are defined multiple times but only one definition survives due to ODR. Therefore, the LTO prelink inlining done on those dropped definitions can be useless based on a local file scope. More importantly, the inlinee, once fully inlined to a to-be-dropped inliner, will have no profile to consume when its outlined version is compiled. This can lead to a profile-less prelink compilation for the outlined version of the inlinee function which may be called from external modules. while this isn't easy to fix, we rely on the postlink AutoFDO pipeline to optimize the inlinee. Since the survived copy of the inliner (defined in headers) can be inlined in its local scope in prelink, it may not exist in the merged IR in postlink, and we'll need the profiled call edges to enforce a top-down order for the rest of the functions.
Considering those cases, a profiled call graph completely independent of the static call graph is constructed based on profile data, where function objects are not even needed to handle case #3 and case 4.
I'm seeing an average 0.4% perf win out of SPEC2017. For certain benchmark such as Xalanbmk and GCC, the win is bigger, above 2%.
The change is an enhancement to https://reviews.llvm.org/D95988.
Reviewed By: wmi, wenlei
Differential Revision: https://reviews.llvm.org/D99351
This change sets up a framework in llvm-profgen to estimate inline decision and adjust context-sensitive profile based on that. We call it a global pre-inliner in llvm-profgen.
It will serve two purposes:
1) Since context profile for not inlined context will be merged into base profile, if we estimate a context will not be inlined, we can merge the context profile in the output to save profile size.
2) For thinLTO, when a context involving functions from different modules is not inined, we can't merge functions profiles across modules, leading to suboptimal post-inline count quality. By estimating some inline decisions, we would be able to adjust/merge context profiles beforehand as a mitigation.
Compiler inline heuristic uses inline cost which is not available in llvm-profgen. But since inline cost is closely related to size, we could get an estimate through function size from debug info. Because the size we have in llvm-profgen is the final size, it could also be more accurate than the inline cost estimation in the compiler.
This change only has the framework, with a few TODOs left for follow up patches for a complete implementation:
1) We need to retrieve size for funciton//inlinee from debug info for inlining estimation. Currently we use number of samples in a profile as place holder for size estimation.
2) Currently the thresholds are using the values used by sample loader inliner. But they need to be tuned since the size here is fully optimized machine code size, instead of inline cost based on not yet fully optimized IR.
Differential Revision: https://reviews.llvm.org/D99146
Switch to use cold threshold from profile summary for cold context merging and trimming, instead of relying on hard coded values. Minor refactoring included for switch names, etc.
Differential Revision: https://reviews.llvm.org/D98921
This changes adds attribute field for metadata of context profile. Currently we have an inline attribute that indicates whether the leaf frame corresponding to a context profile was inlined in previous build.
This will be used to help estimating inlining and be taken into account when trimming context. Changes for that in llvm-profgen will follow. It will also help tuning.
Differential Revision: https://reviews.llvm.org/D98823
Previously we didn't support to keep the unique linkage name(-funique-internal-linkage-name) in llvm-profgen. As discussed in https://reviews.llvm.org/D96932, we choose to do canonicalization for it.
Now since "selected" is set as the default parameter of getCanonicalFnName in `D96932`, we don't need to add any attribute here for the previous usage and only fix the missing usage in the pseudo probe decoding.
Differential Revision: https://reviews.llvm.org/D98226
For ThinLTO's prelink compilation, we need to put external inline candidates into an import list attached to function's entry count metadata. This enables ThinLink to treat such cross module callee as hot in summary index, and later helps postlink to import them for profile guided cross module inlining.
For AutoFDO, the import list is retrieved by traversing the nested inlinee functions. For CSSPGO, since profile is flatterned, a few things need to happen for it to work:
- When loading input profile in extended binary format, we need to load all child context profile whose parent is in current module, so context trie for current module includes potential cross module inlinee.
- In order to make the above happen, we need to know whether input profile is CSSPGO profile before start reading function profile, hence a flag for profile summary section is added.
- When searching for cross module inline candidate, we need to walk through the context trie instead of nested inlinee profile (callsite sample of AutoFDO profile).
- Now that we have more accurate counts with CSSPGO, we swtiched to use entry count instead of total count to decided if an external callee is potentially beneficial to inline. This make it consistent with how we determine whether call tagert is potential inline candidate.
Differential Revision: https://reviews.llvm.org/D98590
Previously we errored out when disassembling illegal instructions and there would be no profile generated. In fact illegal instructions are not uncommon and we'd better skip them and print "unknown" instead of erroring out. This matches the behavior of llvm-objdump (see disassembleObject in llvm-objdump.cpp).
Reviewed By: wlei, wenlei
Differential Revision: https://reviews.llvm.org/D97776
GCC warning:
```
[3397/3703] Building CXX object tools/llvm-profgen/CMakeFiles/llvm-profgen.dir/llvm-profgen.cpp.o
In file included from /llvm-project/llvm/include/llvm/ADT/STLExtras.h:19,
from /llvm-project/llvm/include/llvm/ADT/StringRef.h:12,
from /llvm-project/llvm/include/llvm/ADT/Twine.h:13,
from /llvm-project/llvm/tools/llvm-profgen/ErrorHandling.h:12,
from /llvm-project/llvm/tools/llvm-profgen/llvm-profgen.cpp:13:
/llvm-project/llvm/include/llvm/ADT/Optional.h: In instantiation of ‘void llvm::optional_detail::OptionalStorage<T, <anonymous> >::emplace(Args&& ...) [with Args = {const std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, llvm::sampleprof::LineLocation>}; T = const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>; bool <anonymous> = false]’:
/llvm-project/llvm/include/llvm/ADT/Optional.h:79:7: required from ‘constexpr llvm::optional_detail::OptionalStorage<T, <anonymous> >::OptionalStorage(llvm::optional_detail::OptionalStorage<T, <anonymous> >&&) [with T = const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>; bool <anonymous> = false]’
/llvm-project/llvm/include/llvm/ADT/Optional.h:253:13: required from here
/llvm-project/llvm/include/llvm/ADT/Optional.h:113:12: warning: cast from type ‘const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>*’ to type ‘void*’ casts away qualifiers [-Wcast-qual]
113 | ::new ((void *)std::addressof(value)) T(std::forward<Args>(args)...);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[3398/3703] Building CXX object tools/llvm-profgen/CMakeFiles/llvm-profgen.dir/PerfReader.cpp.o
In file included from /llvm-project/llvm/include/llvm/ADT/STLExtras.h:19,
from /llvm-project/llvm/include/llvm/ADT/StringRef.h:12,
from /llvm-project/llvm/include/llvm/ADT/Twine.h:13,
from /llvm-project/llvm/tools/llvm-profgen/ErrorHandling.h:12,
from /llvm-project/llvm/tools/llvm-profgen/PerfReader.h:11,
from /llvm-project/llvm/tools/llvm-profgen/PerfReader.cpp:8:
/llvm-project/llvm/include/llvm/ADT/Optional.h: In instantiation of ‘void llvm::optional_detail::OptionalStorage<T, <anonymous> >::emplace(Args&& ...) [with Args = {const std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, llvm::sampleprof::LineLocation>}; T = const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>; bool <anonymous> = false]’:
/llvm-project/llvm/include/llvm/ADT/Optional.h:79:7: required from ‘constexpr llvm::optional_detail::OptionalStorage<T, <anonymous> >::OptionalStorage(llvm::optional_detail::OptionalStorage<T, <anonymous> >&&) [with T = const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>; bool <anonymous> = false]’
/llvm-project/llvm/include/llvm/ADT/Optional.h:253:13: required from here
/llvm-project/llvm/include/llvm/ADT/Optional.h:113:12: warning: cast from type ‘const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>*’ to type ‘void*’ casts away qualifiers [-Wcast-qual]
113 | ::new ((void *)std::addressof(value)) T(std::forward<Args>(args)...);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[3399/3703] Building CXX object tools/llvm-profgen/CMakeFiles/llvm-profgen.dir/ProfiledBinary.cpp.o
In file included from /llvm-project/llvm/include/llvm/ADT/STLExtras.h:19,
from /llvm-project/llvm/include/llvm/ADT/ArrayRef.h:15,
from /llvm-project/llvm/include/llvm/ADT/DenseMapInfo.h:18,
from /llvm-project/llvm/include/llvm/ADT/DenseMap.h:16,
from /llvm-project/llvm/include/llvm/ADT/DenseSet.h:16,
from /llvm-project/llvm/include/llvm/ProfileData/SampleProf.h:17,
from /llvm-project/llvm/tools/llvm-profgen/CallContext.h:12,
from /llvm-project/llvm/tools/llvm-profgen/ProfiledBinary.h:12,
from /llvm-project/llvm/tools/llvm-profgen/ProfiledBinary.cpp:9:
/llvm-project/llvm/include/llvm/ADT/Optional.h: In instantiation of ‘void llvm::optional_detail::OptionalStorage<T, <anonymous> >::emplace(Args&& ...) [with Args = {const std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, llvm::sampleprof::LineLocation>}; T = const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>; bool <anonymous> = false]’:
/llvm-project/llvm/include/llvm/ADT/Optional.h:79:7: required from ‘constexpr llvm::optional_detail::OptionalStorage<T, <anonymous> >::OptionalStorage(llvm::optional_detail::OptionalStorage<T, <anonymous> >&&) [with T = const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>; bool <anonymous> = false]’
/llvm-project/llvm/include/llvm/ADT/Optional.h:253:13: required from here
/llvm-project/llvm/include/llvm/ADT/Optional.h:113:12: warning: cast from type ‘const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>*’ to type ‘void*’ casts away qualifiers [-Wcast-qual]
113 | ::new ((void *)std::addressof(value)) T(std::forward<Args>(args)...);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[3404/3703] Building CXX object tools/llvm-profgen/CMakeFiles/llvm-profgen.dir/ProfileGenerator.cpp.o
In file included from /llvm-project/llvm/include/llvm/ADT/STLExtras.h:19,
from /llvm-project/llvm/include/llvm/ADT/StringRef.h:12,
from /llvm-project/llvm/include/llvm/ADT/Twine.h:13,
from /llvm-project/llvm/tools/llvm-profgen/ErrorHandling.h:12,
from /llvm-project/llvm/tools/llvm-profgen/ProfileGenerator.h:11,
from /llvm-project/llvm/tools/llvm-profgen/ProfileGenerator.cpp:9:
/llvm-project/llvm/include/llvm/ADT/Optional.h: In instantiation of ‘void llvm::optional_detail::OptionalStorage<T, <anonymous> >::emplace(Args&& ...) [with Args = {const std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, llvm::sampleprof::LineLocation>}; T = const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>; bool <anonymous> = false]’:
/llvm-project/llvm/include/llvm/ADT/Optional.h:79:7: required from ‘constexpr llvm::optional_detail::OptionalStorage<T, <anonymous> >::OptionalStorage(llvm::optional_detail::OptionalStorage<T, <anonymous> >&&) [with T = const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>; bool <anonymous> = false]’
/llvm-project/llvm/include/llvm/ADT/Optional.h:253:13: required from here
/llvm-project/llvm/include/llvm/ADT/Optional.h:113:12: warning: cast from type ‘const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>*’ to type ‘void*’ casts away qualifiers [-Wcast-qual]
113 | ::new ((void *)std::addressof(value)) T(std::forward<Args>(args)...);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
It appears some instructions doesn't have the debug location info and the symbolizer will return an empty call stack for them which will cause some crash later in profile unwinding. Actually we do not record the sample info for them, so this change just filter out those instruction.
As those instruction would appears at the begin and end of the instruction list, without them we need to add the boundary check for IP `advance` and `backward`.
Also for pseudo probe based profile, we actually don't need the symbolized location info, so here just change to use an empty stack for it. This could save half of the binary loading time.
Differential Revision: https://reviews.llvm.org/D96434
This include some changes related with PerfReader's the input check and command line change:
1) It appears there might be thousands of leading MMAP-Event line in the perfscript for large workload. For this case, the 4k threshold is not eligible to determine it's a hybrid sample. This change renovated the `isHybridPerfScript` by going through the script without threshold limitation checking whether there is a non-empty call stack immediately followed by a LBR sample. It will stop once it find a valid one.
2) Added several input validations for the command line switches in PerfReader.
3) Changed the command line `show-disassembly` to `show-disassembly-only`, it will print to stdout and exit early which leave an empty output profile.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D96387
To align with https://reviews.llvm.org/D95547, we need to add brackets for context id before initializing the `SampleContext`.
Also added test cases for extended binary format from llvm-profgen side.
Differential Revision: https://reviews.llvm.org/D95929
when we skip the call stack starting with an external address, we should also skip the bottom LBR entry, otherwise it will cause a truncated context issue.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D95480
This change allows merging and trimming cold context profile in llvm-profgen to solve profile size bloat problem. Currently when the profile's total sample is below threshold(supported by a switch), it will be considered cold and merged into a base context-less profile, which will at least keep the profile quality as good as the baseline(non-cs).
For example, two input profiles:
[main @ foo @ bar]:60
[main @ bar]:50
Under threshold = 100, the two profiles will be merge into one with the base context, get result:
[bar]:110
Added two switches:
`--csprof-cold-thres=<value>`: Specified the total samples threshold for a context profile to be considered cold, with 100 being the default. Any cold context profiles will be merged into context-less base profile by default.
`--csprof-keep-cold`: Force profile generation to keep cold context profiles instead of dropping them. By default, any cold context will not be written to output profile.
Results:
Though not yet evaluating it with the latest CSSPGO, our internal branch shows neutral on performance but significantly reduce the profile size. Detailed evaluation on llvm-profgen with CSSPGO will come later.
Differential Revision: https://reviews.llvm.org/D94111
For CS profile generation, the process of call stack unwinding is time-consuming since for each LBR entry we need linear time to generate the context( hash, compression, string concatenation). This change speeds up this by grouping all the call frame within one LBR sample into a trie and aggregating the result(sample counter) on it, deferring the context compression and string generation to the end of unwinding.
Specifically, it uses `StackLeaf` as the top frame on the stack and manipulates(pop or push a trie node) it dynamically during virtual unwinding so that the raw sample can just be recoded on the leaf node, the path(root to leaf) will represent its calling context. In the end, it traverses the trie and generates the context on the fly.
Results:
Our internal branch shows about 5X speed-up on some large workloads in SPEC06 benchmark.
Differential Revision: https://reviews.llvm.org/D94110
This change compresses the context string by removing cycles due to recursive function for CS profile generation. Removing recursion cycles is a way to normalize the calling context which will be better for the sample aggregation and also make the context promoting deterministic.
Specifically for implementation, we recognize adjacent repeated frames as cycles and deduplicated them through multiple round of iteration.
For example:
Considering a input context string stack:
[“a”, “a”, “b”, “c”, “a”, “b”, “c”, “b”, “c”, “d”]
For first iteration,, it removed all adjacent repeated frames of size 1:
[“a”, “b”, “c”, “a”, “b”, “c”, “b”, “c”, “d”]
For second iteration, it removed all adjacent repeated frames of size 2:
[“a”, “b”, “c”, “a”, “b”, “c”, “d”]
So in the end, we get compressed output:
[“a”, “b”, “c”, “d”]
Compression will be called in two place: one for sample's context key right after unwinding, one is for the eventual context string id in the ProfileGenerator.
Added a switch `compress-recursion` to control the size of duplicated frames, default -1 means no size limit.
Added unit tests and regression test for this.
Differential Revision: https://reviews.llvm.org/D93556
For CS profile generation, the process of call stack unwinding is time-consuming since for each LBR entry we need linear time to generate the context( hash, compression, string concatenation). This change speeds up this by grouping all the call frame within one LBR sample into a trie and aggregating the result(sample counter) on it, deferring the context compression and string generation to the end of unwinding.
Specifically, it uses `StackLeaf` as the top frame on the stack and manipulates(pop or push a trie node) it dynamically during virtual unwinding so that the raw sample can just be recoded on the leaf node, the path(root to leaf) will represent its calling context. In the end, it traverses the trie and generates the context on the fly.
Results:
Our internal branch shows about 5X speed-up on some large workloads in SPEC06 benchmark.
Differential Revision: https://reviews.llvm.org/D94110
This change compresses the context string by removing cycles due to recursive function for CS profile generation. Removing recursion cycles is a way to normalize the calling context which will be better for the sample aggregation and also make the context promoting deterministic.
Specifically for implementation, we recognize adjacent repeated frames as cycles and deduplicated them through multiple round of iteration.
For example:
Considering a input context string stack:
[“a”, “a”, “b”, “c”, “a”, “b”, “c”, “b”, “c”, “d”]
For first iteration,, it removed all adjacent repeated frames of size 1:
[“a”, “b”, “c”, “a”, “b”, “c”, “b”, “c”, “d”]
For second iteration, it removed all adjacent repeated frames of size 2:
[“a”, “b”, “c”, “a”, “b”, “c”, “d”]
So in the end, we get compressed output:
[“a”, “b”, “c”, “d”]
Compression will be called in two place: one for sample's context key right after unwinding, one is for the eventual context string id in the ProfileGenerator.
Added a switch `compress-recursion` to control the size of duplicated frames, default -1 means no size limit.
Added unit tests and regression test for this.
Differential Revision: https://reviews.llvm.org/D93556
This change implements profile generation infra for pseudo probe in llvm-profgen. During virtual unwinding, the raw profile is extracted into range counter and branch counter and aggregated to sample counter map indexed by the call stack context. This change introduces the last step and produces the eventual profile. Specifically, the body of function sample is recorded by going through each probe among the range and callsite target sample is recorded by extracting the callsite probe from branch's source.
Please refer https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s and https://reviews.llvm.org/D89707 for more context about CSSPGO and llvm-profgen.
**Implementation**
- Extended `PseudoProbeProfileGenerator` for pseudo probe based profile generation.
- `populateBodySamplesWithProbes` reading range counter is responsible for recording function body samples and inferring caller's body samples.
- `populateBoundarySamplesWithProbes` reading branch counter is responsible for recording call site target samples.
- Each sample is recorded with its calling context(named `ContextId`). Remind that the probe based context key doesn't include the leaf frame probe info, so the `ContextId` string is created from two part: one from the probe stack strings' concatenation and other one from the leaf frame probe.
- Added regression test
Test Plan:
ninja & ninja check-llvm
Differential Revision: https://reviews.llvm.org/D92998
This change brings up support of context-sensitive profiles in the format of extended binary. Existing sample profile reader/writer/merger code is being tweaked to reflect the fact of bracketed input contexts, like (`[...]`). The paired brackets are also needed in extbinary profiles because we don't yet have an otherwise good way to tell calling contexts apart from regular function names since the context delimiter `@` can somehow serve as a part of the C++ mangled names.
Reviewed By: wmi, wenlei
Differential Revision: https://reviews.llvm.org/D95547
This change extends virtual unwinder to support pseudo probe in llvm-profgen. Please refer https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s and https://reviews.llvm.org/D89707 for more context about CSSPGO and llvm-profgen.
**Implementation**
- Added `ProbeBasedCtxKey` derived from `ContextKey` for sample counter aggregation. As we need string splitting to infer the profile for callee function, string based context introduces more string handling overhead, here we just use probe pointer based context.
- For linear unwinding, as inline context is encoded in each pseudo probe, we don't need to go through each instruction to extract range sharing same inliner. So just record the range for the context.
- For probe based context, we should ignore the top frame probe since it will be extracted from the address range. we defer the extraction in `ProfileGeneration`.
- Added `PseudoProbeProfileGenerator` for pseudo probe based profile generation.
- Some helper function to get pseduo probe info(call probe, inline context) from profiled binary.
- Added regression test for unwinder's output
The pseudo probe based profile generation will be in the upcoming patch.
Test Plan:
ninja & ninja check-llvm
Differential Revision: https://reviews.llvm.org/D92896
As we plan to support both CSSPGO and AutoFDO for llvm-profgen, we will have different kinds of perf sample and different kinds of sample counter(cs/non-cs, with/without pseudo probe) which both need to do aggregation in hash map. This change implements the hashable interface(`Hashable`) and the unified base class for them to have better extensibility and reusability.
Currently perf trace sample and sample counter with context implemented this `Hashable` and the class hierarchy is like:
```
| Hashable
| PerfSample
| HybridSample
| LBRSample
| ContextKey
| StringBasedCtxKey
| ProbeBasedCtxKey
| CallsiteBasedCtxKey
| ...
```
- Class specifying `Hashable` should implement `getHashCode` and `isEqual`. Here we make `getHashCode` a non-virtual function to avoid vtable overhead, so derived class should calculate and assign the base class's HashCode manually. This also provides the flexibility for calculating the hash code incrementally(like rolling hash) during frame stack unwinding
- `isEqual` is a virtual function, which will have perf overhead. In the future, if we redesign a better hash function, then we can just skip this or switch to non-virtual function.
- Added `PerfSample` and `ContextKey` as base class for perf sample and counter context key, leveraging llvm-style RTTI for this.
- Added `StringBasedCtxKey` class extending `ContextKey` to use string as context id.
- Refactor `AggregationCounter` to take all kinds of `PerfSample` as key
- Refactor `ContextSampleCounter` to take all kinds of `ContextKey` as key
- Other refactoring work:
- Create a wrapper class `SampleCounter` to wrap `RangeCounter` and `BranchCounter`
- Hoist `ContextId` and `FunctionProfile` out of `populateFunctionBodySamples` and `populateFunctionBoundarySamples` to reuse them in ProfileGenerator
Differential Revision: https://reviews.llvm.org/D92584
This change implements pseudo probe decoding and disassembling for llvm-profgen/CSSPGO. Please see https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s and https://reviews.llvm.org/D89707 for more context about CSSPGO and llvm-profgen.
**ELF section format**
Please see the encoding patch(https://reviews.llvm.org/D91878) for more details of the format, just copy the example here:
Two section(`.pseudo_probe_desc` and `.pseudoprobe` ) is emitted in ELF to support pseudo probe.
The format of `.pseudo_probe_desc` section looks like:
```
.section .pseudo_probe_desc,"",@progbits
.quad 6309742469962978389 // Func GUID
.quad 4294967295 // Func Hash
.byte 9 // Length of func name
.ascii "_Z5funcAi" // Func name
.quad 7102633082150537521
.quad 138828622701
.byte 12
.ascii "_Z8funcLeafi"
.quad 446061515086924981
.quad 4294967295
.byte 9
.ascii "_Z5funcBi"
.quad -2016976694713209516
.quad 72617220756
.byte 7
.ascii "_Z3fibi"
```
For each `.pseudoprobe` section, the encoded binary data consists of a single function record corresponding to an outlined function (i.e, a function with a code entry in the `.text` section). A function record has the following format :
```
FUNCTION BODY (one for each outlined function present in the text section)
GUID (uint64)
GUID of the function
NPROBES (ULEB128)
Number of probes originating from this function.
NUM_INLINED_FUNCTIONS (ULEB128)
Number of callees inlined into this function, aka number of
first-level inlinees
PROBE RECORDS
A list of NPROBES entries. Each entry contains:
INDEX (ULEB128)
TYPE (uint4)
0 - block probe, 1 - indirect call, 2 - direct call
ATTRIBUTE (uint3)
reserved
ADDRESS_TYPE (uint1)
0 - code address, 1 - address delta
CODE_ADDRESS (uint64 or ULEB128)
code address or address delta, depending on ADDRESS_TYPE
INLINED FUNCTION RECORDS
A list of NUM_INLINED_FUNCTIONS entries describing each of the inlined
callees. Each record contains:
INLINE SITE
GUID of the inlinee (uint64)
ID of the callsite probe (ULEB128)
FUNCTION BODY
A FUNCTION BODY entry describing the inlined function.
```
**Disassembling**
A switch `--show-pseudo-probe` is added to use along with `--show-disassembly` to print disassembly code with pseudo probe directives.
For example:
```
00000000002011a0 <foo2>:
2011a0: 50 push rax
2011a1: 85 ff test edi,edi
[Probe]: FUNC: foo2 Index: 1 Type: Block
2011a3: 74 02 je 2011a7 <foo2+0x7>
[Probe]: FUNC: foo2 Index: 3 Type: Block
[Probe]: FUNC: foo2 Index: 4 Type: Block
[Probe]: FUNC: foo Index: 1 Type: Block Inlined: @ foo2:6
2011a5: 58 pop rax
2011a6: c3 ret
[Probe]: FUNC: foo2 Index: 2 Type: Block
2011a7: bf 01 00 00 00 mov edi,0x1
[Probe]: FUNC: foo2 Index: 5 Type: IndirectCall
2011ac: ff d6 call rsi
[Probe]: FUNC: foo2 Index: 4 Type: Block
2011ae: 58 pop rax
2011af: c3 ret
```
**Implementation**
- `PseudoProbeDecoder` is added in ProfiledBinary as an infra for the decoding. It decoded the two section and generate two map: `GUIDProbeFunctionMap` stores all the `PseudoProbeFunction` which is the abstraction of a general function. `AddressProbesMap` stores all the pseudo probe info indexed by its address.
- All the inline info is encoded into binary as a trie(`PseudoProbeInlineTree`) and will be constructed from the decoding. Each pseudo probe can get its inline context(`getInlineContext`) by traversing its inline tree node backwards.
Test Plan:
ninja & ninja check-llvm
Differential Revision: https://reviews.llvm.org/D92334
I am experimenting with turning backends into loadable modules and in
that scenario, target specific command line arguments won't be available
until after the targets are initialized.
Also, most other tools initialize targets before parsing arguments.
Reviewed By: wlei
Differential Revision: https://reviews.llvm.org/D93348
Don't know why under Sanitizer build(asan/msan/ubsan), the `std::unordered_map<string, ...>`'s output order is reversed, make the regression test failed.
This change creates a workaround by using sorted container to make the output deterministic.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D92816
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context. Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
LLVMBuild has been removed from the build system. However, three LLVMBuild.txt
files remain in the tree. This patch simply removes them.
llvm/lib/ExecutionEngine/Orc/TargetProcess/LLVMBuild.txt
llvm/tools/llvm-jitlink/llvm-jitlink-executor/LLVMBuild.txt
llvm/tools/llvm-profgen/LLVMBuild.txt
Differential Revision: https://reviews.llvm.org/D92693
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change adds the support of instruction symbolization. Given the RVA on an instruction pointer, a full calling context can be printed side-by-side with the disassembly code.
E.g.
```
Disassembly of section .text [0x0, 0x4a]:
<funcA>:
0: mov eax, edi funcA:0
2: mov ecx, dword ptr [rip] funcLeaf:2 @ funcA:1
8: lea edx, [rcx + 3] fib:2 @ funcLeaf:2 @ funcA:1
b: cmp ecx, 3 fib:2 @ funcLeaf:2 @ funcA:1
e: cmovl edx, ecx fib:2 @ funcLeaf:2 @ funcA:1
11: sub eax, edx funcLeaf:2 @ funcA:1
13: ret funcA:2
14: nop word ptr cs:[rax + rax]
1e: nop
<funcLeaf>:
20: mov eax, edi funcLeaf:1
22: mov ecx, dword ptr [rip] funcLeaf:2
28: lea edx, [rcx + 3] fib:2 @ funcLeaf:2
2b: cmp ecx, 3 fib:2 @ funcLeaf:2
2e: cmovl edx, ecx fib:2 @ funcLeaf:2
31: sub eax, edx funcLeaf:2
33: ret funcLeaf:3
34: nop word ptr cs:[rax + rax]
3e: nop
<fib>:
40: lea eax, [rdi + 3] fib:2
43: cmp edi, 3 fib:2
46: cmovl eax, edi fib:2
49: ret fib:8
```
Test Plan:
ninja check-llvm
Reviewed By: wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89715
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change enables disassembling the text sections to build various address maps that are potentially used by the virtual unwinder. A switch `--show-disassembly` is being added to print the disassembly code.
Like the llvm-objdump tool, this change leverages existing LLVM components to parse and disassemble ELF binary files. So far X86 is supported.
Test Plan:
ninja check-llvm
Reviewed By: wmi, wenlei
Differential Revision: https://reviews.llvm.org/D89712
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
As a starter, this change sets up an entry point by introducing PerfReader to load profiled binaries and perf traces(including perf events and perf samples). For the event, here it parses the mmap2 events from perf script to build the loader snaps, which is used to retrieve the image load address in the subsequent perf tracing parsing.
As described in llvm-profgen.rst, the tool being built aims to support multiple input perf data (preprocessed by perf script) as well as multiple input binary images. It should also support dynamic reload/unload shared objects by leveraging the loader snaps being built by this change
Reviewed By: wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89707