llvm-project

Commit Graph

Author	SHA1	Message	Date
Hongtao Yu	d5a963ab8b	[PseudoProbe] Replace relocation with offset for entry probe. Currently pseudo probe encoding for a function is like: - For the first probe, a relocation from it to its physical position in the code body - For subsequent probes, an incremental offset from the current probe to the previous probe The relocation could potentially cause relocation overflow during link time. I'm now replacing it with an offset from the first probe to the function start address. A source function could be lowered into multiple binary functions due to outlining (e.g, coro-split). Since those binary function have independent link-time layout, to really avoid relocations from .pseudo_probe sections to .text sections, the offset to replace with should really be the offset from the probe's enclosing binary function, rather than from the entry of the source function. This requires some changes to previous section-based emission scheme which now switches to be function-based. The assembly form of pseudo probe directive is also changed correspondingly, i.e, reflecting the binary function name. Most of the source functions end up with only one binary function. For those don't, a sentinel probe is emitted for each of the binary functions with a different name from the source. The sentinel probe indicates the binary function name to differentiate subsequent probes from the ones from a different binary function. For examples, given source function ``` Foo() { … Probe 1 … Probe 2 } ``` If it is transformed into two binary functions: ``` Foo: … Foo.outlined: … ``` The encoding for the two binary functions will be separate: ``` GUID of Foo Probe 1 GUID of Foo Sentinel probe of Foo.outlined Probe 2 ``` Then probe1 will be decoded against binary `Foo`'s address, and Probe 2 will be decoded against `Foo.outlined`. The sentinel probe of `Foo.outlined` makes sure there's not accidental relocation from `Foo.outlined`'s probes to `Foo`'s entry address. On the BOLT side, to be minimal intrusive, the pseudo probe re-encoding sticks with the old encoding format. This is fine since unlike linker, Bolt processes the pseudo probe section as a whole and it is free from relocation overflow issues. The change is downwards compatible as long as there's no mixed use of the old encoding and the new encoding. Reviewed By: wenlei, maksfb Differential Revision: https://reviews.llvm.org/D135912 Differential Revision: https://reviews.llvm.org/D135914 Differential Revision: https://reviews.llvm.org/D136394	2022-10-27 13:28:22 -07:00
wlei	bf97c1e066	[NFC] fix a wrong name change during rebase	2022-10-25 21:16:29 -07:00
wlei	91cc53d5a4	[llvm-profgen] Do not cache the frame location stack during computing inlined context size In `computeInlinedContextSizeForRange`, the offset of range is only used one time, there is no need to cache the frame location stack. Measured on one internal service binary, this can save 2GB memory usage and reduce a small run time (avoid one hash search). Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D128859	2022-10-25 21:08:36 -07:00
wlei	467652486f	[llvm-profgen] Fix inconsistent loading address issues This is to fix two issues related with loading address: 1) When multiple MMAPs occur and their loading address are different, before it only used the first MMap as base address, all perf address after it used the wrong base address. 2) For pseudo probe profile, the address is always based on preferred loading address. If the base address is not equal to the preferred loading address, the pseudo probe address query will be wrong. Solution: Instead of converting the address to offset lazily, right now all the address after parsing are converted on the fly based on preferred loading address in the parsing time. There is no "offset" used in profile generator any more. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D126827	2022-10-13 23:19:30 -07:00
wlei	7e86b13c63	[CSSPGO][llvm-profgen] Reimplement SampleContextTracker using context trie This is the followup patch to https://reviews.llvm.org/D125246 for the `SampleContextTracker` part. Before the promotion and merging of the context is based on the SampleContext(the array of frame), this causes a lot of cost to the memory. This patch detaches the tracker from using the array ref instead to use the context trie itself. This can save a lot of memory usage and benefit both the compiler's CS inliner and llvm-profgen's pre-inliner. One structure needs to be specially treated is the `FuncToCtxtProfiles`, this is used to get all the functionSamples for one function to do the merging and promoting. Before it search each functions' context and traverse the trie to get the node of the context. Now we don't have the context inside the profile, instead we directly use an auxiliary map `ProfileToNodeMap` for profile , it initialize to create the FunctionSamples to TrieNode relations and keep updating it during promoting and merging the node. Moreover, I was expecting the results before and after remain the same, but I found that the order of FuncToCtxtProfiles matter and affect the results. This can happen on recursive context case, but the difference should be small. Now we don't have the context, so I just used a vector for the order, the result is still deterministic. Measured on one huge size(12GB) profile from one of our internal service. The profile similarity difference is 99.999%, and the running time is improved by 3X(debug mode) and the memory is reduced from 170GB to 90GB. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D127031	2022-06-27 23:22:21 -07:00
Hongtao Yu	9f732af583	[llvm-profgen] Filter out oversized LBR ranges. As a follow up to {D123271}, LBR ranges that are too big should also be considered as invalid. For example, the last two pairs in the following trace form a range [0x0d7b02b0, 0x368ba706] that covers a ton of functions in the binary. Such oversized range should also be ignored. 0x0c74505f/0x368b99a0 0x368ba706/0x0c745040 0x0d7b1c3f/0x0d7b02b0 Add a defensive check to filter out those ranges based that the valid range should not cross the unconditional branch(Call, return, unconditional jmp). Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D125448	2022-05-12 10:58:50 -07:00
Hongtao Yu	3f97016857	[llvm-profgen] Decoding pseudo probe for profiled function only. Complete pseudo probes decoding can result in large memory usage. In practice only a small porting of the decoded probes are used in profile generation. I'm changing the full decoding mode to be decoding for profiled functions only, though we still do a full scan of the .pseudoprobe section due to a missing table-of-content but we don't have to build the in-memory data structure for functions not sampled. To build the in-memory data structure for profiled functions only, I'm rewriting the previous non-recursive probe decoding logic to be recursive. This is easy to read and maintain. I also have to change the previous representation of unsymbolized context from probe-based stack to address-based stack since the profiled functions are unknown yet by the time of virtual unwinding. The address-based stack will be converted to probe-based stack after virtual unwinding and on-demand probe decoding. I'm seeing 20GB memory is saved for one of our internal large service. Reviewed By: wenlei Differential Revision: https://reviews.llvm.org/D121643	2022-03-23 14:15:11 -07:00
serge-sans-paille	fc97efa409	Cleanup includes: ProfileData Estimation of the impact on preprocessor output: before: 1067349756 after: 1065940348 Discourse thread: https://discourse.llvm.org/t/include-what-you-use-include-cleanup Differential Revision: https://reviews.llvm.org/D120434	2022-02-24 13:25:11 +01:00
wlei	b3a778fb5e	[llvm-profgen] Support symbol loading for debug fission Support to load debug info from dwarf split file, like .dwo, .dwp files. Leverage the `getNonSkeletonUnitDIE(false)` API to achieve this. Add test cause to make sure all the ranges is well retrieved by the loader. Reviewed By: ayermolo, hoy, wenlei Differential Revision: https://reviews.llvm.org/D115973	2022-02-23 09:40:46 -08:00
Hongtao Yu	34e131b0f2	[llvm-profgen] On-demand track optimized-away inlinees for preinliner. Tracking optimized-away inlinees based on all probes in a binary is expansive in terms of memory usage I'm making the tracking on-demand based on profiled functions only. This saves about 10% memory overall for a medium-sized benchmark. Before: note: After parsePerfTraces note: Thu Jan 27 18:42:09 2022 note: VM: 8.68 GB RSS: 8.39 GB note: After computeSizeForProfiledFunctions note: Thu Jan 27 18:42:41 2022 note: VM: 10.63 GB RSS: 10.20 GB note: After generateProbeBasedProfile note: Thu Jan 27 18:45:49 2022 note: VM: 25.00 GB RSS: 24.95 GB note: After postProcessProfiles note: Thu Jan 27 18:49:29 2022 note: VM: 26.34 GB RSS: 26.27 GB After: note: After parsePerfTraces note: Fri Jan 28 12:04:49 2022 note: VM: 8.68 GB RSS: 7.65 GB note: After computeSizeForProfiledFunctions note: Fri Jan 28 12:05:26 2022 note: VM: 8.68 GB RSS: 8.42 GB note: After generateProbeBasedProfile note: Fri Jan 28 12:08:03 2022 note: VM: 22.93 GB RSS: 22.89 GB note: After postProcessProfiles note: Fri Jan 28 12:11:30 2022 note: VM: 24.27 GB RSS: 24.22 GB This should be a no-diff change in terms of profile quality. Reviewed By: wenlei Differential Revision: https://reviews.llvm.org/D118515	2022-02-08 08:33:23 -08:00
wlei	6693c562f9	[llvm-profgen] Support to load debug info from a second binary For reducing binary size purpose, the binary's debug info and executable segment can be separated(like using objcopy --only-keep-debug). Here add support in llvm-profgen to use two binaries as input. The original one is executable binary and added for debug info only binary. Adding a flag `--debug-binary=file-path`, with this, the binary will load debug info from debug binary. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D115948	2022-01-24 17:14:05 -08:00
wlei	0f53df864e	[CSSPGO][llvm-profgen] Fix external address issues of perf reader (return to external addr part) Before we have an issue with artificial LBR whose source is a return, recalling that "an internal code(A) can return to external address, then from the external address call a new internal code(B), making an artificial branch that looks like a return from A to B can confuse the unwinder". We just ignore the LBRs after this artificial LBR which can miss some samples. This change aims at fixing this by correctly unwinding them instead of ignoring them. List some typical scenarios covered by this change. 1) multiple sequential call back happen in external address, e.g. ``` [ext, call, foo] [foo, return, ext] [ext, call, bar] ``` Unwinder should avoid having foo return from bar. Wrong call stack is like [foo, bar] 2) the call stack before and after external call should be correctly unwinded. ``` {call stack1} {call stack2} [foo, call, ext] [ext, call, bar] [bar, return, ext] [ext, return, foo ] ``` call stack 1 should be the same to call stack2. Both shouldn't be truncated 3) call stack should be truncated after call into external code since we can't do inlining with external code. ``` [foo, call, ext] [ext, call, bar] [bar, call, baz] [baz, return, bar ] [bar, return, ext] ``` the call stack of code in baz should not include foo. ### Implementation: We leverage artificial frame to fix #2 and #3: when we got a return artificial LBR, push an extra artificial frame to the stack. when we pop frame, check if the parent is an artificial frame to pop(fix #2). Therefore, call/ return artificial LBR is just the same as regular LBR which can keep the call stack. While recording context on the trie, artificial frame is used as a tag indicating that we should truncate the call stack(fix #3). To differentiate #1 and #2, we leverage `getCallAddrFromFrameAddr`. Normally the target of the return should be the next inst of a call inst and `getCallAddrFromFrameAddr` will return the address of call inst. Otherwise, getCallAddrFromFrameAddr will return to 0 which is the case of #1. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D115550	2021-12-14 16:40:54 -08:00
wlei	484a569eea	[llvm-profgen] Fix total samples related issues Since total sample and body sample are used to compute hotness threshold in compiler, we found in some services changing the total samples computation will cause noticeable regression. Hence, here we will revert the changes and just keep all total samples number identical to the old tool. Three changes in this diff: 1. Revert previous diff(https://reviews.llvm.org/D112672: [llvm-profgen] Update total samples by accumulating all its body samples) and put it under a switch. 2. Keep the negative line number. Although compiler doesn't consume the count but it will be used to compute hot threshold. 3. Change to accumulate total samples per byte instead of per instruction. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D115013	2021-12-08 12:33:41 -08:00
wlei	f15a854567	[llvm-profgen] Truncate the context with zero probe ID Due to the debug info merging, there may have some contexts with zero probe id, we should truncate the context to avoid misleading pre-inliner. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D114284	2021-11-30 16:21:25 -08:00
wlei	41a681ce09	[FS-AFDO][llvm-profgen] Generate profile with FS-AFDO discriminator In order to support generating profile with FS discriminator, three kind of changes are done in llvm-profgen: 1) Dissassemble .rodata section to check if FS discriminator var ('"__llvm_fs_discriminator__"') exists and set the corresponding flag in the binary. 2) Change the discriminator decoding in `getBaseDiscriminator` and `getDuplicationFactor`. 3) set true for `FunctionSamples::ProfileIsFS` to enable FS functionality in ProfileData. Reviewed By: xur, hoy, wenlei Differential Revision: https://reviews.llvm.org/D113296	2021-11-30 15:57:59 -08:00
wlei	c2e08aba1a	[llvm-profgen] Compute and show profile density AutoFDO performance is sensitive to profile density, i.e., the amount of samples in the profile relative to the program size, because profiles with insufficient samples could be inaccurate due to statistical noise and thus hurt AutoFDO performance. A previous investigation showed that AutoFDO performed better on MySQL with increased amount of samples. Therefore, we implement a profile-density computation feature to give hints about profile density to users and the compiler. We define the density of a profile Prof as follows: - For each function A in the profile, density(A) = total_samples(A) / sizeof(A). - density(Prof) = min(density(A)) for all functions A that are warm (defined below). A function is considered warm if its total-samples is within top N percent of the profile. For implementation, we reuse the `ProfileSummaryBuilder::getHotCountThreshold(..)` as threshold which can be set by percent(`--profile-summary-cutoff-hot`) or by value(`--profile-summary-hot-count`). We also introduce `--hot-function-density-threshold` to set hot function density threshold and will give suggestion if profile density is below it which implies we should increase samples. This also applies for CS profile with all profiles merged into base. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D113781	2021-11-29 23:54:31 -08:00
Wenlei He	f7976edc1e	[llvm-profgen] Add switch to allow use of first loadable segment for calculating offset Adding `-use-loadable-segment-as-base` to allow use of first loadable segment for calculating offset. By default first executable segment is used for calculating offset. The switch helps compatibility with unsymbolized profile generated from older tools. Differential Revision: https://reviews.llvm.org/D113727	2021-11-15 19:00:27 -08:00
wlei	aab1810006	[llvm-profgen] Fix bug of setting function entry Previously we set `isFuncEntry` flag to true when the funcName from DWARF is equal to the name in symbol table and we use this flag to ignore reporting callsite sample that's from an intra func branch. However, in HHVM, it appears that the symbol table name is inconsistent with the dwarf info func name, it's likely due to `OptimizeGlobalAliases`. This change is a workaround in llvm-profgen side to mark the only one range as the function entry and add warnings for the remaining inconsistence. This also fixed a missing `getCanonicalFnName` for symbol name which caused the mismatching as well. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D113492	2021-11-12 12:18:43 -08:00
wlei	5bf191a381	[llvm-profgen] Fix index out of bounds error while using ip.advance Previously we assume there're some non-executing sections at the bottom of the text section so that we won't hit the array's bound. But on BOLTed binary, it turned out .bolt section is at the bottom of text section which can be profiled, then it crash llvm-profgen. This change try to fix it. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D113238	2021-11-05 18:38:40 -07:00
wlei	138202a8c3	[llvm-profgen] Warn on invalid range and show warning summary Two things in this diff: 1) Warn on the invalid range, currently three types of checking, see the detailed message in the code. 2) In some situation, llvm-profgen gives lots of warnings on the truncated stacks which is noisy. This change provides a switch to `--show-detailed-warning` to skip the warnings. Alternatively, we use a summary for those warning and show the percentage of cases with those issues. Example of warning summary. ``` warning: 0.05%(1120/2428958) cases with issue: Profile context truncated due to missing probe for call instruction. warning: 0.00%(2/178637) cases with issue: Range does not belong to any functions, likely from external function. ``` Reviewed By: hoy Differential Revision: https://reviews.llvm.org/D111902	2021-11-02 19:55:55 -07:00
wlei	3f3103c6a9	[llvm-profgen] Fill zero count for all function ranges Allow filling zero count for all the function ranges even there is no samples hitting that function. Add a switch for this. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D112858	2021-11-01 09:57:05 -07:00
wlei	2f8196db92	[llvm-profgen] Fix bug of populating profile symbol list Previous implementation of populating profile symbol list is wrong, it only included the profiled symbols. Actually it should use all symbols, here this switches to use the symbols from debug info. Also turned the flag off by default. Reviewed By: wenlei, hoy Differential Revision: https://reviews.llvm.org/D111824	2021-10-29 09:59:12 -07:00
wlei	40ca411251	[llvm-profgen] Switch to DWARF-based symbol and ranges It happened a bug that some callsite name in the profile is not a real function, it turned out that there're some non-function symbol from the ELF text section, e.g. the global accessible branch label and also recalled that we can have one function being split into multiple ranges. We shouldn't count samples for those are not the entry of the real function. So this change tried to fix this issue by switching to use the name or ranges from DWARF-based debug info, the range of which assure it's the real function start. For the split functions, we assume that the real entry function's DWARF name should always match the symbol table name. The switching is also consistent with the body samples' symbol which is from DWARF. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D112282	2021-10-29 09:59:12 -07:00
wlei	ce40843a3f	[llvm-profgen][CSSPGO] On-demand function size computation for preinliner Similar to https://reviews.llvm.org/D110465, we can compute function size on-demand for the functions that's hit by samples. Here we leverage the raw range samples' address to compute a set of sample hit function. Then `BinarySizeContextTracker` just works on those function range for the size. Reviewed By: hoy Differential Revision: https://reviews.llvm.org/D110466	2021-09-28 09:09:38 -07:00
wlei	091c16f76b	[llvm-profgen] On-demand symbolization Previously we do symbolization for all the functions and actually we only need the symbols that's hit by the samples. This can significantly speed up the time for large size binary. Optimization for per-inliner will come along with next patch. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D110465	2021-09-28 09:09:25 -07:00
wlei	28277e9b48	[AutoFDO][llvm-profgen] Report zero count for unexecuted part of function code In order to be consistent with compiler that interprets zero count as unexecuted(cold), this change reports zero-value count for unexecuted part of function code. For the implementation, it leverages the range counter, initializes all the executed function range with the zero-value. After all ranges are merged and converted into disjoint ranges, the remaining zero count will indicates the unexecuted(cold) part of the function. This change also extends the current `findDisjointRanges` method which now can support adding zero-value range. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D109713	2021-09-24 14:15:05 -07:00
wlei	d5f2013004	[AutoFDO][llvm-profgen] Profile generation for LBR(non-CS) sample This patch introduces non-CS AutoFDO profile generation into LLVM. The profile is supposed to be well consumed by compiler using `-fprofile-sample-use=[profile]`. After range and branch counters are extracted from the LBR sample, here we go through each addresses for symbolization, create FunctionSamples and populate its sub fields like TotalSamples, BodySamples and HeadSamples etc. For inlined code, as we need to map back to original code, so we always add body samples to the leaf frame's function sample. Reviewed By: wenlei, hoy Differential Revision: https://reviews.llvm.org/D109551	2021-09-24 13:55:34 -07:00
Hongtao Yu	734f4d832c	[llvm-profgen] An option to dump disasm of specified symbols For large app, dumping disasm of the whole program can be slow and result in gianant output. Adding a switch to dump specific symbols only. Reviewed By: wlei Differential Revision: https://reviews.llvm.org/D110079	2021-09-22 10:32:59 -07:00
Hongtao Yu	0057c7185d	[CSSPGO][llvm-profgen] Truncate stack samples with invalid return address. Invalid frame addresses exist in call stack samples due to bad unwinding. This could happen to frame-pointer-based unwinding and the callee functions that do not have the frame pointer chain set up. It isn't common when the program is built with the frame pointer omission disabled, but can still happen with third-party static libs built with frame pointer omitted. Reviewed By: wenlei Differential Revision: https://reviews.llvm.org/D109638	2021-09-14 21:56:22 -07:00
wlei	964053d56f	[llvm-profgen] Support LBR only perf script This change aims at supporting LBR only sample perf script which is used for regular(Non-CS) profile generation. A LBR perf script includes a batch of LBR sample which starts with a frame pointer and a group of 32 LBR entries is followed. The FROM/TO LBR pair and the range between two consecutive entries (the former entry's TO and the latter entry's FROM) will be used to infer function profile info. An example of LBR perf script(created by `perf script -F ip,brstack -i perf.data`) ``` 40062f 0x40062f/0x4005b0/P/-/-/9 0x400645/0x4005ff/P/-/-/1 0x400637/0x400645/P/-/-/1 ... 4005d7 0x4005d7/0x4005e5/P/-/-/8 0x40062f/0x4005b0/P/-/-/6 0x400645/0x4005ff/P/-/-/1 ... ... ``` For implementation: - Extended a new child class `LBRPerfReader` for the sample parsing, reused all the functionalities in `extractLBRStack` except for an extension to parsing leading instruction pointer. - `HybridSample` is reused(just leave the call stack empty) and the parsed samples is still aggregated in `AggregatedSamples`. After that, range samples, branch sample, address samples are computed and recorded. - Reused `ContextSampleCounterMap` to store the raw profile, since it's no need to aggregation by context, here it just registered one sample counter with a fake context key. - Unified to use `show-raw-profile` instead of `show-unwinder-output` to dump the intermediate raw profile, see the comments of the format of the raw profile. For CS profile, it remains to output the unwinder output. Profile generation part will come soon. Differential Revision: https://reviews.llvm.org/D108153	2021-08-31 13:28:17 -07:00
Hongtao Yu	b9db70369b	[CSSPGO] Split context string to deduplicate function name used in the context. Currently context strings contain a lot of duplicated function names and that significantly increase the profile size. This change split the context into a series of {name, offset, discriminator} tuples so function names used in the context can be replaced by the index into the name table and that significantly reduce the size consumed by context. A follow-up improvement made in the compiler and profiling tools is to avoid reconstructing full context strings which is time- and memory- consuming. Instead a context vector of `StringRef` is adopted to represent the full context in all scenarios. As a result, the previous prevalent profile map which was implemented as a `StringRef` is now engineered as an unordered map keyed by `SampleContext`. `SampleContext` is reshaped to using an `ArrayRef` to represent a full context for CS profile. For non-CS profile, it falls back to use `StringRef` to represent a contextless function name. Both the `ArrayRef` and `StringRef` objects are underpinned by real array and string objects that are stored in producer buffers. For compiler, they are maintained by the sample reader. For llvm-profgen, they are maintained in `ProfiledBinary` and `ProfileGenerator`. Full context strings can be generated only in those cases of debugging and printing. When it comes to profile format, nothing has changed to the text format, though internally CS context is implemented as a vector. Extbinary format is only changed for CS profile, with an additional `SecCSNameTable` section which stores all full contexts logically in the form of `vector<int>`, which each element as an offset points to `SecNameTable`. All occurrences of contexts elsewhere are redirected to using the offset of `SecCSNameTable`. Testing This is no-diff change in terms of code quality and profile content (for text profile). For our internal large service (aka ads), the profile generation is cut to half, with a 20x smaller string-based extbinary format generated. The compile time of ads is dropped by 25%. Differential Revision: https://reviews.llvm.org/D107299	2021-08-30 20:09:29 -07:00
Wenlei He	a6f15e9a49	[CSSPGO] Use probe inline tree to track zero size fully optimized context for pre-inliner This is a follow up diff for BinarySizeContextTracker to track zero size for fully optimized inlinee. When an inlinee is fully optimized away, we won't be able to get its size through symbolizing instructions, hence we will treat the corresponding context size as unknown. However by traversing the inlined probe forest, we know what're original inlinees regardless of optimization. If a context show up in inlined probes, but not during symbolization, we know that it's fully optimized away hence its size is zero instead of unknown. It should provide more accurate size cost estimation for pre-inliner to make better inline decisions in llvm-profgen. Differential Revision: https://reviews.llvm.org/D108350	2021-08-25 09:01:11 -07:00
Wenlei He	eca03d2768	[CSSPGO] Track and use context-sensitive post-optimization function size to drive global pre-inliner in llvm-profgen This change enables llvm-profgen to use accurate context-sensitive post-optimization function byte size as a cost proxy to drive global preinline decisions. To do this, BinarySizeContextTracker is introduced to track function byte size under different inline context during disassembling. In preinliner, we can not query context byte size under switch `context-cost-for-preinliner`. The tracker uses a reverse trie to keep size of functions under different context (callee as parent, caller as child), and it can give best/longest possible matching context size for given input context. The new size cost is off by default. There're a few TODOs that needs to addressed: 1) avoid dangling string from `Offset2LocStackMap`, which will be addressed in split context work; 2) using inlinee's entry probe to make sure we have correct zero size for inlinee that's completely optimized away after inlining. Some tuning is also needed. Differential Revision: https://reviews.llvm.org/D108180	2021-08-18 22:50:57 -07:00
wlei	f812c19253	[llvm-profgen] Clean up code dealing with multiple binaries As we decided to support only one binary each time, this patch cleans up the related code dealing with multiple binaries. We can use `llvm-profdata` to merge profile from multiple binaries. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D108002	2021-08-17 12:16:07 -07:00
jamesluox	ee7d20e846	[CSSPGO] Migrate and refactor the decoder of Pseudo Probe Migrate pseudo probe decoding logic in llvm-profgen to MC, so other LLVM-base program could reuse existing codes. Redesign object layout of encoded and decoded pseudo probes. Reviewed By: hoy Differential Revision: https://reviews.llvm.org/D106861	2021-08-04 09:21:34 -07:00
wlei	fe3ba90830	[llvm-profgen] Support perf script without parsing MMap events This change supports to run without parsing MMap binary loading events instead it always assumes binary is loaded at the preferred address. This is used when we have assured no binary load address changes or we have pre-processed the addresses resolution. Warn if there's interior mmap event but without leading mmap events. Reviewed By: hoy Differential Revision: https://reviews.llvm.org/D107097	2021-08-03 10:01:07 -07:00
wlei	6da9241aab	[llvm-profgen] Refactor PerfReader to allow different types of perf scripts In order to support different types of perf scripts, this change tried to refactor `PerfReader` by adding the base class `PerfReaderBase` and current HybridPerfReader is derived from it for CS profile generation. Common functions like, passMM2PEvents, extract_lbrs, extract_callstack, etc. can be reused. Next step is to add LBR only reader(for non-CS profile) and aggregated perf scripts reader(do a pre-aggregation of scripts). Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D107014	2021-08-02 17:18:47 -07:00
Hongtao Yu	6b04ecaab3	[CSSPGO][llvm-profgen] Fix a missing initalization Fixing a missing initalization that accidentaly caused by https://reviews.llvm.org/D103178 .	2021-07-13 19:49:55 -07:00
Hongtao Yu	597e9c61ce	Revert "[CSSPGO][llvm-profgen] Fix a missing initalization" This reverts commit `fef5f4456a`.	2021-07-13 19:48:58 -07:00
Hongtao Yu	fef5f4456a	[CSSPGO][llvm-profgen] Fix a missing initalization Fixing a missing initalization that accidentaly caused by https://reviews.llvm.org/D103178 .	2021-07-13 19:46:18 -07:00
Hongtao Yu	0712038458	[CSSPGO][llvm-profgen] Allow multiple executable load segments. The linker or post-link optimizer can create an ELF image with multiple executable segments each of which will be loaded separately at run time. This breaks the assumption of llvm-profgen that currently only supports one base load address. What it ends up with is that the subsequent mmap events will be treated as an overwrite of the first mmap event which will in turn screw up address mapping. While it is non-trivial to support multiple separate load addresses and given that on x64 those segments will always be loaded at consecutive addresses (though via separate mmap sys calls), I'm adding an error checking logic to bail out if that's violated and keep using a single load address which is the address of the first executable segment. Also changing the disassembly output from printing section offset to printing the virtual address instead, which matches the behavior of objdump. Differential Revision: https://reviews.llvm.org/D103178	2021-07-13 18:22:24 -07:00
Wenlei He	1410db70b9	[CSSPGO] Add attribute metadata for context profile This changes adds attribute field for metadata of context profile. Currently we have an inline attribute that indicates whether the leaf frame corresponding to a context profile was inlined in previous build. This will be used to help estimating inlining and be taken into account when trimming context. Changes for that in llvm-profgen will follow. It will also help tuning. Differential Revision: https://reviews.llvm.org/D98823	2021-03-18 22:00:56 -07:00
Yang Fan	64fc9cc723	[CSSPGO][llvm-profgen] Fix gcc Wcast-qual warning (NFC) GCC warning: ``` [3397/3703] Building CXX object tools/llvm-profgen/CMakeFiles/llvm-profgen.dir/llvm-profgen.cpp.o In file included from /llvm-project/llvm/include/llvm/ADT/STLExtras.h:19, from /llvm-project/llvm/include/llvm/ADT/StringRef.h:12, from /llvm-project/llvm/include/llvm/ADT/Twine.h:13, from /llvm-project/llvm/tools/llvm-profgen/ErrorHandling.h:12, from /llvm-project/llvm/tools/llvm-profgen/llvm-profgen.cpp:13: /llvm-project/llvm/include/llvm/ADT/Optional.h: In instantiation of ‘void llvm::optional_detail::OptionalStorage<T, <anonymous> >::emplace(Args&& ...) [with Args = {const std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, llvm::sampleprof::LineLocation>}; T = const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>; bool <anonymous> = false]’: /llvm-project/llvm/include/llvm/ADT/Optional.h:79:7: required from ‘constexpr llvm::optional_detail::OptionalStorage<T, <anonymous> >::OptionalStorage(llvm::optional_detail::OptionalStorage<T, <anonymous> >&&) [with T = const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>; bool <anonymous> = false]’ /llvm-project/llvm/include/llvm/ADT/Optional.h:253:13: required from here /llvm-project/llvm/include/llvm/ADT/Optional.h:113:12: warning: cast from type ‘const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>’ to type ‘void’ casts away qualifiers [-Wcast-qual] 113 \| ::new ((void )std::addressof(value)) T(std::forward<Args>(args)...); \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [3398/3703] Building CXX object tools/llvm-profgen/CMakeFiles/llvm-profgen.dir/PerfReader.cpp.o In file included from /llvm-project/llvm/include/llvm/ADT/STLExtras.h:19, from /llvm-project/llvm/include/llvm/ADT/StringRef.h:12, from /llvm-project/llvm/include/llvm/ADT/Twine.h:13, from /llvm-project/llvm/tools/llvm-profgen/ErrorHandling.h:12, from /llvm-project/llvm/tools/llvm-profgen/PerfReader.h:11, from /llvm-project/llvm/tools/llvm-profgen/PerfReader.cpp:8: /llvm-project/llvm/include/llvm/ADT/Optional.h: In instantiation of ‘void llvm::optional_detail::OptionalStorage<T, <anonymous> >::emplace(Args&& ...) [with Args = {const std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, llvm::sampleprof::LineLocation>}; T = const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>; bool <anonymous> = false]’: /llvm-project/llvm/include/llvm/ADT/Optional.h:79:7: required from ‘constexpr llvm::optional_detail::OptionalStorage<T, <anonymous> >::OptionalStorage(llvm::optional_detail::OptionalStorage<T, <anonymous> >&&) [with T = const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>; bool <anonymous> = false]’ /llvm-project/llvm/include/llvm/ADT/Optional.h:253:13: required from here /llvm-project/llvm/include/llvm/ADT/Optional.h:113:12: warning: cast from type ‘const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>’ to type ‘void’ casts away qualifiers [-Wcast-qual] 113 \| ::new ((void )std::addressof(value)) T(std::forward<Args>(args)...); \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [3399/3703] Building CXX object tools/llvm-profgen/CMakeFiles/llvm-profgen.dir/ProfiledBinary.cpp.o In file included from /llvm-project/llvm/include/llvm/ADT/STLExtras.h:19, from /llvm-project/llvm/include/llvm/ADT/ArrayRef.h:15, from /llvm-project/llvm/include/llvm/ADT/DenseMapInfo.h:18, from /llvm-project/llvm/include/llvm/ADT/DenseMap.h:16, from /llvm-project/llvm/include/llvm/ADT/DenseSet.h:16, from /llvm-project/llvm/include/llvm/ProfileData/SampleProf.h:17, from /llvm-project/llvm/tools/llvm-profgen/CallContext.h:12, from /llvm-project/llvm/tools/llvm-profgen/ProfiledBinary.h:12, from /llvm-project/llvm/tools/llvm-profgen/ProfiledBinary.cpp:9: /llvm-project/llvm/include/llvm/ADT/Optional.h: In instantiation of ‘void llvm::optional_detail::OptionalStorage<T, <anonymous> >::emplace(Args&& ...) [with Args = {const std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, llvm::sampleprof::LineLocation>}; T = const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>; bool <anonymous> = false]’: /llvm-project/llvm/include/llvm/ADT/Optional.h:79:7: required from ‘constexpr llvm::optional_detail::OptionalStorage<T, <anonymous> >::OptionalStorage(llvm::optional_detail::OptionalStorage<T, <anonymous> >&&) [with T = const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>; bool <anonymous> = false]’ /llvm-project/llvm/include/llvm/ADT/Optional.h:253:13: required from here /llvm-project/llvm/include/llvm/ADT/Optional.h:113:12: warning: cast from type ‘const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>’ to type ‘void’ casts away qualifiers [-Wcast-qual] 113 \| ::new ((void )std::addressof(value)) T(std::forward<Args>(args)...); \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [3404/3703] Building CXX object tools/llvm-profgen/CMakeFiles/llvm-profgen.dir/ProfileGenerator.cpp.o In file included from /llvm-project/llvm/include/llvm/ADT/STLExtras.h:19, from /llvm-project/llvm/include/llvm/ADT/StringRef.h:12, from /llvm-project/llvm/include/llvm/ADT/Twine.h:13, from /llvm-project/llvm/tools/llvm-profgen/ErrorHandling.h:12, from /llvm-project/llvm/tools/llvm-profgen/ProfileGenerator.h:11, from /llvm-project/llvm/tools/llvm-profgen/ProfileGenerator.cpp:9: /llvm-project/llvm/include/llvm/ADT/Optional.h: In instantiation of ‘void llvm::optional_detail::OptionalStorage<T, <anonymous> >::emplace(Args&& ...) [with Args = {const std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, llvm::sampleprof::LineLocation>}; T = const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>; bool <anonymous> = false]’: /llvm-project/llvm/include/llvm/ADT/Optional.h:79:7: required from ‘constexpr llvm::optional_detail::OptionalStorage<T, <anonymous> >::OptionalStorage(llvm::optional_detail::OptionalStorage<T, <anonymous> >&&) [with T = const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>; bool <anonymous> = false]’ /llvm-project/llvm/include/llvm/ADT/Optional.h:253:13: required from here /llvm-project/llvm/include/llvm/ADT/Optional.h:113:12: warning: cast from type ‘const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>’ to type ‘void’ casts away qualifiers [-Wcast-qual] 113 \| ::new ((void )std::addressof(value)) T(std::forward<Args>(args)...); \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ```	2021-02-18 17:10:50 +08:00
wlei	afd8bd601e	[CSSPGO][llvm-profgen] Filter out the instructions without location info for symbolizer It appears some instructions doesn't have the debug location info and the symbolizer will return an empty call stack for them which will cause some crash later in profile unwinding. Actually we do not record the sample info for them, so this change just filter out those instruction. As those instruction would appears at the begin and end of the instruction list, without them we need to add the boundary check for IP `advance` and `backward`. Also for pseudo probe based profile, we actually don't need the symbolized location info, so here just change to use an empty stack for it. This could save half of the binary loading time. Differential Revision: https://reviews.llvm.org/D96434	2021-02-12 16:47:49 -08:00
wlei	3869309a0c	[CSSPGO][llvm-profgen] Aggregate samples on call frame trie to speed up profile generation For CS profile generation, the process of call stack unwinding is time-consuming since for each LBR entry we need linear time to generate the context( hash, compression, string concatenation). This change speeds up this by grouping all the call frame within one LBR sample into a trie and aggregating the result(sample counter) on it, deferring the context compression and string generation to the end of unwinding. Specifically, it uses `StackLeaf` as the top frame on the stack and manipulates(pop or push a trie node) it dynamically during virtual unwinding so that the raw sample can just be recoded on the leaf node, the path(root to leaf) will represent its calling context. In the end, it traverses the trie and generates the context on the fly. Results: Our internal branch shows about 5X speed-up on some large workloads in SPEC06 benchmark. Differential Revision: https://reviews.llvm.org/D94110	2021-02-04 08:43:21 -08:00
wlei	ac14bb14e7	[CSSPGO][llvm-profgen] Compress recursive cycles in calling context This change compresses the context string by removing cycles due to recursive function for CS profile generation. Removing recursion cycles is a way to normalize the calling context which will be better for the sample aggregation and also make the context promoting deterministic. Specifically for implementation, we recognize adjacent repeated frames as cycles and deduplicated them through multiple round of iteration. For example: Considering a input context string stack: [“a”, “a”, “b”, “c”, “a”, “b”, “c”, “b”, “c”, “d”] For first iteration,, it removed all adjacent repeated frames of size 1: [“a”, “b”, “c”, “a”, “b”, “c”, “b”, “c”, “d”] For second iteration, it removed all adjacent repeated frames of size 2: [“a”, “b”, “c”, “a”, “b”, “c”, “d”] So in the end, we get compressed output: [“a”, “b”, “c”, “d”] Compression will be called in two place: one for sample's context key right after unwinding, one is for the eventual context string id in the ProfileGenerator. Added a switch `compress-recursion` to control the size of duplicated frames, default -1 means no size limit. Added unit tests and regression test for this. Differential Revision: https://reviews.llvm.org/D93556	2021-02-03 22:16:07 -08:00
wlei	6bccdcdb35	Revert "[CSSPGO][llvm-profgen] Compress recursive cycles in calling context" This reverts commit `0609f257dc`.	2021-02-03 22:16:05 -08:00
wlei	08e8bb60cf	Revert "[CSSPGO][llvm-profgen] Aggregate samples on call frame trie to speed up profile generation" This reverts commit `1714ad2336`.	2021-02-03 22:16:05 -08:00
wlei	1714ad2336	[CSSPGO][llvm-profgen] Aggregate samples on call frame trie to speed up profile generation For CS profile generation, the process of call stack unwinding is time-consuming since for each LBR entry we need linear time to generate the context( hash, compression, string concatenation). This change speeds up this by grouping all the call frame within one LBR sample into a trie and aggregating the result(sample counter) on it, deferring the context compression and string generation to the end of unwinding. Specifically, it uses `StackLeaf` as the top frame on the stack and manipulates(pop or push a trie node) it dynamically during virtual unwinding so that the raw sample can just be recoded on the leaf node, the path(root to leaf) will represent its calling context. In the end, it traverses the trie and generates the context on the fly. Results: Our internal branch shows about 5X speed-up on some large workloads in SPEC06 benchmark. Differential Revision: https://reviews.llvm.org/D94110	2021-02-03 18:50:14 -08:00
wlei	0609f257dc	[CSSPGO][llvm-profgen] Compress recursive cycles in calling context This change compresses the context string by removing cycles due to recursive function for CS profile generation. Removing recursion cycles is a way to normalize the calling context which will be better for the sample aggregation and also make the context promoting deterministic. Specifically for implementation, we recognize adjacent repeated frames as cycles and deduplicated them through multiple round of iteration. For example: Considering a input context string stack: [“a”, “a”, “b”, “c”, “a”, “b”, “c”, “b”, “c”, “d”] For first iteration,, it removed all adjacent repeated frames of size 1: [“a”, “b”, “c”, “a”, “b”, “c”, “b”, “c”, “d”] For second iteration, it removed all adjacent repeated frames of size 2: [“a”, “b”, “c”, “a”, “b”, “c”, “d”] So in the end, we get compressed output: [“a”, “b”, “c”, “d”] Compression will be called in two place: one for sample's context key right after unwinding, one is for the eventual context string id in the ProfileGenerator. Added a switch `compress-recursion` to control the size of duplicated frames, default -1 means no size limit. Added unit tests and regression test for this. Differential Revision: https://reviews.llvm.org/D93556	2021-02-03 18:50:14 -08:00

1 2

59 Commits