Commit Graph

59 Commits

Author SHA1 Message Date
Hongtao Yu d5a963ab8b [PseudoProbe] Replace relocation with offset for entry probe.
Currently pseudo probe encoding for a function is like:
	- For the first probe, a relocation from it to its physical position in the code body
	- For subsequent probes, an incremental offset from the current probe to the previous probe

The relocation could potentially cause relocation overflow during link time. I'm now replacing it with an offset from the first probe to the function start address.

A source function could be lowered into multiple binary functions due to outlining (e.g, coro-split). Since those binary function have independent link-time layout, to really avoid relocations from .pseudo_probe sections to .text sections, the offset to replace with should really be the offset from the probe's enclosing binary function, rather than from the entry of the source function. This requires some changes to previous section-based emission scheme which now switches to be function-based. The assembly form of pseudo probe directive is also changed correspondingly, i.e, reflecting the binary function name.

Most of the source functions end up with only one binary function. For those don't, a sentinel probe is emitted for each of the binary functions with a different name from the source. The sentinel probe indicates the binary function name to differentiate subsequent probes from the ones from a different binary function. For examples, given source function

```
Foo() {
  …
  Probe 1
  …
  Probe 2
}
```

If it is transformed into two binary functions:

```
Foo:
   …

Foo.outlined:
   …
```

The encoding for the two binary functions will be separate:

```

GUID of Foo
  Probe 1

GUID of Foo
  Sentinel probe of Foo.outlined
  Probe 2
```

Then probe1 will be decoded against binary `Foo`'s address, and Probe 2 will be decoded against `Foo.outlined`. The sentinel probe of `Foo.outlined` makes sure there's not accidental relocation from `Foo.outlined`'s probes to `Foo`'s entry address.

On the BOLT side, to be minimal intrusive, the pseudo probe re-encoding sticks with the old encoding format. This is fine since unlike linker, Bolt processes the pseudo probe section as a whole and it is free from relocation overflow issues.

The change is downwards compatible as long as there's no mixed use of the old encoding and the new encoding.

Reviewed By: wenlei, maksfb

Differential Revision: https://reviews.llvm.org/D135912
Differential Revision: https://reviews.llvm.org/D135914
Differential Revision: https://reviews.llvm.org/D136394
2022-10-27 13:28:22 -07:00
wlei bf97c1e066 [NFC] fix a wrong name change during rebase 2022-10-25 21:16:29 -07:00
wlei 91cc53d5a4 [llvm-profgen] Do not cache the frame location stack during computing inlined context size
In `computeInlinedContextSizeForRange`, the offset of range is only used one time, there is no need to cache the frame location stack.
Measured on one internal service binary, this can save 2GB memory usage and reduce a small run time (avoid one hash search).

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D128859
2022-10-25 21:08:36 -07:00
wlei 467652486f [llvm-profgen] Fix inconsistent loading address issues
This is to fix two issues related with loading address:

1) When multiple MMAPs occur and their loading address are different, before it only used the first MMap as base address, all perf address after it used the wrong base address.

2) For pseudo probe profile, the address is always based on preferred loading address. If the base address is not equal to the preferred loading address, the pseudo probe address query will be wrong.

Solution: Instead of converting the address to offset lazily, right now all the address after parsing are converted on the fly based on preferred loading address in the parsing time. There is no "offset" used in profile generator any more.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D126827
2022-10-13 23:19:30 -07:00
wlei 7e86b13c63 [CSSPGO][llvm-profgen] Reimplement SampleContextTracker using context trie
This is the followup patch to https://reviews.llvm.org/D125246 for the `SampleContextTracker` part. Before the promotion and merging of the context is based on the SampleContext(the array of frame), this causes a lot of cost to the memory. This patch detaches the tracker from using the array ref instead to use the context trie itself. This can save a lot of memory usage and benefit both the compiler's CS inliner and llvm-profgen's pre-inliner.

One structure needs to be specially treated is the `FuncToCtxtProfiles`, this is used to get all the functionSamples for one function to do the merging and promoting. Before it search each functions' context and traverse the trie to get the node of the context. Now we don't have the context inside the profile, instead we directly use an auxiliary map `ProfileToNodeMap` for profile , it initialize to create the FunctionSamples to TrieNode relations and keep updating it during promoting and merging the node.

Moreover, I was expecting the results before and after remain the same, but I found that the order of FuncToCtxtProfiles matter and affect the results. This can happen on recursive context case, but the difference should be small. Now we don't have the context, so I just used a vector for the order, the result is still deterministic.

Measured on one huge size(12GB) profile from one of our internal service. The profile similarity difference is 99.999%, and the running time is improved by 3X(debug mode) and the memory is reduced from 170GB to 90GB.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D127031
2022-06-27 23:22:21 -07:00
Hongtao Yu 9f732af583 [llvm-profgen] Filter out oversized LBR ranges.
As a follow up to {D123271}, LBR ranges that are too big should also be considered as invalid.

For example, the last two pairs in the following trace form a range [0x0d7b02b0, 0x368ba706] that covers a ton of functions in the binary. Such oversized range should also be ignored.

   0x0c74505f/0x368b99a0 **0x368ba706**/0x0c745040  0x0d7b1c3f/**0x0d7b02b0**

Add a defensive check to filter out those ranges based that the valid range should not cross the unconditional branch(Call, return, unconditional jmp).

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D125448
2022-05-12 10:58:50 -07:00
Hongtao Yu 3f97016857 [llvm-profgen] Decoding pseudo probe for profiled function only.
Complete pseudo probes decoding can result in large memory usage. In practice only a small porting of the decoded probes are used in profile generation. I'm changing the full decoding mode to be decoding for profiled functions only, though we still do a full scan of the .pseudoprobe section due to a missing table-of-content but we don't have to build the in-memory data structure for functions not sampled.

To build the in-memory data structure for profiled functions only, I'm rewriting the previous non-recursive probe decoding logic to be recursive. This is easy to read and maintain.

I also have to change the previous representation of unsymbolized context from probe-based stack to address-based stack since the profiled functions are unknown yet by the time of virtual unwinding. The address-based stack will be converted to probe-based stack after virtual unwinding and on-demand probe decoding.

I'm seeing 20GB memory is saved for one of our internal large service.

Reviewed By: wenlei

Differential Revision: https://reviews.llvm.org/D121643
2022-03-23 14:15:11 -07:00
serge-sans-paille fc97efa409 Cleanup includes: ProfileData
Estimation of the impact on preprocessor output:

before: 1067349756
after: 1065940348

Discourse thread: https://discourse.llvm.org/t/include-what-you-use-include-cleanup
Differential Revision: https://reviews.llvm.org/D120434
2022-02-24 13:25:11 +01:00
wlei b3a778fb5e [llvm-profgen] Support symbol loading for debug fission
Support to load debug info from dwarf split file, like .dwo, .dwp files. Leverage the `getNonSkeletonUnitDIE(false)` API to achieve this.

Add test cause to make sure all the ranges is well retrieved by the loader.

Reviewed By: ayermolo, hoy, wenlei

Differential Revision: https://reviews.llvm.org/D115973
2022-02-23 09:40:46 -08:00
Hongtao Yu 34e131b0f2 [llvm-profgen] On-demand track optimized-away inlinees for preinliner.
Tracking optimized-away inlinees based on all probes in a binary is expansive in terms of memory usage I'm making the tracking on-demand based on profiled functions only. This saves about 10%  memory overall for a medium-sized benchmark.

Before:

   note: After parsePerfTraces
   note: Thu Jan 27 18:42:09 2022
   note: VM: 8.68 GB   RSS: 8.39 GB
   note: After computeSizeForProfiledFunctions
   note: Thu Jan 27 18:42:41 2022
   note: **VM: 10.63 GB   RSS: 10.20 GB**
   note: After generateProbeBasedProfile
   note: Thu Jan 27 18:45:49 2022
   note: VM: 25.00 GB   RSS: 24.95 GB
   note: After postProcessProfiles
   note: Thu Jan 27 18:49:29 2022
   note: VM: 26.34 GB   RSS: 26.27 GB

After:
   note: After parsePerfTraces
   note: Fri Jan 28 12:04:49 2022
   note: VM: 8.68 GB   RSS: 7.65 GB
   note: After computeSizeForProfiledFunctions
   note: Fri Jan 28 12:05:26 2022
   note: **VM: 8.68 GB   RSS: 8.42 GB**
   note: After generateProbeBasedProfile
   note: Fri Jan 28 12:08:03 2022
   note: VM: 22.93 GB   RSS: 22.89 GB
   note: After postProcessProfiles
   note: Fri Jan 28 12:11:30 2022
   note: VM: 24.27 GB   RSS: 24.22 GB

This should be a no-diff change in terms of profile quality.

Reviewed By: wenlei

Differential Revision: https://reviews.llvm.org/D118515
2022-02-08 08:33:23 -08:00
wlei 6693c562f9 [llvm-profgen] Support to load debug info from a second binary
For reducing binary size purpose, the binary's debug info and executable segment can be separated(like using objcopy --only-keep-debug). Here add support in llvm-profgen to use two binaries as input. The original one is executable binary and added for debug info only binary. Adding a flag `--debug-binary=file-path`, with this, the binary will load debug info from debug binary.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D115948
2022-01-24 17:14:05 -08:00
wlei 0f53df864e [CSSPGO][llvm-profgen] Fix external address issues of perf reader (return to external addr part)
Before we have an issue with artificial LBR whose source is a return, recalling that "an internal code(A) can return to external address, then from the external address call a new internal code(B), making an artificial branch that looks like a return from A to B can confuse the unwinder". We just ignore the LBRs after this artificial LBR which can miss some samples. This change aims at fixing this by correctly unwinding them instead of ignoring them.

List some typical scenarios covered by this change.

1)  multiple sequential call back happen in external address, e.g.

```
[ext, call, foo] [foo, return, ext] [ext, call, bar]
```
Unwinder should avoid having foo return from bar. Wrong call stack is like [foo, bar]

2) the call stack before and after external call should be correctly unwinded.
```
 {call stack1}                                            {call stack2}
 [foo, call, ext]  [ext, call, bar]  [bar, return, ext]  [ext, return, foo ]
```
call stack 1 should be the same to call stack2. Both shouldn't be truncated

3) call stack should be truncated after call into external code since we can't do inlining with external code.

```
 [foo, call, ext]  [ext, call, bar]  [bar, call, baz] [baz, return, bar ] [bar, return, ext]
```
the call stack of code in baz should not include foo.

### Implementation:

We leverage artificial frame to fix #2 and #3: when we got a return artificial LBR, push an extra artificial frame to the stack. when we pop frame, check if the parent is an artificial frame to pop(fix #2). Therefore, call/ return artificial LBR is just the same as regular LBR which can keep the call stack.

While recording context on the trie, artificial frame is used as a tag indicating that we should truncate the call stack(fix #3).

To differentiate #1 and #2, we leverage `getCallAddrFromFrameAddr`.  Normally the target of the return should be the next inst of a call inst and `getCallAddrFromFrameAddr` will return the address of call inst. Otherwise, getCallAddrFromFrameAddr will return to 0 which is the case of #1.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D115550
2021-12-14 16:40:54 -08:00
wlei 484a569eea [llvm-profgen] Fix total samples related issues
Since total sample and body sample are used to compute hotness threshold in compiler, we found in some services changing the total samples computation will cause noticeable regression. Hence, here we will revert the changes and just keep all total samples number identical to the old tool.

Three changes in this diff:

1. Revert previous diff(https://reviews.llvm.org/D112672: [llvm-profgen] Update total samples by accumulating all its body samples) and put it under a switch.

2. Keep the negative line number. Although compiler doesn't consume the count but it will be used to compute hot threshold.

3. Change to accumulate total samples per byte instead of per instruction.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D115013
2021-12-08 12:33:41 -08:00
wlei f15a854567 [llvm-profgen] Truncate the context with zero probe ID
Due to the debug info merging, there may have some contexts with zero probe id, we should truncate the context to avoid misleading pre-inliner.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D114284
2021-11-30 16:21:25 -08:00
wlei 41a681ce09 [FS-AFDO][llvm-profgen] Generate profile with FS-AFDO discriminator
In order to support generating profile  with FS discriminator, three kind of changes are done in llvm-profgen:

1) Dissassemble .rodata section to check if FS discriminator var ('"__llvm_fs_discriminator__"') exists and set the corresponding flag in the binary.

2) Change the discriminator decoding in `getBaseDiscriminator` and `getDuplicationFactor`.

3) set true for `FunctionSamples::ProfileIsFS` to enable FS functionality in ProfileData.

Reviewed By: xur, hoy, wenlei

Differential Revision: https://reviews.llvm.org/D113296
2021-11-30 15:57:59 -08:00
wlei c2e08aba1a [llvm-profgen] Compute and show profile density
AutoFDO performance is sensitive to profile density, i.e., the amount of samples in the profile relative to the program size, because profiles with insufficient samples could be inaccurate due to statistical noise and thus hurt AutoFDO performance. A previous investigation showed that AutoFDO performed better on MySQL with increased amount of samples. Therefore, we implement a profile-density computation feature to give hints about profile density to users and the compiler.

We define the density of a profile Prof as follows:

- For each function A in the profile, density(A) = total_samples(A) / sizeof(A).
- density(Prof) = min(density(A)) for all functions A that are warm (defined below).

A function is considered warm if its total-samples is within top N percent of the profile. For implementation, we reuse the `ProfileSummaryBuilder::getHotCountThreshold(..)` as threshold which can be set by percent(`--profile-summary-cutoff-hot`) or by value(`--profile-summary-hot-count`).

We also introduce `--hot-function-density-threshold` to set hot function density threshold and will give suggestion if profile density is below it which implies we should increase samples.

This also applies for CS profile with all profiles merged into base.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D113781
2021-11-29 23:54:31 -08:00
Wenlei He f7976edc1e [llvm-profgen] Add switch to allow use of first loadable segment for calculating offset
Adding `-use-loadable-segment-as-base` to allow use of first loadable segment for calculating offset. By default first executable segment is used for calculating offset. The switch helps compatibility with unsymbolized profile generated from older tools.

Differential Revision: https://reviews.llvm.org/D113727
2021-11-15 19:00:27 -08:00
wlei aab1810006 [llvm-profgen] Fix bug of setting function entry
Previously we set `isFuncEntry` flag  to true when the funcName from DWARF is equal to the name in symbol table and we use this flag to ignore reporting callsite sample that's from an intra func branch. However, in HHVM, it appears that the symbol table name is inconsistent with the dwarf info func name, it's likely due to `OptimizeGlobalAliases`.

This change is a workaround in llvm-profgen side to mark the only one range as the function entry and add warnings for the remaining inconsistence.

This also fixed a missing `getCanonicalFnName` for symbol name which caused the mismatching as well.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D113492
2021-11-12 12:18:43 -08:00
wlei 5bf191a381 [llvm-profgen] Fix index out of bounds error while using ip.advance
Previously we assume there're some non-executing sections at the bottom of the text section so that we won't hit the array's bound. But on BOLTed binary, it turned out .bolt section is at the bottom of text section which can be profiled, then it crash llvm-profgen. This change try to fix it.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D113238
2021-11-05 18:38:40 -07:00
wlei 138202a8c3 [llvm-profgen] Warn on invalid range and show warning summary
Two things in this diff:

1) Warn on the invalid range, currently three types of checking, see the detailed message in the code.

2) In some situation, llvm-profgen gives lots of warnings on the truncated stacks which is noisy. This change provides a switch to `--show-detailed-warning` to skip the warnings. Alternatively, we use a summary for those warning and show the percentage of cases with those issues.

Example of warning summary.
```
warning: 0.05%(1120/2428958) cases with issue: Profile context truncated due to missing probe for call instruction.
warning: 0.00%(2/178637) cases with issue: Range does not belong to any functions, likely from external function.
```

Reviewed By: hoy

Differential Revision: https://reviews.llvm.org/D111902
2021-11-02 19:55:55 -07:00
wlei 3f3103c6a9 [llvm-profgen] Fill zero count for all function ranges
Allow filling zero count for all the function ranges even there is no samples hitting that function. Add a switch for this.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D112858
2021-11-01 09:57:05 -07:00
wlei 2f8196db92 [llvm-profgen] Fix bug of populating profile symbol list
Previous implementation of populating profile symbol list is wrong, it only included the profiled symbols. Actually it should use all symbols, here this switches to use the symbols from debug info. Also turned the flag off by default.

Reviewed By: wenlei, hoy

Differential Revision: https://reviews.llvm.org/D111824
2021-10-29 09:59:12 -07:00
wlei 40ca411251 [llvm-profgen] Switch to DWARF-based symbol and ranges
It happened a bug that some callsite name in the profile is not a real function, it turned out that there're some non-function symbol from the ELF text section, e.g. the global accessible branch label and also recalled that we can have one function being split into multiple ranges. We shouldn't count samples for those are not the entry of the real function.

So this change tried to fix this issue by switching to use the name or ranges from DWARF-based debug info, the range of which assure it's the real function start. For the split functions, we assume that the real entry function's DWARF name should always match the symbol table name.

The switching is also consistent with the body samples' symbol which is from DWARF.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D112282
2021-10-29 09:59:12 -07:00
wlei ce40843a3f [llvm-profgen][CSSPGO] On-demand function size computation for preinliner
Similar to https://reviews.llvm.org/D110465, we can compute function size on-demand for the functions that's hit by samples.

Here we leverage the raw range samples' address to compute a set of sample hit function. Then `BinarySizeContextTracker` just works on those function range for the size.

Reviewed By: hoy

Differential Revision: https://reviews.llvm.org/D110466
2021-09-28 09:09:38 -07:00
wlei 091c16f76b [llvm-profgen] On-demand symbolization
Previously we do symbolization for all the functions and actually we only need the symbols that's hit by the samples.

This can significantly speed up the time for large size binary.

Optimization for per-inliner will come along with next patch.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D110465
2021-09-28 09:09:25 -07:00
wlei 28277e9b48 [AutoFDO][llvm-profgen] Report zero count for unexecuted part of function code
In order to be consistent with compiler that interprets zero count as unexecuted(cold), this change reports zero-value count for unexecuted part of function code. For the implementation, it leverages the range counter, initializes all the executed function range with the zero-value. After all ranges are merged and converted into disjoint ranges, the remaining zero count will indicates the unexecuted(cold) part of the function.

This change also extends the current `findDisjointRanges` method which now can support adding zero-value range.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D109713
2021-09-24 14:15:05 -07:00
wlei d5f2013004 [AutoFDO][llvm-profgen] Profile generation for LBR(non-CS) sample
This patch introduces non-CS AutoFDO profile generation into LLVM. The profile is supposed to be well consumed by compiler using `-fprofile-sample-use=[profile]`.

After range and branch counters are extracted from the LBR sample, here we go through each addresses for symbolization, create FunctionSamples and populate its sub fields like TotalSamples, BodySamples and HeadSamples etc. For inlined code, as we need to map back to original code, so we always add body samples to the leaf frame's function sample.

Reviewed By: wenlei, hoy

Differential Revision: https://reviews.llvm.org/D109551
2021-09-24 13:55:34 -07:00
Hongtao Yu 734f4d832c [llvm-profgen] An option to dump disasm of specified symbols
For large app, dumping disasm of the whole program can be slow and result in gianant output. Adding a switch to dump specific symbols only.

Reviewed By: wlei

Differential Revision: https://reviews.llvm.org/D110079
2021-09-22 10:32:59 -07:00
Hongtao Yu 0057c7185d [CSSPGO][llvm-profgen] Truncate stack samples with invalid return address.
Invalid frame addresses exist in call stack samples due to bad unwinding. This could happen to frame-pointer-based unwinding and the callee functions that do not have the frame pointer chain set up. It isn't common when the program is built with the frame pointer omission disabled, but can still happen with third-party static libs built with frame pointer omitted.

Reviewed By: wenlei

Differential Revision: https://reviews.llvm.org/D109638
2021-09-14 21:56:22 -07:00
wlei 964053d56f [llvm-profgen] Support LBR only perf script
This change aims at supporting LBR only sample perf script which is used for regular(Non-CS) profile generation.  A LBR perf script includes a batch of LBR sample which starts with a frame pointer and a group of 32 LBR entries is followed. The FROM/TO LBR pair and the range between two consecutive entries (the former entry's TO and the latter entry's FROM) will be used to infer function profile info.

An example of LBR perf script(created by `perf script -F ip,brstack -i perf.data`)
```
           40062f 0x40062f/0x4005b0/P/-/-/9  0x400645/0x4005ff/P/-/-/1  0x400637/0x400645/P/-/-/1 ...
           4005d7 0x4005d7/0x4005e5/P/-/-/8  0x40062f/0x4005b0/P/-/-/6  0x400645/0x4005ff/P/-/-/1 ...
           ...
```

For implementation:
 - Extended a new child class `LBRPerfReader` for the sample parsing, reused all the functionalities in `extractLBRStack` except for an extension to parsing leading instruction pointer.
 - `HybridSample` is reused(just leave the call stack empty) and the parsed samples is still aggregated in `AggregatedSamples`. After that, range samples, branch sample, address samples are computed and recorded.
 - Reused `ContextSampleCounterMap` to store the raw profile, since it's no need to aggregation by context, here it just registered one sample counter with a fake context key.
 - Unified to use `show-raw-profile` instead of `show-unwinder-output` to dump the intermediate raw profile, see the comments of the format of the raw profile. For CS profile, it remains to output the unwinder output.

Profile generation part will come soon.

Differential Revision: https://reviews.llvm.org/D108153
2021-08-31 13:28:17 -07:00
Hongtao Yu b9db70369b [CSSPGO] Split context string to deduplicate function name used in the context.
Currently context strings contain a lot of duplicated function names and that significantly increase the profile size. This change split the context into a series of {name, offset, discriminator} tuples so function names used in the context can be replaced by the index into the name table and that significantly reduce the size consumed by context.

A follow-up improvement made in the compiler and profiling tools is to avoid reconstructing full context strings which is  time- and memory- consuming. Instead a context vector of `StringRef` is adopted to represent the full context in all scenarios. As a result, the previous prevalent profile map which was implemented as a `StringRef` is now engineered as an unordered map keyed by `SampleContext`. `SampleContext` is reshaped to using an `ArrayRef` to represent a full context for CS profile. For non-CS profile, it falls back to use `StringRef` to represent a contextless function name. Both the `ArrayRef` and `StringRef` objects are underpinned by real array and string objects that are stored in producer buffers. For compiler, they are maintained by the sample reader. For llvm-profgen, they are maintained in `ProfiledBinary` and `ProfileGenerator`. Full context strings can be generated only in those cases of debugging and printing.

When it comes to profile format, nothing has changed to the text format, though internally CS context is implemented as a vector. Extbinary format is only changed for CS profile, with an additional `SecCSNameTable` section which stores all full contexts logically in the form of `vector<int>`, which each element as an offset points to `SecNameTable`. All occurrences of contexts elsewhere are redirected to using the offset of `SecCSNameTable`.

Testing
This is no-diff change in terms of code quality and profile content (for text profile).

For our internal large service (aka ads), the profile generation is cut to half, with a 20x smaller string-based extbinary format generated.

The compile time of ads is dropped by 25%.

Differential Revision: https://reviews.llvm.org/D107299
2021-08-30 20:09:29 -07:00
Wenlei He a6f15e9a49 [CSSPGO] Use probe inline tree to track zero size fully optimized context for pre-inliner
This is a follow up diff for BinarySizeContextTracker to track zero size for fully optimized inlinee. When an inlinee is fully optimized away, we won't be able to get its size through symbolizing instructions, hence we will treat the corresponding context size as unknown. However by traversing the inlined probe forest, we know what're original inlinees regardless of optimization. If a context show up in inlined probes, but not during symbolization, we know that it's fully optimized away hence its size is zero instead of unknown. It should provide more accurate size cost estimation for pre-inliner to make better inline decisions in llvm-profgen.

Differential Revision: https://reviews.llvm.org/D108350
2021-08-25 09:01:11 -07:00
Wenlei He eca03d2768 [CSSPGO] Track and use context-sensitive post-optimization function size to drive global pre-inliner in llvm-profgen
This change enables llvm-profgen to use accurate context-sensitive post-optimization function byte size as a cost proxy to drive global preinline decisions.

To do this, BinarySizeContextTracker is introduced to track function byte size under different inline context during disassembling. In preinliner, we can not query context byte size under switch `context-cost-for-preinliner`. The tracker uses a reverse trie to keep size of functions under different context (callee as parent, caller as child), and it can give best/longest possible matching context size for given input context.

The new size cost is off by default. There're a few TODOs that needs to addressed: 1) avoid dangling string from `Offset2LocStackMap`, which will be addressed in split context work; 2) using inlinee's entry probe to make sure we have correct zero size for inlinee that's completely optimized away after inlining. Some tuning is also needed.

Differential Revision: https://reviews.llvm.org/D108180
2021-08-18 22:50:57 -07:00
wlei f812c19253 [llvm-profgen] Clean up code dealing with multiple binaries
As we decided to support only one binary each time, this patch cleans up the related code dealing with multiple binaries. We can use `llvm-profdata` to merge profile from multiple binaries.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D108002
2021-08-17 12:16:07 -07:00
jamesluox ee7d20e846 [CSSPGO] Migrate and refactor the decoder of Pseudo Probe
Migrate pseudo probe decoding logic in llvm-profgen to MC, so other LLVM-base program could reuse existing codes. Redesign object layout of encoded and decoded pseudo probes.

Reviewed By: hoy

Differential Revision: https://reviews.llvm.org/D106861
2021-08-04 09:21:34 -07:00
wlei fe3ba90830 [llvm-profgen] Support perf script without parsing MMap events
This change supports to run without parsing MMap binary loading events instead it always assumes binary is loaded at the preferred address. This is used when we have assured no binary load address changes or we have pre-processed the addresses resolution. Warn if there's interior mmap event but without leading mmap events.

Reviewed By: hoy

Differential Revision: https://reviews.llvm.org/D107097
2021-08-03 10:01:07 -07:00
wlei 6da9241aab [llvm-profgen] Refactor PerfReader to allow different types of perf scripts
In order to support different types of perf scripts, this change tried to refactor `PerfReader` by adding the base class `PerfReaderBase` and current HybridPerfReader is derived from it for CS profile generation. Common functions like, passMM2PEvents, extract_lbrs, extract_callstack, etc. can be reused.

Next step is to add LBR only reader(for non-CS profile) and aggregated perf scripts reader(do a pre-aggregation of scripts).

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D107014
2021-08-02 17:18:47 -07:00
Hongtao Yu 6b04ecaab3 [CSSPGO][llvm-profgen] Fix a missing initalization
Fixing a missing initalization that accidentaly caused by https://reviews.llvm.org/D103178 .
2021-07-13 19:49:55 -07:00
Hongtao Yu 597e9c61ce Revert "[CSSPGO][llvm-profgen] Fix a missing initalization"
This reverts commit fef5f4456a.
2021-07-13 19:48:58 -07:00
Hongtao Yu fef5f4456a [CSSPGO][llvm-profgen] Fix a missing initalization
Fixing a missing initalization that accidentaly caused by https://reviews.llvm.org/D103178 .
2021-07-13 19:46:18 -07:00
Hongtao Yu 0712038458 [CSSPGO][llvm-profgen] Allow multiple executable load segments.
The linker or post-link optimizer can create an ELF image with multiple executable segments each of which will be loaded separately at run time. This breaks the assumption of llvm-profgen that currently only supports one base load address. What it ends up with is that the subsequent mmap events will be treated as an overwrite of the first mmap event which will in turn screw up address mapping. While it is non-trivial to support multiple separate load addresses and given that on x64 those segments will always be loaded at consecutive addresses (though via separate mmap
sys calls), I'm adding an error checking logic to bail out if that's violated and keep using a single load address which is the address of the first executable segment.

Also changing the disassembly output from printing section offset to printing the virtual address instead, which matches the behavior of objdump.

Differential Revision: https://reviews.llvm.org/D103178
2021-07-13 18:22:24 -07:00
Wenlei He 1410db70b9 [CSSPGO] Add attribute metadata for context profile
This changes adds attribute field for metadata of context profile. Currently we have an inline attribute that indicates whether the leaf frame corresponding to a context profile was inlined in previous build.

This will be used to help estimating inlining and be taken into account when trimming context. Changes for that in llvm-profgen will follow. It will also help tuning.

Differential Revision: https://reviews.llvm.org/D98823
2021-03-18 22:00:56 -07:00
Yang Fan 64fc9cc723
[CSSPGO][llvm-profgen] Fix gcc Wcast-qual warning (NFC)
GCC warning:
```
[3397/3703] Building CXX object tools/llvm-profgen/CMakeFiles/llvm-profgen.dir/llvm-profgen.cpp.o
In file included from /llvm-project/llvm/include/llvm/ADT/STLExtras.h:19,
                 from /llvm-project/llvm/include/llvm/ADT/StringRef.h:12,
                 from /llvm-project/llvm/include/llvm/ADT/Twine.h:13,
                 from /llvm-project/llvm/tools/llvm-profgen/ErrorHandling.h:12,
                 from /llvm-project/llvm/tools/llvm-profgen/llvm-profgen.cpp:13:
/llvm-project/llvm/include/llvm/ADT/Optional.h: In instantiation of ‘void llvm::optional_detail::OptionalStorage<T, <anonymous> >::emplace(Args&& ...) [with Args = {const std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, llvm::sampleprof::LineLocation>}; T = const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>; bool <anonymous> = false]’:
/llvm-project/llvm/include/llvm/ADT/Optional.h:79:7:   required from ‘constexpr llvm::optional_detail::OptionalStorage<T, <anonymous> >::OptionalStorage(llvm::optional_detail::OptionalStorage<T, <anonymous> >&&) [with T = const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>; bool <anonymous> = false]’
/llvm-project/llvm/include/llvm/ADT/Optional.h:253:13:   required from here
/llvm-project/llvm/include/llvm/ADT/Optional.h:113:12: warning: cast from type ‘const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>*’ to type ‘void*’ casts away qualifiers [-Wcast-qual]
  113 |     ::new ((void *)std::addressof(value)) T(std::forward<Args>(args)...);
      |            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[3398/3703] Building CXX object tools/llvm-profgen/CMakeFiles/llvm-profgen.dir/PerfReader.cpp.o
In file included from /llvm-project/llvm/include/llvm/ADT/STLExtras.h:19,
                 from /llvm-project/llvm/include/llvm/ADT/StringRef.h:12,
                 from /llvm-project/llvm/include/llvm/ADT/Twine.h:13,
                 from /llvm-project/llvm/tools/llvm-profgen/ErrorHandling.h:12,
                 from /llvm-project/llvm/tools/llvm-profgen/PerfReader.h:11,
                 from /llvm-project/llvm/tools/llvm-profgen/PerfReader.cpp:8:
/llvm-project/llvm/include/llvm/ADT/Optional.h: In instantiation of ‘void llvm::optional_detail::OptionalStorage<T, <anonymous> >::emplace(Args&& ...) [with Args = {const std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, llvm::sampleprof::LineLocation>}; T = const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>; bool <anonymous> = false]’:
/llvm-project/llvm/include/llvm/ADT/Optional.h:79:7:   required from ‘constexpr llvm::optional_detail::OptionalStorage<T, <anonymous> >::OptionalStorage(llvm::optional_detail::OptionalStorage<T, <anonymous> >&&) [with T = const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>; bool <anonymous> = false]’
/llvm-project/llvm/include/llvm/ADT/Optional.h:253:13:   required from here
/llvm-project/llvm/include/llvm/ADT/Optional.h:113:12: warning: cast from type ‘const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>*’ to type ‘void*’ casts away qualifiers [-Wcast-qual]
  113 |     ::new ((void *)std::addressof(value)) T(std::forward<Args>(args)...);
      |            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[3399/3703] Building CXX object tools/llvm-profgen/CMakeFiles/llvm-profgen.dir/ProfiledBinary.cpp.o
In file included from /llvm-project/llvm/include/llvm/ADT/STLExtras.h:19,
                 from /llvm-project/llvm/include/llvm/ADT/ArrayRef.h:15,
                 from /llvm-project/llvm/include/llvm/ADT/DenseMapInfo.h:18,
                 from /llvm-project/llvm/include/llvm/ADT/DenseMap.h:16,
                 from /llvm-project/llvm/include/llvm/ADT/DenseSet.h:16,
                 from /llvm-project/llvm/include/llvm/ProfileData/SampleProf.h:17,
                 from /llvm-project/llvm/tools/llvm-profgen/CallContext.h:12,
                 from /llvm-project/llvm/tools/llvm-profgen/ProfiledBinary.h:12,
                 from /llvm-project/llvm/tools/llvm-profgen/ProfiledBinary.cpp:9:
/llvm-project/llvm/include/llvm/ADT/Optional.h: In instantiation of ‘void llvm::optional_detail::OptionalStorage<T, <anonymous> >::emplace(Args&& ...) [with Args = {const std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, llvm::sampleprof::LineLocation>}; T = const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>; bool <anonymous> = false]’:
/llvm-project/llvm/include/llvm/ADT/Optional.h:79:7:   required from ‘constexpr llvm::optional_detail::OptionalStorage<T, <anonymous> >::OptionalStorage(llvm::optional_detail::OptionalStorage<T, <anonymous> >&&) [with T = const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>; bool <anonymous> = false]’
/llvm-project/llvm/include/llvm/ADT/Optional.h:253:13:   required from here
/llvm-project/llvm/include/llvm/ADT/Optional.h:113:12: warning: cast from type ‘const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>*’ to type ‘void*’ casts away qualifiers [-Wcast-qual]
  113 |     ::new ((void *)std::addressof(value)) T(std::forward<Args>(args)...);
      |            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[3404/3703] Building CXX object tools/llvm-profgen/CMakeFiles/llvm-profgen.dir/ProfileGenerator.cpp.o
In file included from /llvm-project/llvm/include/llvm/ADT/STLExtras.h:19,
                 from /llvm-project/llvm/include/llvm/ADT/StringRef.h:12,
                 from /llvm-project/llvm/include/llvm/ADT/Twine.h:13,
                 from /llvm-project/llvm/tools/llvm-profgen/ErrorHandling.h:12,
                 from /llvm-project/llvm/tools/llvm-profgen/ProfileGenerator.h:11,
                 from /llvm-project/llvm/tools/llvm-profgen/ProfileGenerator.cpp:9:
/llvm-project/llvm/include/llvm/ADT/Optional.h: In instantiation of ‘void llvm::optional_detail::OptionalStorage<T, <anonymous> >::emplace(Args&& ...) [with Args = {const std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, llvm::sampleprof::LineLocation>}; T = const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>; bool <anonymous> = false]’:
/llvm-project/llvm/include/llvm/ADT/Optional.h:79:7:   required from ‘constexpr llvm::optional_detail::OptionalStorage<T, <anonymous> >::OptionalStorage(llvm::optional_detail::OptionalStorage<T, <anonymous> >&&) [with T = const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>; bool <anonymous> = false]’
/llvm-project/llvm/include/llvm/ADT/Optional.h:253:13:   required from here
/llvm-project/llvm/include/llvm/ADT/Optional.h:113:12: warning: cast from type ‘const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>*’ to type ‘void*’ casts away qualifiers [-Wcast-qual]
  113 |     ::new ((void *)std::addressof(value)) T(std::forward<Args>(args)...);
      |            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
2021-02-18 17:10:50 +08:00
wlei afd8bd601e [CSSPGO][llvm-profgen] Filter out the instructions without location info for symbolizer
It appears some instructions doesn't have the debug location info and the symbolizer will return an empty call stack for them which will cause some crash later in profile unwinding. Actually we do not record the sample info for them, so this change just filter out those instruction.

As those instruction would appears at the begin and end of the instruction list, without them we need to add the boundary check for IP `advance` and `backward`.

Also for pseudo probe based profile, we actually don't need the symbolized location info, so here just change to use an empty stack for it. This could save half of the binary loading time.

Differential Revision: https://reviews.llvm.org/D96434
2021-02-12 16:47:49 -08:00
wlei 3869309a0c [CSSPGO][llvm-profgen] Aggregate samples on call frame trie to speed up profile generation
For CS profile generation, the process of call stack unwinding is time-consuming since for each LBR entry we need linear time to generate the context( hash, compression, string concatenation). This change speeds up this by grouping all the call frame within one LBR sample into a trie and aggregating the result(sample counter) on it, deferring the context compression and string generation to the end of unwinding.

Specifically, it uses `StackLeaf` as the top frame on the stack and manipulates(pop or push a trie node) it dynamically during virtual unwinding so that the raw sample can just be recoded on the leaf node, the path(root to leaf) will represent its calling context. In the end, it traverses the trie and generates the context on the fly.

Results:
Our internal branch shows about 5X speed-up on some large workloads in SPEC06 benchmark.

Differential Revision: https://reviews.llvm.org/D94110
2021-02-04 08:43:21 -08:00
wlei ac14bb14e7 [CSSPGO][llvm-profgen] Compress recursive cycles in calling context
This change compresses the context string by removing cycles due to recursive function for CS profile generation. Removing recursion cycles is a way to normalize the calling context which will be better for the sample aggregation and also make the context promoting deterministic.
Specifically for implementation, we recognize adjacent repeated frames as cycles and deduplicated them through multiple round of iteration.
For example:
Considering a input context string stack:
[“a”, “a”, “b”, “c”, “a”, “b”, “c”, “b”, “c”, “d”]
For first iteration,, it removed all adjacent repeated frames of size 1:
[“a”, “b”, “c”, “a”, “b”, “c”, “b”, “c”, “d”]
For second iteration, it removed all adjacent repeated frames of size 2:
[“a”, “b”, “c”, “a”, “b”, “c”, “d”]
So in the end, we get compressed output:
[“a”, “b”, “c”, “d”]

Compression will be called in two place: one for sample's context key right after unwinding, one is for the eventual context string id in the ProfileGenerator.
Added a switch `compress-recursion` to control the size of duplicated frames, default -1 means no size limit.
Added unit tests and regression test for this.

Differential Revision: https://reviews.llvm.org/D93556
2021-02-03 22:16:07 -08:00
wlei 6bccdcdb35 Revert "[CSSPGO][llvm-profgen] Compress recursive cycles in calling context"
This reverts commit 0609f257dc.
2021-02-03 22:16:05 -08:00
wlei 08e8bb60cf Revert "[CSSPGO][llvm-profgen] Aggregate samples on call frame trie to speed up profile generation"
This reverts commit 1714ad2336.
2021-02-03 22:16:05 -08:00
wlei 1714ad2336 [CSSPGO][llvm-profgen] Aggregate samples on call frame trie to speed up profile generation
For CS profile generation, the process of call stack unwinding is time-consuming since for each LBR entry we need linear time to generate the context( hash, compression, string concatenation). This change speeds up this by grouping all the call frame within one LBR sample into a trie and aggregating the result(sample counter) on it, deferring the context compression and string generation to the end of unwinding.

Specifically, it uses `StackLeaf` as the top frame on the stack and manipulates(pop or push a trie node) it dynamically during virtual unwinding so that the raw sample can just be recoded on the leaf node, the path(root to leaf) will represent its calling context. In the end, it traverses the trie and generates the context on the fly.

Results:
Our internal branch shows about 5X speed-up on some large workloads in SPEC06 benchmark.

Differential Revision: https://reviews.llvm.org/D94110
2021-02-03 18:50:14 -08:00
wlei 0609f257dc [CSSPGO][llvm-profgen] Compress recursive cycles in calling context
This change compresses the context string by removing cycles due to recursive function for CS profile generation. Removing recursion cycles is a way to normalize the calling context which will be better for the sample aggregation and also make the context promoting deterministic.
Specifically for implementation, we recognize adjacent repeated frames as cycles and deduplicated them through multiple round of iteration.
For example:
Considering a input context string stack:
[“a”, “a”, “b”, “c”, “a”, “b”, “c”, “b”, “c”, “d”]
For first iteration,, it removed all adjacent repeated frames of size 1:
[“a”, “b”, “c”, “a”, “b”, “c”, “b”, “c”, “d”]
For second iteration, it removed all adjacent repeated frames of size 2:
[“a”, “b”, “c”, “a”, “b”, “c”, “d”]
So in the end, we get compressed output:
[“a”, “b”, “c”, “d”]

Compression will be called in two place: one for sample's context key right after unwinding, one is for the eventual context string id in the ProfileGenerator.
Added a switch `compress-recursion` to control the size of duplicated frames, default -1 means no size limit.
Added unit tests and regression test for this.

Differential Revision: https://reviews.llvm.org/D93556
2021-02-03 18:50:14 -08:00