Commit Graph

155 Commits

Author SHA1 Message Date
serge-sans-paille f1985a3f85 Cleanup includes: Transforms/IPO
Preprocessor output diff: -238205 lines
Discourse thread: https://discourse.llvm.org/t/include-what-you-use-include-cleanup
Differential Revision: https://reviews.llvm.org/D122183
2022-03-22 10:06:28 +01:00
Hongtao Yu bc380c0930 [llvm-profgen] Turn on CS nested profile generation by default for CSSPGO.
CS nested profile has a benefit over the CS flat profile that is to speed up the build while achieve an on-par performance. I'm turning it on by default for CSSPGO.

Reviewed By: wenlei

Differential Revision: https://reviews.llvm.org/D121142
2022-03-08 09:05:27 -08:00
Hongtao Yu 23391febd8 [llvm-profgen] Generating probe-based non-CS profile.
I'm bring up the support of pseudo-probe-based non-CS profile generation. The approach is quite similar to generating dwarf-based non-CS profile. The main difference is for a given linear instruction range, instead of each disassembled instruction,  pseudo probes that are covered by the range are processed. The pseudo probe extraction code is shared with CS probe profile generation.

I'm seeing 0.7% performance win for one of our internal large benchmark compared to using non-CS dwarf-based profile, and 0.5% win for another large benchmark when combined with profi.

Reviewed By: wenlei

Differential Revision: https://reviews.llvm.org/D120335
2022-03-01 18:49:08 -08:00
serge-sans-paille fc97efa409 Cleanup includes: ProfileData
Estimation of the impact on preprocessor output:

before: 1067349756
after: 1065940348

Discourse thread: https://discourse.llvm.org/t/include-what-you-use-include-cleanup
Differential Revision: https://reviews.llvm.org/D120434
2022-02-24 13:25:11 +01:00
serge-sans-paille db29f4374d Cleanup include: DebugInfo/Symbolize
Estimation of the impact on preprocessor output
after: 1067349756
before:1067487786

Discourse thread: https://discourse.llvm.org/t/include-what-you-use-include-cleanup
Differential Revision: https://reviews.llvm.org/D120433
2022-02-24 13:25:11 +01:00
wlei b3a778fb5e [llvm-profgen] Support symbol loading for debug fission
Support to load debug info from dwarf split file, like .dwo, .dwp files. Leverage the `getNonSkeletonUnitDIE(false)` API to achieve this.

Add test cause to make sure all the ranges is well retrieved by the loader.

Reviewed By: ayermolo, hoy, wenlei

Differential Revision: https://reviews.llvm.org/D115973
2022-02-23 09:40:46 -08:00
Hongtao Yu 34e131b0f2 [llvm-profgen] On-demand track optimized-away inlinees for preinliner.
Tracking optimized-away inlinees based on all probes in a binary is expansive in terms of memory usage I'm making the tracking on-demand based on profiled functions only. This saves about 10%  memory overall for a medium-sized benchmark.

Before:

   note: After parsePerfTraces
   note: Thu Jan 27 18:42:09 2022
   note: VM: 8.68 GB   RSS: 8.39 GB
   note: After computeSizeForProfiledFunctions
   note: Thu Jan 27 18:42:41 2022
   note: **VM: 10.63 GB   RSS: 10.20 GB**
   note: After generateProbeBasedProfile
   note: Thu Jan 27 18:45:49 2022
   note: VM: 25.00 GB   RSS: 24.95 GB
   note: After postProcessProfiles
   note: Thu Jan 27 18:49:29 2022
   note: VM: 26.34 GB   RSS: 26.27 GB

After:
   note: After parsePerfTraces
   note: Fri Jan 28 12:04:49 2022
   note: VM: 8.68 GB   RSS: 7.65 GB
   note: After computeSizeForProfiledFunctions
   note: Fri Jan 28 12:05:26 2022
   note: **VM: 8.68 GB   RSS: 8.42 GB**
   note: After generateProbeBasedProfile
   note: Fri Jan 28 12:08:03 2022
   note: VM: 22.93 GB   RSS: 22.89 GB
   note: After postProcessProfiles
   note: Fri Jan 28 12:11:30 2022
   note: VM: 24.27 GB   RSS: 24.22 GB

This should be a no-diff change in terms of profile quality.

Reviewed By: wenlei

Differential Revision: https://reviews.llvm.org/D118515
2022-02-08 08:33:23 -08:00
Simon Pilgrim 01d5254f3d [llvm-profgen] Use cast<> instead of dyn_cast<> to avoid dereference of nullptr
The pointer is dereferenced immediately, so assert the cast is correct instead of returning nullptr
2022-02-02 14:12:11 +00:00
Simon Pilgrim c56a85fde0 [llvm-profgen] Use cast<> instead of dyn_cast<> to avoid dereference of nullptr
The pointers are dereferenced immediately, so assert the cast is correct instead of returning nullptr
2022-02-02 14:12:10 +00:00
Hongtao Yu 67db31115d [llvm-profgen] Clean up unnecessary memory reservations between phases.
Cleaning up data structures that are not used after a certain point. This further brings down peak memory usage by 15% for a large benchmark.

Before:
   note: Before parsePerfTraces
   note: VM: 40.73 GB   RSS: 39.18 GB
   note: Before parseAndAggregateTrace
   note: VM: 40.73 GB   RSS: 39.18 GB
   note: After parseAndAggregateTrace
   note: VM: 88.93 GB   RSS: 87.97 GB
   note: Before generateUnsymbolizedProfile
   note: VM: 88.95 GB   RSS: 87.99 GB
   note: After generateUnsymbolizedProfile
   note: VM: 93.50 GB   RSS: 92.53 GB
   note: After computeSizeForProfiledFunctions
   note: VM: 101.13 GB   RSS: 99.36 GB
   note: After generateProbeBasedProfile
   note: VM: 215.61 GB   RSS: 210.88 GB
   note: After postProcessProfiles
   note: VM: 237.48 GB   RSS: 212.50 GB

After:
   note: Before parsePerfTraces
   note: VM: 40.73 GB   RSS: 39.18 GB
   note: Before parseAndAggregateTrace
   note: VM: 40.73 GB   RSS: 39.18 GB
   note: After parseAndAggregateTrace
   note: VM: 88.93 GB   RSS: 87.96 GB
   note: Before generateUnsymbolizedProfile
   note: VM: 88.95 GB   RSS: 87.97 GB
   note: After generateUnsymbolizedProfile
   note: VM: 93.50 GB   RSS: 92.51 GB
   note: After computeSizeForProfiledFunctions
   note: VM: 93.50 GB   RSS: 92.53 GB
   note: After generateProbeBasedProfile
   note: VM: 164.87 GB   RSS: 163.55 GB
   note: After postProcessProfiles
   note: VM: 182.28 GB   RSS: 179.43 GB

Reviewed By: wenlei, wlei

Differential Revision: https://reviews.llvm.org/D118677
2022-02-01 16:27:54 -08:00
Hongtao Yu fec57e5b17 Revert "[llvm-profgen] Clean up unnecessary memory reservations between phases."
This reverts commit 057e784b09.
2022-02-01 14:44:48 -08:00
Hongtao Yu 057e784b09 [llvm-profgen] Clean up unnecessary memory reservations between phases.
Cleaning up data structures that are not used after a certain point. This further brings down peak memory usage by 15% for a large benchmark.

Before:
   note: Before parsePerfTraces
   note: VM: 40.73 GB   RSS: 39.18 GB
   note: Before parseAndAggregateTrace
   note: VM: 40.73 GB   RSS: 39.18 GB
   note: After parseAndAggregateTrace
   note: VM: 88.93 GB   RSS: 87.97 GB
   note: Before generateUnsymbolizedProfile
   note: VM: 88.95 GB   RSS: 87.99 GB
   note: After generateUnsymbolizedProfile
   note: VM: 93.50 GB   RSS: 92.53 GB
   note: After computeSizeForProfiledFunctions
   note: VM: 101.13 GB   RSS: 99.36 GB
   note: After generateProbeBasedProfile
   note: VM: 215.61 GB   RSS: 210.88 GB
   note: After postProcessProfiles
   note: VM: 237.48 GB   RSS: 212.50 GB

After:
   note: Before parsePerfTraces
   note: VM: 40.73 GB   RSS: 39.18 GB
   note: Before parseAndAggregateTrace
   note: VM: 40.73 GB   RSS: 39.18 GB
   note: After parseAndAggregateTrace
   note: VM: 88.93 GB   RSS: 87.96 GB
   note: Before generateUnsymbolizedProfile
   note: VM: 88.95 GB   RSS: 87.97 GB
   note: After generateUnsymbolizedProfile
   note: VM: 93.50 GB   RSS: 92.51 GB
   note: After computeSizeForProfiledFunctions
   note: VM: 93.50 GB   RSS: 92.53 GB
   note: After generateProbeBasedProfile
   note: VM: 164.87 GB   RSS: 163.55 GB
   note: After postProcessProfiles
   note: VM: 182.28 GB   RSS: 179.43 GB

Reviewed By: wenlei, wlei

Differential Revision: https://reviews.llvm.org/D118677
2022-02-01 12:48:08 -08:00
wlei 6693c562f9 [llvm-profgen] Support to load debug info from a second binary
For reducing binary size purpose, the binary's debug info and executable segment can be separated(like using objcopy --only-keep-debug). Here add support in llvm-profgen to use two binaries as input. The original one is executable binary and added for debug info only binary. Adding a flag `--debug-binary=file-path`, with this, the binary will load debug info from debug binary.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D115948
2022-01-24 17:14:05 -08:00
Simon Pilgrim f4aa2a42ed [llvm-profgen] ProfiledBinary::load - use cast<> instead of dyn_cast<> to avoid dereference of nullptr
The pointer is always dereferenced immediately, so assert the cast is correct instead of returning nullptr
2022-01-14 15:51:21 +00:00
Simon Pilgrim 92ba979c28 [llvm-profgen] Pass iteration value by reference in for-range loops to avoid unnecessary copies 2022-01-14 14:49:57 +00:00
Simon Pilgrim 86bbf01d89 [llvm-profgen] CSProfileGenerator::generateLineNumBasedProfile - use cast<> instead of dyn_cast<> to avoid dereference of nullptr
The pointer is always dereferenced immediately below, so assert the cast is correct instead of returning nullptr
2022-01-14 14:49:57 +00:00
Wenlei He 9a2120a6e1 [llvm-profgen] Error out for unsupported AutoFDO profile generate with probe
Error out instead of siliently generate empty profile when trying to generate AutoFDO profile with probe binary.

Differential Revision: https://reviews.llvm.org/D116508
2022-01-02 16:38:56 -08:00
wlei b239b2b0db [llvm-profgen] Fix warning of enumerated and non-enumerated type in conditional expression
Differential Revision: https://reviews.llvm.org/D115842
2021-12-16 19:28:55 -08:00
Wenlei He f6f0409f6f [llvm-profgen] Turn on preinliner by default
preinliner has been tuned on large server workloads and it's not ready to be turned on by default. this change also updates the thresholds based on tuning.

Differential Revision: https://reviews.llvm.org/D115770
2021-12-14 17:46:57 -08:00
wlei 0f53df864e [CSSPGO][llvm-profgen] Fix external address issues of perf reader (return to external addr part)
Before we have an issue with artificial LBR whose source is a return, recalling that "an internal code(A) can return to external address, then from the external address call a new internal code(B), making an artificial branch that looks like a return from A to B can confuse the unwinder". We just ignore the LBRs after this artificial LBR which can miss some samples. This change aims at fixing this by correctly unwinding them instead of ignoring them.

List some typical scenarios covered by this change.

1)  multiple sequential call back happen in external address, e.g.

```
[ext, call, foo] [foo, return, ext] [ext, call, bar]
```
Unwinder should avoid having foo return from bar. Wrong call stack is like [foo, bar]

2) the call stack before and after external call should be correctly unwinded.
```
 {call stack1}                                            {call stack2}
 [foo, call, ext]  [ext, call, bar]  [bar, return, ext]  [ext, return, foo ]
```
call stack 1 should be the same to call stack2. Both shouldn't be truncated

3) call stack should be truncated after call into external code since we can't do inlining with external code.

```
 [foo, call, ext]  [ext, call, bar]  [bar, call, baz] [baz, return, bar ] [bar, return, ext]
```
the call stack of code in baz should not include foo.

### Implementation:

We leverage artificial frame to fix #2 and #3: when we got a return artificial LBR, push an extra artificial frame to the stack. when we pop frame, check if the parent is an artificial frame to pop(fix #2). Therefore, call/ return artificial LBR is just the same as regular LBR which can keep the call stack.

While recording context on the trie, artificial frame is used as a tag indicating that we should truncate the call stack(fix #3).

To differentiate #1 and #2, we leverage `getCallAddrFromFrameAddr`.  Normally the target of the return should be the next inst of a call inst and `getCallAddrFromFrameAddr` will return the address of call inst. Otherwise, getCallAddrFromFrameAddr will return to 0 which is the case of #1.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D115550
2021-12-14 16:40:54 -08:00
wlei 30c3aba998 [llvm-profgen] Fix to use getUntrackedCallsites outside the loop
Unwinder is hoisted out in https://reviews.llvm.org/D115550, so fix the useage of getUntrackedCallsites.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D115760
2021-12-14 16:40:53 -08:00
wlei 3dcb60db9a [CSSPGO][llvm-profgen] Fix external address issues of perf reader (leading external LBR part)
We can have the sampling just hit into the external addresses, in that case, both the top stack frame and the latest LBR target are external addresses. For example:
```
	        ffffffff
 0x4006c8/0xffffffff/P/-/-/0  0x40069b/0x400670/M/-/-/0

 	          ffffffff
	          40067e
0xffffffff/0xffffffff/P/-/-/0  0x4006c8/0xffffffff/P/-/-/0  0x40069b/0x400670/M/-/-/0
```
Before we will ignore the entire samples. However, we found there exists some internal LBRs in the remaining part of sample, the range between them is still a valid range, we will lose some valid LBRs. Those LBRs will be unwinded based on a empty(context-less) call stack.

This change tries to fix it, instead of ignoring the entire sample, we only ignore the leading external addresses.

Note that the first outgoing LBR is useful since there is a valid range between it's source and next LBR's target.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D115538
2021-12-14 16:40:53 -08:00
wlei 3220571793 [llvm-profgen] Skip disassembling for PLT section
Skip disassembling .plt section, then .plt section code will be treated as external code.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D115699
2021-12-14 16:40:53 -08:00
Hongtao Yu 5740bb801a [CSSPGO] Use nested context-sensitive profile.
CSSPGO currently employs a flat profile format for context-sensitive profiles. Such a flat profile allows for precisely manipulating contexts that is either inlined or not inlined. This is a benefit over the nested profile format used by non-CS AutoFDO. A downside of this is the longer build time due to parsing the indexing the full CS contexts.

For a CS flat profile, though only the context profiles relevant to a module are loaded when that module is compiled, the cost to figure out what profiles are relevant is noticeably high when there're many contexts,  since the sample reader will need to scan all context strings anyway. On the contrary, a nested function profile has its related inline subcontexts isolated from other unrelated contexts. Therefore when compiling a set of functions, unrelated contexts will never need to be scanned.

In this change we are exploring using nested profile format for CSSPGO. This is expected to work based on an assumption that with a preinliner-computed profile all contexts are precomputed and expected to be inlined by the compiler. Contexts not expected to be inlined will be cut off and returned to corresponding base profiles (for top-level outlined functions). This naturally forms a nested profile where all nested contexts are expected to be inlined. The compiler will less likely optimize on derived contexts that are not precomputed.

A CS-nested profile will look exactly the same with regular nested profile except that each nested profile can come with an attributes. With pseudo probes,  a nested profile shown as below can also have a CFG checksum.

```

main:1968679:12
 2: 24
 3: 28 _Z5funcAi:18
 3.1: 28 _Z5funcBi:30
 3: _Z5funcAi:1467398
  0: 10
  1: 10 _Z8funcLeafi:11
  3: 24
  1: _Z8funcLeafi:1467299
   0: 6
   1: 6
   3: 287884
   4: 287864 _Z3fibi:315608
   15: 23
   !CFGChecksum: 138828622701
   !Attributes: 2
  !CFGChecksum: 281479271677951
  !Attributes: 2
```

Specific work included in this change:
- A recursive profile converter to convert CS flat profile to nested profile.
- Extend function checksum and attribute metadata to be stored in nested way for text profile and extbinary profile.
- Unifiy sample loader inliner path for CS and preinlined nested profile.
 - Changes in the sample loader to support probe-based nested profile.

I've seen promising results regarding build time. A nested profile can result in a 20% shorter build time than a CS flat profile while keep an on-par performance. This is with -duplicate-contexts-into-base=1.

Test Plan:

Reviewed By: wenlei

Differential Revision: https://reviews.llvm.org/D115205
2021-12-14 14:40:25 -08:00
wlei 484a569eea [llvm-profgen] Fix total samples related issues
Since total sample and body sample are used to compute hotness threshold in compiler, we found in some services changing the total samples computation will cause noticeable regression. Hence, here we will revert the changes and just keep all total samples number identical to the old tool.

Three changes in this diff:

1. Revert previous diff(https://reviews.llvm.org/D112672: [llvm-profgen] Update total samples by accumulating all its body samples) and put it under a switch.

2. Keep the negative line number. Although compiler doesn't consume the count but it will be used to compute hot threshold.

3. Change to accumulate total samples per byte instead of per instruction.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D115013
2021-12-08 12:33:41 -08:00
wlei 27cb3707db [llvm-profgen] Trim cold function profiles for non-CS AutoFDO
This change allows to trim the profile if it's considered to be cold for baseline AutoFDO. We reuse the cold threshold from `ProfileSummaryBuilder::getColdCountThreshold(..)` which can be set by percent(--profile-summary-cutoff-cold) or by value(--profile-summary-cold-count).

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D113785
2021-12-08 12:20:50 -08:00
wlei f15a854567 [llvm-profgen] Truncate the context with zero probe ID
Due to the debug info merging, there may have some contexts with zero probe id, we should truncate the context to avoid misleading pre-inliner.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D114284
2021-11-30 16:21:25 -08:00
wlei 41a681ce09 [FS-AFDO][llvm-profgen] Generate profile with FS-AFDO discriminator
In order to support generating profile  with FS discriminator, three kind of changes are done in llvm-profgen:

1) Dissassemble .rodata section to check if FS discriminator var ('"__llvm_fs_discriminator__"') exists and set the corresponding flag in the binary.

2) Change the discriminator decoding in `getBaseDiscriminator` and `getDuplicationFactor`.

3) set true for `FunctionSamples::ProfileIsFS` to enable FS functionality in ProfileData.

Reviewed By: xur, hoy, wenlei

Differential Revision: https://reviews.llvm.org/D113296
2021-11-30 15:57:59 -08:00
Hongtao Yu bf317f6698 [CSSPGO] Sorting nodes in a cycle of profiled call graph.
For nodes that are in a cycle of a profiled call graph, the current order the underlying scc_iter computes purely depends on how those nodes are reached from outside the SCC and inside the SCC, based on the Tarjan algorithm. This does not honor profile edge hotness, thus does not gurantee hot callsites to be inlined prior to cold callsites. To mitigate that, I'm adding an extra sorter on top of scc_iter to sort scc functions in the order of callsite hotness, instead of changing the internal of scc_iter.

Sorting on callsite hotness can be optimally based on detecting cycles on a directed call graph, i.e, to remove the coldest edge until a cycle is broken. However, detecting cycles isn't cheap. I'm using an MST-based approach which is faster and appear to deliver some performance wins.

Reviewed By: wenlei

Differential Revision: https://reviews.llvm.org/D114204
2021-11-30 09:01:08 -08:00
wlei c2e08aba1a [llvm-profgen] Compute and show profile density
AutoFDO performance is sensitive to profile density, i.e., the amount of samples in the profile relative to the program size, because profiles with insufficient samples could be inaccurate due to statistical noise and thus hurt AutoFDO performance. A previous investigation showed that AutoFDO performed better on MySQL with increased amount of samples. Therefore, we implement a profile-density computation feature to give hints about profile density to users and the compiler.

We define the density of a profile Prof as follows:

- For each function A in the profile, density(A) = total_samples(A) / sizeof(A).
- density(Prof) = min(density(A)) for all functions A that are warm (defined below).

A function is considered warm if its total-samples is within top N percent of the profile. For implementation, we reuse the `ProfileSummaryBuilder::getHotCountThreshold(..)` as threshold which can be set by percent(`--profile-summary-cutoff-hot`) or by value(`--profile-summary-hot-count`).

We also introduce `--hot-function-density-threshold` to set hot function density threshold and will give suggestion if profile density is below it which implies we should increase samples.

This also applies for CS profile with all profiles merged into base.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D113781
2021-11-29 23:54:31 -08:00
Wenlei He f7976edc1e [llvm-profgen] Add switch to allow use of first loadable segment for calculating offset
Adding `-use-loadable-segment-as-base` to allow use of first loadable segment for calculating offset. By default first executable segment is used for calculating offset. The switch helps compatibility with unsymbolized profile generated from older tools.

Differential Revision: https://reviews.llvm.org/D113727
2021-11-15 19:00:27 -08:00
wlei aab1810006 [llvm-profgen] Fix bug of setting function entry
Previously we set `isFuncEntry` flag  to true when the funcName from DWARF is equal to the name in symbol table and we use this flag to ignore reporting callsite sample that's from an intra func branch. However, in HHVM, it appears that the symbol table name is inconsistent with the dwarf info func name, it's likely due to `OptimizeGlobalAliases`.

This change is a workaround in llvm-profgen side to mark the only one range as the function entry and add warnings for the remaining inconsistence.

This also fixed a missing `getCanonicalFnName` for symbol name which caused the mismatching as well.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D113492
2021-11-12 12:18:43 -08:00
wlei 5bf191a381 [llvm-profgen] Fix index out of bounds error while using ip.advance
Previously we assume there're some non-executing sections at the bottom of the text section so that we won't hit the array's bound. But on BOLTed binary, it turned out .bolt section is at the bottom of text section which can be profiled, then it crash llvm-profgen. This change try to fix it.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D113238
2021-11-05 18:38:40 -07:00
wlei dc9f037955 [llvm-profgen] Refactor the code of getHashCode
Refactor to generate hash code lazily. Tested on clang self build, no observable generating time regression.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D113059
2021-11-02 19:56:20 -07:00
wlei 138202a8c3 [llvm-profgen] Warn on invalid range and show warning summary
Two things in this diff:

1) Warn on the invalid range, currently three types of checking, see the detailed message in the code.

2) In some situation, llvm-profgen gives lots of warnings on the truncated stacks which is noisy. This change provides a switch to `--show-detailed-warning` to skip the warnings. Alternatively, we use a summary for those warning and show the percentage of cases with those issues.

Example of warning summary.
```
warning: 0.05%(1120/2428958) cases with issue: Profile context truncated due to missing probe for call instruction.
warning: 0.00%(2/178637) cases with issue: Range does not belong to any functions, likely from external function.
```

Reviewed By: hoy

Differential Revision: https://reviews.llvm.org/D111902
2021-11-02 19:55:55 -07:00
wlei 3f3103c6a9 [llvm-profgen] Fill zero count for all function ranges
Allow filling zero count for all the function ranges even there is no samples hitting that function. Add a switch for this.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D112858
2021-11-01 09:57:05 -07:00
wlei f5537643b8 [llvm-profgen] Update total samples by accumulating all its body samples
Like probe-based profile, the total samples is the sum of all its body samples. This patch fix it by a post-processing update for the line-number based profile. Tested it on our internal services, results showed no performance change.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D112672
2021-10-29 10:36:57 -07:00
Kazu Hirata 3b285ff517 [llvm-profgen] Fix a set-but-unused warning
This patch fixes:

  llvm/tools/llvm-profgen/ProfiledBinary.cpp:357:12: error: variable
  'EndOffset' set but not used [-Werror,-Wunused-but-set-variable]

The last use of the variable was removed on Oct 26 in commit
40ca411251.
2021-10-29 10:19:44 -07:00
wlei 2f8196db92 [llvm-profgen] Fix bug of populating profile symbol list
Previous implementation of populating profile symbol list is wrong, it only included the profiled symbols. Actually it should use all symbols, here this switches to use the symbols from debug info. Also turned the flag off by default.

Reviewed By: wenlei, hoy

Differential Revision: https://reviews.llvm.org/D111824
2021-10-29 09:59:12 -07:00
wlei 40ca411251 [llvm-profgen] Switch to DWARF-based symbol and ranges
It happened a bug that some callsite name in the profile is not a real function, it turned out that there're some non-function symbol from the ELF text section, e.g. the global accessible branch label and also recalled that we can have one function being split into multiple ranges. We shouldn't count samples for those are not the entry of the real function.

So this change tried to fix this issue by switching to use the name or ranges from DWARF-based debug info, the range of which assure it's the real function start. For the split functions, we assume that the real entry function's DWARF name should always match the symbol table name.

The switching is also consistent with the body samples' symbol which is from DWARF.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D112282
2021-10-29 09:59:12 -07:00
Hongtao Yu 259e4c5658 [CSSPGO] Trim cold base profiles for the CS preinliner.
Adding support to the CS preinliner to trim cold base profiles. This makes trimming consistent with the inline decision made by the preinliner. Also disable the existing profile merger when preinliner is on unless explicitly specified.

Reviewed By: wenlei, wlei

Differential Revision: https://reviews.llvm.org/D112489
2021-10-27 22:50:27 -07:00
wlei a5f411b7f8 [llvm-profgen] Allow unsymbolized profile as perf input
This change allows the unsymbolized profile as input. The unsymbolized profile is created by `llvm-profgen` with `--skip-symbolization` and it's after the sample aggregation but before symbolization , so it has much small file size. It can be used for sample merging and trimming,  also is useful for debugging or adding test cases. A switch `--unsymbolized-profile=file-patch` is added for this.

Format of unsymbolized profile:
```

   [context stack1]    # If it's a CS profile
      number of entries in RangeCounter
      from_1-to_1:count_1
      from_2-to_2:count_2
      ......
      from_n-to_n:count_n
      number of entries in BranchCounter
      src_1->dst_1:count_1
      src_2->dst_2:count_2
      ......
      src_n->dst_n:count_n
    [context stack2]
      ......
```

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D111750
2021-10-25 23:58:08 -07:00
Kazu Hirata 4e3eebc6bd [tools, utils] Use StringRef::contains (NFC) 2021-10-22 17:22:13 -07:00
Wenlei He e8c245dcd3 [llvm-profgen] Skip duplication factor outside of body sample computation
We incorrectly use duplication factor for total samples even though we already accumulate samples instead of taking MAX. It causes profile to have bloated total samples for functions with loop unrolled or vectorized. The change fix the issue for total sample, head sample and call target samples.

Differential Revision: https://reviews.llvm.org/D112042
2021-10-19 23:10:45 -07:00
Wenlei He a316343e19 [llvm-profgen] Allow generating AutoFDO profile from CSSPGO binary
Add `-use-dwarf-correlation` switch to allow llvm-profgen to generate AutoFDO profile for binaries built with CSSPGO (pseudo-probe).

Differential Revision: https://reviews.llvm.org/D111776
2021-10-14 09:11:56 -07:00
wlei 30ca33eab0 [llvm-profgen] Ignore the whole trace with the leading external branch
The first LBR entry can be an external branch, we should ignore the whole trace.

```
     7f7448e889e4 0x7f7448e889e4/0x7f7448e88826/P/-/-/1  0x7f7448e8899f/0x7f7448e889d8/P/-/-/4  ...
```

Reviewed By: wenlei, hoy

Differential Revision: https://reviews.llvm.org/D111749
2021-10-13 16:52:29 -07:00
wlei ab5d65e685 [llvm-profgen] Ignore stack samples before aggregation
With `ignore-stack-samples`, We can ignore the call stack before the samples aggregation which could reduce some redundant computations.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D111577
2021-10-13 16:52:29 -07:00
Wenlei He da4e5fc861 [llvm-profgen] Deduplicate PID when processing perf input
When parsing mmap to retrieve PID, deduplicate them before passing PID list to perf script. Perf script would error out when there's duplicated PID in the input, however raw perf data may main duplicated PID for large binary where more than one mmap is needed to load executable segment.

Differential Revision: https://reviews.llvm.org/D111384
2021-10-10 13:30:17 -07:00
Reid Kleckner 89b57061f7 Move TargetRegistry.(h|cpp) from Support to MC
This moves the registry higher in the LLVM library dependency stack.
Every client of the target registry needs to link against MC anyway to
actually use the target, so we might as well move this out of Support.

This allows us to ensure that Support doesn't have includes from MC/*.

Differential Revision: https://reviews.llvm.org/D111454
2021-10-08 14:51:48 -07:00
wlei b1a45c62f0 [llvm-profgen] Ignore branch count against outline function
For some transformations like hot-cold split or coro split, it can outline its part of function ranges. Since sample loader is the early stage of backend and no split happens at that time, compiler can't recognize those function, so in llvm-profgen we should attribute the sample to the original function. This is already done for the body range samples since we use the symbols from dwarf which is created before the split.

But for branch samples, the call from master function to its outlined function is actually not a call to the original function, we shouldn't add head/callsie samples for it. So instead of dwarf symbol, we use the symbols from symbol table and ignore those functions with special suffixes(like `.cold` ,`.resume`) for accumulating the callsite/head samples.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D110864
2021-10-07 14:03:34 -07:00