Commit Graph

27 Commits

Author SHA1 Message Date
Wenlei He eca03d2768 [CSSPGO] Track and use context-sensitive post-optimization function size to drive global pre-inliner in llvm-profgen
This change enables llvm-profgen to use accurate context-sensitive post-optimization function byte size as a cost proxy to drive global preinline decisions.

To do this, BinarySizeContextTracker is introduced to track function byte size under different inline context during disassembling. In preinliner, we can not query context byte size under switch `context-cost-for-preinliner`. The tracker uses a reverse trie to keep size of functions under different context (callee as parent, caller as child), and it can give best/longest possible matching context size for given input context.

The new size cost is off by default. There're a few TODOs that needs to addressed: 1) avoid dangling string from `Offset2LocStackMap`, which will be addressed in split context work; 2) using inlinee's entry probe to make sure we have correct zero size for inlinee that's completely optimized away after inlining. Some tuning is also needed.

Differential Revision: https://reviews.llvm.org/D108180
2021-08-18 22:50:57 -07:00
wlei f812c19253 [llvm-profgen] Clean up code dealing with multiple binaries
As we decided to support only one binary each time, this patch cleans up the related code dealing with multiple binaries. We can use `llvm-profdata` to merge profile from multiple binaries.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D108002
2021-08-17 12:16:07 -07:00
jamesluox ee7d20e846 [CSSPGO] Migrate and refactor the decoder of Pseudo Probe
Migrate pseudo probe decoding logic in llvm-profgen to MC, so other LLVM-base program could reuse existing codes. Redesign object layout of encoded and decoded pseudo probes.

Reviewed By: hoy

Differential Revision: https://reviews.llvm.org/D106861
2021-08-04 09:21:34 -07:00
wlei fe3ba90830 [llvm-profgen] Support perf script without parsing MMap events
This change supports to run without parsing MMap binary loading events instead it always assumes binary is loaded at the preferred address. This is used when we have assured no binary load address changes or we have pre-processed the addresses resolution. Warn if there's interior mmap event but without leading mmap events.

Reviewed By: hoy

Differential Revision: https://reviews.llvm.org/D107097
2021-08-03 10:01:07 -07:00
wlei 6da9241aab [llvm-profgen] Refactor PerfReader to allow different types of perf scripts
In order to support different types of perf scripts, this change tried to refactor `PerfReader` by adding the base class `PerfReaderBase` and current HybridPerfReader is derived from it for CS profile generation. Common functions like, passMM2PEvents, extract_lbrs, extract_callstack, etc. can be reused.

Next step is to add LBR only reader(for non-CS profile) and aggregated perf scripts reader(do a pre-aggregation of scripts).

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D107014
2021-08-02 17:18:47 -07:00
Hongtao Yu 6b04ecaab3 [CSSPGO][llvm-profgen] Fix a missing initalization
Fixing a missing initalization that accidentaly caused by https://reviews.llvm.org/D103178 .
2021-07-13 19:49:55 -07:00
Hongtao Yu 597e9c61ce Revert "[CSSPGO][llvm-profgen] Fix a missing initalization"
This reverts commit fef5f4456a.
2021-07-13 19:48:58 -07:00
Hongtao Yu fef5f4456a [CSSPGO][llvm-profgen] Fix a missing initalization
Fixing a missing initalization that accidentaly caused by https://reviews.llvm.org/D103178 .
2021-07-13 19:46:18 -07:00
Hongtao Yu 0712038458 [CSSPGO][llvm-profgen] Allow multiple executable load segments.
The linker or post-link optimizer can create an ELF image with multiple executable segments each of which will be loaded separately at run time. This breaks the assumption of llvm-profgen that currently only supports one base load address. What it ends up with is that the subsequent mmap events will be treated as an overwrite of the first mmap event which will in turn screw up address mapping. While it is non-trivial to support multiple separate load addresses and given that on x64 those segments will always be loaded at consecutive addresses (though via separate mmap
sys calls), I'm adding an error checking logic to bail out if that's violated and keep using a single load address which is the address of the first executable segment.

Also changing the disassembly output from printing section offset to printing the virtual address instead, which matches the behavior of objdump.

Differential Revision: https://reviews.llvm.org/D103178
2021-07-13 18:22:24 -07:00
Wenlei He 1410db70b9 [CSSPGO] Add attribute metadata for context profile
This changes adds attribute field for metadata of context profile. Currently we have an inline attribute that indicates whether the leaf frame corresponding to a context profile was inlined in previous build.

This will be used to help estimating inlining and be taken into account when trimming context. Changes for that in llvm-profgen will follow. It will also help tuning.

Differential Revision: https://reviews.llvm.org/D98823
2021-03-18 22:00:56 -07:00
Yang Fan 64fc9cc723
[CSSPGO][llvm-profgen] Fix gcc Wcast-qual warning (NFC)
GCC warning:
```
[3397/3703] Building CXX object tools/llvm-profgen/CMakeFiles/llvm-profgen.dir/llvm-profgen.cpp.o
In file included from /llvm-project/llvm/include/llvm/ADT/STLExtras.h:19,
                 from /llvm-project/llvm/include/llvm/ADT/StringRef.h:12,
                 from /llvm-project/llvm/include/llvm/ADT/Twine.h:13,
                 from /llvm-project/llvm/tools/llvm-profgen/ErrorHandling.h:12,
                 from /llvm-project/llvm/tools/llvm-profgen/llvm-profgen.cpp:13:
/llvm-project/llvm/include/llvm/ADT/Optional.h: In instantiation of ‘void llvm::optional_detail::OptionalStorage<T, <anonymous> >::emplace(Args&& ...) [with Args = {const std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, llvm::sampleprof::LineLocation>}; T = const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>; bool <anonymous> = false]’:
/llvm-project/llvm/include/llvm/ADT/Optional.h:79:7:   required from ‘constexpr llvm::optional_detail::OptionalStorage<T, <anonymous> >::OptionalStorage(llvm::optional_detail::OptionalStorage<T, <anonymous> >&&) [with T = const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>; bool <anonymous> = false]’
/llvm-project/llvm/include/llvm/ADT/Optional.h:253:13:   required from here
/llvm-project/llvm/include/llvm/ADT/Optional.h:113:12: warning: cast from type ‘const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>*’ to type ‘void*’ casts away qualifiers [-Wcast-qual]
  113 |     ::new ((void *)std::addressof(value)) T(std::forward<Args>(args)...);
      |            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[3398/3703] Building CXX object tools/llvm-profgen/CMakeFiles/llvm-profgen.dir/PerfReader.cpp.o
In file included from /llvm-project/llvm/include/llvm/ADT/STLExtras.h:19,
                 from /llvm-project/llvm/include/llvm/ADT/StringRef.h:12,
                 from /llvm-project/llvm/include/llvm/ADT/Twine.h:13,
                 from /llvm-project/llvm/tools/llvm-profgen/ErrorHandling.h:12,
                 from /llvm-project/llvm/tools/llvm-profgen/PerfReader.h:11,
                 from /llvm-project/llvm/tools/llvm-profgen/PerfReader.cpp:8:
/llvm-project/llvm/include/llvm/ADT/Optional.h: In instantiation of ‘void llvm::optional_detail::OptionalStorage<T, <anonymous> >::emplace(Args&& ...) [with Args = {const std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, llvm::sampleprof::LineLocation>}; T = const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>; bool <anonymous> = false]’:
/llvm-project/llvm/include/llvm/ADT/Optional.h:79:7:   required from ‘constexpr llvm::optional_detail::OptionalStorage<T, <anonymous> >::OptionalStorage(llvm::optional_detail::OptionalStorage<T, <anonymous> >&&) [with T = const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>; bool <anonymous> = false]’
/llvm-project/llvm/include/llvm/ADT/Optional.h:253:13:   required from here
/llvm-project/llvm/include/llvm/ADT/Optional.h:113:12: warning: cast from type ‘const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>*’ to type ‘void*’ casts away qualifiers [-Wcast-qual]
  113 |     ::new ((void *)std::addressof(value)) T(std::forward<Args>(args)...);
      |            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[3399/3703] Building CXX object tools/llvm-profgen/CMakeFiles/llvm-profgen.dir/ProfiledBinary.cpp.o
In file included from /llvm-project/llvm/include/llvm/ADT/STLExtras.h:19,
                 from /llvm-project/llvm/include/llvm/ADT/ArrayRef.h:15,
                 from /llvm-project/llvm/include/llvm/ADT/DenseMapInfo.h:18,
                 from /llvm-project/llvm/include/llvm/ADT/DenseMap.h:16,
                 from /llvm-project/llvm/include/llvm/ADT/DenseSet.h:16,
                 from /llvm-project/llvm/include/llvm/ProfileData/SampleProf.h:17,
                 from /llvm-project/llvm/tools/llvm-profgen/CallContext.h:12,
                 from /llvm-project/llvm/tools/llvm-profgen/ProfiledBinary.h:12,
                 from /llvm-project/llvm/tools/llvm-profgen/ProfiledBinary.cpp:9:
/llvm-project/llvm/include/llvm/ADT/Optional.h: In instantiation of ‘void llvm::optional_detail::OptionalStorage<T, <anonymous> >::emplace(Args&& ...) [with Args = {const std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, llvm::sampleprof::LineLocation>}; T = const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>; bool <anonymous> = false]’:
/llvm-project/llvm/include/llvm/ADT/Optional.h:79:7:   required from ‘constexpr llvm::optional_detail::OptionalStorage<T, <anonymous> >::OptionalStorage(llvm::optional_detail::OptionalStorage<T, <anonymous> >&&) [with T = const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>; bool <anonymous> = false]’
/llvm-project/llvm/include/llvm/ADT/Optional.h:253:13:   required from here
/llvm-project/llvm/include/llvm/ADT/Optional.h:113:12: warning: cast from type ‘const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>*’ to type ‘void*’ casts away qualifiers [-Wcast-qual]
  113 |     ::new ((void *)std::addressof(value)) T(std::forward<Args>(args)...);
      |            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[3404/3703] Building CXX object tools/llvm-profgen/CMakeFiles/llvm-profgen.dir/ProfileGenerator.cpp.o
In file included from /llvm-project/llvm/include/llvm/ADT/STLExtras.h:19,
                 from /llvm-project/llvm/include/llvm/ADT/StringRef.h:12,
                 from /llvm-project/llvm/include/llvm/ADT/Twine.h:13,
                 from /llvm-project/llvm/tools/llvm-profgen/ErrorHandling.h:12,
                 from /llvm-project/llvm/tools/llvm-profgen/ProfileGenerator.h:11,
                 from /llvm-project/llvm/tools/llvm-profgen/ProfileGenerator.cpp:9:
/llvm-project/llvm/include/llvm/ADT/Optional.h: In instantiation of ‘void llvm::optional_detail::OptionalStorage<T, <anonymous> >::emplace(Args&& ...) [with Args = {const std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, llvm::sampleprof::LineLocation>}; T = const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>; bool <anonymous> = false]’:
/llvm-project/llvm/include/llvm/ADT/Optional.h:79:7:   required from ‘constexpr llvm::optional_detail::OptionalStorage<T, <anonymous> >::OptionalStorage(llvm::optional_detail::OptionalStorage<T, <anonymous> >&&) [with T = const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>; bool <anonymous> = false]’
/llvm-project/llvm/include/llvm/ADT/Optional.h:253:13:   required from here
/llvm-project/llvm/include/llvm/ADT/Optional.h:113:12: warning: cast from type ‘const std::pair<std::__cxx11::basic_string<char>, llvm::sampleprof::LineLocation>*’ to type ‘void*’ casts away qualifiers [-Wcast-qual]
  113 |     ::new ((void *)std::addressof(value)) T(std::forward<Args>(args)...);
      |            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
2021-02-18 17:10:50 +08:00
wlei afd8bd601e [CSSPGO][llvm-profgen] Filter out the instructions without location info for symbolizer
It appears some instructions doesn't have the debug location info and the symbolizer will return an empty call stack for them which will cause some crash later in profile unwinding. Actually we do not record the sample info for them, so this change just filter out those instruction.

As those instruction would appears at the begin and end of the instruction list, without them we need to add the boundary check for IP `advance` and `backward`.

Also for pseudo probe based profile, we actually don't need the symbolized location info, so here just change to use an empty stack for it. This could save half of the binary loading time.

Differential Revision: https://reviews.llvm.org/D96434
2021-02-12 16:47:49 -08:00
wlei 3869309a0c [CSSPGO][llvm-profgen] Aggregate samples on call frame trie to speed up profile generation
For CS profile generation, the process of call stack unwinding is time-consuming since for each LBR entry we need linear time to generate the context( hash, compression, string concatenation). This change speeds up this by grouping all the call frame within one LBR sample into a trie and aggregating the result(sample counter) on it, deferring the context compression and string generation to the end of unwinding.

Specifically, it uses `StackLeaf` as the top frame on the stack and manipulates(pop or push a trie node) it dynamically during virtual unwinding so that the raw sample can just be recoded on the leaf node, the path(root to leaf) will represent its calling context. In the end, it traverses the trie and generates the context on the fly.

Results:
Our internal branch shows about 5X speed-up on some large workloads in SPEC06 benchmark.

Differential Revision: https://reviews.llvm.org/D94110
2021-02-04 08:43:21 -08:00
wlei ac14bb14e7 [CSSPGO][llvm-profgen] Compress recursive cycles in calling context
This change compresses the context string by removing cycles due to recursive function for CS profile generation. Removing recursion cycles is a way to normalize the calling context which will be better for the sample aggregation and also make the context promoting deterministic.
Specifically for implementation, we recognize adjacent repeated frames as cycles and deduplicated them through multiple round of iteration.
For example:
Considering a input context string stack:
[“a”, “a”, “b”, “c”, “a”, “b”, “c”, “b”, “c”, “d”]
For first iteration,, it removed all adjacent repeated frames of size 1:
[“a”, “b”, “c”, “a”, “b”, “c”, “b”, “c”, “d”]
For second iteration, it removed all adjacent repeated frames of size 2:
[“a”, “b”, “c”, “a”, “b”, “c”, “d”]
So in the end, we get compressed output:
[“a”, “b”, “c”, “d”]

Compression will be called in two place: one for sample's context key right after unwinding, one is for the eventual context string id in the ProfileGenerator.
Added a switch `compress-recursion` to control the size of duplicated frames, default -1 means no size limit.
Added unit tests and regression test for this.

Differential Revision: https://reviews.llvm.org/D93556
2021-02-03 22:16:07 -08:00
wlei 6bccdcdb35 Revert "[CSSPGO][llvm-profgen] Compress recursive cycles in calling context"
This reverts commit 0609f257dc.
2021-02-03 22:16:05 -08:00
wlei 08e8bb60cf Revert "[CSSPGO][llvm-profgen] Aggregate samples on call frame trie to speed up profile generation"
This reverts commit 1714ad2336.
2021-02-03 22:16:05 -08:00
wlei 1714ad2336 [CSSPGO][llvm-profgen] Aggregate samples on call frame trie to speed up profile generation
For CS profile generation, the process of call stack unwinding is time-consuming since for each LBR entry we need linear time to generate the context( hash, compression, string concatenation). This change speeds up this by grouping all the call frame within one LBR sample into a trie and aggregating the result(sample counter) on it, deferring the context compression and string generation to the end of unwinding.

Specifically, it uses `StackLeaf` as the top frame on the stack and manipulates(pop or push a trie node) it dynamically during virtual unwinding so that the raw sample can just be recoded on the leaf node, the path(root to leaf) will represent its calling context. In the end, it traverses the trie and generates the context on the fly.

Results:
Our internal branch shows about 5X speed-up on some large workloads in SPEC06 benchmark.

Differential Revision: https://reviews.llvm.org/D94110
2021-02-03 18:50:14 -08:00
wlei 0609f257dc [CSSPGO][llvm-profgen] Compress recursive cycles in calling context
This change compresses the context string by removing cycles due to recursive function for CS profile generation. Removing recursion cycles is a way to normalize the calling context which will be better for the sample aggregation and also make the context promoting deterministic.
Specifically for implementation, we recognize adjacent repeated frames as cycles and deduplicated them through multiple round of iteration.
For example:
Considering a input context string stack:
[“a”, “a”, “b”, “c”, “a”, “b”, “c”, “b”, “c”, “d”]
For first iteration,, it removed all adjacent repeated frames of size 1:
[“a”, “b”, “c”, “a”, “b”, “c”, “b”, “c”, “d”]
For second iteration, it removed all adjacent repeated frames of size 2:
[“a”, “b”, “c”, “a”, “b”, “c”, “d”]
So in the end, we get compressed output:
[“a”, “b”, “c”, “d”]

Compression will be called in two place: one for sample's context key right after unwinding, one is for the eventual context string id in the ProfileGenerator.
Added a switch `compress-recursion` to control the size of duplicated frames, default -1 means no size limit.
Added unit tests and regression test for this.

Differential Revision: https://reviews.llvm.org/D93556
2021-02-03 18:50:14 -08:00
wlei c82b24f475 [CSSPGO][llvm-profgen] Pseudo probe based CS profile generation
This change implements profile generation infra for pseudo probe in llvm-profgen. During virtual unwinding, the raw profile is extracted into range counter and branch counter and aggregated to sample counter map indexed by the call stack context. This change introduces the last step and produces the eventual profile. Specifically, the body of function sample is recorded by going through each probe among the range and callsite target sample is recorded by extracting the callsite probe from branch's source.

Please refer https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s and https://reviews.llvm.org/D89707 for more context about CSSPGO and llvm-profgen.

**Implementation**

- Extended `PseudoProbeProfileGenerator` for pseudo probe based profile generation.
- `populateBodySamplesWithProbes` reading range counter is responsible for recording function body samples and inferring caller's body samples.
- `populateBoundarySamplesWithProbes` reading branch counter is responsible for recording call site target samples.
- Each sample is recorded with its calling context(named `ContextId`). Remind that the probe based context key doesn't include the leaf frame probe info, so the `ContextId` string is created from two part: one from the probe stack strings' concatenation and other one from the leaf frame probe.
- Added regression test

Test Plan:

ninja & ninja check-llvm

Differential Revision: https://reviews.llvm.org/D92998
2021-02-03 16:21:53 -08:00
Kazu Hirata 3d1200b9f6 [llvm] Drop unnecessary const from return types (NFC)
Identified with const-return-type.
2021-01-31 10:23:43 -08:00
wlei c681400b25 [CSSPGO][llvm-profgen] Virtual unwinding with pseudo probe
This change extends virtual unwinder to support pseudo probe in llvm-profgen. Please refer https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s and https://reviews.llvm.org/D89707 for more context about CSSPGO and llvm-profgen.

**Implementation**

- Added `ProbeBasedCtxKey` derived from `ContextKey` for sample counter aggregation. As we need string splitting to infer the profile for callee function, string based context introduces more string handling overhead, here we just use probe pointer based context.
- For linear unwinding, as inline context is encoded in each pseudo probe, we don't need to go through each instruction to extract range sharing same inliner. So just record the range for the context.
- For probe based context, we should ignore the top frame probe since it will be extracted from the address range. we defer the extraction in `ProfileGeneration`.
- Added `PseudoProbeProfileGenerator` for pseudo probe based profile generation.
- Some helper function to get pseduo probe info(call probe, inline context) from profiled binary.
- Added regression test for unwinder's output

The pseudo probe based profile generation will be in the upcoming patch.

Test Plan:

ninja & ninja check-llvm

Differential Revision: https://reviews.llvm.org/D92896
2021-01-13 11:02:58 -08:00
wlei b3154d11bc [CSSPGO][llvm-profgen] Pseudo probe decoding and disassembling
This change implements pseudo probe decoding and disassembling for llvm-profgen/CSSPGO. Please see https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s and https://reviews.llvm.org/D89707 for more context about CSSPGO and llvm-profgen.

**ELF section format**
Please see the encoding patch(https://reviews.llvm.org/D91878) for more details of the format, just copy the example here:

Two section(`.pseudo_probe_desc` and  `.pseudoprobe` ) is emitted in ELF to support pseudo probe.
The format of `.pseudo_probe_desc` section looks like:

```
.section   .pseudo_probe_desc,"",@progbits
.quad   6309742469962978389  // Func GUID
.quad   4294967295           // Func Hash
.byte   9                    // Length of func name
.ascii  "_Z5funcAi"          // Func name
.quad   7102633082150537521
.quad   138828622701
.byte   12
.ascii  "_Z8funcLeafi"
.quad   446061515086924981
.quad   4294967295
.byte   9
.ascii  "_Z5funcBi"
.quad   -2016976694713209516
.quad   72617220756
.byte   7
.ascii  "_Z3fibi"
```

For each `.pseudoprobe` section, the encoded binary data consists of a single function record corresponding to an outlined function (i.e, a function with a code entry in the `.text` section). A function record has the following format :

```
FUNCTION BODY (one for each outlined function present in the text section)
    GUID (uint64)
        GUID of the function
    NPROBES (ULEB128)
        Number of probes originating from this function.
    NUM_INLINED_FUNCTIONS (ULEB128)
        Number of callees inlined into this function, aka number of
        first-level inlinees
    PROBE RECORDS
        A list of NPROBES entries. Each entry contains:
          INDEX (ULEB128)
          TYPE (uint4)
            0 - block probe, 1 - indirect call, 2 - direct call
          ATTRIBUTE (uint3)
            reserved
          ADDRESS_TYPE (uint1)
            0 - code address, 1 - address delta
          CODE_ADDRESS (uint64 or ULEB128)
            code address or address delta, depending on ADDRESS_TYPE
    INLINED FUNCTION RECORDS
        A list of NUM_INLINED_FUNCTIONS entries describing each of the inlined
        callees.  Each record contains:
          INLINE SITE
            GUID of the inlinee (uint64)
            ID of the callsite probe (ULEB128)
          FUNCTION BODY
            A FUNCTION BODY entry describing the inlined function.
```

**Disassembling**
A switch `--show-pseudo-probe` is added to use along with `--show-disassembly` to print disassembly code with pseudo probe directives.

For example:
```
00000000002011a0 <foo2>:
  2011a0: 50                    push   rax
  2011a1: 85 ff                 test   edi,edi
  [Probe]:  FUNC: foo2  Index: 1  Type: Block
  2011a3: 74 02                 je     2011a7 <foo2+0x7>
  [Probe]:  FUNC: foo2  Index: 3  Type: Block
  [Probe]:  FUNC: foo2  Index: 4  Type: Block
  [Probe]:  FUNC: foo   Index: 1  Type: Block  Inlined: @ foo2:6
  2011a5: 58                    pop    rax
  2011a6: c3                    ret
  [Probe]:  FUNC: foo2  Index: 2  Type: Block
  2011a7: bf 01 00 00 00        mov    edi,0x1
  [Probe]:  FUNC: foo2  Index: 5  Type: IndirectCall
  2011ac: ff d6                 call   rsi
  [Probe]:  FUNC: foo2  Index: 4  Type: Block
  2011ae: 58                    pop    rax
  2011af: c3                    ret
```

**Implementation**
- `PseudoProbeDecoder` is added in ProfiledBinary as an infra for the decoding. It decoded the two section and generate two map: `GUIDProbeFunctionMap` stores all the `PseudoProbeFunction` which is the abstraction of a general function. `AddressProbesMap` stores all the pseudo probe info indexed by its address.
- All the inline info is encoded into binary as a trie(`PseudoProbeInlineTree`) and will be constructed from the decoding. Each pseudo probe can get its inline context(`getInlineContext`) by traversing its inline tree node backwards.

Test Plan:
ninja & ninja check-llvm

Differential Revision: https://reviews.llvm.org/D92334
2021-01-13 11:02:57 -08:00
Kazu Hirata cd088ba7e6 [llvm] Use llvm::lower_bound and llvm::upper_bound (NFC) 2021-01-05 21:15:59 -08:00
wlei 1f05b1a9f5 [CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.

This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.

we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.

With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.

A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating  `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated  `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
 Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.

* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation

Test Plan:
ninja & ninja check-llvm

Reviewed By: hoy, wenlei, wmi

Differential Revision: https://reviews.llvm.org/D89723
2020-12-07 13:48:58 -08:00
wlei 0196b45cea [CSSPGO][llvm-profgen] Instruction symbolization
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.

This change adds the support of instruction symbolization. Given the RVA on an instruction pointer, a full calling context can be printed side-by-side with the disassembly code.
E.g.
```
 Disassembly of section .text [0x0, 0x4a]:

 <funcA>:
     0:	mov	eax, edi                           funcA:0
     2:	mov	ecx, dword ptr [rip]               funcLeaf:2 @ funcA:1
     8:	lea	edx, [rcx + 3]                     fib:2 @ funcLeaf:2 @ funcA:1
     b:	cmp	ecx, 3                             fib:2 @ funcLeaf:2 @ funcA:1
     e:	cmovl	edx, ecx                           fib:2 @ funcLeaf:2 @ funcA:1
    11:	sub	eax, edx                           funcLeaf:2 @ funcA:1
    13:	ret                                        funcA:2
    14:	nop	word ptr cs:[rax + rax]
    1e:	nop

 <funcLeaf>:
    20:	mov	eax, edi                           funcLeaf:1
    22:	mov	ecx, dword ptr [rip]               funcLeaf:2
    28:	lea	edx, [rcx + 3]                     fib:2 @ funcLeaf:2
    2b:	cmp	ecx, 3                             fib:2 @ funcLeaf:2
    2e:	cmovl	edx, ecx                           fib:2 @ funcLeaf:2
    31:	sub	eax, edx                           funcLeaf:2
    33:	ret                                        funcLeaf:3
    34:	nop	word ptr cs:[rax + rax]
    3e:	nop

 <fib>:
    40:	lea	eax, [rdi + 3]                     fib:2
    43:	cmp	edi, 3                             fib:2
    46:	cmovl	eax, edi                           fib:2
    49:	ret                                        fib:8
```

Test Plan:
ninja check-llvm

Reviewed By: wenlei, wmi

Differential Revision: https://reviews.llvm.org/D89715
2020-11-20 14:26:27 -08:00
wlei 32221694cb [CSSPGO][llvm-profgen] Disassemble text sections
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.

This change enables disassembling the text sections to build various address maps that are potentially used by the virtual unwinder.  A switch `--show-disassembly` is being added to print the disassembly code.

Like the llvm-objdump tool, this change leverages existing LLVM components to parse and disassemble ELF binary files. So far X86 is supported.

Test Plan:

ninja check-llvm

Reviewed By: wmi, wenlei

Differential Revision: https://reviews.llvm.org/D89712
2020-11-20 14:26:26 -08:00
wlei a94fa86229 [CSSPGO][llvm-profgen] Parse mmap events from perf script
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.

As a starter, this change sets up an entry point by introducing PerfReader to load profiled binaries and perf traces(including perf events and perf samples). For the event, here it parses the mmap2 events from perf script to build the loader snaps, which is used to retrieve the image load address in the subsequent perf tracing parsing.

As described in llvm-profgen.rst, the tool being built aims to support multiple input perf data (preprocessed by perf script) as well as multiple input binary images. It should also support dynamic reload/unload shared objects by leveraging the loader snaps being built by this change

Reviewed By: wenlei, wmi

Differential Revision: https://reviews.llvm.org/D89707
2020-11-20 14:26:26 -08:00