Commit Graph

40 Commits

Author SHA1 Message Date
Reid Kleckner 89b57061f7 Move TargetRegistry.(h|cpp) from Support to MC
This moves the registry higher in the LLVM library dependency stack.
Every client of the target registry needs to link against MC anyway to
actually use the target, so we might as well move this out of Support.

This allows us to ensure that Support doesn't have includes from MC/*.

Differential Revision: https://reviews.llvm.org/D111454
2021-10-08 14:51:48 -07:00
wlei 31a5cb3292 [llvm-profgen] Filter out invalid debug line
Differential Revision: https://reviews.llvm.org/D110081
2021-10-04 19:09:06 -07:00
wlei 46cf7d75d9 [llvm-profgen] Add duplication factor for line-number based profile
This change adds duplication factor multiplier while accumulating body samples for line-number based profile. The body sample count will be `duplication-factor * count`. Base discriminator and duplication factor is decoded from the raw discriminator, this requires some refactor works.

Differential Revision: https://reviews.llvm.org/D109934
2021-10-04 19:08:55 -07:00
wlei fb29d812e4 [CSSPGO] Rename the field of SampleContextFrame
Differential Revision: https://reviews.llvm.org/D110980
2021-10-04 19:06:59 -07:00
Wenlei He 47d66355ef [llvm-profgen] Fix alignment in preferred based calculation
We used the segment alignment in elf header to assume the loader alignment. However this is incorrect because loader alignment is always the same as page size. If segment needs to be aligned at load time, linker will set aligned address as virtual address in elf header.

Differential Revision: https://reviews.llvm.org/D110795
2021-09-29 23:01:10 -07:00
wlei ce40843a3f [llvm-profgen][CSSPGO] On-demand function size computation for preinliner
Similar to https://reviews.llvm.org/D110465, we can compute function size on-demand for the functions that's hit by samples.

Here we leverage the raw range samples' address to compute a set of sample hit function. Then `BinarySizeContextTracker` just works on those function range for the size.

Reviewed By: hoy

Differential Revision: https://reviews.llvm.org/D110466
2021-09-28 09:09:38 -07:00
wlei 091c16f76b [llvm-profgen] On-demand symbolization
Previously we do symbolization for all the functions and actually we only need the symbols that's hit by the samples.

This can significantly speed up the time for large size binary.

Optimization for per-inliner will come along with next patch.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D110465
2021-09-28 09:09:25 -07:00
wlei 28277e9b48 [AutoFDO][llvm-profgen] Report zero count for unexecuted part of function code
In order to be consistent with compiler that interprets zero count as unexecuted(cold), this change reports zero-value count for unexecuted part of function code. For the implementation, it leverages the range counter, initializes all the executed function range with the zero-value. After all ranges are merged and converted into disjoint ranges, the remaining zero count will indicates the unexecuted(cold) part of the function.

This change also extends the current `findDisjointRanges` method which now can support adding zero-value range.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D109713
2021-09-24 14:15:05 -07:00
wlei c2be2d3284 [llvm-profgen] Fix a bug of assertion
The assertion should work on the entire context.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D110268
2021-09-22 18:33:27 -07:00
Hongtao Yu 734f4d832c [llvm-profgen] An option to dump disasm of specified symbols
For large app, dumping disasm of the whole program can be slow and result in gianant output. Adding a switch to dump specific symbols only.

Reviewed By: wlei

Differential Revision: https://reviews.llvm.org/D110079
2021-09-22 10:32:59 -07:00
wlei 964053d56f [llvm-profgen] Support LBR only perf script
This change aims at supporting LBR only sample perf script which is used for regular(Non-CS) profile generation.  A LBR perf script includes a batch of LBR sample which starts with a frame pointer and a group of 32 LBR entries is followed. The FROM/TO LBR pair and the range between two consecutive entries (the former entry's TO and the latter entry's FROM) will be used to infer function profile info.

An example of LBR perf script(created by `perf script -F ip,brstack -i perf.data`)
```
           40062f 0x40062f/0x4005b0/P/-/-/9  0x400645/0x4005ff/P/-/-/1  0x400637/0x400645/P/-/-/1 ...
           4005d7 0x4005d7/0x4005e5/P/-/-/8  0x40062f/0x4005b0/P/-/-/6  0x400645/0x4005ff/P/-/-/1 ...
           ...
```

For implementation:
 - Extended a new child class `LBRPerfReader` for the sample parsing, reused all the functionalities in `extractLBRStack` except for an extension to parsing leading instruction pointer.
 - `HybridSample` is reused(just leave the call stack empty) and the parsed samples is still aggregated in `AggregatedSamples`. After that, range samples, branch sample, address samples are computed and recorded.
 - Reused `ContextSampleCounterMap` to store the raw profile, since it's no need to aggregation by context, here it just registered one sample counter with a fake context key.
 - Unified to use `show-raw-profile` instead of `show-unwinder-output` to dump the intermediate raw profile, see the comments of the format of the raw profile. For CS profile, it remains to output the unwinder output.

Profile generation part will come soon.

Differential Revision: https://reviews.llvm.org/D108153
2021-08-31 13:28:17 -07:00
Hongtao Yu b9db70369b [CSSPGO] Split context string to deduplicate function name used in the context.
Currently context strings contain a lot of duplicated function names and that significantly increase the profile size. This change split the context into a series of {name, offset, discriminator} tuples so function names used in the context can be replaced by the index into the name table and that significantly reduce the size consumed by context.

A follow-up improvement made in the compiler and profiling tools is to avoid reconstructing full context strings which is  time- and memory- consuming. Instead a context vector of `StringRef` is adopted to represent the full context in all scenarios. As a result, the previous prevalent profile map which was implemented as a `StringRef` is now engineered as an unordered map keyed by `SampleContext`. `SampleContext` is reshaped to using an `ArrayRef` to represent a full context for CS profile. For non-CS profile, it falls back to use `StringRef` to represent a contextless function name. Both the `ArrayRef` and `StringRef` objects are underpinned by real array and string objects that are stored in producer buffers. For compiler, they are maintained by the sample reader. For llvm-profgen, they are maintained in `ProfiledBinary` and `ProfileGenerator`. Full context strings can be generated only in those cases of debugging and printing.

When it comes to profile format, nothing has changed to the text format, though internally CS context is implemented as a vector. Extbinary format is only changed for CS profile, with an additional `SecCSNameTable` section which stores all full contexts logically in the form of `vector<int>`, which each element as an offset points to `SecNameTable`. All occurrences of contexts elsewhere are redirected to using the offset of `SecCSNameTable`.

Testing
This is no-diff change in terms of code quality and profile content (for text profile).

For our internal large service (aka ads), the profile generation is cut to half, with a 20x smaller string-based extbinary format generated.

The compile time of ads is dropped by 25%.

Differential Revision: https://reviews.llvm.org/D107299
2021-08-30 20:09:29 -07:00
Wenlei He a6f15e9a49 [CSSPGO] Use probe inline tree to track zero size fully optimized context for pre-inliner
This is a follow up diff for BinarySizeContextTracker to track zero size for fully optimized inlinee. When an inlinee is fully optimized away, we won't be able to get its size through symbolizing instructions, hence we will treat the corresponding context size as unknown. However by traversing the inlined probe forest, we know what're original inlinees regardless of optimization. If a context show up in inlined probes, but not during symbolization, we know that it's fully optimized away hence its size is zero instead of unknown. It should provide more accurate size cost estimation for pre-inliner to make better inline decisions in llvm-profgen.

Differential Revision: https://reviews.llvm.org/D108350
2021-08-25 09:01:11 -07:00
Wenlei He eca03d2768 [CSSPGO] Track and use context-sensitive post-optimization function size to drive global pre-inliner in llvm-profgen
This change enables llvm-profgen to use accurate context-sensitive post-optimization function byte size as a cost proxy to drive global preinline decisions.

To do this, BinarySizeContextTracker is introduced to track function byte size under different inline context during disassembling. In preinliner, we can not query context byte size under switch `context-cost-for-preinliner`. The tracker uses a reverse trie to keep size of functions under different context (callee as parent, caller as child), and it can give best/longest possible matching context size for given input context.

The new size cost is off by default. There're a few TODOs that needs to addressed: 1) avoid dangling string from `Offset2LocStackMap`, which will be addressed in split context work; 2) using inlinee's entry probe to make sure we have correct zero size for inlinee that's completely optimized away after inlining. Some tuning is also needed.

Differential Revision: https://reviews.llvm.org/D108180
2021-08-18 22:50:57 -07:00
wlei 856a6a5041 [CSSPGO][llvm-profgen] Trim and merge context beforehand to reduce memory usage
Currently we use a centralized string map(StringMap<FunctionSamples> ProfileMap) to store the profile while populating the sample, which might cause the memory usage bottleneck. I saw in an extreme case, there are thousands of samples whose context stack depth is >= 100. The memory consumption can be greater than 100GB.

As here the context is used for inlining, we can assume we won't have so many of inlinees keeping inlined at the same root function, so this change tried to cap the context stack and merge the samples for peak memory reduction and this is done after recursion compression.

The default value is -1 meaning no depth limit, in the future we can tune to a smaller one.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D107800
2021-08-11 16:02:35 -07:00
jamesluox ee7d20e846 [CSSPGO] Migrate and refactor the decoder of Pseudo Probe
Migrate pseudo probe decoding logic in llvm-profgen to MC, so other LLVM-base program could reuse existing codes. Redesign object layout of encoded and decoded pseudo probes.

Reviewed By: hoy

Differential Revision: https://reviews.llvm.org/D106861
2021-08-04 09:21:34 -07:00
Hongtao Yu 0712038458 [CSSPGO][llvm-profgen] Allow multiple executable load segments.
The linker or post-link optimizer can create an ELF image with multiple executable segments each of which will be loaded separately at run time. This breaks the assumption of llvm-profgen that currently only supports one base load address. What it ends up with is that the subsequent mmap events will be treated as an overwrite of the first mmap event which will in turn screw up address mapping. While it is non-trivial to support multiple separate load addresses and given that on x64 those segments will always be loaded at consecutive addresses (though via separate mmap
sys calls), I'm adding an error checking logic to bail out if that's violated and keep using a single load address which is the address of the first executable segment.

Also changing the disassembly output from printing section offset to printing the virtual address instead, which matches the behavior of objdump.

Differential Revision: https://reviews.llvm.org/D103178
2021-07-13 18:22:24 -07:00
Rong Xu 8c68eb8306 [SampleFDO] Make FSDiscriminator flag part of function parameters
Add a parameter of IsFSDiscriminator to function
getBaseDiscriminatorFromDiscriminator().

This function currently checks the internal flag of
--enable-fs-discriminator. This is not good because we might
change the default value of the internal flag.

Note that we have a default parameter. This is just
because create_afdo_tool has a call-site to it.
I will remove the default parameter in a later patch.

Differential Revision: https://reviews.llvm.org/D104584
2021-06-21 14:37:45 -07:00
Philipp Krones c2f819af73 [MC] Refactor MCObjectFileInfo initialization and allow targets to create MCObjectFileInfo
This makes it possible for targets to define their own MCObjectFileInfo.
This MCObjectFileInfo is then used to determine things like section alignment.

This is a follow up to D101462 and prepares for the RISCV backend defining the
text section alignment depending on the enabled extensions.

Reviewed By: MaskRay

Differential Revision: https://reviews.llvm.org/D101921
2021-05-23 14:15:23 -07:00
Philipp Krones 632ebc4ab4 [MC] Untangle MCContext and MCObjectFileInfo
This untangles the MCContext and the MCObjectFileInfo. There is a circular
dependency between MCContext and MCObjectFileInfo. Currently this dependency
also exists during construction: You can't contruct a MOFI without a MCContext
without constructing the MCContext with a dummy version of that MOFI first.
This removes this dependency during construction. In a perfect world,
MCObjectFileInfo wouldn't depend on MCContext at all, but only be stored in the
MCContext, like other MC information. This is future work.

This also shifts/adds more information to the MCContext making it more
available to the different targets. Namely:

- TargetTriple
- ObjectFileType
- SubtargetInfo

Reviewed By: MaskRay

Differential Revision: https://reviews.llvm.org/D101462
2021-05-05 10:03:02 -07:00
Wenlei He 1410db70b9 [CSSPGO] Add attribute metadata for context profile
This changes adds attribute field for metadata of context profile. Currently we have an inline attribute that indicates whether the leaf frame corresponding to a context profile was inlined in previous build.

This will be used to help estimating inlining and be taken into account when trimming context. Changes for that in llvm-profgen will follow. It will also help tuning.

Differential Revision: https://reviews.llvm.org/D98823
2021-03-18 22:00:56 -07:00
wlei dddd590fd0 [CSSPGO][llvm-profgen] Fix getCanonicalFnName usage in llvm-profgen
Previously we didn't support to keep the unique linkage name(-funique-internal-linkage-name) in llvm-profgen. As discussed in https://reviews.llvm.org/D96932, we choose to do canonicalization for it.

Now since "selected" is set as the default parameter of getCanonicalFnName in `D96932`, we don't need to add any attribute here for the previous usage and only fix the missing usage in the pseudo probe decoding.

Differential Revision: https://reviews.llvm.org/D98226
2021-03-15 21:00:42 -07:00
Hongtao Yu 55356c011b [CSSPGO][llvm-profgen] Continue disassembling after illegal instruction is seen.
Previously we errored out when disassembling illegal instructions and there would be no profile generated. In fact illegal instructions are not uncommon and we'd better skip them and print "unknown" instead of erroring out. This matches the behavior of llvm-objdump (see disassembleObject in llvm-objdump.cpp).

Reviewed By: wlei, wenlei

Differential Revision: https://reviews.llvm.org/D97776
2021-03-03 10:14:10 -08:00
wlei afd8bd601e [CSSPGO][llvm-profgen] Filter out the instructions without location info for symbolizer
It appears some instructions doesn't have the debug location info and the symbolizer will return an empty call stack for them which will cause some crash later in profile unwinding. Actually we do not record the sample info for them, so this change just filter out those instruction.

As those instruction would appears at the begin and end of the instruction list, without them we need to add the boundary check for IP `advance` and `backward`.

Also for pseudo probe based profile, we actually don't need the symbolized location info, so here just change to use an empty stack for it. This could save half of the binary loading time.

Differential Revision: https://reviews.llvm.org/D96434
2021-02-12 16:47:49 -08:00
wlei 426e326a19 [CSSPGO][llvm-profgen] Renovate perfscript check and command line input validation
This include some changes related with PerfReader's the input check and command line change:

1) It appears there might be thousands of leading MMAP-Event line in the perfscript for large workload. For this case, the 4k threshold is not eligible to determine it's a hybrid sample. This change renovated the `isHybridPerfScript` by going through the script without threshold limitation checking whether there is a non-empty call stack immediately followed by a LBR sample. It will stop once it find a valid one.

2) Added several input validations for the command line switches in PerfReader.

3) Changed the command line `show-disassembly` to `show-disassembly-only`, it will print to stdout and exit early which leave an empty output profile.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D96387
2021-02-12 15:18:50 -08:00
wlei c3aeabaea1 [CSSPGO][llvm-profgen] Add brackets for context id to support extended binary format
To align with https://reviews.llvm.org/D95547, we need to add brackets for context id before initializing the `SampleContext`.

Also added test cases for extended binary format from llvm-profgen side.

Differential Revision: https://reviews.llvm.org/D95929
2021-02-12 01:14:53 -08:00
wlei 3869309a0c [CSSPGO][llvm-profgen] Aggregate samples on call frame trie to speed up profile generation
For CS profile generation, the process of call stack unwinding is time-consuming since for each LBR entry we need linear time to generate the context( hash, compression, string concatenation). This change speeds up this by grouping all the call frame within one LBR sample into a trie and aggregating the result(sample counter) on it, deferring the context compression and string generation to the end of unwinding.

Specifically, it uses `StackLeaf` as the top frame on the stack and manipulates(pop or push a trie node) it dynamically during virtual unwinding so that the raw sample can just be recoded on the leaf node, the path(root to leaf) will represent its calling context. In the end, it traverses the trie and generates the context on the fly.

Results:
Our internal branch shows about 5X speed-up on some large workloads in SPEC06 benchmark.

Differential Revision: https://reviews.llvm.org/D94110
2021-02-04 08:43:21 -08:00
wlei ac14bb14e7 [CSSPGO][llvm-profgen] Compress recursive cycles in calling context
This change compresses the context string by removing cycles due to recursive function for CS profile generation. Removing recursion cycles is a way to normalize the calling context which will be better for the sample aggregation and also make the context promoting deterministic.
Specifically for implementation, we recognize adjacent repeated frames as cycles and deduplicated them through multiple round of iteration.
For example:
Considering a input context string stack:
[“a”, “a”, “b”, “c”, “a”, “b”, “c”, “b”, “c”, “d”]
For first iteration,, it removed all adjacent repeated frames of size 1:
[“a”, “b”, “c”, “a”, “b”, “c”, “b”, “c”, “d”]
For second iteration, it removed all adjacent repeated frames of size 2:
[“a”, “b”, “c”, “a”, “b”, “c”, “d”]
So in the end, we get compressed output:
[“a”, “b”, “c”, “d”]

Compression will be called in two place: one for sample's context key right after unwinding, one is for the eventual context string id in the ProfileGenerator.
Added a switch `compress-recursion` to control the size of duplicated frames, default -1 means no size limit.
Added unit tests and regression test for this.

Differential Revision: https://reviews.llvm.org/D93556
2021-02-03 22:16:07 -08:00
wlei 6bccdcdb35 Revert "[CSSPGO][llvm-profgen] Compress recursive cycles in calling context"
This reverts commit 0609f257dc.
2021-02-03 22:16:05 -08:00
wlei 08e8bb60cf Revert "[CSSPGO][llvm-profgen] Aggregate samples on call frame trie to speed up profile generation"
This reverts commit 1714ad2336.
2021-02-03 22:16:05 -08:00
wlei 1714ad2336 [CSSPGO][llvm-profgen] Aggregate samples on call frame trie to speed up profile generation
For CS profile generation, the process of call stack unwinding is time-consuming since for each LBR entry we need linear time to generate the context( hash, compression, string concatenation). This change speeds up this by grouping all the call frame within one LBR sample into a trie and aggregating the result(sample counter) on it, deferring the context compression and string generation to the end of unwinding.

Specifically, it uses `StackLeaf` as the top frame on the stack and manipulates(pop or push a trie node) it dynamically during virtual unwinding so that the raw sample can just be recoded on the leaf node, the path(root to leaf) will represent its calling context. In the end, it traverses the trie and generates the context on the fly.

Results:
Our internal branch shows about 5X speed-up on some large workloads in SPEC06 benchmark.

Differential Revision: https://reviews.llvm.org/D94110
2021-02-03 18:50:14 -08:00
wlei 0609f257dc [CSSPGO][llvm-profgen] Compress recursive cycles in calling context
This change compresses the context string by removing cycles due to recursive function for CS profile generation. Removing recursion cycles is a way to normalize the calling context which will be better for the sample aggregation and also make the context promoting deterministic.
Specifically for implementation, we recognize adjacent repeated frames as cycles and deduplicated them through multiple round of iteration.
For example:
Considering a input context string stack:
[“a”, “a”, “b”, “c”, “a”, “b”, “c”, “b”, “c”, “d”]
For first iteration,, it removed all adjacent repeated frames of size 1:
[“a”, “b”, “c”, “a”, “b”, “c”, “b”, “c”, “d”]
For second iteration, it removed all adjacent repeated frames of size 2:
[“a”, “b”, “c”, “a”, “b”, “c”, “d”]
So in the end, we get compressed output:
[“a”, “b”, “c”, “d”]

Compression will be called in two place: one for sample's context key right after unwinding, one is for the eventual context string id in the ProfileGenerator.
Added a switch `compress-recursion` to control the size of duplicated frames, default -1 means no size limit.
Added unit tests and regression test for this.

Differential Revision: https://reviews.llvm.org/D93556
2021-02-03 18:50:14 -08:00
Kazu Hirata 7dc3575ef2 [llvm] Remove redundant return and continue statements (NFC)
Identified with readability-redundant-control-flow.
2021-01-14 20:30:34 -08:00
wlei c681400b25 [CSSPGO][llvm-profgen] Virtual unwinding with pseudo probe
This change extends virtual unwinder to support pseudo probe in llvm-profgen. Please refer https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s and https://reviews.llvm.org/D89707 for more context about CSSPGO and llvm-profgen.

**Implementation**

- Added `ProbeBasedCtxKey` derived from `ContextKey` for sample counter aggregation. As we need string splitting to infer the profile for callee function, string based context introduces more string handling overhead, here we just use probe pointer based context.
- For linear unwinding, as inline context is encoded in each pseudo probe, we don't need to go through each instruction to extract range sharing same inliner. So just record the range for the context.
- For probe based context, we should ignore the top frame probe since it will be extracted from the address range. we defer the extraction in `ProfileGeneration`.
- Added `PseudoProbeProfileGenerator` for pseudo probe based profile generation.
- Some helper function to get pseduo probe info(call probe, inline context) from profiled binary.
- Added regression test for unwinder's output

The pseudo probe based profile generation will be in the upcoming patch.

Test Plan:

ninja & ninja check-llvm

Differential Revision: https://reviews.llvm.org/D92896
2021-01-13 11:02:58 -08:00
wlei b3154d11bc [CSSPGO][llvm-profgen] Pseudo probe decoding and disassembling
This change implements pseudo probe decoding and disassembling for llvm-profgen/CSSPGO. Please see https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s and https://reviews.llvm.org/D89707 for more context about CSSPGO and llvm-profgen.

**ELF section format**
Please see the encoding patch(https://reviews.llvm.org/D91878) for more details of the format, just copy the example here:

Two section(`.pseudo_probe_desc` and  `.pseudoprobe` ) is emitted in ELF to support pseudo probe.
The format of `.pseudo_probe_desc` section looks like:

```
.section   .pseudo_probe_desc,"",@progbits
.quad   6309742469962978389  // Func GUID
.quad   4294967295           // Func Hash
.byte   9                    // Length of func name
.ascii  "_Z5funcAi"          // Func name
.quad   7102633082150537521
.quad   138828622701
.byte   12
.ascii  "_Z8funcLeafi"
.quad   446061515086924981
.quad   4294967295
.byte   9
.ascii  "_Z5funcBi"
.quad   -2016976694713209516
.quad   72617220756
.byte   7
.ascii  "_Z3fibi"
```

For each `.pseudoprobe` section, the encoded binary data consists of a single function record corresponding to an outlined function (i.e, a function with a code entry in the `.text` section). A function record has the following format :

```
FUNCTION BODY (one for each outlined function present in the text section)
    GUID (uint64)
        GUID of the function
    NPROBES (ULEB128)
        Number of probes originating from this function.
    NUM_INLINED_FUNCTIONS (ULEB128)
        Number of callees inlined into this function, aka number of
        first-level inlinees
    PROBE RECORDS
        A list of NPROBES entries. Each entry contains:
          INDEX (ULEB128)
          TYPE (uint4)
            0 - block probe, 1 - indirect call, 2 - direct call
          ATTRIBUTE (uint3)
            reserved
          ADDRESS_TYPE (uint1)
            0 - code address, 1 - address delta
          CODE_ADDRESS (uint64 or ULEB128)
            code address or address delta, depending on ADDRESS_TYPE
    INLINED FUNCTION RECORDS
        A list of NUM_INLINED_FUNCTIONS entries describing each of the inlined
        callees.  Each record contains:
          INLINE SITE
            GUID of the inlinee (uint64)
            ID of the callsite probe (ULEB128)
          FUNCTION BODY
            A FUNCTION BODY entry describing the inlined function.
```

**Disassembling**
A switch `--show-pseudo-probe` is added to use along with `--show-disassembly` to print disassembly code with pseudo probe directives.

For example:
```
00000000002011a0 <foo2>:
  2011a0: 50                    push   rax
  2011a1: 85 ff                 test   edi,edi
  [Probe]:  FUNC: foo2  Index: 1  Type: Block
  2011a3: 74 02                 je     2011a7 <foo2+0x7>
  [Probe]:  FUNC: foo2  Index: 3  Type: Block
  [Probe]:  FUNC: foo2  Index: 4  Type: Block
  [Probe]:  FUNC: foo   Index: 1  Type: Block  Inlined: @ foo2:6
  2011a5: 58                    pop    rax
  2011a6: c3                    ret
  [Probe]:  FUNC: foo2  Index: 2  Type: Block
  2011a7: bf 01 00 00 00        mov    edi,0x1
  [Probe]:  FUNC: foo2  Index: 5  Type: IndirectCall
  2011ac: ff d6                 call   rsi
  [Probe]:  FUNC: foo2  Index: 4  Type: Block
  2011ae: 58                    pop    rax
  2011af: c3                    ret
```

**Implementation**
- `PseudoProbeDecoder` is added in ProfiledBinary as an infra for the decoding. It decoded the two section and generate two map: `GUIDProbeFunctionMap` stores all the `PseudoProbeFunction` which is the abstraction of a general function. `AddressProbesMap` stores all the pseudo probe info indexed by its address.
- All the inline info is encoded into binary as a trie(`PseudoProbeInlineTree`) and will be constructed from the decoding. Each pseudo probe can get its inline context(`getInlineContext`) by traversing its inline tree node backwards.

Test Plan:
ninja & ninja check-llvm

Differential Revision: https://reviews.llvm.org/D92334
2021-01-13 11:02:57 -08:00
wlei 1f05b1a9f5 [CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.

This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.

we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.

With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.

A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating  `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated  `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
 Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.

* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation

Test Plan:
ninja & ninja check-llvm

Reviewed By: hoy, wenlei, wmi

Differential Revision: https://reviews.llvm.org/D89723
2020-12-07 13:48:58 -08:00
Georgii Rymar 44794cde18 [llvm-profgen] - Fix compilation issue after ELFFile<ELFT> interface update.
`D92560` changed `ELFObjectFile::getELFFile` to return reference.
2020-12-04 16:09:25 +03:00
wlei 21c91454a8 [llvm-profgen][NFC]Fix build failure on different platform
see titile
Test Plan:
ninja & ninja check-llvm

Reviewed By: hoy

Differential Revision: https://reviews.llvm.org/D91897
2020-11-20 16:36:04 -08:00
wlei 0196b45cea [CSSPGO][llvm-profgen] Instruction symbolization
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.

This change adds the support of instruction symbolization. Given the RVA on an instruction pointer, a full calling context can be printed side-by-side with the disassembly code.
E.g.
```
 Disassembly of section .text [0x0, 0x4a]:

 <funcA>:
     0:	mov	eax, edi                           funcA:0
     2:	mov	ecx, dword ptr [rip]               funcLeaf:2 @ funcA:1
     8:	lea	edx, [rcx + 3]                     fib:2 @ funcLeaf:2 @ funcA:1
     b:	cmp	ecx, 3                             fib:2 @ funcLeaf:2 @ funcA:1
     e:	cmovl	edx, ecx                           fib:2 @ funcLeaf:2 @ funcA:1
    11:	sub	eax, edx                           funcLeaf:2 @ funcA:1
    13:	ret                                        funcA:2
    14:	nop	word ptr cs:[rax + rax]
    1e:	nop

 <funcLeaf>:
    20:	mov	eax, edi                           funcLeaf:1
    22:	mov	ecx, dword ptr [rip]               funcLeaf:2
    28:	lea	edx, [rcx + 3]                     fib:2 @ funcLeaf:2
    2b:	cmp	ecx, 3                             fib:2 @ funcLeaf:2
    2e:	cmovl	edx, ecx                           fib:2 @ funcLeaf:2
    31:	sub	eax, edx                           funcLeaf:2
    33:	ret                                        funcLeaf:3
    34:	nop	word ptr cs:[rax + rax]
    3e:	nop

 <fib>:
    40:	lea	eax, [rdi + 3]                     fib:2
    43:	cmp	edi, 3                             fib:2
    46:	cmovl	eax, edi                           fib:2
    49:	ret                                        fib:8
```

Test Plan:
ninja check-llvm

Reviewed By: wenlei, wmi

Differential Revision: https://reviews.llvm.org/D89715
2020-11-20 14:26:27 -08:00
wlei 32221694cb [CSSPGO][llvm-profgen] Disassemble text sections
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.

This change enables disassembling the text sections to build various address maps that are potentially used by the virtual unwinder.  A switch `--show-disassembly` is being added to print the disassembly code.

Like the llvm-objdump tool, this change leverages existing LLVM components to parse and disassemble ELF binary files. So far X86 is supported.

Test Plan:

ninja check-llvm

Reviewed By: wmi, wenlei

Differential Revision: https://reviews.llvm.org/D89712
2020-11-20 14:26:26 -08:00