Commit Graph

250 Commits

Author SHA1 Message Date
Rong Xu 5fdaaf7fd8 [SampleFDO] Flow Sensitive Sample FDO (FSAFDO) profile loader
This patch implements Flow Sensitive Sample FDO (FSAFDO) profile
loader. We have two profile loaders for FS profile,
one before RegAlloc and one before BlockPlacement.

To enable it, when -fprofile-sample-use=<profile> is specified,
add "-enable-fs-discriminator=true \
     -disable-ra-fsprofile-loader=false \
     -disable-layout-fsprofile-loader=false"
to turn on the FS profile loaders.

Differential Revision: https://reviews.llvm.org/D107878
2021-08-18 18:37:35 -07:00
wlei f0d41b58da [CSSPGO] Tweak ICP threshold in top-down inliner
This change slightly relaxed the current ICP threshold in top-down inliner, specifically always allow one ICP for it. It shows some perf improvements on SPEC and our internal benchmarks. Also renamed the previous flag. We can also try to turn off PGO ICP in the future.

Reviewed By: wenlei, hoy, wmi

Differential Revision: https://reviews.llvm.org/D106588
2021-07-26 21:49:20 -07:00
Kazu Hirata 4a30a5c8d9 [SampleProfile] Remove ProfileIsValid (NFC)
The last use was removed on Jan 22, 2021 in commit
c9cd9a0066.
2021-07-20 08:07:04 -07:00
Wenlei He f9f3c34e0f [CSSPGO] Turn on iterative-BFI for CSSPGO
Iterative-BFI produces better count quality and performance when evaluated on internal benchmarks. Turning it on by default now for CSSPGO. We can consider turn it on by default for AutoFDO as well in the future.

Differential Revision: https://reviews.llvm.org/D106202
2021-07-16 17:35:49 -07:00
Kazu Hirata 49d66d9f9f [AFDO] Merge function attributes after inlining
This patch teaches the sample profile loader to merge function
attributes after inlining functions.

Without this patch, the compiler could inline a function requiring the
512-bit vector width into its caller without merging function
attributes, triggering a failure during instruction selection.

Differential Revision: https://reviews.llvm.org/D105729
2021-07-09 16:47:12 -07:00
Hongtao Yu bd52495518 [CSSPGO] Undoing the concept of dangling pseudo probe
As a follow-up to https://reviews.llvm.org/D104129, I'm cleaning up the danling probe related code in both the compiler and llvm-profgen.

I'm seeing a 5% size win for the pseudo_probe section for SPEC2017 and 10% for Ciner. Certain benchmark such as 602.gcc has a 20% size win. No obvious difference seen on build time for SPEC2017 and Cinder.

Reviewed By: wenlei

Differential Revision: https://reviews.llvm.org/D104477
2021-06-18 15:14:11 -07:00
Rong Xu 8d581857d7 [SampleFDO] New hierarchical discriminator for FS SampleFDO (llvm-profdata part)
This patch was split from https://reviews.llvm.org/D102246
[SampleFDO] New hierarchical discriminator for Flow Sensitive SampleFDO
This is for llvm-profdata part of change. It sets the bit masks for the
profile reader in llvm-profdata. Also add an internal option
"-fs-discriminator-pass" for show and merge command to process the profile
offline.

This patch also moved setDiscriminatorMaskedBitFrom() to
SampleProfileReader::create() to simplify the interface.

Differential Revision: https://reviews.llvm.org/D103550
2021-06-04 11:22:06 -07:00
Rong Xu 6745ffe4fa [SampleFDO] New hierarchical discriminator for FS SampleFDO (ProfileData part)
This patch was split from https://reviews.llvm.org/D102246
[SampleFDO] New hierarchical discriminator for Flow Sensitive SampleFDO
This is mainly for ProfileData part of change. It will load
FS Profile when such profile is detected. For an extbinary format profile,
create_llvm_prof tool will add a flag to profile summary section.
For other format profiles, the users need to use an internal option
(-profile-isfs) to tell the compiler that the profile uses FS discriminators.

This patch also simplified the bit API used by FS discriminators.

Differential Revision: https://reviews.llvm.org/D103041
2021-06-02 10:32:52 -07:00
Hongtao Yu 4ca6e37b98 [CSSPGO] Overwrite branch weight annotated in previous pass.
Sample profile loader can be run in both LTO prelink and postlink. Currently the counts annoation in postilnk doesn't fully overwrite what's done in prelink. I'm adding a switch (`-overwrite-existing-weights=1`) to enable a full overwrite, which includes:

1. Clear old metadata for calls when their parent block has a zero count. This could be caused by prelink code duplication.

2. Clear indirect call metadata if somehow all the rest targets have a sum of zero count.

3. Overwrite branch weight for basic blocks.

With a CS profile, I was seeing #1 and #2 help reduce code size by preventing post-sample ICP and CGSCC inliner working on obsolete metadata, which come from a partial global inlining in prelink.  It's not expected to work well for non-CS case with a less-accurate post-inline count quality.

It's worth calling out that some prelink optimizations can damage counts quality in an irreversible way. One example is the loop rotate optimization. Due to lack of exact loop entry count (profiling can only give loop iteration count and loop exit count), moving one iteration out of the loop body leaves the rest iteration count unknown. We had to turn off prelink loop rotate to achieve a better postlink counts quality. A even better postlink counts quality can be archived by turning off prelink CGSCC inlining which is not context-sensitive.

Reviewed By: wenlei, wmi

Differential Revision: https://reviews.llvm.org/D102537
2021-05-19 09:12:24 -07:00
Simon Pilgrim e30540a603 SampleProfileLoader::inlineHotFunctionsWithPriority - Fix uninitialized variable warning. NFCI.
findIndirectCallFunctionSamples will leave Sum uninitialized if it returns an empty vector, we don't really use Sum in this case (but we do make a copy that isn't used either) - so ensure we initialize the value to zero to at least silence the static analysis warning.
2021-05-15 15:02:52 +01:00
wlei e475d4d69f [CSSPGO] Fix return value of getProbeWeight
Currently we didn't support multiple return type, we work around to use error_code to represent:

1)  The dangling probe.
2)  Ignore the weight of non-probe instruction

While merging the instructions' weight for the whole BB, it will filter out the error code. But If all instructions of the BB give error_code, the outside logic will mark it as a BB requiring the inference algorithm to infer its weight. This is different from the zero value which will be treated as a cold block.

Fix one place that if we can't find the FunctionSamples in the profile data which indicates the BB is cold, we choose to return zero.

Also refine the comments.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D102007
2021-05-14 14:06:09 -07:00
Hongtao Yu 5f2d730073 [CSSPGO] Fix incorrect prorating indirect call distribution factor that leads to target count loss.
Pseudo probe distribution factor is used to scale down profile samples to avoid misleading the counts inference due to the usage of "maximum" in `getBlockWeight`. For callsites, the scaling down can come from code duplication prior to the sample profile loader (prelink or postlink), or due to the indirect call promotion in sample loader inliner. This patch fixes an issue in sample loader ICP where the leftover indirect callsite scaling down causes the loss of non-promoted call target samples unexpectedly. While the scaling down is to favor BFI/BPI with accurate an callsite count, it doesn't fit in the current distribution factor that represents code duplication changes. Ideally,  we would need two factors, one is for code duplication, the other is for ICP. However this seems over complicated. I'm going to trade one usage (callsite counts) for the other (call target counts).

Seeing perf win on one benchmark (mcf) of SPEC2017 with others unchanged.

Reviewed By: wenlei

Differential Revision: https://reviews.llvm.org/D100993
2021-04-23 11:09:22 -07:00
wlei 6d5132b426 [CSSPGO] Fix incorrect probe distribution factor computation in top-down inliner
We see a regression related to low probe factor(0.01) which prevents some callsites being promoted in ICPPass and later cause the missing inline in CGSCC inliner. The root cause is due to redundant(the second) multiplication of the probe factor and this change try to fix it.

`Sum` does multiply a factor right after findCallSamples but later when using as the parameter in setProbeDistributionFactor, it multiplies one again.

This change could get ~2% perf back on mcf benchmark. In mcf, previously the corresponding factor is 1 and it's the recent feature introducing the <1 factor then trigger this bug.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D99787
2021-04-07 08:48:59 -07:00
spupyrev 22998738e8 [SamplePGO] Keeping prof metadata for IndirectBrInst
Currently prof metadata with branch counts is added only for BranchInst and SwitchInst, but not for IndirectBrInst. As a result, BPI/BFI make incorrect inferences for indirect branches, which can be very hot.
This diff adds metadata for IndirectBrInst, in addition to BranchInst and SwitchInst.

Reviewed By: wmi, wenlei

Differential Revision: https://reviews.llvm.org/D99550
2021-03-30 10:44:48 -07:00
Hongtao Yu 3e3fc431df [CSSPGO] Top-down processing order based on full profile.
Use profiled call edges to augment the top-down order. There are cases that the top-down order computed based on the static call graph doesn't reflect real execution order. For example:

1. Incomplete static call graph due to unknown indirect call targets. Adjusting the order by considering indirect call edges from the profile can enable the inlining of indirect call targets by allowing the caller processed before them.

2. Mutual call edges in an SCC. The static processing order computed for an SCC may not reflect the call contexts in the context-sensitive profile, thus may cause potential inlining to be overlooked. The function order in one SCC is being adjusted to a top-down order based on the profile to favor more inlining.

3. Transitive indirect call edges due to inlining. When a callee function is inlined into into a caller function in LTO prelink, every call edge originated from the callee will be transferred to the caller. If any of the transferred edges is indirect, the original profiled indirect edge, even if considered, would not enforce a top-down order from the caller to the potential indirect call target in LTO postlink since the inlined callee is gone from the static call graph.

4. #3 can happen even for direct call targets, due to functions defined in header files. Header functions, when included into source files, are defined multiple times but only one definition survives due to ODR. Therefore, the LTO prelink inlining done on those dropped definitions can be useless based on a local file scope. More importantly, the inlinee, once fully inlined to a to-be-dropped inliner, will have no profile to consume when its outlined version is compiled. This can lead to a profile-less prelink compilation for the outlined version of the inlinee function which may be called from external modules. while this isn't easy to fix, we rely on the postlink AutoFDO pipeline to optimize the inlinee. Since the survived copy of the inliner (defined in headers) can be inlined in its local scope in prelink, it may not exist in the merged IR in postlink, and we'll need the profiled call edges to enforce a top-down order for the rest of the functions.

Considering those cases, a profiled call graph completely independent of the static call graph is constructed based on profile data, where function objects are not even needed to handle case #3 and case 4.

I'm seeing an average 0.4% perf win out of SPEC2017. For certain benchmark such as Xalanbmk and GCC, the win is bigger, above 2%.

The change is an enhancement to https://reviews.llvm.org/D95988.

Reviewed By: wmi, wenlei

Differential Revision: https://reviews.llvm.org/D99351
2021-03-30 10:42:22 -07:00
Wenlei He 30b0232336 [CSSPGO][llvm-profgen] Context-sensitive global pre-inliner
This change sets up a framework in llvm-profgen to estimate inline decision and adjust context-sensitive profile based on that. We call it a global pre-inliner in llvm-profgen.

It will serve two purposes:
  1) Since context profile for not inlined context will be merged into base profile, if we estimate a context will not be inlined, we can merge the context profile in the output to save profile size.
  2) For thinLTO, when a context involving functions from different modules is not inined, we can't merge functions profiles across modules, leading to suboptimal post-inline count quality. By estimating some inline decisions, we would be able to adjust/merge context profiles beforehand as a mitigation.

Compiler inline heuristic uses inline cost which is not available in llvm-profgen. But since inline cost is closely related to size, we could get an estimate through function size from debug info. Because the size we have in llvm-profgen is the final size, it could also be more accurate than the inline cost estimation in the compiler.

This change only has the framework, with a few TODOs left for follow up patches for a complete implementation:
  1) We need to retrieve size for funciton//inlinee from debug info for inlining estimation. Currently we use number of samples in a profile as place holder for size estimation.
  2) Currently the thresholds are using the values used by sample loader inliner. But they need to be tuned since the size here is fully optimized machine code size, instead of inline cost based on not yet fully optimized IR.

Differential Revision: https://reviews.llvm.org/D99146
2021-03-29 09:46:14 -07:00
Wenlei He 5f59f407f5 [CSSPGO] Minor tweak for inline candidate priority tie breaker
When prioritize call site to consider for inlining in sample loader, use number of samples as a first tier breaker before using name/guid comparison. This would favor smaller functions when hotness is the same (from the same block). We could try to retrieve accurate function size if this turns out to be more important.

Differential Revision: https://reviews.llvm.org/D99370
2021-03-25 21:15:36 -07:00
Wei Mi 14756b70ee [SampleFDO] Don't mix up the existing indirect call value profile with the new
value profile annotated after inlining.

In https://reviews.llvm.org/D96806 and https://reviews.llvm.org/D97350, we
use the magic number -1 in the value profile to avoid repeated indirect call
promotion to the same target for an indirect call. Function updateIDTMetaData
is used to mark an target as being promoted in the value profile with the
magic number. updateIDTMetaData is also used to update the value profile
when an indirect call is inlined and new inline instance profile should be
applied. For the second case, currently updateIDTMetaData mixes up the
existing value profile of the indirect call with the new profile, leading
to the problematic senario that a target count is larger than the total count
in the value profile.

The patch fixes the problem. When updateIDTMetaData is used to update the
value profile after inlining, all the values in the existing value profile
will be dropped except the values with the magic number counts.

Differential Revision: https://reviews.llvm.org/D98835
2021-03-18 09:54:34 -07:00
Wenlei He a5d30421a6 [CSSPGO] Load context profile for external functions in PreLink and populate ThinLTO import list
For ThinLTO's prelink compilation, we need to put external inline candidates into an import list attached to function's entry count metadata. This enables ThinLink to treat such cross module callee as hot in summary index, and later helps postlink to import them for profile guided cross module inlining.

For AutoFDO, the import list is retrieved by traversing the nested inlinee functions. For CSSPGO, since profile is flatterned, a few things need to happen for it to work:

 - When loading input profile in extended binary format, we need to load all child context profile whose parent is in current module, so context trie for current module includes potential cross module inlinee.
 - In order to make the above happen, we need to know whether input profile is CSSPGO profile before start reading function profile, hence a flag for profile summary section is added.
 - When searching for cross module inline candidate, we need to walk through the context trie instead of nested inlinee profile (callsite sample of AutoFDO profile).
 - Now that we have more accurate counts with CSSPGO, we swtiched to use entry count instead of total count to decided if an external callee is potentially beneficial to inline. This make it consistent with how we determine whether call tagert is potential inline candidate.

Differential Revision: https://reviews.llvm.org/D98590
2021-03-15 12:22:15 -07:00
Wenlei He 051f2c144e [SamplePGO] Skip inlinee profile scaling for sample loader inlining
For CGSCC inline, we need to scale down a function's branch weights and entry counts when thee it's inlined at a callsite. This is done through updateCallProfile. Additionally, we also scale the weigths for the inlined clone based on call site count in updateCallerBFI. Neither is needed for inlining during sample profile loader as it's using context profile that is separated from inlinee's own profile. This change skip the inlinee profile scaling for sample loader inlining.

Differential Revision: https://reviews.llvm.org/D98187
2021-03-11 10:18:26 -08:00
Wei Mi ee35784a90 [SampleFDO] Support enabling -funique-internal-linkage-name.
now -funique-internal-linkage-name flag is available, and we want to flip
it on by default since it is beneficial to have separate sample profiles
for different internal symbols with the same name. As a preparation, we
want to avoid regression caused by the flip.

When we flip -funique-internal-linkage-name on, the profile is collected
from binary built without -funique-internal-linkage-name so it has no uniq
suffix, but the IR in the optimized build contains the suffix. This kind of
mismatch may introduce transient regression.

To avoid such mismatch, we introduce a NameTable section flag indicating
whether there is any name in the profile containing uniq suffix. Compiler
will decide whether to keep uniq suffix during name canonicalization
depending on the NameTable section flag. The flag is only available for
extbinary format. For other formats, by default compiler will keep uniq
suffix so they will only experience transient regression when
-funique-internal-linkage-name is just flipped.

Another type of regression is caused by places where we miss to call
getCanonicalFnName. Those places are fixed.

Differential Revision: https://reviews.llvm.org/D96932
2021-03-09 21:41:40 -08:00
Hongtao Yu 4c3d759d00 [CSSPGO] Always use callsite samples as callsite probe counts.
For CS profile, the callsite count of previously inlined callees is populated with the entry count of the callees. Therefore when trying to get a weight for calliste probe after inlinining, the callsite count should always be used. The same fix has already been made for non-probe case.

Reviewed By: wenlei

Differential Revision: https://reviews.llvm.org/D98094
2021-03-08 22:52:36 -08:00
Wei Mi 2357d29335 [SampleFDO] Another fix to prevent repeated indirect call promotion in
sample loader pass.

In https://reviews.llvm.org/rG5fb65c02ca5e91e7e1a00e0efdb8edc899f3e4b9,
to prevent repeated indirect call promotion for the same indirect call
and the same target, we used zero-count value profile to indicate an
indirect call has been promoted for a certain target. We removed
PromotedInsns cache in the same patch. However, there was a problem in
that patch described below, and that problem led me to add PromotedInsns
back as a mitigation in
https://reviews.llvm.org/rG4ffad1fb489f691825d6c7d78e1626de142f26cf.

When we get value profile from metadata by calling getValueProfDataFromInst,
we need to specify the maximum possible number of values we expect to read.
We uses MaxNumPromotions in the last patch so the maximum number of value
information extracted from metadata is MaxNumPromotions. If we have many
values including zero-count values when we write the metadata, some of them
will be dropped when we read them because we only read MaxNumPromotions
values. It will allow repeated indirect call promotion again. We need to
make sure if there are values indicating promoted targets, those values need
to be saved in metadata with higher priority than other values.

The patch fixed that problem. We change to use -1 to represent the count
of a promoted target instead of 0 so it is easier to sort the values.
When we prepare to update the metadata in updateIDTMetaData, we will sort
the values in the descending count order and extract only MaxNumPromotions
values to write into metadata. Since -1 is the max uint64_t number, if we
have equal to or less than MaxNumPromotions of -1 count values, they will
all be kept in metadata. If we have more than MaxNumPromotions of -1 count
values, we will only save MaxNumPromotions such values maximally. In such
case, we have logic in place in doesHistoryAllowICP to guarantee no more
promotion in sample loader pass will happen for the indirect call, because
it has been promoted enough.

With this change, now we can remove PromotedInsns without problem.

Differential Revision: https://reviews.llvm.org/D97350
2021-03-04 18:44:12 -08:00
Hongtao Yu 8985515822 [CSSPGO] Unblocking optimizations by dangling pseudo probes.
This change fixes a couple places where the pseudo probe intrinsic blocks optimizations because they are not naturally removable. To unblock those optimizations, the blocking pseudo probes are moved out of the original blocks and tagged dangling, instead of allowing pseudo probes to be literally removed. The reason is that when the original block is removed, we won't be able to sample it. Instead of assigning it a zero weight, moving all its pseudo probes into another block and marking them dangling should allow the counts inference a chance to assign them a more reasonable weight. We have not seen counts quality degradation from our experiments.

The optimizations being unblocked are:

	1. Removing conditional probes for if-converted branches. Conditional probes are tagged dangling when their homing branch arms are folded so that they will not be over-counted.
	2. Unblocking jump threading from removing empty blocks. Pseudo probe prevents jump threading from removing logically empty blocks that only has one unconditional jump instructions.
	3. Unblocking SimplifyCFG and MIR tail duplicate to thread empty blocks and blocks with redundant branch checks.

Since dangling probes are logically deleted, they should not consume any samples in LTO postLink. This can be achieved by setting their distribution factors to zero when dangled.

Reviewed By: wmi

Differential Revision: https://reviews.llvm.org/D97481
2021-03-03 22:44:42 -08:00
Rong Xu 6103b6ad69 [SampleFDO][NFC] Refactor: make SampleProfileLoaderBaseImpl a template class
This patch makes SampleProfileLoaderBaseImpl a template class so it
can be used in CodeGen transformation.

Noticeable changes:
 * use one template parameter and use IRTraits to get other used
   types an type specific functions.
 * remove the temporary "inline" keywords in previous refactor
   patch.
 * change the template function findEquivalencesFor to a regular
   function. This function has a single caller with type of
   PostDominatorTree. It's simpler to use the type directly
   because MachinePostDominatorTree is not a derived type of
   template DominatorTreeBase.

Differential Revision: https://reviews.llvm.org/D96981
2021-02-25 08:26:17 -08:00
Wei Mi 4ffad1fb48 [SampleFDO] Add PromotedInsns to prevent repeated ICP.
In https://reviews.llvm.org/rG5fb65c02ca5e91e7e1a00e0efdb8edc899f3e4b9,
We use 0 count value profile to memorize which target has been promoted
and prevent repeated ICP for the same target, so we delete PromotedInsns.
However, I found the implementation in the patch has some shortcomings
to be fixed otherwise there will still be repeated ICP. So I add
PromotedInsns back temorarily. Will remove it after I get a thorough fix.
2021-02-19 10:01:49 -08:00
Wei Mi 5fb65c02ca [SampleFDO] Stop repeated indirect call promotion for the same target.
Found a problem in indirect call promotion in sample loader pass. Currently
if an indirect call is promoted for a target, and if the parent function is
inlined into some other function, the indirect call can be promoted for the
same target again. That is redundent which can harm performance and can cause
excessive compile time in some extreme case.

The patch fixes the issue. If a target is promoted for an indirect call, the
patch will write ICP metadata with the target call count being set to 0.
In the later ICP in sample profile loader, if it sees a target has 0 count
for an indirect call, it knows the target has been promoted and won't do
indirect call promotion for the indirect call.

The fix brings 0.1~0.2% performance on our search benchmark.

Differential Revision: https://reviews.llvm.org/D96806
2021-02-18 17:01:32 -08:00
Hongtao Yu e87b1b1d4e [CSSPGO] Use callsite sample counts to annotate indirect call sites.
With CSSPGO all indirect call targets are counted torwards the original indirect call site in the profile, including both inlined and non-inlined targets. Therefore no need to look for callee entry counts. This also fixes the issue where callee entry count doesn't match callsite count due to the nature of CS sampling.

I'm also cleaning up the orginal code that called `findIndirectCallFunctionSamples` just to compute the sum, the return value of which was disgarded.

Reviewed By: wmi, wenlei

Differential Revision: https://reviews.llvm.org/D96990
2021-02-18 14:52:34 -08:00
Rong Xu 7397905ab0 [SampleFDO] Third Try: Refactor SampleProfile.cpp
Apply the patch for the third time after fixing buildbot failures.

Refactor SampleProfile.cpp to use the core code in CodeGen.
The main changes are:
(1) Move SampleProfileLoaderBaseImpl class to a header file.
(2) Split SampleCoverageTracker to a head file and a cpp file.
(3) Move the common codes (common options and callsiteIsHot())
to the common cpp file.
(4) Add inline keyword to avoid duplicated symbols -- they will
be removed later when the class is changed to a template.

Differential Revision: https://reviews.llvm.org/D96455
2021-02-17 15:31:50 -08:00
Vedant Kumar c28622fbf3 Revert "[SampleFDO] Reapply: Refactor SampleProfile.cpp"
Revert "[SampleFDO] Add missing #includes to unbreak modules build after D96455"

This reverts commit c73cbf218a.

Revert "[SampleFDO] Fix MSVC "namespace uses itself" warning (NFC)"

This reverts commit a23e6b321c.

Revert "[SampleFDO] Reapply: Refactor SampleProfile.cpp"

This reverts commit 6fd5ccff72.

Still seeing link failures when building llc (or other tools), due to
the new SampleProfileLoaderBaseImpl.h containing definitions that get
duplicated across multiple TU's.

```
duplicate symbol 'llvm::SampleProfileLoaderBaseImpl::findEquivalenceClasses(llvm::Function&)' in:
    tools/llc/CMakeFiles/llc.dir/llc.cpp.o
    lib/libLLVMInstCombine.a(InstCombineVectorOps.cpp.o)
duplicate symbol 'llvm::SampleProfileLoaderBaseImpl::buildEdges(llvm::Function&)' in:
    tools/llc/CMakeFiles/llc.dir/llc.cpp.o
    lib/libLLVMInstCombine.a(InstCombineVectorOps.cpp.o)
duplicate symbol 'llvm::SampleProfileLoaderBaseImpl::computeDominanceAndLoopInfo(llvm::Function&)' in:
    tools/llc/CMakeFiles/llc.dir/llc.cpp.o
    lib/libLLVMInstCombine.a(InstCombineVectorOps.cpp.o)
duplicate symbol 'llvm::SampleProfileLoaderBaseImpl::getFunctionLoc(llvm::Function&)' in:
    tools/llc/CMakeFiles/llc.dir/llc.cpp.o
    lib/libLLVMInstCombine.a(InstCombineVectorOps.cpp.o)
duplicate symbol 'llvm::SampleProfileLoaderBaseImpl::getBlockWeight(llvm::BasicBlock const*)' in:
    tools/llc/CMakeFiles/llc.dir/llc.cpp.o
    lib/libLLVMInstCombine.a(InstCombineVectorOps.cpp.o)
duplicate symbol 'llvm::SampleProfileLoaderBaseImpl::printBlockWeight(llvm::raw_ostream&, llvm::BasicBlock const*) const' in:
    tools/llc/CMakeFiles/llc.dir/llc.cpp.o
    lib/libLLVMInstCombine.a(InstCombineVectorOps.cpp.o)
duplicate symbol 'llvm::SampleProfileLoaderBaseImpl::printBlockEquivalence(llvm::raw_ostream&, llvm::BasicBlock const*)' in:
    tools/llc/CMakeFiles/llc.dir/llc.cpp.o
    lib/libLLVMInstCombine.a(InstCombineVectorOps.cpp.o)
duplicate symbol 'llvm::SampleProfileLoaderBaseImpl::printEdgeWeight(llvm::raw_ostream&, std::__1::pair<llvm::BasicBlock const*, llvm::BasicBlock const*>)' in:
    tools/llc/CMakeFiles/llc.dir/llc.cpp.o
    lib/libLLVMInstCombine.a(InstCombineVectorOps.cpp.o)
```
2021-02-17 10:22:24 -08:00
Rong Xu 6fd5ccff72 [SampleFDO] Reapply: Refactor SampleProfile.cpp
Reapply patch after fixing buildbot failure.
Refactor SampleProfile.cpp to use the core code in CodeGen.
The main changes are:
(1) Move SampleProfileLoaderBaseImpl class to a header file.
(2) Split SampleCoverageTracker to a head file and a cpp file.
(3) Move the common codes (common options and callsiteIsHot())
to the common cpp file.

Differential Revision: https://reviews.llvm.org/D96455
2021-02-16 16:43:21 -08:00
Mehdi Amini c761fe77bd Revert "[SampleFDO][NFC] Refactor SampleProfile.cpp"
This reverts commit 310b35304c.
The build is broken with -DBUILD_SHARED_LIBS=ON :

lib/ProfileData/CMakeFiles/LLVMProfileData.dir/SampleProfileLoaderBaseUtil.cpp.o: In function `llvm::sampleprofutil::callsiteIsHot(llvm::sampleprof::FunctionSamples const*, llvm::ProfileSummaryInfo*, bool)':
SampleProfileLoaderBaseUtil.cpp:(.text._ZN4llvm14sampleprofutil13callsiteIsHotEPKNS_10sampleprof15FunctionSamplesEPNS_18ProfileSummaryInfoEb+0x1a): undefined reference to `llvm::ProfileSummaryInfo::isColdCount(unsigned long) const'
SampleProfileLoaderBaseUtil.cpp:(.text._ZN4llvm14sampleprofutil13callsiteIsHotEPKNS_10sampleprof15FunctionSamplesEPNS_18ProfileSummaryInfoEb+0x28): undefined reference to `llvm::ProfileSummaryInfo::isHotCount(unsigned long) const'
...
2021-02-16 22:11:42 +00:00
Rong Xu 310b35304c [SampleFDO][NFC] Refactor SampleProfile.cpp
Refactor SampleProfile.cpp to use the core code in CodeGen.
The main changes are:
(1) Move SampleProfileLoaderBaseImpl class to a header file.
(2) Split SampleCoverageTracker to a head file and a cpp file.
(3) Move the common codes (common options and callsiteIsHot())
to the common cpp file.

Differential Revision: https://reviews.llvm.org/D96455
2021-02-16 11:18:21 -08:00
Hongtao Yu de40f6d623 [CSSPGO] Process functions in a top-down order on a dynamic call graph.
Functions are currently processed by the sample profiler loader in a top-down order defined by the static call graph. The order is being adjusted to be a top-down order based on the input context-sensitive profile. One benefit is that the processing order of caller and callee in one SCC would follow the context order in the profile to favor more inlining. Another benefit is that the processing order of caller and callee through an indirect call (which is not on the static call graph) can be honored which in turn allows for more inlining.

The profile top-down order for SCC is also extended to support non-CS profiles.

Two switches `-mllvm -use-profile-indirect-call-edges` and `-mllvm -use-profile-top-down-order` are being introduced.

Reviewed By: wmi

Differential Revision: https://reviews.llvm.org/D95988
2021-02-11 12:36:59 -08:00
Benjamin Kramer 8fb4a4f7bb [SampleFDO] Silence -Wnon-virtual-dtor warning
There's no polymorphic deletion happening here.
2021-02-10 23:37:15 +01:00
Rong Xu db0d7d0ba9 [SampleFDO][NFC] Refactor SampleProfileLoader to reuse in CodeGen
Break SampleProfileLoader into to a base and a derived class.
Base class (SampleProfileLoaderBaseImpl) includes the common
code for IR and MachineIR (CodeGen) sample loader.
It will be templatelized in the later patch.

Inline and Probe related code will remain in the derived class of
SampleProfileLoader and stays in SampleProfile.cpp.

We need to refactor some functions:
(1) getInstWeight() to enable the code sharing -- put the core into
getInstWeightImpl().
(2) emitAnnotation() and propagateWeights() to carve out the code
specific to SampleProfileLoader.
(3) make getInstWeight() and findFunctionSamples() virtual and override
in SampleProfileLoader as they need to access the fields in the derived
class.

Differential Revision: https://reviews.llvm.org/D95832
2021-02-10 13:29:15 -08:00
Simon Pilgrim 476b912e7c SampleProfile.cpp - fix Wdocumentation warning. NFCI.
Remove duplicate param
2021-02-05 11:31:17 +00:00
Kazu Hirata be37475897 [Transforms/IPO] Use range-based for loops (NFC) 2021-02-03 20:41:20 -08:00
Rong Xu b8f13db5b7 [SampleFDO][NFC] Detach SampleProfileLoader from SampleCoverageTracker
This patch detaches SampleProfileLoader from class
SampleCoverageTracker. We plan to move SampleProfileLoader
to a template class. This would remain SampleCoverageTracker
as a class.
Also make callsiteIsHot() as a file static function.

Differential Revision: https://reviews.llvm.org/D95823
2021-02-03 11:38:04 -08:00
Hongtao Yu 3d89b3cbec [CSSPGO] Introducing distribution factor for pseudo probe.
Sample re-annotation is required in LTO time to achieve a reasonable post-inline profile quality. However, we have seen that such LTO-time re-annotation degrades profile quality. This is mainly caused by preLTO code duplication that is done by passes such as loop unrolling, jump threading, indirect call promotion etc, where samples corresponding to a source location are aggregated multiple times due to the duplicates. In this change we are introducing a concept of distribution factor for pseudo probes so that samples can be distributed for duplicated probes scaled by a factor. We hope that optimizations duplicating code well-maintain the branch frequency information (BFI) based on which probe distribution factors are calculated. Distribution factors are updated at the end of preLTO pipeline to reflect an estimated portion of the real execution count.

This change also introduces a pseudo probe verifier that can be run after each IR passes to detect duplicated pseudo probes.

A saturated distribution factor stands for 1.0. A pesudo probe will carry a factor with the value ranged from 0.0 to 1.0. A 64-bit integral distribution factor field that represents [0.0, 1.0] is associated to each block probe. Unfortunately this cannot be done for callsite probes due to the size limitation of a 32-bit Dwarf discriminator. A 7-bit distribution factor is used instead.

Changes are also needed to the sample profile inliner to deal with prorated callsite counts. Call sites duplicated by PreLTO passes, when later on inlined in LTO time, should have the callees’s probe prorated based on the Prelink-computed distribution factors. The distribution factors should also be taken into account when computing hotness for inline candidates. Also, Indirect call promotion results in multiple callisites. The original samples should be distributed across them. This is fixed by adjusting the callisites' distribution factors.

Reviewed By: wmi

Differential Revision: https://reviews.llvm.org/D93264
2021-02-02 11:55:01 -08:00
Wenlei He 1645f465be [CSSPGO] Factor out common part for CSSPGO inline and AFDO inline
Refactoring SampleProfileLoader::inlineHotFunctions to use helpers from CSSPGO inlining and reduce similar code in the inlining loop, plus minor cleanup for AFDO path.

This is resubmit of D95024, with build break and overtighten assertion fixed.

Test Plan:
2021-02-02 07:55:08 -08:00
Adrian Kuegel 48ca6da9d2 Revert "[CSSPGO] Factor out common part for CSSPGO inline and AFDO inline"
This reverts commit 9a03058d63.
2021-02-02 11:51:04 +01:00
Adrian Kuegel 3a65ec4bf9 Revert "Fix build break from D95024"
This reverts commit 09cd849fde.
2021-02-02 11:51:04 +01:00
Wenlei He 09cd849fde Fix build break from D95024 2021-02-02 01:01:12 -08:00
Wenlei He 9a03058d63 [CSSPGO] Factor out common part for CSSPGO inline and AFDO inline
Refactoring SampleProfileLoader::inlineHotFunctions to use helpers from CSSPGO inlining and reduce similar code in the inlining loop, plus minor cleanup for AFDO path.

Test Plan:

Differential Revision: https://reviews.llvm.org/D95024
2021-02-02 00:34:06 -08:00
Wenlei He 6bae5973c4 [CSSPGO] Call site prioritized inlining for sample PGO
This change implemented call site prioritized BFS profile guided inlining for sample profile loader. The new inlining strategy maximize the benefit of context-sensitive profile as mentioned in the follow up discussion of CSSPGO RFC. The change will not affect today's AutoFDO as it's opt-in. CSSPGO now defaults to the new FDO inliner, but can fall back to today's replay inliner using a switch (`-sample-profile-prioritized-inline=0`).

Motivation

With baseline AutoFDO, the inliner in sample profile loader only replays previous inlining, and the use of profile is only for pruning previous inlining that turned out to be cold. Due to the nature of replay, the FDO inliner is simple with hotness being the only decision factor. It has the following limitations that we're improving now for CSSPGO.
 - It doesn't take inline candidate size into account. Since it's doing replay, the size growth is bounded by previous CGSCC inlining. With context-sensitive profile, FDO inliner is no longer limited by previous inlining, so we need to take size into account to avoid significant size bloat.
 - The way it looks at hotness is not accurate. It uses total samples in an inlinee as proxy for hotness, while what really matters for an inline decision is the call site count. This is an unfortunate fall back because call site count and callee entry count are not reliable due to dwarf based correlation, especially for inlinees. Now paired with pseudo-probe, we have accurate call site count and callee's entry count, so we can use that to gauge hotness more accurately.
 - It treats all call sites from a block as hot as long as there's one call site considered hot. This is normally true, but since total samples is used as hotness proxy, this transitiveness within block magnifies the inacurate hotness heuristic. With pseduo-probe and the change above, this is no longer an issue for CSSPGO.

New FDO Inliner

Putting all the requirement for CSSPGO together, we need a top-down call site prioritized BFS inliner. Here're reasons why each component is needed.
 - Top-down: We need a top-down inliner to better leverage context-sensitive profile, so inlining is driven by accurate context profile, and post-inline is also accurate. This is already implemented in https://reviews.llvm.org/D70655.
 - Size Cap: For top-down inliner, taking function size into account for inline decision alone isn't sufficient to control size growth. We also need to explicitly cap size growth because with top-down inlining, we can grow inliner size significantly with large number of smaller inlinees even if each individually passes the cost/size check.
 - Prioritize call sites: With size cap, inlining order also becomes important, because if we stop inlining due to size budget limit, we'd want to use budget towards the most beneficial call sites.
 - BFS inline: Same as call site prioritization, if we stop inlining due to size budget limit, we want a balanced inline tree, rather than going deep on one call path.

Note that the new inliner avoids repeatedly evaluating same set of call site, so it should help with compile time too. For this reason, we could transition today's FDO inliner to use a queue with equal priority to avoid wasted reevaluation of same call site (TODO).

Speculative indirect call promotion and inlining is also supported now with CSSPGO just like baseline AutoFDO.

Tunings and knobs

I created tuning knobs for size growth/cap control, and for hot threshold separate from CGSCC inliner. The default values are selected based on initial tuning with CSSPGO.

Results

Evaluated with an internal LLVM fork couple months ago, plus another change to adjust hot-threshold cutoff for context profile (will send up after this one), the new inliner show ~1% geomean perf win on spec2006 with CSSPGO, while reducing code size too. The measurement was done using train-train setup, MonoLTO w/ new pass manager and pseudo-probe. Note that this is just a starting point - we hope that the new inliner will open up more opportunity with CSSPGO, but it will certainly take more time and effort to make it fully calibrated and ready for bigger workloads (we're working on it).

Differential Revision: https://reviews.llvm.org/D94001
2021-02-01 23:46:34 -08:00
modimo ce7f9cdb50 [InlineAdvisor] Allow replay of inline decisions for the CGSCC inliner from optimization remarks
This change leverages the work done in D83743 to replay in the SampleProfile inliner to also be used in the CGSCC inliner. NOTE: currently restricted to non-ML advisors only.

The added switch `-cgscc-inline-replay=<remarks file>` will replay the inlining decisions in that file where the remarks file is generated via `-Rpass=inline`. The aim here is to make it easier to analyze changes that would modify inlining heuristics to be separated from this behavior. Doing so allows easier examination of assembly and runtime behavior compared to the baseline rather than trying to dig through the large churn caused by inlining.

In LTO compilation, since inlining is done twice you can separately specify replay by passing the flag to the FE (`-cgscc-inline-replay=`) and to the linker (`-Wl,cgscc-inline-replay=`) with the remarks generated from their respective places.

Testing on mysqld by comparing the inline decisions between base (generates remarks.txt) and diff (replay using identical input/tools with remarks.txt) and examining the inlining sites with `diff` shows 14,000 mismatches out of 247,341 for a ~94% replay accuracy. I believe this gap can be narrowed further though for the general case we may never achieve full accuracy. For my personal use, this is close enough to be representative: I set the baseline as the one generated by the replay on identical input/toolset and compare that to my modified input/toolset using the same replay.

Testing:
ninja check-llvm
newly added test correctly replays CGSCC inlining decisions

Reviewed By: mtrofin, wenlei

Differential Revision: https://reviews.llvm.org/D94334
2021-01-25 15:38:57 -08:00
Wei Mi c9cd9a0066 [SampleFDO] Report error when reading a bad/incompatible profile instead of
turning off SampleFDO silently.

Currently sample loader pass turns off SampleFDO optimization silently when
it sees error in reading the profile. This behavior will defeat the tests
which could have caught those bad/incompatible profile problems. This patch
change the behavior to report error.

Differential Revision: https://reviews.llvm.org/D95269
2021-01-25 10:28:23 -08:00
Mircea Trofin ccec2cf1d9 Reland "[NPM][Inliner] Factor ImportedFunctionStats in the InlineAdvisor"
This reverts commit d97f776be5.

The original problem was due to build failures in shared lib builds. D95079
moved ImportedFunctionsInliningStatistics under Analysis, unblocking
this.
2021-01-20 13:33:43 -08:00
Mircea Trofin d97f776be5 Revert "[NPM][Inliner] Factor ImportedFunctionStats in the InlineAdvisor"
This reverts commit e8aec763a5.
2021-01-20 11:19:34 -08:00