llvm-project

Commit Graph

Author	SHA1	Message	Date
Kazu Hirata	b5188591a0	[llvm] Remove redundaunt virtual specifiers (NFC) Identified with modernize-use-override.	2022-07-24 21:50:35 -07:00
Mircea Trofin	7b81a81d5f	[NFC] FunctionSamples::getEntrySamples -> getHeadSamplesEstimate The name `getEntrySamples` was misleading for 2 reasons. One, it's close in name to `Function::getEntryCount`, but the equivalent here is `getHeadSamples`; second, as opposed to the other get* APIs in `FunctionSamples`, it performs an estimate/heuristic rather than just retrieving raw data (or a non-heuristic derivate off that data, like `getMaxCountInside`) The new name should more clearly communicate its intent; and, being close (in name) to `getHeadSamples`, it should allow the reader discover the relation between them. Also updated the doc comments for both `getHeadSamples[Estimate]` so a reader may better understand the relation between them. Differential Revision: https://reviews.llvm.org/D130281	2022-07-22 09:17:59 -07:00
Kazu Hirata	611ffcf4e4	[llvm] Use value instead of getValue (NFC)	2022-07-13 23:11:56 -07:00
wlei	7e86b13c63	[CSSPGO][llvm-profgen] Reimplement SampleContextTracker using context trie This is the followup patch to https://reviews.llvm.org/D125246 for the `SampleContextTracker` part. Before the promotion and merging of the context is based on the SampleContext(the array of frame), this causes a lot of cost to the memory. This patch detaches the tracker from using the array ref instead to use the context trie itself. This can save a lot of memory usage and benefit both the compiler's CS inliner and llvm-profgen's pre-inliner. One structure needs to be specially treated is the `FuncToCtxtProfiles`, this is used to get all the functionSamples for one function to do the merging and promoting. Before it search each functions' context and traverse the trie to get the node of the context. Now we don't have the context inside the profile, instead we directly use an auxiliary map `ProfileToNodeMap` for profile , it initialize to create the FunctionSamples to TrieNode relations and keep updating it during promoting and merging the node. Moreover, I was expecting the results before and after remain the same, but I found that the order of FuncToCtxtProfiles matter and affect the results. This can happen on recursive context case, but the difference should be small. Now we don't have the context, so I just used a vector for the order, the result is still deterministic. Measured on one huge size(12GB) profile from one of our internal service. The profile similarity difference is 99.999%, and the running time is improved by 3X(debug mode) and the memory is reduced from 170GB to 90GB. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D127031	2022-06-27 23:22:21 -07:00
wlei	aa58b7b1e3	[CSSPGO][llvm-profgen] Reimplement computeSummaryAndThreshold using context trie Follow-up patch to https://reviews.llvm.org/D125246, support `computeSummaryAndThreshold` based on context trie. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D127026	2022-06-27 23:22:21 -07:00
wlei	eba5749262	[CSSPGO][llvm-profgen] Reimplement CS profile generator using context trie Our investigation showed ProfileMap's key is the bottleneck of the memory consumption for CS profile generation on some large services. This patch tries to optimize it by storing the CS function samples using the context trie tree structure instead of the context frame array ref. Parts of code in `ContextTrieNode` are reused. Our experiment on one internal service showed that the context key's memory can be reduced from 80GB to 300MB. To be compatible with non-CS profiles, the profile writer still needs to use ProfileMap as input, so rebuild the ProfileMap using the context trie in `postProcessProfiles`. The optimization is not complete yet, next step is to reimplement Pre-inliner or profile trimmer, after that, ProfileMap should be small to be written. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D125246	2022-06-27 23:22:21 -07:00
Kazu Hirata	a7938c74f1	[llvm] Don't use Optional::hasValue (NFC) This patch replaces Optional::hasValue with the implicit cast to bool in conditionals only.	2022-06-25 21:42:52 -07:00
Kazu Hirata	3b7c3a654c	Revert "Don't use Optional::hasValue (NFC)" This reverts commit `aa8feeefd3`.	2022-06-25 11:56:50 -07:00
Kazu Hirata	aa8feeefd3	Don't use Optional::hasValue (NFC)	2022-06-25 11:55:57 -07:00
Kazu Hirata	e0e687a615	[llvm] Don't use Optional::hasValue (NFC)	2022-06-20 10:38:12 -07:00
Corentin Jabot	b62e3a73e1	Replace to_hexString by touhexstr [NFC] LLVM had 2 methods to convert a number to an hexa string, this remove one of them. Differential Revision: https://reviews.llvm.org/D127958	2022-06-16 17:29:50 +02:00
Hongtao Yu	7d293744a8	[CSSPGO][Preinliner] Set default value of sample-profile-inline-limit-max to 3000 The default value of sample-profile-inline-limit-max is defined as 10000 in sampleprofile.cpp. This is too big for cspreinliner which works with assembly size instead of IR size. The value 3000 turns out to be a good tradeoff. Compared to the value 10000, 3000 gives as good performance and code size, but lower build time. Reviewed By: wenlei, wlei Differential Revision: https://reviews.llvm.org/D127330	2022-06-08 12:10:11 -07:00
Fangrui Song	d86a206f06	Remove unneeded cl::ZeroOrMore for cl::opt/cl::list options	2022-06-05 00:31:44 -07:00
Fangrui Song	36c7d79dc4	Remove unneeded cl::ZeroOrMore for cl::opt options Similar to `557efc9a8b`. This commit handles options where cl::ZeroOrMore is more than one line below cl::opt.	2022-06-04 00:10:42 -07:00
Fangrui Song	557efc9a8b	[llvm] Remove unneeded cl::ZeroOrMore for cl::opt options. NFC Some cl::ZeroOrMore were added to avoid the `may only occur zero or one times!` error. More were added due to cargo cult. Since the error has been removed, cl::ZeroOrMore is unneeded. Also remove cl::init(false) while touching the lines.	2022-06-03 21:59:05 -07:00
Hongtao Yu	acfd0a3456	[llvm-profgen] Update callsite body samples by summing up all call target samples. Current profile generation caculcates callsite body samples and call target samples separately. The former is done based on LBR range samples while the latter is done based on branch samples. Note that there's a subtle difference. LBR ranges is formed from two consecutive branch samples. Therefore the last entry in a LBR record will not be counted towards body samples while there's still a chance for it to be counted towards call targets if it is a function call. I'm making sense of the call body samples by updating it to the aggregation of call targets. Reviewed By: wenlei Differential Revision: https://reviews.llvm.org/D122609	2022-05-16 09:13:37 -07:00
Hongtao Yu	9f732af583	[llvm-profgen] Filter out oversized LBR ranges. As a follow up to {D123271}, LBR ranges that are too big should also be considered as invalid. For example, the last two pairs in the following trace form a range [0x0d7b02b0, 0x368ba706] that covers a ton of functions in the binary. Such oversized range should also be ignored. 0x0c74505f/0x368b99a0 0x368ba706/0x0c745040 0x0d7b1c3f/0x0d7b02b0 Add a defensive check to filter out those ranges based that the valid range should not cross the unconditional branch(Call, return, unconditional jmp). Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D125448	2022-05-12 10:58:50 -07:00
Hongtao Yu	a4190037fa	[CSSPGO][Preinliner] Use linear threshold to drive inline decision. The per-callsite size threshold used today to drive preinline decision is based on hotness/coldness cutoff. The default setup is for callsites with a sample count above the hotness cutoff (99%), a 1500 size threshold is used. Any callsite below 99.99% coldness cutoff uses a zero threshold. This has a couple issues: 1. While both cutoffs and size thoresholds are configurable, different applications may need different setups, making a universal setup impractical. 2. The callsites between hotness cutoff and coldness cutoff are not considered as inline candidates, which could be a missing opportunity. 3. Hot callsites always use the same threshold. In reality we may want a bigger threshold for hotter callsites. In this change we are introducing a linear threshold regardless of hot/cold cutoffs. Given a sample space, a threshold is computed for a callsite based on the position of that callsite sample in the whole space. With that we no longer need to define what's hot or cold. Callsites with different hotness will get a different threshold. This should overcome the above three issues. I have seen good results with a universal default setup for two of our internal services. For one service, 0.2% to 0.5% perf improvement over a baseline with a previous default setup, on-par code size. For the second service, 0.5% to 0.8% perf improvement over a baseline with a previous default setup, 0.2% code size increase; on-par performance and code size with a baseline that is with a carefully tuned cutoff to cover enough hot functions. Reviewed By: wenlei Differential Revision: https://reviews.llvm.org/D125023	2022-05-08 22:07:58 -07:00
Hongtao Yu	e36786d15f	[CSSPGO] Rename ProfileIsCSNested and ProfileIsCSFlat To be more clear and definitive, I'm renaming `ProfileIsCSFlat` back to `ProfileIsCS` which stands for full context-sensitive flat profiles. `ProfileIsCSNested` is now renamed to `ProfileIsPreInlined` and is extended to be applicable for CS flat profiles too. More specifically, `ProfileIsPreInlined` is for any kind of profiles (flat or nested) that contain 'ShouldBeInlined' contexts. The flag is encoded in the profile summary section for extbinary profiles and is computed on-the-fly for text profiles. Reviewed By: wenlei Differential Revision: https://reviews.llvm.org/D122602	2022-04-29 17:03:52 -07:00
wlei	bfcb2c1119	[llvm-profgen] Decouple artificial branch from LBR parser and fix external address related issues This patch is fixing two issues for both CS and non-CS. 1) For external-call-internal, the head samples of the the internal function should be recorded. 2) avoid ignoring LBR after meeting the interrupt branch for CS profile LBR parser is shared between CS and non-CS, we found it's error-prone while dealing with artificial branch inside LBR parser. Since artificial branch is mainly used for CS profile unwinding, this patch tries to simplify LBR parser by decoupling artificial branch code from it, the concept of artificial branch is removed and split into two transitional branches(internal-to-external, external-to-internal). Then we leave all the processing of external branch to unwinder. Specifically for unwinder, remembering that we introduce external frame in https://reviews.llvm.org/D115550. We can just take external address as a regular address and reuse current unwind function(unwindCall, unwindReturn). For a normal case, the external frame will match an external LBR, and it will be filtered out by `unwindLinear` without losing any context. The data also shows that the interrupt or standalone LBR pattern(unpaired case) does exist, we choose to handle it by clearing the call stack and keeping unwinding. Here we leverage checking in `unwindLinear`, because a standalone LBR, no matter its type, since it doesn’t have other part to pair, it will eventually cause a wrong linear range, like [external, internal], [internal, external]. Then set the state to invalid there. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D118177	2022-04-28 16:07:28 -07:00
Wenlei He	526af13eba	Fix llvm-profgen breakage	2022-04-18 11:02:29 -07:00
Wenlei He	17f6cba30d	[llvm-profgen] Add process filter for perf reader For profile generation, we need to filter raw perf samples for binary of interest. Sometimes binary name along isn't enough as we can have binary of the same name running in the system. This change adds a process id filter to allow users to further disambiguiate the input raw samples. Differential Revision: https://reviews.llvm.org/D123869	2022-04-18 09:50:16 -07:00
Hongtao Yu	8a0406dcc8	[llvm-profgen] Filter out invalid LBR ranges. The profiler can sometimes give us a LBR trace that implicates bogus code ranges. For example, 0xc5acb56/0xc66c6c0 0xc628195/0xf31fbb0 0xc611261/0xc628130 0xc5c1a21/0xc6111c0 0x1f7edfd3/0xc5c3a50 0xc5c154f/0x1f7edec0 0xe8eed07/0xc5c11e0 , note that the first two pairs are supposed to form a linear execution range, in this case, it is [0xf31fbb0, 0xc5acb56] , which doesn't make sense. Such bogus ranges should be ruled out to avoid generating a bad profile. I'm fixing this for both CS and non-CS cases. Reviewed By: wenlei Differential Revision: https://reviews.llvm.org/D123271	2022-04-07 21:42:01 -07:00
Hongtao Yu	4c2b57ae48	[llvm-profgen] Fixing a context attribure update issue due to a non-derministic processing order on different platforms. A context can be created by invoking the `getFunctionProfileForContext` function in two ways: - by using a probe and its calling context. - by removing the leaf frame from an existing contexts. The first way is used when generating a function profile for a given LBR range, and the input `WasLeafInlined` is computed depending on the actually probe in the LBR range. The second way is used when using the entry count of an inlinee function profile to update its inliner callsite count, so `WasLeafInlined` is unknown for the inliner frame. The two invocations can happen in different order on different platforms, since the lbr ranges are stored in an unordered_map, and we are making sure `ContextWasInlined` is always set correctly. This should fix the random test failure introduced by D121655 Reviewed By: wenlei Differential Revision: https://reviews.llvm.org/D122844	2022-03-31 17:33:58 -07:00
Hongtao Yu	937924eb49	[llvm-profgen] Read sample profiles for post-processing. Sometimes we would like to run post-processing repeatedly on the original sample profile for tuning. In order to avoid regenerating the original profile from scratch every time, this change adds the support of reading in the original profile (called symbolized profile) and running the post-processor on it. Reviewed By: wenlei Differential Revision: https://reviews.llvm.org/D121655	2022-03-30 13:51:16 -07:00
Hongtao Yu	3f97016857	[llvm-profgen] Decoding pseudo probe for profiled function only. Complete pseudo probes decoding can result in large memory usage. In practice only a small porting of the decoded probes are used in profile generation. I'm changing the full decoding mode to be decoding for profiled functions only, though we still do a full scan of the .pseudoprobe section due to a missing table-of-content but we don't have to build the in-memory data structure for functions not sampled. To build the in-memory data structure for profiled functions only, I'm rewriting the previous non-recursive probe decoding logic to be recursive. This is easy to read and maintain. I also have to change the previous representation of unsymbolized context from probe-based stack to address-based stack since the profiled functions are unknown yet by the time of virtual unwinding. The address-based stack will be converted to probe-based stack after virtual unwinding and on-demand probe decoding. I'm seeing 20GB memory is saved for one of our internal large service. Reviewed By: wenlei Differential Revision: https://reviews.llvm.org/D121643	2022-03-23 14:15:11 -07:00
serge-sans-paille	f1985a3f85	Cleanup includes: Transforms/IPO Preprocessor output diff: -238205 lines Discourse thread: https://discourse.llvm.org/t/include-what-you-use-include-cleanup Differential Revision: https://reviews.llvm.org/D122183	2022-03-22 10:06:28 +01:00
Hongtao Yu	bc380c0930	[llvm-profgen] Turn on CS nested profile generation by default for CSSPGO. CS nested profile has a benefit over the CS flat profile that is to speed up the build while achieve an on-par performance. I'm turning it on by default for CSSPGO. Reviewed By: wenlei Differential Revision: https://reviews.llvm.org/D121142	2022-03-08 09:05:27 -08:00
Hongtao Yu	23391febd8	[llvm-profgen] Generating probe-based non-CS profile. I'm bring up the support of pseudo-probe-based non-CS profile generation. The approach is quite similar to generating dwarf-based non-CS profile. The main difference is for a given linear instruction range, instead of each disassembled instruction, pseudo probes that are covered by the range are processed. The pseudo probe extraction code is shared with CS probe profile generation. I'm seeing 0.7% performance win for one of our internal large benchmark compared to using non-CS dwarf-based profile, and 0.5% win for another large benchmark when combined with profi. Reviewed By: wenlei Differential Revision: https://reviews.llvm.org/D120335	2022-03-01 18:49:08 -08:00
serge-sans-paille	fc97efa409	Cleanup includes: ProfileData Estimation of the impact on preprocessor output: before: 1067349756 after: 1065940348 Discourse thread: https://discourse.llvm.org/t/include-what-you-use-include-cleanup Differential Revision: https://reviews.llvm.org/D120434	2022-02-24 13:25:11 +01:00
serge-sans-paille	db29f4374d	Cleanup include: DebugInfo/Symbolize Estimation of the impact on preprocessor output after: 1067349756 before:1067487786 Discourse thread: https://discourse.llvm.org/t/include-what-you-use-include-cleanup Differential Revision: https://reviews.llvm.org/D120433	2022-02-24 13:25:11 +01:00
wlei	b3a778fb5e	[llvm-profgen] Support symbol loading for debug fission Support to load debug info from dwarf split file, like .dwo, .dwp files. Leverage the `getNonSkeletonUnitDIE(false)` API to achieve this. Add test cause to make sure all the ranges is well retrieved by the loader. Reviewed By: ayermolo, hoy, wenlei Differential Revision: https://reviews.llvm.org/D115973	2022-02-23 09:40:46 -08:00
Hongtao Yu	34e131b0f2	[llvm-profgen] On-demand track optimized-away inlinees for preinliner. Tracking optimized-away inlinees based on all probes in a binary is expansive in terms of memory usage I'm making the tracking on-demand based on profiled functions only. This saves about 10% memory overall for a medium-sized benchmark. Before: note: After parsePerfTraces note: Thu Jan 27 18:42:09 2022 note: VM: 8.68 GB RSS: 8.39 GB note: After computeSizeForProfiledFunctions note: Thu Jan 27 18:42:41 2022 note: VM: 10.63 GB RSS: 10.20 GB note: After generateProbeBasedProfile note: Thu Jan 27 18:45:49 2022 note: VM: 25.00 GB RSS: 24.95 GB note: After postProcessProfiles note: Thu Jan 27 18:49:29 2022 note: VM: 26.34 GB RSS: 26.27 GB After: note: After parsePerfTraces note: Fri Jan 28 12:04:49 2022 note: VM: 8.68 GB RSS: 7.65 GB note: After computeSizeForProfiledFunctions note: Fri Jan 28 12:05:26 2022 note: VM: 8.68 GB RSS: 8.42 GB note: After generateProbeBasedProfile note: Fri Jan 28 12:08:03 2022 note: VM: 22.93 GB RSS: 22.89 GB note: After postProcessProfiles note: Fri Jan 28 12:11:30 2022 note: VM: 24.27 GB RSS: 24.22 GB This should be a no-diff change in terms of profile quality. Reviewed By: wenlei Differential Revision: https://reviews.llvm.org/D118515	2022-02-08 08:33:23 -08:00
Simon Pilgrim	01d5254f3d	[llvm-profgen] Use cast<> instead of dyn_cast<> to avoid dereference of nullptr The pointer is dereferenced immediately, so assert the cast is correct instead of returning nullptr	2022-02-02 14:12:11 +00:00
Simon Pilgrim	c56a85fde0	[llvm-profgen] Use cast<> instead of dyn_cast<> to avoid dereference of nullptr The pointers are dereferenced immediately, so assert the cast is correct instead of returning nullptr	2022-02-02 14:12:10 +00:00
Hongtao Yu	67db31115d	[llvm-profgen] Clean up unnecessary memory reservations between phases. Cleaning up data structures that are not used after a certain point. This further brings down peak memory usage by 15% for a large benchmark. Before: note: Before parsePerfTraces note: VM: 40.73 GB RSS: 39.18 GB note: Before parseAndAggregateTrace note: VM: 40.73 GB RSS: 39.18 GB note: After parseAndAggregateTrace note: VM: 88.93 GB RSS: 87.97 GB note: Before generateUnsymbolizedProfile note: VM: 88.95 GB RSS: 87.99 GB note: After generateUnsymbolizedProfile note: VM: 93.50 GB RSS: 92.53 GB note: After computeSizeForProfiledFunctions note: VM: 101.13 GB RSS: 99.36 GB note: After generateProbeBasedProfile note: VM: 215.61 GB RSS: 210.88 GB note: After postProcessProfiles note: VM: 237.48 GB RSS: 212.50 GB After: note: Before parsePerfTraces note: VM: 40.73 GB RSS: 39.18 GB note: Before parseAndAggregateTrace note: VM: 40.73 GB RSS: 39.18 GB note: After parseAndAggregateTrace note: VM: 88.93 GB RSS: 87.96 GB note: Before generateUnsymbolizedProfile note: VM: 88.95 GB RSS: 87.97 GB note: After generateUnsymbolizedProfile note: VM: 93.50 GB RSS: 92.51 GB note: After computeSizeForProfiledFunctions note: VM: 93.50 GB RSS: 92.53 GB note: After generateProbeBasedProfile note: VM: 164.87 GB RSS: 163.55 GB note: After postProcessProfiles note: VM: 182.28 GB RSS: 179.43 GB Reviewed By: wenlei, wlei Differential Revision: https://reviews.llvm.org/D118677	2022-02-01 16:27:54 -08:00
Hongtao Yu	fec57e5b17	Revert "[llvm-profgen] Clean up unnecessary memory reservations between phases." This reverts commit `057e784b09`.	2022-02-01 14:44:48 -08:00
Hongtao Yu	057e784b09	[llvm-profgen] Clean up unnecessary memory reservations between phases. Cleaning up data structures that are not used after a certain point. This further brings down peak memory usage by 15% for a large benchmark. Before: note: Before parsePerfTraces note: VM: 40.73 GB RSS: 39.18 GB note: Before parseAndAggregateTrace note: VM: 40.73 GB RSS: 39.18 GB note: After parseAndAggregateTrace note: VM: 88.93 GB RSS: 87.97 GB note: Before generateUnsymbolizedProfile note: VM: 88.95 GB RSS: 87.99 GB note: After generateUnsymbolizedProfile note: VM: 93.50 GB RSS: 92.53 GB note: After computeSizeForProfiledFunctions note: VM: 101.13 GB RSS: 99.36 GB note: After generateProbeBasedProfile note: VM: 215.61 GB RSS: 210.88 GB note: After postProcessProfiles note: VM: 237.48 GB RSS: 212.50 GB After: note: Before parsePerfTraces note: VM: 40.73 GB RSS: 39.18 GB note: Before parseAndAggregateTrace note: VM: 40.73 GB RSS: 39.18 GB note: After parseAndAggregateTrace note: VM: 88.93 GB RSS: 87.96 GB note: Before generateUnsymbolizedProfile note: VM: 88.95 GB RSS: 87.97 GB note: After generateUnsymbolizedProfile note: VM: 93.50 GB RSS: 92.51 GB note: After computeSizeForProfiledFunctions note: VM: 93.50 GB RSS: 92.53 GB note: After generateProbeBasedProfile note: VM: 164.87 GB RSS: 163.55 GB note: After postProcessProfiles note: VM: 182.28 GB RSS: 179.43 GB Reviewed By: wenlei, wlei Differential Revision: https://reviews.llvm.org/D118677	2022-02-01 12:48:08 -08:00
wlei	6693c562f9	[llvm-profgen] Support to load debug info from a second binary For reducing binary size purpose, the binary's debug info and executable segment can be separated(like using objcopy --only-keep-debug). Here add support in llvm-profgen to use two binaries as input. The original one is executable binary and added for debug info only binary. Adding a flag `--debug-binary=file-path`, with this, the binary will load debug info from debug binary. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D115948	2022-01-24 17:14:05 -08:00
Simon Pilgrim	f4aa2a42ed	[llvm-profgen] ProfiledBinary::load - use cast<> instead of dyn_cast<> to avoid dereference of nullptr The pointer is always dereferenced immediately, so assert the cast is correct instead of returning nullptr	2022-01-14 15:51:21 +00:00
Simon Pilgrim	92ba979c28	[llvm-profgen] Pass iteration value by reference in for-range loops to avoid unnecessary copies	2022-01-14 14:49:57 +00:00
Simon Pilgrim	86bbf01d89	[llvm-profgen] CSProfileGenerator::generateLineNumBasedProfile - use cast<> instead of dyn_cast<> to avoid dereference of nullptr The pointer is always dereferenced immediately below, so assert the cast is correct instead of returning nullptr	2022-01-14 14:49:57 +00:00
Wenlei He	9a2120a6e1	[llvm-profgen] Error out for unsupported AutoFDO profile generate with probe Error out instead of siliently generate empty profile when trying to generate AutoFDO profile with probe binary. Differential Revision: https://reviews.llvm.org/D116508	2022-01-02 16:38:56 -08:00
wlei	b239b2b0db	[llvm-profgen] Fix warning of enumerated and non-enumerated type in conditional expression Differential Revision: https://reviews.llvm.org/D115842	2021-12-16 19:28:55 -08:00
Wenlei He	f6f0409f6f	[llvm-profgen] Turn on preinliner by default preinliner has been tuned on large server workloads and it's not ready to be turned on by default. this change also updates the thresholds based on tuning. Differential Revision: https://reviews.llvm.org/D115770	2021-12-14 17:46:57 -08:00
wlei	0f53df864e	[CSSPGO][llvm-profgen] Fix external address issues of perf reader (return to external addr part) Before we have an issue with artificial LBR whose source is a return, recalling that "an internal code(A) can return to external address, then from the external address call a new internal code(B), making an artificial branch that looks like a return from A to B can confuse the unwinder". We just ignore the LBRs after this artificial LBR which can miss some samples. This change aims at fixing this by correctly unwinding them instead of ignoring them. List some typical scenarios covered by this change. 1) multiple sequential call back happen in external address, e.g. ``` [ext, call, foo] [foo, return, ext] [ext, call, bar] ``` Unwinder should avoid having foo return from bar. Wrong call stack is like [foo, bar] 2) the call stack before and after external call should be correctly unwinded. ``` {call stack1} {call stack2} [foo, call, ext] [ext, call, bar] [bar, return, ext] [ext, return, foo ] ``` call stack 1 should be the same to call stack2. Both shouldn't be truncated 3) call stack should be truncated after call into external code since we can't do inlining with external code. ``` [foo, call, ext] [ext, call, bar] [bar, call, baz] [baz, return, bar ] [bar, return, ext] ``` the call stack of code in baz should not include foo. ### Implementation: We leverage artificial frame to fix #2 and #3: when we got a return artificial LBR, push an extra artificial frame to the stack. when we pop frame, check if the parent is an artificial frame to pop(fix #2). Therefore, call/ return artificial LBR is just the same as regular LBR which can keep the call stack. While recording context on the trie, artificial frame is used as a tag indicating that we should truncate the call stack(fix #3). To differentiate #1 and #2, we leverage `getCallAddrFromFrameAddr`. Normally the target of the return should be the next inst of a call inst and `getCallAddrFromFrameAddr` will return the address of call inst. Otherwise, getCallAddrFromFrameAddr will return to 0 which is the case of #1. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D115550	2021-12-14 16:40:54 -08:00
wlei	30c3aba998	[llvm-profgen] Fix to use getUntrackedCallsites outside the loop Unwinder is hoisted out in https://reviews.llvm.org/D115550, so fix the useage of getUntrackedCallsites. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D115760	2021-12-14 16:40:53 -08:00
wlei	3dcb60db9a	[CSSPGO][llvm-profgen] Fix external address issues of perf reader (leading external LBR part) We can have the sampling just hit into the external addresses, in that case, both the top stack frame and the latest LBR target are external addresses. For example: ``` ffffffff 0x4006c8/0xffffffff/P/-/-/0 0x40069b/0x400670/M/-/-/0 ffffffff 40067e 0xffffffff/0xffffffff/P/-/-/0 0x4006c8/0xffffffff/P/-/-/0 0x40069b/0x400670/M/-/-/0 ``` Before we will ignore the entire samples. However, we found there exists some internal LBRs in the remaining part of sample, the range between them is still a valid range, we will lose some valid LBRs. Those LBRs will be unwinded based on a empty(context-less) call stack. This change tries to fix it, instead of ignoring the entire sample, we only ignore the leading external addresses. Note that the first outgoing LBR is useful since there is a valid range between it's source and next LBR's target. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D115538	2021-12-14 16:40:53 -08:00
wlei	3220571793	[llvm-profgen] Skip disassembling for PLT section Skip disassembling .plt section, then .plt section code will be treated as external code. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D115699	2021-12-14 16:40:53 -08:00
Hongtao Yu	5740bb801a	[CSSPGO] Use nested context-sensitive profile. CSSPGO currently employs a flat profile format for context-sensitive profiles. Such a flat profile allows for precisely manipulating contexts that is either inlined or not inlined. This is a benefit over the nested profile format used by non-CS AutoFDO. A downside of this is the longer build time due to parsing the indexing the full CS contexts. For a CS flat profile, though only the context profiles relevant to a module are loaded when that module is compiled, the cost to figure out what profiles are relevant is noticeably high when there're many contexts, since the sample reader will need to scan all context strings anyway. On the contrary, a nested function profile has its related inline subcontexts isolated from other unrelated contexts. Therefore when compiling a set of functions, unrelated contexts will never need to be scanned. In this change we are exploring using nested profile format for CSSPGO. This is expected to work based on an assumption that with a preinliner-computed profile all contexts are precomputed and expected to be inlined by the compiler. Contexts not expected to be inlined will be cut off and returned to corresponding base profiles (for top-level outlined functions). This naturally forms a nested profile where all nested contexts are expected to be inlined. The compiler will less likely optimize on derived contexts that are not precomputed. A CS-nested profile will look exactly the same with regular nested profile except that each nested profile can come with an attributes. With pseudo probes, a nested profile shown as below can also have a CFG checksum. ``` main:1968679:12 2: 24 3: 28 _Z5funcAi:18 3.1: 28 _Z5funcBi:30 3: _Z5funcAi:1467398 0: 10 1: 10 _Z8funcLeafi:11 3: 24 1: _Z8funcLeafi:1467299 0: 6 1: 6 3: 287884 4: 287864 _Z3fibi:315608 15: 23 !CFGChecksum: 138828622701 !Attributes: 2 !CFGChecksum: 281479271677951 !Attributes: 2 ``` Specific work included in this change: - A recursive profile converter to convert CS flat profile to nested profile. - Extend function checksum and attribute metadata to be stored in nested way for text profile and extbinary profile. - Unifiy sample loader inliner path for CS and preinlined nested profile. - Changes in the sample loader to support probe-based nested profile. I've seen promising results regarding build time. A nested profile can result in a 20% shorter build time than a CS flat profile while keep an on-par performance. This is with -duplicate-contexts-into-base=1. Test Plan: Reviewed By: wenlei Differential Revision: https://reviews.llvm.org/D115205	2021-12-14 14:40:25 -08:00

1 2 3 4

181 Commits