llvm-project

Commit Graph

Author	SHA1	Message	Date
minglotus-6	e2074de6a8	[ProfSampleLoader] When disable-sample-loader-inlining is true, merge profiles of inlined instances to outlining versions. When --disable-sample-loader-inlining is true, skip inline transformation, but merge profiles of inlined instances to outlining versions. Differential Revision: https://reviews.llvm.org/D121862	2022-03-23 13:02:48 -07:00
serge-sans-paille	f1985a3f85	Cleanup includes: Transforms/IPO Preprocessor output diff: -238205 lines Discourse thread: https://discourse.llvm.org/t/include-what-you-use-include-cleanup Differential Revision: https://reviews.llvm.org/D122183	2022-03-22 10:06:28 +01:00
Paul Kirth	964398ccb1	Revert "Revert "Revert "[misexpect] Re-implement MisExpect Diagnostics""" This reverts commit `6cf560d69a`.	2022-03-18 00:21:33 +00:00
Paul Kirth	6cf560d69a	Revert "Revert "[misexpect] Re-implement MisExpect Diagnostics"" I mistakenly reverted my commit, so I'm relanding it. This reverts commit `10866a1df4`.	2022-03-18 00:04:22 +00:00
Paul Kirth	10866a1df4	Revert "[misexpect] Re-implement MisExpect Diagnostics" This reverts commit `e7749d4713`.	2022-03-17 23:54:26 +00:00
Paul Kirth	e7749d4713	[misexpect] Re-implement MisExpect Diagnostics Reimplements MisExpect diagnostics from D66324 to reconstruct its original checking methodology only using MD_prof branch_weights metadata. New checks rely on 2 invariants: 1) For frontend instrumentation, MD_prof branch_weights will always be populated before llvm.expect intrinsics are lowered. 2) for IR and sample profiling, llvm.expect intrinsics will always be lowered before branch_weights are populated from the IR profiles. These invariants allow the checking to assume how the existing branch weights are populated depending on the profiling method used, and emit the correct diagnostics. If these invariants are ever invalidated, the MisExpect related checks would need to be updated, potentially by re-introducing MD_misexpect metadata, and ensuring it always will be transformed the same way as branch_weights in other optimization passes. Frontend based profiling is now enabled without using LLVM Args, by introducing a new CodeGen option, and checking if the -Wmisexpect flag has been passed on the command line. Differential Revision: https://reviews.llvm.org/D115907	2022-03-17 23:46:23 +00:00
Hongtao Yu	07846e3387	[CSSPGO][PriorityInliner] Do not use block weight to drive callsite inlining. The priority-based inliner currenlty uses block count combined with callee entry count to drive callsite inlining. This doesn't work well with LTO where postlink inlining is driven by prelink-annotated block count which could be based on the merge of all context profiles. I'm fixing it by using callee profile entry count only which should be context-sensitive. I'm seeing 0.2% perf improvment for one of our internal large benchmarks with probe-based non-CS profile. Reviewed By: wenlei Differential Revision: https://reviews.llvm.org/D120784	2022-03-01 18:43:19 -08:00
minglotus-6	142cedc283	[SampleProf][Inliner] Add an option to turn off inliner in sample-profile pass. Use case is offline evaluation (for inliner effectiveness) or debugging. Differential Revision: https://reviews.llvm.org/D120344	2022-02-23 14:21:33 -08:00
minglotus-6	f415d74d1d	[SampleProfile] Handle the case when the option `MaxNumPromotions` is zero. In places where `MaxNumPromotions` is used to allocated an array, bail out early to prevent allocating an array of length 0. Differential Revision: https://reviews.llvm.org/D120295	2022-02-22 21:44:32 -08:00
Hongtao Yu	62ef77ca63	[CSSPGO] Do not merge a context that is already duplicated into the base profile. Do not merge a context that is already duplicated into the base profile. Also fixing a typo caused by previous refactoring. Reviewed By: wenlei Differential Revision: https://reviews.llvm.org/D119735	2022-02-14 18:07:11 -08:00
Hongtao Yu	dee058c670	[CSSPGO] Turn on ext-tsp by default for CSSPGO. I'm seeing ext-tsp helps CSSPGO for our intern large benchmarks so I'm turning on it for CSSPGO. For non-CS AutoFDO, ext-tsp doesn't seem to help, probably because of lower profile counts quality. Reviewed By: wenlei Differential Revision: https://reviews.llvm.org/D119048	2022-02-04 19:46:44 -08:00
Kazu Hirata	3710078ceb	[SampleProfile] Reduce indentation with an early return (NFC)	2022-02-03 12:22:23 -08:00
Kazu Hirata	fee57711fe	Use DenseMap::lookup (NFC)	2021-12-17 18:19:25 -08:00
Hongtao Yu	d7b7b64914	[CSSPGO] Warn instead of error out for modules that are not probed. Modules that are not compiled with pseudo probe enabled can still be compiled with a sample profile input, such as in LTO postlink where other modules are probed. Since the profile is unrelated to the current modules, we should warn instead of error out the compilation. Reviewed By: wenlei Differential Revision: https://reviews.llvm.org/D115642	2021-12-14 16:38:50 -08:00
Hongtao Yu	5740bb801a	[CSSPGO] Use nested context-sensitive profile. CSSPGO currently employs a flat profile format for context-sensitive profiles. Such a flat profile allows for precisely manipulating contexts that is either inlined or not inlined. This is a benefit over the nested profile format used by non-CS AutoFDO. A downside of this is the longer build time due to parsing the indexing the full CS contexts. For a CS flat profile, though only the context profiles relevant to a module are loaded when that module is compiled, the cost to figure out what profiles are relevant is noticeably high when there're many contexts, since the sample reader will need to scan all context strings anyway. On the contrary, a nested function profile has its related inline subcontexts isolated from other unrelated contexts. Therefore when compiling a set of functions, unrelated contexts will never need to be scanned. In this change we are exploring using nested profile format for CSSPGO. This is expected to work based on an assumption that with a preinliner-computed profile all contexts are precomputed and expected to be inlined by the compiler. Contexts not expected to be inlined will be cut off and returned to corresponding base profiles (for top-level outlined functions). This naturally forms a nested profile where all nested contexts are expected to be inlined. The compiler will less likely optimize on derived contexts that are not precomputed. A CS-nested profile will look exactly the same with regular nested profile except that each nested profile can come with an attributes. With pseudo probes, a nested profile shown as below can also have a CFG checksum. ``` main:1968679:12 2: 24 3: 28 _Z5funcAi:18 3.1: 28 _Z5funcBi:30 3: _Z5funcAi:1467398 0: 10 1: 10 _Z8funcLeafi:11 3: 24 1: _Z8funcLeafi:1467299 0: 6 1: 6 3: 287884 4: 287864 _Z3fibi:315608 15: 23 !CFGChecksum: 138828622701 !Attributes: 2 !CFGChecksum: 281479271677951 !Attributes: 2 ``` Specific work included in this change: - A recursive profile converter to convert CS flat profile to nested profile. - Extend function checksum and attribute metadata to be stored in nested way for text profile and extbinary profile. - Unifiy sample loader inliner path for CS and preinlined nested profile. - Changes in the sample loader to support probe-based nested profile. I've seen promising results regarding build time. A nested profile can result in a 20% shorter build time than a CS flat profile while keep an on-par performance. This is with -duplicate-contexts-into-base=1. Test Plan: Reviewed By: wenlei Differential Revision: https://reviews.llvm.org/D115205	2021-12-14 14:40:25 -08:00
Hongtao Yu	4e24ca1cdc	[CSSPGO] Turn on Profi by default As titled. Reviewed By: wenlei, wlei Differential Revision: https://reviews.llvm.org/D115011	2021-12-02 17:56:38 -08:00
spupyrev	7cc2493daa	profi - a flow-based profile inference algorithm: Part I (out of 3) The benefits of sampling-based PGO crucially depends on the quality of profile data. This diff implements a flow-based algorithm, called profi, that helps to overcome the inaccuracies in a profile after it is collected. Profi is an extended and significantly re-engineered classic MCMF (min-cost max-flow) approach suggested by Levin, Newman, and Haber [2008, Complementing missing and inaccurate profiling using a minimum cost circulation algorithm]. It models profile inference as an optimization problem on a control-flow graph with the objectives and constraints capturing the desired properties of profile data. Three important challenges that are being solved by profi: - "fixing" errors in profiles caused by sampling; - converting basic block counts to edge frequencies (branch probabilities); - dealing with "dangling" blocks having no samples in the profile. The main implementation (and required docs) are in SampleProfileInference.cpp. The worst-time complexity is quadratic in the number of blocks in a function, O(\|V\|^2). However a careful engineering and extensive evaluation shows that the running time is (slightly) super-linear. In particular, instances with 1000 blocks are solved within 0.1 second. The algorithm has been extensively tested internally on prod workloads, significantly improving the quality of generated profile data and providing speedups in the range from 0% to 5%. For "smaller" benchmarks (SPEC06/17), it generally improves the performance (with a few outliers) but extra work in the compiler might be needed to re-tune existing optimization passes relying on profile counts. UPD Dec 1st 2021: - synced the declaration and definition of the option `SampleProfileUseProfi ` to use type `cl::opt<bool`; - added `inline` for `SampleProfileInference<BT>::findUnlikelyJumps` and `SampleProfileInference<BT>::isExit` to avoid linking problems on windows. Reviewed By: wenlei, hoy Differential Revision: https://reviews.llvm.org/D109860	2021-12-01 15:30:38 -08:00
Hongtao Yu	bf317f6698	[CSSPGO] Sorting nodes in a cycle of profiled call graph. For nodes that are in a cycle of a profiled call graph, the current order the underlying scc_iter computes purely depends on how those nodes are reached from outside the SCC and inside the SCC, based on the Tarjan algorithm. This does not honor profile edge hotness, thus does not gurantee hot callsites to be inlined prior to cold callsites. To mitigate that, I'm adding an extra sorter on top of scc_iter to sort scc functions in the order of callsite hotness, instead of changing the internal of scc_iter. Sorting on callsite hotness can be optimally based on detecting cycles on a directed call graph, i.e, to remove the coldest edge until a cycle is broken. However, detecting cycles isn't cheap. I'm using an MST-based approach which is faster and appear to deliver some performance wins. Reviewed By: wenlei Differential Revision: https://reviews.llvm.org/D114204	2021-11-30 09:01:08 -08:00
Mehdi Amini	1392b654ff	Revert "profi - a flow-based profile inference algorithm: Part I (out of 3)" This reverts commit `884b6dd311`. The windows build is broken with a linker error.	2021-11-23 20:10:36 +00:00
spupyrev	884b6dd311	profi - a flow-based profile inference algorithm: Part I (out of 3) The benefits of sampling-based PGO crucially depends on the quality of profile data. This diff implements a flow-based algorithm, called profi, that helps to overcome the inaccuracies in a profile after it is collected. Profi is an extended and significantly re-engineered classic MCMF (min-cost max-flow) approach suggested by Levin, Newman, and Haber [2008, Complementing missing and inaccurate profiling using a minimum cost circulation algorithm]. It models profile inference as an optimization problem on a control-flow graph with the objectives and constraints capturing the desired properties of profile data. Three important challenges that are being solved by profi: - "fixing" errors in profiles caused by sampling; - converting basic block counts to edge frequencies (branch probabilities); - dealing with "dangling" blocks having no samples in the profile. The main implementation (and required docs) are in SampleProfileInference.cpp. The worst-time complexity is quadratic in the number of blocks in a function, O(\|V\|^2). However a careful engineering and extensive evaluation shows that the running time is (slightly) super-linear. In particular, instances with 1000 blocks are solved within 0.1 second. The algorithm has been extensively tested internally on prod workloads, significantly improving the quality of generated profile data and providing speedups in the range from 0% to 5%. For "smaller" benchmarks (SPEC06/17), it generally improves the performance (with a few outliers) but extra work in the compiler might be needed to re-tune existing optimization passes relying on profile counts. Reviewed By: wenlei, hoy Differential Revision: https://reviews.llvm.org/D109860	2021-11-23 11:02:40 -08:00
Philip Reames	065f777d27	Revert "profi - a flow-based profile inference algorithm: Part I (out of 3)" This reverts commit `b00fc19822`. This change fails to build (link) on ubuntu x86,	2021-11-23 09:18:28 -08:00
spupyrev	b00fc19822	profi - a flow-based profile inference algorithm: Part I (out of 3) The benefits of sampling-based PGO crucially depends on the quality of profile data. This diff implements a flow-based algorithm, called profi, that helps to overcome the inaccuracies in a profile after it is collected. Profi is an extended and significantly re-engineered classic MCMF (min-cost max-flow) approach suggested by Levin, Newman, and Haber [2008, Complementing missing and inaccurate profiling using a minimum cost circulation algorithm]. It models profile inference as an optimization problem on a control-flow graph with the objectives and constraints capturing the desired properties of profile data. Three important challenges that are being solved by profi: - "fixing" errors in profiles caused by sampling; - converting basic block counts to edge frequencies (branch probabilities); - dealing with "dangling" blocks having no samples in the profile. The main implementation (and required docs) are in SampleProfileInference.cpp. The worst-time complexity is quadratic in the number of blocks in a function, O(\|V\|^2). However a careful engineering and extensive evaluation shows that the running time is (slightly) super-linear. In particular, instances with 1000 blocks are solved within 0.1 second. The algorithm has been extensively tested internally on prod workloads, significantly improving the quality of generated profile data and providing speedups in the range from 0% to 5%. For "smaller" benchmarks (SPEC06/17), it generally improves the performance (with a few outliers) but extra work in the compiler might be needed to re-tune existing optimization passes relying on profile counts. Reviewed By: wenlei, hoy Differential Revision: https://reviews.llvm.org/D109860	2021-11-23 09:08:30 -08:00
modimo	5caad9b5d3	[InlineAdvisor] Add fallback/format switches and negative remark processing to Replay Inliner Adds the following switches: 1. --sample-profile-inline-replay-fallback/--cgscc-inline-replay-fallback: controls what the replay advisor does for inline sites that are not present in the replay. Options are: 1. Original: defers to original advisor 2. AlwaysInline: inline all sites not in replay 3. NeverInline: inline no sites not in replay 2. --sample-profile-inline-replay-format/--cgscc-inline-replay-format: controls what format should be generated to match against the replay remarks. Options are: 1. Line 2. LineColumn 3. LineDiscriminator 4. LineColumnDiscriminator Adds support for negative inlining decisions. These are denoted by "will not be inlined into" as compared to the positive "inlined into" in the remarks. All of these together with the previous `--sample-profile-inline-replay-scope/--cgscc-inline-replay-scope` allow tweaking in how to apply replay. In my testing, I'm using: 1. --sample-profile-inline-replay-scope/--cgscc-inline-replay-scope = Function to only replay on a function 2. --sample-profile-inline-replay-fallback/--cgscc-inline-replay-fallback = NeverInline since I'm feeding in only positive remarks to the replay system 3. --sample-profile-inline-replay-format/--cgscc-inline-replay-format = Line since I'm generating the remarks from DWARF information from GCC which can conflict quite heavily in column number compared to Clang An alternative configuration could be to do Function, AlwaysInline, Line fallback with negative remarks which closer matches the final call-sites. Note that this can lead to unbounded inlining if a negative remark doesn't match/exist for one reason or another. Updated various tests to cover the new switches and negative remarks Testing: ninja check-all Reviewed By: wenlei, mtrofin Differential Revision: https://reviews.llvm.org/D112040	2021-10-29 12:32:03 -07:00
modimo	51ce567b38	[SampleProfile] Add all callsites to AllCandidates if InlineReplay is in effect Replay in sample profiling needs to be asked on candidates that may not have counts or below the threshold. If replay is in effect for a function make sure these are captured and also imported during thinLTO. Testing: ninja check-all Reviewed By: wenlei Differential Revision: https://reviews.llvm.org/D112033	2021-10-29 12:04:52 -07:00
modimo	313c657fce	[InlineAdvisor] Add -inline-replay-scope=<Function\|Module> to control replay scope The goal is to allow grafting an inline tree from Clang or GCC into a new compilation without affecting other functions. For GCC, we're doing this by extracting the inline tree from dwarf information and generating the equivalent remarks. This allows easier side-by-side asm analysis and a trial way to see if a particular inlining setup provides benefits by itself. Testing: ninja check-all Reviewed By: wenlei, mtrofin Differential Revision: https://reviews.llvm.org/D110658	2021-10-18 13:08:39 -07:00
Mircea Trofin	7d541eb4d4	[inliner] Mandatory inlining decisions produce remarks This also removes the need to disable the mandatory inlining phase in tests. In a departure from the previous remark, we don't output a 'cost' in this case, because there's no such thing. We just report that inlining happened because of the attribute. Differential Revision: https://reviews.llvm.org/D110891	2021-10-05 14:01:25 -07:00
Wenlei He	5f187f0afa	[SamplePGO] Add switch to honor zero count on block level as accurate Add a new LLVM switch `-profile-sample-block-accurate` to trust zero block counts for branches. Currently we leave out such zero counts when annotating branch weight metadata, which would lead to weights being considered as unknown. Differential Revision: https://reviews.llvm.org/D110117	2021-09-21 17:06:37 -07:00
Wenlei He	054487c5b2	[CSSPGO] Honor preinliner decision for ThinLTO importing When pre-inliner decision is used for CSSPGO, we should take that into account for ThinLTO importing as well, so post-link sample loader inliner can favor that decision. This is handled by a small tweak in this patch. It also includes a change to transfer preinliner decision when merging context. Differential Revision: https://reviews.llvm.org/D109088	2021-09-02 17:29:26 -07:00
Kevin Athey	04ed6e7afc	Revert "[CSSPGO] Honor preinliner decision for ThinLTO importing" This reverts commit `a2768b4732`. Breaks sanitizer-x86_64-linux-fast buildbot: https://lab.llvm.org/buildbot/#/builders/5/builds/11334 Log snippet: Testing: 0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80 FAIL: LLVM :: Transforms/SampleProfile/early-inline.ll (65549 of 78729) ****************** TEST 'LLVM :: Transforms/SampleProfile/early-inline.ll' FAILED ****************** Script: -- : 'RUN: at line 1'; /b/sanitizer-x86_64-linux-fast/build/llvm_build_ubsan/bin/opt < /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/Transforms/SampleProfile/early-inline.ll -instcombine -sample-profile -sample-profile-file=/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/Transforms/SampleProfile/Inputs/einline.prof -S \| /b/sanitizer-x86_64-linux-fast/build/llvm_build_ubsan/bin/FileCheck /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/Transforms/SampleProfile/early-inline.ll -- Exit Code: 2 Command Output (stderr): -- /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Transforms/IPO/SampleProfile.cpp:1309:53: runtime error: member call on null pointer of type 'llvm::sampleprof::FunctionSamples' #0 0x5a730f8 in shouldInlineCandidate /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Transforms/IPO/SampleProfile.cpp:1309:53 #1 0x5a730f8 in (anonymous namespace)::SampleProfileLoader::tryInlineCandidate((anonymous namespace)::InlineCandidate&, llvm::SmallVector<llvm::CallBase, 8u>) /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Transforms/IPO/SampleProfile.cpp:1178:21 #2 0x5a6cda6 in inlineHotFunctions /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Transforms/IPO/SampleProfile.cpp:1105:13 #3 0x5a6cda6 in (anonymous namespace)::SampleProfileLoader::emitAnnotations(llvm::Function&) /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Transforms/IPO/SampleProfile.cpp:1633:16 #4 0x5a5fcbe in runOnFunction /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Transforms/IPO/SampleProfile.cpp:2008:12 #5 0x5a5fcbe in (anonymous namespace)::SampleProfileLoader::runOnModule(llvm::Module&, llvm::AnalysisManager<llvm::Module>, llvm::ProfileSummaryInfo, llvm::CallGraph) /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Transforms/IPO/SampleProfile.cpp:1922:15 #6 0x5a5de55 in llvm::SampleProfileLoaderPass::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Transforms/IPO/SampleProfile.cpp:2038:21 #7 0x6552a01 in llvm::detail::PassModel<llvm::Module, llvm::SampleProfileLoaderPass, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Module> >::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/include/llvm/IR/PassManagerInternal.h:88:17 #8 0x57f807c in llvm::PassManager<llvm::Module, llvm::AnalysisManager<llvm::Module> >::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/include/llvm/IR/PassManager.h:526:21 #9 0x37c8522 in llvm::runPassPipeline(llvm::StringRef, llvm::Module&, llvm::TargetMachine, llvm::TargetLibraryInfoImpl, llvm::ToolOutputFile, llvm::ToolOutputFile, llvm::ToolOutputFile, llvm::StringRef, llvm::ArrayRef<llvm::StringRef>, llvm::opt_tool::OutputKind, llvm::opt_tool::VerifierKind, bool, bool, bool, bool, bool) /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/tools/opt/NewPMDriver.cpp:489:7 #10 0x37e7c11 in main /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/tools/opt/opt.cpp:830:12 #11 0x7fbf4de4009a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2409a) #12 0x379e519 in _start (/b/sanitizer-x86_64-linux-fast/build/llvm_build_ubsan/bin/opt+0x379e519) SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Transforms/IPO/SampleProfile.cpp:1309:53 in FileCheck error: '<stdin>' is empty. FileCheck command line: /b/sanitizer-x86_64-linux-fast/build/llvm_build_ubsan/bin/FileCheck /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/Transforms/SampleProfile/early-inline.ll -- ****************** Testing: 0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80 FAIL: LLVM :: Transforms/SampleProfile/inline-cold.ll (65643 of 78729) **************** TEST 'LLVM :: Transforms/SampleProfile/inline-cold.ll' FAILED ****************** Script: -- : 'RUN: at line 4'; /b/sanitizer-x86_64-linux-fast/build/llvm_build_ubsan/bin/opt < /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/Transforms/SampleProfile/inline-cold.ll -sample-profile -sample-profile-file=/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/Transforms/SampleProfile/Inputs/inline-cold.prof -S \| /b/sanitizer-x86_64-linux-fast/build/llvm_build_ubsan/bin/FileCheck -check-prefix=NOTINLINE /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/Transforms/SampleProfile/inline-cold.ll : 'RUN: at line 5'; /b/sanitizer-x86_64-linux-fast/build/llvm_build_ubsan/bin/opt < /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/Transforms/SampleProfile/inline-cold.ll -passes=sample-profile -sample-profile-file=/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/Transforms/SampleProfile/Inputs/inline-cold.prof -S \| /b/sanitizer-x86_64-linux-fast/build/llvm_build_ubsan/bin/FileCheck -check-prefix=NOTINLINE /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/Transforms/SampleProfile/inline-cold.ll : 'RUN: at line 8'; /b/sanitizer-x86_64-linux-fast/build/llvm_build_ubsan/bin/opt < /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/Transforms/SampleProfile/inline-cold.ll -sample-profile -sample-profile-file=/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/Transforms/SampleProfile/Inputs/inline-cold.prof -sample-profile-inline-size -S \| /b/sanitizer-x86_64-linux-fast/build/llvm_build_ubsan/bin/FileCheck -check-prefix=INLINE /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/Transforms/SampleProfile/inline-cold.ll : 'RUN: at line 11'; /b/sanitizer-x86_64-linux-fast/build/llvm_build_ubsan/bin/opt < /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/Transforms/SampleProfile/inline-cold.ll -passes=sample-profile -sample-profile-file=/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/Transforms/SampleProfile/Inputs/inline-cold.prof -sample-profile-inline-size -sample-profile-cold-inline-threshold=9999999 -S \| /b/sanitizer-x86_64-linux-fast/build/llvm_build_ubsan/bin/FileCheck -check-prefix=INLINE /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/Transforms/SampleProfile/inline-cold.ll : 'RUN: at line 14'; /b/sanitizer-x86_64-linux-fast/build/llvm_build_ubsan/bin/opt < /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/Transforms/SampleProfile/inline-cold.ll -passes=sample-profile -sample-profile-file=/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/Transforms/SampleProfile/Inputs/inline-cold.prof -sample-profile-inline-size -sample-profile-cold-inline-threshold=-500 -S \| /b/sanitizer-x86_64-linux-fast/build/llvm_build_ubsan/bin/FileCheck -check-prefix=NOTINLINE /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/Transforms/SampleProfile/inline-cold.ll -- Exit Code: 2 Command Output (stderr): -- /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Transforms/IPO/SampleProfile.cpp:1309:53: runtime error: member call on null pointer of type 'llvm::sampleprof::FunctionSamples' #0 0x5a730f8 in shouldInlineCandidate /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Transforms/IPO/SampleProfile.cpp:1309:53 #1 0x5a730f8 in (anonymous namespace)::SampleProfileLoader::tryInlineCandidate((anonymous namespace)::InlineCandidate&, llvm::SmallVector<llvm::CallBase, 8u>) /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Transforms/IPO/SampleProfile.cpp:1178:21 #2 0x5a6cda6 in inlineHotFunctions /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Transforms/IPO/SampleProfile.cpp:1105:13 #3 0x5a6cda6 in (anonymous namespace)::SampleProfileLoader::emitAnnotations(llvm::Function&) /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Transforms/IPO/SampleProfile.cpp:1633:16 #4 0x5a5fcbe in runOnFunction /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Transforms/IPO/SampleProfile.cpp:2008:12 #5 0x5a5fcbe in (anonymous namespace)::SampleProfileLoader::runOnModule(llvm::Module&, llvm::AnalysisManager<llvm::Module>, llvm::ProfileSummaryInfo, llvm::CallGraph) /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Transforms/IPO/SampleProfile.cpp:1922:15 #6 0x5a5de55 in llvm::SampleProfileLoaderPass::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Transforms/IPO/SampleProfile.cpp:2038:21 #7 0x6552a01 in llvm::detail::PassModel<llvm::Module, llvm::SampleProfileLoaderPass, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Module> >::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/include/llvm/IR/PassManagerInternal.h:88:17 #8 0x57f807c in llvm::PassManager<llvm::Module, llvm::AnalysisManager<llvm::Module> >::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/include/llvm/IR/PassManager.h:526:21 #9 0x37c8522 in llvm::runPassPipeline(llvm::StringRef, llvm::Module&, llvm::TargetMachine, llvm::TargetLibraryInfoImpl, llvm::ToolOutputFile, llvm::ToolOutputFile, llvm::ToolOutputFile, llvm::StringRef, llvm::ArrayRef<llvm::StringRef>, llvm::opt_tool::OutputKind, llvm::opt_tool::VerifierKind, bool, bool, bool, bool, bool) /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/tools/opt/NewPMDriver.cpp:489:7 #10 0x37e7c11 in main /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/tools/opt/opt.cpp:830:12 #11 0x7fcd534a209a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2409a) #12 0x379e519 in _start (/b/sanitizer-x86_64-linux-fast/build/llvm_build_ubsan/bin/opt+0x379e519) SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Transforms/IPO/SampleProfile.cpp:1309:53 in FileCheck error: '<stdin>' is empty. FileCheck command line: /b/sanitizer-x86_64-linux-fast/build/llvm_build_ubsan/bin/FileCheck -check-prefix=INLINE /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/Transforms/SampleProfile/inline-cold.ll -- ****************** Testing: 0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90.. ****************** Failed Tests (2): LLVM :: Transforms/SampleProfile/early-inline.ll LLVM :: Transforms/SampleProfile/inline-cold.ll	2021-09-02 14:48:31 -07:00
Wenlei He	f7fff46acc	[CSSPGO] Allow inlining recursive call for preinliner When preinliner is used for CSSPGO, we try to honor global preinliner decision as much as we can except for uninlinable callees. We rely on InlineCost::Never to prevent us from illegal inlining. However, it turns out that we use InlineCost::Never for both illeagle inlining and some of the "not-so-beneficial" inlining. The most common one is recursive inlining, while it can bloat size a lot during CGSCC bottom-up inlining, it's less of a problem when recursive inlining is guided by profile and done in top-down manner. Ideally it'd be better to have a clear separation between inline legality check vs cost-benefit check, but that requires a bigger change. This change enables InlineCost computation to allow inlining recursive calls, controlled by InlineParams. In SampleLoader, we now enable recursive inlining for CSSPGO when global preinliner decision is used. With this change, we saw a few perf improvements on SPEC2017 with CSSPGO and preinliner on: 2% for povray_r, 6% for xalancbmk_s, 3% omnetpp_s, while size is about the same (no noticeable perf change for all other benchmarks) Differential Revision: https://reviews.llvm.org/D109104	2021-09-02 11:24:27 -07:00
Wenlei He	a2768b4732	[CSSPGO] Honor preinliner decision for ThinLTO importing When pre-inliner decision is used for CSSPGO, we should take that into account for ThinLTO importing as well, so post-link sample loader inliner can favor that decision. This is handled by a small tweak in this patch. It also includes a change to transfer preinliner decision when merging context. Differential Revision: https://reviews.llvm.org/D109088	2021-09-02 08:24:06 -07:00
Wenlei He	c000b8bd5c	[CSSPGO] Use preinliner decision by default when available For CSSPGO, turn on `sample-profile-use-preinliner` by default. This simplifies the use of llvm-profgen preinliner as it's now simply driven by ContextShouldBeInlined flag for each context profile without needing extra compiler switch. Note that llvm-profgen's preinliner is still off by default, under switch `csspgo-preinliner`. Differential Revision: https://reviews.llvm.org/D109111	2021-09-01 23:45:38 -07:00
Hongtao Yu	7ca8030030	[CSSPGO] Enable loading MD5 CS profile. Adding the compiler support of MD5 CS profile based on pervious context split work D107299. A MD5 CS profile is about 40% smaller than the string-based extbinary profile. As a result, the compilation is 15% faster. There are a few conversion from real names to md5 names that have been made on the sample loader and context tracker side to get it work. Reviewed By: wenlei, wmi Differential Revision: https://reviews.llvm.org/D108342	2021-09-01 09:19:47 -07:00
Wenlei He	a45d72e024	[CSSPGO] Add switch for sample loader to honor global pre-inliner decision from llvm-profgen The change adds a switch to allow sample loader to use global pre-inliner's decision instead. The pre-inliner in llvm-profgen makes inline decision globally based on whole program profile and function byte size as cost proxy. Since pre-inliner also adjusts/merges context profile based on its inline decision, honoring its inline decision in sample loader would lead to better post-inline profile quality especially for thinlto where cross module profile merging isn't possible without pre-inliner. Minor fix in profile reader is also included. When pre-inliner is use, we now also turn off the default merging and trimming logic unless it's explicitly asked. Differential Revision: https://reviews.llvm.org/D108677	2021-08-25 17:20:15 -07:00
Rong Xu	5fdaaf7fd8	[SampleFDO] Flow Sensitive Sample FDO (FSAFDO) profile loader This patch implements Flow Sensitive Sample FDO (FSAFDO) profile loader. We have two profile loaders for FS profile, one before RegAlloc and one before BlockPlacement. To enable it, when -fprofile-sample-use=<profile> is specified, add "-enable-fs-discriminator=true \ -disable-ra-fsprofile-loader=false \ -disable-layout-fsprofile-loader=false" to turn on the FS profile loaders. Differential Revision: https://reviews.llvm.org/D107878	2021-08-18 18:37:35 -07:00
wlei	f0d41b58da	[CSSPGO] Tweak ICP threshold in top-down inliner This change slightly relaxed the current ICP threshold in top-down inliner, specifically always allow one ICP for it. It shows some perf improvements on SPEC and our internal benchmarks. Also renamed the previous flag. We can also try to turn off PGO ICP in the future. Reviewed By: wenlei, hoy, wmi Differential Revision: https://reviews.llvm.org/D106588	2021-07-26 21:49:20 -07:00
Kazu Hirata	4a30a5c8d9	[SampleProfile] Remove ProfileIsValid (NFC) The last use was removed on Jan 22, 2021 in commit `c9cd9a0066`.	2021-07-20 08:07:04 -07:00
Wenlei He	f9f3c34e0f	[CSSPGO] Turn on iterative-BFI for CSSPGO Iterative-BFI produces better count quality and performance when evaluated on internal benchmarks. Turning it on by default now for CSSPGO. We can consider turn it on by default for AutoFDO as well in the future. Differential Revision: https://reviews.llvm.org/D106202	2021-07-16 17:35:49 -07:00
Kazu Hirata	49d66d9f9f	[AFDO] Merge function attributes after inlining This patch teaches the sample profile loader to merge function attributes after inlining functions. Without this patch, the compiler could inline a function requiring the 512-bit vector width into its caller without merging function attributes, triggering a failure during instruction selection. Differential Revision: https://reviews.llvm.org/D105729	2021-07-09 16:47:12 -07:00
Hongtao Yu	bd52495518	[CSSPGO] Undoing the concept of dangling pseudo probe As a follow-up to https://reviews.llvm.org/D104129, I'm cleaning up the danling probe related code in both the compiler and llvm-profgen. I'm seeing a 5% size win for the pseudo_probe section for SPEC2017 and 10% for Ciner. Certain benchmark such as 602.gcc has a 20% size win. No obvious difference seen on build time for SPEC2017 and Cinder. Reviewed By: wenlei Differential Revision: https://reviews.llvm.org/D104477	2021-06-18 15:14:11 -07:00
Rong Xu	8d581857d7	[SampleFDO] New hierarchical discriminator for FS SampleFDO (llvm-profdata part) This patch was split from https://reviews.llvm.org/D102246 [SampleFDO] New hierarchical discriminator for Flow Sensitive SampleFDO This is for llvm-profdata part of change. It sets the bit masks for the profile reader in llvm-profdata. Also add an internal option "-fs-discriminator-pass" for show and merge command to process the profile offline. This patch also moved setDiscriminatorMaskedBitFrom() to SampleProfileReader::create() to simplify the interface. Differential Revision: https://reviews.llvm.org/D103550	2021-06-04 11:22:06 -07:00
Rong Xu	6745ffe4fa	[SampleFDO] New hierarchical discriminator for FS SampleFDO (ProfileData part) This patch was split from https://reviews.llvm.org/D102246 [SampleFDO] New hierarchical discriminator for Flow Sensitive SampleFDO This is mainly for ProfileData part of change. It will load FS Profile when such profile is detected. For an extbinary format profile, create_llvm_prof tool will add a flag to profile summary section. For other format profiles, the users need to use an internal option (-profile-isfs) to tell the compiler that the profile uses FS discriminators. This patch also simplified the bit API used by FS discriminators. Differential Revision: https://reviews.llvm.org/D103041	2021-06-02 10:32:52 -07:00
Hongtao Yu	4ca6e37b98	[CSSPGO] Overwrite branch weight annotated in previous pass. Sample profile loader can be run in both LTO prelink and postlink. Currently the counts annoation in postilnk doesn't fully overwrite what's done in prelink. I'm adding a switch (`-overwrite-existing-weights=1`) to enable a full overwrite, which includes: 1. Clear old metadata for calls when their parent block has a zero count. This could be caused by prelink code duplication. 2. Clear indirect call metadata if somehow all the rest targets have a sum of zero count. 3. Overwrite branch weight for basic blocks. With a CS profile, I was seeing #1 and #2 help reduce code size by preventing post-sample ICP and CGSCC inliner working on obsolete metadata, which come from a partial global inlining in prelink. It's not expected to work well for non-CS case with a less-accurate post-inline count quality. It's worth calling out that some prelink optimizations can damage counts quality in an irreversible way. One example is the loop rotate optimization. Due to lack of exact loop entry count (profiling can only give loop iteration count and loop exit count), moving one iteration out of the loop body leaves the rest iteration count unknown. We had to turn off prelink loop rotate to achieve a better postlink counts quality. A even better postlink counts quality can be archived by turning off prelink CGSCC inlining which is not context-sensitive. Reviewed By: wenlei, wmi Differential Revision: https://reviews.llvm.org/D102537	2021-05-19 09:12:24 -07:00
Simon Pilgrim	e30540a603	SampleProfileLoader::inlineHotFunctionsWithPriority - Fix uninitialized variable warning. NFCI. findIndirectCallFunctionSamples will leave Sum uninitialized if it returns an empty vector, we don't really use Sum in this case (but we do make a copy that isn't used either) - so ensure we initialize the value to zero to at least silence the static analysis warning.	2021-05-15 15:02:52 +01:00
wlei	e475d4d69f	[CSSPGO] Fix return value of getProbeWeight Currently we didn't support multiple return type, we work around to use error_code to represent: 1) The dangling probe. 2) Ignore the weight of non-probe instruction While merging the instructions' weight for the whole BB, it will filter out the error code. But If all instructions of the BB give error_code, the outside logic will mark it as a BB requiring the inference algorithm to infer its weight. This is different from the zero value which will be treated as a cold block. Fix one place that if we can't find the FunctionSamples in the profile data which indicates the BB is cold, we choose to return zero. Also refine the comments. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D102007	2021-05-14 14:06:09 -07:00
Hongtao Yu	5f2d730073	[CSSPGO] Fix incorrect prorating indirect call distribution factor that leads to target count loss. Pseudo probe distribution factor is used to scale down profile samples to avoid misleading the counts inference due to the usage of "maximum" in `getBlockWeight`. For callsites, the scaling down can come from code duplication prior to the sample profile loader (prelink or postlink), or due to the indirect call promotion in sample loader inliner. This patch fixes an issue in sample loader ICP where the leftover indirect callsite scaling down causes the loss of non-promoted call target samples unexpectedly. While the scaling down is to favor BFI/BPI with accurate an callsite count, it doesn't fit in the current distribution factor that represents code duplication changes. Ideally, we would need two factors, one is for code duplication, the other is for ICP. However this seems over complicated. I'm going to trade one usage (callsite counts) for the other (call target counts). Seeing perf win on one benchmark (mcf) of SPEC2017 with others unchanged. Reviewed By: wenlei Differential Revision: https://reviews.llvm.org/D100993	2021-04-23 11:09:22 -07:00
wlei	6d5132b426	[CSSPGO] Fix incorrect probe distribution factor computation in top-down inliner We see a regression related to low probe factor(0.01) which prevents some callsites being promoted in ICPPass and later cause the missing inline in CGSCC inliner. The root cause is due to redundant(the second) multiplication of the probe factor and this change try to fix it. `Sum` does multiply a factor right after findCallSamples but later when using as the parameter in setProbeDistributionFactor, it multiplies one again. This change could get ~2% perf back on mcf benchmark. In mcf, previously the corresponding factor is 1 and it's the recent feature introducing the <1 factor then trigger this bug. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D99787	2021-04-07 08:48:59 -07:00
spupyrev	22998738e8	[SamplePGO] Keeping prof metadata for IndirectBrInst Currently prof metadata with branch counts is added only for BranchInst and SwitchInst, but not for IndirectBrInst. As a result, BPI/BFI make incorrect inferences for indirect branches, which can be very hot. This diff adds metadata for IndirectBrInst, in addition to BranchInst and SwitchInst. Reviewed By: wmi, wenlei Differential Revision: https://reviews.llvm.org/D99550	2021-03-30 10:44:48 -07:00
Hongtao Yu	3e3fc431df	[CSSPGO] Top-down processing order based on full profile. Use profiled call edges to augment the top-down order. There are cases that the top-down order computed based on the static call graph doesn't reflect real execution order. For example: 1. Incomplete static call graph due to unknown indirect call targets. Adjusting the order by considering indirect call edges from the profile can enable the inlining of indirect call targets by allowing the caller processed before them. 2. Mutual call edges in an SCC. The static processing order computed for an SCC may not reflect the call contexts in the context-sensitive profile, thus may cause potential inlining to be overlooked. The function order in one SCC is being adjusted to a top-down order based on the profile to favor more inlining. 3. Transitive indirect call edges due to inlining. When a callee function is inlined into into a caller function in LTO prelink, every call edge originated from the callee will be transferred to the caller. If any of the transferred edges is indirect, the original profiled indirect edge, even if considered, would not enforce a top-down order from the caller to the potential indirect call target in LTO postlink since the inlined callee is gone from the static call graph. 4. #3 can happen even for direct call targets, due to functions defined in header files. Header functions, when included into source files, are defined multiple times but only one definition survives due to ODR. Therefore, the LTO prelink inlining done on those dropped definitions can be useless based on a local file scope. More importantly, the inlinee, once fully inlined to a to-be-dropped inliner, will have no profile to consume when its outlined version is compiled. This can lead to a profile-less prelink compilation for the outlined version of the inlinee function which may be called from external modules. while this isn't easy to fix, we rely on the postlink AutoFDO pipeline to optimize the inlinee. Since the survived copy of the inliner (defined in headers) can be inlined in its local scope in prelink, it may not exist in the merged IR in postlink, and we'll need the profiled call edges to enforce a top-down order for the rest of the functions. Considering those cases, a profiled call graph completely independent of the static call graph is constructed based on profile data, where function objects are not even needed to handle case #3 and case 4. I'm seeing an average 0.4% perf win out of SPEC2017. For certain benchmark such as Xalanbmk and GCC, the win is bigger, above 2%. The change is an enhancement to https://reviews.llvm.org/D95988. Reviewed By: wmi, wenlei Differential Revision: https://reviews.llvm.org/D99351	2021-03-30 10:42:22 -07:00
Wenlei He	30b0232336	[CSSPGO][llvm-profgen] Context-sensitive global pre-inliner This change sets up a framework in llvm-profgen to estimate inline decision and adjust context-sensitive profile based on that. We call it a global pre-inliner in llvm-profgen. It will serve two purposes: 1) Since context profile for not inlined context will be merged into base profile, if we estimate a context will not be inlined, we can merge the context profile in the output to save profile size. 2) For thinLTO, when a context involving functions from different modules is not inined, we can't merge functions profiles across modules, leading to suboptimal post-inline count quality. By estimating some inline decisions, we would be able to adjust/merge context profiles beforehand as a mitigation. Compiler inline heuristic uses inline cost which is not available in llvm-profgen. But since inline cost is closely related to size, we could get an estimate through function size from debug info. Because the size we have in llvm-profgen is the final size, it could also be more accurate than the inline cost estimation in the compiler. This change only has the framework, with a few TODOs left for follow up patches for a complete implementation: 1) We need to retrieve size for funciton//inlinee from debug info for inlining estimation. Currently we use number of samples in a profile as place holder for size estimation. 2) Currently the thresholds are using the values used by sample loader inliner. But they need to be tuned since the size here is fully optimized machine code size, instead of inline cost based on not yet fully optimized IR. Differential Revision: https://reviews.llvm.org/D99146	2021-03-29 09:46:14 -07:00

1 2 3 4 5 ...

284 Commits