llvm-project

Commit Graph

Author	SHA1	Message	Date
Wenlei He	47d66355ef	[llvm-profgen] Fix alignment in preferred based calculation We used the segment alignment in elf header to assume the loader alignment. However this is incorrect because loader alignment is always the same as page size. If segment needs to be aligned at load time, linker will set aligned address as virtual address in elf header. Differential Revision: https://reviews.llvm.org/D110795	2021-09-29 23:01:10 -07:00
Wenlei He	1f0bc617bd	[llvm-porfgen] Allow perf data as input This change enables llvm-profgen to take raw perf data as alternative input format. Sometimes we need to retrieve evenets for processes with matching binary. Using perf data as input allows us to retrieve process Ids from mmap events for matching binary, then filter by process id during perf script generation. Differential Revision: https://reviews.llvm.org/D110793	2021-09-29 22:57:35 -07:00
Wenlei He	941191aae4	[llvm-profgen] Refactor and better diagnostics This change contains diagnostics improvments, refactoring and preparation for consuming perf data directly. Diagnostics: - We now have more detailed diagnostics when no mmap is found. - We also print warning for abnormal transition to external code. Refactoring: - Simplify input perf trace processing to only allow a single input file. This is because 1) using multiple input perf trace (perf script) is error prone because we may miss key mmap events. 2) the functionality is not really being used anyways. - Make more functions private for Readers, move non-trivial definitions out of header. Cleanup some inconsistency. - Prepare for consuming perf data as input directly. Differential Revision: https://reviews.llvm.org/D110729	2021-09-29 22:55:50 -07:00
wlei	a03cf331e1	[llvm-profgen] Strip context to support non-CS profile generation for hybrid sample Differential Revision: https://reviews.llvm.org/D109769	2021-09-28 12:20:23 -07:00
wlei	ce40843a3f	[llvm-profgen][CSSPGO] On-demand function size computation for preinliner Similar to https://reviews.llvm.org/D110465, we can compute function size on-demand for the functions that's hit by samples. Here we leverage the raw range samples' address to compute a set of sample hit function. Then `BinarySizeContextTracker` just works on those function range for the size. Reviewed By: hoy Differential Revision: https://reviews.llvm.org/D110466	2021-09-28 09:09:38 -07:00
wlei	091c16f76b	[llvm-profgen] On-demand symbolization Previously we do symbolization for all the functions and actually we only need the symbols that's hit by the samples. This can significantly speed up the time for large size binary. Optimization for per-inliner will come along with next patch. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D110465	2021-09-28 09:09:25 -07:00
wlei	1422fa5fab	[llvm-profgen] Unify output format of different unsymbolized profiles Differential Revision: https://reviews.llvm.org/D110080	2021-09-24 14:18:00 -07:00
wlei	28277e9b48	[AutoFDO][llvm-profgen] Report zero count for unexecuted part of function code In order to be consistent with compiler that interprets zero count as unexecuted(cold), this change reports zero-value count for unexecuted part of function code. For the implementation, it leverages the range counter, initializes all the executed function range with the zero-value. After all ranges are merged and converted into disjoint ranges, the remaining zero count will indicates the unexecuted(cold) part of the function. This change also extends the current `findDisjointRanges` method which now can support adding zero-value range. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D109713	2021-09-24 14:15:05 -07:00
wlei	d5f2013004	[AutoFDO][llvm-profgen] Profile generation for LBR(non-CS) sample This patch introduces non-CS AutoFDO profile generation into LLVM. The profile is supposed to be well consumed by compiler using `-fprofile-sample-use=[profile]`. After range and branch counters are extracted from the LBR sample, here we go through each addresses for symbolization, create FunctionSamples and populate its sub fields like TotalSamples, BodySamples and HeadSamples etc. For inlined code, as we need to map back to original code, so we always add body samples to the leaf frame's function sample. Reviewed By: wenlei, hoy Differential Revision: https://reviews.llvm.org/D109551	2021-09-24 13:55:34 -07:00
wlei	a7cdcf25c1	[llvm-profgen] Ignore invalid perf line in LBR record Similar to https://reviews.llvm.org/D109637, there is a whole invalid line of message in perfscript. ``` warning: Invalid address in LBR record at line 14118674: Processed 14138923 events and lost 1 chunks! warning: Invalid address in LBR record at line 14118676: Check IO/CPU overload! ``` This only happened for LBR only perfscript, hybridperfscript have a check of " 0x" to make sure it's the LBR perf line. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D110424	2021-09-24 13:44:57 -07:00
wlei	1ed69bb86e	[llvm-profgen] Fix a dangling vector reference in CS line number based generator It seems we missed one spot to persist `SampleContextFrameVector` into the global table (CSProfileGenerator::populateFunctionBoundarySamples:340) which causes a crash. This change tried to fix it in a centralized way i. e. where we generate the `FunctionSamples`. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D110275	2021-09-22 18:33:28 -07:00
wlei	686cc00067	[llvm-profgen] Fix an out-of-range error during unwinding It happened that the LBR entry target can be the first address of text section which causes an out-of-range crash. So here add a boundary check. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D110271	2021-09-22 18:33:27 -07:00
wlei	c2be2d3284	[llvm-profgen] Fix a bug of assertion The assertion should work on the entire context. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D110268	2021-09-22 18:33:27 -07:00
Wenlei He	81c249784f	[llvm-profgen] Use hot threshold for context merging and trimming Without preinliner, we need to tune down the cold count cutoff to merge/trim more context to limit profile size for large components. However it doesn't make sense for cold threshold to be higher than hot threshold, so we now change to use hot threshold as merging/trimming cut off instead. Differential Revision: https://reviews.llvm.org/D110212	2021-09-22 15:01:51 -07:00
Hongtao Yu	734f4d832c	[llvm-profgen] An option to dump disasm of specified symbols For large app, dumping disasm of the whole program can be slow and result in gianant output. Adding a switch to dump specific symbols only. Reviewed By: wlei Differential Revision: https://reviews.llvm.org/D110079	2021-09-22 10:32:59 -07:00
Wenlei He	446e21623c	[llvm-profgen] Use context-sensitive byte size cost for preinliner decisions by default Turn on `use-context-cost-for-preinliner` to use context-sensitive byte size cost for preinliner decisions by default. This is a more accurate proxy of inline cost than profile size. We tested on our large workload that it delivers measureable CPU improvement. Differential Revision: https://reviews.llvm.org/D109893	2021-09-16 10:36:12 -07:00
Hongtao Yu	0057c7185d	[CSSPGO][llvm-profgen] Truncate stack samples with invalid return address. Invalid frame addresses exist in call stack samples due to bad unwinding. This could happen to frame-pointer-based unwinding and the callee functions that do not have the frame pointer chain set up. It isn't common when the program is built with the frame pointer omission disabled, but can still happen with third-party static libs built with frame pointer omitted. Reviewed By: wenlei Differential Revision: https://reviews.llvm.org/D109638	2021-09-14 21:56:22 -07:00
Hongtao Yu	8cbbd7e0b2	[llvm-profgen] Ignore broken LBR samples Perf script can sometimes give disordered LBR samples like below. ``` b022500 32de0044 3386e1d1 7f118e05720c 7f118df2d81f 0x2a0b9622/0x2a0b9610/P/-/-/1 0x2a0b79ff/0x2a0b9618/P/-/-/2 0x2a0b7a4a/0x2a0b79e8/P/-/-/1 0x2a0b7a33/0x2a0b7a46/P/-/-/1 0x2a0b7a42/0x2a0b7a23/P/-/-/1 0x2a0b7a21/0x2a0b7a37/P/-/-/2 0x2a0b79e6/0x2a0b7a07/P/-/-/1 0x2a0b79d4/0x2a0b79dc/P/-/-/2 0x2a0b7a03/0x2a0b79aa/P/-/-/1 0x2a0b79a8/0x2a0b7a00/P/-/-/234 0x2a0b9613/0x2a0b7930/P/-/-/1 0x2a0b9622/0x2a0b9610/P/-/-/1 0x2a0b79ff/0x2a0b9618/P/-/-/2 0x2a0b7a4a/0x2aWarning: Processed 10263226 events and lost 1 chunks! ``` Note that the last LBR record `0x2a0b7a4a/0x2aWarning:` . Currently llvm-profgen does not detect that and as a result an uninitialized branch target value will be used. The uninitialized value can cause creepy instruction ranges created which which in turn will result in a completely wrong profile. An example is like ``` .... @ _ZN5folly13loadUnalignedIsEET_PKv]:18446744073709551615:18446744073709551615 1: 18446744073709551615 !CFGChecksum: 4294967295 !Attributes: 0 ``` Reviewed By: wenlei, wlei Differential Revision: https://reviews.llvm.org/D109637	2021-09-14 12:11:17 -07:00
Wenlei He	a5d3cac033	[llvm-profgen] Turn off cold context trimming by default We merge cold context by default to save profile size. However trimming cold context after merging doesn't save size much, so default to off to reflect how it's commonly used. Differential Revision: https://reviews.llvm.org/D109166	2021-09-02 12:29:06 -07:00
Wenlei He	6eca242e09	[llvm-profgen] Deduplicate and improve warning for truncated context This change improves the warning for truncated context by: 1) deduplicate them as one call without probe can appear in many different context leading to duplicated warnings , 2) rephrase the message to make it easier to understand. The term "untracked frame" can be confusing. Differential Revision: https://reviews.llvm.org/D109115	2021-09-02 09:15:38 -07:00
Wenlei He	f10004e7dd	[CSSPGO] Add stats for pre-inliner Add some stats to help tuning pre-inliner. Differential Revision: https://reviews.llvm.org/D109098	2021-09-01 20:03:50 -07:00
Hongtao Yu	7ca8030030	[CSSPGO] Enable loading MD5 CS profile. Adding the compiler support of MD5 CS profile based on pervious context split work D107299. A MD5 CS profile is about 40% smaller than the string-based extbinary profile. As a result, the compilation is 15% faster. There are a few conversion from real names to md5 names that have been made on the sample loader and context tracker side to get it work. Reviewed By: wenlei, wmi Differential Revision: https://reviews.llvm.org/D108342	2021-09-01 09:19:47 -07:00
wlei	964053d56f	[llvm-profgen] Support LBR only perf script This change aims at supporting LBR only sample perf script which is used for regular(Non-CS) profile generation. A LBR perf script includes a batch of LBR sample which starts with a frame pointer and a group of 32 LBR entries is followed. The FROM/TO LBR pair and the range between two consecutive entries (the former entry's TO and the latter entry's FROM) will be used to infer function profile info. An example of LBR perf script(created by `perf script -F ip,brstack -i perf.data`) ``` 40062f 0x40062f/0x4005b0/P/-/-/9 0x400645/0x4005ff/P/-/-/1 0x400637/0x400645/P/-/-/1 ... 4005d7 0x4005d7/0x4005e5/P/-/-/8 0x40062f/0x4005b0/P/-/-/6 0x400645/0x4005ff/P/-/-/1 ... ... ``` For implementation: - Extended a new child class `LBRPerfReader` for the sample parsing, reused all the functionalities in `extractLBRStack` except for an extension to parsing leading instruction pointer. - `HybridSample` is reused(just leave the call stack empty) and the parsed samples is still aggregated in `AggregatedSamples`. After that, range samples, branch sample, address samples are computed and recorded. - Reused `ContextSampleCounterMap` to store the raw profile, since it's no need to aggregation by context, here it just registered one sample counter with a fake context key. - Unified to use `show-raw-profile` instead of `show-unwinder-output` to dump the intermediate raw profile, see the comments of the format of the raw profile. For CS profile, it remains to output the unwinder output. Profile generation part will come soon. Differential Revision: https://reviews.llvm.org/D108153	2021-08-31 13:28:17 -07:00
Hongtao Yu	b9db70369b	[CSSPGO] Split context string to deduplicate function name used in the context. Currently context strings contain a lot of duplicated function names and that significantly increase the profile size. This change split the context into a series of {name, offset, discriminator} tuples so function names used in the context can be replaced by the index into the name table and that significantly reduce the size consumed by context. A follow-up improvement made in the compiler and profiling tools is to avoid reconstructing full context strings which is time- and memory- consuming. Instead a context vector of `StringRef` is adopted to represent the full context in all scenarios. As a result, the previous prevalent profile map which was implemented as a `StringRef` is now engineered as an unordered map keyed by `SampleContext`. `SampleContext` is reshaped to using an `ArrayRef` to represent a full context for CS profile. For non-CS profile, it falls back to use `StringRef` to represent a contextless function name. Both the `ArrayRef` and `StringRef` objects are underpinned by real array and string objects that are stored in producer buffers. For compiler, they are maintained by the sample reader. For llvm-profgen, they are maintained in `ProfiledBinary` and `ProfileGenerator`. Full context strings can be generated only in those cases of debugging and printing. When it comes to profile format, nothing has changed to the text format, though internally CS context is implemented as a vector. Extbinary format is only changed for CS profile, with an additional `SecCSNameTable` section which stores all full contexts logically in the form of `vector<int>`, which each element as an offset points to `SecNameTable`. All occurrences of contexts elsewhere are redirected to using the offset of `SecCSNameTable`. Testing This is no-diff change in terms of code quality and profile content (for text profile). For our internal large service (aka ads), the profile generation is cut to half, with a 20x smaller string-based extbinary format generated. The compile time of ads is dropped by 25%. Differential Revision: https://reviews.llvm.org/D107299	2021-08-30 20:09:29 -07:00
Wenlei He	a45d72e024	[CSSPGO] Add switch for sample loader to honor global pre-inliner decision from llvm-profgen The change adds a switch to allow sample loader to use global pre-inliner's decision instead. The pre-inliner in llvm-profgen makes inline decision globally based on whole program profile and function byte size as cost proxy. Since pre-inliner also adjusts/merges context profile based on its inline decision, honoring its inline decision in sample loader would lead to better post-inline profile quality especially for thinlto where cross module profile merging isn't possible without pre-inliner. Minor fix in profile reader is also included. When pre-inliner is use, we now also turn off the default merging and trimming logic unless it's explicitly asked. Differential Revision: https://reviews.llvm.org/D108677	2021-08-25 17:20:15 -07:00
Wenlei He	a6f15e9a49	[CSSPGO] Use probe inline tree to track zero size fully optimized context for pre-inliner This is a follow up diff for BinarySizeContextTracker to track zero size for fully optimized inlinee. When an inlinee is fully optimized away, we won't be able to get its size through symbolizing instructions, hence we will treat the corresponding context size as unknown. However by traversing the inlined probe forest, we know what're original inlinees regardless of optimization. If a context show up in inlined probes, but not during symbolization, we know that it's fully optimized away hence its size is zero instead of unknown. It should provide more accurate size cost estimation for pre-inliner to make better inline decisions in llvm-profgen. Differential Revision: https://reviews.llvm.org/D108350	2021-08-25 09:01:11 -07:00
Wenlei He	eca03d2768	[CSSPGO] Track and use context-sensitive post-optimization function size to drive global pre-inliner in llvm-profgen This change enables llvm-profgen to use accurate context-sensitive post-optimization function byte size as a cost proxy to drive global preinline decisions. To do this, BinarySizeContextTracker is introduced to track function byte size under different inline context during disassembling. In preinliner, we can not query context byte size under switch `context-cost-for-preinliner`. The tracker uses a reverse trie to keep size of functions under different context (callee as parent, caller as child), and it can give best/longest possible matching context size for given input context. The new size cost is off by default. There're a few TODOs that needs to addressed: 1) avoid dangling string from `Offset2LocStackMap`, which will be addressed in split context work; 2) using inlinee's entry probe to make sure we have correct zero size for inlinee that's completely optimized away after inlining. Some tuning is also needed. Differential Revision: https://reviews.llvm.org/D108180	2021-08-18 22:50:57 -07:00
wlei	9af46710fe	[llvm-profgen] Move profiled binary loading out of PerfReader Change to use unique pointer of profiled binary to unblock asan. At same time, I realized we can decouple to move the profiled binary loading out of PerfReader, so I made some other related refactors. Reviewed By: hoy Differential Revision: https://reviews.llvm.org/D108254	2021-08-17 17:28:01 -07:00
wlei	f812c19253	[llvm-profgen] Clean up code dealing with multiple binaries As we decided to support only one binary each time, this patch cleans up the related code dealing with multiple binaries. We can use `llvm-profdata` to merge profile from multiple binaries. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D108002	2021-08-17 12:16:07 -07:00
wlei	856a6a5041	[CSSPGO][llvm-profgen] Trim and merge context beforehand to reduce memory usage Currently we use a centralized string map(StringMap<FunctionSamples> ProfileMap) to store the profile while populating the sample, which might cause the memory usage bottleneck. I saw in an extreme case, there are thousands of samples whose context stack depth is >= 100. The memory consumption can be greater than 100GB. As here the context is used for inlining, we can assume we won't have so many of inlinees keeping inlined at the same root function, so this change tried to cap the context stack and merge the samples for peak memory reduction and this is done after recursion compression. The default value is -1 meaning no depth limit, in the future we can tune to a smaller one. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D107800	2021-08-11 16:02:35 -07:00
wlei	a8a38ef3d9	[llvm-profgen] Fix bug of loop scope mismatch One performance issue happened in profile generation and it turned out the line 525 loop is the bottleneck. Moving the code outside of loop scope can fix this issue. The run time is improved from 30+mins to ~30s. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D107529	2021-08-05 16:52:57 -07:00
jamesluox	ee7d20e846	[CSSPGO] Migrate and refactor the decoder of Pseudo Probe Migrate pseudo probe decoding logic in llvm-profgen to MC, so other LLVM-base program could reuse existing codes. Redesign object layout of encoded and decoded pseudo probes. Reviewed By: hoy Differential Revision: https://reviews.llvm.org/D106861	2021-08-04 09:21:34 -07:00
wlei	f1affe8dc8	[llvm-profgen][CSSPGO] Support count based aggregated type of hybrid perf script This change tried to integrate a new count based aggregated type of perf script. The only difference of the format is that an aggregated count is added at the head of the original sample which means the same samples are repeated to the given count times. This is used to reduce the perf script size. e.g. ``` 2 4005dc 400634 400684 7f68c5788793 0x4005c8/0x4005dc/P/-/-/0 .... ``` Implemented by a dedicated PerfReader `AggregatedHybridPerfReader`. Differential Revision: https://reviews.llvm.org/D107192	2021-08-03 17:56:35 -07:00
wlei	fe3ba90830	[llvm-profgen] Support perf script without parsing MMap events This change supports to run without parsing MMap binary loading events instead it always assumes binary is loaded at the preferred address. This is used when we have assured no binary load address changes or we have pre-processed the addresses resolution. Warn if there's interior mmap event but without leading mmap events. Reviewed By: hoy Differential Revision: https://reviews.llvm.org/D107097	2021-08-03 10:01:07 -07:00
wlei	6da9241aab	[llvm-profgen] Refactor PerfReader to allow different types of perf scripts In order to support different types of perf scripts, this change tried to refactor `PerfReader` by adding the base class `PerfReaderBase` and current HybridPerfReader is derived from it for CS profile generation. Common functions like, passMM2PEvents, extract_lbrs, extract_callstack, etc. can be reused. Next step is to add LBR only reader(for non-CS profile) and aggregated perf scripts reader(do a pre-aggregation of scripts). Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D107014	2021-08-02 17:18:47 -07:00
Fangrui Song	6da3d8b19c	[llvm] Replace LLVM_ATTRIBUTE_NORETURN with C++11 [[noreturn]] [[noreturn]] can be used since Oct 2016 when the minimum compiler requirement was bumped to GCC 4.8/MSVC 2015. Note: the definition of LLVM_ATTRIBUTE_NORETURN is kept for now.	2021-07-28 09:31:14 -07:00
Timm Bäder	d16f154240	[llvm][tools] Hide more unrelated LLVM tool options Differential Revision: https://reviews.llvm.org/D106366	2021-07-21 09:14:04 +02:00
Hongtao Yu	6b04ecaab3	[CSSPGO][llvm-profgen] Fix a missing initalization Fixing a missing initalization that accidentaly caused by https://reviews.llvm.org/D103178 .	2021-07-13 19:49:55 -07:00
Hongtao Yu	597e9c61ce	Revert "[CSSPGO][llvm-profgen] Fix a missing initalization" This reverts commit `fef5f4456a`.	2021-07-13 19:48:58 -07:00
Hongtao Yu	fef5f4456a	[CSSPGO][llvm-profgen] Fix a missing initalization Fixing a missing initalization that accidentaly caused by https://reviews.llvm.org/D103178 .	2021-07-13 19:46:18 -07:00
Hongtao Yu	cda2394d97	[NFC][CSSPGO] Rename the name of an enum value.	2021-07-13 18:30:16 -07:00
Hongtao Yu	0712038458	[CSSPGO][llvm-profgen] Allow multiple executable load segments. The linker or post-link optimizer can create an ELF image with multiple executable segments each of which will be loaded separately at run time. This breaks the assumption of llvm-profgen that currently only supports one base load address. What it ends up with is that the subsequent mmap events will be treated as an overwrite of the first mmap event which will in turn screw up address mapping. While it is non-trivial to support multiple separate load addresses and given that on x64 those segments will always be loaded at consecutive addresses (though via separate mmap sys calls), I'm adding an error checking logic to bail out if that's violated and keep using a single load address which is the address of the first executable segment. Also changing the disassembly output from printing section offset to printing the virtual address instead, which matches the behavior of objdump. Differential Revision: https://reviews.llvm.org/D103178	2021-07-13 18:22:24 -07:00
Hongtao Yu	5c8659801a	[CSSPGO][llvm-profgen] Handle return to external transition. In a callback case, a return from internal code, say A, to external runtime can happen. The external runtime can then call back to another internal routine, say B. Making an artificial branch that looks like a return from A to B can confuse the unwinder to treat the instruction before B as the call instruction. Reviewed By: wenlei, wmi Differential Revision: https://reviews.llvm.org/D104546	2021-06-22 16:24:59 -07:00
Rong Xu	8c68eb8306	[SampleFDO] Make FSDiscriminator flag part of function parameters Add a parameter of IsFSDiscriminator to function getBaseDiscriminatorFromDiscriminator(). This function currently checks the internal flag of --enable-fs-discriminator. This is not good because we might change the default value of the internal flag. Note that we have a default parameter. This is just because create_afdo_tool has a call-site to it. I will remove the default parameter in a later patch. Differential Revision: https://reviews.llvm.org/D104584	2021-06-21 14:37:45 -07:00
Hongtao Yu	bd52495518	[CSSPGO] Undoing the concept of dangling pseudo probe As a follow-up to https://reviews.llvm.org/D104129, I'm cleaning up the danling probe related code in both the compiler and llvm-profgen. I'm seeing a 5% size win for the pseudo_probe section for SPEC2017 and 10% for Ciner. Certain benchmark such as 602.gcc has a 20% size win. No obvious difference seen on build time for SPEC2017 and Cinder. Reviewed By: wenlei Differential Revision: https://reviews.llvm.org/D104477	2021-06-18 15:14:11 -07:00
Hongtao Yu	fb19aa0c74	[CSSPGO][llvm-profgen] Fix an issue in findDisjointRanges We were using 0 as an indicator of invalid offset when computing disjoint ranges. In reality, 0 can be an valid code offset which stands for the first function in .text section. I'm using UINT64_MAX as an invalid code offset instead. Reviewed By: wenlei Differential Revision: https://reviews.llvm.org/D104497	2021-06-18 14:38:48 -07:00
Hongtao Yu	8c2c97287e	[CSSPGO][llvm-profgen] Ignore LBR records after interrupt transition If we have seen an inwards transition from external code to internal code, but not a following outwards transition, the inwards transition is likely due to interrupt which is usually unpaired. Ignore current and subsequent entries since they are likely from an unrelated pre-interrupt context. LBR records from different interrupt context are unrelated and they should not be mixed together. Currenlty the OS does this for task-scheduling interrupt but not for all interrupts. Reviewed By: wenlei, wlei Differential Revision: https://reviews.llvm.org/D104276	2021-06-18 12:13:53 -07:00
Hongtao Yu	c60f1d5d98	[CSSPGO] Fix an invalid hash table reference issue in the CS preinliner. We were using a `StringMap` object to store all profiles to be emitted. The object is basically an unordered hash table, therefore updating it in the process of trasvering it may cause issue since the underlying bucket array could change. I'm also moving the `csspgo-preinliner` switch around so that no context tri will be constructed (by the constructor of `CSPreInliner`) when the switch is off. Reviewed By: wenlei Differential Revision: https://reviews.llvm.org/D104267	2021-06-18 11:54:23 -07:00
Hongtao Yu	cef9b96b01	[CSSPGO] Report zero-count probe in profile instead of dangling probes. Previously dangling samples were represented by INT64_MAX in sample profile while probes never executed were not reported. This was based on an observation that dangling probes were only at a smaller portion than zero-count probes. However, with compiler optimizations, dangling probes end up becoming at large portion of all probes in general and reporting them does not make sense from profile size point of view. This change flips sample reporting by reporting zero-count probes instead. This enabled dangling probe to be represented by none (missing entry in profile). This has a couple benefits: 1. Reducing sample profile size in optimize mode, even when the number of non-executed probes outperform the number of dangling probes, since INT64_MAX takes more space over 0 to encode. 2. Binary size savings. No need to encode dangling probe anymore, since missing probes are treated as dangling in the profile reader. 3. Reducing compiler work to track dangling probes. However, for probes that are real dead and removed, we still need the compiler to identify them so that they can be reported as zero-count, instead of mistreated as dangling probes. 4. Improving counts quality by respecting the counts already collected on the non-dangling copy of a probe. A probe, when duplicated, gets two copies at runtime. If one of them is dangling while the other is not, merging the two probes at profile generation time will cause the real samples collected on the non-dangling one to be discarded. Not reporting the dangling counterpart will keep the real samples. 5. Better readability. 6. Be consistent with non-CS dwarf line number based profile. Zero counts are trusted by the compiler counts inferencer while missing counts will be inferred by the compiler. Note that the current patch does include any work for #3. There will be follow-up changes. For #1, I've seen for a large Facebook service, the text profile is reduced by 7%. For extbinary profile, the size of LBRProfileSection is reduced by 35%. For #4, I have seen general counts quality for SPEC2017 is improved by 10%. Reviewed By: wenlei, wlei, wmi Differential Revision: https://reviews.llvm.org/D104129	2021-06-16 11:45:29 -07:00
wlei	863184dd69	[CSSPGO] Aggregation by the last K context frames for cold profiles This change provides the option to merge and aggregate cold context by the last k frames instead of context-less name. By default K = 1 means the context-less one. This is for better perf tuning. The more selective merging and trimming will rely on llvm-profgen's preinliner. Reviewed By: wenlei, hoy Differential Revision: https://reviews.llvm.org/D104131	2021-06-14 10:33:43 -07:00

1 2 3

101 Commits