This reverts commit 794b7ea960, and
thus restores commit a212d8da94, and
follow on fixes 0cd6763fa9,
e9ff53d42f, and
37c6a25e9a.
Use a hash function (BLAKE3) instead of hash_combine/hash_code which are
not guaranteed to be stable across executions.
Additionally, it adds a "REQUIRES: x86_64-linux" to the tests that have
raw profile inputs to avoid failures on big endian bots.
Reviewers: snehasish, davidxl
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D128142
This reverts commit a212d8da94, and follow
on fixes 0cd6763fa9,
e9ff53d42f, and
37c6a25e9a.
After re-reading the documentation for hash_combine, I don't think this
is the appropriate hash function to use for computing the hash to use as
a stack id in the metadata, since it is not guaranteed to produce stable
values across executions. I have not hit this problem, but plan to
switch to using an MD5 hash. I am hitting an issue with one of the bots
(https://lab.llvm.org/buildbot/#/builders/171/builds/20732)
where the values produced are only the lower 32 bits of the expected
hash values, however, which I assume is related to the implementation of
hash_combine and hash_code.
I believe I fixed all of the other bot failures with the follow on fixes,
which I'll merge into the new version before reapplying.
This should fix errors like the following from using the raw
memprof_pgo.profraw profile in a212d8da94d08e229aa8d65283e4b116310bba10:
profile uses zlib compression but the profile reader was built without zlib support
E.g.
https://lab.llvm.org/buildbot/#/builders/196/builds/18536
Profile matching and IR annotation for memprof profiles.
See also related RFCs:
RFC: Sanitizer-based Heap Profiler [1]
RFC: A binary serialization format for MemProf [2]
RFC: IR metadata format for MemProf [3]*
* Note that the IR metadata format has changed from the RFC during
implementation, as described in the preceeding patch adding the basic
metadata and verification support.
The matching is performed during the normal PGO annotation phase, to
ensure that the inlines applied in the IR at that point are a subset
of the inlines in the profiled binary and thus reflected in the
profile's call stacks. This is important because the call frames are
associated with functions in the profile based on the inlining in the
symbolized call stacks, and this simplifies locating the subset of
profile data relevant for matching onto each function's IR.
The PGOInstrumentationUse pass is enhanced to perform matching for
whatever combination of memprof and regular PGO profile data exists in
the profile.
Using the utilities introduced in D128854:
The memprof profile data for each context is converted to "cold" or
"notcold" based on parameterized thresholds for size, access count, and
lifetime. The memprof allocation contexts are trimmed to the minimal
amount of context required to uniquely identify whether the context is
cold or not cold. For allocations where all profiled contexts have the
same allocation type, no memprof metadata is attached and instead the
allocation call is directly annotated with an attribute specifying the
alloction type. This is the same attributed that will be applied to
allocation calls once cloned for different contexts, and later used
during LibCall simplification to emit allocation hints [4].
Depends on D128141 and D128854.
[1] https://lists.llvm.org/pipermail/llvm-dev/2020-June/142744.html
[2] https://lists.llvm.org/pipermail/llvm-dev/2021-September/153007.html
[3] https://discourse.llvm.org/t/rfc-ir-metadata-format-for-memprof/59165
[4] ab87cf382d
Differential Revision: https://reviews.llvm.org/D128142
We currently instrument CallBrInst but do not annotate it with
the branch weight. This patch enables PGO annotation of CallBrInst.
Differential Revision: https://reviews.llvm.org/D133040
Current implementation promotes a non-cold function in the SampleFDO profile
into a hot function in the FDO profile. This is too aggressive. This patch
promotes a hot functions in the SampleFDO profile into a hot function, and a
warm function in SampleFDO into a warm function in FDO.
Differential Revision: https://reviews.llvm.org/D132601
If a function only has a few instructions, instrumentation can significantly increase the size and performance overhead of that function. Add the `-pgo-function-size-threshold` option to select a size threshold so these small functions are not instrumented.
A similar option `-fxray-instruction-threshold=<N>` is used for XRay to reduce binary size overhead [1].
[1] https://www.llvm.org/docs/XRay.html
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D131816
As discussed in [0], this diff adds the `skipprofile` attribute to
prevent the function from being profiled while allowing profiled
functions to be inlined into it. The `noprofile` attribute remains
unchanged.
The `noprofile` attribute is used for functions where it is
dangerous to add instrumentation to while the `skipprofile` attribute is
used to reduce code size or performance overhead.
[0] https://discourse.llvm.org/t/why-does-the-noprofile-attribute-restrict-inlining/64108
Reviewed By: phosek
Differential Revision: https://reviews.llvm.org/D130807
This patch reports number of counts being dropped when a hash-mismatch
happens. This information will be helpful to the users -- if the dropped
counts are large, the user should redo the instrumentation build and
recollect the profile.
Differential Revision: https://reviews.llvm.org/D129001
Don't cross reference CSFDO profile and non-CSFDO profile when
checking the function hash. Only return hash_mismatch when
CS bits match, and return unknown_function otherwise.
Differential Revision: https://reviews.llvm.org/D129000
Following some recent discussions, this changes the representation
of callbrs in IR. The current blockaddress arguments are replaced
with `!` label constraints that refer directly to callbr indirect
destinations:
; Before:
%res = callbr i8* asm "", "=r,r,i"(i8* %x, i8* blockaddress(@test8, %foo))
to label %asm.fallthrough [label %foo]
; After:
%res = callbr i8* asm "", "=r,r,!i"(i8* %x)
to label %asm.fallthrough [label %foo]
The benefit of this is that we can easily update the successors of
a callbr, without having to worry about also updating blockaddress
references. This should allow us to remove some limitations:
* Allow unrolling/peeling/rotation of callbr, or any other
clone-based optimizations
(https://github.com/llvm/llvm-project/issues/41834)
* Allow duplicate successors
(https://github.com/llvm/llvm-project/issues/45248)
This is just the IR representation change though, I will follow up
with patches to remove limtations in various transformation passes
that are no longer needed.
Differential Revision: https://reviews.llvm.org/D129288
This enabled opaque pointers by default in LLVM. The effect of this
is twofold:
* If IR that contains *neither* explicit ptr nor %T* types is passed
to tools, we will now use opaque pointer mode, unless
-opaque-pointers=0 has been explicitly passed.
* Users of LLVM as a library will now default to opaque pointers.
It is possible to opt-out by calling setOpaquePointers(false) on
LLVMContext.
A cmake option to toggle this default will not be provided. Frontends
or other tools that want to (temporarily) keep using typed pointers
should disable opaque pointers via LLVMContext.
Differential Revision: https://reviews.llvm.org/D126689
Use logical instead of bitwise and to combine conditions, to avoid
propagating poison from a later condition if an earlier one is
already false. This avoids introducing branch on poison.
Differential Revision: https://reviews.llvm.org/D125898
Use IRBuilder so that the newly created freeze instructions
automatically gets inserted back into the IC worklist.
The changed worklist processing order leads to some cosmetic
differences in tests.
Fixes https://github.com/llvm/llvm-project/issues/55619.
When using counter relocations, two instructions are emitted to compute
the address of the counter variable.
```
%BiasAdd = add i64 ptrtoint <__profc_>, <__llvm_profile_counter_bias>
%Addr = inttoptr i64 %BiasAdd to i64*
```
When promoting a counter, these instructions might not be available in
the block, so we need to copy these instructions.
This fixes https://github.com/llvm/llvm-project/issues/55125
Reviewed By: phosek
Differential Revision: https://reviews.llvm.org/D125710
While select conditions can be poison, branch on poison is
immediate UB. As such, we need to freeze the condition when
converting a select into a branch.
Differential Revision: https://reviews.llvm.org/D125398
When a block containing llvm.coro.id is cloned during CHR, it inserts an invalid
PHI node with token type to the beginning of the block containing llvm.coro.begin.
To avoid such case, we exclude regions with llvm.coro.id.
Reviewed By: ChuanqiXu
Differential Revision: https://reviews.llvm.org/D124418
Reimplements MisExpect diagnostics from D66324 to reconstruct its
original checking methodology only using MD_prof branch_weights
metadata.
New checks rely on 2 invariants:
1) For frontend instrumentation, MD_prof branch_weights will always be
populated before llvm.expect intrinsics are lowered.
2) for IR and sample profiling, llvm.expect intrinsics will always be
lowered before branch_weights are populated from the IR profiles.
These invariants allow the checking to assume how the existing branch
weights are populated depending on the profiling method used, and emit
the correct diagnostics. If these invariants are ever invalidated, the
MisExpect related checks would need to be updated, potentially by
re-introducing MD_misexpect metadata, and ensuring it always will be
transformed the same way as branch_weights in other optimization passes.
Frontend based profiling is now enabled without using LLVM Args, by
introducing a new CodeGen option, and checking if the -Wmisexpect flag
has been passed on the command line.
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D115907
Reimplements MisExpect diagnostics from D66324 to reconstruct its
original checking methodology only using MD_prof branch_weights
metadata.
New checks rely on 2 invariants:
1) For frontend instrumentation, MD_prof branch_weights will always be
populated before llvm.expect intrinsics are lowered.
2) for IR and sample profiling, llvm.expect intrinsics will always be
lowered before branch_weights are populated from the IR profiles.
These invariants allow the checking to assume how the existing branch
weights are populated depending on the profiling method used, and emit
the correct diagnostics. If these invariants are ever invalidated, the
MisExpect related checks would need to be updated, potentially by
re-introducing MD_misexpect metadata, and ensuring it always will be
transformed the same way as branch_weights in other optimization passes.
Frontend based profiling is now enabled without using LLVM Args, by
introducing a new CodeGen option, and checking if the -Wmisexpect flag
has been passed on the command line.
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D115907
Reimplements MisExpect diagnostics from D66324 to reconstruct its
original checking methodology only using MD_prof branch_weights
metadata.
New checks rely on 2 invariants:
1) For frontend instrumentation, MD_prof branch_weights will always be
populated before llvm.expect intrinsics are lowered.
2) for IR and sample profiling, llvm.expect intrinsics will always be
lowered before branch_weights are populated from the IR profiles.
These invariants allow the checking to assume how the existing branch
weights are populated depending on the profiling method used, and emit
the correct diagnostics. If these invariants are ever invalidated, the
MisExpect related checks would need to be updated, potentially by
re-introducing MD_misexpect metadata, and ensuring it always will be
transformed the same way as branch_weights in other optimization passes.
Frontend based profiling is now enabled without using LLVM Args, by
introducing a new CodeGen option, and checking if the -Wmisexpect flag
has been passed on the command line.
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D115907
This patch tries to sink instructions when they are only used in a successor block.
This is a further enhancement patch based on Anna's commit:
D109700, which allows sinking an instruction having multiple uses in a single user.
In this patch, sink instructions with multiple users in a single successor block will be supported.
It could fix a known issue from rust:
https://github.com/rust-lang/rust/issues/51346#issuecomment-394443610
Reviewed By: nikic, reames
Differential Revision: https://reviews.llvm.org/D121585
Reimplements MisExpect diagnostics from D66324 to reconstruct its
original checking methodology only using MD_prof branch_weights
metadata.
New checks rely on 2 invariants:
1) For frontend instrumentation, MD_prof branch_weights will always be
populated before llvm.expect intrinsics are lowered.
2) for IR and sample profiling, llvm.expect intrinsics will always be
lowered before branch_weights are populated from the IR profiles.
These invariants allow the checking to assume how the existing branch
weights are populated depending on the profiling method used, and emit
the correct diagnostics. If these invariants are ever invalidated, the
MisExpect related checks would need to be updated, potentially by
re-introducing MD_misexpect metadata, and ensuring it always will be
transformed the same way as branch_weights in other optimization passes.
Frontend based profiling is now enabled without using LLVM Args, by
introducing a new CodeGen option, and checking if the -Wmisexpect flag
has been passed on the command line.
Differential Revision: https://reviews.llvm.org/D115907
The `SplitIndirectBrCriticalEdges` function was originally designed for
`CodeGenPrepare` and skipped splitting of edges when the destination
block didn't contain any `PHI` instructions. This only makes sense when
reducing COPYs like `CodeGenPrepare`. In the case of
`PGOInstrumentation` or `GCOVProfiling` it would result in missed
counters and wrong result in functions with computed goto.
Differential Revision: https://reviews.llvm.org/D120096
Unfortunately, it seems we really do need to take the long route;
start from the "merge" block, find (all the) "dispatch" blocks,
and deal with each "dispatch" block separately, instead of simply
starting from each "dispatch" block like it would logically make sense,
otherwise we run into a number of other missing folds around
`switch` formation, missing sinking/hoisting and phase ordering.
This reverts commit 85628ce75b.
This reverts commit c5fff90953.
This reverts commit 34a98e1046.
This reverts commit 1e353f0922.
The current `FoldTwoEntryPHINode()` is not quite designed correctly.
It starts from the merge point, and then tries to detect
the 'divergence' point.
Because of that, it is limited to the simple two-predecessor case,
where the PHI completely goes away. but that is rather pessimistic,
and it doesn't make much sense from the costmodel side of things.
For example if there is some other unrelated predecessor of
the merge point, we could split the merge point so that
the then/else blocks first branch to an empty block
and then to the merge point, and then we'd be able to speculate
the then/else code.
But if we'd instead simply start at the divergence point,
and look for the merge point, then we'll just natively support this case.
There's also the fact that `SpeculativelyExecuteBB()` already does
just that, but only if there is a single block to speculate,
and with a much more restrictive cost model.
But that also means we have code duplication.
Now, sadly, while this is as much NFCI as possible,
there is just no way to cleanly migrate to
the proper implementation. The results *are* going to be different
somewhat because of various phase ordering effects and SimplifyCFG
block iteration strategy.
Use the llvm flag `-pgo-function-entry-coverage` to create single byte "counters" to track functions coverage. This mode has significantly less size overhead in both code and data because
* We mark a function as "covered" with a store instead of an increment which generally requires fewer assembly instructions
* We use a single byte per function rather than 8 bytes per block
The trade off of course is that this mode only tells you if a function has been covered. This is useful, for example, to detect dead code.
When combined with debug info correlation [0] we are able to create an instrumented Clang binary that is only 150M (the vanilla Clang binary is 143M). That is an overhead of 7M (4.9%) compared to the default instrumentation (without value profiling) which has an overhead of 31M (21.7%).
[0] https://groups.google.com/g/llvm-dev/c/r03Z6JoN7d4
Reviewed By: kyulee
Differential Revision: https://reviews.llvm.org/D116180
In D115311, we're looking to modify clang to emit i constraints rather
than X constraints for callbr's indirect destinations. Prior to doing
so, update all of the existing tests in llvm/ to match.
Reviewed By: void, jyknight
Differential Revision: https://reviews.llvm.org/D115410
This patch prevents the optimizer from producing wide vectors in the IR,
specifically the MMA types (v256i1, v512i1). The idea is that on Power, we only
want to be producing these types only if the vector_pair and vector_quad types
are used specifically.
To prevent the optimizer from producing these types in the IR,
vectorCostAdjustmentFactor() is updated to return an instruction cost factor or
an invalid instruction cost if the current type is that of an MMA type. An
invalid instruction cost returned by this function signifies to other cost
computing functions to return the maximum instruction cost to inform the
optimizer that producing these types within the IR is expensive, and should not
be produced in the first place.
This issue was first seen in the test case included within this patch.
Differential Revision: https://reviews.llvm.org/D113900
The `llvm.instrprof.increment` intrinsic uses `i32` for the index. We should use this same type for the index into the GEP instructions.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D114268
The NS==0 condition used by D103717 missed a corner case: if the current copy
does not have a hash suffix (e.g. weak_odr), a copy with value profiling (with a
different CFG) may exist. This is super rare, but is possible with pre-inlining
PGO instrumentation (which can make a weak_odr function inlines its callees
differently, sometimes with value profiling while sometimes without).
If the current copy with private profd is prevailing, the non-prevailing copy
may get an undefined symbol if a caller inlining the non-prevailing function
references its profd. If the other copy with non-private profd is prevailing,
the current copy may cause a "relocation to discarded section" linker error.
The fix is straightforward: just keep non-private profd in such a `DataReferencedByCode` case.
With this change, a stage 2 (`-DLLVM_TARGETS_TO_BUILD=X86 -DLLVM_BUILD_INSTRUMENTED=IR`)
clang is 0.08% larger (172431496/172286720-1).
`stat -c %s **/*.o | awk '{s+=$1}END{print s}' is 0.026% larger.
The majority of D103717's benefits remains.
Reviewed By: xur
Differential Revision: https://reviews.llvm.org/D108432
This reverts commit f653beea88.
It broke Windows coverage-inline.cpp because link.exe has a limitation
that external symbols in IMAGE_COMDAT_SELECT_ASSOCIATIVE don't work.
It essentially dropped the previous size optimization for coverage
because coverage doesn't rename comdat by default.
Needs more investigation what we should do.
The NS==0 condition used by D103717 missed a corner case: if the current copy
does not have a hash suffix (e.g. weak_odr), a copy with value profiling (with a
different CFG) may exist. This is super rare, but is possible with pre-inlining
PGO instrumentation (which can make a weak_odr function inlines its callees
differently, sometimes with value profiling while sometimes without).
If the current copy with private profd is prevailing, the non-prevailing copy
may get an undefined symbol if a caller inlining the non-prevailing function
references its profd. If the other copy with non-private profd is prevailing,
the current copy may cause a "relocation to discarded section" linker error.
The fix is straightforward: just keep non-private profd in this case.
With this change, a stage 2 (`-DLLVM_TARGETS_TO_BUILD=X86 -DLLVM_BUILD_INSTRUMENTED=IR`)
clang is 0.08% larger (172431496/172286720-1).
`stat -c %s **/*.o | awk '{s+=$1}END{print s}' is 0.026% larger.
The majority of D103717's benefits remains.
Reviewed By: xur
Differential Revision: https://reviews.llvm.org/D108432
The IRPGOFlag symbol (__llvm_profile_raw_version) is dropped when
identified as non-prevailing for either regular or thin LTO during
the mixed-LTO mode compilation. This happens in the module where
IRPGOFlag is marked as non-prevailing. This variable
is emitted in the final object from the prevailing module.
This is still problematic because we currently query this symbol
to coordinate some actions between PGOInstrumentation pass
and InstrProfiling lowering pass, like whether to do value
profiling, whether to do comdat renaming.
This problem is bought up by YolandaCY in
https://reviews.llvm.org/D107034
YolandCY reported unresolved symbol linker errors in
CSPGO instrumentation build for chromium.
This patch let LTO retain IRPGOFlag decl by adding it to
CompilerUsed list and relax the check in isIRPGOFlagSet() when
doing the InstrProfiling lowering.
The test case in the patch is from D107034
<https://reviews.llvm.org/D107034>.
Differential Revision: https://reviews.llvm.org/D108581
Previously we would allow promotion even if the byval/inalloca
attributes on the call and the callee didn't match.
It's ok if the byval/inalloca types aren't the same. For example, LTO
importing may rename types.
Fixes PR51397.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D107998
Change `CountersPtr` in `__profd_` to a label difference, which is a link-time
constant. On ELF, when linking a shared object, this requires that `__profc_` is
either private or linkonce/linkonce_odr hidden. On COFF, we need D104564 so that
`.quad a-b` (64-bit label difference) can lower to a 32-bit PC-relative relocation.
```
# ELF: R_X86_64_PC64 (PC-relative)
.quad .L__profc_foo-.L__profd_foo
# Mach-O: a pair of 8-byte X86_64_RELOC_UNSIGNED and X86_64_RELOC_SUBTRACTOR
.quad l___profc_foo-l___profd_foo
# COFF: we actually use IMAGE_REL_AMD64_REL32/IMAGE_REL_ARM64_REL32 so
# the high 32-bit value is zero even if .L__profc_foo < .L__profd_foo
# As compensation, we truncate CountersDelta in the header so that
# __llvm_profile_merge_from_buffer and llvm-profdata reader keep working.
.quad .L__profc_foo-.L__profd_foo
```
(Note: link.exe sorts `.lprfc` before `.lprfd` even if the object writer
has `.lprfd` before `.lprfc`, so we cannot work around by reordering
`.lprfc` and `.lprfd`.)
With this change, a stage 2 (`-DLLVM_TARGETS_TO_BUILD=X86 -DLLVM_BUILD_INSTRUMENTED=IR`)
`ld -pie` linked clang is 1.74% smaller due to fewer R_X86_64_RELATIVE relocations.
```
% readelf -r pie | awk '$3~/R.*/{s[$3]++} END {for (k in s) print k, s[k]}'
R_X86_64_JUMP_SLO 331
R_X86_64_TPOFF64 2
R_X86_64_RELATIVE 476059 # was: 607712
R_X86_64_64 2616
R_X86_64_GLOB_DAT 31
```
The absolute function address (used by llvm-profdata to collect indirect call
targets) can be converted to relative as well, but is not done in this patch.
Differential Revision: https://reviews.llvm.org/D104556
This function is called when some predecessor of an empty return block
ends with a conditional branch, with both successors being empty ret blocks.
Now, because of the way SimplifyCFG works, it might happen to simplify
one of the blocks in a way that makes a conditional branch
into an unconditional one, since it's destinations are now identical,
but it might not have actually simplified said conditional branch
into an unconditional one yet.
So, we have to check that ourselves first,
especially now that SimplifyCFG aggressively tail-merges
all ret and resume blocks.
Even if it was an unconditional branch already,
`SimplifyCFGOpt::simplifyReturn()` doesn't call `FoldReturnIntoUncondBranch()`
by default.