For MachO, lower `@llvm.global_dtors` into `@llvm_global_ctors` with
`__cxa_atexit` calls to avoid emitting the deprecated `__mod_term_func`.
Reuse the existing `WebAssemblyLowerGlobalDtors.cpp` to accomplish this.
Enable fallback to the old behavior via Clang driver flag
(`-fregister-global-dtors-with-atexit`) or llc / code generation flag
(`-lower-global-dtors-via-cxa-atexit`). This escape hatch will be
removed in the future.
Differential Revision: https://reviews.llvm.org/D121736
Before this patch the DebugifyLevel option was used for
the synthetic mode, so after this, it will be used in
the original mode as well.
Differential Revision: https://reviews.llvm.org/D115623
Before we start addressing the issue with having
a lot of false positives when using debugify in
the original mode, we have made a few patches that
should speed up the execution of the testing
utility Passes.
For example, when testing a large project
(let's say LLVM project itself), we can face
a lot of potential DI issues. Usually, we use
-verify-each-debuginfo-preserve (that is very
similar to -debugify-each) -- it collects
DI metadata before each Pass, and after the Pass
it checks if the Pass preserved the DI metadata.
However, we can speed up this process, since we
don't need to collect DI metadata before each
Pass -- we could use the DI metadata that are
collected after the previous Pass from
the pipeline as an input for the next Pass.
This patch speeds up the utility for ~2x.
Differential Revision: https://reviews.llvm.org/D115622
Reimplements MisExpect diagnostics from D66324 to reconstruct its
original checking methodology only using MD_prof branch_weights
metadata.
New checks rely on 2 invariants:
1) For frontend instrumentation, MD_prof branch_weights will always be
populated before llvm.expect intrinsics are lowered.
2) for IR and sample profiling, llvm.expect intrinsics will always be
lowered before branch_weights are populated from the IR profiles.
These invariants allow the checking to assume how the existing branch
weights are populated depending on the profiling method used, and emit
the correct diagnostics. If these invariants are ever invalidated, the
MisExpect related checks would need to be updated, potentially by
re-introducing MD_misexpect metadata, and ensuring it always will be
transformed the same way as branch_weights in other optimization passes.
Frontend based profiling is now enabled without using LLVM Args, by
introducing a new CodeGen option, and checking if the -Wmisexpect flag
has been passed on the command line.
Differential Revision: https://reviews.llvm.org/D115907
For MachO, lower `@llvm.global_dtors` into `@llvm_global_ctors` with
`__cxa_atexit` calls to avoid emitting the deprecated `__mod_term_func`.
Reuse the existing `WebAssemblyLowerGlobalDtors.cpp` to accomplish this.
Enable fallback to the old behavior via Clang driver flag
(`-fregister-global-dtors-with-atexit`) or llc / code generation flag
(`-lower-global-dtors-via-cxa-atexit`). This escape hatch will be
removed in the future.
Differential Revision: https://reviews.llvm.org/D121736
This code queries TTI on a single function, which is considered to
be representative. This is a bit odd, but probably fine in practice.
However, I think we should at least avoid querying declarations,
which e.g. will generally lack target attributes, and for which
we don't seem to ever query TTI in other places.
For MachO, lower `@llvm.global_dtors` into `@llvm_global_ctors` with
`__cxa_atexit` calls to avoid emitting the deprecated `__mod_term_func`.
Reuse the existing `WebAssemblyLowerGlobalDtors.cpp` to accomplish this.
Enable fallback to the old behavior via Clang driver flag
(`-fregister-global-dtors-with-atexit`) or llc / code generation flag
(`-lower-global-dtors-via-cxa-atexit`). This escape hatch will be
removed in the future.
Differential Revision: https://reviews.llvm.org/D121327
Extend -wholeprogramdevirt-check to support both the existing
trapping mode on an incorrect devirtualization, as well as a new
mode to fallback to an indirect call on a mismatch. The new mode is
The new mode is useful in cases where we want to enable
devirtualization but cannot fully guarantee whole program visibility
(e.g in the case where LTO has been disabled for a small set of objects
that could potentially override virtual methods without having a symbol
reference to anything in the base class including the vtable).
Remove !prof and !callees metadata (which are used by indirect call
promotion) from both the new direct call and the fallback indirect call
(so that we don't perform another round of promotion on the latter).
Also remove it from the direct call in the non-fallback cases, which
was an oversight, although it didn't seem to cause any issues. Add tests
for the metadata removal covering the various cases.
Differential Revision: https://reviews.llvm.org/D121419
If there are no ctors, then this can have an arbirary zero-sized
value. The current code checks for null, but it could also be
undef or poison.
Replacing the specific null check with a check for
non-ConstantArray.
MinBaseDistance may be odr-used by std::max, leading to an undefined symbol linker error:
```
ld.lld: error: undefined symbol: (anonymous namespace)::MinCostMaxFlow::MinBaseDistance
>>> referenced by SampleProfileInference.cpp:744 (/home/ray/llvm-project/llvm/lib/Transforms/Utils/SampleProfileInference.cpp:744)
>>> lib/Transforms/Utils/CMakeFiles/LLVMTransformUtils.dir/SampleProfileInference.cpp.o:((anonymous namespace)::FlowAdjuster::jumpDistance(llvm::FlowJump*) const)
```
Since llvm-project is still using C++ 14, workaround it with a cast.
The OpenMPIRBuilder has a bug. Specifically, suppose you have two nested openmp parallel regions (writing with MLIR for ease)
```
omp.parallel {
%a = ...
omp.parallel {
use(%a)
}
}
```
As OpenMP only permits pointer-like inputs, the builder will wrap all of the inputs into a stack allocation, and then pass this
allocation to the inner parallel. For example, we would want to get something like the following:
```
omp.parallel {
%a = ...
%tmp = alloc
store %tmp[] = %a
kmpc_fork(outlined, %tmp)
}
```
However, in practice, this is not what currently occurs in the context of nested parallel regions. Specifically to the OpenMPIRBuilder,
the entirety of the function (at the LLVM level) is currently inlined with blocks marking the corresponding start and end of each
region.
```
entry:
...
parallel1:
%a = ...
...
parallel2:
use(%a)
...
endparallel2:
...
endparallel1:
...
```
When the allocation is inserted, it presently inserted into the parent of the entire function (e.g. entry) rather than the parent
allocation scope to the function being outlined. If we were outlining parallel2, the corresponding alloca location would be parallel1.
This causes a variety of bugs, including https://github.com/llvm/llvm-project/issues/54165 as one example.
This PR allows the stack allocation to be created at the correct allocation block, and thus remedies such issues.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D121061
This will let us start moving away from hard-coded attributes in
MemoryBuiltins.cpp and put the knowledge about various attribute
functions in the compilers that emit those calls where it probably
belongs.
Differential Revision: https://reviews.llvm.org/D117921
`ArgInfo` is reduced to only contain a pair of {formal,actual} values.
The specialized function `Fn` and the `Partial` flag are redundant in
this structure. The `Gain` is moved to a new struct `SpecializationInfo`.
The value mappings created by cloneCandidateFunction() are being used
by rewriteCallSites() for matching the formal arguments of recursive
functions.
The list of specializations is passed by reference to calculateGains()
instead of being returned by value.
The `IsPartial` flag is removed from isArgumentInteresting() and
getPossibleConstants() as it's no longer used anywhere in the code.
Differential Revision: https://reviews.llvm.org/D120753
The verify call was taking 50% of the compile time in our internal LLVM
fork when trying to unroll many loops.
Differential Revision: https://reviews.llvm.org/D113028
Currently adding attribute no_sanitize("bounds") isn't disabling
-fsanitize=local-bounds (also enabled in -fsanitize=bounds). The Clang
frontend handles fsanitize=array-bounds which can already be disabled by
no_sanitize("bounds"). However, instrumentation added by the
BoundsChecking pass in the middle-end cannot be disabled by the
attribute.
The fix is very similar to D102772 that added the ability to selectively
disable sanitizer pass on certain functions.
In this patch, if no_sanitize("bounds") is provided, an additional
function attribute (NoSanitizeBounds) is attached to IR to let the
BoundsChecking pass know we want to disable local-bounds checking. In
order to support this feature, the IR is extended (similar to D102772)
to make Clang able to preserve the information and let BoundsChecking
pass know bounds checking is disabled for certain function.
Reviewed By: melver
Differential Revision: https://reviews.llvm.org/D119816
SCEVs ExprValueMap currently tracks not only which IR Values
correspond to a given SCEV expression, but additionally stores that
it may be expanded in the form X+Offset. In theory, this allows
reusing existing IR Values in more cases.
In practice, this doesn't seem to be particularly useful (the test
changes are rather underwhelming) and adds a good bit of complexity.
Per https://github.com/llvm/llvm-project/issues/53905, we have an
invalidation issue with these offseted expressions.
Differential Revision: https://reviews.llvm.org/D120311
Summary:
We use a section to embed offloading code into the host for later
linking. This is normally unique to the translation unit as it is thrown
away during linking. However, if the user performs a relocatable link
the sections will be merged and we won't be able to access the files
stored inside. This patch changes the section variables to have external
linkage and a name defined by the section name, so if two sections are
combined during linking we get an error.
The `SplitIndirectBrCriticalEdges` function was originally designed for
`CodeGenPrepare` and skipped splitting of edges when the destination
block didn't contain any `PHI` instructions. This only makes sense when
reducing COPYs like `CodeGenPrepare`. In the case of
`PGOInstrumentation` or `GCOVProfiling` it would result in missed
counters and wrong result in functions with computed goto.
Differential Revision: https://reviews.llvm.org/D120096
Constants cannot be cyclic, but they can be tree-like. Keep a
visited set to ensure we do not degenerate to exponential run-time.
This fixes the problem reported in https://reviews.llvm.org/D117223#3335482,
though I haven't been able to construct a concise test case for
the issue. This requires a combination of dead constants and the
kind of constant expression tree that textual IR cannot represent
(because the textual representation, unlike the in-memory
representation, is also exponential in size).
This code could be generalized to be type-independent, but for now
just ensure that the same type constraints are enforced with opaque
pointers as with typed pointers.
By convention, memcpy/memmove intrinsics are always used with i8
pointers (though this is not enforced), so in practice this code
was always using an i8 type. Make that explicit.
Of course, i8 is not a very profitable choice, and this code could
be more performant by picking an appropriate larger type. But that
would require additional test coverage and correctness review, and
certainly shouldn't be a decision based on the pointer element type.
this is the first step in unifying some of the logic between hwasan and
mte stack tagging. this only moves around code, changes to converge
different implementations of the same logic follow later.
Reviewed By: eugenis
Differential Revision: https://reviews.llvm.org/D118947
The module flag to indicate use of hostcall is insufficient to catch
all cases where hostcall might be in use by a kernel. This is now
replaced by a function attribute that gets propagated to top-level
kernel functions via their respective call-graph.
If the attribute "amdgpu-no-hostcall-ptr" is absent on a kernel, the
default behaviour is to emit kernel metadata indicating that the
kernel uses the hostcall buffer pointer passed as an implicit
argument.
The attribute may be placed explicitly by the user, or inferred by the
AMDGPU attributor by examining the call-graph. The attribute is
inferred only if the function is not being sanitized, and the
implictarg_ptr does not result in a load of any byte in the hostcall
pointer argument.
Reviewed By: jdoerfert, arsenm, kpyzhov
Differential Revision: https://reviews.llvm.org/D119216
Note that this doesn't actually cause the top level predicate to become a non-union just yet.
The * above comes from a case in the LoopVectorizer where a predicate which is later proven no longer blocks vectorization due to a change from checking if predicates exists to whether the predicate is possibly false.
For those curious, the whole reason for tracking the predicate set seperately as opposed to just immediately registering the dependencies appears to be allowing the printing code to print a result without changing the PSE state. It's slightly questionable if this justifies the complexity, but since we can preserve it with local ugliness, I did so.
As long as *all* the invokes in the set are indirect,
we can merge them, but don't merge direct invokes into the set,
even though it would be legal to do.
PredicatedScalarEvolution has a predicate type for representing A == B. This change generalizes it into something which can represent a A <pred> B.
This generality is currently unused, but is motivated by a couple of recent cases which have come up. In particular, I'm currently playing around with using this to simplify the runtime checking code in LoopVectorizer. Regardless of the outcome of that prototyping, generalizing the compare node seemed useful.
Alloca promotion can only deal with cases where the load/store
types match the alloca type (it explicitly does not support
bitcasted load/stores).
With opaque pointers this is no longer enforced through the pointer
type, so add an explicit check.
If the original invokes had uses, the uses must have been in PHI's,
but that immediately results in the incoming values being incompatible.
But we'll replace uses of the original invokes with the use of the
merged invoke, so as long as the incoming values become compatible
after that, we can merge.
Even if the invokes have normal destination, iff it's the same block,
we can merge them. For now, require that there are no PHI nodes,
and the returned values of invokes aren't used.
I'm seeing ext-tsp helps CSSPGO for our intern large benchmarks so I'm turning on it for CSSPGO. For non-CS AutoFDO, ext-tsp doesn't seem to help, probably because of lower profile counts quality.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D119048
As per LangRef's definition of `noreturn` attribute:
```
noreturn
This function attribute indicates that the function never returns
normally, hence through a return instruction.
This produces undefined behavior at runtime if the function
ever does dynamically return. nnotated functions may still
raise an exception, i.a., nounwind is not implied.
```
So if we `invoke` a `noreturn` function, and the normal destination
of an invoke is not an `unreachable`, point it at the new `unreachable`
block.
The change/fix from the original commit is that we now actually create
the new block, and don't just repurpose the original block,
because said normal destination block could have other users.
This reverts commit db1176ce66,
relanding commit 598833c987.
As per LangRef's definition of `noreturn` attribute:
```
noreturn
This function attribute indicates that the function never returns
normally, hence through a return instruction.
This produces undefined behavior at runtime if the function
ever does dynamically return. nnotated functions may still
raise an exception, i.a., nounwind is not implied.
```
While nowadays SimplifyCFG knows how to hoist code from then-else blocks,
sink code from unconditional predecessors, and even promote the latter
by tail-merging `ret`/`resume` function terminators, that isn't everything.
While i (& others) have been trying to deal with merging/sinking `unreachable`,
apparently perhaps the more impactful remaining problem is merging the `throw`
calls.
If we start at the `landingpad`, all the predecessors are unwind edges of `invoke`s,
and in some cases some of the `invoke`s are mergeable.
```
/// This is a weird mix of hoisting and sinking. Visually, it goes from:
/// [...] [...]
/// | |
/// [invoke0] [invoke1]
/// / \ / \
/// [cont0] [landingpad] [cont1]
/// to:
/// [...] [...]
/// \ /
/// [invoke]
/// / \
/// [cont] [landingpad]
```
This simplifies the IR/CFG, at the cost of debug info and extra PHI nodes.
Note that we don't require for *all* the `invokes` of the `landingpad`
to be mergeable, they can form more than a single set, we gracefully handle that.
For now, i completely disallowed normal destination, PHI nodes and indirect invokes
but that can be supported.
Out of all the CTMark projects, only 7zip is C++, so there isn't much impact:
https://llvm-compile-time-tracker.com/compare.php?from=ba8eb31bd9542828f6424e15a3014f80f14522c8&to=722fc871c84f14157d45c2159bc9c8c7e2825785&stat=size-total
... but there it currently causes size-total decrease.
Differential Revision: https://reviews.llvm.org/D117805
Unfortunately, it seems we really do need to take the long route;
start from the "merge" block, find (all the) "dispatch" blocks,
and deal with each "dispatch" block separately, instead of simply
starting from each "dispatch" block like it would logically make sense,
otherwise we run into a number of other missing folds around
`switch` formation, missing sinking/hoisting and phase ordering.
This reverts commit 85628ce75b.
This reverts commit c5fff90953.
This reverts commit 34a98e1046.
This reverts commit 1e353f0922.
The current `FoldTwoEntryPHINode()` is not quite designed correctly.
It starts from the merge point, and then tries to detect
the 'divergence' point.
Because of that, it is limited to the simple two-predecessor case,
where the PHI completely goes away. but that is rather pessimistic,
and it doesn't make much sense from the costmodel side of things.
For example if there is some other unrelated predecessor of
the merge point, we could split the merge point so that
the then/else blocks first branch to an empty block
and then to the merge point, and then we'd be able to speculate
the then/else code.
But if we'd instead simply start at the divergence point,
and look for the merge point, then we'll just natively support this case.
There's also the fact that `SpeculativelyExecuteBB()` already does
just that, but only if there is a single block to speculate,
and with a much more restrictive cost model.
But that also means we have code duplication.
Now, sadly, while this is as much NFCI as possible,
there is just no way to cleanly migrate to
the proper implementation. The results *are* going to be different
somewhat because of various phase ordering effects and SimplifyCFG
block iteration strategy.
Based on the output of include-what-you-use.
This is a big chunk of changes. It is very likely to break downstream code
unless they took a lot of care in avoiding hidden ehader dependencies, something
the LLVM codebase doesn't do that well :-/
I've tried to summarize the biggest change below:
- llvm/include/llvm-c/Core.h: no longer includes llvm-c/ErrorHandling.h
- llvm/IR/DIBuilder.h no longer includes llvm/IR/DebugInfo.h
- llvm/IR/IRBuilder.h no longer includes llvm/IR/IntrinsicInst.h
- llvm/IR/LLVMRemarkStreamer.h no longer includes llvm/Support/ToolOutputFile.h
- llvm/IR/LegacyPassManager.h no longer include llvm/Pass.h
- llvm/IR/Type.h no longer includes llvm/ADT/SmallPtrSet.h
- llvm/IR/PassManager.h no longer includes llvm/Pass.h nor llvm/Support/Debug.h
And the usual count of preprocessed lines:
$ clang++ -E -Iinclude -I../llvm/include ../llvm/lib/IR/*.cpp -std=c++14 -fno-rtti -fno-exceptions | wc -l
before: 6400831
after: 6189948
200k lines less to process is no that bad ;-)
Discourse thread on the topic: https://llvm.discourse.group/t/include-what-you-use-include-cleanup
Differential Revision: https://reviews.llvm.org/D118652
Cleanup code in peelLoop API. We already have usage of DT without guarding
against a null DT, so this change constant folds the remaining null DT
checks.
Also make the argument a reference so that it is clear the argument is
a nonnull DT.
Extracted from D118472.
Constant expressions with a non-pointer result type used an early
exit that bypassed the later dead constant user check, and resulted
in different optimization outcomes depending on whether dead users
were present or not.
This fixes the issue reported in https://reviews.llvm.org/D117223#3287039.
D116542 adds EmbedBufferInModule which introduces a layer violation
(https://llvm.org/docs/CodingStandards.html#library-layering).
See 2d5f857a1e for detail.
EmbedBufferInModule does not use BitcodeWriter functionality and should be moved
LLVMTransformsUtils. While here, change the function case to the prevailing
convention.
It seems that EmbedBufferInModule just follows the steps of
EmbedBitcodeInModule. EmbedBitcodeInModule calls WriteBitcodeToFile but has IR
update operations which ideally should be refactored to another library.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D118666
This reverts commit a6b54ddaba.
Apparently it is not safe to modify the condition even if it passes the
hasOneUse test, because StructurizeCFG might have other references to
the condition that are not manifest in the IR use-def chains.
This avoids various cases where StructurizeCFG would otherwise insert an
xor i1 instruction, and it since it generally runs late in the pipeline,
instcombine does not clean up the xor-of-cmp pattern.
Differential Revision: https://reviews.llvm.org/D118478
The pruning cloner already tries to remove unreachable blocks. The
original cloning process will simplify instructions and constant
terminators, and only clone blocks that are reachable at that point.
However, phi nodes can only be simplified after everything has been
cloned. For that reason, additional blocks may become unreachable
after phi simplification.
The code does try to handle this as well, but only removes blocks
that don't have predecessors. It misses unreachable cycles. This
can cause issues if SEH exception handling code is part of an
unreachable cycle, as the inliner is not prepared to deal with that.
This patch instead performs an explicit scan for reachable blocks,
and drops everything else.
Fixes https://github.com/llvm/llvm-project/issues/53206.
Differential Revision: https://reviews.llvm.org/D118449
This reverts commit ef82063207.
- It conflicts with the existing llvm::size in STLExtras, which will now
never be called.
- Calling it without llvm:: breaks C++17 compat
We could keep the non-i8 GEP code for non-opaque pointers, but
there's two reasons I'm dropping it: First, this actually appears
to be dead code, at least it isn't hit in any of our tests. I
expect that this is because we usually expand trip counts, and
those are never pointers (anymore). Second, the non-i8 GEP was
actually incorrect in multiple ways, because it used SCEV type
sizes, which don't match DL type sizes (for pointers) and certainly
don't match type alloc sizes (which is what GEPs actually use).
As such, I'm simplifying the code to always use the i8 GEP code
path if it does get hit.
Summary:
Enable CodeExtractor to construct output functions that partially
aggregate inputs/outputs in their argument list. A use case is the
OMPIRBuilder to create outlined functions for parallel regions that
aggregate in a struct the payload variables for the region while passing
as scalars thread and bound identifiers.
Differential Revision: https://reviews.llvm.org/D96854
We should always be calculating a byte-wise difference here.
Previously this calculated the pointer difference while taking
the pointer element type into account, which is incorrect.
Instead use either Type::getPointerElementType() or
Type::getNonOpaquePointerElementType().
This is part of D117885, in preparation for deprecating the API.
This matches the actual runtime function more closely.
I considered also renaming both RetainRV/UnsafeClaimRV to end with
"ARV", for AutoreleasedReturnValue, but there's less potential
for confusion there.
D108960 added support for SjLj using Wasm EH instructions, which we call
Wasm SjLj going forward. (We call the old SjLj Emscripten SjLj) But it
did not support using Wasm EH and Wasm SjLj together. So far users of
Wasm EH had to use Wasm EH with Emscripten SjLj, which had a certain
limitation and it suffered from bigger code size increases as well.
This enables using Wasm EH and Wasm SjLj together.
1. This redirects `catchswitch` and `cleanupret` that unwind to caller
to `catch.dispatch.longjmp` BB, which is a `catchswitch` BB that
handles longjmps.
2. D108960 converted all longjmpable `call`s to `invokes` that unwind to
`catch.dispatch.longjmp`. This CL checks if the `call` is embedded
within another `catchpad`, and if so, makes it unwind to its nearest
parent's unwind destination, rather than `catch.dispatch.longjmp`.
This is necessary to preserve the scoping structure.
Reviewed By: dschuff
Differential Revision: https://reviews.llvm.org/D117610
The TripCount is not modified by the function so it doesn't need
to be passed by reference. Verified by passing it as const reference
before changing to value.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D117735
This is an extension of **profi** post-processing step that rebalances counts
in CFGs that have basic blocks w/o probes (aka "unknown" blocks). Specifically,
the new version finds many more "unknown" subgraphs and marks more "unknown"
basic blocks as hot (which prevents unwanted optimization passes).
I see up to 0.5% perf on some (large) binaries, e.g., clang-10 and gcc-8.
The algorithm is still linear and yields no build time overhead.
LLVM Programmer’s Manual strongly discourages the use of `std::vector<bool>` and suggests `llvm::BitVector` as a possible replacement.
This patch does just that for llvm.
Reviewed By: dexonsmith
Differential Revision: https://reviews.llvm.org/D117121
After D116332, some icmps no longer fold with the target-independent
constant folder. The SimplifyCFG code assumed that the comparison
would always fold, which is not guaranteed. Explicitly check that the
result is either true or false.
Differential Revision: https://reviews.llvm.org/D117184
This patch fixes an issue in which SSA value reference within a
DIArgList would be unnecessarily dropped by llvm-link, even when
invoking on a single file (which should be a no-op). The reason for the
difference is that the ValueMapper does not refer to the
RF_IgnoreMissingLocals flag for LocalAsMetadata contained within a
DIArgList; this flag is used for direct LocalAsMetadata uses to preserve
SSA references even when the ValueMapper does not have an explicit
mapping for the referenced SSA value, which appears to always be the
case when using llvm-link in this manner.
Differential Revision: https://reviews.llvm.org/D114355
Use the AttributeSet constructor instead. There's no good reason
why AttrBuilder itself should exact the AttributeSet from the
AttributeList. Moving this out of the AttrBuilder generally results
in cleaner code.
The empty() method is a footgun: It only checks whether there are
non-string attributes, which is not at all obvious from its name,
and of dubious usefulness. td_empty() is entirely unused.
Drop these methods in favor of hasAttributes(), which checks
whether there are any attributes, regardless of whether these are
string or enum attributes.
I strongly believe we need some variant of this.
The main problem is e.g. that the glibc's assert has 4 parameters,
but the profitability check is only okay with one extra phi node,
so D116692 doesn't even trigger on most of the expected cases.
While that restriction probably makes sense in normal code, if we
are about to run off of a cliff (into an `unreachable`), this
successor block is unlikely so the cost to setup these PHI nodes
should not be on the hotpath, and shouldn't matter performance-wise.
Likewise, we don't sink if there are unconditional predecessors
UNLESS we'd sink at least one non-speculatable instruction,
which is a performance workaround, but if we are about to run into
`unreachable`, it shouldn't matter.
Note that we only allow the case where there are at
most unconditiona branches on the way to the unreachable block.
Differential Revision: https://reviews.llvm.org/D117045
Similar to memset, memset_pattern{4,8,16} all will return and do not
unwind. Use fallthrough to include all attributes also set for memset.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D114904
analyzeGlobal() looks through non-constexpr cast instructions when
looking for users. However, this particular place only strips the
casts again if they are constexprs. We should be looking through all
casts here.
Not all allocation functions are removable if unused. An example of a non-removable allocation would be a direct call to the replaceable global allocation function in C++. An example of a removable one - at least according to historical practice - would be malloc.
As discussed in https://github.com/llvm/llvm-project/issues/53020 / https://reviews.llvm.org/D116692,
SCEV is forbidden from reasoning about 'backedge taken count'
if the branch condition is a poison-safe logical operation,
which is conservatively correct, but is severely limiting.
Instead, we should have a way to express those
poison blocking properties in SCEV expressions.
The proposed semantics is:
```
Sequential/in-order min/max SCEV expressions are non-commutative variants
of commutative min/max SCEV expressions. If none of their operands
are poison, then they are functionally equivalent, otherwise,
if the operand that represents the saturation point* of given expression,
comes before the first poison operand, then the whole expression is not poison,
but is said saturation point.
```
* saturation point - the maximal/minimal possible integer value for the given type
The lowering is straight-forward:
```
compare each operand to the saturation point,
perform sequential in-order logical-or (poison-safe!) ordered reduction
over those checks, and if reduction returned true then return
saturation point else return the naive min/max reduction over the operands
```
https://alive2.llvm.org/ce/z/Q7jxvH (2 ops)
https://alive2.llvm.org/ce/z/QCRrhk (3 ops)
Note that we don't need to check the last operand: https://alive2.llvm.org/ce/z/abvHQS
Note that this is not commutative: https://alive2.llvm.org/ce/z/FK9e97
That allows us to handle the patterns in question.
Reviewed By: nikic, reames
Differential Revision: https://reviews.llvm.org/D116766
9345ab3a45 updated generateOverflowCheck to skip creating checks that
always evaluate to false. This in turn means that we only need to
create TruncTripCount if it is actually used.
Sink the TruncTripCount creating into ComputeEndCheck, so it is only
created when there's an actual check.
9345ab3a45 updated generateOverflowCheck to skip creating checks that
always evaluate to false. This in turn means that we only need to
compute |Step| * Trip count if the result of the multiplication is
actually used.
Sink the multiplication into ComputeEndCheck, so it is only created
when there's an actual check.
There is no need to sort inserted instructions by dominance, as the
deletion loop still requires RAUW with undef before deleting. Removing
instructions in reverse insertion order should still insure that the
number of uselist updates is kept to a minimum.
9345ab3a45 updated generateOverflowCheck to skip creating checks that
always evaluate to false. This in turn means that we only need to check
for overflows if the result of the multiplication is actually used.
Sink the Or for the overflow check into ComputeEndCheck, so it is only
created when there's an actual check.
Unsigned compares of the form <u 0 are always false. Do not create such
a redundant check in generateOverflowCheck.
The patch introduces a new lambda to create the check, so we can
exit early conveniently and skip creating some instructions feeding the
check.
I am planning to sink a few additional instructions as follow-ups, but I
would prefer to do this separately, to keep the changes and diff
smaller.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D116811
Currently generateOverflowCheck always creates code for Step being
negative and positive, followed by a select at the end depending on
Step's sign.
This patch updates the code to only create either the checks for step
being positive or negative, if the sign is known.
Follow-up to D116696.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D116747
reallocf() is the same as realloc() but frees the input pointer
on failure as well. We can infer the same attributes.
Also combine some cases that infer the same attributes and are
logically related.
I am suspecting a bug around updates of loop info for unreachable exits, but don't have a test case. Running this locally on make check didn't reveal anything, we'll see if the expensive checks bots find it.
This patch updates SCEVExpander::expandUnionPredicate to not create
redundant 'or false, x' instructions. While those are trivially
foldable, they can be easily avoided and hinder code that checks the
size/cost of the generated checks before further folds.
I am planning on look into a few other similar improvements to code
generated by SCEVExpander.
I remember a while ago @lebedev.ri working on doing some trivial folds
like that in IRBuilder itself, but there where concerns that such
changes may subtly break existing code.
Reviewed By: reames, lebedev.ri
Differential Revision: https://reviews.llvm.org/D116696
Track all GlobalObjects that reference a given comdat, which allows
determining whether a function in a comdat is dead without scanning
the whole module.
In particular, this makes filterDeadComdatFunctions() have complexity
O(#DeadFunctions) rather than O(#SymbolsInModule), which addresses
half of the compile-time issue exposed by D115545.
Differential Revision: https://reviews.llvm.org/D115864
This does not appear to cause any problems, and it
fixes#50910
Extra tests with a trunc user were added with:
3a239379
...but they don't match either way, so there's an
opportunity to improve the matching further.
The naming has come up as a source of confusion in several recent reviews. onlyWritesMemory is consist with onlyReadsMemory which we use for the corresponding readonly case as well.
This reverts commit ea75be3d9d and
1eb5b6e850.
That commit caused crashes with compilation e.g. like this
(not fixed by the follow-up commit):
$ cat sqrt.c
float a;
b() { sqrt(a); }
$ clang -target x86_64-linux-gnu -c -O2 sqrt.c
Attributes 'readnone and writeonly' are incompatible!
%sqrtf = tail call float @sqrtf(float %0) #1
in function b
fatal error: error in backend: Broken function found, compilation aborted!
isBitOrNoopPointerCastable() returns true if the types are the
same, but it's not actually possible to create a bitcast for all
such types. The assumption seems to be that the user will omit
creating the cast in that case, as it is unnecessary.
Fixes https://github.com/llvm/llvm-project/issues/52994.
All of these functions would be `readnone`, but can't be on platforms
where they can set `errno`. A `writeonly` function with no pointer
arguments can only write (but never read) global state.
Writeonly theoretically allows these calls to be CSE'd (a writeonly call
with the same arguments will always result in the same global stores) or
hoisted out of loops, but that's not implemented currently.
There are a few functions in this list that could be `readnone` instead
of `writeonly`, if someone is interested.
Differential Revision: https://reviews.llvm.org/D116426
Global ctor evaluation currently models memory as a map from Constant*
to Constant*. For this to be correct, it is required that there is
only a single Constant* referencing a given memory location. The
Evaluator tries to ensure this by imposing certain limitations that
could result in ambiguities (by limiting types, casts and GEP formats),
but ultimately still fails, as can be seen in PR51879. The approach
is fundamentally fragile and will get more so with opaque pointers.
My original thought was to instead store memory for each global as an
offset => value representation. However, we also need to make sure
that we can actually rematerialize the modified global initializer
into a Constant in the end, which may not be possible if we allow
arbitrary writes.
What this patch does instead is to represent globals as a MutableValue,
which is either a Constant* or a MutableAggregate*. The mutable
aggregate exists to allow efficient mutation of individual aggregate
elements, as mutating an element on a Constant would require interning
a new constant. When a write to the Constant* is made, it is converted
into a MutableAggregate* as needed.
I believe this should make the evaluator more robust, compatible
with opaque pointers, and a bit simpler as well.
Fixes https://github.com/llvm/llvm-project/issues/51221.
Differential Revision: https://reviews.llvm.org/D115530
This function returns an upper bound on the number of bits needed
to represent the signed value. Use "Max" to match similar functions
in KnownBits like countMaxActiveBits.
Rename APInt::getMinSignedBits->getSignificantBits. Keeping the old
name around to keep this patch size down. Will do a bulk rename as
follow up.
Rename KnownBits::countMaxSignedBits->countMaxSignificantBits.
Reviewed By: lebedev.ri, RKSimon, spatel
Differential Revision: https://reviews.llvm.org/D116522
If we have an exit which is controlled by a loop invariant condition and which dominates the latch, we know only the copy in the first unrolled iteration can be taken. All other copies are dead.
The change itself is pretty straight forward, but let me add two points of context:
* I'd have expected other transform passes to catch this after unrolling, but I'm seeing multiple examples where we get to the end of O2/O3 without simplifying.
* I'd like to do a stronger change which did CSE during unroll and accounted for invariant expressions (as defined by SCEV instead of trivial ones from LoopInfo), but that doesn't fit cleanly into the current code structure.
Differential Revision: https://reviews.llvm.org/D116496
This list is confusing because it conflates functions attributes
(which are either extractable or not) and other attribute kinds,
which are simply irrelevant for this code.
This fixes a typo/bug when checking for pointer reuse when testing
DI location preservation in the Debugify original mode (when
checking -g generated Debug Info).
Differential Revision: https://reviews.llvm.org/D115621
With Control-Flow Integrity (CFI), the LowerTypeTests pass replaces
function references with CFI jump table references, which is a problem
for low-level code that needs the address of the actual function body.
For example, in the Linux kernel, the code that sets up interrupt
handlers needs to take the address of the interrupt handler function
instead of the CFI jump table, as the jump table may not even be mapped
into memory when an interrupt is triggered.
This change adds the no_cfi constant type, which wraps function
references in a value that LowerTypeTestsModule::replaceCfiUses does not
replace.
Link: https://github.com/ClangBuiltLinux/linux/issues/1353
Reviewed By: nickdesaulniers, pcc
Differential Revision: https://reviews.llvm.org/D108478
After this function call, the LLVM IR would look like the following:
```
if (true)
/* NonVersionedLoop */
else
/* VersionedLoop */
```
Reviewed By: Whitney
Differential Revision: https://reviews.llvm.org/D104631
I noticed we weren't propagating tail flags on calls when
FortifiedLibCallSimplifier.optimizeCall() was replacing calls to runtime
checked calls to the non-checked routines (when safe to do so). Make
sure to check this before replacing the original calls!
Also, avoid any libcall transforms when notail/musttail is present.
PR46734
Fixes: https://github.com/llvm/llvm-project/issues/46079
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D107872
This patch fixes the relative table converter pass for the lookup table
accesses that are resulted in an instruction sequence, where gep is not
immediately followed by a load, such as gep being hoisted outside the loop
or another instruction is inserted in between them. The fix inserts the
call to load.relative.instrinsic in the original place of load instead of gep.
Issue is reported by FreeBSD via https://bugs.freebsd.org/259921.
Differential Revision: https://reviews.llvm.org/D115571
The code claimed to handle nsw/nuw, but those aren't passed via builder state and the explicit IR construction just above never sets them.
The only case this bit of code is actually relevant for is FMF flags. However, dropPoisonGeneratingFlags currently doesn't know about FMF at all, so this was a noop. It's also unneeded, as the caller explicitly configures the flags on the builder before this call, and the flags on the individual ops should be controled by the intrinsic flags anyways. If any of the flags aren't safe to propagate, the caller needs to make that change.
The recurrence lowering code has handling which claims to be about flag intersection, but all the callers pass empty arrays to the arguments. The sole exception is a caller of a method which has the argument, but no implementation.
I don't know what the intent was here, but it certaintly doesn't actually do anything today.
This change allows us to estimate trip count from profile metadata for all multiple exit loops. We still do the estimate only from the latch, but that's fine as it causes us to over estimate the trip count at worst.
Reviewing the uses of the API, all but one are cases where we restrict a loop transformation (unroll, and vectorize respectively) when we know the trip count is short enough. So, as a result, the change makes these passes strictly less aggressive. The test change illustrates a case where we'd previously have runtime unrolled a loop which ran fewer iterations than the unroll factor. This is definitely unprofitable.
The one case where an upper bound on estimate trip count could drive a more aggressive transform is peeling, and I duplicated the logic being removed from the generic estimation there to keep it the same. The resulting heuristic makes no sense and should probably be immediately removed, but we can do that in a separate change.
This was noticed when analyzing regressions on D113939.
I plan to come back and incorporate estimated trip counts from other exits, but that's a minor improvement which can follow separately.
Differential Revision: https://reviews.llvm.org/D115362
This patch adds 4 options for specifying functions, aliases, globals and
structs name prefixes hat don't need to be renamed by MetaRenamer pass.
This is useful if one has some downstream logic that depends directly
on an entity name. MetaRenamer can break this logic, but with the patch
you can tell it not to rename certain names.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D115323
A new basic block ordering improving existing MachineBlockPlacement.
The algorithm tries to find a layout of nodes (basic blocks) of a given CFG
optimizing jump locality and thus processor I-cache utilization. This is
achieved via increasing the number of fall-through jumps and co-locating
frequently executed nodes together. The name follows the underlying
optimization problem, Extended-TSP, which is a generalization of classical
(maximum) Traveling Salesmen Problem.
The algorithm is a greedy heuristic that works with chains (ordered lists)
of basic blocks. Initially all chains are isolated basic blocks. On every
iteration, we pick a pair of chains whose merging yields the biggest increase
in the ExtTSP value, which models how i-cache "friendly" a specific chain is.
A pair of chains giving the maximum gain is merged into a new chain. The
procedure stops when there is only one chain left, or when merging does not
increase ExtTSP. In the latter case, the remaining chains are sorted by
density in decreasing order.
An important aspect is the way two chains are merged. Unlike earlier
algorithms (e.g., based on the approach of Pettis-Hansen), two
chains, X and Y, are first split into three, X1, X2, and Y. Then we
consider all possible ways of gluing the three chains (e.g., X1YX2, X1X2Y,
X2X1Y, X2YX1, YX1X2, YX2X1) and choose the one producing the largest score.
This improves the quality of the final result (the search space is larger)
while keeping the implementation sufficiently fast.
Differential Revision: https://reviews.llvm.org/D113424
A new basic block ordering improving existing MachineBlockPlacement.
The algorithm tries to find a layout of nodes (basic blocks) of a given CFG
optimizing jump locality and thus processor I-cache utilization. This is
achieved via increasing the number of fall-through jumps and co-locating
frequently executed nodes together. The name follows the underlying
optimization problem, Extended-TSP, which is a generalization of classical
(maximum) Traveling Salesmen Problem.
The algorithm is a greedy heuristic that works with chains (ordered lists)
of basic blocks. Initially all chains are isolated basic blocks. On every
iteration, we pick a pair of chains whose merging yields the biggest increase
in the ExtTSP value, which models how i-cache "friendly" a specific chain is.
A pair of chains giving the maximum gain is merged into a new chain. The
procedure stops when there is only one chain left, or when merging does not
increase ExtTSP. In the latter case, the remaining chains are sorted by
density in decreasing order.
An important aspect is the way two chains are merged. Unlike earlier
algorithms (e.g., based on the approach of Pettis-Hansen), two
chains, X and Y, are first split into three, X1, X2, and Y. Then we
consider all possible ways of gluing the three chains (e.g., X1YX2, X1X2Y,
X2X1Y, X2YX1, YX1X2, YX2X1) and choose the one producing the largest score.
This improves the quality of the final result (the search space is larger)
while keeping the implementation sufficiently fast.
Differential Revision: https://reviews.llvm.org/D113424
The earlier usage of wouldInstructionBeTriviallyDead is based on the
assumption that the use_count of that instruction being checked will be
zero. This patch separates the API into two different ones:
1. The strictly conservative one where the instruction is trivially dead iff the uses are dead.
2. The slightly relaxed form, where an instruction is dead along paths where it is not used.
The second form can be used in identifying instructions that are valid
to sink down to uses (D109917).
Reviewed-By: reames
Differential Revision: https://reviews.llvm.org/D114647
This is a continuation of D109860 and D109903.
An important challenge for profile inference is caused by the fact that the
sample profile is collected on a fully optimized binary, while the block and
edge frequencies are consumed on an early stage of the compilation that operates
with a non-optimized IR. As a result, some of the basic blocks may not have
associated sample counts, and it is up to the algorithm to deduce missing
frequencies. The problem is illustrated in the figure where three basic
blocks are not present in the optimized binary and hence, receive no samples
during profiling.
We found that it is beneficial to treat all such blocks equally. Otherwise the
compiler may decide that some blocks are “cold” and apply undesirable
optimizations (e.g., hot-cold splitting) regressing the performance. Therefore,
we want to distribute the counts evenly along the blocks with missing samples.
This is achieved by a post-processing step that identifies "dangling" subgraphs
consisting of basic blocks with no sampled counts; once the subgraphs are
found, we rebalance the flow so as every branch probability is 50:50 within the
subgraphs.
Our experiments indicate up to 1% performance win using the optimization on
some binaries and a significant improvement in the quality of profile counts
(when compared to ground-truth instrumentation-based counts)
{F19093045}
Reviewed By: hoy
Differential Revision: https://reviews.llvm.org/D109980
This is a continuation of D109860.
Traditional flow-based algorithms cannot guarantee that the resulting edge
frequencies correspond to a *connected* flow in the control-flow graph. For
example, for an instance in the attached figure, a flow-based (or any other)
inference algorithm may produce an output in which the hot loop is disconnected
from the entry block (refer to the rightmost graph in the figure). Furthermore,
creating a connected minimum-cost maximum flow is a computationally NP-hard
problem. Hence, we apply a post-processing adjustments to the computed flow
by connecting all isolated flow components ("islands").
This feature helps to keep all blocks with sample counts connected and results
in significant performance wins for some binaries.
{F19077343}
Reviewed By: hoy
Differential Revision: https://reviews.llvm.org/D109903
When doing load/store promotion within LICM, if we
cannot prove that it is safe to sink the store we won't
hoist the load, even though we can prove the load could
be dereferenced and moved outside the loop. This patch
implements the load promotion by moving it in the loop
preheader by inserting proper PHI in the loop. The store
is kept as is in the loop. By doing this, we avoid doing
the load from a memory location in each iteration.
Please consider this small example:
loop {
var = *ptr;
if (var) break;
*ptr= var + 1;
}
After this patch, it will be:
var0 = *ptr;
loop {
var1 = phi (var0, var2);
if (var1) break;
var2 = var1 + 1;
*ptr = var2;
}
This addresses some problems from [0].
[0] https://bugs.llvm.org/show_bug.cgi?id=51193
Differential revision: https://reviews.llvm.org/D113289
Add support for memset_pattern{4,8} similar to the existing
memset_pattern16 handling.
Reviewed By: ab
Differential Revision: https://reviews.llvm.org/D114883
This fixes the assertion failure reported in https://reviews.llvm.org/D114889#3166417,
by making RecursivelyDeleteTriviallyDeadInstructionsPermissive()
more permissive. As the function accepts a WeakTrackingVH, even if
originally only Instructions were inserted, we may end up with
different Value types after a RAUW operation. As such, we should
not assume that the vector only contains instructions.
Notably this matches the behavior of the
RecursivelyDeleteTriviallyDeadInstructions() function variant which
accepts a single value rather than vector.
`memcpy_chk` can be treated like `memcpy`, with the exception that it
may not return (if it aborts the program).
See D114793 for a similar patch for `memset_chk`.
Reviewed By: xbolva00
Differential Revision: https://reviews.llvm.org/D114863
Previously we missed cloning metadata on function declarations because
we don't call CloneFunctionInto() on declarations in CloneModule().
Reviewed By: dexonsmith
Differential Revision: https://reviews.llvm.org/D113812
The benefits of sampling-based PGO crucially depends on the quality of profile
data. This diff implements a flow-based algorithm, called profi, that helps to
overcome the inaccuracies in a profile after it is collected.
Profi is an extended and significantly re-engineered classic MCMF (min-cost
max-flow) approach suggested by Levin, Newman, and Haber [2008, Complementing
missing and inaccurate profiling using a minimum cost circulation algorithm]. It
models profile inference as an optimization problem on a control-flow graph with
the objectives and constraints capturing the desired properties of profile data.
Three important challenges that are being solved by profi:
- "fixing" errors in profiles caused by sampling;
- converting basic block counts to edge frequencies (branch probabilities);
- dealing with "dangling" blocks having no samples in the profile.
The main implementation (and required docs) are in SampleProfileInference.cpp.
The worst-time complexity is quadratic in the number of blocks in a function,
O(|V|^2). However a careful engineering and extensive evaluation shows that
the running time is (slightly) super-linear. In particular, instances with
1000 blocks are solved within 0.1 second.
The algorithm has been extensively tested internally on prod workloads,
significantly improving the quality of generated profile data and providing
speedups in the range from 0% to 5%. For "smaller" benchmarks (SPEC06/17), it
generally improves the performance (with a few outliers) but extra work in
the compiler might be needed to re-tune existing optimization passes relying on
profile counts.
UPD Dec 1st 2021:
- synced the declaration and definition of the option `SampleProfileUseProfi ` to use type `cl::opt<bool`;
- added `inline` for `SampleProfileInference<BT>::findUnlikelyJumps` and `SampleProfileInference<BT>::isExit` to avoid linking problems on windows.
Reviewed By: wenlei, hoy
Differential Revision: https://reviews.llvm.org/D109860
When doing load/store promotion within LICM, if we
cannot prove that it is safe to sink the store we won't
hoist the load, even though we can prove the load could
be dereferenced and moved outside the loop. This patch
implements the load promotion by moving it in the loop
preheader by inserting proper PHI in the loop. The store
is kept as is in the loop. By doing this, we avoid doing
the load from a memory location in each iteration.
Please consider this small example:
loop {
var = *ptr;
if (var) break;
*ptr= var + 1;
}
After this patch, it will be:
var0 = *ptr;
loop {
var1 = phi (var0, var2);
if (var1) break;
var2 = var1 + 1;
*ptr = var2;
}
This addresses some problems from [0].
[0] https://bugs.llvm.org/show_bug.cgi?id=51193
Differential revision: https://reviews.llvm.org/D113289
The memset_chk library function should match memset's attributes with
respect of memory effects (argmemonly, writeonly). It also does not
raise exceptions. It may not return, in case it aborts the program.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D114793
The basic problem we have is that we're trying to reuse an instruction which is mapped to some SCEV. Since we can have multiple such instructions (potentially with different flags), this is analogous to our need to drop flags when performing CSE. A trivial implementation would simply drop flags on any instruction we decided to reuse, and that would be correct.
This patch is almost that trivial patch except that we preserve flags on the reused instruction when existing users would imply UB on overflow already. Adding new users can, at most, refine this program to one which doesn't execute UB which is valid.
In practice, this fixes two conceptual problems with the previous code: 1) a binop could have been canonicalized into a form with different opcode or operands, or 2) the inbounds GEP case which was simply unhandled.
On the test changes, most are pretty straight forward. We loose some flags (in some cases, they'd have been dropped on the next CSE pass anyways). The one that took me the longest to understand was the ashr-expansion test. What's happening there is that we're considering reuse of the mul, previously we disallowed it entirely, now we allow it with no flags. The surrounding diffs are all effects of generating the same mul with a different operand order, and then doing simple DCE.
The loss of the inbounds is unfortunate, but even there, we can recover most of those once we actually treat branch-on-poison as immediate UB.
Differential Revision: https://reviews.llvm.org/D112734
This solves a problem with non-deterministic output from opt due
to not performing dominator tree updates in a deterministic order.
The problem that was analysed indicated that JumpThreading was using
the DomTreeUpdater via llvm::MergeBasicBlockIntoOnlyPred. When
preparing the list of updates to send to DomTreeUpdater::applyUpdates
we iterated over a SmallPtrSet, which didn't give a well-defined
order of updates to perform.
The added domtree-updates.ll test case is an example that would
result in non-deterministic printouts of the domtree. Semantically
those domtree:s are equivalent, but it show the fact that when we
use the domtree iterator the order in which nodes are visited depend
on the order in which dominator tree updates are performed.
Since some passes (at least EarlyCSE) are iterating over nodes in the
dominator tree in a similar fashion as the domtree printer, then the
order in which transforms are applied by such passes, transitively,
also depend on the order in which dominator tree updates are
performed. And taking EarlyCSE as an example the end result could be
different depending on in which order the transforms are applied.
Reviewed By: nikic, kuhar
Differential Revision: https://reviews.llvm.org/D110292
The benefits of sampling-based PGO crucially depends on the quality of profile
data. This diff implements a flow-based algorithm, called profi, that helps to
overcome the inaccuracies in a profile after it is collected.
Profi is an extended and significantly re-engineered classic MCMF (min-cost
max-flow) approach suggested by Levin, Newman, and Haber [2008, Complementing
missing and inaccurate profiling using a minimum cost circulation algorithm]. It
models profile inference as an optimization problem on a control-flow graph with
the objectives and constraints capturing the desired properties of profile data.
Three important challenges that are being solved by profi:
- "fixing" errors in profiles caused by sampling;
- converting basic block counts to edge frequencies (branch probabilities);
- dealing with "dangling" blocks having no samples in the profile.
The main implementation (and required docs) are in SampleProfileInference.cpp.
The worst-time complexity is quadratic in the number of blocks in a function,
O(|V|^2). However a careful engineering and extensive evaluation shows that
the running time is (slightly) super-linear. In particular, instances with
1000 blocks are solved within 0.1 second.
The algorithm has been extensively tested internally on prod workloads,
significantly improving the quality of generated profile data and providing
speedups in the range from 0% to 5%. For "smaller" benchmarks (SPEC06/17), it
generally improves the performance (with a few outliers) but extra work in
the compiler might be needed to re-tune existing optimization passes relying on
profile counts.
Reviewed By: wenlei, hoy
Differential Revision: https://reviews.llvm.org/D109860
The benefits of sampling-based PGO crucially depends on the quality of profile
data. This diff implements a flow-based algorithm, called profi, that helps to
overcome the inaccuracies in a profile after it is collected.
Profi is an extended and significantly re-engineered classic MCMF (min-cost
max-flow) approach suggested by Levin, Newman, and Haber [2008, Complementing
missing and inaccurate profiling using a minimum cost circulation algorithm]. It
models profile inference as an optimization problem on a control-flow graph with
the objectives and constraints capturing the desired properties of profile data.
Three important challenges that are being solved by profi:
- "fixing" errors in profiles caused by sampling;
- converting basic block counts to edge frequencies (branch probabilities);
- dealing with "dangling" blocks having no samples in the profile.
The main implementation (and required docs) are in SampleProfileInference.cpp.
The worst-time complexity is quadratic in the number of blocks in a function,
O(|V|^2). However a careful engineering and extensive evaluation shows that
the running time is (slightly) super-linear. In particular, instances with
1000 blocks are solved within 0.1 second.
The algorithm has been extensively tested internally on prod workloads,
significantly improving the quality of generated profile data and providing
speedups in the range from 0% to 5%. For "smaller" benchmarks (SPEC06/17), it
generally improves the performance (with a few outliers) but extra work in
the compiler might be needed to re-tune existing optimization passes relying on
profile counts.
Reviewed By: wenlei, hoy
Differential Revision: https://reviews.llvm.org/D109860
The if-check above deleted part guarantees that StoreOffset <= LoadOffset
and that StoreOffset + StoreSize >= LoadOffset + LoadSize, and given that
LoadOffset + LoadSize > LoadOffset when LoadSize > 0. Thus, this shows
StoreOffset + StoreSize > LoadOffset is guaranteed given LoadSize > 0,
while it could be meaningless to have a type with nonpositive size, so that
the check could be removed. The values are converted to signed types to
avoid unsigned operation with negative offsets.
Part of revision D100179
Reapply commit c35e8185d8 with fixing problem
reported by mstorsjo
There are still another 2 uses of PointerType::getElementType in
Evaluator when evaluating BitCast's on pointers. BitCast's on pointers
should be removed when opaque ptr is ready, so I just keep them as is.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D114131
This change is mostly about getting rid of some "uninteresting" cases in a follow on deeper heuristic change. If anyone sees actually interesting code differences out of this, please let me know. I'm not expecting this to have much impact at all.
Case 1 - With the single deoptimize non-latch exit, we can't have two exiting blocks sharing an exit block. We can only hit this with a poorly documented debug flag.
Case 2 - Why should we treat epilog cases differently from prolog cases? Or to say it differently, why should starting with a constant control whether a multiple exit loop gets unrolled?
Sorry for the lack of tests here. These are both *exceedingly* narrow cases in practice, and after a while trying, I couldn't come up with a test which did anything "useful" as opposed to simply exercise a random combination of force flags. Note that the legality cases for each are already exercised with force flags.
All of the interesting logic from this routine has been removed, inline the single check into the sole non-assert caller. The assert use has little value with the restructured code and is simply dropped.
This reverts commit c35e8185d8.
mstorsjo reported in the revision thread that one VNCoercion assertion
is violated and seemly in relate to this commit. As per "If a test case
that demonstrates a problem is reported in the commit thread, please
revert and investigate offline", this commit is reverted.
ProfileCount could model invalid values, but a user had no indication
that the getCount method could return bogus data. Optional<ProfileCount>
addresses that, because the user must dereference the optional. In
addition, the patch removes concept duplication.
Differential Revision: https://reviews.llvm.org/D113839
The if-check above deleted part guarantees that StoreOffset <= LoadOffset
and that StoreOffset + StoreSize >= LoadOffset + LoadSize, and given that
LoadOffset + LoadSize > LoadOffset when LoadSize > 0. Thus, this shows
StoreOffset + StoreSize > LoadOffset is guaranteed given LoadSize > 0,
while it could be meaningless to have a type with nonpositive size, so that
the check could be removed.
Part of revision D100179
Reviewed By: nikic
This is one of those wonderful "in theory X doesn't matter, but in practice is does" changes. In this particular case, we shift the IVs inserted by the runtime unroller to clamp iteration count of the loops* from decrementing to incrementing.
Why does this matter? A couple of reasons:
* SCEV doesn't have a native subtract node. Instead, all subtracts (A - B) are represented as A + -1 * B and drops any flags invalidated by such. As a result, SCEV is slightly less good at reasoning about edge cases involving decrementing addrecs than incrementing ones. (You can see this in the inferred flags in some of the test cases.)
* Other parts of the optimizer produce incrementing IVs, and they're common in idiomatic source language. We do have support for reversing IVs, but in general if we produce one of each, the pair will persist surprisingly far through the optimizer before being coalesced. (You can see this looking at nearby phis in the test cases.)
Note that if the hardware prefers decrementing (i.e. zero tested) loops, LSR should convert back immediately before codegen.
* Mostly irrelevant detail: The main loop of the prolog case is handled independently and will simple use the original IV with a changed start value. We could in theory use this scheme for all iteration clamping, but that's a larger and more invasive change.
The unrolling code was previously inserting new cloned blocks at the end of the function. The result of this with typical loop structures is that the new iterations are placed far from the initial iteration.
With unrolling, the general assumption is that the a) the loop is reasonable hot, and b) the first Count-1 copies of the loop are rarely (if ever) loop exiting. As such, placing Count-1 copies out of line is a fairly poor code placement choice. We'd much rather fall through into the hot (non-exiting) path. For code with branch profiles, later layout would fix this, but this may have a positive impact on non-PGO compiled code.
However, the real motivation for this change isn't performance. Its readability and human understanding. Having to jump around long distances in an IR file to trace an unrolled loop structure is error prone and tedious.
Without this patch, passingValueIsAlwaysUndefined will iterate over all
instructions from I to the end of the basic block, even if the use is
outside the block.
This patch adds an early bail out, if the use instruction is outside I's
BB. This can greatly reduce compile-time in cases where very large basic
blocks are involved, with a large number of PHI nodes and incoming
values.
Note that the refactoring makes the handling of the case where I is a
phi and Use is in PHI more explicit as well: for phi nodes, we can also
directly bail out. In the existing code, we would iterate until we reach
the end and return false.
Based on an earlier patch by Matt Wala.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D113293
This is a fix for test failures on expensive checks build caused by db289340c8.
With LLVM_ENABLE_EXPENSIVE_CHECKS enabled the llvm::sort shuffles the given container.
However, the sort is only called when the TTI is passed to replaceCongruentIVs.
In the mentioned patch we pass it TTI, so the sort happens. But due to shuffling
equivalent Phis may appear in different order from run to run.
With the stable_sort instead of sort this is impossible - the order of sorted Phis
is preserved.
Extended value is known to be inside range smaller than full one.
Prevent SCCP to mark such value as overdefined.
Fixes PR52253
Differential Revision: https://reviews.llvm.org/D112721
Added support for peeling loops with exits that are followed either by an
unreachable-terminated block or block that has a terminatnig deoptimize call.
All blocks in the sequence must have an unique successor, maybe except
for the last one.
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D110922
It's a no-op, no overflow happens ever: https://alive2.llvm.org/ce/z/Zw89rZ
While generally i don't like such hacks,
we have a very good reason to do this: here we are expanding
a run-time correctness check for the vectorization,
and said `umul_with_overflow` will not be optimized out
before we query the cost of the checks we've generated.
Which means, the cost of run-time checks would be artificially inflated,
and after https://reviews.llvm.org/D109368 that will affect
the minimal trip count for which these checks are even evaluated.
And if they aren't even evaluated, then the vectorized code
certainly won't be run.
We could consider doing this in IRBuilder, but then we'd need to
also teach `CreateExtractValue()` to look into chain of `insertvalue`'s,
and i'm not sure there's precedent for that.
Refs. https://reviews.llvm.org/D109368#3089809
The function simplifyOnce only calls simplifyOnceImpl and does nothing else.
Having this separate helper makes no sense. Removing it.
Patch by Dmitry Bakunevich!
Differential Revision: https://reviews.llvm.org/D112517
Reviewed By: mkazantsev
When peeling a loop, we assume that the latch has a `br` terminator and that
all loop exits are either terminated with an `unreachable` or have a terminating
deoptimize call. So when we peel off the 1st iteration, we change the IDom of
all loop exits to the peeled copy of `NCD(IDom(Exit), Latch)`. This works now,
but if we add logic to support loops with exits that are followed by a block
with an `unreachable` or a terminating deoptimize call, changing the exit's idom
wouldn't be enough and DT would be broken.
For example, let `Exit1` and `Exit2` are loop exits, and each of them
unconditionally branches to the same `unreachable` terminated block. So neither
of the exits dominates this unreachable block. If we change the IDoms of the
exits to some peeled loop block, we don't update the dominators of the unreachable
block. Currently we just don't get to the peeling logic, saying that we can't peel
such loops.
Previously we stored exits' IDoms in a map before peeling a loop and then, after
peeling off one iteration, we changed their IDoms.
Now we use the same logic not only for exits but for all non-loop blocks dominated
by the loop.
So when we add logic to support peeling loops with exits which branch, for example,
to an unreachable-terminated block, we would update the IDoms not only for exits,
but for their successors.
Patch by Dmitry Makogon!
Differential Revision: https://reviews.llvm.org/D111611
Reviewed By: mkazantsev, nikic
Always insert values into ExprValueMap, and instead skip using them
in SCEVExpander if poison-generating flags have been lost. This
ensures that all values that are in ValueExprMap are also in
ExprValueMap, so we can use the latter to invalidate the former.
This change is probably not entirely NFC for the case where
originally the SCEV had no nowrap flags but they were inferred
later, in which case that would now allow reusing the existing
value for expansion.
Differential Revision: https://reviews.llvm.org/D112389
As this API is now internally offset-based, we can accept a starting
offset and remove the need to create a temporary bitcast+gep
sequence to perform an offset load. The API now mirrors the
ConstantFoldLoadFromConst() API.
As discussed in D112016, our current requirement of speculatability
for ephemeral is overly strict: What we really care about is that
the instruction will be DCEd once the assume is dropped. For that
it is sufficient that the instruction is side-effect free and not
a terminator.
In particular, this allows non-dereferenceable loads to be ephemeral
values.
Differential Revision: https://reviews.llvm.org/D112179
At the moment, rewriteLoopExitValue forgets the current phi node in the
loop that collects phis to rewrite. A few lines after the value is
forgotten, SCEV is used again to analyze incoming values and
potentially expand SCEV expression. This means that another SCEV is
created for PN, before the IR is actually updated in the next loop.
This leads to accessing invalid cached expression in combination with
D71539.
PN should only be changed once the actual incoming exit value is set in
the next loop. Moving invalidation there should ensure that PN is
invalidated in all relevant cases.
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D111495
As discussed in:
* https://reviews.llvm.org/D94166
* https://lists.llvm.org/pipermail/llvm-dev/2020-September/145031.html
The GlobalIndirectSymbol class lost most of its meaning in
https://reviews.llvm.org/D109792, which disambiguated getBaseObject
(now getAliaseeObject) between GlobalIFunc and everything else.
In addition, as long as GlobalIFunc is not a GlobalObject and
getAliaseeObject returns GlobalObjects, a GlobalAlias whose aliasee
is a GlobalIFunc cannot currently be modeled properly. Creating
aliases for GlobalIFuncs does happen in the wild (e.g. glibc). In addition,
calling getAliaseeObject on a GlobalIFunc will currently return nullptr,
which is undesirable because it should return the object itself for
non-aliases.
This patch refactors the GlobalIFunc class to inherit directly from
GlobalObject, and removes GlobalIndirectSymbol (while inlining the
relevant parts into GlobalAlias and GlobalIFunc). This allows for
calling getAliaseeObject() on a GlobalIFunc to return the GlobalIFunc
itself, making getAliaseeObject() more consistent and enabling
alias-to-ifunc to be properly modeled in the IR.
I exercised some judgement in the API clients of GlobalIndirectSymbol:
some were 'monomorphized' for GlobalAlias and GlobalIFunc, and
some remained shared (with the type adapted to become GlobalValue).
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D108872
This simplifies the return value of addRuntimeCheck from a pair of
instructions to a single `Value *`.
The existing users of addRuntimeChecks were ignoring the first element
of the pair, hence there is not reason to track FirstInst and return
it.
Additionally all users of addRuntimeChecks use the second returned
`Instruction *` just as `Value *`, so there is no need to return an
`Instruction *`. Therefore there is no need to create a redundant
dummy `and X, true` instruction any longer.
Effectively this change should not impact the generated code because the
redundant AND will be folded by later optimizations. But it is easy to
avoid creating it in the first place and it allows more accurately
estimating the cost of the runtime checks.
When peeling a loop, we assume that the latch has a `br` terminator and
that all loop exits are either terminated with an `unreachable` or have
a terminating deoptimize call. So when we peel off the 1st iteration, we
change the IDom of all loop exits to the peeled copy of
`NCD(IDom(Exit), Latch)`. This works now, but if we add logic to support
loops with exits that are followed by a block with an `unreachable` or a
terminating deoptimize call, changing the exit's idom wouldn't be enough
and DT would be broken.
For example, let `Exit1` and `Exit2` are loop exits, and each of them
unconditionally branches to the same `unreachable` terminated block. So
neither of the exits dominates this unreachable block. If we change the
IDoms of the exits to some peeled loop block, we don't update the
dominators of the unreachable block. Currently we just don't get to the
peeling logic, saying that we can't peel such loops.
With this NFC we just insert edges from cloned exiting blocks to their
exits after peeling each iteration (we accumulate the insertion updates
and then after peeling apply the updates to DT).
This patch was a part of D110922.
Patch by Dmitry Makogon!
Differential Revision: https://reviews.llvm.org/D111611
Reviewed By: mkazantsev
Fixes: https://bugs.llvm.org/show_bug.cgi?id=51841
This patch places an arbitrary limit on the size of DIExpressions that
we will produce via salvaging, for performance reasons. This helps to
fix a performance issue observed in the bug above, in which debug values
would be salvaged hundreds of times, producing expressions with over
1000 elements and causing the compiler to hang. Limiting the size of
debug values that we will produce to 128 largely fixes this issue.
Reviewed By: dblaikie, jmorse
Differential Revision: https://reviews.llvm.org/D110332
Rather than checking for loop nest preheaders upfront in IVUsers,
move this requirement into isSafeToExpand() from SCEVExpander.
Historically, LSR did not check whether SCEVs are safe to expand
and fully relied on IVUsers to validate this. Later, support for
non-expandable SCEVs was added via rigid formulas.
Checking this in isSafeToExpand() makes it more obvious what
exactly this check is guarding against, and avoids the awkward
loop nest scan.
This is a followup to https://reviews.llvm.org/D111493#3055286.
Differential Revision: https://reviews.llvm.org/D111681
This patch continues unblocking optimizations that are blocked by pseudo probe instrumentation.
Not exactly like DbgIntrinsics, PseudoProbe intrinsic has other attributes (such as mayread, maywrite, mayhaveSideEffect) that can block optimizations. The issues fixed are:
- Flipped default param of getFirstNonPHIOrDbg API to skip pseudo probes
- Unblocked CSE by avoiding pseudo probe from clobbering memory SSA
- Unblocked induction variable simpliciation
- Allow empty loop deletion by treating probe intrinsic isDroppable
- Some refactoring.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D110847
This patch adds a new cost heuristic that allows peeling a single
iteration off read-only loops, if the loop contains a load that
1. is feeding an exit condition,
2. dominates the latch,
3. is not already known to be dereferenceable,
4. and has a loop invariant address.
If all non-latch exits are terminated with unreachable, such loads
in the loop are guaranteed to be dereferenceable after peeling,
enabling hoisting/CSE'ing them.
This enables vectorization of loops with certain runtime-checks, like
multiple calls to `std::vector::at` if the vector is passed as pointer.
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D108114
This patch adds further support for vectorisation of loops that involve
selecting an integer value based on a previous comparison. Consider the
following C++ loop:
int r = a;
for (int i = 0; i < n; i++) {
if (src[i] > 3) {
r = b;
}
src[i] += 2;
}
We should be able to vectorise this loop because all we are doing is
selecting between two states - 'a' and 'b' - both of which are loop
invariant. This just involves building a vector of values that contain
either 'a' or 'b', where the final reduced value will be 'b' if any lane
contains 'b'.
The IR generated by clang typically looks like this:
%phi = phi i32 [ %a, %entry ], [ %phi.update, %for.body ]
...
%pred = icmp ugt i32 %val, i32 3
%phi.update = select i1 %pred, i32 %b, i32 %phi
We already detect min/max patterns, which also involve a select + cmp.
However, with the min/max patterns we are selecting loaded values (and
hence loop variant) in the loop. In addition we only support certain
cmp predicates. This patch adds a new pattern matching function
(isSelectCmpPattern) and new RecurKind enums - SelectICmp & SelectFCmp.
We only support selecting values that are integer and loop invariant,
however we can support any kind of compare - integer or float.
Tests have been added here:
Transforms/LoopVectorize/AArch64/sve-select-cmp.ll
Transforms/LoopVectorize/select-cmp-predicated.ll
Transforms/LoopVectorize/select-cmp.ll
Differential Revision: https://reviews.llvm.org/D108136
This factors out utilities for scanning a bounded block of instructions since we have this code repeated in a bunch of places. The change to InlineFunction isn't strictly NFC as the limit mechanism there didn't handle debug instructions correctly.
Removed obsolete DT verification that should not be there because the
strategy of DT updates has changed.
Differential Revision: https://reviews.llvm.org/D110922