In the current main branch, all cold loops will not be applied non-trivial unswitch. As reported in D129599, skipping these cold loops will incur regression in SPEC benchmark.
Thus, instead of skipping cold loops, now only skipping loops in cold functions.
Reviewed By: alexgatea, aeubanks
Differential Revision: https://reviews.llvm.org/D133275
After https://reviews.llvm.org/rG463aa814182a23 tsan replaces llvm
intrinsics with calls to glibc functions. However this approach is
fragile, as slight changes in pipeline can return llvm intrinsics back.
In particular InstCombine can do that.
Msan/Asan already declare own version of these memory
functions for the similar purpose.
KCSAN, or anything that uses something else than compiler-rt, needs to
implement this callbacks.
Reviewed By: melver
Differential Revision: https://reviews.llvm.org/D133268
This reverts commit fe1f3cfc26.
It looks like this commit breaks building llvm-test-suite.
To reproduce, run `opt -passes=ipsccp` on the IR below.
@g = internal global i32 256, align 4
define void @test() {
entry:
%0 = load i32, ptr @g, align 4
%div = sdiv i32 %0, undef
ret void
}
This pattern is handled more generally in SimplifySelectsFeedingBinaryOp().
Tests to confirm that added to the add.ll test file in the previous commit.
Remove ctx redeclaration.
Format code.
Remove parallel check. Modify tests. Clean-up code.
Fix another test.
Move code to helper functions.
Format file.
Minor fixes.
After https://reviews.llvm.org/rG463aa814182a23 tsan replaces llvm
intrinsics with calls to glibc functions. However this approach is
fragile, as slight changes in pipeline can return llvm intrinsics back.
In particular InstCombine can do that.
Msan/Asan already declare own version of these memory
functions for the similar purpose.
KCSAN, or anything that uses something else than compiler-rt, needs to
implement this callbacks.
Reviewed By: melver
Differential Revision: https://reviews.llvm.org/D133268
This extends the transform added with D81756 to handle div/rem opcodes.
For example:
https://alive2.llvm.org/ce/z/cX6za6
This replicates part of what CVP already does, but the motivating example
from issue #57472 demonstrates a phase ordering problem - we convert
branches to select before CVP runs and miss the transform.
Differential Revision: https://reviews.llvm.org/D133198
SimplifyCFG does some common code hoisting, which is limited
to hoisting a sequence of identical instruction in identical
order and stops at the first non-identical instruction.
This patch allows hoisting instruction pairs over
same-length sequences of non-matching instructions. The
linear asymptotic complexity of the algorithm stays the
same, there's an extra parameter
`simplifycfg-hoist-common-skip-limit` serving to limit
compilation time and/or the size of the hoisted live ranges.
The patch improves SPECv6/525.x264_r by about 10%.
Reviewed By: nikic, dmgreen
Differential Revision: https://reviews.llvm.org/D129370
This used a single check to make sure that the object is both
writable and thread-local. Separate them out to make the
deficiencies in the current code more obvious.
Users of LCSSA may not expect non-phi uses when checking the uses
outside a loop, which may cause crashes. This is due to the fact that we
do not update uses in unreachable blocks.
To ensure all reachable uses outside the loop are phis, update uses in
unreachable blocks to use poison in dead code.
Fixes#57508.
If one of the operands is a transposed splat, the transpose can be
removed.
This is useful to simplify when transposes are distributed to operands
of a matmul:
* k^T -> k
* (A * k)^t -> A^t * k
Differential Revision: https://reviews.llvm.org/D130177
The current code is basically just emulating what the analysis manager does.
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D132581
We are not building up a proper list of load-store candidates because
we are throwing away stores where the type don't match the load.
This patch adds stores with matching store sizes as candidates.
Author of the original patch: David Sherwood.
Differential Revision: https://reviews.llvm.org/D130233
This reverts commit c911befaec.
It has broken LLDB Arm/AArch64 Linux buildbots. I dont really understand
the underlying reason. Reverting for now make buildbot green.
https://reviews.llvm.org/D133036
hasOnlyColdCalls skipped over calls to intrinsics, but it did so after
checking the linkage of the called function. This meant that the presence
of a call to a debug intrinsic could affect the outcome of the
optimization.
In my original reproducer (for an out of tree target) it was particularly
interesting, because the actual IR after GlobalOpt was not different with
debug instrinsics present, so -print-after-all printouts didn't show
anything there.
However, without debuginfo, GlobalOpt went further and ran
BlockFrequencyAnalysis and (more importanly) LoopAnalysis, and later on in
the pipeline, instcombine behaved in different ways when LoopInfo was
present.
So a call to a dbg.declare prevented running LoopAnalysis in
GlobalOpt, which later prevented InstCombine from doing an optimization.
The dbg-intrinsic-loopanalysis.ll testcase tries to expose this.
Then I also noted that adding a dbg.declare actually made the existing
testcase colccc_coldsites.ll generate different code, so I modified that
to now test it behaves the same way with and without the dbg.declare.
Reviewed By: nikic, fhahn
Differential Revision: https://reviews.llvm.org/D133193
Use getPredicateOnEdge method if value is a non-local
compare-with-a-constant instruction, that can give more precise
results than getConstantOnEdge.
Differential Revision: https://reviews.llvm.org/D131956
Currently, we bail out of scalar promotion if the loop may unwind
and the memory may be visible on unwind. This is because we can't
insert stores of the promoted value on unwind edges.
However, nowadays scalar promotion also has support for only
promoting loads, while leaving stores in place. This kind of
promotion is safe even in the presence of unwinding.
Differential Revision: https://reviews.llvm.org/D133111
For noop store of the form of LoadI and StoreI,
An invariant should be kept is that the memory state of the related
MemoryLoc before LoadI is the same as before StoreI.
For this example:
```
define void @pr49927(i32* %q, i32* %p) {
%v = load i32, i32* %p, align 4
store i32 %v, i32* %q, align 4
store i32 %v, i32* %p, align 4
ret void
}
```
Here the definition of the store's destination is different with the
definition of the load's destination, which it seems that the
invariant mentioned above is broken. But the definition of the
store's destination would write a value that is LoadI, actually, the
invariant is still kept. So we can safely ignore it.
Differential Revision: https://reviews.llvm.org/D132657
When we want to add instrumentation after
an instruction, instrumentation still should
keep debug info of the instruction.
Reviewed By: kda, kstoimenov
Differential Revision: https://reviews.llvm.org/D133091
When we do extractvalue (any_mul_with_overflow X, -1) --> (-X and icmp),
which left partly failed to match vector constant with poison element.
This patch try to fix it.
Alive2: https://alive2.llvm.org/ce/z/2rGp_3
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D132996
We currently instrument CallBrInst but do not annotate it with
the branch weight. This patch enables PGO annotation of CallBrInst.
Differential Revision: https://reviews.llvm.org/D133040
Reduces .text size by 1% on our large binary.
On CTMark (-O2 -fsanitize=memory -fsanitize-memory-use-after-dtor -fsanitize-memory-param-retval)
Size -0.4%
Time -0.8%
Reviewed By: kda
Differential Revision: https://reviews.llvm.org/D133071
It's a preparation of to combine shadow checks of the same instruction
Reviewed By: kda, kstoimenov
Differential Revision: https://reviews.llvm.org/D133065
When instrumenting `alloca`s, we use a `SmallSet` (i.e. `SmallPtrSet`). When there are fewer elements than the `SmallSet` size, it behaves like a vector, offering stable iteration order. Once we have too many `alloca`s to instrument, the iteration order becomes unstable. This manifests as non-deterministic builds because of the global constant we create while instrumenting the alloca.
The test added is a simple IR file, but was discovered while building `libcxx/src/filesystem/operations.cpp` from libc++. A reduced C++ example from that:
```
// clang++ -fsanitize=memory -fsanitize-memory-track-origins \
// -fno-discard-value-names -S -emit-llvm \
// -c op.cpp -o op.ll
struct Foo {
~Foo();
};
bool func1(Foo);
void func2(Foo);
void func3(int) {
int f_st, t_st;
Foo f, t;
func1(f) || func1(f) || func1(t) || func1(f) && func1(t);
func2(f);
}
```
Reviewed By: kda
Differential Revision: https://reviews.llvm.org/D133034
This code was relying on a very subtle contract: The expectation
was that for non-allocas, the unwind safety check would already
perform a capture check, so we don't need to perform it later.
This held true when this unwind safety was only handled for allocas
and noalias calls, but became incorrect when byval support was
added.
To avoid this kind of issue, just remove the dependency between the
unwind and thread-safety checks entirely. At worst, this means we
perform a redundant capture check. If this should turn out to be
problematic for compile-time, we can cache that query in a more
explicit way.
The existing predicate doesn't work for a single-element
vector, so make sure we are not crossing scalar/vector types.
Test (was crashing) based on the post-commit example for:
4827771234
When X is a power-of-two or zero and zero input is poison:
ctlz(i32 X) ^ 31 --> cttz(X)
cttz(i32 X) ^ 31 --> ctlz(X)
https://alive2.llvm.org/ce/z/Cs7sFE
Need either follow the original order of the operands for bool logical
ops, or emit freeze instruction to avoid poison propagation.
Differential Revision: https://reviews.llvm.org/D126877
This patch fixes an issue in which CorrelatedValuePropagation::processSRem
would create new instructions to represent the SRem instruction, but would not
correctly copy any existing debug location metadata to the new instruction.
Differential Revision: https://reviews.llvm.org/D132218
This patch moves the cost-based decision whether to use an intrinsic or
library call to the point where the recipe is created. This untangles
code-gen from the cost model and also avoids doing some extra work as
the information is already computed at construction.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D132585
Since D129288, callbr is allowed to have duplicate successors. This patch removes a limitation which prevents optimizations from actually producing such callbrs.
This is probably the riskiest of all the recent callbr changes, because code with incorrect assumptions might be lurking somewhere. I fixed the one case I encountered ahead of time in 8201e3ef5c.
Reviewed By: nickdesaulniers
Differential Revision: https://reviews.llvm.org/D129997
Originally landed as
commit 08860f525a ("[Local] Allow creating callbr with duplicate successors")
Reverted in
commit 1cf6b93df1 ("Revert "[Local] Allow creating callbr with duplicate successors"")
This is NOT nfc. Specifically, the following behavior changes:
* Pointers are now allowed. Both uniform, and constants.
* FP uniform non-constants can now be recognized.
* FP undefs are no longer considered constant. This matches int behavior which we had tests for. FP behavior was untested. Its not clear to me int behavior is reasonable, but it's what tests seem to expect, so go with minimum impact for now.
This simplifies the code and fixes handling for the callbr case,
where the instruction needs to be inserted in the normal
destination, rather than after the terminator.
Originally part of D129660.
Transforms occasionally want to insert an instruction directly
after the definition point of a value. This involves quite a few
different edge cases, e.g. for phi nodes the next insertion point
is not the next instruction, and for invokes and callbrs its not
even in the same block. Additionally, the insertion point may not
exist at all if catchswitch is involved.
This adds a general Instruction::getInsertionPointAfterDef() API to
implement the necessary logic. For now it is used in two places
where this should be mostly NFC. I will follow up with additional
uses where this fixes specific bugs in the existing implementations.
Differential Revision: https://reviews.llvm.org/D129660
For (X op Y) op Z --> (Y op Z) op X
we can still do transform when Y is multi-use. In D131356 limit it to one-use,
this patch remove this limit.
This is still not a complete solution, I add a todo test to show it.
In this case, X and Y are both multi use, we can't differentiate how to convert based on this.
But at least we don't make the code worse,and it can solve half the scenarios.
The pointer operands for the ScatterVectorize node may contain
non-instruction values and they are not checked for "already being
vectorized". Need to check that such pointers are already vectorized and
gather them instead of trying to build vectorize node to avoid compiler
crash.
Differential Revision: https://reviews.llvm.org/D132949
Removed EnableFP parameter in getOperandInfo function since it is not
needed, the operands kinds also controlled by the operation code, which
allows to remove extra check for the type of the operands. Also, added
analysis for uniform constant float values.
This change currently does not trigger any changes in the code since TTI
does not do analysis for constant floats, so it can be considered NFC.
Tested with llvm-test-suite + SPEC2017, no changes.
Differential Revision: https://reviews.llvm.org/D132886
We aleady support the transform: `(X+C1)*CI -> X*CI+C1*CI`
Here the case is a little special as the form of `(X+C1)*CI` is transformed into `(X|C1)*CI`,
so we should also support the transform: `(X|C1)*CI -> X*CI+C1*CI`
Fixes https://github.com/llvm/llvm-project/issues/57278
Reviewed By: bcl5980, spatel, RKSimon
Differential Revision: https://reviews.llvm.org/D132658
Taking the example from the test included in this patch:
$ cat test.cpp -n
1 void fun(int *a, int cond) {
2 if (cond)
3 a[1] = 1;
4 else
5 a[1] = 2;
6 }
mldst-motion will merge and sink the stores in if.then and if.else into
if.end. The resultant PHI, gep and store should be attributed line zero
with the innermost common scope rather than picking a debug location from
one of the original stores.
Reviewed By: djtodoro
Differential Revision: https://reviews.llvm.org/D132741
Current implementation promotes a non-cold function in the SampleFDO profile
into a hot function in the FDO profile. This is too aggressive. This patch
promotes a hot functions in the SampleFDO profile into a hot function, and a
warm function in SampleFDO into a warm function in FDO.
Differential Revision: https://reviews.llvm.org/D132601
I keep finding myself needing to rule this out as a possible source of scalarization, so add debug output like we have for other instructions we decide to scalarize.
This patch changes order of searching for reductions vs other vectorization possibilities.
The idea is if we do not match a reduction it won't be harmful for further attempts to
find vectorizable operations on a vector build sequences. But doing it in the opposite
order we have good chance to ruin opportunity to match a reduction later.
We also don't want to try vectorizing binary operations too early as 2-way vectorization
may effectively prohibit wider ones leading to producing less effective code.
Differential Revision: https://reviews.llvm.org/D132590
This fixes https://github.com/llvm/llvm-project/issues/57336. It was exposed by a recent SCEV change, but appears to have been a long standing issue.
Note that the whole insert into the loop instead of a split exit edge is slightly contrived to begin with; it's there solely because IndVarSimplify preserves the CFG.
Differential Revision: https://reviews.llvm.org/D132571
When estimating the cost of the in-tree vectorized scalars in
buildvector sequences, need to take into account the vectorized
insertelement instruction. The top of the buildvector seuences is the
topmost vectorized insertelement instruction, because it will have
> than 1 use after the vectorization.
For the affected test case improves througput from 21 to 16 (per
llvm-mca).
Differential Revision: https://reviews.llvm.org/D132740
https://alive2.llvm.org/ce/z/j_8Wz9
The arithmetic shift was converted to logical shift with:
246078604c
That does not seem to uncover any other missing/conflicting folds,
so convert directly to signbit test + cast.
We still need to fold the pattern with logical shift to test + cast.
This allows reducing patterns where the output type is not
the same as the input value:
https://alive2.llvm.org/ce/z/nydwFVFixes#57394
The use of std::clamp should be safe here. MinRZ is at most 32, while
kMaxRZ is 1 << 18, so we have MinRZ <= kMaxRZ, avoiding the undefind
behavior of std::clamp.
This addresses a suggestion to simplify the check from D131989. This
also makes it easier to ensure that VPHeaderPHIRecipe::classof checks
for all header phi ids.
This patch replaces calls to greatestCommonDivisor with std::gcd where
both arguments are known to be of unsigned. This means that
std::common_type_t of the two argument types should just be the wider
one of the two.
This patch replaces calls to GreatestCommonDivisor64 with std::gcd
where both arguments are known to be of unsigned types no larger than
64 bits in size.
Add verification that VPHeaderPHIRecipes are only in header VPBBs. Also
adds missing checks for VPPointerInductionRecipe to
VPHeaderPHIRecipe::classof.
Split off from D119661.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D131989