Summary:
The PGO gen/use passes currently fail with an assert failure if there's a
critical edge whose source is an IndirectBr instruction and that edge
needs to be instrumented.
To avoid this in certain cases, split IndirectBr critical edges in the PGO
gen/use passes. This works for blocks with single indirectbr predecessors,
but not for those with multiple indirectbr predecessors (splitting an
IndirectBr critical edge isn't always possible.)
Reviewers: davidxl, xur
Reviewed By: davidxl
Subscribers: efriedma, llvm-commits, mehdi_amini
Differential Revision: https://reviews.llvm.org/D40699
llvm-svn: 320511
Summary:
If we have pattern `store (load(bitcast(select (cmp(V1, V2), &V1,
&V2)))), bitcast)`, but the load is used in other instructions, it leads
to looping in InstCombiner. Patch adds additional check that all users
of the load instructions are stores and then replaces all uses of load
instruction by the new one with new type.
Reviewers: RKSimon, spatel, majnemer
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D41072
llvm-svn: 320510
Summary:
If we have pattern `store (load(bitcast(select (cmp(V1, V2), &V1,
&V2)))), bitcast)`, but the load is used in other instructions, it leads
to looping in InstCombiner. Patch adds additional check that all users
of the load instructions are stores and then replaces all uses of load
instruction by the new one with new type.
Reviewers: RKSimon, spatel, majnemer
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D41072
llvm-svn: 320499
Summary:
If we have pattern `store (load(bitcast(select (cmp(V1, V2), &V1,
&V2)))), bitcast)`, but the load is used in other instructions, it leads
to looping in InstCombiner. Patch adds additional check that all users
of the load instructions are stores and then replaces all uses of load
instruction by the new one with new type.
Reviewers: RKSimon, spatel, majnemer
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D41072
llvm-svn: 320496
Summary:
If we have pattern `store (load(bitcast(select (cmp(V1, V2), &V1,
&V2)))), bitcast)`, but the load is used in other instructions, it leads
to looping in InstCombiner. Patch adds additional check that all users
of the load instructions are stores and then replaces all uses of load
instruction by the new one with new type.
Reviewers: RKSimon, spatel, majnemer
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D41072
llvm-svn: 320488
If we have pattern `store (load(bitcast(select (cmp(V1, V2), &V1,
&V2)))), bitcast)`, but the load is used in other instructions, it leads
to looping in InstCombiner. Patch adds additional check that all users
of the load instructions are stores and then replaces all uses of load
instruction by the new one with new type.
Reviewers: RKSimon, spatel, majnemer
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D41072
llvm-svn: 320483
Summary:
Currently, in InstCombineLoadStoreAlloca, we have simplification
rules for the following cases:
1. load off a null
2. load off a GEP with null base
3. store to a null
This patch adds support for the fourth case which is store into a
GEP with null base. Since this is UB as well (and directly analogous to
the load off a GEP with null base), we can substitute the stored val
with undef in instcombine, so that SimplifyCFG can optimize this code
into unreachable code.
Note: Right now, simplifyCFG hasn't been taught about optimizing
this to unreachable and adding an llvm.trap (this is already done for
the above 3 cases).
Reviewers: majnemer, hfinkel, sanjoy, davide
Reviewed by: sanjoy, davide
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D41026
llvm-svn: 320480
VecValuesToIgnore holds values that will not appear in the vectorized loop.
We should therefore ignore their cost when VF > 1.
Differential Revision: https://reviews.llvm.org/D40883
llvm-svn: 320463
Summary:
This solves PR35616.
We don't want the compiler to generate different code when we compile
with/without -g, so we now ignore debug intrinsics when determining if
the optimization can trigger or not.
Reviewers: junbuml
Subscribers: davide, JDevlieghere, llvm-commits
Differential Revision: https://reviews.llvm.org/D41068
llvm-svn: 320460
The tests fail (opt asserts) on Windows.
> Summary:
> If we have pattern `store (load(bitcast(select (cmp(V1, V2), &V1,
> &V2)))), bitcast)`, but the load is used in other instructions, it leads
> to looping in InstCombiner. Patch adds additional check that all users
> of the load instructions are stores and then replaces all uses of load
> instruction by the new one with new type.
>
> Reviewers: RKSimon, spatel, majnemer
>
> Subscribers: llvm-commits
>
> Differential Revision: https://reviews.llvm.org/D41072
llvm-svn: 320421
Summary:
If we have pattern `store (load(bitcast(select (cmp(V1, V2), &V1,
&V2)))), bitcast)`, but the load is used in other instructions, it leads
to looping in InstCombiner. Patch adds additional check that all users
of the load instructions are stores and then replaces all uses of load
instruction by the new one with new type.
Reviewers: RKSimon, spatel, majnemer
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D41072
llvm-svn: 320407
CreateAddRecFromPHIWithCastsImpl() adds an IncrementNUSW overflow predicate
which allows the PSCEV rewriter to rewrite this scev expression:
(zext i8 {0, + , (trunc i32 step to i8)} to i32)
into
{0, +, (sext i8 (trunc i32 step to i8) to i32)}
But then it adds the wrong Equal predicate:
%step == (zext i8 (trunc i32 %step to i8) to i32).
instead of:
%step == (sext i8 (trunc i32 %step to i8) to i32)
This is fixed here.
Differential Revision: https://reviews.llvm.org/D40641
llvm-svn: 320298
Summary:
This is LLVM instrumentation for the new HWASan tool. It is basically
a stripped down copy of ASan at this point, w/o stack or global
support. Instrumenation adds a global constructor + runtime callbacks
for every load and store.
HWASan comes with its own IR attribute.
A brief design document can be found in
clang/docs/HardwareAssistedAddressSanitizerDesign.rst (submitted earlier).
Reviewers: kcc, pcc, alekseyshl
Subscribers: srhines, mehdi_amini, mgorny, javed.absar, eraman, llvm-commits, hiraditya
Differential Revision: https://reviews.llvm.org/D40932
llvm-svn: 320217
The Debugify pass synthesizes debug info for IR. It's paired with a
CheckDebugify pass which determines how much of the original debug info
is preserved. These passes make it easier to create targeted tests for
debug info preservation.
Here is the Debugify algorithm:
NextLine = 1
for (Instruction &I : M)
attach DebugLoc(NextLine++) to I
NextVar = 1
for (Instruction &I : M)
if (canAttachDebugValue(I))
attach dbg.value(NextVar++) to I
The CheckDebugify pass expects contiguous ranges of DILocations and
DILocalVariables. If it fails to find all of the expected debug info, it
prints a specific error to stderr which can be FileChecked.
This was discussed on llvm-dev in the thread:
"Passes to add/validate synthetic debug info"
Differential Revision: https://reviews.llvm.org/D40512
llvm-svn: 320202
Summary:
If a partially inlined function has debug info, we have to add debug
locations to the call instruction calling the outlined function.
We use the debug location of the first instruction in the outlined
function, as the introduced call transfers control to this statement and
there is no other equivalent line in the source code.
We also use the same debug location for the branch instruction added
to jump from artificial entry block for the outlined function, which just
jumps to the first actual basic block of the outlined function.
Reviewers: davide, aprantl, rriddle, dblaikie, danielcdh, wmi
Reviewed By: aprantl, rriddle, danielcdh
Subscribers: eraman, JDevlieghere, llvm-commits
Differential Revision: https://reviews.llvm.org/D40413
llvm-svn: 320199
Causes unexpected memory issue with New PM this time.
The new PM invalidates BPI but not BFI, leaving the
reference to BPI from BFI invalid.
Abandon this patch. There is a more general solution
which also handles runtime infinite loop (but not statically).
llvm-svn: 320180
Summary:
If we have the code like this:
```
float a, b;
a = std::max(a ,b);
```
it is converted into something like this:
```
%call = call dereferenceable(4) float* @_ZSt3maxIfERKT_S2_S2_(float* nonnull dereferenceable(4) %a.addr, float* nonnull dereferenceable(4) %b.addr)
%1 = bitcast float* %call to i32*
%2 = load i32, i32* %1, align 4
%3 = bitcast float* %a.addr to i32*
store i32 %2, i32* %3, align 4
```
After inlinning this code is converted to the next:
```
%1 = load float, float* %a.addr
%2 = load float, float* %b.addr
%cmp.i = fcmp fast olt float %1, %2
%__b.__a.i = select i1 %cmp.i, float* %a.addr, float* %b.addr
%3 = bitcast float* %__b.__a.i to i32*
%4 = load i32, i32* %3, align 4
%5 = bitcast float* %arrayidx to i32*
store i32 %4, i32* %5, align 4
```
This pattern is not recognized as minmax pattern.
Patch solves this problem by converting sequence
```
store (bitcast, (load bitcast (select ((cmp V1, V2), &V1, &V2))))
```
to a sequence
```
store (,load (select((cmp V1, V2), &V1, &V2)))
```
After this the code is recognized as minmax pattern.
Reviewers: RKSimon, spatel
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D40304
llvm-svn: 320157
As a new access is generated spanning across multiple fields, we need to
propagate alias info from all the fields to form the most generic alias info.
rdar://35602528
Differential Revision: https://reviews.llvm.org/D40617
llvm-svn: 319979
Summary:
An undef extract index can be arbitrarily chosen to be an
out-of-range index value, which would result in the instruction being undef.
This change closes a gap identified while working on lowering vector permute intrinsics
with variable index vectors to pure LLVM IR.
Reviewers: arsenm, spatel, majnemer
Reviewed By: arsenm, spatel
Subscribers: fhahn, nhaehnle, wdng, llvm-commits
Differential Revision: https://reviews.llvm.org/D40231
llvm-svn: 319910
This caused PR35519.
> [memcpyopt] Teach memcpyopt to optimize across basic blocks
>
> This teaches memcpyopt to make a non-local memdep query when a local query
> indicates that the dependency is non-local. This notably allows it to
> eliminate many more llvm.memcpy calls in common Rust code, often by 20-30%.
>
> Fixes PR28958.
>
> Differential Revision: https://reviews.llvm.org/D38374
>
> [memcpyopt] Commit file missed in r319482.
>
> This change was meant to be included with r319482 but was accidentally
> omitted.
llvm-svn: 319873
This uses ConstantRange::makeGuaranteedNoWrapRegion's newly-added handling for subtraction to allow CVP to remove some subtraction overflow checks.
Differential Revision: https://reviews.llvm.org/D40039
llvm-svn: 319807
Summary:
A true or false result is expected from a comparison, but it seems the possibility of undef was overlooked, which could lead to a failed assert. This is fixed by this patch by bailing out if we encounter undef.
The bug is old and the assert has been there since the end of 2014, so it seems this is unusual enough to forego optimization.
Patch by JesperAntonsson.
Reviewers: spatel, eeckstein, hans
Reviewed By: hans
Subscribers: uabelho, llvm-commits
Differential Revision: https://reviews.llvm.org/D40639
llvm-svn: 319768
Follow-up of r316824. This patch supports the vector type for both current and
previous index when factoring out the current one into the previous one.
Differential Revision: https://reviews.llvm.org/D39556
llvm-svn: 319683
(This reapplies r314253. r314253 was reverted on r314482 because of a
correctness regression on P100, but that regression was identified to be
something else.)
Summary:
Don't bail out on constant divisors for divisions that can be narrowed without
introducing control flow . This gives us a 32 bit multiply instead of an
emulated 64 bit multiply in the generated PTX assembly.
Reviewers: jlebar
Subscribers: jholewinski, mcrosier, llvm-commits
Differential Revision: https://reviews.llvm.org/D38265
llvm-svn: 319677
Summary:
Currently, we only support predication for forward loops with step
of 1. This patch enables loop predication for reverse or
countdownLoops, which satisfy the following conditions:
1. The step of the IV is -1.
2. The loop has a singe latch as B(X) = X <pred>
latchLimit with pred as s> or u>
3. The IV of the guard is the decrement
IV of the latch condition (Guard is: G(X) = X-1 u< guardLimit).
This patch was downstream for a while and is the last series of patches
that's from our LP implementation downstream.
Reviewers: apilipenko, mkazantsev, sanjoy
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D40353
llvm-svn: 319659
Turns out we can have comparisons which are indirect users of the induction variable that we can make invariant. In this case, there is no loop invariant value contributing and we'd fail an assert.
The test case was found by a java fuzzer and reduced. It's a real cornercase. You have to have a static loop which we've already proven only executes once, but haven't broken the backedge on, and an inner phi whose result can be constant folded by SCEV using exit count reasoning but not proven by isKnownPredicate. To my knowledge, only the fuzzer has hit this case.
llvm-svn: 319583
These are blocks that haven't not been executed during training. For large
projects this could make a significant difference. For the project, I was
looking at, I got an order of magnitude decrease in the size of the total YAML
files with this and r319235.
Differential Revision: https://reviews.llvm.org/D40678
Re-commit after fixing the failing testcase in rL319576, rL319577 and
rL319578.
llvm-svn: 319581
These are blocks that haven't not been executed during training. For large
projects this could make a significant difference. For the project, I was
looking at, I got an order of magnitude decrease in the size of the total YAML
files with this and r319235.
Differential Revision: https://reviews.llvm.org/D40678
llvm-svn: 319556
It causes builds to fail with "Instruction does not dominate all uses" (PR35497).
> Patch tries to improve vectorization of the following code:
>
> void add1(int * __restrict dst, const int * __restrict src) {
> *dst++ = *src++;
> *dst++ = *src++ + 1;
> *dst++ = *src++ + 2;
> *dst++ = *src++ + 3;
> }
> Allows to vectorize even if the very first operation is not a binary add, but just a load.
>
> Fixed issues related to previous commit.
>
> Reviewers: spatel, mzolotukhin, mkuper, hfinkel, RKSimon, filcab, ABataev
>
> Reviewed By: ABataev, RKSimon
>
> Subscribers: llvm-commits, RKSimon
>
> Differential Revision: https://reviews.llvm.org/D28907
llvm-svn: 319550
Summary:
1/ Operand folding during complex pattern matching for LEAs has been extended, such that it promotes Scale to
accommodate similar operand appearing in the DAG e.g.
T1 = A + B
T2 = T1 + 10
T3 = T2 + A
For above DAG rooted at T3, X86AddressMode will now look like
Base = B , Index = A , Scale = 2 , Disp = 10
2/ During OptimizeLEAPass down the pipeline factorization is now performed over LEAs so that if there is an opportunity
then complex LEAs (having 3 operands) could be factored out e.g.
leal 1(%rax,%rcx,1), %rdx
leal 1(%rax,%rcx,2), %rcx
will be factored as following
leal 1(%rax,%rcx,1), %rdx
leal (%rdx,%rcx) , %edx
3/ Aggressive operand folding for AM based selection for LEAs is sensitive to loops, thus avoiding creation of any complex LEAs within a loop.
4/ Simplify LEA converts (lea (BASE,1,INDEX,0) --> add (BASE, INDEX) which offers better through put.
PR32755 will be taken care of by this pathc.
Previous patch revisions : r313343 , r314886
Reviewers: lsaba, RKSimon, craig.topper, qcolombet, jmolloy, jbhateja
Reviewed By: lsaba, RKSimon, jbhateja
Subscribers: jmolloy, spatel, igorb, llvm-commits
Differential Revision: https://reviews.llvm.org/D35014
llvm-svn: 319543
Summary:
A true or false result is expected from a comparison, but it seems the possibility of undef was overlooked, which could lead to a failed assert. This is fixed by this patch by bailing out if we encounter undef.
The bug is old and the assert has been there since the end of 2014, so it seems this is unusual enough to forego optimization.
Patch by: JesperAntonsson
Reviewers: spatel, eeckstein, hans
Reviewed By: hans
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D40639
llvm-svn: 319537
Patch tries to improve vectorization of the following code:
void add1(int * __restrict dst, const int * __restrict src) {
*dst++ = *src++;
*dst++ = *src++ + 1;
*dst++ = *src++ + 2;
*dst++ = *src++ + 3;
}
Allows to vectorize even if the very first operation is not a binary add, but just a load.
Fixed issues related to previous commit.
Reviewers: spatel, mzolotukhin, mkuper, hfinkel, RKSimon, filcab, ABataev
Reviewed By: ABataev, RKSimon
Subscribers: llvm-commits, RKSimon
Differential Revision: https://reviews.llvm.org/D28907
llvm-svn: 319531
If the thin module has no references to an internal global in the
merged module, we need to make sure to preserve that property if the
global is a member of a comdat group, as otherwise promotion can end
up adding global symbols to the comdat, which is not allowed.
This situation can arise if the external global in the thin module
has dead constant users, which would cause use_empty() to return
false and would cause us to try to promote it. To prevent this from
happening, discard the dead constant users before asking whether a
global is empty.
Differential Revision: https://reviews.llvm.org/D40593
llvm-svn: 319494
This teaches memcpyopt to make a non-local memdep query when a local query
indicates that the dependency is non-local. This notably allows it to
eliminate many more llvm.memcpy calls in common Rust code, often by 20-30%.
Fixes PR28958.
Differential Revision: https://reviews.llvm.org/D38374
llvm-svn: 319482
Currently, SROA splits loads and stores only when they are accessing the whole alloca.
This patch relaxes this limitation to allow splitting a load/store if all other loads and stores to the alloca are disjoint to or fully included in the current load/store. If there is no other load or store that crosses the boundary of the current load/store, the current splitting implementation works as is.
The whole-alloca loads and stores meet this new condition and so they are still splittable.
Here is a simplified motivating example.
struct record {
long long a;
int b;
int c;
};
int func(struct record r) {
for (int i = 0; i < r.c; i++)
r.b++;
return r.b;
}
When updating r.b (or r.c as well), LLVM generates redundant instructions on some platforms (such as x86_64, ppc64); here, r.b and r.c are packed into one 64-bit GPR when the struct is passed as a method argument.
With this patch, the above example is compiled into only few instructions without loop.
Without the patch, unnecessary loop-carried dependency is introduced by SROA and the loop cannot be eliminated by the later optimizers.
Differential Revision: https://reviews.llvm.org/D32998
llvm-svn: 319407
If common type is different we should bail out due to we will not be
able to create a select or Phi of these values.
Basically it is done in ExtAddrMode::compare however it does not work
if we handle the null first and then two values of different types.
so add a check in initializeMap as well. The check in ExtAddrMode::compare
is used as earlier bail out.
Reviewers: reames, john.brawn
Reviewed By: john.brawn
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D40479
llvm-svn: 319292
The core idea is to (re-)introduce some redundancies where their cost is
hidden by the cost of materializing immediates for constant operands of
PHI nodes. When the cost of the redundancies is covered by this,
avoiding materializing the immediate has numerous benefits:
1) Less register pressure
2) Potential for further folding / combining
3) Potential for more efficient instructions due to immediate operand
As a motivating example, consider the remarkably different cost on x86
of a SHL instruction with an immediate operand versus a register
operand.
This pattern turns up surprisingly frequently, but is somewhat rarely
obvious as a significant performance problem.
The pass is entirely target independent, but it does rely on the target
cost model in TTI to decide when to speculate things around the PHI
node. I've included x86-focused tests, but any target that sets up its
immediate cost model should benefit from this pass.
There is probably more that can be done in this space, but the pass
as-is is enough to get some important performance on our internal
benchmarks, and should be generally performance neutral, but help with
more extensive benchmarking is always welcome.
One awkward part is that this pass has to be scheduled after
*everything* that can eliminate these kinds of redundancies. This
includes SimplifyCFG, GVN, etc. I'm open to suggestions about better
places to put this. We could in theory make it part of the codegen pass
pipeline, but there doesn't really seem to be a good reason for that --
it isn't "lowering" in any sense and only relies on pretty standard cost
model based TTI queries, so it seems to fit well with the "optimization"
pipeline model. Still, further thoughts on the pipeline position are
welcome.
I've also only implemented this in the new pass manager. If folks are
very interested, I can try to add it to the old PM as well, but I didn't
really see much point (my use case is already switched over to the new
PM).
I've tested this pretty heavily without issue. A wide range of
benchmarks internally show no change outside the noise, and I don't see
any significant changes in SPEC either. However, the size class
computation in tcmalloc is substantially improved by this, which turns
into a 2% to 4% win on the hottest path through tcmalloc for us, so
there are definitely important cases where this is going to make
a substantial difference.
Differential revision: https://reviews.llvm.org/D37467
llvm-svn: 319164
Summary:
I think we do not need to analyze debug intrinsics here, as they should
not impact codegen. This has 2 benefits: 1) slightly less work to do and
2) avoiding generating optimization remarks for converting calls to
debug intrinsics to tail calls, which are not really helpful for users.
Based on work by Sander de Smalen.
Reviewers: davide, trentxintong, aprantl
Reviewed By: aprantl
Subscribers: llvm-commits, JDevlieghere
Tags: #debug-info
Differential Revision: https://reviews.llvm.org/D40440
llvm-svn: 319158
This is to address a problem similar to those in D37460 for Scalar PRE. We should not
PRE across an instruction that may not pass execution to its successor unless it is safe
to speculatively execute it.
Differential Revision: https://reviews.llvm.org/D38619
llvm-svn: 319147
Revert "[SROA] Propagate !range metadata when moving loads."
Revert "[Mem2Reg] Clang-format unformatted parts of this file. NFCI."
Davide says they broke a bot.
llvm-svn: 319131
This tries to propagate !range metadata to a pre-existing load
when a load is optimized out. This is done instead of adding an
assume because converting loads to and from assumes creates a
lot of IR.
Patch by Ariel Ben-Yehuda.
Differential Revision: https://reviews.llvm.org/D37216
llvm-svn: 319096
enum TailCallKind { TCK_None = 0, TCK_Tail = 1, TCK_MustTail = 2,
TCK_NoTail = 3 };
TCK_NoTail is greater than TCK_Tail so taking the min does not do the
correct thing.
rdar://35639547
llvm-svn: 319075
The current way that trivial addressing modes are detected incorrectly thinks
that null pointers are non-trivial, leading to an infinite loop where we keep
duplicating the same select. Fix this by aware of null when deciding if an
addressing mode is trivial.
Differential Revision: https://reviews.llvm.org/D40447
llvm-svn: 319019
CodeGenPrepare sinks address computations from one basic block to another
and attempts to reuse address computations that have already been sunk. If
the same address computation appears twice with the first instance as an
operand of a load whose result is an operand to a simplifable select,
CodeGenPrepare simplifies the select and recursively erases the now dead
instructions. CodeGenPrepare then attempts to use the erased address
computation for the second load.
Fix this by erasing the cached address value if it has zero uses before
looking for the address value in the sunken address map.
This partially resolves PR35209.
Thanks to Alexander Richardson for reporting the issue!
This fixed version relands r318032 which was reverted in r318049 due to
sanitizer buildbot failures.
Reviewers: john.brawn
Differential Revision: https://reviews.llvm.org/D39841
llvm-svn: 318956
This patch extends the recent work in optimizeMemoryInst to make it able to
combine more ExtAddrMode fields than just the BaseReg.
This fixes some benchmark regressions introduced by r309397, where GVN PRE is
hoisting a getelementptr such that it can no longer be combined into the
addressing mode of the load or store that uses it.
Differential Revision: https://reviews.llvm.org/D38133
llvm-svn: 318949
After the dataflow algorithm proves that an argument is constant,
it replaces it value with the integer constant and drops the lattice
value associated to the DEF.
e.g. in the example we have @f() that's called twice:
call @f(undef, ...)
call @f(2, ...)
`undef` MEET 2 = 2 so we replace the argument and all its uses with
the constant 2.
Shortly after, tryToReplaceWithConstantRange() tries to get the lattice
value for the argument we just replaced, causing an assertion.
This function is a little peculiar as it runs when we're doing replacement
and not as part of the solver but still queries the solver.
The fix is that of checking whether we replaced the value already and
get a temporary lattice value for the constant.
Thanks to Zhendong Su for the report!
Fixes PR35357.
llvm-svn: 318817
properlyDominates() shouldn't be used as sort key. It causes different output between stdlibc++ and libc++.
Instead, I introduced RPOT. In most cases, it works for CSE.
llvm-svn: 318743
Summary:
Add the following heuristics for irreducible loop metadata:
- When an irreducible loop header is missing the loop header weight metadata,
give it the minimum weight seen among other headers.
- Annotate indirectbr targets with the loop header weight metadata (as they are
likely to become irreducible loop headers after indirectbr tail duplication.)
These greatly improve the accuracy of the block frequency info of the Python
interpreter loop (eg. from ~3-16x off down to ~40-55% off) and the Python
performance (eg. unpack_sequence from ~50% slower to ~8% faster than GCC) due to
better register allocation under PGO.
Reviewers: davidxl
Reviewed By: davidxl
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D39980
llvm-svn: 318693
Summary:
SROA can fail in rewriting alloca but still rewrite a phi resulting
in dead instruction elimination. The Changed flag was not being set
correctly, resulting in downstream passes using stale analyses.
The included test case will assert during the second BDCE pass as a
result.
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D39921
llvm-svn: 318677
This patch adds a new abstraction layer to VPlan and leverages it to model the planned
instructions that manipulate masks (AND, OR, NOT), introduced during predication.
The new VPValue and VPUser classes model how data flows into, through and out
of a VPlan, forming the vertices of a planned Def-Use graph. The new
VPInstruction class is a generic single-instruction Recipe that models a
planned instruction along with its opcode, operands and users. See
VectorizationPlan.rst for more details.
Differential Revision: https://reviews.llvm.org/D38676
llvm-svn: 318645
In rL316552, we ban intersection of unsigned latch range with signed range check and vice
versa, unless the entire range check iteration space is known positive. It was a correct
functional fix that saved us from dealing with ambiguous values, but it also appeared
to be a very restrictive limitation. In particular, in the following case:
loop:
%iv = phi i32 [ 0, %preheader ], [ %iv.next, %latch]
%iv.offset = add i32 %iv, 10
%rc = icmp slt i32 %iv.offset, %len
br i1 %rc, label %latch, label %deopt
latch:
%iv.next = add i32 %iv, 11
%cond = icmp i32 ult %iv.next, 100
br it %cond, label %loop, label %exit
Here, the unsigned iteration range is `[0, 100)`, and the safe range for range
check is `[-10, %len - 10)`. For unsigned iteration spaces, we use unsigned
min/max functions for range intersection. Given this, we wanted to avoid dealing
with `-10` because it is interpreted as a very big unsigned value. Semantically, range
check's safe range goes through unsigned border, so in fact it is two disjoint
ranges in IV's iteration space. Intersection of such ranges is not trivial, so we prohibited
this case saying that we are not allowed to intersect such ranges.
What semantics of this safe range actually means is that we can start from `-10` and go
up increasing the `%iv` by one until we reach `%len - 10` (for simplicity let's assume that
`%len - 10` is a reasonably big positive value).
In particular, this safe iteration space includes `0, 1, 2, ..., %len - 11`. So if we were able to return
safe iteration space `[0, %len - 10)`, we could safely intersect it with IV's iteration space. All
values in this range are non-negative, so using signed/unsigned min/max for them is unambiguous.
In this patch, we alter the algorithm of safe range calculation so that it returnes a subset of the
original safe space which is represented by one continuous range that does not go through wrap.
In order to reach this, we use modified SCEV substraction function. It can be imagined as a function
that substracts by `1` (or `-1`) as long as the further substraction does not cause a wrap in IV iteration
space. This allows us to perform IRCE in many situations when we deal with IV space and range check
of different types (in terms of signed/unsigned).
We apply this approach for both matching and not matching types of IV iteration space and the
range check. One implication of this is that now IRCE became smarter in detection of empty safe
ranges. For example, in this case:
loop:
%iv = phi i32 [ %begin, %preheader ], [ %iv.next, %latch]
%iv.offset = sub i32 %iv, 10
%rc = icmp ult i32 %iv.offset, %len
br i1 %rc, label %latch, label %deopt
latch:
%iv.next = add i32 %iv, 11
%cond = icmp i32 ult %iv.next, 100
br it %cond, label %loop, label %exit
If `%len` was less than 10 but SCEV failed to trivially prove that `%begin - 10 >u %len- 10`,
we could end up executing entire loop in safe preloop while the main loop was still generated,
but never executed. Now, cutting the ranges so that if both `begin - 10` and `%len - 10` overflow,
we have a trivially empty range of `[0, 0)`. This in some cases prevents us from meaningless optimization.
Differential Revision: https://reviews.llvm.org/D39954
llvm-svn: 318639
We must collect all AddModes even if they are the same.
This is due to Original value is different but we need all original
values collected as they are used as anchors in common phi finding.
Reviewers: john.brawn, reames
Reviewed By: john.brawn
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D40166
llvm-svn: 318638
As the first test shows, we could transform an llvm intrinsic which never sets errno
into a libcall which could set errno (even though it's marked readnone?), so that's
not ideal.
It's possible that we can also transform a libcall which could set errno to an
intrinsic given the fast-math-flags constraint, but that's deferred to determine
exactly which set of FMF are needed.
Differential Revision: https://reviews.llvm.org/D40150
llvm-svn: 318628