The symptom is that an assertion is triggered. The assertion was added by
me to detect the situation when value is propagated from dead blocks.
(We can certainly get rid of assertion; it is safe to do so, because propagating
value from dead block to alive join node is certainly ok.)
The root cause of this bug is : edge-splitting is conducted on the fly,
the edge being split could be a dead edge, therefore the block that
split the critial edge needs to be flagged "dead" as well.
There are 3 ways to fix this bug:
1) Get rid of the assertion as I mentioned eariler
2) When an dead edge is split, flag the inserted block "dead".
3) proactively split the critical edges connecting dead and live blocks when
new dead blocks are revealed.
This fix go for 3) with additional 2 LOC.
Testing case was added by Rafael the other day.
llvm-svn: 194424
LoopUnswitch's code simplification routine has logic to convert conditional
branches into unconditional branches, after unswitching makes the condition
constant, and then remove any blocks that renders dead. Unfortunately, this
code is dead, currently broken, and furthermore, has never been alive (at least
as far back at 2006).
No functionality change intended.
llvm-svn: 194277
As with the other loop unrolling parameters (the unrolling threshold, partial
unrolling, etc.) runtime unrolling can now also be controlled via the
constructor. This will be necessary for moving non-trivial unrolling late in
the pass manager (after loop vectorization).
No functionality change intended.
llvm-svn: 194027
Partial fix for PR17459: wrong code at -O3 on x86_64-linux-gnu
(affecting trunk and 3.3)
When SCEV expands a recurrence outside of a loop it attempts to scale
by the stride of the recurrence. Chained recurrences don't work that
way. We could compute binomial coefficients, but would hve to
guarantee that the chained AddRec's are in a perfectly reduced form.
llvm-svn: 193438
A landing pad can be jumped to only by the unwind edge of an invoke
instruction. If we eliminate a partially redundant load in a landing pad, it
will create a basic block that violates this constraint. It then leads to other
problems down the line if it tries to merge that basic block with the landing
pad. Avoid this by not eliminating the load in a landing pad.
PR17621
llvm-svn: 193064
Switch instructions were crashing the StructurizeCFG pass, and it's
probably easier anyway if we don't need to handle them in this pass.
Reviewed-by: Christian König <christian.koenig@amd.com>
llvm-svn: 191841
infrastructure.
This was essentially work toward PGO based on a design that had several
flaws, partially dating from a time when LLVM had a different
architecture, and with an effort to modernize it abandoned without being
completed. Since then, it has bitrotted for several years further. The
result is nearly unusable, and isn't helping any of the modern PGO
efforts. Instead, it is getting in the way, adding confusion about PGO
in LLVM and distracting everyone with maintenance on essentially dead
code. Removing it paves the way for modern efforts around PGO.
Among other effects, this removes the last of the runtime libraries from
LLVM. Those are being developed in the separate 'compiler-rt' project
now, with somewhat different licensing specifically more approriate for
runtimes.
llvm-svn: 191835
SROA wants to convert any types of equivalent widths but it's not possible to
convert vectors of pointers to an integer scalar with a single cast. As a
workaround we add a bitcast to the corresponding int ptr type first. This type
of cast used to be an edge case but has become common with SLP vectorization.
Fixes PR17271.
llvm-svn: 191143
The problem of r191017 is that when GVN fabricate a val-number for a dead instruction (in order
to make following expr-PRE happy), it forget to fabricate a leader-table entry for it as well.
llvm-svn: 191118
This is how it ignores the dead code:
1) When a dead branch target, say block B, is identified, all the
blocks dominated by B is dead as well.
2) The PHIs of those blocks in dominance-frontier(B) is updated such
that the operands corresponding to dead predecessors are replaced
by "UndefVal".
Using lattice's jargon, the "UndefVal" is the "Top" in essence.
Phi node like this "phi(v1 bb1, undef xx)" will be optimized into
"v1" if v1 is constant, or v1 is an instruction which dominate this
PHI node.
3) When analyzing the availability of a load L, all dead mem-ops which
L depends on disguise as a load which evaluate exactly same value as L.
4) The dead mem-ops will be materialized as "UndefVal" during code motion.
llvm-svn: 191017
If there are no legal integers, assume 1 byte.
This makes more sense than using the pointer size as
a guess for the maximum GPR width.
It is conceivable to want to use some 64-bit pointers
on a target where 64-bit integers aren't legal.
llvm-svn: 190817
This pass was based on the previous (essentially unused) profiling
infrastructure and the assumption that by ordering the basic blocks at
the IR level in a particular way, the correct layout would happen in the
end. This sometimes worked, and mostly didn't. It also was a really
naive implementation of the classical paper that dates from when branch
predictors were primarily directional and when loop structure wasn't
commonly available. It also didn't factor into the equation
non-fallthrough branches and other machine level details.
Anyways, for all of these reasons and more, I wrote
MachineBlockPlacement, which completely supercedes this pass. It both
uses modern profile information infrastructure, and actually works. =]
llvm-svn: 190748
Allow targets to customize the default behavior of the generic loop unrolling
transformation. This will be used by the PowerPC backend when targeting the A2
core (which is in-order with a deep pipeline), and using more aggressive
defaults is important.
llvm-svn: 190542
Revert unintentional commit (of an unreviewed change).
Original commit message:
Add getUnrollingPreferences to TTI
Allow targets to customize the default behavior of the generic loop unrolling
transformation. This will be used by the PowerPC backend when targeting the A2
core (which is in-order with a deep pipeline), and using more aggressive
defaults is important.
llvm-svn: 189566
Allow targets to customize the default behavior of the generic loop unrolling
transformation. This will be used by the PowerPC backend when targeting the A2
core (which is in-order with a deep pipeline), and using more aggressive
defaults is important.
llvm-svn: 189565
...so that it can be used for z too. Most of the code is the same.
The only real change is to use TargetTransformInfo to test when a sqrt
instruction is available.
The pass is opt-in because at the moment it only handles sqrt.
llvm-svn: 189097
However, opt -O2 doesn't run mem2reg directly so nobody noticed until r188146
when SROA started sending more things directly down the PromoteMemToReg path.
In order to revert r187191, I also revert dependent revisions r187296, r187322
and r188146. Fixes PR16867. Does not add the testcases from that PR, but both
of them should get added for both mem2reg and sroa when this revert gets
unreverted.
llvm-svn: 188327
SROA-based analysis has enough information. This should work now that
both mem2reg *and* the SSAUpdater-based AllocaPromoter have been updated
to be able to promote the types of allocas that the SROA analysis
detects.
I've included tests for the AllocaPromoter that were only possible to
write once we fast-tracked promotable allocas without rewriting them.
This includes a test both for r187347 and r188145.
Original commit log for r187323:
"""
Now that mem2reg understands how to cope with a slightly wider set of uses of
an alloca, we can pre-compute promotability while analyzing an alloca for
splitting in SROA. That lets us short-circuit the common case of a bunch of
trivially promotable allocas. This cuts 20% to 30% off the run time of SROA for
typical frontend-generated IR sequneces I'm seeing. It gets the new SROA to
within 20% of ScalarRepl for such code. My current benchmark for these numbers
is PR15412, but it fits the general pattern of IR emitted by Clang so it should
be widely applicable.
"""
llvm-svn: 188146
the more general set of patterns that are now handled by mem2reg and that we
can detect quickly while doing SROA's initial analysis. Notably, this allows it
to promote through no-op bitcast and GEP sequences. A core part of the
SSAUpdater approach is the ability to test whether a particular instruction is
part of the set being promoted. Testing this becomes significantly more complex
in the world where the operand to every load and store isn't the alloca itself.
I ended up using the approach of walking up the def-chain until we find the
alloca. I benchmarked this against keeping a set of pointer operands and
keeping a set of the loads and stores we care about, and this one seemed faster
although the difference was very small.
No test case yet because currently the rewriting always "fixes" the inputs to
not require this. The next patch which re-enables early promotion of easy cases
in SROA will include a test case that specifically exercises this aspect of the
alloca promoter.
llvm-svn: 188145
our visiting datastructures in the AllocaPromoter/SSAUpdater path of
SROA. Also shift the order if clears around to be more consistent.
No functionality changed here, this is just a cleanup.
llvm-svn: 188144
It is breaking builbots with libgmalloc enabled on Mac OS X.
$ cd llvm ; mkdir release ; cd release
$ ../configure --enable-optimized —prefix=$PWD/install
$ make
$ make check
$ Release+Asserts/bin/llvm-lit -v --param use_gmalloc=1 --param \
gmalloc_path=/usr/lib/libgmalloc.dylib \
../test/Instrumentation/DataFlowSanitizer/args-unreachable-bb.ll
llvm-svn: 188142
This moves removeUnreachableBlocksFromFn from SimplifyCFGPass.cpp
to Utils/Local.cpp and uses it to replace the implementation of
llvm::removeUnreachableBlocks, which appears to do a strict subset
of what removeUnreachableBlocksFromFn does.
Differential Revision: http://llvm-reviews.chandlerc.com/D1334
llvm-svn: 188119
infrastructure to do promotion without a domtree the same smarts about
looking through GEPs, bitcasts, etc., that I just taught mem2reg about.
This way, if SROA chooses to promote an alloca which still has some
noisy instructions this code can cope with them.
I've not used as principled of an approach here for two reasons:
1) This code doesn't really need it as we were already set up to zip
through the instructions used by the alloca.
2) I view the code here as more of a hack, and hopefully a temporary one.
The SSAUpdater path in SROA is a real sore point for me. It doesn't make
a lot of architectural sense for many reasons:
- We're likely to end up needing the domtree anyways in a subsequent
pass, so why not compute it earlier and use it.
- In the future we'll likely end up needing the domtree for parts of the
inliner itself.
- If we need to we could teach the inliner to preserve the domtree. Part
of the re-work of the pass manager will allow this to be very powerful
even in large SCCs with many functions.
- Ultimately, computing a domtree has gotten significantly faster since
the original SSAUpdater-using code went into ScalarRepl. We no longer
use domfrontiers, and much of domtree is lazily done based on queries
rather than eagerly.
- At this point keeping the SSAUpdater-based promotion saves a total of
0.7% on a build of the 'opt' tool for me. That's not a lot of
performance given the complexity!
So I'm leaving this a bit ugly in the hope that eventually we just
remove all of this nonsense.
I can't even readily test this because this code isn't reachable except
through SROA. When I re-instate the patch that fast-tracks allocas
already suitable for promotion, I'll add a testcase there that failed
before this change. Before that, SROA will fix any test case I give it.
llvm-svn: 187347
uses of an alloca, we can pre-compute promotability while analyzing an
alloca for splitting in SROA. That lets us short-circuit the common case
of a bunch of trivially promotable allocas. This cuts 20% to 30% off the
run time of SROA for typical frontend-generated IR sequneces I'm seeing.
It gets the new SROA to within 20% of ScalarRepl for such code. My
current benchmark for these numbers is PR15412, but it fits the general
pattern of IR emitted by Clang so it should be widely applicable.
llvm-svn: 187323
their being optimized out in debug mode. Realistically, this just isn't
going to be the slow part anyways. This also fixes unused variable
warnings that are breaking LLD build bots. =/ I didn't see these at
first, and kept losing track of the fact that they were broken.
llvm-svn: 187297
Adds unit tests for it too.
Split BasicBlockUtils into an analysis-half and a transforms-half, and put the
analysis bits into a new Analysis/CFG.{h,cpp}. Promote isPotentiallyReachable
into llvm::isPotentiallyReachable and move it into Analysis/CFG.
llvm-svn: 187283
Merge consecutive if-regions if they contain identical statements.
Both transformations reduce number of branches. The transformation
is guarded by a target-hook, and is currently enabled only for +R600,
but the correctness has been tested on X86 target using a variety of
CPU benchmarks.
Patch by: Mei Ye
llvm-svn: 187278
schedule an alloca for another iteration in SROA. This only showed up
with a mixture of promotable and unpromotable selects and phis. Added
a test case for this.
llvm-svn: 187031
pending speculation for a phi node. The problem here is that we were
using growth of the specluation set as an indicator of whether
speculation would occur, and if the phi node is already in the set we
don't see it grow. This is a symptom of the fact that this signal is
a total hack.
Unfortunately, I couldn't really come up with a non-hacky way of
signaling that promotion remains valid *after* speculation occurs, such
that we only speculate when all else looks good for promotion. In the
end, I went with at least a much more explicit approach of doing the
work of queuing inside the phi and select processing and setting
a preposterously named flag to convey that we're in the special state of
requiring speculating before promotion.
Thanks to Richard Trieu and Nick Lewycky for the excellent work reducing
a testcase for this from a pretty giant, nasty assert in a big
application. =] The testcase was excellent.
llvm-svn: 187029
implementation of the SROA algorithm. We were using the term 'partition'
in many places that no longer ever represented an actual partition, but
rather just an arbitrary slice of an alloca.
No functionality change intended here. Mostly just renaming of types,
functions, variables, and rewording of comments. Several comments were
rewritten to make a lot more sense in the new structure of things.
The stats are still weird and not reflective of how this really works.
I'll fix those up in a separate patch as it is a touch more semantic of
a change...
llvm-svn: 186659
SROA.
The crux of the issue is that now we track uses of a partition of the
alloca in two places: the iterators over the partitioning uses and the
previously collected split uses vector. We weren't accounting for the
fact that the split uses might invalidate integer widening in ways other
than due to their width (in this case due to being volatile).
Further reduced testcase added to the tests.
llvm-svn: 186655
end of a vector. This was found with ASan. I've had one other report of
a crasher, but thus far been unable to reproduce the crash. It may well
be fixed with this version, and if not I'd like to get more information
from the build bots about what is happening.
See r186316 for the full commit log for the new implementation of the
SROA algorithm.
llvm-svn: 186565
a bot.
This reverts the commit which introduced a new implementation of the
fancy SROA pass designed to reduce its overhead. I'll skip the huge
commit log here, refer to r186316 if you're looking for how this all
works and why it works that way.
llvm-svn: 186332
different core implementation strategy.
Previously, SROA would build a relatively elaborate partitioning of an
alloca, associate uses with each partition, and then rewrite the uses of
each partition in an attempt to break apart the alloca into chunks that
could be promoted. This was very wasteful in terms of memory and compile
time because regardless of how complex the alloca or how much we're able
to do in breaking it up, all of the datastructure work to analyze the
partitioning was done up front.
The new implementation attempts to form partitions of the alloca lazily
and on the fly, rewriting the uses that make up that partition as it
goes. This has a few significant effects:
1) Much simpler data structures are used throughout.
2) No more double walk of the recursive use graph of the alloca, only
walk it once.
3) No more complex algorithms for associating a particular use with
a particular partition.
4) PHI and Select speculation is simplified and happens lazily.
5) More precise information is available about a specific use of the
alloca, removing the need for some side datastructures.
Ultimately, I think this is a much better implementation. It removes
about 300 lines of code, but arguably removes more like 500 considering
that some code grew in the process of being factored apart and cleaned
up for this all to work.
I've re-used as much of the old implementation as possible, which
includes the lion's share of code in the form of the rewriting logic.
The interesting new logic centers around how the uses of a partition are
sorted, and split into actual partitions.
Each instruction using a pointer derived from the alloca gets
a 'Partition' entry. This name is totally wrong, but I'll do a rename in
a follow-up commit as there is already enough churn here. The entry
describes the offset range accessed and the nature of the access. Once
we have all of these entries we sort them in a very specific way:
increasing order of begin offset, followed by whether they are
splittable uses (memcpy, etc), followed by the end offset or whatever.
Sorting by splittability is important as it simplifies the collection of
uses into a partition.
Once we have these uses sorted, we walk from the beginning to the end
building up a range of uses that form a partition of the alloca.
Overlapping unsplittable uses are merged into a single partition while
splittable uses are broken apart and carried from one partition to the
next. A partition is also introduced to bridge splittable uses between
the unsplittable regions when necessary.
I've looked at the performance PRs fairly closely. PR15471 no longer
will even load (the module is invalid). Not sure what is up there.
PR15412 improves by between 5% and 10%, however it is nearly impossible
to know what is holding it up as SROA (the entire pass) takes less time
than reading the IR for that test case. The analysis takes the same time
as running mem2reg on the final allocas. I suspect (without much
evidence) that the new implementation will scale much better however,
and it is just the small nature of the test cases that makes the changes
small and noisy. Either way, it is still simpler and cleaner I think.
llvm-svn: 186316
against a constant."
This reverts commit r186107. It didn't handle wrapping arithmetic in the
loop correctly and thus caused the following C program to count from
0 to UINT64_MAX instead of from 0 to 255 as intended:
#include <stdio.h>
int main() {
unsigned char first = 0, last = 255;
do { printf("%d\n", first); } while (first++ != last);
}
Full test case and instructions to reproduce with just the -indvars pass
sent to the original review thread rather than to r186107's commit.
llvm-svn: 186152
Patch by Michele Scandale!
Adds a special handling of the case where, during the loop exit
condition rewriting, the exit value is a constant of bitwidth lower
than the type of the induction variable: instead of introducing a
trunc operation in order to match correctly the operand types, it
allows to convert the constant value to an equivalent constant,
depending on the initial value of the induction variable and the trip
count, in order have an equivalent comparison between the induction
variable and the new constant.
llvm-svn: 186107
Without the changes introduced into this patch, if TRE saw any allocas at all,
TRE would not perform TRE *or* mark callsites with the tail marker.
Because TRE runs after mem2reg, this inadequacy is not a death sentence. But
given a callsite A without escaping alloca argument, A may not be able to have
the tail marker placed on it due to a separate callsite B having a write-back
parameter passed in via an argument with the nocapture attribute.
Assume that B is the only other callsite besides A and B only has nocapture
escaping alloca arguments (*NOTE* B may have other arguments that are not passed
allocas). In this case not marking A with the tail marker is unnecessarily
conservative since:
1. By assumption A has no escaping alloca arguments itself so it can not
access the caller's stack via its arguments.
2. Since all of B's escaping alloca arguments are passed as parameters with
the nocapture attribute, we know that B does not stash said escaping
allocas in a manner that outlives B itself and thus could be accessed
indirectly by A.
With the changes introduced by this patch:
1. If we see any escaping allocas passed as a capturing argument, we do
nothing and bail early.
2. If we do not see any escaping allocas passed as captured arguments but we
do see escaping allocas passed as nocapture arguments:
i. We do not perform TRE to avoid PR962 since the code generator produces
significantly worse code for the dynamic allocas that would be created
by the TRE algorithm.
ii. If we do not return twice, mark call sites without escaping allocas
with the tail marker. *NOTE* This excludes functions with escaping
nocapture allocas.
3. If we do not see any escaping allocas at all (whether captured or not):
i. If we do not have usage of setjmp, mark all callsites with the tail
marker.
ii. If there are no dynamic/variable sized allocas in the function,
attempt to perform TRE on all callsites in the function.
Based off of a patch by Nick Lewycky.
rdar://14324281.
llvm-svn: 186057
debug statements to add a missing newline. Also canonicalize to '\n' instead of
"\n"; the latter calls a function with a loop the former does not.
llvm-svn: 184897
When a 1-element vector alloca is promoted, a store instruction can often be
rewritten without converting the value to a scalar and using an insertelement
instruction to stuff it into the new alloca. This patch just adds a check
to skip that conversion when it is unnecessary. This turns out to be really
important for some ARM Neon operations where <1 x i64> is used to get around
the fact that i64 is not a legal type.
llvm-svn: 184870
This commit completely removes what is left of the simplify-libcalls
pass. All of the functionality has now been migrated to the instcombine
and functionattrs passes. The following C API functions are now NOPs:
1. LLVMAddSimplifyLibCallsPass
2. LLVMPassManagerBuilderSetDisableSimplifyLibCalls
llvm-svn: 184459
Prior to this change, the considered addressing modes may be invalid since the
maximum and minimum offsets were not taking into account.
This was causing an assertion failure.
The added test case exercices that behavior.
<rdar://problem/14199725> Assertion failed: (CurScaleCost >= 0 && "Legal
addressing mode has an illegal cost!")
llvm-svn: 184341
r183584 tries to derive some info from the code *AFTER* a call and apply
these derived info to the code *BEFORE* the call, which is not always safe
as the call in question may never return, and in this case, the derived
info is invalid.
Thank Duncan for pointing out this potential bug.
rdar://14073661
llvm-svn: 183606
The MemCpyOpt pass is capable of optimizing:
callee(&S); copy N bytes from S to D.
into:
callee(&D);
subject to some legality constraints.
Assertion is triggered when the compiler tries to evalute "sizeof(typeof(D))",
while D is an opaque-typed, 'sret' formal argument of function being compiled.
i.e. the signature of the func being compiled is something like this:
T caller(...,%opaque* noalias nocapture sret %D, ...)
The fix is that when come across such situation, instead of calling some
utility functions to get the size of D's type (which will crash), we simply
assume D has at least N bytes as implified by the copy-instruction.
rdar://14073661
llvm-svn: 183584
IndVarSimplify is willing to move divide instructions outside of their
loop bodies if they are invariant of the loop. However, it may not be
safe to expand them if we do not know if they can trap.
Instead, check to see if it is not safe to expand the instruction and
skip the expansion.
This fixes PR16041.
Testcase by Rafael Ávila de Espíndola.
llvm-svn: 183239
Account for the cost of scaling factor in Loop Strength Reduce when rating the
formulae. This uses a target hook.
The default implementation of the hook is: if the addressing mode is legal, the
scaling factor is free.
<rdar://problem/13806271>
llvm-svn: 183045
Namely, check if the target allows to fold more that one register in the
addressing mode and if yes, adjust the cost accordingly.
Prior to this commit, reg1 + scale * reg2 accesses were artificially preferred
to reg1 + reg2 accesses. Indeed, the cost model wrongly assumed that reg1 + reg2
needs a temporary register for the computation, whereas it was correctly
estimated for reg1 + scale * reg2.
<rdar://problem/13973908>
llvm-svn: 183021
iteration.
This on step toward non-iterative GVN. My local hack suggests that getting rid
of iteration will speedup GVN by 30%+ on a medium sized input (2k LOC, C++).
I cannot explain why not 2x or more at this moment.
llvm-svn: 181532
Test case by Michele Scandale!
Fixes PR10293: Load not hoisted out of loop with multiple exits.
There are few regressions with this patch, now tracked by
rdar:13817079, and a roughly equal number of improvements. The
regressions are almost certainly back luck because LoopRotate has very
little idea of whether rotation is profitable. Doing better requires a
more comprehensive solution.
This checkin is a quick fix that lacks generality (PR10293 has
a counter-example). But it trivially fixes the case in PR10293 without
interfering with other cases, and it does satify the criteria that
LoopRotate is a loop canonicalization pass that should avoid
heuristics and special cases.
I can think of two approaches that would probably be better in
the long run. Ultimately they may both make sense.
(1) LoopRotate should check that the current header would make a good
loop guard, and that the loop does not already has a sufficient
guard. The artifical SimplifiedLoopLatch check would be unnecessary,
and the design would be more general and canonical. Two difficulties:
- We need a strong guarantee that we won't endlessly rotate, so the
analysis would need to be precise in order to avoid the
SimplifiedLoopLatch precondition.
- Analysis like this are usually based on SCEV, which we don't want to
rely on.
(2) Rotate on-demand in late loop passes. This could even be done by
shoving the loop back on the queue after the optimization that needs
it. This could work well when we find LICM opportunities in
multi-branch loops. This requires some work, and it doesn't really
solve the problem of SCEV wanting a loop guard before the analysis.
llvm-svn: 181230
This function consists of following steps:
1. Collect dependent memory accesses.
2. Analyze availability.
3. Perform fully redundancy elimination, or
4. Perform PRE, depending on the availability
Step 2, 3 and 4 are now moved to three helper routines.
llvm-svn: 181047
Actually it took me couple of hours trying to make sense of them and
only to find they are dead code. I guess the original author used
"allSingleSucc" to indicate if there are any critial edge emanating
from some blocks, and tried to perform code motion (actually speculation)
in the presence of these critical edges; but later on he/she changed mind
and decided to perform edge-splitting first.
llvm-svn: 180951