If the loop body contains conditions of the form IndVar < #constant, we
can remove the checks by peeling off #constant iterations.
This improves codegen for PR34364.
Reviewers: mkuper, mkazantsev, efriedma
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D43876
llvm-svn: 327671
Summary:
Before this patch call graph is like this in the LoopUnrollPass:
tryToUnrollLoop
ApproximateLoopSize
collectEphemeralValues
/* Use collected ephemeral values */
computeUnrollCount
analyzeLoopUnrollCost
/* Bail out from the analysis if loop contains CallInst */
This patch moves collection of the ephemeral values to the tryToUnrollLoop
function and passes the collected values into both ApproximateLoopsize (as
before) and additionally starts using them in analyzeLoopUnrollCost:
tryToUnrollLoop
collectEphemeralValues
ApproximateLoopSize(EphValues)
/* Use EphValues */
computeUnrollCount(EphValues)
analyzeLoopUnrollCost(EphValues)
/* Ignore ephemeral values - they don't contribute to the final cost */
/* Bail out from the analysis if loop contains CallInst */
Reviewers: mzolotukhin, evstupac, sanjoy
Reviewed By: evstupac
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D43931
llvm-svn: 327617
Currently when AllowRemainder is disabled, pragma unroll count is not
respected even though there is no remainder. This bug causes a loop
fully unrolled in many cases even though the user specifies a unroll
count. Especially it affects OpenCL/CUDA since in many cases a loop
contains convergent instructions and currently AllowRemainder is
disabled for such loops.
Differential Revision: https://reviews.llvm.org/D43826
llvm-svn: 326585
Summary:
This replaces calls to getEntryCount().hasValue() with hasProfileData
that does the same thing. This refactoring is useful to do before adding
synthetic function entry counts but also a useful cleanup IMO even
otherwise. I have used hasProfileData instead of hasRealProfileData as
David had earlier suggested since I think profile implies "real" and I
use the phrase "synthetic entry count" and not "synthetic profile count"
but I am fine calling it hasRealProfileData if you prefer.
Reviewers: davidxl, silvas
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D41461
llvm-svn: 321331
This avoid code duplication and allow us to add the disable unroll metadata elsewhere.
Differential Revision: https://reviews.llvm.org/D38928
llvm-svn: 315850
parameterized emit() calls
Summary: This is not functional change to adopt new emit() API added in r313691.
Reviewed By: anemet
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D38285
llvm-svn: 315476
Summary:
And now that we no longer have to explicitly free() the Loop instances, we can
(with more ease) use the destructor of LoopBase to do what LoopBase::clear() was
doing.
Reviewers: chandlerc
Subscribers: mehdi_amini, mcrosier, llvm-commits
Differential Revision: https://reviews.llvm.org/D38201
llvm-svn: 314375
Summary:
With this change:
- Methods in LoopBase trip an assert if the receiver has been invalidated
- LoopBase::clear frees up the memory held the LoopBase instance
This change also shuffles things around as necessary to work with this stricter invariant.
Reviewers: chandlerc
Subscribers: mehdi_amini, mcrosier, llvm-commits
Differential Revision: https://reviews.llvm.org/D38055
llvm-svn: 313708
On some targets, the penalty of executing runtime unrolling checks
and then not the unrolled loop can be significantly detrimental to
performance. This results in the need to be more conservative with
the unroll count, keeping a trip count of 2 reduces the overhead as
well as increasing the chance of the unrolled body being executed. But
being conservative leaves performance gains on the table.
This patch enables the unrolling of the remainder loop introduced by
runtime unrolling. This can help reduce the overhead of misunrolled
loops because the cost of non-taken branches is much less than the
cost of the backedge that would normally be executed in the remainder
loop. This allows larger unroll factors to be used without suffering
performance loses with smaller iteration counts.
Differential Revision: https://reviews.llvm.org/D36309
llvm-svn: 310824
results when a loop is completely removed.
This is very hard to manifest as a visible bug. You need to arrange for
there to be a subsequent allocation of a 'Loop' object which gets the
exact same address as the one which the unroll deleted, and you need the
LoopAccessAnalysis results to be significant in the way that they're
stale. And you need a million other things to align.
But when it does, you get a deeply mysterious crash due to actually
finding a stale analysis result. This fixes the issue and tests for it
by directly checking we successfully invalidate things. I have not been
able to get *any* test case to reliably trigger this. Changes to LLVM
itself caused the only test case I ever had to cease to crash.
I've looked pretty extensively at less brittle ways of fixing this and
they are actually very, very hard to do. This is a somewhat strange and
unusual case as we have a pass which is deleting an IR unit, but is not
running within that IR unit's pass framework (which is what handles this
cleanly for the normal loop unroll). And where there isn't a definitive
way to clear *all* of the stale cache entries. And where the pass *is*
updating the core analysis that provides the IR units!
For example, we don't have any of these problems with Function analyses
because it is easy to clear out function analyses when the functions
themselves may have been deleted -- we clear an entire module's worth!
But that is too heavy of a hammer down here in the LoopAnalysisManager
layer.
A better long-term solution IMO is to require that AnalysisManager's
make their keys durable to this kind of thing. Specifically, when
caching an analysis for one IR unit that is conceptually "owned" by
a higher level IR unit, the AnalysisManager should incorporate this into
its data structures so that we can reliably clear these results without
having to teach each and every pass to do so manually as we do here. But
that is a change for another day as it will be a fairly invasive change
to the AnalysisManager infrastructure. Until then, this fortunately
seems to be quite rare.
llvm-svn: 310333
Summary:
Detect when the working set size of a profiled application is huge,
by comparing the number of counts required to reach the hot percentile
in the profile summary to a large threshold*.
When the working set size is determined to be huge, disable peeling
to avoid bloating the working set further.
*Note that the selected threshold (15K) is significantly larger than the
largest working set value in SPEC cpu2006 (which is gcc at around 11K).
Reviewers: davidxl
Subscribers: mehdi_amini, mzolotukhin, eraman, llvm-commits
Differential Revision: https://reviews.llvm.org/D36288
llvm-svn: 310005
Summary:
Peeling should not occur during the full unrolling invocation early
in the pipeline, but rather later with partial and runtime loop
unrolling. The later loop unrolling invocation will also eventually
utilize profile summary and branch frequency information, which
we would like to use to control peeling. And for ThinLTO we want
to delay peeling until the backend (post thin link) phase, just as
we do for most types of unrolling.
Ensure peeling doesn't occur during the full unrolling invocation
by adding a parameter to the shared implementation function, similar
to the way partial and runtime loop unrolling are disabled.
Performance results for ThinLTO suggest this has a neutral to positive
effect on some internal benchmarks.
Reviewers: chandlerc, davidxl
Subscribers: mzolotukhin, llvm-commits, mehdi_amini
Differential Revision: https://reviews.llvm.org/D36258
llvm-svn: 309966
Summary:
This is largely NFC*, in preparation for utilizing ProfileSummaryInfo
and BranchFrequencyInfo analyses. In this patch I am only doing the
splitting for the New PM, but I can do the same for the legacy PM as
a follow-on if this looks good.
*Not NFC since for partial unrolling we lose the updates done to the
loop traversal (adding new sibling and child loops) - according to
Chandler this is not very useful for partial unrolling, but it also
means that the debugging flag -unroll-revisit-child-loops no longer
works for partial unrolling.
Reviewers: chandlerc
Subscribers: mehdi_amini, mzolotukhin, eraman, llvm-commits
Differential Revision: https://reviews.llvm.org/D36157
llvm-svn: 309886
and to expose a handle to represent the actual case rather than having
the iterator return a reference to itself.
All of this allows the iterator to be used with common STL facilities,
standard algorithms, etc.
Doing this exposed some missing facilities in the iterator facade that
I've fixed and required some work to the actual iterator to fully
support the necessary API.
Differential Revision: https://reviews.llvm.org/D31548
llvm-svn: 300032
Summary:
In current implementation the loop peeling happens after trip-count based partial unrolling and may
sometimes not happen at all due to it (for example, if trip count is known, but UP.Partial = false). This
is generally bad, the more than there are some situations where peeling is profitable even if the partial
unrolling is disabled.
This patch is a NFC which reorders peeling and partial unrolling application and prepares the code for
implementation of the said optimizations.
Patch by Max Kazantsev!
Reviewers: sanjoy, anna, reames, apilipenko, igor-laevsky, mkuper
Reviewed By: mkuper
Subscribers: mkuper, llvm-commits, mzolotukhin
Differential Revision: https://reviews.llvm.org/D30243
llvm-svn: 296897
This enables peeling of loops with low dynamic iteration count by default,
when profile information is available.
Differential Revision: https://reviews.llvm.org/D27734
llvm-svn: 295796
factory functions for the two modes the loop unroller is actually used
in in-tree: simplified full-unrolling and the entire thing including
partial unrolling.
I've also wired these up to nice names so you can express both of these
being in a pipeline easily. This is a precursor to actually enabling
these parts of the O2 pipeline.
Differential Revision: https://reviews.llvm.org/D28897
llvm-svn: 293136
Even when we don't create a remainder loop (that is, when we unroll by 2), we
may duplicate nested loops into the remainder. This is complicated by the fact
the remainder may itself be either inserted into an outer loop, or at the top
level. In the latter case, we may need to create new top-level loops.
Differential Revision: https://reviews.llvm.org/D29156
llvm-svn: 293124
loops.
We do this by reconstructing the newly added loops after the unroll
completes to avoid threading pass manager details through all the mess
of the unrolling infrastructure.
I've enabled some extra assertions in the LPM to try and catch issues
here and enabled a bunch of unroller tests to try and make sure this is
sane.
Currently, I'm manually running loop-simplify when needed. That should
go away once it is folded into the LPM infrastructure.
Differential Revision: https://reviews.llvm.org/D28848
llvm-svn: 293011
Summary: Partial unrolling should have separate threshold with full unrolling.
Reviewers: efriedma, mzolotukhin
Reviewed By: efriedma, mzolotukhin
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D28831
llvm-svn: 292293
a function's CFG when that CFG is unchanged.
This allows transformation passes to simply claim they preserve the CFG
and analysis passes to check for the CFG being preserved to remove the
fanout of all analyses being listed in all passes.
I've gone through and removed or cleaned up as many of the comments
reminding us to do this as I could.
Differential Revision: https://reviews.llvm.org/D28627
llvm-svn: 292054
the latter to the Transforms library.
While the loop PM uses an analysis to form the IR units, the current
plan is to have the PM itself establish and enforce both loop simplified
form and LCSSA. This would be a layering violation in the analysis
library.
Fundamentally, the idea behind the loop PM is to *transform* loops in
addition to running passes over them, so it really seemed like the most
natural place to sink this was into the transforms library.
We can't just move *everything* because we also have loop analyses that
rely on a subset of the invariants. So this patch splits the the loop
infrastructure into the analysis management that has to be part of the
analysis library, and the transform-aware pass manager.
This also required splitting the loop analyses' printer passes out to
the transforms library, which makes sense to me as running these will
transform the code into LCSSA in theory.
I haven't split the unittest though because testing one component
without the other seems nearly intractable.
Differential Revision: https://reviews.llvm.org/D28452
llvm-svn: 291662
arguments much like the CGSCC pass manager.
This is a major redesign following the pattern establish for the CGSCC layer to
support updates to the set of loops during the traversal of the loop nest and
to support invalidation of analyses.
An additional significant burden in the loop PM is that so many passes require
access to a large number of function analyses. Manually ensuring these are
cached, available, and preserved has been a long-standing burden in LLVM even
with the help of the automatic scheduling in the old pass manager. And it made
the new pass manager extremely unweildy. With this design, we can package the
common analyses up while in a function pass and make them immediately available
to all the loop passes. While in some cases this is unnecessary, I think the
simplicity afforded is worth it.
This does not (yet) address loop simplified form or LCSSA form, but those are
the next things on my radar and I have a clear plan for them.
While the patch is very large, most of it is either mechanically updating loop
passes to the new API or the new testing for the loop PM. The code for it is
reasonably compact.
I have not yet updated all of the loop passes to correctly leverage the update
mechanisms demonstrated in the unittests. I'll do that in follow-up patches
along with improved FileCheck tests for those passes that ensure things work in
more realistic scenarios. In many cases, there isn't much we can do with these
until the loop simplified form and LCSSA form are in place.
Differential Revision: https://reviews.llvm.org/D28292
llvm-svn: 291651
Summary:
The current loop complete unroll algorithm checks if unrolling complete will reduce the runtime by a certain percentage. If yes, it will apply a fixed boosting factor to the threshold (by discounting cost). The problem for this approach is that the threshold abruptly. This patch makes the boosting factor a function of runtime reduction percentage, capped by a fixed threshold. In this way, the threshold changes continuously.
The patch also simplified the code by reducing one parameter in UP.
The patch only affects code-gen of two speccpu2006 benchmark:
445.gobmk binary size decreases 0.08%, no performance change.
464.h264ref binary size increases 0.24%, no performance change.
Reviewers: mzolotukhin, chandlerc
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D26989
llvm-svn: 290737
After r289755, the AssumptionCache is no longer needed. Variables affected by
assumptions are now found by using the new operand-bundle-based scheme. This
new scheme is more computationally efficient, and also we need much less
code...
llvm-svn: 289756
This implements PGO-driven loop peeling.
The basic idea is that when the average dynamic trip-count of a loop is known,
based on PGO, to be low, we can expect a performance win by peeling off the
first several iterations of that loop.
Unlike unrolling based on a known trip count, or a trip count multiple, this
doesn't save us the conditional check and branch on each iteration. However,
it does allow us to simplify the straight-line code we get (constant-folding,
etc.). This is important given that we know that we will usually only hit this
code, and not the actual loop.
This is currently disabled by default.
Differential Revision: https://reviews.llvm.org/D25963
llvm-svn: 288274
Summary:
For flat loop, even if it is hot, it is not a good idea to unroll in runtime, thus we set a lower partial unroll threshold.
For hot loop, we set a higher unroll threshold and allows expensive tripcount computation to allow more aggressive unrolling.
Reviewers: davidxl, mzolotukhin
Subscribers: sanjoy, mehdi_amini, llvm-commits
Differential Revision: https://reviews.llvm.org/D26527
llvm-svn: 287186
Summary:
Unrolled Loop Size calculations moved to a function.
Constant representing number of optimized instructions
when "back edge" becomes "fall through" replaced with
variable.
Some comments added.
Reviewers: mzolotukhin
Differential Revision: http://reviews.llvm.org/D21719
From: Evgeny Stupachenko <evstupac@gmail.com>
llvm-svn: 286389
When we have a loop with a known upper bound on the number of iterations, and
furthermore know that either the number of iterations will be either exactly
that upper bound or zero, then we can fully unroll up to that upper bound
keeping only the first loop test to check for the zero iteration case.
Most of the work here is in plumbing this 'max-or-zero' information from the
part of scalar evolution where it's detected through to loop unrolling. I've
also gone for the safe default of 'false' everywhere but howManyLessThans which
could probably be improved.
Differential Revision: https://reviews.llvm.org/D25682
llvm-svn: 284818
Reappy r284044 after revert in r284051. Krzysztof fixed the error in r284049.
The original summary:
This patch tries to fully unroll loops having break statement like this
for (int i = 0; i < 8; i++) {
if (a[i] == value) {
found = true;
break;
}
}
GCC can fully unroll such loops, but currently LLVM cannot because LLVM only
supports loops having exact constant trip counts.
The upper bound of the trip count can be obtained from calling
ScalarEvolution::getMaxBackedgeTakenCount(). Part of the patch is the
refactoring work in SCEV to prevent duplicating code.
The feature of using the upper bound is enabled under the same circumstance
when runtime unrolling is enabled since both are used to unroll loops without
knowing the exact constant trip count.
llvm-svn: 284053
This patch tries to fully unroll loops having break statement like this
for (int i = 0; i < 8; i++) {
if (a[i] == value) {
found = true;
break;
}
}
GCC can fully unroll such loops, but currently LLVM cannot because LLVM only
supports loops having exact constant trip counts.
The upper bound of the trip count can be obtained from calling
ScalarEvolution::getMaxBackedgeTakenCount(). Part of the patch is the
refactoring work in SCEV to prevent duplicating code.
The feature of using the upper bound is enabled under the same circumstance
when runtime unrolling is enabled since both are used to unroll loops without
knowing the exact constant trip count.
Differential Revision: https://reviews.llvm.org/D24790
llvm-svn: 284044
Summary: Debug info should *not* affect optimization decisions. This patch updates loop unroller cost model to make it not affected by debug info.
Reviewers: davidxl, mzolotukhin
Subscribers: haicheng, llvm-commits, mzolotukhin
Differential Revision: https://reviews.llvm.org/D25098
llvm-svn: 282894
This commit enables more unrolling for SystemZ by implementing the
SystemZTargetTransformInfo::getUnrollingPreferences() method.
It has been found that it is better to only unroll moderately, so the
DefaultUnrollRuntimeCount has been moved into UnrollingPreferences in order
to set this to a lower value for SystemZ (4).
Reviewers: Evgeny Stupachenko, Ulrich Weigand.
https://reviews.llvm.org/D24451
llvm-svn: 282570
We can't mark ORE (a function pass) preserved as required by the loop
passes because that is how we ensure that the required passes like
LazyBFI are all available any time ORE is used. See the new comments in
the patch.
Instead we use it directly just like the inliner does in D22694.
As expected there is some additional overhead after removing the caching
provided by analysis passes. The worst case, I measured was
LNT/CINT2006_ref/401.bzip2 which regresses by 12%. As before, this only
affects -Rpass-with-hotness and not default compilation.
llvm-svn: 279829
Summary:
In clang commit r268509 we started to invoke loop-unroll pass from the
driver even under -Os. However, we happen to not initialize optsize
thresholds properly, which si fixed with this change.
r268509 led to some big compile time regressions, because we started to
unroll some loops that we didn't unroll before. With this change I hope
to recover most of the regressions. We still are slightly slower than
before, because we do some checks here and there in loop-unrolling
before we bail out, but at least the slowdown is not that huge now.
Reviewers: hfinkel, chandlerc
Subscribers: mzolotukhin, llvm-commits
Differential Revision: https://reviews.llvm.org/D23388
llvm-svn: 279585
One exception here is LoopInfo which must forward-declare it (because
the typedef is in LoopPassManager.h which depends on LoopInfo).
Also, some includes for LoopPassManager.h were needed since that file
provides the typedef.
Besides a general consistently benefit, the extra layer of indirection
allows the mechanical part of https://reviews.llvm.org/D23256 that
requires touching every transformation and analysis to be factored out
cleanly.
Thanks to David for the suggestion.
llvm-svn: 278079
LoopUnroll is a loop pass, so the analysis of OptimizationRemarkEmitter
is added to the common function analysis passes that loop passes
depend on.
The BFI and indirectly BPI used in this pass is computed lazily so no
overhead should be observed unless -pass-remarks-with-hotness is used.
This is how the patch affects the O3 pipeline:
Dominator Tree Construction
Natural Loop Information
Canonicalize natural loops
Loop-Closed SSA Form Pass
Basic Alias Analysis (stateless AA impl)
Function Alias Analysis Results
Scalar Evolution Analysis
+ Lazy Branch Probability Analysis
+ Lazy Block Frequency Analysis
+ Optimization Remark Emitter
Loop Pass Manager
Rotate Loops
Loop Invariant Code Motion
Unswitch loops
Simplify the CFG
Dominator Tree Construction
Basic Alias Analysis (stateless AA impl)
Function Alias Analysis Results
Combine redundant instructions
Natural Loop Information
Canonicalize natural loops
Loop-Closed SSA Form Pass
Scalar Evolution Analysis
+ Lazy Branch Probability Analysis
+ Lazy Block Frequency Analysis
+ Optimization Remark Emitter
Loop Pass Manager
Induction Variable Simplification
Recognize loop idioms
Delete dead loops
Unroll loops
...
llvm-svn: 277203
We just set PreserveLCSSA to always true since we don't have an
analogous method `mustPreserveAnalysisID(LCSSA)`.
Also port LoopInfo verifier pass to test LoopUnrollPass.
llvm-svn: 276063
We do not support splitting cleanuppad or catchswitches. This is
problematic for passes which assume that a loop is in loop simplify
form (the loop would have a dedicated exit block instead of sharing it).
While it isn't great that we don't support this for cleanups, we still
cannot make loop-simplify form an assertable precondition because
indirectbr will also disable these sorts of CFG cleanups.
This fixes PR28132.
llvm-svn: 272739
with user specified count has been applied.
Summary:
Previously SetLoopAlreadyUnrolled() set the disable pragma only if
there was some loop metadata.
Now it set the pragma in all cases. This helps to prevent multiple
unroll when -unroll-count=N is given.
Reviewers: mzolotukhin
Differential Revision: http://reviews.llvm.org/D20765
From: Evgeny Stupachenko <evstupac@gmail.com>
llvm-svn: 272195
In r270478, where I enabled the new heuristic I posted testing results,
which I got when explicitly passed the thresholds values via CL options.
However, setting the CL options init-values is not enough to change the
default values of thresholds, so I'm changing them in another place now.
llvm-svn: 271615
Summary:
Unroll factor (Count) calculations moved to a new function.
Early exits on pragma and "-unroll-count" defined factor added.
New type of unrolling "Force" introduced (previously used implicitly).
New unroll preference "AllowRemainder" introduced and set "true" by default.
(should be set to false for architectures that suffers from it).
Reviewers: hfinkel, mzolotukhin, zzheng
Differential Revision: http://reviews.llvm.org/D19553
From: Evgeny Stupachenko <evstupac@gmail.com>
llvm-svn: 271071
Condition might be simplified to a Constant, but it doesn't have to be
ConstantInt, so we should dyn_cast, instead of cast.
This fixes PR27886.
llvm-svn: 270924
Summary:
This patch turns on LoopUnrollAnalyzer by default. To mitigate compile
time regressions, I chose very conservative thresholds for now. Later we
can make them more aggressive, but it might require being smarter in
which loops we're optimizing. E.g. currently the biggest issue is that
with more agressive thresholds we unroll many cold loops, which
increases compile time for no performance benefit (performance of those
loops is improved, but it doesn't matter since they are cold).
Test results for compile time(using 4 samples to reduce noise):
```
MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes 5.19%
SingleSource/Benchmarks/Polybench/medley/reg_detect/reg_detect 4.19%
MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow 3.39%
MultiSource/Applications/JM/lencod/lencod 1.47%
MultiSource/Benchmarks/Fhourstones-3_1/fhourstones3_1 -6.06%
```
I didn't see any performance changes in the testsuite, but it improves
some internal tests.
Reviewers: hfinkel, chandlerc
Subscribers: llvm-commits, mzolotukhin
Differential Revision: http://reviews.llvm.org/D20482
llvm-svn: 270478
Summary:
...loop after the last iteration.
This is really hard to do correctly. The core problem is that we need to
model liveness through the induction PHIs from iteration to iteration in
order to get the correct results, and we need to correctly de-duplicate
the common subgraphs of instructions feeding some subset of the
induction PHIs. All of this can be driven either from a side effect at
some iteration or from the loop values used after the loop finishes.
This patch implements this by storing the forward-propagating analysis
of each instruction in a cache to recall whether it was free and whether
it has become live and thus counted toward the total unroll cost. Then,
at each sink for a value in the loop, we recursively walk back through
every value that feeds the sink, including looping back through the
iterations as needed, until we have marked the entire input graph as
live. Because we cache this, we never visit instructions more than twice
-- once when we analyze them and put them into the cache, and once when
we count their cost towards the unrolled loop. Also, because the cache
is only two bits and because we are dealing with relatively small
iteration counts, we can store all of this very densely in memory to
avoid this from becoming an excessively slow analysis.
The code here is still pretty gross. I would appreciate suggestions
about better ways to factor or split this up, I've stared too long at
the algorithmic side to really have a good sense of what the design
should probably look at.
Also, it might seem like we should do all of this bottom-up, but I think
that is a red herring. Specifically, the simplification power is *much*
greater working top-down. We can forward propagate very effectively,
even across strange and interesting recurrances around the backedge.
Because we use data to propagate, this doesn't cause a state space
explosion. Doing this level of constant folding, etc, would be very
expensive to do bottom-up because it wouldn't be until the last moment
that you could collapse everything. The current solution is essentially
a top-down simplification with a bottom-up cost accounting which seems
to get the best of both worlds. It makes the simplification incremental
and powerful while leaving everything dead until we *know* it is needed.
Finally, a core property of this approach is its *monotonicity*. At all
times, the current UnrolledCost is a conservatively low estimate. This
ensures that we will never early-exit from the analysis due to exceeding
a threshold when if we had continued, the cost would have gone back
below the threshold. These kinds of bugs can cause incredibly hard to
track down random changes to behavior.
We could use a techinque similar (but much simpler) within the inliner
as well to avoid considering speculated code in the inline cost.
Reviewers: chandlerc
Subscribers: sanjoy, mzolotukhin, llvm-commits
Differential Revision: http://reviews.llvm.org/D11758
llvm-svn: 269388
Before r268509, Clang would disable the loop unroll pass when optimizing
for size. That commit enabled it to be able to support unroll pragmas
in -Os builds. However, this regressed binary size in one of Chromium's
DLLs with ~100 KB.
This restores the original behaviour of no unrolling at -Os, but doing it
in LLVM instead of Clang makes more sense, and also allows the pragmas to
keep working.
Differential revision: http://reviews.llvm.org/D20115
llvm-svn: 269124
The original commit was reverted because of a buildbot problem with LazyCallGraph::SCC handling (not related to the OptBisect handling).
Differential Revision: http://reviews.llvm.org/D19172
llvm-svn: 267231
This patch implements a optimization bisect feature, which will allow optimizations to be selectively disabled at compile time in order to track down test failures that are caused by incorrect optimizations.
The bisection is enabled using a new command line option (-opt-bisect-limit). Individual passes that may be skipped call the OptBisect object (via an LLVMContext) to see if they should be skipped based on the bisect limit. A finer level of control (disabling individual transformations) can be managed through an addition OptBisect method, but this is not yet used.
The skip checking in this implementation is based on (and replaces) the skipOptnoneFunction check. Where that check was being called, a new call has been inserted in its place which checks the bisect limit and the optnone attribute. A new function call has been added for module and SCC passes that behaves in a similar way.
Differential Revision: http://reviews.llvm.org/D19172
llvm-svn: 267022
1. Add FullUnrollMaxCount option that works like MaxCount, but also limits
the unroll count for fully unrolled loops. So if a loop has an iteration
count over this, it won't fully unroll.
2. Add CLI options for MaxCount and the new option, so they can be tested
(plus a test).
3. Make partial unrolling obey MaxCount.
An example use-case (the out of tree one this is originally designed for) is
a target’s TTI can analyze a loop and decide on a max unroll count separate
from the size threshold, e.g. based on register pressure, then constrain
LoopUnroll to not exceed that, regardless of the size of the unrolled loop.
llvm-svn: 265562
Summary:
Specifically, when we perform runtime loop unrolling of a loop that
contains a convergent op, we can only unroll k times, where k divides
the loop trip multiple.
Without this change, we'll happily unroll e.g. the following loop
for (int i = 0; i < N; ++i) {
if (i == 0) convergent_op();
foo();
}
into
int i = 0;
if (N % 2 == 1) {
convergent_op();
foo();
++i;
}
for (; i < N - 1; i += 2) {
if (i == 0) convergent_op();
foo();
foo();
}.
This is unsafe, because we've just added a control-flow dependency to
the convergent op in the prelude.
In general, runtime unrolling loops that contain convergent ops is safe
only if we don't have emit a prelude, which occurs when the unroll count
divides the trip multiple.
Reviewers: resistor
Subscribers: llvm-commits, mzolotukhin
Differential Revision: http://reviews.llvm.org/D17526
llvm-svn: 263509
Summary: Check that we're using SCEV for the same loop we're simulating. Otherwise, we might try to use the iteration number of the current loop in SCEV expressions for inner/outer loops IVs, which is clearly incorrect.
Reviewers: chandlerc, hfinkel
Subscribers: sanjoy, llvm-commits, mzolotukhin
Differential Revision: http://reviews.llvm.org/D17632
llvm-svn: 261958
routine.
We were getting this wrong in small ways and generally being very
inconsistent about it across loop passes. Instead, let's have a common
place where we do this. One minor downside is that this will require
some analyses like SCEV in more places than they are strictly needed.
However, this seems benign as these analyses are complete no-ops, and
without this consistency we can in many cases end up with the legacy
pass manager scheduling deciding to split up a loop pass pipeline in
order to run the function analysis half-way through. It is very, very
annoying to fix these without just being very pedantic across the board.
The only loop passes I've not updated here are ones that use
AU.setPreservesAll() such as IVUsers (an analysis) and the pass printer.
They seemed less relevant.
With this patch, almost all of the problems in PR24804 around loop pass
pipelines are fixed. The one remaining issue is that we run simplify-cfg
and instcombine in the middle of the loop pass pipeline. We've recently
added some loop variants of these passes that would seem substantially
cleaner to use, but this at least gets us much closer to the previous
state. Notably, the seven loop pass managers is down to three.
I've not updated the loop passes using LoopAccessAnalysis because that
analysis hasn't been fully wired into LoopSimplify/LCSSA, and it isn't
clear that those transforms want to support those forms anyways. They
all run late anyways, so this is harmless. Similarly, LSR is left alone
because it already carefully manages its forms and doesn't need to get
fused into a single loop pass manager with a bunch of other loop passes.
LoopReroll didn't use loop simplified form previously, and I've updated
the test case to match the trivially different output.
Finally, I've also factored all the pass initialization for the passes
that use this technique as well, so that should be done regularly and
reliably.
Thanks to James for the help reviewing and thinking about this stuff,
and Ben for help thinking about it as well!
Differential Revision: http://reviews.llvm.org/D17435
llvm-svn: 261316
Summary:
Unrolling Analyzer is already pretty complicated, and it becomes harder and harder to exercise it with usual IR tests, as with them we can only check the final decision: whether the loop is unrolled or not. This change factors this framework out from LoopUnrollPass to analyses, which allows to use unit tests.
The change itself is supposed to be NFC, except adding a couple of tests.
I plan to add more tests as I add new functionality and find/fix bugs.
Reviewers: chandlerc, hfinkel, sanjoy
Subscribers: zzheng, sanjoy, llvm-commits
Differential Revision: http://reviews.llvm.org/D16623
llvm-svn: 260169
The layering of where the various loop unroll parameters are
initialized and overridden here was very confusing, making it pretty
difficult to tell just how the various sources interacted. Instead, we
put all of the initialization logic together in a single function so
that it's obvious what overrides what.
llvm-svn: 257426
Currently we're unrolling loops more in minsize than in optsize, which
means -Oz will have a larger code size than -Os. That doesn't make any
sense.
This resolves the FIXME about this in LoopUnrollPass and extends the
optsize test to make sure we use the smaller threshold for minsize as
well.
llvm-svn: 257402
As of r255720, the loop pass manager will DTRT when passes update the
loop info for removed loops, so they no longer need to reach into
LPPassManager APIs to do this kind of transformation. This change very
nearly removes the need for the LPPassManager to even be passed into
loop passes - the only remaining pass that uses the LPM argument is
LoopUnswitch.
llvm-svn: 255797
A large number of loop utility functions take a `Pass *` and reach
into it to find out which analyses to preserve. There are a number of
problems with this:
- The APIs have access to pretty well any Pass state they want, so
it's hard to tell what they may or may not do.
- Other APIs have copied these and pass around a `Pass *` even though
they don't even use it. Some of these just hand a nullptr to the API
since the callers don't even have a pass available.
- Passes in the new pass manager don't work like the current ones, so
the APIs can't be used as is there.
Instead, we should explicitly thread the analysis results that we
actually care about through these APIs. This is both simpler and more
reusable.
llvm-svn: 255669
Apart from checking that GlobalVariable is a constant, we should check
that it's not a weak constant, in which case we can't propagate its
value.
llvm-svn: 248327
We only checked that a global is initialized with constants, which is
incorrect. We should be checking that GlobalVariable *is* a constant,
not just initialized with it.
llvm-svn: 247769
GlobalsAA must by definition be preserved in function passes, but the passmanager doesn't know that. Make each pass explicitly preserve GlobalsAA.
llvm-svn: 247263