* wrap code blocks in \code ... \endcode;
* refer to parameter names in paragraphs correctly (\arg is not what most
people want -- it starts a new paragraph);
* use \param instead of \arg to document parameters in order to be consistent
with the rest of the codebase.
llvm-svn: 163902
pointless checks in here, bad asserts, and just confusing code. I've
also added a bit more to the comment to clarify what this function is
really trying to do as it was not obvious to Duncan when studying it.
Thanks to Duncan for helping me dig through the issue.
No real functionality changed here in practical cases, and certainly no
test case. This is just cleanup spotted by inspection.
llvm-svn: 163897
inspection by Duncan during review. My suspicion is that we would still
have returned 0 anyways in this case, but doing it sooner is better.
llvm-svn: 163895
deeply suspicious and likely to go away eventually. Also fix a bogus
comment about one of the checks in the vector GEP analysis. Based on
review from Duncan.
llvm-svn: 163894
Originally I had anticipated needing to thread this through more bits of
the SROA pass itself, but that ended up not happening. In the end, this
is a much simpler way to manange the variable.
llvm-svn: 163893
This is essentially a ground up re-think of the SROA pass in LLVM. It
was initially inspired by a few problems with the existing pass:
- It is subject to the bane of my existence in optimizations: arbitrary
thresholds.
- It is overly conservative about which constructs can be split and
promoted.
- The vector value replacement aspect is separated from the splitting
logic, missing many opportunities where splitting and vector value
formation can work together.
- The splitting is entirely based around the underlying type of the
alloca, despite this type often having little to do with the reality
of how that memory is used. This is especially prevelant with unions
and base classes where we tail-pack derived members.
- When splitting fails (often due to the thresholds), the vector value
replacement (again because it is separate) can kick in for
preposterous cases where we simply should have split the value. This
results in forming i1024 and i2048 integer "bit vectors" that
tremendously slow down subsequnet IR optimizations (due to large
APInts) and impede the backend's lowering.
The new design takes an approach that fundamentally is not susceptible
to many of these problems. It is the result of a discusison between
myself and Duncan Sands over IRC about how to premptively avoid these
types of problems and how to do SROA in a more principled way. Since
then, it has evolved and grown, but this remains an important aspect: it
fixes real world problems with the SROA process today.
First, the transform of SROA actually has little to do with replacement.
It has more to do with splitting. The goal is to take an aggregate
alloca and form a composition of scalar allocas which can replace it and
will be most suitable to the eventual replacement by scalar SSA values.
The actual replacement is performed by mem2reg (and in the future
SSAUpdater).
The splitting is divided into four phases. The first phase is an
analysis of the uses of the alloca. This phase recursively walks uses,
building up a dense datastructure representing the ranges of the
alloca's memory actually used and checking for uses which inhibit any
aspects of the transform such as the escape of a pointer.
Once we have a mapping of the ranges of the alloca used by individual
operations, we compute a partitioning of the used ranges. Some uses are
inherently splittable (such as memcpy and memset), while scalar uses are
not splittable. The goal is to build a partitioning that has the minimum
number of splits while placing each unsplittable use in its own
partition. Overlapping unsplittable uses belong to the same partition.
This is the target split of the aggregate alloca, and it maximizes the
number of scalar accesses which become accesses to their own alloca and
candidates for promotion.
Third, we re-walk the uses of the alloca and assign each specific memory
access to all the partitions touched so that we have dense use-lists for
each partition.
Finally, we build a new, smaller alloca for each partition and rewrite
each use of that partition to use the new alloca. During this phase the
pass will also work very hard to transform uses of an alloca into a form
suitable for promotion, including forming vector operations, speculating
loads throguh PHI nodes and selects, etc.
After splitting is complete, each newly refined alloca that is
a candidate for promotion to a scalar SSA value is run through mem2reg.
There are lots of reasonably detailed comments in the source code about
the design and algorithms, and I'm going to be trying to improve them in
subsequent commits to ensure this is well documented, as the new pass is
in many ways more complex than the old one.
Some of this is still a WIP, but the current state is reasonbly stable.
It has passed bootstrap, the nightly test suite, and Duncan has run it
successfully through the ACATS and DragonEgg test suites. That said, it
remains behind a default-off flag until the last few pieces are in
place, and full testing can be done.
Specific areas I'm looking at next:
- Improved comments and some code cleanup from reviews.
- SSAUpdater and enabling this pass inside the CGSCC pass manager.
- Some datastructure tuning and compile-time measurements.
- More aggressive FCA splitting and vector formation.
Many thanks to Duncan Sands for the thorough final review, as well as
Benjamin Kramer for lots of review during the process of writing this
pass, and Daniel Berlin for reviewing the data structures and algorithms
and general theory of the pass. Also, several other people on IRC, over
lunch tables, etc for lots of feedback and advice.
llvm-svn: 163883
* wrap code blocks in \code ... \endcode;
* refer to parameter names in paragraphs correctly (\arg is not what most
people want -- it starts a new paragraph).
llvm-svn: 163790
This function writes out the current values of the counters and then resets
them. This can be used similarly to the __gcov_flush function to sync the
counters when need be. For instance, in a situation where the application
doesn't exit.
<rdar://problem/12185886>
llvm-svn: 163757
a pair of switch/branch where both depend on the value of the same variable and
the default case of the first switch/branch goes to the second switch/branch.
Code clean up and fixed a few issues:
1> handling the case where some cases of the 2nd switch are invalidated
2> correctly calculate the weight for the 2nd switch when it is a conditional eq
Testing case is modified from Alastair's original patch.
llvm-svn: 163635
The lookup tables did not get built in a deterministic order.
This makes them get built in the order that the corresponding phi nodes
were found.
llvm-svn: 163305
This adds a transformation to SimplifyCFG that attemps to turn switch
instructions into loads from lookup tables. It works on switches that
are only used to initialize one or more phi nodes in a common successor
basic block, for example:
int f(int x) {
switch (x) {
case 0: return 5;
case 1: return 4;
case 2: return -2;
case 5: return 7;
case 6: return 9;
default: return 42;
}
This speeds up the code by removing the hard-to-predict jump, and
reduces code size by removing the code for the jump targets.
llvm-svn: 163302
pointers-to-strong-pointers may be in play. These can lead to retains and
releases happening in unstructured ways, foiling the optimizer. This fixes
rdar://12150909.
llvm-svn: 163180
- CodeGenPrepare pass for identifying div/rem ops
- Backend specifies the type mapping using addBypassSlowDivType
- Enabled only for Intel Atom with O2 32-bit -> 8-bit
- Replace IDIV with instructions which test its value and use DIVB if the value
is positive and less than 256.
- In the case when the quotient and remainder of a divide are used a DIV
and a REM instruction will be present in the IR. In the non-Atom case
they are both lowered to IDIVs and CSE removes the redundant IDIV instruction,
using the quotient and remainder from the first IDIV. However,
due to this optimization CSE is not able to eliminate redundant
IDIV instructions because they are located in different basic blocks.
This is overcome by calculating both the quotient (DIV) and remainder (REM)
in each basic block that is inserted by the optimization and reusing the result
values when a subsequent DIV or REM instruction uses the same operands.
- Test cases check for the presents of the optimization when calculating
either the quotient, remainder, or both.
Patch by Tyler Nowicki!
llvm-svn: 163150
Scan the body of the loop and find instructions that may trap.
Use this information when deciding if it is safe to hoist or sink instructions.
Notice that we can optimize the search of instructions that may throw in the case of nested loops.
rdar://11518836
llvm-svn: 163132
For example, the ARM target does not have efficient ISel handling for vector
selects with scalar conditions. This patch adds a TLI hook which allows the
different targets to report which selects are supported well and which selects
should be converted to CF duting codegen prepare.
llvm-svn: 163093
We update until we hit a fixpoint. This is probably slow but also
slightly simplifies the code. It should also fix the occasional
invalid domtrees observed when building with expensive checking.
I couldn't find a case where this had a measurable slowdown, but
if someone finds a pathological case where it does we may have
to find a cleverer way of updating dominators here.
Thanks to Duncan for the test case.
llvm-svn: 163091
Most of the code guarded with ANDROIDEABI are not
ARM-specific, and having no relation with arm-eabi.
Thus, it will be more natural to call this
environment "Android" instead of "ANDROIDEABI".
Note: We are not using ANDROID because several projects
are using "-DANDROID" as the conditional compilation
flag.
llvm-svn: 163087
The old PHI updating code in loop-rotate was replaced with SSAUpdater a while
ago, it has no problems with comples PHIs. What had to be fixed is detecting
whether a loop was already rotated and updating dominators when multiple exits
were present.
This change increases overall code size a bit, mostly due to additional loop
unrolling opportunities. Passes test-suite and selfhost with -verify-dom-info.
Fixes PR7447.
Thanks to Andy for the input on the domtree updating code.
llvm-svn: 162912
This lets the user run the program from a different directory and still have the
.gcda files show up in the correct place.
<rdar://problem/12179524>
llvm-svn: 162855
This disables malloc-specific optimization when -fno-builtin (or -ffreestanding)
is specified. This has been a problem for a long time but became more severe
with the recent memory builtin improvements.
Since the memory builtin functions are used everywhere, this required passing
TLI in many places. This means that functions that now have an optional TLI
argument, like RecursivelyDeleteTriviallyDeadFunctions, won't remove dead
mallocs anymore if the TLI argument is missing. I've updated most passes to do
the right thing.
Fixes PR13694 and probably others.
llvm-svn: 162841
No test case, undefined shifts get folded early, but can occur when other
transforms generate a constant. Thanks to Duncan for bringing this up.
llvm-svn: 162755
No intended behavior change. This was introduced in r162023. With the fixed
algorithm a Release build of ARMInstPrinter.cpp goes from 16s to 10s on a
2011 MBP.
llvm-svn: 162559
optimizations are guarded by the -enable-double-float-shrink LLVM option.
Last bit of PR13574. Patch by Weiming Zhao <weimingz@codeaurora.org>.
llvm-svn: 162368
This optimization is really just replacing allocas wholesale with
globals, there is no scalarization.
The underlying motivation for this patch is to simplify the SROA pass
and focus it on splitting and promoting allocas.
llvm-svn: 162271
where some fact lake a=b dominates a use in a phi, but doesn't dominate the
basic block itself.
This feature could also be implemented by splitting critical edges, but at least
with the current algorithm reasoning about the dominance directly is faster.
The time for running "opt -O2" in the testcase in pr10584 is 1.003 times slower
and on gcc as a single file it is 1.0007 times faster.
llvm-svn: 162023
- memcpy size is wrongly truncated into 32-bit and treat 8GB memcpy is
0-sized memcpy
- as 0-sized memcpy/memset is already removed before SimplifyMemTransfer
and SimplifyMemSet in visitCallInst, replace 0 checking with
assertions.
- replace getZExtValue() with getLimitedValue() according to
Eli Friedman
llvm-svn: 161923
and allow some optimizations to turn conditional branches into unconditional.
This commit adds a simple control-flow optimization which merges two consecutive
basic blocks which are connected by a single edge. This allows the codegen to
operate on larger basic blocks.
rdar://11973998
llvm-svn: 161852
may invalidate its AliasSet because SSAUpdater does not update the AliasSet properly.
This patch teaches SSAUpdater to notify AliasSet that it made changes.
The testcase in PR12901 is too big to be useful and I could not reduce it to a normal size.
rdar://11872059 PR12901
llvm-svn: 161803
multiple scalar promotions on a single loop. This also has the effect of
preserving the order of stores sunk out of loops, which is aesthetically
pleasing, and it happens to fix the testcase in PR13542, though it doesn't
fix the underlying problem.
llvm-svn: 161459
An unsigned value converted to floating-point will always be greater than
a negative constant. Unfortunately InstCombine reversed the check so that
unsigned values were being optimized to always be greater than all positive
floating-point constants. <rdar://problem/12029145>
llvm-svn: 161452
The "findUsedStructTypes" method is very expensive to run. It needs to be
optimized so that LTO can run faster. Splitting this method out of the Module
class will help this occur. For instance, it can keep a list of seen objects so
that it doesn't process them over and over again.
llvm-svn: 161228
This can happen as long as the instruction is not reachable. Instcombine does generate these unreachable malformed selects when doing RAUW
llvm-svn: 160874
might be deliberate "one time" leaks, so that leak checkers can find them.
This is a reapply of r160602 with the fix that this time I'm committing the
code I thought I was committing last time; the I->eraseFromParent() goes
*after* the break out of the loop.
llvm-svn: 160664
r160529 that was subsequently reverted. The fix was to not call
GV->eraseFromParent() right before the caller does the same. The existing
testcases already caught this bug if run under valgrind.
llvm-svn: 160602
GetBestDestForJumpOnUndef() assumes there is at least 1 successor, which isn't
true if the block ends in an indirect branch with no successors. Fix this by
bailing out earlier in this case.
llvm-svn: 160546
Fixes PR13371: indvars pass incorrectly substitutes 'undef' values.
I do not like this fix. It's needed until/unless the meaning of undef
changes. It attempts to be complete according to the IR spec, but I
don't have much confidence in the implementation given the difficulty
testing undefined behavior. Worse, this invalidates some of my
hard-fought work on indvars and LSR to optimize pointer induction
variables. It results benchmark regressions, which I'll track
internally. On x86_64 no LTO I see:
-3% huffbench
-3% 400.perlbench
-8% fhourstones
My only suggestion for recovering is to change the meaning of
undef. If we could trust an arbitrary instruction to produce a some
real value that can be manipulated (e.g. incremented) according to
non-undef rules, then this case could be easily handled with SCEV.
llvm-svn: 160421
This places limits on CollectSubexprs to constrains the number of
reassociation possibilities. It limits the recursion depth and skips
over chains of nested recurrences outside the current loop.
Fixes PR13361. Although underlying SCEV behavior is still potentially bad.
llvm-svn: 160340
It turns out that ASan relied on the at-the-end block insertion order to
(purely by happenstance) disable some LLVM optimizations, which in turn
start firing when the ordering is made more "normal". These
optimizations in turn merge many of the instrumentation reporting calls
which breaks the return address based error reporting in ASan.
We're looking at several different options for fixing this.
llvm-svn: 160256
This is particularly useful to the backend code generators which try to
process things in the incoming function order.
Also, cleanup some uses of IRBuilder to be a bit simpler and more clear.
llvm-svn: 160254
the move of *Builder classes into the Core library.
No uses of this builder in Clang or DragonEgg I could find.
If there is a desire to have an IR-building-support library that
contains all of these builders, that can be easily added, but currently
it seems likely that these add no real overhead to VMCore.
llvm-svn: 160243
IRBuilder, DIBuilder, etc.
This is the proper layering as MDBuilder can't be used (or implemented)
without the Core Metadata representation.
Patches to Clang and Dragonegg coming up.
llvm-svn: 160237
All SCEV expressions used by LSR formulae must be safe to
expand. i.e. they may not contain UDiv unless we can prove nonzero
denominator.
Fixes PR11356: LSR hoists UDiv.
llvm-svn: 160205
%shr = lshr i64 %key, 3
%0 = load i64* %val, align 8
%sub = add i64 %0, -1
%and = and i64 %sub, %shr
ret i64 %and
to:
%shr = lshr i64 %key, 3
%0 = load i64* %val, align 8
%sub = add i64 %0, 2305843009213693951
%and = and i64 %sub, %shr
ret i64 %and
The demanded bit optimization is actually a pessimization because add -1 would
be codegen'ed as a sub 1. Teach the demanded constant shrinking optimization
to check for negated constant to make sure it is actually reducing the width
of the constant.
rdar://11793464
llvm-svn: 160101
This patch removes ~70 lines in InstCombineLoadStoreAlloca.cpp and makes both functions a bit more aggressive than before :)
In theory, we can be more aggressive when removing an alloca than a malloc, because an alloca pointer should never escape, but we are not taking advantage of this anyway
llvm-svn: 159952
This means we can do cheap DSE for heap memory.
Nothing is done if the pointer excapes or has a load.
The churn in the tests is mostly due to objectsize, since we want to make sure we
don't delete the malloc call before evaluating the objectsize (otherwise it becomes -1/0)
llvm-svn: 159876
IntegersSubsetMapping
- Replaced type of Items field from std::list with std::map. In neares future I'll test it with DenseMap and do the correspond replacement
if possible.
llvm-svn: 159703
IntegersSubsetMapping
- Replaced type of Items field from std::list with std::map. In neares future I'll test it with DenseMap and do the correspond replacement
if possible.
llvm-svn: 159659
This was always part of the VMCore library out of necessity -- it deals
entirely in the IR. The .cpp file in fact was already part of the VMCore
library. This is just a mechanical move.
I've tried to go through and re-apply the coding standard's preferred
header sort, but at 40-ish files, I may have gotten some wrong. Please
let me know if so.
I'll be committing the corresponding updates to Clang and Polly, and
Duncan has DragonEgg.
Thanks to Bill and Eric for giving the green light for this bit of cleanup.
llvm-svn: 159421
When both a load/store and its address computation are being vectorized, it can
happen that the address-computation vectorization destroys SCEV's ability
to analyize the relative pointer offsets. As a result (like with the aliasing
analysis info), we need to precompute the necessary information prior to
instruction fusing.
This was found during stress testing (running through the test suite with a very
low required chain length); unfortunately, I don't have a small test case.
llvm-svn: 159332
The original algorithm only used recursive pair fusion of equal-length
types. This is now extended to allow pairing of any types that share
the same underlying scalar type. Because we would still generally
prefer the 2^n-length types, those are formed first. Then a second
set of iterations form the non-2^n-length types.
Also, a call to SimplifyInstructionsInBlock has been added after each
pairing iteration. This takes care of DCE (and a few other things)
that make the following iterations execute somewhat faster. For the
same reason, some of the simple shuffle-combination cases are now
handled internally.
There is some additional refactoring work to be done, but I've had
many requests for this feature, so additional refactoring will come
soon in future commits (as will additional test cases).
llvm-svn: 159330
Maintaining this kind of checking in different places is dangerous, extending
Instruction::isSameOperationAs consolidates this logic into one place. Here
I've added an optional flags parameter and two flags that are important for
vectorization: CompareIgnoringAlignment and CompareUsingScalarTypes.
llvm-svn: 159329
include/llvm/Analysis/DebugInfo.h to include/llvm/DebugInfo.h.
The reasoning is because the DebugInfo module is simply an interface to the
debug info MDNodes and has nothing to do with analysis.
llvm-svn: 159312
Original commit message:
If a constant or a function has linkonce_odr linkage and unnamed_addr, mark it
hidden. Being linkonce_odr guarantees that it is available in every dso that
needs it. Being a constant/function with unnamed_addr guarantees that the
copies don't have to be merged.
llvm-svn: 159272
before the expression root. Any existing operators that are changed to use one
of them needs to be moved between it and the expression root, and recursively
for the operators using that one. When I rewrote RewriteExprTree I accidentally
inverted the logic, resulting in the compacting going down from operators to
operands rather than up from operands to the operators using them, oops. Fix
this, resolving PR12963.
llvm-svn: 159265
// C - zext(bool) -> bool ? C - 1 : C
if (ZExtInst *ZI = dyn_cast<ZExtInst>(Op1))
if (ZI->getSrcTy()->isIntegerTy(1))
return SelectInst::Create(ZI->getOperand(0), SubOne(C), C);
This ends up forming sext i1 instructions that codegen to terrible code. e.g.
int blah(_Bool x, _Bool y) {
return (x - y) + 1;
}
=>
movzbl %dil, %eax
movzbl %sil, %ecx
shll $31, %ecx
sarl $31, %ecx
leal 1(%rax,%rcx), %eax
ret
Without the rule, llvm now generates:
movzbl %sil, %ecx
movzbl %dil, %eax
incl %eax
subl %ecx, %eax
ret
It also helps with ARM (and pretty much any target that doesn't have a sext i1 :-).
The transformation was done as part of Eli's r75531. He has given the ok to
remove it.
rdar://11748024
llvm-svn: 159230
merge all zero-sized alloca's into one, fixing c43204g from the Ada ACATS
conformance testsuite. What happened there was that a variable sized object
was being allocated on the stack, "alloca i8, i32 %size". It was then being
passed to another function, which tested that the address was not null (raising
an exception if it was) then manipulated %size bytes in it (load and/or store).
The optimizers cleverly managed to deduce that %size was zero (congratulations
to them, as it isn't at all obvious), which made the alloca zero size, causing
the optimizers to replace it with null, which then caused the check mentioned
above to fail, and the exception to be raised, wrongly. Note that no loads
and stores were actually being done to the alloca (the loop that does them is
executed %size times, i.e. is not executed), only the not-null address check.
llvm-svn: 159202
- simplifycfg: invoke undef/null -> unreachable
- instcombine: invoke new -> invoke expect(0, 0) (an arbitrary NOOP intrinsic; only done if the allocated memory is unused, of course)
- verifier: allow invoke of intrinsics (to make the previous step work)
llvm-svn: 159146
hidden. Being linkonce_odr guarantees that it is available in every dso that
needs it. Being a constant/function with unnamed_addr guarantees that the
copies don't have to be merged.
llvm-svn: 159136
This allows the user/front-end to specify a model that is better
than what LLVM would choose by default. For example, a variable
might be declared as
@x = thread_local(initialexec) global i32 42
if it will not be used in a shared library that is dlopen'ed.
If the specified model isn't supported by the target, or if LLVM can
make a better choice, a different model may be used.
llvm-svn: 159077
This fixes PR5997.
These transforms were disabled because codegen couldn't deal with other
uses of trunc(x). This is now handled by the peephole pass.
This causes no regressions on x86-64.
llvm-svn: 159003
Original message:
Performance optimizations:
- SwitchInst: case values stored separately from Operands List. It allows to make faster access to individual case value numbers or ranges.
- Optimized IntItem, added APInt value caching.
- Optimized IntegersSubsetGeneric: added optimizations for cases when subset is single number or when subset consists from single numbers only.
llvm-svn: 158997
- provide more extensive set of functions to detect library allocation functions (e.g., malloc, calloc, strdup, etc)
- provide an API to compute the size and offset of an object pointed by
Move a few clients (GVN, AA, instcombine, ...) to the new API.
This implementation is a lot more aggressive than each of the custom implementations being replaced.
Patch reviewed by Nick Lewycky and Chandler Carruth, thanks.
llvm-svn: 158919
I'll admit I'm not entirely satisfied with this change, but it seemed
the cleanest option. Other suggestions quite welcome
The issue is that the traits specializations have static methods which
return the typedef'ed PHI_iterator type. In both the IR and MI layers
this is typedef'ed to a custom iterator class defined in an anonymous
namespace giving the types and the functions returning them internal
linkage. However, because the traits specialization is defined in the
'llvm' namespace (where it has to be, specialized template lives there),
and is in turn used in the templated implementation of the SSAUpdater.
This led to the linkage conflict that Clang now warns about.
The simplest solution to me was just to define the PHI_iterator as
a nested class inside the trait specialization. That way it still
doesn't get scoped widely, it can't be accidentally reused somewhere,
etc. This is a little gross just because nested class definitions are
a little gross, but the alternatives seem more ad-hoc.
llvm-svn: 158799
The present implementation handles only TBAA and FP metadata, discarding everything else.
For debug metadata, the current behavior is maintained (the debug metadata associated with
one of the instructions will be kept, discarding that attached to the other).
This should address PR 13040.
llvm-svn: 158606
Dynamic GEPs created by SROA needed to insert extra "i32 0"
operands to index through structs and arrays to get to the
vector being indexed.
llvm-svn: 158590
For non-address users, Base and Scaled registers are not specially
associated to fit an address mode, so SCEVExpander should apply normal
expansion rules. Otherwise we may sink computation into inner loops
that have already been optimized.
llvm-svn: 158537
example degenerate phi nodes and binops that use themselves in unreachable code.
Thanks to Charles Davis for the testcase that uncovered this can of worms.
llvm-svn: 158508
since then the entire expression must equal zero (similarly for other operations
with an absorbing element). With this in place a bunch of reassociate code for
handling constants is dead since it is all taken care of when linearizing. No
intended functionality change.
llvm-svn: 158398
This patch extends FoldBranchToCommonDest to fold unconditional branches.
For unconditional branches, we fold them if it is easy to update the phi nodes
in the common successors.
rdar://10554090
llvm-svn: 158392
POD type, causing memory corruption when mapping to APInts with bitwidth > 64.
Merge another crash testcase into crash.ll while there.
llvm-svn: 158369
topologies, it is quite possible for a leaf node to have huge multiplicity, for
example: x0 = x*x, x1 = x0*x0, x2 = x1*x1, ... rapidly gives a value which is x
raised to a vast power (the multiplicity, or weight, of x). This patch fixes
the computation of weights by correctly computing them no matter how big they
are, rather than just overflowing and getting a wrong value. It turns out that
the weight for a value never needs more bits to represent than the value itself,
so it is enough to represent weights as APInts of the same bitwidth and do the
right overflow-avoiding dance steps when computing weights. As a side-effect it
reduces the number of multiplies needed in some cases of large powers. While
there, in view of external uses (eg by the vectorizer) I made LinearizeExprTree
static, pushing the rank computation out into users. This is progress towards
fixing PR13021.
llvm-svn: 158358
This saves a cast, and zext is more expensive on platforms with subreg support
than trunc is. This occurs in the BSD implementation of memchr(3), see PR12750.
On the synthetic benchmark from that bug stupid_memchr and bsd_memchr have the
same performance now when not inlining either function.
stupid_memchr: 323.0us
bsd_memchr: 321.0us
memchr: 479.0us
where memchr is the llvm-gcc compiled bsd_memchr from osx lion's libc. When
inlining is enabled bsd_memchr still regresses down to llvm-gcc memchr time,
I haven't fully understood the issue yet, something is grossly mangling the
loop after inlining.
llvm-svn: 158297
-%a + 42
into
42 - %a
previously we were emitting:
-(%a + 42)
This fixes the infinite loop in PR12338. The generated code is still not perfect, though.
Will work on that next
llvm-svn: 158237
problem was that by moving instructions around inside the function, the pass
could accidentally move the iterator being used to advance over the function
too. Fix this by only processing the instruction equal to the iterator, and
leaving processing of instructions that might not be equal to the iterator
to later (later = after traversing the basic block; it could also wait until
after traversing the entire function, but this might make the sets quite big).
Original commit message:
Grab-bag of reassociate tweaks. Unify handling of dead instructions and
instructions to reoptimize. Exploit this to more systematically eliminate
dead instructions (this isn't very useful in practice but is convenient for
analysing some testcase I am working on). No need for WeakVH any more: use
an AssertingVH instead.
llvm-svn: 158226
can move instructions within the instruction list. If the instruction just
happens to be the one the basic block iterator is pointing to, and it is
moved to a different basic block, then we get into an infinite loop due to
the iterator running off the end of the basic block (for some reason this
doesn't fire any assertions). Original commit message:
Grab-bag of reassociate tweaks. Unify handling of dead instructions and
instructions to reoptimize. Exploit this to more systematically eliminate
dead instructions (this isn't very useful in practice but is convenient for
analysing some testcase I am working on). No need for WeakVH any more: use
an AssertingVH instead.
llvm-svn: 158199
There are some that I didn't remove this round because they looked like
obvious stubs. There are dead variables in gtest too, they should be
fixed upstream.
llvm-svn: 158090
instructions to reoptimize. Exploit this to more systematically eliminate
dead instructions (this isn't very useful in practice but is convenient for
analysing some testcase I am working on). No need for WeakVH any more: use
an AssertingVH instead.
llvm-svn: 158073
replacement to make it at least as generic as the instruction being replaced.
This includes:
* dropping nsw/nuw flags
* getting the least restrictive tbaa and fpmath metadata
* merging ranges
Fixes PR12979.
llvm-svn: 157958
inject some code in that will run via the "__mod_init_func" method that
registers the gcov "writeout" function to execute at exit time.
The problem is that the "__mod_term_func" method of specifying d'tors is
deprecated on Darwin. And it can lead to some ambiguities when dealing with
multiple libraries.
<rdar://problem/11110106>
llvm-svn: 157852
- compute size & offset at the same time. The side-effects of this are that we now support negative GEPs. It's now approaching a phase that it can be reused by other passes (e.g., lowering of the objectsize intrinsic)
- use APInt throughout to handle wrap-arounds
- add support for PHI instrumentation
- add a cache (required for recursive PHIs anyway)
- remove hoisting support for now, since it was wrong in a few cases
sorry for the churn here.. tests will follow soon.
llvm-svn: 157775
- hoist checks out of loops where SCEV is smart enough
- add additional statistics to measure how much we loose for not supporting interprocedural and pointers loaded from memory
llvm-svn: 157649
The test case feeds the following into InstCombine's visitSelect:
%tobool8 = icmp ne i32 0, 0
%phitmp = select i1 %tobool8, i32 3, i32 0
Then instcombine replaces the right side of the switch with 0, doesn't notice
that nothing changes and tries again indefinitely.
This fixes PR12897.
llvm-svn: 157587
Implemented IntItem - the wrapper around APInt. Why not to use APInt item directly right now?
1. It will very difficult to implement case ranges as series of small patches. We got several large and heavy patches. Each patch will about 90-120 kb. If you replace ConstantInt with APInt in SwitchInst you will need to changes at the same time all Readers,Writers and absolutely all passes that uses SwitchInst.
2. We can implement APInt pool inside and save memory space. E.g. we use several switches that works with 256 bit items (switch on signatures, or strings). We can avoid value duplicates in this case.
3. IntItem can be easyly easily replaced with APInt.
4. Currenly we can interpret IntItem both as ConstantInt and as APInt. It allows to provide SwitchInst methods that works with ConstantInt for non-updated passes.
Why I need it right now? Currently I need to update SimplifyCFG pass (EqualityComparisons). I need to work with APInts directly a lot, so peaces of code
ConstantInt *V = ...;
if (V->getValue().ugt(AnotherV->getValue()) {
...
}
will look awful. Much more better this way:
IntItem V = ConstantIntVal->getValue();
if (AnotherV < V) {
}
Of course any reviews are welcome.
P.S.: I'm also going to rename ConstantRangesSet to IntegersSubset, and CRSBuilder to IntegersSubsetMapping (allows to map individual subsets of integers to the BasicBlocks).
Since in future these classes will founded on APInt, it will possible to use them in more generic ways.
llvm-svn: 157576
replicating the code for every place it's needed, we instead generate a function
that does that for us. This function is local to the executable, so there
shouldn't be any writing violations.
llvm-svn: 157564
making it stronger and more sane.
Delete the code from tblgen that produced the old code.
Besides being a path forward in intrinsic sanity, this also eliminates a bunch of
machine generated code that was compiled into Function.o
llvm-svn: 157545
then it doesn't alter the instructions composing it, however it would continue
to move the instructions to just before the expression root. Ensure it doesn't
move them either, so now it really does nothing if there is nothing to do. That
commit also ensured that nsw etc flags weren't cleared if the expression was not
being changed. Tweak this a bit so that it doesn't clear flags on the initial
part of a computation either if that part didn't change but later bits did.
llvm-svn: 157518
are passed in. However, those arguments may be in a write-protected area, as far
as the runtime library is concerned. For instance, the data could be placed into
a 'linkedit' section, which isn't writable. Emit the code from
llvm_gcda_increment_indirect_counter directly into the function instead.
Note: The code for this is ugly, and can lead to bloat. We should look into
simplifying this code instead of having all of these branches.
<rdar://problem/11181370>
llvm-svn: 157505
with arbitrary topologies (previously it would give up when hitting a diamond
in the use graph for example). The testcase from PR12764 is now reduced from
a pile of additions to the optimal 1617*%x0+208. In doing this I changed the
previous strategy of dropping all uses for expression leaves to one of dropping
all but one use. This works out more neatly (but required a bunch of tweaks)
and is also safer: some recently fixed bugs during recursive linearization were
because the linearization code thinks it completely owns a node if it has no uses
outside the expression it is linearizing. But if the node was also in another
expression that had been linearized (and thus all uses of the node from that
expression dropped) then the conclusion that it is completely owned by the
expression currently being linearized is wrong. Keeping one use from within each
linearized expression avoids this kind of mistake.
llvm-svn: 157467
LowerSwitch::Clusterify : main functinality was replaced with CRSBuilder::optimize, so big part of Clusterify's code was reduced.
test/Transform/LowerSwitch/feature.ll - this test was refactored: grep + count was replaced with FileCheck usage.
llvm-svn: 157384
leader table. That's because it wasn't expecting instructions to turn up as
leader for a value number that is not its own, but equality propagation could
create this situation. One solution is to have the leader table use a WeakVH
but this slows down GVN by about 5%. Instead just have equality propagation not
add instructions to the leader table, only constants and arguments. In theory
this might cause GVN to run more (each time it changes something it runs again)
but it doesn't seem to occur enough to cause a slow down.
llvm-svn: 157251
so that it can be reused in MemCpyOptimizer. This analysis is needed to remove
an unnecessary memcpy when returning a struct into a local variable.
rdar://11341081
PR12686
llvm-svn: 156776
refactor code a bit to enable future changes to support run-time information
add support to compute allocation sizes at run-time if penalty > 1 (e.g., malloc(x), calloc(x, y), and VLAs)
llvm-svn: 156515
replace the operands of expressions with only one use with undef and generate
a new expression for the original without using RAUW to update the original.
Thus any copies of the original expression held in a vector may end up
referring to some bogus value - and using a ValueHandle won't help since there
is no RAUW. There is already a mechanism for getting the effect of recursion
non-recursively: adding the value to be recursed on to RedoInsts. But it wasn't
being used systematically. Have various places where recursion had snuck in at
some point use the RedoInsts mechanism instead. Fixes PR12169.
llvm-svn: 156379
The primitive conservative heuristic seems to give a slight overall
improvement while not regressing stuff. Make it available to wider
testing. If you notice any speed regressions (or significant code
size regressions) let me know!
llvm-svn: 156258
This came up when a change in block placement formed a cmov and slowed down a
hot loop by 50%:
ucomisd (%rdi), %xmm0
cmovbel %edx, %esi
cmov is a really bad choice in this context because it doesn't get branch
prediction. If we emit it as a branch, an out-of-order CPU can do a better job
(if the branch is predicted right) and avoid waiting for the slow load+compare
instruction to finish. Of course it won't help if the branch is unpredictable,
but those are really rare in practice.
This patch uses a dumb conservative heuristic, it turns all cmovs that have one
use and a direct memory operand into branches. cmovs usually save some code
size, so we disable the transform in -Os mode. In-Order architectures are
unlikely to benefit as well, those are included in the
"predictableSelectIsExpensive" flag.
It would be better to reuse branch probability info here, but BPI doesn't
support select instructions currently. It would make sense to use the same
heuristics as the if-converter pass, which does the opposite direction of this
transform.
Test suite shows a small improvement here and there on corei7-level machines,
but the actual results depend a lot on the used microarchitecture. The
transformation is currently disabled by default and available by passing the
-enable-cgp-select2branch flag to the code generator.
Thanks to Chandler for the initial test case to him and Evan Cheng for providing
me with comments and test-suite numbers that were more stable than mine :)
llvm-svn: 156234
of the CodeExtractor utility. This allows speculatively computing input
and output sets to measure the likely size impact of the code
extraction.
These sets cannot be reused sadly -- we mutate the function prior to
forming the final sets used by the actual extraction.
The interface has been revamped slightly to make it easier to use
correctly by making the interface const and sinking the computation of
the number of exit blocks into the full extraction function and away
from the rest of this logic which just computed two output parameters.
llvm-svn: 156168
and expose it as a utility class rather than as free function wrappers.
The simple free-function interface works well for the bugpoint-specific
pass's uses of code extraction, but in an upcoming patch for more
advanced code extraction, they simply don't expose a rich enough
interface. I need to expose various stages of the process of doing the
code extraction and query information to decide whether or not to
actually complete the extraction or give up.
Rather than build up a new predicate model and pass that into these
functions, just take the class that was actually implementing the
functions and lift it up into a proper interface that can be used to
perform code extraction. The interface is cleaned up and re-documented
to work better in a header. It also is now setup to accept the blocks to
be extracted in the constructor rather than in a method.
In passing this essentially reverts my previous commit here exposing
a block-level query for eligibility of extraction. That is no longer
necessary with the more rich interface as clients can query the
extraction object for eligibility directly. This will reduce the number
of walks of the input basic block sequence by quite a bit which is
useful if this enters the normal optimization pipeline.
llvm-svn: 156163
minor behavior changes with this, but nothing I have seen evidence of in
the wild or expect to be meaningful. The real goal is unifying our logic
and simplifying the interfaces. A summary of the changes follows:
- Make 'callIsSmall' actually accept a callsite so it can handle
intrinsics, and simplify callers appropriately.
- Nuke a completely bogus declaration of 'callIsSmall' that was still
lurking in InlineCost.h... No idea how this got missed.
- Teach the 'isInstructionFree' about the various more intelligent
'free' heuristics that got added to the inline cost analysis during
review and testing. This mostly surrounds int->ptr and ptr->int casts.
- Switch most of the interesting parts of the inline cost analysis that
were essentially computing 'is this instruction free?' to use the code
metrics routine instead. This way we won't keep duplicating logic.
All of this is motivated by the desire to allow other passes to compute
a roughly equivalent 'cost' metric for a particular basic block as the
inline cost analysis. Sadly, re-using the same analysis for both is
really messy because only the actual inline cost analysis is ever going
to go to the contortions required for simplification, SROA analysis,
etc.
llvm-svn: 156140
extraction into a public interface. Also clean it up and apply it more
consistently such that we check for landing pads *anywhere* in the
extracted code, not just in single-block extraction.
This will be used to guide decisions in passes that are planning to
eventually perform a round of code extraction.
llvm-svn: 156114
<rdar://problem/11291436>.
This is a second attempt at a fix for this, the first was r155468. Thanks
to Chandler, Bob and others for the feedback that helped me improve this.
llvm-svn: 155866
Allow the "SplitCriticalEdge" function to split the edge to a landing pad. If
the pass is *sure* that it thinks it knows what it's doing, then it may go ahead
and specify that the landing pad can have its critical edge split. The loop
unswitch pass is one of these passes. It will split the critical edges of all
edges coming from a loop to a landing pad not within the loop. Doing so will
retain important loop analysis information, such as loop simplify.
llvm-svn: 155817
Target specific types should not be vectorized. As a practical matter,
these types are already register matched (at least in the x86 case),
and codegen does not always work correctly (at least in the ppc case,
and this is not worth fixing because ppc_fp128 is currently broken and
will probably go away soon).
llvm-svn: 155729
The required checks are moved to ChainInstruction() itself and the
policy decisions are moved to IVChain::isProfitableInc().
Also cache the ExprBase in IVChain to avoid frequent recomputations.
No functional change intended.
llvm-svn: 155676
elements to minimize the number of multiplies required to compute the
final result. This uses a heuristic to attempt to form near-optimal
binary exponentiation-style multiply chains. While there are some cases
it misses, it seems to at least a decent job on a very diverse range of
inputs.
Initial benchmarks show no interesting regressions, and an 8%
improvement on SPASS. Let me know if any other interesting results (in
either direction) crop up!
Credit to Richard Smith for the core algorithm, and helping code the
patch itself.
llvm-svn: 155616
Original commit message:
Defer some shl transforms to DAGCombine.
The shl instruction is used to represent multiplication by a constant
power of two as well as bitwise left shifts. Some InstCombine
transformations would turn an shl instruction into a bit mask operation,
making it difficult for later analysis passes to recognize the
constsnt multiplication.
Disable those shl transformations, deferring them to DAGCombine time.
An 'shl X, C' instruction is now treated mostly the same was as 'mul X, C'.
These transformations are deferred:
(X >>? C) << C --> X & (-1 << C) (When X >> C has multiple uses)
(X >>? C1) << C2 --> X << (C2-C1) & (-1 << C2) (When C2 > C1)
(X >>? C1) << C2 --> X >>? (C1-C2) & (-1 << C2) (When C1 > C2)
The corresponding exact transformations are preserved, just like
div-exact + mul:
(X >>?,exact C) << C --> X
(X >>?,exact C1) << C2 --> X << (C2-C1)
(X >>?,exact C1) << C2 --> X >>?,exact (C1-C2)
The disabled transformations could also prevent the instruction selector
from recognizing rotate patterns in hash functions and cryptographic
primitives. I have a test case for that, but it is too fragile.
llvm-svn: 155362
While the patch was perfect and defect free, it exposed a really nasty
bug in X86 SelectionDAG that caused an llc crash when compiling lencod.
I'll put the patch back in after fixing the SelectionDAG problem.
llvm-svn: 155181
The shl instruction is used to represent multiplication by a constant
power of two as well as bitwise left shifts. Some InstCombine
transformations would turn an shl instruction into a bit mask operation,
making it difficult for later analysis passes to recognize the
constsnt multiplication.
Disable those shl transformations, deferring them to DAGCombine time.
An 'shl X, C' instruction is now treated mostly the same was as 'mul X, C'.
These transformations are deferred:
(X >>? C) << C --> X & (-1 << C) (When X >> C has multiple uses)
(X >>? C1) << C2 --> X << (C2-C1) & (-1 << C2) (When C2 > C1)
(X >>? C1) << C2 --> X >>? (C1-C2) & (-1 << C2) (When C1 > C2)
The corresponding exact transformations are preserved, just like
div-exact + mul:
(X >>?,exact C) << C --> X
(X >>?,exact C1) << C2 --> X << (C2-C1)
(X >>?,exact C1) << C2 --> X >>?,exact (C1-C2)
The disabled transformations could also prevent the instruction selector
from recognizing rotate patterns in hash functions and cryptographic
primitives. I have a test case for that, but it is too fragile.
llvm-svn: 155136