This patch simply mirrors the attributes we give to @llvm.nvvm.reflect
to the __nvvm_reflect libdevice call. This shaves about 30% of the code
in libdevice away because of CSE opportunities. It's also helps us
figure out that libdevice implementations of transcendental functions
don't have side-effects.
llvm-svn: 265060
Summary:
As discussed on llvm-dev[1].
This change adds the basic boilerplate code around having this intrinsic
in LLVM:
- Changes in Intrinsics.td, and the IR Verifier
- A lowering pass to lower @llvm.experimental.guard to normal
control flow
- Inliner support
[1]: http://lists.llvm.org/pipermail/llvm-dev/2016-February/095523.html
Reviewers: reames, atrick, chandlerc, rnk, JosephTremoulet, echristo
Subscribers: mcrosier, llvm-commits
Differential Revision: http://reviews.llvm.org/D18527
llvm-svn: 264976
Commit r260791 contained an error in that it would introduce a cross-module
reference in the old module. It also introduced O(N^2) complexity in the
module cloner by requiring the entire module to be visited for each function.
Fix both of these problems by avoiding use of the CloneDebugInfoMetadata
function (which is only designed to do intra-module cloning) and cloning
function-attached metadata in the same way that we clone all other metadata.
Differential Revision: http://reviews.llvm.org/D18583
llvm-svn: 264935
Widening a PHI requires us to insert a trunc.
The logical place for this trunc is in the same BB as the PHI.
This is not possible if the BB is terminated by a catchswitch.
This fixes PR27133.
llvm-svn: 264926
Summary:
This gives callers flexibility to pass lambdas with captures, which lets
callers avoid the C-style void*-ptr closure style. (Currently, callers
in clang store state in the PassManagerBuilderBase arg.)
No functional change, and the new API is backwards-compatible.
Reviewers: chandlerc
Subscribers: joker.eph, cfe-commits
Differential Revision: http://reviews.llvm.org/D18613
llvm-svn: 264918
This change prevents the loop vectorizer from vectorizing when all of the vector
types it generates will be scalarized. I've run into this problem on the PPC's QPX
vector ISA, which only holds floating-point vector types. The loop vectorizer
will, however, happily vectorize loops with purely integer computation. Here's
an example:
LV: The Smallest and Widest types: 32 / 32 bits.
LV: The Widest register is: 256 bits.
LV: Found an estimated cost of 0 for VF 1 For instruction: %indvars.iv25 = phi i64 [ 0, %entry ], [ %indvars.iv.next26, %for.body ]
LV: Found an estimated cost of 0 for VF 1 For instruction: %arrayidx = getelementptr inbounds [1600 x i32], [1600 x i32]* %a, i64 0, i64 %indvars.iv25
LV: Found an estimated cost of 0 for VF 1 For instruction: %2 = trunc i64 %indvars.iv25 to i32
LV: Found an estimated cost of 1 for VF 1 For instruction: store i32 %2, i32* %arrayidx, align 4
LV: Found an estimated cost of 1 for VF 1 For instruction: %indvars.iv.next26 = add nuw nsw i64 %indvars.iv25, 1
LV: Found an estimated cost of 1 for VF 1 For instruction: %exitcond27 = icmp eq i64 %indvars.iv.next26, 1600
LV: Found an estimated cost of 0 for VF 1 For instruction: br i1 %exitcond27, label %for.cond.cleanup, label %for.body
LV: Scalar loop costs: 3.
LV: Found an estimated cost of 0 for VF 2 For instruction: %indvars.iv25 = phi i64 [ 0, %entry ], [ %indvars.iv.next26, %for.body ]
LV: Found an estimated cost of 0 for VF 2 For instruction: %arrayidx = getelementptr inbounds [1600 x i32], [1600 x i32]* %a, i64 0, i64 %indvars.iv25
LV: Found an estimated cost of 0 for VF 2 For instruction: %2 = trunc i64 %indvars.iv25 to i32
LV: Found an estimated cost of 2 for VF 2 For instruction: store i32 %2, i32* %arrayidx, align 4
LV: Found an estimated cost of 1 for VF 2 For instruction: %indvars.iv.next26 = add nuw nsw i64 %indvars.iv25, 1
LV: Found an estimated cost of 1 for VF 2 For instruction: %exitcond27 = icmp eq i64 %indvars.iv.next26, 1600
LV: Found an estimated cost of 0 for VF 2 For instruction: br i1 %exitcond27, label %for.cond.cleanup, label %for.body
LV: Vector loop of width 2 costs: 2.
LV: Found an estimated cost of 0 for VF 4 For instruction: %indvars.iv25 = phi i64 [ 0, %entry ], [ %indvars.iv.next26, %for.body ]
LV: Found an estimated cost of 0 for VF 4 For instruction: %arrayidx = getelementptr inbounds [1600 x i32], [1600 x i32]* %a, i64 0, i64 %indvars.iv25
LV: Found an estimated cost of 0 for VF 4 For instruction: %2 = trunc i64 %indvars.iv25 to i32
LV: Found an estimated cost of 4 for VF 4 For instruction: store i32 %2, i32* %arrayidx, align 4
LV: Found an estimated cost of 1 for VF 4 For instruction: %indvars.iv.next26 = add nuw nsw i64 %indvars.iv25, 1
LV: Found an estimated cost of 1 for VF 4 For instruction: %exitcond27 = icmp eq i64 %indvars.iv.next26, 1600
LV: Found an estimated cost of 0 for VF 4 For instruction: br i1 %exitcond27, label %for.cond.cleanup, label %for.body
LV: Vector loop of width 4 costs: 1.
...
LV: Selecting VF: 8.
LV: The target has 32 registers
LV(REG): Calculating max register usage:
LV(REG): At #0 Interval # 0
LV(REG): At #1 Interval # 1
LV(REG): At #2 Interval # 2
LV(REG): At #4 Interval # 1
LV(REG): At #5 Interval # 1
LV(REG): VF = 8
The problem is that the cost model here is not wrong, exactly. Since all of
these operations are scalarized, their cost (aside from the uniform ones) are
indeed VF*(scalar cost), just as the model suggests. In fact, the larger the VF
picked, the lower the relative overhead from the loop itself (and the
induction-variable update and check), and so in a sense, picking the largest VF
here is the right thing to do.
The problem is that vectorizing like this, where all of the vectors will be
scalarized in the backend, isn't really vectorizing, but rather interleaving.
By itself, this would be okay, but then the vectorizer itself also interleaves,
and that's where the problem manifests itself. There's aren't actually enough
scalar registers to support the normal interleave factor multiplied by a factor
of VF (8 in this example). In other words, the problem with this is that our
register-pressure heuristic does not account for scalarization.
While we might want to improve our register-pressure heuristic, I don't think
this is the right motivating case for that work. Here we have a more-basic
problem: The job of the vectorizer is to vectorize things (interleaving aside),
and if the IR it generates won't generate any actual vector code, then
something is wrong. Thus, if every type looks like it will be scalarized (i.e.
will be split into VF or more parts), then don't consider that VF.
This is not a problem specific to PPC/QPX, however. The problem comes up under
SSE on x86 too, and as such, this change fixes PR26837 too. I've added Sanjay's
reduced test case from PR26837 to this commit.
Differential Revision: http://reviews.llvm.org/D18537
llvm-svn: 264904
PGOFuncNames are used as the key to retrieve the Function definition from the
MD5 stored in the profile. For internal linkage function, we prefix the source
file name to the PGOFuncNames. LTO's internalization privatizes many global linkage
symbols. This happens after value profile annotation, but those internal
linkage functions should not have a source prefix. To differentiate compiler
generated internal symbols from original ones, PGOFuncName meta data are
created and attached to the original internal symbols in the value profile
annotation step. If a symbol does not have the meta data, its original linkage
must be non-internal.
Also add a new map that maps PGOFuncName's MD5 value to the function definition.
Differential Revision: http://reviews.llvm.org/D17895
llvm-svn: 264902
These checks are redundant and can be removed
Reviewers: hans
Subscribers: llvm-commits, mzolotukhin
Differential Revision: http://reviews.llvm.org/D18564
llvm-svn: 264872
Prior to this patch, the MemorySSA caching visitor would cache all
calls that it visited. When paired with phi optimization, this can be
problematic. Consider:
define void @foo() {
; 1 = MemoryDef(liveOnEntry)
call void @clobberFunction()
br i1 undef, label %if.end, label %if.then
if.then:
; MemoryUse(??)
call void @readOnlyFunction()
; 2 = MemoryDef(1)
call void @clobberFunction()
br label %if.end
if.end:
; 3 = MemoryPhi(...)
; MemoryUse(?)
call void @readOnlyFunction()
ret void
}
When optimizing MemoryUse(?), we visit defs 1 and 2, so we note to
cache them later. We ultimately end up not being able to optimize
passed the Phi, so we set MemoryUse(?) to point to the Phi. We then
cache the clobbering call for def 1 to be the Phi.
This commit changes this behavior so that we wipe out any calls
added to VisistedCalls while visiting the defs of a phi we couldn't
optimize.
Aside: With this patch, we now can bootstrap clang/LLVM without a
single MemorySSA verifier failure. Woohoo. :)
llvm-svn: 264820
This patch teaches the caching MemorySSA walker a few things:
1. Not to walk Phis we've walked before. It seems that we tried to do
this before, but it didn't work so well in cases like:
define void @foo() {
%1 = alloca i8
%2 = alloca i8
br label %begin
begin:
; 3 = MemoryPhi({%0,liveOnEntry},{%end,2})
; 1 = MemoryDef(3)
store i8 0, i8* %2
br label %end
end:
; MemoryUse(?)
load i8, i8* %1
; 2 = MemoryDef(1)
store i8 0, i8* %2
br label %begin
}
Because we wouldn't put Phis in Q.Visited until we tried to visit them.
So, when trying to optimize MemoryUse(?):
- We would visit 3 above
- ...Which would make us put {%0,liveOnEntry} in Q.Visited
- ...Which would make us visit {%0,liveOnEntry}
- ...Which would make us put {%end,2} in Q.Visited
- ...Which would make us visit {%end,2}
- ...Which would make us visit 3
- ...Which would realize we've already visited everything in 3
- ...Which would make us conservatively return 3.
In the added test-case, (@looped_visitedonlyonce) this behavior would
cause us to give incorrect results. Specifically, we'd visit 4 twice
in the same query, but on the second visit, we'd skip while.cond because
it had been visited, visit if.then/if.then2, and cache "1" as the
clobbering def on the way back.
2. If we try to walk the defs of a {Phi,MemLoc} and see it has been
visited before, just hand back the Phi we're trying to optimize.
I promise this isn't as terrible as it seems. :)
We now insert {Phi,MemLoc} pairs just before walking the Phi's upward
defs. So, we check the cache for the {Phi,MemLoc} pair before checking
if we've already walked the Phi.
The {Phi,MemLoc} pair is (almost?) always guaranteed to have a cache
entry if we've already fully walked it, because we cache as we go.
So, if the {Phi,MemLoc} pair isn't in cache, either:
(a) we must be in the process of visiting it (in which case, we can't
give a better answer in a cache-as-we-go DFS walker)
(b) we visited it, but didn't cache it on the way back (...which seems
to require `ModifyingAccess` to not dominate `StartingAccess`,
so I'm 99% sure that would be an error. If it's not an error, I
haven't been able to get it to happen locally, so I suspect it's
rare.)
- - - - -
As a consequence of this change, we no longer skip upward defs of phis,
so we can kill the `VisitedOnlyOne` check. This gives us better accuracy
than we had before, at the cost of potentially doing a bit more work
when we have a loop.
llvm-svn: 264814
This is effectively NFC, minus the renaming of the options
(-cyclone-prefetch-distance -> -prefetch-distance).
The change was requested by Tim in D17943.
llvm-svn: 264806
We have known races on profile counters, which can be reproduced by enabling
-fsanitize=thread and -fprofile-instr-generate simultaneously on a
multi-threaded program. This patch avoids reporting those races by not
instrumenting the reads and writes coming from the instruction profiler.
llvm-svn: 264805
During ADCE, track which debug info scopes still have live references
from the code, and delete debug info intrinsics for the dead ones.
These intrinsics describe the locations of variables (in registers or
stack slots). If there's no code left corresponding to a variable's
scope, then there's no way to reference the variable in the debugger and
it doesn't matter what its value is.
I add a DEBUG printout when the described location in an SSA register,
in case it helps some trying to track down why locations get lost.
However, we still delete these; the scope itself isn't attached to any
real code, so the ship has already sailed.
llvm-svn: 264800
Since we have moved to a model where functions are imported in bulk from
each source module after making summary-based importing decisions, there
is no longer a need to link metadata as a postpass, and all users have
been removed.
This essentially reverts r255909 and follow-on fixes.
llvm-svn: 264763
When eliminating or merging almost empty basic blocks, the existence of non-trivial PHI nodes
is currently used to recognize potential loops of which the block is the header and keep the block.
However, the current algorithm fails if the loops' exit condition is evaluated only with volatile
values hence no PHI nodes in the header. Especially when such a loop is an outer loop of a nested
loop, the loop is collapsed into a single loop which prevent later optimizations from being
applied (e.g., transforming nested loops into simplified forms and loop vectorization).
The patch augments the existing PHI node-based check by adding a pre-test if the BB actually
belongs to a set of loop headers and not eliminating it if yes.
llvm-svn: 264697
Personality is copied as part of copyFunctionAttributes, but it is
invalid on a declaration. Remove the personality attribute it the
function body is not cloned.
Also add a verifier run over output modules in the llvm-split tool.
llvm-svn: 264667
On OS X El Capitan and iOS 9, the linker supports a new section
attribute, live_support, which allows dead stripping to remove dead
globals along with the ASAN metadata about them.
With this change __asan_global structures are emitted in a new
__DATA,__asan_globals section on Darwin.
Additionally, there is a __DATA,__asan_liveness section with the
live_support attribute. Each entry in this section is simply a tuple
that binds together the liveness of a global variable and its ASAN
metadata structure. Thus the metadata structure will be alive if and
only if the global it references is also alive.
Review: http://reviews.llvm.org/D16737
llvm-svn: 264645
When eliminating or merging almost empty basic blocks, the existence of non-trivial PHI nodes
is currently used to recognize potential loops of which the block is the header and keep the block.
However, the current algorithm fails if the loops' exit condition is evaluated only with volatile
values hence no PHI nodes in the header. Especially when such a loop is an outer loop of a nested
loop, the loop is collapsed into a single loop which prevent later optimizations from being
applied (e.g., transforming nested loops into simplified forms and loop vectorization).
The patch augments the existing PHI node-based check by adding a pre-test if the BB actually
belongs to a set of loop headers and not eliminating it if yes.
llvm-svn: 264596
Don't set the function hotness attribute on the fly. This changes the CFG
branch probability of the caller function, which leads to inconsistent BB
ordering. This patch moves the attribute setting to a separated loop after
the counts in all functions are populated.
Fixes PR27024 - PGO instrumentation profile data is not reflected in correct
basic blocks.
Differential Revision: http://reviews.llvm.org/D18491
llvm-svn: 264594
Summary:
Add a statistic to count the number of imported functions. Also, add a
new -print-imports option to emit a trace of imported functions, that
works even for an NDEBUG build.
Note that emitOptimizationRemark does not work for the above printing as
it expects a Function object and DebugLoc, neither of which we have
with summary-based importing.
This is part 2 of D18487, the first part was committed separately as
r264536.
Reviewers: joker.eph
Subscribers: llvm-commits, joker.eph
Differential Revision: http://reviews.llvm.org/D18487
llvm-svn: 264537
With r264503, aliases are now being added to the GlobalsToImport set
even when their aliasees can't be imported due to their linkage type.
While the importing worked correctly (the aliases imported as
declarations) due to the logic in doImportAsDefinition, there is no
point to adding them to the GlobalsToImport set.
Additionally, with D18487 it was resulting in incorrectly printing a
message indicating that the alias was imported.
To avoid this, delay adding aliases to the GlobalsToImport set until
after the linkage type of the aliasee is checked.
This patch is part of D18487.
llvm-svn: 264536
Summary:
Now that the summary contains the full reference/call graph, we can
replace the existing function importer that loads and inspect the IR
to iteratively walk the call graph by a traversal based purely on the
summary information. Decouple the actual importing decision from any
IR manipulation.
Reviewers: tejohnson
Subscribers: llvm-commits, joker.eph
Differential Revision: http://reviews.llvm.org/D18343
From: Mehdi Amini <mehdi.amini@apple.com>
llvm-svn: 264503
This changes RS4GC to lower calls to ``@llvm.experimental.deoptimize``
to gc.statepoints wrapping ``__llvm_deoptimize``, and changes
``callsGCLeafFunction`` to recognize ``@llvm.experimental.deoptimize``
as a non GC leaf function.
I've had to hard code the ``"__llvm_deoptimize"`` name in
RewriteStatepointsForGC; since ``TargetLibraryInfo`` is available only
during codegen. This isn't without precedent in the codebase, so I'm
not overtly concerned.
llvm-svn: 264456
We try to hoist the insertion point as high as possible to encourage
sharing. However, we must be careful not to hoist into a catchswitch as
it is both an EHPad and a terminator.
llvm-svn: 264344
isDependenceDistanceOfOne asserts that the store and the load access
through the same type. This function is also used by
removeDependencesFromMultipleStores so we need to make sure we filter
out mismatching types before reaching this point.
Now we do this when the initial candidates are gathered.
This is a refinement of the fix made in r262267.
Fixes PR27048.
llvm-svn: 264313
There are a few bugs in the walker that this patch addresses.
Primarily:
- Caching can break when we have multiple BBs without phis
- We weren't optimizing some phis properly
- Because of how the DFS iterator works, there were times where we
wouldn't cache any results of our DFS
I left the test cases with FIXMEs in, because I'm not sure how much
effort it will take to get those to work (read: We'll probably
ultimately have to end up redoing the walker, or we'll have to come up
with some creative caching tricks), and more test coverage = better.
Differential Revision: http://reviews.llvm.org/D18065
llvm-svn: 264180
When you have multiple LCSSA (single-operand) PHIs that are converted
into two-operand PHIs due to versioning, only assert that the PHI
currently being converted has a single operand. I.e. we don't want to
check PHIs that were converted earlier in the loop.
Fixes PR27023.
Thanks to Karl-Johan Karlsson for the minimized testcase!
llvm-svn: 264081
It's a bug fix.
For rerolled loops SE trip count remains unchanged. It leads to incorrect work of the next passes.
My patch just resets SE info for rerolled loop forcing SE to re-evaluate it next time it requested.
I also added a verifier call in the exisitng test to be sure no invalid SE data remain. Without my fix this test would fail with -verify-scev.
Differential Revision: http://reviews.llvm.org/D18316
llvm-svn: 264051
Summary:
Without tree pruning clang has 2,667,552 points.
Wiht only dominators pruning: 1,515,586.
With both dominators & predominators pruning: 1,340,534.
Resubmit of r262103.
Differential Revision: http://reviews.llvm.org/D18341
llvm-svn: 264003
If we have a BB with only MemoryDefs, live-in calculations will ignore
it. This means we get results like this:
define void @foo(i8* %p) {
; 1 = MemoryDef(liveOnEntry)
store i8 0, i8* %p
br i1 undef, label %if.then, label %if.end
if.then:
; 2 = MemoryDef(1)
store i8 1, i8* %p
br label %if.end
if.end:
; 3 = MemoryDef(1)
store i8 2, i8* %p
ret void
}
...When there should be a MemoryPhi in the `if.end` BB.
This patch fixes that behavior.
llvm-svn: 263991
The sinpi/cospi can be replaced with sincospi to remove unnecessary
computations. However, we need to make sure that the calls are within
the same function!
This fixes PR26993.
llvm-svn: 263875
Summary:
ThinLTO is relying on linkInModule to import selected function.
However a lot of "magic" was hidden in linkInModule and the IRMover,
who would rename and promote global variables on the fly.
This is moving to an approach where the steps are decoupled and the
client is reponsible to specify the list of globals to import.
As a consequence some test are changed because they were relying on
the previous behavior which was importing the definition of *every*
single global without control on the client side.
Now the burden is on the client to decide if a global has to be imported
or not.
Reviewers: tejohnson
Subscribers: joker.eph, llvm-commits
Differential Revision: http://reviews.llvm.org/D18122
From: Mehdi Amini <mehdi.amini@apple.com>
llvm-svn: 263863
The loop on IVOperand's incoming values assumes IVOperand to be an
induction variable on the loop over which `S Pred X` is invariant;
otherwise loop invariant incoming values to IVOperand are not guaranteed
to dominate the comparision.
This fixes PR26973.
llvm-svn: 263827
Summary:
These dependencies would be used in the future to reduce the number
of instrumented blocks(http://reviews.llvm.org/rL262103)
This is submitted as a separate CL because of previous problems with
ARM.
Subscribers: aemerson
Differential Revision: http://reviews.llvm.org/D18227
llvm-svn: 263797
Summary:
It can hurt performance to prefetch ahead too much. Be conservative for
now and don't prefetch ahead more than 3 iterations on Cyclone.
Reviewers: hfinkel
Subscribers: llvm-commits, mzolotukhin
Differential Revision: http://reviews.llvm.org/D17949
llvm-svn: 263772
Summary:
And use this TTI for Cyclone. As it was explained in the original RFC
(http://thread.gmane.org/gmane.comp.compilers.llvm.devel/92758), the HW
prefetcher work up to 2KB strides.
I am also adding tests for this and the previous change (D17943):
* Cyclone prefetching accesses with a large stride
* Cyclone not prefetching accesses with a small stride
* Generic Aarch64 subtarget not prefetching either
Reviewers: hfinkel
Subscribers: aemerson, rengolin, llvm-commits, mzolotukhin
Differential Revision: http://reviews.llvm.org/D17945
llvm-svn: 263771
Summary:
Use the new LoopVersioning facility (D16712) to add noalias metadata in
the vector loop if we versioned with memchecks. This can enable some
optimization opportunities further down the pipeline (see the included
test or the benchmark improvement quoted in D16712).
The test also covers the bug I had in the initial version in D16712.
The vectorizer did not previously use LoopVersioning. The reason is
that the vectorizer performs its transformations in single shot. It
creates an empty single-block vector loop that it then populates with
the widened, if-converted instructions. Thus creating an intermediate
versioned scalar loop seems wasteful.
So this patch (rather than bringing in LoopVersioning fully) adds a
special interface to LoopVersioning to allow the vectorizer to add
no-alias annotation while still performing its own versioning.
As the vectorizer propagates metadata from the instructions in the
original loop to the vector instructions we also check the pointer in
the original instruction and see if LoopVersioning can add no-alias
metadata based on the issued memchecks.
Reviewers: hfinkel, nadav, mzolotukhin
Subscribers: mzolotukhin, llvm-commits
Differential Revision: http://reviews.llvm.org/D17191
llvm-svn: 263744
Summary:
If we decide to version a loop to benefit a transformation, it makes
sense to record the now non-aliasing accesses in the newly versioned
loop. This allows non-aliasing information to be used by subsequent
passes.
One example is 456.hmmer in SPECint2006 where after loop distribution,
we vectorize one of the newly distributed loops. To vectorize we
version this loop to fully disambiguate may-aliasing accesses. If we
add the noalias markers, we can use the same information in a later DSE
pass to eliminate some dead stores which amounts to ~25% of the
instructions of this hot memory-pipeline-bound loop. The overall
performance improves by 18% on our ARM64.
The scoped noalias annotation is added in LoopVersioning. The patch
then enables this for loop distribution. A follow-on patch will enable
it for the vectorizer. Eventually this should be run by default when
versioning the loop but first I'd like to get some feedback whether my
understanding and application of scoped noalias metadata is correct.
Essentially my approach was to have a separate alias domain for each
versioning of the loop. For example, if we first version in loop
distribution and then in vectorization of the distributed loops, we have
a different set of memchecks for each versioning. By keeping the scopes
in different domains they can conveniently be defined independently
since different alias domains don't affect each other.
As written, I also have a separate domain for each loop. This is not
necessary and we could save some metadata here by using the same domain
across the different loops. I don't think it's a big deal either way.
Probably the best is to review the tests first to see if I mapped this
problem correctly to scoped noalias markers. I have plenty of comments
in the tests.
Note that the interface is prepared for the vectorizer which needs the
annotateInstWithNoAlias API. The vectorizer does not use LoopVersioning
so we need a way to pass in the versioned instructions. This is also
why the maps have to become part of the object state.
Also currently, we only have an AA-aware DSE after the vectorizer if we
also run the LTO pipeline. Depending how widely this triggers we may
want to schedule a DSE toward the end of the regular pass pipeline.
Reviewers: hfinkel, nadav, ashutosh.nema
Subscribers: mssimpso, aemerson, llvm-commits, mcrosier
Differential Revision: http://reviews.llvm.org/D16712
llvm-svn: 263743
This is similar to D18133 where we allowed profile weights on select instructions.
This extends that change to also allow the 'unpredictable' attribute of branches to apply to selects.
A test to check that 'unpredictable' metadata is preserved when cloning instructions was checked in at:
http://reviews.llvm.org/rL263648
Differential Revision: http://reviews.llvm.org/D18220
llvm-svn: 263716
This splits out the logic that maps the `"statepoint-id"` attribute into
the actual statepoint ID, and the `"statepoint-num-patch-bytes"`
attribute into the number of patchable bytes the statpeoint is lowered
into. The new home of this logic is in IR/Statepoint.cpp, and this
refactoring will support similar functionality when lowering calls with
deopt operand bundles in the future.
llvm-svn: 263685
Summary:
Fix LSRInstance::HoistInsertPosition() to check the original insert
position block first for a canonical insertion point that is dominated
by all inputs. This leads to SCEV being able to reuse more instructions
since it currently tracks the instructions it creates for reuse by
keeping a table of <Value, insert point> pairs.
Reviewers: atrick
Subscribers: mcrosier, mzolotukhin, llvm-commits
Differential Revision: http://reviews.llvm.org/D18001
llvm-svn: 263644
The latent bug that LLE exposed in the LoopVectorizer was resolved
(PR26952).
The pass can be disabled with -mllvm -enable-loop-load-elim=0
llvm-svn: 263595
There is something strange going on with debug info (.eh_frame_hdr)
disappearing when msan.module_ctor are placed in comdat sections.
Moving this functionality under flag, disabled by default.
llvm-svn: 263579
This was a latent bug that got exposed by the change to add LoopSimplify
as a dependence to LoopLoadElimination. Since LoopInfo was corrupted
after LV, LoopSimplify mis-compiled nbench in the test-suite (more
details in the PR).
The problem was that when we create the blocks for predicated stores we
didn't add those to any loops.
The original testcase for store predication provides coverage for this
assuming we verify LI on the way out of LV.
Fixes PR26952.
llvm-svn: 263565
Since the static getGlobalIdentifier and getGUID methods are now called
for global values other than functions, reflect that by moving these
methods to the GlobalValue class.
llvm-svn: 263524
(Resubmitting after fixing missing file issue)
With the changes in r263275, there are now more than just functions in
the summary. Completed the renaming of data structures (started in
r263275) to reflect the wider scope. In particular, changed the
FunctionIndex* data structures to ModuleIndex*, and renamed related
variables and comments. Also renamed the files to reflect the changes.
A companion clang patch will immediately succeed this patch to reflect
this renaming.
llvm-svn: 263513
Summary:
Specifically, when we perform runtime loop unrolling of a loop that
contains a convergent op, we can only unroll k times, where k divides
the loop trip multiple.
Without this change, we'll happily unroll e.g. the following loop
for (int i = 0; i < N; ++i) {
if (i == 0) convergent_op();
foo();
}
into
int i = 0;
if (N % 2 == 1) {
convergent_op();
foo();
++i;
}
for (; i < N - 1; i += 2) {
if (i == 0) convergent_op();
foo();
foo();
}.
This is unsafe, because we've just added a control-flow dependency to
the convergent op in the prelude.
In general, runtime unrolling loops that contain convergent ops is safe
only if we don't have emit a prelude, which occurs when the unroll count
divides the trip multiple.
Reviewers: resistor
Subscribers: llvm-commits, mzolotukhin
Differential Revision: http://reviews.llvm.org/D17526
llvm-svn: 263509
Summary: This now try to reorder instructions in order to help create the optimizable pattern.
Reviewers: craig.topper, spatel, dexonsmith, Prazek, chandlerc, joker.eph, majnemer
Differential Revision: http://reviews.llvm.org/D16523
llvm-svn: 263503
With the changes in r263275, there are now more than just functions in
the summary. Completed the renaming of data structures (started in
r263275) to reflect the wider scope. In particular, changed the
FunctionIndex* data structures to ModuleIndex*, and renamed related
variables and comments. Also renamed the files to reflect the changes.
A companion clang patch will immediately succeed this patch to reflect
this renaming.
llvm-svn: 263490
As noted in:
https://llvm.org/bugs/show_bug.cgi?id=26636
This doesn't accomplish anything on its own. It's the first step towards preserving
and using branch weights with selects.
The next step would be to make sure we're propagating the info in all of the other
places where we create selects (SimplifyCFG, InstCombine, etc). I don't think there's
an easy fix to make this happen; we have to look at each transform individually to
determine how to correctly propagate the weights.
Along with that step, we need to then use the weights when making subsequent transform
decisions such as discussed in http://reviews.llvm.org/D16836.
The inliner test is independent but closely related. It verifies that metadata is
preserved when both branches and selects are cloned.
Differential Revision: http://reviews.llvm.org/D18133
llvm-svn: 263482
Summary:
Previously we had a notion of convergent functions but not of convergent
calls. This is insufficient to correctly analyze calls where the target
is unknown, e.g. indirect calls.
Now a call is convergent if it targets a known-convergent function, or
if it's explicitly marked as convergent. As usual, we can remove
convergent where we can prove that no convergent operations are
performed in the call.
Originally landed as r261544, then reverted in r261544 for (incidental)
build breakage. Re-landed here with no changes.
Reviewers: chandlerc, jingyue
Subscribers: llvm-commits, tra, jhen, hfinkel
Differential Revision: http://reviews.llvm.org/D17739
llvm-svn: 263481
The motivating example is this
for (j = n; j > 1; j = i) {
i = j / 2;
}
The signed division is safely to be changed to an unsigned division (j is known
to be larger than 1 from the loop guard) and later turned into a single shift
without considering the sign bit.
llvm-svn: 263406
This follows up on the related AVX instruction transforms, but this
one is too strange to do anything more with. Intel's behavioral
description of this instruction in its Software Developer's Manual
is tragi-comic.
llvm-svn: 263340
commit ae14bf6488e8441f0f6d74f00455555f6f3943ac
Author: Mehdi Amini <mehdi.amini@apple.com>
Date: Fri Mar 11 17:15:50 2016 +0000
Remove PreserveNames template parameter from IRBuilder
Summary:
Following r263086, we are now relying on a flag on the Context to
discard Value names in release builds.
Reviewers: chandlerc
Subscribers: mzolotukhin, llvm-commits
Differential Revision: http://reviews.llvm.org/D18023
From: Mehdi Amini <mehdi.amini@apple.com>
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@263258
91177308-0d34-0410-b5e6-96231b3b80d8
until we can figure out what to do about clang and Release build testing.
This reverts commit 263258.
llvm-svn: 263321
Summary:
This intrinsic, together with deoptimization operand bundles, allow
frontends to express transfer of control and frame-local state from
one (typically more specialized, hence faster) version of a function
into another (typically more generic, hence slower) version.
In languages with a fully integrated managed runtime this intrinsic can
be used to implement "uncommon trap" like functionality. In unmanaged
languages like C and C++, this intrinsic can be used to represent the
slow paths of specialized functions.
Note: this change does not address how `@llvm.experimental_deoptimize`
is lowered. That will be done in a later change.
Reviewers: chandlerc, rnk, atrick, reames
Subscribers: llvm-commits, kmod, mjacob, maksfb, mcrosier, JosephTremoulet
Differential Revision: http://reviews.llvm.org/D17732
llvm-svn: 263281
Value profile instrumentation treats inline asm calls like they are
indirect calls. This causes problems when the 'Callee' is passed to a
ptrtoint cast -- the verifier rightly claims that this is bogus and
crashes opt.
llvm-svn: 263278
Summary:
This patch adds support for including a full reference graph including
call graph edges and other GV references in the summary.
The reference graph edges can be used to make importing decisions
without materializing any source modules, can be used in the plugin
to make file staging decisions for distributed build systems, and is
expected to have other uses.
The call graph edges are recorded in each function summary in the
bitcode via a list of <CalleeValueIds, StaticCount> tuples when no PGO
data exists, or <CalleeValueId, StaticCount, ProfileCount> pairs when
there is PGO, where the ValueId can be mapped to the function GUID via
the ValueSymbolTable. In the function index in memory, the call graph
edges reference the target via the CalleeGUID instead of the
CalleeValueId.
The reference graph edges are recorded in each summary record with a
list of referenced value IDs, which can be mapped to value GUID via the
ValueSymbolTable.
Addtionally, a new summary record type is added to record references
from global variable initializers. A number of bitcode records and data
structures have been renamed to reflect the newly expanded scope of the
summary beyond functions. More cleanup will follow.
Reviewers: joker.eph, davidxl
Subscribers: joker.eph, llvm-commits
Differential Revision: http://reviews.llvm.org/D17212
llvm-svn: 263275
Summary:
Following r263086, we are now relying on a flag on the Context to
discard Value names in release builds.
Reviewers: chandlerc
Subscribers: mzolotukhin, llvm-commits
Differential Revision: http://reviews.llvm.org/D18023
From: Mehdi Amini <mehdi.amini@apple.com>
llvm-svn: 263258
Summary:
Following r263086, we are replacing this by a runtime check.
More cleanup will follow on the IRBuilder itself, but I submitted
this patch separately as SROA has a fancy "prefixInserter" class
that needs extra-love.
Reviewers: chandlerc
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D18022
From: Mehdi Amini <mehdi.amini@apple.com>
llvm-svn: 263256
member type.
Because of how this type is used by the ValueTable, it cannot actually
have hidden visibility. GCC actually nicely warns about this but Clang
just silently ... I don't even know. =/ We should do a better job either
way though.
This should resolve a bunch of the GCC warnings about visibility that
the port of GVN triggered and make the visibility story a bit more
correct.
llvm-svn: 263250
much to my horror, so use variables to fix it in place.
This terrifies me. Both basic-aa and memdep will provide more precise
information when the domtree and/or the loop info is available. Because
of this, if your pass (like GVN) requires domtree, and then queries
memdep or basic-aa, it will get more precise results. If it does this in
the other order, it gets less precise results.
All of the ideas I have for fixing this are, essentially, terrible. Here
I've just caused us to stop having unspecified behavior as different
implementations evaluate the order of these arguments differently. I'm
actually rather glad that they do, or the fragility of memdep and
basic-aa would have gone on unnoticed. I've left comments so we don't
immediately break this again. This should fix bots whose host compilers
evaluate the order of arguments differently from Clang.
llvm-svn: 263231
This was originally a pointer to support pass managers which didn't use
AnalysisManagers. However, that doesn't realistically come up much and
the complexity of supporting it doesn't really make sense.
In fact, *many* parts of the pass manager were just assuming the pointer
was never null already. This at least makes it much more explicit and
clear.
llvm-svn: 263219
Since the names are used in a loop this does more work in debug builds. In
release builds value names are generally discarded so we don't have to do
the concatenation at all. It's also simpler code, no functional change
intended.
llvm-svn: 263215
tests to run GVN in both modes.
This is mostly the boring refactoring just like SROA and other complex
transformation passes. There is some trickiness in that GVN's
ValueNumber class requires hand holding to get to compile cleanly. I'm
open to suggestions about a better pattern there, but I tried several
before settling on this. I was trying to balance my desire to sink as
much implementation detail into the source file as possible without
introducing overly many layers of abstraction.
Much like with SROA, the design of this system is made somewhat more
cumbersome by the need to support both pass managers without duplicating
the significant state and logic of the pass. The same compromise is
struck here.
I've also left a FIXME in a doxygen comment as the GVN pass seems to
have pretty woeful documentation within it. I'd like to submit this with
the FIXME and let those more deeply familiar backfill the information
here now that we have a nice place in an interface to put that kind of
documentaiton.
Differential Revision: http://reviews.llvm.org/D18019
llvm-svn: 263208
llvm::getDISubprogram walks the instructions in a function, looking for one in the scope of the current function, so that it can find the !dbg entry for the subprogram itself.
Now that !dbg is attached to functions, this should not be necessary. This patch changes all uses to just query the subprogram directly on the function.
Ideally this should be NFC, but in reality its possible that a function:
has no !dbg (in which case there's likely a bug somewhere in an opt pass), or
that none of the instructions had a scope referencing the function, so we used to not find the !dbg on the function but now we will
Reviewed by Duncan Exon Smith.
Differential Revision: http://reviews.llvm.org/D18074
llvm-svn: 263184
The code assumed that we always had a preheader without making the pass
dependent on LoopSimplify.
Thanks to Mattias Eriksson V for reporting this.
llvm-svn: 263173
of, and I misdiagnosed for months and months.
Andrea has had a patch for this forever, but I just couldn't see how
it was fixing the root cause of the problem. It didn't make sense to me,
even though the patch was perfectly good and the analysis of the actual
failure event was *fantastic*.
Well, I came back to it today because the patch has sat for *far* too
long and needs attention and decided I wouldn't let it go until I really
understood what was going on. After quite some time in the debugger,
I finally realized that in fact I had just missed an important case with
my previous attempt to fix PR22093 in r225149. Not only do we need to
handle loads that won't be split, but stores-of-loads that we won't
split. We *do* actually have enough logic in the presplitting to form
new slices for split stores.... *unless* we decided not to split them!
I'm so sorry that it took me this long to come to the realization that
this is the issue. It seems so obvious in hind sight (of course).
Anyways, the fix becomes *much* smaller and more focused. The fact that
we're left doing integer smashing is related to the FIXME in my original
commit: fundamentally, we're not aggressive about pre-splitting for
loads and stores to the same alloca. If we want to get aggressive about
this, it'll need both what Andrea had put into the proposed fix, but
also a *lot* more logic to essentially iteratively pre-split the alloca
until we can't do any more. As I said in that commit log, its really
unclear that this is the right call. Instead, the integer blending and
letting targets lower this to narrower stores seems slightly better. But
we definitely shouldn't really go down that path just to fix this bug.
Again, tons of thanks are owed to Andrea and others at Sony for working
on this bug. I really should have seen what was going on here and
re-directed them sooner. =////
llvm-svn: 263121
We already have the instruction extracted into 'I', just cast that to
a store the way we do for loads. Also, we don't enter the if unless SI
is non-null, so don't test it again for null.
I'm pretty sure the entire test there can be nuked, but this is just the
trivial cleanup.
llvm-svn: 263112
MinVecRegSize is currently hardcoded to 128; this patch adds a cl::opt
to allow changing it. I tried not to change any existing behavior for the default
case.
Differential revision: http://reviews.llvm.org/D13278
llvm-svn: 263089
need to be changed for porting to the new pass manager.
Also sink the comment on the ValueTable class back to that class instead
of it dangling on an anonymous namespace.
No functionality changed.
llvm-svn: 263084
This is a fairly straightforward port to the new pass manager with one
exception. It removes a very questionable use of releaseMemory() in
the old pass to invalidate its caches between runs on a function.
I don't think this is really guaranteed to be safe. I've just used the
more direct port to the new PM to address this by nuking the results
object each time the pass runs. While this could cause some minor malloc
traffic increase, I don't expect the compile time performance hit to be
noticable, and it makes the correctness and other aspects of the pass
much easier to reason about. In some cases, it may make things faster by
making the sets and maps smaller with better locality. Indeed, the
measurements collected by Bruno (thanks!!!) show mostly compile time
improvements.
There is sadly very limited testing at this point as there are only two
tests of memdep, and both rely on GVN. I'll be porting GVN next and that
will exercise this heavily though.
Differential Revision: http://reviews.llvm.org/D17962
llvm-svn: 263082
This patch teaches LICM's implementation of store promotion to exploit the fact that the memory location being accessed might be provable thread local. The fact it's thread local weakens the requirements for where we can insert stores since no other thread can observe the write. This allows us perform store promotion even in cases where the store is not guaranteed to execute in the loop.
Two key assumption worth drawing out is that this assumes a) no-capture is strong enough to imply no-escape, and b) standard allocation functions like malloc, calloc, and operator new return values which can be assumed not to have previously escaped.
In future work, it would be nice to generalize this so that it works without directly seeing the allocation site. I believe that the nocapture return attribute should be suitable for this purpose, but haven't investigated carefully. It's also likely that we could support unescaped allocas with similar reasoning, but since SROA and Mem2Reg should destroy those, they're less interesting than they first might seem.
Differential Revision: http://reviews.llvm.org/D16783
llvm-svn: 263072
When checking whether an smin is positive, we can move the comparison to one of the inputs if the other is known positive. If the known positive one is the min, then the other can't be negative. If the other is the min, then we compute the min.
Differential Revision: http://reviews.llvm.org/D17873
llvm-svn: 263059
I somehow missed this. The case in GCC (global_alloc) was similar to
the new testcase except it had an array of structs rather than a two
dimensional array.
Fixes RP26885.
llvm-svn: 263058
As part of r251146 InstCombine was extended to call computeKnownBits on
every value in the function to determine whether it happens to be
constant. This increases typical compiletime by 1-3% (5% in irgen+opt
time) in my measurements. On the other hand this case did not trigger
once in the whole llvm-testsuite.
This patch introduces the notion of ExpensiveCombines which are only
enabled for OptLevel > 2. I removed the check in InstructionSimplify as
that is called from various places where the OptLevel is not known but
given the rarity of the situation I think a check in InstCombine is
enough.
Differential Revision: http://reviews.llvm.org/D16835
llvm-svn: 263047
Original commit message:
calculate builtin_object_size if argument is a removable pointer
This patch fixes calculating correct value for builtin_object_size function
when pointer is used only in builtin_object_size function call and never
after that.
Patch by Strahinja Petrovic.
Differential Revision: http://reviews.llvm.org/D17337
Reland the original change with a small modification (first do a null check
and then do the cast) to satisfy ubsan.
llvm-svn: 263011
TSan instrumentation functions for atomic stores, loads, and cmpxchg work on
integer value types. This patch adds casts before calling TSan instrumentation
functions in cases where the value is a pointer.
Differential Revision: http://reviews.llvm.org/D17833
llvm-svn: 262876
This lets select sub-targets enable this pass. The patch implements the
idea from the recent llvm-dev thread:
http://thread.gmane.org/gmane.comp.compilers.llvm.devel/94925
The goal is to enable the LoopDataPrefetch pass for the Cyclone
sub-target only within Aarch64.
Positive and negative tests will be included in an upcoming patch that
enables selective prefetching of large-strided accesses on Cyclone.
llvm-svn: 262844
This reverts commit r262250.
It causes SPEC2006/gcc to generate wrong result (166.s) in AArch64 when
running with *ref* data set. The error happens with
"-Ofast -flto -fuse-ld=gold" or "-O3 -fno-strict-aliasing".
llvm-svn: 262839
This code has been successfully used to bootstrap libc++ in a no-asserts
mode for a very long time, so the code that follows cannot be completely
incorrect. I've added a test that shows the current behavior for this
kind of code with DFSan. If it is desirable for DFSan to do something
special when processing an invoke of a variadic function, it can be
added, but we shouldn't keep an assert that we've been ignoring due to
release builds anyways.
llvm-svn: 262829
Given that we're not actually reducing the instruction count in the included
regression tests, I think we would call this a canonicalization step.
The motivation comes from the example in PR26702:
https://llvm.org/bugs/show_bug.cgi?id=26702
If we hoist the bitwise logic ahead of the bitcast, the previously unoptimizable
example of:
define <4 x i32> @is_negative(<4 x i32> %x) {
%lobit = ashr <4 x i32> %x, <i32 31, i32 31, i32 31, i32 31>
%not = xor <4 x i32> %lobit, <i32 -1, i32 -1, i32 -1, i32 -1>
%bc = bitcast <4 x i32> %not to <2 x i64>
%notnot = xor <2 x i64> %bc, <i64 -1, i64 -1>
%bc2 = bitcast <2 x i64> %notnot to <4 x i32>
ret <4 x i32> %bc2
}
Simplifies to the expected:
define <4 x i32> @is_negative(<4 x i32> %x) {
%lobit = ashr <4 x i32> %x, <i32 31, i32 31, i32 31, i32 31>
ret <4 x i32> %lobit
}
Differential Revision: http://reviews.llvm.org/D17583
llvm-svn: 262645
This patch provides the following infrastructure for PGO enhancements in inliner:
Enable the use of block level profile information in inliner
Incremental update of block frequency information during inlining
Update the function entry counts of callees when they get inlined into callers.
Differential Revision: http://reviews.llvm.org/D16381
llvm-svn: 262636
Summary: With discriminator, LineLocation can uniquely identify a callsite without the need to specifying callee name. Remove Callee function name from the key, and put it in the value (FunctionSamples).
Reviewers: davidxl, dnovillo
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D17827
llvm-svn: 262634
The vectorization of first-order recurrences (r261346) caused PR26734. When
detecting these recurrences, we need to ensure that the previous value is
actually defined inside the loop. This patch includes the fix and test case.
llvm-svn: 262624
Summary: This is the last step toward supporting aggregate memory access in instcombine. This explodes stores of arrays into a serie of stores for each element, allowing them to be optimized.
Reviewers: joker.eph, reames, hfinkel, majnemer, mgrang
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D17828
llvm-svn: 262530