Summary:
This change could be way off-piste, I'm looking for any feedback on whether it's an acceptable approach.
It never seems to be a problem to gobble up as many reduction values as can be found, and then to attempt to reduce the resulting tree. Some of the workloads I'm looking at have been aggressively unrolled by hand, and by selecting reduction widths that are not constrained by a vector register size, it becomes possible to profitably vectorize. My test case shows such an unrolling which SLP was not vectorizing (on neither ARM nor X86) before this patch, but with it does vectorize.
I measure no significant compile time impact of this change when combined with D13949 and D14063. There are also no significant performance regressions on ARM/AArch64 in SPEC or LNT.
The more principled approach I thought of was to generate several candidate tree's and use the cost model to pick the cheapest one. That seemed like quite a big design change (the algorithms seem very much one-shot), and would likely be a costly thing for compile time. This seemed to do the job at very little cost, but I'm worried I've misunderstood something!
Reviewers: nadav, jmolloy
Subscribers: mssimpso, llvm-commits, aemerson
Differential Revision: http://reviews.llvm.org/D14116
llvm-svn: 251428
Summary:
Currently, when the SLP vectorizer considers whether a phi is part of a reduction, it dismisses phi's whose incoming blocks are not the same as the block containing the phi. For the patterns I'm looking at, extending this rule to allow phis whose incoming block is a containing loop latch allows me to vectorize certain workloads.
There is no significant compile-time impact, and combined with D13949, no performance improvement measured in ARM/AArch64 in any of SPEC2000, SPEC2006 or LNT.
Reviewers: jmolloy, mcrosier, nadav
Subscribers: mssimpso, nadav, aemerson, llvm-commits
Differential Revision: http://reviews.llvm.org/D14063
llvm-svn: 251425
Summary:
Certain workloads, in particular sum-of-absdiff loops, can be vectorized using SLP if it can treat select instructions as reduction values.
The test case is a bit awkward. The AArch64 cost model needs some tuning to not be so pessimistic about selects. I've had to tweak the SLP threshold here.
Reviewers: jmolloy, mzolotukhin, spatel, nadav
Subscribers: nadav, mssimpso, aemerson, llvm-commits
Differential Revision: http://reviews.llvm.org/D13949
llvm-svn: 251424
When emitting a remark for a conditional branch annotation, the remark
uses the line location information of the conditional branch in the
message. In some cases, that information is unavailable and the
optimization would segfaul. I'm still not sure whether this is a bug or
WAI, but the optimizer should not die because of this.
llvm-svn: 251420
We want to insert no-op casts as close as possible to the def. This is
tricky when the cast is of a PHI node and the BasicBlocks between the
def and the use cannot hold any instructions. Iteratively walk EH pads
until we hit a non-EH pad.
This fixes PR25326.
llvm-svn: 251393
We should remove noalias along with dereference and dereference_or_null attributes
because statepoint could potentially touch the entire heap including noalias objects.
Differential Revision: http://reviews.llvm.org/D14032
llvm-svn: 251333
This adds a couple of optimization remarks to the SamplePGO
transformation. When it decides to inline a hot function (to mimic the
inline stack and repeat useful inline decisions in the original build).
It will also report branch destinations. For instance, given the code
fragment:
6 if (i < 1000)
7 sum -= i;
8 else
9 sum += -i * rand();
If the 'else' branch is taken most of the time, building this code with
-Rpass=sample-profile will produce:
a.cc:9:14: remark: most popular destination for conditional branches at small.cc:6:9 [-Rpass=sample-profile]
sum += -i * rand();
^
llvm-svn: 251330
Android libc provides a fixed TLS slot for the unsafe stack pointer,
and this change implements direct access to that slot on AArch64 via
__builtin_thread_pointer() + offset.
This change also moves more code into TargetLowering and its
target-specific subclasses to get rid of target-specific codegen
in SafeStackPass.
This change does not touch the ARM backend because ARM lowers
builting_thread_pointer as aeabi_read_tp, which is not available
on Android.
The previous iteration of this change was reverted in r250461. This
version leaves the generic, compiler-rt based implementation in
SafeStack.cpp instead of moving it to TargetLoweringBase in order to
allow testing without a TargetMachine.
llvm-svn: 251324
Vectorization of memory instruction (Load/Store) is possible when the pointer is coming from GEP. The GEP analysis allows to estimate the profit.
In some cases we have a "bitcast" between GEP and memory instruction.
I added code that skips the "bitcast".
http://reviews.llvm.org/D13886
llvm-svn: 251291
Summary:
InstCombine tries to transform GEP(PHI(GEP1, GEP2, ..)) into GEP(GEP(PHI(...))
when possible. However, this may leave the old PHI node around. Even if we
do end up folding the GEPs, having an extra PHI node might not be beneficial.
This change makes the transformation more conservative. We now only do this if
the PHI has only one use, and can therefore be removed after the transformation.
Reviewers: jmolloy, majnemer
Subscribers: mcrosier, mssimpso, llvm-commits
Differential Revision: http://reviews.llvm.org/D13887
llvm-svn: 251281
First, the motivation: LLVM currently does not realize that:
((2072 >> (L == 0)) >> 7) & 1 == 0
where L is some arbitrary value. Whether you right-shift 2072 by 7 or by 8, the
lowest-order bit is always zero. There are obviously several ways to go about
fixing this, but the generic solution pursued in this patch is to teach
computeKnownBits something about shifts by a non-constant amount. Previously,
we would give up completely on these. Instead, in cases where we know something
about the low-order bits of the shift-amount operand, we can combine (and
together) the associated restrictions for all shift amounts consistent with
that knowledge. As a further generalization, I refactored all of the logic for
all three kinds of shifts to have this capability. This works well in the above
case, for example, because the dynamic shift amount can only be 0 or 1, and
thus we can say a lot about the known bits of the result.
This brings us to the second part of this change: Even when we know all of the
bits of a value via computeKnownBits, nothing used to constant-fold the result.
This introduces the necessary code into InstCombine and InstSimplify. I've
added it into both because:
1. InstCombine won't automatically pick up the associated logic in
InstSimplify (InstCombine uses InstSimplify, but not via the API that
passes in the original instruction).
2. Putting the logic in InstCombine allows the resulting simplifications to become
part of the iterative worklist
3. Putting the logic in InstSimplify allows the resulting simplifications to be
used by everywhere else that calls SimplifyInstruction (inlining, unrolling,
and many others).
And this requires a small change to our definition of an ephemeral value so
that we don't break the rest case from r246696 (where the icmp feeding the
@llvm.assume, is also feeding a br). Under the old definition, the icmp would
not be considered ephemeral (because it is used by the br), but this causes the
assume to remove itself (in addition to simplifying the branch structure), and
it seems more-useful to prevent that from happening.
llvm-svn: 251146
After some look-ahead PRE was added for GEPs, an instruction could end
up in the table of candidates before it was actually inspected. When
this happened the pass might decide it was the best candidate to
replace itself. This didn't go well.
Should fix PR25291
llvm-svn: 251145
Summary: Currently SimplifyResume can convert an invoke instruction to a call instruction if its landing pad is trivial. In practice we could have several invoke instructions with trivial landing pads and share a common rethrow block, and in the common rethrow block, all the landing pads join to a phi node. The patch extends SimplifyResume to check the phi of landing pad and their incoming blocks. If any of them is trivial, remove it from the phi node and convert the invoke instruction to a call instruction.
Reviewers: hfinkel, reames
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D13718
llvm-svn: 251061
The test case wasn't testing what it was commented to be testing; and
when I tried to fix the test I noticed that SCEV does not support the
simplification that the test was supposed to test.
This change removes the test case to avoid confusion.
llvm-svn: 251053
Summary:
An unsigned comparision is equivalent to is corresponding signed version
if both the operands being compared are positive. Teach SCEV to use
this fact when profitable.
Reviewers: atrick, hfinkel, reames, nlewycky
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D13687
llvm-svn: 251051
Summary:
- A s< (A + C)<nsw> if C > 0
- A s<= (A + C)<nsw> if C >= 0
- (A + C)<nsw> s< A if C < 0
- (A + C)<nsw> s<= A if C <= 0
Right now `C` needs to be a constant, but we can later generalize it to
be a non-constant if needed.
Reviewers: atrick, hfinkel, reames, nlewycky
Subscribers: sanjoy, llvm-commits
Differential Revision: http://reviews.llvm.org/D13686
llvm-svn: 251050
Summary:
This uses `ScalarEvolution::getRange` and not potentially control
dependent `nsw` and `nuw` bits on the arithmetic instruction.
Reviewers: atrick, hfinkel, nlewycky
Subscribers: llvm-commits, sanjoy
Differential Revision: http://reviews.llvm.org/D13613
llvm-svn: 251048
SimplifyTerminatorOnSelect didn't consider the possibility that the
condition might be related to one of PHI nodes.
This fixes PR25267.
llvm-svn: 250922
In some cases (as illustrated in the unittest), lineno can be less than the heade_lineno because the function body are included from some other files. In this case, offset will be negative. This patch makes clang still able to match the profile to IR in this situation.
http://reviews.llvm.org/D13914
llvm-svn: 250873
`normalizeForInvokeSafepoint` in RewriteStatepointsForGC.cpp, as it is
written today, deals with `gc.relocate` and `gc.result` uses of a
statepoint equally well. This change documents this fact and adds a
test case.
There is no functional change here -- only documentation of existing
functionality.
llvm-svn: 250784
Allow LLVM to optimize the sequence like the following:
%inc = add nsw i32 %i, 1
%cmp = icmp slt %n, %inc
into:
%cmp = icmp sle i32 %n, %i
The case is not handled previously due to the complexity of compuation of %n.
Hence, LLVM cannot swap operands of icmp accordingly.
llvm-svn: 250746
This was originally checked in at r250527, but reverted at r250570 because of PR25222.
There were at least 2 problems:
1. The cost check was checking for an instruction with an exact cost of TCC_Expensive;
that should have been >=.
2. The cause of the clang stage 1 failures was illegally sinking 'call' instructions;
we can't sink instructions that may have side effects / are not safe to execute speculatively.
Fixed those conditions in sinkSelectOperand() and added test cases.
Original commit message:
This is a follow-up to the discussion in D12882.
Ideally, we would like SimplifyCFG to be able to form select instructions even when the operands
are expensive (as defined by the TTI cost model) because that may expose further optimizations.
However, we would then like a later pass like CodeGenPrepare to undo that transformation if the
target would likely benefit from not speculatively executing an expensive op (this patch).
Once we have this safety mechanism in place, we can adjust SimplifyCFG to restore its
select-formation behavior that changed with r248439.
Differential Revision: http://reviews.llvm.org/D13297
llvm-svn: 250743
This patch improves support for combining the SSE4A EXTRQ(I) and INSERTQ(I) intrinsics:
1 - Converts INSERTQ/EXTRQ calls to INSERTQI/EXTRQI if the 'bit index' and 'length' operands are constant
2 - Converts INSERTQI/EXTRQI calls to shufflevector if the bit index/length are both byte aligned (we can already lower shuffles to INSERTQI/EXTRQI if its useful)
3 - Constant folding support
4 - Add zeroinitializer handling
Differential Revision: http://reviews.llvm.org/D13348
llvm-svn: 250609
The number of samples collected at the head of a function only make
sense for top-level functions (i.e., those actually called as opposed to
being inlined inside another).
Head samples essentially count the time spent inside the function's
prologue. This clearly doesn't make sense for inlined functions, so we
were always emitting 0 in those.
llvm-svn: 250539
Ideally, we would like SimplifyCFG to be able to form select instructions even when the operands
are expensive (as defined by the TTI cost model) because that may expose further optimizations.
However, we would then like a later pass like CodeGenPrepare to undo that transformation if the
target would likely benefit from not speculatively executing an expensive op (this patch).
Once we have this safety mechanism in place, we can adjust SimplifyCFG to restore its
select-formation behavior that changed with r248439.
Differential Revision: http://reviews.llvm.org/D13297
llvm-svn: 250527
The `"statepoint-id"` and `"statepoint-num-patch-bytes"` attributes are
used solely to determine properties of the `gc.statepoint` being
created. Once the `gc.statepoint` is in place, these should be removed.
llvm-svn: 250491
Summary:
This is a step towards using operand bundles to carry deopt state till
RewriteStatepointsForGC. The change adds a flag to
RewriteStatepointsForGC that teaches it to pick up deopt state from a
`"deopt"` operand bundle attached to the `call` or `invoke` it is
wrapping.
The command line flag added, `-rs4gc-use-deopt-bundles`, will only exist
for a short while. Once we are able to pipe deopt bundle state through
the full optimization pipeline without problems, we will "constant fold"
`-rs4gc-use-deopt-bundles` to `true`.
Reviewers: swaroop.sridhar, reames
Subscribers: llvm-commits, sanjoy
Differential Revision: http://reviews.llvm.org/D13372
llvm-svn: 250489
Summary:
`cloneArithmeticIVUser` currently trips over expression like `add %iv,
-1` when `%iv` is being zero extended -- it tries to construct the
widened use as `add %iv.zext, zext(-1)` and (correctly) fails to prove
equivalence to `zext(add %iv, -1)` (here the SCEV for `%iv` is
`{1,+,1}`).
This change teaches `IndVars` to try sign extending the non-IV operand
if that makes the newly constructed IV use equivalent to the widened
narrow IV use.
Reviewers: atrick, hfinkel, reames
Subscribers: sanjoy, llvm-commits
Differential Revision: http://reviews.llvm.org/D13717
llvm-svn: 250483
Android libc provides a fixed TLS slot for the unsafe stack pointer,
and this change implements direct access to that slot on AArch64 via
__builtin_thread_pointer() + offset.
This change also moves more code into TargetLowering and its
target-specific subclasses to get rid of target-specific codegen
in SafeStackPass.
This change does not touch the ARM backend because ARM lowers
builting_thread_pointer as aeabi_read_tp, which is not available
on Android.
llvm-svn: 250456
Turns out this approach is buggy. In discussion about follow on work, Sanjoy pointed out that we could be subject to circular logic problems.
Consider:
if (i u< L) leave()
if ((i + 1) u< L) leave()
print(a[i] + a[i+1])
If we know that L is less than UINT_MAX, we could possible prove (in a control dependent way) that i + 1 does not overflow. This gives us:
if (i u< L) leave()
if ((i +nuw 1) u< L) leave()
print(a[i] + a[i+1])
If we now do the transform this patch proposed, we end up with:
if ((i +nuw 1) u< L) leave_appropriately()
print(a[i] + a[i+1])
That would be a miscompile when i==-1. The problem here is that the control dependent nuw bits got used to prove something about the first condition. That's obviously invalid.
This won't happen today, but since I plan to enhance LVI/CVP with exactly that transform at some point in the not too distant future...
llvm-svn: 250430
With r250345 and r250343, we start to observe the following failure
when bootstrap clang with lto and pgo:
PHI node entries do not match predecessors!
%.sroa.029.3.i = phi %"class.llvm::SDNode.13298"* [ null, %30953 ], [ null, %31017 ], [ null, %30998 ], [ null, %_ZN4llvm8dyn_castINS_14ConstantSDNodeENS_7SDValueEEENS_10cast_rettyIT_T0_E8ret_typeERS5_.exit.i.1804 ], [ null, %30975 ], [ null, %30991 ], [ null, %_ZNK4llvm3EVT13getScalarTypeEv.exit.i.1812 ], [ %..sroa.029.0.i, %_ZN4llvm11SmallVectorIiLj8EED1Ev.exit.i.1826 ], !dbg !451895
label %30998
label %_ZNK4llvm3EVTeqES0_.exit19.thread.i
LLVM ERROR: Broken function found, compilation aborted!
I will re-commit this if the bot does not recover.
llvm-svn: 250366
Currently in JumpThreading pass, the branch weight metadata is not updated after CFG modification. Consider the jump threading on PredBB, BB, and SuccBB. After jump threading, the weight on BB->SuccBB should be adjusted as some of it is contributed by the edge PredBB->BB, which doesn't exist anymore. This patch tries to update the edge weight in metadata on BB->SuccBB by scaling it by 1 - Freq(PredBB->BB) / Freq(BB->SuccBB).
This is the third attempt to submit this patch, while the first two led to failures in some FDO tests. After investigation, it is the edge weight normalization that caused those failures. In this patch the edge weight normalization is fixed so that there is no zero weight in the output and the sum of all weights can fit in 32-bit integer. Several unit tests are added.
Differential revision: http://reviews.llvm.org/D10979
llvm-svn: 250345
This is a cleaned up patch from the one written by John Regehr based on the findings of the Souper superoptimizer.
The basic idea here is that input bits that are known zero reduce the maximum count that the intrinsic could return. We know that the number of bits required to represent a particular count is at most log2(N)+1.
Differential Revision: http://reviews.llvm.org/D13253
llvm-svn: 250338
Binary encoded profiles used to encode all function names inline at
every reference. This is clearly suboptimal in terms of space. This
patch fixes this by adding a name table to the header of the file.
llvm-svn: 250241
We forgot to append the terminatepad's arguments which resulted in us
treating the old terminatepad as an argument to the new terminatepad
causing us to crash immediately. Instead, add the old terminatepad's
arguments to the new terminatepad.
This fixes PR25155.
llvm-svn: 250234
As discussed in D13348 - the INSERTQI range combining code is wrong in that it confuses the insertion bit index with an extraction bit index.
The remaining legal combines are very unlikely (especially once we've converted to shuffles in D13348) so I'm removing the optimization.
llvm-svn: 250160
In JumpThreading pass, the branch weight metadata is not updated after CFG modification. Consider the jump threading on PredBB, BB, and SuccBB. After jump threading, the weight on BB->SuccBB should be adjusted as some of it is contributed by the edge PredBB->BB, which doesn't exist anymore. This patch tries to update the edge weight in metadata on BB->SuccBB by scaling it by 1 - Freq(PredBB->BB) / Freq(BB->SuccBB).
Differential revision: http://reviews.llvm.org/D10979
llvm-svn: 250089
GlobalOpt currently merges stores into the initialisers of internal,
externally_initialized globals, but should not do so as the value of the global
may change between the initialiser and any code in the module being run.
llvm-svn: 250035
C semantics force sub-int-sized values (e.g. i8, i16) to be promoted to int
type (e.g. i32) whenever arithmetic is performed on them.
For targets with native i8 or i16 operations, usually InstCombine can shrink
the arithmetic type down again. However InstCombine refuses to create illegal
types, so for targets without i8 or i16 registers, the lengthening and
shrinking remains.
Most SIMD ISAs (e.g. NEON) however support vectors of i8 or i16 even when
their scalar equivalents do not, so during vectorization it is important to
remove these lengthens and truncates when deciding the profitability of
vectorization.
The algorithm this uses starts at truncs and icmps, trawling their use-def
chains until they terminate or instructions outside the loop are found (or
unsafe instructions like inttoptr casts are found). If the use-def chains
starting from different root instructions (truncs/icmps) meet, they are
unioned. The demanded bits of each node in the graph are ORed together to form
an overall mask of the demanded bits in the entire graph. The minimum bitwidth
that graph can be truncated to is the bitwidth minus the number of leading
zeroes in the overall mask.
The intention is that this algorithm should "first do no harm", so it will
never insert extra cast instructions. This is why the use-def graphs are
unioned, so that subgraphs with different minimum bitwidths do not need casts
inserted between them.
This algorithm works hard to reduce compile time impact. DemandedBits are only
queried if there are extends of illegal types and if a truncate to an illegal
type is seen. In the general case, this results in a simple linear scan of the
instructions in the loop.
No non-noise compile time impact was seen on a clang bootstrap build.
llvm-svn: 250032
Doing so could cause the post-unswitching convergent ops to be
control-dependent on the unswitch condition where they were not before.
This check could be refined to allow unswitching where the convergent
operation was already control-dependent on the unswitch condition.
llvm-svn: 249874
With this patch we can now read and write inline stacks in sample
profiles to the binary encoded profiles.
In a subsequent patch, I will add a string table to the binary encoding.
Right now function names are emitted as strings every time we find them.
This is too bloated and will produce large files in applications with
lots of inlining.
llvm-svn: 249861
Pass MemCpyOpt doesn't check if a store instruction is nontemporal.
As a consequence, adjacent nontemporal stores are always merged into a
memset call.
Example:
;;;
define void @foo(<4 x float>* nocapture %p) {
entry:
store <4 x float> zeroinitializer, <4 x float>* %p, align 16, !nontemporal !0
%p1 = getelementptr inbounds <4 x float>, <4 x float>* %dst, i64 1
store <4 x float> zeroinitializer, <4 x float>* %p1, align 16, !nontemporal !0
ret void
}
!0 = !{i32 1}
;;;
In this example, the two nontemporal stores are combined to a memset of zero
which does not preserve the nontemporal hint. Later on the backend (tested on a
x86-64 corei7) expands that memset call into a sequence of two normal 16-byte
aligned vector stores.
opt -memcpyopt example.ll -S -o - | llc -mcpu=corei7 -o -
Before:
xorps %xmm0, %xmm0
movaps %xmm0, 16(%rdi)
movaps %xmm0, (%rdi)
With this patch, we no longer merge nontemporal stores into calls to memset.
In this example, llc correctly expands the two stores into two movntps:
xorps %xmm0, %xmm0
movntps %xmm0, 16(%rdi)
movntps %xmm0, (%rdi)
In theory, we could extend the usage of !nontemporal metadata to memcpy/memset
calls. However a change like that would only have the effect of forcing the
backend to expand !nontemporal memsets back to sequences of store instructions.
A memset library call would not have exactly the same semantic of a builtin
!nontemporal memset call. So, SelectionDAG will have to conservatively expand
it back to a sequence of !nontemporal stores (effectively undoing the merging).
Differential Revision: http://reviews.llvm.org/D13519
llvm-svn: 249820
Summary:
These non-semantic changes will help make a later change adding
support for deopt operand bundles more streamlined.
Reviewers: reames, swaroop.sridhar
Subscribers: sanjoy, llvm-commits
Differential Revision: http://reviews.llvm.org/D13491
llvm-svn: 249779
Summary:
`getConstantEvolutionLoopExitValue` and `ComputeExitCountExhaustively`
assumed all phi nodes in the loop header have the same order of incoming
values. This is not correct, and this commit changes
`getConstantEvolutionLoopExitValue` and `ComputeExitCountExhaustively`
to lookup the backedge value of a phi node using the loop's latch block.
Unfortunately, there is still some code duplication
`getConstantEvolutionLoopExitValue` and `ComputeExitCountExhaustively`.
At some point in the future we should extract out a helper class /
method that can evolve constant evolution phi nodes across iterations.
Fixes 25060. Thanks to Mattias Eriksson for the spot-on analysis!
Depends on D13457.
Reviewers: atrick, hfinkel
Subscribers: materi, llvm-commits
Differential Revision: http://reviews.llvm.org/D13458
llvm-svn: 249712
This is a partial fix for PR24886:
https://llvm.org/bugs/show_bug.cgi?id=24886
Without this IR transform, the backend (x86 at least) was producing inefficient code.
This patch is making 2 assumptions:
1. The canonical form of a fabs() operation is, in fact, the LLVM fabs() intrinsic.
2. The high bit of an FP value is always the sign bit; as noted in the bug report, this isn't specified by the LangRef.
Differential Revision: http://reviews.llvm.org/D13076
llvm-svn: 249702
This was requested in D13076: if we're going to canonicalize to fabs(), ValueTracking
should know that fabs() clears sign bits.
In this patch (as in D13076), we're not handling vectors yet even though computeKnownBits'
fabs() case itself should be vector-ready via the splat in this patch.
Fixing this will require follow-on patches to correct other logic that uses 'getScalarType'.
Differential Revision: http://reviews.llvm.org/D13222
llvm-svn: 249701
Summary:
After r249211, SCEV can see through some LCSSA phis. Add a
`replacementPreservesLCSSAForm` check before replacing uses of these phi
nodes with a simplified use of the induction variable to avoid breaking
LCSSA.
Fixes 25047.
Depends on D13460.
Reviewers: atrick, hfinkel
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D13461
llvm-svn: 249575
Summary:
Some target intrinsics can access multiple elements, using the pointer as a
base address (e.g. AArch64 ld4). When trying to CSE such instructions,
it must be checked the available value comes from a compatible instruction
because the pointer is not enough to discriminate whether the value is
correct.
Reviewers: ssijaric
Subscribers: mcrosier, llvm-commits, aemerson
Differential Revision: http://reviews.llvm.org/D13475
llvm-svn: 249523
This will allow us to optimize code such as:
int f(int *p) {
int x;
return p == &x;
}
as well as:
int *allocate(void);
int f() {
int x;
int *p = allocate();
return p == &x;
}
The folding can only be done under certain circumstances. Even though p and &x
cannot alias, the comparison must still return true if the pointer
representations are equal. If a user successfully generates a p that's a
correct guess for &x, comparison should return true even though p is an invalid
pointer.
This patch argues that if the address of the alloca isn't observable outside the
function, the function can act as-if the address is impossible to guess from the
outside. The tricky part is keeping the act consistent: if we fold p == &x to
false in one place, we must make sure to fold any other comparisons based on
those pointers similarly. To ensure that, we only fold when &x is involved
exactly once in comparison instructions.
Differential Revision: http://reviews.llvm.org/D13358
llvm-svn: 249490
Summary:
After r249211, `getSCEV(X) == getSCEV(Y)` does not guarantee that X and
Y are related in the dominator tree, even if X is an operand to Y (I've
included a toy example in comments, and a real example as a test case).
This commit changes `SimplifyIndVar` to require a `DominatorTree`. I
don't think this is a problem because `ScalarEvolution` requires it
anyway.
Fixes PR25051.
Depends on D13459.
Reviewers: atrick, hfinkel
Subscribers: joker.eph, llvm-commits, sanjoy
Differential Revision: http://reviews.llvm.org/D13460
llvm-svn: 249471
This is a cleaned up patch from the one written by John Regehr based on the findings of the Souper superoptimizer.
When writing tests, I was surprised to find that instsimplify apparently doesn't know how to collapse bit test sequences based purely on known bits. This required me to split my tests across both instsimplify and instcombine.
Differential Revision: http://reviews.llvm.org/D13250
llvm-svn: 249453
As mentioned in the bug, I'd missed the presence of a getScalarType in the caller of the new implies method. As a result, when we ended up with a implication over two vectors, we'd trip an assert and crash.
Differential Revision: http://reviews.llvm.org/D13441
llvm-svn: 249442
If the mask of a select instruction is a ConstantVector, method
SimplifyDemandedVectorElts iterates over the mask elements to identify which
values are selected from the select inputs.
Before this patch, method SimplifyDemandedVectorElts always used method
Constant::isNullValue() to check if a value in the mask was zero. Unfortunately
that method always returns false when called on a ConstantExpr.
This patch fixes the problem in SimplifyDemandedVectorElts by adding an explicit
check for ConstantExpr values. Now, if a value in the mask is a ConstantExpr, we
avoid calling isNullValue() on it.
Fixes PR24922.
Differential Revision: http://reviews.llvm.org/D13219
llvm-svn: 249390
Otherwise, the map will observe changes as long as MergeFunctions is alive. This
is bad because follow-up passes could replace-all-uses-with on the key of an
entry in the map. The value handle callback of ValueMap however asserts that the
key type matches.
rdar://22971893
llvm-svn: 249327
The most important part required to make clang
devirtualization works ( ͡°͜ʖ ͡°).
The code is able to find non local dependencies, but unfortunatelly
because the caller can only handle local dependencies, I had to add
some restrictions to look for dependencies only in the same BB.
http://reviews.llvm.org/D12992
llvm-svn: 249196
Summary:
This change teaches SCEV that to prove `A u< B` it is sufficient to
prove each of these facts individually:
- B >= 0
- A s< B
- A >= 0
In practice, SCEV sometimes finds it easier to prove these facts
individually than to prove `A u< B` as one atomic step.
Reviewers: reames, atrick, nlewycky, hfinkel
Subscribers: sanjoy, llvm-commits
Differential Revision: http://reviews.llvm.org/D13042
llvm-svn: 249168
When trying to optimize fortified library functions use the right
location to insert new instructions in order to preserve correct
def-use order.
This fixes an issue where a misplaced instruction definition would
happen to be *after* one of its use after a RAUW, forming invalid IR.
This behavior was introduced by r227250.
Differential Revision: http://reviews.llvm.org/D13301
rdar://problem/22802369
llvm-svn: 249092
Summary:
Some passes may open up opportunities for optimizations, leaving empty
lifetime start/end ranges. For example, with the following code:
void foo(char *, char *);
void bar(int Size, bool flag) {
for (int i = 0; i < Size; ++i) {
char text[1];
char buff[1];
if (flag)
foo(text, buff); // BBFoo
}
}
the loop unswitch pass will create 2 versions of the loop, one with
flag==true, and the other one with flag==false, but always leaving
the BBFoo basic block, with lifetime ranges covering the scope of the for
loop. Simplify CFG will then remove BBFoo in the case where flag==false,
but will leave the lifetime markers.
This patch teaches InstCombine to remove trivially empty lifetime marker
ranges, that is ranges ending right after they were started (ignoring
debug info or other lifetime markers in the range).
This fixes PR24598: excessive compile time after r234581.
Reviewers: reames, chandlerc
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D13305
llvm-svn: 249018
Summary:
The instructions SeenExprs records may be deleted during rewriting.
FindClosestMatchingDominator should ignore these deleted instructions.
Fixes PR24301.
Reviewers: grosser
Subscribers: grosser, llvm-commits
Differential Revision: http://reviews.llvm.org/D13315
llvm-svn: 248983
Summary:
Given an array of i2 elements, 4 consecutive scalar loads will be lowered to
i8-sized loads and thus will access 4 consecutive bytes in memory. If we
vectorize these loads into a single <4 x i2> load, it'll access only 1 byte in
memory. Hence, we should prohibit vectorization in such cases.
PS: Initial patch was proposed by Arnold.
Reviewers: aschwaighofer, nadav, hfinkel
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D13277
llvm-svn: 248943
Usually large blocks are not a problem. But if a large block (> 10k instructions)
contains many (potential) chains of vector instructions, and those chains are
spread over a wide range of instructions, then scheduling becomes a compile time problem.
This change introduces a limit for the accumulate scheduling region size of a block.
For real-world functions this limit will never be exceeded (it's about 10x larger than
the maximum value seen in the test-suite and external test suite).
llvm-svn: 248917
This patch teaches InstCombiner how to convert a SSSE3/AVX2 byte shuffle to a
builtin shuffle if the mask is constant.
Converting byte shuffle intrinsic calls to builtin shuffles can help finding
more opportunities for combining shuffles later on in selection dag.
We may end up with byte shuffles with constant masks as the result of inlining.
Differential Revision: http://reviews.llvm.org/D13252
llvm-svn: 248913
This commit changes the interface of the vld[1234], vld[234]lane, and vst[1234],
vst[234]lane ARM neon intrinsics and associates an address space with the
pointer that these intrinsics take. This changes, e.g.,
<2 x i32> @llvm.arm.neon.vld1.v2i32(i8*, i32)
to
<2 x i32> @llvm.arm.neon.vld1.v2i32.p0i8(i8*, i32)
This change ensures that address spaces are fully taken into account in the ARM
target during lowering of interleaved loads and stores.
Differential Revision: http://reviews.llvm.org/D12985
llvm-svn: 248887
Currently SimplifyDemandedVectorElts can only peek through bitcasts if the vectors have the same number of elements.
This patch fixes and enables some existing (disabled) code to support bitcasting to vectors with more/fewer elements. It currently only accepts cases when vectors alias cleanly (i.e. number of elements are an exact multiple of the other vector).
This was added to improve the demanded vector elements support for SSE vector shifts which require the __m128i (<2 x i64>) argument type to be bitcast to the vector type for the builtin shift. I've added extra tests for various additional bitcasts.
Differential Revision: http://reviews.llvm.org/D12935
llvm-svn: 248784
Summary: This patch adds block frequency analysis to LoopUnswitch pass to recognize hot/cold regions. For cold regions the pass only performs trivial unswitches since they do not increase code size, and for hot regions everything works as before. This helps to minimize code growth in cold regions and be more aggressive in hot regions. Currently the default cold regions are blocks with frequencies below 20% of function entry frequency, and it can be adjusted via -loop-unswitch-cold-block-frequency flag. The entire feature is controlled via -loop-unswitch-with-block-frequency flag and it is off by default.
Reviewers: broune, silvas, dnovillo, reames
Subscribers: davidxl, llvm-commits
Differential Revision: http://reviews.llvm.org/D11605
llvm-svn: 248777
Place new and update dbg.declare calls immediately after the
corresponding alloca.
Current code in replaceDbgDeclareForAlloca puts the new dbg.declare
at the end of the basic block. LLVM codegen has problems emitting
debug info in a situation when dbg.declare appears after all uses of
the variable. This usually kinda works for inlining and ASan (two
users of this function) but not for SafeStack (see the pending change
in http://reviews.llvm.org/D13178).
llvm-svn: 248769
`ScalarEvolution::isImpliedCondOperandsViaNoOverflow` tries to cast the
operand type of the comparison it is given to an `IntegerType`. This is
incorrect because it could actually be simplifying a comparison between
two pointers. Switch it to using `getTypeSizeInBits` instead, which
does the right thing for both pointers and integers.
Fixed PR24956.
llvm-svn: 248743
Patch by Jake VanAdrighem!
Summary:
Fix the way we sort the llvm.used and llvm.compiler.used members.
This bug seems to have been introduced in rL183756 through a set of improper casts to GlobalValue*. In subsequent patches this problem was missed and transformed into a getName call on a ConstantExpr.
Reviewers: silvas
Subscribers: silvas, llvm-commits
Differential Revision: http://reviews.llvm.org/D12851
llvm-svn: 248728
This was split off of http://reviews.llvm.org/D13040 to make it easier to test the correctness of the implication logic. For the moment, this only handles a single easy case which shows up when eliminating and combining range checks. In the (near) future, I plan to extend this for other cases which show up in range checks, but I wanted to make those changes incrementally once the framework was in place.
At the moment, the implication logic will be used by three places. One in InstSimplify (this review) and two in SimplifyCFG (http://reviews.llvm.org/D13040 & http://reviews.llvm.org/D13070). Can anyone think of other locations this style of reasoning would make sense?
Differential Revision: http://reviews.llvm.org/D13074
llvm-svn: 248719
Originally, debug intrinsics and annotation intrinsics may prevent
the loop to be rerolled, now they are ignored.
Differential Revision: http://reviews.llvm.org/D13150
llvm-svn: 248718
Before this change `HasSameValue` would return true for distinct
`alloca` instructions if they happened to be allocating the same
type (`alloca` instructions are not specified as reading memory). This
change adds an explicit whitelist of instruction types for which
"identical" instructions compute the same value.
Fixes PR24952.
llvm-svn: 248690
This is one step towards solving PR24766:
https://llvm.org/bugs/show_bug.cgi?id=24766
We were not producing the same IR for these two C functions because the store
to the temp bool causes extra zexts:
#include <stdbool.h>
bool switchy(char x1, char x2, char condition) {
bool conditionMet = false;
switch (condition) {
case 0: conditionMet = (x1 == x2); break;
case 1: conditionMet = (x1 <= x2); break;
}
return conditionMet;
}
bool switchy2(char x1, char x2, char condition) {
switch (condition) {
case 0: return (x1 == x2);
case 1: return (x1 <= x2);
}
return false;
}
As noted in the code comments, this test case manages to avoid the more general existing
phi optimizations where there are only 2 phi inputs or where there are no constant phi
args mixed in with the casts ops. It seems like a corner case, but if we don't catch it,
then I don't think we can get SimplifyCFG to further optimize towards the canonical form
for this function shown in the bug report.
Differential Revision: http://reviews.llvm.org/D12866
llvm-svn: 248689
Summary:
Factor the code that rewrites invokes to calls and rewrites WinEH
terminators to their "unwind to caller" equivalents into a helper in
Utils/Local, and use it in the three places I'm aware of that need to do
this.
Reviewers: andrew.w.kaylor, majnemer, rnk
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D13152
llvm-svn: 248677
Summary:
This is the second part of fixing bug 24848 https://llvm.org/bugs/show_bug.cgi?id=24848.
If both operands of a comparison have range metadata, they should be used to constant fold the comparison.
Reviewers: sanjoy, hfinkel
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D13177
llvm-svn: 248650
Summary:
If the trip count of a specific backedge is `N`, then we know that
backedge is effectively guarded by the condition `{0,+,1} u< N`. This
change teaches SCEV to use this condition to prove things in
`isLoopBackedgeGuardedByCond`.
Depends on D12948
Depends on D12949
The original checkin, r248608 had to be backed out due to an issue with
a ObjCXX unit test. That issue is now fixed, so re-landing.
Reviewers: atrick, reames, majnemer, hfinkel
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D12950
llvm-svn: 248638
Summary:
This change teaches SCEV's `isImpliedCond` two new identities:
A u< B u< -C => (A + C) u< (B + C)
A s< B s< INT_MIN - C => (A + C) s< (B + C)
While these are useful on their own, they're really intended to support
D12950.
The original checkin, r248606 had to be backed out due to an issue with
a ObjCXX unit test. That issue is now fixed, so re-landing.
Reviewers: atrick, reames, majnemer, nlewycky, hfinkel
Subscribers: aadg, sanjoy, llvm-commits
Differential Revision: http://reviews.llvm.org/D12948
llvm-svn: 248637
This is a fix for PR22723:
https://llvm.org/bugs/show_bug.cgi?id=22723
My first attempt at this was to change what I thought was the root problem:
xor (zext i1 X to i32), 1 --> zext (xor i1 X, true) to i32
...but we create the opposite pattern in InstCombiner::visitZExt(), so infinite loop!
My next idea was to fix the matchIfNot() implementation in PatternMatch, but that would
mean potentially returning a different size for the match than what was input. I think
this would require all users of m_Not to check the size of the returned match, so I
abandoned that idea.
I settled on just fixing the exact case presented in the PR. This patch does allow the
2 functions in PR22723 to compile identically (x86):
bool test(bool x, bool y) { return !x | !y; }
bool test(bool x, bool y) { return !x || !y; }
...
andb %sil, %dil
xorb $1, %dil
movb %dil, %al
retq
Differential Revision: http://reviews.llvm.org/D12705
llvm-svn: 248634
BranchProbability now is represented by its numerator and denominator in uint32_t type. This patch changes this representation into a fixed point that is represented by the numerator in uint32_t type and a constant denominator 1<<31. This is quite similar to the representation of BlockMass in BlockFrequencyInfoImpl.h. There are several pros and cons of this change:
Pros:
1. It uses only a half space of the current one.
2. Some operations are much faster like plus, subtraction, comparison, and scaling by an integer.
Cons:
1. Constructing a probability using arbitrary numerator and denominator needs additional calculations.
2. It is a little less precise than before as we use a fixed denominator. For example, 1 - 1/3 may not be exactly identical to 1 / 3 (this will lead to many BranchProbability unit test failures). This should not matter when we only use it for branch probability. If we use it like a rational value for some precise calculations we may need another construct like ValueRatio.
One important reason for this change is that we propose to store branch probabilities instead of edge weights in MachineBasicBlock. We also want clients to use probability instead of weight when adding successors to a MBB. The current BranchProbability has more space which may be a concern.
Differential revision: http://reviews.llvm.org/D12603
llvm-svn: 248633
Summary:
If the trip count of a specific backedge is `N`, then we know that
backedge is effectively guarded by the condition `{0,+,1} u< N`. This
change teaches SCEV to use this condition to prove things in
`isLoopBackedgeGuardedByCond`.
Depends on D12948
Depends on D12949
Reviewers: atrick, reames, majnemer, hfinkel
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D12950
llvm-svn: 248608
Summary:
This change teaches SCEV's `isImpliedCond` two new identities:
A u< B u< -C => (A + C) u< (B + C)
A s< B s< INT_MIN - C => (A + C) s< (B + C)
While these are useful on their own, they're really intended to support
D12950.
Reviewers: atrick, reames, majnemer, nlewycky, hfinkel
Subscribers: aadg, sanjoy, llvm-commits
Differential Revision: http://reviews.llvm.org/D12948
llvm-svn: 248606
...because that's what the cost model was intended to do.
As discussed in D12882, this fix has a temporary unintended consequence for
SimplifyCFG: it causes us to not speculate an fdiv. However, two wrongs make
PR24818 right, and two wrongs make PR24343 act right even though it's really
still wrong.
I intend to correct SimplifyCFG and add to CodeGenPrepare to account for this
cost model change and preserve the righteousness for the bug report cases.
https://llvm.org/bugs/show_bug.cgi?id=24818https://llvm.org/bugs/show_bug.cgi?id=24343
Differential Revision: http://reviews.llvm.org/D12882
llvm-svn: 248439
Patch by: simoncook
Unlike BitCasts, AddrSpaceCasts do not always produce an output the same size as its input, which was previously assumed. This fixes cases where two address spaces do not have the same size pointer, as an assertion failure would occur when trying to prove deferenceability. LoopUnswitch is used in the particular test, but LICM also exhibits the same problem.
Differential Revision: http://reviews.llvm.org/D13008
llvm-svn: 248422
Add two new ways of accessing the unsafe stack pointer:
* At a fixed offset from the thread TLS base. This is very similar to
StackProtector cookies, but we plan to extend it to other backends
(ARM in particular) soon. Bionic-side implementation here:
https://android-review.googlesource.com/170988.
* Via a function call, as a fallback for platforms that provide
neither a fixed TLS slot, nor a reasonable TLS implementation (i.e.
not emutls).
This is a re-commit of a change in r248357 that was reverted in
r248358.
llvm-svn: 248405
Summary:
This is the first part of fixing bug 24848 https://llvm.org/bugs/show_bug.cgi?id=24848.
When range metadata is provided, it should be used to constant fold comparisons with constant values.
Reviewers: sanjoy, hfinkel
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D12988
llvm-svn: 248402
This changes the behavior of AddAligntmentAssumptions to match its
comment. I.e, prove the asserted alignment in the context of the caller,
not the callee.
Thanks to Mehdi Amini for seeing the issue here! Also to Artur Pilipenko
who also saw a fix for the issue.
rdar://22521387
Differential Revision: http://reviews.llvm.org/D12997
llvm-svn: 248390
Invoking a function which returns an aggregate can sometimes be
transformed to return a scalar value. However, this means that we need
to create an insertvalue instruction(s) to recreate the correct
aggregate type. We achieved this by inserting an insertvalue
instruction at the invoke's normal successor. However, this is not
feasible if the normal successor uses the invoke's return value inside a
PHI node.
Instead, split the edge between the invoke and the unwind successor and
create the insertvalue instruction in the new basic block. The new
basic block's successor will be the old invoke successor which leaves
us with IR which is well behaved.
This fixes PR24906.
llvm-svn: 248387
This change allows dead store elimination to remove zero and null stores into memory freshly allocated with calloc-like function.
Differential Revision: http://reviews.llvm.org/D13021
llvm-svn: 248374
Add two new ways of accessing the unsafe stack pointer:
* At a fixed offset from the thread TLS base. This is very similar to
StackProtector cookies, but we plan to extend it to other backends
(ARM in particular) soon. Bionic-side implementation here:
https://android-review.googlesource.com/170988.
* Via a function call, as a fallback for platforms that provide
neither a fixed TLS slot, nor a reasonable TLS implementation (i.e.
not emutls).
llvm-svn: 248357
Apart from checking that GlobalVariable is a constant, we should check
that it's not a weak constant, in which case we can't propagate its
value.
llvm-svn: 248327
ARM counterpart to r248291:
In the comparison failure block of a cmpxchg expansion, the initial
ldrex/ldxr will not be followed by a matching strex/stxr.
On ARM/AArch64, this unnecessarily ties up the execution monitor,
which might have a negative performance impact on some uarchs.
Instead, release the monitor in the failure block.
The clrex instruction was designed for this: use it.
Also see ARMARM v8-A B2.10.2:
"Exclusive access instructions and Shareable memory locations".
Differential Revision: http://reviews.llvm.org/D13033
llvm-svn: 248294
We know that an argmemonly function can only access memory pointed to by it's pointer arguments. Rather than needing to consider all possible stores as aliasing (as we do for a readonly function), we can only consider the aliasing of the pointer arguments.
Note that this change only addresses hoisting. I'm thinking about how to address speculation safety as well, but that will be a different change.
FYI, argmemonly disallows accessing memory through non-pointer typed arguments.
Differential Revision: http://reviews.llvm.org/D12771
llvm-svn: 248220
We're currently losing any fast-math flags when synthesizing fcmps for
min/max reductions. In LV, make sure we copy over the scalar inst's
flags. In LoopUtils, we know we only ever match patterns with
hasUnsafeAlgebra, so apply that to any synthesized ops.
llvm-svn: 248201
Because -indvars widens induction variables through arithmetic,
`NeverNegative` cannot be a property of the `WidenIV` (a `WidenIV`
manages information for all transitive uses of an IV being widened,
including uses of `-1 * IV`). Instead it must live on `NarrowIVDefUse`
which manages information for a specific def-use edge in the transitive
use list of an induction variable.
This change also adds a test case that demonstrates the problem with
r248045.
llvm-svn: 248107
Summary:
If an induction variable is provably non-negative, its sign extension is
equal to its zero extension. This means narrow uses like
icmp slt iNarrow %indvar, %rhs
can be widened into
icmp slt iWide zext(%indvar), sext(%rhs)
Reviewers: atrick, mcrosier, hfinkel
Subscribers: hfinkel, reames, llvm-commits
Differential Revision: http://reviews.llvm.org/D12745
llvm-svn: 248045
Currently LazyValueInfo will report only alloca's as having nonnull range.
For loads with !nonnull metadata it will bailout with no additional information.
Same is true for calls returning nonnull pointers.
This change extends LazyValueInfo to handle additional nonnull instructions.
Differential Revision: http://reviews.llvm.org/D12932
llvm-svn: 247985
The SSE4A instructions EXTRQ/INSERTQ only use the lower 64-bits (or less) for many of their input vector operands and all of them have undefined upper 64-bits results.
Differential Revision: http://reviews.llvm.org/D12680
llvm-svn: 247934
This test uses a gcov file generated in a little-endian host. The gcov
reader does not allow different endianness, so the test fails on big
endian hosts.
XFAILing for now.
llvm-svn: 247920
This adds enough machinery to support reading simple GCC AutoFDO
profiles. It now supports reading flat profiles (no function calls).
Subsequent patches will add support for:
- Inlined calls (in particular, the inline call stack is not traversed
to accumulate samples).
- Working sets and modules. These are used mostly for GCC's LIPO
optimizations, so they're not needed in LLVM atm. I'm not sure that
we will ever need them. For now, I've if0'd around the calls.
The patch also adds support in GCOV.h for gcov version V704 (generated
by GCC's profile conversion tool).
llvm-svn: 247874
Summary:
`signum(x)` is sometimes implemented as `(x >> 63) | (-x >>> 63)` (for
an `i64` `x`). This change adds a matcher for that pattern, and an
instcombine rule to optimize `signum(x) s< 1`.
Later, we can also consider optimizing:
icmp slt signum(x), 0 --> icmp slt x, 0
icmp sle signum(x), 1 --> true
etc.
Reviewers: majnemer
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D12703
llvm-svn: 247846
Clang now passes the adjectives as an argument to catchpad.
Getting the CatchObj working is simply a matter of threading another
static alloca through codegen, first as an alloca, then as a frame
index, and finally as a frame offset.
llvm-svn: 247844
When building LLVM as a (potentially dynamic) library that can be linked against
by multiple compilers, the default triple is not really meaningful.
We allow to explicitely set it to an empty string when configuring LLVM.
In this case, said "target independent" tests in the test suite that are using
the default triple are disabled by matching the newly available feature
"default_triple".
Reviewers: probinson, echristo
Differential Revision: http://reviews.llvm.org/D12660
From: Mehdi Amini <mehdi.amini@apple.com>
llvm-svn: 247775
We only checked that a global is initialized with constants, which is
incorrect. We should be checking that GlobalVariable *is* a constant,
not just initialized with it.
llvm-svn: 247769
In `IndVarSimplify::ExpandSCEVIfNeeded`,
`SCEVExpander::findExistingExpansion` may return an `llvm::Value` that
differs in type from the SCEV it was asked to find an expansion for (but
computes the same value). In such cases, we fall back on
`expandCodeFor`; and rely on LLVM to CSE the two equivalent
expressions (different only by a no-op cast) into a single computation.
I tried a few other approaches to fixing PR24783, all of which turned
out to be more complex than this current version:
1. Move the `ExpandSCEVIfNeeded` logic into `expandCodeFor`. This got
problematic because currently we do not pass in the `Loop *` into
`expandCodeFor`. Changing the interface to do this is a more
invasive change, and really does not make much semantic sense unless
the SCEV being passed in is an add recurrence.
There is also the problem of `expandCodeFor` being used in places
other than `indvars` -- there may be performance / correctness
issues elsewhere if `expandCodeFor` is moved from always generating
IR from scratch to cache-like model.
2. Have `findExistingExpansion` only return expression with the correct
type. This would make `isHighCostExpansionHelper` and thus
`isHighCostExpansion` more conservative than necessary.
3. Insert casts on the value returned by `findExistingExpansion` if
needed using `InsertNoopCastOfTo`. This is complicated because
`InsertNoopCastOfTo` depends on internal state of its
`SCEVExpander` (specifically `Builder.GetInserPoint()`), and this
may not be set up when `ExpandSCEVIfNeeded` is called.
4. Manually insert casts on the value returned by
`findExistingExpansion` if needed using `InsertNoopCastOfTo` via
`CastInst::Create`. This is probably workable, but figuring out the
location where the cast instruction needs to be inserted has enough
edge cases (arguments, constants, invokes, LCSSA must be preserved)
makes me feel what I have right now is simplest solution.
llvm-svn: 247749
The patch extends the optimization to cases where the constant's
magnitude is so small or large that the rounding of the conversion
is irrelevant. The "so small" case includes negative zero.
Differential review: http://reviews.llvm.org/D11210
llvm-svn: 247708
LazuValueInfo can prove that value is nonnull based on the context information.
Make use of this ability to infer nonnull attributes for the call arguments.
Differential Revision: http://reviews.llvm.org/D12836
llvm-svn: 247707
Summary:
This change lets a `PlaceSafepoints` client change how wide the trip
count of a loop has to be for the loop to be considerd "counted", via
`CountedLoopTripWidth`. It also removes the boolean `SkipCounted` flag
and the `upperTripBound` constant -- we can get the old behavior of
`SkipCounted` == `false` by setting `CountedLoopTripWidth` to `13` (2 ^
13 == 8192).
Reviewers: reames
Subscribers: llvm-commits, sanjoy
Differential Revision: http://reviews.llvm.org/D12789
llvm-svn: 247656
Summary: This patch replaces isKnownNonNull() with isKnownNonNullAt() when checking nullness of passing arguments at callsite. In this way it can handle cases where the argument does not have nonnull attribute but has a dominating null check from the CFG. It also adds assertions in isKnownNonNull() and isKnownNonNullFromDominatingCondition() to make sure the value checked is pointer type (as defined in LLVM document). These assertions might trip failures in things which are not covered under llvm/test, but fixes should be pretty obvious.
Reviewers: reames
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D12779
llvm-svn: 247587
GetElementPointers must have the first argument's type compared
for structural equivalence. Previously the code erroneously compared the
pointer's type, but this code was dead because all pointer types (of the
same address space) are the same. The pointee must be compared instead
(using the type stored in the GEP, not from the pointer type which will
be erased anyway).
Author: jrkoenig
Reviewers: dschuff, nlewycky, jfb
Subscribers: nlewycky, llvm-commits
Differential revision: http://reviews.llvm.org/D12820
llvm-svn: 247570
Improved InstCombine support for CVTPH2PS (F16C half 2 float conversion):
<4 x float> @llvm.x86.vcvtph2ps.128(<8 x i16>) - only uses the bottom 4 i16 elements for the conversion.
Added constant folding support.
Differential Revision: http://reviews.llvm.org/D12731
llvm-svn: 247504
In some ways this is a very boring port to the new pass manager as there
are no interesting analyses or dependencies or other oddities.
However, this does introduce the first good example of a transformation
pass with non-trivial state porting to the new pass manager. I've tried
to carve out patterns here to replicate elsewhere, and would appreciate
comments on whether folks like these patterns:
- A common need in the new pass manager is to effectively lift the pass
class and some of its state into a public header file. Prior to this,
LLVM used anonymous namespaces to provide "module private" types and
utilities, but that doesn't scale to cases where a public header file
is needed and the new pass manager will exacerbate that. The pattern
I've adopted here is to use the namespace-cased-name of the core pass
(what would be a module if we had them) as a module-private namespace.
Then utility and other code can be declared and defined in this
namespace. At some point in the future, we could even have
(conditionally compiled) code that used modules features when
available to do the same basic thing.
- I've split the actual pass run method in two in order to expose
a private method usable by the old pass manager to wrap the new class
with a minimum of duplicated code. I actually looked at a bunch of
ways to automate or generate these, but they are all quite terrible
IMO. The fundamental need is to extract the set of analyses which need
to cross this interface boundary, and that will end up being too
unpredictable to effectively encapsulate IMO. This is also
a relatively small amount of boiler plate that will live a relatively
short time, so I'm not too worried about the fact that it is boiler
plate.
The rest of the patch is totally boring but results in a massive diff
(sorry). It just moves code around and removes or adds qualifiers to
reflect the new name and nesting structure.
Differential Revision: http://reviews.llvm.org/D12773
llvm-svn: 247501
The rest of the EH pads are fine, since they have at most one label and
take fewer operands for the personality.
Old catchpad vs. new:
%5 = catchpad [i8* bitcast (i32 ()* @"\01?filt$0@0@main@@" to i8*)] to label %__except.ret.10 unwind label %catchendblock.9
-----
%5 = catchpad [i8* bitcast (i32 ()* @"\01?filt$0@0@main@@" to i8*)]
to label %__except.ret.10 unwind label %catchendblock.9
llvm-svn: 247433
Summary: This patch replaces isKnownNonNull() with isKnownNonNullAt() when checking nullness of passing arguments at callsite. In this way it can handle cases where the argument does not have nonnull attribute but has a dominating null check from the CFG.
Reviewers: reames
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D12779
llvm-svn: 247356
Summary: This patch replaces isKnownNonNull() with isKnownNonNullAt() when checking nullness of gc.relocate return value. In this way it can handle cases where the relocated value does not have nonnull attribute but has a dominating null check from the CFG.
Reviewers: reames
Subscribers: llvm-commits, sanjoy
Differential Revision: http://reviews.llvm.org/D12772
llvm-svn: 247353
This patch enables small size reductions in which the source types are smaller
than the reduction type (e.g., computing an i16 sum from the values in an i8
array). The previous behavior was to only allow small size reductions if the
source types and reduction type were the same. The change accounts for the fact
that the existing sign- and zero-extend instructions in these cases should
still be included in the cost model.
Differential Revision: http://reviews.llvm.org/D12770
llvm-svn: 247337
This change correctly sets the attributes on the callsites
generated in thunks. This makes sure things such as sret, sext, etc.
are correctly set, so that the call can be a proper tailcall.
Also, the transfer of attributes in the replaceDirectCallers function
appears to be unnecessary, but until this is confirmed it will remain.
Author: jrkoenig
Reviewers: dschuff, jfb
Subscribers: llvm-commits, nlewycky
Differential revision: http://reviews.llvm.org/D12581
llvm-svn: 247313
This is a follow up to http://reviews.llvm.org/D11995 implementing the suggestion by Hans.
If we know some of the bits of the value being switched on, we know that the maximum number of unique cases covers the unknown bits. This allows to eliminate switch defaults for large integers (i32) when most bits in the value are known.
Note that I had to make the transform contingent on not having any dead cases. This is conservatively correct with the old code, but required for the new code since we might have a dead case which varies one of the known bits. Counting that towards our number of covering cases would be bad. If we do have dead cases, we'll eliminate them first, then revisit the possibly dead default.
Differential Revision: http://reviews.llvm.org/D12497
llvm-svn: 247309
removes cast by performing the lshr on smaller types. However, currently there
is no trunc(lshr (sext A), Cst) variant.
This patch add such optimization by transforming trunc(lshr (sext A), Cst)
to ashr A, Cst.
Differential Revision: http://reviews.llvm.org/D12520
llvm-svn: 247271
This change is simply enhancing the existing inference algorithm to handle insertelement instructions by conservatively inserting a new instruction to propagate the vector of associated base pointers. In the process, I'm ripping out the peephole optimizations which mostly helped cover the fact this hadn't been done.
Note that most of the newly inserted nodes will be nearly immediately removed by the post insertion optimization pass introduced in 246718. Arguably, we should be trying harder to avoid the malloc traffic here, but I'd rather get the code correct, then worry about compile time.
Unlike previous extensions of the algorithm to handle more case, I discovered the existing code was causing miscompiles in some cases. In particular, we had an implicit assumption that the peephole covered *all* insert element instructions, so if we had a value directly based on a insert element the peephole didn't cover, we proceeded as if it were a base anyways. Not good. I believe we had the same issue with shufflevector which is why I adjusted the predicate for them as well.
Differential Revision: http://reviews.llvm.org/D12583
llvm-svn: 247210
Visit disjoint sets in a deterministic order based on the maximum BitSetNM
index, otherwise the order in which we visit them will depend on pointer
comparisons. This was being exposed by MSan.
llvm-svn: 247201
with the new pass manager, and no longer relying on analysis groups.
This builds essentially a ground-up new AA infrastructure stack for
LLVM. The core ideas are the same that are used throughout the new pass
manager: type erased polymorphism and direct composition. The design is
as follows:
- FunctionAAResults is a type-erasing alias analysis results aggregation
interface to walk a single query across a range of results from
different alias analyses. Currently this is function-specific as we
always assume that aliasing queries are *within* a function.
- AAResultBase is a CRTP utility providing stub implementations of
various parts of the alias analysis result concept, notably in several
cases in terms of other more general parts of the interface. This can
be used to implement only a narrow part of the interface rather than
the entire interface. This isn't really ideal, this logic should be
hoisted into FunctionAAResults as currently it will cause
a significant amount of redundant work, but it faithfully models the
behavior of the prior infrastructure.
- All the alias analysis passes are ported to be wrapper passes for the
legacy PM and new-style analysis passes for the new PM with a shared
result object. In some cases (most notably CFL), this is an extremely
naive approach that we should revisit when we can specialize for the
new pass manager.
- BasicAA has been restructured to reflect that it is much more
fundamentally a function analysis because it uses dominator trees and
loop info that need to be constructed for each function.
All of the references to getting alias analysis results have been
updated to use the new aggregation interface. All the preservation and
other pass management code has been updated accordingly.
The way the FunctionAAResultsWrapperPass works is to detect the
available alias analyses when run, and add them to the results object.
This means that we should be able to continue to respect when various
passes are added to the pipeline, for example adding CFL or adding TBAA
passes should just cause their results to be available and to get folded
into this. The exception to this rule is BasicAA which really needs to
be a function pass due to using dominator trees and loop info. As
a consequence, the FunctionAAResultsWrapperPass directly depends on
BasicAA and always includes it in the aggregation.
This has significant implications for preserving analyses. Generally,
most passes shouldn't bother preserving FunctionAAResultsWrapperPass
because rebuilding the results just updates the set of known AA passes.
The exception to this rule are LoopPass instances which need to preserve
all the function analyses that the loop pass manager will end up
needing. This means preserving both BasicAAWrapperPass and the
aggregating FunctionAAResultsWrapperPass.
Now, when preserving an alias analysis, you do so by directly preserving
that analysis. This is only necessary for non-immutable-pass-provided
alias analyses though, and there are only three of interest: BasicAA,
GlobalsAA (formerly GlobalsModRef), and SCEVAA. Usually BasicAA is
preserved when needed because it (like DominatorTree and LoopInfo) is
marked as a CFG-only pass. I've expanded GlobalsAA into the preserved
set everywhere we previously were preserving all of AliasAnalysis, and
I've added SCEVAA in the intersection of that with where we preserve
SCEV itself.
One significant challenge to all of this is that the CGSCC passes were
actually using the alias analysis implementations by taking advantage of
a pretty amazing set of loop holes in the old pass manager's analysis
management code which allowed analysis groups to slide through in many
cases. Moving away from analysis groups makes this problem much more
obvious. To fix it, I've leveraged the flexibility the design of the new
PM components provides to just directly construct the relevant alias
analyses for the relevant functions in the IPO passes that need them.
This is a bit hacky, but should go away with the new pass manager, and
is already in many ways cleaner than the prior state.
Another significant challenge is that various facilities of the old
alias analysis infrastructure just don't fit any more. The most
significant of these is the alias analysis 'counter' pass. That pass
relied on the ability to snoop on AA queries at different points in the
analysis group chain. Instead, I'm planning to build printing
functionality directly into the aggregation layer. I've not included
that in this patch merely to keep it smaller.
Note that all of this needs a nearly complete rewrite of the AA
documentation. I'm planning to do that, but I'd like to make sure the
new design settles, and to flesh out a bit more of what it looks like in
the new pass manager first.
Differential Revision: http://reviews.llvm.org/D12080
llvm-svn: 247167
Predicating stores requires creating extra blocks. It's much cleaner if we do this in one pass instead of mutating the CFG while writing vector instructions.
Besides which we can make use of helper functions to update domtree for us, reducing the work we need to do.
llvm-svn: 247139
This change extends the bitset lowering pass to support bitsets that may
contain either functions or global variables. A function bitset is lowered to
a jump table that is laid out before one of the functions in the bitset.
Also add support for non-string bitset identifier names. This allows for
distinct metadata nodes to stand in for names with internal linkage,
as done in D11857.
Differential Revision: http://reviews.llvm.org/D11856
llvm-svn: 247080
- Move tests only exercising instsimplify to instsimplify's apint-or.ll
- Actually test the CHECK lines in instsimplify's apint-or.ll
- Merge the remaining tests in apint-or1.ll and apint-or2.ll, use FileCheck
llvm-svn: 247045
removes cast by performing the lshr on smaller types. However, currently there
is no trunc(lshr (sext A), Cst) variant.
This patch add such optimization by transforming trunc(lshr (sext A), Cst)
to ashr A, Cst.
Differential Revision: http://reviews.llvm.org/D12520
llvm-svn: 246997
Trivial multiplication by zero may survive the worklist. We tried to
reassociate the multiplication with a division instruction, causing us
to divide by zero; bail out instead.
This fixes PR24726.
llvm-svn: 246939
This adds a basic cost model for interleaved-access vectorization (and a better
default for shuffles), and enables interleaved-access vectorization by default.
The relevant difference from the default cost model for interleaved-access
vectorization, is that on PPC, the shuffles that end up being used are *much*
cheaper than modeling the process with insert/extract pairs (which are
quite expensive, especially on older cores).
llvm-svn: 246824
On the A2, with an eye toward QPX unaligned-load merging, we should always use
aggressive interleaving. It is generally superior to only using concatenation
unrolling.
llvm-svn: 246819
Summary:
This function was not taking into account that the
input type could be a vector, and wasn't properly
working for vector types.
This caused an assert when building spec2k6 perlbmk for armv8.
Reviewers: rengolin, mzolotukhin
Subscribers: silviu.baranga, mzolotukhin, rengolin, eugenis, jmolloy, aemerson, llvm-commits
Differential Revision: http://reviews.llvm.org/D12559
llvm-svn: 246759
Summary:
Add a `cleanupendpad` instruction, used to mark exceptional exits out of
cleanups (for languages/targets that can abort a cleanup with another
exception). The `cleanupendpad` instruction is similar to the `catchendpad`
instruction in that it is an EH pad which is the target of unwind edges in
the handler and which itself has an unwind edge to the next EH action.
The `cleanupendpad` instruction, similar to `cleanupret` has a `cleanuppad`
argument indicating which cleanup it exits. The unwind successors of a
`cleanuppad`'s `cleanupendpad`s must agree with each other and with its
`cleanupret`s.
Update WinEHPrepare (and docs/tests) to accomodate `cleanupendpad`.
Reviewers: rnk, andrew.w.kaylor, majnemer
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D12433
llvm-svn: 246751
There was infinite loop because it was trying to change assume(true) into
assume(true)
Also added handling when assume(false) appear
http://reviews.llvm.org/D12516
llvm-svn: 246697
After hitting @llvm.assume(X) we can:
- propagate equality that X == true
- if X is icmp/fcmp (with eq operation), and one of operand
is constant we can change all variables with constants in the same BasicBlock
http://reviews.llvm.org/D11918
llvm-svn: 246695
We were bailing to two places if our runtime checks failed. If the initial overflow check failed, we'd go to ScalarPH. If any other check failed, we'd go to MiddleBlock. This caused us to have to have an extra PHI per induction and reduction as the vector loop's exit block was not dominated by its latch.
There's no need to have this behavior - if we just always go to ScalarPH we can get rid of a bunch of complexity.
llvm-svn: 246637
This reduces the complexity of createEmptyBlock() and will open the door to further refactoring.
The test change is simply because we're now constant folding a trivial test.
llvm-svn: 246634
There's no need to widen canonical induction variables. It's just as efficient to create a *new*, wide, induction variable.
Consider, if we widen an indvar, then we'll have to truncate it before its uses anyway (1 trunc). If we create a new indvar instead, we'll have to truncate that instead (1 trunc) [besides which IndVars should go and clean up our mess after us anyway on principle].
This lets us remove a ton of special-casing code.
llvm-svn: 246631
Vectorized loops only ever have one induction variable. All induction PHIs from the scalar loop are rewritten to be in terms of this single indvar.
We were trying very hard to pick an indvar that already existed, even if that indvar wasn't canonical (didn't start at zero). But trying so hard is really fruitless - creating a new, canonical, indvar only results in one extra add in the worst case and that add is trivially easy to push through the PHI out of the loop by instcombine.
If we try and be less clever here and instead let instcombine clean up our mess (as we do in many other places in LV), we can remove unneeded complexity.
llvm-svn: 246630
Summary:
This change turns on by default interleaved access vectorization
for AArch64.
We also clean up some tests which were spedifically enabling this
behaviour.
Reviewers: rengolin
Subscribers: aemerson, llvm-commits, rengolin
Differential Revision: http://reviews.llvm.org/D12149
llvm-svn: 246542
Summary:
This change turns on by default interleaved access vectorization on ARM,
as it has shown to be beneficial on ARM.
Reviewers: rengolin
Subscribers: aemerson, llvm-commits, rengolin
Differential Revision: http://reviews.llvm.org/D12146
llvm-svn: 246541
Teach FunctionAttr to infer the nonnull attribute on return values of functions which never return a potentially null value. This is done both via a conservative local analysis for the function itself and a optimistic per-SCC analysis. If no function in the SCC returns anything which could be null (other than values from other functions in the SCC), we can conclude no function returned a null pointer. Even if some function within the SCC returns a null pointer, we may be able to locally conclude that some don't.
Differential Revision: http://reviews.llvm.org/D9688
llvm-svn: 246476
If asked to prove a predicate about a value produced by a PHI node, LazyValueInfo was unable to do so even if the predicate was known to be true for each input to the PHI. This prevented JumpThreading from eliminating a provably redundant branch.
The problematic test case looks something like this:
ListNode *p = ...;
while (p != null) {
if (!p) return;
x = g->x; // unrelated
p = p->next
}
The null check at the top of the loop is redundant since the value of 'p' is null checked on entry to the loop and before executing the backedge. This resulted in us a) executing an extra null check per iteration and b) not being able to LICM unrelated loads after the check since we couldn't prove they would execute or that their dereferenceability wasn't effected by the null check on the first iteration.
Differential Revision: http://reviews.llvm.org/D12383
llvm-svn: 246465
Summary:
JumpThreading shouldn't duplicate a convergent call, because that would move a convergent call into a control-inequivalent location. For example,
if (cond) {
...
} else {
...
}
convergent_call();
if (cond) {
...
} else {
...
}
should not be optimized to
if (cond) {
...
convergent_call();
...
} else {
...
convergent_call();
...
}
Test Plan: test/Transforms/JumpThreading/basic.ll
Patch by Xuetian Weng.
Reviewers: resistor, arsenm, jingyue
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D12484
llvm-svn: 246415
I'm working on adding !dbg attachments to functions (PR23367), which
we'll use to determine the canonical subprogram for a function (instead
of the `subprograms:` array in the compile units). This updates a few
old tests in preparation.
Transforms/Mem2Reg/ConvertDebugInfo2.ll had an old-style grep+count
based test that would start to fail because I've added an extra line
with `!dbg`. Instead, explicitly `CHECK` for what I think the test
actually cares about.
All three testcases have subprograms with a valid `function:` reference
-- which means my upgrade script will add a `!dbg` attachment -- but
that aren't referenced from any compile unit. I suspect these testcases
were handreduced over-zealously (or have bitrotted?). Add a reference
from the compile unit so that upcoming Verifier checks won't fail here.
llvm-svn: 246351
This reverts isSafeToSpeculativelyExecute's use of ReadNone until we
split ReadNone into two pieces: one attribute which reasons about how
the function reasons about memory and another attribute which determines
how it may be speculated, CSE'd, trap, etc.
llvm-svn: 246331
As a follow-up to r246098, require `DISubprogram` definitions
(`isDefinition: true`) to be 'distinct'. Specifically, add an assembler
check, a verifier check, and bitcode upgrading logic to combat testcase
bitrot after the `DIBuilder` change.
While working on the testcases, I realized that
test/Linker/subprogram-linkonce-weak-odr.ll isn't relevant anymore. Its
purpose was to check for a corner case in PR22792 where two subprogram
definitions match exactly and share the same metadata node. The new
verifier check, requiring that subprogram definitions are 'distinct',
precludes that possibility.
I updated almost all the IR with the following script:
git grep -l -E -e '= !DISubprogram\(.* isDefinition: true' |
grep -v test/Bitcode |
xargs sed -i '' -e 's/= \(!DISubprogram(.*, isDefinition: true\)/= distinct \1/'
Likely some variant of would work for out-of-tree testcases.
llvm-svn: 246327