We previously made a change to getUserCost to return a Invalid cost
when one of the TTI costs returned '-1' (meaning 'unknown' or
'infinitely expensive'). It makes no sense to say that:
shufflevector <2 x i8> %x, <2 x i8> %y, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
has an invalid cost. Perhaps the cost is not known, but the IR is valid
and can be code-generated. Invalid should only be used for IR that
cannot possibly be code-generated and where a cost is nonsensical.
With more passes now asserting that the cost must be valid, it is possible
that those assertions will fail for perfectly valid IR. An incomplete
cost-model probably shouldn't be a reason for the compiler to break.
It's better to consider these costs as 'very expensive' and ignore them
for other reasons. At some point, we should consider replacing -1 with
some other mechanism.
Reviewed By: paulwalker-arm, dmgreen
Differential Revision: https://reviews.llvm.org/D99502
Lookup tables generate non PIC-friendly code, which requires dynamic relocation as described in:
https://bugs.llvm.org/show_bug.cgi?id=45244
This patch adds a new pass that converts lookup tables to relative lookup tables to make them PIC-friendly.
Differential Revision: https://reviews.llvm.org/D94355
This can only happen if offset types that are larger than the
pointer size are involved. The previous implementation did not
assert in this case because it initialized the APInts to the
width of one of the variables -- though I strongly suspect it
did not compute correct results in this case.
Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=32621
reported by fhahn.
If the sizes of both memory locations are unknown, we can only
perform a check on the underlying objects. There's no point in
going through GEP decomposition in this case.
The current linear expression decomposition handles zext/sext by
decomposing the casted operand, and then checking NUW/NSW flags
to determine whether the extension can be distributed. This has
some disadvantages:
First, it is not possible to perform a partial decomposition. If
we have zext((x + C1) +<nuw> C2) then we will fail to decompose
the expression entirely, even though it would be safe and
profitable to decompose it to zext(x + C1) +<nuw> zext(C2)
Second, we may end up performing unnecessary decompositions,
which will later be discarded because they lack nowrap flags
necessary for extensions.
Third, correctness of the code is not entirely obvious: At a high
level, we encounter zext(x -<nuw> C) in the form of a zext on the
linear expression x + (-C) with nuw flag set. Notably, this case
must be treated as zext(x) + -zext(C) rather than zext(x) + zext(-C).
The code handles this correctly by speculatively zexting constants
to the final bitwidth, and performing additional fixup if the
actual extension turns out to be an sext. This was not immediately
obvious to me.
This patch inverts the approach: An ExtendedValue represents a
zext(sext(V)), and linear expression decomposition will try to
decompose V further, either by absorbing another sext/zext into the
ExtendedValue, or by distributing zext(sext(x op C)) over a binary
operator with appropriate nsw/nuw flags. At each step we can
determine whether distribution is legal and abort with a partial
decomposition if not. We also know which extensions we need to
apply to constants, and don't need to speculate or fixup.
While explicit sext instructions were handled correctly, the
implicit sext that occurs if the offset is smaller than the
pointer size blindly assumed that sext(X * Scale + Offset) is the
same as sext(X) * Scale + Offset, which is obviously not correct.
Fix this by extracting the code that handles linear expression
extension and reusing it for the implicit sext as well.
A number of variables need to be correctly initialized on entry
to GetLinearExpression() for the implementation to behave reasonably.
The fact that SExtBits can currenlty be non-zero on entry is a bug,
as demonstrated by the added test: For implicit sexts by the GEP,
we do currently skip legality checks.
Currently, we'd produce an incorrect decomposition, because we
already recursively called GetLinearExpression(), so the Scale=1,
Offset=0 will not necessarily be relative to the shl itself.
Now, this doesn't actually matter for functional correctness,
because such a shift is poison anyway, so its okay to return
an incorrect decomposition. It's still unnecessarily confusing
though, and we can easily avoid this by checking the bitwidth
earlier.
Nowrap flags between mul and shl differ in that mul nsw allows
multiplication of 1 * INT_MIN, while shl nsw does not. This means
that it is always fine to transfer shl nowrap flags to muls, but
not necessarily the other way around. In this case the NUW/NSW
results refer to mul/add operations, so it's fine to retain the
flags from the shl.
Handle (x << s) != (y << s) where x != y and the shifts are
non-wrapping. Once again, this establishes parity with the
corresponing mul fold that already exists. The shift case is
more powerful because we don't need to guard against multiplies
by zero.
This handles the pattern X != X << C for non-zero X and C and a
non-overflowing shift. This establishes parity with the corresponing
fold for multiplies.
This is mainly for clarity: It doesn't make sense to do any
negative/positive checks when dealing with a nuw add/mul. These
only make sense to nsw add/mul.
This patch enables the cost-benefit-analysis-based inliner by default
if we have instrumentation profile.
- SPEC CPU 2017 shows a 0.4% improvement.
- An internal large benchmark shows a 0.9% reduction in the cycle
count along with 14.6% reduction in the number of call instructions
executed.
Differential Revision: https://reviews.llvm.org/D98213
getPointersDiff would previously round down the difference between two
pointers to a multiple of the element size of the pointee, which could
result in a pointer value being decreased a little.
Alexey Bataev has graciously agreed to add a testcase for this;
submitting the bugfix now to unblock.
loop:
%cmp.0 = phi i32 [ 3, %entry ], [ %inc, %loop ]
%pos.0 = phi i32 [ 1, %entry ], [ %cmp.0, %loop ]
...
%inc = add i32 %cmp.0, 1
br label %loop
On above example, %pos.0 uses previous iteration's %cmp.0 with backedge
according to PHI's instruction's defintion. If the %inc is not same among
iterations, we can say the two PHIs are not same.
Differential Revision: https://reviews.llvm.org/D98422
Rather than special-casing assume in BasicAA getModRefBehavior(),
do this one level higher, in the attribute handling of CallBase.
For assumes with operand bundles, the inaccessiblememonly attribute
applies regardless of operand bundles.
With cost-benefit analysis for inlining, we bypass the cost-threshold by returning inline result from call analyzer early.
However the cost and threshold are still available from call analyzer, and when cost is actually higher than threshold, we incorrect set the reason.
The change makes the decision from cost-benefit analysis explicit. It's mostly NFC, except that it allows the priority-based sample loader inliner used by CSSPGO to use cost-benefit heuristic.
Differential Revision: https://reviews.llvm.org/D99302
This patch enables the cost-benefit-analysis-based inliner by default
if we have instrumentation profile.
- SPEC CPU 2017 shows a 0.4% improvement.
- An internal large benchmark shows a 0.9% reduction in the cycle
count along with 14.6% reduction in the number of call instructions
executed.
Differential Revision: https://reviews.llvm.org/D98213
This is similar to the select logic just ahead of the new code.
Min/max choose exactly one value from the inputs, so if both of
those are a power-of-2, then the result must be a power-of-2.
This might help with D98152, but we likely still need other
pieces of the puzzle to avoid regressions.
The change in PatternMatch.h is needed to build with clang.
It's possible there is a better way to deal with the 'const'
incompatibities.
Differential Revision: https://reviews.llvm.org/D99276
SCEV currently tries to prove implications of x pred y by also
trying to imply ~y pred ~x. This is expensive in terms of
compile-time (in fact, the majority of isImpliedCond compile-time
is spent here) and generally not fruitful. The issue is that this
also swaps the operands and thus breaks canonical ordering. If
originally we were trying to prove an implication like
X > C1 -> Y > C2, then we'll now try to prove X > C1 -> C3 > ~Y,
which will not work.
The only real case where we can get some use out of this transform
is if the original conditions were in the form X > C1 -> Y < C2, were
then swapped to X > C1 -> C2 > Y and are then swapped again here to
X > C1 -> ~Y > C3.
As such, handle this at a higher level, where we are doing the
swapping in the first place. There's four different ways that we
can line up a predicate and a swapped predicate, so we use some
heuristics to pick some profitable way.
Because we now try this transform at a higher level
(isImpliedCondOperands rather than isImpliedCondOperandsHelper),
we can also prove additional facts. Of the added tests, one was
proven previously while the other wasn't.
Differential Revision: https://reviews.llvm.org/D90926
Lookup tables generate non PIC-friendly code, which requires dynamic relocation as described in:
https://bugs.llvm.org/show_bug.cgi?id=45244
This patch adds a new pass that converts lookup tables to relative lookup tables to make them PIC-friendly.
Differential Revision: https://reviews.llvm.org/D94355
FindAvailableLoadedValue() relies on FindAvailablePtrLoadStore() to run
the alias analysis when searching for an equivalent value. However,
FindAvailablePtrLoadStore() calls the alias analysis framework with a
memory location for the load constructed from an address and a size,
which thus lacks TBAA metadata info. This commit modifies
FindAvailablePtrLoadStore() to accept an optional memory location as
parameter to allow FindAvailableLoadedValue() to create it based on the
load instruction, which would then have TBAA metadata info attached.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D99206
This patch changes the interface to take a RegisterKind, to indicate
whether the register bitwidth of a scalar register, fixed-width vector
register, or scalable vector register must be returned.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D98874
As mentioned in [[ https://reviews.llvm.org/D96979 | D96979 ]], I'm extending the **IsGuaranteedLoopInvariant** check also to the `MemorySSA.cpp` file.
@fhahn For now I didn't unify the function into `MemorySSA.h` because, as you mentioned, it's not directly MSSA related. I'm open to suggestions to find a better place so we can improve the unification process.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D97155
Added getPointersDiff function to LoopAccessAnalysis and used it instead
direct calculatoin of the distance between pointers and/or
isConsecutiveAccess function in SLP vectorizer to improve compile time
and detection of stores consecutive chains.
Part of D57059
Differential Revision: https://reviews.llvm.org/D98967
This fixes a regression reported on D99022: If a call has operand
bundles, then the inaccessiblememonly attribute on the function
will be ignored, as operand bundles can affect modref behavior in
the general case. However, for assume operand bundles in particular
this is not the case.
Adjust getModRefBehavior() to always report inaccessiblememonly
for assumes, regardless of presence of operand bundles.
Added getPointersDiff function to LoopAccessAnalysis and used it instead
direct calculatoin of the distance between pointers and/or
isConsecutiveAccess function in SLP vectorizer to improve compile time
and detection of stores consecutive chains.
Part of D57059
Differential Revision: https://reviews.llvm.org/D98967
This select of ctpop with 0 pattern can get left behind after
loop idiom recognize converts a loop to ctpop. LLVM 10 was able
to optimize this, but LLVM 11 and later is not. The difference
seems to be that some select transforms are now limited based
on canCreateUndefOrPoison.
Teaching canCreateUndefOrPoison about ctpop restores the
LLVM 10 codegen.
Differential Revision: https://reviews.llvm.org/D99207
Lookup tables generate non PIC-friendly code, which requires dynamic relocation as described in:
https://bugs.llvm.org/show_bug.cgi?id=45244
This patch adds a new pass that converts lookup tables to relative lookup tables to make them PIC-friendly.
Differential Revision: https://reviews.llvm.org/D94355
This is an alternative to D98391/D98585, playing things more
conservatively. If AllowRefinement == false, then we don't use
InstSimplify methods at all, and instead explicitly implement a
small number of non-refining folds. Most cases are handled by
constant folding, and I only had to add three folds to cover
our unit tests / test-suite. While this may lose some optimization
power, I think it is safer to approach from this direction, given
how many issues this code has already caused.
Differential Revision: https://reviews.llvm.org/D99027
These intrinsics don't need to be marked as arbitrary writing,
it's sufficient to write inaccessible memory (aka "side effect")
to preserve control dependencies. This means less special-casing
in BasicAA. This is intended as an alternative to D98925.
Differential Revision: https://reviews.llvm.org/D99022
This is no-functional-change intended (NFC), but needed to allow
optimizer passes to use the API. See D98898 for a proposed usage
by SimplifyCFG.
I'm simplifying the code by removing the cl::opt. That was added
back with the original commit in D19488, but I don't see any
evidence in regression tests that it was used. Target-specific
overrides can use the usual patterns to adjust as necessary.
We could also restore that cl::opt, but it was not clear to me
exactly how to do it in the convoluted TTI class structure.
This patch exploits the knowledge that we may be running many fewer than bitwidth iterations of the loop, and may be able to disallow the overflow case. This patch specifically implements only the shl case, but this can be generalized to ashr and lshr without difficulty.
Differential Revision: https://reviews.llvm.org/D98222
Switch to use cold threshold from profile summary for cold context merging and trimming, instead of relying on hard coded values. Minor refactoring included for switch names, etc.
Differential Revision: https://reviews.llvm.org/D98921