Canonicalize ((x + C1) & C2) --> ((x & C2) + C1) for suitable constants
C1 and C2, instead of the other way round. This should allow more
constant ADDs to be matched as part of addressing modes for loads and
stores.
Differential Revision: https://reviews.llvm.org/D130080
Don't demand low order bits from the LHS of an Add if:
- they are not demanded in the result, and
- they are known to be zero in the RHS, so they can't possibly
overflow and affect higher bit positions
This is intended to avoid a regression from a future patch to change
the order of canonicalization of ADD and AND.
Differential Revision: https://reviews.llvm.org/D130075
This removes the last use of OperandValueKind from the client side API, and (once this is fully plumbed through TTI implementation) allow use of the same properties in store costing as arithmetic costing.
This completes the client side transition to the OperandValueInfo version of this routine. Backend TTI implementations still use the prior versions for now.
OperandValueKind and OperandValueProperties both provide facts about the operands of an instruction for purposes of cost modeling. We've discussed merging them several times; before I plumb through more flags, let's go ahead and do so.
This change only adds the client side interface for getArithmeticInstrCost and makes a couple of minor changes in client code to prove that it works. Target TTI implementations still use the split flags. I'm deliberately splitting what could be one big change into a series of smaller ones so that I can lean on the compiler to catch errors along the way.
Initial implementation had too weak requirements to positive/negative
range crossings. Not crossing zero with nuw is not enough for two reasons:
- If ArLHS has negative step, it may turn from positive to negative
without crossing 0 boundary from left to right (and crossing right to
left doesn't count for unsigned);
- If ArLHS crosses SINT_MAX boundary, it still turns from positive to
negative;
In fact we require that ArLHS always stays non-negative or negative,
which an be enforced by the following set of preconditions:
- both nuw and nsw;
- positive step (looks liftable);
Because of positive step, boundary crossing is only possible from left
part to the right part. And because of no-wrap flags, it is guaranteed
to never happen.
(X op Y) op Z --> (Y op Z) op X
This isn't a complete solution (see TODO tests for possible refinements),
but it shows some nice wins and doesn't seem to cause any harm. I think
the most potential danger is from conflicting with other folds and causing
an infinite loop - that's the reason for avoiding patterns with constant
operands.
Alternatively, we could try this in the reassociate pass, but we would not
immediately see all of the logic folds that instcombine provides. I also
looked at improving ValueTracking's isImpliedCondition() (and we should
still add some enhancements there), but that would not work in general for
bitwise logic reduction.
The tests that reduce completely to 0/-1 are motivated by issue #56653.
Differential Revision: https://reviews.llvm.org/D131356
Defaults to TCK_RecipThroughput - as most explicit calls were assuming TCK_RecipThroughput (vectorizers) or was just doing a before-vs-after comparison (vectorcombiner). Calls via getInstructionCost were just dropping the CostKind, so again there should be no change at this time (as getShuffleCost and its expansions don't use CostKind yet) - but it will make it easier for us to better account for size/latency shuffle costs in inline/unroll passes in the future.
Differential Revision: https://reviews.llvm.org/D132287
SLP vectorizer tries to find the reductions starting the operands of the
instructions with no-users/void returns/etc. But such operands can be
postponable instructions, like Cmp, InsertElement or InsertValue. Such
operands still must be postponed, vectorizer should not try to vectorize
them immediately.
Differential Revision: https://reviews.llvm.org/D131965
In many cases constant buildvector results in a vector load from a
constant/data pool. Need to consider this cost too.
Differential Revision: https://reviews.llvm.org/D126885
(~A | C) | (A ^ B) --> ~(A & B) | C
https://alive2.llvm.org/ce/z/Qw3aiJ
This extends the existing fold (just above the new match)
to peek through another 'or' instruction.
This should let the motivating case from issue #57174
simplify completely.
Use `std::clamp` directly from the standard library now that LLVM is built with
C++17 standards mode.
Differential Revision: https://reviews.llvm.org/D131869
If the incoming previous value of a fixed-order recurrence is a phi in
the header, go through incoming values from the latch until we find a
non-phi value. Use this as the new Previous, all uses in the header
will be dominated by the original phi, but need to be moved after
the non-phi previous value.
At the moment, fixed-order recurrences are modeled as a chain of
first-order recurrences.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D119661
The code would use first non-phi instruction as an insertion point, however
this could lead to freeze getting inserted between phi and landingpad
causing a verifier assert.
Differential Revision: https://reviews.llvm.org/D132105
This change reorganizes the code and comments to make the expected semantics of these routines more clear. However, this is *not* an NFC change. The functional change is having isScalarWithPredication return false if the instruction does not need predicated. Specifically, for the case of a uniform memory operation we were previously considering it *not* to be a predicated instruction, but *were* considering it to be scalable with predication.
As can be seen with the test changes, this causes uniform memory ops which should have been lowered as uniform-per-parts values to instead be lowering via naive scalarization or if scalarization is infeasible (i.e. scalable vectors) aborted entirely. I also don't trust the code to bail out correctly 100% of the time, so it's possible we had a crash or miscompile from trying to scalarize something which isn't scalaralizable. I haven't found a concrete example here, but I am suspicious.
Differential Revision: https://reviews.llvm.org/D131093
* Replace getUserCost with getInstructionCost, covering all cost kinds.
* Remove getInstructionLatency, it's not implemented by any backends, and we should fold the functionality into getUserCost (now getInstructionCost) to make it easier for targets to handle the cost kinds with their existing cost callbacks.
Original Patch by @samparker (Sam Parker)
Differential Revision: https://reviews.llvm.org/D79483
NewGVN tables are not cleared out between the initial run of NewGVN and the verification. In case of phi-of-ops optimization, OpSafeForPHIOfOps goes out of sync between the two runs. One operand might not be safe for one basic block, but it might be safe for one of its successors. In this case, the operand will be added in OpSafeForPHIOfOps map. In verification phase, we reuse OpSafeForPHIOfOps without updating it again. As a result, the operand will be considered safe for phi-of-ops optimization even for the case that it is not. This patch fixes this problem.
Fix for 53807.
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D130910
In D131869 we noticed that we jump through some hoops because we parse the
tolerance option used in MisExpect.cpp into a 64-bit integer. This is
unnecessary, since the value can only be in the range [0, 100).
This patch changes the underlying type to be 32-bit from where it is
parsed in Clang through to it's use in LLVM.
Reviewed By: jloser
Differential Revision: https://reviews.llvm.org/D131935
If a function only has a few instructions, instrumentation can significantly increase the size and performance overhead of that function. Add the `-pgo-function-size-threshold` option to select a size threshold so these small functions are not instrumented.
A similar option `-fxray-instruction-threshold=<N>` is used for XRay to reduce binary size overhead [1].
[1] https://www.llvm.org/docs/XRay.html
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D131816
Currently, clang ignores the 0 initialisation in finite math
For example:
```
double f_prod = 0;
double arr[1000];
for (size_t i = 0; i < 1000; i++) {
f_prod *= arr[i];
}
```
Clang will ignore that `f_prod` is set to zero and it will generate assembly to iterate over the loop.
Reviewed By: fhahn, spatel
Differential Revision: https://reviews.llvm.org/D131672
We don't have a dominator tree in this pass, so we
can't bail out sooner by checking for unreachable
code, but this is a minimal fix for the example in
issue #56875.
Currently, we try to vectorize values, feeding into stores, only if
slp-vectorize-hor-store option is provided. We can safely enable
vectorization of the value operand of a single store in the basic block,
if the operand value is used only in store.
It should enable extra vectorization and should not increase compile
time significantly.
Fixes https://github.com/llvm/llvm-project/issues/51320
Differential Revision: https://reviews.llvm.org/D131894
This should fix these build bot errors:
Step 6 (build-check-mlir-build-only) failure: build (failure)
C:\buildbot\mlir-x64-windows-ninja\llvm-project\llvm\lib\Transforms\Scalar\EarlyCSE.cpp(124): error C2220: the following warning is treated as an error
C:\buildbot\mlir-x64-windows-ninja\llvm-project\llvm\lib\Transforms\Scalar\EarlyCSE.cpp(124): warning C4996: 'llvm::Optional<llvm::fp::ExceptionBehavior>::getValue': Use value instead.
C:\buildbot\mlir-x64-windows-ninja\llvm-project\llvm\lib\Transforms\Scalar\EarlyCSE.cpp(129): warning C4996: 'llvm::Optional<llvm::RoundingMode>::getValue': Use value instead.
C:\buildbot\mlir-x64-windows-ninja\llvm-project\llvm\lib\Transforms\Scalar\EarlyCSE.cpp(1386): warning C4996: 'llvm::Optional<llvm::fp::ExceptionBehavior>::getValue': Use value instead.
C:\buildbot\mlir-x64-windows-ninja\llvm-project\llvm\lib\Transforms\Scalar\EarlyCSE.cpp(1388): warning C4996: 'llvm::Optional<llvm::RoundingMode>::getValue': Use value instead.
Previously we would only CSE constrained FP intrinsics in the default
floating point environment. Exception behavior of "strict" is still not
allowed since we are not allowed to remove any traps in that case.
There are no restrictions on CSE across function calls inside a function.
Differential Revision: https://reviews.llvm.org/D112256
Using the legacy PM for the optimization pipeline is deprecated and in
the process of being removed. This is a small step in that direction.
For an example of migrating to the new PM:
853b57fe80
The basic patterns look like this:
https://alive2.llvm.org/ce/z/MDj9EC
The tests have a use of the overflow value too.
Otherwise, existing folds should reduce already.
This was noted as a missing IR fold in:
926e7312b2
Hopefully, this makes it easier to implement a backend
fix because we should get the same IR regardless of
whether the source used builtins or inline code.
(A | ?) | (A ^ B) --> (A | ?) | B
https://alive2.llvm.org/ce/z/dbNQw4
This extends the existing transform to peek through
another 'or' instruction for the common operand.
This is the underlying missing fold that should allow
issue #56711 and issue #57120 to reduce even more.
Allows for even more savings in the binary image while simultaneously removing the name of the offending stack variable.
Depends on D131631
Reviewed By: vitalybuka
Differential Revision: https://reviews.llvm.org/D131728
We're seeing non-determinism with loading sample profiles. It seems to
be related to the order in which we merge FunctionSamples in
promoteMergeNotInlinedContextSamples(). Use a MapVector to iterate over
NonInlinedCallSites in the order entries were inserted.
Reviewed By: wenlei, davidxl
Differential Revision: https://reviews.llvm.org/D131592
The goal is to reduce the size of the MSAN with track origins binary, by making
the variable name locations constant which will allow the linker to compress
them.
Follows: https://reviews.llvm.org/D131415
Reviewed By: vitalybuka
Differential Revision: https://reviews.llvm.org/D131631
Contextual knowledge may be used to prove invariance of some conditions.
For example, in this case:
```
; %len >= 0
guard(%iv = {start,+,1}<nuw> <s %len)
guard(%iv = {start,+,1}<nuw> <u %len)
```
the 2nd check always fails if `start` is negative and always passes otherwise.
It looks like there are more opportunities of this kind that are still to be
implemented in the future.
Differential Revision: https://reviews.llvm.org/D129753
Reviewed By: apilipenko
Closing https://github.com/llvm/llvm-project/issues/56329
The problem happens when we try to simplify the suspend points. We might
break the assumption that the final suspend lives in the last slot of
Shape.CoroSuspends. This patch tries to main the assumption and fixes
the problem.
We manage to iteratively achieve this result with no extra
uses, and the reassociate pass can also do this, but this
pattern falls through the cracks in the example from
issue #57053.
Other sanitizers (ASan, TSan, see added tests) already handle
memcpy.inline and memset.inline by not relying on InstVisitor to turn
the intrinsics into calls. Only MSan instrumentation currently does not
support them due to missing InstVisitor callbacks.
Fix it by actually making InstVisitor handle Mem*InlineInst.
While the mem*.inline intrinsics promise no calls to external functions
as an optimization, for the sanitizers we need to break this guarantee
since access into the runtime is required either way, and performance
can no longer be guaranteed. All other cases, where generating a call is
incorrect, should instead use no_sanitize.
Fixes: https://github.com/llvm/llvm-project/issues/57048
Reviewed By: vitalybuka, dvyukov
Differential Revision: https://reviews.llvm.org/D131577
This is done by calling __msan_set_alloca_origin and providing the location of the variable by using the call stack.
This is prepatory work for dropping variable names when track-origins is enabled.
Reviewed By: vitalybuka
Differential Revision: https://reviews.llvm.org/D131205
The relevant property of allocation functions of interest here is
their uniqueness (in the sense of disjoint provenance), which is
encoded by the noalias return attribute.
Differential Revision: https://reviews.llvm.org/D130225
After D121595 was commited, I noticed regressions assosicated with small trip
count numbersvectorisation by tail folding with scalable vectors. As a solution
for those issues I propose to introduce the minimal trip count threshold value.
Differential Revision: https://reviews.llvm.org/D130755
The RelLookupTableConverter pass currently only supports 64-bit
pointers. This is currently enforced using an isArch64Bit() check
on the target triple. However, we consider x32 to be a 64-bit target,
even though the pointers are 32-bit. (And independently of that
specific example, there may be address spaces with different pointer
sizes.)
As such, add an additional guard for the size of the pointers that
are actually part of the lookup table.
Differential Revision: https://reviews.llvm.org/D131399
With profile data, non-trivial LoopUnswitch will only apply on non-cold loops, as unswitching cold loops may not gain much benefit but significantly increase the code size.
Reviewed By: aeubanks, asbirlea
Differential Revision: https://reviews.llvm.org/D129599
We are seeing significant performance loss when an alloca fails to get promoted
to register. I have observed that this is due to the common type found when
attempting to rewrite partition users being unviable for promotion. While if we
would have continue looking for a type, we would have found a subtype in the
original allocated type that would have enabled promotion. Thus first check if
the initial common type found is promotion viable and if not then continue
looking instead of stopping with the initial common type found.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D128073
When EarlyCSE tries to common vector masked loads/stores, it first checks that
they have same base operand and then assumes that this is enough for mask types
to be equal. This is true for typed pointers but false for opaque ones -
two loads of different vector sizes from same base pointer '%b' are the same,
`ptr %b`. (For typed pointers, `%b` was cast to vector pointer type so bases
were different).
Change assert to return from lambda `isSubmask` so this transformation properly
works with opaque pointers.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D131251
This fixes 69 llvm tests that failed when EXPENSIVE_CHECKS was enabled.
llvm/test/Transforms/IROutliner/outlining-commutative-operands-opposite-order.ll
is one example.
When we have EXPENSIVE_CHECKS, _GLIBCXX_DEBUG is defined. This means
that libstdc++ will call the compare function to check if it is
implemented correctly (that !(a < a) is true).
This happens even if there is only one item and here, we expect
to see one return void or multiple return constant integer.
Don't sort if we have 1 item, but do assert that it is the 1
ret void we expect. In the comparator, assert that neither
Value is a nullptr in case one ended up in a the list somehow.
Reviewed By: AndrewLitteken
Differential Revision: https://reviews.llvm.org/D130230
Closing https://github.com/llvm/llvm-project/issues/56919
It is meaningless to preserve the lifetime markers for the spilled
allocas in the coroutine frames and it would block some optimizations
too.
A const reference is preferred over a non-null const pointer.
`Type *` is kept as is to match the other overload.
Reviewed By: davidxl
Differential Revision: https://reviews.llvm.org/D131197
COFF has a verifier check that private global variables don't have a comdat of the same name.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D131043
1) Overloaded (instruction-based) method is a wrapper around the current (opcode-based) method.
2) This patch also changes a few callsites (VectorCombine.cpp,
SLPVectorizer.cpp, CodeGenPrepare.cpp) to call the overloaded method.
3) This is a split of D128302.
Differential Revision: https://reviews.llvm.org/D131114
In contrast to AAPotentialValues, the constant values version can
contain implicit `undef` in the set. We had an assertion that could
misfire before. Handle it properly now.
As discussed in [0], this diff adds the `skipprofile` attribute to
prevent the function from being profiled while allowing profiled
functions to be inlined into it. The `noprofile` attribute remains
unchanged.
The `noprofile` attribute is used for functions where it is
dangerous to add instrumentation to while the `skipprofile` attribute is
used to reduce code size or performance overhead.
[0] https://discourse.llvm.org/t/why-does-the-noprofile-attribute-restrict-inlining/64108
Reviewed By: phosek
Differential Revision: https://reviews.llvm.org/D130807
BoundsChecking uses ObjectSizeOffsetEvaluator to keep track of the
underlying size/offset of pointers in allocations. However,
ObjectSizeOffsetVisitor (something ObjectSizeOffsetEvaluator
uses to check for constant sizes/offsets)
doesn't quite treat sizes and offsets the same way as
BoundsChecking. BoundsChecking wants to know the size of the
underlying allocation and the current pointer's offset within
it, but ObjectSizeOffsetVisitor only cares about the size
from the pointer to the end of the underlying allocation.
This only comes up when merging two size/offset pairs. Add a new mode to
ObjectSizeOffsetVisitor which cares about the underlying size/offset
rather than the size from the current pointer to the end of the
allocation.
Fixes a false positive with -fsanitize=bounds.
Reviewed By: vitalybuka, asbirlea
Differential Revision: https://reviews.llvm.org/D131001
This is the 2nd patch of the two-patch series (D130188, D130189) that
fix PR56275 (https://github.com/llvm/llvm-project/issues/56275) which
is a missed opportunity for loop interchange.
As follow-up on the dependence analysis (DA) patch D130188, this patch
normalizes DA results in loop interchange, such that negative dependence
vectors queried by loop interchange are reversed to be non-negative.
Now all tests in PR56275 can get interchanged. Those tests are added
in lit test as `pr56275.ll`.
Reviewed By: kawashima-fj, bmahjour, Meinersbur, #loopoptwg
Differential Revision: https://reviews.llvm.org/D130189
During LTO a local promoted to a global gets a unique suffix based on
a hash of the module IR. This means that changes in the local's module
can affect the contents in another module that imported it (because the name
of the imported promoted local is changed, but that doesn't reflect a
real change in the importing module). So any tool that's
validating changes to the importing module will see a superficial change.
Instead of using the module hash, we can use the "source_filename" if it
exists to generate a unique identifier that doesn't change due to LTO
shenanigans.
Differential Revision: https://reviews.llvm.org/D128863
This is mostly a stylistic change to make the uniform memop widening cost
code fit more naturally with the sourounding code. Its not strictly
speaking NFC as I added in the store with invariant value case, and we
could in theory have a target where a gather/scatter is cheaper than a
single load/store... but it's probably NFC in practice. Note that the
scatter/gather result can still be overriden later if the result is
uniform-by-parts.
Mark ModRefInfo as a bitmask enum, which allows using normal
& and | operators on it. This supersedes various functions like
unionModRef() and intersectModRef(). I think this makes the code
cleaner than going through helper functions...
Differential Revision: https://reviews.llvm.org/D130870
In this patch we replace common code patterns with the use of utility
functions for dealing with profiling metadata. There should be no change
in functionality, as the existing checks should be preserved in all
cases.
Reviewed By: bogner, davidxl
Differential Revision: https://reviews.llvm.org/D128860
This extends the handling of uniform memory operations to handle the case where a store is storing a loop invariant value. Unlike the general case of a store to an invariant address where we must use the last active lane, in this case we can use any lane since all lanes must produce the same result.
For context, the basic structure of the existing code and how the change fits in:
* First, we select a widening strategy. (The result is irrelevant for this patch.)
* Then we determine if a computation is uniform within all lanes of VF. (Note this is the uniform-per-part definition, not LAI's uniform across all unrolled iterations definition.)
* If it is, we overrule the widening strategy, and unconditionally scalarize.
* VPReplicationRecipe - which is what actually does the scalarization - knows how to handle unform-per-part values including for scalable vectors. However, we do need to know that the expression is safe to execute without predication - e.g. the uniform mem op was unconditional in the original loop. (This part was split off and already landed.)
An obvious question is why not simply implement the generic case? The answer is that I'm going to, but doing so without a canonicalization towards uniform causes regressions due to bad interaction with scalarization/uniformity of values feeding the uniform mem-op. This patch is needed to avoid those regressions.
Differential Revision: https://reviews.llvm.org/D130364
If we have interleave groups in the loop we want to vectorise then
we should fall back on normal vectorisation with a scalar epilogue. In
such cases when tail-folding is enabled we'll almost certainly go on to
create vplans with very high costs for all vector VFs and fall back on
VF=1 anyway. This is likely to be worse than if we'd just used an
unpredicated vector loop in the first place.
Once the vectoriser has proper support for analysing all the costs
for each combination of VF and vectorisation style, then we should
be able to remove this.
Added an extra test here:
Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll
Differential Revision: https://reviews.llvm.org/D128342
Now the API getExtendedAddReductionCost is used to determine the cost of extended Add reduction with optional Mul. For Arm, it could cover the cases. But for other target, for example: RISCV, they support other kinds of extended recution, such as FAdd.
This patch does the following changes:
1, Split getExtendedAddReductionCost into 2 new API: getExtendedReductionCost which handles the extended reduction with addtional input of Opcode; getMulAccReductionCost which handle the MLA cases the getExtendedAddReductionCost.
2, Refactor getReductionPatternCost, add some contraint condition to make sure the getMulAccReductionCost should only handle the reuction of Add + Mul.
Differential Revision: https://reviews.llvm.org/D130868
Reflect in the pointer's offset the length of the leading part
of the consumed string preceding the first converted digit.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D130912
SimplifyCFG does some common code hoisting, which is limited to hoisting a
sequence of identical instruction in identical order and stops at the first
non-identical instruction.
This patch allows hoisting instruction pairs over same-length sequences of
non-matching instructions. The linear asymptotic complexity of the algorithm
stays the same, there's an extra parameter `simplifycfg-hoist-common-skip-limit`
serving to limit compilation time and/or the size of the hoisted live ranges.
The patch improves SPECv6/525.x264_r by about 10%.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D129370
This is a follow-up to 2ebfda2417
(replace "if" with "else if" since the cases nuw/nsw
were meant to be handled separately).
Test plan:
1/ ninja check-llvm check-clang check-lld
2/ Bootstrapped LLVM/Clang pass tests
The isa<Constant> check could misfire on an instruction with 2 constant
operands. This bug was introduced with bb789381fc (D36988).
See issue #56810 for a C source example that exposed the bug.
https://alive2.llvm.org/ce/z/3jYbEH
We should choose one of these forms, and the option that uses
the narrow type allows the motivating example from issue #56294
to reduce. In the best case (no 'not' needed and 'trunc' remains),
this does remove an instruction.
Note that there is what looks like a regression because there
is an existing canonicalization that turns trunc into and+icmp.
That is a long-standing transform, and I'm not sure what effect
reversing it would have.
We already had the reasoning about uniform mem op loads; if the address is accessed at least once, we know the instruction doesn't need predicated to ensure fault safety. For stores, we do need to ensure that the values visible in memory are the same with and without predication. The easiest sub-case to check for is that all the values being stored are the same. Since we know that at least one lane is active, this tells us that the value must be visible.
Warning on confusing terminology: "uniform" vs "uniform mem op" mean two different things here, and this patch is specific to the later. It would *not* be legal to make this same change for merely "uniform" operations.
Differential Revision: https://reviews.llvm.org/D130637
This patch introduces the inline cost priority into the
module inliner, which uses the same computation as
InlineCost.
Reviewed By: kazu
Differential Revision: https://reviews.llvm.org/D130012
This patch introduces the inline cost priority into the
module inliner, which uses the same computation as
InlineCost.
Reviewed By: kazu
Differential Revision: https://reviews.llvm.org/D130012
This works with any logic + extend:
https://alive2.llvm.org/ce/z/vzsqQD
The motivating case is from issue #56294, but that's still not optimal
(it should simplify completely).
In this patch we replace common code patterns with the use of utility
functions for dealing with profiling metadata. There should be no change
in functionality, as the existing checks should be preserved in all
cases.
Reviewed By: bogner, davidxl
Differential Revision: https://reviews.llvm.org/D128860
I am playing with the LoopDataPrefetch pass and found out that it
bails to work with a pointer in a non-zero address space. This
patch adds the target callback to check if an address space is to
be considered for prefetching. Default implementation still only
allows address space 0, so this is NFCI.
This does not currently affect any known targets, but seems to be
generally useful for the future.
Differential Revision: https://reviews.llvm.org/D129795
The behaviour of this patch is not great, but it has some side-effects
that are required for OpenMPOpt to work. The problem is that when we use
`-mlink-builtin-bitcode` we only import used symbols from the runtime.
Then OpenMPOpt will insert calls to symbols that were not previously
included. This patch removed this implicit behaviour as these functions
were kept alive by the `noinline` simply because it kept calls to them
in the module. This caused regression in some tests that relied on some
OpenMPOpt passes without using LTO. Reverting for the LLVM15 release but
will try to fix it more correctly on main.
This reverts commit d61d72dae6.
Fixes#56752
Instructions between two adjacent loops will be hoisted above the first
loop, or sunk below the second to facilitate loop fusion. Hoisting will
be attempted for an instruction that dominates the first loop.
Otherwise, sinking this instructions will be attempted.
Instructions with side effects will not be considered for sinking or
hoisting. Hoisting/sinking of any instructions between loops will only
be performed if all the instructions can be moved. As well,
sinking/hoisting is considered for each instruction in isolation,
without taking into account sinking/hoisting decisions for other
instructions in the preheader.
Differential Revision: https://reviews.llvm.org/D118076
Added alloca optimization which was missed during the implemenation of D112098.
Reviewed By: vitalybuka
Differential Revision: https://reviews.llvm.org/D130503
This is an alternate to D129155 that uses TTI.haveFastSqrt() to avoid a
potential miscompile for programs with reads of errno. Moving the transform
to AggressiveInstCombine provides access to TTI.
If a sqrt call has "nnan", that implies that the input argument is never
negative because sqrt of {negative number} --> NAN.
If the argument is never negative and the call can be lowered without a
libcall, then we can assume that errno accesses are unchanged after lowering,
so the call can be translated to the LLVM intrinsic (which is expected to
become inline code).
This affects codegen for targets like x86 that have sqrt instructions, but
still have to conservatively assume that a libcall may be needed to set
errno as shown in issue #52620 and issue #56383.
This patch won't solve those examples - we will need to extend this to use
CannotBeOrderedLessThanZero or similar, enhance that analysis for new
operators, and/or deal with llvm.assume too.
Differential Revision: https://reviews.llvm.org/D129167
WinEHPrepare marks any function call from EH funclets as unreachable, if it's not a nounwind intrinsic or has no proper funclet bundle operand. This
affects ARC intrinsics on Windows, because they are lowered to regular function calls in the PreISelIntrinsicLowering pass. It caused silent binary truncations and crashes during unwinding with the GNUstep ObjC runtime: https://github.com/gnustep/libobjc2/issues/222
This patch adds a new function `llvm::IntrinsicInst::mayLowerToFunctionCall()` that aims to collect all affected intrinsic IDs.
* Clang CodeGen uses it to determine whether or not it must emit a funclet bundle operand.
* PreISelIntrinsicLowering asserts that the function returns true for all ObjC runtime calls it lowers.
* LLVM uses it to determine whether or not a funclet bundle operand must be propagated to inlined call sites.
Reviewed By: theraven
Differential Revision: https://reviews.llvm.org/D128190
Turning on opaque pointers has uncovered an issue with WPD where we currently pattern match away `assume(type.test)` in WPD so that a later LTT doesn't resolve the type test to undef and introduce an `assume(false)`. The pattern matching can fail in cases where we transform two `assume(type.test)`s into `assume(phi(type.test.1, type.test.2))`.
Currently we create `assume(type.test)` for all virtual calls that might be devirtualized. This is to support `-Wl,--lto-whole-program-visibility`.
To prevent this, all virtual calls that may not be in the same LTO module instead use a new `llvm.public.type.test` intrinsic in place of the `llvm.type.test`. Then when we know if `-Wl,--lto-whole-program-visibility` is passed or not, we can either replace all `llvm.public.type.test` with `llvm.type.test`, or replace all `llvm.public.type.test` with `true`. This prevents WPD from trying to pattern match away `assume(type.test)` for public virtual calls when failing the pattern matching will result in miscompiles.
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D128955
We previously used the `noinline` attributes to specify some defintions
which should be kept alive in the runtime. These were then stripped
immediately in the OpenMPOpt module pass. However, Since the changes in
D130298, we not explicitly state which functions will have external
visiblity in the bitcode library. Additionally the OpenMPOpt module pass
should run before the inliner pass, so this shouldn't make a difference
in whether or not the functions will be alive for the initial pass of
OpenMPOpt. This should simplify the interface, and additionally save
time spend on scanning funciton names for noinline.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D130368
In D129523, it was noted that there is are some questionable naked casts
from Instruction to BinaryOperator, which could be addressed by doing a
dyn_cast directly to BinaryOperator, avoiding the need for the later cast.
This cleans up that casting.
Reviewed By: nikic, spatel, RKSimon
Differential Revision: https://reviews.llvm.org/D130448
In D129523, it was noted that the approach to check whether a value can
have FastMathFlags was done in different ways, and they should be made
consistent. This patch makes minor changes to fix that.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D130408
If we look at a write, we should not enact the "has been written to"
logic introduced to avoid spurious write -> read dependences. Doing so
lead to elimination of stores we needed, which is obviously bad.
The name `getEntrySamples` was misleading for 2 reasons. One, it's
close in name to `Function::getEntryCount`, but the equivalent here is
`getHeadSamples`; second, as opposed to the other get* APIs in
`FunctionSamples`, it performs an estimate/heuristic rather than just
retrieving raw data (or a non-heuristic derivate off that data, like
`getMaxCountInside`)
The new name should more clearly communicate its intent; and, being
close (in name) to `getHeadSamples`, it should allow the reader discover
the relation between them.
Also updated the doc comments for both `getHeadSamples[Estimate]` so a
reader may better understand the relation between them.
Differential Revision: https://reviews.llvm.org/D130281
Reorganize the code to make it clear what is and isn't handle, and why.
Restructure bailout to remove (false and confusing) dependence on
CM_Scalarize; just return invalid cost and propagate, that's what it
is for.
The internalize pass supports an option to provide a list of symbols
that should not be internalized. THis is useful retaining certain
defintions that should be kept alive. However, this interface is
somewhat difficult to use as it requires knowing every single symbol's
name and specifying it. Many APIs provide common prefixes for the
symbols exported by the library, so it would make sense to be able to
match these using a simple glob pattern. This patch changes the handling
from a simple string comparison to a glob pattern match.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D130319
If a function is non-recursive we only performed intra-procedural
reasoning for reachability (via AA::isPotentiallyReachable). However,
if it is re-entrant that doesn't mean we can't reach. Instead of this
problematic logic in the reachability reasoning we utilize logic in
AAPointerInfo. If a location is for sure written by a function it can
be re-entrant or recursive we know only intra-procedural reasoning is
sufficient.
The existing code doesn't expect dummy values (undef, poison, null-derived
constants etc) as arguments of these intrinsics. However, they can be there
in unreached code. Currently we fail trying to find base for them.
Handle these cases separately. Return null as base for them to be consistent
with the handling in the main algorithm in findBaseDefiningValue.
Differential Revision: https://reviews.llvm.org/D129561
Reviewed By: apilipenko
If we have a dominating must-write access we do not need to know the
initial value of some object to perform reasoning about the potential
values. The dominating must-write has overwritten the initial value.
This code confuses LV's "Uniform" and LVL/LAI's "Uniform". Despite the
common name, these are different.
* LVs notion means that only the first lane *of each unrolled part* is
required. That is, lanes within a single unroll factor are considered
uniform. This allows e.g. widenable memory ops to be considered
uses of uniform computations.
* LVL and LAI's notion refers to all lanes across all unrollings.
IsUniformMem is in turn defined in terms of LAI's notion. Thus a
UniformMemOpmeans is a memory operation with a loop invariant address.
This means the same address is accessed in every iteration.
The tweaked piece of code was trying to match a uniform mem op (i.e.
fully loop invariant address), but instead checked for LV's notion of
uniformity. In theory, this meant with UF > 1, we could speculate
a load which wasn't safe to execute.
This ends up being mostly silent in current code as it is nearly
impossible to create the case where this difference is visible. The
closest I've come in the test case from 54cb87, but even then, the
incorrect result is only visible in the vplan debug output; before this
change we sink the unsafely speculated load back into the user's predicate
blocks before emitting IR. Both before and after IR are correct so the
differences aren't "interesting".
The other test changes are uninteresting. They're cases where LV's uniform
analysis is slightly weaker than SCEV isLoopInvariant.
This probably should have been part of D123089, but the effects of it
don't show up until we start removing functions from the table in
D130107. Oops.
Differential Revision: https://reviews.llvm.org/D130184
The InstCombine test is reduced from issue #56601. Without the more
liberal match for ConstantExpr, we try to rearrange constants in
Negator forever.
Alternatively, we could adjust the definition of m_ImmConstant to be
more conservative, but that's probably a larger patch, and I don't
see any downside to changing m_ConstantExpr. We never capture and
modify a ConstantExpr; transforms just want to avoid it.
Differential Revision: https://reviews.llvm.org/D130286
This patch adds the AArch64 hook for preferPredicateOverEpilogue,
which currently returns true if SVE is enabled and one of the
following conditions (non-exhaustive) is met:
1. The "sve-tail-folding" option is set to "all", or
2. The "sve-tail-folding" option is set to "all+noreductions"
and the loop does not contain reductions,
3. The "sve-tail-folding" option is set to "all+norecurrences"
and the loop has no first-order recurrences.
Currently the default option is "disabled", but this will be
changed in a later patch.
I've added new tests to show the options behave as expected here:
Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll
Differential Revision: https://reviews.llvm.org/D129560
Replace the value-accepting isReallocLikeFn() overload with a
getReallocatedOperand() function, which returns which operand is
the one being reallocated. Currently, this is always the first one,
but once allockind(realloc) is respected, the reallocated operand
will be determined by the allocptr parameter attribute.
Remove isFreeCall() in favor of getFreedOperand(). Replace the
two remaining uses with a getFreedOperand() != nullptr check, as
they only care that something is getting freed. (The usage in DSE
is correct as such. The allocator-related checks in CFLGraph look
rather questionable in general.)
Use getFreedOperand() instead of isFreeCall() to remove the
implicit assumption that any pointer operand to a free function
is the operand being freed. This won't actually matter until we
handle allockind(free).
We currently assume in a number of places that free-like functions
free their first argument. This is true for all hardcoded free-like
functions, but with the new attribute-based design, the freed
argument is supposed to be indicated by the allocptr attribute.
To make sure we handle this correctly once allockind(free) is
respected, add a getFreedOperand() helper which returns the freed
argument, rather than just indicating whether the call frees *some*
argument.
This migrates most but not all users of isFreeCall() to the new
API. The remaining users are a bit more tricky.
Reapply the patch with getObjectSize() replaced by getAllocSize().
The former will also look through calls that return their argument,
and we'll end up placing dereferenceable attributes on intrinsics
like llvm.launder.invariant.group. While this isn't wrong, it also
doesn't seem to be particularly useful. For now, use getAllocSize()
instead, which sticks closer to the original behavior of this code.
-----
This code is just interested in the allocsize, not any other
allocator properties.
We were quite conservative when it came to PHI node handling to avoid
recursive reasoning. Now we check more direct if we have seen a PHI
already or not. This allows non-recursive PHI chains to be handled.
This also exposed a bug as we did only model the effect of one loop
traversal. `phi_no_store_3` has been adapted to show how we would have
used `undef` instead of `1` before. With this patch we don't replace
it at all, which is expected as we do not argue about loop iterations
(or alignments).
If we only have exact accesses we should never require the bit-pattern
to be uniform (in this case 0). Only a non-exact access should force us
to require only 0 values.
If we are right shifting a multiply by a negated power of 2 where
the power of 2 is the same as the shift amount, we can replace with
a negate followed by an And.
New tests have not been committed yet but the patch shows the diffs.
Let me know if you want any changes or additional tests.
Differential Revision: https://reviews.llvm.org/D130103
Put AllocationFn check before I->willReturn can allow CodeGenPrepare to remove useless malloc instruction
Differential Revision: https://reviews.llvm.org/D130126
An srem or sdiv has two cases which can cause undefined behavior, not just one. The existing code did not account for this, and as a result, we miscompiled when we encountered e.g. a srem i64 %v, -1 in a conditional block.
Instead of hand rolling the logic, just use the utility function which exists exactly for this purpose.
Differential Revision: https://reviews.llvm.org/D130106
When F calls G calls H, G is nounwind, and G is inlined into F, then the
inlined call-site to H should be effectively nounwind so as not to lose
information during inlining.
If H itself is nounwind (which often happens when H is an intrinsic), we
no longer mark the callsite explicitly as nounwind. Previously, there
were cases where the inlined call-site of H differs from a pre-existing
call-site of H in F *only* in the explicitly added nounwind attribute,
thus preventing common subexpression elimination.
v2:
- just check CI->doesNotThrow
v3 (resubmit after revert at 3443788087):
- update Clang tests
Differential Revision: https://reviews.llvm.org/D129860
This patch introduces some initial def-use verification. This catches
cases like the one fixed by D129436.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D129717
For the longest time we used `AAValueSimplify` and
`genericValueTraversal` to determine "potential values". This was
problematic for many reasons:
- We recomputed the result a lot as there was no caching for the 9
locations calling `genericValueTraversal`.
- We added the idea of "intra" vs. "inter" procedural simplification
only as an afterthought. `genericValueTraversal` did offer an option
but `AAValueSimplify` did not. Thus, we might end up with "too much"
simplification in certain situations and then gave up on it.
- Because `genericValueTraversal` was not a real `AA` we ended up with
problems like the infinite recursion bug (#54981) as well as code
duplication.
This patch introduces `AAPotentialValues` and replaces the
`AAValueSimplify` uses with it. `genericValueTraversal` is folded into
`AAPotentialValues` as are the instruction simplifications performed in
`AAValueSimplify` before. We further distinguish "intra" and "inter"
procedural simplification now.
`AAValueSimplify` was not deleted as we haven't ported the
re-materialization of instructions yet. There are other differences over
the former handling, e.g., we may not fold trivially foldable
instructions right now, e.g., `add i32 1, 1` is not folded to `i32 2`
but if an operand would be simplified to `i32 1` we would fold it still.
We are also even more aware of function/SCC boundaries in CGSCC passes,
which is good even if some tests look like they regress.
Fixes: https://github.com/llvm/llvm-project/issues/54981
Note: A previous version was flawed and consequently reverted in
6555558a80.
powi() is not a standard math library function; it is specified
with non-strict semantics in the LangRef. We currently require
'afn' to do this transform when it needs a sqrt(), so I just
extended that requirement to the whole-number exponent too.
This bug was introduced with:
b17754bcaa
...where we deferred expansion of pow() to later passes.
Use the FreeBSD AArch64 memory layout values when building for it.
These are based on the x86_64 values, scaled to take into account the
larger address space on AArch64.
Reviewed by: vitalybuka
Differential Revision: https://reviews.llvm.org/D125883
AArch64 has a larger address space than 64 but x86. Use the larger
shadow offset on FreeBSD AArch64.
Reviewed by: vitalybuka
Differential Revision: https://reviews.llvm.org/D125873
We currently assert in vectorizeTree(TreeEntry*) when processing a PHI
bundle in a block containing a catchswitch. We attempt to set the
IRBuilder insertion point following the catchswitch, which is invalid.
This is done so that ShuffleBuilder.finalize() knows where to insert
a shuffle if one is needed.
To avoid this occurring, watch out for catchswitch blocks during
buildTree_rec() processing, and avoid adding PHIs in such blocks to
the vectorizable tree. It is unlikely that constraining vectorization
over an exception path will cause a noticeable performance loss, so
this seems preferable to trying to anticipate when a shuffle will and
will not be required.
Since D129288, callbr is allowed to have duplicate successors. This
patch removes a limitation which prevents optimizations from actually
producing such callbrs.
Differential Revision: https://reviews.llvm.org/D129997
One of the transforms in LoopSimplifyCFG demands that the LCSSA form is
truly maintained for all values, tokens included, otherwise it may end up creating
a use that is not dominated by def (and Phi creation for tokens is impossible).
Detect this situation and prevent transform for it early.
Differential Revision: https://reviews.llvm.org/D129984
Reviewed By: efriedma
The flag `-fcs-profile-generate` for enabling CSIRPGO moves the pass
`pgo-instrumentation` after inlining. Function entry coverage works fine
with this change, so remove the assert. I had originally left this
assert in because I had not tested this at the time.
Reviewed By: davidxl, MaskRay
Differential Revision: https://reviews.llvm.org/D129407
When F calls G calls H, G is nounwind, and G is inlined into F, then the
inlined call-site to H should be effectively nounwind so as not to lose
information during inlining.
If H itself is nounwind (which often happens when H is an intrinsic), we
no longer mark the callsite explicitly as nounwind. Previously, there
were cases where the inlined call-site of H differs from a pre-existing
call-site of H in F *only* in the explicitly added nounwind attribute,
thus preventing common subexpression elimination.
v2:
- just check CI->doesNotThrow
Differential Revision: https://reviews.llvm.org/D129860
If DISubpogram was not cloned (e.g. we are cloning a function that has other
functions inlined into it, and subprograms of the inlined functions are
not supposed to be cloned), it doesn't make sense to clone its DILexicalBlocks
as well. Otherwise we'll get duplicated DILexicalBlocks that may confuse
debug info emission in AsmPrinter.
I believe it also makes no sense cloning any DILocalVariables or maybe
other local entities, if their parent subprogram was not cloned, cause
they will be dangling and will not participate in futher emission.
Reviewed By: aprantl
Differential Revision: https://reviews.llvm.org/D127102
As callbr is now allowed to have duplicate destinations, we can
have a callbr with a unique successor. Make sure it doesn't get
dropped, as we still need to preserve the side-effect.
We only need to check the block for the normal/default destination,
not for other destinations. Using the value in those would be
illegal anyway.
The callbr case cannot actually happen here, because callbr is
currently limited to inline asm. Retaining it to match the spirit
of the original code.
At the moment, the VPPRedInstPHIRecipe is not used in subsequent uses of
the predicate recipe. This incorrectly models the def-use chains, as all
later uses should use the phi recipe. Fix that by delaying recording of
the recipe.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D129436
Callbr is no longer an indirect terminator in the sense that is
relevant here (that it's successors cannot be updated). The primary
effect of this change is that callbr no longer prevents formation
of loop simplify form.
I decided to drop the isIndirectTerminator() method entirely and
replace it with isa<IndirectBrInst>() checks. I assume this method
was added to abstract over indirectbr and callbr, but it never
really caught on, and there is nothing left to abstract anymore
at this point.
Differential Revision: https://reviews.llvm.org/D129849
At the moment, the cost of runtime checks for scalable vectors is
overestimated due to creating separate vscale * VF expressions for each
check. Instead re-use the first expression.
The backend getShuffleCosts do not currently handle shuffles that change
size very well. Limit the shuffles we collect to the same type to make
sure they do not cause issues as reported in D128732.
This patch reports number of counts being dropped when a hash-mismatch
happens. This information will be helpful to the users -- if the dropped
counts are large, the user should redo the instrumentation build and
recollect the profile.
Differential Revision: https://reviews.llvm.org/D129001
This patch improves FDO hash-mismatch handling:
(1) filter out warnings to weak functions.
Weak functions definition will be overridden by a strong definition by linker.
The hash mismatch in profile use compilation is expected.
Make the profile hash mismatch warning under the existing option (default true).
(2) add an option to trace the hash of functions with the specific string.
Note that an empty string parameter will trace all functions.
Differential Revision: https://reviews.llvm.org/D129002
Changes since initial commit:
* Wrapping a pointer in an SCEV unknown hides the base, and SCEV is only able to compute a subtraction when the bases are known to be equal. This results in a SCEVCouldNotCompute flowing forward and triggering asserts. Test case added in d767b392.
* isLoopInvariant returns true for instructions outside the loop, but not necessarily *above* the loop. Since this code is allowed to visit uses of an IV outside of a loop, we have to make sure the operands of the compare are both invariant and dominating the header. Test case added in 2aed3cdb.
Original commit message follows...
The ICmpZero matching is checking to see if the expression is loop invariant per SCEV and expandable. This allows expressions inside the loop which can be made loop invariant to be seamlessly expanded, but is overly conservative for expressions which already *are* loop invariant.
As a simple justification for why this is correct, consider a loop invariant urem as RHS vs an alternate function with that same urem wrapped inside a helper call. Why would it be legal to match the later, but not the former?
Differential Revision: https://reviews.llvm.org/D129793
Compiling with '-ffast-math' tuns on all the FastMathFlags (FMF), as
expected, and that enables FP reassociation. Only the two FMF flags
'reassoc' and 'nsz' are technically required to perform reassociation,
but disabling other unrelated FMF bits is needlessly suppressing the
optimization.
This patch fixes that needless suppression, and makes appropriate
adjustments to test-cases, fixing some outstanding TODOs in the process.
Fixes: #56483
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D129523
For scalable vectors, it is not sufficient to only check
MinProfitableTripCount if it is >= VF.getKnownMinValue() * UF, because
this property may not holder for larger values of vscale. In those
cases, compute umax(VF * UF, MinProfTC) instead.
This should fix
https://lab.llvm.org/buildbot/#/builders/197/builds/2262
The ICmpZero matching is checking to see if the expression is loop invariant per SCEV and expandable. This allows expressions inside the loop which can be made loop invariant to be seamlessly expanded, but is overly conservative for expressions which already *are* loop invariant.
As a simple justification for why this is correct, consider a loop invariant urem as RHS vs an alternate function with that same urem wrapped inside a helper call. Why would it be legal to match the later, but not the former?
Differential Revision: https://reviews.llvm.org/D129793
When folding a binop into a select, we need to ensure that one
of the select arms actually does constant fold, otherwise we'll
create two binop instructions and perform the reverse transform.
Ensure this by performing an explicit constant folding attempt,
and failing the transform if neither side simplifies.
A simple alternative here would have been to limit the fold to
ImmConstants, but given the current representation of scalable
vector splats, this wouldn't be ideal.
This is a followup to D129630, which switches LSR to the member
isSafeToExpand() variant, and removes the freestanding function.
This is done by creating the SCEVExpander early (already during the
analysis phase). Because the SCEVExpander is now available for the
whole lifetime of LSRInstance, I've also made it into a member
variable, rather than passing it around in even more places.
Differential Revision: https://reviews.llvm.org/D129769
I happened to notice a two places where the enum was being pass
directly to the bool IsSigned argument of createExtendInst. This
was functionally ok since SignExtended in the enum has value
of 1, but the code shouldn't rely on that.
Using an enum class prevents the enum from being convertible to bool,
but does make writing the enum values more verbose. Since we now
have to write ExtendKind:: in front of them, I've shortened the
names of ZeroExtended and SignExtended.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D129733
Motivation here is to unblock LSRs ability to use ICmpZero uses - the major effect of which is to enable count down IVs. The test changes reflect this goal, but the potential impact is much broader since this isn't a change in LSR at all.
SCEVExpander needs(*) to prove that expanding the expression is safe anywhere the SCEV expression is valid. In general, we can't expand any node which might fault (or exhibit UB) unless we can either a) prove it won't fault, or b) guard the faulting case. We'd been allowing non-zero constants here; this change extends it to non-zero values.
vscale is never zero. This is already implemented in ValueTracking, and this change just adds the same logic in SCEV's range computation (which in turn drives isKnownNonZero). We should common up some logic here, but let's do that in separate changes.
(*) As an aside, "needs" is such an interesting word here. First, we don't actually need to guard this at all; we could choose to emit a select for the RHS of ever udiv and remove this code entirely. Secondly, the property being checked here is way too strong. What the client actually needs is to expand the SCEV at some particular point in some particular loop. In the examples, the original urem dominates that loop and yet we completely ignore that information when analyzing legality. I don't plan to actively pursue either direction, just noting it for future reference.
Differential Revision: https://reviews.llvm.org/D129710