At the moment, LoopAccessAnalysis is a loop analysis for the new pass
manager. The issue with that is that LAI caches SCEV expressions and
modifications in a loop may impact SCEV expressions in other loops, but
we do not have a convenient way to invalidate LAI for other loops
withing a loop pipeline.
To avoid this issue, turn it into a function analysis which returns a
manager object that keeps track of the individual LAI objects per loop.
Fixes#50940.
Fixes#51669.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D134606
I'm not sure how to test this because we seem to constant-fold
all examples already. We changed this code to use the common
isNonNegative() helper, so it should not be necessary to avoid
a constant. This makes the code uniform for all transforms.
Collect more statistics for scalar promotion. In particular,
keep track of how many promotion candidates there were, and
whether it is a load or a load/store promotion.
For noop store of the form of LoadI and StoreI,
An invariant should be kept is that the memory state of the related
MemoryLoc before LoadI is the same as before StoreI.
For this example:
```
define void @pr49927(i32* %q, i32* %p) {
%v = load i32, i32* %p, align 4
store i32 %v, i32* %q, align 4
store i32 %v, i32* %p, align 4
ret void
}
```
Here the definition of the store's destination is different with the
definition of the load's destination, which it seems that the
invariant mentioned above is broken. But the definition of the
store's destination would write a value that is LoadI, actually, the
invariant is still kept. So we can safely ignore it.
Fixes https://github.com/llvm/llvm-project/issues/49271
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D132657
Interestingly, MathExtras.h doesn't use <cmath> declaration, so move it out of
that header and include it when needed.
No functional change intended, but there's no longer a transitive include
fromMathExtras.h to cmath.
This is purely NFC restructure in advance of a change which actually exposes zero strides. This is mostly because I find this interface confusing each time I look at it.
Using these queries with a context instruction and without a cache
seems to be about 2x slower than with it so this theoretically
improves compile time.
During structurization process, we may place non-predecessor blocks
between the predecessors of a block in the structurized CFG. Take
the typical while-break case as an example:
```
/---A(v=...)
| / \
^ B C
| \ /|
\---L |
\ /
E (r = phi (v:C)...)
```
After structurization, the CFG would be look like:
```
/---A
| |\
| | C
| |/
| F1
^ |\
| | B
| |/
| F2
| |\
| | L
\ |/
\--F3
|
E
```
We can see that block B is placed between the predecessors(C/L) of E.
During phi reconstruction, to achieve the same sematics as before, we
are reconstructing the PHIs as:
F1: v1 = phi (v:C), (undef:A)
F3: r = phi (v1:F2), ...
But this is also saying that `v1` would be live through B, which is not
quite necessary. The idea in the change is to say the incoming value
from B is Undef for the PHI in E. With this change, the reconstructed
PHI would be:
F1: v1 = phi (v:C), (undef:A)
F2: v2 = phi (v1:F1), (undef:B)
F3: r = phi (v2:F2), ...
Reviewed by: sameerds
Differential Revision: https://reviews.llvm.org/D132450
The instruction simplification will try to simplify the affected phis.
In some cases, this might extend the liveness of values. For example:
BB0:
| \
| BB1
| /
BB2:phi (BB0, v), (BB1, undef)
The phi in BB2 will be simplified to v as v dominates BB2, but this is
increasing the number of active values in BB1. By setting CanUseUndef
to false, we will not simplify the phi in this way, this would help
register pressure. This is mandatory for the later change to help
reducing VGPR pressure for AMDGPU.
Reviewed by: foad, sameerds
Differential Revision: https://reviews.llvm.org/D132449
LoopDeletion may hoist instructions out of a loop using
makeLoopInvariant without invalidating the SCEV for the moved
instruction.
Moving the instruction to a different block may change its
cached block disposition, so invalidate the cached info.
Fixes#57837.
This is a bugfix patch that resolves the following two bugs in loop interchange:
1. PR57148 which is an assertion error due to of loss of LCSSA form after interchange,
as referred to test1() in pr57148.ll.
2. Use before def for the outermost loop induction variables after interchange,
as referred to test2() in pr57148.ll.
The fix in this patch is that:
1. In cases where the LCSSA form is not maintained after interchange, we update the IR
to the LCSSA form again.
2. We split the phi nodes in the inner loop header into a separate basic block to avoid
the situation where use of the outer indvar appears before its def after interchange.
Previously we already did this for innermost loops, now we do it for non-innermost
loops (e.g., middle loops) as well.
Reviewed By: bmahjour, Meinersbur, #loopoptwg
Differential Revision: https://reviews.llvm.org/D132055
This patch is to resolve the bug reported and discussed in
https://reviews.llvm.org/D124926#3718761 and https://reviews.llvm.org/D124926#3719876.
The problem is that loop interchange is a loopnest pass under the new pass manager,
but the loop nest may not be constructed correctly by the loop pass manager after
running loop interchange and before running the next pass, which might cause problems
when it continues running the next pass.
The reason that the loop nest is constructed incorrectly is that the outermost
loop might have changed after interchange, and what was the original outermost
loop is not the current outermost loop anymore. Constructing the loop nest based
on the original outermost loop would generate an invalid loop nest.
The fix in this patch is that, in the loop pass manager before running each loopnest
pass, we re-cosntruct the loop nest based on the current outermost loop, if LPMUpdater
notifies the loop pass manager that the previous loop nest has been invalidated by passes
like loop interchange.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D132199
Commit de3445e0ef (https://reviews.llvm.org/D132096) made
changes to isVectorPromotionViable basically doing
// Create Vector with size of V, and each element of type Ty
...
uint64_t ElementSize = DL.getTypeStoreSizeInBits(Ty).getFixedSize();
uint64_t VectorSize = DL.getTypeSizeInBits(V).getFixedSize();
...
VectorType *VTy = VectorType::get(Ty, VectorSize / ElementSize, false);
Not quite sure why it uses the TypeStoreSize for the ElementSize,
but the new vector would only match in size with the old vector in
situations when the TypeStoreSize equals the TypeSize for Ty.
Therefore this patch adds a typeSizeEqualsStoreSize check as yet
another condition for allowing the the new type as a promotion
candidate.
Without this fix the new @test15 test would fail with an assert
like this:
opt: ../lib/Transforms/Scalar/SROA.cpp:1966:
auto isVectorPromotionViable(llvm::sroa::Partition &,
const llvm::DataLayout &)
::(anonymous class)::operator()(llvm::VectorType *,
llvm::VectorType *) const:
Assertion `DL.getTypeSizeInBits(RHSTy).getFixedSize() ==
DL.getTypeSizeInBits(LHSTy).getFixedSize() &&
"Cannot have vector types of different sizes!"' failed.
...
#8 isVectorPromotionViable(...)::$_10::operator()...
#9 llvm::SROAPass::rewritePartition(...)
#10 llvm::SROAPass::splitAlloca(...)
#11 llvm::SROAPass::runOnAlloca(...)
#12 llvm::SROAPass::runImpl(...)
#13 llvm::SROAPass::run(...)
Reviewed By: MatzeB
Differential Revision: https://reviews.llvm.org/D134032
The type information of the store values can diverge when checking for valid
mask store candidates to eliminate via DSE. This patch checks for equivalence
wrt to size and element count.
Reviewed By: fhahn, rui.zhang
Differential Revision: https://reviews.llvm.org/D132700
When the IV is only used by the terminating condition (say IV-A) and the loop
has a predictable back-edge count and we have another IV (say IV-B) that is an
affine add recursion, we will be able to calculate the terminating value of
IV-B in the loop pre-header. This patch adds attempts to replace IV-B as the
new terminating condition and remove IV-A. It is safe to do so since IV-A is
only used as the terminating condition.
This transformation is suitable to be appended after LSR as it may optimize the
loop into the situation mentioned above. The transformation can reduce number of
IV-s in the loop by one.
A cli option `lsr-term-fold` is added and default disabled.
Reviewed By: mcberg2021, craig.topper
Differential Revision: https://reviews.llvm.org/D132443
This bug was found by recent improvement in SCEV verifier. The code in LoopFuse
directly reassigns blocks to be a part of a different loop, which should automatically
invalidate all related cached loop dispositions.
Differential Revision: https://reviews.llvm.org/D134173
Reviewed By: nikic
SimplifyCFG folds
bool foo() {
if (cond1) return false;
if (cond2) return false;
return true;
}
as
bool foo() {
if (cond1 | cond2) return false
return true;
}
'cond2' is called 'bonus insts' in branch folding since they introduce overhead
since the original CFG could do early exit but the folded CFG always executes
them. SimplifyCFG calculates the costs of 'bonus insts' of a folding a BB into
its predecessor BB which shares the destination. If it is below bonus-inst-threshold,
SimplifyCFG will fold that BB into its predecessor and cond2 will always be executed.
When SimplifyCFG calculates the cost of 'bonus insts', it only consider 'bonus' insts
in the current BB to be considered for folding. This causes issue for unrolled loops
which share destinations, e.g.
bool foo(int *a) {
for (int i = 0; i < 32; i++)
if (a[i] > 0) return false;
return true;
}
After unrolling, it becomes
bool foo(int *a) {
if(a[0]>0) return false
if(a[1]>0) return false;
//...
if(a[31]>0) return false;
return true;
}
SimplifyCFG will merge each BB with its predecessor BB,
and ends up with 32 'bonus insts' which are always executed, which
is much slower than the original CFG.
The root cause is that SimplifyCFG does not consider the
accumulated cost of 'bonus insts' which are folded from
different BB's.
This patch fixes that by introducing a ValueMap to track
costs of 'bonus insts' coming from different BB's into
the same BB, and cuts off if the accumulated cost
exceeds a threshold.
Reviewed by: Artem Belevich, Florian Hahn, Nikita Popov, Matt Arsenault
Differential Revision: https://reviews.llvm.org/D132408
f213128b29 didn't account for the possibility that the result of
decompose may be empty. Fix that by explicitly checking. Use a newly
introduced helper to also reduce some duplication.
Thanks @bjope for finding the issue!
This is similar to the existing signed instruction folds.
We get the obvious minimal patterns in other passes, but
this avoids potential missed folds when the multi-block
tests are converted to selects.
Keep track if variables are known positive during constraint
decomposition, aggregate the information when building the constraint
object and encode the extra information as constraints to be used during
reasoning.
Currently, FunctionModRefBehavior tracks whether the function reads
or writes memory (ModRefInfo) and which locations it can access
(argmem, inaccessiblemem and other). This patch changes it to track
ModRef information per-location instead.
To give two examples of why this is useful:
* D117095 highlights a weakness of ModRef modelling in the presence
of operand bundles. For a memcpy call with deopt operand bundle,
we want to say that it can read any memory, but only write argument
memory. This would allow them to be treated like any other calls.
However, we currently can't express this and have to say that it
can read or write any memory.
* D127383 would ideally be modelled as a separate threadid location,
where threadid Refs outside pre-split coroutines can be ignored
(like other accesses to constant memory). The current representation
does not allow modelling this precisely.
The patch as implemented is intended to be NFC, but there are some
obvious opportunities for improvements and simplification. To fully
capitalize on this we would also want to change the way we represent
memory attributes on functions, but that's a larger change, and I
think it makes sense to separate out the FunctionModRefBehavior
refactoring.
Differential Revision: https://reviews.llvm.org/D130896
Instead of checking if any of the new indices has a non-zero coefficient
before using the constraint, do this directly when constructing the
constraint.
This patch adds additional vector types to be considered when doing
promotion in SROA, based on the types of the store and load slices. This
provides more promotion opportunities, by potentially using an optimal
"intermediate" vector type.
For example, the following code would currently not be promoted to a
vector, since `__m128i` is a `<2 x i64>` vector.
```
__m128i packfoo0(int a, int b, int c, int d) {
int r[4] = {a, b, c, d};
__m128i rm;
std::memcpy(&rm, r, sizeof(rm));
return rm;
}
```
```
packfoo0(int, int, int, int):
mov dword ptr [rsp - 24], edi
mov dword ptr [rsp - 20], esi
mov dword ptr [rsp - 16], edx
mov dword ptr [rsp - 12], ecx
movaps xmm0, xmmword ptr [rsp - 24]
ret
```
By also considering the types of the elements, we could find that the
`<4 x i32>` type would be valid for promotion, hence removing the memory
accesses for this function. In other words, we can explore other new
vector types, with the same size but different element types based on
the load and store instructions from the Slices, which can provide us
more promotion opportunities.
Additionally, the step for removing duplicate elements from the
`CandidateTys` vector was not using an equality comparator, which has
been fixed.
Differential Revision: https://reviews.llvm.org/D132096
As shown in the examples in issue #57683, we allow matching
vectors with poison (undef) in this transform (and possibly more),
but we can't then use the partially defined value as a replacement
value in other expressions blindly.
This seems to be avoided in simpler examples of reassociation,
and other passes should be able to clean up the redundant op
seen in these tests.
The previous implementation of time tracing in NewPassManager is direct but messive.
The key codes are like the demo below:
```
/// Runs the function pass across every function in the module.
PreservedAnalyses run(LazyCallGraph::SCC &C, CGSCCAnalysisManager &AM,
LazyCallGraph &CG, CGSCCUpdateResult &UR) {
/// ...
PreservedAnalyses PassPA;
{
TimeTraceScope TimeScope(Pass.name());
PassPA = Pass.run(F, FAM);
}
/// ...
}
```
It can be bothered to judge where should we add the tracing codes by hands.
With the PassInstrumentation framework, we can easily add `Before/After` callback
functions to add time tracing codes.
Differential Revision: https://reviews.llvm.org/D131960
If there are non-load/store users of the promoted pointer, we
currently abort promotion. However, having such users isn't really
relevant to the transform. We already separately check that a)
there are no instructions that modref the promoted pointer and
b) that a pointer capture disables store promotion.
In the affected @test_captured_in_loop test case we have a readnone
capture of the promoted pointer, which means that load promotion
can be performed (while store promotion cannot).
Differential Revision: https://reviews.llvm.org/D133485
The original commit ( fe1f3cfc26 ) was reverted because it could
crash / assert when trying to fold a value that was replaced
by a constant. In that case, there might not be an entry for the
constant in the solver yet.
This version adds a check for that possibility along with tests to
exercise that pattern (they used to crash).
Original commit message:
This extends the transform added with D81756 to handle div/rem opcodes.
For example:
https://alive2.llvm.org/ce/z/cX6za6
This replicates part of what CVP already does, but the motivating example
from issue #57472 demonstrates a phase ordering problem - we convert
branches to select before CVP runs and miss the transform.
Differential Revision: https://reviews.llvm.org/D133198