SI is the terminator of BB, so the edge we are adding obviously already existed.
Indeed, this change does not have a test coverage change.
This failure has been exposed in an existing test coverage
by a follow-up patch that switches to lazy domtreeupdater mode,
and removes domtree verification from
SimplifyCFGOpt::simplifyOnce()/SimplifyCFGOpt::run(),
IOW it does not appear feasible to add dedicated test coverage here.
Currently SimplifyCFG drops the debug locations of 'bonus' instructions.
Such instructions are moved before the first branch. The reason for the
current behavior is that this could lead to surprising debug stepping,
if the block that's folded is dead.
In case the first branch and the instructions to be folded have the same
debug location, this shouldn't be an issue and we can keep the debug
location.
Reviewed By: vsk
Differential Revision: https://reviews.llvm.org/D93662
I have added it in d15d81c because it *seemed* correct, was holding
for all the tests so far, and was validating the fix added in the same
commit, but as David Major is pointing out (with a reproducer),
the assertion isn't really correct after all. So remove it.
Note that the d15d81c still fine.
If the predecessor is a switch, and BB is not the default destination,
multiple cases could have the same destination. and it doesn't
make sense to re-process the predecessor, because we won't make any changes,
once is enough.
I'm not sure this can be really tested, other than via the assertion
being added here, which fires without the fix.
One would hope that it would have been already canonicalized into an
unconditional branch, but that isn't really guaranteed to happen
with SimplifyCFG's visitation order.
... which requires not removing a DomTree edge if the switch's default
still points at that destination, because it can't be removed;
... and not processing the same predecessor more than once.
... which requires not deleting an edge that just got deleted,
because we could be dealing with a block that didn't go through
ConstantFoldTerminator() yet, and thus has a degenerate cond br
with matching true/false destinations.
Notably, this doesn't switch *every* case, remaining cases
don't actually pass sanity checks in non-permissve mode,
and therefore require further analysis.
Note that SimplifyCFG still defaults to not preserving DomTree by default,
so this is effectively a NFC change.
This is NFC since SimplifyCFG still currently defaults to not preserving DomTree.
SimplifyCFGOpt::simplifyOnce() is only be called from SimplifyCFGOpt::run(),
and can not be called externally, since SimplifyCFGOpt is defined in .cpp
This avoids some needless verifications, and is thus a bit faster
without sacrificing precision.
We only need to remove non-TrueBB/non-FalseBB successors,
and we only need to do that once. We don't need to insert
any new edges, because no new successors will be added.
There is a number of transforms in SimplifyCFG that take DomTree out of
DomTreeUpdater, and do updates manually. Until they are fixed,
user passes are unable to claim that PDT is preserved.
Note that the default for SimplifyCFG is still not to preserve DomTree,
so this is still effectively NFC.
This pretty much concludes patch series for updating SimplifyCFG
to preserve DomTree. All 318 dedicated `-simplifycfg` tests now pass
with `-simplifycfg-require-and-preserve-domtree=1`.
There are a few leftovers that apparently don't have good test coverage.
I do not yet know what gaps in test coverage will the wider-scale testing
reveal, but the default flip might be close.
We might be dealing with an unreachable code,
so the bonus instruction we clone might be self-referencing.
There is a sanity check that all uses of bonus instructions
that are not in the original block with said bonus instructions
are PHI nodes, and that is obviously not the case
for self-referencing instructions..
So if we find such an use, just rewrite it.
Thanks to Mikael Holmén for the reproducer!
Fixes https://bugs.llvm.org/show_bug.cgi?id=48450#c8
... so just ensure that we pass DomTreeUpdater it into it.
Apparently, there were no dedicated tests just for that functionality,
so i'm adding one here.
And that exposes that a number of tests don't *actually* manage to
maintain DomTree validity, which is inline with my observations.
Once again, SimlifyCFG pass currently does not require/preserve DomTree
by default, so this is effectively NFC.
Pretty boring, removeUnwindEdge() already known how to update DomTree,
so if we are to call it, we must first flush our own pending updates;
otherwise, we just stop predecessors from branching to us,
and for certain predecessors, stop their predecessors from
branching to them also.
... so just ensure that we pass DomTreeUpdater it into it.
Fixes DomTree preservation for a number of tests,
all of which are marked as such so that they do not regress.
... so just ensure that we pass DomTreeUpdater it into it.
Fixes DomTree preservation for a large number of tests,
all of which are marked as such so that they do not regress.
When folding a branch to a common destination, preserve !annotation on
the created instruction, if the terminator of the BB that is going to be
removed has !annotation. This should ensure that !annotation is attached
to the instructions that 'replace' the original terminator.
Reviewed By: jdoerfert, lebedev.ri
Differential Revision: https://reviews.llvm.org/D93410
... so just ensure that we pass DomTreeUpdater it into it.
Fixes DomTree preservation for a large number of tests,
all of which are marked as such so that they do not regress.
... so just ensure that we pass DomTreeUpdater it into it.
Fixes DomTree preservation for a large number of tests,
all of which are marked as such so that they do not regress.
Two observations:
1. Unavailability of DomTree makes it impossible to make
`FoldBranchToCommonDest()` transform in certain cases,
where the successor is dominated by predecessor,
because we then don't have PHI's, and can't recreate them,
well, without handrolling 'is dominated by' check,
which doesn't really look like a great solution to me.
2. Avoiding invalidating DomTree in SimplifyCFG will
decrease the number of `Dominator Tree Construction` by 5
(from 28 now, i.e. -18%) in `-O3` old-pm pipeline
(as per `llvm/test/Other/opt-O3-pipeline.ll`)
This might or might not be beneficial for compile time.
So the plan is to make SimplifyCFG preserve DomTree, and then
eventually make DomTree fully required and preserved by the pass.
Now, SimplifyCFG is ~7KLOC. I don't think it will be nice
to do all this uplifting in a single mega-commit,
nor would it be possible to review it in any meaningful way.
But, i believe, it should be possible to do this in smaller steps,
introducing the new behavior, in an optional way, off-by-default,
opt-in option, and gradually fixing transforms one-by-one
and adding the flag to appropriate test coverage.
Then, eventually, the default should be flipped,
and eventually^2 the flag removed.
And that is what is happening here - when the new off-by-default option
is specified, DomTree is required and is claimed to be preserved,
and SimplifyCFG-internal assertions verify that the DomTree is still OK.
Even though d38205144f was mostly a correct
fix for the external non-PHI users, it's not a *generally* correct fix,
because the 'placeholder' values in those trivial PHI's we create
shouldn't be *always* 'undef', but the PHI itself for the backedges,
else we end up with wrong value, as the `@pr48450_2` test shows.
But we can't just do that, because we can't check that the PHI
can be it's own incoming value when coming from certain predecessor,
because we don't have a dominator tree.
So until we can address this correctness problem properly,
ensure that we don't perform the transformation
if there are such problematic external uses.
Making dominator tree available there is going to be involved,
since `-simplifycfg` pass currently does not preserve/update domtree...
In particular, if the successor block, which is about to get a new
predecessor block, currently only has a single predecessor,
then the bonus instructions will be directly used within said successor,
which is fine, since the block with bonus instructions dominates that
successor. But once there's a new predecessor, the IR is no longer valid,
and we don't fix it, because we only update PHI nodes.
Which means, the live-out bonus instructions must be exclusively used
by the PHI nodes in successor blocks. So we have to form trivial PHI nodes.
which will then be successfully updated to recieve cloned bonus instns.
This all works fine, except for the fact that we don't have access to
the dominator tree, and we don't ignore unreachable code,
so we sometimes do end up having to deal with some weird IR.
Fixes https://bugs.llvm.org/show_bug.cgi?id=48450
There is no correctness need for that, and since we allow live-out
uses, this could theoretically happen, because currently nothing
will move the cond to right before the branch in those tests.
But regardless, lifting that restriction even makes the transform
easier to understand.
This makes the transform happen in 81 more cases (+0.55%)
)
This was orginally committed in 2245fb8aaa.
but was immediately reverted in f3abd54958
because of a PHI handling issue.
Original commit message:
1. It doesn't make sense to enforce that the bonus instruction
is only used once in it's basic block. What matters is
whether those user instructions fit within our budget, sure,
but that is another question.
2. It doesn't make sense to enforce that said bonus instructions
are only used within their basic block. Perhaps the branch
condition isn't using the value computed by said bonus instruction,
and said bonus instruction is simply being calculated
to be used in successors?
So iff we can clone bonus instructions, to lift these restrictions,
we just need to carefully update their external uses
to use the new cloned instructions.
Notably, this transform (even without this change) appears to be
poison-unsafe as per alive2, but is otherwise (including the patch) legal.
We don't introduce any new PHI nodes, but only "move" the instructions
around, i'm not really seeing much potential for extra cost modelling
for the transform, especially since now we allow at most one such
bonus instruction by default.
This causes the fold to fire +11.4% more (13216 -> 14725)
as of vanilla llvm test-suite + RawSpeed.
The motivational pattern is IEEE-754-2008 Binary16->Binary32
extension code:
ca57d77fb2/src/librawspeed/common/FloatingPoint.h (L115-L120)
^ that should be a switch, but it is not now: https://godbolt.org/z/bvja5v
That being said, even thought this seemed like this would fix it: https://godbolt.org/z/xGq3TM
apparently that fold is happening somewhere else afterall,
so something else also has a similar 'artificial' restriction.
Many bots are unhappy, at the very least missed a few codegen tests,
and possibly this has a logic hole inducing a miscompile
(will be really awesome to have ready reproducer..)
Need to investigate.
This reverts commit 2245fb8aaa.
1. It doesn't make sense to enforce that the bonus instruction
is only used once in it's basic block. What matters is
whether those user instructions fit within our budget, sure,
but that is another question.
2. It doesn't make sense to enforce that said bonus instructions
are only used within their basic block. Perhaps the branch
condition isn't using the value computed by said bonus instruction,
and said bonus instruction is simply being calculated
to be used in successors?
So iff we can clone bonus instructions, to lift these restrictions,
we just need to carefully update their external uses
to use the new cloned instructions.
Notably, this transform (even without this change) appears to be
poison-unsafe as per alive2, but is otherwise (including the patch) legal.
We don't introduce any new PHI nodes, but only "move" the instructions
around, i'm not really seeing much potential for extra cost modelling
for the transform, especially since now we allow at most one such
bonus instruction by default.
This causes the fold to fire +11.4% more (13216 -> 14725)
as of vanilla llvm test-suite + RawSpeed.
The motivational pattern is IEEE-754-2008 Binary16->Binary32
extension code:
ca57d77fb2/src/librawspeed/common/FloatingPoint.h (L115-L120)
^ that should be a switch, but it is not now: https://godbolt.org/z/bvja5v
That being said, even thought this seemed like this would fix it: https://godbolt.org/z/xGq3TM
apparently that fold is happening somewhere else afterall,
so something else also has a similar 'artificial' restriction.
This change introduces a new IR intrinsic named `llvm.pseudoprobe` for pseudo-probe block instrumentation. Please refer to https://reviews.llvm.org/D86193 for the whole story.
A pseudo probe is used to collect the execution count of the block where the probe is instrumented. This requires a pseudo probe to be persisting. The LLVM PGO instrumentation also instruments in similar places by placing a counter in the form of atomic read/write operations or runtime helper calls. While these operations are very persisting or optimization-resilient, in theory we can borrow the atomic read/write implementation from PGO counters and cut it off at the end of compilation with all the atomics converted into binary data. This was our initial design and we’ve seen promising sample correlation quality with it. However, the atomics approach has a couple issues:
1. IR Optimizations are blocked unexpectedly. Those atomic instructions are not going to be physically present in the binary code, but since they are on the IR till very end of compilation, they can still prevent certain IR optimizations and result in lower code quality.
2. The counter atomics may not be fully cleaned up from the code stream eventually.
3. Extra work is needed for re-targeting.
We choose to implement pseudo probes based on a special LLVM intrinsic, which is expected to have most of the semantics that comes with an atomic operation but does not block desired optimizations as much as possible. More specifically the semantics associated with the new intrinsic enforces a pseudo probe to be virtually executed exactly the same number of times before and after an IR optimization. The intrinsic also comes with certain flags that are carefully chosen so that the places they are probing are not going to be messed up by the optimizer while most of the IR optimizations still work. The core flags given to the special intrinsic is `IntrInaccessibleMemOnly`, which means the intrinsic accesses memory and does have a side effect so that it is not removable, but is does not access memory locations that are accessible by any original instructions. This way the intrinsic does not alias with any original instruction and thus it does not block optimizations as much as an atomic operation does. We also assign a function GUID and a block index to an intrinsic so that they are uniquely identified and not merged in order to achieve good correlation quality.
Let's now look at an example. Given the following LLVM IR:
```
define internal void @foo2(i32 %x, void (i32)* %f) !dbg !4 {
bb0:
%cmp = icmp eq i32 %x, 0
br i1 %cmp, label %bb1, label %bb2
bb1:
br label %bb3
bb2:
br label %bb3
bb3:
ret void
}
```
The instrumented IR will look like below. Note that each `llvm.pseudoprobe` intrinsic call represents a pseudo probe at a block, of which the first parameter is the GUID of the probe’s owner function and the second parameter is the probe’s ID.
```
define internal void @foo2(i32 %x, void (i32)* %f) !dbg !4 {
bb0:
%cmp = icmp eq i32 %x, 0
call void @llvm.pseudoprobe(i64 837061429793323041, i64 1)
br i1 %cmp, label %bb1, label %bb2
bb1:
call void @llvm.pseudoprobe(i64 837061429793323041, i64 2)
br label %bb3
bb2:
call void @llvm.pseudoprobe(i64 837061429793323041, i64 3)
br label %bb3
bb3:
call void @llvm.pseudoprobe(i64 837061429793323041, i64 4)
ret void
}
```
Reviewed By: wmi
Differential Revision: https://reviews.llvm.org/D86490
This reverts the revert commit 408c4408fa.
This version of the patch includes a fix for a crash caused by
treating ICmp/FCmp constant expressions as instructions.
Original message:
On some targets, like AArch64, vector selects can be efficiently lowered
if the vector condition is a compare with a supported predicate.
This patch adds a new argument to getCmpSelInstrCost, to indicate the
predicate of the feeding select condition. Note that it is not
sufficient to use the context instruction when querying the cost of a
vector select starting from a scalar one, because the condition of the
vector select could be composed of compares with different predicates.
This change greatly improves modeling the costs of certain
compare/select patterns on AArch64.
I am also planning on putting up patches to make use of the new argument in
SLPVectorizer & LV.
CallInst::updateProfWeight() creates branch_weights with i64 instead of i32.
To be more consistent everywhere and remove lots of casts from uint64_t
to uint32_t, use i64 for branch_weights.
Reviewed By: davidxl
Differential Revision: https://reviews.llvm.org/D88609
On some targets, like AArch64, vector selects can be efficiently lowered
if the vector condition is a compare with a supported predicate.
This patch adds a new argument to getCmpSelInstrCost, to indicate the
predicate of the feeding select condition. Note that it is not
sufficient to use the context instruction when querying the cost of a
vector select starting from a scalar one, because the condition of the
vector select could be composed of compares with different predicates.
This change greatly improves modeling the costs of certain
compare/select patterns on AArch64.
I am also planning on putting up patches to make use of the new argument in
SLPVectorizer & LV.
Reviewed By: dmgreen, RKSimon
Differential Revision: https://reviews.llvm.org/D90070
CallInst::updateProfWeight() creates branch_weights with i64 instead of i32.
To be more consistent everywhere and remove lots of casts from uint64_t
to uint32_t, use i64 for branch_weights.
Reviewed By: davidxl
Differential Revision: https://reviews.llvm.org/D88609
Modify FoldBranchToCommonDest to consider the cost of inserting
instructions when attempting to combine predicates to fold blocks.
The threshold can be controlled via a new option:
-simplifycfg-branch-fold-threshold which defaults to '2' to allow
the insertion of a not and another logical operator.
Differential Revision: https://reviews.llvm.org/D86526
Before we speculatively execute a basic block, query the cost of
inserting the necessary select instructions against the phi folding
threshold. For non-trivial insertions, a more accurate decision can
probably be made during machine if-conversion. With minsize we query
the CodeSize cost, otherwise we use SizeAndLatency.
Differential Revision: https://reviews.llvm.org/D82438
SimplifyCFG has two main folds for resumes - one when resume is directly
using the landingpad, and the other one where resume is using a PHI node.
While for the first case, we were already correctly ignoring all the
PHI nodes, and both the debug info intrinsics and lifetime intrinsics,
in the PHI-based-one, we weren't ignoring PHI's in the resume block,
and weren't ignoring lifetime intrinsics. That is clearly a bug.
On RawSpeed library, this results in +9.34% (+81) more invoke->call folds,
-0.19% (-39) landing pads, -0.24% (-81) invoke instructions
but +51 call instructions and -132 basic blocks.
Though, the run-time performance impact appears to be within the noise.
We do not thread blocks with convergent calls, but this check was missing
when we decide to insert PR Phis into it (which we only do for threading).
Differential Revision: https://reviews.llvm.org/D83936
Reviewed By: nikic
Common code sinking is already guarded with a (with default-off!) flag,
so add a flag for hoisting, too.
D84108 will hopefully make hoisting off-by-default too.
SimplifyCFG was incorrectly reporting to the pass manager that it had not made
changes after folding away a PHI. This is detected in the EXPENSIVE_CHECKS
build when the function's hash changes.
Differential Revision: https://reviews.llvm.org/D83985