This was disabled as we were looking for a weird CGSCC problem. I
think/hope we fixed it as there were a lot of updates recently. I could
never reproduce this locally so I'll use the pre-commit phab builds to
confirm this suspicion and if they seem to be happy I'll assume this is
fixed.
Reviewed By: sstefan1
Differential Revision: https://reviews.llvm.org/D87266
In `MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.cpp` we initialized
attributes until stack frame ~35k caused space to run out. The initial
size 1024 is pretty much random.
If we know that the abs operand is known negative, we can replace
it with a neg.
To avoid computing known bits twice, I've removed the fold for the
non-negative case from InstSimplify. Both the non-negative and the
negative case are handled by InstCombine now, with one known bits call.
Differential Revision: https://reviews.llvm.org/D87196
D66230 attempted to fix a problem where when there are allocas used before CoroBegin.
It keeps allocas and their uses stay in put if there are no escapse/changes to the data before CoroBegin.
Unfortunately that's incorrect.
Consider this code:
%var = alloca i32
%1 = getelementptr .. %var; stays put
%f = call i8* @llvm.coro.begin
store ... %1
After this fix, %1 will now stay put, however if a store happens after coro.begin and hence modifies the content, this change will not be reflected in the coroutine frame (and will eventually be DCEed).
To generalize the problem, if any alias ptr is created before coro.begin for an Alloca and that alias ptr is latter written into after coro.begin, it will lead to incorrect behavior.
There are also a few other minor issues, such as incorrect dominate condition check in the ptr visitor, unhandled memory intrinsics and etc.
Ths patch attempts to fix some of these issue, and make it more robust to deal with aliases.
While visiting through the alloca pointer, we also keep track of all aliases created that will be used after CoroBegin. We track the offset of each alias, and then reacreate these aliases after CoroBegin using these offset.
It's worth noting that this is not perfect and there will still be cases we cannot handle. I think it's impractical to handle all cases given the current design.
This patch makes it more robust and should be a pure win.
In the meantime, we need to think about what how to completely elimiante these issues, likely through the route as @rjmccall mentioned in D66230.
Differential Revision: https://reviews.llvm.org/D86859
In GenerateConstantOffsetsImpl, we may generate non canonical Formula
if BaseRegs of that Formula is updated and includes a recurrent expr reg
related with current loop while its ScaledReg is not.
Patched by: mdchen
Reviewed By: qcolombet
Differential Revision: https://reviews.llvm.org/D86939
The CloneFunctionInto has implicit requirements with regards to the
linkage and visibility of the function. We now update these after we did
the CloneFunctionInto on the copy with the same linkage and visibility
as the original.
One run line was different and probably introduced for the manually
added function attribute & name checks. We can do this with the script
and a check prefix used for the other run lines as well.
This was reverted in 503deec218
because it caused gigantic increase (3x) in branch mispredictions
in certain benchmarks on certain CPU's,
see https://reviews.llvm.org/D84108#2227365.
It has since been investigated and here are the results:
https://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20200907/827578.html
> It's an amazingly severe regression, but it's also all due to branch
> mispredicts (about 3x without this). The code layout looks ok so there's
> probably something else to deal with. I'm not sure there's anything we can
> reasonably do so we'll just have to take the hit for now and wait for
> another code reorganization to make the branch predictor a bit more happy :)
>
> Thanks for giving us some time to investigate and feel free to recommit
> whenever you'd like.
>
> -eric
So let's just reland this.
Original commit message:
I've been looking at missed vectorizations in one codebase.
One particular thing that stands out is that some of the loops
reach vectorizer in a rather mangled form, with weird PHI's,
and some of the loops aren't even in a rotated form.
After taking a more detailed look, that happened because
the loop's headers were too big by then. It is evident that
SimplifyCFG's common code hoisting transform is at fault there,
because the pattern it handles is precisely the unrotated
loop basic block structure.
Surprizingly, `SimplifyCFGOpt::HoistThenElseCodeToIf()` is enabled
by default, and is always run, unlike it's friend, common code sinking
transform, `SinkCommonCodeFromPredecessors()`, which is not enabled
by default and is only run once very late in the pipeline.
I'm proposing to harmonize this, and disable common code hoisting
until //late// in pipeline. Definition of //late// may vary,
here currently i've picked the same one as for code sinking,
but i suppose we could enable it as soon as right after
loop rotation happens.
Experimentation shows that this does indeed unsurprizingly help,
more loops got rotated, although other issues remain elsewhere.
Now, this undoubtedly seriously shakes phase ordering.
This will undoubtedly be a mixed bag in terms of both compile- and
run- time performance, codesize. Since we no longer aggressively
hoist+deduplicate common code, we don't pay the price of said hoisting
(which wasn't big). That may allow more loops to be rotated,
so we pay that price. That, in turn, that may enable all the transforms
that require canonical (rotated) loop form, including but not limited to
vectorization, so we pay that too. And in general, no deduplication means
more [duplicate] instructions going through the optimizations. But there's still
late hoisting, some of them will be caught late.
As per benchmarks i've run {F12360204}, this is mostly within the noise,
there are some small improvements, some small regressions.
One big regression i saw i fixed in rG8d487668d09fb0e4e54f36207f07c1480ffabbfd, but i'm sure
this will expose many more pre-existing missed optimizations, as usual :S
llvm-compile-time-tracker.com thoughts on this:
http://llvm-compile-time-tracker.com/compare.php?from=e40315d2b4ed1e38962a8f33ff151693ed4ada63&to=c8289c0ecbf235da9fb0e3bc052e3c0d6bff5cf9&stat=instructions
* this does regress compile-time by +0.5% geomean (unsurprizingly)
* size impact varies; for ThinLTO it's actually an improvement
The largest fallout appears to be in GVN's load partial redundancy
elimination, it spends *much* more time in
`MemoryDependenceResults::getNonLocalPointerDependency()`.
Non-local `MemoryDependenceResults` is widely-known to be, uh, costly.
There does not appear to be a proper solution to this issue,
other than silencing the compile-time performance regression
by tuning cut-off thresholds in `MemoryDependenceResults`,
at the cost of potentially regressing run-time performance.
D84609 attempts to move in that direction, but the path is unclear
and is going to take some time.
If we look at stats before/after diffs, some excerpts:
* RawSpeed (the target) {F12360200}
* -14 (-73.68%) loops not rotated due to the header size (yay)
* -272 (-0.67%) `"Number of live out of a loop variables"` - good for vectorizer
* -3937 (-64.19%) common instructions hoisted
* +561 (+0.06%) x86 asm instructions
* -2 basic blocks
* +2418 (+0.11%) IR instructions
* vanilla test-suite + RawSpeed + darktable {F12360201}
* -36396 (-65.29%) common instructions hoisted
* +1676 (+0.02%) x86 asm instructions
* +662 (+0.06%) basic blocks
* +4395 (+0.04%) IR instructions
It is likely to be sub-optimal for when optimizing for code size,
so one might want to change tune pipeline by enabling sinking/hoisting
when optimizing for size.
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D84108
This reverts commit 503deec218.
For intrinsics supported by ConstantRange, compute the result range
based on the argument ranges. We do this independently of whether
some or all of the input ranges are full, as we can often still
constrain the result in some way.
Differential Revision: https://reviews.llvm.org/D87183
This was supposed to be an NFC cleanup, but there's
a real logic difference (did not drop 'nsw') visible
in some tests in addition to an efficiency improvement.
This is because in the case where we have 2 GEPs,
the code was *always* swapping the operands and
negating the result. But if we have 2 GEPs, we
should *never* need swapping/negation AFAICT.
This is part of improving flags propagation noticed
with PR47430.
Normal dead code elimination ignores assume intrinsics, so we fail to
delete assumes that are not meaningful (and potentially worse if they
cause conflicts with other assumptions).
The motivating example in https://llvm.org/PR47416 suggests that we
might have problems upstream from here (difference between C and C++),
but this should be a cheap way to make sure we remove more dead code.
Differential Revision: https://reviews.llvm.org/D87149
Modify FoldBranchToCommonDest to consider the cost of inserting
instructions when attempting to combine predicates to fold blocks.
The threshold can be controlled via a new option:
-simplifycfg-branch-fold-threshold which defaults to '2' to allow
the insertion of a not and another logical operator.
Differential Revision: https://reviews.llvm.org/D86526
This addresses the remaining issue from D87188. Due to a series of
folds, we may end up with abs-of-abs represented as
x == 0 ? -abs(x) : abs(x). Rather than recognizing this as a special
abs pattern and doing an abs-of-abs fold on it afterwards,
I'm directly folding this to one of the select operands in InstSimplify.
The general pattern falls into the "select with operand replaced"
category, but that fold is not powerful enough to recognize that
both hands of the select are the same for value zero.
Differential Revision: https://reviews.llvm.org/D87197
This adjusts the description of `llvm.memcpy` to also allow operands
to be equal. This is in line with what Clang currently expects.
This change is intended to be temporary and followed by re-introduce
a variant with the non-overlapping guarantee for cases where we can
actually ensure that property in the front-end.
See the links below for more details:
http://lists.llvm.org/pipermail/cfe-dev/2020-August/066614.html
and PR11763.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D86815
Similar to D87168, but for abs. If we have a dominating x >= 0
condition, then we know that abs(x) is x. This fold is in
InstCombine, because we need to create a sub instruction for
the x < 0 case.
Differential Revision: https://reviews.llvm.org/D87184
If we have a dominating condition that x >= y, then umax(x, y) is x,
etc. I'm doing this in InstSimplify as the corresponding transform
for the select form is also done there.
Differential Revision: https://reviews.llvm.org/D87168
When a switch case is folded into default's case, that's an IR change that
should be reported, update ConstantFoldTerminator accordingly.
Differential Revision: https://reviews.llvm.org/D87142
Some tests from multibuild-malloc-free.ll do not actually use malloc or
free and where split out to multiblock-throwing.ll, but not removed from
the original file. This patch cleans that up. It also moves @test22 to
simple.ll, because it does not involve multiple blocks.
When checking call sites, give special handling to indirect call, as the
callee may be unknown and can lead to nullptr dereference later. Assume
conservatively that the ICV always changes in such case.
Reviewed By: sstefan1
Differential Revision: https://reviews.llvm.org/D87104