When the associated value is undef, we immediately forced to indicate a pessimistic fixpoint so far.
This patch changes the initialization to check the attribute given in IR at first and to indicate an optimistic fixpoint when it is given.
This change will enable us to catch , for example, the following case in AAUB.
```
call void @foo(i32 noundef undef)
```
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D86983
Summary:
Analyses are preserved in MemCpyOptimizer.
Get analyses before running the pass and store the pointers, instead of
using lambdas and getting them every time on demand.
Reviewers: lenary, deadalnix, mehdi_amini, nikic, efriedma
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74494
Interleave for small loops that have reductions inside,
which breaks dependencies and expose.
This gives very significant performance improvements for some benchmarks.
Because small loops could be in very hot functions in real applications.
Differential Revision: https://reviews.llvm.org/D81416
Loop Idiom Recognize Pass (LIRP) attempts to transform loops with subscripted arrays
into memcpy/memset function calls. In some particular situation, this transformation
introduces negative impacts. For example: https://bugs.llvm.org/show_bug.cgi?id=47300
This patch will enable users to disable a particular part of the transformation, while
he/she can still enjoy the benefit brought about by the rest of LIRP. The default
behavior stays unchanged: no part of LIRP is disabled by default.
Reviewed By: etiotto (Ettore Tiotto)
Differential Revision: https://reviews.llvm.org/D86262
getValuesInOffloadArrays goes through the offload arrays in __tgt_target_data_begin_mapper getting the values stored in them before the call is issued.
call void @__tgt_target_data_begin_mapper(arg0, arg1,
i8** %offload_baseptrs, i8** %offload_ptrs, i64* %offload_sizes,
...)
Diferential Revision: https://reviews.llvm.org/D86300
The 1st try was reverted because I missed an assert that
needed softening.
As discussed in D86798 / rG09652721 , we were potentially
returning a different result for whether an Instruction
is commutable depending on if we call the base class or
derived class method.
This requires relaxing asserts in GVN, but that pass
seems to be working otherwise.
NewGVN requires more work because it uses different
code paths for numbering binops and calls.
For an instruction in the basic block BB, SinkingPass enumerates basic blocks
dominated by BB and BB's successors. For each enumerated basic block,
SinkingPass uses `AllUsesDominatedByBlock` to check whether the basic
block dominates all of the instruction's users. This is inefficient.
Use the nearest common dominator of all users to avoid enumerating the
candidate. The nearest common dominator may be in a parent loop which is
not beneficial. In that case, find the ancestors in the dominator tree.
In the case that the instruction has no user, with this change we will
not perform unnecessary move. This causes some amdgpu test changes.
A stage-2 x86-64 clang is a byte identical with this change.
As discussed in D86798 / rG09652721 , we were potentially
returning a different result for whether an Instruction
is commutable depending on if we call the base class or
derived class method.
This requires relaxing an assert in GVN, but that pass
seems to be working otherwise.
NewGVN requires more work because it uses different
code paths for numbering binops and calls.
addRuntimeChecks uses SCEVExpander, which relies on the DT/LoopInfo to
be up-to-date. Changing the CFG afterwards may invalidate some inserted
instructions, especially LCSSA phis.
Reorder the code to first update the CFG and then create the runtime
checks. This should not have any impact on the generated code, as we
adjust the CFG and generate runtime checks together.
Fixes PR47343.
As discussed in D86798, it's not clear if the caller code
works with a more liberal definition of "commutative" that
includes intrinsics like min/max. This makes the binop
restriction (current functionality is unchanged) explicit
until the code is audited/tested.
The problem with module slice has been addressed in D86319
Introduce two new AAs. AAICVTrackerFunctionReturned which checks if a
function can have a unique ICV value after it is finished, and
AAICVCallSiteReturned which checks AAICVTrackerFunctionReturned for a
call site. This enables us to check the value of a call and if it
changes the ICV. This also changes the approach in
`getReplacementValues()` to a worklist-based approach so we can explore
all relevant BBs.
Differential Revision: https://reviews.llvm.org/D85544
Summary:
The module slice describes which functions we can analyze and transform
while working on an SCC as part of the Attributor-CGSCC pass. So far we
simply restricted it to the SCC.
Reviewers: jdoerfert
Differential Revision: https://reviews.llvm.org/D86319
This is the next patch of D86842
When we check `noundef` attribute violation at callsites, we do not have to require `nonnull` in the following two cases.
1. An argument is known to be simplified to undef
2. An argument is known to be dead
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D86845
Even though `noundef` IR attribute might be attached to non-void type values, AANoUndef is mistakenly identified for pointer type values only.
This patch fixes that.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D86737
Move bail out when optimizing for size before runtime check generation.
In that case, we do not use the result of the expansion, the expanded
instruction will be dead and cleaned up later.
By doing the check before expanding the runtime-checks, we can save a
bit of unnecessary work.
EarlyCSE has a mode to verify the invariant that hash equality equals
key equality, but EliminateDuplicatePHINodes() doesn't.
I've verified that this would have caught the stage2-stage3 mismatches
5ec2b757cc revert has fixed,
that were introduced last time in 3e69871ab5.
This patch fixes AANoUndef manifestation.
We should not manifest noundef for positions that will be changed to undef.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D86835
Handling the new min/max intrinsics is the motivation, but it
turns out that we have a bunch of other intrinsics with this
missing bit of analysis too.
The FP min/max tests show that we are intersecting FMF,
so that part should be safe too.
As noted in https://llvm.org/PR46897 , there is a commutative
property specifier for intrinsics, but no corresponding function
attribute, and so apparently no uses of that bit. We may want to
remove that next.
Follow-up patches should wire up the Instruction::isCommutative()
to this IntrinsicInst specialization. That requires updating
callers to be aware of the more general commutative property
(not just binops).
Differential Revision: https://reviews.llvm.org/D86798
The original take 1 was 6102310d81,
which taught InstSimplify to do that, which seemed better at time,
since we got EarlyCSE support for free.
However, it was proven that we can not do that there,
the simplified-to PHI would not be reachable from the original PHI,
and that is not something InstSimplify is allowed to do,
as noted in the commit ed90f15efb
that reverted it:
> It appears to cause compilation non-determinism and caused stage3 mismatches.
Then there was take 2 3e69871ab5,
which was InstCombine-specific, but it again showed stage2-stage3 differences,
and reverted in bdaa3f86a0.
This is quite alarming.
Here, let's try to change how we find existing PHI candidate:
due to the worklist order, and the way PHI nodes are inserted
(it may be inserted as the first one, or maybe not), let's look at *all*
PHI nodes in the block.
Effects on vanilla llvm test-suite + RawSpeed:
```
| statistic name | baseline | proposed | Δ | % | \|%\| |
|----------------------------------------------------|-----------|-----------|-------:|---------:|---------:|
| asm-printer.EmittedInsts | 7942329 | 7942457 | 128 | 0.00% | 0.00% |
| assembler.ObjectBytes | 254295632 | 254312480 | 16848 | 0.01% | 0.01% |
| correlated-value-propagation.NumPhis | 18412 | 18347 | -65 | -0.35% | 0.35% |
| early-cse.NumCSE | 2183283 | 2183267 | -16 | 0.00% | 0.00% |
| early-cse.NumSimplify | 550105 | 541842 | -8263 | -1.50% | 1.50% |
| instcombine.NumAggregateReconstructionsSimplified | 73 | 4506 | 4433 | 6072.60% | 6072.60% |
| instcombine.NumCombined | 3640311 | 3644419 | 4108 | 0.11% | 0.11% |
| instcombine.NumDeadInst | 1778204 | 1783205 | 5001 | 0.28% | 0.28% |
| instcombine.NumPHICSEs | 0 | 22490 | 22490 | 0.00% | 0.00% |
| instcombine.NumWorklistIterations | 2023272 | 2024400 | 1128 | 0.06% | 0.06% |
| instcount.NumCallInst | 1758395 | 1758802 | 407 | 0.02% | 0.02% |
| instcount.NumInvokeInst | 59478 | 59502 | 24 | 0.04% | 0.04% |
| instcount.NumPHIInst | 330557 | 330545 | -12 | 0.00% | 0.00% |
| instcount.TotalBlocks | 1077138 | 1077220 | 82 | 0.01% | 0.01% |
| instcount.TotalFuncs | 101442 | 101441 | -1 | 0.00% | 0.00% |
| instcount.TotalInsts | 8831946 | 8832606 | 660 | 0.01% | 0.01% |
| simplifycfg.NumHoistCommonCode | 24186 | 24187 | 1 | 0.00% | 0.00% |
| simplifycfg.NumInvokes | 4300 | 4410 | 110 | 2.56% | 2.56% |
| simplifycfg.NumSimpl | 1019813 | 999767 | -20046 | -1.97% | 1.97% |
```
So it fires 22490 times, which is less than ~24k the take 1 did,
but more than what take 2 did (22228 times)
.
It allows foldAggregateConstructionIntoAggregateReuse() to actually work
after PHI-of-extractvalue folds did their thing. Previously SimplifyCFG
would have done this PHI CSE, of all places. Additionally, allows some
more `invoke`->`call` folds to happen (+110, +2.56%).
All in all, expectedly, this catches less things overall,
but all the motivational cases are still caught, so all good.
While the original variant with doing this in InstSimplify (rightfully)
caused questions and ultimately was detected to be a culprit
of stage2-stage3 mismatch, it was expected that
InstCombine-based implementation would be fine.
But apparently it's not, as
http://lab.llvm.org:8011/builders/clang-with-thin-lto-ubuntu/builds/24095/steps/compare-compilers/logs/stdio
suggests.
Which suggests that somewhere in InstCombine there is a loop
over nondeterministically sorted container, which causes
different worklist ordering.
This reverts commit 3e69871ab5.
As we've established, if it takes more than two iterations
(one to perform folding and one to ensure that no folding opportunities
remain) per function, then there are worklist management issues.
So it may be interesting to keep track of it.
The original take was 6102310d81,
which taught InstSimplify to do that, which seemed better at time,
since we got EarlyCSE support for free.
However, it was proven that we can not do that there,
the simplified-to PHI would not be reachable from the original PHI,
and that is not something InstSimplify is allowed to do,
as noted in the commit ed90f15efb
that reverted it :
> It appears to cause compilation non-determinism and caused stage3 mismatches.
However InstCombine already does many different optimizations,
so it should be a safe place to do it here.
Note that we still can't just compare incoming values ranges,
because there is no guarantee that these PHI's we'd simplify to
were already re-visited and sorted.
However coming up with a test is problematic.
Effects on vanilla llvm test-suite + RawSpeed:
```
| statistic name | baseline | proposed | Δ | % | |%| |
|----------------------------------------------------|-----------|-----------|-------:|---------:|---------:|
| instcombine.NumPHICSEs | 0 | 22228 | 22228 | 0.00% | 0.00% |
| asm-printer.EmittedInsts | 7942329 | 7942456 | 127 | 0.00% | 0.00% |
| assembler.ObjectBytes | 254295632 | 254313792 | 18160 | 0.01% | 0.01% |
| early-cse.NumCSE | 2183283 | 2183272 | -11 | 0.00% | 0.00% |
| early-cse.NumSimplify | 550105 | 541842 | -8263 | -1.50% | 1.50% |
| instcombine.NumAggregateReconstructionsSimplified | 73 | 4506 | 4433 | 6072.60% | 6072.60% |
| instcombine.NumCombined | 3640311 | 3666911 | 26600 | 0.73% | 0.73% |
| instcombine.NumDeadInst | 1778204 | 1783318 | 5114 | 0.29% | 0.29% |
| instcount.NumCallInst | 1758395 | 1758804 | 409 | 0.02% | 0.02% |
| instcount.NumInvokeInst | 59478 | 59502 | 24 | 0.04% | 0.04% |
| instcount.NumPHIInst | 330557 | 330549 | -8 | 0.00% | 0.00% |
| instcount.TotalBlocks | 1077138 | 1077221 | 83 | 0.01% | 0.01% |
| instcount.TotalFuncs | 101442 | 101441 | -1 | 0.00% | 0.00% |
| instcount.TotalInsts | 8831946 | 8832611 | 665 | 0.01% | 0.01% |
| simplifycfg.NumInvokes | 4300 | 4410 | 110 | 2.56% | 2.56% |
| simplifycfg.NumSimpl | 1019813 | 999740 | -20073 | -1.97% | 1.97% |
```
So it fires ~22k times, which is less than ~24k the take 1 did.
It allows foldAggregateConstructionIntoAggregateReuse() to actually work
after PHI-of-extractvalue folds did their thing. Previously SimplifyCFG
would have done this PHI CSE, of all places. Additionally, allows some
more `invoke`->`call` folds to happen (+110, +2.56%).
All in all, expectedly, this catches less things overall,
but all the motivational cases are still caught, so all good.
instruction can decrement the reference count, not whether it can alter
it
This prevents the state transition from S_Use to S_CanRelease when doing
a bottom-up traversal and the transition from S_Retain to S_CanRelease
when doing a top-down traversal when the visited instruction can
increment the ref count but cannot decrement it. This allows the ARC
optimizer to remove retain/release pairs which were previously not
removed.
rdar://problem/21793154
There's a special case in hasAttribute for None when pImpl is null. If pImpl is not null we dispatch to pImpl->hasAttribute which will always return false for Attribute::None.
So if we just want to check for None its sufficient to just check that pImpl is null. Which can even be done inline.
This patch adds a helper for that case which I hope will speed up our getSubtargetImpl implementations.
Differential Revision: https://reviews.llvm.org/D86744
Since doInitialization() in the legacy pass modifies the module, the NPM
pass is a Module pass.
Reviewed By: ahatanak, ychen
Differential Revision: https://reviews.llvm.org/D86178
This patch fixes this crash https://gcc.godbolt.org/z/Ps8d1e
And gives SROA the ability to remove assumes if it allows promoting an alloca to register
Without removing assumes when it can't promote to register.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D86570
This patch changes ElementCount so that the Min and Scalable
members are now private and can only be accessed via the get
functions getKnownMinValue() and isScalable(). In addition I've
added some other member functions for more commonly used operations.
Hopefully this makes the class more useful and will reduce the
need for calling getKnownMinValue().
Differential Revision: https://reviews.llvm.org/D86065
This changes getDomMemoryDef to check if a Current is a valid
candidate for elimination before checking for reads. Before the change,
we were spending a lot of compile-time in checking for read accesses for
Current that might not even be removable.
This patch flips the logic, so we skip Current if they cannot be
removed before checking all their uses. This is much more efficient in
practice.
It also adds a more aggressive limit for checking partially overlapping
stores. The main problem with overlapping stores is that we do not know
if they will lead to elimination until seeing all of them. This patch
limits adds a new limit for overlapping store candidates, which keeps
the number of modified overlapping stores roughly the same.
This is another substantial compile-time improvement (while also
increasing the number of stores eliminated). Geomean -O3 -0.67%,
ReleaseThinLTO -0.97%.
http://llvm-compile-time-tracker.com/compare.php?from=0a929b6978a068af8ddb02d0d4714a2843dd8ba9&to=2e630629b43f64b60b282e90f0d96082fde2dacc&stat=instructions
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D86487
strspn, strncmp, strcspn, strcasecmp, strncasecmp, memcmp, memchr,
memrchr, memcpy, memmove, memcpy, mempcpy, strchr, strrchr, bcmp
should all only access memory through their arguments.
I broke out strcoll, strcasecmp, strncasecmp because the result
depends on the locale, which might get accessed through memory.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D86724
Even if noundef is deduced for a position, we should not manifest it when the position is dead.
This is because the associated values with dead positions are replaced with undef values by AAIsDead.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D86565
If we query an AA with `Attributor::getAAFor` in `AbstractAttribute::manifest`, the AA may be updated.
This patch makes use of the phase flag in Attributor, and handle `getAAFor` behavior according to the flag.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D86635
Add a new flag that indicates which stage in the process we are in.
This flag is introduced for handling behavior of `getAAFor` according to the stage. (discussed in D86635)
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D86678
See RFC for background:
http://lists.llvm.org/pipermail/llvm-dev/2020-June/142744.html
Note that the runtime changes will be sent separately (hopefully this
week, need to add some tests).
This patch includes the LLVM pass to instrument memory accesses with
either inline sequences to increment the access count in the shadow
location, or alternatively to call into the runtime. It also changes
calls to memset/memcpy/memmove to the equivalent runtime version.
The pass is modeled on the address sanitizer pass.
The clang changes add the driver option to invoke the new pass, and to
link with the upcoming heap profiling runtime libraries.
Currently there is no attempt to optimize the instrumentation, e.g. to
aggregate updates to the same memory allocation. That will be
implemented as follow on work.
Differential Revision: https://reviews.llvm.org/D85948
Currently we bail out early for strlen calls with a GEP operand, if none
of the GEP specific optimizations fire. But there could be later
optimizations that still apply, which we currently miss out on.
An example is that we do not apply the following optimization
strlen(x) == 0 --> *x == 0
Unless I am missing something, there seems to be no reason for bailing
out early there.
Fixes PR47149.
Reviewed By: lebedev.ri, xbolva00
Differential Revision: https://reviews.llvm.org/D85886
For DSE with MemorySSA it is beneficial to manually traverse the
defining access, instead of using a MemorySSA walker, so we can
better control the number of steps together with other limits and
also weed out invalid/unprofitable paths early on.
This patch requires a follow-up patch to be most effective, which I will
share soon after putting this patch up.
This temporarily XFAIL's the limit tests, because we now explore more
MemoryDefs that may not alias/clobber the killing def. This will be
improved/fixed by the follow-up patch.
This patch also renames some `Dom*` variables to `Earlier*`, because the
dominance relation is not really used/important here and potentially
confusing.
This patch allows us to aggressively cut down compile time, geomean
-O3 -0.64%, ReleaseThinLTO -1.65%, at the expense of fewer stores
removed. Subsequent patches will increase the number of removed stores
again, while keeping compile-time in check.
http://llvm-compile-time-tracker.com/compare.php?from=d8e3294118a8c5f3f97688a704d5a05b67646012&to=0a929b6978a068af8ddb02d0d4714a2843dd8ba9&stat=instructions
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D86486
Currently, an undef value is reduced to 0 when it is added to a set of potential values.
This patch introduces a flag for under values. By this, for example, we can merge two states `{undef}`, `{1}` to `{1}` (because we can reduce the undef to 1).
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D85592
As discussed in
http://lists.llvm.org/pipermail/llvm-dev/2020-July/143801.html.
Currently no users outside of unit tests.
Replace all instances in tests of -constprop with -instsimplify.
Notable changes in tests:
* vscale.ll - @llvm.sadd.sat.nxv16i8 is evaluated by instsimplify, use a fake intrinsic instead
* InsertElement.ll - insertelement undef is removed by instsimplify in @insertelement_undef
llvm/test/Transforms/ConstProp moved to llvm/test/Transforms/InstSimplify/ConstProp
Reviewed By: lattner, nikic
Differential Revision: https://reviews.llvm.org/D85159
and indirect call promotion candidate.
Profile remapping is a feature to match a function in the module with its
profile in sample profile if the function name and the name in profile look
different but are equivalent using given remapping rules. This is a useful
feature to keep the performance stable by specifying some remapping rules
when sampleFDO targets are going through some large scale function signature
change.
However, currently profile remapping support is only valid for outline
function profile in SampleFDO. It cannot match a callee with an inline
instance profile if they have different but equivalent names. We found
that without the support for inline instance profile, remapping is less
effective for some large scale change.
To add that support, before any remapping lookup happens, all the names
in the profile will be inserted into remapper and the Key to the name
mapping will be recorded in a map called NameMap in the remapper. During
name lookup, a Key will be returned for the given name and it will be used
to extract an equivalent name in the profile from NameMap. So with the help
of the NameMap, we can translate any given name to an equivalent name in
the profile if it exists. Whenever we try to match a name in the module to
a name in the profile, we will try the match with the original name first,
and if it doesn't match, we will use the equivalent name got from remapper
to try the match for another time. In this way, the patch can enhance the
profile remapping support for searching inline instance and searching
indirect call promotion candidate.
In a planned large scale change of int64 type (long long) to int64_t (long),
we found the performance of a google internal benchmark degraded by 2% if
nothing was done. If existing profile remapping was enabled, the performance
degradation dropped to 1.2%. If the profile remapping with the current patch
was enabled, the performance degradation further dropped to 0.14% (Note the
experiment was done before searching indirect call promotion candidate was
added. We hope with the remapping support of searching indirect call promotion
candidate, the degradation can drop to 0% in the end. It will be evaluated
post commit).
Differential Revision: https://reviews.llvm.org/D86332
As FIXME said, they really should be checking for a single user,
not use, so let's do that. It is not *that* unusual to have
the same value as incoming value in a PHI node, not unlike
how a PHI may have the same incoming basic block more than once.
There isn't a nice way to do that, Value::users() isn't uniqified,
and Value only tracks it's uses, not Users, so the check is
potentially costly since it does indeed potentially involes
traversing the entire use list of a value.
This implements 2 different vectorisation fallback strategies if tail-folding
fails: 1) don't vectorise at all, or 2) vectorise using a scalar epilogue. This
can be controlled with option -prefer-predicate-over-epilogue, that has been
changed to take a numeric value corresponding to the tail-folding preference
and preferred fallback.
Patch by: Pierre van Houtryve, Sjoerd Meijer.
Differential Revision: https://reviews.llvm.org/D79783
This patch produces an edge-based interface in AAIsDead.
By this, we can query a set of basic blocks that are directly reachable from a given basic block.
This is specifically useful for implementation of AAReachability.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D85547
While since D86306 we do it's sibling fold for `insertvalue`,
we should also do this for `extractvalue`'s.
And unlike that one, the results here are, quite honestly, shocking,
as it can be observed here on vanilla llvm test-suite + RawSpeed results:
```
| statistic name | baseline | proposed | Δ | % | |%| |
|----------------------------------------------------|-----------|-----------|--------:|--------:|-------:|
| asm-printer.EmittedInsts | 7945095 | 7942507 | -2588 | -0.03% | 0.03% |
| assembler.ObjectBytes | 273209920 | 273069800 | -140120 | -0.05% | 0.05% |
| early-cse.NumCSE | 2183363 | 2183398 | 35 | 0.00% | 0.00% |
| early-cse.NumSimplify | 541847 | 550017 | 8170 | 1.51% | 1.51% |
| instcombine.NumAggregateReconstructionsSimplified | 2139 | 108 | -2031 | -94.95% | 94.95% |
| instcombine.NumCombined | 3601364 | 3635448 | 34084 | 0.95% | 0.95% |
| instcombine.NumConstProp | 27153 | 27157 | 4 | 0.01% | 0.01% |
| instcombine.NumDeadInst | 1694521 | 1765022 | 70501 | 4.16% | 4.16% |
| instcombine.NumPHIsOfExtractValues | 0 | 37546 | 37546 | 0.00% | 0.00% |
| instcombine.NumSunkInst | 63158 | 63686 | 528 | 0.84% | 0.84% |
| instcount.NumBrInst | 874304 | 871857 | -2447 | -0.28% | 0.28% |
| instcount.NumCallInst | 1757657 | 1758402 | 745 | 0.04% | 0.04% |
| instcount.NumExtractValueInst | 45623 | 11483 | -34140 | -74.83% | 74.83% |
| instcount.NumInsertValueInst | 4983 | 580 | -4403 | -88.36% | 88.36% |
| instcount.NumInvokeInst | 61018 | 59478 | -1540 | -2.52% | 2.52% |
| instcount.NumLandingPadInst | 35334 | 34215 | -1119 | -3.17% | 3.17% |
| instcount.NumPHIInst | 344428 | 331116 | -13312 | -3.86% | 3.86% |
| instcount.NumRetInst | 100773 | 100772 | -1 | 0.00% | 0.00% |
| instcount.TotalBlocks | 1081154 | 1077166 | -3988 | -0.37% | 0.37% |
| instcount.TotalFuncs | 101443 | 101442 | -1 | 0.00% | 0.00% |
| instcount.TotalInsts | 8890201 | 8833747 | -56454 | -0.64% | 0.64% |
| instsimplify.NumSimplified | 75822 | 75707 | -115 | -0.15% | 0.15% |
| simplifycfg.NumHoistCommonCode | 24203 | 24197 | -6 | -0.02% | 0.02% |
| simplifycfg.NumHoistCommonInstrs | 48201 | 48195 | -6 | -0.01% | 0.01% |
| simplifycfg.NumInvokes | 2785 | 4298 | 1513 | 54.33% | 54.33% |
| simplifycfg.NumSimpl | 997332 | 1018189 | 20857 | 2.09% | 2.09% |
| simplifycfg.NumSinkCommonCode | 7088 | 6464 | -624 | -8.80% | 8.80% |
| simplifycfg.NumSinkCommonInstrs | 15117 | 14021 | -1096 | -7.25% | 7.25% |
```
... which tells us that this new fold fires whopping 38k times,
increasing the amount of SimplifyCFG's `invoke`->`call` transforms by +54% (+1513) (again, D85787 did that last time),
decreasing total instruction count by -0.64% (-56454),
and sharply decreasing count of `insertvalue`'s (-88.36%, i.e. 9 times less)
and `extractvalue`'s (-74.83%, i.e. four times less).
This causes geomean -0.01% binary size decrease
http://llvm-compile-time-tracker.com/compare.php?from=4d5ca22b8adfb6643466e4e9f48ba14bb48938bc&to=97dacca0111cb2ae678204e52a3cee00e3a69208&stat=size-text
and, ignoring `O0-g`, is a geomean -0.01%..-0.05% compile-time improvement
http://llvm-compile-time-tracker.com/compare.php?from=4d5ca22b8adfb6643466e4e9f48ba14bb48938bc&to=97dacca0111cb2ae678204e52a3cee00e3a69208&stat=instructions
The other thing that tells is, is that while this is a massive win for `invoke`->`call` transform
`InstCombinerImpl::foldAggregateConstructionIntoAggregateReuse()` fold,
which is supposed to be dealing with such aggregate reconstructions,
fires a lot less now. There are two reasons why:
1. After this fold, as it can be seen in tests, we may (will) end up with trivially redundant PHI nodes.
We don't CSE them in InstCombine presently, which means that EarlyCSE needs to run and then InstCombine rerun.
2. But then, EarlyCSE not only manages to fold such redundant PHI's,
it also sees that the extract-insert chain recreates the original aggregate,
and replaces it with the original aggregate.
The take-aways are
1. We maybe should do most trivial, same-BB PHI CSE in InstCombine
2. I need to check if what other patterns remain, and how they can be resolved.
(i.e. i wonder if `foldAggregateConstructionIntoAggregateReuse()` might go away)
This is a reland of the original commit fcb51d8c24,
because originally i forgot to ensure that the base aggregate types match.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D86530
This reverts commit fcb51d8c24.
As buildbots report, there's apparently some missing check to ensure
that the types of incoming values match the type of PHI.
Let's revert for a moment.
While since D86306 we do it's sibling fold for `insertvalue`,
we should also do this for `extractvalue`'s.
And unlike that one, the results here are, quite honestly, shocking,
as it can be observed here on vanilla llvm test-suite + RawSpeed results:
```
| statistic name | baseline | proposed | Δ | % | |%| |
|----------------------------------------------------|-----------|-----------|--------:|--------:|-------:|
| asm-printer.EmittedInsts | 7945095 | 7942507 | -2588 | -0.03% | 0.03% |
| assembler.ObjectBytes | 273209920 | 273069800 | -140120 | -0.05% | 0.05% |
| early-cse.NumCSE | 2183363 | 2183398 | 35 | 0.00% | 0.00% |
| early-cse.NumSimplify | 541847 | 550017 | 8170 | 1.51% | 1.51% |
| instcombine.NumAggregateReconstructionsSimplified | 2139 | 108 | -2031 | -94.95% | 94.95% |
| instcombine.NumCombined | 3601364 | 3635448 | 34084 | 0.95% | 0.95% |
| instcombine.NumConstProp | 27153 | 27157 | 4 | 0.01% | 0.01% |
| instcombine.NumDeadInst | 1694521 | 1765022 | 70501 | 4.16% | 4.16% |
| instcombine.NumPHIsOfExtractValues | 0 | 37546 | 37546 | 0.00% | 0.00% |
| instcombine.NumSunkInst | 63158 | 63686 | 528 | 0.84% | 0.84% |
| instcount.NumBrInst | 874304 | 871857 | -2447 | -0.28% | 0.28% |
| instcount.NumCallInst | 1757657 | 1758402 | 745 | 0.04% | 0.04% |
| instcount.NumExtractValueInst | 45623 | 11483 | -34140 | -74.83% | 74.83% |
| instcount.NumInsertValueInst | 4983 | 580 | -4403 | -88.36% | 88.36% |
| instcount.NumInvokeInst | 61018 | 59478 | -1540 | -2.52% | 2.52% |
| instcount.NumLandingPadInst | 35334 | 34215 | -1119 | -3.17% | 3.17% |
| instcount.NumPHIInst | 344428 | 331116 | -13312 | -3.86% | 3.86% |
| instcount.NumRetInst | 100773 | 100772 | -1 | 0.00% | 0.00% |
| instcount.TotalBlocks | 1081154 | 1077166 | -3988 | -0.37% | 0.37% |
| instcount.TotalFuncs | 101443 | 101442 | -1 | 0.00% | 0.00% |
| instcount.TotalInsts | 8890201 | 8833747 | -56454 | -0.64% | 0.64% |
| instsimplify.NumSimplified | 75822 | 75707 | -115 | -0.15% | 0.15% |
| simplifycfg.NumHoistCommonCode | 24203 | 24197 | -6 | -0.02% | 0.02% |
| simplifycfg.NumHoistCommonInstrs | 48201 | 48195 | -6 | -0.01% | 0.01% |
| simplifycfg.NumInvokes | 2785 | 4298 | 1513 | 54.33% | 54.33% |
| simplifycfg.NumSimpl | 997332 | 1018189 | 20857 | 2.09% | 2.09% |
| simplifycfg.NumSinkCommonCode | 7088 | 6464 | -624 | -8.80% | 8.80% |
| simplifycfg.NumSinkCommonInstrs | 15117 | 14021 | -1096 | -7.25% | 7.25% |
```
... which tells us that this new fold fires whopping 38k times,
increasing the amount of SimplifyCFG's `invoke`->`call` transforms by +54% (+1513) (again, D85787 did that last time),
decreasing total instruction count by -0.64% (-56454),
and sharply decreasing count of `insertvalue`'s (-88.36%, i.e. 9 times less)
and `extractvalue`'s (-74.83%, i.e. four times less).
This causes geomean -0.01% binary size decrease
http://llvm-compile-time-tracker.com/compare.php?from=4d5ca22b8adfb6643466e4e9f48ba14bb48938bc&to=97dacca0111cb2ae678204e52a3cee00e3a69208&stat=size-text
and, ignoring `O0-g`, is a geomean -0.01%..-0.05% compile-time improvement
http://llvm-compile-time-tracker.com/compare.php?from=4d5ca22b8adfb6643466e4e9f48ba14bb48938bc&to=97dacca0111cb2ae678204e52a3cee00e3a69208&stat=instructions
The other thing that tells is, is that while this is a massive win for `invoke`->`call` transform
`InstCombinerImpl::foldAggregateConstructionIntoAggregateReuse()` fold,
which is supposed to be dealing with such aggregate reconstructions,
fires a lot less now. There are two reasons why:
1. After this fold, as it can be seen in tests, we may (will) end up with trivially redundant PHI nodes.
We don't CSE them in InstCombine presently, which means that EarlyCSE needs to run and then InstCombine rerun.
2. But then, EarlyCSE not only manages to fold such redundant PHI's,
it also sees that the extract-insert chain recreates the original aggregate,
and replaces it with the original aggregate.
The take-aways are
1. We maybe should do most trivial, same-BB PHI CSE in InstCombine
2. I need to check if what other patterns remain, and how they can be resolved.
(i.e. i wonder if `foldAggregateConstructionIntoAggregateReuse()` might go away)
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D86530
This patch helps getGuaranteedNonPoisonOp find multiple non-poison operands.
Instead of special-casing llvm.assume, I think it is also a viable option to
add noundef to Intrinsics.td. If it makes sense, I'll make a patch for that.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D86477
The 1st attempt (rG557b890) was reverted because it caused miscompiles.
That bug is avoided here by changing the order of folds and as verified
in the new tests.
Original commit message:
InstCombine currently has odd rules for folding insert-extract chains to shuffles,
so we miss collapsing seemingly simple cases as shown in the tests here.
But poison makes this not quite as easy as we might have guessed. Alive2 tests to
show the subtle difference (similar to the regression tests):
https://alive2.llvm.org/ce/z/hp4hv3 (this is ok)
https://alive2.llvm.org/ce/z/ehEWaN (poison leakage)
SLP tends to create these patterns (as shown in the SLP tests), and this could
help with solving PR16739.
Differential Revision: https://reviews.llvm.org/D86460
This adapts LV to the new semantics of get.active.lane.mask as discussed in
D86147, which means that the LV now emits intrinsic get.active.lane.mask with
the loop tripcount instead of the backedge-taken count as its second argument.
The motivation for this is described in D86147.
Differential Revision: https://reviews.llvm.org/D86304
In getCastInstrCost when the instruction is a truncate we were relying
upon the implicit TypeSize -> uint64_t cast when asking if a given type
has the same size as a legal integer. I've changed the code to only
ask the question if the type is fixed length.
I have also changed InstCombinerImpl::SimplifyDemandedUseBits to bail
out for now if the type is a scalable vector.
I've added the following new tests:
Analysis/CostModel/AArch64/sve-trunc.ll
Transforms/InstCombine/AArch64/sve-trunc.ll
for both of these fixes.
Differential revision: https://reviews.llvm.org/D86432
Currently we repeatedly check the same uses for read clobbers in some
cases. We can avoid unnecessary checks by keeping track of the memory
accesses we already found read clobbers for. To do so, we just add
memory access causing read-clobbers to a set. Note that marking all
visited accesses as read-clobbers would be to pessimistic, as that might
include accesses not on any path to the actual read clobber.
If we do not find any read-clobbers, we can add all visited instructions
to another set and use that to skip the same accesses in the next call.
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D75025
InstCombine currently has odd rules for folding insert-extract chains to shuffles,
so we miss collapsing seemingly simple cases as shown in the tests here.
But poison makes this not quite as easy as we might have guessed. Alive2 tests to
show the subtle difference (similar to the regression tests):
https://alive2.llvm.org/ce/z/hp4hv3 (this is ok)
https://alive2.llvm.org/ce/z/ehEWaN (poison leakage)
SLP tends to create these patterns (as shown in the SLP tests), and this could
help with solving PR16739.
Differential Revision: https://reviews.llvm.org/D86460
The "takeName" logic at the end of ScalarizerVisitor::finish
could end up renaming global variables when having simplified
and extractelement instruction to simply pick a single vector
element. If the input vector to the extractelement instruction
held pointers to global variables we ended up renaming the global
variable.
The patch make sure we only take the name of the replaced Op when
we have added new instructions that might need a useful name.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D86472
This reverts commit 2e43acfed8.
LLVMCoroutines (the library which contains Coroutines.h) depends on LLVMipo (the
library which contains SampleProfile.cpp). It is inappropriate for
SampleProfile.cpp to depent on Coroutines.h (circular dependency).
The test inverted dependencies as well:
llvm/test/Transforms/Coroutines/coro-inline.ll uses -sample-profile.
Using callCapturesBefore potentially improves the precision and the
number of stores we can remove. But in practice, it seems to have very
little impact in terms of stores removed. For example, for
SPEC2000/SPEC2006/MultiSource with -O3 -flto, ~50 more stores are
removed (out of ~26900 stores removed). But in terms of compile-time, it
is very expensive and the patch gives substantial compile-time
improvements: Geomean O3 -0.24%, ReleaseThinLTO -0.47%, ReleaseLTO-g
-0.39%.
http://llvm-compile-time-tracker.com/compare.php?from=612a0bff88ed906c83b82f079d4c49e5fecfb9d0&to=e6c86b96d20d97dd88e903a409bd8d39b6114312&stat=instructions
summary:
When callee coroutine function is inlined into caller coroutine
function before coro-split pass, llvm will emits "coroutine should
have exactly one defining @llvm.coro.begin". It seems that coro-early
pass can not handle this quiet well.
So we believe that unsplited coroutine function should not be inlined.
This patch fix such issue by not inlining function if it has attribute
"coroutine.presplit" (it means the function has not been splited) to
fix this issue
TestPlan: check-llvm
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D85812
Changes:
* Change `ToVectorTy` to deal directly with `ElementCount` instances.
* `VF == 1` replaced with `VF.isScalar()`.
* `VF > 1` and `VF >=2` replaced with `VF.isVector()`.
* `VF <=1` is replaced with `VF.isZero() || VF.isScalar()`.
* Replaced the uses of `llvm::SmallSet<ElementCount, ...>` with
`llvm::SmallSetVector<ElementCount, ...>`. This avoids the need of an
ordering function for the `ElementCount` class.
* Bits and pieces around printing the `ElementCount` to string streams.
To guarantee that this change is a NFC, `VF.Min` and asserts are used
in the following places:
1. When it doesn't make sense to deal with the scalable property, for
example:
a. When computing unrolling factors.
b. When shuffle masks are built for fixed width vector types
In this cases, an
assert(!VF.Scalable && "<mgs>") has been added to make sure we don't
enter coepaths that don't make sense for scalable vectors.
2. When there is a conscious decision to use `FixedVectorType`. These
uses of `FixedVectorType` will likely be removed in favour of
`VectorType` once the vectorizer is generic enough to deal with both
fixed vector types and scalable vector types.
3. When dealing with building constants out of the value of VF, for
example when computing the vectorization `step`, or building vectors
of indices. These operation _make sense_ for scalable vectors too,
but changing the code in these places to be generic and make it work
for scalable vectors is to be submitted in a separate patch, as it is
a functional change.
4. When building the potential VFs in VPlan. Making the VPlan generic
enough to handle scalable vectorization factors is a functional change
that needs a separate patch. See for example `void
LoopVectorizationPlanner::buildVPlans(unsigned MinVF, unsigned
MaxVF)`.
5. The class `IntrinsicCostAttribute`: this class still uses `unsigned
VF` as updating the field to use `ElementCount` woudl require changes
that could result in changing the behavior of the compiler. Will be done
in a separate patch.
7. When dealing with user input for forcing the vectorization
factor. In this case, adding support for scalable vectorization is a
functional change that migh require changes at command line.
Note that in some places the idiom
```
unsigned VF = ...
auto VTy = FixedVectorType::get(ScalarTy, VF)
```
has been replaced with
```
ElementCount VF = ...
assert(!VF.Scalable && ...);
auto VTy = VectorType::get(ScalarTy, VF)
```
The assertion guarantees that the new code is (at least in debug mode)
functionally equivalent to the old version. Notice that this change had been
possible because none of the methods that are specific to `FixedVectorType`
were used after the instantiation of `VTy`.
Reviewed By: rengolin, ctetreau
Differential Revision: https://reviews.llvm.org/D85794
Changes:
* Change `ToVectorTy` to deal directly with `ElementCount` instances.
* `VF == 1` replaced with `VF.isScalar()`.
* `VF > 1` and `VF >=2` replaced with `VF.isVector()`.
* `VF <=1` is replaced with `VF.isZero() || VF.isScalar()`.
* Add `<` operator to `ElementCount` to be able to use
`llvm::SmallSetVector<ElementCount, ...>`.
* Bits and pieces around printing the ElementCount to string streams.
* Added a static method to `ElementCount` to represent a scalar.
To guarantee that this change is a NFC, `VF.Min` and asserts are used
in the following places:
1. When it doesn't make sense to deal with the scalable property, for
example:
a. When computing unrolling factors.
b. When shuffle masks are built for fixed width vector types
In this cases, an
assert(!VF.Scalable && "<mgs>") has been added to make sure we don't
enter coepaths that don't make sense for scalable vectors.
2. When there is a conscious decision to use `FixedVectorType`. These
uses of `FixedVectorType` will likely be removed in favour of
`VectorType` once the vectorizer is generic enough to deal with both
fixed vector types and scalable vector types.
3. When dealing with building constants out of the value of VF, for
example when computing the vectorization `step`, or building vectors
of indices. These operation _make sense_ for scalable vectors too,
but changing the code in these places to be generic and make it work
for scalable vectors is to be submitted in a separate patch, as it is
a functional change.
4. When building the potential VFs in VPlan. Making the VPlan generic
enough to handle scalable vectorization factors is a functional change
that needs a separate patch. See for example `void
LoopVectorizationPlanner::buildVPlans(unsigned MinVF, unsigned
MaxVF)`.
5. The class `IntrinsicCostAttribute`: this class still uses `unsigned
VF` as updating the field to use `ElementCount` woudl require changes
that could result in changing the behavior of the compiler. Will be done
in a separate patch.
7. When dealing with user input for forcing the vectorization
factor. In this case, adding support for scalable vectorization is a
functional change that migh require changes at command line.
Differential Revision: https://reviews.llvm.org/D85794
Avoid computing InvisibleToCallerBefore/AfterRet up front. In most
cases, this information is not really needed. Instead, introduce helper
functions to compute and cache the result on demand.
Notably, this also does not use PointerMayBeCapturedBefore for
isInvisibleToCallerBeforeRet, as it requires the killing MemoryDef as
starting instruction, making the caching ineffective. But it appears the
use of PointerMayBeCapturedBefore has very limited benefits in practice
(e.g. on SPEC2000/SPEC2006/MultiSource there are no binary changes with
-O3 -flto). Refrain from using it for now, to limit-compile-time.
This gives some nice compile-time improvements:
http://llvm-compile-time-tracker.com/compare.php?from=db9345f6810f379a36752dc52caf5230585d0ebd&to=b4d091047e1b8a3d377d200137b79d03aca65663&stat=instructions
Limit elimination of stores at the end of a function to MemoryDefs with
a single underlying object, to save compile time.
In practice, the case with multiple underlying objects seems not very
important in practice. For -O3 -flto on MultiSource/SPEC2000/SPEC2006
this results in a total of 2 more stores being eliminated.
We can always re-visit that in the future.
Similar to the existing transform - peek through a select
to match a value and its negation.
https://alive2.llvm.org/ce/z/MXi5KG
define i8 @src(i1 %b, i8 %x) {
%0:
%neg = sub i8 0, %x
%sel = select i1 %b, i8 %x, i8 %neg
%abs = abs i8 %sel, 1
ret i8 %abs
}
=>
define i8 @tgt(i1 %b, i8 %x) {
%0:
%abs = abs i8 %x, 1
ret i8 %abs
}
Transformation seems to be correct!
This reverses the existing transform that would uniformly canonicalize any 'xor' after any shift. In the case of logical shifts, that turns a 'not' into an arbitrary 'xor' with constant, and that's probably not as good for analysis, SCEV, or codegen.
The SCEV motivating case is discussed in:
http://bugs.llvm.org/PR47136
There's an analysis motivating case at:
http://bugs.llvm.org/PR38781
I did draft a patch that would do the same for 'ashr' but that's questionable because it's just swapping the position of a 'not' and uncovers at least 2 missing folds that we would probably need to deal with as preliminary steps.
Alive proofs:
https://rise4fun.com/Alive/BBV
Name: shift right of 'not'
Pre: C2 == (-1 u>> C1)
%a = lshr i8 %x, C1
%r = xor i8 %a, C2
=>
%n = xor i8 %x, -1
%r = lshr i8 %n, C1
Name: shift left of 'not'
Pre: C2 == (-1 << C1)
%a = shl i8 %x, C1
%r = xor i8 %a, C2
=>
%n = xor i8 %x, -1
%r = shl i8 %n, C1
Name: ashr of 'not'
%a = ashr i8 %x, C1
%r = xor i8 %a, -1
=>
%n = xor i8 %x, -1
%r = ashr i8 %n, C1
Differential Revision: https://reviews.llvm.org/D86243
As disscussed in post-commit review starting with
https://reviews.llvm.org/D84108#2227365
while this appears to be mostly a win overall, especially code-size-wise,
this appears to shake //certain// code pattens in a way that is extremely
unfavorable for performance (+30% runtime regression)
on certain CPU's (i personally can't reproduce).
So until the behaviour is better understood, and a path forward is mapped,
let's back this out for now.
This reverts commit 1d51dc38d8.
- Adds a command line option to seed only selected functions.
- Makes seed allow listing exclusive to assertions enabled builds.
Reviewed By: sstefan1
Differential Revision: https://reviews.llvm.org/D86129