If a function had at most one return block, the pass would return false
regardless if an unified unreachable block was created.
This patch fixes that by refactoring runOnFunction into two separate
helper functions for handling the unreachable blocks respectively the
return blocks, as suggested by @bjope in a review comment.
This was caught using the check introduced by D80916.
Reviewed By: serge-sans-paille
Differential Revision: https://reviews.llvm.org/D85818
This patch follows D85345 and adds more noundef attributes to return values/arguments of library functions
that are mostly about accessing the file system or processes.
A few functions like `chmod` or `times` use typedef `mode_t` and `clock_t`.
They are neither struct nor union, so they cannot contain undef even if they're lowered to iN in IR. So, it is fine to add noundef to them.
- clock_t's actual type is size_t (C17, 7.27.1.3), so it isn't struct or union.
- For mode_t, either int or long is used in practice because programmers use bit manipulation. So, I think it is okay that it's never aggregate in practice.
After this patch, the remaining library functions are those that eagerly participate in optimizations: they can be removed, reordered, or
introduced by a transformation from primitive IR operations.
For them, a few testings is needed, since it may not be valid to add noundef anymore even if C standard says it's okay.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D85894
The MemorySSAWrapperPass depends on AAResultsWrapperPass and if
MemorySSA is preserved but AAResultsWrapperPass is not, this could lead
to a crash when updating the last user of the MemorySSAWrapperPass.
Alternatively AAResultsWrapperPass could be marked preserved by GVN, but
I am not sure if that would be safe. I am not sure what is required in
order to preserve AAResultsWrapperPass. At the moment, it seems like a
couple of passes that do similar transforms to GVN are preserving it.
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D87137
This commit cleans up the ::initialize method of various AAs in the
following ways:
- If an associated function is required, give up on declarations.
This was discovered as a real problem when lots of llvm.dbg.XXX
call sites were assumed `noreturn` until proven otherwise. That
does not make any sense and caused huge regressions and missed
deductions.
- Require more associated declarations for function interface AAs.
- Use the IRAttribute::initialize to determine if function interface
AAs can be used in IPO, don't replicate the checks (especially
isFunctionIPOAmendable) all over the place. Arguably the function
declaration check should be moved to some central place to.
As we handle callback calls we need to disambiguate the call site
argument number from the callee argument number. While always equal in
non-callback calls, a callback comes with a partial parameter-argument
mapping so there is no implicit correspondence. Here we split
`IRPosition::getArgNo()` into two public functions, `getCallSiteArgNo()`
and `getCalleeArgNo()`. Usages are adjusted to pick the right one for
their purpose. This fixed some problems that would have been exposed as
we more aggressively optimize callbacks.
This was disabled as we were looking for a weird CGSCC problem. I
think/hope we fixed it as there were a lot of updates recently. I could
never reproduce this locally so I'll use the pre-commit phab builds to
confirm this suspicion and if they seem to be happy I'll assume this is
fixed.
Reviewed By: sstefan1
Differential Revision: https://reviews.llvm.org/D87266
In `MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.cpp` we initialized
attributes until stack frame ~35k caused space to run out. The initial
size 1024 is pretty much random.
If we know that the abs operand is known negative, we can replace
it with a neg.
To avoid computing known bits twice, I've removed the fold for the
non-negative case from InstSimplify. Both the non-negative and the
negative case are handled by InstCombine now, with one known bits call.
Differential Revision: https://reviews.llvm.org/D87196
D66230 attempted to fix a problem where when there are allocas used before CoroBegin.
It keeps allocas and their uses stay in put if there are no escapse/changes to the data before CoroBegin.
Unfortunately that's incorrect.
Consider this code:
%var = alloca i32
%1 = getelementptr .. %var; stays put
%f = call i8* @llvm.coro.begin
store ... %1
After this fix, %1 will now stay put, however if a store happens after coro.begin and hence modifies the content, this change will not be reflected in the coroutine frame (and will eventually be DCEed).
To generalize the problem, if any alias ptr is created before coro.begin for an Alloca and that alias ptr is latter written into after coro.begin, it will lead to incorrect behavior.
There are also a few other minor issues, such as incorrect dominate condition check in the ptr visitor, unhandled memory intrinsics and etc.
Ths patch attempts to fix some of these issue, and make it more robust to deal with aliases.
While visiting through the alloca pointer, we also keep track of all aliases created that will be used after CoroBegin. We track the offset of each alias, and then reacreate these aliases after CoroBegin using these offset.
It's worth noting that this is not perfect and there will still be cases we cannot handle. I think it's impractical to handle all cases given the current design.
This patch makes it more robust and should be a pure win.
In the meantime, we need to think about what how to completely elimiante these issues, likely through the route as @rjmccall mentioned in D66230.
Differential Revision: https://reviews.llvm.org/D86859
In GenerateConstantOffsetsImpl, we may generate non canonical Formula
if BaseRegs of that Formula is updated and includes a recurrent expr reg
related with current loop while its ScaledReg is not.
Patched by: mdchen
Reviewed By: qcolombet
Differential Revision: https://reviews.llvm.org/D86939
The CloneFunctionInto has implicit requirements with regards to the
linkage and visibility of the function. We now update these after we did
the CloneFunctionInto on the copy with the same linkage and visibility
as the original.
One run line was different and probably introduced for the manually
added function attribute & name checks. We can do this with the script
and a check prefix used for the other run lines as well.
This was reverted in 503deec218
because it caused gigantic increase (3x) in branch mispredictions
in certain benchmarks on certain CPU's,
see https://reviews.llvm.org/D84108#2227365.
It has since been investigated and here are the results:
https://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20200907/827578.html
> It's an amazingly severe regression, but it's also all due to branch
> mispredicts (about 3x without this). The code layout looks ok so there's
> probably something else to deal with. I'm not sure there's anything we can
> reasonably do so we'll just have to take the hit for now and wait for
> another code reorganization to make the branch predictor a bit more happy :)
>
> Thanks for giving us some time to investigate and feel free to recommit
> whenever you'd like.
>
> -eric
So let's just reland this.
Original commit message:
I've been looking at missed vectorizations in one codebase.
One particular thing that stands out is that some of the loops
reach vectorizer in a rather mangled form, with weird PHI's,
and some of the loops aren't even in a rotated form.
After taking a more detailed look, that happened because
the loop's headers were too big by then. It is evident that
SimplifyCFG's common code hoisting transform is at fault there,
because the pattern it handles is precisely the unrotated
loop basic block structure.
Surprizingly, `SimplifyCFGOpt::HoistThenElseCodeToIf()` is enabled
by default, and is always run, unlike it's friend, common code sinking
transform, `SinkCommonCodeFromPredecessors()`, which is not enabled
by default and is only run once very late in the pipeline.
I'm proposing to harmonize this, and disable common code hoisting
until //late// in pipeline. Definition of //late// may vary,
here currently i've picked the same one as for code sinking,
but i suppose we could enable it as soon as right after
loop rotation happens.
Experimentation shows that this does indeed unsurprizingly help,
more loops got rotated, although other issues remain elsewhere.
Now, this undoubtedly seriously shakes phase ordering.
This will undoubtedly be a mixed bag in terms of both compile- and
run- time performance, codesize. Since we no longer aggressively
hoist+deduplicate common code, we don't pay the price of said hoisting
(which wasn't big). That may allow more loops to be rotated,
so we pay that price. That, in turn, that may enable all the transforms
that require canonical (rotated) loop form, including but not limited to
vectorization, so we pay that too. And in general, no deduplication means
more [duplicate] instructions going through the optimizations. But there's still
late hoisting, some of them will be caught late.
As per benchmarks i've run {F12360204}, this is mostly within the noise,
there are some small improvements, some small regressions.
One big regression i saw i fixed in rG8d487668d09fb0e4e54f36207f07c1480ffabbfd, but i'm sure
this will expose many more pre-existing missed optimizations, as usual :S
llvm-compile-time-tracker.com thoughts on this:
http://llvm-compile-time-tracker.com/compare.php?from=e40315d2b4ed1e38962a8f33ff151693ed4ada63&to=c8289c0ecbf235da9fb0e3bc052e3c0d6bff5cf9&stat=instructions
* this does regress compile-time by +0.5% geomean (unsurprizingly)
* size impact varies; for ThinLTO it's actually an improvement
The largest fallout appears to be in GVN's load partial redundancy
elimination, it spends *much* more time in
`MemoryDependenceResults::getNonLocalPointerDependency()`.
Non-local `MemoryDependenceResults` is widely-known to be, uh, costly.
There does not appear to be a proper solution to this issue,
other than silencing the compile-time performance regression
by tuning cut-off thresholds in `MemoryDependenceResults`,
at the cost of potentially regressing run-time performance.
D84609 attempts to move in that direction, but the path is unclear
and is going to take some time.
If we look at stats before/after diffs, some excerpts:
* RawSpeed (the target) {F12360200}
* -14 (-73.68%) loops not rotated due to the header size (yay)
* -272 (-0.67%) `"Number of live out of a loop variables"` - good for vectorizer
* -3937 (-64.19%) common instructions hoisted
* +561 (+0.06%) x86 asm instructions
* -2 basic blocks
* +2418 (+0.11%) IR instructions
* vanilla test-suite + RawSpeed + darktable {F12360201}
* -36396 (-65.29%) common instructions hoisted
* +1676 (+0.02%) x86 asm instructions
* +662 (+0.06%) basic blocks
* +4395 (+0.04%) IR instructions
It is likely to be sub-optimal for when optimizing for code size,
so one might want to change tune pipeline by enabling sinking/hoisting
when optimizing for size.
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D84108
This reverts commit 503deec218.
For intrinsics supported by ConstantRange, compute the result range
based on the argument ranges. We do this independently of whether
some or all of the input ranges are full, as we can often still
constrain the result in some way.
Differential Revision: https://reviews.llvm.org/D87183
This was supposed to be an NFC cleanup, but there's
a real logic difference (did not drop 'nsw') visible
in some tests in addition to an efficiency improvement.
This is because in the case where we have 2 GEPs,
the code was *always* swapping the operands and
negating the result. But if we have 2 GEPs, we
should *never* need swapping/negation AFAICT.
This is part of improving flags propagation noticed
with PR47430.
Normal dead code elimination ignores assume intrinsics, so we fail to
delete assumes that are not meaningful (and potentially worse if they
cause conflicts with other assumptions).
The motivating example in https://llvm.org/PR47416 suggests that we
might have problems upstream from here (difference between C and C++),
but this should be a cheap way to make sure we remove more dead code.
Differential Revision: https://reviews.llvm.org/D87149
Modify FoldBranchToCommonDest to consider the cost of inserting
instructions when attempting to combine predicates to fold blocks.
The threshold can be controlled via a new option:
-simplifycfg-branch-fold-threshold which defaults to '2' to allow
the insertion of a not and another logical operator.
Differential Revision: https://reviews.llvm.org/D86526
This addresses the remaining issue from D87188. Due to a series of
folds, we may end up with abs-of-abs represented as
x == 0 ? -abs(x) : abs(x). Rather than recognizing this as a special
abs pattern and doing an abs-of-abs fold on it afterwards,
I'm directly folding this to one of the select operands in InstSimplify.
The general pattern falls into the "select with operand replaced"
category, but that fold is not powerful enough to recognize that
both hands of the select are the same for value zero.
Differential Revision: https://reviews.llvm.org/D87197
This adjusts the description of `llvm.memcpy` to also allow operands
to be equal. This is in line with what Clang currently expects.
This change is intended to be temporary and followed by re-introduce
a variant with the non-overlapping guarantee for cases where we can
actually ensure that property in the front-end.
See the links below for more details:
http://lists.llvm.org/pipermail/cfe-dev/2020-August/066614.html
and PR11763.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D86815
Similar to D87168, but for abs. If we have a dominating x >= 0
condition, then we know that abs(x) is x. This fold is in
InstCombine, because we need to create a sub instruction for
the x < 0 case.
Differential Revision: https://reviews.llvm.org/D87184
If we have a dominating condition that x >= y, then umax(x, y) is x,
etc. I'm doing this in InstSimplify as the corresponding transform
for the select form is also done there.
Differential Revision: https://reviews.llvm.org/D87168
When a switch case is folded into default's case, that's an IR change that
should be reported, update ConstantFoldTerminator accordingly.
Differential Revision: https://reviews.llvm.org/D87142
Some tests from multibuild-malloc-free.ll do not actually use malloc or
free and where split out to multiblock-throwing.ll, but not removed from
the original file. This patch cleans that up. It also moves @test22 to
simple.ll, because it does not involve multiple blocks.
When checking call sites, give special handling to indirect call, as the
callee may be unknown and can lead to nullptr dereference later. Assume
conservatively that the ICV always changes in such case.
Reviewed By: sstefan1
Differential Revision: https://reviews.llvm.org/D87104
As discussed in D86843, -earlycse-debug-hash should be used in more regression
tests to catch inconsistency between the hashing and the equivalence check.
Differential Revision: https://reviews.llvm.org/D86863
This patch moves the tests for the old MemDepAnalysis based DSE
implementation to the MemDepAnalysis subdirectory and updates them to
pass -enable-dse-memoryssa=false.
This is in preparation for the switch to MemorySSA-backed DSE.
This patch updates MemCpyOpt to preserve MemorySSA. It uses the
MemoryDef at the insertion point of the builder and inserts the new def
after that def.
In some cases, we just modify a memory instruction. In that case, get
the defining access, then remove the memory access and add a new one.
If the defining access is in a different block, insert a new def at the
beginning of the current block, otherwise after the defining access.
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D86651
Although LLVM supports vectorization of loops containing log2, it did not support using SVML implementation of it. Added support so that when clang is invoked with -fveclib=SVML now an appropriate SVML library log2 implementation will be invoked.
Follow up on: https://reviews.llvm.org/D77114
Tests:
Added unit tests to svml-calls.ll, svml-calls-finite.ll. Can be run with llvm-lint.
Created a simple c++ file that tests log2, and used clang+ to build it, and output final assembly.
Reviewed By: wenlei, craig.topper
Differential Revision: https://reviews.llvm.org/D86730
Preserve MemorySSA if it is available before running GVN.
DSE with MemorySSA will run closely after GVN. If GVN and 2 other
passes preserve MemorySSA, DSE can re-use MemorySSA used by LICM
when doing LTO.
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D86534
Currently IPSCCP (and others like CVP/GVN) blindly propagate pointer
equalities. In certain cases, that leads to dereferenceable pointers
being replaced, as in the example test case.
I think this is not allowed, as it introduces an access of an
un-dereferenceable pointer. Note that the pointer is inbounds, but one
past the last element, so it is valid, but not dereferenceable.
This patch is mostly to highlight the issue and start a discussion.
Currently it only checks for specifically looking
one-past-the-last-element pointers with array typed bases.
This causes the mis-compile outlined in
https://stackoverflow.com/questions/55754313/is-this-gcc-clang-past-one-pointer-comparison-behavior-conforming-or-non-standar
In the test case, if we replace %p with the GEP for the store, we
subsequently determine that the store and the load cannot alias, because
they are to different underlying objects.
Note that Alive2 seems to think that the replacement is valid:
https://alive2.llvm.org/ce/z/2rorhk
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D85332
This helps SelectionDAGBuilder recognize the splat can be used as a uniform base.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D86371
First, shuffle cost for scalable type is not known for scalable type;
Second, we cannot reason if the narrowed shuffle mask for scalable type
is a splat or not.
E.g., Bitcast splat vector from type <vscale x 4 x i32> to <vscale x 8 x i16>
will involve narrowing shuffle mask <vscale x 4 x i32> zeroinitializer to
<vscale x 8 x i32> with element sequence of <0, 1, 0, 1, ...>, which cannot be
reasoned if it's a valid splat or not.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D86995
In IPSCCP when a function is optimized to return undef, it should clear the returned attribute for all its input arguments
and its corresponding call sites.
The bug is exposed when the value of an input argument of the function is assigned to a physical register and
because of the argument having a returned attribute, the value of this physical register will continue to be used
as the function return value right after the call instruction returns, even if the value that this register holds may
be clobbered during the function call. This potentially results in incorrect values being used afterwards.
Reviewed By: jdoerfert, fhahn
Differential Revision: https://reviews.llvm.org/D84220
When marking a global variable constant, and simplifying users using
CleanupConstantGlobalUsers(), the pass could incorrectly return false if
there were still some uses left, and no further optimizations was done.
This was caught using the check introduced by D80916.
This fixes PR46749.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D85837
These transforms will now be performed irrespective of the number of uses for the expression "1.0/sqrt(X)":
1.0/sqrt(X) * X => X/sqrt(X)
X * 1.0/sqrt(X) => X/sqrt(X)
We already handle more general cases, and we are intentionally not creating extra (and likely expensive)
fdiv ops in IR. This pattern is the exception to the rule because we always expect the Backend to reduce
X/sqrt(X) to sqrt(X), if it has the necessary (reassoc) fast-math-flags.
Ref: DagCombiner optimizes the X/sqrt(X) to sqrt(X).
Differential Revision: https://reviews.llvm.org/D86726
This is an enhancement to D81766 to allow loading the minimum target
vector type into an IR vector with a different number of elements.
In one of the motivating tests from PR16739, SLP creates <2 x float>
load ops mixed with <4 x float> insert ops, so we want to handle that
pattern in addition to potential oversized vectors created by the
vectorizers.
For now, we are assuming the insert/extract subvector with undef is
free because there is no exact corresponding TTI modeling for that.
Differential Revision: https://reviews.llvm.org/D86160
This patch makes it possible for AAUB to use information from AANoUndef.
This is the next patch of D86983
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D86984
Interleave for small loops that have reductions inside,
which breaks dependencies and expose.
This gives very significant performance improvements for some benchmarks.
Because small loops could be in very hot functions in real applications.
Differential Revision: https://reviews.llvm.org/D81416
Loop Idiom Recognize Pass (LIRP) attempts to transform loops with subscripted arrays
into memcpy/memset function calls. In some particular situation, this transformation
introduces negative impacts. For example: https://bugs.llvm.org/show_bug.cgi?id=47300
This patch will enable users to disable a particular part of the transformation, while
he/she can still enjoy the benefit brought about by the rest of LIRP. The default
behavior stays unchanged: no part of LIRP is disabled by default.
Reviewed By: etiotto (Ettore Tiotto)
Differential Revision: https://reviews.llvm.org/D86262
getValuesInOffloadArrays goes through the offload arrays in __tgt_target_data_begin_mapper getting the values stored in them before the call is issued.
call void @__tgt_target_data_begin_mapper(arg0, arg1,
i8** %offload_baseptrs, i8** %offload_ptrs, i64* %offload_sizes,
...)
Diferential Revision: https://reviews.llvm.org/D86300
The 1st try was reverted because I missed an assert that
needed softening.
As discussed in D86798 / rG09652721 , we were potentially
returning a different result for whether an Instruction
is commutable depending on if we call the base class or
derived class method.
This requires relaxing asserts in GVN, but that pass
seems to be working otherwise.
NewGVN requires more work because it uses different
code paths for numbering binops and calls.
As discussed in D86798 / rG09652721 , we were potentially
returning a different result for whether an Instruction
is commutable depending on if we call the base class or
derived class method.
This requires relaxing an assert in GVN, but that pass
seems to be working otherwise.
NewGVN requires more work because it uses different
code paths for numbering binops and calls.
addRuntimeChecks uses SCEVExpander, which relies on the DT/LoopInfo to
be up-to-date. Changing the CFG afterwards may invalidate some inserted
instructions, especially LCSSA phis.
Reorder the code to first update the CFG and then create the runtime
checks. This should not have any impact on the generated code, as we
adjust the CFG and generate runtime checks together.
Fixes PR47343.
The problem with module slice has been addressed in D86319
Introduce two new AAs. AAICVTrackerFunctionReturned which checks if a
function can have a unique ICV value after it is finished, and
AAICVCallSiteReturned which checks AAICVTrackerFunctionReturned for a
call site. This enables us to check the value of a call and if it
changes the ICV. This also changes the approach in
`getReplacementValues()` to a worklist-based approach so we can explore
all relevant BBs.
Differential Revision: https://reviews.llvm.org/D85544
Summary:
The module slice describes which functions we can analyze and transform
while working on an SCC as part of the Attributor-CGSCC pass. So far we
simply restricted it to the SCC.
Reviewers: jdoerfert
Differential Revision: https://reviews.llvm.org/D86319
This is the next patch of D86842
When we check `noundef` attribute violation at callsites, we do not have to require `nonnull` in the following two cases.
1. An argument is known to be simplified to undef
2. An argument is known to be dead
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D86845
Even though `noundef` IR attribute might be attached to non-void type values, AANoUndef is mistakenly identified for pointer type values only.
This patch fixes that.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D86737
Replace the check for poison-producing instructions in
SimplifyWithOpReplaced() with the generic helper canCreatePoison()
that properly handles poisonous shifts and thus avoids the problem
from PR47322.
This additionally fixes a bug in IIQ.UseInstrInfo=false mode, which
previously could have caused this code to ignore poison flags.
Setting UseInstrInfo=false should reduce the possible optimizations,
not increase them.
This is not a full solution to the problem, as poison could be
introduced more indirectly. This is just a minimal, easy to backport
fix.
Differential Revision: https://reviews.llvm.org/D86834
This got reverted because a dependency was reverted. It has since
been reapplied, so reapply this as well.
-----
Related to D69686. As noted there, LVI currently behaves differently
for integer and pointer values: For integers, the block value is always
valid inside the basic block, while for pointers it is only valid at
the end of the basic block. I believe the integer behavior is the
correct one, and CVP relies on it via its getConstantRange() uses.
The reason for the special pointer behavior is that LVI checks whether
a pointer is dereferenced in a given basic block and marks it as
non-null in that case. Of course, this information is valid only after
the dereferencing instruction, or in conservative approximation,
at the end of the block.
This patch changes the treatment of dereferencability: Instead of
including it inside the block value, we instead treat it as something
similar to an assume (it essentially is a non-nullness assume) and
incorporate this information in intersectAssumeOrGuardBlockValueConstantRange()
if the context instruction is the terminator of the basic block.
This happens either when determining an edge-value internally in LVI,
or when a terminator was explicitly passed to getValueAt(). The latter
case makes this more powerful than the previous implementation as
a side-effect, and this does actually seem benefitial in practice.
Of course, we do not want to recompute dereferencability on each
intersectAssume call, so we need a new cache for this. The
dereferencability analysis requires walking the entire basic block
and computing underlying objects of all memory operands. This was
previously done separately for each queried pointer value. In the
new implementation (both because this makes the caching simpler,
and because it is faster), I instead only walk the full BB once and
cache all the dereferenced pointers. So the traversal is now performed
only once per BB, instead of once per queried pointer value.
I think the overall model now makes more sense than before, and there
will be no more pitfalls due to differing integer/pointer behavior.
Differential Revision: https://reviews.llvm.org/D69914
Handling the new min/max intrinsics is the motivation, but it
turns out that we have a bunch of other intrinsics with this
missing bit of analysis too.
The FP min/max tests show that we are intersecting FMF,
so that part should be safe too.
As noted in https://llvm.org/PR46897 , there is a commutative
property specifier for intrinsics, but no corresponding function
attribute, and so apparently no uses of that bit. We may want to
remove that next.
Follow-up patches should wire up the Instruction::isCommutative()
to this IntrinsicInst specialization. That requires updating
callers to be aware of the more general commutative property
(not just binops).
Differential Revision: https://reviews.llvm.org/D86798
The original take 1 was 6102310d81,
which taught InstSimplify to do that, which seemed better at time,
since we got EarlyCSE support for free.
However, it was proven that we can not do that there,
the simplified-to PHI would not be reachable from the original PHI,
and that is not something InstSimplify is allowed to do,
as noted in the commit ed90f15efb
that reverted it:
> It appears to cause compilation non-determinism and caused stage3 mismatches.
Then there was take 2 3e69871ab5,
which was InstCombine-specific, but it again showed stage2-stage3 differences,
and reverted in bdaa3f86a0.
This is quite alarming.
Here, let's try to change how we find existing PHI candidate:
due to the worklist order, and the way PHI nodes are inserted
(it may be inserted as the first one, or maybe not), let's look at *all*
PHI nodes in the block.
Effects on vanilla llvm test-suite + RawSpeed:
```
| statistic name | baseline | proposed | Δ | % | \|%\| |
|----------------------------------------------------|-----------|-----------|-------:|---------:|---------:|
| asm-printer.EmittedInsts | 7942329 | 7942457 | 128 | 0.00% | 0.00% |
| assembler.ObjectBytes | 254295632 | 254312480 | 16848 | 0.01% | 0.01% |
| correlated-value-propagation.NumPhis | 18412 | 18347 | -65 | -0.35% | 0.35% |
| early-cse.NumCSE | 2183283 | 2183267 | -16 | 0.00% | 0.00% |
| early-cse.NumSimplify | 550105 | 541842 | -8263 | -1.50% | 1.50% |
| instcombine.NumAggregateReconstructionsSimplified | 73 | 4506 | 4433 | 6072.60% | 6072.60% |
| instcombine.NumCombined | 3640311 | 3644419 | 4108 | 0.11% | 0.11% |
| instcombine.NumDeadInst | 1778204 | 1783205 | 5001 | 0.28% | 0.28% |
| instcombine.NumPHICSEs | 0 | 22490 | 22490 | 0.00% | 0.00% |
| instcombine.NumWorklistIterations | 2023272 | 2024400 | 1128 | 0.06% | 0.06% |
| instcount.NumCallInst | 1758395 | 1758802 | 407 | 0.02% | 0.02% |
| instcount.NumInvokeInst | 59478 | 59502 | 24 | 0.04% | 0.04% |
| instcount.NumPHIInst | 330557 | 330545 | -12 | 0.00% | 0.00% |
| instcount.TotalBlocks | 1077138 | 1077220 | 82 | 0.01% | 0.01% |
| instcount.TotalFuncs | 101442 | 101441 | -1 | 0.00% | 0.00% |
| instcount.TotalInsts | 8831946 | 8832606 | 660 | 0.01% | 0.01% |
| simplifycfg.NumHoistCommonCode | 24186 | 24187 | 1 | 0.00% | 0.00% |
| simplifycfg.NumInvokes | 4300 | 4410 | 110 | 2.56% | 2.56% |
| simplifycfg.NumSimpl | 1019813 | 999767 | -20046 | -1.97% | 1.97% |
```
So it fires 22490 times, which is less than ~24k the take 1 did,
but more than what take 2 did (22228 times)
.
It allows foldAggregateConstructionIntoAggregateReuse() to actually work
after PHI-of-extractvalue folds did their thing. Previously SimplifyCFG
would have done this PHI CSE, of all places. Additionally, allows some
more `invoke`->`call` folds to happen (+110, +2.56%).
All in all, expectedly, this catches less things overall,
but all the motivational cases are still caught, so all good.
While the original variant with doing this in InstSimplify (rightfully)
caused questions and ultimately was detected to be a culprit
of stage2-stage3 mismatch, it was expected that
InstCombine-based implementation would be fine.
But apparently it's not, as
http://lab.llvm.org:8011/builders/clang-with-thin-lto-ubuntu/builds/24095/steps/compare-compilers/logs/stdio
suggests.
Which suggests that somewhere in InstCombine there is a loop
over nondeterministically sorted container, which causes
different worklist ordering.
This reverts commit 3e69871ab5.
The original take was 6102310d81,
which taught InstSimplify to do that, which seemed better at time,
since we got EarlyCSE support for free.
However, it was proven that we can not do that there,
the simplified-to PHI would not be reachable from the original PHI,
and that is not something InstSimplify is allowed to do,
as noted in the commit ed90f15efb
that reverted it :
> It appears to cause compilation non-determinism and caused stage3 mismatches.
However InstCombine already does many different optimizations,
so it should be a safe place to do it here.
Note that we still can't just compare incoming values ranges,
because there is no guarantee that these PHI's we'd simplify to
were already re-visited and sorted.
However coming up with a test is problematic.
Effects on vanilla llvm test-suite + RawSpeed:
```
| statistic name | baseline | proposed | Δ | % | |%| |
|----------------------------------------------------|-----------|-----------|-------:|---------:|---------:|
| instcombine.NumPHICSEs | 0 | 22228 | 22228 | 0.00% | 0.00% |
| asm-printer.EmittedInsts | 7942329 | 7942456 | 127 | 0.00% | 0.00% |
| assembler.ObjectBytes | 254295632 | 254313792 | 18160 | 0.01% | 0.01% |
| early-cse.NumCSE | 2183283 | 2183272 | -11 | 0.00% | 0.00% |
| early-cse.NumSimplify | 550105 | 541842 | -8263 | -1.50% | 1.50% |
| instcombine.NumAggregateReconstructionsSimplified | 73 | 4506 | 4433 | 6072.60% | 6072.60% |
| instcombine.NumCombined | 3640311 | 3666911 | 26600 | 0.73% | 0.73% |
| instcombine.NumDeadInst | 1778204 | 1783318 | 5114 | 0.29% | 0.29% |
| instcount.NumCallInst | 1758395 | 1758804 | 409 | 0.02% | 0.02% |
| instcount.NumInvokeInst | 59478 | 59502 | 24 | 0.04% | 0.04% |
| instcount.NumPHIInst | 330557 | 330549 | -8 | 0.00% | 0.00% |
| instcount.TotalBlocks | 1077138 | 1077221 | 83 | 0.01% | 0.01% |
| instcount.TotalFuncs | 101442 | 101441 | -1 | 0.00% | 0.00% |
| instcount.TotalInsts | 8831946 | 8832611 | 665 | 0.01% | 0.01% |
| simplifycfg.NumInvokes | 4300 | 4410 | 110 | 2.56% | 2.56% |
| simplifycfg.NumSimpl | 1019813 | 999740 | -20073 | -1.97% | 1.97% |
```
So it fires ~22k times, which is less than ~24k the take 1 did.
It allows foldAggregateConstructionIntoAggregateReuse() to actually work
after PHI-of-extractvalue folds did their thing. Previously SimplifyCFG
would have done this PHI CSE, of all places. Additionally, allows some
more `invoke`->`call` folds to happen (+110, +2.56%).
All in all, expectedly, this catches less things overall,
but all the motivational cases are still caught, so all good.
As discussed in https://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20200824/824235.html
even though it seems worthwhile doing so in InstSimplify,
we really can't do that there, because the other PHI wouldn't be
def-reachable from the original PHI.
So ignoring whether or not EarlyCSE should also know to do this,
InstCombine is the place.
instruction can decrement the reference count, not whether it can alter
it
This prevents the state transition from S_Use to S_CanRelease when doing
a bottom-up traversal and the transition from S_Retain to S_CanRelease
when doing a top-down traversal when the visited instruction can
increment the ref count but cannot decrement it. This allows the ARC
optimizer to remove retain/release pairs which were previously not
removed.
rdar://problem/21793154
Since doInitialization() in the legacy pass modifies the module, the NPM
pass is a Module pass.
Reviewed By: ahatanak, ychen
Differential Revision: https://reviews.llvm.org/D86178
This patch fixes this crash https://gcc.godbolt.org/z/Ps8d1e
And gives SROA the ability to remove assumes if it allows promoting an alloca to register
Without removing assumes when it can't promote to register.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D86570
This changes getDomMemoryDef to check if a Current is a valid
candidate for elimination before checking for reads. Before the change,
we were spending a lot of compile-time in checking for read accesses for
Current that might not even be removable.
This patch flips the logic, so we skip Current if they cannot be
removed before checking all their uses. This is much more efficient in
practice.
It also adds a more aggressive limit for checking partially overlapping
stores. The main problem with overlapping stores is that we do not know
if they will lead to elimination until seeing all of them. This patch
limits adds a new limit for overlapping store candidates, which keeps
the number of modified overlapping stores roughly the same.
This is another substantial compile-time improvement (while also
increasing the number of stores eliminated). Geomean -O3 -0.67%,
ReleaseThinLTO -0.97%.
http://llvm-compile-time-tracker.com/compare.php?from=0a929b6978a068af8ddb02d0d4714a2843dd8ba9&to=2e630629b43f64b60b282e90f0d96082fde2dacc&stat=instructions
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D86487
This patch adds support for memcmp in MemoryLocation::getForArgument.
memcmp reads from the first 2 arguments up to the number of bytes of the
third argument.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D86725
strspn, strncmp, strcspn, strcasecmp, strncasecmp, memcmp, memchr,
memrchr, memcpy, memmove, memcpy, mempcpy, strchr, strrchr, bcmp
should all only access memory through their arguments.
I broke out strcoll, strcasecmp, strncasecmp because the result
depends on the locale, which might get accessed through memory.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D86724
Even if noundef is deduced for a position, we should not manifest it when the position is dead.
This is because the associated values with dead positions are replaced with undef values by AAIsDead.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D86565
The NPM doesn't support call-site alwaysinline as described in the comments.
Also make NPM runs more similar to legacy PM runs.
Reviewed By: ychen, asbirlea
Differential Revision: https://reviews.llvm.org/D86663
Apparently, we don't do this, neither in EarlyCSE, nor in InstSimplify,
nor in (old) GVN, but do in NewGVN and SimplifyCFG of all places..
While i could teach EarlyCSE how to hash PHI nodes,
we can't really do much (anything?) even if we find two identical
PHI nodes in different basic blocks, same-BB case is the interesting one,
and if we teach InstSimplify about it (which is what i wanted originally,
https://reviews.llvm.org/D86530), we get EarlyCSE support for free.
So i would think this is pretty uncontroversial.
On vanilla llvm test-suite + RawSpeed, this has the following effects:
```
| statistic name | baseline | proposed | Δ | % | \|%\| |
|----------------------------------------------------|-----------|-----------|-------:|---------:|---------:|
| instsimplify.NumPHICSE | 0 | 23779 | 23779 | 0.00% | 0.00% |
| asm-printer.EmittedInsts | 7942328 | 7942392 | 64 | 0.00% | 0.00% |
| assembler.ObjectBytes | 273069192 | 273084704 | 15512 | 0.01% | 0.01% |
| correlated-value-propagation.NumPhis | 18412 | 18539 | 127 | 0.69% | 0.69% |
| early-cse.NumCSE | 2183283 | 2183227 | -56 | 0.00% | 0.00% |
| early-cse.NumSimplify | 550105 | 542090 | -8015 | -1.46% | 1.46% |
| instcombine.NumAggregateReconstructionsSimplified | 73 | 4506 | 4433 | 6072.60% | 6072.60% |
| instcombine.NumCombined | 3640264 | 3664769 | 24505 | 0.67% | 0.67% |
| instcombine.NumDeadInst | 1778193 | 1783183 | 4990 | 0.28% | 0.28% |
| instcount.NumCallInst | 1758401 | 1758799 | 398 | 0.02% | 0.02% |
| instcount.NumInvokeInst | 59478 | 59502 | 24 | 0.04% | 0.04% |
| instcount.NumPHIInst | 330557 | 330533 | -24 | -0.01% | 0.01% |
| instcount.TotalInsts | 8831952 | 8832286 | 334 | 0.00% | 0.00% |
| simplifycfg.NumInvokes | 4300 | 4410 | 110 | 2.56% | 2.56% |
| simplifycfg.NumSimpl | 1019808 | 999607 | -20201 | -1.98% | 1.98% |
```
I.e. it fires ~24k times, causes +110 (+2.56%) more `invoke` -> `call`
transforms, and counter-intuitively results in *more* instructions total.
That being said, the PHI count doesn't decrease that much,
and looking at some examples, it seems at least some of them
were previously getting PHI CSE'd in SimplifyCFG of all places..
I'm adjusting `Instruction::isIdenticalToWhenDefined()` at the same time.
As a comment in `InstCombinerImpl::visitPHINode()` already stated,
there are no guarantees on the ordering of the operands of a PHI node,
so if we just naively compare them, we may false-negatively say that
the nodes are not equal when the only difference is operand order,
which is especially important since the fold is in InstSimplify,
so we can't rely on InstCombine sorting them beforehand.
Fixing this for the general case is costly (geomean +0.02%),
and does not appear to catch anything in test-suite, but for
the same-BB case, it's trivial, so let's fix at least that.
As per http://llvm-compile-time-tracker.com/compare.php?from=04879086b44348cad600a0a1ccbe1f7776cc3cf9&to=82bdedb888b945df1e9f130dd3ac4dd3c96e2925&stat=instructions
this appears to cause geomean +0.03% compile time increase (regression),
but geomean -0.01%..-0.04% code size decrease (improvement).
Currently we bail out early for strlen calls with a GEP operand, if none
of the GEP specific optimizations fire. But there could be later
optimizations that still apply, which we currently miss out on.
An example is that we do not apply the following optimization
strlen(x) == 0 --> *x == 0
Unless I am missing something, there seems to be no reason for bailing
out early there.
Fixes PR47149.
Reviewed By: lebedev.ri, xbolva00
Differential Revision: https://reviews.llvm.org/D85886
For DSE with MemorySSA it is beneficial to manually traverse the
defining access, instead of using a MemorySSA walker, so we can
better control the number of steps together with other limits and
also weed out invalid/unprofitable paths early on.
This patch requires a follow-up patch to be most effective, which I will
share soon after putting this patch up.
This temporarily XFAIL's the limit tests, because we now explore more
MemoryDefs that may not alias/clobber the killing def. This will be
improved/fixed by the follow-up patch.
This patch also renames some `Dom*` variables to `Earlier*`, because the
dominance relation is not really used/important here and potentially
confusing.
This patch allows us to aggressively cut down compile time, geomean
-O3 -0.64%, ReleaseThinLTO -1.65%, at the expense of fewer stores
removed. Subsequent patches will increase the number of removed stores
again, while keeping compile-time in check.
http://llvm-compile-time-tracker.com/compare.php?from=d8e3294118a8c5f3f97688a704d5a05b67646012&to=0a929b6978a068af8ddb02d0d4714a2843dd8ba9&stat=instructions
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D86486
Currently, an undef value is reduced to 0 when it is added to a set of potential values.
This patch introduces a flag for under values. By this, for example, we can merge two states `{undef}`, `{1}` to `{1}` (because we can reduce the undef to 1).
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D85592
As discussed in
http://lists.llvm.org/pipermail/llvm-dev/2020-July/143801.html.
Currently no users outside of unit tests.
Replace all instances in tests of -constprop with -instsimplify.
Notable changes in tests:
* vscale.ll - @llvm.sadd.sat.nxv16i8 is evaluated by instsimplify, use a fake intrinsic instead
* InsertElement.ll - insertelement undef is removed by instsimplify in @insertelement_undef
llvm/test/Transforms/ConstProp moved to llvm/test/Transforms/InstSimplify/ConstProp
Reviewed By: lattner, nikic
Differential Revision: https://reviews.llvm.org/D85159
This is a reboot of D84655, now performing the inner icmp
simplification query without undef folds.
It should be possible to handle the current foldMinMaxSharedOp()
fold based on this, by moving the logic into icmp of min/max instead,
making it more general. We can't drop the folds for constant operands,
because those also allow undef, which we exclude here.
The tests use assumes for exhaustive coverage, and have a few
more examples of misc folds we get based on icmp simplification.
Differential Revision: https://reviews.llvm.org/D85929
InstSimplify should do all transformations that ConstProp does, but
one thing that ConstProp does that InstSimplify wouldn't is inline
vector instructions that are constants, e.g. into a ret.
Previously vector instructions wouldn't be inlined in InstSimplify
because llvm::Simplify*Instruction() would return nullptr for specific
instructions, such as vector instructions that were actually constants,
if it couldn't simplify them.
This changes SimplifyInsertElementInst, SimplifyExtractElementInst, and
SimplifyShuffleVectorInst to return a vector constant when possible.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D85946
and indirect call promotion candidate.
Profile remapping is a feature to match a function in the module with its
profile in sample profile if the function name and the name in profile look
different but are equivalent using given remapping rules. This is a useful
feature to keep the performance stable by specifying some remapping rules
when sampleFDO targets are going through some large scale function signature
change.
However, currently profile remapping support is only valid for outline
function profile in SampleFDO. It cannot match a callee with an inline
instance profile if they have different but equivalent names. We found
that without the support for inline instance profile, remapping is less
effective for some large scale change.
To add that support, before any remapping lookup happens, all the names
in the profile will be inserted into remapper and the Key to the name
mapping will be recorded in a map called NameMap in the remapper. During
name lookup, a Key will be returned for the given name and it will be used
to extract an equivalent name in the profile from NameMap. So with the help
of the NameMap, we can translate any given name to an equivalent name in
the profile if it exists. Whenever we try to match a name in the module to
a name in the profile, we will try the match with the original name first,
and if it doesn't match, we will use the equivalent name got from remapper
to try the match for another time. In this way, the patch can enhance the
profile remapping support for searching inline instance and searching
indirect call promotion candidate.
In a planned large scale change of int64 type (long long) to int64_t (long),
we found the performance of a google internal benchmark degraded by 2% if
nothing was done. If existing profile remapping was enabled, the performance
degradation dropped to 1.2%. If the profile remapping with the current patch
was enabled, the performance degradation further dropped to 0.14% (Note the
experiment was done before searching indirect call promotion candidate was
added. We hope with the remapping support of searching indirect call promotion
candidate, the degradation can drop to 0% in the end. It will be evaluated
post commit).
Differential Revision: https://reviews.llvm.org/D86332
This patch adds NoUndef to Intrinsics.td.
The attribute is attached to llvm.assume's operand, because llvm.assume(undef)
is UB.
It is attached to pointer operands of several memory accessing intrinsics
as well.
This change makes ValueTracking::getGuaranteedNonPoisonOps' intrinsic check
unnecessary, so it is removed.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D86576
As FIXME said, they really should be checking for a single user,
not use, so let's do that. It is not *that* unusual to have
the same value as incoming value in a PHI node, not unlike
how a PHI may have the same incoming basic block more than once.
There isn't a nice way to do that, Value::users() isn't uniqified,
and Value only tracks it's uses, not Users, so the check is
potentially costly since it does indeed potentially involes
traversing the entire use list of a value.
This implements 2 different vectorisation fallback strategies if tail-folding
fails: 1) don't vectorise at all, or 2) vectorise using a scalar epilogue. This
can be controlled with option -prefer-predicate-over-epilogue, that has been
changed to take a numeric value corresponding to the tail-folding preference
and preferred fallback.
Patch by: Pierre van Houtryve, Sjoerd Meijer.
Differential Revision: https://reviews.llvm.org/D79783
MVE Gather scatter codegeneration is looking a lot better than it used
to, but still has some issues. The instructions we currently model as 1
cycle per element, which is a bit low for some cases. Increasing the
cost by the MVECostFactor brings them in-line with our other instruction
costs. This will have the effect of only generating then when the extra
benefit is more likely to overcome some of the issues. Notably in
running out of registers and vectorizing loops that could otherwise be
SLP vectorized.
In the short-term whilst we look at other ways of dealing with those
more directly, we can increase the costs of gathers to make them more
likely to be beneficial when created.
Differential Revision: https://reviews.llvm.org/D86444
While since D86306 we do it's sibling fold for `insertvalue`,
we should also do this for `extractvalue`'s.
And unlike that one, the results here are, quite honestly, shocking,
as it can be observed here on vanilla llvm test-suite + RawSpeed results:
```
| statistic name | baseline | proposed | Δ | % | |%| |
|----------------------------------------------------|-----------|-----------|--------:|--------:|-------:|
| asm-printer.EmittedInsts | 7945095 | 7942507 | -2588 | -0.03% | 0.03% |
| assembler.ObjectBytes | 273209920 | 273069800 | -140120 | -0.05% | 0.05% |
| early-cse.NumCSE | 2183363 | 2183398 | 35 | 0.00% | 0.00% |
| early-cse.NumSimplify | 541847 | 550017 | 8170 | 1.51% | 1.51% |
| instcombine.NumAggregateReconstructionsSimplified | 2139 | 108 | -2031 | -94.95% | 94.95% |
| instcombine.NumCombined | 3601364 | 3635448 | 34084 | 0.95% | 0.95% |
| instcombine.NumConstProp | 27153 | 27157 | 4 | 0.01% | 0.01% |
| instcombine.NumDeadInst | 1694521 | 1765022 | 70501 | 4.16% | 4.16% |
| instcombine.NumPHIsOfExtractValues | 0 | 37546 | 37546 | 0.00% | 0.00% |
| instcombine.NumSunkInst | 63158 | 63686 | 528 | 0.84% | 0.84% |
| instcount.NumBrInst | 874304 | 871857 | -2447 | -0.28% | 0.28% |
| instcount.NumCallInst | 1757657 | 1758402 | 745 | 0.04% | 0.04% |
| instcount.NumExtractValueInst | 45623 | 11483 | -34140 | -74.83% | 74.83% |
| instcount.NumInsertValueInst | 4983 | 580 | -4403 | -88.36% | 88.36% |
| instcount.NumInvokeInst | 61018 | 59478 | -1540 | -2.52% | 2.52% |
| instcount.NumLandingPadInst | 35334 | 34215 | -1119 | -3.17% | 3.17% |
| instcount.NumPHIInst | 344428 | 331116 | -13312 | -3.86% | 3.86% |
| instcount.NumRetInst | 100773 | 100772 | -1 | 0.00% | 0.00% |
| instcount.TotalBlocks | 1081154 | 1077166 | -3988 | -0.37% | 0.37% |
| instcount.TotalFuncs | 101443 | 101442 | -1 | 0.00% | 0.00% |
| instcount.TotalInsts | 8890201 | 8833747 | -56454 | -0.64% | 0.64% |
| instsimplify.NumSimplified | 75822 | 75707 | -115 | -0.15% | 0.15% |
| simplifycfg.NumHoistCommonCode | 24203 | 24197 | -6 | -0.02% | 0.02% |
| simplifycfg.NumHoistCommonInstrs | 48201 | 48195 | -6 | -0.01% | 0.01% |
| simplifycfg.NumInvokes | 2785 | 4298 | 1513 | 54.33% | 54.33% |
| simplifycfg.NumSimpl | 997332 | 1018189 | 20857 | 2.09% | 2.09% |
| simplifycfg.NumSinkCommonCode | 7088 | 6464 | -624 | -8.80% | 8.80% |
| simplifycfg.NumSinkCommonInstrs | 15117 | 14021 | -1096 | -7.25% | 7.25% |
```
... which tells us that this new fold fires whopping 38k times,
increasing the amount of SimplifyCFG's `invoke`->`call` transforms by +54% (+1513) (again, D85787 did that last time),
decreasing total instruction count by -0.64% (-56454),
and sharply decreasing count of `insertvalue`'s (-88.36%, i.e. 9 times less)
and `extractvalue`'s (-74.83%, i.e. four times less).
This causes geomean -0.01% binary size decrease
http://llvm-compile-time-tracker.com/compare.php?from=4d5ca22b8adfb6643466e4e9f48ba14bb48938bc&to=97dacca0111cb2ae678204e52a3cee00e3a69208&stat=size-text
and, ignoring `O0-g`, is a geomean -0.01%..-0.05% compile-time improvement
http://llvm-compile-time-tracker.com/compare.php?from=4d5ca22b8adfb6643466e4e9f48ba14bb48938bc&to=97dacca0111cb2ae678204e52a3cee00e3a69208&stat=instructions
The other thing that tells is, is that while this is a massive win for `invoke`->`call` transform
`InstCombinerImpl::foldAggregateConstructionIntoAggregateReuse()` fold,
which is supposed to be dealing with such aggregate reconstructions,
fires a lot less now. There are two reasons why:
1. After this fold, as it can be seen in tests, we may (will) end up with trivially redundant PHI nodes.
We don't CSE them in InstCombine presently, which means that EarlyCSE needs to run and then InstCombine rerun.
2. But then, EarlyCSE not only manages to fold such redundant PHI's,
it also sees that the extract-insert chain recreates the original aggregate,
and replaces it with the original aggregate.
The take-aways are
1. We maybe should do most trivial, same-BB PHI CSE in InstCombine
2. I need to check if what other patterns remain, and how they can be resolved.
(i.e. i wonder if `foldAggregateConstructionIntoAggregateReuse()` might go away)
This is a reland of the original commit fcb51d8c24,
because originally i forgot to ensure that the base aggregate types match.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D86530
This reverts commit fcb51d8c24.
As buildbots report, there's apparently some missing check to ensure
that the types of incoming values match the type of PHI.
Let's revert for a moment.
While since D86306 we do it's sibling fold for `insertvalue`,
we should also do this for `extractvalue`'s.
And unlike that one, the results here are, quite honestly, shocking,
as it can be observed here on vanilla llvm test-suite + RawSpeed results:
```
| statistic name | baseline | proposed | Δ | % | |%| |
|----------------------------------------------------|-----------|-----------|--------:|--------:|-------:|
| asm-printer.EmittedInsts | 7945095 | 7942507 | -2588 | -0.03% | 0.03% |
| assembler.ObjectBytes | 273209920 | 273069800 | -140120 | -0.05% | 0.05% |
| early-cse.NumCSE | 2183363 | 2183398 | 35 | 0.00% | 0.00% |
| early-cse.NumSimplify | 541847 | 550017 | 8170 | 1.51% | 1.51% |
| instcombine.NumAggregateReconstructionsSimplified | 2139 | 108 | -2031 | -94.95% | 94.95% |
| instcombine.NumCombined | 3601364 | 3635448 | 34084 | 0.95% | 0.95% |
| instcombine.NumConstProp | 27153 | 27157 | 4 | 0.01% | 0.01% |
| instcombine.NumDeadInst | 1694521 | 1765022 | 70501 | 4.16% | 4.16% |
| instcombine.NumPHIsOfExtractValues | 0 | 37546 | 37546 | 0.00% | 0.00% |
| instcombine.NumSunkInst | 63158 | 63686 | 528 | 0.84% | 0.84% |
| instcount.NumBrInst | 874304 | 871857 | -2447 | -0.28% | 0.28% |
| instcount.NumCallInst | 1757657 | 1758402 | 745 | 0.04% | 0.04% |
| instcount.NumExtractValueInst | 45623 | 11483 | -34140 | -74.83% | 74.83% |
| instcount.NumInsertValueInst | 4983 | 580 | -4403 | -88.36% | 88.36% |
| instcount.NumInvokeInst | 61018 | 59478 | -1540 | -2.52% | 2.52% |
| instcount.NumLandingPadInst | 35334 | 34215 | -1119 | -3.17% | 3.17% |
| instcount.NumPHIInst | 344428 | 331116 | -13312 | -3.86% | 3.86% |
| instcount.NumRetInst | 100773 | 100772 | -1 | 0.00% | 0.00% |
| instcount.TotalBlocks | 1081154 | 1077166 | -3988 | -0.37% | 0.37% |
| instcount.TotalFuncs | 101443 | 101442 | -1 | 0.00% | 0.00% |
| instcount.TotalInsts | 8890201 | 8833747 | -56454 | -0.64% | 0.64% |
| instsimplify.NumSimplified | 75822 | 75707 | -115 | -0.15% | 0.15% |
| simplifycfg.NumHoistCommonCode | 24203 | 24197 | -6 | -0.02% | 0.02% |
| simplifycfg.NumHoistCommonInstrs | 48201 | 48195 | -6 | -0.01% | 0.01% |
| simplifycfg.NumInvokes | 2785 | 4298 | 1513 | 54.33% | 54.33% |
| simplifycfg.NumSimpl | 997332 | 1018189 | 20857 | 2.09% | 2.09% |
| simplifycfg.NumSinkCommonCode | 7088 | 6464 | -624 | -8.80% | 8.80% |
| simplifycfg.NumSinkCommonInstrs | 15117 | 14021 | -1096 | -7.25% | 7.25% |
```
... which tells us that this new fold fires whopping 38k times,
increasing the amount of SimplifyCFG's `invoke`->`call` transforms by +54% (+1513) (again, D85787 did that last time),
decreasing total instruction count by -0.64% (-56454),
and sharply decreasing count of `insertvalue`'s (-88.36%, i.e. 9 times less)
and `extractvalue`'s (-74.83%, i.e. four times less).
This causes geomean -0.01% binary size decrease
http://llvm-compile-time-tracker.com/compare.php?from=4d5ca22b8adfb6643466e4e9f48ba14bb48938bc&to=97dacca0111cb2ae678204e52a3cee00e3a69208&stat=size-text
and, ignoring `O0-g`, is a geomean -0.01%..-0.05% compile-time improvement
http://llvm-compile-time-tracker.com/compare.php?from=4d5ca22b8adfb6643466e4e9f48ba14bb48938bc&to=97dacca0111cb2ae678204e52a3cee00e3a69208&stat=instructions
The other thing that tells is, is that while this is a massive win for `invoke`->`call` transform
`InstCombinerImpl::foldAggregateConstructionIntoAggregateReuse()` fold,
which is supposed to be dealing with such aggregate reconstructions,
fires a lot less now. There are two reasons why:
1. After this fold, as it can be seen in tests, we may (will) end up with trivially redundant PHI nodes.
We don't CSE them in InstCombine presently, which means that EarlyCSE needs to run and then InstCombine rerun.
2. But then, EarlyCSE not only manages to fold such redundant PHI's,
it also sees that the extract-insert chain recreates the original aggregate,
and replaces it with the original aggregate.
The take-aways are
1. We maybe should do most trivial, same-BB PHI CSE in InstCombine
2. I need to check if what other patterns remain, and how they can be resolved.
(i.e. i wonder if `foldAggregateConstructionIntoAggregateReuse()` might go away)
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D86530
We only need the C++ type and the corresponding TF Enum. The other
parameter was used for the output spec json file, but we can just
standardize on the C++ type name there.
Differential Revision: https://reviews.llvm.org/D86549
This patch helps getGuaranteedNonPoisonOp find multiple non-poison operands.
Instead of special-casing llvm.assume, I think it is also a viable option to
add noundef to Intrinsics.td. If it makes sense, I'll make a patch for that.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D86477
The legacy LoopVectorize has a dependency on InjectTLIMappingsLegacy.
That cannot be expressed in the new PM since they are both normal
passes. Explicitly add -inject-tli-mappings as a pass.
Follow-up to https://reviews.llvm.org/D86492.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D86561
The legacy SLPVectorizer has a dependency on InjectTLIMappingsLegacy.
That cannot be expressed in the new PM since they are both normal
passes. Explicitly add -inject-tli-mappings as a pass.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D86492
The 1st attempt (rG557b890) was reverted because it caused miscompiles.
That bug is avoided here by changing the order of folds and as verified
in the new tests.
Original commit message:
InstCombine currently has odd rules for folding insert-extract chains to shuffles,
so we miss collapsing seemingly simple cases as shown in the tests here.
But poison makes this not quite as easy as we might have guessed. Alive2 tests to
show the subtle difference (similar to the regression tests):
https://alive2.llvm.org/ce/z/hp4hv3 (this is ok)
https://alive2.llvm.org/ce/z/ehEWaN (poison leakage)
SLP tends to create these patterns (as shown in the SLP tests), and this could
help with solving PR16739.
Differential Revision: https://reviews.llvm.org/D86460
The 1st draft of D86460 (reverted) would show miscompiles with these tests
because the undef element tracking went wrong and became visible in the
shuffle masks.
This adapts LV to the new semantics of get.active.lane.mask as discussed in
D86147, which means that the LV now emits intrinsic get.active.lane.mask with
the loop tripcount instead of the backedge-taken count as its second argument.
The motivation for this is described in D86147.
Differential Revision: https://reviews.llvm.org/D86304
In getCastInstrCost when the instruction is a truncate we were relying
upon the implicit TypeSize -> uint64_t cast when asking if a given type
has the same size as a legal integer. I've changed the code to only
ask the question if the type is fixed length.
I have also changed InstCombinerImpl::SimplifyDemandedUseBits to bail
out for now if the type is a scalable vector.
I've added the following new tests:
Analysis/CostModel/AArch64/sve-trunc.ll
Transforms/InstCombine/AArch64/sve-trunc.ll
for both of these fixes.
Differential revision: https://reviews.llvm.org/D86432
If we use training algorithms that don't need partial rewards, we don't
need to worry about an ir2native model. In that case, training logs
won't contain a 'delta_size' feature either (since that's the partial
reward).
Differential Revision: https://reviews.llvm.org/D86481
InstCombine currently has odd rules for folding insert-extract chains to shuffles,
so we miss collapsing seemingly simple cases as shown in the tests here.
But poison makes this not quite as easy as we might have guessed. Alive2 tests to
show the subtle difference (similar to the regression tests):
https://alive2.llvm.org/ce/z/hp4hv3 (this is ok)
https://alive2.llvm.org/ce/z/ehEWaN (poison leakage)
SLP tends to create these patterns (as shown in the SLP tests), and this could
help with solving PR16739.
Differential Revision: https://reviews.llvm.org/D86460
The "takeName" logic at the end of ScalarizerVisitor::finish
could end up renaming global variables when having simplified
and extractelement instruction to simply pick a single vector
element. If the input vector to the extractelement instruction
held pointers to global variables we ended up renaming the global
variable.
The patch make sure we only take the name of the replaced Op when
we have added new instructions that might need a useful name.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D86472
This reverts commit 2e43acfed8.
LLVMCoroutines (the library which contains Coroutines.h) depends on LLVMipo (the
library which contains SampleProfile.cpp). It is inappropriate for
SampleProfile.cpp to depent on Coroutines.h (circular dependency).
The test inverted dependencies as well:
llvm/test/Transforms/Coroutines/coro-inline.ll uses -sample-profile.
summary:
When callee coroutine function is inlined into caller coroutine
function before coro-split pass, llvm will emits "coroutine should
have exactly one defining @llvm.coro.begin". It seems that coro-early
pass can not handle this quiet well.
So we believe that unsplited coroutine function should not be inlined.
This patch fix such issue by not inlining function if it has attribute
"coroutine.presplit" (it means the function has not been splited) to
fix this issue
TestPlan: check-llvm
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D85812
Avoid computing InvisibleToCallerBefore/AfterRet up front. In most
cases, this information is not really needed. Instead, introduce helper
functions to compute and cache the result on demand.
Notably, this also does not use PointerMayBeCapturedBefore for
isInvisibleToCallerBeforeRet, as it requires the killing MemoryDef as
starting instruction, making the caching ineffective. But it appears the
use of PointerMayBeCapturedBefore has very limited benefits in practice
(e.g. on SPEC2000/SPEC2006/MultiSource there are no binary changes with
-O3 -flto). Refrain from using it for now, to limit-compile-time.
This gives some nice compile-time improvements:
http://llvm-compile-time-tracker.com/compare.php?from=db9345f6810f379a36752dc52caf5230585d0ebd&to=b4d091047e1b8a3d377d200137b79d03aca65663&stat=instructions
If gather/scatters are enabled, ARMTargetTransformInfo now allows
tail predication for loops with a much wider range of strides, up
to anything that is loop invariant.
Differential Revision: https://reviews.llvm.org/D85410
Limit elimination of stores at the end of a function to MemoryDefs with
a single underlying object, to save compile time.
In practice, the case with multiple underlying objects seems not very
important in practice. For -O3 -flto on MultiSource/SPEC2000/SPEC2006
this results in a total of 2 more stores being eliminated.
We can always re-visit that in the future.
Similar to the existing transform - peek through a select
to match a value and its negation.
https://alive2.llvm.org/ce/z/MXi5KG
define i8 @src(i1 %b, i8 %x) {
%0:
%neg = sub i8 0, %x
%sel = select i1 %b, i8 %x, i8 %neg
%abs = abs i8 %sel, 1
ret i8 %abs
}
=>
define i8 @tgt(i1 %b, i8 %x) {
%0:
%abs = abs i8 %x, 1
ret i8 %abs
}
Transformation seems to be correct!
This reverses the existing transform that would uniformly canonicalize any 'xor' after any shift. In the case of logical shifts, that turns a 'not' into an arbitrary 'xor' with constant, and that's probably not as good for analysis, SCEV, or codegen.
The SCEV motivating case is discussed in:
http://bugs.llvm.org/PR47136
There's an analysis motivating case at:
http://bugs.llvm.org/PR38781
I did draft a patch that would do the same for 'ashr' but that's questionable because it's just swapping the position of a 'not' and uncovers at least 2 missing folds that we would probably need to deal with as preliminary steps.
Alive proofs:
https://rise4fun.com/Alive/BBV
Name: shift right of 'not'
Pre: C2 == (-1 u>> C1)
%a = lshr i8 %x, C1
%r = xor i8 %a, C2
=>
%n = xor i8 %x, -1
%r = lshr i8 %n, C1
Name: shift left of 'not'
Pre: C2 == (-1 << C1)
%a = shl i8 %x, C1
%r = xor i8 %a, C2
=>
%n = xor i8 %x, -1
%r = shl i8 %n, C1
Name: ashr of 'not'
%a = ashr i8 %x, C1
%r = xor i8 %a, -1
=>
%n = xor i8 %x, -1
%r = ashr i8 %n, C1
Differential Revision: https://reviews.llvm.org/D86243
As disscussed in post-commit review starting with
https://reviews.llvm.org/D84108#2227365
while this appears to be mostly a win overall, especially code-size-wise,
this appears to shake //certain// code pattens in a way that is extremely
unfavorable for performance (+30% runtime regression)
on certain CPU's (i personally can't reproduce).
So until the behaviour is better understood, and a path forward is mapped,
let's back this out for now.
This reverts commit 1d51dc38d8.
The legacy PM alias analysis pipeline by default includes basic-aa.
When running `opt -foo-pass` under the NPM and -disable-basic-aa is not
specified, use basic-aa.
This decreases the number of check-llvm failures under NPM from 913 to 752.
Reviewed By: ychen, asbirlea
Differential Revision: https://reviews.llvm.org/D86167
- Adds a command line option to seed only selected functions.
- Makes seed allow listing exclusive to assertions enabled builds.
Reviewed By: sstefan1
Differential Revision: https://reviews.llvm.org/D86129
Currently, `AANoUndefImpl::initialize` mistakenly always indicates optimistic fixpoint for function returned position.
This is because an associated value is `Function` in the case, and `isGuaranteedNotToBeUndefOrPoison` returns true for Function.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D86361
If some of gc live value are not used in gc.relocate we can remove them
from gc-live bundle of statepoint instruction.
Also the CL removes duplicated Values in gc-live bundle.
Reviewers: reames, dantrushin
Reviewed By: dantrushin
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D85959