The re-apply includes fixes to clang tests that were missed in
the original commit.
Original message:
Prior to this patch we would only set to undef the unused arguments of the
external functions. The rationale was that unused arguments of internal
functions wouldn't need to be turned into undef arguments because they
should have been simply eliminated by the time we reach that code.
This is actually not true because there are plenty of cases where we can't
remove unused arguments. For instance, if the internal function is used in
an indirect call, it may not be possible to change the function signature.
Yet, for statically known call-sites we would still like to mark the unused
arguments as undef.
This patch enables the "set undef arguments" optimization on internal
functions when we encounter cases where internal functions cannot be
optimized. I.e., whenever an internal function is marked "live".
Differential Revision: https://reviews.llvm.org/D124699
It makes sense to make a non-byval promotion attempt first and then
fall back to the byval one. The non-byval ('usual') promotion is
generally better, for example it does promotion even when a structure
has more elements than 'MaxElements' but not all of them are actually
used in the function.
Differential Revision: https://reviews.llvm.org/D124514
As discussed in issue #37809, this transform is not safe
if the input is an undefined value.
This is similar to a recent change for urem:
d428f09b2c
There is no difference in codegen on the basic examples,
but this could lead to regressions. We may need to
improve freeze analysis or lowering if that happens.
Presumably, in real cases that are similar to the tests
where a subsequent transform removes the select, we
will also be able to remove the freeze by seeing that
the parameter has 'noundef'.
As discussed in issue #37809, this transform is not safe
if the input is an undefined value.
There is no difference in codegen on the basic examples,
but this could lead to regressions. We may need to
improve freeze analysis or lowering if that happens.
If there is a freeze %x, we currently replace all other uses of %x
with freeze %x -- as long as they are dominated by the freeze
instruction. This patch extends this behavior to cases where we
did not originally dominate the use by moving the freeze
instruction directly after the definition of the frozen value.
The motivation can be seen in test @combine_and_after_freezing_uses:
Canonicalizing everything to freeze %x allows folds that are based
on value identity (i.e. same operand occurring in two places) to
trigger. This also covers the case from D125248.
Differential Revision: https://reviews.llvm.org/D125321
Further improvement of the cost model for the scalars used in
buildvectors sequences. The main functionality is outlined into
a separate function.
The cost is calculated in the following way:
1. If the Base vector is not undef vector, resizing the very first mask to
have common VF and perform action for 2 input vectors (including non-undef
Base). Other shuffle masks are combined with the resulting after the 1 stage and processed as a shuffle of 2 elements.
2. If the Base is undef vector and have only 1 shuffle mask, perform the
action only for 1 vector with the given mask, if it is not the identity
mask.
3. If > 2 masks are used, perform serie of shuffle actions for 2 vectors,
combing the masks properly between the steps.
The original implementation misses the very first analysis for the Base
vector, so the cost might too optimistic in some cases. But it improves
the cost for the insertelements which are part of the current SLP graph.
Part of D107966.
Differential Revision: https://reviews.llvm.org/D115750
With opaque pointers, both the stored value and the address can be the
same. Only consider the recipe using the first lane only *if* the
address is not stored.
Fixes#55375.
We're having a hard time booting the ARCH=i386 Linux kernel with clang
after removing -ffreestanding because instcombine was dropping inreg
from callers during libcall simplification, but not the callees defined
in different translation units. This led the callers and callees to have
wildly different calling conventions, which (predictably) blew up at
runtime.
Infer the inreg param attrs on function declarations from the module
metadata "NumRegisterParameters." This allows us to boot the ARCH=i386
Linux kernel (w/ -ffreestanding removed).
Fixes: https://github.com/llvm/llvm-project/issues/53645
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D125285
The current reordering scheme only checks the ordering of in-tree operands.
There are some cases, however, where we need to adjust the ordering based on
the ordering of a future SLP-tree who's instructions are not part of the
current tree, but are external users.
This patch is a simple implementation of this. We keep track of scalar stores
that are users of TreeEntries and if they look profitable to vectorize, then
we keep track of their ordering. During the reordering step we take this new
index order into account. This can remove some shuffles in cases like in the
lit test.
Differential Revision: https://reviews.llvm.org/D125111
shuffle (cast X), (cast Y), Mask --> cast (shuffle X, Y, Mask)
This is similar to a recent transform with fneg ( b331a7ebc1 ),
but this is intentionally the most conservative first step to
try to avoid regressions in codegen. There are several
restrictions that could be removed as follow-up enhancements.
Note that a cast with a unary shuffle is currently canonicalized
in the other direction (shuffle after cast - D103038 ). We might
want to invert that to be consistent with this patch.
Previously we took the old name and always appended a numberic suffix.
Since we're doing a 1:1 replacement, it's clearer to keep the original
name exactly.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D125281
This makes the output IR more readable since we're doing a one to
one replacement.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D125280
The EnableReuseStorageInFrame option is designed for testing only.
But it is better to use *_PASS_WITH_PARAMS macro to keep consistent with
other passes.
30a12f3f63 switched the type check
to use the GEP result type rather than the GEP operand type.
However, the GEP result types may match even if the operand types
don't, in case GEPs with scalar/vector base and vector index
are compared.
Fixes https://github.com/llvm/llvm-project/issues/55363.
This is a following cleanup for the previous work D123918. I missed
serveral places which still use legacy pass managers. This patch tries
to remove them.
When a callee function is inlined via an invoke instruction, every function call inside the callee, if not an invoke, will be converted to an invoke after cloned to the caller body. I found that during the conversion the !prof metadata was dropped. This in turned caused a cloned indirect call not properly promoted in subsequent passes.
The particular scenario I was investigating was with AutoFDO and thinLTO. In prelink, no ICP was triggered (neither by the sample loader nor PGO ICP), no indirect call was promoted. This is because 1) the particular indirect call did not have inlined samples; and 2) PGO ICP was intentionally disabled. After inlining, the prof metadata was dropped. Then in postlink, PGO ICP jumped in but didn't do anything. Thus the opportunity was missed.
I'm making a simple fix to preserve !prof metadata when converting call to invoke.
Reviewed By: davidxl
Differential Revision: https://reviews.llvm.org/D125249
If the same scalar is inserted several times into the same buildvector,
the mask index can be used already. In this case need to check, that
this scalar is already part of the vectorized buildvector.
We can try to vectorize number of stores less than MinVecRegSize
/ scalar_value_size, if it is allowed by target. Gives an extra
opportunity for the vectorization.
Fixes PR54985.
Differential Revision: https://reviews.llvm.org/D124284
Need to use actual index instead of the tree entry position, since the
insert index may be different than 0. It mean, that we vectorized part
of the buildvector starting from not initial insertelement instruction
beause of some reason.
Given a commutative reduction leading from a shuffle, the order of the
lanes on the shuffle are not important for the result. This means we can
reorder the shuffle to something simpler, which we try shuffling the
first vector lanes first. This was D123494.
The new shuffle may not be profitable though, and if it is not we can
try the folding of select shuffles from D123911. This, with some
adjustment as the output lane ordering is now unimportant, can allow the
final shuffle to simplify given the inputs to the patterns from D123911.
Where as each transformation on their own are not profitable, the
combination is.
We can only support a single shuffle when called from reductions, but we
are able to sort the ReconstructMask, potentially allowing it to
simplify to an identity or concat mask.
Differential Revision: https://reviews.llvm.org/D125086
When a PHINode has an incoming block from outside the region, it must be handled specially when assigning a global value number to each incoming value. A PHINode has multiple predecessors, and we must handle this case rather than only the single predecessor case.
Reviewer: paquette
Differential Revision: https://reviews.llvm.org/D124777
Given a load without a better order, this patch partially sorts the
elements to form clusters of adjacent elements in memory. These clusters
can potentially be loaded in fewer loads, meaning less overall shuffling
(for example loading v4i8 clusters of a v16i8 as a single f32 loads, as
opposed to multiple independent bytes loads and inserts).
Differential Revision: https://reviews.llvm.org/D122145
As shown in https://github.com/llvm/llvm-project/issues/55150 -
the existing fold may be wrong when converting to a signed value.
This is a quick fix to avoid the miscompile.
I added tests/comments for all of the signed/unsigned combinations
at either side of the boundary width, and tried to confirm with Alive2:
https://alive2.llvm.org/ce/z/3p9DSu
There are already some TODO items in the test file that suggest
possible refinements, so the regression with ui->FP->si is probably ok.
It seems unlikely that we'd see these kind of edge cases with
non-byte-width integer types in real code. The potential miscompile
went undetected for several years.
This and 747c6a0c73fixes#55150.
Differential Revision: https://reviews.llvm.org/D124692
If a constrained intrinsic call was replaced by some value, it was not
removed in some cases. The dangling instruction resulted in useless
instructions executed in runtime. It happened because constrained
intrinsics usually have side effect, it is used to model the interaction
with floating-point environment. In some cases side effect is actually
absent or can be ignored.
This change adds specific treatment of constrained intrinsics so that
their side effect can be removed if it actually absents.
Differential Revision: https://reviews.llvm.org/D118426
Splatting a bit of constant-index across a value:
sext (ashr (trunc iN X to iM), M-1) to iN --> ashr (shl X, N-M), N-1
If the dest type is different, use a cast (adjust use check).
https://alive2.llvm.org/ce/z/acAan3
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D124590
For the unary shuffle pattern, this is opposite to what we try
to do with binops, but it seems better to keep it consistent
with the motivating binary shuffle pattern. On that, it is
clearly better on the usual no-extra uses case.
There is a chance that this will pull an fneg away from some
other binop and cause a regression in codegen, but that should
be invertible in the backend. The transform is birectional:
https://alive2.llvm.org/ce/z/kKaKCUhttps://alive2.llvm.org/ce/z/3DesfwFixes#45631
Try to push an icmp into a select even if the icmp operand isn't
constant - perform a generic SimplifyICmpInst instead.
This doesn't appear to impact compile-time much, and forming
logical and/or is generally profitable, as we have very good
support for them.
D113035 enhanced the matching of bitwise selects from vector types. This
change unfortunately introduced crashes as it tries to cast scalable
vector types to integers.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D124997
After D97756, collectHomogenousInstGraphLoopInvariants may collect
conditions for both logical ANDs and logical ORs in case the root is a
select that matches both logical AND & OR.
This means the function won't return invariant values of either AND/OR
chains, but both. This can result in incorrect transformations.
See llvm/test/Transforms/SimpleLoopUnswitch/trivial-unswitch-logical-and-or.ll.
Without the patch, Alive2 rejects the modified tests with:
Source and target don't have the same return domain.
Note that this also applies to the test case added in D97756
(@test_partial_condition_unswitch_or_select). We can't unswitch on
%cond6, because the graph leading to it contains and AND and an OR.
This only fixes trivial unswitching for now, but a similar problem
likely exists with non-trivial unswitching.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D124526
Factor our InstrumentationIRBuilder and share it between ThreadSanitizer
and SanitizerCoverage. Simplify its usage at the same time (use function
of passed Instruction or BasicBlock).
This class may be used in other instrumentation passes in future.
NFCI.
Reviewed By: nickdesaulniers
Differential Revision: https://reviews.llvm.org/D125038
This patch adds a combine to attempt to reduce the costs of certain
select-shuffle patterns. The form of code it attempts to detect is:
%x = shuffle ...
%y = shuffle ...
%a = binop %x, %y
%b = binop %x, %y
shuffle %a, %b, selectmask
A classic select-mask will pick items from each lane of a or b. These
do not always have a great lowering on many architectures. This patch
attempts to pack a and b into the lower elements, creating a differently
ordered shuffle for reconstructing the orignal which may be better than
the select mask. This can be better for performance, especially if less
elements of a and b need to be computed and the input shuffles are
cheaper.
Because select-masks are just one form of shuffle, we generalize to any
mask. So long as the backend has decent costmodel for the shuffles, this
can generally improve things when they come up. For more basic cost
models the folds do not appear to be profitable, not getting past the
cost checks.
Differential Revision: https://reviews.llvm.org/D123911
Re-materialize for debug instructions would cause a different code
generated if we enabled `-g`. This is bad. So we disable to
re-materialize for debug instructions.
When building with debug info enabled, some load/store instructions do
not have a DebugLocation attached. When using the default IRBuilder, it
attempts to copy the DebugLocation from the insertion-point instruction.
When there's no DebugLocation, no attempt is made to add one.
This is problematic for inserted calls, where the enclosing function has
debug info but the call ends up without a DebugLocation in e.g. LTO
builds that verify that both the enclosing function and calls to
inlinable functions have debug info attached.
This issue was noticed in Linux kernel KCSAN builds with LTO and debug
info enabled:
| ...
| inlinable function call in a function with debug info must have a !dbg location
| call void @__tsan_read8(i8* %432)
| ...
To fix, ensure that all calls to the runtime have a DebugLocation
attached, where the possibility exists that the insertion-point might
not have any DebugLocation attached to it.
Reviewed By: nickdesaulniers
Differential Revision: https://reviews.llvm.org/D124937
Further improvement of the cost model for the scalars used in
buildvectors sequences. The main functionality is outlined into
a separate function.
The cost is calculated in the following way:
1. If the Base vector is not undef vector, resizing the very first mask to
have common VF and perform action for 2 input vectors (including non-undef
Base). Other shuffle masks are combined with the resulting after the 1 stage and processed as a shuffle of 2 elements.
2. If the Base is undef vector and have only 1 shuffle mask, perform the
action only for 1 vector with the given mask, if it is not the identity
mask.
3. If > 2 masks are used, perform serie of shuffle actions for 2 vectors,
combing the masks properly between the steps.
The original implementation misses the very first analysis for the Base
vector, so the cost might too optimistic in some cases. But it improves
the cost for the insertelements which are part of the current SLP graph.
Part of D107966.
Differential Revision: https://reviews.llvm.org/D115750
We cannot skip the freezing the condition if the unswitched branch
executes, if the condition is a chain of ANDs/ORs. For example, if if we
have an AND %c1, %c2 with %c1 == undef and %c2 == 0, there would be no
branch on undef in the original code, but a branch on undef if we
unswitch %c1.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D124603
If a constrained intrinsic call was replaced by some value, it was not
removed in some cases. The dangling instruction resulted in useless
instructions executed in runtime. It happened because constrained
intrinsics usually have side effect, it is used to model the interaction
with floating-point environment. In some cases it is correct behavior
but often the side effect is actually absent or can be ignored.
This change adds specific treatment of constrained intrinsics so that
their side effect can be removed if it actually absents.
Differential Revision: https://reviews.llvm.org/D118426
This patch switches the PGO implementation on AIX from using the runtime
registration-based section tracking to the __start_SECNAME/__stop_SECNAME
based. In order to enable the recognition of __start_SECNAME/__stop_SECNAME
symbols in the AIX linker, the -bdbg:namedsects:ss needs to be used.
Reviewed By: jsji, MaskRay, davidxl
Differential Revision: https://reviews.llvm.org/D124857
This check is in the related fold for binops,
but it was missed when the code was adapted
for intrinsics in 432c199e84. The new test
would crash when trying to create a new
intrinsic with mismatched types.
This extends 432c199e84 and 9c4770eaab with an intrinsic
cited directly in issue #46238
Eventually, we will want to use llvm::isTriviallyVectorizable()
or create some new API for this list, but for now, I am intentionally
making a minimum change to reduce risk and only affect an intrinsic
with regression tests in place.
https://alive2.llvm.org/ce/z/sD-JVv
This extends 432c199e84 with a 3 arg intrinsic to demonstrate
that the code works with the extra operand.
Eventually, we will want to use llvm::isTriviallyVectorizable()
or create some new API for this list, but for now, I am intentionally
making a minimum change to reduce risk and only affect an intrinsic
with regression tests in place.
As a follow-up to D124632, I'm turning on unlimited size caps for inlining with preinlined profile. It should be safe as a preinlined profile has "bounded" inline contexts.
No noticeable size or perf delta was seen with two of our internal large services, but I think this is still a good change to be consistent with the other case.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D124793
The two fields have the same meaning. Their values come from the reader. Therefore I'm removing one.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D124788
This is an intrinsic version of the existing fold for binops.
As a first step, I only allowed min/max, but the code is set
up to make adding more intrinsics easy (with more or less than
2 arguments).
This (and possible follow-ups) are discussed in issue #46238.
Per feedback on D123086 after submit.
Also added a test for vec_malloc et al attribute inference to show it's
doing the right thing.
The new tests exposed a defect, corrected by adding vec_free to the list of
free functions in MemoryBuiltins.cpp, which had been overlooked all the
way back in D94710, over a year ago.
Differential Revision: https://reviews.llvm.org/D124859
Adds ability to vectorize loops containing a store to a loop-invariant
address as part of a reduction that isn't converted to SSA form due to
lack of aliasing info. Runtime checks are generated to ensure the store
does not alias any other accesses in the loop.
Ordered fadd reductions are not yet supported.
Differential Revision: https://reviews.llvm.org/D110235
This adds fptosi_sat and fptoui_sat to the list of trivially
vectorizable functions, mainly so that the loop vectorizer can vectorize
the instruction. Marking them as trivially vectorizable also allows them
to be SLP vectorized, and Scalarized.
The signature of a fptosi_sat requires two type overrides
(@llvm.fptosi.sat.v2i32.v2f32), unlike other intrinsics that often only
take a single. This patch alters hasVectorInstrinsicOverloadedScalarOpd
to isVectorIntrinsicWithOverloadTypeAtArg, so that it can mark the first
operand of the intrinsic as a overloaded (but not scalar) operand.
Differential Revision: https://reviews.llvm.org/D124358
We don't need to insert a load of the dynamic shadow address unless there
are interesting memory accesses to profile.
Split out of D124703.
Differential Revision: https://reviews.llvm.org/D124797
Suppress instrumentation of PGO counter accesses, which is unnecessary
and costly. Also suppress accesses to other compiler inserted variables
starting with "__llvm". This is a slightly expanded variant of what is
done for tsan in shouldInstrumentReadWriteFromAddress.
Differential Revision: https://reviews.llvm.org/D124703
Currently SLP vectorizer walks through the instructions and selects
3 main classes of values: 1) reduction operations - instructions with same
reduction opcode (add, mul, min/max, etc.), which build the reduction,
2) reduced values - instructions with the same opcodes, but different
from the reduction opcode, 3) extra arguments - all other values,
instructions from the different basic block rather than the root node,
instructions with to many/less uses.
This scheme is not very efficient. It excludes some instructions and all
non-instruction values from the reductions (constants, proficient
gathers), to many possibly reduced values are marked as extra arguments.
Patch improves this process by introducing a bit extended analysis
stage. During this stage, we still try to select 3 classes of the
values: 1) reduction operations - same as before, 2) possibly reduced
values - all instructions from the current block/non-instructions, which
may build a vectorization tree, 3) extra arguments - instructions from
the different basic blocks. Additionally, an extra sorting of the
possibly reduced values occurs to build the scalar sequences which
highly likely will bed vectorized, e.g. loads are grouped by the
distance between them, constants are grouped together, cmp instructions
are sorted by their compare types and predicates, extractelement
instructions are sorted by the vector operand, etc. Also, these groups
are reordered by their length so the longest group is the first in the
list of the possibly reduced values.
The vectorization process tries to emit the reductions for all these
groups. These reductions, remaining non-vectorized possible reduced
values and extra arguments are then combined into the final expression
just like it was before.
Differential Revision: https://reviews.llvm.org/D114171
'Widen' recipe are only used when actual vector values are generated.
Fix tryToWidenCall to do not create VPWidenCallRecipes for scalar vector
factors.
This was exposed by D123720, because the widened recipes are considered
vector users.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D124718
Prior to this patch we would only set to undef the unused arguments of the
external functions. The rationale was that unused arguments of internal
functions wouldn't need to be turned into undef arguments because they
should have been simply eliminated by the time we reach that code.
This is actually not true because there are plenty of cases where we can't
remove unused arguments. For instance, if the internal function is used in
an indirect call, it may not be possible to change the function signature.
Yet, for statically known call-sites we would still like to mark the unused
arguments as undef.
This patch enables the "set undef arguments" optimization on internal
functions when we encounter cases where internal functions cannot be
optimized. I.e., whenever an internal function is marked "live".
Differential Revision: https://reviews.llvm.org/D124699
libcalls." (was 0f8c626). This reverts commit 14d9390.
The patch previously failed to recognize cases where user had defined a
function alias with an identical name as that of the library
function. Module::getFunction() would then return nullptr which is what the
sanitizer discovered.
In this updated version a new function isLibFuncEmittable() has as well been
introduced which is now used instead of TLI->has() anytime a library function
is to be emitted . It additionally also makes sure there is e.g. no function
alias with the same name in the module.
Reviewed By: Eli Friedman
Differential Revision: https://reviews.llvm.org/D123198
If there are pre-existing dead instructions, the order we visit replaced
values can cause us sometimes to not delete dead instructions.
The added test non-deterministically failed without the change.
Normally the index type will already be canonicalized here, but
this is not guaranteed depending on visitation order. The code
was already accounting for a potentially needed sext, but a trunc
may also be needed.
Add a ConstantExpr::getSExtOrTrunc() helper method to make this
simpler. This matches the corresponding IRBuilder method in behavior.
Fixes https://github.com/llvm/llvm-project/issues/55228.
Per the guidance in
https://llvm.org/docs/Atomics.html#atomics-and-ir-optimization,
an atomic load from a constant global can be dropped, as there can
be no stores to synchronize with. Any write to the constant global
would be UB.
IPSCCP will already drop such loads, but the main helper in Local
doesn't recognize this currently. This is motivated by D118387.
Differential Revision: https://reviews.llvm.org/D124241
X86 codegen uses function attribute `min-legal-vector-width` to select the proper ABI. The intention of the attribute is to reflect user's requirement when they passing or returning vector arguments. So Clang front-end will iterate the vector arguments and set `min-legal-vector-width` to the width of the maximum for both caller and callee.
It is assumed any middle end optimizations won't care of the attribute expect inlining and argument promotion.
- For inlining, we will propagate the attribute of inlined functions because the inlining functions become the newer caller.
- For argument promotion, we check the `min-legal-vector-width` of the caller and callee and refuse to promote when they don't match.
The problem comes from the optimizations' combination, as shown by https://godbolt.org/z/zo3hba8xW. The caller `foo` has two callees `bar` and `baz`. When doing argument promotion, both `foo` and `bar` has the same `min-legal-vector-width`. So the argument was promoted to vector. Then the inlining inlines `baz` to `foo` and updates `min-legal-vector-width`, which results in ABI mismatch between `foo` and `bar`.
This patch fixes the problem by expanding the concept of `min-legal-vector-width` to indicator of functions arguments. That says, any passes touch functions arguments have to set `min-legal-vector-width` to the value reflects the width of vector arguments. It makes sense to me because any arguments modifications are ABI related and should response for the ABI compatibility.
Differential Revision: https://reviews.llvm.org/D123284
In some cases, it is not enough to freeze the final AND/OR operation
when chaining a number of invariant conditions together.
After creating a chain of ANDs/ORs, we assume all unswitched operands to
be either true or false. But if any of the operands is poison, the rest
of the operands could have any value after branching on the frozen
condition.
To avoid that, freeze individual operands, if needed. In some cases this
may lead to unnecessary freezes, but it seems required at least for some
cases (see trivial-unswitch-freeze-individual-conditions.ll)
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D124554
Trivial unswitching can also introduce new branches on undef/poison.
Freeze the conditions if needed.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D124549
This patch removes an old hack in visitSelectInst that was written to avoid miscompilation bugs in loop unswitch.
(Added via https://reviews.llvm.org/D35811)
The legacy loop unswitch pass will be removed after D124376, and the new simple loop unswitch pass correctly uses freeze to avoid introducing UB after D124252.
Since the hack is not necessary anymore, this patch removes it.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D124426
We have seen that the prioirty inliner delivered on-par performance with the old inliner for probe-only CSSPGO profile, as long as without a size budget. I'm turning on the priority inliner for probe-only profile by default.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D124632
To be more clear and definitive, I'm renaming `ProfileIsCSFlat` back to `ProfileIsCS` which stands for full context-sensitive flat profiles. `ProfileIsCSNested` is now renamed to `ProfileIsPreInlined` and is extended to be applicable for CS flat profiles too. More specifically, `ProfileIsPreInlined` is for any kind of profiles (flat or nested) that contain 'ShouldBeInlined' contexts. The flag is encoded in the profile summary section for extbinary profiles and is computed on-the-fly for text profiles.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D122602
TI->getBitWidth can be > 64 and in those cases the shift will be UB due
to the exponent being too large.
To fix this, cap the shift at 63. I think this should work out fine,
because TableSize is itself a 64 bit type and the maximum table size
must fit in the type. Also, if we would underestimate the size here, at
most we get an extra ZExt.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D124608
Similar to c515b2f39e, If there are no loops in the function as seen
through LI, we should avoid computing the remaining expensive analyses
(such as SCEV, BPI). Reordered the analyses requests and early return
if there are no loops.
The logic of avoiding expensive analyses is applied to LoopVectorizer,
LoopLoadElimination and LoopUnrollPass, i.e. all function passes which operate
on loops.
This is an NFC with compile time improvement.
Differential Revision: https://reviews.llvm.org/D124529
This removes memset with undef char. We already do this for stores
of undef value.
This comes with the caveat that this optimization is not, strictly
speaking, legal for undef values, because we might be overwriting
a poison value. However, our entire load/store model currently still
operates on undef values, so we need to support undef here as well
for internal consistency.
Once https://github.com/llvm/llvm-project/issues/52930 is resolved,
these and related folds can be limited to poison -- I've added
FIXMEs to that effect.
Differential Revision: https://reviews.llvm.org/D124173
The name CountRoundDown is potentially misleading, as the number of
iterations can be rounded up when folding the tail.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D119681
This is an edge-case where we don't convert to bitwise and/or based
on implies poison reasoning, so explicitly try to perform the fold
in logical form. The transform itself is poison-safe, as both icmps
are based on the same value and any nowrap flags are discarded as
part of the fold (https://alive2.llvm.org/ce/z/aCwC8b for the used
example).
This fold handles a special subset of foldAndOrOfICmpsUsingRanges(),
use the more generic implementation instead.
The result can differ if a representation using a range comparison
is possible, in which case that is preferred over masking. There is
a canonicalization opportunity here.
This is the de Morgan conjugated variant of the existing fold for
ors. Implement this by switching the range code to always work
on ors and perform invert operands at the start and end. This makes
reasoning easier and makes the extension more obviosuly correct.
The legacy LoopUnswitch pass is only used in the legacy pass manager
pipeline, which is deprecated.
The NewPM replacement is SimpleLoopUnswitch and I think it is time to
remove the legacy LoopUnswitch code.
Fixes#31000.
Reviewed By: aeubanks, Meinersbur, asbirlea
Differential Revision: https://reviews.llvm.org/D124376
We can express this fold more naturally when working on the constant
range implementation. This change is not entirely NFC, because the
code now also handles cases that don't match the precise pattern
this previously looked for, e.g. we can omit an add on one of the
ranges.
I think this sort comparator was overly complex, and the windows
expensive check bot agreed, failing as it was not giving a strict weak
ordering. Change it to use the comparison of the mask values as unsigned
integers. This should sort the undef elements to the end whilst keeping
X<Y otherwise.
Replace the condition value with the known constant value on the
threaded edge. This happens implicitly with phi threading because
we replace with the incoming value, but not for non-phi threading.
SimplifyCFG implements basic jump threading, if a branch is
performed on a phi node with constant operands. However,
InstCombine canonicalizes such phis to the condition value of a
previous branch, if possible. SimplifyCFG does support this as
well, but only in the very limited case where the same condition
is used in a direct predecessor -- notably, this does not include
the common diamond pattern (i.e. two consecutive if/elses on the
same condition).
This patch extends the code to look back a limited number of
blocks to find a branch on the same value, rather than only
looking at the direct predecessor.
Fixes https://github.com/llvm/llvm-project/issues/54980.
Differential Revision: https://reviews.llvm.org/D124159
Given a shuffle feeding a commutative reduction, the lane ordering of
the shuffle will not alter the result. This is also true if there are a
number of operations between the reduction and the shuffle, providing
they only operate lane-wise. This patch searches for cases like that in
Vector Combine, allowing us to check the cost of the shuffle vs an
in-order identity shuffle and replace the order if possible. This only
handles a single shuffle at the moment to keep things simple, and is
able to ignore splats that produce results where every result is the
same.
This is a more powerful version of a combine that already happens in
instrcombine, capable of optimizing more cases by looking through more
instructions and being able to cost the shuffle.
Differential Revision: https://reviews.llvm.org/D123494
Introduced masks where they are not added and improved target dependent
cost models to avoid returning of the incorrect cost results after
adding masks.
Differential Revision: https://reviews.llvm.org/D100486
The structure ArgPart and alias OffsetAndArgPart have been moved
into the anonymous namespace. NFC.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D124617
The condition should be 'ArgParts.size() > MaxElements', so that if we
have exactly 3 elements in the 'ArgParts' vector, the promotion should
be allowed because the 'MaxElement' threshold is not exceeded yet.
The default value for 'MaxElement' has been decreased to 2 in order
to avoid an actual change in argument promoting behavior. However,
this changes byval argument transformation behavior by allowing
adding not more than 2 arguments to the function instead of 3 allowed
before.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D124178
Remove one of the last remaining uses of ::needsVectorIV, preparing for
its removal. Now that usesScalars is available and based on the
information explicit in VPlan, there is no need to use the pre-computed
needsVectorIV.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D123720
Some loop counters ('i', 'e') and variables ('type') were named not
in accordance with the code style and clang-tidy issues warnings
about the using of such variables. This patch renames the variables
and fixes some typos in the comments within the source file.
Differential Revision: https://reviews.llvm.org/D123662
When using opaque pointers, convert GEPs into offset representation
of the form P + V1 * Scale1 + V2 * Scale2 + ... + ConstantOffset.
This allows us to recognize equivalent address calculations even if
the GEPs don't use the same source element type.
This fixes an opaque pointer codegen regression seen in rustc.
Differential Revision: https://reviews.llvm.org/D124527
They can already be available, and even if not, DT/LI can be available.
We should not recompute them. Old PM is unchanged because it would
require changing dependencies, and we don't care enough about it.
Differential Revision: https://reviews.llvm.org/D124439
Reviewed By: nikic, aeubanks
isNoopAddrSpaceCast is expecting SrcAS is different from DestAS.
If the two AS are the same, consider ptrtoint/inttoptr as noop cast.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D123573
Previously all entries in global_ctors had to have the void()* type and
we'd skip evaluating bitcasted functions. With opaque pointers we may
see the function directly.
Fixes#55147.
Reviewed By: #opaque-pointers, nikic
Differential Revision: https://reviews.llvm.org/D124553
Using the legacy PM for the optimization pipeline was deprecated in 13.0.0.
Following recent changes to remove non-core features of the legacy
PM/optimization pipeline, remove ThreadSanitizerLegacyPass.
Reviewed By: #sanitizers, vitalybuka
Differential Revision: https://reviews.llvm.org/D124209
Introduced masks where they are not added and improved target dependent
cost models to avoid returning of the incorrect cost results after
adding masks.
Differential Revision: https://reviews.llvm.org/D100486
When a block containing llvm.coro.id is cloned during CHR, it inserts an invalid
PHI node with token type to the beginning of the block containing llvm.coro.begin.
To avoid such case, we exclude regions with llvm.coro.id.
Reviewed By: ChuanqiXu
Differential Revision: https://reviews.llvm.org/D124418
IRCE is a function pass that operates on loops. If there are no loops in
the function (as seen through LI), we should avoid computing the
remaining expensive analyses (such as BPI). Reordered the analyses
requests and early return if there are no loops. This is an NFC with
compile time improvement.
The same will be done in a follow-up patch for the loop vectorizer.
Reviewed-By: nikic
Differential Revision: https://reviews.llvm.org/D124478
This relands commit 8f550368b1.
The test is amended with REQUIRES: x86-registered-target, in line with
the other debuginfo-scev-salvage tests.
Differential Revision: https://reviews.llvm.org/D120169
Second of two patches to extend SCEV-based salvaging to dbg.value
intrinsics that have multiple location ops pre-LSR. This second patch
adds the core implementation.
Reviewers: @StephenTozer, @djtodoro
Differential Revision: https://reviews.llvm.org/D120169
First of two patches that extends SCEV-based salvaging to enable
salvaging of dbg.value instrinsics that have multiple locations ops
before the Loop Strength Reduction pass.
The existing single-op SCEV-based salvaging can generate variadic
dbg.value intrinsics in order to salvage a dbg.value that has a single
location op. If a dbg.value has multiple location ops before LSR, and
LSR optimises away one or more of the location operands, then currently
no salvaging will be attempted.
Salvaging can now be added, but first this patch cleans up consistency
in both the code and comments, and applies some refactoring to make
application of the new salvaging implementation more straightforward.
- Use SCEVDbgValueBuilder for both types of recovery expressions:
IV-offset based and iteration count based.
- Combine the functions that write the final DIExpression.
- Move some static functions into member functions.
Reviewers: @Orlando
Differential Revision: https://reviews.llvm.org/D120168
Currently, two GEPs will only be combined if the result element
type of one is the same as the source element type of the other.
However, this means we may miss folding opportunities where the
second GEP could be rewritten using a different element type. This
is especially relevant for opaque pointers, where constant GEPs
often use i8 element type.
Address this by converting GEP indices to offsets, adding them,
and then converting them back to indices. The first (inner) GEP
is allowed to have variable indices as well, in which case only
the constant suffix is converted into an offset.
This should address the regression reported in
https://reviews.llvm.org/D123300#3467615.
Differential Revision: https://reviews.llvm.org/D124459
I found this bug when performing a two-stage build of clang with
Function Specialization enabled and tuned aggressively. The crash
appears only on release builds.
Fixes https://github.com/llvm/llvm-project/issues/55000.
Before accessing the contents of the ArgInfo iterator inside
SCCPInstVisitor::markArgInFuncSpecialization, we should be
checking that the iterator is valid.
Differential Revision: https://reviews.llvm.org/D124114
canonicalizeClampLike canonicalizes the ule/ugt comparisons to ult/uge,
respectively. However, it does not update the variable holding the
comparison predicate type after doing this. Later code fails to handle
the non-canonical predicate type (specifically, the swap of
ThresholdLowIncl and ThresholdHighExcl when Pred0 has been canonicalized
from ugt to uge). This leads to the miscompile reported in PR53252. Fix
this by updating the comparison predicate after canonicalizing.
Fixes#53252
Differential Revision: https://reviews.llvm.org/D119690
The callback is expected to create a branch to the ContinuationBB (sometimes called FiniBB in some lambdas) argument when finishing. This creates problems:
1. The InsertPoint used for CodeGenIP does not need to be the end of a block. If it is not, a naive callback will insert a branch instruction into the middle of the block.
2. The BasicBlock the CodeGenIP is pointing to may or may not have a terminator. There is an conflict where to branch to if the block already has a terminator.
3. Some API functions work only with block having a terminator. Some workarounds have been used to insert a temporary terminator that is removed again.
4. Some callbacks are sensitive to whether the BasicBlock has a terminator or not. This creates a callback ordering problem where different callback may have different behaviour depending on whether a previous callback created a terminator or not. The problem also exists for FinalizeCallbackTy where some callbacks do create branch to another "continue" block, but unlike BodyGenCallbackTy does not receive the target as argument. This is not addressed in this patch.
With this patch, the callback receives an CodeGenIP into a BasicBlock where to insert instructions. If it has to insert control flow, it can split the block at that position as needed but otherwise no separate ContinuationBB is needed. In particular, a callback can be empty without breaking the emitted IR. If the caller needs the control flow to branch to a specific target, it can insert the branch instruction itself and pass an InsertPoint before the terminator to the callback.
Certain frontends such as Clang may expect the current IRBuilder position to be at the end of a basic block. In this case its callbacks must split the block at CodeGenIP before setting the IRBuilder position such that the instructions after CodeGenIP are moved to another basic block and before returning create a new branch instruction to the split block.
Some utility functions such as `splitBB` are supporting correct splitting of BasicBlocks, independent of whether they have a terminator or not, returning/setting the InsertPoint of an IRBuilder to the end of split predecessor block, and optionally omitting creating a branch to the split successor block to be added later.
Reviewed By: kiranchandramohan
Differential Revision: https://reviews.llvm.org/D118409
We can always replace the undef elements in a vector constant
with regular constants to get rid of the freeze:
https://alive2.llvm.org/ce/z/nfRb4F
The select diffs show that we might do better by adjusting the
logic for a frozen select condition. We may also want to refine
the vector constant replacement to consider forming a splat.
Differential Revision: https://reviews.llvm.org/D123962
Before this patch `Args` was used to pass a broadcat's arguments by SLP.
This patch changes this. `Args` is now used for passing the operands of
the shuffle.
Differential Revision: https://reviews.llvm.org/D124202
This continues the push away from hard-coded knowledge about functions
towards attributes. We'll use this to annotate free(), realloc() and
cousins and obviate the hard-coded list of free functions.
Differential Revision: https://reviews.llvm.org/D123083
This reorganizes the code as a preparation for D123865:
* Use more descriptive names for variables
* Simplify a condition by use an already calculated value
for `MaxPeelCount`
* Remove a duplicate log entry
* Report basic values for loop costs
Differential Revision: https://reviews.llvm.org/D124388
Since the size of most of SCC's is 1, the PriorityInlineOrder would not change the inline
order in SCC inliner.
Reviewed By: kazu
Differential Revision: https://reviews.llvm.org/D123608
At the moment, unfeasible default destinations are not handled properly
in removeNonFeasibleEdges. So far, only unfeasible cases are removed,
but later code expects unreachable blocks to have no predecessors.
This is causing the crash reported in PR49573.
If the default destination is unfeasible it won't be executed. Create
a new unreachable block on demand and use that as default
destination.
Note that at the moment this only is relevant for cases where
resolvedUndefsIn marks the first case as executable. Regular switch
handling has a FIXME/TODO to support determining whether the default
case is feasible or not.
Fixes#48917.
Differential Revision: https://reviews.llvm.org/D113497
Don't check whether an input of BDV can be pruned if the input
is the BDV itself. BDV is present in the states map, so in case
the input is the BDV itself, we'd return false. So explicitly check this case.
Differential Revision: https://reviews.llvm.org/D123846
We may be able to make the ValueTracking wrapper smarter
in the future (for example, analyze a simple recurrence),
so this will automatically benefit if that happens.
tryToVectorize() method implements one of searching paths for vectorizable tree roots in SLP vectorizer,
specifically for binary and comparison operations. Order of making probes for various scalar pairs
was defined by its implementation: the instruction operands, then climb over one operand if
the instruction is its sole user and then perform same actions for another operand if previous
attempts failed. Problem with this approach is that among these options we can have more than a
single vectorizable tree candidate and it is not necessarily the one that encountered first.
Trying to build vectorizable tree for each possible combination for just evaluation is expensive.
But we already have lookahead heuristics mechanism which we use for finding best pick among
operands of commutative instructions. It calculates cumulative score for candidates in two
consecutive lanes. This patch introduces use of the heuristics for choosing the best pair among
several combinations. We only try one that looks as most promising for vectorization.
Additional benefit is that we reduce total number of vectorization trees built for probes
because we skip those looking non-profitable early.
Reviewed By: Alexey Bataev (ABataev), Vasileios Porpodas (vporpo)
Differential Revision: https://reviews.llvm.org/D124309
This AliasPtr is being created always from an Int64 even for targets
where 32 bit is the proper type. e.g. “thumbv7-none-linux-android16”.
This causes the assert in the `get` func to fail as we're getting a 32
bit from the APInt.
Fix this by simply always just getting the type from the value instead.
Reviewed By: ChuanqiXu
Differential Revision: https://reviews.llvm.org/D123272
Using the legacy PM for the optimization pipeline was deprecated in 13.0.0.
Following recent changes to remove non-core features of the legacy
PM/optimization pipeline, remove AddressSanitizerLegacyPass...
...,
ModuleAddressSanitizerLegacyPass, and ASanGlobalsMetadataWrapperPass.
MemorySanitizerLegacyPass was removed in D123894.
AddressSanitizerLegacyPass was removed in D124216.
Reviewed By: #sanitizers, vitalybuka
Differential Revision: https://reviews.llvm.org/D124337
This fixes a series of mis-compiles by SimpleLoopUnswitch.
My measurements showed no performance regression with -O3 on AArch64
in SPEC2006, SPEC2017 and a set of internal benchmarks.
Fixes#50387, #50430
Depends on D124251.
Reviewed By: nikic, aqjune
Differential Revision: https://reviews.llvm.org/D124252
Logic in this pass assumes that all users of loop instructions are
either in the same loop or are LCSSA Phis. In fact, there can also
be users in unreachable blocks that currently break assertions.
Such users don't need to go to the next round of simplifications.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D124368
This patch add a function foldSelectWithFCmpToFabs, and do more combine for
fneg-of-fabs.
With 'nsz':
fold (X < +/-0.0) ? X : -X or (X <= +/-0.0) ? X : -X to -fabs(x)
fold (X > +/-0.0) ? X : -X or (X >= +/-0.0) ? X : -X to -fabs(x)
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D123830
Minor refactoring to reduce size of functional change D124309:
look-ahead scoring routines pulled out of VLOperands and formed
new LookAheadHeuristics helper class.
Reviewed By: Alexey Bataev (ABataev), Vasileios Porpodas (vporpo)
Differential Revision: https://reviews.llvm.org/D124313
MisExpect diagnostics should not prevent compilation from succeeding, and the
assertion is insufficient to prevent division by zero in release builds.
This patch addresses that by replacing the assert with an early return.
Additionally, it disables MisExpect diagnostics when using sample profiling,
since this is the only known case where this error has manifested.
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D124302
We only need to insert a Freeze instruction if any of the conditions
may be poison. Similar checks are already done in the other places
SimpleLoopUnswitch creates Freeze instruction.
Reviewed By: aeubanks, efriedma
Differential Revision: https://reviews.llvm.org/D124259
Folds are supposed to always be added in conjugated pairs for and
and or. Merge the two functions to make folds for which this is
currently not the case more obvious.
1d90e53044 switch this code to store
the predicates and operands in variables, but retained a
swapOperands() call here. Thus the commuted cases were no longer
folded. Additionally, as the change was not reported, the next
InstCombine iteration would not pick it up either.
Reapplying without changes, after a fix to a dependent patch.
-----
Rather than creating a PHI node and then using the PHI threading
code, directly handle this case in
FoldCondBranchOnValueKnownInPredecessor().
This change is supposed to be NFC-ish, but may cause changes due
to different transform order.
Reapply with SmallMapVector instead of SmallDenseMap, which should
address the non-determinism issue.
-----
This general threading transform can be performed whenever we know
a constant value for the condition in a predecessor, which would
currently just be the case of a phi node with constant arguments.
Using the legacy PM for the optimization pipeline was deprecated in 13.0.0.
Following recent changes to remove non-core features of the legacy
PM/optimization pipeline, remove AddressSanitizerLegacyPass,
ModuleAddressSanitizerLegacyPass, and ASanGlobalsMetadataWrapperPass.
MemorySanitizerLegacyPass was removed in D123894.
Reviewed By: #sanitizers, vitalybuka
Differential Revision: https://reviews.llvm.org/D124216
Using the legacy PM for the optimization pipeline was deprecated in 13.0.0.
Following recent changes to remove non-core features of the legacy
PM/optimization pipeline, remove AddressSanitizerLegacyPass,
ModuleAddressSanitizerLegacyPass, and ASanGlobalsMetadataWrapperPass.
MemorySanitizerLegacyPass was removed in D123894.
Reviewed By: #sanitizers, vitalybuka
Differential Revision: https://reviews.llvm.org/D124216
This emits an `st_size` that represents the actual useable size of an object before the redzone is added.
Reviewed By: vitalybuka, MaskRay, hctim
Differential Revision: https://reviews.llvm.org/D123010
The first attempt at this missed a check to make sure the offset
constant was in range and caused many bot failures.
That was missed in the Alive2 proof because on overshift creates
poison rather than the assert from APInt. Here's an alternate
attempt at a proof using count-trailing-zeros:
https://alive2.llvm.org/ce/z/pnXQYR
Original commit message:
This is similar to an existing pre-shift-of-constant fold:
8a9c70fc01
...but in this case, we need no-wrap on the shl and a negative
offset:
https://alive2.llvm.org/ce/z/_RVz99
This reverts commit 3df86e799e.
This reverts commit 8988254667.
`[SimplifyCFG] Handle branch on same condition in pred more directly`
caused non-determinism when compiling opt with a bootstrapped clang.
I have to revert the dependent commit as well.
Using the legacy PM for the optimization pipeline was deprecated in 13.0.0.
Following recent changes to remove non-core features of the legacy
PM/optimization pipeline, remove GCOVProfilerLegacyPass.
I have checked many LLVM users and only llvm-hs[1] uses the legacy gcov pass.
[1]: https://github.com/llvm-hs/llvm-hs/issues/392
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D123829
Using the legacy PM for the optimization pipeline was deprecated in 13.0.0.
Following recent changes to remove non-core features of the legacy
PM/optimization pipeline, remove MemorySanitizerLegacyPass.
Differential Revision: https://reviews.llvm.org/D123894
Debugify in OriginalDebugInfo mode, does (DebugInfo) collect-before-pass & check-after-pass
for each instruction, which is pretty expensive. When used to analyze DebugInfo losses
in large projects (like LLVM), this raises the build time unacceptably.
This patch introduces a limit for the number of processed functions per compile unit.
By default, the limit is set to UINT_MAX (practically unlimited), and by using the introduced
option -debugify-func-limit the limit could be set to any positive integer number.
Differential revision: https://reviews.llvm.org/D115714
Rather than creating a PHI node and then using the PHI threading
code, directly handle this case in
FoldCondBranchOnValueKnownInPredecessor().
This change is supposed to be NFC-ish, but may cause changes due
to different transform order.
This general threading transform can be performed whenever we know
a constant value for the condition in a predecessor, which would
currently just be the case of a phi node with constant arguments.
The legacy passes are deprecated now and would be removed in near
future. This patch tries to remove legacy passes in coroutines.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D123918
This patch extends the scope of VPlan to also include the exit (aka
middle) block.
For now, the exit block remains empty, but handling of exit values will
subsequently be moved to VPlan, by adding recipes to model exit values
in the exit block.
As a first step, this will allow fixing #51366.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D123457
We're making a recursive call here and everything in the function
assumes we're looking at scalars. This would be violated if we
looked through a bitcast from vectors.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D124015
This is not expected to have a functional difference as discussed in the
post-commit comments for 8a9c70fc01. All of the motivating tests for
the older fold still optimize as expected because other code can infer
the 'nuw'.
BlockIsSimpleEnoughToThreadThrough() already checks that the phi
(and all other instructions) are not used outside the block, so
this one-use check is not necessary for legality. I also don't
see any reason why it would be necessary for profitability (in
fact, those extra uses will be replaced with constants, which
should be generally profitable).
test/Transforms/InstCombine/pr39177.ll failed in a -DLLVM_USE_SANITIZER=Undefined build.
```
lib/Transforms/Utils/BuildLibCalls.cpp:1217:17: runtime error: reference binding to null pointer of type 'llvm::Function'
```
`Function &F = *M->getFunction(Name);`
This reverts commit 0f8c626723.
The patch adds SPIRV-specific MC layer implementation, SPIRV object
file support and SPIRVInstPrinter.
Differential Revision: https://reviews.llvm.org/D116462
Authors: Aleksandr Bezzubikov, Lewis Crawford, Ilia Diachkov,
Michal Paszkowski, Andrey Tretyakov, Konrad Trifunovic
Co-authored-by: Aleksandr Bezzubikov <zuban32s@gmail.com>
Co-authored-by: Ilia Diachkov <iliya.diyachkov@intel.com>
Co-authored-by: Michal Paszkowski <michal.paszkowski@outlook.com>
Co-authored-by: Andrey Tretyakov <andrey1.tretyakov@intel.com>
Co-authored-by: Konrad Trifunovic <konrad.trifunovic@intel.com>
Reimplements MisExpect diagnostics from D66324 to reconstruct its
original checking methodology only using MD_prof branch_weights
metadata.
New checks rely on 2 invariants:
1) For frontend instrumentation, MD_prof branch_weights will always be
populated before llvm.expect intrinsics are lowered.
2) for IR and sample profiling, llvm.expect intrinsics will always be
lowered before branch_weights are populated from the IR profiles.
These invariants allow the checking to assume how the existing branch
weights are populated depending on the profiling method used, and emit
the correct diagnostics. If these invariants are ever invalidated, the
MisExpect related checks would need to be updated, potentially by
re-introducing MD_misexpect metadata, and ensuring it always will be
transformed the same way as branch_weights in other optimization passes.
Frontend based profiling is now enabled without using LLVM Args, by
introducing a new CodeGen option, and checking if the -Wmisexpect flag
has been passed on the command line.
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D115907
A new set of overloaded functions named getOrInsertLibFunc() are now supposed
to be used instead of getOrInsertFunction() when building a libcall from
within an LLVM optimizer(). The idea is that this new function also makes
sure that any mandatory argument attributes are added to the function
prototype (after calling getOrInsertFunction()).
inferLibFuncAttributes() is renamed to inferNonMandatoryLibFuncAttrs() as it
only adds attributes that are not necessary for correctness but merely
helping with later optimizations.
Generally, the front end is responsible for building a correct function
prototype with the needed argument attributes. If the middle end however is
the one creating the call, e.g. when replacing one libcall with another, it
then must take this responsibility.
This continues the work of properly handling argument extension if required
by the target ABI when building a lib call. getOrInsertLibFunc() now does
this for all libcalls currently built by any LLVM optimizer. It is expected
that when in the future a new optimization builds a new libcall with an
integer argument it is to be added to getOrInsertLibFunc() with the proper
handling. Note that not all targets have it in their ABI to sign/zero extend
integer arguments to the full register width, but this will be done
selectively as determined by getExtAttrForI32Param().
Review: Eli Friedman, Nikita Popov, Dávid Bolvanský
Differential Revision: https://reviews.llvm.org/D123198
With 'nuw' we can convert the increment of the shift amount
into a pre-shift (constant fold) of the shifted constant:
https://alive2.llvm.org/ce/z/FkTyR2
Fixes issue #41976
This patch moves SCEV expansion of steps used by
VPWidenIntOrFpInductionRecipes to the pre-header using
VPExpandSCEVRecipe. This ensures that those steps are expanded while the
CFG is in a valid state. Previously, SCEV expansion may happen during
vector body code-generation, during which the CFG may be invalid,
causing issues with SCEV expansion.
Depends on D122095.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D122096
This change could reduce the time we call `declaresCoroEarlyIntrinsics`.
And it is helpful for future changes.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D123925
This reverts commit af0285122f.
The test "libomp::loop_dispatch.c" on builder
openmp-gcc-x86_64-linux-debian fails from time-to-time.
See #54969. This patch is unrelated.
The description was ambiguous about the behavior
when boths select arms are constant or both arms
are not constant. I don't think there's any
evidence to support either way, but this matches
the code with a more specified description.
We can extend this to deal with vector constants
with undef/poison elements. Currently, those don't
get folded anywhere.
The OMPScheduleType enum stores the constants from libomp's internal sched_type in kmp.h and are used by several kmp API functions. The enum values have an internal structure, namely each scheduling algorithm (e.g.) exists in four variants: unordered, orderend, normerge unordered, and nomerge ordered.
This patch (basically a followup to D114940) splits the "ordered" and "nomerge" bits into separate flags, as was already done for the "monotonic" and "nonmonotonic", so we can apply bit flags operations on them. It also now contains all possible combinations according to kmp's sched_type. Deriving of the OMPScheduleType enum from clause parameters has been moved form MLIR's OpenMPToLLVMIRTranslation.cpp to OpenMPIRBuilder to make available for clang as well. Since the primary purpose of the flag is the binary interface to libomp, it has been made more private to LLVMFrontend. The primary interface for generating worksharing-loop using OpenMPIRBuilder code becomes `applyWorkshareLoop` which derives the OMPScheduleType automatically and calls the appropriate emitter function.
While this is mostly a NFC refactor, it still applies the following functional changes:
* The logic from OpenMPToLLVMIRTranslation to derive the OMPScheduleType also applies to clang. Most notably, it now applies the nonmonotonic flag for non-static schedules by default.
* In OpenMPToLLVMIRTranslation, the nonmonotonic default flag was previously not applied if the simd modifier was used. I assume this was a bug, since the effect was due to `loop.schedule_modifier()` returning `mlir::omp::ScheduleModifier::none` instead of `llvm::Optional::None`.
* In OpenMPToLLVMIRTranslation, the nonmonotonic default flag was set even if ordered was specified, in breach to what the comment before citing the OpenMP specification says. I assume this was an oversight.
The ordered flag with parameter was not considered in this patch. Changes will need to be made (e.g. adding/modifying function parameters) when support for it is added. The lengthy names of the enum values can be discussed, for the moment this is avoiding reusing previously existing enum value names such as `StaticChunked` to avoid confusion.
Reviewed By: peixin
Differential Revision: https://reviews.llvm.org/D123403
Until now we would only accept a broadcast load pattern if it is only used
by a single vector of instructions.
This patch relaxes this, and allows for the broadcast to have more than one
user vector, as long as all of its uses are internal to the SLP graph and
vectorized.
Differential Revision: https://reviews.llvm.org/D121940
Using the legacy PM for the optimization pipeline was deprecated in 13.0.0.
Following recent changes to remove non-core features of the legacy
PM/optimization pipeline, remove the (Thin)LTO pipelines.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D123882
This patch renames the mergefunc-sanity to mergefunc-verify and renames the related functions to use more
inclusive language
Reviewed By: cebowleratibm
Differential Revision: https://reviews.llvm.org/D114374
When we run the CGSCC pass we should only invest time on the SCC. We can
initialize AAs with information from the module slice but we should not
update those AAs. We make an exception for are call site of the SCC as
they are helpful providing information for the SCC.
Minor modifications to pointer privatization allow us to perform it even
in the CGSCC pass, similar to ArgumentPromotion.
Issue: https://github.com/llvm/llvm-project/issues/54430
For incoming values of phi nodes added to an outlined function to accommodate different exit paths in the function, when a value is a constant that is passed into the outlined function as an argument, we find the corresponding value in the first extracted function used to fill the overall outlined function. When this value is an argument, the corresponding value used will be the old value, prior to outlining. This patch maintains a mapping from these values to arguments, and uses this mapping to update the added phi node accordingly.
Reviewers: paquette
Recommit of d6eb480afb
Differential Revision: https://reviews.llvm.org/D122206
The previous patch introduced the offloading binary format so we can
store some metada along with the binary image. This patch introduces
using this inside the linker wrapper and Clang instead of the previous
method that embedded the metadata in the section name.
Differential Revision: https://reviews.llvm.org/D122683
Instead of lengthy constructors we can now set the members of a
read-only struct before the Attributor is created. Should make it
clearer what is configurable and also help introducing new options in
the future. This actually added IsModulePass and avoids deduction
through the Function set size. No functional change was intended.
With opaque pointers, the stored value and address can be the same.
Previously the code in VPWidenMemoryInstructionRecipe::onlyFirstLaneDemanded
incorrectly considers stores with matching store and pointer operands as
only demanding the first lane, causing a crash.
Legacy PM for optimization pipeline was deprecated in 13.0.0 and Clang dropped
legacy PM support in D123609. This change removes legacy PM passes for PGO so
that downstream projects won't be able to use it. It seems appropriate to start
removing such "add-on" features like instrumentations, before we remove more
stuff after 15.x is branched.
I have checked many LLVM users and only ldc[1] uses the legacy PGO pass.
[1]: https://github.com/ldc-developers/ldc/issues/3961
Reviewed By: davidxl
Differential Revision: https://reviews.llvm.org/D123834
This can cause crashes by accidentally optimizing out checks for
extern_weak_func != nullptr, when replaced with a known-not-null wrapper.
This solution isn't perfect (only avoids replacement on specific patterns)
but should address common cases.
Internal reference: b/185245029
Reviewed By: vitalybuka
Differential Revision: https://reviews.llvm.org/D123701
Issue: https://github.com/llvm/llvm-project/issues/54430
For incoming values of phi nodes added to an outlined function to accommodate different exit paths in the function, when a value is a constant that is passed into the outlined function as an argument, we find the corresponding value in the first extracted function used to fill the overall outlined function. When this value is an argument, the corresponding value used will be the old value, prior to outlining. This patch maintains a mapping from these values to arguments, and uses this mapping to update the added phi node accordingly.
Reviewers: paquette
Differential Revision: https://reviews.llvm.org/D122206
Issue: https://github.com/llvm/llvm-project/issues/54431
PHINodes that need to be generated to accommodate a PHINode outside the region due to different output paths need to have their own numbering to determine the number of output schemes required to properly handle all the outlined regions. This numbering was previously only determined by the order and values of the incoming values, as well as the parent block of the PHINode. This adds the incoming blocks to the calculation of a hash value for these PHINodes as well, and the supporting infrastructure to give each block in a region a corresponding canonical numbering.
Reviewer: paquette
Differential Revision: https://reviews.llvm.org/D122207
This addresses an existing TODO by keeping a mapping of external IR
Value * definitions wrapped in VPValues for use in a VPlan.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D123700
The NewDefault was used to simplify the updating of PHI nodes, but it
causes some inefficiency for target that will run structurizer later. For
example, for a simple two-case switch, the extra NewDefault is causing
unstructured CFG like:
O
/ \
O O
/ \ / \
C1 ND C2
\ | /
\ | /
D
The change is to avoid the ND(NewDefault) block, that is we will get a
structured CFG for above example like:
O
/ \
/ \
O O
/ \ / \
C1 \ / C2
\-> D <-/
The IR change introduced by this patch should be trivial to other targets,
so I am doing this unconditionally.
Fall-through among the cases will also cause unstructured CFG, but it need
more work and will be addressed in a separate change.
Reviewed by: arsenm
Differential Revision: https://reviews.llvm.org/D123607
This reverts commit e810d55809.
The commit was not taken into account the fact that strduped string could be
modified. Checking if such modification happens would make the function very
costly, without a test case in mind it's not worth the effort.
Updated LowerGuardIntrinsic and LowerWidenableCondition to check for
users of the respective intrinsic, instead of checking for guards and
widenable conditions by traversing the entire function.
This is an NFC. Should save some compile time.
This reverts the revert commit 1ddc719680.
This version of the patch sets the initial available value to poison,
which resolves an issue with the SSAUpdater breaking LCSSA form.
C11 specifies memchr() as follows:
> The memchr function locates the first occurrence of c (converted
> to an unsigned char) in the initial n characters (each interpreted
> as unsigned char) of the object pointed to by s. The implementation
> shall behave as if it reads the characters sequentially and stops
> as soon as a matching character is found.
In particular, it is well-defined to specify a memchr size larger
than the underlying object, as long as the character is found before
the end of the object.
Differential Revision: https://reviews.llvm.org/D123665
This should be "NFC" as written, but it will make D122485 smaller
and give us more flexibility to experiment with optimization level
vs. compile-time.
Differential Revision: https://reviews.llvm.org/D123625
Currently SLP vectorizer walks through the instructions and selects
3 main classes of values: 1) reduction operations - instructions with same
reduction opcode (add, mul, min/max, etc.), which build the reduction,
2) reduced values - instructions with the same opcodes, but different
from the reduction opcode, 3) extra arguments - all other values,
instructions from the different basic block rather than the root node,
instructions with to many/less uses.
This scheme is not very efficient. It excludes some instructions and all
non-instruction values from the reductions (constants, proficient
gathers), to many possibly reduced values are marked as extra arguments.
Patch improves this process by introducing a bit extended analysis
stage. During this stage, we still try to select 3 classes of the
values: 1) reduction operations - same as before, 2) possibly reduced
values - all instructions from the current block/non-instructions, which
may build a vectorization tree, 3) extra arguments - instructions from
the different basic blocks. Additionally, an extra sorting of the
possibly reduced values occurs to build the scalar sequences which
highly likely will bed vectorized, e.g. loads are grouped by the
distance between them, constants are grouped together, cmp instructions
are sorted by their compare types and predicates, extractelement
instructions are sorted by the vector operand, etc. Also, these groups
are reordered by their length so the longest group is the first in the
list of the possibly reduced values.
The vectorization process tries to emit the reductions for all these
groups. These reductions, remaining non-vectorized possible reduced
values and extra arguments are then combined into the final expression
just like it was before.
Differential Revision: https://reviews.llvm.org/D114171
We need to explicitly query the shadow here, because it is lazily
initialized for byval arguments. Without opaque pointers this used to
mostly work out, because there would be a bitcast to `i8*` present, and
that would query, and copy in case of byval, the argument shadow.
Reviewed By: vitalybuka, eugenis
Differential Revision: https://reviews.llvm.org/D123602
We need to explicitly query the shadow here, because it is lazily
initialized for byval arguments. Without opaque pointers this used to
mostly work out, because there would be a bitcast to `i8*` present, and
that would query, and copy in case of byval, the argument shadow.
Reviewed By: vitalybuka, eugenis
Differential Revision: https://reviews.llvm.org/D123602
This renames functions for more general usage (and current capitalization style)
before a proposed logic change in D122485.
Differential Revision: https://reviews.llvm.org/D123614
This diff extends foldSelectInstWithICmp to handle the case icmp(X) ? f(X) : C
when f(X) is guaranteed to be equal to C for all X in the exact range of the inverse predicate.
This addresses the issue https://github.com/llvm/llvm-project/issues/54089.
Differential revision: https://reviews.llvm.org/D123159
Test plan: make check-all
The test is already simplified, and I'm not sure how
to write a test to exercise the new clause. But it
protects the 2-bit pattern from miscompiling as noted
in D123453.
https://alive2.llvm.org/ce/z/QPyVfv
(If we managed to fall into the mul transform, it
would wrongly create a zero on this pattern.)
IMO when user provide unroll pragma, compiler should always respect it.
It is not clear to me why loop unroll pass currently ensure that the
unrolled loop size is limited by PragmaUnrollThreshold.
Reviewed By: Meinersbur
Differential Revision: https://reviews.llvm.org/D119148
When only a store is sunk, there is no need to create a load in the
pre-header, as the result of the load will never get used.
The dead load can can introduce UB, if the function is marked as
writeonly.
Fixes#51248.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D123473
After D121624 models the pre-header in VPlan, VPExpandSCEVRecipes can be
placed there. This ensures SCEV expansion happens before modifying the
CFG during VPlan execution, when CFG is incomplete.
Depends on D121624.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D122095
This patch extends the scope of VPlan to also model the pre-header.
The pre-header can be used to place recipes that should be code-gen'd
outside the loop, like SCEV expansion.
Depends on D121623.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D121624
This fixes the code to actually use the location of the instruction, if
available. Previously, SetInsertPoint would overwrite the insert point
set from the instruction.
And thread DSE's ephemeral values to EarliestEscapeInfo.
This allows more precise analysis in DSEState::isReadClobber() via BatchAA.
Followup to D123162.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D123342
Loop Strength Reduce sometimes optimizes away all uses of an induction variable
from a loop but leaves the IV increments. When the only remaining use of the IV
is the PHI in the exit block, this patch will call rewriteLoopExitValues to
replace the exit block PHI with the final value of the IV to skip the updates
in each loop iteration.
Differential Revision: https://reviews.llvm.org/D118808
This makes MemorySSA in LoopSink required, and removes the AST-based
implementation, as well as the related support code in LICM.
Differential Revision: https://reviews.llvm.org/D123288
It actually implements support for seeing through loads, using alias analysis to
refine the result.
This is rather limited, but I didn't want to rely on more than available
analysis at that point (to be gentle with compilation time), and it does seem to
catch common scenario, as showcased by the included tests.
Differential Revision: https://reviews.llvm.org/D122431
Currently, the utility supports lowering of non atomic memory transfer routines only. This patch adds support for atomic version of memcopy. This may be useful for targets not supporting atomic memcopy.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D118443
Similar to the problem in 0bb25b4603, bitcasts that are inserted must
dominate all uses. When rewriting "values" with "new values" that have
the updated address space, we may replace the "new value" with a bitcast
if one of the original users is an addresspace cast. This bitcast must
be inserted before ALL users, not only before the addresspace cast.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D122964
By adding a parameter to function FoldOpIntoSelect, we can fold more Ops to Select.
For this example, we tend to fold the division instruction,
so we no longer care whether SelectInst is one use.
This patch slove TODO left in InstCombine/div.ll.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D122967
This is part of being able to get rid of two more columns in
MemoryBuiltins.cpp's large table. We'll have two more changes before
we can finish the job.
Differential Revision: https://reviews.llvm.org/D119582
Sometimes we can infer an align from an allocalign but the function
already promised it'd be more-aligned than the allocalign and there's an
existing align that we shouldn't reduce. Make sure we handle that
correctly.
Differential Revision: https://reviews.llvm.org/D121642
This is a lshr equivalent to D122340 - if we don't demand any of the additional sign bits introduced by the ashr, the lshr can be treated as an ashr and we can remove the shift entirely if we only demand already known sign bits.
Another step towards PR21929
https://alive2.llvm.org/ce/z/6f3kjq
Differential Revision: https://reviews.llvm.org/D123118
LoopSink with the legacy pass manager still uses AST, because we
can't compute MemorySSA conditionally. I think now that the legacy
pass manager will be removed soon(TM) we don't need to care about
compile-time impact here anymore. Additionally, since MemorySSA is
no longer eagerly optimized, the impact is actually not that high
anymore (~0.2% geomean regression on CTMark).
This just makes legacy PM and new PM behavior line up -- as a
followup I'll drop these options entirely and make MemorySSA use
mandatory.
Differential Revision: https://reviews.llvm.org/D123216
Motivated by pr43326 (https://bugs.llvm.org/show_bug.cgi?id=43326), where a slightly
modified case is as follows.
void f(int e[10][10][10], int f[10][10][10]) {
for (int a = 0; a < 10; a++)
for (int b = 0; b < 10; b++)
for (int c = 0; c < 10; c++)
f[c][b][a] = e[c][b][a];
}
The ideal optimal access pattern after running interchange is supposed to be the following
void f(int e[10][10][10], int f[10][10][10]) {
for (int c = 0; c < 10; c++)
for (int b = 0; b < 10; b++)
for (int a = 0; a < 10; a++)
f[c][b][a] = e[c][b][a];
}
Currently loop interchange is limited to picking up the innermost loop and finding an order
that is locally optimal for it. However, the pass failed to produce the globally optimal
loop access order. For more complex examples what we get could be quite far from the
globally optimal ordering.
What is proposed in this patch is to do a "bubble-sort" fashion when doing interchange.
By comparing neighbors in `LoopList` in each iteration, we would be able to move each loop
onto a most appropriate place, hence this is an approach that tries to achieve the
globally optimal ordering.
The motivating example above is added as a test case.
Reviewed By: Meinersbur
Differential Revision: https://reviews.llvm.org/D120386
Add void casts to mark the variables used, next to the places where
they are used in assert or `LLVM_DEBUG()` expressions.
Differential Revision: https://reviews.llvm.org/D123117
By specification, source and destination of llvm.memcpy.* must either be equal or non-overlapping. This semantics is hard or impossible to figure out once lowered. This patch explicitly marks loads from source and stores to destination as not aliasing if source and destination is known to be not equal.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D118441
The Attributor, as many other parts in LLVM, uses pointer equivalence
for `llvm::Value`s. This only works as long as `llvm::Value`s are
dynamically unique, or, to be exact, we will never end up with the same
`llvm::Value` representing two dynamic instances. We already provided a
helper to check the former, namely `AA::isDynamicallyUnique`, however we
could not check the latter. In this patch we move the logic into a
separate AA which helps with the growing complexity and use cases. We
also extend the interface to answer the second question rather than the
first. So we do not determine dynamically uniqueness but if we might end
up with the `llvm::Value` describing a different dynamic instance. Note
that the latter is very much tied to the Attributor capabilities to look
through memory, recursion, etc. so we need to update the logic as we go.
We look through loads in the "generic value traversal" and we
consequently don't need to look through them again in AAValueSimplify*.
The test changes stem from the fact that we allowed any simplified
value, incl. non-dynamically unique ones, as long as the underlying
memory was an alloca. This doesn't seem to make sense as allocas do not
protect against dynamically non-unique values. We need to make the
unique check better rather than excluding allocas. That in mind, we can
remove a lot of code by simply relying on the generic value traversal
load look through.
To soften the blow some minor adjustments have been made that allow more
simplification through the now used scheme and some tests have been
given a `norecurse` for now.
With D106397 we ensured that `AAReachability` will not answer queries for
potentially recursive functions. This was necessary as we did not treat
recursion explicitly otherwise. Now that we have
`AA::isPotentiallyReachable` we can make `AAReachability` a purely
intra-procedural AA which does not care about recursion.
`AA::isPotentiallyReachable`, however, does already deal with "going
back" the call graph and can now do so for potentially recursive
functions.
Add statistics to count overall devirtualized targets as well as the
various types of devirtualizations applied at callsites.
Differential Revision: https://reviews.llvm.org/D123152
If we ignore droppable users everything only used in llvm.assume (among
other things) is going to be deleted as dead. This is not helpful.
Instead we want to only delete things we actually don't need anymore. A
follow up will deal with loads in a smarter way.
Partially inlining a libcall that has the musttail attribute
leads to broken LLVM IR, triggering an assertion in the IR verifier.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D123116
When creating induction resume values, SCEV queries may rely on
LoopInfo. Make sure vector.body gets added to the loop of the pre-header
during skeleton construction.
%vector.body will be moved to the vector preheader during VPlan
execution.
Fixes#54745.
getExtAttrForI32Param() is the method to be used for determining the type of
extension attribute (if any) that is to be added for a signed/unsigned
argument.
Previously, the SExt attribute was always added to the i32 ldexp* argument as
it was expected to be ignored by targets not needing it. This patch now
changes this so that it is only added for the targets that need it in the
first place.
Putchar() argument is now also extended as required by the target (SystemZ in
the test), to fix the issue below. Many more libcalls will be handled
similarly in a following patch.
Fixes https://github.com/llvm/llvm-project/issues/54532.
Differential Revision: https://reviews.llvm.org/D123030
Review: Eli Friedman
When simplify values we might end up with an instruction from a
different scope or just one that does not dominate the use. If the
instruction can be reproduced without side-effect (incl. UB) we can
now do that. For now this is mostly used for speculatable (intrinsic)
calls but as we learn to make things like arguments or loads available
this will become more powerful.
This will also allow us to remove dead stores more easily in a follow
up.
Update VPInterleavedAccessInfo to use the generic getVectorLoopRegion
helper instead of relying on the entry block being the top-most vector
loop region.
If both the character and string are known, but the length
potentially isn't, we can optimize the memchr() call to a select
of either the known position of the character or null.
Split off from https://reviews.llvm.org/D122836.
Handle the simple constant char case before the bitmask optimization.
This will allow extending the code to handle a non-constant size
argument in a followup change.
Split out from https://reviews.llvm.org/D122836.
If the memchr() size is 1, then we can convert the call into a
single-byte comparison. This works even if both the string and the
character are unknown.
Split off from https://reviews.llvm.org/D122836.
As discussed on https://github.com/llvm/llvm-project/issues/54682,
MemorySSA currently has a bug when computing the clobber of calls
that access loop-varying locations. I think a "proper" fix for this
on the MemorySSA side might be non-trivial, but we can easily work
around this in MemCpyOpt:
Currently, MemCpyOpt uses a location-less getClobberingMemoryAccess()
call to find a clobber on either the src or dest location, and then
refines it for the src and dest clobber. This was intended as an
optimization, as the location-less API is cached, while the
location-affected APIs are not.
However, I don't think this really makes a difference in practice,
because I don't think anything will use the cached clobbers on
those calls later anyway. On CTMark, this patch seems to be very
mildly positive actually.
So I think this is a reasonable way to avoid the problem for now,
though MemorySSA should also get a fix.
Differential Revision: https://reviews.llvm.org/D122911
The range calculation in walkForwards() assumes that the ranges of
the operands have already been calculated. With the used visit
order, this is not necessarily the case when there are multiple
roots. (There is nothing guaranteeing that instructions are visited
in topological order.)
Fix this by queuing instructions for reprocessing if the operand
ranges haven't been calculated yet.
Fixes https://github.com/llvm/llvm-project/issues/54669.
Differential Revision: https://reviews.llvm.org/D122817
I didn't dig into this very much because it appears to be totally valid
(especially once these properties can come from attributes instead
of only from hard-coded library functions) for TLI to not be defined,
and nothing broke when I added this check, including with all my other
patches applied.
Differential Revision: https://reviews.llvm.org/D122917
The search for the clobbering call is fairly expensive if uses are not optimized at construction. Defer the clobber walk to the point in the implementation we need it; there are a bunch of bailouts before that point. (e.g. If the source pointer is not an alloca, we can't do callslotopt.)
On a test case which involves a bunch of copies from argument pointers, this switches memcpyopt from > 1/2 second to < 10ms.
This is a hacky fix for:
https://github.com/llvm/llvm-project/issues/54558
As discussed there, codegen regressed when we opened up this transform
to allow extra uses ( 61580d0949 ), and it's not clear how to
undo the transforms at the later stage of compilation.
As noted in the code comments, there's a set of remaining folds that
are still limited to one-use, so we can try harder to refine and
expand the limitations on these folds, but it's likely to be an
up-and-down battle as we find and overcome similar regressions.
Differential Revision: https://reviews.llvm.org/D122909
This is a retry of 9397bdc67e - that was reverted until
we had a clang warning in place to alert users about a
possible mistake in source. The warning was added with
ab982eace6.
This is noted as a missing clang warning in #54222,
but it is also a missing optimization opportunity.
Alive2 proofs:
https://alive2.llvm.org/ce/z/Q8drDqhttps://alive2.llvm.org/ce/z/pE6LRt
I don't see a single conversion for all predicates
using "getFCmpCode" logic, so other predicates are
left as a TODO item.
These two are equivalent,
and i *think* the `and` form is more-ish canonical.
General proof: https://alive2.llvm.org/ce/z/RrF5s6
If constant on the (outer) `xor` is an `undef`,
the whole lane is dead: https://alive2.llvm.org/ce/z/mu4Sh2
However, if the constant on the (inner) `or` is an `undef`,
we must sanitize it first: https://alive2.llvm.org/ce/z/MHYJL7
I guess, producing a zero `and`-mask is optimal in that case.
alive-tv is happy about the entirety of `xor-of-or.ll`.
This refactor makes it easier to extend the logic to collect information
from blocks in the future, without even further increasing the size of
eliminateConstriants.
This was exposed by 14e3650f. The recommit of 14e3650f will hit the
problematic code path requiring the workaround.
test case that crashes without the workaround.
During skeleton construction for the epilogue vector loop, generic
helpers use getOrCreateTripCount, which will re-expand the trip count
computation. Instead, re-use the TripCount created during main loop
vectorization.
If the frame pointer is an argument of the original pointer (which
happens with opaque pointers), then we currently first replace the
argument with undef, which will prevent later replacement of the
old frame pointer with the new one.
Fix this by replacing arguments with some dummy instructions first,
and then replacing those with undef later. This gives us a chance
to replace the frame pointer before it becomes undef.
Fixes https://github.com/llvm/llvm-project/issues/54523.
Differential Revision: https://reviews.llvm.org/D122375
This isn't expected to reduce compilation times as 'max-iters' is set to
one by default, but it helps with recursive functions that require higher
iteration counts.
Differential Revision: https://reviews.llvm.org/D122819
Reimplements MisExpect diagnostics from D66324 to reconstruct its
original checking methodology only using MD_prof branch_weights
metadata.
New checks rely on 2 invariants:
1) For frontend instrumentation, MD_prof branch_weights will always be
populated before llvm.expect intrinsics are lowered.
2) for IR and sample profiling, llvm.expect intrinsics will always be
lowered before branch_weights are populated from the IR profiles.
These invariants allow the checking to assume how the existing branch
weights are populated depending on the profiling method used, and emit
the correct diagnostics. If these invariants are ever invalidated, the
MisExpect related checks would need to be updated, potentially by
re-introducing MD_misexpect metadata, and ensuring it always will be
transformed the same way as branch_weights in other optimization passes.
Frontend based profiling is now enabled without using LLVM Args, by
introducing a new CodeGen option, and checking if the -Wmisexpect flag
has been passed on the command line.
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D115907
Instead of first creating a lambda for calculating the range,
then collecting the ranges for the operands, and then calling the
lambda on those ranges, we can first calculate the operand ranges
and then calculate the result directly in the switch.
This reverts the revert commit 2760cdc9c6.
This version pulls in the code to create the vector loop object in VPlan
from D121624.
This is needed because otherwise existing LoopInfo verification will
fail, as a loop block doesn't have in-loop successors now that we
do not replace the branch.
Now that we do not add new loops during skeleton construction, there's
also no need to verify LI there.
According to the current design, if a floating point operation is
represented by a constrained intrinsic somewhere in a function, all
floating point operations in the function must be represented by
constrained intrinsics. It imposes additional requirements to inlining
mechanism. If non-strictfp function is inlined into strictfp function,
all ordinary FP operations must be replaced with their constrained
counterparts.
Inlining strictfp function into non-strictfp is not implemented as it
would require replacement of all FP operations in the host function,
which now is undesirable due to expected performance loss.
Differential Revision: https://reviews.llvm.org/D69798
This fixes a TODO in constantArgPropagation() to make it feature complete.
However, I do find myself in agreement with the review comments in
https://reviews.llvm.org/D106426. I don't think we should pursue
specializing such recursive functions as the code size increase becomes
linear to 'max-iters'. Compiling the modified test just with -O3 (no
function specialization) generates the same code.
Differential Revision: https://reviews.llvm.org/D122755
The only remaining use was to get the exit block of the loop. Instead of
relying on the loop, use the successor of VectorHeaderBB
(LoopMiddleBlock) directly to set VPTransformState::CFG::ExitB
Depends on D121621.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D121623
Allow receiving memcpy/memset/memmove instrumentation by using __asan or
__hwasan prefixed versions for AddressSanitizer and HWAddressSanitizer
respectively when compiling in kernel mode, by passing params
-asan-kernel-mem-intrinsic-prefix or -hwasan-kernel-mem-intrinsic-prefix.
By default the kernel-specialized versions of both passes drop the
prefixes for calls generated by memintrinsics. This assumes that all
locations that can lower the intrinsics to libcalls can safely be
instrumented. This unfortunately is not the case when implicit calls to
memintrinsics are inserted by the compiler in no_sanitize functions [1].
To solve the issue, normal memcpy/memset/memmove need to be
uninstrumented, and instrumented code should instead use the prefixed
versions. This also aligns with ASan behaviour in user space.
[1] https://lore.kernel.org/lkml/Yj2yYFloadFobRPx@lakrids/
Reviewed By: glider
Differential Revision: https://reviews.llvm.org/D122724
When MaximizeVectorBandwidth is enabled, we can end up (via calls to
collectUniformsAndScalars/setCostBasedWideningDecision through
calculateRegisterUsage) making widening decisions before we have decided
whether to fold the tail by masking. These decisions will be wrong if we
later decided to fold the tail, for example when the trip count is very
low. It will use incorrect costs for loads that should get masked, using
standard memory operation costs instead.
This still at the moment uses the EmulatedMaskMemRefHack costs (a bit
unfortunately), but the old costs without this change were 1, leading to
too optimistic vectorization.
This slightly changes the way that the MaximizeVectorBandwidth option
works to make it easier to test, always honouring the option if it is
set.
Differential Revision: https://reviews.llvm.org/D120215
According to the LLVM debug info update guide: https://llvm.org/docs/HowToUpdateDebugInfo.html,
"Hoisting identical instructions which appear in several successor
blocks into a predecessor block. In this case there is no single
merged instruction. The rule for dropping locations applies".
Thanks to Yuanbo Li for reporting this.
Reviewed By: dblaikie
Reviewers: sebpop, tejohnson, dblaikie
Differential Revision: https://reviews.llvm.org/D122730
Factor in the TBAA of adjacent stores instead of just the head store
when merging stores into a memset. We were seeing GVN remove a load that
had a TBAA that matched the 2nd store because GVN determined it didn't
match the TBAA of the memset. The memset had the TBAA of only the first
store.
i.e. Loading the field pi_ of shared_count after memset to create an
array of shared_ptr
template<class T>
class shared_ptr {
T *p;
shared_count refcount;
};
class shared_count {
sp_counted_base *pi_;
};
Differential Revision: https://reviews.llvm.org/D122205
Instead of looking up the vector loop using the header, keep track of
the current vector loop in VPTransformState. This removes the
requirement for the vector header block being part of the loop up front.
A follow-up patch will move the code to generate the Loop object for the
vector loop to VPRegionBlock.
Depends on D121619.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D121621
Avoids merge errors when opaque pointers are loaded into different types.
Reviewed by: jcranmer-intel, hiraditya
Differential Revision: https://reviews.llvm.org/D122521
Now that all dependencies on creating the latch block up-front have been
removed, there is no need to create it early.
Depends on D121618.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D121619
In some case, like in the added test case, we can reach
selectInterleaveCount with loops that actually have a cost of 0.
Unfortunately a loop cost of 0 is also used to communicate that the cost
has not been computed yet. To resolve the crash, bail out if the cost
remains zero after computing it.
This seems like the best option, as there are multiple code paths that
return a cost of 0 to force a computation in selectInterleaveCount.
Computing the cost at multiple places up front there would unnecessarily
complicate the logic.
Fixes#54413.
DXIL is wrapped in a container format defined by the DirectX 11
specification. Codebases differ in calling this format either DXBC or
DXILContainer.
Since eventually we want to add support for DXBC as a target
architecture and the format is used by DXBC and DXIL, I've termed it
DXContainer here.
Most of the changes in this patch are just adding cases to switch
statements to address warnings.
Reviewed By: pete
Differential Revision: https://reviews.llvm.org/D122062
This patch moves the code to set the correct incoming block for the
backedge value to VPlan::execute.
When generating the phi node, the backedge value is temporarily added
using the pre-header as incoming block. The invalid phi node will be
fixed up during VPlan::execute after main VPlan code generation.
At the same time, the backedge value is also moved to the latch.
This change removes the requirement to create the latch block up-front
for VPWidenInductionPHIRecipe::execute, which in turn will enable
modeling the pre-header in VPlan.
Depends on D121617.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D121618
According to definition of canonical form, it is a canonical
if scale reg does not contain addrec for loop L then none of bases
should contain addrec for this loop.
The critical word here is "contains".
Current checker of canonical form checks not "containing" property
but "is". So it does not check whether it contains but whether it is.
Fix the checker and canonicalizing utility to follow definition.
Without this fix in the test attached the base formula looking as
reg((-1 * {0,+,8}<nuw><nsw><%bb2>)<nsw>) + 1*reg((8 * (%arg /u 8))<nuw>)
is considered as conanocial while base contains an addrec.
And modified formula we want to insert
reg({0,+,8}<nuw><nsw><%bb2>) + 1*reg((-8 * (%arg /u 8)))
is considered as not canonical.
Reviewed By: mkazantsev
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D122457
We have the same code repeated in both callers, sink it into callee.
The motivation here isn't just code style, we can also defer the relatively expensive aliasing checks until the cheap structural preconditions have been validated. (e.g. Don't bother aliasing if src is not an alloca.) This helps compile time significantly.
If we vectorize a e.g. store, we leave around a bunch of getelementptrs for the individual scalar stores which we removed. We can go ahead and delete them as well.
This is purely for test output quality and readability. It should have no effect in any sane pipeline.
Differential Revision: https://reviews.llvm.org/D122493
Inline assembly is scary but we need to support it for the OpenMP GPU
device runtime. The new assumption expresses the fact that it may not
have call semantics, that is, it will not call another function but
simply perform an operation or side-effect. This is important for
reachability in the presence of inline assembly.
Differential Revision: https://reviews.llvm.org/D109986
Before we gave up if a call through bitcast had parameter attributes.
Interestingly, we allowed attributes for the return value already. We
now handle both the same way, namely, we drop the ones that are
incompatible with the new type and keep the rest. This cannot cause
"more UB" than initially present.
Differential Revision: https://reviews.llvm.org/D119967
Reimplements MisExpect diagnostics from D66324 to reconstruct its
original checking methodology only using MD_prof branch_weights
metadata.
New checks rely on 2 invariants:
1) For frontend instrumentation, MD_prof branch_weights will always be
populated before llvm.expect intrinsics are lowered.
2) for IR and sample profiling, llvm.expect intrinsics will always be
lowered before branch_weights are populated from the IR profiles.
These invariants allow the checking to assume how the existing branch
weights are populated depending on the profiling method used, and emit
the correct diagnostics. If these invariants are ever invalidated, the
MisExpect related checks would need to be updated, potentially by
re-introducing MD_misexpect metadata, and ensuring it always will be
transformed the same way as branch_weights in other optimization passes.
Frontend based profiling is now enabled without using LLVM Args, by
introducing a new CodeGen option, and checking if the -Wmisexpect flag
has been passed on the command line.
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D115907
This patch moves the code to set the correct incoming block for the
backedge value to VPlan::execute.
When generating the phi node, the backedge value is temporarily added
using the pre-header as incoming block. The invalid phi node will be
fixed up during VPlan::execute after main VPlan code generation.
At the same time, the backedge value is also moved to the latch.
This change removes the requirement to create the latch block up-front
for VPWidenIntOrFpInductionRecipe::execute, which in turn will enable
modeling the pre-header in VPlan.
As an alternative, the increment could be modeled as separate recipe,
but that would require more work and a bit of redundant code, as we need
to create the step-vector during VPWidenIntOrFpInductionRecipe::execute
anyways, to create the values for different parts.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D121617
This was reusing a cast to GlobalVariable to check for an
Instruction, which means we'll try to dereference a null pointer
if it's not actually a GlobalVariable. We should be casting
MTI->getSource() instead.
I don't think this problem is really specific to opaque pointers,
but it certainly makes it a lot easier to reproduce.
Fixes https://github.com/llvm/llvm-project/issues/54572.
The current implementation of Function Specialization does not allow
specializing more than one arguments per function call, which is a
limitation I am lifting with this patch.
My main challenge was to choose the most suitable ADT for storing the
specializations. We need an associative container for binding all the
actual arguments of a specialization to the function call. We also
need a consistent iteration order across executions. Lastly we want
to be able to sort the entries by Gain and reject the least profitable
ones.
MapVector fits the bill but not quite; erasing elements is expensive
and using stable_sort messes up the indices to the underlying vector.
I am therefore using the underlying vector directly after calculating
the Gain.
Differential Revision: https://reviews.llvm.org/D119880
This simplifies the implementation of eraseInstruction by moving the odd-replace-users-with-undef handling back to the only caller which uses it. This handling was not obviously correct, so add the asserts which make it clear why this is safe to do at all. The result is simpler code and stronger assertions.
The original commit exposed several missing dependencies (e.g. latent bugs in SLP scheduling). Most of these were fixed over the weekend and have had several days to bake. The last was fixed this morning after being noticed in manual review of test changes yesterday. See the review thread for links to each change.
Original commit message follows:
SLP currently schedules all instructions within a scheduling window which stretches from the first instruction potentially vectorized to the last. This window can include a very large number of unrelated instructions which are not being considered for vectorization. This change switches the code to only schedule the sub-graph consisting of the instructions being vectorized and their transitive users.
This has the effect of greatly reducing the amount of work performed in large basic blocks, and thus greatly improves compile time on degenerate examples. To understand the effects, I added some statistics (not planned for upstream contribution). Here's an illustration from my motivating example:
Before this patch:
704357 SLP - Number of calcDeps actions
699021 SLP - Number of schedule calls
5598 SLP - Number of ReSchedule actions
59 SLP - Number of ReScheduleOnFail actions
10084 SLP - Number of schedule resets
8523 SLP - Number of vector instructions generated
After this patch:
102895 SLP - Number of calcDeps actions
161916 SLP - Number of schedule calls
5637 SLP - Number of ReSchedule actions
55 SLP - Number of ReScheduleOnFail actions
10083 SLP - Number of schedule resets
8403 SLP - Number of vector instructions generated
I do want to highlight that there is a small difference in number of generated vector instructions. This example is hitting the bailout due to maximum window size, and the change in scheduling is slightly perturbing when and how we hit it. This can be seen in the RescheduleOnFail counter change. Given that, I think we can safely ignore.
The downside of this change can be seen in the large test diff. We group all vectorizable instructions together at the bottom of the scheduling region. This means that vector instructions can move quite far from their original point in code. While maybe undesirable, I don't see this as being a major problem as this pass is not intended to be a general scheduling pass.
For context, it's worth noting that the pre-scheduling that SLP does while building the vector tree is exactly the sub-graph scheduling implemented by this patch.
Differential Revision: https://reviews.llvm.org/D118538
After writing the commit message for 4b1bace28, realized that the mentioned optimization was rather straight forward. We already have the code for scanning a block during region initialization, we can simply keep track if we've seen a stacksave or stackrestore. If we haven't, none of these dependencies are relevant and we can avoid the relatively expensive scans entirely.
This is an extension of commit b7806c to handle one last case noticed in test changes for D118538. Again, this is thought to be a latent bug in the existing code, though this time I have not managed to reduce tests for the original algoritthm.
The prior attempt had failed to account for this case:
%a = alloca i8
stacksave
stackrestore
store i8 0, i8* %a
If we allow '%a' to reorder into the stacksave/restore region, then the alloca will be deallocated before the use. We will have taken a well defined program, and introduced a use-after-free bug.
There's also an inverse case where the alloca originally follows the stackrestore, and we need to prevent the reordering it above the restore.
Compile time wise, we potentially do an extra scan of the block for each alloca seen in a bundle. This is significantly more expensive than the stacksave rooted version and is why I'd tried to avoid this in the initial patch. There is room to optimize this (by essentially caching a "has stacksave" bit per block), but I'm leaving that to future work if it actually shows up in practice. Since allocas in bundles should be rare in practice, I suspect we can defer the complexity for a long while.
CoverageMappingModuleGen generates a coverage mapping record
even for unused functions with internal linkage, e.g.
static int foo() { return 100; }
Clang frontend eliminates such functions, but InstrProfiling pass
still pulls in profile runtime since there is a coverage record.
Fuchsia uses runtime counter relocation, and pulling in profile
runtime for unused functions causes a linker error:
undefined hidden symbol: __llvm_profile_counter_bias.
Since 389dc94d4b, we do not hook profile runtime for the binaries
that none of its translation units have been instrumented in Fuchsia.
This patch extends that for the instrumented binaries that
consist of only unused functions.
Differential Revision: https://reviews.llvm.org/D122336
Update all places that currently assume the entry block to the plan is
also the vector loop header to use getVectorLoopRegion instead.
getVectorLoopRegion will keep doing the right thing when the pre-header
is modeled explicitly (and becomes the new entry block in the plan).
Probe-based profile leads to a better performance when combined with profi and ext-tsp block layout. I'm turning them on by default.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D122442
We already do this for SelectionDAG, but we're missing it here.
Noticed while re-triaging PR21929
Differential Revision: https://reviews.llvm.org/D122340
We can not bitcast pointers across different address spaces. This was
previously fixed in D89577 but then in D93229 an enhancement was added
which peeks further through the ponter operand, opening up the
possibility that address-space violations could be introduced.
Instead of bailing as the previous fix did, simply insert an
addrspacecast cast instruction.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D121787
Most intrinsics, especially "default" ones, will not call back into the
IR module. `nocallback` encodes this nicely. As it was not used before,
this patch also makes use of `nocallback` in the Attributor which
results in many more `norecurse` deductions.
Tablegen part is mechanical, test updates by script.
Differential Revision: https://reviews.llvm.org/D118680
This patch moves pointer induction handling from VPWidenPHIRecipe to its
own recipe. In the process, it adds all information required to generate
code for pointer inductions without relying on Legal to access the list
of induction phis.
Alternatively VPWidenPHIRecipe could also take an optional pointer to InductionDescriptor.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D121615
Add the "(0 - X) --> (X * -1)" reverse identity to the list of alternate form binops.
We need a little hack to make the existing logic work because it does not expect to
move constants from op0 to op1, but the code comment hopefully makes that clear.
I don't think there are any other identities like that.
Fixes#54364
Differential Revision: https://reviews.llvm.org/D122390
For MachO, lower `@llvm.global_dtors` into `@llvm_global_ctors` with
`__cxa_atexit` calls to avoid emitting the deprecated `__mod_term_func`.
Reuse the existing `WebAssemblyLowerGlobalDtors.cpp` to accomplish this.
Enable fallback to the old behavior via Clang driver flag
(`-fregister-global-dtors-with-atexit`) or llc / code generation flag
(`-lower-global-dtors-via-cxa-atexit`). This escape hatch will be
removed in the future.
Differential Revision: https://reviews.llvm.org/D121736
There is potential for endless recursion if we try to determine the
underlying objects of a load, just to end up with the load as underlying
object. A proper solution will require us to pass a visited set around.
This will happen as we cleanup genericValueTraversal soon.
When --disable-sample-loader-inlining is true, skip inline transformation, but merge profiles of inlined instances to outlining versions.
Differential Revision: https://reviews.llvm.org/D121862
CoroSplit lowers various coroutine intrinsics. It's a CGSCC pass and
CGSCC passes don't run on unreachable functions. Normally GlobalDCE will
come along and delete unreachable functions, but we don't run GlobalDCE
under -O0, so an unreachable function with coroutine intrinsics may
never have CoroSplit run on it.
This patch adds GlobalDCE when coroutines intrinsics are present. It
also now runs all coroutine passes conditional when coroutine intrinsics
are present. This should also solve the -O0 regression reported in
D105877 due to LazyCallGraph construction.
Fixes https://github.com/llvm/llvm-project/issues/54117
Reviewed By: ChuanqiXu
Differential Revision: https://reviews.llvm.org/D122275
EarlyCSE currently optimizes all MemoryUses upfront. However,
EarlyCSE only actually queries the clobbering memory access for
a subset of uses, namely those where a CSE candidate has already
been identified. Delaying use optimization to the clobber query
improves compile-time in practice.
This change is not NFC because EarlyCSE has a limit on the number
of clobber queries (EarlyCSEMssaOptCap), in which case it falls
back to the defining access. The defining access for uses will now
no longer coincide with the optimized access.
If there are performance regressions from this change, we should
be able to address them by raising this limit.
Differential Revision: https://reviews.llvm.org/D121987
The first attempt at this missed a validity check.
This version includes a test of the narrow source
type for modulo-16-bits.
Original commit message:
This is the IR counterpart to 370ebc9d9a
which provided a bswap narrowing fix for issue #53867.
Here we can be more general (although I'm not sure yet
what would happen for illegal types in codegen - too
rare to worry about?):
https://alive2.llvm.org/ce/z/3-CPfo
This will be more effective if we have moved the shift
after the bswap as proposed in D122010, but it is
independent of that patch.
Differential Revision: https://reviews.llvm.org/D122166
The motivation for this is that while both memcpyopt and dse will catch this case, both are limited by MSSA's walk back threshold when finding clobbers. As such, if you have a memcpy of an otherwise dead alloca placed towards the end of a long basic block with lots of other memory instructions, it would be missed. This is a bit undesirable for such an "obviously" useless bit of code.
As noted in comments, we should probably generalize instcombine's escape analysis peephole (see visitAllocInst) to allow read xor write. Doing that would subsume this code in a more general way, but is also a more involved change. For the moment, I went with the easiest fix.
createInductionResumeValues only uses its loop argument only to get the
pre-header, but the pre-header is already known (we created/cached it
earlier). Remove the unneeded loop argument.
When shifting by a byte-multiple:
bswap (shl X, C) --> lshr (bswap X), C
bswap (lshr X, C) --> shl (bswap X), C
This is an IR implementation of a transform suggested in D120648.
The "swaps cancel" test models the motivating optimization from
that proposal.
Alive2 checks (as noted in the other review, we could use
knownbits to handle shift-by-variable-amount, but that can be an
enhancement patch):
https://alive2.llvm.org/ce/z/pXUaRfhttps://alive2.llvm.org/ce/z/ZnaMLf
Differential Revision: https://reviews.llvm.org/D122010
Before this patch the DebugifyLevel option was used for
the synthetic mode, so after this, it will be used in
the original mode as well.
Differential Revision: https://reviews.llvm.org/D115623
Rather than iterating over users and comparing operands, iterate
over uses and check operand number. Otherwise, we'll end up
promoting a store twice if it has two equal operands.
This can only happen with opaque pointers, as otherwise both
operands differ by a level of indirection, so a bitcast would have
to be involved.
Fixes https://github.com/llvm/llvm-project/issues/54495.
This is the IR counterpart to 370ebc9d9a
which provided a bswap narrowing fix for issue #53867.
Here we can be more general (although I'm not sure yet
what would happen for illegal types in codegen - too
rare to worry about?):
https://alive2.llvm.org/ce/z/3-CPfo
This will be more effective if we have moved the shift
after the bswap as proposed in D122010, but it is
independent of that patch.
Differential Revision: https://reviews.llvm.org/D122166
Before we start addressing the issue with having
a lot of false positives when using debugify in
the original mode, we have made a few patches that
should speed up the execution of the testing
utility Passes.
For example, when testing a large project
(let's say LLVM project itself), we can face
a lot of potential DI issues. Usually, we use
-verify-each-debuginfo-preserve (that is very
similar to -debugify-each) -- it collects
DI metadata before each Pass, and after the Pass
it checks if the Pass preserved the DI metadata.
However, we can speed up this process, since we
don't need to collect DI metadata before each
Pass -- we could use the DI metadata that are
collected after the previous Pass from
the pipeline as an input for the next Pass.
This patch speeds up the utility for ~2x.
Differential Revision: https://reviews.llvm.org/D115622
The CoroSplit pass would check the existence of coroutine intrinsic
before starting work. It is not necessary and wasteful since it would
iterate over the Module.
This patch also removes the constraint on the corresponding of the
SmallVector for the possible coroutines in the Modules. The original
value is 4. Given coroutines is used actually in practice. 4 is really
relatively a low threshold.
When outlining a phi node, if the the incoming branch is a block contained in the region and the branch from that block is not outlined, we create broken code. The fix is to recognize when that branch from the included incoming block is not contained, and ignore the region.
Reviewer: paquette
Differential Revision: https://reviews.llvm.org/D121311
There is a bunch of code improvements in the patch: marking as const everything what can be
const and fixing some typos in comments.
Also the patch removes the shadowing parameter TTI from the rewriteWithNewAddressSpaces
method, the TTI parameter is not required because the same field is in the class.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D121671
completeLoopSkeleton only uses its loop argument only to get the
pre-header, but the pre-header is already known (we created/cached it
earlier). Remove the unneeded loop argument.
Since the IROutliner is performing an optimization, it should not outline from functions explicitly marked with optnone. This adds an extra check and test to make sure this does not occur.
Reviewers: paquette
Differential Revision: https://reviews.llvm.org/D121567
The semantics of an inalloca alloca instruction requires that it not be reordered with a preceeding stacksave intrinsic call. Unfortunately, there's no def/use edge or memory dependence edge. (THe memory point is slightly subtle, but in general a new allocation can't alias with a call which executes strictly before it comes into existance.)
I'd tried to tackle this same case previously in 689babdf6, but the fix chosen there turned out to be incomplete. As such, this change contains a fully revert of the first fix attempt.
This was noticed when investigating problems which surfaced with D118538, but this is definitely an existing bug. This time around, I managed to reduce a couple of additional cases, including one which was being actively miscompiled even without the new scheduling change. (See test diffs)
Compile time wise, we only spend extra time when seeing a stacksave (rare), and even then we walk the block at most once per schedule window extension. Likely a non-issue.
This fixes an active miscompile visible in the test changes. The basic problem is that the scheduling dependency graph didn't have any edges for control dependence within a single basic block. The result is that we could (and in some rare cases *did*) perform reorderings within a block which could introduce new undefined behavior along paths which didn't previously contain any.
Impact wise, we have two major cases where control is not guaranteed to reach a later instruction in the block: may throw calls, and calls containing infinite loops.
* The former case was mostly covered by the memory dependencies, and to trigger require a function which can throw, but not write to memory. In theory, such a case is possible, but not likely in practice.
* The later case is likely more of an issue in practice. After this code was first written, we changed the IR semantics to allow well defined infinite loops without satisifying mustprogress. Even for C/C++ - which do imply mustprogress - recent changes to how we treat atomics (e.g. an atomic read does not always imply a write) could expose this issue. I'm a bit shocked we don't seem to have a bug report which hit this in real code actually.
Compile time wise, this results in a single extra scan of the scheduling window in the common case. Since we stop scanning at the next instruction which isn't guaranteed to execute, no matter what order we traverse instructions in, we scan the block once. The exception to this is that when we extend the scheduling window downwards, we invalidate all dependencies, and thus rescan. So the potentially expensive case is when we a call in a big schedule window which is frequently extended. We could optimize this case (by caching the last instruction not guaranteeed to transfer execution and scanning only the extended window) and starting there), but I decided to leave the complexity until it mattered. That same case is already degenerate with memory dependences which is more expensive than the control dependence scan.
We could also consider combining the memory dependence and control dependence sets to reduce memory usage, but since it complicates the code slightly and makes debugging a bit harder, I went with the simplest scheme for now.
This was noticed while trying to understand the failures reported against D118538, but is not otherwise related to that change.
Update functions that previously took a loop pointer but only to get the
pre-header. Instead, pass the block directly. This removes the
requirement for the loop object to be created up-front.
With debug information enabled (-g) Clang will wrap the actual target
region into a new function which is called from the "kernel". The problem
is that the "kernel" is now basically a wrapper without all the things
we expect. More importantly, if we end up asking for an AAKernelInfo
for the "target region function" we might try to turn it into SPMD mode.
That used to cause an assertion as that function doesn't have an
appropriately named `_exec_mode` global. While the global is going away
soon we still need to make sure to properly handle this case, e.g.,
perform optimizations reliably.
Differential Revision: https://reviews.llvm.org/D122043
Generalize D99629 for ELF. A default visibility non-local symbol is preemptible
in a -shared link. `isInterposable` is an insufficient condition.
Moreover, a non-preemptible alias may be referenced in a sub constant expression
which intends to lower to a PC-relative relocation. Replacing the alias with a
preemptible aliasee may introduce a linker error.
Respect dso_preemptable and suppress optimization to fix the abose issues. With
the change, `alias = 345` will not be rewritten to use aliasee in a `-fpic`
compile.
```
int aliasee;
extern int alias __attribute__((alias("aliasee"), visibility("hidden")));
void foo() { alias = 345; } // intended to access the local copy
```
While here, refine the condition for the alias as well.
For some binary formats like COFF, `isInterposable` is a sufficient condition.
But I think canonicalization for the changed case has little advantage, so I
don't bother to add the `Triple(M.getTargetTriple()).isOSBinFormatELF()` or
`getPICLevel/getPIELevel` complexity.
For instrumentations, it's recommended not to create aliases that refer to
globals that have a weak linkage or is preemptible. However, the following is
supported and the IR needs to handle such cases.
```
int aliasee __attribute__((weak));
extern int alias __attribute__((alias("aliasee")));
```
There are other places where GlobalAlias isInterposable usage may need to be
fixed.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D107249
I keep thinking this assumption is probably exploitable for a bug in the existing implementation, but all of my attempts at writing a test case have failed. So for the moment, just document this very subtle assumption.
This patch fixes:
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:3917:13: error:
unused function 'needToScheduleSingleInstruction'
[-Werror,-Wunused-function]
[SCCP] do not clean up dead blocks that have their address taken
Fixes a crash observed in IPSCCP.
Because the SCCPSolver has already internalized BlockAddresses as
Constants or ConstantExprs, we don't want to try to update their Values
in the ValueLatticeElement. Instead, continue to propagate these
BlockAddress Constants, continue converting BasicBlocks to unreachable,
but don't delete the "dead" BasicBlocks which happen to have their
address taken. Leave replacing the BlockAddresses to another pass.
Fixes: https://github.com/llvm/llvm-project/issues/54238
Fixes: https://github.com/llvm/llvm-project/issues/54251
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D121744
This reverts commit 1cfa986d68. See https://github.com/llvm/llvm-project/issues/54256 for why I'm discontinuing the project.
Seperately, it turns out that while this patch does correctly preserve MSSA, it's correct only at the end of the pass; not between vectorization attempts. Even if we decide to resurrect this, we'll need to fix that before reapplying.
Quote from the LLVM Language Reference
If ptr is a stack-allocated object and it points to the first byte of the
object, the object is initially marked as dead. ptr is conservatively
considered as a non-stack-allocated object if the stack coloring algorithm
that is used in the optimization pipeline cannot conclude that ptr is a
stack-allocated object.
By replacing the alloca pointer with the tagged address before this change,
we confused the stack coloring algorithm.
Reviewed By: eugenis
Differential Revision: https://reviews.llvm.org/D121835
Failed on buildbot:
/home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/bin/llc: error: : error: unable to get target for 'aarch64-unknown-linux-android29', see --version and --triple.
FileCheck error: '<stdin>' is empty.
FileCheck command line: /home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/bin/FileCheck /home/buildbot/buildbot-root/llvm-project/llvm/test/Instrumentation/HWAddressSanitizer/stack-coloring.ll --check-prefix=COLOR
This reverts commit 208b923e74.
This adds a new option to control AllowSpeculation added in D119965 when
using `-passes=...`.
This allows reproducing #54023 using opt.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D121944
Quote from the LLVM Language Reference
If ptr is a stack-allocated object and it points to the first byte of the
object, the object is initially marked as dead. ptr is conservatively
considered as a non-stack-allocated object if the stack coloring algorithm
that is used in the optimization pipeline cannot conclude that ptr is a
stack-allocated object.
By replacing the alloca pointer with the tagged address before this change,
we confused the stack coloring algorithm.
Reviewed By: eugenis
Differential Revision: https://reviews.llvm.org/D121835
The way the pass is actually used in the optimization pipeline,
TLI will be available, but this is not the case when running just
-lower-constant-intrinsics in tests, which ends up being quite
confusing.
Require TLI unconditionally, as we usually do.
This changes MemorySSA to be constructed in unoptimized form.
MemorySSA::ensureOptimizedUses() can be called to optimize all
uses (once). This should be done by passes where having optimized
uses is beneficial, either because we're going to query all uses
anyway, or because we're doing def-use walks.
This should help reduce the compile-time impact of MemorySSA for
some use cases (the reason why I started looking into this is
D117926), which can avoid optimizing all uses upfront, and instead
only optimize those that are actually queried.
Actually, we have an existing use-case for this, which is EarlyCSE.
Disabling eager use optimization there gives a significant
compile-time improvement, because EarlyCSE will generally only query
clobbers for a subset of all uses (this change is not included in
this patch).
Differential Revision: https://reviews.llvm.org/D121381
LoopSimplifyCFG may process loops that are not in
loop-simplify/canonical form. For loops not in canonical form, exit
blocks may be reachable from non-loop blocks and we cannot consider them
as dead if they only are not reachable from the loop itself.
Unfortunately the smallest test I could come up with requires running
multiple passes:
-passes='loop-mssa(loop-instsimplify,loop-simplifycfg,simple-loop-unswitch)'
The reason is that loops are canonicalized at the beginning of loop
pipelines, so a later transform has to break canonical form in a way
that breaks LoopSimplifyCFG's dead-exit analysis.
Alternatively we could try to require all loop passes to maintain
canonical form. That in turn would also require additional verification.
Fixes#54023, #49931.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D121925
This patch tries to sink instructions when they are only used in a successor block.
This is a further enhancement patch based on Anna's commit:
D109700, which allows sinking an instruction having multiple uses in a single user.
In this patch, sink instructions with multiple users in a single successor block will be supported.
It could fix a known issue from rust:
https://github.com/rust-lang/rust/issues/51346#issuecomment-394443610
Reviewed By: nikic, reames
Differential Revision: https://reviews.llvm.org/D121585
Splat loads are inexpensive in X86. For a 2-lane vector we need just one
instruction: `movddup (%reg), xmm0`. Using the standard Splat score leads
to worse code. This patch adds a new score dedicated for splat loads.
Please note that a splat is usually three IR instructions:
- It is usually a load and 2 inserts:
%ld = load double, double* %gep
%ins1 = insertelement <2 x double> poison, double %ld, i32 0
%ins2 = insertelement <2 x double> %ins1, double %ld, i32 1
- But it can also be a load, an insert and a shuffle:
%ld = load double, double* %gep
%ins = insertelement <2 x double> poison, double %ld, i32 0
%shf = shufflevector <2 x double> %ins, <2 x double> poison, <2 x i32> zeroinitializer
Because of this some of the lit tests contain more IR instructions.
Differential Revision: https://reviews.llvm.org/D121354
Reimplements MisExpect diagnostics from D66324 to reconstruct its
original checking methodology only using MD_prof branch_weights
metadata.
New checks rely on 2 invariants:
1) For frontend instrumentation, MD_prof branch_weights will always be
populated before llvm.expect intrinsics are lowered.
2) for IR and sample profiling, llvm.expect intrinsics will always be
lowered before branch_weights are populated from the IR profiles.
These invariants allow the checking to assume how the existing branch
weights are populated depending on the profiling method used, and emit
the correct diagnostics. If these invariants are ever invalidated, the
MisExpect related checks would need to be updated, potentially by
re-introducing MD_misexpect metadata, and ensuring it always will be
transformed the same way as branch_weights in other optimization passes.
Frontend based profiling is now enabled without using LLVM Args, by
introducing a new CodeGen option, and checking if the -Wmisexpect flag
has been passed on the command line.
Differential Revision: https://reviews.llvm.org/D115907
Failures in `InlineFunction()` are caught after D121722, but `emitInlinedIntoBasedOnCost()` should only be called when inlining is successful. This also removes an unnecessary call to `shouldInline()` which always returned `InlineCost::getAlways()`.
Reviewed By: kyulee, nikic
Differential Revision: https://reviews.llvm.org/D121946
We often failed in the assertion, non-deterministically with a large IR:
```
Assertion `notDifferentParent(LocA.Ptr, LocB.Ptr) && "BasicAliasAnalysis doesn't support interprocedural queries."
```
Looking at the comment in https://reviews.llvm.org/D87806, it appears it's actually a module pass for new PM while the legacy PM still works as a function pass.
The fix is to align the same behavior in between new PM and old PM, which initializes ObjCARCContract for each function.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D121949
No need to schedule entry nodes where all instructions are not memory
read/write instructions and their operands are either constants, or
arguments, or phis, or instructions from others blocks, or their users
are phis or from the other blocks.
The resulting vector instructions can be placed at
the beginning of the basic block without scheduling (if operands does
not need to be scheduled) or at the end of the block (if users are
outside of the block).
It may save some compile time and scheduling resources.
Differential Revision: https://reviews.llvm.org/D121121
For MachO, lower `@llvm.global_dtors` into `@llvm_global_ctors` with
`__cxa_atexit` calls to avoid emitting the deprecated `__mod_term_func`.
Reuse the existing `WebAssemblyLowerGlobalDtors.cpp` to accomplish this.
Enable fallback to the old behavior via Clang driver flag
(`-fregister-global-dtors-with-atexit`) or llc / code generation flag
(`-lower-global-dtors-via-cxa-atexit`). This escape hatch will be
removed in the future.
Differential Revision: https://reviews.llvm.org/D121736
When we build clang without asserts we should still check the result of
`InlineFunction()` to be sure there wasn't an error. Otherwise we could
incorrectly merge attributes in the next line.
This also removes a redundent call to `getCaller()`.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D121722
An analysis may just be interested in checking if an instruction is
atomic but system scoped or single-thread scoped, like ThreadSanitizer's
isAtomic(). Unfortunately Instruction::isAtomic() can only answer the
"atomic" part of the question, but to also check scope becomes rather
verbose.
To simplify and reduce redundancy, introduce a common helper
getAtomicSyncScopeID() which returns the scope of an atomic operation.
Start using it in ThreadSanitizer.
NFCI.
Reviewed By: dvyukov
Differential Revision: https://reviews.llvm.org/D121910
This uses the existing VPlan helpers to check whether there are scalar
uses of a phi recipe. It remove one of the few remaining dependencies on
the cost model from VPlan code generation.
Depends on D121612.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D121613
VPInterleaveRecipe only uses the first lane of the address. Add
onlyFirstLaneUsed implementation. This is needed for a follow-up patch.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D121612
Reapply with an explicit check for multi-edges, as the expected
behavior of multi-edge dominance is unclear (D120811).
-----
For conditional branches, we know the value is i1 0 or i1 1 along
the outgoing edges. For switches we can apply exactly the same
optimization, just with the known values determined by the switch
cases.
This patch ensures scalars (except for uniforms) are no
longer collected (prior to LVP planning phase) for
scalable vectorization.
This is to avoid the chances of generating scalarized
instructions later (during LVP execute phase) as they
are not supported for scalable vectorization.
Relevant test has also been added.
Differential Revision: https://reviews.llvm.org/D121452
When a load extends past the extent of the alloca, SROA will
restrict the slice size to extend to the end of the alloca only.
However, presplitting was asserting that the load size and the
slice size match exactly, which does not hold in this case.
Relax the assertion to only require that the load size is greater
or equal than the slice size.
No need to schedule entry nodes where all instructions are not memory
read/write instructions and their operands are either constants, or
arguments, or phis, or instructions from others blocks, or their users
are phis or from the other blocks.
The resulting vector instructions can be placed at
the beginning of the basic block without scheduling (if operands does
not need to be scheduled) or at the end of the block (if users are
outside of the block).
It may save some compile time and scheduling resources.
Differential Revision: https://reviews.llvm.org/D121121
This patch adds initial argmemonly inference, by checking the underlying
objects of locations returned by MemoryLocation.
I think this should cover most cases, except function calls to other
argmemonly functions.
I'm not sure if there's a reason why we don't infer those yet.
Additional argmemonly can improve codegen in some cases. It also makes
it easier to come up with a C reproducer for 7662d1687b (already fixed,
but I'm trying to see if C/C++ fuzzing could help to uncover similar
issues.)
Compile-time impact:
NewPM-O3: +0.01%
NewPM-ReleaseThinLTO: +0.03%
NewPM-ReleaseLTO+g: +0.05%
https://llvm-compile-time-tracker.com/compare.php?from=067c035012fc061ad6378458774ac2df117283c6&to=fe209d4aab5b593bd62d18c0876732ddcca1614d&stat=instructions
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D121415
This code queries TTI on a single function, which is considered to
be representative. This is a bit odd, but probably fine in practice.
However, I think we should at least avoid querying declarations,
which e.g. will generally lack target attributes, and for which
we don't seem to ever query TTI in other places.
This initial patch adds code to preserve MemorySSA through a run of SLP vectorizer. The eventual plan is to use MemorySSA to accelerate SLP's memory dependence checking, but we're a ways from that. In particular, this patch is correct, but really slow. It's being landed so that we can work incrementally in tree, not because it's expected to be useful to anyone just yet.
The broader effort is being tracked in https://github.com/llvm/llvm-project/issues/54256. Its worth noting expicitly that this may not work out, and if not, we will be reverting all of the MSSA support in SLP at some point in the next few weeks.
Differential Revision: https://reviews.llvm.org/D117926
Update FunctionAttrs to use FunctionModRefBehavior instead
MemoryAccessKind.
This allows for adding support for inferring argmemonly and others,
see D121415.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D121460
This can be viewed as swapping the select arms:
https://alive2.llvm.org/ce/z/jUvFMJ
...so we don't have the 'nsz' problem with the more general fold.
This unlocks other folds for the motivating fabs example.
This was discussed in issue #38828.
Hardcode the function type as ParallelTask, which is the guaranteed
pointee type of this runtime function argument (if pointee types
exist). The elimination of the callee bitcast is left for InstCombine.
Differential Revision: https://reviews.llvm.org/D120885
Update places still referencing LoopVectorBody to use the vector loop to
get the vector loop header. This is needed to move vector loop
code-generation to VPlan completely, which in turn is needed to model
pre-header & exit blocks in VPlan as well.
For MachO, lower `@llvm.global_dtors` into `@llvm_global_ctors` with
`__cxa_atexit` calls to avoid emitting the deprecated `__mod_term_func`.
Reuse the existing `WebAssemblyLowerGlobalDtors.cpp` to accomplish this.
Enable fallback to the old behavior via Clang driver flag
(`-fregister-global-dtors-with-atexit`) or llc / code generation flag
(`-lower-global-dtors-via-cxa-atexit`). This escape hatch will be
removed in the future.
Differential Revision: https://reviews.llvm.org/D121327
When matching PHINodes when margining functions the IROutliner only checks that an incoming value exists in phi node in overall function. It doesn't check the length, the order, or that the incoming block also matches. In the given example, we see that both phi nodes have the same incoming values, but from different blocks.
The fix is to to enforce stricter a match of the incoming value, and the incoming block as well when matching the created phi nodes.
Reviewers: paquette
Differential Revision: https://reviews.llvm.org/D121310
If we have a logical and/or in select form and the true/false operand
is an fcmp with poison generating FMF, we won't be able to fold it
to an and/or instruction. This prevents us from optimizing the case
where it is a logical operation of two fcmps with identical operands.
This patch adds explicit checks for this case that doesn't rely on
converting to and/or to do the optimization. It reuses the existing
foldLogicOfFCmps, but adds a new flag to disable the other combine
that is inside that function.
FMF flags from the two FCmps are intersected using the logic added in
D121243. The FIXME has been updated to indicate that we can only use
a union for the non-select form.
This allows us to optimize cases like this from compare-fp-3.c in the
gcc torture suite with fast math.
void
test1 (float x, float y)
{
if ((x==y) && (x!=y))
link_error0();
}
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D121323
2 of the 3 callsite of IRMover::move() pass empty lambda functions. Just
make this parameter llvm::unique_function.
Came about via discussion in D120781. Probably worth making this change
regardless of the resolution of D120781.
Reviewed By: dexonsmith
Differential Revision: https://reviews.llvm.org/D121630
The IR Outliner is supposed to extract the outputs contained in an external phi node and place them into a phi node contained within the outlined function. However, when the output values of two outlined functions with two different output sets are contained within the same phi node, they are counted as the same exit path when first analyzed. In reality, these create two different phi nodes, creating an inconsistency, resulting in a mismatch in the expected number of output paths and a crash. This fixes that counting when analyzing the outputs by also analyzing the incoming blocks rather than just the incoming values.
Reviewer: paquette
Differential Revision: https://reviews.llvm.org/D121313
Extend -wholeprogramdevirt-check to support both the existing
trapping mode on an incorrect devirtualization, as well as a new
mode to fallback to an indirect call on a mismatch. The new mode is
The new mode is useful in cases where we want to enable
devirtualization but cannot fully guarantee whole program visibility
(e.g in the case where LTO has been disabled for a small set of objects
that could potentially override virtual methods without having a symbol
reference to anything in the base class including the vtable).
Remove !prof and !callees metadata (which are used by indirect call
promotion) from both the new direct call and the fallback indirect call
(so that we don't perform another round of promotion on the latter).
Also remove it from the direct call in the non-fallback cases, which
was an oversight, although it didn't seem to cause any issues. Add tests
for the metadata removal covering the various cases.
Differential Revision: https://reviews.llvm.org/D121419
When there are two external phi nodes for two different outlined regions, when compressing the created phi nodes between the two regions, the matching for the second phi node in the second region matches the first phi node created for the first region rather than the second phi node created for the first region. This adds an extra output path where there should not be one.
The fix is the ignore phi nodes that have already been matched for each region.
Reviewer: paquette
Differential Revision: https://reviews.llvm.org/D121312
The insertion point for the builder used during VPlan code generation is
set during code generation. Setting the insert point here is dead code
and can be removed.
Context: I needed this for https://github.com/google/iree/pull/8474 .
I found that TSan instrumentation expects vector sizes to be <= 16,
and in my project (IREE) we have tests with higher vector sizes.
That left some test functions uninstrumented, resulting in crashes as
instrumented code called into them.
Differential Revision: https://reviews.llvm.org/D121182
This patch is a follow-up to D115953. It updates optimizeInductions
to also introduce new VPScalarIVStepsRecipes if an IV has both vector
and scalar uses.
It updates all uses that only need scalar values to use the newly
created recipe for the scalar steps.
This completes untangling of VPWidenIntOrFpInductionRecipe
code-generation. Now the recipe *only* creates the widened vector
values, as it says on the tin.
The code to genereate IR has been moved directly to
VPWidenIntOrFpInductionRecipe::execute.
Note that the recipe has been updated to hold a reference to
ScalarEvolution, which is needed to expand the step, until we can place
the corresponding SCEV expansion in the pre-header.
Depends on D120827.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D120828
Before we used the capture tracker to follow pointer uses, now we do it
explicitly ourselves through the Attributor API. There are multiple
benefits: For one, the boilerplate is cut down by a lot. The class,
potential copies vector, etc. is all not needed anymore. We also do
avoid explicitly looking through memory here, something that was
duplicated and should only live in the `checkForAllUses~ helper. More
importantly, as we do simplifications we need to make sure all parties
are in sync when they reason about uses. The old way did not allow us to
do this but the new one does as every use visiting AA goes through
`checkForAllUses` now..
As replacements will become more complex it is better to have a single
AA responsible for replacing a use. Before this patch AAValueSimplify*
and AAValueSimplifyReturned could both try to replace the returned
value. The latter was marginally better for the old pass manager
when a function was already carrying a `returned` attribute and when
the context of the return instruction was important. The second
shortcoming was resolved by looking for return attributes in the
AAValueSimplifyCallSiteReturned initialization. The old PM impact is
not concerning.
This is yet another step towards the removal of AAReturnedValues, the
very first AA we should now try to eliminate due to the overlapping
logic with value simplification.
When we look through memory for a store we used to allow any other use
of the memory that is reachable. This is generally OK but we need to
make sure to actually let the user look at these properly. For now,
we simply require loads (via exact reloads).
There was some ad-hoc handling of liveness and manifest to avoid
breaking CGSCC guarantees. Things always slipped through though.
This cleanup will:
1) Prevent us from manifesting any "information" outside the CGSCC.
This might be too conservative but we need to opt-in to annotation
not try to avoid some problematic ones.
2) Avoid running any liveness analysis outside the CGSCC. We did have
some AAIsDeadFunction handling to this end but we need this for all
AAIsDead classes. The reason is that AAIsDead information is only
correct if we actually manifest it, since we don't (see point 1) we
cannot actually derive/use it at all. We are currently trying to
avoid running any AA updates outside the CGSCC but that seems to
impact things quite a bit.
3) Assert, don't check, that our modifications (during cleanup) modifies
only CGSCC functions.
In an attempt to remove the memory leak we introduced a double free.
The problem was that we allowed a plain copy of the state and it was
actually used. The use was useless, so it is gone now. The copy
constructor is gone as well. The move constructor ensures the Accesses
pointers are owned by a single state, I hope.
Reported by: https://lab.llvm.org/buildbot/#/builders/16/builds/25820
This patch adds a helper to check if a recipe only uses scalars of a
given operand. This is similar to onlyFirstLaneUsed, which was
introduced earlier.
By default, usesScalars falls back on onlyFirstLaneUsed.
Will be used by D120828.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D120827
X (any pred) -X --> X (any pred) 0.0
This works with all FP values and preserves FMF.
Alive2 examples:
https://alive2.llvm.org/ce/z/dj6jhp
This can also create one of the patterns that we match as "fabs"
as shown in one of the test diffs.
This reverts commit 9397bdc67e.
This optimization is likely to surprise programmers as seen
in post-commit comments, so we should add a clang warning
first (that is proposed in D121306).
If there are no ctors, then this can have an arbirary zero-sized
value. The current code checks for null, but it could also be
undef or poison.
Replacing the specific null check with a check for
non-ConstantArray.
As discussed on Issue #32161 this fold can be generalized a lot more than it currently is, but this patch at least adds vector support.
Differential Revision: https://reviews.llvm.org/D121358
As far as I can tell, these names are only intended to be
informative, so just use a generic "PointerType" for opaque pointers.
The code in solveDIType() also treats pointers as basic types (and
does not try to encode the pointed-to type further), so I believe
this should be fine.
Differential Revision: https://reviews.llvm.org/D121280
We are already doing this in the split functions while we clone. This just
handles the original function.
I also updated the coroutine split test to validate that we are always referring
to the msg in the context object instead of in a shadow copy.
rdar://83957028
Reviewed By: aprantl
Differential Revision: https://reviews.llvm.org/D121324
This patch is motivated by pr48057 where an output dependency is not detected
since loop interchange did not check a store instruction with itself.
Fixed that deficiency.
Reviewed By: bmahjour, Meinersbur, #loopoptwg
Differential Revision: https://reviews.llvm.org/D118102
The existing handling produced crash for test case (attached with patch).
Now the function transferSRADebugInfo is modified to
- Ignore the current variable if it starts after the current Fragment.
- Ignore the current variable if it ends before the current Fragment.
- Generate (!DIExpression()) if current variable completely fits the
current Fragment.
- Otherwise (as earlier), generate the DW_OP_LLVM_fragment in IR if current
Fragment partially defines current variable.
Reviewed By: aprantl
Differential Revision: https://reviews.llvm.org/D121107
As a result of adding multiblock outlining, it became possible to outline the entirety of basic block, and branches that only pointed to the basic blocks contained in the outlined section. This means that there are no exit paths, and no return statement. There was a previous assertion from the older version of the outliner that explicitly made sure there was a return statement. This removes that assertion.
Reviewers: paquette
Differential Revision: https://reviews.llvm.org/D120868
This patch intersects the fast math flags from the two fcmps instead
of dropping them.
I poked at this a bunch with Alive2 for nnan and ninf flags and it seemed
to check out. With the other flags it told me "Couldn't prove the
correctness of the transformation". Not sure if I should just preserve
nnan and ninf?
Reviewed By: spatel, lebedev.ri
Differential Revision: https://reviews.llvm.org/D121243
Single value phis won't be modeled in VPlan. If the phi only gets used
outside the loop, the current code misses the fact that the incoming
value is not dead. Update the code to also look through such phis to
check for outside users.
Fixes#54266
This patch adds PrettyStackEntries before running passes. The entries
include the pass name and the IR unit the pass runs on.
The information is used the print additional information when a pass
crashes, including the name and a reference to the IR unit on which it
crashed. This is similar to the behavior of the legacy pass manager.
The improved stack trace now includes:
Stack dump:
0. Program arguments: bin/opt -loop-vectorize -force-vector-width=4 crash.ll
1. Running pass 'ModuleToFunctionPassAdaptor' on module 'crash.ll'
2. Running pass 'LoopVectorizePass' on function '@a'
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D120993
MinBaseDistance may be odr-used by std::max, leading to an undefined symbol linker error:
```
ld.lld: error: undefined symbol: (anonymous namespace)::MinCostMaxFlow::MinBaseDistance
>>> referenced by SampleProfileInference.cpp:744 (/home/ray/llvm-project/llvm/lib/Transforms/Utils/SampleProfileInference.cpp:744)
>>> lib/Transforms/Utils/CMakeFiles/LLVMTransformUtils.dir/SampleProfileInference.cpp.o:((anonymous namespace)::FlowAdjuster::jumpDistance(llvm::FlowJump*) const)
```
Since llvm-project is still using C++ 14, workaround it with a cast.
After moving the CanAdd check in c60cdb44f7 and using it for
the assume cases as well, the passed in block may not have a branch
instruction as terminator. This can trigger the assertion. Given the new
use case, it doesn't add value any longer and can be removed.
Fixes https://github.com/llvm/llvm-project/issues/54281
This is noted as a missing clang warning in #54222
(and we should still make that enhancement).
Alive2 proofs:
https://alive2.llvm.org/ce/z/Q8drDqhttps://alive2.llvm.org/ce/z/pE6LRt
I don't see a single conversion for all predicates
using "getFCmpCode" logic, so other predicates are
left as a TODO item.
Introduce a new attribute "function-inline-cost-multiplier" which
multiplies the inline cost of a call site (or all calls to a callee) by
the multiplier.
When processing the list of calls created by inlining, check each call
to see if the new call's callee is in the same SCC as the original
callee. If so, set the "function-inline-cost-multiplier" attribute of
the new call site to double the original call site's attribute value.
This does not happen when the original call site is intra-SCC.
This is an alternative to D120584, which marks the call sites as
noinline.
Hopefully fixes PR45253.
Reviewed By: davidxl
Differential Revision: https://reviews.llvm.org/D121084
If the user disables de-globalization we did not seed the AAHeapToShared
and AAHeapToStack but we still could end up with them through in-flight
lookups. With this patch we disable AAHeapToShared completely if the
user disabled de-globalization. Heap-2-stack is still run though.
Differential Revision: https://reviews.llvm.org/D121059
Currently, when instrumenting indirect calls, this uses
CallBase::getCalledFunction to determine whether a given callsite is
eligible.
However, that returns null if:
this is an indirect function invocation or the function signature
does not match the call signature.
So, we end up instrumenting direct calls where the callee is a bitcast
ConstantExpr, even though we presumably don't need to.
Use isIndirectCall to ignore those funky direct calls.
Differential Revision: https://reviews.llvm.org/D119594
The analysis passes output function name encapsulated in `'` braces,
but LV uses `"`. Harmonizing this may help in creating an update script
for the LV costmodel test checks.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D121105
When decomposing constraints for unsigned conditions, we can use
negative values by zero-extending them, as long as they are less than
the maximum constraint value.
Fixes https://github.com/llvm/llvm-project/issues/54224
Add missing CanAdd check before adding a condition from an assume
to the successor blocks. When adding information from assume to
successor blocks we need to perform the same CanAdd as we do for adding
a condition from a branch.
Fixes https://github.com/llvm/llvm-project/issues/54217
Dropping this restriction seems to work fine (there are no assertion
failures), so it appears that either the updater got smarter or the
problematic cases are restricted elsewhere.
If doing this still causes issues, then the place to address it
would probably be 8f5bdaf481/llvm/lib/Transforms/IPO/Attributor.cpp (L1856-L1859),
which already prevents replacement outside the SCC, so I'm not
quite sure what this check is intended to avoid.
Differential Revision: https://reviews.llvm.org/D120987
Only determine the frame layout based on dereferenceable and align
attributes, and remove the type-based fallback, which is incompatible
with opaque pointers. The dereferenceable attribute is required,
while the align attribute uses default alignment of 1 (commonly,
align 1 attributes do not get placed, relying on default alignment).
The CoroSplit pass producing the resume function adds the necessary
attributes in 7daed35911/llvm/lib/Transforms/Coroutines/CoroSplit.cpp (L840),
and their presence is checked in coro-debug.ll at least.
Differential Revision: https://reviews.llvm.org/D120988
With opaque pointers, after splitRetconCoroutine() the FramePtr
may be an Argument rather than an Instruction. With typed pointers,
this currently doesn't happen because the FramePtr would be a
bitcast instruction.
Fix this by making FramePtr a Value and adding a helper for the
"after FramePtr" insertion point, which would be the start of the
function in the Argument case.
Differential Revision: https://reviews.llvm.org/D120994
This patch extends ConstraintElimination to also remove dead variables
when removing a constraint. When a constraint is removed because it is
out of scope, all new variables added for this constraint can also be
removed.
This keeps the total size of the systems much smaller, because it
reduces the number of variables drastically.
It also fixes a bug where variables where removed incorrectly.
Fixes https://github.com/llvm/llvm-project/issues/54228
This check is not compatible with opaque pointers. We can avoid
it by adjusting the getPointerAlignment() implementation to avoid
creating unnecessary ptrtoint expressions for bitcasted pointers.
The code already uses OnlyIfReduced to not create an expression
if it does not simplify, and this makes sure that folding a
bitcast and ptrtoint into a ptrtoint doesn't count as a
simplification.
Differential Revision: https://reviews.llvm.org/D120904
Currently, we hardly ever actually run SCEV verification, even in
tests with -verify-scev. This is because the NewPM LPM does not
verify SCEV. The reason for this is that SCEV verification can
actually change the result of subsequent SCEV queries, which means
that you see different transformations depending on whether
verification is enabled or not.
To allow verification in the LPM, this limits verification to
BECounts that have actually been cached. It will not calculate
new BECounts.
BackedgeTakenInfo::getExact() is still not entirely readonly,
it still calls getUMinFromMismatchedTypes(). But I hope that this
is not problematic in the same way. (This could be avoided by
performing the umin in the other SCEV instance, but this would
require duplicating some of the code.)
Differential Revision: https://reviews.llvm.org/D120551
We already look through memory to determine where a value that is stored
might pop up again (potential copies). This patch introduces the other
direction with similar logic. If a value is loaded, we can follow all
the accesses to the pointer (or better object) and try to determine what
value might have been stored.
Both `undef` and `nullptr` are maximally aligned. This is especially
important as we often see `undef` until a proper value has been
identified during simplification.
With D106397 we used CFG reasoning to filter out writes that will not
interfere with a given load instruction. With this patch we use the
same logic (modulo the reversal in reachability check order) for store
instructions. As an example, we can now proof stores to shared memory
are dead if all the loads of the shared memory are not reachable from
them.
Heap-2-stack and heap-2-shared can replace an allocation call with
something else. To avoid us deriving information from the allocator
implementation we register a simplification callback now that will
force us to stop at the call site. We probably should create the
replacement memory eagerly and return that instead though.
While we can use range information when we derive dereferenceability we
must make sure to pick he right end of the range. Before we always went
with the minimal offset, which is not correct if we want to combine
the base dereferenceability with some offset. In that case it's the
maximum that gives the correct result.
This simply makes the function argument of the
`Attributor::checkForAllInstructions` helper explicit so one can iterate
over instructions in other functions.
The OpenMPIRBuilder has a bug. Specifically, suppose you have two nested openmp parallel regions (writing with MLIR for ease)
```
omp.parallel {
%a = ...
omp.parallel {
use(%a)
}
}
```
As OpenMP only permits pointer-like inputs, the builder will wrap all of the inputs into a stack allocation, and then pass this
allocation to the inner parallel. For example, we would want to get something like the following:
```
omp.parallel {
%a = ...
%tmp = alloc
store %tmp[] = %a
kmpc_fork(outlined, %tmp)
}
```
However, in practice, this is not what currently occurs in the context of nested parallel regions. Specifically to the OpenMPIRBuilder,
the entirety of the function (at the LLVM level) is currently inlined with blocks marking the corresponding start and end of each
region.
```
entry:
...
parallel1:
%a = ...
...
parallel2:
use(%a)
...
endparallel2:
...
endparallel1:
...
```
When the allocation is inserted, it presently inserted into the parent of the entire function (e.g. entry) rather than the parent
allocation scope to the function being outlined. If we were outlining parallel2, the corresponding alloca location would be parallel1.
This causes a variety of bugs, including https://github.com/llvm/llvm-project/issues/54165 as one example.
This PR allows the stack allocation to be created at the correct allocation block, and thus remedies such issues.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D121061
Skip phi nodes in the preheader. They may not be considered loop
invariant by the assertion below.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D121010
There seems to be one more uncaught problem, SROA may now end up trying
to re-re-repromote the just-promoted shadow alloca, and do that endlessly.
This reverts commit adc0984d81.
This is inspired by the original variant of D109749 by Graham Hunter,
but is a more general version.
Roughly, instead of promoting the alloca, we call it
a shadow/backing alloca, go through all it's slices,
clone(!) instructions that operated on it,
but make them operate on the cloned alloca,
and promote cloned alloca instead.
This keeps the shadow/backing alloca, and all the original instructions
around, which results in said shadow/backing alloca being
a perfect mirror/representation of the promoted alloca's content,
so calls that take the alloca as arguments (non-capturingly!)
can be supported.
For now, we require that the calls also don't modify the alloca's content,
but that is only to simplify the initial implementation,
and that will be supported in a follow-up.
Overall, this leads to *smaller* codesize:
https://llvm-compile-time-tracker.com/compare.php?from=a8b4f5bbab62091835205f3d648902432a4a5b58&to=aeae054055b125b011c1122f82c86457e159436f&stat=size-total
and is roughly neutral compile-time wise:
https://llvm-compile-time-tracker.com/compare.php?from=a8b4f5bbab62091835205f3d648902432a4a5b58&to=aeae054055b125b011c1122f82c86457e159436f&stat=instructions
This relands commit 703240c71f,
that was reverted by commit 7405581f7c,
because the assertion `isa<LoadInst>(OrigInstr)` didn't hold in practice,
as the newly added test `@select_of_ptrs` shows:
If the pointers into alloca are used by select's/PHI's, then even if
we manage to fracture the alloca, some sub-alloca's will likely remain.
And if there are any non-capturing calls, then we will also decide to
keep the original backing alloca around, and we suddenly ~doubled
the alloca size, and the amount of memory traffic.
I'm not sure if this is a problem or we could live with it,
but let's leave that for later...
Reviewed By: djtodoro
Differential Revision: https://reviews.llvm.org/D113520
This will let us start moving away from hard-coded attributes in
MemoryBuiltins.cpp and put the knowledge about various attribute
functions in the compilers that emit those calls where it probably
belongs.
Differential Revision: https://reviews.llvm.org/D117921
The custom state machine had a check for surplus threads that filtered
the main thread if the kernel was executed by a single warp only. We
now first check for the main thread, then for surplus threads, avoiding
to filter the former out.
Fixes#54214.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D121011
Prior to this change LLVM would happily elide a call to any allocation
function and a call to any free function operating on the same unused
pointer. This can cause problems in some obscure cases, for example if
the body of operator::new can be inlined but the body of
operator::delete can't, as in this example from jyknight:
#include <stdlib.h>
#include <stdio.h>
int allocs = 0;
void *operator new(size_t n) {
allocs++;
void *mem = malloc(n);
if (!mem) abort();
return mem;
}
__attribute__((noinline)) void operator delete(void *mem) noexcept {
allocs--;
free(mem);
}
void deleteit(int*i) { delete i; }
int main() {
int*i = new int;
deleteit(i);
if (allocs != 0)
printf("MEMORY LEAK! allocs: %d\n", allocs);
}
This patch addresses the issue by introducing the concept of an
allocator function family and uses it to make sure that alloc/free
function pairs are only removed if they're in the same family.
Differential Revision: https://reviews.llvm.org/D117356
This check is not relevant for correctness, it can only avoid
walking some recursive uses if the cast is to a non-function
pointer type. As this distinction will no longer be possible
with opaque pointers and all users will have to be walked
anyway, I'm dropping the check in advance.
Recommit without changes over 53abe3ff66,
which addressed the cause of the reported crash.
-----
With opaque pointers, the zero-offset load will generally not use
a GEP. Allow a direct load without GEP, which is treated the same
way as a zero-offset GEP.
Use the overload that support moving into an empty block. I don't
think that this situation can occur right now, but it can happen
with the change from e7fb1c15cb,
and the test is derived from the issue reported there.
Per discussion on
https://reviews.llvm.org/D59709#inline-1148734, this seems like the
right course of action. `canBeOmittedFromSymbolTable()` subsumes and
generalizes the previous logic. In addition to handling `linkonce_odr`
`unnamed_addr` globals, we now also internalize `linkonce_odr` +
`local_unnamed_addr` constants.
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D120173
This builds on @fhahn's D112313, and caches the liveOnEntry node as a optimized access. D112313 tied to only cache a known clobber. This change adds caching the fact that no clobber exists. It still does not cache may-clobber results.
Differential Revision: https://reviews.llvm.org/D120842
The similar getICmpCode and getPredForICmpCode are already there.
This moves FP for consistency.
I think InstCombine is currently the only user of both.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D120754
We will check a bit later that the constant is in fact a function,
so the separate check for a function pointer type is largely
redunant. Also simplify the cast stripping with
stripPointerCasts().
`ArgInfo` is reduced to only contain a pair of {formal,actual} values.
The specialized function `Fn` and the `Partial` flag are redundant in
this structure. The `Gain` is moved to a new struct `SpecializationInfo`.
The value mappings created by cloneCandidateFunction() are being used
by rewriteCallSites() for matching the formal arguments of recursive
functions.
The list of specializations is passed by reference to calculateGains()
instead of being returned by value.
The `IsPartial` flag is removed from isArgumentInteresting() and
getPossibleConstants() as it's no longer used anywhere in the code.
Differential Revision: https://reviews.llvm.org/D120753
To make this actually trigger, we also need to check whether the
function types differ, which is a hidden cast under opaque pointers.
The transform is somewhat less relevant there because it is
primarily about pointer bitcasts, but it can also happen with other
bit- or pointer-castable types.
Byval handling is easier with opaque pointers because there is no
need to adjust the byval type, we only need to make sure that it's
still a pointer.
The logic for handling this was fixed in
8d7f118ab2, but the check for byval
on the callee was retained. This resulted in a weird situation
where the transform would work depending on whether the byval
was only on the call or on both the call and the function.
There is a general WalkerStepLimit adjustment higher up in the
loop, and I don't see any reason why this particular case would
need additional adjustment. Furthermore, this could underflow.
Root issue which triggered the revert was fixed in 689bab. No changes in the reapplied patch.
Original commit message follows:
SLP currently schedules all instructions within a scheduling window which stretches from the first instr
uction potentially vectorized to the last. This window can include a very large number of unrelated instruct
ions which are not being considered for vectorization. This change switches the code to only schedule the su
b-graph consisting of the instructions being vectorized and their transitive users.
This has the effect of greatly reducing the amount of work performed in large basic blocks, and thus greatly improves compile time on degenerate examples. To understand the effects, I added some statistics (not planned for upstream contribution). Here's an illustration from my motivating example:
Before this patch:
704357 SLP - Number of calcDeps actions
699021 SLP - Number of schedule calls
5598 SLP - Number of ReSchedule actions
59 SLP - Number of ReScheduleOnFail actions
10084 SLP - Number of schedule resets
8523 SLP - Number of vector instructions generated
After this patch:
102895 SLP - Number of calcDeps actions
161916 SLP - Number of schedule calls
5637 SLP - Number of ReSchedule actions
55 SLP - Number of ReScheduleOnFail actions
10083 SLP - Number of schedule resets
8403 SLP - Number of vector instructions generated
I do want to highlight that there is a small difference in number of generated vector instructions. This example is hitting the bailout due to maximum window size, and the change in scheduling is slightly perturbing when and how we hit it. This can be seen in the RescheduleOnFail counter change. Given that, I think we can safely ignore.
The downside of this change can be seen in the large test diff. We group all vectorizable instructions together at the bottom of the scheduling region. This means that vector instructions can move quite far from their original point in code. While maybe undesirable, I don't see this as being a major problem as this pass is not intended to be a general scheduling pass.
For context, it's worth noting that the pre-scheduling that SLP does while building the vector tree is exactly the sub-graph scheduling implemented by this patch.
Differential Revision: https://reviews.llvm.org/D118538
While a collection of allocas are technically vectorizeable - by forming a wider alloca - this was not a transform SLP actually knows how to do. Instead, we were forming a bundle with missing dependencies, and then relying on the scheduling code to preserve program order if multiple instructions were scheduleable at once. I haven't been able to write a test case, but I'm 99% sure this was wrong in some edge case.
The unknown op case was flowing down the shufflevector path. This did result in some splat handling being lost with this change, but the same lack of splat handling is visible in a whole bunch of simple examples for the gather path. I didn't consider this interesting to fix given how narrow the splat of allocas case is.
The verify call was taking 50% of the compile time in our internal LLVM
fork when trying to unroll many loops.
Differential Revision: https://reviews.llvm.org/D113028
Instead of relying on underlying instructions, this patch updates
VPScalarIVStepsRecipe to only store the required type information.
This removes access to unrelated information, as well as avoiding issues
with the same underlying instruction being shared by multiple recipes.
This change should only change the debug output and not cause any
codegen changes, hence NFCI.
This is a recommit without changes. I originally reverted this
due to a significant code-size regression on tramp3d-v4, however
further investigation showed that in the tramp3d-v4 case this
change enables additional optimizations (in particular more
jump threading), which happens to reduce the size of a function
just enough to be eligible for inlining at hot callsites, which
results in the code size increase. As such, this was just bad
luck.
-----
This one-use limitation is artificial, we do not increase
instruction count if we perform the fold with multiple uses. The
motivating case is shown in @sub_eq_zero_select, where the one-use
limitation causes us to miss a subsequent select fold.
I believe the backend is pretty good about reusing flag-producing
subs for cmps with same operands, so I think doing this is fine.
Differential Revision: https://reviews.llvm.org/D120337
For conditional branches, we know the value is i1 0 or i1 1 along
the outgoing edges. For switches we can apply exactly the same
optimization, just with the known values determined by the switch
cases.
The priority-based inliner currenlty uses block count combined with callee entry count to drive callsite inlining. This doesn't work well with LTO where postlink inlining is driven by prelink-annotated block count which could be based on the merge of all context profiles. I'm fixing it by using callee profile entry count only which should be context-sensitive.
I'm seeing 0.2% perf improvment for one of our internal large benchmarks with probe-based non-CS profile.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D120784
Previously there was a debug flag to print the module after
optimizations. Sometimes we wanted to print the module before
optimizations so this is being split into two flags.
`-openmp-opt-print-module` is now `-openmp-opt-print-module-after`.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D120768
Instead of passing an InstCmpInt * and a bool just pass the predicate
from the caller.
I'm considering moving the similar FCmp functions from InstCombine
over here and this makes the interface consistent with what is used
for FCmp.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D120609
Currently adding attribute no_sanitize("bounds") isn't disabling
-fsanitize=local-bounds (also enabled in -fsanitize=bounds). The Clang
frontend handles fsanitize=array-bounds which can already be disabled by
no_sanitize("bounds"). However, instrumentation added by the
BoundsChecking pass in the middle-end cannot be disabled by the
attribute.
The fix is very similar to D102772 that added the ability to selectively
disable sanitizer pass on certain functions.
In this patch, if no_sanitize("bounds") is provided, an additional
function attribute (NoSanitizeBounds) is attached to IR to let the
BoundsChecking pass know we want to disable local-bounds checking. In
order to support this feature, the IR is extended (similar to D102772)
to make Clang able to preserve the information and let BoundsChecking
pass know bounds checking is disabled for certain function.
Reviewed By: melver
Differential Revision: https://reviews.llvm.org/D119816
This transform can still be applied if there are more than two
phi inputs, as long as phi inputs with the same value are dominated
by the same idom edge.
Treat the icmp and sub symmetrically, and require that one of them
has one use, not the icmp in particular. This could be further
relaxed in the abs (but not nabs) case to not check one-use at
all.
This should no longer be necessary now that we canonicalize to
intrinsics. This may not be entirely NFC in practice if worklist
order gets inverted and we perform demanded bits simplification
of a select user before the select is canonicalized.
A function is basically dead when:
* it has no uses
* it has only self-referencing uses (it's recursive)
Differential Revision: https://reviews.llvm.org/D119878
This is a clean-up patch. The functional pass was rolled into the module pass in D112732.
Reviewed By: vitalybuka, aeubanks
Differential Revision: https://reviews.llvm.org/D120674
extractvalue (any_mul_with_overflow X, -1), 0 --> -X
There are similar other potential transforms that we could do as
noted by the last TODO in the test diffs.
Fixes#54053
Currently bottom-to-top reordering analysis counts orders of the
operands and then adds natural order counts for the operand users. It is
very conservative, this the user nodes themselves may require
reordering. Patch improves bottom-to-top analysis by checking for the
user nodes if they require/allows the reordring. If the user node must
be reordered, has reused scalars, is an alternate op vectorization node,
is a non-ordered gather node or may allow reordering because of the
reordered operands, such node is considered as the node that allows
reodring and is not counted as a node with the natural order.
Differential Revision: https://reviews.llvm.org/D120492
This reverts the revert commit ff93260bf6.
The underlying issue causing the PPC bot failures has been fixed in
cbaac14734 and a corresponding test case has been added in
ad2cad1c52.
Original message:
This patch adds a new VPScalarIVStepsRecipe to handle building scalar
steps.
In the first patch, it only handles the case where there is no vector
induction variable needed.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D115953
Exit values of vector inductions are generated completely independent of
the induction recipes. Consider them for removal, if they are not used
in loop.
This fixes a crash exposed by 49b23f451c.
Only call it for intrinsic min/max. The moved implementation is
unchanged apart from the one-use check: It is now hardcoded to
one-use, without the two-use special case for SPF.
SLP makes very heavy use of aliasing queries to construct pointer dependencies for scheduling purposes. AA internally usings pointerMayBeCaptured to prove some noalias results. In a local profile, we were spending about 4% of total O2 time in capture tracking. By using BatchAA interface - which caches capture results - this drops to 2%.
Note that there is no invalidation of BatchAA here. This assumes that no transformation done by SLP invalidates alias or capture results. This is the same assumption made by the existing AliasCache, so this is not a new assumption in the code.
This patch adds a new VPScalarIVStepsRecipe to handle building scalar
steps.
In the first patch, it only handles the case where there is no vector
induction variable needed.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D115953
This can be used to explicitly model VPValues that depend on SCEV
expansion, like the step for inductions.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D116288
This patch adds a new transform to remove dead recipes. For now, it only
removes dead recipes in the header, to keep the number tests that require
updating manageable. Future patches will extend this to remove dead
recipes across the whole plan.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D118051
Currently, SLP can insert "shuffle" instruction beween PHI and Landing pad instruction. The problem is demonstrated by LIT test. The solution is to adjust insertion point once we are done with PHI generation.
Differential Revision: https://reviews.llvm.org/D120552
With opaque pointers, the zero-offset load will generally not use
a GEP. Allow a direct load without GEP, which is treated the same
way as a zero-offset GEP.
We already have a check for !InstQueries.empty(), so move the for-range over InstQueries inside to avoid the AAReachability uninitialized variable static analysis warnings.
Now that we canonicalize SPF min/max to intrinsics, there's no
need to canonicalize the structure of the SPF min/max itself
anymore. This is conceptually NFC, but in practice does slightly
impact results due to folding order differences.
SCEVs ExprValueMap currently tracks not only which IR Values
correspond to a given SCEV expression, but additionally stores that
it may be expanded in the form X+Offset. In theory, this allows
reusing existing IR Values in more cases.
In practice, this doesn't seem to be particularly useful (the test
changes are rather underwhelming) and adds a good bit of complexity.
Per https://github.com/llvm/llvm-project/issues/53905, we have an
invalidation issue with these offseted expressions.
Differential Revision: https://reviews.llvm.org/D120311
Expand `TruncInstCombine` to handle loops by adding `phi` nodes
to expression graph.
Reviewed by: RKSimon, lebedev.ri
(recommit of fixed f84d732f, reverted by 8ad6d5e after sanitizer breakage)
Differential Revision: https://reviews.llvm.org/D109817
The min/max intrinsic cost is currently too low because in the cost calculation
we subtract the cost of the vector compare as we will not emit it.
For the cost of the vector compare we are currently passing BAD_ICMP_PREDICATE
which returns 3, the worst case cost.
I think we should be passing VecPred instead, since we know the predicates of
the compare instr.
I think this is related to commit b3b993a7ad which introduced the predicate
argument to getCmpSelInstrCost().
https://reviews.llvm.org/rGb3b993a7ad817c3c5801341fa78f34332900eb83
Differential Revision: https://reviews.llvm.org/D120439
Summary:
We use a section to embed offloading code into the host for later
linking. This is normally unique to the translation unit as it is thrown
away during linking. However, if the user performs a relocatable link
the sections will be merged and we won't be able to access the files
stored inside. This patch changes the section variables to have external
linkage and a name defined by the section name, so if two sections are
combined during linking we get an error.
Now that integer min/max intrinsics have good support in both
InstCombine and other passes, start canonicalizing SPF min/max
to intrinsic min/max.
Once this sticks, we can stop matching SPF min/max in various
places, and can remove hacks we have for preventing infinite loops
and breaking of SPF canonicalization.
Differential Revision: https://reviews.llvm.org/D98152
The `SplitIndirectBrCriticalEdges` function was originally designed for
`CodeGenPrepare` and skipped splitting of edges when the destination
block didn't contain any `PHI` instructions. This only makes sense when
reducing COPYs like `CodeGenPrepare`. In the case of
`PGOInstrumentation` or `GCOVProfiling` it would result in missed
counters and wrong result in functions with computed goto.
Differential Revision: https://reviews.llvm.org/D120096
This change uses instruction's comesBefore method to simplify the code significantly. There's little compile time concern here because getSpillCost already calls comesBefore on every basic block which contains a vectorization candidate. The only additional times we'll build basic block ordering is when we can't schedule a vector candidate anywhere in the containing block.
Differential Revision: https://reviews.llvm.org/D120364
Prior to this change, LLVM would attempt to optimize an
aligned_alloc(33, ...) call to the stack. This flunked an assertion when
trying to emit the alloca, which crashed LLVM. Avoid that with extra
checks.
Differential Revision: https://reviews.llvm.org/D119604
Prior to this change, LLVM would attempt to optimize an
aligned_alloc(33, ...) call to the stack. This flunked an assertion when
trying to emit the alloca, which crashed LLVM. Avoid that with extra
checks.
Differential Revision: https://reviews.llvm.org/D119604
This cap was first added in 848c1aa45 (back in 2015). Per the original commit message, the purpose was to avoid a compile time explosion in long basic blocks. The algorithmic problem in scheduling has now been fixed in 0539a26d.
In the meantime, the code has rotten fairly badly. Some intermediate refactoring caused the size to only be incremented if *both* iterators advance in the window search. This causes the size to be badly undercounted when near one end of a basic block. We no longer have any test which exercises the logic in an intentional way; there's one test which differs with this change, but the changes appear fairly orthoganol to the purpose of the test file.
Unfortunately, we no longer have the original motivating example, so it's possible that it also hits some other issue. I tested locally with a large example, but even at it's worst, that one doesn't demonstrate anything too extreme even without the algorithmic fix. It's clearly faster with, but only by ~20% which doesn't seem in line with the original commit message. If regressions with this patch are seen, please file a bug and I'll try to fix any other algorithmic problems which fall out.
Rather than queuing up actions, have one function that does the
log2() fold in the obvious way, but with a flag that allows us
to check whether the fold will succeed without actually performing
it.
What we're really doing here is converting Op0 udiv Op1 into
Op0 lshr log2(Op1), so phrase it in that way. Actually pushing
the lshr into the log2(Op1) expression should be seen as a separate
transform.