InstCombine can replace memcpy to an alloca with a pointer directly to the
source in certain cases. Unfortunately, it also did so for volatile memcpys.
This patch makes it stop doing that.
This was discovered in D136822.
Differential Revision: https://reviews.llvm.org/D137031
Replace custom code to check if only the first lane is used by generic
helper `onlyFirstLaneUsed`. This enables VPlan-based sinking in a few
additional cases and was suggested in D133760.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D136368
X * ((1 << Z) + 1) --> (X << Z) + X
https://alive2.llvm.org/ce/z/P-7WK9
It's possible that we could do better with propagating
no-wrap, but this carries over the existing logic and
appears to be correct.
The naming differences on the existing folds are a result
of using getName() to set the final value via Builder.
That makes it easier to transfer no-wrap rather than the
gymnastics required from the raw create instruction APIs.
The `FunctionSpecialization` pass needs loop analysis results for its
cost function. For this purpose, it computes the `DominatorTree` and
`LoopInfo` for a function in `getSpecializationBonus`. This function,
however, is called O(number of call sites x number of arguments), but
the DominatorTree/LoopInfo can be computed just once.
This patch plugs into the PassManager infrastructure to obtain
LoopInfo for a function and removes ad-hoc computation from
`getSpecializatioBonus`.
Reviewed By: ChuanqiXu, labrinea
Differential Revision: https://reviews.llvm.org/D136332
[recommitting after recommitting a dependency]
This patch reorders the traversal of function call sites and function
formal parameters to:
* do various argument feasibility checks (`isArgumentInteresting` )
only once per argument, i.e. doing N-args checks instead of
N-calls x N-args checks.
* do hash table lookups only once per call site, i.e. N-calls
lookups/inserts instead of N-call x N-args lookups/inserts.
Reviewed By: ChuanqiXu, labrinea
Differential Revision: https://reviews.llvm.org/D135968
Change-Id: I7d21081a2479cbdb62deac15f903d6da4f3b8529
In InstCombine we treat i8/i16 as desirable, even if they are not legal.
The current logic in shouldChangeType will decide to convert from an
illegal but desirable type (such as an i8) to an illegal and undesirable
type (such as i3). This patch prevents changing the switch conditions to
an irregular type on like Arm/AArch64 where i8/i16 are not legal.
This is the same issue as https://reviews.llvm.org/D54115. In the case I
was looking it is was converting an i32 switch to an i8 switch, which
then became a i3 switch.
Differential Revision: https://reviews.llvm.org/D136763
There's no reason to shrink a constant or simplify
an operand in 2 steps.
This matches what we currently do for 'add' (although that
seems like it should be altered to handle the commutative
case).
The LSR may suggest less profitable transformation to the loop. This
patch adds check to prevent LSR from generating worse code than what
we already have.
Since LSR affects nearly all targets, the patch is guarded by the
option 'lsr-drop-solution' and default as disable for now.
The next step should be extending an TTI interface to allow target(s)
to enable this enhancememnt.
Debug log is added to remind user of such choice to skip the LSR
solution.
Reviewed By: Meinersbur, #loopoptwg
Differential Revision: https://reviews.llvm.org/D126043
When calculating the specialization bonus for a given function argument,
we recursively traverse the chain of (certain) users, accumulating the
instruction costs. Then we exponentially increase the bonus to account
for loop nests. This is problematic for two reasons: (a) the users might
not themselves be inside the loop nest, (b) if they are we are accounting
for it multiple times. Instead we should be adjusting the bonus before
traversing the user chain.
This reduces the instruction count for CTMark (newPM-O3) when Function
Specialization is enabled without actually reducing the amount of
specializations performed (geomean: -0.001% non-LTO, -0.406% LTO).
Differential Revision: https://reviews.llvm.org/D136692
This is copying the code that was added for 'add' with D130075.
(That patch removed a fallthrough in the cases, but we can
probably still share at least some code again as a follow-up
cleanup, but I didn't want to risk it here.)
The reasoning is similar to the carry propagation for 'add':
if we don't demand low bits of the subtraction and the
subtrahend (aka RHS or operand 1) is known zero in those low
bits, then there can't be any borrowing required from the
higher bits of operand 0, so the low bits don't matter.
Also, the no-wrap flags can be propagated (and I think that
should be true for add too).
Here's an attempt to prove that in Alive2:
https://alive2.llvm.org/ce/z/xqh7Pa
(can add nsw or nuw to src and tgt, and it should still pass)
Differential Revision: https://reviews.llvm.org/D136788
[fixed test to work with reverse iteration]
The `FunctionSpecialization` pass has support for specialising
functions, which are called with literal arguments. This functionality
is disabled by default and is enabled with the option
`-function-specialization-for-literal-constant` . There are a few
issues with the implementation, though:
* even with the default, the pass will still specialise based on
floating-point literals
* even when it's enabled, the pass will specialise only for the `i1`
type (or `i2` if all of the possible 4 values occur, or `i3` if all
of the possible 8 values occur, etc)
The reason for this is incorrect check of the lattice value of the
function formal parameter. The lattice value is `overdefined` when the
constant range of the possible arguments is the full set, and this is
the reason for the specialisation to trigger. However, if the set of
the possible arguments is not the full set, that must not prevent the
specialisation.
This patch changes the pass to NOT consider a formal parameter when
specialising a function if the lattice value for that parameter is:
* unknown or undef
* a constant
* a constant range with a single element
on the basis that specialisation is pointless for those cases.
Is also changes the criteria for picking up an actual argument to
specialise if the argument is:
* a LLVM IR constant
* has `constant` lattice value
has `constantrange` lattice value with a single element.
Reviewed By: ChuanqiXu
Differential Revision: https://reviews.llvm.org/D135893
Change-Id: Iea273423176082ec51339aa66a5fe9fea83557ee
The struct OffsetAndSize is a simple tuple of two int64_t. Treating it as a
derived class of std::pair has no special benefit, but it makes the code
verbose since we need get/set functions that avoid using "first" and "second" in
client code. Eliminating the std::pair makes this more readable.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D136745
When a profile is stale and profile mismatch could happen, the mismatched samples are discarded, so we'd like to compute the mismatch metrics to quantify how stale the profile is, which will suggest user to refresh the profile if the number is high.
Two sets of metrics are introduced here:
- (Num_of_mismatched_funchash/Total_profiled_funchash), (Samples_of_mismached_func_hash / Samples_of_profiled_function) : Here it leverages the FunctionSamples's checksums attribute which is a feature of pseudo probe. When the source code CFG changes, the function checksums will be different, later sample loader will discard the whole functions' samples, this metrics can show the percentage of samples are discarded due to this.
- (Num_of_mismatched_callsite/Total_profiled_callsite), (Samples_of_mismached_callsite / Samples_of_profiled_callsite) : This shows how many mismatching for the callsite location as callsite location mismatch will affect the inlining which is highly correlated with the performance. It goes through all the callsite location in the IR and profile, use the call target name to match, report the num of samples in the profile that doesn't match a IR callsite.
This is implemented in a new class(SampleProfileMatcher) and under a switch("--report-profile-staleness"), we plan to extend it with a fuzzy profile matching feature in the future.
Reviewed By: hoy, wenlei, davidxl
Differential Revision: https://reviews.llvm.org/D136627
If we run LTO optimization we migth end up introducing a custom state machine
and later transforming the region into SPMD. This is a problem. While a follow
up will introduce a check for the SPMD conversion, this already prevents the
eager custom state machine generation. Only if the kernel init function is
defined, rather then declared, we will emit a custom state machine. SPMD-zation
can happen eagerly though. Tests are adjusted via a weak definition. The LTO
test was added to verify this works as expected.
Differential Revision: https://reviews.llvm.org/D136740
This was reverted because it was breaking when targeting Darwin which
tried to export these symbols which are now hidden. It should be safe
to just stop attempting to export these symbols in the clang driver,
though Apple folks will need to change their TAPI allow list described
in the commit where these symbols were originally exported
f538018562
Then reverted again because it broke tests on MacOS, they should be
fixed now.
Bug: https://github.com/llvm/llvm-project/issues/58265
Differential Revision: https://reviews.llvm.org/D135340
This patch reorders the traversal of function call sites and function
formal parameters to:
* do various argument feasibility checks (`isArgumentInteresting` ) only once per argument, i.e. doing N-args checks instead of N-calls x N-args checks.
* do hash table lookups only once per call site, i.e. N-calls lookups/inserts instead of N-call x N-args lookups/inserts.
Reviewed By: ChuanqiXu, labrinea
Differential Revision: https://reviews.llvm.org/D135968
When rewriting the call sites to call the new specialised functions, a
single call site can be matched by two different specialisations - a
"less specialised" version of the function and a "more specialised"
version of the function, e.g. for a function
void f(int x, int y)
the call like `f(1, 2)` could be matched by either
void f.1(int x /* int y == 2 */);
or
void f.2(/* int x == 1, int y == 2 */);
The `FunctionSpecialisation` pass tries to match specialisation in the
order of decreasing gain, so "more specialised" functions are
preferred to "less specialised" functions. This breaks, however, when
using the flag `-force-function-specialization`, in which case the
cost/benefit analysis is not performed and all the specialisations are
equally preferable.
This patch makes the pass calculate specialisation gain and order the
specialisations accordingly even when `-force-function-specialization`
is used, under the assumption that this flag has purely debugging
purpose and it is reasonable to ignore the extra computing effort it
incurs.
Reviewed By: ChuanqiXu, labrinea
Differential Revision: https://reviews.llvm.org/D136180
The `FunctionSpecialization` pass has support for specialising
functions, which are called with literal arguments. This functionality
is disabled by default and is enabled with the option
`-function-specialization-for-literal-constant` . There are a few
issues with the implementation, though:
* even with the default, the pass will still specialise based on
floating-point literals
* even when it's enabled, the pass will specialise only for the `i1`
type (or `i2` if all of the possible 4 values occur, or `i3` if all
of the possible 8 values occur, etc)
The reason for this is incorrect check of the lattice value of the
function formal parameter. The lattice value is `overdefined` when the
constant range of the possible arguments is the full set, and this is
the reason for the specialisation to trigger. However, if the set of
the possible arguments is not the full set, that must not prevent the
specialisation.
This patch changes the pass to NOT consider a formal parameter when
specialising a function if the lattice value for that parameter is:
* unknown or undef
* a constant
* a constant range with a single element
on the basis that specialisation is pointless for those cases.
Is also changes the criteria for picking up an actual argument to
specialise if the argument is:
* a LLVM IR constant
* has `constant` lattice value
has `constantrange` lattice value with a single element.
Reviewed By: ChuanqiXu
Differential Revision: https://reviews.llvm.org/D135893
When collecting the possible constant arguments to
specialise a function the compiler will abandon the search
on the first argument that is for some reason unsuitable as
a specialisation constant. Thus, depending on the traversal
order of the functions and call sites, the compiler can end
up with a different set of possible constants, hence with
different set of specialisations.
With this patch, the compiler will skip unsuitable
constants, but nevertheless will continue searching for
more.
Reviewed By: ChuanqiXu
Differential Revision: https://reviews.llvm.org/D135867
Epilogue loop vectorization is a feature in the vectorize intended to avoid running fully scalar code when the vector length of the main loop turns out to be either longer than the trip count of the actual loop, or with a huge remainder.
In practice, this feature appears to not have been well tuned. I honestly don't think it should be on by default at all, but it definitely shouldn't be on for RISCV. Note that other targets have also disabled it, but they've done so via disabling interleaving - which is, well, completely unrelated - and we don't want to do that for RISCV.
In the near term, many examples I'm seeing have terrible codegen for epilogue vectorization. We are greatly increasing code size for little value at reasonable VLEN values for small types. In the long term, the cases that epilogue vectorization are intended to handle are likely better handled via tail folding on RISCV.
As an aside, I also don't really trust the correctness of epilogue vectorization. The code structure is such that otherwise straight forward changes sometimes break only epilogue vectorization. The reuse of an existing vplan without careful validation opens significant room for nasty bugs. Given how rarely the code is exercised, that is not a good combination.
As such, this patch introduces a TTI hook, and completely disables epilogue vectorization on RISCV.
Differential Revision: https://reviews.llvm.org/D136695
Small functions with size under a given threshold are not
considered for specialisaion on the presumption that they
are easy to inline. This does not apply to `noinline`
functions, though.
Reviewed By: ChuanqiXu
Differential Revision: https://reviews.llvm.org/D135862
Using the legacy PM for the optimization pipeline was deprecated in 13.0.0.
Following recent changes to remove non-core features of the legacy
PM/optimization pipeline, remove DataFlowSanitizerLegacyPass.
Differential Revision: https://reviews.llvm.org/D124594
This reverts commit bd7949bcd8.
Revert this patch since reviwers have different opinions regarding
the approach in post-commit review.
Will open RFC for further discussion.
Differential Revision: https://reviews.llvm.org/D132408
This reverts commit 04877284b4.
Looks like this is still breaking the test
Profile-x86_64 :: instrprof-darwin-dead-strip.c
(see comment on https://reviews.llvm.org/D135340).
This addresses a bug where vector versions of ctlz are creating false positive reports.
Depends on D136369
Reviewed By: vitalybuka
Differential Revision: https://reviews.llvm.org/D136523
When -asan-max-inline-poisoning-size=0, all shadow memory access should be
outlined (through asan calls). This was not occuring when partial poisoning
was required on the right side of a variable's redzone. This diff contains
the changes necessary to implement and utilize __asan_set_shadow_01() through
__asan_set_shadow_07(). The change is necessary for the full abstraction of
the asan implementation and will enable experimentation with alternate strategies.
Differential Revision: https://reviews.llvm.org/D136197
This is a sibling transform to the fold just above it. That was changed
to allow the corresponding commuted patterns with:
3073074562e1bd759ea58628e6df70
This was reverted because it was breaking when targeting Darwin which
tried to export these symbols which are now hidden. It should be safe
to just stop attempting to export these symbols in the clang driver,
though Apple folks will need to change their TAPI allow list described
in the commit where these symbols were originally exported
f538018562
Bug: https://github.com/llvm/llvm-project/issues/58265
Differential Revision: https://reviews.llvm.org/D135340
Improve O(N^2) to O(N) in some cases, reduce number of allocations by
reserving memory.
Also, improve analysis of loads reduction values to avoid analysis
of not vectorizable cases.
This teaches the SCCP Solver how to constant fold more intrinsics. Constant
folding appears to be just as good as D115737 but much, much lower in code
change impact as suggested by nikic.
The constrained floating-point intrinsics all take at least one metadata
argument and were the motivation for the change.
Differential Revision: https://reviews.llvm.org/D136466
(sign|resign) + (auth|resign) can be folded by omitting the middle
sign+auth component if the key and discriminator match.
Differential Revision: https://reviews.llvm.org/D132383
Without a freeze, this transform can leak poison to the output:
https://alive2.llvm.org/ce/z/GJuF9i
This makes the transform as uniform as possible, and it can help
reduce patterns like issue #58313 (although that particular
example probably still needs another transform).
Differential Revision: https://reviews.llvm.org/D136527
This doesn't touch objc-arc-contract because that's in the codegen pipeline.
However, this does move its corresponding initialize function into initializeCodegen().
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D135041
When the common value is part of either select condition,
this is safe to reduce. Otherwise, it is not poison-safe
(with the select form of the pattern):
https://alive2.llvm.org/ce/z/FxQTzB
This is another patch motivated by issue #58313.
1) Use a static array of pointer to retain the dummy vars.
2) Associate liveness of the array with that of the runtime hook variable
__llvm_profile_runtime.
3) Perform the runtime initialization through the runtime hook variable.
4) Preserve the runtime hook variable using the -u linker flag.
Reviewed By: hubert.reinterpretcast
Differential Revision: https://reviews.llvm.org/D136192
This is obviously correct for real logic instructions,
and it also works for the poison-safe variants that use
selects:
https://alive2.llvm.org/ce/z/wyHiwX
This is motivated by the lack of 'xor' folding seen in issue #58313.
This more general fold should help reduce some of those patterns,
but I'm not sure if this specific case does anything for that
particular example.
This allows the regular bitwise logic opcodes in addition to the
poison-safe select variants:
https://alive2.llvm.org/ce/z/8xB9gy
Handling commuted variants safely is likely trickier, so that's
left to another patch.
Additional SCEV verification highlighted a case where the cached loop
dispositions where incorrect after simplifying a condition in IndVars
and moving the user in LoopDeletion. Fix it by invalidating ICmp and all
its users.
Fixes#58515.
This implements IR and bitcode support for the memory attribute,
as specified in https://reviews.llvm.org/D135597.
The new attribute is not used for anything yet (and as such, the
old memory attributes are unaffected).
Differential Revision: https://reviews.llvm.org/D135592
Currently, compiling a program with the `-pg` flag will result in an
undefined symbol error for `.mcount`. This revision fixes the call to
use `__mcount`, which requires a pointer argument to a pointer-sized
object (unique per inserted call) on AIX.
This is only a partial fix. This patch should fix the `-pg` flag's
behaviour on AIX to work with code you are compiling, but it will not
link against standard libraries with `mcount` instrumentation calls. The
next step is to add profiled libraries to the linker search paths in the
Clang driver for the AIX toolchain when linking with `-pg`.
Differential Review: https://reviews.llvm.org/D135384
Canonicalize GEP of GEP by swapping GEP with some suffix constant indices to the back (and GEP with all constant indices to the back of that), this allows more constant index GEP merging to happen. Exceptions are: If swapping violates use-def relations, or anti-optimizes LICM
For constant indexed GEP of GEP, if they cannot be merged directly, they will be casted to i8* and merged.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D125845
The code in buildScalarSteps already properly handles creating the
scalar induction values with VF = 1. Use it directly instead of using
extra code to handle that case.
Suggested by @Ayal in D133760.
Reapplying after the fix for volatile modelling in D135863.
-----
Don't add argmem if the pointer is clearly not an argument (e.g.
a global). I don't think this makes a difference right now, but
gives more obvious results with D135780.
Bug introduced in e239198cdb.
The assert() is making an assumption that the resulting shuffle mask
will always select elements from both vectors, this is untrue in the
case of two shuffles being folded if the former shuffle has a mask with
undef elements in it. In such a case folding the shuffles might result
in a mask which only selects from one of the vectors because the other
elements (in the mask) are undef.
Differential Revision: https://reviews.llvm.org/D136256
Per LangRef, volatile operations are allowed to access the location
of their pointer argument, plus inaccessible memory:
> Any volatile operation can have side effects, and any volatile
> operation can read and/or modify state which is not accessible
> via a regular load or store in this module.
> [...]
> The allowed side-effects for volatile accesses are limited. If
> a non-volatile store to a given address would be legal, a volatile
> operation may modify the memory at that address. A volatile
> operation may not modify any other memory accessible by the
> module being compiled. A volatile operation may not call any
> code in the current module.
FunctionAttrs currently does not model this and ends up marking
functions with volatile accesses on arguments as argmemonly,
even though they should be inaccessiblemem_or_argmemonly.
Differential Revision: https://reviews.llvm.org/D135863
This reverts commit 8ef3fd8d59.
I mentioned that GlobalAlias was not handled. It turns out GlobalAlias has to be handled in the same patch (as opposed to in a follow-up),
as otherwise clang codegen of C5/D5 constructor/destructor would regress (https://reviews.llvm.org/D135427#3869003).
@Ayal suggested a better named helper than using `!getDef()` to check if
a value is invariant across all parts.
The property we are using here is that the VPValue is defined outside
any vector loop region. There's a TODO left to handle recipes defined in
pre-header blocks.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D133666
This patch extends the load merge/widen in AggressiveInstCombine() to handle reverse load patterns.
Differential Revision: https://reviews.llvm.org/D135137
Followup to D135962 to rename remaining uses of
FunctionModRefBehavior to MemoryEffects. Does not touch API names
yet, but also updates variables names FMRB/MRB to ME, to match the
new type name.
Generalized the cost model estimation. Improved cost model estimation
for repeated scalars (no need to count their cost anymore), improved
cost model for extractelement instructions.
cpu2017
511.povray_r 0.57
520.omnetpp_r -0.98
521.wrf_r -0.01
525.x264_r 3.59 <+
526.blender_r -0.12
531.deepsjeng_r -0.07
538.imagick_r -1.42
Geometric mean: 0.21
Differential Revision: https://reviews.llvm.org/D115757
Generalized the cost model estimation. Improved cost model estimation
for repeated scalars (no need to count their cost anymore), improved
cost model for extractelement instructions.
cpu2017
511.povray_r 0.57
520.omnetpp_r -0.98
521.wrf_r -0.01
525.x264_r 3.59 <+
526.blender_r -0.12
531.deepsjeng_r -0.07
538.imagick_r -1.42
Geometric mean: 0.21
Differential Revision: https://reviews.llvm.org/D115757
Fixes an SROA crash.
Fallout from opaque pointers since with typed pointers we'd bail out at the bitcast.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D136119
Moving an instruction can invalidate the cached block dispositions of
the corresponding SCEV. Invalidate the cached dispositions.
Also fixes a copy-paste error in forgetBlockAndLoopDispositions where
the start expression S was removed from BlockDispositions in the loop
but not the current values. This was also exposed by the new test case.
Fixes#58439.
When SimplifyLibCalls fail to optimize printf and sprintf it add
NoUndef/NonNull/Dereferenceable attributes. This patch add the same attributes
if SimplifyLibCalls optimize printf/sprintf into the integer only
iprintf/siprintf.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D136140
When unrolling, the exit values in LCSSA phis will get updated.
Invalidate cached SCEV values for those phis in case SCEV looked through
a exit phi.
Fixes#58340.
If the arithmetic for indices of inbounds GEPs overflows, the result is
poison. This means it is also OK for the coefficients to overflow. GEP
decomposition is limited to cases where the index size is <= 64 bit,
which can be represented by int64_t used for the coefficients in the
constraint system.
This reverts commit 233659c7ae.
I see some sanitizer build bot failures. Not sure if it is change
causing it, but let's see if a revert returns the bots to green...
We can't assume that operand 0 is the negated operand because
the matcher handles "fsub -0.0, X" (and also +0.0 with FMF).
By capturing the extract within the match, we avoid the bug
and make the transform more robust (can't assume that this
pass will only see canonical IR).
Relative to the previous attempt, this is rebased over the
InstSimplify fix in ac74e7a780,
which addresses the miscompile reported in PR58401.
-----
foldOpIntoPhi() currently only folds operations into the phi if all
but one operands constant-fold. The two exceptions to this are freeze
and select, where we allow more general simplification.
This patch makes foldOpIntoPhi() generally simplification based and
removes all the instruction-specific logic. We just try to simplify
the instruction for each operand, and for the (potentially) one
non-simplified operand, we move it into the new block with adjusted
operands.
This fixes https://github.com/llvm/llvm-project/issues/57448, which
was my original motivation for the change.
Differential Revision: https://reviews.llvm.org/D134954
LoopFlatten has been in the code base off by default for years, but this
enables it to run by default. Downstream this has been running for
years, so it has been exposed to quite some code. Then around the time
we switched to the NPM, several fixes went in related to updating the
MemorySSA state and we moved it to a loop pass manager, which both
helped preventing rerunning certain analysis passes, and thus helped a
bit with compile-times.
About compile-times, adding a pass isn't free, but this should see only
very minor increases. The pass is relatively simple and there shouldn't
be anything algorithmically expensive because all it does is looking at
inner/outer loops and it checks assumptions on loop increments and
indices. If we see increases, I expect this to mainly come from
invalidation of analysis info, and perhaps subsequent passes to trigger
and do more. Despite its simplicity/restrictions, it triggers in most
code-bases, which makes it worth to enable this by default.
Differential Revision: https://reviews.llvm.org/D109958
Another alternative to fix the thread identification problem in
coroutines.
We plan to fix this problem by unifying memory effecting attributes. See
https://discourse.llvm.org/t/rfc-unify-memory-effect-attributes/65579.
But it may be a long-term project. And it is a pity that the coroutines
can't resume in different threads for years. So this one is temporary
fix. It may cause unnecessary performance regression for coroutines. But
correctness are more important. And this one is planned to be reverted
after we are able to unify the memory effecting attributes actually.
Reviewed By: jdoerfert, rjmccall
Differential Revision: https://reviews.llvm.org/D135550
Instead of duplicating the existing decomposition code for GEP indices
just use the existing code by calling the existing decompose function on
the index expression and multiply the result's coefficients by the scale of
the index.
This both reduces code duplication and generalizes the pattern we can
handle.
If both operands of an `add nsw` are known positive, it can be treated
the same as `add nuw` and added to the unsigned system.
https://alive2.llvm.org/ce/z/6gprff
If the divisor is a power-of-2 or negative-power-of-2 and the dividend
is known to have >= trailing zeros than the divisor, the division is exact:
https://alive2.llvm.org/ce/z/UGBksM (general proof)
https://alive2.llvm.org/ce/z/D4yPS- (examples based on regression tests)
This isn't the most direct optimization (we could create ashr in these
examples instead of relying on existing folds for exact divides), but
it's possible that there's a more general constraint than just a pow2
divisor, so this might be extended in the future.
This should solve issue #58348.
Differential Revision: https://reviews.llvm.org/D135970
This should be functionally equivalent - both calls are thin
wrappers around computeKnownBits(). We'll probably want to use
known-bits directly in follow-up patches because that could
determine "exact" for example (see issue #58348).
Support decomposition for `mul/shl nuw` with constant operand for unsigned
queries. Those expressions should not wrap in the unsigned sense and can
be added directly to the unsigned system.
makeLoopInvariant may recursively move its operands to make them
invariant, before moving the passed in instruction. Those recursively
moved instructions are currently missed when invalidating block and loop
dispositions.
To address this, move the invalidation code to Loop::makeLoopInvariant.
Fixes#58314.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D135909
`ProvenanceAnalysis::related()` was assuming that the order of parameters for `relatedCheck()` was not affecting
the result but this was not the case when both parameters were `PHINode`s.
Due to this assumption `ProvenanceAnalysis::related()` was ordering the parameters based on pointer value which resulted in
non-deterministic behavior.
To address this change `relatedPHI()` so that it gives the same result independent of the parameter order.
rdar://100325456
Differential Revision: https://reviews.llvm.org/D135376
Instead of checking whether a map entry exists to decide if we should
initialize it or add to it, we can rely on the map entry being constructed
and initialized to 0 before the addition happens.
For the std::max case, I've made a reference to the map entry to
avoid looking it up twice.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D135977
avoiding an assertion.
A BB with a nonzero count, whose successor blocks all have 0 counts, could
cause an assertion. Don't create any branch weights in this case.
Reviewed By: xur
Differential Revision: https://reviews.llvm.org/D134203
(A | B) & ~(A & B) --> A ^ B
https://alive2.llvm.org/ce/z/qpFMns
We already have the equivalent fold for real
logic instructions, but this pattern may occur
with selects too.
This is part of solving issue #58313.
This was reverted because it was breaking when targeting Darwin which
tried to export these symbols which are now hidden. It should be safe
to just stop attempting to export these symbols in the clang driver,
though Apple folks will need to change their TAPI allow list described
in the commit where these symbols were originally exported
f538018562
Bug: https://github.com/llvm/llvm-project/issues/58265
Differential Revision: https://reviews.llvm.org/D135340
Move logic to check and replace conditions to a helper function. This
isolates the code, allows using early returns, reduces the
indentation and simplifies eliminateConstraints.
When generating masked gathers nodes, SLP vectorizer accounts the cost
of the GEPs for loads as part of the scalar-vector transformation cost
estimation. But it does not do it for vectorized loads/stores, while it
may completely remove some of the GEPs completely. Because of this in
some cases masked gather operation can be much more profitable rather
than regular vectorization (masked-gather cost + vector GEP - scalar
loads + GEPs comparing to vectorized loads - scalar loads).
Added the analysis of the removed scalarGEPs for vectorized load/store nodes for better cost estimation.
Differential Revision: https://reviews.llvm.org/D135282
This reverts commit b05f5b90a1.
There are thread sanitizer buildbot failures in simple_stack.c.
I think that's because this ended up affecting the handling of
volatile accesses to allocas. Reverting for now.
Don't add argmem if the pointer is clearly not an argument (e.g.
a global). I don't think this makes a difference right now, but
gives more obvious results with D135780.
Limit pointer decomposition to pointers with index sizes of at most 64
bits. int64_t is used for coefficients, so as long as the index size <=
64 bits we should be able to represent all pointer offsets.
Pointer decomposition is limited to inbounds GEPs, so if a index
computation would overflow the result is poison, so it doesn't matter
that the coefficient overflows.
This allows replacing MulOverflow with regular multiplications.
We have to account for accesses to argument memory via captures.
I don't think there's any way to make this produce incorrect
results right now (because as soon as "other" is set, we lose the
ability to infer argmemonly), but this avoids incorrect results
once we have more precise representation.
The code for inferring memory attributes on arguments claims that
inalloca/preallocated arguments are always clobbered:
d71ad41080/llvm/lib/Transforms/IPO/FunctionAttrs.cpp (L640-L642)
However, we would still infer memory attributes for the whole
function without taking this into account, so we could still end
up inferring readnone for the function. This adds an argument
clobber if there are any inalloca/preallocated arguments.
Differential Revision: https://reviews.llvm.org/D135783
Add a check (can be disabled via a flag) that the pipeline we generate is actually parsable.
Can be disabled because we don't expect to handle every pass in -print-pipeline-passes.
Fixes#58280.
Reviewed By: ChuanqiXu
Differential Revision: https://reviews.llvm.org/D135703
(X << Z) / (Y << Z) --> X / Y
https://alive2.llvm.org/ce/z/CLKzqT
This requires a surprising "nuw" constraint because we have
to guard against immediate UB via signed-div overflow with
-1 divisor.
This extends 008a89037a and is another transform
derived from issue #58137.
Need to set the insertpoint for extractelement to point to the first
instruction in the node to avoid possible crash during external uses
combine process. Without it we may endup with the incorrect
transformation.
Differential Revision: https://reviews.llvm.org/D135591
(X << Z) / (Y << Z) --> X / Y
https://alive2.llvm.org/ce/z/E5eaxU
This fixes the motivating example from issue #58137,
but it is not the most general transform. We should
probably also convert left-shift in the divisor to
right-shift in the dividend for that, but that exposes
another missed canonicalization for shifts and adds.
Freeze instruction in some cases makes codegen worse, so need to be very
careful when emitting it. Instead improve analysis in isUndefVector
function to generate mask of unused elements and use it in the analysis.
Differential Revision: https://reviews.llvm.org/D135382
optimizeInductions may leave dead recipes which can prevent sinking.
Sinking on the other hand should not introduce new dead recipes, so
clean up dead recipes before sinking.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D133762
They have been scattered over the code. For better structuring, perform
them in one place. Potential CT drop is possible because we collect exit
blocks twice, but it's small price to pay for much better code structure.
These are semantically two different stages, but were entwined in the
old implementation. Now cost computation does not do legality checks,
and they all are done beforehead.
For a local linkage GlobalObject in a non-prevailing COMDAT, it remains defined while its
leader has been made available_externally. This violates the COMDAT rule that
its members must be retained or discarded as a unit.
To fix this, update the regular LTO change D34803 to track local linkage
GlobalValues, and port the code to ThinLTO (GlobalAliases are not handled.)
This fixes two problems.
(a) `__cxx_global_var_init` in a non-prevailing COMDAT group used to
linger around (unreferenced, hence benign), and is now correctly discarded.
```
int foo();
inline int v = foo();
```
(b) Fix https://github.com/llvm/llvm-project/issues/58215:
as a size optimization, we place private `__profd_` in a COMDAT with a
`__profc_` key. When FuncImport.cpp makes `__profc_` available_externally due to
a non-prevailing COMDAT, `__profd_` incorrectly remains private. This change
makes the `__profd_` available_externally.
```
cat > c.h <<'eof'
extern void bar();
inline __attribute__((noinline)) void foo() {}
eof
cat > m1.cc <<'eof'
int main() {
bar();
foo();
}
eof
cat > m2.cc <<'eof'
__attribute__((noinline)) void bar() {
foo();
}
eof
clang -O2 -fprofile-generate=./t m1.cc m2.cc -flto -fuse-ld=lld -o t_gen
rm -fr t && ./t_gen && llvm-profdata show -function=foo t/default_*.profraw
clang -O2 -fprofile-generate=./t m1.cc m2.cc -flto=thin -fuse-ld=lld -o t_gen
rm -fr t && ./t_gen && llvm-profdata show -function=foo t/default_*.profraw
```
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D135427
((Op1 * X) / Y) / Op1 --> X / Y
https://alive2.llvm.org/ce/z/JYxWjA
InstSimplify handles the more basic mul+div pattern with
shared operand, but we don't seem to have any reassociation
folds to handle cases where the common op is further away.
This is a generalization of 9cff4711ac and another
transform derived from issue #58137.
When Index is variable but still trivially known to be equal we can use Value
from before the insertion, possibly eliminating the vector.
Reverts a functional change from:
Author: Philip Reames <listmail@philipreames.com>
Date: Wed Dec 8 12:21:10 2021 -0800
[instcombine] A couple style tweaks to visitExtractElementInst [nfc]
Thanks to Michele Scandale for identifying the bug
Differential Revision: https://reviews.llvm.org/D135625
insertelt DestVec, (fneg (extractelt SrcVec, Index)), Index --> shuffle DestVec, (fneg SrcVec), Mask
This is a specialized form of what could be a more general fold for a binop.
It's also possible that fneg is overlooked by SLP in this kind of
insert/extract pattern since it's a unary op.
This shows up in the motivating example from #issue 58139, but it won't solve
it (that probably requires some x86-specific backend changes). There are also
some small enhancements (see TODO comments) that can be done as follow-up
patches.
Differential Revision: https://reviews.llvm.org/D135278
This reverts commit 4fbe33593c. It causes linking errors, with details provided internally. (Hopefully the author/reviewers will be able to upstream the internal repro).
If the single-thread model is used, or the
-licm-force-thread-model-single flag is specified, skip checks
related to thread-safety. This means that store promotion for
conditionally executed stores only requires proof of
dereferenceability and writability, but not of thread-safety. For
example, this enables promotion of stores to (non-constant) globals,
as well as captured allocas.
Fixes https://github.com/llvm/llvm-project/issues/50537.
Differential Revision: https://reviews.llvm.org/D130466
When determining the initial value of the object, use the constant
folding API to load a given type at a given offset in the global
initializer. This makes it work for cases where the load doesn't
directly correspond to an aggregate member.
Differential Revision: https://reviews.llvm.org/D135435
The current decomposition for GEPs did not correctly handle cases where
GEPs access different source types. Adjust the constraints by including
the indexed type-size as coefficients.
Further generalization to allow GEPs with more than one index is a
needed general follow-up improvement.
If the location ptr to be killed is in no loop and the Function does not
have irreducible loops, then we can regard it as loop invariant.
Differential Revision: https://reviews.llvm.org/D135369
Move common logic shared by callers of getConstraint that use the result
to query the constraint system to a new helper getConstraintForSolving.
This includes common legality checks (i.e. not an equality constraint,
no new variables) and the logic to query the unsigned system if possible
for signed predicates.
See the updated linkonce_resolution_comdat.ll. For a local linkage GV in a
non-prevailing COMDAT, it remains defined while its leader has been made
available_externally. This violates the COMDAT rule that its members must be
retained or discarded as a unit.
To fix this, update the regular LTO change D34803 to track local linkage
GlobalValues, and port the code to ThinLTO (GlobalAliases are not handled.)
Fix https://github.com/llvm/llvm-project/issues/58215:
as a size optimization, we place private `__profd_` in a COMDAT with a
`__profc_` key. When FuncImport.cpp makes `__profc_` available_externally due to
a non-prevailing COMDAT, `__profd_` incorrectly remains private. This change
makes the `__profd_` available_externally.
```
cat > c.h <<'eof'
extern void bar();
inline __attribute__((noinline)) void foo() {}
eof
cat > m1.cc <<'eof'
#include "c.h"
int main() {
bar();
foo();
}
eof
cat > m2.cc <<'eof'
#include "c.h"
__attribute__((noinline)) void bar() {
foo();
}
eof
clang -O2 -fprofile-generate=./t m1.cc m2.cc -flto -fuse-ld=lld -o t_gen
rm -fr t && ./t_gen && llvm-profdata show -function=foo t/default_*.profraw
# one _Z3foov
clang -O2 -fprofile-generate=./t m1.cc m2.cc -flto=thin -fuse-ld=lld -o t_gen
rm -fr t && ./t_gen && llvm-profdata show -function=foo t/default_*.profraw
# one _Z3foov
```
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D135427
The 1st attempt failed to updated the test checks as expected.
Original commit message:
sdiv exact X, (1<<ShAmt) --> ashr exact X, ShAmt (if shl is non-negative)
https://alive2.llvm.org/ce/z/kB6VF7
It would probably be better to use ValueTracking to replace this
and the existing transform above it, but the analysis does not
account for the no-wrap properly, and it's not immediately clear
to me how to fix it.
sdiv exact X, (1<<ShAmt) --> ashr exact X, ShAmt (if shl is non-negative)
https://alive2.llvm.org/ce/z/kB6VF7
It would probably be better to use ValueTracking to replace this
and the existing transform above it, but the analysis does not
account for the no-wrap properly, and it's not immediately clear
to me how to fix it.
The logic added in 3771310eed was placed sub-optimally. Applying the
transform in ::getConstraint meant that it would also impact conditions
that are added to the system by the signed <-> unsigned transfer logic.
This meant we failed to add some signed facts to the signed system. To
make sure we still add as many useful facts to the signed/unsigned
systems, move the logic to the point where we query the system.
Clear all dispositions if there are any dead blocks (which will get
removed later) and also clear dispositions for removed instructions.
Clearing all dispositions in case there are dead blocks happens first,
which should avoid traversing SCEV use-lists for invalidating
dispositions for individual values.
Fixes#58179.
This patch was added way back in the beginning of the work which became the statepoint infrastructure. The idea was that safepoints could be inserted late in the optimization pipeline. This is true if the only concern is garbage collection, but this approach turned out to be incompatible with the requirement to also support deoptimization at safepoints.
In theory, this pass would still be quite useful for an AOT compiled language which wants to support garbage collection, but we have no known users, and haven't for over 5 years. Time to remove unused code. If someone wants to use this, restoring it would not be hard. The immediate motivation for removal is that this is one of the last passes remaining which hasn't been ported to the new pass manager and the (straight forward) work to do so is not justified for unused code.
Differential Revision: https://reviews.llvm.org/D135371
Extend forgetBlockAndLoopDisposition to allow clearing information for a
single value. This can be useful when only a single value is changed,
e.g. because the instruction is moved.
We also need to clear the cached values for all SCEV users, because they
may depend on the starting value's disposition.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D134614
Loop peeling currently requires that a) the latch is exiting
b) a branch and c) other exits are unreachable/deopt. This patch
removes all of these limitations, and adds the necessary branch
weight updating support. It essentially works the same way as
before with latch -> exiting terminator and
loop trip count -> per exit trip count.
It's worth noting that there are still other limitations in
profitability heuristics: This patch enables peeling of loops to
make conditions invariant (which is pretty much always highly
profitable if possible), while peeling to make loads dereferenceable
still checks that non-latch exits are unreachable and PGO-based
peeling has even more conditions. Those checks could be relaxed
later if we consider those cases profitable.
The motivation for this change is that loops using iterator adaptors
in Rust often optimize very badly, and end up with a loop phi of the
form phi(true, false) in the final result. Peeling eliminates that
phi and conditions based on it, which enables a lot of follow-on
simplification.
Differential Revision: https://reviews.llvm.org/D134803
As LoopPredication performs non-equivalent transforms removing some
checks from loops, other passes may not be able to perform transforms
they'd be able to do if the checks were left in loops.
This patch makes LoopPredication insert assumes of the replaced
conditions either after a guard call or in the true block of
widenable condition branch.
Differential Revision: https://reviews.llvm.org/D135354
Relative to the previous attempt, this adjusts simplification to
use the correct context instruction: We need to use the terminator
of the incoming block, not the original instruction.
-----
foldOpIntoPhi() currently only folds operations into the phi if all
but one operands constant-fold. The two exceptions to this are freeze
and select, where we allow more general simplification.
This patch makes foldOpIntoPhi() generally simplification based and
removes all the instruction-specific logic. We just try to simplify
the instruction for each operand, and for the (potentially) one
non-simplified operand, we move it into the new block with adjusted
operands.
This fixes https://github.com/llvm/llvm-project/issues/57448, which
was my original motivation for the change.
Differential Revision: https://reviews.llvm.org/D134954
Added analysis for invariant extractelement instructions and improved
detection of the CSE blocks for generated extractelement instructions.
Differential Revision: https://reviews.llvm.org/D135279
The limitation in LibCallSimplifier::optimizeStringLength to only
optimize when the string is an i8 array was changed already in
commit 50ec0b5dce back in 2017.
We still only simplify when 's' points at an array of 'CharSize', so
the comment is still valid in the sense that we do not support
arbitrary array types.
Differential Revision: https://reviews.llvm.org/D135261
Make sure conditions with constant operands come before conditions
without constant operands. This increases the effectiveness of the
current signed <-> unsigned fact transfer logic.
If a call base use will not capture a pointer we can approximate the
effects. This is important especially for readnone/only uses. Even
may-write uses are not too bad with reachability in place. Capturing
is the problem as we loose track of update sides.
If we have a constant aggregate, e.g., as an initializer, we usually
failed to extract the proper value/type from it. This patch provides the
size and offset information necessary to extract the right part of the
constant.
This was already handled correctly below, but not checked for the
original store pointer operand. Encountered when converting tests
to opaque pointers, where the intermediate bitcast goes away.
In the case of non-opaque pointers, when combining consecutive loads,
need to bitcast the pointer source to the combined type size, otherwise
asserts are triggered.
Differential Revision: https://reviews.llvm.org/D135249
The infinite loop seen on buildbots should be fixed by
11897708c0 (assuming there are not
multiple infinite combine loops...)
-----
foldOpIntoPhi() currently only folds operations into the phi if all
but one operands constant-fold. The two exceptions to this are freeze
and select, where we allow more general simplification.
This patch makes foldOpIntoPhi() generally simplification based and
removes all the instruction-specific logic. We just try to simplify
the instruction for each operand, and for the (potentially) one
non-simplified operand, we move it into the new block with adjusted
operands.
This fixes https://github.com/llvm/llvm-project/issues/57448, which
was my original motivation for the change.
Differential Revision: https://reviews.llvm.org/D134954
Rather than inserting a ptrtoint + inttoptr pair, directly replace
the inttoptr with the new phi node. This ensures that no other
transform can undo it before the pair gets folded away.
This avoids the infinite loop when combined with D134954.
This is NFCI in the sense that it shouldn't make a difference, but
could due to different worklist order.
SimpleLoopUnswitch may remove blocks from loops. Clear block and loop
dispositions in that case, to clean up invalid entries in the cache.
Fixes#58158.
Fixes#58159.
Loop versioning changes the control-flow, which may impact SCEVs cached
by for other loops in LoopAccessInfoManager. Clear the manager after
making changes.
Fixes#57825.
Depends on D134609.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D134611
isOuterMostDepPositive()
The function isOuterMostDepPositive() is checked after negative dependence
vectors are normalized to be non-negative, so there will not be any negative
dependency ('>' as the outermost non-equal sign) after normalization. And
therefore the check in isOuterMostDepPositive() is irrelevent and redundant.
Reviewed By: congzhe
Differential Revision: https://reviews.llvm.org/D132982
In the canonical form of the shuffle the poison/undef operand is the
second operand, the patch tries to emit canonical form for partial
vectorization of the buildvector sequence.
Also, this patch starts emitting freeze instruction for shuffles with undef indices if the second shuffle operan is undef, not poison. It is an initial step to D93818, where undef mask element are treated as returning poison value.
Differential Revision: https://reviews.llvm.org/D134377
Reapply with a fix for the case where an operand simplified back
to the original phi: We need to map this case to the new phi node.
-----
foldOpIntoPhi() currently only folds operations into the phi if all
but one operands constant-fold. The two exceptions to this are freeze
and select, where we allow more general simplification.
This patch makes foldOpIntoPhi() generally simplification based and
removes all the instruction-specific logic. We just try to simplify
the instruction for each operand, and for the (potentially) one
non-simplified operand, we move it into the new block with adjusted
operands.
This fixes https://github.com/llvm/llvm-project/issues/57448, which
was my original motivation for the change.
This currently does not make much of a difference (only one tests is
affected), but it is helpful e.g. for the out-of-tree CHERI target where
Builder.CreateMemCpy() can add attributes other than parameter alignment.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D135075
The helpers in BuildLibCalls normally expect that the Value
arguments already have the correct type (matching the lib call
signature). And exception has been emitFPutC which casted the Char
argument to 'int' using CreateIntCast. This patch moves the cast to
the caller instead of doing it inside emitFPutC.
I think it makes sense to make the BuildLibCall API:s a bit
more consistent this way, despite the need to handle the int cast
in two different places now.
Differential Revision: https://reviews.llvm.org/D135066
Stop assuming that an 'int' is 32 bits in helpers that emit libcalls
to lib functions that had 'int' in the signature. For most targets
this is NFC. For a target with 16 bit 'int' type this could help out
detecting if trying to emit a libcall with incorrect signature.
Similarly we now derive the type mapping to 'size_t' by asking TLI
about the size of 'size_t'. This should be NFC (at least for in-tree
targets) since getSizeTSize(), in TLI, is deriving the size in the
same way as DataLayout::getIntPtrType().
Differential Revision: https://reviews.llvm.org/D135065
Lots of BuildLibCalls helpers are using Builder::getInt32Ty to get
a type matching an 'int', and DataLayout::getIntPtrType to get a
type matching 'size_t'. The former is not true for all targets, since
and 'int' isn't always 32 bits. And the latter is a bit weird as well
as the definition of DataLayout::getIntPtrType isn't clearly mapping
it to 'size_t'.
This patch is not aiming at solving any such problems. It is merely
highlighting when a libcall is expecting to use 'int' and 'size_t'
by naming the types as IntTy and SizeTTy when preparing the type
signatures for the emitted libcalls.
Differential Revision: https://reviews.llvm.org/D135064
Use LoopAccessInfoManager directly instead of various GetLAA lambdas.
Depends on D134608.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D134609
If nonnull is already set, we currently skip setting both nonnull
and dereferenceable. Make these independent, to avoid regressions
when additional nonnull attributes are inferred earlier.
foldOpIntoPhi() currently only folds operations into the phi if all
but one operands constant-fold. The two exceptions to this are freeze
and select, where we allow more general simplification.
This patch makes foldOpIntoPhi() generally simplification based and
removes all the instruction-specific logic. We just try to simplify
the instruction for each operand, and for the (potentially) one
non-simplified operand, we move it into the new block with adjusted
operands.
This fixes https://github.com/llvm/llvm-project/issues/57448, which
was my original motivation for the change.
Recent improvements to the code structure mean we don't need to reset
the condition's predicate in the IR and later restore it. Remove the
restorer logic.
llvm/lib/Transforms/Utils/CodeLayout.cpp uses std::abs() with double argument,
which is provided by cmath header, which is not explicitly included into CodeLayout.cpp.
The implicit include in llvm/include/llvm/Support/MathExtras.h was removed in
commit 16544cbe64
Inserting explicit include of cmath into CodeLayout.cpp in order to fix build on MacOS.
Committed on behalf of alsemenov (Aleksei Semenov)
Reviewed By: thieta
Differential Revision: https://reviews.llvm.org/D135072
Added a helper in TargetLibraryInfo to get size of "size_t" in bits,
given a Module reference. The new getSizeTSize helper is using the
same strategy as for example isValidProtoForLibFunc has been using
in the past, assuming that the size can be derived by asking
DataLayout about the size/type of a pointer to int.
FortifiedLibCallSimplifier::optimizeStrpCpyChk was changed to use
the new getSizeTSize helper instead of assuming that sizeof(size_t)
is equal to sizeof(int*) by itself (that is the assumption used in
TargetLibraryInfoImpl::getSizeTSize so the result will be the same).
Having a common helper for this ensure that we use the same strategy
when deriving the size of "size_t" in different parts of the code.
One bonus with this refactoring (basing it on Module instead of just
DataLayout) is that it makes it easier to override this for a specific
target triple, in case the assumption of using getPointerSizeInBits
wouldn't hold.
Differential Revision: https://reviews.llvm.org/D110585
This is an unusual canonicalization because we create an extra instruction,
but it's likely better for analysis and codegen (similar reasoning as D133399).
InstCombine::Negator may create this kind of multiply from negate and shift,
but this should not conflict because of the narrow negation.
I don't know how to create a fully general proof for this kind of transform in
Alive2, but here's an example with bitwidths similar to one of the regression
tests:
https://alive2.llvm.org/ce/z/J3jTjR
Differential Revision: https://reviews.llvm.org/D133667
At the moment, LoopAccessAnalysis is a loop analysis for the new pass
manager. The issue with that is that LAI caches SCEV expressions and
modifications in a loop may impact SCEV expressions in other loops, but
we do not have a convenient way to invalidate LAI for other loops
withing a loop pipeline.
To avoid this issue, turn it into a function analysis which returns a
manager object that keeps track of the individual LAI objects per loop.
Fixes#50940.
Fixes#51669.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D134606
Update both memprof and callsite metadata to reflect inlined functions.
For callsite metadata this is simply a concatenation of each cloned
call's call stack with that of the inlined callsite's.
For memprof metadata, each profiled memory info block (MIB) is either
moved to the cloned allocation call or left on the original allocation
call depending on whether its context matches the newly refined call
stack context on the cloned call. We also reapply context trimming
optimizations based on the refined set of contexts on each of the calls
(cloned and original).
Depends on D128142.
Reviewed By: snehasish
Differential Revision: https://reviews.llvm.org/D128143
This reverts commit 0d7f3464ce and
commit f9403ca41e. The latter was
"Profile matching and IR annotation for memprof profiles." and was left
from a bad rebase from a commit already pushed upstream.
Update both memprof and callsite metadata to reflect inlined functions.
For callsite metadata this is simply a concatenation of each cloned
call's call stack with that of the inlined callsite's.
For memprof metadata, each profiled memory info block (MIB) is either
moved to the cloned allocation call or left on the original allocation
call depending on whether its context matches the newly refined call
stack context on the cloned call. We also reapply context trimming
optimizations based on the refined set of contexts on each of the calls
(cloned and original), via utilities in MemoryProfileInfo.
Depends on D128142.
Differential Revision: https://reviews.llvm.org/D128143
See also related RFCs:
RFC: Sanitizer-based Heap Profiler [1]
RFC: A binary serialization format for MemProf [2]
RFC: IR metadata format for MemProf [3]*
* Note that the IR metadata format has changed from the RFC during
implementation, as described in the preceeding patch adding the basic
metadata and verification support.
The matching is performed during the normal PGO annotation phase, to
ensure that the inlines applied in the IR at that point are a subset
of the inlines in the profiled binary and thus reflected in the
profile's call stacks. This is important because the call frames are
associated with functions in the profile based on the inlining in the
symbolized call stacks, and this simplifies locating the subset of
profile data relevant for matching onto each function's IR.
The PGOInstrumentationUse pass is enhanced to perform matching for
whatever combination of memprof and regular PGO profile data exists in
the profile.
Using the utilities introduced in D128854:
The memprof profile data for each context is converted to "cold" or
"notcold" based on parameterized thresholds for size, access count, and
lifetime. The memprof allocation contexts are trimmed to the minimal
amount of context required to uniquely identify whether the context is
cold or not cold. For allocations where all profiled contexts have the
same allocation type, no memprof metadata is attached and instead the
allocation call is directly annotated with an attribute specifying the
alloction type. This is the same attributed that will be applied to
allocation calls once cloned for different contexts, and later used
during LibCall simplification to emit allocation hints [4].
Depends on D128141 and D128854.
[1] https://lists.llvm.org/pipermail/llvm-dev/2020-June/142744.html
[2] https://lists.llvm.org/pipermail/llvm-dev/2021-September/153007.html
[3] https://discourse.llvm.org/t/rfc-ir-metadata-format-for-memprof/59165
[4] ab87cf382d
Differential Revision: https://reviews.llvm.org/D128142
I'm not sure how to test this because we seem to constant-fold
all examples already. We changed this code to use the common
isNonNegative() helper, so it should not be necessary to avoid
a constant. This makes the code uniform for all transforms.
Collect more statistics for scalar promotion. In particular,
keep track of how many promotion candidates there were, and
whether it is a load or a load/store promotion.
Revert rGef89409a59f3b79ae143b33b7d8e6ee6285aa42f "Fix 'unused-lambda-capture' gcc warning. NFCI."
Revert rG926ccfef032d206dcbcdf74ca1e3a9ebf4d1be45 "[SLP] ScalarizationOverheadBuilder - demand all elements for scalarization if the extraction index is unknown / out of bounds"
Revert ScalarizationOverheadBuilder sequence from D134605 - when accumulating extraction costs by Type (instead of specific Value), we are not distinguishing enough when they are coming from the same source or not, and we always just count the cost once. This needs addressing before we can use getScalarizationOverhead properly.
breakLoopBackedge may remove blocks and loops. Also clear block &
loop disposition to avoid the cache containing invalid blocks and loops.
The coverage for the change is provided when using an ASAN build of opt
to run the LoopDeletion unit tests; without the fix, pointers to invalid
objects would be used.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D134663
Move LCSSA fixup from ::expandCodeForImpl to ::expand(). This has
the advantage that we directly preserve LCSSA nodes here instead of
relying on doing so in rememberInstruction. It also ensures that we
don't add the non-LCSSA-safe value to InsertedExpressions.
Alternative to D132704.
Fixes#57000.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D134739
For noop store of the form of LoadI and StoreI,
An invariant should be kept is that the memory state of the related
MemoryLoc before LoadI is the same as before StoreI.
For this example:
```
define void @pr49927(i32* %q, i32* %p) {
%v = load i32, i32* %p, align 4
store i32 %v, i32* %q, align 4
store i32 %v, i32* %p, align 4
ret void
}
```
Here the definition of the store's destination is different with the
definition of the load's destination, which it seems that the
invariant mentioned above is broken. But the definition of the
store's destination would write a value that is LoadI, actually, the
invariant is still kept. So we can safely ignore it.
Fixes https://github.com/llvm/llvm-project/issues/49271
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D132657
Debugify in OriginalDebugInfo mode (verify-each-debuginfo-preserve), when used
in parallel builds of large projects, can produce incorrect report. More
precisely, simultaneous writes to JSON report file, could form incorrect JSON
objects, which describe found Debug Info bugs.
This patch uses the lock/unlock mechanism to protect JSON report file and also
makes script llvm/utils/llvm-original-di-preservation.py resilient to corrupted
lines in the report file. So, it ensures the creation of HTML report.
Differential Revision: https://reviews.llvm.org/D115616
The previous version of the patch would incorrect convert an
existing argmemonly attribute into an inaccessiblemem_or_argmemonly
attribute.
-----
This updates checkFunctionMemoryAccess() to infer a precise
FunctionModRefBehavior, rather than an approximation split into
read/write and argmemonly.
Afterwards, we still map this back to imprecise function attributes.
This still allows us to infer some cases that we previously did not
handle, namely inaccessiblememonly and inaccessiblemem_or_argmemonly.
In practice, this means we get better memory attributes in the
presence of intrinsics like @llvm.assume.
Differential Revision: https://reviews.llvm.org/D134527
Factor out the logic to create induction resume values for a specific
induction. This will be used in D92132 to support widened IVs during
epilogue vectorization.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D134211
Simplify the code by using CastInst::CreateBitOrPointerCast directly. By
not going through the builder, the temporary instruction also won't get
registered in InsertedValues & co, which means less work overall and
simplifies the clean-up.
https://reviews.llvm.org/D134254 introduced an issue on Fuchsia
target, which does not unconditionally emit runtime hook.
It used containsProfilingIntrinsics(M) after intrinsics are lowered.
So, this patch fixes the issue by capturing the result of that
function invocation before intrinsics are lowered.
Differential Revision: https://reviews.llvm.org/D134841
Interestingly, MathExtras.h doesn't use <cmath> declaration, so move it out of
that header and include it when needed.
No functional change intended, but there's no longer a transitive include
fromMathExtras.h to cmath.
- Before this patch, loop metadata (if exists) will override the metadata of each predecessor; if the predecessor block already has loop metadata, the orignal loop metadata won't be preserved and could cause missed loop transformations (see 'test2' in llvm/test/Transforms/SimplifyCFG/preserve-llvm-loop-metadata.ll).
To illustrate how inner-loop metadata might be dropped before this patch:
CFG Before
entry
|
v
---> while.cond -------------> while.end
| |
| v
| while.body
| |
| v
| for.body <---- (md1)
| | |______|
| v
| while.cond.exit (md2)
| |
|_______|
CFG After
entry
|
v
---> while.cond.rewrite -------------> while.end
| |
| v
| while.body
| |
| v
| for.body <---- (md2)
|_______| |______|
Basically, when 'while.cond.exit' is folded into 'while.cond', 'md2' overrides 'md1' and 'md1' is dropped from the CFG.
Differential Revision: https://reviews.llvm.org/D134152
The patch simplifies some of the patterns as below
1. (ZExt(L1) << shift1) | (ZExt(L2) << shift2) -> ZExt(L3) << shift1
2. (ZExt(L1) << shift1) | ZExt(L2) -> ZExt(L3)
The pattern is indicative of the fact that the loads are being merged to a wider load and the only use of this pattern is with a wider load. In this case for a non-atomic/non-volatile loads reduce the pattern to a combined load which would improve the cost of inlining, unrolling, vectorization etc.
Fix the error reported on reverse load merge.
Differential Revision: https://reviews.llvm.org/D127392
We don't combine generic shuffles together in IR, but select
shuffles are a special-case because a select shuffle of a
select shuffle is just another select shuffle; codegen is
expected to efficiently lower those (select shuffles are also
the canonical form of a vector select with constant condition).
A User like the PHINode may be visited multiple times for the same pointer along
different def-use edges. The uninitialized state of OffsetInfo at the first
visit needs to be distinct from the Unknown value that may be assigned after
processing the PHINode. Without that, a PHINode with all inputs Unknown is never
followed to its uses. This results in incorrect optimization because some
interfering accessess are missed.
Differential Revision: https://reviews.llvm.org/D134704
After deleting a loop, the block and loop dispositions need to be
cleared. As we don't know which SCEVs in the loop/blocks may be
impacted, completely clear the cache. This should also fix some cases
where deleted loops remained in the LoopDispositions cache.
This fixes a verification failure surfaced by D134531.
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D134613
Fixes#57572
Generally LICM pass is responsible for sinking out code that calculates
invariant address inside loop as it only needed to be calculated once.
But in rare case it does not happen we will not be vectorizing the
loop.
Differential Revision: https://reviews.llvm.org/D133687
This is purely NFC restructure in advance of a change which actually exposes zero strides. This is mostly because I find this interface confusing each time I look at it.
Follow up to D133580; adjust the cost model to prefer uniform store lowering for scalable stores which are unpredicated.
The impact here isn't in the uniform store lowering quality itself. InstCombine happily converts the scatter form into the single store form. The main impact is in letting the rest of the cost model make choices based on the knowledge that the vector will be scalarized on use.
Differential Revision: https://reviews.llvm.org/D134460
Instead of accumulating all extraction costs separately and then adjusting for repeated subvector extractions, this patch collects all the extractions and then converts to calls to getScalarizationOverhead to improve the accuracy of the costs.
I'm not entirely satisfied with the getExtractWithExtendCost handling yet - this still just adds all the getExtractWithExtendCost costs together - it really needs to be replaced with a "getScalarizationOverheadWithExtend", but that will require further refactoring first.
This replaces my initial attempt in D124769.
Differential Revision: https://reviews.llvm.org/D134605
This updates checkFunctionMemoryAccess() to infer a precise
FunctionModRefBehavior, rather than an approximation split into
read/write and argmemonly.
Afterwards, we still map this back to imprecise function attributes.
This still allows us to infer some cases that we previously did not
handle, namely inaccessiblememonly and inaccessiblemem_or_argmemonly.
In practice, this means we get better memory attributes in the
presence of intrinsics like @llvm.assume.
Differential Revision: https://reviews.llvm.org/D134527
After unrolling a loop, the block and loop dispositions need to be
cleared. As we don't know which SCEVs in the loop/blocks may be
impacted, completely clear the cache. This should also fix some cases
where deleted loops remained in the LoopDispositions cache.
This fixes a verification failure surfaced by D134531.
I am planning on reviewing/updating the existing uses of
forgetLoopDispositions to check if they should be replaced by
forgetBlockAndLoopDispositions.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D134612
Second patch in the series to remove legacy PM and
associated -enable-new-pm=0 flag targets pass that
has not been ported to new PM - PruneEH.
Discussion about this can be found in D44415.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D134686
This is a test to verify that we do not crash with the
problem noted in issue #57986. The root problem should
be fixed with a prior change to InstSimplify.
Using these queries with a context instruction and without a cache
seems to be about 2x slower than with it so this theoretically
improves compile time.
During structurization process, we may place non-predecessor blocks
between the predecessors of a block in the structurized CFG. Take
the typical while-break case as an example:
```
/---A(v=...)
| / \
^ B C
| \ /|
\---L |
\ /
E (r = phi (v:C)...)
```
After structurization, the CFG would be look like:
```
/---A
| |\
| | C
| |/
| F1
^ |\
| | B
| |/
| F2
| |\
| | L
\ |/
\--F3
|
E
```
We can see that block B is placed between the predecessors(C/L) of E.
During phi reconstruction, to achieve the same sematics as before, we
are reconstructing the PHIs as:
F1: v1 = phi (v:C), (undef:A)
F3: r = phi (v1:F2), ...
But this is also saying that `v1` would be live through B, which is not
quite necessary. The idea in the change is to say the incoming value
from B is Undef for the PHI in E. With this change, the reconstructed
PHI would be:
F1: v1 = phi (v:C), (undef:A)
F2: v2 = phi (v1:F1), (undef:B)
F3: r = phi (v2:F2), ...
Reviewed by: sameerds
Differential Revision: https://reviews.llvm.org/D132450
The instruction simplification will try to simplify the affected phis.
In some cases, this might extend the liveness of values. For example:
BB0:
| \
| BB1
| /
BB2:phi (BB0, v), (BB1, undef)
The phi in BB2 will be simplified to v as v dominates BB2, but this is
increasing the number of active values in BB1. By setting CanUseUndef
to false, we will not simplify the phi in this way, this would help
register pressure. This is mandatory for the later change to help
reducing VGPR pressure for AMDGPU.
Reviewed by: foad, sameerds
Differential Revision: https://reviews.llvm.org/D132449
This reverts commit 794b7ea960, and
thus restores commit a212d8da94, and
follow on fixes 0cd6763fa9,
e9ff53d42f, and
37c6a25e9a.
Use a hash function (BLAKE3) instead of hash_combine/hash_code which are
not guaranteed to be stable across executions.
Additionally, it adds a "REQUIRES: x86_64-linux" to the tests that have
raw profile inputs to avoid failures on big endian bots.
Reviewers: snehasish, davidxl
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D128142
The dependent code has been changed quite a lot since 151c144 which
b73d2c8 effectively reverts. Now we run into a case where lowering
didn't expect/support the behavior pre 151c144 any longer.
Update the code dealing with scalable pointer inductions to also check
for uniformity in combination with isScalarAfterVectorization. This
should ensure scalable pointer inductions are handled properly during
epilogue vectorization.
Fixes#57912.
When store vectorization is infeasible, it's helpful to have a debug logging indication of why. A case I've hit a couple times now is accidentally using -march instead of -mtriple and getting the default TTI results. This causes max-vf to become 1, and thus hits the added logging line.
We allow the target to report different costs depending on properties of the operands; given this, we have to make sure we pass the right set of operands and account for the fact that different scalar instructions can have operands with different properties.
As a motivating example, consider a set of multiplies which each multiply by a constant (but not all the same constant). Most of the constants are power of two (but not all).
If the target doesn't have support for non-uniform constant immediates, this will likely require constant materialization and a non-uniform multiply. However, depending on the balance of target costs for constant scalar multiplies vs a single vector multiply, this might or might not be a profitable vectorization.
This ends up basically being a rewrite of the existing code. Normally, I'd scope the change more narrowly, but I kept noticing things which seemed highly suspicious, and none of the existing code appears to have any test coverage at all. I think this is a case where simply throwing out the existing code and starting from scratch is reasonable.
This is a follow on to Alexey's D126885, but also handles the arithmetic instruction case since the existing code appears to have the same problem.
Differential Revision: https://reviews.llvm.org/D132566
LoopDeletion may hoist instructions out of a loop using
makeLoopInvariant without invalidating the SCEV for the moved
instruction.
Moving the instruction to a different block may change its
cached block disposition, so invalidate the cached info.
Fixes#57837.
The mul by constant costmodels handle power-of-2 constants, but not negated-power-of-2, despite the backends handling both.
This patch adds the OperandValueProperties::OP_NegatedPowerOf2 enum and wires it for use for basic mul cost analysis and SLP handling.
Fixes#50778
Differential Revision: https://reviews.llvm.org/D111968
MemoryLocation::getOrNone() already has the necessary logic to
handle different instruction types. Use it, rather than repeating
a subset of the logic. This adds support for previously unhandled
instructions like atomicrmw.
After 20d798bd47, SCEV looks through PHIs with a single incoming
value. This means adding a new incoming value may change the SCEV for a
phi. Add missing invalidation when an existing PHI is reused during
LoopVersioning. New incoming values will be added later from the
versioned loop.
Similar issues have been fixed by also adding missing invalidation.
Fixes#57825.
Note that the test case unfortunately requires running loop-vectorize
followed by loop-load-elimination, which does the actual versioning. I
don't think it is possible to reproduce the failure without that
combination.
The patch simplifies some of the patterns as below
1. (ZExt(L1) << shift1) | (ZExt(L2) << shift2) -> ZExt(L3) << shift1
2. (ZExt(L1) << shift1) | ZExt(L2) -> ZExt(L3)
The pattern is indicative of the fact that the loads are being merged to a wider load and the only use of this pattern is with a wider load. In this case for a non-atomic/non-volatile loads reduce the pattern to a combined load which would improve the cost of inlining, unrolling, vectorization etc.
Differential Revision: https://reviews.llvm.org/D127392
This reverts commit a212d8da94, and follow
on fixes 0cd6763fa9,
e9ff53d42f, and
37c6a25e9a.
After re-reading the documentation for hash_combine, I don't think this
is the appropriate hash function to use for computing the hash to use as
a stack id in the metadata, since it is not guaranteed to produce stable
values across executions. I have not hit this problem, but plan to
switch to using an MD5 hash. I am hitting an issue with one of the bots
(https://lab.llvm.org/buildbot/#/builders/171/builds/20732)
where the values produced are only the lower 32 bits of the expected
hash values, however, which I assume is related to the implementation of
hash_combine and hash_code.
I believe I fixed all of the other bot failures with the follow on fixes,
which I'll merge into the new version before reapplying.
Profile matching and IR annotation for memprof profiles.
See also related RFCs:
RFC: Sanitizer-based Heap Profiler [1]
RFC: A binary serialization format for MemProf [2]
RFC: IR metadata format for MemProf [3]*
* Note that the IR metadata format has changed from the RFC during
implementation, as described in the preceeding patch adding the basic
metadata and verification support.
The matching is performed during the normal PGO annotation phase, to
ensure that the inlines applied in the IR at that point are a subset
of the inlines in the profiled binary and thus reflected in the
profile's call stacks. This is important because the call frames are
associated with functions in the profile based on the inlining in the
symbolized call stacks, and this simplifies locating the subset of
profile data relevant for matching onto each function's IR.
The PGOInstrumentationUse pass is enhanced to perform matching for
whatever combination of memprof and regular PGO profile data exists in
the profile.
Using the utilities introduced in D128854:
The memprof profile data for each context is converted to "cold" or
"notcold" based on parameterized thresholds for size, access count, and
lifetime. The memprof allocation contexts are trimmed to the minimal
amount of context required to uniquely identify whether the context is
cold or not cold. For allocations where all profiled contexts have the
same allocation type, no memprof metadata is attached and instead the
allocation call is directly annotated with an attribute specifying the
alloction type. This is the same attributed that will be applied to
allocation calls once cloned for different contexts, and later used
during LibCall simplification to emit allocation hints [4].
Depends on D128141 and D128854.
[1] https://lists.llvm.org/pipermail/llvm-dev/2020-June/142744.html
[2] https://lists.llvm.org/pipermail/llvm-dev/2021-September/153007.html
[3] https://discourse.llvm.org/t/rfc-ir-metadata-format-for-memprof/59165
[4] ab87cf382d
Differential Revision: https://reviews.llvm.org/D128142
This extends the previously added uniform store case to handle stores of loop varying values to a loop invariant address. Note that the placement of this code only allows unpredicated stores; this is important for correctness. (That is "IsPredicated" is always false at this point in the function.)
This patch does not include scalable types. The diff felt "large enough" as it were; I'll handle that in a separate patch. (It requires some changes to cost modeling.)
Differential Revision: https://reviews.llvm.org/D133580
For the case where the constant is a power of two rather than zero,
the fold is incorrect, because it fails to check that the bit set
in the LHS matches the bit in the RHS.
Rather than fixing this, remove the power of two handling entirely,
as a different fold will already canonicalize such comparisons to
use a zero constant.
Fixes https://github.com/llvm/llvm-project/issues/57899.
Perform the simplifyWithOpReplaced() fold even for non-bool
selects. This subsumes a number of recently added folds for
zext/sext of the condition.
We still need to manually handle variations with both sext/zext
and not, because simplifyWithOpReplaced() only performs one
level of replacements.
We can handle vectors inside simplifyWithOpReplaced(), as long as
cross-lane operations are excluded. The equality can hold (or not
hold) for each vector lane independently, so we shouldn't use the
replacement value from other lanes.
I believe the only operations relevant here are shufflevector (where
all previous bugs were seen) and calls (which might use shuffle-like
intrinsics and would require more careful classification).
Differential Revision: https://reviews.llvm.org/D134348
This is a bugfix patch that resolves the following two bugs in loop interchange:
1. PR57148 which is an assertion error due to of loss of LCSSA form after interchange,
as referred to test1() in pr57148.ll.
2. Use before def for the outermost loop induction variables after interchange,
as referred to test2() in pr57148.ll.
The fix in this patch is that:
1. In cases where the LCSSA form is not maintained after interchange, we update the IR
to the LCSSA form again.
2. We split the phi nodes in the inner loop header into a separate basic block to avoid
the situation where use of the outer indvar appears before its def after interchange.
Previously we already did this for innermost loops, now we do it for non-innermost
loops (e.g., middle loops) as well.
Reviewed By: bmahjour, Meinersbur, #loopoptwg
Differential Revision: https://reviews.llvm.org/D132055
This patch is to resolve the bug reported and discussed in
https://reviews.llvm.org/D124926#3718761 and https://reviews.llvm.org/D124926#3719876.
The problem is that loop interchange is a loopnest pass under the new pass manager,
but the loop nest may not be constructed correctly by the loop pass manager after
running loop interchange and before running the next pass, which might cause problems
when it continues running the next pass.
The reason that the loop nest is constructed incorrectly is that the outermost
loop might have changed after interchange, and what was the original outermost
loop is not the current outermost loop anymore. Constructing the loop nest based
on the original outermost loop would generate an invalid loop nest.
The fix in this patch is that, in the loop pass manager before running each loopnest
pass, we re-cosntruct the loop nest based on the current outermost loop, if LPMUpdater
notifies the loop pass manager that the previous loop nest has been invalidated by passes
like loop interchange.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D132199
Instrumentation just ORs shadow of inputs.
I assume some result shadow bits can be reset if we go into specifics of particular checks,
but as-is it is still an improvement against existing default strict instruction handler, when
every set bit of input shadow is reported as an error.
Reviewed By: kda
Differential Revision: https://reviews.llvm.org/D134123
`(A * -2**C) + B --> B - (A << C)`
https://alive2.llvm.org/ce/z/A6BWkf
This inverts what Negator was doing before:
D134310 / 0f32a5dea0
Analysis and codegen are generally better without multiply,
so we should favor this form even if we trade add for sub
(because those are generally equivalent cost operations).
This stops Negator from transforming:
`C1 - shl X, C2 --> mul X, (1<<C2) + C1`
...in the general case. There does not seem to be any analysis
benefit to using mul in IR, and there's definitely downside in
codegen (particularly when the multiply has to be expanded).
If `C1` is 0, then there's a stronger argument that the single
mul is a better canonicalization than negate-of-shl, but we may
want to remove that too.
This was noted as a potential conflict for D133667.
Differential Revision: https://reviews.llvm.org/D134310
Commit de3445e0ef (https://reviews.llvm.org/D132096) made
changes to isVectorPromotionViable basically doing
// Create Vector with size of V, and each element of type Ty
...
uint64_t ElementSize = DL.getTypeStoreSizeInBits(Ty).getFixedSize();
uint64_t VectorSize = DL.getTypeSizeInBits(V).getFixedSize();
...
VectorType *VTy = VectorType::get(Ty, VectorSize / ElementSize, false);
Not quite sure why it uses the TypeStoreSize for the ElementSize,
but the new vector would only match in size with the old vector in
situations when the TypeStoreSize equals the TypeSize for Ty.
Therefore this patch adds a typeSizeEqualsStoreSize check as yet
another condition for allowing the the new type as a promotion
candidate.
Without this fix the new @test15 test would fail with an assert
like this:
opt: ../lib/Transforms/Scalar/SROA.cpp:1966:
auto isVectorPromotionViable(llvm::sroa::Partition &,
const llvm::DataLayout &)
::(anonymous class)::operator()(llvm::VectorType *,
llvm::VectorType *) const:
Assertion `DL.getTypeSizeInBits(RHSTy).getFixedSize() ==
DL.getTypeSizeInBits(LHSTy).getFixedSize() &&
"Cannot have vector types of different sizes!"' failed.
...
#8 isVectorPromotionViable(...)::$_10::operator()...
#9 llvm::SROAPass::rewritePartition(...)
#10 llvm::SROAPass::splitAlloca(...)
#11 llvm::SROAPass::runOnAlloca(...)
#12 llvm::SROAPass::runImpl(...)
#13 llvm::SROAPass::run(...)
Reviewed By: MatzeB
Differential Revision: https://reviews.llvm.org/D134032
The type information of the store values can diverge when checking for valid
mask store candidates to eliminate via DSE. This patch checks for equivalence
wrt to size and element count.
Reviewed By: fhahn, rui.zhang
Differential Revision: https://reviews.llvm.org/D132700
These patterns were previously only implemented for i1 type but can be extended for any integer type by also handling zext and sext operands.
Differential Revision: https://reviews.llvm.org/D134142
If one of the operands in a matrix multiplication is negated we can optimise the equation by moving the negation to the smallest element of the operands or the result.
Reviewed By: spatel, fhahn
Differential Revision: https://reviews.llvm.org/D133300
With the recent addition of new parameter MergeAttributes (D134117),
callers need to specify several default parameters before getting to
specify the new parameter.
This patch reorders the parameters so that callers do not have to
specify as many default parameters.
Differential Revision: https://reviews.llvm.org/D134125
The bug reported on the [0] has been fixed.
The issue was we have not checked if the global variables that
represent cttz tables was constant.
There is a new negative test added in negative-lower-table-based-cttz.ll
that represents this.
[0] https://reviews.llvm.org/rGdf868edee561eb973edd85ec9df41c67aa0bff6b
When the IV is only used by the terminating condition (say IV-A) and the loop
has a predictable back-edge count and we have another IV (say IV-B) that is an
affine add recursion, we will be able to calculate the terminating value of
IV-B in the loop pre-header. This patch adds attempts to replace IV-B as the
new terminating condition and remove IV-A. It is safe to do so since IV-A is
only used as the terminating condition.
This transformation is suitable to be appended after LSR as it may optimize the
loop into the situation mentioned above. The transformation can reduce number of
IV-s in the loop by one.
A cli option `lsr-term-fold` is added and default disabled.
Reviewed By: mcberg2021, craig.topper
Differential Revision: https://reviews.llvm.org/D132443
The compiler does not reorder the gather nodes with reused scalars, just
does it for opernads of the user nodes. This currently does not affect
the compiler but breaks internal logic of the SLP graph. In future, it
is supposed to actually use all nodes instead of just list of operands
and this will affect the vectorization result.
Also, did some early check to avoid complex logic in cost estimation
analysis, should improve compiler time a bit.
Msan has default handler for unknown instructions which
previously applied to these as well. However depending on
mask, not all pointers or passthru part will be used. This
allows other passes to insert undef into sum arguments.
As result, default strict instruction handler can produce false reports.
Reviewed By: kda, kstoimenov
Differential Revision: https://reviews.llvm.org/D133678
Epilogue vectorization uses isScalarAfterVectorization to check if
widened versions for inductions need to be generated and bails out in
those cases.
At the moment, there are scenarios where isScalarAfterVectorization
returns true but VPWidenPointerInduction::onlyScalarsGenerated would
return false, causing widening.
This can lead to widened phis with incorrect start values being created
in the epilogue vector body.
This patch addresses the issue by storing the cost-model decision in
VPWidenPointerInductionRecipe and restoring the behavior before 151c144.
This effectively reverts 151c144, but the long-term fix is to properly
support widened inductions during epilogue vectorization
Fixes#57712.
My understanding is that NoImplicitFloat, despite it's name, is
supposed to disable all vectors not just float vectors.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D134084
This is required because if there is a pure loop-invariant instruction, Loop Rotation
may decide to not clone it and just hoist it instead. If SCEV has previously cached
that it was loop-variant (not being smart enough to prove invariance), we may end
up with inconsistent cache state (which may later trigger false-negative assertion
failures checking that something was invariant).
This is a conservative fix that unconditionally drops the dispositions. We could
only drop it if the hoisting has actually happened, but it should take some time
understanding whether it's safe with all other things this function does.
Differential Revision: https://reviews.llvm.org/D134167
Reviewed By: fhahn
This bug was found by recent improvement in SCEV verifier. The code in LoopFuse
directly reassigns blocks to be a part of a different loop, which should automatically
invalidate all related cached loop dispositions.
Differential Revision: https://reviews.llvm.org/D134173
Reviewed By: nikic
SimplifyCFG folds
bool foo() {
if (cond1) return false;
if (cond2) return false;
return true;
}
as
bool foo() {
if (cond1 | cond2) return false
return true;
}
'cond2' is called 'bonus insts' in branch folding since they introduce overhead
since the original CFG could do early exit but the folded CFG always executes
them. SimplifyCFG calculates the costs of 'bonus insts' of a folding a BB into
its predecessor BB which shares the destination. If it is below bonus-inst-threshold,
SimplifyCFG will fold that BB into its predecessor and cond2 will always be executed.
When SimplifyCFG calculates the cost of 'bonus insts', it only consider 'bonus' insts
in the current BB to be considered for folding. This causes issue for unrolled loops
which share destinations, e.g.
bool foo(int *a) {
for (int i = 0; i < 32; i++)
if (a[i] > 0) return false;
return true;
}
After unrolling, it becomes
bool foo(int *a) {
if(a[0]>0) return false
if(a[1]>0) return false;
//...
if(a[31]>0) return false;
return true;
}
SimplifyCFG will merge each BB with its predecessor BB,
and ends up with 32 'bonus insts' which are always executed, which
is much slower than the original CFG.
The root cause is that SimplifyCFG does not consider the
accumulated cost of 'bonus insts' which are folded from
different BB's.
This patch fixes that by introducing a ValueMap to track
costs of 'bonus insts' coming from different BB's into
the same BB, and cuts off if the accumulated cost
exceeds a threshold.
Reviewed by: Artem Belevich, Florian Hahn, Nikita Popov, Matt Arsenault
Differential Revision: https://reviews.llvm.org/D132408
This was originally part of D133788. There are no visible
regressions. All of the diffs show a large unsigned constant
becoming a small negative constant. This should be better
for analysis (and slightly less compile-time) and codegen.