Constfold constrained variants of operations fadd, fsub, fmul, fdiv,
frem, fma and fmuladd.
The change also sets up some means to support for removal of unused
constrained intrinsics. They are declared as accessing memory to model
interaction with floating point environment, so they were not removed,
as they have side effect. Now constrained intrinsics that have
"fpexcept.ignore" as exception behavior are removed if they have no uses.
As for intrinsics that have exception behavior other than "fpexcept.ignore",
they can be removed if it is known that they do not raise floating point
exceptions. It happens when doing constant folding, attributes of such
intrinsic are changed so that the intrinsic is not claimed as accessing
memory.
Differential Revision: https://reviews.llvm.org/D102673
A simplification callback can mean that the IR value is modified beyond
the apparent IR semantics. That is, a `i1 true` could be replaced by an
`i1 false` based on high-level domain-specific information. If a user
provides a simplification callback we will not look at the IR but
instead give up if the callback returns a nullptr.
SPMDization D102307 detects incompatible OpenMP runtime calls to abort converting a target region to SPMD mode. Calls to memory allocation/de-allocation routines kmpc_alloc_shared, kmpc_free_shared are incompatible unless they are removed by AAHeapToStack/AAHeapToShared analysis. This patch extends SPMDization detection to include AAHeapToStack/AAHeapToShared analysis results for enlarging the scope of possible SPMDized regions detected.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D105634
Fixing a typo in SampleContextTracker to use debug name when debug linkage name is no present. This should only affect C programs.
Saw 0.6% perf win on Cinder which is mostly C code.
Reviewed By: wenlei, wmi
Differential Revision: https://reviews.llvm.org/D106599
This function is called when some predecessor of an empty return block
ends with a conditional branch, with both successors being empty ret blocks.
Now, because of the way SimplifyCFG works, it might happen to simplify
one of the blocks in a way that makes a conditional branch
into an unconditional one, since it's destinations are now identical,
but it might not have actually simplified said conditional branch
into an unconditional one yet.
So, we have to check that ourselves first,
especially now that SimplifyCFG aggressively tail-merges
all ret and resume blocks.
Even if it was an unconditional branch already,
`SimplifyCFGOpt::simplifyReturn()` doesn't call `FoldReturnIntoUncondBranch()`
by default.
The logical (select) form of and/or will now be a source of problems.
We don't really account for it's inverted form, yet it exists,
and presumably we should treat it just like non-inverted form:
https://alive2.llvm.org/ce/z/BU9AXkhttps://bugs.llvm.org/show_bug.cgi?id=51149 reports a reportedly-serious
perf regression that will hopefully be mitigated by this.
We should only add the fake lowering entry for the matrix remark if the
transpose is not lowered on its own. `MapVector::insert` is used to insert
the entry during proper lowering which does not overwrite the fake entry in
the map.
We actually had test coverage for this but the reference output code was
wrong; it was storing undef rather than the transposed column.
Also add an assert that would have caught this.
Differential Revision: https://reviews.llvm.org/D106457
D101977 added `BooleanStateWithPtrSetVector` to store pointers to a set meanwhile
tracking boolean state. One of the limitation is that it can only store pointer.
We might want it to store other types of values, such as integer for parallel
level. This patch generalizes the idea and create `BooleanStateWithSetVector`.
`BooleanStateWithPtrSetVector` therefore becomes a type alias of `BooleanStateWithSetVector`.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106149
This patch avoids computing discounts for predicated instructions when the
VF is scalable.
There is no support for vectorization of loops with division because the
vectorizer cannot guarantee that zero divisions will not happen.
This loop now does not use VF scalable
```
for (long long i = 0; i < n; i++)
if (cond[i])
a[i] /= b[i];
```
Differential Revision: https://reviews.llvm.org/D101916
The purpose of patch is to learn Loop idiom recognition pass how to recognize simple memmove patterns
in similar way like GCC: https://godbolt.org/z/fh95e83od
LoopIdiomRecognize already has machinery for memset and memcpy recognition, patch tries to extend exisiting capabilities with minimal effort.
Differential Revision: https://reviews.llvm.org/D104464
As noticed on D106352, after we've folded "(select C, (gep Ptr, Idx), Ptr) -> (gep Ptr, (select C, Idx, 0))" if the inner Ptr was also a (now one use) gep we could then merge the geps, using the sum of the indices instead.
I've limited this to basic 2-op geps - a more general case further down InstCombinerImpl.visitGetElementPtrInst doesn't have the one-use limitation but only creates the add if it can be created via SimplifyAddInst.
https://alive2.llvm.org/ce/z/f8pLfD (Thanks Roman!)
Differential Revision: https://reviews.llvm.org/D106450
If we remove a non-intrinsic instruction we need to tell the (old) call
graph about it. This caused problems with some features down the line as
they allowed to removed calls more aggressively.
If we have a recursive function we could create multiple instantiations
of an SSA value, one per recursive invocation of the function. This is a
problem as we use SSA value equality in various places. The basic idea
follows from this test:
```
static int r(int c, int *a) {
int X;
return c ? r(false, &X) : a == &X;
}
int test(int c) {
return r(c, undef);
}
```
If we look through the argument `a` we will end up with `X`. Using SSA
value equality we will fold `a == &X` to true and return true even
though it should have been false because `a` and `&X` are from different
instantiations of the function.
Various tests for this have been placed in value-simplify-instances.ll
and this commit fixes them all by avoiding to produce simplified values
that could be non-unique at runtime. Thus, the result of a simplify
value call will always be unique at runtime or the original value, both
do not allow to accidentally compare two instances of a value with each
other and conclude they are equal statically (pointer equivalence) while
they are unequal at runtime.
A call that is analyzed in an optimization needs to be verified against
the name and type of the runtime function to avoid that we look at
arguments that do not exist (anymore). This can happen if the signature
was rewritten. Since we will not set RFI.Declaration if the type doesn't
match we can use it (if it's not null) to determine if the signature is
as expected.
Differential Revision: https://reviews.llvm.org/D106341
This patch strips the NoInline attribute from known OpenMP runtime functions.
This is done so that we can denote certain runtime functions as NoInline to
ensure their call sites are intact so they can be checked by OpenMPOpt. We
don't wan't this noinline attribute to remain for any functions after OpenMPOpt
has been run however.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106482
This patch adds the ability to fold `__kmpc_is_generic_main_thread_id` if we
know for a fact that it is executed by the initial thread using
AAExecutionDomain. This combined with folding `__kmpc_is_spmd_exec_mode` will
allow us to fully fold `__kmpc_is_generic_main_thread`.
Depends on D106438 D106437
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106439
Function internalization can sometimes occur in situations where we want to
keep the call sites intact. This patch adds an option to disable function
internalization and prevents the device runtime from being internalized while
creating the bitcode library.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106438
Qualified kernels can be transformed from generic-mode to SPMD mode using an
optimization in OpenMPOpt. This patch introduces a new execution mode to
indicate kernels that have been transformed from generic-mode to SPMD-mode.
These kernels have SPMD-mode execution, but need generic-mode semantics for
scheduling the blocks and threads. Without this far too few blocks will be
scheduled for a generic region as SPMD mode expects the trip count to be
divided by the number of threads.
Reviewed By: ggeorgakoudis
Differential Revision: https://reviews.llvm.org/D106460
This removes an abuse of ELF linker behaviors while keeping Mach-O/COFF linker
behaviors unchanged.
ELF: when module_ctor is in a comdat, this patch removes reliance on a linker
abuse (an SHT_INIT_ARRAY in a section group retains the whole group) by using
SHF_GNU_RETAIN. No linker behavior difference when module_ctor is not in a comdat.
Mach-O: module_ctor gets `N_NO_DEAD_STRIP`. No linker behavior difference
because module_ctor is already referenced by a `S_MOD_INIT_FUNC_POINTERS`
section (GC root).
PE/COFF: no-op. SanitizerCoverage already appends module_ctor to `llvm.used`.
Other sanitizers: llvm.used for local linkage is not implemented in
`TargetLoweringObjectFileCOFF::emitLinkerDirectives` (once implemented or
switched to a non-local linkage, COFF can use module_ctor in comdat (i.e.
generalize ELF-specific rL301586)).
There is no object file size difference.
Reviewed By: vitalybuka
Differential Revision: https://reviews.llvm.org/D106246
We need to make sure that the value types are the same. Otherwise
we both may not have the necessary dereferenceability implication,
nor can we directly form the desired select pattern.
Without opaque pointers this is enforced implicitly through the
pointer comparison.
Manifesting AbstractAttributes may add new BBs in the IR. This patch provides an interface to register those BBs in the Attributor so that those BBs and containing instructions are not deleted as dead.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106383
There is no need for a non-const argument interface and the const argument modification covers existing and upcoming use cases.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106418
In weird cases, the inliner will inline internal recursive functions,
sometimes causing them to have no more uses, in which case the
inliner will mark the function to be deleted. The function is
actually deleted after the call to
updateCGAndAnalysisManagerForCGSCCPass(). In
updateCGAndAnalysisManagerForCGSCCPass(), UR.UpdatedC may be set to
the SCC containing the function to be deleted. Then the inliner calls
CG.removeDeadFunction() which can cause that SCC to be deleted, even
though it's still stored in UR.UpdatedC.
We could potentially check in the wrappers/pass managers if UR.UpdatedC
is in UR.InvalidatedSCCs before doing anything with it, but it's safer
to do this as close to possible to the call to CG.removeDeadFunction()
to avoid issues with allocating a new SCC in the same address as
the deleted one.
It's hard to find a small test case since we need to have recursive
internal functions be reachable from non-internal functions, yet they
need to become non-recursive and not referenced by other functions when
inlined.
Similar to https://reviews.llvm.org/D106306.
Fixes PR50788.
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D106405
Make getLatchCmpInst non-static and use it in LoopFlatten as a more
robust way of identifying the compare.
Differential Revision: https://reviews.llvm.org/D106256
In the textual format, `noduplicates` means no COMDAT/section group
deduplication is performed. Therefore, if both sets of sections are retained, and
they happen to define strong external symbols with the same names,
there will be a duplicate definition linker error.
In PE/COFF, the selection kind lowers to `IMAGE_COMDAT_SELECT_NODUPLICATES`.
The name describes the corollary instead of the immediate semantics. The name
can cause confusion to other binary formats (ELF, wasm) which have implemented/
want to implement the "no deduplication" selection kind. Rename it to be clearer.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D106319
When adding noalias/alias.scope metadata, we analyze the instructions
of the original callee, and then place metadata on the corresponding
inlined instructions in the caller as provided by VMap. However, this
assumes that this actually a clone of the instruction, rather than
the result of simplification. If simplification occurred, the
instruction that VMap points to may not have any relationship as far
as ModRef behavior is concerned.
Fix this by tracking simplified instructions during cloning and then
only processing instructions that have not been simplified. This is
done with an additional map form original to cloned instruction,
into which we only insert if no simplification is performed. The
mapping in VMap can then be compared to this map. If they're the
same, the instruction hasn't been simplified. (I originally wanted
to only track a set of simplified instructions, but that wouldn't
work if the instruction only gets simplified afterwards, e.g. based
on rewritten phis.)
Fixes https://bugs.llvm.org/show_bug.cgi?id=50589.
Differential Revision: https://reviews.llvm.org/D106242
Create an internal alias with the original name for static functions
that are renamed in promoteInternals to avoid breaking inline
assembly references to them. This version uses module inline assembly
to avoid issues with LowerTypeTestsModule.
Relands commmit 8e3b5cb39e with arch
specific tests fixed.
Link: https://github.com/ClangBuiltLinux/linux/issues/1354
Reviewed By: nickdesaulniers, pcc
Differential Revision: https://reviews.llvm.org/D104058
SPMDization in D102307 does not change the RequiresFullRuntime argument of kmpc_target_init/deinit calls. However, the constraints of SPMDization detection for converting a target region to SPMD mode should guarantee that the region does not require full runtime support. Hence, this patch sets RequiresFullRuntime to false for improved execution performance.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D105556
Currently the Instruction cost of getReductionPatternCost returns an
Invalid cost to specify "did not find the pattern". This changes that to
return an Optional with None specifying not found, allowing Invalid to
mean an infinite cost as is used elsewhere.
Differential Revision: https://reviews.llvm.org/D106140
This patch removes the assertion when VF is scalable and replaces
getKnownMinValue() by getFixedValue(), so it still guards the code against
scalable vector types.
The assertions were used to guarantee that getknownMinValue were not used for
scalable vectors.
Differential Revision: https://reviews.llvm.org/D106359
This patch adds a VPFirstOrderRecurrencePHIRecipe, to further untangle
VPWidenPHIRecipe into distinct recipes for distinct use cases/lowering.
See D104989 for a new recipe for reduction phis.
This patch also introduces a new `FirstOrderRecurrenceSplice`
VPInstruction opcode, which is used to make the forming of the vector
recurrence value explicit in VPlan. This more accurately models def-uses
in VPlan and also simplifies code-generation. Now, the vector recurrence
values are created at the right place during VPlan-codegeneration,
rather than during post-VPlan fixups.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D105008
AAMemoryBehaviorFloating used a custom use tracking mechanism even
though checkForAllUses exists and is already more powerful. Further,
AAMemoryBehaviorFloating uses AANoCapture to guarantee that there are no
aliases and following the uses is sufficient. This is an OK assumption
if checkForAllUses is used but custom tracking is easily out of sync
with AANoCapture and problems follow.
As with other patches before, the simplification callback interface
requires us to go through the Attributor::getAssumedSimplified API first
before we recurs.
It is unclear if the problem can be explicitly tested with our current
infrastructure.
We first simplify the operands of a compare and then reason on the
simplified versions, e.g., with AANonNull.
This does improve the simplification capabilities but also fixes a
potential problem that has not yet been observed by simplifying the
operands first.
A byval argument is a different value in the caller and callee, we
cannot propagate the information as part of AAValueSimplify. Users that
want to deal with byval arguments need to specifically perform the
argument -> call site step. We do not do this for now.
This patch introduces AAPointerInfo which tracks the uses of a pointer
and places them in "bins" based on their offset from the base and access
size.
As with other AAs, any pointer can be tracked but it is up to the user
to make sense of the results. The user in this patch is AAValueSimplify
and AAPotentialValues which both utilize AAPointerInfo to determine the
value of a load. For now, this is restricted to loads of allocas and
internal globals. Through the use of AAPointerInfo and the "bins" we can
track struct members separately. The users also know that storing only
zeros (at unknown indices) will result in loading only 0 (from unknown
indices). Other than that, the users are flow and context insensitive
(for now).
To deal with the "bins" more easily, AAPointerInfo provides a
forallInterfearingAccesses that applies a callback on all accesses
that might interfere with a given load or store.
Differential Revision: https://reviews.llvm.org/D104432
As a first step to simplify loads we only handle `null` and `undef`
underlying objects, as well as objects that have the load as a single user.
Loads of those values can be replaced by the initializer, if any.
Proper reasoning is introduced in a follow up patch
Differential Revision: https://reviews.llvm.org/D103862
We did not properly use SPMDCompatibilityTracker in various places.
This patch makes sure we look at the validity properly and also fix
the state if we can.
Differential Revision: https://reviews.llvm.org/D106085
We need the compiler generated variable to override the weak symbol of
the same name inside the profile runtime, but using LinkOnceODRLinkage
results in weak symbol being emitted in which case the symbol selected
by the linker is going to depend on the order of inputs which can be
fragile.
This change replaces the use of weak definition inside the runtime with
a weak alias. We place the compiler generated symbol inside a COMDAT
group so dead definition can be garbage collected by the linker.
We also disable the use of runtime counter relocation on Darwin since
Mach-O doesn't support weak external references, but Darwin already uses
a different continous mode that relies on overmapping so runtime counter
relocation isn't needed there.
Differential Revision: https://reviews.llvm.org/D105176
The patch does not depend on the availability of the library functions for
memcpy/memset as it operates on LLVM intrinsics. The optimizations are useful
on the targets that have these functions disabled (e.g. NVPTX & AMDGPU).
Differential Revision: https://reviews.llvm.org/D104801
This patch adds a new pass called LNICM which is a LoopNest version of LICM and a test case to show how LNICM works.
Basically, LNICM only hoists invariants out of loop nest (not a loop) to keep/make perfect loop nest. This enables later optimizations that require perfect loop nest.
Reviewed By: Whitney
Differential Revision: https://reviews.llvm.org/D104180
The incoming values for PHI nodes may come from unreachable BasicBlocks,
need to handle this case.
Differential Revision: https://reviews.llvm.org/D106264
This fixes the lower and upper bound calculation of a
RuntimeCheckingPtrGroup when it has more than one loop
invariant pointers. Resolves PR50686.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D104148
Replace code which identifies induction phi with helper function
getInductionVariable to improve robustness.
Differential Revision: https://reviews.llvm.org/D106045
The inttoptr/ptrtoint roundtrip optimization is not always correct.
We are working towards removing this optimization and adding support
to specific cases where this optimization works. This patch is the
first one on this line.
Consider the example:
%i = ptrtoint i8* %X to i64
%p = inttoptr i64 %i to i16*
%cmp = icmp eq i8* %load, %p
In this specific case, the inttoptr/ptrtoint optimization is correct
as it only compares the pointer values. In this patch, we fold
inttoptr/ptrtoint to a bitcast (if src and dest types are different).
Differential Revision: https://reviews.llvm.org/D105088
This pattern is visible in unrolled and vectorized loops.
Although the backend seems to be able to reassociate to
ideal form in the examples I looked at, we might as well
do that in IR for efficiency.
This patch fixed two issues found when folding `__kmpc_is_spmd_exec_mode`:
1. When the reaching kernels are empty, it should not fold to generic mode.
2. When creating AA for the caller when updating information, the dependency
should be required.
Reviewed By: ye-luo
Differential Revision: https://reviews.llvm.org/D106209
Iterative-BFI produces better count quality and performance when evaluated on internal benchmarks. Turning it on by default now for CSSPGO. We can consider turn it on by default for AutoFDO as well in the future.
Differential Revision: https://reviews.llvm.org/D106202
Create an internal alias with the original name for static functions
that are renamed in promoteInternals to avoid breaking inline
assembly references to them. This version uses module inline assembly
to avoid issues with LowerTypeTestsModule.
Link: https://github.com/ClangBuiltLinux/linux/issues/1354
Reviewed By: nickdesaulniers, pcc
Differential Revision: https://reviews.llvm.org/D104058
Part of D105020. Also, fixed FIXMEs that need to use wider vector type
when trying to calculate the cost of reused scalars. This may cause
regressions unless D100486 is landed to improve the cost estimations
for long vectors shuffling.
Differential Revision: https://reviews.llvm.org/D106060
The cost of the InsertSubvector shuffle kind cost is not complete and
may end up with just extracts + inserts costs in many cases. Added
a workaround to represent it as a generic PermuteSingleSrc, which is
still pessimistic but better than InsertSubvector.
Differential Revision: https://reviews.llvm.org/D105827
This patch adds unique idenfitiers to the existing OpenMP remarks. This makes
it easier to identify the corresponding documentation for each remark that will
be hosted in the OpenMP webpage.
Depends on D105898
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D105939
This patch rewrites and reworks a few of the existing remarks to make the mmore
concise and consistent prior to writing the documentation for them.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D105898
We already know that we need to check whether lcssa
phis are supported in inner loop exit block or in
outer loop exit block, and we have logic to check
them already. Presumably the inner loop latch does
not have lcssa phis and there is no code that deals
with lcssa phis in the inner loop latch. However,
that assumption is not true, when we have loops
with more than two-level nesting. This patch adds
checks for lcssa phis in the inner latch.
Reviewed By: Whitney
Differential Revision: https://reviews.llvm.org/D102300
This patch returns an Invalid cost from getInstructionCost() for alloca
instructions if the VF is scalable, as otherwise loops which contain
these instructions will crash when attempting to scalarize the alloca.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D105824
The original patch was:
https://reviews.llvm.org/D105806
There were some issues with undeterministic behaviour of the sorting
function, which led to scalable-call.ll passing and/or failing. This
patch fixes the issue by numbering all instructions in the array first,
and using that number as the order, which should provide a consistent
ordering.
This reverts commit a607f64118.
This patch addresses assertion failure in case when the only found formula for LSR
is `1*reg => reg` which was supposed to be an impossible situation, however there
is a test that shows it is possible.
In this case, we can use scale register with scale of 1 as the missing base register.
Reviewed By: huihuiz, reames
Differential Revision: https://reviews.llvm.org/D105009
A common use of `ChangeStatus` is as follows:
```
ChangeStatus Changed = ChangeStatus::UNCHANGED;
Changed |= foo();
```
where `foo` returns `ChangeStatus` as well. Currently `ChangeStatus` doesn't
support compound assignment, we have to write as
```
Changed = Changed | foo();
```
which is not that convenient.
This patch add the support for compound assignment for `ChangeStatus`. Compound
assignment is usually implemented as a member function, and binary arithmetic
operator is therefore implemented using compound assignment. However, unlike
regular C++ class, enum class doesn't support member functions. As a result, they
can only be implemented in the way shown in the patch.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106109
In the device runtime there are many function calls to `__kmpc_is_spmd_exec_mode`
to query the execution mode of current kernels. In many cases, user programs
only contain target region executing in one mode. As a consequence, those runtime
function calls will only return one value. If we can get rid of these function
calls during compliation, it can potentially improve performance.
In this patch, we use `AAKernelInfo` to analyze kernel execution. Basically, for
each kernel (device) function `F`, we collect all kernel entries `K` that can
reach `F`. A new AA, `AAFoldRuntimeCall`, is created for each call site. In each
iteration, it will check all reaching kernel entries, and update the folded value
accordingly.
In the future we will support more function.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D105787
This bug was introduced with D105730 / 25ee55c0ba .
If we are not converting all of the operations of a reduction
into a vector op, we need to preserve the existing select form
of the remaining ops. Otherwise, we are potentially leaking
poison where it did not in the original code.
Alive2 agrees that the version that freezes some inputs
and then falls back to scalar is correct:
https://alive2.llvm.org/ce/z/erF4K2
This implements the elementtype attribute specified in D105407. It
just adds the attribute and the specified verifier rules, but
doesn't yet make use of it anywhere.
Differential Revision: https://reviews.llvm.org/D106008
Fixes some regressions with -fstrict-vtable-pointers in llvm-test-suite.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D106017
This change enables vectorization of multiple exit loops when the exit count is statically computable. That requirement - shared with the rest of LV - in turn requires each exit to be analyzeable and to dominate the latch.
The majority of work to support this was done in a set of previous patches. In particular,, 72314466 avoids having multiple edges from the middle block to the exits, and 4b33b2387 which added support for non-latch single exit and multiple exits with a single exiting block. As a result, this change is basically just removing a bailout and adjusting some tests now that the prerequisite work is done and has stuck in tree for a bit.
Differential Revision: https://reviews.llvm.org/D105817
`SinkCommonCodeFromPredecessors()` doesn't itself ensure that duplicate PHI nodes aren't created.
I suppose, we could teach it to do that on-the-fly (& account for the already-existing PHI nodes,
& adjust costmodel), the diff will be bigger than this.
The alternative is to schedule a new EarlyCSE pass invocation somewhere later in the pipeline.
Clearly, we don't have any EarlyCSE runs in module optimization passline, so this pattern isn't cleaned up...
That would perhaps better, but it will again have some compile time impact.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D106010
The sort function for emitting an OptRemark was not deterministic,
which caused scalable-call.ll to fail on some buildbots. This patch
fixes that.
This patch also fixes an issue where `Instruction::comesBefore()`
is called when two Instructions are in different basic blocks,
which would otherwise cause an assertion failure.
SystemZ ABI requires zero-extending function parameters to 64-bit. The
compiler is free to optimize the code around this assumption, e.g.
failing to zero-extend __tsan_atomic32_load()'s morder may cause
crashes in to_mo() switch table lookup.
Fix by adding zeroext attributes to TSan's FunctionCallees, similar to
how it was done in commit 3bc439bdff ("[MSan] Add instrumentation for
SystemZ"). This is a no-op on arches that don't need it.
Reviewed By: dvyukov
Differential Revision: https://reviews.llvm.org/D105629
This patch make coroutine passes run by default in LLVM pipeline. Now
the clang and opt could handle IR inputs containing coroutine intrinsics
without special options.
It should be fine. On the one hand, the coroutine passes seems to be stable
since there are already many projects using coroutine feature.
On the other hand, the coroutine passes should do nothing for IR who doesn't
contain coroutine intrinsic.
Test Plan: check-llvm
Reviewed by: lxfind, aeubanks
Differential Revision: https://reviews.llvm.org/D105877
This patch adds a feature to AACallEdges AbstractAttribute that allows
users to ask if there is a unknown callee that isn't a inline assembly.
This feature is needed by some of it's users.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D105992
The bug was that evaluateBitcastFromPtr attempts a narrowing to a struct's 0th
element of a store that covers other elements. While this is okay on the load
side, applying it to stores causes us to miss the writes to the additionally
covered elements.
rdar://79503568
Differential revision: https://reviews.llvm.org/D105838
This set of folds was added recently with:
c7b658aeb50c400e895340b752d28d
...and I noted that this wasn't likely to fire in code derived
from C/C++ source because of nsw in particular. But I didn't
notice that I had placed the code above the no-wrap block
of transforms.
This is likely the cause of regressions noted from the previous
commit because -- as shown in the test diffs -- we may have
transformed into a compare with an arbitrary constant rather
than a simpler signbit test.
This patch emits remarks for instructions that have invalid costs for
a given set of vectorization factors. Some example output:
t.c:4:19: remark: Instruction with invalid costs prevented vectorization at VF=(vscale x 1): load
dst[i] = sinf(src[i]);
^
t.c:4:14: remark: Instruction with invalid costs prevented vectorization at VF=(vscale x 1, vscale x 2, vscale x 4): call to llvm.sin.f32
dst[i] = sinf(src[i]);
^
t.c:4:12: remark: Instruction with invalid costs prevented vectorization at VF=(vscale x 1): store
dst[i] = sinf(src[i]);
^
Reviewed By: fhahn, kmclaughlin
Differential Revision: https://reviews.llvm.org/D105806
The cost of the InsertSubvector shuffle kind cost is not complete and
may end up with just extracts + inserts costs in many cases. Added
a workaround to represent it as a generic PermuteSingleSrc, which is
still pessimistic but better than InsertSubvector.
Differential Revision: https://reviews.llvm.org/D105827
This has been a work-in-progress for a long time...we finally have all of
the pieces in place to handle vectorization of compare code as shown in:
https://llvm.org/PR41312
To do this (see PhaseOrdering tests), we converted SimplifyCFG and
InstCombine to the poison-safe (select) forms of the logic ops, so now we
need to have SLP recognize those patterns and insert a freeze op to make
a safe reduction:
https://alive2.llvm.org/ce/z/NH54Ah
We get the minimal patterns with this patch, but the PhaseOrdering tests
show that we still need adjustments to get the ideal IR in some or all of
the motivating cases.
Differential Revision: https://reviews.llvm.org/D105730
As discussed on PR50183, we already fold to prefer 'select-of-idx' vs 'select-of-gep':
define <4 x i32>* @select0a(<4 x i32>* %a0, i64 %a1, i1 %a2, i64 %a3) {
%gep0 = getelementptr inbounds <4 x i32>, <4 x i32>* %a0, i64 %a1
%gep1 = getelementptr inbounds <4 x i32>, <4 x i32>* %a0, i64 %a3
%sel = select i1 %a2, <4 x i32>* %gep0, <4 x i32>* %gep1
ret <4 x i32>* %sel
}
-->
define <4 x i32>* @select1a(<4 x i32>* %a0, i64 %a1, i1 %a2, i64 %a3) {
%sel = select i1 %a2, i64 %a1, i64 %a3
%gep = getelementptr inbounds <4 x i32>, <4 x i32>* %a0, i64 %sel
ret <4 x i32>* %gep
}
This patch adds basic handling for the 'fallthrough' cases where the gep idx == 0 has been folded away to the base address:
define <4 x i32>* @select0(<4 x i32>* %a0, i64 %a1, i1 %a2) {
%gep = getelementptr inbounds <4 x i32>, <4 x i32>* %a0, i64 %a1
%sel = select i1 %a2, <4 x i32>* %a0, <4 x i32>* %gep
ret <4 x i32>* %sel
}
-->
define <4 x i32>* @select1(<4 x i32>* %a0, i64 %a1, i1 %a2) {
%sel = select i1 %a2, i64 0, i64 %a1
%gep = getelementptr inbounds <4 x i32>, <4 x i32>* %a0, i64 %sel
ret <4 x i32>* %gep
}
Reapplied with a fix for the bpf "-bpf-disable-avoid-speculation" tests
Differential Revision: https://reviews.llvm.org/D105901
This patch fixes code that incorrectly handled dbg.values with duplicate
location operands, i.e. !DIArgList(i32 %a, i32 %a). The errors in
question were caused by either applying an update to dbg.value multiple
times when the update is only valid once, or by updating the
DIExpression for only the first instance of a value that appears
multiple times.
Differential Revision: https://reviews.llvm.org/D105831
In the device runtime there are many function calls to `__kmpc_is_spmd_exec_mode`
to query the execution mode of current kernels. In many cases, user programs
only contain target region executing in one mode. As a consequence, those runtime
function calls will only return one value. If we can get rid of these function
calls during compliation, it can potentially improve performance.
In this patch, we use `AAKernelInfo` to analyze kernel execution. Basically, for
each kernel (device) function `F`, we collect all kernel entries `K` that can
reach `F`. A new AA, `AAFoldRuntimeCall`, is created for each call site. In each
iteration, it will check all reaching kernel entries, and update the folded value
accordingly.
In the future we will support more function.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D105787
To help with debugging non-trivial unswitching issues.
Don't care about the legacy pass, nobody is using it.
If a pass's string params are empty (e.g. "simple-loop-unswitch"), don't
default to the empty constructor for the pass params. We should still
let the parser take care of it in case the parser has its own defaults.
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D105933
Make sure getMinusSCEV() didn't return a pointer. The following check
would never succeed if it was a pointer, anyway, but calling
getMulExpr() on a pointer SCEV now asserts.
Handle the missing fold reported in PR50816, which is a variant of the existing ashr(sub_nsw(X,Y),bw-1) --> sext(icmp_sgt(X,Y)) fold.
We also handle the lshr(or(neg(x),x),bw-1) --> zext(icmp_ne(x,0)) equivalent - https://alive2.llvm.org/ce/z/SnZmSj
We still allow multi uses of the neg(x) - as this is likely to let us further simplify other uses of the neg - but not multi uses of the or() which would increase instruction count.
Differential Revision: https://reviews.llvm.org/D105764
Just like intrinsics are not tracked for IFI.InlinedCalls, they should not be tracked for IFI.InlinedCallSites.
In the current top-of-tree this change is a NFC, but the full restrict patches (D68484) potentially trigger an read-after-free
if intrinsics are also added to the InlindeCallSites, due to a late optimization potentially removing some of the inlined intrinsics.
Also see https://lists.llvm.org/pipermail/llvm-dev/2021-July/151722.html for a discussion about the problem.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D105805
This patch fixes the problem of SimplifyBranchOnICmpChain that occurs
when extra values are Undef or poison.
Suppose the %mode is 51 and the %Cond is poison, and let's look at the
case below.
```
%A = icmp ne i32 %mode, 0
%B = icmp ne i32 %mode, 51
%C = select i1 %A, i1 %B, i1 false
%D = select i1 %C, i1 %Cond, i1 false
br i1 %D, label %T, label %F
=>
br i1 %Cond, label %switch.early.test, label %F
switch.early.test:
switch i32 %mode, label %T [
i32 51, label %F
i32 0, label %F
]
```
incorrectness: https://alive2.llvm.org/ce/z/BWScX
Code before transformation will not raise UB because %C and %D is false,
and it will not use %Cond. But after transformation, %Cond is being used
immediately, and it will raise UB.
This problem can be solved by adding freeze instruction.
correctness: https://alive2.llvm.org/ce/z/x9x4oY
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D104569
Continuing from D105763, this allows placing certain properties
about attributes in the TableGen definition. In particular, we
store whether an attribute applies to fn/param/ret (or a combination
thereof). This information is used by the Verifier, as well as the
ForceFunctionAttrs pass. I also plan to use this in LLParser,
which also duplicates info on which attributes are valid where.
This keeps metadata about attributes in one place, and makes it
more likely that it stays in sync, rather than in various
functions spread across the codebase.
Differential Revision: https://reviews.llvm.org/D105780
This is now the same as isIntAttrKind(), so use that instead, as
it does not require manual maintenance. The naming is also more
accurate in that both int and type attributes have an argument,
but this method was only targeting int attributes.
I initially wanted to tighten the AttrBuilder assertion, but we
have some in-tree uses that would violate it.
This is the pattern from the description of:
https://llvm.org/PR50816
There might be a way to generalize this to a smaller or more
generic pattern, but I have not found it yet.
https://alive2.llvm.org/ce/z/ShzJoF
define i1 @src(i8 %x) {
%add = add i8 %x, -1
%xor = xor i8 %x, -1
%and = and i8 %add, %xor
%r = icmp slt i8 %and, 0
ret i1 %r
}
define i1 @tgt(i8 %x) {
%r = icmp eq i8 %x, 0
ret i1 %r
}
This new test demonstrates a case where a base ptr is generated
twice for the same value: the first one is generated while
the gc.get.pointer.base() is inlined, the second is generated
for the statepoint. This happens because the methods
inlineGetBaseAndOffset() and insertParsePoints() do not share
their defining value cache used by the findBasePointer() method.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D103240
The patch templatize PriorityInlinerOrder so that it can accept any type priority metric.
Reviewed By: kazu
Differential Revision: https://reviews.llvm.org/D104972
As with other Attributor interfaces we often want to know if assumed
information was used to answer a query. This is important if only
known information is allowed or if known information can lead to an
early fixpoint. The users have been adjusted but none of them utilizes
the new information yet.
The const version of VPValue::getVPValue still had a default value for
the value index. Remove the default value and use getVPSingleValue
instead, which is the proper function.
This patch adds support for hoisting the division and maybe the
remainder for control flow graphs like this.
```
PredBB
| \
| Rem
| /
Div
```
If we have DivRem we'll hoist both to PredBB. If not we'll just
hoist Div and expand Rem using the Div.
This improves our codegen for something like this
```
__uint128_t udivmodti4(__uint128_t dividend, __uint128_t divisor, __uint128_t *remainder) {
if (remainder != 0)
*remainder = dividend % divisor;
return dividend / divisor;
}
```
Reviewed By: spatel, lebedev.ri
Differential Revision: https://reviews.llvm.org/D87555
In D104569, Freeze was inserted just before br to solve the `branching on undef` miscompilation problem.
But value analysis was being disturbed by added freeze.
```
v = load ptr
cond = freeze(icmp (and v, const), const')
br cond, ...
```
The case in which value analysis disturbed is as above.
By changing freeze to add immediately after load, value analysis will be successful again.
```
v = load ptr
freeze(icmp (and v, const), const')
=>
v = load ptr
v' = freeze v
icmp (and v', const), const'
```
In this patch, I propose the above optimization.
With this patch, the poison will not spread as the freeze is performed early.
Reviewed By: nikic, lebedev.ri
Differential Revision: https://reviews.llvm.org/D105392
AllocationInfo and DeallocationInfo objects themselves are allocated
with the Attributor bump allocator and do not need to be deallocated.
That said, the sets in AllocationInfo and DeallocationInfo need to be
destroyed to avoid memory leaks.
In the spirit of TRegions [0], this patch analyzes a kernel and tracks
if it can be executed in SPMD-mode. If so, we flip the arguments of
the __kmpc_target_init and deinit call to enable the mode. We also
update the `<kernel>_exec_mode` flag to indicate to the runtime we
changed the mode to SPMD.
The code analysis is done interprocedurally by extending the
AAKernelInfo abstract attribute to track SPMD compatibility as well.
[0] https://link.springer.com/chapter/10.1007/978-3-030-28596-8_11
Differential Revision: https://reviews.llvm.org/D102307
In the spirit of TRegions [0], this patch creates a custom state
machine for a generic target region based on the potentially called
parallel regions.
The code analysis is done interprocedurally via an abstract attribute
(AAKernelInfo). All outermost parallel regions are collected and we
check if there might be unknown outermost parallel regions for which
we need an indirect call. Other AAKernelInfo extensions are expected.
[0] https://link.springer.com/chapter/10.1007/978-3-030-28596-8_11
Differential Revision: https://reviews.llvm.org/D101977
In the spirit of TRegions [0], this patch provides a simpler and uniform
interface for a kernel to set up the device runtime. The OMPIRBuilder is
used for reuse in Flang. A custom state machine will be generated in the
follow up patch.
The "surplus" threads of the "master warp" will not exit early anymore
so we need to use non-aligned barriers. The new runtime will not have an
extra warp but also require these non-aligned barriers.
[0] https://link.springer.com/chapter/10.1007/978-3-030-28596-8_11
This was in parts extracted from D59319.
Reviewed By: ABataev, JonChesterfield
Differential Revision: https://reviews.llvm.org/D101976
When we talk to outside analyse, e.g., LVI and ScalarEvolution, we need
to be careful with the query. The particular error occurred because we
folded a PHI node before the LVI query but the context location was now
not dominated by the value anymore. This is not supported by LVI so we
have to filter these situations before we query the outside analyses.
In order to simplify future extensions, e.g., the merge of
AAHeapToShared in to AAHeapToStack, we reorganize AAHeapToStack and the
state we keep for each malloc-like call. The result is also less
confusing as we only track malloc-like calls, not all calls. Further, we
only perform the updates necessary for a malloc-like to argue it can go
to the stack, e.g., we won't check all uses if we moved on to the
"must-be-freed" argument.
This patch also uses Attributor helps to simplify the allocated size,
alignment, and the potentially freed objects.
Overall, this is mostly a reorganization and only the use of the
optimistic helpers should change (=improve) the capabilities a bit.
Differential Revision: https://reviews.llvm.org/D104993
We have to be careful when we replace values to not use a non-dominating
instruction. It makes sense that simplification offers those as
"simplified values" but we can't manifest them in the IR without PHI
nodes. In the future we should consider potentially adding those PHI
nodes.
We should use AAValueSimplify for all value simplification, however
there was some leftover logic that predates AAValueSimplify in
AAReturnedValues. This remove the AAReturnedValues part and provides a
replacement by making AAValueSimplifyReturned strong enough to handle
all previously covered cases. Further, this improve
AAValueSimplifyCallSiteReturned to handle returned arguments.
AAReturnedValues is now much easier and the collected returned
values/instructions are now from the associated function only, making it
much more sane. We also do not have the brittle logic anymore that looks
for unresolved calls. Instead, we use AAValueSimplify to handle
recursion.
Useful code has been split into helper functions, e.g., an Attributor
interface to get a simplified value.
Differential Revision: https://reviews.llvm.org/D103860
As the `llvm::getUnderlyingObjects` helper, the optimistic version
collects objects that might be the base of a given pointer. In contrast
to the llvm variant, the optimistic one will use assumed information,
e.g., about select conditions or dead blocks, to provide a more precise
result.
Differential Revision: https://reviews.llvm.org/D103859
Not all attributes are able to handle the interprocedural step and
follow the uses into a call site. Let them be able to combine call site
uses instead. This might result in some unused values/arguments being
leftover but it removes problems where we misused "is dead" even though
it was actually "is simplified/replaced".
We explicitly check for dead values due to constant propagation in
`AAIsDeadValueImpl::areAllUsesAssumedDead` instead.
Differential Revision: https://reviews.llvm.org/D103858
Broke check-clang, see https://reviews.llvm.org/D102307#2869065
Ran `git revert -n ebbe149a6f08535ede848a531a601ae6591cfbc5..269416d41908bb670f67af689155d5ab8eea689a`
As with other Attributor interfaces we often want to know if assumed
information was used to answer a query. This is important if only
known information is allowed or if known information can lead to an
early fixpoint. The users have been adjusted but none of them utilizes
the new information yet.
In the spirit of TRegions [0], this patch analyzes a kernel and tracks
if it can be executed in SPMD-mode. If so, we flip the arguments of
the __kmpc_target_init and deinit call to enable the mode. We also
update the `<kernel>_exec_mode` flag to indicate to the runtime we
changed the mode to SPMD.
The code analysis is done interprocedurally by extending the
AAKernelInfo abstract attribute to track SPMD compatibility as well.
[0] https://link.springer.com/chapter/10.1007/978-3-030-28596-8_11
Differential Revision: https://reviews.llvm.org/D102307
When we talk to outside analyse, e.g., LVI and ScalarEvolution, we need
to be careful with the query. The particular error occurred because we
folded a PHI node before the LVI query but the context location was now
not dominated by the value anymore. This is not supported by LVI so we
have to filter these situations before we query the outside analyses.
We have to be careful when we replace values to not use a non-dominating
instruction. It makes sense that simplification offers those as
"simplified values" but we can't manifest them in the IR without PHI
nodes. In the future we should consider potentially adding those PHI
nodes.
In the spirit of TRegions [0], this patch creates a custom state
machine for a generic target region based on the potentially called
parallel regions.
The code analysis is done interprocedurally via an abstract attribute
(AAKernelInfo). All outermost parallel regions are collected and we
check if there might be unknown outermost parallel regions for which
we need an indirect call. Other AAKernelInfo extensions are expected.
[0] https://link.springer.com/chapter/10.1007/978-3-030-28596-8_11
Differential Revision: https://reviews.llvm.org/D101977
In the spirit of TRegions [0], this patch provides a simpler and uniform
interface for a kernel to set up the device runtime. The OMPIRBuilder is
used for reuse in Flang. A custom state machine will be generated in the
follow up patch.
The "surplus" threads of the "master warp" will not exit early anymore
so we need to use non-aligned barriers. The new runtime will not have an
extra warp but also require these non-aligned barriers.
[0] https://link.springer.com/chapter/10.1007/978-3-030-28596-8_11
This was in parts extracted from D59319.
Reviewed By: ABataev, JonChesterfield
Differential Revision: https://reviews.llvm.org/D101976
In order to simplify future extensions, e.g., the merge of
AAHeapToShared in to AAHeapToStack, we reorganize AAHeapToStack and the
state we keep for each malloc-like call. The result is also less
confusing as we only track malloc-like calls, not all calls. Further, we
only perform the updates necessary for a malloc-like to argue it can go
to the stack, e.g., we won't check all uses if we moved on to the
"must-be-freed" argument.
This patch also uses Attributor helps to simplify the allocated size,
alignment, and the potentially freed objects.
Overall, this is mostly a reorganization and only the use of the
optimistic helpers should change (=improve) the capabilities a bit.
Differential Revision: https://reviews.llvm.org/D104993
We should use AAValueSimplify for all value simplification, however
there was some leftover logic that predates AAValueSimplify in
AAReturnedValues. This remove the AAReturnedValues part and provides a
replacement by making AAValueSimplifyReturned strong enough to handle
all previously covered cases. Further, this improve
AAValueSimplifyCallSiteReturned to handle returned arguments.
AAReturnedValues is now much easier and the collected returned
values/instructions are now from the associated function only, making it
much more sane. We also do not have the brittle logic anymore that looks
for unresolved calls. Instead, we use AAValueSimplify to handle
recursion.
Useful code has been split into helper functions, e.g., an Attributor
interface to get a simplified value.
Differential Revision: https://reviews.llvm.org/D103860
As the `llvm::getUnderlyingObjects` helper, the optimistic version
collects objects that might be the base of a given pointer. In contrast
to the llvm variant, the optimistic one will use assumed information,
e.g., about select conditions or dead blocks, to provide a more precise
result.
Differential Revision: https://reviews.llvm.org/D103859
Not all attributes are able to handle the interprocedural step and
follow the uses into a call site. Let them be able to combine call site
uses instead. This might result in some unused values/arguments being
leftover but it removes problems where we misused "is dead" even though
it was actually "is simplified/replaced".
We explicitly check for dead values due to constant propagation in
`AAIsDeadValueImpl::areAllUsesAssumedDead` instead.
Differential Revision: https://reviews.llvm.org/D103858
Instead of performing the isMoreProfitable() operation on
InstructionCost::CostTy the operation is performed on InstructionCost
directly, so that it can handle the case where one of the costs is
Invalid.
This patch also changes the CostTy to be int64_t, so that the type is
wide enough to deal with multiplications with e.g. `unsigned MaxTripCount`.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D105113
Rules:
1. SCEVUnknown is a pointer if and only if the LLVM IR value is a
pointer.
2. SCEVPtrToInt is never a pointer.
3. If any other SCEV expression has no pointer operands, the result is
an integer.
4. If a SCEVAddExpr has exactly one pointer operand, the result is a
pointer.
5. If a SCEVAddRecExpr's first operand is a pointer, and it has no other
pointer operands, the result is a pointer.
6. If every operand of a SCEVMinMaxExpr is a pointer, the result is a
pointer.
7. Otherwise, the SCEV expression is invalid.
I'm not sure how useful rule 6 is in practice. If we exclude it, we can
guarantee that ScalarEvolution::getPointerBase always returns a
SCEVUnknown, which might be a helpful property. Anyway, I'll leave that
for a followup.
This is basically mop-up at this point; all the changes with significant
functional effects have landed. Some of the remaining changes could be
split off, but I don't see much point.
Differential Revision: https://reviews.llvm.org/D105510
This patch teaches the sample profile loader to merge function
attributes after inlining functions.
Without this patch, the compiler could inline a function requiring the
512-bit vector width into its caller without merging function
attributes, triggering a failure during instruction selection.
Differential Revision: https://reviews.llvm.org/D105729
This makes it clearer when we have encountered the extra arg.
Also, we may need to adjust the way the operand iteration
works when handling logical and/or.
This change is intended as initial setup. The plan is to add
more semantic checks later. I plan to update the documentation
as more semantic checks are added (instead of documenting the
details up front). Most of the code closely mirrors that for
the Swift calling convention. Three places are marked as
[FIXME: swiftasynccc]; those will be addressed once the
corresponding convention is introduced in LLVM.
Reviewed By: rjmccall
Differential Revision: https://reviews.llvm.org/D95561
This is NFC-intended currently (so no test diffs). The motivation
is to eventually allow matching for poison-safe logical-and and
logical-or (these are in the form of a select-of-bools).
( https://llvm.org/PR41312 )
Those patterns will not have all of the same constraints as min/max
in the form of cmp+sel. We may also end up removing the cmp+sel
min/max matching entirely (if we canonicalize to intrinsics), so
this will make that step easier.
This reverts commit 52aeacfbf5.
There isn't full agreement on a path forward yet, but there is agreement that
this shouldn't land as-is. See discussion on https://reviews.llvm.org/D105338
Also reverts unreviewed "[clang] Improve `-Wnull-dereference` diag to be more in-line with reality"
This reverts commit f4877c78c0.
And all the related changes to tests:
This reverts commit 9a0152799f.
This reverts commit 3f7c9cc274.
This reverts commit 329f8197ef.
This reverts commit aa9f58cc2c.
This reverts commit 2df37d5ddd.
This reverts commit a72a441812.
Currently InstructionSimplify.cpp knows how to simplify floating point
instructions that have a NaN operand. It does not know how to handle the
matching constrained FP intrinsic.
This patch teaches it how to simplify so long as the exception handling
is not "fpexcept.strict".
Differential Revision: https://reviews.llvm.org/D103169
This reverts commit 4e413e1621,
which landed almost 10 months ago under premise that the original behavior
didn't match reality and was breaking users, even though it was correct as per
the LangRef. But the LangRef change still hasn't appeared, which might suggest
that the affected parties aren't really worried about this problem.
Please refer to discussion in:
* https://reviews.llvm.org/D87399 (`Revert "[InstCombine] erase instructions leading up to unreachable"`)
* https://reviews.llvm.org/D53184 (`[LangRef] Clarify semantics of volatile operations.`)
* https://reviews.llvm.org/D87149 (`[InstCombine] erase instructions leading up to unreachable`)
clang has `-Wnull-dereference` which will diagnose the obvious cases
of null dereference, it was adjusted in f4877c78c0,
but it will only catch the cases where the pointer is a null literal,
it will not catch the cases where an arbitrary store is expected to trap.
Differential Revision: https://reviews.llvm.org/D105338
Added check for switch-terminated blocks in loops.
Now if a block is terminated with a switch, we try to find out which of the
cases is taken on 1st iteration and mark corresponding edge from the block
to the case successor as live.
Patch by Dmitry Makogon!
Differential Revision: https://reviews.llvm.org/D105688
Reviewed By: nikic, mkazantsev
This patch removes the IsPairwiseForm flag from the Reduction Cost TTI
hooks, along with some accompanying code for pattern matching reductions
from trees starting at extract elements. IsPairWise is now assumed to be
false, which was the predominant way that the value was used from both
the Loop and SLP vectorizers. Since the adjustments such as D93860, the
SLP vectorizer has not relied upon this distinction between paiwise and
non-pairwise reductions.
This also removes some code that was detecting reductions trees starting
from extract elements inside the costmodel. This case was
double-counting costs though, adding the individual costs on the
individual instruction _and_ the total cost of the reduction. Removing
it changes the costs in llvm/test/Analysis/CostModel/X86/reduction.ll to
not double count. The cost of reduction intrinsics is still tested
through the various tests in
llvm/test/Analysis/CostModel/X86/reduce-xyz.ll.
Differential Revision: https://reviews.llvm.org/D105484
There was an alias between 'simplifycfg' and 'simplify-cfg' in the
PassRegistry. That was the original reason for this patch, which
effectively removes the alias.
This patch also replaces all occurrances of 'simplify-cfg'
by 'simplifycfg'. Reason for choosing that form for the name is
that it matches the DEBUG_TYPE for the pass, and the legacy PM name
and also how it is spelled out in other passes such as
'loop-simplifycfg', and in other options such as
'simplifycfg-merge-cond-stores'.
I for some reason the name should be changed to 'simplify-cfg' in
the future, then I think such a renaming should be more widely done
and not only impacting the PassRegistry.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D105627
C++23 will make these conversions ambiguous - so fix them to make the
codebase forward-compatible with C++23 (& a follow-up change I've made
will make this ambiguous/invalid even in <C++23 so we don't regress
this & it generally improves the code anyway)
Patch tries to improve the vectorization of stores. Originally, we just
check the type and the base pointer of the store.
Patch adds some extra checks to avoid non-profitable vectorization
cases. It includes analysis of the scalar values to be stored and
triggers the vectorization attempt only if the scalar values have
same/alt opcode and are from same basic block, i.e. we don't end up
immediately with the gather node, which is not profitable.
This also improves compile time by filtering out non-profitable cases.
Part of D57059.
Differential Revision: https://reviews.llvm.org/D104122
Revived D101297 in its original form + added some changes in X86
legalization cehcking for masked gathers.
This solution is the most stable and the most correct one. We have to
check the legality before trying to build the masked gather in SLP.
Without this check we have incorrect cost (for SLP) in case if the masked gather
is not legal/slower than the gather. And we're missing some
vectorization opportunities.
This can be fixed in the cost model, but in this case we need to add
special checks for the cost of GEPs for ScatterVectorize node, add
special check for small trees, etc., i.e. there are a lot of corner
cases here and there, which insrease code base and make it harder to
maintain the code.
> Can't we rely on cost model to deal with this? This can be profitable for futher vectorization, when we can start from such gather loads as seed.
The question from D101297. Actually, no, it can't. Actually, simple
gather may give us better result, especially after we started
vectorization of insertelements. Plus, like I said before, the cost for
non-legal masked gathers leads to missed vectorization opportunities.
Differential Revision: https://reviews.llvm.org/D105042
Some of the SPEC tests end up with reduction+(sext/zext(<n x i1>) to <n x im>) pattern, which can be transformed to [-]zext/trunc(ctpop(bitcast <n x i1> to in)) to im.
Also, reduction+(<n x i1>) can be transformed to ctpop(bitcast <n x i1> to in) & 1 != 0.
Differential Revision: https://reviews.llvm.org/D105587
- ``externally_initialized`` variables would be initialized or modified
elsewhere. Particularly, CUDA or HIP may have host code to initialize
or modify ``externally_initialized`` device variables, which may not
be explicitly referenced on the device side but may still be used
through the host side interfaces. Not preserving them triggers the
elimination of them in the GlobalDCE and breaks the user code.
Reviewed By: yaxunl
Differential Revision: https://reviews.llvm.org/D105135
This adds support for opaque pointers to expandAddToGEP() by always
generating an i8 GEP for opaque pointers. After looking at some other
cases (constexpr GEP folding, SROA GEP generation), I've come around
to the idea that we should use i8 GEPs for opaque pointers, because
the alternative would be to guess a GEP type from surrounding code,
which will not be reliable. Ultimately, i8 GEPs is where we want to
end up anyway, and opaque pointers just make that the natural choice.
There are a couple of other places in SCEVExpander that check pointer
element types, I plan to update those when I run across usable test
coverage that doesn't assert elsewhere.
Differential Revision: https://reviews.llvm.org/D105398
The reduction matching was probably only dealing with binops
when it was written, but we have now generalized it to handle
select and intrinsics too, so assert on that too.
Before this patch we would normally use the ABI alignment which can be
to high for the context alginment.
For spilled values we don't need ABI alignment, since the frame entry's
address is not escaped.
rdar://79664965
Differential Revision: https://reviews.llvm.org/D105288
Resubmit after the following changes:
* Fix a latent bug related to unrolling with required epilogue (see e49d65f). I believe this is the cause of the prior PPC buildbot failure.
* Disable non-latch exits for epilogue vectorization to be safe (9ffa90d)
* Split out assert movement (600624a) to reduce churn if this gets reverted again.
Previous commit message (try 3)
Resubmit after fixing test/Transforms/LoopVectorize/ARM/mve-gather-scatter-tailpred.ll
Previous commit message...
This is a resubmit of 3e5ce4 (which was reverted by 7fe41ac). The original commit caused a PPC build bot failure we never really got to the bottom of. I can't reproduce the issue, and the bot owner was non-responsive. In the meantime, we stumbled across an issue which seems possibly related, and worked around a latent bug in 80e8025. My best guess is that the original patch exposed that latent issue at higher frequency, but it really is just a guess.
Original commit message follows...
If we know that the scalar epilogue is required to run, modify the CFG to end the middle block with an unconditional branch to scalar preheader. This is instead of a conditional branch to either the preheader or the exit block.
The motivation to do this is to support multiple exit blocks. Specifically, the current structure forces us to identify immediate dominators and *which* exit block to branch from in the middle terminator. For the multiple exit case - where we know require scalar will hold - these questions are ill formed.
This is the last change needed to support multiple exit loops, but since the diffs are already large enough, I'm going to land this, and then enable separately. You can think of this as being NFCIish prep work, but the changes are a bit too involved for me to feel comfortable tagging the review that way.
Differential Revision: https://reviews.llvm.org/D94892
Before we replaced value by registering all their uses. However, as we
replace a value old uses become stale. We now replace values explicitly
and keep track of "new values" when doing so to avoid replacing only
uses in stale/old values but not their replacements.
We often need to deal with the value lattice that contains none and
undef as special values. A simple helper makes this much nicer.
Differential Revision: https://reviews.llvm.org/D103857
When we do simplification via AAPotentialValues or AAValueConstantRange
we need to simplify the operands of an instruction we deconstruct first.
This does not only improve the result, see for example range.ll, but is
required as we allow outside AAs to provide simplification rules via
callbacks. If we do ignore the simplification rules and base other
simplifications on the IR instead we can create an inconsistent state.
As part of making ScalarEvolution's handling of pointers consistent, we
want to forbid multiplying a pointer by -1 (or any other value). This
means we can't blindly subtract pointers.
There are a few ways we could deal with this:
1. We could completely forbid subtracting pointers in getMinusSCEV()
2. We could forbid subracting pointers with different pointer bases
(this patch).
3. We could try to ptrtoint pointer operands.
The option in this patch is more friendly to non-integral pointers: code
that works with normal pointers will also work with non-integral
pointers. And it seems like there are very few places that actually
benefit from the third option.
As a minimal patch, the ScalarEvolution implementation of getMinusSCEV
still ends up subtracting pointers if they have the same base. This
should eliminate the shared pointer base, but eventually we'll need to
rewrite it to avoid negating the pointer base. I plan to do this as a
separate step to allow measuring the compile-time impact.
This doesn't cause obvious functional changes in most cases; the one
case that is significantly affected is ICmpZero handling in LSR (which
is the source of almost all the test changes). The resulting changes
seem okay to me, but suggestions welcome. As an alternative, I tried
explicitly ptrtoint'ing the operands, but the result doesn't seem
obviously better.
I deleted the test lsr-undef-in-binop.ll becuase I couldn't figure out
how to repair it to test what it was actually trying to test.
Recommitting with fix to MemoryDepChecker::isDependent.
Differential Revision: https://reviews.llvm.org/D104806
When skimming through old review discussion, I noticed a post commit comment on an earlier patch which had gone unaddressed. Better late (4 months), than never right?
I'm not aware of an active problem with the combination of non-latch exits and epilogue vectorization, but the interaction was not considered and I'm not modivated to make epilogue vectorization work with early exits. If there were a bug in the interaction, it would be pretty hard to hit right now (as we canonicalize towards bottom tested loops), but an upcoming change to allow multiple exit loops will greatly increase the chance for error. Thus, let's play it safe for now.
As part of making ScalarEvolution's handling of pointers consistent, we
want to forbid multiplying a pointer by -1 (or any other value). This
means we can't blindly subtract pointers.
There are a few ways we could deal with this:
1. We could completely forbid subtracting pointers in getMinusSCEV()
2. We could forbid subracting pointers with different pointer bases
(this patch).
3. We could try to ptrtoint pointer operands.
The option in this patch is more friendly to non-integral pointers: code
that works with normal pointers will also work with non-integral
pointers. And it seems like there are very few places that actually
benefit from the third option.
As a minimal patch, the ScalarEvolution implementation of getMinusSCEV
still ends up subtracting pointers if they have the same base. This
should eliminate the shared pointer base, but eventually we'll need to
rewrite it to avoid negating the pointer base. I plan to do this as a
separate step to allow measuring the compile-time impact.
This doesn't cause obvious functional changes in most cases; the one
case that is significantly affected is ICmpZero handling in LSR (which
is the source of almost all the test changes). The resulting changes
seem okay to me, but suggestions welcome. As an alternative, I tried
explicitly ptrtoint'ing the operands, but the result doesn't seem
obviously better.
I deleted the test lsr-undef-in-binop.ll becuase I couldn't figure out
how to repair it to test what it was actually trying to test.
Differential Revision: https://reviews.llvm.org/D104806
Code assumes that uses of single predecessor phis are not live accross
suspend points. Cleanup any single predecessor phis preceeding the code
making this assumption.
rdar://76020301
Differential Revision: https://reviews.llvm.org/D105488
Compare type IDs and DFS numbering for basic block instead of addresses
to fix non-determinism.
Differential Revision: https://reviews.llvm.org/D105031
The resume partial functions generated for swift suspend points will now
use a Swift mangling suffix.
Await resume partial functions will use the suffix 'TQ'[0-9]+'_' (e.g "...TQ0_")
and suspend resume partial functions will use the suffix 'TY'[0-9]+'_'
(e.g "...TY1_").
Reviewed By: nate_chandler
Differential Revision: https://reviews.llvm.org/D104144
This reverts commit 706bbfb35b.
The committed version moves the definition of VPReductionPHIRecipe out
of an ifdef only intended for ::print helpers. This should resolve the
build failures that caused the revert
This patch adds a TTI function, isElementTypeLegalForScalableVector, to query
whether it is possible to vectorize a given element type. This is called by
isLegalToVectorizeInstTypesForScalable to reject scalable vectorization if
any of the instruction types in the loop are unsupported, e.g:
int foo(__int128_t* ptr, int N)
#pragma clang loop vectorize_width(4, scalable)
for (int i=0; i<N; ++i)
ptr[i] = ptr[i] + 42;
This example currently crashes if we attempt to vectorize since i128 is not a
supported type for scalable vectorization.
Reviewed By: sdesmalen, david-arm
Differential Revision: https://reviews.llvm.org/D102253
This reverts commit 3fed6d443f,
bbcbf21ae6 and
6c3451cd76.
The changes causing build failures with certain configurations, e.g.
https://lab.llvm.org/buildbot/#/builders/67/builds/3365/steps/6/logs/stdio
lib/libLLVMVectorize.a(LoopVectorize.cpp.o): In function `llvm::VPRecipeBuilder::tryToCreateWidenRecipe(llvm::Instruction*, llvm::ArrayRef<llvm::VPValue*>, llvm::VFRange&, std::unique_ptr<llvm::VPlan, std::default_delete<llvm::VPlan> >&) [clone .localalias.8]':
LoopVectorize.cpp:(.text._ZN4llvm15VPRecipeBuilder22tryToCreateWidenRecipeEPNS_11InstructionENS_8ArrayRefIPNS_7VPValueEEERNS_7VFRangeERSt10unique_ptrINS_5VPlanESt14default_deleteISA_EE+0x63b): undefined reference to `vtable for llvm::VPReductionPHIRecipe'
collect2: error: ld returned 1 exit status
This patch is a first step towards splitting up VPWidenPHIRecipe into
separate recipes for the 3 distinct cases they model:
1. reduction phis,
2. first-order recurrence phis,
3. pointer induction phis.
This allows untangling the code generation and allows us to reduce the
reliance on LoopVectorizationCostModel during VPlan code generation.
Discussed/suggested in D100102, D100113, D104197.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D104989
Splits `getSmallestAndWidestTypes` into two functions, one of which now collects
a list of all element types found in the loop (`ElementTypesInLoop`). This ensures we do not
have to iterate over all instructions in the loop again in other places, such as in D102253
which disables scalable vectorization of a loop if any of the instructions use invalid types.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D105437
that release the retained object
This patch fixes what looks like a longstanding bug in ARC optimizer
where it reverses the order of objc_retain calls and objc_release calls
that retain and release the same object.
The code in ARC optimizer that is responsible for code motion takes the
following steps:
1. Traverse the CFG bottom-up and determine how far up objc_release
calls can be moved. Determine the insertion points for the
objc_release calls, but don't actually move them.
2. Traverse the CFG top-down and determine how far down objc_retain
calls can be moved. Determine the insertion points for the
objc_retain calls, but don't actually move them.
3. Try to move the objc_retain and objc_release calls if they can't be
removed.
The problem is that the insertion points for the objc_retain calls are
determined in step 2 without taking into consideration the insertion
points for objc_release calls determined in step 1, so the order of an
objc_retain call and an objc_release call can be reversed, which is
incorrect, even though each step is correct in isolation.
To fix this bug, this patch teaches the top-down traversal step to take
into consideration the insertion points for objc_release calls
determined in the bottom-up traversal step. Code motion for an
objc_retain call is disabled if there is a possibility that it can be
moved past an objc_release call that releases the retained object.
rdar://79292791
Differential Revision: https://reviews.llvm.org/D104953
This follows up patches for the unsigned siblings:
0c400e8953c7b658aeb5
We are translating an offset signed compare to its
unsigned equivalent when one end of the range is
at the limit (zero or unsigned max).
(X + C2) >s C --> X <u (SMAX - C) (if C == C2 - 1)
(X + C2) <s C --> X >u (C ^ SMAX) (if C == C2)
This probably does not show up much in IR derived
from C/C++ source because that would likely have
'nsw', and we have folds for that already.
As with the previous unsigned transforms, the folds
could be generalized to handle non-constant patterns:
https://alive2.llvm.org/ce/z/Y8Xrrm
; sgt
define i1 @src(i8 %a, i8 %c) {
%c2 = add i8 %c, 1
%t = add i8 %a, %c2
%ov = icmp sgt i8 %t, %c
ret i1 %ov
}
define i1 @tgt(i8 %a, i8 %c) {
%c_off = sub i8 127, %c ; SMAX
%ov = icmp ult i8 %a, %c_off
ret i1 %ov
}
https://alive2.llvm.org/ce/z/c8uhnk
; slt
define i1 @src(i8 %a, i8 %c) {
%t = add i8 %a, %c
%ov = icmp slt i8 %t, %c
ret i1 %ov
}
define i1 @tgt(i8 %a, i8 %c) {
%c_offnot = xor i8 %c, 127 ; SMAX
%ov = icmp ugt i8 %a, %c_offnot
ret i1 %ov
}
The function vectorizeChainsInBlock does not support scalable vector,
because function like canReuseExtract and isCommutative in the code
path assert with scalable vectors.
This patch avoids vectorizing blocks that have extract instructions with scalable
vector..
Differential Revision: https://reviews.llvm.org/D104809
This patch fixes an issue which occurred in CodeGenPrepare and
HWAddressSanitizer, which both at some point create a map of Old->New
instructions and update dbg.value uses of these. They did this by
iterating over the dbg.value's location operands, and if an instance of
the old instruction was found, replaceVariableLocationOp would be
called on that dbg.value. This would cause an error if the same operand
appeared multiple times as a location operand, as the first call to
replaceVariableLocationOp would update all uses of the old instruction,
invalidating the old iterator and eventually hitting an assertion.
This has been fixed by no longer iterating over the dbg.value's location
operands directly, but by first collecting them into a set and then
iterating over that, ensuring that we never attempt to replace a
duplicated operand multiple times.
Differential Revision: https://reviews.llvm.org/D105129
This API is not compatible with opaque pointers, the method
accepting an explicit pointer element type should be used instead.
Thankfully there were few in-tree users. The BPF case still ends
up using the pointer element type for now and needs something like
D105407 to avoid doing so.
Same as other CreateLoad-style APIs, these need an explicit type
argument to support opaque pointers.
Differential Revision: https://reviews.llvm.org/D105395
This replaces the current ad-hoc implementation,
by syncing the code from InstCombine's implementation in `InstCombinerImpl::visitUnreachableInst()`,
with one exception that here in SimplifyCFG we are allowed to remove EH instructions.
Effectively, this now allows SimplifyCFG to remove calls (iff they won't throw and will return),
arithmetic/logic operations, etc.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D105374
D74751 added `ClearDSOLocalOnDeclarations` and dropped dso_local for
isDeclarationForLinker `GlobalValue`s. It missed a case for imported
declarations (`doImportAsDefinition` is false while `isPerformingImport` is
true). This can lead to a linker error for a default visibility symbol in
`ld.lld -shared`.
When `ClearDSOLocalOnDeclarations` is true, we check
`isPerformingImport() && !doImportAsDefinition(&GV)` along with
`GV.isDeclarationForLinker()`. The new condition checks an imported declaration.
This patch fixes a `LLVMPolly.so` link error using a trunk clang -DLLVM_ENABLE_LTO=Thin.
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D104986
Mimics similar change for InstCombine:
ce192ced2b / D104602
All these uses are in blocks that aren't reachable from function's entry,
and said blocks are removed by SimplifyCFG itself,
so we can't really test this change.
This tries to bail out if the PHI is in a `catchswitch` BB in
InstCombine. A PHI cannot be combined into a non-PHI instruction if it
is in a `catchswitch` BB, because `catchswitch` BB cannot have any
non-PHI instruction other than `catchswitch` itself.
The given test case started crashing after D98058.
Reviewed By: lebedev.ri, rnk
Differential Revision: https://reviews.llvm.org/D105309
Somewhat related to D105338.
While it is up for discussion whether or not volatile store traps,
so far there has been no complaints that volatile load/cmpxchg/atomicrmw also may trap.
And even if simplifycfg currently concervatively believes that to be the case,
instcombine does not: https://godbolt.org/z/5vhv4K5b8
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D105343
In the original review D87149 it was mentioned that this approach was tried,
and it lead to infinite combine loops, but i'm not seeing anything like that now,
neither in the `check-llvm`, nor on some codebases i tried.
This is a recommit of d9d65527c2,
which i immediately reverted because i have messed up something
during branch switch, and 597ccc92ce
accidentally ended up being pushed, which was very much not the intention.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D105339
In the original review D87149 it was mentioned that this approach was tried,
and it lead to infinite combine loops, but i'm not seeing anything like that now,
neither in the `check-llvm`, nor on some codebases i tried.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D105339
The compiler should not ignore UndefValue when gathering the scalars,
otherwise the resulting code may be less defined than the original one.
Also, grouped scalars to insert them at first to reduce the analysis in
further passes.
Differential Revision: https://reviews.llvm.org/D105275
If the store address does not dominate the matrix multiply, try to hoist
address computation instructions without side-effects and/or memory
reads before the multiply, to allow fusion.
Reviewed By: thegameg
Differential Revision: https://reviews.llvm.org/D105193
With 'for' loop there is is a single place where 'Current' is adjusted. It helps to avoid copy paste and makes a bit easy to understand overall loop controll flow.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D101044
Previously we used the vector type, but we're loading/storing
invididual elements so I think only element alignment should matter.
Noticed while looking at the code for something else so I don't
have a test case.
Differential Revision: https://reviews.llvm.org/D105220
We need the compiler generated variable to override the weak symbol of
the same name inside the profile runtime, but using LinkOnceODRLinkage
results in weak symbol being emitted which leads to an issue where the
linker might choose either of the weak symbols potentially disabling the
runtime counter relocation.
This change replaces the use of weak definition inside the runtime with
an external weak reference to address the issue. We also place the
compiler generated symbol inside a COMDAT group so dead definition can
be garbage collected by the linker.
Differential Revision: https://reviews.llvm.org/D105176
This follows up to D104665 (which added umulo handling alongside the existing uaddo case), and generalizes for the remaining overflow intrinsics.
I went to add analogous handling to LVI, and discovered that LVI already had a more general implementation. Instead, we can port was LVI does to instcombine. (For context, LVI uses makeExactNoWrapRegion to constrain the value 'x' in blocks reached after a branch on the condition `op.with.overflow(x, C).overflow`.)
Differential Revision: https://reviews.llvm.org/D104932
In lots of places we were calling setDebugLocFromInst and passing
in the same Builder member variable found in InnerLoopVectorizer.
I personally found this confusing so I've changed the interface
to take an Optional<IRBuilder<> *> and we can now pass in None
when we want to use the class member variable.
Differential Revision: https://reviews.llvm.org/D105100
Now we lack a benchmark to measure the performance change for each
commit.
Since coro elide is the main optimization in coroutine module, I wonder
it may be an estimation to count the number of elided coroutine in
private code bases.
e.g., for a certain commit, if we found that the number of elided goes
down, we could find it before the commit check-in.
Reviewed By: lxfind
Differential Revision: https://reviews.llvm.org/D105095
This is one sibling of the fold added with c7b658aeb5 .
(X + C2) <u C --> X >s ~C2 (if C == C2 + SMIN)
I'm still not sure how to describe it best, but we're
translating 2 constants from an unsigned range comparison
to signed because that eliminates the offset (add) op.
This could be extended to handle the more general (non-constant)
pattern too:
https://alive2.llvm.org/ce/z/K-fMBf
define i1 @src(i8 %a, i8 %c2) {
%t = add i8 %a, %c2
%c = add i8 %c2, 128 ; SMIN
%ov = icmp ult i8 %t, %c
ret i1 %ov
}
define i1 @tgt(i8 %a, i8 %c2) {
%not_c2 = xor i8 %c2, -1
%ov = icmp sgt i8 %a, %not_c2
ret i1 %ov
}
Relevant discussion can be found at: https://lists.llvm.org/pipermail/llvm-dev/2021-January/148197.html
In the existing design, An SCC that contains a coroutine will go through the folloing passes:
Inliner -> CoroSplitPass (fake) -> FunctionSimplificationPipeline -> Inliner -> CoroSplitPass (real) -> FunctionSimplificationPipeline
The first CoroSplitPass doesn't do anything other than putting the SCC back to the queue so that the entire pipeline can repeat.
As you can see, we run Inliner twice on the SCC consecutively without doing any real split, which is unnecessary and likely unintended.
What we really wanted is this:
Inliner -> FunctionSimplificationPipeline -> CoroSplitPass -> FunctionSimplificationPipeline
(note that we don't really need to run Inliner again on the ramp function after split).
Hence the way we do it here is to move CoroSplitPass to the end of the CGSCC pipeline, make it once for real, insert the newly generated SCCs (the clones) back to the pipeline so that they can be optimized, and also add a function simplification pipeline after CoroSplit to optimize the post-split ramp function.
This approach also conforms to how the new pass manager works instead of relying on an adhoc post split cleanup, making it ready for full switch to new pass manager eventually.
By looking at some of the changes to the tests, we can already observe that this changes allows for more optimizations applied to coroutines.
Reviewed By: aeubanks, ChuanqiXu
Differential Revision: https://reviews.llvm.org/D95807
There must be a better way to describe this pattern in words?
(X + C2) >u C --> X <s -C2 (if C == C2 + SMAX)
This could be extended to handle the more general (non-constant)
pattern too:
https://alive2.llvm.org/ce/z/rdfNFP
define i1 @src(i8 %a, i8 %c1) {
%t = add i8 %a, %c1
%c2 = add i8 %c1, 127 ; SMAX
%ov = icmp ugt i8 %t, %c2
ret i1 %ov
}
define i1 @tgt(i8 %a, i8 %c1) {
%neg_c1 = sub i8 0, %c1
%ov = icmp slt i8 %a, %neg_c1
ret i1 %ov
}
The pattern was noticed as a by-product of D104932.
We already implemented this for the select form, but the intrinsic form was missing. Note that this doesn't change poison behavior as 1 is non-poison, and the optimized form is still poison exactly when x is.
The remarks will trigger on some functions that are marked cold, such as the
`__muldc3` intrinsic functions. Change the remarks to avoid these functions.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D105196
This patch adds additional remarks, suggesting the use of `noescape` for failed
globalization and indicating when internalization failed.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D105150
Now the option is off by default. Since we are not sure if this option
would make the compile time increase aggressively. Although we tested it
on SPEC2017, we may need to test more to make it on by default.
Reviewed By: SjoerdMeijer
Differential Revision: https://reviews.llvm.org/D104365
Now we lack a benchmark to measure the performance change for each
commit.
Since coro elide is the main optimization in coroutine module, I wonder
it may be an estimation to count the number of elided coroutine in
private code bases.
e.g., for a certain commit, if we found that the number of elided goes
down, we could find it before the commit check-in.
Reviewed By: lxfind
Differential Revision: https://reviews.llvm.org/D105095
This allows application code checks if origin tracking is on before
printing out traces.
-dfsan-track-origins can be 0,1,2.
The current code only distinguishes 1 and 2 in compile time, but not at runtime.
Made runtime distinguish 1 and 2 too.
Reviewed By: browneee
Differential Revision: https://reviews.llvm.org/D105128
The code was previously relying on the fact that an incorrectly
typed global would result in the insertion of a BitCast constant
expression. With opaque pointers, this is no longer the case, so
we should check the type explicitly.
After SLP + LTO we may have have reduction(shuffle V, poison,
mask). This can be simplified to just reduction(V) if the mask is only
for single vector and just all elements from this vector are permuted,
without reusing, replacing with undefs and/or other values, etc.
Differential Revision: https://reviews.llvm.org/D105053
If we unroll a loop in the vectorizer (without vectorizing), and the cost model requires a epilogue be generated for correctness, the code generation must actually do so.
The included test case on an unmodified opt will access memory one past the expected bound. As a result, this patch is fixing a latent miscompile.
Differential Revision: https://reviews.llvm.org/D103700
While we might eventually want to disallow allocas that do not have the
alloca-AS set, it seems undesirable to crash on them. Add a cast when
required so that we can support such allocas (at least here).
Differential Revision: https://reviews.llvm.org/D104866
This patch fixes a crash when the target instruction for sinking is
dead. In that case, no recipe is created and trying to get the recipe
for it results in a crash. To ensure all sink targets are alive, find &
use the first previous alive instruction.
Note that the case where the sink source is dead is already handled.
Found by
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=35320
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D104603
Previously in setCostBasedWideningDecision if we encountered an
invariant store we just assumed that we could scalarize the store
and called getUniformMemOpCost to get the associated cost.
However, for scalable vectors this is not an option because it is
not currently possibly to scalarize the store. At the moment we
crash in VPReplicateRecipe::execute when trying to scalarize the
store.
Therefore, I have changed setCostBasedWideningDecision so that if
we are storing a scalable vector out to a uniform address and the
target supports scatter instructions, then we should use those
instead.
Tests have been added here:
Transforms/LoopVectorize/AArch64/sve-inv-store.ll
Differential Revision: https://reviews.llvm.org/D104624
In all of these, the value must be an instruction for us to succeed anyway,
so change it to maybe hopefully make further changes more straight-forward.
We could use a bigger hammer and bail out on any constant
expression, but there's a regression test that appears to
validly do the transform (although it may not have been
intending to check that optimization).
Currently OpenMPOpt will only check if a function is a kernel before deciding not to internalize it. Any uncalled function that gets internalized will be trivially dead in the module so this is unnnecessary.
Depends on D102423
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D104890
There is a constraint that coro.suspend instructions need to be in their
own blocks. The coro split pass initially creates IR that obeys this constraint
(which is later checked). Sinking rematerializable instructions into these
blocks breaks that constraint.
Instead rematerialize in the predecessor block to the suspend's single
predecessor block.
Differential Revision: https://reviews.llvm.org/D104051
Compiler crashes at an assertion while casting operands to PtrToIntInst at some cases when
ptrtoint is present as an explicit operand to inttoptr. Explicit instruction operator as
operand can not be casted to an Instruction.
This patch replaces cast from PtrToInst to Operator which are later checked for constant
expressions.
Differential Revision: https://reviews.llvm.org/D105002
Increase the number of attributor iterations on a GPU target. I forgot to
change this in D104416.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D104920
Currently we will allow loops with a fixed width VF of 1 to vectorize
if the -enable-strict-reductions flag is set. However, the loop vectorizer
will not use ordered reductions if `VF.isScalar()` and the resulting
vectorized loop will be out of order.
This patch removes `VF.isVector()` when checking if ordered reductions
should be used. Also, instead of converting the FAdds to reductions if the
VF = 1, operands of the FAdds are changed such that the order is preserved.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D104533
Sinking scalar operands into predicated-triangle regions may allow
merging regions. This patch adds a VPlan-to-VPlan transform that tries
to merge predicate-triangle regions after sinking.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D100260
We can exploit branches by `undef` condition. Frankly, the LangRef says that
such branches are UB, so we can assume that all outgoing edges of such blocks
are dead.
However, from practical perspective, we know that this is not supported correctly
in some other places. So we are being conservative about it.
Branch by undef is treated in the following way:
- If it is a loop-exiting branch, we always assume it exits the loop;
- If not, we arbitrarily assume it takes `true` value.
Differential Revision: https://reviews.llvm.org/D104689
Reviewed By: nikic
For the start shortening optimization, always use a i8 type for
the GEP, as it is a raw offset calculation.
Handling of non-i8* memset/memcpy arguments requires insertion
of casts. These cases were previously miscompiled, as the offset
calculation was performed on the wrong type.
Apparently, it is legal to use memcpy/memset with pointer types
other than i8*. Prior to 81fcdae68c
this case was silently miscompiled, as the i8 offset calculation
was performed on some other type. Now it would crash due to a
type mismatch. Fix this by inserting an explicit bitcast to i8*.
This is an extension of the handling for unary intrinsics and
follows the logic that we use for binary ops.
We don't canonicalize to min/max intrinsics yet, but this might
help unlock other folds seen in D98152.
This patch updates VPWidenPHI recipes for first-order recurrences to
also track the incoming value from the back-edge. Similar to D99294,
which did the same for reductions.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D104197
This fixes a bug at LibCallSimplifier::optimizeMemChr which does the following transformation:
```
// memchr("\r\n", C, 2) != nullptr -> (1 << C & ((1 << '\r') | (1 << '\n')))
// != 0
// after bounds check.
```
As written above, a bounds check on C (whether it is less than integer bitwidth) is done before doing `1 << C` otherwise 1 << C will overflow.
If the bounds check is false, the result of (1 << C & ...) must not be used at all, otherwise the result of shift (which is poison) will contaminate the whole results.
A correct way to encode this is `select i1 (bounds check), (1 << C & ...), false` because select does not allow the unused operand to contaminate the result.
However, this optimization was introducing `and (bounds check), (1 << C & ...)` which cannot do that.
The bug was found from compilation of this C++ code: https://reviews.llvm.org/rG2fd3037ac615#1007197
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D104901
The metadata added in D102361 introduces a module flag that we can check
to determine if the module was compiled with `-fopenmp` enables. We can
now check for the precense of this instead of scanning the call graph
for OpenMP runtime functions.
Depends on D102361
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D102423
Types should be defined in function scope instead of a local lexical scope. Field types should be defined inside in its parent type scope.
We were seeing a type defined in a local scope causing trouble to the dwarf emitter where a context is required to be a funciton scope, a namespace or a global scope.
Reviewed By: aprantl
Differential Revision: https://reviews.llvm.org/D104937
If we have a umul.with.overflow where the multiply result is not used and one of the operands is a constant, we can perform the overflow check cheaper with a comparison then by performing the multiply and extracting the overflow flag.
(Noticed when looking at the conditions SCEV emits for overflow checks.)
Differential Revision: https://reviews.llvm.org/D104665
Rather than relying on pointer type equality (which, for a change,
is silently incorrect with opaque pointers) check that the GEP
source element types match.
WPD currently assumes that there is a one to one correspondence between
type test assume sequences and virtual calls. However, with
-fstrict-vtable-pointers this may not be true. This ends up causing
crashes when we try to optimize a virtual call more than once (
applyUniformRetValOpt()/applyUniqueRetValOpt()/applyVirtualConstProp()/applySingleImplDevirt()).
applySingleImplDevirt() actually didn't previous crash because it would
replace the devirtualized call with the same direct call. Adding an
assert that the call is indirect causes the corresponding test to crash
with the rest of the patch.
This makes Chrome successfully build with -fstrict-vtable-pointers + WPD.
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D104798
The whole transform can be dropped once we have fully transitioned
to opaque pointers (as it's purpose is to remove no-op pointer
casts). For now, make sure that it handles opaque pointers correctly.
- When emitting libcalls, do not only pass the calling convention from the
function prototype but also the attributes.
- Do not pass attributes from e.g. libc memcpy to llvm.memcpy.
Review: Reid Kleckner, Eli Friedman, Arthur Eubanks
Differential Revision: https://reviews.llvm.org/D103992
Similar to what we already do for `ret` terminators.
As noted by @rnk, clang seems to already generate a single `ret`/`resume`,
so this isn't likely to cause widespread changes.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D104849
This patch enables the salvaging of debug values that may be calculated
from more than one SSA value, such as with binary operators that do not
use a constant argument. The actual functionality for this behaviour is
added in a previous commit (c7270567), but with the ability to actually
emit the resulting debug values switched off.
The reason for this is that the prior patch has been reverted several
times due to issues discovered downstream, some time after the actual
landing of the patch. The patch in question is rather large and touches
several widely used header files, and all issues discovered are more
related to the handling of variadic debug values as a whole rather than
the details of the patch itself. Therefore, to minimize the build time
impact and risk of conflicts involved in any potential future
revert/reapply of that patch, this significantly smaller patch (that
touches no header files) will instead be used as the capstone to enable
variadic debug value salvaging.
The review linked to this patch is mostly implemented by the previous
commit, c7270567, but also contains the changes in this patch.
Differential Revision: https://reviews.llvm.org/D91722
Based ontop of D104598, which is a NFCI-ish refactoring.
Here, a restriction, that only empty blocks can be merged, is lifted.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D104597
This is a partial reapply of the original commit and the followup commit
that were previously reverted; this reapply also includes a small fix
for a potential source of non-determinism, but also has a small change
to turn off variadic debug value salvaging, to ensure that any future
revert/reapply steps to disable and renable this feature do not risk
causing conflicts.
Differential Revision: https://reviews.llvm.org/D91722
This reverts commit 386b66b2fc.
This enable no_sanitize C++ attribute to exclude globals from hwasan
testing, and automatically excludes other sanitizers' globals (such as
ubsan location descriptors).
Differential Revision: https://reviews.llvm.org/D104825
Create an internal alias with the original name for static functions
that are renamed in promoteInternals to avoid breaking inline
assembly references to them.
This relands commit 4474958d3a
with a fix to a use-of-uninitialized-value error that tripped
MemorySanitizer.
Link: https://github.com/ClangBuiltLinux/linux/issues/1354
Reviewed By: nickdesaulniers, pcc
Differential Revision: https://reviews.llvm.org/D104058
This attribute uses Attributor's internal 'optimistic' call graph
information to answer queries about function call reachability.
Functions can become reachable over time as new call edges are
discovered.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D104599
Make getPointersDiff() and sortPtrAccesses() compatible with opaque
pointers by explicitly passing in the element type instead of
determining it from the pointer element type.
The SLPVectorizer result is slightly non-optimal in that unnecessary
pointer bitcasts are added.
Differential Revision: https://reviews.llvm.org/D104784
If a ctlz operation is performed on higher datatype and then
downcasted, then this can be optimized by doing a ctlz operation
on a lower datatype and adding the difference bitsize to the result
of ctlz to provide the same output:
https://alive2.llvm.org/ce/z/8uup9M
The original problem is shown in
https://llvm.org/PR50173
Differential Revision: https://reviews.llvm.org/D103788
This is part of improving floating-point patterns seen in:
https://llvm.org/PR39480
We don't require any FMF because the 2 potential corner cases
(-0.0 and NaN) are correctly handled without FMF:
1. -0.0 is treated as strictly less than +0.0 with
maximum/minimum, so fabs/fneg work as expected.
2. +/- 0.0 with maxnum/minnum is indeterminate, so
transforming to fabs/fneg is more defined.
3. The sign of a NaN may be altered by this transform,
but that is allowed in the default FP environment.
If there are FMF, they are propagated from the min/max call to
one or both new operands which seems to agree with Alive2:
https://alive2.llvm.org/ce/z/bem_xC
This changes the approach taken to tail-merge the blocks
to always create a new block instead of trying to reuse some block,
and generalizes it to support dealing not with just the `ret` in the future.
This effectively lifts the CallBr restriction, although this isn't really intentional.
That is the only non-NFC change here, i'm not sure if it's reasonable/feasible to temporarily retain it.
Other restrictions of the transform remain.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D104598
With regards to overrunning, the langref (llvm/docs/LangRef.rst)
specifies:
(llvm.experimental.vector.insert)
Elements ``idx`` through (``idx`` + num_elements(``subvec``) - 1)
must be valid ``vec`` indices. If this condition cannot be determined
statically but is false at runtime, then the result vector is
undefined.
(llvm.experimental.vector.extract)
Elements ``idx`` through (``idx`` + num_elements(result_type) - 1)
must be valid vector indices. If this condition cannot be determined
statically but is false at runtime, then the result vector is
undefined.
For the non-mixed cases (e.g. inserting/extracting a scalable into/from
another scalable, or inserting/extracting a fixed into/from another
fixed), it is possible to statically check whether or not the above
conditions are met. This was previously missing from the verifier, and
if the conditions were found to be false, the result of the
insertion/extraction would be replaced with an undef.
With regards to invalid indices, the langref (llvm/docs/LangRef.rst)
specifies:
(llvm.experimental.vector.insert)
``idx`` represents the starting element number at which ``subvec``
will be inserted. ``idx`` must be a constant multiple of
``subvec``'s known minimum vector length.
(llvm.experimental.vector.extract)
The ``idx`` specifies the starting element number within ``vec``
from which a subvector is extracted. ``idx`` must be a constant
multiple of the known-minimum vector length of the result type.
Similarly, these conditions were not previously enforced in the
verifier. In some circumstances, invalid indices were permitted
silently, and in other circumstances, an undef was spawned where a
verifier error would have been preferred.
This commit adds verifier checks to enforce the constraints above.
Differential Revision: https://reviews.llvm.org/D104468
Follow-up on Roman's idea expressed in D103959.
- If a Phi has undefined inputs from live blocks:
- and no other inputs, assume it is undef itself;
- and exactly one non-undef input, we can assume that all undefs are equal to this input.
Differential Revision: https://reviews.llvm.org/D104618
Reviewed By: lebedev.ri, nikic
Zero factor leads to division by zero and failure of corresponding
assert as shown in PR50765. We should filter out such factors.
Differential Revision: https://reviews.llvm.org/D104702
Reviewed By: huihuiz, reames
This patch makes PriorityInlineOrder lazily updated.
The PriorityInlineOrder would lazily update the desirability of a call site if it's decreasing.
Reviewed By: kazu
Differential Revision: https://reviews.llvm.org/D104654
This patch fixes a problem with the AAExecutionDomain attributor not
checking if it is in a valid state. This can cause it to incorrectly
return that a block is executed in a single threaded context after the
attributor failed for any reason.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D103186
After landing the globalization optimizations, the precense of globalization on
the device that was not put in shared or stack memory is a failed optimization
with performance consequences so it should indicate a missed remark.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D104735
When the load type is changed to ptr, we need the load pointer type
to also be ptr, because it's not allowed to create a pointer to an
opaque pointer. This is achieved by adjusting the getPointerTo() API
to return an opaque pointer for an opaque pointer base type.
Differential Revision: https://reviews.llvm.org/D104718
Right now the Attributor defaults to 32 fixed point iterations unless it is set
explicitly by a command line flag. This patch allows this to be configured when
the attributor instance is created. The maximum is then increased in OpenMPOpt
if the target is a kernel. This is because the globalization analysis can result
in larger iteration counts due to many dependent instances running at once.
Depends on D102444
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D104416
Summary:
This patch adds support for the Attributor to emit remarks on behalf of some
other pass. The attributor can now optionally take a callback function that
returns an OptimizationRemarkEmitter object when given a Function pointer. If
this is availible then a remark will be emitted for the corresponding pass
name.
Depends on D102197
Reviewed By: sstefan1 thegameg
Differential Revision: https://reviews.llvm.org/D102444
Summary:
The changes to globalization introduced in D97680 introduce a large amount of overhead by default. The old globalization method would always ignore globalization code if executing in SPMD mode. This wasn't strictly correct as data sharing is still possible in SPMD mode. The new interface is correct but introduces globalization code even when unnecessary. This optimization will use the existing HeapToStack transformation in the attributor to allow for unneeded globalization to be replaced with thread-private stack memory. This is done using the newly introduced library instances for the RTL functions added in D102087.
Depends on D97818
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D102197
Summary:
Currently the attributor needs to give up if a function has external linkage.
This means that the optimization introduced in D97818 will only apply to static
functions. This change uses the Attributor to internalize OpenMP device
routines by making a copy of each function with private linkage and replacing
the uses in the module with it. This allows for the optimization to be applied
to any regular function.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D102824
Summary:
The changes introduced in D97680 create a simpler interface to code that needs
to be globalized. This interface is used to simplify the globalization calls in
the middle end. We can check any globalization call that is only called by a
single thread in the team and replace it with a static shared memory buffer.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D97818
This adds support for addrspace casts involving opaque pointers to
InstCombine, as well as the isEliminableCastPair() helper
(otherwise the assertion failure would just move there).
Add PointerType::hasSameElementTypeAs() to hide the element type
details.
Differential Revision: https://reviews.llvm.org/D104668
There was a bug from cost calculation for partially invariant unswitch.
The costs of non-duplicated blocks are substracted from the total LoopCost, so
anything that is duplicated should not be counted.
Differential Revision: https://reviews.llvm.org/D103816
Summary:
Memory globalization is required to maintain OpenMP standard semantics for data sharing between
worker and master threads. The GPU cannot share data between its threads so must allocate global or
shared memory to store the data in. Currently this is implemented fully in the frontend using the
`__kmpc_data_sharing_push_stack` and __kmpc_data_sharing_pop_stack` functions to emulate standard
CPU stack sharing. The front-end scans the target region for variables that escape the region and
must be shared between the threads. Each variable then has a field created for it in a global record
type.
This patch replaces this functinality with a single allocation command, effectively mimicing an
alloca instruction for the variables that must be shared between the threads. This will be much
slower than the current solution, but makes it much easier to optimize as we can analyze each
variable independently and determine if it is not captured. In the future, we can replace these
calls with an `alloca` and small allocations can be pushed to shared memory.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D97680