Currently context strings contain a lot of duplicated function names and that significantly increase the profile size. This change split the context into a series of {name, offset, discriminator} tuples so function names used in the context can be replaced by the index into the name table and that significantly reduce the size consumed by context.
A follow-up improvement made in the compiler and profiling tools is to avoid reconstructing full context strings which is time- and memory- consuming. Instead a context vector of `StringRef` is adopted to represent the full context in all scenarios. As a result, the previous prevalent profile map which was implemented as a `StringRef` is now engineered as an unordered map keyed by `SampleContext`. `SampleContext` is reshaped to using an `ArrayRef` to represent a full context for CS profile. For non-CS profile, it falls back to use `StringRef` to represent a contextless function name. Both the `ArrayRef` and `StringRef` objects are underpinned by real array and string objects that are stored in producer buffers. For compiler, they are maintained by the sample reader. For llvm-profgen, they are maintained in `ProfiledBinary` and `ProfileGenerator`. Full context strings can be generated only in those cases of debugging and printing.
When it comes to profile format, nothing has changed to the text format, though internally CS context is implemented as a vector. Extbinary format is only changed for CS profile, with an additional `SecCSNameTable` section which stores all full contexts logically in the form of `vector<int>`, which each element as an offset points to `SecNameTable`. All occurrences of contexts elsewhere are redirected to using the offset of `SecCSNameTable`.
Testing
This is no-diff change in terms of code quality and profile content (for text profile).
For our internal large service (aka ads), the profile generation is cut to half, with a 20x smaller string-based extbinary format generated.
The compile time of ads is dropped by 25%.
Differential Revision: https://reviews.llvm.org/D107299
Currently, the IROutliner uses a simple metric to outline the largest amount
of IR possible to outline first if it fits the cost model. This is model
loses out on smaller blocks of code that have higher reductions in cost that
are contained within larger blocks of IR.
This reverses the order, where we calculate all of the costs first, and then
reorder and extract items based on the calculated results.
Reviewers: paquette
Differential Revision: https://reviews.llvm.org/D106440
Occasionally instructions are between the last instruction in a region,
and the following instruction as identified by the Candidate. This
adds an extra check right before splitting a candidate that excludes the region from being split/checked for outlining to remove errors.
Tests Added:
Tranforms/IROuutliner/outlining-extra-bitcasts.ll
Reviewer: paquette, jroelofs
Differential Revision: https://reviews.llvm.org/D104142
Previously, we'd expand *ALL* the SCEV's eagerly, because we needed to
check with `isValidRewrite()`, and discard bad rewrite candidates,
but now that we do not do that, we also don't need to always expand.
In particular, this avoids expanding potentially-huge SCEV's that we
would discard anyways because they are high-cost and we aren't
rewriting aggressively.
`isValidRewrite()` checks that the both the original SCEV,
and the rewrite SCEV have the same base pointer.
I //believe//, after all the recent SCEV improvements,
this invariant is already enforced by SCEV itself.
I originally tried changing it into an assert in D108043,
but that showed that it triggers on e.g. https://reviews.llvm.org/D108043#2946621,
where SCEV manages to forward the store to load,
test added.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D108655
After applying VPlan-to-VPlan transformations, using IR references to
query VPlan values may be incorrect, as the IR is not in sync with the
VPlan any longer.
To better detect such mis-matches, this patch introduces a new flag to
VPlans to indicate whether it is safe to query VPValues using IR values.
getVPValue is updated to assert if it is called when the flag indicates
it is not safe any longer.
There is an escape hatch via an extra argument, because there are 3
places that need to be fixed first. Those are
1. truncateToMinimalBitwidths
2. clearReductionWrapFlags
3. fixLCSSAPHIs
As a first step, this flag will help preventing new code from violating
this property.
Any suggestions with respect to naming very welcome!
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D108573
We cannot leak any equivalency information by comparing against null
since null never has virtual metadata associated with it (when null is
not a valid dereferenceable pointer).
Instcombine seems to make sure that a null will be on the RHS, so we
don't have to check both operands.
This fixes a missed optimization in llvm-test-suite's MultiSource lambda
benchmark under -fstrict-vtable-pointers.
Reviewed By: Prazek
Differential Revision: https://reviews.llvm.org/D108734
ExposePointerBase() in SCEVExpander implements basically the same
functionality as removePointerBase() in SCEV, so reuse it.
The SCEVExpander code assumes that the pointer operand on adds is
the last one -- I'm not sure that always holds. As such this might
not be strictly NFC.
There can only be one pointer operand in an add expression, and
we have sorted operands to guarantee that it is the first. As
such, the pointer check for other operands is dead code.
When the initial relationship between two pairs of values between
similar sections is ambiguous to commutativity, arguments to the
outlined functions can be passed in such that the order is incorrect,
causing miscompilations. This adds a canonical mapping to each
similarity section, so that we can maintain the relationship of global
value numbering from one section to another.
Added Tests:
Transforms/IROutliner/outlining-commutative-operands-opposite-order.ll
unittests/Analysis/IRSimilarityIdentifierTest.cpp - IRSimilarityCandidate:CanonicalNumbering
Reviewers: jroelofs, jpaquette, yroux
Differential Revision: https://reviews.llvm.org/D104143
This is another followup to D106591. Even if there is an
instruction that clobbers one of the loads, this doesn't matter if
it happens before the loads. Those instructions aren't affected by
the transform at all.
The gep-references-bb.ll is modified to preserve the spirit of the
test, as the store to @g no longer impacts the transform.
Differential Revision: https://reviews.llvm.org/D108782
We'd added support a while back from breaking the backedge if SCEV can prove the trip count is zero. However, we used the exact trip count which requires *all* exits be analyzeable. I noticed while writing test cases for another patch that this disallows cases where one exit is provably taken paired with another which is unknown. This patch adds the upper bound case.
We could use a symbolic max trip count here instead, but we use an isKnownNonZero filter (presumably for compile time?) for the first-iteration reasoning. I decided this was a more obvious incremental step, and we could go back and untangle the schemes separately.
Differential Revision: https://reviews.llvm.org/D108833
IIUC we can't emit `memcmp` between pointers in addressspaces,
doing so will trigger an assertion since the signature of the memcmp
will not match it's arguments (https://bugs.llvm.org/show_bug.cgi?id=48661).
This PR disables the attempt to merge icmps,
when the pointer is in an addressspace.
Reviewed By: #julialang, vtjnash
Differential Revision: https://reviews.llvm.org/D94813
Recursion can happen when we see a PHI use the second time or when we
look at a store value operand use again. We already visited the
potential copies and doing so again will just cause endless looping.
Reviewed By: kuter
Differential Revision: https://reviews.llvm.org/D108190
For now we do should not treat byval arguments as local copies performed
on the call edge, though, in general we should. To make that happen we
need to teach various passes, e.g., DSE, about the copy effect of a
byval. That would also allow us to mark functions only accessing byval
arguments as readnone again, atguably their acceses have no effect
outside of the function, like accesses to allocas.
Reviewed By: kuter
Differential Revision: https://reviews.llvm.org/D108140
Changes since aec08e:
* Adjust placement of a closing brace so that the general case actually runs. Turns out we had *no* coverage of the switch case. I added one in eae90fd.
* Drop .llvm.loop.* metadata from the new branch as there is no longer a loop to annotate.
Original commit message:
This special cases an unconditional latch and a conditional branch latch exit to improve codegen and test readability. I am hoping to reuse this function in the runtime unroll code, but without this change, the test diffs are far too complex to assess.
We try to forward a stored-once-constant-value from one global access
to another, but that's not safe if the constant value is an expression
that can trap.
The tests are reduced from the miscompile examples in:
https://llvm.org/PR47578
Differential Revision: https://reviews.llvm.org/D108771
Reworked reordering algorithm. Originally, the compiler just tried to
detect the most common order in the reordarable nodes (loads, stores,
extractelements,extractvalues) and then fully rebuilding the graph in
the best order. This was not effecient, since it required an extra
memory and time for building/rebuilding tree, double the use of the
scheduling budget, which could lead to missing vectorization due to
exausted scheduling resources.
Patch provide 2-way approach for graph reodering problem. At first, all
reordering is done in-place, it doe not required tree
deleting/rebuilding, it just rotates the scalars/orders/reuses masks in
the graph node.
The first step (top-to bottom) rotates the whole graph, similarly to the previous
implementation. Compiler counts the number of the most used orders of
the graph nodes with the same vectorization factor and then rotates the
subgraph with the given vectorization factor to the most used order, if
it is not empty. Then repeats the same procedure for the subgraphs with
the smaller vectorization factor. We can do this because we still need
to reshuffle smaller subgraph when buildiong operands for the graph
nodes with lasrger vectorization factor, we can rotate just subgraph,
not the whole graph.
The second step (bottom-to-top) scans through the leaves and tries to
detect the users of the leaves which can be reordered. If the leaves can
be reorder in the best fashion, they are reordered and their user too.
It allows to remove double shuffles to the same ordering of the operands in
many cases and just reorder the user operations instead. Plus, it moves
the final shuffles closer to the top of the graph and in many cases
allows to remove extra shuffle because the same procedure is repeated
again and we can again merge some reordering masks and reorder user nodes
instead of the operands.
Also, patch improves cost model for gathering of loads, which improves
x264 benchmark in some cases.
Gives about +2% on AVX512 + LTO (more expected for AVX/AVX2) for {625,525}x264,
+3% for 508.namd, improves most of other benchmarks.
The compile and link time are almost the same, though in some cases it
should be better (we're not doing an extra instruction scheduling
anymore) + we may vectorize more code for the large basic blocks again
because of saving scheduling budget.
Differential Revision: https://reviews.llvm.org/D105020
The Code Extractor does not provide an easy mechanism for determining the
inputs and outputs after extraction has occurred, this patch gives the
ability to pass in empty SetVectors to be filled with the inputs and
outputs if they need to be analyzed.
Added Tests:
- InputOutputMonitoring in unittests/Transforms/Utils/CodeExtractorTests.cpp
Reviewers: paquette
Differential Revision: https://reviews.llvm.org/D106991
Since LICM has now unconditionally moved to MemorySSA based form, all
passes that run in same LPM as LICM need to preserve MemorySSA (i.e. our
downstream pipeline).
Added loop-mssa to all tests and perform -verify-memoryssa within
LoopPredication itself.
Differential Revision: https://reviews.llvm.org/D108724
Reworked reordering algorithm. Originally, the compiler just tried to
detect the most common order in the reordarable nodes (loads, stores,
extractelements,extractvalues) and then fully rebuilding the graph in
the best order. This was not effecient, since it required an extra
memory and time for building/rebuilding tree, double the use of the
scheduling budget, which could lead to missing vectorization due to
exausted scheduling resources.
Patch provide 2-way approach for graph reodering problem. At first, all
reordering is done in-place, it doe not required tree
deleting/rebuilding, it just rotates the scalars/orders/reuses masks in
the graph node.
The first step (top-to bottom) rotates the whole graph, similarly to the previous
implementation. Compiler counts the number of the most used orders of
the graph nodes with the same vectorization factor and then rotates the
subgraph with the given vectorization factor to the most used order, if
it is not empty. Then repeats the same procedure for the subgraphs with
the smaller vectorization factor. We can do this because we still need
to reshuffle smaller subgraph when buildiong operands for the graph
nodes with lasrger vectorization factor, we can rotate just subgraph,
not the whole graph.
The second step (bottom-to-top) scans through the leaves and tries to
detect the users of the leaves which can be reordered. If the leaves can
be reorder in the best fashion, they are reordered and their user too.
It allows to remove double shuffles to the same ordering of the operands in
many cases and just reorder the user operations instead. Plus, it moves
the final shuffles closer to the top of the graph and in many cases
allows to remove extra shuffle because the same procedure is repeated
again and we can again merge some reordering masks and reorder user nodes
instead of the operands.
Also, patch improves cost model for gathering of loads, which improves
x264 benchmark in some cases.
Gives about +2% on AVX512 + LTO (more expected for AVX/AVX2) for {625,525}x264,
+3% for 508.namd, improves most of other benchmarks.
The compile and link time are almost the same, though in some cases it
should be better (we're not doing an extra instruction scheduling
anymore) + we may vectorize more code for the large basic blocks again
because of saving scheduling budget.
Differential Revision: https://reviews.llvm.org/D105020
This patch adds initial support to use facts from @llvm.assume calls. It
intentionally does not handle all possible cases to keep things simple
initially.
For now, the condition from an assume is made available on entry to the
containing block, if the assume is guaranteed to execute. Otherwise it
is only made available in the successor blocks.
The change adds a switch to allow sample loader to use global pre-inliner's decision instead. The pre-inliner in llvm-profgen makes inline decision globally based on whole program profile and function byte size as cost proxy.
Since pre-inliner also adjusts/merges context profile based on its inline decision, honoring its inline decision in sample loader would lead to better post-inline profile quality especially for thinlto where cross module profile merging isn't possible without pre-inliner.
Minor fix in profile reader is also included. When pre-inliner is use, we now also turn off the default merging and trimming logic unless it's explicitly asked.
Differential Revision: https://reviews.llvm.org/D108677
The instruction extractelement/extractvalue are not required to
be scheduled since they only depend on the source vector/aggregate (with
constant indices), smae applies to the parent basic block checks.
Improves compile time and saves scheduling budget.
Differential Revision: https://reviews.llvm.org/D108703
This is a follow up diff for BinarySizeContextTracker to track zero size for fully optimized inlinee. When an inlinee is fully optimized away, we won't be able to get its size through symbolizing instructions, hence we will treat the corresponding context size as unknown. However by traversing the inlined probe forest, we know what're original inlinees regardless of optimization. If a context show up in inlined probes, but not during symbolization, we know that it's fully optimized away hence its size is zero instead of unknown. It should provide more accurate size cost estimation for pre-inliner to make better inline decisions in llvm-profgen.
Differential Revision: https://reviews.llvm.org/D108350
The implementation uses the int_asan_check_memaccess intrinsic to instrument the code. The intrinsic is replaced by a call to a function which performs the access check. The generated function names encode the input register name as a number using Reg - X86::NoRegister formula.
Reviewed By: vitalybuka
Differential Revision: https://reviews.llvm.org/D107850
Support for peeling with multiple exit blocks was added in D63921/77bb3a486fa6.
So far it has only been enabled for loops where all non-latch exits are
'de-optimizing' exits (D63923). But peeling of multi-exit loops can be
highly beneficial in other cases too, like if all non-latch exiting
blocks are unreachable.
The motivating case are loops with runtime checks, like the C++ example
below. The main issue preventing vectorization is that the invariant
accesses to load the bounds of B is conditionally executed in the loop
and cannot be hoisted out. If we peel off the first iteration, they
become dereferenceable in the loop, because they must execute before the
loop is executed, as all non-latch exits are terminated with
unreachable. This subsequently allows hoisting the loads and runtime
checks out of the loop, allowing vectorization of the loop.
int sum(std::vector<int> *A, std::vector<int> *B, int N) {
int cost = 0;
for (int i = 0; i < N; ++i)
cost += A->at(i) + B->at(i);
return cost;
}
This gives a ~20-30% increase of score for Geekbench5/HDR on AArch64.
Note that this requires a follow-up improvement to the peeling cost
model to actually peel iterations off loops as above. I will share that
shortly.
Also, peeling of multi-exits might be beneficial for exit blocks with
other terminators, but I would like to keep the scope limited to known
high-reward cases for now.
I removed the option to disable peeling for multi-deopt exits because
the code is more general now. Alternatively, the option could also be
generalized, but I am not sure if there's much value in the option?
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D108108
This change fixes issue found by Markus: https://reviews.llvm.org/rG11338e998df1
Before this patch following code was transformed to memmove:
for (int i = 15; i >= 1; i--) {
p[i] = p[i-1];
sum += p[i-1];
}
However load from p[i-1] is used not only by store to p[i] but also by sum computation.
Therefore we cannot emit memmove in loop header.
Differential Revision: https://reviews.llvm.org/D107964
The NS==0 condition used by D103717 missed a corner case: if the current copy
does not have a hash suffix (e.g. weak_odr), a copy with value profiling (with a
different CFG) may exist. This is super rare, but is possible with pre-inlining
PGO instrumentation (which can make a weak_odr function inlines its callees
differently, sometimes with value profiling while sometimes without).
If the current copy with private profd is prevailing, the non-prevailing copy
may get an undefined symbol if a caller inlining the non-prevailing function
references its profd. If the other copy with non-private profd is prevailing,
the current copy may cause a "relocation to discarded section" linker error.
The fix is straightforward: just keep non-private profd in such a `DataReferencedByCode` case.
With this change, a stage 2 (`-DLLVM_TARGETS_TO_BUILD=X86 -DLLVM_BUILD_INSTRUMENTED=IR`)
clang is 0.08% larger (172431496/172286720-1).
`stat -c %s **/*.o | awk '{s+=$1}END{print s}' is 0.026% larger.
The majority of D103717's benefits remains.
Reviewed By: xur
Differential Revision: https://reviews.llvm.org/D108432
This reverts commit f653beea88.
It broke Windows coverage-inline.cpp because link.exe has a limitation
that external symbols in IMAGE_COMDAT_SELECT_ASSOCIATIVE don't work.
It essentially dropped the previous size optimization for coverage
because coverage doesn't rename comdat by default.
Needs more investigation what we should do.
The NS==0 condition used by D103717 missed a corner case: if the current copy
does not have a hash suffix (e.g. weak_odr), a copy with value profiling (with a
different CFG) may exist. This is super rare, but is possible with pre-inlining
PGO instrumentation (which can make a weak_odr function inlines its callees
differently, sometimes with value profiling while sometimes without).
If the current copy with private profd is prevailing, the non-prevailing copy
may get an undefined symbol if a caller inlining the non-prevailing function
references its profd. If the other copy with non-private profd is prevailing,
the current copy may cause a "relocation to discarded section" linker error.
The fix is straightforward: just keep non-private profd in this case.
With this change, a stage 2 (`-DLLVM_TARGETS_TO_BUILD=X86 -DLLVM_BUILD_INSTRUMENTED=IR`)
clang is 0.08% larger (172431496/172286720-1).
`stat -c %s **/*.o | awk '{s+=$1}END{print s}' is 0.026% larger.
The majority of D103717's benefits remains.
Reviewed By: xur
Differential Revision: https://reviews.llvm.org/D108432
The implementation uses the int_asan_check_memaccess intrinsic to instrument the code. The intrinsic is replaced by a call to a function which performs the access check. The generated function names encode the input register name as a number using Reg - X86::NoRegister formula.
Reviewed By: vitalybuka
Differential Revision: https://reviews.llvm.org/D107850
The IRPGOFlag symbol (__llvm_profile_raw_version) is dropped when
identified as non-prevailing for either regular or thin LTO during
the mixed-LTO mode compilation. This happens in the module where
IRPGOFlag is marked as non-prevailing. This variable
is emitted in the final object from the prevailing module.
This is still problematic because we currently query this symbol
to coordinate some actions between PGOInstrumentation pass
and InstrProfiling lowering pass, like whether to do value
profiling, whether to do comdat renaming.
This problem is bought up by YolandaCY in
https://reviews.llvm.org/D107034
YolandCY reported unresolved symbol linker errors in
CSPGO instrumentation build for chromium.
This patch let LTO retain IRPGOFlag decl by adding it to
CompilerUsed list and relax the check in isIRPGOFlagSet() when
doing the InstrProfiling lowering.
The test case in the patch is from D107034
<https://reviews.llvm.org/D107034>.
Differential Revision: https://reviews.llvm.org/D108581
Add `ashr` instruction to the DAG post-dominated by `trunc`, allowing
`TruncInstCombine` to reduce bitwidth of expressions containing
these instructions.
We should be shifting by less than the target bitwidth.
Also it is sufficient to require that all truncated bits
of the value-to-be-shifted are sign bits (all zeros or ones) and
one sign bit is left untruncated: https://alive2.llvm.org/ce/z/Ajo2__
Part of https://reviews.llvm.org/D107766
Differential Revision: https://reviews.llvm.org/D108355
This is a re-try of 3aa009cc87 which was reverted at
9577fac0fd because it caused an infinite loop.
For the extra test case, either re-ordering the transforms
or adding the extra clause to avoid sub-of-sub is enough
to prevent the infinite compile, but I'm doing both to be
safer.
Original commit message:
The motivation was to get min/max intrinsics to parity
with cmp+select idioms, but this unlocks a few more
folds because isFreeToInvert recognizes add/sub with
constants too.
In the min/max example, we have too many extra uses
for smaller folds to improve things, but this fold
is able to eliminate uses even though we can't reduce
the number of instructions.
Reverted (manually due to merge conflicts) while regressions reported on PR51540 are investigated
As noticed on D106352, after we've folded "(select C, (gep Ptr, Idx), Ptr) -> (gep Ptr, (select C, Idx, 0))" if the inner Ptr was also a (now one use) gep we could then merge the geps, using the sum of the indices instead.
I've limited this to basic 2-op geps - a more general case further down InstCombinerImpl.visitGetElementPtrInst doesn't have the one-use limitation but only creates the add if it can be created via SimplifyAddInst.
https://alive2.llvm.org/ce/z/f8pLfD (Thanks Roman!)
Differential Revision: https://reviews.llvm.org/D106450
This is a followup to D106591. MergeICmps currently only allows
sinking the loads past either instructions that don't write to
memory at all, or simple loads/stores that don't modify the memory
the loads access.
The "simple loads/stores" part of this check doesn't seem necessary
to me -- AA isModRef() already accurately models any operation
that may clobber the memory. For example, in the adjusted test case
the transform is still fine if the call to @foo() isn't readonly,
but inaccessiblememonly -- in both cases, the call cannot modify
the loaded memory.
Differential Revision: https://reviews.llvm.org/D108517
It would waste time to specialize a function which would inline finally.
This patch did two things:
- Don't specialize functions which are always-inline.
- Don't spescialize functions whose lines of code are less than threshold
(100 by default).
For spec2017int, this patch could reduce the number of specialized
functions by 33%. Then the compile time didn't increase for every
benchmark.
Reviewed By: SjoerdMeijer, xbolva00, snehasish
Differential Revision: https://reviews.llvm.org/D107897
4ad41902e8 changed this code to
propagate Changed if scalar GEP PRE is performed. However, as
implemented this would skip the load PRE entirely if GEP indices
were PREd. Make sure load PRE runs even if Changed is already
true.
This likely has no functional effect as load PRE would then
occur on a later GVN iteration.
This special cases an unconditional latch and a conditional branch latch exit to improve codegen and test readability. I am hoping to reuse this function in the runtime unroll code, but without this change, the test diffs are far too complex to assess.
The motivation was to get min/max intrinsics to parity
with cmp+select idioms, but this unlocks a few more
folds because isFreeToInvert recognizes add/sub with
constants too.
In the min/max example, we have too many extra uses
for smaller folds to improve things, but this fold
is able to eliminate uses even though we can't reduce
the number of instructions.
Adjusting the reduction recipes still relies on references to the
original IR, which can become outdated by the first-order recurrence
handling. Until reduction recipe construction does not require IR
references, move it before first-order recurrence handling, to prevent a
crash as exposed by D106653.
This may overlap partially with the reassociate pass,
but it seems simple enough that we should try it here
in InstCombine to enable other folds.
This shows up as an opportunity and potential regression
if we improve a subtract fold with 'not' ops to be more
general.
Letting it take SCEV allows further modification on the function to optimize
if the StoreSize / Stride is runtime determined.
The plan is to let memcpy / memmove deal with runtime-determined sizes, just
like what D107353 did to memset.
Reviewed By: bmahjour
Differential Revision: https://reviews.llvm.org/D108289
Currently it's possible to silently use a loop pass that does not
preserve MemorySSA in a loop-mssa pass manager, as we don't
statically know which loop passes preserve MemorySSA (as was the
case with the legacy pass manager).
However, we can at least add a check after the fact that if
MemorySSA is used, then it should also have been preserved.
Hopefully this will reduce confusion as seen in
https://bugs.llvm.org/show_bug.cgi?id=51020.
Differential Revision: https://reviews.llvm.org/D108399
This reverts commit f4122398e7 to
investigate a crash exposed by it.
The patch breaks building the code below with `clang -O2 --target=aarch64-linux`
int a;
double b, c;
void d() {
for (; a; a++) {
b += c;
c = a;
}
}
When the stack is not reset it keeps previously visited Basic Block
which results in bugs where an instruction is hoisted to a
predecessor where the instruction was not fully anticipable.
Differential Revision: https://reviews.llvm.org/D108425
The purpose of __attribute__((disable_sanitizer_instrumentation)) is to
prevent all kinds of sanitizer instrumentation applied to a certain
function, Objective-C method, or global variable.
The no_sanitize(...) attribute drops instrumentation checks, but may
still insert code preventing false positive reports. In some cases
though (e.g. when building Linux kernel with -fsanitize=kernel-memory
or -fsanitize=thread) the users may want to avoid any kind of
instrumentation.
Differential Revision: https://reviews.llvm.org/D108029
The only thing that function should do as per it's semantic,
is to ensure that the switch's default is a block consisting only of
an `unreachable` terminator.
So let's just create such a block and update switch's default
to point to it. There should be no need for all this weird dance
around predecessors/successors.
The COFF specific `DataReferencedByCode` complexity (D103372 D103717) is due to
a link.exe limitation: an external symbol in IMAGE_COMDAT_SELECT_ASSOCIATIVE is
not really dropped, so it can cause duplicate definition error.
This patch extends the runtime unrolling infrastructure to support unrolling a loop with multiple exiting blocks branching to the same exit block used by the latch. It intentionally does not include a cost model change to enable this functionality unless appropriate force flags are used.
This is the prolog companion to D107381. Since this was LGTMed, a problem with DT updating was reported against that patch. I roled in the analogous fix here as it seemed obvious, and not worth re-review.
As an aside, our prolog form leaves a lot of potential value on the floor when there is an invariant load or invariant condition in the loop being runtime unrolled. We should probably consider a "required prolog" heuristic. (Alternatively, maybe we should be peeling these cases more aggressively?)
Differential Revision: https://reviews.llvm.org/D108262
As reported on https://bugs.llvm.org/show_bug.cgi?id=51020, the
guard widening pass doesn't preserve MemorySSA, so it can no
longer be scheduled in the same loop pass manager as LICM. However,
the loop-schedule.ll test indicates that this is supposed to work.
Fix this by preserving MemorySSA if available, as this seems to be
trivial in this case (we only need to drop the memory access for
the removed guards).
Differential Revision: https://reviews.llvm.org/D108386
In 94d0914, I added support for unrolling of multiple exit loops which have multiple exits reaching the latch. Per reports on the review post commit, I'd missed updating the domtree for one case. This fix addresses that ommission.
There's no new test as this is covered by existing tests with expensive verification turned on.
Folding a GEP from outside to inside a loop will materialize an add where there wasn't an equivalent operation before. Check the containing loops before making this fold.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D107935
This reverts commit 9934a5b2ed.
This patch may cause miscompiles because it missed a constraint
as shown in the examples from:
https://llvm.org/PR51531
This makes the intrinsic logic match the cmp+select idiom folds
just below. It's not clearly a win either way unless we think
that a 'not' op costs more than min/max.
The cmp+select folds on these patterns are more extensive than
the intrinsics currently and may have some complicated interactions,
so I'm trying to make those line up and bring the optimizations
for intrinsics up to parity.
There is an assertion failure in computeOverflowForUnsignedMul
(used in checkOverflow) due to the inner and outer trip counts
having different types. This occurs when the IV has been widened,
but the loop components are not successfully rediscovered.
This is fixed by some refactoring of the code in findLoopComponents
which identifies the trip count of the loop.
Differential Revision: https://reviews.llvm.org/D108107
Refactored implementation of AddressSanitizerPass and
HWAddressSanitizerPass to use pass options similar to passes like
MemorySanitizerPass. This makes sure that there is a single mapping
from class name to pass name (needed by D108298), and options like
-debug-only and -print-after makes a bit more sense when (despite
that it is the unparameterized pass name that should be used in those
options).
A result of the above is that some pass names are removed in favor
of the parameterized versions:
- "khwasan" is now "hwasan<kernel;recover>"
- "kasan" is now "asan<kernel>"
- "kmsan" is now "msan<kernel>"
Differential Revision: https://reviews.llvm.org/D105007
I have added a new TTI interface called enableOrderedReductions() that
controls whether or not ordered reductions should be enabled for a
given target. By default this returns false, whereas for AArch64 it
returns true and we rely upon the cost model to make sensible
vectorisation choices. It is still possible to override the new TTI
interface by setting the command line flag:
-force-ordered-reductions=true|false
I have added a new RUN line to show that we use ordered reductions by
default for SVE and Neon:
Transforms/LoopVectorize/AArch64/strict-fadd.ll
Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll
Differential Revision: https://reviews.llvm.org/D106653
This change enables llvm-profgen to use accurate context-sensitive post-optimization function byte size as a cost proxy to drive global preinline decisions.
To do this, BinarySizeContextTracker is introduced to track function byte size under different inline context during disassembling. In preinliner, we can not query context byte size under switch `context-cost-for-preinliner`. The tracker uses a reverse trie to keep size of functions under different context (callee as parent, caller as child), and it can give best/longest possible matching context size for given input context.
The new size cost is off by default. There're a few TODOs that needs to addressed: 1) avoid dangling string from `Offset2LocStackMap`, which will be addressed in split context work; 2) using inlinee's entry probe to make sure we have correct zero size for inlinee that's completely optimized away after inlining. Some tuning is also needed.
Differential Revision: https://reviews.llvm.org/D108180
This patch implements Flow Sensitive Sample FDO (FSAFDO) profile
loader. We have two profile loaders for FS profile,
one before RegAlloc and one before BlockPlacement.
To enable it, when -fprofile-sample-use=<profile> is specified,
add "-enable-fs-discriminator=true \
-disable-ra-fsprofile-loader=false \
-disable-layout-fsprofile-loader=false"
to turn on the FS profile loaders.
Differential Revision: https://reviews.llvm.org/D107878
MSSA-based LICM has been enabled by default for a few years now.
This drops the old AST-based implementation. Using loop(licm) will
result in a fatal error, the use of loop-mssa(licm) is required
(or just licm, which defaults to loop-mssa).
Note that the core canSinkOrHoistInst() logic has to retain AST
support for now, because it is shared with LoopSink.
Differential Revision: https://reviews.llvm.org/D108244
Decoupling the unrolling logic into three different functions. The shouldPragmaUnroll() covers the 1st and 2nd priorities of the previous code, the shouldFullUnroll() covers the 3rd, and the shouldPartialUnroll() covers the 5th. The output of each function, Optional<unsigned>, could be a value for UP.Count, which means unrolling factor has been set, or None, which means decision hasn't been made yet and should try the next priority.
Reviewed By: mtrofin, jdoerfert
Differential Revision: https://reviews.llvm.org/D106001
When dealing with memmove, we also add the load instruction to the ignored
instructions list passed to `mayLoopAccessLocation`. Renaming "Stores" to
"IgnoredInsts" to be more precise.
Differential Revision: https://reviews.llvm.org/D108275
This very occasionally causes to an assertion failure in the compiler.
Turning off until we can get to the bottom of this.
Reviewed By: hctim
Differential Revision: https://reviews.llvm.org/D108282
Currently, AAKernelInfo will fail on an assertion if we attempt to run
it on a kernel without the init / deinit runtime calls. However, this
occurs for global constructors on the device. This will cause OpenMPOpt
to crash whenever global constructors are present. This patch removes
this assertion and just gives up instead.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D108258
This patch extends the runtime unrolling infrastructure to support unrolling a loop with multiple exiting blocks branching to the same exit block used by the latch. It intentionally does not include a cost model change to enable this functionality unless appropriate force flags are used.
I decided to restrict this to the epilogue case. Given the changes ended up being pretty generic, we may be able to unblock the prolog case too, but I want to do that in a separate change to reduce the amount of code we all have to understand at one time.
Differential Revision: https://reviews.llvm.org/D107381
In the provided test case, we were trying to set the global's
initializer to `i32* null` when the global's value type was `@0`.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D108232
Unfortunatley the IR Verifier doesn't reject debug intrinsics that
have nullptr as arguments, so coro::salvageDebugInfo for now also
needs to deal with them.
rdar://81979541
This was a diagnostic option used to demonstrate a weakness in
the AST-based LICM implementation. This problem does not exist
in the MSSA-based LICM implementation, which has been enabled
for a long time now. As such, this option is no longer relevant.
This patch adds an extra option to print the module after running one of
the OpenMPOpt passes if debugging is enabled. This makes it much easier
to inspect the effects of this pass when doing debugging.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D108146
his is a fix for PR43678, and is an alternate patch to D105723.
The basic issue we're running into is that LSR + SCEVExpander are moving the very instruction whose operand we're in the process of expanding. This breaks the subtle and ill-documented invariant which let LSR work. (Full story can be found here: https://reviews.llvm.org/D105723#2878473)
Rather than attempting a fix, this change just removes the optimization entirely. The code is entirely untested, and removing it appears to have no impact I can find. This code was added back in 2014 by 1e12f8563d with a single test which does not seem to actually test the hoisting logic.
From a philosophical standpoint, it also seems very strange to have the expander implementing optimizations which should live in a dedicated transform pass.
Differential Revision: https://reviews.llvm.org/D106178
Removed AArch64 usage of the getMaxVScale interface, replacing it with
the vscale_range(min, max) IR Attribute.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D106277
If both operands are negated, we can invert the min/max and do
the negation after:
smax (neg nsw X), (neg nsw Y) --> neg nsw (smin X, Y)
smin (neg nsw X), (neg nsw Y) --> neg nsw (smax X, Y)
This is visible as a remaining regression in D98152. I don't see
a way to generalize this for 'unsigned' or adapt Negator to
handle it. This only appears to be safe with 'nsw':
https://alive2.llvm.org/ce/z/GUy1zJ
Differential Revision: https://reviews.llvm.org/D108165
This option has been enabled by default for quite a while now.
The practical impact of removing the option is that MSSA use
cannot be disabled in default pipelines (both LPM and NPM) and
in manual LPM invocations. NPM can still choose to enable/disable
MSSA using loop vs loop-mssa.
The next step will be to require MSSA for LICM and drop the
AST-based implementation entirely.
Differential Revision: https://reviews.llvm.org/D108075
LoopLoadElimination, LoopVersioning and LoopVectorize currently
fetch MemorySSA when construction LoopAccessAnalysis. However,
LoopAccessAnalysis does not actually use MemorySSA and we can pass
nullptr instead.
This saves one MemorySSA calculation in the default pipeline, and
thus improves compile-time.
Differential Revision: https://reviews.llvm.org/D108074
This adds a call to matchSAddSubSat from smin/smax instrinsics, allowing
the same patterns to match if the canonical form of a min/max is an
intrinsics, not a icmp/select.
Differential Revision: https://reviews.llvm.org/D108077
Currently/previously, while SCEV guaranteed that it produces the same value,
the way it was produced may be illegal IR, so we have an ugly check that
the replacement is valid.
But now that the SCEV strictness wrt the pointer/integer types has been improved,
i believe this invariant is already upheld by the SCEV itself, natively.
I think we should add an assertion, wait for a week, and then, if all is good,
rip out all this checking.
Or we could just do the latter directly i guess.
This reverts commit rL127839.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D108043
After recent changes, exit counts and BE taken counts are always
integers, so convert these to assertions.
While here, also convert the loop invariance checks to asserts.
Exit counts are always loop invariant.
Previously we emitted a "does not support scalable vectors"
remark for all targets whenever vectorisation is attempted. This
pollutes the output for architectures that don't support scalable
vectors and is likely confusing to the user.
Instead this patch introduces a debug message that reports when
scalable vectorisation is allowed by the target and only issues
the previous remark when scalable vectorisation is specifically
requested, for example:
#pragma clang loop vectorize_width(2, scalable)
Differential Revision: https://reviews.llvm.org/D108028
The current LIR does not deal with runtime-determined memset-size. This patch
utilizes SCEV and check if the PointerStrideSCEV and the MemsetSizeSCEV are equal.
Before comparison the pass would try to fold the expression that is already
protected by the loop guard.
Testcase file `memset-runtime.ll`, `memset-runtime-debug.ll` added.
This patch deals with proper loop-idiom. Proceeding patch wants to deal with SCEV-s
that are inequal after folding with the loop guards.
Reviewed By: lebedev.ri, Whitney
Differential Revision: https://reviews.llvm.org/D107353
Another simple cleanups set in DSE. CheckCache is removed since 1f1145006b and in consequence KnownNoReads is useless.
Also update description of MemorySSAScanLimit which default value is 150 instead 100.
Differential Revision: https://reviews.llvm.org/D107812
Previously we would allow promotion even if the byval/inalloca
attributes on the call and the callee didn't match.
It's ok if the byval/inalloca types aren't the same. For example, LTO
importing may rename types.
Fixes PR51397.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D107998
AttributeList::hasAttribute() is confusing. In an attempt to change the
name to something that suggests using other methods, fix up some
existing uses.
AttributeList::hasAttribute() is confusing, use clearer methods like
hasParamAttr()/hasRetAttr().
Add hasRetAttr() since it was missing from AttributeList.
It might changed the condition of a branch into a constant,
so we should restart and constant-fold terminator,
instead of continuing with the tautological "conditional" branch.
This fixes the issue reported at https://reviews.llvm.org/rGf30a7dff8a5b32919951dcbf92e4a9d56c4679ff
There is an assertion failure in computeOverflowForUnsignedMul
(used in checkOverflow) due to the inner and outer trip counts
having different types. This occurs when the IV has been widened,
but the loop components are not successfully rediscovered.
This is fixed by some refactoring of the code in findLoopComponents
which identifies the trip count of the loop.
Besides SPMDization, other analysis and optimization for original, frontend-generated SPMD regions uses information from the AAKernelInfoFunction attribute. This fix makes sure disabling SPMDization through the corresponding option applies only to generic mode regions, which should not be SPMDized, while it leaves unaffected the attribute state of original SPMD regions.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D108001
This is a re-try of 6de1dbbd09 which was reverted because
it missed a null check. Extra test for that failure added.
Original commit message:
This is an adaptation of D41603 and another step on the way
to canonicalizing to the intrinsic forms of min/max.
See D98152 for status.
This reverts the revert 28c04794df.
The failing MLIR test that caused the revert should be fixed in this
version.
Also includes a PPC test fix previously in 1f87c7c478.
We really shouldn't deal with a conditional branch that can be trivially
constant-folded into an unconditional branch.
Indeed, barring failure to trigger BB reprocessing, that should be true,
so let's assert as much, and hope the assertion never fires.
If it does, we have a bug to fix.
Mainly, i want to add an assertion that `SimplifyCFGOpt::simplifyCondBranch()`
doesn't get asked to deal with non-unconditional branches,
and if i do that, then said assertion fires on existing tests,
and this is what prevents it from firing.
This is a direct translation of the select folds added with
D53033 / D53036 and another step towards canonicalization
using the intrinsics (see D98152).
Currently, LNICM pass does not support sinking instructions out of loop nest.
This patch enables LNICM to sink down as many instructions to the exit block of outermost loop as possible.
Reviewed By: Whitney
Differential Revision: https://reviews.llvm.org/D107219
This patch adjusts the intrinsics definition of
llvm.matrix.column.major.load and llvm.matrix.column.major.store to
allow overloading the type of the stride. The bitwidth of the stride is
used to perform the offset computation.
This fixes a crash when using __builtin_matrix_column_major_load or
__builtin_matrix_column_major_store on 32 bit platforms. The stride argument
of the builtins are defined as `size_t`, which is 32 bits wide on 32 bit
platforms.
Note that we still perform offset computations with 64 bit width on 32
bit platforms for accesses that do not take a user-specified stride.
This can be fixed separately.
Fixes PR51304.
Reviewed By: erichkeane
Differential Revision: https://reviews.llvm.org/D107349
Originally committed as ffc3fb665d
Reverted in fcf2d5f402 due to an
assertion failure.
Original commit message:
Allow the folding even if there is an
intervening bitcast.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D106667
release call
findSafeStoreForStoreStrongContraction checks whether it's safe to move
the release call to the store by inspecting all instructions between the
two, but was ignoring retain instructions. This was causing objects to
be released and deallocated before they were retained.
rdar://81668577
This is a quick fix for a motivating case that looks like this:
https://godbolt.org/z/GeMqzMc38
As noted, we might be able to restore the min/max patterns
with select folds, or we just wait for this to become easier
with canonicalization to min/max intrinsics.
When enable CSPGO for ThinLTO, there are profile cfg mismatch warnings that will cause lld-link errors (with /WX)
due to source changes (e.g. `#if` code runs for profile generation but not for profile use)
To disable it we have to use an internal "/mllvm:-no-pgo-warn-mismatch" option.
In contrast clang uses option ”-Wno-backend-plugin“ to avoid such warnings and gcc has an explicit "-Wno-coverage-mismatch" option.
Add "lto-pgo-warn-mismatch" option to lld COFF/ELF to help turn on/off the profile mismatch warnings explicitly when build with ThinLTO and CSPGO.
Differential Revision: https://reviews.llvm.org/D104431
When enable CSPGO for ThinLTO, there are profile cfg mismatch warnings that will cause lld-link errors (with /WX).
To disable it we have to use an internal "/mllvm:-no-pgo-warn-mismatch" option.
In contrast clang uses option ”-Wno-backend-plugin“ to avoid such warnings and gcc has an explicit "-Wno-coverage-mismatch" option.
Add this "lto-pgo-warn-mismatch" option to lld to help turn on/off the profile mismatch warnings explicitly when build with ThinLTO and CSPGO.
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D104431
When none of the translation units in the binary have been instrumented
we shouldn't need to link the profile runtime. However, because we pass
-u__llvm_profile_runtime on Linux and Fuchsia, the runtime would still
be pulled in and incur some overhead. On Fuchsia which uses runtime
counter relocation, it also means that we cannot reference the bias
variable unconditionally.
This change modifies the InstrProfiling pass to pull in the profile
runtime only when needed by declaring the __llvm_profile_runtime symbol
in the translation unit only when needed. For now we restrict this only
for Fuchsia, but this can be later expanded to other platforms. This
approach was already used prior to 9a041a7522, but we changed it
to always generate the __llvm_profile_runtime due to a TAPI limitation,
but that limitation may no longer apply, and it certainly doesn't apply
on platforms like Fuchsia.
Differential Revision: https://reviews.llvm.org/D98061
PHI nodes are not pass through but change their value, we have to
account for that to avoid missing stores.
Follow up for D107798 to fix PR51249 for good.
Differential Revision: https://reviews.llvm.org/D107808
AAPointerInfoFloating needs to visit all uses and some multiple times if
we go through PHI nodes. Attributor::checkForAllUses keeps a visited set
so we don't recurs endlessly. We now allow recursion for non-phi uses so
we track all pointer offsets via PHI nodes properly without endless
recursion.
This replaces the first attempt D107579.
Differential Revision: https://reviews.llvm.org/D107798
To avoid simplification with wrong constants we need to make sure we
know that we won't perform specific optimizations based on the users
request. The non-SPMDzation and non-CustomStateMachine flags did only
prevent the final transformation but allowed to value simplification
to go ahead.
Differential Revision: https://reviews.llvm.org/D107862
Some files still contained the old University of Illinois Open Source
Licence header. This patch replaces that with the Apache 2 with LLVM
Exception licence.
Differential Revision: https://reviews.llvm.org/D107528
This patch removes the hand-rolled implementation of salvageDebugInfo
for cast and GEPs and replaces it with a call into
llvm::salvageDebugInfoImpl().
A side-effect of this is that additional redundant convert operations
are introduced, but those don't have any negative effect on the
resulting DWARF expression.
rdar://80227769
Differential Revision: https://reviews.llvm.org/D107384
This patch refactors / simplifies salvageDebugInfoImpl(). The goal
here is to simplify the implementation of coro::salvageDebugInfo() in
a followup patch.
1. Change the return value to I.getOperand(0). Currently users of
salvageDebugInfoImpl() assume that the first operand is
I.getOperand(0). This patch makes this information explicit. A
nice side-effect of this change is that it allows us to salvage
expressions such as add i8 1, %a in the future.
2. Factor out the creation of a DIExpression and return an array of
DIExpression operations instead. This change allows users that
call salvageDebugInfoImpl() in a loop to avoid the costly
creation of temporary DIExpressions and to defer the creation of
a DIExpression until the end.
This patch does not change any functionality.
rdar://80227769
Differential Revision: https://reviews.llvm.org/D107383
When converting a store into a memset, we currently insert the new
MemoryDef after the store MemoryDef, which requires all uses to be
renamed to the new def using a whole block scan. Instead, we can
insert the new MemoryDef before the store and not rename uses,
because we know that the location is immediately overwritten, so
all uses should still refer to the old MemoryDef. Those uses will
get renamed when the old MemoryDef is actually dropped, which is
efficient.
I expect something similar can be done for some of the other MSSA
updates in MemCpyOpt. This is an alternative to D107513, at least
for this particular case.
Differential Revision: https://reviews.llvm.org/D107702
The intrinsics have an extra chunk of known bits logic
compared to the normal cmp+select idiom. That allows
folding the icmp in each case to something better, but
that then opposes the canonical form of min/max that
we try to form for a select.
I'm carving out a narrow exception to preserve all
existing regression tests while avoiding the inf-loop.
It seems unlikely that this is the only bug like this
left, but this should fix:
https://llvm.org/PR51419
Avoid stack overflow errors on systems with small stack sizes
by removing recursion in FoldCondBranchOnPHI.
This is a simple change as the recursion was only iteratively
calling the function again on the same arguments.
Ideally this would be compiled to a tail call, but there is
no guarantee.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D107803
I have updated cheapToScalarize to also consider the case when
extracting lanes from a stepvector intrinsic. This required removing
the existing 'bool IsConstantExtractIndex' and passing in the actual
index as a Value instead. We do this because we need to know if the
index is <= known minimum number of elements returned by the stepvector
intrinsic. Effectively, when extracting lane X from a stepvector we
know the value returned is also X.
New tests added here:
Transforms/InstCombine/vscale_extractelement.ll
Differential Revision: https://reviews.llvm.org/D106358
We use the CurrentBlock to determine whether we have already processed a
block. Don't reuse this variable for setting where we should insert the
rematerialization. The rematerialization block is different to the
current block when we rematerialize for coro suspend block users.
Differential Revision: https://reviews.llvm.org/D107573
Teach LV to use masked-store to support interleave-store-group with
gaps (instead of scatters/scalarization).
The symmetric case of using masked-load to support
interleaved-load-group with gaps was introduced a while ago, by
https://reviews.llvm.org/D53668; This patch completes the store-scenario
leftover from D53668, and solves PR50566.
Reviewed by: Ayal Zaks
Differential Revision: https://reviews.llvm.org/D104750
The MemorySSA-based implementation has been enabled for a few months
(since D94376). This patch drops the old MDA-based implementation
entirely.
I've kept this to only the basic cleanup of dropping various
conditions -- the code could be further cleaned up now that there
is only one implementation.
Differential Revision: https://reviews.llvm.org/D102113
In this patch, the "nnan" requirement is removed for the canonicalization of select with fcmp to fabs.
(i) FSub logic: Remove check for nnan flag presence in fsub. Example: https://alive2.llvm.org/ce/z/751svg (fsub).
(ii) FNeg logic: Remove check for the presence of nnan and nsz flag in fneg. Example: https://alive2.llvm.org/ce/z/a_fsdp (fneg).
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D106872
Both patterns are equivalent (https://alive2.llvm.org/ce/z/jfCViF),
so we should have a preference. It seems like mask+negation is better
than two shifts.
After refactoring the phi recipes, we can now iterate over all header
phis in a VPlan to detect reductions when it comes to fixing them up
when tail folding.
This reduces the coupling with the cost model & legal by using the
information directly available in VPlan. It also removes a call to
getOrAddVPValue, which references the original IR value which may
become outdated after VPlan transformations.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D100102
There may be some generalizations (see test comments) of these patterns,
but this should handle the cases motivated by:
https://llvm.org/PR51315https://llvm.org/PR51259
The backend may want to transform differently, but at least for
the x86 examples that I looked at, there does not appear to be
any significant perf diff either way.
Attempt to enable MemCpyOpt unconditionally in D104801 uncovered the fact that
there are users that do not expect LLVM to materialize `memset` intrinsic.
While other passes can do that, too, MemCpyOpt triggers it more frequently and
breaks sanitizers and some downstream users.
For now introduce a flag to force-enable the flag and opt-in only CUDA
compilation with NVPTX back-end.
Differential Revision: https://reviews.llvm.org/D106401
- Loads from the constant memory (either explicit one or as the source
of memory transfer intrinsics) won't alias any stores.
Reviewed By: asbirlea, efriedma
Differential Revision: https://reviews.llvm.org/D107605
This patch adds more instructions to the Uniforms list, for example certain
intrinsics that are uniform by definition or whose operands are loop invariant.
This list includes:
1. The intrinsics 'experimental.noalias.scope.decl' and 'sideeffect', which
are always uniform by definition.
2. If intrinsics 'lifetime.start', 'lifetime.end' and 'assume' have
loop invariant input operands then these are also uniform too.
Also, in VPRecipeBuilder::handleReplication we check if an instruction is
uniform based purely on whether or not the instruction lives in the Uniforms
list. However, there are certain cases where calls to some intrinsics can
be effectively treated as uniform too. Therefore, we now also treat the
following cases as uniform for scalable vectors:
1. If the 'assume' intrinsic's operand is not loop invariant, then we
are free to treat this as uniform anyway since it's only a performance
hint. We will get the benefit for the first lane.
2. When the input pointers for 'lifetime.start' and 'lifetime.end' are loop
variant then for scalable vectors we assume these still ultimately come
from the broadcast of an alloca. We do not support scalable vectorisation
of loops containing alloca instructions, hence the alloca itself would
be invariant. If the pointer does not come from an alloca then the
intrinsic itself has no effect.
I have updated the assume test for fixed width, since we now treat it
as uniform:
Transforms/LoopVectorize/assume.ll
I've also added new scalable vectorisation tests for other intriniscs:
Transforms/LoopVectorize/scalable-assume.ll
Transforms/LoopVectorize/scalable-lifetime.ll
Transforms/LoopVectorize/scalable-noalias-scope-decl.ll
Differential Revision: https://reviews.llvm.org/D107284
The may get changed before specialization by RunSCCPSolver. In other
words, the pass may change the function without specialization happens.
Add test and comment to reveal this.
And it may return No Changed if the function get changed by
RunSCCPSolver before the specialization. It looks like a potential bug.
Test Plan: check-all
Reviewed By: https://reviews.llvm.org/D107622
Differential Revision: https://reviews.llvm.org/D107622
Noticed that the computation for function specialization cost of a
function wouldn't change during the traversal of the arguments for the
function. We could hoist the computation out of the traversal. I
observed about ~1% improvement on compile time for spec2017. But I guess
it may not be precise. This should be NFC and fine.
Reviewed By: Sjoerd Meijer
Differential Revision: https://reviews.llvm.org/D107621
This is recommit of the patch 16ff91ebcc,
reverted in 0c28a7c990 because it had
an error in call of getFastMathFlags (base type should be FPMathOperator
but not Instruction). The original commit message is duplicated below:
Clang has builtin function '__builtin_isnan', which implements C
library function 'isnan'. This function now is implemented entirely in
clang codegen, which expands the function into set of IR operations.
There are three mechanisms by which the expansion can be made.
* The most common mechanism is using an unordered comparison made by
instruction 'fcmp uno'. This simple solution is target-independent
and works well in most cases. It however is not suitable if floating
point exceptions are tracked. Corresponding IEEE 754 operation and C
function must never raise FP exception, even if the argument is a
signaling NaN. Compare instructions usually does not have such
property, they raise 'invalid' exception in such case. So this
mechanism is unsuitable when exception behavior is strict. In
particular it could result in unexpected trapping if argument is SNaN.
* Another solution was implemented in https://reviews.llvm.org/D95948.
It is used in the cases when raising FP exceptions by 'isnan' is not
allowed. This solution implements 'isnan' using integer operations.
It solves the problem of exceptions, but offers one solution for all
targets, however some can do the check in more efficient way.
* Solution implemented by https://reviews.llvm.org/D96568 introduced a
hook 'clang::TargetCodeGenInfo::testFPKind', which injects target
specific code into IR. Now only SystemZ implements this hook and it
generates a call to target specific intrinsic function.
Although these mechanisms allow to implement 'isnan' with enough
efficiency, expanding 'isnan' in clang has drawbacks:
* The operation 'isnan' is hidden behind generic integer operations or
target-specific intrinsics. It complicates analysis and can prevent
some optimizations.
* IR can be created by tools other than clang, in this case treatment
of 'isnan' has to be duplicated in that tool.
Another issue with the current implementation of 'isnan' comes from the
use of options '-ffast-math' or '-fno-honor-nans'. If such option is
specified, 'fcmp uno' may be optimized to 'false'. It is valid
optimization in general, but it results in 'isnan' always returning
'false'. For example, in some libc++ implementations the following code
returns 'false':
std::isnan(std::numeric_limits<float>::quiet_NaN())
The options '-ffast-math' and '-fno-honor-nans' imply that FP operation
operands are never NaNs. This assumption however should not be applied
to the functions that check FP number properties, including 'isnan'. If
such function returns expected result instead of actually making
checks, it becomes useless in many cases. The option '-ffast-math' is
often used for performance critical code, as it can speed up execution
by the expense of manual treatment of corner cases. If 'isnan' returns
assumed result, a user cannot use it in the manual treatment of NaNs
and has to invent replacements, like making the check using integer
operations. There is a discussion in https://reviews.llvm.org/D18513#387418,
which also expresses the opinion, that limitations imposed by
'-ffast-math' should be applied only to 'math' functions but not to
'tests'.
To overcome these drawbacks, this change introduces a new IR intrinsic
function 'llvm.isnan', which realizes the check as specified by IEEE-754
and C standards in target-agnostic way. During IR transformations it
does not undergo undesirable optimizations. It reaches instruction
selection, where is lowered in target-dependent way. The lowering can
vary depending on options like '-ffast-math' or '-ffp-model' so the
resulting code satisfies requested semantics.
Differential Revision: https://reviews.llvm.org/D104854
All information to fix-up the reduction phi nodes in the vectorized loop
is available in VPlan now. This patch moves the code to do so, to make
this clearer. Fixing up the loop exit value still relies on other
information and remains outside of VPlan for now.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D100113
Now the recursive functions may get specialized many times when
`func-specialization-max-iters` increases. See discussion in
https://reviews.llvm.org/D106426 for details.
For a very large module, __llvm_gcov_reset can become very large.
__llvm_gcov_reset previously emitted stores to a bunch of globals in one
huge basic block. MemCpyOpt would turn many of these stores into
memsets, and updating MemorySSA would be extremely slow.
Verified that this makes the compile time of certain files go down
drastically (20min -> 5min).
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D107538
The SCEV-based salvaging method caches dbg.value information pre-LSR so
that salvaging may be attempted post-LSR. If the dbg.value are already
undef pre-LSR then a salvage attempt would be fruitless, so avoid
caching them.
Reviewed By: StephenTozer
Differential Revision: https://reviews.llvm.org/D107448
If the vectorized insertelements instructions form indentity subvector
(the subvector at the beginning of the long vector), it is just enough
to extend the vector itself, no need to generate inserting subvector
shuffle.
Differential Revision: https://reviews.llvm.org/D107494
Since all operands to ExtractValue must be loop-invariant when we deem
the loop vectorizable, we can consider ExtractValue to be uniform.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D107286
In SimplifyCFG we may simplify the CFG by speculatively executing
certain stores, when they are preceded by a store to the same
location. This patch allows such speculation also when the stores are
similarly preceded by a load.
In order for this transformation to be correct we need to ensure that
the memory location is writable and the store in the new location does
not introduce a data race.
Local objects (created by an `alloca` instruction) are always
writable, so once we are past a read from a location it is valid to
also write to that same location.
Seeing just a load does not guarantee absence of a data race (unlike
if we see a store) - the load may still be part of a race, just not
causing undefined behaviour
(cf. https://llvm.org/docs/Atomics.html#optimization-outside-atomic).
In the original program, a data race might have been prevented by the
condition, but once we move the store outside the condition, we must
be sure a data race wasn't possible anyway, no matter what the
condition evaluates to.
One way to be sure that a local object is never concurrently
read/written is check that its address never escapes the function.
Hence this transformation is restricted to local, non-escaping
objects.
Reviewed By: nikic, lebedev.ri
Differential Revision: https://reviews.llvm.org/D107281
We can only trust the range of the index if it is guaranteed
non-poison.
Fixes PR50949.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D107364
This patch adds more instructions to the Uniforms list, for example certain
intrinsics that are uniform by definition or whose operands are loop invariant.
This list includes:
1. The intrinsics 'experimental.noalias.scope.decl' and 'sideeffect', which
are always uniform by definition.
2. If intrinsics 'lifetime.start', 'lifetime.end' and 'assume' have
loop invariant input operands then these are also uniform too.
Also, in VPRecipeBuilder::handleReplication we check if an instruction is
uniform based purely on whether or not the instruction lives in the Uniforms
list. However, there are certain cases where calls to some intrinsics can
be effectively treated as uniform too. Therefore, we now also treat the
following cases as uniform for scalable vectors:
1. If the 'assume' intrinsic's operand is not loop invariant, then we
are free to treat this as uniform anyway since it's only a performance
hint. We will get the benefit for the first lane.
2. When the input pointers for 'lifetime.start' and 'lifetime.end' are loop
variant then for scalable vectors we assume these still ultimately come
from the broadcast of an alloca. We do not support scalable vectorisation
of loops containing alloca instructions, hence the alloca itself would
be invariant. If the pointer does not come from an alloca then the
intrinsic itself has no effect.
I have updated the assume test for fixed width, since we now treat it
as uniform:
Transforms/LoopVectorize/assume.ll
I've also added new scalable vectorisation tests for other intriniscs:
Transforms/LoopVectorize/scalable-assume.ll
Transforms/LoopVectorize/scalable-lifetime.ll
Transforms/LoopVectorize/scalable-noalias-scope-decl.ll
Differential Revision: https://reviews.llvm.org/D107284
This change wasn't strictly necessary for D106164 and could be removed.
This patch addresses the post-commit comments from @fhahn on D106164, and
also changes sve-widen-gep.ll to use the same IR test as shown in
pointer-induction.ll.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D106878
Letting it take SCEV allows further modification on the function to optimize
if the StoreSize / Stride is runtime determined.
This is a preceeding of D107353.
The big picture is to let LoopIdiom deal with runtime-determined sizes.
Reviewed By: Whitney, lebedev.ri
Differential Revision: https://reviews.llvm.org/D104595
Rather than blocking the whole MemCpyOpt pass if the libcalls are
not available, only disable creation of new memset/memcpy intrinsics
where only load/stores were used previously. This only affects the
store merging and load-store conversion optimization. Other
optimizations are derived from existing intrinsics, which are
well-defined in the absence of libcalls -- not having the libcalls
just means that call simplification won't convert them to intrinsics.
This is a weaker variation of D104801, which dropped these checks
entirely. Ideally we would not couple emission of intrinsics to
libcall availability at all, but as the intrinsics may be legalized
to libcalls we need to be a bit careful right now.
Differential Revision: https://reviews.llvm.org/D106769
This patch expands SPMDization (converting generic execution mode to SPMD for target regions) by guarding code regions that should be executed only by the main thread. Specifically, it generates guarded regions, which only the main thread executes, and the synchronization with worker threads using simple barriers. For correctness, the patch aborts SPMDization for target regions if the same code executes in a parallel region, thus must be not be guarded. This check is implemented using the ParallelLevels AA.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D106892
Rather than emitting the bias variable lazily as needed, emit it
eagerly. This allows profile runtime to refer to this variable
unconditionally without having to use the weak reference. The bias
variable is in a COMDAT so there'll never be more than one instance,
and if it's not needed, linker should be able to GC it, so the overhead
should be minimal.
Differential Revision: https://reviews.llvm.org/D107377
SCEV-based salvaging in LSR translates SCEVs to DIExpressions. SCEVs may
contain very large integers but the translation does not support
integers greater than 64 bits. This patch adds checks to ensure
conversions of these large integers is not attempted. A regression test
is added to ensure no such translation is attempted.
Reviewed by: StephenTozer
PR: https://bugs.llvm.org/show_bug.cgi?id=51329
Differential Revision: https://reviews.llvm.org/D107438
Clang has builtin function '__builtin_isnan', which implements C
library function 'isnan'. This function now is implemented entirely in
clang codegen, which expands the function into set of IR operations.
There are three mechanisms by which the expansion can be made.
* The most common mechanism is using an unordered comparison made by
instruction 'fcmp uno'. This simple solution is target-independent
and works well in most cases. It however is not suitable if floating
point exceptions are tracked. Corresponding IEEE 754 operation and C
function must never raise FP exception, even if the argument is a
signaling NaN. Compare instructions usually does not have such
property, they raise 'invalid' exception in such case. So this
mechanism is unsuitable when exception behavior is strict. In
particular it could result in unexpected trapping if argument is SNaN.
* Another solution was implemented in https://reviews.llvm.org/D95948.
It is used in the cases when raising FP exceptions by 'isnan' is not
allowed. This solution implements 'isnan' using integer operations.
It solves the problem of exceptions, but offers one solution for all
targets, however some can do the check in more efficient way.
* Solution implemented by https://reviews.llvm.org/D96568 introduced a
hook 'clang::TargetCodeGenInfo::testFPKind', which injects target
specific code into IR. Now only SystemZ implements this hook and it
generates a call to target specific intrinsic function.
Although these mechanisms allow to implement 'isnan' with enough
efficiency, expanding 'isnan' in clang has drawbacks:
* The operation 'isnan' is hidden behind generic integer operations or
target-specific intrinsics. It complicates analysis and can prevent
some optimizations.
* IR can be created by tools other than clang, in this case treatment
of 'isnan' has to be duplicated in that tool.
Another issue with the current implementation of 'isnan' comes from the
use of options '-ffast-math' or '-fno-honor-nans'. If such option is
specified, 'fcmp uno' may be optimized to 'false'. It is valid
optimization in general, but it results in 'isnan' always returning
'false'. For example, in some libc++ implementations the following code
returns 'false':
std::isnan(std::numeric_limits<float>::quiet_NaN())
The options '-ffast-math' and '-fno-honor-nans' imply that FP operation
operands are never NaNs. This assumption however should not be applied
to the functions that check FP number properties, including 'isnan'. If
such function returns expected result instead of actually making
checks, it becomes useless in many cases. The option '-ffast-math' is
often used for performance critical code, as it can speed up execution
by the expense of manual treatment of corner cases. If 'isnan' returns
assumed result, a user cannot use it in the manual treatment of NaNs
and has to invent replacements, like making the check using integer
operations. There is a discussion in https://reviews.llvm.org/D18513#387418,
which also expresses the opinion, that limitations imposed by
'-ffast-math' should be applied only to 'math' functions but not to
'tests'.
To overcome these drawbacks, this change introduces a new IR intrinsic
function 'llvm.isnan', which realizes the check as specified by IEEE-754
and C standards in target-agnostic way. During IR transformations it
does not undergo undesirable optimizations. It reaches instruction
selection, where is lowered in target-dependent way. The lowering can
vary depending on options like '-ffast-math' or '-ffp-model' so the
resulting code satisfies requested semantics.
Differential Revision: https://reviews.llvm.org/D104854
This adds support for specialising recursive functions. For example:
int Global = 1;
void recursiveFunc(int *arg) {
if (*arg < 4) {
print(*arg);
recursiveFunc(*arg + 1);
}
}
void main() {
recursiveFunc(&Global);
}
After 3 iterations of function specialisation, followed by inlining of the
specialised versions of recursiveFunc, the main function looks like this:
void main() {
print(1);
print(2);
print(3);
}
To support this, the following has been added:
- Update the solver and state of the new specialised functions,
- An optimisation to propagate constant stack values after each iteration of
function specialisation, which is necessary for the next iteration to
recognise the constant values and trigger.
Specialising recursive functions is (at the moment) controlled by option
-func-specialization-max-iters and is opt-in for compile-time reasons. I.e.,
the default is -func-specialization-max-iters=1, but for the example above we
would need to use -func-specialization-max-iters=3. Future work is to see if we
can increase the default, or improve the cost-model/heuristics to control
compile-times.
Differential Revision: https://reviews.llvm.org/D106426
Currently, in OptimizeGlobalAddressOfMalloc, the transformation for global loads assumes that they have the same Type. With the support of ConstantExpr (https://reviews.llvm.org/D106589), this may not be true any more (as seen in the test case), and we miss the code to handle this, This is to fix that.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D107397
This transform has been restricted to legal types since
https://reviews.llvm.org/rG65df808f6254617b9eee931d00e95d900610b660
in 2012.
This is particularly restrictive on RISCV64 which only has i64
as a legal integer type. i32 is a very common type in code
generated from C, but we won't form a lookup table with it.
This also effects other common types like i8/i16 types on ARM,
AArch64, RISCV, etc.
This patch proposes to allow power of 2 types larger than 8 bit, if
they will fit in the largest legal integer type in DataLayout.
These types are common in C code so generally well handled in
the backends.
We could probably do this for other types like i24 and rely on
alignment and padding to allow the backend to use a single wider
load. This isn't my main concern right now and it will need more
tests.
We could also allow larger types up to some limit and let the
backend split into multiple loads, but we need to define that
limit. It's also not my main concern right now.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D107233
If the vectorized insertelements instructions form indentity subvector
(the subvector at the beginning of the long vector), it is just enough
to extend the vector itself, no need to generate inserting subvector
shuffle.
Differential Revision: https://reviews.llvm.org/D107344
Fixed type assertion failure caused by trying to fold a masked load with a
select where the select condition is a scalar value
Reviewed By: sdesmalen, lebedev.ri
Differential Revision: https://reviews.llvm.org/D107372
The inttoptr/ptrtoint roundtrip optimization is not always correct.
We are working towards removing this optimization and adding support to specific cases where this optimization works.
In this patch, we focus on phi-node operands with inttoptr casts.
We know that ptrtoint( inttoptr( ptrtoint x) ) is same as ptrtoint (x).
So, we want to remove this roundtrip cast which goes through phi-node.
Reviewed By: aqjune
Differential Revision: https://reviews.llvm.org/D106289
I'm renaming the flag because a future patch will add a new
enableOrderedReductions() TTI interface and so the meaning of this
flag will change to be one of forcing the target to enable/disable
them. Also, since other places in LoopVectorize.cpp use the word
'Ordered' instead of 'strict' I changed the flag to match.
Differential Revision: https://reviews.llvm.org/D107264
I'm not sure this is the best way to approach this,
but the situation is rather not very detectable unless we explicitly call it out when refusing to advise to unroll.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D107271
This aligns the multiple exit costing with all the other cost decisions. Note that UnrollAndJam, which is the only other caller of the original home of this code, unconditionally bails out of multiple exit loops.
The option to not preserve LCSSA is in fact not tested at all in upstream. I was tempted to just remove the code entirely, but realized I didn't need to for my actual goal.
This patch updates VPInterleaveRecipe::print to print the actual defined
VPValues for load groups and the store VPValue operands for store
groups.
The IR references may become outdated while transforming the VPlan and
the defined and stored VPValues always are up-to-date.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D107223
When the limit of the inner loop is a known integer, the InstCombine
pass now causes the transformation e.g. imcp ult i32 %inc, tripcount ->
icmp ult %j, tripcount-step (where %j is the inner loop induction
variable and %inc is add %j, step), which is now accounted for when
identifying the trip count of the loop. This is also an acceptable use
of %j (provided the step is 1) so is ignored as long as the compare
that it's used in is also the condition of the inner branch.
Differential Revision: https://reviews.llvm.org/D105802
I'm working on extending the OptimizeGlobalAddressOfMalloc to handle some more general cases. This is to add support of the ConstantExpr use of the global variables. The function allUsesOfLoadedValueWillTrapIfNull is now iterative with the added CE use of GV. Also, the recursive function valueIsOnlyUsedLocallyOrStoredToOneGlobal is changed to iterative using a worklist with the GEP case added.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D106589
I don't know much about this pass, but we need a stronger
check on the memset length arg to avoid an assert. The
current code was added with D59000.
The test is reduced from:
https://llvm.org/PR50910
Differential Revision: https://reviews.llvm.org/D106462
Additional asserts were added to ScalarEvolution to enforce
pointer/int type rules. An assert is triggered when the LSR pass
attempts to extend a pointer SCEV in GenerateTruncates.
This patch changes GenerateTruncates to exit early if the Formaula
contains a ScaledReg or BaseReg with a pointer type.
Differential Revision: https://reviews.llvm.org/D107185
Change `CountersPtr` in `__profd_` to a label difference, which is a link-time
constant. On ELF, when linking a shared object, this requires that `__profc_` is
either private or linkonce/linkonce_odr hidden. On COFF, we need D104564 so that
`.quad a-b` (64-bit label difference) can lower to a 32-bit PC-relative relocation.
```
# ELF: R_X86_64_PC64 (PC-relative)
.quad .L__profc_foo-.L__profd_foo
# Mach-O: a pair of 8-byte X86_64_RELOC_UNSIGNED and X86_64_RELOC_SUBTRACTOR
.quad l___profc_foo-l___profd_foo
# COFF: we actually use IMAGE_REL_AMD64_REL32/IMAGE_REL_ARM64_REL32 so
# the high 32-bit value is zero even if .L__profc_foo < .L__profd_foo
# As compensation, we truncate CountersDelta in the header so that
# __llvm_profile_merge_from_buffer and llvm-profdata reader keep working.
.quad .L__profc_foo-.L__profd_foo
```
(Note: link.exe sorts `.lprfc` before `.lprfd` even if the object writer
has `.lprfd` before `.lprfc`, so we cannot work around by reordering
`.lprfc` and `.lprfd`.)
With this change, a stage 2 (`-DLLVM_TARGETS_TO_BUILD=X86 -DLLVM_BUILD_INSTRUMENTED=IR`)
`ld -pie` linked clang is 1.74% smaller due to fewer R_X86_64_RELATIVE relocations.
```
% readelf -r pie | awk '$3~/R.*/{s[$3]++} END {for (k in s) print k, s[k]}'
R_X86_64_JUMP_SLO 331
R_X86_64_TPOFF64 2
R_X86_64_RELATIVE 476059 # was: 607712
R_X86_64_64 2616
R_X86_64_GLOB_DAT 31
```
The absolute function address (used by llvm-profdata to collect indirect call
targets) can be converted to relative as well, but is not done in this patch.
Differential Revision: https://reviews.llvm.org/D104556
Replace insertelement instructions for splats with just single
insertelement + broadcast shuffle. Also, try to merge these instructions
if they come from the same/shuffled gather node.
Differential Revision: https://reviews.llvm.org/D107104
For the nodes with reused scalars the user may be not only of the size
of the final shuffle but also of the size of the scalars themselves,
need to check for this. It is safe to just modify the check here, since
the order of the scalars themselves is preserved, only indeces of the
reused scalars are changed. So, the users with the same size as the
number of scalars in the node, will not be affected, they still will get
the operands in the required order.
Reported by @mstorsjo in D105020.
Differential Revision: https://reviews.llvm.org/D107080
If the instruction was previously deleted, it should not be treated as
an external user. This fixes cost estimation and removes dead
extractelement instructions.
Differential Revision: https://reviews.llvm.org/D107106
Need to check that the minimum acceptable vector factor is at least 2,
not 0, to avoid compiler crash during gathered loads analysis.
Differential Revision: https://reviews.llvm.org/D107058
This work provides four flags to disable four different sets of OpenMP optimizations. These flags take effect in llvm/lib/Transforms/IPO/OpenMPOpt.cpp and include the following:
- openmp-opt-disable-deglobalization: Defaults to false, adding this flag sets the variable DisableOpenMPOptDeglobalization to true. This prevents AA registration for HeapToStack and HeapToShared.
- openmp-opt-disable-spmdization: Defaults to false, adding this flag sets the variable DisableOpenMPOptSPMDization to true. This indicates a pessimistic fixpoint in changeToSPMDMode.
- openmp-opt-disable-folding: Defaults to false, adding this flag sets the variable DisableOpenMPOptFolding to true. This indicates a pessimistic fixpoint in the attributor init for AAFoldRuntimeCall.
- openmp-opt-disable-state-machine-rewrite: Defaults to false, adding this flag sets the variable DisableOpenMPOptStateMachineRewrite to true. This first prevents changes to the state machine in rewriteDeviceCodeStateMachine by returning before changes are made, and if a custom state machine is built in buildCustomStateMachine, stops by returning a pessimistic fixpoint.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D106802
Patch by Mohammad Fawaz
This issues started happening after
b373b5990d
Basically, if the memcpy is volatile, the collectUsers() function should
return false, just like we do for volatile loads.
Differential Revision: https://reviews.llvm.org/D106950
SCEVToIterCountExpr only expects to be fed affine expressions, but
DbgRewriteSalvageableDVIs is feeding it non-affine induction variables.
Following this up with an obvious fix, will add test coverage too if
this avoids D105207 being reverted.
When the trip count of the inner loop is a constant, the InstCombine
pass now causes the transformation e.g. imcp ult i32 %inc, tripcount ->
icmp ult %j, tripcount-step (where %j is the inner loop induction
variable and %inc is add %j, step), which is now accounted for when
identifying the trip count of the loop. This is also an acceptable use
of %j (provided the step is 1) so is ignored as long as the compare
that it's used in is also the condition of the inner branch.
Differential Revision: https://reviews.llvm.org/D105802
The current implementation of function internalization creats a copy of each
function and replaces every use. This has the downside that the external
versions of the functions will call into the internalized versions of the
functions. This prevents them from being fully independent of eachother. This
patch replaces the current internalization scheme with a method that creates
all the copies of the functions intended to be internalized first and then
replaces the uses as long as their caller is not already internalized.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106931
Reapply commit d675b594f4 that was
reverted due to buildbot failures. A simple fix has been applied to
remove an assertion.
Differential Revision: https://reviews.llvm.org/D105207
The SCEV method getBackedgeTakenCount() returns a SCEVCouldNotCompute
object if the backedge-taken count is unpredictable. This fix ensures
there is no longer an attempt to use such an object to find the trip
count.
Patch by: Rosie Sumpter.
Differential Revision: https://reviews.llvm.org/D106970
This is a second attempt to fix the EXPENSIVE_CHECKS issue that was mentioned In D91661#2875179 by @jroelofs.
(The first attempt was in D105983)
D91661 more or less completely reverted D49126 and by doing so also removed the cleanup logic of the created declarations and calls.
This patch is a replacement for D91661 (which must itself be reverted first). It replaces the custom declaration creation with the
generic version and shows the test impact. It also tracks the number of NamedValues to detect if a new prototype was added instead
of looking at the available users of a prototype.
Reviewed By: jroelofs
Differential Revision: https://reviews.llvm.org/D106147
This reverts commit 77080a1eb6.
This change introduced issues detected with EXPENSIVE_CHECKS. Reverting to restore the
needed function cleanup. A next patch will then just improve on the name mangling.
[[noreturn]] can be used since Oct 2016 when the minimum compiler requirement was bumped to GCC 4.8/MSVC 2015.
Note: the definition of LLVM_ATTRIBUTE_NORETURN is kept for now.
Reapply commit 796b84d26f that was
reverted due to reports of crashes. A minor change now guards against
getVariableLocationOperand() returning a nullptr.
Differential Revision: https://reviews.llvm.org/D106659
This transform was added with e38b7e8948
and as shown in:
https://llvm.org/PR51241
...it could crash without an extra check of the blocks.
There might be a more compact way to write this constraint,
but we can't just count the successors/predecessors without
affecting a test that includes a switch instruction.
Reworked reordering algorithm. Originally, the compiler just tried to
detect the most common order in the reordarable nodes (loads, stores,
extractelements,extractvalues) and then fully rebuilding the graph in
the best order. This was not effecient, since it required an extra
memory and time for building/rebuilding tree, double the use of the
scheduling budget, which could lead to missing vectorization due to
exausted scheduling resources.
Patch provide 2-way approach for graph reodering problem. At first, all
reordering is done in-place, it doe not required tree
deleting/rebuilding, it just rotates the scalars/orders/reuses masks in
the graph node.
The first step (top-to bottom) rotates the whole graph, similarly to the previous
implementation. Compiler counts the number of the most used orders of
the graph nodes with the same vectorization factor and then rotates the
subgraph with the given vectorization factor to the most used order, if
it is not empty. Then repeats the same procedure for the subgraphs with
the smaller vectorization factor. We can do this because we still need
to reshuffle smaller subgraph when buildiong operands for the graph
nodes with lasrger vectorization factor, we can rotate just subgraph,
not the whole graph.
The second step (bottom-to-top) scans through the leaves and tries to
detect the users of the leaves which can be reordered. If the leaves can
be reorder in the best fashion, they are reordered and their user too.
It allows to remove double shuffles to the same ordering of the operands in
many cases and just reorder the user operations instead. Plus, it moves
the final shuffles closer to the top of the graph and in many cases
allows to remove extra shuffle because the same procedure is repeated
again and we can again merge some reordering masks and reorder user nodes
instead of the operands.
Also, patch improves cost model for gathering of loads, which improves
x264 benchmark in some cases.
Gives about +2% on AVX512 + LTO (more expected for AVX/AVX2) for {625,525}x264,
+3% for 508.namd, improves most of other benchmarks.
The compile and link time are almost the same, though in some cases it
should be better (we're not doing an extra instruction scheduling
anymore) + we may vectorize more code for the large basic blocks again
because of saving scheduling budget.
Differential Revision: https://reviews.llvm.org/D105020
As suggested in D105008, move the code that fixes up the backedge value
for first order recurrences to VPlan::execute.
Now all that remains in fixFirstOrderRecurrences is the code responsible
for creating the exit values in the middle block.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D106244
This makes a couple of changes to the costing of MLA reduction patterns,
to more accurately cost various patterns that can come up from
vectorization.
- The Arm implementation of getExtendedAddReductionCost is altered to
only provide costs for legal or smaller types. Larger than legal types
need to be split, which currently does not work very well, especially
for predicated reductions where the predicate may be legal but needs to
be split. Currently we limit it to legal or smaller input types.
- The getReductionPatternCost has learnt that reduce(ext(mul(ext, ext))
is a pattern that can come up, and can be treated the same as
reduce(mul(ext, ext)) providing the extension types match.
- And it has been adjusted to not count the ext in reduce(mul(ext, ext))
as part of a reduce(mul) pattern.
Together these changes help to more accurately cost the mla reductions
in cases such as where the extend types don't match or the extend
opcodes are different, picking better vector factors that don't result
in expanded reductions.
Differential Revision: https://reviews.llvm.org/D106166