This fold handles a special subset of foldAndOrOfICmpsUsingRanges(),
use the more generic implementation instead.
The result can differ if a representation using a range comparison
is possible, in which case that is preferred over masking. There is
a canonicalization opportunity here.
This is the de Morgan conjugated variant of the existing fold for
ors. Implement this by switching the range code to always work
on ors and perform invert operands at the start and end. This makes
reasoning easier and makes the extension more obviosuly correct.
The legacy LoopUnswitch pass is only used in the legacy pass manager
pipeline, which is deprecated.
The NewPM replacement is SimpleLoopUnswitch and I think it is time to
remove the legacy LoopUnswitch code.
Fixes#31000.
Reviewed By: aeubanks, Meinersbur, asbirlea
Differential Revision: https://reviews.llvm.org/D124376
We can express this fold more naturally when working on the constant
range implementation. This change is not entirely NFC, because the
code now also handles cases that don't match the precise pattern
this previously looked for, e.g. we can omit an add on one of the
ranges.
I think this sort comparator was overly complex, and the windows
expensive check bot agreed, failing as it was not giving a strict weak
ordering. Change it to use the comparison of the mask values as unsigned
integers. This should sort the undef elements to the end whilst keeping
X<Y otherwise.
Replace the condition value with the known constant value on the
threaded edge. This happens implicitly with phi threading because
we replace with the incoming value, but not for non-phi threading.
SimplifyCFG implements basic jump threading, if a branch is
performed on a phi node with constant operands. However,
InstCombine canonicalizes such phis to the condition value of a
previous branch, if possible. SimplifyCFG does support this as
well, but only in the very limited case where the same condition
is used in a direct predecessor -- notably, this does not include
the common diamond pattern (i.e. two consecutive if/elses on the
same condition).
This patch extends the code to look back a limited number of
blocks to find a branch on the same value, rather than only
looking at the direct predecessor.
Fixes https://github.com/llvm/llvm-project/issues/54980.
Differential Revision: https://reviews.llvm.org/D124159
Given a shuffle feeding a commutative reduction, the lane ordering of
the shuffle will not alter the result. This is also true if there are a
number of operations between the reduction and the shuffle, providing
they only operate lane-wise. This patch searches for cases like that in
Vector Combine, allowing us to check the cost of the shuffle vs an
in-order identity shuffle and replace the order if possible. This only
handles a single shuffle at the moment to keep things simple, and is
able to ignore splats that produce results where every result is the
same.
This is a more powerful version of a combine that already happens in
instrcombine, capable of optimizing more cases by looking through more
instructions and being able to cost the shuffle.
Differential Revision: https://reviews.llvm.org/D123494
Introduced masks where they are not added and improved target dependent
cost models to avoid returning of the incorrect cost results after
adding masks.
Differential Revision: https://reviews.llvm.org/D100486
The structure ArgPart and alias OffsetAndArgPart have been moved
into the anonymous namespace. NFC.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D124617
The condition should be 'ArgParts.size() > MaxElements', so that if we
have exactly 3 elements in the 'ArgParts' vector, the promotion should
be allowed because the 'MaxElement' threshold is not exceeded yet.
The default value for 'MaxElement' has been decreased to 2 in order
to avoid an actual change in argument promoting behavior. However,
this changes byval argument transformation behavior by allowing
adding not more than 2 arguments to the function instead of 3 allowed
before.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D124178
Remove one of the last remaining uses of ::needsVectorIV, preparing for
its removal. Now that usesScalars is available and based on the
information explicit in VPlan, there is no need to use the pre-computed
needsVectorIV.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D123720
Some loop counters ('i', 'e') and variables ('type') were named not
in accordance with the code style and clang-tidy issues warnings
about the using of such variables. This patch renames the variables
and fixes some typos in the comments within the source file.
Differential Revision: https://reviews.llvm.org/D123662
When using opaque pointers, convert GEPs into offset representation
of the form P + V1 * Scale1 + V2 * Scale2 + ... + ConstantOffset.
This allows us to recognize equivalent address calculations even if
the GEPs don't use the same source element type.
This fixes an opaque pointer codegen regression seen in rustc.
Differential Revision: https://reviews.llvm.org/D124527
They can already be available, and even if not, DT/LI can be available.
We should not recompute them. Old PM is unchanged because it would
require changing dependencies, and we don't care enough about it.
Differential Revision: https://reviews.llvm.org/D124439
Reviewed By: nikic, aeubanks
isNoopAddrSpaceCast is expecting SrcAS is different from DestAS.
If the two AS are the same, consider ptrtoint/inttoptr as noop cast.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D123573
Previously all entries in global_ctors had to have the void()* type and
we'd skip evaluating bitcasted functions. With opaque pointers we may
see the function directly.
Fixes#55147.
Reviewed By: #opaque-pointers, nikic
Differential Revision: https://reviews.llvm.org/D124553
Using the legacy PM for the optimization pipeline was deprecated in 13.0.0.
Following recent changes to remove non-core features of the legacy
PM/optimization pipeline, remove ThreadSanitizerLegacyPass.
Reviewed By: #sanitizers, vitalybuka
Differential Revision: https://reviews.llvm.org/D124209
Introduced masks where they are not added and improved target dependent
cost models to avoid returning of the incorrect cost results after
adding masks.
Differential Revision: https://reviews.llvm.org/D100486
When a block containing llvm.coro.id is cloned during CHR, it inserts an invalid
PHI node with token type to the beginning of the block containing llvm.coro.begin.
To avoid such case, we exclude regions with llvm.coro.id.
Reviewed By: ChuanqiXu
Differential Revision: https://reviews.llvm.org/D124418
IRCE is a function pass that operates on loops. If there are no loops in
the function (as seen through LI), we should avoid computing the
remaining expensive analyses (such as BPI). Reordered the analyses
requests and early return if there are no loops. This is an NFC with
compile time improvement.
The same will be done in a follow-up patch for the loop vectorizer.
Reviewed-By: nikic
Differential Revision: https://reviews.llvm.org/D124478
This relands commit 8f550368b1.
The test is amended with REQUIRES: x86-registered-target, in line with
the other debuginfo-scev-salvage tests.
Differential Revision: https://reviews.llvm.org/D120169
Second of two patches to extend SCEV-based salvaging to dbg.value
intrinsics that have multiple location ops pre-LSR. This second patch
adds the core implementation.
Reviewers: @StephenTozer, @djtodoro
Differential Revision: https://reviews.llvm.org/D120169
First of two patches that extends SCEV-based salvaging to enable
salvaging of dbg.value instrinsics that have multiple locations ops
before the Loop Strength Reduction pass.
The existing single-op SCEV-based salvaging can generate variadic
dbg.value intrinsics in order to salvage a dbg.value that has a single
location op. If a dbg.value has multiple location ops before LSR, and
LSR optimises away one or more of the location operands, then currently
no salvaging will be attempted.
Salvaging can now be added, but first this patch cleans up consistency
in both the code and comments, and applies some refactoring to make
application of the new salvaging implementation more straightforward.
- Use SCEVDbgValueBuilder for both types of recovery expressions:
IV-offset based and iteration count based.
- Combine the functions that write the final DIExpression.
- Move some static functions into member functions.
Reviewers: @Orlando
Differential Revision: https://reviews.llvm.org/D120168
Currently, two GEPs will only be combined if the result element
type of one is the same as the source element type of the other.
However, this means we may miss folding opportunities where the
second GEP could be rewritten using a different element type. This
is especially relevant for opaque pointers, where constant GEPs
often use i8 element type.
Address this by converting GEP indices to offsets, adding them,
and then converting them back to indices. The first (inner) GEP
is allowed to have variable indices as well, in which case only
the constant suffix is converted into an offset.
This should address the regression reported in
https://reviews.llvm.org/D123300#3467615.
Differential Revision: https://reviews.llvm.org/D124459
I found this bug when performing a two-stage build of clang with
Function Specialization enabled and tuned aggressively. The crash
appears only on release builds.
Fixes https://github.com/llvm/llvm-project/issues/55000.
Before accessing the contents of the ArgInfo iterator inside
SCCPInstVisitor::markArgInFuncSpecialization, we should be
checking that the iterator is valid.
Differential Revision: https://reviews.llvm.org/D124114
canonicalizeClampLike canonicalizes the ule/ugt comparisons to ult/uge,
respectively. However, it does not update the variable holding the
comparison predicate type after doing this. Later code fails to handle
the non-canonical predicate type (specifically, the swap of
ThresholdLowIncl and ThresholdHighExcl when Pred0 has been canonicalized
from ugt to uge). This leads to the miscompile reported in PR53252. Fix
this by updating the comparison predicate after canonicalizing.
Fixes#53252
Differential Revision: https://reviews.llvm.org/D119690
The callback is expected to create a branch to the ContinuationBB (sometimes called FiniBB in some lambdas) argument when finishing. This creates problems:
1. The InsertPoint used for CodeGenIP does not need to be the end of a block. If it is not, a naive callback will insert a branch instruction into the middle of the block.
2. The BasicBlock the CodeGenIP is pointing to may or may not have a terminator. There is an conflict where to branch to if the block already has a terminator.
3. Some API functions work only with block having a terminator. Some workarounds have been used to insert a temporary terminator that is removed again.
4. Some callbacks are sensitive to whether the BasicBlock has a terminator or not. This creates a callback ordering problem where different callback may have different behaviour depending on whether a previous callback created a terminator or not. The problem also exists for FinalizeCallbackTy where some callbacks do create branch to another "continue" block, but unlike BodyGenCallbackTy does not receive the target as argument. This is not addressed in this patch.
With this patch, the callback receives an CodeGenIP into a BasicBlock where to insert instructions. If it has to insert control flow, it can split the block at that position as needed but otherwise no separate ContinuationBB is needed. In particular, a callback can be empty without breaking the emitted IR. If the caller needs the control flow to branch to a specific target, it can insert the branch instruction itself and pass an InsertPoint before the terminator to the callback.
Certain frontends such as Clang may expect the current IRBuilder position to be at the end of a basic block. In this case its callbacks must split the block at CodeGenIP before setting the IRBuilder position such that the instructions after CodeGenIP are moved to another basic block and before returning create a new branch instruction to the split block.
Some utility functions such as `splitBB` are supporting correct splitting of BasicBlocks, independent of whether they have a terminator or not, returning/setting the InsertPoint of an IRBuilder to the end of split predecessor block, and optionally omitting creating a branch to the split successor block to be added later.
Reviewed By: kiranchandramohan
Differential Revision: https://reviews.llvm.org/D118409
We can always replace the undef elements in a vector constant
with regular constants to get rid of the freeze:
https://alive2.llvm.org/ce/z/nfRb4F
The select diffs show that we might do better by adjusting the
logic for a frozen select condition. We may also want to refine
the vector constant replacement to consider forming a splat.
Differential Revision: https://reviews.llvm.org/D123962
Before this patch `Args` was used to pass a broadcat's arguments by SLP.
This patch changes this. `Args` is now used for passing the operands of
the shuffle.
Differential Revision: https://reviews.llvm.org/D124202
This continues the push away from hard-coded knowledge about functions
towards attributes. We'll use this to annotate free(), realloc() and
cousins and obviate the hard-coded list of free functions.
Differential Revision: https://reviews.llvm.org/D123083
This reorganizes the code as a preparation for D123865:
* Use more descriptive names for variables
* Simplify a condition by use an already calculated value
for `MaxPeelCount`
* Remove a duplicate log entry
* Report basic values for loop costs
Differential Revision: https://reviews.llvm.org/D124388
Since the size of most of SCC's is 1, the PriorityInlineOrder would not change the inline
order in SCC inliner.
Reviewed By: kazu
Differential Revision: https://reviews.llvm.org/D123608
At the moment, unfeasible default destinations are not handled properly
in removeNonFeasibleEdges. So far, only unfeasible cases are removed,
but later code expects unreachable blocks to have no predecessors.
This is causing the crash reported in PR49573.
If the default destination is unfeasible it won't be executed. Create
a new unreachable block on demand and use that as default
destination.
Note that at the moment this only is relevant for cases where
resolvedUndefsIn marks the first case as executable. Regular switch
handling has a FIXME/TODO to support determining whether the default
case is feasible or not.
Fixes#48917.
Differential Revision: https://reviews.llvm.org/D113497
Don't check whether an input of BDV can be pruned if the input
is the BDV itself. BDV is present in the states map, so in case
the input is the BDV itself, we'd return false. So explicitly check this case.
Differential Revision: https://reviews.llvm.org/D123846
We may be able to make the ValueTracking wrapper smarter
in the future (for example, analyze a simple recurrence),
so this will automatically benefit if that happens.
tryToVectorize() method implements one of searching paths for vectorizable tree roots in SLP vectorizer,
specifically for binary and comparison operations. Order of making probes for various scalar pairs
was defined by its implementation: the instruction operands, then climb over one operand if
the instruction is its sole user and then perform same actions for another operand if previous
attempts failed. Problem with this approach is that among these options we can have more than a
single vectorizable tree candidate and it is not necessarily the one that encountered first.
Trying to build vectorizable tree for each possible combination for just evaluation is expensive.
But we already have lookahead heuristics mechanism which we use for finding best pick among
operands of commutative instructions. It calculates cumulative score for candidates in two
consecutive lanes. This patch introduces use of the heuristics for choosing the best pair among
several combinations. We only try one that looks as most promising for vectorization.
Additional benefit is that we reduce total number of vectorization trees built for probes
because we skip those looking non-profitable early.
Reviewed By: Alexey Bataev (ABataev), Vasileios Porpodas (vporpo)
Differential Revision: https://reviews.llvm.org/D124309
This AliasPtr is being created always from an Int64 even for targets
where 32 bit is the proper type. e.g. “thumbv7-none-linux-android16”.
This causes the assert in the `get` func to fail as we're getting a 32
bit from the APInt.
Fix this by simply always just getting the type from the value instead.
Reviewed By: ChuanqiXu
Differential Revision: https://reviews.llvm.org/D123272
Using the legacy PM for the optimization pipeline was deprecated in 13.0.0.
Following recent changes to remove non-core features of the legacy
PM/optimization pipeline, remove AddressSanitizerLegacyPass...
...,
ModuleAddressSanitizerLegacyPass, and ASanGlobalsMetadataWrapperPass.
MemorySanitizerLegacyPass was removed in D123894.
AddressSanitizerLegacyPass was removed in D124216.
Reviewed By: #sanitizers, vitalybuka
Differential Revision: https://reviews.llvm.org/D124337
This fixes a series of mis-compiles by SimpleLoopUnswitch.
My measurements showed no performance regression with -O3 on AArch64
in SPEC2006, SPEC2017 and a set of internal benchmarks.
Fixes#50387, #50430
Depends on D124251.
Reviewed By: nikic, aqjune
Differential Revision: https://reviews.llvm.org/D124252
Logic in this pass assumes that all users of loop instructions are
either in the same loop or are LCSSA Phis. In fact, there can also
be users in unreachable blocks that currently break assertions.
Such users don't need to go to the next round of simplifications.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D124368
This patch add a function foldSelectWithFCmpToFabs, and do more combine for
fneg-of-fabs.
With 'nsz':
fold (X < +/-0.0) ? X : -X or (X <= +/-0.0) ? X : -X to -fabs(x)
fold (X > +/-0.0) ? X : -X or (X >= +/-0.0) ? X : -X to -fabs(x)
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D123830
Minor refactoring to reduce size of functional change D124309:
look-ahead scoring routines pulled out of VLOperands and formed
new LookAheadHeuristics helper class.
Reviewed By: Alexey Bataev (ABataev), Vasileios Porpodas (vporpo)
Differential Revision: https://reviews.llvm.org/D124313
MisExpect diagnostics should not prevent compilation from succeeding, and the
assertion is insufficient to prevent division by zero in release builds.
This patch addresses that by replacing the assert with an early return.
Additionally, it disables MisExpect diagnostics when using sample profiling,
since this is the only known case where this error has manifested.
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D124302
We only need to insert a Freeze instruction if any of the conditions
may be poison. Similar checks are already done in the other places
SimpleLoopUnswitch creates Freeze instruction.
Reviewed By: aeubanks, efriedma
Differential Revision: https://reviews.llvm.org/D124259
Folds are supposed to always be added in conjugated pairs for and
and or. Merge the two functions to make folds for which this is
currently not the case more obvious.
1d90e53044 switch this code to store
the predicates and operands in variables, but retained a
swapOperands() call here. Thus the commuted cases were no longer
folded. Additionally, as the change was not reported, the next
InstCombine iteration would not pick it up either.
Reapplying without changes, after a fix to a dependent patch.
-----
Rather than creating a PHI node and then using the PHI threading
code, directly handle this case in
FoldCondBranchOnValueKnownInPredecessor().
This change is supposed to be NFC-ish, but may cause changes due
to different transform order.
Reapply with SmallMapVector instead of SmallDenseMap, which should
address the non-determinism issue.
-----
This general threading transform can be performed whenever we know
a constant value for the condition in a predecessor, which would
currently just be the case of a phi node with constant arguments.
Using the legacy PM for the optimization pipeline was deprecated in 13.0.0.
Following recent changes to remove non-core features of the legacy
PM/optimization pipeline, remove AddressSanitizerLegacyPass,
ModuleAddressSanitizerLegacyPass, and ASanGlobalsMetadataWrapperPass.
MemorySanitizerLegacyPass was removed in D123894.
Reviewed By: #sanitizers, vitalybuka
Differential Revision: https://reviews.llvm.org/D124216
Using the legacy PM for the optimization pipeline was deprecated in 13.0.0.
Following recent changes to remove non-core features of the legacy
PM/optimization pipeline, remove AddressSanitizerLegacyPass,
ModuleAddressSanitizerLegacyPass, and ASanGlobalsMetadataWrapperPass.
MemorySanitizerLegacyPass was removed in D123894.
Reviewed By: #sanitizers, vitalybuka
Differential Revision: https://reviews.llvm.org/D124216
This emits an `st_size` that represents the actual useable size of an object before the redzone is added.
Reviewed By: vitalybuka, MaskRay, hctim
Differential Revision: https://reviews.llvm.org/D123010
The first attempt at this missed a check to make sure the offset
constant was in range and caused many bot failures.
That was missed in the Alive2 proof because on overshift creates
poison rather than the assert from APInt. Here's an alternate
attempt at a proof using count-trailing-zeros:
https://alive2.llvm.org/ce/z/pnXQYR
Original commit message:
This is similar to an existing pre-shift-of-constant fold:
8a9c70fc01
...but in this case, we need no-wrap on the shl and a negative
offset:
https://alive2.llvm.org/ce/z/_RVz99
This reverts commit 3df86e799e.
This reverts commit 8988254667.
`[SimplifyCFG] Handle branch on same condition in pred more directly`
caused non-determinism when compiling opt with a bootstrapped clang.
I have to revert the dependent commit as well.
Using the legacy PM for the optimization pipeline was deprecated in 13.0.0.
Following recent changes to remove non-core features of the legacy
PM/optimization pipeline, remove GCOVProfilerLegacyPass.
I have checked many LLVM users and only llvm-hs[1] uses the legacy gcov pass.
[1]: https://github.com/llvm-hs/llvm-hs/issues/392
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D123829
Using the legacy PM for the optimization pipeline was deprecated in 13.0.0.
Following recent changes to remove non-core features of the legacy
PM/optimization pipeline, remove MemorySanitizerLegacyPass.
Differential Revision: https://reviews.llvm.org/D123894
Debugify in OriginalDebugInfo mode, does (DebugInfo) collect-before-pass & check-after-pass
for each instruction, which is pretty expensive. When used to analyze DebugInfo losses
in large projects (like LLVM), this raises the build time unacceptably.
This patch introduces a limit for the number of processed functions per compile unit.
By default, the limit is set to UINT_MAX (practically unlimited), and by using the introduced
option -debugify-func-limit the limit could be set to any positive integer number.
Differential revision: https://reviews.llvm.org/D115714
Rather than creating a PHI node and then using the PHI threading
code, directly handle this case in
FoldCondBranchOnValueKnownInPredecessor().
This change is supposed to be NFC-ish, but may cause changes due
to different transform order.
This general threading transform can be performed whenever we know
a constant value for the condition in a predecessor, which would
currently just be the case of a phi node with constant arguments.
The legacy passes are deprecated now and would be removed in near
future. This patch tries to remove legacy passes in coroutines.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D123918
This patch extends the scope of VPlan to also include the exit (aka
middle) block.
For now, the exit block remains empty, but handling of exit values will
subsequently be moved to VPlan, by adding recipes to model exit values
in the exit block.
As a first step, this will allow fixing #51366.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D123457
We're making a recursive call here and everything in the function
assumes we're looking at scalars. This would be violated if we
looked through a bitcast from vectors.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D124015
This is not expected to have a functional difference as discussed in the
post-commit comments for 8a9c70fc01. All of the motivating tests for
the older fold still optimize as expected because other code can infer
the 'nuw'.
BlockIsSimpleEnoughToThreadThrough() already checks that the phi
(and all other instructions) are not used outside the block, so
this one-use check is not necessary for legality. I also don't
see any reason why it would be necessary for profitability (in
fact, those extra uses will be replaced with constants, which
should be generally profitable).
test/Transforms/InstCombine/pr39177.ll failed in a -DLLVM_USE_SANITIZER=Undefined build.
```
lib/Transforms/Utils/BuildLibCalls.cpp:1217:17: runtime error: reference binding to null pointer of type 'llvm::Function'
```
`Function &F = *M->getFunction(Name);`
This reverts commit 0f8c626723.
The patch adds SPIRV-specific MC layer implementation, SPIRV object
file support and SPIRVInstPrinter.
Differential Revision: https://reviews.llvm.org/D116462
Authors: Aleksandr Bezzubikov, Lewis Crawford, Ilia Diachkov,
Michal Paszkowski, Andrey Tretyakov, Konrad Trifunovic
Co-authored-by: Aleksandr Bezzubikov <zuban32s@gmail.com>
Co-authored-by: Ilia Diachkov <iliya.diyachkov@intel.com>
Co-authored-by: Michal Paszkowski <michal.paszkowski@outlook.com>
Co-authored-by: Andrey Tretyakov <andrey1.tretyakov@intel.com>
Co-authored-by: Konrad Trifunovic <konrad.trifunovic@intel.com>
Reimplements MisExpect diagnostics from D66324 to reconstruct its
original checking methodology only using MD_prof branch_weights
metadata.
New checks rely on 2 invariants:
1) For frontend instrumentation, MD_prof branch_weights will always be
populated before llvm.expect intrinsics are lowered.
2) for IR and sample profiling, llvm.expect intrinsics will always be
lowered before branch_weights are populated from the IR profiles.
These invariants allow the checking to assume how the existing branch
weights are populated depending on the profiling method used, and emit
the correct diagnostics. If these invariants are ever invalidated, the
MisExpect related checks would need to be updated, potentially by
re-introducing MD_misexpect metadata, and ensuring it always will be
transformed the same way as branch_weights in other optimization passes.
Frontend based profiling is now enabled without using LLVM Args, by
introducing a new CodeGen option, and checking if the -Wmisexpect flag
has been passed on the command line.
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D115907
A new set of overloaded functions named getOrInsertLibFunc() are now supposed
to be used instead of getOrInsertFunction() when building a libcall from
within an LLVM optimizer(). The idea is that this new function also makes
sure that any mandatory argument attributes are added to the function
prototype (after calling getOrInsertFunction()).
inferLibFuncAttributes() is renamed to inferNonMandatoryLibFuncAttrs() as it
only adds attributes that are not necessary for correctness but merely
helping with later optimizations.
Generally, the front end is responsible for building a correct function
prototype with the needed argument attributes. If the middle end however is
the one creating the call, e.g. when replacing one libcall with another, it
then must take this responsibility.
This continues the work of properly handling argument extension if required
by the target ABI when building a lib call. getOrInsertLibFunc() now does
this for all libcalls currently built by any LLVM optimizer. It is expected
that when in the future a new optimization builds a new libcall with an
integer argument it is to be added to getOrInsertLibFunc() with the proper
handling. Note that not all targets have it in their ABI to sign/zero extend
integer arguments to the full register width, but this will be done
selectively as determined by getExtAttrForI32Param().
Review: Eli Friedman, Nikita Popov, Dávid Bolvanský
Differential Revision: https://reviews.llvm.org/D123198
With 'nuw' we can convert the increment of the shift amount
into a pre-shift (constant fold) of the shifted constant:
https://alive2.llvm.org/ce/z/FkTyR2
Fixes issue #41976
This patch moves SCEV expansion of steps used by
VPWidenIntOrFpInductionRecipes to the pre-header using
VPExpandSCEVRecipe. This ensures that those steps are expanded while the
CFG is in a valid state. Previously, SCEV expansion may happen during
vector body code-generation, during which the CFG may be invalid,
causing issues with SCEV expansion.
Depends on D122095.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D122096
This change could reduce the time we call `declaresCoroEarlyIntrinsics`.
And it is helpful for future changes.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D123925
This reverts commit af0285122f.
The test "libomp::loop_dispatch.c" on builder
openmp-gcc-x86_64-linux-debian fails from time-to-time.
See #54969. This patch is unrelated.
The description was ambiguous about the behavior
when boths select arms are constant or both arms
are not constant. I don't think there's any
evidence to support either way, but this matches
the code with a more specified description.
We can extend this to deal with vector constants
with undef/poison elements. Currently, those don't
get folded anywhere.
The OMPScheduleType enum stores the constants from libomp's internal sched_type in kmp.h and are used by several kmp API functions. The enum values have an internal structure, namely each scheduling algorithm (e.g.) exists in four variants: unordered, orderend, normerge unordered, and nomerge ordered.
This patch (basically a followup to D114940) splits the "ordered" and "nomerge" bits into separate flags, as was already done for the "monotonic" and "nonmonotonic", so we can apply bit flags operations on them. It also now contains all possible combinations according to kmp's sched_type. Deriving of the OMPScheduleType enum from clause parameters has been moved form MLIR's OpenMPToLLVMIRTranslation.cpp to OpenMPIRBuilder to make available for clang as well. Since the primary purpose of the flag is the binary interface to libomp, it has been made more private to LLVMFrontend. The primary interface for generating worksharing-loop using OpenMPIRBuilder code becomes `applyWorkshareLoop` which derives the OMPScheduleType automatically and calls the appropriate emitter function.
While this is mostly a NFC refactor, it still applies the following functional changes:
* The logic from OpenMPToLLVMIRTranslation to derive the OMPScheduleType also applies to clang. Most notably, it now applies the nonmonotonic flag for non-static schedules by default.
* In OpenMPToLLVMIRTranslation, the nonmonotonic default flag was previously not applied if the simd modifier was used. I assume this was a bug, since the effect was due to `loop.schedule_modifier()` returning `mlir::omp::ScheduleModifier::none` instead of `llvm::Optional::None`.
* In OpenMPToLLVMIRTranslation, the nonmonotonic default flag was set even if ordered was specified, in breach to what the comment before citing the OpenMP specification says. I assume this was an oversight.
The ordered flag with parameter was not considered in this patch. Changes will need to be made (e.g. adding/modifying function parameters) when support for it is added. The lengthy names of the enum values can be discussed, for the moment this is avoiding reusing previously existing enum value names such as `StaticChunked` to avoid confusion.
Reviewed By: peixin
Differential Revision: https://reviews.llvm.org/D123403
Until now we would only accept a broadcast load pattern if it is only used
by a single vector of instructions.
This patch relaxes this, and allows for the broadcast to have more than one
user vector, as long as all of its uses are internal to the SLP graph and
vectorized.
Differential Revision: https://reviews.llvm.org/D121940
Using the legacy PM for the optimization pipeline was deprecated in 13.0.0.
Following recent changes to remove non-core features of the legacy
PM/optimization pipeline, remove the (Thin)LTO pipelines.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D123882
This patch renames the mergefunc-sanity to mergefunc-verify and renames the related functions to use more
inclusive language
Reviewed By: cebowleratibm
Differential Revision: https://reviews.llvm.org/D114374
When we run the CGSCC pass we should only invest time on the SCC. We can
initialize AAs with information from the module slice but we should not
update those AAs. We make an exception for are call site of the SCC as
they are helpful providing information for the SCC.
Minor modifications to pointer privatization allow us to perform it even
in the CGSCC pass, similar to ArgumentPromotion.
Issue: https://github.com/llvm/llvm-project/issues/54430
For incoming values of phi nodes added to an outlined function to accommodate different exit paths in the function, when a value is a constant that is passed into the outlined function as an argument, we find the corresponding value in the first extracted function used to fill the overall outlined function. When this value is an argument, the corresponding value used will be the old value, prior to outlining. This patch maintains a mapping from these values to arguments, and uses this mapping to update the added phi node accordingly.
Reviewers: paquette
Recommit of d6eb480afb
Differential Revision: https://reviews.llvm.org/D122206
The previous patch introduced the offloading binary format so we can
store some metada along with the binary image. This patch introduces
using this inside the linker wrapper and Clang instead of the previous
method that embedded the metadata in the section name.
Differential Revision: https://reviews.llvm.org/D122683
Instead of lengthy constructors we can now set the members of a
read-only struct before the Attributor is created. Should make it
clearer what is configurable and also help introducing new options in
the future. This actually added IsModulePass and avoids deduction
through the Function set size. No functional change was intended.
With opaque pointers, the stored value and address can be the same.
Previously the code in VPWidenMemoryInstructionRecipe::onlyFirstLaneDemanded
incorrectly considers stores with matching store and pointer operands as
only demanding the first lane, causing a crash.
Legacy PM for optimization pipeline was deprecated in 13.0.0 and Clang dropped
legacy PM support in D123609. This change removes legacy PM passes for PGO so
that downstream projects won't be able to use it. It seems appropriate to start
removing such "add-on" features like instrumentations, before we remove more
stuff after 15.x is branched.
I have checked many LLVM users and only ldc[1] uses the legacy PGO pass.
[1]: https://github.com/ldc-developers/ldc/issues/3961
Reviewed By: davidxl
Differential Revision: https://reviews.llvm.org/D123834
This can cause crashes by accidentally optimizing out checks for
extern_weak_func != nullptr, when replaced with a known-not-null wrapper.
This solution isn't perfect (only avoids replacement on specific patterns)
but should address common cases.
Internal reference: b/185245029
Reviewed By: vitalybuka
Differential Revision: https://reviews.llvm.org/D123701
Issue: https://github.com/llvm/llvm-project/issues/54430
For incoming values of phi nodes added to an outlined function to accommodate different exit paths in the function, when a value is a constant that is passed into the outlined function as an argument, we find the corresponding value in the first extracted function used to fill the overall outlined function. When this value is an argument, the corresponding value used will be the old value, prior to outlining. This patch maintains a mapping from these values to arguments, and uses this mapping to update the added phi node accordingly.
Reviewers: paquette
Differential Revision: https://reviews.llvm.org/D122206
Issue: https://github.com/llvm/llvm-project/issues/54431
PHINodes that need to be generated to accommodate a PHINode outside the region due to different output paths need to have their own numbering to determine the number of output schemes required to properly handle all the outlined regions. This numbering was previously only determined by the order and values of the incoming values, as well as the parent block of the PHINode. This adds the incoming blocks to the calculation of a hash value for these PHINodes as well, and the supporting infrastructure to give each block in a region a corresponding canonical numbering.
Reviewer: paquette
Differential Revision: https://reviews.llvm.org/D122207
This addresses an existing TODO by keeping a mapping of external IR
Value * definitions wrapped in VPValues for use in a VPlan.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D123700
The NewDefault was used to simplify the updating of PHI nodes, but it
causes some inefficiency for target that will run structurizer later. For
example, for a simple two-case switch, the extra NewDefault is causing
unstructured CFG like:
O
/ \
O O
/ \ / \
C1 ND C2
\ | /
\ | /
D
The change is to avoid the ND(NewDefault) block, that is we will get a
structured CFG for above example like:
O
/ \
/ \
O O
/ \ / \
C1 \ / C2
\-> D <-/
The IR change introduced by this patch should be trivial to other targets,
so I am doing this unconditionally.
Fall-through among the cases will also cause unstructured CFG, but it need
more work and will be addressed in a separate change.
Reviewed by: arsenm
Differential Revision: https://reviews.llvm.org/D123607
This reverts commit e810d55809.
The commit was not taken into account the fact that strduped string could be
modified. Checking if such modification happens would make the function very
costly, without a test case in mind it's not worth the effort.
Updated LowerGuardIntrinsic and LowerWidenableCondition to check for
users of the respective intrinsic, instead of checking for guards and
widenable conditions by traversing the entire function.
This is an NFC. Should save some compile time.
This reverts the revert commit 1ddc719680.
This version of the patch sets the initial available value to poison,
which resolves an issue with the SSAUpdater breaking LCSSA form.
C11 specifies memchr() as follows:
> The memchr function locates the first occurrence of c (converted
> to an unsigned char) in the initial n characters (each interpreted
> as unsigned char) of the object pointed to by s. The implementation
> shall behave as if it reads the characters sequentially and stops
> as soon as a matching character is found.
In particular, it is well-defined to specify a memchr size larger
than the underlying object, as long as the character is found before
the end of the object.
Differential Revision: https://reviews.llvm.org/D123665
This should be "NFC" as written, but it will make D122485 smaller
and give us more flexibility to experiment with optimization level
vs. compile-time.
Differential Revision: https://reviews.llvm.org/D123625
Currently SLP vectorizer walks through the instructions and selects
3 main classes of values: 1) reduction operations - instructions with same
reduction opcode (add, mul, min/max, etc.), which build the reduction,
2) reduced values - instructions with the same opcodes, but different
from the reduction opcode, 3) extra arguments - all other values,
instructions from the different basic block rather than the root node,
instructions with to many/less uses.
This scheme is not very efficient. It excludes some instructions and all
non-instruction values from the reductions (constants, proficient
gathers), to many possibly reduced values are marked as extra arguments.
Patch improves this process by introducing a bit extended analysis
stage. During this stage, we still try to select 3 classes of the
values: 1) reduction operations - same as before, 2) possibly reduced
values - all instructions from the current block/non-instructions, which
may build a vectorization tree, 3) extra arguments - instructions from
the different basic blocks. Additionally, an extra sorting of the
possibly reduced values occurs to build the scalar sequences which
highly likely will bed vectorized, e.g. loads are grouped by the
distance between them, constants are grouped together, cmp instructions
are sorted by their compare types and predicates, extractelement
instructions are sorted by the vector operand, etc. Also, these groups
are reordered by their length so the longest group is the first in the
list of the possibly reduced values.
The vectorization process tries to emit the reductions for all these
groups. These reductions, remaining non-vectorized possible reduced
values and extra arguments are then combined into the final expression
just like it was before.
Differential Revision: https://reviews.llvm.org/D114171
We need to explicitly query the shadow here, because it is lazily
initialized for byval arguments. Without opaque pointers this used to
mostly work out, because there would be a bitcast to `i8*` present, and
that would query, and copy in case of byval, the argument shadow.
Reviewed By: vitalybuka, eugenis
Differential Revision: https://reviews.llvm.org/D123602
We need to explicitly query the shadow here, because it is lazily
initialized for byval arguments. Without opaque pointers this used to
mostly work out, because there would be a bitcast to `i8*` present, and
that would query, and copy in case of byval, the argument shadow.
Reviewed By: vitalybuka, eugenis
Differential Revision: https://reviews.llvm.org/D123602
This renames functions for more general usage (and current capitalization style)
before a proposed logic change in D122485.
Differential Revision: https://reviews.llvm.org/D123614
This diff extends foldSelectInstWithICmp to handle the case icmp(X) ? f(X) : C
when f(X) is guaranteed to be equal to C for all X in the exact range of the inverse predicate.
This addresses the issue https://github.com/llvm/llvm-project/issues/54089.
Differential revision: https://reviews.llvm.org/D123159
Test plan: make check-all
The test is already simplified, and I'm not sure how
to write a test to exercise the new clause. But it
protects the 2-bit pattern from miscompiling as noted
in D123453.
https://alive2.llvm.org/ce/z/QPyVfv
(If we managed to fall into the mul transform, it
would wrongly create a zero on this pattern.)
IMO when user provide unroll pragma, compiler should always respect it.
It is not clear to me why loop unroll pass currently ensure that the
unrolled loop size is limited by PragmaUnrollThreshold.
Reviewed By: Meinersbur
Differential Revision: https://reviews.llvm.org/D119148
When only a store is sunk, there is no need to create a load in the
pre-header, as the result of the load will never get used.
The dead load can can introduce UB, if the function is marked as
writeonly.
Fixes#51248.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D123473
After D121624 models the pre-header in VPlan, VPExpandSCEVRecipes can be
placed there. This ensures SCEV expansion happens before modifying the
CFG during VPlan execution, when CFG is incomplete.
Depends on D121624.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D122095
This patch extends the scope of VPlan to also model the pre-header.
The pre-header can be used to place recipes that should be code-gen'd
outside the loop, like SCEV expansion.
Depends on D121623.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D121624
This fixes the code to actually use the location of the instruction, if
available. Previously, SetInsertPoint would overwrite the insert point
set from the instruction.
And thread DSE's ephemeral values to EarliestEscapeInfo.
This allows more precise analysis in DSEState::isReadClobber() via BatchAA.
Followup to D123162.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D123342
Loop Strength Reduce sometimes optimizes away all uses of an induction variable
from a loop but leaves the IV increments. When the only remaining use of the IV
is the PHI in the exit block, this patch will call rewriteLoopExitValues to
replace the exit block PHI with the final value of the IV to skip the updates
in each loop iteration.
Differential Revision: https://reviews.llvm.org/D118808
This makes MemorySSA in LoopSink required, and removes the AST-based
implementation, as well as the related support code in LICM.
Differential Revision: https://reviews.llvm.org/D123288
It actually implements support for seeing through loads, using alias analysis to
refine the result.
This is rather limited, but I didn't want to rely on more than available
analysis at that point (to be gentle with compilation time), and it does seem to
catch common scenario, as showcased by the included tests.
Differential Revision: https://reviews.llvm.org/D122431
Currently, the utility supports lowering of non atomic memory transfer routines only. This patch adds support for atomic version of memcopy. This may be useful for targets not supporting atomic memcopy.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D118443
Similar to the problem in 0bb25b4603, bitcasts that are inserted must
dominate all uses. When rewriting "values" with "new values" that have
the updated address space, we may replace the "new value" with a bitcast
if one of the original users is an addresspace cast. This bitcast must
be inserted before ALL users, not only before the addresspace cast.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D122964
By adding a parameter to function FoldOpIntoSelect, we can fold more Ops to Select.
For this example, we tend to fold the division instruction,
so we no longer care whether SelectInst is one use.
This patch slove TODO left in InstCombine/div.ll.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D122967
This is part of being able to get rid of two more columns in
MemoryBuiltins.cpp's large table. We'll have two more changes before
we can finish the job.
Differential Revision: https://reviews.llvm.org/D119582
Sometimes we can infer an align from an allocalign but the function
already promised it'd be more-aligned than the allocalign and there's an
existing align that we shouldn't reduce. Make sure we handle that
correctly.
Differential Revision: https://reviews.llvm.org/D121642
This is a lshr equivalent to D122340 - if we don't demand any of the additional sign bits introduced by the ashr, the lshr can be treated as an ashr and we can remove the shift entirely if we only demand already known sign bits.
Another step towards PR21929
https://alive2.llvm.org/ce/z/6f3kjq
Differential Revision: https://reviews.llvm.org/D123118
LoopSink with the legacy pass manager still uses AST, because we
can't compute MemorySSA conditionally. I think now that the legacy
pass manager will be removed soon(TM) we don't need to care about
compile-time impact here anymore. Additionally, since MemorySSA is
no longer eagerly optimized, the impact is actually not that high
anymore (~0.2% geomean regression on CTMark).
This just makes legacy PM and new PM behavior line up -- as a
followup I'll drop these options entirely and make MemorySSA use
mandatory.
Differential Revision: https://reviews.llvm.org/D123216
Motivated by pr43326 (https://bugs.llvm.org/show_bug.cgi?id=43326), where a slightly
modified case is as follows.
void f(int e[10][10][10], int f[10][10][10]) {
for (int a = 0; a < 10; a++)
for (int b = 0; b < 10; b++)
for (int c = 0; c < 10; c++)
f[c][b][a] = e[c][b][a];
}
The ideal optimal access pattern after running interchange is supposed to be the following
void f(int e[10][10][10], int f[10][10][10]) {
for (int c = 0; c < 10; c++)
for (int b = 0; b < 10; b++)
for (int a = 0; a < 10; a++)
f[c][b][a] = e[c][b][a];
}
Currently loop interchange is limited to picking up the innermost loop and finding an order
that is locally optimal for it. However, the pass failed to produce the globally optimal
loop access order. For more complex examples what we get could be quite far from the
globally optimal ordering.
What is proposed in this patch is to do a "bubble-sort" fashion when doing interchange.
By comparing neighbors in `LoopList` in each iteration, we would be able to move each loop
onto a most appropriate place, hence this is an approach that tries to achieve the
globally optimal ordering.
The motivating example above is added as a test case.
Reviewed By: Meinersbur
Differential Revision: https://reviews.llvm.org/D120386
Add void casts to mark the variables used, next to the places where
they are used in assert or `LLVM_DEBUG()` expressions.
Differential Revision: https://reviews.llvm.org/D123117
By specification, source and destination of llvm.memcpy.* must either be equal or non-overlapping. This semantics is hard or impossible to figure out once lowered. This patch explicitly marks loads from source and stores to destination as not aliasing if source and destination is known to be not equal.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D118441
The Attributor, as many other parts in LLVM, uses pointer equivalence
for `llvm::Value`s. This only works as long as `llvm::Value`s are
dynamically unique, or, to be exact, we will never end up with the same
`llvm::Value` representing two dynamic instances. We already provided a
helper to check the former, namely `AA::isDynamicallyUnique`, however we
could not check the latter. In this patch we move the logic into a
separate AA which helps with the growing complexity and use cases. We
also extend the interface to answer the second question rather than the
first. So we do not determine dynamically uniqueness but if we might end
up with the `llvm::Value` describing a different dynamic instance. Note
that the latter is very much tied to the Attributor capabilities to look
through memory, recursion, etc. so we need to update the logic as we go.
We look through loads in the "generic value traversal" and we
consequently don't need to look through them again in AAValueSimplify*.
The test changes stem from the fact that we allowed any simplified
value, incl. non-dynamically unique ones, as long as the underlying
memory was an alloca. This doesn't seem to make sense as allocas do not
protect against dynamically non-unique values. We need to make the
unique check better rather than excluding allocas. That in mind, we can
remove a lot of code by simply relying on the generic value traversal
load look through.
To soften the blow some minor adjustments have been made that allow more
simplification through the now used scheme and some tests have been
given a `norecurse` for now.
With D106397 we ensured that `AAReachability` will not answer queries for
potentially recursive functions. This was necessary as we did not treat
recursion explicitly otherwise. Now that we have
`AA::isPotentiallyReachable` we can make `AAReachability` a purely
intra-procedural AA which does not care about recursion.
`AA::isPotentiallyReachable`, however, does already deal with "going
back" the call graph and can now do so for potentially recursive
functions.
Add statistics to count overall devirtualized targets as well as the
various types of devirtualizations applied at callsites.
Differential Revision: https://reviews.llvm.org/D123152
If we ignore droppable users everything only used in llvm.assume (among
other things) is going to be deleted as dead. This is not helpful.
Instead we want to only delete things we actually don't need anymore. A
follow up will deal with loads in a smarter way.
Partially inlining a libcall that has the musttail attribute
leads to broken LLVM IR, triggering an assertion in the IR verifier.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D123116
When creating induction resume values, SCEV queries may rely on
LoopInfo. Make sure vector.body gets added to the loop of the pre-header
during skeleton construction.
%vector.body will be moved to the vector preheader during VPlan
execution.
Fixes#54745.
getExtAttrForI32Param() is the method to be used for determining the type of
extension attribute (if any) that is to be added for a signed/unsigned
argument.
Previously, the SExt attribute was always added to the i32 ldexp* argument as
it was expected to be ignored by targets not needing it. This patch now
changes this so that it is only added for the targets that need it in the
first place.
Putchar() argument is now also extended as required by the target (SystemZ in
the test), to fix the issue below. Many more libcalls will be handled
similarly in a following patch.
Fixes https://github.com/llvm/llvm-project/issues/54532.
Differential Revision: https://reviews.llvm.org/D123030
Review: Eli Friedman
When simplify values we might end up with an instruction from a
different scope or just one that does not dominate the use. If the
instruction can be reproduced without side-effect (incl. UB) we can
now do that. For now this is mostly used for speculatable (intrinsic)
calls but as we learn to make things like arguments or loads available
this will become more powerful.
This will also allow us to remove dead stores more easily in a follow
up.
Update VPInterleavedAccessInfo to use the generic getVectorLoopRegion
helper instead of relying on the entry block being the top-most vector
loop region.
If both the character and string are known, but the length
potentially isn't, we can optimize the memchr() call to a select
of either the known position of the character or null.
Split off from https://reviews.llvm.org/D122836.
Handle the simple constant char case before the bitmask optimization.
This will allow extending the code to handle a non-constant size
argument in a followup change.
Split out from https://reviews.llvm.org/D122836.
If the memchr() size is 1, then we can convert the call into a
single-byte comparison. This works even if both the string and the
character are unknown.
Split off from https://reviews.llvm.org/D122836.
As discussed on https://github.com/llvm/llvm-project/issues/54682,
MemorySSA currently has a bug when computing the clobber of calls
that access loop-varying locations. I think a "proper" fix for this
on the MemorySSA side might be non-trivial, but we can easily work
around this in MemCpyOpt:
Currently, MemCpyOpt uses a location-less getClobberingMemoryAccess()
call to find a clobber on either the src or dest location, and then
refines it for the src and dest clobber. This was intended as an
optimization, as the location-less API is cached, while the
location-affected APIs are not.
However, I don't think this really makes a difference in practice,
because I don't think anything will use the cached clobbers on
those calls later anyway. On CTMark, this patch seems to be very
mildly positive actually.
So I think this is a reasonable way to avoid the problem for now,
though MemorySSA should also get a fix.
Differential Revision: https://reviews.llvm.org/D122911
The range calculation in walkForwards() assumes that the ranges of
the operands have already been calculated. With the used visit
order, this is not necessarily the case when there are multiple
roots. (There is nothing guaranteeing that instructions are visited
in topological order.)
Fix this by queuing instructions for reprocessing if the operand
ranges haven't been calculated yet.
Fixes https://github.com/llvm/llvm-project/issues/54669.
Differential Revision: https://reviews.llvm.org/D122817
I didn't dig into this very much because it appears to be totally valid
(especially once these properties can come from attributes instead
of only from hard-coded library functions) for TLI to not be defined,
and nothing broke when I added this check, including with all my other
patches applied.
Differential Revision: https://reviews.llvm.org/D122917
The search for the clobbering call is fairly expensive if uses are not optimized at construction. Defer the clobber walk to the point in the implementation we need it; there are a bunch of bailouts before that point. (e.g. If the source pointer is not an alloca, we can't do callslotopt.)
On a test case which involves a bunch of copies from argument pointers, this switches memcpyopt from > 1/2 second to < 10ms.
This is a hacky fix for:
https://github.com/llvm/llvm-project/issues/54558
As discussed there, codegen regressed when we opened up this transform
to allow extra uses ( 61580d0949 ), and it's not clear how to
undo the transforms at the later stage of compilation.
As noted in the code comments, there's a set of remaining folds that
are still limited to one-use, so we can try harder to refine and
expand the limitations on these folds, but it's likely to be an
up-and-down battle as we find and overcome similar regressions.
Differential Revision: https://reviews.llvm.org/D122909
This is a retry of 9397bdc67e - that was reverted until
we had a clang warning in place to alert users about a
possible mistake in source. The warning was added with
ab982eace6.
This is noted as a missing clang warning in #54222,
but it is also a missing optimization opportunity.
Alive2 proofs:
https://alive2.llvm.org/ce/z/Q8drDqhttps://alive2.llvm.org/ce/z/pE6LRt
I don't see a single conversion for all predicates
using "getFCmpCode" logic, so other predicates are
left as a TODO item.
These two are equivalent,
and i *think* the `and` form is more-ish canonical.
General proof: https://alive2.llvm.org/ce/z/RrF5s6
If constant on the (outer) `xor` is an `undef`,
the whole lane is dead: https://alive2.llvm.org/ce/z/mu4Sh2
However, if the constant on the (inner) `or` is an `undef`,
we must sanitize it first: https://alive2.llvm.org/ce/z/MHYJL7
I guess, producing a zero `and`-mask is optimal in that case.
alive-tv is happy about the entirety of `xor-of-or.ll`.
This refactor makes it easier to extend the logic to collect information
from blocks in the future, without even further increasing the size of
eliminateConstriants.
This was exposed by 14e3650f. The recommit of 14e3650f will hit the
problematic code path requiring the workaround.
test case that crashes without the workaround.
During skeleton construction for the epilogue vector loop, generic
helpers use getOrCreateTripCount, which will re-expand the trip count
computation. Instead, re-use the TripCount created during main loop
vectorization.
If the frame pointer is an argument of the original pointer (which
happens with opaque pointers), then we currently first replace the
argument with undef, which will prevent later replacement of the
old frame pointer with the new one.
Fix this by replacing arguments with some dummy instructions first,
and then replacing those with undef later. This gives us a chance
to replace the frame pointer before it becomes undef.
Fixes https://github.com/llvm/llvm-project/issues/54523.
Differential Revision: https://reviews.llvm.org/D122375
This isn't expected to reduce compilation times as 'max-iters' is set to
one by default, but it helps with recursive functions that require higher
iteration counts.
Differential Revision: https://reviews.llvm.org/D122819
Reimplements MisExpect diagnostics from D66324 to reconstruct its
original checking methodology only using MD_prof branch_weights
metadata.
New checks rely on 2 invariants:
1) For frontend instrumentation, MD_prof branch_weights will always be
populated before llvm.expect intrinsics are lowered.
2) for IR and sample profiling, llvm.expect intrinsics will always be
lowered before branch_weights are populated from the IR profiles.
These invariants allow the checking to assume how the existing branch
weights are populated depending on the profiling method used, and emit
the correct diagnostics. If these invariants are ever invalidated, the
MisExpect related checks would need to be updated, potentially by
re-introducing MD_misexpect metadata, and ensuring it always will be
transformed the same way as branch_weights in other optimization passes.
Frontend based profiling is now enabled without using LLVM Args, by
introducing a new CodeGen option, and checking if the -Wmisexpect flag
has been passed on the command line.
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D115907
Instead of first creating a lambda for calculating the range,
then collecting the ranges for the operands, and then calling the
lambda on those ranges, we can first calculate the operand ranges
and then calculate the result directly in the switch.
This reverts the revert commit 2760cdc9c6.
This version pulls in the code to create the vector loop object in VPlan
from D121624.
This is needed because otherwise existing LoopInfo verification will
fail, as a loop block doesn't have in-loop successors now that we
do not replace the branch.
Now that we do not add new loops during skeleton construction, there's
also no need to verify LI there.
According to the current design, if a floating point operation is
represented by a constrained intrinsic somewhere in a function, all
floating point operations in the function must be represented by
constrained intrinsics. It imposes additional requirements to inlining
mechanism. If non-strictfp function is inlined into strictfp function,
all ordinary FP operations must be replaced with their constrained
counterparts.
Inlining strictfp function into non-strictfp is not implemented as it
would require replacement of all FP operations in the host function,
which now is undesirable due to expected performance loss.
Differential Revision: https://reviews.llvm.org/D69798
This fixes a TODO in constantArgPropagation() to make it feature complete.
However, I do find myself in agreement with the review comments in
https://reviews.llvm.org/D106426. I don't think we should pursue
specializing such recursive functions as the code size increase becomes
linear to 'max-iters'. Compiling the modified test just with -O3 (no
function specialization) generates the same code.
Differential Revision: https://reviews.llvm.org/D122755
The only remaining use was to get the exit block of the loop. Instead of
relying on the loop, use the successor of VectorHeaderBB
(LoopMiddleBlock) directly to set VPTransformState::CFG::ExitB
Depends on D121621.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D121623
Allow receiving memcpy/memset/memmove instrumentation by using __asan or
__hwasan prefixed versions for AddressSanitizer and HWAddressSanitizer
respectively when compiling in kernel mode, by passing params
-asan-kernel-mem-intrinsic-prefix or -hwasan-kernel-mem-intrinsic-prefix.
By default the kernel-specialized versions of both passes drop the
prefixes for calls generated by memintrinsics. This assumes that all
locations that can lower the intrinsics to libcalls can safely be
instrumented. This unfortunately is not the case when implicit calls to
memintrinsics are inserted by the compiler in no_sanitize functions [1].
To solve the issue, normal memcpy/memset/memmove need to be
uninstrumented, and instrumented code should instead use the prefixed
versions. This also aligns with ASan behaviour in user space.
[1] https://lore.kernel.org/lkml/Yj2yYFloadFobRPx@lakrids/
Reviewed By: glider
Differential Revision: https://reviews.llvm.org/D122724
When MaximizeVectorBandwidth is enabled, we can end up (via calls to
collectUniformsAndScalars/setCostBasedWideningDecision through
calculateRegisterUsage) making widening decisions before we have decided
whether to fold the tail by masking. These decisions will be wrong if we
later decided to fold the tail, for example when the trip count is very
low. It will use incorrect costs for loads that should get masked, using
standard memory operation costs instead.
This still at the moment uses the EmulatedMaskMemRefHack costs (a bit
unfortunately), but the old costs without this change were 1, leading to
too optimistic vectorization.
This slightly changes the way that the MaximizeVectorBandwidth option
works to make it easier to test, always honouring the option if it is
set.
Differential Revision: https://reviews.llvm.org/D120215
According to the LLVM debug info update guide: https://llvm.org/docs/HowToUpdateDebugInfo.html,
"Hoisting identical instructions which appear in several successor
blocks into a predecessor block. In this case there is no single
merged instruction. The rule for dropping locations applies".
Thanks to Yuanbo Li for reporting this.
Reviewed By: dblaikie
Reviewers: sebpop, tejohnson, dblaikie
Differential Revision: https://reviews.llvm.org/D122730
Factor in the TBAA of adjacent stores instead of just the head store
when merging stores into a memset. We were seeing GVN remove a load that
had a TBAA that matched the 2nd store because GVN determined it didn't
match the TBAA of the memset. The memset had the TBAA of only the first
store.
i.e. Loading the field pi_ of shared_count after memset to create an
array of shared_ptr
template<class T>
class shared_ptr {
T *p;
shared_count refcount;
};
class shared_count {
sp_counted_base *pi_;
};
Differential Revision: https://reviews.llvm.org/D122205
Instead of looking up the vector loop using the header, keep track of
the current vector loop in VPTransformState. This removes the
requirement for the vector header block being part of the loop up front.
A follow-up patch will move the code to generate the Loop object for the
vector loop to VPRegionBlock.
Depends on D121619.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D121621
Avoids merge errors when opaque pointers are loaded into different types.
Reviewed by: jcranmer-intel, hiraditya
Differential Revision: https://reviews.llvm.org/D122521
Now that all dependencies on creating the latch block up-front have been
removed, there is no need to create it early.
Depends on D121618.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D121619
In some case, like in the added test case, we can reach
selectInterleaveCount with loops that actually have a cost of 0.
Unfortunately a loop cost of 0 is also used to communicate that the cost
has not been computed yet. To resolve the crash, bail out if the cost
remains zero after computing it.
This seems like the best option, as there are multiple code paths that
return a cost of 0 to force a computation in selectInterleaveCount.
Computing the cost at multiple places up front there would unnecessarily
complicate the logic.
Fixes#54413.
DXIL is wrapped in a container format defined by the DirectX 11
specification. Codebases differ in calling this format either DXBC or
DXILContainer.
Since eventually we want to add support for DXBC as a target
architecture and the format is used by DXBC and DXIL, I've termed it
DXContainer here.
Most of the changes in this patch are just adding cases to switch
statements to address warnings.
Reviewed By: pete
Differential Revision: https://reviews.llvm.org/D122062
This patch moves the code to set the correct incoming block for the
backedge value to VPlan::execute.
When generating the phi node, the backedge value is temporarily added
using the pre-header as incoming block. The invalid phi node will be
fixed up during VPlan::execute after main VPlan code generation.
At the same time, the backedge value is also moved to the latch.
This change removes the requirement to create the latch block up-front
for VPWidenInductionPHIRecipe::execute, which in turn will enable
modeling the pre-header in VPlan.
Depends on D121617.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D121618
According to definition of canonical form, it is a canonical
if scale reg does not contain addrec for loop L then none of bases
should contain addrec for this loop.
The critical word here is "contains".
Current checker of canonical form checks not "containing" property
but "is". So it does not check whether it contains but whether it is.
Fix the checker and canonicalizing utility to follow definition.
Without this fix in the test attached the base formula looking as
reg((-1 * {0,+,8}<nuw><nsw><%bb2>)<nsw>) + 1*reg((8 * (%arg /u 8))<nuw>)
is considered as conanocial while base contains an addrec.
And modified formula we want to insert
reg({0,+,8}<nuw><nsw><%bb2>) + 1*reg((-8 * (%arg /u 8)))
is considered as not canonical.
Reviewed By: mkazantsev
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D122457
We have the same code repeated in both callers, sink it into callee.
The motivation here isn't just code style, we can also defer the relatively expensive aliasing checks until the cheap structural preconditions have been validated. (e.g. Don't bother aliasing if src is not an alloca.) This helps compile time significantly.
If we vectorize a e.g. store, we leave around a bunch of getelementptrs for the individual scalar stores which we removed. We can go ahead and delete them as well.
This is purely for test output quality and readability. It should have no effect in any sane pipeline.
Differential Revision: https://reviews.llvm.org/D122493
Inline assembly is scary but we need to support it for the OpenMP GPU
device runtime. The new assumption expresses the fact that it may not
have call semantics, that is, it will not call another function but
simply perform an operation or side-effect. This is important for
reachability in the presence of inline assembly.
Differential Revision: https://reviews.llvm.org/D109986
Before we gave up if a call through bitcast had parameter attributes.
Interestingly, we allowed attributes for the return value already. We
now handle both the same way, namely, we drop the ones that are
incompatible with the new type and keep the rest. This cannot cause
"more UB" than initially present.
Differential Revision: https://reviews.llvm.org/D119967
Reimplements MisExpect diagnostics from D66324 to reconstruct its
original checking methodology only using MD_prof branch_weights
metadata.
New checks rely on 2 invariants:
1) For frontend instrumentation, MD_prof branch_weights will always be
populated before llvm.expect intrinsics are lowered.
2) for IR and sample profiling, llvm.expect intrinsics will always be
lowered before branch_weights are populated from the IR profiles.
These invariants allow the checking to assume how the existing branch
weights are populated depending on the profiling method used, and emit
the correct diagnostics. If these invariants are ever invalidated, the
MisExpect related checks would need to be updated, potentially by
re-introducing MD_misexpect metadata, and ensuring it always will be
transformed the same way as branch_weights in other optimization passes.
Frontend based profiling is now enabled without using LLVM Args, by
introducing a new CodeGen option, and checking if the -Wmisexpect flag
has been passed on the command line.
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D115907
This patch moves the code to set the correct incoming block for the
backedge value to VPlan::execute.
When generating the phi node, the backedge value is temporarily added
using the pre-header as incoming block. The invalid phi node will be
fixed up during VPlan::execute after main VPlan code generation.
At the same time, the backedge value is also moved to the latch.
This change removes the requirement to create the latch block up-front
for VPWidenIntOrFpInductionRecipe::execute, which in turn will enable
modeling the pre-header in VPlan.
As an alternative, the increment could be modeled as separate recipe,
but that would require more work and a bit of redundant code, as we need
to create the step-vector during VPWidenIntOrFpInductionRecipe::execute
anyways, to create the values for different parts.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D121617
This was reusing a cast to GlobalVariable to check for an
Instruction, which means we'll try to dereference a null pointer
if it's not actually a GlobalVariable. We should be casting
MTI->getSource() instead.
I don't think this problem is really specific to opaque pointers,
but it certainly makes it a lot easier to reproduce.
Fixes https://github.com/llvm/llvm-project/issues/54572.
The current implementation of Function Specialization does not allow
specializing more than one arguments per function call, which is a
limitation I am lifting with this patch.
My main challenge was to choose the most suitable ADT for storing the
specializations. We need an associative container for binding all the
actual arguments of a specialization to the function call. We also
need a consistent iteration order across executions. Lastly we want
to be able to sort the entries by Gain and reject the least profitable
ones.
MapVector fits the bill but not quite; erasing elements is expensive
and using stable_sort messes up the indices to the underlying vector.
I am therefore using the underlying vector directly after calculating
the Gain.
Differential Revision: https://reviews.llvm.org/D119880
This simplifies the implementation of eraseInstruction by moving the odd-replace-users-with-undef handling back to the only caller which uses it. This handling was not obviously correct, so add the asserts which make it clear why this is safe to do at all. The result is simpler code and stronger assertions.
The original commit exposed several missing dependencies (e.g. latent bugs in SLP scheduling). Most of these were fixed over the weekend and have had several days to bake. The last was fixed this morning after being noticed in manual review of test changes yesterday. See the review thread for links to each change.
Original commit message follows:
SLP currently schedules all instructions within a scheduling window which stretches from the first instruction potentially vectorized to the last. This window can include a very large number of unrelated instructions which are not being considered for vectorization. This change switches the code to only schedule the sub-graph consisting of the instructions being vectorized and their transitive users.
This has the effect of greatly reducing the amount of work performed in large basic blocks, and thus greatly improves compile time on degenerate examples. To understand the effects, I added some statistics (not planned for upstream contribution). Here's an illustration from my motivating example:
Before this patch:
704357 SLP - Number of calcDeps actions
699021 SLP - Number of schedule calls
5598 SLP - Number of ReSchedule actions
59 SLP - Number of ReScheduleOnFail actions
10084 SLP - Number of schedule resets
8523 SLP - Number of vector instructions generated
After this patch:
102895 SLP - Number of calcDeps actions
161916 SLP - Number of schedule calls
5637 SLP - Number of ReSchedule actions
55 SLP - Number of ReScheduleOnFail actions
10083 SLP - Number of schedule resets
8403 SLP - Number of vector instructions generated
I do want to highlight that there is a small difference in number of generated vector instructions. This example is hitting the bailout due to maximum window size, and the change in scheduling is slightly perturbing when and how we hit it. This can be seen in the RescheduleOnFail counter change. Given that, I think we can safely ignore.
The downside of this change can be seen in the large test diff. We group all vectorizable instructions together at the bottom of the scheduling region. This means that vector instructions can move quite far from their original point in code. While maybe undesirable, I don't see this as being a major problem as this pass is not intended to be a general scheduling pass.
For context, it's worth noting that the pre-scheduling that SLP does while building the vector tree is exactly the sub-graph scheduling implemented by this patch.
Differential Revision: https://reviews.llvm.org/D118538
After writing the commit message for 4b1bace28, realized that the mentioned optimization was rather straight forward. We already have the code for scanning a block during region initialization, we can simply keep track if we've seen a stacksave or stackrestore. If we haven't, none of these dependencies are relevant and we can avoid the relatively expensive scans entirely.
This is an extension of commit b7806c to handle one last case noticed in test changes for D118538. Again, this is thought to be a latent bug in the existing code, though this time I have not managed to reduce tests for the original algoritthm.
The prior attempt had failed to account for this case:
%a = alloca i8
stacksave
stackrestore
store i8 0, i8* %a
If we allow '%a' to reorder into the stacksave/restore region, then the alloca will be deallocated before the use. We will have taken a well defined program, and introduced a use-after-free bug.
There's also an inverse case where the alloca originally follows the stackrestore, and we need to prevent the reordering it above the restore.
Compile time wise, we potentially do an extra scan of the block for each alloca seen in a bundle. This is significantly more expensive than the stacksave rooted version and is why I'd tried to avoid this in the initial patch. There is room to optimize this (by essentially caching a "has stacksave" bit per block), but I'm leaving that to future work if it actually shows up in practice. Since allocas in bundles should be rare in practice, I suspect we can defer the complexity for a long while.
CoverageMappingModuleGen generates a coverage mapping record
even for unused functions with internal linkage, e.g.
static int foo() { return 100; }
Clang frontend eliminates such functions, but InstrProfiling pass
still pulls in profile runtime since there is a coverage record.
Fuchsia uses runtime counter relocation, and pulling in profile
runtime for unused functions causes a linker error:
undefined hidden symbol: __llvm_profile_counter_bias.
Since 389dc94d4b, we do not hook profile runtime for the binaries
that none of its translation units have been instrumented in Fuchsia.
This patch extends that for the instrumented binaries that
consist of only unused functions.
Differential Revision: https://reviews.llvm.org/D122336
Update all places that currently assume the entry block to the plan is
also the vector loop header to use getVectorLoopRegion instead.
getVectorLoopRegion will keep doing the right thing when the pre-header
is modeled explicitly (and becomes the new entry block in the plan).
Probe-based profile leads to a better performance when combined with profi and ext-tsp block layout. I'm turning them on by default.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D122442
We already do this for SelectionDAG, but we're missing it here.
Noticed while re-triaging PR21929
Differential Revision: https://reviews.llvm.org/D122340
We can not bitcast pointers across different address spaces. This was
previously fixed in D89577 but then in D93229 an enhancement was added
which peeks further through the ponter operand, opening up the
possibility that address-space violations could be introduced.
Instead of bailing as the previous fix did, simply insert an
addrspacecast cast instruction.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D121787
Most intrinsics, especially "default" ones, will not call back into the
IR module. `nocallback` encodes this nicely. As it was not used before,
this patch also makes use of `nocallback` in the Attributor which
results in many more `norecurse` deductions.
Tablegen part is mechanical, test updates by script.
Differential Revision: https://reviews.llvm.org/D118680
This patch moves pointer induction handling from VPWidenPHIRecipe to its
own recipe. In the process, it adds all information required to generate
code for pointer inductions without relying on Legal to access the list
of induction phis.
Alternatively VPWidenPHIRecipe could also take an optional pointer to InductionDescriptor.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D121615
Add the "(0 - X) --> (X * -1)" reverse identity to the list of alternate form binops.
We need a little hack to make the existing logic work because it does not expect to
move constants from op0 to op1, but the code comment hopefully makes that clear.
I don't think there are any other identities like that.
Fixes#54364
Differential Revision: https://reviews.llvm.org/D122390
For MachO, lower `@llvm.global_dtors` into `@llvm_global_ctors` with
`__cxa_atexit` calls to avoid emitting the deprecated `__mod_term_func`.
Reuse the existing `WebAssemblyLowerGlobalDtors.cpp` to accomplish this.
Enable fallback to the old behavior via Clang driver flag
(`-fregister-global-dtors-with-atexit`) or llc / code generation flag
(`-lower-global-dtors-via-cxa-atexit`). This escape hatch will be
removed in the future.
Differential Revision: https://reviews.llvm.org/D121736
There is potential for endless recursion if we try to determine the
underlying objects of a load, just to end up with the load as underlying
object. A proper solution will require us to pass a visited set around.
This will happen as we cleanup genericValueTraversal soon.
When --disable-sample-loader-inlining is true, skip inline transformation, but merge profiles of inlined instances to outlining versions.
Differential Revision: https://reviews.llvm.org/D121862
CoroSplit lowers various coroutine intrinsics. It's a CGSCC pass and
CGSCC passes don't run on unreachable functions. Normally GlobalDCE will
come along and delete unreachable functions, but we don't run GlobalDCE
under -O0, so an unreachable function with coroutine intrinsics may
never have CoroSplit run on it.
This patch adds GlobalDCE when coroutines intrinsics are present. It
also now runs all coroutine passes conditional when coroutine intrinsics
are present. This should also solve the -O0 regression reported in
D105877 due to LazyCallGraph construction.
Fixes https://github.com/llvm/llvm-project/issues/54117
Reviewed By: ChuanqiXu
Differential Revision: https://reviews.llvm.org/D122275
EarlyCSE currently optimizes all MemoryUses upfront. However,
EarlyCSE only actually queries the clobbering memory access for
a subset of uses, namely those where a CSE candidate has already
been identified. Delaying use optimization to the clobber query
improves compile-time in practice.
This change is not NFC because EarlyCSE has a limit on the number
of clobber queries (EarlyCSEMssaOptCap), in which case it falls
back to the defining access. The defining access for uses will now
no longer coincide with the optimized access.
If there are performance regressions from this change, we should
be able to address them by raising this limit.
Differential Revision: https://reviews.llvm.org/D121987
The first attempt at this missed a validity check.
This version includes a test of the narrow source
type for modulo-16-bits.
Original commit message:
This is the IR counterpart to 370ebc9d9a
which provided a bswap narrowing fix for issue #53867.
Here we can be more general (although I'm not sure yet
what would happen for illegal types in codegen - too
rare to worry about?):
https://alive2.llvm.org/ce/z/3-CPfo
This will be more effective if we have moved the shift
after the bswap as proposed in D122010, but it is
independent of that patch.
Differential Revision: https://reviews.llvm.org/D122166
The motivation for this is that while both memcpyopt and dse will catch this case, both are limited by MSSA's walk back threshold when finding clobbers. As such, if you have a memcpy of an otherwise dead alloca placed towards the end of a long basic block with lots of other memory instructions, it would be missed. This is a bit undesirable for such an "obviously" useless bit of code.
As noted in comments, we should probably generalize instcombine's escape analysis peephole (see visitAllocInst) to allow read xor write. Doing that would subsume this code in a more general way, but is also a more involved change. For the moment, I went with the easiest fix.
createInductionResumeValues only uses its loop argument only to get the
pre-header, but the pre-header is already known (we created/cached it
earlier). Remove the unneeded loop argument.
When shifting by a byte-multiple:
bswap (shl X, C) --> lshr (bswap X), C
bswap (lshr X, C) --> shl (bswap X), C
This is an IR implementation of a transform suggested in D120648.
The "swaps cancel" test models the motivating optimization from
that proposal.
Alive2 checks (as noted in the other review, we could use
knownbits to handle shift-by-variable-amount, but that can be an
enhancement patch):
https://alive2.llvm.org/ce/z/pXUaRfhttps://alive2.llvm.org/ce/z/ZnaMLf
Differential Revision: https://reviews.llvm.org/D122010
Before this patch the DebugifyLevel option was used for
the synthetic mode, so after this, it will be used in
the original mode as well.
Differential Revision: https://reviews.llvm.org/D115623
Rather than iterating over users and comparing operands, iterate
over uses and check operand number. Otherwise, we'll end up
promoting a store twice if it has two equal operands.
This can only happen with opaque pointers, as otherwise both
operands differ by a level of indirection, so a bitcast would have
to be involved.
Fixes https://github.com/llvm/llvm-project/issues/54495.
This is the IR counterpart to 370ebc9d9a
which provided a bswap narrowing fix for issue #53867.
Here we can be more general (although I'm not sure yet
what would happen for illegal types in codegen - too
rare to worry about?):
https://alive2.llvm.org/ce/z/3-CPfo
This will be more effective if we have moved the shift
after the bswap as proposed in D122010, but it is
independent of that patch.
Differential Revision: https://reviews.llvm.org/D122166
Before we start addressing the issue with having
a lot of false positives when using debugify in
the original mode, we have made a few patches that
should speed up the execution of the testing
utility Passes.
For example, when testing a large project
(let's say LLVM project itself), we can face
a lot of potential DI issues. Usually, we use
-verify-each-debuginfo-preserve (that is very
similar to -debugify-each) -- it collects
DI metadata before each Pass, and after the Pass
it checks if the Pass preserved the DI metadata.
However, we can speed up this process, since we
don't need to collect DI metadata before each
Pass -- we could use the DI metadata that are
collected after the previous Pass from
the pipeline as an input for the next Pass.
This patch speeds up the utility for ~2x.
Differential Revision: https://reviews.llvm.org/D115622
The CoroSplit pass would check the existence of coroutine intrinsic
before starting work. It is not necessary and wasteful since it would
iterate over the Module.
This patch also removes the constraint on the corresponding of the
SmallVector for the possible coroutines in the Modules. The original
value is 4. Given coroutines is used actually in practice. 4 is really
relatively a low threshold.
When outlining a phi node, if the the incoming branch is a block contained in the region and the branch from that block is not outlined, we create broken code. The fix is to recognize when that branch from the included incoming block is not contained, and ignore the region.
Reviewer: paquette
Differential Revision: https://reviews.llvm.org/D121311
There is a bunch of code improvements in the patch: marking as const everything what can be
const and fixing some typos in comments.
Also the patch removes the shadowing parameter TTI from the rewriteWithNewAddressSpaces
method, the TTI parameter is not required because the same field is in the class.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D121671
completeLoopSkeleton only uses its loop argument only to get the
pre-header, but the pre-header is already known (we created/cached it
earlier). Remove the unneeded loop argument.
Since the IROutliner is performing an optimization, it should not outline from functions explicitly marked with optnone. This adds an extra check and test to make sure this does not occur.
Reviewers: paquette
Differential Revision: https://reviews.llvm.org/D121567
The semantics of an inalloca alloca instruction requires that it not be reordered with a preceeding stacksave intrinsic call. Unfortunately, there's no def/use edge or memory dependence edge. (THe memory point is slightly subtle, but in general a new allocation can't alias with a call which executes strictly before it comes into existance.)
I'd tried to tackle this same case previously in 689babdf6, but the fix chosen there turned out to be incomplete. As such, this change contains a fully revert of the first fix attempt.
This was noticed when investigating problems which surfaced with D118538, but this is definitely an existing bug. This time around, I managed to reduce a couple of additional cases, including one which was being actively miscompiled even without the new scheduling change. (See test diffs)
Compile time wise, we only spend extra time when seeing a stacksave (rare), and even then we walk the block at most once per schedule window extension. Likely a non-issue.
This fixes an active miscompile visible in the test changes. The basic problem is that the scheduling dependency graph didn't have any edges for control dependence within a single basic block. The result is that we could (and in some rare cases *did*) perform reorderings within a block which could introduce new undefined behavior along paths which didn't previously contain any.
Impact wise, we have two major cases where control is not guaranteed to reach a later instruction in the block: may throw calls, and calls containing infinite loops.
* The former case was mostly covered by the memory dependencies, and to trigger require a function which can throw, but not write to memory. In theory, such a case is possible, but not likely in practice.
* The later case is likely more of an issue in practice. After this code was first written, we changed the IR semantics to allow well defined infinite loops without satisifying mustprogress. Even for C/C++ - which do imply mustprogress - recent changes to how we treat atomics (e.g. an atomic read does not always imply a write) could expose this issue. I'm a bit shocked we don't seem to have a bug report which hit this in real code actually.
Compile time wise, this results in a single extra scan of the scheduling window in the common case. Since we stop scanning at the next instruction which isn't guaranteed to execute, no matter what order we traverse instructions in, we scan the block once. The exception to this is that when we extend the scheduling window downwards, we invalidate all dependencies, and thus rescan. So the potentially expensive case is when we a call in a big schedule window which is frequently extended. We could optimize this case (by caching the last instruction not guaranteeed to transfer execution and scanning only the extended window) and starting there), but I decided to leave the complexity until it mattered. That same case is already degenerate with memory dependences which is more expensive than the control dependence scan.
We could also consider combining the memory dependence and control dependence sets to reduce memory usage, but since it complicates the code slightly and makes debugging a bit harder, I went with the simplest scheme for now.
This was noticed while trying to understand the failures reported against D118538, but is not otherwise related to that change.
Update functions that previously took a loop pointer but only to get the
pre-header. Instead, pass the block directly. This removes the
requirement for the loop object to be created up-front.
With debug information enabled (-g) Clang will wrap the actual target
region into a new function which is called from the "kernel". The problem
is that the "kernel" is now basically a wrapper without all the things
we expect. More importantly, if we end up asking for an AAKernelInfo
for the "target region function" we might try to turn it into SPMD mode.
That used to cause an assertion as that function doesn't have an
appropriately named `_exec_mode` global. While the global is going away
soon we still need to make sure to properly handle this case, e.g.,
perform optimizations reliably.
Differential Revision: https://reviews.llvm.org/D122043
Generalize D99629 for ELF. A default visibility non-local symbol is preemptible
in a -shared link. `isInterposable` is an insufficient condition.
Moreover, a non-preemptible alias may be referenced in a sub constant expression
which intends to lower to a PC-relative relocation. Replacing the alias with a
preemptible aliasee may introduce a linker error.
Respect dso_preemptable and suppress optimization to fix the abose issues. With
the change, `alias = 345` will not be rewritten to use aliasee in a `-fpic`
compile.
```
int aliasee;
extern int alias __attribute__((alias("aliasee"), visibility("hidden")));
void foo() { alias = 345; } // intended to access the local copy
```
While here, refine the condition for the alias as well.
For some binary formats like COFF, `isInterposable` is a sufficient condition.
But I think canonicalization for the changed case has little advantage, so I
don't bother to add the `Triple(M.getTargetTriple()).isOSBinFormatELF()` or
`getPICLevel/getPIELevel` complexity.
For instrumentations, it's recommended not to create aliases that refer to
globals that have a weak linkage or is preemptible. However, the following is
supported and the IR needs to handle such cases.
```
int aliasee __attribute__((weak));
extern int alias __attribute__((alias("aliasee")));
```
There are other places where GlobalAlias isInterposable usage may need to be
fixed.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D107249
I keep thinking this assumption is probably exploitable for a bug in the existing implementation, but all of my attempts at writing a test case have failed. So for the moment, just document this very subtle assumption.
This patch fixes:
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:3917:13: error:
unused function 'needToScheduleSingleInstruction'
[-Werror,-Wunused-function]
[SCCP] do not clean up dead blocks that have their address taken
Fixes a crash observed in IPSCCP.
Because the SCCPSolver has already internalized BlockAddresses as
Constants or ConstantExprs, we don't want to try to update their Values
in the ValueLatticeElement. Instead, continue to propagate these
BlockAddress Constants, continue converting BasicBlocks to unreachable,
but don't delete the "dead" BasicBlocks which happen to have their
address taken. Leave replacing the BlockAddresses to another pass.
Fixes: https://github.com/llvm/llvm-project/issues/54238
Fixes: https://github.com/llvm/llvm-project/issues/54251
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D121744
This reverts commit 1cfa986d68. See https://github.com/llvm/llvm-project/issues/54256 for why I'm discontinuing the project.
Seperately, it turns out that while this patch does correctly preserve MSSA, it's correct only at the end of the pass; not between vectorization attempts. Even if we decide to resurrect this, we'll need to fix that before reapplying.
Quote from the LLVM Language Reference
If ptr is a stack-allocated object and it points to the first byte of the
object, the object is initially marked as dead. ptr is conservatively
considered as a non-stack-allocated object if the stack coloring algorithm
that is used in the optimization pipeline cannot conclude that ptr is a
stack-allocated object.
By replacing the alloca pointer with the tagged address before this change,
we confused the stack coloring algorithm.
Reviewed By: eugenis
Differential Revision: https://reviews.llvm.org/D121835
Failed on buildbot:
/home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/bin/llc: error: : error: unable to get target for 'aarch64-unknown-linux-android29', see --version and --triple.
FileCheck error: '<stdin>' is empty.
FileCheck command line: /home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/bin/FileCheck /home/buildbot/buildbot-root/llvm-project/llvm/test/Instrumentation/HWAddressSanitizer/stack-coloring.ll --check-prefix=COLOR
This reverts commit 208b923e74.
This adds a new option to control AllowSpeculation added in D119965 when
using `-passes=...`.
This allows reproducing #54023 using opt.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D121944
Quote from the LLVM Language Reference
If ptr is a stack-allocated object and it points to the first byte of the
object, the object is initially marked as dead. ptr is conservatively
considered as a non-stack-allocated object if the stack coloring algorithm
that is used in the optimization pipeline cannot conclude that ptr is a
stack-allocated object.
By replacing the alloca pointer with the tagged address before this change,
we confused the stack coloring algorithm.
Reviewed By: eugenis
Differential Revision: https://reviews.llvm.org/D121835
The way the pass is actually used in the optimization pipeline,
TLI will be available, but this is not the case when running just
-lower-constant-intrinsics in tests, which ends up being quite
confusing.
Require TLI unconditionally, as we usually do.
This changes MemorySSA to be constructed in unoptimized form.
MemorySSA::ensureOptimizedUses() can be called to optimize all
uses (once). This should be done by passes where having optimized
uses is beneficial, either because we're going to query all uses
anyway, or because we're doing def-use walks.
This should help reduce the compile-time impact of MemorySSA for
some use cases (the reason why I started looking into this is
D117926), which can avoid optimizing all uses upfront, and instead
only optimize those that are actually queried.
Actually, we have an existing use-case for this, which is EarlyCSE.
Disabling eager use optimization there gives a significant
compile-time improvement, because EarlyCSE will generally only query
clobbers for a subset of all uses (this change is not included in
this patch).
Differential Revision: https://reviews.llvm.org/D121381
LoopSimplifyCFG may process loops that are not in
loop-simplify/canonical form. For loops not in canonical form, exit
blocks may be reachable from non-loop blocks and we cannot consider them
as dead if they only are not reachable from the loop itself.
Unfortunately the smallest test I could come up with requires running
multiple passes:
-passes='loop-mssa(loop-instsimplify,loop-simplifycfg,simple-loop-unswitch)'
The reason is that loops are canonicalized at the beginning of loop
pipelines, so a later transform has to break canonical form in a way
that breaks LoopSimplifyCFG's dead-exit analysis.
Alternatively we could try to require all loop passes to maintain
canonical form. That in turn would also require additional verification.
Fixes#54023, #49931.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D121925
This patch tries to sink instructions when they are only used in a successor block.
This is a further enhancement patch based on Anna's commit:
D109700, which allows sinking an instruction having multiple uses in a single user.
In this patch, sink instructions with multiple users in a single successor block will be supported.
It could fix a known issue from rust:
https://github.com/rust-lang/rust/issues/51346#issuecomment-394443610
Reviewed By: nikic, reames
Differential Revision: https://reviews.llvm.org/D121585
Splat loads are inexpensive in X86. For a 2-lane vector we need just one
instruction: `movddup (%reg), xmm0`. Using the standard Splat score leads
to worse code. This patch adds a new score dedicated for splat loads.
Please note that a splat is usually three IR instructions:
- It is usually a load and 2 inserts:
%ld = load double, double* %gep
%ins1 = insertelement <2 x double> poison, double %ld, i32 0
%ins2 = insertelement <2 x double> %ins1, double %ld, i32 1
- But it can also be a load, an insert and a shuffle:
%ld = load double, double* %gep
%ins = insertelement <2 x double> poison, double %ld, i32 0
%shf = shufflevector <2 x double> %ins, <2 x double> poison, <2 x i32> zeroinitializer
Because of this some of the lit tests contain more IR instructions.
Differential Revision: https://reviews.llvm.org/D121354
Reimplements MisExpect diagnostics from D66324 to reconstruct its
original checking methodology only using MD_prof branch_weights
metadata.
New checks rely on 2 invariants:
1) For frontend instrumentation, MD_prof branch_weights will always be
populated before llvm.expect intrinsics are lowered.
2) for IR and sample profiling, llvm.expect intrinsics will always be
lowered before branch_weights are populated from the IR profiles.
These invariants allow the checking to assume how the existing branch
weights are populated depending on the profiling method used, and emit
the correct diagnostics. If these invariants are ever invalidated, the
MisExpect related checks would need to be updated, potentially by
re-introducing MD_misexpect metadata, and ensuring it always will be
transformed the same way as branch_weights in other optimization passes.
Frontend based profiling is now enabled without using LLVM Args, by
introducing a new CodeGen option, and checking if the -Wmisexpect flag
has been passed on the command line.
Differential Revision: https://reviews.llvm.org/D115907
Failures in `InlineFunction()` are caught after D121722, but `emitInlinedIntoBasedOnCost()` should only be called when inlining is successful. This also removes an unnecessary call to `shouldInline()` which always returned `InlineCost::getAlways()`.
Reviewed By: kyulee, nikic
Differential Revision: https://reviews.llvm.org/D121946
We often failed in the assertion, non-deterministically with a large IR:
```
Assertion `notDifferentParent(LocA.Ptr, LocB.Ptr) && "BasicAliasAnalysis doesn't support interprocedural queries."
```
Looking at the comment in https://reviews.llvm.org/D87806, it appears it's actually a module pass for new PM while the legacy PM still works as a function pass.
The fix is to align the same behavior in between new PM and old PM, which initializes ObjCARCContract for each function.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D121949
No need to schedule entry nodes where all instructions are not memory
read/write instructions and their operands are either constants, or
arguments, or phis, or instructions from others blocks, or their users
are phis or from the other blocks.
The resulting vector instructions can be placed at
the beginning of the basic block without scheduling (if operands does
not need to be scheduled) or at the end of the block (if users are
outside of the block).
It may save some compile time and scheduling resources.
Differential Revision: https://reviews.llvm.org/D121121
For MachO, lower `@llvm.global_dtors` into `@llvm_global_ctors` with
`__cxa_atexit` calls to avoid emitting the deprecated `__mod_term_func`.
Reuse the existing `WebAssemblyLowerGlobalDtors.cpp` to accomplish this.
Enable fallback to the old behavior via Clang driver flag
(`-fregister-global-dtors-with-atexit`) or llc / code generation flag
(`-lower-global-dtors-via-cxa-atexit`). This escape hatch will be
removed in the future.
Differential Revision: https://reviews.llvm.org/D121736
When we build clang without asserts we should still check the result of
`InlineFunction()` to be sure there wasn't an error. Otherwise we could
incorrectly merge attributes in the next line.
This also removes a redundent call to `getCaller()`.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D121722
An analysis may just be interested in checking if an instruction is
atomic but system scoped or single-thread scoped, like ThreadSanitizer's
isAtomic(). Unfortunately Instruction::isAtomic() can only answer the
"atomic" part of the question, but to also check scope becomes rather
verbose.
To simplify and reduce redundancy, introduce a common helper
getAtomicSyncScopeID() which returns the scope of an atomic operation.
Start using it in ThreadSanitizer.
NFCI.
Reviewed By: dvyukov
Differential Revision: https://reviews.llvm.org/D121910
This uses the existing VPlan helpers to check whether there are scalar
uses of a phi recipe. It remove one of the few remaining dependencies on
the cost model from VPlan code generation.
Depends on D121612.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D121613
VPInterleaveRecipe only uses the first lane of the address. Add
onlyFirstLaneUsed implementation. This is needed for a follow-up patch.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D121612
Reapply with an explicit check for multi-edges, as the expected
behavior of multi-edge dominance is unclear (D120811).
-----
For conditional branches, we know the value is i1 0 or i1 1 along
the outgoing edges. For switches we can apply exactly the same
optimization, just with the known values determined by the switch
cases.
This patch ensures scalars (except for uniforms) are no
longer collected (prior to LVP planning phase) for
scalable vectorization.
This is to avoid the chances of generating scalarized
instructions later (during LVP execute phase) as they
are not supported for scalable vectorization.
Relevant test has also been added.
Differential Revision: https://reviews.llvm.org/D121452
When a load extends past the extent of the alloca, SROA will
restrict the slice size to extend to the end of the alloca only.
However, presplitting was asserting that the load size and the
slice size match exactly, which does not hold in this case.
Relax the assertion to only require that the load size is greater
or equal than the slice size.
No need to schedule entry nodes where all instructions are not memory
read/write instructions and their operands are either constants, or
arguments, or phis, or instructions from others blocks, or their users
are phis or from the other blocks.
The resulting vector instructions can be placed at
the beginning of the basic block without scheduling (if operands does
not need to be scheduled) or at the end of the block (if users are
outside of the block).
It may save some compile time and scheduling resources.
Differential Revision: https://reviews.llvm.org/D121121
This patch adds initial argmemonly inference, by checking the underlying
objects of locations returned by MemoryLocation.
I think this should cover most cases, except function calls to other
argmemonly functions.
I'm not sure if there's a reason why we don't infer those yet.
Additional argmemonly can improve codegen in some cases. It also makes
it easier to come up with a C reproducer for 7662d1687b (already fixed,
but I'm trying to see if C/C++ fuzzing could help to uncover similar
issues.)
Compile-time impact:
NewPM-O3: +0.01%
NewPM-ReleaseThinLTO: +0.03%
NewPM-ReleaseLTO+g: +0.05%
https://llvm-compile-time-tracker.com/compare.php?from=067c035012fc061ad6378458774ac2df117283c6&to=fe209d4aab5b593bd62d18c0876732ddcca1614d&stat=instructions
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D121415
This code queries TTI on a single function, which is considered to
be representative. This is a bit odd, but probably fine in practice.
However, I think we should at least avoid querying declarations,
which e.g. will generally lack target attributes, and for which
we don't seem to ever query TTI in other places.
This initial patch adds code to preserve MemorySSA through a run of SLP vectorizer. The eventual plan is to use MemorySSA to accelerate SLP's memory dependence checking, but we're a ways from that. In particular, this patch is correct, but really slow. It's being landed so that we can work incrementally in tree, not because it's expected to be useful to anyone just yet.
The broader effort is being tracked in https://github.com/llvm/llvm-project/issues/54256. Its worth noting expicitly that this may not work out, and if not, we will be reverting all of the MSSA support in SLP at some point in the next few weeks.
Differential Revision: https://reviews.llvm.org/D117926
Update FunctionAttrs to use FunctionModRefBehavior instead
MemoryAccessKind.
This allows for adding support for inferring argmemonly and others,
see D121415.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D121460
This can be viewed as swapping the select arms:
https://alive2.llvm.org/ce/z/jUvFMJ
...so we don't have the 'nsz' problem with the more general fold.
This unlocks other folds for the motivating fabs example.
This was discussed in issue #38828.
Hardcode the function type as ParallelTask, which is the guaranteed
pointee type of this runtime function argument (if pointee types
exist). The elimination of the callee bitcast is left for InstCombine.
Differential Revision: https://reviews.llvm.org/D120885
Update places still referencing LoopVectorBody to use the vector loop to
get the vector loop header. This is needed to move vector loop
code-generation to VPlan completely, which in turn is needed to model
pre-header & exit blocks in VPlan as well.
For MachO, lower `@llvm.global_dtors` into `@llvm_global_ctors` with
`__cxa_atexit` calls to avoid emitting the deprecated `__mod_term_func`.
Reuse the existing `WebAssemblyLowerGlobalDtors.cpp` to accomplish this.
Enable fallback to the old behavior via Clang driver flag
(`-fregister-global-dtors-with-atexit`) or llc / code generation flag
(`-lower-global-dtors-via-cxa-atexit`). This escape hatch will be
removed in the future.
Differential Revision: https://reviews.llvm.org/D121327
When matching PHINodes when margining functions the IROutliner only checks that an incoming value exists in phi node in overall function. It doesn't check the length, the order, or that the incoming block also matches. In the given example, we see that both phi nodes have the same incoming values, but from different blocks.
The fix is to to enforce stricter a match of the incoming value, and the incoming block as well when matching the created phi nodes.
Reviewers: paquette
Differential Revision: https://reviews.llvm.org/D121310
If we have a logical and/or in select form and the true/false operand
is an fcmp with poison generating FMF, we won't be able to fold it
to an and/or instruction. This prevents us from optimizing the case
where it is a logical operation of two fcmps with identical operands.
This patch adds explicit checks for this case that doesn't rely on
converting to and/or to do the optimization. It reuses the existing
foldLogicOfFCmps, but adds a new flag to disable the other combine
that is inside that function.
FMF flags from the two FCmps are intersected using the logic added in
D121243. The FIXME has been updated to indicate that we can only use
a union for the non-select form.
This allows us to optimize cases like this from compare-fp-3.c in the
gcc torture suite with fast math.
void
test1 (float x, float y)
{
if ((x==y) && (x!=y))
link_error0();
}
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D121323
2 of the 3 callsite of IRMover::move() pass empty lambda functions. Just
make this parameter llvm::unique_function.
Came about via discussion in D120781. Probably worth making this change
regardless of the resolution of D120781.
Reviewed By: dexonsmith
Differential Revision: https://reviews.llvm.org/D121630
The IR Outliner is supposed to extract the outputs contained in an external phi node and place them into a phi node contained within the outlined function. However, when the output values of two outlined functions with two different output sets are contained within the same phi node, they are counted as the same exit path when first analyzed. In reality, these create two different phi nodes, creating an inconsistency, resulting in a mismatch in the expected number of output paths and a crash. This fixes that counting when analyzing the outputs by also analyzing the incoming blocks rather than just the incoming values.
Reviewer: paquette
Differential Revision: https://reviews.llvm.org/D121313
Extend -wholeprogramdevirt-check to support both the existing
trapping mode on an incorrect devirtualization, as well as a new
mode to fallback to an indirect call on a mismatch. The new mode is
The new mode is useful in cases where we want to enable
devirtualization but cannot fully guarantee whole program visibility
(e.g in the case where LTO has been disabled for a small set of objects
that could potentially override virtual methods without having a symbol
reference to anything in the base class including the vtable).
Remove !prof and !callees metadata (which are used by indirect call
promotion) from both the new direct call and the fallback indirect call
(so that we don't perform another round of promotion on the latter).
Also remove it from the direct call in the non-fallback cases, which
was an oversight, although it didn't seem to cause any issues. Add tests
for the metadata removal covering the various cases.
Differential Revision: https://reviews.llvm.org/D121419
When there are two external phi nodes for two different outlined regions, when compressing the created phi nodes between the two regions, the matching for the second phi node in the second region matches the first phi node created for the first region rather than the second phi node created for the first region. This adds an extra output path where there should not be one.
The fix is the ignore phi nodes that have already been matched for each region.
Reviewer: paquette
Differential Revision: https://reviews.llvm.org/D121312
The insertion point for the builder used during VPlan code generation is
set during code generation. Setting the insert point here is dead code
and can be removed.
Context: I needed this for https://github.com/google/iree/pull/8474 .
I found that TSan instrumentation expects vector sizes to be <= 16,
and in my project (IREE) we have tests with higher vector sizes.
That left some test functions uninstrumented, resulting in crashes as
instrumented code called into them.
Differential Revision: https://reviews.llvm.org/D121182
This patch is a follow-up to D115953. It updates optimizeInductions
to also introduce new VPScalarIVStepsRecipes if an IV has both vector
and scalar uses.
It updates all uses that only need scalar values to use the newly
created recipe for the scalar steps.
This completes untangling of VPWidenIntOrFpInductionRecipe
code-generation. Now the recipe *only* creates the widened vector
values, as it says on the tin.
The code to genereate IR has been moved directly to
VPWidenIntOrFpInductionRecipe::execute.
Note that the recipe has been updated to hold a reference to
ScalarEvolution, which is needed to expand the step, until we can place
the corresponding SCEV expansion in the pre-header.
Depends on D120827.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D120828
Before we used the capture tracker to follow pointer uses, now we do it
explicitly ourselves through the Attributor API. There are multiple
benefits: For one, the boilerplate is cut down by a lot. The class,
potential copies vector, etc. is all not needed anymore. We also do
avoid explicitly looking through memory here, something that was
duplicated and should only live in the `checkForAllUses~ helper. More
importantly, as we do simplifications we need to make sure all parties
are in sync when they reason about uses. The old way did not allow us to
do this but the new one does as every use visiting AA goes through
`checkForAllUses` now..
As replacements will become more complex it is better to have a single
AA responsible for replacing a use. Before this patch AAValueSimplify*
and AAValueSimplifyReturned could both try to replace the returned
value. The latter was marginally better for the old pass manager
when a function was already carrying a `returned` attribute and when
the context of the return instruction was important. The second
shortcoming was resolved by looking for return attributes in the
AAValueSimplifyCallSiteReturned initialization. The old PM impact is
not concerning.
This is yet another step towards the removal of AAReturnedValues, the
very first AA we should now try to eliminate due to the overlapping
logic with value simplification.
When we look through memory for a store we used to allow any other use
of the memory that is reachable. This is generally OK but we need to
make sure to actually let the user look at these properly. For now,
we simply require loads (via exact reloads).
There was some ad-hoc handling of liveness and manifest to avoid
breaking CGSCC guarantees. Things always slipped through though.
This cleanup will:
1) Prevent us from manifesting any "information" outside the CGSCC.
This might be too conservative but we need to opt-in to annotation
not try to avoid some problematic ones.
2) Avoid running any liveness analysis outside the CGSCC. We did have
some AAIsDeadFunction handling to this end but we need this for all
AAIsDead classes. The reason is that AAIsDead information is only
correct if we actually manifest it, since we don't (see point 1) we
cannot actually derive/use it at all. We are currently trying to
avoid running any AA updates outside the CGSCC but that seems to
impact things quite a bit.
3) Assert, don't check, that our modifications (during cleanup) modifies
only CGSCC functions.
In an attempt to remove the memory leak we introduced a double free.
The problem was that we allowed a plain copy of the state and it was
actually used. The use was useless, so it is gone now. The copy
constructor is gone as well. The move constructor ensures the Accesses
pointers are owned by a single state, I hope.
Reported by: https://lab.llvm.org/buildbot/#/builders/16/builds/25820
This patch adds a helper to check if a recipe only uses scalars of a
given operand. This is similar to onlyFirstLaneUsed, which was
introduced earlier.
By default, usesScalars falls back on onlyFirstLaneUsed.
Will be used by D120828.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D120827
X (any pred) -X --> X (any pred) 0.0
This works with all FP values and preserves FMF.
Alive2 examples:
https://alive2.llvm.org/ce/z/dj6jhp
This can also create one of the patterns that we match as "fabs"
as shown in one of the test diffs.
This reverts commit 9397bdc67e.
This optimization is likely to surprise programmers as seen
in post-commit comments, so we should add a clang warning
first (that is proposed in D121306).
If there are no ctors, then this can have an arbirary zero-sized
value. The current code checks for null, but it could also be
undef or poison.
Replacing the specific null check with a check for
non-ConstantArray.
As discussed on Issue #32161 this fold can be generalized a lot more than it currently is, but this patch at least adds vector support.
Differential Revision: https://reviews.llvm.org/D121358
As far as I can tell, these names are only intended to be
informative, so just use a generic "PointerType" for opaque pointers.
The code in solveDIType() also treats pointers as basic types (and
does not try to encode the pointed-to type further), so I believe
this should be fine.
Differential Revision: https://reviews.llvm.org/D121280
We are already doing this in the split functions while we clone. This just
handles the original function.
I also updated the coroutine split test to validate that we are always referring
to the msg in the context object instead of in a shadow copy.
rdar://83957028
Reviewed By: aprantl
Differential Revision: https://reviews.llvm.org/D121324
This patch is motivated by pr48057 where an output dependency is not detected
since loop interchange did not check a store instruction with itself.
Fixed that deficiency.
Reviewed By: bmahjour, Meinersbur, #loopoptwg
Differential Revision: https://reviews.llvm.org/D118102
The existing handling produced crash for test case (attached with patch).
Now the function transferSRADebugInfo is modified to
- Ignore the current variable if it starts after the current Fragment.
- Ignore the current variable if it ends before the current Fragment.
- Generate (!DIExpression()) if current variable completely fits the
current Fragment.
- Otherwise (as earlier), generate the DW_OP_LLVM_fragment in IR if current
Fragment partially defines current variable.
Reviewed By: aprantl
Differential Revision: https://reviews.llvm.org/D121107
As a result of adding multiblock outlining, it became possible to outline the entirety of basic block, and branches that only pointed to the basic blocks contained in the outlined section. This means that there are no exit paths, and no return statement. There was a previous assertion from the older version of the outliner that explicitly made sure there was a return statement. This removes that assertion.
Reviewers: paquette
Differential Revision: https://reviews.llvm.org/D120868
This patch intersects the fast math flags from the two fcmps instead
of dropping them.
I poked at this a bunch with Alive2 for nnan and ninf flags and it seemed
to check out. With the other flags it told me "Couldn't prove the
correctness of the transformation". Not sure if I should just preserve
nnan and ninf?
Reviewed By: spatel, lebedev.ri
Differential Revision: https://reviews.llvm.org/D121243
Single value phis won't be modeled in VPlan. If the phi only gets used
outside the loop, the current code misses the fact that the incoming
value is not dead. Update the code to also look through such phis to
check for outside users.
Fixes#54266
This patch adds PrettyStackEntries before running passes. The entries
include the pass name and the IR unit the pass runs on.
The information is used the print additional information when a pass
crashes, including the name and a reference to the IR unit on which it
crashed. This is similar to the behavior of the legacy pass manager.
The improved stack trace now includes:
Stack dump:
0. Program arguments: bin/opt -loop-vectorize -force-vector-width=4 crash.ll
1. Running pass 'ModuleToFunctionPassAdaptor' on module 'crash.ll'
2. Running pass 'LoopVectorizePass' on function '@a'
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D120993
MinBaseDistance may be odr-used by std::max, leading to an undefined symbol linker error:
```
ld.lld: error: undefined symbol: (anonymous namespace)::MinCostMaxFlow::MinBaseDistance
>>> referenced by SampleProfileInference.cpp:744 (/home/ray/llvm-project/llvm/lib/Transforms/Utils/SampleProfileInference.cpp:744)
>>> lib/Transforms/Utils/CMakeFiles/LLVMTransformUtils.dir/SampleProfileInference.cpp.o:((anonymous namespace)::FlowAdjuster::jumpDistance(llvm::FlowJump*) const)
```
Since llvm-project is still using C++ 14, workaround it with a cast.
After moving the CanAdd check in c60cdb44f7 and using it for
the assume cases as well, the passed in block may not have a branch
instruction as terminator. This can trigger the assertion. Given the new
use case, it doesn't add value any longer and can be removed.
Fixes https://github.com/llvm/llvm-project/issues/54281
This is noted as a missing clang warning in #54222
(and we should still make that enhancement).
Alive2 proofs:
https://alive2.llvm.org/ce/z/Q8drDqhttps://alive2.llvm.org/ce/z/pE6LRt
I don't see a single conversion for all predicates
using "getFCmpCode" logic, so other predicates are
left as a TODO item.
Introduce a new attribute "function-inline-cost-multiplier" which
multiplies the inline cost of a call site (or all calls to a callee) by
the multiplier.
When processing the list of calls created by inlining, check each call
to see if the new call's callee is in the same SCC as the original
callee. If so, set the "function-inline-cost-multiplier" attribute of
the new call site to double the original call site's attribute value.
This does not happen when the original call site is intra-SCC.
This is an alternative to D120584, which marks the call sites as
noinline.
Hopefully fixes PR45253.
Reviewed By: davidxl
Differential Revision: https://reviews.llvm.org/D121084
If the user disables de-globalization we did not seed the AAHeapToShared
and AAHeapToStack but we still could end up with them through in-flight
lookups. With this patch we disable AAHeapToShared completely if the
user disabled de-globalization. Heap-2-stack is still run though.
Differential Revision: https://reviews.llvm.org/D121059
Currently, when instrumenting indirect calls, this uses
CallBase::getCalledFunction to determine whether a given callsite is
eligible.
However, that returns null if:
this is an indirect function invocation or the function signature
does not match the call signature.
So, we end up instrumenting direct calls where the callee is a bitcast
ConstantExpr, even though we presumably don't need to.
Use isIndirectCall to ignore those funky direct calls.
Differential Revision: https://reviews.llvm.org/D119594
The analysis passes output function name encapsulated in `'` braces,
but LV uses `"`. Harmonizing this may help in creating an update script
for the LV costmodel test checks.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D121105
When decomposing constraints for unsigned conditions, we can use
negative values by zero-extending them, as long as they are less than
the maximum constraint value.
Fixes https://github.com/llvm/llvm-project/issues/54224
Add missing CanAdd check before adding a condition from an assume
to the successor blocks. When adding information from assume to
successor blocks we need to perform the same CanAdd as we do for adding
a condition from a branch.
Fixes https://github.com/llvm/llvm-project/issues/54217
Dropping this restriction seems to work fine (there are no assertion
failures), so it appears that either the updater got smarter or the
problematic cases are restricted elsewhere.
If doing this still causes issues, then the place to address it
would probably be 8f5bdaf481/llvm/lib/Transforms/IPO/Attributor.cpp (L1856-L1859),
which already prevents replacement outside the SCC, so I'm not
quite sure what this check is intended to avoid.
Differential Revision: https://reviews.llvm.org/D120987
Only determine the frame layout based on dereferenceable and align
attributes, and remove the type-based fallback, which is incompatible
with opaque pointers. The dereferenceable attribute is required,
while the align attribute uses default alignment of 1 (commonly,
align 1 attributes do not get placed, relying on default alignment).
The CoroSplit pass producing the resume function adds the necessary
attributes in 7daed35911/llvm/lib/Transforms/Coroutines/CoroSplit.cpp (L840),
and their presence is checked in coro-debug.ll at least.
Differential Revision: https://reviews.llvm.org/D120988
With opaque pointers, after splitRetconCoroutine() the FramePtr
may be an Argument rather than an Instruction. With typed pointers,
this currently doesn't happen because the FramePtr would be a
bitcast instruction.
Fix this by making FramePtr a Value and adding a helper for the
"after FramePtr" insertion point, which would be the start of the
function in the Argument case.
Differential Revision: https://reviews.llvm.org/D120994
This patch extends ConstraintElimination to also remove dead variables
when removing a constraint. When a constraint is removed because it is
out of scope, all new variables added for this constraint can also be
removed.
This keeps the total size of the systems much smaller, because it
reduces the number of variables drastically.
It also fixes a bug where variables where removed incorrectly.
Fixes https://github.com/llvm/llvm-project/issues/54228
This check is not compatible with opaque pointers. We can avoid
it by adjusting the getPointerAlignment() implementation to avoid
creating unnecessary ptrtoint expressions for bitcasted pointers.
The code already uses OnlyIfReduced to not create an expression
if it does not simplify, and this makes sure that folding a
bitcast and ptrtoint into a ptrtoint doesn't count as a
simplification.
Differential Revision: https://reviews.llvm.org/D120904
Currently, we hardly ever actually run SCEV verification, even in
tests with -verify-scev. This is because the NewPM LPM does not
verify SCEV. The reason for this is that SCEV verification can
actually change the result of subsequent SCEV queries, which means
that you see different transformations depending on whether
verification is enabled or not.
To allow verification in the LPM, this limits verification to
BECounts that have actually been cached. It will not calculate
new BECounts.
BackedgeTakenInfo::getExact() is still not entirely readonly,
it still calls getUMinFromMismatchedTypes(). But I hope that this
is not problematic in the same way. (This could be avoided by
performing the umin in the other SCEV instance, but this would
require duplicating some of the code.)
Differential Revision: https://reviews.llvm.org/D120551
We already look through memory to determine where a value that is stored
might pop up again (potential copies). This patch introduces the other
direction with similar logic. If a value is loaded, we can follow all
the accesses to the pointer (or better object) and try to determine what
value might have been stored.
Both `undef` and `nullptr` are maximally aligned. This is especially
important as we often see `undef` until a proper value has been
identified during simplification.
With D106397 we used CFG reasoning to filter out writes that will not
interfere with a given load instruction. With this patch we use the
same logic (modulo the reversal in reachability check order) for store
instructions. As an example, we can now proof stores to shared memory
are dead if all the loads of the shared memory are not reachable from
them.
Heap-2-stack and heap-2-shared can replace an allocation call with
something else. To avoid us deriving information from the allocator
implementation we register a simplification callback now that will
force us to stop at the call site. We probably should create the
replacement memory eagerly and return that instead though.
While we can use range information when we derive dereferenceability we
must make sure to pick he right end of the range. Before we always went
with the minimal offset, which is not correct if we want to combine
the base dereferenceability with some offset. In that case it's the
maximum that gives the correct result.
This simply makes the function argument of the
`Attributor::checkForAllInstructions` helper explicit so one can iterate
over instructions in other functions.
The OpenMPIRBuilder has a bug. Specifically, suppose you have two nested openmp parallel regions (writing with MLIR for ease)
```
omp.parallel {
%a = ...
omp.parallel {
use(%a)
}
}
```
As OpenMP only permits pointer-like inputs, the builder will wrap all of the inputs into a stack allocation, and then pass this
allocation to the inner parallel. For example, we would want to get something like the following:
```
omp.parallel {
%a = ...
%tmp = alloc
store %tmp[] = %a
kmpc_fork(outlined, %tmp)
}
```
However, in practice, this is not what currently occurs in the context of nested parallel regions. Specifically to the OpenMPIRBuilder,
the entirety of the function (at the LLVM level) is currently inlined with blocks marking the corresponding start and end of each
region.
```
entry:
...
parallel1:
%a = ...
...
parallel2:
use(%a)
...
endparallel2:
...
endparallel1:
...
```
When the allocation is inserted, it presently inserted into the parent of the entire function (e.g. entry) rather than the parent
allocation scope to the function being outlined. If we were outlining parallel2, the corresponding alloca location would be parallel1.
This causes a variety of bugs, including https://github.com/llvm/llvm-project/issues/54165 as one example.
This PR allows the stack allocation to be created at the correct allocation block, and thus remedies such issues.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D121061
Skip phi nodes in the preheader. They may not be considered loop
invariant by the assertion below.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D121010
There seems to be one more uncaught problem, SROA may now end up trying
to re-re-repromote the just-promoted shadow alloca, and do that endlessly.
This reverts commit adc0984d81.
This is inspired by the original variant of D109749 by Graham Hunter,
but is a more general version.
Roughly, instead of promoting the alloca, we call it
a shadow/backing alloca, go through all it's slices,
clone(!) instructions that operated on it,
but make them operate on the cloned alloca,
and promote cloned alloca instead.
This keeps the shadow/backing alloca, and all the original instructions
around, which results in said shadow/backing alloca being
a perfect mirror/representation of the promoted alloca's content,
so calls that take the alloca as arguments (non-capturingly!)
can be supported.
For now, we require that the calls also don't modify the alloca's content,
but that is only to simplify the initial implementation,
and that will be supported in a follow-up.
Overall, this leads to *smaller* codesize:
https://llvm-compile-time-tracker.com/compare.php?from=a8b4f5bbab62091835205f3d648902432a4a5b58&to=aeae054055b125b011c1122f82c86457e159436f&stat=size-total
and is roughly neutral compile-time wise:
https://llvm-compile-time-tracker.com/compare.php?from=a8b4f5bbab62091835205f3d648902432a4a5b58&to=aeae054055b125b011c1122f82c86457e159436f&stat=instructions
This relands commit 703240c71f,
that was reverted by commit 7405581f7c,
because the assertion `isa<LoadInst>(OrigInstr)` didn't hold in practice,
as the newly added test `@select_of_ptrs` shows:
If the pointers into alloca are used by select's/PHI's, then even if
we manage to fracture the alloca, some sub-alloca's will likely remain.
And if there are any non-capturing calls, then we will also decide to
keep the original backing alloca around, and we suddenly ~doubled
the alloca size, and the amount of memory traffic.
I'm not sure if this is a problem or we could live with it,
but let's leave that for later...
Reviewed By: djtodoro
Differential Revision: https://reviews.llvm.org/D113520
This will let us start moving away from hard-coded attributes in
MemoryBuiltins.cpp and put the knowledge about various attribute
functions in the compilers that emit those calls where it probably
belongs.
Differential Revision: https://reviews.llvm.org/D117921
The custom state machine had a check for surplus threads that filtered
the main thread if the kernel was executed by a single warp only. We
now first check for the main thread, then for surplus threads, avoiding
to filter the former out.
Fixes#54214.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D121011
Prior to this change LLVM would happily elide a call to any allocation
function and a call to any free function operating on the same unused
pointer. This can cause problems in some obscure cases, for example if
the body of operator::new can be inlined but the body of
operator::delete can't, as in this example from jyknight:
#include <stdlib.h>
#include <stdio.h>
int allocs = 0;
void *operator new(size_t n) {
allocs++;
void *mem = malloc(n);
if (!mem) abort();
return mem;
}
__attribute__((noinline)) void operator delete(void *mem) noexcept {
allocs--;
free(mem);
}
void deleteit(int*i) { delete i; }
int main() {
int*i = new int;
deleteit(i);
if (allocs != 0)
printf("MEMORY LEAK! allocs: %d\n", allocs);
}
This patch addresses the issue by introducing the concept of an
allocator function family and uses it to make sure that alloc/free
function pairs are only removed if they're in the same family.
Differential Revision: https://reviews.llvm.org/D117356
This check is not relevant for correctness, it can only avoid
walking some recursive uses if the cast is to a non-function
pointer type. As this distinction will no longer be possible
with opaque pointers and all users will have to be walked
anyway, I'm dropping the check in advance.
Recommit without changes over 53abe3ff66,
which addressed the cause of the reported crash.
-----
With opaque pointers, the zero-offset load will generally not use
a GEP. Allow a direct load without GEP, which is treated the same
way as a zero-offset GEP.
Use the overload that support moving into an empty block. I don't
think that this situation can occur right now, but it can happen
with the change from e7fb1c15cb,
and the test is derived from the issue reported there.
Per discussion on
https://reviews.llvm.org/D59709#inline-1148734, this seems like the
right course of action. `canBeOmittedFromSymbolTable()` subsumes and
generalizes the previous logic. In addition to handling `linkonce_odr`
`unnamed_addr` globals, we now also internalize `linkonce_odr` +
`local_unnamed_addr` constants.
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D120173
This builds on @fhahn's D112313, and caches the liveOnEntry node as a optimized access. D112313 tied to only cache a known clobber. This change adds caching the fact that no clobber exists. It still does not cache may-clobber results.
Differential Revision: https://reviews.llvm.org/D120842
The similar getICmpCode and getPredForICmpCode are already there.
This moves FP for consistency.
I think InstCombine is currently the only user of both.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D120754
We will check a bit later that the constant is in fact a function,
so the separate check for a function pointer type is largely
redunant. Also simplify the cast stripping with
stripPointerCasts().
`ArgInfo` is reduced to only contain a pair of {formal,actual} values.
The specialized function `Fn` and the `Partial` flag are redundant in
this structure. The `Gain` is moved to a new struct `SpecializationInfo`.
The value mappings created by cloneCandidateFunction() are being used
by rewriteCallSites() for matching the formal arguments of recursive
functions.
The list of specializations is passed by reference to calculateGains()
instead of being returned by value.
The `IsPartial` flag is removed from isArgumentInteresting() and
getPossibleConstants() as it's no longer used anywhere in the code.
Differential Revision: https://reviews.llvm.org/D120753
To make this actually trigger, we also need to check whether the
function types differ, which is a hidden cast under opaque pointers.
The transform is somewhat less relevant there because it is
primarily about pointer bitcasts, but it can also happen with other
bit- or pointer-castable types.
Byval handling is easier with opaque pointers because there is no
need to adjust the byval type, we only need to make sure that it's
still a pointer.
The logic for handling this was fixed in
8d7f118ab2, but the check for byval
on the callee was retained. This resulted in a weird situation
where the transform would work depending on whether the byval
was only on the call or on both the call and the function.
There is a general WalkerStepLimit adjustment higher up in the
loop, and I don't see any reason why this particular case would
need additional adjustment. Furthermore, this could underflow.
Root issue which triggered the revert was fixed in 689bab. No changes in the reapplied patch.
Original commit message follows:
SLP currently schedules all instructions within a scheduling window which stretches from the first instr
uction potentially vectorized to the last. This window can include a very large number of unrelated instruct
ions which are not being considered for vectorization. This change switches the code to only schedule the su
b-graph consisting of the instructions being vectorized and their transitive users.
This has the effect of greatly reducing the amount of work performed in large basic blocks, and thus greatly improves compile time on degenerate examples. To understand the effects, I added some statistics (not planned for upstream contribution). Here's an illustration from my motivating example:
Before this patch:
704357 SLP - Number of calcDeps actions
699021 SLP - Number of schedule calls
5598 SLP - Number of ReSchedule actions
59 SLP - Number of ReScheduleOnFail actions
10084 SLP - Number of schedule resets
8523 SLP - Number of vector instructions generated
After this patch:
102895 SLP - Number of calcDeps actions
161916 SLP - Number of schedule calls
5637 SLP - Number of ReSchedule actions
55 SLP - Number of ReScheduleOnFail actions
10083 SLP - Number of schedule resets
8403 SLP - Number of vector instructions generated
I do want to highlight that there is a small difference in number of generated vector instructions. This example is hitting the bailout due to maximum window size, and the change in scheduling is slightly perturbing when and how we hit it. This can be seen in the RescheduleOnFail counter change. Given that, I think we can safely ignore.
The downside of this change can be seen in the large test diff. We group all vectorizable instructions together at the bottom of the scheduling region. This means that vector instructions can move quite far from their original point in code. While maybe undesirable, I don't see this as being a major problem as this pass is not intended to be a general scheduling pass.
For context, it's worth noting that the pre-scheduling that SLP does while building the vector tree is exactly the sub-graph scheduling implemented by this patch.
Differential Revision: https://reviews.llvm.org/D118538
While a collection of allocas are technically vectorizeable - by forming a wider alloca - this was not a transform SLP actually knows how to do. Instead, we were forming a bundle with missing dependencies, and then relying on the scheduling code to preserve program order if multiple instructions were scheduleable at once. I haven't been able to write a test case, but I'm 99% sure this was wrong in some edge case.
The unknown op case was flowing down the shufflevector path. This did result in some splat handling being lost with this change, but the same lack of splat handling is visible in a whole bunch of simple examples for the gather path. I didn't consider this interesting to fix given how narrow the splat of allocas case is.
The verify call was taking 50% of the compile time in our internal LLVM
fork when trying to unroll many loops.
Differential Revision: https://reviews.llvm.org/D113028
Instead of relying on underlying instructions, this patch updates
VPScalarIVStepsRecipe to only store the required type information.
This removes access to unrelated information, as well as avoiding issues
with the same underlying instruction being shared by multiple recipes.
This change should only change the debug output and not cause any
codegen changes, hence NFCI.
This is a recommit without changes. I originally reverted this
due to a significant code-size regression on tramp3d-v4, however
further investigation showed that in the tramp3d-v4 case this
change enables additional optimizations (in particular more
jump threading), which happens to reduce the size of a function
just enough to be eligible for inlining at hot callsites, which
results in the code size increase. As such, this was just bad
luck.
-----
This one-use limitation is artificial, we do not increase
instruction count if we perform the fold with multiple uses. The
motivating case is shown in @sub_eq_zero_select, where the one-use
limitation causes us to miss a subsequent select fold.
I believe the backend is pretty good about reusing flag-producing
subs for cmps with same operands, so I think doing this is fine.
Differential Revision: https://reviews.llvm.org/D120337
For conditional branches, we know the value is i1 0 or i1 1 along
the outgoing edges. For switches we can apply exactly the same
optimization, just with the known values determined by the switch
cases.
The priority-based inliner currenlty uses block count combined with callee entry count to drive callsite inlining. This doesn't work well with LTO where postlink inlining is driven by prelink-annotated block count which could be based on the merge of all context profiles. I'm fixing it by using callee profile entry count only which should be context-sensitive.
I'm seeing 0.2% perf improvment for one of our internal large benchmarks with probe-based non-CS profile.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D120784
Previously there was a debug flag to print the module after
optimizations. Sometimes we wanted to print the module before
optimizations so this is being split into two flags.
`-openmp-opt-print-module` is now `-openmp-opt-print-module-after`.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D120768
Instead of passing an InstCmpInt * and a bool just pass the predicate
from the caller.
I'm considering moving the similar FCmp functions from InstCombine
over here and this makes the interface consistent with what is used
for FCmp.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D120609
Currently adding attribute no_sanitize("bounds") isn't disabling
-fsanitize=local-bounds (also enabled in -fsanitize=bounds). The Clang
frontend handles fsanitize=array-bounds which can already be disabled by
no_sanitize("bounds"). However, instrumentation added by the
BoundsChecking pass in the middle-end cannot be disabled by the
attribute.
The fix is very similar to D102772 that added the ability to selectively
disable sanitizer pass on certain functions.
In this patch, if no_sanitize("bounds") is provided, an additional
function attribute (NoSanitizeBounds) is attached to IR to let the
BoundsChecking pass know we want to disable local-bounds checking. In
order to support this feature, the IR is extended (similar to D102772)
to make Clang able to preserve the information and let BoundsChecking
pass know bounds checking is disabled for certain function.
Reviewed By: melver
Differential Revision: https://reviews.llvm.org/D119816
This transform can still be applied if there are more than two
phi inputs, as long as phi inputs with the same value are dominated
by the same idom edge.
Treat the icmp and sub symmetrically, and require that one of them
has one use, not the icmp in particular. This could be further
relaxed in the abs (but not nabs) case to not check one-use at
all.
This should no longer be necessary now that we canonicalize to
intrinsics. This may not be entirely NFC in practice if worklist
order gets inverted and we perform demanded bits simplification
of a select user before the select is canonicalized.
A function is basically dead when:
* it has no uses
* it has only self-referencing uses (it's recursive)
Differential Revision: https://reviews.llvm.org/D119878
This is a clean-up patch. The functional pass was rolled into the module pass in D112732.
Reviewed By: vitalybuka, aeubanks
Differential Revision: https://reviews.llvm.org/D120674
extractvalue (any_mul_with_overflow X, -1), 0 --> -X
There are similar other potential transforms that we could do as
noted by the last TODO in the test diffs.
Fixes#54053
Currently bottom-to-top reordering analysis counts orders of the
operands and then adds natural order counts for the operand users. It is
very conservative, this the user nodes themselves may require
reordering. Patch improves bottom-to-top analysis by checking for the
user nodes if they require/allows the reordring. If the user node must
be reordered, has reused scalars, is an alternate op vectorization node,
is a non-ordered gather node or may allow reordering because of the
reordered operands, such node is considered as the node that allows
reodring and is not counted as a node with the natural order.
Differential Revision: https://reviews.llvm.org/D120492
This reverts the revert commit ff93260bf6.
The underlying issue causing the PPC bot failures has been fixed in
cbaac14734 and a corresponding test case has been added in
ad2cad1c52.
Original message:
This patch adds a new VPScalarIVStepsRecipe to handle building scalar
steps.
In the first patch, it only handles the case where there is no vector
induction variable needed.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D115953
Exit values of vector inductions are generated completely independent of
the induction recipes. Consider them for removal, if they are not used
in loop.
This fixes a crash exposed by 49b23f451c.
Only call it for intrinsic min/max. The moved implementation is
unchanged apart from the one-use check: It is now hardcoded to
one-use, without the two-use special case for SPF.
SLP makes very heavy use of aliasing queries to construct pointer dependencies for scheduling purposes. AA internally usings pointerMayBeCaptured to prove some noalias results. In a local profile, we were spending about 4% of total O2 time in capture tracking. By using BatchAA interface - which caches capture results - this drops to 2%.
Note that there is no invalidation of BatchAA here. This assumes that no transformation done by SLP invalidates alias or capture results. This is the same assumption made by the existing AliasCache, so this is not a new assumption in the code.
This patch adds a new VPScalarIVStepsRecipe to handle building scalar
steps.
In the first patch, it only handles the case where there is no vector
induction variable needed.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D115953
This can be used to explicitly model VPValues that depend on SCEV
expansion, like the step for inductions.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D116288
This patch adds a new transform to remove dead recipes. For now, it only
removes dead recipes in the header, to keep the number tests that require
updating manageable. Future patches will extend this to remove dead
recipes across the whole plan.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D118051
Currently, SLP can insert "shuffle" instruction beween PHI and Landing pad instruction. The problem is demonstrated by LIT test. The solution is to adjust insertion point once we are done with PHI generation.
Differential Revision: https://reviews.llvm.org/D120552
With opaque pointers, the zero-offset load will generally not use
a GEP. Allow a direct load without GEP, which is treated the same
way as a zero-offset GEP.
We already have a check for !InstQueries.empty(), so move the for-range over InstQueries inside to avoid the AAReachability uninitialized variable static analysis warnings.
Now that we canonicalize SPF min/max to intrinsics, there's no
need to canonicalize the structure of the SPF min/max itself
anymore. This is conceptually NFC, but in practice does slightly
impact results due to folding order differences.
SCEVs ExprValueMap currently tracks not only which IR Values
correspond to a given SCEV expression, but additionally stores that
it may be expanded in the form X+Offset. In theory, this allows
reusing existing IR Values in more cases.
In practice, this doesn't seem to be particularly useful (the test
changes are rather underwhelming) and adds a good bit of complexity.
Per https://github.com/llvm/llvm-project/issues/53905, we have an
invalidation issue with these offseted expressions.
Differential Revision: https://reviews.llvm.org/D120311
Expand `TruncInstCombine` to handle loops by adding `phi` nodes
to expression graph.
Reviewed by: RKSimon, lebedev.ri
(recommit of fixed f84d732f, reverted by 8ad6d5e after sanitizer breakage)
Differential Revision: https://reviews.llvm.org/D109817
The min/max intrinsic cost is currently too low because in the cost calculation
we subtract the cost of the vector compare as we will not emit it.
For the cost of the vector compare we are currently passing BAD_ICMP_PREDICATE
which returns 3, the worst case cost.
I think we should be passing VecPred instead, since we know the predicates of
the compare instr.
I think this is related to commit b3b993a7ad which introduced the predicate
argument to getCmpSelInstrCost().
https://reviews.llvm.org/rGb3b993a7ad817c3c5801341fa78f34332900eb83
Differential Revision: https://reviews.llvm.org/D120439
Summary:
We use a section to embed offloading code into the host for later
linking. This is normally unique to the translation unit as it is thrown
away during linking. However, if the user performs a relocatable link
the sections will be merged and we won't be able to access the files
stored inside. This patch changes the section variables to have external
linkage and a name defined by the section name, so if two sections are
combined during linking we get an error.
Now that integer min/max intrinsics have good support in both
InstCombine and other passes, start canonicalizing SPF min/max
to intrinsic min/max.
Once this sticks, we can stop matching SPF min/max in various
places, and can remove hacks we have for preventing infinite loops
and breaking of SPF canonicalization.
Differential Revision: https://reviews.llvm.org/D98152
The `SplitIndirectBrCriticalEdges` function was originally designed for
`CodeGenPrepare` and skipped splitting of edges when the destination
block didn't contain any `PHI` instructions. This only makes sense when
reducing COPYs like `CodeGenPrepare`. In the case of
`PGOInstrumentation` or `GCOVProfiling` it would result in missed
counters and wrong result in functions with computed goto.
Differential Revision: https://reviews.llvm.org/D120096
This change uses instruction's comesBefore method to simplify the code significantly. There's little compile time concern here because getSpillCost already calls comesBefore on every basic block which contains a vectorization candidate. The only additional times we'll build basic block ordering is when we can't schedule a vector candidate anywhere in the containing block.
Differential Revision: https://reviews.llvm.org/D120364
Prior to this change, LLVM would attempt to optimize an
aligned_alloc(33, ...) call to the stack. This flunked an assertion when
trying to emit the alloca, which crashed LLVM. Avoid that with extra
checks.
Differential Revision: https://reviews.llvm.org/D119604
Prior to this change, LLVM would attempt to optimize an
aligned_alloc(33, ...) call to the stack. This flunked an assertion when
trying to emit the alloca, which crashed LLVM. Avoid that with extra
checks.
Differential Revision: https://reviews.llvm.org/D119604
This cap was first added in 848c1aa45 (back in 2015). Per the original commit message, the purpose was to avoid a compile time explosion in long basic blocks. The algorithmic problem in scheduling has now been fixed in 0539a26d.
In the meantime, the code has rotten fairly badly. Some intermediate refactoring caused the size to only be incremented if *both* iterators advance in the window search. This causes the size to be badly undercounted when near one end of a basic block. We no longer have any test which exercises the logic in an intentional way; there's one test which differs with this change, but the changes appear fairly orthoganol to the purpose of the test file.
Unfortunately, we no longer have the original motivating example, so it's possible that it also hits some other issue. I tested locally with a large example, but even at it's worst, that one doesn't demonstrate anything too extreme even without the algorithmic fix. It's clearly faster with, but only by ~20% which doesn't seem in line with the original commit message. If regressions with this patch are seen, please file a bug and I'll try to fix any other algorithmic problems which fall out.
Rather than queuing up actions, have one function that does the
log2() fold in the obvious way, but with a flag that allows us
to check whether the fold will succeed without actually performing
it.
What we're really doing here is converting Op0 udiv Op1 into
Op0 lshr log2(Op1), so phrase it in that way. Actually pushing
the lshr into the log2(Op1) expression should be seen as a separate
transform.
This one-use limitation is artificial, we do not increase
instruction count if we perform the fold with multiple uses. The
motivating case is shown in @sub_eq_zero_select, where the one-use
limitation causes us to miss a subsequent select fold.
I believe the backend is pretty good about reusing flag-producing
subs for cmps with same operands, so I think doing this is fine.
Differential Revision: https://reviews.llvm.org/D120337
In places where `MaxNumPromotions` is used to allocated an array, bail out early to prevent allocating an array of length 0.
Differential Revision: https://reviews.llvm.org/D120295
A call to getInsertIndex() in getTreeCost() is returning None,
which causes an assert because a non-constant index value for
insertelement was not expected. This case occurs when the
insertelement index value is defined with a PHI.
Differential Revision: https://reviews.llvm.org/D120223
The "Correlated Value Propagation" pass was missing a case when handling select instructions. It was only handling the "false" constant value, while in NVPTX the select may have the condition (and thus the branches) inverted, for example:
```
loop:
%phi = phi i32* [ %sel, %loop ], [ %x, %entry ]
%f = tail call i32* @f(i32* %phi)
%cmp1 = icmp ne i32* %f, %y
%sel = select i1 %cmp1, i32* %f, i32* null
%cmp2 = icmp eq i32* %sel, null
br i1 %cmp2, label %return, label %loop
```
But the select condition can be inverted:
```
%cmp1 = icmp eq i32* %f, %y
%sel = select i1 %cmp1, i32* null, i32* %f
```
The fix is to enhance "Correlated Value Propagation" to handle both branches of the select instruction.
Reviewed By: nikic, lebedev.ri
Differential Revision: https://reviews.llvm.org/D119643
SLP currently schedules all instructions within a scheduling window which stretches from the first instruction potentially vectorized to the last. This window can include a very large number of unrelated instructions which are not being considered for vectorization. This change switches the code to only schedule the sub-graph consisting of the instructions being vectorized and their transitive users.
This has the effect of greatly reducing the amount of work performed in large basic blocks, and thus greatly improves compile time on degenerate examples. To understand the effects, I added some statistics (not planned for upstream contribution). Here's an illustration from my motivating example:
Before this patch:
704357 SLP - Number of calcDeps actions
699021 SLP - Number of schedule calls
5598 SLP - Number of ReSchedule actions
59 SLP - Number of ReScheduleOnFail actions
10084 SLP - Number of schedule resets
8523 SLP - Number of vector instructions generated
After this patch:
102895 SLP - Number of calcDeps actions
161916 SLP - Number of schedule calls
5637 SLP - Number of ReSchedule actions
55 SLP - Number of ReScheduleOnFail actions
10083 SLP - Number of schedule resets
8403 SLP - Number of vector instructions generated
I do want to highlight that there is a small difference in number of generated vector instructions. This example is hitting the bailout due to maximum window size, and the change in scheduling is slightly perturbing when and how we hit it. This can be seen in the RescheduleOnFail counter change. Given that, I think we can safely ignore.
The downside of this change can be seen in the large test diff. We group all vectorizable instructions together at the bottom of the scheduling region. This means that vector instructions can move quite far from their original point in code. While maybe undesirable, I don't see this as being a major problem as this pass is not intended to be a general scheduling pass.
For context, it's worth noting that the pre-scheduling that SLP does while building the vector tree is exactly the sub-graph scheduling implemented by this patch.
Differential Revision: https://reviews.llvm.org/D118538
D118623 added code to fold not-of-compare into a compare
with the inverted predicate, if the compare had no other
uses. This relies on accurate use lists in the IR but it
was run before setPhiValues, when some phi inputs are still
stored in a data structure on the side, instead of being
real uses in the IR. The effect was that a phi that should
be using the original compare result would now get an
inverted result instead.
Fix this by moving simplifyConditions after setPhiValues.
Differential Revision: https://reviews.llvm.org/D120312
This patch is the first in a series of patches to upstream the support for Apple's DriverKit. Once complete, it will allow targeting DriverKit platform with Clang similarly to AppleClang.
This code was originally authored by JF Bastien.
Differential Revision: https://reviews.llvm.org/D118046
Extends getReductionOpChain to look through Phis which may be part of
the reduction chain. adjustRecipesForReductions will now also create a
CondOp for VPReductionRecipe if the block is predicated and not only if
foldTailByMasking is true.
Changes were required in tryToBlend to ensure that we don't attempt
to convert the reduction Phi into a select by returning a VPBlendRecipe.
The VPReductionRecipe will create a select between the Phi and the reduction.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D117580
Constants cannot be cyclic, but they can be tree-like. Keep a
visited set to ensure we do not degenerate to exponential run-time.
This fixes the problem reported in https://reviews.llvm.org/D117223#3335482,
though I haven't been able to construct a concise test case for
the issue. This requires a combination of dead constants and the
kind of constant expression tree that textual IR cannot represent
(because the textual representation, unlike the in-memory
representation, is also exponential in size).
Currently writtenBetween can miss clobbers of Loc between End and Start,
if End is a MemoryUse.
To guarantee we see all write clobbers of Loc between Start and End
for MemoryUses, restrict to Start and End being in the same block
and check all accesses between them.
This fixes 2 mis-compiles illustrated in
llvm/test/Transforms/MemCpyOpt/memcpy-byval-forwarding-clobbers.ll
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D119929
This patch simplifies constraint handling by removing the
ConstraintListTy wrapper struct and moving the Preconditions directly
into ConstraintTy. This reduces the amount of memory needed for managing
constraints.
The only use case for ConstraintListTy was adding 2 constraints to model
ICMP_EQ conditions. But this can be handled by adding an IsEq flag. When
adding an equality constraint, we need to add the constraint and the
inverted constraint.
One of the optimizations performed in OpenMPOpt pushes globalized
variables to static shared memory. This is preferable to keeping the
runtime call in all cases, however if too many variables are pushed to
hared memory the kernel will crash. Since this is an optimization and
not something the user specified explicitly, there should be an option
to limit this optimization in those cases. This path introduces the
`-openmp-opt-shared-limit=` option to limit the amount of bytes that
will be placed in shared memory from HeapToShared.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D120079
If the alternate cmp instruction is a swapped predicate of the main cmp
instruction, need to generate alternate instruction, not the one with
the swapped predicate. Also, the lane with the alternate opcode should
be selected only, if the corresponding operands are not compatible.
Correctness confirmed:
https://alive2.llvm.org/ce/z/94BG66
Differential Revision: https://reviews.llvm.org/D119855
For ASan this will effectively serve as a synonym for
__attribute__((no_sanitize("address"))).
Adding the disable_sanitizer_instrumentation to functions will drop the
sanitize_XXX attributes on the IR level.
This is the third reland of https://reviews.llvm.org/D114421.
Now that TSan test is fixed (https://reviews.llvm.org/D120050) there
should be no deadlocks.
Differential Revision: https://reviews.llvm.org/D120055
When we scan vtables for a particular vload in ScanVTableLoad and an entry in
one possible vtable is invalid (null or non-fptr), we bail in a wrong way -- we
completely stop the scanning of vtables and this results in dropped dependencies
and incorrectly removed vfuncs from vtables. Let's fix that by correcting the
bailing logic to keep iterating and only skip the invalid entries.
Differential Revision: https://reviews.llvm.org/D120006
LICM will speculatively hoist code outside of loops. This requires removing information, like alias analysis (https://github.com/llvm/llvm-project/issues/53794), range information (https://bugs.llvm.org/show_bug.cgi?id=50550), among others. Prior to https://reviews.llvm.org/D99249 , LICM would only be run after LoopRotate. Running Loop Rotate prior to LICM prevents a instruction hoist from being speculative, if it was conditionally executed by the iteration (as is commonly emitted by clang and other frontends). Adding the additional LICM pass first, however, forces all of these instructions to be considered speculative, even if they are not speculative after LoopRotate. This destroys information, resulting in performance losses for discarding this additional information.
This PR modifies LICM to accept a ``speculative'' parameter which allows LICM to be set to perform information-loss speculative hoists or not. Phase ordering is then modified to not perform the information-losing speculative hoists until after loop rotate is performed, preserving this additional information.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D119965
This will bail out on target specific intrinsics. If those are deemed
important enough for EarlyCSE to handle, we can augment MemIntrinsicInfo
with an access type for TargetTransformInfo::getTgtMemIntrinsic() to
handle.
Reviewed By: #opaque-pointers, nikic
Differential Revision: https://reviews.llvm.org/D120077
This patch adds the '_kmpc_get_hardware_num_threads_in_block'
OpenMP RTL function to the externalization RAII struct. This was getting
optimized out and then being replaced with an undefined value once added
back in, causing bugs for complex reductions.
Fixes#53909.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D120076
With
668c5c688b
we introduced an ordering issue revealed by the reverse iteration
buildbot. Depending on the order of the map that tracks the AAIsDead AAs
we ended up with slightly different attributes. This is not totally
unexpected and can happen. We should however be deterministic in our
orderings to avoid such issues.
The assertion verifying that a newly computed value matches what is
already cached used stripPointerCasts() to strip bitcasts, however the
values can be not only pointers, but also vectors of pointers. That is
problematic because stripPointerCasts() doesn't handle vectors of
pointers. This patch introduces an ad-hoc utility function to strip all
bitcasts regardless of the value type.
Reviewed By: skatkov, reames
Differential Revision: https://reviews.llvm.org/D119994
Removing dead constants should not count as making a change to the
module. This means that RemoveUnusedGlobalValue simplifies to just
calling removeDeadConstantUsers, so inline it.
Differential Revision: https://reviews.llvm.org/D120052
A generalization like this was suggested in D119754.
This is the inverse direction of D119851,
and we get all of the folds there plus the one that was missed.
There is precedence for this kind of transform in instcombine
with "or" instructions (but strangely only with that one opcode AFAICT).
Similar justification as in the other patch:
The line between instcombine and reassociate for these kinds of folds
is blurry. This doesn't appear to have much cost and gives us the
expected wins from repeated folds as seen in the last set of test diffs.
Differential Revision: https://reviews.llvm.org/D119955
This code could be generalized to be type-independent, but for now
just ensure that the same type constraints are enforced with opaque
pointers as with typed pointers.
When we move an allocation from the heap to the stack we need to
allocate it in the alloca AS and then cast the result. This also
prevents us from inserting the alloca after the allocation call but
rather right before.
Fixes https://github.com/llvm/llvm-project/issues/53858
When we use liveness for edges during the `genericValueTraversal` we
need to make sure to use the AAIsDead of the correct function. This
patch adds the proper logic and some simple caching scheme. We also
add an assertion to the `isEdgeDead` call to make sure future misuse
is detected earlier.
Fixes https://github.com/llvm/llvm-project/issues/53872
`UsedAssumedInformation` is a return argument utilized to determine what
information is known. Most APIs used it already but
`genericValueTraversal` did not. This adds it to `genericValueTraversal`
and replaces `AllCallSitesKnown` of `checkForAllCallSites` with the
commonly used `UsedAssumedInformation`.
This was supposed to be a NFC commit, then the test change appeared.
Turns out, we had one user of `AllCallSitesKnown` (AANoReturn) and the
way we set `AllCallSitesKnown` was wrong as we ignored the fact some
call sites were optimistically assumed dead. Included a dedicated test
for this as well now.
Fixes https://github.com/llvm/llvm-project/issues/53884
By convention, memcpy/memmove intrinsics are always used with i8
pointers (though this is not enforced), so in practice this code
was always using an i8 type. Make that explicit.
Of course, i8 is not a very profitable choice, and this code could
be more performant by picking an appropriate larger type. But that
would require additional test coverage and correctness review, and
certainly shouldn't be a decision based on the pointer element type.
We only need to do propagation on use instructions of the original
value, rather than the replacing const value which might have lots
of irrelavant uses. This is done by caching uses before replacing.
Differential Revision: https://reviews.llvm.org/D119815
Integer min/max operations are associative:
max (max X, C0), C1 --> max X, (max C0, C1) --> max X, NewC
https://alive2.llvm.org/ce/z/wW5HVM
This would avoid a regression when we canonicalize to min/max intrinsics
(see D98152 ).
Differential Revision: https://reviews.llvm.org/D119754
Particularly this breaks vectorization of insertelements where some of
intermediate (i.e. not last) insertelements are used externally.
Fixes PR52275
Fixes#51617
Differential Revision: https://reviews.llvm.org/D119679
Do not merge a context that is already duplicated into the base profile.
Also fixing a typo caused by previous refactoring.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D119735
There was a fixme in the code pertaining to attributing functions as
noreturn. By using reachability, if none of the blocks that are
reachable from the entry return, then the function is noreturn.
Previously, the code only checked if any blocks returned. If they're
unreachable, then they don't matter.
This improves codegen for the Linux kernel.
Fixes: https://github.com/ClangBuiltLinux/linux/issues/1563
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D119571
This introduces a new "ptrauth" operand bundle to be used in
call/invoke. At the IR level, it's semantically equivalent to an
@llvm.ptrauth.auth followed by an indirect call, but it additionally
provides additional hardening, by preventing the intermediate raw
pointer from being exposed.
This mostly adds the IR definition, verifier checks, and support in
a couple of general helper functions. Clang IRGen and backend support
will come separately.
Note that we'll eventually want to support this bundle in indirectbr as
well, for similar reasons. indirectbr currently doesn't support bundles
at all, and the IR data structures need to be updated to allow that.
Differential Revision: https://reviews.llvm.org/D113685
If the function types differ, the call arguments don't necessarily
correspon to the function arguments. It's likely not worthwhile to
handle this more precisely, but at least we shouldn't crash.
While this might be marginally more precise, we generally don't
bother with this in InstCombine, and let the IRBuilder assign the
debug location. I don't see why this one fold, out of the thousands
done in InstCombine, should be treated specially.
This ensures that if we have a dbg.addr in a coroutine funclet that is on one of
our function arguments, that the dbg.addr is not mapped to undef and also that
later it isn't hoisted to the front of the basic block. Instead it remains at
its original cloned location.
rdar://83957028
Differential Revision: https://reviews.llvm.org/D119576
this is the first step in unifying some of the logic between hwasan and
mte stack tagging. this only moves around code, changes to converge
different implementations of the same logic follow later.
Reviewed By: eugenis
Differential Revision: https://reviews.llvm.org/D118947
If we assume `llvm.amdgcn.s.barrier` is aligned we may remove it and
cause OpenMP GPU applications on the AMD GPU to be stuck or wrongly
synchronized.
Reported by Carlo Bertolli.
```
always_inline foo() { }
bar () {
noinline foo();
}
```
We should prefer call site attribute over attribute on decl. This is fix for AlwaysInliner, similar fix is needed for normal Inliner (follow up).
Related to https://reviews.llvm.org/D119061
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D119553
If a cast is needed when replacing uses with newly created values, the
cast must be inserted after the instruction that defines the new value.
Fixes: SWDEV-321215
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D119524
The module flag to indicate use of hostcall is insufficient to catch
all cases where hostcall might be in use by a kernel. This is now
replaced by a function attribute that gets propagated to top-level
kernel functions via their respective call-graph.
If the attribute "amdgpu-no-hostcall-ptr" is absent on a kernel, the
default behaviour is to emit kernel metadata indicating that the
kernel uses the hostcall buffer pointer passed as an implicit
argument.
The attribute may be placed explicitly by the user, or inferred by the
AMDGPU attributor by examining the call-graph. The attribute is
inferred only if the function is not being sanitized, and the
implictarg_ptr does not result in a load of any byte in the hostcall
pointer argument.
Reviewed By: jdoerfert, arsenm, kpyzhov
Differential Revision: https://reviews.llvm.org/D119216
If we had some source value we could infer an address space from that
went through a ptrtoint/inttoptr pair, this would fail since bitcast
can't change the address space.
Fixes issue 53665.
Rather than checking that the type is the same (which is always
the case, given how these are part of the same phi) check that the
source element type is the same. With opaque pointers, this is no
longer implied.
To avoid incorrectly merging GEPs with different source types
under opaque pointers.
To avoid increasing the Expression structure size, this reuses the
existing type member. The code does not rely on this to be the
expression result type, it's only used as a disambiguator.
In addition to the self-recursion check, also check whether there
is more than one node in the SCC, which implies that there is a
larger cycle. I believe checking SCC structure (rather than
something like norecurse) is the right thing to do here, because
this is specifically about preventing infinite loops over the SCC.
Fixes https://github.com/llvm/llvm-project/issues/42028.
Differential Revision: https://reviews.llvm.org/D119418
The last use of this file was removed in late 2013 in ea56494625.
The last use was in PathProfiling.cpp, which had an overview comment
of the overall approach.
Similar functionality lives in the slight more cryptically named
CFGMST.h in this same directory. A similar overview comment is
in PGOInstrumentation.cpp.
No behavior change.
Note that this doesn't actually cause the top level predicate to become a non-union just yet.
The * above comes from a case in the LoopVectorizer where a predicate which is later proven no longer blocks vectorization due to a change from checking if predicates exists to whether the predicate is possibly false.
Minor efficiency fix. There is no reason to perform the same set lookup
repeatedly in the inner loop as it is invariant there.
Differential Revision: https://reviews.llvm.org/D119474
New users might want to check bins without a load or store instruction
at hand. Since we use those instructions only to find the offset and
size of the access anyway, we can expose an offset and size interface
to the outside world as well.
This commit mainly moves code around and exposes a class (OffsetAndSize)
as well as a method forallInterferingAccesses in AAPointerInfo.
Differential Revision: https://reviews.llvm.org/D119249
The oversight caused us to ignore call sites that are effectively dead
when we computed reachability (or more precise the call edges of a
function). The problem is that loads in the readonly callee might depend
on stores prior to the callee. If we do not track the call edge we
mistakenly assumed the store before the call cannot reach the load.
The problem is nicely visible in:
`llvm/test/Transforms/Attributor/ArgumentPromotion/basictest.ll`
Caused by D118673.
Fixes https://github.com/llvm/llvm-project/issues/53726
When we privatize a pointer (~argument promotion) we introduce new
private allocas as replacement. These need to be placed in the alloca
address space as later passes cannot properly deal with them otherwise.
Fixes https://github.com/llvm/llvm-project/issues/53725
Replace matchSelectPattern pattern match with the more general m_SMin so that it can handle smin intrinsics as well as the icmp+select pattern
Noticed while reviewing regressions from D98152
For those curious, the whole reason for tracking the predicate set seperately as opposed to just immediately registering the dependencies appears to be allowing the printing code to print a result without changing the PSE state. It's slightly questionable if this justifies the complexity, but since we can preserve it with local ugliness, I did so.
This reverts commit 77a0da926c as we've
received multiple reports of this significantly impacting performance,
in ways that don't seem to just be target specific cost models going
wrong. I would offer some reproducers, but the test changes here seem to
be full of them!
Reverting for now and hopefully we can remove the "hack" more carefully
as we go.
With typed pointers the pointer operand type checks the address space
and the load/store type. With opaque pointers we have to check the
load/store type separately.
Move out the induction step creation from emitTransformedIndex to the
callers. In some places (e.g. widenIntOrFpInduction) the step is already
created. Passing the step in ensures the steps are kept in sync.
This rewrites ArgPromotion to be based on offsets rather than GEP
structure. We inspect all loads at constant offsets and remember
which types are loaded at which offsets. Then we promote based on
those types.
This generalizes ArgPromotion to work with bitcasted loads, and
is compatible with opaque pointers.
This patch also fixes incorrect handling of alignment during
argument promotion. Previously, the implementation only checked
that the pointer is dereferenceable, but was happy to speculate
overaligned loads. (I would have fixed this separately in advance,
but I found this hard to do with the previous implementation
approach).
Differential Revision: https://reviews.llvm.org/D118685
This makes the function independent of shared state in ILV (ensures no
new dependencies on things like the cost model are introduced) and allows
for use directly in recipe's ::execute functions.
As long as *all* the invokes in the set are indirect,
we can merge them, but don't merge direct invokes into the set,
even though it would be legal to do.
This patch replaces the function we emit the remark on when we run into
the fix-point limit. Previously we got a function to emit a remark on
from the worklist's associated function. However, the worklist may not
always have an associated function in the case of global variables.
Replace this with the function set, and if there are no functions don't
emit the remark.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D119248
PredicatedScalarEvolution has a predicate type for representing A == B. This change generalizes it into something which can represent a A <pred> B.
This generality is currently unused, but is motivated by a couple of recent cases which have come up. In particular, I'm currently playing around with using this to simplify the runtime checking code in LoopVectorizer. Regardless of the outcome of that prototyping, generalizing the compare node seemed useful.
Alloca promotion can only deal with cases where the load/store
types match the alloca type (it explicitly does not support
bitcasted load/stores).
With opaque pointers this is no longer enforced through the pointer
type, so add an explicit check.
If the original invokes had uses, the uses must have been in PHI's,
but that immediately results in the incoming values being incompatible.
But we'll replace uses of the original invokes with the use of the
merged invoke, so as long as the incoming values become compatible
after that, we can merge.
Even if the invokes have normal destination, iff it's the same block,
we can merge them. For now, require that there are no PHI nodes,
and the returned values of invokes aren't used.
Before walking all the callers, check whether we have a
dereferenceable attribute directly on the argument.
Also make it clearer that the code currently does not treat
alignment correctly.
The helper `Attributor::checkForAllReturnedValuesAndReturnInsts`
simplifies the returned value optimistically. In `AAUndefinedBehavior`
we cannot use such optimistic values when deducing UB. As a result, we
assumed UB for the return value of a function because we initially
(=optimistically) thought the function return is `undef`. While we later
adjusted this properly, the `AAUndefinedBehavior` was under the
impression the return value is "known" (=fix) and could never change.
To correct this we use `Attributor::checkForAllInstructions` and then
manually to perform simplification of the return value, only allowing
known values to be used. This actually matches the other UB deductions.
Fixes#53647
The vectorizer will choose at times to "vectorize" loops with a scalar
factor (VF=1) with interleaving (IC > 1). This can occasionally produce
better code than the unroller (notable for reductions where it can
produce independent reduction chains that are combined after the loop).
At times this is not very beneficial though, for example when runtime
checks are needed or when the scalar code requires predication.
This addresses the second point, preventing the vectorizer from
interleaving when the scalar loop will require predication. This
prevents it from making a bit of a mess, that is worse than the original
and better left for the unroller to unroll if beneficial. It helps
reverse some of the regressions from D118090.
Differential Revision: https://reviews.llvm.org/D118566
This is a follow-up suggested in D119060.
Instead of checking each of the bottom 2 bits individually,
we can check them together and handle the possibility that
we demand both together.
https://alive2.llvm.org/ce/z/C2ihC2
Differential Revision: https://reviews.llvm.org/D119139
D43208 extracted `useEmulatedMaskMemRefHack()` from legality into cost model.
What it essentially does is prevents scalarized vectorization of masked memory operations:
```
// TODO: Cost model for emulated masked load/store is completely
// broken. This hack guides the cost model to use an artificially
// high enough value to practically disable vectorization with such
// operations, except where previously deployed legality hack allowed
// using very low cost values. This is to avoid regressions coming simply
// from moving "masked load/store" check from legality to cost model.
// Masked Load/Gather emulation was previously never allowed.
// Limited number of Masked Store/Scatter emulation was allowed.
```
While i don't really understand about what specifically `is completely broken`
was talking about, i believe that at least on X86 with AVX2-or-later,
this is no longer true. (or at least, i would like to know what is still broken).
So i would like to follow suit after D111460, and like wise disable that hack for AVX2+.
But since this was added for X86 specifically, let's just instead completely remove this hack.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D114779
Enabled loop interchange support for floating point reductions
if it is allowed to reorder floating point operations.
Previously when we encouter a floating point PHI node in the
outer loop exit block, we bailed out since we could not detect
floating point reductions in the early days. Now we remove this
limiation since we are able to detect floating point reductions.
Reviewed By: #loopoptwg, Meinersbur
Differential Revision: https://reviews.llvm.org/D117450
In scalarizeInstruction(), isUniformAfterVectorization is used to detect
cases where it is sufficient to always access the first lane. This
should map directly checking whether the operand is a uniform replicate
recipe.
Differential Revision: https://reviews.llvm.org/D116654
I'm seeing ext-tsp helps CSSPGO for our intern large benchmarks so I'm turning on it for CSSPGO. For non-CS AutoFDO, ext-tsp doesn't seem to help, probably because of lower profile counts quality.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D119048
As per LangRef's definition of `noreturn` attribute:
```
noreturn
This function attribute indicates that the function never returns
normally, hence through a return instruction.
This produces undefined behavior at runtime if the function
ever does dynamically return. nnotated functions may still
raise an exception, i.a., nounwind is not implied.
```
So if we `invoke` a `noreturn` function, and the normal destination
of an invoke is not an `unreachable`, point it at the new `unreachable`
block.
The change/fix from the original commit is that we now actually create
the new block, and don't just repurpose the original block,
because said normal destination block could have other users.
This reverts commit db1176ce66,
relanding commit 598833c987.
As per LangRef's definition of `noreturn` attribute:
```
noreturn
This function attribute indicates that the function never returns
normally, hence through a return instruction.
This produces undefined behavior at runtime if the function
ever does dynamically return. nnotated functions may still
raise an exception, i.a., nounwind is not implied.
```
Changes the remark to emit on the function call that captures the globalized
variable instead of the globalized variable itself. The user should be able to
see which variable it was in the argument list of the function.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106980
We can simply compute the value of this field on demand. Doing so clarifies the behavior when one of the instructions within a bundle doesn't have valid dependencies. I vaguely thing this could change behavior slightly, but none of the test cases are affected, and my attempts to write one by hand have failed.
This also minorly reduces memory usage, but that's a secondary value at best.
While nowadays SimplifyCFG knows how to hoist code from then-else blocks,
sink code from unconditional predecessors, and even promote the latter
by tail-merging `ret`/`resume` function terminators, that isn't everything.
While i (& others) have been trying to deal with merging/sinking `unreachable`,
apparently perhaps the more impactful remaining problem is merging the `throw`
calls.
If we start at the `landingpad`, all the predecessors are unwind edges of `invoke`s,
and in some cases some of the `invoke`s are mergeable.
```
/// This is a weird mix of hoisting and sinking. Visually, it goes from:
/// [...] [...]
/// | |
/// [invoke0] [invoke1]
/// / \ / \
/// [cont0] [landingpad] [cont1]
/// to:
/// [...] [...]
/// \ /
/// [invoke]
/// / \
/// [cont] [landingpad]
```
This simplifies the IR/CFG, at the cost of debug info and extra PHI nodes.
Note that we don't require for *all* the `invokes` of the `landingpad`
to be mergeable, they can form more than a single set, we gracefully handle that.
For now, i completely disallowed normal destination, PHI nodes and indirect invokes
but that can be supported.
Out of all the CTMark projects, only 7zip is C++, so there isn't much impact:
https://llvm-compile-time-tracker.com/compare.php?from=ba8eb31bd9542828f6424e15a3014f80f14522c8&to=722fc871c84f14157d45c2159bc9c8c7e2825785&stat=size-total
... but there it currently causes size-total decrease.
Differential Revision: https://reviews.llvm.org/D117805
This patch adds initial support for signed conditions. To do so,
ConstraintElimination maintains two separate systems, one with facts
from signed and one for unsigned conditions.
To start with this means information from signed and unsigned conditions
is kept completely separate. When it is safe to do so, information from
signed conditions may be also transferred to the unsigned system and
vice versa. That's left for follow-ups.
In the initial version, de-composition of signed values just handles
constants and otherwise just uses the value, without trying to
decompose the operation. Again this can be extended in follow-up
changes.
The main benefit of this limited signed support is proving >=s 0
pre-conditions added in D118799. But even this initial version also
fixes PR53273.
Depends on D118799.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D118806
With this patch pre-conditions can be added to a list of constraints.
Constraints with pre-conditions can only be used if all pre-conditions
are satisfied when the constraint is used.
The pre-conditions at the moment are specified as a list of
(Predicate, Value *,Value *) tuples. This allow easily checking them
like any other condition, using the existing infrastructure.
This then is used to limit GEP decomposition to cases where we can
prove that offsets are signed positive.
This fixes a couple of incorrect transforms where GEP offsets where
assumed to be signed positive, but they were not.
Note that this effectively disables GEP decomposition, as there's no
support for reasoning about signed predicates. D118806 adds initial
signed support.
Fixes PR49624.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D118799
This header is very large (3M Lines once expended) and was included in location
where dwarf-specific information were not needed.
More specifically, this commit suppresses the dependencies on
llvm/BinaryFormat/Dwarf.h in two headers: llvm/IR/IRBuilder.h and
llvm/IR/DebugInfoMetadata.h. As these headers (esp. the former) are widely used,
this has a decent impact on number of preprocessed lines generated during
compilation of LLVM, as showcased below.
This is achieved by moving some definitions back to the .cpp file, no
performance impact implied[0].
As a consequence of that patch, downstream user may need to manually some extra
files:
llvm/IR/IRBuilder.h no longer includes llvm/BinaryFormat/Dwarf.h
llvm/IR/DebugInfoMetadata.h no longer includes llvm/BinaryFormat/Dwarf.h
In some situations, codes maybe relying on the fact that
llvm/BinaryFormat/Dwarf.h was including llvm/ADT/Triple.h, this hidden
dependency now needs to be explicit.
$ clang++ -E -Iinclude -I../llvm/include ../llvm/lib/Transforms/Scalar/*.cpp -std=c++14 -fno-rtti -fno-exceptions | wc -l
after: 10978519
before: 11245451
Related Discourse thread: https://llvm.discourse.group/t/include-what-you-use-include-cleanup
[0] https://llvm-compile-time-tracker.com/compare.php?from=fa7145dfbf94cb93b1c3e610582c495cb806569b&to=995d3e326ee1d9489145e20762c65465a9caeab4&stat=instructions
Differential Revision: https://reviews.llvm.org/D118781
This makes the statepoint methods in IRBuilder accept a
FunctionCallee, which carries both the callee and function type.
This is used to add the elementtype attribute to the statepoint call.
RS4GC requires an additional tweak to actually preserve that attribute
-- previously the attributes on the call were completely overwritten.
Differential Revision: https://reviews.llvm.org/D118886
This adds the assertion that all items in the ready list are in-fact scheduleable entities ready to be scheduled. This involves changing the ReadyInsts structure to be a set, and fixing a couple places where we left nodes on the list when they were no longer ready.
Finding re-materialization chain for derived pointer does not depend on
call site. To avoid this finding for each call site it can be extracted in
a separate routine.
Reviewers: reames, dantrushin
Reviewed By: reames
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D118676
The idea here is to have a verify routine we can call during scheduling to ensure broken invariants are reported. The intent is to help in debugging scheduling bugs.
At the moment, only the most basic properties are checked as adding several I thought held reported failures.
When the main loop is e.g. VF=vscale x 1 and the epilogue VF cannot
be any smaller, the vectorizer should try to estimate how many lanes are
executed at runtime and allow a suitable fixed-width VF to be chosen. It
can use VScaleForTuning to figure out what a suitable fixed-width VF could
be. For the case where the main loop VF is VF=vscale x 1, and VScaleForTuning=8,
it could still choose an epilogue VF upto VF=4.
This was a bit tricky to test, so this patch also introduces a wrapper
function to get 'VScaleForTuning' by also considering vscale_range.
If min and max are equal, then that will be the vscale we compile for.
It makes little sense to tune for a different width if the code
will not be portable for other widths.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D118709
Compiler adds the estimation for the external uses during operands
reordering analysis, which makes it tend to prefer duplicates in the
lanes rather than diamond/shuffled match in the graph. It changes the sizes of
the vector operands and may prevent some vectorization. We don't need
this kind of estimation for the analysis phase, because we just need to
choose the most compatible instruction and it does not matter if it has
external user or used in the non-matching lane. Instead, we count the number
of unique instruction in the lane and see if the reassociation changes
the number of unique scalars to be power of 2 or not. If we have power
of 2 unique scalars in the lane, it is considered more profitable rather
than having non-power-of-2 number of unique scalars.
Metric: SLP.NumVectorInstructions
test-suite :: MultiSource/Benchmarks/FreeBench/distray/distray.test 70.00 86.00 22.9%
test-suite :: External/SPEC/CFP2017rate/544.nab_r/544.nab_r.test 346.00 353.00 2.0%
test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test 346.00 353.00 2.0%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 235.00 239.00 1.7%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 235.00 239.00 1.7%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 8723.00 8834.00 1.3%
test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 1051.00 1064.00 1.2%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 1628.00 1646.00 1.1%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 1628.00 1646.00 1.1%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 9100.00 9184.00 0.9%
test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 3565.00 3577.00 0.3%
test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 3565.00 3577.00 0.3%
test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 4235.00 4245.00 0.2%
test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1996.00 1998.00 0.1%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 1671.00 1672.00 0.1%
test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 783.00 782.00 -0.1%
test-suite :: SingleSource/Benchmarks/Misc/oourafft.test 69.00 68.00 -1.4%
test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test 207.00 192.00 -7.2%
test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test 207.00 192.00 -7.2%
test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test 89.00 80.00 -10.1%
test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test 89.00 80.00 -10.1%
test-suite :: MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test 260.00 215.00 -17.3%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test 256.00 211.00 -17.6%
MultiSource/Benchmarks/Prolangs-C/TimberWolfMC - pretty the same.
SingleSource/Benchmarks/Misc/oourafft.test - 2 <2 x > loads replaced by
one <4 x> load.
External/SPEC/CINT2017speed/641.leela_s - function gets vectorized and
not inlined anymore.
External/SPEC/CINT2017rate/541.leela_r - same
xternal/SPEC/CINT2017rate/531.deepsjeng_r - changed the order in
multi-block tree, the result is pretty the same.
External/SPEC/CINT2017speed/631.deepsjeng_s - same.
MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a - the result is the same
as before.
MultiSource/Benchmarks/MiBench/consumer-jpeg - same.
Differential Revision: https://reviews.llvm.org/D116688
Added support for alternate ops vectorization of the cmp instructions.
It allows to vectorize either cmp instructions with same/swapped
predicate but different (swapped) operands kinds or cmp instructions
with different predicates and compatible operands kinds.
Differential Revision: https://reviews.llvm.org/D115955
Unfortunately, it seems we really do need to take the long route;
start from the "merge" block, find (all the) "dispatch" blocks,
and deal with each "dispatch" block separately, instead of simply
starting from each "dispatch" block like it would logically make sense,
otherwise we run into a number of other missing folds around
`switch` formation, missing sinking/hoisting and phase ordering.
This reverts commit 85628ce75b.
This reverts commit c5fff90953.
This reverts commit 34a98e1046.
This reverts commit 1e353f0922.
Assertion added in f50821cff0 confirms that the DT is indeed nonnull.
Change it to a reference instead of a pointer to make this explicit in
FusionCandidate.
Suggested in D118472.
Added support for alternate ops vectorization of the cmp instructions.
It allows to vectorize either cmp instructions with same/swapped
predicate but different (swapped) operands kinds or cmp instructions
with different predicates and compatible operands kinds.
Differential Revision: https://reviews.llvm.org/D115955
This is a fix for a use-after-free found by the address sanitizer when
compiling GCC: https://github.com/llvm/llvm-project/issues/52821
The Function Specialization pass may remove instructions, cached
inside the PredicateBase class, which are later being dereferenced
from the SCCPInstVisitor class. To prevent the dangling references
I am lazily deleting the dead instructions after the Solver has run.
Differential Revision: https://reviews.llvm.org/D118591
The current `FoldTwoEntryPHINode()` is not quite designed correctly.
It starts from the merge point, and then tries to detect
the 'divergence' point.
Because of that, it is limited to the simple two-predecessor case,
where the PHI completely goes away. but that is rather pessimistic,
and it doesn't make much sense from the costmodel side of things.
For example if there is some other unrelated predecessor of
the merge point, we could split the merge point so that
the then/else blocks first branch to an empty block
and then to the merge point, and then we'd be able to speculate
the then/else code.
But if we'd instead simply start at the divergence point,
and look for the merge point, then we'll just natively support this case.
There's also the fact that `SpeculativelyExecuteBB()` already does
just that, but only if there is a single block to speculate,
and with a much more restrictive cost model.
But that also means we have code duplication.
Now, sadly, while this is as much NFCI as possible,
there is just no way to cleanly migrate to
the proper implementation. The results *are* going to be different
somewhat because of various phase ordering effects and SimplifyCFG
block iteration strategy.
After adding another value kind in 8a12cae862, Value * pointers do not
have enough available empty bits to store the kind (e.g. on ARM)
To address this, the patch replaces the PointerIntPair with separate
value and kind fields.
This patch extends the available-value logic to detect loads
of pointer-selects that can be replaced by a value select.
For example, consider the code below:
loop:
%sel.phi = phi i32* [ %start, %ph ], [ %sel, %ph ]
%l = load %ptr
%l.sel = load %sel.phi
%sel = select cond, %ptr, %sel.phi
...
exit:
%res = load %sel
use(%res)
The load of the pointer phi can be replaced by a load of the start value
outside the loop and a new phi/select chain based on the loaded values,
as illustrated below
%l.start = load %start
loop:
sel.phi.prom = phi i32 [ %l.start, %ph ], [ %sel.prom, %ph ]
%l = load %ptr
%sel.prom = select cond, %l, %sel.phi.prom
...
exit:
use(%sel.prom)
This is a first step towards alllowing vectorizing loops using common libc++
library functions, like std::min_element (https://clang.godbolt.org/z/6czGzzqbs)
#include <vector>
#include <algorithm>
int foo(const std::vector<int> &V) {
return *std::min_element(V.begin(), V.end());
}
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D118143
Based on the output of include-what-you-use.
This is a big chunk of changes. It is very likely to break downstream code
unless they took a lot of care in avoiding hidden ehader dependencies, something
the LLVM codebase doesn't do that well :-/
I've tried to summarize the biggest change below:
- llvm/include/llvm-c/Core.h: no longer includes llvm-c/ErrorHandling.h
- llvm/IR/DIBuilder.h no longer includes llvm/IR/DebugInfo.h
- llvm/IR/IRBuilder.h no longer includes llvm/IR/IntrinsicInst.h
- llvm/IR/LLVMRemarkStreamer.h no longer includes llvm/Support/ToolOutputFile.h
- llvm/IR/LegacyPassManager.h no longer include llvm/Pass.h
- llvm/IR/Type.h no longer includes llvm/ADT/SmallPtrSet.h
- llvm/IR/PassManager.h no longer includes llvm/Pass.h nor llvm/Support/Debug.h
And the usual count of preprocessed lines:
$ clang++ -E -Iinclude -I../llvm/include ../llvm/lib/IR/*.cpp -std=c++14 -fno-rtti -fno-exceptions | wc -l
before: 6400831
after: 6189948
200k lines less to process is no that bad ;-)
Discourse thread on the topic: https://llvm.discourse.group/t/include-what-you-use-include-cleanup
Differential Revision: https://reviews.llvm.org/D118652
For some reason we limited the epilogue VF to be fixed-width, but there
is not necessarily a reason for doing so. If the main VF=vscale x 16, the
epilogue VF could be either fixed-width, or a scalable VF upto vscale x 8.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D118688
The code paths analyzed (all constructor invocations of fusion
candidate) pass in a non-null DT.
Adding this assert as requested in D118472 before converting this to a
reference argument.
Cleanup code in peelLoop API. We already have usage of DT without guarding
against a null DT, so this change constant folds the remaining null DT
checks.
Also make the argument a reference so that it is clear the argument is
a nonnull DT.
Extracted from D118472.
setjmp can return twice, but PostDominatorTree is unaware of this. as
such, it overestimates postdominance, leaving some cases (see attached
compiler-rt) where memory does not get untagged on return. this causes
false positives later in the program execution.
this is a crude workaround to unblock use-after-scope for now, in the
longer term PostDominatorTree should bemade aware of returns_twice
function, as this may cause problems elsewhere.
Reviewed By: eugenis
Differential Revision: https://reviews.llvm.org/D118647
We have an instCombine rule to remove identical consecutive fences.
We can extend this to remove weaker fences when we have consecutive stronger
fence.
As stated in the LangRef, a fence with a stronger ordering also implies
ordering weaker than itself: "A fence which has seq_cst ordering, in addition to
having both acquire and release semantics specified above, participates in the
global program order of other seq_cst operations and/or fences."
Reviewed-By: reames
Differential Revision: https://reviews.llvm.org/D118607
Generalize D99629 for ELF. A default visibility non-local symbol is preemptible
in a -shared link. `isInterposable` is an insufficient condition.
Moreover, a non-preemptible alias may be referenced in a sub constant expression
which intends to lower to a PC-relative relocation. Replacing the alias with a
preemptible aliasee may introduce a linker error.
Respect dso_preemptable and suppress optimization to fix the abose issues. With
the change, `alias = 345` will not be rewritten to use aliasee in a `-fpic`
compile.
```
int aliasee;
extern int alias __attribute__((alias("aliasee"), visibility("hidden")));
void foo() { alias = 345; } // intended to access the local copy
```
While here, refine the condition for the alias as well.
For some binary formats like COFF, `isInterposable` is a sufficient condition.
But I think canonicalization for the changed case has little advantage, so I
don't bother to add the `Triple(M.getTargetTriple()).isOSBinFormatELF()` or
`getPICLevel/getPIELevel` complexity.
For instrumentations, it's recommended not to create aliases that refer to
globals that have a weak linkage or is preemptible. However, the following is
supported and the IR needs to handle such cases.
```
int aliasee __attribute__((weak));
extern int alias __attribute__((alias("aliasee")));
```
There are other places where GlobalAlias isInterposable usage may need to be
fixed.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D107249
Added support for alternate ops vectorization of the cmp instructions.
It allows to vectorize either cmp instructions with same/swapped
predicate but different (swapped) operands kinds or cmp instructions
with different predicates and compatible operands kinds.
Differential Revision: https://reviews.llvm.org/D115955