The test is a crasher reduced from:
https://llvm.org/PR49993
linearFunctionTestReplace() assumes that we have an add recurrence,
so check for that as a condition of matching a loop counter.
Differential Revision: https://reviews.llvm.org/D101291
This patch simplifies VPSlotTracker by using the recursive traversal
iterator to traverse all blocks in a VPlan in reverse post-order when
numbering VPValues in a plan.
This depends on a fix to RPOT (D100169). It also extends the traversal
unit tests to check RPOT.
Reviewed By: a.elovikov
Differential Revision: https://reviews.llvm.org/D100176
The ModulePassManager should already have taken care of all analysis
invalidation. Without this change, upcoming changes will cause more
invalidation than necessary.
Reviewed By: mtrofin
Differential Revision: https://reviews.llvm.org/D101320
LLVM does not have valid assembly backends for atomicrmw on local memory. However, as this memory is thread local, we should be able to lower this to the relevant load/store.
Differential Revision: https://reviews.llvm.org/D98650
As a follow-up to D95982, this patch continues unblocking optimizations that are blocked by pseudu probe instrumention.
The optimizations unblocked are:
- In-block load propagation.
- In-block dead store elimination
- Memory copy optimization that turns stores to consecutive memories into a memset.
These optimizations are local to a block, so they shouldn't affect the profile quality.
Reviewed By: wmi
Differential Revision: https://reviews.llvm.org/D100075
In LLVM_ENABLE_STATS=0 builds, `llvm::Statistic` maps to `llvm::NoopStatistic`
but has 3 mostly unused pointers. GlobalOpt considers that the pointers can
potentially retain allocated objects, so GlobalOpt cannot optimize out the
`NoopStatistic` variables (see D69428 for more context), wasting 23KiB for stage
2 clang.
This patch makes `NoopStatistic` empty and thus reclaims the wasted space. The
clang size is even smaller than applying D69428 (slightly smaller in both .bss and
.text).
```
# This means the D69428 optimization on clang is mostly nullified by this patch.
HEAD+D69428: size(.bss) = 0x0725a8
HEAD+D101211: size(.bss) = 0x072238
# bloaty - HEAD+D69428 vs HEAD+D101211
# With D101211, we also save a lot of string table space (.rodata).
FILE SIZE VM SIZE
-------------- --------------
-0.0% -32 -0.0% -24 .eh_frame
-0.0% -336 [ = ] 0 .symtab
-0.0% -360 [ = ] 0 .strtab
[ = ] 0 -0.2% -880 .bss
-0.0% -2.11Ki -0.0% -2.11Ki .rodata
-0.0% -2.89Ki -0.0% -2.89Ki .text
-0.0% -5.71Ki -0.0% -5.88Ki TOTAL
```
Note: LoopFuse is a disabled pass. For now this patch adds
`#if LLVM_ENABLE_STATS` so `OptimizationRemarkMissed` is skipped in
LLVM_ENABLE_STATS==0 builds. If these `OptimizationRemarkMissed` are useful in
LLVM_ENABLE_STATS==0 builds, we can replace `llvm::Statistic` with
`llvm::TrackingStatistic`, or use a different abstraction to keep track of the strings.
Similarly, skip the code in `mlir/lib/Pass/PassStatistics.cpp` which
calls `getName`/`getDesc`/`getValue`.
Reviewed By: lattner
Differential Revision: https://reviews.llvm.org/D101211
LLVM does not have valid assembly backends for atomicrmw on local memory. However, as this memory is thread local, we should be able to lower this to the relevant load/store.
Differential Revision: https://reviews.llvm.org/D98650
When replacing a conditional branch by an unconditional one because the targets are identical, transfer the metadata to the new branch instruction.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D101226
In LLVM_ENABLE_STATS=0 builds, `llvm::Statistic` maps to `llvm::NoopStatistic`
but has 3 unused pointers. GlobalOpt considers that the pointers can potentially
retain allocated objects, so GlobalOpt cannot optimize out the `NoopStatistic`
variables (see D69428 for more context), wasting 23KiB for stage 2 clang.
This patch makes `NoopStatistic` empty and thus reclaims the wasted space. The
clang size is even smaller than applying D69428 (slightly smaller in both .bss and
.text).
```
# This means the D69428 optimization on clang is mostly nullified by this patch.
HEAD+D69428: size(.bss) = 0x0725a8
HEAD+D101211: size(.bss) = 0x072238
# bloaty - HEAD+D69428 vs HEAD+D101211
# With D101211, we also save a lot of string table space (.rodata).
FILE SIZE VM SIZE
-------------- --------------
-0.0% -32 -0.0% -24 .eh_frame
-0.0% -336 [ = ] 0 .symtab
-0.0% -360 [ = ] 0 .strtab
[ = ] 0 -0.2% -880 .bss
-0.0% -2.11Ki -0.0% -2.11Ki .rodata
-0.0% -2.89Ki -0.0% -2.89Ki .text
-0.0% -5.71Ki -0.0% -5.88Ki TOTAL
```
Note: LoopFuse is a disabled pass. This patch adds `#if LLVM_ENABLE_STATS` so
`OptimizationRemarkMissed` is skipped in LLVM_ENABLE_STATS==0 builds. If these
`OptimizationRemarkMissed` are useful and not noisy, we can replace
`llvm::Statistic` with `llvm::TrackingStatistic` in the future.
Reviewed By: lattner
Differential Revision: https://reviews.llvm.org/D101211
This applies the D100251 mechanism to the gcov instrumentation pass.
With this patch, `-fno-omit-frame-pointer` in
`clang -fprofile-arcs -O1 -fno-omit-frame-pointer` will be respected for synthesized
`__llvm_gcov_writeout,__llvm_gcov_reset,__llvm_gcov_init` functions: the frame pointer
will be kept (note: on many targets -O1 eliminates the frame pointer by default).
`clang -fno-exceptions -fno-asynchronous-unwind-tables -g -fprofile-arcs` will
produce .debug_frame instead of .eh_frame.
Fix: https://github.com/ClangBuiltLinux/linux/issues/955
Reviewed By: nickdesaulniers
Differential Revision: https://reviews.llvm.org/D101129
When replacing a conditional branch by an unconditional one because the condition is a constant, transfer the metadata to the new branch instruction.
Part of fix for llvm.org/PR50060
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D101141
When transforming a loop terminating condition into a "max" comparison,
the DebugLoc from the old condition should be set on the newly created
comparison. They are the same operation, just optimized. Fixes PR48067.
Differential Revision: https://reviews.llvm.org/D98218
When iterating over const blocks, the base type in the lambdas needs
to use const VPBlockBase *, otherwise it cannot be used with input
iterators over const VPBlockBase.
Also adjust the type of the input iterator range to const &, as it
does not take ownership of the input range.
This patch adds a blocksOnly helpers which take an iterator range
over VPBlockBase * or const VPBlockBase * and returns an interator
range that only include BlockTy blocks. The accesses are casted to
BlockTy.
Reviewed By: a.elovikov
Differential Revision: https://reviews.llvm.org/D101093
This patch updates performSymbolicPredicateInfoEvaluation to manage
registering additional dependencies using ExprResult. Similar to D99987,
this fixes an issues where we failed to track the correct dependency for
a phi-of-ops value, which is marked as temporary.
Fixes PR49873.
Reviewed By: asbirlea, ruiling
Differential Revision: https://reviews.llvm.org/D100560
performSymbolicEvaluation is used to obtain the symbolic expression when
visiting instructions and this is used to determine their congruence
class.
performSymbolicEvaluation only creates expressions for certain
instructions (via createExpression). For unsupported instructions,
'unknown' expression are created.
The use of createExpression in processOutgoingEdges means we may
simplify the condition in processOutgoingEdges to a constant in the
initial round of processing, but we use Unknown(I) for the congruence
class. If an operand of I changes the expression Unknown(I) stays the
same, so there is no update of the congruence class of I. Hence it
won't get re-visited. So if an operand of I changes in a way that causes
createExpression to return different result, this update is missed.
This patch updates the code to use performSymbolicEvaluation, to be
symmetric with the congruence class updating code.
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D99990
For example:
```
int src(unsigned int a, unsigned int b)
{
return __builtin_popcount(a << 16) + __builtin_popcount(b >> 16);
}
int tgt(unsigned int a, unsigned int b)
{
return __builtin_popcount((a << 16) | (b >> 16));
}
```
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D101210
While doing speculative execution opt, it conservatively drops all insn's debug info in the merged `ThenBB`(see the loop at line 2384) including the dangling probe. The missing debug info of the dangling probe will cause the wrong inference computation.
So we should avoid dropping the debug info from pseudo probe, this change try to fix this by moving the to-be dangling probe to the merging target BB before the debug info is dropped.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D101195
Pseudo probe distribution factor is used to scale down profile samples to avoid misleading the counts inference due to the usage of "maximum" in `getBlockWeight`. For callsites, the scaling down can come from code duplication prior to the sample profile loader (prelink or postlink), or due to the indirect call promotion in sample loader inliner. This patch fixes an issue in sample loader ICP where the leftover indirect callsite scaling down causes the loss of non-promoted call target samples unexpectedly. While the scaling down is to favor BFI/BPI with accurate an callsite count, it doesn't fit in the current distribution factor that represents code duplication changes. Ideally, we would need two factors, one is for code duplication, the other is for ICP. However this seems over complicated. I'm going to trade one usage (callsite counts) for the other (call target counts).
Seeing perf win on one benchmark (mcf) of SPEC2017 with others unchanged.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D100993
As discussed in https://llvm.org/PR50096 , we could
convert the 'not' into a 'sub' and see the same
fold. That's because we already have another demanded
bits optimization for 'sub'.
We could add a related transform for
odd-number-of-type-bits, but that seems unlikely
to be practical.
https://alive2.llvm.org/ce/z/TWJZXr
This patch adds a new iterator to traverse through VPRegionBlocks and a
GraphTraits specialization using the iterator to traverse through
VPRegionBlocks.
Because there is already a GraphTraits specialization for VPBlockBase *
and co, a new VPBlockRecursiveTraversalWrapper helper is introduced.
This allows us to provide a new GraphTraits specialization for that
type. Users can use the new recursive traversal by using this wrapper.
The graph trait visits both the entry block of a region, as well as all
its successors. Exit blocks of a region implicitly have their parent
region's successors. This ensures all blocks in a region are visited
before any blocks in a successor region when doing a reverse post-order
traversal of the graph.
Reviewed By: a.elovikov
Differential Revision: https://reviews.llvm.org/D100175
This recommits 4f5da356ff, including
explicit implementations of move a constructor and deleted copy
constructors/assignment operators, to fix failures with some compilers.
This reverts the revert 74854d00e8.
Previous build failures were caused by an error in bitcode reading and
writing for DIArgList metadata, which has been fixed in e5d844b587.
There were also some unnecessary asserts that were being triggered on
certain builds, which have been removed.
This reverts commit dad5caa59e.
If we are using a simplified value, we need to add an extra
dependency this value , because changes to the class of the
simplified value may require us to invalidate any decision based on
that value.
This is done by adding such values as additional users, however the
current code does not excludes temporary instructions.
At the moment, this means that we miss those dependencies for
phi-of-ops, because they are temporary instructions at this point. We
instead need to add the extra dependencies to the root instruction of
the phi-of-ops.
This patch pushes the responsibility of adding extra users to the
callers of createExpression & performSymbolicEvaluation. At those
points, it is clearer which real instruction to pick.
Alternatively we could either pass the 'real' instruction as additional
argument or use another map, but I think the approach in the patch makes
things a bit easier to follow.
Fixes PR35074.
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D99987
Fixes PR47627
This fix suppresses rerolling a loop which has an unrerollable
instruction.
Sample IR for the explanation below:
```
define void @foo([2 x i32]* nocapture %a) {
entry:
br label %loop
loop:
; base instruction
%indvar = phi i64 [ 0, %entry ], [ %indvar.next, %loop ]
; unrerollable instructions
%stptrx = getelementptr inbounds [2 x i32], [2 x i32]* %a, i64 %indvar, i64 0
store i32 999, i32* %stptrx, align 4
; extra simple arithmetic operations, used by root instructions
%plus20 = add nuw nsw i64 %indvar, 20
%plus10 = add nuw nsw i64 %indvar, 10
; root instruction 0
%ldptr0 = getelementptr inbounds [2 x i32], [2 x i32]* %a, i64 %plus20, i64 0
%value0 = load i32, i32* %ldptr0, align 4
%stptr0 = getelementptr inbounds [2 x i32], [2 x i32]* %a, i64 %plus10, i64 0
store i32 %value0, i32* %stptr0, align 4
; root instruction 1
%ldptr1 = getelementptr inbounds [2 x i32], [2 x i32]* %a, i64 %plus20, i64 1
%value1 = load i32, i32* %ldptr1, align 4
%stptr1 = getelementptr inbounds [2 x i32], [2 x i32]* %a, i64 %plus10, i64 1
store i32 %value1, i32* %stptr1, align 4
; loop-increment and latch
%indvar.next = add nuw nsw i64 %indvar, 1
%exitcond = icmp eq i64 %indvar.next, 5
br i1 %exitcond, label %exit, label %loop
exit:
ret void
}
```
In the loop rerolling pass, `%indvar` and `%indvar.next` are appended
to the `LoopIncs` vector in the `LoopReroll::DAGRootTracker::findRoots`
function.
Before this fix, two instructions with `unrerollable instructions`
comment above are marked as `IL_All` at the end of the
`LoopReroll::DAGRootTracker::collectUsedInstructions` function,
as well as instructions with `extra simple arithmetic operations`
comment and `loop-increment and latch` comment. It is incorrect
because `IL_All` means that the instruction should be executed in all
iterations of the rerolled loop but the `store` instruction should
not.
This fix rejects instructions which may have side effects and don't
belong to def-use chains of any root instructions and reductions.
See https://bugs.llvm.org/show_bug.cgi?id=47627 for more information.
This patch is supposed to solve: https://bugs.llvm.org/show_bug.cgi?id=50075
The function `__dfsan_mem_transfer_callback` takes a `Len` argument of type `i64`; however, when processing a `MemTransferInst` such as `llvm.memcpy.p0i8.p0i8.i32`, the `len` argument has type `i32`. In order to make the type of `len` compatible with the one of the callback argument, this change zero-extends it when necessary.
Reviewed By: stephan.yichao.zhao, gbalats
Differential Revision: https://reviews.llvm.org/D101048
Both the alias and aliasee linkage are important.
PR27866 provides some background.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D99629
This change effectively reverts 86664638, but since there have been some changes on top and I wanted to leave the tests in, it's not a mechanical revert.
Why revert this now? Two main reasons:
1) There are continuing discussion around what the semantics of nofree. I am getting increasing uncomfortable with the seeming possibility we might redefine nofree in a way incompatible with these changes.
2) There was a reported miscompile triggered by this change (https://github.com/emscripten-core/emscripten/issues/9443). At first, I was making good progress on tracking down the issues exposed and those issues appeared to be unrelated latent bugs. Now that we've found at least one bug in the original change, and the investigation has stalled, I'm no longer comfortable leaving this in tree. In retrospect, I probably should have reverted this earlier and investigated the issues once the triggering change was out of tree.
The first version of origin tracking tracks only memory stores. Although
this is sufficient for understanding correct flows, it is hard to figure
out where an undefined value is read from. To find reading undefined values,
we still have to do a reverse binary search from the last store in the chain
with printing and logging at possible code paths. This is
quite inefficient.
Tracking memory load instructions can help this case. The main issues of
tracking loads are performance and code size overheads.
With tracking only stores, the code size overhead is 38%,
memory overhead is 1x, and cpu overhead is 3x. In practice #load is much
larger than #store, so both code size and cpu overhead increases. The
first blocker is code size overhead: link fails if we inline tracking
loads. The workaround is using external function calls to propagate
metadata. This is also the workaround ASan uses. The cpu overhead
is ~10x. This is a trade off between debuggability and performance,
and will be used only when debugging cases that tracking only stores
is not enough.
Reviewed By: gbalats
Differential Revision: https://reviews.llvm.org/D100967
We can skip check for undefs trying to find perfect/shuffled tree
entries matching, they can be ignored completely improving the final
cost/vectorization results.
Differential Revision: https://reviews.llvm.org/D101061
This commit fixes a bug where the loop vectoriser fails to predicate
loads/stores when interleaving for targets that support masked
loads and stores.
Code such as:
1 void foo(int *restrict data1, int *restrict data2)
2 {
3 int counter = 1024;
4 while (counter--)
5 if (data1[counter] > data2[counter])
6 data1[counter] = data2[counter];
7 }
... could previously be transformed in such a way that the predicated
store implied by:
if (data1[counter] > data2[counter])
data1[counter] = data2[counter];
... was lost, resulting in miscompiles.
This bug was causing some tests in llvm-test-suite to fail when built
for SVE.
Differential Revision: https://reviews.llvm.org/D99569
1. No need to call `areAllUsersVectorized` as later the cost is
calculated only if the instruction has one use and gets vectorized.
2. Need to calculate the cost of the dead extractelement more precisely,
taking the vector type of the vector operand, not the resulting
vector type.
Part of D57059.
Differential Revision: https://reviews.llvm.org/D99980
In quite a few cases in LoopVectorize.cpp we call createStepForVF
with a step value of 0, which leads to unnecessary generation of
llvm.vscale intrinsic calls. I've optimised IRBuilder::CreateVScale
and createStepForVF to return 0 when attempting to multiply
vscale by 0.
Differential Revision: https://reviews.llvm.org/D100763
This patch allows PRE of the following type of loads:
```
preheader:
br label %loop
loop:
br i1 ..., label %merge, label %clobber
clobber:
call foo() // Clobbers %p
br label %merge
merge:
...
br i1 ..., label %loop, label %exit
```
Into
```
preheader:
%x0 = load %p
br label %loop
loop:
%x.pre = phi(x0, x2)
br i1 ..., label %merge, label %clobber
clobber:
call foo() // Clobbers %p
%x1 = load %p
br label %merge
merge:
x2 = phi(x.pre, x1)
...
br i1 ..., label %loop, label %exit
```
So instead of loading from %p on every iteration, we load only when the actual clobber happens.
The typical pattern which it is trying to address is: hot loop, with all code inlined and
provably having no side effects, and some side-effecting calls on cold path.
The worst overhead from it is, if we always take clobber block, we make 1 more load
overall (in preheader). It only matters if loop has very few iteration. If clobber block is not taken
at least once, the transform is neutral or profitable.
There are several improvements prospect open up:
- We can sometimes be smarter in loop-exiting blocks via split of critical edges;
- If we have block frequency info, we can handle multiple clobbers. The only obstacle now is that
we don't know if their sum is colder than the header.
Differential Revision: https://reviews.llvm.org/D99926
Reviewed By: reames
Summary: The original logic seems to be we could collecting a CoroBegin
if one of the terminators could be dominated by one of coro.destroy,
which doesn't make sense.
This patch rewrites the logics to collect CoroBegin if all of
terminators are dominated by one coro.destroy. If there is no such
coro.destroy, we would call hasEscapePath to evaluate if we should
collect it.
Test Plan: check-llvm
Reviewed by: lxfind
Differential Revision: https://reviews.llvm.org/D100614
This revision simplifies Clang codegen for parallel regions in OpenMP GPU target offloading and corresponding changes in libomptarget: SPMD/non-SPMD parallel calls are unified under a single `kmpc_parallel_51` runtime entry point for parallel regions (which will be commonized between target, host-side parallel regions), data sharing is internalized to the runtime. Tests have been auto-generated using `update_cc_test_checks.py`. Also, the revision contains changes to OpenMPOpt for remark creation on target offloading regions.
Reviewed By: jdoerfert, Meinersbur
Differential Revision: https://reviews.llvm.org/D95976
On ELF targets, if a function has uwtable or personality, or does not have
nounwind (`needsUnwindTableEntry`), it marks that `.eh_frame` is needed in the module.
Then, a function gets `.eh_frame` if `needsUnwindTableEntry` or `-g[123]` is specified.
(i.e. If -g[123], every function gets `.eh_frame`.
This behavior is strange but that is the status quo on GCC and Clang.)
Let's take asan as an example. Other sanitizers are similar.
`asan.module_[cd]tor` has no attribute. `needsUnwindTableEntry` returns true,
so every function gets `.eh_frame` if `-g[123]` is specified.
This is the root cause that
`-fno-exceptions -fno-asynchronous-unwind-tables -g` produces .debug_frame
while
`-fno-exceptions -fno-asynchronous-unwind-tables -g -fsanitize=address` produces .eh_frame.
This patch
* sets the nounwind attribute on sanitizer module ctor/dtor.
* let Clang emit a module flag metadata "uwtable" for -fasynchronous-unwind-tables. If "uwtable" is set, sanitizer module ctor/dtor additionally get the uwtable attribute.
The "uwtable" mechanism is generic: synthesized functions not cloned/specialized
from existing ones should consider `Function::createWithDefaultAttr` instead of
`Function::create` if they want to get some default attributes which
have more of module semantics.
Other candidates: "frame-pointer" (https://github.com/ClangBuiltLinux/linux/issues/955https://github.com/ClangBuiltLinux/linux/issues/1238), dso_local, etc.
Differential Revision: https://reviews.llvm.org/D100251
This makes the memcpy-memcpy and memcpy-memset optimizations work for
variable sizes as long as they are equal, relaxing the old restriction
that they are constant integers. If they're not equal, the old
requirement that they are constant integers with certain size
restrictions is used.
The implementation works by pushing the length tests further down in the
code, which reveals some places where it's enough that the lengths are
equal (but not necessarily constant).
Differential Revision: https://reviews.llvm.org/D100870
Trying to evaluate a GEP would assert with
"Ty == cast<PointerType>(C->getType()->getScalarType())->getElementType()"
because the type of the pointer we would evaluate the GEP argument to
would be a different type than the GEP was expecting. We should treat
pointer stripping as a bitcast.
The test adds a redundant GEP that would crash due to type mismatch.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D100970
Discovered during attributor testing comparing stats with
and without the attributor. Willreturn should not be inferred
for nonexact definitions.
Differential Revision: https://reviews.llvm.org/D100988
Fix for PR49984
This was discovered during Attributor testing.
Memset was always created with alignment of 1
and in case when strncpy alignment was changed
it triggered an assertion in the AttrBuilder.
Memset will now be created with appropriate alignment.
Differential Revision: https://reviews.llvm.org/D100875
CommandLine.h is indirectly included in ~50% of TUs when building
clang, and VirtualFileSystem.h is large.
(Already remarked by jhenderson on D70769.)
No behavior change.
Differential Revision: https://reviews.llvm.org/D100957
FunctionAnalysisManagerCGSCCProxy should not be preserved if any of its
keys may be invalid. Since we are not removing/adding functions in
FuncAttrs, it's fine to preserve it.
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D100893
This reverts commit 13ec913bdf.
This commit introduces new uses of the overflow checking intrinsics that
depend on implementations in compiler-rt, which Windows users generally
do not link against. I filed an issue (somewhere) to make clang
auto-link the builtins library to resolve this situation, but until that
happens, it isn't reasonable for the optimizer to introduce new link
time dependencies.
It used to be that all of our intrinsics were call instructions, but over time, we've added more and more invokable intrinsics. According to the verifier, we're up to 8 right now. As IntrinsicInst is a sub-class of CallInst, this puts us in an awkward spot where the idiomatic means to check for intrinsic has a false negative if the intrinsic is invoked.
This change switches IntrinsicInst from being a sub-class of CallInst to being a subclass of CallBase. This allows invoked intrinsics to be instances of IntrinsicInst, at the cost of requiring a few more casts to CallInst in places where the intrinsic really is known to be a call, not an invoke.
After this lands and has baked for a couple days, planned cleanups:
Make GCStatepointInst a IntrinsicInst subclass.
Merge intrinsic handling in InstCombine and use idiomatic visitIntrinsicInst entry point for InstVisitor.
Do the same in SelectionDAG.
Do the same in FastISEL.
Differential Revision: https://reviews.llvm.org/D99976
This is a more convoluted form of the same pattern "sext of NSW trunc",
but in this case the operand of trunc was a right-shift,
and the truncation chops off just the zero bits that were shifted-in.
Summary:
This patch registers OpenMPOpt as a Module pass in addition to a CGSCC
pass. This is so certain optimzations that are sensitive to intact
call-sites can happen before inlining. The old `openmpopt` pass name is
changed to `openmp-opt-cgscc` and `openmp-opt` calls the Module pass.
The current module pass only runs a single check but will be expanded in
the future.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D99202
SLP supports perfect diamond matching for the vectorized tree entries
but do not support it for gathered entries and does not support
non-perfect (shuffled) matching with 1 or 2 tree entries. Patch adds
support for this matching to improve cost of the vectorized tree.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D100495
This fixes a subtle and nasty bug in my 86664638. The problem is that free(nullptr) is well defined (and common).
The specification for the nofree attributes talks about memory objects, and doesn't explicitly address null, but I think it's reasonable to assume that nofree doesn't disallow a call to free(nullptr). If it did, we'd have to prove nonnull on an argument to ever infer nofree which doesn't seem to be the intent.
This was found by Nuno and Alive2 over in https://reviews.llvm.org/D100141#2697374.
Differential Revision: https://reviews.llvm.org/D100779
SLP supports perfect diamond matching for the vectorized tree entries
but do not support it for gathered entries and does not support
non-perfect (shuffled) matching with 1 or 2 tree entries. Patch adds
support for this matching to improve cost of the vectorized tree.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D100495
SLP supports perfect diamond matching for the vectorized tree entries
but do not support it for gathered entries and does not support
non-perfect (shuffled) matching with 1 or 2 tree entries. Patch adds
support for this matching to improve cost of the vectorized tree.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D100495
Rather than maintaining two separate values, a `float` for the per-lane
cost and a Width for the VF, maintain a single VectorizationFactor which
comprises the two and also removes the need for converting an integer value
to float.
This simplifies the query when asking if one VF is more profitable than
another when we want to extend this for scalable vectors (which may
require additional options to determine if e.g. a scalable VF of the
some cost, is more profitable than a fixed VF of the same cost).
The patch isn't entirely NFC because it also fixes an issue in
selectEpilogueVectorizationFactor, where the cost passed to ProfitableVFs
no longer truncates the floating-point cost from `float` to `unsigned` to
then perform the calculation on the truncated cost. It now does
a cost comparison with the correct precision.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D100121
This patch is related to https://reviews.llvm.org/D100032 which define
some illegal types or operations for x86_amx. There are no arguments,
arrays, pointers, vectors or constants of x86_amx.
Reviewed By: pengfei
Differential Revision: https://reviews.llvm.org/D100472
Previously we would use the type of the pointee to determine what to
cast the result of constant folding a load. To aid with opaque pointer
types, we should explicitly pass the type of the load rather than
looking at pointee types.
ConstantFoldLoadThroughBitcast() converts the const prop'd value to the
proper load type (e.g. [1 x i32] -> i32). Instead of calling this in
every intermediate step like bitcasts, we only call this when we
actually see the global initializer value.
In some existing uses of this API, we don't know the exact type we're
loading from immediately (e.g. first we visit a bitcast, then we visit
the load using the bitcast). In those cases we have to manually call
ConstantFoldLoadThroughBitcast() when simplifying the load to make sure
that we cast to the proper type.
Reviewed By: dblaikie
Differential Revision: https://reviews.llvm.org/D100718
This patch improves https://reviews.llvm.org/D76971 (Deduce attributes for aligned_alloc in InstCombine) and implements "TODO" item mentioned in the review of that patch.
> The function aligned_alloc() is the same as memalign(), except for the added restriction that size should be a multiple of alignment.
Currently, we simply bail out if we see a non-constant size - change that.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D100785
The unnamedaddr property of a function is lost when using
`-fwhole-program-vtables` and thinlto which causes size increase under linker's
safe icf mode.
The size increase of chrome on Linux when switching from all icf to safe icf
drops from 5 MB to 3 MB after this change, and from 6 MB to 4 MB on Windows.
There is a repro:
```
# a.h
struct A {
virtual int f();
virtual int g();
};
# a.cpp
#include "a.h"
int A::f() { return 10; }
int A::g() { return 10; }
# main.cpp
#include "a.h"
int g(A* a) {
return a->f();
}
int main(int argv, char** args) {
A a;
return g(&a);
}
$ clang++ -O2 -ffunction-sections -flto=thin -fwhole-program-vtables -fsplit-lto-unit -c main.cpp -o main.o && clang++ -Wl,--icf=safe -fuse-ld=lld -flto=thin main.o -o a.out && llvm-readobj -t a.out | grep -A 1 -e _ZN1A1fEv -e _ZN1A1gEv
Name: _ZN1A1fEv (480)
Value: 0x201830
--
Name: _ZN1A1gEv (490)
Value: 0x201840
```
Differential Revision: https://reviews.llvm.org/D100498
SLP supports perfect diamond matching for the vectorized tree entries
but do not support it for gathered entries and does not support
non-perfect (shuffled) matching with 1 or 2 tree entries. Patch adds
support for this matching to improve cost of the vectorized tree.
Differential Revision: https://reviews.llvm.org/D100495
This is mostly stylistic cleanup after D100226, but not entirely. When skimming the code, I found one case where we weren't accounting for attributes on the callsite at all. I'm also suspicious we had some latent bugs related to operand bundles (which are supposed to be able to *override* attributes on declarations), but I don't have concrete test cases for those, just suspicions.
Aside: The only case left in the file which directly checks attributes on the declaration is the norecurse logic. I left that because I didn't understand it; it looks obviously wrong, so I suspect I'm misinterpreting the intended semantics of the attribute.
Differential Revision: https://reviews.llvm.org/D100689
This change fixes a latent bug which was exposed by a change currently in review (https://reviews.llvm.org/D99802#2685032).
The story on this is a bit involved. Without this change, what ended up happening with the pending review was that we'd strip attributes off intrinsics, and then selectiondag would fail to lower the intrinsic. Why? Because the lowering of the intrinsic relies on the presence of the readonly attribute. We don't have a matcher to select the case where there's a glue node needed.
Now, on the surface, this still seems like a codegen bug. However, here it gets fun. I was unable to reproduce this with a standalone test at all, and was pretty much struck until skatkov provided the critical detail. This reproduces only when RS4GC and codegen are run in the same process and context. Why? Because it turns out we can't roundtrip the stripped attribute through serialized IR!
We'll happily print out the missing attribute, but when we parse it back, the auto-upgrade logic has a side effect of blindly overwriting attributes on intrinsics with those specified in Intrinsics.td. This makes it impossible to exercise SelectionDAG from a standalone test case.
At this point, I decided to treat this an RS4GC bug as a) we don't need to strip in this case, and b) I could write a test which shows the correct behavior to ensure this doesn't break again in the future.
As an aside, I'd originally set out to handle libfuncs too - since in theory they might have the same issues - but backed away quickly when I realized how the semantics of builtin, nobuiltin, and no-builtin-x all interacted. I'm utterly convinced that no part of the optimizer handles that correctly, and decided not to open that can of worms here.
During store promotion, we check whether the pointer was captured
to exclude potential reads from other threads. However, we're only
interested in captures before or inside the loop. Check this using
PointerMayBeCapturedBefore against the loop header.
Differential Revision: https://reviews.llvm.org/D100706
I guess this case hasn't come up thus far, and i'm not sure if it can
really happen for the existing usages, thus no test in *this* commit.
But, the following commit adds test coverage,
there we'd expirience a crash without this fix.
Currently, InsertNoopCastOfTo() would implicitly insert that cast,
but now that we have SCEVPtrToIntExpr, i'm hoping we could stop
InsertNoopCastOfTo() from doing that. But first all users must be fixed.
Move the findDbg* functions into lib/IR/DebugInfo.cpp from
lib/Transforms/Utils/Local.cpp.
D99169 adds a call to a function (findDbgUsers) that lives in
lib/Transforms/Utils/Local.cpp (LLVMTransformUtils) from lib/IR/Value.cpp
(LLVMCore). The Core lib doesn't include TransformUtils. The builtbots caught
this here: https://lab.llvm.org/buildbot/#/builders/109/builds/12664. This patch
moves the function, and the 3 similar ones for consistency, into DebugInfo.cpp
which is part of LLVMCore.
Reviewed By: dblaikie, rnk
Differential Revision: https://reviews.llvm.org/D100632
Recently processMinMaxIntrinsic has been added and we started to observe a number of analysis get invalidated after CVP. The problem is CVP conservatively returns 'true' even if there were no modifications to IR. I found one more place besides processMinMaxIntrinsic which has the same problem. I think processMinMaxIntrinsic and similar should better have boolean return status to prevent similar issue reappear in future.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D100538
This reverts commit fa6b54c44a.
The commited patch broke mlir tests. It seems that mlir tests depend on coroutine function properties set in CoroEarly pass.
Presplit coroutines cannot be inlined. During AlwaysInliner we check if a function is a presplit coroutine, if so we skip inlining.
The presplit coroutine attributes are set in CoroEarly pass.
However in O0 pipeline, AlwaysInliner runs before CoroEarly, so the attribute isn't set yet and will still inline the coroutine.
This causes Clang to crash: https://bugs.llvm.org/show_bug.cgi?id=49920
To fix this, we set the attributes in the Clang front-end instead of in CoroEarly pass.
Reviewed By: rjmccall, ChuanqiXu
Differential Revision: https://reviews.llvm.org/D100282