Don't outline machine instructions which are using jump table indexes
since they are materialized as local labels (like the already handled
case of constant pools).
Reviewed By: paquette
Differential Revision: https://reviews.llvm.org/D109436
This moves one mid-size function out of line, inlines the
trivial tcAnd/tcOr/tcXor/tcComplement methods into their only
caller, and moves the magic/umagic functions into SelectionDAG
since they are implementation details of its algorithm. This
also removes the unit tests for magic, but these are already
tested in the divide lowering logic for various targets.
This also upgrades some C style comments to C++.
Differential Revision: https://reviews.llvm.org/D109476
Similar to other code which handles creating the function frame.
If LR isn't live-in to the block that we're inserting the call into, we'll get
a MachineVerifier error.
This is a port of a combine which matches a pattern where a wide type scalar
value is stored by several narrow stores. It folds it into a single store or
a BSWAP and a store if the targets supports it.
Assuming little endian target:
i8 *p = ...
i32 val = ...
p[0] = (val >> 0) & 0xFF;
p[1] = (val >> 8) & 0xFF;
p[2] = (val >> 16) & 0xFF;
p[3] = (val >> 24) & 0xFF;
=>
*((i32)p) = val;
On CTMark AArch64 -Os this results in a good amount of savings:
Program before after diff
SPASS 412792 412788 -0.0%
kc 432528 432512 -0.0%
lencod 430112 430096 -0.0%
consumer-typeset 419156 419128 -0.0%
bullet 475840 475752 -0.0%
tramp3d-v4 367760 367628 -0.0%
clamscan 383388 383204 -0.0%
pairlocalalign 249764 249476 -0.1%
7zip-benchmark 570100 568860 -0.2%
sqlite3 287628 286920 -0.2%
Geomean difference -0.1%
Differential Revision: https://reviews.llvm.org/D109419
We were returning a tuple when all but one caller only cared about one piece of the return value. That one caller can inline the complexity, and we can simplify all other uses.
Previously the CodeExtractor created exit stubs, and the subsequent return value of the outlined function based on the order of out-of-region blocks after splitting any phi nodes, and collecting the blocks to be outlined. This could cause differences in order if there was a difference of exit block phi nodes between the two regions. This patch moves the collection of the output target blocks to be before this occurs, so that the assignment of target block to output value will be the same, regardless of the contents of the output block.
Reviewers: paquette, roelofs
Differential Revision: https://reviews.llvm.org/D108657
Since i128 isn't a legal C type on RV32, I don't believe
libgcc implements these functions for RV32. compiler-rt
does implement them because i128 support is enabled
in order to handle long double.
This is consistent with 32-bit X86 and ARM.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D109383
Users of delinearization assume that the the offset into the array element is zero. In most cases it will indeed be zero, but if it is not, the delinearization has to fail since it violates that assumption without the API even allowing to signal to the caller that the by offset is non-zero.
This bug caused Polly to miscompile blender (526.blender_r from SPEC CPU 2017) in -polly-process-unprofitable mode. The SCEV expression incorrectly delinearized has been reduced in the test case byte_offset.ll. The dropped offset into the array element of size 4 (a float) is ((sext i32 %mul7.i4534 to i64) + {(sext i32 %i1 to i64),+,((sext i32 (1 + ((1 + %shl.i.i) * (1 + %shl.i.i)) + %shl.i.i) to i64) * (sext i32 %i1 to i64))}<%for.body703>). This significant component was just dropped, and the wrong pointer was computed when regenerating code from the remaining delinearized subscripts. This occurred during blender's subsurface scattering implementation. As a result, blender's rendering diverged from the reference image.
Patch D108885 would also fix the API.
Reviewed By: bmahjour
Differential Revision: https://reviews.llvm.org/D109133
Since we might end up using multiple threads when logging information in the DWARFTransformer, the handleDie() method must use the supplied stream named "OS" when logging warnings and errors. When we use multiple threads, we log to a thread specific stream buffer and then use a mutex to ensure our output doesn't overlap when we emit warnings and errors after a thread is done.
Differential Revision: https://reviews.llvm.org/D109401
Make the following changes in order to support opaque pointers in SROA:
* Generate i8 GEPs for opaque pointers.
* Explicitly enforce that promotable allocas only have stores of
the alloca type -- previously this was implicitly enforced.
* Replace a check for pointer element type with load/store type.
Differential Revision: https://reviews.llvm.org/D109259
The implementation is mostly copied from MemDepAnalysis. We want to look
at all loads and stores to the same pointer operand. Bitcasts and zero
GEPs of a pointer are considered the same pointer value. We choose the
most dominating instruction.
Since updating MemorySSA with invariant.group is non-trivial, for now
handling of invariant.group is not cached in any way, so it's part of
the walker. The number of loads/stores with invariant.group is small for
now anyway. We can revisit if this actually noticeably affects compile
times.
To avoid invariant.group affecting optimized uses, we need to have
optimizeUsesInBlock() not use invariant.group in any way.
Co-authored-by: Piotr Padlewski <prazek@google.com>
Reviewed By: asbirlea, nikic, Prazek
Differential Revision: https://reviews.llvm.org/D109134
None of this logic has anything to do with SCEV's internals, it just uses the existing public APIs. As a result, we can move the code from ScalarEvolution.cpp/hpp to Delinearization.cpp/hpp with only minor changes.
This was discussed in advance on today's loop opt call. It turned out to be easy as hoped.
integer 0/1 for the operand of bundle "clang.arc.attachedcall"
https://reviews.llvm.org/D102996 changes the operand of bundle
"clang.arc.attachedcall". This patch makes changes to llvm that are
needed to handle the new IR.
This should make it easier to understand what the IR is doing and also
simplify some of the passes as they no longer have to translate the
integer values to the runtime functions.
Differential Revision: https://reviews.llvm.org/D103000
D104143 introduced canonical value numbering between regions, which allows for the easy identification of items across a region, eliminating the need in the outliner to create parallel lists of instructions for each region, and replace output values in a less convoluted way.
Additionally, in a future commit, the output values will not necessarily be recorded values from the region itself, it could be a combination value where the actual value being output is a PHINode instead. This new method allows us to handle the replacement of the output value to the stored value with the corresponding item in the same place for both normal output values, and PHINode outputs instead of handling the different types of outputs in different locations.
Reviewers: paquette, roelofs
Differential Revision: https://reviews.llvm.org/D108656
This patch changes SPMDization to not trigger for regions with no
parallelism. Otherwise, this will introduce unnecessary barriers that
will slow the single-threaded region down.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D109438
Otherwise we end up with an extra conditional jump, following by an
unconditional jump off the end of a function. ie.
bb.0:
BT32rr ..
JCC_1 %bb.4 ...
bb.1:
BT32rr ..
JCC_1 %bb.2 ...
JMP_1 %bb.3
bb.2:
...
bb.3.unreachable:
bb.4:
...
Should be equivalent to:
bb.0:
BT32rr ..
JCC_1 %bb.4 ...
JMP_1 %bb.2
bb.1:
bb.2:
...
bb.3.unreachable:
bb.4:
...
This can occur since at the higher level IR (Instruction) SwitchInsts
are required to have BBs for default destinations, even when it can be
deduced that such BBs are unreachable.
For most programs, this isn't an issue, just wasted instructions since the
unreachable has been statically proven.
The x86_64 Linux kernel when built with CONFIG_LTO_CLANG_THIN=y fails to
boot though once D106056 is re-applied. D106056 makes it more likely
that correlation-propagation (CVP) can deduce that the default case of
SwitchInsts are unreachable. The x86_64 kernel uses a binary post
processor called objtool, which emits this warning:
vmlinux.o: warning: objtool: cfg80211_edmg_chandef_valid()+0x169: can't
find jump dest instruction at .text.cfg80211_edmg_chandef_valid+0x17b
I haven't debugged precisely why this causes a failure at boot time, but
fixing this very obvious jump off the end of the function fixes the
warning and boot problem.
Link: https://bugs.llvm.org/show_bug.cgi?id=50080
Fixes: https://github.com/ClangBuiltLinux/linux/issues/679
Fixes: https://github.com/ClangBuiltLinux/linux/issues/1440
Reviewed By: hans
Differential Revision: https://reviews.llvm.org/D109103
This should have been the 4 byte version in the first place. Unfortunatelly there is no easy way to add a test as both the 1 byte and 4 byte version are printed as 'jmp' in the assembly code.
Reviewed By: kda
Differential Revision: https://reviews.llvm.org/D109453
This is consistent with the RVV intrinsic patterns. This has been
shown to prevent some "ran out of registers" errors in our internal
testing.
Unfortunately, there are some regressions on LMUL=8 tests in here.
I think the lack of registers with LMUL=8 just makes it very hard
to schedule correctly.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D109245
Currently, we only deal with the case where we can match
the number of low bits to be kept, i.e.:
```
x & ((1 << y) - 1)
```
will extract low `y` bits of `x`.
But what will
```
x & (-1 >> y)
```
do?
Logically, it will extract `bitwidth(x) - y` low bits, i.e.:
```
x & ~(-1 << (bitwidth(x)-y))
```
... except we can't do such a transformation in IR in general,
because if we wanted to extract all the bits `(-1 >> 0)` is fine,
but `-1 << bitwidth(x)` would be `poison`: https://alive2.llvm.org/ce/z/BKJZfw,
Yet, here with BMI's BEXTR and BMI2's BZHI we don't have any such problems with edge-cases.
So what we can do is: https://alive2.llvm.org/ce/z/gm5M2B
As briefly discussed with @craig.topper, this appears to be not worse than what we'd end up with currently (a pair of shifts):
* https://godbolt.org/z/nsPb8bejs (direct data dependency, sequential execution)
* https://godbolt.org/z/7bj3zeh1d (no direct data dependency, parallel execution)
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D107923
The expansion of these pseudos creates ADD instructions. Those
ADDs modify a GPR so that it is no longer contains the same value
as the input base pointer. Therefore, I believe we should have a
GPR as a Def on these instructions and expansion should get the
destination register for the ADDs from that operand.
At least in our tests here this works out so that register
scavenging picks the same register as the base pointer.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D109405
When we start outlining across branches, there is the possibility that we will have two different blocks with different output locations, or a single branch that goes to two blocks outside of the region that is being outlined. While the CodeExtractor provides most of the mechanisms by using the return value of the extracted function as the input to a switch statement to correctly branch to the correct location, we need special handling for different output schemas to each location.
This is done by repeating the existing storing scheme for each different exit block. We have a map from the return values used, to the basic block that is used to store the outputs for that particular exit block within the outlined function. Then if needed, we create a switch statement for each return block to branch to the correct set of stored outputs.
Reviewers: paquette
Differential Revision: https://reviews.llvm.org/D106993
The call to emitCodeAlignment was missing a STI which is required
after D45962.
emitCodeAlignment has a default parameter of 0 for MaxBytesToEmit.
Explicitly passing 0 here was interpreted as as nullptr for the STI.
This could possibly be avoided by taking STI as a const reference in
emitCodeAlignment.
Differential Revision: https://reviews.llvm.org/D109425
When handling register spill for indirect debug value LiveDebugValues pass doesn't add
DW_OP_deref operator which may in some cases cause debugger to return value address, instead
of value while machine register holding that address is spilled.
Differential revision: https://reviews.llvm.org/D109142
This patch fixes a variety of crashes resulting from the `MemCpyOptPass`
casting `TypeSize` to a constant integer, whether implicitly or
explicitly.
Since the `MemsetRanges` requires a constant size to work, all but one
of the fixes in this patch simply involve skipping the various
optimizations for scalable types as cleanly as possible.
The optimization of `byval` parameters, however, has been updated to
work on scalable types in theory. In practice, this optimization is only
valid when the length of the `memcpy` is known to be larger than the
scalable type size, which is currently never the case. This could
perhaps be done in the future using the `vscale_range` attribute.
Some implicit casts have been left as they were, under the knowledge
they are only called on aggregate types. These should never be
scalably-sized.
Reviewed By: nikic, tra
Differential Revision: https://reviews.llvm.org/D109329
This patch extends the preliminary support for vector-predicated (VP)
operation legalization to include promotion of illegal integer vector
types.
Integer promotion of binary VP operations is relatively simple and
piggy-backs on the non-VP logic, but passing the two extra mask and VP
operands through to the promoted operation.
Tests have been added to the RISC-V target to cover the basic scenarios
for integer promotion for both fixed- and scalable-vector types.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D108288
This patch implements extract_subvector for predicate types when
the input type is more than twice the size of the subvector that
is being extracted.
Reviewed By: CarolineConcatto
Differential Revision: https://reviews.llvm.org/D109314
This improvement adds "assume" after removal of branch basing on UB in successor block.
Consider the following example:
```
pred:
x = ...
cond = x > 10
br cond, bb, other.succ
bb:
phi [nullptr, pred], ... // other possible preds
load(phi) // UB if we came from pred
other.succ:
// here we know that x <= 10, but this knowledge is lost
// after the branch is turned to unconditional unless we
// preserve it with assume.
```
If we remove the branch basing on knowledge about UB in a successor block,
then the fact that x <= 10 is other.succ might be lost if this condition is
not inferrable from any dominating condition. To preserve this knowledge, we
can add assume intrinsic with (possibly inverted) branch condition.
Patch by Dmitry Bakunevich!
Differential Revision: https://reviews.llvm.org/D109054
Reviewed By: lebedev.ri
The alignment of vector variable arguments in callee side is 4, which is
aligned with MSVC. But the caller aligns them to the size of vector
arguments. It results in run fails. This patch fixes this problem by
trimming it to 4 bytes for variable arguments on Win32.
Fixed vector arguments are passed by pointer on Win32. So they don't have
the problem.
I don't find a doc in MSDN for this calling conversion, so I did several
experiments here: https://godbolt.org/z/n1zn1Gx1z
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D108887
The basic problem being solved is that we largely give up when encountering a trip count involving an IV which is not an addrec. We will fall back to the brute force constant eval, but that doesn't have the information about the fact that we can't cycle back through the same set of values.
There's a high level design question of whether this is the right place to handle this, and if not, where that place is. The major alternative here would be to return a conservative upper bound, and then rely on two invocations of indvars to add the facts to the narrow IV, and then reconstruct SCEV. (I have not implemented the alternative and am not 100% sure this would work out.) That's arguably more in line with existing code, but I find this substantially easier to reason about. During review, no one expressed a strong opinion, so we went with this one.
Differential Revision: D108651
Both Wasm & Emscripten SjLj handling has a restriction that `setjmp`
cannot be called indirectly. I thought we have been erroring out on
indirect uses of `setjmp`, but some recent CL disrupted the logic and
we are not erroring out anymore.
We currently
1. Collect functions that contain `setjmp` calls in `SetjmpUsers`. This
only counts direct calls:
8f77dc459e/llvm/lib/Target/WebAssembly/WebAssemblyLowerEmscriptenEHSjLj.cpp (L869-L878)
2. Run `runSjLjOnFunction` only on those `SetjmpUsers`. Within
`runSjLjOnFunction`, if we see an indirect use of `setjmp`, we error
out:
8f77dc459e/llvm/lib/Target/WebAssembly/WebAssemblyLowerEmscriptenEHSjLj.cpp (L1218-L1221)
So if there are only indirect setjmp calls within the module,
`SetjmpUsers` will be empty, and `runSjLjOnFunction` is not even entered
once. And the indirect `setjmp` call will error out at link time. So in
this CL we check for the indirect uses of `setjmp` upfront before we
enter `runSjLjOnFunction`.
Also this currently errors out on `invoke @setjmp`, which can only occur
when using Wasm EH + Wasm SjLj within a function. We recently added Wasm
SjLj support but we don't support using Wasm EH + Wasm SjLj in the same
function yet. We plan to add this support very soon, so I don't think
it's worth creating another test file just for this. (This is an error
test so it needs its own file)
Reviewed By: dschuff
Differential Revision: https://reviews.llvm.org/D109375
Many `flang` tests currently `FAIL` on Solaris because the module files
aren't found. I could trace this to `sys::fs::getMainExecutable` not being
implemented.
This patch does this and fixes all affected `flang` tests.
Tested on `amd64-pc-solaris2.11`.
Differential Revision: https://reviews.llvm.org/D109374
Follow on to D109029. I realized we had no mention of mustprogrress in the comment (as it prexisted mustprogress in the codebase). In the process of adding it, I tweaked the preconditions into something I think is more clear. Note that mustprogress is checked in the code.
Differential Revision: https://reviews.llvm.org/D109091
This could go either direction since the instruction
count is the same either way, but there are a few
reasons to prefer this:
1. We already do the related transform with 'and'
(see just above the new code).
2. We try (too hard) to compensate for not having this
and possibly other folds in transformZExtICmp(),
and that leads to bugs like https://llvm.org/PR51762 .
3. Codegen looks better across a variety of targets.
https://alive2.llvm.org/ce/z/uEgn4P
This patch adds improvements for sext/zext of a vector extract in Global
ISel.
For example, this piece of code:
define i64 @si64(<4 x i32> %0, i32 %1) {
%3 = extractelement <4 x i32> %0, i64 1
%s = sext i32 %3 to i64
ret i64 %s
}
Used to have this lowering:
si64:
mov s0, v0.s[1]
fmov w8, s0
sxtw x0, w8
ret
Whereas this patch makes it lower to this:
si64:
smov x0, v0.h[0]
ret
Differential Revision: https://reviews.llvm.org/D108137
Index is 0 when the return value has the returned attribute. But the
return value cannot have the returned attribute, so the check is
pointless.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D109334
On X86, the stackprobe emission code chooses the `R11D` register, which
is illegal on i686. This ends up wrapping around to `EBX`, which does
not get properly callee-saved within the stack probing prologue,
clobbering the register for the callers.
We fix this by explicitly using `EAX` as the stack probe register.
Reviewed By: pengfei
Differential Revision: https://reviews.llvm.org/D109203
Functions can have a personality function, as well as prefix and
prologue data as additional operands. Unused operands are assigned
a dummy value of i1* null. This patch addresses multiple issues in
use-list order preservation for these:
* Fix verify-uselistorder to also enumerate the dummy values.
This means that now use-list order values of these values are
shuffled even if there is no other mention of i1* null in the
module. This results in failures of Assembler/call-arg-is-callee.ll,
Assembler/opaque-ptr.ll and Bitcode/use-list-order2.ll.
* The use-list order prediction in ValueEnumerator does not take
into account the fact that a global may use a value more than
once and leaves uses in the same global effectively unordered.
We should be comparing the operand number here, as we do for
the more general case.
* While we enumerate all operands of a function together (which
seems sensible to me), the bitcode reader would first resolve
prefix data for all function, then prologue data for all
functions, then personality functions for all functions. Change
this to resolve all operands for a given function together
instead.
Differential Revision: https://reviews.llvm.org/D109282
Support opaque pointers in SymbolicallyEvaluateGEP() by using the
value type of a GlobalValue base or falling back to i8 if there
isn't one. We don't unconditionally generate i8 GEPs here because
that would lose inrange attribues, and because some optimizations
on globals currently rely on GEP types (e.g. the globals SROA
mentioned in the comment).
Differential Revision: https://reviews.llvm.org/D109297
Copying IR during linking causes a type mismatch due to the field being missing in IRMover/Valuemapper. Adds the full range of typed attributes including elementtype attribute in the copy functions.
Patch by Chenyang Liu
Differential Revision: https://reviews.llvm.org/D108796
We're trying to get the parameter index of sret and see if it's part of
a function's varargs.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D109335
Similar to D108842, D108844, and D108926.
__has_builtin(builtin_mul_overflow) returns true for 32b x86 targets,
but Clang is deferring to compiler RT when encountering long long types.
This breaks ARCH=i386 + CONFIG_BLK_DEV_NBD=y builds of the Linux kernel
that are using builtin_mul_overflow with these types for these targets.
If the semantics of __has_builtin mean "the compiler resolves these,
always" then we shouldn't conditionally emit a libcall.
This will still need to be worked around in the Linux kernel in order to
continue to support these builds of the Linux kernel for this
target with older releases of clang.
Link: https://bugs.llvm.org/show_bug.cgi?id=28629
Link: https://bugs.llvm.org/show_bug.cgi?id=35922
Link: https://github.com/ClangBuiltLinux/linux/issues/1438
Reviewed By: lebedev.ri, RKSimon
Differential Revision: https://reviews.llvm.org/D108928
Using the similarity found from the IRSimilarity Identifier, we take regions with structural similarity, and deduplicate them into a separate function. The Code Extractor is able to provide most of this functionality.
For simplicity, we start by only outlining regions with a single entry and single exit branch, this reduces the complexity in handling phi nodes outside the region, and handling many sets of outputs for each of the different exit blocks.
Reviewer: paquette
Differential Revision: https://reviews.llvm.org/D106990
This patch adds a fix to do early if conversion to select when
conditional branch not using physical register to prevent the crash when
expanding ISEL instruction.
Reviewed By: lei, kamaub, PowerPC
Differential revision: https://reviews.llvm.org/D108302
It doesn't appear to be possible to generate this from tests atm, but it matches what we do in sse12_ord_cmp
Fixes unused template arg identified in D109359
This is exposed by enabling FastIsel on 64bit AIX.
We are generating XSRSP regardless of the arch,
which may be wrong when -mcpu=pwr7.
The fix is to guard the generation in P8 only.
Reviewed By: qiucf
Differential Revision: https://reviews.llvm.org/D109365
Using PUNPKLO/HI instead of ZIP1/ZIP2, because that avoids
having to generate a predicate with all lanes inactive (PFALSE).
Reviewed By: CarolineConcatto
Differential Revision: https://reviews.llvm.org/D109312
On some architectures such as Arm and X86 the encoding for a nop may
change depending on the subtarget in operation at the time of
encoding. This change replaces the per module MCSubtargetInfo retained
by the targets AsmBackend in favour of passing through the local
MCSubtargetInfo in operation at the time.
On Arm using the architectural NOP instruction can have a performance
benefit on some implementations.
For Arm I've deleted the copy of the AsmBackend's MCSubtargetInfo to
limit the chances of this causing problems in the future. I've not
done this for other targets such as X86 as there is more frequent use
of the MCSubtargetInfo and it looks to be for stable properties that
we would not expect to vary per function.
This change required threading STI through MCNopsFragment and
MCBoundaryAlignFragment.
I've attempted to take into account the in tree experimental backends.
Differential Revision: https://reviews.llvm.org/D45962
In preparation for passing the MCSubtargetInfo (STI) through to writeNops
so that it can use the STI in operation at the time, we need to record the
STI in operation when a MCAlignFragment may write nops as padding. The
STI is currently unused, a further patch will pass it through to
writeNops.
There are many places that can create an MCAlignFragment, in most cases
we can find out the STI in operation at the time. In a few places this
isn't possible as we are in initialisation or finalisation, or are
emitting constant pools. When possible I've tried to find the most
appropriate existing fragment to obtain the STI from, when none is
available use the per module STI.
For constant pools we don't actually need to use EmitCodeAlign as the
constant pools are data anyway so falling through into it via an
executable NOP is no better than falling through into data padding.
This is a prerequisite for D45962 which uses the STI to emit the
appropriate NOP for the STI. Which can differ per fragment.
Note that involves an interface change to InitSections. It is now
called initSections and requires a SubtargetInfo as a parameter.
Differential Revision: https://reviews.llvm.org/D45961
Legalizing G_MUL for non-standard types (like i33) generated an error. Putting
minScalar and maxScalar instead of clampScalar. Also using new rule, instead
of widening to the next power of 2, widen to the next multiple of the passed
argument (32 in this case), so instead of widening i65 to i128, we widen it to
i96.
Patch by: Mateja Marjanovic
Differential Revision: https://reviews.llvm.org/D109228
Add implementation for the legalization of G_ROTL and G_ROTR machine
instructions. They are very similar to funnel shift instructions, the only
difference is funnel shifts have 3 operands, whereas rotate instructions have
two operands, the first being the register that is being rotated and the second
being the number of shifts. The legalization of G_ROTL/G_ROTR is just lowering
them into funnel shift instructions if they are legal.
Patch by: Mateja Marjanovic
Differential Revision: https://reviews.llvm.org/D105347
Add KnownBits handling and unit tests for X*X self-multiplication cases which guarantee that bit1 of their results will be zero - see PR48683.
https://alive2.llvm.org/ce/z/NN_eaR
The next step will be to add suitable test coverage so this can be enabled in ValueTracking/DAG/GlobalISel - currently only a single Analysis/ScalarEvolution test is affected.
Differential Revision: https://reviews.llvm.org/D108992
Legalize G_MEMCPY, G_MEMMOVE, G_MEMSET and G_MEMCPY_INLINE.
Corresponding intrinsics are replaced by a loop that uses loads/stores in
AMDGPULowerIntrinsics pass unless their length is a constant lower then
MemIntrinsicExpandSizeThresholdOpt (default 1024). Any G_MEM* instruction that
reaches legalizer should have a const length argument and should be expanded
into appropriate number of loads + stores.
Differential Revision: https://reviews.llvm.org/D108357
This patch adds support for the vector-predicated `VP_STORE` and
`VP_LOAD` nodes. We do this in the same way we lower `MSTORE` and
`MLOAD`: to regular load/store instructions via intrinsics.
One necessary change was made to `SelectionDAGLegalize` so that
`VP_STORE` nodes' operation actions are taken from the stored "value"
operands, in the same vein as `STORE` or `MSTORE`.
Reviewed By: craig.topper, rogfer01
Differential Revision: https://reviews.llvm.org/D108999
This patch adds support for the `VP_SCATTER` and `VP_GATHER` nodes by
lowering them to RVV's `vsox`/`vlux` instructions, respectively. This
process is almost identical to the existing `MSCATTER`/`MGATHER` support.
One extra change was made to `SelectionDAGLegalize` so that
`VP_SCATTER`'s operation action is derived from its stored "value"
operand rather than its return type (which is always the chain).
Reviewed By: craig.topper, rogfer01
Differential Revision: https://reviews.llvm.org/D108987
When expanding pseudo insts, in order to create a new machine instr, we use BuildMI,
which will add implicit operands by default. And transferImpOps will also copy implicit
operands from old ones. Finally, duplicate implicit operands are added to the same inst.
Sometimes this can cause correctness issues. Like below inst,
renamable $w18 = nsw SUBSWrr renamable $w30, renamable $w14, implicit-def dead $nzcv
After expanding, it will become
$w18 = SUBSWrs renamable $w13, renamable $w14, 0, implicit-def $nzcv, implicit-def dead $nzcv
A redundant implicit-def $nzcv is added, but the dead flag is missing.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D109069
This is a more general version of D109273. Though it doesn't
peek through bitcasts or rearange broadcasts.
Reviewed By: LuoYuanke
Differential Revision: https://reviews.llvm.org/D109295
Due to a typo, this replaced %x with umax(C1, umin(C2, %x + C3))
rather than umax(C1, umin(C2, %x)). This didn't make a difference
for the existing tests, because the result is only used for range
calculation, and %x will usually have an unknown starting range,
and the additional offset keeps it unknown. However, if %x already
has a known range, we may compute a result range that is too
small.
The current IRSimilarityIdentifier does not try to find similarity across blocks, this patch provides a mechanism to compare two branches against one another, to find similarity across basic blocks, rather than just within them.
This adds a step in the similarity identification process that labels all of the basic blocks so that we can identify the relative branching locations. Within an IRSimilarityCandidate we use these relative locations to determine whether if the branching to other relative locations in the same region is the same between branches. If they are, we consider them similar.
We do not consider the relative location of the branch if the target branch is outside of the region. In this case, both branches must exit to a location outside the region, but the exact relative location does not matter.
Reviewers: paquette, yroux
Differential Revision: https://reviews.llvm.org/D106989
In case of a virtual register tied to a phys-def, the register class needs to
be computed. Make sure that this works generally also with fast regalloc by
using TLI.getRegClassFor() whenever possible, and make only the case of
'Untyped' use getMinimalPhysRegClass().
Fixes https://bugs.llvm.org/show_bug.cgi?id=51699.
Review: Ulrich Weigand
Differential Revision: https://reviews.llvm.org/D109291
I'm not sure if there is a better way or another bug
still here, but this is enough to avoid the loop from:
https://llvm.org/PR51657
The test requires multiple blocks and datalayout to
trigger the problem path.
This transform should be updated to use better
variable names and code comments. It could
also create the shift-of-shift directly instead
of relying on another combine for that.
This transform is written in a confusing style,
and I suspect it is at fault for a more serious
bug noted in PR51567.
But it's been around forever, so I'm making the
minimal change to fix another bug - it could
increase instructions because it was not checking
uses.
FeaturePMU was created in AArch64 to accommodate one missing system
register, PMMIR_EL1, in commit ffcd7698ae.
However, the Performance Monitors extension already had a target
feature, which is called FeaturePerfMon. Therefore, FeaturePMU is
redundant.
This patch removes FeaturePMU and merges its contents into
FeaturePerfMon.
Reviewed By: dnsampaio
Differential Revision: https://reviews.llvm.org/D109246
The isEssentiallyExtractHighSubvector function currently calls
getVectorNumElements on a type that in specific cases might be scalable.
Since this function only has correct behaviour at the moment on scalable
types anyway, the function can just return false when given a fixed type.
Differential Revision: https://reviews.llvm.org/D109163
If the vector is a splat of some scalar value, findScalarElement()
can simply return the scalar value if it knows the requested lane
is in the vector.
This is only needed for scalable vectors, because the InsertElement/ShuffleVector
case is already handled explicitly for the fixed-width case.
This helps to recognize an InstCombine fold like:
extractelt(bitcast(splat(%v))) -> bitcast(%v)
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D107254
d8faf03807 implemented general-regs-only for X86 by disabling all features
with vector instructions. But the CRC32 instruction in SSE4.2 ISA, which uses
only GPRs, also becomes unavailable. This patch adds a CRC32 feature for this
instruction and allows it to be used with general-regs-only.
Reviewed By: pengfei
Differential Revision: https://reviews.llvm.org/D105462
Currently, we use SExtValue to decide whether to invert tbz or tbnz.
However, for the case zext (xor x, c), we should use ZExt rather
than SExt otherwise we will generate totally opposite branches.
Reviewed By: paquette
Differential Revision: https://reviews.llvm.org/D108755
Given a select_cc producing a constant and a invertion of the constant
for a comparison more than zero, we can produce an xor with ashr
instead, which produces smaller code. The ashr either sets all bits or
clear all bits depending on if the value is negative. This is then xor'd
with the constant to optionally negate the value.
https://alive2.llvm.org/ce/z/DTFaBZ
This includes a OneUseCheck on the Cmp, which seems to make thinks a
little worse and will be removed in a followup.
Differential Revision: https://reviews.llvm.org/D109149
Pulled out of D109149, this folds set_cc seteq (ashr X, BW-1), -1 ->
set_cc setlt X, 0 to prevent some regressions later on when folding
select_cc setgt X, -1, C, ~C -> xor (ashr X, BW-1), C
Differential Revision: https://reviews.llvm.org/D109214
Recommit of 707ce34b06. Don't introduce a
dependency to the LLVMPasses component, instead register the required
passes individually.
Add methods for loop unrolling to the OpenMPIRBuilder class and use them in Clang if `-fopenmp-enable-irbuilder` is enabled. The unrolling methods are:
* `unrollLoopFull`
* `unrollLoopPartial`
* `unrollLoopHeuristic`
`unrollLoopPartial` and `unrollLoopHeuristic` can use compiler heuristics to automatically determine the unroll factor. If possible, that is if no CanonicalLoopInfo is required to pass to another method, metadata for LLVM's LoopUnrollPass is added. Otherwise the unroll factor is determined using the same heurstics as user by LoopUnrollPass. Not requiring a CanonicalLoopInfo, especially with `unrollLoopHeuristic` allows greater flexibility.
With full unrolling and partial unrolling with known unroll factor, instead of duplicating instructions by the OpenMPIRBuilder, the full unroll is still delegated to the LoopUnrollPass. In case of partial unrolling the loop is first tiled using the existing `tileLoops` methods, then the inner loop fully unrolled using the same mechanism.
Reviewed By: jdoerfert, kiranchandramohan
Differential Revision: https://reviews.llvm.org/D107764
The temporary object was used as a workaround when the target parser may
change STI. D14346 made the MCSubtargetInfo argument to
createMCAsmParser const, so we no longer need the temporary object.
As part of the nontrivial unswitching we could end up removing child
loops. This patch add a notification to the pass manager when
that happens (using the markLoopAsDeleted callback).
Without this there could be stale LoopAccessAnalysis results cached
in the analysis manager. Those analysis results are cached based on
a Loop* as key. Since the BumpPtrAllocator used to allocate
Loop objects could be resetted between different runs of for
example the loop-distribute pass (running on different functions),
a new Loop object could be created using the same Loop pointer.
And then when requiring the LoopAccessAnalysis for the loop we
got the stale (corrupt) result from the destroyed loop.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D109257
The current inconsistency confuse contributors which coding guidlines to follow.
It would be better to have it consistent using clang-format tool.
Reviewed By: mhjacobson
Differential Revision: https://reviews.llvm.org/D109270
The xmm variant have half the throughput (and +1cy latency) of the mmx variants, but are still 1uop.
I still need to do more thorough testing of SLM on test-suite before fixing the obvious bad numbers for WritePMULLD.
But this helps the D103695 helper script get to more accurate numbers for vXi32 multiplies of extended operands (i.e. we can use PMADDWD, PMULLW/PMULHW etc). Matches what Intel AoM / Agner / llvm-exegesis reports.
The xmm variant have half the throughput (and +1cy latency) of the mmx variants, but are still 1uop.
I still need to do more thorough testing of SLM on test-suite before fixing the obvious bad numbers for WritePMULLD.
But this helps the D103695 helper script get to more accurate numbers for vXi32 multiplies of extended operands (i.e. we can use PMADDWD, PMULLW/PMULHW etc). Matches what Intel AoM / Agner / llvm-exegesis reports.
For RMW instructions, the load and store hold the MEC for an extra cycle, but within the same single uop. This is alluded to in the Intel AOM:
"The MEC also owns the MEC RSV, which is responsible for scheduling of all loads and stores. Load and
store instructions go through addresses generation phase in program order to avoid on-the-fly memory
ordering later in the pipeline. Therefore, an unknown address will stall younger memory instructions."
Noticed while trying to get a cheap SLM test box up and running with llvm-exegesis - RMW arithmetic is always 1uop - and matches what Agner / InstLatX64 report as well.
These were all set to the same best case mul i32 values (which seems to be the only version of MUL that SLM actually performs well with).
Noticed while trying to improve multiplication costs for vectorization via the D103695 helper script. Confirmed with Intel AoM / Agner / InstLatX64.
The change here is basically the same as in D108880: Rather than
looking at bitcasts, look at calls and their function type. We
still need to look through bitcasts to find those calls.
The change in llvm/test/CodeGen/WebAssembly/add-prototypes-conflict.ll
is due to different visitation order. add-prototypes-opaque-ptrs.ll
is a copy of add-prototypes.ll with -force-opaque-pointers.
Differential Revision: https://reviews.llvm.org/D109256
This does add some extra superfluous whitespace (eg: "int *") intended
to make the Simplified Template Names work easier - this makes the
DIE-based names match more exactly the clang-generated names, so it's
easier to identify cases that don't generate matching names.
(arguably we could change clang to skip that whitespace or add some
fuzzy matching to accommodate differences in certain whitespace - but
this seemed easier and fairly low-impact)
As an extension to D107866, this adds store(fptosisat(..)) patterns,
similar to the existing fptosi patterns, to prevent unnecessarily moving
into gpr regs where we can use fp stores directly.
Differential Revision: https://reviews.llvm.org/D108378
This extends D107865 to the VFP insructions, lowering llvm.fptosi.sat
and llvm.fptoui.sat to VCVT instructions that inherently perform the
saturate.
Differential Revision: https://reviews.llvm.org/D107866
This patch changes the register class to avoid accidentally setting
the AVL operand to X0 through MachineIR optimizations.
There are cases where we really want to use X0, but we can't get that
past the MachineVerifier with the register class as GPRNoX0. So I've
use a 64-bit -1 as a sentinel for X0. All other immediate values should
be uimm5. I convert it to X0 at the earliest possible point in the VSETVLI
insertion pass to avoid touching the rest of the algorithm. In
SelectionDAG lowering I'm using a -1 TargetConstant to hide it from
instruction selection and treat it differently than if the user
used -1. A user -1 should be selected to a register since it doesn't
fit in uimm5.
This is the rest of the changes started in D109110. As mentioned there,
I don't have a failing test from MachineIR optimizations anymore.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D109116
As noticed in https://reviews.llvm.org/D105688, it would be great to move
handling of ICmpInst which was in canProveExitOnFirstIteration() to
getValueOnFirstIteration().
Patch by Dmitry Makogon!
Differential Revision: https://reviews.llvm.org/D108978
Reviewed By: reames
This allows constructing a MemDesc from a MachineMemoryOperand, a pattern that starts to show up more frequently.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D109161
SLM PBLENDVB is just as bad as BLENDVPD/PS - so model it as such, fixing the rr vs rm uops diff as well. The Intel AoM appears to have a copy+paste typo with PBLENDW, it doesn't match Agner or InstLatX64.
Noticed while investigating some of the weird discrepancies reported by the D103695 helper script (SLM had much better vector shift throughputs than it should).
This reapplies 71d7fed3bc which was
reverted by 3e2bd82f02. This change
includes the fix for breaking the sanitizer bots.
As seen in https://bugs.llvm.org/show_bug.cgi?id=48880 the current
implementation for parsing grouped short options can return unclear
error messages. This change fixes the example given in the ticket in
which a flag is incorrectly given an argument. Also when parsing a
group we now keep reading past the first incorrect option and output
errors for all incorrect options in the group.
Differential Revision: https://reviews.llvm.org/D108770
This is important as with exceptions enabled, non-POD allocas often have
two lifetime ends: the exception handler, and the normal one.
Reviewed By: eugenis
Differential Revision: https://reviews.llvm.org/D108365
Set up basic infrastructure for 64-bit ARM architecture support in JITLink. It allows for loading a minimal object file and resolving a single relocation. Advanced features like GOT and PLT handling or relaxations were intentionally left out for the moment.
This patch follows the idea to keep implementations for ARM (32-bit) and Aaarch64 (64-bit) separate, because:
* it might be easier to share code with the MachO "arm64" JITLink backend
* LLVM has individual targets for ARM and Aaarch64 as well
Reviewed By: lhames
Differential Revision: https://reviews.llvm.org/D108986
A small subset of the NEON instruction set is legal in streaming mode.
This patch adds support for the following vector to integer move
instructions:
0x00 1110 0000 0001 0010 11xx xxxx xxxx # SMOV W|Xd,Vn.B[0]
0x00 1110 0000 0010 0010 11xx xxxx xxxx # SMOV W|Xd,Vn.H[0]
0100 1110 0000 0100 0010 11xx xxxx xxxx # SMOV Xd,Vn.S[0]
0000 1110 0000 0001 0011 11xx xxxx xxxx # UMOV Wd,Vn.B[0]
0000 1110 0000 0010 0011 11xx xxxx xxxx # UMOV Wd,Vn.H[0]
0000 1110 0000 0100 0011 11xx xxxx xxxx # UMOV Wd,Vn.S[0]
0100 1110 0000 1000 0011 11xx xxxx xxxx # UMOV Xd,Vn.D[0]
Only the zero index variants are legal, all others indexes are illegal.
To support this, new instructions are defined specifically for zero
index which is hardcoded, along an implicit 'VectorIndex0' operand.
Since the index operand is implicit and takes no bits in the encoding,
custom decoding is required to add the operand.
I'm not sure if this is the best approach but the predicate constraint
on a subset of an operand is unusual. Would be interested to hear some
alternatives.
The instructions are predicated on 'HasNEONorStreamingSVE', i.e. they're
enabled by either +neon or +streaming-sve. This follows on from the work
in D106272 to support the subset of SVE(2) instructions that are legal
in streaming mode.
Depends on D107902.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D107903
This patch basically enables fast-isel for AIX 64-bit subtarget
(previously enabled only for ELF 64). The initial motivation is to
introduce branch folding to AIX generated code for correct debug
behavior. I also saw some compiling time improvement in a few LLVM
test-suite benchmarks. (toast, dbms, cjpeg, burg, etc.)
Reviewed By: jsji
Differential Revision: https://reviews.llvm.org/D98844
Add support for ordered directive in the OpenMPIRBuilder.
This patch also modidies clang to use the ordered directive when the
option -fopenmp-enable-irbuilder is enabled.
Also fix one ICE when parsing one canonical for loop with the relational
operator LE or GE in openmp region by replacing unary increment
operation of the expression of the variable "Expr A" minus the variable
"Expr B" (++(Expr A - Expr B)) with binary addition operation of the
experssion of the variable "Expr A" minus the variable "Expr B" and the
expression with constant value "1" (Expr A - Expr B + "1").
Reviewed By: Meinersbur, kiranchandramohan
Differential Revision: https://reviews.llvm.org/D107430
The attached testcase crashes without the patch (Not the same accesses
in the same order).
When we move instructions before another instruction, we also need to
update the memory accesses corresponding to it.
Reviewed-By: asbirlea
Differential Revision: https://reviews.llvm.org/D109197
verifyFunction can be really slow on large functions. This can significantly slow down compilation in production.
Given that coroutine passes are fairly stable now, we should only run it in debug mode.
Differential Revision: https://reviews.llvm.org/D109198
When pre-inliner decision is used for CSSPGO, we should take that into account for ThinLTO importing as well, so post-link sample loader inliner can favor that decision. This is handled by a small tweak in this patch. It also includes a change to transfer preinliner decision when merging context.
Differential Revision: https://reviews.llvm.org/D109088
This ISD node/wrapper represents am address which is relative to a base
address and therefore lowers to `i32.const` rather than `global.get`.
Use this wrapper type for TLS-relative addresses, paving the way for the
non-REL wrapper to be used to external TLS address once those are
supported.
Differential Revision: https://reviews.llvm.org/D109179
This patch adds support for unrolling inner loops using epilogue unrolling. The basic issue is that the original latch exit block of the inner loop could be outside the outer loop. When we clone the inner loop and split the latch exit, the cloned blocks need to be in the outer loop.
Differential Revision: https://reviews.llvm.org/D108476
All ExecutorProcessControl subclasses must provide a JITLinkMemoryManager object
that can be used to allocate memory in the executor process. The
EPCGenericJITLinkMemoryManager class provides an off-the-shelf
JITLinkMemoryManager implementation for JITs that do not need (or cannot
provide) a specialized JITLinkMemoryManager implementation. This simplifies the
process of creating new ExecutorProcessControl implementations.
This adds the following combines:
```
x = ... 0 or 1
c = icmp eq x, 1
->
c = x
```
and
```
x = ... 0 or 1
c = icmp ne x, 0
->
c = x
```
When the target's true value for the relevant types is 1.
This showed up in the following situation:
https://godbolt.org/z/M5jKexWTW
SDAG currently supports the `ne` case, but not the `eq` case. This can probably
be further generalized, but I don't feel like thinking that hard right now.
This gives some minor code size improvements across the board on CTMark at
-Os for AArch64. (0.1% for 7zip and pairlocalalign in particular.)
Differential Revision: https://reviews.llvm.org/D109130
Fixing this link error:
ld: error: relocation R_X86_64_PC32 cannot be used against symbol __asan_report_load...; recompile with -fPIC
Reviewed By: vitalybuka
Differential Revision: https://reviews.llvm.org/D109183
When preinliner is used for CSSPGO, we try to honor global preinliner decision as much as we can except for uninlinable callees. We rely on InlineCost::Never to prevent us from illegal inlining.
However, it turns out that we use InlineCost::Never for both illeagle inlining and some of the "not-so-beneficial" inlining.
The most common one is recursive inlining, while it can bloat size a lot during CGSCC bottom-up inlining, it's less of a problem when recursive inlining is guided by profile and done in top-down manner.
Ideally it'd be better to have a clear separation between inline legality check vs cost-benefit check, but that requires a bigger change.
This change enables InlineCost computation to allow inlining recursive calls, controlled by InlineParams. In SampleLoader, we now enable recursive inlining for CSSPGO when global preinliner decision is used.
With this change, we saw a few perf improvements on SPEC2017 with CSSPGO and preinliner on: 2% for povray_r, 6% for xalancbmk_s, 3% omnetpp_s, while size is about the same (no noticeable perf change for all other benchmarks)
Differential Revision: https://reviews.llvm.org/D109104
This is a followup to D104662 to generate slightly nicer code for
pointer overflow checks. Bypass expandAddToGEP and instead
explicitly generate i8 GEPs. This saves some bitcasts and negates
the value in a more obvious way. In particular, this prevents SCEV
from looking through the umul.with.overflow, same as in the integer
case.
The wrapping-pointer-ni.ll test deserves a comment: Previously,
this generated a typed GEP which used the umulo argument rather
than the multiplication result. This results in more compact IR in
that case, but effectively does the multiplication twice, the
second one is just hidden in the GEP. Reusing the umulo result
seems pretty reasonable to me.
Differential Revision: https://reviews.llvm.org/D109093
This add support for SjLj using Wasm exception handling instructions:
https://github.com/WebAssembly/exception-handling/blob/master/proposals/exception-handling/Exceptions.md
This does not yet support the mixed use of EH and SjLj within a
function. It will be added in a follow-up CL.
This currently passes all SjLj Emscripten tests for wasm0/1/2/3/s,
except for the below:
- `test_longjmp_standalone`: Uses Node
- `test_dlfcn_longjmp`: Uses NodeRAWFS
- `test_longjmp_throw`: Mixes EH and SjLj
- `test_exceptions_longjmp1`: Mixes EH and SjLj
- `test_exceptions_longjmp2`: Mixes EH and SjLj
- `test_exceptions_longjmp3`: Mixes EH and SjLj
Reviewed By: dschuff, tlively
Differential Revision: https://reviews.llvm.org/D108960
Similar to D108842 and D108844.
__has_builtin(builtin_mul_overflow) returns true for 32b MIPS targets,
but Clang is deferring to compiler RT when encountering long long types.
This breaks MIPS malta_defconfig builds of the Linux kernel that are
using __builtin_mul_overflow with these types for these targets.
If the semantics of __has_builtin mean "the compiler resolves these,
always" then we shouldn't conditionally emit a libcall.
This will still need to be worked around in the Linux kernel in order to
continue to support malta_defconfig builds of the Linux kernel for this
target with older releases of clang.
Link: https://bugs.llvm.org/show_bug.cgi?id=28629
Link: https://github.com/ClangBuiltLinux/linux/issues/1438
Reviewed By: rengolin
Differential Revision: https://reviews.llvm.org/D108926
This patch introduces four new string attributes: function-inline-cost,
function-inline-threshold, call-inline-cost and call-threshold-bonus.
These attributes allow you to selectively override some aspects of
InlineCost analysis. That would allow us to test inliner separately from
the InlineCost analysis.
That could be useful when you're trying to write tests for inliner and
you need to test some very specific situation, like "the inline cost has
to be this high", or "the threshold has to be this low". Right now every
time someone does that, they have get creative to come up with a way to
make the InlineCost give them the number they need (like adding ~30
load/add pairs for a trivial test). This process can be somewhat tedious
which can discourage some people from writing enough tests for their
changes. Also, that results in tests that are fragile and can be easily
broken without anyone noticing it because the test writer can't
explicitly control what input the inliner will get from the inline cost
analysis.
These new attributes will alleviate those problems to an extent.
Reviewed By: mtrofin
Differential Revision: https://reviews.llvm.org/D109033
https://reviews.llvm.org/D56686 was supposed to allow these to
work on Windows without needing to enable the xsave feature to
match MSVC. It seems this didn't work because the backend isel
patterns would still block it.
This patch removes the predicates from the isel patterns.
Fixes PR51706.
Reviewed By: pengfei
Differential Revision: https://reviews.llvm.org/D109097
PMADDWD(v8i16 x, v8i16 y) == (v4i32) { (int)x[0]*y[0] + (int)x[1]*y[1], ..., (int)x[6]*y[6] + (int)x[7]*y[7] }
Currently combineMulToPMADDWD only folds cases where the upper 17 bits of both vXi32 inputs are known zero (i.e. the first half is positive and the second half of the pair is zero in each 2xi16 pair), this can be relaxed to only require one zero-extended input if the other input has at least 17 sign bits.
That way the sign of the result is still preserved, and the second half is still zero.
Noticed while investigating PR47437.
Differential Revision: https://reviews.llvm.org/D108522
When pre-inliner decision is used for CSSPGO, we should take that into account for ThinLTO importing as well, so post-link sample loader inliner can favor that decision. This is handled by a small tweak in this patch. It also includes a change to transfer preinliner decision when merging context.
Differential Revision: https://reviews.llvm.org/D109088
When lowering a fixed length gather/scatter the index type is assumed to
be the same as the memory type, this is incorrect in cases where the
extension of the index has been folded into the addressing mode.
For now add a temporary workaround to fix the codegen faults caused by
this by preventing the removal of this extension. At a later date the
lowering for SVE gather/scatters will be redesigned to improve the way
addressing modes are handled.
As a short term side effect of this change, the addressing modes
generated for fixed length gather/scatters will not be optimal.
Differential Revision: https://reviews.llvm.org/D109145
If a sext_inreg is up for isel, and all its users are W instructions,
we can skip emitting the sext_inreg. This helpful if the producing
instruction can't become a W instruction.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D108966
X0 has special meaning for vsetvli, we need to make sure we never
create it a vsetvli that uses it by accident. This could happen
if the register coalescer coalesces a copy from X0 into this
instruction.
This patch splits the instruction so that we can have GPRNoX0
register class to use for the cases where we don't want the source
to be X0. The verifier won't let us explicitly use X0 on a GPRNoX0
operand so we need a separate pseudo for those cases.
I don't currently have a failing example for this. There was a
failure in D107957, but the coalescable copy from that example
should have been optimized away much earlier so I've fixed that.
This is not a complete fix. We still need to prevent the same
possible issue on the AVL operand of all of the vector instruction
pseudos. I don't want to make two versions of all of those so we
need to find a different solution for those. I have an idea I'm
going to try.
Differential Revision: https://reviews.llvm.org/D109110
Extend SILoadStoreOptimizer to merge into DWORDX8 variant of S_BUFFER_LOAD.
Merging into DWORDX2 and DWORDX4 variants is handled already.
Differential Revision: https://reviews.llvm.org/D108909
The semantics of tail predication loops means that the value of LR as an
instruction is executed determines the predicate. In other words:
mov r3, #3
DLSTP lr, r3 // Start tail predication, lr==3
VADD.s32 q0, q1, q2 // Lanes 0,1 and 2 are updated in q0.
mov lr, #1
VADD.s32 q0, q1, q2 // Only first lane is updated.
This means that the value of lr cannot be spilled and re-used in tail
predication regions without potentially altering the behaviour of the
program. More lanes than required could be stored, for example, and in
the case of a gather those lanes might not have been setup, leading to
alignment exceptions.
This patch adds a new lr predicate operand to MVE instructions in order
to keep a reference to the lr that they use as a tail predicate. It will
usually hold the zeroreg meaning not predicated, being set to the LR phi
value in the MVETPAndVPTOptimisationsPass. This will prevent it from
being spilled anywhere that it needs to be used.
A lot of tests needed updating.
Differential Revision: https://reviews.llvm.org/D107638
Please refer to
https://lists.llvm.org/pipermail/llvm-dev/2021-September/152440.html
(and that whole thread.)
TLDR: the original patch had no prior RFC, yet it had some changes that
really need a proper RFC discussion. It won't be productive to discuss
such an RFC, once it's actually posted, while said patch is already
committed, because that introduces bias towards already-committed stuff,
and the tree is potentially in broken state meanwhile.
While the end result of discussion may lead back to the current design,
it may also not lead to the current design.
Therefore i take it upon myself
to revert the tree back to last known good state.
This reverts commit 4c4093e6e3.
This reverts commit 0a2b1ba33a.
This reverts commit d9873711cb.
This reverts commit 791006fb8c.
This reverts commit c22b64ef66.
This reverts commit 72ebcd3198.
This reverts commit 5fa6039a5f.
This reverts commit 9efda541bf.
This reverts commit 94d3ff09cf.
Breaks build with -DBUILD_SHARED_LIBS=ON
```
CMake Error: The inter-target dependency graph contains the following strongly connected component (cycle):
"LLVMFrontendOpenMP" of type SHARED_LIBRARY
depends on "LLVMPasses" (weak)
"LLVMipo" of type SHARED_LIBRARY
depends on "LLVMFrontendOpenMP" (weak)
"LLVMCoroutines" of type SHARED_LIBRARY
depends on "LLVMipo" (weak)
"LLVMPasses" of type SHARED_LIBRARY
depends on "LLVMCoroutines" (weak)
depends on "LLVMipo" (weak)
At least one of these targets is not a STATIC_LIBRARY. Cyclic dependencies are allowed only among static libraries.
CMake Generate step failed. Build files cannot be regenerated correctly.
```
This reverts commit 707ce34b06.
This patch extends D107904's introduction of vector-predicated (VP)
operation legalization to include vector splitting.
When the result of a binary VP operation needs splitting, all of its
operands are split in kind. The two operands and the mask are split as
usual, and the vector-length parameter EVL is "split" such that the low
and high halves each execute the correct number of elements.
Tests have been added to the RISC-V target to show splitting several
scenarios for fixed- and scalable-vector types. Without support for
`umax` (e.g. in the `B` extension) the generated code starts to branch.
Ideally a cost model would prevent their insertion in the first place.
Through these tests many opportunities for better codegen can be seen:
combining known-undef VP operations and for constant-folding operations
on `ISD::VSCALE`, to name but a few.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D107957
llvm.vp.select extends the regular select instruction with an explicit
vector length (%evl).
All lanes with indexes at and above %evl are
undefined. Lanes below %evl are taken from the first input where the
mask is true and from the second input otherwise.
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D105351
Several FP instructions (fadd, fsub, etc.) were incorrectly assigned
a higher cost for SVE because they have custom lowering, however we
know they are legal. This patch explicitly assigns a cost of 2 to
these opcodes.
Tests added here:
Analysis/CostModel/AArch64/arith-fp-sve.ll
Differential Revision: https://reviews.llvm.org/D108993
sh_info links to a section, therefore SHF_INFO_LINK should be set as GNU as
does. The issue has been benign because linkers kindly combines relocation
sections w/ and w/o the flag.
Add methods for loop unrolling to the OpenMPIRBuilder class and use them in Clang if `-fopenmp-enable-irbuilder` is enabled. The unrolling methods are:
* `unrollLoopFull`
* `unrollLoopPartial`
* `unrollLoopHeuristic`
`unrollLoopPartial` and `unrollLoopHeuristic` can use compiler heuristics to automatically determine the unroll factor. If possible, that is if no CanonicalLoopInfo is required to pass to another method, metadata for LLVM's LoopUnrollPass is added. Otherwise the unroll factor is determined using the same heurstics as user by LoopUnrollPass. Not requiring a CanonicalLoopInfo, especially with `unrollLoopHeuristic` allows greater flexibility.
With full unrolling and partial unrolling with known unroll factor, instead of duplicating instructions by the OpenMPIRBuilder, the full unroll is still delegated to the LoopUnrollPass. In case of partial unrolling the loop is first tiled using the existing `tileLoops` methods, then the inner loop fully unrolled using the same mechanism.
Reviewed By: jdoerfert, kiranchandramohan
Differential Revision: https://reviews.llvm.org/D107764
For CSSPGO, turn on `sample-profile-use-preinliner` by default. This simplifies the use of llvm-profgen preinliner as it's now simply driven by ContextShouldBeInlined flag for each context profile without needing extra compiler switch.
Note that llvm-profgen's preinliner is still off by default, under switch `csspgo-preinliner`.
Differential Revision: https://reviews.llvm.org/D109111
Added opt option -print-pipeline-passes to print a -passes compatible
string describing the built pass pipeline.
As an example:
$ opt -enable-new-pm=1 -adce -licm -simplifycfg -o /dev/null /dev/null -print-pipeline-passes
verify,function(adce),function(loop-mssa(licm)),function(simplifycfg<bonus-inst-threshold=1;no-forward-switch-cond;no-switch-to-lookup;keep-loops;no-hoist-common-insts;no-sink-common-insts>),verify,BitcodeWriterPass
At the moment this is best-effort only and there are some known
limitations:
- Not all passes accepting parameters will print their parameters
(currently only implemented for simplifycfg).
- Some ClassName to pass-name mappings are not unique.
- Some ClassName to pass-name mappings are missing (e.g.
BitcodeWriterPass).
Differential Revision: https://reviews.llvm.org/D108298
Added opt option -print-pipeline-passes to print a -passes compatible
string describing the built pass pipeline.
As an example:
$ opt -enable-new-pm=1 -adce -licm -simplifycfg -o /dev/null /dev/null -print-pipeline-passes
verify,function(adce),function(loop-mssa(licm)),function(simplifycfg<bonus-inst-threshold=1;no-forward-switch-cond;no-switch-to-lookup;keep-loops;no-hoist-common-insts;no-sink-common-insts>),verify,BitcodeWriterPass
At the moment this is best-effort only and there are some known
limitations:
- Not all passes accepting parameters will print their parameters
(currently only implemented for simplifycfg).
- Some ClassName to pass-name mappings are not unique.
- Some ClassName to pass-name mappings are missing (e.g.
BitcodeWriterPass).
This is a case I'd missed in 6a8237. The odd bit here is that missing the edge removal update seems to produce MemorySSA which verifies, but is still corrupt in a way which bothers following passes. I wasn't able to reduce a single pass test case, which is why the reported test case is taken as is.
Differential Revision: https://reviews.llvm.org/D109068
I believe, the profitability reasoning here is correct
"sub"reg is already located within the 0'th subreg of wider reg,
so if we have suvector insertion at index 0 into undef,
then it's always free do to.
After this, D109065 finally avoids the regression in D108382.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D109074
Similar to D97585.
D25456 used `S_ATTR_LIVE_SUPPORT` to ensure the data variable will be retained
or discarded as a unit with the counter variable, so llvm.compiler.used is
sufficient. It allows ld to dead strip unneeded profc and profd variables.
Reviewed By: vsk
Differential Revision: https://reviews.llvm.org/D105445
This adds lowering of the llvm.fptosi.sat and llvm.fptoui.sat intinsics,
selecting a VCVT instruction which under MVE will inherently perform the
saturate.
Differential Revision: https://reviews.llvm.org/D107865
There's a silent bug in our reasoning about zero strides. We assume that having a single static exit implies that if that exit is not taken, then the loop must be infinite. This ignores the potential for abnormal exits via exceptions. Consider the following example:
for (uint_8 i = 0; i < 1; i += 0) {
throw_on_thousandth_call();
}
Our reasoning is such that we'd conclude this loop can't take the backedge as that would lead to a (presumed) infinite loop.
In practice, this is a silent bug because the loopIsFiniteByAssumption returns false strictly more often than the loopHaNoAbnormalExits property. We could reasonable want to change that in the future, so fixing the codeflow now is worthwhile.
Differential Revision: https://reviews.llvm.org/D109029
Currently, the truncate selection dag node is expanded as a bitwise AND plus compare to 1. This change enables scalar comparison in the pattern if the truncate node is uniform.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D108925
As mentioned in D108833, the logic for figuring out if a backedge is dead was somewhat interwoven with the SCEV based logic and the symbolic eval logic. This is my attempt at making the code easier to follow.
Note that this is only NFC after the work done in 29fa37ec. Thanks to Nikita for catching that case.
Differential Revision: https://reviews.llvm.org/D108848
With opaque pointers, no actual bitcasts will be present. Instead,
there will be a mismatch between the call FunctionType and the
function ValueType. Change the code to collect CallBases
specifically (rather than general Uses) and compare these types.
RAUW is no longer performed, as there would no longer be any
bitcasts that can be RAUWd.
Differential Revision: https://reviews.llvm.org/D108880
We'd special cased this logic to use pointer types for non-integral pointers, but there's no reason we can't do that for all pointer types. Doing it this was has a few advantages:
a) The code itself becomes more straight forward, and easier to test.
b) We avoid introducing ptrtoint into programs which didn't have them in the source.
c) The resulting codegen is easier to analyze and simplify (mostly due to lack of ptrtoint).
Note that there are some test diffs, but a) running them through instcombine helps a ton, and b) there's enough missing obvious transforms on both before and after IR that it's clear this isn't performance sensitive.
This is mostly motivated by cleaning up mentions of non-integrals to have a clearer idea of what we actually need to support.
Differential Revision: https://reviews.llvm.org/D104662
If the true and false values are the same, we don't need a SELECT_CC.
This would normally be folded before a select is legalized to
select_cc. The test case exploits the late legalization of vscale
to trigger a case where they become identical after legalization.
This works around an issue found on a test case in D107957. In that
case the true/false values were both eventually 0 and the select was
used by a vector AVL operand. The select_cc got expanded to control
flow and a phi, but the phi inputs were both copies from X0. MachineIR
optimizations simplified this to a single copy from X0 going into the
vector instruction. This became the input of a vsetvli after vsetvli
insertion. Then register coalescing folded the copy into the vsetvli.
X0 as the source of a vsetvli is a special encoding and should not be
created by coalesing. We need to fix our vsetvli handling to make sure
this can never happen any other way, but removing the unneeded select
is still a worthwhile optimization.
With the context split work, the context-based (an array of strings) sorting performed at profile load time is way more expansive than single-string-based sorting. This is likely due to auxiliary operations done on each array element, such as indirect references, std::min operations, also likely cache misses. In this change I'm presorting profiles during profile generation time to avoid sorting at compile time.
Compared to the previous context-split work, this effectively cuts down compile time by 20% for one of our large services and brings us closer to non-CS build, with still a small gap in build time.
Reviewed By: wenlei, wmi
Differential Revision: https://reviews.llvm.org/D109036
Store the used element type in the InductionDescriptor. For typed
pointers, it remains the pointer element type. For opaque pointers,
we always use an i8 element type, such that the step is a simple
offset.
A previous version of this patch instead tried to guess the element
type from an induction GEP, but this is not reliable, as the GEP
may be hidden (see @both in iv_outside_user.ll).
Differential Revision: https://reviews.llvm.org/D104795
This extends D108921 into a generic rule applied to constructing ExitLimits along all paths. The remaining paths (primarily howFarToZero) don't have the same reasoning about UB sensitivity as the howManyLessThan ones did. Instead, the remain cause for max counts being more precise than exact counts is that we apply context sensitive loop guards on the max path, and not on the exact path. That choice is mildly suspect, but out of scope of this patch.
The MVETailPredication.cpp change deserves a bit of explanation. We were previously figuring out that two SCEVs happened to be equal because the happened to be identical. When we optimized one with context sensitive information, but not the other, we lost the ability to prove them equal. So, cover this case by subtracting and then applying loop guards again. Without this, we see changes in test/CodeGen/Thumb2/mve-blockplacement.ll
Differential Revision: https://reviews.llvm.org/D109015
This is used by BOLT to do patching of DebugInfo section, and Line Table. Directly by using find, and through getAttrFieldOffsetForUnit.
Reviewed By: dblaikie
Differential Revision: https://reviews.llvm.org/D107874
isFreeToInvert allows min/max with 'not' on both operands,
so easing the argument restriction catches the case where
that operand has one use.
We already handle the sub-patterns when there are less uses:
https://alive2.llvm.org/ce/z/8Jatm_
...but this is another step towards parity with the
equivalent icmp+select idioms ( D98152 ).
Differential Revision: https://reviews.llvm.org/D109059
This mimics the code for the corresponding cmp-select idiom.
This also prevents an infinite loop because isFreeToInvert
does not match constant expressions.
So this patch solves the same problem as D108814 and obsoletes
it, but my main motivation is to enhance the pattern matching
to allow more invertible ops. That change will be a follow-up
patch on top of this one.
Differential Revision: https://reviews.llvm.org/D109058
The OutermostLoad condition is supposed to strip the outermost
DW_OP_deref operation because dbg.declares are implicitly
indirect. This patch makes sure the heuristic is only applied to
dbg.declare intrinsics and only if the outermost instruction is a
load.
This was found while qualifying the latest Swift compiler rebranch.
rdar://82037764
libdevice bitcode provided by NVIDIA is linked with clang/LLVM-generated IR
which uses nvptx*-nvidia-cuda triple. We need to mark them as compatible.
Differential Revision: https://reviews.llvm.org/D108835
This is part one of a couple of patches to fully rename these methods.
I've made the mistake of assuming that these indexes are for parameters
multiple times, but actually they're based off of a weird indexing
scheme AttributeList::AttrIndex where 0 is the return value and ~0 is
the function. Hopefully renaming these methods will make this clearer.
Ideally users should use more specific methods like
AttributeList::getFnAttr().
This patch simply adds the name that we want in the end. This is so the
removal of the methods with the original names happens in a separate
change to make it easier for downstream users.
This touches all relevant methods in AttributeList, CallBase, and Function.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D108788
Previously extra wide v4f32 to v4f64 extending loads would be legalized to v2f32
to v2f64 extending loads, which would then be scalarized by legalization. (v2f32
to v2f64 extending loads not produced by legalization were already being emitted
correctly.) Instead, mark v2f32 to v2f64 extending loads as legal and explicitly
lower them using promote_low. This regresses the addressing modes supported for
the extloads not produced by legalization, but that's a fine trade off for now.
Differential Revision: https://reviews.llvm.org/D108496
When we have an any-extending FPR bank load, none of the tablegen patterns
match and we fall back to the C++ selector. Like with the truncating stores
that were fixed recently, the C++ wasn't able to handle it and ended up
generating invalid copies between different size regclasses.
This change adds handling for this case, splitting the load into a regular
load and a SUBREG_TO_REG to extend it into the original wide destination reg.
Adding the compiler support of MD5 CS profile based on pervious context split work D107299. A MD5 CS profile is about 40% smaller than the string-based extbinary profile. As a result, the compilation is 15% faster.
There are a few conversion from real names to md5 names that have been made on the sample loader and context tracker side to get it work.
Reviewed By: wenlei, wmi
Differential Revision: https://reviews.llvm.org/D108342
The load store vectorizer currently uses isNoAlias() to determine
whether memory-accessing instructions should prevent vectorization.
However, this only works for loads and stores. Additionally, a
couple of intrinsics like assume are special-cased to be ignored.
Instead use getModRefInfo() to generically determine whether the
instruction accesses/modifies the relevant location. This will
automatically handle all inaccessiblememonly intrinsics correctly
(as well as other calls that don't modref for other reasons).
This requires generalizing the code a bit, as it was previously
only considering loads and stored in particular.
Differential Revision: https://reviews.llvm.org/D109020
We have a large compile showing occasional non-deterministic behavior
that is due to DIArgList not being properly uniqued in some cases. I
tracked this down to handleChangedOperands, for which there is a custom
implementation for DIArgList, that does not take care of re-uniquing
after updating the DIArgList Args, unlike the default version of
handleChangedOperands for MDNode.
Since the Args in the DIArgList form the key for the store, this seems
to be occasionally breaking the lookup in that DenseSet. Specifically,
when invoking DIArgList::get() from replaceVariableLocationOp, very
occasionally it returns a new DIArgList object, when one already exists
having the same exact Args pointers. This in turn causes a subsequent
call to Instruction::isIdenticalToWhenDefined on those two otherwise
identical DIArgList objects during a later pass to return false, leading
to different IR in those rare cases.
I modified DIArgList::handleChangedOperands to perform similar
re-uniquing as the MDNode version used by other metadata node types.
This also necessitated a change to the context destructor, since in some
cases we end up with DIArgList as distinct nodes: DIArgList is the only
metadata node type to have a custom dropAllReferences, so we need to
invoke that version on DIArgList in the DistinctMDNodes store to clean
it up properly.
Differential Revision: https://reviews.llvm.org/D108968
If MatchRegexp is an invalid regex, an error message will be printed
using SourceManager::PrintMessage via AddRegExToRegEx.
PrintMessage relies on the input being a StringRef into a string managed
by SourceManager. At the moment, a StringRef to a std::string
allocated in the caller of AddRegExToRegEx is passed. If the regex is
invalid, this StringRef is passed to PrintMessage, where it will crash,
because it does not point to a string managed via SourceMgr.
This patch fixes the crash by turning MatchRegexp into a StringRef If
we use MatchStr, we directly use that StringRef, which points into a
string from SourceMgr. Otherwise, MatchRegexp gets assigned
Format.getWildcardRegex(), which returns a std::string. To extend the
lifetime, assign it to a std::string variable WildcardRegexp and assign
MatchRegexp to a stringref to WildcardRegexp. WildcardRegexp should
always be valid, so we should never have to print an error message
via the SoureMgr I think.
Fixes PR49319.
Reviewed By: thopre
Differential Revision: https://reviews.llvm.org/D109050
Configure and use the TSFlags in TargetRegisterClass to
have unique flags for VGPR and AGPR register classes.
The vector register class queries like `hasVGPRs` will
now become more efficient with just a bitwise operation.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D108815
For a variable in a comdat nodeduplicate, its initializer may be significant.
E.g. its content may be implicitly referenced by another comdat member (or
required to parallel to another comdat member by the runtime when explicit
section is used). We can clone it into an unnamed private linkage variable to
preserve its content.
This partially fixes PR51394 (Sony's proprietary linker using LTO): no error
will be reported. This is partial because we do not guarantee the global
variable order if the runtime has parallel section requirement.
---
There is a similar issue for regular LTO, but unrelated to PR51394:
with lib/LTO (using either ld.lld or LLVMgold.so), linking two modules
with a weak function of the same name, can leave one weak profc and two
private profd, due to lib/LTO's current deficiency that it mixes the two
concepts together: comdat selection and symbol resolution. If the issue
is considered important, we should suppress private profd for the weak+
regular LTO case.
Reviewed By: phosek
Differential Revision: https://reviews.llvm.org/D108879
While initializing the LDS pointers within entry basic block of kernel(s), make
sure that the entry basic block is split after alloca instructions.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D108971
A new LLVM specific TAG DW_TAG_LLVM_annotation is added.
The name is suggested by Paul Robinson ([1]).
Currently, this tag is used to output __attribute__((btf_tag("string")))
annotations in dwarf. The following is an example for a global
variable with two btf_tag attributes:
0x0000002a: DW_TAG_variable
DW_AT_name ("g1")
DW_AT_type (0x00000052 "int")
DW_AT_external (true)
DW_AT_decl_file ("/tmp/home/yhs/work/tests/llvm/btf_tag/t.c")
DW_AT_decl_line (8)
DW_AT_location (DW_OP_addr 0x0)
0x0000003f: DW_TAG_LLVM_annotation
DW_AT_name ("btf_tag")
DW_AT_const_value ("tag1")
0x00000048: DW_TAG_LLVM_annotation
DW_AT_name ("btf_tag")
DW_AT_const_value ("tag2")
0x00000051: NULL
In the future, DW_TAG_LLVM_annotation may encode other type
of non-string const value.
[1] https://lists.llvm.org/pipermail/llvm-dev/2021-June/151250.html
Differential Revision: https://reviews.llvm.org/D106621
When a nodeduplicate COMDAT group contains a weak symbol, choose
a non-weak symbol (or one of the weak ones) rather than reporting
an error. This should address issue PR51394.
With the current IR representation, a generic comdat nodeduplicate
semantics is not representable for LTO. In the linker, sections and
symbols are separate concepts. A dropped weak symbol does not force the
defining input section to be dropped as well (though it can be collected
by GC). In the IR, when a weak linkage symbol is dropped, its associate
section content is dropped as well.
For InstrProfiling, which is where ran into this issue in PR51394, the
deduplication semantic is a sufficient workaround.
Differential Revision: https://reviews.llvm.org/D108689
Performance on GPU targets can be highly variable, sometimes inlining
everything hurts performance and sometimes it greatly improves it. Add
an option to toggle this behaviour to better investigate it.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D109014
When RA eliminated a dead def it can either immediately delete
the instruction itself or replace it with KILL to defer the
actual removal. If this instruction has a virtual register use
killing the register it will shrink the LI of the use. However,
if the LI covers the instruction and extends beyond it the
shrink will not happen. In fact that is impossible to shrink
such use because of the KILL still using it.
If later the LI of the use will be split at the KILL and the
KILL itself is eliminated after that point the new live segment
ends up at an invalid slot index.
This extremely rare condition was hit after D106408 which has
enabled rematerialization of such instructions. The replacement
with KILL is only done for rematerialized defs which became dead
and such rematerialization did not generally happen before.
The patch deletes an instruction immediately if it is a result
of rematerialization and has such use. An alternative would be
to prohibit a split at a KILL instruction, but it looks like it
is better to split a live range rather then keeping a killed
instruction just in case it can be rematerialized further.
Fixes PR51655.
Differential Revision: https://reviews.llvm.org/D108951
SLPVectorizer currently uses AA::isNoAlias() to determine whether
two locations alias. This does not work if one of the instructions
is a call. Instead, we should check getModRefInfo(), which
determines whether an arbitrary instruction modifies or references
a given location.
Among other things, this prevents @llvm.experimental.noalias.scope.decl()
and other inaccessiblmemonly intrinsics from interfering with SLP
vectorization.
Differential Revision: https://reviews.llvm.org/D109012
Similar to D108842, D108844, D108926, D108928, and D108936.
__has_builtin(builtin_mul_overflow) returns true for 32b RISCV targets,
but Clang is deferring to compiler RT when encountering long long types.
If the semantics of __has_builtin mean "the compiler resolves these,
always" then we shouldn't conditionally emit a libcall.
Link: https://bugs.llvm.org/show_bug.cgi?id=28629
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D108939
Similar to D108842, D108844, and D108926.
__has_builtin(builtin_mul_overflow) returns true for 32b PPC targets,
but Clang is deferring to compiler RT when encountering long long types.
This breaks ppc44x_defconfig + CONFIG_BLK_DEV_NBD=y builds of the Linux
kernel that are using builtin_mul_overflow with these types for these
targets.
If the semantics of __has_builtin mean "the compiler resolves these,
always" then we shouldn't conditionally emit a libcall.
This will still need to be worked around in the Linux kernel in order to
continue to support these builds of the Linux kernel for this
target with older releases of clang.
Link: https://bugs.llvm.org/show_bug.cgi?id=28629
Link: https://github.com/ClangBuiltLinux/linux/issues/1438
Reviewed By: nemanjai
Differential Revision: https://reviews.llvm.org/D108936
As noted in the comments in D108227, using G_FPTOSI produces wrong results for
G_ISNAN. Drop the G_FPTOSI and perform the operation on integer types.
Elsewhere in LLVM, a bitcast would be the appropriate choice (as it is in SDAG).
GlobalISel does not distinguish between integer and FP types, so a bitcast would
be meaningless here.
The runtime unroller will try to produce a non-loop if the unroll count is 2 and thus the prolog/epilog loop would only run at most one iteration. The old implementation did this by avoiding loop construction entirely. This patches instead constructs the trivial loop and then explicitly breaks the backedge and simplifies. This does result in some additional code churn when triggered, but a) results in better quality code and b) removes a codepath which didn't work properly for multiple exit epilogs.
One oddity that I want to draw to reviewer attention is that this somehow changes revisit order. The new order looks equivalent to me, but I don't understand how creating and erasing an extra loop here creates this effect.
Differential Revision: https://reviews.llvm.org/D108521
This is a bailout for pr51680. This pass appears to assume that the alignment operand to an align tag on an assume bundle is constant. This doesn't appear to be required anywhere, and clang happily generates non-constant alignments for cases such as this case taken from the bug report:
// clang -cc1 -triple powerpc64-- -S -O1 opal_pci-min.c
extern int a[];
long *b;
long c;
void *d(long, int *, int, long, long, long) __attribute__((__alloc_align__(6)));
void e() {
b = d(c, a, 0, 0, 5, c);
b[0] = 0;
}
This was exposed by a SCEV change which allowed a non-constant alignment to reach further into the pass' code. We could generalize the pass, but for now, let's fix the crash.
This patch is specifically the howManyLessThan case. There will be a couple of followon patches for other codepaths.
The subtle bit is explaining why the two codepaths have a difference while both are correct. The test case with modifications is a good example, so let's discuss in terms of it.
* The previous exact bounds for this example of (-126 + (126 smax %n))<nsw> can evaluate to either 0 or 1. Both are "correct" results, but only one of them results in a well defined loop. If %n were 127 (the only possible value producing a trip count of 1), then the loop must execute undefined behavior. As a result, we can ignore the TC computed when %n is 127. All other values produce 0.
* The max taken count computation uses the limit (i.e. the maximum value END can be without resulting in UB) to restrict the bound computation. As a result, it returns 0 which is also correct.
WARNING: The logic above only holds for a single exit loop. The current logic for max trip count would be incorrect for multiple exit loops, except that we never call computeMaxBECountForLT except when we can prove either a) no overflow occurs in this IV before exit, or b) this is the sole exit.
An alternate approach here would be to add the limit logic to the symbolic path. I haven't played with this extensively, but I'm hesitant because a) the term is optional and b) I'm not sure it'll reliably simplify away. As such, the resulting code quality from expansion might actually get worse.
This was noticed while trying to figure out why D108848 wasn't NFC, but is otherwise standalone.
Differential Revision: https://reviews.llvm.org/D108921
As seen in https://bugs.llvm.org/show_bug.cgi?id=48880 the current
implementation for parsing grouped short options can return unclear
error messages. This change fixes the example given in the ticket in
which a flag is incorrectly given an argument. Also when parsing a
group we now keep reading past the first incorrect option and output
errors for all incorrect options in the group.
Differential Revision: https://reviews.llvm.org/D108770
Followup to D99355: SDAG support for vector-predicated load/store/gather/scatter.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D105871
To support Virtual Function Elimination to Swift, this PR adds support for Swift
vtables which contain "relative pointers" instead of direct pointer references.
These are in the form of:
@symbol = ... {
i32 trunc (i64 sub (i64 ptrtoint (<type> @target to i64), i64 ptrtoint (... @symbol to i64)) to i32)
}
The PR extends GlobalDCE's way of looking up a vtable offset into a dependency
to be able to see through this expression and find the target symbol.
Differential Revision: https://reviews.llvm.org/D107645
The existing code was unquestionably wrong - it looked at one
fneg and ignored the other 2 instructions.
It was also untested, so it didn't make the list of bugs
flagged by Alive2.
This is an unusual propagation, but Alive2 agress that we
can intersect the fnegs and union that with the select,
then apply the results to both new instructions:
https://alive2.llvm.org/ce/z/SF8_dt
Icelake, Rocketlake and Tigerlake targets currently use the SkylakeServer scheduler model, despite being a later microarchitecture, leading to both reported bugs (PR48110) and discrepancies when comparing llvm-mca reports to other profiling tools (OSACA, uops, uica, etc.). And tbh I'm getting sick of llvm-mca getting blamed for what are backend scheduler model issues :-(
This patch doesn't attempt to fix any of these discrepancies - there should be no changes in codegen - its a setup patch that copies the skx model, renames all the resources, adds the additional ports (but doesn't reference them yet) and updates the llvm-exegesis pfm counter mappings (based off https://sourceforge.net/p/perfmon2/libpfm4/ci/master/tree/lib/events/intel_icl_events.h).
This should make it trivial for anyone with hardware access to use llvm-exegesis reports to iteratively improve the model (my attempts to get hold of a cheap tiger lake box haven't been fruitful yet....).
I will copy the SkylakeServer llvm-mca resource tests as follow up commits - the diff should entirely be the resource renames.
Differential Revision: https://reviews.llvm.org/D108914
Add assert to provoke failure in object file output, not just in disassembly output.
Reviewed By: yroux
Differential Revision: https://reviews.llvm.org/D107259
This is an improvement over D107852. We don't need to enumerate specific
function names; we can just check for `noreturn` attribute. This also
requires us to make sure `__resumeExeption` and `emscripten_longjmp`
have `noreturn` attribute too; one of them is a JS function and the
other calls a JS function so Clang does not have a way to deduce they
don't return.
This is effectively NFC, because I'm not sure if there is an additional
case this case covers; if we add a custom function call that has
`noreturn` attribute, it will be processed within the SjLj handling and
turned into `__invoke` call. So this really applies to some special
functions like `emscripten_longjmp`.
Reviewed By: dschuff
Differential Revision: https://reviews.llvm.org/D108955
There are three kinds of "rethrowing" BBs in this pass:
1. In Emscripten SjLj, after a possibly longjmping function call, we
check if the thrown longjmp corresponds to one of setjmps within the
current function. If not, we rethrow the longjmp by calling
`emscripten_longjmp`.
2. In Emscripten EH, after a possibly throwing function call, we check
if the thrown exception corresponds to the current `catch` clauses.
If not, we rethrow the exception by calling `__resumeException`.
3. When both Emscripten EH and SjLj are used, when we check for an
exception after a possibly throwing function call, it is possible
that we get not an exception but a longjmp. In this case, we
shouldn't swallow it; we should rethrow the longjmp by calling
`emscripten_longjmp`.
4. When both Emscripten EH and SjLj are used, when we check for a
longjmp after a possibly longjmping function call, it is possible
that we get not a longjmp but an exception. In this case, we
shouldn't swallot it; we should rethrow the exception by calling
`__resumeException`.
Case 1 is in Emscripten SjLj, 2 is in Emscripten EH, and 3 and 4 are
relevant when both Emscripten EH and SjLj are used. 3 and 4 were first
implemented in D106525.
We create BBs for 1, 3, and 4 in this pass. We create those BBs for
every throwing/longjmping function call, along with other BBs that
contain condition checks. What this CL does is to create a single BB
within a function for each of 1, 3, and 4 cases. These BBs are exiting
BBs in the function and thus don't have successors, so easy to be shared
between calls.
The names of BBs created are:
Case 1: `call.em.longjmp`
Case 3: `rethrow.exn`
Case 4: `rethrow.longjmp`
For the case 2 we don't currently create BBs; we only replace the
existing `resume` instruction with `call @__resumeException`. And Clang
already creates only a single `resume` BB per function and reuses it,
so we don't need to optimize this case.
Not sure what are good benchmarks for EH/SjLj, but this decreases the
size of the object file for `grfmt_jpeg.bc` (presumably from opencv) we
got from one of our users by 8.9%. Even after running `wasm-opt -O4` on
them, there is still 4.8% improvement.
Reviewed By: dschuff
Differential Revision: https://reviews.llvm.org/D108945
Currently context strings contain a lot of duplicated function names and that significantly increase the profile size. This change split the context into a series of {name, offset, discriminator} tuples so function names used in the context can be replaced by the index into the name table and that significantly reduce the size consumed by context.
A follow-up improvement made in the compiler and profiling tools is to avoid reconstructing full context strings which is time- and memory- consuming. Instead a context vector of `StringRef` is adopted to represent the full context in all scenarios. As a result, the previous prevalent profile map which was implemented as a `StringRef` is now engineered as an unordered map keyed by `SampleContext`. `SampleContext` is reshaped to using an `ArrayRef` to represent a full context for CS profile. For non-CS profile, it falls back to use `StringRef` to represent a contextless function name. Both the `ArrayRef` and `StringRef` objects are underpinned by real array and string objects that are stored in producer buffers. For compiler, they are maintained by the sample reader. For llvm-profgen, they are maintained in `ProfiledBinary` and `ProfileGenerator`. Full context strings can be generated only in those cases of debugging and printing.
When it comes to profile format, nothing has changed to the text format, though internally CS context is implemented as a vector. Extbinary format is only changed for CS profile, with an additional `SecCSNameTable` section which stores all full contexts logically in the form of `vector<int>`, which each element as an offset points to `SecNameTable`. All occurrences of contexts elsewhere are redirected to using the offset of `SecCSNameTable`.
Testing
This is no-diff change in terms of code quality and profile content (for text profile).
For our internal large service (aka ads), the profile generation is cut to half, with a 20x smaller string-based extbinary format generated.
The compile time of ads is dropped by 25%.
Differential Revision: https://reviews.llvm.org/D107299
In D87099, the mangler learned to quote export directives that contain
special characters. Only alhpanumerical characters as well as
'_', '$', '.' and '@' were exmpt from this quoting. However, at least
binutils considers an unquoted '.' to be syntax and object files
containing such symbols will cause errors during linking. Fix that
by removing '.' from the list of allowed exemptions.
Differential Revision: https://reviews.llvm.org/D100359
Currently, the IROutliner uses a simple metric to outline the largest amount
of IR possible to outline first if it fits the cost model. This is model
loses out on smaller blocks of code that have higher reductions in cost that
are contained within larger blocks of IR.
This reverses the order, where we calculate all of the costs first, and then
reorder and extract items based on the calculated results.
Reviewers: paquette
Differential Revision: https://reviews.llvm.org/D106440
Instead of splitting off the fp16 to float conversion and generating
a libcall, we should split the operation into fp16 to float and float
to integer operations. This will allow the float to integer conversion
to go through any custom handling the target has. If the target doesn't
have custom handling then we should come back to ExpandIntRes_FP_TO_SINT/
ExpandIntRes_FP_TO_UINT automatically to create the libcall.
This avoids generating libcalls on 32-bit X86. These library functions may
not exist in 32-bit libgcc. At least for LLVM, we never generate them when
hardware floating point instructions are available.
Differential Revision: https://reviews.llvm.org/D108933
When expanding a SMULFIXSAT ISD node (usually originating from
a smul.fix.sat intrinsic) we've applied some optimizations for
the special case when the scale is zero. The idea has been that
it would be cheaper to use an SMULO instruction (if legal) to
perform the multiplication and at the same time detect any overflow.
And in case of overflow we could use some SELECT:s to replace the
result with the saturated min/max value. The only tricky part
is to know if we overflowed on the min or max value, i.e. if the
product is positive or negative. Unfortunately the implementation
has been incorrect as it has looked at the product returned by the
SMULO to determine the sign of the product. In case of overflow that
product is truncated and won't give us the correct sign bit.
This patch is adding an extra XOR of the multiplication operands,
which is used to determine the sign of the non truncated product.
This patch fixes PR51677.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D108938
It looks like this array was missed in 4276d4a8d0
Fixed tests that expected `elements` to be empty or depeneded on the order of the empty DINode.
Reviewed By: aprantl
Differential Revision: https://reviews.llvm.org/D107024
The min/max intrinsics are not yet canonical, but when they are the tail
predications analysis will change from treating them like icmp to
treating them like intrinsics. Unfortunately, they can currently produce
better code by not being tail predicated thanks to the vectorizer picking
higher VF's and the backend folding to better instructions (especially
for saturate patterns). In the long run we will need to improve the
vectorizers cost modelling, recognizing the instruction directly, but in
the meantime this treats min/max as before to prevent performance
regressions.
The backend generally uses 64-bit immediates (e.g. what
MachineOperand::getImm() returns), so use that for analyzeCompare()
and optimizeCompareInst() as well. This avoids truncation for
targets that support immediates larger 32-bit. In particular, we
can avoid the bugprone value normalization hack in the AArch64
target.
This is a followup to D108076.
Differential Revision: https://reviews.llvm.org/D108875
Md5 hashing is expansive. Using a hash map to look up already computed GUID for dwarf names. Saw a 2% build time improvement on an internal large application.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D108722
Only enforce that ptr* is illegal if the base type is a simple type,
not when it is something like %ty, where %ty may resolve to an
opaque pointer in force-opaque-pointers mode.
Differential Revision: https://reviews.llvm.org/D108876
Occasionally instructions are between the last instruction in a region,
and the following instruction as identified by the Candidate. This
adds an extra check right before splitting a candidate that excludes the region from being split/checked for outlining to remove errors.
Tests Added:
Tranforms/IROuutliner/outlining-extra-bitcasts.ll
Reviewer: paquette, jroelofs
Differential Revision: https://reviews.llvm.org/D104142
The check for whether a rotate is possible occurs before the
memory legality checks for the integer type. So it's possible we
decide we can use a rotate, but then fail the legality checks. If
that happens we should not fall back to a vector type. This triggers
an assertion in the rotate handling when it finds a vector type
instead of an integer type.
In theory we could use a shufflevector in place of the rotate, but
right now I'd just like to fix the crash.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D108839
When building LLVM with Open XL and -Werror is specified, the -Waix-compat warning becomes an error. This patch updates the SmallVector class to suppress the -Waix-compat warning/error on AIX.
Reviewed By: daltenty
Differential Revision: https://reviews.llvm.org/D108577
If we encounter a new debug value, describing the same parameter,
we should stop tracking the parameter's Entry Value. At that point,
in some cases, the Transfer which uses the parameter's Entry Value,
is already emitted. Thanks to the RemoveRedundantDebugValues pass,
many problems with incorrect instruction order and number of DBG_VALUEs
are fixed. However, we still cannot rely on the rule that each new
debug value is set by the previous non-debug instruction in Machine
Basic Block.
When new parameter debug value triggers removal of Backup Entry Value
for the same parameter, do the cleanup of Transfers emitted from Backup
Entry Values. Get the Transfer Instruction which created the new debug
value and search for debug values already emitted from the to-be-deleted
Backup Entry Value and attached to the Transfer Instruction. If found,
delete the Transfer and remove "primary" Entry Value Var Loc from
OpenRanges.
This patch fixes PR47628.
Patch by Nikola Tesic.
Differential revision: https://reviews.llvm.org/D106856
The GLM/SLM special cases still get tested first but after the the MUL/DIV/REM pattern detection - this will be necessary for when we make the SLM vXi32 MUL canonicalization generic to improve PMULLW/PMULHW/PMADDDW cost support etc.
Previously, we'd expand *ALL* the SCEV's eagerly, because we needed to
check with `isValidRewrite()`, and discard bad rewrite candidates,
but now that we do not do that, we also don't need to always expand.
In particular, this avoids expanding potentially-huge SCEV's that we
would discard anyways because they are high-cost and we aren't
rewriting aggressively.
`isValidRewrite()` checks that the both the original SCEV,
and the rewrite SCEV have the same base pointer.
I //believe//, after all the recent SCEV improvements,
this invariant is already enforced by SCEV itself.
I originally tried changing it into an assert in D108043,
but that showed that it triggers on e.g. https://reviews.llvm.org/D108043#2946621,
where SCEV manages to forward the store to load,
test added.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D108655
This patch emits DW_TAG_namelist and DW_TAG_namelist_item for fortran
namelist variables. DICompositeType is extended to support this fortran
feature.
Reviewed By: aprantl
Differential Revision: https://reviews.llvm.org/D108553
After applying VPlan-to-VPlan transformations, using IR references to
query VPlan values may be incorrect, as the IR is not in sync with the
VPlan any longer.
To better detect such mis-matches, this patch introduces a new flag to
VPlans to indicate whether it is safe to query VPValues using IR values.
getVPValue is updated to assert if it is called when the flag indicates
it is not safe any longer.
There is an escape hatch via an extra argument, because there are 3
places that need to be fixed first. Those are
1. truncateToMinimalBitwidths
2. clearReductionWrapFlags
3. fixLCSSAPHIs
As a first step, this flag will help preventing new code from violating
this property.
Any suggestions with respect to naming very welcome!
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D108573
PowerPC can model these instructions, so we don't need this flag set.
Reviewed By: shchenz, jsji
Differential Revision: https://reviews.llvm.org/D71983
We cannot leak any equivalency information by comparing against null
since null never has virtual metadata associated with it (when null is
not a valid dereferenceable pointer).
Instcombine seems to make sure that a null will be on the RHS, so we
don't have to check both operands.
This fixes a missed optimization in llvm-test-suite's MultiSource lambda
benchmark under -fstrict-vtable-pointers.
Reviewed By: Prazek
Differential Revision: https://reviews.llvm.org/D108734
ExposePointerBase() in SCEVExpander implements basically the same
functionality as removePointerBase() in SCEV, so reuse it.
The SCEVExpander code assumes that the pointer operand on adds is
the last one -- I'm not sure that always holds. As such this might
not be strictly NFC.
There can only be one pointer operand in an add expression, and
we have sorted operands to guarantee that it is the first. As
such, the pointer check for other operands is dead code.
Remove method X86LowerAMXType::getRowFromCol since it's not used, and
it's causing a warning.
Reviewed By: LuoYuanke
Differential Revision: https://reviews.llvm.org/D108862
A new kind BTF_KIND_TAG is added to .BTF to encode
btf_tag attributes. The format looks like
CommonType.name : attribute string
CommonType.type : attached to a struct/union/func/var.
CommonType.info : encoding BTF_KIND_TAG
kflag == 1 to indicate the attribute is
for CommonType.type, or kflag == 0
for struct/union member or func argument.
one uint32_t : to encode which member/argument starting from 0.
If one particular type or member/argument has more than one attribute,
multiple BTF_KIND_TAG will be generated.
Differential Revision: https://reviews.llvm.org/D106622
This is different from symbol resolution based LinkFromSrc. Rename to be
clearer.
In the future we may support a new enum member 'Both' for nodeduplicate. This is
feasible (by renaming to a private linkage GlobalValue), but we need to be
careful not to break InstrProfiling.cpp's expectation of parallel profd/profc.
The challenge is that current LTO symbol resolution only allows to mark one
profc as prevailing: the other profc in another comdat nodeduplicate may be
discarded while its associated profd isn't.
If the icmp is in a different block, then the register for the icmp
operand may not be initialized, as it nominally does not have
cross-block uses. Add a check that the icmp is in the same block
as the branch, which should be the common case.
This matches what X86 FastISel does:
5b6b090cf2/llvm/lib/Target/X86/X86FastISel.cpp (L1648)
The "not" transform that could have a similar issue is dropped
entirely, because it is currently dead: The incoming value is
a branch or select condition of type i1, but this code requires
an i32 to trigger.
Fixes https://bugs.llvm.org/show_bug.cgi?id=51651.
Differential Revision: https://reviews.llvm.org/D108840
This patch add the R_RISCV_GOT_HI20 and R_RISCV_CALL_PLT relocation support. And the basic got/plt was implemented. Because of riscv32 and riscv64 has different pointer size, the got entry size and instructions of plt entry is different. This patch is the basic support, the optimization pass at preFixup stage has not been implemented.
Reviewed By: lhames
Differential Revision: https://reviews.llvm.org/D107688
__has_builtin(__builtin_mul_overflow) returns true for 32b MIPS targets,
but Clang is deferring to compiler RT when encountering `long long`
types. This breaks sanitizer builds of the Linux kernel that are using
__builtin_mul_overflow with these types for these targets.
If the semantics of __has_builtin mean "the compiler resolves these,
always" then we shouldn't conditionally emit a libcall.
This will still need to be worked around in the Linux kernel in order to
continue to support malta_defconfig builds of the Linux kernel for this
target with older releases of clang.
Link: https://bugs.llvm.org/show_bug.cgi?id=28629
Link: https://github.com/ClangBuiltLinux/linux/issues/1438
Reviewed By: rengolin
Differential Revision: https://reviews.llvm.org/D108844
__has_builtin(__builtin_mul_overflow) returns true for 32b ARM targets,
but Clang is deferring to compiler RT when encountering `long long`
types. This breaks sanitizer builds of the Linux kernel that are using
__builtin_mul_overflow with these types for these targets.
If the semantics of __has_builtin mean "the compiler resolves these,
always" then we shouldn't conditionally emit a libcall.
This will still need to be worked around in the Linux kernel in order to
continue to support allmodconfig builds of the Linux kernel for this
target with older releases of clang.
Link: https://bugs.llvm.org/show_bug.cgi?id=28629
Link: https://github.com/ClangBuiltLinux/linux/issues/1438
Reviewed By: rengolin
Differential Revision: https://reviews.llvm.org/D108842
When the initial relationship between two pairs of values between
similar sections is ambiguous to commutativity, arguments to the
outlined functions can be passed in such that the order is incorrect,
causing miscompilations. This adds a canonical mapping to each
similarity section, so that we can maintain the relationship of global
value numbering from one section to another.
Added Tests:
Transforms/IROutliner/outlining-commutative-operands-opposite-order.ll
unittests/Analysis/IRSimilarityIdentifierTest.cpp - IRSimilarityCandidate:CanonicalNumbering
Reviewers: jroelofs, jpaquette, yroux
Differential Revision: https://reviews.llvm.org/D104143
This is another followup to D106591. Even if there is an
instruction that clobbers one of the loads, this doesn't matter if
it happens before the loads. Those instructions aren't affected by
the transform at all.
The gep-references-bb.ll is modified to preserve the spirit of the
test, as the store to @g no longer impacts the transform.
Differential Revision: https://reviews.llvm.org/D108782
We'd added support a while back from breaking the backedge if SCEV can prove the trip count is zero. However, we used the exact trip count which requires *all* exits be analyzeable. I noticed while writing test cases for another patch that this disallows cases where one exit is provably taken paired with another which is unknown. This patch adds the upper bound case.
We could use a symbolic max trip count here instead, but we use an isKnownNonZero filter (presumably for compile time?) for the first-iteration reasoning. I decided this was a more obvious incremental step, and we could go back and untangle the schemes separately.
Differential Revision: https://reviews.llvm.org/D108833
IIUC we can't emit `memcmp` between pointers in addressspaces,
doing so will trigger an assertion since the signature of the memcmp
will not match it's arguments (https://bugs.llvm.org/show_bug.cgi?id=48661).
This PR disables the attempt to merge icmps,
when the pointer is in an addressspace.
Reviewed By: #julialang, vtjnash
Differential Revision: https://reviews.llvm.org/D94813
Recursion can happen when we see a PHI use the second time or when we
look at a store value operand use again. We already visited the
potential copies and doing so again will just cause endless looping.
Reviewed By: kuter
Differential Revision: https://reviews.llvm.org/D108190
For now we do should not treat byval arguments as local copies performed
on the call edge, though, in general we should. To make that happen we
need to teach various passes, e.g., DSE, about the copy effect of a
byval. That would also allow us to mark functions only accessing byval
arguments as readnone again, atguably their acceses have no effect
outside of the function, like accesses to allocas.
Reviewed By: kuter
Differential Revision: https://reviews.llvm.org/D108140
Changes since aec08e:
* Adjust placement of a closing brace so that the general case actually runs. Turns out we had *no* coverage of the switch case. I added one in eae90fd.
* Drop .llvm.loop.* metadata from the new branch as there is no longer a loop to annotate.
Original commit message:
This special cases an unconditional latch and a conditional branch latch exit to improve codegen and test readability. I am hoping to reuse this function in the runtime unroll code, but without this change, the test diffs are far too complex to assess.
This fixes another reproducer from https://bugs.llvm.org/show_bug.cgi?id=51615
And again, the fix lies not in the code added in D105390
In this case, we completely don't check that the "broadcast-from-mem" we create
can actually fold the load. In this case, it's operand was not a load at all:
```
Combining: t16: v8i32 = vector_shuffle<0,u,u,u,0,u,u,u> t14, undef:v8i32
Creating new node: t29: i32 = undef
RepeatLoad:
t8: i32 = truncate t7
t7: i64 = extract_vector_elt t5, Constant:i64<0>
t5: v2i64,ch = load<(load (s128) from %ir.arg)> t0, t2, undef:i64
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t1: i64 = Register %0
t4: i64 = undef
t3: i64 = Constant<0>
Combining: t15: v8i32 = undef
```
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D108821
This makes sure, that the text section will have a 2-byte alignment, if
the +c extension is enabled.
Reviewed By: MaskRay, luismarques
Differential Revision: https://reviews.llvm.org/D102052
This adds an ELEN limit for fixed length vectors. This will scalarize
any elements larger than this. It will also disable some fractional
LMULs. For example, if ELEN=32 then mf8 becomes illegal, i32/f32
vectors can't use any fractional LMULs, i16/f16 can only use mf2,
and i8 can use mf2 and mf4.
We may also need something for the scalable vectors, but that has
interactions with the intrinsics and we can't scalarize a scalable
vector.
Longer term this should come from one of the Zve* features
Similar to D97976.
On Linux, most GCC installations are configured with
`--enable-gnu-unique-object` and such GCC emits `@gnu_unique_object` assembly.
The feature is highly controversial and disliked by many folks.
(On glibc DF_1_NODELETE is implicitly enabled and makes dlclose a no-op).
In llvm-project STB_GNU_UNIQUE is assembly only. Clang does not use STB_GNU_UNIQUE.
Use ELFOSABI_GNU to match GNU as behavior and avoid collision with other
OSABI binding values.
Reviewed By: jrtc27
Differential Revision: https://reviews.llvm.org/D107861
We try to forward a stored-once-constant-value from one global access
to another, but that's not safe if the constant value is an expression
that can trap.
The tests are reduced from the miscompile examples in:
https://llvm.org/PR47578
Differential Revision: https://reviews.llvm.org/D108771
For vectors that are exactly equal to getMaxSVEVectorSizeInBits, just use
AArch64SVEPredPattern::all, which can enable the use of unpredicated ptrue when available.
TestPlan: check-llvm
Differential Revision: https://reviews.llvm.org/D108706
This patch solely moves convert operation between SVE predicate pattern
and element number into two small functions. It's pre-commit patch for optimize
pture with known sve register width.
Differential Revision: https://reviews.llvm.org/D108705
x86_fp80 format allows values that do not fit any of IEEE-754 category.
Previously they were recognized by intrinsic __builtin_isnan as NaNs.
Now this intrinsic is implemented using instruction FXAM, which
distinguish between NaNs and unsupported values. It can make some
programs behave differently.
As a solution, this fix changes lowering of the intrinsic. If floating
point exceptions are ignored, llvm.isnan is lowered into unordered
comparison, as __buildtin_isnan was implemented earlier. In strictfp
functions the intrinsic is lowered using FXAM, which does not raise
exceptions even for signaling NaN, as required by IEEE-754 and C
standards.
Differential Revision: https://reviews.llvm.org/D108037
There are no paths into LowerFormalParms that have already specified
the sret register. We always materialize a virtual and then assign it
to the physical reg at the point of the return.
Differential Revision: https://reviews.llvm.org/D108762
Without this change only the preferred fusion opcode is tested
when attempting to combine FMA operations.
If both FMA and FMAD are available then FMA ops formed prior to
legalization will not be merged post legalization as FMAD becomes
the preferred fusion opcode.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D108619
Fixes PR51626.
The M68k requires that all instruction, word and long word reads are
aligned to word boundaries. From the 68020 onwards, there is a
performance benefit from aligning long words to long word boundaries.
The M68k uses the same data layout for pointers and integers.
In line with this, this commit updates the pointer data layout to
match the layout already set for 32-bit integers: 32:16:32.
Differential Revision: https://reviews.llvm.org/D108792
Exegesis is faulty and sometimes when measuring throughput^-1
produces snippets that have loop-carried dependencies,
which must be what caused me to incorrectly measure it originally.
After looking much more carefully, the inverse throughput should match
that of the MULX w/ reg op.
As per llvm-exegesis measurements.
In a future patch, a new set of amdgpu-no-* attributes will be
introduced to indicate when a function does not need an implicitly
passed input. This pass introduces new instances of these intrinsic
calls, and should remove the attributes if they were present before.
Switch to using BitIntegerState for each of the inputs, and invert
their meanings.
This now diverges more from the old AMDGPUAnnotateKernelFeatures, but
this isn't used yet anyway.
We only really want this to add the custom attributes. Theoretically
the regular transforms were already run at this point. Touching
undefined behavior breaks a lot of tests when this is enabled by
default, many of which are expecting to test handling of undef
operations.
amdgpu-calls and amdgpu-stack-objects don't really belong as
attributes, and are currently a hacky way of passing an analysis into
the DAG. These don't really belong in the IR, and don't really fit in
with the other attributes. Remove these to facilitate inverting the
pass.
I don't exactly understand the indirect call test changes. These tests
are using calls which are trivially replacable with a direct call, so
I'm not sure what the point is.
When doing Emscritpen EH, if SjLj is also enabled and used and if the
thrown exception has a possiblity being a longjmp instead of an
exception, we shouldn't swallow it; we should rethrow, or relay it. It
was done in D106525 and the code is here:
8441a8eea8/llvm/lib/Target/WebAssembly/WebAssemblyLowerEmscriptenEHSjLj.cpp (L858-L898)
Here is the pseudocode of that part: (copied from comments)
```
if (%__THREW__.val == 0 || %__THREW__.val == 1)
goto %tail
else
goto %longjmp.rethrow
longjmp.rethrow: ;; This is longjmp. Rethrow it
%__threwValue.val = __threwValue
emscripten_longjmp(%__THREW__.val, %__threwValue.val);
tail: ;; Nothing happened or an exception is thrown
... Continue exception handling ...
```
If the current BB (where the `invoke` is created) has successors that
has the current BB as its PHI incoming node, now that has to change to
`tail` in the pseudocode, because `tail` is the latest BB that is
connected with the next BB, but this was missing.
Reviewed By: tlively
Differential Revision: https://reviews.llvm.org/D108785
Generate btf_tag annotations for function parameters.
A field "annotations" is introduced to DILocalVariable, and
annotations are represented as an DINodeArray, similar to
DIComposite elements. The following example illustrates how
annotations are encoded in IR:
distinct !DILocalVariable(name: "info",, arg: 1, ..., annotations: !10)
!10 = !{!11, !12}
!11 = !{!"btf_tag", !"a"}
!12 = !{!"btf_tag", !"b"}
Differential Revision: https://reviews.llvm.org/D106620
Looks like the NoRegister has some effect on the final code that is generated. My guess is that some optimization kicks in at the end?
When I use -S to dump the assembly I get the correct version with 'shrq $3, %r8':
movq %r9, %r8
shrq $3, %r8
movsbl 2147450880(%r8), %r8d
But, when I disassemble the final binary I get RAX in stead of R8:
mov %r9,%r8
shr $0x3,%rax
movsbl 0x7fff8000(%r8),%r8d
Reviewed By: vitalybuka
Differential Revision: https://reviews.llvm.org/D108745
Reworked reordering algorithm. Originally, the compiler just tried to
detect the most common order in the reordarable nodes (loads, stores,
extractelements,extractvalues) and then fully rebuilding the graph in
the best order. This was not effecient, since it required an extra
memory and time for building/rebuilding tree, double the use of the
scheduling budget, which could lead to missing vectorization due to
exausted scheduling resources.
Patch provide 2-way approach for graph reodering problem. At first, all
reordering is done in-place, it doe not required tree
deleting/rebuilding, it just rotates the scalars/orders/reuses masks in
the graph node.
The first step (top-to bottom) rotates the whole graph, similarly to the previous
implementation. Compiler counts the number of the most used orders of
the graph nodes with the same vectorization factor and then rotates the
subgraph with the given vectorization factor to the most used order, if
it is not empty. Then repeats the same procedure for the subgraphs with
the smaller vectorization factor. We can do this because we still need
to reshuffle smaller subgraph when buildiong operands for the graph
nodes with lasrger vectorization factor, we can rotate just subgraph,
not the whole graph.
The second step (bottom-to-top) scans through the leaves and tries to
detect the users of the leaves which can be reordered. If the leaves can
be reorder in the best fashion, they are reordered and their user too.
It allows to remove double shuffles to the same ordering of the operands in
many cases and just reorder the user operations instead. Plus, it moves
the final shuffles closer to the top of the graph and in many cases
allows to remove extra shuffle because the same procedure is repeated
again and we can again merge some reordering masks and reorder user nodes
instead of the operands.
Also, patch improves cost model for gathering of loads, which improves
x264 benchmark in some cases.
Gives about +2% on AVX512 + LTO (more expected for AVX/AVX2) for {625,525}x264,
+3% for 508.namd, improves most of other benchmarks.
The compile and link time are almost the same, though in some cases it
should be better (we're not doing an extra instruction scheduling
anymore) + we may vectorize more code for the large basic blocks again
because of saving scheduling budget.
Differential Revision: https://reviews.llvm.org/D105020
This fixes some redundant move in return statement [-Wredundant-move] gcc 9.3.0
warnings.
This also fixes a minor coverity issue reported agaist class MCAOperand about
the lack of proper initialization for field Index.
No functional change intended.
This pattern
```
%elt = ... something ...
%undef = G_IMPLICIT_DEF
%vec = G_BUILD_VECTOR %elt, %undef, %undef, ... %undef
```
Can be selected to a SUBREG_TO_REG, assuming `%elt` and `%vec` have the same
register bank. We don't care about any of the bits in `%vec` aside from those
in `%elt`, which just happens to be the 0th element.
This is preferable to emitting `mov` instructions for every index.
This gives minor code size improvements on the test suite at -Os.
Differential Revision: https://reviews.llvm.org/D108773
My last change to the RegisterFile (PR51495) has introduced a bug in the logic
that allocates physical registers in the PRF.
In some cases, this bug could have triggered a nasty unsigned wrap in the number
of allocated registers, thus resulting in mca being stuck forever in a loop of
PRF availability checks.
Similar to what we do for add/sub/mul.
This can help remove some sext.w. There are some regressions on
some bswap tests, but I have an idea how to fix that for a follow up.
A new PACKW pattern is added to handle the new sext_inreg placement.
Differential Revision: https://reviews.llvm.org/D108663
Generate btf_tag annotations for DIGlobalVariable.
A field "annotations" is introduced to DIGlobalVariable, and
annotations are represented as an DINodeArray, similar to
DIComposite elements. The following example illustrates how
annotations are encoded in IR:
distinct !DIGlobalVariable(..., annotations: !10)
!10 = !{!11, !12}
!11 = !{!"btf_tag", !"a"}
!12 = !{!"btf_tag", !"b"}
Differential Revision: https://reviews.llvm.org/D106619
The Code Extractor does not provide an easy mechanism for determining the
inputs and outputs after extraction has occurred, this patch gives the
ability to pass in empty SetVectors to be filled with the inputs and
outputs if they need to be analyzed.
Added Tests:
- InputOutputMonitoring in unittests/Transforms/Utils/CodeExtractorTests.cpp
Reviewers: paquette
Differential Revision: https://reviews.llvm.org/D106991
On targets requiring VGPR alignment we may end up spilling an
unaligned register if we were partially spilled odd number of
leading lanes. The reminder will start with an odd register.
This problem is solved by inverting the order of lanes to
be spillied so that we start from the end.
Differential Revision: https://reviews.llvm.org/D108732
We can halve the number of mask constants by masking before shl
and after srl.
This can reduce the number of mov immediate or constant
materializations. Or reduce the number of constant pool loads
for X86 vectors.
I think we might be able to do something similar for bswap. I'll
look at it next.
Differential Revision: https://reviews.llvm.org/D108738
Even though https://bugs.llvm.org/show_bug.cgi?id=51615
appears to be introduced by D105390, the fix lies here.
We can not replace a wide volatile load with a broadcast-from-memory,
because that would narrow the load, which isn't legal for volatiles.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D108757
Since LICM has now unconditionally moved to MemorySSA based form, all
passes that run in same LPM as LICM need to preserve MemorySSA (i.e. our
downstream pipeline).
Added loop-mssa to all tests and perform -verify-memoryssa within
LoopPredication itself.
Differential Revision: https://reviews.llvm.org/D108724
Generate btf_tag annotations for DISubprogram types.
A field "annotations" is introduced to DISubprogram, and
annotations are represented as an DINodeArray, similar to
DIComposite elements. The following example illustrates how
annotations are encoded in IR:
distinct !DISubprogram(..., annotations: !10)
!10 = !{!11, !12}
!11 = !{!"btf_tag", !"a"}
!12 = !{!"btf_tag", !"b"}
Differential Revision: https://reviews.llvm.org/D106618
In the combination of addressing modes, when replacing the matched phi nodes,
sometimes the phi node to be replaced has been modified. For example,
there’s matcher set [A, B] and [C, A], which will have cyclic dependency:
A is replaced by B and C will be replaced by A. Because we tried to match new phi node
to another new phi node, we should ignore new phi nodes when mapping new phi node to old one.
Reviewed By: skatkov
Differential Revision: https://reviews.llvm.org/D108635
Reworked reordering algorithm. Originally, the compiler just tried to
detect the most common order in the reordarable nodes (loads, stores,
extractelements,extractvalues) and then fully rebuilding the graph in
the best order. This was not effecient, since it required an extra
memory and time for building/rebuilding tree, double the use of the
scheduling budget, which could lead to missing vectorization due to
exausted scheduling resources.
Patch provide 2-way approach for graph reodering problem. At first, all
reordering is done in-place, it doe not required tree
deleting/rebuilding, it just rotates the scalars/orders/reuses masks in
the graph node.
The first step (top-to bottom) rotates the whole graph, similarly to the previous
implementation. Compiler counts the number of the most used orders of
the graph nodes with the same vectorization factor and then rotates the
subgraph with the given vectorization factor to the most used order, if
it is not empty. Then repeats the same procedure for the subgraphs with
the smaller vectorization factor. We can do this because we still need
to reshuffle smaller subgraph when buildiong operands for the graph
nodes with lasrger vectorization factor, we can rotate just subgraph,
not the whole graph.
The second step (bottom-to-top) scans through the leaves and tries to
detect the users of the leaves which can be reordered. If the leaves can
be reorder in the best fashion, they are reordered and their user too.
It allows to remove double shuffles to the same ordering of the operands in
many cases and just reorder the user operations instead. Plus, it moves
the final shuffles closer to the top of the graph and in many cases
allows to remove extra shuffle because the same procedure is repeated
again and we can again merge some reordering masks and reorder user nodes
instead of the operands.
Also, patch improves cost model for gathering of loads, which improves
x264 benchmark in some cases.
Gives about +2% on AVX512 + LTO (more expected for AVX/AVX2) for {625,525}x264,
+3% for 508.namd, improves most of other benchmarks.
The compile and link time are almost the same, though in some cases it
should be better (we're not doing an extra instruction scheduling
anymore) + we may vectorize more code for the large basic blocks again
because of saving scheduling budget.
Differential Revision: https://reviews.llvm.org/D105020
SelectADDRParam was discovered as being dead 5 years ago and removed in
7b4ef068c6 but the unused ComplexPattern definition was left behind.
SelectADDRDWord has never existed as far as I can tell, even back when
AMDGPU was R600-only and called that.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D108758
It turns out that SchedWrite WriteIMulH was always assigned to the low half of
the result of a MULX (rather than to the high half).
To avoid confusion, this patch swaps the two MULX writes in the tablegen
definition of MULX32/64. That way, write names better describe what they
actually refer to; this also avoids further complications if in future we decide
to reuse the same MulH writes to also model other scalar integer multiply
instructions. I also had to swap the latency values for the two MULX writes to
make sure that the change is effectively an NFC. In fact, none of the existing
x86 tests were affected by this small refactoring.
This patch also fixes a bug in MCA: a wrong latency value was propagated for
instructions that perform multiple writes to a same register. This last issue
was found by Roman while testing MULX on targets that define a different latency
for the Low/High part of the result.
Differential Revision: https://reviews.llvm.org/D108727
Tell the cost model to use the scalable calculation for non-neon fixed vector.
This results in a cheaper cost for fixed-length SVE masked gathers/scatters
allowing the vectorizor to emit them more frequently.
This patch adds initial support to use facts from @llvm.assume calls. It
intentionally does not handle all possible cases to keep things simple
initially.
For now, the condition from an assume is made available on entry to the
containing block, if the assume is guaranteed to execute. Otherwise it
is only made available in the successor blocks.
Summary: This patch is trying to add support for llvm-readobj
--needed-libs option under XCOFF.
For XCOFF, the needed libraries can be found from the Import
File ID Name Table of the Loader Section.
Currently, I am using binary inputs in the test since yaml2obj
does not yet support for writing the Loader Section and the
import file table.
Reviewed By: jhenderson
Differential Revision: https://reviews.llvm.org/D106643
The change adds a switch to allow sample loader to use global pre-inliner's decision instead. The pre-inliner in llvm-profgen makes inline decision globally based on whole program profile and function byte size as cost proxy.
Since pre-inliner also adjusts/merges context profile based on its inline decision, honoring its inline decision in sample loader would lead to better post-inline profile quality especially for thinlto where cross module profile merging isn't possible without pre-inliner.
Minor fix in profile reader is also included. When pre-inliner is use, we now also turn off the default merging and trimming logic unless it's explicitly asked.
Differential Revision: https://reviews.llvm.org/D108677
Emscripten SjLj transformation is done in four steps. This will be
mostly the same for the soon-to-be-added Wasm SjLj; the step 1, 3, and 4
will be shared and there will be separate way of doing step 2.
1. Initialize `setjmpTable` and `setjmpTableSize` in the entry BB
2. Handle `setjmp` callsites
3. Handle `longjmp` callsites
4. Cleanup and update SSA
We initialize `setjmpTable` and `setjmpTableSize` in the entry BB. But
if the entry BB contains a `setjmp` call, some `setjmp` handling
transformation will also happen in the entry BB, such as calling
`saveSetjmp`.
This is fine for Emscripten SjLj but not for Wasm SjLj, because in Wasm
SjLj we will add a dispatch BB that contains a `switch` right after the
entry BB, from which we jump to one of post-`setjmp` BBs. And this
dispatch BB should precede all `setjmp` calls.
Emscripten SjLj (current):
```
entry:
%setjmpTable = ...
%setjmpTableSize = ...
...
call @saveSetjmp(...)
```
Wasm SjLj (follow-up):
```
entry:
%setjmpTable = ...
%setjmpTableSize = ...
setjmp.dispatch:
...
; Jump to the right post-setjmp BB, if we are returning from a
; longjmp. If this is the first setjmp call, go to %entry.split.
switch i32 %no, label %entry.split [
i32 1, label %post.setjmp1
i32 2, label %post.setjmp2
...
i32 N, label %post.setjmpN
]
entry.split:
...
call @saveSetjmp(...)
```
So in Wasm SjLj we split the entry BB to make the entry block only for
`setjmpTable` and `setjmpTableSize` initialization and insert a
`setjmp.dispatch` BB. (This part is not in this CL. This will be a
follow-up.) But note that Emscripten SjLj and Wasm SjLj share all
steps except for the step 2. If we only split the entry BB only for Wasm
SjLj, there will be one more `if`-`else` and the code will be more
complicated.
So this CL splits the entry BB in Emscripten SjLj and put only
initialization stuff there as follows:
Emscripten SjLj (this CL):
```
entry:
%setjmpTable = ...
%setjmpTableSize = ...
br %entry.split
entry.split:
...
call @saveSetjmp(...)
```
This is just done to share code with Wasm SjLj. It adds an unnecessary
branch but this will be removed in later optimization passes anyway.
This is in effect NFC, meaning the program behavior will not change, but
existing ll tests files have changed because the entry block was split.
The reason I upload this in a separate CL is to make the Wasm SjLj diff
tidier, because this changes many existing Emscripten SjLj tests, which
can be confusing for the follow-up Wasm SjLj CL.
Reviewed By: tlively
Differential Revision: https://reviews.llvm.org/D108729
Emscripten SjLj and (soon-to-be-added) Wasm SjLj transformation share
many steps:
1. Initialize `setjmpTable` and `setjmpTableSize` in the entry BB
2. Handle `setjmp` callsites
3. Handle `longjmp` callsites
4. Cleanup and update SSA
1, 3, and 4 are identical for Emscripten SjLj and Wasm SjLj. Only the
step 2 is different. This CL extracts the current Emscripten SjLj's
longjmp callsites handling into a function. The reason to make this a
separate CL is, without this, the diff tool cannot compare things well
in the presence of moved code and added code in the followup Wasm SjLj
CL, and it ends up mixing them together, making the diff unreadable.
Also fixes some typos and variable names. So far we've been calling the
buffer argument to `setjmp` and `longjmp` `jmpbuf`, but the name used in
the man page for those functions is `env`, so updated them to be
consistent.
Reviewed By: tlively
Differential Revision: https://reviews.llvm.org/D108728
Rename the M68kOperand::Type enumeration to KindTy to avoid ambiguity
with the Kind field when referencing enumeration values e.g.
`Kind::Value`.
This works around a compilation error under GCC 5, where GCC won't
lookup enum class values if you have a similarly named field
(see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60994).
The error in question is:
`M68kAsmParser.cpp:857:8: error: 'Kind' is not a class, namespace, or enumeration`
Differential Revision: https://reviews.llvm.org/D108723
The plan was to use `wasm.catch.exn` intrinsic to catch exceptions and
add `wasm.catch.longjmp` intrinsic, that returns two values (setjmp
buffer and return value), later to catch longjmps. But because we
decided not to use multivalue support at the moment, we are going to use
one intrinsic that returns a single value for both exceptions and
longjmps. And even if it's not for that, I now think the naming of
`wasm.catch.exn` is a little weird, because the intrinsic can still take
a tag immediate, which means it can be used for anything, not only
exceptions, as long as that returns a single value.
This partially reverts D107405.
Reviewed By: tlively
Differential Revision: https://reviews.llvm.org/D108683
This is another bug exposed by https://llvm.org/PR51612
(and the one that triggered the initial assertion) in the report.
That example was suppressed with:
985b48f183
...but these would still crash because we created nodes
like UADDO without the expected 2 output values.
Moved View.h and View.cpp from /tools/llvm-mca/Views/ to /lib/MCA/ and
/include/llvm/MCA/. This is so that targets can define their own Views within
the /lib/Target/ directory (so that the View can use backend functionality).
To enable these Views within mca, targets will need to add them to the vector of
Views returned by their target's CustomBehaviour::getViews() methods.
Differential Revision: https://reviews.llvm.org/D108520
There are 2 bugs here:
1. We were not checking uses of operand 2 (the false value of the select).
2. We were not checking for multiple uses of nodes that produce >1 result.
Correcting those is enough to avoid the crash in the reduced test based on:
https://llvm.org/PR51612
The additional use check on operand 0 (the condition value of the select)
should not strictly be necessary because we are only replacing one use
with another (whether it makes performance sense to do the transform with
that pattern is not clear). But as noted in the TODO, changing that
uncovers another bug.
Note: there's at least one more bug here - we aren't propagating EVTs
correctly, but I plan to fix that in another patch.
Add support for the GNU C style __attribute__((error(""))) and
__attribute__((warning(""))). These attributes are meant to be put on
declarations of functions whom should not be called.
They are frequently used to provide compile time diagnostics similar to
_Static_assert, but which may rely on non-ICE conditions (ie. relying on
compiler optimizations). This is also similar to diagnose_if function
attribute, but can diagnose after optimizations have been run.
While users may instead simply call undefined functions in such cases to
get a linkage failure from the linker, these provide a much more
ergonomic and actionable diagnostic to users and do so at compile time
rather than at link time. Users instead may be able use inline asm .err
directives.
These are used throughout the Linux kernel in its implementation of
BUILD_BUG and BUILD_BUG_ON macros. These macros generally cannot be
converted to use _Static_assert because many of the parameters are not
ICEs. The Linux kernel still needs to be modified to make use of these
when building with Clang; I have a patch that does so I will send once
this feature is landed.
To do so, we create a new IR level Function attribute, "dontcall" (both
error and warning boil down to one IR Fn Attr). Then, similar to calls
to inline asm, we attach a !srcloc Metadata node to call sites of such
attributed callees.
The backend diagnoses these during instruction selection, while we still
know that a call is a call (vs say a JMP that's a tail call) in an arch
agnostic manner.
The frontend then reconstructs the SourceLocation from that Metadata,
and determines whether to emit an error or warning based on the callee's
attribute.
Link: https://bugs.llvm.org/show_bug.cgi?id=16428
Link: https://github.com/ClangBuiltLinux/linux/issues/1173
Reviewed By: aaron.ballman
Differential Revision: https://reviews.llvm.org/D106030
In-register structure returns are not special, and handled by lowering
to multiple-value tuples. We can tail-call from non-sret fns to
structure-returning functions, except on i686 where the sret pointer
is callee-pop.
Differential Revision: https://reviews.llvm.org/D105807
With spilling into AGPRs enabled we cannot reliably predict
if we need to save FP or not. We may finally spill everything
into AGPRs and never touch stack. In this case we still may
save FP. This is deficiency but not an error, so avoid the
assert.
Differential Revision: https://reviews.llvm.org/D107404
The instruction extractelement/extractvalue are not required to
be scheduled since they only depend on the source vector/aggregate (with
constant indices), smae applies to the parent basic block checks.
Improves compile time and saves scheduling budget.
Differential Revision: https://reviews.llvm.org/D108703
We have "-profile-isfs" internal option for text, binary, and
compactbinary format (mostly for debug and test purpose). We
need to set the related flag in FunctionSamples so that ProfileIsFS
is written to the header in extbinary format.
Differential Revision: https://reviews.llvm.org/D108707
This is a follow up diff for BinarySizeContextTracker to track zero size for fully optimized inlinee. When an inlinee is fully optimized away, we won't be able to get its size through symbolizing instructions, hence we will treat the corresponding context size as unknown. However by traversing the inlined probe forest, we know what're original inlinees regardless of optimization. If a context show up in inlined probes, but not during symbolization, we know that it's fully optimized away hence its size is zero instead of unknown. It should provide more accurate size cost estimation for pre-inliner to make better inline decisions in llvm-profgen.
Differential Revision: https://reviews.llvm.org/D108350
The implementation uses the int_asan_check_memaccess intrinsic to instrument the code. The intrinsic is replaced by a call to a function which performs the access check. The generated function names encode the input register name as a number using Reg - X86::NoRegister formula.
Reviewed By: vitalybuka
Differential Revision: https://reviews.llvm.org/D107850
Description: This change enables the compare operations to be selected to SALU/VALU form
dependent of the SDNode divergence flag.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D106079
This patch replaces the SpecialRegisters field with a unique_ptr instead of a raw pointer. This is better practice, and allows us to remove the definition of the dtor for the SystemZSubtarget class.
Reviewed By: uweigand, Kai
Differential Revision: https://reviews.llvm.org/D108639
Before this patch, WriteIMulH reported a latency value which is correct for the
RR variant of MULX, but not for the RM variant.
This patch fixes the issue by introducing a new WriteIMulHLd, which is meant to
be used only by the RM variant of MULX.
Differential Revision: https://reviews.llvm.org/D108701
A single smov instruction is capable of moving from a vector register while performing
the sign-extend during said move, rather than each step being performed by separate instructions.
Differential Revision: https://reviews.llvm.org/D108633
InstrRefBasedLDV is marginally slower than VarlocBasedLDV when analysing
optimised code -- however, it's much slower when analysing code compiled
-O0.
To avoid this: don't use instruction referencing for -O0 functions. In the
"pure" case of unoptimised code, this won't really harm the debugging
experience because most variables won't have been promoted off the stack,
so can't go missing. It becomes more complicated when optimised code is
inlined into functions marked optnone; however these are rare, and as -O0
doesn't run many optimisations there should be little damage to the debug
experience as a result.
I've taken the opportunity to refactor testing for instruction-referencing
into a MachineFunction method, which seems the most appropriate place to
put it.
Differential Revision: https://reviews.llvm.org/D108585
Makes patterns added for gfx90a usable with the gfx10 versions of the
insts.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D108654
Change-Id: I86167bf6b4823f975f74ccb619bd6190331ba16b
Support for peeling with multiple exit blocks was added in D63921/77bb3a486fa6.
So far it has only been enabled for loops where all non-latch exits are
'de-optimizing' exits (D63923). But peeling of multi-exit loops can be
highly beneficial in other cases too, like if all non-latch exiting
blocks are unreachable.
The motivating case are loops with runtime checks, like the C++ example
below. The main issue preventing vectorization is that the invariant
accesses to load the bounds of B is conditionally executed in the loop
and cannot be hoisted out. If we peel off the first iteration, they
become dereferenceable in the loop, because they must execute before the
loop is executed, as all non-latch exits are terminated with
unreachable. This subsequently allows hoisting the loads and runtime
checks out of the loop, allowing vectorization of the loop.
int sum(std::vector<int> *A, std::vector<int> *B, int N) {
int cost = 0;
for (int i = 0; i < N; ++i)
cost += A->at(i) + B->at(i);
return cost;
}
This gives a ~20-30% increase of score for Geekbench5/HDR on AArch64.
Note that this requires a follow-up improvement to the peeling cost
model to actually peel iterations off loops as above. I will share that
shortly.
Also, peeling of multi-exits might be beneficial for exit blocks with
other terminators, but I would like to keep the scope limited to known
high-reward cases for now.
I removed the option to disable peeling for multi-deopt exits because
the code is more general now. Alternatively, the option could also be
generalized, but I am not sure if there's much value in the option?
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D108108
This change fixes issue found by Markus: https://reviews.llvm.org/rG11338e998df1
Before this patch following code was transformed to memmove:
for (int i = 15; i >= 1; i--) {
p[i] = p[i-1];
sum += p[i-1];
}
However load from p[i-1] is used not only by store to p[i] but also by sum computation.
Therefore we cannot emit memmove in loop header.
Differential Revision: https://reviews.llvm.org/D107964
For ISD::EXTRACT_SUBVECTOR, its second operand must be a constant
multiple of the known-minimum vector length of the result type.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D107795
Stack slot colouring adds "weight" to slots if a non-dbg-value instruction
refers to it. This, unfortunately, means that DBG_PHI instructions can have
an effect on codegen. The fix is very simple, replace isDebugValue with
isDebugInstr.
The regression test contains a scenario that reproduces this problem; I've
represented both normal-debug mode and instr-ref debug mode instructions
in comment lines prefixed with AAAAAA and BBBBBB, and un-comment them with
sed to test that the two different modes produce the same behaviour.
Differential Revision: https://reviews.llvm.org/D108627
1. Splitted out some parts of R600 target to separate modules/headers.
2. Reduced some include lists in headers.
3. Found and fixed issue with override `GCNTargetMachine::getSubtargetImpl()`
and `R600TargetMachine::getSubtargetImpl()` had different return value type
than base class.
4. Minor forward declarations cleanup.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D108596
The sext_inreg_of_load combine did not have the isLegalOrBeforeLegalizer check,
leading to the generation of potentially illegal G_SEXTLOADs when run after legalization.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D108626
On some AMDGPU subtargets, copying to and from AGPR registers using another
AGPR register is not possible. A intermediate VGPR register is needed for AGPR
to AGPR copy. This is an issue when machine copy propagation forwards a
COPY $agpr, replacing a COPY $vgpr which results in $agpr = COPY $agpr. It is
removing a cross class copy that may have been optimized by previous passes and
potentially creating an unoptimized cross class copy later on.
To avoid this issue, check CrossCopyRegClass if a different register class will
be needed for the copy. If so then avoid forwarding the copy when the
destination does not match the desired register class and if the original copy
already matches the desired register class.
Issue seen while attempting to optimize another AGPR to AGPR issue:
Live-ins: $agpr0
$vgpr0 = COPY $agpr0
$agpr1 = V_ACCVGPR_WRITE_B32 $vgpr0
$agpr2 = COPY $vgpr0
$agpr3 = COPY $vgpr0
$agpr4 = COPY $vgpr0
After machine-cp:
$vgpr0 = COPY $agpr0
$agpr1 = V_ACCVGPR_WRITE_B32 $vgpr0
$agpr2 = COPY $agpr0
$agpr3 = COPY $agpr0
$agpr4 = COPY $agpr0
Machine-cp propagated COPY $agpr0 to replace $vgpr0 creating 3 AGPR to AGPR
copys. Later this creates a cross-register copy from AGPR->VGPR->AGPR for each
copy when the prior VGPR->AGPR copy was already optimal.
Reviewed By: lkail, rampitec
Differential Revision: https://reviews.llvm.org/D108011
The NS==0 condition used by D103717 missed a corner case: if the current copy
does not have a hash suffix (e.g. weak_odr), a copy with value profiling (with a
different CFG) may exist. This is super rare, but is possible with pre-inlining
PGO instrumentation (which can make a weak_odr function inlines its callees
differently, sometimes with value profiling while sometimes without).
If the current copy with private profd is prevailing, the non-prevailing copy
may get an undefined symbol if a caller inlining the non-prevailing function
references its profd. If the other copy with non-private profd is prevailing,
the current copy may cause a "relocation to discarded section" linker error.
The fix is straightforward: just keep non-private profd in such a `DataReferencedByCode` case.
With this change, a stage 2 (`-DLLVM_TARGETS_TO_BUILD=X86 -DLLVM_BUILD_INSTRUMENTED=IR`)
clang is 0.08% larger (172431496/172286720-1).
`stat -c %s **/*.o | awk '{s+=$1}END{print s}' is 0.026% larger.
The majority of D103717's benefits remains.
Reviewed By: xur
Differential Revision: https://reviews.llvm.org/D108432
This reverts commit f653beea88.
It broke Windows coverage-inline.cpp because link.exe has a limitation
that external symbols in IMAGE_COMDAT_SELECT_ASSOCIATIVE don't work.
It essentially dropped the previous size optimization for coverage
because coverage doesn't rename comdat by default.
Needs more investigation what we should do.
We update SSA in two steps in Emscripten SjLj:
1. Rewrite uses of `setjmpTable` and `setjmpTableSize` variables and
place `phi`s where necessary, which are updated where we call
`saveSetjmp`.
2. Do a whole function level SSA update for all variables, because we
split BBs where `setjmp` is called and there are possibly variable
uses that are not dominated by a def.
(See 955b91c19c/llvm/lib/Target/WebAssembly/WebAssemblyLowerEmscriptenEHSjLj.cpp (L1314-L1324))
We have been using `SSAUpdater` to do this, but `SSAUpdaterBulk` class
was added after this pass was first created, and for the step 2 it looks
like a better alternative with a possible performance benefit. Not sure
the author is aware of it, but `SSAUpdaterBulk` seems to have a
limitation: it cannot handle a use within the same BB as a def but
before it. For example:
```
... = %a + 1
%a = foo();
```
or
```
%a = %a + 1
```
The uses `%a` in RHS should be rewritten with another SSA variable of
`%a`, most likely one generated from a `phi`. But `SSAUpdaterBulk`
thinks all uses of `%a` are below the def of `%a` within the same BB.
(`SSAUpdater` has two different functions of rewriting because of this:
`RewriteUse` and `RewriteUseAfterInsertions`.) This doesn't affect our
usage in the step 2 because that deals with possibly non-dominated uses
by defs after block splitting. But it does in the step 1, which still
uses `SSAUpdater`.
But this CL also simplifies the step 1 by using `make_early_inc_range`,
removing the need to advance the iterator before rewriting a use.
This is NFC; the test changes are just the order of PHI nodes.
Reviewed By: dschuff
Differential Revision: https://reviews.llvm.org/D108583
This CL is small, but the description can be a little long because I'm
trying to sum up the status quo for Emscripten/Wasm EH/SjLj options.
First, this CL adds an option for Wasm SjLj (`-wasm-enable-sjlj`), which
handles SjLj using Wasm EH. The implementation for this will be added as
a followup CL, but this adds the option first to do error checking.
This also adds an option for Wasm EH (`-wasm-enable-eh`), which has been
already implemented. Before we used `-exception-model=wasm` as the same
meaning as enabling Wasm EH, but after we add Wasm SjLj, it will be
possible to use Wasm EH instructions for Wasm SjLj while not enabling
EH, so going forward, to use Wasm EH, `opt` and `llc` will need this
option. This only affects `opt` and `llc` command lines and does not
affect Emscripten user interface.
Now we have two modes of EH (Emscripten/Wasm) and also two modes of SjLj
(also Emscripten/Wasm). The options corresponding to each of are:
- Emscripten EH: `-enable-emscripten-cxx-exceptions`
- Emscripten SjLj: `-enable-emscripten-sjlj`
- Wasm EH: `-wasm-enable-eh -exception-model=wasm`
`-mattr=+exception-handling`
- Wasm SjLj: `-wasm-enable-sjlj -exception-model=wasm`
`-mattr=+exception-handling`
The reason Wasm EH/SjLj's options are a little complicated are
`-exception-model` and `-mattr` are common LLVM options ane not under
our control. (`-mattr` can be omitted if it is embedded within the
bitcode file.)
And we have the following rules of the option composition:
- Emscripten EH and Wasm EH cannot be turned on at the same itme
- Emscripten SjLj and Wasm SjLj cannot be turned on at the same time
- Wasm SjLj should be used with Wasm EH
Which means we now allow these combinations:
- Emscripten EH + Emscripten SjLj: the current default in `emcc`
- Wasm EH + Emscripten SjLj:
This is allowed, but only as an interim step in which we are testing
Wasm EH but not yet have a working implementation of Wasm SjLj. This
will error out (D107687) in compile time if `setjmp` is called in a
function in which Wasm exception is used.
- Wasm EH + Wasm SjLj:
This will be the default mode later when using Wasm EH. Currently Wasm
SjLj implementation doesn't exist, so it doesn't work.
- Emscripten EH + Wasm SjLj will not work.
This CL moves these error checking routines to
`WebAssemblyPassConfig::addIRPasses`. Not sure if this is an ideal place
to do this, but I couldn't find elsewhere. Currently some checking is
done within LowerEmscriptenEHSjLj, but these checks only run if
LowerEmscriptenEHSjLj runs so it may not run when Wasm EH is used. This
moves that to `addIRPasses` and adds some more checks.
Currently LowerEmscriptenEHSjLj pass is responsible for Emscripten EH
and Emscripten SjLj. Wasm EH transformations are done in multiple
places, including WasmEHPrepare, LateEHPrepare, and CFGStackify. But in
the followup CL, LowerEmscriptenEHSjLj pass will be also responsible for
a part of Wasm SjLj transformation, because WasmSjLj will also be using
several Emscripten library functions, and we will be sharing more than
half of the transformation to do that between Emscripten SjLj and Wasm
SjLj.
Currently we have `-enable-emscripten-cxx-exceptions` and
`-enable-emscripten-sjlj` but these only work for `llc`, because for
`llc` we feed these options to the pass but when we run the pass using
`opt` the pass will be created with no options and the default options
will be used, which turns both Emscripten EH and Emscripten SjLj on.
Now we have one more SjLj option to care for, LowerEmscriptenEHSjLj pass
needs a finer way to control these options. This CL removes those
default parameters and make LowerEmscriptenEHSjLj pass read directly
from command line options specified. So if we only run
`opt -wasm-lower-em-ehsjlj`, currently both Emscripten EH and Emscripten
SjLj will run, but with this CL, none will run unless we additionally
pass `-enable-emscripten-cxx-exceptions` or `-enable-emscripten-sjlj`,
or both. This does not affect users; this only affects our `opt` tests
because `emcc` will not call either `opt` or `llc`. As a result of this,
our existing Emscripten EH/SjLj tests gained one or both of those
options in their `RUN` lines.
Reviewed By: dschuff
Differential Revision: https://reviews.llvm.org/D107685
Fixes PR51605 in which a DAG combine and legalization sequence generated
out-of-range constants in BUILD_VECTOR lanes. In the v16i8 case, the constants
were 255, which would be in range if DAG ISel used unsigned constants, but it is
out of range because DAG ISel uses signed constants.
Differential Revision: https://reviews.llvm.org/D108669
This reverts commit 67bf3ac744.
The reason is that this change is now superseded by 04fb9b729a which fixes the
underlying problem in the selector. Now it's fine to generate truncating FP stores
since the selector code will just generate subreg copies to handle them.
When the tablegen patterns fail to select a truncating scalar FPR store,
our manual selection code also failed to handle it silently, trying to
generate an invalid copy. Fix this by adding support in the manual code
to generate a proper subreg copy before selecting a non-truncating store.
The NS==0 condition used by D103717 missed a corner case: if the current copy
does not have a hash suffix (e.g. weak_odr), a copy with value profiling (with a
different CFG) may exist. This is super rare, but is possible with pre-inlining
PGO instrumentation (which can make a weak_odr function inlines its callees
differently, sometimes with value profiling while sometimes without).
If the current copy with private profd is prevailing, the non-prevailing copy
may get an undefined symbol if a caller inlining the non-prevailing function
references its profd. If the other copy with non-private profd is prevailing,
the current copy may cause a "relocation to discarded section" linker error.
The fix is straightforward: just keep non-private profd in this case.
With this change, a stage 2 (`-DLLVM_TARGETS_TO_BUILD=X86 -DLLVM_BUILD_INSTRUMENTED=IR`)
clang is 0.08% larger (172431496/172286720-1).
`stat -c %s **/*.o | awk '{s+=$1}END{print s}' is 0.026% larger.
The majority of D103717's benefits remains.
Reviewed By: xur
Differential Revision: https://reviews.llvm.org/D108432
This was previously committed in 914836b, and reverted due to confusion on the status of the review.
Differential Revision: https://reviews.llvm.org/D108601
Patch 9588b685c6 introduced dependency on ASAN. But it didn't
explicitly put LLVMInstrumentation as one of the library dependencies
such that the build will fail if we're building LLVM as shared libraries
(i.e. -DBUILD_SHARED_LIBS=ON).
This patch explicitly links X86CodeGen against the Instrumentation
component.
Differential Revision: https://reviews.llvm.org/D108662
This implementation allows mca to model the desired behaviour of the s_waitcnt
instruction. This patch also adds the RetireOOO flag to the AMDGPU instructions
within the scheduling model. This flag is only used by mca and allows
instructions to finish out-of-order which helps mca's simulations more closely
model the actual device.
Differential Revision: https://reviews.llvm.org/D104730
These are similar to the rotate pattern added with:
dcf659e821
...but we don't have guard ops on the shift amount,
so we don't canonicalize to the intrinsic.
declare void @llvm.assume(i1)
define i32 @src(i32 %shamt, i32 %bitwidth) {
; subtract must be in range of bitwidth
%lt = icmp ule i32 %bitwidth, 32
call void @llvm.assume(i1 %lt)
%r = lshr i32 -1, %shamt
%s = sub i32 %bitwidth, %shamt
%l = shl i32 -1, %s
%o = or i32 %r, %l
ret i32 %o
}
define i32 @tgt(i32 %shamt, i32 %bitwidth) {
ret i32 -1
}
https://alive2.llvm.org/ce/z/aF7WHx
The implementation uses the int_asan_check_memaccess intrinsic to instrument the code. The intrinsic is replaced by a call to a function which performs the access check. The generated function names encode the input register name as a number using Reg - X86::NoRegister formula.
Reviewed By: vitalybuka
Differential Revision: https://reviews.llvm.org/D107850
Currently isReallyTriviallyReMaterializableGeneric() implementation
prevents rematerialization on any virtual register use on the grounds
that is not a trivial rematerialization and that we do not want to
extend liveranges.
It appears that LRE logic does not attempt to extend a liverange of
a source register for rematerialization so that is not an issue.
That is checked in the LiveRangeEdit::allUsesAvailableAt().
The only non-trivial aspect of it is accounting for tied-defs which
normally represent a read-modify-write operation and not rematerializable.
The test for a tied-def situation already exists in the
/CodeGen/AMDGPU/remat-vop.mir,
test_no_remat_v_cvt_f32_i32_sdwa_dst_unused_preserve.
The change has affected ARM/Thumb, Mips, RISCV, and x86. For the targets
where I more or less understand the asm it seems to reduce spilling
(as expected) or be neutral. However, it needs a review by all targets'
specialists.
Differential Revision: https://reviews.llvm.org/D106408
1) Just mark this case as legal because it can just be a copy.
2) Ensure the copy in the existing code actually gets selected. Without doing
this, we'll crash because the destination won't have a register class.
This fell back 35 times in a build of clang with GISel for AArch64.
Differential Revision: https://reviews.llvm.org/D108610
The IRPGOFlag symbol (__llvm_profile_raw_version) is dropped when
identified as non-prevailing for either regular or thin LTO during
the mixed-LTO mode compilation. This happens in the module where
IRPGOFlag is marked as non-prevailing. This variable
is emitted in the final object from the prevailing module.
This is still problematic because we currently query this symbol
to coordinate some actions between PGOInstrumentation pass
and InstrProfiling lowering pass, like whether to do value
profiling, whether to do comdat renaming.
This problem is bought up by YolandaCY in
https://reviews.llvm.org/D107034
YolandCY reported unresolved symbol linker errors in
CSPGO instrumentation build for chromium.
This patch let LTO retain IRPGOFlag decl by adding it to
CompilerUsed list and relax the check in isIRPGOFlagSet() when
doing the InstrProfiling lowering.
The test case in the patch is from D107034
<https://reviews.llvm.org/D107034>.
Differential Revision: https://reviews.llvm.org/D108581
Reuse the selection code from the ld2 case. This is similar to how SDAG handles
things in AArch64ISelDAGToDAG. (See SelectLoad)
This fell back ~100 times while building clang with GISel enabled for AArch64.
Factoring out the gross subreg copy part ought to make selecting the rest of
this family fairly easy.
Differential Revision: https://reviews.llvm.org/D108600
If we no an addrec doesn't self-wrap, the increment is strictly positive, and the start value is the smallest representable value, then we know that the corresponding wrap type can not occur.
Differential Revision: https://reviews.llvm.org/D108601
We don't have any vXi8 shift instructions (other than on XOP which is handled separately), so replace the shl(x,1) -> add(x,x) fold with shl(x,1) -> add(freeze(x),freeze(x)) to avoid the undef issues identified in PR50468.
Split off from D106675 as I'm still looking at whether we can fix the vXi16/i32/i64 issues with the D106679 alternative.
Differential Revision: https://reviews.llvm.org/D108139
One of the cases identified in PR45116 - we don't need to limit load combines to ABI alignment, we can use allowsMemoryAccess - which tests using getABITypeAlign, but also checks if a target permits (fast) misaligned memory loads by checking allowsMisalignedMemoryAccesses as a fallback.
One of the cases identified in PR45116 - we don't need to limit load combines (in this case for fp->int load/store copies) to ABI alignment, we can use allowsMemoryAccess - which tests using getABITypeAlign, but also checks if a target permits (fast) misaligned memory loads by checking allowsMisalignedMemoryAccesses as a fallback.
Differential Revision: https://reviews.llvm.org/D108318
One of the cases identified in PR45116 - we don't need to limit load combines (in this case for ISD::BUILD_PAIR) to ABI alignment, we can use allowsMemoryAccess - which tests using getABITypeAlign, but also checks if a target permits (fast) misaligned memory loads by checking allowsMisalignedMemoryAccesses as a fallback.
This helps in particular for 32-bit X86 cases loading 64-bit size data, reducing codegen diffs vs x86_64.
Differential Revision: https://reviews.llvm.org/D108307
Fixes PR51580.
Register masks will now be printed as 'movem.l (%sp), %a0-%a5/%d5'
for example and can now be parsed in the same format.
Previously the printed syntax was 'movem.l (%sp), %a0-%a5,%d', which
didn't match prior art and was too ambiguous to easily parse.
Differential Revision: https://reviews.llvm.org/D108597
Over in D105657, we started dropping instruction numbers (that become
variable locations) from call instructions, as we can't correctly represent
the x87 FP stack. Unfortunately, it turns out that the "special FP
instructions" that this pass transforms includes "every call instruction"
[0]. Thus, we've ended up dropping all return values from all calls. Ouch.
This patch adds a filter: only drop instruction numbers from calls if they
return something on the FP stack. Seeing how LLVM only allows a single
return value, this should drop instruction numbers on anything that returns
a float, and nothing else.
Rather than writing a new test, I've modified the original one to have a
positive and negative case: drop instruction number on a call with an
FP-stack modification, keep it on a plain call.
Differential Revision: https://reviews.llvm.org/D108580
This does the same as D96259, but for ARM, just like AArch64,
using the same comment char as for ELF and MinGW mode.
As the assembly input/output of LLVM is GAS style, trying to
match what MS armasm.exe does isn't needed (because the comment
char used is the least concern when it comes to that; all
directives differ too). If a separate armasm compatible mode
is implemented, it can use its own comment style (just like
llvm-ml implements MS ml.exe compatible assembly parsing).
This fixes building compiler-rt assembly files for ARM in MSVC
mode.
The updated testcase literals-comments.s was only intended to
make sure that '#' isn't interpreted as a comment char.
Differential Revision: https://reviews.llvm.org/D107251
Add `ashr` instruction to the DAG post-dominated by `trunc`, allowing
`TruncInstCombine` to reduce bitwidth of expressions containing
these instructions.
We should be shifting by less than the target bitwidth.
Also it is sufficient to require that all truncated bits
of the value-to-be-shifted are sign bits (all zeros or ones) and
one sign bit is left untruncated: https://alive2.llvm.org/ce/z/Ajo2__
Part of https://reviews.llvm.org/D107766
Differential Revision: https://reviews.llvm.org/D108355
This is pretty similar to the ST2 selection code in
`AArch64InstructionSelector::selectIntrinsicWithSideEffects`.
This is a GISel equivalent of the ld2 case in `AArch64DAGToDAGISel::Select`.
There's some weirdness there that appears here too (e.g. using ld1 for scalar
cases, which are 1-element vectors in SDAG.)
It's a little gross that we have to create the copy and then select it right
after, but I think we'd need to refactor the existing copy selection code
quite a bit to do better.
This was falling back while building llvm-project with GISel for AArch64.
Differential Revision: https://reviews.llvm.org/D108590
DWARFDie::getDeclFile(...) previously only supported getting the DW_AT_decl_file if the DIE itself contained the DW_AT_decl_file attribute, or if the DIE had a DW_AT_abstract_origin that pointed to another DIE that had a DW_AT_decl_file. This patch allows the function to get the right attribute value if there is a DW_AT_specification that points to another DIE. We also test that if a DW_AT_abtract_origin or DW_AT_specification points to a DIE in another CU with a DW_FORM_ref_addr, that the right line table is used to extract the file index.
Full tests were added for the following cases:
- DIE has a DW_AT_decl_file attribute
- DIE has a DW_AT_abtract_origin that points to another die in the same CU
- DIE has a DW_AT_abtract_origin that points to another die in another CU
- DIE has a DW_AT_specification that points to another die in the same CU
- DIE has a DW_AT_specification that points to another die in another CU
Differential Revision: https://reviews.llvm.org/D108480
When using final reward (which is now the default), we were skipping
logging decisions that were leading to callee deletion. This fixes that.
Differential Revision: https://reviews.llvm.org/D108587
This was already the case, but the recent change (957334382c) altered
the behavior on some of our bots where __unw_add_dynamic_fde is not
found. This restores the prior behavior on Darwin while also retaining
the new behavior from that change.
This is a re-try of 3aa009cc87 which was reverted at
9577fac0fd because it caused an infinite loop.
For the extra test case, either re-ordering the transforms
or adding the extra clause to avoid sub-of-sub is enough
to prevent the infinite compile, but I'm doing both to be
safer.
Original commit message:
The motivation was to get min/max intrinsics to parity
with cmp+select idioms, but this unlocks a few more
folds because isFreeToInvert recognizes add/sub with
constants too.
In the min/max example, we have too many extra uses
for smaller folds to improve things, but this fold
is able to eliminate uses even though we can't reduce
the number of instructions.
Intended to be NFC. ARM/AArch64 don't appear to need adjustment.
TargetMachine::shouldAssumeDSOLocal is expected to be very simple, ideally
matching isDSOLocal(). The IR producers are expected to set dso_local correctly.
(While some may think this function can make producers' work easier, the
function is really not in a good position to set dso_local. See the various
special cases we duplicate from clang CodeGenModule.cpp.)
Reviewed By: mstorsjo
Differential Revision: https://reviews.llvm.org/D108514
Reverted (manually due to merge conflicts) while regressions reported on PR51540 are investigated
As noticed on D106352, after we've folded "(select C, (gep Ptr, Idx), Ptr) -> (gep Ptr, (select C, Idx, 0))" if the inner Ptr was also a (now one use) gep we could then merge the geps, using the sum of the indices instead.
I've limited this to basic 2-op geps - a more general case further down InstCombinerImpl.visitGetElementPtrInst doesn't have the one-use limitation but only creates the add if it can be created via SimplifyAddInst.
https://alive2.llvm.org/ce/z/f8pLfD (Thanks Roman!)
Differential Revision: https://reviews.llvm.org/D106450
It appears that the Read operand for stores was being placed on the
first operand (the stored value) not the address base. This adds a
ReadST for the stored value operand, allowing the ReadAdrBase to
correctly act upon the address.
Differential Revision: https://reviews.llvm.org/D108287
This is a followup to D106591. MergeICmps currently only allows
sinking the loads past either instructions that don't write to
memory at all, or simple loads/stores that don't modify the memory
the loads access.
The "simple loads/stores" part of this check doesn't seem necessary
to me -- AA isModRef() already accurately models any operation
that may clobber the memory. For example, in the adjusted test case
the transform is still fine if the call to @foo() isn't readonly,
but inaccessiblememonly -- in both cases, the call cannot modify
the loaded memory.
Differential Revision: https://reviews.llvm.org/D108517
D106408 enables rematerialization of instructions with virtual
register uses. That has uncovered the bug in the allUsesAvailableAt
implementation: https://bugs.llvm.org/show_bug.cgi?id=51516.
In the majority of cases canRematerializeAt() called to check if
an instruction can be rematerialized before the given UseIdx.
However, SplitEditor::enterIntvAtEnd() calls it to rematerialize
an instruction at the end of a block passing LIS.getMBBEndIdx()
into the check. In the testcase from the bug it has attempted to
rematerialize ADDXri after STRXui in bb.17. The use operand %55
of the ADD is killed by the STRX but that is undetected by the check
because it adjusts passed UseIdx to the reg slot, before the kill.
The value is dead at the index passed to the check however.
This change uses a later of passed UseIdx and its reg slot. This
shall be correct because if are checking an availability of operands
before an instruction that instruction cannot be the one defining
these operands. If we are checking for late rematerialization we
are really interested if operands live past the instruction.
The bug is not exploitable without D106408 but needed to reland
reverted D106408.
Differential Revision: https://reviews.llvm.org/D108475
After c063946476 usage of R31 doesn't necessarily mean
that alloca is used. The `TracebackTable::IsAllocaUsedMask` flag should be set only
when R31 is used as a frame pointer.
On AIX the `function calls alloca' bit seems to be set whenever R31 is
set up as a frame pointer, even when there is no alloca call.
Reviewed By: lkail
Differential Revision: https://reviews.llvm.org/D108141
When a function has no line table, but does have debug info (DW_TAG_subprogram), we fall back to creating a line table with a single line entry that has the start address of the function and the source file and line of the function declaration. The bug in this code was that we might have a DW_TAG_subprogram that uses a DW_AT_specification or DW_AT_abstract_origin that points to another DIE, and that DIE might be in another compile unit. The bug was we were grabbing the file index value from the DIE, and that index could be from the other DIE in another compile unit that has its own and compleltely different file table, so we might be using a file index from one compile unit with the file table from another. This was causing a crash in llvm-gsymuil when run against dSYM files. dsymutil, the Apple DWARF linker, will often unique types and can end up with more absolute references across different compile units.
The fix is to use the DWARFDie::getDeclFile(...) accessor as it does fetch this information correctly.
Differential Revision: https://reviews.llvm.org/D108497
It would waste time to specialize a function which would inline finally.
This patch did two things:
- Don't specialize functions which are always-inline.
- Don't spescialize functions whose lines of code are less than threshold
(100 by default).
For spec2017int, this patch could reduce the number of specialized
functions by 33%. Then the compile time didn't increase for every
benchmark.
Reviewed By: SjoerdMeijer, xbolva00, snehasish
Differential Revision: https://reviews.llvm.org/D107897
The following scalar FP instructions are legal in streaming mode:
0101 1110 xx1x xxxx 11x1 11xx xxxx xxxx # FMULX/FRECPS/FRSQRTS (scalar)
0101 1110 x10x xxxx 00x1 11xx xxxx xxxx # FMULX/FRECPS/FRSQRTS (scalar, FP16)
01x1 1110 1x10 0001 11x1 10xx xxxx xxxx # FRECPE/FRSQRTE/FRECPX (scalar)
01x1 1110 1111 1001 11x1 10xx xxxx xxxx # FRECPE/FRSQRTE/FRECPX (scalar, FP16)
Predicate them on `HasNEONorStreamingSVE`. Full list of affected
instructions:
FMULX16, FMULX32, FMULX64, FRECPS16, FRECPS32, FRECPS64, FRSQRTS16,
FRSQRTS32, FRSQRTS64, FRECPEv1f16, FRECPEv1i32, FRECPEv1i64, FRECPXv1f16,
FRECPXv1i32, FRECPXv1i64, FRSQRTEv1f16, FRSQRTEv1i32, FRSQRTEv1i64
Depends on D107902.
The reference can be found here:
https://developer.arm.com/documentation/ddi0602/2021-06/SIMD-FP-Instructions
Execution of NEON instructions that are illegal in streaming mode will
cause a trap or exception. Using FMULX [1] as an example, this check is
at the top of the pseudocode:
if elements == 1 then
CheckFPEnabled64();
else
CheckFPAdvSIMDEnabled64();
For the legal scalar variants it calls `CheckFPEnabled64`, whereas for the
illegal vector variants it calls `CheckFPAdvSIMDEnabled64` which traps.
This is useful for observing which instructions are/aren't legal
in streaming mode.
[1] https://developer.arm.com/documentation/ddi0602/2021-06/SIMD-FP-Instructions/FMULX--Floating-point-Multiply-extended-
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D108039
Split out from D107903 to remove dependency for D108039 and D108279.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D108293
This is the first step to enable PPC64 support huge frame size(>2G). Also fix an assertion error for frame size, i.e.,`int x; !isInt<32>(x);` should be always evaluated false, so the guard code for frame size is impossible to hit.
Reviewed By: jsji
Differential Revision: https://reviews.llvm.org/D107435
4ad41902e8 changed this code to
propagate Changed if scalar GEP PRE is performed. However, as
implemented this would skip the load PRE entirely if GEP indices
were PREd. Make sure load PRE runs even if Changed is already
true.
This likely has no functional effect as load PRE would then
occur on a later GVN iteration.
This special cases an unconditional latch and a conditional branch latch exit to improve codegen and test readability. I am hoping to reuse this function in the runtime unroll code, but without this change, the test diffs are far too complex to assess.
verifyDieRanges function checks for the intersected address ranges.
It adds child DieRangeInfo into parent DieRangeInfo to check
whether children have overlapping address ranges. It is safe to not add
DieRangeInfo with empty address range into parent's children list.
This decreases the number of children which should be navigated and as a result
decreases execution time(parents having a lot of children with empty ranges
spend much time navigating them). For this command: "llvm-dwarfdump --verify clang-repl"
execution time decreased from 220 sec till 75 sec.
Differential Revision: https://reviews.llvm.org/D107554
Extend matchShuffleAsBlend to not only match against known in-place elements for BLEND shuffles, but use isElementEquivalent to determine if the shuffle mask's referenced element is the same as the in-place element.
This allows us to replace a number of insertps instructions with more general blendps instructions (better opportunities for commutation, concatenation etc.).
Before lowering shuffles, see if we can merge horizontal ops or canonicalize the shuffle mask to point to the same LHS/RHS of the HOps when an HOp's args are repeated.
Broadwell is mainly a die shrink of Haswell, but the model had many of the scheduling classes in different orders, making side-by-side comparisons very difficult.
The InstRW overrides are still quite different, but at least that part of the side-by-side diff is now in the same position.
This was noticed while I was trying to investigate diffs between llvm-mca and other perf analyzers in https://uica.uops.info/ - we used to be able to do diffs between most of the models very easily, but we seem to have lost that simplicity as classes have been altered, models have been refined and other models have rotted.
The motivation was to get min/max intrinsics to parity
with cmp+select idioms, but this unlocks a few more
folds because isFreeToInvert recognizes add/sub with
constants too.
In the min/max example, we have too many extra uses
for smaller folds to improve things, but this fold
is able to eliminate uses even though we can't reduce
the number of instructions.
Adjusting the reduction recipes still relies on references to the
original IR, which can become outdated by the first-order recurrence
handling. Until reduction recipe construction does not require IR
references, move it before first-order recurrence handling, to prevent a
crash as exposed by D106653.
This patch supported the R_X86_64_32S relocation and add the Pointer32Signed generic edge kind.
Reviewed By: lhames
Differential Revision: https://reviews.llvm.org/D108446
This may overlap partially with the reassociate pass,
but it seems simple enough that we should try it here
in InstCombine to enable other folds.
This shows up as an opportunity and potential regression
if we improve a subtract fold with 'not' ops to be more
general.
Add a variant of mve-vqdmulh tests that uses min/max intrinsics
directly, including a scalar test that shows it misbehaving for min
intrinsics and a fix for the combine to prevent it from misbehaving.
All ExecutorProcessControl subclasses must provide an
ExecutorProcessControl::MemoryAccess object that can be used to access executor
memory from the JIT process. The EPCGenericMemoryAccess class provides an
off-the-shelf MemoryAccess implementation for JITs that do not need (or cannot
provide) a specialized MemoryAccess implementation. This simplifies the process
of creating new ExecutorProcessControl implementations.
Letting it take SCEV allows further modification on the function to optimize
if the StoreSize / Stride is runtime determined.
The plan is to let memcpy / memmove deal with runtime-determined sizes, just
like what D107353 did to memset.
Reviewed By: bmahjour
Differential Revision: https://reviews.llvm.org/D108289
Truncating stores with GPR bank sources shouldn't be mutated into using FPR bank
sources, since those aren't supported.
Ideally this should be a selection failure in the tablegen patterns, but for now
avoid generating them.
SDAG lowers 32-bit and 64-bit G_SMULO + G_UMULO. We were missing the 32-bit
case.
For other sizes, make the 0th type a power of 2 and clamp it to either 32 bits
or 64 bits.
Right now, this will allow us to handle narrow types (e.g. s4, s24, etc.). The
LegalizerHelper doesn't support narrowing G_SMULO or G_UMULO right now. I think
we want clamping behaviour either way, so we might as well include it now to
be explicit.
Differential Revision: https://reviews.llvm.org/D108240
We had a rule for <n x s64> but not one for <n x p0>. As a result, we'd fall
back on like <5 x p0> or whatever.
Differential Revision: https://reviews.llvm.org/D108484
Destination is always a GPR, since the result is always an integer.
Source is always a FPR, since the source is always floating point.
Differential Revision: https://reviews.llvm.org/D108419
Basically the same as G_LROUND. Handles the llvm.llround family of intrinsics.
Also add a helper function to the MachineVerifier for checking if all of the
(virtual register) operands of an instruction are scalars. Seems like a useful
thing to have.
Differential Revision: https://reviews.llvm.org/D108429
Currently it's possible to silently use a loop pass that does not
preserve MemorySSA in a loop-mssa pass manager, as we don't
statically know which loop passes preserve MemorySSA (as was the
case with the legacy pass manager).
However, we can at least add a check after the fact that if
MemorySSA is used, then it should also have been preserved.
Hopefully this will reduce confusion as seen in
https://bugs.llvm.org/show_bug.cgi?id=51020.
Differential Revision: https://reviews.llvm.org/D108399
Because of an odd linking problem, we need to temporarily support
building with TF C API 1.15 + tensorflow 2.50 pip package in
'development' mode scenarios. Protobuf Message 'Swap' is partially
implemented in the header (2.50) and relies on a symbol not found in TF
C API 1.15. std::move avoids that, at no semantic cost.
This reverts commit f4122398e7 to
investigate a crash exposed by it.
The patch breaks building the code below with `clang -O2 --target=aarch64-linux`
int a;
double b, c;
void d() {
for (; a; a++) {
b += c;
c = a;
}
}
Issue Details:
The addresses for SEH tables for Windows are incorrect as 1 was unconditionally being added to all addresses. +1 is required for the SEH end address (as it is exclusive), but the SEH start addresses is inclusive and so should be used as-is.
In the IP2State tables, the addresses are +1 for AMD64 to handle the return address for a call being after the actual call instruction but are as-is for ARM and ARM64 as the `StateFromIp` function in the VC runtime automatically takes this into account and adjusts the address that is it looking up.
Fix Details:
* Split the `getLabel` function into two: `getLabel` (used for the SEH start address and ARM+ARM64 IP2State addresses) and `getLabelPlusOne` (for the SEH end address, and AMD64 IP2State addresses).
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D107784
Generate btf_tag annotations for DIDrived types. More specifically,
clang frontend generates the btf_tag annotations for record
fields. The annotations are represented as an DINodeArray
in DebugInfo. The following example illustrate how
annotations are encoded in IR:
distinct !DIDerivedType(tag: DW_TAG_member, ..., annotations: !10)
!10 = !{!11, !12}
!11 = !{!"btf_tag", !"a"}
!12 = !{!"btf_tag", !"b"}
Differential Revision: https://reviews.llvm.org/D106616
When the stack is not reset it keeps previously visited Basic Block
which results in bugs where an instruction is hoisted to a
predecessor where the instruction was not fully anticipable.
Differential Revision: https://reviews.llvm.org/D108425
Before this patch, instructions MULX32rm and MULX64rm were missing a ReadAdvance
for the implicit read of register EDX/RDX. This patch fixes the issue, and it
also introduces a new SchedWrite for the two variants of MULX. The general idea
behind this last change is to eventually decrease the number of InstRW in the
scheduling models.
This patch also adds a ReadAdvance for the implicit read of EFLAGS in ADCX/ADOX.
Differential Revision: https://reviews.llvm.org/D108372
Partially reverts 85157c0079, which had removed these builtins and intrinsics
in favor of normal codegen patterns. It turns out that it is possible for the
patterns to be split over multiple basic blocks, however, which means that DAG
ISel is not able to select them to the pmin/pmax instructions. To make sure the
SIMD intrinsics generate the correct instructions in these cases, reintroduce
the clang builtins and corresponding LLVM intrinsics, but also keep the normal
pattern matching as well.
Differential Revision: https://reviews.llvm.org/D108387
Optimize (add x, c) to (SH*ADD (c>>b), x) if c is not simm12
while (c>>b) is simm12 and c has b trailing zeros.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D108193
This patch makes InstrRefBasedLDV "safe" to work with DBG_VALUE_LISTs. It
doesn't actually interpret them, but it recognises that they specify
variable locations and avoids propagating false locations, which is better
than the current state. Observe the attached tes
* We avoid propagating DBG_VALUE_LISTs into successor blocks, as they're
not "currently" supported,
* We don't propagate other variable locations across DBG_VALUE_LISTs,
because we know that the variable location is terminated by the
DBG_VALUE_LIST.
Differential Revision: https://reviews.llvm.org/D108143
This patch removes an assertion, and adds a regression test showing why the
assertion is broken.
For context, LocIdx is a key/index number for machine locations, so that we
can describe locations as a single integer and ignore whether they're on
the stack, in registers or otherwise. Back when InstrRefBasedLDV was added,
I happened to bake in a "special" zero number for various reasons, which
Vedant identified as undesirable in this review comment:
https://reviews.llvm.org/D83047#inline-765495 . I subsequently removed that
special zero number, but it looks like I didn't delete this assertion at
the time, which assumes that a zero LocIdx is invalid.
The attached test shows that this assertion is reachable on valid code --
on x86 $rsp always gets the LocIdx number zero, and if you transfer a
variable value into it, InstrRefBasedLDV crashes on that assertion. The
code might be a bit wild to be storing variables to $rsp like that, however
we shouldn't crash on it.
Differential Revision: https://reviews.llvm.org/D108134
A couple of passes that are parameterized in new-PM used different
pass names (in cmd line interface) while using the same pass class
name. This patch updates the PassRegistry to model pass parameters
more properly using PASS_WITH_PARAMS.
Reason for the change is to ensure that we have a 1-1 mapping
between class name and pass name (when disregarding the params).
With a 1-1 mapping it is more obvious which pass name to use in
options such as -debug-only, -print-after etc.
The opt -passes syntax is changed for the following passes:
early-cse-memssa => early-cse<memssa>
post-inline-ee-instrument => ee-instrument<post-inline>
loop-extract-single => loop-extract<single>
lower-matrix-intrinsics-minimal => lower-matrix-intrinsics<minimal>
This patch is not updating pass names in docs/Passes.rst. Not quite
sure what the status is for that document (e.g. when it comes to
listing pass paramters). It is only loop-extract-single that is
mentioned in Passes.rst today, out of the passes mentioned above.
Differential Revision: https://reviews.llvm.org/D108362
The purpose of __attribute__((disable_sanitizer_instrumentation)) is to
prevent all kinds of sanitizer instrumentation applied to a certain
function, Objective-C method, or global variable.
The no_sanitize(...) attribute drops instrumentation checks, but may
still insert code preventing false positive reports. In some cases
though (e.g. when building Linux kernel with -fsanitize=kernel-memory
or -fsanitize=thread) the users may want to avoid any kind of
instrumentation.
Differential Revision: https://reviews.llvm.org/D108029
The only thing that function should do as per it's semantic,
is to ensure that the switch's default is a block consisting only of
an `unreachable` terminator.
So let's just create such a block and update switch's default
to point to it. There should be no need for all this weird dance
around predecessors/successors.
This patch fixes an issue where RISCV's `findCommutedOpIndices` would
incorrectly return the pseudo `CommuteAnyOperandIndex` as a commutable
operand index, rather than fixing a specific index.
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D108206
Prevent SIFoldOperands from creating SALU instructions with a constant
and a frame index. Previously, only one operand was checked to be a
frame index, leading to too many constants when flat scratch is enabled
and stack offsets are large.
Differential Revision: https://reviews.llvm.org/D108368
This prevents the async methods (which shoud be overridden by subclasses) from
hiding the blocking helper methods, avoiding a lot of 'using MemoryAccess::...'
boilerplate.
Accepts a vector of (SymbolStringPtr, ExecutorAddress*) pairs, looks up all the
symbols, then writes their address to each of the corresponding
ExecutorAddresses.
This idiom (looking up and recording addresses into a specific set of variables)
is used in MachOPlatform and the (temporarily reverted) ELFNixPlatform, and is
likely to be used in other places in the near future, so wrapping it in a
utility function should save us some boilerplate.
Produce remarks when atomic instructions are expanded into hardware instructions
in SIISelLowering.cpp. Currently, these remarks are only emitted for atomic fadd
instructions.
Differential Revision: https://reviews.llvm.org/D108150
Support XCOFFDumper relocation reading support
This patch is part of D103696 partition
Reviewed By: daltenty, Helflym
Differential Revision: https://reviews.llvm.org/D104646
Clang patch D106614 added attribute btf_tag support. This patch
generates btf_tag annotations for DIComposite types.
A field "annotations" is introduced to DIComposite, and the
annotations are represented as an DINodeArray, similar to
DIComposite elements. The following example illustrates
how annotations are encoded in IR:
distinct !DICompositeType(..., annotations: !10)
!10 = !{!11, !12}
!11 = !{!"btf_tag", !"a"}
!12 = !{!"btf_tag", !"b"}
Each btf_tag annotation is represented as a 2D array of
meta strings. Each record may have more than one
btf_tag annotations, as in the above example.
Reland with additional fixes for llvm/unittests/IR/DebugTypeODRUniquingTest.cpp.
Differential Revision: https://reviews.llvm.org/D106615
Translate the `@llvm.lround.*` family to G_LROUND via
`IRTranslator::translateSimpleIntrinsic`.
Differential Revision: https://reviews.llvm.org/D108418
For some reductions like G_VECREDUCE_OR on AArch64, we need to scalarize
completely if the source is <= 64b. This change adds support for that in
the legalizer. If the source has a pow-2 num elements, then we can do
a tree reduction using the scalar operation in the individual elements.
Otherwise, we just create a sequential chain of operations.
For AArch64, we only need to scalarize if the input is <64b. If it's great than
64b then we can first do a fewElements step to 64b, taking advantage of vector
instructions until we reach the point of scalarization.
I also had to relax the verifier checks for reductions because the intrinsics
support <1 x EltTy> types, which we lower to scalars for GlobalISel.
Differential Revision: https://reviews.llvm.org/D108276
The COFF specific `DataReferencedByCode` complexity (D103372 D103717) is due to
a link.exe limitation: an external symbol in IMAGE_COMDAT_SELECT_ASSOCIATIVE is
not really dropped, so it can cause duplicate definition error.
When support for copying vector s8 lanes was added recently, this also
had the side effect of fixing a fallback for <16 x s8> extracts since
both used the same helper. However, there was a bug in another helper
to get the regclass for a specific FPR-native type, which was assigning
FPR16 to s8 instead of FPR8.
Clang patch D106614 added attribute btf_tag support. This patch
generates btf_tag annotations for DIComposite types.
A field "annotations" is introduced to DIComposite, and the
annotations are represented as an DINodeArray, similar to
DIComposite elements. The following example illustrates
how annotations are encoded in IR:
distinct !DICompositeType(..., annotations: !10)
!10 = !{!11, !12}
!11 = !{!"btf_tag", !"a"}
!12 = !{!"btf_tag", !"b"}
Each btf_tag annotation is represented as a 2D array of
meta strings. Each record may have more than one
btf_tag annotations, as in the above example.
Differential Revision: https://reviews.llvm.org/D106615
The convert_low and promote_low instructions can widen the lower two lanes of a
four-lane vector, but we were previously scalarizing patterns that widened lanes
besides the low two lanes. The commit adds a shuffle to move the widened lanes
into the low lane positions so the convert_low and promote_low instructions can
be used instead of scalarizing.
Depends on D108266.
Differential Revision: https://reviews.llvm.org/D108341
Since the simplest DAG patterns for convert_low and promote_low instructions
involved v2i32, v2f32, v4i64, and v4f64 types, which are not legal in the
WebAssembly backend and would be eliminated by type legalization, we were
previously matching those patterns in a DAG combine before the type legalization
stage. However in cases where the vectors were wider than 128 bits, the patterns
we matched were not created until the type legalization stage when the wide
vectors were split up. Type legalization would continue to eliminate the illegal
types we were matching as well, so the code ended up scalarized.
To make the ISel for these instructions more robust, match the scalarized
patterns rather than the patterns containing illegal types. Add tests with
double-wide vectors to show that this works as intended.
Fixes PR51098.
Depends on D107502.
Differential Revision: https://reviews.llvm.org/D108266
The default legalization of unsupported vector types is to promote the integers
in each lane, which leads to extra sign or zero extending and masking when
moving data into and out of vectors. Switch our preferred type legalization from
the default to vector widening, which keeps the data in the low lanes of the
vector rather than in the low bits of each lane. The unused high lanes can be
ignored.
Half-wide vectors are now loaded from memory into the low 64 bits of the v128
rather than spread out among the lanes. As a result, v128.load64_splat is a much
more common operation, so add new patterns to support it.
Differential Revision: https://reviews.llvm.org/D107502
This patch extends the runtime unrolling infrastructure to support unrolling a loop with multiple exiting blocks branching to the same exit block used by the latch. It intentionally does not include a cost model change to enable this functionality unless appropriate force flags are used.
This is the prolog companion to D107381. Since this was LGTMed, a problem with DT updating was reported against that patch. I roled in the analogous fix here as it seemed obvious, and not worth re-review.
As an aside, our prolog form leaves a lot of potential value on the floor when there is an invariant load or invariant condition in the loop being runtime unrolled. We should probably consider a "required prolog" heuristic. (Alternatively, maybe we should be peeling these cases more aggressively?)
Differential Revision: https://reviews.llvm.org/D108262
Alias analysis is unable to disambiguate accesses to the structure
fields without it unlike distinct variables. As a result we cannot
combine ds_read and ds_write operations in a case of any store in
between which always considered clobbering.
Differential Revision: https://reviews.llvm.org/D108315
As reported on https://bugs.llvm.org/show_bug.cgi?id=51020, the
guard widening pass doesn't preserve MemorySSA, so it can no
longer be scheduled in the same loop pass manager as LICM. However,
the loop-schedule.ll test indicates that this is supposed to work.
Fix this by preserving MemorySSA if available, as this seems to be
trivial in this case (we only need to drop the memory access for
the removed guards).
Differential Revision: https://reviews.llvm.org/D108386
In 94d0914, I added support for unrolling of multiple exit loops which have multiple exits reaching the latch. Per reports on the review post commit, I'd missed updating the domtree for one case. This fix addresses that ommission.
There's no new test as this is covered by existing tests with expensive verification turned on.
Folding a GEP from outside to inside a loop will materialize an add where there wasn't an equivalent operation before. Check the containing loops before making this fold.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D107935
This allows the instruction selector to realize that it can directly
broadcast the low byte of the memset value, rather than replicating
it to a 64-bit GPR before broadcasting.
This fixes PR50985.
Differential Revision: https://reviews.llvm.org/D108354
This was probably bugging more than is reasonable, but it makes merging
changes in this file slightly less annoying to have the trailing comma
here. I only noticed this because Rust is currently carrying a patch to
this file and it kept making life a little difficult.
This changes the lowering of saddsat and ssubsat so that instead of
using:
r,o = saddo x, y
c = setcc r < 0
s = c ? INTMAX : INTMIN
ret o ? s : r
into using asr and xor to materialize the INTMAX/INTMIN constants:
r,o = saddo x, y
s = ashr r, BW-1
x = xor s, INTMIN
ret o ? x : r
https://alive2.llvm.org/ce/z/TYufgD
This seems to reduce the instruction count in most testcases across most
architectures. X86 has some custom lowering added to compensate for
cases where it can increase instruction count.
Differential Revision: https://reviews.llvm.org/D105853
Previously we pre-calculated this and cached it for every
instruction in the function. Most of the calculated results will
never be used. So instead calculate it only on the first use, and
then cache it.
The cache was originally added to fix a compile time issue which
caused r216066 to be reverted.
This change exposed that we weren't pre-computing the Value for
Arguments. I've explicitly disabled that for now as it seemed to
regress some tests on AArch64 which has sext built into its compare
instructions.
Spotted while investigating how to improve heuristics to work better
with RISCV preferring sign extend for unsigned compares for i32 on RV64.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D107976
This encapsulates the APInt creation and worklist management into
a helper function.
To keep one common interface I've use Log2_32 in places that
previously created a mask by subtracting 1 from a power of 2.
Differential Revision: https://reviews.llvm.org/D108324
This reverts commit 9934a5b2ed.
This patch may cause miscompiles because it missed a constraint
as shown in the examples from:
https://llvm.org/PR51531
This makes the intrinsic logic match the cmp+select idiom folds
just below. It's not clearly a win either way unless we think
that a 'not' op costs more than min/max.
The cmp+select folds on these patterns are more extensive than
the intrinsics currently and may have some complicated interactions,
so I'm trying to make those line up and bring the optimizations
for intrinsics up to parity.
There is an assertion failure in computeOverflowForUnsignedMul
(used in checkOverflow) due to the inner and outer trip counts
having different types. This occurs when the IV has been widened,
but the loop components are not successfully rediscovered.
This is fixed by some refactoring of the code in findLoopComponents
which identifies the trip count of the loop.
Differential Revision: https://reviews.llvm.org/D108107
This patch adds the beginnings of more thorough support in the
legalizers for vector-predicated (VP) operations.
The first step is the ability to widen illegal vectors. The more
complicated scenario in which the result/operands need widening but the
mask doesn't has not been handled here. That would require a lot of code
without an in-tree target on which to test it.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D107904
Refactored implementation of AddressSanitizerPass and
HWAddressSanitizerPass to use pass options similar to passes like
MemorySanitizerPass. This makes sure that there is a single mapping
from class name to pass name (needed by D108298), and options like
-debug-only and -print-after makes a bit more sense when (despite
that it is the unparameterized pass name that should be used in those
options).
A result of the above is that some pass names are removed in favor
of the parameterized versions:
- "khwasan" is now "hwasan<kernel;recover>"
- "kasan" is now "asan<kernel>"
- "kmsan" is now "msan<kernel>"
Differential Revision: https://reviews.llvm.org/D105007
Currently, `printHelp` behaves differently for options that:
* do not define `HelpText` (such options _are not printed_), and
* define its `HelpText` as `HelpText<"">` (such options _are printed_).
In practice, both approaches lead to no help text and `printHelp` should
treat them consistently. This patch addresses that by making
`printHelpt` check the length of the help text to be printed.
All affected tests have been updated accordingly. The option definitions
for llvm-cvtres have been updated with a short description or "Not
implemented" for options that are ignored by the tool.
Differential Revision: https://reviews.llvm.org/D107557