This patch adds the names of the Arm Architecture Reference Manual (ARM)
features to the corresponding Subtarget Features in the AArch64 backend
and target parser.
The aim of this is to make it clearer what architectural features a
subtarget feature might enable (so, which features a CPU must provide to
support that subtarget feature), and so make it easier to add new CPUs
in the future.
Differential Revision: https://reviews.llvm.org/D131257
This is a follow-up patch to D130999. In the test, the MIR contains an
unreachable MBB but the code attempts to look it up in MLocs. This
patch fixes this issue by checking for the default-constructed value.
rdar://97226240
Differential Revision: https://reviews.llvm.org/D131453
si-annotate-control-flow does depth first traversal of BB's of
a function to insert amdgcn if intrinsics for conditional
branches so that isel can generate correct instructions later.
si-annotate-control-flow checks whether the successor BB for the 'else'
branch of a conditional branch has been visited. If it has been
visited, si-annotate-control-flow assumes the conditional
branch has been handled and will not try to insert if intrinsic
for it.
This assumption is not correct when the IR contains multiple
unreachable BB's. Then 'if' intrinscs are not inserted and incorrect
ISA are generated.
This patch fixes the issue by let amdgpu-unify-divergent-exit-nodes
unify unreachables even if they are uniformly reached. In this way
the IR will not contain multiple exits, and structurizer is able to
structurize the IR containing one unified exit.
Reviewed by: Ruiling Song, Matt Arsenault
Differential Revision: https://reviews.llvm.org/D131181
Fixes: SWDEV-343244
This adds a +forced-atomics target feature with the same semantics
as +atomics-32 on ARM (D130480). For RISCV targets without the +a
extension, this forces LLVM to assume that lock-free atomics
(up to 32/64 bits for riscv32/64 respectively) are available.
This means that atomic load/store are lowered to a simple load/store
(and fence as necessary), as these are guaranteed to be atomic
(as long as they're aligned). Atomic RMW/CAS are lowered to __sync
(rather than __atomic) libcalls. Responsibility for providing the
__sync libcalls lies with the user (for privileged single-core code
they can be implemented by disabling interrupts). Code using
+forced-atomics and -forced-atomics are not ABI compatible if atomic
variables cross the ABI boundary.
For context, the difference between __sync and __atomic is that the
former are required to be lock-free, while the latter requires a
shared global lock provided by a shared object library. See
https://llvm.org/docs/Atomics.html#libcalls-atomic for a detailed
discussion on the topic.
This target feature will be used by Rust's riscv32i target family
to support the use of atomic load/store without atomic RMW/CAS.
Differential Revision: https://reviews.llvm.org/D130621
This allows relaxing some relocations to STT_SECTION symbol+offset
instead of emitting a relocation against a symbol.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D131433
ARMAsmPrinter::emitFunctionEntryLabel() was not calling the base class
function so the $local alias was not being emitted. This should not have
any function effect right now since ARM does not generate different code
for the $local symbols, but it could be improved in the future.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D131392
The RelLookupTableConverter pass currently only supports 64-bit
pointers. This is currently enforced using an isArch64Bit() check
on the target triple. However, we consider x32 to be a 64-bit target,
even though the pointers are 32-bit. (And independently of that
specific example, there may be address spaces with different pointer
sizes.)
As such, add an additional guard for the size of the pointers that
are actually part of the lookup table.
Differential Revision: https://reviews.llvm.org/D131399
This allows a number of optimisation passes to work.
E.g. BranchFolding and MachineBlockPlacement.
Differential Revision: https://reviews.llvm.org/D131316
The register operand of DBG_VALUE is not selected to a proper register
bank in both AArch64 and X86. This would cause getRegClass crash after
global ISel. After discussion, we think the MIR should assume all
vritual register should be set proper register class after global ISel,
so this patch is to fix the gap of DBG_VALUE for AArch64 and X86.
Differential Revision: https://reviews.llvm.org/D129037
The previous code overwrites VRMap for prologue stages during Phi
generation if a register spans many stages.
As a result, the wrong register is used as the one coming from
the prologue in Phis at later stages. (A process exists to correct
this, but it does not work in all cases.)
In addition, VRMap for prologue must be preserved until addBranches().
This patch fixes them by separating the map for Phis into a different
variable (VRMapPhi).
Reviewed By: bcahoon
Differential Revision: https://reviews.llvm.org/D127840
1. Missing instruction information (FTSSEL, FMSB, PFIRST and RDFFR)
is added and CompleteModel is set to one.
2. Information for pseudo SVE instructions is added. Those
instructions are present at the time of scheduling.
3. Resource and latency information for SVE instructions is modified
to be more accurate.
For example, the description for CMPEQ, which consumes one cycle
each of unit FLA and PPR, is as follows.
```
Previous:
def A64FXGI01 : ProcResGroup<[A64FXIPFLA, A64FXIPPR]>;
def A64FXWrite_4Cyc_GI01 : SchedWriteRes<[A64FXGI01]> {...
Modified:
def A64FXGI0 : ProcResGroup<[A64FXIPFLA]>;
def A64FXGI1 : ProcResGroup<[A64FXIPPR]>;
def A64FXWrite_CMP : SchedWriteRes<[A64FXGI0, A64FXGI1]> {...
```
Reference: A64FX Microarchitecture Manual (Table 16-3)
https://github.com/fujitsu/A64FX/blob/master/doc/A64FX_Microarchitecture_Manual_en_1.7.pdf
Reviewed By: dmgreen, kawashima-fj
Differential Revision: https://reviews.llvm.org/D131165
Map hardware loop intrinsics loop_decrement and set_loop_iteration
to the new PowerPC pseudo instructions, so that the hardware loop
intrinsics will be expanded to normal cmp+branch form or ctrloop
form based on the CTR register usage on MIR level.
Reviewed By: lkail
Differential Revision: https://reviews.llvm.org/D123366
Exclude the terminating end opcode from the epilog - it doesn't
correspond to an actual instruction that is included in the epilog
itself (within the .seh_startepilogue/.seh_endepilogue range).
In most (all?) cases, an epilog is followed by a matching terminating
instruction though (a ret or a branch to a tail call), but it's not
strictly within the .seh_startepilogue/.seh_endepilogue range.
This fixes a number of failed asserts in cases where the codegen
has incorrectly reoredered SEH opcodes so they don't match up
exactly with their instructions.
However this still just avoids failing the assertion; the root cause
of generating unexpected epilogs is still present (and fixing that is
a less obvious issue).
Differential Revision: https://reviews.llvm.org/D131393
With profile data, non-trivial LoopUnswitch will only apply on non-cold loops, as unswitching cold loops may not gain much benefit but significantly increase the code size.
Reviewed By: aeubanks, asbirlea
Differential Revision: https://reviews.llvm.org/D129599
Implements the pc element for the symbolizing filter, including it's
"ra" and "pc" modes. Return addresses ("ra") are adjusted by
decrementing one. By default, {{{pc}}} elements are assumed to point to
precise code ("pc") locations. Backtrace elements will adopt the
opposite convention.
Along the way, some minor refactors of value printing and colorization.
Reviewed By: peter.smith
Differential Revision: https://reviews.llvm.org/D131115
We are seeing significant performance loss when an alloca fails to get promoted
to register. I have observed that this is due to the common type found when
attempting to rewrite partition users being unviable for promotion. While if we
would have continue looking for a type, we would have found a subtype in the
original allocated type that would have enabled promotion. Thus first check if
the initial common type found is promotion viable and if not then continue
looking instead of stopping with the initial common type found.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D128073
The floating point stores use a different register class, it
probably makes sense to have a different SchedRead.
Reviewed By: monkchiang
Differential Revision: https://reviews.llvm.org/D131379
Previously we'd go off the end of the BI iterator because we expected
that the relative positions of common blocks before and after were
consistent. That's not always true though, for example with
jump-threading.
Reviewed By: jamieschmeiser
Differential Revision: https://reviews.llvm.org/D130596
The build breakages should be addressed by d4abdd2e3d:
[CMake] Check CMAKE_CXX_STANDARD and error if it's to old
Thanks to Tobias and Roy for addressing these issues.
This patch adds basic support for a DAG variant of the canCreateUndefOrPoison call and updates DAGCombiner::visitFREEZE to use it, further Opcodes (including target specific Opcodes) can be handled when we have test coverage.
So far, I've left visitFREEZE to just use this for unary nodes (which currently means the existing BITCAST/FREEZE cases) - later patches will add other unary opcodes (with test coverage) and we can also refactor visitFREEZE to support a general number of operands like we do in InstCombinerImpl::pushFreezeToPreventPoisonFromPropagating.
I'm not aware of any vector test freeze coverage so the DemandedElts (and the Depth) args are not being used yet - but they are in place. Similarly we will be able to handle poison generating SDNodeFlags as and when it becomes an issue.
Part of the work for D106675 / PR50468
Differential Revision: https://reviews.llvm.org/D130646
These are new debug types that ships with the latest
Windows SDK and would warn and finally fail lld-link.
The symbols seems to be related to Microsoft's XFG
which is their version of CFG. We can't handle any of
this yet, so for now we can just ignore these types
so that lld doesn't fail with a new version of Windows
SDK.
Fixes: #56285
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D129378
When EarlyCSE tries to common vector masked loads/stores, it first checks that
they have same base operand and then assumes that this is enough for mask types
to be equal. This is true for typed pointers but false for opaque ones -
two loads of different vector sizes from same base pointer '%b' are the same,
`ptr %b`. (For typed pointers, `%b` was cast to vector pointer type so bases
were different).
Change assert to return from lambda `isSubmask` so this transformation properly
works with opaque pointers.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D131251
This patch emits table lookup in expandCTTZ.
Context -
https://reviews.llvm.org/D113291 transforms set of IR instructions to
cttz intrinsic but there are some targets which does not support CTTZ or
CTLZ. Hence, I generate a table lookup in TargetLowering::expandCTTZ().
Differential Revision: https://reviews.llvm.org/D128911
FoldConstantArithmetic can fold constant vectors hidden behind bitcasts (e.g. vXi64 -> v2Xi32 on 32-bit platforms), but currently bails if either vector contains undef elements. These undefs can often occur due to SimplifyDemandedBits/VectorElts calls recognising that the upper bits are often unnecessary (e.g. funnel-shift/rotate implicit-modulo and AND masks).
This patch adds a basic 'FoldValueWithUndef' handler that will attempt to constant fold if one or both of the ops are undef - so far this just handles the AND and MUL cases where we always fold to zero.
The RISCV codegen increase is interesting - it looks like the BUILD_VECTOR lowering was loading a constant pool entry but now (with all elements defined constant) it can materialize the constant instead?
Differential Revision: https://reviews.llvm.org/D130839
The ABI for big-endian AArch32, as specified by AAELF32, is above-
averagely complicated. Relocatable object files are expected to store
instruction encodings in byte order matching the ELF file's endianness
(so, big-endian for a BE ELF file). But executable images can
//either// do that //or// store instructions little-endian regardless
of data and ELF endianness (to support BE32 and BE8 platforms
respectively). They signal the latter by setting the EF_ARM_BE8 flag
in the ELF header.
(In the case of the Thumb instruction set, this all means that each
16-bit halfword of a Thumb instruction is stored in one or other
endianness. The two halfwords of a 32-bit Thumb instruction must
appear in the same order no matter what, because the first halfword is
the one that must avoid overlapping the encoding of any 16-bit Thumb
instruction.)
llvm-objdump was unconditionally expecting Arm instructions to be
stored little-endian. So it would correctly disassemble a BE8 image,
but if you gave it a BE32 image or a BE object file, it would retrieve
every instruction in byte-swapped form and disassemble it to
nonsense. (Even an object file output by LLVM itself, because
ARMMCCodeEmitter outputs instructions big-endian in big-endian mode,
which is correct for writing an object file.)
This patch allows llvm-objdump to correctly disassemble all three of
those classes of Arm ELF file. It does it by introducing a new
SubtargetFeature for big-endian instructions, setting it from the ELF
image type and flags during llvm-objdump setup, and teaching both
ARMDisassembler and llvm-objdump itself to pay attention to it when
retrieving instruction data from a section being disassembled.
Differential Revision: https://reviews.llvm.org/D130902
Calling reserve() used to require an RPC call. This commit allows large
ranges of executor address space to be reserved. Subsequent calls to
reserve() will return subranges of already reserved address space while
there is still space available.
Differential Revision: https://reviews.llvm.org/D130392
D129150 added a combine from shuffles to And that creates a BUILD_VECTOR
of constant elements. We need to ensure that the elements are of a legal
type, to prevent asserts during lowering.
Fixes#56970.
Differential Revision: https://reviews.llvm.org/D131350
Add patterns to select predicated instructions when lowering:
fadd(a, select(mask, b, splat(0)))
fsub(a, select(mask, b, splat(0)))
'fadd' is unsafe unless no-signed zeros fast-math flag is set, since
-0.0 + 0.0 = 0.0
changes the sign. Alive2: https://alive2.llvm.org/ce/z/wbhJh_
Also adds FMA patterns for:
fadd(a, select(mask, mul(b, c), splat(0))) -> fmla(a, mask, b, c)
fsub(a, select(mask, mul(b, c), splat(0))) -> fmla(a, mask, b, c)
These patterns require the 'contract' fast-math flag to be set, and the
fadd 'nsz' as above.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D130564
Given a TypeIndex or CVType return its size in bytes. Basically it
is the inverse to 'CodeViewDebug::lowerTypeBasic', that returns a
TypeIndex based in a size.
Reviewed By: rnk, djtodoro
Differential Revision: https://reviews.llvm.org/D129846
This patch ensures the `$fp` always points to the bottom of the vararg
spill region.
Includes support for expand `ISD::DYNAMIC_STACKALLOC`.
Differential Revision: https://reviews.llvm.org/D130250
This allow optimization without LTO. Also remove some useless else-ifs.
Signed-off-by: Jun Zhang <jun@junz.org>
Differential Revision: https://reviews.llvm.org/D131313
This reverts commit 6ea5bf436a.
6ea5bf436a made use of new c++17 rules regarding
order of evaluation (specifically: in function calls the expression naming the
function should be sequenced before the evalution of any operands) to simplify
some continuation-passing calls. Unfortunately this appears to break at least
one MSVC bot: https://lab.llvm.org/buildbot/#/builders/123/builds/12149 .
Includes an update to the comments to note that the workaround is now based on
MSVC limitations, not on LLVM adopting c++17.
std::iterator has been deprecated in C++17 and some standard library implementations such as MS STL or libc++ emit deperecation messages when using the class.
Since LLVM has now switched to C++17 these will emit warnings on these implementations, or worse, errors in build configurations using -Werror.
This patch fixes these issues by replacing them with LLVMs own llvm::iterator_facade_base which offers a superset of functionality of std::iterator.
Differential Revision: https://reviews.llvm.org/D131320
This function heap-allocates a ThreadSafeModule (the current C bindings assume
that TSMs are always heap-allocated), but was failing to free it.
Should fix http://llvm.org/PR56953.
If we're adding a constant that can't use addi we try a few tricks,
one of which is using li+sh3add. We should not do this if lui+add
would work. For example adding 8192. Using sh3add prevents folding
a sext.w to form addw, thus increasing instruction count.
This fixes a bug reported privately by @craig.topper. Here's an example which illustrates the problem:
vsetivli a1, a0, e32, m1, ta, mu # both DefInfo and PrevInfo
vsetivli a2, a1, e32, m4, ta, mu
With the unsound result being:
vsetivli a1, a0, e32, m1, ta, mu
vsetivli a2, a0, e32, m4, ta, mu
Consider the case where this is running on a machine with VLEN=512,. For this case, the VLMAXs are 16 and 64 respectively.
Consider for a0 = 33. The correct result is: a1 = 16, and a2 = 16
After the unsound optimization: a1 = 16 and a2 = 33
This particular example used VLMAXs which differed by more than a power of two. With a difference of only one power of two, there's another form of this bug which involves the AVL < 2 x VLMAX special case, but that ones more complicated to construct as many examples turn out accidentally sound.
This patch takes the approach of simply removing the unsound optimization, but there are multiple sound sub-cases of it. I plan to return to at least a couple of them, but figured it was cleaner to remove the unsound optimization (for ease of backporting), and then review the new optimizations on their own.
Differential Revision: https://reviews.llvm.org/D131264
Create function segments and emit unwind info of them.
A segment must be less than 1MB and no prolog or epilog is splitted between two
segments.
This patch should generate correct, though not optimal, unwind info for large
functions. Currently it only generate pacted info (.pdata) only for functions
that are less than 1MB (single-segment functions). This is NFC from before this
patch.
The next step is to enable (.pdata) only unwind info for the first segment or
segments that have neither prolog or epilog in a multi-segment function.
Another future work item is to further split segments that require more than 255
code words or have more than 65535 epilogs.
Reference:
https://docs.microsoft.com/en-us/cpp/build/arm64-exception-handling#function-fragments
Differential Revision: https://reviews.llvm.org/D130049
NOTE: i8 vector splats are ignored because the immediate range of
DUP already has full coverage.
Differential Revision: https://reviews.llvm.org/D131078
We get a couple of improvements from recognizing swapped
operand patterns that were not handled by the replicated
code.
This should also enable simplifying larger patterns as
seen in issue #56653 and issue #56654, but that requires
enhancements to isImpliedCondition() itself.
There are no AMDGPUSampleVariant versions for _G16, it is treated more like a
modifier for derivatives (_D) (also for intrinsics where it is overloaded type
instead of part of instrinsic name) so we ended up making more variants for
these instruction then we actually needed.
32-bit derivatives need 6 dwords at most, while 16-bit need 4 at most. Using
same AMDGPUSampleVariant for both, we ended up creating 2 extra variants per
instruction than were necessary.
In total this deletes 260 unused tablegen records.
Differential Revision: https://reviews.llvm.org/D131252
Given a poison constant as input, the dyn_cast to a ConstantInt would
fail so we would fall through to the generic code that attempts to fold
each element of the input vectors. The inputs to these intrinsics are
not vectors though, leading to a compile time crash. Instead bail out
properly for poison values by returning nullptr. This doesn't try to
define what poison means for these intrinsics.
Fixes#56945
This fixes 69 llvm tests that failed when EXPENSIVE_CHECKS was enabled.
llvm/test/Transforms/IROutliner/outlining-commutative-operands-opposite-order.ll
is one example.
When we have EXPENSIVE_CHECKS, _GLIBCXX_DEBUG is defined. This means
that libstdc++ will call the compare function to check if it is
implemented correctly (that !(a < a) is true).
This happens even if there is only one item and here, we expect
to see one return void or multiple return constant integer.
Don't sort if we have 1 item, but do assert that it is the 1
ret void we expect. In the comparator, assert that neither
Value is a nullptr in case one ended up in a the list somehow.
Reviewed By: AndrewLitteken
Differential Revision: https://reviews.llvm.org/D130230
According to the description of the LoongArch abi documentation,
(https://loongson.github.io/LoongArch-Documentation/LoongArch-ELF-ABI-EN.html)
the calling convention of LoongArch is almost the same as the RISCV's
(except for the vector part), so we borrow the implementation of RISCV.
This patch only guarantees the correctness of lp64d, because only the
part of lp64d is described in detail in the documentation.
Differential Revision: https://reviews.llvm.org/D130249
Closing https://github.com/llvm/llvm-project/issues/56919
It is meaningless to preserve the lifetime markers for the spilled
allocas in the coroutine frames and it would block some optimizations
too.
This is a simple addition to emitConditionalComparison, to match CCMP
with immediates using getIConstantVRegValWithLookThrough, letting it
select the CCMPri variants of the instructions.
Differential Revision: https://reviews.llvm.org/D131073
`BMI` new instruction `tzcnt` has better performance than `bsf` on new
processors. Its encoding has a mandatory prefix '0xf3' compared to
`bsf`. If we force emit `rep` prefix for `bsf`, we will gain better
performance when the same code run on new processors.
GCC has already done this way: https://c.godbolt.org/z/6xere6fs1Fixes#34191
Reviewed By: craig.topper, skan
Differential Revision: https://reviews.llvm.org/D130956
A const reference is preferred over a non-null const pointer.
`Type *` is kept as is to match the other overload.
Reviewed By: davidxl
Differential Revision: https://reviews.llvm.org/D131197
When folding (sra (add (shl X, 32), C1), 32 - C) -> (shl (sext_inreg (add X, C1), i32), C)
it's possible that the add is used by multiple sras. We should
allow the combine if all the SRAs will eventually be updated.
After transforming all of the sras, the shls will share a single
(sext_inreg (add X, C1), i32).
This pattern occurs if an sra with 32 is used as index in multiple
GEPs with different scales. The shl from the GEPs will be combined
with the sra before we get a chance to match the sra pattern.
COFF has a verifier check that private global variables don't have a comdat of the same name.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D131043
1) Overloaded (instruction-based) method is a wrapper around the current (opcode-based) method.
2) This patch also changes a few callsites (VectorCombine.cpp,
SLPVectorizer.cpp, CodeGenPrepare.cpp) to call the overloaded method.
3) This is a split of D128302.
Differential Revision: https://reviews.llvm.org/D131114
In contrast to AAPotentialValues, the constant values version can
contain implicit `undef` in the set. We had an assertion that could
misfire before. Handle it properly now.
When folding (sra (add (shl X, 32), C1), 32 - C) -> (shl (sext_inreg (add X, C1), C)
ignore the use count on the (shl X, 32).
The sext_inreg after the transform is free. So we're only making
2 new instructions, the add and the shl. So we only need to be
concerned with replacing the original sra+add. The original shl
can have other uses. This helps if there are multiple different
constants being added to the same shl.
This connects the Symbolizer to the markup filter and enables the first
working end-to-end flow using the filter.
Reviewed By: peter.smith
Differential Revision: https://reviews.llvm.org/D130187
As discussed in [0], this diff adds the `skipprofile` attribute to
prevent the function from being profiled while allowing profiled
functions to be inlined into it. The `noprofile` attribute remains
unchanged.
The `noprofile` attribute is used for functions where it is
dangerous to add instrumentation to while the `skipprofile` attribute is
used to reduce code size or performance overhead.
[0] https://discourse.llvm.org/t/why-does-the-noprofile-attribute-restrict-inlining/64108
Reviewed By: phosek
Differential Revision: https://reviews.llvm.org/D130807
This allows the construct to be shared between different backends. However, it
still remains illegal to use TypedPointerType in LLVM IR--the type is intended
to remain an auxiliary type, not a real LLVM type. So no support is provided for
LLVM-C, nor bitcode, nor LLVM assembly (besides the bare minimum needed to make
Type->dump() work properly).
Reviewed By: beanz, nikic, aeubanks
Differential Revision: https://reviews.llvm.org/D130592
This fixes warnings like these:
../lib/ExecutionEngine/Orc/MemoryMapper.cpp:364:9: warning: ignoring return value of function declared with 'warn_unused_result' attribute [-Wunused-result]
joinErrors(std::move(Err),
^~~~~~~~~~ ~~~~~~~~~~~~~~~
Differential Revision: https://reviews.llvm.org/D131056
In RVV, we use vwredsum.vs and vwredsumu.vs for vecreduce.add(ext(Ty A)) if the result type's width is twice of the input vector's SEW-width. In this situation, the cost of extended add reduction should be same as single-width add reduction. So as the vector float widenning reduction.
Differential Revision: https://reviews.llvm.org/D129994
The isOnlyUserOf prevented the fold if the chain result had any
users. What we really care about is the the data result from the
AND is only used by the TEST, and the flags results from the ANDs
aren't used at all. It's ok if the chain has users, we just need
to replace those users with the chain from the TESTrm.
Reviewed By: LuoYuanke
Differential Revision: https://reviews.llvm.org/D131117
BoundsChecking uses ObjectSizeOffsetEvaluator to keep track of the
underlying size/offset of pointers in allocations. However,
ObjectSizeOffsetVisitor (something ObjectSizeOffsetEvaluator
uses to check for constant sizes/offsets)
doesn't quite treat sizes and offsets the same way as
BoundsChecking. BoundsChecking wants to know the size of the
underlying allocation and the current pointer's offset within
it, but ObjectSizeOffsetVisitor only cares about the size
from the pointer to the end of the underlying allocation.
This only comes up when merging two size/offset pairs. Add a new mode to
ObjectSizeOffsetVisitor which cares about the underlying size/offset
rather than the size from the current pointer to the end of the
allocation.
Fixes a false positive with -fsanitize=bounds.
Reviewed By: vitalybuka, asbirlea
Differential Revision: https://reviews.llvm.org/D131001
This is the 2nd patch of the two-patch series (D130188, D130189) that
fix PR56275 (https://github.com/llvm/llvm-project/issues/56275) which
is a missed opportunity for loop interchange.
As follow-up on the dependence analysis (DA) patch D130188, this patch
normalizes DA results in loop interchange, such that negative dependence
vectors queried by loop interchange are reversed to be non-negative.
Now all tests in PR56275 can get interchanged. Those tests are added
in lit test as `pr56275.ll`.
Reviewed By: kawashima-fj, bmahjour, Meinersbur, #loopoptwg
Differential Revision: https://reviews.llvm.org/D130189
This patch is the first of the two-patch series (D130188, D130179) that
resolve PR56275 (https://github.com/llvm/llvm-project/issues/56275)
which is a missed opportunity, where a perfrectly valid case for loop
interchange failed interchange legality.
If the distance/direction vector produced by dependence analysis (DA) is
negative, it needs to be normalized (reversed). This patch provides helper
functions `isDirectionNegative()` and `normalize()` in DA that does the
normalization, and clients can query DA to do normalization if needed.
A pass option `<normalized-results>` is added to DependenceAnalysisPrinterPass,
and we leverage it to update DA test cases to make sure of test coverage. The
test cases added in `Banerjee.ll` shows that negative vectors are normalized
with `print<da><normalized-results>`.
Reviewed By: bmahjour, Meinersbur, #loopoptwg
Differential Revision: https://reviews.llvm.org/D130188
During LTO a local promoted to a global gets a unique suffix based on
a hash of the module IR. This means that changes in the local's module
can affect the contents in another module that imported it (because the name
of the imported promoted local is changed, but that doesn't reflect a
real change in the importing module). So any tool that's
validating changes to the importing module will see a superficial change.
Instead of using the module hash, we can use the "source_filename" if it
exists to generate a unique identifier that doesn't change due to LTO
shenanigans.
Differential Revision: https://reviews.llvm.org/D128863
This just shuffles implementations and declarations around. Now the
logger and the TF C API-based model evaluator are separate.
Differential Revision: https://reviews.llvm.org/D131116
D129980 converts (seteq (i64 (and X, 0xffffffff)), C1) into
(seteq (i64 (sext_inreg X, i32)), C1). If bit 31 of X is 0, it
will be turned back into an 'and' by SimplifyDemandedBits which
can cause an infinite loop.
To prevent this, check if bit 31 is 0 with computeKnownBits before
doing the transformation.
Fixes PR56905.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D131113
If we're going to emit a rep prefix before bsf as proposed in
D130956, it makes sense to promote i16 operations to i32 to avoid
the false depedency of tzcntw.
Reviewed By: skan, pengfei
Differential Revision: https://reviews.llvm.org/D130995
The testcase was delta-reduced from an LTO build with sanitizer
coverage and the MIR tail duplication pass caused a machine basic
block to become unreachable in MIR. This caused the MBB to be invisible
to the reverse post-order traversal used to initialize the MBB <->
RPONumber lookup tables.
rdar://97226240
Differential Revision: https://reviews.llvm.org/D130999
The function `handleDebugValue` has custom logic to handle certain kinds
constants, namely integers, floats and null pointers. However, it does
not handle constant pointers created from IntToPtr ConstantExpressions.
This patch addresses the issue by replacing the Constant with its
integer operand.
A similar bug was addressed for GlobalISel in D130642.
Reviewed By: aprantl, #debug-info
Differential Revision: https://reviews.llvm.org/D130908
This is mostly a stylistic change to make the uniform memop widening cost
code fit more naturally with the sourounding code. Its not strictly
speaking NFC as I added in the store with invariant value case, and we
could in theory have a target where a gather/scatter is cheaper than a
single load/store... but it's probably NFC in practice. Note that the
scatter/gather result can still be overriden later if the result is
uniform-by-parts.
This patch ensures consistency in the construction of FP_ROUND nodes
such that they always use ISD::TargetConstant instead of ISD::Constant.
This additionally fixes a bug in the AArch64 SVE backend where patterns
were matching against TargetConstant nodes and sometimes failing when
passed a Constant node.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D130370
An unnecessary sext.w is generated when masking the result of the
riscv_masked_cmpxchg_i64 intrinsic. Implementing handling of the
intrinsic in ComputeNumSignBitsForTargetNode allows it to be removed.
Although this isn't a particularly important optimisation, removing the
sext.w simplifies implementation of an additional cmpxchg-related
optimisation in D130192.
Although I can't produce a test with different codegen for the other
atomics intrinsics, these are added as well for completeness.
Differential Revision: https://reviews.llvm.org/D130191
Enable SGPRs for the following operands of these opcodes:
- src operands of VOP3 variant.
- src2 operand of DPP variants.
Differential Revision: https://reviews.llvm.org/D130989
`BMI` new instruction `tzcnt` has better performance than `bsf` on new
processors. Its encoding has a mandatory prefix '0xf3' compared to
`bsf`. If we force emit `rep` prefix for `bsf`, we will gain better
performance when the same code run on new processors.
GCC has already done this way: https://c.godbolt.org/z/6xere6fs1Fixes#34191
Reviewed By: skan
Differential Revision: https://reviews.llvm.org/D130956
Unfortunately, this overflow is extremely hard to reproduce reliably (in fact, I was unable to do so). The issue is that:
- getOperandsToCreate sometimes skips creating an SCEV for the LHS
- then, createSCEV is called for the BinaryOp
- ... which calls getNoWrapFlagsFromUB
- ... which under certain circumstances calls isSCEVExprNeverPoison
- ... which under certain circumstances requires the SCEVs of all operands
For certain deep dependency trees, this causes a stack overflow.
Reviewed By: bkramer, fhahn
Differential Revision: https://reviews.llvm.org/D129745
Mark ModRefInfo as a bitmask enum, which allows using normal
& and | operators on it. This supersedes various functions like
unionModRef() and intersectModRef(). I think this makes the code
cleaner than going through helper functions...
Differential Revision: https://reviews.llvm.org/D130870
Sometimes SCEV cannot infer nuw/nsw from something as simple as
```
len in [0, MAX_INT]
...
iv = phi(0, iv.next)
guard(iv <s len)
guard(iv <u len)
iv.next = iv + 1
```
just because flag strenthening only relies on definition and does not use local facts.
This patch adds support for the simplest case: inference of flags of `add(x, constant)`
if we can contextually prove that `x <= max_int - constant`.
In case if it has negative CT impact, we can add an option to switch it off. I woudln't
expect that though.
Differential Revision: https://reviews.llvm.org/D129643
Reviewed By: apilipenko
In case that opaque pointers not enabled, there may be some constexpr
bitcast uses for thread local variables and the design of llvm allow
people to sink constant arbitrarily. This breaks the assumption of
IRBuilder::CreateThreadLocalAddress. This patch tries to handle the
case.
In this patch we replace common code patterns with the use of utility
functions for dealing with profiling metadata. There should be no change
in functionality, as the existing checks should be preserved in all
cases.
Reviewed By: bogner, davidxl
Differential Revision: https://reviews.llvm.org/D128860
When compiling for multiple targets the scheduler that is selected via the
-misched option is applied globally. This patch adds a target CL option instead.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D131022
This used to print from the ADDI where the operand number was
correct. It recently changed to print from the LUI or AUIPC which
needs to use operand 1 instead of 2.
This shows up as a crash with -debug.
Prefer using these accessors to access the special sub-commands
corresponding to the top-level (no subcommand) and all sub-commands.
This is a preparatory step towards removing the use of ManagedStatic:
with a subsequent change, these global instances will be moved to
be regular function-scope statics.
It is split up to give downstream projects a (albeit short) window in
which they can switch to using the accessors in a forward-compatible
way.
Differential Revision: https://reviews.llvm.org/D129118
Creates a new scheduling strategy that attempts to maximize ILP for a single
wave.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D130869
When emit-obj from clang directly, DirectX backend will hit assert caused by not initialize passes for AsmPrinter.
The fix will initialize the passes by calling createPassConfig.
Also ignore global variable which not has section in DXILAsmPrinter::emitGlobalVariable to avoid hit llvm_unreachable in DXILTargetObjectFile::SelectSectionForGlobal.
Reviewed By: beanz
Differential Revision: https://reviews.llvm.org/D130856
Fixes code in OrderedChangedData<T>::report which assumes that a string will only appear once in Before/After.
Reviewed By: jamieschmeiser
Differential Revision: https://reviews.llvm.org/D130587
The LegalizerHelper misses the code to lower G_MUL to a library call,
which this change adds.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D130987
rGcf97e0ec42b8 makes $x18 to be treated as callee-saved in functions with
Windows calling convention on non-Windows OSes.
Here we mark $x18 as callee-saved for functions with Windows calling
convention on Darwin, as well as on other non-Windows platforms, in
order to prevent some miscompilations (like miscompilation of
win64cc-darwin-backup-x18.ll).
Since getCalleeSavedRegs doesn't return x18 in list of callee-saved
registers, assignCalleeSavedSpillSlots and determineCalleeSaves
consider different sets of registers as callee-saved. It causes an
error:
```
Assertion failed: ((!HasCalleeSavedStackSize || getCalleeSavedStackSize() == Size) && "Invalid size calculated for callee saves"), function getCalleeSavedStackSize, file
AArch64MachineFunctionInfo.h, line 292.
```
Differential Revision: https://reviews.llvm.org/D130676
The ZERO register should be exposed as a constant physical register through the interface TargetRegisterInfo::isConstantPhysReg.
Differential Revision: https://reviews.llvm.org/D130932
I think these pseudos will exist when the post-RA scheduler runs
so they should have sched classes.
Reviewed By: monkchiang
Differential Revision: https://reviews.llvm.org/D130945
In the 2e29b0138c we introduce a specific solving algorithm
that analyzes the VGPR to SGPR copies use chains and either lowers
the copy to v_readfirstlane_b32 or converts the whole chain to VALU forms.
Same time we still have the code that blindly converts to VALU REG_SEQUENCE and PHIs
in case they produce SGPR but have VGPRs input operands. In case the REG_SEQUENCE and PHIs
are in the VGPR to SGPR copy use chain, and this chain was considered long enough to convert
copy to v_readfistlane_b32, further lowering them to VALU leads to several kinds of issues.
At first, we have v_readfistlane_b32 which is completely useless because most parts of its use chain
were moved to VALU forms. Second, we may encounter subtle bugs related to the EXEC-dependent CF
because of the weird mixing of SALU and VALU instructions.
This change removes the code that moves REG_SEQUENCE and PHIs to VALU. Instead, we use the fact
that both REG_SEQUENCE and PHIs have copy semantics. That is, if they define SGPR but have VGPR inputs,
we insert VGPR to SGPR copies to make them pure SGPR. Then, the new copies are processed by the common
VGPR to SGPR lowering algorithm.
This is Part 2 in the series of commits aiming at the massive refactoring of the SIFixSGPRCopies pass.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D130367
This pass seems to have very little effect because all it does is hoist
some instructions, but it is followed later in the codegen pipeline by
the IR CodeSinking pass which does the opposite.
Differential Revision: https://reviews.llvm.org/D130258
This improves a corner case where v_fmac can be converted to v_fma on
GFX10+ even if it has a literal operand.
Differential Revision: https://reviews.llvm.org/D130992
This extends the handling of uniform memory operations to handle the case where a store is storing a loop invariant value. Unlike the general case of a store to an invariant address where we must use the last active lane, in this case we can use any lane since all lanes must produce the same result.
For context, the basic structure of the existing code and how the change fits in:
* First, we select a widening strategy. (The result is irrelevant for this patch.)
* Then we determine if a computation is uniform within all lanes of VF. (Note this is the uniform-per-part definition, not LAI's uniform across all unrolled iterations definition.)
* If it is, we overrule the widening strategy, and unconditionally scalarize.
* VPReplicationRecipe - which is what actually does the scalarization - knows how to handle unform-per-part values including for scalable vectors. However, we do need to know that the expression is safe to execute without predication - e.g. the uniform mem op was unconditional in the original loop. (This part was split off and already landed.)
An obvious question is why not simply implement the generic case? The answer is that I'm going to, but doing so without a canonicalization towards uniform causes regressions due to bad interaction with scalarization/uniformity of values feeding the uniform mem-op. This patch is needed to avoid those regressions.
Differential Revision: https://reviews.llvm.org/D130364
The problem Alexander reported on D127982 was caused by an optimization
for AVX512-FP16 instruction. We must limit it to the feature enabled only.
During the investigation, I found we didn't expand for fp_round/fp_extend
without F16C. This may result runtime crash, so change them too.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D130817
Add a new IRBuilderBase::CreateIntrinsic which takes the return type and
argument values for the intrinsic call but does not take an explicit
list of types to mangle. Instead the builder works this out from the
intrinsic declaration and the types of the supplied arguments.
This means that the mangling is hidden from the client, which in turn
means that intrinsic definitions can change which arguments are mangled
without requiring any changes to the client code.
Differential Revision: https://reviews.llvm.org/D130776
This folds a v4i32 Mul(And(Srl(X, 15), 0x10001), 0xffff) into a v8i16
CMLTz instruction. The Srl and And extract the top bit (whether the
input is negative) and the Mul sets all values in the i16 half to all
1/0 depending on if that top bit was set. This is equivalent to a v8i16
CMLTz instruction. The same applies to other sizes with equivalent
constants.
Differential Revision: https://reviews.llvm.org/D130874
matchRotateSub is given shift amounts that will already have stripped any/zero-extend nodes from - so make sure those values are wide enough to take a mask.
If we have interleave groups in the loop we want to vectorise then
we should fall back on normal vectorisation with a scalar epilogue. In
such cases when tail-folding is enabled we'll almost certainly go on to
create vplans with very high costs for all vector VFs and fall back on
VF=1 anyway. This is likely to be worse than if we'd just used an
unpredicated vector loop in the first place.
Once the vectoriser has proper support for analysing all the costs
for each combination of VF and vectorisation style, then we should
be able to remove this.
Added an extra test here:
Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll
Differential Revision: https://reviews.llvm.org/D128342
* TargetFrameLowering has a TransientStackAlignment field that "returns
the number of bytes to which the stack pointer must be aligned at all
times, even between calls.
* As explained in the [RISC-V calling
convention](https://github.com/riscv-non-isa/riscv-elf-psabi-doc/blob/master/riscv-cc.adoc),
the stack pointer must remain fully aligned throughout execution for
compliant code. This is important for embedded targets that might avoid
realigning the stack pointer for interrupt service routines. Systems
running full OSes may always realign the stack anyway.
* TransientStackAlignment is used in estimateStackSize in
MachineFrameInfo and in PEI::calculateFrameObjectOffsets.
* estimateStackSize is only used in the RISC-V backend for scavenging
slots. It may be possible to craft a function where the difference
is observable, but it wouldn't be a meaningful test.
* calculateFrameObjectOffsets makes use of TransientStackAlignment,
but then sets the stack alignment to the max of that alignment and
MaxAlign, which is unconditionally set to 16 in
RISCVFrameLowering::processFunctionBeforeFrameFinalized
* I've changed this logic to only set MaxAlign if there are RVV frame
objects. There should be no functional change here for either RVV
targets (MaxAlign is set as before) or non-RVV targets
(TransientStackAlign is now 16 anyway).
Differential Revision: https://reviews.llvm.org/D130068
Now the API getExtendedAddReductionCost is used to determine the cost of extended Add reduction with optional Mul. For Arm, it could cover the cases. But for other target, for example: RISCV, they support other kinds of extended recution, such as FAdd.
This patch does the following changes:
1, Split getExtendedAddReductionCost into 2 new API: getExtendedReductionCost which handles the extended reduction with addtional input of Opcode; getMulAccReductionCost which handle the MLA cases the getExtendedAddReductionCost.
2, Refactor getReductionPatternCost, add some contraint condition to make sure the getMulAccReductionCost should only handle the reuction of Add + Mul.
Differential Revision: https://reviews.llvm.org/D130868
Reflect in the pointer's offset the length of the leading part
of the consumed string preceding the first converted digit.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D130912
It's possible we have:
lui a0, %hi(sym)
addi a0, %lo(sym)
addi a0, <offset1>
lw a0, <offset2>(a0)
We want to arrive at
lui a0, %hi(sym+offset1+offset2)
lw a0, %lo(sym+offset1+offset2)
We currently fail to do this because we only consider loads/stores
if we didn't find any arithmetic.
This patch splits arithmetic folding and load/store folding into
two separate phases. The load/store folding can no longer assume
the offset in hi/lo is 0 so we must combine the offsets. I've applied
the same simm32 limit that we applied in the arithmetic folding.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D130931
The patch replaces SPIRVBaseInfo.* previously created using macros by
the tablegen approach. There are many small changes in other files due to
differences in namespaces. Also, functions in SPIRVUtils are moved to
the llvm namespace.
Differential Revision: https://reviews.llvm.org/D130518
Co-authored-by: Aleksandr Bezzubikov <zuban32s@gmail.com>
Co-authored-by: Michal Paszkowski <michal.paszkowski@outlook.com>
Co-authored-by: Andrey Tretyakov <andrey1.tretyakov@intel.com>
Co-authored-by: Konrad Trifunovic <konrad.trifunovic@intel.com>
2xi64 is the legalized type for wide reductions (like 16xi64) and setting the
cost to 2 makes `load-reduce` and `load-zext-reduce` patterns profitable.
The few performance measurments that I did on an aarch64 machine confirm that
these patterns are actually faster when vectorized.
Differential Revision: https://reviews.llvm.org/D130740
Follow-up to D130434.
Move doSystemDiff to PrintPasses.cpp and call it in MachineFunctionPass.cpp.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D130833
For VALU write and memory (VM, L/DS, FLAT) instructions, SQ would insert
wait-states to avoid data hazard. However when there is a DGEMM instruction
in-between them, SQ incorrectly disables the wait-states thus the data hazard
needs to be handled with this workaround.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D130677
At least based on the lit tests, the coalescer sometimes fails to
propagate the copy from X0 into the branch instruction. This patch
does it manually during isel. The majority of the changes are from
the select patterns.
Some of the changes are just register allocation changes. Only
the Select change affects the whether a b*z instruction is generated
in the tests. I changed the branch pattern for consistency.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D130809
The only iterator we're holding points to HiLUI and we never
delete that so I think it is safe to delete everything else
immediately.
I want to split detectAndFoldOffset into two phases. First, combine
LUI+ADDI with any ADD/ADDI/SHXADD that comes after it. This may
open opportunities to fold the ADDI from the LUI+ADDI into a
load/store address. So the load/store folding should run as a
second phase even if the ADD/ADDI/SHXADD made changes.
In order to do this we need to eagerly delete instructions in the
first phase so that we don't have dead users of the LUI+ADDI
when we start the second phase.
Patches to split the phases will come later.
Reviewed By: asb, luismarques
Differential Revision: https://reviews.llvm.org/D130119
Extend hazard recognizer of ReadM0MovRelInterpHazard with
DS_READ_ADDTID and DS_WRITE_ADDTID, as they also
require a manually inserted S_NOP after SALU writing m0.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D130783
Fix all instances of:
*** Bad machine code: Kill missing from LiveVariables ***
in the X86 CodeGen tests with D129213 applied, which adds verification
of LiveIntervals after the TwoAddressInstruction pass runs.
Differential Revision: https://reviews.llvm.org/D129634
According to the ABI for the Arm Architecture, the value for the
Tag_also_compatible_with eabi attribute is represented by an NTBS entry.
This string value, in turn, is composed of a pair of tag+value encoded
in one of two formats:
- ULEB128: tag, ULEB128: value, 0.
- ULEB128: tag, NBTS: data.
(See [[ 60a8eb8c55/addenda32/addenda32.rst (3373secondary-compatibility-tag) | section 3.3.7.3 on the Addenda to, and Errata in, the ABI for the Arm Architecture ]].)
Currently the Arm assembly parser and streamer ignore the encoding of
the attribute's NTBS value, which can result in incorrect attributes
being emitted in both assembly and object file outputs.
This patch fixes these issues by properly handing the value's encoding.
An update to llvm-readobj to properly handle the attribute's value will be
covered by a separate patch.
Patch by Victor Campos and Lucas Prates.
Reviewed By: vhscampos
Differential Revision: https://reviews.llvm.org/D129500
Scope of changes:
1) Added new function to generate loop versioning
2) Added support for if clause to applySimd function
2) Added tests which confirm that lowering is successful
If ifCond is specified, then collapsed loop is duplicated and if branch
is added. Duplicated loop is executed if simd ifCond is evaluated to false.
Reviewed By: Meinersbur
Differential Revision: https://reviews.llvm.org/D129368
Signed-off-by: Dominik Adamski <dominik.adamski@amd.com>
Eliminate an AND by redefining an anyext|sext|zext.
(and (extract_subvector (anyext|sext|zext v) _) iN_mask)
=> (extract_subvector (zeroext_iN v))
Differential Revision: https://reviews.llvm.org/D130782
Builds upon D123264, adding support for merging the low part of the LLA
address into the load/store instruction offsets.
Differential Revision: https://reviews.llvm.org/D123265
Salvage debug info of instruction that is about to be deleted as dead in
Combiner pass. Currently supported instructions are COPY and G_TRUNC.
It allows to salvage debug info of some dead arguments of functions, by putting
DWARF expression corresponding to the instruction being deleted into related
DBG_VALUE instruction.
Here is an example of missing variables location https://godbolt.org/z/K48osb9dK.
We see that arguments x, y of function foo are not available in debugger, and
corresponding DBG_VALUE instructions have undefined register operand instead of
variables locaton after Aarch64PreLegalizerCombiner pass. The reason is that
registers where variables are located are removed as dead (with instruction
G_TRUNC). We can use salvageDebugInfo analogue for gMIR to preserve debug
locations of dead variables.
Statistics of llvm object files built with vs without this commit on -O2
optimization level (CMAKE_BUILD_TYPE=RelWithDebInfo, -fglobal-isel) on Aarch64 (macOS):
Number of variables with 100% of parent scope covered by DW_AT_location has been increased by 7,9%.
Number of variables with 0% coverage of parent scope has been decreased by 1,2%.
Number of variables processed by location statistics has been increased by 2,9%.
Average PC ranges coverage has been increased by 1,8 percentage points.
Coverage can be improved by supporting more instructions, or by calling
salvageDebugInfo for instructions that are deleted during Combiner rules exection.
Reviewed By: aprantl
Differential Revision: https://reviews.llvm.org/D129909
SimplifyCFG does some common code hoisting, which is limited to hoisting a
sequence of identical instruction in identical order and stops at the first
non-identical instruction.
This patch allows hoisting instruction pairs over same-length sequences of
non-matching instructions. The linear asymptotic complexity of the algorithm
stays the same, there's an extra parameter `simplifycfg-hoist-common-skip-limit`
serving to limit compilation time and/or the size of the hoisted live ranges.
The patch improves SPECv6/525.x264_r by about 10%.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D129370
getModRefInfo() queries currently track whether the result is a
MustAlias on a best-effort basis. The only user of this functionality
is the optimized memory access type in MemorySSA -- which in turn
has no users. Given that this functionality has not found a user
since it was introduced five years ago (in D38862), I think we
should drop it again.
The context is that I'm working to separate FunctionModRefBehavior
to track mod/ref for different location kinds (like argmem or
inaccessiblemem) separately, and the fact that ModRefInfo also has
an unrelated Must flag makes this quite awkward, especially as this
means that NoModRef is not a zero value. If we want to retain the
functionality, I would probably split getModRefInfo() results into
a part that just contains the ModRef information, and a separate
part containing a (best-effort) AliasResult.
Differential Revision: https://reviews.llvm.org/D130713
This belongs to a series of patches which try to solve the thread
identification problem in coroutines. See
https://discourse.llvm.org/t/address-thread-identification-problems-with-coroutine/62015
for a full background.
The problem consists of two concrete problems: TLS variable and readnone
functions. This patch tries to convert the TLS problem to readnone
problem by converting the access of TLS variable to an intrinsic which
is marked as readnone.
The readnone problem would be addressed in following patches.
Reviewed By: nikic, jyknight, nhaehnle, ychen
Differential Revision: https://reviews.llvm.org/D125291
Expand load address pseudo-instructions earlier (pre-ra) to allow follow-up
patches to fold the addi of PseudoLLA instructions into the immediate
operand of load/store instructions.
Differential Revision: https://reviews.llvm.org/D123264
issue #56775
I rearranged the Thumb2 codegen test to avoid simplifying the chain
of rounding instructions. I'm assuming the intent of the test is
to verify lowering of each of those intrinsics.
Only PACKSS/PACKUS faux shuffles make use of the demanded elts at the moment, but this at least improves the handling of a couple of truncation patterns.
Handles COMDAT symbol with an offset and refactor the code to only generated symbol if the second symbol was encountered. This happens very infrequently but happens in recursive_mutex implementation of MSVC STL library.
Reviewed By: lhames
Differential Revision: https://reviews.llvm.org/D130454
Implements remaining IMAGE_REL_AMD64_REL32_*. We only need IMAGE_REL_AMD64_REL32_4 for now but doing all remaining ones for completeness. (clang only uses IMAGE_REL_AMD64_REL32_1 and IMAGE_REL_AMD64_REL32)
Reviewed By: lhames
Differential Revision: https://reviews.llvm.org/D130452
Relax zero-fill edge assertions to only consider relocation edges. Keep-alive edges to zero-fill blocks can cause this assertion which is too strict.
Reviewed By: lhames
Differential Revision: https://reviews.llvm.org/D130450
Implements include/alternatename linker directive. Alternatename is used by static msvc runtime library. Alias symbol is technically incorrect (we have to search for external definition) but we don't have a way to represent this in jitlink/orc yet, this is solved in the following up patch.
Inlcude linker directive is used in ucrt to forcelly lookup the static initializer symbols so that they will be emitted. It's implemented as extenral symbols with live flag on that cause the lookup of these symbols.
Reviewed By: lhames
Differential Revision: https://reviews.llvm.org/D130276
This adds a merge operand to all of the binary _VL nodes. Including
integer and widening. They all share multiclasses in tablegen
so doing them all at once was easiest.
I plan to use FADD_VL in an upcoming patch. The rest are just for
consistency to keep tablegen working.
This does reduce the isel table size by about 25k so that's nice.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D130816
Noticed by inspection and I can't seem to make a test case, but SSE arithmetic bit shifts clamp to the max shift amount (i.e. create a sign splat) - combineVectorShiftImm already does something similar.
This patch fixes the error llvm/lib/CodeGen/MachineScheduler.cpp(755): error C2065: 'MISchedCutoff': undeclared identifier in case of NDEBUG and LLVM_ENABLE_ABI_BREAKING_CHECKS.
Note MISchedCutoff is declared under #ifndef NDEBUG.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D130425
This is a follow-up to 2ebfda2417
(replace "if" with "else if" since the cases nuw/nsw
were meant to be handled separately).
Test plan:
1/ ninja check-llvm check-clang check-lld
2/ Bootstrapped LLVM/Clang pass tests
The isa<Constant> check could misfire on an instruction with 2 constant
operands. This bug was introduced with bb789381fc (D36988).
See issue #56810 for a C source example that exposed the bug.
For constants in the range [-2047, 2048] we use addi. If the constant
is -2048 we can use xori. If we don't match this explicitly, we'll
emit an LI for the -2048 followed by an XOR.
It is not necessary to wait for all outstanding memory operations before
barriers on hardware that can back off of the barrier in the event of an
exception when traps are enabled. Add a new subtarget feature which
tracks which HW has this ability.
Reviewed By: #amdgpu, rampitec
Differential Revision: https://reviews.llvm.org/D130722
https://alive2.llvm.org/ce/z/3jYbEH
We should choose one of these forms, and the option that uses
the narrow type allows the motivating example from issue #56294
to reduce. In the best case (no 'not' needed and 'trunc' remains),
this does remove an instruction.
Note that there is what looks like a regression because there
is an existing canonicalization that turns trunc into and+icmp.
That is a long-standing transform, and I'm not sure what effect
reversing it would have.
If the LHS op has a single use then using the more general AND op is likely to allow commutation, load folding, generic folds etc.
Updated version - original version rG057db2002bb3 didn't correctly account for multiple uses of the mask that might be folding "OR(AND(X,C),AND(Y,~C)) -> OR(AND(X,C),ANDNP(C,Y))" in canonicalizeBitSelect