- Add the M68k-specific MC layer implementation
- Add ELF support for M68k
- Add M68k-specifc CC and reloc
TODO: Currently AsmParser and disassembler are not implemented yet.
Please use this bug to track the status:
https://bugs.llvm.org/show_bug.cgi?id=48976
Authors: myhsu, m4yers, glaubitz
Differential Revision: https://reviews.llvm.org/D88390
This `R_WASM_MEMORY_ADDR_SELFREL_I32` relocation represents an offset
between its relocating address and the symbol address. It's very similar
to `R_X86_64_PC32` but restricted to be used for only data segments.
```
S + A - P
```
A: Represents the addend used to compute the value of the relocatable
field.
P: Represents the place of the storage unit being relocated.
S: Represents the value of the symbol whose index resides in the
relocation entry.
Proposal: https://github.com/WebAssembly/tool-conventions/issues/162
Differential Revision: https://reviews.llvm.org/D96659
Adds support for the TLS general dynamic access model to
assembly files on AIX 32-bit.
To generate the correct code sequence when accessing a TLS variable
`v`, we first create two TOC entry nodes, one for the variable offset, one
for the region handle. These nodes are followed by a `PPCISD::TLSGD_AIX`
node (new node introduced by this patch).
The `PPCISD::TLSGD_AIX` node (`TLSGDAIX` pseudo instruction) is
expanded to 2 copies (to put the variable offset and region handle in
the right registers) and a call to `__tls_get_addr`.
This patch also changes the way TC entries are generated in asm files.
If the generated TC entry is for the region handle of a TLS variable,
we add the `@m` relocation and the `.` prefix to the entry name.
For example:
```
L..C0:
.tc .v[TC],v[TL]@m -> region handle
L..C1:
.tc v[TC],v[TL] -> variable offset
```
Reviewed By: nemanjai, sfertile
Differential Revision: https://reviews.llvm.org/D97948
- This patch adds in support to determine whether a particular label
is valid for the hlasm variant
- The label syntax being checked is that of an ordinary HLASM symbol
(Reference, Chapter 2 (Coding and Structure) - Terms, Literals and
Expressions - Terms - Symbols - Ordinary Symbol)
- To achieve this, the virtual function isLabel defined in
MCTargetAsmParser.h is made use of
- The isLabel function is overridden in SystemZAsmParser for the
hlasm variant, and the syntax is checked appropriately
- Things remain unchanged for the att variant
- Further patches will add in support to emit the label. These future
patches will make use of this isLabel function
Reviewed By: uweigand, Kai
Differential Revision: https://reviews.llvm.org/D97748
This patch updates DbgVariableIntrinsics to support use of a DIArgList for the
location operand, resulting in a significant change to its interface. This patch
does not update all IR passes to support multiple location operands in a
dbg.value; the only change is to update the DbgVariableIntrinsic interface and
its uses. All code outside of the intrinsic classes assumes that an intrinsic
will always have exactly one location operand; they will still support
DIArgLists, but only if they contain exactly one Value.
Among other changes, the setOperand and setArgOperand functions in
DbgVariableIntrinsic have been made private. This is to prevent code from
setting the operands of these intrinsics directly, which could easily result in
incorrect/invalid operands being set. This does not prevent these functions from
being called on a debug intrinsic at all, as they can still be called on any
CallInst pointer; it is assumed that any code directly setting the operands on a
generic call instruction is doing so safely. The intention for making these
functions private is to prevent DIArgLists from being overwritten by code that's
naively trying to replace one of the Values it points to, and also to fail fast
if a DbgVariableIntrinsic is updated to use a DIArgList without a valid
corresponding DIExpression.
This changes the target data layout to make stack align to 16 bytes
on Power10. Before this change, stack was being aligned to 32 bytes.
Reviewed By: #powerpc, nemanjai
Differential Revision: https://reviews.llvm.org/D96265
While working on adding fixed-length vectors to the calling convention,
it was necessary to be able to query for a fixed-length vector container
type without access to an instance of SelectionDAG.
This patch modifies the "main" getContainerForFixedLengthVector function
to use an instance of TargetLowering rather than SelectionDAG, and
preserves the SelectionDAG overload as a wrapper.
An additional non-static version of the function was also added to
simplify the common case in RISCVTargetLowering.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D97925
A setcc can be created during LegalizeDAG after select_cc has been
created. This combine will enable us to fold these late setccs.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D98132
This pattern occurs when lowering for overflow operations
introduce an xor after select_cc has already been formed.
I had to rework another combine that looked for select_cc of an xor
with 1. That xor will now get combined away so we just need to
look for the RHS of the select_cc being 1.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D98130
Patch adds support for passing vector call operands to variadic
functions. Arguments which are fixed shadow GPRs and stack space even
when they are passed in vector registers, while arguments passed through
ellipses are passed in properly aligned GPRs if available and on the
stack once all GPR arguments registers are consumed.
Differential Revision: https://reviews.llvm.org/D97956
That review is extracted from D69372.
It fixes https://bugs.llvm.org/show_bug.cgi?id=42219 bug.
For the noimplicitfloat mode, the compiler mustn't generate
floating-point code if it was not asked directly to do so.
This rule does not work with variable function arguments currently.
Though compiler correctly guards block of code, which copies xmm vararg
parameters with a check for %al, it does not protect spills for xmm registers.
Thus, such spills are generated in non-protected areas and could break code,
which does not expect floating-point data. The problem happens in -O0
optimization mode. With this optimization level there is used
FastRegisterAllocator, which spills virtual registers at basic block boundaries.
Register Allocator does not protect spills with additional control-flow modifications.
Thus to resolve that problem, it is suggested to not copy incoming physical
registers into virtual registers. Instead, store incoming physical xmm registers
into the memory from scratch.
Differential Revision: https://reviews.llvm.org/D80163
gfx1030 added a new way to implement readcyclecounter using the
SHADER_CYCLES hardware register, but the s_memtime instruction still
exists, so the MC layer should still accept it and the
llvm.amdgcn.s.memtime intrinsic should still work.
Differential Revision: https://reviews.llvm.org/D97928
This patch adds support for the default AltiVec ABI for AIX.
Vector registers 20 through 31 are marked as reserved and cannot
be used in the default ABI. This patch adds handling for this case
and also remove the default AltiVec ABI errors.
Reviewed By: sfertile
Differential Revision: https://reviews.llvm.org/D96351
Copy-paste P9 insns were added back in 2016,
however, looks like the opcodes has changed in ISA3.1.
Reviewed By: #powerpc, nemanjai
Differential Revision: https://reviews.llvm.org/D97416
Some BPF programs compiled on s390 fail to load, because s390
arch-specific linux headers contain float and double types. At the
moment there is no BTF_KIND for floats and doubles, so the release
version of LLVM ends up emitting type id 0 for them, which the
in-kernel verifier does not accept.
Introduce support for such types to libbpf by representing them using
the new BTF_KIND_FLOAT.
Reviewed By: yonghong-song
Differential Revision: https://reviews.llvm.org/D83289
Rewrites test to use correct architecture triple; fixes incorrect
reference in SourceLevelDebugging doc; simplifies `spillReg` behaviour
so as to not be dependent on changes elsewhere in the patch stack.
This reverts commit d2000b45d0.
Same as other memory instructions, ds instructions add latency even if
exec is zero. Jumping over them if exec=0 is cheaper than executing
them.
With this change, the branch instruction that skips over a basic block
if exec=0 is not removed when the block contains a ds instruction.
Differential Revision: https://reviews.llvm.org/D97922
Recommit bf5a582650. Depends on
4c8fb7ddd6 which was reverted.
RegBankSelect creates zext and trunc when it selects banks for uniform i1.
Add zext_trunc_fold from generic combiner to post RegBankSelect combiner.
Differential Revision: https://reviews.llvm.org/D95432
This pass runs in any situations but we skip it when it is not O0 and the
function doesn't have optnone attribute. With -O0, the def of shape to amx
intrinsics is near the amx intrinsics code. We are not able to find a
point which post-dominate all the shape and dominate all amx intrinsics.
To decouple the dependency of the shape, we transform amx intrinsics
to scalar operation, so that compiling doesn't fail. In long term, we
should improve fast register allocation to allocate amx register.
Reviewed By: pengfei
Differential Revision: https://reviews.llvm.org/D93594
By implementing the method "unsigned RISCVTTIImpl::getRegisterBitWidth(bool Vector)",
fixed-length vectorization is enabled when possible. Without this method, the
"#pragma clang loop" directive is needed to enable vectorization(or the cost model
may inform LLVM that "Vectorization is possible but not beneficial").
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D97549
Lorenz Bauer from Cloudflare tried to use "const struct <name>"
as the type for __builtin_btf_type_id(*(const struct <name>)0, 1)
relocation and hit a llvm BPF fatal error.
https://lore.kernel.org/bpf/a3782f71-3f6b-1e75-17a9-1827822c2030@fb.com/
...
fatal error: error in backend: Empty type name for BTF_TYPE_ID_REMOTE reloc
Currently, we require the debuginfo type itself must have a name.
In this case, the debuginfo type is "const" which points to "struct <name>".
The "const" type does not have a name, hence the above fatal error
will be triggered.
Let us permit "const" and "volatile" type modifiers. We skip modifiers
in some other cases as well like structure member type tracing.
This can aviod the above fatal error.
Differential Revision: https://reviews.llvm.org/D97986
This is included from IR files, and IR doesn't/can't depend on Analysis
(because Analysis depends on IR).
Also fix the implementation - don't use non-member static in headers, as
it leads to ODR violations, inaccurate "unused function" warnings, etc.
And fix the header protection macro name (we don't generally include
"LIB" in the names, so far as I can tell).
This is actually two changes. One is to avoid copies when fp values are fed into
a build_vector, without being able to tell from the opcode.
The other is that build_vectors are also marked as only defining FP, since they
produce vector results.
Differential Revision: https://reviews.llvm.org/D97968
This is a case D97677 missed. When taking out remaining BBs that are
reachable from already-taken-out exceptions (because they are not
subexcptions but unwind destinations), I assumed the remaining BBs are
not EH pads, but they can be. For example,
```
try {
try {
throw 0;
} catch (int) { // (a)
}
} catch (int) { // (b)
}
try {
foo();
} catch (int) { // (c)
}
```
In this code, (b) is the unwind destination of (a) so its exception is
taken out of (a)'s exception, But even though the next try-catch is not
inside the first two-level try-catches, because the first try always
throws, its continuation BB is unreachable and the whole rest of the
function is dominated by EH pad (a), including EH pad (c). So after we
take out of (b)'s exception out of (a)'s, we also need to take out (c)'s
exception out of (a)'s, because (c) is reachable from (b).
This adds one more step before what we did for remaining BBs in D97677;
it traverses EH pads first to take subexceptions out of their incorrect
parent exception. It's the same thing as D97677, but because we can do
this before we add BBs to exceptions' sets, we don't need to fix sets
and only need to fix parent exception pointers.
Other changes are variable name changes (I changed `WE` -> `SrcWE`,
`UnwindWE` -> `DstWE` for clarity), some comment changes, and a drive-by
fix in a bug in a `LLVM_DEBUG` print statement.
Fixes https://github.com/emscripten-core/emscripten/issues/13588.
Reviewed By: dschuff
Differential Revision: https://reviews.llvm.org/D97929
Background:
Wasm EH, while using Windows EH (catchpad/cleanuppad based) IR, uses
Itanium-based libraries and ABIs with some modifications.
`__clang_call_terminate` is a wrapper generated in Clang's Itanium C++
ABI implementation. It contains this code, in C-style pseudocode:
```
void __clang_call_terminate(void *exn) {
__cxa_begin_catch(exn);
std::terminate();
}
```
So this function is a wrapper to call `__cxa_begin_catch` on the
exception pointer before termination.
In Itanium ABI, this function is called when another exception is thrown
while processing an exception. The pointer for this second, violating
exception is passed as the argument of this `__clang_call_terminate`,
which calls `__cxa_begin_catch` with that pointer and calls
`std::terminate` to terminate the program.
The spec (https://libcxxabi.llvm.org/spec.html) for `__cxa_begin_catch`
says,
```
When the personality routine encounters a termination condition, it
will call __cxa_begin_catch() to mark the exception as handled and then
call terminate(), which shall not return to its caller.
```
In wasm EH's Clang implementation, this function is called from
cleanuppads that terminates the program, which we also call terminate
pads. Cleanuppads normally don't access the thrown exception and the
wasm backend converts them to `catch_all` blocks. But because we need
the exception pointer in this cleanuppad, we generate
`wasm.get.exception` intrinsic (which will eventually be lowered to
`catch` instruction) as we do in the catchpads. But because terminate
pads are cleanup pads and should run even when a foreign exception is
thrown, so what we have been doing is:
1. In `WebAssemblyLateEHPrepare::ensureSingleBBTermPads()`, we make sure
terminate pads are in this simple shape:
```
%exn = catch
call @__clang_call_terminate(%exn)
unreachable
```
2. In `WebAssemblyHandleEHTerminatePads` pass at the end of the
pipeline, we attach a `catch_all` to terminate pads, so they will be in
this form:
```
%exn = catch
call @__clang_call_terminate(%exn)
unreachable
catch_all
call @std::terminate()
unreachable
```
In `catch_all` part, we don't have the exception pointer, so we call
`std::terminate()` directly. The reason we ran HandleEHTerminatePads at
the end of the pipeline, separate from LateEHPrepare, was it was
convenient to assume there was only a single `catch` part per `try`
during CFGSort and CFGStackify.
---
Problem:
While it thinks terminate pads could have been possibly split or calls
to `__clang_call_terminate` could have been duplicated,
`WebAssemblyLateEHPrepare::ensureSingleBBTermPads()` assumes terminate
pads contain no more than calls to `__clang_call_terminate` and
`unreachable` instruction. I assumed that because in LLVM very limited
forms of transformations are done to catchpads and cleanuppads to
maintain the scoping structure. But it turned out to be incorrect;
passes can merge cleanuppads into one, including terminate pads, as long
as the new code has a correct scoping structure. One pass that does this
I observed was `SimplifyCFG`, but there can be more. After this
transformation, a single cleanuppad can contain any number of other
instructions with the call to `__clang_call_terminate` and can span many
BBs. It wouldn't be practical to duplicate all these BBs within the
cleanuppad to generate the equivalent `catch_all` blocks, only with
calls to `__clang_call_terminate` replaced by calls to `std::terminate`.
Unless we do more complicated transformation to split those calls to
`__clang_call_terminate` into a separate cleanuppad, it is tricky to
solve.
---
Solution (?):
This CL just disables the generation and use of `__clang_call_terminate`
and calls `std::terminate()` directly in its place.
The possible downside of this approach can be, because the Itanium ABI
intended to "mark" the violating exception handled, we don't do that
anymore. What `__cxa_begin_catch` actually does is increment the
exception's handler count and decrement the uncaught exception count,
which in my opinion do not matter much given that we are about to
terminate the program anyway. Also it does not affect info like stack
traces that can be possibly shown to developers.
And while we use a variant of Itanium EH ABI, we can make some
deviations if we choose to; we are already different in that in the
current version of the EH spec we don't support two-phase unwinding. We
can possibly consider a more complicated transformation later to
reenable this, but I don't think that has high priority.
Changes in this CL contains:
- In Clang, we don't generate a call to `wasm.get.exception()` intrinsic
and `__clang_call_terminate` function in terminate pads anymore; we
simply generate calls to `std::terminate()`, which is the default
implementation of `CGCXXABI::emitTerminateForUnexpectedException`.
- Remove `WebAssembly::ensureSingleBBTermPads() function and
`WebAssemblyHandleEHTerminatePads` pass, because terminate pads are
already `catch_all` now (because they don't need the exception
pointer) and we don't need these transformations anymore.
- Change tests to use `std::terminate` directly. Also removes tests that
tested `LateEHPrepare::ensureSingleBBTermPads` and
`HandleEHTerminatePads` pass.
- Drive-by fix: Add some function attributes to EH intrinsic
declarations
Fixes https://github.com/emscripten-core/emscripten/issues/13582.
Reviewed By: dschuff, tlively
Differential Revision: https://reviews.llvm.org/D97834
The hazard where a VMEM reads an SGPR written by a VALU counts as a data
dependency hazard, so no nops are required on GFX10. Tested with Vulkan
CTS on GFX10.1 and GFX10.3.
Differential Revision: https://reviews.llvm.org/D97926
explicitly emitting retainRV or claimRV calls in the IR
This reapplies ed4718eccb, which was reverted
because it was causing a miscompile. The bug that was causing the miscompile
has been fixed in 75805dce5f.
Original commit message:
Background:
This fixes a longstanding problem where llvm breaks ARC's autorelease
optimization (see the link below) by separating calls from the marker
instructions or retainRV/claimRV calls. The backend changes are in
https://reviews.llvm.org/D92569.
https://clang.llvm.org/docs/AutomaticReferenceCounting.html#arc-runtime-objc-autoreleasereturnvalue
What this patch does to fix the problem:
- The front-end adds operand bundle "clang.arc.attachedcall" to calls,
which indicates the call is implicitly followed by a marker
instruction and an implicit retainRV/claimRV call that consumes the
call result. In addition, it emits a call to
@llvm.objc.clang.arc.noop.use, which consumes the call result, to
prevent the middle-end passes from changing the return type of the
called function. This is currently done only when the target is arm64
and the optimization level is higher than -O0.
- ARC optimizer temporarily emits retainRV/claimRV calls after the calls
with the operand bundle in the IR and removes the inserted calls after
processing the function.
- ARC contract pass emits retainRV/claimRV calls after the call with the
operand bundle. It doesn't remove the operand bundle on the call since
the backend needs it to emit the marker instruction. The retainRV and
claimRV calls are emitted late in the pipeline to prevent optimization
passes from transforming the IR in a way that makes it harder for the
ARC middle-end passes to figure out the def-use relationship between
the call and the retainRV/claimRV calls (which is the cause of
PR31925).
- The function inliner removes an autoreleaseRV call in the callee if
nothing in the callee prevents it from being paired up with the
retainRV/claimRV call in the caller. It then inserts a release call if
claimRV is attached to the call since autoreleaseRV+claimRV is
equivalent to a release. If it cannot find an autoreleaseRV call, it
tries to transfer the operand bundle to a function call in the callee.
This is important since the ARC optimizer can remove the autoreleaseRV
returning the callee result, which makes it impossible to pair it up
with the retainRV/claimRV call in the caller. If that fails, it simply
emits a retain call in the IR if retainRV is attached to the call and
does nothing if claimRV is attached to it.
- SCCP refrains from replacing the return value of a call with a
constant value if the call has the operand bundle. This ensures the
call always has at least one user (the call to
@llvm.objc.clang.arc.noop.use).
- This patch also fixes a bug in replaceUsesOfNonProtoConstant where
multiple operand bundles of the same kind were being added to a call.
Future work:
- Use the operand bundle on x86-64.
- Fix the auto upgrader to convert call+retainRV/claimRV pairs into
calls with the operand bundles.
rdar://71443534
Differential Revision: https://reviews.llvm.org/D92808
This patch adds the cost model for experimental.vector.reverse
with scalable vector types: nxv16i1, nxv8i1, nxv4i1 and nxv2i1.
These types are missing from the previous cost model patch D95603.
The cost model for experimental.vector.reverse with 1 bit mask is used by
loop vectorization in the patch D95363
Differential Revision: https://reviews.llvm.org/D97758
Patch adds support for passing vector arguments to variadic functions.
Arguments which are fixed shadow GPRs and stack space even when they are
passed in vector registers, while arguments passed through ellipses are
passed in(properly aligned GPRs if available and on the stack once all
GPR arguments registers are consumed.
Differential Revision: https://reviews.llvm.org/D97485
RegBankSelect creates zext and trunc when it selects banks for uniform i1.
Add zext_trunc_fold from generic combiner to post RegBankSelect combiner.
Differential Revision: https://reviews.llvm.org/D95432
This patch is a large number of small changes that should hopefully not
affect the generated machine code but are still important to get right
so that the machine verifier won't complain about them.
The llvm/test/CodeGen/AVR/pseudo/*.mir changes are also necessary
because without the liveins the used registers are considered undefined
by the machine verifier and it will complain about them.
Differential Revision: https://reviews.llvm.org/D97172
This patch adds a new instruction that can represent variadic debug values,
DBG_VALUE_VAR. This patch alone covers the addition of the instruction and a set
of basic code changes in MachineInstr and a few adjacent areas, but does not
correctly handle variadic debug values outside of these areas, nor does it
generate them at any point.
The new instruction is similar to the existing DBG_VALUE instruction, with the
following differences: the operands are in a different order, any number of
values may be used in the instruction following the Variable and Expression
operands (these are referred to in code as “debug operands”) and are indexed
from 0 so that getDebugOperand(X) == getOperand(X+2), and the Expression in a
DBG_VALUE_VAR must use the DW_OP_LLVM_arg operator to pass arguments into the
expression.
The new DW_OP_LLVM_arg operator is only valid in expressions appearing in a
DBG_VALUE_VAR; it takes a single argument and pushes the debug operand at the
index given by the argument onto the Expression stack. For example the
sub-expression `DW_OP_LLVM_arg, 0` has the meaning “Push the debug operand at
index 0 onto the expression stack.”
Differential Revision: https://reviews.llvm.org/D82363
llvm-objdump only uses one MCInstrAnalysis object, so if ARM and Thumb
code is mixed in one object, or if an object is disassembled without
explicitly setting the triple to match the ISA used, then branch and
call targets will be printed incorrectly.
This could be fixed by creating two MCInstrAnalysis objects in
llvm-objdump, like we currently do for SubtargetInfo. However, I don't
think there's any reason we need two separate sub-classes of
MCInstrAnalysis, so instead these can be merged into one, and the ISA
determined by checking the opcode of the instruction.
Differential revision: https://reviews.llvm.org/D97766
This patch adds a pipeline to support in-order CPUs such as ARM
Cortex-A55.
In-order pipeline implements a simplified version of Dispatch,
Scheduler and Execute stages as a single stage. Entry and Retire
stages are common for both in-order and out-of-order pipelines.
Differential Revision: https://reviews.llvm.org/D94928
Generalize the shuffle(not(x)) -> not(shuffle(x)) fold to handle any binop with 0/-1.
Hopefully we can further generalize to help push target unary/binary shuffles through binops similar to what we do in DAGCombiner::visitVECTOR_SHUFFLE
This patch addresses a compiler crash resulting from passing a
fixed-length type to one that expects scalable vector types. An
assertion was added to prevent this regressing in the future.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D97868
This patch fixes up one case where the fixed-length-vector VL was
dropped (falling back to VLMAX) when inserting vector elements, as the
code would lower via ISD::INSERT_VECTOR_ELT (at index 0) which loses the
fixed-length vector information.
To this end, a custom node, VMV_S_XF_VL, was introduced to carry the VL
operand through to the final instruction. This node wraps the RVV
vmv.s.x and vmv.s.f instructions, which were being selected by
insert_vector_elt anyway.
There should be no observable difference in scalable-vector codegen.
There is still one outstanding drop from fixed-length VL to VLMAX, when
an i64 element is inserted into a vector on RV32; the splat (which is
custom legalized) has no notion of the original fixed-length vector
type.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D97842
This adds some simple known bits handling for the three CSINC/NEG/INV
instructions. From the operands known bits we can compute the common
bits of the first operand and incremented/negated/inverted second
operand. The first, especially CSINC ZR, ZR, comes up fair amount in the
tests. The others are more rare so a unit test for them is added.
Differential Revision: https://reviews.llvm.org/D97788
Make sure we preserve info about passed arguments as implicit uses, to
make sure later passes still have access to this information.
This fixes a mis-compile where the machine-combiner would pick an
incorrect free register.
Even though the implementation in emitAtomicCmpSwapW() was correct, it made
Valgrind report an error. Instead of using a RISBG on CmpVal, an LL[CH]R can
be made on the OldVal, and the problem is avoided.
Review: Ulrich Weigand
Differential Revision: https://reviews.llvm.org/D97604
This is a NFC with respect to the generated code. But it fixes a crash
when using -debug, because of the position in the enum CALL_RVMARKER
nodes were treated as memops. That caused a crash when printing
CALL_RVMARKER nodes.
Honor always_inline attribute when processing -amdgpu-inline-max-bb.
It was lost during the ports of the heuristic. There is no reason
to honor inline hint, but not always inline.
Differential Revision: https://reviews.llvm.org/D97790
This caused miscompiles of Chromium tests for iOS due clobbering of live
registers. See discussion on the code review for details.
> Background:
>
> This fixes a longstanding problem where llvm breaks ARC's autorelease
> optimization (see the link below) by separating calls from the marker
> instructions or retainRV/claimRV calls. The backend changes are in
> https://reviews.llvm.org/D92569.
>
> https://clang.llvm.org/docs/AutomaticReferenceCounting.html#arc-runtime-objc-autoreleasereturnvalue
>
> What this patch does to fix the problem:
>
> - The front-end adds operand bundle "clang.arc.attachedcall" to calls,
> which indicates the call is implicitly followed by a marker
> instruction and an implicit retainRV/claimRV call that consumes the
> call result. In addition, it emits a call to
> @llvm.objc.clang.arc.noop.use, which consumes the call result, to
> prevent the middle-end passes from changing the return type of the
> called function. This is currently done only when the target is arm64
> and the optimization level is higher than -O0.
>
> - ARC optimizer temporarily emits retainRV/claimRV calls after the calls
> with the operand bundle in the IR and removes the inserted calls after
> processing the function.
>
> - ARC contract pass emits retainRV/claimRV calls after the call with the
> operand bundle. It doesn't remove the operand bundle on the call since
> the backend needs it to emit the marker instruction. The retainRV and
> claimRV calls are emitted late in the pipeline to prevent optimization
> passes from transforming the IR in a way that makes it harder for the
> ARC middle-end passes to figure out the def-use relationship between
> the call and the retainRV/claimRV calls (which is the cause of
> PR31925).
>
> - The function inliner removes an autoreleaseRV call in the callee if
> nothing in the callee prevents it from being paired up with the
> retainRV/claimRV call in the caller. It then inserts a release call if
> claimRV is attached to the call since autoreleaseRV+claimRV is
> equivalent to a release. If it cannot find an autoreleaseRV call, it
> tries to transfer the operand bundle to a function call in the callee.
> This is important since the ARC optimizer can remove the autoreleaseRV
> returning the callee result, which makes it impossible to pair it up
> with the retainRV/claimRV call in the caller. If that fails, it simply
> emits a retain call in the IR if retainRV is attached to the call and
> does nothing if claimRV is attached to it.
>
> - SCCP refrains from replacing the return value of a call with a
> constant value if the call has the operand bundle. This ensures the
> call always has at least one user (the call to
> @llvm.objc.clang.arc.noop.use).
>
> - This patch also fixes a bug in replaceUsesOfNonProtoConstant where
> multiple operand bundles of the same kind were being added to a call.
>
> Future work:
>
> - Use the operand bundle on x86-64.
>
> - Fix the auto upgrader to convert call+retainRV/claimRV pairs into
> calls with the operand bundles.
>
> rdar://71443534
>
> Differential Revision: https://reviews.llvm.org/D92808
This reverts commit ed4718eccb.
Some instructions (especially mov+pop instructions) were setting the
wrong operands. For example, the pop instruction had the register set as
a source operand while it is a destination operand (the value is loaded
into the register).
I have found these issues using the machine verifier and using manual
code inspection.
Differential Revision: https://reviews.llvm.org/D97159
The previous expansion used SBCI, which is incorrect because the NEGW
pseudo instruction accepts a DREGS operand (2xGPR8) and SBCI only allows
LD8 registers. One solution could be to correct the NEGW pseudo
instruction, but another solution is to use a different instruction
(sbc) that does accept a GPR8 register and therefore allows more freedom
to the register allocator.
The output now matches avr-gcc for the following code:
int foo(int n) {
return -n;
}
I've found this issue using the machine instruction verifier: it was
complaining about the wrong register class in NEGWRd.mir.
Differential Revision: https://reviews.llvm.org/D97131
These aliases are sometimes used in assembly code and make the code more
readable. They are supported by avr-gcc too.
Differential Revision: https://reviews.llvm.org/D96492
Refactor insertion of the asserting ops. This enables using them for
AMDGPU.
This code should essentially be the same for every target. Mips, X86
and ARM all have different code there now, but this seems to be an
accident. The assignment functions are called with different types
than they would be in the DAG, so this is all likely an assortment of
hacks to get around that.
* Add amdgcn_strict_wqm intrinsic.
* Add a corresponding STRICT_WQM machine instruction.
* The semantic is similar to amdgcn_strict_wwm with a notable difference that not all threads will be forcibly enabled during the computations of the intrinsic's argument, but only all threads in quads that have at least one thread active.
* The difference between amdgc_wqm and amdgcn_strict_wqm, is that in the strict mode an inactive lane will always be enabled irrespective of control flow decisions.
Reviewed By: critson
Differential Revision: https://reviews.llvm.org/D96258
* Introduce the new intrinsic amdgcn_strict_wwm
* Deprecate the old intrinsic amdgcn_wwm
The change is done for consistency as the "strict"
prefix will become an important, distinguishing factor
between amdgcn_wqm and amdgcn_strictwqm in the future.
The "strict" prefix indicates that inactive lanes do not
take part in control flow, specifically an inactive lane
enabled by a strict mode will always be enabled irrespective
of control flow decisions.
The amdgcn_wwm will be removed, but doing so in two steps
gives users time to switch to the new name at their own pace.
Reviewed By: critson
Differential Revision: https://reviews.llvm.org/D96257
While the underlying instruction is called image_msaa_load,
the resource must be x component only.
Rename the intrinsic for clarity.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D97829
In some rare circumstances we can be using an undef register for a
compare. When folded into a CBZ/CBNZ the undef flags are lost, leading
to machine verifier problems. This propagates the existing flags to the
new instruction.
The WebAssembly text and binary formats have different operand orders
for the "type" and "table" fields of call_indirect (and
return_call_indirect). In LLVM we use the binary order for the MCInstr,
but when we produce or consume the text format we should use the text
order. For compilation units targetting WebAssembly 1.0 (without the
reference types feature), we omit the table operand entirely.
Differential Revision: https://reviews.llvm.org/D97761
This patch allows generating TLS variables in assembly files on AIX.
Initialized and external uninitialized variables are generated with the
.csect pseudo-op and local uninitialized variables are generated with
the .comm/.lcomm pseudo-ops. The patch also adds a check to
explicitly say that TLS is not yet supported on AIX.
Reviewed by: daltenty, jasonliu, lei, nemanjai, sfertile
Originally patched by: bsaleil
Commandeered by: NeHuang
Differential Revision: https://reviews.llvm.org/D96184
This merges more AMDGPU ABI lowering code into the generic call
lowering. Start cleaning up by factoring away more of the pack/unpack
logic into the buildCopy{To|From}Parts functions. These could use more
improvement, and the SelectionDAG versions are significantly more
complex, and we'll eventually have to emulate all of those cases too.
This is mostly NFC, but does result in some minor instruction
reordering. It also removes some of the limitations with mismatched
sizes the old code had. However, similarly to the merge on the input,
this is forcing gfx6/gfx7 to use the gfx8+ ABI (which is what we
actually want, but SelectionDAG is stuck using the weird emergent
ABI).
This also changes the load/store size for stack passed EVTs for
AArch64, which makes it consistent with the DAG behavior.
This fixes two bugs in `WebAssemblyExceptionInfo` grouping, created by
D97247. These two bugs are not easy to split into two different CLs,
because tests that fail for one also tend to fail for the other.
- In D97247, when fixing `ExceptionInfo` grouping by taking out
the unwind destination' exception from the unwind src's exception, we
just iterated the BBs in the function order, but this was incorrect;
this changes it to dominator tree preorder. Please refer to the
comments in the code for the reason and an example.
- After this subexception-taking-out fix, there still can be remaining
BBs we have to take out. When Exception B is taken out of Exception A
(because EHPad B is the unwind destination of EHPad A), there can
still be BBs within Exception A that are reachable from Exception B,
which also should be taken out. Please refer to the comments in the
code for more detailed explanation on why this can happen. To make
this possible, this splits `WebAssemblyException::addBlock` into two
parts: adding to a set and adding to a vector. We need to iterate on
BBs within a `WebAssemblyException` to fix this, so we add BBs to sets
first. But we add BBs to vectors later after we fix all incorrectness
because deleting BBs from vectors is expensive. I considered removing
the vector from `WebAssemblyException`, but it was not easy because
this class has to maintain a similar interface with `MachineLoop` to
be wrapped into a single interface `SortRegion`, which is used in
CFGSort.
Other misc. drive-by fixes:
- Make `WebAssemblyExceptionInfo` do not even run when wasm EH is not
used or the function doesn't have any EH pads, not to waste time
- Add `LLVM_DEBUG` lines for easy debugging
- Fix `preds` comments in cfg-stackify-eh.ll
- Fix `__cxa_throw`'s signature in cfg-stackify-eh.ll
Fixes https://github.com/emscripten-core/emscripten/issues/13554.
Reviewed By: dschuff, tlively
Differential Revision: https://reviews.llvm.org/D97677
Andrei Matei reported a llvm11 core dump for his bpf program
https://bugs.llvm.org/show_bug.cgi?id=48578
The core dump happens in LiveVariables analysis phase.
#4 0x00007fce54356bb0 __restore_rt
#5 0x00007fce4d51785e llvm::LiveVariables::HandleVirtRegUse(unsigned int,
llvm::MachineBasicBlock*, llvm::MachineInstr&)
#6 0x00007fce4d519abe llvm::LiveVariables::runOnInstr(llvm::MachineInstr&,
llvm::SmallVectorImpl<unsigned int>&)
#7 0x00007fce4d519ec6 llvm::LiveVariables::runOnBlock(llvm::MachineBasicBlock*, unsigned int)
#8 0x00007fce4d51a4bf llvm::LiveVariables::runOnMachineFunction(llvm::MachineFunction&)
The bug can be reproduced with llvm12 and latest trunk as well.
Futher analysis shows that there is a bug in BPF peephole
TRUNC elimination optimization, which tries to remove
unnecessary TRUNC operations (a <<= 32; a >>= 32).
Specifically, the compiler did wrong transformation for the
following patterns:
%1 = LDW ...
%2 = SLL_ri %1, 32
%3 = SRL_ri %2, 32
... %3 ...
%4 = SRA_ri %2, 32
... %4 ...
The current transformation did not check how many uses of %2
and did transformation like
%1 = LDW ...
... %1 ...
%4 = SRL_ri %2, 32
... %4 ...
and pseudo register %2 is used by not defined and
caused LiveVariables analysis core dump.
To fix the issue, when traversing back from SRL_ri to SLL_ri,
check to ensure SLL_ri has only one use. Otherwise, don't
do transformation.
Differential Revision: https://reviews.llvm.org/D97792
To do this while supporting the existing functionality in SelectionDAG of using
PGO info, we add the ProfileSummaryInfo and LazyBlockFrequencyInfo analysis
dependencies to the instruction selector pass.
Then, use the predicate to generate constant pool loads for f32 materialization,
if we're targeting optsize/minsize.
Differential Revision: https://reviews.llvm.org/D97732
Instead of converting the 0 into a ZR reg during lowering, do that with
tablegen by matching the zero immediate. This when combined with other
optimizations is more likely to use ZR and helps keep the DAG more
easily optimizable. It should not otherwise effect code generation.
When a large "irregular" (e.g. i96) integer call argument is converted to
indirect, 64-bit parts are stored to the stack. The full stack space
(e.g. i128) was not allocated prior to this patch, but rather just the exact
space of the original type. This caused neighboring values on the stack to be
overwritten.
Thanks to Josh Stone for reporting this.
Review: Ulrich Weigand
Fixes https://bugs.llvm.org/show_bug.cgi?id=49322
Differential Revision: https://reviews.llvm.org/D97514
Make OMod explicit instead of implied by HasModifiers in the
operand list. Requires explicitly setting HasOMod=1 for
irregular OMod usage in instruction V_CVT_{U,I}*
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D97587
Change-Id: I230e1476f529e816eec60e242531f23a99e3839f
This patch enables support for lowering INSERT_VECTOR_ELT on
fixed-length vector types. The strategy follows that for scalable vector
types.
This patch also includes a quick fix to prevent the compiler infinitely
looping between lowering BUILD_VECTOR as VECTOR_SHUFFLE and back again.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D97698
The default expansion of CONCAT_VECTORS goes through the stack. This
patch avoids that penalty by custom-lowering CONCAT_VECTORS to a series
of INSERT_SUBVECTOR nodes. Futher optimizations are possible, but this
is a good start.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D97692
-amdgpu-inline-max-bb option could lead to a suboptimal
codegen preventing inlining of really simple functions
including pure wrapper calls. Relax the cutoff by allowing
to call a function with a single block on the grounds
that it will not increase total number of blocks after
inlining.
Differential Revision: https://reviews.llvm.org/D97744
Currently ARM backend validates the range of branch targets before the
layout of fragments is finalized. This causes build failure if symbolic
expressions are used, with the exception of a single symbolic value.
For example, "b.w ." works but "b.w . + 2" currently fails to
assemble. This fixes the issue by delaying this check (in
ARMAsmParser::validateInstruction) of b.w instructions until the symbol
expressions are resolved (in ARMAsmBackend::adjustFixupValue).
Link:
https://github.com/ClangBuiltLinux/linux/issues/1286
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D97568
Remove a rule which allows larger scalar types than the destination vector
element type.
This appears to be irrelevant now that we have G_BUILD_VECTOR_TRUNC. Plus,
making a G_BUILD_VECTOR which satisfies this introduces a verifier failure
anyway.
Differential Revision: https://reviews.llvm.org/D97727
def of the adrp before the ldr.
Apparently this pass used to have liveness analysis but it was removed for
scompile time reasons. This workaround prevents the LOH from being emitted
unless the ADD and LDR are adjacent.
Fixes https://github.com/JuliaLang/julia/issues/39820
Differential Revision: https://reviews.llvm.org/D97571
- This patch adds in the distinction between jg[*] and jl[*] pc-relative
mnemonics based on the variant/dialect.
- Under the hlasm variant, we use the jl[*] family of mnemonics and under
the att (GNU as) variant, we use the jg[*] family of mnemonics.
- jgnop which was added in https://reviews.llvm.org/D92185, is now restricted
to att variant. jlnop is introduced and restricted to hlasm variant.
- The br[*]l additional mnemonics are mapped to either jl[*]/jg[*] based on
the variant.
Reviewed By: uweigand
Differential Revision: https://reviews.llvm.org/D97581
In `LateEHPrepare::addCatchAlls`, the current code tries to get the
iterator's debug info even when it is `MachineBasicBlock::end()`. This
fixes the bug by adding empty debug info instead in that case.
Reviewed By: tlively
Differential Revision: https://reviews.llvm.org/D97679
If the reference-types feature is enabled, call_indirect will explicitly
reference its corresponding function table via TABLE_NUMBER
relocations against a table symbol.
Also, as before, address-taken functions can also cause the function
table to be created, only with reference-types they additionally cause a
symbol table entry to be emitted.
Differential Revision: https://reviews.llvm.org/D90948
The expected use case is for frontends to insert this into
shaders that are to be run under a debugger. The shader can
then be resumed or single stepped from the point of the call
under debugger control.
Differential Revision: https://reviews.llvm.org/D97670
I copied the nearly identical function from AArch64 into AMDGPU, so
fix this duplication.
Mips and X86 have their own more exotic versions which should be
removed. However replacing those is better left for a separate patch
since it requires other changes to avoid regressions.
This was reusing the parent function calling convention instead of the
callee. I'm not sure if there's a case where there's an observable
difference.
I previously missed this in b72a23650f
Given a zero input for a udot, an add can be folded in to take the place
of the input, using thte addition that the instruction naturally
performs.
Differential Revision: https://reviews.llvm.org/D97188
Like with EXTRACT_SUBVECTOR, INSERT_SUBVECTOR poses a problem
for vector masks as RVV isn't able to slide mask types around. We choose
instead to bitcast to equivalently-sized i8 types where we can, else we
zero-extend, perform the operation, and truncate back down.
One test was left disabled due to a crash in the legalizer.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D97559
This patch fixes a bug where the lowering for INSERT_SUBVECTOR and
EXTRACT_SUBVECTOR would insist on first extracting a register-aligned
LMUL1 vector type before perfoming the slide up/down. This was even if
the vector was a fractional LMUL type, in which case the aligned
EXTRACT_SUBVECTOR was invalid.
This issue only occurred for scalable vector types, but a variety of
tests for both scalable and fixed-length vectors have been added to
ensure this does not regress in the future.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D97556
This patch unifies the two disparate paths for lowering INSERT_SUBVECTOR
operations under one roof. Consequently, with this patch it is possible to
support any fixed-length subvector insertion, not just "cast-like" ones.
As before, support for the insertion of mask vectors will come in a
separate patch.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D97543
This patch adds support for extracting subvectors from vector masks.
This can be either extracting a scalable vector from another, or a fixed-length
vector from a fixed-length or scalable vector.
Since RVV lacks a way to slide vector masks down on an element-wise
basis and we don't know the true length of the vector registers, in many
cases we must resort to using equivalently-sized i8 vectors to perform
the operation. When this is not possible we fall back and extend to a
suitable i8 vector.
Support was also added for fixed-length truncation to mask types.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D97475
This patch addresses issues arising from the fact that the index type
used for subvector insertion/extraction is inconsistent between the
intrinsics and SDNodes. The intrinsic forms require i64 whereas the
SDNodes use the type returned by SelectionDAG::getVectorIdxTy.
Rather than update the intrinsic definitions to use an overloaded index
type, this patch fixes the issue by transforming the index to the
correct type as required. Any loss of index bits going from i64 to a
smaller type is unexpected, and will be caught by an assertion in
SelectionDAG::getVectorIdxConstant.
The patch also updates the documentation for INSERT_SUBVECTOR and adds
an assertion to its creation to bring it in line with EXTRACT_SUBVECTOR.
This necessitated changes to AArch64 which was using i64 for
EXTRACT_SUBVECTOR but i32 for INSERT_SUBVECTOR. Only one test changed
its codegen after updating the backend accordingly.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D97459
If we insert undef using a VMOVN, we can just use the original value in
three out of the four possible combinations. Using VMOVT into a undef
vector will still require the lanes to be moved, but otherwise the
non-undef value can be used.
Per the discussion in D97453. We currently disable it due to it's not a
common scenario and has some problem in implementation.
Differential Revision: https://reviews.llvm.org/D97453
D97247 added the reverse mapping from unwind destination to their
source, but it had a critical bug; sources can be multiple, because
multiple BBs can have a single BB as their unwind destination.
This changes `WasmEHFuncInfo::getUnwindSrc` to `getUnwindSrcs` and makes
it return a vector rather than a single BB. It does not return the const
reference to the existing vector but creates a new vector because
`WasmEHFuncInfo` stores not `BasicBlock*` or `MachineBasicBlock*` but
`PointerUnion` of them. Also I hoped to unify those methods for
`BasicBlock` and `MachineBasicBlock` into one using templates to reduce
duplication, but failed because various usages require `BasicBlock*` to
be `const` but it's hard to make it `const` for `MachineBasicBlock`
usages.
Fixes https://github.com/emscripten-core/emscripten/issues/13514.
(More precisely, fixes
https://github.com/emscripten-core/emscripten/issues/13514#issuecomment-784708744)
Reviewed By: dschuff, tlively
Differential Revision: https://reviews.llvm.org/D97583
If a global object is listed in `@llvm.used`, place it in a unique section with
the `SHF_GNU_RETAIN` flag. The section is a GC root under `ld --gc-sections`
with LLD>=13 or GNU ld>=2.36.
For front ends which do not expect to see multiple sections of the same name,
consider emitting `@llvm.compiler.used` instead of `@llvm.used`.
SHF_GNU_RETAIN is restricted to ELFOSABI_GNU and ELFOSABI_FREEBSD in
binutils. We don't do the restriction - see the rationale in D95749.
The integrated assembler has supported SHF_GNU_RETAIN since D95730.
GNU as>=2.36 supports section flag 'R'.
We don't need to worry about GNU ld support because older GNU ld just ignores
the unknown SHF_GNU_RETAIN.
With this change, `__attribute__((retain))` functions/variables emitted
by clang will get the SHF_GNU_RETAIN flag.
Differential Revision: https://reviews.llvm.org/D97448
There are existing patterns for FMOVHi, FMOVSi, and FMOVDi in
AArch64InstrFormats.td.
Importing these allows us to remove the manual selection code for FMOV.
It also allows us to select FMOVHi for non-zero constants when we have full
fp-16 support.
Refactor some of the code in AArch64InstrFormats.td so that we can create
equivalent custom renderers in GlobalISel.
Differential Revision: https://reviews.llvm.org/D97511
Previously we would use a bundle to hint the register allocator to not
overwrite the pointers in a sequence of loads to avoid breaking soft
clauses. This bundling was based on a fuzzy register pressure
heuristic, so we could not guarantee using more registers than are
really available. This would result in register allocator failing on
unsatisfiable bundles. Use a kill to artificially extend the live
ranges, so we can always succeed at register allocation even if it
means extra spills in the worst case.
This seems to capture most of the benefit of the bundle while avoiding
most of the risk presented by the bundle. However the lit tests do
show a handful of regressions. In some cases with sequences of
volatile loads, unused load components end up getting reallocated to
the next load which forces a wait between. There are also a few small
scheduling regressions where a hazard used to be avoided, and one
spill torture test which for some reason nearly doubles the stack
usage. There is also a bit of noise from leftover kills (it may make
sense for post-RA pseudos to strip all of these out).
Use `APInt` to convert a 32-bit or 64-bit immediate to an `APFloat` rather than
`bit_cast` to a `float` or `double` to avoid going through host floating-point and
potentially changing the bit pattern of NaNs.
Differential Revision: https://reviews.llvm.org/D97490
This is a case D97178 tried to solve but missed. D97178 could not handle
the case when
multiple consecutive delegates are generated:
- Before:
```
block
br (a)
try
catch
end_try
end_block
<- (a)
```
- After
```
block
br (a)
try
...
try
try
catch
end_try
<- (a)
delegate
delegate
end_block
<- (b)
```
(The `br` should point to (b) now)
D97178 assumed `end_block` exists two BBs later than `end_try`, because
it assumed the order as `end_try` BB -> `delegate` BB -> `end_block` BB.
But it turned out there can be multiple `delegate`s in between. This
patch changes the logic so we just search from `end_try` BB until we
find `end_block`.
Fixes https://github.com/emscripten-core/emscripten/issues/13515.
(More precisely, fixes
https://github.com/emscripten-core/emscripten/issues/13515#issuecomment-784711318.)
Reviewed By: dschuff, tlively
Differential Revision: https://reviews.llvm.org/D97569
If a region was not constrained by a high register pressure
and was not rescheduled without clustering we can skip
rescheduling it ClusteredLowOccupancyReschedule stage.
This improves scheduling speed by 25% on some kernels.
Differential Revision: https://reviews.llvm.org/D97506
We are attempting rescheduling without load store clustering
if occupancy limits were not met with clustering. Skip this
for regions which do not have any loads or stores at all.
In a set of kernels I am experimenting with this improves
scheduling time by ~30%.
Differential Revision: https://reviews.llvm.org/D97342
- This patch introduces a different assembler dialect ("hlasm") for z/OS.
The default dialect has now been given the "att" dialect name. For this
appropriate changes have been added to SystemZ.td.
- This patch also makes a few changes to SystemZInstrFormats.td which
restrict a few condition code mnemonics to just the "att" dialect
variant (he, le, lh, nhe, nle, nlh). These extended condition code
mnemonics are not available in HLASM.
- A new private function has been introduced in SystemZAsmParser.cpp to
return the assembler dialect set in SystemZMCAsmInfo.cpp. The reason we
couldn't/haven't explicitly queried the overriden getAssemblerDialect
function from AsmParser is outlined in this thread here. This returned
dialect is directly passed onto the relevant matcher functions which taken
in a variantID, so that the matcher functions can appropriately choose an
instruction based on the variant.
Reviewed By: uweigand
Differential Revision: https://reviews.llvm.org/D94250
This will allow FrameIndex as the base address instead of
emitting a separate ADDI from isel. eliminateFrameIndex will likely turn
it back into an ADDI, but this makes things consistent with the
SDPatterns and VLPatterns.
I only tested one case for simplicity. I can test more if reviewers
want.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D97221
This allows GlobalISel to use this instruction where available. I assume
SelectionDAG always selects s_xnor_b32 so it isn't affected by this
change.
Differential Revision: https://reviews.llvm.org/D97560
Spilling and reloading AMX registers are expensive. We allow PTILEZEROV
and PTILELOADDV to be rematerializable to avoid the register spilling.
Reviewed By: LuoYuanke
Differential Revision: https://reviews.llvm.org/D97453
As discussed on D97478. The removal of the custom tag causes some changes in the add/sub-overflow expansion as it no longer expands to sat-arith codegen.
In 16-bit mode, some of the nop patterns used in 32-bit mode can end up
mangling other instructions. For instance, an aligned "movz" instruction
may have the 0x66 and 0x67 prefixes omitted, because the nop that's used
messes things up.
xorl %ebx, %ebx
.p2align 4, 0x90
movzbl (%esi,%ebx), %ecx
Use instead nop patterns we know 16-bit mode can handle.
Differential Revision: https://reviews.llvm.org/D97268
Commit 1959ead525 ("BPF: Implement TTI.getCmpSelInstrCost()
properly") introduced a dependency on LLVMTransformUtils
library. Let us encode this dependency explicitly in
CMakefile to avoid build error.
This patch extends the support for scalable-vector int->fp and fp->int
conversions by additionally handling fixed-length vectors.
The existing scalable-vector lowering re-expresses widening/narrowing by
x4+ conversions as standard nodes. The fixed-length vector support slots
in at "the end" of this process by lowering the now equally-sized and
widening/narrowing by x2 nodes to our custom VL versions.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D97374
This patch extends the support for vector FP_ROUND and FP_EXTEND by
including support for fixed-length vector types. Since fixed-length
vectors use "VL" nodes and scalable vectors can use the standard nodes,
there is slightly more to do in the fixed-length case. A helper function
was introduced to try and reduce the divergent paths. It is expected
that this function will similarly come in useful for lowering the
int-to-fp and fp-to-int operations for fixed-length vectors.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D97301
This patch extends support for our custom-lowering of scalable-vector
truncates to include those of fixed-length vectors. It does this by
co-opting the custom RISCVISD::TRUNCATE_VECTOR node and adding mask and
VL operands. This avoids unnecessary duplication of patterns and
inflation of the ISel table.
Some truncates go through CONCAT_VECTORS which currently isn't
efficiently handled, as it goes through the stack. This can be improved
upon in the future.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D97202
This patch adds support for the custom lowering sign- and zero-extension
of fixed-length vector types. It does so through custom nodes. Since the
source and destination types are (necessarily) of different sizes, it is
possible that the source type is legal whilst the larger destination
type isn't. In this case the legalization makes heavy use of
EXTRACT_SUBVECTOR.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D97194
This patch unifies the two disparate paths for lowering
EXTRACT_SUBVECTOR operations under one roof. Consequently, with this
patch it is possible to support any fixed-length subvector extraction,
not just "cast-like" ones.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D97192
We should be able to extend this "canonicalizeShuffleWithBinOps" to handle more generic binop cases where either/both operands can be cheaply shuffled.
Some people are using alternative address spaces to track GC data, but
otherwise they behave exactly the same. This is the only place in the backend
we even try to care about it so it's really not achieving anything.
Adding support for intrinsics of AMX-BF16.
This patch alse fix a bug that AMX-INT8 instructions will be selected with wrong
predicate.
Differential Revision: https://reviews.llvm.org/D97358
We always create the VL operand using a register, but if we can
determine that it came from an ADDI X0, imm with a sufficiently
small immediate, we can use VSETIVLI.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D97332
We just started using a ComplexPattern for sexti32. This updates
zexti32 to match.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D97231
Make sure to set the bottom bit of the symbol even when the type
attribute of a label is set after the label.
GNU as sets the thumb state according to the thumb state of the label.
If a .type directive is placed after the label, set the symbol's thumb
state according to the thumb state of the .type directive. This matches
GNU as in most cases.
From: Stefan Agner <stefan@agner.ch>
This fixes:
https://bugs.llvm.org/show_bug.cgi?id=44860https://github.com/ClangBuiltLinux/linux/issues/866
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D74927
gfx90a operations require even aligned registers, but this was
previously achieved by reserving registers inside the full class.
Ideally this would be captured in the static instruction definitions
for the operands, and we would have different instructions per
subtarget. The hackiest part of this is we need to manually reassign
AGPR register classes after instruction selection (we get away without
this for VGPRs since those types are actually registered for legal
types).
The manual G_DUP selection code would produce DUPv16i8 for v8s8s and DUPv8i16
for v4s16.
This adds the missing cases to the manual selection code, and makes it return
false when there is an unexpected size.
Update select-dup.mir to reflect the change.
Differential Revision: https://reviews.llvm.org/D97240
I've changed to use VL=1 for slidedown and shifts to avoid extra
element processing that we don't need.
The i64 fixed vector handling on i32 isn't great if the vector type
isn't legal due to an ordering issue in type legalization. If the
vector type isn't legal, we fall back to default legalization
which will bitcast the vector to vXi32 and use two independent extracts.
Doing better will require handling several different cases by
manually inserting insert_subvector/extract_subvector to adjust the type
to a legal vector before emitting custom nodes.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D97319
F1.2 Standard assembler syntax fields
describes .w and .n suffixes for wide and narrow encodings.
arch/arm/probes/kprobes/test-thumb.c tests installing kprobes for
certain instructions using inline asm. There's a few instructions we
fail to assemble due to missing .w t2InstAliases.
Adds .w suffixes for:
* bl (F5.1.25 BL, BLX (immediate) T1)
* dbg (F5.1.42 DBG T1)
Reviewed By: DavidSpickett
Differential Revision: https://reviews.llvm.org/D97236
Instead of outright disabling this completely with the noredzone attribute,
we only avoid doing the optimization if there are memory operations between
the adjustment and the load/store that the adjustment would be folded into.
This avoids the case of something like a stack cookie being corrupted if an
exception happens before the pre-increment to the SP occurs.
This also prevents the folding happening if we have a redzone, but the offset
being folded is above the redzone amount (128 bytes in this case).
rdar://73269336
Differential Revision: https://reviews.llvm.org/D95179
Update the list of s_sendmsg messages known to the assembler and
disassembler and validate the ones that were added or removed in gfx9
and gfx10.
Differential Revision: https://reviews.llvm.org/D97295
(CMTST A, A) will only set elements to 0 if the element is 0 in A. Use
it for != 0 compares, which currently use (vnot (CMEQz A)). This saves a
mvn instruction.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D97303
Currently the load/store optimizer will only fold in increments of the
same size as the load/store. This patch expands that to any legal
immediate for the post-inc instruction.
This is a recommit of 3b34b06fc5 with correctness fixes and extra
tests.
Differential Revision: https://reviews.llvm.org/D95885
The non-flag setting variants of instructions may have different regclass
requirements. If so, we need to constrain them.
Differential Revision: https://reviews.llvm.org/D97343
The order in which the nested calls to Builder.buildWhatever are
evaluated in differs between GCC and Clang.
This caused a bot failure because the MIR in the testcase was
coming out in a different order than expected.
Rather than using nested calls, pull them out in order to fix the
order of evaluation.
This CL is not big but contains changes that span multiple analyses and
passes. This description is very long because it tries to explain basics
on what each pass/analysis does and why we need this change on top of
that. Please feel free to skip parts that are not necessary for your
understanding.
---
`WasmEHFuncInfo` contains the mapping of <EH pad, the EH pad's next
unwind destination>. The value (unwind dest) here is where an exception
should end up when it is not caught by the key (EH pad). We record this
info in WasmEHPrepare to fix catch mismatches, because the CFG itself
does not have this info. A CFG only contains BBs and
predecessor-successor relationship between them, but in `WasmEHFuncInfo`
the unwind destination BB is not necessarily a successor or the key EH
pad BB. Their relationship can be intuitively explained by this C++ code
snippet:
```
try {
try {
foo();
} catch (int) { // EH pad
...
}
} catch (...) { // unwind destination
}
```
So when `foo()` throws, it goes to `catch (int)` first. But if it is not
caught by it, it ends up in the next unwind destination `catch (...)`.
This unwind destination is what you see in `catchswitch`'s
`unwind label %bb` part.
---
`WebAssemblyExceptionInfo` groups exceptions so that they can be sorted
continuously together in CFGSort, as we do for loops. What this analysis
does is very simple: it creates a single `WebAssemblyException` per EH
pad, and all BBs that are dominated by that EH pad are included in this
exception. We also identify subexception relationship in this way: if
EHPad A domiantes EHPad B, EHPad B's exception is a subexception of
EHPad A's exception.
This simple rule turns out to be incorrect in some cases. In
`WasmEHFuncInfo`, if EHPad A's unwind destination is EHPad B, it means
semantically EHPad B should not be included in EHPad A's exception,
because it does not make sense to rethrow/delegate to an inner scope.
This is what happened in CFGStackify as a result of this:
```
try
try
catch
... <- %dest_bb is among here!
end
delegate %dest_bb
```
So this patch adds a phase in `WebAssemblyExceptionInfo::recalculate` to
make sure excptions' unwind destinations are not subexceptions of
their unwind sources in `WasmEHFuncInfo`.
But this alone does not prevent `dest_bb` in the example above from
being sorted within the inner `catch`'s exception, even if its exception
is not a subexception of that `catch`'s exception anymore, because of
how CFGSort works, which will be explained below.
---
CFGSort places BBs within the same `SortRegion` (loop or exception)
continuously together so they can be demarcated with `loop`-`end_loop`
or `catch`-`end_try` in CFGStackify.
`SortRegion` is a wrapper for one of `MachineLoop` or
`WebAssemblyException`. `SortRegionInfo` already does some complicated
things because there discrepancies between those two data structures.
`WebAssemblyException` is what we control, and it is defined as an EH
pad as its header and BBs dominated by the header as its BBs (with a
newly added exception of unwind destinations explained in the previous
paragraph). But `MachineLoop` is an LLVM data structure and uses the
standard loop detection algorithm. So by the algorithm, BBs that are 1.
dominated by the loop header and 2. have a path back to its header.
Because of the second condition, many BBs that are dominated by the loop
header are not included in the loop. So BBs that contain `return` or
branches to outside of the loop are not technically included in
`MachineLoop`, but they can be sorted together with the loop with no
problem.
Maybe to relax the condition, in CFGSort, when we are in a `SortRegion`
we allow sorting of not only BBs that belong to the current innermost
region but also BBs that are by the current region header.
(This was written this way from the first version written by Dan, when
only loops existed.) But now, we have cases in exceptions when EHPad B
is the unwind destination for EHPad A, even if EHPad B is dominated by
EHPad A it should not be included in EHPad A's exception, and should not
be sorted within EHPad A.
One way to make things work, at least correctly, is change `dominates`
condition to `contains` condition for `SortRegion` when sorting BBs, but
this will change compilation results for existing non-EH code and I
can't be sure it will not degrade performance or code size. I think it
will degrade performance because it will force many BBs dominated by a
loop, which don't have the path back to the header, to be placed after
the loop and it will likely to create more branches and blocks.
So this does a little hacky check when adding BBs to `Preferred` list:
(`Preferred` list is a ready list. CFGSort maintains ready list in two
priority queues: `Preferred` and `Ready`. I'm not very sure why, but it
was written that way from the beginning. BBs are first added to
`Preferred` list and then some of them are pushed to `Ready` list, so
here we only need to guard condition for `Preferred` list.)
When adding a BB to `Preferred` list, we check if that BB is an unwind
destination of another BB. To do this, this adds the reverse mapping,
`UnwindDestToSrc`, and getter methods to `WasmEHFuncInfo`. And if the BB
is an unwind destination, it checks if the current stack of regions
(`Entries`) contains its source BB by traversing the stack backwards. If
we find its unwind source in there, we add the BB to its `Deferred`
list, to make sure that unwind destination BB is added to `Preferred`
list only after that region with the unwind source BB is sorted and
popped from the stack.
---
This does not contain a new test that crashes because of this bug, but
this fix changes the result for one of existing test case. This test
case didn't crash because it fortunately didn't contain `delegate` to
the incorrectly placed unwind destination BB.
Fixes https://github.com/emscripten-core/emscripten/issues/13514.
Reviewed By: dschuff, tlively
Differential Revision: https://reviews.llvm.org/D97247
This is used to lower UDOT/SDOT instructions, as opposed to relying on
the intrinsic. Subsequent optimizations will be able to optimize them
more cleanly based on these nodes.
This is to limit compile time. I did experiments with some
inputs and found that compile time keeps reasonable for this
pass if we have less than 100000 virtual registers and then
starts to explode somewhere between 100000 and 150000.
Differential Revision: https://reviews.llvm.org/D97218
The Linux kernel when built with CONFIG_THUMB2_KERNEL makes use of these
instructions with immediate operands and wide encodings.
These are the T4 variants of the follow sections from the Arm ARM.
F5.1.72 LDR (immediate)
F5.1.229 STR (immediate)
I wasn't able to represent these simple aliases using t2InstAlias due to
the Constraints on the non-suffixed existing instructions, which results
in some manual parsing logic needing to be added.
F1.2 Standard assembler syntax fields
describes the use of the .w (wide) vs .n (narrow) encoding suffix.
Link: https://bugs.llvm.org/show_bug.cgi?id=49118
Link: https://github.com/ClangBuiltLinux/linux/issues/1296
Reported-by: Stefan Agner <stefan@agner.ch>
Reported-by: Arnd Bergmann <arnd@kernel.org>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed By: DavidSpickett
Differential Revision: https://reviews.llvm.org/D96632
Prefer to keep uniform (non-divergent) multiplies on the scalar ALU when
possible. This significantly improves some game cases by eliminating
v_readfirstlane instructions when the result feeds into a scalar
operation, like the address calculation for a scalar load or store.
Since isDivergent is only an approximation of whether a value is in
SGPRs, it can potentially regress some situations where a uniform value
ends up in a VGPR. These should be rare in real code, although the test
changes do contain a number of examples.
Most of the test changes are just using s_mul instead of v_mul/mad which
is generally better for both register pressure and latency (at least on
GFX10 where sgpr pressure doesn't affect occupancy and vector ALU
instructions have significantly longer latency than scalar ALU). Some
R600 tests now use MULLO_INT instead of MUL_UINT24.
GlobalISel appears to handle more scenarios in the desirable way,
although it can also be thrown off and fails to select the 24-bit
multiplies in some cases.
Alternative solution considered and rejected was to allow selecting
MUL_[UI]24 to S_MUL_I32. I've rejected this because the definition of
those SD operations works is don't-care on the most significant 8 bits,
and this fact is used in some combines via SimplifyDemandedBits.
Based on a patch by Nicolai Hähnle.
Differential Revision: https://reviews.llvm.org/D97063
Early versions of the ARMv7 reference manuals considered the sp register
as a deprecated register for ldm/stm familiy of instructions. However,
later versions such as ARM DDI 0406C.d added a note to the Appendix:
D9.3 Use of the SP as a general-purpose register
Most ARM instructions, unlike Thumb instructions, provide exactly the
same access to the SP as to R0-R12. This means that it is possible to
use the SP as a general-purpose register. Earlier issues of this manual
deprecated the use of SP in an ARM instruction, in any way that is
deprecated, not permitted, or not possible in the corresponding
Thumb instruction. However, user feedback indicates a number of cases
where these instructions are useful. Therefore, ARM no longer deprecates
these instruction uses.
Also Armv8 manuals no longer consider SP as deprecated register for ldm/
stm A32 instructions.
Furthermore, GNU as also does not print a deprecated warning when using
SP with those instructions.
Drop deprecation warning for pop/ldm/push/stm instructions.
Patch by: Stefan Agner.
Differential Revision: https://reviews.llvm.org/D82692
As a followup to D95291, getOperandsScalarizationOverhead was still
using a VF as a vector factor if the arguments were scalar, and would
assert on certain matrix intrinsics with differently sized vector
arguments. This patch removes the VF arg, instead passing the Types
through directly. This should allow it to more accurately compute the
cost without having to guess at which operands will be vectorized,
something difficult with more complex intrinsics.
This adjusts one SVE test as it is now calling the wrong intrinsic vs
veccall. Without invalid InstructCosts the cost of the scalarized
intrinsic is too low. This should get fixed when the cost of
scalarization is accounted for with scalable types.
Differential Revision: https://reviews.llvm.org/D96287
getIntrinsicInstrCost takes a IntrinsicCostAttributes holding various
parameters of the intrinsic being costed. It can either be called with a
scalar intrinsic (RetTy==Scalar, VF==1), with a vector instruction
(RetTy==Vector, VF==1) or from the vectorizer with a scalar type and
vector width (RetTy==Scalar, VF>1). A RetTy==Vector, VF>1 is considered
an error. Both of the vector modes are expected to be treated the same,
but because this is confusing many backends end up getting it wrong.
Instead of trying work with those two values separately this removes the
VF parameter, widening the RetTy/ArgTys by VF used called from the
vectorizer. This keeps things simpler, but does require some other
modifications to keep things consistent.
Most backends look like this will be an improvement (or were not using
getIntrinsicInstrCost). AMDGPU needed the most changes to keep the code
from c230965ccf working. ARM removed the fix in
dfac521da1, webassembly happens to get a fixup for an SLP cost
issue and both X86 and AArch64 seem to now be using better costs from
the vectorizer.
Differential Revision: https://reviews.llvm.org/D95291
This patch extends the support for RVV INSERT_SUBVECTOR to cover those
which don't align to a vector register boundary. Like the support for
EXTRACT_SUBVECTOR in D96959, it accomplishes this by extracting the
nearest register-sized subvector (a subregister operation), then sliding
the vector down with VSLIDEDOWN, inserting the subvector to the first
position, and sliding the vector back up again afterwards.
Unlike subvector extraction, for vectors that occupy less than a full
vector register we must preserve the untouched elements. We do this by
lowering to an LMUL=1 INSERT_SUBVECTOR using the above method and
lowering that to a VSLIDEUP with a zero offset. This uses a
tail-undisturbed policy and so has the effect of "sliding in" the
subvector elements while preserving the surrounding ones.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D96972
Since there is no tile copy instruction, we need to store tile
register to stack and load from stack to another tile register.
We need extra GR to hold the stride, and we need stack slot to
hold the tile data register. We would run this pass after copy
propagation, so that we don't miss copy optimization. And we
would run this pass before prolog/epilog insertion, so that we
can allocate stack slot.
Differential Revision: https://reviews.llvm.org/D97112
An i64 AssertZExt from a type smaller than i32 has at least 33
leading zeros which mean it has at least 33 sign bits.
Since we have a couple patterns that use two sexti32, I've
switched to a ComplexPattern so tablegen didn't have to generate
9 different permutations.
As noted in the FIXME, maybe we should just call computeNumSignBits,
but we don't have tests that benefit from that yet.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D97130
Match a G_SHUFFLE_VECTOR with a mask that allows it to be represented as a
G_INSERT_VECTOR_ELT and a G_EXTRACT_VECTOR_ELT.
This ports `isINSMask` from AArch64ISelLowering and the portion of
`AArch64TargetLowering::LowerVECTOR_SHUFFLE` which handles the equivalent
transformation.
This provides more opportunities for matching DUP. We don't have all of the
necessary combines to actually make DUP out of these yet, but this is better for
size than the full TBL expansion for G_SHUFFLE_VECTOR.
This is a -0.1% code size improvement on CTMark/Bullet at -Os.
IR example: https://godbolt.org/z/sdcevT
Differential Revision: https://reviews.llvm.org/D97214
This adds support for serialization of `WasmEHFuncInfo`, in the form of
<Source BB Number, Unwind destination BB number>. To make YAML mapping
work, we needed to make a copy of the existing `SrcToUnwindDest` map
within `yaml::WebAssemblyMachineFunctionInfo`.
It was hard to add EH MIR tests for CFGStackify because `WasmEHFuncInfo`
could not be read from test MIR files. This adds the serialization
support for that to make EH MIR tests easier.
Reviewed By: dschuff
Differential Revision: https://reviews.llvm.org/D97174
This renames variable and method names in `WasmEHFuncInfo` class to be
simpler and clearer. For example, unwind destinations are EH pads by
definition so it doesn't necessarily need to be included in every method
name. Also I am planning to add the reverse mapping in a later CL,
something like `UnwindDestToSrc`, so this renaming will make meanings
clearer.
Reviewed By: dschuff
Differential Revision: https://reviews.llvm.org/D97173
This should fix the issue reported in D96972.
I don't have a good test case for this without those changes.
Differential Revision: https://reviews.llvm.org/D97082
A previous patch moved the index versions. This moves the rest.
I also removed the custom lowering for VLEFF since we can now
do everything directly in the isel handling.
I had to update getLMUL to handle mask registers to index the
pseudo table correctly for VLE1/VSE1.
This is good for another 15K reduction in llc size.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D97097
This patch adds the following SHA3 Intrinsics:
vsha512hq_u64,
vsha512h2q_u64,
vsha512su0q_u64,
vsha512su1q_u64
veor3q_u8
veor3q_u16
veor3q_u32
veor3q_u64
veor3q_s8
veor3q_s16
veor3q_s32
veor3q_s64
vrax1q_u64
vxarq_u64
vbcaxq_u8
vbcaxq_u16
vbcaxq_u32
vbcaxq_u64
vbcaxq_s8
vbcaxq_s16
vbcaxq_s32
vbcaxq_s64
Note need to include +sha3 and +crypto when building from the front-end
Reviewed By: DavidSpickett
Differential Revision: https://reviews.llvm.org/D96381
Enabled "bound_ctrl:1" and disabled "bound_ctrl:-1" syntax.
Corrected printer to output "bound_ctrl:1" instead of "bound_ctrl:0".
See bug 35397 for detailed issue description.
Differential Revision: https://reviews.llvm.org/D97048
This removes the existing patterns for inserting two lanes into an
f16/i16 vector register using VINS, instead using a DAG combine to
pattern match the same code sequences. The tablegen patterns were
already on the large side (foreach LANE = [0, 2, 4, 6]) and were not
handling all the cases they could. Moving that to a DAG combine, whilst
not less code, allows us to better control and expand the selection of
VINSs. Additionally this allows us to remove the AddedComplexity on
VCVTT.
The extra trick that this has learned in the process is to move two
adjacent lanes using a single f32 vmov, allowing some extra
inefficiencies to be removed.
Differenial Revision: https://reviews.llvm.org/D96876
If the reference-types feature is enabled, call_indirect will explicitly
reference its corresponding function table via `TABLE_NUMBER`
relocations against a table symbol.
Also, as before, address-taken functions can also cause the function
table to be created, only with reference-types they additionally cause a
symbol table entry to be emitted.
We abuse the used-in-reloc flag on symbols to indicate which tables
should end up in the symbol table. We do this because unfortunately
older wasm-ld will carp if it see a table symbol.
Differential Revision: https://reviews.llvm.org/D90948
This also removes a pattern from RISCV that is no longer needed
since the sexti32 on the LHS of the srem in the pattern implies
the result is sign extended so the sign_extend_inreg should be
removed in DAG combine now.
Reviewed By: luismarques, RKSimon
Differential Revision: https://reviews.llvm.org/D97133
In conjunction with the 'vperm2x128(bitcast(x),bitcast(y),c) -> bitcast(vperm2x128(x,y,c))' fold in combineTargetShuffle, this should remove any unnecessary bitcasts around vperm2x128 lane shuffles.
Extend the existing combine that handles bitcasting for fp-logic ops to also help remove logic ops across bitcasts to/from the same integer types.
This helps improve AVX512 predicate handling for D/Q logic ops and also allows DAGCombine's scalarizeExtractedBinop to remove some annoying gpr->simd->gpr transfers.
The concat_vectors regression in pr40891.ll will be addressed in a followup commit on this patch.
Differential Revision: https://reviews.llvm.org/D96206
This patch extends the support for RVV EXTRACT_SUBVECTOR to cover those
which don't align to a vector register boundary. It accomplishes this by
extracting the nearest register-sized subvector (a subregister
operation), then sliding the vector down with VSLIDEDOWN and extracting
the subvector from the first position (a COPY operation).
Since this procedure involves the use of VSCALE and multiplication, the
handling of such operations is done during lowering to simplify the
implementation and make use of DAG combining. This necessitated moving
some helper functions from RISCVISelDAGToDAG to RISCVTargetLowering.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D96959
With vector mask registers only allocatable to V0 (VMV0Regs) it is
relatively simple to generate code which uses multiple masks and naively
requires spilling.
This patch aims to improve codegen in such cases by telling LLVM it can
use VRRegs to hold masks. This will prevent spilling in many cases by
having LLVM copy to an available VR register.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D97055
This is a minor pattern-match update to BPFAdjustOpt.cpp to accept
not only 'or i1 a, b' but also 'select i1 a, i1 true, i1 b'.
This resolves regression after SimplifyCFG's creating select form
of and/or instead (https://reviews.llvm.org/D95026).
This is a small change, and currently such select form isn't created
or doesn't reach to the late pipeline (because InstCombine eagerly
folds it into and/or i1), so I chose to commit without a review process.
We don't currently create memory operands for these intrinsics,
but there was a suggestion of using the indexed load/store
intrinsics to implement isel for scalable vector gather/scatter.
That may propagate the memory operand from the gather/scatter
ISD nodes.
This commit adds the initial changes to the SystemZ target
description for the XPLINK 64-bit calling convention on z/OS.
Additions include:
- a new predicate IsTargetXPLINK64
- different register allocation order
- generaton of nopr after a call
Reviewed-by: uweigand
Differential Revision: https://reviews.llvm.org/D96887
We had more combinations of data and index lmuls than we needed.
Also add some asserts to verify that the IndexVT and data VT have
the same element count when we isel these pseudo instructions.
There are many legal combinations of index and data VTs supported
for these intrinsics. This results in a lot of isel patterns in
RISCVGenDAGISel.inc.
By adding a separate table similar to what we use for segment
load/stores, we can more efficiently manually select these
intrinsics. We should also be able to reuse this table scalable
vector gather/scatter.
This reduces the llc binary size by ~56K.
Reviewed By: khchen
Differential Revision: https://reviews.llvm.org/D97033
Just like we do for isel patterns, we need to call selectVLOp
to prevent 0 from being selected to X0 by the default isel.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D97021
We previously used isel patterns for this, but that used quite
a bit of space in the isel table due to OR being associative
and commutative. It also wouldn't handle shifts/ands being in
reversed order.
This generalizes the shift/and matching from GREVI to
take the expected mask table as input so we can reuse it for
SHFLI.
There is no SHFLIW instruction, but we can promote a 32-bit
SHFLI to i64 on RV64. As long as bit 4 of the control bit isn't
set, a 64-bit SHFLI will preserve 33 sign bits if the input had
at least 33 sign bits. ComputeNumSignBits has been updated to
account for that to avoid sext.w in the tests.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D96661
This is to ensure that we can eliminate G_ASSERT_SEXT.
In a follow-up patch, I'm going to make CallLowering emit G_ASSERT_SEXT for
signext parameters.
Differential Revision: https://reviews.llvm.org/D96913
fixed-abi uses pre-defined and predictable
SGPR/VGPRs for passing arguments. This patch makes
this scheme default when HSA OS is specified in triple.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D96340
Track lanes when processing definitions for marking WQM/WWM.
If all lanes have been defined then marking can stop.
This prevents marking unnecessary instructions as WQM/WWM.
In particular this fixes a bug where values passing through
V_SET_INACTIVE would me marked as requiring WWM.
Reviewed By: piotr
Differential Revision: https://reviews.llvm.org/D95503
This patch fixes some crashes coming from
X86ISelLowering::getSetCCResultType, which would occasionally return
an EVT constructed from an invalid MVT, which has a null Type pointer.
This patch refers to D95434.
Differential Revision: https://reviews.llvm.org/D97036
This enables AES fusion and the post RA scheduler for the Neoverse cores.
And while we are it also for the A55 that we had missed earlier.
Differential Revision: https://reviews.llvm.org/D96866
We were creating more combinations of value and index lmul than
we needed.
I've copied the loop structure used here from VPseudoAMOEI with
all data sew values instead of just 32/64.
Similar can be done for segment loads/store.
Reviewed By: khchen
Differential Revision: https://reviews.llvm.org/D97008
Intrinsic ID is a 32-bit value which made each row of the table 4
byte aligned. The remaining fields used 5 bytes. This meant 3 bytes
of padding per row.
This patch breaks the table into 4 separate tables and indexes them
by properties we know about the intrinsic. NF, masked,
strided, ordered, etc. The indexed load/store tables have no
padding in their rows now.
All together this reduces the size of llc binary by ~28K.
I'm considering adding similar tables for isel of non-segment
load/store as well to cut down the size of the isel table and
probably improve our isel performance. Those tables would need to
indexed from intrinsics, IR loads/stores, gathers/scatters, and
RISCVISD opcodes. So having a table that can be indexed without using
intrinsic ID is more flexible.
Reviewed By: HsiangKai
Differential Revision: https://reviews.llvm.org/D96894
This table is queried in RISCVMCInstLower without knowing
whether the instruction is a vector pseudo. Due to the way the
binary search works, we have to do log2(tablesize) checks just
to determine a non-vector instruction isn't in the table.
Conveniently, all the vector pseudos are pretty tightly
packed within the internal instruction enum. By enabling the
PrimaryKeyEarlyOut, tablegen will emit a check against the
beginning and end of the table before doing the binary search.
This gives a quick early out on the search for the majority
of non-vector instructions.
Differential Revision: https://reviews.llvm.org/D97016
This patch provides two major changes:
1. Add getRelocationInfo to check if a constant will have static, dynamic, or
no relocations. (Also rename the original needsRelocation to needsDynamicRelocation.)
2. Only allow a constant with no relocations (static or dynamic) to be placed
in a mergeable section.
This will allow unused symbols that contain static relocations and happen to
fit in mergeable constant sections (.rodata.cstN) to instead be placed in
unique-named sections if -fdata-sections is used and subsequently garbage collected
by --gc-sections.
See https://lists.llvm.org/pipermail/llvm-dev/2021-February/148281.html.
Differential Revision: https://reviews.llvm.org/D95960
AMDGPU currently has a lot of pre-processing code to pre-split
argument types into 32-bit pieces before passing it to the generic
code in handleAssignments. This is a bit sloppy and also requires some
overly fancy iterator work when building the calls. It's better if all
argument marshalling code is handled directly in
handleAssignments. This handles more situations like decomposing large
element vectors into sub-element sized pieces.
This should mostly be NFC, but does change the generated code by
shifting where the initial argument packing instructions are placed. I
think this is nicer looking, since it now emits the packing code
directly after the relevant copies, rather than after the copies for
the remaining arguments.
This doubles down on gfx6/gfx7 using the gfx8+ ABI for 16-bit
types. This is ultimately the better option, but incompatible with the
DAG. Fixing this requires more work, especially for f16.
We can always look through single-argument (LCSSA) phi nodes when
performing alias analysis. getUnderlyingObject() already does this,
but stripPointerCastsAndInvariantGroups() does not. We still look
through these phi nodes with the usual aliasPhi() logic, but
sometimes get sub-optimal results due to the restrictions on value
equivalence when looking through arbitrary phi nodes. I think it's
generally beneficial to keep the underlying object logic and the
pointer cast stripping logic in sync, insofar as it is possible.
With this patch we get marginally better results:
aa.NumMayAlias | 5010069 | 5009861
aa.NumMustAlias | 347518 | 347674
aa.NumNoAlias | 27201336 | 27201528
...
licm.NumPromoted | 1293 | 1296
I've renamed the relevant strip method to stripPointerCastsForAliasAnalysis(),
as we're past the point where we can explicitly spell out everything
that's getting stripped.
Differential Revision: https://reviews.llvm.org/D96668
This avoids tedious repetition and matches what we do for the
ValueTypeByHwMode uses.
Reviewed By: craig.topper, luismarques
Differential Revision: https://reviews.llvm.org/D96649
Usually `EH_LABEL`s are placed in
- Before an `invoke` (which becomes calls in the backend)
- After an `invoke`
- At the start of an EH pad
I don't know exactly why, but I noticed there are cases of multiple, not
a single, `EH_LABEL` instructions in the beginning of an EH pad. In that
case `global.set` instruction placed to restore `__stack_pointer` ended
up between two `EH_LABEL` instructions before `CATCH`. It should follow
after the `EH_LABEL`s and `CATCH`. This CL fixes that case.
Reviewed By: dschuff
Differential Revision: https://reviews.llvm.org/D96970
This uses to division by constant optimization to use MULHU/MULHS.
Reviewed By: frasercrmck, arcbbb
Differential Revision: https://reviews.llvm.org/D96934
Due to vXi64 on RV32, I've directly emitted this using _VL ISD
opcodes. If it wasn't for that we could just use fixed vector
BUILD_VECTOR and VSELECT and let those each be legalized.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D96910
These should be NOPs so we can just replace with the input. This
matches what SVE does with isel patterns for all permutations.
Custom isel saves us from having to list all permurations for
all LMULs.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D96921
Adjust generateFMAsInMachineCombiner to return false if SVE is present
in order to combine fmul+fadd into fma. Also add new pseudo instructions
so as to select the most appropriate of FMLA/FMAD depending on register
allocation.
Depends on D96599
Differential Revision: https://reviews.llvm.org/D96424
isFMAFasterThanFMulAndFAdd should return true for FP16 types when
HasFullFP16 is present, since we have the instructions to handle it for
both SVE and NEON. (SVE patterns and tests will follow).
Differential Revision: https://reviews.llvm.org/D96599
Currently the load/store optimizer will only fold in increments of the
same size as the load/store. This patch expands that to any legal
immediate for the post-inc instruction.
Differential Revision: https://reviews.llvm.org/D95885
This patch generates the vinsw, vinsd, vinsblx, vinshlx, vinswlx, vinsdlx,
vinsbrx, vinshrx, vinswrx and vinsdrx instructions for vector insertion on P10.
Differential Revision: https://reviews.llvm.org/D94454
For masked segment load, the destination register should not overlap
with mask register. It could not be V0.
In the original implementation, there is no segment load/store register
class without V0. In this patch, I added these register classes and
modify `GetVRegNoV0` to get the correct one.
Differential Revision: https://reviews.llvm.org/D96937
It appears that pointer types were causing issues for the min/max cost
code in getIntrinsicInstrCost. This makes sure that when matching
icmp/select to a min/max, we only do that for normal int or float types.
Added -mrop-protection for Power PC to turn on codegen that provides some
protection from ROP attacks.
The option is off by default and can be turned on for Power 8, Power 9 and
Power 10.
This patch is for the option only. The feature will be implemented by a later
patch.
Reviewed By: amyk
Differential Revision: https://reviews.llvm.org/D96512
A v8i32 compare will produce a v8i1 predicate, but during codegen the
v8i32 will be split into two v4i32, potentially requiring two v4i1
predicates to be merged into a single v8i1. Because this merging of two
v4i1's into a v8i1 is very expensive, we need to make the cost of the
compare equally high.
This patch adds the cost of that to ARMTTIImpl::getCmpSelInstrCost.
Because we don't know whether the user of the predicate can be split,
and the cost model is mostly pre-instruction, we may be pessimistic but
that should only be for larger and legal types. This also adds min/max
detection to the costmodel where it can be detected, to keep those in
line with the cost of simple min/max instructions. Otherwise for the
most part, costs that were already expensive have become more expensive.
Differential Revision: https://reviews.llvm.org/D96692
This patch adds support for INSERT_SUBVECTOR and EXTRACT_SUBVECTOR
(nominally where both operands are scalable vector types) where the
vector, subvector, and index align sufficiently to allow decomposition
to subregister manipulation:
* For extracts, the extracted subvector must correctly align with the
lower elements of a vector register.
* For inserts, the inserted subvector must be at least one full vector
register, and correctly align as above.
This approach should work for fixed-length vector insertion/extraction
too, but that will come later.
Reviewed By: craig.topper, khchen, arcbbb
Differential Revision: https://reviews.llvm.org/D96873
This patch fixes a codegen crash introduced in fde2466171, where the
DAGCombiner started generating optimized MULH[SU] or [SU]MUL_LOHI nodes
unless the target opted out. The AArch64 backend cannot currently select
any of these nodes, so ensure that they are not generated in the first
place.
This issue was raised by @huihuiz in D94501.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D96849
The type legalizer can call this code based on the scalar type so
we need to verify the vector type is a scalable vector.
I think due to how type legalization visits nodes, the vector type
will have already been legalized so we don't have an issue with
using MVT here like we did for EXTRACT_VECTOR_ELT.
I've added a test just in case.
The type legalizer is calling this code based on the scalar type so
we need to verify the input type is a scalable vector.
The vector type has also not been legalized yet when this is called
so we need to use EVT for it.
We are going to support debug sections for XCOFF. So the csect
properties are not necessary. This patch makes these properties
optional.
Reviewed By: hubert.reinterpretcast
Differential Revision: https://reviews.llvm.org/D95931
We did not have atomic flags on SMRD, did not copy TSFlags
to real instructions, and did not have ret/noret atomic map.
At the moment it is NFC, but needed for D96469.
Differential Revision: https://reviews.llvm.org/D96823
Separate the LoZ ELF calling convention in tablegen.
This will make it easier to add the z/OS ABI in future patches.
Reviewed By: uweigand
Differential Revision: https://reviews.llvm.org/D96867
Updating `EHPadStack` with respect to `TRY` and `CATCH` instructions
have to be done after checking all other conditions, not before. Because
we did this before checking other conditions, when we encounter `TRY`
and we want to record the current mismatching range, we already have
popped up the entry from `EHPadStack`, which we need to access to record
the range.
The `baz` call in the added test needs try-delegate because the previous
TRY marker placement for `quux` was placed before `baz`, because `baz`'s
return value was stackified in RegStackify. If this wasn't stackified
this try-delegate is not strictly necessary, but at the moment it is not
easy to identify cases like this. I plan to transfer `nounwind`
attributes from the LLVM IR to prevent cases like this. The call in the
test does not have `unwind` attribute in order to test this bug, but in
many cases of this pattern the previous call has `nounwind` attribute.
Reviewed By: tlively
Differential Revision: https://reviews.llvm.org/D96711
A lot of the code for the masked and unmasked is the same. This
patch adds a boolean to handle the differences so we can share
the code.
Differential Revision: https://reviews.llvm.org/D96841
Summary:
Currently Shrinkwrap is not enabled on AIX.
This patch enables shrink wrap on 32 and 64 bit AIX, and 64 bit ELF.
Reviewed By: sfertile, nemanjai
Differential Revision: https://reviews.llvm.org/D95094
Do not defer to the base class when the register constraint is a
physical fpr. The base class will select SPILLTOVSRRC as the register
class and register allocation will fail on subtargets without VSX
registers.
Differential Revision: https://reviews.llvm.org/D91629
* Update skip-if-dead.ll with tests for wave32.
* Fix the crash in verifier in one newly enabled test by adding
missing fixImplicitOperands in branch insertion code.
```
*** Bad machine code: Using an undefined physical register ***
- function: test_kill_divergent_loop
- basic block: %bb.2 bb (0xad96308)
- instruction: S_CBRANCH_VCCNZ %bb.1, implicit $vcc_lo
- operand 1: implicit $vcc_lo
LLVM ERROR: Found 1 machine code errors.
```
* Simplify "cbranch_kill" to not use interp instructions.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D96793
Fold shuffle(bop(shuffle(x,y),shuffle(z,w)),bop(shuffle(a,b),shuffle(c,d))) -> bop(shuffle(x,y),shuffle(z,w)),bop(shuffle(a,b),shuffle(c,d))
Attempt to fold from a shuffle of a pair of binops to a binop of shuffles, as long as one/both of the binop sources are also shuffles that can be merged with the outer shuffle. This should guarantee that we remove one binop without introducing any additional shuffles.
Technically there's potential for a merged shuffle's lowering to be poorer than the original shuffle, but it could also be better, and I'm not seeing any regressions as long as we keep the 'don't merge splats' rule already present in MergeInnerShuffle.
This expands and generalizes an existing X86 combine and attempts to merge either of each binop's sources (with an on-the-fly commutation of the shuffle mask) - we couldn't do that in the x86 version as it had to stay in a form that DAGCombine's MergeInnerShuffle would still recognise.
Fixes issue raised by @saugustine in rG5aa8f4c0843a where we were failing to replace null shuffle operands from MergeInnerShuffle to UNDEFs.
Differential Revision: https://reviews.llvm.org/D96345
The helper function isBoolSGPR is too aggressive when determining
when a v_cndmask can be skipped on a boolean value because the
function does not check the operands of and/or/xor.
This can be problematic for the Add/Sub combines that can leave
bits set even for inactive lanes leading to wrong results.
Fix this by inspecting the operands of and/or/xor recursively.
Differential Revision: https://reviews.llvm.org/D86878
This patch adds support for fixed-length vector vselect. It does so by
lowering them to a custom unmasked VSELECT_VL node with a vector length
operand.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D96768
This patch proposes how to deal with RISC-V vector frame objects. The
layout of RISC-V vector frame will look like
|---------------------------------|
| scalar callee-saved registers |
|---------------------------------|
| scalar local variables |
|---------------------------------|
| scalar outgoing arguments |
|---------------------------------|
| RVV local variables && |
| RVV outgoing arguments |
|---------------------------------| <- end of frame (sp)
If there is realignment or variable length array in the stack, we will use
frame pointer to access fixed objects and stack pointer to access
non-fixed objects.
|---------------------------------| <- frame pointer (fp)
| scalar callee-saved registers |
|---------------------------------|
| scalar local variables |
|---------------------------------|
| ///// realignment ///// |
|---------------------------------|
| scalar outgoing arguments |
|---------------------------------|
| RVV local variables && |
| RVV outgoing arguments |
|---------------------------------| <- end of frame (sp)
If there are both realignment and variable length array in the stack, we
will use frame pointer to access fixed objects and base pointer to access
non-fixed objects.
|---------------------------------| <- frame pointer (fp)
| scalar callee-saved registers |
|---------------------------------|
| scalar local variables |
|---------------------------------|
| ///// realignment ///// |
|---------------------------------| <- base pointer (bp)
| RVV local variables && |
| RVV outgoing arguments |
|---------------------------------|
| /////////////////////////////// |
| variable length array |
| /////////////////////////////// |
|---------------------------------| <- end of frame (sp)
| scalar outgoing arguments |
|---------------------------------|
In this version, we do not save the addresses of RVV objects in the
stack. We access them directly through the polynomial expression
(a x VLENB + b). We do not reserve frame pointer when there is any RVV
object in the stack. So, we also access the scalar frame objects through the
polynomial expression (a x VLENB + b) if the access across RVV stack
area.
Differential Revision: https://reviews.llvm.org/D94465
The AMD GPU SIMemoryLegalizer was using the ordering address space
rather than the instruction address space when determining the
s_waitcnt to generate to ensure that a read-modify-write atomic has
completed. This resulted in additional unnecessary counters being
waited on.
Differential Revision: https://reviews.llvm.org/D96743
Basic block sections enables function sections implicitly, this is not needed
and is inefficient with "=list" option.
We had basic block sections enable function sections implicitly in clang. This
is particularly inefficient with "=list" option as it places functions that do
not have any basic block sections in separate sections. This causes unnecessary
object file overhead for large applications.
This patch disables this implicit behavior. It only creates function sections
for those functions that require basic block sections.
Further, there was an inconistent behavior with llc as llc was not turning on
function sections by default. This patch makes llc and clang consistent and
tests are added to check the new behavior.
This is the first of two patches and this adds functionality in LLVM to
create a new section for the entry block if function sections is not
enabled.
Differential Revision: https://reviews.llvm.org/D93876
This change introduces support for zero flag ELF section groups to LLVM.
LLVM already supports COMDAT sections, which in ELF are a special type
of ELF section groups. These are generally useful to enable linker GC
where you want a group of sections to always travel together, that is to
be either retained or discarded as a whole, but without the COMDAT
semantics. Other ELF assemblers already support zero flag ELF section
groups and this change helps us reach feature parity.
Differential Revision: https://reviews.llvm.org/D95851
We currently represent TOC entries by an MCSymbol. This is not enough in some situations.
For example, when accessing an initialized TLS variable v on AIX using the general dynamic
model, we need to generate the two following entries for v:
.tc .v[TC],v@m
.tc v[TC],v
One is for the region handle (with the @m relocation), the other is for the variable offset.
This refactoring allows storing several entries for the same symbol with different VariantKind
in the TOC. If the VariantKind is not specified, we default to VK_None.
The AIX TLS implementation using this refactoring to generate the two entries will be posted
in a subsequent patch.
Patched By: bsaleil
Reviewed By: sfertile
Differential Revision: https://reviews.llvm.org/D96346
This reverts commit 5dfba562dd.
That commit causes an assertion failure with the following repro:
typedef long b __attribute__((__vector_size__(16)));
b *d;
b e;
b __attribute__((__always_inline__)) c(b h, b i) {
return (__attribute__((__vector_size__(8 * sizeof(short)))) short)h + i;
}
j() {
b k, l, m, n, o[6], p, q;
m = d[5];
b r = m;
b s = f(r, 8);
q = s;
l = d[1];
p = l;
t(q);
n = c(m, l);
o[1] = c(s, f(p, 8));
k = __builtin_shufflevector(n, o[1], 0, 2);
e = __builtin_ia32_psrlwi128(k, j);
}
./bin/clang -cc1 -triple x86_64-grtev4-linux-gnu -emit-obj -O1 -std=c99 test.c
ICMP & SELECT patterns extracting the sign of a value can be simplified
to OR & ASR (see https://alive2.llvm.org/ce/z/Xx4iZ0).
This does not save any instructions in IR, but it is profitable on
AArch64, because we need at least 2 extra instructions to materialize 1
and -1 for the SELECT.
The improvements result in ~5% speedups on loops of the form
static int sign_of(int x) {
if (x < 0) return -1;
return 1;
}
void foo(const int *x, int *res, int cnt) {
for (int i=0;i<cnt;i++)
res[i] = sign_of(x[i]);
}
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D96596
From what I can tell, a writeback is unpredictable with LR for both
loads and stores. This changes the operand from a gprnopc to a rGPR in
both cases (which I believe is essentially a NFC due to the tied-def
already being a rGPR.)
Differential Revision: https://reviews.llvm.org/D96723
In a future commit, soft clauses will be hinted with kill instructions
rather than forced together with bundles. Look for kills that look
like this, and erase them. I'm not sure if the check for specific uses
is worthwhile, or if it would be better to just unconditionally erase
kills.
This reduces test churn in a future patch.
Fold shuffle(bop(shuffle(x,y),shuffle(z,w)),bop(shuffle(a,b),shuffle(c,d))) -> bop(shuffle(x,y),shuffle(z,w)),bop(shuffle(a,b),shuffle(c,d))
Attempt to fold from a shuffle of a pair of binops to a binop of shuffles, as long as one/both of the binop sources are also shuffles that can be merged with the outer shuffle. This should guarantee that we remove one binop without introducing any additional shuffles.
Technically there's potential for a merged shuffle's lowering to be poorer than the original shuffle, but it could also be better, and I'm not seeing any regressions as long as we keep the 'don't merge splats' rule already present in MergeInnerShuffle.
This expands and generalizes an existing X86 combine and attempts to merge either of each binop's sources (with an on-the-fly commutation of the shuffle mask) - we couldn't do that in the x86 version as it had to stay in a form that DAGCombine's MergeInnerShuffle would still recognise.
Differential Revision: https://reviews.llvm.org/D96345
This was allowing debug instructions to break the bundling, which
would change scheduling behavior. Bundle debug info / kills inside
the bundle. This seems to work OK, although the asm printer doesn't
understand these in a bundle. This implicitly expects the memory
legalizer to unbundle. It would probably be slightly nicer to move
these after.
Rewrite the loop to be clearer and make sure we don't end a bundle on
a meta instruction, only allow them in between other valid bundle
instructions.
When a literal that cannot fit in the immediate form of the fmov instruction
is used to initialise an SVE vector, an extra unnecessary fmov is currently
generated. This patch adds an extra codegen pattern preventing the extra
instruction from being generated.
Differential Revision: https://reviews.llvm.org/D96700
Co-Authored-By: Paul Walker <paul.walker@arm.com>
This patch enables scalable vectorization of loops with integer/fast reductions, e.g:
```
unsigned sum = 0;
for (int i = 0; i < n; ++i) {
sum += a[i];
}
```
A new TTI interface, isLegalToVectorizeReduction, has been added to prevent
reductions which are not supported for scalable types from vectorizing.
If the reduction is not supported for a given scalable VF,
computeFeasibleMaxVF will fall back to using fixed-width vectorization.
Reviewed By: david-arm, fhahn, dmgreen
Differential Revision: https://reviews.llvm.org/D95245
Non-splatted non-integer build_vector nodes were mistakenly being
lowered as VID expressions, which should not happen. VID can only be
used to select integer build_vector nodes.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D96718
The patterns mostly follow the scalar counterparts, save for some extra
optimizations to match the vector/scalar forms.
The patch adds a DAGCombine for ISD::FCOPYSIGN to try and reorder
ISD::FNEG around any ISD::FP_EXTEND or ISD::FP_TRUNC of the second
operand. This helps us achieve better codegen to match vfsgnjn.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D96028
This stops tablegen from generating patterns with the opposite type
in the opposite HwMode. This just adds wasted bytes to the isel table.
This reduces the isel table by about 1800 bytes.
The API is a bit awkward since you need to index into an array in the
passed struct. I guess an alternative would be to pass all of the
individual fields.
This is annoying because the condition code legalization belongs
to LegalizeDAG, but our custom handler runs in Legalize vector ops
which occurs earlier.
This adds some of the mask binary operations so that we can combine
multiple compares that we need for expansion.
I've also fixed up RISCVISelDAGToDAG.cpp to handle copies of masks.
This patch contains a subset of the integer setcc patch as well.
That patch is dependent on the integer binary ops patch. I'll rebase
based on what order the patches go in.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D96567
This commit fixes how metadata is handled in CloneModule to be sound,
and improves how it's handled in CloneFunctionInto (although the latter
is still awkward when called within a module).
Ruiling Song pointed out in PR48841 that CloneModule was changed to
unsoundly use the RF_ReuseAndMutateDistinctMDs flag (renamed in
fa35c1f80f for clarity). This flag papered
over a crash caused by other various changes made to CloneFunctionInto
over the past few years that made it unsound to use cloning between
different modules.
(This commit partially addresses PR48841, fixing the repro from
preprocessed source but not textual IR. MDNodeMapper::mapDistinctNode
became unsound in df763188c9 and this
commit does not address that regression.)
RF_ReuseAndMutateDistinctMDs is designed for the IRMover to use,
avoiding unnecessary clones of all referenced metadata when linking
between modules (with IRMover, the source module is discarded after
linking). It never makes sense to use when you're not discarding the
source. This commit drops its incorrect use in CloneModule.
Sadly, the right thing to do with metadata when cloning a function is
complicated, and this patch doesn't totally fix it.
The first problem is that there are two different types of referenceable
metadata and it's not obvious what to with one of them when remapping.
- `!0 = !{!1}` is metadata's version of a constant. Programatically it's
called "uniqued" (probably a better term would be "constant") because,
like `ConstantArray`, it's stored in uniquing tables. Once it's
constructed, it's illegal to change its arguments.
- `!0 = distinct !{!1}` is a bit closer to a global variable. It's legal
to change the operands after construction.
What should be done with distinct metadata when cloning functions within
the same module?
- Should new, cloned nodes be created?
- Should all references point to the same, old nodes?
The answer depends on whether that metadata is effectively owned by a
function.
And that's the second problem. Referenceable metadata's ownership model
is not clear or explicit. Technically, it's all stored on an
LLVMContext. However, any metadata that is `distinct`, that transitively
references a `distinct` node, or that transitively references a
GlobalValue is specific to a Module and is effectively owned by it. More
specifically, some metadata is effectively owned by a specific Function
within a module.
Effectively function-local metadata was introduced somewhere around
c10d0e5ccd, which made it illegal for two
functions to share a DISubprogram attachment.
When cloning a function within a module, you need to clone the
function-local debug info and suppress cloning of global debug info (the
status quo suppresses cloning some global debug info but not all). When
cloning a function to a new/different module, you need to clone all of
the debug info.
Here's what I think we should do (eventually? soon? not this patch
though):
- Distinguish explicitly (somehow) between pure constant metadata owned
by the LLVMContext, global metadata owned by the Module, and local
metadata owned by a GlobalValue (such as a function).
- Update CloneFunctionInto to trigger cloning of all "local" metadata
(only), perhaps by adding a bit to RemapFlag. Alternatively, split
out a separate function CloneFunctionMetadataInto to prime the
metadata map that callers are updated to call ahead of time as
appropriate.
Here's the somewhat more isolated fix in this patch:
- Converted the `ModuleLevelChanges` parameter to `CloneFunctionInto` to
an enum called `CloneFunctionChangeType` that is one of
LocalChangesOnly, GlobalChanges, DifferentModule, and ClonedModule.
- The code maintaining the "functions uniquely own subprograms"
invariant is now only active in the first two cases, where a function
is being cloned within a single module. That's necessary because this
code inhibits cloning of (some) "global" metadata that's effectively
owned by the module.
- The code maintaining the "all compile units must be explicitly
referenced by !llvm.dbg.cu" invariant is now only active in the
DifferentModule case, where a function is being cloned into a new
module in isolation.
- CoroSplit.cpp's call to CloneFunctionInto in CoroCloner::create
uses LocalChangeOnly, since fa635d730f
only set `ModuleLevelChanges` to trigger cloning of local metadata.
- CloneModule drops its unsound use of RF_ReuseAndMutateDistinctMDs
and special handling of !llvm.dbg.cu.
- Fixed some outdated header docs and left a couple of FIXMEs.
Differential Revision: https://reviews.llvm.org/D96531
We are using AtomicNoRet map in multiple places to determine
if an instruction atomic, rtn or nortn atomic. This method
does not work always since we have some instructions which
only has rtn or nortn version.
One such instruction is ds_wrxchg_rtn_b32 which does not have
nortn version. This has caused changes in memory legalizer
tests.
Differential Revision: https://reviews.llvm.org/D96639
This patch adjusts the placement of the bundle unpacking to just before
code emission. In particular, this means bundle unpacking happens AFTER
the machine outliner. With the previous position, the machine outliner
may outline parts of a bundle, which breaks them up.
This is an issue for BLR_RVMARKER handling, as illustrated by the
rvmarker-pseudo-expansion-and-outlining.mir test case. The machine
outliner should not break up the bundles created during pseudo
expansion.
This should fix PR49082.
Reviewed By: SjoerdMeijer
Differential Revision: https://reviews.llvm.org/D96294
This adds basic MVE costs for SMIN/SMAX/UMIN/UMAX, as well as MINNUM and
MAXNUM representing fmin and fmax. It tightens up the costs, not using a
ICmp+Select cost.
Differential Revision: https://reviews.llvm.org/D96603
This patch uses the function getShuffleCost with SK_Reverse to compute the cost
for experimental.vector.reverse.
For scalable vector type, it adds a table will the legal types on
AArch64TTIImpl::getShuffleCost to not assert in BasicTTIImpl::getShuffleCost,
and for fixed vector, it relies on the existing cost model in BasicTTIImpl.
Depends on D94883
Differential Revision: https://reviews.llvm.org/D95603
This patch adds a new intrinsic experimental.vector.reduce that takes a single
vector and returns a vector of matching type but with the original lane order
reversed. For example:
```
vector.reverse(<A,B,C,D>) ==> <D,C,B,A>
```
The new intrinsic supports fixed and scalable vectors types.
The fixed-width vector relies on shufflevector to maintain existing behaviour.
Scalable vector uses the new ISD node - VECTOR_REVERSE.
This new intrinsic is one of the named shufflevector intrinsics proposed on the
mailing-list in the RFC at [1].
Patch by Paul Walker (@paulwalker-arm).
[1] https://lists.llvm.org/pipermail/llvm-dev/2020-November/146864.html
Differential Revision: https://reviews.llvm.org/D94883
Currently the findIncDecAfter will only look at the next instruction for
post-inc candidates in the load/store optimizer. This extends that to a
search through the current BB, until an instruction that modifies or
uses the increment reg is found. This allows more post-inc load/stores
and ldm/stm's to be created, especially in cases where a schedule might
move instructions further apart.
We make sure not to look any further for an SP, as that might invalidate
stack slots that are still in use.
Differential Revision: https://reviews.llvm.org/D95881
This refactors shouldFavorPostInc() and shouldFavorBackedgeIndex() into
getPreferredAddressingMode() so that we have one interface to steer LSR in
generating the preferred addressing mode.
Differential Revision: https://reviews.llvm.org/D96600
This patch prepares the RISCV VSLIDEUP and VSLIDEDOWN custom nodes to
ones carrying additional mask and vector-length operands. This is
primarily so they can be used by both systems.
This also takes the opportunity to create some helper functions to deal
with the common task of getting the default (unmasked) VL operands.
Reviewed By: craig.topper, arcbbb
Differential Revision: https://reviews.llvm.org/D96505
In the future Windows will enable Control-flow Enforcement Technology (CET aka shadow stacks). To protect the path where the context is updated during exception handling, the binary is required to enumerate valid unwind entrypoints in a dedicated section which is validated when the context is being set during exception handling.
This change allows llvm to generate the section that contains the appropriate symbol references in the form expected by the msvc linker.
This feature is enabled through a new module flag, ehcontguard, which was modelled on the cfguard flag.
The change includes a test that when the module flag is enabled the section is correctly generated.
The set of exception continuation information includes returns from exceptional control flow (catchret in llvm).
In order to collect catchret we:
1) Includes an additional flag on machine basic blocks to indicate that the given block is the target of a catchret operation,
2) Introduces a new machine function pass to insert and collect symbols at the start of each block, and
3) Combines these targets with the other EHCont targets that were already being collected.
Change originally authored by Daniel Frampton <dframpto@microsoft.com>
For more details, see MSVC documentation for `/guard:ehcont`
https://docs.microsoft.com/en-us/cpp/build/reference/guard-enable-eh-continuation-metadata
Reviewed By: pengfei
Differential Revision: https://reviews.llvm.org/D94835
Add intrinsic which demotes all active lanes to helper lanes.
This is used to implement demote to helper Vulkan extension.
In practice demoting a lane to helper simply means removing it
from the mask of live lanes used for WQM/WWM/Exact mode.
Where the shader does not use WQM, demotes just become kills.
Additionally add llvm.amdgcn.live.mask intrinsic to complement
demote operations. In theory llvm.amdgcn.ps.live can be used
to detect helper lanes; however, ps.live can be moved by LICM.
The movement of ps.live cannot be remedied without changing
its type signature and such a change would require ps.live
users to update as well.
Reviewed By: piotr
Differential Revision: https://reviews.llvm.org/D94747
Previously we assumed `rethrow`'s argument was always 0, but it turned
out `rethrow` follows the same rule with `br` or `delegate`:
https://github.com/WebAssembly/exception-handling/pull/137https://github.com/WebAssembly/exception-handling/issues/146#issuecomment-777349038
Currently `rethrow`s generated by our backend always rethrow the
exception caught by the innermost enclosing catch, so this adds a
function to compute that and replaces `rethrow`'s argument with its
computed result.
This also renames `EHPadStack` in `InstPrinter` to `TryStack`, because
in CFGStackify we use `EHPadStack` to mean the range between
`catch`~`end`, while in `InstPrinter` we used it to mean the range
between `try`~`catch`, so choosing different names would look clearer.
Doesn't contain any functional changes in `InstPrinter`.
Reviewed By: dschuff
Differential Revision: https://reviews.llvm.org/D96595
This is pretty much just ports `performGlobalAddressCombine` from
AArch64ISelLowering. (AArch64 doesn't use the generic DAG combine for this.)
This adds a pre-legalize combine which looks for this pattern:
```
%g = G_GLOBAL_VALUE @x
%ptr1 = G_PTR_ADD %g, cst1
%ptr2 = G_PTR_ADD %g, cst2
...
%ptrN = G_PTR_ADD %g, cstN
```
And then, if possible, transforms it like so:
```
%g = G_GLOBAL_VALUE @x
%offset_g = G_PTR_ADD %g, -min(cst)
%ptr1 = G_PTR_ADD %offset_g, cst1
%ptr2 = G_PTR_ADD %offset_g, cst2
...
%ptrN = G_PTR_ADD %offset_g, cstN
```
Where min(cst) is the smallest out of the G_PTR_ADD constants.
This means we should save at least one G_PTR_ADD.
This also updates code in the legalizer + selector which assumes that
G_GLOBAL_VALUE will never have an offset and adds/updates relevant tests.
Differential Revision: https://reviews.llvm.org/D96624
This combine tries to do inter-block hoisting of extends of G_PHIs, into the
originating blocks of the phi's incoming value. The idea is to expose further
optimization opportunities that are normally obscured by the PHI.
Some basic heuristics, and a target hook for AArch64 is added, to allow tuning.
E.g. if the extend is used by a G_PTR_ADD, it doesn't perform this combine
since it may be folded into the addressing mode during selection.
There are very minor code size improvements on AArch64 -Os, but the real benefit
is that it unlocks optimizations like AArch64 conditional compares on some
benchmarks.
Differential Revision: https://reviews.llvm.org/D95703
Given a floating point store from an extracted vector, with an integer
VGETLANE that already exists, storing the existing VGETLANEu directly
can be better for performance. As the value is known to already be in an
integer registers, this can help reduce fp register pressure, removed
the need for the fp extract and allows use of more integer post-inc
stores not available with vstr.
This can be a bit narrow in scope, but helps with certain biquad kernels
that store shuffled vector elements.
Differential Revision: https://reviews.llvm.org/D96159
Begin transitioning the X86 vector code to recognise sub(umax(a,b) ,b) or sub(a,umin(a,b)) USUBSAT patterns to make it more generic and available to all targets.
This initial patch just moves the basic umin/umax patterns to DAG, removing some vector-only checks on the way - these are some of the patterns that the legalizer will try to expand back to so we can be reasonably relaxed about matching these pre-legalization.
We can handle the trunc(sub(..))) variants as well, which helps with patterns where we were promoting to a wider type to detect overflow/saturation.
The remaining x86 code requires some cleanup first - some of it isn't actually tested etc. I also need to resurrect D25987.
Differential Revision: https://reviews.llvm.org/D96413
These two instructions are VOP3P and have op_sel_hi bits,
however do not use op_sel_hi. That is recommended to set
unused op_sel_hi bits to 1. However, we cannot decode
both representations with 1 and 0 if bits are set to
default value 1. If bits are set to be ignored with '?'
initializer then encoding defaults them to 0.
The patch is a hack to force ignored '?' bits to 1 on
encoding for these instructions.
There is still canonicalization happens on disasm print
if incoming values are non-default, so that disasm output
does not match binary input, but this is pre-existing
problem for all instructions with '?' bits.
Fixes: SWDEV-272540
Differential Revision: https://reviews.llvm.org/D96543
explicitly emitting retainRV or claimRV calls in the IR
Background:
This fixes a longstanding problem where llvm breaks ARC's autorelease
optimization (see the link below) by separating calls from the marker
instructions or retainRV/claimRV calls. The backend changes are in
https://reviews.llvm.org/D92569.
https://clang.llvm.org/docs/AutomaticReferenceCounting.html#arc-runtime-objc-autoreleasereturnvalue
What this patch does to fix the problem:
- The front-end adds operand bundle "clang.arc.attachedcall" to calls,
which indicates the call is implicitly followed by a marker
instruction and an implicit retainRV/claimRV call that consumes the
call result. In addition, it emits a call to
@llvm.objc.clang.arc.noop.use, which consumes the call result, to
prevent the middle-end passes from changing the return type of the
called function. This is currently done only when the target is arm64
and the optimization level is higher than -O0.
- ARC optimizer temporarily emits retainRV/claimRV calls after the calls
with the operand bundle in the IR and removes the inserted calls after
processing the function.
- ARC contract pass emits retainRV/claimRV calls after the call with the
operand bundle. It doesn't remove the operand bundle on the call since
the backend needs it to emit the marker instruction. The retainRV and
claimRV calls are emitted late in the pipeline to prevent optimization
passes from transforming the IR in a way that makes it harder for the
ARC middle-end passes to figure out the def-use relationship between
the call and the retainRV/claimRV calls (which is the cause of
PR31925).
- The function inliner removes an autoreleaseRV call in the callee if
nothing in the callee prevents it from being paired up with the
retainRV/claimRV call in the caller. It then inserts a release call if
claimRV is attached to the call since autoreleaseRV+claimRV is
equivalent to a release. If it cannot find an autoreleaseRV call, it
tries to transfer the operand bundle to a function call in the callee.
This is important since the ARC optimizer can remove the autoreleaseRV
returning the callee result, which makes it impossible to pair it up
with the retainRV/claimRV call in the caller. If that fails, it simply
emits a retain call in the IR if retainRV is attached to the call and
does nothing if claimRV is attached to it.
- SCCP refrains from replacing the return value of a call with a
constant value if the call has the operand bundle. This ensures the
call always has at least one user (the call to
@llvm.objc.clang.arc.noop.use).
- This patch also fixes a bug in replaceUsesOfNonProtoConstant where
multiple operand bundles of the same kind were being added to a call.
Future work:
- Use the operand bundle on x86-64.
- Fix the auto upgrader to convert call+retainRV/claimRV pairs into
calls with the operand bundles.
rdar://71443534
Differential Revision: https://reviews.llvm.org/D92808
I believe I've covered all orderings of splat operands here. Better
canonicalization in lowering might help reduce this. I did not handle
the immediate adjustments needed for set(u)gt/set(u)lt.
Testing here is limited to byte types because the scalable vector
type used for masks for the store is calculated assuming 8 byte
elements. But for the setcc its based on the element count of the
container type for the setcc input. So they don't agree. We'll need
to enhanced D96352 to handle this I think.
Differential Revision: https://reviews.llvm.org/D96443
Unlike scalable vectors, I'm only using a ComplexPattern for
the immediate itself. The vmv_v_x is matched explicitly. We igore
the VL argument when matching a binary operator, but we do check
it when matching splat directly.
I left out tests for vXi64 as they fail on rv32 right now.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D96365
Allow different GICustomOperandRenderers to use the same RendererFn.
This avoids the need for targets to define a bunch of identical C++
renderer functions with different names.
Without this fix TableGen would have emitted code that tried to define
the GICR enumeration with duplicate enumerators.
Differential Revision: https://reviews.llvm.org/D96587
Our current lowering of VMOVNT goes via a shuffle vector of the form
<0, N, 2, N+2, 4, N+4, ..>. That can of course also be a single input
shuffle of the form <0, 0, 2, 2, 4, 4, ..>, where we use a VMOVNT to
insert a vector into the top lanes of itself. This adds lowering of that
case, re-using the existing isVMOVNMask.
Differential Revision: https://reviews.llvm.org/D96065
The vector reduction intrinsics started life as experimental ops, so backend support
was lacking. As part of promoting them to 1st-class intrinsics, however, codegen
support was added/improved:
D58015
D90247
So I think it is safe to now remove this complication from IR.
Note that we still have an IR-level codegen expansion pass for these as discussed
in D95690. Removing that is another step in simplifying the logic. Also note that
x86 was already unconditionally forming reductions in IR, so there should be no
difference for x86.
I spot checked a couple of the tests here by running them through opt+llc and did
not see any asm diffs.
If we do find functional differences for other targets, it should be possible
to (at least temporarily) restore the shuffle IR with the ExpandReductions IR
pass.
Differential Revision: https://reviews.llvm.org/D96552
Change parseVTypeI function to Make the added vset instruction test cases report more concrete error message.
Differential Revision: https://reviews.llvm.org/D96218
This patch extends the initial fixed-length vector support to include
smin, smax, umin, and umax.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D96491
I previously assumed `delegate`'s immediate argument computation
followed a different rule than that of branches, but we agreed to make
it the same
(https://github.com/WebAssembly/exception-handling/issues/146). This
removes the need for a separate `DelegateStack` in both CFGStackify and
InstPrinter.
When computing the immediate argument, we use a different function for
`delegate` computation because in MIR `DELEGATE`'s instruction's
destination is the destination catch BB or delegate BB, and when it is a
catch BB, we need an additional step of getting its corresponding `end`
marker.
Reviewed By: tlively, dschuff
Differential Revision: https://reviews.llvm.org/D96525