HVX v60 only has splats that take a 32-bit word as input, while v62+
has splats that take 8- or 16-bit value. This makes writing output
patterns that need to use a splat annoying, because the entire output
pattern needs to be replicated for various versions of HVX.
To avoid this, the patterns will always use the pseudos, and then the
pseudos will be handled using a post-ISel hook.
This change pulls some code from the DirectX backend into a new
LLVMFrontendHLSL library to share utility data structures between the
HLSL code generation in Clang and the backend in LLVM.
This is a small refactoring as a first start to get code into the
right structure and get the library built and dependencies correct.
Fixes#58000 (https://github.com/llvm/llvm-project/issues/58000)
Reviewed By: python3kgae
Differential Revision: https://reviews.llvm.org/D135110
getPrimitiveSizeInBits returns 0 for pointers, we need to query
the size via DataLayout instead.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D135976
When DXC prints IR output it adds a bunch of IR comments in a header
that describe the DXIL metadata in a more human-readable format. This
pass will serve that purpose for LLVM by printing out ahead of the IR
printer.
Reviewed By: python3kgae
Differential Revision: https://reviews.llvm.org/D135802
This patch has the prefered disassembly changed for SVE vector list.
For instance, instead of printing this assembly:
ld4d { z1.d, z2.d, z3.d, z4.d }, p0/z, [x0]
it will print this:
ld4d { z1.d-z4.d }, p0/z, [x0]
Differential Revision: https://reviews.llvm.org/D135952
Add a compile-time flag for enabling streaming mode.
When streaming mode is enabled, lower basic loads and stores of fixed-width vectors;
to generate code that is compatible to streaming mode.
Differential Revision: https://reviews.llvm.org/D133433
Correct v_cndmask_b32 to support abs/neg modifiers in dpp/sdwa/e64 variants.
Correct v_cndmask_b16 for proper disassembly of abs/neg modifiers in e64_dpp variants.
Differential Revision: https://reviews.llvm.org/D135900
Functions with `aarch64_sme_pstatesm_body` will emit a SMSTART at the start
of the function, and a SMSTOP at the end of the function, such that all
operations use the right value for vscale.
Because the placement of these nodes is critically important (i.e. no
vscale-dependent operations should be done before SMSTART has been issued),
we require glueing the CopyFromReg to the Entry node such that we can
insert the SMSTART as part of that glued chain.
More details about the SME attributes and design can be found
in D131562.
Reviewed By: aemerson
Differential Revision: https://reviews.llvm.org/D131582
CCMP/CCMN's second operator support const from 0 to 31. When the CCMP's second operator is in the range [-31, -1] we can replace it with CCMN to avoid extra mov.
Fix: #57034
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D135939
This reverts commit 50e0aced45.
This could accidentally start producing invalid code in some
cases (in particular, if compiling with -mstack-alignment=16, which
one could expect to be a no-op for a target where the stack always
is aligned to 16 bytes anyway).
In Linux PIC model, there are 4 cases about value/label addressing:
Case 1: Function call or Label jmp inside the module.
Case 2: Data access (such as global variable, static variable) inside the module.
Case 3: Function call or Label jmp outside the module.
Case 4: Data access (such as global variable) outside the module.
Due to current llvm inline asm architecture designed to not "recognize" the asm
code, there are quite troubles for us to treat mem addressing differently for
same value/adress used in different instuctions.
For example, in pic model, call a func may in plt way or direclty pc-related,
but lea/mov a function adress may use got.
This patch fix/refine the case 1 and case 2 in inline asm.
Due to currently inline asm didn't support jmp the outsider lable, this patch
mainly focus on fix the function call addressing bugs in inline asm.
Reviewed By: Pengfei, RKSimon
Differential Revision: https://reviews.llvm.org/D133914
Inputs to crnor can come from operands with chains so
if it is being used simply to negate such an operand,
the repeated input cannot be CSE'd. This patch just
adds a code-gen only instruction for this that takes
a single input and duplicates it in the encoding of
the underlying crnor.
Differential revision: https://reviews.llvm.org/D133577
Same with (select C, X, -1), (select C, 0, X), and (select C, X, 0).
There's a DAGCombine after we turn the select into select_cc, but
that may introduce a setcc that didn't previously exist. We could
add more DAGCombines to remove the extra setcc, but this seemed lower
effort.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D135833
When removing frame indices on PowerPC, we need to scavenge
a GPR to materialize a large constant if the stack offset
for the spill/reload cannot be reached by a D-Form
instruction. However, in a perfect storm of conditions,
we may not have GPR's available to scavenge, thereby
requiring an emergency spill. If such an emergency
spill also needs to be spilled to a location with a
large offset, it would itself require register scavenging
thereby creating an infinite loop.
This patch detects when the scavenger cannot scavenge
a register and the spill/reload is to a location with
a large offset. It then stashes a GPR into a VSR so
that it can use the GPR to materialize the constant
(rather than scavenging a GPR).
Fixes: https://github.com/llvm/llvm-project/issues/52894
Differential revision: https://reviews.llvm.org/D124841
Allows us to use combineX86ShuffleChainWithExtract to combine targetshuffle(low_subvector(x),high_subvector(x)) -> low_subvector(targetshuffle(x)) style patterns
This is currently very limited (it must have a v2i64/v2f64 result), but while triaging I noticed we might be able to extend this to allow more types for targets with suitable variable cross lane shuffle support.
Fixes#58339
Similar to D69390 for RISCV, use a guaranteed non-existing insn for
llvm.trap and the break insn for llvm.debugtrap.
Differential Revision: https://reviews.llvm.org/D134365
The e_flags of existing object files are all 0x3 which happens to be
compatible. From this commit on, all LoongArch objects produced with
upstream LLVM will be of object file ABI v1, which is already supported
by binutils' master branch (to be released as 2.40), and is allowed by
the same binutils version to interlink with v0 objects so the existing
distributions have time to migrate.
Differential Revision: https://reviews.llvm.org/D134601
To achieve this, we need this observation:
`uzp1` is just a `xtn` that operates on two registers
For example, given the following register with type v2i64:
LSB_______MSB
x0 x1 x2 x3
Applying xtn on it we get:
x0 x2
This is equivalent to bitcast it to v4i32, and then applying uzp1 on it:
x0 x1 x2 x3
|
uzp1
v
x0 x2 <value from other register>
We can transform xtn to uzp1 by this observation, and vice versa.
This observation only works on little endian target. Big endian target has
a problem: the uzp1 cannot be replaced by xtn since there is a discrepancy
in the behavior of uzp1 between the little endian and big endian.
To illustrate, take the following for example:
LSB____________________MSB
x0 x1 x2 x3
On little endian, uzp1 grabs x0 and x2, which is right; on big endian, it
grabs x3 and x1, which doesn't match what I saw on the document. But, since
I'm new to AArch64, take my word with a pinch of salt. This bevavior is
observed on gdb, maybe there's issue in the order of the value printed by it ?
Whatever the reason is, the execution result given by qemu just doesn't match.
So I disable this on big endian target temporarily until we find the crux.
Fixes#57502
Reviewed By: dmgreen, mingmingl
Co-authored-by: Mingming Liu <mingmingl@google.com>
Differential Revision: https://reviews.llvm.org/D133850
First patch in a series adding MC layer support for Scalable Matrix
Extension 2 (SME2).
This patch adds the following feature:
sme2
The 2022 Architecture Extension release adds other feature flags(eg.:sme2.1),
that will be in follow-up patches.
The reference can be found here:
https://developer.arm.com/documentation/ddi0602/2022-09
Differential Revision: https://reviews.llvm.org/D135448
fp16 and bf16 values can be used in GCC's inline assembly using the "w"
constraint, which means "VFP floating-point registers d0-d31" - fp16 and
bf16 values are stored in S registers (which alias the D registers).
This change ensures that LLVM is compatible with GCC for programs that
use fp16 and the 'w' constraint.
Differential Revision: https://reviews.llvm.org/D135662
This matches what was done for the ARM implementation (where getting
the instruction sizes right is even more tricky, and hence needed
tighter testing).
This will allow catching any future cases where prologs and epilogs
don't match the instructions within them.
Differential Revision: https://reviews.llvm.org/D131394
If the stack is realigned, we've emitted a frame pointer and
already terminated the SEH prologue, making this dead code since
a07787c9a5.
The immediate to this SEH opcode was entirely bogus - we don't
know how many bytes the AND operation adjusts the SP, and by
doing "NumBytes & andMaskEncoded" (where andMaskEncoded was the
immediate bitpattern for the AND instruction), the immediate to the
opcode was total gibberish.
This hasn't had any practical effect, since the original stack
pointer always was restored from the frame pointer afterwards anyway.
Differential Revision: https://reviews.llvm.org/D135815
If inline asm has a VGPR def, it must have come from a VGPR write
somewhere inside the asm. This should be further extended to all
read after write hazards.
V6_vzb and V6_vshuffeb can use any 2 resources in a packet, while
V6_vunpackub/V6_vpackeb both need a shift resource.
Also, add patterns for shifting vectors of i8.
Continuing the theme of adding branchless lowerings for simple selects, this time handle the 0 arm case. This is very common for various umin idioms, etc..
Differential Revision: https://reviews.llvm.org/D135600
Without this, unwinding through functions that does use PAC
would fail, if PAC actually was active.
Differential Revision: https://reviews.llvm.org/D135103
This new opcode was initially documented as "pac_sign_return_address"
in https://github.com/MicrosoftDocs/cpp-docs/pull/4202, but was
soon afterwards renamed into "pac_sign_lr" in
https://github.com/MicrosoftDocs/cpp-docs/pull/4209, as the other
name was unwieldy, and there were no other external references to
that name anywhere.
Rename our external .seh assembler directive - it hasn't been merged
for very long yet, so there's probably no external use to account for.
Rename all other internal references to the opcode similarly.
Differential Revision: https://reviews.llvm.org/D135762
If the AM* atomic memory access instruction has the same register number as
rd and rj, the execution will trigger an Instruction Non-defined Exception.
If the AM* atomic memory access instruction has the same register number as
rd and rk, the execution result is uncertain.
Reference: https://github.com/loongson/LoongArch-Documentation
Differential Revision: https://reviews.llvm.org/D135641
Fix crash issue of D129537 and reopen it.
Currently the X86 shuffle lowering would widen the element type for
shuffle if the mask element value is adjacent. For below example
%t2 = add nsw <16 x i32> %t0, %t1
%t3 = sub nsw <16 x i32> %t0, %t1
%t4 = shufflevector <16 x i32> %t2, <16 x i32> %t3,
<16 x i32> <i32 16, i32 17, i32 2, i32 3, i32 4,
i32 5, i32 6, i32 7, i32 8, i32 9, i32 10,
i32 11, i32 12, i32 13, i32 14, i32 15>
ret <16 x i32> %t4
Compiler would transform the shuffle to
%t4 = shufflevector <8 x i64> %t2, <8 x i64> %t3,
<8 x i64> <i32 8, i32 1, i32 2, i32 3, i32 4,
i32 5, i32 6, i32 7>
This may lose the oppotunity to let ISel select mask instruction when
avx512 is enabled.
This patch is to prevent the tranform when avx512 feature is enabled.
Thank Simon for the idea.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D130830
After setting up the FP, the rest of the prologue doesn't need to
be replayed for unwinding the stack frame.
This allows reverting the functional parts of
2f7fbf8376 (but fixing inconsistent
duplicate setting of HasWinCFI).
Differential Revision: https://reviews.llvm.org/D135686
Given this is an OR reduction the two are equivalent and later
optimizations (AArch64InstrInfo::optimizePTestInstr) may rewrite the
sequence to use the flag-setting variant of instruction X, to remove the
PTEST altogether.
Reviewed By: paulwalker-arm, bsmith
Differential Revision: https://reviews.llvm.org/D134946
The BRKNS instruction is unlike the other instructions that set flags
since it has an all active implicit predicate, so the existing
PTEST(PG, BRKN(PG, A, B)) -> BRKNS(PG, A, B)
in AArch64InstrInfo::optimizePTestInstr is incorrect, however
PTEST(PTRUE_B(31), BRKN(PG, A, B)) -> BRKNS(PG, A, B)
is correct.
Spotted by @paulwalker-arm in D134946.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D135655
The patch selects VSELECT/VP_MERGE_VL which uses fmadd/fnmsub as true operand
and the adden of the fmadd/fnmsub as false operand.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D135330
If the source is implicit_def, the register allocator won't have
any constraint on what register it picks for the destination. This
doesn't give the user much control of what register is being used.
So in my mind that means the only reason to honor the policy operand
is to control what policy is used in vsetvli to maybe avoid a vtype
change. Given the other optimizations we do on the policy field, I
don't think allowing the user this control is reliable.
Therefore, I think we should use agnostic policies if the source is
undef.
This should give better performance on some CPUs for VP intrinsics where
there is no merge operand and the backend adds IMPLICIT_DEF to the instruction.
Differential Revision: https://reviews.llvm.org/D135396
This adds infrastructural pieces for an analysis to compute the DXIL
shader flags. In this state the analysis can compute two fairly
straightforward feature flags for use of double-precision floating
point values and the DX 11.1 extended double support.
This patch does conflict with D135190, conflicts will be resolved prior
to merging.
Reviewed By: python3kgae
Differential Revision: https://reviews.llvm.org/D135393
# Conflicts:
# llvm/lib/Target/DirectX/CMakeLists.txt
# llvm/lib/Target/DirectX/DirectXTargetMachine.cpp
Unfortunately, we have a broken handling of this in the runtime of rocm
5.3. The runtime is expected to handle this correctly when v5 becomes
the default.
Differential Revision: https://reviews.llvm.org/D134714
Make support more generic to support future instructions.
Currently NFC.
Reviewed By: foad, arsenm
Differential Revision: https://reviews.llvm.org/D135678
`getFunctionParamOptimizedAlign` was being passed a null function
argument when getting the callee of a bitcasted function symbol. This is
because `CallBase::getCalledFunction` does not look through bitcasts.
There is already code to handle this case in
`NVPTXTargetLowering::getArgumentAlignment`, which is now hoisted into
an NVPTX util.
The alignment computation now gracefully handles computing alignment of
virtual functions with a check for null.
Reference: https://gcc.gnu.org/onlinedocs/gccint/Machine-Constraints.html
k: A memory operand whose address is formed by a base register and
(optionally scaled) index register.
m: A memory operand whose address is formed by a base register and
offset that is suitable for use in instructions with the same
addressing mode as st.w and ld.w.
ZB: An address that is held in a general-purpose register. The offset
is zero.
ZC: A memory operand whose address is formed by a base register and
offset that is suitable for use in instructions with the same
addressing mode as ll.w and sc.w.
Note:
The INLINEASM SDNode flags in below tests are updated because the new
introduced enum `Constraint_k` is added before `Constraint_m`.
llvm/test/CodeGen/AArch64/GlobalISel/irtranslator-inline-asm.ll
llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-inline-asm.ll
llvm/test/CodeGen/X86/callbr-asm-kill.mir
This patch passes `ninja check-all` on a X86 machine with all official
targets and the LoongArch target enabled.
Differential Revision: https://reviews.llvm.org/D134638
These are harmless for the unwinder - the unwinder doesn't need to
handle them for being able to unwind correctly.
Only add the opcodes when the branch target is in a SEH prologue;
for jumptables e.g. within a function, we shouldn't add any SEH
opcodes.
Differential Revision: https://reviews.llvm.org/D135277
There are static and dynamic TLS address lowering in DAG stage according
to different TLS models.
TLS address will be lowered to pseudo instruction and then expanded by
the `LoongArch Pre-RA pseudo instruction expansion` pass.
Differential Revision: https://reviews.llvm.org/D134713
As suggested on D135572, return Optional<> from getAllocSizeArgs()
rather than the peculiar pair(0, 0) sentinel.
The method on Attribute itself does not return Optional, because
the attribute must exist in that case.
The patch fixes lowering of anonymous functions, removes file/linkage
info for builtin call demangling, and adds relevant test demonstrating
a fixed problem.
Differential Revision: https://reviews.llvm.org/D135390
These instructions already had errors for operands that could not share
the same register:
VCMUL, VMULL, VQDMULL.
This extends that to a few others:
VREV64, VQDMULLqr, VCADD and VHCADD.
Only the i32 types require the error.
Differential Revision: https://reviews.llvm.org/D135560
The return type is two u8 packed into a 16 bit VGPR, so this instruction
should be True16.
Reviewed By: dp
Differential Revision: https://reviews.llvm.org/D135478
This commit fixes https://github.com/llvm/llvm-project/issues/57326.
Currently we would take a Mask out and directly use it by doing
auto Mask = SVI->getShuffleMask();
However, if the mask is undef, this Mask is not initialized. It might be
a vector of -1 or random integers.
This would cause an Out-of-bound read later when trying to find a
StartMask.
This change checks if all indices in the Mask is in the allowed range,
and fixes the out-of-bound accesses.
Differential Revision: https://reviews.llvm.org/D132634
On many aarch64 processors (Cortex A78, Neoverse N1/N2/V1, etc), ADD with LSL shift (shift-amount <= 4) has smaller latency and higher
throughput than ADD with larger shift (shift-amunt > 4). This is at least no-op for the rest of the processors.
Differential Revision: https://reviews.llvm.org/D135208
This is AIX part of update after https://reviews.llvm.org/D117225
Fixed the issue that AIX64 with vector pair enabled saw redundant
spill/reload of callee saved vector registers.
Based on original patch by: Kai Luo
Reviewed By: lkail
Differential Revision: https://reviews.llvm.org/D133466
Otherwise eliminateFrameIndex cannot figure out how to fixup the stack
offset with its stateless logic, because there wouldn't be an immediate
slot for it to trivially write to, and it may not be easy to transform
the surrounding code to make it work.
This fixes a fairly common crash when compiling moderately complex code with
Clang.
Differential Revision: https://reviews.llvm.org/D135251
This patch fixes the failure of llvm/test/CodeGen/Generic/vector.ll and
CodeGen/PowerPC/2007-11-19-VectorSplitting.ll for a LoongArch native build.
Differential Revision: https://reviews.llvm.org/D134798
There are intrinsics for most scalar instructions and almost all HVX
instructions. What's somewhat painful is that there are two intrinsics
for each HVX instruction: one for 64- and one for 128-byte mode.
Instead of checking the current codegen settings every time, this
function would simply return the right intrinsic.
Typically when you match something, you want to check the result.
Fix a couple warnings in the AMDGPUPostLegalizerCombiner which appear as a
result of this.
Differential Revision: https://reviews.llvm.org/D135491
Apparently StackColoring depends on SlotIndexes, but not
LiveIntervals. If regalloc fast were manually requested, LiveIntervals
would be dropped before SILowerSGPRSpills but not SlotIndexes.
SILowerSGPRSpills preserved SlotIndexes, but only through
LiveIntervals. As a result, SILowerSGPRSpills was incorrectly
reporting it preserved SlotIndexes. Start updating these directly,
instead of depending on LiveIntervals also being available.
1. `length(value/type)`: return the number of elements in the vector
input,
2. `getHvxTy(elem_type)`: return the HVX vector type with the element
type provided.
These will help write things more succintly.
EVT can be created for any Type, and so this function can now be used to
check if given Type, as-is, is an HVX type (as opposed to a type that may
be subject to legalization to an HVX type).
The scalar instruction of this is `llvm.trunc`. However the naming of
ISD::VP_TRUNC is already taken by `trunc` of the LLVM IR. Naming this as
`vp.ftrunc` would likely cause confusion with `vp.fptrunc`. So adding
`vp.roundtozero` that will look similar to `vp.roundeven`.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D135233
Now only DXILTranslateMetadata uses DXILResources, so DXILResourceWrapper is only used by DXILTranslateMetadata.
Once we add lower for createHandle, DXILResourceWrapper will be used in more passes.
Also we can add resource index allocation in DXILResourceWrapper.
Reviewed By: beanz
Differential Revision: https://reviews.llvm.org/D135190
I tend to think we should ignore the policy bit in vsetvli insertion
if the tied operand is IMPLICIT_DEF. But that raises questions about
what the policy operand on RVV intrinsics means if you also pass
vundefined().
This change at least fixes some cases. I'll post a separate patch
for vsetvli insertion for discussion.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D135386