It was only handled for FLAT initially because we did not have
unaligned DS instructions lowering. Now it is implemented but
the bug is not handled.
Differential Revision: https://reviews.llvm.org/D123338
Use new NotAtomic expansion to turn these into the equivalent
non-atomic operations. Independent lanes cannot access the private
memory of other lanes, so there's no possibility for synchronization.
These don't really appear directly in user code, but
InferAddressSpaces can make these appear after optimizations.
Fixes issues 54693 and 54274.
Summary:
Specifically, for trap handling, for targets that do not support getDoorbellID,
we load the queue_ptr from the implicit kernarg, and move queue_ptr to s[0:1].
To get aperture bases when targets do not have aperture registers, we load
private_base or shared_base directly from the implicit kernarg. In clang, we use
implicitarg_ptr + offsets to implement __builtin_amdgcn_workgroup_size_{xyz}.
Reviewers: arsenm, sameerds, yaxunl
Differential Revision: https://reviews.llvm.org/D120265
NFCI. The motivation for this is avoid problems in future if we add new
classes containing only a subset of all VGPRs, or a subset of all SGPRs.
getMinimalPhysRegClass would favour these smaller classes, which is not
what we want here.
Differential Revision: https://reviews.llvm.org/D121914
Summary:
In general, we need queue_ptr for aperture bases and trap handling,
and user SGPRs have to be set up to hold queue_ptr. In current implementation,
user SGPRs are set up unnecessarily for some cases. If the target has aperture
registers, queue_ptr is not needed to reference aperture bases. For trap
handling, if target suppots getDoorbellID, queue_ptr is also not necessary.
Futher, code object version 5 introduces new kernel ABI which passes queue_ptr
as an implicit kernel argument, so user SGPRs are no longer necessary for
queue_ptr. Based on the trap handling document:
https://llvm.org/docs/AMDGPUUsage.html#amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table,
llvm.debugtrap does not need queue_ptr, we remove queue_ptr suport for llvm.debugtrap
in the backend.
Reviewers: sameerds, arsenm
Fixes: SWDEV-307189
Differential Revision: https://reviews.llvm.org/D119762
Currently the return address ABI registers s[30:31], which fall in the call
clobbered register range, are added as a live-in on the function entry to
preserve its value when we have calls so that it gets saved and restored
around the calls.
But the DWARF unwind information (CFI) needs to track where the return address
resides in a frame and the above approach makes it difficult to track the
return address when the CFI information is emitted during the frame lowering,
due to the involvment of understanding the control flow.
This patch moves the return address ABI registers s[30:31] into callee saved
registers range and stops adding live-in for return address registers, so that
the CFI machinery will know where the return address resides when CSR
save/restore happen during the frame lowering.
And doing the above poses an issue that now the return instruction uses undefined
register `sgpr30_sgpr31`. This is resolved by hiding the return address register
use by the return instruction through the `SI_RETURN` pseudo instruction, which
doesn't take any input operands, until the `SI_RETURN` pseudo gets lowered to the
`S_SETPC_B64_return` during the `expandPostRAPseudo()`.
As an added benefit, this patch simplifies overall return instruction handling.
Note: The AMDGPU CFI changes are there only in the downstream code and another
version of this patch will be posted for review for the downstream code.
Reviewed By: arsenm, ronlieb
Differential Revision: https://reviews.llvm.org/D114652
Add a new llvm.fptrunc.round intrinsic to precisely control
the rounding mode when converting from f32 to f16.
Differential Revision: https://reviews.llvm.org/D110579
This allows us to set the noclobber flag on (the MMO of) a load
instruction instead of on the pointer. This fixes a bug where noclobber
was being applied to all loads from the same pointer, even if some of
them were clobbered.
Differential Revision: https://reviews.llvm.org/D118775
Summary:
Add code object v5 support (deafult is still v4)
Generate metadata for implicit kernel args for the new ABI
Set the metadata version to be 1.2
Reviewers:
t-tye, b-sumner, arsenm, and bcahoon
Fixes:
SWDEV-307188, SWDEV-307189
Differential Revision:
https://reviews.llvm.org/D118272
Currently not (xor_one_use) pattern is always selected to S_XNOR irrelative od the node divergence.
This relies on further custom selection pass which converts to VALU if necessary and replaces with V_NOT_B32 ( V_XOR_B32)
on those targets which have no V_XNOR.
Current change enables the patterns which explicitly select the not (xor_one_use) to appropriate form.
We assume that xor (not) is already turned into the not (xor) by the combiner.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D116270
As the codegen fix in D111754, the LOD bias needs to be converted to 16
bits. Fix this in the combine.
Differential Revision: https://reviews.llvm.org/D116038
If we had one of the shader calling conventions calling a default
calling convention callee, this would crash when the caller did not
have anything to pass to the workitem ID.
This is illegal, but we still need to produce something
sensible. llvm-reduce likes to replace calls to intrinsics with calls
to null or undef, so this does appear and is helpful to avoid hard
erroring.
Pass undef in this case, as already happened for the other implicit
arguments. It might make sense to define the behavior here and pass
null for the pointers, and -1 for the workitem ID. We do have extra
bits in the workitem ID, so this wouldn't conflict with a valid value.
I believe this is unexploitable because in either case the result
will be 'couldn't allocate register for constraint' error message,
but error code checking is clearly wrong.
Differential Revision: https://reviews.llvm.org/D117189
Shockingly we weren't doing this already. We should probably have this
be done earlier in the IR too, but it's still helpful to have the
lowering guarantee it so that we can modify the ABI implicit inputs
based on it.
If we know we we aren't using a component from the kernel, we can save
a few bit packing instructions.
We're still enabling the VGPR input to the kernel though.
We are always failing parsing of the physreg constraint because
we do not drop trailing brace, thus getAsInteger() returns a
non-empty string and we delegate reparsing to the TargetLowering.
In addition it did not parse register tuples.
Fixed which has allowed to remove w/a in two places we call it.
Differential Revision: https://reviews.llvm.org/D117055
After the split register allocation changes in eebe841a47 it is no
longer necessary to reserve a VGPR before RA. This can also create bugs
when IPRA is enabled since we cannot predict that a called function may
not reserve any register if it does not have any SGPR spills. If that
happens those functions may override reserved registers that are
normally callee saved. Added a test to show this.
Fixes: SWDEV-309900
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D115551
This reverts commit 640beb38e7.
That commit caused performance degradtion in Quicksilver test QS:sGPU and a functional test failure in (rocPRIM rocprim.device_segmented_radix_sort).
Reverting until we have a better solution to s_cselect_b64 codegen cleanup
Change-Id: Ifc167b3c2dae7a65920676f22a97ba76485f3456
Reviewed By: kzhuravl
Differential Revision: https://reviews.llvm.org/D116686
Change-Id: I1abf49b74a7e2ba0e0205f747a4154a468b9d7f2
If we know the source is a valid object, we do not need to insert a
null check. This misses a lot of opportunities from
metadata/attributes not tracked in codegen.
This reverts commit 640beb38e7.
That commit caused performance degradtion in Quicksilver test QS:sGPU and a functional test failure in (rocPRIM rocprim.device_segmented_radix_sort).
Reverting until we have a better solution to s_cselect_b64 codegen cleanup
Change-Id: Ibf8e397df94001f248fba609f072088a46abae08
Reviewed By: kzhuravl
Differential Revision: https://reviews.llvm.org/D115960
Change-Id: Id169459ce4dfffa857d5645a0af50b0063ce1105
Currently the return address ABI registers s[30:31], which fall in the call
clobbered register range, are added as a live-in on the function entry to
preserve its value when we have calls so that it gets saved and restored
around the calls.
But the DWARF unwind information (CFI) needs to track where the return address
resides in a frame and the above approach makes it difficult to track the
return address when the CFI information is emitted during the frame lowering,
due to the involvment of understanding the control flow.
This patch moves the return address ABI registers s[30:31] into callee saved
registers range and stops adding live-in for return address registers, so that
the CFI machinery will know where the return address resides when CSR
save/restore happen during the frame lowering.
And doing the above poses an issue that now the return instruction uses undefined
register `sgpr30_sgpr31`. This is resolved by hiding the return address register
use by the return instruction through the `SI_RETURN` pseudo instruction, which
doesn't take any input operands, until the `SI_RETURN` pseudo gets lowered to the
`S_SETPC_B64_return` during the `expandPostRAPseudo()`.
As an added benefit, this patch simplifies overall return instruction handling.
Note: The AMDGPU CFI changes are there only in the downstream code and another
version of this patch will be posted for review for the downstream code.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D114652
The LOD bias operand is of type 'half' when the A16-bit is ON' for MIMG instructions.
'bias' is only 16-bit but occupies 32-bits with upper 16-bits containing junk.
The patch fixes both the paths(ISelDAG and GlobalISel) for proper encoding of LOD bias operand.
Differential Revision: https://reviews.llvm.org/D111754
Fixed ABI arguments are compute specific and should not be added to
graphics shaders or functions, so do not try to add them.
Differential Revision: https://reviews.llvm.org/D115344
The ray_origin, ray_dir and ray_inv_dir arguments should all be vec3 to
match how the hardware instruction works.
Don't change the API of the corresponding OpenCL builtins.
Differential Revision: https://reviews.llvm.org/D115032
Recognize constant splat padded with undef in isCanonicalized.
Fcanonicalize will be removed by RemoveFcanonicalize in post-legalizer
combiner. We will treat undef as value that will result in a splat
in clamp combine after regbankselect.
Differential Revision: https://reviews.llvm.org/D104408
1. Fixed vector instructions costs estimations incosistency - removed different
logic for "not simple types" since it biases costs for these types.
2. Fixed legalization penalty for vectors too big for the target: changed from
overwrite default legalization cost value estimation to added penalty.
3. Fixed few typos in tests.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D114893
The combined vector register classes with both
VGPRs and AGPRs are currently unallocatable.
This patch turns them into allocatable as a
prerequisite to enable copy between VGPR and
AGPR registers during regalloc.
Also, added the missing AV register classes from
192b to 1024b.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D109300
Select SelectionDAG ops smul_lohi/umul_lohi to
v_mad_i64_i32/v_mad_u64_u32 respectively, with an addend of 0.
v_mul_lo, v_mul_hi and v_mad_i64/u64 are all quarter-rate instructions
so it is better to use one instruction than two.
Further improvements are possible to make better use of the addend
operand, but this is already a strict improvement over what we have
now.
Differential Revision: https://reviews.llvm.org/D113986
This patch changes the AMDGPU_Gfx calling convention. It defines the SGPR registers s[4:29] as callee-save and leaves some SGPRs usable for callers. The intention is to avoid unneccessary s_mov instructions for arguments the caller would otherwise save and restore in these registers.
Reviewed By: sebastian-ne
Differential Revision: https://reviews.llvm.org/D111637
Change numBitsSigned to return the minimum size of a signed integer that
can hold the value. This is different by one from the previous result
but is more consistent with numBitsUnsigned. Update all callers. All
callers are now more consistent between the signed and unsigned cases,
and some callers get simpler, especially the ones that deal with
quantities like numBitsSigned(LHS) + numBitsSigned(RHS).
Differential Revision: https://reviews.llvm.org/D112813
Shift node is still needed to check if the shift is shr or shl to increment/decrement offset. Do not override the node.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D112733
The combine asserted if constants could not be represented as uint64_t.
Use APInts to fix this.
Differential Revision: https://reviews.llvm.org/D112416
This is used to fix wrong code generation of s_add_co_select_user in
test/CodeGen/AMDGPU/expand-scalar-carry-out-select-user.ll
s_addc_u32 s4, s6, 0
s_cselect_b64 vcc, 1, 0 <-- vcc set as 0x1 if SCC==1
v_mov_b32_e32 v1, s4
s_cmp_gt_u32 s6, 31
v_cndmask_b32_e32 v1, 0, v1, vcc
If the s_addc_u32 set SCC, then we will get value 0x1 in VCC.
The v_cndmask will do per thread selection with VCC as condition
register. As VCC only gets the first bit being set, only the first
thread/lane in destination register can get correct result if the
very first lane is active. In fact, we should broadcast the value to all
active lanes of the final register.
The idea here is doing this broadcast to vector boolean explicitly
instead of lowering it into a COPY from SCC which would be interpreted as
selecting between 0/1.
This is used to replace D109754.
Reviewed-by: foad, alex-t
Differential Revision: https://reviews.llvm.org/D109889
The attributor can determine that some indirect calls do not require
special inputs. The special inputs will still be present in the ABI,
so we need to allocate the registers and pass undefs.
Soft deprecrate isNullValue/isAllOnesValue and update in tree
callers. This matches the changes to the APInt interface from
D109483.
Reviewed By: lattner
Differential Revision: https://reviews.llvm.org/D109535
Produce remarks when atomic instructions are expanded into hardware instructions
in SIISelLowering.cpp. Currently, these remarks are only emitted for atomic fadd
instructions.
Differential Revision: https://reviews.llvm.org/D108150
Suffix opcodes with _gfx10.
Remove direct references to architecture specific opcodes.
Add a BVH flag and apply this to diassembly.
Fix a number of disassembly errors on gfx90a target caused by
previous incorrect BVH detection code.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D108117
These instructions have an implicit use of vcc which counts towards the
constant bus limit. Pre gfx10 this means that the explicit operands
cannot be sgprs. Use the custom inserter hook to call legalizeOperands
to enforce that restriction.
Fixes https://bugs.llvm.org/show_bug.cgi?id=51217
Differential Revision: https://reviews.llvm.org/D106868
Enable custom insert_subvector for larger vector types.
This is necessary now that SelectionDAG can attempt v3f64 insert
to v4f64, etc.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D105385
Codegen for the raw/struct buffer access intrinsics would update the
offset in the MMO to reflect the combined offset, if it was known to be
constant. If the combined offset was not known to be constant, or if
there was an index, it would set the offset in the MMO to 0. This is
unsafe because it makes it look like the access does not alias with
another access with a fixed non-zero offset.
Fix these cases by setting the pointer in the MMO to null, to reflect
the fact that we do not have any known IR value pointer + constant
offset for the access.
Differential Revision: https://reviews.llvm.org/D106284
Add maximum NSA size limit as an ISA feature.
Use this to reduce NSA usage on GFX10.1 to avoid stability issues
with 4 and 5 dwords NSA instructions.
Maintain use of longer NSA instructions on GFX10.3.
Note: this also contains some minor fixes for GlobalISel which
did not work correctly with non-NSA form instructions on GFX10.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D103348
Allow MIMG instructions to be selected with 6/7 VGPRs for vaddr.
Previously these were rounded up to VReg_256 this saves VGPRs.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D103800
Rename getBufferOffsetForMMO to updateBufferMMO and pass in the MMO to
be updated, in preparation for the bug fix in D106284.
Call updateBufferMMO consistently for all buffer intrinsics, even the
ones that use setBufferOffsets to decompose a combined offset
expression.
Add a getIdxEn helper function.
Differential Revision: https://reviews.llvm.org/D106354
Add SReg_224, VReg_224, AReg_224, etc.
Link 224-bit types with v7i32/v7f32.
Link existing 192-bit types to newly added v3i64/v3f64/v6i32/v6f32.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D104622
Avoid having to round up to v8f32/VReg_256 when only 5 VGPRs are
required for a MIMG address operand.
Maintain _V8 instruction variants of pseudo instructions allowing
assembly prior to GFX10 to work as-is. Currently the validator
can tell for GFX10 what the correct size is, so will disallow
oversize address registers.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D103672
When flat scratch is used, the stack pointer needs to be added when
writing arguments to the stack.
For buffer instructions, this is done in SelectMUBUFScratchOffen
and SelectMUBUFScratchOffset.
Move that to call argument lowering, like it is done in GlobalISel.
Differential Revision: https://reviews.llvm.org/D103166
For gfx10 gradient (g16) and address (a16) can be independent. Previous
implementation assumed that a16 implied g16.
There are some other changes that fix the verification (as well as asm/disasm)
that are required for the included test to pass - the XFAIL will be removed in
those changes.
This also includes required fixes for GlobalISel
Differential Revision: https://reviews.llvm.org/D102066
Change-Id: I7d171cc90994de05f41669b66a6d0ffa2ed05d09
Previously we were allowing to use FP atomics without
-amdgpu-unsafe-fp-atomics option if a scope is less then
system. This is not safe just as well if we have UC memory.
This change only allows global and flat FP atomics with
the unsafe option. Consequentially that makes a check for
denorm mode redundant since we skip it with the unsafe
option and do not have a way to produce these instructions
without it anyway.
Differential Revision: https://reviews.llvm.org/D102347
getVectorNumElements() returns a value for scalable vectors
without any warning so it is effectively getVectorMinNumElements().
By renaming it and making getVectorNumElements() forward to
it, we can insert a check for scalable vectors into getVectorNumElements()
similar to EVT. I didn't do that in this patch because there are still more
fixes needed, but I was able to temporarily do it and passed the RISCV
lit tests with these changes.
The changes to isPow2VectorType and getPow2VectorType are copied from EVT.
The change to TypeInfer::EnforceSameNumElts reduces the size of AArch64's isel table.
We're now considering SameNumElts to require the scalable property to match which
removes some unneeded type checks.
This was motivated by the bug I fixed yesterday in 80b9510806
Reviewed By: frasercrmck, sdesmalen
Differential Revision: https://reviews.llvm.org/D102262
Improve the code generation of build_vector.
Use the v_pack_b32_f16 instruction instead of
v_and_b32 + v_lshl_or_b32
Differential Revision: https://reviews.llvm.org/D98081
Patch by Julien Pagès!
Improve the code generation of fp_to_sint
and fp_to_uint for integer on 16-bits.
Differential Revision: https://reviews.llvm.org/D101481
Patch by Julien Pagès!
Add basic version of isCanonicalized for global-isel. Copied from sdag.
Add post legalizer combine that deletes G_FCANONICALIZE when its input
is already Canonicalized.
Differential Revision: https://reviews.llvm.org/D96605