Commit Graph

1075 Commits

Author SHA1 Message Date
Stanislav Mekhanoshin 16cf9e6dad [AMDGPU] Fix handling of gfx10 LDS misaligned access bug
It was only handled for FLAT initially because we did not have
unaligned DS instructions lowering. Now it is implemented but
the bug is not handled.

Differential Revision: https://reviews.llvm.org/D123338
2022-04-07 15:08:29 -07:00
Matt Arsenault e6012c8e0f AMDGPU: Handle private atomics
Use new NotAtomic expansion to turn these into the equivalent
non-atomic operations. Independent lanes cannot access the private
memory of other lanes, so there's no possibility for synchronization.

These don't really appear directly in user code, but
InferAddressSpaces can make these appear after optimizations.

Fixes issues 54693 and 54274.
2022-04-06 22:47:19 -04:00
Stanislav Mekhanoshin a41a676e8a [AMDGPU] Check SI LDS offset bug in the allowsMisalignedMemoryAccesses
Differential Revision: https://reviews.llvm.org/D123268
2022-04-06 18:05:02 -07:00
Shao-Ce SUN 662b9fa02c [NFC][CodeGen] Add a setTargetDAGCombine use ArrayRef
Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D122557
2022-03-29 09:53:24 +08:00
Stanislav Mekhanoshin 6e3e14f600 [AMDGPU] Support gfx940 smfmac instructions
Differential Revision: https://reviews.llvm.org/D122191
2022-03-24 12:40:42 -07:00
Jon Chesterfield bcbd4cf1f2 Revert "[amdgpu][nfc] Pass function instead of module to allocateModuleLDSGlobal"
Reconsidered, better to handle per-function state in the constructor as before.
This reverts commit 98e474c1b3.
2022-03-20 00:58:26 +00:00
Jon Chesterfield 98e474c1b3 [amdgpu][nfc] Pass function instead of module to allocateModuleLDSGlobal 2022-03-19 16:42:17 +00:00
Changpeng Fang dd5895cc39 AMDGPU: Use the implicit kernargs for code object version 5
Summary:
  Specifically, for trap handling, for targets that do not support getDoorbellID,
we load the queue_ptr from the implicit kernarg, and move queue_ptr to s[0:1].
To get aperture bases when targets do not have aperture registers, we load
private_base or shared_base directly from the implicit kernarg. In clang, we use
implicitarg_ptr + offsets to implement __builtin_amdgcn_workgroup_size_{xyz}.

Reviewers: arsenm, sameerds, yaxunl

Differential Revision: https://reviews.llvm.org/D120265
2022-03-17 14:12:36 -07:00
Jay Foad 313f306b26 [AMDGPU] Stop using getMinimalPhysRegClass in LowerFormalArguments
NFCI. The motivation for this is avoid problems in future if we add new
classes containing only a subset of all VGPRs, or a subset of all SGPRs.
getMinimalPhysRegClass would favour these smaller classes, which is not
what we want here.

Differential Revision: https://reviews.llvm.org/D121914
2022-03-17 15:19:17 +00:00
Shengchen Kan 37b378386e [NFC][CodeGen] Rename some functions in MachineInstr.h and remove duplicated comments 2022-03-16 20:25:42 +08:00
serge-sans-paille 989f1c72e0 Cleanup codegen includes
This is a (fixed) recommit of https://reviews.llvm.org/D121169

after:  1061034926
before: 1063332844

Discourse thread: https://discourse.llvm.org/t/include-what-you-use-include-cleanup
Differential Revision: https://reviews.llvm.org/D121681
2022-03-16 08:43:00 +01:00
Stanislav Mekhanoshin 1f53f20fc1 [AMDGPU] Support gfx940 v_lshl_add_u64 instruction
Differential Revision: https://reviews.llvm.org/D121401
2022-03-14 15:45:42 -07:00
Nico Weber a278250b0f Revert "Cleanup codegen includes"
This reverts commit 7f230feeea.
Breaks CodeGenCUDA/link-device-bitcode.cu in check-clang,
and many LLVM tests, see comments on https://reviews.llvm.org/D121169
2022-03-10 07:59:22 -05:00
serge-sans-paille 7f230feeea Cleanup codegen includes
after:  1061034926
before: 1063332844

Differential Revision: https://reviews.llvm.org/D121169
2022-03-10 10:00:30 +01:00
Changpeng Fang 0f20a35b9e AMDGPU: Set up User SGPRs for queue_ptr only when necessary
Summary:
  In general, we need queue_ptr for aperture bases and trap handling,
and user SGPRs have to be set up to hold queue_ptr. In current implementation,
user SGPRs are set up unnecessarily for some cases. If the target has aperture
registers, queue_ptr is not needed to reference aperture bases. For trap
handling, if target suppots getDoorbellID, queue_ptr is also not necessary.
Futher, code object version 5 introduces new kernel ABI which passes queue_ptr
as an implicit kernel argument, so user SGPRs are no longer necessary for
queue_ptr. Based on the trap handling document:
https://llvm.org/docs/AMDGPUUsage.html#amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table,
llvm.debugtrap does not need queue_ptr, we remove queue_ptr suport for llvm.debugtrap
in the backend.

Reviewers: sameerds, arsenm

Fixes: SWDEV-307189

Differential Revision: https://reviews.llvm.org/D119762
2022-03-09 10:14:05 -08:00
Venkata Ramanaiah Nalamothu 04fff547e2 [AMDGPU] Move call clobbered return address registers s[30:31] to callee saved range
Currently the return address ABI registers s[30:31], which fall in the call
clobbered register range, are added as a live-in on the function entry to
preserve its value when we have calls so that it gets saved and restored
around the calls.

But the DWARF unwind information (CFI) needs to track where the return address
resides in a frame and the above approach makes it difficult to track the
return address when the CFI information is emitted during the frame lowering,
due to the involvment of understanding the control flow.

This patch moves the return address ABI registers s[30:31] into callee saved
registers range and stops adding live-in for return address registers, so that
the CFI machinery will know where the return address resides when CSR
save/restore happen during the frame lowering.

And doing the above poses an issue that now the return instruction uses undefined
register `sgpr30_sgpr31`. This is resolved by hiding the return address register
use by the return instruction through the `SI_RETURN` pseudo instruction, which
doesn't take any input operands, until the `SI_RETURN` pseudo gets lowered to the
`S_SETPC_B64_return` during the `expandPostRAPseudo()`.

As an added benefit, this patch simplifies overall return instruction handling.

Note: The AMDGPU CFI changes are there only in the downstream code and another
version of this patch will be posted for review for the downstream code.

Reviewed By: arsenm, ronlieb

Differential Revision: https://reviews.llvm.org/D114652
2022-03-09 12:18:02 +05:30
Stanislav Mekhanoshin 932f628121 [AMDGPU] new gfx940 fp atomics
Differential Revision: https://reviews.llvm.org/D121028
2022-03-07 12:32:02 -08:00
Sebastian Neubauer 6527b2a4d5 [AMDGPU][NFC] Fix typos
Fix some typos in the amdgpu backend.

Differential Revision: https://reviews.llvm.org/D119235
2022-02-18 15:05:21 +01:00
Jay Foad f72d8897ac [AMDGPU] Honor !invariant.load metadata on load-like intrinsics
Differential Revision: https://reviews.llvm.org/D119739
2022-02-15 09:16:57 +00:00
Julien Pages dcb2da13f1 [AMDGPU] Add a new intrinsic to control fp_trunc rounding mode
Add a new llvm.fptrunc.round intrinsic to precisely control
the rounding mode when converting from f32 to f16.

Differential Revision: https://reviews.llvm.org/D110579
2022-02-11 12:08:23 -05:00
Jay Foad ddd3807e69 [AMDGPU] Use new target MMO flag MONoClobber
This allows us to set the noclobber flag on (the MMO of) a load
instruction instead of on the pointer. This fixes a bug where noclobber
was being applied to all loads from the same pointer, even if some of
them were clobbered.

Differential Revision: https://reviews.llvm.org/D118775
2022-02-02 17:12:36 +00:00
Changpeng Fang 1194b9cdda AMDGPU {NFC}: Add code object v5 support and generate metadata for implicit kernel args
Summary:
  Add code object v5 support (deafult is still v4)
  Generate metadata for implicit kernel args for the new ABI
  Set the metadata version to be 1.2

Reviewers:
  t-tye, b-sumner, arsenm, and bcahoon

Fixes:
  SWDEV-307188, SWDEV-307189

Differential Revision:
  https://reviews.llvm.org/D118272
2022-01-31 18:07:47 -08:00
Matt Arsenault 33b45ee44b AMDGPU: Handle addrspacecast of constant 32-bit to flat
I accidentally made this work on the GlobalISel path, and there's no
real reason not to handle this.
2022-01-27 11:01:44 -05:00
alex-t 5157f984ae [AMDGPU] Enable divergence-driven XNOR selection
Currently not (xor_one_use) pattern is always selected to S_XNOR irrelative od the node divergence.
This relies on further custom selection pass which converts to VALU if necessary and replaces with V_NOT_B32 ( V_XOR_B32)
on those targets which have no V_XNOR.
Current change enables the patterns which explicitly select the not (xor_one_use) to appropriate form.
We assume that xor (not) is already turned into the not (xor) by the combiner.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D116270
2022-01-26 15:33:10 +03:00
Stanislav Mekhanoshin bb1fe36977 [AMDGPU] Make v8i16/v8f16 legal
Differential Revision: https://reviews.llvm.org/D117721
2022-01-24 11:51:08 -08:00
Sebastian Neubauer ae2f9c8be8 [AMDGPU] Remove lz and nomip combine from codegen
These combines have been moved into the IR combiner in D116042.

Differential Revision: https://reviews.llvm.org/D116116
2022-01-21 12:09:08 +01:00
Sebastian Neubauer 0530fdbbbb [AMDGPU] Fix LOD bias in A16 combine
As the codegen fix in D111754, the LOD bias needs to be converted to 16
bits. Fix this in the combine.

Differential Revision: https://reviews.llvm.org/D116038
2022-01-21 12:09:06 +01:00
Matt Arsenault 82de129ab8 AMDGPU: Remove llvm.amdgcn.alignbit and handle bitcode upgrade to fshr 2022-01-18 14:08:36 -05:00
Matt Arsenault dc2457c8cf AMDGPU: Fix crashing on calls to C functions from graphics contexts
If we had one of the shader calling conventions calling a default
calling convention callee, this would crash when the caller did not
have anything to pass to the workitem ID.

This is illegal, but we still need to produce something
sensible. llvm-reduce likes to replace calls to intrinsics with calls
to null or undef, so this does appear and is helpful to avoid hard
erroring.

Pass undef in this case, as already happened for the other implicit
arguments. It might make sense to define the behavior here and pass
null for the pointers, and -1 for the workitem ID. We do have extra
bits in the workitem ID, so this wouldn't conflict with a valid value.
2022-01-17 10:13:25 -05:00
Stanislav Mekhanoshin fc6af7e188 [AMDGPU] Fix error handling in asm constraint syntax
I believe this is unexploitable because in either case the result
will be 'couldn't allocate register for constraint' error message,
but error code checking is clearly wrong.

Differential Revision: https://reviews.llvm.org/D117189
2022-01-13 09:33:50 -08:00
Matt Arsenault 59994c25f9 AMDGPU: Select workitem ID intrinsics to 0 with req_work_group_size
Shockingly we weren't doing this already. We should probably have this
be done earlier in the IR too, but it's still helpful to have the
lowering guarantee it so that we can modify the ABI implicit inputs
based on it.
2022-01-13 12:08:18 -05:00
Matt Arsenault a6f49423c1 AMDGPU: Optimize outgoing workitem ID based on reqd_work_group_size
If we know we we aren't using a component from the kernel, we can save
a few bit packing instructions.

We're still enabling the VGPR input to the kernel though.
2022-01-13 12:08:18 -05:00
Stanislav Mekhanoshin d043822daa [AMDGPU] Fixed physreg asm constraint parsing
We are always failing parsing of the physreg constraint because
we do not drop trailing brace, thus getAsInteger() returns a
non-empty string and we delegate reparsing to the TargetLowering.

In addition it did not parse register tuples.

Fixed which has allowed to remove w/a in two places we call it.

Differential Revision: https://reviews.llvm.org/D117055
2022-01-12 16:37:08 -08:00
Austin Kerbow 8470bf2b08 [AMDGPU] Do not reserve any VGPR for SGPR spills
After the split register allocation changes in eebe841a47 it is no
longer necessary to reserve a VGPR before RA. This can also create bugs
when IPRA is enabled since we cannot predict that a called function may
not reserve any register if it does not have any SGPR spills. If that
happens those functions may override reserved registers that are
normally callee saved. Added a test to show this.

Fixes: SWDEV-309900

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D115551
2022-01-11 22:14:59 -08:00
David Salinas c0581f7df6 Revert D109159 : Revert "[amdgpu] Enable selection of `s_cselect_b64`."
This reverts commit 640beb38e7.

That commit caused performance degradtion in Quicksilver test QS:sGPU and a functional test failure in (rocPRIM rocprim.device_segmented_radix_sort).
Reverting until we have a better solution to s_cselect_b64 codegen cleanup

Change-Id: Ifc167b3c2dae7a65920676f22a97ba76485f3456

Reviewed By: kzhuravl

Differential Revision: https://reviews.llvm.org/D116686

Change-Id: I1abf49b74a7e2ba0e0205f747a4154a468b9d7f2
2022-01-11 21:14:09 +00:00
Matt Arsenault 68468bbe15 AMDGPU: Avoid null check during addrspacecast lowering
If we know the source is a valid object, we do not need to insert a
null check. This misses a lot of opportunities from
metadata/attributes not tracked in codegen.
2022-01-10 13:27:39 -05:00
Kazu Hirata 2aed08131d [llvm] Use true/false instead of 1/0 (NFC)
Identified with modernize-use-bool-literals.
2022-01-07 00:39:14 -08:00
Nico Weber 085f078307 Revert "Revert D109159 "[amdgpu] Enable selection of `s_cselect_b64`.""
This reverts commit 859ebca744.
The change contained many unrelated changes and e.g. restored
unit test failes for the old lld port.
2022-01-05 13:10:25 -05:00
David Salinas 859ebca744 Revert D109159 "[amdgpu] Enable selection of `s_cselect_b64`."
This reverts commit 640beb38e7.

That commit caused performance degradtion in Quicksilver test QS:sGPU and a functional test failure in (rocPRIM rocprim.device_segmented_radix_sort).
Reverting until we have a better solution to s_cselect_b64 codegen cleanup

Change-Id: Ibf8e397df94001f248fba609f072088a46abae08

Reviewed By: kzhuravl

Differential Revision: https://reviews.llvm.org/D115960

Change-Id: Id169459ce4dfffa857d5645a0af50b0063ce1105
2022-01-05 17:57:32 +00:00
Ron Lieberman 09b53296cf Revert "[AMDGPU] Move call clobbered return address registers s[30:31] to callee saved range"
This reverts commit 9075009d1f.

 Failed amdgpu runtime buildbot # 3514
2021-12-22 11:39:28 -05:00
RamNalamothu 9075009d1f [AMDGPU] Move call clobbered return address registers s[30:31] to callee saved range
Currently the return address ABI registers s[30:31], which fall in the call
clobbered register range, are added as a live-in on the function entry to
preserve its value when we have calls so that it gets saved and restored
around the calls.

But the DWARF unwind information (CFI) needs to track where the return address
resides in a frame and the above approach makes it difficult to track the
return address when the CFI information is emitted during the frame lowering,
due to the involvment of understanding the control flow.

This patch moves the return address ABI registers s[30:31] into callee saved
registers range and stops adding live-in for return address registers, so that
the CFI machinery will know where the return address resides when CSR
save/restore happen during the frame lowering.

And doing the above poses an issue that now the return instruction uses undefined
register `sgpr30_sgpr31`. This is resolved by hiding the return address register
use by the return instruction through the `SI_RETURN` pseudo instruction, which
doesn't take any input operands, until the `SI_RETURN` pseudo gets lowered to the
`S_SETPC_B64_return` during the `expandPostRAPseudo()`.

As an added benefit, this patch simplifies overall return instruction handling.

Note: The AMDGPU CFI changes are there only in the downstream code and another
version of this patch will be posted for review for the downstream code.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D114652
2021-12-22 20:51:12 +05:30
rkorsa c680fb69d6 [AMDGPU] Fixes in ISelDAG path and GlobalISel path for 'bias' operand with A16 bit on
The LOD bias operand is of type 'half' when the A16-bit is ON' for MIMG instructions.
'bias' is only 16-bit but occupies 32-bits with upper 16-bits containing junk.
The patch fixes both the paths(ISelDAG and GlobalISel) for proper encoding of LOD bias operand.

Differential Revision: https://reviews.llvm.org/D111754
2021-12-17 16:11:51 +05:30
Neubauer, Sebastian 26924b57e8 [AMDGPU] Ignore special ABI registers for graphics
Fixed ABI arguments are compute specific and should not be added to
graphics shaders or functions, so do not try to add them.

Differential Revision: https://reviews.llvm.org/D115344
2021-12-13 16:44:37 +01:00
Matt Arsenault 06b90175e7 AMDGPU: Remove fixed function ABI option 2021-12-10 19:41:19 -05:00
Michael Liao 53fc971a4b Fix `-Wunused-variable` warning. NFC. 2021-12-04 23:34:55 -05:00
Jay Foad 2774bad112 [AMDGPU] Change llvm.amdgcn.image.bvh.intersect.ray to take vec3 args
The ray_origin, ray_dir and ray_inv_dir arguments should all be vec3 to
match how the hardware instruction works.

Don't change the API of the corresponding OpenCL builtins.

Differential Revision: https://reviews.llvm.org/D115032
2021-12-04 10:32:11 +00:00
Petar Avramovic ab01f4d264 AMDGPU/GlobalISel: Do not fcanonicalize const splat padded with undef
Recognize constant splat padded with undef in isCanonicalized.
Fcanonicalize will be removed by RemoveFcanonicalize in post-legalizer
combiner. We will treat undef as value that will result in a splat
in clamp combine after regbankselect.

Differential Revision: https://reviews.llvm.org/D104408
2021-12-03 12:49:38 +01:00
Daniil Fukalov ab05ab59a7 [CostModel][AMDGPU] Fix instructions costs estimation for vector types.
1. Fixed vector instructions costs estimations incosistency - removed different
   logic for "not simple types" since it biases costs for these types.
2. Fixed legalization penalty for vectors too big for the target: changed from
   overwrite default legalization cost value estimation to added penalty.
3. Fixed few typos in tests.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D114893
2021-12-03 03:08:08 +03:00
Mirko Brkusanin 8951136216 [AMDGPU][GlobalISel] Transform (fadd (fpext (fmul x, y)), z) -> (fma (fpext x), (fpext y), z)
Patch by: Mateja Marjanovic

Differential Revision: https://reviews.llvm.org/D97937
2021-11-29 16:27:21 +01:00
Mirko Brkusanin 881840fc26 [AMDGPU][GlobalISel] Transform (fadd (fmul x, y), z) -> (fma x, y, z)
Patch by: Mateja Marjanovic

Differential Revision: https://reviews.llvm.org/D93305
2021-11-29 16:27:21 +01:00
Kazu Hirata 387927bbaf [Target] Use range-based for loops (NFC) 2021-11-26 21:21:17 -08:00
Christudasan Devadasan 654c89d85a [AMDGPU] Make vector superclasses allocatable
The combined vector register classes with both
VGPRs and AGPRs are currently unallocatable.
This patch turns them into allocatable as a
prerequisite to enable copy between VGPR and
AGPR registers during regalloc.

Also, added the missing AV register classes from
192b to 1024b.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D109300
2021-11-26 00:42:12 -05:00
Jay Foad d7e03df719 [AMDGPU] Implement widening multiplies with v_mad_i64_i32/v_mad_u64_u32
Select SelectionDAG ops smul_lohi/umul_lohi to
v_mad_i64_i32/v_mad_u64_u32 respectively, with an addend of 0.
v_mul_lo, v_mul_hi and v_mad_i64/u64 are all quarter-rate instructions
so it is better to use one instruction than two.

Further improvements are possible to make better use of the addend
operand, but this is already a strict improvement over what we have
now.

Differential Revision: https://reviews.llvm.org/D113986
2021-11-24 11:25:02 +00:00
kpyzhov c9690092c8 [AMDGPU] Small correction in SITargetLowering::performOrCombine().
Differential Revision: https://reviews.llvm.org/D113203
2021-11-10 21:07:27 -05:00
Thomas Symalla 76cbe62262 [AMDGPU] Changes the AMDGPU_Gfx calling convention by making the SGPRs 4..29 callee-save. This is to avoid superfluous s_movs when executing amdgpu_gfx function calls as the callee is likely not going to change the argument values.
This patch changes the AMDGPU_Gfx calling convention. It defines the SGPR registers s[4:29] as callee-save and leaves some SGPRs usable for callers. The intention is to avoid unneccessary s_mov instructions for arguments the caller would otherwise save and restore in these registers.

Reviewed By: sebastian-ne

Differential Revision: https://reviews.llvm.org/D111637
2021-11-04 21:50:18 +01:00
Jay Foad 21a1d4cf71 [AMDGPU] Change numBitsSigned for simplicity and document it. NFC.
Change numBitsSigned to return the minimum size of a signed integer that
can hold the value. This is different by one from the previous result
but is more consistent with numBitsUnsigned. Update all callers. All
callers are now more consistent between the signed and unsigned cases,
and some callers get simpler, especially the ones that deal with
quantities like numBitsSigned(LHS) + numBitsSigned(RHS).

Differential Revision: https://reviews.llvm.org/D112813
2021-10-29 14:22:06 +01:00
Vang Thao 52b43d1549 [AMDGPU] Fix cvt_f32_ubyte combine with shl
Shift node is still needed to check if the shift is shr or shl to increment/decrement offset. Do not override the node.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D112733
2021-10-28 21:43:06 -07:00
Neubauer, Sebastian 487f15603e [AMDGPU] Fix setcc combine for i128
The combine asserted if constants could not be represented as uint64_t.
Use APInts to fix this.

Differential Revision: https://reviews.llvm.org/D112416
2021-10-26 13:39:50 +02:00
Ruiling Song 52785989e9 AMDGPU: Broadcast scalar boolean to vector boolean explicitly
This is used to fix wrong code generation of s_add_co_select_user in
test/CodeGen/AMDGPU/expand-scalar-carry-out-select-user.ll

  s_addc_u32 s4, s6, 0
  s_cselect_b64 vcc, 1, 0    <-- vcc set as 0x1 if SCC==1
  v_mov_b32_e32 v1, s4
  s_cmp_gt_u32 s6, 31
  v_cndmask_b32_e32 v1, 0, v1, vcc

If the s_addc_u32 set SCC, then we will get value 0x1 in VCC.
The v_cndmask will do per thread selection with VCC as condition
register. As VCC only gets the first bit being set, only the first
thread/lane in destination register can get correct result if the
very first lane is active. In fact, we should broadcast the value to all
active lanes of the final register.

The idea here is doing this broadcast to vector boolean explicitly
instead of lowering it into a COPY from SCC which would be interpreted as
selecting between 0/1.

This is used to replace D109754.

Reviewed-by: foad, alex-t

Differential Revision: https://reviews.llvm.org/D109889
2021-09-30 10:15:01 +08:00
Matt Arsenault c305513cc2 AMDGPU: Fix assert with indirect call with known required inputs
The attributor can determine that some indirect calls do not require
special inputs. The special inputs will still be present in the ABI,
so we need to allocate the registers and pass undefs.
2021-09-13 22:54:11 -04:00
Matt Arsenault 0197cd0bd4 AMDGPU: Optimize amdgpu-no-* attributes
This allows clobbering a few extra registers in the fixed ABI, and
avoids some workitem ID packing instructions.
2021-09-09 18:24:28 -04:00
Craig Topper 9af8f1b18e [SelectionDAG] Add isZero/isAllOnes methods to ConstantSDNode.
Soft deprecrate isNullValue/isAllOnesValue and update in tree
callers. This matches the changes to the APInt interface from
D109483.

Reviewed By: lattner

Differential Revision: https://reviews.llvm.org/D109535
2021-09-09 13:28:30 -07:00
Michael Liao 640beb38e7 [amdgpu] Enable selection of `s_cselect_b64`.
Differential Revision: https://reviews.llvm.org/D109159
2021-09-07 10:45:07 -04:00
Joe Nash c96839265a [AMDGPU] Enable ds_min/ds_max on more subtargets
Adds patterns for f64 ds_min/ds_max. Shrinks HasLDSFPAtomics
scope to enable f32.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D108994

Change-Id: Id890b677841ee588b20d42b1bb3f4cdbf6e9ba1a
2021-08-31 13:22:31 -04:00
Matt Arsenault 98d7aa435f AMDGPU: Stop inferring use of llvm.amdgcn.kernarg.segment.ptr
We no longer use this intrinsic outside of the backend and no longer
support using it outside of kernels.
2021-08-26 20:30:03 -04:00
Anshil Gandhi 508b06699a [Remarks] [AMDGPU] Emit optimization remarks for atomics generating hardware instructions
Produce remarks when atomic instructions are expanded into hardware instructions
in SIISelLowering.cpp. Currently, these remarks are only emitted for atomic fadd
instructions.

Differential Revision: https://reviews.llvm.org/D108150
2021-08-19 20:51:19 -06:00
David Stuttard ebdb0d09a4 AMDGPU: During img instruction ret value construction cater for non int values
Make sure return type is int type.

Differential Revision: https://reviews.llvm.org/D108131

Change-Id: Ic02f07d1234cd51b6ed78c3fecd2cb1d6acd5644
2021-08-17 09:08:24 +01:00
Carl Ritson 99c790dc21 [AMDGPU] Make BVH isel consistent with other MIMG opcodes
Suffix opcodes with _gfx10.
Remove direct references to architecture specific opcodes.
Add a BVH flag and apply this to diassembly.
Fix a number of disassembly errors on gfx90a target caused by
previous incorrect BVH detection code.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D108117
2021-08-17 10:42:22 +09:00
Arthur Eubanks 92ce6db9ee [NFC] Rename AttributeList::hasFnAttribute() -> hasFnAttr()
This is more consistent with similar methods.
2021-08-13 11:09:18 -07:00
Amara Emerson 2b067e3335 Change TargetLowering::canMergeStoresTo() to take a MF instead of DAG.
DAG is unnecessary and we need this hook to implement store merging on GlobalISel too.
2021-08-06 12:57:53 -07:00
Jay Foad 2b63933115 [AMDGPU][SDag] Better lowering for 32-bit ctlz/cttz
Differential Revision: https://reviews.llvm.org/D107566
2021-08-05 15:57:40 +01:00
Jay Foad 40202b13b2 [AMDGPU] Legalize operands of V_ADDC_U32_e32 and friends
These instructions have an implicit use of vcc which counts towards the
constant bus limit. Pre gfx10 this means that the explicit operands
cannot be sgprs. Use the custom inserter hook to call legalizeOperands
to enforce that restriction.

Fixes https://bugs.llvm.org/show_bug.cgi?id=51217

Differential Revision: https://reviews.llvm.org/D106868
2021-08-03 09:04:52 +01:00
Carl Ritson 675c942373 [AMDGPU] Disable NSA for BVH instructions when appropriate
Check maximum NSA size when selecting NSA or non-NSA BVH instructions.

Differential Revision: https://reviews.llvm.org/D103230
2021-08-02 20:09:26 +09:00
Carl Ritson fbaa35e169 [AMDGPU] Add SelectionDAG support for insert_subvector on v4f64
Enable custom insert_subvector for larger vector types.
This is necessary now that SelectionDAG can attempt v3f64 insert
to v4f64, etc.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D105385
2021-07-27 10:11:34 +09:00
Jay Foad 9ac10658ae [AMDGPU] Fix MMO for raw/struct buffer access with non-constant offset
Codegen for the raw/struct buffer access intrinsics would update the
offset in the MMO to reflect the combined offset, if it was known to be
constant. If the combined offset was not known to be constant, or if
there was an index, it would set the offset in the MMO to 0. This is
unsafe because it makes it look like the access does not alias with
another access with a fixed non-zero offset.

Fix these cases by setting the pointer in the MMO to null, to reflect
the fact that we do not have any known IR value pointer + constant
offset for the access.

Differential Revision: https://reviews.llvm.org/D106284
2021-07-26 14:27:30 +01:00
Carl Ritson 7d4baf25aa [AMDGPU] Add maximum NSA size limit ISA feature
Add maximum NSA size limit as an ISA feature.
Use this to reduce NSA usage on GFX10.1 to avoid stability issues
with 4 and 5 dwords NSA instructions.
Maintain use of longer NSA instructions on GFX10.3.

Note: this also contains some minor fixes for GlobalISel which
did not work correctly with non-NSA form instructions on GFX10.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D103348
2021-07-23 16:16:06 +09:00
Carl Ritson 6efb3220b4 [AMDGPU] Add VReg_192/VReg_224 support for MIMG instructions
Allow MIMG instructions to be selected with 6/7 VGPRs for vaddr.
Previously these were rounded up to VReg_256 this saves VGPRs.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D103800
2021-07-22 10:42:15 +09:00
Jay Foad 3ed29f960c [AMDGPU] NFC refactoring in isel for buffer access intrinsics
Rename getBufferOffsetForMMO to updateBufferMMO and pass in the MMO to
be updated, in preparation for the bug fix in D106284.

Call updateBufferMMO consistently for all buffer intrinsics, even the
ones that use setBufferOffsets to decompose a combined offset
expression.

Add a getIdxEn helper function.

Differential Revision: https://reviews.llvm.org/D106354
2021-07-21 11:12:49 +01:00
Jay Foad 96d8f2a1e0 [AMDGPU] Fix typo in comments idexen -> idxen 2021-07-19 13:39:30 +01:00
Carl Ritson 98f48723f2 [AMDGPU] Add 224-bit vector types and link 192-bit types to MVTs
Add SReg_224, VReg_224, AReg_224, etc.
Link 224-bit types with v7i32/v7f32.
Link existing 192-bit types to newly added v3i64/v3f64/v6i32/v6f32.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D104622
2021-06-24 12:41:22 +09:00
Matt Arsenault a7786badb7 AMDGPU: Move zeroed FP high bits optimization to patterns 2021-06-22 12:47:56 -04:00
Carl Ritson a10aeb3b32 [AMDGPU] Remove duplicate setOperationAction for v4i16/v4f16 (NFC) 2021-06-18 12:38:54 +09:00
Brendon Cahoon 294efbbd3e Reland "[AMDGPU] Add gfx1013 target"
This reverts commit 211e584fa2.

Fixed a use-after-free error that caused the sanitizers to fail.
2021-06-08 21:15:35 -04:00
Brendon Cahoon 211e584fa2 Revert "[AMDGPU] Add gfx1013 target"
This reverts commit ea10a86984.

A sanitizer buildbot reports an error.
2021-06-08 16:29:41 -04:00
Brendon Cahoon ea10a86984 [AMDGPU] Add gfx1013 target
Differential Revision: https://reviews.llvm.org/D103663
2021-06-08 12:49:49 -04:00
Carl Ritson f8816c7400 [AMDGPU] Add v5f32/VReg_160 support for MIMG instructions
Avoid having to round up to v8f32/VReg_256 when only 5 VGPRs are
required for a MIMG address operand.

Maintain _V8 instruction variants of pseudo instructions allowing
assembly prior to GFX10 to work as-is.  Currently the validator
can tell for GFX10 what the correct size is, so will disallow
oversize address registers.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D103672
2021-06-08 11:11:40 +09:00
Nikita Popov 9914200393 [CodeGen] Add missing includes (NFC)
These currently rely on the IRBuilder.h include in TargetLowering.h.
Make them explicit.
2021-06-06 15:48:27 +02:00
Stanislav Mekhanoshin 9e2e49328f [AMDGPU] All GWS instructions need aligned VGPR on gfx90a
Fixes: SWDEV-288006

Differential Revision: https://reviews.llvm.org/D103197
2021-06-01 17:08:03 -07:00
Sebastian Neubauer 690f5b7a01 [AMDGPU] Fix function calls with flat scratch
When flat scratch is used, the stack pointer needs to be added when
writing arguments to the stack.
For buffer instructions, this is done in SelectMUBUFScratchOffen
and SelectMUBUFScratchOffset.

Move that to call argument lowering, like it is done in GlobalISel.

Differential Revision: https://reviews.llvm.org/D103166
2021-05-28 11:22:13 +02:00
Stanislav Mekhanoshin 45764efb69 [AMDGPU] Do not check denorm for LDS FP atomic with unsafe flag
This is already how it is handled for global and flat atomics.

Differential Revision: https://reviews.llvm.org/D102366
2021-05-17 16:53:09 -07:00
David Stuttard 31b62aa162 [AMDGPU] Fix codegen of image intrinsics for g16 and a16
For gfx10 gradient (g16) and address (a16) can be independent. Previous
implementation assumed that a16 implied g16.

There are some other changes that fix the verification (as well as asm/disasm)
that are required for the included test to pass - the XFAIL will be removed in
those changes.

This also includes required fixes for GlobalISel

Differential Revision: https://reviews.llvm.org/D102066

Change-Id: I7d171cc90994de05f41669b66a6d0ffa2ed05d09
2021-05-14 09:28:15 +01:00
Stanislav Mekhanoshin 8f98356bb5 [AMDGPU] Only allow global fp atomics with unsafe option
Previously we were allowing to use FP atomics without
-amdgpu-unsafe-fp-atomics option if a scope is less then
system. This is not safe just as well if we have UC memory.

This change only allows global and flat FP atomics with
the unsafe option. Consequentially that makes a check for
denorm mode redundant since we skip it with the unsafe
option and do not have a way to produce these instructions
without it anyway.

Differential Revision: https://reviews.llvm.org/D102347
2021-05-13 08:52:20 -07:00
Stanislav Mekhanoshin bd00106d1e [AMDGPU] Refactor shouldExpandAtomicRMWInIR(). NFC.
This is logic simplification for better readability.

Differential Revision: https://reviews.llvm.org/D102371
2021-05-12 16:39:03 -07:00
Craig Topper 44e0e91db0 [ValueTypes] Rename MVT::getVectorNumElements() to MVT::getVectorMinNumElements(). Fix some misuses of getVectorNumElements()
getVectorNumElements() returns a value for scalable vectors
without any warning so it is effectively getVectorMinNumElements().
By renaming it and making getVectorNumElements() forward to
it, we can insert a check for scalable vectors into getVectorNumElements()
similar to EVT. I didn't do that in this patch because there are still more
fixes needed, but I was able to temporarily do it and passed the RISCV
lit tests with these changes.

The changes to isPow2VectorType and getPow2VectorType are copied from EVT.

The change to TypeInfer::EnforceSameNumElts reduces the size of AArch64's isel table.
We're now considering SameNumElts to require the scalable property to match which
removes some unneeded type checks.

This was motivated by the bug I fixed yesterday in 80b9510806

Reviewed By: frasercrmck, sdesmalen

Differential Revision: https://reviews.llvm.org/D102262
2021-05-12 07:46:45 -07:00
Julien Pagès 46adccc5cc [AMDGPU] Improve Codegen for build_vector
Improve the code generation of build_vector.
Use the v_pack_b32_f16 instruction instead of
v_and_b32 + v_lshl_or_b32

Differential Revision: https://reviews.llvm.org/D98081

Patch by Julien Pagès!
2021-05-12 14:17:44 +01:00
Stanislav Mekhanoshin c714d03785 [AMDGPU] Expose __builtin_amdgcn_perm for v_perm_b32
Differential Revision: https://reviews.llvm.org/D102022
2021-05-06 16:17:33 -07:00
Julien Pagès a1ed39df96 [AMDGPU] Select V_CVT_*16_F16 more often
Improve the code generation of fp_to_sint
and fp_to_uint for integer on 16-bits.

Differential Revision: https://reviews.llvm.org/D101481

Patch by Julien Pagès!
2021-05-05 08:57:51 +01:00
Daniil Fukalov 3489c2d7b1 [TTI] NFC: Change getTypeLegalizationCost to return InstructionCost.
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Reviewed By: sdesmalen, kparzysz

Differential Revision: https://reviews.llvm.org/D101533
2021-04-30 22:51:51 +03:00
Petar Avramovic fb7be0d912 AMDGPU/GlobalISel: Remove redundant G_FCANONICALIZE
Add basic version of isCanonicalized for global-isel. Copied from sdag.
Add post legalizer combine that deletes G_FCANONICALIZE when its input
is already Canonicalized.

Differential Revision: https://reviews.llvm.org/D96605
2021-04-27 12:26:37 +02:00
Matt Arsenault 70ab76a81b AMDGPU: Fix indirect tail calls
Fix a selection error on uniform callees, and use a regular call if
divergent.
2021-04-21 09:15:24 -04:00