Commit Graph

552 Commits

Author SHA1 Message Date
Pierre van Houtryve a88deb4b65 [AMDGPU] Use aperture registers instead of S_GETREG
Fixes a longstanding TODO in the codebase where we were using S_GETREG + shift to do something that could simply be done with an inline constant (register).

Patch based on D31874 by @kzhuravl
Depends on D137767

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D137542
2022-11-30 12:25:10 +00:00
Mateja Marjanovic 595a08847a [AMDGPU] Add support for new LLVM vector types
Add VReg, AReg and SReg on AMDGPU for bit widths: 288, 320, 352 and 384.

Differential Revision: https://reviews.llvm.org/D138205
2022-11-29 17:02:04 +01:00
Janek van Oirschot 322966f8f8 [AMDGPU] Add llvm.is.fpclass intrinsic to existing SelectionDAG fp
class support and introduce GlobalISel implementation for AMDGPU

Uses existing SelectionDAG lowering of the llvm.amdgcn.class intrinsic
for llvm.is.fpclass
2022-11-28 16:00:36 -05:00
Ivan Kosarev ec8ede8177 [AMDGPU][CodeGen] Support raw format TFE buffer loads other than byte, short and d16 ones.
Differential Revision: https://reviews.llvm.org/D138215
2022-11-24 10:50:26 +00:00
Matt Arsenault 1fe1299a93 GlobalISel: Legalize strict_fsub
In the future should probably have a more convenient
way to switch between building strict and non-strict ops.
2022-11-18 15:21:41 -08:00
Matt Arsenault fe5b9a6a11 AMDGPU/GlobalISel: Make strict fadd, fmul and fma legal 2022-11-17 20:50:04 -08:00
Stanislav Mekhanoshin bcaf31ec3f [AMDGPU] Allow finer grain control of an unaligned access speed
A target can return if a misaligned access is 'fast' as defined
by the target or not. In reality there can be different levels
of 'fast' and 'slow'. This patch changes the boolean 'Fast'
argument of the allowsMisalignedMemoryAccesses family of functions
to an unsigned representing its speed.

A target can still define it as it wants and the direct translation
of the current code uses 0 and 1 for current false and true. This
makes the change an NFC.

Subsequent patch will start using an actual value of speed in
the load/store vectorizer to compare if a vectorized access going
to be not just fast, but not slower than before.

Differential Revision: https://reviews.llvm.org/D124217
2022-11-17 09:23:53 -08:00
Yashwant Singh 1c9a93ae3a [GlobalIsel][AMDGPU] Changing legalize rule for G_{UADDO|UADDE|USUBO|USUBE|SADDE|SSUBE}
Generic add and sub with carry are now legalized in a way to explicitly calculate carry/borrow output. i.e
%6:_(s64), %7:_(s1) = G_UADDO %0, %1
becomes,
%13:_(s32), %14:_(s1) = G_UADDO %2, %4
%15:_(s32), %16:_(s1) = G_UADDE %3, %5, %14
%6:_(s64) = G_MERGE_VALUES %13(s32), %15(s32)
%7:_(s1) = G_ICMP intpred(ult), %6(s64), %1

Here G_MERGE and G_ICMP instructions are redundant for recalculating carry output. (Similar case for sub with borrow)
This change fix this.

Reviewed By: arsenm, #amdgpu

Differential Revision: https://reviews.llvm.org/D137932
2022-11-14 23:42:23 +05:30
Jay Foad 5073ae2a88 [AMDGPU] Fix duplicated words in comments 2022-11-03 15:33:30 +00:00
Pierre van Houtryve bb71079e30 [AMDGPU][GISel] Add missing V2S16 BUILD_VECTOR_TRUNC legalization
Previously we would be unable to legalize V2S16 BUILD_VECTOR_TRUNC on GFX8 & below as the custom legalization was missing.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D135149
2022-10-06 06:48:53 +00:00
Pierre van Houtryve c93104073c [AMDGPU] Always lower SHUFFLE_VECTOR
Make it illegal, remove InstructionSelector logic for it

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D134967
2022-10-04 14:23:17 +00:00
Pierre van Houtryve 9a67a6b72a [AMDGPU][GISel] Legalize V2S16 G_BUILD_VECTOR
Preparation patch for D134354 to make V2S16 G_BUILD_VECTOR legal.
Also removes RegBankInfo's scalarization of small BUILD_VECTORs,
replacing it with InstructionSelector logic instead.

This allows for V2S16 BUILD_VECTOR instructions to survive
all the way to ISel so we can select FMA/MAD_MIX instructions
in D134354.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D134433
2022-09-30 14:04:53 +00:00
Carl Ritson 266b5dbc5d [AMDGPU] Add MIMG NSA threshold configuration attribute
Make MIMG NSA minimum addresses threshold an attribute that can
be set on a function or configured via command line.
This enables frontend tuning which allows increased NSA usage
where beneficial.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D134780
2022-09-28 20:03:18 +09:00
Petar Avramovic 6db7921b65 AMDGPU: Use tablegen patterns for buffer global and flat atomic fadd
Remove manual selection for atomic fadd from global-isel.
Stop pre-isel translation to AtomicLoadFAdd/G_ATOMICRMW_FADD
which corresponds to llvm-ir's atomicrmw fadd instruction.

global and flat atomic fadd patterns changes:
Split rtn/no-rtn patterns
Add missing patterns or fix predicates
Remove atomicrmw patterns for v2f16 (atomic rmw doesn't support vectors).
Patterns now check addrspace of pointer, added patterns for flat intrinsic.
with global addrspace pointer that selects into global atomic instruction.

buffer atomic fadd patterns changes:
Rdit patterns to import into global-isel.
Remove gfx6/gfx7 _addr64 and _offset patterns.
Remove patterns that can't be reached (same pattern but different feature).

Differential Revision: https://reviews.llvm.org/D130579
2022-09-23 17:52:10 +02:00
Petar Avramovic 5cee9047d5 AMDGPU: Improve atomicrmw fadd selection
Use same atomicrmw fadd expansion rules for gfx908, gfx940 and gfx11
as for gfx90a. Add missing globalisel legalizer support for flat
atomicrmw fadd f32 on gfx940 and gfx11.
Isel support for gfx11 will be added in D130579.

Differential Revision: https://reviews.llvm.org/D131560
2022-09-23 17:52:10 +02:00
Jay Foad 8901f7cebc [AMDGPU] Fix crash legalizing G_EXTRACT_VECTOR_ELT with negative index
Fixes https://github.com/llvm/llvm-project/issues/57408

Differential Revision: https://reviews.llvm.org/D132938
2022-09-09 15:53:34 +01:00
Kazu Hirata 0387da6f4f Use value instead of getValue (NFC) 2022-07-19 21:18:26 -07:00
Kazu Hirata 41ae78ea3a Use has_value instead of hasValue (NFC) 2022-07-19 20:15:44 -07:00
Jon Chesterfield 3a20597776 [amdgpu] Implement lds kernel id intrinsic
Implement an intrinsic for use lowering LDS variables to different
addresses from different kernels. This will allow kernels that cannot
reach an LDS variable to avoid wasting space for it.

There are a number of implicit arguments accessed by intrinsic already
so this implementation closely follows the existing handling. It is slightly
novel in that this SGPR is written by the kernel prologue.

It is necessary in the general case to put variables at different addresses
such that they can be compactly allocated and thus necessary for an
indirect function call to have some means of determining where a
given variable was allocated. Claiming an arbitrary SGPR into which
an integer can be written by the kernel, in this implementation based
on metadata associated with that kernel, which is then passed on to
indirect call sites is sufficient to determine the variable address.

The intent is to emit a __const array of LDS addresses and index into it.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D125060
2022-07-19 17:46:19 +01:00
Joe Nash ae72fee74e [AMDGPU] gfx11 Select on Buffer Atomic FAdd Rtn type
Reviewed By: #amdgpu, foad, rampitec

Differential Revision: https://reviews.llvm.org/D128205
2022-06-23 11:05:32 -04:00
Mirko Brkusanin 6cae753bf4 [AMDGPU][GlobalISel] Legalize G_FSUB for s16
Differential Revision: https://reviews.llvm.org/D128066
2022-06-20 12:25:49 +02:00
Jay Foad c155a944fb [AMDGPU] GFX11 CodeGen support for MIMG instructions
This includes:
- New llvm.amdgcn.image.msaa.load.* intrinsics
- NSA changes, because MIMG-NSA is now limited to 3 dwords
- Split CD forms of IMAGE_SAMPLE instructions out into separate
  test files since they are no longer supported in GFX11

Differential Revision: https://reviews.llvm.org/D127837
2022-06-16 18:23:14 +01:00
Benjamin Kramer 0abb472fff AMDGPU/GISel: Remove unused variable. NFC. 2022-06-09 13:43:47 +02:00
Nicolai Hähnle 264d1136f9 AMDGPU/GISel: Introduce custom legalization of G_MUL
The generic legalizer framework is still used to reduce the problem
to scalar multiplication with the bit size a multiple of 32.

Generating optimal code sequences for big integer multiplication is
somewhat tricky and has a number of target-specific intricacies:

- The target has V_MAD_U64_U32 instructions that multiply two 32-bit
  factors and add a 64-bit accumulator. Most partial products should
  use this instruction.
- The accumulator is mapped to consecutive 32-bit GPRs, and partial-
  product multiply-adds can feed the accumulator into each other
  directly. (The register allocator's support for that is somewhat
  limited, but that only matters for 128-bit integers and larger.)
- OTOH, on some hardware, V_MAD_U64_U32 requires the accumulator
  to be stored in an even-aligned pair of GPRs. To avoid excessive
  register copies, it makes sense to compute odd partial products
  separately from even partial products (where a partial product
  src0[j0] * src1[j1] is "odd" if j0 + j1 is odd) and add both
  halves together as a final step.
- We can combine G_MUL+G_ADD into a single cascade of multiply-adds.
- The target can keep many carry-bits in flight simultaneously, so
  combining carries using G_UADDE is preferable over G_ZEXT + G_ADD.
- Not addressed by this patch: When the factors are sign-extended,
  the V_MAD_I64_I32 instruction (signed version!) can be used.

It is difficult to address these points generically:

1) Finding matching pairs of G_MUL and G_UMULH to find a wide
multiply is expensive. We could add a G_UMUL_LOHI generic instruction
and conditionally use that in the generic legalizer, but by itself
this wouldn't allow us to use the accumulation capability of
V_MAD_U64_U32. One could attempt to find matching G_ADD + G_UADDE
post-legalization, but this is also expensive.

2) Similarly, making sense of the legalization outcome of a wide
pre-legalization G_MUL+G_ADD pair is extremely expensive.

3) How could the generic legalizer possibly deal with the
particular idiosyncracy of "odd" vs. "even" partial products.

All this points in the direction of directly emitting an ideal code
sequence during legalization, but the generic legalizer should not
be burdened with such overly target-specific concerns. Hence, a
custom legalization.

Note that the implemented approach is different from that used by
SelectionDAG because narrowing of scalars works differently in
general. SelectionDAG iteratively cuts wide scalars into low and
high halves until a legal size is reached. By contrast, GlobalISel
does the narrowing in a single shot, which should be better for
compile-time and for the quality of the generated code.

This patch leaves three gaps open:

1. When the factors are uniform, we should execute the multiplication on
   the SALU. Register bank mapping already ensures this.

   However, the resulting code sequence is not optimal because it doesn't
   fully use the carry-in capabilities of S_ADDC_U32. (V_MAD_U64_U32
   doesn't have a carry-in.) It is very difficult to fix this after the
   fact, so we should really use a different legalization sequence in
   this case. Unfortunately, we don't have a divergence analysis and so
   cannot make that choice.

   (This only matters for 128-bit integers and larger.)

2. Avoid unnecessary multiplies when sources are known to be zero- or
   sign-extended. The challenge is that the legalizer does not currently
   have access to GISelKnownBits.

3. When the G_MUL is followed by a G_ADD, we should consider combining
   the two instructions into a single multiply-add sequence, to utilize
   the accumulator of V_MAD_U64_U32 fully. (Unless the multiply has
   multiple uses and the implied duplication of the multiply is an
   overall negative). However, this is also not true when the factors
   are uniform: in that case, it is generally better to *not* combine
   the two operations, so that the multiply can be done on the SALU.

   Again, we don't have a divergence analysis available and so cannot
   make an informed choice.

Differential Revision: https://reviews.llvm.org/D124844
2022-06-09 13:38:56 +02:00
Matt Arsenault 794a0bb547 AMDGPU: Directly implement computeKnownBits for workitem intrinsics
Currently metadata is inserted in a late pass which is lowered
to an AssertZext. The metadata would be more useful if it was
inserted earlier after inlining, but before codegen.

Probably shouldn't change anything now. Just replacing the
late metadata annotation needs more work, since we lose
out on optimizations after these are lowered to CopyFromReg.

Seems to be slightly better than relying on the AssertZext from the
metadata. The test change in cvt_f32_ubyte.ll is a quirk from it using
-start-before=amdgpu-isel instead of running the usual codegen
pipeline.
2022-04-22 10:49:50 -04:00
Jay Foad 538c77172a [AMDGPU] Fix unused variable warning after D117484 2022-04-06 14:45:38 +01:00
Matt Arsenault 54c525fc53 AMDGPU/GlobalISel: Handle legacy grid ID intrinsics
Handle the llvm.r600.* intrinsics which are still in use in libclc. I
thought it would be possible to switch it to using
llvm.amdgcn.implicitarg.ptr already, but it turns out the implicit
arguments are currently split into a piece before and after the
explicit kernel arguments.
2022-04-05 22:01:31 -04:00
Abinav Puthan Purayil 898d5776ec [AMDGPU][GlobalISel] Scalarize add/sub with overflow ops in the legalizer
Differential Revision: https://reviews.llvm.org/D122803
2022-03-31 21:46:34 +05:30
Abinav Puthan Purayil acf83abcbf [AMDGPU][GlobalISel] Remove unused variable. NFC. 2022-03-31 16:50:34 +05:30
Changpeng Fang dd5895cc39 AMDGPU: Use the implicit kernargs for code object version 5
Summary:
  Specifically, for trap handling, for targets that do not support getDoorbellID,
we load the queue_ptr from the implicit kernarg, and move queue_ptr to s[0:1].
To get aperture bases when targets do not have aperture registers, we load
private_base or shared_base directly from the implicit kernarg. In clang, we use
implicitarg_ptr + offsets to implement __builtin_amdgcn_workgroup_size_{xyz}.

Reviewers: arsenm, sameerds, yaxunl

Differential Revision: https://reviews.llvm.org/D120265
2022-03-17 14:12:36 -07:00
Shengchen Kan 37b378386e [NFC][CodeGen] Rename some functions in MachineInstr.h and remove duplicated comments 2022-03-16 20:25:42 +08:00
Stanislav Mekhanoshin 932f628121 [AMDGPU] new gfx940 fp atomics
Differential Revision: https://reviews.llvm.org/D121028
2022-03-07 12:32:02 -08:00
Sebastian Neubauer 6527b2a4d5 [AMDGPU][NFC] Fix typos
Fix some typos in the amdgpu backend.

Differential Revision: https://reviews.llvm.org/D119235
2022-02-18 15:05:21 +01:00
Julien Pages dcb2da13f1 [AMDGPU] Add a new intrinsic to control fp_trunc rounding mode
Add a new llvm.fptrunc.round intrinsic to precisely control
the rounding mode when converting from f32 to f16.

Differential Revision: https://reviews.llvm.org/D110579
2022-02-11 12:08:23 -05:00
Matt Arsenault 8b8b491379 AMDGPU/GlobalISel: Fix assertions on invalid addrspacecasts
Fixes some assert on invalid situations and starts directly emitting
the error.
2022-02-04 17:28:49 -05:00
Changpeng Fang 1194b9cdda AMDGPU {NFC}: Add code object v5 support and generate metadata for implicit kernel args
Summary:
  Add code object v5 support (deafult is still v4)
  Generate metadata for implicit kernel args for the new ABI
  Set the metadata version to be 1.2

Reviewers:
  t-tye, b-sumner, arsenm, and bcahoon

Fixes:
  SWDEV-307188, SWDEV-307189

Differential Revision:
  https://reviews.llvm.org/D118272
2022-01-31 18:07:47 -08:00
Sebastian Neubauer ae2f9c8be8 [AMDGPU] Remove lz and nomip combine from codegen
These combines have been moved into the IR combiner in D116042.

Differential Revision: https://reviews.llvm.org/D116116
2022-01-21 12:09:08 +01:00
Sebastian Neubauer 0530fdbbbb [AMDGPU] Fix LOD bias in A16 combine
As the codegen fix in D111754, the LOD bias needs to be converted to 16
bits. Fix this in the combine.

Differential Revision: https://reviews.llvm.org/D116038
2022-01-21 12:09:06 +01:00
Matt Arsenault 2e49e0cfde AMDGPU/GlobalISel: Directly diagnose return value use for FP atomics
Emit an error if the return value is used on subtargets that do not
support them. Previously we were falling back to the DAG on selection
failure, where it would emit this error and then fail again.
2022-01-20 12:46:45 -05:00
Matt Arsenault be7e938e27 AMDGPU/GlobalISel: Stop handling llvm.amdgcn.buffer.atomic.fadd
This code is not structured to handle the legacy buffer intrinsics and
was miscompiling them.
2022-01-20 12:12:06 -05:00
Matt Arsenault 8ff3c9e0be AMDGPU/GlobalISel: Fix selection of gfx90a FP atomics
The struct/raw forms for the buffer atomics now work as
expected. However, we're incorrectly handling the legacy form (which
we probably shouldn't handle at all). We also are not diagnosing the
use of the return value on gfx908. These will be addressed separately.
2022-01-20 12:12:06 -05:00
Matt Arsenault adab71711e AMDGPU/GlobalISel: Fix legalize failure on i65 ctpop 2022-01-19 10:26:28 -05:00
Matt Arsenault c3a74183a5 AMDGPU/GlobalISel: Fix legalization failure for s65 shifts
This was trying to clamp s65 down to s32, which wasn't handled so we
need to promote all the way to s128 first. Having to order the
legalization rules in just the right way is rather dissatisfying, but
I'm not sure how smart the legalizer should be in trying to interpret
the rules.
2022-01-17 10:04:41 -05:00
Matt Arsenault 59994c25f9 AMDGPU: Select workitem ID intrinsics to 0 with req_work_group_size
Shockingly we weren't doing this already. We should probably have this
be done earlier in the IR too, but it's still helpful to have the
lowering guarantee it so that we can modify the ABI implicit inputs
based on it.
2022-01-13 12:08:18 -05:00
Petar Avramovic 235886e174 AMDGPU/GlobalISel: Fix custom legalizatation for fceil 2022-01-13 14:29:30 +01:00
Matt Arsenault 4515c24bbc AMDGPU/GlobalISel: Fix assertions on legalize queries with huge align
For some reason we pass around the alignment in bits as uint64_t. Two
places were truncating it to unsigned, and losing bits in extreme
cases.
2022-01-12 18:21:44 -05:00
Matt Arsenault 8e682086a0 AMDGPU/GlobalISel: Explicitly track d16 for image legalization
We were trying to guess at the original IR type for image intrinsics
after legalization to figure out if they were d16, but this didn't
work. Explicitly track if this is a d16 operation or not in the
opcode, as is done for the buffer intrinsics.

The OpenCL library is using f32 image writes with a dmask of 15 for
some reason, and this was incorrectly switching them to use d16. Fixes
image failures in the OpenCL conformance test. The equivalent dmask
for loads doesn't even select in either selector.
2022-01-10 14:25:14 -05:00
Matt Arsenault 0ba4e4b500 GlobalISel: Pass DebugLoc to getFunctionLiveInPhysReg
Fixes crash in assertion about dropping debug info.
2022-01-10 13:50:52 -05:00
Matt Arsenault 68468bbe15 AMDGPU: Avoid null check during addrspacecast lowering
If we know the source is a valid object, we do not need to insert a
null check. This misses a lot of opportunities from
metadata/attributes not tracked in codegen.
2022-01-10 13:27:39 -05:00
Kazu Hirata 720c48b58e [AMDGPU] Fix an unused variable warning (NFC)
This patch fixes:

  llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp:2245:12: error:
  unused variable 'Ins' [-Werror,-Wunused-variable]
2022-01-10 08:49:46 -08:00