Commit Graph

1065 Commits

Author SHA1 Message Date
Stanislav Mekhanoshin 3bffb1cd0e [AMDGPU] Use single cache policy operand
Replace individual operands GLC, SLC, and DLC with a single cache_policy
bitmask operand. This will reduce the number of operands in MIR and I hope
the amount of code. These operands are mostly 0 anyway.

Additional advantage that parser will accept these flags in any order unlike
now.

Differential Revision: https://reviews.llvm.org/D96469
2021-03-15 13:00:59 -07:00
Jon Chesterfield 13e49dcee4 [amdgpu] Implement lower function LDS pass
[amdgpu] Implement lower function LDS pass

Local variables are allocated at kernel launch. This pass collects global
variables that are used from non-kernel functions, moves them into a new struct
type, and allocates an instance of that type in every kernel. Uses are then
replaced with a constantexpr offset.

Prior to this pass, accesses from a function are compiled to trap. With this
pass, most such accesses are removed before reaching codegen. The trap logic
is left unchanged by this pass. It is still reachable for the cases this pass
misses, notably the extern shared construct from hip and variables marked
constant which survive the optimizer.

This is of interest to the openmp project because the deviceRTL runtime library
uses cuda shared variables from functions that cannot be inlined. Trunk llvm
therefore cannot compile some openmp kernels for amdgpu. In addition to the
unit tests attached, this patch applied to ROCm llvm with fixed-abi enabled
and the function pointer hashing scheme deleted passes the openmp suite.

This lowering will use more LDS than strictly necessary. It is intended to be
a functionally correct fallback for cases that are difficult to target from
future optimisation passes.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D94648
2021-03-15 15:24:01 +00:00
Stanislav Mekhanoshin 315ebe0df3 [AMDGPU] Fix getAlignedAGPRClassID
Not all register classes were listed.

Differential Revision: https://reviews.llvm.org/D98550
2021-03-12 14:41:50 -08:00
Ruiling Song e8e6817d00 [AMDGPU] Don't check hasStackObjects() when reserving VGPR
We have amdgpu_gfx functions that have high register pressure. If
we do not reserve VGPR for SGPR spill, we will fall into the path
to spill the SGPR to memory, which does not only have correctness issue,
but also have really bad performance.

I don't know why there is the check for hasStackObjects(), in our case,
we don't have stack objects at the time of finalizeLowering(). So just
remove the check that we always reserve a VGPR for possible SGPR spill
in non-entry functions.

Reviewed by: arsenm

Differential Revision: https://reviews.llvm.org/D98345
2021-03-12 08:11:14 +08:00
Stanislav Mekhanoshin 574a9dabc6 [AMDGPU] Always expand system scope fp atomics on gfx90a
FP atomics in system scope cannot be used and shall always
be expanded in a CAS loop.

Differential Revision: https://reviews.llvm.org/D98085
2021-03-10 12:35:23 -08:00
Jay Foad 48ca5d3398 [AMDGPU] Simplify SITargetLowering::isSDNodeSourceOfDivergence. NFC.
Check for read-modify-write AtomicSDNodes instead of using an exhaustive
list of ISD opcodes.

Differential Revision: https://reviews.llvm.org/D97671
2021-03-01 14:22:08 +00:00
Jay Foad 3ad5216ed8 [AMDGPU] Better codegen for i64 bitreverse
Differential Revision: https://reviews.llvm.org/D97547
2021-02-26 15:51:36 +00:00
Michael Liao 0d4e12e3c1 [amdgpu] Atomic should be source of divergence.
Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D97392
2021-02-24 15:27:47 -05:00
Matt Arsenault 78b6d73a93 AMDGPU: Add even aligned VGPR/AGPR register classes
gfx90a operations require even aligned registers, but this was
previously achieved by reserving registers inside the full class.

Ideally this would be captured in the static instruction definitions
for the operands, and we would have different instructions per
subtarget. The hackiest part of this is we need to manually reassign
AGPR register classes after instruction selection (we get away without
this for VGPRs since those types are actually registered for legal
types).
2021-02-24 14:49:37 -05:00
Stanislav Mekhanoshin a8d9d50762 [AMDGPU] gfx90a support
Differential Revision: https://reviews.llvm.org/D96906
2021-02-17 16:01:32 -08:00
Piotr Sobczak 08131c7439 [AMDGPU] Fix a miscompile with S_ADD/S_SUB
The helper function isBoolSGPR is too aggressive when determining
when a v_cndmask can be skipped on a boolean value because the
function does not check the operands of and/or/xor.

This can be problematic for the Add/Sub combines that can leave
bits set even for inactive lanes leading to wrong results.

Fix this by inspecting the operands of and/or/xor recursively.

Differential Revision: https://reviews.llvm.org/D86878
2021-02-17 12:24:58 +01:00
Kazu Hirata 7725b81822 [AMDGPU] Drop unnecessary const from a return type (NFC)
Identified with const-return-type.
2021-02-05 21:02:04 -08:00
Craig Topper 11ef356d9e [TargetLowering] Use Align in allowsMisalignedMemoryAccesses.
Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D96097
2021-02-04 19:22:06 -08:00
Konstantin Zhuravlyov 6054a456da AMDGPU: Add support for amdgpu-unsafe-fp-atomics attribute
If amdgpu-unsafe-fp-atomics is specified, allow {flat|global}_atomic_add_f32 even if atomic modes don't match.

Differential Revision: https://reviews.llvm.org/D95391
2021-02-04 08:09:34 -05:00
Matt Arsenault 9719f17011 AMDGPU: Move handling of allocation of fixed ABI inputs
For the fixed ABI, set this in the initial argument constructor,
rather than relying on the allocation logic to set the values. Also
stop passing them for amdgpu_gfx, since the DAG path seems to skip
these. I'm unclear on what amdgpu_gfx's expectations are.  This will
allow moving the special input registers out of the normal argument
range.
2021-02-03 09:27:59 -05:00
Austin Kerbow 0397dca021 [AMDGPU] Fix crash with sgpr spills to vgpr disabled
This would assert with amdgpu-spill-sgpr-to-vgpr disabled when trying to
spill the FP.

Fixes: SWDEV-262704

Reviewed By: RamNalamothu

Differential Revision: https://reviews.llvm.org/D95768
2021-02-01 08:35:25 -08:00
Carl Ritson a80ebd0179 [AMDGPU] Fix llvm.amdgcn.init.exec and frame materialization
Frame-base materialization may insert vector instructions before EXEC is initialised.
Fix this by moving lowering of llvm.amdgcn.init.exec later in backend.
Also remove SI_INIT_EXEC_LO pseudo as this is not necessary.

Reviewed By: ruiling

Differential Revision: https://reviews.llvm.org/D94645
2021-01-25 08:31:17 +09:00
Kazu Hirata 054444177b [Target] Use llvm::append_range (NFC) 2021-01-24 12:18:56 -08:00
Kazu Hirata e4847a7fcf Revert "[Target] Use llvm::append_range (NFC)"
This reverts commit cc7a238286.

The X86WinEHState.cpp hunk seems to break certain builds.
2021-01-23 11:25:27 -08:00
Kazu Hirata cc7a238286 [Target] Use llvm::append_range (NFC) 2021-01-23 10:56:31 -08:00
Sebastian Neubauer 8214982b50 [AMDGPU] Implement mir parseCustomPseudoSourceValue
Allow parsing generated mir with custom pseudo source value tokens.
Also rename pseudo source values to have more meaningful names.

Relands ba7dcd8542, which had memory leaks.

Differential Revision: https://reviews.llvm.org/D95215
2021-01-22 11:24:08 +01:00
Matt Arsenault 2a0db8d70e AMDGPU: Use more accurate fast f64 fdiv
A raw v_rcp_f64 isn't accurate enough, so start applying correction.
2021-01-21 10:51:36 -05:00
dfukalov 560d7e0411 [NFC][AMDGPU] Split AMDGPUSubtarget.h to R600 and GCN subtargets
... to reduce headers dependency.

Reviewed By: rampitec, arsenm

Differential Revision: https://reviews.llvm.org/D95036
2021-01-20 22:22:45 +03:00
Joe Nash 314e29ed2b [AMDGPU] Add _e64 suffix to VOP3 Insts
Previously, instructions which could be
expressed as VOP3 in addition to another
encoding had a _e64 suffix on the tablegen
record name, while those
only available as VOP3 did not. With this
patch, all VOP3s will have the _e64 suffix.
The assembly does not change, only  the mir.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D94341

Change-Id: Ia8ec8890d47f8f94bbbdac43745b4e9dd2b03423
2021-01-12 18:33:18 -05:00
QingShan Zhang 7539c75bb4 [DAGCombine] Remove the check for unsafe-fp-math when we are checking the AFN
We are checking the unsafe-fp-math for sqrt but not for fpow, which behaves inconsistent.
As the direction is to remove this global option, we need to remove the unsafe-fp-math
check for sqrt and update the test with afn fast-math flags.

Reviewed By: Spatel

Differential Revision: https://reviews.llvm.org/D93891
2021-01-11 02:25:53 +00:00
dfukalov 6a87e9b08b [NFC][AMDGPU] Reduce include files dependency.
Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D93813
2021-01-07 22:22:05 +03:00
Kazu Hirata 0e219b6443 [Target] Construct SmallVector with iterator ranges (NFC) 2021-01-03 09:57:45 -08:00
Stanislav Mekhanoshin 747f67e034 [AMDGPU] Fix adjustWritemask subreg handling
If we happen to extract a non-dword subreg that breaks the
logic of the function and it may shrink the dmask because
it does not recognize the use of a lane(s).

This bug is next to impossible to trigger with the current
lowering in the BE, but it breaks in one of my future patches.

Differential Revision: https://reviews.llvm.org/D93782
2020-12-23 14:43:31 -08:00
Stanislav Mekhanoshin ca4bf58e4e [AMDGPU] Support unaligned flat scratch in TLI
Adjust SITargetLowering::allowsMisalignedMemoryAccessesImpl for
unaligned flat scratch support. Mostly needed for global isel.

Differential Revision: https://reviews.llvm.org/D93669
2020-12-22 16:12:31 -08:00
Jay Foad 4f25e53982 [AMDGPU] Make use of emitRemovedIntrinsicError. NFC.
Change-Id: I482bbf528255f2eacd3878ddfe7edb9a8f63d5c2
2020-12-11 14:02:14 +00:00
Austin Kerbow 4aa842a800 [AMDGPU] Add new pseudos for indirect addressing with VGPR Indexing
It is possible for copies or spills to be inserted in the middle of indirect
addressing sequences which use VGPR indexing. Spills to accvgprs could be
effected by the indexing mode.

Add new pseudo instructions that are expanded after register allocation to avoid
the problematic spill or copy placement.

Differential Revision: https://reviews.llvm.org/D91048
2020-12-08 12:24:12 -08:00
Jay Foad d28624a209 [AMDGPU] Stop adding an implicit def of vcc_hi for wave32
This doesn't seem to be needed for anything.

Differential Revision: https://reviews.llvm.org/D92400
2020-12-02 10:11:42 +00:00
Jay Foad 4f87d30a06 [AMDGPU] Introduce and use isGFX10Plus. NFC.
It's more future-proof to use isGFX10Plus from the start, on the
assumption that future architectures will be based on current
architectures.

Also make use of the existing isGFX9Plus in a few places.

Differential Revision: https://reviews.llvm.org/D92092
2020-11-26 09:02:36 +00:00
Sebastian Neubauer 7a18bdb350 [AMDGPU] Implement flat scratch init for pal
Extract the scratch offset from the scratch buffer descriptor that is
stored in the global table.

Differential Revision: https://reviews.llvm.org/D91701
2020-11-20 11:14:30 +01:00
Sebastian Neubauer 72ccec1bbc [AMDGPU] Fix v3f16 interaction with image store workaround
In some cases, the wrong amount of registers was reserved.

Also enable more v3f16 tests.

Differential Revision: https://reviews.llvm.org/D90847
2020-11-18 18:21:04 +01:00
Stanislav Mekhanoshin cf6565f6d0 [AMDGPU] Enable multi-dword flat scratch load/stores
Differential Revision: https://reviews.llvm.org/D91384
2020-11-12 13:38:56 -08:00
Jay Foad 830ed64ccd Revert "Revert "[AMDGPU] Reorganize GCN subtarget features for unaligned access""
This reverts commit 8b08fa0103.

The underlying problems were fixed by D90607.
2020-11-11 14:40:14 +00:00
Stanislav Mekhanoshin d5a465866e [AMDGPU] Omit buffer resource with flat scratch.
Differential Revision: https://reviews.llvm.org/D90979
2020-11-09 08:05:20 -08:00
Sebastian Neubauer a022b1ccd8 [AMDGPU] Add amdgpu_gfx calling convention
Add a calling convention called amdgpu_gfx for real function calls
within graphics shaders. For the moment, this uses the same calling
convention as other calls in amdgpu, with registers excluded for return
address, stack pointer and stack buffer descriptor.

Differential Revision: https://reviews.llvm.org/D88540
2020-11-09 16:51:44 +01:00
Stanislav Mekhanoshin f738aee0bb [AMDGPU] Add default 1 glc operand to rtn atomics
This change adds a real glc operand to the return atomic
instead of just string " glc" in the middle of the asm
string.

Improves asm parser diagnostics.

Differential Revision: https://reviews.llvm.org/D90730
2020-11-05 10:41:59 -08:00
Christudasan Devadasan d6aa4aa29a [AMDGPU] Some refactoring after D90404. NFC. 2020-11-01 13:18:53 +05:30
Christudasan Devadasan 9bb2b4f0aa [AMDGPU] Add alignment check for v3 to v4 load type promotion
It should be enabled only when the load alignment is at least 8-byte.

Fixes: SWDEV-256824

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D90404
2020-11-01 12:05:34 +05:30
Jay Foad 5b91a6a88b [AMDGPU] Allow some modifiers on VOP3B instructions
V_DIV_SCALE_F32/F64 are VOP3B encoded so they can't use the ABS src
modifier, but they can still use NEG and the usual output modifiers.

This partially reverts 3b99f12a4e "AMDGPU: Remove modifiers from v_div_scale_*".

Differential Revision: https://reviews.llvm.org/D90296
2020-10-28 21:54:14 +00:00
Stanislav Mekhanoshin 038d884a50 [AMDGPU] Use flat scratch instructions where available
The support is disabled by default. So far there is instruction
selection, spilling, and frame elimination. It also changes SP
from unswizzled to swizzled as used by flat scratch instructions,
so it cannot be mixed with MUBUF stack access.

At the very least missing:

- GlobalISel;
- Some optimizations in frame elimination in between vector
  and scalar ALU;
- It shall finally allow to always materialize frame index
  as an SGPR, but that is not implemented and frame elimination
  cannot handle it yet;
- Unaligned and/or multidword flat scratch shall work, but it
  is legalized now for MUBUF;
- Operand folding cannot optimize FI like with MUBUF yet;
- It will need scaling the value of the SP/FP in the DWARF
  expression to recover the unswizzled scratch address;

Differential Revision: https://reviews.llvm.org/D89170
2020-10-26 14:40:42 -07:00
Piotr Sobczak c872faf6e0 [AMDGPU] Do not generate S_CMP_LG_U64 on gfx7
S_CMP_LG_U64 was added in gfx8 and is guarded by hasScalarCompareEq64().

Rewrite S_CMP_LG_U64 to S_OR_B32 + S_CMP_LG_U32 for targets that
do not support 64-bit scalar compare.

Differential Revision: https://reviews.llvm.org/D89536
2020-10-19 14:44:31 +02:00
Jay Foad b59d8d7c72 [AMDGPU][GlobalISel] Compute known bits for zero-extending loads
Implement computeKnownBitsForTargetInstr for G_AMDGPU_BUFFER_LOAD_UBYTE
and G_AMDGPU_BUFFER_LOAD_USHORT. This allows generic combines to remove
some unnecessary G_ANDs.

Differential Revision: https://reviews.llvm.org/D89316
2020-10-13 16:22:00 +01:00
Sebastian Neubauer f53b43c00a [AMDGPU] Use isLegalMUBUFImmOffset more
Instead of hardcoding isUInt<12>.

Differential Revision: https://reviews.llvm.org/D88961
2020-10-08 14:31:44 +02:00
Mirko Brkusanin 7c88d13fd1 [AMDGPU] Prefer SplitVectorLoad/Store over expandUnalignedLoad/Store
ExpandUnalignedLoad/Store can sometimes produce unnecessary copies to
temporary stack slot. We should prefer splitting vectors if possible.

Differential Revision: https://reviews.llvm.org/D88882
2020-10-08 10:17:15 +02:00
Rodrigo Dominguez f71f5f39f6 [AMDGPU] Implement hardware bug workaround for image instructions
Summary:
This implements a workaround for a hardware bug in gfx8 and gfx9,
where register usage is not estimated correctly for image_store and
image_gather4 instructions when D16 is used.

Change-Id: I4e30744da6796acac53a9b5ad37ac1c2035c8899

Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D81172
2020-10-07 07:39:52 -04:00
Sebastian Neubauer 6a089ce0e4 [AMDGPU] Use tablegen for argument indices
Use tablegen generic tables to get the index of image intrinsic
arguments.
Before, the computation of which image intrinsic argument is at which
index was scattered in a few places, tablegen, the SDag instruction
selection and GlobalISel. This patch changes that, so only tablegen
contains code to compute indices and the ImageDimIntrinsicInfo table
provides these information.

Differential Revision: https://reviews.llvm.org/D86270
2020-10-05 11:50:52 +02:00
Mirko Brkusanin 8b08fa0103 Revert "[AMDGPU] Reorganize GCN subtarget features for unaligned access"
This reverts commit f5cd7ec9f3.

Certain rocPRIM/rocThrust/hipCUB tests were failing because of this change.
2020-09-29 15:33:34 +02:00
Jay Foad d3a8e333ec [AMDGPU] Reformat SITargetLowering::isSDNodeSourceOfDivergence. NFC. 2020-09-28 14:42:05 +01:00
Sebastian Neubauer 6f7cd16d29 [AMDGPU] Fix v3f16 handling for getresinfo
v3f32 should not be expanded to v4f32. getresinfo with a dmask of 7
created an image sample with a v3f32 return value, which was bitcasted
to a v4f32 in constructRetValue.

Differential Revision: https://reviews.llvm.org/D88206
2020-09-24 16:03:02 +02:00
Matt Arsenault af0207f2ba AMDGPU: Check global FP atomics match default FP mode
We would always select global FP atomics from atomicrmw fadd, although
they have a hardcoded FP mode.
2020-09-23 09:07:50 -04:00
Matt Arsenault 6daddc213f AMDGPU: Don't add frame register to frame pseudos
We no longer treat the frame register like a function argument, so the
problem this avoided is no longer relevant.
2020-09-21 16:18:47 -04:00
Matt Arsenault 3105d0f84b CodeGen: Move split block utility to MachineBasicBlock
AMDGPU needs this in several places, so consolidate them here.
2020-09-18 14:05:18 -04:00
Matt Arsenault 27df165270 Revert "[amdgpu] Lower SGPR-to-VGPR copy in the final phase of ISel."
This reverts commit c3492a1aa1.

I think this is the wrong strategy and wrong place to do this
transform anyway. Also reverts follow up commit
7d593d0d69.
2020-09-18 09:48:33 -04:00
Mirko Brkusanin ae36c02ad0 [AMDGPU] Set DS alignment requirements to be more strict
Alignment requirements for ds_read/write_b96/b128 for gfx9 and onward are
now the same as for other GCN subtargets. This way we can avoid any
unintentional use of these instructions on systems that do not support dword
alignment and instead require natural alignment.
This also makes 'SH_MEM_CONFIG.alignment_mode == STRICT' the default.

Differential Revision: https://reviews.llvm.org/D87821
2020-09-18 15:26:24 +02:00
Bogdan Graur 7d593d0d69 [amdgpu] Compilation fix for Release
Reviewed By: bkramer

Differential Revision: https://reviews.llvm.org/D87838
2020-09-17 18:04:53 +02:00
Michael Liao c3492a1aa1 [amdgpu] Lower SGPR-to-VGPR copy in the final phase of ISel.
- Need to lower COPY from SGPR to VGPR to a real instruction as the
  standard COPY is used where the source and destination are from the
  same register bank so that we potentially coalesc them together and
  save one COPY. Considering that, backend optimizations, such as CSE,
  won't handle them. However, the copy from SGPR to VGPR always needs
  materializing to a native instruction, it should be lowered into a
  real one before other backend optimizations.

Differential Revision: https://reviews.llvm.org/D87556
2020-09-17 11:04:17 -04:00
alex-t 0efbb70b71 [AMDGPU] should expand ROTL i16 to shifts.
Instruction combining pass turns library rotl implementation to llvm.fshl.i16.
In the selection dag the intrinsic is turned to ISD::ROTL node that cannot be selected.
Need to expand it to shifts again.

Reviewed By: rampitec, arsenm

Differential Revision: https://reviews.llvm.org/D87618
2020-09-17 17:34:33 +03:00
Stanislav Mekhanoshin 91f503c3af [AMDGPU] gfx1030 RT support
Differential Revision: https://reviews.llvm.org/D87782
2020-09-16 11:40:58 -07:00
Matt Arsenault 71131db689 AMDGPU: Improve <2 x i24> arguments and return value handling
This was asserting for GlobalISel. For SelectionDAG, this was
passing this on the stack. Instead, scalarize this as if it were a
32-bit vector.
2020-09-16 11:21:56 -04:00
Sebastian Neubauer 833b3b0d3a [AMDGPU] Add v3f16/v3i16 support to SDag
Fix lowering and instruction selection for v3x16 types
and enable InstCombine to emit them.

This patch only implements it for the selection dag.
GlobalISel tests in GlobalISel/llvm.amdgcn.image.load.1d.d16.ll and
GlobalISel/llvm.amdgcn.image.store.2d.d16.ll still don't work.

Differential Revision: https://reviews.llvm.org/D84420
2020-09-16 17:20:27 +02:00
Jay Foad 90777e2924 [AMDGPU] Enable scheduling around FP MODE-setting instructions
Pre-gfx10 all MODE-setting instructions were S_SETREG_B32 which is
marked as having unmodeled side effects, which makes the machine
scheduler treat it as a barrier. Now that we have proper implicit $mode
operands we can use a no-side-effects S_SETREG_B32_mode pseudo instead
for setregs that only touch the FP MODE bits, to give the scheduler more
freedom.

Differential Revision: https://reviews.llvm.org/D87446
2020-09-16 16:10:47 +01:00
Stanislav Mekhanoshin 277de43d88 [AMDGPU] Unify intrinsic ret/nortn interface
We have a single noret intrinsic an a lot of special handling
around it. Declare it just as any other but do not define rtn
instructions itself instead.

Differential Revision: https://reviews.llvm.org/D87719
2020-09-15 15:26:42 -07:00
Craig Topper c193a689b4 [SelectionDAG] Use Align/MaybeAlign in calls to getLoad/getStore/getExtLoad/getTruncStore.
The versions that take 'unsigned' will be removed in the future.

I tried to use getOriginalAlign instead of getAlign in some
places. getAlign factors in the minimum alignment implied by
the offset in the pointer info. Since we're also passing the
pointer info we can use the original alignment.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D87592
2020-09-14 13:54:50 -07:00
Jay Foad 649bde488c [AMDGPU] Simplify S_SETREG_B32 case in EmitInstrWithCustomInserter
NFC.
2020-09-09 15:18:31 +01:00
Mirko Brkusanin 43af2a6faa [AMDGPU] Workaround for LDS Misalignment bug on GFX10
Add subtarget feature check to avoid using ds_read/write_b96/128 with too
low alignment if a bug is present on that specific hardware.
Add this "feature" to GFX 10.1.1 as it is also affected.
Add global-isel test.
2020-09-09 11:46:09 +02:00
Simon Pilgrim 1673a08044 SelectionDAG.h - remove unnecessary FunctionLoweringInfo.h include. NFCI.
Use forward declarations and move the include down to dependent files that actually use it.

This also exposes a number of implicit dependencies on KnownBits.h
2020-09-03 18:33:25 +01:00
Jay Foad 4bdab2e86a [AMDGPU] Fix offset for REL32_HI relocs
The addend in a REL32 reloc needs to be adjusted to account for the
offset from the PC value returned by the s_getpc instruction to the
point where the reloc is applied. This was being done correctly for
(GOTPC)REL32_LO but not for (GOTPC)REL32_HI. This will only make a
difference if the target symbol happens to get loaded almost exactly
a multiple of 4G away from the relocated instructions.

Differential Revision: https://reviews.llvm.org/D86938
2020-09-02 10:55:55 +01:00
Matt Arsenault af1c1e20f4 AMDGPU/GlobalISel: Implement computeKnownBits for groupstaticsize 2020-08-27 19:39:44 -04:00
Matt Arsenault 9d3dc276a6 AMDGPU: Fix broken switch braces 2020-08-27 19:39:39 -04:00
Matt Arsenault 70cd9f5b77 AMDGPU/GlobalISel: Start implementing computeKnownBitsForTargetInstr
Handle workitem intrinsics. There isn't really away to adequately test
this right now, since none of the known bits users are fine grained
enough to test the edge conditions. This triggers a number of
instances of the new 64-bit to 32-bit shift combine in the existing
tests.
2020-08-24 09:53:27 -04:00
Matt Arsenault e1644a3779 GlobalISel: Reduce G_SHL width if source is extension
shl ([sza]ext x, y) => zext (shl x, y).

Turns expensive 64 bit shifts into 32 bit if it does not overflow the
source type:

This is a port of an AMDGPU DAG combine added in
5fa289f0d8. InstCombine does this
already, but we need to do it again here to apply it to shifts
introduced for lowered getelementptrs. This will help matching
addressing modes that use 32-bit offsets in a future patch.

TableGen annoyingly assumes only a single match data operand, so
introduce a reusable struct. However, this still requires defining a
separate GIMatchData for every combine which is still annoying.

Adds a morally equivalent function to the existing
getShiftAmountTy. Without this, we would have to do try to repeatedly
query the legalizer info and guess at what type to use for the shift.
2020-08-24 09:42:40 -04:00
Mirko Brkusanin 0654ff703d [AMDGPU] Use ds_read/write_b96/b128 when possible for SDag
Do not break down local loads and stores so ds_read/write_b96/b128 in
ISelLowering can be selected on subtargets that support them and if align
requirements allow them.

Differential Revision: https://reviews.llvm.org/D84403
2020-08-21 12:26:31 +02:00
Mirko Brkusanin f5cd7ec9f3 [AMDGPU] Reorganize GCN subtarget features for unaligned access
Features UnalignedBufferAccess and UnalignedDSAccess are now used to determine
whether hardware supports such access.
UnalignedAccessMode should be used to enable them.
hasUnalignedBufferAccessEnabled() and hasUnalignedDSAccessEnabled() can be
now used to quickly check both.

Differential Revision: https://reviews.llvm.org/D84522
2020-08-21 12:26:31 +02:00
Mirko Brkusanin 5bd1febe21 [AMDGPU] Fix alignment requirements for 96bit and 128bit local loads and stores
Adjust alignment requirements for ds_read/write_b96/b128.
GFX9 and onwards allow misaligned access for reads and writes but only if
SH_MEM_CONFIG.alignment_mode allows it.
UnalignedDSAccess is set on GCN subtargets from GFX9 onward to let us know if we
can relax alignment requirements.
UnalignedAccessMode acts similary to UnalignedBufferAccess for DS instructions
but only from GFX9 onward and is supposed to match alignment_mode. By default
alignment of 4 is required.

Differential Revision: https://reviews.llvm.org/D82788
2020-08-21 12:26:31 +02:00
Jay Foad 98de0d22f5 [AMDGPU] Apply llvm-prefer-register-over-unsigned from clang-tidy 2020-08-21 10:14:35 +01:00
Michael Liao 5257a60ee0 [amdgpu] Add codegen support for HIP dynamic shared memory.
Summary:
- HIP uses an unsized extern array `extern __shared__ T s[]` to declare
  the dynamic shared memory, which size is not known at the
  compile time.

Reviewers: arsenm, yaxunl, kpyzhov, b-sumner

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D82496
2020-08-20 21:29:18 -04:00
Jay Foad 3497860203 [AMDGPU] Remove uses of Register::isPhysicalRegister/isVirtualRegister
... in favour of the isPhysical/isVirtual methods.
2020-08-20 17:59:11 +01:00
Matt Arsenault e14474a39a AMDGPU/GlobalISel: Select llvm.amdgcn.global.atomic.fadd
Remove the intermediate transform in the DAG path. I believe this is
the last non-deprecated intrinsic that needs handling.
2020-08-12 10:04:53 -04:00
Matt Arsenault 701228c411 AMDGPU: Handle intrinsics in performMemSDNodeCombine
This avoids a possible regression in a future patch
2020-08-12 10:04:53 -04:00
Kerry McLaughlin 85c7e89f3b [CodeGen] Refactor getMemBasePlusOffset & getObjectPtrOffset to accept a TypeSize
Changes the Offset arguments to both functions from int64_t to TypeSize
& updates all uses of the functions to create the offset using TypeSize::Fixed()

Reviewed By: efriedma

Differential Revision: https://reviews.llvm.org/D85220
2020-08-11 12:17:10 +01:00
Matt Arsenault 6fe6b29c29 AMDGPU: Fix assertion in performSHLPtrCombine for 64-bit pointers 2020-08-10 13:46:52 -04:00
Matt Arsenault 3c0597a9e4 AMDGPU: Avoid explicitly listing all the memory nodes 2020-08-07 19:22:46 -04:00
Matt Arsenault 1a0c0944c6 AMDGPU: Define raw/struct variants of buffer atomic fadd
Somehow the new FP atomic buffer intrinsics ended up using the legacy
style for buffer intrinsics.
2020-08-06 13:36:19 -04:00
Matt Arsenault d188a608bd AMDGPU: Fix code duplication between the selectors
Not sure this is the right place for this helper.
2020-08-06 10:42:15 -04:00
Matt Arsenault 6c7f640bf7 AMDGPU/GlobalISel: Implement LLT version of allowsMisalignedMemoryAccesses 2020-08-06 09:50:36 -04:00
Matt Arsenault 0ee1eba581 AMDGPU: Remove ATOMIC_PK_FADD
The f32 and v2f16 cases should be handled the same way.
2020-08-05 22:00:52 -04:00
Matt Arsenault 83eaf5d55d AMDGPU: Eliminate BUFFER_ATOMIC_PK_ADD_F16 node
This is redundant with the other no return buffer atomic node, and we
don't really need a separate type profile for it.
2020-08-05 15:16:51 -04:00
Matt Arsenault 43c0c9252a AMDGPU: Refactor buffer atomic intrinsic lowering
Move raw/struct buffer atomic lowering to separate functions. This
avoids a long nested switch, and simplifies a future patch.
2020-08-05 14:44:55 -04:00
Matt Arsenault 57bd64ff84 Support addrspacecast initializers with isNoopAddrSpaceCast
Moves isNoopAddrSpaceCast to the TargetMachine. It logically belongs
with the DataLayout.
2020-07-31 10:42:43 -04:00
Stanislav Mekhanoshin 5b32518f96 [AMDGPU] Do not use undef on indirect source
We are using undef on the indirect move source subreg and then
using implicit super-reg. This creates a problem in RA when
Greedy decides to split the register. It reassigns the implicit
super-reg but does not bother to change undef source because
it is really does not matter. The fix is to stop lying to RA and
drop undef flag.

This has also hit a problem in SIFoldOperands as it can fold
immediate into an indirect move since there is no undef flag
anymore. That results in multiple test failures, so added the
check for this case.

Differential Revision: https://reviews.llvm.org/D84899
2020-07-30 10:41:59 -07:00
Matt Arsenault c230965ccf AMDGPU: Make saturating add/sub legal for DAG path 2020-07-29 08:27:31 -04:00
Matt Arsenault cdd45d5f9c AMDGPU/GlobalISel: Select llvm.amdgcn.global.atomic.csub
Remove the custom node boilerplate. Not sure why this tried to handle
the LDS atomic stuff.
2020-07-29 08:27:31 -04:00
Matt Arsenault 8860daf0ed AMDGPU: Handle a few missing cases in getAddrModeArguments 2020-07-28 20:22:38 -04:00
Matt Arsenault 8b81d0633f AMDGPU: global_atomic_csub is not always dereferenceable 2020-07-27 18:47:39 -04:00
Sebastian Neubauer 896679733d [AMDGPU] Fix typo. NFC 2020-07-23 17:01:12 +02:00
Matt Arsenault 1168119c2f AMDGPU: Start interpreting byref on kernel arguments
These are treated identically to value aggregates placed in the kernel
argument list. A %struct.foo or %struct.foo addrspace(4)*
byref(sizeof(%struct.foo)) align(alignof(%struct.foo)) argument should
produce the same offsets and argument metadata.

This handles all 3 kernel ABI implementations, and the two HSA
metadata emission paths.
2020-07-21 18:11:22 -04:00
Matt Arsenault 994fb86bc2 AMDGPU: Fix promoting f16 fpowi with legal f16 2020-07-17 11:29:05 -04:00
Matt Arsenault 79f67cae91 AMDGPU: Rename add/sub with carry out instructions
The hardware has created a real mess in the naming for add/sub, which
have been renamed basically every generation. Switch the carry out
pseudos to have the gfx9/gfx10 names. We were using the original SI/CI
v_add_i32/v_sub_i32 names. Later targets reintroduced these names as
carryless instructions with a saturating clamp bit, which we do not
define. Do this rename so we can unambiguously add these missing
instructions.

The carry-in versions should also be renamed, but at least those had a
consistent _u32 name to begin with. The 16-bit instructions were also
renamed, but aren't ambiguous.

This does regress assembler error message quality in some cases. In
mismatched wave32/wave64 situations, this will switch from
"unsupported instruction" to "invalid operand", with the error
pointing at the wrong position. I couldn't quite follow how the
assembler selects these, but the previous behavior seemed accidental
to me. It looked like there was a partial attempt to handle this which
was never completed (i.e. there is an AMDGPUOperand::isBoolReg but it
isn't used for anything).
2020-07-16 13:16:30 -04:00
dstuttar 69a89b54c6 [NFC] Change isFPPredicate comparison to ignore lower bound
Summary:
Since changing the Predicate to be an unsigned enum, the lower bound check for
isFPPredicate no longer needs to check the lower bound, since
it will always evaluate to true.

Also fixed a similar issue in SIISelLowering.cpp by removing the need for
comparing to FIRST and LAST predicates

Added an assert to the isFPPredicate comparison to flag if the
FIRST_FCMP_PREDICATE is ever changed to anything other than 0, in which case the
logic will break.

Without this change warnings are generated in VS.

Change-Id: I358f0daf28c0628c7bda8ad4cab4e1757b761bab

Subscribers: arsenm, jvesely, nhaehnle, hiraditya, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D83540
2020-07-10 11:57:20 +01:00
Carl Ritson 560292fa99 [AMDGPU] Update isFMAFasterThanFMulAndFAdd assumptions
MAD/MAC is no longer always available.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D83207
2020-07-07 15:40:44 +09:00
Stanislav Mekhanoshin f7a7efbf88 [AMDGPU] Tweak getTypeLegalizationCost()
Even though wide vectors are legal they still cost more as we
will have to eventually split them. Not all operations can
be uniformly done on vector types.

Conservatively add the cost of splitting at least to 8 dwords,
which is our widest possible load.

We are more or less lying to cost mode with this change but
this can prevent vectorizer from creation of wide vectors which
results in RA problems for us.

Differential Revision: https://reviews.llvm.org/D83078
2020-07-06 14:07:48 -07:00
Matt Arsenault f25d020c2e AMDGPU/GlobalISel: Add types to special inputs
When passing special ABI inputs, we have no existing context for the
type to use.
2020-07-06 17:00:55 -04:00
Matt Arsenault c19c153e74 AMDGPU: Don't ignore carry out user when expanding add_co_pseudo
This was resulting in a missing vreg def in the use select
instruction.

The output of the pseudo doesn't make sense, since it really shouldn't
have the vreg output in the first place, and instead an implicit scc
def to match the real scalar behavior.

We could have easier to understand tests if we selected scalar
versions of the [us]{add|sub}.with.overflow intrinsics.

This does still end up producing vector code in the end, since it gets
moved later.
2020-07-06 14:28:01 -04:00
Guillaume Chatelet 87e2751cf0 [Alignment][NFC] Use proper getter to retrieve alignment from ConstantInt and ConstantSDNode
This patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790

Differential Revision: https://reviews.llvm.org/D83082
2020-07-03 08:06:43 +00:00
Guillaume Chatelet 3587c9c427 [NFC] Use ADT/Bitfields in Instructions
This is an example patch for D81580.

Differential Revision: https://reviews.llvm.org/D81662
2020-07-03 07:20:22 +00:00
Dmitry Preobrazhensky 1c9d681092 [AMDGPU][CODEGEN] Added support of new inline assembler constraints
Added support for constraints 'I', 'J', 'B', 'C', 'DA', 'DB'.

See https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html#Machine-Constraints.

Reviewers: arsenm, rampitec

Differential Revision: https://reviews.llvm.org/D81651
2020-07-02 17:20:15 +03:00
Guillaume Chatelet 52911428ef [Alignment][NFC] Migrate AMDGPU backend to Align
This patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790

Differential Revision: https://reviews.llvm.org/D82743
2020-06-29 11:56:06 +00:00
Michael Liao 20a1700293 [amdgpu] Fix REL32 relocations with negative offsets.
Summary: - The offset should be treated as a signed one.

Reviewers: rampitec, arsenm

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D82234
2020-06-21 23:09:03 -04:00
Matt Arsenault 95605b784b AMDGPU/GlobalISel: Implement computeKnownAlignForTargetInstr
We probably need to move where intrinsics are lowered to copies to
make this useful.
2020-06-18 17:28:00 -04:00
Matt Arsenault c5c58fd6b5 AMDGPU: Remove intermediate DAG node for trig_preop intrinsic
We weren't doing anything with this, and keeping it would just add
more boilerplate for GlobalISel.
2020-06-16 21:06:25 -04:00
Stanislav Mekhanoshin 9ee272f13d [AMDGPU] Add gfx1030 target
Differential Revision: https://reviews.llvm.org/D81886
2020-06-15 16:18:05 -07:00
Michael Liao ec02635d10 [amdgpu] Skip OR combining on 64-bit integer before legalizing ops.
Reviewers: arsenm, rampitec

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D81710
2020-06-12 15:22:38 -04:00
Sebastian Neubauer 29a6ad94fd [AMDGPU] Add G16 support to image instructions
Add G16 feature for GFX10 and support A16 and G16 in GlobalISel.

Differential Revision: https://reviews.llvm.org/D76836
2020-06-12 11:26:31 +02:00
Stanislav Mekhanoshin 295d1fe733 [AMDGPU] Custom lowering of i64 umulo/smulo
Differential Revision: https://reviews.llvm.org/D81430
2020-06-08 23:14:19 -07:00
Guillaume Chatelet 1778564f91 [Alignment][NFC] Migrate the rest of backends
Summary: This is a followup on D81196

Reviewers: courbet

Subscribers: arsenm, dschuff, jyknight, dylanmckay, sdardis, nemanjai, jvesely, nhaehnle, sbc100, jgravelle-google, hiraditya, aheejin, kbarton, fedor.sergeev, asb, rbar, johnrusso, simoncook, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, atanasyan, rogfer01, MartinMosbeck, brucehoult, the_o, PkmX, jocewei, Jim, lenary, s.egerton, pzheng, sameer.abuasal, apazos, luismarques, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D81278
2020-06-08 07:17:20 +00:00
Stanislav Mekhanoshin 5d62606f90 AMDGPU/GlobalISel: cmp/select method for extract element
Differential Revision: https://reviews.llvm.org/D80749
2020-06-05 12:57:40 -07:00
Matt Arsenault af867b7850 DAG: Change computeKnownBitsForFrameIndex to be usable by GISel
This wasn't getting much value from the DAG or depth arguments, since
it's only called on the frame index root nodes. FrameIndexes can also
only return a scalar value, so it also didn't need DemandedElts.
2020-06-04 10:50:26 -04:00
Matt Arsenault 89d48ccabe AMDGPU: Fix not emitting nofpexcept on fdiv expansion
In this awkward case, we have to emit custom pseudo-constrained FP
wrappers. InstrEmitter concludes that since a mayRaiseFPException
instruction had a chain, it can't add nofpexcept.

Test deferred until mayRaiseFPException is really set on everything.
2020-06-01 14:10:26 -04:00
Matt Arsenault 7ad36491ca AMDGPU: Fix alignment for dynamic allocas
The alignment value also needs to be scaled by the wave size.
2020-06-01 13:06:37 -04:00
Jay Foad 2768edfff1 [AMDGPU] Propagate fast-math flags when lowering FSIN and FCOS
Differential Revision: https://reviews.llvm.org/D80813
2020-05-31 05:21:55 +01:00
Changpeng Fang 234eba90f4 AMDGPU: Add setTruncStoreAction for vector i64 types made legal recently
Reviewers:
  rampitec, arsenm

Differential Revision:
  https://reviews.llvm.org/D80853
2020-05-30 20:45:27 -07:00
Matt Arsenault 0892a96a05 AMDGPU: Optimize s_setreg_b32 to s_denorm_mode/s_round_mode
This is a custom inserter because it was less work than teaching
tablegen a way to indicate that it is sometimes OK to have a no side
effect instruction in the output of a side effecting pattern.

The asm is needed to look like a read of the mode register to prevent
it from being deleted. However, there seems to be a bug where the mode
register def instructions are moved across the asm sideeffect by the
post-RA scheduler.

Another oddity is the immediate is formatted differently between
s_denorm_mode and s_round_mode.
2020-05-29 21:11:36 -04:00
Matt Arsenault f012c58abd AMDGPU: Move MIMG MMO check to verifier 2020-05-29 20:58:23 -04:00
Jay Foad b28d038ff3 [AMDGPU] Better use of llvm::numbers
Tweak a few constant expressions involving numbers::pi etc to avoid
rounding errors. NFCI though it's possible some of these will now be
more accurate in the last bit.
2020-05-29 09:55:36 +01:00
Jay Foad 036d4b0dbf [AMDGPU] Use numbers::pi instead of M_PI. NFC. 2020-05-29 09:55:36 +01:00
Matt Arsenault 97f3f0bab0 AMDGPU: Add intrinsic for s_setreg
This will be more useful with fenv access implemented.
2020-05-28 14:26:38 -04:00
Matt Arsenault 5e007fe998 AMDGPU: Support non-entry block static sized allocas
OpenMP emits these for some reason, so handle them. Assume these use
4096 bytes by default, with a flag to override this. Also change the
related stack assumption for calls to have a flag.
2020-05-27 18:46:10 -04:00
Matt Arsenault d37ce53ad3 AMDGPU: Set StackPointerRegisterToSaveRestore
This will enable selecting non-entry block allocas. Skip the SP write
check in the base isSchedulingBoundary implementation to preserve the
previous scheduling behavior and avoid test churn. It's apparently for
compile time reasons, but if we were to use this more work would be
needed since in some of the failing tests, we seem to incorrectly get
hazard nops inserted.
2020-05-27 13:44:05 -04:00
Matt Arsenault e09064e97f AMDGPU: Update store node checks for atomics
Prepare to switch to using StoreSDNode for atomic stores.
2020-05-26 15:20:03 -04:00
Matt Arsenault 9786e7552d Revert "[AMDGPU] NFC target dependent requiresUniformRegister refactored out"
This reverts commit fb38b98338.

This will regress compile time.
2020-05-26 12:58:18 -04:00
alex-t fb38b98338 [AMDGPU] NFC target dependent requiresUniformRegister refactored out
Summary: Target specific method encapsulated into the Target Lowering Info.

Reviewers: rampitec, vpykhtin

Reviewed By: rampitec

Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D70085
2020-05-26 19:49:20 +03:00
Dmitry Preobrazhensky b087b91c91 [AMDGPU][CODEGEN] Added 'A' constraint for inline assembler
Summary: 'A' constraint requires an immediate int or fp constant that can be inlined in an instruction encoding.

Reviewers: arsenm, rampitec

Differential Revision: https://reviews.llvm.org/D78494
2020-05-25 14:23:34 +03:00
Matt Arsenault 66fe60220c AMDGPU/GlobalISel: Fix masked control flow with fallthrough blocks
Unlike SelectionDAGBuilder, IRTranslator omits the unconditional
branch in fallthrough cases. Confusingly, the control flow pseudos
function in the opposite way the intrinsics are used, and the branch
targets always need to be swapped. We're inverting the target blocks,
so we need to figure out the old fallthrough block and insert a branch
to the original unconditional branch target.
2020-05-22 10:31:44 -04:00
Stanislav Mekhanoshin 1dfd1b3e4b [AMDGPU] Tune threshold for cmp/select vector lowering
It was set in total vector size while the idea was to limit
a number of instructions. Now it started to work with doubles
and thresholds needs to be updated.

Differential Revision: https://reviews.llvm.org/D80322
2020-05-21 08:59:35 -07:00
Stanislav Mekhanoshin 4eecf17164 [AMDGPU] Always expand ext/insertelement with divergent idx
Even though series of cmd/cndmask can produce quite a lot of
code that is still better than a loop. In case of doubles we
would even produce two loops.

Differential Revision: https://reviews.llvm.org/D80032
2020-05-20 15:51:29 -07:00
Matt Arsenault 074b802654 AMDGPU: Fix DAG divergence for implicit function arguments
This should be directly implied from the register class, and there's
no need to special case live ins here. This was getting the wrong
answer for the queue ptr argument in callable functions, since it's
not an explicit IR argument and is always uniform.

Fixes not using scalar loads for the aperture in addrspacecast
lowering, and any other places that use implicit SGPR arguments.
2020-05-19 18:11:34 -04:00
Matt Arsenault 681a161ff5 AMDGPU: Remove outdated comment 2020-05-18 12:06:16 -04:00
Stanislav Mekhanoshin 9d4cf5bd42 [AMDGPU] Make v16f64/v16i64 legal
This allows indirect VGPR addressing to work.

Differential Revision: https://reviews.llvm.org/D79960
2020-05-14 14:46:55 -07:00
Christopher Tetreault 3254a001fc [SVE] Remove usages of VectorType::getNumElements() from AMDGPU
Reviewers: efriedma, arsenm, david-arm, fpetrogalli

Reviewed By: efriedma

Subscribers: dmgreen, arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, tschuett, hiraditya, rkruppe, psnobl, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D79807
2020-05-13 15:57:55 -07:00
Matt Arsenault 704b539f65 AMDGPU: Use Register 2020-05-13 15:31:54 -04:00
Stanislav Mekhanoshin 71ed66d97f [AMDGPU] Make v4i64/v4f64/v8i64/v8f64 legal
We can produce such vectors in the Promote Alloca pass,
but we are unable to use movrel to operate it and lower
via scratch. Making it legal makes SI_INDIRECT patterns
work.

There is more work to do in subsequent changes:

1. We initialize m0 twice to access each dword. It shall
be possible to only do it once and increment base register
number instead.
2. We also need v16i64/v16f64 but these first need to be
added to tablegen.

Differential Revision: https://reviews.llvm.org/D79808
2020-05-12 16:05:12 -07:00
Saiyedul Islam 117e5609e9 [AMDGPU] Reserving VGPR for future SGPR Spill
Summary: One VGPR register is allocated to handle a future spill of SGPR if "--amdgpu-reserve-vgpr-for-sgpr-spill" option is used

Reviewers: arsenm, rampitec, msearles, cdevadas

Reviewed By: arsenm

Subscribers: madhur13490, qcolombet, kerbowa, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits

Tags: #amdgpu, #llvm

Differential Revision: https://reviews.llvm.org/D70379
2020-05-12 00:33:00 +00:00
Matt Arsenault 78a43f10c7 AMDGPU: Don't assert on unknown address spaces
Assume unknown address spaces behave like some flavor of global
memory.
2020-05-08 12:57:27 -04:00
Matt Arsenault fda0c8df28 AMDGPU: Lower addrspacecast to 32-bit constant
Somehow this was missing from the DAG path, but not global isel.
2020-05-08 10:46:00 -04:00
alex-t 5b898bddff [AMDGPU] Enable carry out ADD/SUB operations divergence driven instruction selection.
Summary: This change enables all kind of carry out ISD opcodes to be selected according to the node divergence.

Reviewers: rampitec, arsenm, vpykhtin

Reviewed By: rampitec

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D78091
2020-05-04 16:42:25 +03:00
Craig Topper a58b62b4a2 [IR] Replace all uses of CallBase::getCalledValue() with getCalledOperand().
This method has been commented as deprecated for a while. Remove
it and replace all uses with the equivalent getCalledOperand().

I also made a few cleanups in here. For example, to removes use
of getElementType on a pointer when we could just use getFunctionType
from the call.

Differential Revision: https://reviews.llvm.org/D78882
2020-04-27 22:17:03 -07:00
Jay Foad cca6bc42d9 [AMDGPU] Use RegClass helper functions in getRegForInlineAsmConstraint.
This avoids more long lists of register classes that have to be updated
every time we add a new one. NFC.

Differential Revision: https://reviews.llvm.org/D78570
2020-04-23 12:26:52 +01:00
Jay Foad 0337017a9f [AMDGPU] Use SGPR instead of SReg classes
12994a70cf did this for 128-bit classes:

    SGPR_128 only includes the real allocatable SGPRs, and SReg_128 adds
    the additional non-allocatable TTMP registers. There's no point in
    allocating SReg_128 vregs. This shrinks the size of the classes
    regalloc needs to consider, which is usually good.

This patch extends it to all classes > 64 bits, for consistency.

Differential Revision: https://reviews.llvm.org/D78622
2020-04-23 11:45:22 +01:00
Jay Foad dbdffe3ee9 [AMDGPU] Add 192-bit register classes
Differential Revision: https://reviews.llvm.org/D78312
2020-04-22 13:10:37 +01:00
Jay Foad d625b4b081 [AMDGPU] Add missing AReg classes
Add 96-bit, 160-bit and 256-bit AReg classes to match VReg and SReg.
NFC as far as I know, but it may avoid weird legalization problems.

Differential Revision: https://reviews.llvm.org/D78348
2020-04-22 13:10:37 +01:00
Jay Foad 7318625674 [AMDGPU] Remove obsolete special case for 1024-bit vector types. NFC. 2020-04-22 09:05:24 +01:00
Matt Arsenault f463792506 AMDGPU: Remove custom node for RSQ_LEGACY
Directly select from the intrinsic. This wasn't getting much value
from the custom node.
2020-04-17 19:50:36 -04:00
Matt Arsenault 0ba40d4ccf AMDGPU/GlobalISel: Combines for V_CVT_F32_UBYTE[0-3]
Ports the existing DAG combines, minus the simplify demanded bits
which seems to have no equivalent now. Without these, this isn't
particularly helpful in most of the IR sample cases.
2020-04-13 19:18:19 -04:00
Craig Topper 113f37a1f9 [CallSite removal][TargetLowering] Replace ImmutableCallSite with CallBase
Differential Revision: https://reviews.llvm.org/D77995
2020-04-13 13:50:15 -07:00
Matt Arsenault e6605a209c DAG: Fix wrong legality check for ISD::FMAD
Since 1725f28841, this should check
isFMADLegalForFAddFSub rather than the the plain isOperationLegal.

This would assert in a subset of cases due to an oddity in how FMAD is
selected. We will allow FMA formation pre-legalize, but not FMAD even
in cases where it would be valid.

The current hook requires passing in the root fadd/fsub. However, in
this distributed case, this would be far more complicated to pass in
the relevant operand. AMDGPU doesn't get any value from the node, and
only needs the type and is the only implementor, so I'm not sure why
we have this complexity. Just rename and expand the assert to avoid
the more complicated checks spread through the distribution logic.
2020-04-13 10:25:39 -07:00
Craig Topper 95192f548d [CallSite removal][TargetLowering] Use CallBase instead of CallSite in TargetLowering::ParseConstraints interface.
Differential Revision: https://reviews.llvm.org/D77929
2020-04-12 11:26:25 -07:00
Stanislav Mekhanoshin f96810ff34 [AMDGPU] Expand vector trunc stores from i16 to i8
Differential Revision: https://reviews.llvm.org/D77693
2020-04-07 21:47:45 -07:00
Matt Arsenault 869f05c834 AMDGPU: Remove dead paths for requiresUniformRegister
The extracts from control flow intrinsics are already properly handled
by divergence analysis. The inline asm case isn't dead, but has also
never really worked correctly so leave it as-is for now.
2020-04-06 16:15:10 -04:00
Apelete Seketeli 8aadb442d1 [scan-build] fix dead store warnings emitted on LLVM AMDGPU code base
This fixes dead store warnings of the type "dead assignment" reported
by Clang Static Analyzer.
2020-04-05 11:19:03 -04:00
Matt Arsenault 178050c3ba AMDGPU: Use Register in more places 2020-04-03 14:52:54 -04:00
Guillaume Chatelet c9d5c19597 [Alignment][NFC] Transitionning more getMachineMemOperand call sites
Summary:
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790

Reviewers: courbet

Subscribers: arsenm, dylanmckay, sdardis, nemanjai, jvesely, nhaehnle, hiraditya, kbarton, jrtc27, atanasyan, Jim, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D77121
2020-03-31 08:36:18 +00:00
Sebastian Neubauer 5d3a69feca [AMDGPU] New llvm.amdgcn.ballot intrinsic
Add a new llvm.amdgcn.ballot intrinsic modeled on the ballot function
in GLSL and other shader languages. It returns a bitfield containing the
result of its boolean argument in all active lanes, and zero in all
inactive lanes.

This is intended to replace the existing llvm.amdgcn.icmp and
llvm.amdgcn.fcmp intrinsics after a suitable transition period.

Use the new intrinsic in the atomic optimizer pass.

Differential Revision: https://reviews.llvm.org/D65088
2020-03-31 10:35:39 +02:00
Matt Arsenault db9f0d1ce5 AMDGPU: Form v_cvt_ubyte* with f16 results
We get 2 conversion instructions anyway. Previously we would get a
conversion with SDWA reading from a byte source, which has a larger
encoding.
2020-03-30 17:59:49 -04:00
Matt Arsenault 570a578e46 AMDGPU: Account for dmask when computing image mem size
Only the number of elements in the dmask will really be accessed.
2020-03-30 17:30:58 -04:00
Jakub Kuderski 77ce2e21a8 [AMDGPU] Add Relocation Constant Support
Summary:
This change adds amdgcn.reloc.constant intrinsic to the amdgpu backend, which will compile into a relocation entry in the resulting elf.

The intrinsics takes a MetadataNode (String) as its only argument, which specifies the symbol name of the relocation entry.

`SelectionDAGBuilder::getValueImpl` is changed to allow metadata operands passed through to ISel.

Author: csyonghe <yonghe@google.com>

Reviewers: tpr, nhaehnle

Reviewed By: nhaehnle

Subscribers: arsenm, kzhuravl, jvesely, wdng, yaxunl, dstuttard, t-tye, hiraditya, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D76440
2020-03-30 13:49:20 -04:00
Matt Arsenault ab7a41069e AMDGPU: Fix using wrong instruction for FP conversion
This was was never actually hit, but FTRUNC was clearly not the intent
here.
2020-03-29 14:03:07 -04:00
Matt Arsenault 97bbe7ad2a AMDGPU: Fix typo 2020-03-29 14:03:06 -04:00
Matt Arsenault a950e3beef AMDGPU: Move towards deprecating alignbit intrinsic
This is equivalent to llvm.fshr, so legalize the intrinsic to the
generic node.
2020-03-20 11:03:04 -04:00
Scott Linder 0e9368cc8c [AMDGPU] Move frame pointer from s34 to s33
Remove the gap left between the stack pointer (s32) and frame pointer
(s34) now that the scratch wave offset is no longer a part of the
calling convention ABI.

Update llvm/docs/AMDGPUUsage.rst to reflect the change.

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D75657
2020-03-19 15:35:16 -04:00
Scott Linder 60b1967c39 [AMDGPU] Add Scratch Wave Offset to Scratch Buffer Descriptor in entry functions
Add the scratch wave offset to the scratch buffer descriptor (SRSrc) in
the entry function prologue. This allows us to removes the scratch wave
offset register from the calling convention ABI.

As part of this change, allow the use of an inline constant zero for the
SOffset of MUBUF instructions accessing the stack in entry functions
when a frame pointer is not requested/required. Entry functions with
calls still need to set up the calling convention ABI stack pointer
register, and reference it in order to address arguments of called
functions. The ABI stack pointer register remains unswizzled, but is now
wave-relative instead of queue-relative.

Non-entry functions also use an inline constant zero SOffset for
wave-relative scratch access, but continue to use the stack and frame
pointers as before. When the stack or frame pointer is converted to a
swizzled offset it is now scaled directly, as the scratch wave offset no
longer needs to be subtracted first.

Update llvm/docs/AMDGPUUsage.rst to reflect these changes to the calling
convention.

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D75138
2020-03-19 15:35:16 -04:00
Matt Arsenault 4ea1baf6a0 AMDGPU: Initial, crude support for indirect calls
This isn't really usable, and requires using the
-amdgpu-fixed-function-abi flag to work.

Assumes a uniform call target, and will hit a verifier error if the
call target ends up in a VGPR. Also doesn't attempt to do anything
sensible for the reported register/stack usage.
2020-03-18 12:03:48 -04:00
Matt Arsenault 015b640be4 AMDGPU: Add flag to used fixed function ABI
Pass all arguments to every function, rather than only passing the
minimum set of inputs needed for the call graph.
2020-03-13 13:27:05 -07:00
Matt Arsenault bb8622094d AMDGPU: Don't handle kernarg.segment.ptr in functions
Just lower this to null. Pass implicitarg.ptr in its place in the
argument list.
2020-03-13 12:51:12 -07:00
alex-t 39e1a90784 [AMDGPU] SI_INDIRECT_DST_V* pseudos expansion should place EXEC restore to separate basic block
Summary:
When SI_INDIRECT_DST_V* pseudos has indexes in VGPR, they get expanded into the self-looped basic block that modifies EXEC in a loop.

To keep EXEC consistent it is stored before and then re-stored after the pseudo expansion result.

%95:vreg_512 = SI_INDIRECT_DST_V16 %93:vreg_512(tied-def 0), %94:sreg_32, 0, killed %1500:vgpr_32

results to

    s_mov_b64 s[6:7], exec
BB0_16:
    v_readfirstlane_b32 s8, v28
    v_cmp_eq_u32_e32 vcc, s8, v28
    s_and_saveexec_b64 vcc, vcc
    s_set_gpr_idx_on s8, gpr_idx(DST)
    v_mov_b32_e32 v6, v25
    s_set_gpr_idx_off
    s_xor_b64 exec, exec, vcc
    s_cbranch_execnz BB0_16
; %bb.17:
    s_mov_b64 exec, s[6:7]

The bug appeared in case this expansion occurs in the ELSE block of the CF.

Originally

  %110:vreg_512 = SI_INDIRECT_DST_V16 %103:vreg_512(tied-def 0), %85:vgpr_32, 0, %107:vgpr_32,
   %112:sreg_64 = SI_ELSE %108:sreg_64, %bb.19, 0, implicit-def dead $exec, implicit-def dead $scc, implicit $exec

expanded to

         ******************   <== here exec has "THEN" context

    s_mov_b64 s[6:7], exec
BB0_16:
    v_readfirstlane_b32 s8, v28
    v_cmp_eq_u32_e32 vcc, s8, v28
    s_and_saveexec_b64 vcc, vcc
    s_set_gpr_idx_on s8, gpr_idx(DST)
    v_mov_b32_e32 v6, v25
    s_set_gpr_idx_off
    s_xor_b64 exec, exec, vcc
    s_cbranch_execnz BB0_16
; %bb.17:
    s_or_saveexec_b64 s[4:5], s[4:5]   <-- exec mask is restored for "ELSE" but immediately overwritten.
    s_mov_b64 exec, s[6:7]

The rest of the "ELSE" block is executed not by the workitems which constitute the "else mask" but by those which constitute "then mask"

SILowerControlFlow::emitElse always considers the basic block begin() as an insertion point for s_or_saveexec.

Proposed fix:  The SI_INDIRECT_DST_V* procedure should split the reminder block to create landing pad for the EXEC restoration.

Reviewers: rampitec, vpykhtin, nhaehnle

Reviewed By: vpykhtin

Subscribers: arsenm, kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D75472
2020-03-10 14:04:22 +03:00
Jay Foad 11d1573bb6 [APFloat] Make use of new overloaded comparison operators. NFC.
Reviewers: ekatz, spatel, jfb, tlively, craig.topper, RKSimon, nikic, scanon

Subscribers: arsenm, jvesely, nhaehnle, hiraditya, dexonsmith, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D75744
2020-03-06 16:42:53 +00:00
Simon Pilgrim e2f0093800 [AMDGPU] performCvtF32UByteNCombine - revisit node after src operand simplification.
If SimplifyDemandedBits succeeds in simplifying the byte src, add the CVT_F32_UBYTE node back to the worklist as we might be able to simplify further.

Yet another step towards removing SelectionDAG::GetDemandedBits.
2020-03-04 11:25:50 +00:00
Simon Pilgrim 4af8db317d [AMDGPU] performCvtF32UByteNCombine - add SHL and SimplifyMultipleUseDemandedBits support
This is part of the work to remove SelectionDAG::GetDemandedBits and just use SimplifyMultipleUseDemandedBits.

Recent experiments raised some v_cvt_f32_ubyte*_e32 regressions, so I've added some additional abilities to performCvtF32UByteNCombine to help unpack byte data more aggressively.

We still don't remove all OR(SHL,SRL) patterns as some of the regenerated nodes don't get combined again, but we are getting closer.

Differential Revision: https://reviews.llvm.org/D74786
2020-02-19 11:45:57 +00:00
Matt Arsenault e240b27d6d AMDGPU/GlobalISel: Allow arbitrary global values
Treat unknown address spaces as global
2020-02-17 11:32:28 -08:00
Matt Arsenault 8c2c0b3637 AMDGPU: Improve i16/v2i16 bswap 2020-02-14 09:53:22 -08:00
Matt Arsenault bfe3779459 AMDGPU: Use v_perm_b32 to implement bswap
Also greatly improve i64 lowering. LegalizeIntegerTypes does the
correct narrowing if i64 isn't legal. Just workaround this for
SelectionDAG by making i64 legal and splitting in the patterns.
2020-02-13 09:45:31 -08:00
Matt Arsenault 86f9117d47 AMDGPU: Don't report 2-byte alignment as fast
This is apparently worse than 1-byte alignment. This does not attempt
to decompose 2-byte aligned wide stores, but will stop trying to
produce them.

Also fix bug in LoadStoreVectorizer which was decreasing the alignment
and vectorizing stack accesses. It was assuming a stack object was an
alloca that could have its base alignment changed, which is not true
if the pointer is derived from a function argument.
2020-02-11 18:35:00 -05:00
Matt Arsenault 7af7b96a9b AMDGPU: Move R600 test compatability hack
Instead of handling the r600 intrinsics on amdgcn, handle the amdgcn
intrinsics on r600.
2020-02-10 10:02:06 -08:00
Stanislav Mekhanoshin ed3527c648 [AMDGPU] Split R600 and GCN subregs
These are generated and do not need to have the same values.
We are defining separate subregs for R600 and GCN but then
using AMDGPU subregs on R600.

Differential Revision: https://reviews.llvm.org/D74248
2020-02-10 08:29:56 -08:00
Sebastian Neubauer 8756869170 [AMDGPU] Add a16 feature to gfx10
Based on D72931

This adds a new feature called A16 which is enabled for gfx10.
gfx9 keeps the R128A16 feature so it can share all the instruction encodings
with gfx7/8.

Differential Revision: https://reviews.llvm.org/D73956
2020-02-10 09:04:23 +01:00
Changpeng Fang 884acbb9e1 AMDGPU: Enhancement on FDIV lowering in AMDGPUCodeGenPrepare
Summary:
  The accuracy limit to use rcp is adjusted to 1.0 ulp from 2.5 ulp.
Also, afn instead of arcp is used to allow inaccurate rcp to be used.

Reviewers:
  arsenm

Differential Revision: https://reviews.llvm.org/D73588
2020-02-07 11:46:23 -08:00
Guillaume Chatelet f85d3408e6 [NFC] Introduce an API for MemOp
Summary: This patch introduces an API for MemOp in order to simplify and tighten the client code.

Reviewers: courbet

Subscribers: arsenm, nemanjai, jvesely, nhaehnle, hiraditya, kbarton, jsji, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D73964
2020-02-07 11:32:27 +01:00
Stanislav Mekhanoshin cacc3b7a55 [AMDGPU] Cleanup assumptions about generated subregs
We are using countPopulation on a LaneBitmask to determine
a number of registers it covers. This is the assumption which
does not necessarily need to be true. It is not changed but
factored into a single call SIRegisterInfo::getNumCoveredRegs().

Some other places are cleaned up with respect to assumptions
about subreg indexes values and tablegen behavior.

Differential Revision: https://reviews.llvm.org/D74177
2020-02-06 17:39:24 -08:00
Matt Arsenault 03a2d0045d AMDGPU: Add compile time hack for hasCFUser
Assume the control flow intrinsic results are never casted, and early
exit based on the type.
2020-02-06 11:41:34 -08:00
Sebastian Neubauer 163e33b290 [AMDGPU] Fix lowering a16 image intrinsics
scalar_to_vector takes only one argument, not two.
The a16 tests now also check the packing of coordinates into registers

Differential Revision: https://reviews.llvm.org/D73482
2020-02-05 10:54:34 +01:00
Sebastian Neubauer 3bc7ffdaab [AMDGPU] Use v3f32 type in image instructions
This should lower the amount of used registers for gfx9.

I updated some of the changed tests with the update script because
changing them by hand is tedious.

Differential Revision: https://reviews.llvm.org/D73884
2020-02-05 10:35:41 +01:00
Matt Arsenault 9260d01faa AMDGPU: Correct memory size for image intrinsics
This was incorrectly rounding up to the next power of 2. v4f32 was
rounding up to v8f32, which was just wrong. There are also v3i16/v3f16
available in MVT, so we don't even need to round the f16 cases
anymore. Additionally, this field is really an EVT so we don't even
need to consider this.

Also switch some asserts to return invalid. We should have an IR
verifier for these intrinsic return types, but for now it's better to
not assert on IR that passes the verifier.

This should also probably be fixed to consider that dmask is really
eliminating some of the loaded components.
2020-02-04 22:29:23 -05:00
Matt Arsenault 1024b73ef5 AMDGPU: Split denormal mode tracking bits
Prepare to accurately track the future denormal-fp-math attribute
changes. The way to actually set these separately is not wired in yet.

This is just a mechanical change, and mostly still assumes the input
and output mode match. This should be refined for some cases. For
example, fcanonicalize lowering should use the flushing variant if
either input or output flushing is enabled
2020-02-04 10:44:21 -08:00
Guillaume Chatelet b8144c0536 [NFC] Encapsulate MemOp logic
Summary:
This patch simply introduces functions instead of directly accessing the fields.
This helps introducing additional check logic. A second patch will add simplifying functions.

Reviewers: courbet

Subscribers: arsenm, nemanjai, jvesely, nhaehnle, hiraditya, kbarton, jsji, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D73945
2020-02-04 10:36:26 +01:00
Matt Arsenault cb7b661d3d AMDGPU: Analyze divergence of inline asm 2020-02-03 12:42:16 -08:00
Matt Arsenault 726446a009 AMDGPU: Fix splitting wide f32 s.buffer.load intrinsics
This would witch f32 to i32, and produce an invald concat_vectors from
i32 pieces to an f32 vector.
2020-02-03 12:28:08 -08:00
Guillaume Chatelet 333f2ad8b8 [Alignment][NFC] Use Align for getMemcpy/Memmove/Memset
Summary:
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790

Reviewers: courbet

Subscribers: arsenm, dschuff, jyknight, sdardis, nemanjai, jvesely, nhaehnle, sbc100, jgravelle-google, hiraditya, aheejin, kbarton, fedor.sergeev, asb, rbar, johnrusso, simoncook, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, atanasyan, rogfer01, MartinMosbeck, brucehoult, the_o, PkmX, jocewei, jsji, Jim, lenary, s.egerton, pzheng, sameer.abuasal, apazos, luismarques, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D73885
2020-02-03 17:13:19 +01:00