Commit Graph

5948 Commits

Author SHA1 Message Date
Konstantin Zhuravlyov 844012940e AMDGPU: Add isBranch=1 to SOPP branch instructions
Differential Revision: https://reviews.llvm.org/D99955
2021-04-06 10:59:30 -04:00
Jay Foad efc7bf27f5 [AMDGPU] SIFoldOperands: use MachineRegisterInfo::hasOneNonDBGUser
NFC.
2021-04-06 15:23:58 +01:00
Jay Foad 005dcd196e [AMDGPU] SIFoldOperands: use range-based loops and make_early_inc_range
NFC.
2021-04-06 15:23:58 +01:00
Jay Foad ce9cca6c3a [AMDGPU] SIFoldOperands: rename tryFoldInst to tryFoldCndMask
This follows the pattern of the other tryFold* functions. NFC.
2021-04-06 15:23:58 +01:00
Jay Foad cf4f5292f6 [AMDGPU] SIFoldOperands: use getVRegDef instead of getUniqueVRegDef
We are in SSA so getVRegDef is equivalent but simpler. NFC.
2021-04-06 15:23:58 +01:00
Jay Foad e9608a84d8 [AMDGPU][SDag] Add IMG init also for image_gather4 instructions
This fixes an oversight in D99747 which moved the IMG init code from
SIAddIMGInit to AdjustInstrPostInstrSelection, but did not set the
hasPostISelHook flag on gather4 instructions.

Differential Revision: https://reviews.llvm.org/D99953
2021-04-06 14:47:20 +01:00
Dmitry Preobrazhensky 3eadcb86ab [AMDGPU][MC][GFX9] Corrected SMEM decoding
Corrected SMEM decoding when IMM=0 and OFFSET>127

Fixed bug 49819 (https://bugs.llvm.org/show_bug.cgi?id=49819)

Differential Revision: https://reviews.llvm.org/D99804
2021-04-06 14:10:46 +03:00
Jay Foad fdc4f19e2f [AMDGPU] Remove SIAddIMGInit pass which is now unused
Differential Revision: https://reviews.llvm.org/D99748
2021-04-01 18:13:17 +01:00
Jay Foad 3d07a6d891 [AMDGPU][GlobalISel] Add IMG init in selectImageIntrinsic
Doing this during instruction selection avoids the cost of running
SIAddIMGInit which is yet another pass over the MIR.

Differential Revision: https://reviews.llvm.org/D99670
2021-04-01 18:13:17 +01:00
Jay Foad 4af6251cea [AMDGPU][SDag] Add IMG init in AdjustInstrPostInstrSelection
Doing this in a post-isel hook avoids the cost of running SIAddIMGInit
which is yet another pass over the MIR.

Differential Revision: https://reviews.llvm.org/D99747
2021-04-01 18:13:17 +01:00
Jay Foad b1fbfd9e4c [AMDGPU] Small cleanup to constructRetValue and its caller. NFC. 2021-04-01 16:36:16 +01:00
Brendon Cahoon 65c8bfb509 [AMDGPU] Enable output modifiers for double precision instructions
Update SIFoldOperands pass to recognize v_add_f64 and v_mul_f64
instructions for folding output modifiers.

Differential Revision: https://reviews.llvm.org/D99505
2021-04-01 10:08:17 -04:00
Dmitry Preobrazhensky cd953434f2 [AMDGPU][MC][GFX10][GFX90A] Corrected _e32/_e64 suffices
Fixed bugs https://bugs.llvm.org//show_bug.cgi?id=49643, https://bugs.llvm.org//show_bug.cgi?id=49644, https://bugs.llvm.org//show_bug.cgi?id=49645.

Differential Revision: https://reviews.llvm.org/D99413
2021-04-01 14:21:00 +03:00
Dmitry Preobrazhensky 0f5ebbcc7f [AMDGPU][MC] Added flag to identify VOP instructions which have a single variant
By convention, VOP1/2/C instructions which can be promoted to VOP3 have _e32 suffix while promoted instructions have _e64 suffix. Instructions which have a single variant should have no _e32/_e64 suffix. Unfortunately there was no simple way to identify single variant instructions - it was implemented by a hack. See bug https://bugs.llvm.org/show_bug.cgi?id=39086.

This fix simplifies handling of single VOP instructions by adding a dedicated flag.

Differential Revision: https://reviews.llvm.org/D99408
2021-04-01 13:53:12 +03:00
Sander de Smalen 2f6f249a49 NFC: Change getIntrinsicInstrCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Depends on D97468

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D97469
2021-03-31 14:04:41 +01:00
Sander de Smalen 2f56e1c6b1 NFC: Change getTypeBasedIntrinsicCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Depends on D97466

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D97468
2021-03-31 14:04:41 +01:00
Jay Foad 5d0e9ddfa5 [AMDGPU][GlobalISel] Add support for global atomicrmw fadd
This includes gfx908 which only has a no-return version of the
global_atomic_add_f32 instruction, using the same hack that was
previously implemented for selecting from the
llvm.amdgcn.global.atomic.fadd intrinsic.

Differential Revision: https://reviews.llvm.org/D97767
2021-03-31 11:13:00 +01:00
Tomas Matheson a9968c0a33 [NFC][CodeGen] Tidy up TargetRegisterInfo stack realignment functions
Currently needsStackRealignment returns false if canRealignStack returns false.
This means that the behavior of needsStackRealignment does not correspond to
it's name and description; a function might need stack realignment, but if it
is not possible then this function returns false. Furthermore,
needsStackRealignment is not virtual and therefore some backends have made use
of canRealignStack to indicate whether a function needs stack realignment.

This patch attempts to clarify the situation by separating them and introducing
new names:

 - shouldRealignStack - true if there is any reason the stack should be
   realigned

 - canRealignStack - true if we are still able to realign the stack (e.g. we
   can still reserve/have reserved a frame pointer)

 - hasStackRealignment = shouldRealignStack && canRealignStack (not target
   customisable)

Targets can now override shouldRealignStack to indicate that stack realignment
is required.

This change will make it easier in a future change to handle the case where we
need to realign the stack but can't do so (for example when the register
allocator creates an aligned spill after the frame pointer has been
eliminated).

Differential Revision: https://reviews.llvm.org/D98716

Change-Id: Ib9a4d21728bf9d08a545b4365418d3ffe1af4d87
2021-03-30 17:31:39 +01:00
Sebastian Neubauer 1c3b74f0ab [AMDGPU] Remove outdated TODOs. NFC
spillSGPRToVGPR is already respected in these places since D95768.

Differential Revision: https://reviews.llvm.org/D99570
2021-03-30 15:18:49 +02:00
Stanislav Mekhanoshin 619b88849e [AMDGPU] Fix "Sequence" spelling. NFC. 2021-03-29 12:11:36 -07:00
Joe Nash 45fd7c02af Revert "[AMDGPU] Mark additional VOP3 as commutable"
This reverts commit d35d8da7d6.
2021-03-29 14:48:11 -04:00
Joe Nash d35d8da7d6 [AMDGPU] Mark additional VOP3 as commutable
Note, only src0 and src1 will be commuted if the isCommutable flag
is set. This patch does not change that, it just makes it possible
to commute src0 and src1 of more instructions.

Reviewed By: foad, rampitec

Differential Revision: https://reviews.llvm.org/D99376

Change-Id: I61e20490962d95ea429beb355c55f55c024dafdc
2021-03-29 14:22:20 -04:00
Jay Foad 9d08f276d7 [AMDGPU] Use reductions instead of scans in the atomic optimizer
If the result of an atomic operation is not used then it can be more
efficient to build a reduction across all lanes instead of a scan. Do
this for GFX10, where the permlanex16 instruction makes it viable. For
wave64 this saves a couple of dpp operations. For wave32 it saves one
readlane (which are generally bad for performance) and one dpp
operation.

Differential Revision: https://reviews.llvm.org/D98953
2021-03-26 15:38:14 +00:00
Jay Foad d92b4956d6 [AMDGPU] Inline FSHRPattern into its only use. NFC. 2021-03-26 09:32:02 +00:00
Konstantin Zhuravlyov f4ace63737 AMDGPU: Add target id and code object v4 support
- Add target id support (https://clang.llvm.org/docs/ClangOffloadBundler.html#target-id)
  - Add code object v4 support (https://llvm.org/docs/AMDGPUUsage.html#elf-code-object)
    - Add kernarg_size to kernel descriptor
    - Change trap handler ABI to no longer move queue pointer into s[0:1]
  - Cleanup ELF definitions
    - Add V2, V3, V4 suffixes to make a clear distinction for code object version
    - Consolidate note names

Differential Revision: https://reviews.llvm.org/D95638
2021-03-24 11:54:05 -04:00
Sander de Smalen 55d18b3cc2 [TTI] Return a TypeSize from getRegisterBitWidth.
This patch changes the interface to take a RegisterKind, to indicate
whether the register bitwidth of a scalar register, fixed-width vector
register, or scalable vector register must be returned.

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D98874
2021-03-24 14:45:13 +00:00
alex-t dccf83acf9 [AMDGPU] SIOptimizeExecMaskingPreRA should check constant bus constraint when folds EXEC copy
Folding EXEC copy into it's single use may lead to constant bus constraint violation as it adds one more SGPR operand.
         This change makes it validate the user instruction with the new SGPR operand and only fold it if it is legal.

Reviewed By: rampitec, arsenm

Differential Revision: https://reviews.llvm.org/D98888
2021-03-24 14:14:13 +03:00
Jay Foad fd142e6c18 [AMDGPU] Simplify AMDGPUAnnotateUniformValues::visitBranchInst. NFC.
A BranchInst is always the terminator of its containing BasicBlock.
2021-03-23 16:54:43 +00:00
Joe Nash 538bda0b80 [AMDGPU] Refactor DPPCombine
NFC. Extract IsShrinkable into a helper function, and
make Subtarget a member variable.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D99099

Change-Id: If4bc97a88a9ae4eb1df47e717345d46a6ed515bf
2021-03-23 11:53:53 -04:00
Jay Foad fc7e3e7dd9 [AMDGPU] Set SchedRW on real instructions
Coyp SchedRW from pseudos to real instructions so that llvm-mca has
access to it. This is NFC for normal compiler codegen, which schedules
pseudos not real instructions.

Add an llvm-mca test for some high latency double-precision instructions
as a smoke test.

Differential Revision: https://reviews.llvm.org/D99187
2021-03-23 15:38:11 +00:00
Matt Arsenault b24436ac96 GlobalISel: Lower funnel shifts 2021-03-23 09:11:17 -04:00
Pushpinder Singh d0e5422eb8 [GlobalISel][AMDGPU] Lower G_UMULO/G_SMULO
Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D93963
2021-03-23 05:45:43 +00:00
Carl Ritson 64db6b8d37 [AMDGPU] Only unbundle memory accesses in SIMemoryLegalizer
This restores previous behaviour and is a step toward removing
unbundling entirely.

Reviewed By: foad, rampitec

Differential Revision: https://reviews.llvm.org/D99061
2021-03-23 11:30:36 +09:00
Matt Arsenault 1dd23c6d53 AMDGPU: Allow tail calls for amdgpu_gfx functions 2021-03-22 10:55:19 -04:00
Matt Arsenault a0f5aad6d7 AMDGPU: Fix allowing immediates for tail call pseudo.
The pseudo was using SSrc_b64, so it allowed folding immediates into
the destination operand for a tail call to null. However, this is not
a valid operand for the s_setpc_b64 this will be lowered to. Avoids
printing the operand as an invalid immediate.

Avoids a regression when tail calls are enabled in GlobalISel (somehow
tail calls to null get deleted in the DAG).
2021-03-21 13:14:04 -04:00
Matt Arsenault 6314a72730 AMDGPU/GlobalISel: Enable CSE in pre-legalizer combiner 2021-03-21 10:07:37 -04:00
Carl Ritson 6c9cac5da1 [AMDGPU] Add MDT update missing from D98915 2021-03-20 13:38:58 +09:00
Carl Ritson fe5f4c397f [AMDGPU] Rename SIInsertSkips Pass
Pass no longer handles skips.  Pass now removes unnecessary
unconditional branches and lowers early termination branches.
Hence rename to SILateBranchLowering.

Move code to handle returns to epilog from SIPreEmitPeephole
into SILateBranchLowering. This means SIPreEmitPeephole only
contains optional optimisations, and all required transforms
are in SILateBranchLowering.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D98915
2021-03-20 11:48:04 +09:00
Carl Ritson 5df2af8b0e [AMDGPU] Merge SIRemoveShortExecBranches into SIPreEmitPeephole
SIRemoveShortExecBranches is an optimisation so fits well in the
context of SIPreEmitPeephole.

Test changes relate to early termination from kills which have now
been lowered prior to considering branches for removal.
As these use s_cbranch the execz skips are now retained instead.
Currently either behaviour is valid as kill with EXEC=0 is a nop;
however, if early termination is used differently in future then
the new behaviour is the correct one.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D98917
2021-03-20 11:26:42 +09:00
Carl Ritson b76c09023d [AMDGPU] Allow index optimisation in SIPreEmitPeephole for bundles
Add code so duplication index register changes can be removed from
inside bundles.

Reviewed By: rampitec, foad

Differential Revision: https://reviews.llvm.org/D98940
2021-03-20 10:26:23 +09:00
Stanislav Mekhanoshin 57effe2205 [AMDGPU] Remove dead glc1 handing in asm parser. NFC. 2021-03-19 08:37:47 -07:00
Jay Foad 5a5a531214 [AMDGPU] Remove some redundant code. NFC.
This is redundant because we have already checked that we can't handle
divergent 64-bit atomic operands.
2021-03-19 11:36:15 +00:00
Jay Foad 5dd5ddcb41 [AMDGPU] Skip building some IR if it won't be used. NFC. 2021-03-19 11:36:14 +00:00
Jay Foad c96dfe0d8b [AMDGPU] Sink Intrinsic::getDeclaration calls to where they are used. NFC. 2021-03-19 11:36:14 +00:00
Stanislav Mekhanoshin edd6da10d2 [AMDGPU] Remove cpol, tfe, and swz from MUBUF patterns
These are always selected as 0 anyway.

Differential Revision: https://reviews.llvm.org/D98663
2021-03-18 14:36:04 -07:00
Stanislav Mekhanoshin 961e4384f4 [AMDGPU] Support SCC on buffer atomics
Differential Revision: https://reviews.llvm.org/D98731
2021-03-18 09:56:14 -07:00
Stanislav Mekhanoshin 3f37c28230 [AMDGPU] Remove unused template parameters of MUBUF_Real_AllAddr_vi
Differential Revision: https://reviews.llvm.org/D98804
2021-03-18 09:02:38 -07:00
Jon Chesterfield 253f804deb [amdgpu] Update med3 combine to skip i64
[amdgpu] Update med3 combine to skip i64

Fixes an assumption that a type which is not i32 will be i16. This asserts
when trying to sign/zero extend an i64 to i32.

Test case was cut down from an openmp application. Variations on it are hit by
other combines before reaching the problematic one, e.g. replacing the
immediate values with other function arguments changes the codegen path and
misses this combine.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D98872
2021-03-18 15:56:41 +00:00
Matt Arsenault b9a0384983 GlobalISel: Preserve source value information for outgoing byval args
Pass through the original argument IR value in order to preserve the
aliasing information in the memcpy memory operands.
2021-03-18 09:16:54 -04:00
Carl Ritson 1a4bc3aba3 [AMDGPU] Avoid unnecessary graph visits during WQM marking
Avoid revisiting nodes with the same set of defined lanes by
using a unified visited set which integrates lanes into the key.
This retains the intent of the original code by still revisiting
a subgraph if a different set of lanes is defined and hence
marking might progress differently.

Note: default size of the visited set has been confirmed to
cover >99% of invocations in large array of test shaders.

Reviewed By: piotr

Differential Revision: https://reviews.llvm.org/D98772
2021-03-18 10:00:41 +09:00
David Green e2935dcfc4 [TTI] Add a Mask to getShuffleCost
This adds an Mask ArrayRef to getShuffleCost, so that if an exact mask
can be provided a more accurate cost can be provided by the backend.
For example VREV costs could be returned by the ARM backend. This should
be an NFC until then, laying the groundwork for that to be added.

Differential Revision: https://reviews.llvm.org/D98206
2021-03-17 17:46:26 +00:00
Jay Foad 967b64beb4 [AMDGPU] Split dot2-insts feature
Split out some of the instructions predicated on the dot2-insts target
feature into a new dot7-insts, in preparation for subtargets that have
some but not all of these instructions. NFCI.

Differential Revision: https://reviews.llvm.org/D98717
2021-03-17 09:42:21 +00:00
RamNalamothu 43f2d269b3 [AMDGPU, NFC] Refactor FP/BP spill index code in emitPrologue/emitEpilogue
Reviewed By: scott.linder

Differential Revision: https://reviews.llvm.org/D98617
2021-03-16 19:19:45 +05:30
Dmitry Preobrazhensky 596db9934b [AMDGPU][MC] Disabled lds_direct for GFX90a
Fixed bug 49382.

Differential Revision: https://reviews.llvm.org/D98626
2021-03-16 13:52:36 +03:00
Stanislav Mekhanoshin bc27a31801 [AMDGPU] Fix copyPhysReg to not produce unalined vgpr access
RA can insert something like a sub1_sub2 COPY of a wide VGPR
tuple which results in the unaligned acces with v_pk_mov_b32
after the copy is expanded. This is regression after D97316.

Differential Revision: https://reviews.llvm.org/D98549
2021-03-15 14:14:30 -07:00
Stanislav Mekhanoshin c297709ee1 [AMDGPU] Fixed msan failure with uninitialized value 2021-03-15 13:58:19 -07:00
Stanislav Mekhanoshin 3bffb1cd0e [AMDGPU] Use single cache policy operand
Replace individual operands GLC, SLC, and DLC with a single cache_policy
bitmask operand. This will reduce the number of operands in MIR and I hope
the amount of code. These operands are mostly 0 anyway.

Additional advantage that parser will accept these flags in any order unlike
now.

Differential Revision: https://reviews.llvm.org/D96469
2021-03-15 13:00:59 -07:00
Jon Chesterfield 13e49dcee4 [amdgpu] Implement lower function LDS pass
[amdgpu] Implement lower function LDS pass

Local variables are allocated at kernel launch. This pass collects global
variables that are used from non-kernel functions, moves them into a new struct
type, and allocates an instance of that type in every kernel. Uses are then
replaced with a constantexpr offset.

Prior to this pass, accesses from a function are compiled to trap. With this
pass, most such accesses are removed before reaching codegen. The trap logic
is left unchanged by this pass. It is still reachable for the cases this pass
misses, notably the extern shared construct from hip and variables marked
constant which survive the optimizer.

This is of interest to the openmp project because the deviceRTL runtime library
uses cuda shared variables from functions that cannot be inlined. Trunk llvm
therefore cannot compile some openmp kernels for amdgpu. In addition to the
unit tests attached, this patch applied to ROCm llvm with fixed-abi enabled
and the function pointer hashing scheme deleted passes the openmp suite.

This lowering will use more LDS than strictly necessary. It is intended to be
a functionally correct fallback for cases that are difficult to target from
future optimisation passes.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D94648
2021-03-15 15:24:01 +00:00
Carl Ritson 13877db2fa [AMDGPU] Fix shortfalls in WQM marking
When tracking defined lanes through phi nodes in the live range
graph each branch of the phi must be handled independently.
Also rewrite the marking algorithm to reduce unnecessary
operations.

Previously a shared set of defined lanes was used which caused
marking to stop prematurely. This was observable in existing lit
tests, but test patterns did not cover this detail.

Reviewed By: piotr

Differential Revision: https://reviews.llvm.org/D98614
2021-03-15 21:44:15 +09:00
Jay Foad 5d48b45ce3 [AMDGPU] Use depth first iterator instead of recursive DFS. NFCI.
The reason for this is to avoid deep recursion in DFS() which can cause
stack overflow on large CFGs, especially on Windows.

Differential Revision: https://reviews.llvm.org/D98528
2021-03-15 10:32:55 +00:00
Stanislav Mekhanoshin 315ebe0df3 [AMDGPU] Fix getAlignedAGPRClassID
Not all register classes were listed.

Differential Revision: https://reviews.llvm.org/D98550
2021-03-12 14:41:50 -08:00
Matt Arsenault 6b76d82853 GlobalISel: Fix marking byval arguments as immutable
byval arguments need to be assumed writable. Only implicitly stack
passed arguments which aren't addressable in the IR can be assumed
immutable.

Mips is still broken since for some reason its doing its own thing
with the ValueHandlers (and x86 doesn't actually handle byval
arguments now, although some of the code is there).
2021-03-12 09:01:53 -05:00
Matt Arsenault 3231d2b581 AMDGPU/GlobalISel: Cleanup call lowering sequence
Now that handleAssignments is handling all of the argument splitting,
we don't have to move the insert point around.
2021-03-12 09:01:52 -05:00
Carl Ritson f08dadd242 [AMDGPU] Do not annotate an else branch if there is a kill
As llvm.amdgcn.kill is lowered to a terminator it can cause
else branch annotations to end up in the wrong block.
Do not annotate conditionals as else branches where there is
a kill to avoid this.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D97427
2021-03-12 11:52:08 +09:00
Carl Ritson c07f2025e4 [AMDGPU] Restrict image_msaa_load to MSAA dimension types
This instruction is only valid on 2D MSAA and 2D MSAA Array
surfaces.  Remove intrinsic support for other dimension types,
and block assembly for unsupported dimensions.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D98397
2021-03-12 09:47:24 +09:00
Ruiling Song e8e6817d00 [AMDGPU] Don't check hasStackObjects() when reserving VGPR
We have amdgpu_gfx functions that have high register pressure. If
we do not reserve VGPR for SGPR spill, we will fall into the path
to spill the SGPR to memory, which does not only have correctness issue,
but also have really bad performance.

I don't know why there is the check for hasStackObjects(), in our case,
we don't have stack objects at the time of finalizeLowering(). So just
remove the check that we always reserve a VGPR for possible SGPR spill
in non-entry functions.

Reviewed by: arsenm

Differential Revision: https://reviews.llvm.org/D98345
2021-03-12 08:11:14 +08:00
Ruiling Song 4cee5cad28 [AMDGPU] Free reserved VGPR if no SGPR spill
I met some code generation behavior change when I tried to remove
the hasStackObject() check when reserving VGPR for SGPR spill.
For example, the function `callee_no_stack_no_fp_elim_all` in the lit
test file `callee-frame-setup.ll`.
The generated code changed from:
```
  s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
  s_mov_b32 s4, s33
  s_mov_b32 s33, s32
  s_mov_b32 s33, s4
  s_setpc_b64 s[30:31]
```

into something like:
```
  s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
  v_writelane_b32 v63, s33, 0
  s_mov_b32 s33, s32
  v_readlane_b32 s33, v63, 0
  s_setpc_b64 s[30:31]
```

I think we still prefer the old version where only scalar instructions are needed.
The idea here is free the reserved VGPR if no SGPR spills. So we will very likely
to use a free SGPR for fp/sp spill.

Reviewed by: arsenm

Differential Revision: https://reviews.llvm.org/D98344
2021-03-12 08:11:14 +08:00
Stanislav Mekhanoshin 6e8a0213a3 [AMDGPU] Remove dead MTBUF patterns
These patterns are obviously dead, they are using format
operand which is not selected and we have no corresponding
SelectMUBUF() function.

Differential Revision: https://reviews.llvm.org/D98451
2021-03-11 14:13:00 -08:00
Matt Arsenault 70cb57d7da AMDGPU/GlobalISel: Improve private addressing mode matching
This enables the look-through-copy to hack around not correctly
regbankselecting constants to match the use bank.
2021-03-11 10:23:35 -05:00
Nikita Popov 46354bac76 [OpaquePtrs] Remove some uses of type-less CreateLoad APIs (NFC)
Explicitly pass loaded type when creating loads, in preparation
for the deprecation of these APIs.

There are still a couple of uses left.
2021-03-11 14:40:57 +01:00
Ruiling Song 66340846b3 [AMDGPU] Always create Stack Object for reserved VGPR
As we may overwrite inactive lanes of a caller-save-vgpr, we should
always save/restore the reserved vgpr for sgpr spill.

Reviewed by: arsenm

Differential Revision: https://reviews.llvm.org/D98319
2021-03-11 10:06:07 +08:00
Stanislav Mekhanoshin 9931b1f7a4 [AMDGPU] Disable SCC bit on fp atomics
Differential Revision: https://reviews.llvm.org/D98221
2021-03-10 12:36:09 -08:00
Stanislav Mekhanoshin 574a9dabc6 [AMDGPU] Always expand system scope fp atomics on gfx90a
FP atomics in system scope cannot be used and shall always
be expanded in a CAS loop.

Differential Revision: https://reviews.llvm.org/D98085
2021-03-10 12:35:23 -08:00
Jay Foad 70f013fd3b [AMDGPU] Fix isReallyTriviallyReMaterializable for V_MOV_*
D57708 changed SIInstrInfo::isReallyTriviallyReMaterializable to reject
V_MOVs with extra implicit operands, but it accidentally rejected all
V_MOVs because of their implicit use of exec. Fix it but avoid adding a
moderately expensive call to MI.getDesc().getNumImplicitUses().

In real graphics shaders this changes quite a few vgpr copies into move-
immediates, which is good for avoiding stalls on GFX10.

Differential Revision: https://reviews.llvm.org/D98347
2021-03-10 16:18:12 +00:00
Jay Foad 288ea820cf [AMDGPU] Refactor AMDGPUTargetStreamer::EmitCodeEnd
Refactor and add comments to explain where the magic numbers come from
in terms of the instruction cache line size. NFC.

Differential Revision: https://reviews.llvm.org/D98266
2021-03-09 19:02:18 +00:00
Christudasan Devadasan 24c0ad7143 [AMDGPU] Fix the dead frame indices during custom spill lowering.
AMDGPU target tries to handle the SGPR and VGPR spills in a
custom pass before the actual frame lowering pass. Once they
are handled and the respective frames are eliminated in the
custom pass, certain uses of them still remain. For instance,
the DBG_VALUE instructions inserted by the allocator alongside
the spill instruction will use the corresponding frame index.
They become dead later during PEI and causes a crash while trying to
replace the frame indices. We should possibly avoid this custom pass.
For now, replacing such dead references with null register value.

Reviewed By: arsenm, scott.linder

Differential Revision: https://reviews.llvm.org/D98038
2021-03-09 23:22:49 +05:30
Ruiling Song 67a05f4e09 [AMDGPU] Remove unused function opcodeEmitsNoInsts()
This was missed in the patch D97545, and cause buildbot failure.

Reviewed by: critson

Differential Revision: https://reviews.llvm.org/D98229
2021-03-09 10:48:30 +08:00
Ruiling Song f0ccdde3c9 [AMDGPU] Remove SI_MASK_BRANCH
This is already deprecated, so remove code working on this.
Also update the tests by using S_CBRANCH_EXECZ instead of SI_MASK_BRANCH.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D97545
2021-03-09 09:13:23 +08:00
Jay Foad 99682bc039 Revert "Revert "[AMDGPU] Restore the s_memtime instruction in gfx1030""
This reverts commit e58d68fcd0.

This reinstates commit fc28f600e5
with a fix to initialize HasShaderCyclesRegister. See
https://reviews.llvm.org/D97928.
2021-03-06 09:00:01 +00:00
Mitch Phillips e58d68fcd0 Revert "[AMDGPU] Restore the s_memtime instruction in gfx1030"
Broke the ASan/MSan buildbots. See more comments in the original patch,
https://reviews.llvm.org/D97928.

Build failure at http://lab.llvm.org:8011/#/builders/5/builds/5327

This reverts commit fc28f600e5.
2021-03-05 18:24:59 -08:00
Jay Foad fc28f600e5 [AMDGPU] Restore the s_memtime instruction in gfx1030
gfx1030 added a new way to implement readcyclecounter using the
SHADER_CYCLES hardware register, but the s_memtime instruction still
exists, so the MC layer should still accept it and the
llvm.amdgcn.s.memtime intrinsic should still work.

Differential Revision: https://reviews.llvm.org/D97928
2021-03-05 20:19:11 +00:00
RamNalamothu 3998a8e797 [AMDGPU] Do not attempt sgpr spills to vgpr, when it is disabled
This covers a path missed in https://reviews.llvm.org/D95768.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D98013
2021-03-05 22:47:21 +05:30
Sebastian Neubauer e0e73714fb [AMDGPU] Keep skip branch for ds instructions
Same as other memory instructions, ds instructions add latency even if
exec is zero. Jumping over them if exec=0 is cheaper than executing
them.
With this change, the branch instruction that skips over a basic block
if exec=0 is not removed when the block contains a ds instruction.

Differential Revision: https://reviews.llvm.org/D97922
2021-03-05 12:34:09 +01:00
Petar Avramovic 36beaa3ba3 Reland AMDGPU/GlobalISel: Combine zext(trunc x) to x after RegBankSelect
Recommit bf5a582650. Depends on
4c8fb7ddd6 which was reverted.

RegBankSelect creates zext and trunc when it selects banks for uniform i1.
Add zext_trunc_fold from generic combiner to post RegBankSelect combiner.

Differential Revision: https://reviews.llvm.org/D95432
2021-03-05 11:05:37 +01:00
Jay Foad ed7458398a [AMDGPU] Don't check for VMEM hazards on GFX10
The hazard where a VMEM reads an SGPR written by a VALU counts as a data
dependency hazard, so no nops are required on GFX10. Tested with Vulkan
CTS on GFX10.1 and GFX10.3.

Differential Revision: https://reviews.llvm.org/D97926
2021-03-04 21:44:56 +00:00
Nico Weber e68de60bc4 Revert "AMDGPU/GlobalISel: Combine zext(trunc x) to x after RegBankSelect"
This reverts commit bf5a582650.
Also depends on now-reverted 4c8fb7ddd6
2021-03-04 10:16:11 -05:00
Petar Avramovic bf5a582650 AMDGPU/GlobalISel: Combine zext(trunc x) to x after RegBankSelect
RegBankSelect creates zext and trunc when it selects banks for uniform i1.
Add zext_trunc_fold from generic combiner to post RegBankSelect combiner.

Differential Revision: https://reviews.llvm.org/D95432
2021-03-04 15:05:24 +01:00
Stanislav Mekhanoshin b70c483e04 [AMDGPU] Exclude always_inline from max bb threshold
Honor always_inline attribute when processing -amdgpu-inline-max-bb.
It was lost during the ports of the heuristic. There is no reason
to honor inline hint, but not always inline.

Differential Revision: https://reviews.llvm.org/D97790
2021-03-03 10:21:56 -08:00
Matt Arsenault 78dcff4841 GlobalISel: Add default implementation of assignValueToReg
Refactor insertion of the asserting ops. This enables using them for
AMDGPU.

This code should essentially be the same for every target. Mips, X86
and ARM all have different code there now, but this seems to be an
accident. The assignment functions are called with different types
than they would be in the DAG, so this is all likely an assortment of
hacks to get around that.
2021-03-03 09:29:53 -05:00
Piotr Sobczak 4672bac177 [AMDGPU] Introduce Strict WQM mode
* Add amdgcn_strict_wqm intrinsic.
* Add a corresponding STRICT_WQM machine instruction.
* The semantic is similar to amdgcn_strict_wwm with a notable difference that not all threads will be forcibly enabled during the computations of the intrinsic's argument, but only all threads in quads that have at least one thread active.
* The difference between amdgc_wqm and amdgcn_strict_wqm, is that in the strict mode an inactive lane will always be enabled irrespective of control flow decisions.

Reviewed By: critson

Differential Revision: https://reviews.llvm.org/D96258
2021-03-03 14:19:16 +01:00
Piotr Sobczak c3ce7bae80 [AMDGPU] Rename amdgcn_wwm to amdgcn_strict_wwm
* Introduce the new intrinsic amdgcn_strict_wwm
 * Deprecate the old intrinsic amdgcn_wwm

The change is done for consistency as the "strict"
prefix will become an important, distinguishing factor
between amdgcn_wqm and amdgcn_strictwqm in the future.

The "strict" prefix indicates that inactive lanes do not
take part in control flow, specifically an inactive lane
enabled by a strict mode will always be enabled irrespective
of control flow decisions.

The amdgcn_wwm will be removed, but doing so in two steps
gives users time to switch to the new name at their own pace.

Reviewed By: critson

Differential Revision: https://reviews.llvm.org/D96257
2021-03-03 09:33:57 +01:00
Carl Ritson 2ddac69f98 [AMDGPU] Rename llvm.amdgcn.msaa.load to llvm.amdgcn.msaa.load.x
While the underlying instruction is called image_msaa_load,
the resource must be x component only.
Rename the intrinsic for clarity.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D97829
2021-03-03 17:30:39 +09:00
Matt Arsenault fd82cbcf7d GlobalISel: Merge and cleanup more AMDGPU call lowering code
This merges more AMDGPU ABI lowering code into the generic call
lowering. Start cleaning up by factoring away more of the pack/unpack
logic into the buildCopy{To|From}Parts functions. These could use more
improvement, and the SelectionDAG versions are significantly more
complex, and we'll eventually have to emulate all of those cases too.

This is mostly NFC, but does result in some minor instruction
reordering. It also removes some of the limitations with mismatched
sizes the old code had. However, similarly to the merge on the input,
this is forcing gfx6/gfx7 to use the gfx8+ ABI (which is what we
actually want, but SelectionDAG is stuck using the weird emergent
ABI).

This also changes the load/store size for stack passed EVTs for
AArch64, which makes it consistent with the DAG behavior.
2021-03-02 17:31:13 -05:00
Amara Emerson 8a316045ed [AArch64][GlobalISel] Enable use of the optsize predicate in the selector.
To do this while supporting the existing functionality in SelectionDAG of using
PGO info, we add the ProfileSummaryInfo and LazyBlockFrequencyInfo analysis
dependencies to the instruction selector pass.

Then, use the predicate to generate constant pool loads for f32 materialization,
if we're targeting optsize/minsize.

Differential Revision: https://reviews.llvm.org/D97732
2021-03-02 12:55:51 -08:00
Joe Nash 5531f24cc2 [AMDGPU] Make OMod explicit for V_CVT_{U,I}*
Make OMod explicit instead of implied by HasModifiers in the
operand list. Requires explicitly setting HasOMod=1 for
irregular OMod usage in instruction V_CVT_{U,I}*

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D97587

Change-Id: I230e1476f529e816eec60e242531f23a99e3839f
2021-03-02 13:32:06 -05:00
Simon Pilgrim 25b788716b [AMDGPU] Fix "initialization is never read" clang-tidy warnings. NFCI. 2021-03-02 12:06:24 +00:00
Dmitry Preobrazhensky 28f164bca7 [AMDGPU][MC][GFX9+] Corrected encoding of op_sel_hi for unused operands in VOP3P
Corrected encoding of VOP3P op_sel_hi for unused operands. See bug 49363.

Differential Revision: https://reviews.llvm.org/D97689
2021-03-02 13:02:25 +03:00
Stanislav Mekhanoshin 7c724a896f [AMDGPU] Do not check max-bb for a single block callee
-amdgpu-inline-max-bb option could lead to a suboptimal
codegen preventing inlining of really simple functions
including pure wrapper calls. Relax the cutoff by allowing
to call a function with a single block on the grounds
that it will not increase total number of blocks after
inlining.

Differential Revision: https://reviews.llvm.org/D97744
2021-03-01 19:48:50 -08:00
Jay Foad 796a60d2ea [AMDGPU] New intrinsic void llvm.amdgcn.s.sethalt(i32)
The expected use case is for frontends to insert this into
shaders that are to be run under a debugger. The shader can
then be resumed or single stepped from the point of the call
under debugger control.

Differential Revision: https://reviews.llvm.org/D97670
2021-03-01 14:30:23 +00:00
Jay Foad 48ca5d3398 [AMDGPU] Simplify SITargetLowering::isSDNodeSourceOfDivergence. NFC.
Check for read-modify-write AtomicSDNodes instead of using an exhaustive
list of ISD opcodes.

Differential Revision: https://reviews.llvm.org/D97671
2021-03-01 14:22:08 +00:00
Matt Arsenault 6c260d3bc0 GlobalISel: Move splitToValueTypes to generic code
I copied the nearly identical function from AArch64 into AMDGPU, so
fix this duplication.

Mips and X86 have their own more exotic versions which should be
removed. However replacing those is better left for a separate patch
since it requires other changes to avoid regressions.
2021-03-01 08:58:18 -05:00
Matt Arsenault 81b2c23b77 AMDGPU: Use kill instruction to hint soft clause live ranges
Previously we would use a bundle to hint the register allocator to not
overwrite the pointers in a sequence of loads to avoid breaking soft
clauses. This bundling was based on a fuzzy register pressure
heuristic, so we could not guarantee using more registers than are
really available. This would result in register allocator failing on
unsatisfiable bundles. Use a kill to artificially extend the live
ranges, so we can always succeed at register allocation even if it
means extra spills in the worst case.

This seems to capture most of the benefit of the bundle while avoiding
most of the risk presented by the bundle. However the lit tests do
show a handful of regressions. In some cases with sequences of
volatile loads, unused load components end up getting reallocated to
the next load which forces a wait between. There are also a few small
scheduling regressions where a hazard used to be avoided, and one
spill torture test which for some reason nearly doubles the stack
usage. There is also a bit of noise from leftover kills (it may make
sense for post-RA pseudos to strip all of these out).
2021-02-26 18:26:40 -05:00
Stanislav Mekhanoshin 799c50fe93 [AMDGPU] Avoid second rescheduling for some regions
If a region was not constrained by a high register pressure
and was not rescheduled without clustering we can skip
rescheduling it ClusteredLowOccupancyReschedule stage.

This improves scheduling speed by 25% on some kernels.

Differential Revision: https://reviews.llvm.org/D97506
2021-02-26 12:29:37 -08:00
Stanislav Mekhanoshin 635993f07b [AMDGPU] Skip unclusterd rescheduling w/o ld/st
We are attempting rescheduling without load store clustering
if occupancy limits were not met with clustering. Skip this
for regions which do not have any loads or stores at all.

In a set of kernels I am experimenting with this improves
scheduling time by ~30%.

Differential Revision: https://reviews.llvm.org/D97342
2021-02-26 12:29:03 -08:00
Jay Foad dc2259537a [AMDGPU] Add selection pattern for v_xnor_b32
This allows GlobalISel to use this instruction where available. I assume
SelectionDAG always selects s_xnor_b32 so it isn't affected by this
change.

Differential Revision: https://reviews.llvm.org/D97560
2021-02-26 16:41:47 +00:00
Jay Foad 3ad5216ed8 [AMDGPU] Better codegen for i64 bitreverse
Differential Revision: https://reviews.llvm.org/D97547
2021-02-26 15:51:36 +00:00
Michael Liao 0d4e12e3c1 [amdgpu] Atomic should be source of divergence.
Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D97392
2021-02-24 15:27:47 -05:00
Matt Arsenault 589223e044 AMDGPU: Remove special case in shouldCoalesce
Unaligned registers are now constrained with classes, rather than
specially reserving a subset of the whole class.
2021-02-24 14:49:44 -05:00
Matt Arsenault 78b6d73a93 AMDGPU: Add even aligned VGPR/AGPR register classes
gfx90a operations require even aligned registers, but this was
previously achieved by reserving registers inside the full class.

Ideally this would be captured in the static instruction definitions
for the operands, and we would have different instructions per
subtarget. The hackiest part of this is we need to manually reassign
AGPR register classes after instruction selection (we get away without
this for VGPRs since those types are actually registered for legal
types).
2021-02-24 14:49:37 -05:00
Jay Foad aab709f090 [AMDGPU] Add more PAL metadata register names
Add all the registers that are currently used by
LLPC: https://github.com/GPUOpen-Drivers/llpc

This only affects disassembly of PAL metadata generated by LLPC and
similar frontends.

Differential Revision: https://reviews.llvm.org/D95619
2021-02-24 13:37:05 +00:00
Jay Foad 67f0620831 [AMDGPU] Update s_sendmsg messages
Update the list of s_sendmsg messages known to the assembler and
disassembler and validate the ones that were added or removed in gfx9
and gfx10.

Differential Revision: https://reviews.llvm.org/D97295
2021-02-24 13:07:00 +00:00
Stanislav Mekhanoshin d1b92c91af [AMDGPU] Set threshold for regbanks reassign pass
This is to limit compile time. I did experiments with some
inputs and found that compile time keeps reasonable for this
pass if we have less than 100000 virtual registers and then
starts to explode somewhere between 100000 and 150000.

Differential Revision: https://reviews.llvm.org/D97218
2021-02-23 10:22:31 -08:00
Nicolai Hähnle 52bc2e7577 [AMDGPU][SelectionDAG] Don't combine uniform multiplies to MUL_[UI]24
Prefer to keep uniform (non-divergent) multiplies on the scalar ALU when
possible. This significantly improves some game cases by eliminating
v_readfirstlane instructions when the result feeds into a scalar
operation, like the address calculation for a scalar load or store.

Since isDivergent is only an approximation of whether a value is in
SGPRs, it can potentially regress some situations where a uniform value
ends up in a VGPR. These should be rare in real code, although the test
changes do contain a number of examples.

Most of the test changes are just using s_mul instead of v_mul/mad which
is generally better for both register pressure and latency (at least on
GFX10 where sgpr pressure doesn't affect occupancy and vector ALU
instructions have significantly longer latency than scalar ALU). Some
R600 tests now use MULLO_INT instead of MUL_UINT24.

GlobalISel appears to handle more scenarios in the desirable way,
although it can also be thrown off and fails to select the 24-bit
multiplies in some cases.

Alternative solution considered and rejected was to allow selecting
MUL_[UI]24 to S_MUL_I32. I've rejected this because the definition of
those SD operations works is don't-care on the most significant 8 bits,
and this fact is used in some combines via SimplifyDemandedBits.

Based on a patch by Nicolai Hähnle.

Differential Revision: https://reviews.llvm.org/D97063
2021-02-23 15:39:19 +00:00
David Green dd2dbf7ee2 [TTI] Change getOperandsScalarizationOverhead to take Type args
As a followup to D95291, getOperandsScalarizationOverhead was still
using a VF as a vector factor if the arguments were scalar, and would
assert on certain matrix intrinsics with differently sized vector
arguments. This patch removes the VF arg, instead passing the Types
through directly. This should allow it to more accurately compute the
cost without having to guess at which operands will be vectorized,
something difficult with more complex intrinsics.

This adjusts one SVE test as it is now calling the wrong intrinsic vs
veccall. Without invalid InstructCosts the cost of the scalarized
intrinsic is too low. This should get fixed when the cost of
scalarization is accounted for with scalable types.

Differential Revision: https://reviews.llvm.org/D96287
2021-02-23 13:04:59 +00:00
David Green bd4b61efbd [CostModel] Remove VF from IntrinsicCostAttributes
getIntrinsicInstrCost takes a IntrinsicCostAttributes holding various
parameters of the intrinsic being costed. It can either be called with a
scalar intrinsic (RetTy==Scalar, VF==1), with a vector instruction
(RetTy==Vector, VF==1) or from the vectorizer with a scalar type and
vector width (RetTy==Scalar, VF>1). A RetTy==Vector, VF>1 is considered
an error. Both of the vector modes are expected to be treated the same,
but because this is confusing many backends end up getting it wrong.

Instead of trying work with those two values separately this removes the
VF parameter, widening the RetTy/ArgTys by VF used called from the
vectorizer. This keeps things simpler, but does require some other
modifications to keep things consistent.

Most backends look like this will be an improvement (or were not using
getIntrinsicInstrCost). AMDGPU needed the most changes to keep the code
from c230965ccf working. ARM removed the fix in
dfac521da1, webassembly happens to get a fixup for an SLP cost
issue and both X86 and AArch64 seem to now be using better costs from
the vectorizer.

Differential Revision: https://reviews.llvm.org/D95291
2021-02-23 13:03:26 +00:00
Stanislav Mekhanoshin bb16efe280 [AMDGPU] Move RPT::getLiveRegs() check under EXPENSIVE_CHECKS
This is too expensive even for debug builds. It doubles
scheduling time if enabled.

Differential Revision: https://reviews.llvm.org/D97232
2021-02-22 15:21:59 -08:00
Dmitry Preobrazhensky 4813518092 [AMDGPU][MC] Corrected bound_ctrl for compatibility with sp3
Enabled "bound_ctrl:1" and disabled "bound_ctrl:-1" syntax.
Corrected printer to output "bound_ctrl:1" instead of "bound_ctrl:0".
See bug 35397 for detailed issue description.

Differential Revision: https://reviews.llvm.org/D97048
2021-02-22 14:59:40 +03:00
madhur13490 3c297a2564 Make fixed-abi default for AMD HSA OS
fixed-abi uses pre-defined and predictable
SGPR/VGPRs for passing arguments. This patch makes
this scheme default when HSA OS is specified in triple.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D96340
2021-02-19 15:05:25 +00:00
Carl Ritson 8181dcd30f [AMDGPU] WQM/WWM: Fix marking of partial definitions
Track lanes when processing definitions for marking WQM/WWM.
If all lanes have been defined then marking can stop.
This prevents marking unnecessary instructions as WQM/WWM.

In particular this fixes a bug where values passing through
V_SET_INACTIVE would me marked as requiring WWM.

Reviewed By: piotr

Differential Revision: https://reviews.llvm.org/D95503
2021-02-19 20:45:24 +09:00
Matt Arsenault 62d946e133 GlobalISel: Merge some AMDGPU ABI lowering code to generic code
AMDGPU currently has a lot of pre-processing code to pre-split
argument types into 32-bit pieces before passing it to the generic
code in handleAssignments. This is a bit sloppy and also requires some
overly fancy iterator work when building the calls. It's better if all
argument marshalling code is handled directly in
handleAssignments. This handles more situations like decomposing large
element vectors into sub-element sized pieces.

This should mostly be NFC, but does change the generated code by
shifting where the initial argument packing instructions are placed. I
think this is nicer looking, since it now emits the packing code
directly after the relevant copies, rather than after the copies for
the remaining arguments.

This doubles down on gfx6/gfx7 using the gfx8+ ABI for 16-bit
types. This is ultimately the better option, but incompatible with the
DAG. Fixing this requires more work, especially for f16.
2021-02-18 17:26:55 -05:00
Nikita Popov 70e3c9a8b6 [BasicAA] Always strip single-argument phi nodes
We can always look through single-argument (LCSSA) phi nodes when
performing alias analysis. getUnderlyingObject() already does this,
but stripPointerCastsAndInvariantGroups() does not. We still look
through these phi nodes with the usual aliasPhi() logic, but
sometimes get sub-optimal results due to the restrictions on value
equivalence when looking through arbitrary phi nodes. I think it's
generally beneficial to keep the underlying object logic and the
pointer cast stripping logic in sync, insofar as it is possible.

With this patch we get marginally better results:

  aa.NumMayAlias | 5010069 | 5009861
  aa.NumMustAlias | 347518 | 347674
  aa.NumNoAlias | 27201336 | 27201528
  ...
  licm.NumPromoted | 1293 | 1296

I've renamed the relevant strip method to stripPointerCastsForAliasAnalysis(),
as we're past the point where we can explicitly spell out everything
that's getting stripped.

Differential Revision: https://reviews.llvm.org/D96668
2021-02-18 23:07:50 +01:00
Stanislav Mekhanoshin 5247a0d9e6 [AMDGPU] Correct gfx90c feature list
Looks like we have forced FeatureXNACK and forgot
FeatureMadMacF32Insts.

Differential Revision: https://reviews.llvm.org/D96989
2021-02-18 12:40:27 -08:00
Stanislav Mekhanoshin 75997e8407 [AMDGPU] Fixed msan build
LoadStoreOptimizer was using uninitialized SCC value for
instructions where it is unsupported.
2021-02-17 18:01:23 -08:00
Stanislav Mekhanoshin 48d2e04152 [AMDGPU] Mark SMRD atomics
We did not have atomic flags on SMRD, did not copy TSFlags
to real instructions, and did not have ret/noret atomic map.

At the moment it is NFC, but needed for D96469.

Differential Revision: https://reviews.llvm.org/D96823
2021-02-17 16:47:02 -08:00
Stanislav Mekhanoshin a8d9d50762 [AMDGPU] gfx90a support
Differential Revision: https://reviews.llvm.org/D96906
2021-02-17 16:01:32 -08:00
Piotr Sobczak c72a63b4b0 [AMDGPU] Add implicit vcc_lo on S_CBRANCH_VCCNZ in wave32
* Update skip-if-dead.ll with tests for wave32.
* Fix the crash in verifier in one newly enabled test by adding
  missing fixImplicitOperands in branch insertion code.

```
*** Bad machine code: Using an undefined physical register ***
- function:    test_kill_divergent_loop
- basic block: %bb.2 bb (0xad96308)
- instruction: S_CBRANCH_VCCNZ %bb.1, implicit $vcc_lo
- operand 1:   implicit $vcc_lo
LLVM ERROR: Found 1 machine code errors.
```

* Simplify "cbranch_kill" to not use interp instructions.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D96793
2021-02-17 15:14:57 +01:00
Jay Foad c8be7e96bb [AMDGPU] Rename simplifyI24 to simplifyMul24
Also simplify one of its call sites. NFC.
2021-02-17 11:33:49 +00:00
Piotr Sobczak 08131c7439 [AMDGPU] Fix a miscompile with S_ADD/S_SUB
The helper function isBoolSGPR is too aggressive when determining
when a v_cndmask can be skipped on a boolean value because the
function does not check the operands of and/or/xor.

This can be problematic for the Add/Sub combines that can leave
bits set even for inactive lanes leading to wrong results.

Fix this by inspecting the operands of and/or/xor recursively.

Differential Revision: https://reviews.llvm.org/D86878
2021-02-17 12:24:58 +01:00
Tony Tye c62b737ad6 [AMDGPU] Correct rmw atomics s_waitcnt generation
The AMD GPU SIMemoryLegalizer was using the ordering address space
rather than the instruction address space when determining the
s_waitcnt to generate to ensure that a read-modify-write atomic has
completed. This resulted in additional unnecessary counters being
waited on.

Differential Revision: https://reviews.llvm.org/D96743
2021-02-17 01:32:29 +00:00
Matt Arsenault a7455d7b7c AMDGPU: Remove kills following clusters of memory instruction
In a future commit, soft clauses will be hinted with kill instructions
rather than forced together with bundles. Look for kills that look
like this, and erase them. I'm not sure if the check for specific uses
is worthwhile, or if it would be better to just unconditionally erase
kills.

This reduces test churn in a future patch.
2021-02-16 10:49:28 -05:00
Matt Arsenault c320e8196a AMDGPU: Fix debug info handling in post-RA bundler
This was allowing debug instructions to break the bundling, which
would change scheduling behavior. Bundle debug info / kills inside
the bundle. This seems to work OK, although the asm printer doesn't
understand these in a bundle. This implicitly expects the memory
legalizer to unbundle. It would probably be slightly nicer to move
these after.

Rewrite the loop to be clearer and make sure we don't end a bundle on
a meta instruction, only allow them in between other valid bundle
instructions.
2021-02-16 10:42:06 -05:00
Matt Arsenault 392e0fcfd1 GlobalISel: Handle arguments partially passed on the stack
The API is a bit awkward since you need to index into an array in the
passed struct. I guess an alternative would be to pass all of the
individual fields.
2021-02-15 17:06:14 -05:00
Duncan P. N. Exon Smith 22a52dfddc TransformUtils: Fix metadata handling in CloneModule (and improve CloneFunctionInto)
This commit fixes how metadata is handled in CloneModule to be sound,
and improves how it's handled in CloneFunctionInto (although the latter
is still awkward when called within a module).

Ruiling Song pointed out in PR48841 that CloneModule was changed to
unsoundly use the RF_ReuseAndMutateDistinctMDs flag (renamed in
fa35c1f80f for clarity). This flag papered
over a crash caused by other various changes made to CloneFunctionInto
over the past few years that made it unsound to use cloning between
different modules.

(This commit partially addresses PR48841, fixing the repro from
preprocessed source but not textual IR. MDNodeMapper::mapDistinctNode
became unsound in df763188c9 and this
commit does not address that regression.)

RF_ReuseAndMutateDistinctMDs is designed for the IRMover to use,
avoiding unnecessary clones of all referenced metadata when linking
between modules (with IRMover, the source module is discarded after
linking). It never makes sense to use when you're not discarding the
source. This commit drops its incorrect use in CloneModule.

Sadly, the right thing to do with metadata when cloning a function is
complicated, and this patch doesn't totally fix it.

The first problem is that there are two different types of referenceable
metadata and it's not obvious what to with one of them when remapping.

- `!0 = !{!1}` is metadata's version of a constant. Programatically it's
  called "uniqued" (probably a better term would be "constant") because,
  like `ConstantArray`, it's stored in uniquing tables. Once it's
  constructed, it's illegal to change its arguments.
- `!0 = distinct !{!1}` is a bit closer to a global variable. It's legal
  to change the operands after construction.

What should be done with distinct metadata when cloning functions within
the same module?

- Should new, cloned nodes be created?
- Should all references point to the same, old nodes?

The answer depends on whether that metadata is effectively owned by a
function.

And that's the second problem. Referenceable metadata's ownership model
is not clear or explicit. Technically, it's all stored on an
LLVMContext. However, any metadata that is `distinct`, that transitively
references a `distinct` node, or that transitively references a
GlobalValue is specific to a Module and is effectively owned by it. More
specifically, some metadata is effectively owned by a specific Function
within a module.

Effectively function-local metadata was introduced somewhere around
c10d0e5ccd, which made it illegal for two
functions to share a DISubprogram attachment.

When cloning a function within a module, you need to clone the
function-local debug info and suppress cloning of global debug info (the
status quo suppresses cloning some global debug info but not all). When
cloning a function to a new/different module, you need to clone all of
the debug info.

Here's what I think we should do (eventually? soon? not this patch
though):
- Distinguish explicitly (somehow) between pure constant metadata owned
  by the LLVMContext, global metadata owned by the Module, and local
  metadata owned by a GlobalValue (such as a function).
- Update CloneFunctionInto to trigger cloning of all "local" metadata
  (only), perhaps by adding a bit to RemapFlag. Alternatively, split
  out a separate function CloneFunctionMetadataInto to prime the
  metadata map that callers are updated to call ahead of time as
  appropriate.

Here's the somewhat more isolated fix in this patch:
- Converted the `ModuleLevelChanges` parameter to `CloneFunctionInto` to
  an enum called `CloneFunctionChangeType` that is one of
  LocalChangesOnly, GlobalChanges, DifferentModule, and ClonedModule.
- The code maintaining the "functions uniquely own subprograms"
  invariant is now only active in the first two cases, where a function
  is being cloned within a single module. That's necessary because this
  code inhibits cloning of (some) "global" metadata that's effectively
  owned by the module.
- The code maintaining the "all compile units must be explicitly
  referenced by !llvm.dbg.cu" invariant is now only active in the
  DifferentModule case, where a function is being cloned into a new
  module in isolation.
- CoroSplit.cpp's call to CloneFunctionInto in CoroCloner::create
  uses LocalChangeOnly, since fa635d730f
  only set `ModuleLevelChanges` to trigger cloning of local metadata.
- CloneModule drops its unsound use of RF_ReuseAndMutateDistinctMDs
  and special handling of !llvm.dbg.cu.
- Fixed some outdated header docs and left a couple of FIXMEs.

Differential Revision: https://reviews.llvm.org/D96531
2021-02-15 11:56:00 -08:00
Stanislav Mekhanoshin 5cf9292ce3 [AMDGPU] Add two TSFlags: IsAtomicNoRtn and IsAtomicRtn
We are using AtomicNoRet map in multiple places to determine
if an instruction atomic, rtn or nortn atomic. This method
does not work always since we have some instructions which
only has rtn or nortn version.

One such instruction is ds_wrxchg_rtn_b32 which does not have
nortn version. This has caused changes in memory legalizer
tests.

Differential Revision: https://reviews.llvm.org/D96639
2021-02-15 11:27:59 -08:00
Carl Ritson aef781b47a [AMDGPU] Add llvm.amdgcn.wqm.demote intrinsic
Add intrinsic which demotes all active lanes to helper lanes.
This is used to implement demote to helper Vulkan extension.

In practice demoting a lane to helper simply means removing it
from the mask of live lanes used for WQM/WWM/Exact mode.
Where the shader does not use WQM, demotes just become kills.

Additionally add llvm.amdgcn.live.mask intrinsic to complement
demote operations. In theory llvm.amdgcn.ps.live can be used
to detect helper lanes; however, ps.live can be moved by LICM.
The movement of ps.live cannot be remedied without changing
its type signature and such a change would require ps.live
users to update as well.

Reviewed By: piotr

Differential Revision: https://reviews.llvm.org/D94747
2021-02-15 08:45:46 +09:00
Tony Tye 8a91b68b95 [AMDGPU] Limit memory scope for scratch, LDS and GDS
Changes for AMD GPU SIMemoryLegalizer:

- Limit the memory scope to maximum supported by the scratch, LDS and
  GDS address spaces.

- Improve assertion checking.

- Correct toSIAtomicScope argument name.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D96643
2021-02-14 17:34:12 +00:00
Kazu Hirata b4c0d610a6 [AMDGPU] Fix build breakage 2021-02-14 09:02:55 -08:00
Kazu Hirata 910e2d1e57 [llvm] Use llvm::is_contained (NFC) 2021-02-14 08:36:20 -08:00
Kazu Hirata 96c90a6d14 [AMDGPU] Drop unnecessary const from a return type (NFC)
Identified with readability-const-return-type.
2021-02-12 23:44:32 -08:00
Stanislav Mekhanoshin c96e214b9c [AMDGPU] Fix Windows build
A trivial fix, 64 bit constant is 1ull, not 1ul on Windows.
Fixed build broken by c0d7a8bc62.
2021-02-12 12:30:52 -08:00
Stanislav Mekhanoshin c0d7a8bc62 [AMDGPU] Allow accvgpr_read/write decode with opsel
These two instructions are VOP3P and have op_sel_hi bits,
however do not use op_sel_hi. That is recommended to set
unused op_sel_hi bits to 1. However, we cannot decode
both representations with 1 and 0 if bits are set to
default value 1. If bits are set to be ignored with '?'
initializer then encoding defaults them to 0.

The patch is a hack to force ignored '?' bits to 1 on
encoding for these instructions.

There is still canonicalization happens on disasm print
if incoming values are non-default, so that disasm output
does not match binary input, but this is pre-existing
problem for all instructions with '?' bits.

Fixes: SWDEV-272540

Differential Revision: https://reviews.llvm.org/D96543
2021-02-12 10:04:47 -08:00
Jay Foad 7e9ceed9a2 [TableGen][GlobalISel] Allow duplicate RendererFns
Allow different GICustomOperandRenderers to use the same RendererFn.
This avoids the need for targets to define a bunch of identical C++
renderer functions with different names.

Without this fix TableGen would have emitted code that tried to define
the GICR enumeration with duplicate enumerators.

Differential Revision: https://reviews.llvm.org/D96587
2021-02-12 15:05:32 +00:00
Stanislav Mekhanoshin cb41ee92da [AMDGPU] Fix promote alloca with double use in a same insn
If we have an instruction where more than one pointer operands
are derived from the same promoted alloca, we are fixing it for
one argument and do not fix a second use considering this user
done.

Fix this by deferring processing of memory intrinsics until all
potential operands are replaced.

Fixes: SWDEV-271358

Differential Revision: https://reviews.llvm.org/D96386
2021-02-11 11:42:25 -08:00
Matt Arsenault e3c6fa3611 AMDGPU: Restrict soft clause bundling at half of the available regs
Fixes a testcase that was overcommitting large register tuples to a
bundle, which the register allocator could not possibly satisfy.  This
was producing a bundle which used nearly all of the available SGPRs
with a series of 16-dword loads (not all of which are freely available
to use).

This is a quick hack for some deeper issues with how the clause
bundler tracks register pressure.

Overall the pressure tracking used here doesn't make sense and is too
imprecise for what it needs to avoid the allocator failing. The
pressure estimate does not account for the alignment requirements of
large SGPR tuples, so this was really underestimating the pressure
impact. This also ignores the impact of the extended live range of the
use registers after the bundle is introduced. Additionally, it didn't
account for some wide tuples not being available due to reserved
registers.

This regresses a few cases. These end up introducing more
spilling. This is also a function of the global pressure being used in
the decision to bundle, not the local pressure impact of the bundle
itself.
2021-02-11 14:08:59 -05:00
Jay Foad 23db2d363f [AMDGPU] Better selection of base offset when merging DS reads/writes
When merging a pair of DS reads or writes needs to materialize the base
offset in a vgpr, choose a value that is aligned to as high a power of
two as possible. This maximises the chance that different pairs can use
the same base offset, in which case the base offset registers can be
commoned up by MachineCSE.

Differential Revision: https://reviews.llvm.org/D96421
2021-02-11 17:46:09 +00:00
Carl Ritson c16f776028 [AMDGPU] Move kill lowering to WQM pass and add live mask tracking
Move implementation of kill intrinsics to WQM pass. Add live lane
tracking by updating a stored exec mask when lanes are killed.
Use live lane tracking to enable early termination of shader
at any point in control flow.

Reviewed By: piotr

Differential Revision: https://reviews.llvm.org/D94746
2021-02-11 20:31:29 +09:00
Carl Ritson e5b0b434f6 [AMDGPU] Refactor MIMG tables to better handle hardware variants
Add mimgopc object to represent the opcode allowing different
opcodes for different hardware variants.
This enables image_atomic_fcmpswap, image_atomic_fmin, and
image_atomic_fmax on GFX10

Reviewed By: foad, rampitec

Differential Revision: https://reviews.llvm.org/D96309
2021-02-11 13:22:41 +09:00
Jay Foad 2114b458b0 [AMDGPU] Fix comments in SILoadStoreOptimizer::offsetsCanBeCombined 2021-02-10 14:49:33 +00:00
Matt Arsenault f4ca6d8289 AMDGPU: Fix verifier error with argument passed in CSR SGPR
We need to avoid setting the kill flag on the CSR spill if there's an
additional use of the register after the spill.

This does rely on consistency between the entry block liveins and the
MRI's function live ins, which is not something the verifier checks
now.
2021-02-09 13:49:44 -05:00
Matt Arsenault b72a23650f GlobalISel: Fix using wrong calling convention for callees
This was taking the calling convention from the parent function,
instead of the callee. Avoids regressions in a future patch when the
caller and callee have different type breakdowns.

For some reason AArch64's lowerFormalArguments seems to intentionally
ignore the parent isVarArg.
2021-02-09 13:48:56 -05:00