Commit Graph

6124 Commits

Author SHA1 Message Date
dfukalov 8f4b7e94a2 [AMDGPU][CostModel] Refine cost model for control-flow instructions.
Added cost estimation for switch instruction, updated costs of branches, fixed
phi cost.
Had to increase `-amdgpu-unroll-threshold-if` default value since conditional
branch cost (size) was corrected to higher value.
Test renamed to "control-flow.ll".

Removed redundant code in `X86TTIImpl::getCFInstrCost()` and
`PPCTTIImpl::getCFInstrCost()`.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D96805
2021-04-10 09:20:24 +03:00
Mitch Phillips 092f288d36 Revert "[AMDGPU] Remove MachineDCE after SIFoldOperands"
This reverts commit 5a0117b2d0.

Reason: Dependent change d19a42eba9 broke
the ASan buildbots.
2021-04-09 15:47:44 -07:00
Mitch Phillips 3d4730a73f Revert "[AMDGPU] SIFoldOperands: eagerly erase dead REG_SEQUENCEs"
This reverts commit d19a42eba9.

Reason: Broke the ASan buildbots. See the original phabricator review
for more details: https://reviews.llvm.org/D100188
2021-04-09 15:47:44 -07:00
Jay Foad 5a0117b2d0 [AMDGPU] Remove MachineDCE after SIFoldOperands
Remove the MachineDCE pass after the first SIFoldOperands pass now
that SIFoldOperands deletes its own dead instructions.

Differential Revision: https://reviews.llvm.org/D100189
2021-04-09 20:41:09 +01:00
Jay Foad d19a42eba9 [AMDGPU] SIFoldOperands: eagerly erase dead REG_SEQUENCEs
This is fairly cheap to implement and means less work for future
passes like MachineDCE.

Differential Revision: https://reviews.llvm.org/D100188
2021-04-09 20:41:09 +01:00
Jay Foad a4ced03d34 [AMDGPU] SIFoldOperands: eagerly delete dead copies
This is cheap to implement, means less work for future passes like
MachineDCE, and slightly improves the folding in some cases.

Differential Revision: https://reviews.llvm.org/D100117
2021-04-09 13:52:54 +01:00
Sebastian Neubauer cc7add5298 [AMDGPU] Use SIInstrFlags for flat variants. NFC
Use SIInstrFlags to differentiate between the different
variants of flat instructions (flat, global and scratch).
This should make it easier to bundle the immediate offset logic in a
single place and implement restrictions and bug workarounds.

Fixed version of D99587, which does not rely on the address space.

Differential Revision: https://reviews.llvm.org/D99743
2021-04-09 12:28:36 +02:00
dfukalov d066079728 [NFC][AA] Prepare to convert AliasResult to class with PartialAlias offset.
Main reason is preparation to transform AliasResult to class that contains
offset for PartialAlias case.

Reviewed By: asbirlea

Differential Revision: https://reviews.llvm.org/D98027
2021-04-09 12:54:22 +03:00
Sebastian Neubauer 36138db116 [AMDGPU] IsFlatScratch/Global -> FlatScratch/Global
Remove 'Is' from IsFlatScratch/Global. NFC

Differential Revision: https://reviews.llvm.org/D100108
2021-04-09 11:20:31 +02:00
Konstantin Zhuravlyov 4fae63c612 AMDGPU: Add gfx90c support to code object v2 for backwards compatibility
Differential Revision: https://reviews.llvm.org/D100126
2021-04-08 16:42:43 -04:00
Stanislav Mekhanoshin 627dab3dbf [AMDGPU] Check for all meta instrs in GCNRegBankReassign
It used to work correctly even with a KILL, but there is
no reason to consider meta instructions since they do not
create real HW uses.

Differential Revision: https://reviews.llvm.org/D100135
2021-04-08 13:41:10 -07:00
Stanislav Mekhanoshin 189310a140 [AMDGPU] Allow -amdgpu-unsafe-fp-atomics to ignore denorm mode
Fixes: SWDEV-274276

Differential Revision: https://reviews.llvm.org/D100072
2021-04-08 12:46:36 -07:00
Jay Foad a1a372dfb5 [AMDGPU] SIFoldOperands: remove an unneeded isReg check. NFC. 2021-04-08 16:37:43 +01:00
Jay Foad a250e91d10 [AMDGPU] SIFoldOperands: make use of emplace_back. NFC. 2021-04-08 14:34:10 +01:00
Jay Foad 2724b57ecd [AMDGPU] SIFoldOperands: remove an unneeded make_early_inc_range. NFC. 2021-04-08 14:32:36 +01:00
Jay Foad c28f79a0e3 [AMDGPU] SIFoldOperands: try harder to fold cndmask instructions
Look through copies to find more cases where the two values being
selected are identical. The motivation for this is just to be able to
remove the weird special case where tryFoldCndMask was called from
foldInstOperand, part way through folding a move-immediate into its
users, without regressing any lit tests.
2021-04-08 14:26:12 +01:00
Jay Foad 3344cd3a14 [AMDGPU] SIFoldOperands: make tryFoldCndMask a member function. NFC. 2021-04-08 14:05:29 +01:00
Sebastian Neubauer c10cc4ea27 [AMDGPU] Fix computing live registers in prolog
ScratchExecCopy needs to be marked as live, we cannot use that register
while EXEC is stored in there.

Marking SGPRForFPSaveRestoreCopy and SGPRForBPSaveRestoreCopy as
available is unnecessary, they should not be live at that point anway.

Differential Revision: https://reviews.llvm.org/D100098
2021-04-08 14:52:50 +02:00
Jay Foad 94a6fe43de [AMDGPU] SIFoldOperands: refactor tryFoldCndMask with early-outs. NFC. 2021-04-08 13:16:07 +01:00
hsmahesha ac64995ceb [AMDGPU] Only use ds_read/write_b128 for alignment >= 16
PS: Submitting on behalf of Jay.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D100008
2021-04-08 08:12:05 +05:30
Stanislav Mekhanoshin 37878de503 Disable use of SCC bit from asm
Differential Revision: https://reviews.llvm.org/D100069
2021-04-07 15:32:17 -07:00
Tony Tye 4658cd4c18 [AMDGPU] Update gfx90a memory model support
Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D100070
2021-04-07 22:17:58 +00:00
Stanislav Mekhanoshin d5d412f2ae [AMDGPU] Split GCNRegBankReassign
Allow pass to work separately with SGPR, VGPR registers or both.
This is NFC now but will be needed to split RA for separate
SGPR and VGPR passes.

Differential Revision: https://reviews.llvm.org/D100063
2021-04-07 14:45:13 -07:00
Sebastian Neubauer 2dc6be5209 [AMDGPU] Update SGPRSpillVGPRCSR name. NFC
The struct is used for both, callee and caller-save registers now.
The frame index is not set for entrypoints, as we do not need to save
the registers then.
Update the struct name to reflect that.

Differential Revision: https://reviews.llvm.org/D99722
2021-04-07 16:30:40 +02:00
Jay Foad bf6cab6f07 [AMDGPU] SIFoldOperands: don't dump extra '\n' after MachineInstr. NFC. 2021-04-07 14:13:00 +01:00
Jay Foad 8f798566a3 [AMDGPU] SIFoldOperands: use isUseMIInFoldList. NFC. 2021-04-06 17:53:48 +01:00
Konstantin Zhuravlyov 844012940e AMDGPU: Add isBranch=1 to SOPP branch instructions
Differential Revision: https://reviews.llvm.org/D99955
2021-04-06 10:59:30 -04:00
Jay Foad efc7bf27f5 [AMDGPU] SIFoldOperands: use MachineRegisterInfo::hasOneNonDBGUser
NFC.
2021-04-06 15:23:58 +01:00
Jay Foad 005dcd196e [AMDGPU] SIFoldOperands: use range-based loops and make_early_inc_range
NFC.
2021-04-06 15:23:58 +01:00
Jay Foad ce9cca6c3a [AMDGPU] SIFoldOperands: rename tryFoldInst to tryFoldCndMask
This follows the pattern of the other tryFold* functions. NFC.
2021-04-06 15:23:58 +01:00
Jay Foad cf4f5292f6 [AMDGPU] SIFoldOperands: use getVRegDef instead of getUniqueVRegDef
We are in SSA so getVRegDef is equivalent but simpler. NFC.
2021-04-06 15:23:58 +01:00
Jay Foad e9608a84d8 [AMDGPU][SDag] Add IMG init also for image_gather4 instructions
This fixes an oversight in D99747 which moved the IMG init code from
SIAddIMGInit to AdjustInstrPostInstrSelection, but did not set the
hasPostISelHook flag on gather4 instructions.

Differential Revision: https://reviews.llvm.org/D99953
2021-04-06 14:47:20 +01:00
Dmitry Preobrazhensky 3eadcb86ab [AMDGPU][MC][GFX9] Corrected SMEM decoding
Corrected SMEM decoding when IMM=0 and OFFSET>127

Fixed bug 49819 (https://bugs.llvm.org/show_bug.cgi?id=49819)

Differential Revision: https://reviews.llvm.org/D99804
2021-04-06 14:10:46 +03:00
Jay Foad fdc4f19e2f [AMDGPU] Remove SIAddIMGInit pass which is now unused
Differential Revision: https://reviews.llvm.org/D99748
2021-04-01 18:13:17 +01:00
Jay Foad 3d07a6d891 [AMDGPU][GlobalISel] Add IMG init in selectImageIntrinsic
Doing this during instruction selection avoids the cost of running
SIAddIMGInit which is yet another pass over the MIR.

Differential Revision: https://reviews.llvm.org/D99670
2021-04-01 18:13:17 +01:00
Jay Foad 4af6251cea [AMDGPU][SDag] Add IMG init in AdjustInstrPostInstrSelection
Doing this in a post-isel hook avoids the cost of running SIAddIMGInit
which is yet another pass over the MIR.

Differential Revision: https://reviews.llvm.org/D99747
2021-04-01 18:13:17 +01:00
Jay Foad b1fbfd9e4c [AMDGPU] Small cleanup to constructRetValue and its caller. NFC. 2021-04-01 16:36:16 +01:00
Brendon Cahoon 65c8bfb509 [AMDGPU] Enable output modifiers for double precision instructions
Update SIFoldOperands pass to recognize v_add_f64 and v_mul_f64
instructions for folding output modifiers.

Differential Revision: https://reviews.llvm.org/D99505
2021-04-01 10:08:17 -04:00
Dmitry Preobrazhensky cd953434f2 [AMDGPU][MC][GFX10][GFX90A] Corrected _e32/_e64 suffices
Fixed bugs https://bugs.llvm.org//show_bug.cgi?id=49643, https://bugs.llvm.org//show_bug.cgi?id=49644, https://bugs.llvm.org//show_bug.cgi?id=49645.

Differential Revision: https://reviews.llvm.org/D99413
2021-04-01 14:21:00 +03:00
Dmitry Preobrazhensky 0f5ebbcc7f [AMDGPU][MC] Added flag to identify VOP instructions which have a single variant
By convention, VOP1/2/C instructions which can be promoted to VOP3 have _e32 suffix while promoted instructions have _e64 suffix. Instructions which have a single variant should have no _e32/_e64 suffix. Unfortunately there was no simple way to identify single variant instructions - it was implemented by a hack. See bug https://bugs.llvm.org/show_bug.cgi?id=39086.

This fix simplifies handling of single VOP instructions by adding a dedicated flag.

Differential Revision: https://reviews.llvm.org/D99408
2021-04-01 13:53:12 +03:00
Sander de Smalen 2f6f249a49 NFC: Change getIntrinsicInstrCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Depends on D97468

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D97469
2021-03-31 14:04:41 +01:00
Sander de Smalen 2f56e1c6b1 NFC: Change getTypeBasedIntrinsicCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Depends on D97466

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D97468
2021-03-31 14:04:41 +01:00
Jay Foad 5d0e9ddfa5 [AMDGPU][GlobalISel] Add support for global atomicrmw fadd
This includes gfx908 which only has a no-return version of the
global_atomic_add_f32 instruction, using the same hack that was
previously implemented for selecting from the
llvm.amdgcn.global.atomic.fadd intrinsic.

Differential Revision: https://reviews.llvm.org/D97767
2021-03-31 11:13:00 +01:00
Tomas Matheson a9968c0a33 [NFC][CodeGen] Tidy up TargetRegisterInfo stack realignment functions
Currently needsStackRealignment returns false if canRealignStack returns false.
This means that the behavior of needsStackRealignment does not correspond to
it's name and description; a function might need stack realignment, but if it
is not possible then this function returns false. Furthermore,
needsStackRealignment is not virtual and therefore some backends have made use
of canRealignStack to indicate whether a function needs stack realignment.

This patch attempts to clarify the situation by separating them and introducing
new names:

 - shouldRealignStack - true if there is any reason the stack should be
   realigned

 - canRealignStack - true if we are still able to realign the stack (e.g. we
   can still reserve/have reserved a frame pointer)

 - hasStackRealignment = shouldRealignStack && canRealignStack (not target
   customisable)

Targets can now override shouldRealignStack to indicate that stack realignment
is required.

This change will make it easier in a future change to handle the case where we
need to realign the stack but can't do so (for example when the register
allocator creates an aligned spill after the frame pointer has been
eliminated).

Differential Revision: https://reviews.llvm.org/D98716

Change-Id: Ib9a4d21728bf9d08a545b4365418d3ffe1af4d87
2021-03-30 17:31:39 +01:00
Sebastian Neubauer 1c3b74f0ab [AMDGPU] Remove outdated TODOs. NFC
spillSGPRToVGPR is already respected in these places since D95768.

Differential Revision: https://reviews.llvm.org/D99570
2021-03-30 15:18:49 +02:00
Stanislav Mekhanoshin 619b88849e [AMDGPU] Fix "Sequence" spelling. NFC. 2021-03-29 12:11:36 -07:00
Joe Nash 45fd7c02af Revert "[AMDGPU] Mark additional VOP3 as commutable"
This reverts commit d35d8da7d6.
2021-03-29 14:48:11 -04:00
Joe Nash d35d8da7d6 [AMDGPU] Mark additional VOP3 as commutable
Note, only src0 and src1 will be commuted if the isCommutable flag
is set. This patch does not change that, it just makes it possible
to commute src0 and src1 of more instructions.

Reviewed By: foad, rampitec

Differential Revision: https://reviews.llvm.org/D99376

Change-Id: I61e20490962d95ea429beb355c55f55c024dafdc
2021-03-29 14:22:20 -04:00
Jay Foad 9d08f276d7 [AMDGPU] Use reductions instead of scans in the atomic optimizer
If the result of an atomic operation is not used then it can be more
efficient to build a reduction across all lanes instead of a scan. Do
this for GFX10, where the permlanex16 instruction makes it viable. For
wave64 this saves a couple of dpp operations. For wave32 it saves one
readlane (which are generally bad for performance) and one dpp
operation.

Differential Revision: https://reviews.llvm.org/D98953
2021-03-26 15:38:14 +00:00
Jay Foad d92b4956d6 [AMDGPU] Inline FSHRPattern into its only use. NFC. 2021-03-26 09:32:02 +00:00
Konstantin Zhuravlyov f4ace63737 AMDGPU: Add target id and code object v4 support
- Add target id support (https://clang.llvm.org/docs/ClangOffloadBundler.html#target-id)
  - Add code object v4 support (https://llvm.org/docs/AMDGPUUsage.html#elf-code-object)
    - Add kernarg_size to kernel descriptor
    - Change trap handler ABI to no longer move queue pointer into s[0:1]
  - Cleanup ELF definitions
    - Add V2, V3, V4 suffixes to make a clear distinction for code object version
    - Consolidate note names

Differential Revision: https://reviews.llvm.org/D95638
2021-03-24 11:54:05 -04:00
Sander de Smalen 55d18b3cc2 [TTI] Return a TypeSize from getRegisterBitWidth.
This patch changes the interface to take a RegisterKind, to indicate
whether the register bitwidth of a scalar register, fixed-width vector
register, or scalable vector register must be returned.

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D98874
2021-03-24 14:45:13 +00:00
alex-t dccf83acf9 [AMDGPU] SIOptimizeExecMaskingPreRA should check constant bus constraint when folds EXEC copy
Folding EXEC copy into it's single use may lead to constant bus constraint violation as it adds one more SGPR operand.
         This change makes it validate the user instruction with the new SGPR operand and only fold it if it is legal.

Reviewed By: rampitec, arsenm

Differential Revision: https://reviews.llvm.org/D98888
2021-03-24 14:14:13 +03:00
Jay Foad fd142e6c18 [AMDGPU] Simplify AMDGPUAnnotateUniformValues::visitBranchInst. NFC.
A BranchInst is always the terminator of its containing BasicBlock.
2021-03-23 16:54:43 +00:00
Joe Nash 538bda0b80 [AMDGPU] Refactor DPPCombine
NFC. Extract IsShrinkable into a helper function, and
make Subtarget a member variable.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D99099

Change-Id: If4bc97a88a9ae4eb1df47e717345d46a6ed515bf
2021-03-23 11:53:53 -04:00
Jay Foad fc7e3e7dd9 [AMDGPU] Set SchedRW on real instructions
Coyp SchedRW from pseudos to real instructions so that llvm-mca has
access to it. This is NFC for normal compiler codegen, which schedules
pseudos not real instructions.

Add an llvm-mca test for some high latency double-precision instructions
as a smoke test.

Differential Revision: https://reviews.llvm.org/D99187
2021-03-23 15:38:11 +00:00
Matt Arsenault b24436ac96 GlobalISel: Lower funnel shifts 2021-03-23 09:11:17 -04:00
Pushpinder Singh d0e5422eb8 [GlobalISel][AMDGPU] Lower G_UMULO/G_SMULO
Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D93963
2021-03-23 05:45:43 +00:00
Carl Ritson 64db6b8d37 [AMDGPU] Only unbundle memory accesses in SIMemoryLegalizer
This restores previous behaviour and is a step toward removing
unbundling entirely.

Reviewed By: foad, rampitec

Differential Revision: https://reviews.llvm.org/D99061
2021-03-23 11:30:36 +09:00
Matt Arsenault 1dd23c6d53 AMDGPU: Allow tail calls for amdgpu_gfx functions 2021-03-22 10:55:19 -04:00
Matt Arsenault a0f5aad6d7 AMDGPU: Fix allowing immediates for tail call pseudo.
The pseudo was using SSrc_b64, so it allowed folding immediates into
the destination operand for a tail call to null. However, this is not
a valid operand for the s_setpc_b64 this will be lowered to. Avoids
printing the operand as an invalid immediate.

Avoids a regression when tail calls are enabled in GlobalISel (somehow
tail calls to null get deleted in the DAG).
2021-03-21 13:14:04 -04:00
Matt Arsenault 6314a72730 AMDGPU/GlobalISel: Enable CSE in pre-legalizer combiner 2021-03-21 10:07:37 -04:00
Carl Ritson 6c9cac5da1 [AMDGPU] Add MDT update missing from D98915 2021-03-20 13:38:58 +09:00
Carl Ritson fe5f4c397f [AMDGPU] Rename SIInsertSkips Pass
Pass no longer handles skips.  Pass now removes unnecessary
unconditional branches and lowers early termination branches.
Hence rename to SILateBranchLowering.

Move code to handle returns to epilog from SIPreEmitPeephole
into SILateBranchLowering. This means SIPreEmitPeephole only
contains optional optimisations, and all required transforms
are in SILateBranchLowering.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D98915
2021-03-20 11:48:04 +09:00
Carl Ritson 5df2af8b0e [AMDGPU] Merge SIRemoveShortExecBranches into SIPreEmitPeephole
SIRemoveShortExecBranches is an optimisation so fits well in the
context of SIPreEmitPeephole.

Test changes relate to early termination from kills which have now
been lowered prior to considering branches for removal.
As these use s_cbranch the execz skips are now retained instead.
Currently either behaviour is valid as kill with EXEC=0 is a nop;
however, if early termination is used differently in future then
the new behaviour is the correct one.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D98917
2021-03-20 11:26:42 +09:00
Carl Ritson b76c09023d [AMDGPU] Allow index optimisation in SIPreEmitPeephole for bundles
Add code so duplication index register changes can be removed from
inside bundles.

Reviewed By: rampitec, foad

Differential Revision: https://reviews.llvm.org/D98940
2021-03-20 10:26:23 +09:00
Stanislav Mekhanoshin 57effe2205 [AMDGPU] Remove dead glc1 handing in asm parser. NFC. 2021-03-19 08:37:47 -07:00
Jay Foad 5a5a531214 [AMDGPU] Remove some redundant code. NFC.
This is redundant because we have already checked that we can't handle
divergent 64-bit atomic operands.
2021-03-19 11:36:15 +00:00
Jay Foad 5dd5ddcb41 [AMDGPU] Skip building some IR if it won't be used. NFC. 2021-03-19 11:36:14 +00:00
Jay Foad c96dfe0d8b [AMDGPU] Sink Intrinsic::getDeclaration calls to where they are used. NFC. 2021-03-19 11:36:14 +00:00
Stanislav Mekhanoshin edd6da10d2 [AMDGPU] Remove cpol, tfe, and swz from MUBUF patterns
These are always selected as 0 anyway.

Differential Revision: https://reviews.llvm.org/D98663
2021-03-18 14:36:04 -07:00
Stanislav Mekhanoshin 961e4384f4 [AMDGPU] Support SCC on buffer atomics
Differential Revision: https://reviews.llvm.org/D98731
2021-03-18 09:56:14 -07:00
Stanislav Mekhanoshin 3f37c28230 [AMDGPU] Remove unused template parameters of MUBUF_Real_AllAddr_vi
Differential Revision: https://reviews.llvm.org/D98804
2021-03-18 09:02:38 -07:00
Jon Chesterfield 253f804deb [amdgpu] Update med3 combine to skip i64
[amdgpu] Update med3 combine to skip i64

Fixes an assumption that a type which is not i32 will be i16. This asserts
when trying to sign/zero extend an i64 to i32.

Test case was cut down from an openmp application. Variations on it are hit by
other combines before reaching the problematic one, e.g. replacing the
immediate values with other function arguments changes the codegen path and
misses this combine.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D98872
2021-03-18 15:56:41 +00:00
Matt Arsenault b9a0384983 GlobalISel: Preserve source value information for outgoing byval args
Pass through the original argument IR value in order to preserve the
aliasing information in the memcpy memory operands.
2021-03-18 09:16:54 -04:00
Carl Ritson 1a4bc3aba3 [AMDGPU] Avoid unnecessary graph visits during WQM marking
Avoid revisiting nodes with the same set of defined lanes by
using a unified visited set which integrates lanes into the key.
This retains the intent of the original code by still revisiting
a subgraph if a different set of lanes is defined and hence
marking might progress differently.

Note: default size of the visited set has been confirmed to
cover >99% of invocations in large array of test shaders.

Reviewed By: piotr

Differential Revision: https://reviews.llvm.org/D98772
2021-03-18 10:00:41 +09:00
David Green e2935dcfc4 [TTI] Add a Mask to getShuffleCost
This adds an Mask ArrayRef to getShuffleCost, so that if an exact mask
can be provided a more accurate cost can be provided by the backend.
For example VREV costs could be returned by the ARM backend. This should
be an NFC until then, laying the groundwork for that to be added.

Differential Revision: https://reviews.llvm.org/D98206
2021-03-17 17:46:26 +00:00
Jay Foad 967b64beb4 [AMDGPU] Split dot2-insts feature
Split out some of the instructions predicated on the dot2-insts target
feature into a new dot7-insts, in preparation for subtargets that have
some but not all of these instructions. NFCI.

Differential Revision: https://reviews.llvm.org/D98717
2021-03-17 09:42:21 +00:00
RamNalamothu 43f2d269b3 [AMDGPU, NFC] Refactor FP/BP spill index code in emitPrologue/emitEpilogue
Reviewed By: scott.linder

Differential Revision: https://reviews.llvm.org/D98617
2021-03-16 19:19:45 +05:30
Dmitry Preobrazhensky 596db9934b [AMDGPU][MC] Disabled lds_direct for GFX90a
Fixed bug 49382.

Differential Revision: https://reviews.llvm.org/D98626
2021-03-16 13:52:36 +03:00
Stanislav Mekhanoshin bc27a31801 [AMDGPU] Fix copyPhysReg to not produce unalined vgpr access
RA can insert something like a sub1_sub2 COPY of a wide VGPR
tuple which results in the unaligned acces with v_pk_mov_b32
after the copy is expanded. This is regression after D97316.

Differential Revision: https://reviews.llvm.org/D98549
2021-03-15 14:14:30 -07:00
Stanislav Mekhanoshin c297709ee1 [AMDGPU] Fixed msan failure with uninitialized value 2021-03-15 13:58:19 -07:00
Stanislav Mekhanoshin 3bffb1cd0e [AMDGPU] Use single cache policy operand
Replace individual operands GLC, SLC, and DLC with a single cache_policy
bitmask operand. This will reduce the number of operands in MIR and I hope
the amount of code. These operands are mostly 0 anyway.

Additional advantage that parser will accept these flags in any order unlike
now.

Differential Revision: https://reviews.llvm.org/D96469
2021-03-15 13:00:59 -07:00
Jon Chesterfield 13e49dcee4 [amdgpu] Implement lower function LDS pass
[amdgpu] Implement lower function LDS pass

Local variables are allocated at kernel launch. This pass collects global
variables that are used from non-kernel functions, moves them into a new struct
type, and allocates an instance of that type in every kernel. Uses are then
replaced with a constantexpr offset.

Prior to this pass, accesses from a function are compiled to trap. With this
pass, most such accesses are removed before reaching codegen. The trap logic
is left unchanged by this pass. It is still reachable for the cases this pass
misses, notably the extern shared construct from hip and variables marked
constant which survive the optimizer.

This is of interest to the openmp project because the deviceRTL runtime library
uses cuda shared variables from functions that cannot be inlined. Trunk llvm
therefore cannot compile some openmp kernels for amdgpu. In addition to the
unit tests attached, this patch applied to ROCm llvm with fixed-abi enabled
and the function pointer hashing scheme deleted passes the openmp suite.

This lowering will use more LDS than strictly necessary. It is intended to be
a functionally correct fallback for cases that are difficult to target from
future optimisation passes.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D94648
2021-03-15 15:24:01 +00:00
Carl Ritson 13877db2fa [AMDGPU] Fix shortfalls in WQM marking
When tracking defined lanes through phi nodes in the live range
graph each branch of the phi must be handled independently.
Also rewrite the marking algorithm to reduce unnecessary
operations.

Previously a shared set of defined lanes was used which caused
marking to stop prematurely. This was observable in existing lit
tests, but test patterns did not cover this detail.

Reviewed By: piotr

Differential Revision: https://reviews.llvm.org/D98614
2021-03-15 21:44:15 +09:00
Jay Foad 5d48b45ce3 [AMDGPU] Use depth first iterator instead of recursive DFS. NFCI.
The reason for this is to avoid deep recursion in DFS() which can cause
stack overflow on large CFGs, especially on Windows.

Differential Revision: https://reviews.llvm.org/D98528
2021-03-15 10:32:55 +00:00
Stanislav Mekhanoshin 315ebe0df3 [AMDGPU] Fix getAlignedAGPRClassID
Not all register classes were listed.

Differential Revision: https://reviews.llvm.org/D98550
2021-03-12 14:41:50 -08:00
Matt Arsenault 6b76d82853 GlobalISel: Fix marking byval arguments as immutable
byval arguments need to be assumed writable. Only implicitly stack
passed arguments which aren't addressable in the IR can be assumed
immutable.

Mips is still broken since for some reason its doing its own thing
with the ValueHandlers (and x86 doesn't actually handle byval
arguments now, although some of the code is there).
2021-03-12 09:01:53 -05:00
Matt Arsenault 3231d2b581 AMDGPU/GlobalISel: Cleanup call lowering sequence
Now that handleAssignments is handling all of the argument splitting,
we don't have to move the insert point around.
2021-03-12 09:01:52 -05:00
Carl Ritson f08dadd242 [AMDGPU] Do not annotate an else branch if there is a kill
As llvm.amdgcn.kill is lowered to a terminator it can cause
else branch annotations to end up in the wrong block.
Do not annotate conditionals as else branches where there is
a kill to avoid this.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D97427
2021-03-12 11:52:08 +09:00
Carl Ritson c07f2025e4 [AMDGPU] Restrict image_msaa_load to MSAA dimension types
This instruction is only valid on 2D MSAA and 2D MSAA Array
surfaces.  Remove intrinsic support for other dimension types,
and block assembly for unsupported dimensions.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D98397
2021-03-12 09:47:24 +09:00
Ruiling Song e8e6817d00 [AMDGPU] Don't check hasStackObjects() when reserving VGPR
We have amdgpu_gfx functions that have high register pressure. If
we do not reserve VGPR for SGPR spill, we will fall into the path
to spill the SGPR to memory, which does not only have correctness issue,
but also have really bad performance.

I don't know why there is the check for hasStackObjects(), in our case,
we don't have stack objects at the time of finalizeLowering(). So just
remove the check that we always reserve a VGPR for possible SGPR spill
in non-entry functions.

Reviewed by: arsenm

Differential Revision: https://reviews.llvm.org/D98345
2021-03-12 08:11:14 +08:00
Ruiling Song 4cee5cad28 [AMDGPU] Free reserved VGPR if no SGPR spill
I met some code generation behavior change when I tried to remove
the hasStackObject() check when reserving VGPR for SGPR spill.
For example, the function `callee_no_stack_no_fp_elim_all` in the lit
test file `callee-frame-setup.ll`.
The generated code changed from:
```
  s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
  s_mov_b32 s4, s33
  s_mov_b32 s33, s32
  s_mov_b32 s33, s4
  s_setpc_b64 s[30:31]
```

into something like:
```
  s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
  v_writelane_b32 v63, s33, 0
  s_mov_b32 s33, s32
  v_readlane_b32 s33, v63, 0
  s_setpc_b64 s[30:31]
```

I think we still prefer the old version where only scalar instructions are needed.
The idea here is free the reserved VGPR if no SGPR spills. So we will very likely
to use a free SGPR for fp/sp spill.

Reviewed by: arsenm

Differential Revision: https://reviews.llvm.org/D98344
2021-03-12 08:11:14 +08:00
Stanislav Mekhanoshin 6e8a0213a3 [AMDGPU] Remove dead MTBUF patterns
These patterns are obviously dead, they are using format
operand which is not selected and we have no corresponding
SelectMUBUF() function.

Differential Revision: https://reviews.llvm.org/D98451
2021-03-11 14:13:00 -08:00
Matt Arsenault 70cb57d7da AMDGPU/GlobalISel: Improve private addressing mode matching
This enables the look-through-copy to hack around not correctly
regbankselecting constants to match the use bank.
2021-03-11 10:23:35 -05:00
Nikita Popov 46354bac76 [OpaquePtrs] Remove some uses of type-less CreateLoad APIs (NFC)
Explicitly pass loaded type when creating loads, in preparation
for the deprecation of these APIs.

There are still a couple of uses left.
2021-03-11 14:40:57 +01:00
Ruiling Song 66340846b3 [AMDGPU] Always create Stack Object for reserved VGPR
As we may overwrite inactive lanes of a caller-save-vgpr, we should
always save/restore the reserved vgpr for sgpr spill.

Reviewed by: arsenm

Differential Revision: https://reviews.llvm.org/D98319
2021-03-11 10:06:07 +08:00
Stanislav Mekhanoshin 9931b1f7a4 [AMDGPU] Disable SCC bit on fp atomics
Differential Revision: https://reviews.llvm.org/D98221
2021-03-10 12:36:09 -08:00
Stanislav Mekhanoshin 574a9dabc6 [AMDGPU] Always expand system scope fp atomics on gfx90a
FP atomics in system scope cannot be used and shall always
be expanded in a CAS loop.

Differential Revision: https://reviews.llvm.org/D98085
2021-03-10 12:35:23 -08:00
Jay Foad 70f013fd3b [AMDGPU] Fix isReallyTriviallyReMaterializable for V_MOV_*
D57708 changed SIInstrInfo::isReallyTriviallyReMaterializable to reject
V_MOVs with extra implicit operands, but it accidentally rejected all
V_MOVs because of their implicit use of exec. Fix it but avoid adding a
moderately expensive call to MI.getDesc().getNumImplicitUses().

In real graphics shaders this changes quite a few vgpr copies into move-
immediates, which is good for avoiding stalls on GFX10.

Differential Revision: https://reviews.llvm.org/D98347
2021-03-10 16:18:12 +00:00
Jay Foad 288ea820cf [AMDGPU] Refactor AMDGPUTargetStreamer::EmitCodeEnd
Refactor and add comments to explain where the magic numbers come from
in terms of the instruction cache line size. NFC.

Differential Revision: https://reviews.llvm.org/D98266
2021-03-09 19:02:18 +00:00
Christudasan Devadasan 24c0ad7143 [AMDGPU] Fix the dead frame indices during custom spill lowering.
AMDGPU target tries to handle the SGPR and VGPR spills in a
custom pass before the actual frame lowering pass. Once they
are handled and the respective frames are eliminated in the
custom pass, certain uses of them still remain. For instance,
the DBG_VALUE instructions inserted by the allocator alongside
the spill instruction will use the corresponding frame index.
They become dead later during PEI and causes a crash while trying to
replace the frame indices. We should possibly avoid this custom pass.
For now, replacing such dead references with null register value.

Reviewed By: arsenm, scott.linder

Differential Revision: https://reviews.llvm.org/D98038
2021-03-09 23:22:49 +05:30
Ruiling Song 67a05f4e09 [AMDGPU] Remove unused function opcodeEmitsNoInsts()
This was missed in the patch D97545, and cause buildbot failure.

Reviewed by: critson

Differential Revision: https://reviews.llvm.org/D98229
2021-03-09 10:48:30 +08:00
Ruiling Song f0ccdde3c9 [AMDGPU] Remove SI_MASK_BRANCH
This is already deprecated, so remove code working on this.
Also update the tests by using S_CBRANCH_EXECZ instead of SI_MASK_BRANCH.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D97545
2021-03-09 09:13:23 +08:00
Jay Foad 99682bc039 Revert "Revert "[AMDGPU] Restore the s_memtime instruction in gfx1030""
This reverts commit e58d68fcd0.

This reinstates commit fc28f600e5
with a fix to initialize HasShaderCyclesRegister. See
https://reviews.llvm.org/D97928.
2021-03-06 09:00:01 +00:00
Mitch Phillips e58d68fcd0 Revert "[AMDGPU] Restore the s_memtime instruction in gfx1030"
Broke the ASan/MSan buildbots. See more comments in the original patch,
https://reviews.llvm.org/D97928.

Build failure at http://lab.llvm.org:8011/#/builders/5/builds/5327

This reverts commit fc28f600e5.
2021-03-05 18:24:59 -08:00
Jay Foad fc28f600e5 [AMDGPU] Restore the s_memtime instruction in gfx1030
gfx1030 added a new way to implement readcyclecounter using the
SHADER_CYCLES hardware register, but the s_memtime instruction still
exists, so the MC layer should still accept it and the
llvm.amdgcn.s.memtime intrinsic should still work.

Differential Revision: https://reviews.llvm.org/D97928
2021-03-05 20:19:11 +00:00
RamNalamothu 3998a8e797 [AMDGPU] Do not attempt sgpr spills to vgpr, when it is disabled
This covers a path missed in https://reviews.llvm.org/D95768.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D98013
2021-03-05 22:47:21 +05:30
Sebastian Neubauer e0e73714fb [AMDGPU] Keep skip branch for ds instructions
Same as other memory instructions, ds instructions add latency even if
exec is zero. Jumping over them if exec=0 is cheaper than executing
them.
With this change, the branch instruction that skips over a basic block
if exec=0 is not removed when the block contains a ds instruction.

Differential Revision: https://reviews.llvm.org/D97922
2021-03-05 12:34:09 +01:00
Petar Avramovic 36beaa3ba3 Reland AMDGPU/GlobalISel: Combine zext(trunc x) to x after RegBankSelect
Recommit bf5a582650. Depends on
4c8fb7ddd6 which was reverted.

RegBankSelect creates zext and trunc when it selects banks for uniform i1.
Add zext_trunc_fold from generic combiner to post RegBankSelect combiner.

Differential Revision: https://reviews.llvm.org/D95432
2021-03-05 11:05:37 +01:00
Jay Foad ed7458398a [AMDGPU] Don't check for VMEM hazards on GFX10
The hazard where a VMEM reads an SGPR written by a VALU counts as a data
dependency hazard, so no nops are required on GFX10. Tested with Vulkan
CTS on GFX10.1 and GFX10.3.

Differential Revision: https://reviews.llvm.org/D97926
2021-03-04 21:44:56 +00:00
Nico Weber e68de60bc4 Revert "AMDGPU/GlobalISel: Combine zext(trunc x) to x after RegBankSelect"
This reverts commit bf5a582650.
Also depends on now-reverted 4c8fb7ddd6
2021-03-04 10:16:11 -05:00
Petar Avramovic bf5a582650 AMDGPU/GlobalISel: Combine zext(trunc x) to x after RegBankSelect
RegBankSelect creates zext and trunc when it selects banks for uniform i1.
Add zext_trunc_fold from generic combiner to post RegBankSelect combiner.

Differential Revision: https://reviews.llvm.org/D95432
2021-03-04 15:05:24 +01:00
Stanislav Mekhanoshin b70c483e04 [AMDGPU] Exclude always_inline from max bb threshold
Honor always_inline attribute when processing -amdgpu-inline-max-bb.
It was lost during the ports of the heuristic. There is no reason
to honor inline hint, but not always inline.

Differential Revision: https://reviews.llvm.org/D97790
2021-03-03 10:21:56 -08:00
Matt Arsenault 78dcff4841 GlobalISel: Add default implementation of assignValueToReg
Refactor insertion of the asserting ops. This enables using them for
AMDGPU.

This code should essentially be the same for every target. Mips, X86
and ARM all have different code there now, but this seems to be an
accident. The assignment functions are called with different types
than they would be in the DAG, so this is all likely an assortment of
hacks to get around that.
2021-03-03 09:29:53 -05:00
Piotr Sobczak 4672bac177 [AMDGPU] Introduce Strict WQM mode
* Add amdgcn_strict_wqm intrinsic.
* Add a corresponding STRICT_WQM machine instruction.
* The semantic is similar to amdgcn_strict_wwm with a notable difference that not all threads will be forcibly enabled during the computations of the intrinsic's argument, but only all threads in quads that have at least one thread active.
* The difference between amdgc_wqm and amdgcn_strict_wqm, is that in the strict mode an inactive lane will always be enabled irrespective of control flow decisions.

Reviewed By: critson

Differential Revision: https://reviews.llvm.org/D96258
2021-03-03 14:19:16 +01:00
Piotr Sobczak c3ce7bae80 [AMDGPU] Rename amdgcn_wwm to amdgcn_strict_wwm
* Introduce the new intrinsic amdgcn_strict_wwm
 * Deprecate the old intrinsic amdgcn_wwm

The change is done for consistency as the "strict"
prefix will become an important, distinguishing factor
between amdgcn_wqm and amdgcn_strictwqm in the future.

The "strict" prefix indicates that inactive lanes do not
take part in control flow, specifically an inactive lane
enabled by a strict mode will always be enabled irrespective
of control flow decisions.

The amdgcn_wwm will be removed, but doing so in two steps
gives users time to switch to the new name at their own pace.

Reviewed By: critson

Differential Revision: https://reviews.llvm.org/D96257
2021-03-03 09:33:57 +01:00
Carl Ritson 2ddac69f98 [AMDGPU] Rename llvm.amdgcn.msaa.load to llvm.amdgcn.msaa.load.x
While the underlying instruction is called image_msaa_load,
the resource must be x component only.
Rename the intrinsic for clarity.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D97829
2021-03-03 17:30:39 +09:00
Matt Arsenault fd82cbcf7d GlobalISel: Merge and cleanup more AMDGPU call lowering code
This merges more AMDGPU ABI lowering code into the generic call
lowering. Start cleaning up by factoring away more of the pack/unpack
logic into the buildCopy{To|From}Parts functions. These could use more
improvement, and the SelectionDAG versions are significantly more
complex, and we'll eventually have to emulate all of those cases too.

This is mostly NFC, but does result in some minor instruction
reordering. It also removes some of the limitations with mismatched
sizes the old code had. However, similarly to the merge on the input,
this is forcing gfx6/gfx7 to use the gfx8+ ABI (which is what we
actually want, but SelectionDAG is stuck using the weird emergent
ABI).

This also changes the load/store size for stack passed EVTs for
AArch64, which makes it consistent with the DAG behavior.
2021-03-02 17:31:13 -05:00
Amara Emerson 8a316045ed [AArch64][GlobalISel] Enable use of the optsize predicate in the selector.
To do this while supporting the existing functionality in SelectionDAG of using
PGO info, we add the ProfileSummaryInfo and LazyBlockFrequencyInfo analysis
dependencies to the instruction selector pass.

Then, use the predicate to generate constant pool loads for f32 materialization,
if we're targeting optsize/minsize.

Differential Revision: https://reviews.llvm.org/D97732
2021-03-02 12:55:51 -08:00
Joe Nash 5531f24cc2 [AMDGPU] Make OMod explicit for V_CVT_{U,I}*
Make OMod explicit instead of implied by HasModifiers in the
operand list. Requires explicitly setting HasOMod=1 for
irregular OMod usage in instruction V_CVT_{U,I}*

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D97587

Change-Id: I230e1476f529e816eec60e242531f23a99e3839f
2021-03-02 13:32:06 -05:00
Simon Pilgrim 25b788716b [AMDGPU] Fix "initialization is never read" clang-tidy warnings. NFCI. 2021-03-02 12:06:24 +00:00
Dmitry Preobrazhensky 28f164bca7 [AMDGPU][MC][GFX9+] Corrected encoding of op_sel_hi for unused operands in VOP3P
Corrected encoding of VOP3P op_sel_hi for unused operands. See bug 49363.

Differential Revision: https://reviews.llvm.org/D97689
2021-03-02 13:02:25 +03:00
Stanislav Mekhanoshin 7c724a896f [AMDGPU] Do not check max-bb for a single block callee
-amdgpu-inline-max-bb option could lead to a suboptimal
codegen preventing inlining of really simple functions
including pure wrapper calls. Relax the cutoff by allowing
to call a function with a single block on the grounds
that it will not increase total number of blocks after
inlining.

Differential Revision: https://reviews.llvm.org/D97744
2021-03-01 19:48:50 -08:00
Jay Foad 796a60d2ea [AMDGPU] New intrinsic void llvm.amdgcn.s.sethalt(i32)
The expected use case is for frontends to insert this into
shaders that are to be run under a debugger. The shader can
then be resumed or single stepped from the point of the call
under debugger control.

Differential Revision: https://reviews.llvm.org/D97670
2021-03-01 14:30:23 +00:00
Jay Foad 48ca5d3398 [AMDGPU] Simplify SITargetLowering::isSDNodeSourceOfDivergence. NFC.
Check for read-modify-write AtomicSDNodes instead of using an exhaustive
list of ISD opcodes.

Differential Revision: https://reviews.llvm.org/D97671
2021-03-01 14:22:08 +00:00
Matt Arsenault 6c260d3bc0 GlobalISel: Move splitToValueTypes to generic code
I copied the nearly identical function from AArch64 into AMDGPU, so
fix this duplication.

Mips and X86 have their own more exotic versions which should be
removed. However replacing those is better left for a separate patch
since it requires other changes to avoid regressions.
2021-03-01 08:58:18 -05:00
Matt Arsenault 81b2c23b77 AMDGPU: Use kill instruction to hint soft clause live ranges
Previously we would use a bundle to hint the register allocator to not
overwrite the pointers in a sequence of loads to avoid breaking soft
clauses. This bundling was based on a fuzzy register pressure
heuristic, so we could not guarantee using more registers than are
really available. This would result in register allocator failing on
unsatisfiable bundles. Use a kill to artificially extend the live
ranges, so we can always succeed at register allocation even if it
means extra spills in the worst case.

This seems to capture most of the benefit of the bundle while avoiding
most of the risk presented by the bundle. However the lit tests do
show a handful of regressions. In some cases with sequences of
volatile loads, unused load components end up getting reallocated to
the next load which forces a wait between. There are also a few small
scheduling regressions where a hazard used to be avoided, and one
spill torture test which for some reason nearly doubles the stack
usage. There is also a bit of noise from leftover kills (it may make
sense for post-RA pseudos to strip all of these out).
2021-02-26 18:26:40 -05:00
Stanislav Mekhanoshin 799c50fe93 [AMDGPU] Avoid second rescheduling for some regions
If a region was not constrained by a high register pressure
and was not rescheduled without clustering we can skip
rescheduling it ClusteredLowOccupancyReschedule stage.

This improves scheduling speed by 25% on some kernels.

Differential Revision: https://reviews.llvm.org/D97506
2021-02-26 12:29:37 -08:00
Stanislav Mekhanoshin 635993f07b [AMDGPU] Skip unclusterd rescheduling w/o ld/st
We are attempting rescheduling without load store clustering
if occupancy limits were not met with clustering. Skip this
for regions which do not have any loads or stores at all.

In a set of kernels I am experimenting with this improves
scheduling time by ~30%.

Differential Revision: https://reviews.llvm.org/D97342
2021-02-26 12:29:03 -08:00
Jay Foad dc2259537a [AMDGPU] Add selection pattern for v_xnor_b32
This allows GlobalISel to use this instruction where available. I assume
SelectionDAG always selects s_xnor_b32 so it isn't affected by this
change.

Differential Revision: https://reviews.llvm.org/D97560
2021-02-26 16:41:47 +00:00
Jay Foad 3ad5216ed8 [AMDGPU] Better codegen for i64 bitreverse
Differential Revision: https://reviews.llvm.org/D97547
2021-02-26 15:51:36 +00:00
Michael Liao 0d4e12e3c1 [amdgpu] Atomic should be source of divergence.
Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D97392
2021-02-24 15:27:47 -05:00
Matt Arsenault 589223e044 AMDGPU: Remove special case in shouldCoalesce
Unaligned registers are now constrained with classes, rather than
specially reserving a subset of the whole class.
2021-02-24 14:49:44 -05:00
Matt Arsenault 78b6d73a93 AMDGPU: Add even aligned VGPR/AGPR register classes
gfx90a operations require even aligned registers, but this was
previously achieved by reserving registers inside the full class.

Ideally this would be captured in the static instruction definitions
for the operands, and we would have different instructions per
subtarget. The hackiest part of this is we need to manually reassign
AGPR register classes after instruction selection (we get away without
this for VGPRs since those types are actually registered for legal
types).
2021-02-24 14:49:37 -05:00
Jay Foad aab709f090 [AMDGPU] Add more PAL metadata register names
Add all the registers that are currently used by
LLPC: https://github.com/GPUOpen-Drivers/llpc

This only affects disassembly of PAL metadata generated by LLPC and
similar frontends.

Differential Revision: https://reviews.llvm.org/D95619
2021-02-24 13:37:05 +00:00
Jay Foad 67f0620831 [AMDGPU] Update s_sendmsg messages
Update the list of s_sendmsg messages known to the assembler and
disassembler and validate the ones that were added or removed in gfx9
and gfx10.

Differential Revision: https://reviews.llvm.org/D97295
2021-02-24 13:07:00 +00:00
Stanislav Mekhanoshin d1b92c91af [AMDGPU] Set threshold for regbanks reassign pass
This is to limit compile time. I did experiments with some
inputs and found that compile time keeps reasonable for this
pass if we have less than 100000 virtual registers and then
starts to explode somewhere between 100000 and 150000.

Differential Revision: https://reviews.llvm.org/D97218
2021-02-23 10:22:31 -08:00
Nicolai Hähnle 52bc2e7577 [AMDGPU][SelectionDAG] Don't combine uniform multiplies to MUL_[UI]24
Prefer to keep uniform (non-divergent) multiplies on the scalar ALU when
possible. This significantly improves some game cases by eliminating
v_readfirstlane instructions when the result feeds into a scalar
operation, like the address calculation for a scalar load or store.

Since isDivergent is only an approximation of whether a value is in
SGPRs, it can potentially regress some situations where a uniform value
ends up in a VGPR. These should be rare in real code, although the test
changes do contain a number of examples.

Most of the test changes are just using s_mul instead of v_mul/mad which
is generally better for both register pressure and latency (at least on
GFX10 where sgpr pressure doesn't affect occupancy and vector ALU
instructions have significantly longer latency than scalar ALU). Some
R600 tests now use MULLO_INT instead of MUL_UINT24.

GlobalISel appears to handle more scenarios in the desirable way,
although it can also be thrown off and fails to select the 24-bit
multiplies in some cases.

Alternative solution considered and rejected was to allow selecting
MUL_[UI]24 to S_MUL_I32. I've rejected this because the definition of
those SD operations works is don't-care on the most significant 8 bits,
and this fact is used in some combines via SimplifyDemandedBits.

Based on a patch by Nicolai Hähnle.

Differential Revision: https://reviews.llvm.org/D97063
2021-02-23 15:39:19 +00:00
David Green dd2dbf7ee2 [TTI] Change getOperandsScalarizationOverhead to take Type args
As a followup to D95291, getOperandsScalarizationOverhead was still
using a VF as a vector factor if the arguments were scalar, and would
assert on certain matrix intrinsics with differently sized vector
arguments. This patch removes the VF arg, instead passing the Types
through directly. This should allow it to more accurately compute the
cost without having to guess at which operands will be vectorized,
something difficult with more complex intrinsics.

This adjusts one SVE test as it is now calling the wrong intrinsic vs
veccall. Without invalid InstructCosts the cost of the scalarized
intrinsic is too low. This should get fixed when the cost of
scalarization is accounted for with scalable types.

Differential Revision: https://reviews.llvm.org/D96287
2021-02-23 13:04:59 +00:00
David Green bd4b61efbd [CostModel] Remove VF from IntrinsicCostAttributes
getIntrinsicInstrCost takes a IntrinsicCostAttributes holding various
parameters of the intrinsic being costed. It can either be called with a
scalar intrinsic (RetTy==Scalar, VF==1), with a vector instruction
(RetTy==Vector, VF==1) or from the vectorizer with a scalar type and
vector width (RetTy==Scalar, VF>1). A RetTy==Vector, VF>1 is considered
an error. Both of the vector modes are expected to be treated the same,
but because this is confusing many backends end up getting it wrong.

Instead of trying work with those two values separately this removes the
VF parameter, widening the RetTy/ArgTys by VF used called from the
vectorizer. This keeps things simpler, but does require some other
modifications to keep things consistent.

Most backends look like this will be an improvement (or were not using
getIntrinsicInstrCost). AMDGPU needed the most changes to keep the code
from c230965ccf working. ARM removed the fix in
dfac521da1, webassembly happens to get a fixup for an SLP cost
issue and both X86 and AArch64 seem to now be using better costs from
the vectorizer.

Differential Revision: https://reviews.llvm.org/D95291
2021-02-23 13:03:26 +00:00
Stanislav Mekhanoshin bb16efe280 [AMDGPU] Move RPT::getLiveRegs() check under EXPENSIVE_CHECKS
This is too expensive even for debug builds. It doubles
scheduling time if enabled.

Differential Revision: https://reviews.llvm.org/D97232
2021-02-22 15:21:59 -08:00
Dmitry Preobrazhensky 4813518092 [AMDGPU][MC] Corrected bound_ctrl for compatibility with sp3
Enabled "bound_ctrl:1" and disabled "bound_ctrl:-1" syntax.
Corrected printer to output "bound_ctrl:1" instead of "bound_ctrl:0".
See bug 35397 for detailed issue description.

Differential Revision: https://reviews.llvm.org/D97048
2021-02-22 14:59:40 +03:00
madhur13490 3c297a2564 Make fixed-abi default for AMD HSA OS
fixed-abi uses pre-defined and predictable
SGPR/VGPRs for passing arguments. This patch makes
this scheme default when HSA OS is specified in triple.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D96340
2021-02-19 15:05:25 +00:00
Carl Ritson 8181dcd30f [AMDGPU] WQM/WWM: Fix marking of partial definitions
Track lanes when processing definitions for marking WQM/WWM.
If all lanes have been defined then marking can stop.
This prevents marking unnecessary instructions as WQM/WWM.

In particular this fixes a bug where values passing through
V_SET_INACTIVE would me marked as requiring WWM.

Reviewed By: piotr

Differential Revision: https://reviews.llvm.org/D95503
2021-02-19 20:45:24 +09:00
Matt Arsenault 62d946e133 GlobalISel: Merge some AMDGPU ABI lowering code to generic code
AMDGPU currently has a lot of pre-processing code to pre-split
argument types into 32-bit pieces before passing it to the generic
code in handleAssignments. This is a bit sloppy and also requires some
overly fancy iterator work when building the calls. It's better if all
argument marshalling code is handled directly in
handleAssignments. This handles more situations like decomposing large
element vectors into sub-element sized pieces.

This should mostly be NFC, but does change the generated code by
shifting where the initial argument packing instructions are placed. I
think this is nicer looking, since it now emits the packing code
directly after the relevant copies, rather than after the copies for
the remaining arguments.

This doubles down on gfx6/gfx7 using the gfx8+ ABI for 16-bit
types. This is ultimately the better option, but incompatible with the
DAG. Fixing this requires more work, especially for f16.
2021-02-18 17:26:55 -05:00
Nikita Popov 70e3c9a8b6 [BasicAA] Always strip single-argument phi nodes
We can always look through single-argument (LCSSA) phi nodes when
performing alias analysis. getUnderlyingObject() already does this,
but stripPointerCastsAndInvariantGroups() does not. We still look
through these phi nodes with the usual aliasPhi() logic, but
sometimes get sub-optimal results due to the restrictions on value
equivalence when looking through arbitrary phi nodes. I think it's
generally beneficial to keep the underlying object logic and the
pointer cast stripping logic in sync, insofar as it is possible.

With this patch we get marginally better results:

  aa.NumMayAlias | 5010069 | 5009861
  aa.NumMustAlias | 347518 | 347674
  aa.NumNoAlias | 27201336 | 27201528
  ...
  licm.NumPromoted | 1293 | 1296

I've renamed the relevant strip method to stripPointerCastsForAliasAnalysis(),
as we're past the point where we can explicitly spell out everything
that's getting stripped.

Differential Revision: https://reviews.llvm.org/D96668
2021-02-18 23:07:50 +01:00
Stanislav Mekhanoshin 5247a0d9e6 [AMDGPU] Correct gfx90c feature list
Looks like we have forced FeatureXNACK and forgot
FeatureMadMacF32Insts.

Differential Revision: https://reviews.llvm.org/D96989
2021-02-18 12:40:27 -08:00
Stanislav Mekhanoshin 75997e8407 [AMDGPU] Fixed msan build
LoadStoreOptimizer was using uninitialized SCC value for
instructions where it is unsupported.
2021-02-17 18:01:23 -08:00
Stanislav Mekhanoshin 48d2e04152 [AMDGPU] Mark SMRD atomics
We did not have atomic flags on SMRD, did not copy TSFlags
to real instructions, and did not have ret/noret atomic map.

At the moment it is NFC, but needed for D96469.

Differential Revision: https://reviews.llvm.org/D96823
2021-02-17 16:47:02 -08:00
Stanislav Mekhanoshin a8d9d50762 [AMDGPU] gfx90a support
Differential Revision: https://reviews.llvm.org/D96906
2021-02-17 16:01:32 -08:00
Piotr Sobczak c72a63b4b0 [AMDGPU] Add implicit vcc_lo on S_CBRANCH_VCCNZ in wave32
* Update skip-if-dead.ll with tests for wave32.
* Fix the crash in verifier in one newly enabled test by adding
  missing fixImplicitOperands in branch insertion code.

```
*** Bad machine code: Using an undefined physical register ***
- function:    test_kill_divergent_loop
- basic block: %bb.2 bb (0xad96308)
- instruction: S_CBRANCH_VCCNZ %bb.1, implicit $vcc_lo
- operand 1:   implicit $vcc_lo
LLVM ERROR: Found 1 machine code errors.
```

* Simplify "cbranch_kill" to not use interp instructions.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D96793
2021-02-17 15:14:57 +01:00
Jay Foad c8be7e96bb [AMDGPU] Rename simplifyI24 to simplifyMul24
Also simplify one of its call sites. NFC.
2021-02-17 11:33:49 +00:00
Piotr Sobczak 08131c7439 [AMDGPU] Fix a miscompile with S_ADD/S_SUB
The helper function isBoolSGPR is too aggressive when determining
when a v_cndmask can be skipped on a boolean value because the
function does not check the operands of and/or/xor.

This can be problematic for the Add/Sub combines that can leave
bits set even for inactive lanes leading to wrong results.

Fix this by inspecting the operands of and/or/xor recursively.

Differential Revision: https://reviews.llvm.org/D86878
2021-02-17 12:24:58 +01:00
Tony Tye c62b737ad6 [AMDGPU] Correct rmw atomics s_waitcnt generation
The AMD GPU SIMemoryLegalizer was using the ordering address space
rather than the instruction address space when determining the
s_waitcnt to generate to ensure that a read-modify-write atomic has
completed. This resulted in additional unnecessary counters being
waited on.

Differential Revision: https://reviews.llvm.org/D96743
2021-02-17 01:32:29 +00:00
Matt Arsenault a7455d7b7c AMDGPU: Remove kills following clusters of memory instruction
In a future commit, soft clauses will be hinted with kill instructions
rather than forced together with bundles. Look for kills that look
like this, and erase them. I'm not sure if the check for specific uses
is worthwhile, or if it would be better to just unconditionally erase
kills.

This reduces test churn in a future patch.
2021-02-16 10:49:28 -05:00
Matt Arsenault c320e8196a AMDGPU: Fix debug info handling in post-RA bundler
This was allowing debug instructions to break the bundling, which
would change scheduling behavior. Bundle debug info / kills inside
the bundle. This seems to work OK, although the asm printer doesn't
understand these in a bundle. This implicitly expects the memory
legalizer to unbundle. It would probably be slightly nicer to move
these after.

Rewrite the loop to be clearer and make sure we don't end a bundle on
a meta instruction, only allow them in between other valid bundle
instructions.
2021-02-16 10:42:06 -05:00
Matt Arsenault 392e0fcfd1 GlobalISel: Handle arguments partially passed on the stack
The API is a bit awkward since you need to index into an array in the
passed struct. I guess an alternative would be to pass all of the
individual fields.
2021-02-15 17:06:14 -05:00
Duncan P. N. Exon Smith 22a52dfddc TransformUtils: Fix metadata handling in CloneModule (and improve CloneFunctionInto)
This commit fixes how metadata is handled in CloneModule to be sound,
and improves how it's handled in CloneFunctionInto (although the latter
is still awkward when called within a module).

Ruiling Song pointed out in PR48841 that CloneModule was changed to
unsoundly use the RF_ReuseAndMutateDistinctMDs flag (renamed in
fa35c1f80f for clarity). This flag papered
over a crash caused by other various changes made to CloneFunctionInto
over the past few years that made it unsound to use cloning between
different modules.

(This commit partially addresses PR48841, fixing the repro from
preprocessed source but not textual IR. MDNodeMapper::mapDistinctNode
became unsound in df763188c9 and this
commit does not address that regression.)

RF_ReuseAndMutateDistinctMDs is designed for the IRMover to use,
avoiding unnecessary clones of all referenced metadata when linking
between modules (with IRMover, the source module is discarded after
linking). It never makes sense to use when you're not discarding the
source. This commit drops its incorrect use in CloneModule.

Sadly, the right thing to do with metadata when cloning a function is
complicated, and this patch doesn't totally fix it.

The first problem is that there are two different types of referenceable
metadata and it's not obvious what to with one of them when remapping.

- `!0 = !{!1}` is metadata's version of a constant. Programatically it's
  called "uniqued" (probably a better term would be "constant") because,
  like `ConstantArray`, it's stored in uniquing tables. Once it's
  constructed, it's illegal to change its arguments.
- `!0 = distinct !{!1}` is a bit closer to a global variable. It's legal
  to change the operands after construction.

What should be done with distinct metadata when cloning functions within
the same module?

- Should new, cloned nodes be created?
- Should all references point to the same, old nodes?

The answer depends on whether that metadata is effectively owned by a
function.

And that's the second problem. Referenceable metadata's ownership model
is not clear or explicit. Technically, it's all stored on an
LLVMContext. However, any metadata that is `distinct`, that transitively
references a `distinct` node, or that transitively references a
GlobalValue is specific to a Module and is effectively owned by it. More
specifically, some metadata is effectively owned by a specific Function
within a module.

Effectively function-local metadata was introduced somewhere around
c10d0e5ccd, which made it illegal for two
functions to share a DISubprogram attachment.

When cloning a function within a module, you need to clone the
function-local debug info and suppress cloning of global debug info (the
status quo suppresses cloning some global debug info but not all). When
cloning a function to a new/different module, you need to clone all of
the debug info.

Here's what I think we should do (eventually? soon? not this patch
though):
- Distinguish explicitly (somehow) between pure constant metadata owned
  by the LLVMContext, global metadata owned by the Module, and local
  metadata owned by a GlobalValue (such as a function).
- Update CloneFunctionInto to trigger cloning of all "local" metadata
  (only), perhaps by adding a bit to RemapFlag. Alternatively, split
  out a separate function CloneFunctionMetadataInto to prime the
  metadata map that callers are updated to call ahead of time as
  appropriate.

Here's the somewhat more isolated fix in this patch:
- Converted the `ModuleLevelChanges` parameter to `CloneFunctionInto` to
  an enum called `CloneFunctionChangeType` that is one of
  LocalChangesOnly, GlobalChanges, DifferentModule, and ClonedModule.
- The code maintaining the "functions uniquely own subprograms"
  invariant is now only active in the first two cases, where a function
  is being cloned within a single module. That's necessary because this
  code inhibits cloning of (some) "global" metadata that's effectively
  owned by the module.
- The code maintaining the "all compile units must be explicitly
  referenced by !llvm.dbg.cu" invariant is now only active in the
  DifferentModule case, where a function is being cloned into a new
  module in isolation.
- CoroSplit.cpp's call to CloneFunctionInto in CoroCloner::create
  uses LocalChangeOnly, since fa635d730f
  only set `ModuleLevelChanges` to trigger cloning of local metadata.
- CloneModule drops its unsound use of RF_ReuseAndMutateDistinctMDs
  and special handling of !llvm.dbg.cu.
- Fixed some outdated header docs and left a couple of FIXMEs.

Differential Revision: https://reviews.llvm.org/D96531
2021-02-15 11:56:00 -08:00
Stanislav Mekhanoshin 5cf9292ce3 [AMDGPU] Add two TSFlags: IsAtomicNoRtn and IsAtomicRtn
We are using AtomicNoRet map in multiple places to determine
if an instruction atomic, rtn or nortn atomic. This method
does not work always since we have some instructions which
only has rtn or nortn version.

One such instruction is ds_wrxchg_rtn_b32 which does not have
nortn version. This has caused changes in memory legalizer
tests.

Differential Revision: https://reviews.llvm.org/D96639
2021-02-15 11:27:59 -08:00
Carl Ritson aef781b47a [AMDGPU] Add llvm.amdgcn.wqm.demote intrinsic
Add intrinsic which demotes all active lanes to helper lanes.
This is used to implement demote to helper Vulkan extension.

In practice demoting a lane to helper simply means removing it
from the mask of live lanes used for WQM/WWM/Exact mode.
Where the shader does not use WQM, demotes just become kills.

Additionally add llvm.amdgcn.live.mask intrinsic to complement
demote operations. In theory llvm.amdgcn.ps.live can be used
to detect helper lanes; however, ps.live can be moved by LICM.
The movement of ps.live cannot be remedied without changing
its type signature and such a change would require ps.live
users to update as well.

Reviewed By: piotr

Differential Revision: https://reviews.llvm.org/D94747
2021-02-15 08:45:46 +09:00
Tony Tye 8a91b68b95 [AMDGPU] Limit memory scope for scratch, LDS and GDS
Changes for AMD GPU SIMemoryLegalizer:

- Limit the memory scope to maximum supported by the scratch, LDS and
  GDS address spaces.

- Improve assertion checking.

- Correct toSIAtomicScope argument name.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D96643
2021-02-14 17:34:12 +00:00
Kazu Hirata b4c0d610a6 [AMDGPU] Fix build breakage 2021-02-14 09:02:55 -08:00
Kazu Hirata 910e2d1e57 [llvm] Use llvm::is_contained (NFC) 2021-02-14 08:36:20 -08:00
Kazu Hirata 96c90a6d14 [AMDGPU] Drop unnecessary const from a return type (NFC)
Identified with readability-const-return-type.
2021-02-12 23:44:32 -08:00
Stanislav Mekhanoshin c96e214b9c [AMDGPU] Fix Windows build
A trivial fix, 64 bit constant is 1ull, not 1ul on Windows.
Fixed build broken by c0d7a8bc62.
2021-02-12 12:30:52 -08:00
Stanislav Mekhanoshin c0d7a8bc62 [AMDGPU] Allow accvgpr_read/write decode with opsel
These two instructions are VOP3P and have op_sel_hi bits,
however do not use op_sel_hi. That is recommended to set
unused op_sel_hi bits to 1. However, we cannot decode
both representations with 1 and 0 if bits are set to
default value 1. If bits are set to be ignored with '?'
initializer then encoding defaults them to 0.

The patch is a hack to force ignored '?' bits to 1 on
encoding for these instructions.

There is still canonicalization happens on disasm print
if incoming values are non-default, so that disasm output
does not match binary input, but this is pre-existing
problem for all instructions with '?' bits.

Fixes: SWDEV-272540

Differential Revision: https://reviews.llvm.org/D96543
2021-02-12 10:04:47 -08:00
Jay Foad 7e9ceed9a2 [TableGen][GlobalISel] Allow duplicate RendererFns
Allow different GICustomOperandRenderers to use the same RendererFn.
This avoids the need for targets to define a bunch of identical C++
renderer functions with different names.

Without this fix TableGen would have emitted code that tried to define
the GICR enumeration with duplicate enumerators.

Differential Revision: https://reviews.llvm.org/D96587
2021-02-12 15:05:32 +00:00
Stanislav Mekhanoshin cb41ee92da [AMDGPU] Fix promote alloca with double use in a same insn
If we have an instruction where more than one pointer operands
are derived from the same promoted alloca, we are fixing it for
one argument and do not fix a second use considering this user
done.

Fix this by deferring processing of memory intrinsics until all
potential operands are replaced.

Fixes: SWDEV-271358

Differential Revision: https://reviews.llvm.org/D96386
2021-02-11 11:42:25 -08:00
Matt Arsenault e3c6fa3611 AMDGPU: Restrict soft clause bundling at half of the available regs
Fixes a testcase that was overcommitting large register tuples to a
bundle, which the register allocator could not possibly satisfy.  This
was producing a bundle which used nearly all of the available SGPRs
with a series of 16-dword loads (not all of which are freely available
to use).

This is a quick hack for some deeper issues with how the clause
bundler tracks register pressure.

Overall the pressure tracking used here doesn't make sense and is too
imprecise for what it needs to avoid the allocator failing. The
pressure estimate does not account for the alignment requirements of
large SGPR tuples, so this was really underestimating the pressure
impact. This also ignores the impact of the extended live range of the
use registers after the bundle is introduced. Additionally, it didn't
account for some wide tuples not being available due to reserved
registers.

This regresses a few cases. These end up introducing more
spilling. This is also a function of the global pressure being used in
the decision to bundle, not the local pressure impact of the bundle
itself.
2021-02-11 14:08:59 -05:00
Jay Foad 23db2d363f [AMDGPU] Better selection of base offset when merging DS reads/writes
When merging a pair of DS reads or writes needs to materialize the base
offset in a vgpr, choose a value that is aligned to as high a power of
two as possible. This maximises the chance that different pairs can use
the same base offset, in which case the base offset registers can be
commoned up by MachineCSE.

Differential Revision: https://reviews.llvm.org/D96421
2021-02-11 17:46:09 +00:00
Carl Ritson c16f776028 [AMDGPU] Move kill lowering to WQM pass and add live mask tracking
Move implementation of kill intrinsics to WQM pass. Add live lane
tracking by updating a stored exec mask when lanes are killed.
Use live lane tracking to enable early termination of shader
at any point in control flow.

Reviewed By: piotr

Differential Revision: https://reviews.llvm.org/D94746
2021-02-11 20:31:29 +09:00
Carl Ritson e5b0b434f6 [AMDGPU] Refactor MIMG tables to better handle hardware variants
Add mimgopc object to represent the opcode allowing different
opcodes for different hardware variants.
This enables image_atomic_fcmpswap, image_atomic_fmin, and
image_atomic_fmax on GFX10

Reviewed By: foad, rampitec

Differential Revision: https://reviews.llvm.org/D96309
2021-02-11 13:22:41 +09:00
Jay Foad 2114b458b0 [AMDGPU] Fix comments in SILoadStoreOptimizer::offsetsCanBeCombined 2021-02-10 14:49:33 +00:00
Matt Arsenault f4ca6d8289 AMDGPU: Fix verifier error with argument passed in CSR SGPR
We need to avoid setting the kill flag on the CSR spill if there's an
additional use of the register after the spill.

This does rely on consistency between the entry block liveins and the
MRI's function live ins, which is not something the verifier checks
now.
2021-02-09 13:49:44 -05:00
Matt Arsenault b72a23650f GlobalISel: Fix using wrong calling convention for callees
This was taking the calling convention from the parent function,
instead of the callee. Avoids regressions in a future patch when the
caller and callee have different type breakdowns.

For some reason AArch64's lowerFormalArguments seems to intentionally
ignore the parent isVarArg.
2021-02-09 13:48:56 -05:00
Jinsong Ji 9202806241 Revert "[CostModel] Remove VF from IntrinsicCostAttributes"
This reverts commit 502a67dd7f.

This expose a failure in test-suite build on PowerPC,
revert to unblock buildbot first,
Dave will re-commit in https://reviews.llvm.org/D96287.

Thanks Dave.
2021-02-09 02:14:14 +00:00
Matt Arsenault bcf723b2fd AMDGPU: Stop adding stack passed wide arguments to call conv handler
The generated calling convention code shouldn't see these types since
we split large types into 32-bit chunks before the calling convention
code is triggered.

GlobalISel ends up directly calls the generated CC code before
checking for the register count breakdown. Arguably this difference is
a bug, but this was dead code for the DAG anyway.
2021-02-08 17:09:28 -05:00
Jay Foad a4b1df8af3 [AMDGPU] Use named unified buffer format constant. NFC. 2021-02-08 17:34:36 +00:00
Thomas Symalla f89f6d1e5d [AMDGPU]: Fixes an invalid clamp selection pattern.
When running the tests on PowerPC and x86, the lit test GlobalISel/trunc.ll fails at the memory sanitize step. This seems to be due to wrong invalid logic (which matches even if it shouldn't) and likely missing variable initialisation."

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D95878
2021-02-08 13:06:30 +01:00
Dmitry Preobrazhensky 05433a8d03 [AMDGPU][MC] Corrected error position for invalid dim modifiers
Fixed bug 49054.

Differential Revision: https://reviews.llvm.org/D96117
2021-02-08 14:32:28 +03:00
Dmitry Preobrazhensky 168ccc8ecb [AMDGPU][MC][GFX10] Improved errors reporting for invalid MIMG NSA operands
Differential Revision: https://reviews.llvm.org/D96118
2021-02-08 14:04:28 +03:00
Kazu Hirata 7725b81822 [AMDGPU] Drop unnecessary const from a return type (NFC)
Identified with const-return-type.
2021-02-05 21:02:04 -08:00
Guillaume Chatelet 79b3ab725d [NFC] Simplify expression 2021-02-05 10:17:02 +00:00
David Green 502a67dd7f [CostModel] Remove VF from IntrinsicCostAttributes
getIntrinsicInstrCost takes a IntrinsicCostAttributes holding various
parameters of the intrinsic being costed. It can either be called with a
scalar intrinsic (RetTy==Scalar, VF==1), with a vector instruction
(RetTy==Vector, VF==1) or from the vectorizer with a scalar type and
vector width (RetTy==Scalar, VF>1). A RetTy==Vector, VF>1 is considered
an error. Both of the vector modes are expected to be treated the same,
but because this is confusing many backends end up getting it wrong.

Instead of trying work with those two values separately this removes the
VF parameter, widening the RetTy/ArgTys by VF used called from the
vectorizer. This keeps things simpler, but does require some other
modifications to keep things consistent.

Most backends look like this will be an improvement (or were not using
getIntrinsicInstrCost). AMDGPU needed the most changes to keep the code
from c230965ccf working. ARM removed the fix in
dfac521da1, webassembly happens to get a fixup for an SLP cost
issue and both X86 and AArch64 seem to now be using better costs from
the vectorizer.

Differential Revision: https://reviews.llvm.org/D95291
2021-02-05 09:34:24 +00:00
Craig Topper 11ef356d9e [TargetLowering] Use Align in allowsMisalignedMemoryAccesses.
Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D96097
2021-02-04 19:22:06 -08:00
Dan Gohman 698c6b0a09 [WebAssembly] Support single-floating-point immediate value
As mentioned in TODO comment, casting double to float causes NaNs to change bits.
To avoid the change, this patch adds support for single-floating-point immediate value on MachineCode.

Patch by Yuta Saito.

Differential Revision: https://reviews.llvm.org/D77384
2021-02-04 18:05:06 -08:00
Wen-Heng (Jack) Chung 50578cf339 [AMDGPU] Add f16 to i1 CodeGen patterns.
Follow patterns used for f32 and f64 types.

Differential Revision: https://reviews.llvm.org/D95964
2021-02-04 11:44:18 -06:00
Jay Foad d84e5fdac1 [AMDGPU][GlobalISel] Fix v2s16 right shifts
When widening, each half of the v2s16 operands needs to be sign extended
for G_ASHR or zero extended for G_LSHR.

Differential Revision: https://reviews.llvm.org/D96048
2021-02-04 17:04:32 +00:00
Jay Foad b3bb5c3efc [AMDGPU][GlobalISel] Use scalar min/max instructions
SALU min/max s32 instructions exist so use them. This means that
regbankselect can handle min/max much like add/sub/mul/shifts.

Differential Revision: https://reviews.llvm.org/D96047
2021-02-04 17:04:32 +00:00
Konstantin Zhuravlyov 6054a456da AMDGPU: Add support for amdgpu-unsafe-fp-atomics attribute
If amdgpu-unsafe-fp-atomics is specified, allow {flat|global}_atomic_add_f32 even if atomic modes don't match.

Differential Revision: https://reviews.llvm.org/D95391
2021-02-04 08:09:34 -05:00
Sebastian Neubauer 6c59dc474d [AMDGPU] Save all lanes for reserved VGPRs
When SGPRs are spilled to VGPRs, they can overwrite any lane. We need
to preserve the value of inactive lanes in function calls, so we save
the register even if it is marked as caller saved.

Also, teach buildPrologSpill to work when no registers are free like in
CodeGen/AMDGPU/pei-scavenge-vgpr-spill.mir and update the comment on
findScratchNonCalleeSaveRegister as it is not used anymore to realign
the stack pointer since D95865.

Differential Revision: https://reviews.llvm.org/D95946
2021-02-04 09:56:36 +01:00
Matt Arsenault 477e3fe4f8 Revert "AMDGPU: Don't consider global pressure when bundling soft clauses"
This reverts commit 1e377a273f.

A regression was reported.
2021-02-03 13:25:05 -05:00
Jay Foad 26ec785386 [AMDGPU] Fix multiclass template parameter types. NFC.
This fixes TableGen parser errors that will be reported when D95874 is
applied.

Differential Revision: https://reviews.llvm.org/D95955
2021-02-03 16:21:51 +00:00
Matt Arsenault 9719f17011 AMDGPU: Move handling of allocation of fixed ABI inputs
For the fixed ABI, set this in the initial argument constructor,
rather than relying on the allocation logic to set the values. Also
stop passing them for amdgpu_gfx, since the DAG path seems to skip
these. I'm unclear on what amdgpu_gfx's expectations are.  This will
allow moving the special input registers out of the normal argument
range.
2021-02-03 09:27:59 -05:00
Sebastian Neubauer d49efdc969 Revert "[AMDGPU] Add a new Clamp Pattern to the GlobalISel Path."
This reverts commits 62af0305b7cc..677a3529d3e6 from D93708.
They cause failures in the sanitizer builds because of uninitialized
values.

A fix is in D95878, but it might take some time until this is pushed,
so reverting the changes for now.
2021-02-03 11:03:34 +01:00
Matt Arsenault af2cbe8eff AMDGPU: Fix adding extra operands for i128 asm constraints
We don't register i128 as a legal type with addRegisterClass, but it
appears in the list of legal register types. This inconsistency
resulted in the asm constraint lowering trying to use 2 128-bit
registers for these operands. This would leave behind a dead def that
would waste registers.

Regresses GlobalISel tests for i128 load/store, but these aren't very
important right now. Ideally these would not depend on the list of
register types.
2021-02-02 19:01:04 -05:00
Matt Arsenault 1e377a273f AMDGPU: Don't consider global pressure when bundling soft clauses
This should only consider whether the pressure impact of the bundle at
the given point in the program will decrease the occupancy. High VGPR
pressure was incorrectly blocking the formation of scalar bundles, and
vice versa. This was also blocking bundling from high pressure
situations at other points in the program.
2021-02-02 19:00:14 -05:00
Sebastian Neubauer 8b898b19a8 [AMDGPU] Remove unused tmp register
The temporary register is only used to compute the frame pointer.
The frame pointer is overwritten and not used in between, so we
can reuse the frame pointer for the computation, saving one register.

Differential Revision: https://reviews.llvm.org/D95865
2021-02-02 17:17:54 +01:00
Sebastian Neubauer 6b6ae583cf [AMDGPU] Save fp/bp after csr saves
Saving callee-save registers happens in whole wave mode. Exec is saved
to a free register, which can be reused to save the frame pointer.
Therefore, saving the fp needs to happen after saving csrs.

Differential Revision: https://reviews.llvm.org/D95861
2021-02-02 17:17:54 +01:00
Dmitry Preobrazhensky 586df38478 [AMDGPU][MC] Corrected parsing of optional modifiers
Fixed bugs in parsing of "no*" modifiers and improved errors handling.
See https://bugs.llvm.org/show_bug.cgi?id=41282.

Differential Revision: https://reviews.llvm.org/D95675
2021-02-02 14:52:29 +03:00
Benjamin Kramer 679ef22f2e Fold one-use variable into assert. NFCI.
Avoids a warning in Release builds.
2021-02-02 10:50:48 +01:00
Sebastian Neubauer b91afa474e [AMDGPU] Mark epilog restores as frame-destroy
I guess instructions were marked as frame-setup by accident, they are
restores as part of the epilog.

Differential Revision: https://reviews.llvm.org/D95783
2021-02-02 10:24:37 +01:00
Thomas Symalla faeed774d1 Fixed includes.
Differential Revision: https://reviews.llvm.org/D93708
2021-02-02 09:14:54 +01:00
Thomas Symalla 6c85e98f06 Fixed includes. 2021-02-02 09:14:54 +01:00
Thomas Symalla 09508d2849 Reverted whitespace changes.
Differential Revision: https://reviews.llvm.org/D90968
2021-02-02 09:14:54 +01:00
Thomas Symalla e630dd476c Added missing includes. 2021-02-02 09:14:54 +01:00
Thomas Symalla 602896b9d2 Renamed med3 opcode, removed superfluous copy. 2021-02-02 09:14:54 +01:00
Thomas Symalla fa3e840d3d Removed the generic virtual register creations. Reworked the tests. 2021-02-02 09:14:54 +01:00
Thomas Symalla c781c25412 Implemented a MED3_S32 GIR opcode. 2021-02-02 09:14:53 +01:00
Thomas Symalla 6604d81e1b Added and used new target pseudo for v_cvt_pk_i16_i32, changes due to code review. 2021-02-02 09:14:53 +01:00
Thomas Symalla 52bfb50145 Formatting changes 2021-02-02 09:14:53 +01:00
Thomas Symalla 7d24026ed2 Formatting changes. 2021-02-02 09:14:53 +01:00
Thomas Symalla bcd6c2d203 Updating formatting changes. 2021-02-02 09:14:53 +01:00
Thomas Symalla ecbed4e0ab Resolve formatting changes. 2021-02-02 09:14:53 +01:00
Thomas Symalla 7b2e701906 Code changes yielded from review. 2021-02-02 09:14:53 +01:00
Thomas Symalla 3a46502264 Move step to PreLegalizer 2021-02-02 09:14:53 +01:00
Thomas Symalla cdfd9b3bf5 Move Combiner to PreLegalize step 2021-02-02 09:14:53 +01:00
Thomas Symalla 9a8da909f1 Reverted unintended git-format change. 2021-02-02 09:14:52 +01:00
Thomas Symalla dae85e4671 Fixed the lit tests and a bug in the implementation. 2021-02-02 09:14:52 +01:00
Thomas Symalla 88a832aef1 Refactored the pattern matching. 2021-02-02 09:14:52 +01:00
Thomas Symalla fce3230be2 Added early exit. 2021-02-02 09:14:52 +01:00
Thomas Symalla d722924f20 Added comments. 2021-02-02 09:14:52 +01:00
Thomas Symalla ec043967ec clang-format 2021-02-02 09:14:52 +01:00
Thomas Symalla 62af0305b7 Added clamp i64 to i16 global isel pattern. 2021-02-02 09:14:52 +01:00
Matt Arsenault 41877b82f0 AMDGPU: Fix dbg_value handling when forming soft clause bundles
DBG_VALUES placed between memory instructions would change
codegen. Skip over these and re-insert them after the bundle instead
of giving up on bundling.
2021-02-01 22:16:35 -05:00
Austin Kerbow e068e236c3 [AMDGPU] Fix release build after 0397dca0. 2021-02-01 08:55:14 -08:00
Austin Kerbow 0397dca021 [AMDGPU] Fix crash with sgpr spills to vgpr disabled
This would assert with amdgpu-spill-sgpr-to-vgpr disabled when trying to
spill the FP.

Fixes: SWDEV-262704

Reviewed By: RamNalamothu

Differential Revision: https://reviews.llvm.org/D95768
2021-02-01 08:35:25 -08:00
Dmitry Preobrazhensky 99b5631649 [AMDGPU][MC] Corrected error position for invalid operands
Generic parser may report an incorrect error position when an offending operand is followed by a comma.
See bug 48884 for details: https://bugs.llvm.org/show_bug.cgi?id=48884.

Differential Revision: https://reviews.llvm.org/D95674
2021-02-01 14:31:08 +03:00
Matt Arsenault 8f14a08863 AMDGPU: Add missing consts 2021-01-31 10:47:57 -05:00
Kazu Hirata 627b5bda11 [llvm] Add missing header guards (NFC)
Identified with llvm-header-guard.
2021-01-30 09:53:42 -08:00
Kazu Hirata b4e780697d [AMDGPU] Forward-declare AMDGPUTargetMachine (NFC)
AMDGPUTargetTransformInfo.h needs AMDGPUTargetMachine but relies on a
forward declaration of AMDGPUTargetMachine in AMDGPU.h.  This patch
adds a forward declaration right in AMDGPUTargetTransformInfo.h.

While we are at it, this patch removes the one in
AMDGPU.h, where it is unnecessary.
2021-01-30 09:53:40 -08:00
Stanislav Mekhanoshin 9dbe736cbd [AMDGPU] Be more specific in needsFrameBaseReg
A condition "mayLoadOrStore" is too broad for that function.

Differential Revision: https://reviews.llvm.org/D95700
2021-01-29 14:40:25 -08:00
Kazu Hirata 7925aa091d [llvm] Populate SmallVector at construction time (NFC) 2021-01-28 22:21:14 -08:00
Kazu Hirata 046cfb8565 [llvm] Forward-declare formatted_raw_ostream (NFC)
Various *TargetStreamer.h need formatted_raw_ostream but rely on a
forward declaration of formatted_raw_ostream in MCStreamer.h.  This
patch adds forward declarations right in *TargetStreamer.h.

While we are at it, this patch removes the one in MCStreamer.h, where
it is unnecessary.
2021-01-28 22:21:13 -08:00
Carl Ritson 0824694d68 [AMDGPU] Fix WMM Entry SCC preservation
SCC was not correctly preserved when entering WWM.
Current lit test was unable to detect this as entry block is
handled differently.
Additionally fix an issue where SCC was unnecessarily preserved
when exiting from WWM to Exact mode.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D95500
2021-01-29 10:05:36 +09:00
Carl Ritson 0e8f50595e [AMDGPU] Mark V_SET_INACTIVE as defining SCC
V_SET_INACTIVE is implemented with S_NOT which clobbers SCC.
Mark sure it is marked appropriately.

Reviewed By: piotr

Differential Revision: https://reviews.llvm.org/D95509
2021-01-29 09:46:41 +09:00
Simon Pilgrim 0805e40a94 AMDGPUPrintfRuntimeBinding - don't dereference a dyn_cast<> pointer. NFCI.
We dereference the dyn_cast<> in all paths - use cast<> to silence the clang static analyzer warning.
2021-01-28 12:38:44 +00:00
Mirko Brkusanin 3c979ae9ec [AMDGPU][GlobalISel] Remove redundant cmp when copying constant to vcc
Differential Revision: https://reviews.llvm.org/D95540
2021-01-28 11:20:09 +01:00
Mirko Brkusanin 4b422708ba [AMDGPU][GlobalISel] Handle G_PTR_ADD when looking for constant offset
Look throught G_PTRTOINT and G_PTR_ADD nodes when looking for constant
offset for buffer stores. This also helps with merging of these instructions
later on.

Differential Revision: https://reviews.llvm.org/D95242
2021-01-28 11:20:09 +01:00
Piotr Sobczak fc8e741121 [AMDGPU] Avoid an illegal operand in si-shrink-instructions
Before the patch it was possible to trigger a constant bus
violation when folding immediates into a shrunk instruction.

The patch adds a check to enforce the legality of the new operand.

Differential Revision: https://reviews.llvm.org/D95527
2021-01-28 08:49:21 +01:00
Stanislav Mekhanoshin d91ee2f782 [AMDGPU] Do not reassign spilled registers
We cannot call LRM::unassign() if LRM::assign() was never called
before, these are symmetrical calls. There are two ways of
assigning a physical register to virtual, via LRM::assign() and
via VRM::assignVirt2Phys(). LRM::assign() will call the VRM to
assign the register and then update LiveIntervalUnion. Inline
spiller calls VRM directly and thus LiveIntervalUnion never gets
updated. A call to LRM::unassign() then asserts about inconsistent
liveness.

We have to note that not all callers of the InlineSpiller even
have LRM to pass, RegAllocPBQP does not have it, so we cannot
always pass LRM into the spiller.

The only way to get into that spiller LRE_DidCloneVirtReg() call
is from LiveRangeEdit::eliminateDeadDefs if we split an LI.

This patch refuses to reassign a LiveInterval created by a split
to workaround the problem. In fact we cannot reassign a spill
anyway as all registers of the needed class are occupied and we
are spilling.

Fixes: SWDEV-267996

Differential Revision: https://reviews.llvm.org/D95489
2021-01-27 16:29:05 -08:00
Kazu Hirata 6bde085366 [AMDGPU] Forward-declare TargetRegisterClass (NFC)
AMDGPUInstructionSelector.h needs TargetRegisterClass but relies on a
forward declaration of TargetRegisterClass in InstructionSelector.h.
This patch adds a forward declaration right in
AMDGPUInstructionSelector.h.

While we are at it, this patch removes the one in
InstructionSelector.h, where it is unnecessary.
2021-01-26 20:00:16 -08:00
Austin Kerbow 2291bd137d [AMDGPU] Update subtarget features for new target ID support
Support for XNACK and SRAMECC is not static on some GPUs. We must be able
to differentiate between different scenarios for these dynamic subtarget
features.

The possible settings are:

- Unsupported: The GPU has no support for XNACK/SRAMECC.
- Any: Preference is unspecified. Use conservative settings that can run anywhere.
- Off: Request support for XNACK/SRAMECC Off
- On: Request support for XNACK/SRAMECC On

GCNSubtarget will track the four options based on the following criteria. If
the subtarget does not support XNACK/SRAMECC we say the setting is
"Unsupported". If no subtarget features for XNACK/SRAMECC are requested we
must support "Any" mode. If the subtarget features XNACK/SRAMECC exist in the
feature string when initializing the subtarget, the settings are "On/Off".

The defaults are updated to be conservatively correct, meaning if no setting
for XNACK or SRAMECC is explicitly requested, defaults will be used which
generate code that can be run anywhere. This corresponds to the "Any" setting.

Differential Revision: https://reviews.llvm.org/D85882
2021-01-26 11:25:51 -08:00
Matt Arsenault 5f9707b796 AMDGPU: Fix redundant FP spilling/assert in some functions
If a function has stack objects, and a call, we require an FP. If we
did not initially have any stack objects, and only introduced them
during PrologEpilogInserter for CSR VGPR spills, SILowerSGPRSpills
would end up spilling the FP register as if it were a normal
register. This would result in an assert in a debug build, or
redundant handling of the FP register in a release build.

Try to predict that we will have an FP later, although this is ugly.
2021-01-26 13:01:45 -05:00
Matt Arsenault 92d1195b5f AMDGPU: Add assertion to determineCalleeSaves
Make sure this isn't getting called multiple times. I was surprised we
were modifying the function here, which I think is a bit questionable.
2021-01-26 13:01:45 -05:00
Simon Pilgrim f82cff31d3 [AMDGPU] HSAMD::fromString - replace std::string arg with StringRef. NFCI.
Removes an unnecessary chain of StringRef -> std::string -> StringRef conversions
2021-01-26 16:09:39 +00:00
Simon Pilgrim ee3da8958a [AMDGPU] Fix null-dereference static analysis warnings. NFCI.
Avoid repeated calls to isZeroValue() and check for a null pointer before dereferencing a dyn_cast<>.
2021-01-26 15:43:59 +00:00
Matt Arsenault 551a69e418 AMDGPU: Clear IsSSA property in SIFormMemoryClauses
Fixes verifier error when writing MIR testcases
2021-01-26 10:40:41 -05:00
Mirko Brkusanin 608ac62540 [AMDGPU] Fix use of HasModifiers in VopProfile
HasModifiers should be true if at least one modifier is used.
This should make the use of this field bit more consistent.

Differential Revision: https://reviews.llvm.org/D94795
2021-01-26 15:21:11 +01:00
Dmitry Preobrazhensky 745064e36b [AMDGPU][MC] Refactored exp tgt handling
Summary:
- Separated tgt encoding from parsing;
- Separated tgt decoding from printing;
- Improved errors handling;
- Disabled leading zeroes in index. The following code is no longer accepted: exp pos00 v3, v2, v1, v0

Reviewers: arsenm, rampitec, foad

Differential Revision: https://reviews.llvm.org/D95216
2021-01-26 14:54:15 +03:00
Kazu Hirata c85b6bf33c [AMDGPU] Forward-declare MachineIRBuilder (NFC)
AMDGPULegalizerInfo.h needs MachineIRBuilder but relies on a forward
declaration of MachineIRBuilder in LegalizerInfo.h.  This patch adds a
forward declaration right in AMDGPULegalizerInfo.h.

While we are at it, this patch removes the one in LegalizerInfo.h,
where it is unnecessary.
2021-01-25 19:24:01 -08:00
Changpeng Fang 5b648df1a8 AMDGPU: Reduce the number of expensive calls in SIFormMemoryClause
Summary:
  RPTracker::reset(MI) is a very expensive call when the number of virtual registers is huge.
We observed a long compilation time issue when RPT::reset() is called once for each cluster.

In this work, we call RPT.reset() only at the first seen cluster, and use advance() to get
the register pressure for the later clusters in the same basic block. This could effectively reduce the number
of the expensive calls and thus reduce the compile time.

Reviewers:
  rampitec

Fixes:
  SWDEV-239161

Differential Revision:
  https://reviews.llvm.org/D95273
2021-01-25 16:08:08 -08:00
Konstantin Zhuravlyov 2cdb34efda Revert "[IndirectFunctions] Skip propagating attributes to address taken functions"
This reverts commit dd8ae42674.

This commit causes infinite loop when compiling rocThrust and hipCUB.

Differential Revision: https://reviews.llvm.org/D95389
2021-01-25 15:58:06 -05:00
Dmitry Preobrazhensky 558b3bbb5b [AMDGPU][MC] Improved errors handling for SDWA operands
Reviewers: rampitec

Differential Revision: https://reviews.llvm.org/D95212
2021-01-25 19:02:53 +03:00
Carl Ritson a80ebd0179 [AMDGPU] Fix llvm.amdgcn.init.exec and frame materialization
Frame-base materialization may insert vector instructions before EXEC is initialised.
Fix this by moving lowering of llvm.amdgcn.init.exec later in backend.
Also remove SI_INIT_EXEC_LO pseudo as this is not necessary.

Reviewed By: ruiling

Differential Revision: https://reviews.llvm.org/D94645
2021-01-25 08:31:17 +09:00
Kazu Hirata 054444177b [Target] Use llvm::append_range (NFC) 2021-01-24 12:18:56 -08:00
Kazu Hirata e4847a7fcf Revert "[Target] Use llvm::append_range (NFC)"
This reverts commit cc7a238286.

The X86WinEHState.cpp hunk seems to break certain builds.
2021-01-23 11:25:27 -08:00
Kazu Hirata cc7a238286 [Target] Use llvm::append_range (NFC) 2021-01-23 10:56:31 -08:00
Kazu Hirata 49231c1f80 [llvm] Use static_assert instead of assert (NFC)
Identified with misc-static-assert.
2021-01-22 23:25:05 -08:00
Stanislav Mekhanoshin ca904b81e6 [AMDGPU] Fix FP materialization/resolve with flat scratch
Differential Revision: https://reviews.llvm.org/D95266
2021-01-22 16:06:47 -08:00
Stanislav Mekhanoshin 607bec0bb9 Change materializeFrameBaseRegister() to return register
The only caller of this function is in the LocalStackSlotAllocation
and it creates base register of class returned by the target's
getPointerRegClass(). AMDGPU wants to use a different reg class
here so let materializeFrameBaseRegister to just create and return
whatever it wants.

Differential Revision: https://reviews.llvm.org/D95268
2021-01-22 15:51:06 -08:00
Arthur Eubanks 42d682a217 [NewPM][AMDGPU] Skip adding CGSCCOptimizerLate callbacks at O0
The legacy PM's EP_CGSCCOptimizerLate was only used under not-O0.

Fixes clang/test/CodeGenCXX/cxx0x-initializer-stdinitializerlist.cpp under the new PM.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D95250
2021-01-22 12:29:39 -08:00
Sebastian Neubauer 8214982b50 [AMDGPU] Implement mir parseCustomPseudoSourceValue
Allow parsing generated mir with custom pseudo source value tokens.
Also rename pseudo source values to have more meaningful names.

Relands ba7dcd8542, which had memory leaks.

Differential Revision: https://reviews.llvm.org/D95215
2021-01-22 11:24:08 +01:00
Christudasan Devadasan ff8a1cae18 [AMDGPU] Fix the inconsistency in soffset for MUBUF stack accesses.
During instruction selection, there is an inconsistency in choosing
the initial soffset value. With certain early passes, this value is
getting modified and that brought additional fixup during
eliminateFrameIndex to work for all cases. This whole transformation
looks trivial and can be handled better.

This patch clearly defines the initial value for soffset and keeps it
unchanged before eliminateFrameIndex. The initial value must be zero
for MUBUF with a frame index. The non-frame index MUBUF forms that
use a raw offset from SP will have the stack register for soffset.
During frame elimination, the soffset remains zero for entry functions
with zero dynamic allocas and no callsites, or else is updated to the
appropriate frame/stack register.

Also, did some code clean up and made all asserts around soffset
stricter to match.

Reviewed By: scott.linder

Differential Revision: https://reviews.llvm.org/D95071
2021-01-22 14:20:59 +05:30
Arthur Eubanks a11bf9a7fb [AMDGPU][Inliner] Remove amdgpu-inline and add a new TTI inline hook
Having a custom inliner doesn't really fit in with the new PM's
pipeline. It's also extra technical debt.

amdgpu-inline only does a couple of custom things compared to the normal
inliner:
1) It disables inlining if the number of BBs in a function would exceed
   some limit
2) It increases the threshold if there are pointers to private arrays(?)

These can all be handled as TTI inliner hooks.
There already exists a hook for backends to multiply the inlining
threshold.

This way we can remove the custom amdgpu-inline pass.

This caused inline-hint.ll to fail, and after some investigation, it
looks like getInliningThresholdMultiplier() was previously getting
applied twice in amdgpu-inline (https://reviews.llvm.org/D62707 fixed it
not applying at all, so some later inliner change must have fixed
something), so I had to change the threshold in the test.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D94153
2021-01-21 20:29:17 -08:00
Sebastian Neubauer 4dbdff66fe Revert "[AMDGPU] Implement mir parseCustomPseudoSourceValue"
This reverts commit ba7dcd8542.

(caused memory leaks)
2021-01-21 18:11:48 +01:00
Jay Foad c0b3c5a064 [AMDGPU][GlobalISel] Run SIAddImgInit
This pass is required to get correct codegen for image instructions with
the tfe or lwe bits set.

Differential Revision: https://reviews.llvm.org/D95132
2021-01-21 15:54:54 +00:00
Matt Arsenault 94375d1083 AMDGPU: Remove v_rsq_f64 patterns
This isn't accurate enough without correction
2021-01-21 10:51:36 -05:00
Matt Arsenault 2a0db8d70e AMDGPU: Use more accurate fast f64 fdiv
A raw v_rcp_f64 isn't accurate enough, so start applying correction.
2021-01-21 10:51:36 -05:00
Sebastian Neubauer ba7dcd8542 [AMDGPU] Implement mir parseCustomPseudoSourceValue
Allow parsing generated mir with custom pseudo source value tokens.
Also rename pseudo source values to have more meaningful names.

Differential Revision: https://reviews.llvm.org/D94768
2021-01-21 16:32:17 +01:00
Matt Arsenault 20566a2ed8 AMDGPU: Add occupancy to serialized MachineFunctionInfo
Not sure about the default value handling, but also not sure
defaulting to a theoretically subtarget dependent value.
2021-01-21 09:21:00 -05:00
madhur13490 dd8ae42674 [IndirectFunctions] Skip propagating attributes to address taken functions
In case of indirect calls or address taken functions,
skip propagating any attributes to them. We just
propagate features to such functions.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D94585
2021-01-21 07:04:28 +00:00
dfukalov 560d7e0411 [NFC][AMDGPU] Split AMDGPUSubtarget.h to R600 and GCN subtargets
... to reduce headers dependency.

Reviewed By: rampitec, arsenm

Differential Revision: https://reviews.llvm.org/D95036
2021-01-20 22:22:45 +03:00
Mirko Brkusanin a6a72dfdf2 [AMDGPU][GlobalISel] Avoid selecting S_PACK with constants
If constants are hidden behind G_ANYEXT we can treat them same way as G_SEXT.
For that purpose we extend getConstantVRegValWithLookThrough with option
to handle G_ANYEXT same way as G_SEXT.

Differential Revision: https://reviews.llvm.org/D92219
2021-01-20 11:54:53 +01:00
Petar Avramovic 4ab704d628 [AMDGPU][MC] Add tfe disassembler support MIMG opcodes
With tfe on there can be a vgpr write to vdata+1.
Add tablegen support for 5 register vdata store.
This is required for 4 register vdata store with tfe.

Differential Revision: https://reviews.llvm.org/D94960
2021-01-20 10:37:09 +01:00
Jay Foad 18cb7441b6 [AMDGPU] Simpler names for arch-specific ttmp registers. NFC.
Rename the *_gfx9_gfx10 ttmp registers to *_gfx9plus for simplicity,
and use the corresponding isGFX9Plus predicate to decide when to use
them instead of the old *_vi versions.

Differential Revision: https://reviews.llvm.org/D94975
2021-01-19 18:47:14 +00:00
Jay Foad 49dce85584 [AMDGPU] Simplify AMDGPUInstPrinter::printExpSrcN. NFC.
Change-Id: Idd7f47647bc0faa3ad6f61f44728c0f20540ec00
2021-01-19 10:39:56 +00:00
Dmitry Preobrazhensky 30b8f55378 Fix for sanitizer issue in 55c557a 2021-01-18 18:39:55 +03:00
Dmitry Preobrazhensky 55c557a5d2 [AMDGPU][MC] Refactored parsing of dpp ctrl
Summary of changes:
- simplified code to improve maintainability;
- replaced lex() with higher level parser functions;
- improved errors handling.

Reviewers: rampitec

Differential Revision: https://reviews.llvm.org/D94777
2021-01-18 18:14:19 +03:00
Dmitry Preobrazhensky 911961c9c1 [AMDGPU][MC][GFX10] Improved dpp8 errors handling
Reviewers: rampitec

Differential Revision: https://reviews.llvm.org/D94756
2021-01-18 15:02:31 +03:00
Kazu Hirata 352fcfc697 [llvm] Use llvm::sort (NFC) 2021-01-17 10:39:45 -08:00
Kazu Hirata 2082b10d10 [llvm] Use *::empty (NFC) 2021-01-16 09:40:55 -08:00
Kazu Hirata 19aacdb715 [llvm] Construct SmallVector with iterator ranges (NFC) 2021-01-16 09:40:53 -08:00
Kazu Hirata 4707b21298 [AMDGPU] Use llvm::is_contained (NFC) 2021-01-15 21:00:54 -08:00
Kazu Hirata 7dc3575ef2 [llvm] Remove redundant return and continue statements (NFC)
Identified with readability-redundant-control-flow.
2021-01-14 20:30:34 -08:00
Matt Arsenault d55d592a92 GlobalISel: Do not set observer of MachineIRBuilder in LegalizerHelper
This fixes double printing of insertion debug messages in the
legalizer.

Try to cleanup usage of observers. Currently the use of observers is
pretty hard to follow and it's not clear what is responsible for
them. Observers are referenced in 3 places:

1. In the MachineFunction
2. In the MachineIRBuilder
3. In the LegalizerHelper

The observers in the MachineFunction and MachineIRBuilder are both
called only on insertions, and are redundant with each other. The
source of the double printing was the same observer was added to both
the MachineFunction, and the MachineIRBuilder. One of these references
needs to be removed. Arguably observers in general should be fully
removed from one or the other, but it may be useful to have a local
observer in the MachineIRBuilder that is not added to the function's
observers. Alternatively, the wrapper observer could manage a local
observer in one place.

The LegalizerHelper only ever calls the observer on changing/changed
instructions, and never insertions. Logically these are two different
types of observers, for changes and for insertions.

Additionally, some places used the GISelObserverWrapper when they only
needed a single observer they could use directly.

Setting the observer in the LegalizerHelper constructor is not
flexible enough if the LegalizerHelper is constructed anywhere outside
the one used by the legalizer. AMDGPU calls the LegalizerHelper in
RegBankSelect, and needs to use a local observer to apply the regbank
to newly created instructions. Currently it accomplishes this by
constructing a local MachineIRBuilder. I'm trying to move the
MachineIRBuilder to be owned/maintained by the RegBankSelect pass
itself, but the locally constructed LegalizerHelper would reset the
observer.

Mips also has a special case use of the LegalizationArtifactCombiner
in applyMappingImpl; I think we do need to run the artifact combiner
during RegBankSelect, but in a more consistent way outside of
applyMappingImpl.
2021-01-13 10:44:31 -05:00
Carl Ritson 790c75c163 [AMDGPU] Add SI_EARLY_TERMINATE_SCC0 for early terminating shader
Add pseudo instruction to allow early termination of pixel shader
anywhere based on the value of SCC.  The intention is to use this
when a mask of live lanes is updated, e.g. live lanes in WQM pass.
This facilitates early termination of shaders even when EXEC is
incomplete, e.g. in non-uniform control flow.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D88777
2021-01-13 13:29:05 +09:00
Hsiangkai Wang 914e2f5a02 [NFC] Use generic name for scalable vector stack ID.
Differential Revision: https://reviews.llvm.org/D94471
2021-01-13 10:57:43 +08:00
Joe Nash 314e29ed2b [AMDGPU] Add _e64 suffix to VOP3 Insts
Previously, instructions which could be
expressed as VOP3 in addition to another
encoding had a _e64 suffix on the tablegen
record name, while those
only available as VOP3 did not. With this
patch, all VOP3s will have the _e64 suffix.
The assembly does not change, only  the mir.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D94341

Change-Id: Ia8ec8890d47f8f94bbbdac43745b4e9dd2b03423
2021-01-12 18:33:18 -05:00
Matt Arsenault 3d39709159 AMDGPU: Remove wrapper only call limitation
This seems to only have overridden cold handling, which we probably
shouldn't do. As far as I can tell the wrapper library functions are
still inlined as appropriate.
2021-01-12 17:12:49 -05:00
Sebastian Neubauer 6a195491b6 [AMDGPU] Fix failing assert with scratch ST mode
In ST mode, flat scratch instructions have neither an sgpr nor a vgpr
for the address. This lead to an assertion when inserting hard clauses.

Differential Revision: https://reviews.llvm.org/D94406
2021-01-12 09:54:02 +01:00
Kazu Hirata 8590a3e3ad [llvm] Use *Set::contains (NFC) 2021-01-11 18:48:07 -08:00
Joe Nash bcec0f27a2 [AMDGPU] Deduplicate VOP tablegen asm & ins
VOP3 and VOP DPP subroutines to generate input
operands and asm strings were essentially copy
pasted several times. They are deduplicated to
reduce the maintenance burden and allow faster
development.

Reviewed By: dp

Differential Revision: https://reviews.llvm.org/D94102

Change-Id: I76225eed3c33239d9573351e0c8a0abfad0146ea
2021-01-11 13:49:26 -05:00
QingShan Zhang 7539c75bb4 [DAGCombine] Remove the check for unsafe-fp-math when we are checking the AFN
We are checking the unsafe-fp-math for sqrt but not for fpow, which behaves inconsistent.
As the direction is to remove this global option, we need to remove the unsafe-fp-math
check for sqrt and update the test with afn fast-math flags.

Reviewed By: Spatel

Differential Revision: https://reviews.llvm.org/D93891
2021-01-11 02:25:53 +00:00
Kazu Hirata b7c5e0b02c [Target, Transforms] Use *Set::contains (NFC) 2021-01-08 18:39:54 -08:00
Tony 2f499b9aff [AMDGPU] Add volatile support to SIMemoryLegalizer
Treat a non-atomic volatile load and store as a relaxed atomic at
system scope for the address spaces accessed. This will ensure all
relevant caches will be bypassed.

A volatile atomic is not changed and still only bypasses caches upto
the level specified by the SyncScope operand.

Differential Revision: https://reviews.llvm.org/D94214
2021-01-09 00:52:33 +00:00
Christudasan Devadasan ae25a397e9 AMDGPU/GlobalISel: Enable sret demotion 2021-01-08 10:56:35 +05:30
Kazu Hirata b934160aaa [Target] Use llvm::find_if (NFC) 2021-01-07 20:29:36 -08:00
Mehdi Amini 467e916d30 Fix gcc5 build failure (NFC)
The loop index was shadowing the container name.
It seems that we can just not use a for-range loop here since there is
an induction variable anyway.

Differential Revision: https://reviews.llvm.org/D94254
2021-01-07 20:11:57 +00:00