Commit Graph

7293 Commits

Author SHA1 Message Date
Fangrui Song 15d82c62dc [MC] De-capitalize MCStreamer functions
Follow-up to c031378ce0 .
The class is mostly consistent now.
2022-06-07 00:31:02 -07:00
Joe Nash eaed07eb7e Revert "[AMDGPU] gfx11 vop3dpp instructions"
This reverts commit 99a83b1286.
2022-06-06 17:12:09 -04:00
Joe Nash f617f89e5b Revert "[AMDGPU] gfx11 VOP1+VOP2 Instruction MC support"
This reverts commit 6079804498.
2022-06-06 17:11:35 -04:00
Ivan Kosarev facbfb121a [AMDGPU][GFX9+] Support base+soffset+offset s_atc_probe's.
Resolves part of
https://github.com/llvm/llvm-project/issues/38652

Reviewed By: dp

Differential Revision: https://reviews.llvm.org/D126791
2022-06-06 16:46:22 +01:00
Ivan Kosarev 79ec1e8fd6 [AMDGPU][GFX9][GFX10] Support base+soffset+offset s_dcache_discard's.
Resolves part of
https://github.com/llvm/llvm-project/issues/38652

Reviewed By: dp

Differential Revision: https://reviews.llvm.org/D126766
2022-06-06 16:32:16 +01:00
Joe Nash 6079804498 [AMDGPU] gfx11 VOP1+VOP2 Instruction MC support
Includes dpp instructions and vop1/vop2 promoted to vop3

Patch 17/N for upstreaming of AMDGPU gfx11 architecture

Depends on D126483

Reviewed By: rampitec, #amdgpu

Differential Revision: https://reviews.llvm.org/D126917
2022-06-06 09:57:59 -04:00
Joe Nash 99a83b1286 [AMDGPU] gfx11 vop3dpp instructions
gfx11 adds the ability to use dpp modifiers on vop3 instructions.
This patch adds machine code layer support for that. The MCCodeEmitter
is changed to use APInt instead of uint64_t to support these wider
instructions.

Patch 16/N for upstreaming of AMDGPU gfx11 architecture

Depends on D126475

Reviewed By: rampitec, #amdgpu

Differential Revision: https://reviews.llvm.org/D126483
2022-06-06 09:34:59 -04:00
Fangrui Song 77e300ffdf [MC] Change EndOfStatement "unexpected tokens in .xxx directive " to "expected newline" 2022-06-05 15:11:01 -07:00
Fangrui Song 95a134254a Remove unneeded cl::ZeroOrMore for cl::opt/cl::list options 2022-06-05 01:07:51 -07:00
Kazu Hirata e0039b8d6a Use llvm::less_second (NFC) 2022-06-04 22:48:32 -07:00
Jacob Weightman 814a0abcce AMDGPU: allow reordering of functions in AMDGPUResourceUsageAnalysis
The AMDGPUResourceUsageAnalysis was previously a CGSCC pass, and assumed
that a function's callees were always analyzed prior to their callees.
When it was refactored into a module pass, this assumption no longer
always holds. This results in calls being erroneously identified as
indirect, and reserving private segment space for them. This results in
significantly slower kernel launch latency.

This patch changes the order in which the module's functions are analyzed
from the order in which they occur in the module to a post-order traversal
of the call graph. Perhaps Clang always generates the module's functions
in such an order, but this is not the case for the Cray Fortran compiler.

Reviewed By: #amdgpu, arsenm

Differential Revision: https://reviews.llvm.org/D126025
2022-06-03 15:55:54 -05:00
Matt Arsenault dd7e407d81 AMDGPU: Move SpilledReg from MFI to SIRegisterInfo
This isn't the most natural place for it, but it avoids a circular
include dependency in an out of tree patch.
2022-06-02 17:11:24 -04:00
Julien Pages 2dfe419446 [AMDGPU] Improve codegen of extractelement/insertelement in some cases
This patch improves the codegen of extractelement and insertelement for vector
containing 8 elements. Before, a dag combine transformation was generating a
sequence of 8 select/cmp.
This patch changes the upper limit for this transformation and the movrel
instruction will eventually be used instead. Extractlement/insertelement for
vectors containing less than 8 elements are unchanged.

Differential Revision: https://reviews.llvm.org/D126389
2022-06-02 17:05:55 -04:00
Joe Nash 3732cd59be [AMDGPU] gfx11 vop3 and inherited vop instructions
This patch includes MC layer support for VOP3 encoded instructions and generic VOP support
classes.
Some VOP1 and VOP2 instructions which share an encoding with gfx10 and are using
the AssemblerPredicate = isGFX10Plus are also enabled. That predicate
will be changed to isGFX10Only in a later patch.

Patch 15/N for upstreaming of AMDGPU gfx11 architecture.

Depends on D126468

Reviewed By: dp

Differential Revision: https://reviews.llvm.org/D126475
2022-06-02 14:03:02 -04:00
Joe Nash e4870c8357 [AMDGPU] gfx11 ds instructions
MC layer support for ds instructions

Contributors:
Piotr Sobczak <Piotr.Sobczak@amd.com>

Patch 14/N for upstreaming of AMDGPU gfx11 architecture.

Depends on D126463

Reviewed By: arsenm, #amdgpu

Differential Revision: https://reviews.llvm.org/D126468
2022-06-02 13:36:56 -04:00
Matt Arsenault 89b1808a2f AMDGPU: Fix missing c++ mode comment 2022-06-01 21:14:48 -04:00
Stanislav Mekhanoshin c9e242f6dd [AMDGPU] Change GISel error handling for TFE on GFX90A
Differential Revision: https://reviews.llvm.org/D126797
2022-06-01 11:07:25 -07:00
Scott Linder 2d43955cec [AMDGPU][NFC] Refactor AMDGPUCallingConv.td
Rename CalleeSavedRegs defs to avoid being overly specific:

* CSR_AMDGPU_AGPRs_32_255 => CSR_AMDGPU_AGPRs
* CSR_AMDGPU_SGPRs_30_31 + CSR_AMDGPU_SGPRs_32_105 => CSR_AMDGPU_SGPRs
* CSR_AMDGPU_SI_Gfx_SGPRs_4_29 + CSR_AMDGPU_SI_Gfx_SGPRs_64_105 =>
  CSR_AMDGPU_SI_Gfx_SGPRs
* CSR_AMDGPU_HighRegs => CSR_AMDGPU
* CSR_AMDGPU_HighRegs_With_AGPRs => CSR_AMDGPU_GFX90AInsts
* CSR_AMDGPU_SI_Gfx_With_AGPRs => CSR_AMDGPU_SI_Gfx_GFX90AInsts

Introduce a class RegMask to mark the cases where we use the
CalleeSavedRegs class purely as an expedient way to produce a mask.
Update the names of these masks to not mention "CSR". Other targets also
seem to do this, so a reasonable alternative is to actually update
table-gen to include a new class to do this explicitly, but the current
approach seems harmless so I opted to just make it more explicit.

Reviewed By: arsenm, sebastian-ne

Differential Revision: https://reviews.llvm.org/D109008
2022-06-01 16:24:09 +00:00
Matt Arsenault 0e1c71e4a4 CodeGen: Move getAddressSpaceForPseudoSourceKind into TargetMachine
Avoid the dependency on TargetInstrInfo, which depends on the subtarget
and therefore the individual function.

Currently AMDGPU is constructing PseudoSourceValue instances in MachineFunctionInfo.
In order to facilitate copying MachineFunctionInfo, we need to stop allocating these
there. Alternatively we could allow targets to subclass PseudoSourceValueManager,
and allocate them similarly to MachineFunctionInfo.
2022-06-01 09:45:40 -04:00
Stanislav Mekhanoshin dec1283279 [AMDGPU] Fix image opcodes GlobalISel on gfx90a.
- Correct flavor of an instruction was not selected.
- GFX90A does not support TFE.

Differential Revision: https://reviews.llvm.org/D126312
2022-05-31 14:07:46 -07:00
jeff 2e61dfb124 [AMDGPU] Instruction Type Pipeline
This patch implements a DAG mutation which adds edges between different groups of instructions. The purpose is to try to generate code that conforms to a pipeline (groupA instructions occur before groupB, groupB -> groupC, and so on). Currently the pipeline order is hardcoded as VMEM->DSRead->MFMA->DSWrite, but the patch was designed to be easily extensible. Alias analysis is problematic for pipelining as memory instructions will usually not be able to be reordered w.r.t one another.

Differential Revision: https://reviews.llvm.org/D125997
2022-05-31 17:48:52 +00:00
Joe Nash e8860bee28 [AMDGPU] gfx11 Image instructions
MC layer support for instructions in the MIMG encoding(Image
instructions).

Contributors:
Carl Ritson <carl.ritson@amd.com>

Patch 13/N for upstreaming of AMDGPU gfx11 architecture.

Depends on D125992

Reviewed By: rampitec, #amdgpu

Differential Revision: https://reviews.llvm.org/D126463
2022-05-31 10:53:35 -04:00
Ivan Kosarev f199b2b00f [AMDGPU][NFC] Refine defining the offset field for GFX10+ SMEM instructions.
Reviewed By: dp

Differential Revision: https://reviews.llvm.org/D126662
2022-05-31 09:54:51 +01:00
Ivan Kosarev b4dbcba3b7 [AMDGPU][GFX9][NFC] Rename the base class for SMEM stores. 2022-05-30 10:31:59 +01:00
Ivan Kosarev 082822b381 [AMDGPU][GFX9] Support base+soffset+offset SMEM stores.
Reviewed By: dp

Differential Revision: https://reviews.llvm.org/D126388
2022-05-30 10:27:57 +01:00
Nicolai Hähnle 5df2893a9a AMDGPU: Add G_AMDGPU_MAD_64_32 instructions
These generic instructions are trivially selected to
V_MAD_[IU]64_[IU]32 instructions when run on the VALU.

When at least both factors are scalar, it is usually better to execute
some or all of the instruction on the SALU. To this end, we lower the
instruction to simpler instructions that are supported on the SALU
when applying the register bank mapping.

Differential Revision: https://reviews.llvm.org/D124843
2022-05-27 12:36:17 -05:00
Ivan Kosarev b0ccf38b01 [AMDGPU][GFX9] Support base+soffset+offset SMEM loads.
Resolves part of
https://github.com/llvm/llvm-project/issues/38652

Reviewed By: dp

Differential Revision: https://reviews.llvm.org/D125700
2022-05-26 12:42:33 +01:00
serge-sans-paille fb67d683db [iwyu] Handle regressions in libLLVM header include
Running iwyu-diff on LLVM codebase since 7030654296 detected a few
regressions, fixing them.

Differential Revision: https://reviews.llvm.org/D126417
2022-05-26 08:12:34 +02:00
Maksim Panchenko bed9efed71 [MCDisassembler] Disambiguate Size parameter in tryAddingSymbolicOperand()
MCSymbolizer::tryAddingSymbolicOperand() overloaded the Size parameter
to specify either the instruction size or the operand size depending on
the architecture. However, for proper symbolic disassembly on X86, we
need to know both sizes, as an instruction can have two operands, and
the instruction size cannot be reliably calculated based on the operand
offset and its size. Hence, split Size into OpSize and InstSize.

For X86, the new interface allows to fix a couple of issues:
  * Correctly adjust the value of PC-relative operands.
  * Set operand size to zero when the operand is specified implicitly.

Differential Revision: https://reviews.llvm.org/D126101
2022-05-25 13:44:32 -07:00
Joe Nash 835e09c4c3 [AMDGPU] gfx11 FLAT Instructions
MachineCode Support for FLAT type instructions

Contributors:
Sebastian Neubauer <sebastian.neubauer@amd.com>

Patch 12/N for upstreaming of AMDGPU gfx11 architecture.

Depends on D125989

Reviewed By: rampitec, #amdgpu

Differential Revision: https://reviews.llvm.org/D125992
2022-05-25 15:29:39 -04:00
Joe Nash ef1ea5ac01 [AMDGPU] gfx11 vinterp instructions MC support
A new instruction encoding. Some of these instructions were previously VOP3
encoded.

Contributors:
Carl Ritson <carl.ritson@amd.com>

Patch 11/N for upstreaming of AMDGPU gfx11 architecture.

Depends on D125824

Reviewed By: critson

Differential Revision: https://reviews.llvm.org/D125989
2022-05-25 14:59:16 -04:00
Joe Nash 1a51ab766f [AMDGPU] gfx11 export instructions
Contributors:
Jay Foad <jay.foad@amd.com>
Dmitry Preobrazhensky <d-pre@mail.ru>

Patch 10/N for upstreaming of AMDGPU gfx11 architecture.

Depends on D125822

Reviewed By: dp

Differential Revision: https://reviews.llvm.org/D125824
2022-05-25 14:44:09 -04:00
Nicolai Hähnle affa1b1cc5 AMDGPU/GISel: Factor out AMDGPURegisterBankInfo::buildReadFirstLane
A later change will add a 3rd user, so factoring out the common code
seems useful.

Reorganizing the executeInWaterfallLoop causes some more COPYs to be
generated, but those all fold away during instruction selection.
Generating the comparisons uses generic instructions over machine
instructions now which admittedly shouldn't make a difference
(though it should make it easier to move the waterfall loop generation
to another place).

(Resubmit with missing test added.)

Differential Revision: https://reviews.llvm.org/D125324
2022-05-25 12:14:01 -05:00
Nicolai Hähnle afc90101a5 Revert "AMDGPU/GISel: Factor out AMDGPURegisterBankInfo::buildReadFirstLane"
This reverts commit 2a28467e53.
2022-05-25 12:03:23 -05:00
Nicolai Hähnle 2a28467e53 AMDGPU/GISel: Factor out AMDGPURegisterBankInfo::buildReadFirstLane
A later change will add a 3rd user, so factoring out the common code
seems useful.

Reorganizing the executeInWaterfallLoop causes some more COPYs to be
generated, but those all fold away during instruction selection.
Generating the comparisons uses generic instructions over machine
instructions now which admittedly shouldn't make a difference
(though it should make it easier to move the waterfall loop generation
to another place).

Differential Revision: https://reviews.llvm.org/D125324
2022-05-25 11:35:02 -05:00
Stanislav Mekhanoshin 5df6669d45 [AMDGPU] Enforce alignment of image vaddr on gfx90a
Even though single address image instructions only use a single VGPR
HW accesses 4 or 5 which creates alignment requirement.

Fixes: SWDEV-316648

Differential Revision: https://reviews.llvm.org/D126009
2022-05-24 10:05:39 -07:00
Ivan Kosarev 1586e1dc95 [AMDGPU][MC][GFX11] Support base+soffset+offset SMEM loads.
Reviewed By: dp

Differential Revision: https://reviews.llvm.org/D126207
2022-05-24 15:13:14 +01:00
Dmitry Preobrazhensky 818cc9b285 [AMDGPU][MC][GFX940] Disable v_mac_f32_dpp
Differential Revision: https://reviews.llvm.org/D126070
2022-05-23 15:49:44 +03:00
Jay Foad 9af56c676e [AMDGPU] Mark SMEM cache invalidations as not reading memory
This brings the MachineInstrs in line with the corresponding intrinsics
which have side effects but do not access memory. It also matches how
BUF cache invalidation instructions are defined.

The lit test changes are just because the machine scheduler previously
treated them like loads, and added an artificial scheduling edge from
them to the exit SU, which caused them to be scheduled earlier.

Differential Revision: https://reviews.llvm.org/D126074
2022-05-20 17:18:03 +01:00
Jay Foad 78ec59e6ae [AMDGPU] Handle mandatory literals in isOperandLegal
Extend SIInstrInfo::isOperandLegal to enforce a limit on the number of
literal operands for all VALU instructions, not just VOP3. In particular
it now handles VOP2 instructions with a mandatory literal operand like
V_FMAAK_F32.

Differential Revision: https://reviews.llvm.org/D126064
2022-05-20 16:14:00 +01:00
Jay Foad 5b18ef7256 [AMDGPU] Add verification for mandatory literals
Extend the literal operand checking in SIInstrInfo::verifyInstruction to
check VOP2 instructions like V_FMAAK_F32 which have a mandatory literal
operand. The rule is that src0 can also be a literal, but only if it is
the same literal value.

AMDGPUAsmParser::validateConstantBusLimitations already handles this
correctly.

Differential Revision: https://reviews.llvm.org/D126063
2022-05-20 16:14:00 +01:00
Dmitry Preobrazhensky f598dfb3bf [AMDGPU][MC][GFX8+] Correct SMEM offset parsing
Differential Revision: https://reviews.llvm.org/D125907
2022-05-20 14:00:34 +03:00
Jay Foad 9ece051847 [AMDGPU] Mark s_get_waveid_in_workgroup as not reading memory
It is already marked as having side effects, at least in MIR. It does
not interact with anything else that is modelled as a memory access
either in IR or MachineIR.

Differential Revision: https://reviews.llvm.org/D125985
2022-05-19 21:25:46 +01:00
Jay Foad 86b55edab6 [AMDGPU] Mark s_getreg as having side effects instead of reading memory
s_getreg does not interact with anything else that is modelled as a
memory access either in IR or MachineIR.

Differential Revision: https://reviews.llvm.org/D125968
2022-05-19 21:25:46 +01:00
Jay Foad d14f2a6359 [AMDGPU] Allow multiple uses of the same literal in SOP2/SOPC
AMDGPUAsmParser::validateSOPLiteral already knew about this but
SIInstrInfo::verifyInstruction did not.

Differential Revision: https://reviews.llvm.org/D125976
2022-05-19 16:42:20 +01:00
Joe Nash ac2ff258d6 [AMDGPU] gfx11 scalar memory instructions
Contributors:
Mirko Brkusanin <Mirko.Brkusanin@amd.com>

Patch 9/N for upstreaming of AMDGPU gfx11 architecture.

Depends on D125820

Reviewed By: kosarev, #amdgpu, arsenm

Differential Revision: https://reviews.llvm.org/D125822
2022-05-19 10:27:47 -04:00
Joe Nash 729467acef [AMDGPU] gfx11 LDSDIR instructions MC support
Contributors:
Carl Ritson <carl.ritson@amd.com>

Patch 8/N for upstreaming of AMDGPU gfx11 architecture.

Depends on D125498

Reviewed By: critson, rampitec, #amdgpu

Differential Revision: https://reviews.llvm.org/D125820
2022-05-19 10:08:47 -04:00
Dmitry Preobrazhensky 44673278e0 [AMDGPU][MC][GFX940] Add SMFMAC aliases
Differential Revision: https://reviews.llvm.org/D125888
2022-05-19 13:40:48 +03:00
Jay Foad 6bec3e9303 [APInt] Remove all uses of zextOrSelf, sextOrSelf and truncOrSelf
Most clients only used these methods because they wanted to be able to
extend or truncate to the same bit width (which is a no-op). Now that
the standard zext, sext and trunc allow this, there is no reason to use
the OrSelf versions.

The OrSelf versions additionally have the strange behaviour of allowing
extending to a *smaller* width, or truncating to a *larger* width, which
are also treated as no-ops. A small amount of client code relied on this
(ConstantRange::castOp and MicrosoftCXXNameMangler::mangleNumber) and
needed rewriting.

Differential Revision: https://reviews.llvm.org/D125557
2022-05-19 11:23:13 +01:00
Dmitry Preobrazhensky 32ca9bd7b5 [AMDGPU][MC][GFX940] Correct tied operand decoding for smfmac opcodes
Differential Revision: https://reviews.llvm.org/D125790
2022-05-18 15:39:30 +03:00
Dmitry Preobrazhensky 169416c64a [AMDGPU][MC][GFX7] Disable cache policy modifiers with SMRD
Differential Revision: https://reviews.llvm.org/D125799
2022-05-18 15:17:49 +03:00
Dmitry Preobrazhensky 95a8af2750 [AMDGPU][MC][NFC] MUBUF code cleanup
Removed code that is no longer used after https://reviews.llvm.org/D124485.

Differential Revision: https://reviews.llvm.org/D125811
2022-05-18 15:00:38 +03:00
Jay Foad e2926501d8 [AMDGPU] Aggressively fold immediates in SIShrinkInstructions
Fold immediates regardless of how many uses they have. This is expected
to increase overall code size, but decrease register usage.

Differential Revision: https://reviews.llvm.org/D114644
2022-05-18 11:04:33 +01:00
Jay Foad 3eb2281bc0 [AMDGPU] Aggressively fold immediates in SIFoldOperands
Previously SIFoldOperands::foldInstOperand would only fold a
non-inlinable immediate into a single user, so as not to increase code
size by adding the same 32-bit literal operand to many instructions.

This patch removes that restriction, so that a non-inlinable immediate
will be folded into any number of users. The rationale is:
- It reduces the number of registers used for holding constant values,
  which might increase occupancy. (On the other hand, many of these
  registers are SGPRs which no longer affect occupancy on GFX10+.)
- It reduces ALU stalls between the instruction that loads a constant
  into a register, and the instruction that uses it.
- The above benefits are expected to outweigh any increase in code size.

Differential Revision: https://reviews.llvm.org/D114643
2022-05-18 10:19:35 +01:00
Jay Foad dd12c3433e [AMDGPU] Shrink F16 MAD/FMA to MADAK/MADMK/FMAAK/FMAMK on GFX10
Differential Revision: https://reviews.llvm.org/D125803
2022-05-18 10:00:06 +01:00
Shao-Ce SUN 25af3afa67 [NFC][AMDGPU][CodeGen] Use ArrayRef in TargetLowering functions
Based on D123467.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D124508
2022-05-18 10:50:23 +08:00
Stanislav Mekhanoshin dee3190293 [AMDGPU] Add llvm.amdgcn.global.load.lds intrinsic
Differential Revision: https://reviews.llvm.org/D125279
2022-05-17 12:35:27 -07:00
Stanislav Mekhanoshin a09af86693 [AMDGPU] Enable FLAT LDS DMA on gfx9/10 before gfx940
We always had global and scratch loads to LDS in the gfx9,
but did not handle it. These were available via the 'lds'
encoding bit. In gfx940 this bit was reused as 'svs' which
resulted in new '_lds' opcodes effectively pushing this
bit into the opcode, but functionally it is the same. These
instructions are also available on gfx10.

Differential Revision: https://reviews.llvm.org/D125126
2022-05-17 12:16:37 -07:00
Joe Nash d21b9b4946 [AMDGPU] gfx11 scalar alu instructions
MC layer support for SOP(scalar alu operations) including encoding
support for s_delay_alu and s_sendmsg_rtn.

Contributors:
Jay Foad <jay.foad@amd.com>

Patch 7/N for upstreaming of AMDGPU gfx11 architecture.

Depends on D125319

Reviewed By: #amdgpu, arsenm

Differential Revision: https://reviews.llvm.org/D125498
2022-05-17 13:35:41 -04:00
Stanislav Mekhanoshin 791ec1c68e [AMDGPU] Add intrinsics llvm.amdgcn.{raw|struct}.buffer.load.lds
Differential Revision: https://reviews.llvm.org/D124884
2022-05-17 10:32:13 -07:00
Stanislav Mekhanoshin 332b73fe12 [AMDGPU] Revert wide LDS DMA support.
This reverts ffbee7acdc, see also bug 37653 which it was fixing.
The bug claims this is an undocumented feature which actually works.
In the reality it is documented as not working for a good reason.
It likely does something, but it is useless anyway. These instructions
write into the LDS. The LDS address is:

M0 + inst_offset + (TIDinWave * 4).

For a store wider than a DWORD neighboring lanes will overwrite each
other.

Differential Revision: https://reviews.llvm.org/D125409
2022-05-16 11:23:35 -07:00
Joe Nash 6ef17f20d9 [AMDGPU] Mark sendmsg hasSideEffects. NFC
Address the FIXME by marking the sendmsg instructions with
hasSideEffects.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D125569
2022-05-16 09:59:27 -04:00
Jay Foad 27fa41583f [AMDGPU] Shrink MAD/FMA to MADAK/MADMK/FMAAK/FMAMK on GFX10
On GFX10 VOP3 instructions can have a literal operand, so the conversion
from VOP3 MAD/FMA to VOP2 MADAK/MADMK/FMAAK/FMAMK will not happen in
SIFoldOperands. The only benefit of the VOP2 form is code size, so do it
in SIShrinkInstructions instead.

Differential Revision: https://reviews.llvm.org/D125567
2022-05-16 15:15:23 +01:00
Joe Nash c70259405c [AMDGPU] gfx11 BUF Instructions
Includes MachineCode layer support and tests, and MIR tests not requiring
CodeGen pass changes.
Includes a small change in SMInstructions.td to correct encoded bits.

Contributors:
Petar Avramovic <Petar.Avramovic@amd.com>
Dmitry Preobrazhensky <dmitry.preobrazhensky@amd.com>

Depends on D125316

Patch 6/N for upstreaming of AMDGPU gfx11 architecture.

Reviewed By: dp, Petar.Avramovic

Differential Revision: https://reviews.llvm.org/D125319
2022-05-16 09:41:40 -04:00
Jay Foad c1af2d329f [AMDGPU] SIShrinkInstructions: change static functions to methods
This is a mechanical change to avoid passing MRI and TII around
explicitly. NFC.

Differential Revision: https://reviews.llvm.org/D125566
2022-05-16 09:43:41 +01:00
Jay Foad dfb006c0c9 [AMDGPU] Extract SIInstrInfo::removeModOperands. NFC.
Make this an externally callable function for use in a future patch.

Differential Revision: https://reviews.llvm.org/D125565
2022-05-16 09:43:41 +01:00
Sheng c644488a8b Rename `MCFixedLenDisassembler.h` as `MCDecoderOps.h`
The name `MCFixedLenDisassembler.h` is out of date after D120958.

Rename it as `MCDecoderOps.h` to reflect the change.

Reviewed By: myhsu

Differential Revision: https://reviews.llvm.org/D124987
2022-05-15 08:44:58 +08:00
Ivan Kosarev bf5fc0d603 [AMDGPU][NFC] Remove unused function.
Introduced in
https://reviews.llvm.org/rG229d5e669bbbe7ca38ad832627a9809405939f1b

and then became unused in
https://reviews.llvm.org/D19584

Reviewed By: foad, dp

Differential Revision: https://reviews.llvm.org/D125385
2022-05-12 08:52:06 +01:00
Ivan Kosarev cb67b2ccc4 [AMDGPU][GFX10] Support base+soffset+offset SMEM stores.
Also makes another step towards resolving
https://github.com/llvm/llvm-project/issues/38652

Reviewed By: foad, dp

Differential Revision: https://reviews.llvm.org/D125380
2022-05-12 08:48:05 +01:00
Austin Kerbow 2db700215a [AMDGPU] Add llvm.amdgcn.sched.barrier intrinsic
Adds an intrinsic/builtin that can be used to fine tune scheduler behavior. If
there is a need to have highly optimized codegen and kernel developers have
knowledge of inter-wave runtime behavior which is unknown to the compiler this
builtin can be used to tune scheduling.

This intrinsic creates a barrier between scheduling regions. The immediate
parameter is a mask to determine the types of instructions that should be
prevented from crossing the sched_barrier. In this initial patch, there are only
two variations. A mask of 0 means that no instructions may be scheduled across
the sched_barrier. A mask of 1 means that non-memory, non-side-effect inducing
instructions may cross the sched_barrier.

Note that this intrinsic is only meant to work with the scheduling passes. Any
other transformations that may move code will not be impacted in the ways
described above.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D124700
2022-05-11 13:22:51 -07:00
Joe Nash a0a406b257 [AMDGPU] gfx11 Decode wider instructions. NFC
Refactor to pass a templatized size parameter to the decoder to allow wider than
64bit decodes in a later patch.

Contributors:
Jay Foad <jay.foad@amd.com>

Depends on D125261

Patch 5/N for upstreaming of AMDGPU gfx11 architecture.

Reviewed By: dp

Differential Revision: https://reviews.llvm.org/D125316
2022-05-11 11:05:58 -04:00
Joe Nash 18ed279a3a [AMDGPU] gfx11 subtarget features & early tests
Tablegen definitions for subtarget features and cpp predicate functions to
access the features.
New Sub-TargetProcessors and common latencies.
Simple changes to MIR codegen tests which pass on gfx11 because they have the
same output as previous subtargets or operate on pseudo instructions which
are reused from previous subtargets.

Contributors:
Jay Foad <jay.foad@amd.com>
Petar Avramovic <Petar.Avramovic@amd.com>

Patch 4/N for upstreaming of AMDGPU gfx11 architecture

Depends on D124538

Reviewed By: Petar.Avramovic, foad

Differential Revision: https://reviews.llvm.org/D125261
2022-05-11 10:31:49 -04:00
Mehdi Amini 3ffb08844c Remove unused variable (fix -Werror build on MSVC) 2022-05-10 21:04:52 +00:00
jeff f822db7670 [AMDGPU] Allow for MFMA Inst Clustering
This patch adds cluster edges between independent MFMA instructions. Additionally, it propogates all predecessors of cluster insts to the root of the cluster(s), and all successors to the leaf(ves) of the cluster(s) -- this is done to remove the possibility that those insts will be interspersed within the cluster.

Reviewed By: kerbowa

Differential Revision: https://reviews.llvm.org/D124678
2022-05-10 12:57:40 -07:00
jeff 3ff8ee2447 [NFC] Fix typo
Reviewed By: kerbowa

Differential Revision: https://reviews.llvm.org/D124647
2022-05-10 12:11:21 -07:00
Ivan Kosarev 88f04bdbd8 [AMDGPU][GFX10] Support base+soffset+offset SMEM loads.
Also makes a step towards resolving
https://github.com/llvm/llvm-project/issues/38652

Reviewed By: foad, dp

Differential Revision: https://reviews.llvm.org/D125117
2022-05-10 16:17:14 +01:00
Nicolai Hähnle 6c2a01ce3a AMDGPU/SDAG: Refine the fold to v_mad_[iu]64_[iu]32
Only fold for uniform values on pre-GFX9 chips. GFX9+ allow us
to keep the calculation entirely on the SALU.

For subtargets where integer multiplication isn't full-rate, avoid
folding if the multiply has too many uses.

Finally, we expand 64x32 and 64x64 multiplies here as well, if they
feed into an addition. This results in better code generation than
the generic expansion for such multiplies because we end up using
the accumulator of the MAD instructions.

Differential Revision: https://reviews.llvm.org/D123835
2022-05-10 09:15:51 -05:00
Simon Pilgrim 7e3ef7dcd2 [AMDGPU] lowerEXTRACT_VECTOR_ELT - fold from a SCALAR_TO_VECTOR source
As suggested by @foad on D124839

If we're extracting a vector element that originally came from a scalar_to_vector, then avoid the bitcasting of a vector type and perform the shift masking on the (any-extended) scalar source directly, making use of the fact that the upper elements of a scalar_to_vector are all undef.

Differential Revision: https://reviews.llvm.org/D125173
2022-05-07 20:23:31 +01:00
Joe Nash 7e71a03966 [AMDGPU] Split FeatureAtomicFaddInsts
FeatureAtomicFaddInsts is replaced with three more granular features.
Contributors:
Petar Avramovic <Petar.Avramovic@amd.com>

Patch 3/N for upstreaming of AMDGPU gfx11 architecture

Depends on D124537

Reviewed By: foad, #amdgpu, arsenm

Differential Revision: https://reviews.llvm.org/D124538
2022-05-05 13:27:45 -04:00
Jay Foad ba6c8d42d4 [AMDGPU] Combine DPP mov even if old reg def is in different BB
Given a DPP mov like this:

  %2:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
  ...
  %3:vgpr_32 = V_MOV_B32_dpp %2, %1, 1, 1, 1, 0, implicit $exec

this patch just removes a check that %2 (the "old reg") was defined in
the same BB as the DPP mov instruction. GCNDPPCombine requires that the
MIR is in SSA form so I don't understand why the BB matters.

This lets the optimization work in more real world cases when the
definition of %2 gets hoisted out of a loop.

Differential Revision: https://reviews.llvm.org/D124182
2022-05-05 11:30:31 +01:00
Mariusz Sikora 2417de2758 [AMDGPU] Use d16 flag for image.sample instructions
Image.sample instruction can be forced to return half type instead of
float when d16 flag is enabled.

This patch adds new pattern in InstCombine to detect if output of
image.sample is used later only by fptrunc which converts the type
from float to half. If pattern is detected then fptrunc and image.sample
are combined to single image.sample which is returning half type.
Later in Lowering part d16 flag is added to image sample intrinsic.

Differential Revision: https://reviews.llvm.org/D124232
2022-05-05 06:29:19 +02:00
Stanislav Mekhanoshin 63f21f4cc7 [AMDGPU] Handle LDS DMA and LDS_DIRECT hazards
There shall be 1 wait state between M0 write and LDS DMA/LDS_DIRECT use.

Differential Revision: https://reviews.llvm.org/D124550
2022-05-04 14:45:16 -07:00
Jon Chesterfield bc78c09952 [amdgpu] Elide module lds allocation in kernels with no callees
Introduces a string attribute, amdgpu-requires-module-lds, to allow
eliding the module.lds block from kernels. Will allocate the block as before
if the attribute is missing or has its default value of true.

Patch uses the new attribute to detect the simplest possible instance of this,
where a kernel makes no calls and thus cannot call any functions that use LDS.

Tests updated to match, coverage was already good. Interesting cases is in
lower-module-lds-offsets where annotating the kernel allows the backend to pick
a different (in this case better) variable ordering than previously. A later
patch will avoid moving kernel variables into module.lds when the kernel can
have this attribute, allowing optimal ordering and locally unused variable
elimination.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D122091
2022-05-04 22:42:07 +01:00
serge-sans-paille 7030654296 [iwyu] Handle regressions in libLLVM header include
Running iwyu-diff on LLVM codebase since fa5a4e1b95 detected a few
regressions, fixing them.

Differential Revision: https://reviews.llvm.org/D124847
2022-05-04 08:32:38 +02:00
Nicolai Hähnle 8b42e6d057 AMDGPU: Remove redundant call to MachineInstrBuilder::setMBB
setInstrAndDebugLoc also sets the basic block automatically.

Differential Revision: https://reviews.llvm.org/D124809
2022-05-03 07:49:20 -05:00
hsmahesha 589b9df4e1 [AMDGPU] Fix scalar_to_vector for v8i16/v8f16
so that the stack access is avoided.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D124734
2022-05-03 07:28:15 +05:30
hsmahesha 3175323ce1 [AMDGPU][NFC] Make lowerINSERT_VECTOR_ELT() more readable
by moving around the code and by adding more comments, which would
later help during any required clean-up.

Differential Revision: https://reviews.llvm.org/D124733
2022-05-03 07:28:15 +05:30
Nicolai Hähnle deaa678137 AMDGPU/SDAG: Factor out the fold (add (mul x, y), y) --> mad_[iu]64_[iu]32
Refactor to simplify a follow-up change.

No functional change intended. However, there is a rather subtle logic
change: the subsequent combines (e.g. reassociation) are skipped *always*
when one of the operands of the add is a mul, instead of only when
additionally mad64_32 etc. are available. This change makes sense because
the subsequent combines should never apply when one of the operands is a
mul.

Differential Revision: https://reviews.llvm.org/D123833
2022-05-02 17:40:03 -05:00
Stanislav Mekhanoshin 51e02409f0 [AMDGPU] Produce waitcounts for LDS DMA
MUBUF and FLAT LDS DMA operations need a wait on vmcnt before LDS written
can be accessed. A load from LDS to VMEM does not need a wait.

Differential Revision: https://reviews.llvm.org/D124626
2022-04-29 11:14:11 -07:00
Joe Nash 813e521e55 [AMDGPU] Add gfx11 subtarget ELF definition
This is the first patch of a series to upstream support for the new
subtarget.

Contributors:
Jay Foad <jay.foad@amd.com>
Konstantin Zhuravlyov <kzhuravl_dev@outlook.com>

Patch 1/N for upstreaming AMDGPU gfx11 architectures.

Reviewed By: foad, kzhuravl, #amdgpu

Differential Revision: https://reviews.llvm.org/D124536
2022-04-29 12:27:17 -04:00
Ivan Kosarev 6ddf2a824d [AMDGPU] Adjust wave priority based on VMEM instructions to avoid duty-cycling.
As older waves execute long sequences of VALU instructions, this may
prevent younger waves from address calculation and then issuing their
VMEM loads, which in turn leads the VALU unit to idle. This patch tries
to prevent this by temporarily raising the wave's priority.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D124246
2022-04-27 14:37:18 +01:00
Stanislav Mekhanoshin 6a24e37219 [AMDGPU] Remove now unused variable HasLdsModifier. NFC. 2022-04-26 17:49:30 -07:00
Stanislav Mekhanoshin 0274811b5a [AMDGPU] Add both mayLoad and mayStore to MUBUF LDS opcodes
Differential Revision: https://reviews.llvm.org/D124483
2022-04-26 17:30:24 -07:00
Stanislav Mekhanoshin 00d84a9f92 [AMDGPU] Remove vdata from buffer to lds load
Differential Revision: https://reviews.llvm.org/D124485
2022-04-26 17:16:26 -07:00
Stanislav Mekhanoshin a9ccc7bc54 [AMDGPU] Properly mark MUBUF and FLAT LDS DMA instructions. NFC.
Add these bits to the MUBUF and FLAT LDS DMA instructions:

- LGKM_CNT - these operate on LDS;
- VALU - SPG 3.9.8: This instruction acts as both a MUBUF and
VALU instruction;

Codegen currently does not produce any of this, so the change is NFC.

Differential Revision: https://reviews.llvm.org/D124472
2022-04-26 14:20:26 -07:00
Vasileios Porpodas fa8a9fea47 Recommit "[SLP][TTI] Refactoring of `getShuffleCost` `Args` to work like `getArithmeticInstrCost`"
This reverts commit 6a9bbd9f20.

Code review: https://reviews.llvm.org/D124202
2022-04-26 14:02:40 -07:00
Piotr Sobczak c6afbdb5d2 Revert "[AMDGPU] Use d16 flag for image.sample instructions"
This reverts commit d1762fc454.

Reverting D124232 as the buildbot reported some errors in sanitizers.
2022-04-25 17:18:49 +02:00
Mariusz Sikora d1762fc454 [AMDGPU] Use d16 flag for image.sample instructions
Image.sample instruction can be forced to return half type instead of
float when d16 flag is enabled.

This patch adds new pattern in InstCombine to detect if output of
image.sample is used later only by fptrunc which converts the type
from float to half. If pattern is detected then fptrunc and image.sample
are combined to single image.sample which is returning half type.
Later in Lowering part d16 flag is added to image sample intrinsic.

Differential Revision: https://reviews.llvm.org/D124232
2022-04-25 13:05:52 +01:00
Matt Arsenault 0ecbb683a2 TableGen/GlobalISel: Make address space/align predicates consistent
The builtin predicate handling has a strange behavior where the code
assumes that a PatFrag is a stack of PatFrags, and each level adds at
most one predicate. I don't think this particularly makes sense,
especially without a diagnostic to ensure you aren't trying to set
multiple at once.

This wasn't followed for address spaces and alignment, which could
potentially fall through to report no builtin predicate was
added. Just switch these to follow the existing convention for now.
2022-04-22 15:48:07 -04:00
Matt Arsenault 794a0bb547 AMDGPU: Directly implement computeKnownBits for workitem intrinsics
Currently metadata is inserted in a late pass which is lowered
to an AssertZext. The metadata would be more useful if it was
inserted earlier after inlining, but before codegen.

Probably shouldn't change anything now. Just replacing the
late metadata annotation needs more work, since we lose
out on optimizations after these are lowered to CopyFromReg.

Seems to be slightly better than relying on the AssertZext from the
metadata. The test change in cvt_f32_ubyte.ll is a quirk from it using
-start-before=amdgpu-isel instead of running the usual codegen
pipeline.
2022-04-22 10:49:50 -04:00
Abinav Puthan Purayil 561af89fed [AMDGPU] Use a wrapper multiclass for buffer atomic intrinsic patterns. NFC 2022-04-22 13:59:34 +05:30
Abinav Puthan Purayil 272a876804 [AMDGPU] Rename the FlatSignedIntrPat multiclass to FlatSignedAtomicIntrPat. NFC 2022-04-22 11:47:23 +05:30
Abinav Puthan Purayil 2147b6c89d [AMDGPU] Remove no-ret atomic ops selection in the post-isel hook
No-ret atomic ops are now selected in tblgen.

Differential Revision: https://reviews.llvm.org/D124086
2022-04-22 09:37:41 +05:30
Abinav Puthan Purayil 165ae7276c [AMDGPU] Remove atomic pattern args in FLAT_[Global_]Atomic_Pseudo defs
We already have explicit patterns for these.

Differential Revision: https://reviews.llvm.org/D124084
2022-04-22 09:37:40 +05:30
Abinav Puthan Purayil f935908d7b [AMDGPU] Select no-return DS_PK_ADD_F16 in tblgen
Differential Revision: https://reviews.llvm.org/D123584
2022-04-22 09:37:40 +05:30
Abinav Puthan Purayil 45ca94334e [AMDGPU] Select no-return atomic intrinsics in tblgen
This is to avoid relying on the post-isel hook.

This change also enable the saddr pattern selection for atomic
intrinsics in GlobalISel.

Differential Revision: https://reviews.llvm.org/D123583
2022-04-22 09:37:40 +05:30
Stanislav Mekhanoshin ac94073daa [AMDGPU] Refine 64 bit misaligned LDS ops selection
Here is the performance data:
```
Using platform: AMD Accelerated Parallel Processing
Using device: gfx900:xnack-

ds_write_b64                       aligned by  8:  3.2 sec
ds_write2_b32                      aligned by  8:  3.2 sec
ds_write_b16 * 4                   aligned by  8:  7.0 sec
ds_write_b8 * 8                    aligned by  8: 13.2 sec
ds_write_b64                       aligned by  1:  7.3 sec
ds_write2_b32                      aligned by  1:  7.5 sec
ds_write_b16 * 4                   aligned by  1: 14.0 sec
ds_write_b8 * 8                    aligned by  1: 13.2 sec
ds_write_b64                       aligned by  2:  7.3 sec
ds_write2_b32                      aligned by  2:  7.5 sec
ds_write_b16 * 4                   aligned by  2:  7.1 sec
ds_write_b8 * 8                    aligned by  2: 13.3 sec
ds_write_b64                       aligned by  4:  4.6 sec
ds_write2_b32                      aligned by  4:  3.2 sec
ds_write_b16 * 4                   aligned by  4:  7.1 sec
ds_write_b8 * 8                    aligned by  4: 13.3 sec
ds_read_b64                        aligned by  8:  2.3 sec
ds_read2_b32                       aligned by  8:  2.2 sec
ds_read_u16 * 4                    aligned by  8:  4.8 sec
ds_read_u8 * 8                     aligned by  8:  8.6 sec
ds_read_b64                        aligned by  1:  4.4 sec
ds_read2_b32                       aligned by  1:  7.3 sec
ds_read_u16 * 4                    aligned by  1: 14.0 sec
ds_read_u8 * 8                     aligned by  1:  8.7 sec
ds_read_b64                        aligned by  2:  4.4 sec
ds_read2_b32                       aligned by  2:  7.3 sec
ds_read_u16 * 4                    aligned by  2:  4.8 sec
ds_read_u8 * 8                     aligned by  2:  8.7 sec
ds_read_b64                        aligned by  4:  4.4 sec
ds_read2_b32                       aligned by  4:  2.3 sec
ds_read_u16 * 4                    aligned by  4:  4.8 sec
ds_read_u8 * 8                     aligned by  4:  8.7 sec

Using platform: AMD Accelerated Parallel Processing
Using device: gfx1030

ds_write_b64                       aligned by  8:  4.4 sec
ds_write2_b32                      aligned by  8:  4.3 sec
ds_write_b16 * 4                   aligned by  8:  7.9 sec
ds_write_b8 * 8                    aligned by  8: 13.0 sec
ds_write_b64                       aligned by  1: 23.2 sec
ds_write2_b32                      aligned by  1: 23.1 sec
ds_write_b16 * 4                   aligned by  1: 44.0 sec
ds_write_b8 * 8                    aligned by  1: 13.0 sec
ds_write_b64                       aligned by  2: 23.2 sec
ds_write2_b32                      aligned by  2: 23.1 sec
ds_write_b16 * 4                   aligned by  2:  7.9 sec
ds_write_b8 * 8                    aligned by  2: 13.1 sec
ds_write_b64                       aligned by  4: 13.5 sec
ds_write2_b32                      aligned by  4:  4.3 sec
ds_write_b16 * 4                   aligned by  4:  7.9 sec
ds_write_b8 * 8                    aligned by  4: 13.1 sec
ds_read_b64                        aligned by  8:  3.5 sec
ds_read2_b32                       aligned by  8:  3.4 sec
ds_read_u16 * 4                    aligned by  8:  5.3 sec
ds_read_u8 * 8                     aligned by  8:  8.5 sec
ds_read_b64                        aligned by  1: 13.1 sec
ds_read2_b32                       aligned by  1: 22.7 sec
ds_read_u16 * 4                    aligned by  1: 43.9 sec
ds_read_u8 * 8                     aligned by  1:  7.9 sec
ds_read_b64                        aligned by  2: 13.1 sec
ds_read2_b32                       aligned by  2: 22.7 sec
ds_read_u16 * 4                    aligned by  2:  5.6 sec
ds_read_u8 * 8                     aligned by  2:  7.9 sec
ds_read_b64                        aligned by  4: 13.1 sec
ds_read2_b32                       aligned by  4:  3.4 sec
ds_read_u16 * 4                    aligned by  4:  5.6 sec
ds_read_u8 * 8                     aligned by  4:  7.9 sec
```

GFX10 exposes a different pattern for sub-DWORD load/store performance
than GFX9. On GFX9 it is faster to issue a single unaligned load or
store than a fully split b8 access, where on GFX10 even a full split
is better. However, this is a theoretical only gain because splitting
an access to a sub-dword level will require more registers and packing/
unpacking logic, so ignoring this option it is better to use a single
64 bit instruction on a misaligned data with the exception of 4 byte
aligned data where ds_read2_b32/ds_write2_b32 is better.

Differential Revision: https://reviews.llvm.org/D123956
2022-04-21 09:37:16 -07:00
Petar Avramovic e06290e53f AMDGPU/GlobalISel: Fix isVCC for uniform s1 with reg class on wave32
Fix isVCC for register that was assigned register class during
inst-selection. This happens when register has multiple uses.
For wave32, uniform i1 to vcc copy was selected like vcc to vcc
copy when uniform i1 had assigned register class.
Uniform i1 register with assigned register class will have s1 LLT,
be defined using G_TRUNC and class will be SReg_32RegClass.
Vcc i1 register with assigned register class will have s1 LLT,
class will be SReg_32RegClass for wave32 and SReg_64RegClass for
wave64 and register will not be defined by G_TRUNC.

Differential Revision: https://reviews.llvm.org/D124163
2022-04-21 16:12:04 +02:00
Jannik Silvanus 607f8ced39 [AMDGPU]: Fix failing assertion in SIMachineScheduler
This fixes the assertion failure "Loop in the Block Graph!".

SIMachineScheduler groups instructions into blocks (also referred to
as coloring or groups) and then performs a two-level scheduling:
inter-block scheduling, and intra-block scheduling.

This approach requires that the dependency graph on the blocks which
is obtained by contracting the blocks in the original dependency graph
is acyclic. In other words: Whenever A and B end up in the same block,
all vertices on a path from A to B must be in the same block.

When compiling an example consisting of an export followed by
a buffer store, we see a dependency between these two. This dependency
may be false, but that is a different issue.
This dependency was not correctly accounted for by SiMachineScheduler.

A new test case si-scheduler-exports.ll demonstrating this is
also added in this commit.

The problematic part of SiMachineScheduler was a post-optimization of
the block assignment that tried to group all export instructions into
a separate export block for better execution performance. This routine
correctly checked that any paths from exports to exports did not
contain any non-exports, but not vice-versa: In case of an export with
a non-export successor dependency, that single export was moved
to a separate block, which could then be both a successor and a
predecessor block of a non-export block.

As fix, we now skip export grouping if there are exports with direct
non-export successor dependencies. This fixes the issue at hand,
but is slightly pessimistic:
We *could* group all exports into a separate block that have neither
direct nor indirect export successor dependencies.
We will review the potential performance impact and potentially
revisit with a more sophisticated implementation.

Note that just grouping all exports without direct non-export successor
dependencies could still lead to illegal blocks, since non-export A
could depend on export B that depends on export C. In that case,
export C has no non-export successor, but still may not be grouped
into an export block.
2022-04-21 14:52:29 +01:00
Dmitry Preobrazhensky 81af32b9a3 [AMDGPU][MC][NFC][GFX940] Corrected an error position
Differential Revision: https://reviews.llvm.org/D124099
2022-04-21 14:04:46 +03:00
Dmitry Preobrazhensky b4231ac4be [AMDGPU][GFX90A+] Disabled ds_ordered_count and exp
Differential Revision: https://reviews.llvm.org/D124087
2022-04-21 13:16:44 +03:00
hsmahesha 5bd87350a5 [AMDGPU] On gfx908, reserve VGPR for AGPR copy based on register budget.
Based on available register budget, reserve highest available VGPR for
AGPR copy before RA. After RA, shift it to lowest unused VGPR if the one
exist.

Fixes SWDEV-330006.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D123525
2022-04-21 07:57:26 +05:30
Stanislav Mekhanoshin aa14e2ef3e [AMDGPU] Remove obsolete hack from allowsMisalignedMemoryAccesses. NFCI.
Differential Revision: https://reviews.llvm.org/D124035
2022-04-20 11:52:56 -07:00
Jay Foad 879ac41089 [AMDGPU] Fix crash in SIOptimizeExecMaskingPreRA
When folding a COPY of exec into another COPY, the call to
TII->isOperandLegal would crash because COPYs don't have defined
register classes for their operands.

Differential Revision: https://reviews.llvm.org/D122737
2022-04-20 14:42:48 +01:00
Jay Foad 1f91512268 [AMDGPU] Simplify calls to getDefSrcRegIgnoringCopies. NFC.
getDefSrcRegIgnoringCopies never returns None on valid MIR.
2022-04-20 12:37:24 +01:00
Abinav Puthan Purayil b7df71524e [AMDGPU][GlobalISel] Force return atomic selection for now 2022-04-20 16:00:08 +05:30
Fangrui Song bec8dff33e [AMDGPU] Fix -Wunused-variable in -DLLVM_ENABLE_ASSERTIONS=off builds 2022-04-19 22:36:58 -07:00
Matt Arsenault 1900b6c77b AMDGPU: Add assert for GDS globals 2022-04-19 22:28:11 -04:00
Matt Arsenault 987df725ac AMDGPU: Serialize VGPRForAGPRCopy 2022-04-19 22:14:52 -04:00
Matt Arsenault b5ec131267 AMDGPU: Fix allocating GDS globals to LDS offsets
These don't seem to be very well used or tested, but try to make the
behavior a bit more consistent with LDS globals.

I'm not sure what the definition for amdgpu-gds-size is supposed to
mean. For now I assumed it's allocating a static size at the beginning
of the allocation, and any known globals are allocated after it.
2022-04-19 22:14:48 -04:00
Matt Arsenault 378bb8014d AMDGPU: Serialize a few more MachineFunctionInfo fields in MIR 2022-04-19 22:12:59 -04:00
Matt Arsenault f90f4884c8 AMDGPU: Serialize gds size in MIR 2022-04-19 22:12:59 -04:00
Matt Arsenault 5cd17f9d43 AMDGPU: Serialize WWM registers 2022-04-19 21:44:43 -04:00
Matt Arsenault e0d585d75a AMDGPU: Defer creation of WWM VGPR spill slots
There's no reason to create these immediately. They can be created in
the prolog/epilog code like CSR spills. There's probably a cleaner way
to do this by utilizing the CSR spill code.

This makes the frame index used transient state for
PrologEpilogInserter, and thus makes serialization easier. Really this
doesn't need to be saved here but there isn't really a better place
for it.
2022-04-19 21:07:13 -04:00
Matt Arsenault 4271ae22be AMDGPU: Remove some unreachable code in WWM pass
Defs must be registers and there's no point to code after
llvm_unreachable.
2022-04-19 21:04:33 -04:00
Matt Arsenault bc7902f148 AMDGPU: Remove unused MachineFunctionInfo fields
These were leftovers from a half-implement spill to LDS attempt.
2022-04-19 21:04:33 -04:00
Dmitry Preobrazhensky e01dbabdd1 [AMDGPU][MC] Corrected error message "image data size does not match dmask and tfe"
Differential Revision: https://reviews.llvm.org/D123929
2022-04-19 13:52:58 +03:00
Jay Foad f707e1255e [AMDGPU] Select d16 stores even when sramecc is enabled
The sramecc feature changes the behaviour of d16 loads so they do not
preserve the unused 16 bits of the result register, but it has no impact
on d16 stores, so we should make use of them even when the feature is
enabled.

Differential Revision: https://reviews.llvm.org/D104912
2022-04-19 09:34:32 +01:00
Austin Kerbow 7f97ac94f7 Revert "[AMDGPU] Omit unnecessary waitcnt before barriers"
This reverts commit 8d0c34fd4f.
2022-04-18 21:24:08 -07:00
Stanislav Mekhanoshin c1c49a3561 [AMDGPU] Fix comment type in the DSInstructions.td. NFC. 2022-04-18 14:28:12 -07:00
Christudasan Devadasan 34a68037dd [AMDGPU][SIFrameLowering] Refactor custom SGPR spills (NFC).
Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D123666
2022-04-17 13:42:42 +05:30
Johannes Doerfert 3be3b40188 [Attributor][NFCI] Introduce AttributorConfig to bundle all options
Instead of lengthy constructors we can now set the members of a
read-only struct before the Attributor is created. Should make it
clearer what is configurable and also help introducing new options in
the future. This actually added IsModulePass and avoids deduction
through the Function set size. No functional change was intended.
2022-04-15 18:17:19 -05:00
Matt Arsenault df29ec2f54 AMDGPU: Select i8/i16 global and flat atomic load/store
As far as I know these should be atomic anyway, as long as the address
is aligned. Unaligned atomics hit an ugly error in AtomicExpand.
2022-04-14 20:52:05 -04:00
Matt Arsenault c528fbf882 AMDGPU: Fix assert if v_mov_b32_dpp is last instruction in the block
This can happen if the use instruction is a phi.

Fixes issue 49961
2022-04-14 20:21:22 -04:00
Stanislav Mekhanoshin 49b39c4f2e [AMDGPU] Remove redundand RequiredAlignment assignment. NFCI.
Differential Revision: https://reviews.llvm.org/D123699
2022-04-14 02:03:51 -07:00
hsmahesha ea47373af4 [AMDGPU][NFC] Organize code around reserving VGPR32 for AGPR copy.
This is an NFC patch in preparation to fix a bug related to always
reserving VGPR32 for AGPR copy.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D123651
2022-04-14 12:51:33 +05:30
Carl Ritson 35ea326047 [AMDGPU] Try to avoid inserting duplicate s_inst_prefetch
Check for existing s_inst_prefetch instructions when
configuring prefetches during loop alignment.

Reviewed By: rampitec, foad

Differential Revision: https://reviews.llvm.org/D123569
2022-04-14 16:06:24 +09:00
Stanislav Mekhanoshin d951d937a0 [AMDGPU] Increate hazard for store dwordx3/4 to 2 waitstates on gfx940
Fixes: SWDEV-327053

Differential Revision: https://reviews.llvm.org/D123687
2022-04-13 14:21:45 -07:00
Jay Foad ccaf6dabcc [AMDGPU] Initialize a couple more Subtarget fields
This is just for consistency. The fields are never actually used
so it is NFC.
2022-04-13 16:36:10 +01:00
Dmitry Preobrazhensky 5c0bf1303e [AMDGPU][MC][GFX10] Removed unsupported 64bit DPP opcodes
Removed 64bit DPP opcodes from asm matcher tables.

Differential Revision: https://reviews.llvm.org/D123611
2022-04-13 14:43:40 +03:00
Dmitry Preobrazhensky ab18e1a533 [AMDGPU][GFX10] Enabled op_sel for v_add_nc_u16 and v_sub_nc_u16
Differential Revision: https://reviews.llvm.org/D123594
2022-04-13 13:48:42 +03:00
Matt Arsenault c986d476cd AMDGPU: Update reqd-work-group-size optimization for umin intrinsic
This code was pattern matching the ID computation expression as it
appears in the library. This was a compare and select, but now that
umin is canonical, we were no longer matching. Update to match the
intrinsic instead.
2022-04-12 20:03:02 -04:00
Stanislav Mekhanoshin f6462a26f0 [AMDGPU] Split unaligned 4 DWORD DS operations
Similarly to 3 DWORD operations it is better for performance
to split unlaligned operations as long a these are at least
DWORD alignmened. Performance data:

```
Using platform: AMD Accelerated Parallel Processing
Using device: gfx900:xnack-

ds_write_b128                      aligned by 16:  4.9 sec
ds_write2_b64                      aligned by 16:  5.1 sec
ds_write2_b32 * 2                  aligned by 16:  5.5 sec
ds_write_b128                      aligned by  1:  8.1 sec
ds_write2_b64                      aligned by  1:  8.7 sec
ds_write2_b32 * 2                  aligned by  1: 14.0 sec
ds_write_b128                      aligned by  2:  8.1 sec
ds_write2_b64                      aligned by  2:  8.7 sec
ds_write2_b32 * 2                  aligned by  2: 14.0 sec
ds_write_b128                      aligned by  4:  5.6 sec
ds_write2_b64                      aligned by  4:  8.7 sec
ds_write2_b32 * 2                  aligned by  4:  5.6 sec
ds_write_b128                      aligned by  8:  5.6 sec
ds_write2_b64                      aligned by  8:  5.1 sec
ds_write2_b32 * 2                  aligned by  8:  5.6 sec
ds_read_b128                       aligned by 16:  3.8 sec
ds_read2_b64                       aligned by 16:  3.8 sec
ds_read2_b32 * 2                   aligned by 16:  4.0 sec
ds_read_b128                       aligned by  1:  4.6 sec
ds_read2_b64                       aligned by  1:  8.1 sec
ds_read2_b32 * 2                   aligned by  1: 14.0 sec
ds_read_b128                       aligned by  2:  4.6 sec
ds_read2_b64                       aligned by  2:  8.1 sec
ds_read2_b32 * 2                   aligned by  2: 14.0 sec
ds_read_b128                       aligned by  4:  4.6 sec
ds_read2_b64                       aligned by  4:  8.1 sec
ds_read2_b32 * 2                   aligned by  4:  4.0 sec
ds_read_b128                       aligned by  8:  4.6 sec
ds_read2_b64                       aligned by  8:  3.8 sec
ds_read2_b32 * 2                   aligned by  8:  4.0 sec

Using platform: AMD Accelerated Parallel Processing
Using device: gfx1030

ds_write_b128                      aligned by 16:  6.2 sec
ds_write2_b64                      aligned by 16:  7.1 sec
ds_write2_b32 * 2                  aligned by 16:  7.6 sec
ds_write_b128                      aligned by  1: 24.1 sec
ds_write2_b64                      aligned by  1: 25.2 sec
ds_write2_b32 * 2                  aligned by  1: 43.7 sec
ds_write_b128                      aligned by  2: 24.1 sec
ds_write2_b64                      aligned by  2: 25.1 sec
ds_write2_b32 * 2                  aligned by  2: 43.7 sec
ds_write_b128                      aligned by  4: 14.4 sec
ds_write2_b64                      aligned by  4: 25.1 sec
ds_write2_b32 * 2                  aligned by  4:  7.6 sec
ds_write_b128                      aligned by  8: 14.4 sec
ds_write2_b64                      aligned by  8:  7.1 sec
ds_write2_b32 * 2                  aligned by  8:  7.6 sec
ds_read_b128                       aligned by 16:  6.2 sec
ds_read2_b64                       aligned by 16:  6.3 sec
ds_read2_b32 * 2                   aligned by 16:  7.5 sec
ds_read_b128                       aligned by  1: 12.5 sec
ds_read2_b64                       aligned by  1: 24.0 sec
ds_read2_b32 * 2                   aligned by  1: 43.6 sec
ds_read_b128                       aligned by  2: 12.5 sec
ds_read2_b64                       aligned by  2: 24.0 sec
ds_read2_b32 * 2                   aligned by  2: 43.6 sec
ds_read_b128                       aligned by  4: 12.5 sec
ds_read2_b64                       aligned by  4: 24.0 sec
ds_read2_b32 * 2                   aligned by  4:  7.5 sec
ds_read_b128                       aligned by  8: 12.5 sec
ds_read2_b64                       aligned by  8:  6.3 sec
ds_read2_b32 * 2                   aligned by  8:  7.5 sec
```

Differential Revision: https://reviews.llvm.org/D123634
2022-04-12 16:07:13 -07:00
Matt Arsenault 788f94f731 AMDGPU: Don't use unreachable on stores to unhandled address space
For stores to constant address space, this will now consistently hit a
selection error instead of hitting unreachable in an asserts build.

I'm not sure what we should really do here. We could either just
codegen as if it were global, delete the instruction, or declare the
IR invalid (we really should have a target IR verifier to enforce it).
2022-04-12 17:31:50 -04:00
Changpeng Fang 8edaf25986 AMDGPU: Emit metadata for the hidden_multigrid_sync_arg conditionally
Summary:
  Introduce a new function attribute, amdgpu-no-multigrid-sync-arg, which is default.
We use implicitarg_ptr + offset to check whether the multigrid synchronization
pointer is used. If yes, we remove this attribute and also remove
amdgpu-no-implicitarg-ptr. We generate metadata for the hidden_multigrid_sync_arg
only when the amdgpu-no-multigrid-sync-arg attribute is removed from the function.

Reviewers: arsenm, sameerds, b-sumner and foad

Differential Revision: https://reviews.llvm.org/D123548
2022-04-12 12:36:30 -07:00
Anshil Gandhi 528aa09010 [AMDGPU][Codegen] Unsupported image sample texture map instructions
Disables image_sample_*_g16 instructions on architectures lacking g16 support. This patch fixes the issue 54672.

Differential Revision: https://reviews.llvm.org/D123461
2022-04-12 10:38:59 -06:00
Jay Foad 8a53b25ed5 [AMDGPU] Use default member initializers in Subtarget classes
Use default member initializers in AMDGPUSubtarget and subclasses. This
is to guard against adding a new feature boolean in AMDGPUSubtarget.h
but forgetting to initialize it to false in AMDGPUSubtarget.cpp.

This was mostly autogenerated by:
clang-tidy -checks=-*,cppcoreguidelines-prefer-member-initializer,modernize-use-default-member-init -header-filter=Subtarget -fix lib/Target/AMDGPU/*Subtarget.cpp

Differential Revision: https://reviews.llvm.org/D123613
2022-04-12 16:42:30 +01:00
Stanislav Mekhanoshin 3870b36025 [AMDGPU] Split unaligned 3 DWORD DS operations
I have written a minitest to check the performance. Overall
the benefit of aligned b96 operations on data which is not
known but happens to be aligned is small, while performance
hit of using b96 operations on a really unaligned memory is
high.

The only exception is when data is not aligned even by 4, it
is better to use b96 in this case.

Here is the test output on Vega and Navi:

```
Using platform: AMD Accelerated Parallel Processing
Using device: gfx900:xnack-

ds_write_b96                                  aligned: 3.4 sec
ds_write_b32 + ds_write_b64                   aligned: 4.5 sec
ds_write_b32 * 3                              aligned: 4.8 sec
ds_write_b96                          misaligned by 1: 4.8 sec
ds_write_b32 + ds_write_b64           misaligned by 1: 7.2 sec
ds_write_b32 * 3                      misaligned by 1: 10.0 sec
ds_write_b96                          misaligned by 2: 4.8 sec
ds_write_b32 + ds_write_b64           misaligned by 2: 7.2 sec
ds_write_b32 * 3                      misaligned by 2: 10.1 sec
ds_write_b96                          misaligned by 4: 4.8 sec
ds_write_b32 + ds_write_b64           misaligned by 4: 4.2 sec
ds_write_b32 * 3                      misaligned by 4: 4.9 sec
ds_write_b96                          misaligned by 8: 4.8 sec
ds_write_b32 + ds_write_b64           misaligned by 8: 4.6 sec
ds_write_b32 * 3                      misaligned by 8: 4.9 sec
ds_read_b96                                   aligned: 3.3 sec
ds_read_b32 + ds_read_b64                     aligned: 4.9 sec
ds_read_b32 * 3                               aligned: 2.6 sec
ds_read_b96                           misaligned by 1: 4.1 sec
ds_read_b32 + ds_read_b64             misaligned by 1: 7.2 sec
ds_read_b32 * 3                       misaligned by 1: 10.1 sec
ds_read_b96                           misaligned by 2: 4.1 sec
ds_read_b32 + ds_read_b64             misaligned by 2: 7.2 sec
ds_read_b32 * 3                       misaligned by 2: 10.1 sec
ds_read_b96                           misaligned by 4: 4.1 sec
ds_read_b32 + ds_read_b64             misaligned by 4: 2.6 sec
ds_read_b32 * 3                       misaligned by 4: 2.6 sec
ds_read_b96                           misaligned by 8: 4.1 sec
ds_read_b32 + ds_read_b64             misaligned by 8: 4.9 sec
ds_read_b32 * 3                       misaligned by 8: 2.6 sec

Using platform: AMD Accelerated Parallel Processing
Using device: gfx1030

ds_write_b96                                  aligned: 4.1 sec
ds_write_b32 + ds_write_b64                   aligned: 13.0 sec
ds_write_b32 * 3                              aligned: 4.5 sec
ds_write_b96                          misaligned by 1: 12.5 sec
ds_write_b32 + ds_write_b64           misaligned by 1: 22.0 sec
ds_write_b32 * 3                      misaligned by 1: 31.5 sec
ds_write_b96                          misaligned by 2: 12.4 sec
ds_write_b32 + ds_write_b64           misaligned by 2: 22.0 sec
ds_write_b32 * 3                      misaligned by 2: 31.5 sec
ds_write_b96                          misaligned by 4: 12.4 sec
ds_write_b32 + ds_write_b64           misaligned by 4: 4.0 sec
ds_write_b32 * 3                      misaligned by 4: 4.5 sec
ds_write_b96                          misaligned by 8: 12.4 sec
ds_write_b32 + ds_write_b64           misaligned by 8: 13.0 sec
ds_write_b32 * 3                      misaligned by 8: 4.5 sec
ds_read_b96                                   aligned: 3.8 sec
ds_read_b32 + ds_read_b64                     aligned: 12.8 sec
ds_read_b32 * 3                               aligned: 4.4 sec
ds_read_b96                           misaligned by 1: 10.9 sec
ds_read_b32 + ds_read_b64             misaligned by 1: 21.8 sec
ds_read_b32 * 3                       misaligned by 1: 31.5 sec
ds_read_b96                           misaligned by 2: 10.9 sec
ds_read_b32 + ds_read_b64             misaligned by 2: 21.9 sec
ds_read_b32 * 3                       misaligned by 2: 31.5 sec
ds_read_b96                           misaligned by 4: 10.9 sec
ds_read_b32 + ds_read_b64             misaligned by 4: 3.8 sec
ds_read_b32 * 3                       misaligned by 4: 4.5 sec
ds_read_b96                           misaligned by 8: 10.9 sec
ds_read_b32 + ds_read_b64             misaligned by 8: 12.8 sec
ds_read_b32 * 3                       misaligned by 8: 4.5 sec
```

Fixes: SWDEV-330802

Differential Revision: https://reviews.llvm.org/D123524
2022-04-12 07:52:39 -07:00
Stanislav Mekhanoshin b8e09f1553 [AMDGPU] Refactor LDS alignment checks.
Move features/bugs checks into the single place
allowsMisalignedMemoryAccessesImpl.

This is mostly NFCI except for the order of selection in couple places.
A separate change may be needed to stop lying about Fast.

Differential Revision: https://reviews.llvm.org/D123343
2022-04-12 07:49:40 -07:00
Carl Ritson 2bca7d859a [AMDGPU] Graceful abort for waterfalls in SIOptimizeVGPRLiveRange
If the CFG structure of a waterfall loop is not the expected shape
then gracefully abort traversing the IR for the given loop.
This applies to nest waterfall loops which are not supported by
the VGPR live range optimizer.

Reviewed By: ruiling

Differential Revision: https://reviews.llvm.org/D123480
2022-04-12 14:15:44 +09:00
Matt Arsenault 463bc93e5f AMDGPU/GlobalISel: Remove unused parameter 2022-04-11 19:43:37 -04:00
Matt Arsenault 203a1e36ed Reapply "AMDGPU: Remove AMDGPUFixFunctionBitcasts pass"
This reverts commit 8a85be807b.

The unrelated failure this exposed was fixed.
2022-04-11 19:43:37 -04:00
Changpeng Fang 7f9868f9b7 AMDGPU: Align the implicit kernel argument segment to 8 bytes for v5
Summary:
  In emitting metadata for implicit kernel arguments, we need to be in sync with the actual loads
to align the implicit kernel argument segment to 8 byte boundary. In this work, we simply force
this alignment through the first implicit argument.
In addition, we don't emit metadata for any implicit kernel argument if none of them is actually used.

Reviewers: arsenm, b-sumner

Differential Revision: https://reviews.llvm.org/D123346
2022-04-11 16:12:39 -07:00
Nicolai Hähnle 4df4922da6 AMDGPU/SDAG: Custom SETCC (i.e. ballot) is always uniform
The AMDGPUISD::SETCC node is like ISD::SETCC, but returns a lane mask
instead of a per-lane boolean. The lane mask is uniform.

This improves instruction selection for code patterns like
ctpop(ballot(x)), which can now use an S_BCNT1_* instruction instead
of V_BCNT_*.

GlobalISel already selects scalar instructions (an earlier commit
added a test case)..

Differential Revision: https://reviews.llvm.org/D123432
2022-04-11 14:04:21 -05:00
Stanislav Mekhanoshin fced87d457 [AMDGPU] Fix regression with vectorization limiting
D67148 has removed TTI::getNumberOfRegisters(bool Vector) and
started to call TTI::getNumberOfRegisters(unsigned ClassID) from
the LoopVectorize. This has resulted in an unrestricted vectorization
on AMDGPU blowing up register pressure.

Differential Revision: https://reviews.llvm.org/D122850
2022-04-08 17:46:49 -07:00
Vang Thao 311edc6b5b [AMDGPU] Enable PreRARematerialize scheduling pass with multiple high RP regions
Enable the PreRARematerialize pass when there are multiple high RP scheduling
regions present. Require the occupancy in all high RP regions be improved
before finalizing sinking. If any high RP region did not improve in occupancy
then un-do all sinking and restore the state to before the pass.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D122501
2022-04-08 13:08:32 -07:00
Vang Thao cd1071171c [AMDGPU] Fix inline asm causing assert during PreRARematerialize stage in scheduler pass
Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D123348
2022-04-08 09:22:32 -07:00
Christudasan Devadasan 2c46d067e1 [AMDGPU][SIMachineFunctionInfo] Code cleanup (NFC). 2022-04-08 19:42:48 +05:30
Abinav Puthan Purayil b536f24d22 [AMDGPU] Use GCNPat in the buffer atomic pattern multiclasses 2022-04-08 16:28:11 +05:30
Thomas Symalla 6d97ca690c [AMDGPU] Increase detection range for s_mov, v_cmpx transformation.
We found that it might be beneficial to have the SIOptimizeExecMasking
pass detect more cases where v_cmp, s_and_saveexec patterns can be
transformed to s_mov, v_cmpx patterns. Currently, the search range
for finding a fitting v_cmp instruction is 5, however, this is doubled
to 10 here.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D123367
2022-04-08 12:47:24 +02:00
Evgeniy Brevnov da41214d65 Add support for atomic memory copy lowering
Currently, the utility supports lowering of non atomic memory transfer routines only. This patch adds support for atomic version of memcopy. This may be useful for targets not supporting atomic memcopy.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D118443
2022-04-08 10:41:31 +07:00
Stanislav Mekhanoshin 16cf9e6dad [AMDGPU] Fix handling of gfx10 LDS misaligned access bug
It was only handled for FLAT initially because we did not have
unaligned DS instructions lowering. Now it is implemented but
the bug is not handled.

Differential Revision: https://reviews.llvm.org/D123338
2022-04-07 15:08:29 -07:00
Stanislav Mekhanoshin e66f0edb40 [AMDGPU] Split unaligned LDS access instead of scalarizing
There is no need to fully scalarize an unaligned operation in
some case, just split it to alignment.

Differential Revision: https://reviews.llvm.org/D123330
2022-04-07 14:27:37 -07:00
Changpeng Fang 6733590db2 AMDGPU: Set implicit kernarg size to be of 256 bytes for code object version 5
Summary:
  If implicitarg_ptr intrinsic is not used, set implicit kernarg size to 0, otherwise
set it to 256 bytes for code object version 5 (and beyond).

Reviewers: arsenm

Differential Revision: https://reviews.llvm.org/D123262
2022-04-07 08:35:23 -07:00
Dmitry Preobrazhensky 1f6aa90386 [AMDGPU][MC][GFX10] Added syntactic sugar for s_waitcnt_depctr operand
Added the following helpers:

    depctr_hold_cnt(...)
    depctr_sa_sdst(...)
    depctr_va_vdst(...)
    depctr_va_sdst(...)
    depctr_va_ssrc(...)
    depctr_va_vcc(...)
    depctr_vm_vsrc(...)

Differential Revision: https://reviews.llvm.org/D123022
2022-04-07 17:03:44 +03:00
Matt Arsenault e6012c8e0f AMDGPU: Handle private atomics
Use new NotAtomic expansion to turn these into the equivalent
non-atomic operations. Independent lanes cannot access the private
memory of other lanes, so there's no possibility for synchronization.

These don't really appear directly in user code, but
InferAddressSpaces can make these appear after optimizations.

Fixes issues 54693 and 54274.
2022-04-06 22:47:19 -04:00
Stanislav Mekhanoshin a41a676e8a [AMDGPU] Check SI LDS offset bug in the allowsMisalignedMemoryAccesses
Differential Revision: https://reviews.llvm.org/D123268
2022-04-06 18:05:02 -07:00
Craig Topper 1235aaefbd [AArch64][AMDGPU][WebAssembly] Use static_cast instead of a reinterpret_cast to downcast in parseMachineFunctionInfo. NFC
static_cast is a little safer here since the compiler will
ensure we're casting to a class derived from
yaml::MachineFunctionInfo.

I believe this first appeared on AMDGPU and was copied to the
other two targets.

Spotted when it was being copied to RISCV in D123178.

Differential Revision: https://reviews.llvm.org/D123260
2022-04-06 15:09:18 -07:00
Jay Foad 538c77172a [AMDGPU] Fix unused variable warning after D117484 2022-04-06 14:45:38 +01:00
Matt Arsenault 54c525fc53 AMDGPU/GlobalISel: Handle legacy grid ID intrinsics
Handle the llvm.r600.* intrinsics which are still in use in libclc. I
thought it would be possible to switch it to using
llvm.amdgcn.implicitarg.ptr already, but it turns out the implicit
arguments are currently split into a piece before and after the
explicit kernel arguments.
2022-04-05 22:01:31 -04:00
Matt Arsenault 6071c92768 AMDGPU: Fix LiveVariables error after lowering SI_END_CF
This wasn't accounting for the block change in updating LiveVariables.
2022-04-05 21:57:50 -04:00
Scott Linder 09f33a430b [AMDGPU][OpenCL] Remove "printf and hostcall" diagnostic
The diagnostic is unreliable, and triggers even for dead uses of
hostcall that may exist when linking the device-libs at lower
optimization levels.

Eliminate the diagnostic, and directly document the limitation for
OpenCL before code object V5.

Make some NFC changes to clarify the related code in the
MetadataStreamer.

Add a clang test to tie OCL sources containing printf to the backend IR
tests for this situation.

Reviewed By: sameerds, arsenm, yaxunl

Differential Revision: https://reviews.llvm.org/D121951
2022-04-05 19:10:23 +00:00
Vang Thao 45c2371c0d [AMDGPU] Ignore debug use during PreRARematerialize stage in scheduling pass
Ignore all debug uses when collecting trivially rematerializable defs. This fixes an issue with difference in codegen when enabling debug info.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D123048
2022-04-04 11:15:06 -07:00
Jay Foad c246b7bd4a [AMDGPU] Only count global-to-global as indirect accesses
Previously any load (global, local or constant) feeding into a
global load or store would be counted as an indirect access. This
patch only counts global loads feeding into a global load or store.
The rationale is that the latency for global loads is generally
much larger than the other kinds.

As a side effect this makes it easier to write small kernels test
cases that are not counted as having indirect accesses, despite
the fact that arguments to the kernel are accessed with an SMEM
load.

Differential Revision: https://reviews.llvm.org/D122804
2022-04-01 13:48:13 +01:00
Thomas Symalla 1a6aa8b195 [AMDGPU] Add missing use check in SIOptimizeExecMasking pass.
Whenever a v_cmp, s_and_saveexec instruction sequence shall be
transformed to an equivalent s_mov, v_cmpx sequence, it needs
to be detected if the v_cmp target register is used between
the two instructions as the v_cmp result gets omitted by
using the v_cmpx instruction, resulting in invalid code.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D122797
2022-03-31 19:25:35 +02:00
Abinav Puthan Purayil 898d5776ec [AMDGPU][GlobalISel] Scalarize add/sub with overflow ops in the legalizer
Differential Revision: https://reviews.llvm.org/D122803
2022-03-31 21:46:34 +05:30
Changpeng Fang 1711020c37 AMDGPU: Use isLiteralConstantLike to check whether the operand could ever be literal
Summary:
  To compute the size of a VALU/SALU instruction, we need to check whether an operand
could ever be literal. Previously isLiteralConstant was used, which missed cases
like global variables or external symbols. These misses lead to under-estimation of
the instruction size and branch offset, and thus incorrectly skip the necessary branch
relaxation when the branch offset is actually greater than what the branch bits can hold.
In this work, we use isLiteralConstantLike to check the operands. It maybe conservative,
but it is safe.

Reviewers: arsenm

Differential Revision: https://reviews.llvm.org/D122778
2022-03-31 08:06:31 -07:00
Abinav Puthan Purayil acf83abcbf [AMDGPU][GlobalISel] Remove unused variable. NFC. 2022-03-31 16:50:34 +05:30
Stanislav Mekhanoshin f311f934e1 [AMDGPU] gfx940 VALU hazard recognizer
Differntial Revision: https://reviews.llvm.org/D122339
2022-03-29 10:57:54 -07:00
Shao-Ce SUN 662b9fa02c [NFC][CodeGen] Add a setTargetDAGCombine use ArrayRef
Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D122557
2022-03-29 09:53:24 +08:00
Changpeng Fang 8384ced974 [AMDGPU][NFC]: Remove unnecessary MFI functions
Summary:
  hasHostcallPtr() and hasHeapPtr() are only used in metadata emit.
However, we can use the corresponding function attributes directly
instead introducing the functions.

Reviewers: arsenm

Differential Revision: https://reviews.llvm.org/D122600
2022-03-28 12:13:33 -07:00
Thomas Symalla 3bd15c03c6 [AMDGPU] Fix adding modifiers when creating v_cmpx instructions.
Revision https://reviews.llvm.org/D122332 added a pattern transformation
where v_cmpx instructions are introduced. However, the modifiers are
not correctly inherited from the original operands. The patch
adds the source modifiers, if they are exist, or sets them to 0.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D122489
2022-03-28 17:52:53 +02:00
Carl Ritson 1f52d02ceb [AMDGPU] Split waterfall loop exec manipulation
Split waterfall loops into multiple blocks so that exec mask
manipulation (s_and_saveexec) does not occur in the middle of
a block.

VGPR live range optimizer is updated to handle waterfall loops
spanning multiple blocks.

Reviewed By: ruiling

Differential Revision: https://reviews.llvm.org/D122200
2022-03-28 17:44:54 +09:00
Kazu Hirata 6212871968 [Target] Apply clang-tidy fixes for readability-redundant-member-init (NFC) 2022-03-27 22:22:37 -07:00
Maksim Panchenko 4ae9745af1 [Disassember][NFCI] Use strong type for instruction decoder
All LLVM backends use MCDisassembler as a base class for their
instruction decoders. Use "const MCDisassembler *" for the decoder
instead of "const void *". Remove unnecessary static casts.

Reviewed By: skan

Differential Revision: https://reviews.llvm.org/D122245
2022-03-25 18:53:59 -07:00
Jay Foad be9acee059 [AMDGPU] Move VOP3 classes into VOPInstructions.td. NFC.
These classes are also used by VOP1/2/C instructions.

Differential Revision: https://reviews.llvm.org/D122470
2022-03-25 13:56:43 +00:00
Thomas Symalla 718aec209c [AMDGPU] Improve v_cmpx usage on GFX10.3.
On GFX10.3 targets, the following instruction sequence

v_cmp_* SGPR, ...
s_and_saveexec ..., SGPR

leads to a fairly long stall caused by a VALU write to a SGPR and having the
following SALU wait for the SGPR.

An equivalent sequence is to save the exec mask manually instead of letting
s_and_saveexec do the work and use a v_cmpx instruction instead to do the
comparison.

This patch modifies the SIOptimizeExecMasking pass as this is the last position
where s_and_saveexec instructions are inserted. It does the transformation by
trying to find the pattern, extracting the operands and generating the new
instruction sequence.

It also changes some existing lit tests and introduces a few new tests to show
the changed behavior on GFX10.3 targets.

Same as D119696 including a buildbot and MIR test fix.

Reviewed By: critson

Differential Revision: https://reviews.llvm.org/D122332
2022-03-25 11:40:18 +01:00
Stanislav Mekhanoshin 64838ba365 [AMDGPU] Use GenericTable to classify DGEMM
Since there is a table introduced for MAI instructions extend it
to use for DGEMM classification.

Differential Revision: https://reviews.llvm.org/D122337
2022-03-24 13:00:37 -07:00
Stanislav Mekhanoshin cad9de71d7 [AMDGPU] gfx940 MAI hazard recognizer
Differential Revision: https://reviews.llvm.org/D122263
2022-03-24 12:59:52 -07:00
Stanislav Mekhanoshin 6e3e14f600 [AMDGPU] Support gfx940 smfmac instructions
Differential Revision: https://reviews.llvm.org/D122191
2022-03-24 12:40:42 -07:00
Stanislav Mekhanoshin 27439a7642 [AMDGPU] New gfx940 mfma instructions
Differential Revision: https://reviews.llvm.org/D122044
2022-03-24 12:12:52 -07:00
Vasileios Porpodas 39aa202aff Recommit "[SLP] Fix lookahead operand reordering for splat loads." attempt 3, fixed assertion crash.
Original review: https://reviews.llvm.org/D121354

This reverts commit e6ead19b77.
2022-03-23 18:32:17 -07:00
Austin Kerbow 1e15adba62 [AMDGPU] Add s_nop WaitStates between neighboring mfma
In some cases padding bubbles between sequential MFMA instructions may
lead to increased inter-wave performance. Add option to request to pad
some portion of these stall cycles with s_nops.

Fixes: SWDEV-326925

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D121437
2022-03-23 13:56:09 -07:00
Arthur Eubanks e6ead19b77 Revert "Recommit "[SLP] Fix lookahead operand reordering for splat loads." attempt 2, fixed assertion crash."
This reverts commit 27bd8f9492.

Causes crashes, see comments in D121973
2022-03-23 10:57:45 -07:00
hsmahesha f5b6866d7e [AMDGPU] Add missing testcase for SGPR to AGPR copy
and, also update the function indirectCopyToAGPR() to ensure that it is called only on GFX908 sub-target.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D122286
2022-03-23 21:38:04 +05:30
hsmahesha f014303e2c [AMDGPU] [NFC]: Organize the code around reserving registers.
First, add code to reserve all required special purpose registers,
followed by code to reserve SGPRs, followed by code to reserve
VGPRs/AGPRs.

This patch is prepared as a pre-requisite to fix an issue related to
GFX90A hardware.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D122219
2022-03-23 07:15:59 +05:30
Vasileios Porpodas 27bd8f9492 Recommit "[SLP] Fix lookahead operand reordering for splat loads." attempt 2, fixed assertion crash.
Original review: https://reviews.llvm.org/D121354

This reverts commit f7d7d2a08d.
2022-03-22 16:41:55 -07:00
Stanislav Mekhanoshin 72c1a0d9c2 [AMDGPU] Allow v_accvgpr_write to use SGPR on gfx90a
This is undocumented, but it should work.

Differential Revision: https://reviews.llvm.org/D122252
2022-03-22 13:52:29 -07:00
Arthur Eubanks f7d7d2a08d Revert "Recommit "[SLP] Fix lookahead operand reordering for splat loads.""
This reverts commit 79613185d3.

Causes crashes, see comments in https://reviews.llvm.org/D121973.
2022-03-22 13:33:49 -07:00
alex-t 7636c9a929 [AMDGPU] use scalar shift for SALU users in frame index elimination
In the frame index lowering we have to insert shift and add
instructions to adjust stack object access.  We need to take care of the stack
object user kind and use scalar shift/add for scalar users.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D121524
2022-03-22 13:16:24 +01:00
alex-t 0a488cba2c [AMDGPU] use scalar shift for SALU users in frame index elimination
In the frame index lowering we have to insert shift and add
instructions to adjust stack object access.  We need to take care of the stack
object user kind and use scalar shift/add for scalar users.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D121524
2022-03-22 11:43:23 +01:00
Vasileios Porpodas 79613185d3 Recommit "[SLP] Fix lookahead operand reordering for splat loads."
Original review: https://reviews.llvm.org/D121354

The original commit 9136145eb0 broke the build on several targets.

Differential Revision: https://reviews.llvm.org/D121973
2022-03-21 15:57:32 -07:00
Stanislav Mekhanoshin 9b1fa6f89f [AMDGPU] Fix AV classes VTs. NFCI.
NFC at this point, but will be used at a later patch.

Differential Revision: https://reviews.llvm.org/D122174
2022-03-21 13:38:05 -07:00
alex-t a0ea7ec90f [AMDGPU] divergence patterns for the BUILD_VECTOR i16, undef expansion.
BUILD_VECTOR of i16 and undef gets expanded to the COPY_TO_REGCLASS.
         The latter is further lowererd to the copy instructions.
	 We need to provide the correct register class for the uniform and divergent BUILD_VECTOR nodes
	 to avoid VGPR to SGPR copies.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D122068
2022-03-21 21:11:20 +01:00
Dmitry Preobrazhensky 1d817a1448 [AMDGPU][MC][NFC] Refactored sendmsg(...) handling
Differential Revision: https://reviews.llvm.org/D121995
2022-03-21 15:37:30 +03:00
Jay Foad 321d3aae7c [AMDGPU] SIInstrInfo::verifyInstruction tweaks. NFCI.
Simplify some for loops. Don't bother checking src2 operand for
writelane because it doesn't have one. Check all VALU instructions,
not just VOP1/2/3/C/SDWA.
2022-03-21 11:15:55 +00:00
Thomas Symalla 7de6107dce Revert "[AMDGPU] Improve v_cmpx usage on GFX10.3."
This reverts commit 011c64191e and
e725e2afe0.

Differential Revision: https://reviews.llvm.org/D122117
2022-03-21 09:50:44 +01:00
Thomas Symalla e725e2afe0
[AMDGPU] [NFC] Fix missing include. 2022-03-21 09:37:22 +01:00
Thomas Symalla 011c64191e [AMDGPU] Improve v_cmpx usage on GFX10.3.
On GFX10.3 targets, the following instruction sequence

v_cmp_* SGPR, ...
s_and_saveexec ..., SGPR

leads to a fairly long stall caused by a VALU write to a SGPR and having the
following SALU wait for the SGPR.

An equivalent sequence is to save the exec mask manually instead of letting
s_and_saveexec do the work and use a v_cmpx instruction instead to do the
comparison.

This patch modifies the SIOptimizeExecMasking pass as this is the last position
where s_and_saveexec instructions are inserted. It does the transformation by
trying to find the pattern, extracting the operands and generating the new
instruction sequence.

It also changes some existing lit tests and introduces a few new tests to show
the changed behavior on GFX10.3 targets.

Reviewed By: sebastian-ne, critson

Differential Revision: https://reviews.llvm.org/D119696
2022-03-21 09:31:59 +01:00
Jon Chesterfield bcbd4cf1f2 Revert "[amdgpu][nfc] Pass function instead of module to allocateModuleLDSGlobal"
Reconsidered, better to handle per-function state in the constructor as before.
This reverts commit 98e474c1b3.
2022-03-20 00:58:26 +00:00
Jon Chesterfield 98e474c1b3 [amdgpu][nfc] Pass function instead of module to allocateModuleLDSGlobal 2022-03-19 16:42:17 +00:00
Stanislav Mekhanoshin e9a49c6483 [AMDGPU] gfx940 basic speed model
This is incomplete and will handle more instructions as they are added.

Differential Revision: https://reviews.llvm.org/D121966
2022-03-18 13:19:47 -07:00
Stanislav Mekhanoshin 4570527e72 [AMDGPU] Disable some MFMA instructions on gfx940
Differential Revision: https://reviews.llvm.org/D121956
2022-03-18 13:19:12 -07:00
Stanislav Mekhanoshin 0a79e1f30a [AMDGPU] reuse blgp as neg in 2 mfma operations on gfx940
GFX940 repurposes BLGP as NEG only in DGEMM MFMA.

Differential Revision: https://reviews.llvm.org/D121745
2022-03-18 12:56:51 -07:00
Abinav Puthan Purayil aee3684995 [AMDGPU] Use COPY_TO_REGCLASS for buffer_atomic_cmpswap selection
GlobalISel was selecting the av_* regclass for some cases.

Differential Revision: https://reviews.llvm.org/D121933
2022-03-18 08:56:23 +05:30
Changpeng Fang dd5895cc39 AMDGPU: Use the implicit kernargs for code object version 5
Summary:
  Specifically, for trap handling, for targets that do not support getDoorbellID,
we load the queue_ptr from the implicit kernarg, and move queue_ptr to s[0:1].
To get aperture bases when targets do not have aperture registers, we load
private_base or shared_base directly from the implicit kernarg. In clang, we use
implicitarg_ptr + offsets to implement __builtin_amdgcn_workgroup_size_{xyz}.

Reviewers: arsenm, sameerds, yaxunl

Differential Revision: https://reviews.llvm.org/D120265
2022-03-17 14:12:36 -07:00
Stanislav Mekhanoshin d9ac55fab2 [AMDGPU] New MFMA names for existing instructions
Old names are supported as aliases.
_1k MFMA got new opcodes.

Differential Revision: https://reviews.llvm.org/D121741
2022-03-17 13:05:36 -07:00
Stanislav Mekhanoshin 522b259976 [AMDGPU] Allow v_accvgpr_write to use SGPR src on gfx940
Differential Revision: https://reviews.llvm.org/D121843
2022-03-17 12:12:06 -07:00
Vang Thao 27e1931508 [AMDGPU] Fix PreRARematerialize scheduler pass sinking subreg defs
When collecting trivially rematerializable defs, skip any subreg defs. We do not want to sink these.

Differential Revision: https://reviews.llvm.org/D121874
2022-03-17 11:38:53 -07:00
Jay Foad 313f306b26 [AMDGPU] Stop using getMinimalPhysRegClass in LowerFormalArguments
NFCI. The motivation for this is avoid problems in future if we add new
classes containing only a subset of all VGPRs, or a subset of all SGPRs.
getMinimalPhysRegClass would favour these smaller classes, which is not
what we want here.

Differential Revision: https://reviews.llvm.org/D121914
2022-03-17 15:19:17 +00:00
Dmitry Preobrazhensky 9c632b61eb [AMDGPU][MC] A fix for commit 5977dfb
The commit code 5977dfba64
failed to compile with GCC5. This patch addresses the issue.
For a related discussion, see https://reviews.llvm.org/D121696
2022-03-17 14:41:21 +03:00
Abinav Puthan Purayil f59cb41ba1 [AMDGPU] Select buffer_atomic_cmpswap* in tblgen
This change replaces the manual selection of buffer_atomic_cmpswap*
instructions in SelectionDAG and GlobalISel with a tblgen based
selection in BUFInstructions.td. This allows us to select the return and
no-return variants in tblgen.

Differential Revision: https://reviews.llvm.org/D121770
2022-03-17 10:12:32 +05:30
Christudasan Devadasan 6dd21d1db1 [AMDGPU][SIFoldOperands] Consider the alignment constraints
Enforced an alignment check while folding the operands.
2022-03-17 08:27:53 +05:30
Christudasan Devadasan af717d4aca [AMDGPU][MachineVerifier] Alignment check for fp32 packed math instructions
The fp32 packed math instructions are introduced in gfx90a.
If their vector register operands are not properly aligned, the
verifier should flag them. Currently, the verifier failed to
report it and the compiler ended up emitting a broken assembly.
This patch fixes that missed case in TII::verifyInstruction.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D121794
2022-03-17 08:21:35 +05:30
Jay Foad fb8d23b8e7 [AMDGPU] Define new feature HasFlatScratchSVSMode. NFC.
This is by analogy with HasFlatScratchSTMode and is slightly more
informative than using isGFX940Plus.

Differential Revision: https://reviews.llvm.org/D121804
2022-03-16 19:54:02 +00:00
Joe Nash fb81f06f63 [AMDGPU] Calculate RegWidth in bits in AsmParser
NFC. Switch from calculations based on dwords to bits, to be more
flexible.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D121730
2022-03-16 10:52:14 -04:00
Dmitry Preobrazhensky 5977dfba64 [AMDGPU][MC][NFC] Refactored custom operands handling
The original design of custom operands support assumed that most GPUs
have the same or very similar operand names end encodings. This is
no longer the case. As a result the support code becomes over-complicated
and difficult to maintain.

This change implements a different design with the following benefits:

- support of aliases;
- support of operands with overlapped encodings;
- identification of defined but unsupported operands.

Differential Revision: https://reviews.llvm.org/D121696
2022-03-16 16:04:55 +03:00
Shengchen Kan 37b378386e [NFC][CodeGen] Rename some functions in MachineInstr.h and remove duplicated comments 2022-03-16 20:25:42 +08:00
serge-sans-paille 989f1c72e0 Cleanup codegen includes
This is a (fixed) recommit of https://reviews.llvm.org/D121169

after:  1061034926
before: 1063332844

Discourse thread: https://discourse.llvm.org/t/include-what-you-use-include-cleanup
Differential Revision: https://reviews.llvm.org/D121681
2022-03-16 08:43:00 +01:00
Joe Nash 5464fd36ba [AMDGPU] Fix typo consecutive in GCNNSAReassign 2022-03-15 12:42:52 -04:00
Pavel Labath 991dc4b4e0 Remove a top-level "using namespace" in TargetTransformInfoImpl.h
Avoids polluting the namespace of all files including the header.
2022-03-15 13:49:20 +01:00
Ruiling Song 98dd390573 AMDGPU: Use removeAllRegUnitsForPhysReg()
I met the issue here when working on something else.
Actually we have already reserved EXEC, but it looks
like the register coalescer is causing the sub-register
of EXEC appears in LiveIntervals. I have not looked
deeper why register coalscer have such behavior, but
removeAllRegUnitsForPhysReg() is the right way.

Reviewed By: critson, foad, arsenm

Differential Revision: https://reviews.llvm.org/D117014
2022-03-15 10:28:27 +08:00
Stanislav Mekhanoshin c4500de255 [AMDGPU] gfx940: disable OP_SEL on V_DOT instructions
Differential Revision: https://reviews.llvm.org/D121634
2022-03-14 17:02:00 -07:00
Stanislav Mekhanoshin 8dd3d1cf1f [AMDGPU] Add symbolic names for gfx940 HWREGs
The namespaces of HWREGs is now overlapping with gfx10. Thus the
patch is longer than necessary to just support new names. It also
need to handle proper error messages, i.e. to issue a "specified
hardware register is not supported on this GPU" message.

This may need a major refactoring in the future.

Differential Revision: https://reviews.llvm.org/D121418
2022-03-14 16:13:33 -07:00
Stanislav Mekhanoshin 23499103f7 [AMDGPU] Support for gfx940 flat lds opcodes
Differential Revision: https://reviews.llvm.org/D121414
2022-03-14 15:46:19 -07:00
Stanislav Mekhanoshin 1f53f20fc1 [AMDGPU] Support gfx940 v_lshl_add_u64 instruction
Differential Revision: https://reviews.llvm.org/D121401
2022-03-14 15:45:42 -07:00
Stanislav Mekhanoshin 36fe3f13a9 [AMDGPU] flat scratch SVS addressing mode for gfx940
Both VADDR and SADDR are used in SVS mode.

Differential Revision: https://reviews.llvm.org/D121254
2022-03-14 15:23:36 -07:00
Stanislav Mekhanoshin 47bac63d3f [AMDGPU] gfx940 memory model
Differential Revision: https://reviews.llvm.org/D121242
2022-03-14 15:01:46 -07:00
Stanislav Mekhanoshin 72a9e5f891 [AMDGPU] Restrict machine copy propagation from creating unaligned classes
Fixes: SWDEV-326366

Differential Revision: https://reviews.llvm.org/D121491
2022-03-14 14:09:40 -07:00
Austin Kerbow 62bcfcb5a5 [AMDGPU] Add llvm.amdgcn.s.setprio intrinsic
Reviewed By: rampitec, arsenm

Differential Revision: https://reviews.llvm.org/D120976
2022-03-12 22:15:42 -08:00
serge-sans-paille ed98c1b376 Cleanup includes: DebugInfo & CodeGen
Discourse thread: https://discourse.llvm.org/t/include-what-you-use-include-cleanup
Differential Revision: https://reviews.llvm.org/D121332
2022-03-12 17:26:40 +01:00
Stanislav Mekhanoshin 31f215ab0c [AMDGPU] Support v_mov_b64 in dpp combine
Differential Revision: https://reviews.llvm.org/D121411
2022-03-11 11:37:32 -08:00
Stanislav Mekhanoshin 6181458662 [AMDGPU] gfx940 MUBUF format changes
Differential Revision: https://reviews.llvm.org/D121234
2022-03-11 11:36:49 -08:00
Nikita Popov 3ed643ea76 [AMDGPUPromoteAlloca] Make compatible with opaque pointers
This mainly changes the handling of bitcasts to not check the types
being casted from/to -- we should only care about the actual
load/store types. The GEP handling is also changed to not care about
types, and just make sure that we get an offset corresponding to
a vector element.

This was a bit of a struggle for me, because this code seems to be
pretty sensitive to small changes. The end result seems to produce
strictly better results for the existing test coverage though,
because we can now deal with more situations involving bitcasts.

Differential Revision: https://reviews.llvm.org/D121371
2022-03-11 09:20:51 +01:00
Joe Nash c8e6d68a9f [AMDGPU] Use subreg encoding instead of reassign
The HWEncoding for these 64 bit registers should be the same as as the
encoding for the previously defined low halves of the registers. So
reuse that value instead of repeating the assignment. NFC.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D121391
2022-03-10 12:50:29 -05:00
Nico Weber a278250b0f Revert "Cleanup codegen includes"
This reverts commit 7f230feeea.
Breaks CodeGenCUDA/link-device-bitcode.cu in check-clang,
and many LLVM tests, see comments on https://reviews.llvm.org/D121169
2022-03-10 07:59:22 -05:00
alex-t d159b4444c [AMDGPU] Enable divergence predicates for negative inline constant subtraction
We have a pattern that undo sub x, c -> add x, -c canonicalization since c is more likely
 an inline immediate than -c. This patch enables it to select scalar or vector subtracion by the input node divergence.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D121360
2022-03-10 15:03:22 +03:00
serge-sans-paille 7f230feeea Cleanup codegen includes
after:  1061034926
before: 1063332844

Differential Revision: https://reviews.llvm.org/D121169
2022-03-10 10:00:30 +01:00
Changpeng Fang 0f20a35b9e AMDGPU: Set up User SGPRs for queue_ptr only when necessary
Summary:
  In general, we need queue_ptr for aperture bases and trap handling,
and user SGPRs have to be set up to hold queue_ptr. In current implementation,
user SGPRs are set up unnecessarily for some cases. If the target has aperture
registers, queue_ptr is not needed to reference aperture bases. For trap
handling, if target suppots getDoorbellID, queue_ptr is also not necessary.
Futher, code object version 5 introduces new kernel ABI which passes queue_ptr
as an implicit kernel argument, so user SGPRs are no longer necessary for
queue_ptr. Based on the trap handling document:
https://llvm.org/docs/AMDGPUUsage.html#amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table,
llvm.debugtrap does not need queue_ptr, we remove queue_ptr suport for llvm.debugtrap
in the backend.

Reviewers: sameerds, arsenm

Fixes: SWDEV-307189

Differential Revision: https://reviews.llvm.org/D119762
2022-03-09 10:14:05 -08:00
Stanislav Mekhanoshin 33fb23f728 [AMDGPU] Merge flat with global in the SILoadStoreOptimizer
Flat can be merged with flat global since address cast is a no-op.
A combined memory operation needs to be promoted to flat.

Differential Revision: https://reviews.llvm.org/D120431
2022-03-09 10:04:37 -08:00
Jay Foad c7218164c4 [AMDGPU] Remove HasAtomicFaddInstsGFX90X and HasAtomicFaddInstsGFX940
These compound predicates are not required, since we can use a
combination of setting the SubtargetPredicate (to a subtarget
predicate like isGFX940Plus) and OtherPredicates (to a list of feature
predicates like HasAtomicFaddInsts) instead. NFC.

Differential Revision: https://reviews.llvm.org/D121289
2022-03-09 18:02:21 +00:00
Vang Thao 28322c2514 [AMDGPU] Add scheduler pass to rematerialize trivial defs
Add a new pass in the pre-ra AMDGPU scheduler to check if sinking trivially rematerializable defs that only has one use outside of the defining block will increase occupancy. If we can determine that occupancy can be increased, then rematerialize only the minimum amount of defs required to increase occupancy. Also re-schedule all regions that had occupancy matching the previous min occupancy using the new occupancy.

This is based off of the discussion in https://reviews.llvm.org/D117562.

The logic to determine the defs we should collect and determining if sinking would be beneficial is mostly the same. Main differences is that we are no longer limiting it to immediate defs and the def and use does not have to be part of a loop.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D119475
2022-03-09 09:34:33 -08:00
Venkata Ramanaiah Nalamothu 04fff547e2 [AMDGPU] Move call clobbered return address registers s[30:31] to callee saved range
Currently the return address ABI registers s[30:31], which fall in the call
clobbered register range, are added as a live-in on the function entry to
preserve its value when we have calls so that it gets saved and restored
around the calls.

But the DWARF unwind information (CFI) needs to track where the return address
resides in a frame and the above approach makes it difficult to track the
return address when the CFI information is emitted during the frame lowering,
due to the involvment of understanding the control flow.

This patch moves the return address ABI registers s[30:31] into callee saved
registers range and stops adding live-in for return address registers, so that
the CFI machinery will know where the return address resides when CSR
save/restore happen during the frame lowering.

And doing the above poses an issue that now the return instruction uses undefined
register `sgpr30_sgpr31`. This is resolved by hiding the return address register
use by the return instruction through the `SI_RETURN` pseudo instruction, which
doesn't take any input operands, until the `SI_RETURN` pseudo gets lowered to the
`S_SETPC_B64_return` during the `expandPostRAPseudo()`.

As an added benefit, this patch simplifies overall return instruction handling.

Note: The AMDGPU CFI changes are there only in the downstream code and another
version of this patch will be posted for review for the downstream code.

Reviewed By: arsenm, ronlieb

Differential Revision: https://reviews.llvm.org/D114652
2022-03-09 12:18:02 +05:30
Stanislav Mekhanoshin 9eabea3968 [AMDGPU] Set noclobber metadata on loads instead of cast to constant
A load via pointer cast to constant will return true from
pointsToConstantMemory which is not necessarily so.

Fixes: SWDEV-326463

Differential Revision: https://reviews.llvm.org/D121172
2022-03-07 23:13:02 -08:00
Christudasan Devadasan 0d849b8249 AMDGPU: Skip folding REG_SEQUENCE if found unknown regclasses for its users
Use TII::getRegClass to return a valid regclass or a nullptr
if the RC is unknown for a given OpIdx. This fixes a potential
crash occurred while getting the RC from a variadic instruction.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D120813
2022-03-08 10:11:57 +05:30
Jacob Lambert 5160447f58 [AMDGPU] Add gfx10 assembler directive to specify shared VGPR count
Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D105507
2022-03-07 14:27:41 -08:00
Stanislav Mekhanoshin 932f628121 [AMDGPU] new gfx940 fp atomics
Differential Revision: https://reviews.llvm.org/D121028
2022-03-07 12:32:02 -08:00
Stanislav Mekhanoshin e7b362d75d [AMDGPU] Add v_mov_b64 gfx940 opcode
Differential Revision: https://reviews.llvm.org/D121023
2022-03-07 12:07:12 -08:00
Stanislav Mekhanoshin 8992b50e2f [AMDGPU] gfx940 uses new names for coherency bits
Differential Revision: https://reviews.llvm.org/D120855
2022-03-07 11:50:07 -08:00
Austin Kerbow 0c0636f782 [AMDGPU] Fix uninitialized value after 8d0c34fd4f 2022-03-07 11:32:01 -08:00
Stanislav Mekhanoshin 2c830c8fab [AMDGPU] gfx940: support V_FMAMK_F32 and V_FMAAK_F32
Differential Revision: https://reviews.llvm.org/D120769
2022-03-07 11:31:01 -08:00
Austin Kerbow 8d0c34fd4f [AMDGPU] Omit unnecessary waitcnt before barriers
It is not necessary to wait for all outstanding memory operations before
barriers on hardware that can back off of the barrier in the event of an
exception when traps are enabled. Add a new subtarget feature which
tracks which HW has this ability.

Reviewed By: #amdgpu, rampitec

Differential Revision: https://reviews.llvm.org/D120544
2022-03-07 08:23:53 -08:00
Jay Foad d7d4ed0847 [AMDGPU] Tweak predicates for image_bvh_intersect_ray instructions
Don't override SubtargetPredicate since that is already set in the
base classes for the appropriate subtarget like MIMG_gfx10. Use
OtherPredicates instead for consistency with the way we handle
features like HasImageInsts and HasExtendedImageInsts. NFC.

Differential Revision: https://reviews.llvm.org/D120909
2022-03-04 12:05:23 +00:00
Aakanksha 840695814a [AMDGPU] Add gfx1036 target
Differential Revision: https://reviews.llvm.org/D120846
2022-03-02 23:26:38 +00:00
Stanislav Mekhanoshin 35ec58d8c0 [AMDGPU] gfx940 removes all image instructions
Differential Revision: https://reviews.llvm.org/D120763
2022-03-02 13:55:26 -08:00
Stanislav Mekhanoshin 2e2e64df4a [AMDGPU] Add gfx940 target
This is target definition only.

Differential Revision: https://reviews.llvm.org/D120688
2022-03-02 13:54:48 -08:00
Jay Foad 5ddfedc956 [AMDGPU] Fix deleting of move-immediate instructions after folding
SIInstrInfo::FoldImmediate tried to delete move-immediate instructions
after folding them into their only use. This did not work because it was
checking hasOneNonDBGUse after doing the fold, at which point there
should be no uses. This seems to have no effect on codegen, it just
means less stuff for DCE to clean up later.

Differential Revision: https://reviews.llvm.org/D120815
2022-03-02 16:11:16 +00:00
Jay Foad 8bed52c9eb [AMDGPU] Make more use of madmk/fmamk instructions
In convertToThreeAddress handle VOP2 mac/fmac instructions with a
literal src0 operand, since these are prime candidates for
converting to madmk/fmamk.

Previously this would only happen if src0 (or src1) was a register
defined by a move-immediate instruction, but in many cases these
operands have already been folded because SIFoldOperands runs
before TwoAddressInstructionPass.

Differential Revision: https://reviews.llvm.org/D120736
2022-03-02 10:22:10 +00:00
Mircea Trofin cb2160760e [nfc][codegen] Move RegisterBank[Info].h under CodeGen
This wraps up from D119053. The 2 headers are moved as described,
fixed file headers and include guards, updated all files where the old
paths were detected (simple grep through the repo), and `clang-format`-ed it all.

Differential Revision: https://reviews.llvm.org/D119876
2022-03-01 21:53:25 -08:00
Abinav Puthan Purayil 8b4ab01c38 [AMDGPU] Select no-return atomic ops in BUFInstructions.td
This change adds the selection of no-return buffer_* instructions in
tblgen. The motivation for this is to get the no-return atomic isel
working without relying on post-isel hooks so that GlobalISel can start
selecting them (once GlobalISelEmitter allows no return atomic patterns
like how DAGISel does).

This change handles the selection of no-return mubuf_atomic_cmpswap in
tblgen without changing the extract_subreg generation for the return
variant. This handling was done by the post-isel hook.

Differential Revision: https://reviews.llvm.org/D120538
2022-03-02 08:25:28 +05:30
Jay Foad 289339140e [AMDGPU] Handle legacy multiply-accumulate opcodes in convertToThreeAddress
Handle V_MAC_LEGACY_F32 and V_FMAC_LEGACY_F32 in
convertToThreeAddress, to avoid the need for an extra mov
instruction in some cases.

Differential Revision: https://reviews.llvm.org/D120704
2022-03-01 16:58:00 +00:00
Jay Foad 9ac3a85047 [AMDGPU] Disentangle MFMA handling in convertToThreeAddress. NFC.
Move MFMA handling to the top of convertToThreeAddress and pull
IsF16 calculation out of the switch. I think this makes it clearer
exactly which mac/fmac opcodes are handled, since they are now
listed in the switch with minimal extra clutter.

Differential Revision: https://reviews.llvm.org/D120703
2022-03-01 16:56:56 +00:00
Jay Foad 68895098d1 [AMDGPU] Preserve src2_modifiers in convertToThreeAddress
Found by code inspection. I don't think it makes a difference with
current codegen, because if any source modifiers were present we
would have selected mad/fma instead of mac/fmac in the first place.

Differential Revision: https://reviews.llvm.org/D120709
2022-03-01 14:48:25 +00:00
Stanislav Mekhanoshin 517171ce20 [AMDGPU] Extend SILoadStoreOptimizer to handle flat load/stores
TODO: merge flat with global promoting to flat.

Differential Revision: https://reviews.llvm.org/D120351
2022-02-28 11:27:30 -08:00
Carl Ritson 2bbe6506d4 [AMDGPU] Remove redundant isVALU in SIPreEmitPeephole. NFC
Remove redundant isVALU call added in D120202.
2022-02-27 16:09:20 +09:00
Benjamin Kramer 1de11fe360 Use RegisterInfo::regsOverlaps instead of checking aliases
This is both less code and faster since it doesn't have to expand all
the sub & superreg sets. NFCI.
2022-02-26 20:32:12 +01:00
Jameson Nash c4b1a63a1b mark getTargetTransformInfo and getTargetIRAnalysis as const
Seems like this can be const, since Passes shouldn't modify it.

Reviewed By: wsmoses

Differential Revision: https://reviews.llvm.org/D120518
2022-02-25 14:30:44 -05:00
Changpeng Fang ca62b1db9f [AMDGPU][NFC]: Emit metadata for hidden_heap_v1 kernarg
Summary:
  Emit metadata for hidden_heap_v1 kernarg

Reviewers:
  sameerds, b-sumner

Fixes:
  SWDEV-307188

Differential Revision:
  https://reviews.llvm.org/D119027
2022-02-25 10:45:35 -08:00
Aakanksha bf60a1c546 Avoid comparisons between types of different widths in a loop condition to prevent the loop from behaving unexpectedly
This change fixes the code violations flagged in AMD compute CodeQL scan -
Query Description: "Comparisons between types of different widths in a loop condition can cause the loop to behave unexpectedly."

Differential Revision: https://reviews.llvm.org/D120355
2022-02-25 17:30:12 +00:00
Carl Ritson 565af157ef [AMDGPU] Extend pre-emit peephole to redundantly masked VCC
Extend pre-emit peephole for S_CBRANCH_VCC[N]Z to eliminate
redundant S_AND operations against EXEC for V_CMP results in VCC.
These occur after after register allocation when VCC has been
selected as the comparison destination.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D120202
2022-02-25 10:18:31 +09:00
Jay Foad 05d79e3562 [AMDGPU] Divergence-driven instruction selection for bitreverse
Differential Revision: https://reviews.llvm.org/D119702
2022-02-24 20:21:59 +00:00
Stanislav Mekhanoshin 3279e44063 [AMDGPU] Extend SILoadStoreOptimizer to handle global stores
TODO: merge flat load/stores.
TODO: merge flat with global promoting to flat.

Differential Revision: https://reviews.llvm.org/D120346
2022-02-24 11:09:51 -08:00
Stanislav Mekhanoshin cefa1c5ca9 [AMDGPU] Fix combined MMO in load-store merge
Loads and stores can be out of order in the SILoadStoreOptimizer.
When combining MachineMemOperands of two instructions operands are
sent in the IR order into the combineKnownAdjacentMMOs. At the
moment it picks the first operand and just replaces its offset and
size. This essentially loses alignment information and may generally
result in an incorrect base pointer to be used.

Use a base pointer in memory addresses order instead and only adjust
size.

Differential Revision: https://reviews.llvm.org/D120370
2022-02-24 10:47:57 -08:00
Bill Wendling a5bbc6ef99 [NFC] Remove unnecessary "#include"s from header files 2022-02-23 01:20:48 -08:00
Stanislav Mekhanoshin 9e055c0fff [AMDGPU] Extend SILoadStoreOptimizer to handle global saddr loads
This adds handling of the _SADDR forms to the GLOBAL_LOAD combining.

TODO: merge global stores.
TODO: merge flat load/stores.
TODO: merge flat with global promoting to flat.

Differential Revision: https://reviews.llvm.org/D120285
2022-02-22 09:01:43 -08:00
Stanislav Mekhanoshin ba17bd2674 [AMDGPU] Extend SILoadStoreOptimizer to handle global loads
There can be situations where global and flat loads and stores are not
combined by the vectorizer, in particular if their address space
differ in the IR but they end up the same class instructions after
selection. For example a divergent load from constant address space
ends up being the same global_load as a load from global address space.

TODO: merge global stores.
TODO: handle SADDR forms.
TODO: merge flat load/stores.
TODO: merge flat with global promoting to flat.

Differential Revision: https://reviews.llvm.org/D120279
2022-02-22 08:42:36 -08:00
Thomas Symalla 380ff31d83
[AMDGPU] Fix typo in comment [NFC]
This replaces "V_MOB_B32" with "V_MOV_B32" in some comment.
2022-02-22 13:27:26 +01:00
Stanislav Mekhanoshin dc0981562e [AMDGPU] Remove redundand check in the SILoadStoreOptimizer
Differential Revision: https://reviews.llvm.org/D120268
2022-02-21 15:04:44 -08:00
Jay Foad 359a792f9b [AMDGPU] SILoadStoreOptimizer: avoid unbounded register pressure increases
Previously when combining two loads this pass would sink the
first one down to the second one, putting the combined load
where the second one was. It would also sink any intervening
instructions which depended on the first load down to just
after the combined load.

For example, if we started with this sequence of
instructions (code flowing from left to right):

  X A B C D E F Y

After combining loads X and Y into XY we might end up with:

  A B C D E F XY

But if B D and F depended on X, we would get:

  A C E XY B D F

Now if the original code had some short disjoint live ranges
from A to B, C to D and E to F, in the transformed code
these live ranges will be long and overlapping. In this way
a single merge of two loads could cause an unbounded
increase in register pressure.

To fix this, change the way the way that loads are moved in
order to merge them so that:
- The second load is moved up to the first one. (But when
  merging stores, we still move the first store down to the
  second one.)
- Intervening instructions are never moved.
- Instead, if we find an intervening instruction that would
  need to be moved, give up on the merge. But this case
  should now be pretty rare because normal stores have no
  outputs, and normal loads only have address register
  inputs, but these will be identical for any pair of loads
  that we try to merge.

As well as fixing the unbounded register pressure increase
problem, moving loads up and stores down seems like it
should usually be a win for memory latency reasons.

Differential Revision: https://reviews.llvm.org/D119006
2022-02-21 10:51:14 +00:00
Jay Foad 57baa14d74 [AMDGPU] Rename AMDGPUCFGStructurizer to R600MachineCFGStructurizer
Previously the name of the class (AMDGPUCFGStructurizer) did not
match the name of the file (AMDILCFGStructurizer).

Standardize on the name R600MachineCFGStructurizer by analogy with
AMDGPUMachineCFGStructurizer.

Differential Revision: https://reviews.llvm.org/D120128
2022-02-18 15:08:25 +00:00
Sebastian Neubauer 6527b2a4d5 [AMDGPU][NFC] Fix typos
Fix some typos in the amdgpu backend.

Differential Revision: https://reviews.llvm.org/D119235
2022-02-18 15:05:21 +01:00
Sebastian Neubauer 1f0aadfa62 [AMDGPU] Fix kill flag on overlapping sgpr copy
Same as on vgpr copies, we cannot kill the source register if it
overlaps with the destination register. Otherwise, the kill of the
source register will also count as a kill for the destination register.

Differential Revision: https://reviews.llvm.org/D120042
2022-02-18 14:36:00 +01:00
Jay Foad 69ab233a15 [AMDGPU] Return better Changed status from SIFoldOperands
Differential Revision: https://reviews.llvm.org/D120023
2022-02-18 10:35:48 +00:00
Jay Foad 768e6faba8 [AMDGPU] Return better Changed status from SILowerControlFlow
Differential Revision: https://reviews.llvm.org/D120025
2022-02-18 10:09:22 +00:00
Jay Foad d86dcb7ea5 [AMDGPU] Return better Changed status from SIOptimizeExecMasking
Differential Revision: https://reviews.llvm.org/D120024
2022-02-18 10:09:21 +00:00
Stanislav Mekhanoshin b0aa1946df [AMDGPU] Promote recursive loads from kernel argument to constant
Not clobbered pointer load chains are promoted to global now. That
is possible to promote these loads itself into constant address
space. Loaded pointers still need to point to global because we
need to be able to store into that pointer and because an actual
load from it may occur after a clobber.

Differential Revision: https://reviews.llvm.org/D119886
2022-02-17 11:07:03 -08:00
Jay Foad c08896d292 [AMDGPU] Return better Changed status from SILowerI1Copies
Differential Revision: https://reviews.llvm.org/D119946
2022-02-17 09:38:57 +00:00
Jay Foad 78ebb1dd24 [AMDGPU] Return better Changed status from SIAnnotateControlFlow
Differential Revision: https://reviews.llvm.org/D119945
2022-02-17 09:38:57 +00:00
Jay Foad 1822a5ecdd [AMDGPU] Return better Changed status from AMDGPUPerfHintAnalysis
Differential Revision: https://reviews.llvm.org/D119944
2022-02-17 09:31:42 +00:00
Jay Foad 77e793d025 [AMDGPU] Return better Changed status from AMDGPUAnnotateUniformValues
Differential Revision: https://reviews.llvm.org/D119943
2022-02-17 09:31:42 +00:00
Matt Arsenault 3884cb9235 AMDGPU: Always reserve VGPR for AGPR copies on gfx908
Just because there aren't AGPRs in the original program doesn't mean
the register allocator can't choose to use them (unless we were to
forcibly reserve all AGPRs if there weren't any uses). This happens in
high pressure situations and introduces copies to avoid spills.

In this test, the allocator ends up introducing a copy from SGPR to
AGPR which requires an intermediate VGPR. I don't believe it would
introduce a copy from AGPR to AGPR in this situation, since it would
be trying to use an intermediate with a different class.

Theoretically this is also broken on gfx90a, but I have been unable to
come up with a testcase.
2022-02-16 18:48:18 -05:00
Jacob Lambert 7470244475 [AMDGPU] Add agpr_count to metadata and AsmParser
gfx90a allows the number of ACC registers (AGPRs) to be set
independently to the VGPR registers. For both HSA and PAL metadata, we
now include an "agpr_count" key to report the number of AGPRs set for
supported devices (gfx90a, gfx908, as determined by hasMAIInsts()).
This is collected from SIProgramInfo.NumAccVGPR for both HSA and PAL.
The AsmParser also now recognizes ".kernel.agpr_count" for supported
devices.

Differential Revision: https://reviews.llvm.org/D116140
2022-02-16 15:17:23 -08:00
Jacob Lambert 0bad7cb565 Hoist getTotalNumVGPRs into AMDGPUBaseInfo for use in both codegen and MC
Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D119912
2022-02-16 11:04:08 -08:00
Dmitry Preobrazhensky 6655c5a6bb [AMDGPU][MC][GFX10] Added an alias for HW_REG_HW_ID1
Enabled HW_REG_HW_ID as an alias for HW_REG_HW_ID1. This is required for compatibility with existing code.

Differential Revision: https://reviews.llvm.org/D119939
2022-02-16 19:45:44 +03:00
Shao-Ce SUN 2aed07e96c [NFC][MC] remove unused argument `MCRegisterInfo` in `MCCodeEmitter`
Reviewed By: skan

Differential Revision: https://reviews.llvm.org/D119846
2022-02-16 13:10:09 +08:00
Matt Arsenault 898dc8a4b1 AMDGPU: Use subtarget in class instead of querying function 2022-02-15 21:28:12 -05:00
Stanislav Mekhanoshin 29a0e0a9e5 [AMDGPU] Do not define GET_INSTRINFO_SCHED_ENUM
Autogenerated names are too long and break compilation on Windows,
while we do not need this enum at all.

Differential Revision: https://reviews.llvm.org/D119869
2022-02-15 13:00:54 -08:00
Jay Foad a65b9dd049 [AMDGPU] Divergence-driven instruction selection for bfm patterns
Differential Revision: https://reviews.llvm.org/D119706
2022-02-15 10:49:18 +00:00
Jay Foad f72d8897ac [AMDGPU] Honor !invariant.load metadata on load-like intrinsics
Differential Revision: https://reviews.llvm.org/D119739
2022-02-15 09:16:57 +00:00
Jay Foad cb199e0fca [MC] Define and use MCRegisterInfo::regsOverlap
Separate MCRegisterInfo::regsOverlap out from
TargetRegisterInfo::regsOverlap. This is useful in the AMDGPU AsmParser
where we only have access to MCRegisterInfo.

Differential Revision: https://reviews.llvm.org/D119533
2022-02-14 20:46:02 +00:00
Joe Nash c87c61c52c [AMDGPU] Fix AGPR offset for waitcnt
An enum value stores the offset between AGPR ranges and VGPR
ranges in the internal storage of SIInsertWaitcnts. It said 226 when
it should say 256, causing some portion of the ranges to overlap. That
in turn causes 'aliasing' between the registers, potentially inserting
waitcnts that are not required.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D119749
2022-02-14 15:16:21 -05:00
alex-t c23198ec13 [AMDGPU] Divergence-driven abs instruction selection
This change enables "abs" SDNodes selection by the node divergence.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D119581
2022-02-14 21:36:32 +03:00
Stanislav Mekhanoshin c7eb846345 [AMDGPU] Merge AMDGPULDSUtils into AMDGPUMemoryUtils
Differential Revision: https://reviews.llvm.org/D119502
2022-02-11 10:32:24 -08:00
Jay Foad b59ad64ead [TableGen][AMDGPU] Allow empty register classes
Remove ARTIFICIAL_VGPR which only existed to make VReg_1 not empty.

Differential Revision: https://reviews.llvm.org/D119552
2022-02-11 17:30:04 +00:00
Sebastian Neubauer a5d4f82b73 [AMDGPU] Make enable-flat-scratch a subtarget feature
Use a subtarget feature instead of a command line argument to reduce
global state.
We want to enable flat scratch for graphics in some cases and this
doesn't work well with command line options.

Differential Revision: https://reviews.llvm.org/D119425
2022-02-11 18:23:07 +01:00
Sameer Sahasrabuddhe d8f99bb6e0 [AMDGPU] replace hostcall module flag with function attribute
The module flag to indicate use of hostcall is insufficient to catch
all cases where hostcall might be in use by a kernel. This is now
replaced by a function attribute that gets propagated to top-level
kernel functions via their respective call-graph.

If the attribute "amdgpu-no-hostcall-ptr" is absent on a kernel, the
default behaviour is to emit kernel metadata indicating that the
kernel uses the hostcall buffer pointer passed as an implicit
argument.

The attribute may be placed explicitly by the user, or inferred by the
AMDGPU attributor by examining the call-graph. The attribute is
inferred only if the function is not being sanitized, and the
implictarg_ptr does not result in a load of any byte in the hostcall
pointer argument.

Reviewed By: jdoerfert, arsenm, kpyzhov

Differential Revision: https://reviews.llvm.org/D119216
2022-02-11 22:51:56 +05:30
Julien Pages dcb2da13f1 [AMDGPU] Add a new intrinsic to control fp_trunc rounding mode
Add a new llvm.fptrunc.round intrinsic to precisely control
the rounding mode when converting from f32 to f16.

Differential Revision: https://reviews.llvm.org/D110579
2022-02-11 12:08:23 -05:00
Mirko Brkusanin 5ff35ba8ae [AMDGPU][GlobalISel] Fix insert point in FoldableFneg combine
Newly created fneg was built after some of it's uses in some cases.
Now it will be built immediately after instruction whose dst it negates.

Differential Revision: https://reviews.llvm.org/D119459
2022-02-11 12:09:40 +01:00
serge-sans-paille 06943537d9 Cleanup MCParser headers
As usual with that header cleanup series, some implicit dependencies now need to
be explicit:

llvm/MC/MCParser/MCAsmParser.h no longer includes llvm/MC/MCParser/MCAsmLexer.h

Preprocessed lines to build llvm on my setup:
after:  1068185081
before: 1068324320

So no compile time benefit to expect, but we still get the looser coupling
between files which is great.

Discourse thread: https://discourse.llvm.org/t/include-what-you-use-include-cleanup
Differential Revision: https://reviews.llvm.org/D119359
2022-02-11 10:39:29 +01:00
Stanislav Mekhanoshin 290e5722e8 [AMDGPU] Improve clobbering checks in the kernel argument promotion
Use same MSSA clobbering checks as in the AMDGPUAnnotateUniformValues.
Kernel argument promotion needs exactly the same information so factor
out utility function isClobberedInFunction.

Differential Revision: https://reviews.llvm.org/D119480
2022-02-10 14:51:47 -08:00
alex-t d88a146f2b [AMDGPU] Missed sign/zero extend patterns for divergence-driven instruction selection
This change includes tablegen patterns that were missed by https://reviews.llvm.org/D110950 and https://reviews.llvm.org/D76230

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D119302
2022-02-10 19:36:12 +03:00
Simon Pilgrim 8de7297374 [AMDGPU] Pull out repeated getVecSize() calls. NFC.
This is guaranteed to be evaluated so we can avoid repeated calls.

Helps the static analyzer as it couldn't recognise that each getVecSize() would return the same value.
2022-02-10 16:31:36 +00:00
Jay Foad e34623b165 [AMDGPU] Rename DSAtomicCmpXChg to DSAtomicCmpXChgSwapped. NFC.
This is just a reminder that the operands are swapped compared with all
the other CmpXChg instructions.

Differential Revision: https://reviews.llvm.org/D119421
2022-02-10 14:54:44 +00:00
Abinav Puthan Purayil 29bd3fadbc [AMDGPU] Select no-return atomic ops in FLATInstructions.td.
This change adds the selection for the no-return global_* and flat_*
instructions in tblgen.  The motivation for this is to get the no-return atomic
isel working without relying on post-isel hooks so that GlobalISel can start
selecting them (once GlobalISelEmitter allows no return atomic patterns like
how DAGISel does).

Differential Revision: https://reviews.llvm.org/D119227
2022-02-10 09:26:37 +05:30
Abinav Puthan Purayil f4e8cf25af [AMDGPU] Select no-return ds_* atomic ops in tblgen.
SelectionDAG relies on MachineInstr's HasPostISelHook for selecting the
no-return atomic ops. GlobalISel, at the moment, doesn't handle
HasPostISelHook.

This change adds the selection for no-return ds_* atomic ops in tblgen
so that it can work with both GlobalISel and SelectionDAG. I couldn't
add the predicates for GlobalISel in this change since there's a
restriction in GlobalISelEmitter that disallows selecting generic
atomics ops that return with instructions that doesn't return.

We can't remove the HasPostISelHook code that selects the no return
atomic ops in SelectionDAG yet since we still need to cover selections
in FLATInstructions.td, BUFInstructions.td.

Differential Revision: https://reviews.llvm.org/D115881
2022-02-10 09:26:37 +05:30
Jay Foad 476bb2d94e [AMDGPU] Remove dead code from shrinkScalarLogicOp
It looks like this code has been dead since shrinkScalarLogicOp
was introduced in svn r348601.
2022-02-09 17:07:12 +00:00
Jay Foad db28a45617 [AMDGPU] Remove irrelevant comments on V_BFE_I32 instructions
These comments explain the encoding of the immediate operand of S_BFE_*
which is not relevant for V_BFE_I32.
2022-02-09 12:06:29 +00:00
serge-sans-paille ef736a1c39 Cleanup LLVMMC headers
There's a few relevant forward declarations in there that may require downstream
adding explicit includes:

llvm/MC/MCContext.h no longer includes llvm/BinaryFormat/ELF.h, llvm/MC/MCSubtargetInfo.h, llvm/MC/MCTargetOptions.h
llvm/MC/MCObjectStreamer.h no longer include llvm/MC/MCAssembler.h
llvm/MC/MCAssembler.h no longer includes llvm/MC/MCFixup.h, llvm/MC/MCFragment.h

Counting preprocessed lines required to rebuild llvm-project on my setup:
before: 1052436830
after:  1049293745

Which is significant and backs up the change in addition to the usual benefits of
decreasing coupling between headers and compilation units.

Discourse thread: https://discourse.llvm.org/t/include-what-you-use-include-cleanup
Differential Revision: https://reviews.llvm.org/D119244
2022-02-09 11:09:17 +01:00
Sameer Sahasrabuddhe c6a6b57902 [AMDGPU] [NFC] Fix incorrect use of bitwise operator.
Differential Revision: https://reviews.llvm.org/D119308
2022-02-08 22:12:54 -05:00
Stanislav Mekhanoshin aeaf85b9c2 [AMDGPU] Select VGPR versions of MFMA if possible
We can select _vgprcd versions of MAI instructions and have no
AGPRs with the whole budget left for VGPRs if:

1. This is a kernel;
2. It has no calls;
3. It runs at least on 2 waves thus having not more that 256 VGPRs.
4. There is no inline asm requesting AGPRs.

Differential Revision: https://reviews.llvm.org/D117253
2022-02-08 10:19:41 -08:00
Matt Arsenault f2c99ea47d AMDGPU: Use reserved VGPR for AGPR spills to memory
Previously would reuse the VGPR used for large frame offsets with the
one needed for copying from the AGPR. Fix this by reusing the register
we already reserved for handling AGPR to AGPR copies.
2022-02-08 11:26:59 -05:00
Matt Arsenault 8b2ca766f0 AMDGPU: Reserve v32 if we may need to copy between AGPRs on gfx908
We need to guarantee cheap copies between AGPRs, and unfortunately
gfx908 cannot directly do this. Theoretically we could set the
scavenger up with an emergency spill slot, but it also feels
unreasonable to pay that cost for what was assumed to be a simple and
cheap copy. Pick a register that doesn't conflict with any ABI
registers.

This does not address the same issue when copying from SGPR to AGPR
for gfx90a (this coincidentally fixes it for gfx908), but that's less
interesting since the register allocator shouldn't be proactively
introducing such copies.

One edge case I'm worried about is respecting the VGPR budget implied
by amdgpu-waves-per-eu. If the theoretical upper bound of a function
is 32 VGPRs, this will force the actual count to be 33.

This is also broken if inline assembly uses/defs something in v32. The
coalescer will eliminate the intermediate vreg between the def and
use, and the introduced copy will clobber the user value.

(cherry picked from commit 3335784ac2d587ff4eac04586e189532ae8b2607)
2022-02-08 11:14:52 -05:00
Nikita Popov 997027347d [AMDGPURewriteOutArguments] Don't use pointer element type
Instead of using the pointer element type, look at how the pointer
is actually being used in store instructions, while looking through
bitcasts. This makes the transform compatible with opaque pointers
and a bit more general.

It's worth noting that I have dropped the 3-vector to 4-vector
shufflevector special case, because this is now handled in a
different way: If the value is actually used as a 4-vector, then
we're directly going to use that type, instead of shuffling to a
3-vector in between.

Differential Revision: https://reviews.llvm.org/D119237
2022-02-08 16:10:41 +01:00
Simon Pilgrim fd2bb51f1e [ADT] Add APInt/MathExtras isShiftedMask variant returning mask offset/length
In many cases, calls to isShiftedMask are immediately followed with checks to determine the size and position of the bitmask.

This patch adds variants of APInt::isShiftedMask, isShiftedMask_32 and isShiftedMask_64 that return these values as additional arguments.

I've updated a number of cases that were either performing seperate size/position calculations or had created their own local wrapper versions of these.

Differential Revision: https://reviews.llvm.org/D119019
2022-02-08 12:04:13 +00:00
Sameer Sahasrabuddhe 02a2e46ff0 [AMDGPU] [NFC] refactor the AMDGPU attributor
Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D119087
2022-02-07 21:45:32 -05:00
Carl Ritson a1fb307b4b [AMDGPU] Allow hoisting of some VALU compare instructions
Conversatively allow hoisting/sinking of VALU comparisons.
If the result of a comparison is masked with exec, narrowing the
set of active lanes, then it is safe to hoist it as the masking
instruction will never by hoisted.

Heuristically this is also true for sinking, as we do not expect
the result of a sunk comparison that is masked with exec to be
used outside of the loop.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D118975
2022-02-08 11:27:23 +09:00
Vang Thao 570471199b [AMDGPU] Fix debug values in scheduler not placed correctly when reverting
Debug position data is cleared after ScheduleDAGMILive::schedule() due to it also calling placeDebugValues(). Make it so the data is not cleared after initial call to placeDebugValues since we will call it again after reverting a schedule.

Secondly, since we skip debug instructions when reverting the schedule on AMDGPU, all debug instructions are now moved to the end of the scheduling region. RegionEnd points to the beginning of this chunk of debug instructions since it was not incremented when a debug instruction was skipped. RegionBegin may also point to the same debug instruction if Unsched.front() is a debug instruction thus shrinking the region to 1. Fix RegionBegin and RegionEnd so that they point to the current beginning and ending before calling placeDebugValues() since both vars will be used as reference points to move debug instructions back.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D119022
2022-02-07 11:01:13 -08:00
Matt Arsenault 31973062ec AMDGPU: Fix clobbering SCC when expanding large offset spill pseudos
If we had a large offset which required materializing in a register,
we would emit an s_add_i32, clobbering SCC. Start checking if SCC is
live, and instead use a VGPR offset. For MUBUF, we switch to using
offen. We would do this anyway in a normal load/store with a frame
index, but not for spills.

The same problem still exists in other contexts where we expand frame
indices.

The nasty edge case is when SGPRs are spilled to memory at a large
frame offset where SCC is also clobbered. This requires a second
scavenging index, and also required several patches in the scavenger
to correctly handle multiple recursive scavenge indexes.

An even nastier edge case we still don't support is if we don't have
any free SGPRs. If SCC is live and we don't have any free SGPRs to
save exec, we have no way of flipping exec back and forth without also
clobbering SCC.

Fixes: SWDEV-309419
2022-02-07 10:02:03 -05:00
Kazu Hirata 3a3cb929ab [llvm] Use = default (NFC) 2022-02-06 22:18:35 -08:00
Ruiling Song 0719c43735 AMDGPU: Don't clobber source register for V_SET_INACTIVE_*
The WWM register has unmodeled register liveness, For v_set_inactive_*,
clobberring source register is dangerous because it will overwrite the
inactive lanes. When the source vgpr is dead at v_set_inactive_lane,
the inactive lanes may be not really dead. This may make common
optimizations doing wrong.

For example in a simple if-then cfg in Machine IR:
bb.if:
  %src =

bb.then:
  %src1 = COPY %src
  %dst = V_SET_INACTIVE %src1(tied-def 0), %inactive

bb.end
  ... = PHI [0, %bb.then] [%src, %bb.if]

The register coalescer will think it is safe to optimize "%src1 = COPY %src"
in bb.then. And at the same time, there is no interference for the PHI in
bb.end. The source and destination values of the PHI will be assigned
the same register. The single PHI register will be overwritten by the
v_set_inactive, then we would get wrong value in bb.end.

With this change, we will copy the content of the source register before
setting inactive lanes after register allocation. Yes, this will sacrifice
the WWM code generation a little, but I don't have any better idea to do things
correctly.

Differential Revision: https://reviews.llvm.org/D117482
2022-02-06 12:38:26 +08:00
Matt Arsenault 8b8b491379 AMDGPU/GlobalISel: Fix assertions on invalid addrspacecasts
Fixes some assert on invalid situations and starts directly emitting
the error.
2022-02-04 17:28:49 -05:00
Matt Arsenault 4622afa94c AMDGPU: Convert AMDGPUResourceUsageAnalysis to a Module pass
This is more precise in the face of indirect calls and aliases, still
assuming the call target is defined somewhere in the current module.

This sometimes changes the order the functions are printed, and also
changes the point where context errors are printed relative to
stdout. This also likely has negative consequences for compile time
and memory usage.
2022-02-04 15:56:04 -05:00
Matt Arsenault 935abab65c AMDGPU: Use module level register maximums for unknown callees
Compute the theoretical register budget based on the IR function
signature/attributes, and use the global maximum register budgets for
unknown callees.

This should fix the kernel reported register usage in the presence of
indirect calls. The previous fix in
2b08f6af62 was incorrect becauset it was
only taking the maximum in the known call graph, and missing something
that was either outside of it or codegened later.

This fixes a second case I discovered where calls to aliases also did
not work as expected. CallGraphAnalysis misses these, so functions
called through aliases were not codegened ahead of callers as
expected. CallGraphAnalysis should probably be fixed to understand
this case, and there's likely a bug with IPRA here. This fixes
numerous failures in the conformance test at -O0.
2022-02-04 15:56:03 -05:00
Sebastian Neubauer 4a02562275 [AMDGPU] Lazily init pal metadata on first function
Delay reading global metadata until the first function or the end of
the file is emitted. That way, earlier module passes can set metadata
that is emitted in the ELF.

`emitStartOfAsmFile` gets called when the passes are initialized,
which prevented earlier passes from changing the metadata.

This fixes issues encountered after converting
AMDGPUResourceUsageAnalysis to a Module pass in D117504.

Differential Revision: https://reviews.llvm.org/D118492
2022-02-04 18:39:35 +01:00
Jay Foad a456ace9c1 [AMDGPU] SILoadStoreOptimizer: rewrite checkAndPrepareMerge. NFCI.
Separate the function clearly into:
- Checks that can be done on CI and Paired before the loop.
- The loop over all instructions between CI and Paired.
- Checks that must be done on InstsToMove after the loop.

Previously these were mostly done inside the loop in a very
confusing way.

Differential Revision: https://reviews.llvm.org/D118994
2022-02-04 17:17:29 +00:00
Jay Foad 001cb43159 [AMDGPU] SILoadStoreOptimizer: fewer calls to offsetsCanBeCombined
Only call offsetsCanBeCombined with Modify = true in cases
where it will really do something. NFC.
2022-02-04 14:52:11 +00:00
Jay Foad 00bbda07ae [AMDGPU] SILoadStoreOptimizer: simplify class/subclass checks
Also add a comment explaining the difference between class
and subclass. NFCI.
2022-02-04 14:08:04 +00:00
Jay Foad 33ef8bdf36 [AMDGPU] SILoadStoreOptimizer: simplify optimizeInstsWithSameBaseAddr
Common up all the calls to CI.setMI. NFCI.
2022-02-04 13:25:05 +00:00
Jay Foad ca05edd927 [AMDGPU] SILoadStoreOptimizer: simplify OptimizeListAgain test
At this point CI represents the combined access (original CI combined
with Paired) so it doesn't make any sense to add in Paired.width again.
NFCI.
2022-02-04 13:02:19 +00:00
serge-sans-paille ffe8720aa0 Reduce dependencies on llvm/BinaryFormat/Dwarf.h
This header is very large (3M Lines once expended) and was included in location
where dwarf-specific information were not needed.

More specifically, this commit suppresses the dependencies on
llvm/BinaryFormat/Dwarf.h in two headers: llvm/IR/IRBuilder.h and
llvm/IR/DebugInfoMetadata.h. As these headers (esp. the former) are widely used,
this has a decent impact on number of preprocessed lines generated during
compilation of LLVM, as showcased below.

This is achieved by moving some definitions back to the .cpp file, no
performance impact implied[0].

As a consequence of that patch, downstream user may need to manually some extra
files:

llvm/IR/IRBuilder.h no longer includes llvm/BinaryFormat/Dwarf.h
llvm/IR/DebugInfoMetadata.h no longer includes llvm/BinaryFormat/Dwarf.h

In some situations, codes maybe relying on the fact that
llvm/BinaryFormat/Dwarf.h was including llvm/ADT/Triple.h, this hidden
dependency now needs to be explicit.

$ clang++ -E  -Iinclude -I../llvm/include ../llvm/lib/Transforms/Scalar/*.cpp -std=c++14 -fno-rtti -fno-exceptions | wc -l
after:   10978519
before:  11245451

Related Discourse thread: https://llvm.discourse.group/t/include-what-you-use-include-cleanup
[0] https://llvm-compile-time-tracker.com/compare.php?from=fa7145dfbf94cb93b1c3e610582c495cb806569b&to=995d3e326ee1d9489145e20762c65465a9caeab4&stat=instructions

Differential Revision: https://reviews.llvm.org/D118781
2022-02-04 11:44:03 +01:00
Vang Thao 2ca194ff55 [AMDGPU] Fix scheduler live-ins with debug inst at start of block
GCNDownwardRPTracker RPTracker.reset() skips debug instructions for NextMI so RPTracker.getNext() will never give the beginning of a sched region if it is a debug value. In this case we will never set the live-ins for that block.

Add check to see if getNext also equals the MI after skipping debug instructions.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D118853
2022-02-03 12:41:32 -08:00
Stanislav Mekhanoshin 1d111090ad [AMDGPU] Fix windows build warning with IMMBitSelConst. NFC.
VS gives this warning for an integer constant:
AMDGPUGenDAGISel.inc(214687): warning C4334: '<<': result of 32-bit
shift implicitly converted to 64 bits (was 64-bit shift intended?)
2022-02-03 12:03:22 -08:00
Stanislav Mekhanoshin d3b87e4a1c [AMDGPU] HWRegs TMA and TBA also supported on gfx9
Differential Revision: https://reviews.llvm.org/D118860
2022-02-03 09:36:10 -08:00
Thomas Symalla 476babcc1d [AMDGPU] Introduce new ISel combine for trunc-slr patterns
In some cases, when selecting a (trunc (slr)) pattern, the slr gets translated
to a v_lshrrev_b3e2_e64 instruction whereas the truncation gets selected to
a sequence of v_and_b32_e64 and v_cmp_eq_u32_e64. In the final ISA, this appears
as selecting the nth-bit:

v_lshrrev_b32_e32 v0, 2, v1
v_and_b32_e32 v0, 1, v0
v_cmp_eq_u32_e32 vcc_lo, 1, v0

However, when the value used in the right shift is known at compilation time, the
whole sequence can be reduced to two VALUs when the constant operand in the v_and is adjusted to (1 << lshrrev_operand):

v_and_b32_e32 v0, (1 << 2), v1
v_cmp_ne_u32_e32 vcc_lo, 0, v0

In the example above, the following pseudo-code:

v0 = (v1 >> 2)
v0 = v0 & 1
vcc_lo = (v0 == 1)

would be translated to:

v0 = v1 & 0b100
vcc_lo = (v0 == 0b100)

which should yield an equivalent result.
This is a little bit hard to test as one needs to force the SelectionDAG to
contain the nodes before instruction selection, but the test sequence was
roughly derived from a production shader.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D118461
2022-02-03 18:06:44 +01:00
Jay Foad b9cf52bc3d [AMDGPU] Simplify AMDGPUAnnotateUniformValues::visitLoadInst
Always set uniform metadata on the pointer if it is an instruction, but
otherwise do not bother to create a trivial getelementptr instruction,
because AMDGPUInstrInfo::isUniformMMO can already detect that various
non-instruction pointers are uniform.

Most of the test case churn is from tests that used undef as a pointer,
which AMDGPUInstrInfo::isUniformMMO treats as uniform.

Differential Revision: https://reviews.llvm.org/D118909
2022-02-03 16:27:48 +00:00
Matt Arsenault d6fdbbcace AMDGPU: Add second emergency slot for SGPR to vmem for large frames
In a future change, we will sometimes use a VGPR offset for doing
spills to memory, in which case we need 2 free VGPRs to do the SGPR
spill. In most cases we could spill the VGPR along with the SGPR being
spilled, but we don't have any free lanes for SGPR_1024 in wave32 so
we could still potentially need a second scavenging slot.
2022-02-02 19:05:05 -05:00
Matt Arsenault 245e25f9c3 AMDGPU: Implement isAsmClobberable
Warn on inline assembly clobbering reserved registers. It should also
warn on at least some reserved register defs, but that isn't happening
right now. If you have a def and re-use of a register we reserve, the
register coalescer will eliminate the intermediate virtual
register. When the reserved reg def is introduced later by the
backend, it will end up clobbering the value the register coalescer
assumed was live through the range.

There is also isInlineAsmReadOnlyReg, although I don't understand what
the distinction really is. It's called in SelectionDAGBuilder, long
before the set of reserved registers is frozen so I'm not sure how
that can possibly work reliably.

Unfortunately this is also using the ugly tablegenerated names for the
registers.
2022-02-02 14:20:12 -05:00
Jay Foad ddd3807e69 [AMDGPU] Use new target MMO flag MONoClobber
This allows us to set the noclobber flag on (the MMO of) a load
instruction instead of on the pointer. This fixes a bug where noclobber
was being applied to all loads from the same pointer, even if some of
them were clobbered.

Differential Revision: https://reviews.llvm.org/D118775
2022-02-02 17:12:36 +00:00
serge-sans-paille e188aae406 Cleanup header dependencies in LLVMCore
Based on the output of include-what-you-use.

This is a big chunk of changes. It is very likely to break downstream code
unless they took a lot of care in avoiding hidden ehader dependencies, something
the LLVM codebase doesn't do that well :-/

I've tried to summarize the biggest change below:

- llvm/include/llvm-c/Core.h: no longer includes llvm-c/ErrorHandling.h
- llvm/IR/DIBuilder.h no longer includes llvm/IR/DebugInfo.h
- llvm/IR/IRBuilder.h no longer includes llvm/IR/IntrinsicInst.h
- llvm/IR/LLVMRemarkStreamer.h no longer includes llvm/Support/ToolOutputFile.h
- llvm/IR/LegacyPassManager.h no longer include llvm/Pass.h
- llvm/IR/Type.h no longer includes llvm/ADT/SmallPtrSet.h
- llvm/IR/PassManager.h no longer includes llvm/Pass.h nor llvm/Support/Debug.h

And the usual count of preprocessed lines:
$ clang++ -E  -Iinclude -I../llvm/include ../llvm/lib/IR/*.cpp -std=c++14 -fno-rtti -fno-exceptions | wc -l
before: 6400831
after:  6189948

200k lines less to process is no that bad ;-)

Discourse thread on the topic: https://llvm.discourse.group/t/include-what-you-use-include-cleanup

Differential Revision: https://reviews.llvm.org/D118652
2022-02-02 06:54:20 +01:00
Jacob Lambert a24ff176a6 [AMDGPU][NFC] Fixing formatting
Differential Revision: https://reviews.llvm.org/D117801
2022-02-01 17:59:01 -08:00
Stanislav Mekhanoshin 79606ee85c [AMDGPU] Check atomics aliasing in the clobbering annotation
MemorySSA considers any atomic a def to any operation it dominates
just like a barrier or fence. That is correct from memory state
perspective, but not required for the no-clobber metadata since
we are not using it for reordering. Skip such atomics during the
scan just like a barrier if it does not alias with the load.

Differential Revision: https://reviews.llvm.org/D118661
2022-02-01 12:33:25 -08:00
Stanislav Mekhanoshin c2b18a3cc5 [AMDGPU] Allow scalar loads after barrier
Currently we cannot convert a vector load into scalar if there
is dominating barrier or fence. It is considered a clobbering
memory access to prevent memory operations reordering. While
reordering is not possible the actual memory is not being clobbered
by a barrier or fence and we can still use a scalar load for a
uniform pointer.

The solution is not to bail on a first clobbering access but
traverse MemorySSA to the root excluding barriers and fences.

Differential Revision: https://reviews.llvm.org/D118419
2022-02-01 11:43:17 -08:00
Changpeng Fang 1194b9cdda AMDGPU {NFC}: Add code object v5 support and generate metadata for implicit kernel args
Summary:
  Add code object v5 support (deafult is still v4)
  Generate metadata for implicit kernel args for the new ABI
  Set the metadata version to be 1.2

Reviewers:
  t-tye, b-sumner, arsenm, and bcahoon

Fixes:
  SWDEV-307188, SWDEV-307189

Differential Revision:
  https://reviews.llvm.org/D118272
2022-01-31 18:07:47 -08:00
Jay Foad 0dcc8b86ee [AMDGPU] AMDGPUAnnotateUniformValues: inline a single-use lambda. NFC. 2022-01-31 11:22:09 +00:00
Jay Foad 68e3946270 [AMDGPU] SILoadStoreOptimizer: break lists on instructions with side effects
This just helps to keep the lists shorter and faster to sort. NFCI.

Differential Revision: https://reviews.llvm.org/D118384
2022-01-28 18:03:42 +00:00
Jay Foad 4b133cee80 [AMDGPU] SILoadStoreOptimizer: reject AGPR DS_WRITE sooner
Rejecting AGPR DS_WRITE instructions before adding them to any mergeable
list seems cleaner than adding them to the list and rejecting them
later.

Differential Revision: https://reviews.llvm.org/D118368
2022-01-27 18:20:46 +00:00
Jay Foad 94a4594c54 [AMDGPU] SILoadStoreOptimizer: use separate lists for AGPR instructions
Using separate lists for AGPR and non-AGPR instructions seems like a
cleaner solution than putting them all in the same list and then later
refusing to merge instructions of different AGPR-ness.

Differential Revision: https://reviews.llvm.org/D118367
2022-01-27 18:20:46 +00:00
Jay Foad 8a52fef1e0 [AMDGPU] SILoadStoreOptimizer: tweak API of CombineInfo::setMI. NFC.
Change CombineInfo::setMI to take a reference to the
SILoadStoreOptimizer instance, for easy access to common fields like
TII and STM.

Differential Revision: https://reviews.llvm.org/D118366
2022-01-27 18:20:46 +00:00
Matt Arsenault 33b45ee44b AMDGPU: Handle addrspacecast of constant 32-bit to flat
I accidentally made this work on the GlobalISel path, and there's no
real reason not to handle this.
2022-01-27 11:01:44 -05:00
Matt Arsenault aa88b65392 AMDGPU/GlobalISel: Fix assert on invalid cond code for llvm.amdgcn.icmp 2022-01-27 10:34:06 -05:00
Matt Arsenault f482e86980 AMDGPU/GlobalISel: Fix flat_scratch_init handling for shaders
I don't think this is actually defined for mesa, but this is what we
were doing on the DAG path.
2022-01-27 10:20:52 -05:00
Jay Foad 185cb8e82c [AMDGPU] SILoadStoreOptimizer: Allow merging across a swizzled access
Swizzled accesses are not merged, but there is no particular reason not
to merge two instructions if any of the intervening instructions happens
to be a swizzled access.

This moves the check for swizzled accesses out of checkAndPrepareMerge
into collectMergeableInsts where I think it makes more sense.

Differential Revision: https://reviews.llvm.org/D118267
2022-01-27 14:40:58 +00:00
Jay Foad 95857a7058 [AMDGPU] SILoadStoreOptimizer: Remove redundant check for volatile
SILoadStoreOptimizer::collectMergeableInsts already ends the current
block if it sees a volatile (or ordered) memory access, so there is no
need to check for them again when scanning the instructions between two
pairing candidates in a block.

Differential Revision: https://reviews.llvm.org/D118266
2022-01-27 10:14:53 +00:00
Stanislav Mekhanoshin 409c4436f9 [AMDGPU] Validate dst and src2 non-overlapping restriction in asm
Differential Revision: https://reviews.llvm.org/D118089
2022-01-26 15:14:06 -08:00
Stanislav Mekhanoshin dbf278b984 [AMDGPU] Prevent aliasing of SrcC and Dst in MAI
Form the MAI spec: It’s ok that Src_C and vDst are the exact same VGPRs
or Src_C and vDst are completely separated. The case that Src_C and vDst
are overlapping should be avoid as new value could be written to accumulator
input before it gets read.

Note that this inevitably increases register pressure to the point where
some programs will become uncompilable.

This patch separates MAC and FMA versions of MFMA instructions using either
tied dst and src2 or earlyclobber dst.

Fixes: SWDEV-318900

Differential Revision: https://reviews.llvm.org/D117844
2022-01-26 14:48:20 -08:00
Matt Arsenault 045be6ff36 AMDGPU/GlobalISel: Fold wave address into mubuf addressing modes 2022-01-26 15:25:26 -05:00
Matt Arsenault 09fc311af7 AMDGPU/GlobalISel: Mostly fix BFI patterns
Most importantly, fixes constant bus errors in the 64-bit cases. It's
surprising to me these were even passing the selection test using
SReg_* sources. Also fixes pattern matching in the 32-bit cases, with
simple operands.

These patterns aren't working in a few cases, like with mixed SGPR
inputs. The patterns aren't looking through the SGPR->VGPR copies like
they need to. The vector cases also have some unmerges of build_vector
which are obscuring the inputs.
2022-01-26 15:06:50 -05:00
Matt Arsenault e6564f39c7 AMDGPU: Emit user sgpr count directives in text asm
We were emitting these in the object file but not printing them.
2022-01-26 13:51:12 -05:00
Konstantina aa418b9133 [AMDGPU][SIWholeQuadMode] Use the right VCC register to activate the correct lanes.
Reviewed By: critson

Differential Revision: https://reviews.llvm.org/D118096
2022-01-26 08:54:39 -08:00
Stanislav Mekhanoshin 4e077c0a0b [AMDGPU] Remove feature register-banking
Since RegBankReassign pass was removed this feature is not
use for anything.

Differential Revision: https://reviews.llvm.org/D118195
2022-01-26 08:39:17 -08:00
Benjamin Kramer f15014ff54 Revert "Rename llvm::array_lengthof into llvm::size to match std::size from C++17"
This reverts commit ef82063207.

- It conflicts with the existing llvm::size in STLExtras, which will now
  never be called.
- Calling it without llvm:: breaks C++17 compat
2022-01-26 16:55:53 +01:00
serge-sans-paille ef82063207 Rename llvm::array_lengthof into llvm::size to match std::size from C++17
As a conquence move llvm::array_lengthof from STLExtras.h to
STLForwardCompat.h (which is included by STLExtras.h so no build
breakage expected).
2022-01-26 16:17:45 +01:00
Nikita Popov a5e324e3e2 [AMDGPUHSAMetadataStreamer] Do not assume ABI alignment for pointers
AMDGPUHSAMetadataStreamer currently assumes that pointer arguments
without align attribute have ABI alignment of the pointee type.
This is incompatible with opaque pointers, but also plain incorrect:
Pointer arguments without explicit alignment have alignment 1. It is
the responsibility of the frontent to add correct align annotations.

Differential Revision: https://reviews.llvm.org/D118229
2022-01-26 15:45:14 +01:00
alex-t 5157f984ae [AMDGPU] Enable divergence-driven XNOR selection
Currently not (xor_one_use) pattern is always selected to S_XNOR irrelative od the node divergence.
This relies on further custom selection pass which converts to VALU if necessary and replaces with V_NOT_B32 ( V_XOR_B32)
on those targets which have no V_XNOR.
Current change enables the patterns which explicitly select the not (xor_one_use) to appropriate form.
We assume that xor (not) is already turned into the not (xor) by the combiner.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D116270
2022-01-26 15:33:10 +03:00
Sebastian Neubauer 4ed7c6eec9 [AMDGPU] Only match correct type for a16
Addresses are floats when a sampler is present and unsigned integers
when no sampler is present.

Therefore, only zext instructions, not sext instructions should match.

Also match integer constants that can be truncated.

Differential Revision: https://reviews.llvm.org/D118043
2022-01-25 14:59:16 +01:00
Nikita Popov aa97bc116d [NFC] Remove uses of PointerType::getElementType()
Instead use either Type::getPointerElementType() or
Type::getNonOpaquePointerElementType().

This is part of D117885, in preparation for deprecating the API.
2022-01-25 09:44:52 +01:00
Stanislav Mekhanoshin bb1fe36977 [AMDGPU] Make v8i16/v8f16 legal
Differential Revision: https://reviews.llvm.org/D117721
2022-01-24 11:51:08 -08:00
Stanislav Mekhanoshin c27f8fb968 [AMDGPU] Remove cndmask from readsExecAsData
Differential Revision: https://reviews.llvm.org/D117909
2022-01-24 11:24:47 -08:00
Sebastian Neubauer 80532ebb50 [AMDGPU][InstCombine] Remove zero image offset
Remove the offset parameter if it is zero.

Differential Revision: https://reviews.llvm.org/D117876
2022-01-24 18:06:33 +01:00
Matt Arsenault 18aabae8e2 AMDGPU: Fix assertion on fixed stack objects with VGPR->AGPR spills
These have negative / out of bounds frame index values and would
assert when trying to set the BitVector. Fixed stack objects can't be
colored away so ignore them.
2022-01-24 09:45:41 -05:00
Matt Arsenault 99e8e17313 Reapply "Revert "GlobalISel: Add G_ASSERT_ALIGN hint instruction"
This reverts commit a97e20a3a8.
2022-01-24 09:26:52 -05:00
serge-sans-paille 5f290c090a Move STLFunctionalExtras out of STLExtras
Only using that change in StringRef already decreases the number of
preoprocessed lines from 7837621 to 7776151 for LLVMSupport

Perhaps more interestingly, it shows that many files were relying on the
inclusion of StringRef.h to have the declaration from STLExtras.h. This
patch tries hard to patch relevant part of llvm-project impacted by this
hidden dependency removal.

Potential impact:
- "llvm/ADT/StringRef.h" no longer includes <memory>,
  "llvm/ADT/Optional.h" nor "llvm/ADT/STLExtras.h"

Related Discourse thread:
https://llvm.discourse.group/t/include-what-you-use-include-cleanup/5831
2022-01-24 14:13:21 +01:00
Sebastian Neubauer f1e36474b9 [AMDGPU][NFC] Fix debug prints
Print the instructions instead of pointers.
2022-01-24 13:55:00 +01:00
Sebastian Neubauer ae2f9c8be8 [AMDGPU] Remove lz and nomip combine from codegen
These combines have been moved into the IR combiner in D116042.

Differential Revision: https://reviews.llvm.org/D116116
2022-01-21 12:09:08 +01:00
Sebastian Neubauer 603d18033c [AMDGPU][InstCombine] Remove zero LOD bias
If the bias is zero, we can remove it from the image instruction.
Also copy other image optimizations (l->lz, mip->nomip) to IR combines.

Differential Revision: https://reviews.llvm.org/D116042
2022-01-21 12:09:07 +01:00
Sebastian Neubauer 0530fdbbbb [AMDGPU] Fix LOD bias in A16 combine
As the codegen fix in D111754, the LOD bias needs to be converted to 16
bits. Fix this in the combine.

Differential Revision: https://reviews.llvm.org/D116038
2022-01-21 12:09:06 +01:00
Stanislav Mekhanoshin 41ebd19681 [AMDGPU] Do not ignore exec use where exec is read as data
Compares, v_cndmask_b32, and v_readfirstlane_b32 use EXEC
in a way which modifies the result. This implicit EXEC use
shall not be ignored for the purposes of instruction moves.

Differential Revision: https://reviews.llvm.org/D117814
2022-01-20 14:05:22 -08:00
Matt Arsenault 064cea9c9a AMDGPU/GlobalISel: Try to use s_and_b64 in ptrmask selection
Avoids a test diff with SDAG.
2022-01-20 12:56:53 -05:00
Matt Arsenault 2e49e0cfde AMDGPU/GlobalISel: Directly diagnose return value use for FP atomics
Emit an error if the return value is used on subtargets that do not
support them. Previously we were falling back to the DAG on selection
failure, where it would emit this error and then fail again.
2022-01-20 12:46:45 -05:00
Matt Arsenault be7e938e27 AMDGPU/GlobalISel: Stop handling llvm.amdgcn.buffer.atomic.fadd
This code is not structured to handle the legacy buffer intrinsics and
was miscompiling them.
2022-01-20 12:12:06 -05:00