9957 lines
		
	
	
		
			593 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
			
		
		
	
	
			9957 lines
		
	
	
		
			593 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
=============================
 | 
						||
User Guide for AMDGPU Backend
 | 
						||
=============================
 | 
						||
 | 
						||
.. contents::
 | 
						||
   :local:
 | 
						||
 | 
						||
Introduction
 | 
						||
============
 | 
						||
 | 
						||
The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
 | 
						||
R600 family up until the current GCN families. It lives in the
 | 
						||
``llvm/lib/Target/AMDGPU`` directory.
 | 
						||
 | 
						||
LLVM
 | 
						||
====
 | 
						||
 | 
						||
.. _amdgpu-target-triples:
 | 
						||
 | 
						||
Target Triples
 | 
						||
--------------
 | 
						||
 | 
						||
Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
 | 
						||
specify the target triple:
 | 
						||
 | 
						||
  .. table:: AMDGPU Architectures
 | 
						||
     :name: amdgpu-architecture-table
 | 
						||
 | 
						||
     ============ ==============================================================
 | 
						||
     Architecture Description
 | 
						||
     ============ ==============================================================
 | 
						||
     ``r600``     AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
 | 
						||
     ``amdgcn``   AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
 | 
						||
     ============ ==============================================================
 | 
						||
 | 
						||
  .. table:: AMDGPU Vendors
 | 
						||
     :name: amdgpu-vendor-table
 | 
						||
 | 
						||
     ============ ==============================================================
 | 
						||
     Vendor       Description
 | 
						||
     ============ ==============================================================
 | 
						||
     ``amd``      Can be used for all AMD GPU usage.
 | 
						||
     ``mesa3d``   Can be used if the OS is ``mesa3d``.
 | 
						||
     ============ ==============================================================
 | 
						||
 | 
						||
  .. table:: AMDGPU Operating Systems
 | 
						||
     :name: amdgpu-os-table
 | 
						||
 | 
						||
     ============== ============================================================
 | 
						||
     OS             Description
 | 
						||
     ============== ============================================================
 | 
						||
     *<empty>*      Defaults to the *unknown* OS.
 | 
						||
     ``amdhsa``     Compute kernels executed on HSA [HSA]_ compatible runtimes
 | 
						||
                    such as AMD's ROCm [AMD-ROCm]_.
 | 
						||
     ``amdpal``     Graphic shaders and compute kernels executed on AMD PAL
 | 
						||
                    runtime.
 | 
						||
     ``mesa3d``     Graphic shaders and compute kernels executed on Mesa 3D
 | 
						||
                    runtime.
 | 
						||
     ============== ============================================================
 | 
						||
 | 
						||
  .. table:: AMDGPU Environments
 | 
						||
     :name: amdgpu-environment-table
 | 
						||
 | 
						||
     ============ ==============================================================
 | 
						||
     Environment  Description
 | 
						||
     ============ ==============================================================
 | 
						||
     *<empty>*    Default.
 | 
						||
     ============ ==============================================================
 | 
						||
 | 
						||
.. _amdgpu-processors:
 | 
						||
 | 
						||
Processors
 | 
						||
----------
 | 
						||
 | 
						||
Use the ``clang -mcpu <Processor>`` option to specify the AMDGPU processor. The
 | 
						||
names from both the *Processor* and *Alternative Processor* can be used.
 | 
						||
 | 
						||
  .. table:: AMDGPU Processors
 | 
						||
     :name: amdgpu-processor-table
 | 
						||
 | 
						||
     =========== =============== ============ ===== ================= ======= ======================
 | 
						||
     Processor   Alternative     Target       dGPU/ Target            ROCm    Example
 | 
						||
                 Processor       Triple       APU   Features          Support Products
 | 
						||
                                 Architecture       Supported
 | 
						||
                                                    [Default]
 | 
						||
     =========== =============== ============ ===== ================= ======= ======================
 | 
						||
     **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
 | 
						||
     -----------------------------------------------------------------------------------------------
 | 
						||
     ``r600``                    ``r600``     dGPU
 | 
						||
     ``r630``                    ``r600``     dGPU
 | 
						||
     ``rs880``                   ``r600``     dGPU
 | 
						||
     ``rv670``                   ``r600``     dGPU
 | 
						||
     **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
 | 
						||
     -----------------------------------------------------------------------------------------------
 | 
						||
     ``rv710``                   ``r600``     dGPU
 | 
						||
     ``rv730``                   ``r600``     dGPU
 | 
						||
     ``rv770``                   ``r600``     dGPU
 | 
						||
     **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
 | 
						||
     -----------------------------------------------------------------------------------------------
 | 
						||
     ``cedar``                   ``r600``     dGPU
 | 
						||
     ``cypress``                 ``r600``     dGPU
 | 
						||
     ``juniper``                 ``r600``     dGPU
 | 
						||
     ``redwood``                 ``r600``     dGPU
 | 
						||
     ``sumo``                    ``r600``     dGPU
 | 
						||
     **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
 | 
						||
     -----------------------------------------------------------------------------------------------
 | 
						||
     ``barts``                   ``r600``     dGPU
 | 
						||
     ``caicos``                  ``r600``     dGPU
 | 
						||
     ``cayman``                  ``r600``     dGPU
 | 
						||
     ``turks``                   ``r600``     dGPU
 | 
						||
     **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
 | 
						||
     -----------------------------------------------------------------------------------------------
 | 
						||
     ``gfx600``  - ``tahiti``    ``amdgcn``   dGPU
 | 
						||
     ``gfx601``  - ``hainan``    ``amdgcn``   dGPU
 | 
						||
                 - ``oland``
 | 
						||
                 - ``pitcairn``
 | 
						||
                 - ``verde``
 | 
						||
     **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
 | 
						||
     -----------------------------------------------------------------------------------------------
 | 
						||
     ``gfx700``  - ``kaveri``    ``amdgcn``   APU                             - A6-7000
 | 
						||
                                                                              - A6 Pro-7050B
 | 
						||
                                                                              - A8-7100
 | 
						||
                                                                              - A8 Pro-7150B
 | 
						||
                                                                              - A10-7300
 | 
						||
                                                                              - A10 Pro-7350B
 | 
						||
                                                                              - FX-7500
 | 
						||
                                                                              - A8-7200P
 | 
						||
                                                                              - A10-7400P
 | 
						||
                                                                              - FX-7600P
 | 
						||
     ``gfx701``  - ``hawaii``    ``amdgcn``   dGPU                    ROCm    - FirePro W8100
 | 
						||
                                                                              - FirePro W9100
 | 
						||
                                                                              - FirePro S9150
 | 
						||
                                                                              - FirePro S9170
 | 
						||
     ``gfx702``                  ``amdgcn``   dGPU                    ROCm    - Radeon R9 290
 | 
						||
                                                                              - Radeon R9 290x
 | 
						||
                                                                              - Radeon R390
 | 
						||
                                                                              - Radeon R390x
 | 
						||
     ``gfx703``  - ``kabini``    ``amdgcn``   APU                             - E1-2100
 | 
						||
                 - ``mullins``                                                - E1-2200
 | 
						||
                                                                              - E1-2500
 | 
						||
                                                                              - E2-3000
 | 
						||
                                                                              - E2-3800
 | 
						||
                                                                              - A4-5000
 | 
						||
                                                                              - A4-5100
 | 
						||
                                                                              - A6-5200
 | 
						||
                                                                              - A4 Pro-3340B
 | 
						||
     ``gfx704``  - ``bonaire``   ``amdgcn``   dGPU                            - Radeon HD 7790
 | 
						||
                                                                              - Radeon HD 8770
 | 
						||
                                                                              - R7 260
 | 
						||
                                                                              - R7 260X
 | 
						||
     **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
 | 
						||
     -----------------------------------------------------------------------------------------------
 | 
						||
     ``gfx801``  - ``carrizo``   ``amdgcn``   APU   - xnack                   - A6-8500P
 | 
						||
                                                      [on]                    - Pro A6-8500B
 | 
						||
                                                                              - A8-8600P
 | 
						||
                                                                              - Pro A8-8600B
 | 
						||
                                                                              - FX-8800P
 | 
						||
                                                                              - Pro A12-8800B
 | 
						||
     \                           ``amdgcn``   APU   - xnack           ROCm    - A10-8700P
 | 
						||
                                                      [on]                    - Pro A10-8700B
 | 
						||
                                                                              - A10-8780P
 | 
						||
     \                           ``amdgcn``   APU   - xnack                   - A10-9600P
 | 
						||
                                                      [on]                    - A10-9630P
 | 
						||
                                                                              - A12-9700P
 | 
						||
                                                                              - A12-9730P
 | 
						||
                                                                              - FX-9800P
 | 
						||
                                                                              - FX-9830P
 | 
						||
     \                           ``amdgcn``   APU   - xnack                   - E2-9010
 | 
						||
                                                      [on]                    - A6-9210
 | 
						||
                                                                              - A9-9410
 | 
						||
     ``gfx802``  - ``iceland``   ``amdgcn``   dGPU  - xnack           ROCm    - FirePro S7150
 | 
						||
                 - ``tonga``                          [off]                   - FirePro S7100
 | 
						||
                                                                              - FirePro W7100
 | 
						||
                                                                              - Radeon R285
 | 
						||
                                                                              - Radeon R9 380
 | 
						||
                                                                              - Radeon R9 385
 | 
						||
                                                                              - Mobile FirePro
 | 
						||
                                                                                M7170
 | 
						||
     ``gfx803``  - ``fiji``      ``amdgcn``   dGPU  - xnack           ROCm    - Radeon R9 Nano
 | 
						||
                                                      [off]                   - Radeon R9 Fury
 | 
						||
                                                                              - Radeon R9 FuryX
 | 
						||
                                                                              - Radeon Pro Duo
 | 
						||
                                                                              - FirePro S9300x2
 | 
						||
                                                                              - Radeon Instinct MI8
 | 
						||
     \           - ``polaris10`` ``amdgcn``   dGPU  - xnack           ROCm    - Radeon RX 470
 | 
						||
                                                      [off]                   - Radeon RX 480
 | 
						||
                                                                              - Radeon Instinct MI6
 | 
						||
     \           - ``polaris11`` ``amdgcn``   dGPU  - xnack           ROCm    - Radeon RX 460
 | 
						||
                                                      [off]
 | 
						||
     ``gfx810``  - ``stoney``    ``amdgcn``   APU   - xnack
 | 
						||
                                                      [on]
 | 
						||
     **GCN GFX9** [AMD-GCN-GFX9]_
 | 
						||
     -----------------------------------------------------------------------------------------------
 | 
						||
     ``gfx900``                  ``amdgcn``   dGPU  - xnack           ROCm    - Radeon Vega
 | 
						||
                                                      [off]                     Frontier Edition
 | 
						||
                                                                              - Radeon RX Vega 56
 | 
						||
                                                                              - Radeon RX Vega 64
 | 
						||
                                                                              - Radeon RX Vega 64
 | 
						||
                                                                                Liquid
 | 
						||
                                                                              - Radeon Instinct MI25
 | 
						||
     ``gfx902``                  ``amdgcn``   APU   - xnack                   - Ryzen 3 2200G
 | 
						||
                                                      [on]                    - Ryzen 5 2400G
 | 
						||
     ``gfx904``                  ``amdgcn``   dGPU  - xnack                   *TBA*
 | 
						||
                                                      [off]
 | 
						||
                                                                              .. TODO::
 | 
						||
                                                                                 Add product
 | 
						||
                                                                                 names.
 | 
						||
     ``gfx906``                  ``amdgcn``   dGPU  - xnack                   - Radeon Instinct MI50
 | 
						||
                                                      [off]                   - Radeon Instinct MI60
 | 
						||
     ``gfx908``                  ``amdgcn``   dGPU  - xnack                   *TBA*
 | 
						||
                                                      [off]
 | 
						||
                                                      sram-ecc
 | 
						||
                                                      [on]
 | 
						||
     ``gfx909``                  ``amdgcn``   APU   - xnack                   *TBA* (Raven Ridge 2)
 | 
						||
                                                      [on]
 | 
						||
                                                                              .. TODO::
 | 
						||
                                                                                 Add product
 | 
						||
                                                                                 names.
 | 
						||
     **GCN GFX10** [AMD-GCN-GFX10]_
 | 
						||
     -----------------------------------------------------------------------------------------------
 | 
						||
     ``gfx1010``                 ``amdgcn``   dGPU  - xnack                   *TBA*
 | 
						||
                                                      [off]
 | 
						||
                                                    - wavefrontsize64
 | 
						||
                                                      [off]
 | 
						||
                                                    - cumode
 | 
						||
                                                      [off]
 | 
						||
                                                                              .. TODO::
 | 
						||
                                                                                 Add product
 | 
						||
                                                                                 names.
 | 
						||
     ``gfx1011``                 ``amdgcn``   dGPU  - xnack                   *TBA*
 | 
						||
                                                      [off]
 | 
						||
                                                    - wavefrontsize64
 | 
						||
                                                      [off]
 | 
						||
                                                    - cumode
 | 
						||
                                                      [off]
 | 
						||
                                                                              .. TODO::
 | 
						||
                                                                                 Add product
 | 
						||
                                                                                 names.
 | 
						||
     ``gfx1012``                 ``amdgcn``   dGPU  - xnack                   *TBA*
 | 
						||
                                                      [off]
 | 
						||
                                                    - wavefrontsize64
 | 
						||
                                                      [off]
 | 
						||
                                                    - cumode
 | 
						||
                                                      [off]
 | 
						||
                                                                              .. TODO::
 | 
						||
                                                                                 Add product
 | 
						||
                                                                                 names.
 | 
						||
     =========== =============== ============ ===== ================= ======= ======================
 | 
						||
 | 
						||
.. _amdgpu-target-features:
 | 
						||
 | 
						||
Target Features
 | 
						||
---------------
 | 
						||
 | 
						||
Target features control how code is generated to support certain
 | 
						||
processor specific features. Not all target features are supported by
 | 
						||
all processors. The runtime must ensure that the features supported by
 | 
						||
the device used to execute the code match the features enabled when
 | 
						||
generating the code. A mismatch of features may result in incorrect
 | 
						||
execution, or a reduction in performance.
 | 
						||
 | 
						||
The target features supported by each processor, and the default value
 | 
						||
used if not specified explicitly, is listed in
 | 
						||
:ref:`amdgpu-processor-table`.
 | 
						||
 | 
						||
Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMDGPU
 | 
						||
target features.
 | 
						||
 | 
						||
For example:
 | 
						||
 | 
						||
``-mxnack``
 | 
						||
  Enable the ``xnack`` feature.
 | 
						||
``-mno-xnack``
 | 
						||
  Disable the ``xnack`` feature.
 | 
						||
 | 
						||
  .. table:: AMDGPU Target Features
 | 
						||
     :name: amdgpu-target-feature-table
 | 
						||
 | 
						||
     ====================== ==================================================
 | 
						||
     Target Feature         Description
 | 
						||
     ====================== ==================================================
 | 
						||
     -m[no-]xnack           Enable/disable generating code that has
 | 
						||
                            memory clauses that are compatible with
 | 
						||
                            having XNACK replay enabled.
 | 
						||
 | 
						||
                            This is used for demand paging and page
 | 
						||
                            migration. If XNACK replay is enabled in
 | 
						||
                            the device, then if a page fault occurs
 | 
						||
                            the code may execute incorrectly if the
 | 
						||
                            ``xnack`` feature is not enabled. Executing
 | 
						||
                            code that has the feature enabled on a
 | 
						||
                            device that does not have XNACK replay
 | 
						||
                            enabled will execute correctly, but may
 | 
						||
                            be less performant than code with the
 | 
						||
                            feature disabled.
 | 
						||
 | 
						||
     -m[no-]sram-ecc        Enable/disable generating code that assumes SRAM
 | 
						||
                            ECC is enabled/disabled.
 | 
						||
 | 
						||
     -m[no-]wavefrontsize64 Control the default wavefront size used when
 | 
						||
                            generating code for kernels. When disabled
 | 
						||
                            native wavefront size 32 is used, when enabled
 | 
						||
                            wavefront size 64 is used.
 | 
						||
 | 
						||
     -m[no-]cumode          Control the default wavefront execution mode used
 | 
						||
                            when generating code for kernels. When disabled
 | 
						||
                            native WGP wavefront execution mode is used,
 | 
						||
                            when enabled CU wavefront execution mode is used
 | 
						||
                            (see :ref:`amdgpu-amdhsa-memory-model`).
 | 
						||
     ====================== ==================================================
 | 
						||
 | 
						||
.. _amdgpu-address-spaces:
 | 
						||
 | 
						||
Address Spaces
 | 
						||
--------------
 | 
						||
 | 
						||
The AMDGPU architecture supports a number of memory address spaces. The address
 | 
						||
space names use the OpenCL standard names, with some additions.
 | 
						||
 | 
						||
The AMDGPU address spaces correspond to architecture-specific LLVM address
 | 
						||
space numbers used in LLVM IR.
 | 
						||
 | 
						||
The AMDGPU address spaces are described in
 | 
						||
:ref:`amdgpu-address-spaces-table`. Only 64-bit process address spaces are
 | 
						||
supported for the ``amdgcn`` target.
 | 
						||
 | 
						||
  .. table:: AMDGPU Address Spaces
 | 
						||
     :name: amdgpu-address-spaces-table
 | 
						||
 | 
						||
     ================================= =============== =========== ================ ======= ============================
 | 
						||
     ..                                                                                     64-Bit Process Address Space
 | 
						||
     --------------------------------- --------------- ----------- ---------------- ------------------------------------
 | 
						||
     Address Space Name                LLVM IR Address HSA Segment Hardware         Address NULL Value
 | 
						||
                                       Space Number    Name        Name             Size
 | 
						||
     ================================= =============== =========== ================ ======= ============================
 | 
						||
     Generic                           0               flat        flat             64      0x0000000000000000
 | 
						||
     Global                            1               global      global           64      0x0000000000000000
 | 
						||
     Region                            2               N/A         GDS              32      *not implemented for AMDHSA*
 | 
						||
     Local                             3               group       LDS              32      0xFFFFFFFF
 | 
						||
     Constant                          4               constant    *same as global* 64      0x0000000000000000
 | 
						||
     Private                           5               private     scratch          32      0x00000000
 | 
						||
     Constant 32-bit                   6               *TODO*
 | 
						||
     Buffer Fat Pointer (experimental) 7               *TODO*
 | 
						||
     ================================= =============== =========== ================ ======= ============================
 | 
						||
 | 
						||
**Generic**
 | 
						||
  The generic address space uses the hardware flat address support available in
 | 
						||
  GFX7-GFX10. This uses two fixed ranges of virtual addresses (the private and
 | 
						||
  local apertures), that are outside the range of addressable global memory, to
 | 
						||
  map from a flat address to a private or local address.
 | 
						||
 | 
						||
  FLAT instructions can take a flat address and access global, private
 | 
						||
  (scratch), and group (LDS) memory depending on if the address is within one
 | 
						||
  of the aperture ranges. Flat access to scratch requires hardware aperture
 | 
						||
  setup and setup in the kernel prologue (see
 | 
						||
  :ref:`amdgpu-amdhsa-flat-scratch`). Flat access to LDS requires hardware
 | 
						||
  aperture setup and M0 (GFX7-GFX8) register setup (see
 | 
						||
  :ref:`amdgpu-amdhsa-m0`).
 | 
						||
 | 
						||
  To convert between a private or group address space address (termed a segment
 | 
						||
  address) and a flat address the base address of the corresponding aperture
 | 
						||
  can be used. For GFX7-GFX8 these are available in the
 | 
						||
  :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
 | 
						||
  Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
 | 
						||
  GFX9-GFX10 the aperture base addresses are directly available as inline
 | 
						||
  constant registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``.
 | 
						||
  In 64-bit address mode the aperture sizes are 2^32 bytes and the base is
 | 
						||
  aligned to 2^32 which makes it easier to convert from flat to segment or
 | 
						||
  segment to flat.
 | 
						||
 | 
						||
  A global address space address has the same value when used as a flat address
 | 
						||
  so no conversion is needed.
 | 
						||
 | 
						||
**Global and Constant**
 | 
						||
  The global and constant address spaces both use global virtual addresses,
 | 
						||
  which are the same virtual address space used by the CPU. However, some
 | 
						||
  virtual addresses may only be accessible to the CPU, some only accessible
 | 
						||
  by the GPU, and some by both.
 | 
						||
 | 
						||
  Using the constant address space indicates that the data will not change
 | 
						||
  during the execution of the kernel. This allows scalar read instructions to
 | 
						||
  be used. The vector and scalar L1 caches are invalidated of volatile data
 | 
						||
  before each kernel dispatch execution to allow constant memory to change
 | 
						||
  values between kernel dispatches.
 | 
						||
 | 
						||
**Region**
 | 
						||
  The region address space uses the hardware Global Data Store (GDS). All
 | 
						||
  wavefronts executing on the same device will access the same memory for any
 | 
						||
  given region address. However, the same region address accessed by wavefronts
 | 
						||
  executing on different devices will access different memory. It is higher
 | 
						||
  performance than global memory. It is allocated by the runtime. The data
 | 
						||
  store (DS) instructions can be used to access it.
 | 
						||
 | 
						||
**Local**
 | 
						||
  The local address space uses the hardware Local Data Store (LDS) which is
 | 
						||
  automatically allocated when the hardware creates the wavefronts of a
 | 
						||
  work-group, and freed when all the wavefronts of a work-group have
 | 
						||
  terminated. All wavefronts belonging to the same work-group will access the
 | 
						||
  same memory for any given local address. However, the same local address
 | 
						||
  accessed by wavefronts belonging to different work-groups will access
 | 
						||
  different memory. It is higher performance than global memory. The data store
 | 
						||
  (DS) instructions can be used to access it.
 | 
						||
 | 
						||
**Private**
 | 
						||
  The private address space uses the hardware scratch memory support which
 | 
						||
  automatically allocates memory when it creates a wavefront, and frees it when
 | 
						||
  a wavefronts terminates. The memory accessed by a lane of a wavefront for any
 | 
						||
  given private address will be different to the memory accessed by another lane
 | 
						||
  of the same or different wavefront for the same private address.
 | 
						||
 | 
						||
  If a kernel dispatch uses scratch, then the hardware allocates memory from a
 | 
						||
  pool of backing memory allocated by the runtime for each wavefront. The lanes
 | 
						||
  of the wavefront access this using dword (4 byte) interleaving. The mapping
 | 
						||
  used from private address to backing memory address is:
 | 
						||
 | 
						||
    ``wavefront-scratch-base +
 | 
						||
    ((private-address / 4) * wavefront-size * 4) +
 | 
						||
    (wavefront-lane-id * 4) + (private-address % 4)``
 | 
						||
 | 
						||
  If each lane of a wavefront accesses the same private address, the
 | 
						||
  interleaving results in adjacent dwords being accessed and hence requires
 | 
						||
  fewer cache lines to be fetched.
 | 
						||
 | 
						||
  There are different ways that the wavefront scratch base address is
 | 
						||
  determined by a wavefront (see
 | 
						||
  :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
 | 
						||
 | 
						||
  Scratch memory can be accessed in an interleaved manner using buffer
 | 
						||
  instructions with the scratch buffer descriptor and per wavefront scratch
 | 
						||
  offset, by the scratch instructions, or by flat instructions. Multi-dword
 | 
						||
  access is not supported except by flat and scratch instructions in
 | 
						||
  GFX9-GFX10.
 | 
						||
 | 
						||
**Constant 32-bit**
 | 
						||
  *TODO*
 | 
						||
 | 
						||
**Buffer Fat Pointer**
 | 
						||
  The buffer fat pointer is an experimental address space that is currently
 | 
						||
  unsupported in the backend. It exposes a non-integral pointer that is in
 | 
						||
  the future intended to support the modelling of 128-bit buffer descriptors
 | 
						||
  plus a 32-bit offset into the buffer (in total encapsulating a 160-bit
 | 
						||
  *pointer*), allowing normal LLVM load/store/atomic operations to be used to
 | 
						||
  model the buffer descriptors used heavily in graphics workloads targeting
 | 
						||
  the backend.
 | 
						||
 | 
						||
.. _amdgpu-memory-scopes:
 | 
						||
 | 
						||
Memory Scopes
 | 
						||
-------------
 | 
						||
 | 
						||
This section provides LLVM memory synchronization scopes supported by the AMDGPU
 | 
						||
backend memory model when the target triple OS is ``amdhsa`` (see
 | 
						||
:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
 | 
						||
 | 
						||
The memory model supported is based on the HSA memory model [HSA]_ which is
 | 
						||
based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
 | 
						||
relation is transitive over the synchronizes-with relation independent of scope,
 | 
						||
and synchronizes-with allows the memory scope instances to be inclusive (see
 | 
						||
table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
 | 
						||
 | 
						||
This is different to the OpenCL [OpenCL]_ memory model which does not have scope
 | 
						||
inclusion and requires the memory scopes to exactly match. However, this
 | 
						||
is conservatively correct for OpenCL.
 | 
						||
 | 
						||
  .. table:: AMDHSA LLVM Sync Scopes
 | 
						||
     :name: amdgpu-amdhsa-llvm-sync-scopes-table
 | 
						||
 | 
						||
     ======================= ===================================================
 | 
						||
     LLVM Sync Scope         Description
 | 
						||
     ======================= ===================================================
 | 
						||
     *none*                  The default: ``system``.
 | 
						||
 | 
						||
                             Synchronizes with, and participates in modification
 | 
						||
                             and seq_cst total orderings with, other operations
 | 
						||
                             (except image operations) for all address spaces
 | 
						||
                             (except private, or generic that accesses private)
 | 
						||
                             provided the other operation's sync scope is:
 | 
						||
 | 
						||
                             - ``system``.
 | 
						||
                             - ``agent`` and executed by a thread on the same
 | 
						||
                               agent.
 | 
						||
                             - ``workgroup`` and executed by a thread in the
 | 
						||
                               same work-group.
 | 
						||
                             - ``wavefront`` and executed by a thread in the
 | 
						||
                               same wavefront.
 | 
						||
 | 
						||
     ``agent``               Synchronizes with, and participates in modification
 | 
						||
                             and seq_cst total orderings with, other operations
 | 
						||
                             (except image operations) for all address spaces
 | 
						||
                             (except private, or generic that accesses private)
 | 
						||
                             provided the other operation's sync scope is:
 | 
						||
 | 
						||
                             - ``system`` or ``agent`` and executed by a thread
 | 
						||
                               on the same agent.
 | 
						||
                             - ``workgroup`` and executed by a thread in the
 | 
						||
                               same work-group.
 | 
						||
                             - ``wavefront`` and executed by a thread in the
 | 
						||
                               same wavefront.
 | 
						||
 | 
						||
     ``workgroup``           Synchronizes with, and participates in modification
 | 
						||
                             and seq_cst total orderings with, other operations
 | 
						||
                             (except image operations) for all address spaces
 | 
						||
                             (except private, or generic that accesses private)
 | 
						||
                             provided the other operation's sync scope is:
 | 
						||
 | 
						||
                             - ``system``, ``agent`` or ``workgroup`` and
 | 
						||
                               executed by a thread in the same work-group.
 | 
						||
                             - ``wavefront`` and executed by a thread in the
 | 
						||
                               same wavefront.
 | 
						||
 | 
						||
     ``wavefront``           Synchronizes with, and participates in modification
 | 
						||
                             and seq_cst total orderings with, other operations
 | 
						||
                             (except image operations) for all address spaces
 | 
						||
                             (except private, or generic that accesses private)
 | 
						||
                             provided the other operation's sync scope is:
 | 
						||
 | 
						||
                             - ``system``, ``agent``, ``workgroup`` or
 | 
						||
                               ``wavefront`` and executed by a thread in the
 | 
						||
                               same wavefront.
 | 
						||
 | 
						||
     ``singlethread``        Only synchronizes with, and participates in
 | 
						||
                             modification and seq_cst total orderings with,
 | 
						||
                             other operations (except image operations) running
 | 
						||
                             in the same thread for all address spaces (for
 | 
						||
                             example, in signal handlers).
 | 
						||
 | 
						||
     ``one-as``              Same as ``system`` but only synchronizes with other
 | 
						||
                             operations within the same address space.
 | 
						||
 | 
						||
     ``agent-one-as``        Same as ``agent`` but only synchronizes with other
 | 
						||
                             operations within the same address space.
 | 
						||
 | 
						||
     ``workgroup-one-as``    Same as ``workgroup`` but only synchronizes with
 | 
						||
                             other operations within the same address space.
 | 
						||
 | 
						||
     ``wavefront-one-as``    Same as ``wavefront`` but only synchronizes with
 | 
						||
                             other operations within the same address space.
 | 
						||
 | 
						||
     ``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with
 | 
						||
                             other operations within the same address space.
 | 
						||
     ======================= ===================================================
 | 
						||
 | 
						||
AMDGPU Intrinsics
 | 
						||
-----------------
 | 
						||
 | 
						||
The AMDGPU backend implements the following LLVM IR intrinsics.
 | 
						||
 | 
						||
*This section is WIP.*
 | 
						||
 | 
						||
.. TODO::
 | 
						||
 | 
						||
   List AMDGPU intrinsics.
 | 
						||
 | 
						||
AMDGPU Attributes
 | 
						||
-----------------
 | 
						||
 | 
						||
The AMDGPU backend supports the following LLVM IR attributes.
 | 
						||
 | 
						||
  .. table:: AMDGPU LLVM IR Attributes
 | 
						||
     :name: amdgpu-llvm-ir-attributes-table
 | 
						||
 | 
						||
     ======================================= ==========================================================
 | 
						||
     LLVM Attribute                          Description
 | 
						||
     ======================================= ==========================================================
 | 
						||
     "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that
 | 
						||
                                             will be specified when the kernel is dispatched. Generated
 | 
						||
                                             by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_.
 | 
						||
     "amdgpu-implicitarg-num-bytes"="n"      Number of kernel argument bytes to add to the kernel
 | 
						||
                                             argument block size for the implicit arguments. This
 | 
						||
                                             varies by OS and language (for OpenCL see
 | 
						||
                                             :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
 | 
						||
     "amdgpu-num-sgpr"="n"                   Specifies the number of SGPRs to use. Generated by
 | 
						||
                                             the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_.
 | 
						||
     "amdgpu-num-vgpr"="n"                   Specifies the number of VGPRs to use. Generated by the
 | 
						||
                                             ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_.
 | 
						||
     "amdgpu-waves-per-eu"="m,n"             Specify the minimum and maximum number of waves per
 | 
						||
                                             execution unit. Generated by the ``amdgpu_waves_per_eu``
 | 
						||
                                             CLANG attribute [CLANG-ATTR]_.
 | 
						||
     "amdgpu-ieee" true/false.               Specify whether the function expects the IEEE field of the
 | 
						||
                                             mode register to be set on entry. Overrides the default for
 | 
						||
                                             the calling convention.
 | 
						||
     "amdgpu-dx10-clamp" true/false.         Specify whether the function expects the DX10_CLAMP field of
 | 
						||
                                             the mode register to be set on entry. Overrides the default
 | 
						||
                                             for the calling convention.
 | 
						||
     ======================================= ==========================================================
 | 
						||
 | 
						||
Code Object
 | 
						||
===========
 | 
						||
 | 
						||
The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
 | 
						||
can be linked by ``lld`` to produce a standard ELF shared code object which can
 | 
						||
be loaded and executed on an AMDGPU target.
 | 
						||
 | 
						||
Header
 | 
						||
------
 | 
						||
 | 
						||
The AMDGPU backend uses the following ELF header:
 | 
						||
 | 
						||
  .. table:: AMDGPU ELF Header
 | 
						||
     :name: amdgpu-elf-header-table
 | 
						||
 | 
						||
     ========================== ===============================
 | 
						||
     Field                      Value
 | 
						||
     ========================== ===============================
 | 
						||
     ``e_ident[EI_CLASS]``      ``ELFCLASS64``
 | 
						||
     ``e_ident[EI_DATA]``       ``ELFDATA2LSB``
 | 
						||
     ``e_ident[EI_OSABI]``      - ``ELFOSABI_NONE``
 | 
						||
                                - ``ELFOSABI_AMDGPU_HSA``
 | 
						||
                                - ``ELFOSABI_AMDGPU_PAL``
 | 
						||
                                - ``ELFOSABI_AMDGPU_MESA3D``
 | 
						||
     ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
 | 
						||
                                - ``ELFABIVERSION_AMDGPU_PAL``
 | 
						||
                                - ``ELFABIVERSION_AMDGPU_MESA3D``
 | 
						||
     ``e_type``                 - ``ET_REL``
 | 
						||
                                - ``ET_DYN``
 | 
						||
     ``e_machine``              ``EM_AMDGPU``
 | 
						||
     ``e_entry``                0
 | 
						||
     ``e_flags``                See :ref:`amdgpu-elf-header-e_flags-table`
 | 
						||
     ========================== ===============================
 | 
						||
 | 
						||
..
 | 
						||
 | 
						||
  .. table:: AMDGPU ELF Header Enumeration Values
 | 
						||
     :name: amdgpu-elf-header-enumeration-values-table
 | 
						||
 | 
						||
     =============================== =====
 | 
						||
     Name                            Value
 | 
						||
     =============================== =====
 | 
						||
     ``EM_AMDGPU``                   224
 | 
						||
     ``ELFOSABI_NONE``               0
 | 
						||
     ``ELFOSABI_AMDGPU_HSA``         64
 | 
						||
     ``ELFOSABI_AMDGPU_PAL``         65
 | 
						||
     ``ELFOSABI_AMDGPU_MESA3D``      66
 | 
						||
     ``ELFABIVERSION_AMDGPU_HSA``    1
 | 
						||
     ``ELFABIVERSION_AMDGPU_PAL``    0
 | 
						||
     ``ELFABIVERSION_AMDGPU_MESA3D`` 0
 | 
						||
     =============================== =====
 | 
						||
 | 
						||
``e_ident[EI_CLASS]``
 | 
						||
  The ELF class is:
 | 
						||
 | 
						||
  * ``ELFCLASS32`` for ``r600`` architecture.
 | 
						||
 | 
						||
  * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64-bit
 | 
						||
    process address space applications.
 | 
						||
 | 
						||
``e_ident[EI_DATA]``
 | 
						||
  All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
 | 
						||
 | 
						||
``e_ident[EI_OSABI]``
 | 
						||
  One of the following AMDGPU architecture specific OS ABIs
 | 
						||
  (see :ref:`amdgpu-os-table`):
 | 
						||
 | 
						||
  * ``ELFOSABI_NONE`` for *unknown* OS.
 | 
						||
 | 
						||
  * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
 | 
						||
 | 
						||
  * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
 | 
						||
 | 
						||
  * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
 | 
						||
 | 
						||
``e_ident[EI_ABIVERSION]``
 | 
						||
  The ABI version of the AMDGPU architecture specific OS ABI to which the code
 | 
						||
  object conforms:
 | 
						||
 | 
						||
  * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
 | 
						||
    runtime ABI.
 | 
						||
 | 
						||
  * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
 | 
						||
    runtime ABI.
 | 
						||
 | 
						||
  * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
 | 
						||
    3D runtime ABI.
 | 
						||
 | 
						||
``e_type``
 | 
						||
  Can be one of the following values:
 | 
						||
 | 
						||
 | 
						||
  ``ET_REL``
 | 
						||
    The type produced by the AMDGPU backend compiler as it is relocatable code
 | 
						||
    object.
 | 
						||
 | 
						||
  ``ET_DYN``
 | 
						||
    The type produced by the linker as it is a shared code object.
 | 
						||
 | 
						||
  The AMD HSA runtime loader requires a ``ET_DYN`` code object.
 | 
						||
 | 
						||
``e_machine``
 | 
						||
  The value ``EM_AMDGPU`` is used for the machine for all processors supported
 | 
						||
  by the ``r600`` and ``amdgcn`` architectures (see
 | 
						||
  :ref:`amdgpu-processor-table`). The specific processor is specified in the
 | 
						||
  ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
 | 
						||
  :ref:`amdgpu-elf-header-e_flags-table`).
 | 
						||
 | 
						||
``e_entry``
 | 
						||
  The entry point is 0 as the entry points for individual kernels must be
 | 
						||
  selected in order to invoke them through AQL packets.
 | 
						||
 | 
						||
``e_flags``
 | 
						||
  The AMDGPU backend uses the following ELF header flags:
 | 
						||
 | 
						||
  .. table:: AMDGPU ELF Header ``e_flags``
 | 
						||
     :name: amdgpu-elf-header-e_flags-table
 | 
						||
 | 
						||
     ================================= ========== =============================
 | 
						||
     Name                              Value      Description
 | 
						||
     ================================= ========== =============================
 | 
						||
     **AMDGPU Processor Flag**                    See :ref:`amdgpu-processor-table`.
 | 
						||
     -------------------------------------------- -----------------------------
 | 
						||
     ``EF_AMDGPU_MACH``                0x000000ff AMDGPU processor selection
 | 
						||
                                                  mask for
 | 
						||
                                                  ``EF_AMDGPU_MACH_xxx`` values
 | 
						||
                                                  defined in
 | 
						||
                                                  :ref:`amdgpu-ef-amdgpu-mach-table`.
 | 
						||
     ``EF_AMDGPU_XNACK``               0x00000100 Indicates if the ``xnack``
 | 
						||
                                                  target feature is
 | 
						||
                                                  enabled for all code
 | 
						||
                                                  contained in the code object.
 | 
						||
                                                  If the processor
 | 
						||
                                                  does not support the
 | 
						||
                                                  ``xnack`` target
 | 
						||
                                                  feature then must
 | 
						||
                                                  be 0.
 | 
						||
                                                  See
 | 
						||
                                                  :ref:`amdgpu-target-features`.
 | 
						||
     ``EF_AMDGPU_SRAM_ECC``            0x00000200 Indicates if the ``sram-ecc``
 | 
						||
                                                  target feature is
 | 
						||
                                                  enabled for all code
 | 
						||
                                                  contained in the code object.
 | 
						||
                                                  If the processor
 | 
						||
                                                  does not support the
 | 
						||
                                                  ``sram-ecc`` target
 | 
						||
                                                  feature then must
 | 
						||
                                                  be 0.
 | 
						||
                                                  See
 | 
						||
                                                  :ref:`amdgpu-target-features`.
 | 
						||
     ================================= ========== =============================
 | 
						||
 | 
						||
  .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
 | 
						||
     :name: amdgpu-ef-amdgpu-mach-table
 | 
						||
 | 
						||
     ================================= ========== =============================
 | 
						||
     Name                              Value      Description (see
 | 
						||
                                                  :ref:`amdgpu-processor-table`)
 | 
						||
     ================================= ========== =============================
 | 
						||
     ``EF_AMDGPU_MACH_NONE``           0x000      *not specified*
 | 
						||
     ``EF_AMDGPU_MACH_R600_R600``      0x001      ``r600``
 | 
						||
     ``EF_AMDGPU_MACH_R600_R630``      0x002      ``r630``
 | 
						||
     ``EF_AMDGPU_MACH_R600_RS880``     0x003      ``rs880``
 | 
						||
     ``EF_AMDGPU_MACH_R600_RV670``     0x004      ``rv670``
 | 
						||
     ``EF_AMDGPU_MACH_R600_RV710``     0x005      ``rv710``
 | 
						||
     ``EF_AMDGPU_MACH_R600_RV730``     0x006      ``rv730``
 | 
						||
     ``EF_AMDGPU_MACH_R600_RV770``     0x007      ``rv770``
 | 
						||
     ``EF_AMDGPU_MACH_R600_CEDAR``     0x008      ``cedar``
 | 
						||
     ``EF_AMDGPU_MACH_R600_CYPRESS``   0x009      ``cypress``
 | 
						||
     ``EF_AMDGPU_MACH_R600_JUNIPER``   0x00a      ``juniper``
 | 
						||
     ``EF_AMDGPU_MACH_R600_REDWOOD``   0x00b      ``redwood``
 | 
						||
     ``EF_AMDGPU_MACH_R600_SUMO``      0x00c      ``sumo``
 | 
						||
     ``EF_AMDGPU_MACH_R600_BARTS``     0x00d      ``barts``
 | 
						||
     ``EF_AMDGPU_MACH_R600_CAICOS``    0x00e      ``caicos``
 | 
						||
     ``EF_AMDGPU_MACH_R600_CAYMAN``    0x00f      ``cayman``
 | 
						||
     ``EF_AMDGPU_MACH_R600_TURKS``     0x010      ``turks``
 | 
						||
     *reserved*                        0x011 -    Reserved for ``r600``
 | 
						||
                                       0x01f      architecture processors.
 | 
						||
     ``EF_AMDGPU_MACH_AMDGCN_GFX600``  0x020      ``gfx600``
 | 
						||
     ``EF_AMDGPU_MACH_AMDGCN_GFX601``  0x021      ``gfx601``
 | 
						||
     ``EF_AMDGPU_MACH_AMDGCN_GFX700``  0x022      ``gfx700``
 | 
						||
     ``EF_AMDGPU_MACH_AMDGCN_GFX701``  0x023      ``gfx701``
 | 
						||
     ``EF_AMDGPU_MACH_AMDGCN_GFX702``  0x024      ``gfx702``
 | 
						||
     ``EF_AMDGPU_MACH_AMDGCN_GFX703``  0x025      ``gfx703``
 | 
						||
     ``EF_AMDGPU_MACH_AMDGCN_GFX704``  0x026      ``gfx704``
 | 
						||
     *reserved*                        0x027      Reserved.
 | 
						||
     ``EF_AMDGPU_MACH_AMDGCN_GFX801``  0x028      ``gfx801``
 | 
						||
     ``EF_AMDGPU_MACH_AMDGCN_GFX802``  0x029      ``gfx802``
 | 
						||
     ``EF_AMDGPU_MACH_AMDGCN_GFX803``  0x02a      ``gfx803``
 | 
						||
     ``EF_AMDGPU_MACH_AMDGCN_GFX810``  0x02b      ``gfx810``
 | 
						||
     ``EF_AMDGPU_MACH_AMDGCN_GFX900``  0x02c      ``gfx900``
 | 
						||
     ``EF_AMDGPU_MACH_AMDGCN_GFX902``  0x02d      ``gfx902``
 | 
						||
     ``EF_AMDGPU_MACH_AMDGCN_GFX904``  0x02e      ``gfx904``
 | 
						||
     ``EF_AMDGPU_MACH_AMDGCN_GFX906``  0x02f      ``gfx906``
 | 
						||
     ``EF_AMDGPU_MACH_AMDGCN_GFX908``  0x030      ``gfx908``
 | 
						||
     ``EF_AMDGPU_MACH_AMDGCN_GFX909``  0x031      ``gfx909``
 | 
						||
     *reserved*                        0x032      Reserved.
 | 
						||
     ``EF_AMDGPU_MACH_AMDGCN_GFX1010`` 0x033      ``gfx1010``
 | 
						||
     ``EF_AMDGPU_MACH_AMDGCN_GFX1011`` 0x034      ``gfx1011``
 | 
						||
     ``EF_AMDGPU_MACH_AMDGCN_GFX1012`` 0x035      ``gfx1012``
 | 
						||
     ================================= ========== =============================
 | 
						||
 | 
						||
Sections
 | 
						||
--------
 | 
						||
 | 
						||
An AMDGPU target ELF code object has the standard ELF sections which include:
 | 
						||
 | 
						||
  .. table:: AMDGPU ELF Sections
 | 
						||
     :name: amdgpu-elf-sections-table
 | 
						||
 | 
						||
     ================== ================ =================================
 | 
						||
     Name               Type             Attributes
 | 
						||
     ================== ================ =================================
 | 
						||
     ``.bss``           ``SHT_NOBITS``   ``SHF_ALLOC`` + ``SHF_WRITE``
 | 
						||
     ``.data``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
 | 
						||
     ``.debug_``\ *\**  ``SHT_PROGBITS`` *none*
 | 
						||
     ``.dynamic``       ``SHT_DYNAMIC``  ``SHF_ALLOC``
 | 
						||
     ``.dynstr``        ``SHT_PROGBITS`` ``SHF_ALLOC``
 | 
						||
     ``.dynsym``        ``SHT_PROGBITS`` ``SHF_ALLOC``
 | 
						||
     ``.got``           ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
 | 
						||
     ``.hash``          ``SHT_HASH``     ``SHF_ALLOC``
 | 
						||
     ``.note``          ``SHT_NOTE``     *none*
 | 
						||
     ``.rela``\ *name*  ``SHT_RELA``     *none*
 | 
						||
     ``.rela.dyn``      ``SHT_RELA``     *none*
 | 
						||
     ``.rodata``        ``SHT_PROGBITS`` ``SHF_ALLOC``
 | 
						||
     ``.shstrtab``      ``SHT_STRTAB``   *none*
 | 
						||
     ``.strtab``        ``SHT_STRTAB``   *none*
 | 
						||
     ``.symtab``        ``SHT_SYMTAB``   *none*
 | 
						||
     ``.text``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
 | 
						||
     ================== ================ =================================
 | 
						||
 | 
						||
These sections have their standard meanings (see [ELF]_) and are only generated
 | 
						||
if needed.
 | 
						||
 | 
						||
``.debug``\ *\**
 | 
						||
  The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
 | 
						||
  DWARF produced by the AMDGPU backend.
 | 
						||
 | 
						||
``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
 | 
						||
  The standard sections used by a dynamic loader.
 | 
						||
 | 
						||
``.note``
 | 
						||
  See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
 | 
						||
  backend.
 | 
						||
 | 
						||
``.rela``\ *name*, ``.rela.dyn``
 | 
						||
  For relocatable code objects, *name* is the name of the section that the
 | 
						||
  relocation records apply. For example, ``.rela.text`` is the section name for
 | 
						||
  relocation records associated with the ``.text`` section.
 | 
						||
 | 
						||
  For linked shared code objects, ``.rela.dyn`` contains all the relocation
 | 
						||
  records from each of the relocatable code object's ``.rela``\ *name* sections.
 | 
						||
 | 
						||
  See :ref:`amdgpu-relocation-records` for the relocation records supported by
 | 
						||
  the AMDGPU backend.
 | 
						||
 | 
						||
``.text``
 | 
						||
  The executable machine code for the kernels and functions they call. Generated
 | 
						||
  as position independent code. See :ref:`amdgpu-code-conventions` for
 | 
						||
  information on conventions used in the isa generation.
 | 
						||
 | 
						||
.. _amdgpu-note-records:
 | 
						||
 | 
						||
Note Records
 | 
						||
------------
 | 
						||
 | 
						||
The AMDGPU backend code object contains ELF note records in the ``.note``
 | 
						||
section. The set of generated notes and their semantics depend on the code
 | 
						||
object version; see :ref:`amdgpu-note-records-v2` and
 | 
						||
:ref:`amdgpu-note-records-v3`.
 | 
						||
 | 
						||
As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding
 | 
						||
must be generated after the ``name`` field to ensure the ``desc`` field is 4
 | 
						||
byte aligned. In addition, minimal zero byte padding must be generated to
 | 
						||
ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign``
 | 
						||
field of the ``.note`` section must be at least 4 to indicate at least 8 byte
 | 
						||
alignment.
 | 
						||
 | 
						||
.. _amdgpu-note-records-v2:
 | 
						||
 | 
						||
Code Object V2 Note Records (-mattr=-code-object-v3)
 | 
						||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 | 
						||
 | 
						||
.. warning:: Code Object V2 is not the default code object version emitted by
 | 
						||
  this version of LLVM. For a description of the notes generated with the
 | 
						||
  default configuration (Code Object V3) see :ref:`amdgpu-note-records-v3`.
 | 
						||
 | 
						||
The AMDGPU backend code object uses the following ELF note record in the
 | 
						||
``.note`` section when compiling for Code Object V2 (-mattr=-code-object-v3).
 | 
						||
 | 
						||
Additional note records may be present, but any which are not documented here
 | 
						||
are deprecated and should not be used.
 | 
						||
 | 
						||
  .. table:: AMDGPU Code Object V2 ELF Note Records
 | 
						||
     :name: amdgpu-elf-note-records-table-v2
 | 
						||
 | 
						||
     ===== ============================== ======================================
 | 
						||
     Name  Type                           Description
 | 
						||
     ===== ============================== ======================================
 | 
						||
     "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
 | 
						||
     ===== ============================== ======================================
 | 
						||
 | 
						||
..
 | 
						||
 | 
						||
  .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values
 | 
						||
     :name: amdgpu-elf-note-record-enumeration-values-table-v2
 | 
						||
 | 
						||
     ============================== =====
 | 
						||
     Name                           Value
 | 
						||
     ============================== =====
 | 
						||
     *reserved*                       0-9
 | 
						||
     ``NT_AMD_AMDGPU_HSA_METADATA``    10
 | 
						||
     *reserved*                        11
 | 
						||
     ============================== =====
 | 
						||
 | 
						||
``NT_AMD_AMDGPU_HSA_METADATA``
 | 
						||
  Specifies extensible metadata associated with the code objects executed on HSA
 | 
						||
  [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
 | 
						||
  the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
 | 
						||
  :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code
 | 
						||
  object metadata string.
 | 
						||
 | 
						||
.. _amdgpu-note-records-v3:
 | 
						||
 | 
						||
Code Object V3 Note Records (-mattr=+code-object-v3)
 | 
						||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 | 
						||
 | 
						||
The AMDGPU backend code object uses the following ELF note record in the
 | 
						||
``.note`` section when compiling for Code Object V3 (-mattr=+code-object-v3).
 | 
						||
 | 
						||
Additional note records may be present, but any which are not documented here
 | 
						||
are deprecated and should not be used.
 | 
						||
 | 
						||
  .. table:: AMDGPU Code Object V3 ELF Note Records
 | 
						||
     :name: amdgpu-elf-note-records-table-v3
 | 
						||
 | 
						||
     ======== ============================== ======================================
 | 
						||
     Name     Type                           Description
 | 
						||
     ======== ============================== ======================================
 | 
						||
     "AMDGPU" ``NT_AMDGPU_METADATA``         Metadata in Message Pack [MsgPack]_
 | 
						||
                                             binary format.
 | 
						||
     ======== ============================== ======================================
 | 
						||
 | 
						||
..
 | 
						||
 | 
						||
  .. table:: AMDGPU Code Object V3 ELF Note Record Enumeration Values
 | 
						||
     :name: amdgpu-elf-note-record-enumeration-values-table-v3
 | 
						||
 | 
						||
     ============================== =====
 | 
						||
     Name                           Value
 | 
						||
     ============================== =====
 | 
						||
     *reserved*                     0-31
 | 
						||
     ``NT_AMDGPU_METADATA``         32
 | 
						||
     ============================== =====
 | 
						||
 | 
						||
``NT_AMDGPU_METADATA``
 | 
						||
  Specifies extensible metadata associated with an AMDGPU code
 | 
						||
  object. It is encoded as a map in the Message Pack [MsgPack]_ binary
 | 
						||
  data format. See :ref:`amdgpu-amdhsa-code-object-metadata-v3` for the
 | 
						||
  map keys defined for the ``amdhsa`` OS.
 | 
						||
 | 
						||
.. _amdgpu-symbols:
 | 
						||
 | 
						||
Symbols
 | 
						||
-------
 | 
						||
 | 
						||
Symbols include the following:
 | 
						||
 | 
						||
  .. table:: AMDGPU ELF Symbols
 | 
						||
     :name: amdgpu-elf-symbols-table
 | 
						||
 | 
						||
     ===================== ================== ================ ==================
 | 
						||
     Name                  Type               Section          Description
 | 
						||
     ===================== ================== ================ ==================
 | 
						||
     *link-name*           ``STT_OBJECT``     - ``.data``      Global variable
 | 
						||
                                              - ``.rodata``
 | 
						||
                                              - ``.bss``
 | 
						||
     *link-name*\ ``.kd``  ``STT_OBJECT``     - ``.rodata``    Kernel descriptor
 | 
						||
     *link-name*           ``STT_FUNC``       - ``.text``      Kernel entry point
 | 
						||
     *link-name*           ``STT_OBJECT``     - SHN_AMDGPU_LDS Global variable in LDS
 | 
						||
     ===================== ================== ================ ==================
 | 
						||
 | 
						||
Global variable
 | 
						||
  Global variables both used and defined by the compilation unit.
 | 
						||
 | 
						||
  If the symbol is defined in the compilation unit then it is allocated in the
 | 
						||
  appropriate section according to if it has initialized data or is readonly.
 | 
						||
 | 
						||
  If the symbol is external then its section is ``STN_UNDEF`` and the loader
 | 
						||
  will resolve relocations using the definition provided by another code object
 | 
						||
  or explicitly defined by the runtime.
 | 
						||
 | 
						||
  If the symbol resides in local/group memory (LDS) then its section is the
 | 
						||
  special processor-specific section name ``SHN_AMDGPU_LDS``, and the
 | 
						||
  ``st_value`` field describes alignment requirements as it does for common
 | 
						||
  symbols.
 | 
						||
 | 
						||
  .. TODO::
 | 
						||
 | 
						||
     Add description of linked shared object symbols. Seems undefined symbols
 | 
						||
     are marked as STT_NOTYPE.
 | 
						||
 | 
						||
Kernel descriptor
 | 
						||
  Every HSA kernel has an associated kernel descriptor. It is the address of the
 | 
						||
  kernel descriptor that is used in the AQL dispatch packet used to invoke the
 | 
						||
  kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
 | 
						||
  defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
 | 
						||
 | 
						||
Kernel entry point
 | 
						||
  Every HSA kernel also has a symbol for its machine code entry point.
 | 
						||
 | 
						||
.. _amdgpu-relocation-records:
 | 
						||
 | 
						||
Relocation Records
 | 
						||
------------------
 | 
						||
 | 
						||
AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
 | 
						||
relocatable fields are:
 | 
						||
 | 
						||
``word32``
 | 
						||
  This specifies a 32-bit field occupying 4 bytes with arbitrary byte
 | 
						||
  alignment. These values use the same byte order as other word values in the
 | 
						||
  AMDGPU architecture.
 | 
						||
 | 
						||
``word64``
 | 
						||
  This specifies a 64-bit field occupying 8 bytes with arbitrary byte
 | 
						||
  alignment. These values use the same byte order as other word values in the
 | 
						||
  AMDGPU architecture.
 | 
						||
 | 
						||
Following notations are used for specifying relocation calculations:
 | 
						||
 | 
						||
**A**
 | 
						||
  Represents the addend used to compute the value of the relocatable field.
 | 
						||
 | 
						||
**G**
 | 
						||
  Represents the offset into the global offset table at which the relocation
 | 
						||
  entry's symbol will reside during execution.
 | 
						||
 | 
						||
**GOT**
 | 
						||
  Represents the address of the global offset table.
 | 
						||
 | 
						||
**P**
 | 
						||
  Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
 | 
						||
  of the storage unit being relocated (computed using ``r_offset``).
 | 
						||
 | 
						||
**S**
 | 
						||
  Represents the value of the symbol whose index resides in the relocation
 | 
						||
  entry. Relocations not using this must specify a symbol index of
 | 
						||
  ``STN_UNDEF``.
 | 
						||
 | 
						||
**B**
 | 
						||
  Represents the base address of a loaded executable or shared object which is
 | 
						||
  the difference between the ELF address and the actual load address.
 | 
						||
  Relocations using this are only valid in executable or shared objects.
 | 
						||
 | 
						||
The following relocation types are supported:
 | 
						||
 | 
						||
  .. table:: AMDGPU ELF Relocation Records
 | 
						||
     :name: amdgpu-elf-relocation-records-table
 | 
						||
 | 
						||
     ========================== ======= =====  ==========  ==============================
 | 
						||
     Relocation Type            Kind    Value  Field       Calculation
 | 
						||
     ========================== ======= =====  ==========  ==============================
 | 
						||
     ``R_AMDGPU_NONE``                  0      *none*      *none*
 | 
						||
     ``R_AMDGPU_ABS32_LO``      Static, 1      ``word32``  (S + A) & 0xFFFFFFFF
 | 
						||
                                Dynamic
 | 
						||
     ``R_AMDGPU_ABS32_HI``      Static, 2      ``word32``  (S + A) >> 32
 | 
						||
                                Dynamic
 | 
						||
     ``R_AMDGPU_ABS64``         Static, 3      ``word64``  S + A
 | 
						||
                                Dynamic
 | 
						||
     ``R_AMDGPU_REL32``         Static  4      ``word32``  S + A - P
 | 
						||
     ``R_AMDGPU_REL64``         Static  5      ``word64``  S + A - P
 | 
						||
     ``R_AMDGPU_ABS32``         Static, 6      ``word32``  S + A
 | 
						||
                                Dynamic
 | 
						||
     ``R_AMDGPU_GOTPCREL``      Static  7      ``word32``  G + GOT + A - P
 | 
						||
     ``R_AMDGPU_GOTPCREL32_LO`` Static  8      ``word32``  (G + GOT + A - P) & 0xFFFFFFFF
 | 
						||
     ``R_AMDGPU_GOTPCREL32_HI`` Static  9      ``word32``  (G + GOT + A - P) >> 32
 | 
						||
     ``R_AMDGPU_REL32_LO``      Static  10     ``word32``  (S + A - P) & 0xFFFFFFFF
 | 
						||
     ``R_AMDGPU_REL32_HI``      Static  11     ``word32``  (S + A - P) >> 32
 | 
						||
     *reserved*                         12
 | 
						||
     ``R_AMDGPU_RELATIVE64``    Dynamic 13     ``word64``  B + A
 | 
						||
     ========================== ======= =====  ==========  ==============================
 | 
						||
 | 
						||
``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
 | 
						||
the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
 | 
						||
 | 
						||
There is no current OS loader support for 32-bit programs and so
 | 
						||
``R_AMDGPU_ABS32`` is not used.
 | 
						||
 | 
						||
.. _amdgpu-dwarf:
 | 
						||
 | 
						||
DWARF
 | 
						||
-----
 | 
						||
 | 
						||
.. warning::
 | 
						||
   This section describes a **provisional proposal** that is not currently
 | 
						||
   fully implemented and is subject to change.
 | 
						||
 | 
						||
Standard DWARF [DWARF]_ sections can be generated. These contain information
 | 
						||
that maps the code object executable code and data to the source language
 | 
						||
constructs. It can be used by tools such as debuggers and profilers.
 | 
						||
 | 
						||
This section defines the AMDGPU target specific DWARF. It applies to DWARF
 | 
						||
Version 4 and 5.
 | 
						||
 | 
						||
.. _amdgpu-dwarf-overview:
 | 
						||
 | 
						||
Overview
 | 
						||
~~~~~~~~
 | 
						||
 | 
						||
The AMDGPU has several features that require additional DWARF functionality in
 | 
						||
order to support optimized code.
 | 
						||
 | 
						||
A single code object can contain code for kernels that have different wave
 | 
						||
sizes. The vector registers and some scalar registers are based on the wave
 | 
						||
size. AMDGPU defines distinct DWARF registers for each wave size. This
 | 
						||
simplifies the consumer of the DWARF so that each register has a fixed size,
 | 
						||
rather than being dynamic according to the wave mode. Similarly, distinct DWARF
 | 
						||
registers are defined for those registers that vary in size according to the
 | 
						||
process address size. This allows a consumer to treat a specific AMDGPU target
 | 
						||
as a single architecture regardless of how it is configured. The compiler
 | 
						||
explicitly specifies the registers that match the mode of the code it is
 | 
						||
generating.
 | 
						||
 | 
						||
AMDGPU optimized code may spill vector registers to non-global address space
 | 
						||
memory, and this spilling may be done only for lanes that are active on entry to
 | 
						||
the subprogram. To support this, a location description that can be created as a
 | 
						||
masked select is required.
 | 
						||
 | 
						||
Since the active lane mask may be held in a register, a way to get the value of
 | 
						||
a register on entry to a subprogram is required. To support this an operation
 | 
						||
that returns the caller value of a register as specified by the Call Frame
 | 
						||
Information (see :ref:`amdgpu-call-frame-information`) is required.
 | 
						||
 | 
						||
Current DWARF uses an empty expression to indicate an undefined location
 | 
						||
description. Since the masked select composite location description operation
 | 
						||
takes more than one location description, it is necessary to have an explicit
 | 
						||
way to specify an undefined location description. Otherwise it is not possible
 | 
						||
to specify that a particular one of the input location descriptions is
 | 
						||
undefined.
 | 
						||
 | 
						||
CFI describes restoring callee saved registers that are spilled. Currently CFI
 | 
						||
only allows a location description that is a register, memory address, or
 | 
						||
implicit location description. AMDGPU optimized code may spill scalar registers
 | 
						||
into portions of vector registers. This requires extending CFI to allow any
 | 
						||
location description.
 | 
						||
 | 
						||
The vector registers of the AMDGPU are represented as their full wave size,
 | 
						||
meaning the wave size times the dword size. This reflects the actual hardware,
 | 
						||
and allows the compiler to generate DWARF for languages that map a thread to the
 | 
						||
complete wave. It also allows more efficient DWARF to be generated to describe
 | 
						||
the CFI as only a single expression is required for the whole vector register,
 | 
						||
rather than a separate expression for each lane's dword of the vector register.
 | 
						||
It also allows the compiler to produce DWARF that indexes the vector register if
 | 
						||
it spills scalar registers into portions of a vector registers.
 | 
						||
 | 
						||
Since DWARF stack value entries have a base type and AMDGPU registers are a
 | 
						||
vector of dwords, the ability to specify that a base type is a vector is
 | 
						||
required.
 | 
						||
 | 
						||
If the source language is mapped onto the AMDGPU wavefronts in a SIMT manner,
 | 
						||
then the variable DWARF location expressions must compute the location for a
 | 
						||
single lane of the wavefront. Therefore, a DWARF operator is required to denote
 | 
						||
the current lane, much like ``DW_OP_push_object_address`` denotes the current
 | 
						||
object. The ``DW_OP_*piece`` operators only allow literal indices. Therefore, a
 | 
						||
composite location description is required that can take a computed index of a
 | 
						||
location description (such as a vector register).
 | 
						||
 | 
						||
If the source language is mapped onto the AMDGPU wavefronts in a SIMT manner the
 | 
						||
compiler can use the AMDGPU execution mask register to control which lanes are
 | 
						||
active. To describe the conceptual location of non-active lanes a DWARF
 | 
						||
expression is needed that can compute a per lane PC. For efficiency, this is
 | 
						||
done for the wave as a whole. This expression benefits by having a masked select
 | 
						||
composite location description operation. This requires an attribute for source
 | 
						||
location of each lane. The AMDGPU may update the execution mask for whole wave
 | 
						||
operations and so needs an attribute that computes the current active lane mask.
 | 
						||
 | 
						||
AMDGPU needs to be able to describe addresses that are in different kinds of
 | 
						||
memory. Optimized code may need to describe a variable that resides in pieces
 | 
						||
that are in different kinds of storage which may include parts of registers,
 | 
						||
memory that is in a mixture of memory kinds, implicit values, or be undefined.
 | 
						||
DWARF has the concept of segment addresses. However, the segment cannot be
 | 
						||
specified within a DWARF expression, which is only able to specify the offset
 | 
						||
portion of a segment address. The segment index is only provided by the entity
 | 
						||
that species the DWARF expression. Therefore, the segment index is a property
 | 
						||
that can only be put on complete objects, such as a variable. That makes it only
 | 
						||
suitable for describing an entity (such as variable or subprogram code) that is
 | 
						||
in a single kind of memory. Therefore, AMDGPU uses the DWARF concept of address
 | 
						||
spaces. For example, a variable may be allocated in a register that is partially
 | 
						||
spilled to the call stack which is in the private address space, and partially
 | 
						||
spilled to the local address space.
 | 
						||
 | 
						||
DWARF uses the concept of an address in many expression operators but does not
 | 
						||
define how it relates to address spaces. For example,
 | 
						||
``DW_OP_push_object_address`` pushes the address of an object. Other contexts
 | 
						||
implicitly push an address on the stack before evaluating an expression. For
 | 
						||
example, the ``DW_AT_use_location`` attribute of the
 | 
						||
``DW_TAG_ptr_to_member_type``. The expression that uses the address needs to do
 | 
						||
so in a general way and not need to be dependent on the address space of the
 | 
						||
address. For example, a pointer to member value may want to be applied to an
 | 
						||
object that may reside in any address space.
 | 
						||
 | 
						||
The number of registers and the cost of memory operations is much higher for
 | 
						||
AMDGPU than a typical CPU. The compiler attempts to optimize whole variables and
 | 
						||
arrays into registers. Currently DWARF only allows ``DW_OP_push_object_address``
 | 
						||
and related operations to work with a global memory location. To support AMDGPU
 | 
						||
optimized code it is required to generalize DWARF to allow any location
 | 
						||
description to be used. This allows registers, or composite location
 | 
						||
descriptions that may be a mixture of memory, registers, or even implicit
 | 
						||
values.
 | 
						||
 | 
						||
Allowing a location description to be an entry on the DWARF stack allows them to
 | 
						||
compose naturally. It allows objects to be located in any kind of memory address
 | 
						||
space, in registers, be implicit values, be undefined, or a composite of any of
 | 
						||
these.
 | 
						||
 | 
						||
By extending DWARF carefully, all existing DWARF expressions can retain their
 | 
						||
current semantic meaning. DWARF has implicit conversions that convert from a
 | 
						||
value that is treated as an address in the default address space to a memory
 | 
						||
location description. This can be extended to allow a default address space
 | 
						||
memory location description to be implicitly converted back to its address
 | 
						||
value. To allow composition of composite location descriptions, an explicit
 | 
						||
operator that indicates the end is required. This can be implied if the end of a
 | 
						||
DWARF expression is reached, allowing current DWARF expressions to remain legal.
 | 
						||
 | 
						||
The ``DW_OP_plus`` and ``DW_OP_minus`` can be defined to operate on a memory
 | 
						||
location description in the default target architecture address space and a
 | 
						||
generic type, and produce a memory location description. This allows them to
 | 
						||
continue to be used to offset an address. To generalize offsetting to any
 | 
						||
location description, including location descriptions that describe when bytes
 | 
						||
are in registers, are implicit, or a composite of these, the
 | 
						||
``DW_OP_LLVM_offset`` and ``DW_OP_LLVM_bit_offset`` operations are added. These
 | 
						||
do not perform wrapping which would be hard to define for location descriptions
 | 
						||
of non-memory kinds. This allows ``DW_OP_push_object_address`` to push a
 | 
						||
location description that may be in a register, or be an implicit value, and the
 | 
						||
DWARF expression of ``DW_TAG_ptr_to_member_type`` can contain
 | 
						||
``DW_OP_LLVM_offset`` to offset within it. ``DW_OP_LLVM_bit_offset`` generalizes
 | 
						||
DWARF to work with bit fields.
 | 
						||
 | 
						||
The DWARF ``DW_OP_xderef*`` operation allows a value to be converted into an
 | 
						||
address of a specified address space which is then read. But provides no way to
 | 
						||
create a memory location description for an address in the non-default address
 | 
						||
space. For example, AMDGPU variables can be allocated in the local address space
 | 
						||
at a fixed address. It is required to have an operation to create an address in
 | 
						||
a specific address space that can be used to define the location description of
 | 
						||
the variable. Defining this operation to produce a location description allows
 | 
						||
the size of addresses in an address space to be larger than the generic type.
 | 
						||
 | 
						||
If an operation had to produce a value that can be implicitly converted to a
 | 
						||
memory location description, then it would be limited to the size of the generic
 | 
						||
type which matches the size of the default address space. Its value would be
 | 
						||
unspecified and likely not match any value in the actual program. By making the
 | 
						||
result a location description, it allows a consumer great freedom in how it
 | 
						||
implements it. The implicit conversion back to a value can be limited only to
 | 
						||
the default address space to maintain compatibility.
 | 
						||
 | 
						||
Similarly ``DW_OP_breg*`` treats the register as containing an address in the
 | 
						||
default address space. It is required to be able to specify the address space of
 | 
						||
the register value.
 | 
						||
 | 
						||
Almost all uses of addresses in DWARF are limited to defining location
 | 
						||
descriptions, or to be dereferenced to read memory. The exception is
 | 
						||
``DW_CFA_val_offset`` which uses the address to set the value of a register. By
 | 
						||
defining the CFA DWARF expression as being a memory location description, it can
 | 
						||
maintain what address space it is, and that can be used to convert the offset
 | 
						||
address back to an address in that address space. (An alternative is to defined
 | 
						||
``DW_CFA_val_offset`` to implicitly use the default address space, and add
 | 
						||
another operation that specifies the address space.)
 | 
						||
 | 
						||
This approach allows all existing DWARF to have the identical semantics. It
 | 
						||
allows the compiler to explicitly specify the address space it is using. For
 | 
						||
example, a compiler could choose to access private memory in a swizzled manner
 | 
						||
when mapping a source language to a wave in a SIMT manner, or to access it in an
 | 
						||
unswizzled manner if mapping the same language with the wave being the thread.
 | 
						||
It also allows the compiler to mix the address space it uses to access private
 | 
						||
memory. For example, for SIMT it can still spill entire vector registers in an
 | 
						||
unswizzled manner, while using swizzled for SIMT variable access. This approach
 | 
						||
allows memory location descriptions for different address spaces to be combined
 | 
						||
using the regular ``DW_OP_*piece`` operators.
 | 
						||
 | 
						||
Location descriptions are an abstraction of storage, they give freedom to the
 | 
						||
consumer on how to implement them. They allow the address space to encode lane
 | 
						||
information so they can be used to read memory with only the memory description
 | 
						||
and no extra arguments. The same set of operations can operate on locations
 | 
						||
independent of their kind of storage. The ``DW_OP_deref*`` therefore can be used
 | 
						||
on any storage kind. ``DW_OP_xderef*`` is unnecessary except to become a more
 | 
						||
compact way to convert a segment address followed by dereferencing it.
 | 
						||
 | 
						||
Several approaches were considered, and the one proposed appears to be the
 | 
						||
cleanest and offers the greatest improvement of DWARF's ability to support
 | 
						||
optimized code. Examining the gdb debugger and LLVM compiler, it appears only to
 | 
						||
require modest changes as they both already have to support general use of
 | 
						||
location descriptions. It is anticipated that will be the case for other
 | 
						||
debuggers and compilers.
 | 
						||
 | 
						||
The following provides the definitions for the additional operators, as well as
 | 
						||
clarifying how existing expression operators, CFI operators, and attributes
 | 
						||
behave with respect to generalized location descriptions that support address
 | 
						||
spaces. It has been defined such that it is backwards compatible with DWARF 5.
 | 
						||
The definitions are intended to fully define well-formed DWARF in a consistent
 | 
						||
style. Some sections are organized to mirror the DWARF 5 specification
 | 
						||
structure, with non-normative text shown in *italics*.
 | 
						||
 | 
						||
.. _amdgpu-dwarf-language-names:
 | 
						||
 | 
						||
Language Names
 | 
						||
~~~~~~~~~~~~~~
 | 
						||
 | 
						||
Language codes defined for use with the ``DW_AT_language`` attribute are
 | 
						||
defined in :ref:`amdgpu-dwarf-language-names-table`.
 | 
						||
 | 
						||
.. table:: AMDGPU DWARF Language Names
 | 
						||
   :name: amdgpu-dwarf-language-names-table
 | 
						||
 | 
						||
   ==================== ====== =================== =============================
 | 
						||
   Language Name        Code   Default Lower Bound Description
 | 
						||
   ==================== ====== =================== =============================
 | 
						||
   ``DW_LANG_LLVM_HIP`` 0x8100 0                   AMD HIP Language. See [HIP]_.
 | 
						||
   ==================== ====== =================== =============================
 | 
						||
 | 
						||
The ``DW_LANG_LLVM_HIP`` language can be supported by extending the C++
 | 
						||
language.
 | 
						||
 | 
						||
.. _amdgpu-dwarf-register-mapping:
 | 
						||
 | 
						||
Register Mapping
 | 
						||
~~~~~~~~~~~~~~~~
 | 
						||
 | 
						||
DWARF registers are encoded as numbers, which are mapped to architecture
 | 
						||
registers. The mapping for AMDGPU is defined in
 | 
						||
:ref:`amdgpu-dwarf-register-mapping-table`.
 | 
						||
 | 
						||
.. table:: AMDGPU DWARF Register Mapping
 | 
						||
   :name: amdgpu-dwarf-register-mapping-table
 | 
						||
 | 
						||
   ============== ================= ======== ==================================
 | 
						||
   DWARF Register AMDGPU Register   Bit Size Description
 | 
						||
   ============== ================= ======== ==================================
 | 
						||
   0              PC_32             32       Program Counter (PC) when
 | 
						||
                                             executing in a 32-bit process
 | 
						||
                                             address space. Used in the CFI to
 | 
						||
                                             describe the PC of the calling
 | 
						||
                                             frame.
 | 
						||
   1              EXEC_MASK_32      32       Execution Mask Register when
 | 
						||
                                             executing in wave 32 mode.
 | 
						||
   2-15           *Reserved*
 | 
						||
   16             PC_64             64       Program Counter (PC) when
 | 
						||
                                             executing in a 64-bit process
 | 
						||
                                             address space. Used in the CFI to
 | 
						||
                                             describe the PC of the calling
 | 
						||
                                             frame.
 | 
						||
   17             EXEC_MASK_64      64       Execution Mask Register when
 | 
						||
                                             executing in wave 64 mode.
 | 
						||
   18-31          *Reserved*
 | 
						||
   32-95          SGPR0-SGPR63      32       Scalar General Purpose
 | 
						||
                                             Registers.
 | 
						||
   96-127         *Reserved*
 | 
						||
   128-511        *Reserved*
 | 
						||
   512-1023       *Reserved*
 | 
						||
   1024-1087      *Reserved*
 | 
						||
   1088-1129      SGPR64-SGPR105    32       Scalar General Purpose Registers
 | 
						||
   1130-1535      *Reserved*
 | 
						||
   1536-1791      VGPR0-VGPR255     32*32    Vector General Purpose Registers
 | 
						||
                                             when executing in wave 32 mode.
 | 
						||
   1792-2047      *Reserved*
 | 
						||
   2048-2303      AGPR0-AGPR255     32*32    Vector Accumulation Registers
 | 
						||
                                             when executing in wave 32 mode.
 | 
						||
   2304-2559      *Reserved*
 | 
						||
   2560-2815      VGPR0-VGPR255     64*32    Vector General Purpose Registers
 | 
						||
                                             when executing in wave 64 mode.
 | 
						||
   2816-3071      *Reserved*
 | 
						||
   3072-3327      AGPR0-AGPR255     64*32    Vector Accumulation Registers
 | 
						||
                                             when executing in wave 64 mode.
 | 
						||
   3328-3583      *Reserved*
 | 
						||
   ============== ================= ======== ==================================
 | 
						||
 | 
						||
The vector registers are represented as the full size for the wavefront. They
 | 
						||
are organized as consecutive dwords (32-bits), one per lane, with the dword at
 | 
						||
the least significant bit position corresponding to lane 0 and so forth. DWARF
 | 
						||
location expressions involving the ``DW_OP_LLVM_offset`` and
 | 
						||
``DW_OP_LLVM_push_lane`` operations are used to select the part of the vector
 | 
						||
register corresponding to the lane that is executing the current thread of
 | 
						||
execution in languages that are implemented using a SIMD or SIMT execution
 | 
						||
model.
 | 
						||
 | 
						||
If the wavefront size is 32 lanes then the wave 32 mode register definitions
 | 
						||
are used. If the wavefront size is 64 lanes then the wave 64 mode register
 | 
						||
definitions are used. Some AMDGPU targets support executing in both wave 32
 | 
						||
and wave 64 mode. The register definitions corresponding to the wave mode
 | 
						||
of the generated code will be used.
 | 
						||
 | 
						||
If code is generated to execute in a 32-bit process address space then the
 | 
						||
32-bit process address space register definitions are used. If code is
 | 
						||
generated to execute in a 64-bit process address space then the 64-bit process
 | 
						||
address space register definitions are used. The ``amdgcn`` target only
 | 
						||
supports the 64-bit process address space.
 | 
						||
 | 
						||
Address Class Mapping
 | 
						||
~~~~~~~~~~~~~~~~~~~~~
 | 
						||
 | 
						||
DWARF address classes are used for languages with the concept of memory address
 | 
						||
spaces. They are used in the ``DW_AT_address_class`` attribute for pointer type,
 | 
						||
reference type, subroutine, and subroutine type debugger information entries
 | 
						||
(DIEs).
 | 
						||
 | 
						||
The address class mapping for AMDGPU is defined in
 | 
						||
:ref:`amdgpu-dwarf-address-class-mapping-table`.
 | 
						||
 | 
						||
.. table:: AMDGPU DWARF Address Class Mapping
 | 
						||
   :name: amdgpu-dwarf-address-class-mapping-table
 | 
						||
 | 
						||
   =========================== ===== =================
 | 
						||
   DWARF                             AMDGPU
 | 
						||
   --------------------------------- -----------------
 | 
						||
   Address Class Name          Value Address Space
 | 
						||
   =========================== ===== =================
 | 
						||
   ``DW_ADDR_none``            0x00  Generic (Flat)
 | 
						||
   ``DW_ADDR_AMDGPU_global``   0x01  Global
 | 
						||
   ``DW_ADDR_AMDGPU_region``   0x02  Region (GDS)
 | 
						||
   ``DW_ADDR_AMDGPU_local``    0x03  Local (group/LDS)
 | 
						||
   ``DW_ADDR_AMDGPU_constant`` 0x04  Global
 | 
						||
   ``DW_ADDR_AMDGPU_private``  0x05  Private (Scratch)
 | 
						||
   =========================== ===== =================
 | 
						||
 | 
						||
See :ref:`amdgpu-address-spaces` for information on the AMDGPU address spaces
 | 
						||
including address size and NULL value.
 | 
						||
 | 
						||
For AMDGPU the address class encodes the address class as declared in the
 | 
						||
source language type.
 | 
						||
 | 
						||
For AMDGPU if no ``DW_AT_address_class`` attribute is present, then the
 | 
						||
``DW_ADDR_none`` address class is used.
 | 
						||
 | 
						||
.. note::
 | 
						||
 | 
						||
  The ``DW_ADDR_none`` default was defined as ``Generic`` and not ``Global``
 | 
						||
  to match the LLVM address space ordering. This ordering was chosen to better
 | 
						||
  support CUDA-like languages such as HIP that do not have address spaces in
 | 
						||
  the language type system, but do allow variables to be allocated in
 | 
						||
  different address spaces. So effectively all CUDA and HIP source language
 | 
						||
  addresses are generic.
 | 
						||
 | 
						||
.. note::
 | 
						||
 | 
						||
  Currently DWARF defines address class values as architecture specific. It
 | 
						||
  is unclear how language specific address spaces are intended to be
 | 
						||
  represented in DWARF.
 | 
						||
 | 
						||
  For example, OpenCL defines address spaces for ``global``, ``local``,
 | 
						||
  ``constant``, and ``private``. These are part of the type system and are
 | 
						||
  modifies to pointer types. In addition, OpenCL defines ``generic`` pointers
 | 
						||
  that can reference either the ``global``, ``local``, or ``private`` address
 | 
						||
  spaces. To support the OpenCL language the debugger would want to support
 | 
						||
  casting pointers between the ``generic`` and other address spaces, and
 | 
						||
  possibly using pointer casting to form an address for a specific address
 | 
						||
  space out of an integral value.
 | 
						||
 | 
						||
  The method to use to dereference a pointer type or reference type value is
 | 
						||
  defined in DWARF expressions using ``DW_OP_xderef*`` which uses an
 | 
						||
  architecture specific address space.
 | 
						||
 | 
						||
  DWARF defines the ``DW_AT_address_class`` attribute on pointer types and
 | 
						||
  reference types. It specifies the method to use to dereference them. Why
 | 
						||
  is the value of this not the same as the address space value used in
 | 
						||
  ``DW_OP_xderef*`` since in both cases it is architecture specific and the
 | 
						||
  architecture presumably will use the same set of methods to dereference
 | 
						||
  pointers in both cases?
 | 
						||
 | 
						||
  Since ``DW_AT_address_class`` uses an architecture specific value it cannot
 | 
						||
  in general capture the source language address space type modifier concept.
 | 
						||
  On some architectures all source language address space modifies may
 | 
						||
  actually use the same method for dereferencing pointers.
 | 
						||
 | 
						||
  One possibility is for DWARF to add an ``DW_TAG_LLVM_address_class_type``
 | 
						||
  type modifier that can be applied to a pointer type and reference type. The
 | 
						||
  ``DW_AT_address_class`` attribute could be re-defined to not be architecture
 | 
						||
  specific and instead define generalized language values that will support
 | 
						||
  OpenCL and other languages using address spaces. The ``DW_AT_address_class``
 | 
						||
  could be defined to not be applied to pointer or reference types, but
 | 
						||
  instead only to the ``DW_TAG_LLVM_address_class_type`` type modifier entry.
 | 
						||
 | 
						||
  If a pointer type or reference type is not modified by
 | 
						||
  ``DW_TAG_LLVM_address_class_type`` or if ``DW_TAG_LLVM_address_class_type``
 | 
						||
  has no ``DW_AT_address_class`` attribute, then the pointer type or reference
 | 
						||
  type would be defined to use the ``DW_ADDR_none`` address class as
 | 
						||
  currently. Since modifiers can be chained, it would need to be defined if
 | 
						||
  multiple ``DW_TAG_LLVM_address_class_type`` modifies was legal, and if so if
 | 
						||
  the outermost one is the one that takes precedence.
 | 
						||
 | 
						||
  A target implementation that supports multiple address spaces would need to
 | 
						||
  map ``DW_ADDR_none`` appropriately to support CUDA-like languages
 | 
						||
  that have no address classes in the type system, but do support variable
 | 
						||
  allocation in address spaces. See the above note that describes why AMDGPU
 | 
						||
  choose to make ``DW_ADDR_none`` map to the ``Generic`` AMDGPU address space
 | 
						||
  and not the ``Global`` address space.
 | 
						||
 | 
						||
  An alternative would be to define ``DW_ADDR_none`` as being the global
 | 
						||
  address class and then change ``DW_ADDR_global`` to ``DW_ADDR_generic``.
 | 
						||
  Compilers generating DWARF for CUDA-like languages would then have to define
 | 
						||
  every CUDA-like language pointer type or reference type using
 | 
						||
  ``DW_TAG_LLVM_address_class_type`` with a ``DW_AT_address_class`` attribute
 | 
						||
  of ``DW_ADDR_generic`` to match the language semantics. The AMDGPU
 | 
						||
  alternative avoids needing to do this and seems to fit better into how CLANG
 | 
						||
  and LLVM have added support for the CUDA-like languages on top of existing
 | 
						||
  C++ language support.
 | 
						||
 | 
						||
  A new ``DW_AT_address_space`` attribute could be defined that can be applied
 | 
						||
  to pointer type, reference type, subroutine, and subroutine type to describe
 | 
						||
  how objects having the given type are dereferenced or called (the role that
 | 
						||
  ``DW_AT_address_class`` currently provides). The values of
 | 
						||
  ``DW_AT_address_space`` would be architecture specific and the same as used
 | 
						||
  in ``DW_OP_xderef*``.
 | 
						||
 | 
						||
.. _amdgpu-dwarf-address-space-mapping:
 | 
						||
 | 
						||
Address Space Mapping
 | 
						||
~~~~~~~~~~~~~~~~~~~~~
 | 
						||
 | 
						||
DWARF address spaces are used in location expressions to describe the memory
 | 
						||
space where data resides. Address spaces correspond to a target specific memory
 | 
						||
space and are not tied to any source language concept.
 | 
						||
 | 
						||
The AMDGPU address space mapping is defined in
 | 
						||
:ref:`amdgpu-dwarf-address-space-mapping-table`.
 | 
						||
 | 
						||
.. table:: AMDGPU DWARF Address Space Mapping
 | 
						||
   :name: amdgpu-dwarf-address-space-mapping-table
 | 
						||
 | 
						||
   ======================================= ===== ======= ======== ================= =======================
 | 
						||
   DWARF                                                          AMDGPU            Notes
 | 
						||
   --------------------------------------- ----- ---------------- ----------------- -----------------------
 | 
						||
   Address Space Name                      Value Address Bit Size Address Space
 | 
						||
   --------------------------------------- ----- ------- -------- ----------------- -----------------------
 | 
						||
   ..                                            64-bit  32-bit
 | 
						||
                                                 process process
 | 
						||
                                                 address address
 | 
						||
                                                 space   space
 | 
						||
   ======================================= ===== ======= ======== ================= =======================
 | 
						||
   ``DW_ASPACE_none``                      0x00  8       4        Global            *default address space*
 | 
						||
   ``DW_ASPACE_AMDGPU_generic``            0x01  8       4        Generic (Flat)
 | 
						||
   ``DW_ASPACE_AMDGPU_region``             0x02  4       4        Region (GDS)
 | 
						||
   ``DW_ASPACE_AMDGPU_local``              0x03  4       4        Local (group/LDS)
 | 
						||
   *Reserved*                              0x04
 | 
						||
   ``DW_ASPACE_AMDGPU_private_lane``       0x05  4       4        Private (Scratch) *focused lane*
 | 
						||
   ``DW_ASPACE_AMDGPU_private_wave``       0x06  4       4        Private (Scratch) *unswizzled wave*
 | 
						||
   *Reserved*                              0x07-
 | 
						||
                                           0x1F
 | 
						||
   ``DW_ASPACE_AMDGPU_private_lane<0-63>`` 0x20- 4       4        Private (Scratch) *specific lane*
 | 
						||
                                           0x5F
 | 
						||
   ======================================= ===== ======= ======== ================= =======================
 | 
						||
 | 
						||
See :ref:`amdgpu-address-spaces` for information on the AMDGPU address spaces
 | 
						||
including address size and NULL value.
 | 
						||
 | 
						||
The ``DW_ASPACE_none`` address space is the default address space used in DWARF
 | 
						||
operations that do not specify an address space. It therefore has to map to the
 | 
						||
global address space so that the ``DW_OP_addr*`` and related operations can
 | 
						||
refer to addresses in the program code.
 | 
						||
 | 
						||
The ``DW_ASPACE_AMDGPU_generic`` address space allows location expressions to
 | 
						||
specify the flat address space. If the address corresponds to an address in the
 | 
						||
local address space then it corresponds to the wave that is executing the
 | 
						||
focused thread of execution. If the address corresponds to an address in the
 | 
						||
private address space then it corresponds to the lane that is executing the
 | 
						||
focused thread of execution for languages that are implemented using a SIMD or
 | 
						||
SIMT execution model.
 | 
						||
 | 
						||
.. note::
 | 
						||
 | 
						||
  CUDA-like languages such as HIP that do not have address spaces in the
 | 
						||
  language type system, but do allow variables to be allocated in different
 | 
						||
  address spaces, will need to explicitly specify the
 | 
						||
  ``DW_ASPACE_AMDGPU_generic`` address space in the DWARF operations as the
 | 
						||
  default address space is the global address space.
 | 
						||
 | 
						||
The ``DW_ASPACE_AMDGPU_local`` address space allows location expressions to
 | 
						||
specify the local address space corresponding to the wave that is executing the
 | 
						||
focused thread of execution.
 | 
						||
 | 
						||
The ``DW_ASPACE_AMDGPU_private_lane`` address space allows location expressions
 | 
						||
to specify the private address space corresponding to the lane that is
 | 
						||
executing the focused thread of execution for languages that are implemented
 | 
						||
using a SIMD or SIMT execution model.
 | 
						||
 | 
						||
The ``DW_ASPACE_AMDGPU_private_wave`` address space allows location expressions
 | 
						||
to specify the unswizzled private address space corresponding to the wave that
 | 
						||
is executing the focused thread of execution. The wave view of private memory
 | 
						||
is the per wave unswizzled backing memory layout defined in
 | 
						||
:ref:`amdgpu-address-spaces`, such that address 0 corresponds to the first
 | 
						||
location for the backing memory of the wave (namely the address is not offset
 | 
						||
by ``wavefront-scratch-base``). So to convert from a
 | 
						||
``DW_ASPACE_AMDGPU_private_lane`` to a ``DW_ASPACE_AMDGPU_private_wave``
 | 
						||
segment address perform the following:
 | 
						||
 | 
						||
::
 | 
						||
 | 
						||
  private-address-wave =
 | 
						||
    ((private-address-lane / 4) * wavefront-size * 4) +
 | 
						||
    (wavefront-lane-id * 4) + (private-address-lane % 4)
 | 
						||
 | 
						||
If the ``DW_ASPACE_AMDGPU_private_lane`` segment address is dword aligned and
 | 
						||
the start of the dwords for each lane starting with lane 0 is required, then
 | 
						||
this simplifies to:
 | 
						||
 | 
						||
::
 | 
						||
 | 
						||
  private-address-wave =
 | 
						||
    private-address-lane * wavefront-size
 | 
						||
 | 
						||
A compiler can use this address space to read a complete spilled vector
 | 
						||
register back into a complete vector register in the CFI. The frame pointer can
 | 
						||
be a private lane segment address which is dword aligned, which can be shifted
 | 
						||
to multiply by the wave size, and then used to form a private wave segment
 | 
						||
address that gives a location for a contiguous set of dwords, one per lane,
 | 
						||
where the vector register dwords are spilled. The compiler knows the wave size
 | 
						||
since it generates the code. Note that the type of the address may have to be
 | 
						||
converted as the size of a private lane segment address may be smaller than the
 | 
						||
size of a private wave segment address.
 | 
						||
 | 
						||
The ``DW_ASPACE_AMDGPU_private_lane<n>`` address space allows location
 | 
						||
expressions to specify the private address space corresponding to a specific
 | 
						||
lane. For example, this can be used when the compiler spills scalar registers
 | 
						||
to scratch memory, with each scalar register being saved to a different lane's
 | 
						||
scratch memory.
 | 
						||
 | 
						||
.. _amdgpu-dwarf-expressions:
 | 
						||
 | 
						||
Expressions
 | 
						||
~~~~~~~~~~~
 | 
						||
 | 
						||
The following sections define the new DWARF expression operator used by AMDGPU,
 | 
						||
as well as clarifying the extensions to already existing DWARF 5 operations.
 | 
						||
 | 
						||
DWARF expressions describe how to compute a value or specify a location
 | 
						||
description. An expression is encoded as a stream of operations, each consisting
 | 
						||
of an opcode followed by zero or more literal operands. The number of operands
 | 
						||
is implied by the opcode.
 | 
						||
 | 
						||
Operations represent a postfix operation on a simple stack machine. They can act
 | 
						||
on entries on the stack, including adding entries and removing entries. If the
 | 
						||
kind of a stack entry does not match the kind required by the operation, and is
 | 
						||
not implicitly convertible to the required kind, then the DWARF expression is
 | 
						||
ill-formed.
 | 
						||
 | 
						||
Each stack entry can be one of two kinds: a value or a location description.
 | 
						||
Value stack entries are described in :ref:`amdgpu-value-operations` and
 | 
						||
location description stack entries are described in
 | 
						||
:ref:`amdgpu-location-description-operations`.
 | 
						||
 | 
						||
*The evaluation of a DWARF expression can provide the location description of an
 | 
						||
object, the value of an array bound, the length of a dynamic string, the desired
 | 
						||
value itself, and so on.*
 | 
						||
 | 
						||
The result of the evaluation of a DWARF expression is defined as:
 | 
						||
 | 
						||
* If evaluation of the DWARF expression is on behalf of a ``DW_OP_call*``
 | 
						||
  operation for a ``DW_AT_location`` attribute that belongs to a
 | 
						||
  ``DW_TAG_dwarf_procedure`` debugging information entry, then all the entries
 | 
						||
  on the stack are left, and execution of the DWARF expression containing the
 | 
						||
  ``DW_OP_call*`` operation continues.
 | 
						||
 | 
						||
* If evaluation of the DWARF expression requires a location description, then:
 | 
						||
 | 
						||
  * If the stack is empty, an undefined location description is returned.
 | 
						||
 | 
						||
  * If the top stack entry is a location description, or can be converted to
 | 
						||
    one, then the, possibly converted, location description is returned. Any
 | 
						||
    other entries on the stack are discarded.
 | 
						||
 | 
						||
  * Otherwise the DWARF expression is ill-formed.
 | 
						||
 | 
						||
    .. note::
 | 
						||
 | 
						||
      Could define this case as returning an implicit location description as
 | 
						||
      if the ``DW_OP_implicit`` operation is performed.
 | 
						||
 | 
						||
* If evaluation of the DWARF expression requires a value, then:
 | 
						||
 | 
						||
  * If the top stack entry is a value, or can be converted to one, then the,
 | 
						||
    possibly converted, value is returned. Any other entries on the stack are
 | 
						||
    discarded.
 | 
						||
 | 
						||
  * Otherwise the DWARF expression is ill-formed.
 | 
						||
 | 
						||
.. _amdgpu-stack-operations:
 | 
						||
 | 
						||
Stack Operations
 | 
						||
++++++++++++++++
 | 
						||
 | 
						||
The following operations manipulate the DWARF stack. Operations that index
 | 
						||
the stack assume that the top of the stack (most recently added entry) has index
 | 
						||
0. They allow the stack entries to be either a value or location description.
 | 
						||
 | 
						||
If any stack entry accessed by a stack operation is an incomplete composite
 | 
						||
location description, then the DWARF expression is ill-formed.
 | 
						||
 | 
						||
.. note::
 | 
						||
 | 
						||
  These operations now support stack entries that are values and location
 | 
						||
  descriptions.
 | 
						||
 | 
						||
.. note::
 | 
						||
 | 
						||
  If it is desired to also make them work with incomplete composite location
 | 
						||
  descriptions then would need to define that the composite location storage
 | 
						||
  specified by the incomplete composite location description is also replicated
 | 
						||
  when a copy is pushed. This ensures that each copy of the incomplete composite
 | 
						||
  location description can updated the composite location storage they specify
 | 
						||
  independently.
 | 
						||
 | 
						||
1.  ``DW_OP_dup``
 | 
						||
 | 
						||
    ``DW_OP_dup`` duplicates the stack entry at the top of the stack.
 | 
						||
 | 
						||
2.  ``DW_OP_drop``
 | 
						||
 | 
						||
    ``DW_OP_drop`` pops the stack entry at the top of the stack and discards it.
 | 
						||
 | 
						||
3.  ``DW_OP_pick``
 | 
						||
 | 
						||
    ``DW_OP_pick`` has a single unsigned 1-byte operand that is treated as an
 | 
						||
    index I. A copy of the stack entry with index I is pushed onto the stack.
 | 
						||
 | 
						||
4.  ``DW_OP_over``
 | 
						||
 | 
						||
    ``DW_OP_over`` pushes a copy of the entry entry with index 1.
 | 
						||
 | 
						||
    *This is equivalent to a ``DW_OP_pick 1`` operation.*
 | 
						||
 | 
						||
5.  ``DW_OP_swap``
 | 
						||
 | 
						||
    ``DW_OP_swap`` swaps the top two stack entries. The entry at the top of the
 | 
						||
    stack becomes the second stack entry, and the second stack entry becomes the
 | 
						||
    top of the stack.
 | 
						||
 | 
						||
6.  ``DW_OP_rot``
 | 
						||
 | 
						||
    ``DW_OP_rot`` rotates the first three stack entries. The entry at the top of
 | 
						||
    the stack becomes the third stack entry, the second entry becomes the top of
 | 
						||
    the stack, and the third entry becomes the second entry.
 | 
						||
 | 
						||
.. _amdgpu-value-operations:
 | 
						||
 | 
						||
Value Operations
 | 
						||
++++++++++++++++
 | 
						||
 | 
						||
Each value stack entry has a type and a value, and can represent a value of
 | 
						||
any supported base type of the target machine. The base type specifies the size
 | 
						||
and encoding of the value.
 | 
						||
 | 
						||
.. note::
 | 
						||
 | 
						||
  It may be better to add an implicit pointer value kind that is produced when
 | 
						||
  ``DW_OP_deref*`` retrieves the full contents of an implicit pointer location
 | 
						||
  storage created by the ``DW_OP_implicit_pointer`` or
 | 
						||
  ``DW_OP_LLVM_aspace_implicit_pointer`` operations.
 | 
						||
 | 
						||
Instead of a base type, value stack entries can have a distinguished generic
 | 
						||
type, which is an integral type that has the size of an address in the target
 | 
						||
architecture default address space on the target machine and unspecified
 | 
						||
signedness.
 | 
						||
 | 
						||
*The generic type is the same as the unspecified type used for stack operations
 | 
						||
defined in DWARF Version 4 and before.*
 | 
						||
 | 
						||
An integral type is a base type that has an encoding of ``DW_ATE_signed``,
 | 
						||
``DW_ATE_signed_char``, ``DW_ATE_unsigned``, ``DW_ATE_unsigned_char``,
 | 
						||
``DW_ATE_boolean``, or any target architecture defined integral encoding in the
 | 
						||
inclusive range ``DW_ATE_lo_user`` to ``DW_ATE_hi_user``.
 | 
						||
 | 
						||
.. note::
 | 
						||
 | 
						||
  Unclear if ``DW_ATE_address`` is an integral type. gdb does not seem to
 | 
						||
  consider as integral.
 | 
						||
 | 
						||
1.  ``DW_OP_LLVM_push_lane`` *New*
 | 
						||
 | 
						||
    ``DW_OP_LLVM_push_lane`` pushes a value with the generic type that is the
 | 
						||
    target architecture lane identifier of the thread of execution for which a
 | 
						||
    user presented expression is currently being evaluated. For languages that
 | 
						||
    are implemented using a SIMD or SIMT execution model this is the lane number
 | 
						||
    that corresponds to the source language thread of execution upon which the
 | 
						||
    user is focused. Otherwise this is the value 0.
 | 
						||
 | 
						||
    For AMDGPU, the lane identifier returned by ``DW_OP_LLVM_push_lane``
 | 
						||
    corresponds to the the hardware lane number which is numbered from 0 to the
 | 
						||
    wavefront size minus 1.
 | 
						||
 | 
						||
2.  ``DW_OP_entry_value``
 | 
						||
 | 
						||
    ``DW_OP_entry_value`` pushes the value that the described location held upon
 | 
						||
    entering the current subprogram.
 | 
						||
 | 
						||
    It has two operands. The first is an unsigned LEB128 integer. The second is
 | 
						||
    a block of bytes, with a length equal to the first operand, treated as a
 | 
						||
    DWARF expression E.
 | 
						||
 | 
						||
    E is evaluated as if it had been evaluated upon entering the current
 | 
						||
    subprogram. E assumes no values are present on the DWARF stack initially and
 | 
						||
    results in exactly one value being pushed on the DWARF stack when completed.
 | 
						||
 | 
						||
    ``DW_OP_push_object_address`` is not meaningful inside of this DWARF
 | 
						||
    operation.
 | 
						||
 | 
						||
    If the result of E is a register location description (see
 | 
						||
    :ref:`amdgpu-register-location-descriptions`), ``DW_OP_entry_value`` pushes
 | 
						||
    the value that register had upon entering the current subprogram. The value
 | 
						||
    entry type is the target machine register base type. If the register value
 | 
						||
    is undefined or the register location description bit offset is not 0, then
 | 
						||
    the DWARF expression is ill-formed.
 | 
						||
 | 
						||
    *The register location description provides a more compact form for the case
 | 
						||
    where the value was in a register on entry to the subprogram.*
 | 
						||
 | 
						||
    Otherwise, the expression result is required to be a value, and
 | 
						||
    ``DW_OP_entry_value`` pushes that value.
 | 
						||
 | 
						||
    *The values needed to evaluate* ``DW_OP_entry_value`` *could be obtained in
 | 
						||
    several ways. The consumer could suspend execution on entry to the
 | 
						||
    subprogram, record values needed by* ``DW_OP_entry_value`` *expressions
 | 
						||
    within the subprogram, and then continue; when evaluating*
 | 
						||
    ``DW_OP_entry_value``\ *, the consumer would use these recorded values
 | 
						||
    rather than the current values. Or, when evaluating* ``DW_OP_entry_value``\
 | 
						||
    *, the consumer could virtually unwind using the Call Frame Information
 | 
						||
    (see* :ref:`amdgpu-call-frame-information`\ *) to recover register values
 | 
						||
    that might have been clobbered since the subprogram entry point.*
 | 
						||
 | 
						||
    .. note::
 | 
						||
 | 
						||
      Unclear why this operation is defined this way. If the expression is
 | 
						||
      simply using existing variables then it is just a regular expression. It
 | 
						||
      is unclear how the compiler instructs the consumer how to create the saved
 | 
						||
      copies of the variables on entry. Seems only the compiler knows how to do
 | 
						||
      this. If the main purpose is only to read the entry value of a register
 | 
						||
      using CFI then would be better to have an operation that explicitly does
 | 
						||
      just that such as ``DW_OP_LLVM_call_frame_entry_reg``.
 | 
						||
 | 
						||
.. _amdgpu-location-description-operations:
 | 
						||
 | 
						||
Location Description Operations
 | 
						||
+++++++++++++++++++++++++++++++
 | 
						||
 | 
						||
Information about the location of program objects is provided by location
 | 
						||
descriptions. Location descriptions specify the storage that holds the program
 | 
						||
objects, and a position within the storage.
 | 
						||
 | 
						||
A location storage is a linear stream of bits that can hold values. Each
 | 
						||
location storage has a size in bits and can be accessed using a zero-based bit
 | 
						||
offset. The ordering of bits within location storage uses the bit numbering and
 | 
						||
direction conventions that are appropriate to the current language on the target
 | 
						||
architecture.
 | 
						||
 | 
						||
.. note::
 | 
						||
 | 
						||
  For AMDGPU bytes are ordered with least significant bytes first, and bits are
 | 
						||
  ordered within bytes with least significant bits first.
 | 
						||
 | 
						||
There are five kinds of location storage: undefined, memory, register, implicit,
 | 
						||
and composite. Memory and register location storage corresponds to the target
 | 
						||
architecture memory address spaces and registers. Implicit location storage
 | 
						||
corresponds to fixed values that can only be read. Undefined location storage
 | 
						||
indicates no value is available and therefore cannot be read or written.
 | 
						||
Composite location storage allows a mixture of these where some bits come from
 | 
						||
one kind of location storage and some from another kind of location storage.
 | 
						||
 | 
						||
.. note::
 | 
						||
 | 
						||
  It may be better to add an implicit pointer location storage kind for
 | 
						||
  ``DW_OP_implicit_pointer`` or ``DW_OP_LLVM_aspace_implicit_pointer``.
 | 
						||
 | 
						||
Location description stack entries specify a location storage to which they
 | 
						||
refer, and a bit offset relative to the start of the location storage.
 | 
						||
 | 
						||
General Operations
 | 
						||
##################
 | 
						||
 | 
						||
1.  ``DW_OP_LLVM_offset`` *New*
 | 
						||
 | 
						||
    ``DW_OP_LLVM_offset`` pops two stack entries. The first must be an integral
 | 
						||
    type value that is treated as a byte displacement D. The second must be a
 | 
						||
    location description L.
 | 
						||
 | 
						||
    It adds the value of D scaled by 8 (the byte size) to the bit offset of L,
 | 
						||
    and pushes the updated L.
 | 
						||
 | 
						||
    If the updated bit offset of L is less than 0 or greater than or equal to
 | 
						||
    the size of the location storage specified by L, then the DWARF expression
 | 
						||
    is ill-formed.
 | 
						||
 | 
						||
2.  ``DW_OP_LLVM_offset_uconst`` *New*
 | 
						||
 | 
						||
    ``DW_OP_LLVM_offset_uconst`` has a single unsigned LEB128 integer operand
 | 
						||
    that is treated as a displacement D.
 | 
						||
 | 
						||
    It pops one stack entry that must be a location description L. It adds the
 | 
						||
    value of D scaled by 8 (the byte size) to the bit offset of L, and pushes
 | 
						||
    the updated L.
 | 
						||
 | 
						||
    If the updated bit offset of L is less than 0 or greater than or equal to
 | 
						||
    the size of the location storage specified by L, then the DWARF expression
 | 
						||
    is ill-formed.
 | 
						||
 | 
						||
    *This operation is supplied specifically to be able to encode more field
 | 
						||
    displacements in two bytes than can be done with* ``DW_OP_lit<n>
 | 
						||
    DW_OP_LLVM_offset``\ *.*
 | 
						||
 | 
						||
3.  ``DW_OP_LLVM_bit_offset`` *New*
 | 
						||
 | 
						||
    ``DW_OP_LLVM_bit_offset`` pops two stack entries. The first must be an
 | 
						||
    integral type value that is treated as a bit displacement D. The second must
 | 
						||
    be a location description L.
 | 
						||
 | 
						||
    It adds the value of D to the bit offset of L, and pushes the updated L.
 | 
						||
 | 
						||
    If the updated bit offset of L is less than 0 or greater than or equal to
 | 
						||
    the size of the location storage specified by L, then the DWARF expression
 | 
						||
    is ill-formed.
 | 
						||
 | 
						||
4.  ``DW_OP_deref``
 | 
						||
 | 
						||
    The ``DW_OP_deref`` operation pops one stack entry that must be a location
 | 
						||
    description L.
 | 
						||
 | 
						||
    A value of the bit size of the generic type is retrieved from the location
 | 
						||
    storage specified by L starting at the bit offset specified by L. The
 | 
						||
    retrieved generic type value V is pushed on the stack.
 | 
						||
 | 
						||
    If any bit of the value is retrieved from the undefined location storage, or
 | 
						||
    the offset of any bit exceeds the size of the location storage specified by
 | 
						||
    L, then the DWARF expression is ill-formed.
 | 
						||
 | 
						||
    See :ref:`amdgpu-implicit-location-descriptions` for special rules
 | 
						||
    concerning implicit location descriptions created by the
 | 
						||
    ``DW_OP_implicit_pointer`` and ``DW_OP_LLVM_implicit_aspace_pointer``
 | 
						||
    operations.
 | 
						||
 | 
						||
5.  ``DW_OP_deref_size``
 | 
						||
 | 
						||
    ``DW_OP_deref_size`` has a single 1-byte unsigned integral constant treated
 | 
						||
    as a byte result size S.
 | 
						||
 | 
						||
    It pops one stack entry that must be a location description L.
 | 
						||
 | 
						||
    A value of S scaled by 8 (the byte size) bits is retrieved from the location
 | 
						||
    storage specified by L starting at the bit offset specified by L. The value
 | 
						||
    V retrieved is zero-extended to the bit size of the generic type before
 | 
						||
    being pushed onto the stack with the generic type.
 | 
						||
 | 
						||
    If S is larger than the byte size of the generic type, if any bit of the
 | 
						||
    value is retrieved from the undefined location storage, or if the offset of
 | 
						||
    any bit exceeds the size of the location storage specified by L, then the
 | 
						||
    DWARF expression is ill-formed.
 | 
						||
 | 
						||
    See :ref:`amdgpu-implicit-location-descriptions` for special rules
 | 
						||
    concerning implicit location descriptions created by the
 | 
						||
    ``DW_OP_implicit_pointer`` and ``DW_OP_LLVM_implicit_aspace_pointer``
 | 
						||
    operations.
 | 
						||
 | 
						||
6.  ``DW_OP_deref_type``
 | 
						||
 | 
						||
    ``DW_OP_deref_type`` has two operands. The first is a 1-byte unsigned
 | 
						||
    integral constant whose value S is the same as the size of the base type
 | 
						||
    referenced by the second operand. The second operand is an unsigned LEB128
 | 
						||
    integer that represents the offset of a debugging information entry E in the
 | 
						||
    current compilation unit, which must be a ``DW_TAG_base_type`` entry that
 | 
						||
    provides the type of the result value.
 | 
						||
 | 
						||
    It pops one stack entry that must be a location description L. A value of
 | 
						||
    the bit size S is retrieved from the location storage specified by L
 | 
						||
    starting at the bit offset specified by the L. The retrieved result type
 | 
						||
    value V is pushed on the stack.
 | 
						||
 | 
						||
    If any bit of the value is retrieved from the undefined location storage, or
 | 
						||
    if the offset of any bit exceeds the size of the specified location storage,
 | 
						||
    then the DWARF expression is ill-formed.
 | 
						||
 | 
						||
    See :ref:`amdgpu-implicit-location-descriptions` for special rules
 | 
						||
    concerning implicit location descriptions created by the
 | 
						||
    ``DW_OP_implicit_pointer`` and ``DW_OP_LLVM_implicit_aspace_pointer``
 | 
						||
    operations.
 | 
						||
 | 
						||
    *While the size of the pushed value could be inferred from the base type
 | 
						||
    definition, it is encoded explicitly into the operation so that the
 | 
						||
    operation can be parsed easily without reference to the* ``.debug_info``
 | 
						||
    *section.*
 | 
						||
 | 
						||
7.  ``DW_OP_xderef`` *Deprecated*
 | 
						||
 | 
						||
    ``DW_OP_xderef`` pops two stack entries. The first must be an integral type
 | 
						||
    value that is treated as an address A. The second must be an integral type
 | 
						||
    value that is treated as an address space identifier AS for those
 | 
						||
    architectures that support multiple address spaces.
 | 
						||
 | 
						||
    The operation is equivalent to performing ``DW_OP_swap;
 | 
						||
    DW_OP_LLVM_form_aspace_address; DW_OP_deref``. The retrieved generic type
 | 
						||
    value V is left on the stack.
 | 
						||
 | 
						||
8.  ``DW_OP_xderef_size`` *Deprecated*
 | 
						||
 | 
						||
    ``DW_OP_xderef_size`` has a single 1-byte unsigned integral constant treated
 | 
						||
    as a byte result size S.
 | 
						||
 | 
						||
    It pops two stack entries. The first must be an integral type value that is
 | 
						||
    treated as an address A. The second must be an integral type value that is
 | 
						||
    treated as an address space identifier AS for those architectures that
 | 
						||
    support multiple address spaces.
 | 
						||
 | 
						||
    The operation is equivalent to performing ``DW_OP_swap;
 | 
						||
    DW_OP_LLVM_form_aspace_address; DW_OP_deref_size S``. The zero-extended
 | 
						||
    retrieved generic type value V is left on the stack.
 | 
						||
 | 
						||
9.  ``DW_OP_xderef_type`` *Deprecated*
 | 
						||
 | 
						||
    ``DW_OP_xderef_type`` has two operands. The first is a 1-byte unsigned
 | 
						||
    integral constant S whose value is the same as the size of the base type
 | 
						||
    referenced by the second operand. The second operand is an unsigned LEB128
 | 
						||
    integer R that represents the offset of a debugging information entry E in
 | 
						||
    the current compilation unit, which must be a ``DW_TAG_base_type`` entry
 | 
						||
    that provides the type of the result value.
 | 
						||
 | 
						||
    It pops two stack entries. The first must be an integral type value that is
 | 
						||
    treated as an address A. The second must be an integral type value that is
 | 
						||
    treated as an address space identifier AS for those architectures that
 | 
						||
    support multiple address spaces.
 | 
						||
 | 
						||
    The operation is equivalent to performing ``DW_OP_swap;
 | 
						||
    DW_OP_LLVM_form_aspace_address; DW_OP_deref_type S R``. The retrieved result
 | 
						||
    type value V is left on the stack.
 | 
						||
 | 
						||
10. ``DW_OP_push_object_address``
 | 
						||
 | 
						||
    ``DW_OP_push_object_address`` pushes the location description L of the
 | 
						||
    object currently being evaluated as part of evaluation of a user presented
 | 
						||
    expression.
 | 
						||
 | 
						||
    This object may correspond to an independent variable described by its own
 | 
						||
    debugging information entry or it may be a component of an array, structure,
 | 
						||
    or class whose address has been dynamically determined by an earlier step
 | 
						||
    during user expression evaluation.
 | 
						||
 | 
						||
    *This operator provides explicit functionality (especially for arrays
 | 
						||
    involving descriptions) that is analogous to the implicit push of the base
 | 
						||
    address of a structure prior to evaluation of a
 | 
						||
    ``DW_AT_data_member_location`` to access a data member of a structure.*
 | 
						||
 | 
						||
11. ``DW_OP_call2, DW_OP_call4, DW_OP_call_ref``
 | 
						||
 | 
						||
    ``DW_OP_call2``, ``DW_OP_call4``, and ``DW_OP_call_ref`` perform DWARF
 | 
						||
    procedure calls during evaluation of a DWARF expression or location
 | 
						||
    description.
 | 
						||
 | 
						||
    ``DW_OP_call2`` and ``DW_OP_call4``, have one operand that is a 2- or 4-byte
 | 
						||
    unsigned offset, respectively, of a debugging information entry D in the
 | 
						||
    current compilation unit.
 | 
						||
 | 
						||
    ``DW_OP_LLVM_call_ref`` has one operand that is a 4-byte unsigned value in
 | 
						||
    the 32-bit DWARF format, or an 8-byte unsigned value in the 64-bit DWARF
 | 
						||
    format, that is treated as an offset of a debugging information entry D in a
 | 
						||
    ``.debug_info`` section, which may be contained in an executable or shared
 | 
						||
    object file other than that containing the operator. For references from one
 | 
						||
    executable or shared object file to another, the relocation must be
 | 
						||
    performed by the consumer.
 | 
						||
 | 
						||
    *Operand interpretation of* ``DW_OP_call2``\ *,* ``DW_OP_call4``\ *, and*
 | 
						||
    ``DW_OP_call_ref`` *is exactly like that for* ``DW_FORM_ref2``\ *,
 | 
						||
    ``DW_FORM_ref4``\ *, and* ``DW_FORM_ref_addr``\ *, respectively.*
 | 
						||
 | 
						||
    If D has a ``DW_AT_location`` attribute, then the DWARF expression E
 | 
						||
    corresponding to the current program location is selected.
 | 
						||
 | 
						||
    .. note::
 | 
						||
 | 
						||
      To allow ``DW_OP_call*`` to compute the location description for any
 | 
						||
      variable or formal parameter regardless of whether the producer has
 | 
						||
      optimized it to a constant, the following rule could be added:
 | 
						||
 | 
						||
      .. note::
 | 
						||
 | 
						||
        If D has a ``DW_AT_const_value`` attribute, then a DWARF expression E
 | 
						||
        consisting a ``DW_OP_implicit_value`` operation with the value of the
 | 
						||
        ``DW_AT_const_value`` attribute is selected.
 | 
						||
 | 
						||
      This would be consistent with ``DW_OP_implicit_pointer``.
 | 
						||
 | 
						||
      Alternatively, could deprecate using ``DW_AT_const_value`` for
 | 
						||
      ``DW_TAG_variable`` and ``DW_TAG_formal_parameter`` debugger information
 | 
						||
      entries that are constants and instead use ``DW_AT_location`` with an
 | 
						||
      implicit location description instead, then this rule would not be
 | 
						||
      required.
 | 
						||
 | 
						||
    Otherwise, an empty expression E is selected.
 | 
						||
 | 
						||
    If D is a ``DW_TAG_dwarf_procedure`` debugging information entry, then E is
 | 
						||
    evaluated using the same DWARF expression stack. Any existing stack entries
 | 
						||
    may be accessed and/or removed in the evaluation of E, and the evaluation of
 | 
						||
    E may add any new stack entries.
 | 
						||
 | 
						||
    *Values on the stack at the time of the call may be used as parameters by
 | 
						||
    the called expression and values left on the stack by the called expression
 | 
						||
    may be used as return values by prior agreement between the calling and
 | 
						||
    called expressions.*
 | 
						||
 | 
						||
    Otherwise, E is evaluated on a separate DWARF stack and the resulting
 | 
						||
    location description L is pushed on the ``DW_OP_call*`` operation's stack.
 | 
						||
 | 
						||
    .. note:
 | 
						||
 | 
						||
      In DWARF 5, if D does not have a ``DW_AT_location`` then ``DW_OP_call*``
 | 
						||
      is defined to have no effect. It is unclear that this is the right
 | 
						||
      definition as a producer should be able to rely on using ``DW_OP_call*``
 | 
						||
      to get a location description for any non-\ ``DW_TAG_dwarf_procedure``
 | 
						||
      debugging information entries, and should not be creating DWARF with
 | 
						||
      ``DW_OP_call*`` to a ``DW_TAG_dwarf_procedure`` that does not have a
 | 
						||
      ``DW_AT_location`` attribute.
 | 
						||
 | 
						||
12. ``DW_OP_LLVM_call_frame_entry_reg`` *New*
 | 
						||
 | 
						||
    ``DW_OP_LLVM_call_frame_entry_reg`` has a single unsigned LEB128 integer
 | 
						||
    operand that is treated as a target architecture register number R.
 | 
						||
 | 
						||
    It pushes a location description L that holds the value of register R on
 | 
						||
    entry to the current subprogram as defined by the Call Frame Information
 | 
						||
    (see :ref:`amdgpu-call-frame-information`).
 | 
						||
 | 
						||
    *If there is no Call Frame Information defined, then the default rules for
 | 
						||
    the target architecture are used. If the register rule is* undefined\ *,
 | 
						||
    then the undefined location description is pushed. If the register rule is*
 | 
						||
    same value\ *, then a register location description for R is pushed.*
 | 
						||
 | 
						||
Undefined Location Descriptions
 | 
						||
###############################
 | 
						||
 | 
						||
The undefined location storage represents a piece or all of an object that is
 | 
						||
present in the source but not in the object code (perhaps due to optimization).
 | 
						||
Neither reading or writing to the undefined location storage is meaningful.
 | 
						||
 | 
						||
An undefined location description specifies the undefined location storage.
 | 
						||
There is no concept of the size of the undefined location storage, nor of a bit
 | 
						||
offset for an undefined location description. The ``DW_OP_LLVM_*offset``
 | 
						||
operations leave an undefined location description unchanged. The
 | 
						||
``DW_OP_*piece`` operations can explicitly or implicitly specify an undefined
 | 
						||
location description, allowing any size and offset to be specified, and results
 | 
						||
in a part with all undefined bits.
 | 
						||
 | 
						||
1.  ``DW_OP_LLVM_undefined`` *New*
 | 
						||
 | 
						||
    ``DW_OP_LLVM_undefined`` pushes an undefined location description L.
 | 
						||
 | 
						||
Memory Location Descriptions
 | 
						||
############################
 | 
						||
 | 
						||
There is a memory location storage that corresponds to each of the target
 | 
						||
architecture linear memory address spaces. The size of each memory location
 | 
						||
storage corresponds to the range of the addresses in the address space.
 | 
						||
 | 
						||
*It is target architecture defined how address space location storage maps to
 | 
						||
target architecture physical memory. For example, they may be independent memory
 | 
						||
or more than one location storage may alias the same physical memory possibly at
 | 
						||
different offsets and with different interleaving. The mapping may also be
 | 
						||
dictated by the source language address classes.*
 | 
						||
 | 
						||
A memory location description specifies a memory location storage. The bit
 | 
						||
offset corresponds to an address in the address space scaled by 8 (the byte
 | 
						||
size). Bits accessed using a memory location description, access the
 | 
						||
corresponding target architecture memory starting at the bit offset.
 | 
						||
 | 
						||
``DW_ASPACE_none`` is defined as the target architecture default address space.
 | 
						||
 | 
						||
*The target architecture default address space for AMDGPU is the global address
 | 
						||
space.*
 | 
						||
 | 
						||
If a stack entry is required to be a location description, but it is a value
 | 
						||
with the generic type, then it is implicitly convert to a memory location
 | 
						||
description that specifies memory in the target architecture default address
 | 
						||
space with a bit offset equal to the value scaled by 8 (the byte size).
 | 
						||
 | 
						||
  .. note::
 | 
						||
 | 
						||
    If want to allow any integral type value to be implicitly converted to a
 | 
						||
    memory location description in the target architecture default address
 | 
						||
    space:
 | 
						||
 | 
						||
    .. note::
 | 
						||
 | 
						||
      If a stack entry is required to be a location description, but it is a
 | 
						||
      value with an integral type, then it is implicitly convert to a memory
 | 
						||
      location description. The stack entry value is zero extended to the size
 | 
						||
      of the generic type and the least significant generic type size bits are
 | 
						||
      treated as a twos-complement unsigned value to be used as an address. The
 | 
						||
      converted memory location description specifies memory location storage
 | 
						||
      corresponding to the target architecture default address space with a bit
 | 
						||
      offset equal to the address scaled by 8 (the byte size).
 | 
						||
 | 
						||
    The implicit conversion could also be defined as target specific. For
 | 
						||
    example, gdb checks if the value is an integral type. If it is not it gives
 | 
						||
    an error. Otherwise, gdb zero-extends the value to 64 bits. If the gdb
 | 
						||
    target defines a hook function then it is called and it can modify the 64
 | 
						||
    bit value, possibly sign extending the original value. Finally, gdb treats
 | 
						||
    the 64 bit value as a memory location address.
 | 
						||
 | 
						||
If a stack entry is required to be a location description, but it is an implicit
 | 
						||
pointer value IPV with the target architecture default address space, then it is
 | 
						||
implicitly convert to the location description specified by IPV. See
 | 
						||
:ref:`amdgpu-implicit-location-descriptions`.
 | 
						||
 | 
						||
If a stack entry is required to be a value with a generic type, but it is a
 | 
						||
memory location description in the target architecture default address space
 | 
						||
with a bit offset that is a multiple of 8, then it is implicitly converted to a
 | 
						||
value with a generic type that is equal to the bit offset divided by 8 (the byte
 | 
						||
size).
 | 
						||
 | 
						||
1.  ``DW_OP_addr``
 | 
						||
 | 
						||
    ``DW_OP_addr`` has a single byte constant value operand, which has the size
 | 
						||
    of the generic type, treated as an address A.
 | 
						||
 | 
						||
    It pushes a memory location description L on the stack that specifies the
 | 
						||
    memory location storage for the target architecture default address space
 | 
						||
    with a bit offset equal to A scaled by 8 (the byte size).
 | 
						||
 | 
						||
    *If the DWARF is part of a code object, then A may need to be relocated. For
 | 
						||
    example, in the ELF code object format, A must be adjusted by the difference
 | 
						||
    between the ELF segment virtual address and the virtual address at which the
 | 
						||
    segment is loaded.*
 | 
						||
 | 
						||
2.  ``DW_OP_addrx``
 | 
						||
 | 
						||
    ``DW_OP_addrx`` has a single unsigned LEB128 integer operand that is treated
 | 
						||
    as a zero-based index into the ``.debug_addr`` section relative to the value
 | 
						||
    of the ``DW_AT_addr_base`` attribute of the associated compilation unit. The
 | 
						||
    address value A in the ``.debug_addr`` section has the size of generic type.
 | 
						||
 | 
						||
    It pushes a memory location description L on the stack that specifies the
 | 
						||
    memory location storage for the target architecture default address space
 | 
						||
    with a bit offset equal to A scaled by 8 (the byte size).
 | 
						||
 | 
						||
    *If the DWARF is part of a code object, then A may need to be relocated. For
 | 
						||
    example, in the ELF code object format, A must be adjusted by the difference
 | 
						||
    between the ELF segment virtual address and the virtual address at which the
 | 
						||
    segment is loaded.*
 | 
						||
 | 
						||
3.  ``DW_OP_LLVM_form_aspace_address`` *New*
 | 
						||
 | 
						||
    ``DW_OP_LLVM_form_aspace_address`` pops top two stack entries. The first
 | 
						||
    must be an integral type value that is treated as an address space
 | 
						||
    identifier AS for those architectures that support multiple address spaces.
 | 
						||
    The second must be an integral type value that is treated as an address A.
 | 
						||
 | 
						||
    The address size S is defined as the address bit size of the target
 | 
						||
    architecture's address space that corresponds to AS.
 | 
						||
 | 
						||
    A is adjusted by zero extending it to S bits and the least significant S
 | 
						||
    bits are treated as a twos-complement unsigned value.
 | 
						||
 | 
						||
    ``DW_OP_LLVM_form_aspace_address`` pushes a memory location description L
 | 
						||
    that specifies the memory location storage that corresponds to AS, with a
 | 
						||
    bit offset equal to the adjusted A scaled by 8 (the byte size).
 | 
						||
 | 
						||
    If AS is not one of the values defined by the target architecture's
 | 
						||
    ``DW_ASPACE_*`` values, then the DWARF expression is ill-formed.
 | 
						||
 | 
						||
    See :ref:`amdgpu-implicit-location-descriptions` for special rules
 | 
						||
    concerning implicit pointer values produced by dereferencing implicit
 | 
						||
    location descriptions created by the ``DW_OP_implicit_pointer`` and
 | 
						||
    ``DW_OP_LLVM_implicit_aspace_pointer`` operations.
 | 
						||
 | 
						||
    The AMDGPU address spaces are defined in
 | 
						||
    :ref:`amdgpu-dwarf-address-space-mapping-table`.
 | 
						||
 | 
						||
4.  ``DW_OP_form_tls_address``
 | 
						||
 | 
						||
    ``DW_OP_form_tls_address`` pops one stack entry that must be an integral
 | 
						||
    type value, and treats it as a thread-local storage address.
 | 
						||
 | 
						||
    ``DW_OP_form_tls_address`` pushes a memory location description L for the
 | 
						||
    target architecture default address space that corresponds to the
 | 
						||
    thread-local storage address.
 | 
						||
 | 
						||
    The meaning of the thread-local storage address is defined by the run-time
 | 
						||
    environment. If the run-time environment supports multiple thread-local
 | 
						||
    storage blocks for a single thread, then the block corresponding to the
 | 
						||
    executable or shared library containing this DWARF expression is used.
 | 
						||
 | 
						||
    *Some implementations of C, C++, Fortran, and other languages, support a
 | 
						||
    thread-local storage class. Variables with this storage class have distinct
 | 
						||
    values and addresses in distinct threads, much as automatic variables have
 | 
						||
    distinct values and addresses in each function invocation. Typically, there
 | 
						||
    is a single block of storage containing all thread-local variables declared
 | 
						||
    in the main executable, and a separate block for the variables declared in
 | 
						||
    each shared library. Each thread-local variable can then be accessed in its
 | 
						||
    block using an identifier. This identifier is typically an offset into the
 | 
						||
    block and pushed onto the DWARF stack by one of the* ``DW_OP_const<n><x>``
 | 
						||
    *operations prior to the* ``DW_OP_form_tls_address`` *operation. Computing
 | 
						||
    the address of the appropriate block can be complex (in some cases, the
 | 
						||
    compiler emits a function call to do it), and difficult to describe using
 | 
						||
    ordinary DWARF location descriptions. Instead of forcing complex
 | 
						||
    thread-local storage calculations into the DWARF expressions, the*
 | 
						||
    ``DW_OP_form_tls_address`` *allows the consumer to perform the computation
 | 
						||
    based on the run-time environment.*
 | 
						||
 | 
						||
5.  ``DW_OP_call_frame_cfa``
 | 
						||
 | 
						||
    ``DW_OP_call_frame_cfa`` pushes the memory location description L of the
 | 
						||
    Canonical Frame Address (CFA) of the current function, obtained from the
 | 
						||
    Call Frame Information (see :ref:`amdgpu-call-frame-information`).
 | 
						||
 | 
						||
    *Although the value of* ``DW_AT_frame_base`` *can be computed using other
 | 
						||
    DWARF expression operators, in some cases this would require an extensive
 | 
						||
    location list because the values of the registers used in computing the CFA
 | 
						||
    change during a subroutine. If the Call Frame Information is present, then
 | 
						||
    it already encodes such changes, and it is space efficient to reference
 | 
						||
    that.*
 | 
						||
 | 
						||
6.  ``DW_OP_fbreg``
 | 
						||
 | 
						||
    ``DW_OP_fbreg`` has a single signed LEB128 integer operand that is treated
 | 
						||
    as a byte displacement D.
 | 
						||
 | 
						||
    The DWARF expression E corresponding to the current program location is
 | 
						||
    selected from the ``DW_AT_frame_base`` attribute of the current function and
 | 
						||
    evaluated. The resulting memory location description L's bit offset is
 | 
						||
    updated as if the ``DW_OP_LLVM_offset D`` operation were applied. The
 | 
						||
    updated L is pushed.
 | 
						||
 | 
						||
    *This is typically a stack pointer register plus or minus some offset.*
 | 
						||
 | 
						||
7.  ``DW_OP_breg0, DW_OP_breg1, ..., DW_OP_breg31``
 | 
						||
 | 
						||
    The ``DW_OP_breg<n>`` operations encode the numbers of up to 32 registers,
 | 
						||
    numbered from 0 through 31, inclusive. The register number R corresponds to
 | 
						||
    the ``n`` in the operation name.
 | 
						||
 | 
						||
    They have a single signed LEB128 integer operand that is treated as a byte
 | 
						||
    displacement D.
 | 
						||
 | 
						||
    The address space identifier AS is defined as the one corresponding to the
 | 
						||
    target architecture's default address space.
 | 
						||
 | 
						||
    The address size S is defined as the address bit size of the target
 | 
						||
    architecture's address space corresponding to AS.
 | 
						||
 | 
						||
    The contents of the register specified by R is retrieved as a
 | 
						||
    twos-complement unsigned value and zero extended to S bits. D is added and
 | 
						||
    the least significant S bits are treated as a twos-complement unsigned value
 | 
						||
    to be used as an address A.
 | 
						||
 | 
						||
    They push a memory location description L that specifies the memory location
 | 
						||
    storage that corresponds to AS, with a bit offset equal to A scaled by 8
 | 
						||
    (the byte size).
 | 
						||
 | 
						||
8.  ``DW_OP_bregx``
 | 
						||
 | 
						||
    ``DW_OP_bregx`` has two operands. The first is an unsigned LEB128 integer
 | 
						||
    that is treated as a register number R. The second is a signed LEB128
 | 
						||
    integer that is treated as a byte displacement D.
 | 
						||
 | 
						||
    The action is the same as for ``DW_OP_breg<n>`` except that R is used as the
 | 
						||
    register number and D is used as the byte displacement.
 | 
						||
 | 
						||
9.  ``DW_OP_LLVM_aspace_bregx`` *New*
 | 
						||
 | 
						||
    ``DW_OP_LLVM_aspace_bregx`` has two operands. The first is an unsigned
 | 
						||
    LEB128 integer that is treated as a register number R. The second is a
 | 
						||
    signed LEB128 integer that is treated as a byte displacement D. It pops one
 | 
						||
    stack entry that is required to be an integral type value that is treated as
 | 
						||
    an address space identifier AS for those architectures that support multiple
 | 
						||
    address spaces.
 | 
						||
 | 
						||
    The action is the same as for ``DW_OP_breg<n>`` except that R is used as the
 | 
						||
    register number, D is used as the byte displacement, and AS is used as the
 | 
						||
    address space identifier.
 | 
						||
 | 
						||
    If AS is not one of the values defined by the target architecture's
 | 
						||
    ``DW_ASPACE_*`` values, then the DWARF expression is ill-formed.
 | 
						||
 | 
						||
    .. note::
 | 
						||
 | 
						||
      Could also consider adding ``DW_OP_aspace_breg0, DW_OP_aspace_breg1, ...,
 | 
						||
      DW_OP_aspace_bref31`` which would save encoding size.
 | 
						||
 | 
						||
.. _amdgpu-register-location-descriptions:
 | 
						||
 | 
						||
Register Location Descriptions
 | 
						||
##############################
 | 
						||
 | 
						||
There is a register location storage that corresponds to each of the target
 | 
						||
architecture registers. The size of each register location storage corresponds
 | 
						||
to the size of the corresponding target architecture register.
 | 
						||
 | 
						||
A register location description specifies a register location storage. The bit
 | 
						||
offset corresponds to a bit position within the register. Bits accessed using a
 | 
						||
register location description, access the corresponding target architecture
 | 
						||
register starting at the bit offset.
 | 
						||
 | 
						||
1.  ``DW_OP_reg0, DW_OP_reg1, ..., DW_OP_reg31``
 | 
						||
 | 
						||
    ``DW_OP_reg<n>`` operations encode the numbers of up to 32 registers,
 | 
						||
    numbered from 0 through 31, inclusive. The target architecture register
 | 
						||
    number R corresponds to the ``n`` in the operation name.
 | 
						||
 | 
						||
    ``DW_OP_reg<n>`` pushes a register location description L that specifies the
 | 
						||
    register location storage that corresponds to R, with a bit offset of 0.
 | 
						||
 | 
						||
2.  ``DW_OP_regx``
 | 
						||
 | 
						||
    ``DW_OP_regx`` has a single unsigned LEB128 integer operand that is treated
 | 
						||
    as a target architecture register number R.
 | 
						||
 | 
						||
    ``DW_OP_regx`` pushes a register location description L that specifies the
 | 
						||
    register location storage that corresponds to R, with a bit offset of 0.
 | 
						||
 | 
						||
*These operations name a register location. To fetch the contents of a register,
 | 
						||
it is necessary to use* ``DW_OP_regval_type``\ *, or one of the register based
 | 
						||
addressing operations such as* ``DW_OP_bregx``\ *, or using* ``DW_OP_deref*``
 | 
						||
*on a register location description.*
 | 
						||
 | 
						||
.. _amdgpu-implicit-location-descriptions:
 | 
						||
 | 
						||
Implicit Location Descriptions
 | 
						||
##############################
 | 
						||
 | 
						||
Implicit location storage represents a piece or all of an object which has no
 | 
						||
actual location in the program but whose contents are nonetheless known, either
 | 
						||
as a constant or can be computed from other locations and values in the program.
 | 
						||
 | 
						||
An implicit location description specifies an implicit location storage. The bit
 | 
						||
offset corresponds to a bit position within the implicit location storage. Bits
 | 
						||
accessed using an implicit location description, access the corresponding
 | 
						||
implicit storage value starting at the bit offset.
 | 
						||
 | 
						||
1.  ``DW_OP_implicit_value``
 | 
						||
 | 
						||
    ``DW_OP_implicit_value`` has two operands. The first is an unsigned LEB128
 | 
						||
    integer treated as a byte size S. The second is a block of bytes with a
 | 
						||
    length equal to S treated as a literal value V.
 | 
						||
 | 
						||
    An implicit location storage LS is created with the literal value V and a
 | 
						||
    size of S. An implicit location description L is pushed that specifies LS
 | 
						||
    with a bit offset of 0.
 | 
						||
 | 
						||
2.  ``DW_OP_stack_value``
 | 
						||
 | 
						||
    ``DW_OP_stack_value`` pops one stack entry that must be a value treated as a
 | 
						||
    literal value V.
 | 
						||
 | 
						||
    An implicit location storage LS is created with the literal value V and a
 | 
						||
    size equal to V's base type size. An implicit location description L is
 | 
						||
    pushed that specifies LS with a bit offset of 0.
 | 
						||
 | 
						||
    The ``DW_OP_stack_value`` operation specifies that the object does not exist
 | 
						||
    in memory but its value is nonetheless known and is at the top of the DWARF
 | 
						||
    expression stack. In this form of location description, the DWARF expression
 | 
						||
    represents the actual value of the object, rather than its location.
 | 
						||
 | 
						||
    See :ref:`amdgpu-implicit-location-descriptions` for special rules
 | 
						||
    concerning implicit pointer values produced by dereferencing implicit
 | 
						||
    location descriptions created by the ``DW_OP_implicit_pointer`` and
 | 
						||
    ``DW_OP_LLVM_implicit_aspace_pointer`` operations.
 | 
						||
 | 
						||
    .. note::
 | 
						||
 | 
						||
      Since location descriptions are allowed on the stack, the
 | 
						||
      ``DW_OP_stack_value`` operation no longer terminates the DWARF expression.
 | 
						||
 | 
						||
3.  ``DW_OP_implicit_pointer``
 | 
						||
 | 
						||
    *An optimizing compiler may eliminate a pointer, while still retaining the
 | 
						||
    value that the pointer addressed.* ``DW_OP_implicit_pointer`` *allows a
 | 
						||
    producer to describe this value.*
 | 
						||
 | 
						||
    ``DW_OP_implicit_pointer`` specifies that the object is a pointer to the
 | 
						||
    target architecture default address space that cannot be represented as a
 | 
						||
    real pointer, even though the value it would point to can be described. In
 | 
						||
    this form of location description, the DWARF expression refers to a
 | 
						||
    debugging information entry that represents the actual location description
 | 
						||
    of the object to which the pointer would point. Thus, a consumer of the
 | 
						||
    debug information would be able to access the the dereferenced pointer, even
 | 
						||
    when it cannot access of the pointer itself.
 | 
						||
 | 
						||
    ``DW_OP_implicit_pointer`` has two operands. The first is a 4-byte unsigned
 | 
						||
    value in the 32-bit DWARF format, or an 8-byte unsigned value in the 64-bit
 | 
						||
    DWARF format, that is treated as a debugging information entry reference R.
 | 
						||
    The second is a signed LEB128 integer that is treated as a byte
 | 
						||
    displacement D.
 | 
						||
 | 
						||
    R is used as the offset of a debugging information entry E in a
 | 
						||
    ``.debug_info`` section, which may be contained in an executable or shared
 | 
						||
    object file other than that containing the operator. For references from one
 | 
						||
    executable or shared object file to another, the relocation must be
 | 
						||
    performed by the consumer.
 | 
						||
 | 
						||
    *The first operand interpretation is exactly like that for*
 | 
						||
    ``DW_FORM_ref_addr``\ *.*
 | 
						||
 | 
						||
    The address space identifier AS is defined as the one corresponding to the
 | 
						||
    target architecture's default address space.
 | 
						||
 | 
						||
    The address size S is defined as the address bit size of the target
 | 
						||
    architecture's address space corresponding to AS.
 | 
						||
 | 
						||
    An implicit location storage LS is created that has the bit size of S. An
 | 
						||
    implicit location description L is pushed that specifies LS and has a bit
 | 
						||
    offset of 0.
 | 
						||
 | 
						||
    If a ``DW_OP_deref*`` operation pops a location description L' and retrieves
 | 
						||
    S' bits where some retrieved bits come from LS such that either:
 | 
						||
 | 
						||
    1.  L' is an implicit location description that specifies LS with bit offset
 | 
						||
        0, and S' equals S.
 | 
						||
 | 
						||
    2.  L' is a complete composite location description that specifies a
 | 
						||
        canonical form composite location storage LS'. The bits retrieved all
 | 
						||
        come from a single part P' of LS'. P' has a bit size of S and has
 | 
						||
        an implicit location description PL'. PL' specifies LS with a bit offset
 | 
						||
        of 0.
 | 
						||
 | 
						||
    Then the value V pushed by the ``DW_OP_deref*`` operation is an implicit
 | 
						||
    pointer value IPV with an address space of AS, a debugging information entry
 | 
						||
    of E, and a base type of T. If AS is the target architecture default address
 | 
						||
    space, then T is the generic type. Otherwise, T is an architecture specific
 | 
						||
    integral type with a bit size equal to S.
 | 
						||
 | 
						||
    Otherwise, if a ``DW_OP_deref*`` operation is applied to a location
 | 
						||
    description such that some retrieved bits come from LS, then the DWARF
 | 
						||
    expression is ill-formed.
 | 
						||
 | 
						||
    If IPV is either implicitly converted to a location description (only done
 | 
						||
    if AS is the target architecture default address space) or used by
 | 
						||
    ``DW_OP_LLVM_form_aspace_address`` (only done if the address space specified
 | 
						||
    is AS), then the resulting location description is:
 | 
						||
 | 
						||
    * If E has a ``DW_AT_location`` attribute, the DWARF expression
 | 
						||
      corresponding to the current program location is selected and evaluated
 | 
						||
      from the ``DW_AT_location`` attribute. The expression result is the
 | 
						||
      resulting location description RL.
 | 
						||
 | 
						||
    * If E has a ``DW_AT_const_value`` attribute, then an implicit location
 | 
						||
      storage RLS is created from the ``DW_AT_const_value`` attribute's value,
 | 
						||
      with a size matching the size of the ``DW_AT_const_value`` attribute's
 | 
						||
      value. The resulting implicit location description RL specifies RLS with a
 | 
						||
      bit offset of 0.
 | 
						||
 | 
						||
      .. note::
 | 
						||
 | 
						||
        If deprecate using ``DW_AT_const_value`` for variables and formal
 | 
						||
        parameters and instead use ``DW_AT_location`` with an implicit location
 | 
						||
        description instead, then this rule would not be required.
 | 
						||
 | 
						||
    * Otherwise the DWARF expression is ill-formed.
 | 
						||
 | 
						||
    The bit offset of RL is updated as if the ``DW_OP_LLVM_offset D`` operation
 | 
						||
    were applied.
 | 
						||
 | 
						||
    If a ``DW_OP_stack_value`` operation pops a value that is the same as IPV,
 | 
						||
    then it pushes a location description that is the same as L.
 | 
						||
 | 
						||
    The DWARF expression is ill-formed if it accesses LS or IPV in any other
 | 
						||
    manner.
 | 
						||
 | 
						||
    *The restrictions on how an implicit pointer location description created by
 | 
						||
    ``DW_OP_implicit_pointer`` and ``DW_OP_LLVM_aspace_implicit_pointer``, or an
 | 
						||
    implicit pointer value created by ``DW_OP_deref*``, can be used are to
 | 
						||
    simplify the DWARF consumer.*
 | 
						||
 | 
						||
4.  ``DW_OP_LLVM_aspace_implicit_pointer`` *New*
 | 
						||
 | 
						||
    ``DW_OP_LLVM_aspace_implicit_pointer`` has two operands that are the same as
 | 
						||
    for ``DW_OP_implicit_pointer``.
 | 
						||
 | 
						||
    It pops one stack entry that must be an integral type value that is treated
 | 
						||
    as an address space identifier AS for those architectures that support
 | 
						||
    multiple address spaces.
 | 
						||
 | 
						||
    The implicit location description L that is pushed is the same as for
 | 
						||
    ``DW_OP_implicit_pointer`` except that the address space identifier used is
 | 
						||
    AS.
 | 
						||
 | 
						||
    If AS is not one of the values defined by the target architecture's
 | 
						||
    ``DW_ASPACE_*`` values, then the DWARF expression is ill-formed.
 | 
						||
 | 
						||
*The debugging information entry referenced by a* ``DW_OP_implicit_pointer`` or
 | 
						||
``DW_OP_LLVM_aspace_implicit_pointer`` *operation is typically a*
 | 
						||
``DW_TAG_variable`` *or* ``DW_TAG_formal_parameter`` *entry whose*
 | 
						||
``DW_AT_location`` *attribute gives a second DWARF expression or a location list
 | 
						||
that describes the value of the object, but the referenced entry may be any
 | 
						||
entry that contains a* ``DW_AT_location`` *or* ``DW_AT_const_value`` *attribute
 | 
						||
(for example,* ``DW_TAG_dwarf_procedure``\ *). By using the second DWARF
 | 
						||
expression, a consumer can reconstruct the value of the object when asked to
 | 
						||
dereference the pointer described by the original DWARF expression containing
 | 
						||
the* ``DW_OP_implicit_pointer`` or ``DW_OP_LLVM_aspace_implicit_pointer``
 | 
						||
*operation.*
 | 
						||
 | 
						||
Composite Location Descriptions
 | 
						||
###############################
 | 
						||
 | 
						||
A composite location storage represents an object or value which may be
 | 
						||
contained in part of another location storage, or contained in parts of more
 | 
						||
than one location storage.
 | 
						||
 | 
						||
Each part has a part location description L and a part bit size S. The bits of
 | 
						||
the part comprise S contiguous bits from the location storage specified by L,
 | 
						||
starting at the bit offset specified by L. All the bits must be within the size
 | 
						||
of the location storage specified by L or the DWARF expression is ill-formed.
 | 
						||
 | 
						||
A composite location storage can have zero or more parts. The parts are
 | 
						||
contiguous such that the zero-based location storage bit index will range over
 | 
						||
each part with no gaps between them. Therefore, the size of a composite location
 | 
						||
storage is the size of its parts. The DWARF expression is ill-formed if the size
 | 
						||
of the contiguous location storage is larger than the size of the memory
 | 
						||
location storage corresponding to the target architecture's largest address
 | 
						||
space.
 | 
						||
 | 
						||
The canonical form of a composite location storage is computed by applying the
 | 
						||
following steps to a composite location storage:
 | 
						||
 | 
						||
1.  If any part P has a composite location description L, it is replaced by a
 | 
						||
    copy of the parts of the composite location storage specified by L that are
 | 
						||
    selected by the bit size of P starting at the bit offset of L. The location
 | 
						||
    description of the first copied part has its bit offset updated as
 | 
						||
    necessary, and the last copied part has its bit size updated as necessary,
 | 
						||
    to reflect the bits selected by P. This rule is applied repeatedly until no
 | 
						||
    part has a composite location description.
 | 
						||
 | 
						||
2.  If the size on any part is zero, it is removed.
 | 
						||
 | 
						||
3.  If any adjacent parts P\ :sup:`1` to P\ :sup:`n` have location descriptions
 | 
						||
    that specify the same location storage LS such that the bits selected form a
 | 
						||
    contiguous portion of LS, then they are replaced by a single new part P'. P'
 | 
						||
    has a location description L that specifies LS with the same bit offset as
 | 
						||
    P\ :sup:`1`\ 's location description, and a bit size equal to the sum of the
 | 
						||
    bit sizes of P\ :sup:`1` to P\ :sup:`n` inclusive.
 | 
						||
 | 
						||
A composite location description specifies the canonical form of a composite
 | 
						||
location storage and a bit offset.
 | 
						||
 | 
						||
There are operations that push a composite location description that specifies a
 | 
						||
composite location storage that is created by the operation.
 | 
						||
 | 
						||
There are other operations that allow a composite location storage and a
 | 
						||
composite location description that specifies it to be created incrementally.
 | 
						||
Each part is described by a separate operation. There may be one or more
 | 
						||
operations to create the final composite location storage and associated
 | 
						||
description. A series of such operations describes the parts of the composite
 | 
						||
location storage that are in the order that the associated part operations are
 | 
						||
executed.
 | 
						||
 | 
						||
To support incremental creation, a composite location description can be in an
 | 
						||
incomplete state. When an incremental operation operates on an incomplete
 | 
						||
composite location description, it adds a new part, otherwise it creates a new
 | 
						||
composite location description. The ``DW_OP_LLVM_piece_end`` operation
 | 
						||
explicitly makes an incomplete composite location description complete.
 | 
						||
 | 
						||
If the top stack entry is an incomplete composite location description after the
 | 
						||
execution of a DWARF expression has completed, it is converted to a complete
 | 
						||
composite location description.
 | 
						||
 | 
						||
If a stack entry is required to be a location description, but it is an
 | 
						||
incomplete composite location description, then the DWARF expression is
 | 
						||
ill-formed.
 | 
						||
 | 
						||
*Note that a DWARF expression may arbitrarily compose composite location
 | 
						||
descriptions from any other location description, including other composite
 | 
						||
location descriptions.*
 | 
						||
 | 
						||
*The incremental composite location description operations are defined to be
 | 
						||
compatible with the definitions in DWARF 5 and earlier.*
 | 
						||
 | 
						||
1.  ``DW_OP_piece``
 | 
						||
 | 
						||
    ``DW_OP_piece`` has a single unsigned LEB128 integer that is treated as a
 | 
						||
    byte size S.
 | 
						||
 | 
						||
    The action is based on the context:
 | 
						||
 | 
						||
    * If the stack is empty, then an incomplete composite location description
 | 
						||
      L is pushed that specifies a new composite location storage LS and has a
 | 
						||
      bit offset of 0. LS has a single part P that specifies the undefined
 | 
						||
      location description, and has a bit size of S scaled by 8 (the byte size).
 | 
						||
 | 
						||
    * If the top stack entry is an incomplete composite location description L,
 | 
						||
      then the composite location storage LS that it specifies is updated to
 | 
						||
      append a part that specifies an undefined location description, and has a
 | 
						||
      bit size S scaled by 8 (the byte size).
 | 
						||
 | 
						||
    * If the top stack entry is a location description or can be converted to
 | 
						||
      one, then it is popped and treated as a part location description PL.
 | 
						||
      Then:
 | 
						||
 | 
						||
      * If the stack is empty or the top stack entry is not an incomplete
 | 
						||
        composite location description, then an incomplete composite location
 | 
						||
        description L is pushed that specifies a new composite location storage
 | 
						||
        LS. LS has a single part that specifies PL, and has a bit size of S
 | 
						||
        scaled by 8 (the byte size).
 | 
						||
 | 
						||
      * Otherwise, the composite location storage LS specified by the top stack
 | 
						||
        incomplete composite location description L is updated to append a part
 | 
						||
        that specifies PL, and has a bit size S scaled by 8 (the byte size).
 | 
						||
 | 
						||
    * Otherwise, the DWARF expression is ill-formed
 | 
						||
 | 
						||
    If LS is not in canonical form it is updated to be in canonical form.
 | 
						||
 | 
						||
    *Many compilers store a single variable in sets of registers, or store a
 | 
						||
    variable partially in memory and partially in registers.* ``DW_OP_piece``
 | 
						||
    *provides a way of describing how large a part of a variable a particular
 | 
						||
    DWARF location description refers to.*
 | 
						||
 | 
						||
    *If a computed byte displacement is required, the* ``DW_OP_LLVM_offset``
 | 
						||
    *can be used to update the part location description.*
 | 
						||
 | 
						||
2.  ``DW_OP_bit_piece``
 | 
						||
 | 
						||
    ``DW_OP_bit_piece`` has two operands. The first is an unsigned LEB128
 | 
						||
    integer that is treated as the part bit size S. The second is an unsigned
 | 
						||
    LEB128 integer that is treated as a bit displacement D.
 | 
						||
 | 
						||
    The action is the same as for ``DW_OP_piece`` except that any part created
 | 
						||
    has the bit size S, and the location description of any created part has its
 | 
						||
    bit offset updated as if the ``DW_OP_LLVM_bit_offset D`` operation were
 | 
						||
    applied.
 | 
						||
 | 
						||
    *If a computed bit displacement is required, the* ``DW_OP_LLVM_bit_offset``
 | 
						||
    *can be used to update the part location description.*
 | 
						||
 | 
						||
    .. note::
 | 
						||
 | 
						||
      The bit offset operand is not needed as ``DW_OP_LLVM_bit_offset`` can be
 | 
						||
      used on the part's location description.
 | 
						||
 | 
						||
3.  ``DW_OP_LLVM_piece_end`` *New*
 | 
						||
 | 
						||
    If the top stack entry is an incomplete composite location description L,
 | 
						||
    then it is updated to be a complete composite location description with the
 | 
						||
    same parts. Otherwise, the DWARF expression is ill-formed.
 | 
						||
 | 
						||
4.  ``DW_OP_LLVM_extend`` *New*
 | 
						||
 | 
						||
    ``DW_OP_LLVM_extend`` has two operands. The first is an unsigned LEB128
 | 
						||
    integer that is treated as the element bit size S. The second is an unsigned
 | 
						||
    LEB128 integer that is treated as a count C.
 | 
						||
 | 
						||
    It pops one stack entry that must be a location description and is treated
 | 
						||
    as the part location description PL.
 | 
						||
 | 
						||
    A complete composite location description L is pushed that comprises C parts
 | 
						||
    that each specify PL and have a bit size of S.
 | 
						||
 | 
						||
    The DWARF expression is ill-formed if the element bit size or count are 0.
 | 
						||
 | 
						||
5.  ``DW_OP_LLVM_select_bit_piece`` *New*
 | 
						||
 | 
						||
    ``DW_OP_LLVM_select_bit_piece`` has two operands. The first is an unsigned
 | 
						||
    LEB128 integer that is treated as the element bit size S. The second is an
 | 
						||
    unsigned LEB128 integer that is treated as a count C.
 | 
						||
 | 
						||
    It pops three stack entries. The first must be an integral type value that
 | 
						||
    is treated as a bit mask value M. The second must be a location description
 | 
						||
    that is treated as the one-location description L1. The third must be a
 | 
						||
    location description that is treated as the zero-location description L0.
 | 
						||
 | 
						||
    A complete composite location description L is pushed that specifies a new
 | 
						||
    composite location storage LS. LS comprises C parts that each specify a part
 | 
						||
    location description PL and have a bit size of S. The PL for part N is
 | 
						||
    defined as:
 | 
						||
 | 
						||
    1.  If the Nth least significant bit of M is a zero then the PL for part N
 | 
						||
        is the same as L0, otherwise it is the same as L1.
 | 
						||
 | 
						||
    2.  The PL for part N is updated as if the ``DW_OP_LLVM_bit_offset N*S``
 | 
						||
        operation was applied.
 | 
						||
 | 
						||
    If LS is not in canonical form it is updated to be in canonical form.
 | 
						||
 | 
						||
    The DWARF expression is ill-formed if S or C are 0, or if the bit size of M
 | 
						||
    is less than C.
 | 
						||
 | 
						||
``DW_OP_bit_piece`` *is used instead of* ``DW_OP_piece`` *when the piece to be
 | 
						||
assembled into a value or assigned to is not byte-sized or is not at the start
 | 
						||
of the part location description.*
 | 
						||
 | 
						||
.. note::
 | 
						||
 | 
						||
  For AMDGPU:
 | 
						||
 | 
						||
  * In CFI expressions ``DW_OP_LLVM_select_bit_piece`` is used to describe
 | 
						||
    unwinding vector registers that are spilled under the execution mask to
 | 
						||
    memory: the zero location description is the vector register, and the one
 | 
						||
    location description is the spilled memory location. The
 | 
						||
    ``DW_OP_LLVM_form_aspace_address`` is used to specify the address space of
 | 
						||
    the memory location description.
 | 
						||
 | 
						||
  * ``DW_OP_LLVM_select_bit_piece`` is used by the ``lane_pc`` attribute
 | 
						||
    expression where divergent control flow is controlled by the execution mask.
 | 
						||
    An undefined location description together with ``DW_OP_LLVM_extend`` is
 | 
						||
    used to indicate the lane was not active on entry to the subprogram.
 | 
						||
 | 
						||
Expression Operation Encodings
 | 
						||
++++++++++++++++++++++++++++++
 | 
						||
 | 
						||
The following table gives the encoding of the DWARF expression operations added
 | 
						||
for AMDGPU.
 | 
						||
 | 
						||
.. table:: AMDGPU DWARF Expression Operation Encodings
 | 
						||
   :name: amdgpu-dwarf-expression-operation-encodings-table
 | 
						||
 | 
						||
   ================================== ===== ======== ===============================
 | 
						||
   Operation                          Code  Number   Notes
 | 
						||
                                            of
 | 
						||
                                            Operands
 | 
						||
   ================================== ===== ======== ===============================
 | 
						||
   DW_OP_LLVM_form_aspace_address     0xe7     0
 | 
						||
   DW_OP_LLVM_push_lane               0xea     0
 | 
						||
   DW_OP_LLVM_offset                  0xe9     0
 | 
						||
   DW_OP_LLVM_offset_uconst           *TBD*    1     ULEB128 byte displacement
 | 
						||
   DW_OP_LLVM_bit_offset              *TBD*    0
 | 
						||
   DW_OP_LLVM_call_frame_entry_reg    *TBD*    1     ULEB128 register number
 | 
						||
   DW_OP_LLVM_undefined               *TBD*    0
 | 
						||
   DW_OP_LLVM_aspace_bregx            *TBD*    2     ULEB128 register number,
 | 
						||
                                                     ULEB128 byte displacement
 | 
						||
   DW_OP_LLVM_aspace_implicit_pointer *TBD*    2     4- or 8-byte offset of DIE,
 | 
						||
                                                     SLEB128 byte displacement
 | 
						||
   DW_OP_LLVM_piece_end               *TBD*    0
 | 
						||
   DW_OP_LLVM_extend                  *TBD*    2     ULEB128 bit size,
 | 
						||
                                                     ULEB128 count
 | 
						||
   DW_OP_LLVM_select_bit_piece        *TBD*    2     ULEB128 bit size,
 | 
						||
                                                     ULEB128 count
 | 
						||
   ================================== ===== ======== ===============================
 | 
						||
 | 
						||
.. _amdgpu-dwarf-debugging-information-entry-attributes:
 | 
						||
 | 
						||
Debugging Information Entry Attributes
 | 
						||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 | 
						||
 | 
						||
This section provides changes to existing debugger information attributes and
 | 
						||
defines attributes added by the AMDGPU target.
 | 
						||
 | 
						||
1.  ``DW_AT_location``
 | 
						||
 | 
						||
    If the result of the ``DW_AT_location`` DWARF expression is required to be a
 | 
						||
    location description, then it may have any kind of location description (see
 | 
						||
    :ref:`amdgpu-location-description-operations`).
 | 
						||
 | 
						||
2.  ``DW_AT_const_value``
 | 
						||
 | 
						||
    .. note::
 | 
						||
 | 
						||
      Could deprecate using the ``DW_AT_const_value`` attribute for
 | 
						||
      ``DW_TAG_variable`` or ``DW_TAG_formal_parameter`` debugger information
 | 
						||
      entries that are constants. Instead, ``DW_AT_location`` could be used with
 | 
						||
      a DWARF expression that produces an implicit location description now that
 | 
						||
      any location description can be used within a DWARF expression. This
 | 
						||
      allows the ``DW_OP_call*`` operations to be used to push the location
 | 
						||
      description of any variable regardless of how it is optimized.
 | 
						||
 | 
						||
3.  ``DW_AT_frame_base``
 | 
						||
 | 
						||
    A ``DW_TAG_subprogram`` or ``DW_TAG_entry_point`` debugger information entry
 | 
						||
    may have a ``DW_AT_frame_base`` attribute, whose value is a DWARF expression
 | 
						||
    or location list that describes the *frame base* for the subroutine or entry
 | 
						||
    point.
 | 
						||
 | 
						||
    If the result of the DWARF expression is a register location description,
 | 
						||
    then the ``DW_OP_deref`` operation is applied to compute the frame base
 | 
						||
    memory location description in the target architecture default address
 | 
						||
    space.
 | 
						||
 | 
						||
    .. note::
 | 
						||
 | 
						||
      This rule could be removed and require the producer to create the
 | 
						||
      required location descriptor directly using ``DW_OP_call_frame_cfa``,
 | 
						||
      ``DW_OP_fbreg``, ``DW_OP_breg*``, or ``DW_OP_LLVM-aspace_bregx``. This
 | 
						||
      would also then allow a target to implement the call frames withing a
 | 
						||
      large register.
 | 
						||
 | 
						||
    Otherwise, the result of the DWARF expression is required to be a memory
 | 
						||
    location description in any of the target architecture address spaces which
 | 
						||
    is the frame base.
 | 
						||
 | 
						||
4.  ``DW_AT_data_member_location``
 | 
						||
 | 
						||
    For a ``DW_AT_data_member_location`` attribute there are two cases:
 | 
						||
 | 
						||
    1.  If the value is an integer constant, it is the offset in bytes from the
 | 
						||
        beginning of the containing entity. If the beginning of the containing
 | 
						||
        entity has a non-zero bit offset then the beginning of the member entry
 | 
						||
        has that same bit offset as well.
 | 
						||
 | 
						||
    2.  Otherwise, the value must be a DWARF expression or location list. The
 | 
						||
        DWARF expression E corresponding to the current program location is
 | 
						||
        selected. The location description of the beginning of the containing
 | 
						||
        entity is pushed on the DWARF stack before E is evaluated. The result of
 | 
						||
        the evaluation is the location description of the base of the member
 | 
						||
        entry.
 | 
						||
 | 
						||
        .. note::
 | 
						||
 | 
						||
          The beginning of the containing entity can now be any location
 | 
						||
          description and can be bit aligned.
 | 
						||
 | 
						||
5.  ``DW_AT_use_location``
 | 
						||
 | 
						||
    The ``DW_TAG_ptr_to_member_type`` debugging information entry has a
 | 
						||
    ``DW_AT_use_location`` attribute whose value is a DWARF expression or
 | 
						||
    location list. The DWARF expression E corresponding to the current program
 | 
						||
    location is selected. It is used to computes the location description of the
 | 
						||
    member of the class to which the pointer to member entry points
 | 
						||
 | 
						||
    *The method used to find the location description of a given member of a
 | 
						||
    class or structure is common to any instance of that class or structure and
 | 
						||
    to any instance of the pointer or member type. The method is thus associated
 | 
						||
    with the type entry, rather than with each instance of the type.*
 | 
						||
 | 
						||
    The ``DW_AT_use_location`` description is used in conjunction with the
 | 
						||
    location descriptions for a particular object of the given pointer to member
 | 
						||
    type and for a particular structure or class instance.
 | 
						||
 | 
						||
    Two values are pushed onto the DWARF expression stack before E is evaluated.
 | 
						||
    The first value pushed is the value of the pointer to member object itself.
 | 
						||
    The second value pushed is the location description of the base of the
 | 
						||
    entire structure or union instance containing the member whose address is
 | 
						||
    being calculated.
 | 
						||
 | 
						||
6.  ``DW_AT_data_location``
 | 
						||
 | 
						||
    The ``DW_AT_data_location`` attribute may be used with any type that
 | 
						||
    provides one or more levels of hidden indirection and/or run-time parameters
 | 
						||
    in its representation. Its value is a DWARF expression E which computes the
 | 
						||
    location description of the data for an object. When this attribute is
 | 
						||
    omitted, the location description of the data is the same as the location
 | 
						||
    description of the object.
 | 
						||
 | 
						||
    *E will typically begin with ``DW_OP_push_object_address`` which loads the
 | 
						||
    location description of the object which can then serve as a descriptor in
 | 
						||
    subsequent calculation.*
 | 
						||
 | 
						||
7.  ``DW_AT_vtable_elem_location``
 | 
						||
 | 
						||
    An entry for a virtual function also has a ``DW_AT_vtable_elem_location``
 | 
						||
    attribute whose value is a DWARF expression or location list. The DWARF
 | 
						||
    expression E corresponding to the current program location is selected. The
 | 
						||
    location description of the object of the enclosing type is pushed onto the
 | 
						||
    expression stack before E is evaluated. The resulting location description
 | 
						||
    is the slot for the function within the virtual function table for the
 | 
						||
    enclosing class.
 | 
						||
 | 
						||
8.  ``DW_AT_static_link``
 | 
						||
 | 
						||
    If a ``DW_TAG_subprogram`` or ``DW_TAG_entry_point`` debugger information
 | 
						||
    entry is nested, it may have a ``DW_AT_static_link`` attribute, whose value
 | 
						||
    is a DWARF expression or location list. The DWARF expression E corresponding
 | 
						||
    to the current program location is selected. The result of evaluating E is
 | 
						||
    the frame base memory location description of the relevant instance of the
 | 
						||
    subroutine that immediately encloses the subroutine or entry point.
 | 
						||
 | 
						||
9.  ``DW_AT_return_addr``
 | 
						||
 | 
						||
    A ``DW_TAG_subprogram``, ``DW_TAG_inlined_subroutine``, or
 | 
						||
    ``DW_TAG_entry_point`` debugger information entry may have a
 | 
						||
    ``DW_AT_return_addr`` attribute, whose value is a DWARF expression or
 | 
						||
    location list. The DWARF expression E corresponding to the current program
 | 
						||
    location is selected. The result of evaluating E is the location description
 | 
						||
    for the place where the return address for the subroutine or entry point is
 | 
						||
    stored.
 | 
						||
 | 
						||
    .. note::
 | 
						||
 | 
						||
      It is unclear why ``DW_TAG_inlined_subroutine`` has a
 | 
						||
      ``DW_AT_return_addr`` attribute but not a ``DW_AT_frame_base`` or
 | 
						||
      ``DW_AT_static_link`` attribute. Seems it would either have all of them or
 | 
						||
      none. Since inlined subprograms do not have a frame it seems they would
 | 
						||
      have none of these attributes.
 | 
						||
 | 
						||
10. ``DW_AT_LLVM_lanes`` *New*
 | 
						||
 | 
						||
    For languages that are implemented using a SIMD or SIMT execution model, a
 | 
						||
    ``DW_TAG_subprogram``, ``DW_TAG_inlined_subroutine``, or
 | 
						||
    ``DW_TAG_entry_point`` debugger information entry may have a
 | 
						||
    ``DW_AT_LLVM_lanes`` attribute whose value is an integer constant that is
 | 
						||
    the number of lanes per thread.
 | 
						||
 | 
						||
    If not present, the default value of 1 is used.
 | 
						||
 | 
						||
    The DWARF is ill-formed if the value is 0.
 | 
						||
 | 
						||
11. ``DW_AT_LLVM_lane_pc`` *New*
 | 
						||
 | 
						||
    For languages that are implemented using a SIMD or SIMT execution model, a
 | 
						||
    ``DW_TAG_subprogram``, ``DW_TAG_inlined_subroutine``, or
 | 
						||
    ``DW_TAG_entry_point`` debugging information entry may have a
 | 
						||
    ``DW_AT_LLVM_lane_pc`` attribute whose value is a DWARF expression or
 | 
						||
    location list. The DWARF expression E corresponding to the current program
 | 
						||
    location is selected. The result of evaluating E is a location description
 | 
						||
    that references a wave size vector of generic type elements. Each element
 | 
						||
    holds the conceptual program location of the corresponding lane, where the
 | 
						||
    least significant element corresponds to the first target architecture lane
 | 
						||
    identifier and so forth. If the lane was not active when the subprogram was
 | 
						||
    called, its element is an undefined location description.
 | 
						||
 | 
						||
    *``DW_AT_LLVM_lane_pc`` allows the compiler to indicate conceptually where
 | 
						||
    each lane of a SIMT thread is positioned even when it is in divergent
 | 
						||
    control flow that is not active.*
 | 
						||
 | 
						||
    If not present, the thread is not being used in a SIMT manner, and the
 | 
						||
    thread's program location is used.
 | 
						||
 | 
						||
    *See* :ref:`amdgpu-dwarf-amdgpu-dw-at-llvm-lane-pc` *for AMDGPU
 | 
						||
    information.*
 | 
						||
 | 
						||
12. ``DW_AT_LLVM_active_lane`` *New*
 | 
						||
 | 
						||
    For languages that are implemented using a SIMD or SIMT execution model, a
 | 
						||
    ``DW_TAG_subprogram``, ``DW_TAG_inlined_subroutine``, or
 | 
						||
    ``DW_TAG_entry_point`` debugger information entry may have a
 | 
						||
    ``DW_AT_LLVM_active_lane`` attribute whose value is a DWARF expression or
 | 
						||
    location list. The DWARF expression E corresponding to the current program
 | 
						||
    location is selected. The result of evaluating E is a integral value that is
 | 
						||
    the mask of active lanes for the current program location. The Nth least
 | 
						||
    significant bit of the mask corresponds to the Nth lane. If the bit is 1 the
 | 
						||
    lane is active, otherwise it is inactive.
 | 
						||
 | 
						||
    *Some targets may update the target architecture execution mask for regions
 | 
						||
    of code that must execute with different sets of lanes than the current
 | 
						||
    active lanes. For example, some code must execute in whole wave mode.
 | 
						||
    ``DW_AT_LLVM_active_lane` allows the compiler can provide the means to
 | 
						||
    determine the actual active lanes.*
 | 
						||
 | 
						||
    If not present and ``DW_AT_LLVM_lanes`` is greater than 1, then the target
 | 
						||
    architecture execution mask is used.
 | 
						||
 | 
						||
    *See* :ref:`amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane` *for AMDGPU
 | 
						||
    information.*
 | 
						||
 | 
						||
13. ``DW_AT_LLVM_vector_size`` *New*
 | 
						||
 | 
						||
    A base type V may have the ``DW_AT_LLVM_vector_size`` attribute whose value
 | 
						||
    is an integer constant that is the vector size S.
 | 
						||
 | 
						||
    The representation of a vector base type is as S contiguous elements, each
 | 
						||
    one having the representation of a base type E that is the same as V without
 | 
						||
    the ``DW_AT_LLVM_vector_size`` attribute.
 | 
						||
 | 
						||
    If not present, the base type is not a vector.
 | 
						||
 | 
						||
    The DWARF is ill-formed if S not greater than 0.
 | 
						||
 | 
						||
    .. note::
 | 
						||
 | 
						||
      LLVM has mention of non-upstreamed debugger information entry that is
 | 
						||
      intended to support vector types. However, that was not for a base type
 | 
						||
      so would not be suitable as the type of a stack value entry. But perhaps
 | 
						||
      that could be replaced by using this attribute.
 | 
						||
 | 
						||
14. ``DW_AT_LLVM_augmentation`` *New*
 | 
						||
 | 
						||
    A compilation unit may have a ``DW_AT_LLVM_augmentation`` attribute, whose
 | 
						||
    value is an augmentation string.
 | 
						||
 | 
						||
    *The augmentation string allows users to indicate that there is additional
 | 
						||
    target-specific information in the debugging information entries. For
 | 
						||
    example, this might be information about the version of target-specific
 | 
						||
    extensions that are being used.*
 | 
						||
 | 
						||
    If not present, or if the string is empty, then the compilation unit has no
 | 
						||
    augmentation string.
 | 
						||
 | 
						||
    .. note::
 | 
						||
 | 
						||
      For AMDGPU, the augmentation string contains:
 | 
						||
 | 
						||
      ::
 | 
						||
 | 
						||
        [amd:v0.0]
 | 
						||
 | 
						||
      The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
 | 
						||
      extensions used in the DWARF of the compilation unit. The version number
 | 
						||
      conforms to [SEMVER]_.
 | 
						||
 | 
						||
Attribute Encodings
 | 
						||
+++++++++++++++++++
 | 
						||
 | 
						||
The following table gives the encoding of the debugging information entry
 | 
						||
attributes added for AMDGPU.
 | 
						||
 | 
						||
.. table:: AMDGPU DWARF Attribute Encodings
 | 
						||
   :name: amdgpu-dwarf-attribute-encodings-table
 | 
						||
 | 
						||
   ================================== ===== ====================================
 | 
						||
   Attribute Name                     Value Classes
 | 
						||
   ================================== ===== ====================================
 | 
						||
   DW_AT_LLVM_lanes                         constant
 | 
						||
   DW_AT_LLVM_lane_pc                       exprloc, loclist
 | 
						||
   DW_AT_LLVM_active_lane                   exprloc, loclist
 | 
						||
   DW_AT_LLVM_vector_size                   constant
 | 
						||
   DW_AT_LLVM_augmentation                  string
 | 
						||
   ================================== ===== ====================================
 | 
						||
 | 
						||
.. _amdgpu-call-frame-information:
 | 
						||
 | 
						||
Call Frame Information
 | 
						||
~~~~~~~~~~~~~~~~~~~~~~
 | 
						||
 | 
						||
DWARF Call Frame Information describes how an agent can virtually *unwind*
 | 
						||
call frames in a running process or core dump.
 | 
						||
 | 
						||
.. note::
 | 
						||
 | 
						||
  AMDGPU conforms to the DWARF standard with additional support added for
 | 
						||
  address spaces. Register unwind DWARF expressions are generalized to allow any
 | 
						||
  location description, including composite and implicit location descriptions.
 | 
						||
 | 
						||
Structure of Call Frame Information
 | 
						||
+++++++++++++++++++++++++++++++++++
 | 
						||
 | 
						||
The register rules are:
 | 
						||
 | 
						||
*undefined*
 | 
						||
  A register that has this rule has no recoverable value in the previous frame.
 | 
						||
  (By convention, it is not preserved by a callee.)
 | 
						||
 | 
						||
*same value*
 | 
						||
  This register has not been modified from the previous frame. (By convention,
 | 
						||
  it is preserved by the callee, but the callee has not modified it.)
 | 
						||
 | 
						||
*offset(N)*
 | 
						||
  The previous value of this register is saved at the location description
 | 
						||
  computed as if the ``DW_OP_LLVM_offset N`` operation is applied to the current
 | 
						||
  CFA memory location description where N is a signed byte offset.
 | 
						||
 | 
						||
*val_offset(N)*
 | 
						||
  The previous value of this register is the address in the address space of the
 | 
						||
  memory location description computed as if the ``DW_OP_LLVM_offset N``
 | 
						||
  operation is applied to the current CFA memory location description where N is
 | 
						||
  a signed byte displacement.
 | 
						||
 | 
						||
  If the register size does not match the size of an address in the address
 | 
						||
  space of the current CFA memory location description, then the DWARF is
 | 
						||
  ill-formed .
 | 
						||
 | 
						||
*register(R)*
 | 
						||
  The previous value of this register is stored in another register numbered R.
 | 
						||
 | 
						||
  If the register sizes do not match, then the DWARF is ill-formed.
 | 
						||
 | 
						||
*expression(E)*
 | 
						||
  The previous value of this register is located at the location description
 | 
						||
  produced by executing the DWARF expression E (see
 | 
						||
  :ref:`amdgpu-dwarf-expressions`).
 | 
						||
 | 
						||
*val_expression(E)*
 | 
						||
  The previous value of this register is the value produced by executing the
 | 
						||
  DWARF expression E (see :ref:`amdgpu-dwarf-expressions`).
 | 
						||
 | 
						||
  If value type size does not match the register size, then the DWARF is
 | 
						||
  ill-formed.
 | 
						||
 | 
						||
*architectural*
 | 
						||
  The rule is defined externally to this specification by the augmenter.
 | 
						||
 | 
						||
A Common Information Entry holds information that is shared among many Frame
 | 
						||
Description Entries. There is at least one CIE in every non-empty
 | 
						||
``.debug_frame`` section. A CIE contains the following fields, in order:
 | 
						||
 | 
						||
1.  ``length`` (initial length)
 | 
						||
 | 
						||
    A constant that gives the number of bytes of the CIE structure, not
 | 
						||
    including the length field itself. The size of the length field plus the
 | 
						||
    value of length must be an integral multiple of the address size specified
 | 
						||
    in the ``address_size`` field.
 | 
						||
 | 
						||
2.  ``CIE_id`` (4 or 8 bytes, see
 | 
						||
    :ref:`amdgpu-dwarf-32-bit-and-64-bit-dwarf-formats`)
 | 
						||
 | 
						||
    A constant that is used to distinguish CIEs from FDEs.
 | 
						||
 | 
						||
    In the 32-bit DWARF format, the value of the CIE id in the CIE header is
 | 
						||
    0xffffffff; in the 64-bit DWARF format, the value is 0xffffffffffffffff.
 | 
						||
 | 
						||
3.  ``version`` (ubyte)
 | 
						||
 | 
						||
    A version number. This number is specific to the call frame information and
 | 
						||
    is independent of the DWARF version number.
 | 
						||
 | 
						||
    The value of the CIE version number is 4.
 | 
						||
 | 
						||
4.  ``augmentation`` (sequence of UTF-8 characters)
 | 
						||
 | 
						||
    A null-terminated UTF-8 string that identifies the augmentation to this CIE
 | 
						||
    or to the FDEs that use it. If a reader encounters an augmentation string
 | 
						||
    that is unexpected, then only the following fields can be read:
 | 
						||
 | 
						||
    * CIE: length, CIE_id, version, augmentation
 | 
						||
    * FDE: length, CIE_pointer, initial_location, address_range
 | 
						||
 | 
						||
    If there is no augmentation, this value is a zero byte.
 | 
						||
 | 
						||
    *The augmentation string allows users to indicate that there is additional
 | 
						||
    target-specific information in the CIE or FDE which is needed to virtually
 | 
						||
    unwind a stack frame. For example, this might be information about
 | 
						||
    dynamically allocated data which needs to be freed on exit from the
 | 
						||
    routine.*
 | 
						||
 | 
						||
    *Because the .debug_frame section is useful independently of any
 | 
						||
    ``.debug_info`` section, the augmentation string always uses UTF-8
 | 
						||
    encoding.*
 | 
						||
 | 
						||
    .. note::
 | 
						||
 | 
						||
      For AMDGPU, the augmentation string contains:
 | 
						||
 | 
						||
      ::
 | 
						||
 | 
						||
        [amd:v0.0]
 | 
						||
 | 
						||
      The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
 | 
						||
      extensions used in the DWARF of the compilation unit. The version number
 | 
						||
      conforms to [SEMVER]_.
 | 
						||
 | 
						||
5.  ``address_size`` (ubyte)
 | 
						||
 | 
						||
    The size of a target address in this CIE and any FDEs that use it, in bytes.
 | 
						||
    If a compilation unit exists for this frame, its address size must match the
 | 
						||
    address size here.
 | 
						||
 | 
						||
    .. note::
 | 
						||
 | 
						||
      For AMDGPU:
 | 
						||
 | 
						||
      * The address size for the ``Global`` address space defined in
 | 
						||
        :ref:`amdgpu-dwarf-address-space-mapping-table`.
 | 
						||
 | 
						||
6.  ``segment_selector_size`` (ubyte)
 | 
						||
 | 
						||
    The size of a segment selector in this CIE and any FDEs that use it, in
 | 
						||
    bytes.
 | 
						||
 | 
						||
    .. note::
 | 
						||
 | 
						||
      For AMDGPU:
 | 
						||
 | 
						||
      * Does not use a segment selector so this is 0.
 | 
						||
 | 
						||
7.  ``code_alignment_factor`` (unsigned LEB128)
 | 
						||
 | 
						||
    A constant that is factored out of all advance location instructions (see
 | 
						||
    :ref:`amdgpu-dwarf-row-creation-instructions`). The resulting value is
 | 
						||
    ``(operand * code_alignment_factor)``.
 | 
						||
 | 
						||
    .. note::
 | 
						||
 | 
						||
      For AMDGPU:
 | 
						||
 | 
						||
      * 4 bytes.
 | 
						||
 | 
						||
    .. TODO::
 | 
						||
 | 
						||
       Add to :ref:`amdgpu-processor-table` table.
 | 
						||
 | 
						||
8.  ``data_alignment_factor`` (signed LEB128)
 | 
						||
 | 
						||
    A constant that is factored out of certain offset instructions (see
 | 
						||
    :ref:`amdgpu-dwarf-cfa-definition-instructions` and
 | 
						||
    :ref:`amdgpu-dwarf-register-rule-instructions`). The resulting value is
 | 
						||
    ``(operand * data_alignment_factor)``.
 | 
						||
 | 
						||
    .. note::
 | 
						||
 | 
						||
      For AMDGPU:
 | 
						||
 | 
						||
      * 4 bytes.
 | 
						||
 | 
						||
    .. TODO::
 | 
						||
 | 
						||
       Add to :ref:`amdgpu-processor-table` table.
 | 
						||
 | 
						||
9.  ``return_address_register`` (unsigned LEB128)
 | 
						||
 | 
						||
    An unsigned LEB128 constant that indicates which column in the rule table
 | 
						||
    represents the return address of the function. Note that this column might
 | 
						||
    not correspond to an actual machine register.
 | 
						||
 | 
						||
    .. note::
 | 
						||
 | 
						||
      For AMDGPU:
 | 
						||
 | 
						||
      * ``PC_32`` for 32-bit processes and ``PC_64`` for
 | 
						||
        64-bit processes defined in :ref:`amdgpu-dwarf-register-mapping`.
 | 
						||
 | 
						||
10. ``initial_instructions`` (array of ubyte)
 | 
						||
 | 
						||
    A sequence of rules that are interpreted to create the initial setting of
 | 
						||
    each column in the table.
 | 
						||
 | 
						||
    The default rule for all columns before interpretation of the initial
 | 
						||
    instructions is the undefined rule. However, an ABI authoring body or a
 | 
						||
    compilation system authoring body may specify an alternate default value for
 | 
						||
    any or all columns.
 | 
						||
 | 
						||
    .. note::
 | 
						||
 | 
						||
      For AMDGPU:
 | 
						||
 | 
						||
      * Since a subprogram A with fewer registers can be called from subprogram
 | 
						||
        B that has more allocated, A will not change any of the extra registers
 | 
						||
        as it cannot access them. Therefore, The default rule for all columns is
 | 
						||
        ``same value``.
 | 
						||
 | 
						||
11. ``padding`` (array of ubyte)
 | 
						||
 | 
						||
    Enough ``DW_CFA_nop`` instructions to make the size of this entry match the
 | 
						||
    length value above.
 | 
						||
 | 
						||
An FDE contains the following fields, in order:
 | 
						||
 | 
						||
1.  ``length`` (initial length)
 | 
						||
 | 
						||
    A constant that gives the number of bytes of the header and instruction
 | 
						||
    stream for this function, not including the length field itself. The size of
 | 
						||
    the length field plus the value of length must be an integral multiple of
 | 
						||
    the address size.
 | 
						||
 | 
						||
2.  ``CIE_pointer`` (4 or 8 bytes, see
 | 
						||
    :ref:`amdgpu-dwarf-32-bit-and-64-bit-dwarf-formats`)
 | 
						||
 | 
						||
    A constant offset into the ``.debug_frame`` section that denotes the CIE
 | 
						||
    that is associated with this FDE.
 | 
						||
 | 
						||
3.  ``initial_location`` (segment selector and target address)
 | 
						||
 | 
						||
    The address of the first location associated with this table entry. If the
 | 
						||
    segment_selector_size field of this FDE’s CIE is non-zero, the initial
 | 
						||
    location is preceded by a segment selector of the given length.
 | 
						||
 | 
						||
4.  ``address_range`` (target address)
 | 
						||
 | 
						||
    The number of bytes of program instructions described by this entry.
 | 
						||
 | 
						||
5.  ``instructions`` (array of ubyte)
 | 
						||
 | 
						||
    A sequence of table defining instructions that are described in
 | 
						||
    :ref:`amdgpu-dwarf-call-frame-instructions`.
 | 
						||
 | 
						||
6.  ``padding`` (array of ubyte)
 | 
						||
 | 
						||
    Enough ``DW_CFA_nop`` instructions to make the size of this entry match the
 | 
						||
    length value above.
 | 
						||
 | 
						||
.. _amdgpu-dwarf-call-frame-instructions:
 | 
						||
 | 
						||
Call Frame Instructions
 | 
						||
+++++++++++++++++++++++
 | 
						||
 | 
						||
Some call frame instructions have operands that are encoded as DWARF expressions
 | 
						||
E (see :ref:`amdgpu-dwarf-expressions`). The DWARF operators that can be used in
 | 
						||
E have the following restrictions:
 | 
						||
 | 
						||
* ``DW_OP_addrx``, ``DW_OP_call2``, ``DW_OP_call4``, ``DW_OP_call_ref``,
 | 
						||
  ``DW_OP_const_type``, ``DW_OP_constx``, ``DW_OP_convert``,
 | 
						||
  ``DW_OP_deref_type``, ``DW_OP_regval_type``, and ``DW_OP_reinterpret``
 | 
						||
  operators are not allowed because the call frame information must not depend
 | 
						||
  on other debug sections.
 | 
						||
 | 
						||
* ``DW_OP_push_object_address`` is not allowed because there is no object
 | 
						||
  context to provide a value to push.
 | 
						||
 | 
						||
* ``DW_OP_call_frame_cfa`` and ``DW_OP_entry_value`` are not allowed because
 | 
						||
  their use would be circular.
 | 
						||
 | 
						||
* ``DW_OP_LLVM_call_frame_entry_reg`` is not allowed if evaluating E causes a
 | 
						||
  circular dependency between ``DW_OP_LLVM_call_frame_entry_reg`` operators.
 | 
						||
 | 
						||
  *For example, if a register R1 has a* ``DW_CFA_def_cfa_expression``
 | 
						||
  *instruction that evaluates a* ``DW_OP_LLVM_call_frame_entry_reg`` *operator
 | 
						||
  that specifies register R2, and register R2 has a*
 | 
						||
  ``DW_CFA_def_cfa_expression`` *instruction that that evaluates a*
 | 
						||
  ``DW_OP_LLVM_call_frame_entry_reg`` *operator that specifies register R1.*
 | 
						||
 | 
						||
*Call frame instructions to which these restrictions apply include*
 | 
						||
``DW_CFA_def_cfa_expression``\ *,* ``DW_CFA_expression``\ *, and*
 | 
						||
``DW_CFA_val_expression``\ *.*
 | 
						||
 | 
						||
.. _amdgpu-dwarf-row-creation-instructions:
 | 
						||
 | 
						||
Row Creation Instructions
 | 
						||
#########################
 | 
						||
 | 
						||
These instructions are the same as in DWARF 5.
 | 
						||
 | 
						||
.. _amdgpu-dwarf-cfa-definition-instructions:
 | 
						||
 | 
						||
CFA Definition Instructions
 | 
						||
###########################
 | 
						||
 | 
						||
1.  ``DW_CFA_def_cfa``
 | 
						||
 | 
						||
    The ``DW_CFA_def_cfa`` instruction takes two unsigned LEB128 operands
 | 
						||
    representing a register number R and a (non-factored) byte displacement D.
 | 
						||
    The required action is to define the current CFA rule to be the memory
 | 
						||
    location description that is the result of evaluating the DWARF expression
 | 
						||
    ``DW_OP_bregx R, D``.
 | 
						||
 | 
						||
    .. note::
 | 
						||
 | 
						||
      Could also consider adding ``DW_CFA_def_aspace_cfa`` and
 | 
						||
      ``DW_CFA_def_aspace_cfa_sf`` which allow a register R, offset D, and
 | 
						||
      address space AS to be specified. For example, that would save a byte of
 | 
						||
      encoding over using ``DW_CFA_def_cfa R, D; DW_CFA_LLVM_def_cfa_aspace
 | 
						||
      AS;``.
 | 
						||
 | 
						||
2.  ``DW_CFA_def_cfa_sf``
 | 
						||
 | 
						||
    The ``DW_CFA_def_cfa_sf`` instruction takes two operands: an unsigned LEB128
 | 
						||
    value representing a register number R and a signed LEB128 factored byte
 | 
						||
    displacement D. The required action is to define the current CFA rule to be
 | 
						||
    the memory location description that is the result of evaluating the DWARF
 | 
						||
    expression ``DW_OP_bregx R, D*data_alignment_factor``.
 | 
						||
 | 
						||
    *The action is the same as ``DW_CFA_def_cfa`` except that the second operand
 | 
						||
    is signed and factored.*
 | 
						||
 | 
						||
3.  ``DW_CFA_def_cfa_register``
 | 
						||
 | 
						||
    The ``DW_CFA_def_cfa_register`` instruction takes a single unsigned LEB128
 | 
						||
    operand representing a register number R. The required action is to define
 | 
						||
    the current CFA rule to be the memory location description that is the
 | 
						||
    result of evaluating the DWARF expression ``DW_OP_constu AS;
 | 
						||
    DW_OP_aspace_bregx R, D`` where D and AS are the old CFA byte displacement
 | 
						||
    and address space respectively.
 | 
						||
 | 
						||
    If the subprogram has no current CFA rule, or the rule was defined by a
 | 
						||
    ``DW_CFA_def_cfa_expression`` instruction, then the DWARF is ill-formed.
 | 
						||
 | 
						||
4.  ``DW_CFA_def_cfa_offset``
 | 
						||
 | 
						||
    The ``DW_CFA_def_cfa_offset`` instruction takes a single unsigned LEB128
 | 
						||
    operand representing a (non-factored) byte displacement D. The required
 | 
						||
    action is to define the current CFA rule to be the memory location
 | 
						||
    description that is the result of evaluating the DWARF expression
 | 
						||
    ``DW_OP_constu AS; DW_OP_aspace_bregx R, D`` where R and AS are the old CFA
 | 
						||
    register number and address space respectively.
 | 
						||
 | 
						||
    If the subprogram has no current CFA rule, or the rule was defined by a
 | 
						||
    ``DW_CFA_def_cfa_expression`` instruction, then the DWARF is ill-formed.
 | 
						||
 | 
						||
5.  ``DW_CFA_def_cfa_offset_sf``
 | 
						||
 | 
						||
    The ``DW_CFA_def_cfa_offset_sf`` instruction takes a signed LEB128 operand
 | 
						||
    representing a factored byte displacement D. The required action is to
 | 
						||
    define the current CFA rule to be the memory location description that is
 | 
						||
    the result of evaluating the DWARF expression ``DW_OP_constu AS;
 | 
						||
    DW_OP_aspace_bregx R, D*data_alignment_factor`` where R and AS are the old
 | 
						||
    CFA register number and address space respectively.
 | 
						||
 | 
						||
    If the subprogram has no current CFA rule, or the rule was defined by a
 | 
						||
    ``DW_CFA_def_cfa_expression`` instruction, then the DWARF is ill-formed.
 | 
						||
 | 
						||
    *The action is the same as ``DW_CFA_def_cfa_offset`` except that the operand
 | 
						||
    is signed and factored.*
 | 
						||
 | 
						||
6.  ``DW_CFA_LLVM_def_cfa_aspace`` *New*
 | 
						||
 | 
						||
    The ``DW_CFA_LLVM_def_cfa_aspace`` instruction takes a single unsigned
 | 
						||
    LEB128 operand representing an address space identifier AS for those
 | 
						||
    architectures that support multiple address spaces. The required action is
 | 
						||
    to define the current CFA rule to be the memory location description L that
 | 
						||
    is the result of evaluating the DWARF expression ``DW_OP_constu AS;
 | 
						||
    DW_OP_aspace_bregx R, D`` where R and D are the old CFA register number and
 | 
						||
    byte displacement respectively.
 | 
						||
 | 
						||
    If AS is not one of the values defined by the target architecture's
 | 
						||
    ``DW_ASPACE_*`` values then the DWARF expression is ill-formed.
 | 
						||
 | 
						||
7.  ``DW_CFA_def_cfa_expression``
 | 
						||
 | 
						||
    The ``DW_CFA_def_cfa_expression`` instruction takes a single operand encoded
 | 
						||
    as a ``DW_FORM_exprloc`` value representing a DWARF expression E. The
 | 
						||
    required action is to define the current CFA rule to be the memory location
 | 
						||
    description computed by evaluating E.
 | 
						||
 | 
						||
    *See :ref:`amdgpu-dwarf-call-frame-instructions` regarding restrictions on
 | 
						||
    the DWARF expression operators that can be used in E.*
 | 
						||
 | 
						||
    If the result of evaluating E is not a memory location description with bit
 | 
						||
    offset that is a multiple of 8 (the byte size), then the DWARF is
 | 
						||
    ill-formed.
 | 
						||
 | 
						||
.. _amdgpu-dwarf-register-rule-instructions:
 | 
						||
 | 
						||
Register Rule Instructions
 | 
						||
##########################
 | 
						||
 | 
						||
.. note::
 | 
						||
 | 
						||
  For AMDGPU:
 | 
						||
 | 
						||
  * The register number follows the numbering defined in
 | 
						||
    :ref:`amdgpu-dwarf-register-mapping`.
 | 
						||
 | 
						||
1.  ``DW_CFA_undefined``
 | 
						||
 | 
						||
    The ``DW_CFA_undefined`` instruction takes a single unsigned LEB128 operand
 | 
						||
    that represents a register number R. The required action is to set the rule
 | 
						||
    for the register specified by R to ``undefined``.
 | 
						||
 | 
						||
2.  ``DW_CFA_same_value``
 | 
						||
 | 
						||
    The ``DW_CFA_same_value`` instruction takes a single unsigned LEB128 operand
 | 
						||
    that represents a register number R. The required action is to set the rule
 | 
						||
    for the register specified by R to ``same value``.
 | 
						||
 | 
						||
3.  ``DW_CFA_offset``
 | 
						||
 | 
						||
    The ``DW_CFA_offset`` instruction takes two operands: a register number R
 | 
						||
    (encoded with the opcode) and an unsigned LEB128 constant representing a
 | 
						||
    factored displacement D. The required action is to change the rule for the
 | 
						||
    register specified by R to be an *offset(D*data_alignment_factor)* rule.
 | 
						||
 | 
						||
    .. note::
 | 
						||
 | 
						||
      Seems this should be named ``DW_CFA_offset_uf`` since the offset is
 | 
						||
      unsigned factored.
 | 
						||
 | 
						||
4.  ``DW_CFA_offset_extended``
 | 
						||
 | 
						||
    The ``DW_CFA_offset_extended`` instruction takes two unsigned LEB128
 | 
						||
    operands representing a register number R and a factored displacement D.
 | 
						||
    This instruction is identical to ``DW_CFA_offset`` except for the encoding
 | 
						||
    and size of the register operand.
 | 
						||
 | 
						||
    .. note::
 | 
						||
 | 
						||
      Seems this should be named ``DW_CFA_offset_extended_uf`` since the
 | 
						||
      displacement is unsigned factored.
 | 
						||
 | 
						||
5.  ``DW_CFA_offset_extended_sf``
 | 
						||
 | 
						||
    The ``DW_CFA_offset_extended_sf`` instruction takes two operands: an
 | 
						||
    unsigned LEB128 value representing a register number R and a signed LEB128
 | 
						||
    factored displacement D. This instruction is identical to
 | 
						||
    ``DW_CFA_offset_extended`` except that D is signed.
 | 
						||
 | 
						||
6.  ``DW_CFA_val_offset``
 | 
						||
 | 
						||
    The ``DW_CFA_val_offset`` instruction takes two unsigned LEB128 operands
 | 
						||
    representing a register number R and a factored displacement D. The required
 | 
						||
    action is to change the rule for the register indicated by R to be a
 | 
						||
    *val_offset(D*data_alignment_factor)* rule.
 | 
						||
 | 
						||
    .. note::
 | 
						||
 | 
						||
      Seems this should be named ``DW_CFA_val_offset_uf`` since the displacement
 | 
						||
      is unsigned factored.
 | 
						||
 | 
						||
7.  ``DW_CFA_val_offset_sf``
 | 
						||
 | 
						||
    The ``DW_CFA_val_offset_sf`` instruction takes two operands: an unsigned
 | 
						||
    LEB128 value representing a register number R and a signed LEB128 factored
 | 
						||
    displacement D. This instruction is identical to ``DW_CFA_val_offset``
 | 
						||
    except that D is signed.
 | 
						||
 | 
						||
8.  ``DW_CFA_register``
 | 
						||
 | 
						||
    The ``DW_CFA_register`` instruction takes two unsigned LEB128 operands
 | 
						||
    representing register numbers R1 and R2 respectively. The required action is
 | 
						||
    to set the rule for the register specified by R1 to be *register(R)* where R
 | 
						||
    is R2.
 | 
						||
 | 
						||
9.  ``DW_CFA_expression``
 | 
						||
 | 
						||
    The ``DW_CFA_expression`` instruction takes two operands: an unsigned LEB128
 | 
						||
    value representing a register number R, and a ``DW_FORM_block`` value
 | 
						||
    representing a DWARF expression E. The required action is to change the rule
 | 
						||
    for the register specified by R to be an *expression(E)* rule. The memory
 | 
						||
    location description of the current CFA is pushed on the DWARF stack prior
 | 
						||
    to execution of E.
 | 
						||
 | 
						||
    *That is, the DWARF expression computes the location description where the
 | 
						||
    register value can be retrieved.*
 | 
						||
 | 
						||
    *See :ref:`amdgpu-dwarf-call-frame-instructions` regarding restrictions on
 | 
						||
    the DWARF expression operators that can be used in E.*
 | 
						||
 | 
						||
10. ``DW_CFA_val_expression``
 | 
						||
 | 
						||
    The ``DW_CFA_val_expression`` instruction takes two operands: an unsigned
 | 
						||
    LEB128 value representing a register number R, and a ``DW_FORM_block`` value
 | 
						||
    representing a DWARF expression E. The required action is to change the rule
 | 
						||
    for the register specified by R to be a *val_expression(E)* rule. The memory
 | 
						||
    location description of the current CFA is pushed on the DWARF evaluation
 | 
						||
    stack prior to execution of E.
 | 
						||
 | 
						||
    *That is, E computes the value of register R.*
 | 
						||
 | 
						||
    *See :ref:`amdgpu-dwarf-call-frame-instructions` regarding restrictions on
 | 
						||
    the DWARF expression operators that can be used in E.*
 | 
						||
 | 
						||
    If the result of evaluating E is not a value with a base type size that
 | 
						||
    matches the register size, then the DWARF is ill-formed.
 | 
						||
 | 
						||
11. ``DW_CFA_restore``
 | 
						||
 | 
						||
    The ``DW_CFA_restore`` instruction takes a single operand (encoded with the
 | 
						||
    opcode) that represents a register number R. The required action is to
 | 
						||
    change the rule for the register specified by R to the rule assigned it by
 | 
						||
    the initial_instructions in the CIE.
 | 
						||
 | 
						||
12. ``DW_CFA_restore_extended``
 | 
						||
 | 
						||
    The ``DW_CFA_restore_extended`` instruction takes a single unsigned LEB128
 | 
						||
    operand that represents a register number R. This instruction is identical
 | 
						||
    to ``DW_CFA_restore`` except for the encoding and size of the register
 | 
						||
    operand.
 | 
						||
 | 
						||
Row State Instructions
 | 
						||
######################
 | 
						||
 | 
						||
These instructions are the same as in DWARF 5.
 | 
						||
 | 
						||
Call Frame Calling Address
 | 
						||
++++++++++++++++++++++++++
 | 
						||
 | 
						||
*When virtually unwinding frames, consumers frequently wish to obtain the
 | 
						||
address of the instruction which called a subroutine. This information is not
 | 
						||
always provided. Typically, however, one of the registers in the virtual unwind
 | 
						||
table is the Return Address.*
 | 
						||
 | 
						||
If a Return Address register is defined in the virtual unwind table, and its
 | 
						||
rule is undefined (for example, by ``DW_CFA_undefined``), then there is no
 | 
						||
return address and no call address, and the virtual unwind of stack activations
 | 
						||
is complete.
 | 
						||
 | 
						||
*In most cases the return address is in the same context as the calling address,
 | 
						||
but that need not be the case, especially if the producer knows in some way the
 | 
						||
call never will return. The context of the ’return address’ might be on a
 | 
						||
different line, in a different lexical block, or past the end of the calling
 | 
						||
subroutine. If a consumer were to assume that it was in the same context as the
 | 
						||
calling address, the virtual unwind might fail.*
 | 
						||
 | 
						||
*For architectures with constant-length instructions where the return address
 | 
						||
immediately follows the call instruction, a simple solution is to subtract the
 | 
						||
length of an instruction from the return address to obtain the calling
 | 
						||
instruction. For architectures with variable-length instructions (for example,
 | 
						||
x86), this is not possible. However, subtracting 1 from the return address,
 | 
						||
although not guaranteed to provide the exact calling address, generally will
 | 
						||
produce an address within the same context as the calling address, and that
 | 
						||
usually is sufficient.*
 | 
						||
 | 
						||
.. note::
 | 
						||
 | 
						||
  For AMDGPU the instructions are variable size and a consumer can subtract 1
 | 
						||
  from the return address to get the address of a byte within the call site
 | 
						||
  instructions.
 | 
						||
 | 
						||
Call Frame Information Instruction Encodings
 | 
						||
++++++++++++++++++++++++++++++++++++++++++++
 | 
						||
 | 
						||
The following table gives the encoding of the DWARF call frame information
 | 
						||
instructions added for AMDGPU.
 | 
						||
 | 
						||
.. table:: AMDGPU DWARF Call Frame Information Instruction Encodings
 | 
						||
   :name: amdgpu-dwarf-call-frame-information-instruction-encodings-table
 | 
						||
 | 
						||
   =================================== ==== ==== ============== ================
 | 
						||
   Instruction                         High Low  Operand 1      Operand 1
 | 
						||
                                       2    6
 | 
						||
                                       Bits Bits
 | 
						||
   =================================== ==== ==== ============== ================
 | 
						||
   DW_CFA_LLVM_def_cfa_aspace          0    0Xxx ULEB128
 | 
						||
   =================================== ==== ==== ============== ================
 | 
						||
 | 
						||
Line Table
 | 
						||
~~~~~~~~~~
 | 
						||
 | 
						||
.. note::
 | 
						||
 | 
						||
  AMDGPU does not use the ``isa`` state machine registers and always sets it to
 | 
						||
  0.
 | 
						||
 | 
						||
.. TODO::
 | 
						||
 | 
						||
  Should the ``isa`` state machine register be used to indicate if the code is
 | 
						||
  in wave32 or wave64 mode? Or used to specify the architecture ISA?
 | 
						||
 | 
						||
Accelerated Access
 | 
						||
~~~~~~~~~~~~~~~~~~
 | 
						||
 | 
						||
Lookup By Name
 | 
						||
++++++++++++++
 | 
						||
 | 
						||
.. note::
 | 
						||
 | 
						||
  For AMDGPU:
 | 
						||
 | 
						||
  * The rule for debugger information entries included in the name
 | 
						||
    index in the optional ``.debug_names`` section is extended to also include
 | 
						||
    named ``DW_TAG_variable`` debugging information entries with a
 | 
						||
    ``DW_AT_location`` attribute that includes a
 | 
						||
    ``DW_OP_LLVM_form_aspace_address`` operation.
 | 
						||
 | 
						||
  * The lookup by name section header ``augmentation_string`` string field contains:
 | 
						||
 | 
						||
    ::
 | 
						||
 | 
						||
      [amd:v0.0]
 | 
						||
 | 
						||
    The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
 | 
						||
    extensions used in the DWARF of the compilation unit. The version number
 | 
						||
    conforms to [SEMVER]_.
 | 
						||
 | 
						||
Lookup By Address
 | 
						||
+++++++++++++++++
 | 
						||
 | 
						||
.. note::
 | 
						||
 | 
						||
  For AMDGPU:
 | 
						||
 | 
						||
  * The lookup by address section header table:
 | 
						||
 | 
						||
    ``address_size`` (ubyte)
 | 
						||
      Match the address size for the ``Global`` address space defined in
 | 
						||
      :ref:`amdgpu-dwarf-address-space-mapping-table`.
 | 
						||
 | 
						||
    ``segment_selector_size`` (ubyte)
 | 
						||
      AMDGPU does not use a segment selector so this is 0. The entries in the
 | 
						||
      ``.debug_aranges`` do not have a segment selector.
 | 
						||
 | 
						||
Data Representation
 | 
						||
~~~~~~~~~~~~~~~~~~~
 | 
						||
 | 
						||
.. _amdgpu-dwarf-32-bit-and-64-bit-dwarf-formats:
 | 
						||
 | 
						||
32-Bit and 64-Bit DWARF Formats
 | 
						||
+++++++++++++++++++++++++++++++
 | 
						||
 | 
						||
.. note::
 | 
						||
 | 
						||
  For AMDGPU:
 | 
						||
 | 
						||
  * For the ``amdgcn`` target only 64-bit process address space is supported
 | 
						||
  * The producer can generate either 32-bit or 64-bit DWARF format.
 | 
						||
 | 
						||
1.  Within the body of the ``.debug_info`` section, certain forms of attribute
 | 
						||
    value depend on the choice of DWARF format as follows. For the 32-bit DWARF
 | 
						||
    format, the value is a 4-byte unsigned integer; for the 64-bit DWARF format,
 | 
						||
    the value is an 8-byte unsigned integer.
 | 
						||
 | 
						||
    .. table:: AMDGPU DWARF ``.debug_info`` section attribute sizes
 | 
						||
      :name: amdgpu-dwarf-debug-info-section-attribute-sizes
 | 
						||
 | 
						||
      =================================== =====================================
 | 
						||
      Form                                Role
 | 
						||
      =================================== =====================================
 | 
						||
      DW_FORM_line_strp                   offset in ``.debug_line_str``
 | 
						||
      DW_FORM_ref_addr                    offset in ``.debug_info``
 | 
						||
      DW_FORM_sec_offset                  offset in a section other than
 | 
						||
                                          ``.debug_info`` or ``.debug_str``
 | 
						||
      DW_FORM_strp                        offset in ``.debug_str``
 | 
						||
      DW_FORM_strp_sup                    offset in ``.debug_str`` section of
 | 
						||
                                          supplementary object file
 | 
						||
      DW_OP_call_ref                      offset in ``.debug_info``
 | 
						||
      DW_OP_implicit_pointer              offset in ``.debug_info``
 | 
						||
      DW_OP_LLVM_aspace_implicit_pointer  offset in ``.debug_info``
 | 
						||
      =================================== =====================================
 | 
						||
 | 
						||
Unit Headers
 | 
						||
++++++++++++
 | 
						||
 | 
						||
.. note::
 | 
						||
 | 
						||
  For AMDGPU:
 | 
						||
 | 
						||
  * For AMDGPU the ``address_size`` field of the DWARF unit headers matches the
 | 
						||
    address size for the ``Global`` address space defined in
 | 
						||
    :ref:`amdgpu-dwarf-address-space-mapping-table`.
 | 
						||
 | 
						||
.. _amdgpu-dwarf-amdgpu-dw-at-llvm-lane-pc:
 | 
						||
 | 
						||
AMDGPU DW_AT_LLVM_lane_pc
 | 
						||
~~~~~~~~~~~~~~~~~~~~~~~~~
 | 
						||
 | 
						||
The ``DW_AT_LLVM_lane_pc`` attribute can be used to specify the program location
 | 
						||
of the separate lanes of a SIMT thread. See
 | 
						||
:ref:`amdgpu-dwarf-debugging-information-entry-attributes`.
 | 
						||
 | 
						||
If the lane is an active lane then this will be the same as the current program
 | 
						||
location.
 | 
						||
 | 
						||
If the lane is inactive, but was active on entry to the subprogram, then this is
 | 
						||
the program location in the subprogram at which execution of the lane is
 | 
						||
conceptual positioned.
 | 
						||
 | 
						||
If the lane was not active on entry to the subprogram, then this will be the
 | 
						||
undefined location. A client debugger can check if the lane is part of a valid
 | 
						||
work-group by checking that the lane is in the range of the associated
 | 
						||
work-group within the grid, accounting for partial work-groups. If it is not
 | 
						||
then the debugger can omit any information for the lane. Otherwise, the debugger
 | 
						||
may repeatedly unwind the stack and inspect the ``DW_AT_LLVM_lane_pc`` of the
 | 
						||
calling subprogram until it finds a non-undefined location. Conceptually the
 | 
						||
lane only has the call frames that it has a non-undefined
 | 
						||
``DW_AT_LLVM_lane_pc``.
 | 
						||
 | 
						||
The following example illustrates how the AMDGPU backend can generate a location
 | 
						||
list for the nested ``IF/THEN/ELSE`` structures of the following subprogram
 | 
						||
pseudo code for a target with 64 lanes per wave.
 | 
						||
 | 
						||
.. code::
 | 
						||
  :number-lines:
 | 
						||
 | 
						||
  SUBPROGRAM X
 | 
						||
  BEGIN
 | 
						||
    a;
 | 
						||
    IF (c1) THEN
 | 
						||
      b;
 | 
						||
      IF (c2) THEN
 | 
						||
        c;
 | 
						||
      ELSE
 | 
						||
        d;
 | 
						||
      ENDIF
 | 
						||
      e;
 | 
						||
    ELSE
 | 
						||
      f;
 | 
						||
    ENDIF
 | 
						||
    g;
 | 
						||
  END
 | 
						||
 | 
						||
The AMDGPU backend may generate the following pseudo LLVM MIR to manipulate the
 | 
						||
execution mask (``EXEC``) to linearized the control flow. The condition is
 | 
						||
evaluated to make a mask of the lanes for which the condition evaluates to true.
 | 
						||
First the ``THEN`` region is executed by setting the ``EXEC`` mask to the
 | 
						||
logical ``AND`` of the current ``EXEC`` mask with the condition mask. Then the
 | 
						||
``ELSE`` region is executed by negating the ``EXEC`` mask and logical ``AND`` of
 | 
						||
the saved ``EXEC`` mask at the start of the region. After the ``IF/THEN/ELSE``
 | 
						||
region the ``EXEC`` mask is restored to the value it had at the beginning of the
 | 
						||
region. This is shown below. Other approaches are possible, but the basic
 | 
						||
concept is the same.
 | 
						||
 | 
						||
.. code::
 | 
						||
  :number-lines:
 | 
						||
 | 
						||
  $lex_start:
 | 
						||
    a;
 | 
						||
    %1 = EXEC
 | 
						||
    %2 = c1
 | 
						||
  $lex_1_start:
 | 
						||
    EXEC = %1 & %2
 | 
						||
  $if_1_then:
 | 
						||
      b;
 | 
						||
      %3 = EXEC
 | 
						||
      %4 = c2
 | 
						||
  $lex_1_1_start:
 | 
						||
      EXEC = %3 & %4
 | 
						||
  $lex_1_1_then:
 | 
						||
        c;
 | 
						||
      EXEC = ~EXEC & %3
 | 
						||
  $lex_1_1_else:
 | 
						||
        d;
 | 
						||
      EXEC = %3
 | 
						||
  $lex_1_1_end:
 | 
						||
      e;
 | 
						||
    EXEC = ~EXEC & %1
 | 
						||
  $lex_1_else:
 | 
						||
      f;
 | 
						||
    EXEC = %1
 | 
						||
  $lex_1_end:
 | 
						||
    g;
 | 
						||
  $lex_end:
 | 
						||
 | 
						||
To create the location list that defines the location description of a vector of
 | 
						||
lane program locations, the LLVM MIR ``DBG_VALUE`` pseudo instruction can be
 | 
						||
used to annotate the linearized control flow. This can be done by defining an
 | 
						||
artificial variable for the lane PC. The location list created for it is used to
 | 
						||
define the value of the ``DW_AT_LLVM_lane_pc`` attribute.
 | 
						||
 | 
						||
A DWARF procedure is defined for each well nested structured control flow region
 | 
						||
which provides the conceptual lane program location for a lane if it is not
 | 
						||
active (namely it is divergent). The expression for each region inherits the
 | 
						||
value of the immediately enclosing region and modifies it according to the
 | 
						||
semantics of the region.
 | 
						||
 | 
						||
For an ``IF/THEN/ELSE`` region the divergent program location is at the start of
 | 
						||
the region for the ``THEN`` region since it is executed first. For the ``ELSE``
 | 
						||
region the divergent program location is at the end of the ``IF/THEN/ELSE``
 | 
						||
region since the ``THEN`` region has completed.
 | 
						||
 | 
						||
The lane PC artificial variable is assigned at each region transition. It uses
 | 
						||
the immediately enclosing region's DWARF procedure to compute the program
 | 
						||
location for each lane assuming they are divergent, and then modifies the result
 | 
						||
by inserting the current program location for each lane that the ``EXEC`` mask
 | 
						||
indicates is active.
 | 
						||
 | 
						||
By having separate DWARF procedures for each region, they can be reused to
 | 
						||
define the value for any nested region. This reduces the amount of DWARF
 | 
						||
required.
 | 
						||
 | 
						||
The following provides an example using pseudo LLVM MIR.
 | 
						||
 | 
						||
.. code::
 | 
						||
  :number-lines:
 | 
						||
 | 
						||
  $lex_start:
 | 
						||
    DEFINE_DWARF %__uint_64 = DW_TAG_base_type[
 | 
						||
      DW_AT_name = "__uint64";
 | 
						||
      DW_AT_byte_size = 8;
 | 
						||
      DW_AT_encoding = DW_ATE_unsigned;
 | 
						||
    ];
 | 
						||
    DEFINE_DWARF %__active_lane_pc = DW_TAG_dwarf_procedure[
 | 
						||
      DW_AT_name = "__active_lane_pc";
 | 
						||
      DW_AT_location = [
 | 
						||
        DW_OP_regx PC;
 | 
						||
        DW_OP_LLVM_extend 64, 64;
 | 
						||
        DW_OP_regval_type EXEC, %uint_64;
 | 
						||
        DW_OP_LLVM_select_bit_piece 64, 64;
 | 
						||
      ];
 | 
						||
    ];
 | 
						||
    DEFINE_DWARF %__divergent_lane_pc = DW_TAG_dwarf_procedure[
 | 
						||
      DW_AT_name = "__divergent_lane_pc";
 | 
						||
      DW_AT_location = [
 | 
						||
        DW_OP_LLVM_undefined;
 | 
						||
        DW_OP_LLVM_extend 64, 64;
 | 
						||
      ];
 | 
						||
    ];
 | 
						||
    DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
 | 
						||
      DW_OP_call_ref %__divergent_lane_pc;
 | 
						||
      DW_OP_call_ref %__active_lane_pc;
 | 
						||
    ];
 | 
						||
    a;
 | 
						||
    %1 = EXEC;
 | 
						||
    DBG_VALUE %1, $noreg, %__lex_1_save_exec;
 | 
						||
    %2 = c1;
 | 
						||
  $lex_1_start:
 | 
						||
    EXEC = %1 & %2;
 | 
						||
  $lex_1_then:
 | 
						||
      DEFINE_DWARF %__divergent_lane_pc_1_then = DW_TAG_dwarf_procedure[
 | 
						||
        DW_AT_name = "__divergent_lane_pc_1_then";
 | 
						||
        DW_AT_location = DIExpression[
 | 
						||
          DW_OP_call_ref %__divergent_lane_pc;
 | 
						||
          DW_OP_xaddr &lex_1_start;
 | 
						||
          DW_OP_stack_value;
 | 
						||
          DW_OP_LLVM_extend 64, 64;
 | 
						||
          DW_OP_call_ref %__lex_1_save_exec;
 | 
						||
          DW_OP_deref_type 64, %__uint_64;
 | 
						||
          DW_OP_LLVM_select_bit_piece 64, 64;
 | 
						||
        ];
 | 
						||
      ];
 | 
						||
      DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
 | 
						||
        DW_OP_call_ref %__divergent_lane_pc_1_then;
 | 
						||
        DW_OP_call_ref %__active_lane_pc;
 | 
						||
      ];
 | 
						||
      b;
 | 
						||
      %3 = EXEC;
 | 
						||
      DBG_VALUE %3, %__lex_1_1_save_exec;
 | 
						||
      %4 = c2;
 | 
						||
  $lex_1_1_start:
 | 
						||
      EXEC = %3 & %4;
 | 
						||
  $lex_1_1_then:
 | 
						||
        DEFINE_DWARF %__divergent_lane_pc_1_1_then = DW_TAG_dwarf_procedure[
 | 
						||
          DW_AT_name = "__divergent_lane_pc_1_1_then";
 | 
						||
          DW_AT_location = DIExpression[
 | 
						||
            DW_OP_call_ref %__divergent_lane_pc_1_then;
 | 
						||
            DW_OP_xaddr &lex_1_1_start;
 | 
						||
            DW_OP_stack_value;
 | 
						||
            DW_OP_LLVM_extend 64, 64;
 | 
						||
            DW_OP_call_ref %__lex_1_1_save_exec;
 | 
						||
            DW_OP_deref_type 64, %__uint_64;
 | 
						||
            DW_OP_LLVM_select_bit_piece 64, 64;
 | 
						||
          ];
 | 
						||
        ];
 | 
						||
        DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
 | 
						||
          DW_OP_call_ref %__divergent_lane_pc_1_1_then;
 | 
						||
          DW_OP_call_ref %__active_lane_pc;
 | 
						||
        ];
 | 
						||
        c;
 | 
						||
      EXEC = ~EXEC & %3;
 | 
						||
  $lex_1_1_else:
 | 
						||
        DEFINE_DWARF %__divergent_lane_pc_1_1_else = DW_TAG_dwarf_procedure[
 | 
						||
          DW_AT_name = "__divergent_lane_pc_1_1_else";
 | 
						||
          DW_AT_location = DIExpression[
 | 
						||
            DW_OP_call_ref %__divergent_lane_pc_1_then;
 | 
						||
            DW_OP_xaddr &lex_1_1_end;
 | 
						||
            DW_OP_stack_value;
 | 
						||
            DW_OP_LLVM_extend 64, 64;
 | 
						||
            DW_OP_call_ref %__lex_1_1_save_exec;
 | 
						||
            DW_OP_deref_type 64, %__uint_64;
 | 
						||
            DW_OP_LLVM_select_bit_piece 64, 64;
 | 
						||
          ];
 | 
						||
        ];
 | 
						||
        DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
 | 
						||
          DW_OP_call_ref %__divergent_lane_pc_1_1_else;
 | 
						||
          DW_OP_call_ref %__active_lane_pc;
 | 
						||
        ];
 | 
						||
        d;
 | 
						||
      EXEC = %3;
 | 
						||
  $lex_1_1_end:
 | 
						||
      DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
 | 
						||
        DW_OP_call_ref %__divergent_lane_pc;
 | 
						||
        DW_OP_call_ref %__active_lane_pc;
 | 
						||
      ];
 | 
						||
      e;
 | 
						||
    EXEC = ~EXEC & %1;
 | 
						||
  $lex_1_else:
 | 
						||
      DEFINE_DWARF %__divergent_lane_pc_1_else = DW_TAG_dwarf_procedure[
 | 
						||
        DW_AT_name = "__divergent_lane_pc_1_else";
 | 
						||
        DW_AT_location = DIExpression[
 | 
						||
          DW_OP_call_ref %__divergent_lane_pc;
 | 
						||
          DW_OP_xaddr &lex_1_end;
 | 
						||
          DW_OP_stack_value;
 | 
						||
          DW_OP_LLVM_extend 64, 64;
 | 
						||
          DW_OP_call_ref %__lex_1_save_exec;
 | 
						||
          DW_OP_deref_type 64, %__uint_64;
 | 
						||
          DW_OP_LLVM_select_bit_piece 64, 64;
 | 
						||
        ];
 | 
						||
      ];
 | 
						||
      DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
 | 
						||
        DW_OP_call_ref %__divergent_lane_pc_1_else;
 | 
						||
        DW_OP_call_ref %__active_lane_pc;
 | 
						||
      ];
 | 
						||
      f;
 | 
						||
    EXEC = %1;
 | 
						||
  $lex_1_end:
 | 
						||
    DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc DIExpression[
 | 
						||
      DW_OP_call_ref %__divergent_lane_pc;
 | 
						||
      DW_OP_call_ref %__active_lane_pc;
 | 
						||
    ];
 | 
						||
    g;
 | 
						||
  $lex_end:
 | 
						||
 | 
						||
The DWARF procedure ``%__active_lane_pc`` is used to update the lane pc elements
 | 
						||
that are active with the current program location.
 | 
						||
 | 
						||
Artificial variables %__lex_1_save_exec and %__lex_1_1_save_exec are created for
 | 
						||
the execution masks saved on entry to a region. Using the ``DBG_VALUE`` pseudo
 | 
						||
instruction, location lists that describes where they are allocated at any given
 | 
						||
program location will be created. The compiler may allocate them to registers,
 | 
						||
or spill them to memory.
 | 
						||
 | 
						||
The DWARF procedures for each region use saved execution mask value to only
 | 
						||
update the lanes that are active on entry to the region. All other lanes retain
 | 
						||
the value of the enclosing region where they were last active. If they were not
 | 
						||
active on entry to the subprogram, then will have the undefined location
 | 
						||
description.
 | 
						||
 | 
						||
Other structured control flow regions can be handled similarly. For example,
 | 
						||
loops would set the divergent program location for the region at the end of the
 | 
						||
loop. Any lanes active will be in the loop, and any lanes not active must have
 | 
						||
exited the loop.
 | 
						||
 | 
						||
An ``IF/THEN/ELSEIF/ELSEIF/...`` region can be treated as a nest of
 | 
						||
``IF/THEN/ELSE`` regions.
 | 
						||
 | 
						||
The DWARF procedures can use the active lane artificial variable described in
 | 
						||
:ref:`amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane` rather than the actual
 | 
						||
``EXEC`` mask in order to support whole or quad wave mode.
 | 
						||
 | 
						||
.. _amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane:
 | 
						||
 | 
						||
AMDGPU DW_AT_LLVM_active_lane
 | 
						||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 | 
						||
 | 
						||
The ``DW_AT_LLVM_active_lane`` attribute can be used to specify the lanes that
 | 
						||
are conceptually active for a SIMT thread. See
 | 
						||
:ref:`amdgpu-dwarf-debugging-information-entry-attributes`.
 | 
						||
 | 
						||
The execution mask may be modified to implement whole or quad wave mode
 | 
						||
operations. For example, all lanes may need to temporarily be made active to
 | 
						||
execute a whole wave operation. Such regions would save the ``EXEC`` mask,
 | 
						||
update it to enable the necessary lanes, perform the operations, and then
 | 
						||
restore the ``EXEC`` mask from the saved value. While executing the whole wave
 | 
						||
region, the conceptual execution mask is the saved value, not the ``EXEC``
 | 
						||
value.
 | 
						||
 | 
						||
This is handled by defining an artificial variable for the active lane mask. The
 | 
						||
active lane mask artificial variable would be the actual ``EXEC`` mask for
 | 
						||
normal regions, and the saved execution mask for regions where the mask is
 | 
						||
temporarily updated. The location list created for this artificial variable is
 | 
						||
used to define the value of the ``DW_AT_LLVM_active_lane`` attribute.
 | 
						||
 | 
						||
Source Text
 | 
						||
~~~~~~~~~~~
 | 
						||
 | 
						||
Source text for online-compiled programs (e.g. those compiled by the OpenCL
 | 
						||
runtime) may be embedded into the DWARF v5 line table using the ``clang
 | 
						||
-gembed-source`` option, described in table :ref:`amdgpu-debug-options`.
 | 
						||
 | 
						||
For example:
 | 
						||
 | 
						||
``-gembed-source``
 | 
						||
  Enable the embedded source DWARF v5 extension.
 | 
						||
``-gno-embed-source``
 | 
						||
  Disable the embedded source DWARF v5 extension.
 | 
						||
 | 
						||
  .. table:: AMDGPU Debug Options
 | 
						||
     :name: amdgpu-debug-options
 | 
						||
 | 
						||
     ==================== ==================================================
 | 
						||
     Debug Flag           Description
 | 
						||
     ==================== ==================================================
 | 
						||
     -g[no-]embed-source  Enable/disable embedding source text in DWARF
 | 
						||
                          debug sections. Useful for environments where
 | 
						||
                          source cannot be written to disk, such as
 | 
						||
                          when performing online compilation.
 | 
						||
     ==================== ==================================================
 | 
						||
 | 
						||
This option enables one extended content types in the DWARF v5 Line Number
 | 
						||
Program Header, which is used to encode embedded source.
 | 
						||
 | 
						||
  .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types
 | 
						||
     :name: amdgpu-dwarf-extended-content-types
 | 
						||
 | 
						||
     ============================  ======================
 | 
						||
     Content Type                  Form
 | 
						||
     ============================  ======================
 | 
						||
     ``DW_LNCT_LLVM_source``       ``DW_FORM_line_strp``
 | 
						||
     ============================  ======================
 | 
						||
 | 
						||
The source field will contain the UTF-8 encoded, null-terminated source text
 | 
						||
with ``'\n'`` line endings. When the source field is present, consumers can use
 | 
						||
the embedded source instead of attempting to discover the source on disk. When
 | 
						||
the source field is absent, consumers can access the file to get the source
 | 
						||
text.
 | 
						||
 | 
						||
The above content type appears in the ``file_name_entry_format`` field of the
 | 
						||
line table prologue, and its corresponding value appear in the ``file_names``
 | 
						||
field. The current encoding of the content type is documented in table
 | 
						||
:ref:`amdgpu-dwarf-extended-content-types-encoding`
 | 
						||
 | 
						||
  .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types Encoding
 | 
						||
     :name: amdgpu-dwarf-extended-content-types-encoding
 | 
						||
 | 
						||
     ============================  ====================
 | 
						||
     Content Type                  Value
 | 
						||
     ============================  ====================
 | 
						||
     ``DW_LNCT_LLVM_source``       0x2001
 | 
						||
     ============================  ====================
 | 
						||
 | 
						||
.. _amdgpu-code-conventions:
 | 
						||
 | 
						||
Code Conventions
 | 
						||
================
 | 
						||
 | 
						||
This section provides code conventions used for each supported target triple OS
 | 
						||
(see :ref:`amdgpu-target-triples`).
 | 
						||
 | 
						||
AMDHSA
 | 
						||
------
 | 
						||
 | 
						||
This section provides code conventions used when the target triple OS is
 | 
						||
``amdhsa`` (see :ref:`amdgpu-target-triples`).
 | 
						||
 | 
						||
.. _amdgpu-amdhsa-code-object-target-identification:
 | 
						||
 | 
						||
Code Object Target Identification
 | 
						||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 | 
						||
 | 
						||
The AMDHSA OS uses the following syntax to specify the code object
 | 
						||
target as a single string:
 | 
						||
 | 
						||
  ``<Architecture>-<Vendor>-<OS>-<Environment>-<Processor><Target Features>``
 | 
						||
 | 
						||
Where:
 | 
						||
 | 
						||
  - ``<Architecture>``, ``<Vendor>``, ``<OS>`` and ``<Environment>``
 | 
						||
    are the same as the *Target Triple* (see
 | 
						||
    :ref:`amdgpu-target-triples`).
 | 
						||
 | 
						||
  - ``<Processor>`` is the same as the *Processor* (see
 | 
						||
    :ref:`amdgpu-processors`).
 | 
						||
 | 
						||
  - ``<Target Features>`` is a list of the enabled *Target Features*
 | 
						||
    (see :ref:`amdgpu-target-features`), each prefixed by a plus, that
 | 
						||
    apply to *Processor*. The list must be in the same order as listed
 | 
						||
    in the table :ref:`amdgpu-target-feature-table`. Note that *Target
 | 
						||
    Features* must be included in the list if they are enabled even if
 | 
						||
    that is the default for *Processor*.
 | 
						||
 | 
						||
For example:
 | 
						||
 | 
						||
  ``"amdgcn-amd-amdhsa--gfx902+xnack"``
 | 
						||
 | 
						||
.. _amdgpu-amdhsa-code-object-metadata:
 | 
						||
 | 
						||
Code Object Metadata
 | 
						||
~~~~~~~~~~~~~~~~~~~~
 | 
						||
 | 
						||
The code object metadata specifies extensible metadata associated with the code
 | 
						||
objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
 | 
						||
[AMD-ROCm]_. The encoding and semantics of this metadata depends on the code
 | 
						||
object version; see :ref:`amdgpu-amdhsa-code-object-metadata-v2` and
 | 
						||
:ref:`amdgpu-amdhsa-code-object-metadata-v3`.
 | 
						||
 | 
						||
Code object metadata is specified in a note record (see
 | 
						||
:ref:`amdgpu-note-records`) and is required when the target triple OS is
 | 
						||
``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
 | 
						||
information necessary to support the ROCM kernel queries. For example, the
 | 
						||
segment sizes needed in a dispatch packet. In addition, a high level language
 | 
						||
runtime may require other information to be included. For example, the AMD
 | 
						||
OpenCL runtime records kernel argument information.
 | 
						||
 | 
						||
.. _amdgpu-amdhsa-code-object-metadata-v2:
 | 
						||
 | 
						||
Code Object V2 Metadata (-mattr=-code-object-v3)
 | 
						||
++++++++++++++++++++++++++++++++++++++++++++++++
 | 
						||
 | 
						||
.. warning:: Code Object V2 is not the default code object version emitted by
 | 
						||
  this version of LLVM. For a description of the metadata generated with the
 | 
						||
  default configuration (Code Object V3) see
 | 
						||
  :ref:`amdgpu-amdhsa-code-object-metadata-v3`.
 | 
						||
 | 
						||
Code object V2 metadata is specified by the ``NT_AMD_AMDGPU_METADATA`` note
 | 
						||
record (see :ref:`amdgpu-note-records-v2`).
 | 
						||
 | 
						||
The metadata is specified as a YAML formatted string (see [YAML]_ and
 | 
						||
:doc:`YamlIO`).
 | 
						||
 | 
						||
.. TODO::
 | 
						||
 | 
						||
  Is the string null terminated? It probably should not if YAML allows it to
 | 
						||
  contain null characters, otherwise it should be.
 | 
						||
 | 
						||
The metadata is represented as a single YAML document comprised of the mapping
 | 
						||
defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v2` and
 | 
						||
referenced tables.
 | 
						||
 | 
						||
For boolean values, the string values of ``false`` and ``true`` are used for
 | 
						||
false and true respectively.
 | 
						||
 | 
						||
Additional information can be added to the mappings. To avoid conflicts, any
 | 
						||
non-AMD key names should be prefixed by "*vendor-name*.".
 | 
						||
 | 
						||
  .. table:: AMDHSA Code Object V2 Metadata Map
 | 
						||
     :name: amdgpu-amdhsa-code-object-metadata-map-table-v2
 | 
						||
 | 
						||
     ========== ============== ========= =======================================
 | 
						||
     String Key Value Type     Required? Description
 | 
						||
     ========== ============== ========= =======================================
 | 
						||
     "Version"  sequence of    Required  - The first integer is the major
 | 
						||
                2 integers                 version. Currently 1.
 | 
						||
                                         - The second integer is the minor
 | 
						||
                                           version. Currently 0.
 | 
						||
     "Printf"   sequence of              Each string is encoded information
 | 
						||
                strings                  about a printf function call. The
 | 
						||
                                         encoded information is organized as
 | 
						||
                                         fields separated by colon (':'):
 | 
						||
 | 
						||
                                         ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
 | 
						||
 | 
						||
                                         where:
 | 
						||
 | 
						||
                                         ``ID``
 | 
						||
                                           A 32-bit integer as a unique id for
 | 
						||
                                           each printf function call
 | 
						||
 | 
						||
                                         ``N``
 | 
						||
                                           A 32-bit integer equal to the number
 | 
						||
                                           of arguments of printf function call
 | 
						||
                                           minus 1
 | 
						||
 | 
						||
                                         ``S[i]`` (where i = 0, 1, ... , N-1)
 | 
						||
                                           32-bit integers for the size in bytes
 | 
						||
                                           of the i-th FormatString argument of
 | 
						||
                                           the printf function call
 | 
						||
 | 
						||
                                         FormatString
 | 
						||
                                           The format string passed to the
 | 
						||
                                           printf function call.
 | 
						||
     "Kernels"  sequence of    Required  Sequence of the mappings for each
 | 
						||
                mapping                  kernel in the code object. See
 | 
						||
                                         :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v2`
 | 
						||
                                         for the definition of the mapping.
 | 
						||
     ========== ============== ========= =======================================
 | 
						||
 | 
						||
..
 | 
						||
 | 
						||
  .. table:: AMDHSA Code Object V2 Kernel Metadata Map
 | 
						||
     :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v2
 | 
						||
 | 
						||
     ================= ============== ========= ================================
 | 
						||
     String Key        Value Type     Required? Description
 | 
						||
     ================= ============== ========= ================================
 | 
						||
     "Name"            string         Required  Source name of the kernel.
 | 
						||
     "SymbolName"      string         Required  Name of the kernel
 | 
						||
                                                descriptor ELF symbol.
 | 
						||
     "Language"        string                   Source language of the kernel.
 | 
						||
                                                Values include:
 | 
						||
 | 
						||
                                                - "OpenCL C"
 | 
						||
                                                - "OpenCL C++"
 | 
						||
                                                - "HCC"
 | 
						||
                                                - "OpenMP"
 | 
						||
 | 
						||
     "LanguageVersion" sequence of              - The first integer is the major
 | 
						||
                       2 integers                 version.
 | 
						||
                                                - The second integer is the
 | 
						||
                                                  minor version.
 | 
						||
     "Attrs"           mapping                  Mapping of kernel attributes.
 | 
						||
                                                See
 | 
						||
                                                :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-table-v2`
 | 
						||
                                                for the mapping definition.
 | 
						||
     "Args"            sequence of              Sequence of mappings of the
 | 
						||
                       mapping                  kernel arguments. See
 | 
						||
                                                :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v2`
 | 
						||
                                                for the definition of the mapping.
 | 
						||
     "CodeProps"       mapping                  Mapping of properties related to
 | 
						||
                                                the kernel code. See
 | 
						||
                                                :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-table-v2`
 | 
						||
                                                for the mapping definition.
 | 
						||
     ================= ============== ========= ================================
 | 
						||
 | 
						||
..
 | 
						||
 | 
						||
  .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map
 | 
						||
     :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-table-v2
 | 
						||
 | 
						||
     =================== ============== ========= ==============================
 | 
						||
     String Key          Value Type     Required? Description
 | 
						||
     =================== ============== ========= ==============================
 | 
						||
     "ReqdWorkGroupSize" sequence of              If not 0, 0, 0 then all values
 | 
						||
                         3 integers               must be >=1 and the dispatch
 | 
						||
                                                  work-group size X, Y, Z must
 | 
						||
                                                  correspond to the specified
 | 
						||
                                                  values. Defaults to 0, 0, 0.
 | 
						||
 | 
						||
                                                  Corresponds to the OpenCL
 | 
						||
                                                  ``reqd_work_group_size``
 | 
						||
                                                  attribute.
 | 
						||
     "WorkGroupSizeHint" sequence of              The dispatch work-group size
 | 
						||
                         3 integers               X, Y, Z is likely to be the
 | 
						||
                                                  specified values.
 | 
						||
 | 
						||
                                                  Corresponds to the OpenCL
 | 
						||
                                                  ``work_group_size_hint``
 | 
						||
                                                  attribute.
 | 
						||
     "VecTypeHint"       string                   The name of a scalar or vector
 | 
						||
                                                  type.
 | 
						||
 | 
						||
                                                  Corresponds to the OpenCL
 | 
						||
                                                  ``vec_type_hint`` attribute.
 | 
						||
 | 
						||
     "RuntimeHandle"     string                   The external symbol name
 | 
						||
                                                  associated with a kernel.
 | 
						||
                                                  OpenCL runtime allocates a
 | 
						||
                                                  global buffer for the symbol
 | 
						||
                                                  and saves the kernel's address
 | 
						||
                                                  to it, which is used for
 | 
						||
                                                  device side enqueueing. Only
 | 
						||
                                                  available for device side
 | 
						||
                                                  enqueued kernels.
 | 
						||
     =================== ============== ========= ==============================
 | 
						||
 | 
						||
..
 | 
						||
 | 
						||
  .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map
 | 
						||
     :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v2
 | 
						||
 | 
						||
     ================= ============== ========= ================================
 | 
						||
     String Key        Value Type     Required? Description
 | 
						||
     ================= ============== ========= ================================
 | 
						||
     "Name"            string                   Kernel argument name.
 | 
						||
     "TypeName"        string                   Kernel argument type name.
 | 
						||
     "Size"            integer        Required  Kernel argument size in bytes.
 | 
						||
     "Align"           integer        Required  Kernel argument alignment in
 | 
						||
                                                bytes. Must be a power of two.
 | 
						||
     "ValueKind"       string         Required  Kernel argument kind that
 | 
						||
                                                specifies how to set up the
 | 
						||
                                                corresponding argument.
 | 
						||
                                                Values include:
 | 
						||
 | 
						||
                                                "ByValue"
 | 
						||
                                                  The argument is copied
 | 
						||
                                                  directly into the kernarg.
 | 
						||
 | 
						||
                                                "GlobalBuffer"
 | 
						||
                                                  A global address space pointer
 | 
						||
                                                  to the buffer data is passed
 | 
						||
                                                  in the kernarg.
 | 
						||
 | 
						||
                                                "DynamicSharedPointer"
 | 
						||
                                                  A group address space pointer
 | 
						||
                                                  to dynamically allocated LDS
 | 
						||
                                                  is passed in the kernarg.
 | 
						||
 | 
						||
                                                "Sampler"
 | 
						||
                                                  A global address space
 | 
						||
                                                  pointer to a S# is passed in
 | 
						||
                                                  the kernarg.
 | 
						||
 | 
						||
                                                "Image"
 | 
						||
                                                  A global address space
 | 
						||
                                                  pointer to a T# is passed in
 | 
						||
                                                  the kernarg.
 | 
						||
 | 
						||
                                                "Pipe"
 | 
						||
                                                  A global address space pointer
 | 
						||
                                                  to an OpenCL pipe is passed in
 | 
						||
                                                  the kernarg.
 | 
						||
 | 
						||
                                                "Queue"
 | 
						||
                                                  A global address space pointer
 | 
						||
                                                  to an OpenCL device enqueue
 | 
						||
                                                  queue is passed in the
 | 
						||
                                                  kernarg.
 | 
						||
 | 
						||
                                                "HiddenGlobalOffsetX"
 | 
						||
                                                  The OpenCL grid dispatch
 | 
						||
                                                  global offset for the X
 | 
						||
                                                  dimension is passed in the
 | 
						||
                                                  kernarg.
 | 
						||
 | 
						||
                                                "HiddenGlobalOffsetY"
 | 
						||
                                                  The OpenCL grid dispatch
 | 
						||
                                                  global offset for the Y
 | 
						||
                                                  dimension is passed in the
 | 
						||
                                                  kernarg.
 | 
						||
 | 
						||
                                                "HiddenGlobalOffsetZ"
 | 
						||
                                                  The OpenCL grid dispatch
 | 
						||
                                                  global offset for the Z
 | 
						||
                                                  dimension is passed in the
 | 
						||
                                                  kernarg.
 | 
						||
 | 
						||
                                                "HiddenNone"
 | 
						||
                                                  An argument that is not used
 | 
						||
                                                  by the kernel. Space needs to
 | 
						||
                                                  be left for it, but it does
 | 
						||
                                                  not need to be set up.
 | 
						||
 | 
						||
                                                "HiddenPrintfBuffer"
 | 
						||
                                                  A global address space pointer
 | 
						||
                                                  to the runtime printf buffer
 | 
						||
                                                  is passed in kernarg.
 | 
						||
 | 
						||
                                                "HiddenHostcallBuffer"
 | 
						||
                                                  A global address space pointer
 | 
						||
                                                  to the runtime hostcall buffer
 | 
						||
                                                  is passed in kernarg.
 | 
						||
 | 
						||
                                                "HiddenDefaultQueue"
 | 
						||
                                                  A global address space pointer
 | 
						||
                                                  to the OpenCL device enqueue
 | 
						||
                                                  queue that should be used by
 | 
						||
                                                  the kernel by default is
 | 
						||
                                                  passed in the kernarg.
 | 
						||
 | 
						||
                                                "HiddenCompletionAction"
 | 
						||
                                                  A global address space pointer
 | 
						||
                                                  to help link enqueued kernels into
 | 
						||
                                                  the ancestor tree for determining
 | 
						||
                                                  when the parent kernel has finished.
 | 
						||
 | 
						||
                                                "HiddenMultiGridSyncArg"
 | 
						||
                                                  A global address space pointer for
 | 
						||
                                                  multi-grid synchronization is
 | 
						||
                                                  passed in the kernarg.
 | 
						||
 | 
						||
     "ValueType"       string         Required  Kernel argument value type. Only
 | 
						||
                                                present if "ValueKind" is
 | 
						||
                                                "ByValue". For vector data
 | 
						||
                                                types, the value is for the
 | 
						||
                                                element type. Values include:
 | 
						||
 | 
						||
                                                - "Struct"
 | 
						||
                                                - "I8"
 | 
						||
                                                - "U8"
 | 
						||
                                                - "I16"
 | 
						||
                                                - "U16"
 | 
						||
                                                - "F16"
 | 
						||
                                                - "I32"
 | 
						||
                                                - "U32"
 | 
						||
                                                - "F32"
 | 
						||
                                                - "I64"
 | 
						||
                                                - "U64"
 | 
						||
                                                - "F64"
 | 
						||
 | 
						||
                                                .. TODO::
 | 
						||
                                                   How can it be determined if a
 | 
						||
                                                   vector type, and what size
 | 
						||
                                                   vector?
 | 
						||
     "PointeeAlign"    integer                  Alignment in bytes of pointee
 | 
						||
                                                type for pointer type kernel
 | 
						||
                                                argument. Must be a power
 | 
						||
                                                of 2. Only present if
 | 
						||
                                                "ValueKind" is
 | 
						||
                                                "DynamicSharedPointer".
 | 
						||
     "AddrSpaceQual"   string                   Kernel argument address space
 | 
						||
                                                qualifier. Only present if
 | 
						||
                                                "ValueKind" is "GlobalBuffer" or
 | 
						||
                                                "DynamicSharedPointer". Values
 | 
						||
                                                are:
 | 
						||
 | 
						||
                                                - "Private"
 | 
						||
                                                - "Global"
 | 
						||
                                                - "Constant"
 | 
						||
                                                - "Local"
 | 
						||
                                                - "Generic"
 | 
						||
                                                - "Region"
 | 
						||
 | 
						||
                                                .. TODO::
 | 
						||
                                                   Is GlobalBuffer only Global
 | 
						||
                                                   or Constant? Is
 | 
						||
                                                   DynamicSharedPointer always
 | 
						||
                                                   Local? Can HCC allow Generic?
 | 
						||
                                                   How can Private or Region
 | 
						||
                                                   ever happen?
 | 
						||
     "AccQual"         string                   Kernel argument access
 | 
						||
                                                qualifier. Only present if
 | 
						||
                                                "ValueKind" is "Image" or
 | 
						||
                                                "Pipe". Values
 | 
						||
                                                are:
 | 
						||
 | 
						||
                                                - "ReadOnly"
 | 
						||
                                                - "WriteOnly"
 | 
						||
                                                - "ReadWrite"
 | 
						||
 | 
						||
                                                .. TODO::
 | 
						||
                                                   Does this apply to
 | 
						||
                                                   GlobalBuffer?
 | 
						||
     "ActualAccQual"   string                   The actual memory accesses
 | 
						||
                                                performed by the kernel on the
 | 
						||
                                                kernel argument. Only present if
 | 
						||
                                                "ValueKind" is "GlobalBuffer",
 | 
						||
                                                "Image", or "Pipe". This may be
 | 
						||
                                                more restrictive than indicated
 | 
						||
                                                by "AccQual" to reflect what the
 | 
						||
                                                kernel actual does. If not
 | 
						||
                                                present then the runtime must
 | 
						||
                                                assume what is implied by
 | 
						||
                                                "AccQual" and "IsConst". Values
 | 
						||
                                                are:
 | 
						||
 | 
						||
                                                - "ReadOnly"
 | 
						||
                                                - "WriteOnly"
 | 
						||
                                                - "ReadWrite"
 | 
						||
 | 
						||
     "IsConst"         boolean                  Indicates if the kernel argument
 | 
						||
                                                is const qualified. Only present
 | 
						||
                                                if "ValueKind" is
 | 
						||
                                                "GlobalBuffer".
 | 
						||
 | 
						||
     "IsRestrict"      boolean                  Indicates if the kernel argument
 | 
						||
                                                is restrict qualified. Only
 | 
						||
                                                present if "ValueKind" is
 | 
						||
                                                "GlobalBuffer".
 | 
						||
 | 
						||
     "IsVolatile"      boolean                  Indicates if the kernel argument
 | 
						||
                                                is volatile qualified. Only
 | 
						||
                                                present if "ValueKind" is
 | 
						||
                                                "GlobalBuffer".
 | 
						||
 | 
						||
     "IsPipe"          boolean                  Indicates if the kernel argument
 | 
						||
                                                is pipe qualified. Only present
 | 
						||
                                                if "ValueKind" is "Pipe".
 | 
						||
 | 
						||
                                                .. TODO::
 | 
						||
                                                   Can GlobalBuffer be pipe
 | 
						||
                                                   qualified?
 | 
						||
     ================= ============== ========= ================================
 | 
						||
 | 
						||
..
 | 
						||
 | 
						||
  .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map
 | 
						||
     :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-table-v2
 | 
						||
 | 
						||
     ============================ ============== ========= =====================
 | 
						||
     String Key                   Value Type     Required? Description
 | 
						||
     ============================ ============== ========= =====================
 | 
						||
     "KernargSegmentSize"         integer        Required  The size in bytes of
 | 
						||
                                                           the kernarg segment
 | 
						||
                                                           that holds the values
 | 
						||
                                                           of the arguments to
 | 
						||
                                                           the kernel.
 | 
						||
     "GroupSegmentFixedSize"      integer        Required  The amount of group
 | 
						||
                                                           segment memory
 | 
						||
                                                           required by a
 | 
						||
                                                           work-group in
 | 
						||
                                                           bytes. This does not
 | 
						||
                                                           include any
 | 
						||
                                                           dynamically allocated
 | 
						||
                                                           group segment memory
 | 
						||
                                                           that may be added
 | 
						||
                                                           when the kernel is
 | 
						||
                                                           dispatched.
 | 
						||
     "PrivateSegmentFixedSize"    integer        Required  The amount of fixed
 | 
						||
                                                           private address space
 | 
						||
                                                           memory required for a
 | 
						||
                                                           work-item in
 | 
						||
                                                           bytes. If the kernel
 | 
						||
                                                           uses a dynamic call
 | 
						||
                                                           stack then additional
 | 
						||
                                                           space must be added
 | 
						||
                                                           to this value for the
 | 
						||
                                                           call stack.
 | 
						||
     "KernargSegmentAlign"        integer        Required  The maximum byte
 | 
						||
                                                           alignment of
 | 
						||
                                                           arguments in the
 | 
						||
                                                           kernarg segment. Must
 | 
						||
                                                           be a power of 2.
 | 
						||
     "WavefrontSize"              integer        Required  Wavefront size. Must
 | 
						||
                                                           be a power of 2.
 | 
						||
     "NumSGPRs"                   integer        Required  Number of scalar
 | 
						||
                                                           registers used by a
 | 
						||
                                                           wavefront for
 | 
						||
                                                           GFX6-GFX10. This
 | 
						||
                                                           includes the special
 | 
						||
                                                           SGPRs for VCC, Flat
 | 
						||
                                                           Scratch (GFX7-GFX10)
 | 
						||
                                                           and XNACK (for
 | 
						||
                                                           GFX8-GFX10). It does
 | 
						||
                                                           not include the 16
 | 
						||
                                                           SGPR added if a trap
 | 
						||
                                                           handler is
 | 
						||
                                                           enabled. It is not
 | 
						||
                                                           rounded up to the
 | 
						||
                                                           allocation
 | 
						||
                                                           granularity.
 | 
						||
     "NumVGPRs"                   integer        Required  Number of vector
 | 
						||
                                                           registers used by
 | 
						||
                                                           each work-item for
 | 
						||
                                                           GFX6-GFX10
 | 
						||
     "MaxFlatWorkGroupSize"       integer        Required  Maximum flat
 | 
						||
                                                           work-group size
 | 
						||
                                                           supported by the
 | 
						||
                                                           kernel in work-items.
 | 
						||
                                                           Must be >=1 and
 | 
						||
                                                           consistent with
 | 
						||
                                                           ReqdWorkGroupSize if
 | 
						||
                                                           not 0, 0, 0.
 | 
						||
     "NumSpilledSGPRs"            integer                  Number of stores from
 | 
						||
                                                           a scalar register to
 | 
						||
                                                           a register allocator
 | 
						||
                                                           created spill
 | 
						||
                                                           location.
 | 
						||
     "NumSpilledVGPRs"            integer                  Number of stores from
 | 
						||
                                                           a vector register to
 | 
						||
                                                           a register allocator
 | 
						||
                                                           created spill
 | 
						||
                                                           location.
 | 
						||
     ============================ ============== ========= =====================
 | 
						||
 | 
						||
.. _amdgpu-amdhsa-code-object-metadata-v3:
 | 
						||
 | 
						||
Code Object V3 Metadata (-mattr=+code-object-v3)
 | 
						||
++++++++++++++++++++++++++++++++++++++++++++++++
 | 
						||
 | 
						||
Code object V3 metadata is specified by the ``NT_AMDGPU_METADATA`` note record
 | 
						||
(see :ref:`amdgpu-note-records-v3`).
 | 
						||
 | 
						||
The metadata is represented as Message Pack formatted binary data (see
 | 
						||
[MsgPack]_). The top level is a Message Pack map that includes the
 | 
						||
keys defined in table
 | 
						||
:ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced
 | 
						||
tables.
 | 
						||
 | 
						||
Additional information can be added to the maps. To avoid conflicts,
 | 
						||
any key names should be prefixed by "*vendor-name*." where
 | 
						||
``vendor-name`` can be the name of the vendor and specific vendor
 | 
						||
tool that generates the information. The prefix is abbreviated to
 | 
						||
simply "." when it appears within a map that has been added by the
 | 
						||
same *vendor-name*.
 | 
						||
 | 
						||
  .. table:: AMDHSA Code Object V3 Metadata Map
 | 
						||
     :name: amdgpu-amdhsa-code-object-metadata-map-table-v3
 | 
						||
 | 
						||
     ================= ============== ========= =======================================
 | 
						||
     String Key        Value Type     Required? Description
 | 
						||
     ================= ============== ========= =======================================
 | 
						||
     "amdhsa.version"  sequence of    Required  - The first integer is the major
 | 
						||
                       2 integers                 version. Currently 1.
 | 
						||
                                                - The second integer is the minor
 | 
						||
                                                  version. Currently 0.
 | 
						||
     "amdhsa.printf"   sequence of              Each string is encoded information
 | 
						||
                       strings                  about a printf function call. The
 | 
						||
                                                encoded information is organized as
 | 
						||
                                                fields separated by colon (':'):
 | 
						||
 | 
						||
                                                ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
 | 
						||
 | 
						||
                                                where:
 | 
						||
 | 
						||
                                                ``ID``
 | 
						||
                                                  A 32-bit integer as a unique id for
 | 
						||
                                                  each printf function call
 | 
						||
 | 
						||
                                                ``N``
 | 
						||
                                                  A 32-bit integer equal to the number
 | 
						||
                                                  of arguments of printf function call
 | 
						||
                                                  minus 1
 | 
						||
 | 
						||
                                                ``S[i]`` (where i = 0, 1, ... , N-1)
 | 
						||
                                                  32-bit integers for the size in bytes
 | 
						||
                                                  of the i-th FormatString argument of
 | 
						||
                                                  the printf function call
 | 
						||
 | 
						||
                                                FormatString
 | 
						||
                                                  The format string passed to the
 | 
						||
                                                  printf function call.
 | 
						||
     "amdhsa.kernels"  sequence of    Required  Sequence of the maps for each
 | 
						||
                       map                      kernel in the code object. See
 | 
						||
                                                :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3`
 | 
						||
                                                for the definition of the keys included
 | 
						||
                                                in that map.
 | 
						||
     ================= ============== ========= =======================================
 | 
						||
 | 
						||
..
 | 
						||
 | 
						||
  .. table:: AMDHSA Code Object V3 Kernel Metadata Map
 | 
						||
     :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3
 | 
						||
 | 
						||
     =================================== ============== ========= ================================
 | 
						||
     String Key                          Value Type     Required? Description
 | 
						||
     =================================== ============== ========= ================================
 | 
						||
     ".name"                             string         Required  Source name of the kernel.
 | 
						||
     ".symbol"                           string         Required  Name of the kernel
 | 
						||
                                                                  descriptor ELF symbol.
 | 
						||
     ".language"                         string                   Source language of the kernel.
 | 
						||
                                                                  Values include:
 | 
						||
 | 
						||
                                                                  - "OpenCL C"
 | 
						||
                                                                  - "OpenCL C++"
 | 
						||
                                                                  - "HCC"
 | 
						||
                                                                  - "HIP"
 | 
						||
                                                                  - "OpenMP"
 | 
						||
                                                                  - "Assembler"
 | 
						||
 | 
						||
     ".language_version"                 sequence of              - The first integer is the major
 | 
						||
                                         2 integers                 version.
 | 
						||
                                                                  - The second integer is the
 | 
						||
                                                                    minor version.
 | 
						||
     ".args"                             sequence of              Sequence of maps of the
 | 
						||
                                         map                      kernel arguments. See
 | 
						||
                                                                  :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`
 | 
						||
                                                                  for the definition of the keys
 | 
						||
                                                                  included in that map.
 | 
						||
     ".reqd_workgroup_size"              sequence of              If not 0, 0, 0 then all values
 | 
						||
                                         3 integers               must be >=1 and the dispatch
 | 
						||
                                                                  work-group size X, Y, Z must
 | 
						||
                                                                  correspond to the specified
 | 
						||
                                                                  values. Defaults to 0, 0, 0.
 | 
						||
 | 
						||
                                                                  Corresponds to the OpenCL
 | 
						||
                                                                  ``reqd_work_group_size``
 | 
						||
                                                                  attribute.
 | 
						||
     ".workgroup_size_hint"              sequence of              The dispatch work-group size
 | 
						||
                                         3 integers               X, Y, Z is likely to be the
 | 
						||
                                                                  specified values.
 | 
						||
 | 
						||
                                                                  Corresponds to the OpenCL
 | 
						||
                                                                  ``work_group_size_hint``
 | 
						||
                                                                  attribute.
 | 
						||
     ".vec_type_hint"                    string                   The name of a scalar or vector
 | 
						||
                                                                  type.
 | 
						||
 | 
						||
                                                                  Corresponds to the OpenCL
 | 
						||
                                                                  ``vec_type_hint`` attribute.
 | 
						||
 | 
						||
     ".device_enqueue_symbol"            string                   The external symbol name
 | 
						||
                                                                  associated with a kernel.
 | 
						||
                                                                  OpenCL runtime allocates a
 | 
						||
                                                                  global buffer for the symbol
 | 
						||
                                                                  and saves the kernel's address
 | 
						||
                                                                  to it, which is used for
 | 
						||
                                                                  device side enqueueing. Only
 | 
						||
                                                                  available for device side
 | 
						||
                                                                  enqueued kernels.
 | 
						||
     ".kernarg_segment_size"             integer        Required  The size in bytes of
 | 
						||
                                                                  the kernarg segment
 | 
						||
                                                                  that holds the values
 | 
						||
                                                                  of the arguments to
 | 
						||
                                                                  the kernel.
 | 
						||
     ".group_segment_fixed_size"         integer        Required  The amount of group
 | 
						||
                                                                  segment memory
 | 
						||
                                                                  required by a
 | 
						||
                                                                  work-group in
 | 
						||
                                                                  bytes. This does not
 | 
						||
                                                                  include any
 | 
						||
                                                                  dynamically allocated
 | 
						||
                                                                  group segment memory
 | 
						||
                                                                  that may be added
 | 
						||
                                                                  when the kernel is
 | 
						||
                                                                  dispatched.
 | 
						||
     ".private_segment_fixed_size"       integer        Required  The amount of fixed
 | 
						||
                                                                  private address space
 | 
						||
                                                                  memory required for a
 | 
						||
                                                                  work-item in
 | 
						||
                                                                  bytes. If the kernel
 | 
						||
                                                                  uses a dynamic call
 | 
						||
                                                                  stack then additional
 | 
						||
                                                                  space must be added
 | 
						||
                                                                  to this value for the
 | 
						||
                                                                  call stack.
 | 
						||
     ".kernarg_segment_align"            integer        Required  The maximum byte
 | 
						||
                                                                  alignment of
 | 
						||
                                                                  arguments in the
 | 
						||
                                                                  kernarg segment. Must
 | 
						||
                                                                  be a power of 2.
 | 
						||
     ".wavefront_size"                   integer        Required  Wavefront size. Must
 | 
						||
                                                                  be a power of 2.
 | 
						||
     ".sgpr_count"                       integer        Required  Number of scalar
 | 
						||
                                                                  registers required by a
 | 
						||
                                                                  wavefront for
 | 
						||
                                                                  GFX6-GFX9. A register
 | 
						||
                                                                  is required if it is
 | 
						||
                                                                  used explicitly, or
 | 
						||
                                                                  if a higher numbered
 | 
						||
                                                                  register is used
 | 
						||
                                                                  explicitly. This
 | 
						||
                                                                  includes the special
 | 
						||
                                                                  SGPRs for VCC, Flat
 | 
						||
                                                                  Scratch (GFX7-GFX9)
 | 
						||
                                                                  and XNACK (for
 | 
						||
                                                                  GFX8-GFX9). It does
 | 
						||
                                                                  not include the 16
 | 
						||
                                                                  SGPR added if a trap
 | 
						||
                                                                  handler is
 | 
						||
                                                                  enabled. It is not
 | 
						||
                                                                  rounded up to the
 | 
						||
                                                                  allocation
 | 
						||
                                                                  granularity.
 | 
						||
     ".vgpr_count"                       integer        Required  Number of vector
 | 
						||
                                                                  registers required by
 | 
						||
                                                                  each work-item for
 | 
						||
                                                                  GFX6-GFX9. A register
 | 
						||
                                                                  is required if it is
 | 
						||
                                                                  used explicitly, or
 | 
						||
                                                                  if a higher numbered
 | 
						||
                                                                  register is used
 | 
						||
                                                                  explicitly.
 | 
						||
     ".max_flat_workgroup_size"          integer        Required  Maximum flat
 | 
						||
                                                                  work-group size
 | 
						||
                                                                  supported by the
 | 
						||
                                                                  kernel in work-items.
 | 
						||
                                                                  Must be >=1 and
 | 
						||
                                                                  consistent with
 | 
						||
                                                                  ReqdWorkGroupSize if
 | 
						||
                                                                  not 0, 0, 0.
 | 
						||
     ".sgpr_spill_count"                 integer                  Number of stores from
 | 
						||
                                                                  a scalar register to
 | 
						||
                                                                  a register allocator
 | 
						||
                                                                  created spill
 | 
						||
                                                                  location.
 | 
						||
     ".vgpr_spill_count"                 integer                  Number of stores from
 | 
						||
                                                                  a vector register to
 | 
						||
                                                                  a register allocator
 | 
						||
                                                                  created spill
 | 
						||
                                                                  location.
 | 
						||
     =================================== ============== ========= ================================
 | 
						||
 | 
						||
..
 | 
						||
 | 
						||
  .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map
 | 
						||
     :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3
 | 
						||
 | 
						||
     ====================== ============== ========= ================================
 | 
						||
     String Key             Value Type     Required? Description
 | 
						||
     ====================== ============== ========= ================================
 | 
						||
     ".name"                string                   Kernel argument name.
 | 
						||
     ".type_name"           string                   Kernel argument type name.
 | 
						||
     ".size"                integer        Required  Kernel argument size in bytes.
 | 
						||
     ".offset"              integer        Required  Kernel argument offset in
 | 
						||
                                                     bytes. The offset must be a
 | 
						||
                                                     multiple of the alignment
 | 
						||
                                                     required by the argument.
 | 
						||
     ".value_kind"          string         Required  Kernel argument kind that
 | 
						||
                                                     specifies how to set up the
 | 
						||
                                                     corresponding argument.
 | 
						||
                                                     Values include:
 | 
						||
 | 
						||
                                                     "by_value"
 | 
						||
                                                       The argument is copied
 | 
						||
                                                       directly into the kernarg.
 | 
						||
 | 
						||
                                                     "global_buffer"
 | 
						||
                                                       A global address space pointer
 | 
						||
                                                       to the buffer data is passed
 | 
						||
                                                       in the kernarg.
 | 
						||
 | 
						||
                                                     "dynamic_shared_pointer"
 | 
						||
                                                       A group address space pointer
 | 
						||
                                                       to dynamically allocated LDS
 | 
						||
                                                       is passed in the kernarg.
 | 
						||
 | 
						||
                                                     "sampler"
 | 
						||
                                                       A global address space
 | 
						||
                                                       pointer to a S# is passed in
 | 
						||
                                                       the kernarg.
 | 
						||
 | 
						||
                                                     "image"
 | 
						||
                                                       A global address space
 | 
						||
                                                       pointer to a T# is passed in
 | 
						||
                                                       the kernarg.
 | 
						||
 | 
						||
                                                     "pipe"
 | 
						||
                                                       A global address space pointer
 | 
						||
                                                       to an OpenCL pipe is passed in
 | 
						||
                                                       the kernarg.
 | 
						||
 | 
						||
                                                     "queue"
 | 
						||
                                                       A global address space pointer
 | 
						||
                                                       to an OpenCL device enqueue
 | 
						||
                                                       queue is passed in the
 | 
						||
                                                       kernarg.
 | 
						||
 | 
						||
                                                     "hidden_global_offset_x"
 | 
						||
                                                       The OpenCL grid dispatch
 | 
						||
                                                       global offset for the X
 | 
						||
                                                       dimension is passed in the
 | 
						||
                                                       kernarg.
 | 
						||
 | 
						||
                                                     "hidden_global_offset_y"
 | 
						||
                                                       The OpenCL grid dispatch
 | 
						||
                                                       global offset for the Y
 | 
						||
                                                       dimension is passed in the
 | 
						||
                                                       kernarg.
 | 
						||
 | 
						||
                                                     "hidden_global_offset_z"
 | 
						||
                                                       The OpenCL grid dispatch
 | 
						||
                                                       global offset for the Z
 | 
						||
                                                       dimension is passed in the
 | 
						||
                                                       kernarg.
 | 
						||
 | 
						||
                                                     "hidden_none"
 | 
						||
                                                       An argument that is not used
 | 
						||
                                                       by the kernel. Space needs to
 | 
						||
                                                       be left for it, but it does
 | 
						||
                                                       not need to be set up.
 | 
						||
 | 
						||
                                                     "hidden_printf_buffer"
 | 
						||
                                                       A global address space pointer
 | 
						||
                                                       to the runtime printf buffer
 | 
						||
                                                       is passed in kernarg.
 | 
						||
 | 
						||
                                                     "hidden_hostcall_buffer"
 | 
						||
                                                       A global address space pointer
 | 
						||
                                                       to the runtime hostcall buffer
 | 
						||
                                                       is passed in kernarg.
 | 
						||
 | 
						||
                                                     "hidden_default_queue"
 | 
						||
                                                       A global address space pointer
 | 
						||
                                                       to the OpenCL device enqueue
 | 
						||
                                                       queue that should be used by
 | 
						||
                                                       the kernel by default is
 | 
						||
                                                       passed in the kernarg.
 | 
						||
 | 
						||
                                                     "hidden_completion_action"
 | 
						||
                                                       A global address space pointer
 | 
						||
                                                       to help link enqueued kernels into
 | 
						||
                                                       the ancestor tree for determining
 | 
						||
                                                       when the parent kernel has finished.
 | 
						||
 | 
						||
                                                     "hidden_multigrid_sync_arg"
 | 
						||
                                                       A global address space pointer for
 | 
						||
                                                       multi-grid synchronization is
 | 
						||
                                                       passed in the kernarg.
 | 
						||
 | 
						||
     ".value_type"          string         Required  Kernel argument value type. Only
 | 
						||
                                                     present if ".value_kind" is
 | 
						||
                                                     "by_value". For vector data
 | 
						||
                                                     types, the value is for the
 | 
						||
                                                     element type. Values include:
 | 
						||
 | 
						||
                                                     - "struct"
 | 
						||
                                                     - "i8"
 | 
						||
                                                     - "u8"
 | 
						||
                                                     - "i16"
 | 
						||
                                                     - "u16"
 | 
						||
                                                     - "f16"
 | 
						||
                                                     - "i32"
 | 
						||
                                                     - "u32"
 | 
						||
                                                     - "f32"
 | 
						||
                                                     - "i64"
 | 
						||
                                                     - "u64"
 | 
						||
                                                     - "f64"
 | 
						||
 | 
						||
                                                     .. TODO::
 | 
						||
                                                        How can it be determined if a
 | 
						||
                                                        vector type, and what size
 | 
						||
                                                        vector?
 | 
						||
     ".pointee_align"       integer                  Alignment in bytes of pointee
 | 
						||
                                                     type for pointer type kernel
 | 
						||
                                                     argument. Must be a power
 | 
						||
                                                     of 2. Only present if
 | 
						||
                                                     ".value_kind" is
 | 
						||
                                                     "dynamic_shared_pointer".
 | 
						||
     ".address_space"       string                   Kernel argument address space
 | 
						||
                                                     qualifier. Only present if
 | 
						||
                                                     ".value_kind" is "global_buffer" or
 | 
						||
                                                     "dynamic_shared_pointer". Values
 | 
						||
                                                     are:
 | 
						||
 | 
						||
                                                     - "private"
 | 
						||
                                                     - "global"
 | 
						||
                                                     - "constant"
 | 
						||
                                                     - "local"
 | 
						||
                                                     - "generic"
 | 
						||
                                                     - "region"
 | 
						||
 | 
						||
                                                     .. TODO::
 | 
						||
                                                        Is "global_buffer" only "global"
 | 
						||
                                                        or "constant"? Is
 | 
						||
                                                        "dynamic_shared_pointer" always
 | 
						||
                                                        "local"? Can HCC allow "generic"?
 | 
						||
                                                        How can "private" or "region"
 | 
						||
                                                        ever happen?
 | 
						||
     ".access"              string                   Kernel argument access
 | 
						||
                                                     qualifier. Only present if
 | 
						||
                                                     ".value_kind" is "image" or
 | 
						||
                                                     "pipe". Values
 | 
						||
                                                     are:
 | 
						||
 | 
						||
                                                     - "read_only"
 | 
						||
                                                     - "write_only"
 | 
						||
                                                     - "read_write"
 | 
						||
 | 
						||
                                                     .. TODO::
 | 
						||
                                                        Does this apply to
 | 
						||
                                                        "global_buffer"?
 | 
						||
     ".actual_access"       string                   The actual memory accesses
 | 
						||
                                                     performed by the kernel on the
 | 
						||
                                                     kernel argument. Only present if
 | 
						||
                                                     ".value_kind" is "global_buffer",
 | 
						||
                                                     "image", or "pipe". This may be
 | 
						||
                                                     more restrictive than indicated
 | 
						||
                                                     by ".access" to reflect what the
 | 
						||
                                                     kernel actual does. If not
 | 
						||
                                                     present then the runtime must
 | 
						||
                                                     assume what is implied by
 | 
						||
                                                     ".access" and ".is_const"      . Values
 | 
						||
                                                     are:
 | 
						||
 | 
						||
                                                     - "read_only"
 | 
						||
                                                     - "write_only"
 | 
						||
                                                     - "read_write"
 | 
						||
 | 
						||
     ".is_const"            boolean                  Indicates if the kernel argument
 | 
						||
                                                     is const qualified. Only present
 | 
						||
                                                     if ".value_kind" is
 | 
						||
                                                     "global_buffer".
 | 
						||
 | 
						||
     ".is_restrict"         boolean                  Indicates if the kernel argument
 | 
						||
                                                     is restrict qualified. Only
 | 
						||
                                                     present if ".value_kind" is
 | 
						||
                                                     "global_buffer".
 | 
						||
 | 
						||
     ".is_volatile"         boolean                  Indicates if the kernel argument
 | 
						||
                                                     is volatile qualified. Only
 | 
						||
                                                     present if ".value_kind" is
 | 
						||
                                                     "global_buffer".
 | 
						||
 | 
						||
     ".is_pipe"             boolean                  Indicates if the kernel argument
 | 
						||
                                                     is pipe qualified. Only present
 | 
						||
                                                     if ".value_kind" is "pipe".
 | 
						||
 | 
						||
                                                     .. TODO::
 | 
						||
                                                        Can "global_buffer" be pipe
 | 
						||
                                                        qualified?
 | 
						||
     ====================== ============== ========= ================================
 | 
						||
 | 
						||
..
 | 
						||
 | 
						||
Kernel Dispatch
 | 
						||
~~~~~~~~~~~~~~~
 | 
						||
 | 
						||
The HSA architected queuing language (AQL) defines a user space memory
 | 
						||
interface that can be used to control the dispatch of kernels, in an agent
 | 
						||
independent way. An agent can have zero or more AQL queues created for it using
 | 
						||
the ROCm runtime, in which AQL packets (all of which are 64 bytes) can be
 | 
						||
placed. See the *HSA Platform System Architecture Specification* [HSA]_ for the
 | 
						||
AQL queue mechanics and packet layouts.
 | 
						||
 | 
						||
The packet processor of a kernel agent is responsible for detecting and
 | 
						||
dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
 | 
						||
packet processor is implemented by the hardware command processor (CP),
 | 
						||
asynchronous dispatch controller (ADC) and shader processor input controller
 | 
						||
(SPI).
 | 
						||
 | 
						||
The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
 | 
						||
mode driver to initialize and register the AQL queue with CP.
 | 
						||
 | 
						||
To dispatch a kernel the following actions are performed. This can occur in the
 | 
						||
CPU host program, or from an HSA kernel executing on a GPU.
 | 
						||
 | 
						||
1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
 | 
						||
   executed is obtained.
 | 
						||
2. A pointer to the kernel descriptor (see
 | 
						||
   :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is obtained.
 | 
						||
   It must be for a kernel that is contained in a code object that that was
 | 
						||
   loaded by the ROCm runtime on the kernel agent with which the AQL queue is
 | 
						||
   associated.
 | 
						||
3. Space is allocated for the kernel arguments using the ROCm runtime allocator
 | 
						||
   for a memory region with the kernarg property for the kernel agent that will
 | 
						||
   execute the kernel. It must be at least 16 byte aligned.
 | 
						||
4. Kernel argument values are assigned to the kernel argument memory
 | 
						||
   allocation. The layout is defined in the *HSA Programmer's Language
 | 
						||
   Reference* [HSA]_. For AMDGPU the kernel execution directly accesses the
 | 
						||
   kernel argument memory in the same way constant memory is accessed. (Note
 | 
						||
   that the HSA specification allows an implementation to copy the kernel
 | 
						||
   argument contents to another location that is accessed by the kernel.)
 | 
						||
5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
 | 
						||
   api uses 64-bit atomic operations to reserve space in the AQL queue for the
 | 
						||
   packet. The packet must be set up, and the final write must use an atomic
 | 
						||
   store release to set the packet kind to ensure the packet contents are
 | 
						||
   visible to the kernel agent. AQL defines a doorbell signal mechanism to
 | 
						||
   notify the kernel agent that the AQL queue has been updated. These rules, and
 | 
						||
   the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
 | 
						||
   System Architecture Specification* [HSA]_.
 | 
						||
6. A kernel dispatch packet includes information about the actual dispatch,
 | 
						||
   such as grid and work-group size, together with information from the code
 | 
						||
   object about the kernel, such as segment sizes. The ROCm runtime queries on
 | 
						||
   the kernel symbol can be used to obtain the code object values which are
 | 
						||
   recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`.
 | 
						||
7. CP executes micro-code and is responsible for detecting and setting up the
 | 
						||
   GPU to execute the wavefronts of a kernel dispatch.
 | 
						||
8. CP ensures that when the a wavefront starts executing the kernel machine
 | 
						||
   code, the scalar general purpose registers (SGPR) and vector general purpose
 | 
						||
   registers (VGPR) are set up as required by the machine code. The required
 | 
						||
   setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
 | 
						||
   register state is defined in
 | 
						||
   :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
 | 
						||
9. The prolog of the kernel machine code (see
 | 
						||
   :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
 | 
						||
   before continuing executing the machine code that corresponds to the kernel.
 | 
						||
10. When the kernel dispatch has completed execution, CP signals the completion
 | 
						||
    signal specified in the kernel dispatch packet if not 0.
 | 
						||
 | 
						||
Image and Samplers
 | 
						||
~~~~~~~~~~~~~~~~~~
 | 
						||
 | 
						||
Image and sample handles created by the ROCm runtime are 64-bit addresses of a
 | 
						||
hardware 32 byte V# and 48 byte S# object respectively. In order to support the
 | 
						||
HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
 | 
						||
enumeration values for the queries that are not trivially deducible from the S#
 | 
						||
representation.
 | 
						||
 | 
						||
HSA Signals
 | 
						||
~~~~~~~~~~~
 | 
						||
 | 
						||
HSA signal handles created by the ROCm runtime are 64-bit addresses of a
 | 
						||
structure allocated in memory accessible from both the CPU and GPU. The
 | 
						||
structure is defined by the ROCm runtime and subject to change between releases
 | 
						||
(see [AMD-ROCm-github]_).
 | 
						||
 | 
						||
.. _amdgpu-amdhsa-hsa-aql-queue:
 | 
						||
 | 
						||
HSA AQL Queue
 | 
						||
~~~~~~~~~~~~~
 | 
						||
 | 
						||
The HSA AQL queue structure is defined by the ROCm runtime and subject to change
 | 
						||
between releases (see [AMD-ROCm-github]_). For some processors it contains
 | 
						||
fields needed to implement certain language features such as the flat address
 | 
						||
aperture bases. It also contains fields used by CP such as managing the
 | 
						||
allocation of scratch memory.
 | 
						||
 | 
						||
.. _amdgpu-amdhsa-kernel-descriptor:
 | 
						||
 | 
						||
Kernel Descriptor
 | 
						||
~~~~~~~~~~~~~~~~~
 | 
						||
 | 
						||
A kernel descriptor consists of the information needed by CP to initiate the
 | 
						||
execution of a kernel, including the entry point address of the machine code
 | 
						||
that implements the kernel.
 | 
						||
 | 
						||
Kernel Descriptor for GFX6-GFX10
 | 
						||
++++++++++++++++++++++++++++++++
 | 
						||
 | 
						||
CP microcode requires the Kernel descriptor to be allocated on 64 byte
 | 
						||
alignment.
 | 
						||
 | 
						||
  .. table:: Kernel Descriptor for GFX6-GFX10
 | 
						||
     :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table
 | 
						||
 | 
						||
     ======= ======= =============================== ============================
 | 
						||
     Bits    Size    Field Name                      Description
 | 
						||
     ======= ======= =============================== ============================
 | 
						||
     31:0    4 bytes GROUP_SEGMENT_FIXED_SIZE        The amount of fixed local
 | 
						||
                                                     address space memory
 | 
						||
                                                     required for a work-group
 | 
						||
                                                     in bytes. This does not
 | 
						||
                                                     include any dynamically
 | 
						||
                                                     allocated local address
 | 
						||
                                                     space memory that may be
 | 
						||
                                                     added when the kernel is
 | 
						||
                                                     dispatched.
 | 
						||
     63:32   4 bytes PRIVATE_SEGMENT_FIXED_SIZE      The amount of fixed
 | 
						||
                                                     private address space
 | 
						||
                                                     memory required for a
 | 
						||
                                                     work-item in bytes. If
 | 
						||
                                                     is_dynamic_callstack is 1
 | 
						||
                                                     then additional space must
 | 
						||
                                                     be added to this value for
 | 
						||
                                                     the call stack.
 | 
						||
     127:64  8 bytes                                 Reserved, must be 0.
 | 
						||
     191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET   Byte offset (possibly
 | 
						||
                                                     negative) from base
 | 
						||
                                                     address of kernel
 | 
						||
                                                     descriptor to kernel's
 | 
						||
                                                     entry point instruction
 | 
						||
                                                     which must be 256 byte
 | 
						||
                                                     aligned.
 | 
						||
     351:272 20                                      Reserved, must be 0.
 | 
						||
             bytes
 | 
						||
     383:352 4 bytes COMPUTE_PGM_RSRC3               GFX6-9
 | 
						||
                                                       Reserved, must be 0.
 | 
						||
                                                     GFX10
 | 
						||
                                                       Compute Shader (CS)
 | 
						||
                                                       program settings used by
 | 
						||
                                                       CP to set up
 | 
						||
                                                       ``COMPUTE_PGM_RSRC3``
 | 
						||
                                                       configuration
 | 
						||
                                                       register. See
 | 
						||
                                                       :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table`.
 | 
						||
     415:384 4 bytes COMPUTE_PGM_RSRC1               Compute Shader (CS)
 | 
						||
                                                     program settings used by
 | 
						||
                                                     CP to set up
 | 
						||
                                                     ``COMPUTE_PGM_RSRC1``
 | 
						||
                                                     configuration
 | 
						||
                                                     register. See
 | 
						||
                                                     :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
 | 
						||
     447:416 4 bytes COMPUTE_PGM_RSRC2               Compute Shader (CS)
 | 
						||
                                                     program settings used by
 | 
						||
                                                     CP to set up
 | 
						||
                                                     ``COMPUTE_PGM_RSRC2``
 | 
						||
                                                     configuration
 | 
						||
                                                     register. See
 | 
						||
                                                     :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
 | 
						||
     448     1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     Enable the setup of the
 | 
						||
                     _BUFFER                         SGPR user data registers
 | 
						||
                                                     (see
 | 
						||
                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
 | 
						||
 | 
						||
                                                     The total number of SGPR
 | 
						||
                                                     user data registers
 | 
						||
                                                     requested must not exceed
 | 
						||
                                                     16 and match value in
 | 
						||
                                                     ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
 | 
						||
                                                     Any requests beyond 16
 | 
						||
                                                     will be ignored.
 | 
						||
     449     1 bit   ENABLE_SGPR_DISPATCH_PTR        *see above*
 | 
						||
     450     1 bit   ENABLE_SGPR_QUEUE_PTR           *see above*
 | 
						||
     451     1 bit   ENABLE_SGPR_KERNARG_SEGMENT_PTR *see above*
 | 
						||
     452     1 bit   ENABLE_SGPR_DISPATCH_ID         *see above*
 | 
						||
     453     1 bit   ENABLE_SGPR_FLAT_SCRATCH_INIT   *see above*
 | 
						||
     454     1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     *see above*
 | 
						||
                     _SIZE
 | 
						||
     457:455 3 bits                                  Reserved, must be 0.
 | 
						||
     458     1 bit   ENABLE_WAVEFRONT_SIZE32         GFX6-9
 | 
						||
                                                       Reserved, must be 0.
 | 
						||
                                                     GFX10
 | 
						||
                                                       - If 0 execute in
 | 
						||
                                                         wavefront size 64 mode.
 | 
						||
                                                       - If 1 execute in
 | 
						||
                                                         native wavefront size
 | 
						||
                                                         32 mode.
 | 
						||
     463:459 5 bits                                  Reserved, must be 0.
 | 
						||
     511:464 6 bytes                                 Reserved, must be 0.
 | 
						||
     512     **Total size 64 bytes.**
 | 
						||
     ======= ====================================================================
 | 
						||
 | 
						||
..
 | 
						||
 | 
						||
  .. table:: compute_pgm_rsrc1 for GFX6-GFX10
 | 
						||
     :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table
 | 
						||
 | 
						||
     ======= ======= =============================== ===========================================================================
 | 
						||
     Bits    Size    Field Name                      Description
 | 
						||
     ======= ======= =============================== ===========================================================================
 | 
						||
     5:0     6 bits  GRANULATED_WORKITEM_VGPR_COUNT  Number of vector register
 | 
						||
                                                     blocks used by each work-item;
 | 
						||
                                                     granularity is device
 | 
						||
                                                     specific:
 | 
						||
 | 
						||
                                                     GFX6-GFX9
 | 
						||
                                                       - vgprs_used 0..256
 | 
						||
                                                       - max(0, ceil(vgprs_used / 4) - 1)
 | 
						||
                                                     GFX10 (wavefront size 64)
 | 
						||
                                                       - max_vgpr 1..256
 | 
						||
                                                       - max(0, ceil(vgprs_used / 4) - 1)
 | 
						||
                                                     GFX10 (wavefront size 32)
 | 
						||
                                                       - max_vgpr 1..256
 | 
						||
                                                       - max(0, ceil(vgprs_used / 8) - 1)
 | 
						||
 | 
						||
                                                     Where vgprs_used is defined
 | 
						||
                                                     as the highest VGPR number
 | 
						||
                                                     explicitly referenced plus
 | 
						||
                                                     one.
 | 
						||
 | 
						||
                                                     Used by CP to set up
 | 
						||
                                                     ``COMPUTE_PGM_RSRC1.VGPRS``.
 | 
						||
 | 
						||
                                                     The
 | 
						||
                                                     :ref:`amdgpu-assembler`
 | 
						||
                                                     calculates this
 | 
						||
                                                     automatically for the
 | 
						||
                                                     selected processor from
 | 
						||
                                                     values provided to the
 | 
						||
                                                     `.amdhsa_kernel` directive
 | 
						||
                                                     by the
 | 
						||
                                                     `.amdhsa_next_free_vgpr`
 | 
						||
                                                     nested directive (see
 | 
						||
                                                     :ref:`amdhsa-kernel-directives-table`).
 | 
						||
     9:6     4 bits  GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register
 | 
						||
                                                     blocks used by a wavefront;
 | 
						||
                                                     granularity is device
 | 
						||
                                                     specific:
 | 
						||
 | 
						||
                                                     GFX6-GFX8
 | 
						||
                                                       - sgprs_used 0..112
 | 
						||
                                                       - max(0, ceil(sgprs_used / 8) - 1)
 | 
						||
                                                     GFX9
 | 
						||
                                                       - sgprs_used 0..112
 | 
						||
                                                       - 2 * max(0, ceil(sgprs_used / 16) - 1)
 | 
						||
                                                     GFX10
 | 
						||
                                                       Reserved, must be 0.
 | 
						||
                                                       (128 SGPRs always
 | 
						||
                                                       allocated.)
 | 
						||
 | 
						||
                                                     Where sgprs_used is
 | 
						||
                                                     defined as the highest
 | 
						||
                                                     SGPR number explicitly
 | 
						||
                                                     referenced plus one, plus
 | 
						||
                                                     a target-specific number
 | 
						||
                                                     of additional special
 | 
						||
                                                     SGPRs for VCC,
 | 
						||
                                                     FLAT_SCRATCH (GFX7+) and
 | 
						||
                                                     XNACK_MASK (GFX8+), and
 | 
						||
                                                     any additional
 | 
						||
                                                     target-specific
 | 
						||
                                                     limitations. It does not
 | 
						||
                                                     include the 16 SGPRs added
 | 
						||
                                                     if a trap handler is
 | 
						||
                                                     enabled.
 | 
						||
 | 
						||
                                                     The target-specific
 | 
						||
                                                     limitations and special
 | 
						||
                                                     SGPR layout are defined in
 | 
						||
                                                     the hardware
 | 
						||
                                                     documentation, which can
 | 
						||
                                                     be found in the
 | 
						||
                                                     :ref:`amdgpu-processors`
 | 
						||
                                                     table.
 | 
						||
 | 
						||
                                                     Used by CP to set up
 | 
						||
                                                     ``COMPUTE_PGM_RSRC1.SGPRS``.
 | 
						||
 | 
						||
                                                     The
 | 
						||
                                                     :ref:`amdgpu-assembler`
 | 
						||
                                                     calculates this
 | 
						||
                                                     automatically for the
 | 
						||
                                                     selected processor from
 | 
						||
                                                     values provided to the
 | 
						||
                                                     `.amdhsa_kernel` directive
 | 
						||
                                                     by the
 | 
						||
                                                     `.amdhsa_next_free_sgpr`
 | 
						||
                                                     and `.amdhsa_reserve_*`
 | 
						||
                                                     nested directives (see
 | 
						||
                                                     :ref:`amdhsa-kernel-directives-table`).
 | 
						||
     11:10   2 bits  PRIORITY                        Must be 0.
 | 
						||
 | 
						||
                                                     Start executing wavefront
 | 
						||
                                                     at the specified priority.
 | 
						||
 | 
						||
                                                     CP is responsible for
 | 
						||
                                                     filling in
 | 
						||
                                                     ``COMPUTE_PGM_RSRC1.PRIORITY``.
 | 
						||
     13:12   2 bits  FLOAT_ROUND_MODE_32             Wavefront starts execution
 | 
						||
                                                     with specified rounding
 | 
						||
                                                     mode for single (32
 | 
						||
                                                     bit) floating point
 | 
						||
                                                     precision floating point
 | 
						||
                                                     operations.
 | 
						||
 | 
						||
                                                     Floating point rounding
 | 
						||
                                                     mode values are defined in
 | 
						||
                                                     :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
 | 
						||
 | 
						||
                                                     Used by CP to set up
 | 
						||
                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
 | 
						||
     15:14   2 bits  FLOAT_ROUND_MODE_16_64          Wavefront starts execution
 | 
						||
                                                     with specified rounding
 | 
						||
                                                     denorm mode for half/double (16
 | 
						||
                                                     and 64-bit) floating point
 | 
						||
                                                     precision floating point
 | 
						||
                                                     operations.
 | 
						||
 | 
						||
                                                     Floating point rounding
 | 
						||
                                                     mode values are defined in
 | 
						||
                                                     :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
 | 
						||
 | 
						||
                                                     Used by CP to set up
 | 
						||
                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
 | 
						||
     17:16   2 bits  FLOAT_DENORM_MODE_32            Wavefront starts execution
 | 
						||
                                                     with specified denorm mode
 | 
						||
                                                     for single (32
 | 
						||
                                                     bit)  floating point
 | 
						||
                                                     precision floating point
 | 
						||
                                                     operations.
 | 
						||
 | 
						||
                                                     Floating point denorm mode
 | 
						||
                                                     values are defined in
 | 
						||
                                                     :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
 | 
						||
 | 
						||
                                                     Used by CP to set up
 | 
						||
                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
 | 
						||
     19:18   2 bits  FLOAT_DENORM_MODE_16_64         Wavefront starts execution
 | 
						||
                                                     with specified denorm mode
 | 
						||
                                                     for half/double (16
 | 
						||
                                                     and 64-bit) floating point
 | 
						||
                                                     precision floating point
 | 
						||
                                                     operations.
 | 
						||
 | 
						||
                                                     Floating point denorm mode
 | 
						||
                                                     values are defined in
 | 
						||
                                                     :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
 | 
						||
 | 
						||
                                                     Used by CP to set up
 | 
						||
                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
 | 
						||
     20      1 bit   PRIV                            Must be 0.
 | 
						||
 | 
						||
                                                     Start executing wavefront
 | 
						||
                                                     in privilege trap handler
 | 
						||
                                                     mode.
 | 
						||
 | 
						||
                                                     CP is responsible for
 | 
						||
                                                     filling in
 | 
						||
                                                     ``COMPUTE_PGM_RSRC1.PRIV``.
 | 
						||
     21      1 bit   ENABLE_DX10_CLAMP               Wavefront starts execution
 | 
						||
                                                     with DX10 clamp mode
 | 
						||
                                                     enabled. Used by the vector
 | 
						||
                                                     ALU to force DX10 style
 | 
						||
                                                     treatment of NaN's (when
 | 
						||
                                                     set, clamp NaN to zero,
 | 
						||
                                                     otherwise pass NaN
 | 
						||
                                                     through).
 | 
						||
 | 
						||
                                                     Used by CP to set up
 | 
						||
                                                     ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
 | 
						||
     22      1 bit   DEBUG_MODE                      Must be 0.
 | 
						||
 | 
						||
                                                     Start executing wavefront
 | 
						||
                                                     in single step mode.
 | 
						||
 | 
						||
                                                     CP is responsible for
 | 
						||
                                                     filling in
 | 
						||
                                                     ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
 | 
						||
     23      1 bit   ENABLE_IEEE_MODE                Wavefront starts execution
 | 
						||
                                                     with IEEE mode
 | 
						||
                                                     enabled. Floating point
 | 
						||
                                                     opcodes that support
 | 
						||
                                                     exception flag gathering
 | 
						||
                                                     will quiet and propagate
 | 
						||
                                                     signaling-NaN inputs per
 | 
						||
                                                     IEEE 754-2008. Min_dx10 and
 | 
						||
                                                     max_dx10 become IEEE
 | 
						||
                                                     754-2008 compliant due to
 | 
						||
                                                     signaling-NaN propagation
 | 
						||
                                                     and quieting.
 | 
						||
 | 
						||
                                                     Used by CP to set up
 | 
						||
                                                     ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
 | 
						||
     24      1 bit   BULKY                           Must be 0.
 | 
						||
 | 
						||
                                                     Only one work-group allowed
 | 
						||
                                                     to execute on a compute
 | 
						||
                                                     unit.
 | 
						||
 | 
						||
                                                     CP is responsible for
 | 
						||
                                                     filling in
 | 
						||
                                                     ``COMPUTE_PGM_RSRC1.BULKY``.
 | 
						||
     25      1 bit   CDBG_USER                       Must be 0.
 | 
						||
 | 
						||
                                                     Flag that can be used to
 | 
						||
                                                     control debugging code.
 | 
						||
 | 
						||
                                                     CP is responsible for
 | 
						||
                                                     filling in
 | 
						||
                                                     ``COMPUTE_PGM_RSRC1.CDBG_USER``.
 | 
						||
     26      1 bit   FP16_OVFL                       GFX6-GFX8
 | 
						||
                                                       Reserved, must be 0.
 | 
						||
                                                     GFX9-GFX10
 | 
						||
                                                       Wavefront starts execution
 | 
						||
                                                       with specified fp16 overflow
 | 
						||
                                                       mode.
 | 
						||
 | 
						||
                                                       - If 0, fp16 overflow generates
 | 
						||
                                                         +/-INF values.
 | 
						||
                                                       - If 1, fp16 overflow that is the
 | 
						||
                                                         result of an +/-INF input value
 | 
						||
                                                         or divide by 0 produces a +/-INF,
 | 
						||
                                                         otherwise clamps computed
 | 
						||
                                                         overflow to +/-MAX_FP16 as
 | 
						||
                                                         appropriate.
 | 
						||
 | 
						||
                                                       Used by CP to set up
 | 
						||
                                                       ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
 | 
						||
     28:27   2 bits                                  Reserved, must be 0.
 | 
						||
     29      1 bit    WGP_MODE                       GFX6-GFX9
 | 
						||
                                                       Reserved, must be 0.
 | 
						||
                                                     GFX10
 | 
						||
                                                       - If 0 execute work-groups in
 | 
						||
                                                         CU wavefront execution mode.
 | 
						||
                                                       - If 1 execute work-groups on
 | 
						||
                                                         in WGP wavefront execution mode.
 | 
						||
 | 
						||
                                                       See :ref:`amdgpu-amdhsa-memory-model`.
 | 
						||
 | 
						||
                                                       Used by CP to set up
 | 
						||
                                                       ``COMPUTE_PGM_RSRC1.WGP_MODE``.
 | 
						||
     30      1 bit    MEM_ORDERED                    GFX6-9
 | 
						||
                                                       Reserved, must be 0.
 | 
						||
                                                     GFX10
 | 
						||
                                                       Controls the behavior of the
 | 
						||
                                                       waitcnt's vmcnt and vscnt
 | 
						||
                                                       counters.
 | 
						||
 | 
						||
                                                       - If 0 vmcnt reports completion
 | 
						||
                                                         of load and atomic with return
 | 
						||
                                                         out of order with sample
 | 
						||
                                                         instructions, and the vscnt
 | 
						||
                                                         reports the completion of
 | 
						||
                                                         store and atomic without
 | 
						||
                                                         return in order.
 | 
						||
                                                       - If 1 vmcnt reports completion
 | 
						||
                                                         of load, atomic with return
 | 
						||
                                                         and sample instructions in
 | 
						||
                                                         order, and the vscnt reports
 | 
						||
                                                         the completion of store and
 | 
						||
                                                         atomic without return in order.
 | 
						||
 | 
						||
                                                       Used by CP to set up
 | 
						||
                                                       ``COMPUTE_PGM_RSRC1.MEM_ORDERED``.
 | 
						||
     31      1 bit    FWD_PROGRESS                   GFX6-9
 | 
						||
                                                       Reserved, must be 0.
 | 
						||
                                                     GFX10
 | 
						||
                                                       - If 0 execute SIMD wavefronts
 | 
						||
                                                         using oldest first policy.
 | 
						||
                                                       - If 1 execute SIMD wavefronts to
 | 
						||
                                                         ensure wavefronts will make some
 | 
						||
                                                         forward progress.
 | 
						||
 | 
						||
                                                       Used by CP to set up
 | 
						||
                                                       ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``.
 | 
						||
     32      **Total size 4 bytes**
 | 
						||
     ======= ===================================================================================================================
 | 
						||
 | 
						||
..
 | 
						||
 | 
						||
  .. table:: compute_pgm_rsrc2 for GFX6-GFX10
 | 
						||
     :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table
 | 
						||
 | 
						||
     ======= ======= =============================== ===========================================================================
 | 
						||
     Bits    Size    Field Name                      Description
 | 
						||
     ======= ======= =============================== ===========================================================================
 | 
						||
     0       1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     Enable the setup of the
 | 
						||
                     _WAVEFRONT_OFFSET               SGPR wavefront scratch offset
 | 
						||
                                                     system register (see
 | 
						||
                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
 | 
						||
 | 
						||
                                                     Used by CP to set up
 | 
						||
                                                     ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
 | 
						||
     5:1     5 bits  USER_SGPR_COUNT                 The total number of SGPR
 | 
						||
                                                     user data registers
 | 
						||
                                                     requested. This number must
 | 
						||
                                                     match the number of user
 | 
						||
                                                     data registers enabled.
 | 
						||
 | 
						||
                                                     Used by CP to set up
 | 
						||
                                                     ``COMPUTE_PGM_RSRC2.USER_SGPR``.
 | 
						||
     6       1 bit   ENABLE_TRAP_HANDLER             Must be 0.
 | 
						||
 | 
						||
                                                     This bit represents
 | 
						||
                                                     ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``,
 | 
						||
                                                     which is set by the CP if
 | 
						||
                                                     the runtime has installed a
 | 
						||
                                                     trap handler.
 | 
						||
     7       1 bit   ENABLE_SGPR_WORKGROUP_ID_X      Enable the setup of the
 | 
						||
                                                     system SGPR register for
 | 
						||
                                                     the work-group id in the X
 | 
						||
                                                     dimension (see
 | 
						||
                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
 | 
						||
 | 
						||
                                                     Used by CP to set up
 | 
						||
                                                     ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
 | 
						||
     8       1 bit   ENABLE_SGPR_WORKGROUP_ID_Y      Enable the setup of the
 | 
						||
                                                     system SGPR register for
 | 
						||
                                                     the work-group id in the Y
 | 
						||
                                                     dimension (see
 | 
						||
                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
 | 
						||
 | 
						||
                                                     Used by CP to set up
 | 
						||
                                                     ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
 | 
						||
     9       1 bit   ENABLE_SGPR_WORKGROUP_ID_Z      Enable the setup of the
 | 
						||
                                                     system SGPR register for
 | 
						||
                                                     the work-group id in the Z
 | 
						||
                                                     dimension (see
 | 
						||
                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
 | 
						||
 | 
						||
                                                     Used by CP to set up
 | 
						||
                                                     ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
 | 
						||
     10      1 bit   ENABLE_SGPR_WORKGROUP_INFO      Enable the setup of the
 | 
						||
                                                     system SGPR register for
 | 
						||
                                                     work-group information (see
 | 
						||
                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
 | 
						||
 | 
						||
                                                     Used by CP to set up
 | 
						||
                                                     ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
 | 
						||
     12:11   2 bits  ENABLE_VGPR_WORKITEM_ID         Enable the setup of the
 | 
						||
                                                     VGPR system registers used
 | 
						||
                                                     for the work-item ID.
 | 
						||
                                                     :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
 | 
						||
                                                     defines the values.
 | 
						||
 | 
						||
                                                     Used by CP to set up
 | 
						||
                                                     ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
 | 
						||
     13      1 bit   ENABLE_EXCEPTION_ADDRESS_WATCH  Must be 0.
 | 
						||
 | 
						||
                                                     Wavefront starts execution
 | 
						||
                                                     with address watch
 | 
						||
                                                     exceptions enabled which
 | 
						||
                                                     are generated when L1 has
 | 
						||
                                                     witnessed a thread access
 | 
						||
                                                     an *address of
 | 
						||
                                                     interest*.
 | 
						||
 | 
						||
                                                     CP is responsible for
 | 
						||
                                                     filling in the address
 | 
						||
                                                     watch bit in
 | 
						||
                                                     ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
 | 
						||
                                                     according to what the
 | 
						||
                                                     runtime requests.
 | 
						||
     14      1 bit   ENABLE_EXCEPTION_MEMORY         Must be 0.
 | 
						||
 | 
						||
                                                     Wavefront starts execution
 | 
						||
                                                     with memory violation
 | 
						||
                                                     exceptions exceptions
 | 
						||
                                                     enabled which are generated
 | 
						||
                                                     when a memory violation has
 | 
						||
                                                     occurred for this wavefront from
 | 
						||
                                                     L1 or LDS
 | 
						||
                                                     (write-to-read-only-memory,
 | 
						||
                                                     mis-aligned atomic, LDS
 | 
						||
                                                     address out of range,
 | 
						||
                                                     illegal address, etc.).
 | 
						||
 | 
						||
                                                     CP sets the memory
 | 
						||
                                                     violation bit in
 | 
						||
                                                     ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
 | 
						||
                                                     according to what the
 | 
						||
                                                     runtime requests.
 | 
						||
     23:15   9 bits  GRANULATED_LDS_SIZE             Must be 0.
 | 
						||
 | 
						||
                                                     CP uses the rounded value
 | 
						||
                                                     from the dispatch packet,
 | 
						||
                                                     not this value, as the
 | 
						||
                                                     dispatch may contain
 | 
						||
                                                     dynamically allocated group
 | 
						||
                                                     segment memory. CP writes
 | 
						||
                                                     directly to
 | 
						||
                                                     ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
 | 
						||
 | 
						||
                                                     Amount of group segment
 | 
						||
                                                     (LDS) to allocate for each
 | 
						||
                                                     work-group. Granularity is
 | 
						||
                                                     device specific:
 | 
						||
 | 
						||
                                                     GFX6:
 | 
						||
                                                       roundup(lds-size / (64 * 4))
 | 
						||
                                                     GFX7-GFX10:
 | 
						||
                                                       roundup(lds-size / (128 * 4))
 | 
						||
 | 
						||
     24      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    Wavefront starts execution
 | 
						||
                     _INVALID_OPERATION              with specified exceptions
 | 
						||
                                                     enabled.
 | 
						||
 | 
						||
                                                     Used by CP to set up
 | 
						||
                                                     ``COMPUTE_PGM_RSRC2.EXCP_EN``
 | 
						||
                                                     (set from bits 0..6).
 | 
						||
 | 
						||
                                                     IEEE 754 FP Invalid
 | 
						||
                                                     Operation
 | 
						||
     25      1 bit   ENABLE_EXCEPTION_FP_DENORMAL    FP Denormal one or more
 | 
						||
                     _SOURCE                         input operands is a
 | 
						||
                                                     denormal number
 | 
						||
     26      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Division by
 | 
						||
                     _DIVISION_BY_ZERO               Zero
 | 
						||
     27      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP FP Overflow
 | 
						||
                     _OVERFLOW
 | 
						||
     28      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Underflow
 | 
						||
                     _UNDERFLOW
 | 
						||
     29      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Inexact
 | 
						||
                     _INEXACT
 | 
						||
     30      1 bit   ENABLE_EXCEPTION_INT_DIVIDE_BY  Integer Division by Zero
 | 
						||
                     _ZERO                           (rcp_iflag_f32 instruction
 | 
						||
                                                     only)
 | 
						||
     31      1 bit                                   Reserved, must be 0.
 | 
						||
     32      **Total size 4 bytes.**
 | 
						||
     ======= ===================================================================================================================
 | 
						||
 | 
						||
..
 | 
						||
 | 
						||
  .. table:: compute_pgm_rsrc3 for GFX10
 | 
						||
     :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table
 | 
						||
 | 
						||
     ======= ======= =============================== ===========================================================================
 | 
						||
     Bits    Size    Field Name                      Description
 | 
						||
     ======= ======= =============================== ===========================================================================
 | 
						||
     3:0     4 bits  SHARED_VGPR_COUNT               Number of shared VGPRs for wavefront size 64. Granularity 8. Value 0-120.
 | 
						||
                                                     compute_pgm_rsrc1.vgprs + shared_vgpr_cnt cannot exceed 64.
 | 
						||
     31:4    28                                      Reserved, must be 0.
 | 
						||
             bits
 | 
						||
     32      **Total size 4 bytes.**
 | 
						||
     ======= ===================================================================================================================
 | 
						||
 | 
						||
..
 | 
						||
 | 
						||
  .. table:: Floating Point Rounding Mode Enumeration Values
 | 
						||
     :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
 | 
						||
 | 
						||
     ====================================== ===== ==============================
 | 
						||
     Enumeration Name                       Value Description
 | 
						||
     ====================================== ===== ==============================
 | 
						||
     FLOAT_ROUND_MODE_NEAR_EVEN             0     Round Ties To Even
 | 
						||
     FLOAT_ROUND_MODE_PLUS_INFINITY         1     Round Toward +infinity
 | 
						||
     FLOAT_ROUND_MODE_MINUS_INFINITY        2     Round Toward -infinity
 | 
						||
     FLOAT_ROUND_MODE_ZERO                  3     Round Toward 0
 | 
						||
     ====================================== ===== ==============================
 | 
						||
 | 
						||
..
 | 
						||
 | 
						||
  .. table:: Floating Point Denorm Mode Enumeration Values
 | 
						||
     :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
 | 
						||
 | 
						||
     ====================================== ===== ==============================
 | 
						||
     Enumeration Name                       Value Description
 | 
						||
     ====================================== ===== ==============================
 | 
						||
     FLOAT_DENORM_MODE_FLUSH_SRC_DST        0     Flush Source and Destination
 | 
						||
                                                  Denorms
 | 
						||
     FLOAT_DENORM_MODE_FLUSH_DST            1     Flush Output Denorms
 | 
						||
     FLOAT_DENORM_MODE_FLUSH_SRC            2     Flush Source Denorms
 | 
						||
     FLOAT_DENORM_MODE_FLUSH_NONE           3     No Flush
 | 
						||
     ====================================== ===== ==============================
 | 
						||
 | 
						||
..
 | 
						||
 | 
						||
  .. table:: System VGPR Work-Item ID Enumeration Values
 | 
						||
     :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
 | 
						||
 | 
						||
     ======================================== ===== ============================
 | 
						||
     Enumeration Name                         Value Description
 | 
						||
     ======================================== ===== ============================
 | 
						||
     SYSTEM_VGPR_WORKITEM_ID_X                0     Set work-item X dimension
 | 
						||
                                                    ID.
 | 
						||
     SYSTEM_VGPR_WORKITEM_ID_X_Y              1     Set work-item X and Y
 | 
						||
                                                    dimensions ID.
 | 
						||
     SYSTEM_VGPR_WORKITEM_ID_X_Y_Z            2     Set work-item X, Y and Z
 | 
						||
                                                    dimensions ID.
 | 
						||
     SYSTEM_VGPR_WORKITEM_ID_UNDEFINED        3     Undefined.
 | 
						||
     ======================================== ===== ============================
 | 
						||
 | 
						||
.. _amdgpu-amdhsa-initial-kernel-execution-state:
 | 
						||
 | 
						||
Initial Kernel Execution State
 | 
						||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 | 
						||
 | 
						||
This section defines the register state that will be set up by the packet
 | 
						||
processor prior to the start of execution of every wavefront. This is limited by
 | 
						||
the constraints of the hardware controllers of CP/ADC/SPI.
 | 
						||
 | 
						||
The order of the SGPR registers is defined, but the compiler can specify which
 | 
						||
ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
 | 
						||
fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
 | 
						||
for enabled registers are dense starting at SGPR0: the first enabled register is
 | 
						||
SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
 | 
						||
an SGPR number.
 | 
						||
 | 
						||
The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
 | 
						||
all wavefronts of the grid. It is possible to specify more than 16 User SGPRs
 | 
						||
using the ``enable_sgpr_*`` bit fields, in which case only the first 16 are
 | 
						||
actually initialized. These are then immediately followed by the System SGPRs
 | 
						||
that are set up by ADC/SPI and can have different values for each wavefront of
 | 
						||
the grid dispatch.
 | 
						||
 | 
						||
SGPR register initial state is defined in
 | 
						||
:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
 | 
						||
 | 
						||
  .. table:: SGPR Register Set Up Order
 | 
						||
     :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
 | 
						||
 | 
						||
     ========== ========================== ====== ==============================
 | 
						||
     SGPR Order Name                       Number Description
 | 
						||
                (kernel descriptor enable  of
 | 
						||
                field)                     SGPRs
 | 
						||
     ========== ========================== ====== ==============================
 | 
						||
     First      Private Segment Buffer     4      V# that can be used, together
 | 
						||
                (enable_sgpr_private              with Scratch Wavefront Offset
 | 
						||
                _segment_buffer)                  as an offset, to access the
 | 
						||
                                                  private address space using a
 | 
						||
                                                  segment address.
 | 
						||
 | 
						||
                                                  CP uses the value provided by
 | 
						||
                                                  the runtime.
 | 
						||
     then       Dispatch Ptr               2      64-bit address of AQL dispatch
 | 
						||
                (enable_sgpr_dispatch_ptr)        packet for kernel dispatch
 | 
						||
                                                  actually executing.
 | 
						||
     then       Queue Ptr                  2      64-bit address of amd_queue_t
 | 
						||
                (enable_sgpr_queue_ptr)           object for AQL queue on which
 | 
						||
                                                  the dispatch packet was
 | 
						||
                                                  queued.
 | 
						||
     then       Kernarg Segment Ptr        2      64-bit address of Kernarg
 | 
						||
                (enable_sgpr_kernarg              segment. This is directly
 | 
						||
                _segment_ptr)                     copied from the
 | 
						||
                                                  kernarg_address in the kernel
 | 
						||
                                                  dispatch packet.
 | 
						||
 | 
						||
                                                  Having CP load it once avoids
 | 
						||
                                                  loading it at the beginning of
 | 
						||
                                                  every wavefront.
 | 
						||
     then       Dispatch Id                2      64-bit Dispatch ID of the
 | 
						||
                (enable_sgpr_dispatch_id)         dispatch packet being
 | 
						||
                                                  executed.
 | 
						||
     then       Flat Scratch Init          2      This is 2 SGPRs:
 | 
						||
                (enable_sgpr_flat_scratch
 | 
						||
                _init)                            GFX6
 | 
						||
                                                    Not supported.
 | 
						||
                                                  GFX7-GFX8
 | 
						||
                                                    The first SGPR is a 32-bit
 | 
						||
                                                    byte offset from
 | 
						||
                                                    ``SH_HIDDEN_PRIVATE_BASE_VIMID``
 | 
						||
                                                    to per SPI base of memory
 | 
						||
                                                    for scratch for the queue
 | 
						||
                                                    executing the kernel
 | 
						||
                                                    dispatch. CP obtains this
 | 
						||
                                                    from the runtime. (The
 | 
						||
                                                    Scratch Segment Buffer base
 | 
						||
                                                    address is
 | 
						||
                                                    ``SH_HIDDEN_PRIVATE_BASE_VIMID``
 | 
						||
                                                    plus this offset.) The value
 | 
						||
                                                    of Scratch Wavefront Offset must
 | 
						||
                                                    be added to this offset by
 | 
						||
                                                    the kernel machine code,
 | 
						||
                                                    right shifted by 8, and
 | 
						||
                                                    moved to the FLAT_SCRATCH_HI
 | 
						||
                                                    SGPR register.
 | 
						||
                                                    FLAT_SCRATCH_HI corresponds
 | 
						||
                                                    to SGPRn-4 on GFX7, and
 | 
						||
                                                    SGPRn-6 on GFX8 (where SGPRn
 | 
						||
                                                    is the highest numbered SGPR
 | 
						||
                                                    allocated to the wavefront).
 | 
						||
                                                    FLAT_SCRATCH_HI is
 | 
						||
                                                    multiplied by 256 (as it is
 | 
						||
                                                    in units of 256 bytes) and
 | 
						||
                                                    added to
 | 
						||
                                                    ``SH_HIDDEN_PRIVATE_BASE_VIMID``
 | 
						||
                                                    to calculate the per wavefront
 | 
						||
                                                    FLAT SCRATCH BASE in flat
 | 
						||
                                                    memory instructions that
 | 
						||
                                                    access the scratch
 | 
						||
                                                    aperture.
 | 
						||
 | 
						||
                                                    The second SGPR is 32-bit
 | 
						||
                                                    byte size of a single
 | 
						||
                                                    work-item's scratch memory
 | 
						||
                                                    usage. CP obtains this from
 | 
						||
                                                    the runtime, and it is
 | 
						||
                                                    always a multiple of DWORD.
 | 
						||
                                                    CP checks that the value in
 | 
						||
                                                    the kernel dispatch packet
 | 
						||
                                                    Private Segment Byte Size is
 | 
						||
                                                    not larger, and requests the
 | 
						||
                                                    runtime to increase the
 | 
						||
                                                    queue's scratch size if
 | 
						||
                                                    necessary. The kernel code
 | 
						||
                                                    must move it to
 | 
						||
                                                    FLAT_SCRATCH_LO which is
 | 
						||
                                                    SGPRn-3 on GFX7 and SGPRn-5
 | 
						||
                                                    on GFX8. FLAT_SCRATCH_LO is
 | 
						||
                                                    used as the FLAT SCRATCH
 | 
						||
                                                    SIZE in flat memory
 | 
						||
                                                    instructions. Having CP load
 | 
						||
                                                    it once avoids loading it at
 | 
						||
                                                    the beginning of every
 | 
						||
                                                    wavefront.
 | 
						||
                                                  GFX9-GFX10
 | 
						||
                                                    This is the
 | 
						||
                                                    64-bit base address of the
 | 
						||
                                                    per SPI scratch backing
 | 
						||
                                                    memory managed by SPI for
 | 
						||
                                                    the queue executing the
 | 
						||
                                                    kernel dispatch. CP obtains
 | 
						||
                                                    this from the runtime (and
 | 
						||
                                                    divides it if there are
 | 
						||
                                                    multiple Shader Arrays each
 | 
						||
                                                    with its own SPI). The value
 | 
						||
                                                    of Scratch Wavefront Offset must
 | 
						||
                                                    be added by the kernel
 | 
						||
                                                    machine code and the result
 | 
						||
                                                    moved to the FLAT_SCRATCH
 | 
						||
                                                    SGPR which is SGPRn-6 and
 | 
						||
                                                    SGPRn-5. It is used as the
 | 
						||
                                                    FLAT SCRATCH BASE in flat
 | 
						||
                                                    memory instructions.
 | 
						||
     then       Private Segment Size       1      The 32-bit byte size of a
 | 
						||
                                                  (enable_sgpr_private single
 | 
						||
                                                  work-item's
 | 
						||
                                                  scratch_segment_size) memory
 | 
						||
                                                  allocation. This is the
 | 
						||
                                                  value from the kernel
 | 
						||
                                                  dispatch packet Private
 | 
						||
                                                  Segment Byte Size rounded up
 | 
						||
                                                  by CP to a multiple of
 | 
						||
                                                  DWORD.
 | 
						||
 | 
						||
                                                  Having CP load it once avoids
 | 
						||
                                                  loading it at the beginning of
 | 
						||
                                                  every wavefront.
 | 
						||
 | 
						||
                                                  This is not used for
 | 
						||
                                                  GFX7-GFX8 since it is the same
 | 
						||
                                                  value as the second SGPR of
 | 
						||
                                                  Flat Scratch Init. However, it
 | 
						||
                                                  may be needed for GFX9-GFX10 which
 | 
						||
                                                  changes the meaning of the
 | 
						||
                                                  Flat Scratch Init value.
 | 
						||
     then       Grid Work-Group Count X    1      32-bit count of the number of
 | 
						||
                (enable_sgpr_grid                 work-groups in the X dimension
 | 
						||
                _workgroup_count_X)               for the grid being
 | 
						||
                                                  executed. Computed from the
 | 
						||
                                                  fields in the kernel dispatch
 | 
						||
                                                  packet as ((grid_size.x +
 | 
						||
                                                  workgroup_size.x - 1) /
 | 
						||
                                                  workgroup_size.x).
 | 
						||
     then       Grid Work-Group Count Y    1      32-bit count of the number of
 | 
						||
                (enable_sgpr_grid                 work-groups in the Y dimension
 | 
						||
                _workgroup_count_Y &&             for the grid being
 | 
						||
                less than 16 previous             executed. Computed from the
 | 
						||
                SGPRs)                            fields in the kernel dispatch
 | 
						||
                                                  packet as ((grid_size.y +
 | 
						||
                                                  workgroup_size.y - 1) /
 | 
						||
                                                  workgroupSize.y).
 | 
						||
 | 
						||
                                                  Only initialized if <16
 | 
						||
                                                  previous SGPRs initialized.
 | 
						||
     then       Grid Work-Group Count Z    1      32-bit count of the number of
 | 
						||
                (enable_sgpr_grid                 work-groups in the Z dimension
 | 
						||
                _workgroup_count_Z &&             for the grid being
 | 
						||
                less than 16 previous             executed. Computed from the
 | 
						||
                SGPRs)                            fields in the kernel dispatch
 | 
						||
                                                  packet as ((grid_size.z +
 | 
						||
                                                  workgroup_size.z - 1) /
 | 
						||
                                                  workgroupSize.z).
 | 
						||
 | 
						||
                                                  Only initialized if <16
 | 
						||
                                                  previous SGPRs initialized.
 | 
						||
     then       Work-Group Id X            1      32-bit work-group id in X
 | 
						||
                (enable_sgpr_workgroup_id         dimension of grid for
 | 
						||
                _X)                               wavefront.
 | 
						||
     then       Work-Group Id Y            1      32-bit work-group id in Y
 | 
						||
                (enable_sgpr_workgroup_id         dimension of grid for
 | 
						||
                _Y)                               wavefront.
 | 
						||
     then       Work-Group Id Z            1      32-bit work-group id in Z
 | 
						||
                (enable_sgpr_workgroup_id         dimension of grid for
 | 
						||
                _Z)                               wavefront.
 | 
						||
     then       Work-Group Info            1      {first_wavefront, 14'b0000,
 | 
						||
                (enable_sgpr_workgroup            ordered_append_term[10:0],
 | 
						||
                _info)                            threadgroup_size_in_wavefronts[5:0]}
 | 
						||
     then       Scratch Wavefront Offset   1      32-bit byte offset from base
 | 
						||
                (enable_sgpr_private              of scratch base of queue
 | 
						||
                _segment_wavefront_offset)        executing the kernel
 | 
						||
                                                  dispatch. Must be used as an
 | 
						||
                                                  offset with Private
 | 
						||
                                                  segment address when using
 | 
						||
                                                  Scratch Segment Buffer. It
 | 
						||
                                                  must be used to set up FLAT
 | 
						||
                                                  SCRATCH for flat addressing
 | 
						||
                                                  (see
 | 
						||
                                                  :ref:`amdgpu-amdhsa-flat-scratch`).
 | 
						||
     ========== ========================== ====== ==============================
 | 
						||
 | 
						||
The order of the VGPR registers is defined, but the compiler can specify which
 | 
						||
ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
 | 
						||
fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
 | 
						||
for enabled registers are dense starting at VGPR0: the first enabled register is
 | 
						||
VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
 | 
						||
VGPR number.
 | 
						||
 | 
						||
VGPR register initial state is defined in
 | 
						||
:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
 | 
						||
 | 
						||
  .. table:: VGPR Register Set Up Order
 | 
						||
     :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
 | 
						||
 | 
						||
     ========== ========================== ====== ==============================
 | 
						||
     VGPR Order Name                       Number Description
 | 
						||
                (kernel descriptor enable  of
 | 
						||
                field)                     VGPRs
 | 
						||
     ========== ========================== ====== ==============================
 | 
						||
     First      Work-Item Id X             1      32-bit work item id in X
 | 
						||
                (Always initialized)              dimension of work-group for
 | 
						||
                                                  wavefront lane.
 | 
						||
     then       Work-Item Id Y             1      32-bit work item id in Y
 | 
						||
                (enable_vgpr_workitem_id          dimension of work-group for
 | 
						||
                > 0)                              wavefront lane.
 | 
						||
     then       Work-Item Id Z             1      32-bit work item id in Z
 | 
						||
                (enable_vgpr_workitem_id          dimension of work-group for
 | 
						||
                > 1)                              wavefront lane.
 | 
						||
     ========== ========================== ====== ==============================
 | 
						||
 | 
						||
The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
 | 
						||
 | 
						||
1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
 | 
						||
   registers.
 | 
						||
2. Work-group Id registers X, Y, Z are set by ADC which supports any
 | 
						||
   combination including none.
 | 
						||
3. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
 | 
						||
   its value cannot included with the flat scratch init value which is per
 | 
						||
   queue.
 | 
						||
4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
 | 
						||
   or (X, Y, Z).
 | 
						||
 | 
						||
Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64-bit
 | 
						||
value to the hardware required SGPRn-3 and SGPRn-4 respectively.
 | 
						||
 | 
						||
The global segment can be accessed either using buffer instructions (GFX6 which
 | 
						||
has V# 64-bit address support), flat instructions (GFX7-GFX10), or global
 | 
						||
instructions (GFX9-GFX10).
 | 
						||
 | 
						||
If buffer operations are used then the compiler can generate a V# with the
 | 
						||
following properties:
 | 
						||
 | 
						||
* base address of 0
 | 
						||
* no swizzle
 | 
						||
* ATC: 1 if IOMMU present (such as APU)
 | 
						||
* ptr64: 1
 | 
						||
* MTYPE set to support memory coherence that matches the runtime (such as CC for
 | 
						||
  APU and NC for dGPU).
 | 
						||
 | 
						||
.. _amdgpu-amdhsa-kernel-prolog:
 | 
						||
 | 
						||
Kernel Prolog
 | 
						||
~~~~~~~~~~~~~
 | 
						||
 | 
						||
.. _amdgpu-amdhsa-m0:
 | 
						||
 | 
						||
M0
 | 
						||
++
 | 
						||
 | 
						||
GFX6-GFX8
 | 
						||
  The M0 register must be initialized with a value at least the total LDS size
 | 
						||
  if the kernel may access LDS via DS or flat operations. Total LDS size is
 | 
						||
  available in dispatch packet. For M0, it is also possible to use maximum
 | 
						||
  possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
 | 
						||
  GFX7-GFX8).
 | 
						||
GFX9-GFX10
 | 
						||
  The M0 register is not used for range checking LDS accesses and so does not
 | 
						||
  need to be initialized in the prolog.
 | 
						||
 | 
						||
.. _amdgpu-amdhsa-flat-scratch:
 | 
						||
 | 
						||
Flat Scratch
 | 
						||
++++++++++++
 | 
						||
 | 
						||
If the kernel may use flat operations to access scratch memory, the prolog code
 | 
						||
must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
 | 
						||
are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch
 | 
						||
Wavefront Offset SGPR registers (see
 | 
						||
:ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
 | 
						||
 | 
						||
GFX6
 | 
						||
  Flat scratch is not supported.
 | 
						||
 | 
						||
GFX7-GFX8
 | 
						||
 | 
						||
  1. The low word of Flat Scratch Init is 32-bit byte offset from
 | 
						||
     ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
 | 
						||
     being managed by SPI for the queue executing the kernel dispatch. This is
 | 
						||
     the same value used in the Scratch Segment Buffer V# base address. The
 | 
						||
     prolog must add the value of Scratch Wavefront Offset to get the
 | 
						||
     wavefront's byte scratch backing memory offset from
 | 
						||
     ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since FLAT_SCRATCH_LO is in units of 256
 | 
						||
     bytes, the offset must be right shifted by 8 before moving into
 | 
						||
     FLAT_SCRATCH_LO.
 | 
						||
  2. The second word of Flat Scratch Init is 32-bit byte size of a single
 | 
						||
     work-items scratch memory usage. This is directly loaded from the kernel
 | 
						||
     dispatch packet Private Segment Byte Size and rounded up to a multiple of
 | 
						||
     DWORD. Having CP load it once avoids loading it at the beginning of every
 | 
						||
     wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT
 | 
						||
     SCRATCH SIZE.
 | 
						||
 | 
						||
GFX9-GFX10
 | 
						||
  The Flat Scratch Init is the 64-bit address of the base of scratch backing
 | 
						||
  memory being managed by SPI for the queue executing the kernel dispatch. The
 | 
						||
  prolog must add the value of Scratch Wavefront Offset and moved to the
 | 
						||
  FLAT_SCRATCH pair for use as the flat scratch base in flat memory
 | 
						||
  instructions.
 | 
						||
 | 
						||
.. _amdgpu-amdhsa-memory-model:
 | 
						||
 | 
						||
Memory Model
 | 
						||
~~~~~~~~~~~~
 | 
						||
 | 
						||
This section describes the mapping of LLVM memory model onto AMDGPU machine code
 | 
						||
(see :ref:`memmodel`).
 | 
						||
 | 
						||
The AMDGPU backend supports the memory synchronization scopes specified in
 | 
						||
:ref:`amdgpu-memory-scopes`.
 | 
						||
 | 
						||
The code sequences used to implement the memory model are defined in table
 | 
						||
:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx10-table`.
 | 
						||
 | 
						||
The sequences specify the order of instructions that a single thread must
 | 
						||
execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
 | 
						||
to other memory instructions executed by the same thread. This allows them to be
 | 
						||
moved earlier or later which can allow them to be combined with other instances
 | 
						||
of the same instruction, or hoisted/sunk out of loops to improve
 | 
						||
performance. Only the instructions related to the memory model are given;
 | 
						||
additional ``s_waitcnt`` instructions are required to ensure registers are
 | 
						||
defined before being used. These may be able to be combined with the memory
 | 
						||
model ``s_waitcnt`` instructions as described above.
 | 
						||
 | 
						||
The AMDGPU backend supports the following memory models:
 | 
						||
 | 
						||
  HSA Memory Model [HSA]_
 | 
						||
    The HSA memory model uses a single happens-before relation for all address
 | 
						||
    spaces (see :ref:`amdgpu-address-spaces`).
 | 
						||
  OpenCL Memory Model [OpenCL]_
 | 
						||
    The OpenCL memory model which has separate happens-before relations for the
 | 
						||
    global and local address spaces. Only a fence specifying both global and
 | 
						||
    local address space, and seq_cst instructions join the relationships. Since
 | 
						||
    the LLVM ``memfence`` instruction does not allow an address space to be
 | 
						||
    specified the OpenCL fence has to conservatively assume both local and
 | 
						||
    global address space was specified. However, optimizations can often be
 | 
						||
    done to eliminate the additional ``s_waitcnt`` instructions when there are
 | 
						||
    no intervening memory instructions which access the corresponding address
 | 
						||
    space. The code sequences in the table indicate what can be omitted for the
 | 
						||
    OpenCL memory. The target triple environment is used to determine if the
 | 
						||
    source language is OpenCL (see :ref:`amdgpu-opencl`).
 | 
						||
 | 
						||
``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
 | 
						||
operations.
 | 
						||
 | 
						||
``buffer/global/flat_load/store/atomic`` instructions to global memory are
 | 
						||
termed vector memory operations.
 | 
						||
 | 
						||
For GFX6-GFX9:
 | 
						||
 | 
						||
* Each agent has multiple shader arrays (SA).
 | 
						||
* Each SA has multiple compute units (CU).
 | 
						||
* Each CU has multiple SIMDs that execute wavefronts.
 | 
						||
* The wavefronts for a single work-group are executed in the same CU but may be
 | 
						||
  executed by different SIMDs.
 | 
						||
* Each CU has a single LDS memory shared by the wavefronts of the work-groups
 | 
						||
  executing on it.
 | 
						||
* All LDS operations of a CU are performed as wavefront wide operations in a
 | 
						||
  global order and involve no caching. Completion is reported to a wavefront in
 | 
						||
  execution order.
 | 
						||
* The LDS memory has multiple request queues shared by the SIMDs of a
 | 
						||
  CU. Therefore, the LDS operations performed by different wavefronts of a
 | 
						||
  work-group can be reordered relative to each other, which can result in
 | 
						||
  reordering the visibility of vector memory operations with respect to LDS
 | 
						||
  operations of other wavefronts in the same work-group. A ``s_waitcnt
 | 
						||
  lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
 | 
						||
  vector memory operations between wavefronts of a work-group, but not between
 | 
						||
  operations performed by the same wavefront.
 | 
						||
* The vector memory operations are performed as wavefront wide operations and
 | 
						||
  completion is reported to a wavefront in execution order. The exception is
 | 
						||
  that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
 | 
						||
  vector memory order if they access LDS memory, and out of LDS operation order
 | 
						||
  if they access global memory.
 | 
						||
* The vector memory operations access a single vector L1 cache shared by all
 | 
						||
  SIMDs a CU. Therefore, no special action is required for coherence between the
 | 
						||
  lanes of a single wavefront, or for coherence between wavefronts in the same
 | 
						||
  work-group. A ``buffer_wbinvl1_vol`` is required for coherence between
 | 
						||
  wavefronts executing in different work-groups as they may be executing on
 | 
						||
  different CUs.
 | 
						||
* The scalar memory operations access a scalar L1 cache shared by all wavefronts
 | 
						||
  on a group of CUs. The scalar and vector L1 caches are not coherent. However,
 | 
						||
  scalar operations are used in a restricted way so do not impact the memory
 | 
						||
  model. See :ref:`amdgpu-address-spaces`.
 | 
						||
* The vector and scalar memory operations use an L2 cache shared by all CUs on
 | 
						||
  the same agent.
 | 
						||
* The L2 cache has independent channels to service disjoint ranges of virtual
 | 
						||
  addresses.
 | 
						||
* Each CU has a separate request queue per channel. Therefore, the vector and
 | 
						||
  scalar memory operations performed by wavefronts executing in different
 | 
						||
  work-groups (which may be executing on different CUs) of an agent can be
 | 
						||
  reordered relative to each other. A ``s_waitcnt vmcnt(0)`` is required to
 | 
						||
  ensure synchronization between vector memory operations of different CUs. It
 | 
						||
  ensures a previous vector memory operation has completed before executing a
 | 
						||
  subsequent vector memory or LDS operation and so can be used to meet the
 | 
						||
  requirements of acquire and release.
 | 
						||
* The L2 cache can be kept coherent with other agents on some targets, or ranges
 | 
						||
  of virtual addresses can be set up to bypass it to ensure system coherence.
 | 
						||
 | 
						||
For GFX10:
 | 
						||
 | 
						||
* Each agent has multiple shader arrays (SA).
 | 
						||
* Each SA has multiple work-group processors (WGP).
 | 
						||
* Each WGP has multiple compute units (CU).
 | 
						||
* Each CU has multiple SIMDs that execute wavefronts.
 | 
						||
* The wavefronts for a single work-group are executed in the same
 | 
						||
  WGP. In CU wavefront execution mode the wavefronts may be executed by
 | 
						||
  different SIMDs in the same CU. In WGP wavefront execution mode the
 | 
						||
  wavefronts may be executed by different SIMDs in different CUs in the same
 | 
						||
  WGP.
 | 
						||
* Each WGP has a single LDS memory shared by the wavefronts of the work-groups
 | 
						||
  executing on it.
 | 
						||
* All LDS operations of a WGP are performed as wavefront wide operations in a
 | 
						||
  global order and involve no caching. Completion is reported to a wavefront in
 | 
						||
  execution order.
 | 
						||
* The LDS memory has multiple request queues shared by the SIMDs of a
 | 
						||
  WGP. Therefore, the LDS operations performed by different wavefronts of a
 | 
						||
  work-group can be reordered relative to each other, which can result in
 | 
						||
  reordering the visibility of vector memory operations with respect to LDS
 | 
						||
  operations of other wavefronts in the same work-group. A ``s_waitcnt
 | 
						||
  lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
 | 
						||
  vector memory operations between wavefronts of a work-group, but not between
 | 
						||
  operations performed by the same wavefront.
 | 
						||
* The vector memory operations are performed as wavefront wide operations.
 | 
						||
  Completion of load/store/sample operations are reported to a wavefront in
 | 
						||
  execution order of other load/store/sample operations performed by that
 | 
						||
  wavefront.
 | 
						||
* The vector memory operations access a vector L0 cache. There is a single L0
 | 
						||
  cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no
 | 
						||
  special action is required for coherence between the lanes of a single
 | 
						||
  wavefront. However, a ``BUFFER_GL0_INV`` is required for coherence between
 | 
						||
  wavefronts executing in the same work-group as they may be executing on SIMDs
 | 
						||
  of different CUs that access different L0s. A ``BUFFER_GL0_INV`` is also
 | 
						||
  required for coherence between wavefronts executing in different work-groups
 | 
						||
  as they may be executing on different WGPs.
 | 
						||
* The scalar memory operations access a scalar L0 cache shared by all wavefronts
 | 
						||
  on a WGP. The scalar and vector L0 caches are not coherent. However, scalar
 | 
						||
  operations are used in a restricted way so do not impact the memory model. See
 | 
						||
  :ref:`amdgpu-address-spaces`.
 | 
						||
* The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on
 | 
						||
  the same SA. Therefore, no special action is required for coherence between
 | 
						||
  the wavefronts of a single work-group. However, a ``BUFFER_GL1_INV`` is
 | 
						||
  required for coherence between wavefronts executing in different work-groups
 | 
						||
  as they may be executing on different SAs that access different L1s.
 | 
						||
* The L1 caches have independent quadrants to service disjoint ranges of virtual
 | 
						||
  addresses.
 | 
						||
* Each L0 cache has a separate request queue per L1 quadrant. Therefore, the
 | 
						||
  vector and scalar memory operations performed by different wavefronts, whether
 | 
						||
  executing in the same or different work-groups (which may be executing on
 | 
						||
  different CUs accessing different L0s), can be reordered relative to each
 | 
						||
  other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure
 | 
						||
  synchronization between vector memory operations of different wavefronts. It
 | 
						||
  ensures a previous vector memory operation has completed before executing a
 | 
						||
  subsequent vector memory or LDS operation and so can be used to meet the
 | 
						||
  requirements of acquire, release and sequential consistency.
 | 
						||
* The L1 caches use an L2 cache shared by all SAs on the same agent.
 | 
						||
* The L2 cache has independent channels to service disjoint ranges of virtual
 | 
						||
  addresses.
 | 
						||
* Each L1 quadrant of a single SA accesses a different L2 channel. Each L1
 | 
						||
  quadrant has a separate request queue per L2 channel. Therefore, the vector
 | 
						||
  and scalar memory operations performed by wavefronts executing in different
 | 
						||
  work-groups (which may be executing on different SAs) of an agent can be
 | 
						||
  reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is
 | 
						||
  required to ensure synchronization between vector memory operations of
 | 
						||
  different SAs. It ensures a previous vector memory operation has completed
 | 
						||
  before executing a subsequent vector memory and so can be used to meet the
 | 
						||
  requirements of acquire, release and sequential consistency.
 | 
						||
* The L2 cache can be kept coherent with other agents on some targets, or ranges
 | 
						||
  of virtual addresses can be set up to bypass it to ensure system coherence.
 | 
						||
 | 
						||
Private address space uses ``buffer_load/store`` using the scratch V#
 | 
						||
(GFX6-GFX8), or ``scratch_load/store`` (GFX9-GFX10). Since only a single thread
 | 
						||
is accessing the memory, atomic memory orderings are not meaningful and all
 | 
						||
accesses are treated as non-atomic.
 | 
						||
 | 
						||
Constant address space uses ``buffer/global_load`` instructions (or equivalent
 | 
						||
scalar memory instructions). Since the constant address space contents do not
 | 
						||
change during the execution of a kernel dispatch it is not legal to perform
 | 
						||
stores, and atomic memory orderings are not meaningful and all access are
 | 
						||
treated as non-atomic.
 | 
						||
 | 
						||
A memory synchronization scope wider than work-group is not meaningful for the
 | 
						||
group (LDS) address space and is treated as work-group.
 | 
						||
 | 
						||
The memory model does not support the region address space which is treated as
 | 
						||
non-atomic.
 | 
						||
 | 
						||
Acquire memory ordering is not meaningful on store atomic instructions and is
 | 
						||
treated as non-atomic.
 | 
						||
 | 
						||
Release memory ordering is not meaningful on load atomic instructions and is
 | 
						||
treated a non-atomic.
 | 
						||
 | 
						||
Acquire-release memory ordering is not meaningful on load or store atomic
 | 
						||
instructions and is treated as acquire and release respectively.
 | 
						||
 | 
						||
AMDGPU backend only uses scalar memory operations to access memory that is
 | 
						||
proven to not change during the execution of the kernel dispatch. This includes
 | 
						||
constant address space and global address space for program scope const
 | 
						||
variables. Therefore the kernel machine code does not have to maintain the
 | 
						||
scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
 | 
						||
and vector L1 caches are invalidated between kernel dispatches by CP since
 | 
						||
constant address space data may change between kernel dispatch executions. See
 | 
						||
:ref:`amdgpu-address-spaces`.
 | 
						||
 | 
						||
The one exception is if scalar writes are used to spill SGPR registers. In this
 | 
						||
case the AMDGPU backend ensures the memory location used to spill is never
 | 
						||
accessed by vector memory operations at the same time. If scalar writes are used
 | 
						||
then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
 | 
						||
return since the locations may be used for vector memory instructions by a
 | 
						||
future wavefront that uses the same scratch area, or a function call that
 | 
						||
creates a frame at the same address, respectively. There is no need for a
 | 
						||
``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
 | 
						||
 | 
						||
For GFX6-GFX9, scratch backing memory (which is used for the private address
 | 
						||
space) is accessed with MTYPE NC_NV (non-coherent non-volatile). Since the
 | 
						||
private address space is only accessed by a single thread, and is always
 | 
						||
write-before-read, there is never a need to invalidate these entries from the L1
 | 
						||
cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
 | 
						||
volatile cache lines.
 | 
						||
 | 
						||
For GFX10, scratch backing memory (which is used for the private address space)
 | 
						||
is accessed with MTYPE NC (non-coherent). Since the private address space is
 | 
						||
only accessed by a single thread, and is always write-before-read, there is
 | 
						||
never a need to invalidate these entries from the L0 or L1 caches.
 | 
						||
 | 
						||
For GFX10, wavefronts are executed in native mode with in-order reporting of
 | 
						||
loads and sample instructions. In this mode vmcnt reports completion of load,
 | 
						||
atomic with return and sample instructions in order, and the vscnt reports the
 | 
						||
completion of store and atomic without return in order. See ``MEM_ORDERED``
 | 
						||
field in :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
 | 
						||
 | 
						||
In GFX10, wavefronts can be executed in WGP or CU wavefront execution mode:
 | 
						||
 | 
						||
* In WGP wavefront execution mode the wavefronts of a work-group are executed
 | 
						||
  on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per
 | 
						||
  CU L0 caches is required for work-group synchronization. Also accesses to L1
 | 
						||
  at work-group scope need to be explicitly ordered as the accesses from
 | 
						||
  different CUs are not ordered.
 | 
						||
* In CU wavefront execution mode the wavefronts of a work-group are executed on
 | 
						||
  the SIMDs of a single CU of the WGP. Therefore, all global memory access by
 | 
						||
  the work-group access the same L0 which in turn ensures L1 accesses are
 | 
						||
  ordered and so do not require explicit management of the caches for
 | 
						||
  work-group synchronization.
 | 
						||
 | 
						||
See ``WGP_MODE`` field in
 | 
						||
:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table` and
 | 
						||
:ref:`amdgpu-target-features`.
 | 
						||
 | 
						||
On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
 | 
						||
to invalidate the L2 cache. For GFX6-GFX9, this also causes it to be treated as
 | 
						||
non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
 | 
						||
(cache coherent) and so the L2 cache will be coherent with the CPU and other
 | 
						||
agents.
 | 
						||
 | 
						||
  .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX10
 | 
						||
     :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx10-table
 | 
						||
 | 
						||
     ============ ============ ============== ========== =============================== ==================================
 | 
						||
     LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code             AMDGPU Machine Code
 | 
						||
                  Ordering     Sync Scope     Address    GFX6-9                          GFX10
 | 
						||
                                              Space
 | 
						||
     ============ ============ ============== ========== =============================== ==================================
 | 
						||
     **Non-Atomic**
 | 
						||
     ----------------------------------------------------------------------------------------------------------------------
 | 
						||
     load         *none*       *none*         - global   - !volatile & !nontemporal      - !volatile & !nontemporal
 | 
						||
                                              - generic
 | 
						||
                                              - private    1. buffer/global/flat_load      1. buffer/global/flat_load
 | 
						||
                                              - constant
 | 
						||
                                                         - volatile & !nontemporal       - volatile & !nontemporal
 | 
						||
 | 
						||
                                                           1. buffer/global/flat_load      1. buffer/global/flat_load
 | 
						||
                                                              glc=1                           glc=1 dlc=1
 | 
						||
 | 
						||
                                                         - nontemporal                   - nontemporal
 | 
						||
 | 
						||
                                                           1. buffer/global/flat_load      1. buffer/global/flat_load
 | 
						||
                                                              glc=1 slc=1                     slc=1
 | 
						||
 | 
						||
     load         *none*       *none*         - local    1. ds_load                      1. ds_load
 | 
						||
     store        *none*       *none*         - global   - !nontemporal                  - !nontemporal
 | 
						||
                                              - generic
 | 
						||
                                              - private    1. buffer/global/flat_store     1. buffer/global/flat_store
 | 
						||
                                              - constant
 | 
						||
                                                         - nontemporal                   - nontemporal
 | 
						||
 | 
						||
                                                           1. buffer/global/flat_store      1. buffer/global/flat_store
 | 
						||
                                                              glc=1 slc=1                      slc=1
 | 
						||
 | 
						||
     store        *none*       *none*         - local    1. ds_store                     1. ds_store
 | 
						||
     **Unordered Atomic**
 | 
						||
     ----------------------------------------------------------------------------------------------------------------------
 | 
						||
     load atomic  unordered    *any*          *any*      *Same as non-atomic*.           *Same as non-atomic*.
 | 
						||
     store atomic unordered    *any*          *any*      *Same as non-atomic*.           *Same as non-atomic*.
 | 
						||
     atomicrmw    unordered    *any*          *any*      *Same as monotonic              *Same as monotonic
 | 
						||
                                                         atomic*.                        atomic*.
 | 
						||
     **Monotonic Atomic**
 | 
						||
     ----------------------------------------------------------------------------------------------------------------------
 | 
						||
     load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load      1. buffer/global/flat_load
 | 
						||
                               - wavefront    - generic
 | 
						||
     load atomic  monotonic    - workgroup    - global   1. buffer/global/flat_load      1. buffer/global/flat_load
 | 
						||
                                              - generic                                     glc=1
 | 
						||
 | 
						||
                                                                                           - If CU wavefront execution mode, omit glc=1.
 | 
						||
 | 
						||
     load atomic  monotonic    - singlethread - local    1. ds_load                      1. ds_load
 | 
						||
                               - wavefront
 | 
						||
                               - workgroup
 | 
						||
     load atomic  monotonic    - agent        - global   1. buffer/global/flat_load      1. buffer/global/flat_load
 | 
						||
                               - system       - generic     glc=1                           glc=1 dlc=1
 | 
						||
     store atomic monotonic    - singlethread - global   1. buffer/global/flat_store     1. buffer/global/flat_store
 | 
						||
                               - wavefront    - generic
 | 
						||
                               - workgroup
 | 
						||
                               - agent
 | 
						||
                               - system
 | 
						||
     store atomic monotonic    - singlethread - local    1. ds_store                     1. ds_store
 | 
						||
                               - wavefront
 | 
						||
                               - workgroup
 | 
						||
     atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic    1. buffer/global/flat_atomic
 | 
						||
                               - wavefront    - generic
 | 
						||
                               - workgroup
 | 
						||
                               - agent
 | 
						||
                               - system
 | 
						||
     atomicrmw    monotonic    - singlethread - local    1. ds_atomic                    1. ds_atomic
 | 
						||
                               - wavefront
 | 
						||
                               - workgroup
 | 
						||
     **Acquire Atomic**
 | 
						||
     ----------------------------------------------------------------------------------------------------------------------
 | 
						||
     load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load   1. buffer/global/ds/flat_load
 | 
						||
                               - wavefront    - local
 | 
						||
                                              - generic
 | 
						||
     load atomic  acquire      - workgroup    - global   1. buffer/global/flat_load      1. buffer/global_load glc=1
 | 
						||
 | 
						||
                                                                                           - If CU wavefront execution mode, omit glc=1.
 | 
						||
 | 
						||
                                                                                         2. s_waitcnt vmcnt(0)
 | 
						||
 | 
						||
                                                                                           - If CU wavefront execution mode, omit.
 | 
						||
                                                                                           - Must happen before
 | 
						||
                                                                                             the following buffer_gl0_inv
 | 
						||
                                                                                             and before any following
 | 
						||
                                                                                             global/generic
 | 
						||
                                                                                             load/load
 | 
						||
                                                                                             atomic/store/store
 | 
						||
                                                                                             atomic/atomicrmw.
 | 
						||
 | 
						||
                                                                                         3. buffer_gl0_inv
 | 
						||
 | 
						||
                                                                                           - If CU wavefront execution mode, omit.
 | 
						||
                                                                                           - Ensures that
 | 
						||
                                                                                             following
 | 
						||
                                                                                             loads will not see
 | 
						||
                                                                                             stale data.
 | 
						||
 | 
						||
     load atomic  acquire      - workgroup    - local    1. ds_load                      1. ds_load
 | 
						||
                                                         2. s_waitcnt lgkmcnt(0)         2. s_waitcnt lgkmcnt(0)
 | 
						||
 | 
						||
                                                           - If OpenCL, omit.              - If OpenCL, omit.
 | 
						||
                                                           - Must happen before            - Must happen before
 | 
						||
                                                             any following                   the following buffer_gl0_inv
 | 
						||
                                                             global/generic                  and before any following
 | 
						||
                                                             load/load                       global/generic load/load
 | 
						||
                                                             atomic/store/store              atomic/store/store
 | 
						||
                                                             atomic/atomicrmw.               atomic/atomicrmw.
 | 
						||
                                                           - Ensures any                   - Ensures any
 | 
						||
                                                             following global                following global
 | 
						||
                                                             data read is no                 data read is no
 | 
						||
                                                             older than the load             older than the load
 | 
						||
                                                             atomic value being              atomic value being
 | 
						||
                                                             acquired.                       acquired.
 | 
						||
 | 
						||
                                                                                         3. buffer_gl0_inv
 | 
						||
 | 
						||
                                                                                           - If CU wavefront execution mode, omit.
 | 
						||
                                                                                           - If OpenCL, omit.
 | 
						||
                                                                                           - Ensures that
 | 
						||
                                                                                             following
 | 
						||
                                                                                             loads will not see
 | 
						||
                                                                                             stale data.
 | 
						||
 | 
						||
     load atomic  acquire      - workgroup    - generic  1. flat_load                    1. flat_load glc=1
 | 
						||
 | 
						||
                                                                                           - If CU wavefront execution mode, omit glc=1.
 | 
						||
 | 
						||
                                                         2. s_waitcnt lgkmcnt(0)         2. s_waitcnt lgkmcnt(0) &
 | 
						||
                                                                                            vmcnt(0)
 | 
						||
 | 
						||
                                                                                           - If CU wavefront execution mode, omit vmcnt.
 | 
						||
                                                           - If OpenCL, omit.              - If OpenCL, omit
 | 
						||
                                                                                             lgkmcnt(0).
 | 
						||
                                                           - Must happen before            - Must happen before
 | 
						||
                                                             any following                   the following
 | 
						||
                                                             global/generic                  buffer_gl0_inv and any
 | 
						||
                                                             load/load                       following global/generic
 | 
						||
                                                             atomic/store/store              load/load
 | 
						||
                                                             atomic/atomicrmw.               atomic/store/store
 | 
						||
                                                                                             atomic/atomicrmw.
 | 
						||
                                                           - Ensures any                   - Ensures any
 | 
						||
                                                             following global                following global
 | 
						||
                                                             data read is no                 data read is no
 | 
						||
                                                             older than the load             older than the load
 | 
						||
                                                             atomic value being              atomic value being
 | 
						||
                                                             acquired.                       acquired.
 | 
						||
 | 
						||
                                                                                         3. buffer_gl0_inv
 | 
						||
 | 
						||
                                                                                           - If CU wavefront execution mode, omit.
 | 
						||
                                                                                           - Ensures that
 | 
						||
                                                                                             following
 | 
						||
                                                                                             loads will not see
 | 
						||
                                                                                             stale data.
 | 
						||
 | 
						||
     load atomic  acquire      - agent        - global   1. buffer/global/flat_load      1. buffer/global_load
 | 
						||
                               - system                     glc=1                           glc=1 dlc=1
 | 
						||
                                                         2. s_waitcnt vmcnt(0)           2. s_waitcnt vmcnt(0)
 | 
						||
 | 
						||
                                                           - Must happen before            - Must happen before
 | 
						||
                                                             following                       following
 | 
						||
                                                             buffer_wbinvl1_vol.             buffer_gl*_inv.
 | 
						||
                                                           - Ensures the load              - Ensures the load
 | 
						||
                                                             has completed                   has completed
 | 
						||
                                                             before invalidating             before invalidating
 | 
						||
                                                             the cache.                      the caches.
 | 
						||
 | 
						||
                                                         3. buffer_wbinvl1_vol           3. buffer_gl0_inv;
 | 
						||
                                                                                            buffer_gl1_inv
 | 
						||
 | 
						||
                                                           - Must happen before            - Must happen before
 | 
						||
                                                             any following                   any following
 | 
						||
                                                             global/generic                  global/generic
 | 
						||
                                                             load/load                       load/load
 | 
						||
                                                             atomic/atomicrmw.               atomic/atomicrmw.
 | 
						||
                                                           - Ensures that                  - Ensures that
 | 
						||
                                                             following                       following
 | 
						||
                                                             loads will not see              loads will not see
 | 
						||
                                                             stale global data.              stale global data.
 | 
						||
 | 
						||
     load atomic  acquire      - agent        - generic  1. flat_load glc=1              1. flat_load glc=1 dlc=1
 | 
						||
                               - system                  2. s_waitcnt vmcnt(0) &         2. s_waitcnt vmcnt(0) &
 | 
						||
                                                            lgkmcnt(0)                      lgkmcnt(0)
 | 
						||
 | 
						||
                                                           - If OpenCL omit                - If OpenCL omit
 | 
						||
                                                             lgkmcnt(0).                     lgkmcnt(0).
 | 
						||
                                                           - Must happen before            - Must happen before
 | 
						||
                                                             following                       following
 | 
						||
                                                             buffer_wbinvl1_vol.             buffer_gl*_invl.
 | 
						||
                                                           - Ensures the flat_load         - Ensures the flat_load
 | 
						||
                                                             has completed                   has completed
 | 
						||
                                                             before invalidating             before invalidating
 | 
						||
                                                             the cache.                      the caches.
 | 
						||
 | 
						||
                                                         3. buffer_wbinvl1_vol           3. buffer_gl0_inv;
 | 
						||
                                                                                            buffer_gl1_inv
 | 
						||
 | 
						||
                                                           - Must happen before            - Must happen before
 | 
						||
                                                             any following                   any following
 | 
						||
                                                             global/generic                  global/generic
 | 
						||
                                                             load/load                       load/load
 | 
						||
                                                             atomic/atomicrmw.               atomic/atomicrmw.
 | 
						||
                                                           - Ensures that                  - Ensures that
 | 
						||
                                                             following loads                 following loads
 | 
						||
                                                             will not see stale              will not see stale
 | 
						||
                                                             global data.                    global data.
 | 
						||
 | 
						||
     atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic 1. buffer/global/ds/flat_atomic
 | 
						||
                               - wavefront    - local
 | 
						||
                                              - generic
 | 
						||
     atomicrmw    acquire      - workgroup    - global   1. buffer/global/flat_atomic    1. buffer/global_atomic
 | 
						||
                                                                                         2. s_waitcnt vm/vscnt(0)
 | 
						||
 | 
						||
                                                                                           - If CU wavefront execution mode, omit.
 | 
						||
                                                                                           - Use vmcnt if atomic with
 | 
						||
                                                                                             return and vscnt if atomic
 | 
						||
                                                                                             with no-return.
 | 
						||
                                                                                           - Must happen before
 | 
						||
                                                                                             the following buffer_gl0_inv
 | 
						||
                                                                                             and before any following
 | 
						||
                                                                                             global/generic
 | 
						||
                                                                                             load/load
 | 
						||
                                                                                             atomic/store/store
 | 
						||
                                                                                             atomic/atomicrmw.
 | 
						||
 | 
						||
                                                                                         3. buffer_gl0_inv
 | 
						||
 | 
						||
                                                                                           - If CU wavefront execution mode, omit.
 | 
						||
                                                                                           - Ensures that
 | 
						||
                                                                                             following
 | 
						||
                                                                                             loads will not see
 | 
						||
                                                                                             stale data.
 | 
						||
 | 
						||
     atomicrmw    acquire      - workgroup    - local    1. ds_atomic                    1. ds_atomic
 | 
						||
                                                         2. waitcnt lgkmcnt(0)           2. waitcnt lgkmcnt(0)
 | 
						||
 | 
						||
                                                           - If OpenCL, omit.              - If OpenCL, omit.
 | 
						||
                                                           - Must happen before            - Must happen before
 | 
						||
                                                             any following                   the following
 | 
						||
                                                             global/generic                  buffer_gl0_inv.
 | 
						||
                                                             load/load
 | 
						||
                                                             atomic/store/store
 | 
						||
                                                             atomic/atomicrmw.
 | 
						||
                                                           - Ensures any                   - Ensures any
 | 
						||
                                                             following global                following global
 | 
						||
                                                             data read is no                 data read is no
 | 
						||
                                                             older than the                  older than the
 | 
						||
                                                             atomicrmw value                 atomicrmw value
 | 
						||
                                                             being acquired.                 being acquired.
 | 
						||
 | 
						||
                                                                                         3. buffer_gl0_inv
 | 
						||
 | 
						||
                                                                                           - If OpenCL omit.
 | 
						||
                                                                                           - Ensures that
 | 
						||
                                                                                             following
 | 
						||
                                                                                             loads will not see
 | 
						||
                                                                                             stale data.
 | 
						||
 | 
						||
     atomicrmw    acquire      - workgroup    - generic  1. flat_atomic                  1. flat_atomic
 | 
						||
                                                         2. waitcnt lgkmcnt(0)           2. waitcnt lgkmcnt(0) &
 | 
						||
                                                                                            vm/vscnt(0)
 | 
						||
 | 
						||
                                                                                           - If CU wavefront execution mode, omit vm/vscnt.
 | 
						||
                                                           - If OpenCL, omit.              - If OpenCL, omit
 | 
						||
                                                                                             waitcnt lgkmcnt(0)..
 | 
						||
                                                                                           - Use vmcnt if atomic with
 | 
						||
                                                                                             return and vscnt if atomic
 | 
						||
                                                                                             with no-return.
 | 
						||
                                                                                             waitcnt lgkmcnt(0).
 | 
						||
                                                           - Must happen before            - Must happen before
 | 
						||
                                                             any following                   the following
 | 
						||
                                                             global/generic                  buffer_gl0_inv.
 | 
						||
                                                             load/load
 | 
						||
                                                             atomic/store/store
 | 
						||
                                                             atomic/atomicrmw.
 | 
						||
                                                           - Ensures any                   - Ensures any
 | 
						||
                                                             following global                following global
 | 
						||
                                                             data read is no                 data read is no
 | 
						||
                                                             older than the                  older than the
 | 
						||
                                                             atomicrmw value                 atomicrmw value
 | 
						||
                                                             being acquired.                 being acquired.
 | 
						||
 | 
						||
                                                                                         3. buffer_gl0_inv
 | 
						||
 | 
						||
                                                                                           - If CU wavefront execution mode, omit.
 | 
						||
                                                                                           - Ensures that
 | 
						||
                                                                                             following
 | 
						||
                                                                                             loads will not see
 | 
						||
                                                                                             stale data.
 | 
						||
 | 
						||
     atomicrmw    acquire      - agent        - global   1. buffer/global/flat_atomic    1. buffer/global_atomic
 | 
						||
                               - system                  2. s_waitcnt vmcnt(0)           2. s_waitcnt vm/vscnt(0)
 | 
						||
 | 
						||
                                                                                           - Use vmcnt if atomic with
 | 
						||
                                                                                             return and vscnt if atomic
 | 
						||
                                                                                             with no-return.
 | 
						||
                                                                                             waitcnt lgkmcnt(0).
 | 
						||
                                                           - Must happen before            - Must happen before
 | 
						||
                                                             following                       following
 | 
						||
                                                             buffer_wbinvl1_vol.             buffer_gl*_inv.
 | 
						||
                                                           - Ensures the                   - Ensures the
 | 
						||
                                                             atomicrmw has                   atomicrmw has
 | 
						||
                                                             completed before                completed before
 | 
						||
                                                             invalidating the                invalidating the
 | 
						||
                                                             cache.                          caches.
 | 
						||
 | 
						||
                                                         3. buffer_wbinvl1_vol           3. buffer_gl0_inv;
 | 
						||
                                                                                            buffer_gl1_inv
 | 
						||
 | 
						||
                                                           - Must happen before            - Must happen before
 | 
						||
                                                             any following                   any following
 | 
						||
                                                             global/generic                  global/generic
 | 
						||
                                                             load/load                       load/load
 | 
						||
                                                             atomic/atomicrmw.               atomic/atomicrmw.
 | 
						||
                                                           - Ensures that                  - Ensures that
 | 
						||
                                                             following loads                 following loads
 | 
						||
                                                             will not see stale              will not see stale
 | 
						||
                                                             global data.                    global data.
 | 
						||
 | 
						||
     atomicrmw    acquire      - agent        - generic  1. flat_atomic                  1. flat_atomic
 | 
						||
                               - system                  2. s_waitcnt vmcnt(0) &         2. s_waitcnt vm/vscnt(0) &
 | 
						||
                                                            lgkmcnt(0)                      lgkmcnt(0)
 | 
						||
 | 
						||
                                                           - If OpenCL, omit               - If OpenCL, omit
 | 
						||
                                                             lgkmcnt(0).                     lgkmcnt(0).
 | 
						||
                                                                                           - Use vmcnt if atomic with
 | 
						||
                                                                                             return and vscnt if atomic
 | 
						||
                                                                                             with no-return.
 | 
						||
                                                           - Must happen before            - Must happen before
 | 
						||
                                                             following                       following
 | 
						||
                                                             buffer_wbinvl1_vol.             buffer_gl*_inv.
 | 
						||
                                                           - Ensures the                   - Ensures the
 | 
						||
                                                             atomicrmw has                   atomicrmw has
 | 
						||
                                                             completed before                completed before
 | 
						||
                                                             invalidating the                invalidating the
 | 
						||
                                                             cache.                          caches.
 | 
						||
 | 
						||
                                                         3. buffer_wbinvl1_vol           3. buffer_gl0_inv;
 | 
						||
                                                                                            buffer_gl1_inv
 | 
						||
 | 
						||
                                                           - Must happen before            - Must happen before
 | 
						||
                                                             any following                   any following
 | 
						||
                                                             global/generic                  global/generic
 | 
						||
                                                             load/load                       load/load
 | 
						||
                                                             atomic/atomicrmw.               atomic/atomicrmw.
 | 
						||
                                                           - Ensures that                  - Ensures that
 | 
						||
                                                             following loads                 following loads
 | 
						||
                                                             will not see stale              will not see stale
 | 
						||
                                                             global data.                    global data.
 | 
						||
 | 
						||
     fence        acquire      - singlethread *none*     *none*                          *none*
 | 
						||
                               - wavefront
 | 
						||
     fence        acquire      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
 | 
						||
                                                                                            vmcnt(0) & vscnt(0)
 | 
						||
 | 
						||
                                                                                           - If CU wavefront execution mode, omit vmcnt and
 | 
						||
                                                                                             vscnt.
 | 
						||
                                                           - If OpenCL and                 - If OpenCL and
 | 
						||
                                                             address space is                address space is
 | 
						||
                                                             not generic, omit.              not generic, omit
 | 
						||
                                                                                             lgkmcnt(0).
 | 
						||
                                                                                           - If OpenCL and
 | 
						||
                                                                                             address space is
 | 
						||
                                                                                             local, omit
 | 
						||
                                                                                             vmcnt(0) and vscnt(0).
 | 
						||
                                                           - However, since LLVM           - However, since LLVM
 | 
						||
                                                             currently has no                currently has no
 | 
						||
                                                             address space on                address space on
 | 
						||
                                                             the fence need to               the fence need to
 | 
						||
                                                             conservatively                  conservatively
 | 
						||
                                                             always generate. If             always generate. If
 | 
						||
                                                             fence had an                    fence had an
 | 
						||
                                                             address space then              address space then
 | 
						||
                                                             set to address                  set to address
 | 
						||
                                                             space of OpenCL                 space of OpenCL
 | 
						||
                                                             fence flag, or to               fence flag, or to
 | 
						||
                                                             generic if both                 generic if both
 | 
						||
                                                             local and global                local and global
 | 
						||
                                                             flags are                       flags are
 | 
						||
                                                             specified.                      specified.
 | 
						||
                                                           - Must happen after
 | 
						||
                                                             any preceding
 | 
						||
                                                             local/generic load
 | 
						||
                                                             atomic/atomicrmw
 | 
						||
                                                             with an equal or
 | 
						||
                                                             wider sync scope
 | 
						||
                                                             and memory ordering
 | 
						||
                                                             stronger than
 | 
						||
                                                             unordered (this is
 | 
						||
                                                             termed the
 | 
						||
                                                             fence-paired-atomic).
 | 
						||
                                                           - Must happen before
 | 
						||
                                                             any following
 | 
						||
                                                             global/generic
 | 
						||
                                                             load/load
 | 
						||
                                                             atomic/store/store
 | 
						||
                                                             atomic/atomicrmw.
 | 
						||
                                                           - Ensures any
 | 
						||
                                                             following global
 | 
						||
                                                             data read is no
 | 
						||
                                                             older than the
 | 
						||
                                                             value read by the
 | 
						||
                                                             fence-paired-atomic.
 | 
						||
                                                                                           - Could be split into
 | 
						||
                                                                                             separate s_waitcnt
 | 
						||
                                                                                             vmcnt(0), s_waitcnt
 | 
						||
                                                                                             vscnt(0) and s_waitcnt
 | 
						||
                                                                                             lgkmcnt(0) to allow
 | 
						||
                                                                                             them to be
 | 
						||
                                                                                             independently moved
 | 
						||
                                                                                             according to the
 | 
						||
                                                                                             following rules.
 | 
						||
                                                                                           - s_waitcnt vmcnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             global/generic load
 | 
						||
                                                                                             atomic/
 | 
						||
                                                                                             atomicrmw-with-return-value
 | 
						||
                                                                                             with an equal or
 | 
						||
                                                                                             wider sync scope
 | 
						||
                                                                                             and memory ordering
 | 
						||
                                                                                             stronger than
 | 
						||
                                                                                             unordered (this is
 | 
						||
                                                                                             termed the
 | 
						||
                                                                                             fence-paired-atomic).
 | 
						||
                                                                                           - s_waitcnt vscnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             global/generic
 | 
						||
                                                                                             atomicrmw-no-return-value
 | 
						||
                                                                                             with an equal or
 | 
						||
                                                                                             wider sync scope
 | 
						||
                                                                                             and memory ordering
 | 
						||
                                                                                             stronger than
 | 
						||
                                                                                             unordered (this is
 | 
						||
                                                                                             termed the
 | 
						||
                                                                                             fence-paired-atomic).
 | 
						||
                                                                                           - s_waitcnt lgkmcnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             local/generic load
 | 
						||
                                                                                             atomic/atomicrmw
 | 
						||
                                                                                             with an equal or
 | 
						||
                                                                                             wider sync scope
 | 
						||
                                                                                             and memory ordering
 | 
						||
                                                                                             stronger than
 | 
						||
                                                                                             unordered (this is
 | 
						||
                                                                                             termed the
 | 
						||
                                                                                             fence-paired-atomic).
 | 
						||
                                                                                           - Must happen before
 | 
						||
                                                                                             the following
 | 
						||
                                                                                             buffer_gl0_inv.
 | 
						||
                                                                                           - Ensures that the
 | 
						||
                                                                                             fence-paired atomic
 | 
						||
                                                                                             has completed
 | 
						||
                                                                                             before invalidating
 | 
						||
                                                                                             the
 | 
						||
                                                                                             cache. Therefore
 | 
						||
                                                                                             any following
 | 
						||
                                                                                             locations read must
 | 
						||
                                                                                             be no older than
 | 
						||
                                                                                             the value read by
 | 
						||
                                                                                             the
 | 
						||
                                                                                             fence-paired-atomic.
 | 
						||
 | 
						||
                                                                                         3. buffer_gl0_inv
 | 
						||
 | 
						||
                                                                                           - If CU wavefront execution mode, omit.
 | 
						||
                                                                                           - Ensures that
 | 
						||
                                                                                             following
 | 
						||
                                                                                             loads will not see
 | 
						||
                                                                                             stale data.
 | 
						||
 | 
						||
     fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &       1. s_waitcnt lgkmcnt(0) &
 | 
						||
                               - system                     vmcnt(0)                        vmcnt(0) & vscnt(0)
 | 
						||
 | 
						||
                                                           - If OpenCL and                 - If OpenCL and
 | 
						||
                                                             address space is                address space is
 | 
						||
                                                             not generic, omit               not generic, omit
 | 
						||
                                                             lgkmcnt(0).                     lgkmcnt(0).
 | 
						||
                                                                                           - If OpenCL and
 | 
						||
                                                                                             address space is
 | 
						||
                                                                                             local, omit
 | 
						||
                                                                                             vmcnt(0) and vscnt(0).
 | 
						||
                                                           - However, since LLVM           - However, since LLVM
 | 
						||
                                                             currently has no                currently has no
 | 
						||
                                                             address space on                address space on
 | 
						||
                                                             the fence need to               the fence need to
 | 
						||
                                                             conservatively                  conservatively
 | 
						||
                                                             always generate                 always generate
 | 
						||
                                                             (see comment for                (see comment for
 | 
						||
                                                             previous fence).                previous fence).
 | 
						||
                                                           - Could be split into
 | 
						||
                                                             separate s_waitcnt
 | 
						||
                                                             vmcnt(0) and
 | 
						||
                                                             s_waitcnt
 | 
						||
                                                             lgkmcnt(0) to allow
 | 
						||
                                                             them to be
 | 
						||
                                                             independently moved
 | 
						||
                                                             according to the
 | 
						||
                                                             following rules.
 | 
						||
                                                           - s_waitcnt vmcnt(0)
 | 
						||
                                                             must happen after
 | 
						||
                                                             any preceding
 | 
						||
                                                             global/generic load
 | 
						||
                                                             atomic/atomicrmw
 | 
						||
                                                             with an equal or
 | 
						||
                                                             wider sync scope
 | 
						||
                                                             and memory ordering
 | 
						||
                                                             stronger than
 | 
						||
                                                             unordered (this is
 | 
						||
                                                             termed the
 | 
						||
                                                             fence-paired-atomic).
 | 
						||
                                                           - s_waitcnt lgkmcnt(0)
 | 
						||
                                                             must happen after
 | 
						||
                                                             any preceding
 | 
						||
                                                             local/generic load
 | 
						||
                                                             atomic/atomicrmw
 | 
						||
                                                             with an equal or
 | 
						||
                                                             wider sync scope
 | 
						||
                                                             and memory ordering
 | 
						||
                                                             stronger than
 | 
						||
                                                             unordered (this is
 | 
						||
                                                             termed the
 | 
						||
                                                             fence-paired-atomic).
 | 
						||
                                                           - Must happen before
 | 
						||
                                                             the following
 | 
						||
                                                             buffer_wbinvl1_vol.
 | 
						||
                                                           - Ensures that the
 | 
						||
                                                             fence-paired atomic
 | 
						||
                                                             has completed
 | 
						||
                                                             before invalidating
 | 
						||
                                                             the
 | 
						||
                                                             cache. Therefore
 | 
						||
                                                             any following
 | 
						||
                                                             locations read must
 | 
						||
                                                             be no older than
 | 
						||
                                                             the value read by
 | 
						||
                                                             the
 | 
						||
                                                             fence-paired-atomic.
 | 
						||
                                                                                           - Could be split into
 | 
						||
                                                                                             separate s_waitcnt
 | 
						||
                                                                                             vmcnt(0), s_waitcnt
 | 
						||
                                                                                             vscnt(0) and s_waitcnt
 | 
						||
                                                                                             lgkmcnt(0) to allow
 | 
						||
                                                                                             them to be
 | 
						||
                                                                                             independently moved
 | 
						||
                                                                                             according to the
 | 
						||
                                                                                             following rules.
 | 
						||
                                                                                           - s_waitcnt vmcnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             global/generic load
 | 
						||
                                                                                             atomic/
 | 
						||
                                                                                             atomicrmw-with-return-value
 | 
						||
                                                                                             with an equal or
 | 
						||
                                                                                             wider sync scope
 | 
						||
                                                                                             and memory ordering
 | 
						||
                                                                                             stronger than
 | 
						||
                                                                                             unordered (this is
 | 
						||
                                                                                             termed the
 | 
						||
                                                                                             fence-paired-atomic).
 | 
						||
                                                                                           - s_waitcnt vscnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             global/generic
 | 
						||
                                                                                             atomicrmw-no-return-value
 | 
						||
                                                                                             with an equal or
 | 
						||
                                                                                             wider sync scope
 | 
						||
                                                                                             and memory ordering
 | 
						||
                                                                                             stronger than
 | 
						||
                                                                                             unordered (this is
 | 
						||
                                                                                             termed the
 | 
						||
                                                                                             fence-paired-atomic).
 | 
						||
                                                                                           - s_waitcnt lgkmcnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             local/generic load
 | 
						||
                                                                                             atomic/atomicrmw
 | 
						||
                                                                                             with an equal or
 | 
						||
                                                                                             wider sync scope
 | 
						||
                                                                                             and memory ordering
 | 
						||
                                                                                             stronger than
 | 
						||
                                                                                             unordered (this is
 | 
						||
                                                                                             termed the
 | 
						||
                                                                                             fence-paired-atomic).
 | 
						||
                                                                                           - Must happen before
 | 
						||
                                                                                             the following
 | 
						||
                                                                                             buffer_gl*_inv.
 | 
						||
                                                                                           - Ensures that the
 | 
						||
                                                                                             fence-paired atomic
 | 
						||
                                                                                             has completed
 | 
						||
                                                                                             before invalidating
 | 
						||
                                                                                             the
 | 
						||
                                                                                             caches. Therefore
 | 
						||
                                                                                             any following
 | 
						||
                                                                                             locations read must
 | 
						||
                                                                                             be no older than
 | 
						||
                                                                                             the value read by
 | 
						||
                                                                                             the
 | 
						||
                                                                                             fence-paired-atomic.
 | 
						||
 | 
						||
                                                         2. buffer_wbinvl1_vol           2. buffer_gl0_inv;
 | 
						||
                                                                                            buffer_gl1_inv
 | 
						||
 | 
						||
                                                           - Must happen before any        - Must happen before any
 | 
						||
                                                             following global/generic        following global/generic
 | 
						||
                                                             load/load                       load/load
 | 
						||
                                                             atomic/store/store              atomic/store/store
 | 
						||
                                                             atomic/atomicrmw.               atomic/atomicrmw.
 | 
						||
                                                           - Ensures that                  - Ensures that
 | 
						||
                                                             following loads                 following loads
 | 
						||
                                                             will not see stale              will not see stale
 | 
						||
                                                             global data.                    global data.
 | 
						||
 | 
						||
     **Release Atomic**
 | 
						||
     ----------------------------------------------------------------------------------------------------------------------
 | 
						||
     store atomic release      - singlethread - global   1. buffer/global/ds/flat_store  1. buffer/global/ds/flat_store
 | 
						||
                               - wavefront    - local
 | 
						||
                                              - generic
 | 
						||
     store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
 | 
						||
                                                                                            vmcnt(0) & vscnt(0)
 | 
						||
 | 
						||
                                                                                           - If CU wavefront execution mode, omit vmcnt and
 | 
						||
                                                                                             vscnt.
 | 
						||
                                                           - If OpenCL, omit.              - If OpenCL, omit
 | 
						||
                                                                                             lgkmcnt(0).
 | 
						||
                                                           - Must happen after
 | 
						||
                                                             any preceding
 | 
						||
                                                             local/generic
 | 
						||
                                                             load/store/load
 | 
						||
                                                             atomic/store
 | 
						||
                                                             atomic/atomicrmw.
 | 
						||
                                                                                           - Could be split into
 | 
						||
                                                                                             separate s_waitcnt
 | 
						||
                                                                                             vmcnt(0), s_waitcnt
 | 
						||
                                                                                             vscnt(0) and s_waitcnt
 | 
						||
                                                                                             lgkmcnt(0) to allow
 | 
						||
                                                                                             them to be
 | 
						||
                                                                                             independently moved
 | 
						||
                                                                                             according to the
 | 
						||
                                                                                             following rules.
 | 
						||
                                                                                           - s_waitcnt vmcnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             global/generic load/load
 | 
						||
                                                                                             atomic/
 | 
						||
                                                                                             atomicrmw-with-return-value.
 | 
						||
                                                                                           - s_waitcnt vscnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             global/generic
 | 
						||
                                                                                             store/store
 | 
						||
                                                                                             atomic/
 | 
						||
                                                                                             atomicrmw-no-return-value.
 | 
						||
                                                                                           - s_waitcnt lgkmcnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             local/generic
 | 
						||
                                                                                             load/store/load
 | 
						||
                                                                                             atomic/store
 | 
						||
                                                                                             atomic/atomicrmw.
 | 
						||
                                                           - Must happen before            - Must happen before
 | 
						||
                                                             the following                   the following
 | 
						||
                                                             store.                          store.
 | 
						||
                                                           - Ensures that all              - Ensures that all
 | 
						||
                                                             memory operations               memory operations
 | 
						||
                                                             to local have                   have
 | 
						||
                                                             completed before                completed before
 | 
						||
                                                             performing the                  performing the
 | 
						||
                                                             store that is being             store that is being
 | 
						||
                                                             released.                       released.
 | 
						||
 | 
						||
                                                         2. buffer/global/flat_store     2. buffer/global_store
 | 
						||
     store atomic release      - workgroup    - local                                    1. waitcnt vmcnt(0) & vscnt(0)
 | 
						||
 | 
						||
                                                                                           - If CU wavefront execution mode, omit.
 | 
						||
                                                                                           - If OpenCL, omit.
 | 
						||
                                                                                           - Could be split into
 | 
						||
                                                                                             separate s_waitcnt
 | 
						||
                                                                                             vmcnt(0) and s_waitcnt
 | 
						||
                                                                                             vscnt(0) to allow
 | 
						||
                                                                                             them to be
 | 
						||
                                                                                             independently moved
 | 
						||
                                                                                             according to the
 | 
						||
                                                                                             following rules.
 | 
						||
                                                                                           - s_waitcnt vmcnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             global/generic load/load
 | 
						||
                                                                                             atomic/
 | 
						||
                                                                                             atomicrmw-with-return-value.
 | 
						||
                                                                                           - s_waitcnt vscnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             global/generic
 | 
						||
                                                                                             store/store atomic/
 | 
						||
                                                                                             atomicrmw-no-return-value.
 | 
						||
                                                                                           - Must happen before
 | 
						||
                                                                                             the following
 | 
						||
                                                                                             store.
 | 
						||
                                                                                           - Ensures that all
 | 
						||
                                                                                             global memory
 | 
						||
                                                                                             operations have
 | 
						||
                                                                                             completed before
 | 
						||
                                                                                             performing the
 | 
						||
                                                                                             store that is being
 | 
						||
                                                                                             released.
 | 
						||
 | 
						||
                                                         1. ds_store                     2. ds_store
 | 
						||
     store atomic release      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
 | 
						||
                                                                                            vmcnt(0) & vscnt(0)
 | 
						||
 | 
						||
                                                                                           - If CU wavefront execution mode, omit vmcnt and
 | 
						||
                                                                                             vscnt.
 | 
						||
                                                           - If OpenCL, omit.              - If OpenCL, omit
 | 
						||
                                                                                             lgkmcnt(0).
 | 
						||
                                                           - Must happen after
 | 
						||
                                                             any preceding
 | 
						||
                                                             local/generic
 | 
						||
                                                             load/store/load
 | 
						||
                                                             atomic/store
 | 
						||
                                                             atomic/atomicrmw.
 | 
						||
                                                                                           - Could be split into
 | 
						||
                                                                                             separate s_waitcnt
 | 
						||
                                                                                             vmcnt(0), s_waitcnt
 | 
						||
                                                                                             vscnt(0) and s_waitcnt
 | 
						||
                                                                                             lgkmcnt(0) to allow
 | 
						||
                                                                                             them to be
 | 
						||
                                                                                             independently moved
 | 
						||
                                                                                             according to the
 | 
						||
                                                                                             following rules.
 | 
						||
                                                                                           - s_waitcnt vmcnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             global/generic load/load
 | 
						||
                                                                                             atomic/
 | 
						||
                                                                                             atomicrmw-with-return-value.
 | 
						||
                                                                                           - s_waitcnt vscnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             global/generic
 | 
						||
                                                                                             store/store
 | 
						||
                                                                                             atomic/
 | 
						||
                                                                                             atomicrmw-no-return-value.
 | 
						||
                                                                                           - s_waitcnt lgkmcnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             local/generic load/store/load
 | 
						||
                                                                                             atomic/store atomic/atomicrmw.
 | 
						||
                                                           - Must happen before            - Must happen before
 | 
						||
                                                             the following                   the following
 | 
						||
                                                             store.                          store.
 | 
						||
                                                           - Ensures that all              - Ensures that all
 | 
						||
                                                             memory operations               memory operations
 | 
						||
                                                             to local have                   have
 | 
						||
                                                             completed before                completed before
 | 
						||
                                                             performing the                  performing the
 | 
						||
                                                             store that is being             store that is being
 | 
						||
                                                             released.                       released.
 | 
						||
 | 
						||
                                                         2. flat_store                   2. flat_store
 | 
						||
     store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &         1. s_waitcnt lgkmcnt(0) &
 | 
						||
                               - system       - generic     vmcnt(0)                          vmcnt(0) & vscnt(0)
 | 
						||
 | 
						||
                                                           - If OpenCL, omit               - If OpenCL, omit
 | 
						||
                                                             lgkmcnt(0).                     lgkmcnt(0).
 | 
						||
                                                           - Could be split into           - Could be split into
 | 
						||
                                                             separate s_waitcnt              separate s_waitcnt
 | 
						||
                                                             vmcnt(0) and                    vmcnt(0), s_waitcnt vscnt(0)
 | 
						||
                                                             s_waitcnt                       and s_waitcnt
 | 
						||
                                                             lgkmcnt(0) to allow             lgkmcnt(0) to allow
 | 
						||
                                                             them to be                      them to be
 | 
						||
                                                             independently moved             independently moved
 | 
						||
                                                             according to the                according to the
 | 
						||
                                                             following rules.                following rules.
 | 
						||
                                                           - s_waitcnt vmcnt(0)            - s_waitcnt vmcnt(0)
 | 
						||
                                                             must happen after               must happen after
 | 
						||
                                                             any preceding                   any preceding
 | 
						||
                                                             global/generic                  global/generic
 | 
						||
                                                             load/store/load                 load/load
 | 
						||
                                                             atomic/store                    atomic/
 | 
						||
                                                             atomic/atomicrmw.               atomicrmw-with-return-value.
 | 
						||
                                                                                           - s_waitcnt vscnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             global/generic
 | 
						||
                                                                                             store/store atomic/
 | 
						||
                                                                                             atomicrmw-no-return-value.
 | 
						||
                                                           - s_waitcnt lgkmcnt(0)          - s_waitcnt lgkmcnt(0)
 | 
						||
                                                             must happen after               must happen after
 | 
						||
                                                             any preceding                   any preceding
 | 
						||
                                                             local/generic                   local/generic
 | 
						||
                                                             load/store/load                 load/store/load
 | 
						||
                                                             atomic/store                    atomic/store
 | 
						||
                                                             atomic/atomicrmw.               atomic/atomicrmw.
 | 
						||
                                                           - Must happen before            - Must happen before
 | 
						||
                                                             the following                   the following
 | 
						||
                                                             store.                          store.
 | 
						||
                                                           - Ensures that all              - Ensures that all
 | 
						||
                                                             memory operations               memory operations
 | 
						||
                                                             to memory have                  to memory have
 | 
						||
                                                             completed before                completed before
 | 
						||
                                                             performing the                  performing the
 | 
						||
                                                             store that is being             store that is being
 | 
						||
                                                             released.                       released.
 | 
						||
 | 
						||
                                                         2. buffer/global/ds/flat_store  2. buffer/global/ds/flat_store
 | 
						||
     atomicrmw    release      - singlethread - global   1. buffer/global/ds/flat_atomic 1. buffer/global/ds/flat_atomic
 | 
						||
                               - wavefront    - local
 | 
						||
                                              - generic
 | 
						||
     atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
 | 
						||
                                                                                            vmcnt(0) & vscnt(0)
 | 
						||
 | 
						||
                                                                                           - If CU wavefront execution mode, omit vmcnt and
 | 
						||
                                                                                             vscnt.
 | 
						||
                                                           - If OpenCL, omit.
 | 
						||
 | 
						||
                                                           - Must happen after
 | 
						||
                                                             any preceding
 | 
						||
                                                             local/generic
 | 
						||
                                                             load/store/load
 | 
						||
                                                             atomic/store
 | 
						||
                                                             atomic/atomicrmw.
 | 
						||
                                                                                           - Could be split into
 | 
						||
                                                                                             separate s_waitcnt
 | 
						||
                                                                                             vmcnt(0), s_waitcnt
 | 
						||
                                                                                             vscnt(0) and s_waitcnt
 | 
						||
                                                                                             lgkmcnt(0) to allow
 | 
						||
                                                                                             them to be
 | 
						||
                                                                                             independently moved
 | 
						||
                                                                                             according to the
 | 
						||
                                                                                             following rules.
 | 
						||
                                                                                           - s_waitcnt vmcnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             global/generic load/load
 | 
						||
                                                                                             atomic/
 | 
						||
                                                                                             atomicrmw-with-return-value.
 | 
						||
                                                                                           - s_waitcnt vscnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             global/generic
 | 
						||
                                                                                             store/store
 | 
						||
                                                                                             atomic/
 | 
						||
                                                                                             atomicrmw-no-return-value.
 | 
						||
                                                                                           - s_waitcnt lgkmcnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             local/generic
 | 
						||
                                                                                             load/store/load
 | 
						||
                                                                                             atomic/store
 | 
						||
                                                                                             atomic/atomicrmw.
 | 
						||
                                                           - Must happen before            - Must happen before
 | 
						||
                                                             the following                   the following
 | 
						||
                                                             atomicrmw.                      atomicrmw.
 | 
						||
                                                           - Ensures that all              - Ensures that all
 | 
						||
                                                             memory operations               memory operations
 | 
						||
                                                             to local have                   have
 | 
						||
                                                             completed before                completed before
 | 
						||
                                                             performing the                  performing the
 | 
						||
                                                             atomicrmw that is               atomicrmw that is
 | 
						||
                                                             being released.                 being released.
 | 
						||
 | 
						||
                                                         2. buffer/global/flat_atomic    2. buffer/global_atomic
 | 
						||
     atomicrmw    release      - workgroup    - local                                    1. waitcnt vmcnt(0) & vscnt(0)
 | 
						||
 | 
						||
                                                                                           - If CU wavefront execution mode, omit.
 | 
						||
                                                                                           - If OpenCL, omit.
 | 
						||
                                                                                           - Could be split into
 | 
						||
                                                                                             separate s_waitcnt
 | 
						||
                                                                                             vmcnt(0) and s_waitcnt
 | 
						||
                                                                                             vscnt(0) to allow
 | 
						||
                                                                                             them to be
 | 
						||
                                                                                             independently moved
 | 
						||
                                                                                             according to the
 | 
						||
                                                                                             following rules.
 | 
						||
                                                                                           - s_waitcnt vmcnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             global/generic load/load
 | 
						||
                                                                                             atomic/
 | 
						||
                                                                                             atomicrmw-with-return-value.
 | 
						||
                                                                                           - s_waitcnt vscnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             global/generic
 | 
						||
                                                                                             store/store atomic/
 | 
						||
                                                                                             atomicrmw-no-return-value.
 | 
						||
                                                                                           - Must happen before
 | 
						||
                                                                                             the following
 | 
						||
                                                                                             store.
 | 
						||
                                                                                           - Ensures that all
 | 
						||
                                                                                             global memory
 | 
						||
                                                                                             operations have
 | 
						||
                                                                                             completed before
 | 
						||
                                                                                             performing the
 | 
						||
                                                                                             store that is being
 | 
						||
                                                                                             released.
 | 
						||
 | 
						||
                                                         1. ds_atomic                    2. ds_atomic
 | 
						||
     atomicrmw    release      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
 | 
						||
                                                                                            vmcnt(0) & vscnt(0)
 | 
						||
 | 
						||
                                                                                           - If CU wavefront execution mode, omit vmcnt and
 | 
						||
                                                                                             vscnt.
 | 
						||
                                                           - If OpenCL, omit.              - If OpenCL, omit
 | 
						||
                                                                                             waitcnt lgkmcnt(0).
 | 
						||
                                                           - Must happen after
 | 
						||
                                                             any preceding
 | 
						||
                                                             local/generic
 | 
						||
                                                             load/store/load
 | 
						||
                                                             atomic/store
 | 
						||
                                                             atomic/atomicrmw.
 | 
						||
                                                                                           - Could be split into
 | 
						||
                                                                                             separate s_waitcnt
 | 
						||
                                                                                             vmcnt(0), s_waitcnt
 | 
						||
                                                                                             vscnt(0) and s_waitcnt
 | 
						||
                                                                                             lgkmcnt(0) to allow
 | 
						||
                                                                                             them to be
 | 
						||
                                                                                             independently moved
 | 
						||
                                                                                             according to the
 | 
						||
                                                                                             following rules.
 | 
						||
                                                                                           - s_waitcnt vmcnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             global/generic load/load
 | 
						||
                                                                                             atomic/
 | 
						||
                                                                                             atomicrmw-with-return-value.
 | 
						||
                                                                                           - s_waitcnt vscnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             global/generic
 | 
						||
                                                                                             store/store
 | 
						||
                                                                                             atomic/
 | 
						||
                                                                                             atomicrmw-no-return-value.
 | 
						||
                                                                                           - s_waitcnt lgkmcnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             local/generic load/store/load
 | 
						||
                                                                                             atomic/store atomic/atomicrmw.
 | 
						||
                                                           - Must happen before            - Must happen before
 | 
						||
                                                             the following                   the following
 | 
						||
                                                             atomicrmw.                      atomicrmw.
 | 
						||
                                                           - Ensures that all              - Ensures that all
 | 
						||
                                                             memory operations               memory operations
 | 
						||
                                                             to local have                   have
 | 
						||
                                                             completed before                completed before
 | 
						||
                                                             performing the                  performing the
 | 
						||
                                                             atomicrmw that is               atomicrmw that is
 | 
						||
                                                             being released.                 being released.
 | 
						||
 | 
						||
                                                         2. flat_atomic                  2. flat_atomic
 | 
						||
     atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &       1. s_waitcnt lkkmcnt(0) &
 | 
						||
                               - system       - generic     vmcnt(0)                         vmcnt(0) & vscnt(0)
 | 
						||
 | 
						||
                                                           - If OpenCL, omit               - If OpenCL, omit
 | 
						||
                                                             lgkmcnt(0).                     lgkmcnt(0).
 | 
						||
                                                           - Could be split into           - Could be split into
 | 
						||
                                                             separate s_waitcnt              separate s_waitcnt
 | 
						||
                                                             vmcnt(0) and                    vmcnt(0), s_waitcnt
 | 
						||
                                                             s_waitcnt                       vscnt(0) and s_waitcnt
 | 
						||
                                                             lgkmcnt(0) to allow             lgkmcnt(0) to allow
 | 
						||
                                                             them to be                      them to be
 | 
						||
                                                             independently moved             independently moved
 | 
						||
                                                             according to the                according to the
 | 
						||
                                                             following rules.                following rules.
 | 
						||
                                                           - s_waitcnt vmcnt(0)            - s_waitcnt vmcnt(0)
 | 
						||
                                                             must happen after               must happen after
 | 
						||
                                                             any preceding                   any preceding
 | 
						||
                                                             global/generic                  global/generic
 | 
						||
                                                             load/store/load                 load/load atomic/
 | 
						||
                                                             atomic/store                    atomicrmw-with-return-value.
 | 
						||
                                                             atomic/atomicrmw.
 | 
						||
                                                                                           - s_waitcnt vscnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             global/generic
 | 
						||
                                                                                             store/store atomic/
 | 
						||
                                                                                             atomicrmw-no-return-value.
 | 
						||
                                                           - s_waitcnt lgkmcnt(0)          - s_waitcnt lgkmcnt(0)
 | 
						||
                                                             must happen after               must happen after
 | 
						||
                                                             any preceding                   any preceding
 | 
						||
                                                             local/generic                   local/generic
 | 
						||
                                                             load/store/load                 load/store/load
 | 
						||
                                                             atomic/store                    atomic/store
 | 
						||
                                                             atomic/atomicrmw.               atomic/atomicrmw.
 | 
						||
                                                           - Must happen before            - Must happen before
 | 
						||
                                                             the following                   the following
 | 
						||
                                                             atomicrmw.                      atomicrmw.
 | 
						||
                                                           - Ensures that all              - Ensures that all
 | 
						||
                                                             memory operations               memory operations
 | 
						||
                                                             to global and local             to global and local
 | 
						||
                                                             have completed                  have completed
 | 
						||
                                                             before performing               before performing
 | 
						||
                                                             the atomicrmw that              the atomicrmw that
 | 
						||
                                                             is being released.              is being released.
 | 
						||
 | 
						||
                                                         2. buffer/global/ds/flat_atomic 2. buffer/global/ds/flat_atomic
 | 
						||
     fence        release      - singlethread *none*     *none*                          *none*
 | 
						||
                               - wavefront
 | 
						||
     fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
 | 
						||
                                                                                            vmcnt(0) & vscnt(0)
 | 
						||
 | 
						||
                                                                                           - If CU wavefront execution mode, omit vmcnt and
 | 
						||
                                                                                             vscnt.
 | 
						||
                                                           - If OpenCL and                 - If OpenCL and
 | 
						||
                                                             address space is                address space is
 | 
						||
                                                             not generic, omit.              not generic, omit
 | 
						||
                                                                                             lgkmcnt(0).
 | 
						||
                                                                                           - If OpenCL and
 | 
						||
                                                                                             address space is
 | 
						||
                                                                                             local, omit
 | 
						||
                                                                                             vmcnt(0) and vscnt(0).
 | 
						||
                                                           - However, since LLVM           - However, since LLVM
 | 
						||
                                                             currently has no                currently has no
 | 
						||
                                                             address space on                address space on
 | 
						||
                                                             the fence need to               the fence need to
 | 
						||
                                                             conservatively                  conservatively
 | 
						||
                                                             always generate. If             always generate. If
 | 
						||
                                                             fence had an                    fence had an
 | 
						||
                                                             address space then              address space then
 | 
						||
                                                             set to address                  set to address
 | 
						||
                                                             space of OpenCL                 space of OpenCL
 | 
						||
                                                             fence flag, or to               fence flag, or to
 | 
						||
                                                             generic if both                 generic if both
 | 
						||
                                                             local and global                local and global
 | 
						||
                                                             flags are                       flags are
 | 
						||
                                                             specified.                      specified.
 | 
						||
                                                           - Must happen after
 | 
						||
                                                             any preceding
 | 
						||
                                                             local/generic
 | 
						||
                                                             load/load
 | 
						||
                                                             atomic/store/store
 | 
						||
                                                             atomic/atomicrmw.
 | 
						||
                                                                                           - Could be split into
 | 
						||
                                                                                             separate s_waitcnt
 | 
						||
                                                                                             vmcnt(0), s_waitcnt
 | 
						||
                                                                                             vscnt(0) and s_waitcnt
 | 
						||
                                                                                             lgkmcnt(0) to allow
 | 
						||
                                                                                             them to be
 | 
						||
                                                                                             independently moved
 | 
						||
                                                                                             according to the
 | 
						||
                                                                                             following rules.
 | 
						||
                                                                                           - s_waitcnt vmcnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             global/generic
 | 
						||
                                                                                             load/load
 | 
						||
                                                                                             atomic/
 | 
						||
                                                                                             atomicrmw-with-return-value.
 | 
						||
                                                                                           - s_waitcnt vscnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             global/generic
 | 
						||
                                                                                             store/store atomic/
 | 
						||
                                                                                             atomicrmw-no-return-value.
 | 
						||
                                                                                           - s_waitcnt lgkmcnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             local/generic
 | 
						||
                                                                                             load/store/load
 | 
						||
                                                                                             atomic/store atomic/
 | 
						||
                                                                                             atomicrmw.
 | 
						||
                                                           - Must happen before            - Must happen before
 | 
						||
                                                             any following store             any following store
 | 
						||
                                                             atomic/atomicrmw                atomic/atomicrmw
 | 
						||
                                                             with an equal or                with an equal or
 | 
						||
                                                             wider sync scope                wider sync scope
 | 
						||
                                                             and memory ordering             and memory ordering
 | 
						||
                                                             stronger than                   stronger than
 | 
						||
                                                             unordered (this is              unordered (this is
 | 
						||
                                                             termed the                      termed the
 | 
						||
                                                             fence-paired-atomic).           fence-paired-atomic).
 | 
						||
                                                           - Ensures that all              - Ensures that all
 | 
						||
                                                             memory operations               memory operations
 | 
						||
                                                             to local have                   have
 | 
						||
                                                             completed before                completed before
 | 
						||
                                                             performing the                  performing the
 | 
						||
                                                             following                       following
 | 
						||
                                                             fence-paired-atomic.            fence-paired-atomic.
 | 
						||
 | 
						||
     fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &       1. s_waitcnt lgkmcnt(0) &
 | 
						||
                               - system                     vmcnt(0)                        vmcnt(0) & vscnt(0)
 | 
						||
 | 
						||
                                                           - If OpenCL and                 - If OpenCL and
 | 
						||
                                                             address space is                address space is
 | 
						||
                                                             not generic, omit               not generic, omit
 | 
						||
                                                             lgkmcnt(0).                     lgkmcnt(0).
 | 
						||
                                                           - If OpenCL and                 - If OpenCL and
 | 
						||
                                                             address space is                address space is
 | 
						||
                                                             local, omit                     local, omit
 | 
						||
                                                             vmcnt(0).                       vmcnt(0) and vscnt(0).
 | 
						||
                                                           - However, since LLVM           - However, since LLVM
 | 
						||
                                                             currently has no                currently has no
 | 
						||
                                                             address space on                address space on
 | 
						||
                                                             the fence need to               the fence need to
 | 
						||
                                                             conservatively                  conservatively
 | 
						||
                                                             always generate. If             always generate. If
 | 
						||
                                                             fence had an                    fence had an
 | 
						||
                                                             address space then              address space then
 | 
						||
                                                             set to address                  set to address
 | 
						||
                                                             space of OpenCL                 space of OpenCL
 | 
						||
                                                             fence flag, or to               fence flag, or to
 | 
						||
                                                             generic if both                 generic if both
 | 
						||
                                                             local and global                local and global
 | 
						||
                                                             flags are                       flags are
 | 
						||
                                                             specified.                      specified.
 | 
						||
                                                           - Could be split into           - Could be split into
 | 
						||
                                                             separate s_waitcnt              separate s_waitcnt
 | 
						||
                                                             vmcnt(0) and                    vmcnt(0), s_waitcnt
 | 
						||
                                                             s_waitcnt                       vscnt(0) and s_waitcnt
 | 
						||
                                                             lgkmcnt(0) to allow             lgkmcnt(0) to allow
 | 
						||
                                                             them to be                      them to be
 | 
						||
                                                             independently moved             independently moved
 | 
						||
                                                             according to the                according to the
 | 
						||
                                                             following rules.                following rules.
 | 
						||
                                                           - s_waitcnt vmcnt(0)            - s_waitcnt vmcnt(0)
 | 
						||
                                                             must happen after               must happen after
 | 
						||
                                                             any preceding                   any preceding
 | 
						||
                                                             global/generic                  global/generic
 | 
						||
                                                             load/store/load                 load/load atomic/
 | 
						||
                                                             atomic/store                    atomicrmw-with-return-value.
 | 
						||
                                                             atomic/atomicrmw.
 | 
						||
                                                                                           - s_waitcnt vscnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             global/generic
 | 
						||
                                                                                             store/store atomic/
 | 
						||
                                                                                             atomicrmw-no-return-value.
 | 
						||
                                                           - s_waitcnt lgkmcnt(0)          - s_waitcnt lgkmcnt(0)
 | 
						||
                                                             must happen after               must happen after
 | 
						||
                                                             any preceding                   any preceding
 | 
						||
                                                             local/generic                   local/generic
 | 
						||
                                                             load/store/load                 load/store/load
 | 
						||
                                                             atomic/store                    atomic/store
 | 
						||
                                                             atomic/atomicrmw.               atomic/atomicrmw.
 | 
						||
                                                           - Must happen before            - Must happen before
 | 
						||
                                                             any following store             any following store
 | 
						||
                                                             atomic/atomicrmw                atomic/atomicrmw
 | 
						||
                                                             with an equal or                with an equal or
 | 
						||
                                                             wider sync scope                wider sync scope
 | 
						||
                                                             and memory ordering             and memory ordering
 | 
						||
                                                             stronger than                   stronger than
 | 
						||
                                                             unordered (this is              unordered (this is
 | 
						||
                                                             termed the                      termed the
 | 
						||
                                                             fence-paired-atomic).           fence-paired-atomic).
 | 
						||
                                                           - Ensures that all              - Ensures that all
 | 
						||
                                                             memory operations               memory operations
 | 
						||
                                                             have                            have
 | 
						||
                                                             completed before                completed before
 | 
						||
                                                             performing the                  performing the
 | 
						||
                                                             following                       following
 | 
						||
                                                             fence-paired-atomic.            fence-paired-atomic.
 | 
						||
 | 
						||
     **Acquire-Release Atomic**
 | 
						||
     ----------------------------------------------------------------------------------------------------------------------
 | 
						||
     atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic 1. buffer/global/ds/flat_atomic
 | 
						||
                               - wavefront    - local
 | 
						||
                                              - generic
 | 
						||
     atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
 | 
						||
                                                                                            vmcnt(0) & vscnt(0)
 | 
						||
 | 
						||
                                                                                           - If CU wavefront execution mode, omit vmcnt and
 | 
						||
                                                                                             vscnt.
 | 
						||
                                                           - If OpenCL, omit.              - If OpenCL, omit
 | 
						||
                                                                                             s_waitcnt lgkmcnt(0).
 | 
						||
                                                           - Must happen after             - Must happen after
 | 
						||
                                                             any preceding                   any preceding
 | 
						||
                                                             local/generic                   local/generic
 | 
						||
                                                             load/store/load                 load/store/load
 | 
						||
                                                             atomic/store                    atomic/store
 | 
						||
                                                             atomic/atomicrmw.               atomic/atomicrmw.
 | 
						||
                                                                                           - Could be split into
 | 
						||
                                                                                             separate s_waitcnt
 | 
						||
                                                                                             vmcnt(0), s_waitcnt
 | 
						||
                                                                                             vscnt(0) and s_waitcnt
 | 
						||
                                                                                             lgkmcnt(0) to allow
 | 
						||
                                                                                             them to be
 | 
						||
                                                                                             independently moved
 | 
						||
                                                                                             according to the
 | 
						||
                                                                                             following rules.
 | 
						||
                                                                                           - s_waitcnt vmcnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             global/generic load/load
 | 
						||
                                                                                             atomic/
 | 
						||
                                                                                             atomicrmw-with-return-value.
 | 
						||
                                                                                           - s_waitcnt vscnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             global/generic
 | 
						||
                                                                                             store/store
 | 
						||
                                                                                             atomic/
 | 
						||
                                                                                             atomicrmw-no-return-value.
 | 
						||
                                                                                           - s_waitcnt lgkmcnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             local/generic load/store/load
 | 
						||
                                                                                             atomic/store atomic/atomicrmw.
 | 
						||
                                                           - Must happen before            - Must happen before
 | 
						||
                                                             the following                   the following
 | 
						||
                                                             atomicrmw.                      atomicrmw.
 | 
						||
                                                           - Ensures that all              - Ensures that all
 | 
						||
                                                             memory operations               memory operations
 | 
						||
                                                             to local have                   have
 | 
						||
                                                             completed before                completed before
 | 
						||
                                                             performing the                  performing the
 | 
						||
                                                             atomicrmw that is               atomicrmw that is
 | 
						||
                                                             being released.                 being released.
 | 
						||
 | 
						||
                                                         2. buffer/global/flat_atomic    2. buffer/global_atomic
 | 
						||
                                                                                         3. s_waitcnt vm/vscnt(0)
 | 
						||
 | 
						||
                                                                                           - If CU wavefront execution mode, omit vm/vscnt.
 | 
						||
                                                                                           - Use vmcnt if atomic with
 | 
						||
                                                                                             return and vscnt if atomic
 | 
						||
                                                                                             with no-return.
 | 
						||
                                                                                             waitcnt lgkmcnt(0).
 | 
						||
                                                                                           - Must happen before
 | 
						||
                                                                                             the following
 | 
						||
                                                                                             buffer_gl0_inv.
 | 
						||
                                                                                           - Ensures any
 | 
						||
                                                                                             following global
 | 
						||
                                                                                             data read is no
 | 
						||
                                                                                             older than the
 | 
						||
                                                                                             atomicrmw value
 | 
						||
                                                                                             being acquired.
 | 
						||
 | 
						||
                                                                                         4. buffer_gl0_inv
 | 
						||
 | 
						||
                                                                                           - If CU wavefront execution mode, omit.
 | 
						||
                                                                                           - Ensures that
 | 
						||
                                                                                             following
 | 
						||
                                                                                             loads will not see
 | 
						||
                                                                                             stale data.
 | 
						||
 | 
						||
     atomicrmw    acq_rel      - workgroup    - local                                    1. waitcnt vmcnt(0) & vscnt(0)
 | 
						||
 | 
						||
                                                                                           - If CU wavefront execution mode, omit.
 | 
						||
                                                                                           - If OpenCL, omit.
 | 
						||
                                                                                           - Could be split into
 | 
						||
                                                                                             separate s_waitcnt
 | 
						||
                                                                                             vmcnt(0) and s_waitcnt
 | 
						||
                                                                                             vscnt(0) to allow
 | 
						||
                                                                                             them to be
 | 
						||
                                                                                             independently moved
 | 
						||
                                                                                             according to the
 | 
						||
                                                                                             following rules.
 | 
						||
                                                                                           - s_waitcnt vmcnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             global/generic load/load
 | 
						||
                                                                                             atomic/
 | 
						||
                                                                                             atomicrmw-with-return-value.
 | 
						||
                                                                                           - s_waitcnt vscnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             global/generic
 | 
						||
                                                                                             store/store atomic/
 | 
						||
                                                                                             atomicrmw-no-return-value.
 | 
						||
                                                                                           - Must happen before
 | 
						||
                                                                                             the following
 | 
						||
                                                                                             store.
 | 
						||
                                                                                           - Ensures that all
 | 
						||
                                                                                             global memory
 | 
						||
                                                                                             operations have
 | 
						||
                                                                                             completed before
 | 
						||
                                                                                             performing the
 | 
						||
                                                                                             store that is being
 | 
						||
                                                                                             released.
 | 
						||
 | 
						||
                                                         1. ds_atomic                    2. ds_atomic
 | 
						||
                                                         2. s_waitcnt lgkmcnt(0)         3. s_waitcnt lgkmcnt(0)
 | 
						||
 | 
						||
                                                           - If OpenCL, omit.              - If OpenCL, omit.
 | 
						||
                                                           - Must happen before            - Must happen before
 | 
						||
                                                             any following                   the following
 | 
						||
                                                             global/generic                  buffer_gl0_inv.
 | 
						||
                                                             load/load
 | 
						||
                                                             atomic/store/store
 | 
						||
                                                             atomic/atomicrmw.
 | 
						||
                                                           - Ensures any                   - Ensures any
 | 
						||
                                                             following global                following global
 | 
						||
                                                             data read is no                 data read is no
 | 
						||
                                                             older than the load             older than the load
 | 
						||
                                                             atomic value being              atomic value being
 | 
						||
                                                             acquired.                       acquired.
 | 
						||
 | 
						||
                                                                                         4. buffer_gl0_inv
 | 
						||
 | 
						||
                                                                                           - If CU wavefront execution mode, omit.
 | 
						||
                                                                                           - If OpenCL omit.
 | 
						||
                                                                                           - Ensures that
 | 
						||
                                                                                             following
 | 
						||
                                                                                             loads will not see
 | 
						||
                                                                                             stale data.
 | 
						||
 | 
						||
     atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
 | 
						||
                                                                                            vmcnt(0) & vscnt(0)
 | 
						||
 | 
						||
                                                                                           - If CU wavefront execution mode, omit vmcnt and
 | 
						||
                                                                                             vscnt.
 | 
						||
                                                           - If OpenCL, omit.              - If OpenCL, omit
 | 
						||
                                                                                             waitcnt lgkmcnt(0).
 | 
						||
                                                           - Must happen after
 | 
						||
                                                             any preceding
 | 
						||
                                                             local/generic
 | 
						||
                                                             load/store/load
 | 
						||
                                                             atomic/store
 | 
						||
                                                             atomic/atomicrmw.
 | 
						||
                                                                                           - Could be split into
 | 
						||
                                                                                             separate s_waitcnt
 | 
						||
                                                                                             vmcnt(0), s_waitcnt
 | 
						||
                                                                                             vscnt(0) and s_waitcnt
 | 
						||
                                                                                             lgkmcnt(0) to allow
 | 
						||
                                                                                             them to be
 | 
						||
                                                                                             independently moved
 | 
						||
                                                                                             according to the
 | 
						||
                                                                                             following rules.
 | 
						||
                                                                                           - s_waitcnt vmcnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             global/generic load/load
 | 
						||
                                                                                             atomic/
 | 
						||
                                                                                             atomicrmw-with-return-value.
 | 
						||
                                                                                           - s_waitcnt vscnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             global/generic
 | 
						||
                                                                                             store/store
 | 
						||
                                                                                             atomic/
 | 
						||
                                                                                             atomicrmw-no-return-value.
 | 
						||
                                                                                           - s_waitcnt lgkmcnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             local/generic load/store/load
 | 
						||
                                                                                             atomic/store atomic/atomicrmw.
 | 
						||
                                                           - Must happen before            - Must happen before
 | 
						||
                                                             the following                   the following
 | 
						||
                                                             atomicrmw.                      atomicrmw.
 | 
						||
                                                           - Ensures that all              - Ensures that all
 | 
						||
                                                             memory operations               memory operations
 | 
						||
                                                             to local have                   have
 | 
						||
                                                             completed before                completed before
 | 
						||
                                                             performing the                  performing the
 | 
						||
                                                             atomicrmw that is               atomicrmw that is
 | 
						||
                                                             being released.                 being released.
 | 
						||
 | 
						||
                                                         2. flat_atomic                  2. flat_atomic
 | 
						||
                                                         3. s_waitcnt lgkmcnt(0)         3. s_waitcnt lgkmcnt(0) &
 | 
						||
                                                                                            vm/vscnt(0)
 | 
						||
 | 
						||
                                                                                           - If CU wavefront execution mode, omit vm/vscnt.
 | 
						||
                                                           - If OpenCL, omit.              - If OpenCL, omit
 | 
						||
                                                                                             waitcnt lgkmcnt(0).
 | 
						||
                                                           - Must happen before            - Must happen before
 | 
						||
                                                             any following                   the following
 | 
						||
                                                             global/generic                  buffer_gl0_inv.
 | 
						||
                                                             load/load
 | 
						||
                                                             atomic/store/store
 | 
						||
                                                             atomic/atomicrmw.
 | 
						||
                                                           - Ensures any                   - Ensures any
 | 
						||
                                                             following global                following global
 | 
						||
                                                             data read is no                 data read is no
 | 
						||
                                                             older than the load             older than the load
 | 
						||
                                                             atomic value being              atomic value being
 | 
						||
                                                             acquired.                       acquired.
 | 
						||
 | 
						||
                                                                                         3. buffer_gl0_inv
 | 
						||
 | 
						||
                                                                                           - If CU wavefront execution mode, omit.
 | 
						||
                                                                                           - Ensures that
 | 
						||
                                                                                             following
 | 
						||
                                                                                             loads will not see
 | 
						||
                                                                                             stale data.
 | 
						||
 | 
						||
     atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &       1. s_waitcnt lgkmcnt(0) &
 | 
						||
                               - system                     vmcnt(0)                        vmcnt(0) & vscnt(0)
 | 
						||
 | 
						||
                                                           - If OpenCL, omit               - If OpenCL, omit
 | 
						||
                                                             lgkmcnt(0).                     lgkmcnt(0).
 | 
						||
                                                           - Could be split into           - Could be split into
 | 
						||
                                                             separate s_waitcnt              separate s_waitcnt
 | 
						||
                                                             vmcnt(0) and                    vmcnt(0), s_waitcnt
 | 
						||
                                                             s_waitcnt                       vscnt(0) and s_waitcnt
 | 
						||
                                                             lgkmcnt(0) to allow             lgkmcnt(0) to allow
 | 
						||
                                                             them to be                      them to be
 | 
						||
                                                             independently moved             independently moved
 | 
						||
                                                             according to the                according to the
 | 
						||
                                                             following rules.                following rules.
 | 
						||
                                                           - s_waitcnt vmcnt(0)            - s_waitcnt vmcnt(0)
 | 
						||
                                                             must happen after               must happen after
 | 
						||
                                                             any preceding                   any preceding
 | 
						||
                                                             global/generic                  global/generic
 | 
						||
                                                             load/store/load                 load/load atomic/
 | 
						||
                                                             atomic/store                    atomicrmw-with-return-value.
 | 
						||
                                                             atomic/atomicrmw.
 | 
						||
                                                                                           - s_waitcnt vscnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             global/generic
 | 
						||
                                                                                             store/store atomic/
 | 
						||
                                                                                             atomicrmw-no-return-value.
 | 
						||
                                                           - s_waitcnt lgkmcnt(0)          - s_waitcnt lgkmcnt(0)
 | 
						||
                                                             must happen after               must happen after
 | 
						||
                                                             any preceding                   any preceding
 | 
						||
                                                             local/generic                   local/generic
 | 
						||
                                                             load/store/load                 load/store/load
 | 
						||
                                                             atomic/store                    atomic/store
 | 
						||
                                                             atomic/atomicrmw.               atomic/atomicrmw.
 | 
						||
                                                           - Must happen before            - Must happen before
 | 
						||
                                                             the following                   the following
 | 
						||
                                                             atomicrmw.                      atomicrmw.
 | 
						||
                                                           - Ensures that all              - Ensures that all
 | 
						||
                                                             memory operations               memory operations
 | 
						||
                                                             to global have                  to global have
 | 
						||
                                                             completed before                completed before
 | 
						||
                                                             performing the                  performing the
 | 
						||
                                                             atomicrmw that is               atomicrmw that is
 | 
						||
                                                             being released.                 being released.
 | 
						||
 | 
						||
                                                         2. buffer/global/flat_atomic    2. buffer/global_atomic
 | 
						||
                                                         3. s_waitcnt vmcnt(0)           3. s_waitcnt vm/vscnt(0)
 | 
						||
 | 
						||
                                                                                           - Use vmcnt if atomic with
 | 
						||
                                                                                             return and vscnt if atomic
 | 
						||
                                                                                             with no-return.
 | 
						||
                                                                                             waitcnt lgkmcnt(0).
 | 
						||
                                                           - Must happen before            - Must happen before
 | 
						||
                                                             following                       following
 | 
						||
                                                             buffer_wbinvl1_vol.             buffer_gl*_inv.
 | 
						||
                                                           - Ensures the                   - Ensures the
 | 
						||
                                                             atomicrmw has                   atomicrmw has
 | 
						||
                                                             completed before                completed before
 | 
						||
                                                             invalidating the                invalidating the
 | 
						||
                                                             cache.                          caches.
 | 
						||
 | 
						||
                                                         4. buffer_wbinvl1_vol           4. buffer_gl0_inv;
 | 
						||
                                                                                            buffer_gl1_inv
 | 
						||
 | 
						||
                                                           - Must happen before            - Must happen before
 | 
						||
                                                             any following                   any following
 | 
						||
                                                             global/generic                  global/generic
 | 
						||
                                                             load/load                       load/load
 | 
						||
                                                             atomic/atomicrmw.               atomic/atomicrmw.
 | 
						||
                                                           - Ensures that                  - Ensures that
 | 
						||
                                                             following loads                 following loads
 | 
						||
                                                             will not see stale              will not see stale
 | 
						||
                                                             global data.                    global data.
 | 
						||
 | 
						||
     atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &       1. s_waitcnt lgkmcnt(0) &
 | 
						||
                               - system                     vmcnt(0)                        vmcnt(0) & vscnt(0)
 | 
						||
 | 
						||
                                                           - If OpenCL, omit               - If OpenCL, omit
 | 
						||
                                                             lgkmcnt(0).                     lgkmcnt(0).
 | 
						||
                                                           - Could be split into           - Could be split into
 | 
						||
                                                             separate s_waitcnt              separate s_waitcnt
 | 
						||
                                                             vmcnt(0) and                    vmcnt(0), s_waitcnt
 | 
						||
                                                             s_waitcnt                       vscnt(0) and s_waitcnt
 | 
						||
                                                             lgkmcnt(0) to allow             lgkmcnt(0) to allow
 | 
						||
                                                             them to be                      them to be
 | 
						||
                                                             independently moved             independently moved
 | 
						||
                                                             according to the                according to the
 | 
						||
                                                             following rules.                following rules.
 | 
						||
                                                           - s_waitcnt vmcnt(0)            - s_waitcnt vmcnt(0)
 | 
						||
                                                             must happen after               must happen after
 | 
						||
                                                             any preceding                   any preceding
 | 
						||
                                                             global/generic                  global/generic
 | 
						||
                                                             load/store/load                 load/load atomic
 | 
						||
                                                             atomic/store                    atomicrmw-with-return-value.
 | 
						||
                                                             atomic/atomicrmw.
 | 
						||
                                                                                           - s_waitcnt vscnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             global/generic
 | 
						||
                                                                                             store/store atomic/
 | 
						||
                                                                                             atomicrmw-no-return-value.
 | 
						||
                                                           - s_waitcnt lgkmcnt(0)          - s_waitcnt lgkmcnt(0)
 | 
						||
                                                             must happen after               must happen after
 | 
						||
                                                             any preceding                   any preceding
 | 
						||
                                                             local/generic                   local/generic
 | 
						||
                                                             load/store/load                 load/store/load
 | 
						||
                                                             atomic/store                    atomic/store
 | 
						||
                                                             atomic/atomicrmw.               atomic/atomicrmw.
 | 
						||
                                                           - Must happen before            - Must happen before
 | 
						||
                                                             the following                   the following
 | 
						||
                                                             atomicrmw.                      atomicrmw.
 | 
						||
                                                           - Ensures that all              - Ensures that all
 | 
						||
                                                             memory operations               memory operations
 | 
						||
                                                             to global have                  have
 | 
						||
                                                             completed before                completed before
 | 
						||
                                                             performing the                  performing the
 | 
						||
                                                             atomicrmw that is               atomicrmw that is
 | 
						||
                                                             being released.                 being released.
 | 
						||
 | 
						||
                                                         2. flat_atomic                  2. flat_atomic
 | 
						||
                                                         3. s_waitcnt vmcnt(0) &         3. s_waitcnt vm/vscnt(0) &
 | 
						||
                                                            lgkmcnt(0)                      lgkmcnt(0)
 | 
						||
 | 
						||
                                                           - If OpenCL, omit               - If OpenCL, omit
 | 
						||
                                                             lgkmcnt(0).                     lgkmcnt(0).
 | 
						||
                                                                                           - Use vmcnt if atomic with
 | 
						||
                                                                                             return and vscnt if atomic
 | 
						||
                                                                                             with no-return.
 | 
						||
                                                           - Must happen before            - Must happen before
 | 
						||
                                                             following                       following
 | 
						||
                                                             buffer_wbinvl1_vol.             buffer_gl*_inv.
 | 
						||
                                                           - Ensures the                   - Ensures the
 | 
						||
                                                             atomicrmw has                   atomicrmw has
 | 
						||
                                                             completed before                completed before
 | 
						||
                                                             invalidating the                invalidating the
 | 
						||
                                                             cache.                          caches.
 | 
						||
 | 
						||
                                                         4. buffer_wbinvl1_vol           4. buffer_gl0_inv;
 | 
						||
                                                                                            buffer_gl1_inv
 | 
						||
 | 
						||
                                                           - Must happen before            - Must happen before
 | 
						||
                                                             any following                   any following
 | 
						||
                                                             global/generic                  global/generic
 | 
						||
                                                             load/load                       load/load
 | 
						||
                                                             atomic/atomicrmw.               atomic/atomicrmw.
 | 
						||
                                                           - Ensures that                  - Ensures that
 | 
						||
                                                             following loads                 following loads
 | 
						||
                                                             will not see stale              will not see stale
 | 
						||
                                                             global data.                    global data.
 | 
						||
 | 
						||
     fence        acq_rel      - singlethread *none*     *none*                          *none*
 | 
						||
                               - wavefront
 | 
						||
     fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
 | 
						||
                                                                                            vmcnt(0) & vscnt(0)
 | 
						||
 | 
						||
                                                                                           - If CU wavefront execution mode, omit vmcnt and
 | 
						||
                                                                                             vscnt.
 | 
						||
                                                           - If OpenCL and                 - If OpenCL and
 | 
						||
                                                             address space is                address space is
 | 
						||
                                                             not generic, omit.              not generic, omit
 | 
						||
                                                                                             lgkmcnt(0).
 | 
						||
                                                                                           - If OpenCL and
 | 
						||
                                                                                             address space is
 | 
						||
                                                                                             local, omit
 | 
						||
                                                                                             vmcnt(0) and vscnt(0).
 | 
						||
                                                           - However,                      - However,
 | 
						||
                                                             since LLVM                      since LLVM
 | 
						||
                                                             currently has no                currently has no
 | 
						||
                                                             address space on                address space on
 | 
						||
                                                             the fence need to               the fence need to
 | 
						||
                                                             conservatively                  conservatively
 | 
						||
                                                             always generate                 always generate
 | 
						||
                                                             (see comment for                (see comment for
 | 
						||
                                                             previous fence).                previous fence).
 | 
						||
                                                           - Must happen after
 | 
						||
                                                             any preceding
 | 
						||
                                                             local/generic
 | 
						||
                                                             load/load
 | 
						||
                                                             atomic/store/store
 | 
						||
                                                             atomic/atomicrmw.
 | 
						||
                                                                                           - Could be split into
 | 
						||
                                                                                             separate s_waitcnt
 | 
						||
                                                                                             vmcnt(0), s_waitcnt
 | 
						||
                                                                                             vscnt(0) and s_waitcnt
 | 
						||
                                                                                             lgkmcnt(0) to allow
 | 
						||
                                                                                             them to be
 | 
						||
                                                                                             independently moved
 | 
						||
                                                                                             according to the
 | 
						||
                                                                                             following rules.
 | 
						||
                                                                                           - s_waitcnt vmcnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             global/generic
 | 
						||
                                                                                             load/load
 | 
						||
                                                                                             atomic/
 | 
						||
                                                                                             atomicrmw-with-return-value.
 | 
						||
                                                                                           - s_waitcnt vscnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             global/generic
 | 
						||
                                                                                             store/store atomic/
 | 
						||
                                                                                             atomicrmw-no-return-value.
 | 
						||
                                                                                           - s_waitcnt lgkmcnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             local/generic
 | 
						||
                                                                                             load/store/load
 | 
						||
                                                                                             atomic/store atomic/
 | 
						||
                                                                                             atomicrmw.
 | 
						||
                                                           - Must happen before            - Must happen before
 | 
						||
                                                             any following                   any following
 | 
						||
                                                             global/generic                  global/generic
 | 
						||
                                                             load/load                       load/load
 | 
						||
                                                             atomic/store/store              atomic/store/store
 | 
						||
                                                             atomic/atomicrmw.               atomic/atomicrmw.
 | 
						||
                                                           - Ensures that all              - Ensures that all
 | 
						||
                                                             memory operations               memory operations
 | 
						||
                                                             to local have                   have
 | 
						||
                                                             completed before                completed before
 | 
						||
                                                             performing any                  performing any
 | 
						||
                                                             following global                following global
 | 
						||
                                                             memory operations.              memory operations.
 | 
						||
                                                           - Ensures that the              - Ensures that the
 | 
						||
                                                             preceding                       preceding
 | 
						||
                                                             local/generic load              local/generic load
 | 
						||
                                                             atomic/atomicrmw                atomic/atomicrmw
 | 
						||
                                                             with an equal or                with an equal or
 | 
						||
                                                             wider sync scope                wider sync scope
 | 
						||
                                                             and memory ordering             and memory ordering
 | 
						||
                                                             stronger than                   stronger than
 | 
						||
                                                             unordered (this is              unordered (this is
 | 
						||
                                                             termed the                      termed the
 | 
						||
                                                             acquire-fence-paired-atomic     acquire-fence-paired-atomic
 | 
						||
                                                             ) has completed                 ) has completed
 | 
						||
                                                             before following                before following
 | 
						||
                                                             global memory                   global memory
 | 
						||
                                                             operations. This                operations. This
 | 
						||
                                                             satisfies the                   satisfies the
 | 
						||
                                                             requirements of                 requirements of
 | 
						||
                                                             acquire.                        acquire.
 | 
						||
                                                           - Ensures that all              - Ensures that all
 | 
						||
                                                             previous memory                 previous memory
 | 
						||
                                                             operations have                 operations have
 | 
						||
                                                             completed before a              completed before a
 | 
						||
                                                             following                       following
 | 
						||
                                                             local/generic store             local/generic store
 | 
						||
                                                             atomic/atomicrmw                atomic/atomicrmw
 | 
						||
                                                             with an equal or                with an equal or
 | 
						||
                                                             wider sync scope                wider sync scope
 | 
						||
                                                             and memory ordering             and memory ordering
 | 
						||
                                                             stronger than                   stronger than
 | 
						||
                                                             unordered (this is              unordered (this is
 | 
						||
                                                             termed the                      termed the
 | 
						||
                                                             release-fence-paired-atomic     release-fence-paired-atomic
 | 
						||
                                                             ). This satisfies the           ). This satisfies the
 | 
						||
                                                             requirements of                 requirements of
 | 
						||
                                                             release.                        release.
 | 
						||
                                                                                           - Must happen before
 | 
						||
                                                                                             the following
 | 
						||
                                                                                             buffer_gl0_inv.
 | 
						||
                                                                                           - Ensures that the
 | 
						||
                                                                                             acquire-fence-paired
 | 
						||
                                                                                             atomic has completed
 | 
						||
                                                                                             before invalidating
 | 
						||
                                                                                             the
 | 
						||
                                                                                             cache. Therefore
 | 
						||
                                                                                             any following
 | 
						||
                                                                                             locations read must
 | 
						||
                                                                                             be no older than
 | 
						||
                                                                                             the value read by
 | 
						||
                                                                                             the
 | 
						||
                                                                                             acquire-fence-paired-atomic.
 | 
						||
 | 
						||
                                                                                         3. buffer_gl0_inv
 | 
						||
 | 
						||
                                                                                           - If CU wavefront execution mode, omit.
 | 
						||
                                                                                           - Ensures that
 | 
						||
                                                                                             following
 | 
						||
                                                                                             loads will not see
 | 
						||
                                                                                             stale data.
 | 
						||
 | 
						||
     fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &       1. s_waitcnt lgkmcnt(0) &
 | 
						||
                               - system                     vmcnt(0)                        vmcnt(0) & vscnt(0)
 | 
						||
 | 
						||
                                                           - If OpenCL and                 - If OpenCL and
 | 
						||
                                                             address space is                address space is
 | 
						||
                                                             not generic, omit               not generic, omit
 | 
						||
                                                             lgkmcnt(0).                     lgkmcnt(0).
 | 
						||
                                                                                           - If OpenCL and
 | 
						||
                                                                                             address space is
 | 
						||
                                                                                             local, omit
 | 
						||
                                                                                             vmcnt(0) and vscnt(0).
 | 
						||
                                                           - However, since LLVM           - However, since LLVM
 | 
						||
                                                             currently has no                currently has no
 | 
						||
                                                             address space on                address space on
 | 
						||
                                                             the fence need to               the fence need to
 | 
						||
                                                             conservatively                  conservatively
 | 
						||
                                                             always generate                 always generate
 | 
						||
                                                             (see comment for                (see comment for
 | 
						||
                                                             previous fence).                previous fence).
 | 
						||
                                                           - Could be split into           - Could be split into
 | 
						||
                                                             separate s_waitcnt              separate s_waitcnt
 | 
						||
                                                             vmcnt(0) and                    vmcnt(0), s_waitcnt
 | 
						||
                                                             s_waitcnt                       vscnt(0) and s_waitcnt
 | 
						||
                                                             lgkmcnt(0) to allow             lgkmcnt(0) to allow
 | 
						||
                                                             them to be                      them to be
 | 
						||
                                                             independently moved             independently moved
 | 
						||
                                                             according to the                according to the
 | 
						||
                                                             following rules.                following rules.
 | 
						||
                                                           - s_waitcnt vmcnt(0)            - s_waitcnt vmcnt(0)
 | 
						||
                                                             must happen after               must happen after
 | 
						||
                                                             any preceding                   any preceding
 | 
						||
                                                             global/generic                  global/generic
 | 
						||
                                                             load/store/load                 load/load
 | 
						||
                                                             atomic/store                    atomic/
 | 
						||
                                                             atomic/atomicrmw.               atomicrmw-with-return-value.
 | 
						||
                                                                                           - s_waitcnt vscnt(0)
 | 
						||
                                                                                             must happen after
 | 
						||
                                                                                             any preceding
 | 
						||
                                                                                             global/generic
 | 
						||
                                                                                             store/store atomic/
 | 
						||
                                                                                             atomicrmw-no-return-value.
 | 
						||
                                                           - s_waitcnt lgkmcnt(0)          - s_waitcnt lgkmcnt(0)
 | 
						||
                                                             must happen after               must happen after
 | 
						||
                                                             any preceding                   any preceding
 | 
						||
                                                             local/generic                   local/generic
 | 
						||
                                                             load/store/load                 load/store/load
 | 
						||
                                                             atomic/store                    atomic/store
 | 
						||
                                                             atomic/atomicrmw.               atomic/atomicrmw.
 | 
						||
                                                           - Must happen before            - Must happen before
 | 
						||
                                                             the following                   the following
 | 
						||
                                                             buffer_wbinvl1_vol.             buffer_gl*_inv.
 | 
						||
                                                           - Ensures that the              - Ensures that the
 | 
						||
                                                             preceding                       preceding
 | 
						||
                                                             global/local/generic            global/local/generic
 | 
						||
                                                             load                            load
 | 
						||
                                                             atomic/atomicrmw                atomic/atomicrmw
 | 
						||
                                                             with an equal or                with an equal or
 | 
						||
                                                             wider sync scope                wider sync scope
 | 
						||
                                                             and memory ordering             and memory ordering
 | 
						||
                                                             stronger than                   stronger than
 | 
						||
                                                             unordered (this is              unordered (this is
 | 
						||
                                                             termed the                      termed the
 | 
						||
                                                             acquire-fence-paired-atomic     acquire-fence-paired-atomic
 | 
						||
                                                             ) has completed                 ) has completed
 | 
						||
                                                             before invalidating             before invalidating
 | 
						||
                                                             the cache. This                 the caches. This
 | 
						||
                                                             satisfies the                   satisfies the
 | 
						||
                                                             requirements of                 requirements of
 | 
						||
                                                             acquire.                        acquire.
 | 
						||
                                                           - Ensures that all              - Ensures that all
 | 
						||
                                                             previous memory                 previous memory
 | 
						||
                                                             operations have                 operations have
 | 
						||
                                                             completed before a              completed before a
 | 
						||
                                                             following                       following
 | 
						||
                                                             global/local/generic            global/local/generic
 | 
						||
                                                             store                           store
 | 
						||
                                                             atomic/atomicrmw                atomic/atomicrmw
 | 
						||
                                                             with an equal or                with an equal or
 | 
						||
                                                             wider sync scope                wider sync scope
 | 
						||
                                                             and memory ordering             and memory ordering
 | 
						||
                                                             stronger than                   stronger than
 | 
						||
                                                             unordered (this is              unordered (this is
 | 
						||
                                                             termed the                      termed the
 | 
						||
                                                             release-fence-paired-atomic     release-fence-paired-atomic
 | 
						||
                                                             ). This satisfies the           ). This satisfies the
 | 
						||
                                                             requirements of                 requirements of
 | 
						||
                                                             release.                        release.
 | 
						||
 | 
						||
                                                         2. buffer_wbinvl1_vol           2. buffer_gl0_inv;
 | 
						||
                                                                                            buffer_gl1_inv
 | 
						||
 | 
						||
                                                           - Must happen before            - Must happen before
 | 
						||
                                                             any following                   any following
 | 
						||
                                                             global/generic                  global/generic
 | 
						||
                                                             load/load                       load/load
 | 
						||
                                                             atomic/store/store              atomic/store/store
 | 
						||
                                                             atomic/atomicrmw.               atomic/atomicrmw.
 | 
						||
                                                           - Ensures that                  - Ensures that
 | 
						||
                                                             following loads                 following loads
 | 
						||
                                                             will not see stale              will not see stale
 | 
						||
                                                             global data. This               global data. This
 | 
						||
                                                             satisfies the                   satisfies the
 | 
						||
                                                             requirements of                 requirements of
 | 
						||
                                                             acquire.                        acquire.
 | 
						||
 | 
						||
     **Sequential Consistent Atomic**
 | 
						||
     ----------------------------------------------------------------------------------------------------------------------
 | 
						||
     load atomic  seq_cst      - singlethread - global   *Same as corresponding          *Same as corresponding
 | 
						||
                               - wavefront    - local    load atomic acquire,            load atomic acquire,
 | 
						||
                                              - generic  except must generated           except must generated
 | 
						||
                                                         all instructions even           all instructions even
 | 
						||
                                                         for OpenCL.*                    for OpenCL.*
 | 
						||
     load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
 | 
						||
                                              - generic                                     vmcnt(0) & vscnt(0)
 | 
						||
 | 
						||
                                                                                           - If CU wavefront execution mode, omit vmcnt and
 | 
						||
                                                                                             vscnt.
 | 
						||
                                                                                           - Could be split into
 | 
						||
                                                                                             separate s_waitcnt
 | 
						||
                                                                                             vmcnt(0), s_waitcnt
 | 
						||
                                                                                             vscnt(0) and s_waitcnt
 | 
						||
                                                                                             lgkmcnt(0) to allow
 | 
						||
                                                                                             them to be
 | 
						||
                                                                                             independently moved
 | 
						||
                                                                                             according to the
 | 
						||
                                                                                             following rules.
 | 
						||
                                                           - Must                          - waitcnt lgkmcnt(0) must
 | 
						||
                                                             happen after                    happen after
 | 
						||
                                                             preceding                       preceding
 | 
						||
                                                             global/generic load             local load
 | 
						||
                                                             atomic/store                    atomic/store
 | 
						||
                                                             atomic/atomicrmw                atomic/atomicrmw
 | 
						||
                                                             with memory                     with memory
 | 
						||
                                                             ordering of seq_cst             ordering of seq_cst
 | 
						||
                                                             and with equal or               and with equal or
 | 
						||
                                                             wider sync scope.               wider sync scope.
 | 
						||
                                                             (Note that seq_cst              (Note that seq_cst
 | 
						||
                                                             fences have their               fences have their
 | 
						||
                                                             own s_waitcnt                   own s_waitcnt
 | 
						||
                                                             lgkmcnt(0) and so do            lgkmcnt(0) and so do
 | 
						||
                                                             not need to be                  not need to be
 | 
						||
                                                             considered.)                    considered.)
 | 
						||
                                                                                           - waitcnt vmcnt(0)
 | 
						||
                                                                                             Must happen after
 | 
						||
                                                                                             preceding
 | 
						||
                                                                                             global/generic load
 | 
						||
                                                                                             atomic/
 | 
						||
                                                                                             atomicrmw-with-return-value
 | 
						||
                                                                                             with memory
 | 
						||
                                                                                             ordering of seq_cst
 | 
						||
                                                                                             and with equal or
 | 
						||
                                                                                             wider sync scope.
 | 
						||
                                                                                             (Note that seq_cst
 | 
						||
                                                                                             fences have their
 | 
						||
                                                                                             own s_waitcnt
 | 
						||
                                                                                             vmcnt(0) and so do
 | 
						||
                                                                                             not need to be
 | 
						||
                                                                                             considered.)
 | 
						||
                                                                                           - waitcnt vscnt(0)
 | 
						||
                                                                                             Must happen after
 | 
						||
                                                                                             preceding
 | 
						||
                                                                                             global/generic store
 | 
						||
                                                                                             atomic/
 | 
						||
                                                                                             atomicrmw-no-return-value
 | 
						||
                                                                                             with memory
 | 
						||
                                                                                             ordering of seq_cst
 | 
						||
                                                                                             and with equal or
 | 
						||
                                                                                             wider sync scope.
 | 
						||
                                                                                             (Note that seq_cst
 | 
						||
                                                                                             fences have their
 | 
						||
                                                                                             own s_waitcnt
 | 
						||
                                                                                             vscnt(0) and so do
 | 
						||
                                                                                             not need to be
 | 
						||
                                                                                             considered.)
 | 
						||
                                                           - Ensures any                   - Ensures any
 | 
						||
                                                             preceding                       preceding
 | 
						||
                                                             sequential                      sequential
 | 
						||
                                                             consistent local                consistent global/local
 | 
						||
                                                             memory instructions             memory instructions
 | 
						||
                                                             have completed                  have completed
 | 
						||
                                                             before executing                before executing
 | 
						||
                                                             this sequentially               this sequentially
 | 
						||
                                                             consistent                      consistent
 | 
						||
                                                             instruction. This               instruction. This
 | 
						||
                                                             prevents reordering             prevents reordering
 | 
						||
                                                             a seq_cst store                 a seq_cst store
 | 
						||
                                                             followed by a                   followed by a
 | 
						||
                                                             seq_cst load. (Note             seq_cst load. (Note
 | 
						||
                                                             that seq_cst is                 that seq_cst is
 | 
						||
                                                             stronger than                   stronger than
 | 
						||
                                                             acquire/release as              acquire/release as
 | 
						||
                                                             the reordering of               the reordering of
 | 
						||
                                                             load acquire                    load acquire
 | 
						||
                                                             followed by a store             followed by a store
 | 
						||
                                                             release is                      release is
 | 
						||
                                                             prevented by the                prevented by the
 | 
						||
                                                             waitcnt of                      waitcnt of
 | 
						||
                                                             the release, but                the release, but
 | 
						||
                                                             there is nothing                there is nothing
 | 
						||
                                                             preventing a store              preventing a store
 | 
						||
                                                             release followed by             release followed by
 | 
						||
                                                             load acquire from               load acquire from
 | 
						||
                                                             competing out of                competing out of
 | 
						||
                                                             order.)                         order.)
 | 
						||
 | 
						||
                                                         2. *Following                   2. *Following
 | 
						||
                                                            instructions same as            instructions same as
 | 
						||
                                                            corresponding load              corresponding load
 | 
						||
                                                            atomic acquire,                 atomic acquire,
 | 
						||
                                                            except must generated           except must generated
 | 
						||
                                                            all instructions even           all instructions even
 | 
						||
                                                            for OpenCL.*                    for OpenCL.*
 | 
						||
     load atomic  seq_cst      - workgroup    - local    *Same as corresponding
 | 
						||
                                                         load atomic acquire,
 | 
						||
                                                         except must generated
 | 
						||
                                                         all instructions even
 | 
						||
                                                         for OpenCL.*
 | 
						||
 | 
						||
                                                                                         1. s_waitcnt vmcnt(0) & vscnt(0)
 | 
						||
 | 
						||
                                                                                           - If CU wavefront execution mode, omit.
 | 
						||
                                                                                           - Could be split into
 | 
						||
                                                                                             separate s_waitcnt
 | 
						||
                                                                                             vmcnt(0) and s_waitcnt
 | 
						||
                                                                                             vscnt(0) to allow
 | 
						||
                                                                                             them to be
 | 
						||
                                                                                             independently moved
 | 
						||
                                                                                             according to the
 | 
						||
                                                                                             following rules.
 | 
						||
                                                                                           - waitcnt vmcnt(0)
 | 
						||
                                                                                             Must happen after
 | 
						||
                                                                                             preceding
 | 
						||
                                                                                             global/generic load
 | 
						||
                                                                                             atomic/
 | 
						||
                                                                                             atomicrmw-with-return-value
 | 
						||
                                                                                             with memory
 | 
						||
                                                                                             ordering of seq_cst
 | 
						||
                                                                                             and with equal or
 | 
						||
                                                                                             wider sync scope.
 | 
						||
                                                                                             (Note that seq_cst
 | 
						||
                                                                                             fences have their
 | 
						||
                                                                                             own s_waitcnt
 | 
						||
                                                                                             vmcnt(0) and so do
 | 
						||
                                                                                             not need to be
 | 
						||
                                                                                             considered.)
 | 
						||
                                                                                           - waitcnt vscnt(0)
 | 
						||
                                                                                             Must happen after
 | 
						||
                                                                                             preceding
 | 
						||
                                                                                             global/generic store
 | 
						||
                                                                                             atomic/
 | 
						||
                                                                                             atomicrmw-no-return-value
 | 
						||
                                                                                             with memory
 | 
						||
                                                                                             ordering of seq_cst
 | 
						||
                                                                                             and with equal or
 | 
						||
                                                                                             wider sync scope.
 | 
						||
                                                                                             (Note that seq_cst
 | 
						||
                                                                                             fences have their
 | 
						||
                                                                                             own s_waitcnt
 | 
						||
                                                                                             vscnt(0) and so do
 | 
						||
                                                                                             not need to be
 | 
						||
                                                                                             considered.)
 | 
						||
                                                                                           - Ensures any
 | 
						||
                                                                                             preceding
 | 
						||
                                                                                             sequential
 | 
						||
                                                                                             consistent global
 | 
						||
                                                                                             memory instructions
 | 
						||
                                                                                             have completed
 | 
						||
                                                                                             before executing
 | 
						||
                                                                                             this sequentially
 | 
						||
                                                                                             consistent
 | 
						||
                                                                                             instruction. This
 | 
						||
                                                                                             prevents reordering
 | 
						||
                                                                                             a seq_cst store
 | 
						||
                                                                                             followed by a
 | 
						||
                                                                                             seq_cst load. (Note
 | 
						||
                                                                                             that seq_cst is
 | 
						||
                                                                                             stronger than
 | 
						||
                                                                                             acquire/release as
 | 
						||
                                                                                             the reordering of
 | 
						||
                                                                                             load acquire
 | 
						||
                                                                                             followed by a store
 | 
						||
                                                                                             release is
 | 
						||
                                                                                             prevented by the
 | 
						||
                                                                                             waitcnt of
 | 
						||
                                                                                             the release, but
 | 
						||
                                                                                             there is nothing
 | 
						||
                                                                                             preventing a store
 | 
						||
                                                                                             release followed by
 | 
						||
                                                                                             load acquire from
 | 
						||
                                                                                             competing out of
 | 
						||
                                                                                             order.)
 | 
						||
 | 
						||
                                                                                         2. *Following
 | 
						||
                                                                                            instructions same as
 | 
						||
                                                                                            corresponding load
 | 
						||
                                                                                            atomic acquire,
 | 
						||
                                                                                            except must generated
 | 
						||
                                                                                            all instructions even
 | 
						||
                                                                                            for OpenCL.*
 | 
						||
 | 
						||
     load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &       1. s_waitcnt lgkmcnt(0) &
 | 
						||
                               - system       - generic     vmcnt(0)                        vmcnt(0) & vscnt(0)
 | 
						||
 | 
						||
                                                           - Could be split into           - Could be split into
 | 
						||
                                                             separate s_waitcnt              separate s_waitcnt
 | 
						||
                                                             vmcnt(0)                        vmcnt(0), s_waitcnt
 | 
						||
                                                             and s_waitcnt                   vscnt(0) and s_waitcnt
 | 
						||
                                                             lgkmcnt(0) to allow             lgkmcnt(0) to allow
 | 
						||
                                                             them to be                      them to be
 | 
						||
                                                             independently moved             independently moved
 | 
						||
                                                             according to the                according to the
 | 
						||
                                                             following rules.                following rules.
 | 
						||
                                                           - waitcnt lgkmcnt(0)            - waitcnt lgkmcnt(0)
 | 
						||
                                                             must happen after               must happen after
 | 
						||
                                                             preceding                       preceding
 | 
						||
                                                             global/generic load             local load
 | 
						||
                                                             atomic/store                    atomic/store
 | 
						||
                                                             atomic/atomicrmw                atomic/atomicrmw
 | 
						||
                                                             with memory                     with memory
 | 
						||
                                                             ordering of seq_cst             ordering of seq_cst
 | 
						||
                                                             and with equal or               and with equal or
 | 
						||
                                                             wider sync scope.               wider sync scope.
 | 
						||
                                                             (Note that seq_cst              (Note that seq_cst
 | 
						||
                                                             fences have their               fences have their
 | 
						||
                                                             own s_waitcnt                   own s_waitcnt
 | 
						||
                                                             lgkmcnt(0) and so do            lgkmcnt(0) and so do
 | 
						||
                                                             not need to be                  not need to be
 | 
						||
                                                             considered.)                    considered.)
 | 
						||
                                                           - waitcnt vmcnt(0)              - waitcnt vmcnt(0)
 | 
						||
                                                             must happen after               must happen after
 | 
						||
                                                             preceding                       preceding
 | 
						||
                                                             global/generic load             global/generic load
 | 
						||
                                                             atomic/store                    atomic/
 | 
						||
                                                             atomic/atomicrmw                atomicrmw-with-return-value
 | 
						||
                                                             with memory                     with memory
 | 
						||
                                                             ordering of seq_cst             ordering of seq_cst
 | 
						||
                                                             and with equal or               and with equal or
 | 
						||
                                                             wider sync scope.               wider sync scope.
 | 
						||
                                                             (Note that seq_cst              (Note that seq_cst
 | 
						||
                                                             fences have their               fences have their
 | 
						||
                                                             own s_waitcnt                   own s_waitcnt
 | 
						||
                                                             vmcnt(0) and so do              vmcnt(0) and so do
 | 
						||
                                                             not need to be                  not need to be
 | 
						||
                                                             considered.)                    considered.)
 | 
						||
                                                                                           - waitcnt vscnt(0)
 | 
						||
                                                                                             Must happen after
 | 
						||
                                                                                             preceding
 | 
						||
                                                                                             global/generic store
 | 
						||
                                                                                             atomic/
 | 
						||
                                                                                             atomicrmw-no-return-value
 | 
						||
                                                                                             with memory
 | 
						||
                                                                                             ordering of seq_cst
 | 
						||
                                                                                             and with equal or
 | 
						||
                                                                                             wider sync scope.
 | 
						||
                                                                                             (Note that seq_cst
 | 
						||
                                                                                             fences have their
 | 
						||
                                                                                             own s_waitcnt
 | 
						||
                                                                                             vscnt(0) and so do
 | 
						||
                                                                                             not need to be
 | 
						||
                                                                                             considered.)
 | 
						||
                                                           - Ensures any                   - Ensures any
 | 
						||
                                                             preceding                       preceding
 | 
						||
                                                             sequential                      sequential
 | 
						||
                                                             consistent global               consistent global
 | 
						||
                                                             memory instructions             memory instructions
 | 
						||
                                                             have completed                  have completed
 | 
						||
                                                             before executing                before executing
 | 
						||
                                                             this sequentially               this sequentially
 | 
						||
                                                             consistent                      consistent
 | 
						||
                                                             instruction. This               instruction. This
 | 
						||
                                                             prevents reordering             prevents reordering
 | 
						||
                                                             a seq_cst store                 a seq_cst store
 | 
						||
                                                             followed by a                   followed by a
 | 
						||
                                                             seq_cst load. (Note             seq_cst load. (Note
 | 
						||
                                                             that seq_cst is                 that seq_cst is
 | 
						||
                                                             stronger than                   stronger than
 | 
						||
                                                             acquire/release as              acquire/release as
 | 
						||
                                                             the reordering of               the reordering of
 | 
						||
                                                             load acquire                    load acquire
 | 
						||
                                                             followed by a store             followed by a store
 | 
						||
                                                             release is                      release is
 | 
						||
                                                             prevented by the                prevented by the
 | 
						||
                                                             waitcnt of                      waitcnt of
 | 
						||
                                                             the release, but                the release, but
 | 
						||
                                                             there is nothing                there is nothing
 | 
						||
                                                             preventing a store              preventing a store
 | 
						||
                                                             release followed by             release followed by
 | 
						||
                                                             load acquire from               load acquire from
 | 
						||
                                                             competing out of                competing out of
 | 
						||
                                                             order.)                         order.)
 | 
						||
 | 
						||
                                                         2. *Following                   2. *Following
 | 
						||
                                                            instructions same as            instructions same as
 | 
						||
                                                            corresponding load              corresponding load
 | 
						||
                                                            atomic acquire,                 atomic acquire,
 | 
						||
                                                            except must generated           except must generated
 | 
						||
                                                            all instructions even           all instructions even
 | 
						||
                                                            for OpenCL.*                    for OpenCL.*
 | 
						||
     store atomic seq_cst      - singlethread - global   *Same as corresponding          *Same as corresponding
 | 
						||
                               - wavefront    - local    store atomic release,           store atomic release,
 | 
						||
                               - workgroup    - generic  except must generated           except must generated
 | 
						||
                                                         all instructions even           all instructions even
 | 
						||
                                                         for OpenCL.*                    for OpenCL.*
 | 
						||
     store atomic seq_cst      - agent        - global   *Same as corresponding          *Same as corresponding
 | 
						||
                               - system       - generic  store atomic release,           store atomic release,
 | 
						||
                                                         except must generated           except must generated
 | 
						||
                                                         all instructions even           all instructions even
 | 
						||
                                                         for OpenCL.*                    for OpenCL.*
 | 
						||
     atomicrmw    seq_cst      - singlethread - global   *Same as corresponding          *Same as corresponding
 | 
						||
                               - wavefront    - local    atomicrmw acq_rel,              atomicrmw acq_rel,
 | 
						||
                               - workgroup    - generic  except must generated           except must generated
 | 
						||
                                                         all instructions even           all instructions even
 | 
						||
                                                         for OpenCL.*                    for OpenCL.*
 | 
						||
     atomicrmw    seq_cst      - agent        - global   *Same as corresponding          *Same as corresponding
 | 
						||
                               - system       - generic  atomicrmw acq_rel,              atomicrmw acq_rel,
 | 
						||
                                                         except must generated           except must generated
 | 
						||
                                                         all instructions even           all instructions even
 | 
						||
                                                         for OpenCL.*                    for OpenCL.*
 | 
						||
     fence        seq_cst      - singlethread *none*     *Same as corresponding          *Same as corresponding
 | 
						||
                               - wavefront               fence acq_rel,                  fence acq_rel,
 | 
						||
                               - workgroup               except must generated           except must generated
 | 
						||
                               - agent                   all instructions even           all instructions even
 | 
						||
                               - system                  for OpenCL.*                    for OpenCL.*
 | 
						||
     ============ ============ ============== ========== =============================== ==================================
 | 
						||
 | 
						||
The memory order also adds the single thread optimization constrains defined in
 | 
						||
table
 | 
						||
:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx10-table`.
 | 
						||
 | 
						||
  .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX10
 | 
						||
     :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx10-table
 | 
						||
 | 
						||
     ============ ==============================================================
 | 
						||
     LLVM Memory  Optimization Constraints
 | 
						||
     Ordering
 | 
						||
     ============ ==============================================================
 | 
						||
     unordered    *none*
 | 
						||
     monotonic    *none*
 | 
						||
     acquire      - If a load atomic/atomicrmw then no following load/load
 | 
						||
                    atomic/store/ store atomic/atomicrmw/fence instruction can
 | 
						||
                    be moved before the acquire.
 | 
						||
                  - If a fence then same as load atomic, plus no preceding
 | 
						||
                    associated fence-paired-atomic can be moved after the fence.
 | 
						||
     release      - If a store atomic/atomicrmw then no preceding load/load
 | 
						||
                    atomic/store/ store atomic/atomicrmw/fence instruction can
 | 
						||
                    be moved after the release.
 | 
						||
                  - If a fence then same as store atomic, plus no following
 | 
						||
                    associated fence-paired-atomic can be moved before the
 | 
						||
                    fence.
 | 
						||
     acq_rel      Same constraints as both acquire and release.
 | 
						||
     seq_cst      - If a load atomic then same constraints as acquire, plus no
 | 
						||
                    preceding sequentially consistent load atomic/store
 | 
						||
                    atomic/atomicrmw/fence instruction can be moved after the
 | 
						||
                    seq_cst.
 | 
						||
                  - If a store atomic then the same constraints as release, plus
 | 
						||
                    no following sequentially consistent load atomic/store
 | 
						||
                    atomic/atomicrmw/fence instruction can be moved before the
 | 
						||
                    seq_cst.
 | 
						||
                  - If an atomicrmw/fence then same constraints as acq_rel.
 | 
						||
     ============ ==============================================================
 | 
						||
 | 
						||
Trap Handler ABI
 | 
						||
~~~~~~~~~~~~~~~~
 | 
						||
 | 
						||
For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
 | 
						||
(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
 | 
						||
the ``s_trap`` instruction with the following usage:
 | 
						||
 | 
						||
  .. table:: AMDGPU Trap Handler for AMDHSA OS
 | 
						||
     :name: amdgpu-trap-handler-for-amdhsa-os-table
 | 
						||
 | 
						||
     =================== =============== =============== =======================
 | 
						||
     Usage               Code Sequence   Trap Handler    Description
 | 
						||
                                         Inputs
 | 
						||
     =================== =============== =============== =======================
 | 
						||
     reserved            ``s_trap 0x00``                 Reserved by hardware.
 | 
						||
     ``debugtrap(arg)``  ``s_trap 0x01`` ``SGPR0-1``:    Reserved for HSA
 | 
						||
                                           ``queue_ptr`` ``debugtrap``
 | 
						||
                                         ``VGPR0``:      intrinsic (not
 | 
						||
                                           ``arg``       implemented).
 | 
						||
     ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:    Causes dispatch to be
 | 
						||
                                           ``queue_ptr`` terminated and its
 | 
						||
                                                         associated queue put
 | 
						||
                                                         into the error state.
 | 
						||
     ``llvm.debugtrap``  ``s_trap 0x03``                 - If debugger not
 | 
						||
                                                           installed then
 | 
						||
                                                           behaves as a
 | 
						||
                                                           no-operation. The
 | 
						||
                                                           trap handler is
 | 
						||
                                                           entered and
 | 
						||
                                                           immediately returns
 | 
						||
                                                           to continue
 | 
						||
                                                           execution of the
 | 
						||
                                                           wavefront.
 | 
						||
                                                         - If the debugger is
 | 
						||
                                                           installed, causes
 | 
						||
                                                           the debug trap to be
 | 
						||
                                                           reported by the
 | 
						||
                                                           debugger and the
 | 
						||
                                                           wavefront is put in
 | 
						||
                                                           the halt state until
 | 
						||
                                                           resumed by the
 | 
						||
                                                           debugger.
 | 
						||
     reserved            ``s_trap 0x04``                 Reserved.
 | 
						||
     reserved            ``s_trap 0x05``                 Reserved.
 | 
						||
     reserved            ``s_trap 0x06``                 Reserved.
 | 
						||
     debugger breakpoint ``s_trap 0x07``                 Reserved for debugger
 | 
						||
                                                         breakpoints.
 | 
						||
     reserved            ``s_trap 0x08``                 Reserved.
 | 
						||
     reserved            ``s_trap 0xfe``                 Reserved.
 | 
						||
     reserved            ``s_trap 0xff``                 Reserved.
 | 
						||
     =================== =============== =============== =======================
 | 
						||
 | 
						||
.. _amdgpu-amdhsa-function-call-convention:
 | 
						||
 | 
						||
Call Convention
 | 
						||
~~~~~~~~~~~~~~~
 | 
						||
 | 
						||
.. note::
 | 
						||
 | 
						||
  This section is currently incomplete and has inakkuracies. It is WIP that will
 | 
						||
  be updated as information is determined.
 | 
						||
 | 
						||
See :ref:`amdgpu-dwarf-address-space-mapping` for information on swizzled
 | 
						||
addresses. Unswizzled addresses are normal linear addresses.
 | 
						||
 | 
						||
Kernel Functions
 | 
						||
++++++++++++++++
 | 
						||
 | 
						||
This section describes the call convention ABI for the outer kernel function.
 | 
						||
 | 
						||
See :ref:`amdgpu-amdhsa-initial-kernel-execution-state` for the kernel call
 | 
						||
convention.
 | 
						||
 | 
						||
The following is not part of the AMDGPU kernel calling convention but describes
 | 
						||
how the AMDGPU implements function calls:
 | 
						||
 | 
						||
1.  Clang decides the kernarg layout to match the *HSA Programmer's Language
 | 
						||
    Reference* [HSA]_.
 | 
						||
 | 
						||
    - All structs are passed directly.
 | 
						||
    - Lambda values are passed *TBA*.
 | 
						||
 | 
						||
    .. TODO::
 | 
						||
 | 
						||
      - Does this really follow HSA rules? Or are structs >16 bytes passed
 | 
						||
        by-value struct?
 | 
						||
      - What is ABI for lambda values?
 | 
						||
 | 
						||
2.  The CFI return address is undefined.
 | 
						||
3.  If the kernel contains no calls then:
 | 
						||
 | 
						||
    - If using the ``amdhsa`` OS ABI (see :ref:`amdgpu-os-table`), and know
 | 
						||
      during ISel that there is stack usage SGPR0-3 is reserved for use as the
 | 
						||
      scratch SRD and SGPR33 reserved for the wave scratch offset. Stack usage
 | 
						||
      is assumed if ``-O0``, if already aware of stack objects for locals, etc.,
 | 
						||
      or if there are any function calls.
 | 
						||
    - Otherwise, five high numbered SGPRs are reserved for the tentative scratch
 | 
						||
      SRD and wave scratch offset. These will be used if determine need to do
 | 
						||
      spilling.
 | 
						||
 | 
						||
      - If no use is made of the tentative scratch SRD or wave scratch offset,
 | 
						||
        then they are unreserved and the register count is determined ignoring
 | 
						||
        them.
 | 
						||
      - If use is made of the tenatative scratch SRD or wave scratch offset,
 | 
						||
        then the register numbers used are shifted to be after the highest one
 | 
						||
        allocated by the register allocator, and all uses updated. The register
 | 
						||
        count will include them in the shifted location. Since register
 | 
						||
        allocation may introduce spills, this shifting allows them to be
 | 
						||
        eliminated without having to perform register allocation again.
 | 
						||
      - In either case, if the processor has the SGPR allocation bug, the
 | 
						||
        tentative allocation is not shifted or unreserved inorder to ensure the
 | 
						||
        register count is higher to workaround the bug.
 | 
						||
 | 
						||
4.  If the kernel contains function calls:
 | 
						||
 | 
						||
    - SP is set to the wave scratch offset.
 | 
						||
 | 
						||
      - Since SP is an unswizzled address relative to the queue scratch base, an
 | 
						||
        wave scratch offset is an unswizzle offset, this means that if SP is
 | 
						||
        used to access swizzled scratch memory, it will access the private
 | 
						||
        segment address 0.
 | 
						||
 | 
						||
      .. note::
 | 
						||
 | 
						||
        This is planned to be changed to be the unswizzled base address of the
 | 
						||
        wavefront scratch backing memory.
 | 
						||
 | 
						||
Non-Kernel Functions
 | 
						||
++++++++++++++++++++
 | 
						||
 | 
						||
This section describes the call convention ABI for functions other than the
 | 
						||
outer kernel function.
 | 
						||
 | 
						||
If a kernel has function calls then scratch is always allocated and used for the
 | 
						||
call stack which grows from low address to high address using the swizzled
 | 
						||
scratch address space.
 | 
						||
 | 
						||
On entry to a function:
 | 
						||
 | 
						||
1.  SGPR0-3 contain a V# with the following properties:
 | 
						||
 | 
						||
    * Base address of the queue scratch backing memory.
 | 
						||
 | 
						||
      .. note::
 | 
						||
 | 
						||
        This is planned to be changed to be the unswizzled base address of the
 | 
						||
        wavefront scratch backing memory.
 | 
						||
 | 
						||
    * Swizzled with dword element size and stride of wavefront size elements.
 | 
						||
 | 
						||
2.  The FLAT_SCRATCH register pair is setup. See
 | 
						||
    :ref:`amdgpu-amdhsa-flat-scratch`.
 | 
						||
3.  GFX6-8: M0 register set to the size of LDS in bytes.
 | 
						||
4.  The EXEC register is set to the lanes active on entry to the function.
 | 
						||
5.  MODE register: *TBD*
 | 
						||
6.  VGPR0-31 and SGPR4-29 are used to pass function input arguments as described
 | 
						||
    below.
 | 
						||
7.  SGPR30-31 return address (RA). The code address that the function must
 | 
						||
    return to when it completes. The value is undefined if the function is *no
 | 
						||
    return*.
 | 
						||
8.  SGPR32 is used for the stack pointer (SP). It is an unswizzled
 | 
						||
    scratch offset relative to the beginning of the queue scratch backing
 | 
						||
    memory.
 | 
						||
 | 
						||
    The unswizzled SP can be used with buffer instructions as an unswizzled SGPR
 | 
						||
    offset with the scratch V# in SGPR0-3 to access the stack in a swizzled
 | 
						||
    manner.
 | 
						||
 | 
						||
    The swizzled SP value is always 4 bytes aligned for the ``r600``
 | 
						||
    architecture and 16 byte aligned for the ``amdgcn`` architecture.
 | 
						||
 | 
						||
    .. note::
 | 
						||
 | 
						||
      The ``amdgcn`` value is selected to avoid dynamic stack alignment for the
 | 
						||
      OpenCL language which has the largest base type defined as 16 bytes.
 | 
						||
 | 
						||
    On entry, the swizzled SP value is the address of the first function
 | 
						||
    argument passed on the stack. Other stack passed arguments are positive
 | 
						||
    offsets from the entry swizzled SP value.
 | 
						||
 | 
						||
    The function may use positive offsets beyond the last stack passed argument
 | 
						||
    for stack allocated local variables and register spill slots. If necessary
 | 
						||
    the function may align these to greater alignment than 16 bytes. After these
 | 
						||
    the function may dynamically allocate space for such things as runtime sized
 | 
						||
    ``alloca`` local allocations.
 | 
						||
 | 
						||
    If the function calls another function, it will place any stack allocated
 | 
						||
    arguments after the last local allocation and adjust SGPR32 to the address
 | 
						||
    after the last local allocation.
 | 
						||
 | 
						||
    .. note::
 | 
						||
 | 
						||
      The SP value is planned to be changed to be the unswizzled offset relative
 | 
						||
      to the wavefront scratch backing memory.
 | 
						||
 | 
						||
9.  SGPR33 wavefront scratch base offset. The unswizzled offset from the queue
 | 
						||
    scratch backing memory base to the base of the wavefront scratch backing
 | 
						||
    memory.
 | 
						||
 | 
						||
    It is used to convert the unswizzled SP value to swizzled address in the
 | 
						||
    private address space by:
 | 
						||
 | 
						||
      | private address = (unswizzled SP - wavefront scratch base offset) /
 | 
						||
        wavefront size
 | 
						||
 | 
						||
    This may be used to obtain the private address of stack objects and to
 | 
						||
    convert these address to a flat address by adding the flat scratch aperture
 | 
						||
    base address.
 | 
						||
 | 
						||
    .. note::
 | 
						||
 | 
						||
      This is planned to be eliminated when SP is changed to be the unswizzled
 | 
						||
      offset relative to the wavefront scratch backing memory. The the
 | 
						||
      conversion simplifies to:
 | 
						||
 | 
						||
        | private address = unswizzled SP / wavefront size
 | 
						||
 | 
						||
10. All other registers are unspecified.
 | 
						||
11. Any necessary ``waitcnt`` has been performed to ensure memory is available
 | 
						||
    to the function.
 | 
						||
 | 
						||
On exit from a function:
 | 
						||
 | 
						||
1.  VGPR0-31 and SGPR4-29 are used to pass function result arguments as
 | 
						||
    described below. Any registers used are considered clobbered registers,
 | 
						||
2.  The following registers are preserved and have the same value as on entry:
 | 
						||
 | 
						||
    * FLAT_SCRATCH
 | 
						||
    * EXEC
 | 
						||
    * GFX6-8: M0
 | 
						||
    * All SGPR and VGPR registers except the clobbered registers of SGPR4-31 and
 | 
						||
      VGPR0-31.
 | 
						||
 | 
						||
      For the AMDGPU backend, an inter-procedural register allocation (IPRA)
 | 
						||
      optimization may mark some of clobbered SGPR4-31 and VGPR0-31 registers as
 | 
						||
      preserved if it can be determined that the called function does not change
 | 
						||
      their value.
 | 
						||
 | 
						||
2.  The PC is set to the RA provided on entry.
 | 
						||
3.  MODE register: *TBD*.
 | 
						||
4.  All other registers are clobbered.
 | 
						||
5.  Any necessary ``waitcnt`` has been performed to ensure memory accessed by
 | 
						||
    function is available to the caller.
 | 
						||
 | 
						||
.. TODO::
 | 
						||
 | 
						||
  - On gfx908 are all ACC registers clobbered?
 | 
						||
 | 
						||
  - How are function results returned? The address of structured types is passed
 | 
						||
    by reference, but what about other types?
 | 
						||
 | 
						||
The function input arguments are made up of the formal arguments explicitly
 | 
						||
declared by the source language function plus the implicit input arguments used
 | 
						||
by the implementation.
 | 
						||
 | 
						||
The source language input arguments are:
 | 
						||
 | 
						||
1. Any source language implicit ``this`` or ``self`` argument comes first as a
 | 
						||
   pointer type.
 | 
						||
2. Followed by the function formal arguments in left to right source order.
 | 
						||
 | 
						||
The source language result arguments are:
 | 
						||
 | 
						||
1. The function result argument.
 | 
						||
 | 
						||
The source language input or result struct type arguments that are less than or
 | 
						||
equal to 16 bytes, are decomposed recursively into their base type fields, and
 | 
						||
each field is passed as if a separate argument. For input arguments, if the
 | 
						||
called function requires the struct to be in memory, for example because its
 | 
						||
address is taken, then the function body is responsible for allocating a stack
 | 
						||
location and copying the field arguments into it. Clang terms this *direct
 | 
						||
struct*.
 | 
						||
 | 
						||
The source language input struct type arguments that are greater than 16 bytes,
 | 
						||
are passed by reference. The caller is responsible for allocating a stack
 | 
						||
location to make a copy of the struct value and pass the address as the input
 | 
						||
argument. The called function is responsible to perform the dereference when
 | 
						||
accessing the input argument. Clang terms this *by-value struct*.
 | 
						||
 | 
						||
A source language result struct type argument that is greater than 16 bytes, is
 | 
						||
returned by reference. The caller is responsible for allocating a stack location
 | 
						||
to hold the result value and passes the address as the last input argument
 | 
						||
(before the implicit input arguments). In this case there are no result
 | 
						||
arguments. The called function is responsible to perform the dereference when
 | 
						||
storing the result value. Clang terms this *structured return (sret)*.
 | 
						||
 | 
						||
*TODO: correct the sret definition.*
 | 
						||
 | 
						||
.. TODO::
 | 
						||
 | 
						||
  Is this definition correct? Or is sret only used if passing in registers, and
 | 
						||
  pass as non-decomposed struct as stack argument? Or something else? Is the
 | 
						||
  memory location in the caller stack frame, or a stack memory argument and so
 | 
						||
  no address is passed as the caller can directly write to the argument stack
 | 
						||
  location. But then the stack location is still live after return. If an
 | 
						||
  argument stack location is it the first stack argument or the last one?
 | 
						||
 | 
						||
Lambda argument types are treated as struct types with an implementation defined
 | 
						||
set of fields.
 | 
						||
 | 
						||
.. TODO::
 | 
						||
 | 
						||
  Need to specify the ABI for lambda types for AMDGPU.
 | 
						||
 | 
						||
For AMDGPU backend all source language arguments (including the decomposed
 | 
						||
struct type arguments) are passed in VGPRs unless marked ``inreg`` in which case
 | 
						||
they are passed in SGPRs.
 | 
						||
 | 
						||
The AMDGPU backend walks the function call graph from the leaves to determine
 | 
						||
which implicit input arguments are used, propagating to each caller of the
 | 
						||
function. The used implicit arguments are appended to the function arguments
 | 
						||
after the source language arguments in the following order:
 | 
						||
 | 
						||
.. TODO::
 | 
						||
 | 
						||
  Is recursion or external functions supported?
 | 
						||
 | 
						||
1.  Work-Item ID (1 VGPR)
 | 
						||
 | 
						||
    The X, Y and Z work-item ID are packed into a single VGRP with the following
 | 
						||
    layout. Only fields actually used by the function are set. The other bits
 | 
						||
    are undefined.
 | 
						||
 | 
						||
    The values come from the initial kernel execution state. See
 | 
						||
    :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
 | 
						||
 | 
						||
    .. table:: Work-item implict argument layout
 | 
						||
      :name: amdgpu-amdhsa-workitem-implict-argument-layout-table
 | 
						||
 | 
						||
      ======= ======= ==============
 | 
						||
      Bits    Size    Field Name
 | 
						||
      ======= ======= ==============
 | 
						||
      9:0     10 bits X Work-Item ID
 | 
						||
      19:10   10 bits Y Work-Item ID
 | 
						||
      29:20   10 bits Z Work-Item ID
 | 
						||
      31:30   2 bits  Unused
 | 
						||
      ======= ======= ==============
 | 
						||
 | 
						||
2.  Dispatch Ptr (2 SGPRs)
 | 
						||
 | 
						||
    The value comes from the initial kernel execution state. See
 | 
						||
    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
 | 
						||
 | 
						||
3.  Queue Ptr (2 SGPRs)
 | 
						||
 | 
						||
    The value comes from the initial kernel execution state. See
 | 
						||
    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
 | 
						||
 | 
						||
4.  Kernarg Segment Ptr (2 SGPRs)
 | 
						||
 | 
						||
    The value comes from the initial kernel execution state. See
 | 
						||
    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
 | 
						||
 | 
						||
5.  Dispatch id (2 SGPRs)
 | 
						||
 | 
						||
    The value comes from the initial kernel execution state. See
 | 
						||
    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
 | 
						||
 | 
						||
6.  Work-Group ID X (1 SGPR)
 | 
						||
 | 
						||
    The value comes from the initial kernel execution state. See
 | 
						||
    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
 | 
						||
 | 
						||
7.  Work-Group ID Y (1 SGPR)
 | 
						||
 | 
						||
    The value comes from the initial kernel execution state. See
 | 
						||
    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
 | 
						||
 | 
						||
8.  Work-Group ID Z (1 SGPR)
 | 
						||
 | 
						||
    The value comes from the initial kernel execution state. See
 | 
						||
    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
 | 
						||
 | 
						||
9.  Implicit Argument Ptr (2 SGPRs)
 | 
						||
 | 
						||
    The value is computed by adding an offset to Kernarg Segment Ptr to get the
 | 
						||
    global address space pointer to the first kernarg implicit argument.
 | 
						||
 | 
						||
The input and result arguments are assigned in order in the following manner:
 | 
						||
 | 
						||
..note::
 | 
						||
 | 
						||
  There are likely some errors and ommissions in the following description that
 | 
						||
  need correction.
 | 
						||
 | 
						||
  ..TODO::
 | 
						||
 | 
						||
    Check the clang source code to decipher how funtion arguments and return
 | 
						||
    results are handled. Also see the AMDGPU specific values used.
 | 
						||
 | 
						||
* VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up to
 | 
						||
  VGPR31.
 | 
						||
 | 
						||
  If there are more arguments than will fit in these registers, the remaining
 | 
						||
  arguments are allocated on the stack in order on naturally aligned
 | 
						||
  addresses.
 | 
						||
 | 
						||
  .. TODO::
 | 
						||
 | 
						||
    How are overly aligned structures allocated on the stack?
 | 
						||
 | 
						||
* SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up to
 | 
						||
  SGPR29.
 | 
						||
 | 
						||
  If there are more arguments than will fit in these registers, the remaining
 | 
						||
  arguments are allocated on the stack in order on naturally aligned
 | 
						||
  addresses.
 | 
						||
 | 
						||
Note that decomposed struct type arguments may have some fields passed in
 | 
						||
registers and some in memory.
 | 
						||
 | 
						||
..TODO::
 | 
						||
 | 
						||
  So a struct which can pass some fields as decomposed register arguments, will
 | 
						||
  pass the rest as decomposed stack elements? But an arguent that will not start
 | 
						||
  in registers will not be decomposed and will be passed as a non-decomposed
 | 
						||
  stack value?
 | 
						||
 | 
						||
The following is not part of the AMDGPU function calling convention but
 | 
						||
describes how the AMDGPU implements function calls:
 | 
						||
 | 
						||
1.  SGPR34 is used as a frame pointer (FP) if necessary. Like the SP it is an
 | 
						||
    unswizzled scratch address. It is only needed if runtime sized ``alloca``
 | 
						||
    are used, or for the reasons defined in ``SiFrameLowering``.
 | 
						||
2.  Runtime stack alignment is not currently supported.
 | 
						||
 | 
						||
    .. TODO::
 | 
						||
 | 
						||
      - If runtime stack alignment is supported then will an extra argument
 | 
						||
        pointer register be used?
 | 
						||
 | 
						||
2.  Allocating SGPR arguments on the stack are not supported.
 | 
						||
 | 
						||
3.  No CFI is currently generated. See :ref:`amdgpu-call-frame-information`.
 | 
						||
 | 
						||
    ..note::
 | 
						||
 | 
						||
      Before CFI is generated, the call convention will be changed so that SP is
 | 
						||
      an unswizzled address relative to the wave scratch base.
 | 
						||
 | 
						||
      CFI will be generated that defines the CFA as the unswizzled address
 | 
						||
      relative to the wave scratch base in the unswizzled private address space
 | 
						||
      of the lowest address stack allocated local variable.
 | 
						||
 | 
						||
      ``DW_AT_frame_base`` will be defined as the swizelled address in the
 | 
						||
      swizzled private address space by dividing the CFA by the wavefront size
 | 
						||
      (since CFA is always at least dword aligned which matches the scratch
 | 
						||
      swizzle element size).
 | 
						||
 | 
						||
      If no dynamic stack alignment was performed, the stack allocated arguments
 | 
						||
      are accessed as negative offsets relative to ``DW_AT_frame_base``, and the
 | 
						||
      local variables and register spill slots are accessed as positive offsets
 | 
						||
      relative to ``DW_AT_frame_base``.
 | 
						||
 | 
						||
4.  Function argument passing is implemented by copying the input physical
 | 
						||
    registers to virtual registers on entry. The register allocator can spill if
 | 
						||
    necessary. These are copied back to physical registers at call sites. The
 | 
						||
    net effect is that each function call can have these values in entirely
 | 
						||
    distinct locations. The IPRA can help avoid shuffling argument registers.
 | 
						||
5.  Call sites are implemented by setting up the arguments at positive offsets
 | 
						||
    from SP. Then SP is incremented to account for the known frame size before
 | 
						||
    the call and decremented after the call.
 | 
						||
 | 
						||
    ..note::
 | 
						||
 | 
						||
      The CFI will reflect the changed calculation needed to compute the CFA
 | 
						||
      from SP.
 | 
						||
 | 
						||
6.  4 byte spill slots are used in the stack frame. One slot is allocated for an
 | 
						||
    emergency spill slot. Buffer instructions are used for stack accesses and
 | 
						||
    not the ``flat_scratch`` instruction.
 | 
						||
 | 
						||
    ..TODO::
 | 
						||
 | 
						||
      Explain when the emergency spill slot is used.
 | 
						||
 | 
						||
.. TODO::
 | 
						||
 | 
						||
  Possible broken issues:
 | 
						||
 | 
						||
  - Stack arguments must be aligned to required alignment.
 | 
						||
  - Stack is aligned to max(16, max formal argument alignment)
 | 
						||
  - Direct argument < 64 bits should check register budget.
 | 
						||
  - Register budget calculation should respect ``inreg`` for SGPR.
 | 
						||
  - SGPR overflow is not handled.
 | 
						||
  - struct with 1 member unpeeling is not checking size of member.
 | 
						||
  - ``sret`` is after ``this`` pointer.
 | 
						||
  - Caller is not implementing stack realignment: need an extra pointer.
 | 
						||
  - Should say AMDGPU passes FP rather than SP.
 | 
						||
  - Should CFI define CFA as address of locals or arguments. Difference is
 | 
						||
    apparent when have implemented dynamic alignment.
 | 
						||
  - If ``SCRATCH`` instruction could allow negative offsets then can make FP be
 | 
						||
    highest address of stack frame and use negative offset for locals. Would
 | 
						||
    allow SP to be the same as FP and could support signal-handler-like as now
 | 
						||
    have a real SP for the top of the stack.
 | 
						||
  - How is ``sret`` passed on the stack? In argument stack area? Can it overlay
 | 
						||
    arguments?
 | 
						||
 | 
						||
AMDPAL
 | 
						||
------
 | 
						||
 | 
						||
This section provides code conventions used when the target triple OS is
 | 
						||
``amdpal`` (see :ref:`amdgpu-target-triples`) for passing runtime parameters
 | 
						||
from the application/runtime to each invocation of a hardware shader. These
 | 
						||
parameters include both generic, application-controlled parameters called
 | 
						||
*user data* as well as system-generated parameters that are a product of the
 | 
						||
draw or dispatch execution.
 | 
						||
 | 
						||
User Data
 | 
						||
~~~~~~~~~
 | 
						||
 | 
						||
Each hardware stage has a set of 32-bit *user data registers* which can be
 | 
						||
written from a command buffer and then loaded into SGPRs when waves are launched
 | 
						||
via a subsequent dispatch or draw operation. This is the way most arguments are
 | 
						||
passed from the application/runtime to a hardware shader.
 | 
						||
 | 
						||
Compute User Data
 | 
						||
~~~~~~~~~~~~~~~~~
 | 
						||
 | 
						||
Compute shader user data mappings are simpler than graphics shaders, and have a
 | 
						||
fixed mapping.
 | 
						||
 | 
						||
Note that there are always 10 available *user data entries* in registers -
 | 
						||
entries beyond that limit must be fetched from memory (via the spill table
 | 
						||
pointer) by the shader.
 | 
						||
 | 
						||
  .. table:: PAL Compute Shader User Data Registers
 | 
						||
     :name: pal-compute-user-data-registers
 | 
						||
 | 
						||
     ============= ================================
 | 
						||
     User Register Description
 | 
						||
     ============= ================================
 | 
						||
     0             Global Internal Table (32-bit pointer)
 | 
						||
     1             Per-Shader Internal Table (32-bit pointer)
 | 
						||
     2 - 11        Application-Controlled User Data (10 32-bit values)
 | 
						||
     12            Spill Table (32-bit pointer)
 | 
						||
     13 - 14       Thread Group Count (64-bit pointer)
 | 
						||
     15            GDS Range
 | 
						||
     ============= ================================
 | 
						||
 | 
						||
Graphics User Data
 | 
						||
~~~~~~~~~~~~~~~~~~
 | 
						||
 | 
						||
Graphics pipelines support a much more flexible user data mapping:
 | 
						||
 | 
						||
  .. table:: PAL Graphics Shader User Data Registers
 | 
						||
     :name: pal-graphics-user-data-registers
 | 
						||
 | 
						||
     ============= ================================
 | 
						||
     User Register Description
 | 
						||
     ============= ================================
 | 
						||
     0             Global Internal Table (32-bit pointer)
 | 
						||
     +             Per-Shader Internal Table (32-bit pointer)
 | 
						||
     + 1-15        Application Controlled User Data
 | 
						||
                   (1-15 Contiguous 32-bit Values in Registers)
 | 
						||
     +             Spill Table (32-bit pointer)
 | 
						||
     +             Draw Index (First Stage Only)
 | 
						||
     +             Vertex Offset (First Stage Only)
 | 
						||
     +             Instance Offset (First Stage Only)
 | 
						||
     ============= ================================
 | 
						||
 | 
						||
  The placement of the global internal table remains fixed in the first *user
 | 
						||
  data SGPR register*. Otherwise all parameters are optional, and can be mapped
 | 
						||
  to any desired *user data SGPR register*, with the following restrictions:
 | 
						||
 | 
						||
  * Draw Index, Vertex Offset, and Instance Offset can only be used by the first
 | 
						||
    active hardware stage in a graphics pipeline (i.e. where the API vertex
 | 
						||
    shader runs).
 | 
						||
 | 
						||
  * Application-controlled user data must be mapped into a contiguous range of
 | 
						||
    user data registers.
 | 
						||
 | 
						||
  * The application-controlled user data range supports compaction remapping, so
 | 
						||
    only *entries* that are actually consumed by the shader must be assigned to
 | 
						||
    corresponding *registers*. Note that in order to support an efficient runtime
 | 
						||
    implementation, the remapping must pack *registers* in the same order as
 | 
						||
    *entries*, with unused *entries* removed.
 | 
						||
 | 
						||
.. _pal_global_internal_table:
 | 
						||
 | 
						||
Global Internal Table
 | 
						||
~~~~~~~~~~~~~~~~~~~~~
 | 
						||
 | 
						||
The global internal table is a table of *shader resource descriptors* (SRDs)
 | 
						||
that define how certain engine-wide, runtime-managed resources should be
 | 
						||
accessed from a shader. The majority of these resources have HW-defined formats,
 | 
						||
and it is up to the compiler to write/read data as required by the target
 | 
						||
hardware.
 | 
						||
 | 
						||
The following table illustrates the required format:
 | 
						||
 | 
						||
  .. table:: PAL Global Internal Table
 | 
						||
     :name: pal-git-table
 | 
						||
 | 
						||
     ============= ================================
 | 
						||
     Offset        Description
 | 
						||
     ============= ================================
 | 
						||
     0-3           Graphics Scratch SRD
 | 
						||
     4-7           Compute Scratch SRD
 | 
						||
     8-11          ES/GS Ring Output SRD
 | 
						||
     12-15         ES/GS Ring Input SRD
 | 
						||
     16-19         GS/VS Ring Output #0
 | 
						||
     20-23         GS/VS Ring Output #1
 | 
						||
     24-27         GS/VS Ring Output #2
 | 
						||
     28-31         GS/VS Ring Output #3
 | 
						||
     32-35         GS/VS Ring Input SRD
 | 
						||
     36-39         Tessellation Factor Buffer SRD
 | 
						||
     40-43         Off-Chip LDS Buffer SRD
 | 
						||
     44-47         Off-Chip Param Cache Buffer SRD
 | 
						||
     48-51         Sample Position Buffer SRD
 | 
						||
     52            vaRange::ShadowDescriptorTable High Bits
 | 
						||
     ============= ================================
 | 
						||
 | 
						||
  The pointer to the global internal table passed to the shader as user data
 | 
						||
  is a 32-bit pointer. The top 32 bits should be assumed to be the same as
 | 
						||
  the top 32 bits of the pipeline, so the shader may use the program
 | 
						||
  counter's top 32 bits.
 | 
						||
 | 
						||
Unspecified OS
 | 
						||
--------------
 | 
						||
 | 
						||
This section provides code conventions used when the target triple OS is
 | 
						||
empty (see :ref:`amdgpu-target-triples`).
 | 
						||
 | 
						||
Trap Handler ABI
 | 
						||
~~~~~~~~~~~~~~~~
 | 
						||
 | 
						||
For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
 | 
						||
not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
 | 
						||
instructions are handled as follows:
 | 
						||
 | 
						||
  .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
 | 
						||
     :name: amdgpu-trap-handler-for-non-amdhsa-os-table
 | 
						||
 | 
						||
     =============== =============== ===========================================
 | 
						||
     Usage           Code Sequence   Description
 | 
						||
     =============== =============== ===========================================
 | 
						||
     llvm.trap       s_endpgm        Causes wavefront to be terminated.
 | 
						||
     llvm.debugtrap  *none*          Compiler warning given that there is no
 | 
						||
                                     trap handler installed.
 | 
						||
     =============== =============== ===========================================
 | 
						||
 | 
						||
Source Languages
 | 
						||
================
 | 
						||
 | 
						||
.. _amdgpu-opencl:
 | 
						||
 | 
						||
OpenCL
 | 
						||
------
 | 
						||
 | 
						||
When the language is OpenCL the following differences occur:
 | 
						||
 | 
						||
1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
 | 
						||
2. The AMDGPU backend appends additional arguments to the kernel's explicit
 | 
						||
   arguments for the AMDHSA OS (see
 | 
						||
   :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
 | 
						||
3. Additional metadata is generated
 | 
						||
   (see :ref:`amdgpu-amdhsa-code-object-metadata`).
 | 
						||
 | 
						||
  .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
 | 
						||
     :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
 | 
						||
 | 
						||
     ======== ==== ========= ===========================================
 | 
						||
     Position Byte Byte      Description
 | 
						||
              Size Alignment
 | 
						||
     ======== ==== ========= ===========================================
 | 
						||
     1        8    8         OpenCL Global Offset X
 | 
						||
     2        8    8         OpenCL Global Offset Y
 | 
						||
     3        8    8         OpenCL Global Offset Z
 | 
						||
     4        8    8         OpenCL address of printf buffer
 | 
						||
     5        8    8         OpenCL address of virtual queue used by
 | 
						||
                             enqueue_kernel.
 | 
						||
     6        8    8         OpenCL address of AqlWrap struct used by
 | 
						||
                             enqueue_kernel.
 | 
						||
     7        8    8         Pointer argument used for Multi-gird
 | 
						||
                             synchronization.
 | 
						||
     ======== ==== ========= ===========================================
 | 
						||
 | 
						||
.. _amdgpu-hcc:
 | 
						||
 | 
						||
HCC
 | 
						||
---
 | 
						||
 | 
						||
When the language is HCC the following differences occur:
 | 
						||
 | 
						||
1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
 | 
						||
 | 
						||
.. _amdgpu-assembler:
 | 
						||
 | 
						||
Assembler
 | 
						||
---------
 | 
						||
 | 
						||
AMDGPU backend has LLVM-MC based assembler which is currently in development.
 | 
						||
It supports AMDGCN GFX6-GFX10.
 | 
						||
 | 
						||
This section describes general syntax for instructions and operands.
 | 
						||
 | 
						||
Instructions
 | 
						||
~~~~~~~~~~~~
 | 
						||
 | 
						||
.. toctree::
 | 
						||
   :hidden:
 | 
						||
 | 
						||
   AMDGPU/AMDGPUAsmGFX7
 | 
						||
   AMDGPU/AMDGPUAsmGFX8
 | 
						||
   AMDGPU/AMDGPUAsmGFX9
 | 
						||
   AMDGPU/AMDGPUAsmGFX900
 | 
						||
   AMDGPU/AMDGPUAsmGFX904
 | 
						||
   AMDGPU/AMDGPUAsmGFX906
 | 
						||
   AMDGPU/AMDGPUAsmGFX908
 | 
						||
   AMDGPU/AMDGPUAsmGFX10
 | 
						||
   AMDGPU/AMDGPUAsmGFX1011
 | 
						||
   AMDGPUModifierSyntax
 | 
						||
   AMDGPUOperandSyntax
 | 
						||
   AMDGPUInstructionSyntax
 | 
						||
   AMDGPUInstructionNotation
 | 
						||
 | 
						||
An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`:
 | 
						||
 | 
						||
  | ``<``\ *opcode*\ ``> <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,...
 | 
						||
    <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...``
 | 
						||
 | 
						||
:doc:`Operands<AMDGPUOperandSyntax>` are comma-separated while
 | 
						||
:doc:`modifiers<AMDGPUModifierSyntax>` are space-separated.
 | 
						||
 | 
						||
The order of operands and modifiers is fixed.
 | 
						||
Most modifiers are optional and may be omitted.
 | 
						||
 | 
						||
Links to detailed instruction syntax description may be found in the following
 | 
						||
table. Note that features under development are not included
 | 
						||
in this description.
 | 
						||
 | 
						||
    =================================== =======================================
 | 
						||
    Core ISA                            ISA Extensions
 | 
						||
    =================================== =======================================
 | 
						||
    :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>`   \-
 | 
						||
    :doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>`   \-
 | 
						||
    :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>`   :doc:`gfx900<AMDGPU/AMDGPUAsmGFX900>`
 | 
						||
 | 
						||
                                        :doc:`gfx902<AMDGPU/AMDGPUAsmGFX900>`
 | 
						||
 | 
						||
                                        :doc:`gfx904<AMDGPU/AMDGPUAsmGFX904>`
 | 
						||
 | 
						||
                                        :doc:`gfx906<AMDGPU/AMDGPUAsmGFX906>`
 | 
						||
 | 
						||
                                        :doc:`gfx908<AMDGPU/AMDGPUAsmGFX908>`
 | 
						||
 | 
						||
                                        :doc:`gfx909<AMDGPU/AMDGPUAsmGFX900>`
 | 
						||
 | 
						||
    :doc:`GFX10<AMDGPU/AMDGPUAsmGFX10>` :doc:`gfx1011<AMDGPU/AMDGPUAsmGFX1011>`
 | 
						||
 | 
						||
                                        :doc:`gfx1012<AMDGPU/AMDGPUAsmGFX1011>`
 | 
						||
    =================================== =======================================
 | 
						||
 | 
						||
For more information about instructions, their semantics and supported
 | 
						||
combinations of operands, refer to one of instruction set architecture manuals
 | 
						||
[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_, [AMD-GCN-GFX9]_ and
 | 
						||
[AMD-GCN-GFX10]_.
 | 
						||
 | 
						||
Operands
 | 
						||
~~~~~~~~
 | 
						||
 | 
						||
Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`.
 | 
						||
 | 
						||
Modifiers
 | 
						||
~~~~~~~~~
 | 
						||
 | 
						||
Detailed description of modifiers may be found
 | 
						||
:doc:`here<AMDGPUModifierSyntax>`.
 | 
						||
 | 
						||
Instruction Examples
 | 
						||
~~~~~~~~~~~~~~~~~~~~
 | 
						||
 | 
						||
DS
 | 
						||
++
 | 
						||
 | 
						||
.. code-block:: nasm
 | 
						||
 | 
						||
  ds_add_u32 v2, v4 offset:16
 | 
						||
  ds_write_src2_b64 v2 offset0:4 offset1:8
 | 
						||
  ds_cmpst_f32 v2, v4, v6
 | 
						||
  ds_min_rtn_f64 v[8:9], v2, v[4:5]
 | 
						||
 | 
						||
For full list of supported instructions, refer to "LDS/GDS instructions" in ISA
 | 
						||
Manual.
 | 
						||
 | 
						||
FLAT
 | 
						||
++++
 | 
						||
 | 
						||
.. code-block:: nasm
 | 
						||
 | 
						||
  flat_load_dword v1, v[3:4]
 | 
						||
  flat_store_dwordx3 v[3:4], v[5:7]
 | 
						||
  flat_atomic_swap v1, v[3:4], v5 glc
 | 
						||
  flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
 | 
						||
  flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
 | 
						||
 | 
						||
For full list of supported instructions, refer to "FLAT instructions" in ISA
 | 
						||
Manual.
 | 
						||
 | 
						||
MUBUF
 | 
						||
+++++
 | 
						||
 | 
						||
.. code-block:: nasm
 | 
						||
 | 
						||
  buffer_load_dword v1, off, s[4:7], s1
 | 
						||
  buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
 | 
						||
  buffer_store_format_xy v[1:2], off, s[4:7], s1
 | 
						||
  buffer_wbinvl1
 | 
						||
  buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
 | 
						||
 | 
						||
For full list of supported instructions, refer to "MUBUF Instructions" in ISA
 | 
						||
Manual.
 | 
						||
 | 
						||
SMRD/SMEM
 | 
						||
+++++++++
 | 
						||
 | 
						||
.. code-block:: nasm
 | 
						||
 | 
						||
  s_load_dword s1, s[2:3], 0xfc
 | 
						||
  s_load_dwordx8 s[8:15], s[2:3], s4
 | 
						||
  s_load_dwordx16 s[88:103], s[2:3], s4
 | 
						||
  s_dcache_inv_vol
 | 
						||
  s_memtime s[4:5]
 | 
						||
 | 
						||
For full list of supported instructions, refer to "Scalar Memory Operations" in
 | 
						||
ISA Manual.
 | 
						||
 | 
						||
SOP1
 | 
						||
++++
 | 
						||
 | 
						||
.. code-block:: nasm
 | 
						||
 | 
						||
  s_mov_b32 s1, s2
 | 
						||
  s_mov_b64 s[0:1], 0x80000000
 | 
						||
  s_cmov_b32 s1, 200
 | 
						||
  s_wqm_b64 s[2:3], s[4:5]
 | 
						||
  s_bcnt0_i32_b64 s1, s[2:3]
 | 
						||
  s_swappc_b64 s[2:3], s[4:5]
 | 
						||
  s_cbranch_join s[4:5]
 | 
						||
 | 
						||
For full list of supported instructions, refer to "SOP1 Instructions" in ISA
 | 
						||
Manual.
 | 
						||
 | 
						||
SOP2
 | 
						||
++++
 | 
						||
 | 
						||
.. code-block:: nasm
 | 
						||
 | 
						||
  s_add_u32 s1, s2, s3
 | 
						||
  s_and_b64 s[2:3], s[4:5], s[6:7]
 | 
						||
  s_cselect_b32 s1, s2, s3
 | 
						||
  s_andn2_b32 s2, s4, s6
 | 
						||
  s_lshr_b64 s[2:3], s[4:5], s6
 | 
						||
  s_ashr_i32 s2, s4, s6
 | 
						||
  s_bfm_b64 s[2:3], s4, s6
 | 
						||
  s_bfe_i64 s[2:3], s[4:5], s6
 | 
						||
  s_cbranch_g_fork s[4:5], s[6:7]
 | 
						||
 | 
						||
For full list of supported instructions, refer to "SOP2 Instructions" in ISA
 | 
						||
Manual.
 | 
						||
 | 
						||
SOPC
 | 
						||
++++
 | 
						||
 | 
						||
.. code-block:: nasm
 | 
						||
 | 
						||
  s_cmp_eq_i32 s1, s2
 | 
						||
  s_bitcmp1_b32 s1, s2
 | 
						||
  s_bitcmp0_b64 s[2:3], s4
 | 
						||
  s_setvskip s3, s5
 | 
						||
 | 
						||
For full list of supported instructions, refer to "SOPC Instructions" in ISA
 | 
						||
Manual.
 | 
						||
 | 
						||
SOPP
 | 
						||
++++
 | 
						||
 | 
						||
.. code-block:: nasm
 | 
						||
 | 
						||
  s_barrier
 | 
						||
  s_nop 2
 | 
						||
  s_endpgm
 | 
						||
  s_waitcnt 0 ; Wait for all counters to be 0
 | 
						||
  s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
 | 
						||
  s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
 | 
						||
  s_sethalt 9
 | 
						||
  s_sleep 10
 | 
						||
  s_sendmsg 0x1
 | 
						||
  s_sendmsg sendmsg(MSG_INTERRUPT)
 | 
						||
  s_trap 1
 | 
						||
 | 
						||
For full list of supported instructions, refer to "SOPP Instructions" in ISA
 | 
						||
Manual.
 | 
						||
 | 
						||
Unless otherwise mentioned, little verification is performed on the operands
 | 
						||
of SOPP Instructions, so it is up to the programmer to be familiar with the
 | 
						||
range or acceptable values.
 | 
						||
 | 
						||
VALU
 | 
						||
++++
 | 
						||
 | 
						||
For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
 | 
						||
the assembler will automatically use optimal encoding based on its operands. To
 | 
						||
force specific encoding, one can add a suffix to the opcode of the instruction:
 | 
						||
 | 
						||
* _e32 for 32-bit VOP1/VOP2/VOPC
 | 
						||
* _e64 for 64-bit VOP3
 | 
						||
* _dpp for VOP_DPP
 | 
						||
* _sdwa for VOP_SDWA
 | 
						||
 | 
						||
VOP1/VOP2/VOP3/VOPC examples:
 | 
						||
 | 
						||
.. code-block:: nasm
 | 
						||
 | 
						||
  v_mov_b32 v1, v2
 | 
						||
  v_mov_b32_e32 v1, v2
 | 
						||
  v_nop
 | 
						||
  v_cvt_f64_i32_e32 v[1:2], v2
 | 
						||
  v_floor_f32_e32 v1, v2
 | 
						||
  v_bfrev_b32_e32 v1, v2
 | 
						||
  v_add_f32_e32 v1, v2, v3
 | 
						||
  v_mul_i32_i24_e64 v1, v2, 3
 | 
						||
  v_mul_i32_i24_e32 v1, -3, v3
 | 
						||
  v_mul_i32_i24_e32 v1, -100, v3
 | 
						||
  v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
 | 
						||
  v_max_f16_e32 v1, v2, v3
 | 
						||
 | 
						||
VOP_DPP examples:
 | 
						||
 | 
						||
.. code-block:: nasm
 | 
						||
 | 
						||
  v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
 | 
						||
  v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
 | 
						||
  v_mov_b32 v0, v0 wave_shl:1
 | 
						||
  v_mov_b32 v0, v0 row_mirror
 | 
						||
  v_mov_b32 v0, v0 row_bcast:31
 | 
						||
  v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
 | 
						||
  v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
 | 
						||
  v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
 | 
						||
 | 
						||
VOP_SDWA examples:
 | 
						||
 | 
						||
.. code-block:: nasm
 | 
						||
 | 
						||
  v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
 | 
						||
  v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
 | 
						||
  v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
 | 
						||
  v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
 | 
						||
  v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
 | 
						||
 | 
						||
For full list of supported instructions, refer to "Vector ALU instructions".
 | 
						||
 | 
						||
.. TODO::
 | 
						||
 | 
						||
  Remove once we switch to code object v3 by default.
 | 
						||
 | 
						||
.. _amdgpu-amdhsa-assembler-predefined-symbols-v2:
 | 
						||
 | 
						||
Code Object V2 Predefined Symbols (-mattr=-code-object-v3)
 | 
						||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 | 
						||
 | 
						||
.. warning:: Code Object V2 is not the default code object version emitted by
 | 
						||
  this version of LLVM. For a description of the predefined symbols available
 | 
						||
  with the default configuration (Code Object V3) see
 | 
						||
  :ref:`amdgpu-amdhsa-assembler-predefined-symbols-v3`.
 | 
						||
 | 
						||
The AMDGPU assembler defines and updates some symbols automatically. These
 | 
						||
symbols do not affect code generation.
 | 
						||
 | 
						||
.option.machine_version_major
 | 
						||
+++++++++++++++++++++++++++++
 | 
						||
 | 
						||
Set to the GFX major generation number of the target being assembled for. For
 | 
						||
example, when assembling for a "GFX9" target this will be set to the integer
 | 
						||
value "9". The possible GFX major generation numbers are presented in
 | 
						||
:ref:`amdgpu-processors`.
 | 
						||
 | 
						||
.option.machine_version_minor
 | 
						||
+++++++++++++++++++++++++++++
 | 
						||
 | 
						||
Set to the GFX minor generation number of the target being assembled for. For
 | 
						||
example, when assembling for a "GFX810" target this will be set to the integer
 | 
						||
value "1". The possible GFX minor generation numbers are presented in
 | 
						||
:ref:`amdgpu-processors`.
 | 
						||
 | 
						||
.option.machine_version_stepping
 | 
						||
++++++++++++++++++++++++++++++++
 | 
						||
 | 
						||
Set to the GFX stepping generation number of the target being assembled for.
 | 
						||
For example, when assembling for a "GFX704" target this will be set to the
 | 
						||
integer value "4". The possible GFX stepping generation numbers are presented
 | 
						||
in :ref:`amdgpu-processors`.
 | 
						||
 | 
						||
.kernel.vgpr_count
 | 
						||
++++++++++++++++++
 | 
						||
 | 
						||
Set to zero each time a
 | 
						||
:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
 | 
						||
encountered. At each instruction, if the current value of this symbol is less
 | 
						||
than or equal to the maximum VPGR number explicitly referenced within that
 | 
						||
instruction then the symbol value is updated to equal that VGPR number plus
 | 
						||
one.
 | 
						||
 | 
						||
.kernel.sgpr_count
 | 
						||
++++++++++++++++++
 | 
						||
 | 
						||
Set to zero each time a
 | 
						||
:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
 | 
						||
encountered. At each instruction, if the current value of this symbol is less
 | 
						||
than or equal to the maximum VPGR number explicitly referenced within that
 | 
						||
instruction then the symbol value is updated to equal that SGPR number plus
 | 
						||
one.
 | 
						||
 | 
						||
.. _amdgpu-amdhsa-assembler-directives-v2:
 | 
						||
 | 
						||
Code Object V2 Directives (-mattr=-code-object-v3)
 | 
						||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 | 
						||
 | 
						||
.. warning:: Code Object V2 is not the default code object version emitted by
 | 
						||
  this version of LLVM. For a description of the directives supported with
 | 
						||
  the default configuration (Code Object V3) see
 | 
						||
  :ref:`amdgpu-amdhsa-assembler-directives-v3`.
 | 
						||
 | 
						||
AMDGPU ABI defines auxiliary data in output code object. In assembly source,
 | 
						||
one can specify them with assembler directives.
 | 
						||
 | 
						||
.hsa_code_object_version major, minor
 | 
						||
+++++++++++++++++++++++++++++++++++++
 | 
						||
 | 
						||
*major* and *minor* are integers that specify the version of the HSA code
 | 
						||
object that will be generated by the assembler.
 | 
						||
 | 
						||
.hsa_code_object_isa [major, minor, stepping, vendor, arch]
 | 
						||
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 | 
						||
 | 
						||
 | 
						||
*major*, *minor*, and *stepping* are all integers that describe the instruction
 | 
						||
set architecture (ISA) version of the assembly program.
 | 
						||
 | 
						||
*vendor* and *arch* are quoted strings. *vendor* should always be equal to
 | 
						||
"AMD" and *arch* should always be equal to "AMDGPU".
 | 
						||
 | 
						||
By default, the assembler will derive the ISA version, *vendor*, and *arch*
 | 
						||
from the value of the -mcpu option that is passed to the assembler.
 | 
						||
 | 
						||
.. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel:
 | 
						||
 | 
						||
.amdgpu_hsa_kernel (name)
 | 
						||
+++++++++++++++++++++++++
 | 
						||
 | 
						||
This directives specifies that the symbol with given name is a kernel entry
 | 
						||
point (label) and the object should contain corresponding symbol of type
 | 
						||
STT_AMDGPU_HSA_KERNEL.
 | 
						||
 | 
						||
.amd_kernel_code_t
 | 
						||
++++++++++++++++++
 | 
						||
 | 
						||
This directive marks the beginning of a list of key / value pairs that are used
 | 
						||
to specify the amd_kernel_code_t object that will be emitted by the assembler.
 | 
						||
The list must be terminated by the *.end_amd_kernel_code_t* directive. For any
 | 
						||
amd_kernel_code_t values that are unspecified a default value will be used. The
 | 
						||
default value for all keys is 0, with the following exceptions:
 | 
						||
 | 
						||
- *amd_code_version_major* defaults to 1.
 | 
						||
- *amd_kernel_code_version_minor* defaults to 2.
 | 
						||
- *amd_machine_kind* defaults to 1.
 | 
						||
- *amd_machine_version_major*, *machine_version_minor*, and
 | 
						||
  *amd_machine_version_stepping* are derived from the value of the -mcpu option
 | 
						||
  that is passed to the assembler.
 | 
						||
- *kernel_code_entry_byte_offset* defaults to 256.
 | 
						||
- *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards
 | 
						||
  defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5.
 | 
						||
  Note that wavefront size is specified as a power of two, so a value of **n**
 | 
						||
  means a size of 2^ **n**.
 | 
						||
- *call_convention* defaults to -1.
 | 
						||
- *kernarg_segment_alignment*, *group_segment_alignment*, and
 | 
						||
  *private_segment_alignment* default to 4. Note that alignments are specified
 | 
						||
  as a power of 2, so a value of **n** means an alignment of 2^ **n**.
 | 
						||
- *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for
 | 
						||
  GFX10 onwards.
 | 
						||
- *enable_mem_ordered* defaults to 1 for GFX10 onwards.
 | 
						||
 | 
						||
The *.amd_kernel_code_t* directive must be placed immediately after the
 | 
						||
function label and before any instructions.
 | 
						||
 | 
						||
For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
 | 
						||
comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
 | 
						||
 | 
						||
.. _amdgpu-amdhsa-assembler-example-v2:
 | 
						||
 | 
						||
Code Object V2 Example Source Code (-mattr=-code-object-v3)
 | 
						||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 | 
						||
 | 
						||
.. warning:: Code Object V2 is not the default code object version emitted by
 | 
						||
  this version of LLVM. For a description of the directives supported with
 | 
						||
  the default configuration (Code Object V3) see
 | 
						||
  :ref:`amdgpu-amdhsa-assembler-example-v3`.
 | 
						||
 | 
						||
Here is an example of a minimal assembly source file, defining one HSA kernel:
 | 
						||
 | 
						||
.. code::
 | 
						||
   :number-lines:
 | 
						||
 | 
						||
   .hsa_code_object_version 1,0
 | 
						||
   .hsa_code_object_isa
 | 
						||
 | 
						||
   .hsatext
 | 
						||
   .globl  hello_world
 | 
						||
   .p2align 8
 | 
						||
   .amdgpu_hsa_kernel hello_world
 | 
						||
 | 
						||
   hello_world:
 | 
						||
 | 
						||
      .amd_kernel_code_t
 | 
						||
         enable_sgpr_kernarg_segment_ptr = 1
 | 
						||
         is_ptr64 = 1
 | 
						||
         compute_pgm_rsrc1_vgprs = 0
 | 
						||
         compute_pgm_rsrc1_sgprs = 0
 | 
						||
         compute_pgm_rsrc2_user_sgpr = 2
 | 
						||
         compute_pgm_rsrc1_wgp_mode = 0
 | 
						||
         compute_pgm_rsrc1_mem_ordered = 0
 | 
						||
         compute_pgm_rsrc1_fwd_progress = 1
 | 
						||
     .end_amd_kernel_code_t
 | 
						||
 | 
						||
     s_load_dwordx2 s[0:1], s[0:1] 0x0
 | 
						||
     v_mov_b32 v0, 3.14159
 | 
						||
     s_waitcnt lgkmcnt(0)
 | 
						||
     v_mov_b32 v1, s0
 | 
						||
     v_mov_b32 v2, s1
 | 
						||
     flat_store_dword v[1:2], v0
 | 
						||
     s_endpgm
 | 
						||
   .Lfunc_end0:
 | 
						||
        .size   hello_world, .Lfunc_end0-hello_world
 | 
						||
 | 
						||
.. _amdgpu-amdhsa-assembler-predefined-symbols-v3:
 | 
						||
 | 
						||
Code Object V3 Predefined Symbols (-mattr=+code-object-v3)
 | 
						||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 | 
						||
 | 
						||
The AMDGPU assembler defines and updates some symbols automatically. These
 | 
						||
symbols do not affect code generation.
 | 
						||
 | 
						||
.amdgcn.gfx_generation_number
 | 
						||
+++++++++++++++++++++++++++++
 | 
						||
 | 
						||
Set to the GFX major generation number of the target being assembled for. For
 | 
						||
example, when assembling for a "GFX9" target this will be set to the integer
 | 
						||
value "9". The possible GFX major generation numbers are presented in
 | 
						||
:ref:`amdgpu-processors`.
 | 
						||
 | 
						||
.amdgcn.gfx_generation_minor
 | 
						||
++++++++++++++++++++++++++++
 | 
						||
 | 
						||
Set to the GFX minor generation number of the target being assembled for. For
 | 
						||
example, when assembling for a "GFX810" target this will be set to the integer
 | 
						||
value "1". The possible GFX minor generation numbers are presented in
 | 
						||
:ref:`amdgpu-processors`.
 | 
						||
 | 
						||
.amdgcn.gfx_generation_stepping
 | 
						||
+++++++++++++++++++++++++++++++
 | 
						||
 | 
						||
Set to the GFX stepping generation number of the target being assembled for.
 | 
						||
For example, when assembling for a "GFX704" target this will be set to the
 | 
						||
integer value "4". The possible GFX stepping generation numbers are presented
 | 
						||
in :ref:`amdgpu-processors`.
 | 
						||
 | 
						||
.. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr:
 | 
						||
 | 
						||
.amdgcn.next_free_vgpr
 | 
						||
++++++++++++++++++++++
 | 
						||
 | 
						||
Set to zero before assembly begins. At each instruction, if the current value
 | 
						||
of this symbol is less than or equal to the maximum VGPR number explicitly
 | 
						||
referenced within that instruction then the symbol value is updated to equal
 | 
						||
that VGPR number plus one.
 | 
						||
 | 
						||
May be used to set the `.amdhsa_next_free_vpgr` directive in
 | 
						||
:ref:`amdhsa-kernel-directives-table`.
 | 
						||
 | 
						||
May be set at any time, e.g. manually set to zero at the start of each kernel.
 | 
						||
 | 
						||
.. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr:
 | 
						||
 | 
						||
.amdgcn.next_free_sgpr
 | 
						||
++++++++++++++++++++++
 | 
						||
 | 
						||
Set to zero before assembly begins. At each instruction, if the current value
 | 
						||
of this symbol is less than or equal the maximum SGPR number explicitly
 | 
						||
referenced within that instruction then the symbol value is updated to equal
 | 
						||
that SGPR number plus one.
 | 
						||
 | 
						||
May be used to set the `.amdhsa_next_free_spgr` directive in
 | 
						||
:ref:`amdhsa-kernel-directives-table`.
 | 
						||
 | 
						||
May be set at any time, e.g. manually set to zero at the start of each kernel.
 | 
						||
 | 
						||
.. _amdgpu-amdhsa-assembler-directives-v3:
 | 
						||
 | 
						||
Code Object V3 Directives (-mattr=+code-object-v3)
 | 
						||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 | 
						||
 | 
						||
Directives which begin with ``.amdgcn`` are valid for all ``amdgcn``
 | 
						||
architecture processors, and are not OS-specific. Directives which begin with
 | 
						||
``.amdhsa`` are specific to ``amdgcn`` architecture processors when the
 | 
						||
``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and
 | 
						||
:ref:`amdgpu-processors`.
 | 
						||
 | 
						||
.amdgcn_target <target>
 | 
						||
+++++++++++++++++++++++
 | 
						||
 | 
						||
Optional directive which declares the target supported by the containing
 | 
						||
assembler source file. Valid values are described in
 | 
						||
:ref:`amdgpu-amdhsa-code-object-target-identification`. Used by the assembler
 | 
						||
to validate command-line options such as ``-triple``, ``-mcpu``, and those
 | 
						||
which specify target features.
 | 
						||
 | 
						||
.amdhsa_kernel <name>
 | 
						||
+++++++++++++++++++++
 | 
						||
 | 
						||
Creates a correctly aligned AMDHSA kernel descriptor and a symbol,
 | 
						||
``<name>.kd``, in the current location of the current section. Only valid when
 | 
						||
the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first
 | 
						||
instruction to execute, and does not need to be previously defined.
 | 
						||
 | 
						||
Marks the beginning of a list of directives used to generate the bytes of a
 | 
						||
kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`.
 | 
						||
Directives which may appear in this list are described in
 | 
						||
:ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must
 | 
						||
be valid for the target being assembled for, and cannot be repeated. Directives
 | 
						||
support the range of values specified by the field they reference in
 | 
						||
:ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is
 | 
						||
assumed to have its default value, unless it is marked as "Required", in which
 | 
						||
case it is an error to omit the directive. This list of directives is
 | 
						||
terminated by an ``.end_amdhsa_kernel`` directive.
 | 
						||
 | 
						||
  .. table:: AMDHSA Kernel Assembler Directives
 | 
						||
     :name: amdhsa-kernel-directives-table
 | 
						||
 | 
						||
     ======================================================== =================== ============ ===================
 | 
						||
     Directive                                                Default             Supported On Description
 | 
						||
     ======================================================== =================== ============ ===================
 | 
						||
     ``.amdhsa_group_segment_fixed_size``                     0                   GFX6-GFX10   Controls GROUP_SEGMENT_FIXED_SIZE in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
 | 
						||
     ``.amdhsa_private_segment_fixed_size``                   0                   GFX6-GFX10   Controls PRIVATE_SEGMENT_FIXED_SIZE in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
 | 
						||
     ``.amdhsa_user_sgpr_private_segment_buffer``             0                   GFX6-GFX10   Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
 | 
						||
     ``.amdhsa_user_sgpr_dispatch_ptr``                       0                   GFX6-GFX10   Controls ENABLE_SGPR_DISPATCH_PTR in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
 | 
						||
     ``.amdhsa_user_sgpr_queue_ptr``                          0                   GFX6-GFX10   Controls ENABLE_SGPR_QUEUE_PTR in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
 | 
						||
     ``.amdhsa_user_sgpr_kernarg_segment_ptr``                0                   GFX6-GFX10   Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
 | 
						||
     ``.amdhsa_user_sgpr_dispatch_id``                        0                   GFX6-GFX10   Controls ENABLE_SGPR_DISPATCH_ID in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
 | 
						||
     ``.amdhsa_user_sgpr_flat_scratch_init``                  0                   GFX6-GFX10   Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
 | 
						||
     ``.amdhsa_user_sgpr_private_segment_size``               0                   GFX6-GFX10   Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
 | 
						||
     ``.amdhsa_wavefront_size32``                             Target              GFX10        Controls ENABLE_WAVEFRONT_SIZE32 in
 | 
						||
                                                              Feature                          :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
 | 
						||
                                                              Specific
 | 
						||
                                                              (-wavefrontsize64)
 | 
						||
     ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0                   GFX6-GFX10   Controls ENABLE_SGPR_PRIVATE_SEGMENT_WAVEFRONT_OFFSET in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
 | 
						||
     ``.amdhsa_system_sgpr_workgroup_id_x``                   1                   GFX6-GFX10   Controls ENABLE_SGPR_WORKGROUP_ID_X in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
 | 
						||
     ``.amdhsa_system_sgpr_workgroup_id_y``                   0                   GFX6-GFX10   Controls ENABLE_SGPR_WORKGROUP_ID_Y in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
 | 
						||
     ``.amdhsa_system_sgpr_workgroup_id_z``                   0                   GFX6-GFX10   Controls ENABLE_SGPR_WORKGROUP_ID_Z in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
 | 
						||
     ``.amdhsa_system_sgpr_workgroup_info``                   0                   GFX6-GFX10   Controls ENABLE_SGPR_WORKGROUP_INFO in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
 | 
						||
     ``.amdhsa_system_vgpr_workitem_id``                      0                   GFX6-GFX10   Controls ENABLE_VGPR_WORKITEM_ID in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
 | 
						||
                                                                                               Possible values are defined in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`.
 | 
						||
     ``.amdhsa_next_free_vgpr``                               Required            GFX6-GFX10   Maximum VGPR number explicitly referenced, plus one.
 | 
						||
                                                                                               Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
 | 
						||
     ``.amdhsa_next_free_sgpr``                               Required            GFX6-GFX10   Maximum SGPR number explicitly referenced, plus one.
 | 
						||
                                                                                               Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
 | 
						||
     ``.amdhsa_reserve_vcc``                                  1                   GFX6-GFX10   Whether the kernel may use the special VCC SGPR.
 | 
						||
                                                                                               Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
 | 
						||
     ``.amdhsa_reserve_flat_scratch``                         1                   GFX7-GFX10   Whether the kernel may use flat instructions to access
 | 
						||
                                                                                               scratch memory. Used to calculate
 | 
						||
                                                                                               GRANULATED_WAVEFRONT_SGPR_COUNT in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
 | 
						||
     ``.amdhsa_reserve_xnack_mask``                           Target              GFX8-GFX10   Whether the kernel may trigger XNACK replay.
 | 
						||
                                                              Feature                          Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
 | 
						||
                                                              Specific                         :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
 | 
						||
                                                              (+xnack)
 | 
						||
     ``.amdhsa_float_round_mode_32``                          0                   GFX6-GFX10   Controls FLOAT_ROUND_MODE_32 in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
 | 
						||
                                                                                               Possible values are defined in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
 | 
						||
     ``.amdhsa_float_round_mode_16_64``                       0                   GFX6-GFX10   Controls FLOAT_ROUND_MODE_16_64 in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
 | 
						||
                                                                                               Possible values are defined in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
 | 
						||
     ``.amdhsa_float_denorm_mode_32``                         0                   GFX6-GFX10   Controls FLOAT_DENORM_MODE_32 in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
 | 
						||
                                                                                               Possible values are defined in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
 | 
						||
     ``.amdhsa_float_denorm_mode_16_64``                      3                   GFX6-GFX10   Controls FLOAT_DENORM_MODE_16_64 in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
 | 
						||
                                                                                               Possible values are defined in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
 | 
						||
     ``.amdhsa_dx10_clamp``                                   1                   GFX6-GFX10   Controls ENABLE_DX10_CLAMP in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
 | 
						||
     ``.amdhsa_ieee_mode``                                    1                   GFX6-GFX10   Controls ENABLE_IEEE_MODE in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
 | 
						||
     ``.amdhsa_fp16_overflow``                                0                   GFX9-GFX10   Controls FP16_OVFL in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
 | 
						||
     ``.amdhsa_workgroup_processor_mode``                     Target              GFX10        Controls ENABLE_WGP_MODE in
 | 
						||
                                                              Feature                          :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
 | 
						||
                                                              Specific
 | 
						||
                                                              (-cumode)
 | 
						||
     ``.amdhsa_memory_ordered``                               1                   GFX10        Controls MEM_ORDERED in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
 | 
						||
     ``.amdhsa_forward_progress``                             0                   GFX10        Controls FWD_PROGRESS in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
 | 
						||
     ``.amdhsa_exception_fp_ieee_invalid_op``                 0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
 | 
						||
     ``.amdhsa_exception_fp_denorm_src``                      0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
 | 
						||
     ``.amdhsa_exception_fp_ieee_div_zero``                   0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
 | 
						||
     ``.amdhsa_exception_fp_ieee_overflow``                   0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
 | 
						||
     ``.amdhsa_exception_fp_ieee_underflow``                  0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
 | 
						||
     ``.amdhsa_exception_fp_ieee_inexact``                    0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
 | 
						||
     ``.amdhsa_exception_int_div_zero``                       0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in
 | 
						||
                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
 | 
						||
     ======================================================== =================== ============ ===================
 | 
						||
 | 
						||
.amdgpu_metadata
 | 
						||
++++++++++++++++
 | 
						||
 | 
						||
Optional directive which declares the contents of the ``NT_AMDGPU_METADATA``
 | 
						||
note record (see :ref:`amdgpu-elf-note-records-table-v3`).
 | 
						||
 | 
						||
The contents must be in the [YAML]_ markup format, with the same structure and
 | 
						||
semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3`.
 | 
						||
 | 
						||
This directive is terminated by an ``.end_amdgpu_metadata`` directive.
 | 
						||
 | 
						||
.. _amdgpu-amdhsa-assembler-example-v3:
 | 
						||
 | 
						||
Code Object V3 Example Source Code (-mattr=+code-object-v3)
 | 
						||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 | 
						||
 | 
						||
Here is an example of a minimal assembly source file, defining one HSA kernel:
 | 
						||
 | 
						||
.. code::
 | 
						||
   :number-lines:
 | 
						||
 | 
						||
   .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
 | 
						||
 | 
						||
   .text
 | 
						||
   .globl hello_world
 | 
						||
   .p2align 8
 | 
						||
   .type hello_world,@function
 | 
						||
   hello_world:
 | 
						||
     s_load_dwordx2 s[0:1], s[0:1] 0x0
 | 
						||
     v_mov_b32 v0, 3.14159
 | 
						||
     s_waitcnt lgkmcnt(0)
 | 
						||
     v_mov_b32 v1, s0
 | 
						||
     v_mov_b32 v2, s1
 | 
						||
     flat_store_dword v[1:2], v0
 | 
						||
     s_endpgm
 | 
						||
   .Lfunc_end0:
 | 
						||
     .size   hello_world, .Lfunc_end0-hello_world
 | 
						||
 | 
						||
   .rodata
 | 
						||
   .p2align 6
 | 
						||
   .amdhsa_kernel hello_world
 | 
						||
     .amdhsa_user_sgpr_kernarg_segment_ptr 1
 | 
						||
     .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
 | 
						||
     .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
 | 
						||
   .end_amdhsa_kernel
 | 
						||
 | 
						||
   .amdgpu_metadata
 | 
						||
   ---
 | 
						||
   amdhsa.version:
 | 
						||
     - 1
 | 
						||
     - 0
 | 
						||
   amdhsa.kernels:
 | 
						||
     - .name: hello_world
 | 
						||
       .symbol: hello_world.kd
 | 
						||
       .kernarg_segment_size: 48
 | 
						||
       .group_segment_fixed_size: 0
 | 
						||
       .private_segment_fixed_size: 0
 | 
						||
       .kernarg_segment_align: 4
 | 
						||
       .wavefront_size: 64
 | 
						||
       .sgpr_count: 2
 | 
						||
       .vgpr_count: 3
 | 
						||
       .max_flat_workgroup_size: 256
 | 
						||
   ...
 | 
						||
   .end_amdgpu_metadata
 | 
						||
 | 
						||
If an assembly source file contains multiple kernels and/or functions, the
 | 
						||
:ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and
 | 
						||
:ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using
 | 
						||
the ``.set <symbol>, <expression>`` directive. For example, in the case of two
 | 
						||
kernels, where ``function1`` is only called from ``kernel1`` it is sufficient
 | 
						||
to group the function with the kernel that calls it and reset the symbols
 | 
						||
between the two connected components:
 | 
						||
 | 
						||
.. code::
 | 
						||
   :number-lines:
 | 
						||
 | 
						||
   .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
 | 
						||
 | 
						||
   // gpr tracking symbols are implicitly set to zero
 | 
						||
 | 
						||
   .text
 | 
						||
   .globl kern0
 | 
						||
   .p2align 8
 | 
						||
   .type kern0,@function
 | 
						||
   kern0:
 | 
						||
     // ...
 | 
						||
     s_endpgm
 | 
						||
   .Lkern0_end:
 | 
						||
     .size   kern0, .Lkern0_end-kern0
 | 
						||
 | 
						||
   .rodata
 | 
						||
   .p2align 6
 | 
						||
   .amdhsa_kernel kern0
 | 
						||
     // ...
 | 
						||
     .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
 | 
						||
     .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
 | 
						||
   .end_amdhsa_kernel
 | 
						||
 | 
						||
   // reset symbols to begin tracking usage in func1 and kern1
 | 
						||
   .set .amdgcn.next_free_vgpr, 0
 | 
						||
   .set .amdgcn.next_free_sgpr, 0
 | 
						||
 | 
						||
   .text
 | 
						||
   .hidden func1
 | 
						||
   .global func1
 | 
						||
   .p2align 2
 | 
						||
   .type func1,@function
 | 
						||
   func1:
 | 
						||
     // ...
 | 
						||
     s_setpc_b64 s[30:31]
 | 
						||
   .Lfunc1_end:
 | 
						||
   .size func1, .Lfunc1_end-func1
 | 
						||
 | 
						||
   .globl kern1
 | 
						||
   .p2align 8
 | 
						||
   .type kern1,@function
 | 
						||
   kern1:
 | 
						||
     // ...
 | 
						||
     s_getpc_b64 s[4:5]
 | 
						||
     s_add_u32 s4, s4, func1@rel32@lo+4
 | 
						||
     s_addc_u32 s5, s5, func1@rel32@lo+4
 | 
						||
     s_swappc_b64 s[30:31], s[4:5]
 | 
						||
     // ...
 | 
						||
     s_endpgm
 | 
						||
   .Lkern1_end:
 | 
						||
     .size   kern1, .Lkern1_end-kern1
 | 
						||
 | 
						||
   .rodata
 | 
						||
   .p2align 6
 | 
						||
   .amdhsa_kernel kern1
 | 
						||
     // ...
 | 
						||
     .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
 | 
						||
     .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
 | 
						||
   .end_amdhsa_kernel
 | 
						||
 | 
						||
These symbols cannot identify connected components in order to automatically
 | 
						||
track the usage for each kernel. However, in some cases careful organization of
 | 
						||
the kernels and functions in the source file means there is minimal additional
 | 
						||
effort required to accurately calculate GPR usage.
 | 
						||
 | 
						||
Additional Documentation
 | 
						||
========================
 | 
						||
 | 
						||
.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
 | 
						||
.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
 | 
						||
.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
 | 
						||
.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
 | 
						||
.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
 | 
						||
.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
 | 
						||
.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
 | 
						||
.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
 | 
						||
.. [AMD-GCN-GFX10] `AMD "RDNA 1.0" Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf>`__
 | 
						||
.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
 | 
						||
.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
 | 
						||
.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
 | 
						||
.. [HIP] `HIP Programming Guide <https://rocm-documentation.readthedocs.io/en/latest/Programming_Guides/Programming-Guides.html#hip-programing-guide>`__
 | 
						||
.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
 | 
						||
.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
 | 
						||
.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
 | 
						||
.. [MsgPack] `Message Pack <http://www.msgpack.org/>`__
 | 
						||
.. [SEMVER] `Semantic Versioning <https://semver.org/>`__
 | 
						||
.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
 | 
						||
.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
 | 
						||
.. [CLANG-ATTR] `Attributes in Clang <http://clang.llvm.org/docs/AttributeReference.html>`__
 |