277 lines
		
	
	
		
			7.9 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
			
		
		
	
	
			277 lines
		
	
	
		
			7.9 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
=============================
 | 
						|
User Guide for NVPTX Back-end
 | 
						|
=============================
 | 
						|
 | 
						|
.. contents::
 | 
						|
   :local:
 | 
						|
   :depth: 3
 | 
						|
 | 
						|
 | 
						|
Introduction
 | 
						|
============
 | 
						|
 | 
						|
To support GPU programming, the NVPTX back-end supports a subset of LLVM IR
 | 
						|
along with a defined set of conventions used to represent GPU programming
 | 
						|
concepts. This document provides an overview of the general usage of the back-
 | 
						|
end, including a description of the conventions used and the set of accepted
 | 
						|
LLVM IR.
 | 
						|
 | 
						|
.. note:: 
 | 
						|
   
 | 
						|
   This document assumes a basic familiarity with CUDA and the PTX
 | 
						|
   assembly language. Information about the CUDA Driver API and the PTX assembly
 | 
						|
   language can be found in the `CUDA documentation
 | 
						|
   <http://docs.nvidia.com/cuda/index.html>`_.
 | 
						|
 | 
						|
 | 
						|
 | 
						|
Conventions
 | 
						|
===========
 | 
						|
 | 
						|
Marking Functions as Kernels
 | 
						|
----------------------------
 | 
						|
 | 
						|
In PTX, there are two types of functions: *device functions*, which are only
 | 
						|
callable by device code, and *kernel functions*, which are callable by host
 | 
						|
code. By default, the back-end will emit device functions. Metadata is used to
 | 
						|
declare a function as a kernel function. This metadata is attached to the
 | 
						|
``nvvm.annotations`` named metadata object, and has the following format:
 | 
						|
 | 
						|
.. code-block:: llvm
 | 
						|
 | 
						|
   !0 = metadata !{<function-ref>, metadata !"kernel", i32 1}
 | 
						|
 | 
						|
The first parameter is a reference to the kernel function. The following
 | 
						|
example shows a kernel function calling a device function in LLVM IR. The
 | 
						|
function ``@my_kernel`` is callable from host code, but ``@my_fmad`` is not.
 | 
						|
 | 
						|
.. code-block:: llvm
 | 
						|
 | 
						|
    define float @my_fmad(float %x, float %y, float %z) {
 | 
						|
      %mul = fmul float %x, %y
 | 
						|
      %add = fadd float %mul, %z
 | 
						|
      ret float %add
 | 
						|
    }
 | 
						|
 | 
						|
    define void @my_kernel(float* %ptr) {
 | 
						|
      %val = load float* %ptr
 | 
						|
      %ret = call float @my_fmad(float %val, float %val, float %val)
 | 
						|
      store float %ret, float* %ptr
 | 
						|
      ret void
 | 
						|
    }
 | 
						|
 | 
						|
    !nvvm.annotations = !{!1}
 | 
						|
    !1 = metadata !{void (float*)* @my_kernel, metadata !"kernel", i32 1}
 | 
						|
 | 
						|
When compiled, the PTX kernel functions are callable by host-side code.
 | 
						|
 | 
						|
 | 
						|
Address Spaces
 | 
						|
--------------
 | 
						|
 | 
						|
The NVPTX back-end uses the following address space mapping:
 | 
						|
 | 
						|
   ============= ======================
 | 
						|
   Address Space Memory Space
 | 
						|
   ============= ======================
 | 
						|
   0             Generic
 | 
						|
   1             Global
 | 
						|
   2             Internal Use
 | 
						|
   3             Shared
 | 
						|
   4             Constant
 | 
						|
   5             Local
 | 
						|
   ============= ======================
 | 
						|
 | 
						|
Every global variable and pointer type is assigned to one of these address
 | 
						|
spaces, with 0 being the default address space. Intrinsics are provided which
 | 
						|
can be used to convert pointers between the generic and non-generic address
 | 
						|
spaces.
 | 
						|
 | 
						|
As an example, the following IR will define an array ``@g`` that resides in
 | 
						|
global device memory.
 | 
						|
 | 
						|
.. code-block:: llvm
 | 
						|
 | 
						|
    @g = internal addrspace(1) global [4 x i32] [ i32 0, i32 1, i32 2, i32 3 ]
 | 
						|
 | 
						|
LLVM IR functions can read and write to this array, and host-side code can
 | 
						|
copy data to it by name with the CUDA Driver API.
 | 
						|
 | 
						|
Note that since address space 0 is the generic space, it is illegal to have
 | 
						|
global variables in address space 0.  Address space 0 is the default address
 | 
						|
space in LLVM, so the ``addrspace(N)`` annotation is *required* for global
 | 
						|
variables.
 | 
						|
 | 
						|
 | 
						|
NVPTX Intrinsics
 | 
						|
================
 | 
						|
 | 
						|
Address Space Conversion
 | 
						|
------------------------
 | 
						|
 | 
						|
'``llvm.nvvm.ptr.*.to.gen``' Intrinsics
 | 
						|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | 
						|
 | 
						|
Syntax:
 | 
						|
"""""""
 | 
						|
 | 
						|
These are overloaded intrinsics.  You can use these on any pointer types.
 | 
						|
 | 
						|
.. code-block:: llvm
 | 
						|
 | 
						|
    declare i8* @llvm.nvvm.ptr.global.to.gen.p0i8.p1i8(i8 addrspace(1)*)
 | 
						|
    declare i8* @llvm.nvvm.ptr.shared.to.gen.p0i8.p3i8(i8 addrspace(3)*)
 | 
						|
    declare i8* @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(i8 addrspace(4)*)
 | 
						|
    declare i8* @llvm.nvvm.ptr.local.to.gen.p0i8.p5i8(i8 addrspace(5)*)
 | 
						|
 | 
						|
Overview:
 | 
						|
"""""""""
 | 
						|
 | 
						|
The '``llvm.nvvm.ptr.*.to.gen``' intrinsics convert a pointer in a non-generic
 | 
						|
address space to a generic address space pointer.
 | 
						|
 | 
						|
Semantics:
 | 
						|
""""""""""
 | 
						|
 | 
						|
These intrinsics modify the pointer value to be a valid generic address space
 | 
						|
pointer.
 | 
						|
 | 
						|
 | 
						|
'``llvm.nvvm.ptr.gen.to.*``' Intrinsics
 | 
						|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | 
						|
 | 
						|
Syntax:
 | 
						|
"""""""
 | 
						|
 | 
						|
These are overloaded intrinsics.  You can use these on any pointer types.
 | 
						|
 | 
						|
.. code-block:: llvm
 | 
						|
 | 
						|
    declare i8* @llvm.nvvm.ptr.gen.to.global.p1i8.p0i8(i8 addrspace(1)*)
 | 
						|
    declare i8* @llvm.nvvm.ptr.gen.to.shared.p3i8.p0i8(i8 addrspace(3)*)
 | 
						|
    declare i8* @llvm.nvvm.ptr.gen.to.constant.p4i8.p0i8(i8 addrspace(4)*)
 | 
						|
    declare i8* @llvm.nvvm.ptr.gen.to.local.p5i8.p0i8(i8 addrspace(5)*)
 | 
						|
 | 
						|
Overview:
 | 
						|
"""""""""
 | 
						|
 | 
						|
The '``llvm.nvvm.ptr.gen.to.*``' intrinsics convert a pointer in the generic
 | 
						|
address space to a pointer in the target address space.  Note that these
 | 
						|
intrinsics are only useful if the address space of the target address space of
 | 
						|
the pointer is known.  It is not legal to use address space conversion
 | 
						|
intrinsics to convert a pointer from one non-generic address space to another
 | 
						|
non-generic address space.
 | 
						|
 | 
						|
Semantics:
 | 
						|
""""""""""
 | 
						|
 | 
						|
These intrinsics modify the pointer value to be a valid pointer in the target
 | 
						|
non-generic address space.
 | 
						|
 | 
						|
 | 
						|
Reading PTX Special Registers
 | 
						|
-----------------------------
 | 
						|
 | 
						|
'``llvm.nvvm.read.ptx.sreg.*``'
 | 
						|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | 
						|
 | 
						|
Syntax:
 | 
						|
"""""""
 | 
						|
 | 
						|
.. code-block:: llvm
 | 
						|
 | 
						|
    declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
 | 
						|
    declare i32 @llvm.nvvm.read.ptx.sreg.tid.y()
 | 
						|
    declare i32 @llvm.nvvm.read.ptx.sreg.tid.z()
 | 
						|
    declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
 | 
						|
    declare i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
 | 
						|
    declare i32 @llvm.nvvm.read.ptx.sreg.ntid.z()
 | 
						|
    declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
 | 
						|
    declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
 | 
						|
    declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.z()
 | 
						|
    declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.x()
 | 
						|
    declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.y()
 | 
						|
    declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.z()
 | 
						|
    declare i32 @llvm.nvvm.read.ptx.sreg.warpsize()
 | 
						|
 | 
						|
Overview:
 | 
						|
"""""""""
 | 
						|
 | 
						|
The '``@llvm.nvvm.read.ptx.sreg.*``' intrinsics provide access to the PTX
 | 
						|
special registers, in particular the kernel launch bounds.  These registers
 | 
						|
map in the following way to CUDA builtins:
 | 
						|
 | 
						|
   ============ =====================================
 | 
						|
   CUDA Builtin PTX Special Register Intrinsic
 | 
						|
   ============ =====================================
 | 
						|
   ``threadId`` ``@llvm.nvvm.read.ptx.sreg.tid.*``
 | 
						|
   ``blockIdx`` ``@llvm.nvvm.read.ptx.sreg.ctaid.*``
 | 
						|
   ``blockDim`` ``@llvm.nvvm.read.ptx.sreg.ntid.*``
 | 
						|
   ``gridDim``  ``@llvm.nvvm.read.ptx.sreg.nctaid.*``
 | 
						|
   ============ =====================================
 | 
						|
 | 
						|
 | 
						|
Barriers
 | 
						|
--------
 | 
						|
 | 
						|
'``llvm.nvvm.barrier0``'
 | 
						|
^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | 
						|
 | 
						|
Syntax:
 | 
						|
"""""""
 | 
						|
 | 
						|
.. code-block:: llvm
 | 
						|
 | 
						|
  declare void @llvm.nvvm.barrier0()
 | 
						|
 | 
						|
Overview:
 | 
						|
"""""""""
 | 
						|
 | 
						|
The '``@llvm.nvvm.barrier0()``' intrinsic emits a PTX ``bar.sync 0``
 | 
						|
instruction, equivalent to the ``__syncthreads()`` call in CUDA.
 | 
						|
 | 
						|
 | 
						|
Other Intrinsics
 | 
						|
----------------
 | 
						|
 | 
						|
For the full set of NVPTX intrinsics, please see the
 | 
						|
``include/llvm/IR/IntrinsicsNVVM.td`` file in the LLVM source tree.
 | 
						|
 | 
						|
 | 
						|
Executing PTX
 | 
						|
=============
 | 
						|
 | 
						|
The most common way to execute PTX assembly on a GPU device is to use the CUDA
 | 
						|
Driver API. This API is a low-level interface to the GPU driver and allows for
 | 
						|
JIT compilation of PTX code to native GPU machine code.
 | 
						|
 | 
						|
Initializing the Driver API:
 | 
						|
 | 
						|
.. code-block:: c++
 | 
						|
 | 
						|
    CUdevice device;
 | 
						|
    CUcontext context;
 | 
						|
 | 
						|
    // Initialize the driver API
 | 
						|
    cuInit(0);
 | 
						|
    // Get a handle to the first compute device
 | 
						|
    cuDeviceGet(&device, 0);
 | 
						|
    // Create a compute device context
 | 
						|
    cuCtxCreate(&context, 0, device);
 | 
						|
 | 
						|
JIT compiling a PTX string to a device binary:
 | 
						|
 | 
						|
.. code-block:: c++
 | 
						|
 | 
						|
    CUmodule module;
 | 
						|
    CUfunction funcion;
 | 
						|
 | 
						|
    // JIT compile a null-terminated PTX string
 | 
						|
    cuModuleLoadData(&module, (void*)PTXString);
 | 
						|
 | 
						|
    // Get a handle to the "myfunction" kernel function
 | 
						|
    cuModuleGetFunction(&function, module, "myfunction");
 | 
						|
 | 
						|
For full examples of executing PTX assembly, please see the `CUDA Samples
 | 
						|
<https://developer.nvidia.com/cuda-downloads>`_ distribution.
 |