forked from OSchip/llvm-project
				
			
		
			
				
	
	
		
			560 lines
		
	
	
		
			21 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
			
		
		
	
	
			560 lines
		
	
	
		
			21 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
=========================
 | 
						|
Compiling CUDA with clang
 | 
						|
=========================
 | 
						|
 | 
						|
.. contents::
 | 
						|
   :local:
 | 
						|
 | 
						|
Introduction
 | 
						|
============
 | 
						|
 | 
						|
This document describes how to compile CUDA code with clang, and gives some
 | 
						|
details about LLVM and clang's CUDA implementations.
 | 
						|
 | 
						|
This document assumes a basic familiarity with CUDA. Information about CUDA
 | 
						|
programming can be found in the
 | 
						|
`CUDA programming guide
 | 
						|
<http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html>`_.
 | 
						|
 | 
						|
Compiling CUDA Code
 | 
						|
===================
 | 
						|
 | 
						|
Prerequisites
 | 
						|
-------------
 | 
						|
 | 
						|
CUDA is supported in llvm 3.9, but it's still in active development, so we
 | 
						|
recommend you `compile clang/LLVM from HEAD
 | 
						|
<http://llvm.org/docs/GettingStarted.html>`_.
 | 
						|
 | 
						|
Before you build CUDA code, you'll need to have installed the appropriate
 | 
						|
driver for your nvidia GPU and the CUDA SDK.  See `NVIDIA's CUDA installation
 | 
						|
guide <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_
 | 
						|
for details.  Note that clang `does not support
 | 
						|
<https://llvm.org/bugs/show_bug.cgi?id=26966>`_ the CUDA toolkit as installed
 | 
						|
by many Linux package managers; you probably need to install nvidia's package.
 | 
						|
 | 
						|
You will need CUDA 7.0, 7.5, or 8.0 to compile with clang.
 | 
						|
 | 
						|
CUDA compilation is supported on Linux, on MacOS as of 2016-11-18, and on
 | 
						|
Windows as of 2017-01-05.
 | 
						|
 | 
						|
Invoking clang
 | 
						|
--------------
 | 
						|
 | 
						|
Invoking clang for CUDA compilation works similarly to compiling regular C++.
 | 
						|
You just need to be aware of a few additional flags.
 | 
						|
 | 
						|
You can use `this <https://gist.github.com/855e277884eb6b388cd2f00d956c2fd4>`_
 | 
						|
program as a toy example.  Save it as ``axpy.cu``.  (Clang detects that you're
 | 
						|
compiling CUDA code by noticing that your filename ends with ``.cu``.
 | 
						|
Alternatively, you can pass ``-x cuda``.)
 | 
						|
 | 
						|
To build and run, run the following commands, filling in the parts in angle
 | 
						|
brackets as described below:
 | 
						|
 | 
						|
.. code-block:: console
 | 
						|
 | 
						|
  $ clang++ axpy.cu -o axpy --cuda-gpu-arch=<GPU arch> \
 | 
						|
      -L<CUDA install path>/<lib64 or lib>             \
 | 
						|
      -lcudart_static -ldl -lrt -pthread
 | 
						|
  $ ./axpy
 | 
						|
  y[0] = 2
 | 
						|
  y[1] = 4
 | 
						|
  y[2] = 6
 | 
						|
  y[3] = 8
 | 
						|
 | 
						|
On MacOS, replace `-lcudart_static` with `-lcudart`; otherwise, you may get
 | 
						|
"CUDA driver version is insufficient for CUDA runtime version" errors when you
 | 
						|
run your program.
 | 
						|
 | 
						|
* ``<CUDA install path>`` -- the directory where you installed CUDA SDK.
 | 
						|
  Typically, ``/usr/local/cuda``.
 | 
						|
 | 
						|
  Pass e.g. ``-L/usr/local/cuda/lib64`` if compiling in 64-bit mode; otherwise,
 | 
						|
  pass e.g. ``-L/usr/local/cuda/lib``.  (In CUDA, the device code and host code
 | 
						|
  always have the same pointer widths, so if you're compiling 64-bit code for
 | 
						|
  the host, you're also compiling 64-bit code for the device.)
 | 
						|
 | 
						|
* ``<GPU arch>`` -- the `compute capability
 | 
						|
  <https://developer.nvidia.com/cuda-gpus>`_ of your GPU. For example, if you
 | 
						|
  want to run your program on a GPU with compute capability of 3.5, specify
 | 
						|
  ``--cuda-gpu-arch=sm_35``.
 | 
						|
 | 
						|
  Note: You cannot pass ``compute_XX`` as an argument to ``--cuda-gpu-arch``;
 | 
						|
  only ``sm_XX`` is currently supported.  However, clang always includes PTX in
 | 
						|
  its binaries, so e.g. a binary compiled with ``--cuda-gpu-arch=sm_30`` would be
 | 
						|
  forwards-compatible with e.g. ``sm_35`` GPUs.
 | 
						|
 | 
						|
  You can pass ``--cuda-gpu-arch`` multiple times to compile for multiple archs.
 | 
						|
 | 
						|
The `-L` and `-l` flags only need to be passed when linking.  When compiling,
 | 
						|
you may also need to pass ``--cuda-path=/path/to/cuda`` if you didn't install
 | 
						|
the CUDA SDK into ``/usr/local/cuda``, ``/usr/local/cuda-7.0``, or
 | 
						|
``/usr/local/cuda-7.5``.
 | 
						|
 | 
						|
Flags that control numerical code
 | 
						|
---------------------------------
 | 
						|
 | 
						|
If you're using GPUs, you probably care about making numerical code run fast.
 | 
						|
GPU hardware allows for more control over numerical operations than most CPUs,
 | 
						|
but this results in more compiler options for you to juggle.
 | 
						|
 | 
						|
Flags you may wish to tweak include:
 | 
						|
 | 
						|
* ``-ffp-contract={on,off,fast}`` (defaults to ``fast`` on host and device when
 | 
						|
  compiling CUDA) Controls whether the compiler emits fused multiply-add
 | 
						|
  operations.
 | 
						|
 | 
						|
  * ``off``: never emit fma operations, and prevent ptxas from fusing multiply
 | 
						|
    and add instructions.
 | 
						|
  * ``on``: fuse multiplies and adds within a single statement, but never
 | 
						|
    across statements (C11 semantics).  Prevent ptxas from fusing other
 | 
						|
    multiplies and adds.
 | 
						|
  * ``fast``: fuse multiplies and adds wherever profitable, even across
 | 
						|
    statements.  Doesn't prevent ptxas from fusing additional multiplies and
 | 
						|
    adds.
 | 
						|
 | 
						|
  Fused multiply-add instructions can be much faster than the unfused
 | 
						|
  equivalents, but because the intermediate result in an fma is not rounded,
 | 
						|
  this flag can affect numerical code.
 | 
						|
 | 
						|
* ``-fcuda-flush-denormals-to-zero`` (default: off) When this is enabled,
 | 
						|
  floating point operations may flush `denormal
 | 
						|
  <https://en.wikipedia.org/wiki/Denormal_number>`_ inputs and/or outputs to 0.
 | 
						|
  Operations on denormal numbers are often much slower than the same operations
 | 
						|
  on normal numbers.
 | 
						|
 | 
						|
* ``-fcuda-approx-transcendentals`` (default: off) When this is enabled, the
 | 
						|
  compiler may emit calls to faster, approximate versions of transcendental
 | 
						|
  functions, instead of using the slower, fully IEEE-compliant versions.  For
 | 
						|
  example, this flag allows clang to emit the ptx ``sin.approx.f32``
 | 
						|
  instruction.
 | 
						|
 | 
						|
  This is implied by ``-ffast-math``.
 | 
						|
 | 
						|
Standard library support
 | 
						|
========================
 | 
						|
 | 
						|
In clang and nvcc, most of the C++ standard library is not supported on the
 | 
						|
device side.
 | 
						|
 | 
						|
``<math.h>`` and ``<cmath>``
 | 
						|
----------------------------
 | 
						|
 | 
						|
In clang, ``math.h`` and ``cmath`` are available and `pass
 | 
						|
<https://github.com/llvm-mirror/test-suite/blob/master/External/CUDA/math_h.cu>`_
 | 
						|
`tests
 | 
						|
<https://github.com/llvm-mirror/test-suite/blob/master/External/CUDA/cmath.cu>`_
 | 
						|
adapted from libc++'s test suite.
 | 
						|
 | 
						|
In nvcc ``math.h`` and ``cmath`` are mostly available.  Versions of ``::foof``
 | 
						|
in namespace std (e.g. ``std::sinf``) are not available, and where the standard
 | 
						|
calls for overloads that take integral arguments, these are usually not
 | 
						|
available.
 | 
						|
 | 
						|
.. code-block:: c++
 | 
						|
 | 
						|
  #include <math.h>
 | 
						|
  #include <cmath.h>
 | 
						|
 | 
						|
  // clang is OK with everything in this function.
 | 
						|
  __device__ void test() {
 | 
						|
    std::sin(0.); // nvcc - ok
 | 
						|
    std::sin(0);  // nvcc - error, because no std::sin(int) override is available.
 | 
						|
    sin(0);       // nvcc - same as above.
 | 
						|
 | 
						|
    sinf(0.);       // nvcc - ok
 | 
						|
    std::sinf(0.);  // nvcc - no such function
 | 
						|
  }
 | 
						|
 | 
						|
``<std::complex>``
 | 
						|
------------------
 | 
						|
 | 
						|
nvcc does not officially support ``std::complex``.  It's an error to use
 | 
						|
``std::complex`` in ``__device__`` code, but it often works in ``__host__
 | 
						|
__device__`` code due to nvcc's interpretation of the "wrong-side rule" (see
 | 
						|
below).  However, we have heard from implementers that it's possible to get
 | 
						|
into situations where nvcc will omit a call to an ``std::complex`` function,
 | 
						|
especially when compiling without optimizations.
 | 
						|
 | 
						|
As of 2016-11-16, clang supports ``std::complex`` without these caveats.  It is
 | 
						|
tested with libstdc++ 4.8.5 and newer, but is known to work only with libc++
 | 
						|
newer than 2016-11-16.
 | 
						|
 | 
						|
``<algorithm>``
 | 
						|
---------------
 | 
						|
 | 
						|
In C++14, many useful functions from ``<algorithm>`` (notably, ``std::min`` and
 | 
						|
``std::max``) become constexpr.  You can therefore use these in device code,
 | 
						|
when compiling with clang.
 | 
						|
 | 
						|
Detecting clang vs NVCC from code
 | 
						|
=================================
 | 
						|
 | 
						|
Although clang's CUDA implementation is largely compatible with NVCC's, you may
 | 
						|
still want to detect when you're compiling CUDA code specifically with clang.
 | 
						|
 | 
						|
This is tricky, because NVCC may invoke clang as part of its own compilation
 | 
						|
process!  For example, NVCC uses the host compiler's preprocessor when
 | 
						|
compiling for device code, and that host compiler may in fact be clang.
 | 
						|
 | 
						|
When clang is actually compiling CUDA code -- rather than being used as a
 | 
						|
subtool of NVCC's -- it defines the ``__CUDA__`` macro.  ``__CUDA_ARCH__`` is
 | 
						|
defined only in device mode (but will be defined if NVCC is using clang as a
 | 
						|
preprocessor).  So you can use the following incantations to detect clang CUDA
 | 
						|
compilation, in host and device modes:
 | 
						|
 | 
						|
.. code-block:: c++
 | 
						|
 | 
						|
  #if defined(__clang__) && defined(__CUDA__) && !defined(__CUDA_ARCH__)
 | 
						|
  // clang compiling CUDA code, host mode.
 | 
						|
  #endif
 | 
						|
 | 
						|
  #if defined(__clang__) && defined(__CUDA__) && defined(__CUDA_ARCH__)
 | 
						|
  // clang compiling CUDA code, device mode.
 | 
						|
  #endif
 | 
						|
 | 
						|
Both clang and nvcc define ``__CUDACC__`` during CUDA compilation.  You can
 | 
						|
detect NVCC specifically by looking for ``__NVCC__``.
 | 
						|
 | 
						|
Dialect Differences Between clang and nvcc
 | 
						|
==========================================
 | 
						|
 | 
						|
There is no formal CUDA spec, and clang and nvcc speak slightly different
 | 
						|
dialects of the language.  Below, we describe some of the differences.
 | 
						|
 | 
						|
This section is painful; hopefully you can skip this section and live your life
 | 
						|
blissfully unaware.
 | 
						|
 | 
						|
Compilation Models
 | 
						|
------------------
 | 
						|
 | 
						|
Most of the differences between clang and nvcc stem from the different
 | 
						|
compilation models used by clang and nvcc.  nvcc uses *split compilation*,
 | 
						|
which works roughly as follows:
 | 
						|
 | 
						|
 * Run a preprocessor over the input ``.cu`` file to split it into two source
 | 
						|
   files: ``H``, containing source code for the host, and ``D``, containing
 | 
						|
   source code for the device.
 | 
						|
 | 
						|
 * For each GPU architecture ``arch`` that we're compiling for, do:
 | 
						|
 | 
						|
   * Compile ``D`` using nvcc proper.  The result of this is a ``ptx`` file for
 | 
						|
     ``P_arch``.
 | 
						|
 | 
						|
   * Optionally, invoke ``ptxas``, the PTX assembler, to generate a file,
 | 
						|
     ``S_arch``, containing GPU machine code (SASS) for ``arch``.
 | 
						|
 | 
						|
 * Invoke ``fatbin`` to combine all ``P_arch`` and ``S_arch`` files into a
 | 
						|
   single "fat binary" file, ``F``.
 | 
						|
 | 
						|
 * Compile ``H`` using an external host compiler (gcc, clang, or whatever you
 | 
						|
   like).  ``F`` is packaged up into a header file which is force-included into
 | 
						|
   ``H``; nvcc generates code that calls into this header to e.g. launch
 | 
						|
   kernels.
 | 
						|
 | 
						|
clang uses *merged parsing*.  This is similar to split compilation, except all
 | 
						|
of the host and device code is present and must be semantically-correct in both
 | 
						|
compilation steps.
 | 
						|
 | 
						|
  * For each GPU architecture ``arch`` that we're compiling for, do:
 | 
						|
 | 
						|
    * Compile the input ``.cu`` file for device, using clang.  ``__host__`` code
 | 
						|
      is parsed and must be semantically correct, even though we're not
 | 
						|
      generating code for the host at this time.
 | 
						|
 | 
						|
      The output of this step is a ``ptx`` file ``P_arch``.
 | 
						|
 | 
						|
    * Invoke ``ptxas`` to generate a SASS file, ``S_arch``.  Note that, unlike
 | 
						|
      nvcc, clang always generates SASS code.
 | 
						|
 | 
						|
  * Invoke ``fatbin`` to combine all ``P_arch`` and ``S_arch`` files into a
 | 
						|
    single fat binary file, ``F``.
 | 
						|
 | 
						|
  * Compile ``H`` using clang.  ``__device__`` code is parsed and must be
 | 
						|
    semantically correct, even though we're not generating code for the device
 | 
						|
    at this time.
 | 
						|
 | 
						|
    ``F`` is passed to this compilation, and clang includes it in a special ELF
 | 
						|
    section, where it can be found by tools like ``cuobjdump``.
 | 
						|
 | 
						|
(You may ask at this point, why does clang need to parse the input file
 | 
						|
multiple times?  Why not parse it just once, and then use the AST to generate
 | 
						|
code for the host and each device architecture?
 | 
						|
 | 
						|
Unfortunately this can't work because we have to define different macros during
 | 
						|
host compilation and during device compilation for each GPU architecture.)
 | 
						|
 | 
						|
clang's approach allows it to be highly robust to C++ edge cases, as it doesn't
 | 
						|
need to decide at an early stage which declarations to keep and which to throw
 | 
						|
away.  But it has some consequences you should be aware of.
 | 
						|
 | 
						|
Overloading Based on ``__host__`` and ``__device__`` Attributes
 | 
						|
---------------------------------------------------------------
 | 
						|
 | 
						|
Let "H", "D", and "HD" stand for "``__host__`` functions", "``__device__``
 | 
						|
functions", and "``__host__ __device__`` functions", respectively.  Functions
 | 
						|
with no attributes behave the same as H.
 | 
						|
 | 
						|
nvcc does not allow you to create H and D functions with the same signature:
 | 
						|
 | 
						|
.. code-block:: c++
 | 
						|
 | 
						|
  // nvcc: error - function "foo" has already been defined
 | 
						|
  __host__ void foo() {}
 | 
						|
  __device__ void foo() {}
 | 
						|
 | 
						|
However, nvcc allows you to "overload" H and D functions with different
 | 
						|
signatures:
 | 
						|
 | 
						|
.. code-block:: c++
 | 
						|
 | 
						|
  // nvcc: no error
 | 
						|
  __host__ void foo(int) {}
 | 
						|
  __device__ void foo() {}
 | 
						|
 | 
						|
In clang, the ``__host__`` and ``__device__`` attributes are part of a
 | 
						|
function's signature, and so it's legal to have H and D functions with
 | 
						|
(otherwise) the same signature:
 | 
						|
 | 
						|
.. code-block:: c++
 | 
						|
 | 
						|
  // clang: no error
 | 
						|
  __host__ void foo() {}
 | 
						|
  __device__ void foo() {}
 | 
						|
 | 
						|
HD functions cannot be overloaded by H or D functions with the same signature:
 | 
						|
 | 
						|
.. code-block:: c++
 | 
						|
 | 
						|
  // nvcc: error - function "foo" has already been defined
 | 
						|
  // clang: error - redefinition of 'foo'
 | 
						|
  __host__ __device__ void foo() {}
 | 
						|
  __device__ void foo() {}
 | 
						|
 | 
						|
  // nvcc: no error
 | 
						|
  // clang: no error
 | 
						|
  __host__ __device__ void bar(int) {}
 | 
						|
  __device__ void bar() {}
 | 
						|
 | 
						|
When resolving an overloaded function, clang considers the host/device
 | 
						|
attributes of the caller and callee.  These are used as a tiebreaker during
 | 
						|
overload resolution.  See `IdentifyCUDAPreference
 | 
						|
<http://clang.llvm.org/doxygen/SemaCUDA_8cpp.html>`_ for the full set of rules,
 | 
						|
but at a high level they are:
 | 
						|
 | 
						|
 * D functions prefer to call other Ds.  HDs are given lower priority.
 | 
						|
 | 
						|
 * Similarly, H functions prefer to call other Hs, or ``__global__`` functions
 | 
						|
   (with equal priority).  HDs are given lower priority.
 | 
						|
 | 
						|
 * HD functions prefer to call other HDs.
 | 
						|
 | 
						|
   When compiling for device, HDs will call Ds with lower priority than HD, and
 | 
						|
   will call Hs with still lower priority.  If it's forced to call an H, the
 | 
						|
   program is malformed if we emit code for this HD function.  We call this the
 | 
						|
   "wrong-side rule", see example below.
 | 
						|
 | 
						|
   The rules are symmetrical when compiling for host.
 | 
						|
 | 
						|
Some examples:
 | 
						|
 | 
						|
.. code-block:: c++
 | 
						|
 | 
						|
   __host__ void foo();
 | 
						|
   __device__ void foo();
 | 
						|
 | 
						|
   __host__ void bar();
 | 
						|
   __host__ __device__ void bar();
 | 
						|
 | 
						|
   __host__ void test_host() {
 | 
						|
     foo();  // calls H overload
 | 
						|
     bar();  // calls H overload
 | 
						|
   }
 | 
						|
 | 
						|
   __device__ void test_device() {
 | 
						|
     foo();  // calls D overload
 | 
						|
     bar();  // calls HD overload
 | 
						|
   }
 | 
						|
 | 
						|
   __host__ __device__ void test_hd() {
 | 
						|
     foo();  // calls H overload when compiling for host, otherwise D overload
 | 
						|
     bar();  // always calls HD overload
 | 
						|
   }
 | 
						|
 | 
						|
Wrong-side rule example:
 | 
						|
 | 
						|
.. code-block:: c++
 | 
						|
 | 
						|
  __host__ void host_only();
 | 
						|
 | 
						|
  // We don't codegen inline functions unless they're referenced by a
 | 
						|
  // non-inline function.  inline_hd1() is called only from the host side, so
 | 
						|
  // does not generate an error.  inline_hd2() is called from the device side,
 | 
						|
  // so it generates an error.
 | 
						|
  inline __host__ __device__ void inline_hd1() { host_only(); }  // no error
 | 
						|
  inline __host__ __device__ void inline_hd2() { host_only(); }  // error
 | 
						|
 | 
						|
  __host__ void host_fn() { inline_hd1(); }
 | 
						|
  __device__ void device_fn() { inline_hd2(); }
 | 
						|
 | 
						|
  // This function is not inline, so it's always codegen'ed on both the host
 | 
						|
  // and the device.  Therefore, it generates an error.
 | 
						|
  __host__ __device__ void not_inline_hd() { host_only(); }
 | 
						|
 | 
						|
For the purposes of the wrong-side rule, templated functions also behave like
 | 
						|
``inline`` functions: They aren't codegen'ed unless they're instantiated
 | 
						|
(usually as part of the process of invoking them).
 | 
						|
 | 
						|
clang's behavior with respect to the wrong-side rule matches nvcc's, except
 | 
						|
nvcc only emits a warning for ``not_inline_hd``; device code is allowed to call
 | 
						|
``not_inline_hd``.  In its generated code, nvcc may omit ``not_inline_hd``'s
 | 
						|
call to ``host_only`` entirely, or it may try to generate code for
 | 
						|
``host_only`` on the device.  What you get seems to depend on whether or not
 | 
						|
the compiler chooses to inline ``host_only``.
 | 
						|
 | 
						|
Member functions, including constructors, may be overloaded using H and D
 | 
						|
attributes.  However, destructors cannot be overloaded.
 | 
						|
 | 
						|
Using a Different Class on Host/Device
 | 
						|
--------------------------------------
 | 
						|
 | 
						|
Occasionally you may want to have a class with different host/device versions.
 | 
						|
 | 
						|
If all of the class's members are the same on the host and device, you can just
 | 
						|
provide overloads for the class's member functions.
 | 
						|
 | 
						|
However, if you want your class to have different members on host/device, you
 | 
						|
won't be able to provide working H and D overloads in both classes. In this
 | 
						|
case, clang is likely to be unhappy with you.
 | 
						|
 | 
						|
.. code-block:: c++
 | 
						|
 | 
						|
  #ifdef __CUDA_ARCH__
 | 
						|
  struct S {
 | 
						|
    __device__ void foo() { /* use device_only */ }
 | 
						|
    int device_only;
 | 
						|
  };
 | 
						|
  #else
 | 
						|
  struct S {
 | 
						|
    __host__ void foo() { /* use host_only */ }
 | 
						|
    double host_only;
 | 
						|
  };
 | 
						|
 | 
						|
  __device__ void test() {
 | 
						|
    S s;
 | 
						|
    // clang generates an error here, because during host compilation, we
 | 
						|
    // have ifdef'ed away the __device__ overload of S::foo().  The __device__
 | 
						|
    // overload must be present *even during host compilation*.
 | 
						|
    S.foo();
 | 
						|
  }
 | 
						|
  #endif
 | 
						|
 | 
						|
We posit that you don't really want to have classes with different members on H
 | 
						|
and D.  For example, if you were to pass one of these as a parameter to a
 | 
						|
kernel, it would have a different layout on H and D, so would not work
 | 
						|
properly.
 | 
						|
 | 
						|
To make code like this compatible with clang, we recommend you separate it out
 | 
						|
into two classes.  If you need to write code that works on both host and
 | 
						|
device, consider writing an overloaded wrapper function that returns different
 | 
						|
types on host and device.
 | 
						|
 | 
						|
.. code-block:: c++
 | 
						|
 | 
						|
  struct HostS { ... };
 | 
						|
  struct DeviceS { ... };
 | 
						|
 | 
						|
  __host__ HostS MakeStruct() { return HostS(); }
 | 
						|
  __device__ DeviceS MakeStruct() { return DeviceS(); }
 | 
						|
 | 
						|
  // Now host and device code can call MakeStruct().
 | 
						|
 | 
						|
Unfortunately, this idiom isn't compatible with nvcc, because it doesn't allow
 | 
						|
you to overload based on the H/D attributes.  Here's an idiom that works with
 | 
						|
both clang and nvcc:
 | 
						|
 | 
						|
.. code-block:: c++
 | 
						|
 | 
						|
  struct HostS { ... };
 | 
						|
  struct DeviceS { ... };
 | 
						|
 | 
						|
  #ifdef __NVCC__
 | 
						|
    #ifndef __CUDA_ARCH__
 | 
						|
      __host__ HostS MakeStruct() { return HostS(); }
 | 
						|
    #else
 | 
						|
      __device__ DeviceS MakeStruct() { return DeviceS(); }
 | 
						|
    #endif
 | 
						|
  #else
 | 
						|
    __host__ HostS MakeStruct() { return HostS(); }
 | 
						|
    __device__ DeviceS MakeStruct() { return DeviceS(); }
 | 
						|
  #endif
 | 
						|
 | 
						|
  // Now host and device code can call MakeStruct().
 | 
						|
 | 
						|
Hopefully you don't have to do this sort of thing often.
 | 
						|
 | 
						|
Optimizations
 | 
						|
=============
 | 
						|
 | 
						|
Modern CPUs and GPUs are architecturally quite different, so code that's fast
 | 
						|
on a CPU isn't necessarily fast on a GPU.  We've made a number of changes to
 | 
						|
LLVM to make it generate good GPU code.  Among these changes are:
 | 
						|
 | 
						|
* `Straight-line scalar optimizations <https://goo.gl/4Rb9As>`_ -- These
 | 
						|
  reduce redundancy within straight-line code.
 | 
						|
 | 
						|
* `Aggressive speculative execution
 | 
						|
  <http://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_
 | 
						|
  -- This is mainly for promoting straight-line scalar optimizations, which are
 | 
						|
  most effective on code along dominator paths.
 | 
						|
 | 
						|
* `Memory space inference
 | 
						|
  <http://llvm.org/doxygen/NVPTXInferAddressSpaces_8cpp_source.html>`_ --
 | 
						|
  In PTX, we can operate on pointers that are in a paricular "address space"
 | 
						|
  (global, shared, constant, or local), or we can operate on pointers in the
 | 
						|
  "generic" address space, which can point to anything.  Operations in a
 | 
						|
  non-generic address space are faster, but pointers in CUDA are not explicitly
 | 
						|
  annotated with their address space, so it's up to LLVM to infer it where
 | 
						|
  possible.
 | 
						|
 | 
						|
* `Bypassing 64-bit divides
 | 
						|
  <http://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_ --
 | 
						|
  This was an existing optimization that we enabled for the PTX backend.
 | 
						|
 | 
						|
  64-bit integer divides are much slower than 32-bit ones on NVIDIA GPUs.
 | 
						|
  Many of the 64-bit divides in our benchmarks have a divisor and dividend
 | 
						|
  which fit in 32-bits at runtime. This optimization provides a fast path for
 | 
						|
  this common case.
 | 
						|
 | 
						|
* Aggressive loop unrooling and function inlining -- Loop unrolling and
 | 
						|
  function inlining need to be more aggressive for GPUs than for CPUs because
 | 
						|
  control flow transfer in GPU is more expensive. More aggressive unrolling and
 | 
						|
  inlining also promote other optimizations, such as constant propagation and
 | 
						|
  SROA, which sometimes speed up code by over 10x.
 | 
						|
 | 
						|
  (Programmers can force unrolling and inline using clang's `loop unrolling pragmas
 | 
						|
  <http://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll>`_
 | 
						|
  and ``__attribute__((always_inline))``.)
 | 
						|
 | 
						|
Publication
 | 
						|
===========
 | 
						|
 | 
						|
The team at Google published a paper in CGO 2016 detailing the optimizations
 | 
						|
they'd made to clang/LLVM.  Note that "gpucc" is no longer a meaningful name:
 | 
						|
The relevant tools are now just vanilla clang/LLVM.
 | 
						|
 | 
						|
| `gpucc: An Open-Source GPGPU Compiler <http://dl.acm.org/citation.cfm?id=2854041>`_
 | 
						|
| Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt
 | 
						|
| *Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO 2016)*
 | 
						|
|
 | 
						|
| `Slides from the CGO talk <http://wujingyue.com/docs/gpucc-talk.pdf>`_
 | 
						|
|
 | 
						|
| `Tutorial given at CGO <http://wujingyue.com/docs/gpucc-tutorial.pdf>`_
 | 
						|
 | 
						|
Obtaining Help
 | 
						|
==============
 | 
						|
 | 
						|
To obtain help on LLVM in general and its CUDA support, see `the LLVM
 | 
						|
community <http://llvm.org/docs/#mailing-lists>`_.
 |