230 lines
		
	
	
		
			9.8 KiB
		
	
	
	
		
			HTML
		
	
	
	
			
		
		
	
	
			230 lines
		
	
	
		
			9.8 KiB
		
	
	
	
		
			HTML
		
	
	
	
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
 | 
						|
        "http://www.w3.org/TR/html4/strict.dtd">
 | 
						|
<!-- Material used from: HTML 4.01 specs: http://www.w3.org/TR/html401/ -->
 | 
						|
<html>
 | 
						|
<head>
 | 
						|
  <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
 | 
						|
  <title>Polly - GPGPU Code Generation</title>
 | 
						|
  <link type="text/css" rel="stylesheet" href="../menu.css">
 | 
						|
  <link type="text/css" rel="stylesheet" href="../content.css">
 | 
						|
</head>
 | 
						|
<body>
 | 
						|
<div id="box">
 | 
						|
<!--#include virtual="../menu.html.incl"-->
 | 
						|
<div id="content">
 | 
						|
  <!--*********************************************************************-->
 | 
						|
  <h1>Polly - GPGPU Code Generation</h1>
 | 
						|
  <!--*********************************************************************-->
 | 
						|
<p><em>WARNING: This project was part of the Google Summer of Code 2012.
 | 
						|
It is currently not finished, but it is in the design and implementation stage.
 | 
						|
The ideas/plans described here may not yet be implemented in Polly and may
 | 
						|
change later on.</em></p>
 | 
						|
 | 
						|
This project adds GPGPU code generation feature to Polly.
 | 
						|
 | 
						|
<h2>Objective</h2>
 | 
						|
<p>The overall objective of this GSoC project is to create a preliminary
 | 
						|
   implementation of GPGPU code generation for Polly. With this addition, users
 | 
						|
   can parallelize some perfectly nested loops with Polly to execute on a
 | 
						|
   heterogeneous platform, composed of CPU and GPU.</p>
 | 
						|
<p>There are several successful projects about automatic source-to-source gpu
 | 
						|
   code transformation. C-to-CUDA[1] uses the standard Pluto algorithms for
 | 
						|
   computing an affine schedule and then applies a wavefront transformation to
 | 
						|
   obtain one sequential and n-1 parallel loops. The parallel loops are then
 | 
						|
   mapped onto the blocks and threads of GPU. PPCG[2] introduces some advanced
 | 
						|
   algorithms which can expose much more parallelism than other methods . And It
 | 
						|
   also introduces affine partition heuristics and code generation algorithms
 | 
						|
   for locality enhancement in the registers and shared memory.</p>
 | 
						|
<p>Since automatic GPGPU code generation is quite a complex problem and what we
 | 
						|
   target is a low-level intermediate representation, LLVM IR, rather than a
 | 
						|
   high-level language source, it is important for us to set a proper objective
 | 
						|
   as a start step to give a complete solution to GPGPU code generation for LLVM
 | 
						|
   IR.</p>
 | 
						|
<p>Firstly, we plan to target two kinds of relatively simple test cases. One is
 | 
						|
   comprised of pure parallel and perfectly nested loops, like the following
 | 
						|
   code.</p>
 | 
						|
<pre>
 | 
						|
parfor(int i=0 to M)
 | 
						|
  parfor(int j=0 to N)
 | 
						|
    LoopBody(i, j);
 | 
						|
</pre>
 | 
						|
<p>Another one is that all the loops in it are parallel except the inner-most
 | 
						|
   one, just like this:</p>
 | 
						|
<pre>
 | 
						|
parfor(int i=0 to M)
 | 
						|
  parfor(int j=0 to N)
 | 
						|
    non-parfor(int k=0 to K)
 | 
						|
      LoopBody(i, j, k);
 | 
						|
</pre>
 | 
						|
<p>The LoopBody part should be limited to instructions or functions calls
 | 
						|
   (intrinsics) which can be handled by LLVM's NVPTX backend.</p>
 | 
						|
<p>On the other hand, we focus on building a preliminary and scalable framework
 | 
						|
   of GPGPU code generation for polly. Thus we plan to employ relatively simple
 | 
						|
   tiling and mapping algorithms and optimize them later.</p>
 | 
						|
<h2>Work Flow</h2>
 | 
						|
<h3>GPGPU Code Generation In General</h3>
 | 
						|
<p>C-to-CUDA[1] and PPCG[2] propose similar steps to solve the automatic GPGPU
 | 
						|
   code generation problem.</p>
 | 
						|
<li>Look for parallel loops.</li>
 | 
						|
<li>Create a polyhedral model from the loops.</li>
 | 
						|
<li>Tile and map the loops to GPU blocks and threads.</li>
 | 
						|
<li>Determine where to place the data.</li>
 | 
						|
<h3>What has been done in Polly</h3>
 | 
						|
<p>Polly has implemented the 1st, 2nd and part of the 3rd of the above steps and
 | 
						|
   many other analysis and transformation passes.</p>
 | 
						|
<h3>What to do in Polly</h3>
 | 
						|
<p>Unlike many source-to-source optimizers such as C-to-CUDA and PPCG, Polly is
 | 
						|
   a low-level optimizer, which means we can't use a source-level compiler
 | 
						|
   (e.g. NVCC) to generate the final assembly for the device. We need manually
 | 
						|
   insert device driver API calls to execute the generated kernel assembly
 | 
						|
   text.</p>
 | 
						|
<p>In this project, we assume that the device driver library has provided an
 | 
						|
   interface to launch kernels in the form of assembly text. Fortunately, most
 | 
						|
   of the mainstream GPU vendors provide such a feature in thier products (see
 | 
						|
   ptxjit of NVIDIA GPUs and CAL of AMD GPUs). Generally speaking, what we
 | 
						|
   are going to do in Polly is:</p>
 | 
						|
<li>Find a way to tile the parallel loops.</li>
 | 
						|
<li>Find a way to extract the loop body and transform it into thread-centric
 | 
						|
    parallel code.</li>
 | 
						|
<li>Find a way to store/load the thread-centric code into/from a device module.
 | 
						|
<li>Find a way to pass the target machine information and generate code of the
 | 
						|
    device module for the target.
 | 
						|
<li>Find a way to map the tiled loop to GPU blocks and threads.</li>
 | 
						|
<li>Find a way to insert CUDA synchronization operations on-demand.
 | 
						|
<li>Find a way to generate the memory copy operations between a host and a
 | 
						|
    device.</li>
 | 
						|
<li>Implement/Wrap a runtime library to serve as the execution engine for the
 | 
						|
    generated device code.</li>
 | 
						|
 | 
						|
<h3>The Work Flow</h3>
 | 
						|
<p>In this section, we assume that the host cpu is X86 and the device is NVIDIA
 | 
						|
   CUDA-compatible. we will use the following test case to describe our work
 | 
						|
   flow.</p>
 | 
						|
<pre>
 | 
						|
for(i = 0; i < 128; i++)
 | 
						|
      for(j = 0; j < 128; j++)
 | 
						|
              A[i][j] = i*128 + j;
 | 
						|
</pre>
 | 
						|
<p>The work flow of our code generator is as follows.</p>
 | 
						|
<p>1.We first use Polly's jscop file importer to get a wanted 4-level parallel
 | 
						|
   tiled code.</p>
 | 
						|
The "schedule" part of the pre-optimization jscop file is as the following:
 | 
						|
<pre>
 | 
						|
"schedule" : "{ Stmt_for_body3[i0, i1] -> schedule[0, i0, 0, i1, 0] }"
 | 
						|
</pre>
 | 
						|
The jscop file describing the tiling transformation is:
 | 
						|
<pre>
 | 
						|
"schedule" : "{ Stmt_for_body3[i0, i1] -> schedule[0, o0, o1, o2, o3]:
 | 
						|
              o0 >= 0 and o0 <= 7 and o1 >= 0 and o1 <= 15 and
 | 
						|
              o2 >= 0 and o2 <= 7 and o3 >= 0 and o3 <= 15 and
 | 
						|
              i0 = 16o0 + o1 and i1 = 16o2 + o3 }"
 | 
						|
</pre>
 | 
						|
We can test the schedule with the following command line.
 | 
						|
<pre>
 | 
						|
opt -load /path/to/polly/build/LLVMPolly.so -basicaa -polly-import-jscop
 | 
						|
    -polly-ast -analyze -q ./test.ll
 | 
						|
    -polly-import-jscop-postfix=transformed+gpu
 | 
						|
</pre>
 | 
						|
The output of this schedule is:
 | 
						|
<pre>
 | 
						|
for (c2=0;c2<=7;c2++) {
 | 
						|
  for (c3=0;c3<=15;c3++) {
 | 
						|
    for (c4=0;c4<=7;c4++) {
 | 
						|
      for (c5=0;c5<=15;c5++) {
 | 
						|
        Stmt_for_body3(16*c2+c3,16*c4+c5);
 | 
						|
      }
 | 
						|
    }
 | 
						|
  }
 | 
						|
}
 | 
						|
</pre>
 | 
						|
Now we get a 4-dimensional parallel loops with a single SCoP statement in it.
 | 
						|
<p>2.We then extract the loop body (or the inner-most non-parallel loop) into a
 | 
						|
   LLVM function, tagging it with PTX_Kernel call convention.</p>
 | 
						|
<p>3.We extract the PTX_kernel function into a temporary module, set the target
 | 
						|
   triple (e.g. nvptx64-unknown-linux) for the module, transform the temporary
 | 
						|
   module into a string, store it in the original module and erase the
 | 
						|
   PTX_kernel function.</p>
 | 
						|
<p>4.We replace the loops with their GPGPU counterpart. The GPGPU part of code
 | 
						|
   is composed of a call to the llvm.codegen intrinsic and function calls to our
 | 
						|
   GPU runtime library.</p>
 | 
						|
<p>5.Finally, we generate the executable program with <em>llc</em> or run the
 | 
						|
   optimized LLVM IRs with a JIT compiler like  <em>lli</em>.</p>
 | 
						|
<h2>Usage</h2>
 | 
						|
<p>1. Apply the llvm.codegen intrinsic patch to LLVM code base.</p>
 | 
						|
<pre>cd /path/to/llvm/source
 | 
						|
git am /path/to/polly/source/utils/0001-Add-llvm.codegen-intrinsic.patch</pre>
 | 
						|
<p>2. Build the test case.</p>
 | 
						|
<pre>/path/to/polly/source/test/create_ll.sh test.c</pre>
 | 
						|
<p>3. Get and edit the jscop file (take function "gpu_codegen" as an example).
 | 
						|
</p>
 | 
						|
<pre>opt -load /path/to/polly/build/lib/LLVMPolly.so -basicaa
 | 
						|
    -polly-export-jscop ./test.ll
 | 
						|
cp gpu_codegen___%for.cond---%for.end8.jscop
 | 
						|
   gpu_codegen___%for.cond---%for.end8.jscop.transformed+gpu
 | 
						|
vi gpu_codegen___%for.cond---%for.end8.jscop.transformed+gpu</pre>
 | 
						|
<p><em>(Please refer to section "The Work Flow" on how to edit the "schedule"
 | 
						|
       part of a statement)</em></p>
 | 
						|
<p>4. Optimize the code with GPGPU code generation.</p>
 | 
						|
<pre>opt -load /path/to/polly/build/lib/LLVMPolly.so -basicaa
 | 
						|
    -polly-import-jscop-postfix=transformed+gpu -enable-polly-gpgpu
 | 
						|
    -polly-gpgpu-triple=nvptx64-unknown-unknown -polly-codegen ./test.ll -S
 | 
						|
    -o test.gpued.ll</pre>
 | 
						|
<p>5. Build the final assembly and executable.</p>
 | 
						|
<pre>llc test.gpued.ll -o test.s
 | 
						|
gcc test.s -lGPURuntime -o test</pre>
 | 
						|
<p><em>(Please make sure that LD_LIBRARY_PATH is set properly so that
 | 
						|
        /path/to/polly/build/lib/libGPURuntime.so is visible to gcc.)</em></p>
 | 
						|
<h2>TODO List</h2>
 | 
						|
 | 
						|
<table class="wikitable" cellpadding="2">
 | 
						|
<tbody>
 | 
						|
<tr style="background: rgb(239, 239, 239)">
 | 
						|
  <th width="400px"> Tasks</th>
 | 
						|
  <th width="150px"> Status </th>
 | 
						|
  <th> Owner </th>
 | 
						|
</tr>
 | 
						|
<tr>
 | 
						|
<th align="left">Tiling the Parallel Loops with An External Jscop File</th>
 | 
						|
<td align="center" class='open'>Open, In Design</td>
 | 
						|
<td>Yabin Hu</td>
 | 
						|
</tr>
 | 
						|
<tr>
 | 
						|
<th align="left">GPU Runtime Library Implementation</th>
 | 
						|
<td align="center" class='inprogress'>Coding Finished, In Reviewing</td>
 | 
						|
<td></td>
 | 
						|
</tr>
 | 
						|
<tr>
 | 
						|
<th align="left">llvm.codegen Intrinsic Implementation</th>
 | 
						|
<td align="center" class='inprogress'>Codeing Finished, To Be Reviewed</td>
 | 
						|
<td></td>
 | 
						|
</tr>
 | 
						|
<tr>
 | 
						|
<th align="left">Code Generation For Host</th>
 | 
						|
<td align="center" class='inprogress'>50% Done</td>
 | 
						|
<td></td>
 | 
						|
</tr>
 | 
						|
 | 
						|
</tbody></table>
 | 
						|
 | 
						|
<h2>References</h2>
 | 
						|
<li type="1" value="1">
 | 
						|
<em>Automatic C-to-CUDA Code Generation for Affine Programs. </em><br />
 | 
						|
    Muthu Manikandan Baskaran, J. Ramanujam and P. Sadayappan.<br />
 | 
						|
    International Conference on Compiler Construction (CC) 2010.<br />
 | 
						|
</li>
 | 
						|
<li type="1"><em>PPCG Project</em><br />
 | 
						|
<a href="http://freecode.com/projects/ppcg">http://freecode.com/projects/ppcg
 | 
						|
</a></li>
 | 
						|
<li type="1">
 | 
						|
<em>Where is the Data? Why You Cannot Debate GPU vs. CPU Performance Without the
 | 
						|
    Answer. </em><br />
 | 
						|
  Chris Gregg and Kim Hazelwood<br />
 | 
						|
  International Symposium on Performance Analysis of Systems and Software
 | 
						|
  (ISPASS) 2011.
 | 
						|
</li>
 | 
						|
<p></p>
 | 
						|
</div>
 | 
						|
</div>
 | 
						|
</body>
 | 
						|
</html>
 |