We are going to look at some example Cuda programs together. Use these as examples as you work on the next lab assignment.
1. Copy over Examples
Copy over some files into your cs87 subdirectory
cd ~/cs87
cp -R ~newhall/public/cs87/cuda_examples .
cd cuda_examples
ls
Makefile main.cpp testgrid.cu
README.md saxpy.cu userBufferKernel.cu userBufferKernel.h
make
2. Cuda Examples
2.1. Simple Cuda Example
saxpy.cu
is an example that computes y ← ax + y
, which is
from Nvidia
In this example, y
and x
are two 1D arrays of size N, and a
is a scalar
value.
The application running on the host (and this is very typical control flow):
-
initializes x and y in CPU (host) memory.
-
allocates cuda memory for both x and y (
cudaMalloc
) -
copies their values from CPU to GPU memory (
cudaMemcpy
). -
invokes the cuda
saxpy
kernel on the GPU using 1D block/grid and threads/block layout:`saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y)`
(the blocks/grid dimension, (N+255)/256, is the number of blocks needed to process N elements of the arrays x and y (note that this is int division, so truncation is used for the result (this is why 255 is added to N) ). The threads/block dimension is 256. The application assumes that N is evenly divisible by the thread block size (256), otherwise the kernel code will have to check that threads are not accessing x and y beyond their bounds.
-
copies the result value of y from GPU to CPU memory (
cudaMemcpy
) -
computes the max error on the CPU side
-
frees all CPU and GPU allocated memory (
cudaFree
)
The kernel function is written from the view of each individual thread
executing. Each thread computes its index into x and y based on its
position in its enclosing block (threadIdx.x
) and its block’s position
in the grid (blockIdx.x*blockDim.x
)
__global__
void saxpy(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}
2.2. Example of 2D Block, Grid, Thread layout
testgrid.cu
is an example that illustrates a 2D block/grid and
2D thread/block layout. It is often useful to use a dimensional layout
that matches the dimensions of your data structures
(1D for 1D, 2D for 2D, 3D for 3D), but it is not necessary to do this.
When run, this application takes one command line argument that invokes a different kernel function, each set 2D array values to a different function that lets you "see" how a thread is mapped onto the 2D array by its position in its block and grid.
./testgrid 1
./testgrid 2
./testgrid 3
In this example, the kernel functions are invoked with dim3
variables that
specify the blocks/grid and threads/block 2D arrangement (this code
assumes DIM is evenly divisible by 8):
dim3 blocks(DIM/8, DIM/8, 1); // 2D array of blocks in grid: DIM/8 x DIM/8
dim3 threads_block(8, 8, 1); // 2D array of threads/block 8x8
When the kernel function is called, these are used to specify the blocks/grid and threads/block:
kernel_func<<blocks,threads_block>>(dev_grid);
Within each kernel, a thread figures out its position in the 2D threads/block/grid layout, and uses it to compute an index into the data (which is viewed as a 1D array in GPU memory…a single cudaMalloc allocated it):
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int offset = x + y*DIM;
The GPU view of data is an x-axis, y-axis view, with the origin block (0,0) in the upper left (2D thread blocks are similarly numbered):
block(0,0) block(1,0) block (2, 0)
block(0,1) block(1,1) block (2, 1)
block(0,2) block(1,2) block (2, 2)
block(0,3) block(1,3) block (2, 3)
When mapping onto a 2D C view of data, keep this in mind (sometimes it matters and sometimes it doesn’t for the computation you are doing).
2.3. Example using ParaVis library to animate CUDA application
The main.cpp
and userBufferKernel.[h,cu]
is a CUDA application that
uses the ParaVis library to simultaneously animate the computation
on the GPU. This example has a application 2D array, and a 2D array
of pixel values that the application uses to "animate" its computation.
You will use this library in your Lab 3 solution.
A ParaVis application must be implemented in
a class derived from the Animator
class. Its implementation of
the update
method drives the application’s execution.
The bulk of this code is in userBufferKernel.[h,cu]
. main.cpp
contains the program’s main function that sets up and runs the
animation.
2.3.1. main.cpp
is the main control flow for using the ParaVis library
main.cpp
contains the main control flow for writing a program
using the ParaVis library. This is pretty generic code for any
CUDA application using ParaVis. The application-specific part
involves:
-
determining how and where to parse any command line arguments (typically either parse in main or pass them to the the application-specific constructor to parse)
-
calling the application-specific constructor for the class derived from the
Animator
class (perhaps passing it the command line options to parse, or values from parsed command line options).
In this application, the application-specific part of main.cpp is the
call to the UserBufferKernel
constructor, and the width
and height
values passed to it. The rest of the main control flow is common to
most ParaVis Cuda animations. These main steps are:
-
create a new QTSafeViewer object (you can add an application specific title passing a string value as the 3rd argument)
-
create a DataVisCUDA object and call its setAnimator method to hook the Animator (UserBufferKernel) object’s update method:
-
call DataVisCUDA constructor passing it the 2D dimensions matching those of the application 2D buffer (800x800 in this example).
-
call UserBufferKernel constructor (derived from Animator) passing args needed to init it (dimensions of 2D buffer in this example)
-
call setAnimation method of the DataVisCUDA objet passing it the Animator (UserBufferKernel) object.
-
-
call setAnimation on the QTViewer object passing it the DataVisCUA object
-
call run on the QTViewer to start the animation running. (this will invoke the update method of the UserBufferKernel class)
-
clean-up any state before exit
The C++ parts of this code are mostly to support running the main
animation loop (in main.cpp
), and then as C++ derived UserBufferKernel
constructor and update methods, that invoke more
C-like code that implements the core of the CUDA application
userBufferKernel.[h,cu]
. The CUDA kernels themselves should be written
like C functions and not as C++ method functions of the userBufferKernel
class.
2.3.2. userBufferKernel.[h,cu]
is the application-specific code
Look at the userBufferKernel.[h,cu]
to see how what this program
does.
In userBuferKernel.h
is the definition of the userBufferKernel
class that
is derived from Animator
. It implements a constructor, destructor,
and update method that are application-specific. Its update method
is invoked by the main animation loop that is invoked by viewer.run()
in
main
. run
can be called with no argument, in which case the application
runs until the user chooses to stop it, or passing in a number of iterations
(e.g. viewer.run(200)
) for which to animation (the number of times
the udpate
method is called). This class has private data members
corresponding to the 2D grid.
In the userBufferKernel.cu
:
-
The constructor:
-
initializes CPU-size data (the data members of this object)
-
allocates CUDA memory space for the 2D application grid
-
copies the CPU initialized 2D grid to the GPU memory
-
-
The destructor: frees cuda allocated memory
-
The update method is called repeatedly by QTSafeViewer
run
, and is where the CUDA kernels that perform the main computation on the data grid and the update to the imageBuffer are invoked. The QTSafeViewerrun
method passes it aImageBuffer *img
, through which the method can access the grid ofcolor3
image pixel values to set to different colors viaimg→buffer
.Because it is called repeatedly by the QTViewer’s run method, it is implemented like one iteration (or step) of computation on the 2D grid.
void UserBufferKernel::update(ImageBuffer* img){ dim3 blocks(m_cols/BLOCKS, m_rows/BLOCKS, 1); dim3 threads_block(BLOCKS, BLOCKS, 1); for (int i = 0; i < 90; i++){ simplekernel<<<blocks, threads_block>>>(m_dev_grid, m_cols); } int_to_color<<<blocks, threads_block>>>(img->buffer, m_dev_grid, m_cols); usleep(100000); }
In this example, the
udpate
calls:-
the
simplekernel
CUDA kernel function 90 times (just to animate larger steps of computation), which performs some computation on the 2D data grid -
the
int_to_color
CUDA kernel, that updates the image buffer for the ParaVis annimation based on the values of the application 2D grid (m_dev_grid
). -
usleep
just to slow down the animation.
The kernel’s themselves should be written in a C style (rather than C++ style).
In the
int_to_color
kernel, you can see code that each CUDA thread runs to set its individual pixels values in the imagecolor3
buffer. Each pixel has a rbg component, which is set to some function based on its corresponding value in the 2D array of ints (whose values are updated bysimplekernel
):optr[offset].r = (my_cuda_data[offset] + 10) % 255; // R value optr[offset].g = (my_cuda_data[offset] + 100) % 255; // G value optr[offset].b = (my_cuda_data[offset] + 200) % 255; // B value
-
3. Cuda Programming References
-
Nvidia’s Cuda Developer Documentation there is a CUDA C Programming Guide off this page
-
Chapt 11.1 from Dive into Systems textbook
-
saxpy.cu examplei from Nvida
-
ParaVis documentation is in .h files (view in vim or other editor):
/usr/local/include/qtvis/animator.h /usr/local/include/qtvis/imageBuffer.h /usr/local/include/qtvis/dataVisCUDA.h /usr/local/include/qtvis/dataVis.h