For this lab, you'll extend, write, and test some CUDA kernels.
$ cd [~]$ ssh-add Enter passphrase for /home/ghopper1/.ssh/id_rsa: Identity added: /home/ghopper1/.ssh/id_rsa (/home/ghopper1/.ssh/id_rsa) [~]$ cd ~/cs40/ [labs]$ git clone git@github.swarthmore.edu:CS40-F18/lab07-YOURUSERNAME-YOURPARTNERNAME.git ./lab07
cmake
with the special path for CUDA, then run make
.
[~]$ cd ~/cs40/lab07 [~]$ mkdir build [~]$ cd build [build]$ cmake -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-6 \ -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc ../ [build]$ make -j8There are two executables for you to modify:
maxval
and circleArt
.
maxval.cu.
This program is supposed to compute the maximum of an array of
floats. Initially a CPU only version has been provided for you. I provided various timing code
to time the GPU and CPU versions of this max function. Your first step is to write a simple
CUDA kernel that works with only multiple threads and one block. Because a global CUDA kernel
can only have a void return type, the variable partial_c
can be used to hold a GPU buffer that
can store one or more results. Call your kernel max_gpu_thread
, and have it store the max value amongst all the threads in the first index of the results buffer. Call your kernel in main with one block and one thread, and note
the time. Check that your GPU result matches the CPU result before proceeding.
As noted in the code, you should use shared memory and a parallel reduction within each block. You can use the dot.cu
demo as a guide for the setting up shared memory. The parallel reduction we finished on Friday for the dot product is pasted below. You can use a similar pattern to compute your max value.
/* This reduction assumes blockDim is a power of 2 A reduction can work for any size, but the indexing becomes a little more tricky */ int i=blockDim.x/2; while(i != 0 ){ if(threadIdx.x < i){ /*reduce!*/ cache[threadIdx.x] += cache[threadIdx.x+i]; } __syncthreads(); /*outside branch but inside loop to prevent deadlock */ i/=2; }
Next, change the size of N in main from 32 to 32*1024*1024. Run your code and note the time for the GPU and CPU versions. If your GPU version is significantly slower, that is OK at this point. Next, make the following changes and run some experiments. Note, the GPU version may be so slow that it times out. If this happens, decrease the size of N until the kernel is able to finish.
max_gpu_kernel
kernel with the following number of threads and record the time: 32,
64, 256, 512. In this version, you should only have one block, so all threads in a block can communicate and determine a single answer for the max of the array. Using shared memory, compute the global answer and write one value to *c
. Since the number of threads will vary, you can overprovision the size of the shared memory array to be the maximum number of threads (512 or even 1024)
max_gpu
that can be called on an arbitrary number of
blocks, each with multiple threads. Try various block and thread counts when calling your
kernel, reporting at least three experiments and highlighting the parameters that result in
the shortest run time. At the thread level, you should use shared memory and a parallel reduction to compute the maximum value per block.
Card | MP | Cores/MP | Total |
P5000 | 20 | 128 | 2560 |
GTX 1080 | 20 | 128 | 2560 |
GTX 980Ti | 22 | 128 | 2816 |
GTX 750 | 4 | 128 | 512 |
M1000M | 4 | 128 | 512 |
M600 | 4 | 128 | 512 |
K2100M | 3 | 192 | 576 |
1000M | 2 | 64 | 128 |
In this application we assume our view is a static 2D world view bounded between 0 and 1 in the $x$ and $y$ directions. This world view is rendered onto an image containing $w$ columns and $h$ rows of pixels with $(r,c)=(0,0)$ in the upper left and wold coordinates $(x,y)=(0,0)$ in the lower left.
In this world, we have a set of Circles
described in the lightweight class circle.h
. Each circle has a center position, and radius, a color, along with a velocity (vx,vy,vz)
and transparency between 0 and 1, 0 being completely transparent, 1 being opaque.
The functions in scenes.h
can generate a set of circles in this world and return them as a CircScene
struct containing a pointer to an array of circle objects and a count of the total number of objects. Because CUDA cannot easily compile complex C++ code including Qt5 code, we don't have the luxury of using QVector3D
or similar QList
objects to store our information.
In viewer.cpp
we create a QTViewer
application like several examples in class. We additionally make a scene and pass the resulting circles to a new Animator
subclass you will create called CircleRenderer
. Your primary task is to complete several CUDA kernels that will draw the set of circles into a buffer, and also animate the scene.
update(ImageBuffer *img)
method that writes colors to an array of pixels stored in img->buffer
. You can see examples of how this is done in several of the in class examples as well as the clearImage
kernel which has been completed for you. The update method gets called repeatedly in an openGL event loop and, if you draw a new image each time update is called, you will see an animation.
For this example, the primary update loop will be
clear the entire screen move the circles according to their velocity draw the circles in the image bufferRight now, only clear is implemented. It takes a clear color as input and writes that color to every pixel in the image buffer. You should examine the implementation of clearImage in
clearImage.cu
to understand how it works and how to work with the ImageBuffer
.
drawCircsWrong
has an implementation that attempts to draw the circles using CUDA, but it isn't quite right. It works OK if you only have a single thread, but that isn't very parallel or scalable. drawCircsWrong
parallelizes the problem the distributing all the circles over all the threads. For each circle, a thread computes a bounding box around the circle, and finds the rows and columns of the image contained within that box. For each pixel in the box, it checks if the center of the pixel is inside the circle. If the pixel overlaps, the old color in the image buffer and the color of the circle are blended together according to the blending function provided. You should not modify the blending function. However, you should note that the order of blending the circles matters and can lead to different results. Your rendering must preserve the order of the circles as they are presented in the initial input array. drawCircsWrong
may violate this property and the Readme asks you to think about why.
The correct order for the three test circles is to draw the red circle first, followed by the green then blue.
Two possible incorrect orders are blue, then green, then red
or green, then red, then blue.
You need to find a way to properly implement drawCircs
. Implement your code in drawCircs
, keeping drawCircsWrong
unmodified. To test your solution, don't forget to change the call in update
in circRenderer.cu
to call your drawCircs
kernel.
Once you think you have the drawing correct, implement the updating of your scene by writing one more kernel moveCircs
. Note that this kernel does not draw anything, but it simply updates the position of each circle to be the previous position plus the current velocity. Once the positions have been modified, the next call to drawCircs
will draw the circles in their new location. If a circle moves completely off one edge of the screen, have it wrap around to the other side as demonstrated in lab.
Adjust numFrames
in viewer.cpp
to test your animation. Set numFrames=-1
to let it loop until you quit.