You may work with one partner on this assignment. In this lab, you will design and performance test CUDA kernels for finding the maximum element in a large array.
Get a clean copy of the code from the origin/master branch and create a local and remote project8 branch by following these steps.
git fetch origin
git checkout -b project8 origin/master
git push -u private project8
The above commands assume that the default class code is on the origin remote, and your personal remote is named private. Furthermore, your working directory must be clean (no uncommitted changes to files under version control), before checking out a new branch. If this is not the case, add and commit you changes locally before switching to a project8 branch.
Once you and your partner have pushed the project8 branches, you can each checkout a shared8 branch to follow your partner's changes
git fetch partner
git checkout -b shared8 partner/project8
Note you should not merge into either the master, shared, or shared8 branches as you will later be unable to push your changes. Make sure you are on either the working or project8 branch before merging.
First compile and run maxval.cu.
This program is supposed to compute the maximum of an array of
floats. Initially a CPU only version has been provided for you. I provided various timing code
to time the GPU and CPU versions of this max function. Your first step is to write a simple
CUDA kernel that works with only one block and one thread. Because a global CUDA kernel
can only have a void return type, the variable result can be used to hold a GPU buffer that
can store one or more results. Call your kernel max gpu single, and have it store the max value
in the buffer result. The code in main will copy this buffer into a partial results buffer
and do some post processing. Call your kernel in main with one block and one thread, and note
the time. Check that your GPU result matches the CPU result before proceeding.
Next, change the size of N near the top of the code from 32 to 32*1024*1024. Comment out the
line in main which prints out the values of a[i] using cout, so you do not see 32 million items
print. Run your code and note the time for the GPU and CPU versions. If your GPU version is
significantly slower, that is OK at this point. Next, make the following changes and run some
experiments. Note, the GPU version may be so slow that it times out. If this happens, decrease the size of N until the kernel is able to finish.
Write a kernel called max_gpu_block that can be called on multiple blocks, each containing
one thread. Call your kernel with the following number of blocks: 4, 8, 16, 32, 64, 256,
and record the time. Note you will need to recompile. Note your kernel only needs to have
each block compute the max of all the elements that block checks. Each block can store its
maximum in results[blockIdx.x]. A small amount of post-processing by the GPU can
then find the maximum over all the blocks
Write a kernel called max_gpu_thread that can be called on a single block containing multiple
threads. Call your kernel with the following number of threads and record the time: 32,
64, 256, 512. You may need to change the variable partial size in main to max sure the
results buffer is the appropriate size. Each thread will write to one slot in this buffer which
is again post-processed by the CPU in main.
Finally, write a kernel called max_gpu_combined that can be called on a arbitrary number of
blocks, each with multiple threads. Try various block and thread counts when calling your
kernel, reporting at least three experiments and highlighting the parameters that result in
the shortest run time. At the thread level, you should use shared memory and a parallel reduction to compute the maximum value per block.
Test on two Cards
Test your code on at least two different graphics cards. See the list of Host graphics cards
. The specs in terms of multiprocessors, cores per multiprocess, and total number of cores is summarized below.
|FX 380 ||2||8||16
You should have four kernels (single, block, thread, combo) in your source code.
Run on at least two different graphics cards.
Put your results in a README.txt file. A sample format is shown below. If a kernel cannot run on the full 32 million elements, list the time and size of the largest input you can successfully run on.
Card 1: NVS450
Single kernel Time: ... Max size: ....
4 Blocks Time: ...
8 Blocks Time: ...
256 Blocks Time: ...
32 Threads Time: ...
64 Threads Time: ...
256 Threads Time: ...
B Blocks, T Threads Time: ...
Repeat above for different hardware.
Write your results in a README.txt file and add, commit, and push this file to your git repo. Note any test cases when you could not run on the full 32 million element array and summarize any patterns or peculiarities you noticed in your testing.
You should regularly commit your changes to the project8 branch and occasionally push to your private remote. Note you must push to your remote to share updates with your partner. Ideally you should commit changes at the end of every session of working on the project. You will be graded on work that appears in your remote project8 branch by the project deadline