CS40 Lab 8: CUDA Performance testing

Due 2:30pm Wednesday 24 April 2013

You may work with one partner on this assignment. In this lab, you will design and performance test CUDA kernels for finding the maximum element in a large array.

Git Setup
Get a clean copy of the code from the origin/master branch and create a local and remote project8 branch by following these steps.
git fetch origin
git checkout -b project8 origin/master
git push -u private project8
The above commands assume that the default class code is on the origin remote, and your personal remote is named private. Furthermore, your working directory must be clean (no uncommitted changes to files under version control), before checking out a new branch. If this is not the case, add and commit you changes locally before switching to a project8 branch.

Once you and your partner have pushed the project8 branches, you can each checkout a shared8 branch to follow your partner's changes

git fetch partner
git checkout -b shared8 partner/project8
Note you should not merge into either the master, shared, or shared8 branches as you will later be unable to push your changes. Make sure you are on either the working or project8 branch before merging.
Background
First compile and run maxval.cu. This program is supposed to compute the maximum of an array of floats. Initially a CPU only version has been provided for you. I provided various timing code to time the GPU and CPU versions of this max function. Your first step is to write a simple CUDA kernel that works with only one block and one thread. Because a global CUDA kernel can only have a void return type, the variable result can be used to hold a GPU buffer that can store one or more results. Call your kernel max gpu single, and have it store the max value in the buffer result[0]. The code in main will copy this buffer into a partial results buffer and do some post processing. Call your kernel in main with one block and one thread, and note the time. Check that your GPU result matches the CPU result before proceeding.

Next, change the size of N near the top of the code from 32 to 32*1024*1024. Comment out the line in main which prints out the values of a[i] using cout, so you do not see 32 million items print. Run your code and note the time for the GPU and CPU versions. If your GPU version is significantly slower, that is OK at this point. Next, make the following changes and run some experiments. Note, the GPU version may be so slow that it times out. If this happens, decrease the size of N until the kernel is able to finish.

Experiments

Block Kernel

Write a kernel called max_gpu_block that can be called on multiple blocks, each containing one thread. Call your kernel with the following number of blocks: 4, 8, 16, 32, 64, 256, and record the time. Note you will need to recompile. Note your kernel only needs to have each block compute the max of all the elements that block checks. Each block can store its maximum in results[blockIdx.x]. A small amount of post-processing by the GPU can then find the maximum over all the blocks

Thread Kernel

Write a kernel called max_gpu_thread that can be called on a single block containing multiple threads. Call your kernel with the following number of threads and record the time: 32, 64, 256, 512. You may need to change the variable partial size in main to max sure the results buffer is the appropriate size. Each thread will write to one slot in this buffer which is again post-processed by the CPU in main.

Combination Kernel

Finally, write a kernel called max_gpu_combined that can be called on a arbitrary number of blocks, each with multiple threads. Try various block and thread counts when calling your kernel, reporting at least three experiments and highlighting the parameters that result in the shortest run time. At the thread level, you should use shared memory and a parallel reduction to compute the maximum value per block.
Test on two Cards
Test your code on at least two different graphics cards. See the list of Host graphics cards. The specs in terms of multiprocessors, cores per multiprocess, and total number of cores is summarized below.
Card MP Cores/MP Total
NVS 450188
FX 380 2816
FX 18008864
60024896
2000448192
FX 3800248192
Requirements

You should have four kernels (single, block, thread, combo) in your source code.

Run on at least two different graphics cards.

Put your results in a README.txt file. A sample format is shown below. If a kernel cannot run on the full 32 million elements, list the time and size of the largest input you can successfully run on.

Card 1: NVS450

CPU Time:

Single kernel Time:  ...  Max size: ....

-------------
Block Kernel
-------------

4   Blocks Time: ... 
8   Blocks Time: ...
...
256 Blocks Time: ...


-------------
Thread Kernel
-------------

32  Threads Time: ...
64  Threads Time: ...
256 Threads Time: ...


-------------
Combo  Kernel
-------------

B Blocks, T Threads Time: ...


Repeat above for different hardware. 
Submit
Write your results in a README.txt file and add, commit, and push this file to your git repo. Note any test cases when you could not run on the full 32 million element array and summarize any patterns or peculiarities you noticed in your testing.

You should regularly commit your changes to the project8 branch and occasionally push to your private remote. Note you must push to your remote to share updates with your partner. Ideally you should commit changes at the end of every session of working on the project. You will be graded on work that appears in your remote project8 branch by the project deadline