Lab 2 Partners | ||
Sam White and Chloe Stevens | Niels Verosky and Jordan Singleton | |
Phil Koonce and Luis Ramirez | Steven Hwang and Ames Bielenberg | |
Nick Felt and Kyle Erf | Katherine Bertaut and Elliot Weiser | |
See the git howto for information about how you can set up a git repository for your joint lab 2 project. |
For this assignment you are going to implement parallel matrix multiply using Pthreads and evaluate the scalability of your implementation as you increase the problem size and the number of threads. Matrix multiply is an example of a parallel kernel--something that is not typically a complete stand alone application, but is a common computation pattern that occurs in many numeric applications. Efficient parallel matrix multiply can be used to greatly improve the performance a large set of real-world problems.
Your implementation will take command line arguments for the N and M dimensions of the first matrix, the number of threads, and the number of iterations of matrix multiply (how many times you will do AxB=C...and yes, it re-does the exact same computation each iteration). For example:
./matrixmult -n 1024 -m 512 -t 4 -i 10Will Multiply matrix A of size 1024x512 times matrix B of size 512x1024 to obtain the resulting C of size 1024x1024, and will repeat this multiplication 10 times. The reason for multiple iterations is to get longer runs so that you can obtain more meaningful timing results, and also to give you some practice using Pthread synchronization primatives: no thread should be start the next round of matrix multiply until all threads have finished with the current round.
The sequential matrix-multiply algorithm is:
// given matrices A, B, and C of the following dimensions: A[N][M] // N rows and M columns B[M][N] // M rows and N columns C[N][N] // N rows and N columns // compute C = AxB for i from 0 to N // for each row in C for j from 0 to N // for each column in C // compute the value of C[i][j] val = 0.0 for k from 0 to M // num elms in row i of A and col j of B val += A[i][k]*B[k][j] C[i][j] = valTo parallelize matrix multiply, assign each thread some portion of the computation. For example, each thread could assigned a subset of C's rows to calculate, or a subset of C's columns, or blocks of C, or ...
You can also copy over my pthreads simple example into your lab2 repository to use as starting point for your lab2 solution if you'd like:
cp ~newhall/public/cs87/pthreads_example/* .
usage: ./matrixmult -n n_dim -m m_dim -t num_tids -i iters [-c] -n n_dim number of rows in A -m m_dim number of cols in A -t num_tidst number of threads -i iters number of iterations -c optional arg to parallelize by columns of C (default is by rows) -h: print out this help message
row partitioning column partitioning ---------------- ---------------- 0 0 0 0 0 0 0 0 0 0 1 1 2 2 3 3 0 0 0 0 0 0 0 0 0 0 1 1 2 2 3 3 1 1 1 1 1 1 1 1 0 0 1 1 2 2 3 3 1 1 1 1 1 1 1 1 0 0 1 1 2 2 3 3 2 2 2 2 2 2 2 2 0 0 1 1 2 2 3 3 2 2 2 2 2 2 2 2 0 0 1 1 2 2 3 3 3 3 3 3 3 3 3 3 0 0 1 1 2 2 3 3 3 3 3 3 3 3 3 3 0 0 1 1 2 2 3 3If the optional command line argument -c is given, run with column partitioning, otherwise run with row partitioning.
You are welcome to implement and test other ways of partitioning of the problem, but the row-wise and column-wise partitioning are required.
double val = (double)(rand()/(RAND_MAX*1.0));I recommend picking random values between some small range (like (0.0,2.0)). You can test for correctness of your two parallelization schemes by commenting out the call to seed the random number generator so that every run of your program generates the same random sequence, and then print out the resulting C matrix at the end to see if the values are the same. You can just print out the same small chunk of C (like a 10x10 upper corner) and compare to avoid massive amounts of hard to read output.
In addition to comparing the scalability of your solution, test whether or not the row-wise or column-wise assignment has any effect on performance.
As you run experiments, make sure you are doing so in a way that doesn't interfere with others (see the Useful Unix Utilities section below for some tips about how to ensure this). Also, remember to remove all output statements from the code you time.
You should run multiple runs of each experiment (don't just do a single times run of 16 threads for 512x512 and 10 iterations, for example). The purpose of multiple runs is to determine how consistent your runs are (look at the standard deviation across runs), and to also see if you have some odd outliers (which may indicate that something else running on the computer interfered with this run, and that you should discard this result).
You should be careful to control the experimental runs as much as possible. If other people using the machine you are running experiments on, their programs can interfere with your results. You can see what is running on a machine:
Also, make sure that your matrix sizes are not too large to fit in RAM. I don't think this really will be a problem, but double check your a run with your largest sizes before running experiments. To see if the system is swapping, run:
watch -n 1 cat /proc/swaps Filename Type Size Used Priority /dev/sda5 partition 2072344 0 -1If you notice the Used value going above 0, your matrix sizes are too big to fit into RAM (or there are too many people running on this machine, and you need to find an idle machine).
As much as possible, it is good to run all your experiments on the same machine, or at least identical machines.
// declare a global pthread_barrier_t var static pthread_barrier_t barrier; // initialize it somewhere before using it: if(pthread_barrier_init(&barrier, NULL, num_threads)){ perror("Error: with pthread barrier init\n"); exit(1); } // threads than can call pthread_barrier_wait to synchronize: ret = pthread_barrier_wait(&barrier); if(ret != 0 && ret != PTHREAD_BARRIER_SERIAL_THREAD) { perror("Error: can't wait on pthread barrier\n"); exit(1); }
#!/bin/bash for((n=256; n <= 2048; n=n*2)) do for ((t=1; t <= 32; t=t*2)) do echo "" echo "matrixmult -n $n -m $n -t $t -i 10" time matrixmult -n $n -m $n -t $t -i 10 done doneSee bash shell programing links for more information.
Turn in a hard copy of your written report at the beginning of class on Thursday.