Here are the partnerships and here's a link to GitHub.
The Game of Life programs we wrote for lab 7 run quite slowly when we use them to simulate a large board over many iterations. We can speed the simulation up by making better use of our system's resources. Our computers have multiple processors (CPUs), but our lab 7 programs used just one processor. Your task on this lab is to speed up your Game of Life program by re-designing it to run over multiple processors.
The user will specify with a command line arugment how many threads they want the Game of Life simulation to use. If the number of these worker threads is less than or equal to the number of processors on your computer, then all the threads can run at the same time.
Your job is to divide the computation evenly over multiple threads that can run in parallel. Each thread will be assigned a section (partition) of the board that it's responsible for updating. You will need to synchronize the threads to avoid the situation where one thread is several iterations ahead of another.
With experimentation, you should be able to confirm that adding more threads makes the program run faster, until you exceed the number of processors available on your machine.
Use your lab 7 solution as a starting point, and then add in the additional features required for this lab. We'll use the pthread library to create and manage our threads, so you should add #include <pthread.h>
to the top of your gol.c
file.
On top of making the program multithreaded, your program should now take in four command line arguments:
When the fourth command line argument is 1, each thread should print out the range of cells that it's responsible for updating. This is mainly useful for debugging (and grading) purposes. Here we're referring to cells by their index in the dynamically allocated 1D array that we've been treating as a 2D array.
$./gol square100.txt 0 3 1
thread 0: 0-3332 (3333 cells)
thread 1: 3333-6665 (3333 cells)
thread 2: 6666-9999 (3334 cells)
50 iterations of a 100x100 board took 0.019325 seconds.
$ ./gol square100.txt 0 4 1
thread 0: 0-2499 (2500 cells)
thread 1: 2500-4999 (2500 cells)
thread 2: 5000-7499 (2500 cells)
thread 3: 7500-9999 (2500 cells)
50 iterations of a 100x100 board took 0.012225 seconds.
$ ./gol square100.txt 0 6 1
thread 0: 0-1665 (1666 cells)
thread 1: 1666-3331 (1666 cells)
thread 5: 8330-9999 (1670 cells)
thread 3: 4998-6663 (1666 cells)
thread 4: 6664-8329 (1666 cells)
thread 2: 3332-4997 (1666 cells)
50 iterations of a 100x100 board took 0.044254 seconds.
You may be able to deduce the partitioning scheme we're using from the example output above. We divide the number of cells by the number of threads to get the number of cells each thread is responsible for updating, n
. The thread with logical id 0
gets the first n
cells, thread 1
gets the next n
cells, and so on. If the number of threads doesn't evenly divide the number of cells, we assign all the extra cells to the last thread (the one with the biggest id).
Make sure that each cell is assigned to exactly one thread. If a cell is unassigned, then the simluation won't run correctly. If a cell is assigned to multiple threads, this will slow down the program.
Because the partitioning prints are coming from within concurrent threads, they won't necessarily show up in order of thread id, as in the third example run above. The above runs were done on a machine with 4 CPUs. This explains why the program got faster when we increased from 3 threads to 4 threads, but then slowed down when we increased to 6 threads.
The input file format is the same as it was on lab 7. We have provided a couple test files, which you can use in addition to the ones from lab 7. You are encouraged to create your own additional test files. When you test that your multithread Game of Life program is still correctly implementing the update rules, it's good to use a small board. When you test that your program actually becomes faster with more threads, it's good to use a large board that runs for many iterations.
pthread_create
to create the threads. You should create exactly the number of threads specified in the command line. The threads should all run for the entire duration of the simulation. Don't create new threads for every iteration.pthread_join
to wait for the threads to finish.0
and counting up.0
. This thread is also responsible for updating a partition of the board, like all the other threads.1
is on iteration 5
when thread 0
is still on iteration 3
, this would be a race condition. The board may change before thread 0
has the chance to print it. Because we're using this alternating two-board approach, it's okay if a thread is one iteration ahead of another thread. However, you never want a thread to be two iterations ahead.main()
thread or in the worker threads. The latter is more efficient, but also more complex.0
malloc
the two boards and initialize board #1, but before you do any of the work involved in setting up the threads. End the timer at the end of your program. You should always print out the elapsed time.malloc
an array of these structs, one for each thread. Initialize each struct before passing it into pthread_create
by reference.k
, into a 2D pair of indices, i
and j
, do the following:int i, j;
i = k / cols;
j = k % cols;
This reverses the i*cols + j
calculation.
typedef
:typedef struct {
// declare fields here
} ThreadInfo;
// declare a global barrier variable
static pthread_barrier_t barrier;
// initialize the barrier with the number of threads it should wait for
pthread_barrier_init(&barrier, NULL, numberOfThreads);
// Use the barrier
pthread_barrier_wait(&barrier);
$ cat /proc/cpuinfo
malloc
s:
pthread_t
)ThreadInfo
structsNote that when you run valgrind, it will report more than four malloc
s because fopen
and pthread_create
both use malloc
internally.
There are a number of things you could do to make this program more efficient in terms of time or memory. None of these is required.
$ ./gol square100.txt 0 6 1
thread 0: 0-1666 (1667 cells)
thread 1: 1667-3333 (1667 cells)
thread 2: 3334-5000 (1667 cells)
thread 3: 5001-6667 (1667 cells)
thread 4: 6668-8333 (1666 cells)
thread 5: 8334-9999 (1666 cells)
50 iterations of a 100x100 board took 0.037205 seconds.
Compute the partitions inside the worker threads instead of in the main()
thread, if you haven't already done so.
Certain values are needed by all the threads, like the number of iterations and the number of columns. Giving each thread its own copy of these values is a waste of memory. Instead, put all these values in a single struct, stored in the stack frame for main()
. Then pass each thread a pointer to this struct. Since we pass things to threads by putting them in structs, you'll have a struct inside of a struct.
posix_memalign
instead of malloc
to make sure your board starts on a 64-byte cache line.