CS31: Lab 9

This lab should be done with a partner of your choosing.

Lab 9 Goals:

Practice with parallel programming using pthreads
Implement synchronization
Gain more experience with 2D arrays and file I/O in C
Gain more experience with pointers and dynamic memory allocation
Practice debugging threaded programs

Setup Procedure

The setup procedure for this lab will be similar to previous labs. First, both you and your partner should run setup31 to grab the starting point code for this assignment. Suppose users molly and tejas which to work together. Molly (mdanner1) can start by running

  [~]$ setup31 labs/09 tdanner1

Once the script finishes, Tejas (tdanner1) should run

  [~]$ setup31 labs/09 mdanner1

For the next step only one partner should copy over the starting code

  [~]$ cd ~/cs31/labs/09
  [09]$ cp ~lammert/public/cs31/labs/09/* ./
  [09]$ ls
  Makefile  glide.txt  gol.c  oscillator.txt

Now push the changes to your partner

[09]$ git add *
[09]$ git commit -m "lab 9 start"
[09]$ git push

Your partner can now pull the changes. In this case if Tejas wishes to get files Molly pushed, he would run

[~]$ cd ~/cs31/labs/09
[09]$ git pull

Starting Point Code:

The Lab 09 starting point code includes a Makefile, a sequential implementation of the Game of Life (gol.txt), and some sample input (.txt) files. You should create more input files to test larger sized and longer running threaded version, but use smaller ones (and printing command line options) to test for correctness.

The reasons you are being given an implementation of the Game of Life are two. First, it is to ensure that you do not have to struggle with any lingering bugs from Lab 6. Second, it is to put everyone on equal footing. Even if you are happy with your implementation of the Game of Life from Lab 6, you should use the provided code as your starting point for this lab. You should feel very free, though, to make any adjustments to the provided code that you feel are necessary. However, you should not alter the methods for initializing a dynamically allocated board from an input file given as a command line argument, nor should you alter the methods for the print/not-print option after every iteration. Please limit any changes to the sections related to actually "playing the game" (e.g., updating the gameboard).

Project Details and Requirements

This lab is designed to give you some practice writing and debugging multithreaded Pthreads programs, using synchronization primitives, and experience designing and running scalability experiments.

You will implement a parallel version of Conway's Game of Life using the provided sequential solution as a starting point for this lab. See the lab 6 assignment to remind yourself about the sequential version of the program, and about the rules of the game. You will evaluate the scalability of your implementation as you increase the problem size and the number of threads.

Parallel Version

The parallel version should be structured as follows:

The main thread should begin by initializing the game board in the same way as the sequential version. The gettimeofday timers in your code should not include this step but should include all other parts of the execution.
After initializing the game board, the main thread spawns off worker threads that will play multiple rounds of game of life, each thread computing just its portion of cells of the new board each round. Grid cells are allocated to threads using either a row-wise or column-wise partitioning specified by a command line argument.
If printing after each round is enabled, you should designate one thread to be the printing thread, and only that thread should do the printing after each round (otherwise the output can be interleaved in crazy ways).
The main thread should print out the time of the gol computation part of the program before exiting (this time should NOT include the time to initialize the board from the input file, but should measure the rest of the computation).

You will need to think about where and how to synchronize thread actions to remove race-conditions. For example, you need to ensure that no thread starts the next round of play until all threads are done with the current round, otherwise threads will not read consistent values for the given round. If printing is enabled, you need to ensure that the printing thread prints a consistent version of the game board.

A run with no printing enabled should only printout the final timing result of the gettimeofday timer and should not include any calls to usleep in the run.

Command line arguments

Your program will use the two command line arguments from lab 6 and additionally add three new command line arguments to:

specify the number of threads (the degree of parallelism)
specify how to parallelize the GOL program (0: row-wise grid cell allocation, 1: column-wise grid cell allocation).
specify if the per-thread board allocation should be printed out or not (0:don't print configuration information, 1: print allocation information)

$ ./gol
usage: ./gol infile.txt print[0:1] ntids partition[0:1] print_alloc[0:1]

Here are some example command lines:

# run with config values read from file1.txt, do not print the board after each round
# create 8 threads, use row-wise partitioning, print per-thread partitioning details:
./gol file1.txt  0  8  0  1     

# run with config file file2.txt, print the board after each round, create
# 15 threads, use column-wise partitioning, don't print per-thread partitioning:
./gol file2.txt  1  15  1  0

The print per-thread board allocation command line option will help you debug the grid cell allocation scheme to ensure that you are correctly allocating rows or columns across the worker threads.

Your program should handle badly formed command lines and check for bad values entered (like a negative number of threads). It should print out an error message and exit for badly formed command lines and handle most bad input values similarly.

Partitioning

You will implement two different ways of parallelizing the computation: one partitions the board's rows across threads, the other partitions the board's columns across threads (i.e each thread computes the result for some number of rows or columns). As an example, suppose that the board is 8x8, and that there are 4 threads. Here is how the threads (with logical tids 0-3) would be assigned to computing elements of C using row and column partitioning:

row partitioning                        column partitioning
----------------                        ----------------
0 0 0 0 0 0 0 0                         0 0 1 1 2 2 3 3  
0 0 0 0 0 0 0 0                         0 0 1 1 2 2 3 3
1 1 1 1 1 1 1 1                         0 0 1 1 2 2 3 3
1 1 1 1 1 1 1 1                         0 0 1 1 2 2 3 3
2 2 2 2 2 2 2 2                         0 0 1 1 2 2 3 3
2 2 2 2 2 2 2 2                         0 0 1 1 2 2 3 3
3 3 3 3 3 3 3 3                         0 0 1 1 2 2 3 3
3 3 3 3 3 3 3 3                         0 0 1 1 2 2 3 3

When the number of threads does not evenly divide the dimension by which you are partitioning, divide the rows (or columns) up so that there is at most a difference of 1 assigned row (or column) between any two threads. For example, for an 8x7 board and 3 threads, you would partition rows and columns like this among the 3 threads (with tids 0-2):

row partitioning                        column partitioning
----------------                        ----------------
0 0 0 0 0 0 0                           0 0 0 1 1 2 2   
0 0 0 0 0 0 0                           0 0 0 1 1 2 2
0 0 0 0 0 0 0                           0 0 0 1 1 2 2 
1 1 1 1 1 1 1                           0 0 0 1 1 2 2 
1 1 1 1 1 1 1                           0 0 0 1 1 2 2 
1 1 1 1 1 1 1                           0 0 0 1 1 2 2 
2 2 2 2 2 2 2                           0 0 0 1 1 2 2 
2 2 2 2 2 2 2                           0 0 0 1 1 2 2

In this partitioning scheme, a single row (or column) is assigned to exactly one thread; a single row (or column) is never split between two or more threads. Thus, in the above example threads 0 and 1 have one more row each of work to do than thread 2 in the row-wise partitioning, and thread 0 has one more column of work to do than threads 1 and 2 in the column-wise partitioning.

Printing Partitioning Information

When the 5th command line option is non-zero, your program should print thread partitioning information. Each thread should print:

Its thread id (its logical one not its pthread_self value)
Its start and end row index values
The total number of rows it is allocated
Its start and end column index values
The total number of columns it is allocated

The threads will run in different orders, so you may see each thread's output in any order, which is fine. You can avoid some interleaving of individual threads output by calling fflush after printf:

printf("tid %d my values ...\n", mytid, ...);
fflush(stdout);   // force the printf output to be printed to the terminal

Here are some example runs of my program with different numbers of threads partitioning a 100x100 grid. The total number of rows and columns per thread are shown in parentheses (your output does not need to be identical to mine but must include all required parts):

# 9 threads, column-wise partitioning
% gol big 0 9 1 1
tid  0: rows:  0:99 (100) cols:  0:11 (12)
tid  1: rows:  0:99 (100) cols: 12:22 (11)
tid  2: rows:  0:99 (100) cols: 23:33 (11)
tid  4: rows:  0:99 (100) cols: 45:55 (11)
tid  3: rows:  0:99 (100) cols: 34:44 (11)
tid  5: rows:  0:99 (100) cols: 56:66 (11)
tid  6: rows:  0:99 (100) cols: 67:77 (11)
tid  7: rows:  0:99 (100) cols: 78:88 (11)
tid  8: rows:  0:99 (100) cols: 89:99 (11)

# 6 threads, row-wise partitioning
% gol big 0 6 0 1
tid  0: rows:  0:16 (17) cols:  0:99 (100)
tid  1: rows: 17:33 (17) cols:  0:99 (100)
tid  3: rows: 51:67 (17) cols:  0:99 (100)
tid  2: rows: 34:50 (17) cols:  0:99 (100)
tid  4: rows: 68:83 (16) cols:  0:99 (100)
tid  5: rows: 84:99 (16) cols:  0:99 (100)

# 17 threads, row-wise partitioning
% gol big 0 17 0 1
tid  0: rows:  0: 5 ( 6) cols:  0:99 (100)
tid  1: rows:  6:11 ( 6) cols:  0:99 (100)
tid  3: rows: 18:23 ( 6) cols:  0:99 (100)
tid  2: rows: 12:17 ( 6) cols:  0:99 (100)
tid  4: rows: 24:29 ( 6) cols:  0:99 (100)
tid  5: rows: 30:35 ( 6) cols:  0:99 (100)
tid  6: rows: 36:41 ( 6) cols:  0:99 (100)
tid  7: rows: 42:47 ( 6) cols:  0:99 (100)
tid 10: rows: 60:65 ( 6) cols:  0:99 (100)
tid 11: rows: 66:71 ( 6) cols:  0:99 (100)
tid 12: rows: 72:77 ( 6) cols:  0:99 (100)
tid 13: rows: 78:83 ( 6) cols:  0:99 (100)
tid  8: rows: 48:53 ( 6) cols:  0:99 (100)
tid 15: rows: 90:94 ( 5) cols:  0:99 (100)
tid 14: rows: 84:89 ( 6) cols:  0:99 (100)
tid  9: rows: 54:59 ( 6) cols:  0:99 (100)
tid 16: rows: 95:99 ( 5) cols:  0:99 (100)

Note that for the 17 thread run there are enough threads to see "out of order" execution due to the exact scheduling of threads on the CPU's.

Correctness

In addition to printing out per-thread board partitioning information to verify correct allocation, you should also ensure that your parallel version solves GOL correctly. To do this run your parallel version on some of the same input files as your sequential one with printing enabled. The results should be identical. For example, starting with the oscillator input file the start and end board should look like this (you would of course call system("clear") between each iteration):

$ ./gol oscillator.txt 1 4 0 0

start board:

- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - @ @ @ - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 

end board:

- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - @ - - - - - 
- - - - - @ - - - - - 
- - - - - @ - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 

total time for 21 iterations of 11x11 is 5.493663 secs

If you see odd output, either your parallelization of GOL is incorrect, or you are missing synchronization, or both. You can remove the call to system("clear") to see the history of each round to help you debug incorrect GOL computation (of course run for a small number of iterations).

Additional Requirements

In addition to the parallelization and correctness requirements described above, you solution must:

use a makefile
be free of valgrind errors
be well designed and well commented

Hints and Useful Utilities

For the pthread implementation:

The man pages for pthread functions are very helpful. In addition I have links to pthreads documentations and tutorials
use perror to print out error messages from failed system and pthread library calls. perror prints out detailed information about the error that occurred. You call it passing in a string with program-specific error message. For example:
```
big_buff = malloc( ...);
if(!big_buff) {
  perror("mallocing big buffer failed\n");
}
```
See the man page for perror for more information and for the include files you need to add to your program.
You will need to use some synchronization in your solution. You may use any of the pthread synchronization primitives: semaphores, condition variables, and/or barriers. Remember that before using any of these, you need to initialize them, and that all threads need to have access to the shared synchronization variables (either pass a reference to them to the main thread function or declare them as static global variables). Here are some code snippets for using the barrier synchronization primitive:
```
// declare a global pthread_barrier_t var
static pthread_barrier_t barrier;

// initialize it somewhere before using it:
if(pthread_barrier_init(&barrier, NULL, num_threads)){
   perror("Error: with pthread barrier init\n");
   exit(1); 
}

// threads than can call pthread_barrier_wait to synchronize: 
ret = pthread_barrier_wait(&barrier);
if(ret != 0 && ret != PTHREAD_BARRIER_SERIAL_THREAD) {
  perror("Error: can't wait on pthread barrier\n");
  exit(1);
}
```
With the exception of pthread synchronization primitives, you should not use any other global variables in your program. Any values that threads need should be passed to them via the (void *arg) parameter. Look at the parallel max program we did on Tuesday for an example of how to do this.
A guide for using gdb to debug pthread programs

Submit

To submit your code, simply commit your changes locally using git add and git commit. Then run git push while in the labs/09 directory. Only one partner needs to run the final push, but make sure both partners have pulled and merged each others changes. See the section on using a shared repo on the git help page.

CS31 Lab 9: Parallel Game of Life