CS31: Lab 10

This lab will be done with the same partner with whom you worked for the Game of Life lab.

Here is the partner list for Lab Partners

Lab 10 Goals:

shared memory parallel programming using pthreads
synchronization
2D arrays and file I/O in C
pointers, dynamic memory allocation, and pass by reference
debugging threaded programs

Both you and your partner should do:
1. Get your Lab10 ssh-URL from the GitHub server for our class: CS31-F17
2. On the CS system, cd into your cs31/labs subdirectory
```
cd ~/cs31/labs
```
3. Clone a local copy of your shared repo in your private cs31/labs subdirectory:
```
git clone [your_Lab10_URL]
```
  Then cd into your Lab10-you-partner subdirectory.
If all was successful, you should see the following files when you run ls:
```
Makefile  QUESTIONNAIRE  big.txt  square100.txt  testedges.txt
```
If this didn't work, or for more detailed instructions see the the Using Git page.
Next, you should copy over your Lab 7 sequential solution as the starting point for your parallel version of GOL (and you may want to copy over any .txt files you have for testing gol).
```
  /home/your_user_name/cs31/labs/Lab10-you-partner
  $ cp ../Lab07-you-partner/gol.c .
  $ cp ../Lab07-you-partner/*.txt .
  $ ls
  Makefile  QUESTIONNAIRE  gol.c  and some .txt files
```
With the lab 10 starting point code, I've given you a Makefile and some .txt input files. You should create more .txt input files to test large and long-running GOL sequences, with the computation split over multiple threads. The smaller inputs and the printing command line options are useful when you are testing for correctness.

In your gol.c file make sure you have the following #includes:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>

Lab Overview

This lab is designed to give you some practice writing and debugging multithreaded programs that use synchronization primitives.

You will implement a parallel version of Conway's Game of Life using your sequential solution as a starting point for this lab. See the lab 7 assignment to remind yourself about the sequential version of the program, and about the rules of the game.

Your lab 10 solution will use the same method as lab 7 for initializing a dynamically allocated board from an input file given as a command line argument, and will have the same print or not-print option after every iteration. It will distribute board cells among the parallel threads in either a row-wise or column-wise fashion.

You will evaluate the scalability of your implementation as you increase the problem size and the number of threads.

You should begin lab 10 by first fixing any errors in your lab 7 sequential solution that you copied over into your labs/10 subdirectory. Don't worry about getting the torus correct (the edge cells) if you did not get that correct in lab 7. As long as you have a correct GOL implementation for the middle (non-edge) cells, that is enough to move on to starting to parallelize your GOL implementation.

Lab Details and Requirements

Parallel Game of Life

The parallel version of GOL should be structured as follows:

The main thread should begin by initializing the game board in the same way as the sequential version. The gettimeofday timers in your code should not include this step but should include all other parts of the execution.
After initializing the game board, the main thread spawns off worker threads that will play multiple rounds of game of life, each thread computing just its portion of cells of the new board each round. Grid cells are allocated to threads using either a row-wise or column-wise partitioning, as specified by a command line argument.
If printing after each round is enabled, you should designate one thread to be the printing thread, and only that thread should do the printing after each round (otherwise the output can be interleaved in crazy ways).
The main thread should print out the time of the GOL computation part of the program before exiting.

You will need to think about where and how to synchronize thread actions to remove race conditions. For example, you need to ensure that no thread starts the next round of play until all threads are done with the current round, otherwise threads will not read consistent values for the given round. If printing is enabled, you need to ensure that the printing thread prints a consistent version of the game board.

A run with no printing enabled should only print out the final timing result of the gettimeofday timer and should not include any calls to usleep in the run.

Command line arguments

Your program will use the two command line arguments from lab 7 and add three new command line arguments to:

specify the number of threads (the degree of parallelism)
specify how to parallelize the GOL program (0: row-wise grid cell partioning, 1: column-wise grid cell partioning).
specify if the per-thread board partioning should be printed out or not (0:don't print partitioning information, 1: print partitioning information)

$ ./gol
usage: ./gol infile.txt print[0:1] ntids partition[0:1] print_alloc[0:1]

Here are some example command lines:

# run with config values read from file1.txt, do not print the board after each round
# create 8 threads, use row-wise partitioning, print per-thread partitioning details:
./gol file1.txt  0  8  0  1     

# run with config file file2.txt, print the board after each round, create
# 15 threads, use column-wise partitioning, don't print per-thread partitioning:
./gol file2.txt  1  15  1  0

The print per-thread board partioning command line option will help you debug the grid cell partioning scheme to ensure that you are correctly allocating rows or columns across the worker threads.

Your program should handle badly formed command lines and check for bad values entered (like zero or a negative integer for the number of threads). It should print out an error message and exit for badly formed command lines and handle most bad input values similarly.

Partitioning

You will implement two different ways of parallelizing the computation: one partitions the board's rows across threads, the other partitions the board's columns across threads (i.e each thread computes the result for some number of rows or columns). As an example, suppose that the board is 8x8, and that there are 4 threads. Here is how the threads (with logical tids 0-3) would be assigned to computing elements of C using row and column partitioning:

row partitioning                        column partitioning
----------------                        ----------------
0 0 0 0 0 0 0 0                         0 0 1 1 2 2 3 3  
0 0 0 0 0 0 0 0                         0 0 1 1 2 2 3 3
1 1 1 1 1 1 1 1                         0 0 1 1 2 2 3 3
1 1 1 1 1 1 1 1                         0 0 1 1 2 2 3 3
2 2 2 2 2 2 2 2                         0 0 1 1 2 2 3 3
2 2 2 2 2 2 2 2                         0 0 1 1 2 2 3 3
3 3 3 3 3 3 3 3                         0 0 1 1 2 2 3 3
3 3 3 3 3 3 3 3                         0 0 1 1 2 2 3 3

When the number of threads does not evenly divide the dimension by which you are partitioning, divide the rows (or columns) up so that there is at most a difference of 1 assigned row (or column) between any two threads. For example, for an 8x7 board and 3 threads, you would partition rows and columns like this among the 3 threads (with tids 0-2):

row partitioning                        column partitioning
----------------                        ----------------
0 0 0 0 0 0 0                           0 0 0 1 1 2 2   
0 0 0 0 0 0 0                           0 0 0 1 1 2 2
0 0 0 0 0 0 0                           0 0 0 1 1 2 2 
1 1 1 1 1 1 1                           0 0 0 1 1 2 2 
1 1 1 1 1 1 1                           0 0 0 1 1 2 2 
1 1 1 1 1 1 1                           0 0 0 1 1 2 2 
2 2 2 2 2 2 2                           0 0 0 1 1 2 2 
2 2 2 2 2 2 2                           0 0 0 1 1 2 2

In this partitioning scheme, a single row (or column) is assigned to exactly one thread; a single row (or column) is never split between two or more threads. Thus, in the above example threads 0 and 1 have one more row each of work to do than thread 2 in the row-wise partitioning, and thread 0 has one more column of work to do than threads 1 and 2 in the column-wise partitioning.

Printing Partitioning Information

When the 5th command line option is non-zero, your program should print thread partitioning information. Each thread should print:

Its thread id (its logical one not its pthread_self value)
Its start and end row index values
The total number of rows it is allocated
Its start and end column index values
The total number of columns it is allocated

The threads will run in different orders, so you may see each thread's output in any order, which is fine. You can avoid some interleaving of an individual thread's output by calling fflush after printf:

printf("tid %d my values ...\n", mytid, ...);
fflush(stdout);   // force the printf output to be printed to the terminal

Example Partitioning Output

Here are some example runs of my program with different numbers of threads partitioning a 100x100 grid. The total number of rows and columns per thread are shown in parentheses (your output does not need to be identical to mine but must include all required parts):

# 9 threads, column-wise partitioning
% gol square100.txt 0 9 1 1
tid  0: rows:  0:99 (100) cols:  0:11 (12)
tid  1: rows:  0:99 (100) cols: 12:22 (11)
tid  2: rows:  0:99 (100) cols: 23:33 (11)
tid  4: rows:  0:99 (100) cols: 45:55 (11)
tid  3: rows:  0:99 (100) cols: 34:44 (11)
tid  5: rows:  0:99 (100) cols: 56:66 (11)
tid  6: rows:  0:99 (100) cols: 67:77 (11)
tid  7: rows:  0:99 (100) cols: 78:88 (11)
tid  8: rows:  0:99 (100) cols: 89:99 (11)

# 6 threads, row-wise partitioning
% gol square100.txt 0 6 0 1
tid  0: rows:  0:16 (17) cols:  0:99 (100)
tid  1: rows: 17:33 (17) cols:  0:99 (100)
tid  3: rows: 51:67 (17) cols:  0:99 (100)
tid  2: rows: 34:50 (17) cols:  0:99 (100)
tid  4: rows: 68:83 (16) cols:  0:99 (100)
tid  5: rows: 84:99 (16) cols:  0:99 (100)

# 17 threads, row-wise partitioning
% gol square100.txt 0 17 0 1
tid  0: rows:  0: 5 ( 6) cols:  0:99 (100)
tid  1: rows:  6:11 ( 6) cols:  0:99 (100)
tid  3: rows: 18:23 ( 6) cols:  0:99 (100)
tid  2: rows: 12:17 ( 6) cols:  0:99 (100)
tid  4: rows: 24:29 ( 6) cols:  0:99 (100)
tid  5: rows: 30:35 ( 6) cols:  0:99 (100)
tid  6: rows: 36:41 ( 6) cols:  0:99 (100)
tid  7: rows: 42:47 ( 6) cols:  0:99 (100)
tid 10: rows: 60:65 ( 6) cols:  0:99 (100)
tid 11: rows: 66:71 ( 6) cols:  0:99 (100)
tid 12: rows: 72:77 ( 6) cols:  0:99 (100)
tid 13: rows: 78:83 ( 6) cols:  0:99 (100)
tid  8: rows: 48:53 ( 6) cols:  0:99 (100)
tid 15: rows: 90:94 ( 5) cols:  0:99 (100)
tid 14: rows: 84:89 ( 6) cols:  0:99 (100)
tid  9: rows: 54:59 ( 6) cols:  0:99 (100)
tid 16: rows: 95:99 ( 5) cols:  0:99 (100)

Note that for the 17 thread run there are enough threads to see "out of order" execution due to the exact scheduling of threads on the CPUs.

Example Correctness Output/Testing

In addition to printing out per-thread board partitioning information to verify correct partioning, you should also ensure that your parallel version solves GOL correctly. To do this run both your parallel version and your sequential version on some of the same input files with printing enabled. The results should be identical. For example, starting with the oscillator input file the start and end board should look like this (you would of course call system("clear") between each iteration):

$ ./gol test_oscillator.txt 1 4 0 0

start board:

- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - @ @ @ - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
number of live cells: 3

end board:

- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - @ - - - - - 
- - - - - @ - - - - - 
- - - - - @ - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
number of live cells: 3

total time for 21 iterations of 11x11 is 5.493663 secs

If you see odd output, either your parallelization of GOL is incorrect, or you are missing synchronization, or both. You can remove the call to system("clear") to see the history of each round to help you debug incorrect GOL computations.

Additional Requirements

In addition to the parallelization, partitioning, command line options, and correctness requirements for pthreads GOL described above, your solution must:

Satisfy all of the sequential game-of-life requirements: the command line options, the playing gol rounds, the animation option including printing out the board and the number of live cells after each round, prininting out timing information, etc. See the Lab 7 Requirements for more information.
The one exception is that if you did not correctly get the torus part of gol to work (your corner or edge cells are incorrect), you can parallelize a gol solution with just all non-edge cells correct.

The task of counting the number of live cells must be divided among the threads, with each thread counting the live cells in its assigned section of the grid. Make sure this portion of your code is free of race conditions. It's okay to accumulate the live cell count in a static global variable.

Be free of valgrind errors

Be well designed and well commented

Hints and Useful Utilities

If you are having trouble getting started, here is one way to incrementally approach a solution to this lab:
1. Without spawning any threads, write the partitioning code, i.e. figure out which section of the board each thread would be assigned.
2. Spawn the number of threads specified at the command line, and have each thread just print out its logical thread id.
3. Create a struct that stores all the information you will want to pass into the threads.
4. Have each thread print out its own partitioning information (when the final command line argument is nonzero).
5. Write the code to compute one round of the GOL simulation, with the work divided among the threads. This would involve printing the original board, updating the board once based on the local update rules, and printing out the new board.
6. Compute all rounds of the GOL simulation, avoiding race conditions.
7. Count the number of live cells in a multithreaded way, avoiding race conditions.
8. Make sure all other requirements have been met.

Descriptions of Conway's Game of Life, from Lab 7 and Wikipedia.

The man pages for pthread functions are very helpful. In addition I have links to pthreads documentation and tutorials
Here are a few on-line manual pages for pthread functions:

Look at example pthread code that we will do in lecuture (max and prodcons) and from Wednesday lab of week 12.
You can see your threads running on the system (for a long enough running parallel gol program). Run top -H in another terminal as your parallel gol is running to see the multiple threads running. If you only see one thread from the 'gol' program, then there is likely either something wrong with your thread spawning or with your thread synchronization.
Use perror to print out error messages from failed system and pthread library calls. perror prints out detailed information about the error that occurred. You call it passing in a string with program-specific error message. For example:
```
big_buff = malloc( ...);
if(!big_buff) {
  perror("mallocing big buffer failed\n");
  fflush(sterr);
  exit(1); 
}
```
See the man page for perror for more information and for the include files you need to add to your program.
You will need to use some synchronization in your solution. You may use any of the pthread synchronization primitives: semaphores, condition variables, and/or barriers. Remember that before using any of these, you need to initialize them, and that all threads need to have access to the shared synchronization variables (either pass a reference to them to the main thread function or declare them as static global variables).
Think about what shared state is updated by more than one thread. Think about guarantees you need for correctness and how concurrent thread execution may or may not meet these guarantees.
Look at the lecture notes and the week12 Weekly lab synch.c example for examples of using some pthread synchronization primitives. Here are some code snippets for using the barrier synchronization primitive:
```
// declare a global pthread_barrier_t var
static pthread_barrier_t barrier;

// initialize it somewhere before using it:
if(pthread_barrier_init(&barrier, NULL, num_threads)){
   perror("Error: with pthread barrier init\n");
   exit(1); 
}

// threads can then call pthread_barrier_wait to synchronize: 
ret = pthread_barrier_wait(&barrier);
if(ret != 0 && ret != PTHREAD_BARRIER_SERIAL_THREAD) {
  perror("Error: can't wait on pthread barrier\n");
  exit(1);
}
```
With the exception of the pthread synchroniztion primitives and the live cell count, you should not use any other global variables in your program.
Any values that threads need should be passed to them via the (void *arg) parameter. Look at the parallel max code from class for an example of how to do this.
If you want to pass a thread more than a single parameter, you need to declare a struct type with fields for each value you want to pass. Look at the synch.c code from week 12 Wednesday lab for an example of how to do this. And here is a reference about C structs.
Debugging pthreads programs with gdb. Some basic pthread debugging advice, and some basics for using gdb to debug pthread programs.
And a more detailed guide for debugging pthread programs in gdb.
C programming references including C style guide, links to C language refs, gdb and valgrind.

Lab Questionnaire

With every lab assignment is a file named QUESTIONNAIRE for you to fill out and submit with your lab solution. In this file you will answer some questions about the lab assignment. You should fill this out and submit it with your lab solution.

Submit

Before the Due Date

Only one of you or your partner needs to push your solution from your local repo to the GitHub remote repo. (It doesn't hurt if you both push, but the last pushed version before the due date is the one we will grade, so be careful that you are pushing the version you want to submit for grading.)

From one of your local repos (in your ~you/cs31/labs/Lab10-partner1-partner2 subdirectory):

git push

Troubleshooting

If git push fails, then there are likely local changes you haven't committed. Commit those first, then try pushing again:

git add gol.c
git add Makefile
git add QUESTIONNAIRE
git add *.txt   # add any test files you created too
git commit
git push

Another likely source of a failed push is that your partner pushed, and you have not pulled their changes. Do a git pull. Compile and test that your code still works. Then you can add, commit, and push.

If that doesn't work, take a look at the "Troubleshooting" section of the Using git page. You may need to pull and merge some changes from master into your local. If so, this indicates that your partner pushed changes that you have not yet merged into your local. Anytime you pull into your local, you need to check that the result is that your code still compiles and runs before submitting.

CS31 Lab 10: Pthreads Game of Life