Due Dates:

  • Checkpoint Due 11:59PM, Tuesday, April 28th 11.59PM EDT

  • Full Lab Due 11:59 PM, Friday, May 8th 11.59PM EDT (No Extensions beyond the full lab due date.)

1. Lab 9 Goals

  • Parallelize a shared-memory application using the pthreads library.

  • Protect threaded code with synchronization primitives.

  • Revisit 2D arrays and file I/O in C.

  • Gain more exposure to pointers and dynamic memory allocation.

  • Practice debugging threaded programs.

2. Handy References

3. Lab Description

This lab is designed to give you some practice writing and debugging multi-threaded pthreads programs, using synchronization primitives, and experience designing and running scalability experiments.

You’ll implement a parallel version of Conway’s Game of Life using your sequential solution as a starting point for this lab.

  • See the Lab 6 assignment to remind yourself about the sequential version of the program and the rules of the game.

  • You will evaluate the scalability of your implementation as you increase the problem size and the number of threads.

  • Your Lab 9 solution will use the same method as Lab 6 for initializing a dynamically allocated board from an input file given as a command line argument, and will have the same output mode options.

  • You should probably begin Lab 9 by first fixing any errors in your Lab 6 sequential solution.

Don’t worry about fixing the torus parts (the edges cells) if you didn’t have that working in Lab 6. As long as your Lab 6 gol implementation for all the middle (non-edge) cells works, that is enough to move on to starting to parallelize your gol implementation. And then changing it to get ready for parallelization.

3.1. Getting Your Lab 9 Repo

Both you and your partner should clone your Lab repo into your cs31/labs subdirectory:

  1. get your Lab ssh-URL from the CS31 git org. The repository to clone is named Lab9-user1-user2, where user1 and user2 are the user names of you and your Lab partner.

  2. cd into your cs31/labs subdirectory:

    $ cd ~/cs31/labs
    $ pwd
  3. clone your repo

    $ git clone [your Lab9-user1-user2 url]
    $ cd Lab9-user1-user2
    $ ls

4. Lab Overview

Your parallel version of the game of life should be structured as follows:

  1. Main Thread Functionality: The main thread should begin by initializing the game board by reading the input file in the same way as the sequential version. The input files format has not changed.

  2. Worker Threads: After initializing the game board, the main thread spawns off worker threads that will simulate multiple rounds of the game, with each thread computing just its portion of cells of the new board each round.

    1. Grid cells are allocated to threads using either a row-wise or column-wise partitioning specified by a command line argument.

    2. You may assume that you will not be asked to create more threads than there are rows/columns (whichever you’ve been asked to partition by). In other words, you don’t need to worry about a case in which there are threads that have no rows/columns assigned to them.

  3. Synchronization: Your program should use a global variable that will be shared among all threads to count the number of live cells that are processed during each round.

    1. That is, each thread must keep a count of the number of live cells it encountered during a round and use that value to update the global count when it has completed its execution of the round.

      You should NOT count the live cells in the print function like your serial version did! The rationale for this requirement is that it creates a critical section when the threads attempt to write the global values, so you’ll need to use synchronization to protect those writes.
  4. Code Performance: At the end of the simulation’s execution, the main thread will show the elapsed time. You will not be graded on absolute performance, but your code should run faster when run in parallel. As a sanity check, try running a large board instance (e.g., 1500x1500) for many rounds (e.g., 100+). You should see a substantial improvement between running with one thread and a handful of threads (2-4). It’s that sort of relative improvement that we’ll be looking for.

    For accurate timing, a run with no printing arguments enabled should not include any calls to usleep in the run.
  5. Your simulation should have no valgrind errors.

5. Weekly Deliverables

  • Since most of your GOL functionality is already implemented, this lab is focussed on parallelizing your sequential GOL implementation. Our main goals are to modify your sequential are: to partition the work to multiple threads, and add synchronization primitives to ensure there are no race conditions.

  • Week 1 Checkpoint: For the first week, we will focus on figuring out creating and reaping threads. And how to partition the board amongst the worker threads.

  • Week 2 Full Implementation For the second week, we will work on adding synchronization primitives.

6. Week 1: Getting Started

6.1. Modify the Lab8/gol.c provided in your starter code

The starting point code includes a gol.c file that has the required libraries and synchronization primitives initialized.

  1. You and your partner should choose one of your Lab 6 solutions as a starting point for this lab, and judiciously copy over only the functionality of your previous Lab6/gol.c.

    DO NOT overwrite main or change the function prototypes provided in the 'Lab8' folder. Carefully walk through the provided code in Lab8/gol.c and only copy those parts of your previous implementation that are necessary for correct functioning.
  2. After copying over the parts of your previous implementation, compile and run to make sure it works.

  3. Then fix any any parts of your gol.c that will make parallelizing it difficult, and that are memory inefficient. We will look at your solution in lab on Wednesday and help you figure out what you need to do.

6.2. Creating and Reaping threads

In some ways, creating threads resembles the way you created processes.

  • For example, with processes, we used a pid_t to represent the process ID of the child process. Here, we’ll use a pthread_t to represent a new thread.

  • The example code allocates an array of pthread_t types, one for every thread. The thread library will fill in the details of the pthread_t when you call pthread_create.

  • Thus, when you’re creating thread N, give pthread_create the address of the Nth element of the pthread_t array, and it will fill in the thread’s information so that you can reference that thread again later (e.g., join).

  • Also like processes, you may want to wait until a thread has terminated (and check it’s status) before moving forward. For processes, we used waitpid(). For threads, we will use pthread_join(), which takes the pthread_t of the thread you want to wait on. The pthread_join() function will also give you the thread’s return value, if you want it (we don’t care for this exercise).

  • Unlike fork(), with threads we don’t create a clone of the current process. Instead, we give pthread_create the name of the function that we want it to execute in a separate thread. The function you tell it to execute must be of the form:

    void *function(void *argument)

    That is, the function must return a pointer (NULL is fine, if you don’t care about returning anything), and it must accept exactly one pointer as an argument (NULL also fine here, if you don’t want to pass anything in). The type of both pointers is the generic void *`, since the thread library has no idea what you’re planning to do in the thread. This means we’ll have to do some casting when we pass arguments.

6.2.1. hello.c with threads

Before we dive into parallelizing our gol lab, let’s talk about more about threads!

Take a look at hello.c in your github folder and read through what this code does. Let’s try running it a few times with different numbers of threads.

It show an example of a simple pthreads program that:

  1. spawns some worker threads by calling pthread_create

  2. passes parameter values (thread logical identifier values) to each thread’s main function via its void *args parameter.

    This shows an example of how the thread main loop function can get is argument value(s) from it parameter

  3. calls to pthread_join to wait for all spawned threads to exit.

6.2.2. find_local_max.c with threads

For this exercise, we’ll write a C program that breaks up a (potentially very large) array into equal portions and finds the local maximum data value for each subset of the data.

Since we are learning about threads, we will divide up our data set into partitions and assign one thread to each partition. The steps we go through here will be similar to what you will need to do when parallelizing the game of life.

You can invoke the program with two arguments, the file name to use when reading data and the number of threads:

# Use the "data_small.txt" input file and divide it among four threads
$ ./find_local_max data_small.txt 4

The code to read in the input file, parse_file(), is provided for you. It returns an array of integers and sets the num_values variable so that you know how many data points you have. The provided code also sets num_threads to the value of the second parameter.

6.3. Parallelize GOL: Command line arguments

Now that we have some experience with using threads, let’s get started with parallelizing our game of life.

Your program will take the same two command line arguments from lab 6, plus three new command line arguments:

  1. Num Threads: An integer to specify the number of threads (the degree of parallelism).

  2. Row/Col Partition: A 0/1 flag to specify how to parallelize the GOL program (0: row-wise grid cell allocation, 1: column-wise grid cell allocation). More details below.

  3. Print Thread Allocation: A 0/1 flag to specify if the per-thread board allocation should be printed out or not at start up (0:don’t print configuration information, 1: print allocation information).

    1. The print per-thread board allocation command line option will help you debug the grid cell allocation scheme to ensure that you are correctly allocating rows or columns across the worker threads.

  4. Badly Formatted Input: Your program should handle badly formed command lines and check for bad values entered (e.g., a negative number of threads). It should print out an error message and exit for badly formed command lines and handle most bad input values similarly.

Example usage

$ ./gol
usage: ./gol infile.txt output_mode[0,1] num_threads partition[0,1] print_partition[0,1]

Here are some example command lines:

# run with config values read from file1.txt, do not print the board after each round
# create 8 threads, use row-wise partitioning, print per-thread partitioning details:
./gol file1.txt  0  8  0  1

# run with config file file2.txt, print the board after each round, create
# 15 threads, use column-wise partitioning, don't print per-thread partitioning:
./gol file2.txt  1  15  1  0

6.4. Partitioning

You will implement two different ways of partitioning the computation: one partitions the board’s rows across threads, the other partitions the board’s columns across threads. As an example, suppose that the board is 8x8, and that there are 4 threads. Here is how the threads (with Thread IDs or tids 0-3) would be assigned to the grid using row and column partitioning, respectively:

row partitioning                        column partitioning
----------------                        ----------------
0 0 0 0 0 0 0 0                         0 0 1 1 2 2 3 3
0 0 0 0 0 0 0 0                         0 0 1 1 2 2 3 3
1 1 1 1 1 1 1 1                         0 0 1 1 2 2 3 3
1 1 1 1 1 1 1 1                         0 0 1 1 2 2 3 3
2 2 2 2 2 2 2 2                         0 0 1 1 2 2 3 3
2 2 2 2 2 2 2 2                         0 0 1 1 2 2 3 3
3 3 3 3 3 3 3 3                         0 0 1 1 2 2 3 3
3 3 3 3 3 3 3 3                         0 0 1 1 2 2 3 3

When the number of threads does not evenly divide the dimension by which you are partitioning, divide the rows (or columns) up so that there is at most a difference of 1 assigned row (or column) between any two threads. For example, for an 8x7 board and 3 threads, you would partition rows and columns like this among the 3 threads (with tids 0-2):

row partitioning                        column partitioning
----------------                        ----------------
0 0 0 0 0 0 0                           0 0 0 1 1 2 2
0 0 0 0 0 0 0                           0 0 0 1 1 2 2
0 0 0 0 0 0 0                           0 0 0 1 1 2 2
1 1 1 1 1 1 1                           0 0 0 1 1 2 2
1 1 1 1 1 1 1                           0 0 0 1 1 2 2
1 1 1 1 1 1 1                           0 0 0 1 1 2 2
2 2 2 2 2 2 2                           0 0 0 1 1 2 2
2 2 2 2 2 2 2                           0 0 0 1 1 2 2

In this partitioning scheme, a single row (or column) is assigned to exactly one thread; a single row (or column) is never split between two or more threads. Thus, in the above example threads 0 and 1 have one more row each of work to do than thread 2 in the row-wise partitioning, and thread 0 has one more column of work to do than threads 1 and 2 in the column-wise partitioning.

Note how the imbalance of rows/columns is assigned to threads in this example. Your solution should assign them in the same way. Every thread should either be assigned the same number of rows or columns or differ by at most one.

6.5. Printing Partitioning Information

When the 5th command line option is non-zero, your program should print thread partitioning information. Each thread should print:

  1. Its thread id (the logical id number that you assign, not the pthread_self value).

  2. Its start and end row index values.

  3. The total number of rows it is allocated.

  4. Its start and end column index values.

  5. The total number of columns it is allocated.

The threads will run in different orders, so you may see each thread’s output in any order, which is fine. If you’d like, you can try to avoid some interleaving of individual threads output by calling fflush after printf:

printf("tid %d my values ...\n", mytid, ...);
fflush(stdout);   // force the printf output to be printed to the terminal

Here are some example runs with different numbers of threads partitioning a 100x100 grid. The total number of rows and columns per thread are shown in parentheses (your output does not need to be identical to this, but it must include all the required parts):

# 9 threads, column-wise partitioning
./gol big 0 9 1 1
tid  0: rows:  0:99 (100) cols:  0:11 (12)
tid  1: rows:  0:99 (100) cols: 12:22 (11)
tid  2: rows:  0:99 (100) cols: 23:33 (11)
tid  4: rows:  0:99 (100) cols: 45:55 (11)
tid  3: rows:  0:99 (100) cols: 34:44 (11)
tid  5: rows:  0:99 (100) cols: 56:66 (11)
tid  6: rows:  0:99 (100) cols: 67:77 (11)
tid  7: rows:  0:99 (100) cols: 78:88 (11)
tid  8: rows:  0:99 (100) cols: 89:99 (11)

# 6 threads, row-wise partitioning
./gol big 0 6 0 1
tid  0: rows:  0:16 (17) cols:  0:99 (100)
tid  1: rows: 17:33 (17) cols:  0:99 (100)
tid  3: rows: 51:67 (17) cols:  0:99 (100)
tid  2: rows: 34:50 (17) cols:  0:99 (100)
tid  4: rows: 68:83 (16) cols:  0:99 (100)
tid  5: rows: 84:99 (16) cols:  0:99 (100)

# 17 threads, row-wise partitioning
./gol big 0 17 0 1
tid  0: rows:  0: 5 (6) cols:  0:99 (100)
tid  1: rows:  6:11 (6) cols:  0:99 (100)
tid  3: rows: 18:23 (6) cols:  0:99 (100)
tid  2: rows: 12:17 (6) cols:  0:99 (100)
tid  4: rows: 24:29 (6) cols:  0:99 (100)
tid  5: rows: 30:35 (6) cols:  0:99 (100)
tid  6: rows: 36:41 (6) cols:  0:99 (100)
tid  7: rows: 42:47 (6) cols:  0:99 (100)
tid 10: rows: 60:65 (6) cols:  0:99 (100)
tid 11: rows: 66:71 (6) cols:  0:99 (100)
tid 12: rows: 72:77 (6) cols:  0:99 (100)
tid 13: rows: 78:83 (6) cols:  0:99 (100)
tid  8: rows: 48:53 (6) cols:  0:99 (100)
tid 15: rows: 90:94 (5) cols:  0:99 (100)
tid 14: rows: 84:89 (6) cols:  0:99 (100)
tid  9: rows: 54:59 (6) cols:  0:99 (100)
tid 16: rows: 95:99 (5) cols:  0:99 (100)

Note that for the run with 17 threads, there are enough threads to see "out of order" execution due to the exact scheduling of threads on the CPUs. That’s fine, we’re at the mercy of the OS scheduler!

6.6. Correctness

In addition to printing out per-thread board partitioning information to verify correct allocation, you should also ensure that your parallel version simulates GOL correctly. Run your parallel version on some of the same input files as your sequential one with printing enabled. The results should be identical.

6.7. Partitioning: Getting Started

  1. Without spawning any threads, write the partitioning function, i.e. figure out which section of the board would be assigned to a thread.

    1. Remember each thread will call this function to compute its partitioning info.

    2. You can, however, test the function’s implementation first without spawning threads by adding some calls in main to test different sizes and partitioning schemes (and remove these calls before moving on to the next step).

  2. Implement the main thread create-join control flow, and test thread’s row-column partitioning.

    1. Spawn the number of threads specified at the command line and have each thread to just print its partitioning information (including its tid value) in play_gol and just return.

    2. Remove this return before moving on to the next step.

  3. Add fields to gol_data struct needed for parallelization and add code to initialize these fields (some may be init’ed by the main thread before spawning and other by the thread themselves before they start playing rounds of gol).

    1. You can add some debug output to test this, and add a call to return to avoid execution the rest of the code in play_gol.

    2. Remove both before moving on to the next step.

  4. Add code so that each thread prints out its own partitioning information only when the final command line argument is nonzero.

  5. If you see odd output, either your parallelization of GOL is incorrect, or you are missing synchronization, or both.

    1. You can remove the call to system("clear")` to see the history of each round to help you debug incorrect GOL computation (of course run for a small number of iterations).

7. Week 2

7.1. Synchronization

You will need to think about where and how to synchronize thread actions to remove race-conditions.

  • For example, you need to ensure that no thread starts the next round of play until all threads are done with the current round, otherwise threads will not read consistent values for the given round.

  • If printing is enabled, you need to ensure that the printing thread prints a consistent version of the game board.

  • You will also need to ensure that threads do not race to update the global live cell counter.

  • Any added synchronization that is only necessary for a specific command line option should not be included in runs that do not specify that option.

7.1.1. Example pthread program that uses synchronization primitives

The synch.c contains some examples using pthreads mutex and barrier synchronization. It also shows an example of defining a struct type that can be used to pass several values to a threads main function via its single void *args parameter. Try running this to see understand what it is doing.

7.1.2. Modifying find_max.c

Let’s fix find_max.c program by adding in some synchronization to remove the race conditions.

8. debugging pthread programs

Debugging threaded programs can be tricky because there are multiple streams of execution. In general, try to debug with as few threads as possible, and if you use printfs, print out a thread id and call fflush after. You can also put printf’s in conditional statements to only have one of the threads print out information (or only some of the threads, or only some of the information, …​). For example, if each thread is passed a logical thread id value on start-up, and stores its value in a local variable named my_tid then you could have logical thread 1 be the debug output printing thread to do something like:

if(my_tid == 1) {
   printf("Tid:%d: value of count is now %d my i is %d\n", my_tid,count,i);
      fflush(stdout);
      }

8.1. gdb and pthreads

gdb has support for debugging multi-threaded processes. If you want to try using gdb to debug your pthread code, here is some general information about it and an example you can try out: Debugging pthreads programs with gdb. It contains an example run of debugging the racecond program in your example code.

More detailed information about gdb and pthreads can be found:

9. Tips

  • The man pages for the various pthread functions are very helpful. Additionally, Tia has some links to other pthreads documentations and tutorials

  • Use the perror() function to print out error messages from failed system and pthread library calls. perror prints out detailed, human-readable information about the error that occurred. You call it passing in a string with program-specific error message. For example:

big_buff = malloc(...);
if (big_buff == NULL) {
  perror("mallocing big buffer failed");
}

See the man page for perror for more information and for the include files you need to add to your program to use it.

  • You will need to use some synchronization in your solution. You may use any of the pthread synchronization primitives: mutexes, condition variables, and/or barriers. Remember that before using any of these, you need to initialize them and that all threads need to have access to the shared synchronization variables (either pass a reference to them at thread creation or declare them as static global variables). Here are some code snippets for using the barrier synchronization primitive:

// declare a global pthread_barrier_t
static pthread_barrier_t barrier;

// initialize it in a function (e.g., main) before using it:
if(pthread_barrier_init(&barrier, NULL, num_threads)){
   perror("pthread_barrier_init");
   exit(1);
}

// threads than can call pthread_barrier_wait to synchronize:
ret = pthread_barrier_wait(&barrier);
if(ret != 0 && ret != PTHREAD_BARRIER_SERIAL_THREAD) {
  perror("pthread_barrier_wait");
  exit(1);
}
  • With the exception of pthread synchronization primitives and cell counters, you should not use any other global variables in your program. Any values that your threads need should be passed to them via the (void *arg) parameter. Look at the parallel max program we did for an example of how to do this.

  • A guide for using gdb to debug pthread programs

10. Submitting

Please remove any debugging output prior to submitting.

To submit your code, simply commit your changes locally using git add and git commit. Then run git push while in your lab directory. Only one partner needs to run the final push, but make sure both partners have pulled and merged each others changes. See the section on Using a shared repo on the git help page.