87Lab 4

Due Dates

Code and RESULTS: Due before 11:59pm, Friday Oct. 27 push to git repo. (I recommend that you finish early and submit before the due date to focus more on your project).
Complexity Questions: handin printout at start of class, Monday Oct. 30 and push .tex to git repo.

Lab4 Partners

This lab will be done with your Lab 4 Partner

See the documentation on the course webpage about working with partners. It is a brief guide for effective practices and my expectations for how CS students should be working together on labs.

Overview

In this lab you will implement a parallel sorting algorithm in MPI and run and test it out on our system. In a second part you will run some large runs on Swarthmore’s Strelka cluster and on an ACCESS computer.

In this part of the lab you will implement the MPI parallel sorting algorithm, and then run some experiments on our system. You should also set up your Strelka and ACCESS accounts this week.

After break, you will try out some large runs on Strelka first, and then on an ACCESS machine as a short add-on to this lab assignment (assigned after break).

To get started with MPI programming, I suggest looking at the following information Running OpenMPI on our system, and try out some examples before you get started on this lab:

cp -r ~newhall/public/openMPI_examples .

Lab Goals

Learn MPI programming.
Run an MPI application on our system using openMPI.
Implement a parallel sorting algorithm
Answer complexity questions about your implementation.
Learn how to use an ACCESS resource.
Test your MPI implementation on our system, on Strelka, and on a large ACCESS system.
Answer some questions about the complexity of a parallel algorithm.

Starting Point Code

Clone your Lab 4 repo from the CS87 git org:
```
cd ~/cs87/Labs
git clone [your_Lab_ssh_URL]
```
If all was successful, you should see the following files when you run ls:
```
Makefile   RESULTS	 hostfile     oddeven.sb     questions.tex
README.md  check_up.sh*  hostfilebig  oddevensort.c  run.sh*
```
If this didn’t work, or for more detailed instructions on git see: Git Help Page

Starting Point Files

With the starting point are several files, many have some starting point code written for you. These files include:

Makefile: builds an openMPI parallel version of oddevensort
README.md: see notes about #defines and sizes
questions.tex: your answers to the complexity questions
oddevensort.c: the starting point file for your implementation.
hostfile*: example hostfiles for running on our system
check_up.sh: an script to check if all the machines in a hostfile are up
run.sh: an example run script for running on our system
RESULTS: for your timing results on our system

Details

For this lab you will implement parallel Odd-Even Sort in MPI. Odd-Even is similar to Bubble Sort in that it compares and swaps adjacent elements, but it does the compares and swaps in an order that is more easily parallelizable than Bubble Sort’s compare and swaps.

Sequential Algorithm

The sequential Odd-Even Sort algorithm works as follows:

Swapping adjacent elements is done in N rounds.
For each round, r, pairs of elements are compared and swapped if not in sorted order. The pairs compared and swapped depend on if it is an odd or and even round:
- If round r % 2 == 0 compare even/odd pair of elements
- If round r % 2 == 1 compare odd/even pair of elements

Here is an example sorting N=6 values: 3 8 9 5 6 2

round
  0:   3  8  9  5  6  2   (even/odd)
       ^--^  ^--^  ^--^
  1:   3  8  5  9  2  6   (odd/even)
          ^--^  ^--^
  2:   3  5  8  2  9  6   (even/odd)
       ^--^  ^--^  ^--^
  3:   3  5  2  8  6  9   (odd/even)
          ^--^  ^--^
  4:   3  2  5  6  8  9   (even/odd)
       ^--^  ^--^  ^--^
  5:   2  3  5  6  8  9   (odd/even)
          ^--^  ^--^

Note that the end elements are not involved in compares and swaps in every round.

Parallel Algorithm

In the parallel version of Odd-Even sort, the assumption is that P (the number of processes) is much smaller than N. The parallelization is in terms of P.

Assume Pi is the id of the ith process, and that i=0-P-1, represents the rank of the processes participating the parallelization.

Each Pi is allocated a contiguous portion of the list.
Some number of sorting rounds occur where Pi exchanges its sorted list of values with one of its neighbors.

At each round, r, process Pi does the following:
1. Pi sorts its portion of the list locally using any sorting algorithm.
2. If round r is such that r%2 == 0, and even/odd exchange happens
  - If Pi is even, it sends its sorted portion to Pi+1. Pi keeps the smallest half of the items for the next round, Pi+1 the largest half for the next round.
  - If Pi is odd, it sends its sorted portion to Pi-1. Pi keeps the largest half of the items for the next round, Pi-1 the smallest half.
3. If round r is such that r%2 == 1, and odd/even exchange happens
  - If Pi is even, it sends its sorted portion to Pi-1. Pi keeps the largest half of the items for the next round, Pi-1 keeps the smallest half for the next round.
  - If Pi is odd, it sends its sorted portion to Pi+1. Pi keeps the smallest half of the items for the next round, Pi+1 the largest half.

Note that the end-most Pis may not participate in every round.

Implementing in MPI

You will implement parallel Odd-Even Sort in MPI. Your implementation will ignore two issues often associated with sorting:

How to distribute the N elements over the P processes.
How to merge the P sorted parts back into a single sorted array.

In your implementation:

Each process will generate a set of size random int values as its portion of the N items to sort.
Processes will compute all rounds of Odd-Even sort so that at the end, process 0 has the smallest size elements in sorted order, process 1 the next smallest size elements in sorted order, …, process P-1 has the largest size elements in sorted order.

The sorted portions DO NOT need to be assembled together into a single sorted array on one of the nodes. In other words, your MPI program can exit with the sorted list distributed in size chunks over the P processes such that P0 has the smallest size values in sorted order, P1 then next smallest sorted size values, and so on.

Sizes and Running

The total number (N), of data elements sorted depends on both the size of values each process allocates for its portion and the number of processes. For example, in a run with a size of 1024 and 4 processes, N is 4096, and for a run with a size of 1024 and 8 processes, N is 8192.

In the starting point oddevensort.c file is a definition for SIZE that you should use as the default size value for each process' portion of N to allocate and initialize. When you run your mpi program you will specify the number of process to spawn (e.g. -np 8 spawns 8 MPI processes for a run). You can also run ./oddevensort with an optional command line argument that specifies a different value for `size:

# sort 8*SIZE values such that each of the 8 processes gets SIZE elements
mpirun -np 8 --hostfile hostfile  ./oddevensort

# run with optional command line argument:
# sort 8*100 (800) values such that each of the 8 processes gets 100 elements
mpirun -np 8 --hostfile hostfile  ./oddevensort 100

When you run on our system, you may see the following warning:

A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: <some host name>

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.

You can just ignore this warning. MPI is looking for an infiniBand network interface to use, but we don’t have infiniBand, so it is telling us it is using something else (Ethernet that we do have).

You can also set an MCA variable to remvoe the warning. See the Running OpenMPI on our system page for instructions on how to do this.

If mpirun gives this error::

There are not enough slots available in the system to satisfy the 512 slots
that were requested by the application:
  ./oddevensort

Either request fewer slots for your application, or make more slots available
for use.

This means that you are using a hostfile that does not have enough total slots for the number of processes you are trying to spawn (-np). The fix is to use a hostfile with more hosts, and specify the number of slots per host in the hostfile too. With the starting point code is a large hostfile you can use. You can also create your own hostfile with machines on our system using either autoMPIgen or by copying and pasting machines from lists of lab machines in files on our system. See Tips and Handy Resources for more information about both of these.

Command line arguments

You will add two command line arguments to oddevensort, both are optional command line arguments (meaning one can run oddevensort with or without these command line arguments):

The first specifies the size of each Pi’s portion of the set of values to sort. See Sizes and Running for more information about how to use this value if it is given in a command line.
The second specifies the number of times to repeat a full odd-even sort in this run.

You do not need to use getops for processing command line options. Instead if one argument is listed, then it is the size value (in argv[1]), and if two are listed, then the first is the size value (in argv[1]), and the second is the number of repetitions of odd-even sort to execute (in argv[2]).

The Second Command Line Argument

NOTE: add this 2nd command line option later, after you have the rest of oddevensort with the first command line option, debugged, tested, and working: In addition to the optional first command line argument for specifying size, oddevensort should take an optional second command line argument that specifies the number of times to repeat a full oddevensort (this will be useful for running really long runs on larger machines).

# run oddevensort with 8 processes, each getting 100 elements
# (sorts 800 total values using 8 processes)
mpirun -np 8 --hostfile hostfile  ./oddevensort 100

# run oddevensort with 8 processes, each getting 100 elements
# repeat the full oddeven sort (allocation, initialization, and sort) 3 times
mpirun -np 8 --hostfile hostfile  ./oddevensort 100  3

See Sizes and Running for information about the mpirun details.

The 2nd command line argument may be useful for getting some longer runs (in addition to large number of processes and large size values). If the number of iterations is greater than 1, then the parts that should be repeated are:

each Pi’s initialization of its portion of array to random values
perform odd-even sort on this new set of values

Multiple iterations should not re-spawn MPI threads or have Pi’s reallocate memory space for its array of size values and any other space it needs to sort. Instead, just iterate over the number of times a full odd-even sort of N values is executed.

Debugging

You can call printf in MPI applications, each MPI process' output is sent to be displayed in the terminal from which you ran mpirun. As a result, printf cause more message passing in the system, so you should disable printing when not debugging. Additionally, process’s may want to prefix their output with their MPI rank number, as messages can arrive to be displayed in any order.

In the starting point oddevensort.c file are macro definitions for debug printf output. They are named PRINTX where X is number of arguments to a printf function:

// comment DEBUG to disable PRINTX output
#define DEBUG

#ifdef DEBUG
#define PRINT0(s)     printf(s)
#define PRINT1(s,a)   printf(s,a)
#define PRINT2(s,a,b) printf(s,a,b)
#else
#define PRINT0(s)
#define PRINT1(s,a)
#define PRINT2(s,a,b)
#endif

In your program code make calls to the PRINTX macros instead of to the printf function. For example:

PRINT0("got here\n");
PRINT1("x: %d\n", x);
PRINT1("my name is %s\n", name);
PRINT2("x: %d, error: %g\n", x, err_val);

Depending on whether the DEBUG constant is defined or commented out, the PRINTX macros are either defined to be a call to a printf function, or they are defined to be nothing (i.e. they don’t print anything). To enable/disable printing, uncomment/comment DEBUG and recompile. This is one way to easily enable or disable debug printing in a program. You can add more PRINTX macro definitions following these.

As you first develop your solution, uncomment the DEBUG definition in the starting point file, and try running with small SIZE values (you can change this definition) and smallish different number of processes to test correctness. With DEBUG defined, each MPI process will print out its portion of the sorted array before performing odd-even sort and then after where you should see process 0 has smallest SIZE values in sorted order, process 1 the next smallest, and so on.

Try different sized SIZE and number of process values, to make sure your sort is correct (correct number of iterations, correct Pi’s exchanging at each step).

You can also try odd-even sorting in the other direction after doing it one way first (i.e. try a descending sort after you do the ascending sort). Sorting an array in the opposite starting sorted order is often a good way to see if you are missing catching an edge case. To do this you may want to functionize all comparisons so that it is easy to change your code to sort in either direction.

Also, try some large size runs (with DEBUG commented out) to make sure you do not have any deadlock in your solution.

Sample Output

Here is output from my program. It shows a few runs with different number of processes and sizes run in debug mode where each process prints out its portion of the unsorted array before sorting and prints out its portion after.

It is good to add this debug printing functionality to yours to help you verify correctness of your implementation.

Requirements

Code

In addition to a correct implementation, your solution should be well designed, well commented, with good error detection and handling, and free of memory access errors. Use constants, meaningful variable names, and good modular design. See my C Style guide and other Other C programming documentation.

Also, do not use global variables in your solution.

Questions

In the latex source file questions.tex are some questions about the Odd-Even sort algorithm. Edit this latex file with your answers. There is a commented out example of how to list verbatim code sequences if you want to do this in here. However, strive for concise answers to these questions. You should not have to write a lot to answer them.

Results

By before 11:59pm, Friday Oct. 27 you should submit results from some initial runs of your odd-even MPI sort on our system. Include some large-ish runs.

Due After Break: Results from larger-sized runs on our system, on Strelka, and on ACCESS are due after break. More details will be assigned after break. For now, make sure you have set up your Strelka and ACCESS accounts set up.

Runs on our system

You should run some large-sized runs on our system, see the run.sh script included in the starting point, for an example of how to start a bunch of runs of different sizes.

Try varying the problem size is a few (both large sizes of data to sort and large numbers of processes and hosts). NOTE: run.sh has pretty small sizes, so all runs will complete pretty quickly. There is another runX.sh that is an example of larger runs, and you can create your own using these two as references.

For larger numbers of processes, you need to use a larger hostfile with enough slots for the number of processes. You should create your own hostfiles with more CS lab machines to use for larger runs. And create some that use slots=<num> to specifiy different numbers of slots to use on machines (this can be a way to distribute processes over more machines in your hostfile, for example).

You can run the run.sh script with an optional command line specifying the hostfile:

./run.sh
./run.sh hostfile
./run.sh myhostfileHUGE6slots

With your lab submission you should submit a file RESULTS that lists run times for different sizes that you ran on our system. See the comment at the top of the file about what and how to list.

If you do some runs on a lot of nodes, you can make use of autoMPIgen to create a good hostfile for you, and run the check_up.sh script to remove unreachable nodes from this file.

Also please be mindful that machines are heavily used for lab sessions, classes, and ninja sessions and please avoid running computational intensive experiments during these times. Smaller runs should be okay.

To find good large-sized runs, first test out a large sized run on one node and make sure that your oddeven processes running are not allocating more space than can fit into RAM (run htop to see). If so, back off the size until the cumulative memory fits into RAM (with some to spare). You should see no swap space usage.

More details about running some even larger runs on strelka and XSEDE will be discussed and assigned after break.

For the Lab 4 due date do the following:

Try some timed runs of different sizes on our system. These should include some large sized runs, but not huge. You will run some really large ones after break on Anvil and Strelka.
Make sure to set up your ACCESS and Strelka accounts this week.

Tips and Handy Resources

MPI and running

Running OpenMPI on our system. This includes information about setting up ssh keys and ssh-agent so that mpi can ssh into hosts without you haveing to give your password, and also information about my simple MPI example programs you can copy over and try out on our system:
```
cp -r ~newhall/public/openMPI_examples .
```
Remember that MPI message send and receives have associated fixed-size buffers. On a regular send, the sending process typically doesn’t block on the send call after the data are copied to the buffer (you can configure different send semantics, but this is the default). Sometimes, however, a process may block on send if the buffer is full and the receiving process has not yet received the buffered data.
check_up.sh script with starting point code to check if the machines in a hostfile are up, and to run a bunch of experiments. To run check_up script:
```
./check_up hostfile
```

hf_2_hfslots script with the starting point code that takes a hostfile with just machine names and number and creates a new hostfile with every machine with slots=number. To run:

# create new hostfile named hostfile6 from hostfile with slots=6
./ht_2_hfslots hostfile 6 > hostfile6

# create new hostfile named hostfile4 from hostfile with slots=4
./ht_2_hfslots hostfile 4 > hostfile4

runX.sh, with the starting point code, example run scripts for running a set of runs of different sizes and different numbers of processes.
```
./run.sh
```

MPI data types. When you send and receive values you need to specify their type with an MPItype.

long int i;

MPI_Send(&i,1,MPI_LONG_INT,target,tag,comm);

Here are a few examples of C type and corresponding MPItype:

char                MPI_CHAR  or  MPI_SIGNED_CHAR
unsigned char       MPI_BYTE  or  MPI_UNSIGNED_CHAR
int                 MPI_INT
unsigned int        MPI_UNSIGNED
float               MPI_FLOAT
double              MPI_DOUBLE
long int            MPI_LONG
unsigned long int   MPI_UNSIGNED_LONG
long long int       MPI_LONG_LONG_INT

If you define structs that are passed in MPI messages you need to define MPI data types for these to ensure that byte-ordering is maintained for messages passed between nodes with different Endian-ness. For this lab, you should not use structs.

autoMPIgen (and smarterSSH): these are tools that are useful for finding machines that are less loaded on our system. autoMPIgen will generate an mpi host file for you, picking good machines to run on based on command line criteria. smarterSSH picks a good machine to ssh into on our system, and can also be used to list load information about lab machines. If you use autoMPIgen to generate a hostfile, keep in mind that things change, and you don’t want to just keep using this one same hostfile over and over. Instead, run it to regenerate a good set of hosts periodically.

Here are some examples of how to run smarterSSH to get statistics:

smarterSSH -h

# run smarterSSH in -i mode to list machine statistics
smarterSSH -i  -v -n 50
smarterSSH -i  -v -n 50 -c
smarterSSH -i  -v -n 50 -m

# ssh's you into a good machine
smarterSSH

Here are some examples of how to run autoMPIgen to generate MPI host files in different ways:

autoMPIgen -h

# generate a host file (hostfile) of 16 hosts selected
# based on a function of available cpu and memory
autoMPIgen -n 16 hostfile
# generate a host file of 16 hosts that includes the host's cpu count (as slot=)
autoMPIgen -n 16 -q hostfile
# generate a host file of 16 hosts that includes slot=4 with each entry
autoMPIgen -n 16 -s 4 hostfile

# generate a host file, choosing 8 hosts based on memory usage
autoMPIgen -n 8 -m hostfile
# generate a host file, choosing 32 hosts based on cpu usage
autoMPIgen -n 32 -c hostfile
# generate a host file, randomly choosing 64 hosts
autoMPIgen -n 64 -r hostfile

You can create your own hostfiles from lists of machines in our labs. Just open one or more of these in vim (or another editor) and cut and paste into a hostfile. The list of CS lab machines are available here:
```
/usr/swat/db/hosts.256
/usr/swat/db/hosts.bookstore    # machines in Tarble lab
/usr/swat/db/hosts.mainlab      # machines in 240
/usr/swat/db/hosts.overflow
```
You can add in some of machines just for our class too.
man mpirun: the mpirun man page has some information about different command line options to mpirun. The --map-by option allows for specifying options for process spawning on hosts in the hostfile. As an example, this can be used to specify that more processes per node should be spawned than there are slots per node. For this lab assignment you don’t want to do this, but for other MPI applications you might.
Running Experiments info from Lab 1.
{usingxsede}[Using ACCESS] My directions and examples for running MPI programs on Anvil using your ACCESS account.
Lab Machine specs page contains information about most of the lab machines, including number of cores.
Chapt. 15.2 on MPI from Dive into Systems
Links to other MPI references and tutorials

C programming

srand, rand: C random number functions:

# srand: seed the random number generator using current time:
#        (only seed one time)
srand(time(NULL));

# then call rand to get the next value in the random sequence
# (returns value from 0-MAXINT, use % to get smaller range):
next_val = rand() % MAXVAL;

Chapter 2 of Dive into Systems covers C programming. And here are some other C programming references.
valgrind and gdb guides. Also see Chapter 3 of Dive into Systems on debugging C programs.

misc help pages

my help pages includes other links to Unix, C, pthreads, networking resources
Dive into systems(Chapt. 2,3, 15.2, Appendix 2 on Using Unix)
CS department help pages
man pages for C library, pthreads library, and TCP/IP socket functions.
Tools for examining system and runtime state top and htop to see what is executing on client and server side
make and writing Makefiles
some useful tools vim, ssh, scp, tmux, finding idle machines, latex, git, …
git help page

Submitting

Repo

Before the Due Date, one of you or your partner should push your solution to github from one of your local repos to the GitHub remote repo. Be sure to do a make clean before you git add anything:

git add questions.tex
git add oddevensort.c
git add RESULTS

git add hostfile*     # add any hostfiles and run scripts you wrote
git add *.sh

git commit
git push

See the git help page "Troubleshooting" section for git help.

Post Lab Eval

After submisssion and demo, fill out a short evaluation form for this lab assignment. A link will be sent to you in an email.

CS87 Lab 4: