Due Dates
-
Code and RESULTS: Due before 11:59pm, Friday Oct. 27 push to git repo. (I recommend that you finish early and submit before the due date to focus more on your project).
-
Complexity Questions: handin printout at start of class, Monday Oct. 30 and push .tex to git repo.
Lab4 Partners
This lab will be done with your Lab 4 Partner
See the documentation on the course webpage about working with partners. It is a brief guide for effective practices and my expectations for how CS students should be working together on labs.
Overview
In this lab you will implement a parallel sorting algorithm in MPI and run and test it out on our system. In a second part you will run some large runs on Swarthmore’s Strelka cluster and on an ACCESS computer.
In this part of the lab you will implement the MPI parallel sorting algorithm, and then run some experiments on our system. You should also set up your Strelka and ACCESS accounts this week.
After break, you will try out some large runs on Strelka first, and then on an ACCESS machine as a short add-on to this lab assignment (assigned after break).
To get started with MPI programming, I suggest looking at the following information Running OpenMPI on our system, and try out some examples before you get started on this lab:
cp -r ~newhall/public/openMPI_examples .
Lab Goals
-
Learn MPI programming.
-
Run an MPI application on our system using openMPI.
-
Implement a parallel sorting algorithm
-
Answer complexity questions about your implementation.
-
Learn how to use an ACCESS resource.
-
Test your MPI implementation on our system, on Strelka, and on a large ACCESS system.
-
Answer some questions about the complexity of a parallel algorithm.
Starting Point Code
-
Clone your Lab 4 repo from the CS87 git org:
cd ~/cs87/Labs git clone [your_Lab_ssh_URL]
If all was successful, you should see the following files when you run ls:
Makefile RESULTS hostfile oddeven.sb questions.tex README.md check_up.sh* hostfilebig oddevensort.c run.sh*
If this didn’t work, or for more detailed instructions on git see: Git Help Page
Starting Point Files
With the starting point are several files, many have some starting point code written for you. These files include:
-
Makefile: builds an openMPI parallel version of oddevensort
-
README.md
: see notes about #defines and sizes -
questions.tex
: your answers to the complexity questions -
oddevensort.c
: the starting point file for your implementation. -
hostfile*
: example hostfiles for running on our system -
check_up.sh
: an script to check if all the machines in a hostfile are up -
run.sh
: an example run script for running on our system -
RESULTS
: for your timing results on our system
Details
For this lab you will implement parallel Odd-Even Sort in MPI. Odd-Even is similar to Bubble Sort in that it compares and swaps adjacent elements, but it does the compares and swaps in an order that is more easily parallelizable than Bubble Sort’s compare and swaps.
Sequential Algorithm
The sequential Odd-Even Sort algorithm works as follows:
-
Swapping adjacent elements is done in N rounds.
-
For each round,
r
, pairs of elements are compared and swapped if not in sorted order. The pairs compared and swapped depend on if it is an odd or and even round:-
If round r % 2 == 0 compare even/odd pair of elements
-
If round r % 2 == 1 compare odd/even pair of elements
-
Here is an example sorting N=6 values: 3 8 9 5 6 2
round
0: 3 8 9 5 6 2 (even/odd)
^--^ ^--^ ^--^
1: 3 8 5 9 2 6 (odd/even)
^--^ ^--^
2: 3 5 8 2 9 6 (even/odd)
^--^ ^--^ ^--^
3: 3 5 2 8 6 9 (odd/even)
^--^ ^--^
4: 3 2 5 6 8 9 (even/odd)
^--^ ^--^ ^--^
5: 2 3 5 6 8 9 (odd/even)
^--^ ^--^
Note that the end elements are not involved in compares and swaps in every round.
Parallel Algorithm
In the parallel version of Odd-Even sort, the assumption is that P (the number of processes) is much smaller than N. The parallelization is in terms of P.
Assume Pi is the id of the ith process, and that i=0-P-1, represents the rank of the processes participating the parallelization.
-
Each Pi is allocated a contiguous portion of the list.
-
Some number of sorting rounds occur where Pi exchanges its sorted list of values with one of its neighbors.
At each round,
r
, process Pi does the following:-
Pi sorts its portion of the list locally using any sorting algorithm.
-
If round r is such that
r%2 == 0
, and even/odd exchange happens-
If Pi is even, it sends its sorted portion to Pi+1. Pi keeps the smallest half of the items for the next round, Pi+1 the largest half for the next round.
-
If Pi is odd, it sends its sorted portion to Pi-1. Pi keeps the largest half of the items for the next round, Pi-1 the smallest half.
-
-
If round r is such that
r%2 == 1
, and odd/even exchange happens-
If Pi is even, it sends its sorted portion to Pi-1. Pi keeps the largest half of the items for the next round, Pi-1 keeps the smallest half for the next round.
-
If Pi is odd, it sends its sorted portion to Pi+1. Pi keeps the smallest half of the items for the next round, Pi+1 the largest half.
-
-
Note that the end-most Pis may not participate in every round.
Implementing in MPI
You will implement parallel Odd-Even Sort in MPI. Your implementation will ignore two issues often associated with sorting:
-
How to distribute the N elements over the P processes.
-
How to merge the P sorted parts back into a single sorted array.
In your implementation:
-
Each process will generate a set of
size
random int values as its portion of the N items to sort. -
Processes will compute all rounds of Odd-Even sort so that at the end, process 0 has the smallest
size
elements in sorted order, process 1 the next smallestsize
elements in sorted order, …, process P-1 has the largestsize
elements in sorted order.The sorted portions DO NOT need to be assembled together into a single sorted array on one of the nodes. In other words, your MPI program can exit with the sorted list distributed in
size
chunks over the P processes such that P0 has the smallestsize
values in sorted order, P1 then next smallest sortedsize
values, and so on.
Sizes and Running
The total number (N), of data elements sorted depends on both the size
of
values each process allocates for its portion and the number of processes.
For example, in a run with a size
of 1024 and 4 processes, N is
4096, and for a run with a size
of 1024 and 8 processes, N
is 8192.
In the starting point oddevensort.c
file is a definition for SIZE
that you should use as the default size
value for each process' portion
of N to allocate and initialize. When you run your mpi program you will
specify the number of process to spawn (e.g. -np 8
spawns 8 MPI
processes for a run). You can also run ./oddevensort with an optional
command line argument that specifies a different value for `size
:
# sort 8*SIZE values such that each of the 8 processes gets SIZE elements
mpirun -np 8 --hostfile hostfile ./oddevensort
# run with optional command line argument:
# sort 8*100 (800) values such that each of the 8 processes gets 100 elements
mpirun -np 8 --hostfile hostfile ./oddevensort 100
When you run on our system, you may see the following warning:
A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
Host: <some host name>
Another transport will be used instead, although this may result in
lower performance.
NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
You can just ignore this warning. MPI is looking for an infiniBand network interface to use, but we don’t have infiniBand, so it is telling us it is using something else (Ethernet that we do have).
You can also set an MCA variable to remvoe the warning. See the Running OpenMPI on our system page for instructions on how to do this.
If mpirun
gives this error::
There are not enough slots available in the system to satisfy the 512 slots
that were requested by the application:
./oddevensort
Either request fewer slots for your application, or make more slots available
for use.
This means that you are using a hostfile that does not have enough total
slots for the number of processes you are trying to spawn (-np
). The fix
is to use a hostfile with more hosts, and specify the number of slots per
host in the hostfile too. With the starting point code is a large hostfile
you can use. You can also create your own hostfile with machines on our system
using either autoMPIgen
or by copying and pasting machines from lists of
lab machines in files on our system. See Tips and Handy Resources for more information
about both of these.
Command line arguments
You will add two command line arguments to oddevensort
, both are optional
command line arguments (meaning one can run oddevensort
with or without these
command line arguments):
-
The first specifies the size of each Pi’s portion of the set of values to sort. See Sizes and Running for more information about how to use this value if it is given in a command line.
-
The second specifies the number of times to repeat a full odd-even sort in this run.
You do not need to use getops
for processing command line options. Instead
if one argument is listed, then it is the size value (in argv[1]
), and if two are
listed, then the first is the size value (in argv[1]
), and the second is
the number of repetitions of odd-even sort to execute (in argv[2]
).
The Second Command Line Argument
NOTE: add this 2nd command line option later, after you have
the rest of oddevensort with the first command line option, debugged,
tested, and working:
In addition to the optional first command line argument for specifying
size, oddevensort
should take an optional second command line argument
that specifies the number of times to repeat a full oddevensort (this will
be useful for running really long runs on larger machines).
# run oddevensort with 8 processes, each getting 100 elements
# (sorts 800 total values using 8 processes)
mpirun -np 8 --hostfile hostfile ./oddevensort 100
# run oddevensort with 8 processes, each getting 100 elements
# repeat the full oddeven sort (allocation, initialization, and sort) 3 times
mpirun -np 8 --hostfile hostfile ./oddevensort 100 3
See Sizes and Running for information about the mpirun
details.
The 2nd command line argument may be useful for getting some longer runs (in addition to large number of processes and large size values). If the number of iterations is greater than 1, then the parts that should be repeated are:
-
each Pi’s initialization of its portion of array to random values
-
perform odd-even sort on this new set of values
Multiple iterations should not re-spawn MPI threads or have Pi’s reallocate memory space for its array of size values and any other space it needs to sort. Instead, just iterate over the number of times a full odd-even sort of N values is executed.
Debugging
You can call printf
in MPI applications, each MPI process' output is
sent to be displayed in the terminal from which you ran mpirun
.
As a result, printf
cause more message passing in the system, so you
should disable printing when not debugging. Additionally, process’s may
want to prefix their output with their MPI rank number, as messages can
arrive to be displayed in any order.
In the starting point oddevensort.c
file are macro definitions for debug
printf output. They are named PRINTX
where X
is number of arguments
to a printf
function:
// comment DEBUG to disable PRINTX output
#define DEBUG
#ifdef DEBUG
#define PRINT0(s) printf(s)
#define PRINT1(s,a) printf(s,a)
#define PRINT2(s,a,b) printf(s,a,b)
#else
#define PRINT0(s)
#define PRINT1(s,a)
#define PRINT2(s,a,b)
#endif
In your program code make calls to the PRINTX
macros instead of to the
printf
function. For example:
PRINT0("got here\n");
PRINT1("x: %d\n", x);
PRINT1("my name is %s\n", name);
PRINT2("x: %d, error: %g\n", x, err_val);
Depending on whether the DEBUG
constant is defined or commented out,
the PRINTX
macros are either defined to be a call
to a printf
function, or they are defined to be nothing (i.e. they
don’t print anything).
To enable/disable printing, uncomment/comment DEBUG
and recompile.
This is one way to easily enable or disable debug printing in a program.
You can add more PRINTX
macro definitions following these.
As you first develop your solution, uncomment the DEBUG
definition in
the starting point file, and try running with small SIZE
values (you
can change this definition) and smallish different number of processes
to test correctness. With DEBUG
defined,
each MPI process will print out its portion of the sorted array before
performing odd-even sort and then after where you should see process 0
has smallest SIZE
values in sorted order, process 1 the next smallest,
and so on.
Try different sized SIZE
and number of process values, to make sure your
sort is correct (correct number of iterations, correct Pi’s exchanging at
each step).
You can also try odd-even sorting in the other direction after doing it one way first (i.e. try a descending sort after you do the ascending sort). Sorting an array in the opposite starting sorted order is often a good way to see if you are missing catching an edge case. To do this you may want to functionize all comparisons so that it is easy to change your code to sort in either direction.
Also, try some large size
runs (with DEBUG
commented out) to make sure
you do not have any deadlock in your solution.
Sample Output
Here is output from my program. It shows a few runs with different number of processes and sizes run in debug mode where each process prints out its portion of the unsorted array before sorting and prints out its portion after.
It is good to add this debug printing functionality to yours to help you verify correctness of your implementation.
Requirements
Code
In addition to a correct implementation, your solution should be well designed, well commented, with good error detection and handling, and free of memory access errors. Use constants, meaningful variable names, and good modular design. See my C Style guide and other Other C programming documentation.
Also, do not use global variables in your solution.
Questions
In the latex source file questions.tex
are some questions about
the Odd-Even sort algorithm. Edit this latex file with your answers.
There is a commented out example of how to list verbatim code sequences
if you want to do this in here. However, strive for concise answers
to these questions. You should not have to write a lot to answer them.
Results
By before 11:59pm, Friday Oct. 27 you should submit results from some initial runs of your odd-even MPI sort on our system. Include some large-ish runs.
Due After Break: Results from larger-sized runs on our system, on Strelka, and on ACCESS are due after break. More details will be assigned after break. For now, make sure you have set up your Strelka and ACCESS accounts set up.
Runs on our system
You should run some large-sized runs on our system, see the run.sh
script included in the starting point, for an example of how to start a bunch
of runs of different sizes.
Try varying the problem size is a few (both large sizes of data to sort and
large numbers of processes and hosts). NOTE: run.sh
has pretty small
sizes, so all runs will complete pretty quickly. There is another runX.sh
that is an example of larger runs, and you can create your own using these
two as references.
For larger numbers of processes, you need to use a larger hostfile with
enough slots for the number of processes. You should create your own
hostfiles with more CS lab machines to use for larger runs. And create
some that use slots=<num>
to specifiy different numbers of slots to use
on machines (this can be a way to distribute processes over more machines
in your hostfile, for example).
You can run the run.sh
script with an optional command line specifying the
hostfile:
./run.sh
./run.sh hostfile
./run.sh myhostfileHUGE6slots
With your lab submission you should submit a file RESULTS
that lists
run times for different sizes that you ran on our system. See the comment at
the top of the file about what and how to list.
If you do some runs on a lot of nodes, you can make use of autoMPIgen to create a good hostfile for you, and run the check_up.sh script to remove unreachable nodes from this file.
Also please be mindful that machines are heavily used for lab sessions, classes, and ninja sessions and please avoid running computational intensive experiments during these times. Smaller runs should be okay.
To find good large-sized runs, first test out a large sized run on one node
and make sure that your oddeven processes running are not allocating more
space than can fit into RAM (run htop
to see). If so, back off the size
until the cumulative memory fits into RAM (with some to spare). You should see
no swap space usage.
More details about running some even larger runs on strelka and XSEDE will be discussed and assigned after break.
For the Lab 4 due date do the following:
-
Try some timed runs of different sizes on our system. These should include some large sized runs, but not huge. You will run some really large ones after break on Anvil and Strelka.
-
Make sure to set up your ACCESS and Strelka accounts this week.
Tips and Handy Resources
MPI and running
-
Running OpenMPI on our system. This includes information about setting up ssh keys and ssh-agent so that mpi can ssh into hosts without you haveing to give your password, and also information about my simple MPI example programs you can copy over and try out on our system:
cp -r ~newhall/public/openMPI_examples .
-
Remember that MPI message send and receives have associated fixed-size buffers. On a regular send, the sending process typically doesn’t block on the send call after the data are copied to the buffer (you can configure different send semantics, but this is the default). Sometimes, however, a process may block on send if the buffer is full and the receiving process has not yet received the buffered data.
-
check_up.sh
script with starting point code to check if the machines in a hostfile are up, and to run a bunch of experiments. To run check_up script:./check_up hostfile
-
hf_2_hfslots
script with the starting point code that takes a hostfile with just machine names and number and creates a new hostfile with every machine withslots=number
. To run:# create new hostfile named hostfile6 from hostfile with slots=6 ./ht_2_hfslots hostfile 6 > hostfile6 # create new hostfile named hostfile4 from hostfile with slots=4 ./ht_2_hfslots hostfile 4 > hostfile4
-
runX.sh
, with the starting point code, example run scripts for running a set of runs of different sizes and different numbers of processes../run.sh
-
MPI data types. When you send and receive values you need to specify their type with an MPItype.
long int i; MPI_Send(&i,1,MPI_LONG_INT,target,tag,comm);
Here are a few examples of C type and corresponding MPItype:
char MPI_CHAR or MPI_SIGNED_CHAR unsigned char MPI_BYTE or MPI_UNSIGNED_CHAR int MPI_INT unsigned int MPI_UNSIGNED float MPI_FLOAT double MPI_DOUBLE long int MPI_LONG unsigned long int MPI_UNSIGNED_LONG long long int MPI_LONG_LONG_INT
If you define structs that are passed in MPI messages you need to define MPI data types for these to ensure that byte-ordering is maintained for messages passed between nodes with different Endian-ness. For this lab, you should not use structs.
-
autoMPIgen
(andsmarterSSH
): these are tools that are useful for finding machines that are less loaded on our system.autoMPIgen
will generate an mpi host file for you, picking good machines to run on based on command line criteria.smarterSSH
picks a good machine to ssh into on our system, and can also be used to list load information about lab machines. If you useautoMPIgen
to generate a hostfile, keep in mind that things change, and you don’t want to just keep using this one same hostfile over and over. Instead, run it to regenerate a good set of hosts periodically.Here are some examples of how to run smarterSSH to get statistics:
smarterSSH -h # run smarterSSH in -i mode to list machine statistics smarterSSH -i -v -n 50 smarterSSH -i -v -n 50 -c smarterSSH -i -v -n 50 -m # ssh's you into a good machine smarterSSH
Here are some examples of how to run autoMPIgen to generate MPI host files in different ways:
autoMPIgen -h # generate a host file (hostfile) of 16 hosts selected # based on a function of available cpu and memory autoMPIgen -n 16 hostfile # generate a host file of 16 hosts that includes the host's cpu count (as slot=) autoMPIgen -n 16 -q hostfile # generate a host file of 16 hosts that includes slot=4 with each entry autoMPIgen -n 16 -s 4 hostfile # generate a host file, choosing 8 hosts based on memory usage autoMPIgen -n 8 -m hostfile # generate a host file, choosing 32 hosts based on cpu usage autoMPIgen -n 32 -c hostfile # generate a host file, randomly choosing 64 hosts autoMPIgen -n 64 -r hostfile
-
You can create your own hostfiles from lists of machines in our labs. Just open one or more of these in vim (or another editor) and cut and paste into a hostfile. The list of CS lab machines are available here:
/usr/swat/db/hosts.256 /usr/swat/db/hosts.bookstore # machines in Tarble lab /usr/swat/db/hosts.mainlab # machines in 240 /usr/swat/db/hosts.overflow
-
You can add in some of machines just for our class too.
-
man mpirun
: the mpirun man page has some information about different command line options to mpirun. The--map-by
option allows for specifying options for process spawning on hosts in the hostfile. As an example, this can be used to specify that more processes per node should be spawned than there are slots per node. For this lab assignment you don’t want to do this, but for other MPI applications you might. -
Running Experiments info from Lab 1.
-
{usingxsede}[Using ACCESS] My directions and examples for running MPI programs on Anvil using your ACCESS account.
-
Lab Machine specs page contains information about most of the lab machines, including number of cores.
-
Chapt. 15.2 on MPI from Dive into Systems
C programming
-
srand
,rand
: C random number functions:# srand: seed the random number generator using current time: # (only seed one time) srand(time(NULL)); # then call rand to get the next value in the random sequence # (returns value from 0-MAXINT, use % to get smaller range): next_val = rand() % MAXVAL;
-
Chapter 2 of Dive into Systems covers C programming. And here are some other C programming references.
-
valgrind and gdb guides. Also see Chapter 3 of Dive into Systems on debugging C programs.
misc help pages
-
my help pages includes other links to Unix, C, pthreads, networking resources
-
Dive into systems(Chapt. 2,3, 15.2, Appendix 2 on Using Unix)
-
man pages for C library, pthreads library, and TCP/IP socket functions.
-
Tools for examining system and runtime state top and htop to see what is executing on client and server side
-
some useful tools vim, ssh, scp, tmux, finding idle machines, latex, git, …
Submitting
Repo
Before the Due Date, one of you or your partner should push
your solution to github from one of your local repos to the GitHub
remote repo. Be sure to do a make clean
before you git add
anything:
git add questions.tex
git add oddevensort.c
git add RESULTS
git add hostfile* # add any hostfiles and run scripts you wrote
git add *.sh
git commit
git push
See the git help page "Troubleshooting" section for git help.
Post Lab Eval
After submisssion and demo, fill out a short evaluation form for this lab assignment. A link will be sent to you in an email.