Lab 0: C++ Warmup/Binary IO
Due by 11:59 p.m., Wednesday, Sept 12, 2018.
This is an individual lab. You are to complete this lab on your own, although you may discuss the lab concepts with your classmates. Remember the Academic Integrity Policy: do not show your code to anyone outside of the course staff and do not look at anyone else’s code for this lab. If you need help, please post on the Piazza forum or contact the instructor. If you have any doubts about what is okay and what is not, it’s much safer to ask than to risk violating the Academic Integrity Policy.
- Overview
- Linked List Implementation
- Main program
- Binary File Format
- Tips and Hints
- Submitting your lab
Overview
The task for this week’s lab: given employee data from two separate files, combine the contents into a sorted list and output the results to file. You will implement a linked-list data structure to maintain your sorted list of employee data. This will simulate, at a low-level, a common operation for storing raw data in a DBMS.
While this lab serves primarily as a warm-up exercise and reminder of C++ programming, we will also introduce new concepts. The learning objectives for this assignment:
- reacquaint students to programming in C/C++
- introduce binary file I/O
- utilize low-level memory management to manage data, including using tools such as valgrind to debug memory errors
- introduce the
const
keyword and C-style strings
You may want to read a bit about binary files before getting started.
Getting Started
First create a cs44
directory in your home directory, and add a labs subdirectory to it:
mkdir cs44
cd cs44
mkdir labs
cd labs
pwd
We will be using git repos hosted on the college’s GitHub server for labs in this class. If you have not used git or the college’s GitHub server before, here are some instructions on Using Git (follow the instructions for repos on Swarthmore’s GitHub Enterprise server).
Next find your git repo for this lab assignment off the GitHub server for our class: CPSC44-F18
Clone your git repo (Lab0-userID) containing starting point files into your labs directory:
cd ~/cs44/labs
git clone [the ssh url to your your repo]
cd Lab0-userID
If all was successful, you should see the following files (highlighted files require modification):
Makefile
- pre-defined. You may edit this file to add extra source files or execution commands.sortEmployees.cpp
- your main program for reading and sorting employee data.employee.[cpp/h]
- the implementation of anEmployeeList
class that will represent a list of sorted employees in a linked list.input/
- directory of sample input files in either text format (.txt
extension) or binary (.dat
).createbin
- an executable that will convert any ascii file to binary format. This should prove helpful when attempting to debug your file i/o.README.md
- a few wrap-up questions for you to answer about the lab assignment. Information about any late days you used on this assignment, and how many you have used so far. (TIP: Don’t use your late days on this lab).
Linked List Implementation
In employee.cpp/h
, you will implement an EmployeeList
and related EmployeeNode
class to manage a sorted linked list of employee data. The data will be sorted based on the name field of EmployeeNode
.
EmployeeNode class
The data members and public methods have been provided. You may add private methods, but shouldn’t need to. Begin by implementing the EmployeeNode
class in employee.cpp
. This class represents one node in a linked list:
name
is a C-style string containing the employees name.salary
is an integer.next
is a pointer to the nextEmployeeNode
in the linked list (orNULL
).
The constructor should initialize all data members (e.g., set pointers to NULL
) and the destructor should deallocate name
if it is not NULL
.
Each of the above data members has a corresponding setter and getter method. Note that c-style strings are arrays of characters, so you will need to allocate and deallocate memory. See the section on c-strings below for details.
EmployeeList class
EmployeeList
is a linked list with a limited interface – there is only one type of method for inserting and no methods for removing. While this may seem like a poor design, it is common to implement limited data structures to accomplish specific tasks efficiently and predictably. Do not modify the public interface, or add additional data members. You may add private methods, if needed.
There is only one data member for the linked list: the head
of the list. You
will need to implement the class methods:
- A constructor that initializes
head
toNULL
. - A destructor that deletes each
EmployeeNode
in the list. insert()
adds a new employee to the list such that the resulting list is in sorted order by the name field. In other words, you are performing insertion sort. Your method should create anEmployeeNode
with the given information and traverse the list to find the correct spot to insert the newEmployeeNode
.getNumEmployees()
andgetAverageSalary()
must traverse the linked list to accumulate the respective statistics.writeToFile()
writes the entireEmployeeList
, in sorted order, to the binary file specified by the parameterfilename
. See details on the structure of the binary files and writing to binary files. The file output should mirror the structure of the file that is read in byreadFromFile()
in the main program. Return 0 on success and a non-zero error value on failure (e.g., illegal file name). Be sure to specify the meaning of error values in the function comments.print()
is useful for debugging; it output the contents of the list in a formatted table.
Main program
In sortEmployees.cpp
, you will write a main program that, at a high level:
- reads in employee data from two unsorted binary files
- merges the data from two files together in a sorted list of employees
- write the resulting sorted list of employee data to a binary file.
- prints formatted information to standard output as it makes progress (see example runs).
General Requirements
- All of your code should demonstrate good design - your main should be simple, and written in a top-down, modular fashion.
- Each function should have a single, clear purpose.
- Each file must have a top-level comment describing the file’s purpose and authors.
- Each function should have a comment explaining its purpose as well as parameters, return values, and error conditions.
- Your program must be free of memory-access errors. This includes both memory leaks in out of bounds memory access. Recall that C++ does not report this to you; you need to test your code and use tools such as valgrind and gdb. See Compiling, Debugging and Linking Tips for a reminder on how to use these tools.
Usage and Sample Runs
$ ./sortEmployees input/infile1.dat input/infile2.dat result.dat
Your program takes in three arguments: two binary files to read and the name of the file to write the resulting list. Incorrect usage should result in an error message e.g.,
$ ./sortEmployees
usage: sortEmployees file1 file2 resultfile
See here for the expected output.
Reading the input files
In sortEmployees.cpp
, you must implement a function to read from binary files:
int readFromFile (char const *filename, EmployeeList *list)
This function reads in employee records from the binary file specified by the parameter filename
and adds each employee record list
. Return 0 on success and any non-zero error value on failure (there may be multiple reasons for failure; in the function comments, be sure to specify the error associated with each value you return). See below for tips on how to complete this function
Formatted Output
Your program should use formatted output (e.g., printf
) to display the employee data as it is read from the input files (i.e, in readFromFile()
). Your program should finish by displaying the sorted list of employee records. See examples below.
Binary File Format
You are used to reading files that are in text format - all data is stored as ASCII characters regardless of whether it is actually a number or character. Alternatively, we could store the data in a format more native to computing - as binary. You cannot open and read these files in a text editor (because they aren’t text). But they can be much more predicable in terms of how data is formatted.
Employee Data Files
Each employee file is a binary file that stores a sequence of variable length employee records. Each employee record has three fields: the length of the name (4 byte integer), the actual name (variable length), and a salary (4 byte integer). The format of an employee record in the binary file is as follows:
----------------------------------------------------------------------------
| 4 byte integer | N character string | 4 byte integer |
| Number of characters | Employee name | Employee salary |
| in employee name (N) | (not null-terminated) | |
----------------------------------------------------------------------------
Reading binary files
To read integers and character strings from a binary file you can use the following code as an example:
//open file specified in filename. Second parameter specifies that
// the file is for input and is in binary format
fstream infile(filename, ios::in | ios::binary);
int nameLen;
char *name;
// read 4 bytes into memory location of nameLen
// read assumes everything is char, so we will here we typecast
// the address of nameLen (the destination) as a char *
infile.read((char*)&nameLen, sizeof(int));
//Allocate space and read characters for string
name = new char[nameLen+1];
infile.read(name, nameLen);
// TODO: remember to null terminate strings with '\\0'
// TODO: add code to finish reading file
infile.close();
You should reference the fstream library for details on how to read files. Also, you will need to loop until you read all of the records in a file. You should use the eof()
(end of file) function to do so. Here is an illustrative example with tips on how to use it properly.
Writing binary files
To write integers and character strings to a binary file you can use:
fstream outfile(filename, ios::out | ios::binary);
//You need to figure out how to set these variables and loop through the list
int nameLen = ...;
char *name = ...;
...
outfile.write((char*)&nameLen, sizeof(int));//write the length of the name
outfile.write(name, nameLen); //write the name using the specific length
outfile.close(); //close the file when done writing
Remember to close the file when done writing, otherwise your results may not be saved to disk.
Tips and Hints
c-strings
See Prof. Newhall’s help page on Strings in C for a refresher on using c-strings. In particular, you will need to use strcpy
, strlen
, printf
, among others. Since your code is compiled in C++, you can use new
and delete
instead of malloc
and free
e.g.,
char *staticVar = "this is a string"; //a static c-string
char *dynamicVar = NULL; // we well allocate/deallocate space for this c-string
printf("%s\n", staticVar); //Prints out "this is a string"
len = strlen(staticVar);
dynamicVar = new char[len+1]; //add one for the null termination '\0'
strcpy(dynamicVar, staticVar,len); //copies "this is a string"
printf("%s\n", dynamicVar); //should print "this is a string"
delete [] dynamicVar //clean up when done
C-strings are arrays; allocating them dynamically requires that you also clean up memory when it is no longer being used. This also means that, c-strings variables are pointers - treat that as such.
const modifier
const
is a modifier that guarantees that an entity will not be modified while in scope. You will see it used in several contexts throughout the given code. There are two main uses:
- As a modifier to a variable. Primarily, I have done this in various function/method definitions to signify that a parameter cannot be modified in that function. For example,
readFromFile
declaresfilename
as of typechar const *
instead of justchar *
. This ensures that the function cannot modify the string.EmployeeList::insert()
can only read the values that have been sent it, it is not allowed to modify them. - As a modifier to a method header. This only occurs inside of class definitions. This ensures that the method will not modify the state of the object - that is, it has read-only access to the data members. For example,
EmployeeList::print()
only needs to access thehead
of the list and read its values. It should not (and cannot) attempt to modify the head. Conversely,insert
needs to modify the class so it is not declaredconst
.
Note that nothing changes on the caller’s end of function use. This simply provides a contract guarantee to a user that their data cannot be modified by a called function. It is enforced by the compiler (attempting to modify a constant will lead to a compile-time error message).
Error messages
Above, I mention that you will return error statuses via an int for your read and write functions. This is common practice, particularly in C programs where exceptions are not used. To make this aspect of your code more readable, consider using meaningful constant values. For example:
//Global static variable
static const int ERROR_NULL_FILENAME = -1;
/* Function purpose
* @params ...
* @return ...
* @error returns ERROR_NULL_FILENAME if the filename pointer is NULL
*/
int readFromFile(char const *filename, EmployeeList *list){
//...
//some code
if (filename == NULL){
return ERROR_NULL_FILENAME; //returns a -1, but someone reading your
//code cares more about the reason for error
}
//...rest of program
Instead of returning -1, the function returns a variable (this variable’s value happens to be -1 but this is abstracted). The key is that in main, you can check for this by comparing the return value to the error values you expect to find. For example, if you store the function return value in main using status, you can write:
if (status == ERROR_NULL_FILENAME){
//handle this error
}
This can be easier to understand (and less prone to bugs) than the equivalent statement:
if(status == -1){
//handle this error
}
Testing your code
It is difficult to verify the output since it is not text. Here are some strategies:
- To verify that your
writeToFile
function works, try using the output file of one run as one of the input files in a subsequent run. Read errors mean that the write phase was not correct. This will allow you to see the contents of the file as the employee records get printed to screen when it is read. - Run
wc
(word count) orls -l
to get the number of bytes in the output and input files. The output file should have exactly the same number of bytes as the sum of the size of the two inputs. - Generate a hex dump of a binary file using
xxd
.$ xxd infile1.dat 00000000: 0c00 0000 5475 7269 6e67 2c20 416c 616e ....Turing, Alan 00000010: c8af 0000 0d00 0000 486f 7070 6572 2c20 ........Hopper, 00000020: 4772 6163 6550 c300 000d 0000 004c 6f76 GraceP.......Lov 00000030: 656c 6163 652c 2041 6461 3aaa 0000 1000 elace, Ada:..... 00000040: 0000 4261 6262 6167 652c 2043 6861 726c ..Babbage, Charl 00000050: 6573 ed11 0100 es....
- I will also test your code with held-aside input files, so you should try create other input files for both debugging purposes early on (i.e., simple tests) as well as more stressful tests. To create new binary input files, use the
createbin
program provided with the starting point code. It will take in a text file that you create and output its binary version, which you can then feed into yoursortEmployees
program:
./createbin yourtextfile binaryoutfile
In input/
, I have given you infile1.dat
and infile2.dat
as sample inputs. I have also provided the corresponding text files so you can see how to format your input to createbin
.
Submitting your lab
Before the due date, push your solution to github from one of your local repos to the GitHub remote repo.
From your local repo (in your ~userID/cs44/labs/Lab0-userID
subdirectory)
git add *
git commit -m "my correct and well commented solution for grading"
git push
Verify that the results appear correct online. You will receive deductions for submitting code that does not run or repos with merge conflicts. Also, note that I take the time stamp of your final submission to record late days, so please do not update your repo until after the late period has ended.
If that doesn’t work, take a look at the “Troubleshooting” section of the Using git page. Also, be sure to complete the README.md
file.