Lab 0: C++ Warmup/Binary IO

Due by 11:59 p.m., Wednesday, Sept 12, 2018.

This is an individual lab. You are to complete this lab on your own, although you may discuss the lab concepts with your classmates. Remember the Academic Integrity Policy: do not show your code to anyone outside of the course staff and do not look at anyone else’s code for this lab. If you need help, please post on the Piazza forum or contact the instructor. If you have any doubts about what is okay and what is not, it’s much safer to ask than to risk violating the Academic Integrity Policy.

Overview
- Getting Started
Linked List Implementation
- EmployeeNode class
- EmployeeList class
Main program
Binary File Format
Tips and Hints
Submitting your lab

Overview

The task for this week’s lab: given employee data from two separate files, combine the contents into a sorted list and output the results to file. You will implement a linked-list data structure to maintain your sorted list of employee data. This will simulate, at a low-level, a common operation for storing raw data in a DBMS.

While this lab serves primarily as a warm-up exercise and reminder of C++ programming, we will also introduce new concepts. The learning objectives for this assignment:

reacquaint students to programming in C/C++
introduce binary file I/O
utilize low-level memory management to manage data, including using tools such as valgrind to debug memory errors
introduce the const keyword and C-style strings

You may want to read a bit about binary files before getting started.

Getting Started

First create a cs44 directory in your home directory, and add a labs subdirectory to it:

mkdir cs44
cd cs44
mkdir labs
cd labs
pwd

We will be using git repos hosted on the college’s GitHub server for labs in this class. If you have not used git or the college’s GitHub server before, here are some instructions on Using Git (follow the instructions for repos on Swarthmore’s GitHub Enterprise server).

Next find your git repo for this lab assignment off the GitHub server for our class: CPSC44-F18

Clone your git repo (Lab0-userID) containing starting point files into your labs directory:

cd ~/cs44/labs
git clone [the ssh url to your your repo]
cd Lab0-userID

If all was successful, you should see the following files (highlighted files require modification):

Makefile - pre-defined. You may edit this file to add extra source files or execution commands.
sortEmployees.cpp - your main program for reading and sorting employee data.
employee.[cpp/h] - the implementation of an EmployeeList class that will represent a list of sorted employees in a linked list.
input/ - directory of sample input files in either text format (.txt extension) or binary (.dat).
createbin - an executable that will convert any ascii file to binary format. This should prove helpful when attempting to debug your file i/o.
README.md - a few wrap-up questions for you to answer about the lab assignment. Information about any late days you used on this assignment, and how many you have used so far. (TIP: Don’t use your late days on this lab).

Linked List Implementation

In employee.cpp/h, you will implement an EmployeeList and related EmployeeNode class to manage a sorted linked list of employee data. The data will be sorted based on the name field of EmployeeNode.

EmployeeNode class

The data members and public methods have been provided. You may add private methods, but shouldn’t need to. Begin by implementing the EmployeeNode class in employee.cpp. This class represents one node in a linked list:

name is a C-style string containing the employees name.
salary is an integer.
next is a pointer to the next EmployeeNode in the linked list (or NULL).

The constructor should initialize all data members (e.g., set pointers to NULL) and the destructor should deallocate name if it is not NULL.

Each of the above data members has a corresponding setter and getter method. Note that c-style strings are arrays of characters, so you will need to allocate and deallocate memory. See the section on c-strings below for details.

EmployeeList class

EmployeeList is a linked list with a limited interface – there is only one type of method for inserting and no methods for removing. While this may seem like a poor design, it is common to implement limited data structures to accomplish specific tasks efficiently and predictably. Do not modify the public interface, or add additional data members. You may add private methods, if needed.

There is only one data member for the linked list: the head of the list. You will need to implement the class methods:

A constructor that initializes head to NULL.
A destructor that deletes each EmployeeNode in the list.
insert() adds a new employee to the list such that the resulting list is in sorted order by the name field. In other words, you are performing insertion sort. Your method should create an EmployeeNode with the given information and traverse the list to find the correct spot to insert the new EmployeeNode.
getNumEmployees() and getAverageSalary() must traverse the linked list to accumulate the respective statistics.
writeToFile() writes the entire EmployeeList, in sorted order, to the binary file specified by the parameter filename. See details on the structure of the binary files and writing to binary files. The file output should mirror the structure of the file that is read in by readFromFile() in the main program. Return 0 on success and a non-zero error value on failure (e.g., illegal file name). Be sure to specify the meaning of error values in the function comments.
print() is useful for debugging; it output the contents of the list in a formatted table.

Main program

In sortEmployees.cpp, you will write a main program that, at a high level:

reads in employee data from two unsorted binary files
merges the data from two files together in a sorted list of employees
write the resulting sorted list of employee data to a binary file.
prints formatted information to standard output as it makes progress (see example runs).

General Requirements

All of your code should demonstrate good design - your main should be simple, and written in a top-down, modular fashion.
Each function should have a single, clear purpose.
Each file must have a top-level comment describing the file’s purpose and authors.
Each function should have a comment explaining its purpose as well as parameters, return values, and error conditions.
Your program must be free of memory-access errors. This includes both memory leaks in out of bounds memory access. Recall that C++ does not report this to you; you need to test your code and use tools such as valgrind and gdb. See Compiling, Debugging and Linking Tips for a reminder on how to use these tools.

Usage and Sample Runs

$ ./sortEmployees input/infile1.dat input/infile2.dat result.dat

Your program takes in three arguments: two binary files to read and the name of the file to write the resulting list. Incorrect usage should result in an error message e.g.,

$ ./sortEmployees
usage: sortEmployees file1 file2 resultfile

See here for the expected output.

Reading the input files

In sortEmployees.cpp, you must implement a function to read from binary files:

int readFromFile (char const *filename, EmployeeList *list)

This function reads in employee records from the binary file specified by the parameter filename and adds each employee record list. Return 0 on success and any non-zero error value on failure (there may be multiple reasons for failure; in the function comments, be sure to specify the error associated with each value you return). See below for tips on how to complete this function

Formatted Output

Your program should use formatted output (e.g., printf) to display the employee data as it is read from the input files (i.e, in readFromFile()). Your program should finish by displaying the sorted list of employee records. See examples below.

Binary File Format

You are used to reading files that are in text format - all data is stored as ASCII characters regardless of whether it is actually a number or character. Alternatively, we could store the data in a format more native to computing - as binary. You cannot open and read these files in a text editor (because they aren’t text). But they can be much more predicable in terms of how data is formatted.

Employee Data Files

Each employee file is a binary file that stores a sequence of variable length employee records. Each employee record has three fields: the length of the name (4 byte integer), the actual name (variable length), and a salary (4 byte integer). The format of an employee record in the binary file is as follows:

  ----------------------------------------------------------------------------
 |    4 byte integer     |     N character string     |    4 byte integer     |
 | Number of characters  |        Employee name       |    Employee salary    |
 | in employee name (N)  |    (not null-terminated)   |                       |
  ----------------------------------------------------------------------------

Reading binary files

To read integers and character strings from a binary file you can use the following code as an example:

  //open file specified in filename. Second parameter specifies that
  // the file is for input and is in binary format
  fstream infile(filename, ios::in | ios::binary);
  int nameLen;
  char *name;

  // read 4 bytes into memory location of nameLen
  // read assumes everything is char, so we will here we typecast
  // the address of nameLen (the destination) as a char *
  infile.read((char*)&nameLen, sizeof(int));

  //Allocate space and read characters for string
  name = new char[nameLen+1];
  infile.read(name, nameLen);
  // TODO: remember to null terminate strings with '\\0'
  // TODO: add code to finish reading file
  infile.close();

You should reference the fstream library for details on how to read files. Also, you will need to loop until you read all of the records in a file. You should use the eof() (end of file) function to do so. Here is an illustrative example with tips on how to use it properly.

Writing binary files

To write integers and character strings to a binary file you can use:

  fstream outfile(filename, ios::out | ios::binary);
  //You need to figure out how to set these variables and loop through the list
  int nameLen = ...;
  char *name = ...;
  ...

  outfile.write((char*)&nameLen, sizeof(int));//write the length of the name
  outfile.write(name, nameLen); //write the name using the specific length
  outfile.close(); //close the file when done writing

Remember to close the file when done writing, otherwise your results may not be saved to disk.

Tips and Hints

c-strings

See Prof. Newhall’s help page on Strings in C for a refresher on using c-strings. In particular, you will need to use strcpy, strlen, printf, among others. Since your code is compiled in C++, you can use new and delete instead of malloc and free e.g.,

char *staticVar = "this is a string"; //a static c-string
char *dynamicVar = NULL; // we well allocate/deallocate space for this c-string

printf("%s\n", staticVar); //Prints out "this is a string"
len = strlen(staticVar);

dynamicVar = new char[len+1]; //add one for the null termination '\0'
strcpy(dynamicVar, staticVar,len); //copies "this is a string"

printf("%s\n", dynamicVar); //should print "this is a string"

delete [] dynamicVar //clean up when done

C-strings are arrays; allocating them dynamically requires that you also clean up memory when it is no longer being used. This also means that, c-strings variables are pointers - treat that as such.

const modifier

const is a modifier that guarantees that an entity will not be modified while in scope. You will see it used in several contexts throughout the given code. There are two main uses:

As a modifier to a variable. Primarily, I have done this in various function/method definitions to signify that a parameter cannot be modified in that function. For example, readFromFile declares filename as of type char const * instead of just char *. This ensures that the function cannot modify the string. EmployeeList::insert() can only read the values that have been sent it, it is not allowed to modify them.
As a modifier to a method header. This only occurs inside of class definitions. This ensures that the method will not modify the state of the object - that is, it has read-only access to the data members. For example, EmployeeList::print() only needs to access the head of the list and read its values. It should not (and cannot) attempt to modify the head. Conversely, insert needs to modify the class so it is not declared const.

Note that nothing changes on the caller’s end of function use. This simply provides a contract guarantee to a user that their data cannot be modified by a called function. It is enforced by the compiler (attempting to modify a constant will lead to a compile-time error message).

Error messages

Above, I mention that you will return error statuses via an int for your read and write functions. This is common practice, particularly in C programs where exceptions are not used. To make this aspect of your code more readable, consider using meaningful constant values. For example:

//Global static variable
static const int ERROR_NULL_FILENAME = -1;

/* Function purpose
 * @params ...
 * @return ...
 * @error returns ERROR_NULL_FILENAME if the filename pointer is NULL
 */
int readFromFile(char const *filename, EmployeeList *list){
  //...
  //some code
  if (filename == NULL){
    return ERROR_NULL_FILENAME; //returns a -1, but someone reading your
                                //code cares more about the reason for error
  }
  //...rest of program

Instead of returning -1, the function returns a variable (this variable’s value happens to be -1 but this is abstracted). The key is that in main, you can check for this by comparing the return value to the error values you expect to find. For example, if you store the function return value in main using status, you can write:

  if (status == ERROR_NULL_FILENAME){
    //handle this error
  }

This can be easier to understand (and less prone to bugs) than the equivalent statement:

  if(status == -1){
    //handle this error
  }

Testing your code

It is difficult to verify the output since it is not text. Here are some strategies:

To verify that your writeToFile function works, try using the output file of one run as one of the input files in a subsequent run. Read errors mean that the write phase was not correct. This will allow you to see the contents of the file as the employee records get printed to screen when it is read.
Run wc (word count) or ls -l to get the number of bytes in the output and input files. The output file should have exactly the same number of bytes as the sum of the size of the two inputs.

Generate a hex dump of a binary file using xxd.

$ xxd infile1.dat
00000000: 0c00 0000 5475 7269 6e67 2c20 416c 616e  ....Turing, Alan
00000010: c8af 0000 0d00 0000 486f 7070 6572 2c20  ........Hopper,
00000020: 4772 6163 6550 c300 000d 0000 004c 6f76  GraceP.......Lov
00000030: 656c 6163 652c 2041 6461 3aaa 0000 1000  elace, Ada:.....
00000040: 0000 4261 6262 6167 652c 2043 6861 726c  ..Babbage, Charl
00000050: 6573 ed11 0100                           es....

I will also test your code with held-aside input files, so you should try create other input files for both debugging purposes early on (i.e., simple tests) as well as more stressful tests. To create new binary input files, use the createbin program provided with the starting point code. It will take in a text file that you create and output its binary version, which you can then feed into your sortEmployees program:

./createbin yourtextfile binaryoutfile

In input/, I have given you infile1.dat and infile2.dat as sample inputs. I have also provided the corresponding text files so you can see how to format your input to createbin.

Submitting your lab

Before the due date, push your solution to github from one of your local repos to the GitHub remote repo.

From your local repo (in your ~userID/cs44/labs/Lab0-userID subdirectory)

git add *
git commit -m "my correct and well commented solution for grading"
git push

Verify that the results appear correct online. You will receive deductions for submitting code that does not run or repos with merge conflicts. Also, note that I take the time stamp of your final submission to record late days, so please do not update your repo until after the late period has ended.

If that doesn’t work, take a look at the “Troubleshooting” section of the Using git page. Also, be sure to complete the README.md file.