Introduction
This assignment is to be done with a partner. You may not work with
other groups, and the share of workload must be even between both partners.
Failing to do either is a violation of the department Academic
Integrity policy.
The goal of the WiscDB projects is to allow students to learn about the
internals of a data processing engine. In the first assignment, you
built a buffer manager, on top of an I/O Layer that I provide, to
understand the role of a system in managing main memory efficiently.
In this second part, we will drill into this I/O layer to understand
details of managing data in a file. You will complete the heap
file representation of the data - that is, one where records are
stored in an unsorted manner. This will allow you to directly
simulate the physical structure of data on disk and help prepare you
for future labs.
Recall that a file is a collection of pages which themselves maintain
a collection of records. I have provided you with the File
class, which provides a logical ordering of pages for you (e.g.,
a linked list representation of pages). You must complete the
I/O layer by implementing the Page class, which provides
the physical layout of records on a disk page.
The code base is quite extensive and, as with Lab 3, will require
much reading of
API documentation and thorough testing.
You are responsible for making
significant progress early on the in assignment as waiting until the last
few days will not be manageable. The goals of this assignment include:
- Understanding the structure of heap files in the context of a
physical disk layer
- Managing variable-length records
- Simulating the logical structure of a physical page by maintaining
a directory of slots
- Learning to manage and manipulate low-level data structures in C++
- Understand the role of abstraction in large systems
- Developing a thorough testing methodology for a large-integrated
system
To get started, run update44
to obtain the starting point files in ~/cs44/labs/4/. You should
obtain the following files (files highlighted in blue require modification):
- Makefile - pre-defined. You may edit this file to add
extra source files or execution commands. To test your implementation
alongside your buffer manager from Lab 3, you will need to uncomment
the second target and copy over the necessary files from Lab 3.
- page.cpp - the central implementation file for this lab. You must edit this
file to complete the I/O layer.
- pageTester.cpp - your main
testing file. Unlike the previous lab, it is up to you to implement
the proper tests for your solution.
- README - a few wrap-up questions
for you to answer about the lab assignment.
For your convenience, the subfolders
include/ lib/ exceptions/
have been placed in a common shared directory
/home/soni/public/cs44/heapPage. You do not need to do anything
It is probably easiest to refer to the
online documentation for exceptions.
You are
free to read and/or copy the header files if it proves helpful, but
nothing special is needed to source them in your coding.
When you are ready to submit your lab, use handin44. Recall that
only files in the ~/cs44/labs/4 subdirectory will be submitted. You
may submit as many times as you wish; only the most recent copy will be saved.
The WiscDB I/O Layer: Heap Files
The lowest layer of the WiscDB database systems is the I/O layer.
This layer consists of two classes: a file (class
File) and a page (class
Page) class. These classes use C++ exceptions to handle the occurrence of any unexpected event.
Implementation of the
File class, and the exception classes
are provided to you.
The code has been adequately commented to help you with understanding how it does what it does.
At a high-level, the File class abstracts the physical structure
of files from upper-levels of the database. It stores a linked-page structure
of heap-file pages that allow iteration of contents of a file and the
storage of records. The File class allows the upper level of the system to:
- create/destroy files
- allocate/deallocate pages within a file
- read and write pages of a file
Before reading further you should first read the documentation for the
File class while reviewing the concept of a heap file in Chapter 9
of the textbook. Again, this implementation has been completed and thus
the details of managing a collection of pages is an abstraction for
both higher layers (e.g., the buffer manager) as well as lower layers (i.e.,
the page layer).
Heap Page
The Page class describes the structure and set of operations that
can be performed on a slotted heap page. A page stores a set of records and
supports operations such as:
- inserting/deleting records
- retrieving and updating records
- iterating over all records on a page
Specifically, you will implement a
variable-length slot representation. Review your material from lecture
and the textbook. Our representation is very similar to the one depicted
on Figure 9.7 with the following alteration: the header and slot directory
is stored at the beginning of the page (the directory grows forward) and the
heap space is filled from back to front. The structure is illustrated
below:
Your job in this assignment is to complete the implementation of the
Page class. You will have to pay close attention to pointer
arithmetic and the conceptual mapping of data in a general array of data.
You are provided the following structures to represent a
page object:
- A header structure, which stores meta data such as pointers
(i.e., indices into the data array)
to the beginning and end of free space; the number of allocated slots;
the number of free slots; the page number of the current page as well as
the next page number in the linked structure.
- A data array containing all of the records. This array is
of fixed size (the size of a page minus the header) and of generic type.
You will map both the slot directory (at the beginning of the array) and
stored records slots (starting from the end of the array) into this array
using casting and pointer arithmetic.
- A PageSlot structure which represents the offset and length
for one slot as well as whether the slot contains a record or not.
Thus, a page is simply the concatenation of the header and data array.
Implementation of the Page class
The Page class consists of several public methods that are utilized
as the interface to a heap page as well as a set of private methods which are
intended as either helper methods or restricted access methods to be used
by friend classes (such as File and PageIterator).
The following is a list of methods in no particular order. You should not
simply implement each method one-by-one; you need to carefully plan your
implementation in tandem with a testing strategy. Failure to do so will
lead to significant problems as the data array structure is handled
directly by you.
Provided methods
These methods have been defined for you:
- Page() constructor which calls reset() to initialize a page
- begin(),end() methods utilized for scanning a page of records
- bool hasSpaceForRecord(const char* record_data) const
calculates the amount of space needed in the page for a record to fit.
If there is a slot in the directory available to use, this is simply the
length of the record. Otherwise we must consider the growth of the slot
directory in the calculation.
- page_number(), next_page_number(), setPageNumber(), setNextPageNumber() getter and setter methods for the current page number and
next page number. These are handled by the File object.
- isUsed() tells the File class is valid or not.
Basic accessor/mutator methods to implement
- PageSlot* getSlot(const SlotId slot_number),
const PageSlot& getSlot(const SlotId slot_number) const
These methods return a pointer to the
location of a slot in the data array. One returns a
pointer to the data for potential modification of the underlying data
while the other returns the actual PageSlot by disallows
modification. You should implement this first, otherwise you will
not understand the structure of the Page. You should email
me your implementations as soon as you think you have it correct
so that I can make sure you are starting off correctly.
- void reset()
This method initializes a new heap page (or resets an existing one) to
default values. It should set all header values to
reasonable defaults (follow the documentation for the
PageHeader class in page.h) and fill the
data array completely with
null characters (i.e., '\0'). Note that INVALID_NUMBER
is a pre-defined constant that may be useful and that the pointers
to be stored are simply indices into the data array.
- getFreeSpace()
returns the amount of bytes available for
storing records. You should simply consider the distance between
the free space pointers.
Inserting a record
These are the set of methods involved in adding a record to a page:
- RecordId insertRecord(const char* record_data)
This member function is the public interface for adding a record to the
page and relies on several helper methods (see below). If the method
is successful, it returns the RecordId which is simply
a tuple containing the page number and slot number
(i.e., {PageId,SlotId}) for the stored record.
The method should throw a InsufficientSpaceException if there
is not enough free space to fit the record. This method:
- Checks for sufficient space
- Obtains an available slot
- Inserts the record in the appropriate free space and
updates the given slot
- Returns the id for the record
Each of these steps should be completed using helper methods as much
as possible.
- bool hasSpaceForRecord() see above in pre-defined methods
- SlotId getAvailableSlot()
Obtains a SlotId for the directory. There are two
scenarios:
- If there are any
free (i.e., unused) slots in the directory we should reuse that slot.
This requires a scan across the slot directory, checking to see if the
slot is not used (note, the directory uses 1-based indexing so the slot
directory is over indices [1,header.num_slots]. Simply
return the index of the available slot.
- If there are no free slots, you should grow the
directory by updating the necessary header information
(i.e., number of slots, number of free slots, beginning of free space).
Note that at this point, the slot is unused and there really isn't
any useful data to store about it. You will need to return
its index (i.e., the last index in the directory).
- void insertRecordInSlot(const SlotId slot_number,
const char* record_data)
A private helper method for placing record_data in the data array
and record it's meta information at position slot_number in the
directory. Your method should:
- Check that the provided slot number is valid (i.e., within the
range of indices into the slot directory [1,num_slots].
You should throw an InvalidSlotException if not.
- Obtain the slot data and check that it is not in use already.
Attempting to insert in a used slot should result in a SlotInUseException.
- Given the slot is valid to use, update the slot's used, length,
and offset field to the correct values. HINT: strlen may
prove useful here. Recall that the offset is to the beginning of the
record and that records are inserted from back to front on the page (i.e., the record will be stored in the indices ending at the free_space_end pointer.
- Reduce the number of free slots
- Copy the record into the data array. This can be done as follows:
strncpy(pointer to start point in data, record_data, length of record);.
Getting a record and updating a record
To obtain a record from a page, you will implement the following:
Delete a Record
There are two methods for deleting a record:
- void Page::deleteRecord(const RecordId& record_id)
The public method for deleting a record. That has been defined for you
to utilize the private method below.
- void deleteRecord(const RecordId& record_id,
const bool allow_slot_compaction)
This method deletes an existing record and will be the most
detailed implementation for your lab. The method takes in two parameters:
a record_id which must be validated is in getRecord and
allow_slot_compaction which, when true, allows for the
slot directory to shrink if and only if the last record in the directory
is removed. Updates do not want this behaviour, but all other deletes do.
The major steps for this method are:
- Validate the record ID
- After finding where the current record is, overwrite all stored
bytes. This can be done by setting all bytes to '\0' in the
range of the record's indices in data.
- Compacting the utilized space in data to maximize the free
space. All records that are stored to the left of the current record
should be shifted, one by one, over to fill the gap in data.
Be sure to update the respective entries in the slot directory as well
as the free space markers in header.
- Mark the slot as unused and be sure
to update all relevant header information.
If this is not the last slot in the direcotry, or
compaction is not allowed, you are done.
- If compaction is allowed, and you deleted the record
stored in the last slot in the directory,
delete the actual slot as well (to maximize free space).
You should repeat this check to see if the new "last" slot is also
unused. Continue deleting slots until you hit a used one. Note that
in all circumstances, you should compact the utilized record space.
The allow_slot_compact parameter only refers to the optional
shrinking of the slot directory if there are unused slots at the end.
Coding and Testing
In Lab 3, you spent most of your time understanding the tests I provided;
only a few methods were not covered in those tests. For this lab, you
will be given much less guidance by me. As usual, you are more than welcome
to discuss testing strategies with your classmates (but not the testing
code). Think carefully about how your implementation could go wrong.
For example, are you properly calculating the free space available?
You can test this by inserting the exact amount of data a page should fit
and see if it succeeds. Then, see if you Page correctly fails
if you try to insert even a little more.
Acknowledgements
The code base was provided and developed by the University of
Wisconsin Database Group
Submitting your lab
Submit using handin44. Please run make clean before
submitting to keep file sizes down. Also, be sure to complete
the README file.