Lab 2: Heap Page Implementation
Due by 11:59 p.m., Saturday, October 27, 2018.
This is a partnered lab. You are to complete this lab with one other person, who must attend the same lab as you. You may discuss the lab concepts with other classmates. Please use Teammaker to set your lab partner preferences. You can choose “Random” if you do not have a partner. Remember the Academic Integrity Policy: do not show your code/solution to anyone outside of the course staff and your lab partner. Do not look at any other group’s code/solution for this lab (current or past). If you need help, please post on Piazza.
- Overview
- Getting Started
- The WiscDB I/O Layer: Heap Files
- Heap Page
- Implementation of the
Page
class - Coding and Testing
- Acknowledgements
- Submitting your lab
Overview
The goal of the WiscDB projects is to allow students to learn about the internals of a data processing engine. In the first assignment, you built a buffer manager on top of an I/O Layer that I provided. The aim was to understand the role of a system in managing main memory efficiently.
In this second component of the WiscDB labs, we will drill into the I/O layer to understand
details of managing data in a file. You will complete a heap file implementation
that stores data in variable-length records. This will allow you to directly simulate the physical
structure of data on disk and help prepare you for future labs. Recall
that a file is a collection of pages which themselves maintain a
collection of records. I have provided you with the File
class, which
provides a logical ordering of pages for you (e.g., a linked list
representation of pages). You must complete the I/O layer by
implementing the Page
class, which provides the physical layout of
records on a disk page. The code base is quite extensive and, as with
Buffer Manager lab, will require much reading of API
documentation and thorough testing. You are
responsible for making significant progress early on the in assignment
as waiting until the last few days will not be manageable. The goals of
this assignment include:
- Understanding the structure of heap files in the context of a physical disk layer
- Managing variable-length records
- Simulating the logical structure of a physical page by maintaining a directory of slots
- Learning to manage and manipulate low-level data structures in C++
- Understand the role of abstraction in large systems
- Developing a thorough testing methodology for a large-integrated system
Getting Started
Click to expand
Both you and your partner will share a repo named Lab2-userID1-userID2
. Note that the lab will not release until you have both marked your partner preferences on Teammaker. You should find your repo on the GitHub server for our class: CPSC44-F18
Clone your Lab 2 git repo containing starting point files into your labs directory that you created last week:
cd ~/cs44/labs
git clone [the ssh url to your your repo]
cd Lab2-userID1-userID2
If you need help, see the instructions on Using Git (follow the instructions for repos on Swarthmore’s GitHub Enterprise server).
If all was successful, you should see the following files (highlighted files require modification):
Makefile
- pre-defined. You may edit this file to add extra source files and execution commands if you add more test files.page.h
- definitions for thePage
class as well as structs forPageHeader
andPageSlot
. DO NOT MODIFY THE GIVEN METHOD DECLARATIONS OR DATA MEMBERS. You may add additional private methods and a public print method.page.cpp
- the central implementation file for this lab. You must edit this file to complete the I/O layer.pageTester.cpp
- your main testing file. As with the previous lab, it is up to you to implement the proper tests for your solution.README
- a few wrap-up questions for you to answer about the lab assignment.
In addition, you should run make setup
(just once) to create symlinks to shared libraries that are needed for your programs to run:
cd ~cs44/Lab2-userID1-userID2/
make setup
include/
- contains header files for thePage
,File
and related classes. While reading the header files may be helpful, you should start with the online WiscDB documents first.lib/
- necessary object files. The contents of this directory can be ignored.exceptions/
- defines the list of possible exceptions for WiscDB. You will need to reference these exceptions to both handle possible errors that can be thrown to you or that you must throw. Again, it is probably easier to refer to the online documentation.
The WiscDB I/O Layer: Heap Files
The lowest layer of the WiscDB database systems is the I/O layer. This
layer consists of two classes: a
File
and
Page
class. These classes use C++
exceptions to handle the occurrence of unexpected behavior.
Implementation of the File
class and the exception classes are
provided to you. The code has been adequately commented to help you
with understanding how it does what it does.
At a high-level, the File
class abstracts the physical structure of
files from upper-levels of the database. It stores a linked-page
structure of heap-file pages that allow iteration of contents of a file
and the storage of records. The File
class allows the upper level of
the system to:
- create/destroy files
- allocate/deallocate pages within a file
- read and write pages of a file
Before reading further you should first read the documentation for the
File
class while reviewing the concept of a heap file in Chapter 9 of
the textbook.
Heap Page
The Page
class describes the structure and set of operations that can
be performed on a slotted heap page with variable-length records. A page stores a set of records and
supports operations such as:
- inserting/deleting records
- retrieving and updating records
- iterating over all records on a page
Specifically, you will implement a variable-length slot representation. Review your material from lecture and the textbook. Our representation is very similar to the one depicted on Figure 9.7 in the text with the following alteration: the header and slot directory is stored at the beginning of the page (the directory grows forward) and the heap space is filled from back to front. The structure is illustrated below:
- The page begins with a header (defined as a
struct
PageHeader
). - After the header comes the slot directory, which grows forward (i.e., new slots get added to the right side of the directory).
- Records are added to the back end of the page, moving backward - meaning the free space end offset decreases with each insertion. Note that records themselves are not flipped in orientation (they still read back-to-front).
- Record IDs are a combination of slot and page
{page_number, slot_number}
. - Slot ID values start at 1 (not 0).
Your job in this assignment is to complete the implementation of the
Page
class (read the page.h
file, implement page.cpp
) and throughly test its mechanics. You will treat the page is a large, untyped array of bytes (e.g., a char
array of size 8192), and manually format the contents. You will have to pay close attention to pointer types, and make use of sizeof
to access different values within a page.
You are
provided the following structures to represent a page object defined in page.h
(DO NOT ADD or MODIFY DATA MEMBERS):
- A
header
structure (typePageHeader
), which stores meta data such as pointers (or offsets) to the beginning and end of free space; the number of allocated slots and the number of free slots; the page number of the current page as well as the next page number in the linked structure.Click to see the
PageHeader
structstruct PageHeader { std::uint16_t free_space_begin; std::uint16_t free_space_end; std::uint16_t num_slots; std::uint16_t num_free_slots; PageId current_page_number; PageId next_page_number; };
- A
data
array containing all of the records. This array is of fixed size (number of bytes in a page excluding the header) and of generic type. You will map both the slot directory (at the beginning of the array) and stored records slots (starting from the end of the array) into this array using casting and pointer arithmetic. - A
PageSlot
structure which represents the offset and length for one slot as well as whether the slot contains a record or not.Click to see the
PageSlot
structstruct PageSlot { // Whether the slot currently holds data or is unused bool used; // Offset of the record in the page. std::uint16_t offset; // Length of the record in this slot. std::uint16_t length; };
Implementation of the Page
class
The Page
class consists of several public methods that are utilized as
the interface to a heap page as well as a set of private methods which
are intended as either helper methods or restricted access methods to be
used by friend classes (such as File
and PageIterator
). The
following is a list of methods in no particular order. You should not
simply implement each method one-by-one; you need to carefully plan your
implementation in tandem with a testing strategy. Failure to do so will
lead to significant problems as the data
array structure is handled
directly by you.
Provided methods
These methods have been implemented for you:
Page()
constructor which callsreset()
to initialize a page.begin()
andend()
methods utilized for scanning a page of records.hasSpaceForRecord(const char* record_data)
which calculates the amount of space needed in the page for a record to fit, returningTrue
if the free space is adequate to fit the record.page_number(), next_page_number(), setPageNumber(), setNextPageNumber()
getter and setter methods for the current page number and next page number. These are set by theFile
class.isUsed()
tells theFile
class if the page is valid or not (i.e., it has been assigned a page number).
You should test these methods, but should not modify the implementations. I strongly suggest studying hasSpaceForRecord()
to better understand the arithmetic required to manage the unformatted data
array.
Basic accessor/mutator methods
const PageSlot& getSlot(const SlotId slot_number) const
PageSlot* getSlot(const SlotId slot_number)
These methods return a reference or pointer, respectively, to the location of a particular slot in thedata
array. The first returns aconst
reference to the slot to provide read-only access to the offset and length values. The second function returns a pointer to the slot for potential modification; this should be used when setting the offset and/or length values. You should implement these two methods first, otherwise you will not understand the structure of thePage
. You will need to use type casting to treat part of thedata
array as aPageSlot
:(PageSlot *) &data; //returns a PageSlot pointer that points to the first slot
CHECKPOINT: Implement these methods first and show me or email me your implementation as soon as you think it is correct so that I can make sure you are starting off correctly.
To use these methods, keep track of whether you are in a getter (
const
) function or setter. Getters will automatically call the first version:PageSlot slot = getSlot(id); if (slot.used == true){ //example usage //do something }
Setters will automatically call the pointer version:
PageSlot* slot_ptr = getSlot(id); slot_ptr->used = true; //mark slot as used
void reset()
This method initializes a new heap page (or resets an existing one) to default values. It should set allheader
values to reasonable defaults (follow the documentation for thePageHeader
class inpage.h
).- Carefully consider how you plan to use
free_space_end
- it can be set as the last unused byte before the record space (DATA_SIZE-1
) OR as the first used byte of the record space (DATA_SIZE
). Both options work as long as you are consistent throughout your code. - Fill the
data
array completely with null characters (i.e.,'\0'
). - For the page numbers, recall the use of
INVALID_NUMBER
from the previous lab as a designator for invalid pages.
- Carefully consider how you plan to use
getFreeSpace()
Returns the amount of bytes available for storing records. You should simply consider the distance between the free space pointers. Again, your answer depends on whatfree_space_end
means in your implementation. Be consistent.
Inserting a record
Records to be inserted into a page are passed as an C-style string (an array of char with a terminating '\0'
character). When you insert a new record in a Heap Page, you should NOT store the terminating '\0'
part of each record (i.e. the record "Tia"
should take up 3 bytes of space in the page and not 4). An inserted record obtains an identifying RecordId
that consists of a page number and slot id. NOTE: slot ids start at 1 not 0.
These are the set of methods involved in adding a record to a page:
RecordId insertRecord(const char* record_data)
This is the public interface for adding a record to the page and relies on several helper methods (see below). If the method is successful, it returns theRecordId
– a tuple containing the page number and slot number (i.e.,{PageId,SlotId}
) for the stored record. The method should throw anInsufficientSpaceException
if there is not enough free space to fit the record. This method:- Checks for sufficient space
- Obtains an available slot
- Inserts the record in the appropriate free space and updates the given slot
- Returns the id for the record
Each of these steps should be completed using the helper methods as much as possible.
bool hasSpaceForRecord()
see above in provided methodsSlotId getAvailableSlot()
Obtains aSlotId
for a slot that can store the offset/length for the new record. There are two scenarios:- If there are any free (i.e., unused) slots in the directory we
should reuse that slot (i.e., return the
SlotId
). Scan across the slot directory, checking to see if the slot is not used. Return the first available slot index as aSlotId
. Recall that theSlotId
starts at 1, whereas C++ arrays use 0-based indexing, so be sure to account for the shift in your return value. - If there are no free slots, you should grow the directory by
updating the necessary
header
information (i.e., number of slots, number of free slots, beginning of free space). Return the newly createdSlotId
.
- If there are any free (i.e., unused) slots in the directory we
should reuse that slot (i.e., return the
void insertRecordInSlot(const SlotId slot_number, const char* record_data)
A private helper method for inserting; this method should placerecord_data
in the data array and update the corresponding slot ( indexslot_number
in the slot directory) to note the offset and length. Your method should:- Check that the provided slot number is valid (i.e., within the
range of indices into the slot directory
[1,num_slots]
. Throw anInvalidSlotException
if it is not a valid slot number. - Check that the given slot it is not in use already.
Attempting to insert in a used slot should result in a
SlotInUseException
. (NOTE: neither of these exceptions should not happen if getAvailableSlot is done correctly; it is useful for debugging.) - Update the slot’s
used
,length
, andoffset
fields to the correct values. Recall that in our model that the new record will be placed at the end of free space (i.e., the last byte of the new record will end at the currentfree_space_end
). You will want to usestrlen
here. - Reduce the number of free slots
- Copy the record into the data array. Be sure not to copy a null-terminating character
'\0'
into the data array (which is side effect of usingstrcpy
). You should usestrncpy
or write a for-loop to copy the characters one at a time e.g.,
strncpy(pointer to start point in data, record_data, length of record);
- Check that the provided slot number is valid (i.e., within the
range of indices into the slot directory
Retrieving a record
To obtain a record from a page, you will implement the following:
-
string getRecord(const RecordId& record_id) const
Given aRecordId
to retrieve, this method returns the record in a C++ string object. Be sure to validate therecord_id
(seevalidateRecordId()
). Once you have obtained the location of the record within the page, you should construct a string that is a copy of the record on disk. There are many ways to do this, but one is to use theassign
method and then copy over byte-by-byte. For example, to copy bytes 250-254 fromdata
, you could do the following:string toReturn; toReturn.assign(5,char()); //initializes string to 5 default characters for(uint16_t i = 250; i < 255; i++){ toReturn[i-250] = data[i]; }
-
void validateRecordId(const RecordId& record_id) const
Throws anInvalidRecordException
if therecord_id
is not valid. This can happen if thepage_number
does not match the current page, or if record’sslot_number
is not valid (either because it is out of bounds of the directory or the given slot is marked as unused).
Delete a Record
There are two methods for deleting a record. We will actively compact the record space on all deletes; you will also implement compaction on the slot directory if a slot at the end of the directory is not in use and we are not doing an update.
void Page::deleteRecord(const RecordId& record_id)
The public method for deleting a record. That has been defined for you to utilize the private method below.void deleteRecord(const RecordId& record_id, const bool allow_slot_compaction)
Private helper method that does the work of deleting an existing record. WARNING: This is the most involved method in this lab, so carefully test every aspect of this implementation. The method takes in two parameters: arecord_id
andallow_slot_compaction
which, whentrue
, triggers the slot directory to shrink if and only if the last record in the directory is the one being removed. This is turned off inupdateRecord
(below), buttrue
for the publicdeleteRecord()
method. The major steps for this method are:- Validate the record ID
- After finding where the current record is, overwrite all stored
bytes in this record’s space in
data
. This can be done by setting all bytes to'\0'
. - Compact the utilized space in
data
to maximize the free space. All records that are stored to the left of the current record should be shifted over to fill the gap indata
. HINT: do not try to search through the directory for each item to be moved; instead think about what indices ofdata
need to be shifted. - Update the offsets of all slots with records that were moved in the previous step (this does require a loop through the slot directory).
- Mark the slot as unused and be sure to update all relevant
header
data. If this is not the last slot in the directory, or slot compaction is not allowed, you are done. - If slot compaction is on, and you deleted the record stored in
the last slot in the directory, shrink the slot directory (and give the space to the free space region).
You should repeat this check to see if
the new “last” slot is also unused, continuing to shrink the slot directory until a used slot is found or there are no remaining slots. Update the
header
values as you proceed.
Update Record
Updating a record can be done in multiple manners, but probably the simplest given our representation is to utilize a delete of the old record followed by a insert of the new version of the record.
void Page::updateRecord(const RecordId& record_id, const char* record_data)
This method replaces the data stored for an existing record. The new record data may not be the same size as the old record data. If it is larger, then there may not be enough space in the page to allow the update. If it is smaller or the same size, update should succeed. Updating a record should not change its RecordId
. Follow these steps:
- Validate the record
- Check that there is adequate space to complete the update.
Carefully think about how to calculate this. Do not delete the
old record if you cannot later find the space to insert the new
record! If there is not sufficient space, throw an
InsufficientSpaceException
. - Delete the existing record (do not let the slot be compacted since you will soon reuse it).
- Insert the new record using the same
SlotId
the old record used; this is required in order to preserve the record id.
Coding and Testing
In the Buffer Manger lab, you spent most of your time understanding the tests I
provided; only a few methods were not covered in those tests. For this
lab, you will be given much less guidance by me. As usual, you are more
than welcome to discuss testing strategies with your classmates (but not
the testing code). Think carefully about how your implementation could
go wrong. For example, are you properly calculating the free space
available? You can test this by inserting the exact amount of data a
page should fit and see if it succeeds. Then, see if you Page
correctly fails if you try to insert even a little more.
You will want to try different testing strategies. I recommend utilizing a print
method for debugging, where you print out all of the contents of the slot directory, header, and records. You should make your tests increasingly interconnected; that is, first try just doing inserts, than deletes, than a mixture of the two, then a mixture of updates/deletes/inserts, etc. Your tests should test corner cases (e.g., the page is 100% filled). When calculating the size of page, note that SIZE
is 8192, sizeof(PageHeader)
is 16 bytes (meaning data
is of length 8176) and sizeof(PageSlot)
is 6 bytes on our machines (2 bytes for each field). Use these values to determine exactly on how many records you can insert on a page.
Acknowledgements
The code base was provided and developed by the University of Wisconsin Database Group.
Submitting your lab
Before the due date, push your solution to github from one of your local repos to the GitHub remote repo. Only one of you or your partner needs to do this.
From your local repo (in your ~cs44/labs/Lab2-userID1-userID2
subdirectory)
make clean
git add *
git commit -m "our correct, robust, and well commented solution for grading"
git push
If that doesn’t work, take a look at the “Troubleshooting” section of the Using Git page.
Also, be sure to complete the README.md
file, add and push it.