Introduction
This assignment is to be done with your assigned Lab 3 partner.
You may not work with other groups, and the share of workload must be
even between both partners. Failing to do either is a violation of
the department Academic Integrity policy.
Please read over Expectations for Working with Partners on CS Lab work
Lab 3 Partner List for Lab Section A
Lab 3 Partner List for Lab Section B
Lab 3 Goals
The goal of the WiscDB projects is to allow students to learn about the
internals of a data processing engine. In the first assignment, you
built a buffer manager, on top of an I/O Layer that I provide, to
understand the role of a system in managing main memory efficiently.
In this second part, we will drill into this I/O layer to
understand details of managing data in a file. You will complete the
heap file representation of the data - that is, one where records
are stored in an unsorted manner. This will allow you to directly simulate
the physical structure of data on disk and help prepare you for future labs.
Recall that a file is a collection of pages which themselves maintain a
collection of records. I have provided you with the File class,
which provides a logical ordering of pages for you (e.g., a linked list
representation of pages). You must complete the I/O layer by implementing
the Page class, which provides the physical layout of records on a
disk page.
The code base is quite extensive and, as with the BufferManager lab, will
require much reading of API documentation and
thorough testing. You are responsible for making significant progress early on
the in assignment as waiting until the last few days will not be manageable.
The goals of this assignment include:
- Understanding the structure of heap files in the context of
a physical disk layer
- Managing variable-length records
- Simulating the logical structure of a physical page by maintaining
a directory of slots
- Learning to manage and manipulate low-level data structures in C++
- Understand the role of abstraction in large systems
- Developing a thorough testing methodology for a large-integrated system
Lab 3 Starting point
First, find your Lab3-partner1-partner2 git repo off the GitHub server for
our class:
CS44-f16
Next, clone your Lab 3 git repo into your cs44/labs subdirectory, cd
into your repo, and run make setup:
cd
cd cs44/labs
git clone [the ssh url to your repo]
cd Lab3-partner1-partner2
make setup
Here are some instructions on
Using Git page (follow the instructions for repos on Swarthmore's GitHub Enterprise server).
If all was successful, you should see the following files and symlinks
(files highlighted in blue require modification):
- Makefile - pre-defined. You may edit this file to add
extra source files or execution commands. To test your implementation
alongside your buffer manager from Lab 3, you will need to uncomment the second
target and copy over the necessary files from Lab 3.
- page.cpp - the central
implementation file for this lab. You must edit this file to complete the I/O
layer.
- pageTester.cpp - your main testing file.
Unlike the previous lab, it is up to you to implement the proper tests for your
solution.
- README - a few wrap-up questions for you
to answer about the lab assignment.
Running
make setup will add symlinks to execptions, include,
lib directories. You can open .h and .cpp files in exections and include
to see definitions. It is probably easiest to refer to the
online documentation for exceptions.
The WiscDB I/O Layer: Heap Files
The lowest layer of the WiscDB database systems is the I/O layer.
This layer consists of two classes: a file (class
File) and a page (class
Page) class.
These classes use C++ exceptions to handle the occurrence of any unexpected
event. Implementation of the
File class, and the exception classes
are provided to you. The code has been adequately commented to
help you with understanding how it does what it does.
At a high-level, the File class abstracts the physical structure
of files from upper-levels of the database. It stores a linked-page structure
of heap-file pages that allow iteration of contents of a file and the
storage of records. The File class allows the upper level of the system to:
- create/destroy files
- allocate/deallocate pages within a file
- read and write pages of a file
Before reading further you should first read the documentation for the
File class while reviewing the concept of a heap file in Chapter 9
of the textbook. Again, this implementation has been completed and thus
the details of managing a collection of pages is an abstraction for
both higher layers (e.g., the buffer manager) as well as lower layers (i.e.,
the page layer).
Heap Page
The Page class describes the structure and set of operations that
can be performed on a slotted heap page. A page stores a set of records and
supports operations such as:
- inserting/deleting records
- retrieving and updating records
- iterating over all records on a page
Specifically, you will implement a
variable-length slot representation. Review your material from lecture
and the textbook. Our representation is very similar to the one depicted
on Figure 9.7 with the following alteration: the header and slot directory
is stored at the beginning of the page (the directory grows forward) and the
heap space is filled from back to front. The structure is illustrated
below:
In your Heap Page implementation:
- Records should always be compacted starting at the end of the page
and growing forward
- Slot directory grows from the header towards the end of the page.
- Record IDs are (slot num, page num)
- slot num values start at 1 (not 0)
Your job in this assignment is to complete the implementation of the
Page class (read the page.h file, implement page.cpp). You
will have to pay close attention to pointer type and make use of sizeof
to access different types within a page (an array of bytes/char).
You are provided the following structures to represent a page object,
defined in page.h (DO NOT ADD or MODIFY DATA MEMBERS):
- A PageHeader structure, which stores meta data such as pointers
(i.e., indices into the data array) to the beginning and end of free
space; the number of allocated slots; the number of free slots; the page number
of the current page as well as the next page number in the linked structure.
- A data array containing all of the records. This array is of
fixed size (the size of a page minus the header) and of generic type. You will
map both the slot directory (at the beginning of the array) and
stored records slots (starting from the end of the array) into this array
using casting and pointer arithmetic.
- A PageSlot structure which represents the offset and length
for one slot as well as whether the slot contains a record or not.
Thus, a page is simply the concatenation of the header and data array.
So another view of the Heap File Page organization shown above, is in
terms of the wiscdb Page class definition. The Page class has
two fields (header and data). The data field
is used to store the slot directory and the data records. The slot
directory grows into freespace from one end of the data array, the
record data compacted from the other end:
Implementation of the Page class
The Page class consists of several public methods that are utilized
as the interface to a heap page as well as a set of private methods which are
intended as either helper methods or restricted access methods to be used
by friend classes (such as File and PageIterator).
The following is a list of methods in no particular order. You should not
simply implement each method one-by-one; you need to carefully plan your
implementation in tandem with a testing strategy. Failure to do so will
lead to significant problems as the data array structure is handled
directly by you.
Provided methods
These methods have been implemented for you:
- Page() constructor which calls reset() to initialize a page
- begin(),end() methods utilized for scanning a page of records
- bool hasSpaceForRecord(const char* record_data) const calculates
the amount of space needed in the page for a record to fit. If there is a slot
in the directory available to use, this is simply the length of the record.
Otherwise we must consider the growth of the slot directory in the calculation.
- page_number(), next_page_number(), setPageNumber(), setNextPageNumber() getter and setter methods for the current page number and
next page number. These are handled by the File object.
- isUsed() tells the File class is valid or not.
Basic accessor/mutator methods to implement
- PageSlot* getSlot(const SlotId slot_number)
const PageSlot& getSlot(const SlotId slot_number) const
These methods return a pointer or reference to the location of a particular
slot in the data array (in the portion of the data array currently used to
store slot directory entries). One returns a pointer to the data for potential
modification of the underlying data by the caller, while the other returns
a reference to the actual PageSlot by disallowing modification (const).
You should implement these first, otherwise you will not understand the
structure of the Page.
NOTE: Implement these methods first and show me
or email me your implementation of the pointer version or both of these as soon
as you think it is correct so that I can make sure you are
starting off correctly.
- void reset()
This method initializes a new heap page (or resets an
existing one) to default values. It should set all header
values to reasonable defaults (follow the documentation for the
PageHeader class in page.h) and fill
the data array completely with null characters
(i.e., '\0'). Note that
INVALID_NUMBER is a pre-defined constant that may be useful
and that the pointers to be stored are simply indices into the data array.
- getFreeSpace()
returns the amount of bytes available for
storing records. You should simply consider the distance between the free
space pointers.
Inserting a record
Records to be inserted into a page are passed as an C-style string
(an array of char with a terminating '\0' character). When you insert
a new record in a Heap Page, you
should NOT store the terminating
'\0' part of each record (i.e. the record "Tia" should take up
3 bytes of space in the page and not 4).
Record Id consists of the
(slot_number, page_number). NOTE:slot_numbers start at 1 not 0.
These are the set of methods involved in adding a record to a page:
- RecordId insertRecord(const char* record_data)
This member
function is the public interface for adding a record to the page and relies on
several helper methods (see below). If the method is successful, it returns
the RecordId which is simply a tuple containing the page number and
slot number (i.e., {PageId,SlotId}) for the stored record. The method
should throw a InsufficientSpaceException if there is not enough free
space to fit the record. This method:
- Checks for sufficient space
- Obtains an available slot
- Inserts the record in the appropriate free space and updates the given slot
- Returns the id for the record
Each of these steps should be completed using private helper methods
as much as possible.
- bool hasSpaceForRecord() see above in pre-defined methods
- SlotId getAvailableSlot()
Obtains a SlotId for the directory. There are two scenarios:
- If there are any free (i.e., unused) slots in the directory we should reuse
that slot. This requires a scan across the slot directory, checking to see if
the slot is not used (note, the directory uses 1-based indexing so the slot
directory is over indices [1,header.num_slots]. Simply return the
index of the available slot.
- If there are no free slots, you should grow the directory by updating the
necessary header information (i.e., number of slots, number of free
slots, beginning of free space). You will need to
return its index (i.e., the new last index in the directory).
In both cases, you should mark this slot as used before returning (this
function has allocated the slot in the slot directory for use). Because
this is a private method function, you can assume that it will never be
called if there isn't enough space to create a new slot in the slot
directory (and your code that calls this function should ensure that this is
the case).
- void insertRecordInSlot(const SlotId slot_number,
const char* record_data)
A private helper method for placing record_data in the data array
and record it's meta information at position slot_number in the
directory. Your method should:
- Check that the provided slot number is valid (i.e., within the range
of indices into the slot directory [1,num_slots]. It should be
because you should have allocated a slot for it prior to calling this
function, but this is a sanity check. Throw an InvalidSlotException
if it is not a valid slot number.
- Update the slot's used, length, and offset field to the correct values.
HINT: strlen may prove useful here. Recall that the offset is to
the beginning of the record and that records are inserted from back to
front on the page (i.e., the record will be stored in the indices ending
at the free_space_end pointer.
- Reduce the number of free slots
- Copy the record into the data array. Remember to not include
storing the terminating '\0' char in the copy of the record on the page.
Getting a record
To obtain a record from a page, you will implement the following:
Delete a Record
There are two methods for deleting a record:
- void Page::deleteRecord(const RecordId& record_id)
The public method for deleting a record. That has been defined for you
to utilize the private method below.
- void deleteRecord(const RecordId& record_id,
const bool allow_slot_compaction)
This method deletes an existing
record and will be the most detailed implementation for your lab. The method
takes in two parameters: a record_id which must be validated is in
getRecord and allow_slot_compaction which, when true, allows
for the slot directory to shrink if and only if the last record in the
directory is removed. Updates do not want this behaviour, but all other
deletes do. The major steps for this method are:
- Validate the record ID
- After finding where the current record is, overwrite all stored
bytes. This can be done by setting all bytes to '\0' in the
range of the record's indices in data.
- Compacting the utilized space in data to maximize the free space.
All records that are stored to the left of the current record should be
shifted, one by one, over to fill the gap in data. Be sure to update
the respective entries in the slot directory as well as the free space markers
in header.
- Mark the slot as unused and be sure to update all relevant header
information. If this is not the last slot in the direcotry, or compaction is
not allowed, you are done.
- If compaction is allowed, and you deleted the record stored in the last
slot in the directory, delete the actual slot as well (to maximize free space).
You should repeat this check to see if the new "last" slot is also unused.
Continue deleting slots until you hit a used one. Note that
in all
circumstances, you should compact the utilized record space.
The allow_slot_compact parameter only refers to the optional
shrinking of the slot directory if there are unused slots at the end.
The memmove function may be useful for compacting the record space after
remove:
void *memmove(void *dest, const void *src, size_t n);
See the man page for more information about this function:
man 3 memmove
Updating a record
- void Page::updateRecord(const RecordId& record_id,
const char* record_data)
This method replaces the data stored for an existing record.
The new record data may not be the same size as the old record data.
If it is larger, then there may not be enough space in the page to
allow the update. If it is smaller or the same size, update should
succeed. Updating a record does not change its RID.
To do so you should:
- Validate the record
- Check that there is adequate space to complete the
update. Carefully think about how to calculate this. Do not delete the old
record if you cannot later find the space to insert the new record! If there
is not sufficient space, throw an InsufficientSpaceException.
- Delete the existing record (do not let the slot be compacted since you will
soon reuse it).
- Insert the new record in the same slot the old record
existed to preserve the record id.
Coding and Testing
In the BufferPool lab, you spent most of your time understanding
the tests I provided; only a few methods were not covered in those tests.
For this lab, you will be given much less guidance by me. As usual,
you are more than welcome to discuss testing strategies with your classmates
(but not the testing code). Think carefully about how your implementation
could go wrong.
For example, are you properly calculating the free space available?
You can test this by inserting the exact amount of data a page should fit
and see if it succeeds. Then, see if you Page correctly fails
if you try to insert even a little more.
You can test your heap Page implementation for correctness without having
the buffer manager layer. Create a file, allocate a Page in the file, and
do your testing on that Page. You are not testing the full Heap File
implementation, and you are not testing your Heap Page as it would be
used in a full DBMS system implementation (where there would be a BufferPool
into which copies of file pages would be loaded to be accessed and modified).
Acknowledgements
The code base was provided and developed by the University of
Wisconsin Database Group.
Submitting your lab
Before the Due Date, push your solution to github from one of your
local repos to the GitHub remote repo.
From your local repo (in your ~you/cs44/labs/Lab03-partner1-partner2
subdirectory)
git add *
git commit -m "our correct, robust, and well commented solution for grading"
git push
If that doesn't work, take a look at the "Troubleshooting" section of the
Using git
page.
Also, be sure to complete the README.md file, add and push it.