Quick Links
This assignment is to be done with a partner. You may not work with
other groups, and the share of workload must be even between both partners.
Failing to do either is a violation of the department Academic
Integrity policy.
Introduction
A key distinction in data structures designed for database management systems is
the fact that data (i.e., relations) are stored in files which reside on disk.
In addition to discussing the effect of data structure design, we must also
discuss each file type's particular organization and how that organization
lends itself to efficient
evaluation of some (not all) of the following operations: scan, equality search,
range search, insertion, and deletion.
When it is important to access a relation
quickly in more than one way, a good solution is to use an index.
For this assignment, the index will store data entries in the form < key, rid>
(Alternative 2 for data entries in terms of index values). These data entries in the
index will be stored in a file that is separate from the data file. In other words,
entries in the index file "point to" entries in the data file where the actual
records are stored.
Two primary kinds of indexes are hash-based and tree-based, and the most
commonly implemented tree-based index is the B+ tree.
In this assignment, you will implement a B+ tree, including the full search and
insert algorithms as discussed in class. Your insert routine must be capable of
dealing with overflows (at any level of the tree) by splitting pages.
The goals of this assignment include:
- Understand and implement a B+ tree
- Constructing an index on an existing file (relation) to improve efficiency
- Organizing a variety of file/page types as containers for objects
- Developing a testing strategy for a large, intricate system
- Simulate central relation operations including scanning,
searching (equality and range), insertion, and deletion
There are number of design choices that you need to make, and you
probably need to reserve a big chunk of time for testing and debugging.
So, start working on this assignment early; you are unlikely to finish
this project if you start is just a week before the deadline.
WiscDB and Getting Started
The goal of the WiscDB projects is to allow students to learn about the
internals of a data processing engine. In this first assignment, you will
built a buffer manager, on top of an I/O Layer that I provide to understand
the management of main memory in a DBMS. In part two, you dove
into the I/O Layer to implement a heap page.
In this last part,
you will construct a B+ tree class to index relations on disk.
In the previous two labs, we simplified the topic of records by not
directly dealing with relations. In this lab, records will be more
complex. This will require you to navigate a wide-range of concepts
including efficiency, relations, indexes, and the use of low-level memory.
As such, the depth of design choices and testing strategy will
increase significantly. Furthermore, you will re-use many of the
constructs from previous labs in your implementation, In particular,
you will use (modified) page representations to map nodes on to disk and
you will use a buffer manager to manage the interface between your
B+ tree and the actual data on disk.
The code base is quite extensive and will require much reading of
API documentation and thorough testing.
Note: For this lab, I have attempted to directly link all class names
to their API documentation. This should make understanding the interface for a class more streamlined for you.
You are responsible for making
significant progress early on the in assignment as waiting until the last
few days will not be manageable.
To get started, run update44
to obtain the starting point files in ~/cs44/labs/5/. You should
obtain the following files (files highlighted in blue require modification):
- Makefile - pre-defined. You may edit this file to add
extra source files or execution commands.
- include/ - contains header files for the Page, File, Page and related classes.
These do not need to be modified, but each class
must be well understood to manage the interface to the I/O layer. While
reading the header files may be helpful, you should start with the online
WiscDB documents first.
- lib/ - necessary object files. This directory can be ignored
and should not be modified. It provides the object files for all classes
related to the buffer manager (i.e., Lab 1 solutions) and I/O layer (i.e.,
Lab 3 solutions).
- exceptions/ - defines the list of possible exceptions for WiscDB.
You will need to reference these exceptions to both handle possible
errors that can be thrown to you or that you must throw. Again,
it is probably easier to refer to the online documentation.
- btree.h/.cpp - You must edit these
files to implement the
B+ Tree Index. The header
file has been completed for you, and is quite extensive. You should
modify it to add necessary private methods but do not modify the
public interface.
- fileScanner.h/.cpp - Defines the
FileScanner class which iterates over instances of a relation.
- main.cpp - The main testing
grounds for this lab. Feel free to add any additional test files
as needed.
- README - a few wrap-up questions
for you to answer about the lab assignment.
When you are ready to submit your lab, use
handin44. Recall that
only files in the
~/cs44/labs/5 subdirectory will be submitted. You
may submit as many times as you wish; only the most recent copy will be saved.
I/O Layer: Modification to Files and Pages
To help get you started, we will provide you with an implementation of few new classes: PageFile, RawFile, and FileScanner.
The PageFile and RawFile classes are derived from the File class. These classes implement a file interface in two different ways.
The PageFile class implements the file interface for the File class as was done in the previous two WisdDB labs - a file
consists of pages formatted using the Page class (i.e., a
double-linked list of heap pages).
We use the PageFile class to store all the relations.
The RawFile class implements the file interface for a less restricted file organization in which the pages in the file are not linked by prevPage/nextPage links, as they are in the case of the PageFile class. When reading/writing pages, the RawFile class treats the pages as blobs of 8KB size and does not require a
certain structure (i.e., pages are not instances of the Page class). We will use the RawFile class to store the B+ index file, where every page in the file is a node from the B+ tree. Since no other class requires RawFile pages to be valid objects of the Page class, we can modify these pages as we wish without worrying that these pages will not be valid after their arbitrary modification. Inside the file btree.cpp you will treat the pages from a RawFile as your B+ tree index nodes, and the RawFile class will read/write pages for you from disk without modifying/using them in any way. The
Page class has also been changed so that it does not use page objects to find out their page numbers.
The FileScanner class is used to scan records in a file. We will use
this class for the base relation, and not for the index file.
The file main.cpp file contains code which shows how to use this class.
The public member functions of this class are described below.
- FileScanner(const std::string &relationName, BufMgr *bufMgr)
The constructor takes the relation name and buffer manager instance as
parameters. The methods described below are then used to scan the relation.
- ~FileScanner()
Shuts down the scan and unpins any pinned pages.
- void scanNext(RecordId& outRid)
Returns (via the outRid parameter) the RecordId of the next
record from the relation being scanned. It throws EndOfFileException
when the end of relation is reached.
- std::string getRecord()
Returns the record identified by rid. The rid is obtained by a preceding
scanNext() call.
- void markDirty()
Marks the current page being scanned as dirty, in case the page was being
modified. You probably won't need to use this in this assignment.
A B+ Tree Index
Your assignment is to implement a B+ tree index. To simplify the task, we
will make the following assumptions:
- All records in a file have the same length (so for a given attribute
its offset in the record is always the same).
- The B+-tree only needs to support single-attribute indexing (not composite attribute).
- The data type for the index attribute will be limited to one of three
possibilities: integer, double, or string.
- Furthermore, string keys can be truncated to the first 10 characters.
That is, assume that all string attributes are at least 10 characters and that
only the first 10 need to be used for indexing.
- All keys are unique. That is, you will not need to handle inserting
two keys with the same value.
The index will be built directly on top of the I/O Layer
(the RawFile and the Page classes). An index will need to
store its data in a file on disk, and the file will need a name (so that the
DB class can identify it). The convention for naming an index file is
specified below. To create a disk image of the index file, you simply use the
RawFile constructor with
the name of the index file. Since the file is unformatted, you will need
to implement a structure on top of the pages that you get from the I/O Layer
to represent the nodes of the B+ tree. Furthermore, where the File
class abstracted the creation a header page, you will need to directly
allocate a page and format it to be the header page (i.e., store meta-data)
for your index.
We'll start you off with an interface for a class, BTreeIndex. You will
need to implement the methods of this interface as described below. You may add new private methods to this class if needed (hint: it should be), but do
not modify the public interfaces that are described here:
- Constructor
The constructor first checks if the specified index file exists - if it
does, simply load the file and the header page which stores the
relevant meta-data. You should verify that the parameters match
the meta-data in the loaded index (e.g., file name, attribute offset,
attribute type, etc.). If they do not, throw an
BadIndexInfoException.
If the file does not exist, you will need to create it and
construct the index as defined below. To specify this, it is best
to examine the role of each parameter:
- Destructor
Performs any cleanup that may be necessary, including clearing up any
state variables, unpinning any B+-tree pages that are pinned, and
flushing the index file (by calling bufMgr->flushFile()).
Note that this method does not delete the index file itself! It is
useful to recall that deletion will call the destructor of File
class causing the index file to be closed.
- insertEntry
Inserts a new data entry (i.e., pair). The
parameter key is of type void* which means the
actual type has been stripped away. You will need to use your member
variables to identify what type it is; e.g., a pointer to an integer,
double, or c-string. You should follow the algorithm presented in class
e.g., split overflowing leaves and nodes.
- startScan
This method is used to begin a "filtered" scan of the index.
For example, if the method is called using arguments ("a",GT,"d",LTE),
the scan should seek all entries greater than "a" and less than or equal
to "d". Recall that the operators are an enumerated type to make it
easier to define which operator to use. lowValue and
highValue are pointers that need to be interpreted as mentioned
above. Recall that you should only use a fixed prefix of strings
(defined to be 10 in the code) for key comparisons.
This method does not return any values; you will need to use scanNext to obtain the results one-by-one. You should first check to see
if there is already a scan in progress - if so, first end that scan
before initializing a new one. You should
throw a BadScanrangeException if the user provides a low value
greater than the high value and throw a BadOpcodesException if
the lowOpParm or highOpParm are not legal (e.g.,
the lowOpParm must be either GT or GTE).
The function should point to the first matching record when it is
complete; if no such record exists (i.e., no keys match the search
criteria) throw an NoSuchKeyFoundException.
- scanNext
After initializing a scan using startScan, this method returns
the rid of the next matching record for the search
through the outRid parameter. You should be
maintaining global information about the current search using existing
data members for the BTreeIndex. If there is nothing to return,
you should throw an IndexScanCompletedException. You should
be sure to keep any leaf page that is currently being scanned pinned
between searches - logically, the page isn't free until the page is
fully scanned or the scan is complete. As discussed in class, each
leaf page should have a sibling pointer to help continue the scan
between leaves.
- endScan
Terminates the current scan, unpinning any pages related to the scan. You
should throw a ScanNotInitializedException if there is no
current scan in process for either this method or scanNext.
Tips and Additional Details
Here is some additional information to help complete the lab:
Testing and Design
The majority of your time on this lab will be spent designing and testing your
program. When designing your solution, be sure to consider the major hurdles:
- There are two types of nodes - leaf and non-leaf nodes
- There are three possible enumerated types - integers, doubles, and strings
You will need to design several short helper methods for common operations. In
many cases, you can take advantage of templating. In others, templates
help but you will need to condition your code on the
attributeType -
this arises due to the shortcomings of polymorphic code in real systems (e.g.,
the index needs to be aware of the actual type to be aware of the byte layout;
key storage varies by type; nodes are not polymorphic). Your overall design
may benefit from minimizing the amount of code that is dependent on the actual
type. If possible, push the need to check the type as late as possible, and
into specialized helper methods (see the example on Piazza for extracting keys).
You should develop a testing strategy that parallels your design. The provided
test is not useful for incremental development. You should devise smaller
tests. Ideas include:
- Write a print method that does a traversal of the tree, printing out
information stored in each node.
- Create smaller/simple relations that specifically invoke events like
splitting.
- Feel free to hard code the leafOccupancy and
nodeOccupancy to small values for early testing. Be sure to
fix this before handing in. For example, insert 10 items into a tree,
see if your leaf insertion works properly. Then insert 100 items to invoke
a leaf splitting. Etc.
- Write a short test that creates a B+ tree, closes it, and then
attempts to read it back from disk to verify you are saving header information
properly.
- Search for TODOs in main.cpp for tips on completing the
provided test as well as on ideas for creating new tests.
- Be sure to also test for expected errors (e.g., giving a bad search range).
- You can use the existing relation-creation code to test double and string
keys as well.
- The current test inserts keys from smallest to largest in order. That
is a poor assumption to make. Be sure to try more "random" orderings.
Submitting your lab
Submit using handin44. Please run make clean before
submitting to keep file sizes down. Also, be sure to complete
the README file.