Checkpoint 1: Due in Lab week 10
creating a B+Tree, printing the B+Tree, and inserting
records into a B+Tree without splitting.
Checkpoint 2: Due in Lab week 11
insert with support for splitting leaf nodes. Inserting
records in different orders into a B+Tree. B+Tree File
save and reload (constructor for an existing B+Tree file on disk).
This leaves splitting internal nodes, scans, and developing more test code and more rigorous tests for the final week.
This assignment is to be done with your assigned Lab partner. You may not work with other groups, and the share of workload must be even between both partners. Failing to do either is a violation of the department Academic Integrity policy. Please read over Expectations for Working with Partners on CS Lab work
Your lab5 and beyond CS44 lab partner
A key distinction in DBMS data structures design vs. data structures you have designed and used in CS35, is that DBMS data structures are designed to be stored in files on secondary storage (usually disk). In addition to discussing the effect of data structure design, we must also discuss each file type's particular organization and how that organization lends itself to efficient evaluation of typical operations: scan, equality search, range search, insertion, and deletion.
A DBMS index is a structure to access a relation quickly, and potentially to provide additional access methods to tuples in a relation. For this assignment, you will implement a B+ Tree in index that will store data entries in the form < key, rid > (Alternative 2 for data entries in terms of index values). The actual tuples are stored in a file separate from the index; entries in the index file "point to" the location of the tuple using a RecordId.
You will implement a B+ tree, including the full search and insert algorithms as discussed in class. Insert must be capable of dealing with overflows (at any level of the B+Tree) by splitting pages. Your impelementation will always split when full, and should not impelement re-distribution.
The goal of the WiscDB projects is to allow students to learn about the internals of a data processing engine. In this first assignment, you built a buffer manager, on top of an I/O Layer that I provide to understand the management of main memory in a DBMS. In part two, you dove into the I/O Layer to implement a heap page.
In this last part, you will construct a B+ tree class to index relations on disk. In the previous two labs, we simplified the topic of records by not directly dealing with relations. In this lab, records will be more complex. This will require you to navigate a wide-range of concepts including efficiency, relations, indexes, and the use of low-level memory. As such, the depth of design choices and testing strategy will increase significantly. Furthermore, you will re-use many of the constructs from previous labs in your implementation, In particular, you will use (modified) page representations to map nodes on to disk and you will use a buffer manager to manage the interface between your B+ tree and the actual data on disk.
The code base is quite extensive and will require much reading of API documentation and thorough testing. Note: For this lab, I have attempted to directly link all class names to their API documentation. This should make understanding the interface for a class more streamlined for you. You are responsible for making significant progress early on the in assignment as waiting until the last few days will not be manageable.
Next, clone your Lab 5 git repo into your cs44/labs subdirectory, cd into your repo, and run make setup:
cd cd cs44/labs git clone [the ssh url to your repo] cd Lab5-partner1-partner2 make setupHere are some instructions on Using Git page (follow the instructions for repos on Swarthmore's GitHub Enterprise server).
If all was successful, you should see the following files and symlinks (files highlighted in blue require modification):
As in previous labs, run make setup to create symlinks to lib, exceptions, and includes for files needed to build your solution.
To help get you started, I have provided you with an implementation of few new classes: PageFile, RawFile, and FileScanner. The PageFile and RawFile classes are derived from the File class. These classes implement a file interface in two different ways. The PageFile class implements the file interface for the File class as was done in the previous two WisdDB labs - a file consists of pages formatted using the Page class (i.e., a double-linked list of heap pages). We use the PageFile class to store all the relations.
The RawFile class provides an abstraction where there is minimal internal structuring of the data. This allows us to implement our own file organization on top of the RawFile without having our data corrupted in the process. Specifically, the RawFile treats pages as completely unformatted chunks of memory (e.g., 8KB). There is no slot directory or page pointers built in. This allows us to use the page as essentially one large chunk of memory (like the data array from the previous lab).
You will use the RawFile class to store the B+ index file, where every page in the file is a node from the B+ tree whose type is mapped onto a raw page of bytes (like the data field in the HeapPage).
The FileScanner class is used to scan records in a file. You will use this class for the base relation, and not for the index file. The file main.cpp file contains code which shows how to use this class. The public member functions of this class are described below.
Your assignment is to implement a B+ tree index. To simplify the task, we will make the following assumptions:
strncpy(dest, src, 10);
The index will be built directly on top of the I/O Layer (the RawFile and the Page classes). An index will need to store its data in a file on disk, and the file will need a name (so that the DB class can identify it). The convention for naming an index file is specified below. To create a disk image of the index file, you simply use the RawFile constructor with the name of the index file. Since a RawFile is an unstructured array of bytes, you will need to map a structure on top of the pages that you get from the I/O Layer to represent the nodes of the B+ tree. Furthermore, where the File class abstracted the creation a header page, you will need to directly allocate a page and format it to be the header page (i.e., store meta-data) for your index.
We'll start you off with an interface for a class, BTreeIndex. You will need to implement the methods of this interface as described below. You may add new private methods to this class if needed (like helper functions for good modular design), but do not modify the public interfaces that are described here:
std::stringstream ss; ss << relationName << '.' << attrByteOffset; std::string indexName = ss.str(); // indexName is the name of the index file
This method does not return any values; you will need to use scanNext to obtain the results one-by-one. You should first check to see if there is already a scan in progress - if so, first end that scan before initializing a new one. You should throw a BadScanrangeException if the user provides a low value greater than the high value and throw a BadOpcodesException if the lowOpParm or highOpParm are not legal (e.g., the lowOpParm must be either GT or GTE). The function should point to the first matching record when it is complete; if no such record exists (i.e., no keys match the search criteria) throw an NoSuchKeyFoundException.
Here is some additional information to help complete the lab:
struct RECORD { int i; double d; char s[64]; };And, we are building the index over the double d, then the argument for attrByteOffset is offsetof(RECORD, d). There are examples of this in main.cpp that you can use when constructing new tests.
A large part of your time on this lab will be spent designing and testing your program. When designing your solution, be sure to consider the major hurdles here are two types of nodes - leaf and non-leaf nodes - and each has its own structure. You will need to design several short helper methods for common operations.
You should develop a testing strategy that parallels your design. The provided test is not useful for incremental development. You should devise smaller tests. Ideas include:
From your local repo (in your ~you/cs44/labs/Lab05-partner1-partner2 subdirectory)
make clean git add * git commit -m "our correct, robust, and well commented solution for grading" git push
If that doesn't work, take a look at the "Troubleshooting" section of the Using git page. Also, be sure to complete the README.md file, add and push it.