Database Systems

1. Due Date

Complete lab: Due by 11:59 p.m., Thursday, April 24, 2025.. Pass all tests in smalltests, projecttests, selecttests results are similar to ours.

Checkpoint: Tuesday, April 15. Pass all tests in smalltests.

Your lab partner for Lab 7 is listed here: Lab 7 lab partners

Our guidelines for working with partners: working with partners, etiquette and expectations

2. Overview

In this lab you will implement Select and Project operators in the RelOps layer of the SwatDB database management system.

The RelOps layer interacts with the FileManager layer of SwatDB. Your implementations of Select and Project will make use of Index and Heap files to access individual records to perform the requested operation, and will store the result in a new Heap. For Project the schema of the result file will be a subset of the source file’s schema.

You will implement two versions of Select. One version uses a file scan to find records that meet the select condition, and the other uses an index to find records matching the select condition. You will implement a file scan algorithm for performing the Project operation, and you will not remove duplicate records in the result (note: this is unlike what a project operation should do, but is a simplification for this lab assignment).

For this assignment you will not write large amounts of code. However, you will need to spend significant amount of time reading the starting point code and reading SwatDB documentation in order to figure out how to implement your solution; the code you write for your solution uses many SwatDB classes, some of which you have used and/or written in past assignments, but many of which are new to this assignment.

The primary goal of the SwatDB lab assignments is to gain an understanding of the details of how a relational DBMS works by implementing and testing parts of a relational DBMS.

The SwatDB code base is quite extensive and will require significant reading of its documentation (see Section 3).

2.1. Lab Goals

The main goals of the SwatDB RelOps Lab are:

Understanding how the relational operators Select and Project can be implemented in a DBMS.
Understanding different ways of implementing the same operation using different access methods to the underlying relational data.
Gaining some practice evaluating different implementations of the same relational operation.
Developing a testing strategy for a large, complex system; making use of a provided unit testing framework, and debugging tools.
Practice working with part of a large code base, much of which you have access to only through its interface definition (i.e., .h files and generated documentation).

2.2. Starting Point Code

If you have not already done so, first create a course directory for this course, and add a lab subdirectory for your lab repos:

mkdir -p cs44
mkdir -p cs44/labs
cd cs44/labs

We will be using git repos hosted on the college’s GitHub server for labs in this class. If you have not used git or the college’s GitHub server before, here are some detailed instructions on using git for CS44 labs.

Next find your git repo for this lab assignment off the GitHub server for our class: CS44-s25

Clone your git repo (Lab7-userID1-userID2) containing starting point files into your labs directory:

cd ~/cs44/labs
git clone [the ssh url to your your repo]
cd Lab7-userID1-userID2

If all was successful, you should see the following files (highlighted files require modification):

2.2.1. Lab 7 Files

Makefile - pre-defined. You may edit this file to add extra source files or execution commands.
README.adoc - some directions about how to compile and run test programs for the RelOps layer. See information in this file about running test programs and about cleaning up relation files if need be.

The test programs create relation files and these can be be corrupted if your programs exit unexpectedly. There is a cleanup.sh file you can run to clean these up in this case. Running make clean will also run the clean up script.

2.2.2. RelOps Manager layer Files

Most of the RelOpsManager is implemented for you (most of these files are for your reference). However, you do need to implement one method function in the RelOpsManager class. You should not add any new data members to these classes or public methods. You may add private helper methods for good modular code design.

relopsmgr.h: interface to the relops manager layer. These are interface functions to initiate different operations (specific versions of select, project, and join) on specific relation operand(s). You do not need to modify this file.
relopsmgr.cpp: contains some of the main RelOpsManager methods, including the _createResultFile method that you will use to store SELECT operation results. You do not need to modify this file.
relopsmgr_projets.cpp: you will implement the _createProjectRes method that creates a file to store PROJECT operation results.

It also contains the project method that picks the project operation type. You do not need to modify the project method, but we suggest you read through it to understand what it is doing, and to understand the values it passes to the specific project operation.
relopsmgr_selects.cpp: contains the select method that picks a select operation type to perform. You do not need to modify this file, but we suggest you read through the select method to understand what it is doing. and to understand the values it passes to the specific select operation.

2.2.3. Select and Project Operation Files

Specific operations are derived from the Operations base class.

operation.[h,cpp]: the Operation base class. You do not need to modify this, but is here for your reference.
project.[h,cpp]: the Project class and implementation. You will implement the runOperation method that implements project.
select.[h,cpp]: the Select class. FileScan and IndexScan are derived from this class. You will implement the constructor.
filescan.[h,cpp]: the File Scan implementation of Select. This class is derived from the Select class. You will implement the runOperation method.
indexscan.[h,cpp]: the Index Scan implementation of Select. This class is derived from the Select class. You will implement the runOperation method.

2.2.4. Select and Project Test Files

We are giving you a lot of unit tests for this assignment.

You are not required to add more tests as part of this assignment. However, you still may want to add some more for further stress testing your solution. If you add new tests, please add verbose comments describing what your test is testing.

smalltests.cpp - unit testing code for the small DB. The checkpoint for this lab assignment.
projecttests.cpp - unit testing code for Project.
selecttests.cpp - unit testing code for the two Select operations.

2.2.5. Scripts (.sh files) and Test Databases

This lab creates and uses some swatDB instances with some relations files. The instances and files are created in /scratch/<your user name>/cs44swatDBfiles/. After you run make, you can cat out the .db files to see the relations and their schema. For example, user tnas would do this:

cat /scratch/tnas/cs44swatDBfiles/small.db

README.adoc: documentation about the scripts
Makefile: make command run these scripts for you (you can run them through make command vs. running a .sh file at the command line.

Make sure the .sh files are executable:

ls -l     # should list x permission with executable files
chmod 700 *.sh  # set permission to rwx for owner if not

./cleanup.sh: cleanup created DB and files from incomplete runs. (make clean will also do this)
./mktestconf.sh followed by ./getfiles.sh will build the examples SwatDB databases for testing. However, just run make to build the .db files (easier than running the scripts by hand).

2.3. Deliverables

The following will be evaluated for your lab grade:

A complete and robust implementation of the assigned select and project operations and the RelOps manager methods. This includes adding complete comments, and removing all my TODO comments to you.
Passes all unit tests in: smalltests.cpp, selecttests.cpp, and projecttests.cpp. You are not required to add additional tests for this lab, but you may want to add some to test your solution, and are welcome to add more to these files (please add descriptive comments to any that you add).
The class definitions in .h files. Only add private helper methods to class definitions to support good modular design of your solution. Do not add public methods or data members to any classes defined in these file. Also note where you can and cannot add private data members to classes (look for comments in .h files and in lab write-up).
Your TBA[Lab 7 Questionnaire] to be completed individually (This will open on the due date and close after 3 days)

2.4. Checkpoint

Before the checkpoint due date, you should complete the functionality to pass all the unit tests in the smalltests program. The checkpoint functionality includes:

Implementation of the file scan version of Select.
Implementation of Project (and _createProjectFile)

While we recommend dealing with exceptions as you implement the methods, we will not require that exceptions are implemented for the checkpoint.

Also note that other unit test programs stress test these methods, so it is possible that after passing the checkpoint, you may still need to go back to these implementations and do some debugging.

3. SwatDB

This assignment implements parts of the RelOps Manager part of SwatDB, including implementing specific relational operators.

For information about SwatDB, including a link to its on-line code documentation, see this page:

About SwatDB

In addition to the .h files distributed with this lab, the SwatDB documentation that will be particularly helpful for this lab includes:

schema.h: create a Schema object for relation file results
record.h: get Record data from file tuples, create result Record to add to result file.
key.h: create and use search keys.
searchkeyformat.h: need to create a SearchKeyFormat object that you use to initialize a Key used for selects.
filemgr.h: create HeapFile relation file for the operation result.
heapfilescanner.h: scan through records in a HeapFile.
hashindexcanner.h: use this to scan through index to get find Rid values that match the selection criteria.
Common SwatDB type definitions, defined in swatdb_types.h
The Exceptions classes are defined in swatdb_exceptions.h. You may need to catch some, and throw others.

4. Lab Details

All relational operations are invoked through the RelOpsManager object (defined in relopsmgr.h, relopsmgr.cpp, relopsmgr_selects.cpp, and relopsmgr_projects.cpp) that implements the interface to the relational operations manager layer of SwatDB. Its has interface methods for each of the main operations: select, project, and join. These methods take parameters that specify the relation(s) on which to perform the operation and some take a parameter that specifies the specific type of algorithm to use to perform the operation.

Specific relation operations are implemented as classes derived from the Operation base class (defined in operation.[h,cpp]). Start by investigating the class hierarchy for different versions of relations operators to understand what different project and select classes inherit. For example, filescan is derived from Select which is derived from Operation. The on-line documentation is useful for seeing the class hierarchy. You can view .h files there or we have included several of the related .h files for you with the lab starting point code.

4.1. What to implement

For this lab, you will not implement a large amount of code, but you will need to spend a fair amount of time reading starting point code and SwatDB class documentation to determine how and where to implement the select and project operations.

4.2. RelOpsManager

RelOpsManager class: this class implements the interface to the relational operators level of SwatDB; it has public methods for performing select, project, and join operations. It also has a method, checkFilesEqual, that can be used to test the results of operations.

In addition to public methods, it has several private methods that create the correct result file for the operation. For this assignment, you will implement one of these functions for creating the file for the project results.

The RelOpsManager methods are implemented in several .cpp files: realopsmgr.cpp, relopsmgr_selects.cpp, and relopsmgr_projects.cpp)

In relopsmgr.cpp, several methods are implemented for you, including:
- FileID FileId _createResultFile(Schema *schema): creates a new HeapFile to store the result of an operation. RelOpsManager::select (in relopsmgr_selects.cpp) calls this method directly to create the result file (the result and source relations have the same schema. You do not need to implement this method, but we recommend that you read it to understand what it does.

4.2.1. relopsmgr_projects.cpp

RelOpsManager::project: runs project operation. You do not need to implement this method, but we recommend that you read it to understand what it does.

FileID _createProjectRes(Schema *rel_schema, std::vector<FieldId> fields): creates a new HeapFile to store the result of a project operation. Note that the result file’s schema is a subset of the relation file on which the project operation is performed. You will implement this function.

This function should:

Create the Schema for the project result from the files’s rel_schema given the vector of FieldIds in the project result.

Note that the RelOpsManager class is a friend class of the Schema class, which means that it can directly access the field_list field of the passed rel_schema Schema object.

Also note that FieldId values are the positions (the index values) of each field in the field_list vector of the rel_schema of the relation.

The primary key for the result Schema can be empty (i.e., pass { }).

Call the _createResultFile method, passing it the Schema you create in (1).

4.2.2. relopsmgr_selects.cpp

RelOpsManager::select: runs a select operation of a specified type. You do not need to implement this method, but we recommend that you read it to understand what it does.

4.3. Operation

The Operation class is the base class of all relational operations. There is nothing for you to implement, but you should look through the operation.[h,cpp] files to see the class definition and what some of its methods do. In particular, note:

struct fileState (defined in operation.h). This structure stores state for the operand files and indices and for the result file of the particular operation. Some of the fields in this struct can be used to store and manipulate record data as part of the operation.

Since every operation has a result file, the fileState for the result relation file is a field of the Operation base class (result_state), and it is initialized by the constructor of the Operation class.

A fileState struct for the source relation (and index) files is initialized the Select and Project constructors. Look at the starting point code that contains calls to _initState that performs this operation.
The _initState method is called by the Operation constructor and by the constructors of the derived classes Select and Project. This method initializes the fileState structs associated with the result and the source files for the operation.
The _delState method cleans-up and fileState structs created for the operation.

4.4. Project

The Project class is derived from the Operation class.

Project class: this is the base class for specific project operations. You do not need to modify this class, but look at its declaration (in project.h) to see the data members that are part of the base class, and which you will use to implement project operations.

You do not need to remove duplicates from the result relation for project. The default in SQL is to not remove duplicates, so since this a relational operator to implement an SQL query, you should not remove duplicates from the result (strict relational algebra project does, SQL project does not).

4.5. Select

The Select class: this is the base class for specific select operations. It is derived from the Operation class.

Select::Select: complete the implementation of the constructor, by adding error detection and handling. Check that the field ids are within bounds and check that the values comparisons and fields passed in are all the same size. Throw appropriate exceptions if not.

Also, look at this class definition (in select.h) to see the data members that are part of the base class, and which you will use to implement select operations.

If an exception is thrown in the constructor, the destructor of its base class, Operation, is invoked, but the destructor of the Select class is not (the object has not yet been successfully created). As a result, any state you allocated before the exception, needs to be cleaned-up in the constructor before the exception is thrown.

4.5.1. FileScan

The FileScan class: derived from the Select class.

runOperation: implements the file scan select operation. Your algorithm should scan over all records in the file, and add those matching the select conditions to the result file. You will need to use the HeapFileScanner class to scan over all records in the file.

4.5.2. IndexScan

The IndexScan class: derived from the Select class

IndexScan: complete the implementation of the constructor. This should initialize the index_file data member by calling the getFile method of the catalog, and check that the index fields match the fields on which the select operation is being performed. This should check for errors and throw appropriate exceptions on different errors (see comments in .h file). Note that the Select base class parts are already initialized in the part of the constructor we give you with the starting point.

the index’s key fields and the select fields must be the same for the index scan version of select to proceed. Although an index on a subset of the select fields could be used if the select condition is in conjunctive normal form, you do not need to handle this case for this lab assignment.

Also, see the NOTE about the Select constructor behavior when an exception is thrown.

runOperation: implement the index scan select operation. Your algorithm will create a HashIndexScanner on the HashIndex to find RIDs of records in the file that match the selection criteria (remember that only equality select criteria can be used with a hash index). You can then call getRecord on the relation file to extract the full matching record data from the relation file to add to the result file.

You will make use of the HashIndexScanner::HashIndexScanner(HashIndexFile* index, Key *key_val) method that scans over all hash index entries with a matching key_val.

You will also need to make use of the Key and SearchKeyFormat classes to create objects that can be passed to and compared with index file entries.

When creating a Key for the index scan, you can just create a new Key object of MAX_RECORD_SIZE to allocate enough space for a key of any size.

Also note that this Key constructor allocates space that the caller must explicitly free when done. See its function comment for more information.

Be sure that any temporary objects created in this method are deleted before exit.

5. Lab Requirements

In addition to correctly implementing parts of the Project and Select operations, and adding code to test your implementation, you should also:

Declare and use variables of the types defined in swatdb_types.h as opposed to their underlying type definition. Also use constants and enum types defined in this file - they help make the code more readable. For example, if a method returns a FileId, declare a variable of type FileId rather than std:uint32_t or int to store its return value:
```
FileId result_fileid;
```
Write good C++ code design, and good modular design in your solution. This includes using defined constants and types.
Ensure you code is robust to errors, in particular, be sure to test for error handling for exceptions that should be thrown and caught by the buffer manager.
Ensure your code is free of valgrind errors.
Make sure your code is well-commented, and there is no line wrapping. (See our C++ Style guide link from the Handy Links section.
Your code should be free of all compiler warnings. The one exception is that there is a known deprecation warning with SHA1_ functions that SwatDB uses for hashing. If you see these, you may ignore them.
Your submitted code should have all of our TODO comments removed…as you implement a TODO, remove it. These (as well as NOTE comments) are also helpful to find parts of the given code that you need to implement.

6. Testing your code

There are several test files in the starting point code. They use the same unittests framework you used in CS35, and test various relational operator functionality and exceptions:

6.1. unit tests

smalletests.cpp: file scan select and project on a very small relation DB (one that you can print out all relations and examine results)
projecttests.cpp: extensive tests of project on a much larger DB
selecttests.cpp: extensive tests of select on a much larger DB

You can add additional tests to any of these files by following the examples in this the files (add them as a new test SUITE separate from the ones we give you.

6.2. Test DB relations

When you type make, along with building the unittest executables, the Makefile rule runs the getfiles.sh script which creates .db and relations files in /scratch/yourusername/cs44swatDBfiles/ directory. Two DBs are created, small.db and tables.db, and are used in the unit test programs.

When you type make clean the Makefile is set up to run the cleanup.sh script to remove these DB files.

The .db files created are ascii files and are readable in an editor program. The unit test code also has commented out calls for printing out the Catalog and relation files in the test code (note: for large relation files, only the first 50 records are printed).

You should not need to ever run either by hand, but you can. See the README.adoc files form more information.

6.3. To run unit test programs:

# run all of the unittest test suites
make runtests

# run individual tests
./smalltests
./projecttests
./selecttests

# or you can run individual test suites alone using -s testSuiteName
./projecttests -s ExceptionTests   # run just ExceptionTests test suite

# to list the test suites names run with -h, for example:
./selecttests -h

6.4. Cleaning up corrupted files

Run make clean or you can explicitly run the ./cleanup.sh to remove the DB files.

./cleanup.sh

7. Tips and Hints

7.1. General Tips

Spend some time reading the starting point source code, and looking at SwatDB docs to get an idea of how the methods you need to implement are called starting from the RelopsManager. There is a fair amount of inheritance here, particularly with the select operations. In addition to some of the File and Index interface functions, take time to understand the Record and Schema class interfaces as well.
Implement at test incrementally. Use the smalltests.cpp to help guide the order and testing that you do.
Make use of gdb and valgrind to help you as you go.
Look at past weekly lab page for help with C++, gdb, and valgrind.
Make use of the cleanup.sh script to clean up state from incomplete previous runs. You can also run make clean; make to clean-up and to regenerate the source test files.

Look at the information in Section 2.2.1 about the script files and how to use them (or implicitly use them with make command), and how to view the .db file contents for the test SwatDB database instances used by the test code.
Read the README.adoc file about some of the scripts. Also, look at the Makefile to see what is being built (and cleaned up) and where. See Week 8 for more information about Makefiles to help you read it.

7.2. Suggested Order

Here is a suggestion for an order in which to implement Relations Operators:

Start with the file scan version of Select (in filescan.[h,cpp], and get it to work with the smalltests.cpp unit tests. You will want to refer to the base class select.[h,cpp] and operation.[h,cpp] as you implement. The main steps are:
1. Create a new HeapFile for the result (this step is done for you in the starting point).
2. Create a new HeapFileScanner object on the HeapFile source.
3. Scan each record in the source relation file, and if it matches the selection criteria, insert the record in the result file.
4. Test on the smalltests.cpp unit tests first.
Next, implement project operation, and get it to work with the smalltests.cpp unit tests.

Like Select, project will do a file scan of the source relation, and for each record add its projected version to the result file. Unlike Select, the schema for the result relation is different from that of the source relation.
1. Start by implementing the _createProjectRes method of the RelOpsManager class in relopsmgr_projects.cpp. This create a new HeapFile result file for the projection operation. You will need to create a new Schema for this file that consists of a subset of the fields of the source file. Once you create the new Schema, you can call the RelOpsManager _createResultFile method to create the result file.
2. Next, perform a scan of the source heap file, and for each record create a new record with just the projected fields and add it to the result file.
3. Test on the smalltests.cpp unit tests first.
4. Test on the smalltests.cpp unit tests first.

8. Submitting your lab

Review the lab deliverables to ensure you have completed all of your work. Before the due date, push your solution to github from one of your local repos to the GitHub remote repo.

From your local repo (in your ~/cs44/labs/Lab7-userID1-userID2 subdirectory)

make clean
git add *.h *.cpp
git commit -m "my correct and well commented solution for grading"
git push

Be careful not to add binary files to your repo (executable or .o files that are compiled when you run make). These are very large files and not ones that you need git revision control over (they are created from source and header files that should be in your repo).

To avoid adding these files to your repo, NEVER run these commands: git add * or git add .!. These both add every file in the current directory to your repo, often mistakenly adding in binary files that are built with make, and should not be added to your repo.

Instead add only the files you want to add to your repo by explicitly listing them (git add filename.cpp filename.h) or use a better alternative git command for adding a set of changes to only the files you want to add to your repo.

Here are three alternative ways to submit a group of changes to lab files using some git command shorthand without the horror of doing a git add . or git add *:

add all my changes to files with a .cpp suffix (commit changes only to C++ source files). This command could add new .cpp files to my repo, but it won’t add binary files, temporary files, input files, etc. (it will only add files with a .cpp suffix).
```
git add *.cpp
```
add my changes to all files already under git control (won’t add any new files to the repo)
```
git add -u
```
git commit all changes to files already under git control (won’t add any new files to the repo)
```
git commit -am "my commit message"
```

Run git rm filename to remove a file from your repo.

Verify that the results appear (e.g., by viewing the the repository on CS44-s25). You will receive deductions for submitting code that does not run or repos with merge conflicts. Also note that the time stamp of your final submission is used to verify late days, so please do not update your repo until after the late period has ended.

If that doesn’t work, take a look at the "Troubleshooting" section of the Using git for CS44 labs and the Using git pages. At this point, you should submit the required TBA[Lab 7 Questionnaire] (each lab partner must do this).

9. Handy References

Information about SwatDB
SwatDB Code Documentation
Review in lab exercises from Week 9
Some C++ Programming Resources and Links including the C++ Style Guide
C++ programming tools compiling, linking, debugging C++
C references in Dive into Systems (some useful for C++ programming too) Chapter 2: C pointers, command line arguments; Chapter 3: debugging tools (valgrind, gdb for C)
gdb Guide also in Chapt. 3 of Dive into systems
Valgrind Guide also in Chapt. 3 of Dive into systems
my CS help pages (Unix tools, programming links)
Appendix 2: Using Unix from Dive into Systems textbook
CS Department help pages
Using git for CS44 labs
Using Git more complete Git guide

CS44 Lab 7: SwatDB Relational Operators: Select and Project