Homework 2

Due by 11:59 p.m., Tuesday, November 27, 2018.

This is a partnered lab. You are to complete this lab with one other person, who must attend the same lab as you. You may discuss the lab concepts with other classmates. Please use Teammaker to set your lab partner preferences. You can choose “Random” if you do not have a partner. Remember the Academic Integrity Policy: do not show your code/solution to anyone outside of the course staff and your lab partner. Do not look at any other group’s code/solution for this lab (current or past). If you need help, please post on Piazza.

Overview
- Getting Started
Hash Indexing
Sorting
Relational Algebra
SQL
- Submitting your lab

Overview

This assignment is a problem set that will engage the modules since the midterm exam. The goals of his lab are to:

Apply and analyze the extendible hashing data structure
Apply and analyze external merge sort
Design and interpret relational algebra queries
Extend the SQL syntax from what was covered in class.

While you will work with a partner, you are not allowed to divide and conquer this assignment with your lab partner. Each individual must work on each of the problems below - this is in your best interest as these problem sets closely mimic questions you will see on exams.

Getting Started

Both you and your partner will share a repo named Homework2-userID1-userID2. Note that the lab will not release until you have both marked your partner preferences on Teammaker. You should find your repo on the GitHub server for our class: CPSC44-F18

Clone your Homework 1 git repo containing starting point files into your labs directory that you created last week:

cd ~/cs44/labs
git clone [the ssh url to your your repo]
cd Homework2-userID1-userID2

If you need help, see the instructions on Using Git (follow the instructions for repos on Swarthmore’s GitHub Enterprise server).

If all was successful, you should see the following files (highlighted files require modification):

README - a few wrap-up questions for you to answer about the lab assignment.
hw2.tex - LaTeX file for you to type your solution

In addition, you should add the following file:

hw2.pdf - Solution for the problems below. Hard copies will not be accepted. LaTeX is strongly encouraged to ensure your solutions are legible.

You should submit your solution as a PDF and by using LaTeX. If you are not familiar with LaTeX, here is a primer. You may ask your classmates any questions about LaTeX without worrying about violating the academic integrity policy. To create a pdf, you can use pdflatex and open the pdf with your preferred viewer (e.g., acroread):

$$ pdflatex hw2.tex
$$ acroread hw2.pdf

Hash Indexing

Exercise 11.1, subproblems 4-7 only, Ramakrishnan and Gehrke.
After an insert that causes the directory to double, how many buckets have exactly one directory entry pointing to them (i.e., are not shared)? Briefly explain.
Exercise 11.7, Ramakrishnan and Gehrke.

Sorting

Assume you have a file with 2,000,000 pages and 17 available buffer pages. Answer the following questions. Your answers should be exact, not rough estimates. Read Section 13.3 for the exact formulas for the number of passes and the number of sets in a run.

A) How many runs does Pass 0 produce?

B) How many passes will it take to sort the entire file completely?

C) What is the total I/O cost of sorting the entire file?

D) How many buffer pages are needed to sort the entire file with only one merge pass (i.e., two passes in total)?
In class, we discussed an extension to external merge sort called double buffering. Consider the possibility of extending this idea to triple buffering (i.e., 3 buffer pages per input as opposed to 2). Describe one potential benefit to this approach as well as one key disadvantage.

Relational Algebra

Assume you have two relations $R$ and $S$ , where $R$ contains $N$ tuples and $S$ contains $M$ tuples, and $N > M$ (i.e., $R$ has more rows). For each expression below, give:

i) the minimum number of tuples possible in the resulting relation

ii) the maximum number of tuples possible in the resulting relation

iii) the restrictions on the schemas of $R$ and $S$ , if any. That is, will the expression only work on certain schemas for $R$ and $S$ ?

A. $R \cup S$

B. $R \cap S$

C. $R - S$

D. $R/S$

E. $\sigma_{x=10} R$

F. $R \times S$
You are given the following schema:
$Suppliers(\underline{sid:integer}, sname: string, address: string)$ $Parts(\underline{pid:integer}, pname: string, color: string)$ $Catalog(\underline{sid:integer, pid:integer}, cost: real)$
Underlined fields form the primary key for the relation. Write each of the following queries as a relational algebra expression:

A. Find the names of suppliers who supply some red part.

B. Find the sids of suppliers who supply some red part or are at 500 College Avenue.

C. Find the sids of suppliers who supply some red part and some green part.

D. Find the sids of suppliers who supply every part.

E. Find the sids of suppliers who supply every red part.

F. Find the pids of parts supplied by at least two different suppliers.

Note: one benefit to the renaming operator is that it makes a copy of a relation. So, you could use it to store an intermediate result if you want to break up an expression into pieces. For example, to simplify $\pi_{a,b,c}((R \bowtie S) \cup (X \bowtie Y))$ we could produce:
$\rho(Temp, R\bowtie S)$ $\rho(Temp2, X\bowtie Y)$ $\pi_{a,b,c}(Temp \cup Temp2)$
Using the same schema as above, state the query that the following expressions compute. If the query is illegal, please state why:

A. $\pi_{sname}( (\sigma_{color=`red'}Parts) \bowtie (\sigma_{cost<100}Catalog) \bowtie Suppliers)$

B. $\pi_{sname}(\pi_{sid}((\sigma_{color=`red'}Parts) \bowtie (\sigma_{cost<100}Catalog) \bowtie Suppliers))$

C. $(\pi_{sname}((\sigma_{color=`red'} Parts) \bowtie (\sigma_{cost<100}Catalog) \bowtie Suppliers)) \cap (\pi_{sname}((\sigma_{color=`green'} Parts) \bowtie (\sigma_{cost<100}Catalog) \bowtie Suppliers))$

SQL

What is the purpose of the CHECK command?
For this question, you will define a SQL query to translate the relational operator of division using the EXISTS/NOT EXISTS commands in lieu of an SQL division operator. Define a query for the following example where you have you have two relations for enrollments:
$Enrollee(studentid, courseid)$ $Course(coursid)$
And you want to find the students enrolled in all courses; i.e., $Enrollee/Course$

Submitting your lab

Before the due date, push your solution to github from one of your local repos to the GitHub remote repo. Only one of you or your partner needs to do this.

From your local repo (in your ~cs44/labs/Homework2-userID1-userID2 subdirectory)

git add hw2.tex hw2.pdf README.md
git commit -m "our solution"
git push

If that doesn’t work, take a look at the “Troubleshooting” section of the Using Git page. Also, be sure to complete the README.md file, add and push it.