Lab 4: Movie Database

Due by 11:59 p.m., Friday, December 7, 2018.

This is a partnered lab. You are to complete this lab with one other person, who must attend the same lab as you. You may discuss the lab concepts with other classmates. Please use Teammaker to set your lab partner preferences. You can choose “Random” if you do not have a partner. Remember the Academic Integrity Policy: do not show your code/solution to anyone outside of the course staff and your lab partner. Do not look at any other group’s code/solution for this lab (current or past). If you need help, please post on Piazza.

Introduction
- Getting Started
Creating the movie database
- Schema
- Required methods
Querying the database
Tips and additional details
Submitting your lab

Introduction

In this lab, you will create a relational movie database and pose a set of queries using SQLite. The raw data has been extracted from the Internet Movie Database (IMDb). You will structure this data in 6 tables to represent Movies, Actors, Directors, Genres, the relationship between director(s) for each movie (DirectsMovie), and the casts of each movie (Casts). Your schema is closely related to the following ER diagram:

In addition, you will use embedded SQL to write a Python program to interface with the sqlite3 engine. While there is a bit of a learning curve to picking up the Python library for sqlite3, the SQL commands are equivalent to those you would enter on the normal command-line interface.

The objectives for your lab are:

to gain experience with one popular instance of a relational database (SQLite)
to combine procedural language (Python) elements with declarative language (SQL) constructs
to practice data definition language (DDL) commands in SQL
to design queries to answer questions about the data
to utilize indices and query planning to make your queries efficient

Getting Started

Both you and your partner will share a repo named Lab4-userID1-userID2. Note that the lab will not release until you have both marked your partner preferences on Teammaker. You should find your repo on the GitHub server for our class: CPSC44-F18

Clone your Lab 4 git repo containing starting point files into your labs directory:

cd ~/cs44/labs
git clone [the ssh url to your your repo]
cd Lab4-userID1-userID2

If you need help, see the instructions on Using Git (follow the instructions for repos on Swarthmore’s GitHub Enterprise server).

If all was successful, you should see the following files (highlighted files require modification):

createDB.py - a main program for creating your initial database schema including the creation of tables and indices plus the insertion of records. You will only need to run this program one time once you get it working (but be sure to test it!)
queryDB.py - the main user program. This program will prompt the user with a menu of all possible queries to run and execute the user’s choice(s)
README.md - the usual data collection

In addition, you will utilize the following shared data files:

/home/soni/public/cs44/movieDB/ - this folder contains the records to insert into your tables in the form of relationName.txt (e.g., Casts.txt contains all records of actors cast in a movie). DO NOT COPY THIS FOLDER - it will eat up your quota.

You should utilize these references on using the sqlite library in Python:

SQLite API - a detailed description of the full library interface. This is a good reference for help on accomplishing specific tasks or inspecting data
Python tutorial - a quick-start guide for the basics of opening a database connection
Python Central tutorial guide - another quick-start guide
When in doubt, google the specific topic. Start your search with "sqlite python"; e.g., how do I retrieve query results? Google: sqlite python retrieve query results.

Creating the movie database

In createDB.py, place your code to create the movie database. You will create tables, insert values into the table, and create indices on the table as needed to solve your queries efficiently. The code has been partially provided to help you get started. NOTE: do not use your home directory to create the actual database - this will eat up your quota very quickly. See our Tips for Performance and Storage.

Schema

The schema is as follows:

Actor (id, fname, lname, gender)
Movie (id, title, year)
Director (id, fname, lname)
Casts (actorID, movieID, role)
DirectsMovie (directorID, movieID)
Genre (movieID, type)

All id fields are integers, as is year. All other fields are character strings. You can use either CHAR(N) or VARCHAR(N) for the character strings. Generally, stick to a maximum of 30 characters for any name or title field; 50 for role and type; 1 character for gender.

The keys are specified in italics above. To sum up: id is the key for Actor, Movie, and Director. For the remaining relations, the primary key is the combination of all attributes. This is because an actor can appear in a movie many times in different roles; each movie can fit multiple genres; and each movie can have multiple directors.

The foreign keys should be clear from the context. Casts.actorID references Actor.id. Casts.movieID and DirectsMovie.movieID reference Movie.id. DirectsMovie.directorID references Directors.id The IMDb dataset is not perfectly clean and some entries in Genre.movieID refer to non-existing movies. This is the reality of “messy” data in the real world. In this instance, we drop the constraint rather than cleaning up the data. DO NOT specify that Genre.movieID is a foreign key.

Required methods

You will define the following functions:

main()
Takes the file name for the new data base as a command-line argument. This function should be minimal and delegate work to the helper methods. It should check if the file already exists, establish a connection, create all tables, insert all values, and finish up by building indices.
checkDB(filename)
Determines if the file already exists (TIP: use the os.path library). If not, you may return to the main program proceed to construct the database. If it does already exist, the user should be prompted with the option of deleting the file (using os.remove()) or quitting (e.g, exit(1)). If they chose to delete, remove the file and proceed with creating the database. Here is an example run:
```
$ python3 createdb.py movieDB
File already exists; enter 0 to quit and 1 to remove: 1
...Creating new movie database
...Creating Tables...
...Inserting Records...
...Building Indexes...
```
createTables()
Issues SQL commands to create all 6 tables in the schema. The first line of code given turns referential integrity checks on.
insertValues()
You may not use .import to import all values. Instead, you must read each file, parsing the arguments on each line, and insert the results one-by-one using the INSERT INTO SQL command. You may want to take a look at executemany() function to see how all insertions can be executed at one time.

Each relation’s text file contains one record per line. Fields are separated by '|' delimiters, and are in the same order as the schemas above. You should not need to worry about type casting since SQL should handle that. Note that in Python3, you may be encoding errors as some of the symbols are not utf-8 compatible. When opening the file, specify that errors can be modified:
```
with open('/home/soni/public/cs44/movieDB/Actor.txt','r',errors="backslashreplace") as f:
  #process file
```
createIndex()
Place all SQL commands for creating indices here. You should choose your indices wisely. You will probably need to complete this function after solving some of the queries below. As an aside, most queries should run quickly (a few seconds at most). In the README.md file, you should provide a justification for each index constructed by citing which queries it is useful for. Each index must be justified, and not superfluous. See the tips below for additional help. Also, note that SQLite automatically builds some indices (e.g., on primary keys) so pay attention to your query plans to see if SQLite is actually using the ones you specify (and drop it if it is note used).

You may add additional methods as needed (for example, to help with inserting). Each method must be commented and clear to follow. Please read about using multi-line strings below to avoid unreadable code.

Querying the database

In queryDB.py, you will define your queries and implement a user interface for interacting with the database. Your main method should establish a connection in a similar fashion as createDB.py: read the name of the database from the command line, check to see if the file exists (exit cleanly if it does not), establish a connection and cursor. See the sample output to get an idea of how the program should run.

Next, your program should repeatedly print a menu of options until the user selects “Exit” as an option. The menu has been provided for you in printMenu(). Please do not change this method. After the user enters a choice, you should call the appropriate query.

Requirements

Each query should be defined in a function queryX where X is the number below. For example the first query listed in the Queries table below should be defined in query1().
Your queries should be easy to read. Your SQL commands are sent as strings through the execute method. Since these queries can be long, entering them as a very long line will make them very hard to read. Instead, you should use Python’s multi-line string format (i.e., """Long string""" to create strings with line breaks that make your queries easy to parse). For example, to query the number of Directors named "Steven":
```
  command = """
            SELECT COUNT(*)
            FROM Directors D
            WHERE D.name = 'Steven'
            """
  db.execute(command)
  results = db.fetchall()
```
Print out the query plan after each query. To get the query plan prefix EXPLAIN QUERY PLAN to the query:
```
  db.execute( "EXPLAIN QUERY PLAN\n" + command )
  results = db.fetchall()   # results contains the query plan not the query result
```
You only need to print out the details field; that is, for each tuple in in results, print out the item in index 3 (print(results[i][3])). If you build intermediate tables, only print the query plan of the final query (i.e., you don’t need the query plan for the CREATE TABLE commands).
Your SQL queries should be written so that they are run relatively efficiently. SQLite will make use of indexes if they exist. Try out queries in SQLite command line, look at the query plan, and think about re-structuring and adding indexes to improve the performance of slow queries (see Tips below).
You should print out the time it takes to perform each query. To time around some code in python:
```
  start = time.time()
  # execute query
  # call fetchall to get results
  end = time.time()
  print("\nCompleted in %.3f seconds" % (end-start))
  # print results
```
You only need to count the execution time and fetching results, not the time to print your results or explain the query.
You are allowed to create temporary tables to store intermediate queries. But you must drop temporary tables as soon as you complete the query. For example, if you create a temporary table called MyTable, clean up by running:
```
  db.execute("DROP TABLE MyTable")
```
You should format your results in an easy to read manner. Create a function that can print out the result to any query so that you are not tailoring your output to the specific query. First print out the column headers (take a look at the Cursor’s description field. The length of description is equal to the number of columns in the result. The first item in each row of description is the column name. So, to print out the ith column name using 20 spaces:
```
      print("%-20s" % db.description[i][0])
```
To print each tuple, you can also allocate 20 spaces for each attribute value. You can create a format string to accomplish this:
```
      formString = "%-20s " * len(db.description)
```
Then, fill in the templates using a tuple. For example, if you store a result into a tuple called result, you can print it out simply:
```
      print(formString % result)
```
You can also use formString for the column headers above (but you will have to extract all the column names into a list). Note that some fields can be longer than 20 characters and may not look at nice. This is okay, but you can try something more adaptive if you would like.

Queries

You will need to answer the following queries. Since there are many ways to write the same query, I ask that you sort your final results as specified to make comparisons easier. Queries 4-6 should be in descending order (largest values first) while others are ascending. “Additional attributes” are attributes that should appear in your results but are not relevant to the sort ordering.

Query	Description	Sort order	Additional attributes
1	List the names of all distinct actors in the movie "The Princess Bride"	Actor's first name, Actor's last name
2	Ask the user for the name of an actor, and print all the movies starring an actor with that name (only print each title once).	Movie title
3	Ask the user for the name of two actors. Print the names of all movies in which those two actors co-starred (i.e., the movie starred both actors).	Movie id	Movie title
4	List all directors who directed 500 movies or more, in descending order of the number of movies they directed. Return the directors’ names and the number of movies each of them directed.	Number of movies directed	Director's first name and last name
5	Find Kevin Bacon's favorite co-stars. Print all actors as well as the number of movies that actor has co-starred with Kevin Bacon (but only if they've acted together in 8 movies or more). Be sure that Kevin Bacon isn't in your results! Only count each movie once per actor (i.e., ignore multiple roles in the same film)	Number of distinct movies co-starred	Co-stars first name and last name
6	Find actors who played five or more roles in the same movie during the year 2010.	Number of roles, Movie title	Actor’s first name and last name
7	Programmer’s Choice: develop your own query. It should be both meaningful and non-trivial - show off your relational reasoning skills! (But keep the query under a minute of run time)

Example Run

It is up to you to evaluate your query results for correctness and efficiency. I have provided sample output for some of the queries to help understand what your program should do. The output includes an estimate for the number of tuples queries will return, and the time it takes for my queries to run (see if you can beat my time!).

Tips and additional details

See this page for tips on where to store your data: Tips for Performance and Storage.
You will want to divorce SQL errors from Python errors as much as possible. Since SQL queries are roughly the same in both SQLite and the Python interface to sSQLite, you should practice running your queries using the SQLite command-line interface. Note that a DB created using Python can be inspected by loading it up in SQLite. For example:
```
  $ python createDB.py /local/me_n_pal/movie.db
  $ sqlite3 movie.db
  SQLite version 3.7.9 2011-11-01 00:52:41
  Enter ".help" for instructions
  Enter SQL statements terminated with a ";"
  sqlite> .table
  .table
  Actor    Casts    DirectMovie     Director     Genre        Movie
  sqlite>
```
This data set is pretty large. createDB.py may take awhile to insert all of the values and construct the indices (~10 minutes). Your queries, however, should run fairly quickly.
When designing your indices, it will get tedious to do so in createDB.py since the insertion of records is takes several minutes while constructing indices is quick. I suggest doing all DROP INDEX and CREATE INDEX operations in sqlite3 or sqlitebrowser until you are satisfied. Keep track of which indices you want to keep and add them to your createDB.py script (and be sure to test that you kept the correct ones!).
Query taking a long time? Try breaking up the query into multiple pieces, saving intermediate results using the CREATE TABLE name AS ... syntax. Sqlite may not optimize nested queries well in certain cases, so if the nest is a static table, it may be faster to pre-calculate the table.
In some cases, you might get a query closer to a minute depending on how you write the query. If your query is under a minute, you are probably fine. There is a fast (less than a second) solution for all of the queries. You are not required to fully optimize all queries, but you should avoid very poor performance on any individual query.

Submitting your lab

Before the due date, push your solution to github from one of your local repos to the GitHub remote repo. Only one of you or your partner needs to do this.

From your local repo (in your ~cs44/labs/Lab4-userID1-userID2subdirectory)

make clean
git add *
git commit -m "our correct, robust, and well commented solution for grading"
git push

If that doesn’t work, take a look at the “Troubleshooting” section of the Using Git page. Also, be sure to complete the README.md file, add and push it.