This week and next we are going to look at some resources that will help with Lab 6: The Movie Database. This lab involves using SQL commands to create a database, define indicies, and run SQL queries on it, all using SQLite embedded in Python.
In lab week 9, we will focus on some resource to help with the checkpoint part of the assignment. The following week we will talk about queries and performance more.
Week 9: checkpoint
Introduce Lab 6
Today we are going talk about writing a large SQL program for Lab 6: The Movie Database.
It may be helpful to refer to past information about SQLite:
-
Week 4 in-lab page, specifically the SQLite tutorial.
In addition, you will want to examine these resources from the Lab 6 assignment page before starting:
The goal of this lab is to learn how to create a large database application using Embedded SQL.
Implementation Strategy
We will preview and suggest the following implementation order:
-
Begin by implementing
createDB.py
.-
Review the provided code. Refer to SQLite in Python tutorials to better understand the main steps (for example, how to connect and send instructions to an SQLite instance). The
main
method has been provided. -
Begin implementing
createTables()
. Look forTODO
orpass
statements.pass
is a useful placeholder for code that needs to be written, so be sure to remove these statements once you have written the methods. -
We will step through how to create the
Actor
table in lab which will follow this format:db.execute("""CREATE TABLE Actor( id INTEGER, # define more attributes and constraints )""")
db.execute()
takes a query as a string. Note that the string will match exactly with the query you would enter into SQLite. Above, we show an example of multi-line strings in Python. While this is not required, it allows you to type the query on multiple lines of code and thus makes your code easier to read. This is highly recommended. -
Implement the rest of the 5 schemas on your own.
-
To verify your work, periodically open your database in SQLite or by using the
sqlitebrowser
application to provide a visual GUI representation. Check that your key constraints are correct.# assumes you've created a directory w/your userID name in /local/userID python3 createDB.py /local/userID/movieDB # create the movieDB sqlite3 /local/userID/movieDB # open the DB you created in sqlite sqlitebrowser /local/userID/movieDB # open it in the sqlitebrowser
-
Implement
insertAll()
to read in the provided files and populate the tables. If you haven’t done so already, you will want to read the page on tips for speeding up performance. You will not want to use your home directory to create your database - it will be very slow and eat up your quota. -
Insert one relation at a time and check your results to make sure everything looks correct. This method will require file parsing (use the main writeup and this page from CS21 for tips).
-
We will look at an example of inserting many tuples at once.
tuples = [] tuples.append(["111", "Robert", "DeNiro", "M"]) tuples.append(["222", "Carl", "Weathers", "M"]) db.executemany("INSERT INTO Actor VALUES(?,?,?,?)", tuples)
Note that
tuples
above uses toy examples; in actuality you will create your array by parsing the provided data files. But this will allow you to insert all of the tuples for a relation at once, rather than one a time. -
Implement the rest of the inserts, one relation at a time. Once this is done you should not have to run
createDB.py
again except to (re)build indexes.
-
-
You can now shift to working on
queryDB.py
.-
Read the provided code - you will need to utilize all of the provided methods at some point.
-
Follow the same advice as before to start implementing (look for
todo
andpass
statements to find methods you need to implement). -
We recommend first writing and running your queries in SQLite directly and then copying the query to your Python code when you are satisfied with the results.
-
Use
EXPLAIN QUERY
to print out the query plan for your queries. If you built an index but it is not being uses here, that means the query optimizer did not find your index to be helpful. You should only keep indexes that are being used by at least one query. An example where an index namedMovieIndex
onMovie(id)
is used would be:SEARCH TABLE Movie AS M USING COVERING INDEX MovieIndex (id=?)
-
-
Next, you will need to return to
createDB.py
to build (B+ Tree) indexes (note: you do not need to create indicies for the checkpoint, but you may want to revisit the checkpoint queries later to see if their performance may benefit from adding indicies). To reconstruct just the indexes, first add code to create an index in thebuildIndexes()
method and then runcreateDB.py
while selection option 2 to rebuild indexes. This will keep all existing tables and tuples intact.python3 createDB.py /local/userID/movieDB File already exists. Would you like to: 0) Exit the program 1) Remove the file and rebuild the entire DB 2) Keep the file and rebuild the indexes only Enter choice: 2 ...Removing Indexes.. ...Building Indexes...
Week 10: queries and performance
Today we are going to continue working on Lab 6: The Movie Database. Be sure to examine the "Introduction" Section for tips on speeding up your program and look other helpful tips. In addition, it may be helpful to refer to past information about SQLite:
as well as these resources:
Saving Query Output as a Relational Table
For each of the required queries, there exists (at least one) single query solution. However, as computer scientists, we learn that it is good to break a hard problem into multiple parts. For some of the more difficult queries, it may be useful to break the problem into multiple parts and solve the smaller subproblems. One useful strategy is to write multiple queries and use the output of one query as an input to the next (i.e., compose your queries). Below, we show an example of this can be done.
Let’s say we want to satisfy the following query: "List all actors who have been cast in at more than 4 Quentin Tarantino films." Take a minute and think about how you might solve this problem before moving on.
Okay, here is one potential solution where we break the query into two parts. First, we want to identify all of the movies that Quentin Tarantino has directed:
queryPart1 = """
SELECT DM.movieID
FROM DirectsMovie DM, Director D
WHERE DM.directorID = D.id AND D.lname = 'Tarantino' and D.fname = 'Quentin'
"""
We join the DirectsMovie
and Director
tables and only keep films where the
director is named "Quentin Tarantino". Now we’d like to use this as the input
to the second part of our solution: "find the actors that were in those movies".
First, we need to save the results from part 1. We can do this using the
CREATE TABLE X AS
command which defines a table based on a query result:
queryPart1 = """
CREATE TABLE TarantinoFilms AS
SELECT DM.movieID
FROM DirectsMovie DM, Director D
WHERE DM.directorID = D.id AND D.lname = 'Tarantino' and D.fname = 'Quentin'
"""
db.execute(queryPart1) #at this point, you should check the table in sqlite3
Using the Saved Query Output
You should load up sqlite3
and verify that the table exists and the results
look reasonable (here is his IMDB page. Note that our dataset is a bit outdated so newer movies won’t show up
in your query result).
Now, we can use TarantinoFilms
just like any other relation in our database.
Here is how we can find all actors who have been in one of the
Tarantino films from Part 1:
queryPart2 = """
SELECT A.id, A.fname, A.lname, COUNT(*)
FROM TarantinoFilms T, Casts C, Actor A
WHERE T.movieID = C.movieID and A.id = C.actorID
GROUP BY A.id
HAVING COUNT(*)>3
"""
We join Casts
with Actors
and TarantinoFilms
which gives us information
about all Actors who have starred in a Tarantino movie. Then, we add a
GROUP BY
and HAVING
command to group all of the instances for each actor
and only keep actors with more than 3 instances. We can now execute, fetch
results, and print results for this query.
id fname lname COUNT(*) -------------------------------------------------------------------------------- 51688 Michael Bacall 4 82772 Lawrence Bender 5 496647 Samuel L. Jackson 4 647382 Michael Madsen 4 806512 Michael Parks 4 840200 Stevo Polyi 4 1038507 Quentin Tarantino 8 1141844 Bruce Willis 4 1343554 Julie Dreyfus 4 1794091 Uma Thurman 4 1810514 Venessia Valentino 4
Cleaning Up Temporary Tables
At this point, you’ve done the hard work. But you will need to drop the temporary table so it doesn’t linger in the database (the DB should be in the same state as it was before you ran the query).
After your query is done, you can run a DROP TABLE
command
db.execute("DROP TABLE TarantinoFilm")
Also, since your query may crash during development, you will want to
preemptively drop any tables so that you don’t have to recreate the whole
database. Add IF EXISTS
to
drop any tables that need to be cleaned up before starting your query:
db.execute("DROP TABLE IF EXISTS TarantinoFilm")
Result
Bringing it all together, this is the complete solution:
db.execute("DROP TABLE IF EXISTS TarantinoFilm") # in case your program crashed before cleaning this up last time
queryPart1 = """CREATE TABLE TarantinoFilms AS
SELECT DM.movieID
FROM DirectsMovie DM, Director D
WHERE DM.directorID = D.id AND D.lname = 'Tarantino' and D.fname = 'Quentin'
"""
queryPart2 = """
SELECT A.id, A.fname, A.lname, COUNT(*)
FROM TarantinoFilms T, Casts C, Actor A
WHERE T.movieID = C.movieID and A.id = C.actorID
GROUP BY A.id
HAVING COUNT(*)>3
"""
db.execute(queryPart1) #create the table
db.execute(queryPart2) #process the query
# todo, fetch and print results. This will only output results from the most
# recent call to db.execute()
db.execute("DROP TABLE TarantinoFilm") # delete the temporary table
Handy Resources
-
Week 4 in-lab page, specifically the SQLite tutorial.
-
SQLite Python Tutorial from zetcode.com
-
SQL Tutorial from sqlzoo.net