This week and next we are going to look at some resources that will help with Lab 6: The Movie Database. This lab involves using SQL commands to create a database, define indicies, and run SQL queries on it, all using SQLite embedded in Python.

In lab week 9, we will focus on some resource to help with the checkpoint part of the assignment. The following week we will talk about queries and performance more.

Week 9: checkpoint

Introduce Lab 6

Today we are going talk about writing a large SQL program for Lab 6: The Movie Database.

It may be helpful to refer to past information about SQLite:

In addition, you will want to examine these resources from the Lab 6 assignment page before starting:

The goal of this lab is to learn how to create a large database application using Embedded SQL.

Implementation Strategy

We will preview and suggest the following implementation order:

  1. Begin by implementing createDB.py.

    1. Review the provided code. Refer to SQLite in Python tutorials to better understand the main steps (for example, how to connect and send instructions to an SQLite instance). The main method has been provided.

    2. Begin implementing createTables(). Look for TODO or pass statements. pass is a useful placeholder for code that needs to be written, so be sure to remove these statements once you have written the methods.

    3. We will step through how to create the Actor table in lab which will follow this format:

      db.execute("""CREATE TABLE Actor(
                    id INTEGER,
                    # define more attributes and constraints
                    )""")

      db.execute() takes a query as a string. Note that the string will match exactly with the query you would enter into SQLite. Above, we show an example of multi-line strings in Python. While this is not required, it allows you to type the query on multiple lines of code and thus makes your code easier to read. This is highly recommended.

    4. Implement the rest of the 5 schemas on your own.

    5. To verify your work, periodically open your database in SQLite or by using the sqlitebrowser application to provide a visual GUI representation. Check that your key constraints are correct.

      # assumes you've created a directory w/your userID name in /local/userID
      
      python3 createDB.py /local/userID/movieDB   # create the movieDB
      
      sqlite3 /local/userID/movieDB         # open the DB you created in sqlite
      
      sqlitebrowser /local/userID/movieDB   # open it in the sqlitebrowser
    6. Implement insertAll() to read in the provided files and populate the tables. If you haven’t done so already, you will want to read the page on tips for speeding up performance. You will not want to use your home directory to create your database - it will be very slow and eat up your quota.

    7. Insert one relation at a time and check your results to make sure everything looks correct. This method will require file parsing (use the main writeup and this page from CS21 for tips).

    8. We will look at an example of inserting many tuples at once.

      tuples = []
      tuples.append(["111", "Robert", "DeNiro", "M"])
      tuples.append(["222", "Carl", "Weathers", "M"])
      db.executemany("INSERT INTO Actor VALUES(?,?,?,?)", tuples)

      Note that tuples above uses toy examples; in actuality you will create your array by parsing the provided data files. But this will allow you to insert all of the tuples for a relation at once, rather than one a time.

    9. Implement the rest of the inserts, one relation at a time. Once this is done you should not have to run createDB.py again except to (re)build indexes.

  2. You can now shift to working on queryDB.py.

    1. Read the provided code - you will need to utilize all of the provided methods at some point.

    2. Follow the same advice as before to start implementing (look for todo and pass statements to find methods you need to implement).

    3. We recommend first writing and running your queries in SQLite directly and then copying the query to your Python code when you are satisfied with the results.

    4. Use EXPLAIN QUERY to print out the query plan for your queries. If you built an index but it is not being uses here, that means the query optimizer did not find your index to be helpful. You should only keep indexes that are being used by at least one query. An example where an index named MovieIndex on Movie(id) is used would be:

      SEARCH TABLE Movie AS M USING COVERING INDEX MovieIndex (id=?)
  3. Next, you will need to return to createDB.py to build (B+ Tree) indexes (note: you do not need to create indicies for the checkpoint, but you may want to revisit the checkpoint queries later to see if their performance may benefit from adding indicies). To reconstruct just the indexes, first add code to create an index in the buildIndexes() method and then run createDB.py while selection option 2 to rebuild indexes. This will keep all existing tables and tuples intact.

    python3 createDB.py /local/userID/movieDB
    File already exists.  Would you like to:
      0) Exit the program
      1) Remove the file and rebuild the entire DB
      2) Keep the file and rebuild the indexes only
    Enter choice: 2
    ...Removing Indexes..
    ...Building Indexes...

Week 10: queries and performance

Today we are going to continue working on Lab 6: The Movie Database. Be sure to examine the "Introduction" Section for tips on speeding up your program and look other helpful tips. In addition, it may be helpful to refer to past information about SQLite:

as well as these resources:

Saving Query Output as a Relational Table

For each of the required queries, there exists (at least one) single query solution. However, as computer scientists, we learn that it is good to break a hard problem into multiple parts. For some of the more difficult queries, it may be useful to break the problem into multiple parts and solve the smaller subproblems. One useful strategy is to write multiple queries and use the output of one query as an input to the next (i.e., compose your queries). Below, we show an example of this can be done.

Let’s say we want to satisfy the following query: "List all actors who have been cast in at more than 4 Quentin Tarantino films." Take a minute and think about how you might solve this problem before moving on.

Okay, here is one potential solution where we break the query into two parts. First, we want to identify all of the movies that Quentin Tarantino has directed:

queryPart1 = """
              SELECT DM.movieID
              FROM DirectsMovie DM, Director D
              WHERE DM.directorID = D.id AND D.lname = 'Tarantino' and D.fname = 'Quentin'
             """

We join the DirectsMovie and Director tables and only keep films where the director is named "Quentin Tarantino". Now we’d like to use this as the input to the second part of our solution: "find the actors that were in those movies".

First, we need to save the results from part 1. We can do this using the CREATE TABLE X AS command which defines a table based on a query result:

queryPart1 = """
              CREATE TABLE TarantinoFilms AS
                SELECT DM.movieID
                FROM DirectsMovie DM, Director D
                WHERE DM.directorID = D.id AND D.lname = 'Tarantino' and D.fname = 'Quentin'
             """
db.execute(queryPart1) #at this point, you should check the table in sqlite3

Using the Saved Query Output

You should load up sqlite3 and verify that the table exists and the results look reasonable (here is his IMDB page. Note that our dataset is a bit outdated so newer movies won’t show up in your query result). Now, we can use TarantinoFilms just like any other relation in our database. Here is how we can find all actors who have been in one of the Tarantino films from Part 1:

queryPart2 = """
              SELECT A.id, A.fname, A.lname, COUNT(*)
              FROM TarantinoFilms T, Casts C, Actor A
              WHERE T.movieID = C.movieID and A.id = C.actorID
              GROUP BY A.id
              HAVING COUNT(*)>3
             """

We join Casts with Actors and TarantinoFilms which gives us information about all Actors who have starred in a Tarantino movie. Then, we add a GROUP BY and HAVING command to group all of the instances for each actor and only keep actors with more than 3 instances. We can now execute, fetch results, and print results for this query.

id                   fname                lname                COUNT(*)
--------------------------------------------------------------------------------
51688                Michael              Bacall               4
82772                Lawrence             Bender               5
496647               Samuel L.            Jackson              4
647382               Michael              Madsen               4
806512               Michael              Parks                4
840200               Stevo                Polyi                4
1038507              Quentin              Tarantino            8
1141844              Bruce                Willis               4
1343554              Julie                Dreyfus              4
1794091              Uma                  Thurman              4
1810514              Venessia             Valentino            4

Cleaning Up Temporary Tables

At this point, you’ve done the hard work. But you will need to drop the temporary table so it doesn’t linger in the database (the DB should be in the same state as it was before you ran the query).

After your query is done, you can run a DROP TABLE command

db.execute("DROP TABLE TarantinoFilm")

Also, since your query may crash during development, you will want to preemptively drop any tables so that you don’t have to recreate the whole database. Add IF EXISTS to drop any tables that need to be cleaned up before starting your query:

db.execute("DROP TABLE IF EXISTS TarantinoFilm")

Result

Bringing it all together, this is the complete solution:

db.execute("DROP TABLE IF EXISTS TarantinoFilm") # in case your program crashed before cleaning this up last time


queryPart1 = """CREATE TABLE TarantinoFilms AS
                 SELECT DM.movieID
                 FROM DirectsMovie DM, Director D
                 WHERE DM.directorID = D.id AND D.lname = 'Tarantino' and D.fname = 'Quentin'
              """


queryPart2 = """
              SELECT A.id, A.fname, A.lname, COUNT(*)
              FROM TarantinoFilms T, Casts C, Actor A
              WHERE T.movieID = C.movieID and A.id = C.actorID
              GROUP BY A.id
              HAVING COUNT(*)>3
            """

db.execute(queryPart1) #create the table
db.execute(queryPart2) #process the query
# todo, fetch and print results.  This will only output results from the most
#  recent call to db.execute()

db.execute("DROP TABLE TarantinoFilm") # delete the temporary table

Handy Resources