CS21 Lab 9: Sorting

Due Saturday night, April 14

Make sure all programs are saved to your cs21/labs/09 directory. Files outside this directory will not be graded.

$ update21
$ cd ~/cs21/labs/09

Topics for this assignment

sorting algorithms
top-down design
reading from files
lists of lists

Overview

This is a two part assignment:

Sentiment Analysis - In sentiment.py, you will write a program that reads in movie reviews from Rotten Tomatoes and conducts sentiment analysis by sorting words based on how positive or negative they are.
Sort Visualization - In visualize_sort.py, you will write a sorting algorithm with the goal of visualizing how the algorithm sorts a list of numbers.

1. Sentiment Analysis

A popular area of research in natural language processing is sentiment analysis. Typically, the goal is to develop algorithms to read natural text and classify its sentiment (e.g., negative vs. positive; 5-star vs 3-star product). For this lab, you will analyze data extracted from Rotten Tomatoes, a website that aggregates movie reviews from film critics. Your goal is to read a file containing movie reviews (made available by Stanford University’s NLP group and Kaggle) and use sorting to determine which words are most associated with positive reviews and which words are most associated with negative reviews.

Movie Review Data

The file of reviews is located in /usr/local/doc/movieReviews.txt:

$ head /usr/local/doc/movieReviews.txt
1 A series of escapades demonstrating the adage that what is good for the goose
 is also good for the gander , some of which occasionally amuses but none of
 which amounts to much of a story .
4 This quiet , introspective and entertaining independent is worth seeking .
1 Even fans of Ismail Merchant 's work , I suspect , would have a hard time
 sitting through this one .
3 A positively thrilling combination of ethnography and all the intrigue ,
 betrayal , deceit and murder of a Shakespearean tragedy or a juicy soap opera .
1 Aggressive self-glorification and a manipulative whitewash .
4 A comedy-drama of nearly epic proportions rooted in a sincere performance by
 the title character undergoing midlife crisis .
1 Narratively , Trouble Every Day is a plodding mess .
3 The Importance of Being Earnest , so thick with wit it plays like a reading
 from Bartlett 's Familiar Quotations
1 But it does not leave you with much .
1 You could hate it for the same reason .

Each line is one review; the first element is an integer star rating from 0 through 4 with 0 being very negative and 4 being a very positive review. Each element after that is a word in the review.

Program Requirements

Your program must follow all top-down design requirements and best practices including commenting, style, and bottom-up implementation.
When designing your functions, pay attention to the work you will have to do to process the file; there are more steps here than in the labs we have seen so far (see Processing Input below).
Each word should be given a sentiment score (see Sentiment Score below). You should store words and sentiment scores as either a list of lists:

[ [word1, score1], [word2, score2], ... ]

or parallel lists

[word1, word2, ...]
[score1, score2, ...]

Your program should sort words by their sentiment score using either selection sort, insertion sort, bubble sort. You must implement your own version of the sorting algorithm.
Your program should conclude by printing the 20 sentiment words (from highest to lowest) and bottom 20 sentiment words (from lowest to highest).

Processing Words

Your program must read the file and process the data in order to obtain the list of lists described above (word/rating pairs). The next two sections provide some details that will help you design your solution correctly. As you process each word in each line, keep track of the following:

Make your program case insensitive - it should not matter if the word is capitalized or not. We recommend converting all words to lower case using the lower() method for str objects.

>>> word = 'Film'
>>> newWord = word.lower()
>>> newWord
'film'

Discard all very common words (e.g., the, and, or, a) known as stop words. To help out, we have created a list of all words to ignore. You will need to import the words library (as you did in Lab 5) and use the getStopWords() function to get this list. In the example below, your program should ignore 'and' since it is a stop word and keep 'film':

>>> from words import getStopWords
>>> ignoreWords = getStopWords()
>>> 'and' in ignoreWords
True
>>> 'film' in ignoreWords
False

Also ignore words that contain non-alphabetic characters (e.g., apostrophes, periods, dashes). This may throw out some important words, but will get rid of punctuation marks and special characters in the data set that have no meaning for this task. The str class has an isalpha() method that returns True only if the string contains only alphabetic characters.

>>> word = '.'
>>> word.isalpha()
False
>>> word = 'film'
>>> word.isalpha()
True

As mentioned above, each word gets paired with the rating of the movie review it appears in. This means that each word in the sentence shares the rating (which occurs at the beginning of the line).
There are more details below, but you will need to adjust the rating scale to be between -2 and +2 by subtracting. This is as simple as subtracting 2 from each rating (so that a rating of 0 becomes -2 and a rating of 4 is +2).

Sentiment Score

Your sentiment score is the cumulative rating of all reviews that contain that word. Keep in mind the following details:

Before summing the raw ratings from the file, be sure that you have subtracted 2 from each rating (so that very negative scores are now -2 and very positive scores are +2).
Each unique word has a rating - you will need to process the data you read in so that there is only one copy of each unique word.
For each unique word in the corpus, its sentiment is the sum of all ratings containing that word. Words that appear in positive reviews will have a positive sum while negative values indicate the word is in more negative reviews.
If a rating contains the word multiple times, it should be counted multiple times (this is less work that only counting it once; see example below).

For example, assume we have the following three reviews (available in /usr/local/doc/smallReviews.txt):

$ cat /usr/local/doc/smallReviews.txt
1 A terrible waste of talent on a bloated plot
4 The cast is full of talent and memorable performances
3 While the run time is bloated , this independent film is anything but run of
 the mill

You should obtain the following scores:

2 cast
2 full
2 memorable
2 performances
2 run
1 talent
1 time
1 independent
1 film
1 anything
1 mill
0 bloated
-1 terrible
-1 plot
-1 waste

The three reviews have adjusted ratings of -1, 2, and 1. 'run' appears twice in a review with an adjusted 1 score (third review) and thus has a cumulative score of 2. 'talent' appears in two different reviews and has a cumulative score of 1 (-1 from review 1, +2 from review 2). We suggest that use this smaller file (or make your own) for development and testing.

Here is some pseudocode that can help guide you:

for each word, rating pair read from file
  if word has not been seen
     append to unique-word list
     append rating to unique-word-rating
  else
     locate index of word in unique-word list
     add rating to the unique-word-rating[index]

There are many ways to implement this. As mentioned above, you can do this is parallel lists or a list of lists. The above pseudocode matches a parallel list implementation. The second option is to use a list of lists where each item is a list containing a string (the unique word) and a number (the current sentiment score). Finding if a new word is unique isn’t as easy as using the in/index() operators; you’ll have to do a linear search over the list of lists as you did last week.

Here is an example of what your list(s) might look like on the smallReviews.txt example.

Sorting multiple lists

Your sorting algorithm will sort words based on their sentiment value. This is different than what we did in class, where items were sorted based on their own value. You have two options, depending on if you use parallel lists or a list of lists:

If you have parallel lists e.g., unique-words and sentiments, your algorithm will use sentiments for comparisons; e.g., sorting largest to smallest for selection sort will require looking for the index of the largest sentiment. The major change from the algorithm in class is the swap stage: you will swap the sentiment values as dictated in class, but you will at the same time swap the corresponding items in unique-words (i.e., swap in parallel).

#Swap i and j
tempSent = sentiments[i]
sentiments[i] = sentiments[j]
sentiments[j] = tempSent
tempWord = unique_words[i]
unique_words[i] = unique_words[j]
unique_words[j] = tempWord

If you have a list of lists, your comparisons will be the part that changes. Since indices in the list are a list themselves, you’ll need to double index into the part that is the sentiment score (this is very similar to last week’s lab)

Final Output and Extensions

We have provide a total of three files to evaluate on:

/usr/local/doc/smallReviews.txt - the example file from above that contains 3 fake reviews
/usr/local/doc/1000Reviews.txt - the first 1000 reviews available
/usr/local/doc/movieReviews.txt - all available reviews

We highly recommend testing and debugging in the order of the files listed above. Your code may take anywhere from 30 seconds to 2 minutes on the full data set, which is not practical for developing and testing. That has been provided to see what sentiment is found using all available data. Here are sample outputs for the larger two files (the output of smallReviews.txt is provided above):

Your final submission must use movieReviews.txt. It is okay if you have differences in case of ties (e.g., there are multiple words with sentiment of 19.00).

You’ll notice that the final output has some words that don’t really make sense, particularly amongst negative reviews (e.g., feels, movie, like). This is because we used a fairly naive scoring metric to keep the implementation simpler. As an extension (not required, and should be attempted only if the rest of the lab is complete), try other scoring metrics and see if you can come up with something better. Prof. Wicentowski, our local NLP (natural language processing) expert suggests the following formula for each unique word, w:

$score_w = \frac{sum\_of\_ratings_w}{num\_ratings_w} * \log(num\_ratings_w)$

That is, for each unique word, keep track of the sum of ratings for that word (as above) and the total number of occurrences of that word. The first part of the equation is the average rating for a unique word (the write-up above only uses the sum) and the right term is the log of the number of occurrences of that word. The average and log help adjust for the frequency of words. Here are the (more sensible) results:

Extension Output

2. Visualizing a Sorting Algorithm

Part 2 of this assignment is more open-ended than our typical lab programs. Your task is to write a program, visual_sort.py, that displays a visualization of how one particular sorting algorithm works. This visualization can work by printing strings to the terminal, or — for the extra challenge — by creating an animation using Zelle graphics. Your program should allow its users to see how a particular sorting algorithm changes a list from unsorted to sorted, one swap at a time.

Requirements

You must use a different sorting algorithm than you used in the sentiment.py program.
Your program should, at a minimum, update the visualization after each swap in the algorithm and the end of each phase/pass (this may be the same depending on the algorithm).
Use top down design; do not put all of your code in main().
Randomly generate a list of integers to sort. You should pick the length of the list (minimum length: 6), and the range of possible integers that may appear as elements in the list.
The goal of your program is to convey to the user how this randomly-generated list would go from unsorted to sorted according to the steps of your chosen algorithm. You should attempt to explain these steps visually and with text. It is not sufficient to just print out the list each time there’s a swap.

Here are two examples of visualizing selection sort:

We would like to see if you can come up. Note that the graphics example is not the expectation - this is more of a challenge/extension if you are interested. The goal is to help your users understand the inner workings of your chosen algorithm, however you see fit. Along the way, hopefully you will also deepen your own understanding.

TIP: it is a good idea to pause your program after each major step. In the graphics example above, you can use getMouse() or getKey() to make the program wait until the user is ready to proceed. For the command-line versions, you can use input() e.g.,

#output some result

input("Hit Enter to Continue...")

#resume execution of program

It doesn’t matter what the user inputs, so I have not bothered to save the value in a variable.

3. Submit

Once you are satisfied with your programs, fill out the questionnaire in QUESTIONS-09.txt. Then run handin21 a final time to make sure we have access to the most recent versions of your file.

Acknowledgments

Thank you to Prof. Eric D. Manley and Timothy M. Urness at Drake University for the data and inspiration for the sentiment analysis portion of the lab.