Make sure all programs are saved to your cs21/labs/09
directory. Files outside this directory will not be graded.
$ update21
$ cd ~/cs21/labs/09
This is a two part assignment:
sentiment.py
, you will write a program that reads in movie reviews from Rotten Tomatoes and conducts sentiment analysis by sorting words based on how positive or negative they are.visualize_sort.py
, you will write a sorting algorithm with the goal of visualizing how the algorithm sorts a list of numbers.A popular area of research in natural language processing is sentiment analysis. Typically, the goal is to develop algorithms to read natural text and classify its sentiment (e.g., negative vs. positive; 5-star vs 3-star product). For this lab, you will analyze data extracted from Rotten Tomatoes, a website that aggregates movie reviews from film critics. Your goal is to read a file containing movie reviews (made available by Stanford University’s NLP group and Kaggle) and use sorting to determine which words are most associated with positive reviews and which words are most associated with negative reviews.
The file of reviews is located in /usr/local/doc/movieReviews.txt
:
$ head /usr/local/doc/movieReviews.txt
1 A series of escapades demonstrating the adage that what is good for the goose
is also good for the gander , some of which occasionally amuses but none of
which amounts to much of a story .
4 This quiet , introspective and entertaining independent is worth seeking .
1 Even fans of Ismail Merchant 's work , I suspect , would have a hard time
sitting through this one .
3 A positively thrilling combination of ethnography and all the intrigue ,
betrayal , deceit and murder of a Shakespearean tragedy or a juicy soap opera .
1 Aggressive self-glorification and a manipulative whitewash .
4 A comedy-drama of nearly epic proportions rooted in a sincere performance by
the title character undergoing midlife crisis .
1 Narratively , Trouble Every Day is a plodding mess .
3 The Importance of Being Earnest , so thick with wit it plays like a reading
from Bartlett 's Familiar Quotations
1 But it does not leave you with much .
1 You could hate it for the same reason .
Each line is one review; the first element is an integer star rating from 0 through 4 with 0 being very negative and 4 being a very positive review. Each element after that is a word in the review.
[ [word1, score1], [word2, score2], ... ]
or parallel lists
[word1, word2, ...]
[score1, score2, ...]
Your program must read the file and process the data in order to obtain the list of lists described above (word/rating pairs). The next two sections provide some details that will help you design your solution correctly. As you process each word in each line, keep track of the following:
lower()
method for str
objects.>>> word = 'Film'
>>> newWord = word.lower()
>>> newWord
'film'
words
library (as you did in Lab 5) and use the getStopWords()
function to get this list. In the example below, your program should ignore 'and'
since it is a stop word and keep 'film'
:>>> from words import getStopWords
>>> ignoreWords = getStopWords()
>>> 'and' in ignoreWords
True
>>> 'film' in ignoreWords
False
str
class has an isalpha()
method that returns True
only if the string contains only alphabetic characters.>>> word = '.'
>>> word.isalpha()
False
>>> word = 'film'
>>> word.isalpha()
True
As mentioned above, each word gets paired with the rating of the movie review it appears in. This means that each word in the sentence shares the rating (which occurs at the beginning of the line).
There are more details below, but you will need to adjust the rating scale to be between -2 and +2 by subtracting. This is as simple as subtracting 2 from each rating (so that a rating of 0 becomes -2 and a rating of 4 is +2).
Your sentiment score is the cumulative rating of all reviews that contain that word. Keep in mind the following details:
For example, assume we have the following three reviews (available in /usr/local/doc/smallReviews.txt
):
$ cat /usr/local/doc/smallReviews.txt
1 A terrible waste of talent on a bloated plot
4 The cast is full of talent and memorable performances
3 While the run time is bloated , this independent film is anything but run of
the mill
You should obtain the following scores:
2 cast
2 full
2 memorable
2 performances
2 run
1 talent
1 time
1 independent
1 film
1 anything
1 mill
0 bloated
-1 terrible
-1 plot
-1 waste
The three reviews have adjusted ratings of -1, 2, and 1. 'run'
appears twice in a review with an adjusted 1 score (third review) and thus has a cumulative score of 2. 'talent'
appears in two different reviews and has a cumulative score of 1 (-1 from review 1, +2 from review 2). We suggest that use this smaller file (or make your own) for development and testing.
Here is some pseudocode that can help guide you:
for each word, rating pair read from file
if word has not been seen
append to unique-word list
append rating to unique-word-rating
else
locate index of word in unique-word list
add rating to the unique-word-rating[index]
There are many ways to implement this. As mentioned above, you can do this is parallel lists or a list of lists. The above pseudocode matches a parallel list implementation. The second option is to use a list of lists where each item is a list containing a string (the unique word) and a number (the current sentiment score). Finding if a new word is unique isn’t as easy as using the in
/index()
operators; you’ll have to do a linear search over the list of lists as you did last week.
Here is an example of what your list(s) might look like on the smallReviews.txt
example.
Your sorting algorithm will sort words based on their sentiment value. This is different than what we did in class, where items were sorted based on their own value. You have two options, depending on if you use parallel lists or a list of lists:
#Swap i and j
tempSent = sentiments[i]
sentiments[i] = sentiments[j]
sentiments[j] = tempSent
tempWord = unique_words[i]
unique_words[i] = unique_words[j]
unique_words[j] = tempWord
We have provide a total of three files to evaluate on:
/usr/local/doc/smallReviews.txt
- the example file from above that contains 3 fake reviews/usr/local/doc/1000Reviews.txt
- the first 1000 reviews available/usr/local/doc/movieReviews.txt
- all available reviewsWe highly recommend testing and debugging in the order of the files listed above. Your code may take anywhere from 30 seconds to 2 minutes on the full data set, which is not practical for developing and testing. That has been provided to see what sentiment is found using all available data. Here are sample outputs for the larger two files (the output of smallReviews.txt
is provided above):
Your final submission must use movieReviews.txt. It is okay if you have differences in case of ties (e.g., there are multiple words with sentiment of 19.00).
You’ll notice that the final output has some words that don’t really make sense, particularly amongst negative reviews (e.g., feels, movie, like). This is because we used a fairly naive scoring metric to keep the implementation simpler. As an extension (not required, and should be attempted only if the rest of the lab is complete), try other scoring metrics and see if you can come up with something better. Prof. Wicentowski, our local NLP (natural language processing) expert suggests the following formula for each unique word, w
:
That is, for each unique word, keep track of the sum of ratings for that word (as above) and the total number of occurrences of that word. The first part of the equation is the average rating for a unique word (the write-up above only uses the sum) and the right term is the log of the number of occurrences of that word. The average and log help adjust for the frequency of words. Here are the (more sensible) results:
Part 2 of this assignment is more open-ended than our typical lab programs. Your task is to write a program, visual_sort.py
, that displays a visualization of how one particular sorting algorithm works. This visualization can work by printing strings to the terminal, or — for the extra challenge — by creating an animation using Zelle graphics. Your program should allow its users to see how a particular sorting algorithm changes a list from unsorted to sorted, one swap at a time.
sentiment.py
program.main()
.Here are two examples of visualizing selection sort:
We would like to see if you can come up. Note that the graphics example is not the expectation - this is more of a challenge/extension if you are interested. The goal is to help your users understand the inner workings of your chosen algorithm, however you see fit. Along the way, hopefully you will also deepen your own understanding.
TIP: it is a good idea to pause your program after each major step. In the graphics example above, you can use getMouse()
or getKey()
to make the program wait until the user is ready to proceed. For the command-line versions, you can use input()
e.g.,
#output some result
input("Hit Enter to Continue...")
#resume execution of program
It doesn’t matter what the user inputs, so I have not bothered to save the value in a variable.
Once you are satisfied with your programs, fill out the questionnaire in QUESTIONS-09.txt
. Then run handin21
a final time to make sure we have access to the most recent versions of your file.
Thank you to Prof. Eric D. Manley and Timothy M. Urness at Drake University for the data and inspiration for the sentiment analysis portion of the lab.