Introduction
This week, you will explore the training and validation data for the Semeval 2019 Hyperpartisan News task that you will be working on for your final project.
Compared to the data sample you worked with earlier in the semester, the full training set is “big” in a couple of ways:
- There are 800,000 training articles (plus 200,000 validation articles) that you’ll need to process – way more than the 300 in the sample data set you used in Lab 2!
- Some of the articles are much longer than the articles in the training set, which were filtered to not be more than 600 words each.
Consequently, part of your task this week will be to implement code that efficiently processes the files, minimizing unnecessary memory usage or computation.
The data files you’ll work with this week are processed versions of the actual files released as part of the Semeval task. In particular, the text of every article has been pre-tokenized with spacy
, so that you can just split the tokens by whitespace, and all of the hyperlinks have been separated from the main text so you don’t have to worry about filtering out HTML from the middle of the articles.
There are 5 XML files you will need to work with for this lab:
/data/semeval/training/articles-training-20180831.spacy_links.xml
(3.7GB)/data/semeval/training/ground-truth-training-20180831.xml
(133MB)/data/semeval/training/vocab.txt
(19MB)/data/semeval/validation/articles-validation-20180831.spacy_links.xml
(1.5GB)/data/semeval/validation/ground-truth-validation-20180831.xml
(33M)
Look through these files to familiarize yourself with them.
Examining the Base Classes
Read through the code in the provided HyperpartisanNewsReader.py
file. Add comments to the file, including docstrings for every function and block/line comments as necessary to demonstrate full understanding of how the code works.
There may be some Python code in this file that’s new to you, and that’s ok! Take some time now to read about any functionality you haven’t seen before so that you understand what it’s doing.
Your comments should, specifically, demonstrate understanding of the roles of the following:
islice
yield
.clear()
ABC
@abstractmethod
- In
Writeup.md
, compare the functiondo_xml_parse
to the functiondumb_xml_parse
. In what sense do they do the same thing? In what ways do they differ? Which is more scalable, and why? Your answer here should be precise in terms of resource (e.g. memory, processing, etc.) usage.
Sparse Matrices
In this lab, you will write code that extracts features from articles and stores the features in a feature matrix $X$, with one row per article and one column per feature.
- We know that the training set has 800,000 articles in it. If for every article we store the counts for 10,000 features (perhaps the most common 10,000 unigrams) and each feature is stored as an 8-bit unsigned integer, how much space would be needed to store the feature matrix in memory?
If a matrix has a lot of zeros in it, it can instead be represented in a sparse matrix. There are several implementations of sparse matrices available in scipy, but the one that will be most useful to us is the lil_matrix
.
The lil_matrix
will turn out to be a good choice because we can efficiently create an empty (all zeros) matrix and then assign values to specific elements of the matrix as we go along.
- How is the data in a
lil_matrix
stored? The scipy documentation may be helpful here. Assume you’re working with the same matrix as you were in Question 2: 800,000 articles and 10,000 features. If only1%
of the elements in our $X$ matrix are non-zero, what can we say about the size of the resultinglil_matrix
? What if10%
of the elements are non-zero? Would it ever make sense to stop using thelil_matrix
and instead use a “normal”numpy
array?
Limiting the number of articles you read in
Notice that the process
methods in HNFeatures
and HNLabels
take a max_instances
optional parameter. In both cases, this argument is used to help determine the size of matrices they create (the $X$ and $y$ matrices, respectively) and they are used as an argument to the do_xml_parse
function. When you are working on this task, whether its for this lab, a future lab, or your final project, you should pass in a value for max_instances
that is small to help you debug. For example, when you are first starting out, you might want to set max_instances
to something very small, like 5 or 10. Once you’re a bit more confident, you can set max_instances
to a value that is small enough for it to run quickly but large enough that you’re confident things are working, for example 500 or 1000. Not until you’re pretty confident that everything is working should you set max_instances
to 800,000 (or set it to None
which will read through the XML file and determine the largest possible value for max_instances
, which in this case is 800,000).
Code you will need to implement
The following three subsections outline all of the code you’ll need to write this week. You should read through these three subsections before beginning any coding so that you have a big picture understanding of what you’re trying to build before you get started.
Sample output for this code is linked to at the end of this section.
Implementing your own Labeler
Define your own derived class that inherits from HNLabels
. Your class should be called BinaryLabels
. In your subclass, you will need to define the _extract_label
function. In this function, you should extract the hyperpartisan
attribute stored in an article taken from the ground-truth XML file. The hyperpartisan
attribute is either true
or false
, hence the name of your subclass.
For your final project, you might think about whether there are other ways to separate the data for classification. Note that the task organizers want your final prediction to be either “true” or “false”, regardless of how you decide to separate the data. For this week, we’ll make that binary prediction directly.
Implementing your own Feature Extractor
Define your own derived class that inherits from HNFeatures
. Your class should be called BagOfWordsFeatures
. It should implement a Bag of Words feature set – that is, the features returned by your _extract_features
method should be the counts of words in the article. Only include the counts for words that are in already stored in the vocab
. Words that are not in the vocab
should be ignored. Be sure to read the code where _extract_features
is called so you know what you should be returning from the _extract_features
method.
The vocabulary you should use to initialize your vocab
is stored in /data/semeval/training/vocab.txt
. You may decide later to make your our vocabulary file, but it is not necessary.
Note: You are only required to implement functions that match the interface of the HNFeatures
class, but you’re encouraged to add extra helper functions to modularize your code. By convention, the names of helper functions that you won’t call directly from outside of the class definition should start with a single underscore (e.g., _my_helper_function(self)
).
Experiment Interface
To run prediction on the Hyperpartisan News task, you’ll use the hyperpartisan_main.py
program. To get usage information, pass it the -h
flag:
$ python3 hyperpartisan_main.py -h
usage: hyperpartisan_main.py [-h] [-o FILE] [-v N] [-s N] [--train_size N]
[--test_size N] (-t FILE | -x XVALIDATE)
training labels vocabulary
positional arguments:
training Training articles
labels Training article labels
vocabulary Vocabulary
optional arguments:
-h, --help show this help message and exit
-o FILE, --output_file FILE
Write predictions to FILE
-v N, --vocab_size N Only count the top N words from the vocab file
-s N, --stop_words N Exclude the top N words as stop words
--train_size N Only train on the first N instances. N=0 means use all
training instances.
--test_size N Only test on the first N instances. N=0 means use all
test instances.
-t FILE, --test_data FILE
-x XVALIDATE, --xvalidate XVALIDATE
Once it has parsed the command-line arguments, hyperpartisan_main
calls the function do_experiment
, which has not been implemented. You will use many of these arguments in the do_experiment
function, which will do the following:
- Creates an instance if
HNVocab
. - Creates an instance of (a derived class of)
HNFeatures
. For this lab, the derived class will beBagOfWordsFeatures
. - Creates an instance of (a derived class of)
HNLabels
. For this lab, the derived class will beBinaryLabels
. - Creates an instance of a classifier. You can use something from
sklearn
, such asMultinomialNB
. (You can use your Decision List classifier from Lab 6 but it might require some reworking as it isn’t expecting a sparse matrix.) - Creates feature and target ($X$ and $y$) matrices from the training data.
- Depending on the value of
args.xvalidate
andargs.test_data
, either does:- Creates a feature matrix for the test data, fitting your model to the training data, and getting predictions (and probabilities) for each article in the test set.
- Performs $k$-fold cross validation on the training data, getting predictions (and probabilities) for each article in the training set.
- Regardless of which method was used to generate predictions, writes out one line to
args.output_file
for each article with three values, separated by spaces:- the article id
- the predicted class (“true” or “false” – do not include the quotes)
- your model’s confidence, which we’ll consider to be the probability of the predicted class
Sample output
Sample output is provided. Note that it is a challenge to provide samples for everything you will try, especially since the dataset is so large. If there are particular samples you would like to see, it’s possible that they can be added.
Analysis
For each of the questions below, perform the analysis using a vocabulary size of 30,000 after excluding 100 stop words, performing 10-fold cross-validation on the full training data file. Be sure to write out the labels and probabilities for the Multinomial Naive Bayes classifier: you will need those results to answer the 4 questions that follow.
Warning: You should only continue with these questions if you are 100% certain that your code is working up to this point. Running each classifier in Question 4 will take about 30 minutes! You can continue with Q5-Q8 after just running the Multinomial Naive Bayes classifier, which will allow you to run the DummyClassifier
portion of Q4 while you are working in Q5-Q8.
- Use the Multinomial Naive Bayes classifier, along with at least two different Dummy Classifiers. Comment on their relative performance, and on what your results tell you about the data set. Briefly describe how the Dummy Classifiers compare to the baseline classifiers you wrote in Lab 6.
- From the Multinomial Naive Bayes classifier output, identify (by id) three articles that your model is confident are hyperpartisan. Comment on the contents of the articles: What do you think makes your classifier so confident that they are hyperpartisan? Is your classifier right?
- From the Multinomial Naive Bayes classifier output, identify (by id) three articles that your model is confident are not hyperpartisan. Comment on the contents of the articles: what do you think makes your classifier so confident that they are not hyperpartisan? Is your classifier right?
- From the Multinomial Naive Bayes classifier output, identify (by id) three articles that your model is not confident about – that is, articles for which your classifier’s prediction is very close to $0.50$. Comment on the contents of the articles: what do you think makes these articles hard for your classifier? Do you find them hard to classify as a human? If not, what aspects of the articles do you take into account that are not captured by the features available to your classifier?
- Based on your answers to the above, give a list of 3-5 additional features you could extract that might be helpful to your classifier.
Optional
- We talked in class about the need to train hyperparameters on a development set or by using cross-validation. Use 5-fold cross-validation to compare some hyperparameters and see how tuning these hyperparameters changes your performance. You can use any classifier you’d like, but if you’re not sure which to use, use the Multinomial Naive Bayes classifier. Some hyperparameters include:
- The value of the option
alpha
parameter in the used for+alpha
smoothing. - The number of stopwords to exclude
- The total size of the vocabulary to include
Each full run of the classifier can take 20-30 minutes, so don’t work on this unless you’ve completed everything else above.
- Run the Multinomial Naive Bayes classifier trained on the full training data and tested on the full validation data using the same parameters as you did for Question 4. How do your results compare? Is this surprising or not? You may also want to look through the results and think about how you would answer Questions 5-8 based on the output.