Overview
The goals of this week's lab:
- Implement the Naive Bayes algorithm
- Implement Logistic Regression
- Handle multi-class prediction
You make work with one lab partner for this lab. You may discuss high level ideas with any other group, but
examining or sharing code is a violation of the department academic integrity policy. Note that there is a check point requirement for this lab; I will check that you have completed the Naive Bayes portion by lab on Thursday, October 26.
Getting Started
Projects in this course will be distributed and submitted through Swarthmore
GitHub Enterprise. If you have never used the Swarthmore GitHub before, you should follow our GitHub Setup Guide. If you have taken CS31 or CS35 recently, you will be familiar
with the process. Find your git repo for this lab assignment with name format of Project2-id1_id2 (id1 being replaced by the
user id of you and your partner): CPSC66-F17
Clone your git repo with the starting point files into
your labs directory:
$ cd cs66/labs
$ git clone [the ssh url to your your repo)
Then cd into your
Project2-id1_id2 subdirectory. You will have the following files (those in blue require your modification):
- input - directory containing train/test data sets (more forthcoming)
- output - all required output from your algorithm should be saved here
- run_nb.py - your main program executable
- NaiveBayes.py - a class file to define your Naive Bayes algorithm
- run_lr.py - your main program executable
- LogisticRegression.py - a class file to define your Logistic Regression algorithm
- README.md - required response for data collection and grading
Common Implementation Details
Usage
Your programs should take in 1 or 2 command-line arguments as in the previous lab. The first argument for both programs will
be the data set. For logisitc regression, you will also take in the learning
rate.
For example, to run naive Bayes on the tennis example:
python run_NB.py tennis
If the user specifies an incorrect number of arguments, or invalid values for the parameters print a usage
statement and exit the program. You should error check all parameters
to ensure that the files exist. The learning rate must be greater than 0. E.g.,
$ python run_LR.py
Error, incorrect number of arguments
Usage: python run_LR.py dataset [eta]
dataset - base name of input arff files in the input/ folder
eta - learning rate during gradient descent
Note that the dataset name (e.g.,
tennis) will determine 3 file names:
- The training set file input/[dataset]_train.arff e.g., input/tennis_train.arff
- The test set file input/[dataset]_test.arff e.g., input/tennis_test.arff
- The predicted labels for the test set outputLR/[dataset].labels e.g., outputLR/tennis.labels. Save
Naive Bayes output to the outputNB folder.
You must ensure that the two input files actually exist. Please follow this format exactly;
these are easy points for your grade and they significantly reduce the number of headaches
I will face during grading.
Program Inputs
To simplify preprocessing, you may assume the following:
- The data set will be in separate train and test files, as with Project 1.
- Data will be in ARFF format. You may either reuse your work from Project 1 for parsing ARFF files or use any available library (e.g., scipy).
- You may assume that all features are discrete (binary or multi-valued) and unordered. You do not need to handle continuous features, although you are welcome to reuse your thresholding code from the previous project.
- Unlike Project 1, prediction tasks can be multi-class for Naive Bayes.
- You can and should reuse as much as you can from Project 1 (e.g.,
checking command-line arguments, parsing input files, outputting predictions).
For Both Classifiers
- Your program should print a confusion matrix and accuracy of the result. Your program should also save the predicted labels (as with the previous project) in your output{NB,LR}/ folder.
- The design of your solution is largely up to you. At a minimum, each classifier should have a consistent public interface including a train(X,Y) function which takes a set of training data (X and Y) and a predict(Xnew) function which returns a prediction for a test example (and, optiionally, a predictAll() method to loop over the entire test set).
- All preprocessing and postprocessing should be done outside of the class definitions. It is a good idea to create
an additional library file for functions that you'll use in both programs. This would include a savePredictions function that works regardless of what method made the predictions as well as any parsers (e.g., a readARFF function) that returns the result of parsing the arff file. Your programs should minimize redundancy.
Naive Bayes Implementation
You will implement a multinomial (multi-value features) Naive Bayes as discussed in class. Your model will be capable of making
multi-class predictions or binary predictions.
Model
All probabilities should be modeled using
log probabilities to avoid underflow. Note that when taking the log of a formula, all multiplications become additions and division become subtractions. That is, our
Naive Bayes model for a class label
yj:
becomes:
Note that we estimate the probabilities using LaPlace counts (thetas), so our final model for making predictions is:
Training
You will need to represent model parameters (thetas) for each class probability
P(Y = y_j) and each class conditional probability
P(X_i = v | Y=y_j). This will require you to count the frequency of each event in your training data set. Use
LaPlace estimates to avoid 0 counts. If you think about the formula carefully, you will note that we can implement our training phase using only integers
to keep track of counts (and no division!). Each theta is just a normalized frequency where we count how many examples where
Xi=v and the label is
yj
and divide by the total number of examples where the label is
yj (plus the LaPlace estimate):
Taking the log and each theta becomes the difference between the two counts:
The important point is that it is completely possible to train your Naive Bayes with only integers (the counts,
N). You don't need to calculate log scores until prediction time. This will make debugging (and training) much simpler.
Inference/Prediction
For this phase, simply calculate the probability of each class (using the model equations above) and choose the maximum probability (recall that the max probability is also the max log odds):
To convert your log odds to a probability value, exponentiate the value using
math.exp(score) where score is your log odds value.
Example Output
Note that I have included Debugging Output below to help diagnose potential problems. It includes frequency counts as well as predictions. The predictions include the predicted label (y^), the raw score of that label, and then the normalized probability (divide by the sum of all other possible label scores). You
do not need to calculate all of these values, they are for reference. You do not/should not have this level of verbosity in your final program.
- Tennis example:
$python run_NB.py tennis
Accuracy: 0.928571 (13 out of 14 correct)
Prediction
No Yes
No 4 1
Yes 0 9
Labels and Debugging Output
- Zoo example:
$python run_NB.py zoo
Accuracy: 0.857143 (30 out of 35 correct)
Prediction
1 2 3 4 5 6 7
1 13 0 0 1 0 0 0
2 0 7 0 0 0 0 0
3 0 1 0 0 1 0 0
4 0 0 0 4 0 0 0
5 0 0 0 0 1 0 0
6 0 0 0 0 0 3 0
7 0 0 0 0 0 2 2
Labels and Debugging Output
Logistic Regression Implementation
You will implement the logistic regression task in class, with the extension to multi-valued features and multi-valued prediction.
Model
In all situations, you should use binary variables for Y and X. "negative" should be presented as a 0 and "positive" as a 1.
In the case of binary labels, the probability of a positive prediction is:
If the user provided task is binary, you should treat the second entry in the arff file as the positive class (e.g.,
{ No, Yes} would mean
Yes is the positive label. During prediction, return positive (e.g.,
Yes) if the probability from the above equation is greater than 0.5 (else, the negative class e.g.,
No).
For multinomial regression, you will be build a 1-vs-all classifier for each of the K classes. For example, in the zoo dataset where there are possible labels 1,2,3,...,7, you will build a classifier where 1 is positive and all other classes are negative; a seperate classifier where 2 is positive and all other classes are negative (e.g., 0); and so on for 7 classifiers in total.
In addition, you will need to process the input files to handle feature non-binary feature inputs. If a feature is binary, use the same rule as above where the first possible value is treated as a 0 and the second value is a 1. For multi-nomial featuers, we will need a weight for each possible value of a feature, X_i=v. You should binarize these features by removing the original feature and adding a new column for each feature/value pair. For example, if you have a feature color={blue,red,green} you will add features {w_color=blue,w_color=red, w_color=green}. For an example where color=green, the corresponding feature values for that example would be x_color=blue=0, x_color=green=1, x_color=red=0.
To model an intercept term (if data is not centered), we will include a bias term. In practice, this is equivalent to adding a
feature that is always "on". For each instance in training and testing, append a 1 to the feature vector. Ensure that your
logistic regression model has a weight for this feature that it can now learn in the same way it learns all other weights.
Training
To learn the weights, you will apply gradient ascent as discussed in class until the weights do not change in value for a given iteration.
while not converged:
initialize gradient to 0 for each feature g = [0,...,0]
for each training example xi:
p = P(Y=1|xi) //predict probabilty of positive under current model
error = yi - p // calculate error in prediction
for each feature j:
g[j] += error*xi[j]
w += eta*g //update weights
The hyperparameter
eta (learning rate) should be sent as a parameter to the constructor and used in training. You should only update weights once per iteration (i.e., updating the weights is the last step of the iteration). A few notes for the above:
- The stopping criteria should be a) a maximum of 100 iterations OR b) the weights have changed by less than 0.0001 between two iterations. To calculate this, subtract the previous weight w_prev from the current weight w. Take the absolute value
of that subtraction, and sum all elements in the resulting vector.
Stop if delta<0.0001. Note that is the same as the L1 norm of the change in weights.
- Many of the operations above are on vectors. I strongly recommend using numpy features such as dot product to make the code simple
- Above, p is the probability of being positive, not the predicted label. Use the equation given in the Model section
for this value.
You will need to tune your eta parameter.You will not need to tune eta; take this as a command-line argument.
Prediction
For binary prediction, apply the equation:
and output the more likely binary outcome. Use the Python
math.exp function to make the calculation in the denominator.
Furthermore,
wTx is the dot product of
w and example
x.
For multi-class prediction, you will build K 1 vs. all models. For
a test example, you will run it through each of the K models. The predicted label is the model with the highest P(Y=1|x_test) for
the given test example.
Example Output
I have included Debugging Output below to help diagnose potential problems. It includes the change in gradients (delta) per round, the learned weights for the first two iterations as well as the final learned weights, and the L1-norm of the weight vector at each iteration. You
do not need to calculate all of these values, they are for reference. You do not/should not have this level of verbosity in your final program.
- Tennis example with eta=.1:
$python run_LR.py tennis .1
Accuracy: 1.0 (14 out of 14 correct)
Prediction
No Yes
No 5 0
Yes 0 9
Labels and Debugging Output
- Zoo example:
$ python run_LR.py zoo .1
Accuracy: 0.885714 (31 out of 35 correct)
Prediction
1 2 3 4 5 6 7
1 14 0 0 0 0 0 0
2 0 7 0 0 0 0 0
3 1 0 0 0 1 0 0
4 0 0 0 4 0 0 0
5 0 0 0 0 1 0 0
6 0 0 0 0 0 3 0
7 0 0 0 0 0 2 2
Labels and Debugging Output
Tips and Hints
-
You may use preprocessing modules from Python libraries. For example, you could utilize the LabelBinarizer() from sklearn.preprocessing.
- For handling the bias term in logistic regression, I recommend adding a dummy variable to your training/testing sets. All of the examples will have a value of 1 for this feature. Then you don't need to explicitly model this term (our derivation in class assumed this to simplify notation). While this is straightforward to do, you could also use add_dummy_feature.
- For creating a train/test split, you can do it manually in Python to allow flexibility (e.g., using shuffle). You could also try to use the train_test_split function in sklearn. While you are there, take a look at the functions for creating multiple folds
Extensions (Optional)
These extensions are optional if you are interested in going beyond the requirements. Doing the extensions do not make up for missed requirements above, and they are a lot of work for the relative amount of gain. But they can be rewarding and can aid understanding.
Be sure your extensions are
a net plus on the requirements above - I should be able to match expected behavior on the basic version of the lab above.
Continuous Features
Extend both models to all continuous features. For Naive Bayes, a common optionis to estimate a Gaussian distribution for each continuous feature. That is,
P(Xi|Y=yj) will require an estimate of mean and standard deviation.
Both can be estimated in a similar fashion to the counts above: for all of the
times that
Y=yj, find the mean and standard deviation of
Xi.
For Logistic Regression, you do need need to make many modifications - you
need to allow non-binary features into the model. That is, in gradient
descent, multiplying by
Xi will no longer be limited to 0 and 1.
I would recommend standardizing or normalizing continuous features.
Use the titanic data set from HW1, or any data set from the UCI repository that contains
continuous features.
ROC Curves
This only applies to binary classification tasks. Use the probabilistic
predictions of both models to generate ROC Curves.
Submitting your work
For the programming portion, be sure to commit your work often to prevent lost data. Only your final
pushed solution will be graded. Only files in the main directory will be graded. Please double check all requirements; common errors include:
- Did you fill out the README?
- Did you remove all debugging/developer comments including TODOs and print statements?
- Are all functions and non-trivial portions of code commented and easy to follow?
- Recheck your code on a lab machine to make sure it runs. Students often have whitespace errors are doing a last past of adding/removing comments.
Also, be sure to meet the style requirements:
Program Style Requirements
- Your program should follow good design practices - effective top-down
design, modularity, commenting for functions and non-trivial code, etc.
- You should break your program up into multiple files and classes.
You can additional files for common library methods between the two programs.
- Practice defensive programming; e.g., did the user provide enough
arguments on the command line? Do the files exist? You do
not need to check if the contents are correct.
- Do not interact with the user for your program - if an incorrect filename
is given, simply exit the program with a message; do not prompt for a new
file.
- Clean up any debugging print statements at final submission. It may be useful to use these during development, but need to be cleaned up before we grade them.
- All functions should include a top-level comment describing purpose, parameters, and return values.
- Unlike the previous lab, some data processing libraries are permitted. See the Tips and Hints section; feel free to email me if you are not sure if your library is permitted. In general, it should not implement a core
algorithmic feature of Naive Bayes or Logistic Regression.