CS66: Project 2

Overview

The goals of this week's lab:

Implement the Naive Bayes algorithm
Implement Logistic Regression
Handle multi-class prediction

You make work with one lab partner for this lab. You may discuss high level ideas with any other group, but examining or sharing code is a violation of the department academic integrity policy. Note that there is a check point requirement for this lab; I will check that you have completed the Naive Bayes portion by lab on Thursday, October 26.

Getting Started

Projects in this course will be distributed and submitted through Swarthmore GitHub Enterprise. If you have never used the Swarthmore GitHub before, you should follow our GitHub Setup Guide. If you have taken CS31 or CS35 recently, you will be familiar with the process. Find your git repo for this lab assignment with name format of Project2-id1_id2 (id1 being replaced by the user id of you and your partner): CPSC66-F17

Clone your git repo with the starting point files into your labs directory:

$ cd cs66/labs
$ git clone [the ssh url to your your repo)

Then cd into your Project2-id1_id2 subdirectory. You will have the following files (those in blue require your modification):

input - directory containing train/test data sets (more forthcoming)
output - all required output from your algorithm should be saved here
run_nb.py - your main program executable
NaiveBayes.py - a class file to define your Naive Bayes algorithm
run_lr.py - your main program executable
LogisticRegression.py - a class file to define your Logistic Regression algorithm
README.md - required response for data collection and grading

Common Implementation Details

Usage

Your programs should take in 1 or 2 command-line arguments as in the previous lab. The first argument for both programs will be the data set. For logisitc regression, you will also take in the learning rate. For example, to run naive Bayes on the tennis example:

python run_NB.py tennis

If the user specifies an incorrect number of arguments, or invalid values for the parameters print a usage statement and exit the program. You should error check all parameters to ensure that the files exist. The learning rate must be greater than 0. E.g.,

$ python run_LR.py
Error, incorrect number of arguments
Usage: python run_LR.py dataset [eta]
  dataset - base name of input arff files in the input/ folder
  eta - learning rate during gradient descent

Note that the dataset name (e.g., tennis) will determine 3 file names:

The training set file input/[dataset]_train.arff e.g., input/tennis_train.arff
The test set file input/[dataset]_test.arff e.g., input/tennis_test.arff
The predicted labels for the test set outputLR/[dataset].labels e.g., outputLR/tennis.labels. Save Naive Bayes output to the outputNB folder.

You must ensure that the two input files actually exist. Please follow this format exactly; these are easy points for your grade and they significantly reduce the number of headaches I will face during grading.

Program Inputs

To simplify preprocessing, you may assume the following:

The data set will be in separate train and test files, as with Project 1.
Data will be in ARFF format. You may either reuse your work from Project 1 for parsing ARFF files or use any available library (e.g., scipy).
You may assume that all features are discrete (binary or multi-valued) and unordered. You do not need to handle continuous features, although you are welcome to reuse your thresholding code from the previous project.
Unlike Project 1, prediction tasks can be multi-class for Naive Bayes.
You can and should reuse as much as you can from Project 1 (e.g., checking command-line arguments, parsing input files, outputting predictions).

For Both Classifiers

Your program should print a confusion matrix and accuracy of the result. Your program should also save the predicted labels (as with the previous project) in your output{NB,LR}/ folder.
The design of your solution is largely up to you. At a minimum, each classifier should have a consistent public interface including a train(X,Y) function which takes a set of training data (X and Y) and a predict(Xnew) function which returns a prediction for a test example (and, optiionally, a predictAll() method to loop over the entire test set).
All preprocessing and postprocessing should be done outside of the class definitions. It is a good idea to create an additional library file for functions that you'll use in both programs. This would include a savePredictions function that works regardless of what method made the predictions as well as any parsers (e.g., a readARFF function) that returns the result of parsing the arff file. Your programs should minimize redundancy.

Naive Bayes Implementation

You will implement a multinomial (multi-value features) Naive Bayes as discussed in class. Your model will be capable of making multi-class predictions or binary predictions.

Model

All probabilities should be modeled using log probabilities to avoid underflow. Note that when taking the log of a formula, all multiplications become additions and division become subtractions. That is, our Naive Bayes model for a class label yj:
$P(Y=y_j) \prod_i P(X_i=v|Y=y_j)$
becomes:
$\log P(Y=y_j) + \sum_i \log P(X_i=v|Y=y_j)$
Note that we estimate the probabilities using LaPlace counts (thetas), so our final model for making predictions is:
$\log P(Y=y_j|X) \propto \log \theta_{y_j} + \sum_i \theta_{X_i=v|Y=y_j}$

Training

You will need to represent model parameters (thetas) for each class probability P(Y = y_j) and each class conditional probability P(X_i = v | Y=y_j). This will require you to count the frequency of each event in your training data set. Use LaPlace estimates to avoid 0 counts. If you think about the formula carefully, you will note that we can implement our training phase using only integers to keep track of counts (and no division!). Each theta is just a normalized frequency where we count how many examples where Xi=v and the label is yj and divide by the total number of examples where the label is yj (plus the LaPlace estimate): $\theta_{X_i=v|Y=y_j} = \frac{N_{X_i=v,Y=y_j}+1}{N_{Y=y_j}+|X_i|}$ Taking the log and each theta becomes the difference between the two counts:

$\log \theta_{X_i=v|Y=y_j} = \log(N_{X_i=v,Y=y_j}+1)-\log(N_{Y=y_j}+|X_i|)$
The important point is that it is completely possible to train your Naive Bayes with only integers (the counts, N). You don't need to calculate log scores until prediction time. This will make debugging (and training) much simpler.

Inference/Prediction

For this phase, simply calculate the probability of each class (using the model equations above) and choose the maximum probability (recall that the max probability is also the max log odds):

$\hat{y} = argmax_{y_j} \log \theta_{y_j} + \sum_i \log \theta_{X_i=v|Y=y_j}$
To convert your log odds to a probability value, exponentiate the value using math.exp(score) where score is your log odds value.

Example Output

Note that I have included Debugging Output below to help diagnose potential problems. It includes frequency counts as well as predictions. The predictions include the predicted label (y^), the raw score of that label, and then the normalized probability (divide by the sum of all other possible label scores). You do not need to calculate all of these values, they are for reference. You do not/should not have this level of verbosity in your final program.

Tennis example:

$python run_NB.py tennis
Accuracy: 0.928571 (13 out of 14 correct)

          Prediction
           No    Yes 
    No      4      1 
   Yes      0      9

Labels and Debugging Output

Zoo example:

$python run_NB.py zoo
Accuracy: 0.857143 (30 out of 35 correct)

                    Prediction
        1    2    3    4    5    6    7 
   1   13    0    0    1    0    0    0 
   2    0    7    0    0    0    0    0 
   3    0    1    0    0    1    0    0 
   4    0    0    0    4    0    0    0 
   5    0    0    0    0    1    0    0 
   6    0    0    0    0    0    3    0 
   7    0    0    0    0    0    2    2

Labels and Debugging Output

Logistic Regression Implementation

You will implement the logistic regression task in class, with the extension to multi-valued features and multi-valued prediction.

Model

In all situations, you should use binary variables for Y and X. "negative" should be presented as a 0 and "positive" as a 1. In the case of binary labels, the probability of a positive prediction is:

$P(Y=1|X) = \frac{1}{1+e^{-w^Tx}}$
If the user provided task is binary, you should treat the second entry in the arff file as the positive class (e.g., { No, Yes} would mean Yes is the positive label. During prediction, return positive (e.g., Yes) if the probability from the above equation is greater than 0.5 (else, the negative class e.g., No).

For multinomial regression, you will be build a 1-vs-all classifier for each of the K classes. For example, in the zoo dataset where there are possible labels 1,2,3,...,7, you will build a classifier where 1 is positive and all other classes are negative; a seperate classifier where 2 is positive and all other classes are negative (e.g., 0); and so on for 7 classifiers in total.

In addition, you will need to process the input files to handle feature non-binary feature inputs. If a feature is binary, use the same rule as above where the first possible value is treated as a 0 and the second value is a 1. For multi-nomial featuers, we will need a weight for each possible value of a feature, X_i=v. You should binarize these features by removing the original feature and adding a new column for each feature/value pair. For example, if you have a feature color={blue,red,green} you will add features {w_color=blue,w_color=red, w_color=green}. For an example where color=green, the corresponding feature values for that example would be x_color=blue=0, x_color=green=1, x_color=red=0.

To model an intercept term (if data is not centered), we will include a bias term. In practice, this is equivalent to adding a feature that is always "on". For each instance in training and testing, append a 1 to the feature vector. Ensure that your logistic regression model has a weight for this feature that it can now learn in the same way it learns all other weights.

Training

To learn the weights, you will apply gradient ascent as discussed in class until the weights do not change in value for a given iteration.

while not converged:
    initialize gradient to 0 for each feature g = [0,...,0]
    for each training example xi:
        p = P(Y=1|xi) //predict probabilty of positive under current model
        error = yi - p // calculate error in prediction
        for each feature j:
            g[j] += error*xi[j]
    w += eta*g //update weights

The hyperparameter eta (learning rate) should be sent as a parameter to the constructor and used in training. You should only update weights once per iteration (i.e., updating the weights is the last step of the iteration). A few notes for the above:

The stopping criteria should be a) a maximum of 100 iterations OR b) the weights have changed by less than 0.0001 between two iterations. To calculate this, subtract the previous weight w_prev from the current weight w. Take the absolute value of that subtraction, and sum all elements in the resulting vector.

$delta = \f{\sum_j abs(w_j-w^{prev}_j))}$
Stop if delta<0.0001. Note that is the same as the L1 norm of the change in weights.
Many of the operations above are on vectors. I strongly recommend using numpy features such as dot product to make the code simple
Above, p is the probability of being positive, not the predicted label. Use the equation given in the Model section for this value.
~~You will need to tune your eta parameter.~~You will not need to tune eta; take this as a command-line argument.

Prediction

For binary prediction, apply the equation:

$P(Y=1|X) = \frac{1}{1+e^{-w^Tx}}$
and output the more likely binary outcome. Use the Python math.exp function to make the calculation in the denominator. Furthermore, wTx is the dot product of w and example x.

For multi-class prediction, you will build K 1 vs. all models. For a test example, you will run it through each of the K models. The predicted label is the model with the highest P(Y=1|x_test) for the given test example.

Example Output

I have included Debugging Output below to help diagnose potential problems. It includes the change in gradients (delta) per round, the learned weights for the first two iterations as well as the final learned weights, and the L1-norm of the weight vector at each iteration. You do not need to calculate all of these values, they are for reference. You do not/should not have this level of verbosity in your final program.

Tennis example with eta=.1:

$python run_LR.py tennis .1

Accuracy: 1.0 (14 out of 14 correct)

          Prediction
           No    Yes 
    No      5      0 
   Yes      0      9

Labels and Debugging Output

Zoo example:

$ python run_LR.py zoo .1

Accuracy: 0.885714 (31 out of 35 correct)

                    Prediction
        1    2    3    4    5    6    7 
   1   14    0    0    0    0    0    0 
   2    0    7    0    0    0    0    0 
   3    1    0    0    0    1    0    0 
   4    0    0    0    4    0    0    0 
   5    0    0    0    0    1    0    0 
   6    0    0    0    0    0    3    0 
   7    0    0    0    0    0    2    2

Labels and Debugging Output

Tips and Hints

You may use preprocessing modules from Python libraries. For example, you could utilize the LabelBinarizer() from sklearn.preprocessing.
For handling the bias term in logistic regression, I recommend adding a dummy variable to your training/testing sets. All of the examples will have a value of 1 for this feature. Then you don't need to explicitly model this term (our derivation in class assumed this to simplify notation). While this is straightforward to do, you could also use add_dummy_feature.
For creating a train/test split, you can do it manually in Python to allow flexibility (e.g., using shuffle). You could also try to use the train_test_split function in sklearn. While you are there, take a look at the functions for creating multiple folds

Extensions (Optional)

These extensions are optional if you are interested in going beyond the requirements. Doing the extensions do not make up for missed requirements above, and they are a lot of work for the relative amount of gain. But they can be rewarding and can aid understanding. Be sure your extensions are a net plus on the requirements above - I should be able to match expected behavior on the basic version of the lab above.

Continuous Features

Extend both models to all continuous features. For Naive Bayes, a common optionis to estimate a Gaussian distribution for each continuous feature. That is, P(Xi|Y=yj) will require an estimate of mean and standard deviation. Both can be estimated in a similar fashion to the counts above: for all of the times that Y=yj, find the mean and standard deviation of Xi. For Logistic Regression, you do need need to make many modifications - you need to allow non-binary features into the model. That is, in gradient descent, multiplying by Xi will no longer be limited to 0 and 1. I would recommend standardizing or normalizing continuous features. Use the titanic data set from HW1, or any data set from the UCI repository that contains continuous features.

ROC Curves

This only applies to binary classification tasks. Use the probabilistic predictions of both models to generate ROC Curves.

Submitting your work

For the programming portion, be sure to commit your work often to prevent lost data. Only your final pushed solution will be graded. Only files in the main directory will be graded. Please double check all requirements; common errors include:

Did you fill out the README?
Did you remove all debugging/developer comments including TODOs and print statements?
Are all functions and non-trivial portions of code commented and easy to follow?
Recheck your code on a lab machine to make sure it runs. Students often have whitespace errors are doing a last past of adding/removing comments.

Also, be sure to meet the style requirements:

Program Style Requirements

Your program should follow good design practices - effective top-down design, modularity, commenting for functions and non-trivial code, etc.
You should break your program up into multiple files and classes. You can additional files for common library methods between the two programs.
Practice defensive programming; e.g., did the user provide enough arguments on the command line? Do the files exist? You do not need to check if the contents are correct.
Do not interact with the user for your program - if an incorrect filename is given, simply exit the program with a message; do not prompt for a new file.
Clean up any debugging print statements at final submission. It may be useful to use these during development, but need to be cleaned up before we grade them.
All functions should include a top-level comment describing purpose, parameters, and return values.
Unlike the previous lab, some data processing libraries are permitted. See the Tips and Hints section; feel free to email me if you are not sure if your library is permitted. In general, it should not implement a core algorithmic feature of Naive Bayes or Logistic Regression.

CS66 Project 2: Probabilistic Models

Usage

Program Inputs

For Both Classifiers

Model

Training

Inference/Prediction

Example Output

Model

Training

Prediction

Example Output

Continuous Features

ROC Curves

Program Style Requirements