CS66: Project 2

Overview

The goals of this week's lab:

Implement support vector machines and random forests
Utilize held-aside tuning sets to choose hyperparameters and balance bias/variance
Evaluate methods on several data sets using N-fold cross validation
Analyze experiments using several methodologies, including paired t-tests
Learn to read and use the scikit-learn API

You make work with one lab partner for this lab. You may discuss high level ideas with any other group, but examining or sharing code (or results/analysis) is a violation of the department academic integrity policy.

Getting Started

Projects in this course will be distributed and submitted through Swarthmore GitHub Enterprise. Find your git repo for this lab assignment with name format of Project3-id1_id2 (id1 being replaced by the user id of you and your partner): CPSC66-F17

Clone your git repo with the starting point files into your labs directory:

$ cd cs66/labs
$ git clone [the ssh url to your your repo)

Then cd into your Project3-id1_id2 subdirectory. You will have the following files (those in blue require your modification):

run_pipeline.py - your main program executable for tuning and testing your algorithms and outputing error rates
generateCurves.py - put code here to generate your learning curves
README.md - required response for data collection and grading
analysis.pdf - a report on the results of your experiments. This is the most important deliverable for this lab.

Datasets

Rather than parse and process data sets, you will utilize scikit-learn's pre-defined data sets. Details can be found here. At a minimum, your experiments will require using the MNIST and 20 Newsgroup datasets. Both are multi-class tasks (10 and 20 classes, respectively). Note that both of these are large and take time to run, so I recommend developing using the Wisconsin Breast Cancer dataset:

if sys.argv[1] == "cancer":
  data = load_breast_cancer()

X = data['data']
y = data['target']
print(X.shape)
print(y.shape)

which outputs 569 examples with 30 features each:

(569, 30)
(569,)

The MNIST dataset is very large and takes a lot of time to run, so you can randomly subselect 1000 examples; you should also normalize the pixel values between 0 and 1 (instead of 0 and 255):

data = fetch_mldata('MNIST Original', data_home="~soni/public/cs66/sklearn-data/")
X = data['data']
y = data['target']
X,y = utils.shuffle(X,y) #Shuffle the rows
X = X[:1000] #Only keep 1000 training examples
y = y[:1000]
X = X/255.0 #Normalize the feature values

The newsgroup dataset in vector form (i.e., bag of words) is obtained using:

data = fetch_20newsgroups_vectorized(subset='all', data_home="~soni/public/cs66/sklearn-data/")

No normalization is required; I also suggest randomly sampling 1000 examples for this dataset as well. The data object also contains headers and target information which you should examine (e.g., in a Jupyter notebook) for understanding. For your analyis, it may be helpful to know the number of features, their types, and what classes are being predicted.

Coding requirements

The coding portion is flexible - the goal is to be able to execute the experiments below. However, you should keep these requirements in mind:

Make your code reusable and modular. You shouldn't hardcode the datasets/algorithms you want. Make the user define the dataset on the command line, and abstract away the algorithm of choice.
Use command-line arguments to help identify the dataset you want to load. E.g., for my cancer example above, you would run:
```
  $ python run_pipeline.py cancer
```
If your code is designed well, it shouldn't matter what the source of the data is.
Functions should also be modular. I recommend having a runTuneTest(learner, parameters, X,y) method that takes in the base learner (e.g., SVC, RandomForestClassifier), the hyperparameters to tune as a dictionary and all of the data. This method will then handle creating train/tune/test sets and running the pipeline. For example, I could create a K-Nearest Neighbor classifier:
```
clf = KNeighborsClassifier()
parameters = {"weights":["uniform","distance"], "n_neighbors":[1,5,11]}
results = runTuneTest(clf, parameters, X,y)
```
Note that the hyperparameters match the API for KNeighborsClassifier. In the dictionary, the key is the name of the hyperparameter and the value is a list of values to try.
generateCurves.py should follow similar constraints, though the names and styles of methods will be different.
Keep your code simple and easy to read. Let scikit-learn do most of the heavy lifting. Most of your time will be spent reading the API and finding the appropriate methods to call. Your solution will be about 60 lines of code including methods and comments.

Experiment 1: SVM vs RandomForest Generalization Error

Using run_pipeline.py, you will run both SVMs and Random Forests and compare which does better in terms of estimated generalization error.

Coding Details

Your program should read in the dataset (MNIST or 20 Newsgroup, at a minimum) defined by the command line, as discussed above e.g.,

$ python run_pipeline.py mnist
$ python run_pipeline.py news

You should specify your parameters and classifier and call runTuneTest (see the above example), which follows this sequence of steps:

Dividing the data into training/test splits using StratifiedKFold. Follow this example to create a for-loop for each fold. Set the parameters to shuffle the data and use 5 folds (cv). Set the random_state to a fixed integer value (e.g., 42) so the folds are consistent for both algorithms.
For each fold, tune the hyperparameters using GridSearchCV, which is a wrapper for your base learning algorithms; it automates the search over multiple hyperparameters. Use the default value of 3-fold tuning.
After creating a GridSearchCV classifier, fit it using your training data
Get the test-set accuracy by running the GridSearchCV score method taking in the fold's test data.
Return a list of accuracy scores for each fold.

In main(), you should print the paired test accuracies for all 5 folds for both classifiers. Your classifiers/hyperparameters are defined as follows:

Your RandomForest classifier should fix the number of trees at 200, but tune the number of features between 1%, 10%, 50%, 100%, and square-root of the number of features in the dataset.
Your Support Vector Machine should use the Gaussian kernel ('rbf') and tune both the complexity (C) parameter (1, 10, 100, 1000) and gamma parameter (10^-4,10^-3,10^-2,10^-1,1).

Code incrementally, and be sure to examine the results of your tuning (what were the best hyperparameter settings? what were the scores across each parameter?) to ensure you have the pipeline correct. Since the analysis below is dependent on your results, I cannot provide sample output for this task. However, this is what is generated if I change my classifier to K-Nearest Neighbors using the parameters listed in the previous section (you can try to replicate this using a random_state of 42):

$ python run_pipeline.py mnist
RUNNING 5-Fold CV on KNN
------------------------
Fold 1:
('Best parameters:', "{'n_neighbors': 5, 'weights': 'distance'}")
Tuning Set Score: 0.827

Fold 2:
('Best parameters:', "{'n_neighbors': 5, 'weights': 'distance'}")
Tuning Set Score: 0.817

Fold 3:
('Best parameters:', "{'n_neighbors': 1, 'weights': 'uniform'}")
Tuning Set Score: 0.837

Fold 4:
('Best parameters:', "{'n_neighbors': 5, 'weights': 'distance'}")
Tuning Set Score: 0.846

Fold 5:
('Best parameters:', "{'n_neighbors': 5, 'weights': 'distance'}")
Tuning Set Score: 0.840

Fold, Test Accuracy
0, 0.892
1, 0.916
2, 0.861
3, 0.812
4, 0.862

Note that StratifiedKFold changed between version 0.17.1 (the one on our systems) and the current version, so you'll get different results if you are developing on your own computer. They should be in the same ball park in terms of accuracy.

Analysis

In part 1 of your writeup, you will analyze your results. At a minimum, your submission should include the following type of analysis:

Provide quantitative results. Present the results visually both in summary and detail (e.g., a table). What is the average accuracy of each? Report the p-value of paired t-test score for both data sets (TIP: you can use the scipy library to do this). Can we reject the null hypothesis at p < 0.05?
Qualitatively assess the results. What can we conclude/infer about both methods and how did the methods compare to each other?
Align the results with class discussion e.g., Did one method dominate or did they split across the data sets? Can you explain this using properties for each algorithm as discussed in class? What, if any, hyperparameters were commonly chosen for each dataset?

Your analysis should be written as if it were the results section of a scientific report/paper.

Experiment 2: Learning Curves

Using generateCurves.py, you will generate the data for learning curves for the above two classifiers. Since we are not interested in generalization accuracy here, we will generate the curves using one round of train/tune splits.

Coding Requirements

Follow the same guidelines for loading the data as in Experiment 1 (you will use both MNIST and 20 Newsgroup).
For Random Forests, you will generate a learning curve for the number of trees (i.e., n_estimators). The parameter will take on all values from 1 to 201 spaced by 10 (i.e., 1, 11, 21, ..., 201). Keep all other parameters at their default values.
For Support Vector Machines, again use an RBF kernel with a fixed complexity parameter of 1.0 (the default). You will range over gamma values 10^-5,10^-4,10^-3,10^-2,10^-1,1,10 (HINT: use np.logspace() to easily generate this range).
To generate the data for the curve, you only need the validation_curve method, which returns the training and test set accuracies for each parameter and each fold. You will need to average the folds together; use 3-fold cv (the default).
Print (and, optionally, save to a csv file), the following for each parameter value: the parameter value, the average train accuracy across the 3 folds, and the average test accuracy across the three folds.

Here is the result if I run KNeighborsClassifier with all odd k values from 1 to 21:

$ python generateCurves.py mnist
Neighbors, Train Accuracy, Test Accuracy
1, 1.000, 0.837
3, 0.930, 0.833
5, 0.910, 0.838
7, 0.896, 0.834
9, 0.880, 0.817
11, 0.866, 0.823
13, 0.853, 0.819
15, 0.842, 0.811
17, 0.833, 0.809
19, 0.824, 0.800
21, 0.816, 0.795

Analysis

Analyse your results for experiment 2. At a minimum, you should have:

A learning curve for each experiment and each method. Each learning curve (4 in total) should have both the training and test accuracy, clearly labeled axes, and a legend. For the SVM curves, plot your x-axis in log space to evenly space the points.
Analyze each of your 4 learning curves. Your discussion should describe the results and related it to relevant course topics such as bias/variance as well as overfitting/underfitting.

Submitting your work

For the programming portion, be sure to commit your work often to prevent lost data. Only your final pushed solution will be graded. Only files in the main directory will be graded. Please double check all requirements; common errors include:

Did you fill out the README?
Did you remove all debugging/developer comments including TODOs and debugging print statements?
Are all functions and non-trivial portions of code commented and easy to follow?
Recheck your code on a lab machine to make sure it runs. Students often have whitespace errors after doing a last past of adding/removing comments.

Program Style Requirements

Your program should follow good design practices - effective top-down design, modularity, commenting for functions and non-trivial code, etc.
You should break your program up into multiple files and classes. You can additional files for common library methods between the two programs.
Practice defensive programming; e.g., did the user provide enough arguments on the command line? Do the files exist? You do not need to check if the contents are correct.
Do not interact with the user for your program - if an incorrect filename is given, simply exit the program with a message; do not prompt for a new file.
Clean up any debugging print statements at final submission. It may be useful to use these during development, but need to be cleaned up before we grade them.
All functions should include a top-level comment describing purpose, parameters, and return values.

CS66 Project 3: SVMs and Ensembles

Coding Details

Analysis

Coding Requirements

Analysis

Program Style Requirements