Overview
The goals of this week's lab:
- Implement the decision-tree algorithm
- Handle common data-processing tasks including file parsing and feature conversion
- Implement schemes to prevent overfitting
- Perform experimental analysis on your decision tree
You make work with one lab partner for this lab. You may discuss high level ideas with any other group, but
examining or sharing code is a violation of the department academic integrity policy.
Getting Started
Projects in this course will be distributed and submitted through Swarthmore
GitHub Enterprise. If you have never used the Swarthmore GitHub before, you should follow our GitHub Setup Guide. If you have taken CS31 or CS35 recently, you will be familiar
with the process. Find your git repo for this lab assignment with name format of Project1-id1_id2 (id1 being replaced by the
user id of you and your partner): CPSC66-F17
Clone your git repo with the starting point files into
your labs directory:
$ cd cs66/labs
$ git clone [the ssh url to your your repo)
Then cd into your
Project1-id1_id2 subdirectory. You will have the following files (those in blue require your modification):
- input - directory containing train/test data sets including the heart and diabetes data sets. You may
add additional data sets here.
- output - all required output from your algorithm should be saved here
- run_dtree.py - your main program executable
- DecisionTree.py - a class file to define your decision tree algorithm
- analysis.pdf - not provided. This is the name you should give to the file
containing your solutions to the analysis questions in Part 2.
- README.md - required response for data collection and grading
Part 1: Implement a Decision Tree
You will implement an ID3-like decision tree in this part of the lab. The main features of your implementation include:
- Reading in ARFF formatted train and test set files
- Converting continuous features into thresholded binary features
- Building trees to a maximum depth as determined by user input
- Output of learned tree for interpretation of induced model
- Evaluation of the constructed tree on a held-aside test set
Usage
Your program should take in 1 or 2 command-line arguments in this specific order:
- The name of the data set
- A parameter setting for the maximum depth of the learned tree (optional)
For example:
python run_dtree.py toy 3
If the user specifies an incorrect number of arguments, or invalid values for the parameters print a usage
statement and exit the program. You should error check all parameters
to ensure that the files exist and that given parameters are integers greater than 0. E.g.,
$ python run_dtree.py
Error, incorrect number of arguments
Usage: python run_dtree.py dataset [depth]
dataset - base name of input arff files in the input/ folder
depth (optional) - maximum depth of tree, set as a positive integer value
Note that the dataset name (e.g.,
toy) will determine 4 file names:
- The training set file input/[dataset]_train.arff e.g., input/toy_train.arff
- The test set file input/[dataset]_test.arff e.g., input/toy_test.arff
- The tree output file output/[dataset].tree e.g., output/toy.tree
- The predicted labels for the test set output/[dataset].labels e.g., output/toy.labels
You must ensure that the two input files actually exist. Please follow this format exactly;
these are easy points for your grade and they significantly reduce the number of headaches
I will face during grading.
Program Style Requirements
- Your program should follow good design practices - effective top-down
design, modularity, commenting for functions and non-trivial code, etc.
- You should break your program up into multiple files with run_dtree.py as the main executable.
You should build classes in separate files (I have provided one, you may add others).
At a minimum, you should represent decision nodes and the decision tree
as objects. Be sure to design these aspects of the program first before starting your main implementation.
- All decision tree algorithm details should be encapsulated in your decision tree class. The main file should only
directly handle data parsing and the main control flow for calling decision tree steps (e.g., train, test, output).
- Practice defensive programming; e.g., did the user provide enough
arguments on the command line? Do the files exist? You do
not need to check if the contents are correct.
- Do not interact with the user for your program - if an incorrect filename
is given, simply exit the program with a message; do not prompt for a new
file.
- Clean up any debugging print statements at final submission. It may be useful to use these during development, but need to be cleaned up before we grade them.
- All functions should include a top-level comment describing purpose, parameters, and return values.
- Avoid using existing libraries to solve major components of the program. For example, I want you to manually parse the ARFF files instead of using an ARFF library that exists. You are more than welcome to use numpy for array management. Please ask me if there are other libraries you are interested in using
Program Inputs
- Your program is expected to read files that are in the ARFF format. You are required to defend against non-existent files, but if you are given a valid file name, you can assume that it is in ARFF format.
- The header of the file describes the features and class label. You can assume that the class label
is the last line in the feature list and will be named class to easily demarcate the end of the header. You can
also assume the headers in the train and test files are the same.
- Subsequent lines after the header contain one example per line represented as commas separated feature values. The feature values occur in the same ordering as feature descriptions in the header (with the class label being at the end of the line).
- Your program should handle both nominal and numeric features (e.g., by thresholding). You can assume the class label is nominal.
- Lines starting with % are comments and should be ignored.
- I have provided data sets for testing purposes in the input directory including a heart disease prediction task (heart_{train,test}.arff) and a diabetes prediction task (diabetes_{train,test}.arff). You are encouraged to add your
own toy examples for the purposes of debugging. Be sure to follow the format required in the usage.
- The other type of input is the optional maximum depth of the tree. This should be a positive integer - all other values (including 0) should be rejected and lead to a clean exit with error message. E.g., a depth of 1 means there is a single decision node in the tree. If no value is given, trees should be built using the same criteria as ID3.
Program Main Behavior
Your program should process inputs, parse and index training and test examples, and induce a decision tree. There
are very few restrictions on how you do this, but be sure your tree fits the behavior from class:
- Never use information from the test set in any part of learning the tree, including when you decide thresholds for
continuous features.
- You may assume that the prediction task is binary (use the values given in the
header file). You can implement multi-valued prediction as an extension (see below)
- Continuous features should be converted into multiple binary features using thresholding at class label changes. The feature
should use the less than or equals to relation (i.e., the feature is true if a feature value is <= to the threshold). Choose the
mid-point between two adjacent feature values. E.g., if there is a label change from X=20 and X=30, you should
add a feature X<=25.
- There should be a branch for each nominal value of a variable. This is described in the header of the arff file, and is
limited to true and false for the converted continuous features. It is probably simplest to store these as string type.
- Use information gain for your selection criteria. You can break ties arbitrarily.
- The stopping criteria for a leaf is:
- All examples are the same label.
- There are no features remaining to split on.
- The maximum depth of the branch has been reached.
- No features produce information gain
The labels at leaves should be the plurality label of the examples at the leaf. Ties can be broken arbitrarily.
Program output
When run, your program should output the following:
Example runs
Heart disease data set:
Diabetes data set:
Trees if converted discrete features to binary as well (no depth limit):
Part 2: Analysis
Once your lab is complete, generate two learning curve graphs (i.e., line graphs) with an accompanying analysis of the graphs. On the x-axis, show the depth of a tree. On the y-axis, show the accuracy of the tree on both the training set and test set. Run experiments at various depth levels including, at a minimum, depths of 1, 2, 4, 7, and 10. You should have one graph for the diabetes data set and one graph for the heart disease data set.
Describe what you can conclude from each graph. What happens to the training set accuracy and test set accuracy as the depth varies. Connect these results to class discussion. Do they agree or disagree with our discussion? Why or why not? Place your
graphs and analysis in a PDF file called analysis.pdf. Be sure that your graphs are of scientific quality - including captions, axis labels, distinguishable curves, and legible font sizes.
Extensions (Optional)
These extensions are optional if you are interested in going beyond the requirements. Meeting the minimum requirements typically earns an A on labs - 9.5 out of 10. I reserve perfects scores for labs that are exceptional and go beyond the minimum. Doing the extensions do not make up for missed requirements above, and they are a lot of work for the relative amount of gain. But they can be rewarding and can aid understanding.
Be sure your extensions are
a net plus on the requirements above - I should be able to match expected behavior on the basic version of the lab above.
Mult-class prediction
Both data sets I have provided have two class labels. Decision trees easily extend to multi-class prediction. Find a
relevant data set (e.g., at Kaggle or the UC Irvine Machine Learning Repository) and evaluate your algorithm with multiple
nominal values allowed for labels.
Min Leaf Size
Another approach to prevent overfitting is setting a minimum number of examples in a leaf. If a split causes results in children below the threshold, the recursion is stopped at the current node and a leaf is created with the plurality label. This is often more flexible than maximum depth - it allows variable depth branches while still trying to prevent overly detailed hypotheses that only fit to a few examples. Add minimum leaf size as an optional command line argument.
Learning curves for training set size
Machine learning algorithms are often sensitive to the amount of training data - some algorithms thrive when there is large amounts of data but struggle when data is sparse (e.g., deep neural networks) while others plateau in performance even if more data is available. Evaluate your decision tree on the given training data by randomly subsampling the data to create smaller training sets e.g., use 10% of training data. Choose at least 5 training set sizes and plot them on the x-axis with the y-axis describing the accuracy. You'll probably want to run each training set size 3-5 times and average the results. Describe what you see when you compare training and test accuracy as the training set grows.
Submitting your work
For the programming portion, be sure to commit your work often to prevent lost data. Only your final
pushed solution will be graded. Only files in the main directory will be graded. Please double check all requirements; common errors include:
- Did you fill out the README?
- Did you remove all debugging/developer comments including TODOs and print statements?
- Are all functions and non-trivial portions of code commented and easy to follow?
- Recheck your code on a lab machine to make sure it runs. Students often have whitespace errors are doing a last past of adding/removing comments.
Acknowledgements
Both data sets were made available by the UC Irvine Machine Learning Repository. The heart disease data set is the result of a study by the
Cleveland Clinic to study coronary heart disease. The patients were identified
as having or not having heart disease and the resulting feature set is a subset
of the original 76 factors that were studied. The diabetes
data set is the result of a study of females at least 21 years old of Pima Indian heritage to understand risk factors of diabetes. Both data sets were converted
to ARFF format by Mark Craven at the University of Wisconsin.