This exercise is inspired by Aurelien Geron's Sample Code
We will ust a Jupyter notebook to progressively implement this exercise and view the code running all within your browser window. If this does not work, you can instead use the python interpretor:
python -i
and type (or copy-paste) the code below line-by-line. You can also download the stand alone python file and analyze the code line by line (this is not recommended).
From Prof. Meeden's Jupyter explanation:
A Jupyter notebook is a document that can contain both executable code and explanatory text. It is a great tool for documenting and illustrating scientific results. In fact, this lab writeup is a notebook that was saved as an html file. A notebook is made up of a sequence of cells. Each cell can be of a different type. The most common types are Code and Markdown. We will be using Code cells that are running Python 3, though there are many other possibilities. Markdown cells allow you to format rich text. In your terminal window, where you are already in your cs66 lab directory, type the following to start a notebook:
jupyter notebook
A web browser window will appear with the Jupyter home screen. On the right-hand side, go to the New drop down menu and select Python3
. A blank notebook will appear with a single cell.
Let's try writing and executing a simple hello program. One of the main differences between Python 2 and Python 3, is that print is a function in Python 3. When you are ready to execute the cell, press Shift-Enter
.
def hello(name):
print("Hi", name)
hello("Chris")
hello("Samantha")
You should see the output of the code after you execute it:
Hi Chris
Hi Samantha
To name your notebook, double click on the default name, Untitled, at the top. Let's call it FirstNotebook. To save a notebook, click on the disk symbol on the left-hand side of the tool bar. You can also use the File menu and choose Save and Checkpoint. Explore the other menu options in the notebook. Figure out how to insert and delete cells, which are common commands you'll need to know.
To exit a notebook, save it first, then from the File menu choose Close and Halt. In the terminal window where you started the notebook, you'll also need to type CTRL+C
to shutdown the kernel. Do an ls
to list the contents of your directory. You'll see that you now have a file called FirstNotebook.ipynb.
If you haven't already done so, follow the instructions above to start your Jupyter Notebook server. Create a new Python 3 notebook and name it as you see fit e.g., DTreeTutorial.
Enter the following code into your first cell, paying close attention to the comments to understand what each line is doing. This will set up compatability and needed libraries. When done entering it your notebook, hit Shift+Enter
to execute.
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals
# Common imports
import numpy as np
import os
# to make this notebook's output stable across runs
np.random.seed(42)
# To plot pretty figures
#If using the pythong interpretor, omit this first line. It only applies to the Jupyter environment
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
def save_fig(fig_id, tight_layout=True):
print("Saving figure", fig_id)
if tight_layout:
plt.tight_layout()
plt.savefig(fig_id + ".png", format='png', dpi=300)
Let's get to it. We'll use the pre-loaded Iris data set that we referred to in class and build a Decision Tree Classifier.
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
iris = load_iris()
X = iris.data[:, 2:]
y = iris.target
tree_clf = DecisionTreeClassifier(max_depth=2, random_state=42)
tree_clf.fit(X, y)
max_depth
is set to 2. How does this related to our discussion about stopping criteria and overfitting? How do we prevent the use of max_depth
. Can you find other parameters to prevent overfitting? Again, I recommend coming back to this later and playing around with different options in the code.Now, visualize your tree. Execute the following code:
from sklearn.tree import export_graphviz
export_graphviz(
tree_clf,
out_file="iris_tree.dot",
feature_names=iris.feature_names[2:],
class_names=iris.target_names,
rounded=True,
filled=True
)
To view your tree, open a command line and find the directory with iris_tree.dot
. Convert the dot file to a PDF or PNG:
$ dot -Tpdf iris_tree.dot -o iris_tree.pdf
$ xpdf iris_tree.pdf
What features are in your decision tree? How many nodes? What are the class distributions for the leaves?
from matplotlib.colors import ListedColormap
def plot_decision_boundary(clf, X, y, axes=[0, 7.5, 0, 3], iris=True, legend=False, plot_training=True):
x1s = np.linspace(axes[0], axes[1], 100)
x2s = np.linspace(axes[2], axes[3], 100)
x1, x2 = np.meshgrid(x1s, x2s)
X_new = np.c_[x1.ravel(), x2.ravel()]
y_pred = clf.predict(X_new).reshape(x1.shape)
custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])
plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap, linewidth=10)
if not iris:
custom_cmap2 = ListedColormap(['#7d7d58','#4c4c7f','#507d50'])
plt.contour(x1, x2, y_pred, cmap=custom_cmap2, alpha=0.8)
if plot_training:
plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo", label="Iris-Setosa")
plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs", label="Iris-Versicolor")
plt.plot(X[:, 0][y==2], X[:, 1][y==2], "g^", label="Iris-Virginica")
plt.axis(axes)
if iris:
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
else:
plt.xlabel(r"$x_1$", fontsize=18)
plt.ylabel(r"$x_2$", fontsize=18, rotation=0)
if legend:
plt.legend(loc="lower right", fontsize=14)
plt.figure(figsize=(8, 4))
plot_decision_boundary(tree_clf, X, y)
plt.plot([2.45, 2.45], [0, 3], "k-", linewidth=2)
plt.plot([2.45, 7.5], [1.75, 1.75], "k--", linewidth=2)
plt.plot([4.95, 4.95], [0, 1.75], "k:", linewidth=2)
plt.plot([4.85, 4.85], [1.75, 3], "k:", linewidth=2)
plt.text(1.40, 1.0, "Depth=0", fontsize=15)
plt.text(3.2, 1.80, "Depth=1", fontsize=13)
plt.text(4.05, 0.5, "(Depth=2)", fontsize=11)
save_fig("decision_tree_decision_boundaries_plot")
plt.show()
Once you have a model (tree), you'll want to use it to make predictions. The following two lines of code showcase two different predictions on a test example x=[5,1.5]:
tree_clf.predict_proba([[5, 1.5]])
tree_clf.predict([[5, 1.5]])
Decision trees tend to be sensitive to small changes in the data set. We'll run an experiment to understand this. First, let's remove the widest petal from the training set (length of 4.8cm and width of 1.8cm):
X[(X[:, 1]==X[:, 1][y==1].max()) & (y==1)] # view the example that is the widest Iris-Versicolor flower
not_widest_versicolor = (X[:, 1]!=1.8) | (y==2) #find indices of examples that are not the max width
X_tweaked = X[not_widest_versicolor] #create a training set with the widest petals removed
y_tweaked = y[not_widest_versicolor]
tree_clf_tweaked = DecisionTreeClassifier(max_depth=2, random_state=40) #Retrain
tree_clf_tweaked.fit(X_tweaked, y_tweaked)
Now let us plot the new result:
plt.figure(figsize=(8, 4))
plot_decision_boundary(tree_clf_tweaked, X_tweaked, y_tweaked, legend=False)
plt.plot([0, 7.5], [0.8, 0.8], "k-", linewidth=2)
plt.plot([0, 7.5], [1.75, 1.75], "k--", linewidth=2)
plt.text(1.0, 0.9, "Depth=0", fontsize=15)
plt.text(1.0, 1.80, "Depth=1", fontsize=13)
save_fig("decision_tree_instability_plot")
plt.show()
How does this differ from the previous decision boundary? Is there a way to mitigate this effect?