If you want to work alone do:
setup63-bryce labs/10 noneIf you want to work with a partner, then one of you needs to run the following command while the rest one wait until it finishes:
setup63-bryce labs/10 partnerUsernameOnce the script finishes, the other partner should run it on their account.
cd ~/cs63/labs/10 cp -r /home/bryce/public/cs63/labs/10/* .
git add * git commit -m "lab10 start" git push
cd ~/cs63/labs/10 git pull
This week, the starting point directory includes just two files: project.tex and README.
In the LaTex file, project.tex, you will describe your project. We have provided a basic structure that you should follow. Feel free to change the section headings, or to add additional sections. Recall that you use pdflatex to convert the LaTex into a pdf file. Here is a template for your paper.
In the README file you will give instructions about how to test the code that you create.
For this project you will be taking on a machine learning challenge of your own chosing. If you want to work on the Netflix prize data set, it is available in my public directory: ~bryce/cs63/labs/netflix/. Other sources of ML contests include: kaggle and Cha Learn. Some of these contests are currently active, with prizes available. However, you are welcome to work on contest problems that have expired or on other machine learning problems of your choosing. On kaggle, you can see old contests by selecting All Competitions and checking the completed box.
Many of these challenges involve large data sets that could quickly blow through your disk quota. To avoid this, you can save them to /scratch (instructions), which is unlimited, but isn't backed up. Also, take a look at the department's suggestions for long running jobs.
You are welcome to make use of your own implementations from previous labs, or of any additional machine learning algorithms. However, for all of the algorithms we have studied and many more, there are excellent publicly available Python libraries.
Scikit-learn is a collection of Python libraries that implement a large number of machine learning algorithms. You have already encountered it once, in lab 6, when the SVM code was imported from sklearn. The following links give documentation on using scikit-learn for many of the machine learning algorithms we have studied this semester.
Scikit-learn also has modules for preprocessing data and for evaluating models with cross validation. Feel free to poke around in the scikit-learn documentation for other tools and algorithms.
Ensemble learning is based on the idea that combining the output of many weak classifiers can make a strong classifier that outperforms all of its component parts. Ensemble learning is feasible as long as all of the component classifiers are useful (they perform better than random guessing) and not entirely redundant (they sometimes give different answers).
Two common methods for ensemble learning are boosting and bagging. Boosting works by training many weak classifiers (such as shallow decision trees) on the same data and then taking a plurality vote or weighted average over their outputs in order to classify a new point. Bagging works by training many highly specific classifiers (such as deep decision trees) on random subsets of the data, and again taking a vote or average over their output labels.
If it is appropriate to your selected machine learning task, you should try using an ensemble learning method. Your writeup should report the results of testing the ensemble against its component algorithms. Several variations on boosting and bagging are implemented by scikit-learn. Documentation can be found in the ensemble methods section.
This project is meant to be open-ended and allow you to choose what machine learning topics you would like to explore further. If none of the contests sound appealing, you may propose alternative projects. Extending any of the machine-learning-related labs could be the basis for a project, and I am also open to other suggestions. If you want to pursue a non-contest project, be sure to talk to me about your ideas before you start significant coding.
Please turn in a hard copy of your writeup pdf outside my office before the deadline.