For this project you will take on a machine learning challenge of your choice from kaggle. Some of these contests are currently active, with prizes available. However, you are welcome to work on contest problems that have expired or (with approval) on another machine learning problem of your choosing. On kaggle, you can see old under all competitions.
Many of these challenges involve large data sets that could quickly blow through your disk quota. To avoid this, you can save them to /scratch (instructions), which is unlimited, but isn't backed up. Also, take a look at the department's suggestions for long running jobs.
In order to download data, you will need to sign up for a free account. Kaggle also has a discussion forum, which may have useful suggestions, especially if you are working on an active contest.
Scikit-learn is a collection of Python libraries that implement a large number of machine learning algorithms. We previously used the sklearn implementation of support vector machines in lab 8. Scikit-learn has a huge collection of classification, clustering, and regression techniques. Here is the documentation for sklearn implementations of many of the algorithms we have studied:
Scikit-learn also has modules for preprocessing data and for evaluating models with cross validation. Feel free to poke around in the scikit-learn documentation for other tools and algorithms.
Ensemble learning is based on the idea that combining the output of many weak classifiers can make a strong classifier that outperforms all of its component parts. Ensemble learning is feasible as long as all of the component classifiers are useful (they perform better than random guessing) and not entirely redundant (they sometimes give different answers).
Two common methods for ensemble learning are boosting and bagging. Boosting works by training many weak classifiers (such as shallow decision trees) on the same data and then taking a plurality vote or weighted average over their outputs in order to classify a new point. Bagging works by training many highly specific classifiers (such as deep decision trees) on random subsets of the data, and again taking a vote or average over their output labels.
You are expected to use at least one ensemble method in your project. The ensemble method may or may not be the best learning algorithm for your task, but your writeup should report the results of testing the ensemble against its component algorithms. Several variations on boosting and bagging are implemented by scikit-learn. Documentation can be found in the ensemble methods section.
Before the deadline, you need to submit the following things through git:
In addition, you must turn in a hard copy of the writeup pdf outside my office.
In the LaTex file, project.tex, you will describe your project. This file already contains a basic structure that you should follow. Feel free to change the section headings, or to add additional sections. Recall that you use pdflatex to convert the LaTex into a pdf file. Here is a template for your paper.
As your project develops and you create more files, be sure to use git to add, commit, and push them. Run: git status to check that all of the necessary files are being tracked in your git repo. Don't forget to update the README so that I can test your code!