CS66: Final Project

Your project must be novel work related to the field of machine learning -- it should go beyond what we have covered in the course in terms of assignments or core lecture materials. It also most be considerably different than previous work you have done in other courses or for research. If you want to build on existing framework (yours or someone elses), you must be explicit about your unique contributions in your proposal. You can and should work with Professor Soni throughout the process of brainstorming and designing a proposal.

Within these bounds, there is a great deal of flexibility in the style of project you can consider. All papers are expected to have the following components:

Algorithm and Methods - describe the algorithms (e.g., in pseudocode), provide background on existing work, and analyze your approach and competing approaces (e.g., what is the inductive bias? What are emperical or theoretical properties of the algorithm that are relevant?).
Experiments and Analysis - run your chosen algorithm(s) on data and analyze the results using tools from the class (e.g., learning curves, PR or ROC curves, paired t-tests)

Your project could be very deep on algorithmic analysis (e.g., considering learning theory, or a very unique framework) or could be focused on experimentation (e.g., doing a metareview of a class of algorithms, or by attacking a real-world problem that requires a good deal of effort in gathering, processing, and analyzing the data). There is a continuum of projects in these two dimensions, but are some tangible examples:

Implement one or two algorithms. Analyze issues that are present in implementation and choices you have to make (including the effect of these choices e.g., on the hypothesis space). Pick a couple of data sets to evaluate your approach(es).
Use an existing implementation to study a broader area (see below). Pick 3-5 algorithms and test them on ~5 data sets. Analyze the results and what they tell you about the approaches.
Attack a real-world dataset (not in a standard ML repository). It could be a challenge you find online or from your own interests. Develop an entire ML pipeline and discuss the series of algorithms needed for e.g., preprocessing, standardizing, training, visualizing results.

Some topics suggestions include:

Transfer learning or domain adaptation - algorithms that take what is learned on one task and apply it to another.
Semi-supervised learning - utilize large, unlabeled data sets to improve supervised learning.
Multiple-instance learning - the training examples are sets of items with a collectively label.
Multi-label learning - there are may properties we'd like to predict at once.
Multi-view learning - constructing models from different modalities (or types of data, e.g., images and text) and then synthesizing them into one model. For example, use captions and images to identify objects in an image.
Active learning - human in the loop interaction with learners.
Knowledge-based learning - incorporating human advice/expertise into a model to improve learning. This has been applied to neural networks and support vector machines, among others.
Privileged learning - some features are only available during training time. Instead of throwing them out, can we use them to help train the other features?
If you took the course due to interest in deep learning, you now have the tools to attack that topic in interesting ways. There are data sets (especially speech, text, and images) that can be realistically mined with neural networks. Additionally, the core topics in deep learning require an analysis of new algorithms (neural network advances are almost all exclusively related to the need for regularization since deep networks are prone to overfitting). You could consider the topic of pretraining networks with unsupervised data, transfer learning where weights from one task are used for a different task (e.g., classify cats vs dogs and then apply to brain images).
Dimensionality reduction or feature selection - how to eliminate features either for visualization or to prevent overfitting
Regression - predicting real-valued outputs
Rank prediction - predict the relative ordering of items rather than a fixed category
Time-series or relational prediction - remove the i.i.d. assumption and predict items that are related to one another.

The key piece of advice I have is to make sure the goal is tangible. For example, many previous projects on deep learning have failed because it is notoriously difficult to debug and manage large data sets (e.g., 3D medical images). Look for papers related to your idea to ensure it is tangible. If it is difficult to find related work, treat it as a warning sign. Usually, it means it isn't a suitable machine learning task, but it could also mean that you need to learn a bit more about the idea first.

You may use existing libraries and software if you want to spend less time on implementation and more on experiments. I recommend creating virtual environments for tensorflow and Keras (you may be able to reuse the CS81Envs as well). Sci-kit learn and Weka are already on our systems.

Resources

Look for research papers in top machine learning venues (NIPS, Journal of ML Reserch (JLMR), Int. Conference on ML (ICML), ICMLA (Applications), Euro. Conf. on ML (ECML), Int Conf. on Data Mining (ICMD), AAAI). NIPS has all papers available online, as does JMLR. The library should be able to get you access to the other materials although most authors have PDFs of their papers on their websites. For data sets:

CS66 Final Project Ideas

Resources