Your project must include work related to the field of machine learning — it should relate to what we have covered in the course in terms of assignments and core lecture materials, but it should go beyond the basic course content in some way. It should also be considerably different than previous work you have done in other courses or for research. If you want to build on existing framework (yours or someone elses), you must be explicit about your unique contributions in your proposal. You can and should work with the instructors throughout the process of brainstorming and designing a proposal.

Project Scope

The first key piece of advice is to make sure the goal is feasible and manageable in the time you have for the semester. For example, project failures usually relate to not having a good data set, or the approach does not match well to the problem, or because the debugging/implementation process requires more than 3 weeks of work. Look for papers related to your idea to ensure it is tangible. If it is difficult to find related work, treat it as a warning sign. Usually, it means it isn’t a suitable machine learning task, but it could also mean that you need to learn a bit more about the idea first.

The second key piece of advice is that it’s extremely difficult to guess how long things will take when you don’t have much practice doing them, so you’re likely to make mistakes when predicting what’s 'feasible in the time available.' Therefore, you should ensure that your plan has many different 'endpoints' that are staged in sequence, with the easiest ones being things you are confident you can complete within a few days of starting your project, and the hardest being 'stretch-goals' that you’d like to complete, but think you probably won’t have time for. This way, however badly you over- (or under-) estimate the time it will take to do each part of your project, when the end of the semester rolls around you’ll have something to write a report and create a presentation about.

You are welcome (and encouraged) to use existing libraries and software if you want to spend less time on algorithm implementation and more on experiments and analysis. We can help with this if the libraries are not already on our systems.

If you are using libraries that make use of GPU resources, let us know and we can help guide you to the CS machines that have good GPUs in them; for GPU enabled libraries, this can lead to several orders of magnitude speed-up. Deep learning is especially likely to require a strong GPU, but there are a number of other algorithms that can leverage this resource if you use the right libraries.

Guidance

Within these bounds, there is a great deal of flexibility in the style of project you can consider. All papers are expected to have the following components:

  • Introduction - describe the problem, why it’s interesting, and who might be impacted by it. Provide background on existing work as appropriate.

  • Algorithm and Methods - describe the algorithms (e.g., in pseudocode), and analyze your approach and competing approaches (e.g., what is the inductive bias? What are empirical or theoretical properties of the algorithm that are relevant?).

  • Experiments and Analysis - run your chosen algorithm(s) on data and analyze the results using tools from the class (e.g., cross-validation, learning curves, PR or ROC curves, paired t-tests)

  • Ethics and Implications - describe the ethical implications and real-world impacts of your research (e.g. ways your algorithm(s) might impact fairness)

  • Fairness in machine learning - a subfield where mathematical frameworks incorporate some (mathematical) notion of fairness to prevent real-world bias from appearing in a learned model.

Note that while all projects must do each of these to some extent, your work does not have to be perfectly balanced between the three categories. For instance, your project could be very deep on algorithmic analysis (e.g., considering learning theory, or a very unique framework) or could be focused on experimentation (e.g., doing a meta-review of a class of algorithms, or by addressing a real-world problem that requires a good deal of effort in gathering, processing, and analyzing the data). There is a continuum of acceptable projects in these dimensions, but here are some examples:

  • Use an existing implementation to study a broader area (see below). Pick 3-5 algorithms and test them on ~5 data sets. Analyze the results and what they tell you about the approaches.

  • Attack a real-world dataset (not in a standard ML repository). It could be a challenge you find online or from your own interests. Develop an entire ML pipeline and discuss the series of algorithms needed for e.g., preprocessing, standardizing, training, visualizing results.

  • Implement one or two algorithms. Analyze issues that are present in implementation and choices you have to make (including the effect of these choices e.g., on the hypothesis space). Pick a couple of data sets to evaluate your approach(es).

  • Evaluate the algorithmic bias in a system; pick a combination of algorithms and data sets and analyze the different types of bias that are present in the results. Suggest ways these biases might be minimized.

Topic Suggestions

Here are some possible project ideas to get you started; note that this list is in no way exhaustive, and you are more than welcome to come up wtih your own ideas that are not on this list.

  • Sampling bias in real-world data - examine data sets to determine how they fail to accurately represent the world and the impacts this has on models learned from the data.

  • Anomaly detection - work with highly imbalanced data set(s) to find ways to classify novel examples as 'normal' or 'abnormal' based on a training sample that consists almost exclusively of 'normal' samples.

  • Transfer learning or domain adaptation - algorithms that take what is learned on one task and apply it to another.

  • Semi-supervised learning - utilize large, unlabeled data sets to improve supervised learning.

  • Multiple-instance learning - the training examples are bag of items with one collective label for the entire bag but (unknown) labels for instances in the bag (typically negative means all instances are negative, but positive means at least one instance is positive but most are negative).

  • Multi-label learning - there are many properties we’d like to predict at once (i.e. an example can be a member of more than one class at a time).

  • Multi-view learning - constructing models from different modalities (or types of data, e.g., images and text) and then synthesizing them into one model. For example, use captions and images to identify objects in an image.

  • Active learning - human-in-the-loop interaction with learners.

  • Knowledge-based learning - incorporating human advice/expertise into a model to improve learning. This has been applied to neural networks and support vector machines, among others.

  • Privileged learning - some features are only available during training time. Instead of throwing them out, can we use them to help train the other features?

  • If you took the course due to interest in deep learning, you now have the tools to attack that topic in interesting ways. There are data sets (especially speech, text, and images) that can be realistically mined with neural networks. Additionally, the core topics in deep learning require an analysis of new algorithms (neural network advances are almost all exclusively related to the need for regularization since deep networks are prone to overfitting). You could consider the topic of pretraining networks with unsupervised data, transfer learning where weights from one task are used for a different task (e.g., classify cats vs dogs and then apply to brain images).

  • Dimensionality reduction or feature selection - how to eliminate features either for visualization or to prevent overfitting

  • Regression - predicting real-valued outputs

  • Rank prediction - predict the relative ordering of items rather than a fixed category

  • Time-series or relational prediction - remove the I.I.D. assumption and predict items that are related to one another.

Domains to be Cautious About

These are some data domains that you may want to avoid because they’re likely to be difficult to get satisfying results with. If you really want to work on a topic like this, make sure you have a plan for how to have a successful project even if prediction turns out to be impractical (e.g. all the available learning algorithms produce very low performance results). This can involve things like a focus on trying to draw conclusions about the nature of the data (i.e. focus on what you can discover that’s interesting and worthwhile, rather than on the fact that prediction is difficult).

  • Outcomes of sporting events (football/soccer, basketball, etc.)

  • Outcomes of video games (e-sports, pokemon tournaments, etc.)

  • Values of markets (stocks, indexes, GDP, etc.)

  • Other domains in which humans are directly competing against other humans

Resources

Look for research papers in top machine learning venues (NeurIPS, Journal of ML Research (JLMR), Int. Conference on Machine Learning (ICML), ICMLA (Applications), Euro. Conf. on ML (ECML), Int Conf. on Data Mining (ICMD), AAAI). Most of the papers are freely available online. The library should be able to get you access to the other materials although most authors have PDFs of their papers on their websites. For data sets: