CS68: Final Project Proposals

Introduction

A major component of the course is a final project, which will span the last four weeks of the semester. Your project will encompass applying algorithmic approaches to biological (or biomedical) data, and is fairly open-ended in nature. You can choose to build off a module we covered in course, or that we will cover in the course, or that is closely related to the course (e.g., protein structure prediction). Below, you will find a list of ideas to help guide you in your search.

The final product will be a 6-page paper in conference format (examples and rubrics will be provided). Furthermore, a portion of your grade will involve presenting intermediate findings at weekly checkpoints. Your project must be novel work - if you choose to build off of projects you are (or have been) involved with outside of class, please specify in your proposal what will be uniquely done for this project.

The project (and proposal) will require you to pose a scientific question (or central hypothesis) which is related to the course in some way, but goes beyond material covered in the classroom. You will then develop a robust methodology to properly address your question and analyze the results of experiments to arrive at a conclusion to your original question. You have seen a few models of how this works in bioinformatics through reading research papers on algorithmic approaches that built on class concepts. Since your grade largely depends on your paper, you should keep in mind that your final paper consists mostly of analysis and communication of a methodology and not the amount of code you write. In fact, I will only spot-check your implementations. This is why every lab assignment to this point included analysis as one of the objectives.

By Friday, March 31 before midnight, you must hand in a minimum one page project proposal detailing your aims. There are specific questions I want you to address below. Your proposal will be graded on how thorough it is - the formatting is not as important as the contents. If you are not sure how to address one of the points, please make time to meet with me before Friday night. In lab Monday, April 3, I will talk with each group about their proposals and provide guidance.

Required Elements in Proposal

At a minimum, you should answer the following questions in your proposal submission:

Who are the participating members? Groups should be of size 2.
What is the central hypothesis? Or: what is the goal of your project?
Methods: What algorithm(s) will you be utilizing to answer the question? How do you plan on implementing these algorithms (e.g., using python packages; downloading publicly available software). Be specific about your sources. (Note: I advise that you avoid implementing from scratch. Use packages/libraries if available).
What data sets do you plan to use? Be sure that you can access the data and it is applicable to the problem you state. Cite related work if available.
Evaluation: Describe how you will validate your hypothesis. What experiments will you run? What will these experiments measure? What statistical tools will you use to make comparisons?
Cite at least 2 references relevant to your project

Project Requirements

You should consider the following general expectations when exploring your options:

There must be a relevant algorithmic component in your work. It does not have to be an algorithm we covered in class (and probably shouldn't be - do something new)!
There must be some form of experimental validation of your work.
It should be an interesting biological/medical application. Exceptions could be made if the algorithmic component is very strong/demanding. Please see me first.
Your project should balance the depth of experimental analysis and depth of algorithmic development. A lighter algorithmic approach (e.g., take an off-the-shelf algorithm and tweak it) should be paired with strong experimental analysis and thorough explanation of the methods evaluated. We can discuss this balance as the project progresses. I strongly discourage implementing algorithms from scratch. This was the most common point of failure in previous semesters.
You may not re-use work from a related class, research experience, or project. You may, however, use any prior work as a building block. The amount of work you do for this project cannot double count for some other academic purpose.
You will submit a paper. The deadline is the last day of classes. I will require that you do it in LaTex (I will provide a starting file) and it will be about 6 conference paper pages (this is not short, its similar to 12 regular pages in Word).
You must work with a partner. I suggest finding a partner with a similar interest as you, rather than just someone you regularly work with. The key to these final projects is having the motivation to start early rather than procrastinating and turning in an incomplete assignment.

Project ideas

Be sure to calibrate your goals to account for the limited time period. Every project must involve:

A central hypothesis (the aims of your project should be clear)
One or more methods (algorithms) to address your question
Experiments that utilize correct methodology as cited in related work and/or lecture
Analysis

The methods and experiments/analysis are the two major time components. Your project proposal should address where on the spectrum your proposal lies (i.e., will it focus more on methods or on analysis and experiments). There are a few ways of accomplishing this task:

Use a library to implement methods e.g., statistical or machine learning algorithms. I strongly recommend scikit-learn for machine learning. They have most common clustering and classification algorithms as well model selection and evaluation metrics available. You can choose to focus on one method and go very far in depth to try out many variations (e.g., SVMs). Or you could choose to compare several types of algorithms (e.g., 3 clustering algorithms) in terms of their inductive bias and experimental results.
Download software that is open access and apply it to real data. Your proposal should cite relevant work to motivate this. For example, if you are interested in multiple sequence alignment, you could download MUSCLE and compare it to an HMM algorithm on real data. Your paper will explain both methods and then apply validation measures (from class and relevant literature) to analyze their plusses and minuses.
You can do an implementation-heavy project, but it needs to be very specific and thoroughly defined. I have an old lab that use you implement profile Hidden Markov Models to do multiple sequence alignment using probabilistic graphical models. I can provide data and starting files.

Some problems include:

Genome assembly (taking many, small overlapping fragments and reconstructing the original genome). You can write a review of popular approaches, challenges, and state of the art. Then, pick 2 or 3 data sets and ~3 methods and test them out.
Phylogenetic tree inferences - explore other types of methods, including maximum likelihood and Bayesian approaches that are recently popular. Try several methods out on some available data.
Machine learning methodology: focus on a specific problem in machine learning (e.g., overfitting, cost-senstitive prediction, regression, model selection) and describe several approaches with experimental analysis.
Machine learning algorithms: similar to the above, but focus on a category of algorithmic approaches (e.g., ensembles of trees) and describe/evaluate several methods (e.g., random forests, adaboost, etc.).
Secondary structure from a protein sequence. For example, use a conditional random field library and/or HMM to predict secondary structure. This is a well-studied problem with related work and plenty of available data.
Multiple sequence alignment: implement a profile HMM to develop a probabilistic MSA (see link above). Or compare several types of algorithms (e.g., iterative algorithms like PSI-BLAST vs HMM methods vs tree-based methods)
Find regulatory regions, or genes, or promotor sites, or binding sites etc., from a DNA sequence. GENSCAN uses HMMs to do a lot of this at once. ChromHMM finds chromatin sites using HMMs.
RNA Structure Prediction overview. Utilize a stochastic-context free grammar to solve RNA structure prediction.

If you are more interested in exploring general algorithms, you can use the following list to explore various categories of techniques and then search for biology problems that utilize them:

Probabilistic graphical models and sequence models (e.g., Bayesian networks, HMM's, conditional random fields, Markov random fields)
Supervised learning algorithms (e.g., neural networks, support vector machines, deep learning)
Unsupervised learning algorithms
Vision/image analysis
Memory efficient models (e.g., memory efficient decision trees and k-nearest neighbors)
Tree-learning algorithms
Semi-supervised learning
Search

Some tips:

For relevant journals, consider Bioinformatics magazine. The top conferences in the field are ISMB, RECOMB and ACM-BCB. If you Google the conference name, you will find a list of all the papers presented at each year of the conference. If you are interested in machine learning algorithms, look into AAAI, ICML, NIPS, JMLR and many more. In some years, there are special tracks for computational biology (e.g., NIPS often has a workshop on bioinformatics and machine learning). If you need help finding the relevant articles, use Google Scholar or look at the researcher's webpage. Most CS journals are open access. Others are behind a paywall that the library can help you with.
Take a shot at one the DREAM Challenge or one of the collection of other biomedical challenges, especially MICCAI.
The most common reasons for stress-filled projects last time were procrastination, inability to get the data (do this within the first week!), and trying to do too much coding and getting stuck with bugs.
I find all of these problems interesting. However, I will be approaching some of these problem with the same level experience you will. If you want to pick my brain about probabilistic models, protein structure prediction, or supervised machine learning I will have more specific knowledge of directions. If you pick something else, I would be more than happy to help look at the literature with you.

CS68 Final Project Proposals