Introduction
A major component of the course is a final project, which will span
the last four weeks of the semester. Your project will encompass applying algorithmic approaches
to biological (or biomedical) data, and is fairly open-ended in nature. You can choose to build off a module we covered in
course, or that we will cover in the course, or that is closely related to the course (e.g., protein structure prediction). Below, you will find a list of
ideas to help guide you in your search.
The final product will be a 6-page paper in conference format (examples and rubrics will be provided). Furthermore, a portion of your grade will involve presenting intermediate findings at weekly checkpoints. Your project must be novel work - if you choose
to build off of projects you are (or have been) involved with outside of class,
please specify in your proposal what will be uniquely done for this project.
The project (and proposal) will require you to pose
a scientific question (or central hypothesis) which is related to the course
in some way, but goes beyond material covered in the classroom. You will then
develop a robust methodology to properly address your question and analyze the results of experiments to arrive at a conclusion to your original question.
You have seen
a few models of how this works in bioinformatics through reading research
papers on algorithmic approaches that built on class concepts. Since your
grade largely depends on your paper, you should keep in mind that your final
paper consists mostly of analysis and communication of a methodology and not the amount of code you write. In fact, I will only spot-check your implementations.
This is why every lab assignment to this point included analysis
as one of the objectives.
By
Friday, March 31 before midnight, you must hand in a minimum one page project
proposal detailing your aims. There are specific questions I want you to
address below. Your proposal will be graded on how thorough it is - the formatting is not as important as the contents. If you are not sure how to address one of the points,
please make time to meet with me before Friday night.
In lab Monday, April 3, I will
talk with each group about their proposals and provide guidance.
Required Elements in Proposal
At a minimum, you should answer the following questions in your proposal
submission:
- Who are the participating members? Groups should be of size 2.
- What is the central hypothesis? Or: what is the goal of your project?
- Methods: What algorithm(s) will you be utilizing to
answer the question? How do you plan on implementing these algorithms (e.g.,
using python packages; downloading publicly available software). Be specific about your sources. (Note: I advise that you avoid implementing from scratch. Use packages/libraries if available).
- What data sets do you plan to use? Be sure that you can access the data and it is applicable to the problem you state. Cite related work
if available.
- Evaluation: Describe how you will validate your hypothesis. What
experiments will you run? What will these experiments
measure? What statistical tools
will you use to make comparisons?
- Cite at least 2 references relevant to your project
Project Requirements
You should
consider the following general expectations when exploring your options:
- There must be a relevant algorithmic component in your work. It does not
have to be an algorithm we covered in class (and probably shouldn't be -
do something new)!
- There must be some form of experimental validation of your work.
- It should be an interesting biological/medical application.
Exceptions could be made if the algorithmic component is very strong/demanding. Please see me first.
- Your project should balance the depth of experimental analysis and
depth of algorithmic development. A lighter algorithmic approach (e.g., take an off-the-shelf algorithm
and tweak it) should be paired with strong experimental analysis and thorough
explanation of the methods evaluated. We can
discuss this balance as the project progresses. I strongly discourage
implementing algorithms from scratch. This was the most common
point of failure in previous semesters.
- You may not re-use work from a related class, research experience, or
project. You may, however, use any prior work as a building block. The amount of work you do for this project
cannot double count for some other academic purpose.
- You will submit a paper. The deadline is the last day of classes.
I will require that you do it in LaTex (I will provide a starting file)
and it will be about 6 conference paper pages (this is not short,
its similar to 12 regular pages in Word).
- You must work with a partner. I suggest finding a partner with a
similar interest as you, rather than just someone you regularly work with.
The key to these final projects is having the motivation to start early
rather than procrastinating and turning in an incomplete assignment.
Project ideas
Be sure to calibrate your goals to account for the limited time period. Every project must involve:
- A central hypothesis (the aims of your project should be clear)
- One or more methods (algorithms) to address your question
- Experiments that utilize correct methodology as cited in related work
and/or lecture
- Analysis
The methods and experiments/analysis are the two major time components. Your
project proposal should address where on the spectrum your proposal lies (i.e., will it focus more on methods or on analysis and experiments). There are a few ways of accomplishing this task:
- Use a library to implement methods e.g., statistical or machine learning algorithms.
I strongly recommend scikit-learn for
machine learning. They have most common clustering and classification algorithms as well model selection
and evaluation metrics available. You can choose to focus on one method
and go very far in depth to try out many variations (e.g., SVMs). Or you could choose
to compare several types of algorithms (e.g., 3 clustering algorithms) in
terms of their inductive bias and experimental results.
- Download software that is open access and apply it to
real data. Your proposal should cite relevant work to motivate this.
For example, if you are interested in multiple sequence alignment, you could
download MUSCLE and compare it to an HMM algorithm on real data. Your paper will explain both methods and then apply
validation measures (from class and relevant literature) to analyze their
plusses and minuses.
- You can do an implementation-heavy project, but it needs to be
very specific and thoroughly defined.
I have an old lab that use you implement profile Hidden
Markov Models to do multiple sequence alignment using probabilistic
graphical models. I can provide data and starting files.
Some problems include:
- Genome assembly (taking many, small overlapping fragments and reconstructing
the original genome). You can write a review of popular approaches, challenges,
and state of the art. Then, pick 2 or 3 data sets and ~3 methods and test them out.
- Phylogenetic tree inferences - explore other types of methods, including
maximum likelihood and Bayesian approaches that are recently popular. Try
several methods out on some available data.
- Machine learning methodology: focus on a specific problem in machine learning
(e.g., overfitting, cost-senstitive prediction, regression, model selection) and
describe several approaches with experimental analysis.
- Machine learning algorithms: similar to the above, but focus on a category
of algorithmic approaches (e.g., ensembles of trees) and describe/evaluate
several methods (e.g., random forests, adaboost, etc.).
- Secondary structure from a protein sequence. For example,
use a conditional random field library and/or HMM to predict secondary structure.
This is a well-studied problem with related work and plenty of available data.
- Multiple sequence alignment: implement a profile HMM to develop a
probabilistic MSA (see link above). Or compare several types of algorithms
(e.g., iterative algorithms like PSI-BLAST vs HMM methods vs tree-based methods)
- Find regulatory regions, or genes, or promotor sites, or binding sites
etc., from a DNA sequence. GENSCAN uses HMMs to do a lot of this at
once. ChromHMM finds chromatin sites using HMMs.
- RNA Structure
Prediction overview. Utilize a stochastic-context free grammar to solve RNA structure
prediction.
If you are more interested in exploring general algorithms, you can use the following list to
explore various categories of techniques and then search for biology problems
that utilize them:
- Probabilistic graphical models and sequence models (e.g., Bayesian networks, HMM's, conditional random fields, Markov random fields)
- Supervised learning algorithms (e.g., neural networks, support vector machines, deep learning)
- Unsupervised learning algorithms
- Vision/image analysis
- Memory efficient models (e.g., memory efficient decision trees and
k-nearest neighbors)
- Tree-learning algorithms
- Semi-supervised learning
- Search
Some tips:
- For relevant journals, consider Bioinformatics
magazine. The top conferences in the field are ISMB, RECOMB
and ACM-BCB. If you Google the conference name, you will find
a list of all the papers presented at each year of the conference. If
you are interested in machine learning algorithms, look into AAAI, ICML,
NIPS, JMLR and many more. In some years, there are special tracks
for computational biology (e.g., NIPS often has a workshop on bioinformatics
and machine learning). If you need help finding the relevant articles,
use Google Scholar or look at the researcher's webpage. Most CS journals
are open access. Others are behind a paywall that the library can
help you with.
- Take a shot at one the DREAM Challenge or one of the collection of other biomedical challenges,
especially MICCAI.
- The most common reasons for stress-filled projects last time were procrastination, inability to get the data (do this within the first week!), and
trying to do too much coding and getting stuck with bugs.
- I find all of these problems interesting. However, I will
be approaching some of these problem with the same level experience you will. If you want to pick my brain about
probabilistic models, protein structure prediction, or supervised machine
learning I will have more specific knowledge of directions. If you pick
something else, I would be more than happy to help look at the literature with
you.