Introduction
A major component of the course includes a final project, which will span
approximately half of the semester in duration. This project is open-ended
in scope, with certain requirements and parameters to guide you. In general,
you will conduct scientific research by identifying a computational problem,
developing a solution,
running experiments to validate your solution, and analyzing results. You have seen
a few models of how this works in bioinformatics through reading research
papers on algorithmic approaches that built on class concepts.
Your project should generally involve applying algorithmic approaches
to biomedical data. You can choose to build off a module we covered in
course, that we will cover in the course, or one of the dozens of other
research areas omitted from the syllabus. Below, you will find a list of
ideas to help guide you in your search.
By Friday, October 31 at noon, you must hand in a one page project
proposal detailing your aims. There are specific questions I want you to
address below. You do not need to feel compelled to turn in an essay; rather
you can choose to describe the high level aims/motivation of the project
in an opening paragraph and then answer the detailed questions
in list form. If you are not sure how to address one of the points,
please make a special note of this in your proposal.
In lab in Monday, November 3, I will
talk with each group about their proposals and provide guidance.
Required Elements in Proposal
At a minimum, you should answer the following questions in your 1-page
submission (you can answer in any order):
- Who are the participating members? Groups should be of size 2.
- What is the central hypothesis? Or: what questions are you trying to answer?
- What is the motivation for your project? That is, what is your
goal in pursuing this problem?
- What algorithm(s) will you be implementing/utilizing to
answer the question?
- What related work exists already? Identify which, if any,
you will be comparing against your method.
- What data sets do you plan to use? Be specific - try to do an
initial search of available data sets to make sure you understand what is
available.
- Describe how you will validate your hypothesis: what
experiments will you run? What will these experiments
measure? What statistical tools
will you use to make comparisons?
- Cite at least 2 references relevant to your project
Project Requirements
I am flexible in how you approach this task. In general, you should
consider the following general expectations:
- There must be a relevant algorithmic component in your work. It does not
have to be an algorithm we covered in class (and probably shouldn't be -
do something new)!
- There must be some form of experimental validation of your work.
- It should be an interesting biological/medical application.
Exceptions could be made if the algorithmic component is very strong/demanding. Please see me first.
- Your project should balance the depth of experimental analysis and
depth of algorithmic development. That is, I do not expect a very difficult
implementation or novel algorithm to also have 10 experiments. On the other
hand, a lighter algorithmic approach (e.g., take an off-the-shelf algorithm
and tweak it) should be paired with strong experimental analysis. We can
discuss this balance as the project progresses.
- You may not re-use work from a related class, research experience, or
project. You may, however, use any prior work as a building block. If you
have a concurrent project that you want to add on to, meet with me to
discuss. However, in general, the amount of work you do for this project
cannot double count for some other academic purpose.
- You will submit a paper. The deadline is TBA, but will either be
the last day of classes or at the exam. The paper length is TBA, but
I will require that you do it in LaTex (I will provide a starting file)
and it will probably be about 6-8 conference paper pages (this is not short,
its similar to 20 regular pages in Word).
- I do not think we have time
to do full presentations, but the last lab period will be used to do status
reports. This may just be one-on-one, or to the whole lab section.
- You must work with a partner. I suggest finding a partner with a
similar interest as you, rather than just someone you regularly work with.
The key to these final projects is having the motivation to start early
rather than procrastinating and turning in an incomplete assignment.
Project ideas
Common biological problems include:
- Genome assembly (taking many, small overlapping fragments and reconstructing
the original genome)
- Single nucleotide polymorphisms (SNPs) and their uses (e.g., genome wide association studies (GWAS))
- Systems biology
- Network inference
- genomics and evolution
- Protein structure; protein function; secondary structure; protein disorder
- Mass-spec and proteomics
- RNA structure
- Protein-Protein interactions; protein-dna interactions
- Next generation sequencing data
- Medical informatics
- Translational bioinformatics
- Computational Immunology
- Image Analysis
- Databases and ontologies
- Biomedical text mining
- Disease models/ontologies
- Drug discovery
- Metabolic networks
If you are more interested in exploring general algorithms, you can use the following list to
explore various techniques:
- Probabilistic graphical models (e.g., Bayesian networks, HMM's, conditional random fields, Markov random fields)
- Supervised learning algorithms (e.g., neural networks, support vector machines, deep learning)
- Unsupervised learning algorithms
- Sequential algorithms
- Vision/image analysis
- Memory efficient models
- Tree-learning algorithms
- Semi-supervised learning
- Search
Here are some more concrete ideas:
- Pick a paper you find interesting, implement the algorithm and validate
your implementation. For relevant journals, consider Bioinformatics
magazine. The top conferences in the field are ISMB, RECOMB
and ACM-BCB. If you Google the conference name, you will find
a list of all the papers presented at each year of the conference. If
you are interested in machine learning algorithms, look into AAAI, ICML,
NIPS, JMLR and many more. In some years, there are special tracks
for computational biology (e.g., NIPS often has a workshop on bioinformatics
and machine learning). If you need help finding the relevant articles,
use Google Scholar or look at the researcher's webpage. Most CS journals
are freely available. Others are behind a paywall that the library can
help you with.
- Take a shot at the DREAM Challenge to
predict drug sensitivity
- Implement a solution to genome assembly: given a set of fragments of
DNA, reassemble the entire genome. Next-gen sequencing techniques work
by obtaining many, short reads of DNA. However, the shortness of the reads
makes finding overlaps difficult. There are also errors in the coding
process.
- Classification: develop a supervised machine learning algorithm to
make predictions on some data set. Standard classification is two classes
(true or false). How can we do multiple class? Continuous value? Structured?
- Predict secondary structure from a protein sequence. For example,
implement a conditional random field.
- Look into maximum likelihood methods for phylogenetic trees. Or some
other new direction in the subfield.
- Multiple sequence alignment: implement a profile HMM to develop a
probabilistic MSA. Or implement an iterative algorithm such as PSI-BLAST.
- Contact map prediction - predict whether two amino acids in a protein
are likely to be in contact in the final 3D structure (this subproblem is
deemed to be an essential precursor to solving protein structure prediction).
For example, implement a deep-learning algorithm to predict contact maps. Or use a graphical model such as a markov random field
or conditional random field.
- Find regulatory regions, or genes, or promotor sites, or binding sites
etc., from a DNA sequence. GENSCAN uses HMMs to do a lot of this at
once. ChromHMM finds chromatin sites using HMMs.
- Predict protein hot spots - which amino acids can be changed without
dramatically altering the structure. This can be key in designing
new proteins. For example, use support vector machines to predict hot spots.
- Predict surface accessiblity or other structural features from
a protein sequence. These properties can aid in predicting the 3D structure.
- Develop an algorithm for predicting human disease from SNP data.
Humans only vary about 0.1% of the genome. Instead of trying to sequence
the whole genome, we have many data sets of Single-Nucleotide Polymorphisms
- sites of DNA that show considerable variation. (This is what 23andMe
does). There are also variations such as miRPs for longer regions.
- Look into RNA-seq tasks. For example: Given RNA-Seq reads, reconstruct
the full transcripts. OR Do: cluster expression of genes.
- Some ideas from Sushmita Roy at Wisconsin
- RNA Structure
Prediction overview. Implement a stochastic-context free grammar to solve RNA structure
prediction.
- Learn Module Networks from gene expression data. This is a pretty cool paper.
- Learn cis-regulatory modules using probabilistic models
As a side note, I find all of these problems interesting. However, I will
be approaching the problem with the same level experience you will. That is,
I have not worked in most of these areas. If you want to pick my brain about
probabilistic models, protein structure prediction, or supervised machine
learning I will have more specific knowledge of directions. If you pick
something else, I would be more than happy to help look at the literature with
you.