CS68: Final Project Proposals

Introduction

A major component of the course includes a final project, which will span approximately half of the semester in duration. This project is open-ended in scope, with certain requirements and parameters to guide you. In general, you will conduct scientific research by identifying a computational problem, developing a solution, running experiments to validate your solution, and analyzing results. You have seen a few models of how this works in bioinformatics through reading research papers on algorithmic approaches that built on class concepts.

Your project should generally involve applying algorithmic approaches to biomedical data. You can choose to build off a module we covered in course, that we will cover in the course, or one of the dozens of other research areas omitted from the syllabus. Below, you will find a list of ideas to help guide you in your search.

By Friday, October 31 at noon, you must hand in a one page project proposal detailing your aims. There are specific questions I want you to address below. You do not need to feel compelled to turn in an essay; rather you can choose to describe the high level aims/motivation of the project in an opening paragraph and then answer the detailed questions in list form. If you are not sure how to address one of the points, please make a special note of this in your proposal. In lab in Monday, November 3, I will talk with each group about their proposals and provide guidance.

Required Elements in Proposal

At a minimum, you should answer the following questions in your 1-page submission (you can answer in any order):

Who are the participating members? Groups should be of size 2.
What is the central hypothesis? Or: what questions are you trying to answer?
What is the motivation for your project? That is, what is your goal in pursuing this problem?
What algorithm(s) will you be implementing/utilizing to answer the question?
What related work exists already? Identify which, if any, you will be comparing against your method.
What data sets do you plan to use? Be specific - try to do an initial search of available data sets to make sure you understand what is available.
Describe how you will validate your hypothesis: what experiments will you run? What will these experiments measure? What statistical tools will you use to make comparisons?
Cite at least 2 references relevant to your project

Project Requirements

I am flexible in how you approach this task. In general, you should consider the following general expectations:

There must be a relevant algorithmic component in your work. It does not have to be an algorithm we covered in class (and probably shouldn't be - do something new)!
There must be some form of experimental validation of your work.
It should be an interesting biological/medical application. Exceptions could be made if the algorithmic component is very strong/demanding. Please see me first.
Your project should balance the depth of experimental analysis and depth of algorithmic development. That is, I do not expect a very difficult implementation or novel algorithm to also have 10 experiments. On the other hand, a lighter algorithmic approach (e.g., take an off-the-shelf algorithm and tweak it) should be paired with strong experimental analysis. We can discuss this balance as the project progresses.
You may not re-use work from a related class, research experience, or project. You may, however, use any prior work as a building block. If you have a concurrent project that you want to add on to, meet with me to discuss. However, in general, the amount of work you do for this project cannot double count for some other academic purpose.
You will submit a paper. The deadline is TBA, but will either be the last day of classes or at the exam. The paper length is TBA, but I will require that you do it in LaTex (I will provide a starting file) and it will probably be about 6-8 conference paper pages (this is not short, its similar to 20 regular pages in Word).
I do not think we have time to do full presentations, but the last lab period will be used to do status reports. This may just be one-on-one, or to the whole lab section.
You must work with a partner. I suggest finding a partner with a similar interest as you, rather than just someone you regularly work with. The key to these final projects is having the motivation to start early rather than procrastinating and turning in an incomplete assignment.

Project ideas

Common biological problems include:

Genome assembly (taking many, small overlapping fragments and reconstructing the original genome)
Single nucleotide polymorphisms (SNPs) and their uses (e.g., genome wide association studies (GWAS))
Systems biology
Network inference
genomics and evolution
Protein structure; protein function; secondary structure; protein disorder
Mass-spec and proteomics
RNA structure
Protein-Protein interactions; protein-dna interactions
Next generation sequencing data
Medical informatics
Translational bioinformatics
Computational Immunology
Image Analysis
Databases and ontologies
Biomedical text mining
Disease models/ontologies
Drug discovery
Metabolic networks

If you are more interested in exploring general algorithms, you can use the following list to explore various techniques:

Probabilistic graphical models (e.g., Bayesian networks, HMM's, conditional random fields, Markov random fields)
Supervised learning algorithms (e.g., neural networks, support vector machines, deep learning)
Unsupervised learning algorithms
Sequential algorithms
Vision/image analysis
Memory efficient models
Tree-learning algorithms
Semi-supervised learning
Search

Here are some more concrete ideas:

Pick a paper you find interesting, implement the algorithm and validate your implementation. For relevant journals, consider Bioinformatics magazine. The top conferences in the field are ISMB, RECOMB and ACM-BCB. If you Google the conference name, you will find a list of all the papers presented at each year of the conference. If you are interested in machine learning algorithms, look into AAAI, ICML, NIPS, JMLR and many more. In some years, there are special tracks for computational biology (e.g., NIPS often has a workshop on bioinformatics and machine learning). If you need help finding the relevant articles, use Google Scholar or look at the researcher's webpage. Most CS journals are freely available. Others are behind a paywall that the library can help you with.
Take a shot at the DREAM Challenge to predict drug sensitivity
Implement a solution to genome assembly: given a set of fragments of DNA, reassemble the entire genome. Next-gen sequencing techniques work by obtaining many, short reads of DNA. However, the shortness of the reads makes finding overlaps difficult. There are also errors in the coding process.
Classification: develop a supervised machine learning algorithm to make predictions on some data set. Standard classification is two classes (true or false). How can we do multiple class? Continuous value? Structured?
Predict secondary structure from a protein sequence. For example, implement a conditional random field.
Look into maximum likelihood methods for phylogenetic trees. Or some other new direction in the subfield.
Multiple sequence alignment: implement a profile HMM to develop a probabilistic MSA. Or implement an iterative algorithm such as PSI-BLAST.
Contact map prediction - predict whether two amino acids in a protein are likely to be in contact in the final 3D structure (this subproblem is deemed to be an essential precursor to solving protein structure prediction). For example, implement a deep-learning algorithm to predict contact maps. Or use a graphical model such as a markov random field or conditional random field.
Find regulatory regions, or genes, or promotor sites, or binding sites etc., from a DNA sequence. GENSCAN uses HMMs to do a lot of this at once. ChromHMM finds chromatin sites using HMMs.
Predict protein hot spots - which amino acids can be changed without dramatically altering the structure. This can be key in designing new proteins. For example, use support vector machines to predict hot spots.
Predict surface accessiblity or other structural features from a protein sequence. These properties can aid in predicting the 3D structure.
Develop an algorithm for predicting human disease from SNP data. Humans only vary about 0.1% of the genome. Instead of trying to sequence the whole genome, we have many data sets of Single-Nucleotide Polymorphisms - sites of DNA that show considerable variation. (This is what 23andMe does). There are also variations such as miRPs for longer regions.
Look into RNA-seq tasks. For example: Given RNA-Seq reads, reconstruct the full transcripts. OR Do: cluster expression of genes.
Some ideas from Sushmita Roy at Wisconsin
RNA Structure Prediction overview. Implement a stochastic-context free grammar to solve RNA structure prediction.
Learn Module Networks from gene expression data. This is a pretty cool paper.
Learn cis-regulatory modules using probabilistic models

As a side note, I find all of these problems interesting. However, I will be approaching the problem with the same level experience you will. That is, I have not worked in most of these areas. If you want to pick my brain about probabilistic models, protein structure prediction, or supervised machine learning I will have more specific knowledge of directions. If you pick something else, I would be more than happy to help look at the literature with you.

CS68 Final Project Proposals