Overview
The goal of this week's lab is to reinforce basic concepts of biology and bioinformatics.
Specifically, Part 1 will have you explore existing bioinformatic databases, discovering
what information is available for a particular gene of interest and answering
questions along the way. Part 2 will have you
implement a short program that will simulate many aspects of the central dogma and
explore the mechanisms more in detail.
This lab is to done individually. You can obtain your starting directory
structure and any starting files by running handin68. You may discuss concepts
with a fellow classmate, especially if you are having difficulty with the details of
transcription or translation. You may not share code, however.
You will be responsible for handing in your solutions to both parts, due
Thursday, January 31 before midnight.
- Part 1: Working with Databases should be handed in as a textfile
questions.txt
- Part 2: Central Dogma you will hand in two files. The first,
sequences.py will contain class definitions for DNA, RNA, and
Protein objects. The second, dogma.py will have you implement
a main program that allows a user to read in a DNA sequence and convert it
to a protein using transcription and translation functionality. The user
will be able to search a large genome for potential proteins of interest.
The programming for this week's lab may seem basic upon first read, but that is
partially because we haven't covered any algorithms in class yet!
It is designed to you get you back in the practice of using Python and to get you to
see how transcription and translation work.
Part 1: Working with Databases
One of the most well studied proteins in molecular biology is the green fluorescent
protein (GFP).
It's discovery was recently awarded a Nobel Prize in Chemistry in 2008
for redefining how fluorescent microscopy is utilized in biology. It's also being
used to create a breed of glow in the dark pets that is guaranteed to give you nightmares.
In this portion of the lab, you will learn about GFP using three well known databases
for genomics: GenBank (for nucleotide sequences), UniProt (for protein sequences), and
the Protein Data Bank (PDB; for protein structures). Along the way, you will
answer questions that you will submit in your questions.txt file.
Genbank:
- Genbank is a database of nucleotide sequences. It can be accessed at the
NCBI website (National Center for Biotechnology Information) at
http://www.ncbi.nlm.nih.gov/.
In the search pull down menu at
the top, make sure "nucleotide" is selected. In the text
box at the top of the screen where it
solicits input for searching, type "GFP" and hit the Go
button.
- This search will bring up over 1000 results. To narrow the search,
click on "Limits" just
below the box where you typed "GFP".
Limit the search to "gene name" (in the dropdown box) and
click the "Search" button again. Go to results 58 and 59 on the
third page.
- These two entries, M62653 and M62654, are from a seminal 1992 paper. Click on M62653 (Item 59),
look over the Genbank record, and answer the questions in questions.txt
UPDATE: For Question 1, the field CDS details the encoding, or protein sequence. Either
do the math or click on the protein ID for the number of amino acids.
For question 2, look about 12 lines down for an entry labeled PUBMED. There
is a link next to it that points to the original paper describing the gene. Read the
abstract for the answer.
UniProt:
- UniProt is a database of amino acid sequences that can
be accessed at UniProt. At the UniProt
homepage, type
GFP and click the Search button. The first link
should be GFP_AEQVI P42212. Click on the link.
- Find where the sequence data (click the "Sequences" link on the menu bar)
is available. If you click on the check box
next to the access ID, you will see a green bar pop up at the bottom of the screen.
Click "Retrieve" and note the wide variety of standardized formats for sequences.
FASTA is the most common, simple format. Click Open on all of the formats and
see what information is found.
- Examine the web page for this protein, and answer the questions.
UPDATES: For Question 1: the Pfam information is near the bottom of the
"Cross-References" section. For Question 2: Look for the "Clan" link on the left side.
To describe the function, just mention where the protein is generally found.
For Question 3: go back to the UniProt page to find this information. You can click
"Ontologies" at the top.
Protein Data Bank:
- The PDB (Protein Data Bank) is a database of protein
structures at http://www.rcsb.org/pdb.
Type GFP into the search text box and
click the Search button.
- Note that the GFP was once the molecule of the day! Click on the story and read
it, it contains a nice history of the protein.
- Back to the search results. Sort by release data (increasing) and click
at the result 1EMA (should be first or second).
- Notice that a lot of the sequence and annotations in the other databases are also
accessible here. This is a recent modification to the PDB, making it a great
resource for known structures (genes with no known structures will not be here).
- If you have Java applets enabled, you can view the molecule. For now, you can
at least see a static image of the molecule. This is known as a ribbon
representation where instead of atoms, the shape of the protein indicates the
type of secondary structure.
- On the far right at the top, click on Display File,
then click on the link to display the structure file
in PDB file format.
- In this file the majority of lines are ATOM lines.
Scroll down until you see those
lines and note how the atoms are numbered (in this case, 1 to 1771). Answer
the questions for this section
UPDATES: I updated the instructions to reflect the 1EMA is actually the second link
on the results. For Question 1: the letter is the abbreviation for the atom. C, CA,
CB or all carbon (A is alpha and B beta to note a specific location on the structure); N
is nitrogen, O oxygen. For Question 2: you have the three letter abbreviation for the
amino acid. Look up the full name on the wiki page for amino acids. For Question 3: the
x,y,z coordinates are columns 7, 8, and 9 respectively.
Dino Hunting
- Go to
the following web page: http://nh-brin.unh.edu/Bioinformatics/Tutorials/DinoDNA/
UPDATE: This webpage seems to be down. The same info is available here. Ignore
the exercises.
- Copy
the DNA sequence marked JurassicPark DinoDNA from the book Jurassic
Park. (Read the text to
learn the story behind this particular DNA).
- Go the
NCBI Blast home page at http://www.ncbi.nlm.nih.gov/BLAST/. Go to the link that says
Nucleotide-nucleotide BLAST [blastn]
- Paste
the DinoDNA DNA sequence into the text box and hit the Blast! button.
UPDATES: For Question 1, give the Sequence ID number instead. For the length
of the match, either just Range 1 or the sum of ranges is acceptable. For Question 2:
the e-value is also the Expected value and describes the probability one would
get this match by random chance in the database.
Central Dogma
In this portion of the lab, you will create a Python library and main program
to simulate operations described in the central dogma in order to better understand
the link between a DNA sequence and resulting protein sequence(s).
First, you will construct 3 class definitions, one each for DNA, RNA, and Protein. I will describe the main functionalities that
are expected, you can feel free to add additional information/methods. All
three should be defined in a file sequences.py.
Sequence classes
First, define a DNA class. Your class should have, at a minimum,
the following functionality:
- A constructor that takes in a string, strand. It should create
class variable to store the strand and initialize any other data members
you want to maintain.
- An __str__ method for converting the object to a string. It
should return a strand summary, including directionality. That is, the
start of the string should be "5' " and the end should be "3' ".
If the strand is longer than 30 bases, print the first 15 bases, a series of dots,
and then the last 15 bases. E.g., "5' TTTGAGCAAGTCAAA...TTTTATTCGTGTGTA 3'
- An __len__ method to get the length of the sequence
- An invert() method to replace the current strand with
its reverse complement. That is, the other half of the DNA double strand.
You should always think of sequences as 5' to 3', so you will not only need
to find the complement of each base, but also reverse the sequence.
For example, AAGG should become CCTT.
- A getStrand() method to return the raw sequence
- A getSubStrand() method to retrieve a portion of the sequence. This should
take in a start and stop index and return a string containing all bases from start up
to the stop index.
- A transcription() method that returns a list of RNA
objects. Each RNA object will represent the sequence between one pair of
start/stop codons in the same
reading frame. That is, the distance between them is evenly divisible by three.
A naive way to implement this method is to search for all possible start codons (ATG). For each start codon, search the rest of the strand incrementing by three
for a stop codon (TAG, TGA, or TAA). If there is no stop codon,
do not add the encoding to the list. There may be overlaps in encodings (ATG can code
for a regular Methionine or a start one). Be sure to substitute for U's for T's when
constructor the RNA object. You should pass the index of the first nucleotide after
the start codon and the last index before the stop codon to each RNA objects constructor.
Next, define an RNA class. Your class should have the following methods:
- A constructor that takes in a strand, as well as start and stop indices for
where the encoding can be found in the original DNA sequence. You should store
these three items as well as any other data members you see fit.
- An __str__ method similar to above, but it should also print out the
indices e.g., "16-21: 5' AUGCCA 3'"
- An __len__ method to return the length of the sequence
- A getStrand() as above
- A translate() method that returns a Protein object containing
the translation of the mRNA sequence. This method should take in a codon table as
input and use this to produce the translation. You should pass the start/stop index
in to the Protein constructor.
Lastly, you should create a
Protein class. This class will look exactly
the same as the
RNA class minus the
translate method. The constructor
will take in an amino acid sequence and a start and stop index for finding the original
encoding region in the DNA sequence.
Main program
You will define your main program in
dogma.py. At a high level, your program
should:
- Greet the user
- Prompt the user for a sequence file; load the sequence as a DNA object and
print the sequence's summary
- Prompt the user for the codon table file; load the table into a dictionary
- Go into the program's main loop for allowing a user to interact with the
sequence. The loop should exit when the user selections option "0"
The main loop can be as creative as you like. At a
minimum, you should define
behavior for the following options:
- Print the raw DNA sequence (entire sequence, no 5' or 3' labels using
getStrand())
- Display a subsequence of the DNA strand. This should print the user for a
start and stop location and display just the nucleotides between these two indices.
- Allow the user to invert the DNA sequence, and then print the
sequence summary (i.e., use its str() method). Any RNA or Protein sequences
that have been stored should be cleared as they no longer apply.
- Transcribe the DNA sequence. As described above, this should produce a list
of all possible proteins that could be produced from the sequence. You should
print the number of mRNA molecules produced and their summary.
- Print all of the raw mRNA sequences (entire sequence, no directionality)
- Translate each mRNA sequence to a protein (make sure you clear out any
previous proteins from your list); print a summary of each protein
- Print raw protein sequences to an output file, one protein per line
Hints and Tips
Reading FASTA file
FASTA is a standardized format used across the field to represent DNA and/or
protein sequences. You can read in detail about the format at the
NCBI manual page.
For this lab, you only need to know that there are two types of lines in the file:
description lines and sequence lines. For example:
>gi|129295|sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED)
QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE
KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS
VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP
FLFLIKHNPTNTIVYFGRYWSP
The first line describes the gene and can be ignored for this lab. The next
four lines are the gene's protein sequence.
When loading your file, you can ignore
description lines. The first character on a description line will be the greater
than symbol
">". Each line below the description line is part of the
sequence, with 80 characters per line. Simply finish reading the file line-by-line
concatenating the lines together to create one large string for the sequence.
Reading Codon Table
A codon table maps three-letter RNA codons to a single-letter amino acid that
it produces. Look at the
codon.txt file and note that each line
contains the amino-acid abbreviation first, and then a list of all codons
that map to that amino acid. You should load this file into a dictionary data
structure (go
here to read up on using the built-in dictionary class in Python).
You should map codons to their amino acid equivalent.
E.g.,
codonTable["AUG"] = 'M'
Program Requirements
In addition to the requirements listed above, you should ensure your code satisfies these
general guidelines
- Use good design principles. In fact, your solution should be very short (about 125-150
lines in dogma.py) if you design your solution well.
- Make sure to practice defensive programming. Make sure the user enters in valid
file names and numeric choices for the menu.
- Be sure to comment non-trivial sections of your code
Sample Runs
In your labs directory, I have placed two sample sequence files,
test.fasta
and
gfp.fasta. The latter is the sequence for the green flourescent protein,
while the former is a toy example for which I have results below. Try your code
on the test file first, and then see what happens with your GFP gene (can you recover
the protein sequence you find in Part 1?). If you want to try a
large example, try running your code on the E. coli UTI89 genome in
ecoli_uti89.fasta.
It is located at
/home/soni/public/cs68/ecoli_uti89.fasta. DO NOT COPY this file,
it is quite large. Note that your program will take awhile to run for certain
operations since it is a large sequence.
Welcome to the gene translator
Enter FASTA file name: test.fasta
Enter Codon Table file name: codon.txt
DNA sequence of length 126 successfully loaded:
5' TTAATAGCGTGGAAT...CATTTTATTTTAAAA 3'
Options:
0) Exit
1) Print raw DNA sequence
2) Invert DNA sequence
3) Transcribe DNA sequence and print summary
4) Print raw RNA sequences
5) Translate RNA sequences and print summary
6) Print raw sequences to file
Enter choice: 1
Entire DNA sequence:
TTAATAGCGTGGAATGATCCTTATTAAAGAGTGTCACGAAGAGTCGGAATAGAATATGGAGGCGACAGTCGAGGGTGGGATAGAGTCCTAAAGATAACATTAAGTGTTAATCATTTTATTTTAAAA
Options:
0) Exit
1) Print raw DNA sequence
2) Invert DNA sequence
3) Transcribe DNA sequence and print summary
4) Print raw RNA sequences
5) Translate RNA sequences and print summary
6) Print raw sequences to file
Enter choice: 3
2 Resulting mRNA sequences:
13-48: 5' AUGAUCCUUAUUAAA...CACGAAGAGUCGGAA 3'
55-87: 5' AUGGAGGCGACAGUC...GGUGGGAUAGAGUCC 3'
Options:
0) Exit
1) Print raw DNA sequence
2) Invert DNA sequence
3) Transcribe DNA sequence and print summary
4) Print raw RNA sequences
5) Translate RNA sequences and print summary
6) Print raw sequences to file
Enter choice: 4
mRNA Sequence 0
AUGAUCCUUAUUAAAGAGUGUCACGAAGAGUCGGAA
mRNA Sequence 1
AUGGAGGCGACAGUCGAGGGUGGGAUAGAGUCC
Options:
0) Exit
1) Print raw DNA sequence
2) Invert DNA sequence
3) Transcribe DNA sequence and print summary
4) Print raw RNA sequences
5) Translate RNA sequences and print summary
6) Print raw sequences to file
Enter choice: 5
2 Resulting protein sequences:
13-48: MILIKECHEESE
55-87: MEATVEGGIES
Options:
0) Exit
1) Print raw DNA sequence
2) Invert DNA sequence
3) Transcribe DNA sequence and print summary
4) Print raw RNA sequences
5) Translate RNA sequences and print summary
6) Print raw sequences to file
Enter choice: 6
Enter output filename: test.pro
File output complete
Options:
0) Exit
1) Print raw DNA sequence
2) Invert DNA sequence
3) Transcribe DNA sequence and print summary
4) Print raw RNA sequences
5) Translate RNA sequences and print summary
6) Print raw sequences to file
Enter choice: 2
DNA sequence successfully inverted:
5' TTTTAAAATAAAATG...ATTCCACGCTATTAA 3'
Options:
0) Exit
1) Print raw DNA sequence
2) Invert DNA sequence
3) Transcribe DNA sequence and print summary
4) Print raw RNA sequences
5) Translate RNA sequences and print summary
6) Print raw sequences to file
Enter choice: 3
1 Resulting mRNA sequences:
12-23: 5' AUGAUUAACACU 3'
Options:
0) Exit
1) Print raw DNA sequence
2) Invert DNA sequence
3) Transcribe DNA sequence and print summary
4) Print raw RNA sequences
5) Translate RNA sequences and print summary
6) Print raw sequences to file
Enter choice: 5
1 Resulting protein sequences:
12-23: MINT
Options:
0) Exit
1) Print raw DNA sequence
2) Invert DNA sequence
3) Transcribe DNA sequence and print summary
4) Print raw RNA sequences
5) Translate RNA sequences and print summary
6) Print raw sequences to file
Enter choice: 0
Submitting your work
Once you are satisfied with your program, hand it in by typing
handin68 at the unix prompt.
You may run handin68 as many times as you like, and only the
most recent submission will be recorded. This is useful if you realize
after handing in some programs that you'd like to make a few more
changes to them.
About the Data
The guide in Part 1 is based off of a lab developed by Neil C. Jones and Ravel A. Pevzner available at
here.
Thanks to Mark Goadrich for sharing his test example sequence for part 2.