Run update21 to get the starting point file for this
week's lab, which will appear in cs21b/labs/03/. The program
handin21 will only submit files from this directory.
Biologists often get a piece of DNA sequence and want to know what's in it. One of the most obvious questions to ask is, does it contain a gene? Because genomes of organisms consist of many non-coding regions, it's not clear that a random piece of DNA will always have a gene. And if there is a gene, where does it begin and end? A simple strategy for finding genes is to look for open reading frames. An open reading frame is the section of a sequence between a start codon and a stop codon.
Gene expression involves the processes of transcription and translation. Last week we focused on implementing translation. This week we will implement transcription, and then search for an open reading frame in the resulting mRNA. To simplify the program we will only search for an open reading frame at offset 0. If an open reading frame is found, we will translate it into the appropriate amino acid sequence.
You will write a single program called transcribeAndTranslate.py to perform all of the necessary steps. You should create your solution incrementally by following the instructions given below. Test each step of your partial solution. Do not go onto the next step until the previous step is working correctly.
Below is a sample run of the program in which an open reading frame was found. The start and stop codons have been highlighted in red for clarity, but your program need not do this.
This program simulates both transcription and translation. Enter length of random DNA string as a multiple of 3: 300 antisense strand of DNA: CGTCCGAGGCTGTGGCCAATTACTGTCAACTGCAGAGTGTACGTGATATGAATGATTTAGATG GGGCCTCCTTGGACGTCGTGCGGTGAGAGAGGCGAGCACAGATAGGTACTGGAAATGTACCGT TCTAGTTCGTGATTTACGCACGTGGAGATTGCCGCGTCGTCCACAATAGTGGCACGAGAATGT TCGGAGTTAAGACTTAATATCAAGAACAAGATGTCGCGGAGGACAGCGGGTCTGAAAGATGCG CTTTATCAGACACCTGCATGACCCTATGATACCTGAAACCTACTGGGA mRNA: GCAGGCUCCGACACCGGUUAAUGACAGUUGACGUCUCACAUGCACUAUACUUACUAAAUCUAC CCCGGAGGAACCUGCAGCACGCCACUCUCUCCGCUCGUGUCUAUCCAUGACCUUUACAUGGCA AGAUCAAGCACUAAAUGCGUGCACCUCUAACGGCGCAGCAGGUGUUAUCACCGUGCUCUUACA AGCCUCAAUUCUGAAUUAUAGUUCUUGUUCUACAGCGCCUCCUGUCGCCCAGACUUUCUACGC GAAAUAGUCUGUGGACGUACUGGGAUACUAUGGACUUUGGAUGACCCU Found an open reading frame starting at 39 and ending at 54 MetHisTyrThrTyrSTOP
Here is a sample run of the program in which an open reading frame was not found:
This program simulates both transcription and translation. Enter length of random DNA string as a multiple of 3: 75 antisense strand of DNA: GACAAGCCTCCGCTTAGTCTTTTTCCGTGTTGCGTGGAGTTACTTGACTATTATAAAAGGCGT TATCCGTTACAG mRNA: CUGUUCGGAGGCGAAUCAGAAAAAGGCACAACGCACCUCAAUGAACUGAUAAUAUUUUCCGCA AUAGGCAAUGUC No open reading frame found
DNA is composed of two strands, termed sense and antisense. The antisense strand is a complement of the sense strand as shown in the small example here:
AGAATGGCCTGGTAAGGC sense strand of DNA TCTTACCGGACCATTCCG antisense strand of DNAGenerate a random antisense strand of DNA represented as a string. Use the choice function from the random library to randomly select from the bases T, A, G, and C. [Please note: The sense strand of DNA is provided above for illustration only. You do not need to generate the sense strand. Your program begins with generating the antisense strand of DNA.]
When transcribing the antisense strand of DNA, T becomes A, A becomes U, G becomes C, and C becomes G. The transcription of the antisense strand is almost identical to the original sense strand except that the T's are replaced with U's. For instance, if we continue with the previous example:
TCTTACCGGACCATTCCG antisense strand of DNA AGAAUGGCCUGGUAAGGC transcription into mRNACreate an mRNA string representing the transcription of the anitsense strand from the previous step.
Only particular sections of the mRNA can code for proteins and these are termed open reading frames. An open reading frame begins with a particular start codon: AUG. The open reading frame can end with several different stop codons: UAA, UGA, or UAG. For example, the following mRNA contains a short open reading frame that begins at position 3 (starting from 0) and ends at position 12 (the first letter of the stop codon).
AGAAUGGCCUGGUAAGGC open reading frame found in mRNA
Write a for loop to find the position of the first start codon in the mRNA string, if one exists. Use a break statement to exit from the loop as soon as a start codon is found.
Write another for loop to look for the first stop codon beginning from the location of a found start codon, if one exists. Use a break statement to exit from the loop as soon as a stop codon is found.
If both a start and stop codon were found, report the locations of
the open reading frame.
If an open reading frame was found, use the translateName function from the genetics library to translate the open reading frame into amino acids. For example the open reading frame found in the previous step translates to:
MetAlaTrpSTOPIf no open reading frame was found, report this.
testDNA = "TCTTACCGGACCATTCCG"and then test our program to see if you get the same output as in the example. Once that is working, you can remove testDNA and use the randomly generated sequence.
These suggestions are not required.
If you're interested in making your program more complete, you can enhance it so that it searches for open reading frames at offset 0, 1, and 2 and reports the first one found.
A further enhancement is to report the longest open reading frame
found at any of the possible offsets.
Once you are satisfied with your programs, hand them in by typing handin21 at the unix prompt. Recall that you may run handin21 as many times as you like, and only the most recent submission will be recorded.