In this lab you will implement a C++ linked list and then use
that linked list to simulate the behavior of DNA molecules. Your program
will represent DNA as a linked list and perform common molecular operations
using the linked list interface. At the end, you will discuss the performance
of your implementation and suggest improvements.
To get started, run
update35 from a computer science lab
computer. The program
update35 will create a
cs35/labs/04
directory in your CS home directory, and the program
handin35
will subsequently submit files from that directory.
This lab will require much more time than the earlier
labs in the course. Because you will use many pointers in this lab, you
will probably have more trouble debugging your program than in labs
01 or 02.
You may work with one partner on this lab. Only one of you should submit
your solution,
but it is important that both of your names appear at the top of your program.
As always, you must follow the academic integrity rules for the course. In
particular, sharing/dicussing code with other groups is strictly prohibited.
In addition, it is also a violation of academic integrity if both students are
not equal contributors in this lab. You should be working together and both
partners are responsible for understanding/verifying all code.
You will earn reasonable partial credit for this lab if you
complete and test your linked list implementation. So be sure to work in
a modular fashion, testing and completing your linked list first before
implementing the main program
Introduction
DNA (Deoxyribonucleic acid) contains the blueprint, or genetic instructions, for
the functioning of living organisms. DNA is a polymer chain consisting of
four building blocks called nucleobases (or bases): adenine, cytosine,
guanine, thymine. Typically, they are abbreviated A, C,
G, and T respectively. Thus, we can represent the chain
of bases as a sequence of characters:
ATTGTCATCCGGGAATC...
There are many behaviors/reactions that DNA undergo that we may want to
simulate using a computer to study their behavior. While this lab won't win
you a nobel prize, it will provide a simple computer simulation of a few
interesting properties of DNA and see the linked-chain of bases matches fairly
well with the linked nature of the linked list implementation of lists. (Say
that 5 times fast).
Your program will simulate the following molecular reactions. Feel free to add
more!
- Mutations - DNA, overtime, changes the value of bases in
its sequence with some randomness. For example, a base may randomly switch
from the value A to the value G. This mechanism can be
devasting to the organism - shutting down machinery or altering the normal
behavior of a gene to a diseased state. It also has the benefit of creating
variety in a population and generating beneficial alterations, such as
hulk-like strength or the ability to climb walls. Your program should
be able to mutate a random location in the sequence to some new value.
- Telomere Degradation - Chromosomes (the packaging of a large DNA
sequence in eukaryotes) contain a built-in buffer called telomores at the
end of their sequence that
does not code for genes, but prevents important information from being
degraded. During replication (or the copying of DNA), the telomeres get
shortened, resulting in the fact that cells can only duplicate their DNA
so many times before the buffer has disappeared. This mechanism is part
of organisms mortality, but also helps slow down cancer cells which
proliferate too quickly. You will need to simulate the removal of bases
from either end of the DNA sequence.
- Cell Death - your program should also monitor the ends of the
sequence to detect if a cell has eaten away the telomere buffer. For this
lab, you can assume that the telomere sequence is a sequence of repeating
T's. Therefore, the sequence can no longer replicate if the leading
or trailing thymine's have disappeared.
- Reverse - A DNA sequence is normally read in a 5' to 3'
direction. Replication, to be efficient, can simultaneously read a sequence
3' to 5' (right to left). To help scientists understand what this reverse
sequence looks like, your program should be able to reverse the direction of
the sequence. For example, reversing ATTAGC results in a
sequence that now reads CGATTA.
- Print Sequence - Your program should be able to spit out the
sequence in its current form if a scientist wants to see the result of a
few rounds of operations (such as mutations or telemore degradation)
Implementation details
I have provided a DNAList interface, very similar to the StringList
interface given in lecture. The difference is that you will represent
a DNA base as a char instead of a string.
You must define and implement a DNALinkedList class that implements
the DNAList interface, without altering the DNAList interface.
I've also given the declaration and implementation of the BaseNode
class to help you get started. The BaseNode represents one node
in your linked list and stores the value of a base. You should not alter this
definition, but feel free to analyze how pointers are handled in this class.
You should first begin your lab by testing and working with the
BaseNode class. Can you create a few nodes? Link them together?
Print all the items? I have provided the beginning of a program to do this,
testNode.cpp. Run make testNode to get started.
This file will not be graded, but should be your starting point to understand
how pointers and nodes work.
After testing BaseNode, you should complete and test your
DNALinkedList class. HINT: you may want to start by copying and
pasting the DNAList interface declaration into the
DNALinkedList header file. The interface is
very similar to the one in class with the addtion of a replace()
method. Some requirements to note:
- You should not add any extra public methods to the DNALinkedList
interface than what is already specified in the DNAList interface.
- You may add as many private methods as you desire.
- The class should only contain variables necessary for maintaining the
linked list. Do not declare local variables in the class declaration.
- Your methods should be safe. That means that it should not rely on
a user of the class to avoid making mistakes, such as removing a non-existent
item. In the case that the user attempts something illegal, you should
throw a runtime_error with a descriptive message.
- Your implementation must manage dynamic memory allocation properly. That
is, your linked list should only allocate memory essential to the linked
list implementation and must de-allocate memory when it is no longer being
used. Be sure to refer to your class notes on linked lists.
- You should modify your testNode.cpp file to make sure your
DNALinkedList implementation is working. Again, this file will
not be graded but the ninjas will ask you to show your test file to help
you with errors.
- A few of your methods will require traversing the linked list.
Think about using a loop to follow the linkes. When do you know you've
reached your destination. Also, how do you keep track of the previous
node in the linked list?
Once you have thoroughly tested your DNALinkedList implementation,
you should implement the dnaSimulate main program. The program
has been started for you. Use make to compile the program. When you
run it, you will see the basic flow of the program has been implemented, it is
your job to add functionality. Specifically, you should implement functions
to simulate the features described in the introduction. Some tips and
requirements
- Begin by finishing getStrandInput, which reads the sequence
from a file. You do not need to add code for the file input portion, but you
should add a few lines of code to create your list. While not required,
you should attempt to handle exception errors here in case the source
file contains an invalid character (such as 'D'). To
catch an exception, see the example in "Tips and Hints" below.
- A small sample sequence
has been placed in sequence_small.txt. You should create more
significant examples to throughly test your program and modify the file name
used to read in the sequence to test out your other sequences
- Work on printSequence next to print out all the characters in the
sequence as one large string
- One-by-one implement mutate, telomereDegrade,
cellDeath, and reverseSequence whose declarations have
been commented out at the top of the file.
These functions should simulate the behavior described in the Introduction above
- mutate should randomly pick one base in the sequence and change the
value to one of the four DNA bases.
- telomereDegrade should randomly pick one end of the sequence and
remove one telomere's worth of data (3 bases). It should remove the values
even if the telomere has run out of 'T's.
- reverseSequence should generate and return a new a list
with the sequence in reverse order. Note that your main program only needs
to keep the new version of the sequence and should properly clean up the old
list.
- cellDeath should be run after each round of asking the user
to perform some operation. If the telomere has been degraded (how do we know?),
the program should stop running and report the cell can no longer exist.
- Your program should only exit if the user chooses to quit or the DNA
sequence is no longer viable. In both cases, you should output the final
sequence.
Here are a few example executions (this will be up soon):
- Run 1 sequence_small.txt with choices of
print, reverse, print, degrade telomere
- Run 2 sequence_small.txt with a series of
mutations (none fatal) with prints and finally a quit. Notice that some
mutations lead to no change.
An implementation strategy
You should always program incrementally such that you can test your
work as you go. Here is one suggested strategy:
- Start by testing the DNABaseNode class -- which
consists of a constructor, destructor and simple access and update
functions
- Implement the DNALinkedList constructor, then either
implement a print() method in your test file
or use gdb to verify that
your list is correctly constructed.
- Implement one of insertAtHead or insertAtTail.
Then implement getSize, isEmpty,
peekHead and peekTail.
Use print or gdb to verify correctness.
- Implement getItem and verify.
- Implement removeHead. Then implement removeAtTail
and replace and verify.
- Implement the destructor. Carefully consider every possible way that
your class might allocate memory on the heap, and be sure that there
is a corresponding call to delete for each call to new.
At this point you should have a complete working
DNALinkedList class
whose behavior matches the
ArrayList from lecture (except
that the example
ArrayList has a capacity of just 10 items, whereas
your
LinkedList should have no artificial capacity limits).
Write a small test program to verify that your
DNALinkedList implementation.
You should not under any circumstances begin the DNA
simulation until you have thoroughly verified the correctness of your
DNALinkedList class. Any bugs in your DNALinkedList class will
hinder progress on the DNA simulation.
Written Portion
Individually, each of you are responsible for also handing in the answer to
the following questions. Your submissions can be hand or type-written;
please bring a hard copy to class on Thursday, February 16. Your answers
should be brief, explanations are not necessary except for #4.
- In class, we discussed the advantages and disadvantages of array lists.
After having implemented linked lists, please list which methods of the
DNALinkedList class are "fast" (i.e., O(1)) and which are not
efficient (i.e., O(N) for a linked list of size N).
- Draw a memory diagram for the dnaLL variable immediately after
returning from getStrandInput in your program. You can ignore
all other variables in your program, but your diagram should clearly show
what parts of dnaLL are in the heap and which are in the i
main() stack frame. Assume you've read in the sequence TGC
- For the DNA simulation methods mutate, telomereDegrade,
reverseSequence, cellDeath, and printSequence,
identify which are more efficiently implemented using ArrayLists, which are
more efficient with Linked Lists, and which would be the same.
- You have implemented a specific linked list called a singly-linked list.
An alternative implementation, called a doubly-linked list, uses a node
implementation that has a pointer to the previous node in addition to a
pointer to the next node. Hypothesize which, if any, methods in the dnaList.h
interface would be more efficient with this alternate representation.
Tips and Hints
Throwing and Handling Exceptions
Your linked list should handle invalid usage by throwing a
runtime_error. There is an example in the BaseNode
constructor. For example, if you may want to handle at attempt to get an
invalid item:
throw std::runtime_error("Attempt to getItem out of bounds");
Here is an example of handling a runtime_error (not required
in this lab, but suggested):
try{
//do something dangerous
} catch (runtime_error& exc){
cerr << "Warning: " << exc.what() << endl;
cerr << "Ignoring illegal behavior." << endl;
}
This is similar to the example in class, but this catches a
runtime_error
pointer and stores the exception as a variable,
exc. The
what() method returns the actual string message that was thrown.
Lastly,
cerr is standard error output and works very similarly to
cout except that a programmer usually uses it specifically to
print warnings and errors. Don't forget to include the
stdexcept
library when handling or throwing exceptions.
Obtaining random values in C++
Technically, there is no such thing as a random value in programming. You can,
at best, obtain pseudo-random numbers that seem random over a long period. C++
has a few libraries that try to be as random as possible, and we saw one of them
in Lab 2 with the default Player. To use random, I suggest using the
drand48() method. First, add the following libraries to your main
source file:
#include <stdlib.h>
#include <time.h>
Second, seed the random method at the top of your
main() function.
If you use the same seed (e.g.,
srand48(0)), your random number
generator will actually give you the same sequence of values every time you
run your program. This can be helpful if you want to get predictable behavior,
but if you want something closer to "random", put the following line at the
top of
main() instead:
srand48(time(0)); //bases the seed value off of the current time
To generate a random number in the range [0.0, 1.0) (i.e., upto but not
including one), call
drand48(). Let's say you want to get an
integer value between 0 upto but not including 5, you can try:
int index = (int)(drand48()*5);
This multiplies the random value, so the range is [0.0, 5.0) and then truncates
to an int, giving either 0, 1, 2, 3, or 4. You will need to use this for
a few of your methods, including mutations (pick a base to mutate, and also
pick a new random value for the base) as well as selecting which telomere end
to degrade.
Extensions
NOTE: you are not required to implement any extensions, and no extra credit
will be awarded if you do. You should only attempt this if you want to get
extra experience with templates.
Implementing a specific list interface just to handle DNA molecules is not
efficient. We could easily want to represent integers, strings, or more
complex objects such as an MP3 library. The underlying data structues and
algorithms are the same across each of these problems, only the type of the
data changes. In the labs/04/template/ subdirectory, I have placed
the equivalent starting point files for this lab but using templates
to create a generic interface. Instead of a BaseNode class, there is
a Node close that can take any type for a value.
Submit
Once you are satisfied with your program, hand it in by typing
handin35 at the unix prompt.
You may run handin35 as many times as you like, and only the
most recent submission will be recorded. This is useful if you realize
after handing in some programs that you'd like to make a few more
changes to them. Only one partner should hand in the code for the lab.
The written portion is due in class on Thursday and should be done individually