CS35: Lab 4

In this lab you will implement a C++ linked list and then use that linked list to simulate the behavior of DNA molecules. Your program will represent DNA as a linked list and perform common molecular operations using the linked list interface. At the end, you will discuss the performance of your implementation and suggest improvements. To get started, run update35 from a computer science lab computer. The program update35 will create a cs35/labs/04 directory in your CS home directory, and the program handin35 will subsequently submit files from that directory.

This lab will require much more time than the earlier labs in the course. Because you will use many pointers in this lab, you will probably have more trouble debugging your program than in labs 01 or 02.

You may work with one partner on this lab. Only one of you should submit your solution, but it is important that both of your names appear at the top of your program. As always, you must follow the academic integrity rules for the course. In particular, sharing/dicussing code with other groups is strictly prohibited. In addition, it is also a violation of academic integrity if both students are not equal contributors in this lab. You should be working together and both partners are responsible for understanding/verifying all code.

You will earn reasonable partial credit for this lab if you complete and test your linked list implementation. So be sure to work in a modular fashion, testing and completing your linked list first before implementing the main program

Introduction

DNA (Deoxyribonucleic acid) contains the blueprint, or genetic instructions, for the functioning of living organisms. DNA is a polymer chain consisting of four building blocks called nucleobases (or bases): adenine, cytosine, guanine, thymine. Typically, they are abbreviated A, C, G, and T respectively. Thus, we can represent the chain of bases as a sequence of characters:

ATTGTCATCCGGGAATC...

There are many behaviors/reactions that DNA undergo that we may want to simulate using a computer to study their behavior. While this lab won't win you a nobel prize, it will provide a simple computer simulation of a few interesting properties of DNA and see the linked-chain of bases matches fairly well with the linked nature of the linked list implementation of lists. (Say that 5 times fast).

Your program will simulate the following molecular reactions. Feel free to add more!

Mutations - DNA, overtime, changes the value of bases in its sequence with some randomness. For example, a base may randomly switch from the value A to the value G. This mechanism can be devasting to the organism - shutting down machinery or altering the normal behavior of a gene to a diseased state. It also has the benefit of creating variety in a population and generating beneficial alterations, such as hulk-like strength or the ability to climb walls. Your program should be able to mutate a random location in the sequence to some new value.
Telomere Degradation - Chromosomes (the packaging of a large DNA sequence in eukaryotes) contain a built-in buffer called telomores at the end of their sequence that does not code for genes, but prevents important information from being degraded. During replication (or the copying of DNA), the telomeres get shortened, resulting in the fact that cells can only duplicate their DNA so many times before the buffer has disappeared. This mechanism is part of organisms mortality, but also helps slow down cancer cells which proliferate too quickly. You will need to simulate the removal of bases from either end of the DNA sequence.
Cell Death - your program should also monitor the ends of the sequence to detect if a cell has eaten away the telomere buffer. For this lab, you can assume that the telomere sequence is a sequence of repeating T's. Therefore, the sequence can no longer replicate if the leading or trailing thymine's have disappeared.
Reverse - A DNA sequence is normally read in a 5' to 3' direction. Replication, to be efficient, can simultaneously read a sequence 3' to 5' (right to left). To help scientists understand what this reverse sequence looks like, your program should be able to reverse the direction of the sequence. For example, reversing ATTAGC results in a sequence that now reads CGATTA.
Print Sequence - Your program should be able to spit out the sequence in its current form if a scientist wants to see the result of a few rounds of operations (such as mutations or telemore degradation)

Implementation details

I have provided a DNAList interface, very similar to the StringList interface given in lecture. The difference is that you will represent a DNA base as a char instead of a string. You must define and implement a DNALinkedList class that implements the DNAList interface, without altering the DNAList interface.

I've also given the declaration and implementation of the BaseNode class to help you get started. The BaseNode represents one node in your linked list and stores the value of a base. You should not alter this definition, but feel free to analyze how pointers are handled in this class. You should first begin your lab by testing and working with the BaseNode class. Can you create a few nodes? Link them together? Print all the items? I have provided the beginning of a program to do this, testNode.cpp. Run make testNode to get started. This file will not be graded, but should be your starting point to understand how pointers and nodes work.

After testing BaseNode, you should complete and test your DNALinkedList class. HINT: you may want to start by copying and pasting the DNAList interface declaration into the DNALinkedList header file. The interface is very similar to the one in class with the addtion of a replace() method. Some requirements to note:

You should not add any extra public methods to the DNALinkedList interface than what is already specified in the DNAList interface.
You may add as many private methods as you desire.
The class should only contain variables necessary for maintaining the linked list. Do not declare local variables in the class declaration.
Your methods should be safe. That means that it should not rely on a user of the class to avoid making mistakes, such as removing a non-existent item. In the case that the user attempts something illegal, you should throw a runtime_error with a descriptive message.
Your implementation must manage dynamic memory allocation properly. That is, your linked list should only allocate memory essential to the linked list implementation and must de-allocate memory when it is no longer being used. Be sure to refer to your class notes on linked lists.
You should modify your testNode.cpp file to make sure your DNALinkedList implementation is working. Again, this file will not be graded but the ninjas will ask you to show your test file to help you with errors.
A few of your methods will require traversing the linked list. Think about using a loop to follow the linkes. When do you know you've reached your destination. Also, how do you keep track of the previous node in the linked list?

Once you have thoroughly tested your DNALinkedList implementation, you should implement the dnaSimulate main program. The program has been started for you. Use make to compile the program. When you run it, you will see the basic flow of the program has been implemented, it is your job to add functionality. Specifically, you should implement functions to simulate the features described in the introduction. Some tips and requirements

Begin by finishing getStrandInput, which reads the sequence from a file. You do not need to add code for the file input portion, but you should add a few lines of code to create your list. While not required, you should attempt to handle exception errors here in case the source file contains an invalid character (such as 'D'). To catch an exception, see the example in "Tips and Hints" below.
A small sample sequence has been placed in sequence_small.txt. You should create more significant examples to throughly test your program and modify the file name used to read in the sequence to test out your other sequences
Work on printSequence next to print out all the characters in the sequence as one large string
One-by-one implement mutate, telomereDegrade, cellDeath, and reverseSequence whose declarations have been commented out at the top of the file. These functions should simulate the behavior described in the Introduction above
mutate should randomly pick one base in the sequence and change the value to one of the four DNA bases.
telomereDegrade should randomly pick one end of the sequence and remove one telomere's worth of data (3 bases). It should remove the values even if the telomere has run out of 'T's.
reverseSequence should generate and return a new a list with the sequence in reverse order. Note that your main program only needs to keep the new version of the sequence and should properly clean up the old list.
cellDeath should be run after each round of asking the user to perform some operation. If the telomere has been degraded (how do we know?), the program should stop running and report the cell can no longer exist.
Your program should only exit if the user chooses to quit or the DNA sequence is no longer viable. In both cases, you should output the final sequence.

Here are a few example executions (this will be up soon):

Run 1 sequence_small.txt with choices of print, reverse, print, degrade telomere
Run 2 sequence_small.txt with a series of mutations (none fatal) with prints and finally a quit. Notice that some mutations lead to no change.

An implementation strategy

You should always program incrementally such that you can test your work as you go. Here is one suggested strategy:

Start by testing the DNABaseNode class -- which consists of a constructor, destructor and simple access and update functions
Implement the DNALinkedList constructor, then either implement a print() method in your test file or use gdb to verify that your list is correctly constructed.
Implement one of insertAtHead or insertAtTail. Then implement getSize, isEmpty, peekHead and peekTail. Use print or gdb to verify correctness.
Implement getItem and verify.
Implement removeHead. Then implement removeAtTail and replace and verify.
Implement the destructor. Carefully consider every possible way that your class might allocate memory on the heap, and be sure that there is a corresponding call to delete for each call to new.

At this point you should have a complete working DNALinkedList class whose behavior matches the ArrayList from lecture (except that the example ArrayList has a capacity of just 10 items, whereas your LinkedList should have no artificial capacity limits). Write a small test program to verify that your DNALinkedList implementation.

You should not under any circumstances begin the DNA simulation until you have thoroughly verified the correctness of your DNALinkedList class. Any bugs in your DNALinkedList class will hinder progress on the DNA simulation.

Written Portion

Individually, each of you are responsible for also handing in the answer to the following questions. Your submissions can be hand or type-written; please bring a hard copy to class on Thursday, February 16. Your answers should be brief, explanations are not necessary except for #4.

In class, we discussed the advantages and disadvantages of array lists. After having implemented linked lists, please list which methods of the DNALinkedList class are "fast" (i.e., O(1)) and which are not efficient (i.e., O(N) for a linked list of size N).

Draw a memory diagram for the dnaLL variable immediately after returning from getStrandInput in your program. You can ignore all other variables in your program, but your diagram should clearly show what parts of dnaLL are in the heap and which are in the i main() stack frame. Assume you've read in the sequence TGC

For the DNA simulation methods mutate, telomereDegrade, reverseSequence, cellDeath, and printSequence, identify which are more efficiently implemented using ArrayLists, which are more efficient with Linked Lists, and which would be the same.

You have implemented a specific linked list called a singly-linked list. An alternative implementation, called a doubly-linked list, uses a node implementation that has a pointer to the previous node in addition to a pointer to the next node. Hypothesize which, if any, methods in the dnaList.h interface would be more efficient with this alternate representation.

Tips and Hints

Throwing and Handling Exceptions

Your linked list should handle invalid usage by throwing a runtime_error. There is an example in the BaseNode constructor. For example, if you may want to handle at attempt to get an invalid item:

throw std::runtime_error("Attempt to getItem out of bounds");

Here is an example of handling a runtime_error (not required in this lab, but suggested):

try{
  //do something dangerous
} catch (runtime_error& exc){
  cerr << "Warning: " << exc.what() << endl;
  cerr << "Ignoring illegal behavior." << endl;
}

This is similar to the example in class, but this catches a runtime_error pointer and stores the exception as a variable, exc. The what() method returns the actual string message that was thrown. Lastly, cerr is standard error output and works very similarly to cout except that a programmer usually uses it specifically to print warnings and errors. Don't forget to include the stdexcept library when handling or throwing exceptions.

Obtaining random values in C++

Technically, there is no such thing as a random value in programming. You can, at best, obtain pseudo-random numbers that seem random over a long period. C++ has a few libraries that try to be as random as possible, and we saw one of them in Lab 2 with the default Player. To use random, I suggest using the drand48() method. First, add the following libraries to your main source file:

#include <stdlib.h>
#include <time.h>

Second, seed the random method at the top of your main() function. If you use the same seed (e.g., srand48(0)), your random number generator will actually give you the same sequence of values every time you run your program. This can be helpful if you want to get predictable behavior, but if you want something closer to "random", put the following line at the top of main() instead:

srand48(time(0)); //bases the seed value off of the current time

To generate a random number in the range [0.0, 1.0) (i.e., upto but not including one), call drand48(). Let's say you want to get an integer value between 0 upto but not including 5, you can try:

int index = (int)(drand48()*5);

This multiplies the random value, so the range is [0.0, 5.0) and then truncates to an int, giving either 0, 1, 2, 3, or 4. You will need to use this for a few of your methods, including mutations (pick a base to mutate, and also pick a new random value for the base) as well as selecting which telomere end to degrade.

Extensions

NOTE: you are not required to implement any extensions, and no extra credit will be awarded if you do. You should only attempt this if you want to get extra experience with templates.

Implementing a specific list interface just to handle DNA molecules is not efficient. We could easily want to represent integers, strings, or more complex objects such as an MP3 library. The underlying data structues and algorithms are the same across each of these problems, only the type of the data changes. In the labs/04/template/ subdirectory, I have placed the equivalent starting point files for this lab but using templates to create a generic interface. Instead of a BaseNode class, there is a Node close that can take any type for a value.

Submit

Once you are satisfied with your program, hand it in by typing handin35 at the unix prompt.

You may run handin35 as many times as you like, and only the most recent submission will be recorded. This is useful if you realize after handing in some programs that you'd like to make a few more changes to them. Only one partner should hand in the code for the lab.

The written portion is due in class on Thursday and should be done individually

CS35 Lab 4: The linked lists of life

Throwing and Handling Exceptions

Obtaining random values in C++