The final exam for this course is comprehensive: it will contain material from throughout the semester, albeit with a slight emphasis on more recent material. This study guide contains only material from after Test 3. To prepare for the final exam, please consult the following problems as well as the older study guides (listed below).
What is a hash function? What is a hash collision?
We discussed two techniques for resolving hash collisions in class: linear probing and chaining. The technique used by a HashTable affects its structure.
Briefly explain the difference between these hash collision resolution techniques.
Give a C++ data type which could be used to store the contents of each type of hash table. (You may use pair and other data types we discussed in class. You may also use a pair within a pair or assume the existence of a type triple for three elements.)
What is load factor? How is it used in a hash table?
What is the worst case time for inserting an element into a hash table? How does this worst case time relate to the hashing function?
Suppose you have an empty chaining hash table with three buckets and a maximum load factor of 0.75. Assume that keys and values are integers and that the hash function is hash(x, capacity) is x % capacity. We will insert the mappings \(4 \mapsto 8\), \(1 \mapsto 5\), \(0 \mapsto 6\), and \(6 \mapsto 2\). Draw the hash table after each addition. Assume that ensure_capacity doubles the number of buckets before rehashing.
A hash function transforms data (e.g. a dictionary key) into integers used to index into the buckets of a hash table.
Ideally, a hash function will distribute keys uniformly across buckets.
A collision occurs when the hash function maps two different inputs to the same hash value (bucket).
In linear probing, each hash value is assigned a bucket which holds at most one value and collisions are resolved by moving on to the next bucket. In chaining, each bucket may hold many values. A linear probing hash table might store all of its buckets as a pair<pair<K,V>,bool>*, where the boolean indicates whether a bucket is in use or not. A chaining hash table might store its buckets as a vector<pair<K,V>>* where each bucket is a vector containing all keys of the same hash.
The load factor of a hash table is its size divided by its capacity. Load factor is used to determine when more buckets are allocated in order to keep mappings spread out between buckets.
In the worst case, an element insertion takes \(O(n)\) time; this occurs if all of the mappings have the same hash.
The likelihood of this occurring is based on the quality of the hash function used with the table and the distribution of keys given as input.
Empty hash table:
After \(4 \mapsto 8\):
After \(1 \mapsto 5\):
After \(0 \mapsto 6\), which causes a growth:
After \(6 \mapsto 2\):
Graphs
Graph for Question 3
What is a graph?
Define the following terms in relation to graphs:
Weighted
Directed
Connected
Consider the graph on the right.
Which path would a breadth first search find from vertex A to vertex G?
Which path would Dijkstra’s algorithm find from vertex A to vertex G? Why is this path different from the BFS path above?
Ignoring the weights on this graph, perform a topological sort of it. Give the resulting topological sort as a list of vertices.
Define miniminum spanning tree. Treating this graph has having undirected edges, find the minimum spanning tree.
The graph data structure you used in class uses a dictionary to store the edges of the graph. Edges may be stored in a variety of different ways, so let us consider the adjacency matrix representation. Instead of a dictionary, we will store a two-dimensional array of weight values of size \(|V|^2\): that is, there is a row and a column for each vertex. If an edge exists from vertex A to vertex B, then the cell in row A and column B will contain its weight; otherwise, that cell contains a zero (for simplicity). Answer the following questions, using \(|V|\) for the number of vertices in the graph and \(|E|\) for the number of edges.
What is the worst-case time complexity of adding an edge to this graph?
What is the worst-case time complexity of producing a list of edges in this graph?
What is the worst-case time complexity of determining the out-degree of a vertex?
Suppose that we copy our edge data into a new matrix of size \((|V|+1)^2\) each time a new vertex is added. What is the time complexity of addVertex in this implementation?
Consider the strategy that ArrayList uses to amortize the cost of copying data; a similar approach can be used here. Describe how the addVertex algorithm can be improved by allocating more space than is immediately necessary. The amortized worst-case time complexity of adding a new vertex should be \(O(n)\). Explain why.
A graph is a pair of sets. The first set, \(V\), is a set of vertices of any type. The second set, \(E\), is a set of edges. An edge is a grouping between a source vertex, a target vertex, along with data such as a weight and a label.
A graph is weighted if its edges have weights / have meaningful weights / have non-identical weights. A social network, for instance, is unlikely to be weighted, while a map probably is (where weight is distance, difficulty of traveling, etc.).
A graph is undirected if, for each edge heading in one direction, there is an identical edge heading in the opposite direction. A graph is directed if it is not undirected.
A graph is connected if, for each vertex in the graph, there is a path to every other vertex in the graph. In a directed graph, this is sometimes called strongly connected. A directed graph is weakly connected if its undirected equivalent is connected.
A breadth-first search would find the path [A, D, G].
Dijkstra’s algorithm would find the path [A, C, D, E, G]. This is different from breadth-first search because the sum of the weights of this path is less than the sum of the weights of the path with the fewest edges.
There are many valid sequences that may result from the topological sort. Here are a two possible outcomes:
[A, B, F, C, D, E, G]
[A, C, B, F, D, E, G]
A minimum spanning tree is a set of edges from a connected, weighted, undirected graph that connects all of the vertices together, without any cycles, and with the minimum possible total edge weight. In this particular graph, the minimum spanning tree would include the edges {(D,E), (D,C), (E,G), (D,F), (F,B), (B,A)} for a total cost of 14.
The worst-case time complexity for adding an edge to the graph is \(O(1)\); the only action necessary is indexing into an array.
The worst-case time complexity to find all edges in the graph is \(O(|V|^2)\), since we have to look at every cell in the matrix to find all of the edges.
The worst-case time complexity of determining a vertex’s out degree is \(O(|V|)\), since we must examine each vertex individually to determine if an edge exists to it.
The complexity of this copy is \(O(|V|^2)\), proportional to the number of elements copied from the old matrix.
The addVertex routine can instead copy into a matrix of size \(O((2|V|)^2)\) similar to how an ArrayList might double its own capacity. This operation would involve copying \(O(|V|^2)\) elements but it would be less common to do so. In particular, assuming a starting capacity of 2 (and thus a 2x2 matrix), this doubling strategy would yield the following numbers of copies for each call to addVertex: \(0, 0, 2^2, 0, 4^2, 0, 0, 0, 8^2, 0, 0, 0, 0, 0, 0, 0, 16^2, \ldots\). This is less than \(\sum\limits_{i=1}^{\log_2 n} (2^i)^2 = \dfrac{4}{3}(n^2-1) \in O(n^2)\) copies for \(n\) operations, or amortized \(O(n)\) worst-case time. Intuitively, each matrix copy takes more time than all of the previous copies combined, so \(i\) calls to addVertex will result in less than \(2 \cdot (2^{\lceil\log_2 i\rceil})^2 \leq 4 \cdot i^2\) copies.