=========================================================================== CSC B63 Lecture Summary for Week 6 Summer 2008 =========================================================================== [[Q: denotes a question that you should think about and that will be answered during lecture. ]] ============= Disjoint Sets (aka Union-Find) [ Chapter 21 ] ============= Problem: Given a set of n _distinct_ elements, group them into disjoint groups. The Disjoint Set ADT: Objects: A collection of nonempty disjoint sets S = {S_1, S_2, ..., S_k}, i.e., each S_i is a nonempty set that has _no_ element in common with any other S_j (S_i intersect S_j = {} for all i <> j). Each set is identified by a unique element called its "representative". Operations: . MAKE-SET(x): Given an element x that does not already belong to one of the sets, create a new set {x} that contains only x (and assign x as the representative of that new set). . FIND-SET(x): Given an element x, return the representative of the set that contains x (or some special value, like NIL, if x does not belong to any set). . UNION(x,y): Given two distinct elements x and y, let S_x be the set that contains x and S_y be the set that contains y. This operation forms a new set consisting of S_x U S_y and it removes S_x and S_y from the collection (since all the sets must be disjoint). It also picks a representative for the new set. Note: if both x and y belong to the same set already (i.e., S_x = S_y), nothing is done by this operation. Applications: . Maintaining the set of connected components of a graph: for each vertex v in V, MAKE-SET(v) for each edge (u,v) in E, UNION(u,v) To see if vertices u and v are in the same component, simply test FIND-SET(u) ?= FIND-SET(v). This idea is used by some important graph algorithms. . Maintain lists of duplicate copies of webpages. --------------------------------- Data Structures for Disjoint Sets: --------------------------------- 0. Array. Use an array with one position for each element: information about element x is stored at position index[x] (the index of each element is maintained separately as part of the outside information). Each location in the array stores two pieces of information: a reference to the element associated with this location, and an index value. Sets are represented implicitly: elements are in the same set iff the corresponding array locations store the same index value. For example, the collection of sets { {A}, {B,E}, {C,F,G}, {D} } could be represented using the following array (the ... indicate that the array may have more locations available but currently unused; a counter would be used to keep track of the last used position): _0___1___2___3___4___5___6__... | A | B | C | D | E | F | G |... |_0_|_1_|_5_|_3_|_1_|_5_|_5_|... Operations: . MAKE-SET(x) takes time O(1): store x in the next available location, with index value equal to its location (guaranteed to be different from all other values currently used). For example, MAKE-SET(H) would result in the following array: _0___1___2___3___4___5___6___7__... | A | B | C | D | E | F | G | H |... |_0_|_1_|_5_|_3_|_1_|_5_|_5_|_7_|... . FIND-SET(x) takes time O(1): the index value stored at location index[x] indicates the representative of the set containing x. For example, FIND-SET(C) looks at location index[C] = 2 and finds index value 5 stored there, so it returns F, the element stored at location 5. . UNION(x,y) takes time Theta(n), where n is the number of elements: let X be the index value stored at location index[x] and Y be the index value stored at index[y]; if X != Y, go through every element in the array and replace every index value equal to Y with an X (it would also be fine to replace every X with Y instead). For example, UNION(A,E) will replace every index value equal to 1 (the index value stored at location 4 = index[E]) with a 0 (the index value stored at location 0 = index[A]), with result: _0___1___2___3___4___5___6__... | A | B | C | D | E | F | G |... |_0_|_0_|_5_|_3_|_0_|_5_|_5_|... Worst-case sequence complexity for m operations: Upper bound: . each operation takes time O(n), where n is the number of elements . n is O(m) . hence, any sequence of m operations takes time O(m^2). Lower bound: . perform m/2 MAKE-SETs followed by m/2-1 UNIONs . each UNION will take time Omega(m/2) (the number of elements) . so the sequence will take time Omega((m/2)*(m/2-1)) = Omega(m^2). Therefore the worst-case sequence complexity is Theta(m^2). 1. Circularly-linked list. Represent each set by a circularly-linked list, with the first element in the list being its representative. (We'll need a "flag" for each element to indicate whether it is the representative or not.) For example, the collection of sets { {A}, {B,E}, {C,F,G}, {D} } could be represented by the following lists, where first field = representative flag (1 = true, 0 = false), second field = element, third field = link: ------------- | 1 | A | *-+- ------------- | ^------- ------------- ------------- | 1 | B | *-+--> | 0 | E | *-+- ------------- ------------- | ^------------------------ ------------- ------------- ------------- | 1 | F | *-+--> | 0 | G | *-+--> | 0 | C | *-+- ------------- ------------- ------------- | ^----------------------------------------- ------------- | 1 | D | *-+- ------------- | ^------- Operations: . MAKE-SET(x) takes time O(1): simply create a new linked list with one element x. . FIND-SET(x) takes time Omega(length of list): in the worst-case, we need to traverse every link in a list before we find the first element. . UNION(x,y) takes time Theta(1 + time for FIND-SET): first, we must run FIND-SET(x) and FIND-SET(y) to make sure that x and y don't already belong to the same list; if they don't, then we have the representative node of each list so we can reset the representative flag of the old representative of y's list, and merge the two lists by swapping the links of the two representatives. Pictorially: BEFORE: v----------------------------------------------------- ------------- ------------- ------------- | | 1 | | *-+--> | 0 | | *-+--> ... | 0 | x | *-+--> ... - ------------- ------------- ------------- ------------- ------------- ------------- | 1 | | *-+--> | 0 | | *-+--> ... | 0 | y | *-+--> ... - ------------- ------------- ------------- | ^----------------------------------------------------- AFTER: v----------------------------------------------------- ------------- ------------- ------------- | | 1 | | *-\ />| 0 | | *-+--> ... | 0 | x | *-+--> ... - -------------\/ ------------- ------------- -------------/\ ------------- ------------- | 0 | | *-/ \>| 0 | | *-+--> ... | 0 | y | *-+--> ... - ------------- ------------- ------------- | ^----------------------------------------------------- Worst-case sequence complexity for m operations: Upper bound: since the number of elements in the structure at any point in any sequence of m operations is <= m, the complexity of each operation in a sequence is O(m) so the total time is O(m^2). Lower bound: perform m/4 MAKE-SETs with different elements, then m/4-1 UNIONs to get one list with m/4 elements, then m/2 FIND-SETs on the second element in the list, so that each one requires time Omega(m/4), for a total time of Omega(m^2). 2. Linked list with extra pointer to front. Represent each set by a linked list where each element stores a pointer to the next element and also a pointer "back" to the first element in the list (the representative). . MAKE-SET(x) takes time O(1), as before. . FIND-SET(x) now takes time O(1): simply follow the pointer back. . UNION(x,y) takes time Omega(length of appended list): append one list to the end of the other, and modify all the back pointers of the second list. Worst-case sequence complexity for m operations: perform m/2+1 MAKE-SETs with different elements, then perform m/2-1 UNIONs, creating one longer and longer list and always appending it to the end of a single-element list. Total time is Omega(m^2). 3. Linked list with extra pointer to front and "union-by-weight". As before, except that we also keep track of the number of elements in each list. MAKE-SET and FIND-SET are not affected (still take time O(1)), and when we perform UNION, we always append the smaller set to the longer one (so we have fewer pointers to change). This is called "union-by-weight" (the "weight" of a set is simply its size). Worst-case sequence complexity for m operations: let n be the number of MAKE-SET operations in the sequence (so there are never more than n elements in total). For some arbitrary element x, we want to prove an upper bound on the number of times that x's back pointer can be updated. Note that this happens only when the set that contains x is UNIONed with a set that is no smaller (because we only update back pointers for the smaller set). This means that each time x's back pointer is updated, the resulting set must have at least doubled in size. Since there are no more than n elements in total in all the sets, this means that x's back pointer cannot be updated more than lg(n) times. And since this is true for every element x, the total number of pointer updates during the entire sequence of operations is O(n log n). The time for other operations is still O(1), and there are m operations in total, so the total time for the entire sequence is O(m + n log n). 4. Trees. Represent each set by a tree, where each element points to its parent only and the root points back to itself. The representative of a set is the root. Note that the trees are _not_ necessarily binary trees: the number of children of a node can be arbitrarily large (or small). . MAKE-SET(x) takes time O(1): just create a new tree with root x. . UNION(x,y) takes time O(1): just make the root of one of the trees point to the root of the second one. . FIND-SET(x) takes time O(depth of x): simply follow "parent" pointers back to the root of x's tree. Worst-case sequence complexity for m operations: just like for the linked list with back pointers but no size, we can create a tree that is just one long chain with m/4 elements, so that FIND-SET takes time Omega(m); if we perform m/2 FIND-SET operations, we get a sequence whose total time is Omega(m^2). 5. Trees with "union-by-weight". As before, except we also keep track of the weight (i.e., size) of each tree and always append the smaller tree to the larger one when performing UNION. The complexity of MAKE-SET is still O(1), and so is the complexity of UNION (when one tree is appended to another, we just add the two weights as the weight of the new tree). What about the complexity of FIND-SET? It is possible to show that during any sequence of m operations, n of which are MAKE-SET, the maximum height of any tree is O(log n). (The proof is by induction on the height h of the trees.) This means that the running time of any individual FIND-SET is O(log n), which gives total time of O(m log n) for the entire sequence. 6. Trees with path compression. When performing FIND-SET(x), keep track of the nodes visited on the path from x to the root of the tree (in a stack or queue), and once the root is found, update the parent pointers of each node to point directly to the root. This at most doubles the running time of the FIND-SET operation, but it can speed up future operations considerably. In fact, it is possible to prove (but we won't do it) that the worst-case running time of a single operation in a sequence, if there are n MAKE-SET operations (so at most n-1 UNIONs) and `f' FIND-SET operations is Theta( f log n / log(1+f/n) ) if f >= n Theta( n + f log n) if f < n But we can do even better! 7. Trees with "union-by-rank" and path compression. With trees, the measure that matters the most for the running time is the height of each tree, not its size. So, instead of using weight to decide how to carry out UNION, we use "rank". The rank is an upper bound on the height of the tree, but it is not always exactly equal to the height (i.e., it is not efficient to try to keep rank exactly equal to height). . MAKE-SET(x): as before, and set the rank of x to 0 . UNION(x,y): the node with higher rank is the new root and its rank is unchanged; if the two nodes have the same rank, pick any one as the new root and increase its rank by 1 . FIND-SET(x): use path compression and leave ranks unchanged It is possible to prove that the worst-case time for a sequence of m operations, where there are n MAKE-SETs, is O(m log* n). log* n is an extremely slowly-growing function, equal to the number of time you must iteratively apply log to get the value to be 1 or less. { 0, if n <= 1 log* n = { { 1 + log*(log n), if n > 1 For all practical input values n, log* n has value <= 5. In practice, we can consider log* n as bounded by a constant. To see how slowly this grows, look at base 2 logarithm, denoted as "lg", lg* 2 = 1 lg* 4 = 2 lg* 16 = 3 lg* 65536 = 4 lg* 2^{65536} = 5 For context, # atoms in universe is approximately 10^{80} (1 followed by 80 zeros), which is much smaller than 2^{65536}. [ Actually, it's possible to prove a bound even better than O(m log* n). The log* n term can be replaced by alpha(n), where alpha(n) denotes the inverse of the Ackermann function A(n,n). The Ackerman function A(m,n) grows extremely quickly (one of the absolute fastest growth rates we know!), so the inverse grows extremely slowly (and more slowly than log* n). It turns out we can prove an Omega(m alpha(n)) lower bound too. ]