===========================================================================
CSC B63                 Lecture Summary for Week  6             Summer 2008
===========================================================================

[[Q:  denotes a question that you should think about and
      that will be answered during lecture.  ]]

=============
Disjoint Sets (aka Union-Find) [ Chapter 21 ]
=============

Problem: Given a set of n _distinct_ elements, group them into disjoint groups.

The Disjoint Set ADT:

Objects: A collection of nonempty disjoint sets S = {S_1, S_2, ..., S_k},
  i.e., each S_i is a nonempty set that has _no_ element in common with any
  other S_j (S_i intersect S_j = {} for all i <> j).
Each set is identified by a unique element called its "representative".

Operations:

  . MAKE-SET(x): Given an element x that does not already belong to one of
                 the sets, create a new set {x} that contains only x (and
                 assign x as the representative of that new set).

  . FIND-SET(x): Given an element x, return the representative of the set
                 that contains x (or some special value, like NIL, if x
                 does not belong to any set).

  . UNION(x,y): Given two distinct elements x and y, let S_x be the set
                that contains x and S_y be the set that contains y.  This
                operation forms a new set consisting of S_x U S_y and it
                removes S_x and S_y from the collection (since all the sets
                must be disjoint).  It also picks a representative for the
                new set.  Note: if both x and y belong to the same set
                already (i.e., S_x = S_y), nothing is done by this
                operation.

Applications:

  . Maintaining the set of connected components of a graph:

       for each vertex v in V, MAKE-SET(v)
       for each edge (u,v) in E, UNION(u,v)

    To see if vertices u and v are in the same component, simply test
    FIND-SET(u) ?= FIND-SET(v).

    This idea is used by some important graph algorithms.

  . Maintain lists of duplicate copies of webpages.

---------------------------------
Data Structures for Disjoint Sets:
---------------------------------

 0. Array.

    Use an array with one position for each element: information about
    element x is stored at position index[x] (the index of each element is
    maintained separately as part of the outside information).

    Each location in the array stores two pieces of information: a
    reference to the element associated with this location, and an index
    value.  Sets are represented implicitly: elements are in the same set
    iff the corresponding array locations store the same index value.

    For example, the collection of sets

        { {A}, {B,E}, {C,F,G}, {D} }

    could be represented using the following array (the ... indicate that
    the array may have more locations available but currently unused; a
    counter would be used to keep track of the last used position):

             _0___1___2___3___4___5___6__...
            | A | B | C | D | E | F | G |...
            |_0_|_1_|_5_|_3_|_1_|_5_|_5_|...

    Operations:

      . MAKE-SET(x) takes time O(1): store x in the next available
        location, with index value equal to its location (guaranteed to be
        different from all other values currently used).
        For example, MAKE-SET(H) would result in the following array:

                 _0___1___2___3___4___5___6___7__...
                | A | B | C | D | E | F | G | H |...
                |_0_|_1_|_5_|_3_|_1_|_5_|_5_|_7_|...

      . FIND-SET(x) takes time O(1): the index value stored at location
        index[x] indicates the representative of the set containing x.
        For example, FIND-SET(C) looks at location index[C] = 2 and finds
        index value 5 stored there, so it returns F, the element stored at
        location 5.

      . UNION(x,y) takes time Theta(n), where n is the number of elements:
        let X be the index value stored at location index[x] and Y be the
        index value stored at index[y]; if X != Y, go through every element
        in the array and replace every index value equal to Y with an X
        (it would also be fine to replace every X with Y instead).
        For example, UNION(A,E) will replace every index value equal to 1
        (the index value stored at location 4 = index[E]) with a 0 (the
        index value stored at location 0 = index[A]), with result:

                 _0___1___2___3___4___5___6__...
                | A | B | C | D | E | F | G |...
                |_0_|_0_|_5_|_3_|_0_|_5_|_5_|...

    Worst-case sequence complexity for m operations:
    Upper bound:
      . each operation takes time O(n), where n is the number of elements
      . n is O(m)
      . hence, any sequence of m operations takes time O(m^2).
    Lower bound:
      . perform m/2 MAKE-SETs followed by m/2-1 UNIONs
      . each UNION will take time Omega(m/2) (the number of elements)
      . so the sequence will take time Omega((m/2)*(m/2-1)) = Omega(m^2).
    Therefore the worst-case sequence complexity is Theta(m^2).

  1. Circularly-linked list.

    Represent each set by a circularly-linked list, with the first element
    in the list being its representative. (We'll need a "flag" for
    each element to indicate whether it is the representative or not.)

    For example, the collection of sets

        { {A}, {B,E}, {C,F,G}, {D} }

    could be represented by the following lists, where first field =
    representative flag (1 = true, 0 = false), second field = element,
    third field = link:

            -------------
            | 1 | A | *-+-
            ------------- |
                  ^-------

            -------------    -------------
            | 1 | B | *-+--> | 0 | E | *-+-
            -------------    ------------- |
                  ^------------------------

            -------------    -------------    -------------
            | 1 | F | *-+--> | 0 | G | *-+--> | 0 | C | *-+-
            -------------    -------------    ------------- |
                  ^-----------------------------------------

            -------------
            | 1 | D | *-+-
            ------------- |
                  ^-------

    Operations:

      . MAKE-SET(x) takes time O(1): simply create a new linked list with
        one element x.

      . FIND-SET(x) takes time Omega(length of list): in the worst-case, we
        need to traverse every link in a list before we find the first
        element.

      . UNION(x,y) takes time Theta(1 + time for FIND-SET): first, we must
        run FIND-SET(x) and FIND-SET(y) to make sure that x and y don't
        already belong to the same list; if they don't, then we have the
        representative node of each list so we can reset the representative
        flag of the old representative of y's list, and merge the two lists
        by swapping the links of the two representatives.  Pictorially:

        BEFORE:

                  v-----------------------------------------------------
            -------------    -------------        -------------         |
            | 1 |   | *-+--> | 0 |   | *-+--> ... | 0 | x | *-+--> ... -
            -------------    -------------        -------------
            -------------    -------------        -------------
            | 1 |   | *-+--> | 0 |   | *-+--> ... | 0 | y | *-+--> ... -
            -------------    -------------        -------------         |
                  ^-----------------------------------------------------

        AFTER:

                  v-----------------------------------------------------
            -------------    -------------        -------------         |
            | 1 |   | *-\  />| 0 |   | *-+--> ... | 0 | x | *-+--> ... -
            -------------\/  -------------        -------------
            -------------/\  -------------        -------------
            | 0 |   | *-/  \>| 0 |   | *-+--> ... | 0 | y | *-+--> ... -
            -------------    -------------        -------------         |
                  ^-----------------------------------------------------

    Worst-case sequence complexity for m operations: 
    Upper bound: since the number of elements in the structure at any
    point in any sequence of m operations is <= m, the complexity of
    each operation in a sequence is O(m) so the total time is O(m^2).
    Lower bound: perform m/4 MAKE-SETs with different elements, then m/4-1
    UNIONs to get one list with m/4 elements, then m/2 FIND-SETs on the
    second element in the list, so that each one requires time Omega(m/4),
    for a total time of Omega(m^2).

 2. Linked list with extra pointer to front.

    Represent each set by a linked list where each element stores a pointer
    to the next element and also a pointer "back" to the first element in
    the list (the representative).

      . MAKE-SET(x) takes time O(1), as before.

      . FIND-SET(x) now takes time O(1): simply follow the pointer back.

      . UNION(x,y) takes time Omega(length of appended list): append one
        list to the end of the other, and modify all the back pointers of
        the second list.

    Worst-case sequence complexity for m operations: perform m/2+1
    MAKE-SETs with different elements, then perform m/2-1 UNIONs, creating
    one longer and longer list and always appending it to the end of a
    single-element list.  Total time is Omega(m^2).

 3. Linked list with extra pointer to front and "union-by-weight".

    As before, except that we also keep track of the number of elements in
    each list.  MAKE-SET and FIND-SET are not affected (still take time
    O(1)), and when we perform UNION, we always append the smaller set to
    the longer one (so we have fewer pointers to change).  This is called
    "union-by-weight" (the "weight" of a set is simply its size).

    Worst-case sequence complexity for m operations: let n be the number of
    MAKE-SET operations in the sequence (so there are never more than n
    elements in total).  For some arbitrary element x, we want to prove an
    upper bound on the number of times that x's back pointer can be
    updated.  Note that this happens only when the set that contains x is
    UNIONed with a set that is no smaller (because we only update back
    pointers for the smaller set).  This means that each time x's back
    pointer is updated, the resulting set must have at least doubled in
    size.  Since there are no more than n elements in total in all the
    sets, this means that x's back pointer cannot be updated more than
    lg(n) times.  And since this is true for every element x, the total
    number of pointer updates during the entire sequence of operations is
    O(n log n).  The time for other operations is still O(1), and there are
    m operations in total, so the total time for the entire sequence is
    O(m + n log n).

 4. Trees.

    Represent each set by a tree, where each element points to its parent
    only and the root points back to itself.  The representative of a set
    is the root.  Note that the trees are _not_ necessarily binary trees:
    the number of children of a node can be arbitrarily large (or small).

      . MAKE-SET(x) takes time O(1): just create a new tree with root x.

      . UNION(x,y) takes time O(1): just make the root of one of the trees
        point to the root of the second one.

      . FIND-SET(x) takes time O(depth of x): simply follow "parent"
        pointers back to the root of x's tree.

    Worst-case sequence complexity for m operations: just like for the
    linked list with back pointers but no size, we can create a tree that
    is just one long chain with m/4 elements, so that FIND-SET takes time
    Omega(m); if we perform m/2 FIND-SET operations, we get a sequence
    whose total time is Omega(m^2).

 5. Trees with "union-by-weight".

    As before, except we also keep track of the weight (i.e., size) of each
    tree and always append the smaller tree to the larger one when
    performing UNION.  The complexity of MAKE-SET is still O(1), and so is
    the complexity of UNION (when one tree is appended to another, we just
    add the two weights as the weight of the new tree).

    What about the complexity of FIND-SET?  It is possible to show that
    during any sequence of m operations, n of which are MAKE-SET, the
    maximum height of any tree is O(log n).  (The proof is by induction on
    the height h of the trees.)  This means that the running time of any
    individual FIND-SET is O(log n), which gives total time of O(m log n)
    for the entire sequence.

 6. Trees with path compression.

    When performing FIND-SET(x), keep track of the nodes visited on the
    path from x to the root of the tree (in a stack or queue), and once the
    root is found, update the parent pointers of each node to point
    directly to the root.  This at most doubles the running time of the
    FIND-SET operation, but it can speed up future operations considerably.

    In fact, it is possible to prove (but we won't do it) that the
    worst-case running time of a single operation in a sequence, if there
    are n MAKE-SET operations (so at most n-1 UNIONs) and `f' FIND-SET
    operations is

        Theta( f log n / log(1+f/n) )        if f >= n

        Theta( n + f log n)                  if f < n

    But we can do even better!

 7. Trees with "union-by-rank" and path compression.

    With trees, the measure that matters the most for the running time is
    the height of each tree, not its size.  So, instead of using weight to
    decide how to carry out UNION, we use "rank".  The rank is an upper
    bound on the height of the tree, but it is not always exactly equal to
    the height (i.e., it is not efficient to try to keep rank exactly equal
    to height).

      . MAKE-SET(x): as before, and set the rank of x to 0

      . UNION(x,y): the node with higher rank is the new root and its rank
        is unchanged; if the two nodes have the same rank, pick any one as
        the new root and increase its rank by 1

      . FIND-SET(x): use path compression and leave ranks unchanged

    It is possible to prove that the worst-case time for a sequence of m
    operations, where there are n MAKE-SETs, is O(m log* n).

    log* n is an extremely slowly-growing function, equal to the number of
    time you must iteratively apply log to get the value to be 1 or less.
    
                { 0,  if n <= 1
      log* n =  {
                { 1 + log*(log n),  if n > 1

    For all practical input values n, log* n has value <= 5. 
    In practice, we can consider log* n as bounded by a constant.
    
    To see how slowly this grows, look at base 2 logarithm, denoted as "lg",
        lg* 2 = 1
	lg* 4 = 2
	lg* 16 = 3
	lg* 65536 = 4
	lg* 2^{65536} = 5

      For context, # atoms in universe is approximately 10^{80}
      (1 followed by 80 zeros), which is much smaller than 2^{65536}.

    [ Actually, it's possible to prove a bound even better than O(m log* n).
      The log* n term can be replaced by alpha(n), where alpha(n)
      denotes the inverse of the Ackermann function A(n,n).
      The Ackerman function A(m,n) grows extremely quickly (one of the
      absolute fastest growth rates we know!), so the inverse grows
      extremely slowly (and more slowly than log* n).
      It turns out we can prove an Omega(m alpha(n)) lower bound too. ]