===========================================================================
CSC 263                 Lecture Summary for Week  4               Fall 2007
===========================================================================

--------------------------
Augmenting Data Structures  [chapter 14]
--------------------------

Sometimes you only need a standard data structure.
Sometimes they are not quite enough for your needs.
  - for example, might need to be able to do some extra operations

Often you can take a standard data structure and augment it slightly to
fit your needs!

How to augment?
  - not automatic!
  - often requires some creativity

Often want to:
  - store additional information,
      and/or
  - perform additional operations (efficiently)

An "augmented" data structure is simply an existing data structure
modified to store additional information and/or perform additional
operations.

To do it:
  1. determine which additional information to store
  2. check that this information can be "cheaply" maintained during each
     of the original operations
  3. implement the new operations


Example: Dynamic Order Statistics

  - We want to maintain a dynamic set S (i.e., the set changes over time)
    of elements (data + key) that will allow us to perform the following
    operations efficiently:
    answer two types of
    "rank" queries on sets of values, as well as having standard operations
    for maintaining the set (INSERT, DELETE, SEARCH):

      . SEARCH(k): Given a key k, return data associated with k

      . INSERT(x): Given a key+data x, add it to S

      . DELETE(x): Given a key+data x, remove it from S

      . RANK(k): Given a key k, what is its "rank", i.e., its position
        among the elements in the data structure?

      . SELECT(r): Given a rank r, what is the key with that rank?

    The first three operations are simply the DICTIONARY ADT operations;
    the last two operations are new

For example, if our set of values is {5,15,27,30,56},
then RANK(15) = 2 and SELECT(4) = 30.

Let's look at 3 different ways we could do this.

 1. Use AVL trees (or any balanced BST) without modification.

      . can do SEARCH/INSERT/DELETE efficiently, but doing SELECT/RANK
        queries is costly!

      . Queries: Simply carry out an inorder traversal of the tree, keeping
        track of the number of nodes visited, until the desired rank or key
        is reached.

    [[Q: What will be the time for a query? ]]

    [[Q: Will the other operations (SEARCH/INSERT/DELETE) take any longer?]]

    [[Q: What is the problem? Could we do better? ]]


 2. Augment AVL trees so that each node has an additional field 'rank[x]'
    that stores its rank in the tree.

    [[Q: What will be the time for a query? ]]

    [[Q: Will the other operations (SEARCH/INSERT/DELETE) take any longer?]]

    [[Q: What is the problem? Could we do better? ]]

    Good: RANK and SELECT easy to do
    Bad: maintaining rank info is expensive
      -eg. try adding new element 2 -- all nodes must be updated!

 3. Augment the tree in a more sophisticated way.

    Consider augmenting AVL trees so that each node x has an additional
    field 'size[x]' that stores the number of keys in the subtree rooted at
    x (including x itself).  This may seem unrelated to 'rank' at first,
    but we'll see that it is enough to allow us to do what we want.

  - Queries:

      . Consider a node x in an AVL tree.  We know that
        rank[x] = 1 + number of keys that come before x in the tree.
	In particular, if we consider only the keys in the subtree rooted
	at x, then rank[x] = size[left child's subtree] + 1
	(this is NOT necessarily the true rank of x in the whole tree,
	only its "relative" rank among the keys in the subtree rooted at x).

      . Note: size[x] = size[left(x)] + size[right(x)] + 1
        for any node x in the AVL tree, where left(x) and right(x) are
	the left and right children of x

      . RANK(k): Given key k, perform SEARCH on k keeping track of "current
        rank" r: Each time you go down a level you must add the size of the
        subtrees to the left that you skipped.  You must also add the key
        itself that you skipped.

        [[Q: When we recursively call SEARCH(v_i,k) what do we add to r?]]

        [[Q: When we find x how to we compute its true rank?]]

      . SELECT(r): Given rank r, start at x = root[T] and work down.

        SELECT(x,k):
	  rrx = size[left(x)] + 1
	  if k = rrx then return x
	  if k < rrx then return SELECT(left(x), k)
	  if k > rrx then return SELECT(right(x), k-rrx)

        Each call goes one level down in the tree. Tree height is
	Theta(log n), so algorithm is O(log n) in worst case.
      
  - Updates: INSERT and DELETE operations consist of two phases for AVL trees:
    the operation itself, followed by the fix-up (rebalancing) process.

      . INSERT(x): 
	Phase 1: Simply increment the size of the subtree rooted at every
        node that is examined when finding the position of x (since x will
        be added in that subtree).
	Phase 2: Update balance factors of ancestor nodes. If a rotation
	is required, recompute the size values for the rotated nodes
	from its new children.

      . DELETE(x): Consider the leaf y that is actually removed by the
        operation (so y = x or y = successor(x)).  We know the size of the
        subtree rooted at every node on the path from the root down to y
        decreases by 1, so we simply traverse that path to decrement the
        size of each node on that path. If a rotation is necessary, we
	recompute the size of rotated nodes from its children, since we
	know those values are correct.

    Update time?  We have only added a constant amount of extra work
    at each level of the tree, so the total time in the worst-case
    is still Theta(log n).

  - Now, we have finally achieved what we wanted: each operation (old or
    new) takes time Theta(log n) in the worst-case.

-------
Hashing
-------

Problem 1: Read a text file, keeping track of the number of occurrences of
each character (ASCII codes 0 - 127).

Solution?  A Direct-Address Table: simply keep track of the number of
occurrences of each character in an array with 128 positions (one position
for each character).  All operations are therefore Theta(1)
Memory usage? 128 x size of an integer = 4kB (for 32-bit ints) -- small!

Problem 2: Read a data file, keeping track of the number of occurrences of
each integer value (from 0 to 2^{32}-1).

Solution?  It would be extremely wasteful (maybe even impossible) to keep
an array with 2^{32} positions, especially when the data files may contain
no more than 10^5 different values (out of all the 2^{32} possibilities).
(To store 2^{32} integers would require 128 GB of storage!)
So instead, we will allocate an array with 10,000 positions (for example),
and figure out a way to map each integer we encounter to one of those
positions.  This is called "hashing".

[[Q: Define the ADT that we're using here. ]]

Reading assignment: 11.1 - 11.3 (except 11.3.3) - Hash Tables,
	Chapter 5 - Probabilistic Analysis and Randomized Algorithms