===========================================================================
CSC B63                 Lecture Summary for Week  4             Summer 2008
===========================================================================

-------
B-Trees
-------

AVL trees are a type of balanced BINARY search tree, which gives good
performance of dictionary operations, proportional to log_2 n,
where n is the number of elements in the data structure (the base 2
here comes from being a binary tree).

What if we use trees with more children? Like trinary trees, etc.?
  - expect running time to be proportional to log_k n, if k children
  - operations still Theta(log n) though -- no change!
  - only affects the constants, but might be important

For example, if we must do a slow, expensive disk read operation each
time we look at level of the tree, we'd want to make the tree bushier so
that there are less levels to look at.

Generalize this idea: B-trees

A B-tree T is a rooted tree (whose root is root[T]) having the following
properties:

   1.  Every node x has the following fields:
         - n[x], the number of keys currently stored in node x,
         - the n[x] keys themselves, stored in nondecreasing order,
	   so that key1[x] <= key2[x] <= ... <= keyn[x][x],
         - leaf[x], a boolean value that is TRUE if x is a leaf and
	   FALSE if x is an internal node.

   2.  Each internal node x also contains n[x]+1 pointers c1[x], c2[x],
    ..., cn[x]+1[x] to its children. Leaf nodes have no children, so their
    ci fields are undefined.

   3.  The keys keyi[x] separate the ranges of keys stored in each subtree:
    if ki is any key stored in the subtree with root ci [x], then

      k1 <= key1[x] <= k2 <= key2[x] <= ... <= keyn[x][x] <= kn[x]+1.

   4.  All leaves have the same depth, which is the tree's height h.

   5.  There are lower and upper bounds on the number of keys a node can
    contain. These bounds are expressed in terms of a fixed integer t >= 2
    called the minimum degree of the B-tree:

         - Every node other than the root must have at least t - 1 keys.
	   Every internal node other than the root has at least t children.
	   If the tree is nonempty, the root must have at least one key.

         - Every node can contain at most 2t - 1 keys.
	   Therefore, an internal node can have at most 2t children.
	   We say that a node is full if it contains exactly 2t - 1 keys.

The simplest B-tree occurs when t = 2. Every internal node then has
either 2, 3, or 4 children, and we have a 2-3-4 tree. In practice,
however, much larger values of t are used, typically adjusted to block
size of hard disks for efficient I/O.

[[Q: Consider keeping a search tree on disk.  How many disk accesses would
it take to search a tree of height h?  Why is this so important?]]

When the OS reads from disk, it reads a minimum of 1 block of data from the
disk.  That means that if all you want is one 2-node from a 2-3-4 tree
(possibly 12 bytes) the OS must still read a full block.  It also has to
spin the disk to the correct sector, move the head to the correct track
etc.

[[Q: So how could we design the tree better to take advantage of the fact
that disk access is expensive but once we do the access we read at least
one full block of data?]]

We can compute the worst-case height of a B-tree with n elements (in
terms of t): the height is at most log_t [(n+1)/2]
where log_t means logarithm base t (recall t is fixed).
[proof is in textbook, and useful to read]

Note that when inserting a key in a B-tree, nodes might get too full
(overflow), and when deleting a key, nodes might get too empty (underflow).
Thus the insertion and deletion algorithms must SPLIT and MERGE nodes
as necessary to maintain the B-tree properties.

Example:

In a B-tree with t=2 (a 2-3-4 tree), an internal node can have
2, 3 or 4 children. Here is an example:

                              ______17_____
                             /             \
                        __4____9_       20__30__41__
                       /     |   \     /  |    |    \
                    1_2_3   7_8   12  18  24  33   56_80

Consider the node (4,9): it contains two values and has three children
(the nodes (1,2,3), (7,8), and (12)).

[[Q: What are the possible number of values a 2-3-4 tree internal node may
    contain? ]]

Notice the property relating the values in a subtree to the values in the
parent node of its root.  This is formally defined in section 3.3.1
(Multi-way Search Trees) in the text but is quite simple to see informally.

  - Consider the subtree rooted at node (20,30,41).  All values in this
    subtree must be greater than 17.
    Now consider the subtree rooted at node (7,8).  All values in this
    subtree must be greater than 4 and less than 9.

    [[Q: What range of values would be allowed in any subtree rooted at
         node (33)?]]

    [[Q: If we were to insert the value 15 into the tree above, where would
         it need to go to preserve the order property?]]

Another property which must hold in 2-3-4 trees is that all external nodes
must be at the same depth.  (This is the "depth property".)

[[Q: Why do we want this property?]]

[[Q: Given these two properties, what can we say about the height of a
    2-3-4 tree which stores n items?
    Hint: First consider how to relate the number of external nodes in a
    2-3-4 tree to the number of values stored in the internal nodes.  Next
    let h be the height of a 2-3-4 tree and then determine the maximum and
    minimum number of external nodes as a function of h.]]

[[Now, think about the three non-trivial operations (SEARCH, INSERT,
  DELETE) and how to perform them on a 2-3-4 tree.]]

Details of these operations will be covered in tutorial.

Reading assignment: Chapter 18 - B-Trees


--------------------------
Augmenting Data Structures  [chapter 14]
--------------------------

Sometimes you only need a standard data structure.
Sometimes they are not quite enough for your needs.
  - for example, might need to be able to do some extra operations

Often you can take a standard data structure and augment it slightly to
fit your needs!

How to augment?
  - not automatic!
  - often requires some creativity

Often want to:
  - store additional information,
      and/or
  - perform additional operations (efficiently)

An "augmented" data structure is simply an existing data structure
modified to store additional information and/or perform additional
operations.

To do it:
  1. determine which additional information to store
  2. check that this information can be "cheaply" maintained during each
     of the original operations
  3. implement the new operations


Example: Dynamic Order Statistics

  - We want to maintain a dynamic set S (i.e., the set changes over time)
    of elements (data + key) that will allow us to perform the following
    operations efficiently:
    answer two types of
    "rank" queries on sets of values, as well as having standard operations
    for maintaining the set (INSERT, DELETE, SEARCH):

      . SEARCH(k): Given a key k, return data associated with k

      . INSERT(x): Given a key+data x, add it to S

      . DELETE(x): Given a key+data x, remove it from S

      . RANK(k): Given a key k, what is its "rank", i.e., its position
        among the elements in the data structure?

      . SELECT(r): Given a rank r, what is the key with that rank?

    The first three operations are simply the DICTIONARY ADT operations;
    the last two operations are new

For example, if our set of values is {5,15,27,30,56},
then RANK(15) = 2 and SELECT(4) = 30.

Let's look at 3 different ways we could do this.

 1. Use AVL trees (or any balanced BST) without modification.

      . can do SEARCH/INSERT/DELETE efficiently, but doing SELECT/RANK
        queries is costly!

      . Queries: Simply carry out an inorder traversal of the tree, keeping
        track of the number of nodes visited, until the desired rank or key
        is reached.

    [[Q: What will be the time for a query? ]]

    [[Q: Will the other operations (SEARCH/INSERT/DELETE) take any longer?]]

    [[Q: What is the problem? Could we do better? ]]


 2. Augment AVL trees so that each node has an additional field 'rank[x]'
    that stores its rank in the tree.

    [[Q: What will be the time for a query? ]]

    [[Q: Will the other operations (SEARCH/INSERT/DELETE) take any longer?]]

    [[Q: What is the problem? Could we do better? ]]

    Good: RANK and SELECT easy to do
    Bad: maintaining rank info is expensive
      -eg. try adding new element 2 -- all nodes must be updated!

 3. Augment the tree in a more sophisticated way.

    Consider augmenting AVL trees so that each node x has an additional
    field 'size[x]' that stores the number of keys in the subtree rooted at
    x (including x itself).  This may seem unrelated to 'rank' at first,
    but we'll see that it is enough to allow us to do what we want.

  - Queries:

      . Consider a node x in an AVL tree.  We know that
        rank[x] = 1 + number of keys that come before x in the tree.
	In particular, if we consider only the keys in the subtree rooted
	at x, then rank[x] = size[left child's subtree] + 1
	(this is NOT necessarily the true rank of x in the whole tree,
	only its "relative" rank among the keys in the subtree rooted at x).

      . Note: size[x] = size[left(x)] + size[right(x)] + 1
        for any node x in the AVL tree, where left(x) and right(x) are
	the left and right children of x

      . RANK(k): Given key k, perform SEARCH on k keeping track of "current
        rank" r: Each time you go down a level you must add the size of the
        subtrees to the left that you skipped.  You must also add the key
        itself that you skipped.

        [[Q: When we recursively call SEARCH(v_i,k) what do we add to r?]]

        [[Q: When we find x how to we compute its true rank?]]

      . SELECT(r): Given rank r, start at x = root[T] and work down.

        SELECT(x,k):
	  rrx = size[left(x)] + 1
	  if k = rrx then return x
	  if k < rrx then return SELECT(left(x), k)
	  if k > rrx then return SELECT(right(x), k-rrx)

        Each call goes one level down in the tree. Tree height is
	Theta(log n), so algorithm is O(log n) in worst case.
      
  - Updates: INSERT and DELETE operations consist of two phases for AVL trees:
    the operation itself, followed by the fix-up (rebalancing) process.

      . INSERT(x): 
	Phase 1: Simply increment the size of the subtree rooted at every
        node that is examined when finding the position of x (since x will
        be added in that subtree).
	Phase 2: Update balance factors of ancestor nodes. If a rotation
	is required, recompute the size values for the rotated nodes
	from its new children.

      . DELETE(x): Consider the leaf y that is actually removed by the
        operation (so y = x or y = successor(x)).  We know the size of the
        subtree rooted at every node on the path from the root down to y
        decreases by 1, so we simply traverse that path to decrement the
        size of each node on that path. If a rotation is necessary, we
	recompute the size of rotated nodes from its children, since we
	know those values are correct.

    Update time?  We have only added a constant amount of extra work
    at each level of the tree, so the total time in the worst-case
    is still Theta(log n).

  - Now, we have finally achieved what we wanted: each operation (old or
    new) takes time Theta(log n) in the worst-case.