=========================================================================== CSC 263 Lecture Summary for Week 4 Fall 2007 =========================================================================== -------------------------- Augmenting Data Structures [chapter 14] -------------------------- Sometimes you only need a standard data structure. Sometimes they are not quite enough for your needs. - for example, might need to be able to do some extra operations Often you can take a standard data structure and augment it slightly to fit your needs! How to augment? - not automatic! - often requires some creativity Often want to: - store additional information, and/or - perform additional operations (efficiently) An "augmented" data structure is simply an existing data structure modified to store additional information and/or perform additional operations. To do it: 1. determine which additional information to store 2. check that this information can be "cheaply" maintained during each of the original operations 3. implement the new operations Example: Dynamic Order Statistics - We want to maintain a dynamic set S (i.e., the set changes over time) of elements (data + key) that will allow us to perform the following operations efficiently: answer two types of "rank" queries on sets of values, as well as having standard operations for maintaining the set (INSERT, DELETE, SEARCH): . SEARCH(k): Given a key k, return data associated with k . INSERT(x): Given a key+data x, add it to S . DELETE(x): Given a key+data x, remove it from S . RANK(k): Given a key k, what is its "rank", i.e., its position among the elements in the data structure? . SELECT(r): Given a rank r, what is the key with that rank? The first three operations are simply the DICTIONARY ADT operations; the last two operations are new For example, if our set of values is {5,15,27,30,56}, then RANK(15) = 2 and SELECT(4) = 30. Let's look at 3 different ways we could do this. 1. Use AVL trees (or any balanced BST) without modification. . can do SEARCH/INSERT/DELETE efficiently, but doing SELECT/RANK queries is costly! . Queries: Simply carry out an inorder traversal of the tree, keeping track of the number of nodes visited, until the desired rank or key is reached. [[Q: What will be the time for a query? ]] [[Q: Will the other operations (SEARCH/INSERT/DELETE) take any longer?]] [[Q: What is the problem? Could we do better? ]] 2. Augment AVL trees so that each node has an additional field 'rank[x]' that stores its rank in the tree. [[Q: What will be the time for a query? ]] [[Q: Will the other operations (SEARCH/INSERT/DELETE) take any longer?]] [[Q: What is the problem? Could we do better? ]] Good: RANK and SELECT easy to do Bad: maintaining rank info is expensive -eg. try adding new element 2 -- all nodes must be updated! 3. Augment the tree in a more sophisticated way. Consider augmenting AVL trees so that each node x has an additional field 'size[x]' that stores the number of keys in the subtree rooted at x (including x itself). This may seem unrelated to 'rank' at first, but we'll see that it is enough to allow us to do what we want. - Queries: . Consider a node x in an AVL tree. We know that rank[x] = 1 + number of keys that come before x in the tree. In particular, if we consider only the keys in the subtree rooted at x, then rank[x] = size[left child's subtree] + 1 (this is NOT necessarily the true rank of x in the whole tree, only its "relative" rank among the keys in the subtree rooted at x). . Note: size[x] = size[left(x)] + size[right(x)] + 1 for any node x in the AVL tree, where left(x) and right(x) are the left and right children of x . RANK(k): Given key k, perform SEARCH on k keeping track of "current rank" r: Each time you go down a level you must add the size of the subtrees to the left that you skipped. You must also add the key itself that you skipped. [[Q: When we recursively call SEARCH(v_i,k) what do we add to r?]] [[Q: When we find x how to we compute its true rank?]] . SELECT(r): Given rank r, start at x = root[T] and work down. SELECT(x,k): rrx = size[left(x)] + 1 if k = rrx then return x if k < rrx then return SELECT(left(x), k) if k > rrx then return SELECT(right(x), k-rrx) Each call goes one level down in the tree. Tree height is Theta(log n), so algorithm is O(log n) in worst case. - Updates: INSERT and DELETE operations consist of two phases for AVL trees: the operation itself, followed by the fix-up (rebalancing) process. . INSERT(x): Phase 1: Simply increment the size of the subtree rooted at every node that is examined when finding the position of x (since x will be added in that subtree). Phase 2: Update balance factors of ancestor nodes. If a rotation is required, recompute the size values for the rotated nodes from its new children. . DELETE(x): Consider the leaf y that is actually removed by the operation (so y = x or y = successor(x)). We know the size of the subtree rooted at every node on the path from the root down to y decreases by 1, so we simply traverse that path to decrement the size of each node on that path. If a rotation is necessary, we recompute the size of rotated nodes from its children, since we know those values are correct. Update time? We have only added a constant amount of extra work at each level of the tree, so the total time in the worst-case is still Theta(log n). - Now, we have finally achieved what we wanted: each operation (old or new) takes time Theta(log n) in the worst-case. ------- Hashing ------- Problem 1: Read a text file, keeping track of the number of occurrences of each character (ASCII codes 0 - 127). Solution? A Direct-Address Table: simply keep track of the number of occurrences of each character in an array with 128 positions (one position for each character). All operations are therefore Theta(1) Memory usage? 128 x size of an integer = 4kB (for 32-bit ints) -- small! Problem 2: Read a data file, keeping track of the number of occurrences of each integer value (from 0 to 2^{32}-1). Solution? It would be extremely wasteful (maybe even impossible) to keep an array with 2^{32} positions, especially when the data files may contain no more than 10^5 different values (out of all the 2^{32} possibilities). (To store 2^{32} integers would require 128 GB of storage!) So instead, we will allocate an array with 10,000 positions (for example), and figure out a way to map each integer we encounter to one of those positions. This is called "hashing". [[Q: Define the ADT that we're using here. ]] Reading assignment: 11.1 - 11.3 (except 11.3.3) - Hash Tables, Chapter 5 - Probabilistic Analysis and Randomized Algorithms