Lecture Notes 16: 7/17/00			Michael Brudno

Hash Functions
---- ---------

A hash function is a function that maps an object to an integer. We
want different objects to produce different hash codes. We usually
want objects to be uniformly distributed by the hash function, but
that is not a requirement. In some cases you may want a hash function
which maps a number to itself, as will be shown shortly.

Requirement of the hash-function: if o.equals(a) then H(o) = H(a) where
H is the hash function.

A decent hashfunction for integers is (integer) % (e/pi). This produces
a very small number, so you multiply it by a constant, say 2^31 to
produce the full range of integers.

The sum of char values of letters in a string is not a very good hash 
function: "abc" and "cba" will produce the same result, as will all 
strings with letters re-arranged. The hash function for strings of length 
15 or less used in java is

SIGMA   C_i*37^i.
0<=i<n

Although I am not sure how well this works in practice, I am tempted to 
trust SUN.


Hash Table 
---- ----- 

A hash table is a data structure that supports probabilistic constant
time insertions, removals, and searches. In order to insert an element 
o into a hashtable T with t cells, we calculate H(o), and then 
put o into the cell H(o) % t. If more than one element gets the same
cell number we have a collision. There are several options of what to
do in cases of collisions, with one of the most common being chaining:
each cell contains a linked list of the items assigned to that cell.

In order to find out if we have a certain element in our hashtable,
we'd calculate the hash code of it, and check the appropriate cell.
Note that because of collisions, in the worst case we may have
Theta(n) lookup time. Because of this it is necessary to choose your
hash function very carefully. In some cases people create a family of
hash functions, but for each new hash table pick a random hash
function from the family. It is also common to have the hash table of
size 2n, where n is the number of elements you plan to store in it.

Another good idea is to use a prime length of table, as that allows
for more mistakes by the hash function. Taking a number % a prime is
by itself in a way a hash function. Note that taking it % 2^k would
only leave the last k bits, and hence a power of 2 is not a very good
size for a hash table.

Example: Given an initially empty Hash table of 13 elements and 
a hash function H(x) = x^2, insert 5, 6, and 7.

5^2 = 25, 25 % 13 = 12
6^2 = 36, 36 % 13 = 10
7^2 = 49  49 % 13 = 10

Note that the 6 & 7 cause a collision.

 __ __ __ __ __ __ __ __ __ __ __ __ __ 
|__|__|__|__|__|__|__|__|__|__|__|__|__|
  0  1  2  3  4  5  6  7  8  9 10 11 12 
                                |     |
           +--------------------+     V_ __       
           |  |6 | /|  |7 | /|       |5 | /|  
	   +->|__|/_|->|__|/_|       |__|/_|


Here in cell 12 we have just a single element, but in cell 10 there are 2, 
because of a collision. Note that the order of 6 & 7 could be either way,
depending on your implementation. If we wanted to know if 9 was in our 
table it is easy: 81%13 = 3, there are no elments in cell 3. However if 
we wanted to find out if 18 was present, 324%13 = 12, we would have to 
iterate down the list until we found either the element 18 or the end of 
the list.

Because collisions (should) happen rarely, it is uncommon to create
more complicated data structures for the chain of the hashtable. The
only noteable approach is using another hashtable in each cell, with a
different hash function. This may be something to do if you have very
little confidence in your hash function.

Linear & Other Probing
------ - ----- -------

It is possible, instead of using linked lists to resolve collisions to
put the overflow elments from one cell into another one. In linear
probing we just try to put it in the next one. If it is free we do so,
if it is not, we try the one after that, etc. This, however, suffers
from the problem that nearby entries, if they overflow, will cause
"clusters" and in order to find the specific element it will be
necessary to search the whole cluster.

In order to combat this more complicated probing programs were
developed: some used quadratic probing: if cell is used, add 1 to the
index. If that is used add 4, than 9, and so on. Another possibility
is to just re-hash the object using another hash function. This is
known as double hashing.

All of these methods, however, suffer from the disadvantage that
removal is not trivial: removing an element may brake a chain of
colliding entries, and cause an element to get lost. Often it
is not necessary to remove from hash tables, but if it is we point
you to The Bible of Computer Science: The Art of Computer Programming
by Donald Knuth. In this case volume 3.

Bucket Sort
------ ----

Hashtables are usually really bad for sorting, however, if you know
you have a limited range of elements, such as bytes which range from
0-255, you could create a hashtable of 256 elements and have the
hashfunction return the value of the byte itself! turns out this works
great for small ranges, it is O(n) sort because each insert only takes
O(1) to insert each one into the hashtable, and then O(n) more to
re-adjust the pointers to create a single long list. (You could do the
second part faster, but the ruling part will still be O(n).

 __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __
|__|__|__|__|__|__|__|__|__|__|__|__|__|__|__|__|__|__|__|__|__|__|__|__| etc. 
  0  1  2  3  4  5  etc. (storing byte objects here) ^ ^ ^ ^   etc.   255 
        |            |             |         | 
        v_ __        v_ __        _v __      v_ __ 
       |__|__| ---> |__|__| ---> |__|__| -> |__|/_| 

This is a simple diagram of how the resulting linked list will look
like. In cases of collision we link the last element in the first
bucket with the first of the next (non-empty) bucket.

Java Hashtable class
---- --------- -----

All java objects have a hashCode() method. It is defined in java.lang.Object
and all other classes override it as necessary. The definition in Object 
just returns the memory address of the Object (recall that .equals() is 
defined similarly, so the requirement above is satisfied). You should 
feel free to override it for other objects you put in a hashtable. Remember,
however, to also override the .equals() method, as forgetting this will
cause weird results.

Java has a built in hashtable class: java.util.Hashtable. Instead of just 
checking whether something is just present (as in the hashtables we saw 
above) it allows one to store KEY-VALUE pairs. The KEY is the object 
getting hashed, the VALUE can be any other object which is associated 
with the KEY. 

The two most useful methods are 

Object put(Object KEY, Object VALUE) {...}

Object get(Object KEY) {...}

put() inserts the pair into the hashtable and returns the value previously
associated with KEY.
get() return the VALUE currently mapped to KEY in the hashtable.

There are other methods, which you may, or may not find convinient. The
only one which I want to mention are contains() and containsValue(). These
methods are identical and check for the presence of some VALUE. These 
methods just do a linear search, as only KEYs are hashed in the Java
Hashtable class, not values.