Lecture Notes 16: 7/17/00 Michael Brudno Hash Functions ---- --------- A hash function is a function that maps an object to an integer. We want different objects to produce different hash codes. We usually want objects to be uniformly distributed by the hash function, but that is not a requirement. In some cases you may want a hash function which maps a number to itself, as will be shown shortly. Requirement of the hash-function: if o.equals(a) then H(o) = H(a) where H is the hash function. A decent hashfunction for integers is (integer) % (e/pi). This produces a very small number, so you multiply it by a constant, say 2^31 to produce the full range of integers. The sum of char values of letters in a string is not a very good hash function: "abc" and "cba" will produce the same result, as will all strings with letters re-arranged. The hash function for strings of length 15 or less used in java is SIGMA C_i*37^i. 0<=i|__|/_|->|__|/_| |__|/_| Here in cell 12 we have just a single element, but in cell 10 there are 2, because of a collision. Note that the order of 6 & 7 could be either way, depending on your implementation. If we wanted to know if 9 was in our table it is easy: 81%13 = 3, there are no elments in cell 3. However if we wanted to find out if 18 was present, 324%13 = 12, we would have to iterate down the list until we found either the element 18 or the end of the list. Because collisions (should) happen rarely, it is uncommon to create more complicated data structures for the chain of the hashtable. The only noteable approach is using another hashtable in each cell, with a different hash function. This may be something to do if you have very little confidence in your hash function. Linear & Other Probing ------ - ----- ------- It is possible, instead of using linked lists to resolve collisions to put the overflow elments from one cell into another one. In linear probing we just try to put it in the next one. If it is free we do so, if it is not, we try the one after that, etc. This, however, suffers from the problem that nearby entries, if they overflow, will cause "clusters" and in order to find the specific element it will be necessary to search the whole cluster. In order to combat this more complicated probing programs were developed: some used quadratic probing: if cell is used, add 1 to the index. If that is used add 4, than 9, and so on. Another possibility is to just re-hash the object using another hash function. This is known as double hashing. All of these methods, however, suffer from the disadvantage that removal is not trivial: removing an element may brake a chain of colliding entries, and cause an element to get lost. Often it is not necessary to remove from hash tables, but if it is we point you to The Bible of Computer Science: The Art of Computer Programming by Donald Knuth. In this case volume 3. Bucket Sort ------ ---- Hashtables are usually really bad for sorting, however, if you know you have a limited range of elements, such as bytes which range from 0-255, you could create a hashtable of 256 elements and have the hashfunction return the value of the byte itself! turns out this works great for small ranges, it is O(n) sort because each insert only takes O(1) to insert each one into the hashtable, and then O(n) more to re-adjust the pointers to create a single long list. (You could do the second part faster, but the ruling part will still be O(n). __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ |__|__|__|__|__|__|__|__|__|__|__|__|__|__|__|__|__|__|__|__|__|__|__|__| etc. 0 1 2 3 4 5 etc. (storing byte objects here) ^ ^ ^ ^ etc. 255 | | | | v_ __ v_ __ _v __ v_ __ |__|__| ---> |__|__| ---> |__|__| -> |__|/_| This is a simple diagram of how the resulting linked list will look like. In cases of collision we link the last element in the first bucket with the first of the next (non-empty) bucket. Java Hashtable class ---- --------- ----- All java objects have a hashCode() method. It is defined in java.lang.Object and all other classes override it as necessary. The definition in Object just returns the memory address of the Object (recall that .equals() is defined similarly, so the requirement above is satisfied). You should feel free to override it for other objects you put in a hashtable. Remember, however, to also override the .equals() method, as forgetting this will cause weird results. Java has a built in hashtable class: java.util.Hashtable. Instead of just checking whether something is just present (as in the hashtables we saw above) it allows one to store KEY-VALUE pairs. The KEY is the object getting hashed, the VALUE can be any other object which is associated with the KEY. The two most useful methods are Object put(Object KEY, Object VALUE) {...} Object get(Object KEY) {...} put() inserts the pair into the hashtable and returns the value previously associated with KEY. get() return the VALUE currently mapped to KEY in the hashtable. There are other methods, which you may, or may not find convinient. The only one which I want to mention are contains() and containsValue(). These methods are identical and check for the presence of some VALUE. These methods just do a linear search, as only KEYs are hashed in the Java Hashtable class, not values.