Algorithms and Data Structures for Programming Contests
Hash Tables
Discussion
Direct-address table
- Problem: count frequencies of characters in a text file?
- Easy solution: use an array with 256 positions
(one for each possible character, assuming 8-bit ASCII)
and increment counts as each character is read.
- Feature: each character provides a direct address into the array.
Problem
- Consider analogous situation:
count frequencies of integer values in a data file?
- Direct-address table not practical anymore:
232 = 4,294,967,296 possible integer values
would require an array of that size — even if data file only contains a few hundred values!
Solution
- Use smaller array (hash table)
and hash function that maps each integer value
to a valid array index.
Hash functions
- In general, two-step:
arbitrary data ⇒ number ⇒ table index.
- Last step usually a simple scaling of the number
into the range [0..M−1]
(for a hash table with M slots).
- Desired properties of hash functions:
- consistency: equal keys produce equal hash values;
- efficiency: hash function can be computed quickly;
- uniformity: keys are distributed uniformly.
- It's very difficult to always achieve all three!
- Term comes from cooking, where "hash" is all cut up and mixed up — a good hash function tries to "mix"
the bits in the representation of its input so that (ideally)
changing even just one bit of the input makes roughly half
the bits in the hash value different
(this property is called "avalanching").
Collisions
- No matter how good the hash function,
collisions (different inputs hashing to same slot)
are unavoidable.
- Closed addressing uses a "probing sequence" to look for
another slot that can store the value.
- Open hashing stores a simple linked list of values
at each slot.
Complexity
- Theoretically, can always degenerate
if every input happens to hash to the same slot.
- In practice, this never happens.
As long as load factor (fraction of table filled)
is kept low and hash function is reasonably good,
no more than a constant number of collisions is ever encountered
while searching for a value.
To learn more...
© Copyright 2012–2015 François Pitt
<francois.pitt@utoronto.ca>
last updated at 08:24 (EST) on Thu 5 Mar 2015
Algorithms and Data Structures by
François Pitt is licensed under a
Creative
Commons Attribution-ShareAlike 4.0 International License.