We're constructing a hash table. Suppose the probability
distribution of searches,
is fixed but unknown. We might inadvertently choose a bad hash
function that gives
performance, since
is unknown.
For example, U has 1,000,000 keys, 1000 of which are
high-probability. All other keys have zero probability. If the hash
function maps all 1000 to the same slot, we'll get terrible
performance.
Idea: Make a new, random hash function, h, each time the
program runs. (We'll have to rehash our data with the new hash
function each time the program runs.)
We can choose h from a special class, H, such that the expected
number of collisions is low, no matter the distribution, D.
Unfortunately, we will still occasionally pick a bad hash function.
But it'll only last for one run of the program, and it'll occur only
infrequently.
Let
be a set of hash functions.
H is universal if, for each distinct pair,
,
the number of functions,
, for which
is
exactly
.
Suppose
:
So if we pick a random function, h, from H, there is a
chance that
. That means that, for such a
randomly chosen h,
This is true for any pair,
:
.
But this is the same chance of collision as if
and
are chosen randomly from
. So this is the best hash
function that we could choose.
Choose m to be prime.
For a key,
, break x into
pieces,
.
We'll choose the division into bit strings such that each
has
the same number of bits, and each
.
For example,
To pick a random hash function from a universal set, randomly choose
where each
is randomly chosen from
.
Define a hash function based on this choice of a:
Define
. The union is taken
over all possible as (there are
of them).
Then
is a universal set of hash functions, and any
randomly
chosen from
has the desirable property that
.
Proof that H is universal
We will prove that H is universal by showing that,
, if
is randomly chosen from
then
.
From number theory, the equation
(for known integers a, b, and m (with
) and unknown x)
has a unique solution, x, if m is prime.
Let
with
.
Write them as
Since
, they differ in at least one position. Without
loss of generality, assume that
.
If
, then
In other words, if
, then
But for how many value of
is this true if
are fixed?
Using the number theory equation above, substitute
Then the equation has a unique solution for
!
This means that, if we choose
randomly, there's
only one choice of the remaining
that results in a
collision.
Since we're choosing
randomly from
,
there's a
chance of choosing the
that is a
solution for the equation.
In other words,
.
Looking at this another way: For a given
, there are
ways to choose the
such that
.
But since there are
total ways to choose
, the probability that we choose one for which
is
exactly
Thus H is a universal set of hash functions.
Picking a hash function randomly from a universal set gives us good
expected performance.