CSC378: Red-Black Insertion

We're constructing a hash table. Suppose the probability distribution of searches,

is fixed but unknown. We might inadvertently choose a bad hash function that gives performance, since is unknown.

For example, U has 1,000,000 keys, 1000 of which are high-probability. All other keys have zero probability. If the hash function maps all 1000 to the same slot, we'll get terrible performance.

Idea: Make a new, random hash function, h, each time the program runs. (We'll have to rehash our data with the new hash function each time the program runs.)

We can choose h from a special class, H, such that the expected number of collisions is low, no matter the distribution, D.

Unfortunately, we will still occasionally pick a bad hash function. But it'll only last for one run of the program, and it'll occur only infrequently.

Let be a set of hash functions.

H is universal if, for each distinct pair, , the number of functions, , for which is exactly .

Suppose :

So if we pick a random function, h, from H, there is a chance that . That means that, for such a randomly chosen h,

This is true for any pair, : .

But this is the same chance of collision as if and are chosen randomly from . So this is the best hash function that we could choose.

Constructing a universal set of hash functions

Choose m to be prime.

For a key, , break x into pieces, .

We'll choose the division into bit strings such that each has the same number of bits, and each .

For example,

To pick a random hash function from a universal set, randomly choose

where each is randomly chosen from .

Define a hash function based on this choice of a:

Define . The union is taken over all possible as (there are of them).

Then is a universal set of hash functions, and any randomly chosen from has the desirable property that .

Proof that H is universal

We will prove that H is universal by showing that, , if is randomly chosen from then .

From number theory, the equation

(for known integers a, b, and m (with ) and unknown x) has a unique solution, x, if m is prime.

Let with .

Write them as

Since , they differ in at least one position. Without loss of generality, assume that .

If , then

In other words, if , then

But for how many value of is this true if are fixed?

Using the number theory equation above, substitute

Then the equation has a unique solution for !

This means that, if we choose randomly, there's only one choice of the remaining that results in a collision.

Since we're choosing randomly from , there's a chance of choosing the that is a solution for the equation.

In other words, .

Looking at this another way: For a given , there are ways to choose the such that . But since there are total ways to choose , the probability that we choose one for which is exactly

Thus H is a universal set of hash functions.

Summary

Picking a hash function randomly from a universal set gives us good expected performance.

One way to pick such a function is to generate random

as described above. Then

is the hash function.

Each time the program is run, we randomly pick a new hash function. This protects us from being stuck with a consistently bad hash function. But it means that we have to rehash all of our data at the start of each execution.

Motivation

Constructing a universal set of hash functions

Proof that H is universal

Summary