Today's topic addresses the following two problems:
DYNAMIC: A set S of n elments is not fixed in advance. Only know that
elements (key, data) are such that key in universe U.
Want to implement quickly O(1) operations:
insert(key, data), delete (key, data), search(key,data).
E.g. compiler sybmol table.
STATIC: Given set S of n elements (key,data) pairs.
Want to store S so that we can do find the elements indexed by the
key.
E.g., fixed dictionary (e.g, key=word, data=definition)
Notice: using sorting can achieve O(log n) time to
insert/search/delate
into/in/from a sorted list.
In cotrast, HASHING is an algorithmic method that will
let us to solve the DYNAMIC version of the problem
in expected O(1) time, and the static version in
O(1) for the STATIC version.
Today's Lecture: solution for the dynamic version.
Friday's Section: use dynamic solution to solve static version.
Idea 1: Suppose all keys are in the range {1,...,M}. Keep a
table T of size m, and on arrival of (key, data), store it
in T[key]. Search, insert, delete are O(1), but if m>>n waste too
much space. Ex: your set S consists of
50 100-digit numbers, then the range is 2100 although there are only
50 elements to work with.
Idea 2: HASHING.
Keep around a HASH TABLE T of size m and a
HASH FUNCTION h: U -> {0,...,m-1} (to be specified as lecture
goes on).
--To insert element (key, data), compute location=h(key),
and make T[location] ={key, data)
--To lookup (key, data), look in T[h(key)]
--To delete, goto T[h(key)] =0
Whats m ?
To gain over Idea1, m<< |U|, but in this case COLLISIONS can occur.
COLLISION: there exists k1,k2 in universe U such that h(k1)=h(k2).
There are 2 questions to address now:
1. What to do if collisions occur?
2. How to choose h s.t. collisions are minimized.
What to do if collisions occur?
Many ideas inthe litreature and in CLR.
Simplest one: CHAINING.
CHAINING:
Each entry T[j] of the hash table points to a linked list
of elements elt_i=(key_i,data_i) for all of which
h(key_i) = j.
Insert Time: O(1)
Search and Delete time: O(cost of computing h + size of list).
We shall ignore by enlarge the cost of computing h (but you should not
in practice and in evaluating your solution to a concrete problem).
Worst Case List SIZE = O(n). TERRIBLE!
Claim: As long as U > (n-1)m + 1 then for every choice of
hash function h, there exists n keys in U s.t. all collide under h
(forcing worst case behavior)
Proof: Let table entries be HOLES, and |U| = (n-1)k+s for max size
k. Then
partition U into k subsets of size (n-1) (k PIGEONS) each plus a set
of the
leftover s elements (last PIGENON). So, by condition in the statement
of the claim on size of U there are k+1> m pigeons and only m holes,
so there exist one hole (entry) which has in it 2 pigeons which
amounts to (n-1)+s>= n elements in one table entry. QED
SO, why was Hashing such a great idea???
Answer:
RANDOMIZATION will come to the rescue.
Will never fix h, but choose it at random
[not from the entire space of all possible hash fuctions
but ] from a special family of hash functions which we call
family of UNIVERSAL HASH FUNCTIONS.
Our strategy then will be to
define a family of hash functions H (the so-called
Universal Hash Functions) and during runtime
choose h in H.
Then, we shall argue that
for all set of possibly keys S to be hashed, the expected
number of collisions under a randomly chosen h is small.
ANALOGY to RANDOMIZED PARTITION in QUICKSORT.
Picked the pivot element to partition around at random and
showed that for any input ordering, the Expected partition
over the choices of the pivot is good.
Definition:
A set H = {h: U --> {0,....,m-1}
is a UNIVERSAL FAMILY of hash functions if for all x != y in U,
Pr [ h(x) = h(y) ] <= 1/m
h chosen at random in H
THEOREM: if H is universal, then for any set S in U
of size n, for any x in S, the
E[number of collisions x] <= (n-1)/m
where expectation is taken over h in H.
PROOF:
-- For each y!=x in S, let C_xy = 1 if x and y collide and 0
otherwise.
E[C_xy] = Pr(x and y collide) <= 1/m
-- C_x = total # collisions for x = sum_y C_xy.
-- E[C_x] = sum_y E[C_xy] <= (n-1)/m
PUNCH LINE:
E[search time] = O(n/m + cost of computing hash function).
If cost of h is taken to be O(1), then this gives me
O(n/m). Usually call n/m = number of items/table size = load factor.
If m=O(n), then load factor =O(1) and E (search time) = O(1).
BIG QUESTION:
All very nice, but
can we actually construct a universal hash family?
this is all pretty vacuous.
ANSWER: YES!!!
In lecture did one method. Here is another.
HOW TO CONSTRUCT:
-----------------
Pick prime p >= |U|. Define
h_{a,b}(x) = ( (a*x + b) mod p ) mod m.
H = {h_ab | a in {1,...,p-1}, b in {0,...,p-1}}
Claim: H is a universal family.
Proof: Fixing x!=y and two locations r and s, let's look at:
For any r,s such that
r != s, what is Pr[[(ax + b) = r (mod p)] and [(ay + b) = s (mod p)]]?
-> this means that ax-ay = r-s (mod p), so a = (r-s)/(x-y) mod p,
which
has exactly one solution (mod p) since Z_p* is a field.
-> Thus, there is a 1/(p-1) chance that a has the right
value. Given this value of a, we need b = r-ax (mod p), and there is
a 1/p chance b gets this value, so the overall probability is
1/[p(p-1)].
Now, the probability x and y collide is equal to 1/p(p-1) times the
number of pairs r!=s in {0,...,p-1} such that r = s (mod m).
We have p choices for r, and then at most (p/m)-1 choices left
for s != r. The product is at most p(p-1)/m.
Thus, Pr[(a*x + b mod p) mod m = (a*y + b mod p) mod m]
<= p(p-1)/m * [1/(p(p-1))] = 1/m. QED