Today's topic addresses the following two problems: DYNAMIC: A set S of n elments is not fixed in advance. Only know that elements (key, data) are such that key in universe U. Want to implement quickly O(1) operations: insert(key, data), delete (key, data), search(key,data). E.g. compiler sybmol table. STATIC: Given set S of n elements (key,data) pairs. Want to store S so that we can do find the elements indexed by the key. E.g., fixed dictionary (e.g, key=word, data=definition) Notice: using sorting can achieve O(log n) time to insert/search/delate into/in/from a sorted list. In cotrast, HASHING is an algorithmic method that will let us to solve the DYNAMIC version of the problem in expected O(1) time, and the static version in O(1) for the STATIC version. Today's Lecture: solution for the dynamic version. Friday's Section: use dynamic solution to solve static version. Idea 1: Suppose all keys are in the range {1,...,M}. Keep a table T of size m, and on arrival of (key, data), store it in T[key]. Search, insert, delete are O(1), but if m>>n waste too much space. Ex: your set S consists of 50 100-digit numbers, then the range is 2100 although there are only 50 elements to work with. Idea 2: HASHING. Keep around a HASH TABLE T of size m and a HASH FUNCTION h: U -> {0,...,m-1} (to be specified as lecture goes on). --To insert element (key, data), compute location=h(key), and make T[location] ={key, data) --To lookup (key, data), look in T[h(key)] --To delete, goto T[h(key)] =0 Whats m ? To gain over Idea1, m<< |U|, but in this case COLLISIONS can occur. COLLISION: there exists k1,k2 in universe U such that h(k1)=h(k2). There are 2 questions to address now: 1. What to do if collisions occur? 2. How to choose h s.t. collisions are minimized. What to do if collisions occur? Many ideas inthe litreature and in CLR. Simplest one: CHAINING. CHAINING: Each entry T[j] of the hash table points to a linked list of elements elt_i=(key_i,data_i) for all of which h(key_i) = j. Insert Time: O(1) Search and Delete time: O(cost of computing h + size of list). We shall ignore by enlarge the cost of computing h (but you should not in practice and in evaluating your solution to a concrete problem). Worst Case List SIZE = O(n). TERRIBLE! Claim: As long as U > (n-1)m + 1 then for every choice of hash function h, there exists n keys in U s.t. all collide under h (forcing worst case behavior) Proof: Let table entries be HOLES, and |U| = (n-1)k+s for max size k. Then partition U into k subsets of size (n-1) (k PIGEONS) each plus a set of the leftover s elements (last PIGENON). So, by condition in the statement of the claim on size of U there are k+1> m pigeons and only m holes, so there exist one hole (entry) which has in it 2 pigeons which amounts to (n-1)+s>= n elements in one table entry. QED SO, why was Hashing such a great idea??? Answer: RANDOMIZATION will come to the rescue. Will never fix h, but choose it at random [not from the entire space of all possible hash fuctions but ] from a special family of hash functions which we call family of UNIVERSAL HASH FUNCTIONS. Our strategy then will be to define a family of hash functions H (the so-called Universal Hash Functions) and during runtime choose h in H. Then, we shall argue that for all set of possibly keys S to be hashed, the expected number of collisions under a randomly chosen h is small. ANALOGY to RANDOMIZED PARTITION in QUICKSORT. Picked the pivot element to partition around at random and showed that for any input ordering, the Expected partition over the choices of the pivot is good. Definition: A set H = {h: U --> {0,....,m-1} is a UNIVERSAL FAMILY of hash functions if for all x != y in U, Pr [ h(x) = h(y) ] <= 1/m h chosen at random in H THEOREM: if H is universal, then for any set S in U of size n, for any x in S, the E[number of collisions x] <= (n-1)/m where expectation is taken over h in H. PROOF: -- For each y!=x in S, let C_xy = 1 if x and y collide and 0 otherwise. E[C_xy] = Pr(x and y collide) <= 1/m -- C_x = total # collisions for x = sum_y C_xy. -- E[C_x] = sum_y E[C_xy] <= (n-1)/m PUNCH LINE: E[search time] = O(n/m + cost of computing hash function). If cost of h is taken to be O(1), then this gives me O(n/m). Usually call n/m = number of items/table size = load factor. If m=O(n), then load factor =O(1) and E (search time) = O(1). BIG QUESTION: All very nice, but can we actually construct a universal hash family? this is all pretty vacuous. ANSWER: YES!!! In lecture did one method. Here is another. HOW TO CONSTRUCT: ----------------- Pick prime p >= |U|. Define h_{a,b}(x) = ( (a*x + b) mod p ) mod m. H = {h_ab | a in {1,...,p-1}, b in {0,...,p-1}} Claim: H is a universal family. Proof: Fixing x!=y and two locations r and s, let's look at: For any r,s such that r != s, what is Pr[[(ax + b) = r (mod p)] and [(ay + b) = s (mod p)]]? -> this means that ax-ay = r-s (mod p), so a = (r-s)/(x-y) mod p, which has exactly one solution (mod p) since Z_p* is a field. -> Thus, there is a 1/(p-1) chance that a has the right value. Given this value of a, we need b = r-ax (mod p), and there is a 1/p chance b gets this value, so the overall probability is 1/[p(p-1)]. Now, the probability x and y collide is equal to 1/p(p-1) times the number of pairs r!=s in {0,...,p-1} such that r = s (mod m). We have p choices for r, and then at most (p/m)-1 choices left for s != r. The product is at most p(p-1)/m. Thus, Pr[(a*x + b mod p) mod m = (a*y + b mod p) mod m] <= p(p-1)/m * [1/(p(p-1))] = 1/m. QED