perfect hash families
Last time, a 2-universal hash family the is
- small (constant words to describe)
- fast to evaluate
Improve
- perfect hash function: no collisions
- perfect family $h$: for any $S$ of $s \le n$, perfect $h$ in family
- eg, set of all functions
Alternative try: use more space:
- How big can $s$ be for random $s$ to $n$ without collisions?
- Expected number of collisions is $E[\sum C_{ij}] = \binom{s}{2}(1/n) \approx s^2/2n$
- So $s=\sqrt{n}$ works with prob. 1/2 (markov)
- Is this best possible?
- Birthday problem: $(1-1/n)\cdots(1-s/n) \approx e^{-(1/n+2/n+\cdots+s/n)} \approx e^{-s^2/2n}$
- So, when $s=\sqrt{n}$ has $\Omega(1)$ chance of collision
- 23 for birthdays
Two level hashing solves problem
- Hash $s$ items into $O(s)$ space 2-universally
- Build quadratic size hash table on contents of each bucket
- bound $\sum b_k^2=\sum_k (\sum_i [i \in b_k])^2 =\sum C_i + C_{ij}$
- expected value $O(s)$.
- So try till get (markov)
- Then build collision-free quadratic tables inside
- Try till get
- Polynomial time in $s$, Las-vegas algorithm
- Easy: $6s$ cells
- Hard: $s+o(s)$ cells (bit fiddling)
Derandomization
- Probability $1/2$ top-level function works
- Only $m^2$ top-level functions
- Try them all!
- Polynomial in $m$ (not $n$), deterministic algorithm
Cuckoo Hashing
Uniform Hashing Assumption:
- Many hash analysis assume the hash function is a random function
- makes analysis simple
- valid if e.g. use a cryptographic hash function (indistinguishable from random by any polynomial time algorithm)
- also, many results hold even if replace with a limited-independence pseudorandom function---e.g., $O(\log)$-wise independent family.
- We'll make this assumption for cuckoo and consistent hashing.
Suppose your $n$ items all make their two random choices and then you place them carefully?
- think of buckets as $n$ vertices
- think of items as $n$ random edges representing choices for each item
- this is a random graph
- very rich literature
- e.g. Erdos-Renyi connectivity result
Analysis
- If no cycle, no problem placing items
- why?
- cycle if $m$ edges into $n$ vertices?
- prob. given length-$k$ cycle is $\le (2m/n^2)^k$
- prob. some edge takes first spot on cycle is $2m/n^2$
- conditioned on that, prob. some edge takes next spot is $2(m-1)/n^2$
- etc.
- number of length $k$ cycles is $\le n^k$
- So overall expect at most $(2m/n)^k$
- So prob. cycle of length 2 or more is $O((2m/n)^2) < 1$ if $m \ll n/2$
- Reasonably tight since prob. no length-2 cycle (two identical edges) is birthday paradox: $(1-1/n^2)\cdots(1-m/n^2)\approx e^{-m^2/2n^2}$
- Actually, only have problem if there is a component with 2 cycles (bicycle) or cycle with a chord.
- Later work improves; show can achieve max load 2 with $83%$ load
What about insertion?
- insertion evicts along a path of edges
- when reaches a leaf, done with success
- if returns to a previous node, may have a problem
- consider path case first
- pr. exists length $k$ path $\le n^k(2m/n^2)^k = (2m/n)^k$.
- so expected constant and w.h.p $O(\log n)$ insert
- what about cycle?
- when $m \ll n$, Prob length-$k$ cycle involving me (given my ends) is $n^{k-2}(2m/n^2)^{k-1} = O(1/n)$
- so whp no long path or cycle
- if get cycle, rebalance (cost $n$)
- amortized cost $O(1)$
Consistent Hashing
Motivation
- $m$ Distributed web caches
- Assign $n$ items such that no cache overloaded
- Hashing fine
- Problem: machines come and go
- Change one machine, whole hash function changes
- Better if when one machine leaves, only $O(1/m)$ of data leaves.
Outline
- Use standard hash function to map each machine to $[0,1]$ interval
- Also map each item to same interval
- Assign each item to immediate successor on ring
Intuition:
- Machines “random”, evenly spread out; each “owns” $1/m$ of ring.
- Items “random”, equally likely to land on any homestead
- When one machine leaves, its successor takes over
- Only requires moving data from departing machine
- When machine arrives, splits a homestead
- Only steals data from one machine
Implementation
- Binary tree
- $O(\log m)$ to insert new machine or hash an item
Analysis
- For simplicity, assume base hash produces “random” values in $[0,1]$
- review later
Max homestead size $O((\log m)/m)$
- Consider position of machine $M$
- Consider interval of length $2(\log m)/m)$
- Probability no other machines there $(1-2(\log m)/m)^m=1/m^2$
Conclude load
- Homestead size $p$ means expected load $p$
- Once expected load exceeds $\log m$, within constant w.h.p. in $m$
- So, max load $(\log m)(1+n/m)$
Can we do better?
- Not a balls in bins issue, since probably $n\gg m$
- Instead, problem is mismatched homesteads
- Create $\log m$ “virtual nodes” for each real
- Claim: sum of homestead sizes is $O(1/m)$
- Break ring into $m(\log m)$ “slots” (same as number of virtual nodes)
- Evaluate number of slots owned by one node
- Start at virtual node---seek forward on slots until see another node
- When see successor, jump to next virtual node
- given slot is empty with prob. $(1-(\log m)/m)^{m/(\log m)}\approx 1/e$
- So, number of slots till see next node is geometric
- Do it $\log m$ times.
- Chernoff bound says time for $\log m$ hits is $O(\log m)$ w.h.p.
- With balanced homesteads, get balanced load
We assumed random
- Turns out $O(\log m)$-way independence enough
- In practice, use MD5 or SHA1
What about flash crowds for one page?
- Random trees