perfect hash families
Last time, a 2-universal hash family the is
- small (constant words to describe)
- fast to evaluate
Improve
- perfect hash function: no collisions
- perfect family $h$: for any $S$ of $s \le n$, perfect $h$ in family
- eg, set of all functions
Alternative try: use more space:
- How big can $s$ be for random $s$ to $n$ without collisions?
- Expected number of collisions is $E[\sum C_{ij}] = \binom{s}{2}(1/n) \approx s^2/2n$
- So $s=\sqrt{n}$ works with prob. 1/2 (markov)
- Is this best possible?
- Birthday problem: $(1-1/n)\cdots(1-s/n) \approx e^{-(1/n+2/n+\cdots+s/n)} \approx e^{-s^2/2n}$
- So, when $s=\sqrt{n}$ has $\Omega(1)$ chance of collision
- 23 for birthdays
Two level hashing solves problem
- Hash $s$ items into $O(s)$ space 2-universally
- Build quadratic size hash table on contents of each bucket
- bound $\sum b_k^2=\sum_k (\sum_i [i \in b_k])^2 =\sum C_i + C_{ij}$
- expected value $O(s)$.
- So try till get (markov)
- Then build collision-free quadratic tables inside
- Try till get
- Polynomial time in $s$, Las-vegas algorithm
- Easy: $6s$ cells
- Hard: $s+o(s)$ cells (bit fiddling)
Derandomization
- Probability $1/2$ top-level function works
- Only $m^2$ top-level functions
- Try them all!
- Polynomial in $m$ (not $n$), deterministic algorithm
Cuckoo Hashing
Uniform Hashing Assumption:
- Many hash analysis assume the hash function is a random function
- makes analysis simple
- valid if e.g. use a cryptographic hash function (indistinguishable from random by any polynomial time algorithm)
- also, many results hold even if replace with a limited-independence pseudorandom function---e.g., $O(\log)$-wise independent family.
- We'll make this assumption for cuckoo and consistent hashing.
Suppose your $n$ items all make their two random choices and then you place them carefully?
- think of buckets as $n$ vertices
- think of items as $n$ random edges representing choices for each item
- this is a random graph
- very rich literature
- e.g. Erdos-Renyi connectivity result
- Not quite correct.
- Two random hashes might choose same vertex and create a loop.
- Two edges might be identical ie parallel forming a cycle
Analysis
- placing items equivalent to directing edges and placing item at head
- need indegree 1 everywhere
- If no cycle, no problem placing items
- components are trees
- pick a root
- direct all edges away from root
- actually, a cycle is fine too
- direct all edges in same direction around cycle
- and their combination, a pseudotree (tree with one cycle)
- make the cycle the root
- direct edges around the cycle
- direct other edges away from the cycle
- So what is the failure case? A bicycle component with 2 cycles.
- edges$=$vertices+1 so no room
- fundamental question: how likely is a bicycle in a random graph?
What about insertion?
- I choose one of my two buckets (pick an edge direction)
- if occupied, evict current to make room for me
- evicted occupant moves to its second location
- corresponds to reversing its edge
- which may force another edge to reverse, cascading along a path
- if the path eventually reaches an unoccupied vertex, we're done!
- if not, has to eventually hit a vertex we've seen before: cycle!
- if it is tail of inserted item edge, done!
- otherwise, it is bucket that got occupied by a new item
- we evict that item back to its original location
- and cascade all the way back to the just inserted item
- which gets evicted to its second location and proceeds to cascade evictions there
- but if there's no bicycle, there's no cycle in this direction because we already saw one in the other!
- so in this direction we will reach an empty vertex and stop
- Total time spent proprtional to number of edges visited, which is less than total size of connected component
- fundemantal question: size of component in a random graph?
Theory of Random Graphs
Two mostly-equivalent models:
- $G(n,p)$ graph with $n$ vertices with each edge present independently with probability $p$
- $G(n,m)$ graph with $n$ vertices and $m$ distinct random edges
- Mostly same results for $G(n,p)$ and $G(n,m=p\binom{n}{2})$
- since $G(n,p)$ has good chance of having exactly expected number of edges
Random graph theory sledgehammer
- At $m=(1-\epsilon)n/2$, no bicycle whp (exponent grows with $\epsilon$)
- so cuckoo hashing works.
- Also, connected component is size $O(1)$ in expectation and $O(\log n)$ whp
- So this is insertion time for cuckoo hashing.
- (Conversely, at $m=(1+\epsilon)n/2$, whp bicycle and a huge component (again, exponent grows with $\epsilon$)
- “sharp threshold” phenomenon. Everything changes at $n/2$ edges i.e. $p=1/n$.
Basic analysis. Probability of a cycle?
- Suppose $m=cn/2$ for $c< 1$
- A $k$-cycle consists of $k$ vertices so there are $< n^k/2k$ “potential” $k$-cycles (denominator accounts for rotation and reflection
- each edge is chosen with probability $m/\binom{n}{2}=c/(n-1)\approx c/n$
- and negatively correlated so probability all are chosen $< (c/n)^k$
- so expected number of $k$-cycles is $< c^k/2k$
- sum over $k \ge 3 < 1$ if $c < 2/3$ or so
- so constant probability of no cycle (so certainly no bicycle) and cuckoo hashing works
- but there truly is a constant probability of getting a cycle, so this is too weak to show cuckoo hashing works reliablyn
Fancier analysis: Probability of a bicycle
- show every bicycle contains one of 3 canonical shapes: figure 8, dumbell, or a “theta”
- then show no canonical shape exists whp by counting like we did cycles
Canonical shapes
- consider a minimal bicycle on $k$ vertices
- it contains a spanning tree
- each additional edge forms a cycle with the spanning tree
- so if minimal there are only 2 additional edges, totalling $k+1$
- if it has a leaf, can discard and still have bicycle
- now every degree $\ge 2$
- $k+1$ edges means sum of degrees is $2k+2$
- but every degree $\ge 2$ means only 2 extra units of degree remain
- either a single vertex of degree 4 or two vertices of degree 3
- rest of vertices can only form paths with endpoints at those special vertices
- two paths with both ends at degree 4 vertex is a figure 8
- two degree-3 vertices can either have each of their three paths go to the other: this creates a theta
- or else one path returns to the degree-3 vertex it leaves (forming a cycle) and the other path goes to the other vertex (which also has a cycle): this is a dumbell
Let's count expected figure-eights.
- a cycle on $a< n$ vertices and a cycle on $b< n$ vertices, sharing one vertex
- so in total, $a+b-1$ vertices
- but $a+b$ edges
- each vertex can be one of $n$ choices (overcounting since repeats aren't allowed) so $n^{a+b}$ potential cycles
- each cycle edge is occupied with probability $m/\binom{n}{2}\approx c/n$.
- occupancies are negatively correlated so probability all are occupied is $(c/n)^{a+b}$
- so expected number of $(a,b)$ eights is at most $n^{a+b-1}(c/n)^{a+b} = c^{a+b}/n$
- now sum over $a,b< n$: $$ \begin{align*} \sum_a \sum_b c^{a+b}/n &= (1/n) \sum c^a \sum c^b\\ &=O(1/n) \end{align*} $$
- So expected number of figure eights is negligible
- note how we get one more critical factor of $c/n$ because the minimal shapes have on more edge than a cycle on the same number of vertices
Analysis for dumbells and thetas is the same: $O(1/n)$
- both have one more edge than vertex
- so probability a single one on $k$ vertices is created is $c^k/n$
- now sum over sizes of different parts (3 paths or 1 path and 2 cycles) converges in the same way
Note: not our usual “with high probability” result
- because you can't fiddle constants to make the $O(1/n)$ into $O(1/n^2)$ etc.
- but $O(1/n)$ is small enough for our use case!
Size of components via “exposure process”
- consider a BFS from a given vertex
- keep a queue, initially the starting vertex
- extract a vertex from the queue, add its neighbors to the queue
- BFS done when queue is empty
- how many neighbors get added in a step?
- each edge from neighbor present with probability $c/n$
- so (stochastically dominated by) $B(n, c/n)$ neighbors
- so after $k$ steps total number added is $B(kn, c/n)$
- expected number added is $kn\cdot c/n=kc < k$
- but in order to still be running $k$ need to be added
- This is $1/c=1/(1-\epsilon)\approx 1+\epsilon$ times expectation,
- once $k=\log n$, chernoff says whp you won't have this much
- in other words, queue will empty out before this
- conclude maximum component size is $O(\log n)$
Random graph theory is broad
- when is the graph connected ($p=\Theta(n\log n)$)
- when is there a Hamiltonian path?
- when is there a perfect matching?
- what is the diameter?
- what is the chromatic number?
- sharp thresholds abound.
- whole textbooks
Self collisions
- we discussed possibility of 2 hashes for item being equal
- Solution 1 (common in practice): partition the hash space
- split the array in half, put one hash in each half
- so now it is a random bipartite graph and thus loopless
- load factor analysis basically the same
- Solution 2: constrain to not have identical hashes
- we want two distinct buckets from $1\ldots n$
- i.e. choosing from $\binom{n}{2}$ possibilities
- so do that directly
- e.g., choose one number $a$ from $1\ldots n$, then an offset $b$ from $1\ldots n-1$, and return $a, a+b \bmod n$.
Duplicate edges?
- probability 2 specific items choose same pair of hashes is $1/\binom{n}{2}\approx 2/n^2$
- there are $\binom{cn}{2} \approx (cn)^2/2$ pairs
- so expected number of double-edges is $c^2 < 1$
- and $O(\log n)$ whp
- which edges are duplicated is random
- we showed there's only a constant number of cyclic components
- each of size $O(\log n)$
- so $O(\log n)$ vertices are in a component with a cycle
- so probability a duplicate edge lands both ends in one to create a bicycle is $\le (\log^3 n)/n^2$
Consistent Hashing
Motivation
- $m$ Distributed web caches
- Assign $n$ items such that no cache overloaded
- Hashing fine
- Problem: machines come and go
- Change one machine, whole hash function changes
- Better if when one machine leaves, only $O(1/m)$ of data leaves.
Outline
- Use standard hash function to map each machine to $[0,1]$ interval
- Also map each item to same interval
- Assign each item to immediate successor on ring
Intuition:
- Machines “random”, evenly spread out; each “owns” $1/m$ of ring.
- Items “random”, equally likely to land on any homestead
- When one machine leaves, its successor takes over
- Only requires moving data from departing machine
- When machine arrives, splits a homestead
- Only steals data from one machine
Implementation
- Binary tree
- $O(\log m)$ to insert new machine or hash an item
Analysis
- For simplicity, assume base hash produces “random” values in $[0,1]$
- review later
Max homestead size $O((\log m)/m)$
- Consider position of machine $M$
- Consider interval of length $2(\log m)/m)$
- Probability no other machines there $(1-2(\log m)/m)^m=1/m^2$
Conclude load
- Homestead size $p$ means expected load $p$
- Once expected load exceeds $\log m$, within constant w.h.p. in $m$
- So, max load $(\log m)(1+n/m)$
Can we do better?
- Not a balls in bins issue, since probably $n\gg m$
- Instead, problem is mismatched homesteads
- Create $\log m$ “virtual nodes” for each real
- Claim: sum of homestead sizes is $O(1/m)$
- Break ring into $m(\log m)$ equal size “slots” (same as number of virtual nodes)
- Evaluate number of slots owned by one node
- Start at virtual node---seek forward on slots until see another node
- When see successor, jump to next virtual node
- given slot is empty with prob. $(1-(\log m)/m)^{m/(\log m)}\approx 1/e$
- So, number of slots till see next node is geometric
- Do it $\log m$ times.
- Chernoff bound for sum of geometric r.v. says time for $\log m$ hits is $O(\log m)$ w.h.p.
- With balanced homesteads, get balanced load
We assumed random
- Turns out $O(\log m)$-way independence enough
- In practice, use MD5 or SHA1
What about flash crowds for one page?
- Random trees