6.856 Lecture Notes

perfect hash families

Last time, a 2-universal hash family the is

small (constant words to describe)
fast to evaluate

Improve

perfect hash function: no collisions
perfect family $h$: for any $S$ of $s \le n$, perfect $h$ in family
eg, set of all functions

Alternative try: use more space:

How big can $s$ be for random $s$ to $n$ without collisions?
- Expected number of collisions is $E[\sum C_{ij}] = \binom{s}{2}(1/n) \approx s^2/2n$
- So $s=\sqrt{n}$ works with prob. 1/2 (markov)
Is this best possible?
- Birthday problem: $(1-1/n)\cdots(1-s/n) \approx e^{-(1/n+2/n+\cdots+s/n)} \approx e^{-s^2/2n}$
- So, when $s=\sqrt{n}$ has $\Omega(1)$ chance of collision
- 23 for birthdays

Two level hashing solves problem

Hash $s$ items into $O(s)$ space 2-universally
Build quadratic size hash table on contents of each bucket
bound $\sum b_k^2=\sum_k (\sum_i [i \in b_k])^2 =\sum C_i + C_{ij}$
expected value $O(s)$.
So try till get (markov)
Then build collision-free quadratic tables inside
Try till get
Polynomial time in $s$, Las-vegas algorithm
Easy: $6s$ cells
Hard: $s+o(s)$ cells (bit fiddling)

Derandomization

Probability $1/2$ top-level function works
Only $m^2$ top-level functions
Try them all!
Polynomial in $m$ (not $n$), deterministic algorithm

Cuckoo Hashing

Uniform Hashing Assumption:

Many hash analysis assume the hash function is a random function
makes analysis simple
valid if e.g. use a cryptographic hash function (indistinguishable from random by any polynomial time algorithm)
also, many results hold even if replace with a limited-independence pseudorandom function---e.g., $O(\log)$-wise independent family.
We'll make this assumption for cuckoo and consistent hashing.

Suppose your $n$ items all make their two random choices and then you place them carefully?

think of buckets as $n$ vertices
think of items as $n$ random edges representing choices for each item
this is a random graph
very rich literature
e.g. Erdos-Renyi connectivity result
Not quite correct.
- Two random hashes might choose same vertex and create a loop.
- Two edges might be identical ie parallel forming a cycle
we'll fix this later

Analysis

placing items equivalent to directing edges and placing item at head
need indegree 1 everywhere
If no cycle, no problem placing items
- components are trees
- pick a root
- direct all edges away from root
actually, a cycle is fine too
- direct all edges in same direction around cycle
and their combination, a pseudotree (tree with one cycle)
- make the cycle the root
- direct edges around the cycle
- direct other edges away from the cycle
So what is the failure case? A bicycle component with 2 cycles.
edges$=$vertices+1 so no room
fundamental question: how likely is a bicycle in a random graph?

What about insertion?

I choose one of my two buckets (pick an edge direction)
if occupied, evict current to make room for me
evicted occupant moves to its second location
corresponds to reversing its edge
which may force another edge to reverse, cascading along a path
if the path eventually reaches an unoccupied vertex, we're done!
if not, has to eventually hit a vertex we've seen before: cycle!
- if it is tail of inserted item edge, done!
- otherwise, it is bucket that got occupied by a new item
- we evict that item back to its original location
- and cascade all the way back to the just inserted item
- which gets evicted to its second location and proceeds to cascade evictions there
- but if there's no bicycle, there's no cycle in this direction because we already saw one in the other!
- so in this direction we will reach an empty vertex and stop
Total time spent proprtional to number of edges visited, which is less than total size of connected component
fundemantal question: size of component in a random graph?

Theory of Random Graphs

Two mostly-equivalent models:

$G(n,p)$ graph with $n$ vertices with each edge present independently with probability $p$
$G(n,m)$ graph with $n$ vertices and $m$ distinct random edges
Mostly same results for $G(n,p)$ and $G(n,m=p\binom{n}{2})$
since $G(n,p)$ has good chance of having exactly expected number of edges

Random graph theory sledgehammer

At $m=(1-\epsilon)n/2$, no bicycle whp (exponent grows with $\epsilon$)
so cuckoo hashing works.
Also, connected component is size $O(1)$ in expectation and $O(\log n)$ whp
So this is insertion time for cuckoo hashing.
(Conversely, at $m=(1+\epsilon)n/2$, whp bicycle and a huge component (again, exponent grows with $\epsilon$)
“sharp threshold” phenomenon. Everything changes at $n/2$ edges i.e. $p=1/n$.

Basic analysis. Probability of a cycle?

Suppose $m=cn/2$ for $c< 1$
A $k$-cycle consists of $k$ vertices so there are $< n^k/2k$ “potential” $k$-cycles (denominator accounts for rotation and reflection
each edge is chosen with probability $m/\binom{n}{2}=c/(n-1)\approx c/n$
and negatively correlated so probability all are chosen $< (c/n)^k$
so expected number of $k$-cycles is $< c^k/2k$
sum over $k \ge 3 < 1$ if $c < 2/3$ or so
so constant probability of no cycle (so certainly no bicycle) and cuckoo hashing works
but there truly is a constant probability of getting a cycle, so this is too weak to show cuckoo hashing works reliablyn

Fancier analysis: Probability of a bicycle

show every bicycle contains one of 3 canonical shapes: figure 8, dumbell, or a “theta”
then show no canonical shape exists whp by counting like we did cycles

Canonical shapes

consider a minimal bicycle on $k$ vertices
it contains a spanning tree
each additional edge forms a cycle with the spanning tree
so if minimal there are only 2 additional edges, totalling $k+1$
if it has a leaf, can discard and still have bicycle
now every degree $\ge 2$
$k+1$ edges means sum of degrees is $2k+2$
but every degree $\ge 2$ means only 2 extra units of degree remain
either a single vertex of degree 4 or two vertices of degree 3
rest of vertices can only form paths with endpoints at those special vertices
two paths with both ends at degree 4 vertex is a figure 8
two degree-3 vertices can either have each of their three paths go to the other: this creates a theta
or else one path returns to the degree-3 vertex it leaves (forming a cycle) and the other path goes to the other vertex (which also has a cycle): this is a dumbell

Let's count expected figure-eights.

a cycle on $a< n$ vertices and a cycle on $b< n$ vertices, sharing one vertex
so in total, $a+b-1$ vertices
but $a+b$ edges
each vertex can be one of $n$ choices (overcounting since repeats aren't allowed) so $n^{a+b}$ potential cycles
each cycle edge is occupied with probability $m/\binom{n}{2}\approx c/n$.
occupancies are negatively correlated so probability all are occupied is $(c/n)^{a+b}$
so expected number of $(a,b)$ eights is at most $n^{a+b-1}(c/n)^{a+b} = c^{a+b}/n$
now sum over $a,b< n$: $$ \begin{align*} \sum_a \sum_b c^{a+b}/n &= (1/n) \sum c^a \sum c^b\\ &=O(1/n) \end{align*} $$
So expected number of figure eights is negligible
note how we get one more critical factor of $c/n$ because the minimal shapes have on more edge than a cycle on the same number of vertices

Analysis for dumbells and thetas is the same: $O(1/n)$

both have one more edge than vertex
so probability a single one on $k$ vertices is created is $c^k/n$
now sum over sizes of different parts (3 paths or 1 path and 2 cycles) converges in the same way

Note: not our usual “with high probability” result

because you can't fiddle constants to make the $O(1/n)$ into $O(1/n^2)$ etc.
but $O(1/n)$ is small enough for our use case!

Size of components via “exposure process”

consider a BFS from a given vertex
- keep a queue, initially the starting vertex
- extract a vertex from the queue, add its neighbors to the queue
- BFS done when queue is empty
how many neighbors get added in a step?
- each edge from neighbor present with probability $c/n$
- so (stochastically dominated by) $B(n, c/n)$ neighbors
so after $k$ steps total number added is $B(kn, c/n)$
expected number added is $kn\cdot c/n=kc < k$
but in order to still be running $k$ need to be added
- This is $1/c=1/(1-\epsilon)\approx 1+\epsilon$ times expectation,
- once $k=\log n$, chernoff says whp you won't have this much
- in other words, queue will empty out before this
conclude maximum component size is $O(\log n)$

Random graph theory is broad

when is the graph connected ($p=\Theta(n\log n)$)
when is there a Hamiltonian path?
when is there a perfect matching?
what is the diameter?
what is the chromatic number?
sharp thresholds abound.
whole textbooks

Self collisions

we discussed possibility of 2 hashes for item being equal
Solution 1 (common in practice): partition the hash space
- split the array in half, put one hash in each half
- so now it is a random bipartite graph and thus loopless
- load factor analysis basically the same
Solution 2: constrain to not have identical hashes
- we want two distinct buckets from $1\ldots n$
- i.e. choosing from $\binom{n}{2}$ possibilities
- so do that directly
- e.g., choose one number $a$ from $1\ldots n$, then an offset $b$ from $1\ldots n-1$, and return $a, a+b \bmod n$.

Duplicate edges?

probability 2 specific items choose same pair of hashes is $1/\binom{n}{2}\approx 2/n^2$
there are $\binom{cn}{2} \approx (cn)^2/2$ pairs
so expected number of double-edges is $c^2 < 1$
and $O(\log n)$ whp
which edges are duplicated is random
we showed there's only a constant number of cyclic components
each of size $O(\log n)$
so $O(\log n)$ vertices are in a component with a cycle
so probability a duplicate edge lands both ends in one to create a bicycle is $\le (\log^3 n)/n^2$

Consistent Hashing

Motivation

$m$ Distributed web caches
Assign $n$ items such that no cache overloaded
Hashing fine
Problem: machines come and go
Change one machine, whole hash function changes
Better if when one machine leaves, only $O(1/m)$ of data leaves.

Outline

Use standard hash function to map each machine to $[0,1]$ interval
Also map each item to same interval
Assign each item to immediate successor on ring

Intuition:

Machines “random”, evenly spread out; each “owns” $1/m$ of ring.
Items “random”, equally likely to land on any homestead
When one machine leaves, its successor takes over
Only requires moving data from departing machine
When machine arrives, splits a homestead
Only steals data from one machine

Implementation

Binary tree
$O(\log m)$ to insert new machine or hash an item

Analysis

For simplicity, assume base hash produces “random” values in $[0,1]$
review later

Max homestead size $O((\log m)/m)$

Consider position of machine $M$
Consider interval of length $2(\log m)/m)$
Probability no other machines there $(1-2(\log m)/m)^m=1/m^2$

Conclude load

Homestead size $p$ means expected load $p$
Once expected load exceeds $\log m$, within constant w.h.p. in $m$
So, max load $(\log m)(1+n/m)$

Can we do better?

Not a balls in bins issue, since probably $n\gg m$
Instead, problem is mismatched homesteads
Create $\log m$ “virtual nodes” for each real
Claim: sum of homestead sizes is $O(1/m)$
- Break ring into $m(\log m)$ equal size “slots” (same as number of virtual nodes)
- Evaluate number of slots owned by one node
- Start at virtual node---seek forward on slots until see another node
- When see successor, jump to next virtual node
- given slot is empty with prob. $(1-(\log m)/m)^{m/(\log m)}\approx 1/e$
- So, number of slots till see next node is geometric
- Do it $\log m$ times.
- Chernoff bound for sum of geometric r.v. says time for $\log m$ hits is $O(\log m)$ w.h.p.
With balanced homesteads, get balanced load

We assumed random

Turns out $O(\log m)$-way independence enough
In practice, use MD5 or SHA1

What about flash crowds for one page?

Random trees