\documentclass[12pt]{article} \usepackage{wide,me} \parindent0pt

## perfect hash families

Last time, a 2-universal hash family the is

• small (constant words to describe)
• fast to evaluate

Improve

• perfect hash function: no collisions
• perfect family $h$: for any $S$ of $s \le n$, perfect $h$ in family
• eg, set of all functions

Alternative try: use more space:

• How big can $s$ be for random $s$ to $n$ without collisions?
• Expected number of collisions is $E[\sum C_{ij}] = \binom{s}{2}(1/n) \approx s^2/2n$
• So $s=\sqrt{n}$ works with prob. 1/2 (markov)
• Is this best possible?
• Birthday problem: $(1-1/n)\cdots(1-s/n) \approx e^{-(1/n+2/n+\cdots+s/n)} \approx e^{-s^2/2n}$
• So, when $s=\sqrt{n}$ has $\Omega(1)$ chance of collision
• 23 for birthdays

Two level hashing solves problem

• Hash $s$ items into $O(s)$ space 2-universally
• Build quadratic size hash table on contents of each bucket
• bound $\sum b_k^2=\sum_k (\sum_i [i \in b_k])^2 =\sum C_i + C_{ij}$
• expected value $O(s)$.
• So try till get (markov)
• Then build collision-free quadratic tables inside
• Try till get
• Polynomial time in $s$, Las-vegas algorithm
• Easy: $6s$ cells
• Hard: $s+o(s)$ cells (bit fiddling)

Derandomization

• Probability $1/2$ top-level function works
• Only $m^2$ top-level functions
• Try them all!
• Polynomial in $m$ (not $n$), deterministic algorithm

### Cuckoo Hashing

Uniform Hashing Assumption:

• Many hash analysis assume the hash function is a random function
• makes analysis simple
• valid if e.g. use a cryptographic hash function (indistinguishable from random by any polynomial time algorithm)
• also, many results hold even if replace with a limited-independence pseudorandom function---e.g., $O(\log)$-wise independent family.
• We'll make this assumption for cuckoo and consistent hashing.

Suppose your $n$ items all make their two random choices and then you place them carefully?

• think of buckets as $n$ vertices
• think of items as $n$ random edges representing choices for each item
• this is a random graph
• very rich literature
• e.g. Erdos-Renyi connectivity result

Analysis

• If no cycle, no problem placing items
• why?
• cycle if $m$ edges into $n$ vertices?
• prob. given length-$k$ cycle is $\le (2m/n^2)^k$
• prob. some edge takes first spot on cycle is $2m/n^2$
• conditioned on that, prob. some edge takes next spot is $2(m-1)/n^2$
• etc.
• number of length $k$ cycles is $\le n^k$
• So overall expect at most $(2m/n)^k$
• So prob. cycle of length 2 or more is $O((2m/n)^2) < 1$ if $m \ll n/2$
• Reasonably tight since prob. no length-2 cycle (two identical edges) is birthday paradox: $(1-1/n^2)\cdots(1-m/n^2)\approx e^{-m^2/2n^2}$
• Actually, only have problem if there is a component with 2 cycles (bicycle) or cycle with a chord.
• Later work improves; show can achieve max load 2 with $83%$ load

• insertion evicts along a path of edges
• when reaches a leaf, done with success
• if returns to a previous node, may have a problem
• consider path case first
• pr. exists length $k$ path $\le n^k(2m/n^2)^k = (2m/n)^k$.
• so expected constant and w.h.p $O(\log n)$ insert
• what about cycle?
• when $m \ll n$, Prob length-$k$ cycle involving me (given my ends) is $n^{k-2}(2m/n^2)^{k-1} = O(1/n)$
• so whp no long path or cycle
• if get cycle, rebalance (cost $n$)
• amortized cost $O(1)$

## Consistent Hashing

Motivation

• $m$ Distributed web caches
• Assign $n$ items such that no cache overloaded
• Hashing fine
• Problem: machines come and go
• Change one machine, whole hash function changes
• Better if when one machine leaves, only $O(1/m)$ of data leaves.

Outline

• Use standard hash function to map each machine to $[0,1]$ interval
• Also map each item to same interval
• Assign each item to immediate successor on ring

Intuition:

• Machines “random”, evenly spread out; each “owns” $1/m$ of ring.
• Items “random”, equally likely to land on any homestead
• When one machine leaves, its successor takes over
• Only requires moving data from departing machine
• When machine arrives, splits a homestead
• Only steals data from one machine

Implementation

• Binary tree
• $O(\log m)$ to insert new machine or hash an item

Analysis

• For simplicity, assume base hash produces “random” values in $[0,1]$
• review later

Max homestead size $O((\log m)/m)$

• Consider position of machine $M$
• Consider interval of length $2(\log m)/m)$
• Probability no other machines there $(1-2(\log m)/m)^m=1/m^2$

• Homestead size $p$ means expected load $p$
• Once expected load exceeds $\log m$, within constant w.h.p. in $m$
• So, max load $(\log m)(1+n/m)$

Can we do better?

• Not a balls in bins issue, since probably $n\gg m$
• Create $\log m$ “virtual nodes” for each real
• Claim: sum of homestead sizes is $O(1/m)$
• Break ring into $m(\log m)$ “slots” (same as number of virtual nodes)
• Evaluate number of slots owned by one node
• Start at virtual node---seek forward on slots until see another node
• When see successor, jump to next virtual node
• given slot is empty with prob. $(1-(\log m)/m)^{m/(\log m)}\approx 1/e$
• So, number of slots till see next node is geometric
• Do it $\log m$ times.
• Chernoff bound says time for $\log m$ hits is $O(\log m)$ w.h.p.
• With balanced homesteads, get balanced load

We assumed random

• Turns out $O(\log m)$-way independence enough
• In practice, use MD5 or SHA1

What about flash crowds for one page?

• Random trees