This material takes 1:05.
Everything you need to know about probability
- Linearity of expectation
- Indicator variables
- Independent events
- Product rule
- Markov inequality
Hashing
Dictionaries
- Operations.
- makeset, insert, delete, find
Model
- keys are integers in $M=\set{1,\ldots,m}$
- (so assume machine word size, or “unit time,” is $\log m$)
- can store in array of size $M$
- using power: arithmetic, indirect addressing
- (more than bucket based heaps that use indirection without arithmetic)
- compare to comparison and pointer based sorting, binary trees
- problem: space.
Hashing:
- find function $h$ mapping $M$ into table of size $s \ll m$
- use it to put $n$ items from $M$ in the table
- so presumably $s \ge n$
- Still, some items can get mapped to same place: “collision”
- use linked list etc.
- search, insert cost equals size of linked list
- goal: keep linked lists small: few collisions
Hash families:
- problem: for any hash function, some bad input (if space $s$, then $m/s$ items to same bucket)
- This true even if hash is e.g. SHA1
- Solution: build family of functions, choose one that works well
2013 lecture 7 start
Set of all functions?
- Idea: choose “function” that stores items in sorted order without collisions
- problem: to evaluate function, must examine data
- evaluation time $\Omega(\log n)$.
- “description size” $\Omega(n \log m)$,
- Better goal: choose function without looking at data (except query key)
How about a random function?
- set $N$ of $n$ items
- If $s=n$, balls in bins
- $O((\log n)/(\log\log n))$ collisions w.h.p.
- And matches that somewhere
- but we may care more about average collisions over many operations
- $C_{ij}=1$ if $i,j$ collide
- Time to find $i$ is $1+\sum_j C_{ij}$
- expected value $1+(n-1)/n \le 1$
- more generally expected search time for item (present or not): $O(n/s)=O(1)$ if $s=n$
Problem:
- $n^m$ functions (specify one of $n$ places for each of $n$
items)
- too much space to specify ($m\log n$),
- hard to evaluate
- for $O(1)$ search time, want to evaluate function in
$O(1)$ time.
- so function description must fit in $O(1)$ machine words
- Assuming $\log m$ bit words
- So, fixed number of cells can only distinguish poly$(m)$ functions
- This bounds size of hash family we can choose from
2-universal family: [Carter-Wegman]
- Key insight: don't need entirely random function
- All we care about is which pairs of items collide
- so: OK if items land pairwise independent
- pick prime (or prime power) $p$ in range $m,\ldots,2m$ (not random)
- pick random $a,b$
- map $x$ to $(ax + b \bmod p)\bmod m$
- fix $x$, $y$
- mapping pairwise independent, uniform before $\mod m$ $$ \begin{align*} ax+1\cdot b &= s\\ ay+1\cdot b &= t\\ \end{align*} $$ matrix $\binom{x\quad 1}{y \quad 1}$ determinant $x-y \ne 0$, so unique solution
- So pairwise independent, near-uniform after $\mod m$
- at most 2 “uniform buckets” to same place
- argument above holds: $O(1)$ expected search time.
- represent with two $O(\log m)$-bit integers: hash family of poly size.
- max load may be large as $\sqrt{n}$, but who cares?
- expected load in a bin is 1
- so $O(\sqrt{n})$ with prob. 1-1/n (chebyshev).
- this bounds expected max-load
- some item may have bad load, but unlikely to be the requested one
- can show the max load is probably achieved for some 2-universal families
2012 lecture start
perfect hash families
Fredman Komlos Szemeredi.Ideally, would hash with no collisions
- Get worst case $O(1)$ lookups
- Explore case of fixed set of $n$ items (read only)
- perfect hash function: no collisions
- Even fully random function of $n$ to $n$ has collisions
Alternative try: use more space:
- How small can $s$ be for random $n$ to $s$ without collisions?
- Expected number of collisions is $E[\sum C_{ij}] = \binom{n}{2}(1/s) \approx n^2/2s$
- Markov Inequality: $n=\sqrt{s}$ works with prob. 1/2
- Nonzero probability, so, 2-universal hashes can work in quadratic space.
- Is this best possible?
- Birthday problem: $(1-1/s)\cdots(1-n/s) \approx e^{-(1/s+2/s+\cdots+n/s)} \approx e^{-n^2/2s}$
- So, when $n=\sqrt{s}$ has $\Omega(1)$ chance of collision
- 23 for birthdays
- even for fully independent
Finding one
- We know one exists---how find it?
- Try till succeed
- Each time, succeed with probability 1/2
- Expected number of tries to succeed is 2
- so expect $O(n)$ work
- Probability need $k$ tries is $2^{-k}$
Two level hashing for linear space
- Hash $n$ items into $O(n)$ space 2-universally
- Build quadratic size hash table on contents of each bucket
- bound $\sum b_k^2=\sum_k (\sum_i [i \in b_k])^2 =\sum C_i + C_{ij}$
- expected value $O(n)$.
- So try till get (markov)
- Then build collision-free quadratic tables inside
- Try till get
- Polynomial time in $n$, Las-vegas algorithm
- Easy: $6n$ cells
- Hard: $n+o(n)$ cells (bit fiddling)
2013 Lecture 7 End
Define las vegas, compare to monte carlo.
Derandomization
- Probability $1/2$ top-level function works
- Only $m^2$ top-level functions
- Try them all!
- Polynomial in $m$ (not $n$), deterministic algorithm
Followups
- Dynamic Perfect Hashing