Hashing
Dictionaries
- Operations.
- makeset, insert, delete, find
Model
- keys are integers in $M=\set{1,\ldots,m}$
- (so assume machine word size, or “unit time,” is $\log m$)
- can store in array of size $M$
- using power: arithmetic, indirect addressing
- compare to comparison and pointer based sorting, binary trees
- problem: space.
Hashing:
- find function $h$ mapping $M$ into table of size $n \ll m$
- Note some items get mapped to same place: “collision”
- use linked list etc.
- search, insert cost equals size of linked list
- goal: keep linked lists small: few collisions
Hash families:
- problem: for any hash function, some bad input (if $n$ items, then $m/n$ items to same bucket)
- This true even if hash is e.g. SHA1
- Solution: build family of functions, choose one that works well
Set of all functions?
- Idea: choose “function” that stores items in sorted order without collisions
- problem: to evaluate function, must examine all data
- evaluation time $\Omega(\log n)$.
- “description size” $\Omega(n \log m)$,
- Better goal: choose function that can be evaluated in constant time without looking at data (except query key)
How about a random function?
- set $S$ of $s$ items
- If $s=n$, balls in bins
- $O((\log n)/(\log\log n))$ collisions w.h.p.
- And matches that somewhere
- but we care more about average collisions over many operations
- $C_{ij}=1$ if $i,j$ collide
- Time to find $i$ is $\sum_j C_{ij}$
- expected value $(n-1)/n \le 1$
- more generally expected search time for item (present or not): $O(s/n)=O(1)$ if $s=n$
Problem:
- $n^m$ functions (specify one of $n$ places for each of $n$
items)
- too much space to specify ($m\log n$),
- hard to evaluate
- for $O(1)$ search time, need to identify function in
$O(1)$ time.
- so function description must fit in $O(1)$ machine words
- Assuming $\log m$ bit words
- So, fixed number of cells can only distinguish poly$(m)$ functions
- This bounds size of hash family we can choose from
Our analysis:
- sloppier constants
- but more intuitive than book
Universal Hashing
Recall our pairwise independent “pseudorandom-generator”:
- Motivation.
- Idea of randomness as (complexity theoretic) resource like space or time.
- sometime full independence unnecessary
- pairwise independent vars.
- generating over $Z_p$.
- Want random numbers in range $[1,\ldots,p]$
- pick random $a,b$
- $i^{th}$ random number $ai+b$
- Works because invertible over field
- If want over nonprime field, use “slightly larger” $p$
2-universal family: [Carter-Wegman]
- Key insight: don't need entirely random function
- All we care about is which pairs of items collide
- how much independence was used above? pairwise (search item versus each other item)
- so: OK if items land pairwise independent
- pick $p$ in range $m,\ldots,2m$ (not random)
- pick random $a,b$
- map $x$ to $(ax + b \bmod p)\bmod n$
- pairwise independent, uniform before $\mod m$
- So pairwise independent, near-uniform after $\mod m$
- at most 2 “uniform buckets” to same place
- argument above holds: $O(1)$ expected search time.
- represent with two $O(\log m)$-bit integers: hash family of poly size.
- max load?
- expected load in a bin is 1
- so $O(\sqrt{n})$ with prob. 1-1/n (chebyshev).
- this bounds expected max-load
- some item may have bad load, but unlikely to be the requested one
- can show the max load is probably achieved for some 2-universal families