Hashing
6.854 Notes #5

David Karger

Everything you need to know about probability

Hashing

Dictionaries

Model

Hashing:

Hash families:

problem: for any hash function, some bad input (if space \(s\), then \(m/s\) items to same bucket)
This true even if hash is e.g. SHA1
Solution: build family of functions, choose one that works well

Set of all functions?

How about a random function?

set \(N\) of \(n\) items
If \(s=n\), balls in bins
- \(O((\log n)/(\log\log n))\) collisions w.h.p.
- And matches that somewhere
- but we may care more about average collisions over many operations
- \(C_{ij}=1\) if \(i,j\) collide
- Time to find \(i\) is \(1+\sum_j C_{ij}\)
- expected value \(1+(n-1)/n \le 1\)
more generally expected search time for item (present or not): \(O(n/s)=O(1)\) if \(s=n\)

Problem:

\(n^m\) functions (specify one of \(n\) places for each of \(n\) items)
- too much space to specify (\(m\log n\)),
- hard to evaluate
for \(O(1)\) search time, want to evaluate function in \(O(1)\) time.
- so function description must fit in \(O(1)\) machine words
- Assuming \(\log m\) bit words
- So, fixed number of cells can only distinguish poly\((m)\) functions
This bounds size of hash family we can choose from

2-universal family: [Carter-Wegman]

Key insight: don’t need entirely random function
All we care about is which pairs of items collide
so: OK if items land pairwise independent
pick prime (or prime power) \(p\) in range \(m,\ldots,2m\) (not random)
pick random \(a,b\)
map \(x\) to \((ax + b \bmod p)\bmod m\)
- fix \(x\), \(y\)
- mapping pairwise independent, uniform before \(\mod m\) \[\begin{aligned} ax+1\cdot b &= s\\ ay+1\cdot b &= t\\ \end{aligned}\] matrix \(\binom{x\quad 1}{y \quad 1}\) determinant \(x-y \ne 0\), so unique solution
- So pairwise independent, near-uniform after \(\mod m\)
- at most 2 “uniform buckets” to same place
argument above holds: \(O(1)\) expected search time.
represent with two \(O(\log m)\)-bit integers: hash family of poly size.
max load may be large as \(\sqrt{n}\), but who cares?
- expected load in a bin is 1
- so \(O(\sqrt{n})\) with prob. 1-1/n (chebyshev).
- this bounds expected max-load
- some item may have bad load, but unlikely to be the requested one
- can show the max load is probably achieved for some 2-universal families

Fredman Komlos Szemeredi.

Ideally, would hash with no collisions

Alternative try: use more space:

How small can \(s\) be for random \(n\) to \(s\) without collisions?
- Expected number of collisions is \(E[\sum C_{ij}] = \binom{n}{2}(1/s) \approx n^2/2s\)
- Markov Inequality: \(n=\sqrt{s}\) works with prob. 1/2
- Nonzero probability, so, 2-universal hashes can work in quadratic space.
Is this best possible?
- Birthday problem: \((1-1/s)\cdots(1-n/s) \approx e^{-(1/s+2/s+\cdots+n/s)} \approx e^{-n^2/2s}\)
- So, when \(n=\sqrt{s}\) has \(\Omega(1)\) chance of collision
- 23 for birthdays
- even for fully independent

Finding one

Two level hashing for linear space

Define las vegas, compare to monte carlo.

Derandomization

Followups