\documentclass[12pt]{article} \usepackage{wide,me} \parindent0pt
This material takes 1:05.

Everything you need to know about probability

• Linearity of expectation
• Indicator variables
• Independent events
• Product rule
• Markov inequality

### Hashing

Dictionaries

• Operations.
• makeset, insert, delete, find

Model

• keys are integers in $M=\set{1,\ldots,m}$
• (so assume machine word size, or “unit time,” is $\log m$)
• can store in array of size $M$
• using power: arithmetic, indirect addressing
• (more than bucket based heaps that use indirection without arithmetic)
• compare to comparison and pointer based sorting, binary trees
• problem: space.

Hashing:

• find function $h$ mapping $M$ into table of size $s \ll m$
• use it to put $n$ items from $M$ in the table
• so presumably $s \ge n$
• Still, some items can get mapped to same place: “collision”
• use linked list etc.
• search, insert cost equals size of linked list
• goal: keep linked lists small: few collisions

Hash families:

• problem: for any hash function, some bad input (if space $s$, then $m/s$ items to same bucket)
• This true even if hash is e.g. SHA1
• Solution: build family of functions, choose one that works well
2013 lecture 7 start

Set of all functions?

• Idea: choose “function” that stores items in sorted order without collisions
• problem: to evaluate function, must examine data
• evaluation time $\Omega(\log n)$.
• “description size” $\Omega(n \log m)$,
• Better goal: choose function without looking at data (except query key)

How about a random function?

• set $N$ of $n$ items
• If $s=n$, balls in bins
• $O((\log n)/(\log\log n))$ collisions w.h.p.
• And matches that somewhere
• but we may care more about average collisions over many operations
• $C_{ij}=1$ if $i,j$ collide
• Time to find $i$ is $1+\sum_j C_{ij}$
• expected value $1+(n-1)/n \le 1$
• more generally expected search time for item (present or not): $O(n/s)=O(1)$ if $s=n$

Problem:

• $n^m$ functions (specify one of $n$ places for each of $n$ items)
• too much space to specify ($m\log n$),
• hard to evaluate
• for $O(1)$ search time, want to evaluate function in $O(1)$ time.
• so function description must fit in $O(1)$ machine words
• Assuming $\log m$ bit words
• So, fixed number of cells can only distinguish poly$(m)$ functions
• This bounds size of hash family we can choose from

2-universal family: [Carter-Wegman]

• Key insight: don't need entirely random function
• All we care about is which pairs of items collide
• so: OK if items land pairwise independent
• pick prime (or prime power) $p$ in range $m,\ldots,2m$ (not random)
• pick random $a,b$
• map $x$ to $(ax + b \bmod p)\bmod m$
• fix $x$, $y$
• mapping pairwise independent, uniform before $\mod m$ \begin{align*} ax+1\cdot b &= s\\ ay+1\cdot b &= t\\ \end{align*} matrix $\binom{x\quad 1}{y \quad 1}$ determinant $x-y \ne 0$, so unique solution
• So pairwise independent, near-uniform after $\mod m$
• at most 2 “uniform buckets” to same place
• argument above holds: $O(1)$ expected search time.
• represent with two $O(\log m)$-bit integers: hash family of poly size.
• max load may be large as $\sqrt{n}$, but who cares?
• expected load in a bin is 1
• so $O(\sqrt{n})$ with prob. 1-1/n (chebyshev).
• this bounds expected max-load
• some item may have bad load, but unlikely to be the requested one
• can show the max load is probably achieved for some 2-universal families
2012 lecture start

### perfect hash families

Fredman Komlos Szemeredi.

Ideally, would hash with no collisions

• Get worst case $O(1)$ lookups
• Explore case of fixed set of $n$ items (read only)
• perfect hash function: no collisions
• Even fully random function of $n$ to $n$ has collisions

Alternative try: use more space:

• How small can $s$ be for random $n$ to $s$ without collisions?
• Expected number of collisions is $E[\sum C_{ij}] = \binom{n}{2}(1/s) \approx n^2/2s$
• Markov Inequality: $n=\sqrt{s}$ works with prob. 1/2
• Nonzero probability, so, 2-universal hashes can work in quadratic space.
• Is this best possible?
• Birthday problem: $(1-1/s)\cdots(1-n/s) \approx e^{-(1/s+2/s+\cdots+n/s)} \approx e^{-n^2/2s}$
• So, when $n=\sqrt{s}$ has $\Omega(1)$ chance of collision
• 23 for birthdays
• even for fully independent

Finding one

• We know one exists---how find it?
• Try till succeed
• Each time, succeed with probability 1/2
• Expected number of tries to succeed is 2
• so expect $O(n)$ work
• Probability need $k$ tries is $2^{-k}$

Two level hashing for linear space

• Hash $n$ items into $O(n)$ space 2-universally
• Build quadratic size hash table on contents of each bucket
• bound $\sum b_k^2=\sum_k (\sum_i [i \in b_k])^2 =\sum C_i + C_{ij}$
• expected value $O(n)$.
• So try till get (markov)
• Then build collision-free quadratic tables inside
• Try till get
• Polynomial time in $n$, Las-vegas algorithm
• Easy: $6n$ cells
• Hard: $n+o(n)$ cells (bit fiddling)
2013 Lecture 7 End

Define las vegas, compare to monte carlo.

Derandomization

• Probability $1/2$ top-level function works
• Only $m^2$ top-level functions
• Try them all!
• Polynomial in $m$ (not $n$), deterministic algorithm

Followups

• Dynamic Perfect Hashing