# Hashing 6.854 Notes #5

Everything you need to know about probability

• Linearity of expectation

• Indicator variables

• Independent events

• Product rule

• Markov inequality

## Hashing

Dictionaries

• Operations.

• makeset, insert, delete, find

Model

• keys are integers in $$M=\set{1,\ldots,m}$$

• (so assume machine word size, or “unit time,” is $$\log m$$)

• can store in array of size $$M$$

• using power: arithmetic, indirect addressing

• (more than bucket based heaps that use indirection without arithmetic)

• compare to comparison and pointer based sorting, binary trees

• problem: space.

Hashing:

• find function $$h$$ mapping $$M$$ into table of size $$s \ll m$$

• use it to put $$n$$ items from $$M$$ in the table

• so presumably $$s \ge n$$

• Still, some items can get mapped to same place: “collision”

• search, insert cost equals size of linked list

• goal: keep linked lists small: few collisions

Hash families:

• problem: for any hash function, some bad input (if space $$s$$, then $$m/s$$ items to same bucket)

• This true even if hash is e.g. SHA1

• Solution: build family of functions, choose one that works well

Set of all functions?

• Idea: choose “function” that stores items in sorted order without collisions

• problem: to evaluate function, must examine data

• evaluation time $$\Omega(\log n)$$.

• “description size” $$\Omega(n \log m)$$,

• Better goal: choose function without looking at data (except query key)

• set $$N$$ of $$n$$ items

• If $$s=n$$, balls in bins

• $$O((\log n)/(\log\log n))$$ collisions w.h.p.

• And matches that somewhere

• but we may care more about average collisions over many operations

• $$C_{ij}=1$$ if $$i,j$$ collide

• Time to find $$i$$ is $$1+\sum_j C_{ij}$$

• expected value $$1+(n-1)/n \le 1$$

• more generally expected search time for item (present or not): $$O(n/s)=O(1)$$ if $$s=n$$

Problem:

• $$n^m$$ functions (specify one of $$n$$ places for each of $$n$$ items)

• too much space to specify ($$m\log n$$),

• hard to evaluate

• for $$O(1)$$ search time, want to evaluate function in $$O(1)$$ time.

• so function description must fit in $$O(1)$$ machine words

• Assuming $$\log m$$ bit words

• So, fixed number of cells can only distinguish poly$$(m)$$ functions

• This bounds size of hash family we can choose from

2-universal family: [Carter-Wegman]

• Key insight: don’t need entirely random function

• All we care about is which pairs of items collide

• so: OK if items land pairwise independent

• pick prime (or prime power) $$p$$ in range $$m,\ldots,2m$$ (not random)

• pick random $$a,b$$

• map $$x$$ to $$(ax + b \bmod p)\bmod m$$

• fix $$x$$, $$y$$

• mapping pairwise independent, uniform before $$\mod m$$ \begin{aligned} ax+1\cdot b &= s\\ ay+1\cdot b &= t\\ \end{aligned} matrix $$\binom{x\quad 1}{y \quad 1}$$ determinant $$x-y \ne 0$$, so unique solution

• So pairwise independent, near-uniform after $$\mod m$$

• at most 2 “uniform buckets” to same place

• argument above holds: $$O(1)$$ expected search time.

• represent with two $$O(\log m)$$-bit integers: hash family of poly size.

• max load may be large as $$\sqrt{n}$$, but who cares?

• expected load in a bin is 1

• so $$O(\sqrt{n})$$ with prob. 1-1/n (chebyshev).

• some item may have bad load, but unlikely to be the requested one

• can show the max load is probably achieved for some 2-universal families

## perfect hash families

Fredman Komlos Szemeredi.

Ideally, would hash with no collisions

• Get worst case $$O(1)$$ lookups

• Explore case of fixed set of $$n$$ items (read only)

• perfect hash function: no collisions

• Even fully random function of $$n$$ to $$n$$ has collisions

Alternative try: use more space:

• How small can $$s$$ be for random $$n$$ to $$s$$ without collisions?

• Expected number of collisions is $$E[\sum C_{ij}] = \binom{n}{2}(1/s) \approx n^2/2s$$

• Markov Inequality: $$n=\sqrt{s}$$ works with prob. 1/2

• Nonzero probability, so, 2-universal hashes can work in quadratic space.

• Is this best possible?

• Birthday problem: $$(1-1/s)\cdots(1-n/s) \approx e^{-(1/s+2/s+\cdots+n/s)} \approx e^{-n^2/2s}$$

• So, when $$n=\sqrt{s}$$ has $$\Omega(1)$$ chance of collision

• 23 for birthdays

• even for fully independent

Finding one

• We know one exists—how find it?

• Try till succeed

• Each time, succeed with probability 1/2

• Expected number of tries to succeed is 2

• so expect $$O(n)$$ work

• Probability need $$k$$ tries is $$2^{-k}$$

Two level hashing for linear space

• Hash $$n$$ items into $$O(n)$$ space 2-universally

• Build quadratic size hash table on contents of each bucket

• bound $$\sum b_k^2=\sum_k (\sum_i [i \in b_k])^2 =\sum C_i + C_{ij}$$

• expected value $$O(n)$$.

• So try till get (markov)

• Then build collision-free quadratic tables inside

• Try till get

• Polynomial time in $$n$$, Las-vegas algorithm

• Easy: $$6n$$ cells

• Hard: $$n+o(n)$$ cells (bit fiddling)

Define las vegas, compare to monte carlo.

Derandomization

• Probability $$1/2$$ top-level function works

• Only $$m^2$$ top-level functions

• Try them all!

• Polynomial in $$m$$ (not $$n$$), deterministic algorithm

Followups

• Dynamic Perfect Hashing