\documentclass[12pt]{article} \usepackage{wide,me} \parindent0pt

### Hashing

Dictionaries

• Operations.
• makeset, insert, delete, find

Model

• keys are integers in $M=\set{1,\ldots,m}$
• (so assume machine word size, or “unit time,” is $\log m$)
• can store in array of size $M$
• using power: arithmetic, indirect addressing
• compare to comparison and pointer based sorting, binary trees
• problem: space.

Hashing:

• find function $h$ mapping $M$ into table of size $n \ll m$
• Note some items get mapped to same place: “collision”
• search, insert cost equals size of linked list
• goal: keep linked lists small: few collisions

Hash families:

• problem: for any hash function, some bad input (if $n$ items, then $m/n$ items to same bucket)
• This true even if hash is e.g. SHA1
• Solution: build family of functions, choose one that works well

Set of all functions?

• Idea: choose “function” that stores items in sorted order without collisions
• problem: to evaluate function, must examine all data
• evaluation time $\Omega(\log n)$.
• “description size” $\Omega(n \log m)$,
• Better goal: choose function that can be evaluated in constant time without looking at data (except query key)

• set $S$ of $s$ items
• If $s=n$, balls in bins
• $O((\log n)/(\log\log n))$ collisions w.h.p.
• And matches that somewhere
• but we care more about average collisions over many operations
• $C_{ij}=1$ if $i,j$ collide
• Time to find $i$ is $\sum_j C_{ij}$
• expected value $(n-1)/n \le 1$
• more generally expected search time for item (present or not): $O(s/n)=O(1)$ if $s=n$

Problem:

• $n^m$ functions (specify one of $n$ places for each of $n$ items)
• too much space to specify ($m\log n$),
• hard to evaluate
• for $O(1)$ search time, need to identify function in $O(1)$ time.
• so function description must fit in $O(1)$ machine words
• Assuming $\log m$ bit words
• So, fixed number of cells can only distinguish poly$(m)$ functions
• This bounds size of hash family we can choose from

Our analysis:

• sloppier constants
• but more intuitive than book

### Universal Hashing

Recall our pairwise independent “pseudorandom-generator”:

• Motivation.
• Idea of randomness as (complexity theoretic) resource like space or time.
• sometime full independence unnecessary
• pairwise independent vars.
• generating over $Z_p$.
• Want random numbers in range $[1,\ldots,p]$
• pick random $a,b$
• $i^{th}$ random number $ai+b$
• Works because invertible over field
• If want over nonprime field, use “slightly larger” $p$

2-universal family: [Carter-Wegman]

• Key insight: don't need entirely random function
• All we care about is which pairs of items collide
• how much independence was used above? pairwise (search item versus each other item)
• so: OK if items land pairwise independent
• pick $p$ in range $m,\ldots,2m$ (not random)
• pick random $a,b$
• map $x$ to $(ax + b \bmod p)\bmod n$
• pairwise independent, uniform before $\mod m$
• So pairwise independent, near-uniform after $\mod m$
• at most 2 “uniform buckets” to same place
• argument above holds: $O(1)$ expected search time.
• represent with two $O(\log m)$-bit integers: hash family of poly size.
• so $O(\sqrt{n})$ with prob. 1-1/n (chebyshev).