6.856 Lecture Notes

Fingerprinting

Basic idea: compare two things from a big universe $U$

generally takes $\log U$, could be huge.
Better: randomly map $U$ to smaller $V$, compare elements of $V$.
Probability(same)$=1/|V|$
intuition: $\log V$ bits to compare, error prob. $1/|V|$

We work with fields

add, subtract, mult, divide
0 and 1 elements
eg reals, rats, (not ints)
talk about $Z_p$
which field often won't matter.

Verifying matrix multiplications:

Claim $AB=C$
check by mul: $n^3$, or $n^{2.376}$ with deep math
Freivald's $O(n^2)$.
Good to apply at end of complex algorithm (check answer)

Freivald's technique:

choose random $r \in \{0,1\}^n$
check $ABr=Cr$
time $O(n^2)$
if $AB=C$, fine.
What if $AB \ne C$?
- trouble if $(AB-C)r=0$ but $D=AB-C \ne 0$
- find some nonzero row $(d_1,\ldots,d_n)$
- wlog $d_1 \ne 0$
- trouble if $\sum d_i r_i = 0$
- ie $r_1 = (\sum_{i > 1} d_i r_i)/d_1$
- principle of deferred decisions: choose all $i \ge 2$ first
- then have exactly one error value for $r_1$
- prob. pick it is at most $1/2$
How improve detection prob?
- $k$ trials makes $1/2^k$ failure.
- Or choosing $r \in [1,s]$ makes $1/s$.
Doesn't just do matrix mul.
- check any matrix identity claim
- useful when matrices are “implicit” (e.g. $AB$)
We are mapping matrices ($n^2$ entries) to vectors ($n$ entries).

In general, many ways to fingerprint explicitly represented objects. But for implicit objects, different methods have different strengths and weaknesses.

We'll fingerprint 3 ways:

vector multiply
number mod a random prime
polynomial evaluation at a random point

All are functions mapping large input space to small output space.

String matching

Checksums:

Alice and Bob have bit strings of length $n$
Think of $n$ bit integers $a$, $b$
take a prime number $p$, compare $a \bmod p$ and $b \bmod p$ with $\log p$ bits.

Trouble if $a=b \pmod p$. How avoid? How likely?

$c=a-b$ is $n$-bit integer.
so at most $n$ prime factors.
How many prime factors less than $k$? $\Theta(k/\ln k)$
so take $2n^2\log n$ limit
number of primes about $n^2$
So on random one, $1/n$ error prob.
$O(\log n)$ bits to send.
implement by add/sub, no mul or div!

Do we need a prime?

Some numbers have many divisors
So a random number may be a divisor
asymptotically, number of divisors $\rightarrow \le n^{2/\log\log n} \le \sqrt{n}$
so a random number will fail with probability $\le 1/\sqrt{n}$
but this is a lot of work to prove (and requires asymptotics)
primes are easier/clearer

How find prime?

Well, a randomly chosen number is prime with prob. $1/\ln n$,
so just try a few.
How know its prime? Simple randomized test (later)
Or, just run Eratosthenes' sieve
Going to range $n^2$ takes time $O(n^2)$

Pattern matching in strings

Rabin-Karp algorithm

$m$-bit pattern
$n$-bit string
work mod prime $p$ of size at most $t$
prob. error at particular point most $m/(t/\log t)$ from above
so pick big $t$, union bound
implement by add/sub as shift in bits

Bloom Filters

Hash tables are good for key to value mappings,

but sometimes that is overkill
Sufficient to know if a key is “present”
For array, difference between “array” and “bit vector”
Analogue for hash tables?
$n$-item perfect hash uses $n\log m$ bits
can we do better?
Idea: hash each item to a bit location, set the bit
Problem: false positive if hash to same place
Solution: hash to multiple bits for redundant checking

Algorithm

Set of $n$ items
Array of $m$ bits
Insert
- Hash each item to $k$ “random” bits
- Set those bits
Lookup
- Answer yes if all $k$ bits set
Always say yes for present items
Can say yes for not present items (false positive)

Analysis

false positive if all $k$ bits set
$n$ items, so $kn$ random tries to set bits
How many bits are one?
- probability bit clear is $(1-1/m)^{kn} \approx e^{-kn/m}=p$
- expect $pm$ clear bits
Claim probability $k$ bits set is $f=(1-p)^k$
- there's a conditioning here that makes this non-obvious
- having a bit set makes others being set less likely
- but this negative correlation only helps us
How pick $k$ to optimize?
- More hashes give more chances to see a 0
- But also increase number of set bits
- pick $k$ to minimize $(1-p)^k$ (same as minimizing original)
- take derivative, find zero at $k=(\ln 2)(m/n)$
- in this case $p=1/2$, and $f=(1/2)^k=(0.6185)^{m/n}$
practical numbers: with $m=8n$ get $f=.02$

Formally validate our assumption that have $pm$ zeros.

Poisson distribution

fix one bin
prob. $k$ balls $\approx \binom{n}{k}n^{-k}\rightarrow 1/ek!$ as $n\rightarrow\infty$
more generally, Poisson distribution is limit of binomial $B(m,\lambda/m)$ as $m\rightarrow\infty$ $$ \begin{align*} \Pr_m[k]&=\binom{m}{k}(\lambda/m)^k(1-\lambda/m)^{m-k}\\ &\approx (m^k/k!)(\lambda/m)^ke^{-\lambda}\\ &=e^{-\lambda}\lambda^k/k! \intertext{alternatively,} &=\frac{m}{k}(\lambda/m)\binom{m-1}{k-1}(\lambda/m)^{k-1}(1-\lambda/m)^{(m-1)-(k-1)}\\ &=(\lambda/k)\Pr_{m-1}[k-1]\\ &=(\lambda/k)\Pr_m[k-1] \end{align*} $$
So $\Pr[k] \sim \lambda^k/k!$
Must normalize to 1 so $\Pr[k]=e^{-\lambda}\lambda^k/k!$

The mode:

consider throwing $B(m,1/m)$ balls into each bin independently
now take $m\rightarrow\infty$
$\Pr[0]=e^{-1}$
total over $n$ bins, $B(nm,1/m)$ balls so $\lambda=n$
alternatively, can prove convolution of Poisson variables is Poisson with summed rate
in total, $\Pr[n]=e^{-n}n^n/n!\approx 1/\sqrt{2\pi n}$ by Stirling's approx.
that is, about $1/\sqrt{n}$ chance of exactly $n$ balls

Application

Suppose prove $\Pr[A]< p$ in independent variables case
Let $N$ be event of “exactly $n$ balls in independent case”
Then $p>\Pr[A]>\Pr[A \cap N]=\Pr[A \mid N]\Pr[N]>\Pr[A \mid N]/\sqrt{2\pi n}$
So $\Pr[A \mid N] < p\sqrt{2\pi n}$
But $\Pr[A \mid N]$ is case when throw in exactly $n$ balls.
so if showed $\neg A$ w.h.p, can concude $\neg A \mid N$ w.h.p (and change of constants)
i.e. also in exactly $n$ balls case
so Chernoff says close to $p$ empty.