6.856 Lecture Notes

Fingerprinting

Basic idea: compare two things from a big universe $U$

generally takes $\log U$, could be huge.
Better: randomly map $U$ to smaller $V$, compare elements of $V$.
Probability(same)$=1/|V|$
intuition: $\log V$ bits to compare, error prob. $1/|V|$

We work with fields

add, subtract, mult, divide
0 and 1 elements
eg reals, rats, (not ints)
talk about $Z_p$
which field often won't matter.

Verifying matrix multiplications:

Claim $AB=C$
check by mul: $n^3$, or $n^{2.376}$ with deep math
Freivald's $O(n^2)$.
Good to apply at end of complex algorithm (check answer)

Freivald's technique:

choose random $r \in \{0,1\}^n$
check $ABr=Cr$
time $O(n^2)$
if $AB=C$, fine.
What if $AB \ne C$?
- trouble if $(AB-C)r=0$ but $D=AB-C \ne 0$
- find some nonzero row $(d_1,\ldots,d_n)$
- wlog $d_1 \ne 0$
- trouble if $\sum d_i r_i = 0$
- ie $r_1 = (\sum_{i > 1} d_i r_i)/d_1$
- principle of deferred decisions: choose all $i \ge 2$ first
- then have exactly one error value for $r_1$
- prob. pick it is at most $1/2$
How improve detection prob?
- $k$ trials makes $1/2^k$ failure.
- Or choosing $r \in [1,s]$ makes $1/s$.
Doesn't just do matrix mul.
- check any matrix identity claim
- useful when matrices are “implicit” (e.g. $AB$)
We are mapping matrices ($n^2$ entries) to vectors ($n$ entries).

In general, many ways to fingerprint explicitly represented objects. But for implicit objects, different methods have different strengths and weaknesses.

We'll fingerprint 3 ways:

vector multiply
number mod a random prime
polynomial evaluation at a random point

All are functions mapping large input space to small output space.

String matching

Checksums:

Alice and Bob have bit strings of length $n$
Think of $n$ bit integers $a$, $b$
take a prime number $p$, compare $a \bmod p$ and $b \bmod p$ with $\log p$ bits.

Trouble if $a=b \pmod p$. How avoid? How likely?

$c=a-b$ is $n$-bit integer.
so at most $n$ prime factors.
How many prime factors less than $k$? $\Theta(k/\ln k)$
so consider range $1,\ldots,2n^2\log n$
number of primes about $n^2$
So on random one, $1/n$ error prob.
$O(\log n)$ bits to send.
implement by add/sub, no mul or div!

Do we need a prime?

Some numbers have many divisors
So a random number may be a divisor
asymptotically, number of divisors $\rightarrow \le n^{2/\log\log n} \le \sqrt{n}$
so a random number will fail with probability $\le 1/\sqrt{n}$
but this is a lot of work to prove (and requires asymptotics)
primes are easier/clearer

How find prime?

Well, a randomly chosen number is prime with prob. $1/\ln n$,
so just try a few.
How know its prime? Simple randomized test (later)
Or, just run Eratosthenes' sieve
Going to range $n^2\log n$ takes time $O(n^2)$

Pattern matching in strings

Rabin-Karp algorithm

$m$-bit pattern
$n$-bit string
work mod prime $p$ of size at most $t$
prob. error at particular point most $m/(t/\log t)$ from above
so pick big $t$, union bound
implement by add/sub as shift in bits
How to make this Las Vegas? Just check matches!

Bloom Filters

Hash tables are good for key to value mappings,

but sometimes that is overkill
Sufficient to know if a key is “present”
For array, difference between “array” and “bit vector”
Analogue for hash tables?
$n$-item perfect hash uses $n\log m$ bits
can we do better?
Idea: hash each item to a bit location, set the bit
Problem: false positive if hash to same place
Solution: hash to multiple bits for redundant checking

Algorithm

Set of $n$ items
Array of $m$ bits
Insert
- Hash each item to $k$ “random” bits
- Set those bits
- two hashes can yield same bit; this is ok in the analysis
Lookup
- Answer yes if all $k$ chosen bits set
Always say yes for present items
Can say yes for not present items (false positive)

Analysis

false positive if all $k$ bits set
$n$ items, so $kn$ random tries to set bits
How many bits are one?
- probability bit clear is $(1-1/m)^{kn} \approx e^{-kn/m}=p$
- expect $pm$ clear bits
Claim probability $k$ bits set is $f=(1-p)^k$
- there's a conditioning here that makes this non-obvious
- having a bit set makes others being set less likely
- but this negative correlation only helps us
How pick $k$ to optimize?
- More hashes give more chances to see a 0
- But also increase number of set bits
- pick $k$ to minimize $(1-p)^k$ (same as minimizing original)
- take derivative, find zero at $k=(\ln 2)(m/n)$
- so number of bit-set attempts is $kn=m\ln 2$.
- in this case $p=(1-1/m)^{m\ln 2} \rightarrow 1/2$, and $f=(1/2)^k=(0.6185)^{m/n}$
- half the bits are clear---which maximizes information in filter
practical numbers: with $m=8n$ get $f=.02$

Formally validate our assumption that have $pm$ zeros.

This is a ($kn$) balls in ($m$) bins problem
we want to lower bound empty bins
Problem: dependence of different bins being empty
(correlation is negative, opposite of what we want)
solution: define a different distribution where bins are independent
then relate that distribution to our original one

General Plan:

we are sampling from a distribution $D$ on balls in bins where empty bins are not independent
define a different balls-in-bins distribution $I$ where
- empty bins are independent.
- expected number of empty bins is same as for $D$
- but since independent, Chernoff bound holds
- meaning number of empty bins is $\alpha m \pm O(\sqrt{m\log m})$ whp
- i.e. probability of being outside range ($X$) is (say) $m^{-3}$
$I$ will put a random number $N$ of balls in the bins
show that if we condition on $N=m$, we are sampling from $D$
show that $\Pr_{I}[N=m] = \Omega(1/\sqrt{m})$
conclude $$ \begin{align*} \Pr_I[X] &\ge \Pr_I[X \cap N=m] \\ m^{-3} &= \Pr_I[X \mid N=m]\Pr_I[N=m]\\ m^{-3} &= \Omega(\Pr_D[X] / \sqrt{m})\\ \Pr_D[X] &= O(m^{-2.5}) \end{align*} $$
in other words, empty are tightly concentrated in our original distribution w.h.p

The new distribution

our original dependent distribution $D$ throws $kn=\alpha m$ balls into $m$ bins
so an expectation of $\alpha$ balls per bin
and probability of empty bin $(1-1/m)^{\alpha m} \approx e^{-\alpha}$.
instead, associate each bin with $\alpha r$ possiballs that each independently “arrive” in their bin with probability $1/r$
so also $\alpha$ balls per bin in expectation
now let $r \rightarrow \infty$
probability a bin is empty is $(1-1/r)^{\alpha r} \rightarrow e^{-\alpha}$.
so our new distribution $I$ has same expected items and roughly same empty probability as $D$
but each bin has its own possiballs, so independently empty
so, Chernoff bound applies to counting empty bins
we expect $m/2$ empty bin
Chernoff says we get at least $(1-o(1))m/2$ w.h.p

Tangent: Poisson process

the limiting distribution in one bin is a Poisson process, a kind of continuous limit of the binomial distribution
instead of a $B(n,p)$ sequence of $n$ discrete-time coin flips with success probability $p$, we have continuous time where the success “rate” in interval $dt$ is $\alpha dt$.
so we expect $\alpha$ successes per unit time
$\alpha$ is the rate of the Poisson process
if you compute the limits, you find probability of $k$ successes in one time unit is $e^{-\alpha} \alpha^k/k!$.

Relate the distributions:

one way to sample from $I$ is to flip a coin for each of the $\alpha r m$ possiballs to see which arrive
alternatively, choose a number of arriving possiballs (distribution $B(\alpha r m, 1/r)$, then choose that many possiballs to arrive
all possiballs equally likely, so I should just choose uniformly at random from the possiballs
in other words, each possiball I choose is from a random bin
you might worry about choosing the same possible twice, but this has probability 0 as $ r\rightarrow \infty$.
in particular, if I sample from $I$ but condition to choose exactly $\alpha m$ possiballs, then I am actually sampling from $D$!

Bound $\Pr[N=\alpha m]$

this is the event that the number of arriving possiballs equals its expectation $\alpha m$
or in original notation, $\Pr[B(n,p)=np]$
turns out this is $\Omega(1/\sqrt{np})$
which is nice because it is doesn't depend directly on $n$, only the mean
in a sense we already know this.
- Chernoff tells us the outcome is within $\pm O(\sqrt{np})$ of the mean
- this is only $O(\sqrt{np})$ possible values
- so the biggest (which is the mean) has to have probability $\Omega(1/\sqrt{np})$
can also prove directly by looking at the proper binomial coefficient $\binom{n}{np} p^{np}(1-p)^{1-np}$
and just applying Stirlings approximation $k! = \Theta(\sqrt{k}(k/e)^k$.

Wrapup

let $X$ be event of too few empty bins
$\Pr_D[X] \le \Pr_I[X]/(1/\sqrt{\alpha m}$
$X$ is false with high probability in $m$ under $I$
therefore $X$ is false with high probability in $m$ under $D$