\documentclass[12pt]{article} \usepackage{wide,me} \parindent0pt

## Fingerprinting

Basic idea: compare two things from a big universe $U$

• generally takes $\log U$, could be huge.
• Better: randomly map $U$ to smaller $V$, compare elements of $V$.
• Probability(same)$=1/|V|$
• intuition: $\log V$ bits to compare, error prob. $1/|V|$

We work with fields

• add, subtract, mult, divide
• 0 and 1 elements
• eg reals, rats, (not ints)
• talk about $Z_p$
• which field often won't matter.

Verifying matrix multiplications:

• Claim $AB=C$
• check by mul: $n^3$, or $n^{2.376}$ with deep math
• Freivald's $O(n^2)$.
• Good to apply at end of complex algorithm (check answer)

Freivald's technique:

• choose random $r \in \{0,1\}^n$
• check $ABr=Cr$
• time $O(n^2)$
• if $AB=C$, fine.
• What if $AB \ne C$?
• trouble if $(AB-C)r=0$ but $D=AB-C \ne 0$
• find some nonzero row $(d_1,\ldots,d_n)$
• wlog $d_1 \ne 0$
• trouble if $\sum d_i r_i = 0$
• ie $r_1 = (\sum_{i > 1} d_i r_i)/d_1$
• principle of deferred decisions: choose all $i \ge 2$ first
• then have exactly one error value for $r_1$
• prob. pick it is at most $1/2$
How improve detection prob?
• $k$ trials makes $1/2^k$ failure.
• Or choosing $r \in [1,s]$ makes $1/s$.
• Doesn't just do matrix mul.
• check any matrix identity claim
• useful when matrices are “implicit” (e.g. $AB$)
• We are mapping matrices ($n^2$ entries) to vectors ($n$ entries).

In general, many ways to fingerprint explicitly represented objects. But for implicit objects, different methods have different strengths and weaknesses.

We'll fingerprint 3 ways:

• vector multiply
• number mod a random prime
• polynomial evaluation at a random point
All are functions mapping large input space to small output space.

### String matching

Checksums:

• Alice and Bob have bit strings of length $n$
• Think of $n$ bit integers $a$, $b$
• take a prime number $p$, compare $a \bmod p$ and $b \bmod p$ with $\log p$ bits.

Trouble if $a=b \pmod p$. How avoid? How likely?

• $c=a-b$ is $n$-bit integer.
• so at most $n$ prime factors.
• How many prime factors less than $k$? $\Theta(k/\ln k)$
• so take $2n^2\log n$ limit
• number of primes about $n^2$
• So on random one, $1/n$ error prob.
• $O(\log n)$ bits to send.
• implement by add/sub, no mul or div!

Do we need a prime?

• Some numbers have many divisors
• So a random number may be a divisor
• asymptotically, number of divisors $\rightarrow \le n^{2/\log\log n} \le \sqrt{n}$
• so a random number will fail with probability $\le 1/\sqrt{n}$
• but this is a lot of work to prove (and requires asymptotics)
• primes are easier/clearer

How find prime?

• Well, a randomly chosen number is prime with prob. $1/\ln n$,
• so just try a few.
• How know its prime? Simple randomized test (later)
• Or, just run Eratosthenes' sieve
• Going to range $n^2$ takes time $O(n^2)$

#### Pattern matching in strings

Rabin-Karp algorithm

• $m$-bit pattern
• $n$-bit string
• work mod prime $p$ of size at most $t$
• prob. error at particular point most $m/(t/\log t)$ from above
• so pick big $t$, union bound
• implement by add/sub as shift in bits

## Bloom Filters

Hash tables are good for key to value mappings,

• but sometimes that is overkill
• Sufficient to know if a key is “present”
• For array, difference between “array” and “bit vector”
• Analogue for hash tables?
• $n$-item perfect hash uses $n\log m$ bits
• can we do better?
• Idea: hash each item to a bit location, set the bit
• Problem: false positive if hash to same place
• Solution: hash to multiple bits for redundant checking

Algorithm

• Set of $n$ items
• Array of $m$ bits
• Insert
• Hash each item to $k$ “random” bits
• Set those bits
• Lookup
• Answer yes if all $k$ bits set
• Always say yes for present items
• Can say yes for not present items (false positive)

Analysis

• false positive if all $k$ bits set
• $n$ items, so $kn$ random tries to set bits
• How many bits are one?
• probability bit clear is $(1-1/m)^{kn} \approx e^{-kn/m}=p$
• expect $pm$ clear bits
• Claim probability $k$ bits set is $f=(1-p)^k$
• there's a conditioning here that makes this non-obvious
• having a bit set makes others being set less likely
• but this negative correlation only helps us
• How pick $k$ to optimize?
• More hashes give more chances to see a 0
• But also increase number of set bits
• pick $k$ to minimize $(1-p)^k$ (same as minimizing original)
• take derivative, find zero at $k=(\ln 2)(m/n)$
• in this case $p=1/2$, and $f=(1/2)^k=(0.6185)^{m/n}$
• practical numbers: with $m=8n$ get $f=.02$

Formally validate our assumption that have $pm$ zeros.

Poisson distribution

• fix one bin
• prob. $k$ balls $\approx \binom{n}{k}n^{-k}\rightarrow 1/ek!$ as $n\rightarrow\infty$
• more generally, Poisson distribution is limit of binomial $B(m,\lambda/m)$ as $m\rightarrow\infty$ \begin{align*} \Pr_m[k]&=\binom{m}{k}(\lambda/m)^k(1-\lambda/m)^{m-k}\\ &\approx (m^k/k!)(\lambda/m)^ke^{-\lambda}\\ &=e^{-\lambda}\lambda^k/k! \intertext{alternatively,} &=\frac{m}{k}(\lambda/m)\binom{m-1}{k-1}(\lambda/m)^{k-1}(1-\lambda/m)^{(m-1)-(k-1)}\\ &=(\lambda/k)\Pr_{m-1}[k-1]\\ &=(\lambda/k)\Pr_m[k-1] \end{align*}
• So $\Pr[k] \sim \lambda^k/k!$
• Must normalize to 1 so $\Pr[k]=e^{-\lambda}\lambda^k/k!$

The mode:

• consider throwing $B(m,1/m)$ balls into each bin independently
• now take $m\rightarrow\infty$
• $\Pr[0]=e^{-1}$
• total over $n$ bins, $B(nm,1/m)$ balls so $\lambda=n$
• alternatively, can prove convolution of Poisson variables is Poisson with summed rate
• in total, $\Pr[n]=e^{-n}n^n/n!\approx 1/\sqrt{2\pi n}$ by Stirling's approx.
• that is, about $1/\sqrt{n}$ chance of exactly $n$ balls

Application

• Suppose prove $\Pr[A]< p$ in independent variables case
• Let $N$ be event of “exactly $n$ balls in independent case”
• Then $p>\Pr[A]>\Pr[A \cap N]=\Pr[A \mid N]\Pr[N]>\Pr[A \mid N]/\sqrt{2\pi n}$
• So $\Pr[A \mid N] < p\sqrt{2\pi n}$
• But $\Pr[A \mid N]$ is case when throw in exactly $n$ balls.
• so if showed $\neg A$ w.h.p, can concude $\neg A \mid N$ w.h.p (and change of constants)
• i.e. also in exactly $n$ balls case
• so Chernoff says close to $p$ empty.
• “Poisson Approximation”
• Imagine putting $B(n,1/n)$ balls in each bin
• Then each bin is empty with probability $(1-1/n)^n=1/e$
• And since bin sizes are independent, Chernoff applies
• Total number of balls is $B(n^2,1/n)$
• Expectation $n$
• What is probability of exactly $n$?
• Easy to prove $n$ is most likely, so at least $1/n^2$
• In fact, can prove $1/n$ (because no probability is outside $\pm n$ of mean)
• Suppose prove $\Pr[A]< p$ in independent variables case
• Let $N$ be event of “exactly $n$ balls in independent case
• Then $p>\Pr[A]>\Pr[A \cap N]=\Pr[A \mid N]\Pr[N]>\Pr[A \mid N]/n$
• So $\Pr[A \mid N] < np$ (or $n^2 p$)
• But $\Pr[A \mid N]$ is case when throw in exactly $n$ balls.
• so if showed $\neg A$ w.h.p, can concude $\neg A \mid N$ w.h.p (and change of constants)
• i.e. also in exactly $n$ balls case
• so Chernoff says close to $p$ empty.