## Fingerprinting

Basic idea: compare two things from a big universe $U$

- generally takes $\log U$, could be huge.
- Better: randomly map $U$ to smaller $V$, compare elements of $V$.
- Probability(same)$=1/|V|$
- intuition: $\log V$ bits to compare, error prob. $1/|V|$

We work with * fields*

- add, subtract, mult, divide
- 0 and 1 elements
- eg reals, rats, (not ints)
- talk about $Z_p$
- which field often won't matter.

Verifying matrix multiplications:

- Claim $AB=C$
- check by mul: $n^3$, or $n^{2.376}$ with deep math
- Freivald's $O(n^2)$.
- Good to apply at end of complex algorithm (check answer)

Freivald's technique:

- choose random $r \in \{0,1\}^n$
- check $ABr=Cr$
- time $O(n^2)$
- if $AB=C$, fine.
- What if $AB \ne C$?
- trouble if $(AB-C)r=0$ but $D=AB-C \ne 0$
- find some nonzero row $(d_1,\ldots,d_n)$
- wlog $d_1 \ne 0$
- trouble if $\sum d_i r_i = 0$
- ie $r_1 = (\sum_{i > 1} d_i r_i)/d_1$
- principle of deferred decisions: choose all $i \ge 2$ first
- then have exactly one error value for $r_1$
- prob. pick it is at most $1/2$

- $k$ trials makes $1/2^k$ failure.
- Or choosing $r \in [1,s]$ makes $1/s$.

- Doesn't just do matrix mul.
- check any matrix identity claim
- useful when matrices are “implicit” (e.g. $AB$)

- We are mapping matrices ($n^2$ entries) to vectors ($n$ entries).

In general, many ways to fingerprint explicitly represented objects. But for implicit objects, different methods have different strengths and weaknesses.

We'll fingerprint 3 ways:

- vector multiply
- number mod a random prime
- polynomial evaluation at a random point

### String matching

Checksums:

- Alice and Bob have bit strings of length $n$
- Think of $n$ bit integers $a$, $b$
- take a prime number $p$, compare $a \bmod p$ and $b \bmod p$ with $\log p$ bits.

Trouble if $a=b \pmod p$. How avoid? How likely?

- $c=a-b$ is $n$-bit integer.
- so at most $n$ prime factors.
- How many prime factors less than $k$? $\Theta(k/\ln k)$
- so take $2n^2\log n$ limit
- number of primes about $n^2$
- So on random one, $1/n$ error prob.
- $O(\log n)$ bits to send.
- implement by add/sub, no mul or div!

Do we need a prime?

- Some numbers have many divisors
- So a random number may be a divisor
- asymptotically, number of divisors $\rightarrow \le n^{2/\log\log n} \le \sqrt{n}$
- so a random number will fail with probability $\le 1/\sqrt{n}$
- but this is a lot of work to prove (and requires asymptotics)
- primes are easier/clearer

How find prime?

- Well, a randomly chosen number is prime with prob. $1/\ln n$,
- so just try a few.
- How know its prime? Simple randomized test (later)
- Or, just run Eratosthenes' sieve
- Going to range $n^2$ takes time $O(n^2)$

#### Pattern matching in strings

Rabin-Karp algorithm

- $m$-bit pattern
- $n$-bit string
- work mod prime $p$ of size at most $t$
- prob. error at particular point most $m/(t/\log t)$ from above
- so pick big $t$, union bound
- implement by add/sub as shift in bits

## Bloom Filters

Hash tables are good for key to value mappings,

- but sometimes that is overkill
- Sufficient to know if a key is “present”
- For array, difference between “array” and “bit vector”
- Analogue for hash tables?
- $n$-item perfect hash uses $n\log m$ bits
- can we do better?
- Idea: hash each item to a bit location, set the bit
- Problem: false positive if hash to same place
- Solution: hash to multiple bits for redundant checking

Algorithm

- Set of $n$ items
- Array of $m$ bits
- Insert
- Hash each item to $k$ “random” bits
- Set those bits

- Lookup
- Answer yes if all $k$ bits set

- Always say yes for present items
- Can say yes for not present items (false positive)

Analysis

- false positive if all $k$ bits set
- $n$ items, so $kn$ random tries to set bits
- How many bits are one?
- probability bit clear is $(1-1/m)^{kn} \approx e^{-kn/m}=p$
- expect $pm$ clear bits

- Claim probability $k$ bits set is $f=(1-p)^k$
- there's a conditioning here that makes this non-obvious
- having a bit set makes others being set less likely
- but this
*negative correlation*only helps us

- How pick $k$ to optimize?
- More hashes give more chances to see a 0
- But also increase number of set bits
- pick $k$ to minimize $(1-p)^k$ (same as minimizing original)
- take derivative, find zero at $k=(\ln 2)(m/n)$
- in this case $p=1/2$, and $f=(1/2)^k=(0.6185)^{m/n}$

- practical numbers: with $m=8n$ get $f=.02$

Formally validate our assumption that have $pm$ zeros.

Poisson distribution

- fix one bin
- prob. $k$ balls $\approx \binom{n}{k}n^{-k}\rightarrow 1/ek!$ as $n\rightarrow\infty$
- more generally,
**Poisson distribution**is limit of binomial $B(m,\lambda/m)$ as $m\rightarrow\infty$ $$ \begin{align*} \Pr_m[k]&=\binom{m}{k}(\lambda/m)^k(1-\lambda/m)^{m-k}\\ &\approx (m^k/k!)(\lambda/m)^ke^{-\lambda}\\ &=e^{-\lambda}\lambda^k/k! \intertext{alternatively,} &=\frac{m}{k}(\lambda/m)\binom{m-1}{k-1}(\lambda/m)^{k-1}(1-\lambda/m)^{(m-1)-(k-1)}\\ &=(\lambda/k)\Pr_{m-1}[k-1]\\ &=(\lambda/k)\Pr_m[k-1] \end{align*} $$ - So $\Pr[k] \sim \lambda^k/k!$
- Must normalize to 1 so $\Pr[k]=e^{-\lambda}\lambda^k/k!$

The mode:

- consider throwing $B(m,1/m)$ balls into each bin independently
- now take $m\rightarrow\infty$
- $\Pr[0]=e^{-1}$
- total over $n$ bins, $B(nm,1/m)$ balls so $\lambda=n$
- alternatively, can prove convolution of Poisson variables is Poisson with summed rate
- in total, $\Pr[n]=e^{-n}n^n/n!\approx 1/\sqrt{2\pi n}$ by Stirling's approx.
- that is, about $1/\sqrt{n}$ chance of exactly $n$ balls

Application

- Suppose prove $\Pr[A]< p$ in independent variables case
- Let $N$ be event of “exactly $n$ balls in independent case”
- Then $p>\Pr[A]>\Pr[A \cap N]=\Pr[A \mid N]\Pr[N]>\Pr[A \mid N]/\sqrt{2\pi n}$
- So $\Pr[A \mid N] < p\sqrt{2\pi n}$
- But $\Pr[A \mid N]$ is case when throw in exactly $n$ balls.
- so if showed $\neg A$ w.h.p, can concude $\neg A \mid N$ w.h.p (and change of constants)
- i.e. also in exactly $n$ balls case
- so Chernoff says close to $p$ empty.