## Fingerprinting

Basic idea: compare two things from a big universe $U$

- generally takes $\log U$, could be huge.
- Better: randomly map $U$ to smaller $V$, compare elements of $V$.
- Probability(same)$=1/|V|$
- intuition: $\log V$ bits to compare, error prob. $1/|V|$

We work with * fields*

- add, subtract, mult, divide
- 0 and 1 elements
- eg reals, rats, (not ints)
- talk about $Z_p$
- which field often won't matter.

Verifying matrix multiplications:

- Claim $AB=C$
- check by mul: $n^3$, or $n^{2.376}$ with deep math
- Freivald's $O(n^2)$.
- Good to apply at end of complex algorithm (check answer)

Freivald's technique:

- choose random $r \in \{0,1\}^n$
- check $ABr=Cr$
- time $O(n^2)$
- if $AB=C$, fine.
- What if $AB \ne C$?
- trouble if $(AB-C)r=0$ but $D=AB-C \ne 0$
- find some nonzero row $(d_1,\ldots,d_n)$
- wlog $d_1 \ne 0$
- trouble if $\sum d_i r_i = 0$
- ie $r_1 = (\sum_{i > 1} d_i r_i)/d_1$
- principle of deferred decisions: choose all $i \ge 2$ first
- then have exactly one error value for $r_1$
- prob. pick it is at most $1/2$

- $k$ trials makes $1/2^k$ failure.
- Or choosing $r \in [1,s]$ makes $1/s$.

- Doesn't just do matrix mul.
- check any matrix identity claim
- useful when matrices are “implicit” (e.g. $AB$)

- We are mapping matrices ($n^2$ entries) to vectors ($n$ entries).

In general, many ways to fingerprint explicitly represented objects. But for implicit objects, different methods have different strengths and weaknesses.

We'll fingerprint 3 ways:

- vector multiply
- number mod a random prime
- polynomial evaluation at a random point

### String matching

Checksums:

- Alice and Bob have bit strings of length $n$
- Think of $n$ bit integers $a$, $b$
- take a prime number $p$, compare $a \bmod p$ and $b \bmod p$ with $\log p$ bits.

Trouble if $a=b \pmod p$. How avoid? How likely?

- $c=a-b$ is $n$-bit integer.
- so at most $n$ prime factors.
- How many prime factors less than $k$? $\Theta(k/\ln k)$
- so take $2n^2\log n$ limit
- number of primes about $n^2$
- So on random one, $1/n$ error prob.
- $O(\log n)$ bits to send.
- implement by add/sub, no mul or div!

Do we need a prime?

- Some numbers have many divisors
- So a random number may be a divisor
- asymptotically, number of divisors $\rightarrow \le n^{2/\log\log n} \le \sqrt{n}$
- so a random number will fail with probability $\le 1/\sqrt{n}$
- but this is a lot of work to prove (and requires asymptotics)
- primes are easier/clearer

How find prime?

- Well, a randomly chosen number is prime with prob. $1/\ln n$,
- so just try a few.
- How know its prime? Simple randomized test (later)
- Or, just run Eratosthenes' sieve
- Going to range $n^2$ takes time $O(n^2)$

#### Pattern matching in strings

Rabin-Karp algorithm

- $m$-bit pattern
- $n$-bit string
- work mod prime $p$ of size at most $t$
- prob. error at particular point most $m/(t/\log t)$ from above
- so pick big $t$, union bound
- implement by add/sub as shift in bits

## Bloom Filters

Hash tables are good for key to value mappings,

- but sometimes that is overkill
- Sufficient to know if a key is “present”
- For array, difference between “array” and “bit vector”
- Analogue for hash tables?
- $n$-item perfect hash uses $n\log m$ bits
- can we do better?
- Idea: hash each item to a bit location, set the bit
- Problem: false positive if hash to same place
- Solution: hash to multiple bits for redundant checking

Algorithm

- Set of $n$ items
- Array of $m$ bits
- Insert
- Hash each item to $k$ “random” bits
- Set those bits

- Lookup
- Answer yes if all $k$ bits set

- Always say yes for present items
- Can say yes for not present items (false positive)

Analysis

- false positive if all $k$ bits set
- $n$ items, so $kn$ random tries to set bits
- How many bits are one?
- probability bit clear is $(1-1/m)^{kn} \approx e^{-kn/m}=p$
- expect $pm$ clear bits

- Claim probability $k$ bits set is $f=(1-p)^k$
- there's a conditioning here that makes this non-obvious
- having a bit set makes others being set less likely
- but this
*negative correlation*only helps us

- How pick $k$ to optimize?
- More hashes give more chances to see a 0
- But also increase number of set bits
- pick $k$ to minimize $(1-p)^k$ (same as minimizing original)
- take derivative, find zero at $k=(\ln 2)(m/n)$
- in this case $p=1/2$, and $f=(1/2)^k=(0.6185)^{m/n}$

- practical numbers: with $m=8n$ get $f=.02$

Formally validate our assumption that have $pm$ zeros.

- “Poisson Approximation”
- Imagine putting $B(n,1/n)$ balls in each bin
- Then each bin is empty with probability $(1-1/n)^n=1/e$
- And since bin sizes are independent, Chernoff applies
- Total number of balls is $B(n^2,1/n)$
- Expectation $n$
- What is probability of exactly $n$?
- Easy to prove $n$ is most likely, so at least $1/n^2$
- In fact, can prove $1/n$ (because no probability is outside $\pm n$ of mean)

- Suppose prove $\Pr[A]< p$ in independent variables case
- Let $N$ be event of “exactly $n$ balls in independent case
- Then $p>\Pr[A]>\Pr[A \cap N]=\Pr[A \mid N]\Pr[N]>\Pr[A \mid N]/n$
- So $\Pr[A \mid N] < np$ (or $n^2 p$)
- But $\Pr[A \mid N]$ is case when throw in exactly $n$ balls.
- so if showed $\neg A$ w.h.p, can concude $\neg A \mid N$
- i.e. also in exactly $n$ balls case
- so Chernoff says close to $p$ empty.