Fingerprinting
Basic idea: compare two things from a big universe $U$
- generally takes $\log U$, could be huge.
- Better: randomly map $U$ to smaller $V$, compare elements of $V$.
- Probability(same)$=1/|V|$
- intuition: $\log V$ bits to compare, error prob. $1/|V|$
We work with fields
- add, subtract, mult, divide
- 0 and 1 elements
- eg reals, rats, (not ints)
- talk about $Z_p$
- which field often won't matter.
Verifying matrix multiplications:
- Claim $AB=C$
- check by mul: $n^3$, or $n^{2.376}$ with deep math
- Freivald's $O(n^2)$.
- Good to apply at end of complex algorithm (check answer)
Freivald's technique:
- choose random $r \in \{0,1\}^n$
- check $ABr=Cr$
- time $O(n^2)$
- if $AB=C$, fine.
- What if $AB \ne C$?
- trouble if $(AB-C)r=0$ but $D=AB-C \ne 0$
- find some nonzero row $(d_1,\ldots,d_n)$
- wlog $d_1 \ne 0$
- trouble if $\sum d_i r_i = 0$
- ie $r_1 = (\sum_{i > 1} d_i r_i)/d_1$
- principle of deferred decisions: choose all $i \ge 2$ first
- then have exactly one error value for $r_1$
- prob. pick it is at most $1/2$
- $k$ trials makes $1/2^k$ failure.
- Or choosing $r \in [1,s]$ makes $1/s$.
- Doesn't just do matrix mul.
- check any matrix identity claim
- useful when matrices are “implicit” (e.g. $AB$)
- We are mapping matrices ($n^2$ entries) to vectors ($n$ entries).
In general, many ways to fingerprint explicitly represented objects. But for implicit objects, different methods have different strengths and weaknesses.
We'll fingerprint 3 ways:
- vector multiply
- number mod a random prime
- polynomial evaluation at a random point
String matching
Checksums:
- Alice and Bob have bit strings of length $n$
- Think of $n$ bit integers $a$, $b$
- take a prime number $p$, compare $a \bmod p$ and $b \bmod p$ with $\log p$ bits.
Trouble if $a=b \pmod p$. How avoid? How likely?
- $c=a-b$ is $n$-bit integer.
- so at most $n$ prime factors.
- How many prime factors less than $k$? $\Theta(k/\ln k)$
- so consider range $1,\ldots,2n^2\log n$
- number of primes about $n^2$
- So on random one, $1/n$ error prob.
- $O(\log n)$ bits to send.
- implement by add/sub, no mul or div!
Do we need a prime?
- Some numbers have many divisors
- So a random number may be a divisor
- asymptotically, number of divisors $\rightarrow \le n^{2/\log\log n} \le \sqrt{n}$
- so a random number will fail with probability $\le 1/\sqrt{n}$
- but this is a lot of work to prove (and requires asymptotics)
- primes are easier/clearer
How find prime?
- Well, a randomly chosen number is prime with prob. $1/\ln n$,
- so just try a few.
- How know its prime? Simple randomized test (later)
- Or, just run Eratosthenes' sieve
- Going to range $n^2\log n$ takes time $O(n^2)$
Pattern matching in strings
Rabin-Karp algorithm
- $m$-bit pattern
- $n$-bit string
- work mod prime $p$ of size at most $t$
- prob. error at particular point most $m/(t/\log t)$ from above
- so pick big $t$, union bound
- implement by add/sub as shift in bits
- How to make this Las Vegas? Just check matches!
Bloom Filters
Hash tables are good for key to value mappings,
- but sometimes that is overkill
- Sufficient to know if a key is “present”
- For array, difference between “array” and “bit vector”
- Analogue for hash tables?
- $n$-item perfect hash uses $n\log m$ bits
- can we do better?
- Idea: hash each item to a bit location, set the bit
- Problem: false positive if hash to same place
- Solution: hash to multiple bits for redundant checking
Algorithm
- Set of $n$ items
- Array of $m$ bits
- Insert
- Hash each item to $k$ “random” bits
- Set those bits
- two hashes can yield same bit; this is ok in the analysis
- Lookup
- Answer yes if all $k$ chosen bits set
- Always say yes for present items
- Can say yes for not present items (false positive)
Analysis
- false positive if all $k$ bits set
- $n$ items, so $kn$ random tries to set bits
- How many bits are one?
- probability bit clear is $(1-1/m)^{kn} \approx e^{-kn/m}=p$
- expect $pm$ clear bits
- Claim probability $k$ bits set is $f=(1-p)^k$
- there's a conditioning here that makes this non-obvious
- having a bit set makes others being set less likely
- but this negative correlation only helps us
- How pick $k$ to optimize?
- More hashes give more chances to see a 0
- But also increase number of set bits
- pick $k$ to minimize $(1-p)^k$ (same as minimizing original)
- take derivative, find zero at $k=(\ln 2)(m/n)$
- so number of bit-set attempts is $kn=m\ln 2$.
- in this case $p=(1-1/m)^{m\ln 2} \rightarrow 1/2$, and $f=(1/2)^k=(0.6185)^{m/n}$
- half the bits are clear---which maximizes information in filter
- practical numbers: with $m=8n$ get $f=.02$
Formally validate our assumption that have $pm$ zeros.
- This is a ($kn$) balls in ($m$) bins problem
- we want to lower bound empty bins
- Problem: dependence of different bins being empty
- (correlation is negative, opposite of what we want)
- solution: define a different distribution where bins are independent
- then relate that distribution to our original one
General Plan:
- we are sampling from a distribution $D$ on balls in bins where empty bins are not independent
- define a different balls-in-bins distribution $I$ where
- empty bins are independent.
- expected number of empty bins is same as for $D$
- but since independent, Chernoff bound holds
- meaning number of empty bins is $\alpha m \pm O(\sqrt{m\log m})$ whp
- i.e. probability of being outside range ($X$) is (say) $m^{-3}$
- $I$ will put a random number $N$ of balls in the bins
- show that if we condition on $N=m$, we are sampling from $D$
- show that $\Pr_{I}[N=m] = \Omega(1/\sqrt{m})$
- conclude $$ \begin{align*} \Pr_I[X] &\ge \Pr_I[X \cap N=m] \\ m^{-3} &= \Pr_I[X \mid N=m]\Pr_I[N=m]\\ m^{-3} &= \Omega(\Pr_D[X] / \sqrt{m})\\ \Pr_D[X] &= O(m^{-2.5}) \end{align*} $$
- in other words, empty are tightly concentrated in our original distribution w.h.p
The new distribution
- our original dependent distribution $D$ throws $kn=\alpha m$ balls into $m$ bins
- so an expectation of $\alpha$ balls per bin
- and probability of empty bin $(1-1/m)^{\alpha m} \approx e^{-\alpha}$.
- instead, associate each bin with $\alpha r$ possiballs that each independently “arrive” in their bin with probability $1/r$
- so also $\alpha$ balls per bin in expectation
- now let $r \rightarrow \infty$
- probability a bin is empty is $(1-1/r)^{\alpha r} \rightarrow e^{-\alpha}$.
- so our new distribution $I$ has same expected items and roughly same empty probability as $D$
- but each bin has its own possiballs, so independently empty
- so, Chernoff bound applies to counting empty bins
- we expect $m/2$ empty bin
- Chernoff says we get at least $(1-o(1))m/2$ w.h.p
Tangent: Poisson process
- the limiting distribution in one bin is a Poisson process, a kind of continuous limit of the binomial distribution
- instead of a $B(n,p)$ sequence of $n$ discrete-time coin flips with success probability $p$, we have continuous time where the success “rate” in interval $dt$ is $\alpha dt$.
- so we expect $\alpha$ successes per unit time
- $\alpha$ is the rate of the Poisson process
- if you compute the limits, you find probability of $k$ successes in one time unit is $e^{-\alpha} \alpha^k/k!$.
Relate the distributions:
- one way to sample from $I$ is to flip a coin for each of the $\alpha r m$ possiballs to see which arrive
- alternatively, choose a number of arriving possiballs (distribution $B(\alpha r m, 1/r)$, then choose that many possiballs to arrive
- all possiballs equally likely, so I should just choose uniformly at random from the possiballs
- in other words, each possiball I choose is from a random bin
- you might worry about choosing the same possible twice, but this has probability 0 as $ r\rightarrow \infty$.
- in particular, if I sample from $I$ but condition to choose exactly $\alpha m$ possiballs, then I am actually sampling from $D$!
Bound $\Pr[N=\alpha m]$
- this is the event that the number of arriving possiballs equals its expectation $\alpha m$
- or in original notation, $\Pr[B(n,p)=np]$
- turns out this is $\Omega(1/\sqrt{np})$
- which is nice because it is doesn't depend directly on $n$, only the mean
- in a sense we already know this.
- Chernoff tells us the outcome is within $\pm O(\sqrt{np})$ of the mean
- this is only $O(\sqrt{np})$ possible values
- so the biggest (which is the mean) has to have probability $\Omega(1/\sqrt{np})$
- can also prove directly by looking at the proper binomial coefficient $\binom{n}{np} p^{np}(1-p)^{1-np}$
- and just applying Stirlings approximation $k! = \Theta(\sqrt{k}(k/e)^k$.
Wrapup
- let $X$ be event of too few empty bins
- $\Pr_D[X] \le \Pr_I[X]/(1/\sqrt{\alpha m}$
- $X$ is false with high probability in $m$ under $I$
- therefore $X$ is false with high probability in $m$ under $D$