Polling
Outline
- Set has size $u$, contains $s$ “special” elements
- goal: count number of special elements
- sample with probability $p=c(\log n)/\epsilon^2 s$
- with high probability, $(1\pm\epsilon)sp$ special elements
- if observe $k$ elements, deduce $s \in (1\pm\epsilon)k/p$.
- Problem: what is $p$?
Related idea: Monte Carlo simulation
- Probability space, event $A$
- easy to test for $A$
- goal: estimate $p=\Pr[A]$.
- Perform $k$ trials (sampling with replacement).
- expected outcome $pk$.
- estimator $\frac{1}{k}\sum I_i$
- prob outside $\epsilon < \exp(-\epsilon^2 kp/3)$ ($\epsilon < 1$)
- for prob. $\delta$, need \[ k=O\left(\frac{\log 1/\delta}{\epsilon^2 p}\right) \]
- Define $(\epsilon,\delta)$-approximation scheme
- what if $p$ unknown? For now, assume confirming proposed $p$.
- What if $p$ is small?
More general estimation
- Have random variable $X$
- Want to estimate $E[X]$
- turns out can do this with few samples if $\mu < \sigma$.
- Proof in HW.
- What if want to estimate $\sigma$?
- Just need to estimate mean of random variable $X^2$.
- There are schemes that are “competitive” versus unknown $\mu,\sigma$
Another generalization: conditional sampling
- choose from some distribution, conditioned on some event
- if event common, just choose from distribution till event happens
- rare events hard to condition on.
- Choosing a random graph coloring is easy
- Choosing a random legal graph coloring is hard (and useful in physics)
For now we'll study “sampling for (better) algorithms.” Later, “algorithms for (better) sampling”
Handling unknown $p$
- Sample $n$ (unknown) times till get $\mu_{\epsilon,\delta}=O(\log \delta^{-1}/\epsilon^2)$ hits
- w.h.p, $p \in (1\pm\epsilon)\mu_{\epsilon,\delta}/n$
- let $k = \mu_{\epsilon,\delta}/p$
- so when take $k$ samples expect $\mu_{\epsilon,\delta}$ hits
- consider first $k/(1+\epsilon)$ samples
- expect $\mu_{\epsilon,\delta}/(1+\epsilon)$ hits
- Chernoff says w.h.p $< (1+\epsilon)\mu_{\epsilon,\delta}/(1+\epsilon)=\mu_{\epsilon,\delta}$ hits
- so won't have enough hits to stop before this point
- similar argument that won't stop too late
- instead will stop at some point in proper interval
- and this get good estimate
Discussion
- For error probability $\delta$, need $kp=\mu_{\epsilon,\delta}=(4/\epsilon^2)\ln 2/\delta$
- So, can get tiny probability of error cheaply, but sensitive to $\epsilon$.
- Note for many apps, want “with high probability:” $\delta=1/n^2$, so $kp \approx \log n$.
- Note also, to get good sample, need $k >> 1/p$.
- i.e., sample needed is larger as event we want to detect is rarer.
- Sampling without replacement gives lower probability of large deviation
Transitive closure
Problem outline
- databases want size
- matrix multiply time
- compute reachibility set of each vertex, add
Sampling algorithm
- generate vertex samples until $\mu_{\epsilon\delta}$ reachable from $v$
- deduce size of $v$'s reachability set.
- reachability test: $O(m)$.
- number of sample: $n/$size.
- $O(mn)$ per vertex---ouch!
Can test for all vertices simultaneously
- increase mean to $O(1/\epsilon^2\log (n/\delta))$,
- so $\delta/n$ failure probability per vertex
- so $\delta$ overall (union bound)
- $O(mn)$ for all vertices (still ouch).
Avoid wasting work
- stop propagating to a vertex once it has seen enough
- if a vertex has seen $k$ samples, so have all its predecessors
- so no need to forward observations
- so send at most $\log n/\delta$ samples over an edge
- also, after $O(n\mu_{\epsilon,\delta/n})$ samples, every vertex has $\log n/\delta$ hits. No more needed.