6.854 Lecture Notes

Polling

Outline

Related idea: Monte Carlo simulation

Probability space, event $A$
easy to test for $A$
goal: estimate $p=\Pr[A]$.
Perform $k$ trials (sampling with replacement).
- expected outcome $pk$.
- estimator $\frac{1}{k}\sum I_i$
- prob outside $\epsilon < \exp(-\epsilon^2 kp/3)$ ($\epsilon < 1$)
- for prob. $\delta$, need \[ k=O\left(\frac{\log 1/\delta}{\epsilon^2 p}\right) \]
Define $(\epsilon,\delta)$-approximation scheme
what if $p$ unknown? For now, assume confirming proposed $p$.
What if $p$ is small?

More general estimation

Another generalization: conditional sampling

For now we'll study “sampling for (better) algorithms.” Later, “algorithms for (better) sampling”

Handling unknown $p$

Sample $n$ (unknown) times till get $\mu_{\epsilon,\delta}=O(\log \delta^{-1}/\epsilon^2)$ hits
w.h.p, $p \in (1\pm\epsilon)\mu_{\epsilon,\delta}/n$
- let $k = \mu_{\epsilon,\delta}/p$
- so when take $k$ samples expect $\mu_{\epsilon,\delta}$ hits
- consider first $k/(1+\epsilon)$ samples
- expect $\mu_{\epsilon,\delta}/(1+\epsilon)$ hits
- Chernoff says w.h.p $< (1+\epsilon)\mu_{\epsilon,\delta}/(1+\epsilon)=\mu_{\epsilon,\delta}$ hits
- so won't have enough hits to stop before this point
- similar argument that won't stop too late
- instead will stop at some point in proper interval
- and this get good estimate

Discussion

For error probability $\delta$, need $kp=\mu_{\epsilon,\delta}=(4/\epsilon^2)\ln 2/\delta$
So, can get tiny probability of error cheaply, but sensitive to $\epsilon$.
Note for many apps, want “with high probability:” $\delta=1/n^2$, so $kp \approx \log n$.
Note also, to get good sample, need $k >> 1/p$.
i.e., sample needed is larger as event we want to detect is rarer.
Sampling without replacement gives lower probability of large deviation

Problem outline

Sampling algorithm

Can test for all vertices simultaneously

Avoid wasting work

stop propagating to a vertex once it has seen enough
if a vertex has seen $k$ samples, so have all its predecessors
so no need to forward observations
so send at most $\log n/\delta$ samples over an edge
also, after $O(n\mu_{\epsilon,\delta/n})$ samples, every vertex has $\log n/\delta$ hits. No more needed.