\documentclass[12pt]{article} \usepackage{wide,me} \parindent0pt

## Polling

Outline

• Set has size $u$, contains $s$ “special” elements
• goal: count number of special elements
• sample with probability $p=c(\log n)/\epsilon^2 s$
• with high probability, $(1\pm\epsilon)sp$ special elements
• if observe $k$ elements, deduce $s \in (1\pm\epsilon)k/p$.
• Problem: what is $p$?

Related idea: Monte Carlo simulation

• Probability space, event $A$
• easy to test for $A$
• goal: estimate $p=\Pr[A]$.
• Perform $k$ trials (sampling with replacement).
• expected outcome $pk$.
• estimator $\frac{1}{k}\sum I_i$
• prob outside $\epsilon < \exp(-\epsilon^2 kp/3)$ ($\epsilon < 1$)
• for prob. $\delta$, need $k=O\left(\frac{\log 1/\delta}{\epsilon^2 p}\right)$
• Define $(\epsilon,\delta)$-approximation scheme
• what if $p$ unknown? For now, assume confirming proposed $p$.
• What if $p$ is small?

More general estimation

• Have random variable $X$
• Want to estimate $E[X]$
• turns out can do this with few samples if $\mu < \sigma$.
• Proof in HW.
• What if want to estimate $\sigma$?
• Just need to estimate mean of random variable $X^2$.
• There are schemes that are “competitive” versus unknown $\mu,\sigma$

Another generalization: conditional sampling

• choose from some distribution, conditioned on some event
• if event common, just choose from distribution till event happens
• rare events hard to condition on.
• Choosing a random graph coloring is easy
• Choosing a random legal graph coloring is hard (and useful in physics)

For now we'll study “sampling for (better) algorithms.” Later, “algorithms for (better) sampling”

Handling unknown $p$

• Sample $n$ (unknown) times till get $\mu_{\epsilon,\delta}=O(\log \delta^{-1}/\epsilon^2)$ hits
• w.h.p, $p \in (1\pm\epsilon)\mu_{\epsilon,\delta}/n$
• let $k = \mu_{\epsilon,\delta}/p$
• so when take $k$ samples expect $\mu_{\epsilon,\delta}$ hits
• consider first $k/(1+\epsilon)$ samples
• expect $\mu_{\epsilon,\delta}/(1+\epsilon)$ hits
• Chernoff says w.h.p $< (1+\epsilon)\mu_{\epsilon,\delta}/(1+\epsilon)=\mu_{\epsilon,\delta}$ hits
• so won't have enough hits to stop before this point
• similar argument that won't stop too late
• instead will stop at some point in proper interval
• and this get good estimate

Discussion

• For error probability $\delta$, need $kp=\mu_{\epsilon,\delta}=(4/\epsilon^2)\ln 2/\delta$
• So, can get tiny probability of error cheaply, but sensitive to $\epsilon$.
• Note for many apps, want “with high probability:” $\delta=1/n^2$, so $kp \approx \log n$.
• Note also, to get good sample, need $k >> 1/p$.
• i.e., sample needed is larger as event we want to detect is rarer.
• Sampling without replacement gives lower probability of large deviation

### Transitive closure

Problem outline

• databases want size
• matrix multiply time
• compute reachibility set of each vertex, add

Sampling algorithm

• generate vertex samples until $\mu_{\epsilon\delta}$ reachable from $v$
• deduce size of $v$'s reachability set.
• reachability test: $O(m)$.
• number of sample: $n/$size.
• $O(mn)$ per vertex---ouch!

Can test for all vertices simultaneously

• increase mean to $O(1/\epsilon^2\log (n/\delta))$,
• so $\delta/n$ failure probability per vertex
• so $\delta$ overall (union bound)
• $O(mn)$ for all vertices (still ouch).

Avoid wasting work

• stop propagating to a vertex once it has seen enough
• if a vertex has seen $k$ samples, so have all its predecessors
• so no need to forward observations
• so send at most $\log n/\delta$ samples over an edge
• also, after $O(n\mu_{\epsilon,\delta/n})$ samples, every vertex has $\log n/\delta$ hits. No more needed.