6.856 Lecture Notes

Streaming

Model:

Key idea: sampling

Wait: how do we sample?

Simple example: frequent items

Suppose want to find all items that occur more than $t$ times
From above, property of being a particular item has fraction $p=t/n$
Intuitively, need enough samples to expect to see item: i.e. $n/t$
To avoid errors, need more.
Choose $k=(8n/t\epsilon^2)\log(n)$
Then $kp=(8/\epsilon^2)\log n$
So odds of $\epsilon$ deviation are $(1/n)^{2}$ for particular item
At most $n/t$ “frequent” items, so odds any deviate are $1/n$
So can collect any item that shows up more than $(1-\epsilon)kp$ times
This will also have some false positives.
But same Chernoff bound shows any item with few representatives unlikely to be sampled enough
In particular, with high probability no item with less than $(1-\epsilon)^2 t$ occurrences will be included.

Discussion:

For this particular problem (most frequent) there's a better algorithm
It just needs $n/t$ space for exact answer
Nice thing about sample, though, is that you can decide what question to ask after you collect the sample
Can we beat $n/t$? No: lower bound says that if most frequent item is rare, you need tons of space to find it.

Count-min sketches

Already saw how to count frequent items
What if can also subtract in stream (but stay positive)? (turnstile model)
Sketch: hash to $k$ buckets, accumulate in $h(x)$
If no collisions, get good count.
If $S$ is total weight, in expectation only $S/k$ collides
So, get (over) estimate to within additive $2S/k$ with probability $1/2$
What if want better odds? Use more counters!
Take $c$ counters, take minimum result
Probability of error $2^{-c}$
General: need $O((1/\epsilon)\ln 1/\delta)$ counters (each taking $O(\log n)$ space) for $(\epsilon,\delta)$ approx.
Can also subtract items from sketch in obvious way.
Sketch will accurately estimate number of items as long as net count of each item is positive.
another advantage: sketches can be combined (just add them).

More generalizations:

Natural question: how many distinct items in stream?

Suppose items are distinct random reals in $[0,1]$

Show them spread out
What changes as more arrive? spacing
Consider interval $[0,p]$
If $n$ distinct items, expect $np$ in interval
If $np =\mu_{\epsilon,\delta}$ get $(1\pm\epsilon)\mu$ with probability $1-\delta$.
So how about choosing $p$ so that got $\mu$ items?
Translation: keep smallest $\mu$ items, use to set $p$
Can we conclude this $p$ is $\epsilon$-accurate?
Not quite: if expected number for this $p$ is $\mu/(1+\epsilon)$, then its probability of deviating to $\mu$ is a bit more than $\delta$.
So tweak: wait till see $\mu/(1-\epsilon) \ge (1+\epsilon)\mu$ items.
This won't happen before $p$
Also won't happen after $(1+\epsilon)p/(1-\epsilon)$
So we'll get a $p$ in this interval.
Multiply by $(1-\epsilon)$

What if items not random? Randomize!

number up to $m$
simpler goal: for given $t$, with prob. $1-P$, say yes if above $(1+\epsilon)t$ and no if below $(1-\epsilon)t$
run in parallel for powers of $(1+\epsilon)$ up to $n$.
Space multiplies by $(\log n)/\epsilon$.
For given $t$
- hash to $t$ space
- watch items hashing to $0$
- answer yes if anything does, no otherwise.
Analysis
- if $k$ elements, prob. no hits is $(1-1/t)^k$
- for large enough $t$ this is $e^{-k/t}$
- for small $\epsilon$, if $k>(1+\epsilon)t$ this is $< 1/e-\epsilon/3$
- while $k< (1-\epsilon)t$ implies $>1/e+\epsilon/3$
- replicate to amplify. threshold on whether $1/e$ of the tests pass.

Min-wise independence

alternative pseudorandom approach
go back to idea of keeping $k$ smallest-hashing “random” values
so for that, want a distribution where for any subset each item equally likely to hash to “smallest”
“min-wise independence”
people have created (approximately) min-wise independent hash families
so can use these to simulate true randomness.