\documentclass[12pt]{article} \usepackage{wide,me} \parindent0pt

## Streaming

Model:

• Huge amount of data
• Can't afford random access
• Instead, read in a stream (sequential access).
• E.g., tapes
• Even for data on disk, sequential much faster
• On disk, can do multiple passes
• Sometimes, even too much to store
• Stream generated online, must analyze as it goes past
• Stock market
• Network routers
• Sensor network data
• Telephone billing records
• Model posits stream length $n$ (unknown)
• which makes $\log n$ a natural limit (e.g. to count length of stream).

Key idea: sampling

• What fraction of $n$ stream entries have property $P$?
• if know $P$ in advance, compute as you go
• what if want to ask about unknown $P$ later?
• Choose random sample of size $k$
• Let $x$ be number in sample with property $P$
• Estimate as $nx/k$
• Analysis
• Suppose $pn$ have property $P$
• Assume choosing with replacement
• Then each choice has $P$ with probability $p$
• So expected number having $p$ is $pk$
• We see $x$. If $x\in(1\pm\epsilon)kp$, then estimate is $\epsilon$-accurate.
• Chernoff bound on $\epsilon$ error: $2e^{-\epsilon^2 kp/3}$
• So, if $kp$ large, probability of error is small
• in particular, $kp=\mu_{\epsilon,\delta}$ gives $(\epsilon,\delta)$-approximation
• e.g. $O(\log n)$ space for constant error w.h.p. in $n$.

Wait: how do we sample?

• Reservoir method
• Keep pool of size $k$
• Incorporate first $k$ items
• When $k+1$'st arrives, it should be in sample with prob $k/(k+1)$
• Put it in with that probability
• And evict someone random
• Generalize:
• When $n^{th}$ item arrives, put it in with prob. $k/n$
• Evict someone random if so
• By induction, reservoir contains random sample of items seen so far.
• Then we include last item with right probability
• Whether or not included, remainder of reservoir is random subset of prior items

### Frequent Items

Simple example: frequent items

• Suppose want to find all items that occur more than $t$ times
• From above, property of being a particular item has fraction $p=t/n$
• Intuitively, need enough samples to expect to see item: i.e. $n/t$
• To avoid errors, need more.
• Choose $k=(8n/t\epsilon^2)\log(n)$
• Then $kp=(8/\epsilon^2)\log n$
• So odds of $\epsilon$ deviation are $(1/n)^{2}$ for particular item
• At most $n/t$ “frequent” items, so odds any deviate are $1/n$
• So can collect any item that shows up more than $(1-\epsilon)kp$ times
• This will also have some false positives.
• But same Chernoff bound shows any item with few representatives unlikely to be sampled enough
• In particular, with high probability no item with less than $(1-\epsilon)^2 t$ occurrences will be included.

Discussion:

• For this particular problem (most frequent) there's a better algorithm
• It just needs $n/t$ space for exact answer
• Nice thing about sample, though, is that you can decide what question to ask after you collect the sample
• Can we beat $n/t$? No: lower bound says that if most frequent item is rare, you need tons of space to find it.

### Sketches

Count-min sketches

• Already saw how to count frequent items
• What if can also subtract in stream (but stay positive)? (turnstile model)
• Sketch: hash to $k$ buckets, accumulate in $h(x)$
• If no collisions, get good count.
• If $S$ is total weight, in expectation only $S/k$ collides
• So, get (over) estimate to within additive $2S/k$ with probability $1/2$
• What if want better odds? Use more counters!
• Take $c$ counters, take minimum result
• Probability of error $2^{-c}$
• General: need $O((1/\epsilon)\ln 1/\delta)$ counters (each taking $O(\log n)$ space) for $(\epsilon,\delta)$ approx.
• Can also subtract items from sketch in obvious way.
• Sketch will accurately estimate number of items as long as net count of each item is positive.
• another advantage: sketches can be combined (just add them).

More generalizations:

• Maintaining stats over a window (expiring items)
• Higher moments (e.g. sum of squares) to estimate db joins

### Distinct Items

Natural question: how many distinct items in stream?

• To answer naive way, need space equal number of distinct items
• Sampling doesn't obviously apply: what property do we test?
• But a sampling insight still helps

Suppose items are distinct random reals in $[0,1]$

• Show them spread out
• What changes as more arrive? spacing
• Consider interval $[0,p]$
• If $n$ distinct items, expect $np$ in interval
• If $np =\mu_{\epsilon,\delta}$ get $(1\pm\epsilon)\mu$ with probability $1-\delta$.
• So how about choosing $p$ so that got $\mu$ items?
• Translation: keep smallest $\mu$ items, use to set $p$
• Can we conclude this $p$ is $\epsilon$-accurate?
• Not quite: if expected number for this $p$ is $\mu/(1+\epsilon)$, then its probability of deviating to $\mu$ is a bit more than $\delta$.
• So tweak: wait till see $\mu/(1-\epsilon) \ge (1+\epsilon)\mu$ items.
• This won't happen before $p$
• Also won't happen after $(1+\epsilon)p/(1-\epsilon)$
• So we'll get a $p$ in this interval.
• Multiply by $(1-\epsilon)$

What if items not random? Randomize!

• number up to $m$
• simpler goal: for given $t$, with prob. $1-P$, say yes if above $(1+\epsilon)t$ and no if below $(1-\epsilon)t$
• run in parallel for powers of $(1+\epsilon)$ up to $n$.
• Space multiplies by $(\log n)/\epsilon$.
• For given $t$
• hash to $t$ space
• watch items hashing to $0$
• answer yes if anything does, no otherwise.
• Analysis
• if $k$ elements, prob. no hits is $(1-1/t)^k$
• for large enough $t$ this is $e^{-k/t}$
• for small $\epsilon$, if $k>(1+\epsilon)t$ this is $< 1/e-\epsilon/3$
• while $k< (1-\epsilon)t$ implies $>1/e+\epsilon/3$
• replicate to amplify. threshold on whether $1/e$ of the tests pass.

Min-wise independence

• alternative pseudorandom approach
• go back to idea of keeping $k$ smallest-hashing “random” values
• so for that, want a distribution where for any subset each item equally likely to hash to “smallest”
• “min-wise independence”
• people have created (approximately) min-wise independent hash families
• so can use these to simulate true randomness.