Streaming
Model:
- Huge amount of data
- Can't afford random access
- Instead, read in a stream (sequential access).
- E.g., tapes
- Even for data on disk, sequential much faster
- On disk, can do multiple passes
- Sometimes, even too much to store
- Stream generated online, must analyze as it goes past
- Stock market
- Network routers
- Sensor network data
- Telephone billing records
- Model posits stream length $n$ (unknown)
- which makes $\log n$ a natural limit (e.g. to count length of stream).
Key idea: sampling
- What fraction of $n$ stream entries have property $P$?
- if know $P$ in advance, compute as you go
- what if want to ask about unknown $P$ later?
- Choose random sample of size $k$
- Let $x$ be number in sample with property $P$
- Estimate as $nx/k$
- Analysis
- Suppose $pn$ have property $P$
- Assume choosing with replacement
- Then each choice has $P$ with probability $p$
- So expected number having $p$ is $pk$
- We see $x$. If $x\in(1\pm\epsilon)kp$, then estimate is $\epsilon$-accurate.
- Chernoff bound on $\epsilon$ error: $2e^{-\epsilon^2 kp/3}$
- So, if $kp$ large, probability of error is small
- in particular, $kp=\mu_{\epsilon,\delta}$ gives $(\epsilon,\delta)$-approximation
- e.g. $O(\log n)$ space for constant error w.h.p. in $n$.
Wait: how do we sample?
- Reservoir method
- Keep pool of size $k$
- Incorporate first $k$ items
- When $k+1$'st arrives, it should be in sample with prob $k/(k+1)$
- Put it in with that probability
- And evict someone random
- Generalize:
- When $n^{th}$ item arrives, put it in with prob. $k/n$
- Evict someone random if so
- By induction, reservoir contains random sample of items seen so far.
- Then we include last item with right probability
- Whether or not included, remainder of reservoir is random subset of prior items
Frequent Items
Simple example: frequent items
- Suppose want to find all items that occur more than $t$ times
- From above, property of being a particular item has fraction $p=t/n$
- Intuitively, need enough samples to expect to see item: i.e. $n/t$
- To avoid errors, need more.
- Choose $k=(8n/t\epsilon^2)\log(n)$
- Then $kp=(8/\epsilon^2)\log n$
- So odds of $\epsilon$ deviation are $(1/n)^{2}$ for particular item
- At most $n/t$ “frequent” items, so odds any deviate are $1/n$
- So can collect any item that shows up more than $(1-\epsilon)kp$ times
- This will also have some false positives.
- But same Chernoff bound shows any item with few representatives unlikely to be sampled enough
- In particular, with high probability no item with less than $(1-\epsilon)^2 t$ occurrences will be included.
Discussion:
- For this particular problem (most frequent) there's a better algorithm
- It just needs $n/t$ space for exact answer
- Nice thing about sample, though, is that you can decide what question to ask after you collect the sample
- Can we beat $n/t$? No: lower bound says that if most frequent item is rare, you need tons of space to find it.
Sketches
Count-min sketches
- Already saw how to count frequent items
- What if can also subtract in stream (but stay positive)? (turnstile model)
- Sketch: hash to $k$ buckets, accumulate in $h(x)$
- If no collisions, get good count.
- If $S$ is total weight, in expectation only $S/k$ collides
- So, get (over) estimate to within additive $2S/k$ with probability $1/2$
- What if want better odds? Use more counters!
- Take $c$ counters, take minimum result
- Probability of error $2^{-c}$
- General: need $O((1/\epsilon)\ln 1/\delta)$ counters (each taking $O(\log n)$ space) for $(\epsilon,\delta)$ approx.
- Can also subtract items from sketch in obvious way.
- Sketch will accurately estimate number of items as long as net count of each item is positive.
- another advantage: sketches can be combined (just add them).
More generalizations:
- Maintaining stats over a window (expiring items)
- Higher moments (e.g. sum of squares) to estimate db joins
Distinct Items
Natural question: how many distinct items in stream?
- To answer naive way, need space equal number of distinct items
- Sampling doesn't obviously apply: what property do we test?
- But a sampling insight still helps
Suppose items are distinct random reals in $[0,1]$
- Show them spread out
- What changes as more arrive? spacing
- Consider interval $[0,p]$
- If $n$ distinct items, expect $np$ in interval
- If $np =\mu_{\epsilon,\delta}$ get $(1\pm\epsilon)\mu$ with probability $1-\delta$.
- So how about choosing $p$ so that got $\mu$ items?
- Translation: keep smallest $\mu$ items, use to set $p$
- Can we conclude this $p$ is $\epsilon$-accurate?
- Not quite: if expected number for this $p$ is $\mu/(1+\epsilon)$, then its probability of deviating to $\mu$ is a bit more than $\delta$.
- So tweak: wait till see $\mu/(1-\epsilon) \ge (1+\epsilon)\mu$ items.
- This won't happen before $p$
- Also won't happen after $(1+\epsilon)p/(1-\epsilon)$
- So we'll get a $p$ in this interval.
- Multiply by $(1-\epsilon)$
What if items not random? Randomize!
- number up to $m$
- simpler goal: for given $t$, with prob. $1-P$, say yes if above $(1+\epsilon)t$ and no if below $(1-\epsilon)t$
- run in parallel for powers of $(1+\epsilon)$ up to $n$.
- Space multiplies by $(\log n)/\epsilon$.
- For given $t$
- hash to $t$ space
- watch items hashing to $0$
- answer yes if anything does, no otherwise.
- Analysis
- if $k$ elements, prob. no hits is $(1-1/t)^k$
- for large enough $t$ this is $e^{-k/t}$
- for small $\epsilon$, if $k>(1+\epsilon)t$ this is $< 1/e-\epsilon/3$
- while $k< (1-\epsilon)t$ implies $>1/e+\epsilon/3$
- replicate to amplify. threshold on whether $1/e$ of the tests pass.
Min-wise independence
- alternative pseudorandom approach
- go back to idea of keeping $k$ smallest-hashing “random” values
- so for that, want a distribution where for any subset each item equally likely to hash to “smallest”
- “min-wise independence”
- people have created (approximately) min-wise independent hash families
- so can use these to simulate true randomness.