6.854 Lecture Notes

\begin{center} Lecture 5: Streaming Algorithms \end{center}

Today: “Too much input, too little space -- how to deal with big data.”

Streaming Computation Model

Goal: Model scenarios where we want to process large amounts of data while having only a very limited storage at our disposal. (Think: Modeling a network router that connects MIT to the rest of the world.)

Data: A long vector (stream) $x_1, \ldots, x_n$ of $n$ elements from a universe $U$ of size $m$. (Assume $m\leq n$.)
We can access the data only in a very restricted way -- by making a single pass through the stream. In particular, we see each element only once (and according to its order in the stream).
We have limited memory to accumulate enough of information about the stream to compute some useful statistics of it at the end.
Our golden standard is to keep the memory to by only poly-logarithmic, i.e., $\log^c n$, for some constant $c$. (So, no hope to store the whole, or even part of the, stream!)
Key question: Can we even do anything meaningful in such a restricted model? If so, then what are the task we can tackle here?

Distinct Elements Problem

Flajolet and Martin, “Probabilistic counting algorithms for data base applications”, JCSS 31(2), 1985.

Problem: Count the number of distinct elements in the stream.

Examples: How many unique words did Shakespeare use in his works? How many distinct IP addresses accessed MIT's websites?

Observe: If we want to solve this problem exactly, we need $\Omega(m)\gg log^c n$ space. (Why?)
But: If we aim for answers that are only approximate and probabilistic, our prospects improve dramatically.
(Randomization and approximation is prevalent -- and often simply necessary -- in streaming algorithms.)

Key question: How to use randomness to (approximately) estimate the number of distinct elements?

Simple algorithm:

Choose a (fully) random hash function $h: U \rightarrow [0,1]$.
Compute $Y:=\min_i h(x_i)$, i.e., the minimum-value hash of the elements of the stream.
Need only one register, i.e., $O(\log m)=O(\log n)$ space to compute $Y$.
Intuition: The larger number of distinct elements the smaller the minimum-value hash will be (in expectation) -- we have more “shots” at getting a really small value.
Specifically: $E\left[Y\right]=\frac{1}{N+1}$, where $N$ is the number of distinct elements in the stream.
Intuition: If you space out the $N$ items evenly in $[0,1]$, then $1/(N+1)$ is the exact value of the minimum.
Proof: $E\left[Y\right]=\int_0^1 P[Y\geq z] dz = \int_0^1 (1-z)^N dz = \left(-\frac{(1-z)^{N+1}}{N+1}\right)\Big|_0^1=\frac{1}{N+1}$.
Alternative proof: $E\left[Y\right]$ is equal to the probability that the additional, $N+1$-th element has the minimum hash. By symmetry, this probability has to be $\frac{1}{N+1}$.
So, estimating $N$ by $\frac{1}{Y}-1$ is correct in expectation.
But what is the probability that this estimate is actually (close) to the correct value?
Being correct in expectation is not enough! Think: $Y=1$ with probability $\frac{1}{N+1}$, and $Y=0$, otherwise. Expectation would still be correct, but we do not learn much from seeing a sample from $Y$. (And would need to divide by $0$ sometime!)
Need some tail bound. How about Chebyshev's inequality? For any $t>0$, \[ Pr[|X-E\left[X\right]|\geq t] \leq \frac{Var\left[X\right]}{t^2}, \] where $Var\left[X\right]:=E\left[(X-E\left[X\right])^2\right]=E\left[X^2\right]-E\left[X\right]^2$ is the variance of the random variable $X$.
What is the variance of $Y$?
$Var\left[Y\right]\leq \frac{1}{(N+1)^2}$.
Proof: $$ \begin{eqnarray*} Var\left[Y\right] & = & E\left[Y^2\right]-E\left[Y\right]^2\\ &=& \int_0^1 \left(1-\sqrt{z}\right)^N dz - \frac{1}{(N+1)^2}= \int_0^1 2 w \left(1-w\right)^N dw- \frac{1}{(N+1)^2}\\ &=& \frac{2}{(N+1)(N+2)}- \frac{1}{(N+1)^2} \leq \frac{1}{(N+1)^2}. \end{eqnarray*} $$
Problem: The standard deviation $\sqrt{Var\left[Y\right]}$ comparable to the expectation $E\left[Y\right]$.
$\Rightarrow$ Chebyshev's inequality cannot even tell us that $Pr[Y=0]< 1$.
What to do now?

Key mantra of statistics: Aggregate multiple independent measurements to boost accuracy and confidence of your prediction.

Use $k$ random hash functions $h_1, \ldots, h_k: U \rightarrow [0,1]$.
Compute $Y_j:=\min_i h_j(x_i)$, for all $j=1, \ldots, k$.
Let $\oY:=\frac{1}{k}\sum_{j=1}^k Y_k$. Use $\frac{1}{\oY}-1$ as an estimate of $N$.
Note: $E\left[\oY\right] = E\left[Y\right]=\frac{1}{N+1}$.
How about variance $Var\left[\oY\right]$?
Recall: If $X_1$ and $X_2$ are independent then \[ Var\left[a\cdot X_1+b\cdot X_2\right]=a^2 Var\left[X_1\right]+b^2 Var\left[X_2\right] \] $\Rightarrow$ $Var\left[\oY\right]=\sum_{i=1}^k \frac{1}{k^2} Var\left[Y_i\right]=\frac{1}{k} Var\left[Y\right] \leq \frac{1}{k(N+1)^2}.$
Chebyshev's inequality tells us that \[ P\left[\left|\oY - \frac{1}{N+1}\right|\geq \frac{\varepsilon}{N+1}\right] \leq \frac{Var[\oY]}{\left(\frac{\varepsilon}{N+1}\right)^2} \leq \frac{1}{k\varepsilon^2}. \] $\Rightarrow$ $\frac{1}{\oY}$ satisfies $\frac{N+1}{1+\varepsilon}\leq \frac{1}{\oY} \leq \frac{N+1}{1-\varepsilon}$ with probability at least $1-\frac{1}{k\varepsilon^2}$.
$\Rightarrow$ For small enough $\varepsilon$, this gives us \[ (1-O(\varepsilon))N\leq \frac{1}{\oY}-1 \leq (1+O(\varepsilon))N. \] $\Rightarrow$ Setting $k=O(\varepsilon^{-2})$ gives us an $\varepsilon$-accurate estimate with probability, say, $9/10$.
Space used: Need to store $k$ minimums and $k$ hash functions.
- Problem 1: We are working with real numbers -- can be taken care of by appropriate discretization (working with $O(\log n)$ length random binary strings) and make each “minimum” be encoded using $O(\log n)$ bits.
- Problem 2: Random hash functions are expensive to store (and evaluate). Ideally, one would want to use universal hashing/two-wise independent hashing functions instead. Important (and often used) fact: If $X_1, \ldots, X_s$ only pairwise independent (and not necessarily fully independent) we still have that $Var[X_1 + \ldots + X_s]=Var[X_1] + \ldots + Var[X_s]$.
  So, if our analysis of the algorithm relied only on the first and second order moments of our random variables $h_j(x_i)$, everything what we did above will automatically follow through without a change. This in turn would allow us to store each function using only $O(\log n)$ bits.
  Unfortunately (contrary to what I said in class): This is NOT true for the algorithm we presented above -- our calculation of the expectation $E[Y]$ of the minimum of $h(x_i)$ depends crucially on the hash function $h(x_i)$ be fully independent. So, we can't simply use pair-wise independent hash function here.
  Fortunately: There is a different algorithm that indeed fully depends on the second order moments and thus can use pairwise hash functions freely. One example of such algorithm can be found here \url{https://people.csail.mit.edu/madry/gems/notes/lecture16.pdf} (although it gets worse space complexity).
- How to get a high probability bound, i.e., probability of success be at least $1-1/n$? Standard technique: Just run $O(\log n)$ independent instances of this algorithm in parallel and take the median (not the average!) of the resulting estimates. Makes our space complexity increase by a factor of $O(\log n)$.
Fun empirical result: Estimate the number of unique words in the works of Shakespear up to 10% accuracy using only 128 bits of space and a single pass over data.
Optimal algorithm (Kane-Nelson-Woodruff '10): $O(\varepsilon^{-2}+\log n)$ space usage for success probability of at least $2/3$. (Can do independent runs to amplify this probability, paying an extra $O(\log n)$ factor.)

Heavy Hitters Problem

Meta-problem: Identify elements that appear in the stream with significant frequency.

For an element $x\in U$, define $f_x$ to be the number of occurrences of that element in the stream.

One variant of the question: For a given $k\geq 1$, report a set $S$ of at most $k$ elements such that contains all the elements $x$ such that $f_x>\frac{n}{k+1}$.

(Note: There is at most $k$ of such elements. Also, the returned set $S$ might contain elements that are not so frequent too -- as long as all the very frequent ones are returned.)
This problem can be solved in $O(k\log m)=O(k \log n)$ space -- see the homework.

A more ambitious variant of the question ($\ell_1$-point Query Problem): For a given $\varepsilon>0$, and each element $x\in U$, provide an additive estimate $\hf_x$ of the frequency of that element. Specifically, want \[ \hf_x \approx f_x \pm \varepsilon n, \] for all $x\in U$.

If $f_x\leq \varepsilon n$, outputting $\hf_x=0$ satisfies the above guarantee.
$\Rightarrow$ There is at most $\varepsilon^{-1}$ “interesting” elements. (Hence hope for a non-trivial algorithm.)
Setting $\varepsilon=\frac{1}{k+1}$ corresponds in spirit to the other variant we considered above. (Although this correspondence is not really precise.)

Count-Min Sketch Algorithm

Cormode and Muthukrishnan, “An improved data stream summary: the count-min sketch and its applications”, JALG 55(1), 2005

Warm-up

Let $h: U\rightarrow \{1, \ldots, b\}$, for some integer parameter $b>1$, be a random hash function.
Initialize an $b$-dimensional vector $CMS$ with all zeros.
Read the stream, for each element $x_i$ read, increment the entry $CMS[h(x_i)]$ by one.
At the end, make the estimate of the frequency of an element $x$ be \[ \hf_x' := CMS[h(x)]. \]
How to analyze the performance of this algorithm?
Observe: For each $x$, we always have that $f_x\leq \hf_x'$, since each occurrence of the element $x$ in the stream increments $CMS[h(x)]$ by one.
Problem: $\hf_x$ might be (much) larger than $f_x$ due to collisions.
We have that $$ \begin{eqnarray*} \hf_x' &=& CMS[h(x)] = |\{i \ | \ h(x_i)=h(x)\}| = f_x + Y_x, \end{eqnarray*} $$ where $Y_x := |\{i \ | \ h(x_i)=h(x) \text{ and } x_i\neq x\}|$ is the number of other elements of the stream that collide with $x$.
Note that \begin{equation*} \label{eq:collisions} E[Y_x] = \sum_{i: x_i\neq x} Pr[h(x)=h(x_i)] = \sum_{i: x_i\neq x} \frac{1}{b} \leq \frac{n}{b}. \end{equation*}
By Markov's inequality, setting $b=\frac{2}{\varepsilon}$ gives us that for a fixed $x$, \[ Pr[Y_x > \varepsilon n] \leq \frac{E[Y_x]}{\varepsilon n}\leq \frac{n/b}{\varepsilon n} = \frac{1}{2}. \] $\Rightarrow$ the estimate $\hf_x'$ is sufficiently accurate with probability at least $\frac{1}{2}$.
Problem: Would like to get better probability of success and, more importantly, get it for all $x\in U$ simultaneously.

Final Algorithm

How to amplify the probability of success?
Standard answer: Run multiple (independent) copies of the above algorithm and aggregate the results.
Key question: How to aggregate different estimates?
Observe: The estimates produced by the above algorithm have only one-sided error, i.e., they can only overestimate the true frequency.
$\Rightarrow$ Taking the minimum of all the estimates always results in the tightest answer.
Resulting algorithm (Count-Min Sketch):
- Choose $l$ independent random hash functions $h_1, \ldots, h_l: U\rightarrow \{1, \ldots, b\}$.
- Initialize $l$ different $b$-dimensional vectors $CMS_1,\ldots, CMS_l$.
- Read the stream, for each element $x_i$ read, increment the entry $CMS_j[h_j(x_i)]$, for each $j=1, \ldots, l$.
- At the end, return for each element $x\in U$ \[ \hf_x := \min_{j=1, \ldots, l} CMS_j[h_j(x)]. \]
Observe: We still have that $f_x \leq \hf_x$ for each element $x$.
Also, by analysis we did before, we have that $$ \begin{eqnarray*} Pr[\hf_x > f_x + \varepsilon n ] &=&Pr[Y_x^j > \varepsilon n \text{ for each $j=1, \ldots, l$}]\\ &\leq & \frac{1}{2^l}. \end{eqnarray*} $$ where $Y_x^j := |\{i \ | \ h_j(x_i)=h_j(x) \text{ and } x_i\neq x\}|$.
Setting $l=\log \frac{m}{\delta}=O(\log \frac{n}{\delta})$ ensures that, by union bounding over all $m$ different elements, we have that, with probability at least $1-\delta$, \[ \hf_x \approx f_x \pm \varepsilon n, \] for all elements $x\in U$.
Space used: Maintaining the vectors $CMS_1 \ldots CMS_l$ takes $O(b\cdot l \cdot \log n)=O(\varepsilon^{-1} \log n \log \frac{n}{\delta})$ space. Also, note that if we make the hash functions $h_1, \ldots, h_l$ be only pair-wise independent then still our analysis will go through. (All we needed there is that the probability of two distinct elements collide is $\frac{1}{b}$.)
$\Rightarrow$ Space needed to store the hash functions is $O(l \log m)=O(\log n \log \frac{n}{\delta})$.
$\Rightarrow$ Total space complexity is $O(\varepsilon^{-1} \log n \log \frac{n}{\delta})$, which becomes $O(\varepsilon^{-1} \log^2 n)$ if we aim for probability of the success to be at least $1-\frac{1}{n}$.