Today: “Too much input, too little space -- how to deal with big data.”
Streaming Computation Model
Goal: Model scenarios where we want to process large amounts of data while having only a very limited storage at our disposal. (Think: Modeling a network router that connects MIT to the rest of the world.)
- Data: A long vector (stream) $x_1, \ldots, x_n$ of $n$ elements from a universe $U$ of size $m$. (Assume $m\leq n$.)
- We can access the data only in a very restricted way -- by making a single pass through the stream. In particular, we see each element only once (and according to its order in the stream).
- We have limited memory to accumulate enough of information about the stream to compute some useful statistics of it at the end.
- Our golden standard is to keep the memory to by only poly-logarithmic, i.e., $\log^c n$, for some constant $c$. (So, no hope to store the whole, or even part of the, stream!)
- Key question: Can we even do anything meaningful in such a restricted model? If so, then what are the task we can tackle here?
Distinct Elements Problem
Flajolet and Martin, “Probabilistic counting algorithms for data base applications”, JCSS 31(2), 1985.Problem: Count the number of distinct elements in the stream.
Examples: How many unique words did Shakespeare use in his works? How many distinct IP addresses accessed MIT's websites?
- Observe: If we want to solve this problem exactly, we need $\Omega(m)\gg log^c n$ space. (Why?)
- But: If we aim for answers that are only approximate and probabilistic, our prospects improve dramatically.
- (Randomization and approximation is prevalent -- and often simply necessary -- in streaming algorithms.)
Simple algorithm:
- Choose a (fully) random hash function $h: U \rightarrow [0,1]$.
- Compute $Y:=\min_i h(x_i)$, i.e., the minimum-value hash of the elements of the stream.
- Need only one register, i.e., $O(\log m)=O(\log n)$ space to compute $Y$.
- Intuition: The larger number of distinct elements the smaller the minimum-value hash will be (in expectation) -- we have more “shots” at getting a really small value.
- Specifically: $E\left[Y\right]=\frac{1}{N+1}$, where $N$ is the number of distinct elements in the stream.
- Intuition: If you space out the $N$ items evenly in $[0,1]$, then $1/(N+1)$ is the exact value of the minimum.
- Proof: $E\left[Y\right]=\int_0^1 P[Y\geq z] dz = \int_0^1 (1-z)^N dz = \left(-\frac{(1-z)^{N+1}}{N+1}\right)\Big|_0^1=\frac{1}{N+1}$.
- Alternative proof: $E\left[Y\right]$ is equal to the probability that the additional, $N+1$-th element has the minimum hash. By symmetry, this probability has to be $\frac{1}{N+1}$.
- So, estimating $N$ by $\frac{1}{Y}-1$ is correct in expectation.
- But what is the probability that this estimate is actually (close) to the correct value?
- Being correct in expectation is not enough! Think: $Y=1$ with probability $\frac{1}{N+1}$, and $Y=0$, otherwise. Expectation would still be correct, but we do not learn much from seeing a sample from $Y$. (And would need to divide by $0$ sometime!)
- Need some tail bound. How about Chebyshev's inequality? For any $t>0$, \[ Pr[|X-E\left[X\right]|\geq t] \leq \frac{Var\left[X\right]}{t^2}, \] where $Var\left[X\right]:=E\left[(X-E\left[X\right])^2\right]=E\left[X^2\right]-E\left[X\right]^2$ is the variance of the random variable $X$.
- What is the variance of $Y$?
- $Var\left[Y\right]\leq \frac{1}{(N+1)^2}$.
- Proof: $$ \begin{eqnarray*} Var\left[Y\right] & = & E\left[Y^2\right]-E\left[Y\right]^2\\ &=& \int_0^1 \left(1-\sqrt{z}\right)^N dz - \frac{1}{(N+1)^2}= \int_0^1 2 w \left(1-w\right)^N dw- \frac{1}{(N+1)^2}\\ &=& \frac{2}{(N+1)(N+2)}- \frac{1}{(N+1)^2} \leq \frac{1}{(N+1)^2}. \end{eqnarray*} $$
- Problem: The standard deviation $\sqrt{Var\left[Y\right]}$ comparable to the expectation $E\left[Y\right]$.
$\Rightarrow$ Chebyshev's inequality cannot even tell us that $Pr[Y=0]< 1$.
- What to do now?
Key mantra of statistics: Aggregate multiple independent measurements to boost accuracy and confidence of your prediction.
- Use $k$ random hash functions $h_1, \ldots, h_k: U \rightarrow [0,1]$.
- Compute $Y_j:=\min_i h_j(x_i)$, for all $j=1, \ldots, k$.
- Let $\oY:=\frac{1}{k}\sum_{j=1}^k Y_k$. Use $\frac{1}{\oY}-1$ as an estimate of $N$.
- Note: $E\left[\oY\right] = E\left[Y\right]=\frac{1}{N+1}$.
- How about variance $Var\left[\oY\right]$?
- Recall: If $X_1$ and $X_2$ are independent then \[ Var\left[a\cdot X_1+b\cdot X_2\right]=a^2 Var\left[X_1\right]+b^2 Var\left[X_2\right] \] $\Rightarrow$ $Var\left[\oY\right]=\sum_{i=1}^k \frac{1}{k^2} Var\left[Y_i\right]=\frac{1}{k} Var\left[Y\right] \leq \frac{1}{k(N+1)^2}.$
- Chebyshev's inequality tells us that
\[
P\left[\left|\oY - \frac{1}{N+1}\right|\geq \frac{\varepsilon}{N+1}\right] \leq \frac{Var[\oY]}{\left(\frac{\varepsilon}{N+1}\right)^2} \leq \frac{1}{k\varepsilon^2}.
\]
$\Rightarrow$ $\frac{1}{\oY}$ satisfies $\frac{N+1}{1+\varepsilon}\leq \frac{1}{\oY} \leq \frac{N+1}{1-\varepsilon}$ with probability at least $1-\frac{1}{k\varepsilon^2}$.
$\Rightarrow$ For small enough $\varepsilon$, this gives us \[ (1-O(\varepsilon))N\leq \frac{1}{\oY}-1 \leq (1+O(\varepsilon))N. \] $\Rightarrow$ Setting $k=O(\varepsilon^{-2})$ gives us an $\varepsilon$-accurate estimate with probability, say, $9/10$.
- Space used: Need to store $k$ minimums and $k$ hash functions.
- Problem 1: We are working with real numbers -- can be taken care of by appropriate discretization (working with $O(\log n)$ length random binary strings) and make each “minimum” be encoded using $O(\log n)$ bits.
- Problem 2: Random hash functions are expensive to store (and evaluate). Ideally, one would want to use universal hashing/two-wise independent hashing functions instead.
Important (and often used) fact: If $X_1, \ldots, X_s$ only pairwise independent (and not necessarily fully independent) we still have that $Var[X_1 + \ldots + X_s]=Var[X_1] + \ldots + Var[X_s]$.
So, if our analysis of the algorithm relied only on the first and second order moments of our random variables $h_j(x_i)$, everything what we did above will automatically follow through without a change. This in turn would allow us to store each function using only $O(\log n)$ bits.
{Unfortunately,} this is NOT true for the algorithm we presented above---our calculation of the expectation $E[Y]$ of the minimum of $h(x_i)$ depends crucially on the hash function $h(x_i)$ be fully independent. So, we can't simply use pair-wise independent hash function here.
Fortunately, there is a different algorithm that indeed fully depends on the second order moments and thus can use pairwise hash functions freely. This algorithm relies on keeping track of the $t$-th smallest hash $Y^t$ (instead of the smallest one used above), for $t=\Theta(\varepsilon^{-2})$, and then employs $\frac{t}{Y^t}$ as the (biased) estimator for $N$.
- How to get a high probability bound, i.e., probability of success be at least $1-1/n$? Standard technique: Just run $O(\log n)$ independent instances of this algorithm in parallel and take the median (not the average!) of the resulting estimates. Makes our space complexity increase by a factor of $O(\log n)$.
- Fun empirical result: Estimate the number of unique words in the works of Shakespear up to 10% accuracy using only 128 bits of space and a single pass over data.
- Optimal algorithm (Kane-Nelson-Woodruff '10): $O(\varepsilon^{-2}+\log n)$ space usage for success probability of at least $2/3$. (Can do independent runs to amplify this probability, paying an extra $O(\log n)$ factor.)
Heavy Hitters Problem
Meta-problem: Identify elements that appear in the stream with significant frequency.
For an element $x\in U$, define $f_x$ to be the number of occurrences of that element in the stream.
One variant of the question: For a given $k\geq 1$, report a set $S$ of at most $k$ elements such that contains all the elements $x$ such that $f_x>\frac{n}{k+1}$.
- (Note: There is at most $k$ of such elements. Also, the returned set $S$ might contain elements that are not so frequent too -- as long as all the very frequent ones are returned.)
- This problem can be solved in $O(k\log m)=O(k \log n)$ space -- see the homework.
A more ambitious variant of the question ($\ell_1$-point Query Problem): For a given $\varepsilon>0$, and each element $x\in U$, provide an additive estimate $\hf_x$ of the frequency of that element. Specifically, want \[ \hf_x \approx f_x \pm \varepsilon n, \] for all $x\in U$.
- If $f_x\leq \varepsilon n$, outputting $\hf_x=0$ satisfies the above guarantee.
$\Rightarrow$ There is at most $\varepsilon^{-1}$ “interesting” elements. (Hence hope for a non-trivial algorithm.)
- Setting $\varepsilon=\frac{1}{k+1}$ corresponds in spirit to the other variant we considered above. (Although this correspondence is not really precise.)
Count-Min Sketch Algorithm
Cormode and Muthukrishnan, “An improved data stream summary: the count-min sketch and its applications”, JALG 55(1), 2005
Warm-up
- Let $h: U\rightarrow \{1, \ldots, b\}$, for some integer parameter $b>1$, be a random hash function.
- Initialize an $b$-dimensional vector $CMS$ with all zeros.
- Read the stream, for each element $x_i$ read, increment the entry $CMS[h(x_i)]$ by one.
- At the end, make the estimate of the frequency of an element $x$ be \[ \hf_x := CMS[h(x)]. \]
- How to analyze the performance of this algorithm?
- Observe: For each $x$, we always have that $f_x\leq \hf_x'$, since each occurrence of the element $x$ in the stream increments $CMS[h(x)]$ by one.
- Problem: $\hf_x$ might be (much) larger than $f_x$ due to collisions.
- We have that $$ \begin{eqnarray*} \hf_x &=& CMS[h(x)] = |\{i \ | \ h(x_i)=h(x)\}| = f_x + Y_x, \end{eqnarray*} $$ where $Y_x := |\{i \ | \ h(x_i)=h(x) \text{ and } x_i\neq x\}|$ is the number of other elements of the stream that collide with $x$.
- Note that E[Y_x] = \sum_{i: x_i\neq x} Pr[h(x)=h(x_i)] = \sum_{i: x_i\neq x} \frac{1}{b} \leq \frac{n}{b}.
- By Markov's inequality, setting $b=\frac{2}{\varepsilon}$ gives us that for a fixed $x$, \[ Pr[Y_x > \varepsilon n] \leq \frac{E[Y_x]}{\varepsilon n}\leq \frac{n/b}{\varepsilon n} = \frac{1}{2}. \] $\Rightarrow$ the estimate $\hf_x'$ is sufficiently accurate with probability at least $\frac{1}{2}$.
- Problem: Would like to get better probability of success and, more importantly, get it for all $x\in U$ simultaneously.
Final Algorithm
- How to amplify the probability of success?
- Standard answer: Run multiple (independent) copies of the above algorithm and aggregate the results.
- Key question: How to aggregate different estimates?
- Observe: The estimates produced by the above algorithm have only one-sided error, i.e., they can only overestimate the true frequency.
$\Rightarrow$ Taking the minimum of all the estimates always results in the tightest answer.
- Resulting algorithm (Count-Min Sketch):
- Choose $l$ independent random hash functions $h_1, \ldots, h_l: U\rightarrow \{1, \ldots, b\}$.
- Initialize $l$ different $b$-dimensional vectors $CMS_1,\ldots, CMS_l$.
- Read the stream, for each element $x_i$ read, increment the entry $CMS_j[h_j(x_i)]$, for each $j=1, \ldots, l$.
- At the end, return for each element $x\in U$ \[ \hf_x := \min_{j=1, \ldots, l} CMS_j[h_j(x)]. \]
- Observe: We still have that $f_x \leq \hf_x$ for each element $x$.
- Also, by analysis we did before, we have that $$ \begin{eqnarray*} Pr[\hf_x > f_x + \varepsilon n ] &=&Pr[Y_x^j > \varepsilon n \text{ for each $j=1, \ldots, l$}]\\ &\leq & \frac{1}{2^l}. \end{eqnarray*} $$ where $Y_x^j := |\{i \ | \ h_j(x_i)=h_j(x) \text{ and } x_i\neq x\}|$.
- Setting $l=\log \frac{m}{\delta}=O(\log \frac{n}{\delta})$ ensures that, by union bounding over all $m$ different elements, we have that, with probability at least $1-\delta$, \[ \hf_x \approx f_x \pm \varepsilon n, \] for all elements $x\in U$.
- Space used: Maintaining the vectors $CMS_1 \ldots CMS_l$ takes $O(b\cdot l \cdot \log n)=O(\varepsilon^{-1} \log n \log \frac{n}{\delta})$ space. Also, note that if we make the hash functions $h_1, \ldots, h_l$ be only pair-wise independent then still our analysis will go through. (All we needed there is that the probability of two distinct elements collide is $\frac{1}{b}$.)
$\Rightarrow$ Space needed to store the hash functions is $O(l \log m)=O(\log n \log \frac{n}{\delta})$.
$\Rightarrow$ Total space complexity is $O(\varepsilon^{-1} \log n \log \frac{n}{\delta})$, which becomes $O(\varepsilon^{-1} \log^2 n)$ if we aim for probability of the success to be at least $1-\frac{1}{n}$.