\documentclass{article} \usepackage{me} \usepackage{url} \setlength{\parindent}{0pt} \def\lefteqn#1{\rlap{\displaystyle{#1}}} $$\newcommand{\oY}{\overline{Y}}$$ $$\newcommand{\hf}{\widehat{f}}$$
\begin{center} Lecture 5: Streaming Algorithms \end{center}

Today: “Too much input, too little space -- how to deal with big data.”

## Streaming Computation Model

Goal: Model scenarios where we want to process large amounts of data while having only a very limited storage at our disposal. (Think: Modeling a network router that connects MIT to the rest of the world.)

• Data: A long vector (stream) $x_1, \ldots, x_n$ of $n$ elements from a universe $U$ of size $m$. (Assume $m\leq n$.)
• We can access the data only in a very restricted way -- by making a single pass through the stream. In particular, we see each element only once (and according to its order in the stream).
• We have limited memory to accumulate enough of information about the stream to compute some useful statistics of it at the end.
• Our golden standard is to keep the memory to by only poly-logarithmic, i.e., $\log^c n$, for some constant $c$. (So, no hope to store the whole, or even part of the, stream!)
• Key question: Can we even do anything meaningful in such a restricted model? If so, then what are the task we can tackle here?

## Distinct Elements Problem

Flajolet and Martin, “Probabilistic counting algorithms for data base applications”, JCSS 31(2), 1985.

Problem: Count the number of distinct elements in the stream.

Examples: How many unique words did Shakespeare use in his works? How many distinct IP addresses accessed MIT's websites?

• Observe: If we want to solve this problem exactly, we need $\Omega(m)\gg log^c n$ space. (Why?)
• But: If we aim for answers that are only approximate and probabilistic, our prospects improve dramatically.
• (Randomization and approximation is prevalent -- and often simply necessary -- in streaming algorithms.)
Key question: How to use randomness to (approximately) estimate the number of distinct elements?

Simple algorithm:

• Choose a (fully) random hash function $h: U \rightarrow [0,1]$.
• Compute $Y:=\min_i h(x_i)$, i.e., the minimum-value hash of the elements of the stream.
• Need only one register, i.e., $O(\log m)=O(\log n)$ space to compute $Y$.
• Intuition: The larger number of distinct elements the smaller the minimum-value hash will be (in expectation) -- we have more “shots” at getting a really small value.
• Specifically: $E\left[Y\right]=\frac{1}{N+1}$, where $N$ is the number of distinct elements in the stream.
• Intuition: If you space out the $N$ items evenly in $[0,1]$, then $1/(N+1)$ is the exact value of the minimum.
• Proof: $E\left[Y\right]=\int_0^1 P[Y\geq z] dz = \int_0^1 (1-z)^N dz = \left(-\frac{(1-z)^{N+1}}{N+1}\right)\Big|_0^1=\frac{1}{N+1}$.
• Alternative proof: $E\left[Y\right]$ is equal to the probability that the additional, $N+1$-th element has the minimum hash. By symmetry, this probability has to be $\frac{1}{N+1}$.
• So, estimating $N$ by $\frac{1}{Y}-1$ is correct in expectation.
• But what is the probability that this estimate is actually (close) to the correct value?
• Being correct in expectation is not enough! Think: $Y=1$ with probability $\frac{1}{N+1}$, and $Y=0$, otherwise. Expectation would still be correct, but we do not learn much from seeing a sample from $Y$. (And would need to divide by $0$ sometime!)
• Need some tail bound. How about Chebyshev's inequality? For any $t>0$, $Pr[|X-E\left[X\right]|\geq t] \leq \frac{Var\left[X\right]}{t^2},$ where $Var\left[X\right]:=E\left[(X-E\left[X\right])^2\right]=E\left[X^2\right]-E\left[X\right]^2$ is the variance of the random variable $X$.
• What is the variance of $Y$?
• $Var\left[Y\right]\leq \frac{1}{(N+1)^2}$.
• Proof: $$\begin{eqnarray*} Var\left[Y\right] & = & E\left[Y^2\right]-E\left[Y\right]^2\\ &=& \int_0^1 \left(1-\sqrt{z}\right)^N dz - \frac{1}{(N+1)^2}= \int_0^1 2 w \left(1-w\right)^N dw- \frac{1}{(N+1)^2}\\ &=& \frac{2}{(N+1)(N+2)}- \frac{1}{(N+1)^2} \leq \frac{1}{(N+1)^2}. \end{eqnarray*}$$
• Problem: The standard deviation $\sqrt{Var\left[Y\right]}$ comparable to the expectation $E\left[Y\right]$.

$\Rightarrow$ Chebyshev's inequality cannot even tell us that $Pr[Y=0]< 1$.

• What to do now?

Key mantra of statistics: Aggregate multiple independent measurements to boost accuracy and confidence of your prediction.

• Use $k$ random hash functions $h_1, \ldots, h_k: U \rightarrow [0,1]$.
• Compute $Y_j:=\min_i h_j(x_i)$, for all $j=1, \ldots, k$.
• Let $\oY:=\frac{1}{k}\sum_{j=1}^k Y_k$. Use $\frac{1}{\oY}-1$ as an estimate of $N$.
• Note: $E\left[\oY\right] = E\left[Y\right]=\frac{1}{N+1}$.
• How about variance $Var\left[\oY\right]$?
• Recall: If $X_1$ and $X_2$ are independent then $Var\left[a\cdot X_1+b\cdot X_2\right]=a^2 Var\left[X_1\right]+b^2 Var\left[X_2\right]$ $\Rightarrow$ $Var\left[\oY\right]=\sum_{i=1}^k \frac{1}{k^2} Var\left[Y_i\right]=\frac{1}{k} Var\left[Y\right] \leq \frac{1}{k(N+1)^2}.$
• Chebyshev's inequality tells us that $P\left[\left|\oY - \frac{1}{N+1}\right|\geq \frac{\varepsilon}{N+1}\right] \leq \frac{Var[\oY]}{\left(\frac{\varepsilon}{N+1}\right)^2} \leq \frac{1}{k\varepsilon^2}.$ $\Rightarrow$ $\frac{1}{\oY}$ satisfies $\frac{N+1}{1+\varepsilon}\leq \frac{1}{\oY} \leq \frac{N+1}{1-\varepsilon}$ with probability at least $1-\frac{1}{k\varepsilon^2}$.

$\Rightarrow$ For small enough $\varepsilon$, this gives us $(1-O(\varepsilon))N\leq \frac{1}{\oY}-1 \leq (1+O(\varepsilon))N.$ $\Rightarrow$ Setting $k=O(\varepsilon^{-2})$ gives us an $\varepsilon$-accurate estimate with probability, say, $9/10$.

• Space used: Need to store $k$ minimums and $k$ hash functions.
• Problem 1: We are working with real numbers -- can be taken care of by appropriate discretization (working with $O(\log n)$ length random binary strings) and make each “minimum” be encoded using $O(\log n)$ bits.
• Problem 2: Random hash functions are expensive to store (and evaluate). Ideally, one would want to use universal hashing/two-wise independent hashing functions instead. Important (and often used) fact: If $X_1, \ldots, X_s$ only pairwise independent (and not necessarily fully independent) we still have that $Var[X_1 + \ldots + X_s]=Var[X_1] + \ldots + Var[X_s]$.

So, if our analysis of the algorithm relied only on the first and second order moments of our random variables $h_j(x_i)$, everything what we did above will automatically follow through without a change. This in turn would allow us to store each function using only $O(\log n)$ bits.

Unfortunately (contrary to what I said in class): This is NOT true for the algorithm we presented above -- our calculation of the expectation $E[Y]$ of the minimum of $h(x_i)$ depends crucially on the hash function $h(x_i)$ be fully independent. So, we can't simply use pair-wise independent hash function here.

Fortunately: There is a different algorithm that indeed fully depends on the second order moments and thus can use pairwise hash functions freely. One example of such algorithm can be found here \url{https://people.csail.mit.edu/madry/gems/notes/lecture16.pdf} (although it gets worse space complexity).

• How to get a high probability bound, i.e., probability of success be at least $1-1/n$? Standard technique: Just run $O(\log n)$ independent instances of this algorithm in parallel and take the median (not the average!) of the resulting estimates. Makes our space complexity increase by a factor of $O(\log n)$.
• Fun empirical result: Estimate the number of unique words in the works of Shakespear up to 10% accuracy using only 128 bits of space and a single pass over data.
• Optimal algorithm (Kane-Nelson-Woodruff '10): $O(\varepsilon^{-2}+\log n)$ space usage for success probability of at least $2/3$. (Can do independent runs to amplify this probability, paying an extra $O(\log n)$ factor.)

## Heavy Hitters Problem

Meta-problem: Identify elements that appear in the stream with significant frequency.

For an element $x\in U$, define $f_x$ to be the number of occurrences of that element in the stream.

One variant of the question: For a given $k\geq 1$, report a set $S$ of at most $k$ elements such that contains all the elements $x$ such that $f_x>\frac{n}{k+1}$.

• (Note: There is at most $k$ of such elements. Also, the returned set $S$ might contain elements that are not so frequent too -- as long as all the very frequent ones are returned.)
• This problem can be solved in $O(k\log m)=O(k \log n)$ space -- see the homework.

A more ambitious variant of the question ($\ell_1$-point Query Problem): For a given $\varepsilon>0$, and each element $x\in U$, provide an additive estimate $\hf_x$ of the frequency of that element. Specifically, want $\hf_x \approx f_x \pm \varepsilon n,$ for all $x\in U$.

• If $f_x\leq \varepsilon n$, outputting $\hf_x=0$ satisfies the above guarantee.

$\Rightarrow$ There is at most $\varepsilon^{-1}$ “interesting” elements. (Hence hope for a non-trivial algorithm.)

• Setting $\varepsilon=\frac{1}{k+1}$ corresponds in spirit to the other variant we considered above. (Although this correspondence is not really precise.)

### Count-Min Sketch Algorithm

Cormode and Muthukrishnan, “An improved data stream summary: the count-min sketch and its applications”, JALG 55(1), 2005

#### Warm-up

• Let $h: U\rightarrow \{1, \ldots, b\}$, for some integer parameter $b>1$, be a random hash function.
• Initialize an $b$-dimensional vector $CMS$ with all zeros.
• Read the stream, for each element $x_i$ read, increment the entry $CMS[h(x_i)]$ by one.
• At the end, make the estimate of the frequency of an element $x$ be $\hf_x' := CMS[h(x)].$
• How to analyze the performance of this algorithm?
• Observe: For each $x$, we always have that $f_x\leq \hf_x'$, since each occurrence of the element $x$ in the stream increments $CMS[h(x)]$ by one.
• Problem: $\hf_x$ might be (much) larger than $f_x$ due to collisions.
• We have that $$\begin{eqnarray*} \hf_x' &=& CMS[h(x)] = |\{i \ | \ h(x_i)=h(x)\}| = f_x + Y_x, \end{eqnarray*}$$ where $Y_x := |\{i \ | \ h(x_i)=h(x) \text{ and } x_i\neq x\}|$ is the number of other elements of the stream that collide with $x$.
• Note that \begin{equation*} \label{eq:collisions} E[Y_x] = \sum_{i: x_i\neq x} Pr[h(x)=h(x_i)] = \sum_{i: x_i\neq x} \frac{1}{b} \leq \frac{n}{b}. \end{equation*}
• By Markov's inequality, setting $b=\frac{2}{\varepsilon}$ gives us that for a fixed $x$, $Pr[Y_x > \varepsilon n] \leq \frac{E[Y_x]}{\varepsilon n}\leq \frac{n/b}{\varepsilon n} = \frac{1}{2}.$ $\Rightarrow$ the estimate $\hf_x'$ is sufficiently accurate with probability at least $\frac{1}{2}$.
• Problem: Would like to get better probability of success and, more importantly, get it for all $x\in U$ simultaneously.

#### Final Algorithm

• How to amplify the probability of success?
• Standard answer: Run multiple (independent) copies of the above algorithm and aggregate the results.
• Key question: How to aggregate different estimates?
• Observe: The estimates produced by the above algorithm have only one-sided error, i.e., they can only overestimate the true frequency.

$\Rightarrow$ Taking the minimum of all the estimates always results in the tightest answer.

• Resulting algorithm (Count-Min Sketch):
• Choose $l$ independent random hash functions $h_1, \ldots, h_l: U\rightarrow \{1, \ldots, b\}$.
• Initialize $l$ different $b$-dimensional vectors $CMS_1,\ldots, CMS_l$.
• Read the stream, for each element $x_i$ read, increment the entry $CMS_j[h_j(x_i)]$, for each $j=1, \ldots, l$.
• At the end, return for each element $x\in U$ $\hf_x := \min_{j=1, \ldots, l} CMS_j[h_j(x)].$
• Observe: We still have that $f_x \leq \hf_x$ for each element $x$.
• Also, by analysis we did before, we have that $$\begin{eqnarray*} Pr[\hf_x > f_x + \varepsilon n ] &=&Pr[Y_x^j > \varepsilon n \text{ for each j=1, \ldots, l}]\\ &\leq & \frac{1}{2^l}. \end{eqnarray*}$$ where $Y_x^j := |\{i \ | \ h_j(x_i)=h_j(x) \text{ and } x_i\neq x\}|$.
• Setting $l=\log \frac{m}{\delta}=O(\log \frac{n}{\delta})$ ensures that, by union bounding over all $m$ different elements, we have that, with probability at least $1-\delta$, $\hf_x \approx f_x \pm \varepsilon n,$ for all elements $x\in U$.
• Space used: Maintaining the vectors $CMS_1 \ldots CMS_l$ takes $O(b\cdot l \cdot \log n)=O(\varepsilon^{-1} \log n \log \frac{n}{\delta})$ space. Also, note that if we make the hash functions $h_1, \ldots, h_l$ be only pair-wise independent then still our analysis will go through. (All we needed there is that the probability of two distinct elements collide is $\frac{1}{b}$.)

$\Rightarrow$ Space needed to store the hash functions is $O(l \log m)=O(\log n \log \frac{n}{\delta})$.

$\Rightarrow$ Total space complexity is $O(\varepsilon^{-1} \log n \log \frac{n}{\delta})$, which becomes $O(\varepsilon^{-1} \log^2 n)$ if we aim for probability of the success to be at least $1-\frac{1}{n}$.