\documentclass{article} \usepackage{me} \usepackage{url} \setlength{\parindent}{0pt} \def\lefteqn#1{\rlap{\displaystyle{#1}}} $\newcommand{\oY}{\overline{Y}}$ $\newcommand{\hf}{\widehat{f}}$
\begin{center} Lecture 5: Streaming Algorithms \end{center}

Today: “Too much input, too little space -- how to deal with big data.”

Streaming Computation Model

Goal: Model scenarios where we want to process large amounts of data while having only a very limited storage at our disposal. (Think: Modeling a network router that connects MIT to the rest of the world.)

Distinct Elements Problem

Flajolet and Martin, “Probabilistic counting algorithms for data base applications”, JCSS 31(2), 1985.

Problem: Count the number of distinct elements in the stream.

Examples: How many unique words did Shakespeare use in his works? How many distinct IP addresses accessed MIT's websites?

Key question: How to use randomness to (approximately) estimate the number of distinct elements?

Simple algorithm:

Key mantra of statistics: Aggregate multiple independent measurements to boost accuracy and confidence of your prediction.

Heavy Hitters Problem

Meta-problem: Identify elements that appear in the stream with significant frequency.

For an element $x\in U$, define $f_x$ to be the number of occurrences of that element in the stream.

One variant of the question: For a given $k\geq 1$, report a set $S$ of at most $k$ elements such that contains all the elements $x$ such that $f_x>\frac{n}{k+1}$.

A more ambitious variant of the question ($\ell_1$-point Query Problem): For a given $\varepsilon>0$, and each element $x\in U$, provide an additive estimate $\hf_x$ of the frequency of that element. Specifically, want \[ \hf_x \approx f_x \pm \varepsilon n, \] for all $x\in U$.

Count-Min Sketch Algorithm

Cormode and Muthukrishnan, “An improved data stream summary: the count-min sketch and its applications”, JALG 55(1), 2005

Warm-up

Final Algorithm