6.854 Lecture Notes

\begin{center} Lecture 6: Dimensionality Reduction \end{center}

Today: How to represent high-dimensional data in low-dimensions.

Much of the data we work with today comes in a form of high-dimensional vectors. That is, each data point is a vector $x_i\in \mathbb{R}^d$, for a very large $d$.
Usually, each coordinate corresponds to one of the “features” of the data. For instance, the bag-of-words representation of documents, which treats every document as a vector of word counts, can have tens of thousands of coordinates. This dimension tends to be even greater in scenarios where images and videos are subjects of interest.
Motivating question: Can we map these high-dimensional vectors to a much lower dimensional space while still (approximately) preserving some characteristics of the data that we care about?
We will now study this question in the case when we are interested in preserving the $\ell_2$-norm (Euclidean) distance between the data points.

Johnson-Lindenstrauss Lemma

Lemma: For any set of $n$ $d$-dimensional vectors $x^1, \ldots, x^n\in \mathbb{R}^d$, and any $\frac{1}{2}\geq \varepsilon>0$, there exists an $k$-by-$d$ matrix $A$, with $k=O(\varepsilon^{-2}\log n)$ such that \[ (1-\varepsilon) \|x^i-x^j\|_2^2 \leq \|Ax^i - Ax^j\|_2^2 \leq (1+\varepsilon) \|x^i-x^j\|_2^2, \] for any $i$ and $j$.

Intuitively, the above statement tells us that there is a mapping of $d$-dimensional vectors to $k$-dimensional vectors, with $k=O(\varepsilon^{-2}\log n)$, that keep the $\ell_2$-norm distances be (approximately) the same.
Note that this mapping is linear in the vectors we are mapping. This is a very useful property in many of the applications of the Johnson-Lindenstrauss lemma.
Observe that $k$ does not depend on $d$.
Additionally, even though, in principle, in the statement above the matrix $A$ could depend on the vectors $x^1, \ldots, x^n$, we will construct it in a way that is completely oblivious to what these vectors are.
However, as one could show that there is no single $A$ that works for all collections of vectors, we will make the construction of $A$ probabilistic and prove that for any fixed set of vectors it works with high probability.
Specifically, we will make each entry of $A$ be sampled independently from a Gaussian distribution $\cN{0,\frac{1}{k}}$, where the density of a Gaussian/normal random variable is given by \[ \cN{\mu, \sigma^2, x} = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right), \] with $\mu$ being the mean, and $\sigma^2$ being its variance.

Proof

Key observation: It suffices to establish a certain norm preservation property of $A$. Specifically, it suffices to prove that, for any fixed vector $x\in \mathbb{R}^d$, we have that \[ \Prob{(1-\varepsilon)\|x\|_2^2 \leq \|Ax\|_2^2 \leq (1+\varepsilon)\|x\|_2^2} \geq 1 - \frac{1}{n^3}. \] Then, considering vectors $x=x^i-x^j$ for all $n\choose{2}$ pairs of vectors and union bounding over them gives us the statement of the Johnson-Lindenstrauss lemma with probability at least $1-\frac{1}{n}$. (And by increasing the underlying parameters we can boost this probability further.)
To this end, let us fix $x\in \mathbb{R}^d$, and consider the distribution of the $i$-th coordinate $Y_i$ of the mapping $Ax$ of $x$ \[ Y_i:=(Ax)_i=\sum_{j=1}^d A_{i,j} x_j. \]
Each random variable $Y_i$ is a sum of Gaussian distributions. Specifically, we have that \[ Y_i \sim \sum_{j=1}^d\cN{0, \frac{1}{k}} x_j \sim \cN{0, \frac{\|x\|_2^2}{k}}, \] where we used the crucial property of Gaussian distributions that they are closed under taking a sum. More precisely, for any two independent Gaussian random variables $Z_1\sim \cN{\mu_1, \sigma_1^2}$ and $Z_2\sim \cN{\mu_2, \sigma_2^2}$, we have that \[ Z_1 + Z_2 \sim \cN{\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2}. \]
Consequently, we can conclude that \[ \Ex{\|Ax\|_2^2}=\sum_i \Ex{Y_i^2}=k\cdot \Var{Y_i} = k\cdot \frac{\|x\|_2^2}{k} = \|x\|_2^2. \]
So, the expectation is correct, but how about the concentration around this expectation? We need a tail bound.
Recall that $\|Ax\|_2^2$ is distributed as a Gaussian variable.
Gaussian variables are very concentrated around expectation -- the tail vanishes at an exponential rate. For instance, $\cN{\mu, \sigma^2}$ is within $2\sigma$ of the expectation with probability at least $0.95$.
This strong concentration is quantified via Chernoff bounds (for $\chi$-squared distributions). Namely, if $Z_1, \ldots, Z_t\sim \cN{0,1}$ are $t$ independent Gaussian variables with zero mean and variance of one, then, for any $\frac{1}{2}\geq \delta >0$, \[ \Prob{\left|\sum_{i=1}^t Z_i^2-t\right| > \delta t}\leq 2\cdot \exp\left(-\frac{t\delta^2}{8}\right). \] (Note that $t$ is the expectation of $\sum_{i=1}^t Z_i^2$.)
For future reference: Another popular form of Chernoff bound corresponds to sums of independent random $0$-$1$ variables. If $Z_1, \ldots, Z_t\in \{0,1\}$ are independent random variables with $\Ex{Z_i}=\mu$ then \[ \Prob{\left|\sum_{i=1}^t Z_i-t\mu\right| > \delta t\mu } \leq 2\cdot \exp\left(-\frac{t\mu \delta^2}{3}\right). \] One could see this Chernoff bound as a quantitative version of the -- asymptotic in nature -- Central Limit Theorem, which states that for any sequence of $t$ independent random variables $Z_1, \ldots, Z_t$ with $\Ex{Z_i}=\mu$ and $\Var{Z_i}=\sigma^2$, for each $i$, \[ \left(\frac{\sum_t Z_t}{t}\right) \underset{t\rightarrow \infty}{\sim} \cN{\mu, \frac{\sigma^2}{t}}. \] (Note that $0$-$1$ variables have always their variance be at most $1$.)
{ \bf Back to the proof:} observe now that if we take $k=C\cdot \varepsilon^{-2} \log n$, for sufficiently large $C>0$, we have $$ \begin{eqnarray*} \Prob{\left|\|Ax\|_2^2-\|x\|_2^2\right| > \varepsilon \|x\|_2^2} &= & \Prob{\left|\sum_i Y_i^2-\|x\|_2^2\right| > \varepsilon \|x\|_2^2} \\ &=& \Prob{\left|\sum_i \left(\frac{\sqrt{k}}{\|x\|_2}\cdot Y_i\right)^2-k\right| > \varepsilon k}\\ &\leq & 2\cdot \exp\left(-\frac{k\varepsilon^2}{8}\right)\leq \frac{1}{n^3}, \end{eqnarray*} $$ where we used the Chernoff bound for $\chi$-squared distributions and the fact that \[ \frac{\sqrt{k}}{\|x\|_2}\cdot Y_i \sim \cN{0,1}. \] The norm preservation property, and thus the whole lemma, follows.

Extensions and Different Variants

Two issues: the matrix $A$ we sample above has real-valued entries, and it is very dense, i.e., each entry is non-zero (with probability $1$).
To deal with real numbers: Achlioptas '03 has shown that sampling each entry to be: $+1$ with probability $\frac{1}{6}$, $-1$ with probability $\frac{1}{6}$, and $0$ otherwise, independently at random; suffices.
To make $A$ be sparse: Kane and Nelson '14 have shown that there exists a way of sampling $A$ to ensure that there is only at most $O(\frac{\log n}{\varepsilon})$ non-zero entries per column. (The distribution here cannot correspond to simple entry-wise independent sampling anymore.)
Also, hot off the press: Green Larsen and Nelson have shown very recently, i.e., two weeks ago, that the bound on $k$ is tight. (Previously, it was known that this bound is tight up to an $\log \varepsilon^{-1}$ factor.)
Natural question: Can we have such a strong dimensionality reduction phenomena also for other distances, e.g., $\ell_1$ or $\ell_\infty$ norms?
Unfortunately, no -- see the homework. $\ell_2$-distances turn out to be very special.