Today: How to represent high-dimensional data in low-dimensions.
- Much of the data we work with today comes in a form of high-dimensional vectors. That is, each data point is a vector $x_i\in \mathbb{R}^d$, for a very large $d$.
Usually, each coordinate corresponds to one of the “features” of the data. For instance, the bag-of-words representation of documents, which treats every document as a vector of word counts, can have tens of thousands of coordinates. This dimension tends to be even greater in scenarios where images and videos are subjects of interest.
- Motivating question: Can we map these high-dimensional vectors to a much lower dimensional space while still (approximately) preserving some characteristics of the data that we care about?
- We will now study this question in the case when we are interested in preserving the $\ell_2$-norm (Euclidean) distance between the data points.
Johnson-Lindenstrauss Lemma
Lemma: For any set of $n$ $d$-dimensional vectors $x^1, \ldots, x^n\in \mathbb{R}^d$, and any $\frac{1}{2}\geq \varepsilon>0$, there exists an $k$-by-$d$ matrix $A$, with $k=O(\varepsilon^{-2}\log n)$ such that \[ (1-\varepsilon) \|x^i-x^j\|_2^2 \leq \|Ax^i - Ax^j\|_2^2 \leq (1+\varepsilon) \|x^i-x^j\|_2^2, \] for any $i$ and $j$.- Intuitively, the above statement tells us that there is a mapping of $d$-dimensional vectors to $k$-dimensional vectors, with $k=O(\varepsilon^{-2}\log n)$, that keep the $\ell_2$-norm distances be (approximately) the same.
- Note that this mapping is linear in the vectors we are mapping. This is a very useful property in many of the applications of the Johnson-Lindenstrauss lemma.
- Observe that $k$ does not depend on $d$.
- Additionally, even though, in principle, in the statement above the matrix $A$ could depend on the vectors $x^1, \ldots, x^n$, we will construct it in a way that is completely oblivious to what these vectors are.
- However, as one could show that there is no single $A$ that works for all collections of vectors, we will make the construction of $A$ probabilistic and prove that for any fixed set of vectors it works with high probability.
- Specifically, we will make each entry of $A$ be sampled independently from a Gaussian distribution $\cN{0,\frac{1}{k}}$, where the density of a Gaussian/normal random variable is given by \[ \cN{\mu, \sigma^2, x} = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right), \] with $\mu$ being the mean, and $\sigma^2$ being its variance.
Proof
- Key observation: It suffices to establish a certain norm preservation property of $A$. Specifically, it suffices to prove that, for any fixed vector $x\in \mathbb{R}^d$, we have that \[ \Prob{(1-\varepsilon)\|x\|_2^2 \leq \|Ax\|_2^2 \leq (1+\varepsilon)\|x\|_2^2} \geq 1 - \frac{1}{n^3}. \] Then, considering vectors $x=x^i-x^j$ for all $n\choose{2}$ pairs of vectors and union bounding over them gives us the statement of the Johnson-Lindenstrauss lemma with probability at least $1-\frac{1}{n}$. (And by increasing the underlying parameters we can boost this probability further.)
- To this end, let us fix $x\in \mathbb{R}^d$, and consider the distribution of the $i$-th coordinate $Y_i$ of the mapping $Ax$ of $x$ \[ Y_i:=(Ax)_i=\sum_{j=1}^d A_{i,j} x_j. \]
- Each random variable $Y_i$ is a sum of Gaussian distributions. Specifically, we have that \[ Y_i \sim \sum_{j=1}^d\cN{0, \frac{1}{k}} x_j \sim \cN{0, \frac{\|x\|_2^2}{k}}, \] where we used the crucial property of Gaussian distributions that they are closed under taking a sum. More precisely, for any two independent Gaussian random variables $Z_1\sim \cN{\mu_1, \sigma_1^2}$ and $Z_2\sim \cN{\mu_2, \sigma_2^2}$, we have that \[ Z_1 + Z_2 \sim \cN{\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2}. \]
- Consequently, we can conclude that \[ \Ex{\|Ax\|_2^2}=\sum_i \Ex{Y_i^2}=k\cdot \Var{Y_i} = k\cdot \frac{\|x\|_2^2}{k} = \|x\|_2^2. \]
- So, the expectation is correct, but how about the concentration around this expectation? We need a tail bound.
- Recall that $\|Ax\|_2^2$ is distributed as a Gaussian variable.
- Gaussian variables are very concentrated around expectation -- the tail vanishes at an exponential rate. For instance, $\cN{\mu, \sigma^2}$ is within $2\sigma$ of the expectation with probability at least $0.95$.
- This strong concentration is quantified via Chernoff bounds (for $\chi$-squared distributions). Namely, if $Z_1, \ldots, Z_t\sim \cN{0,1}$ are $t$ independent Gaussian variables with zero mean and variance of one, then, for any $\frac{1}{2}\geq \delta >0$, \[ \Prob{\left|\sum_{i=1}^t Z_i^2-t\right| > \delta t}\leq 2\cdot \exp\left(-\frac{t\delta^2}{8}\right). \] (Note that $t$ is the expectation of $\sum_{i=1}^t Z_i^2$.)
- For future reference: Another popular form of Chernoff bound corresponds to sums of independent random $0$-$1$ variables. If $Z_1, \ldots, Z_t\in \{0,1\}$ are independent random variables with $\Ex{Z_i}=\mu$ then \[ \Prob{\left|\sum_{i=1}^t Z_i-t\mu\right| > \delta t\mu } \leq 2\cdot \exp\left(-\frac{t\mu \delta^2}{3}\right). \] One could see this Chernoff bound as a quantitative version of the -- asymptotic in nature -- Central Limit Theorem, which states that for any sequence of $t$ independent random variables $Z_1, \ldots, Z_t$ with $\Ex{Z_i}=\mu$ and $\Var{Z_i}=\sigma^2$, for each $i$, \[ \left(\frac{\sum_t Z_t}{t}\right) \underset{t\rightarrow \infty}{\sim} \cN{\mu, \frac{\sigma^2}{t}}. \] (Note that $0$-$1$ variables have always their variance be at most $1$.)
- { \bf Back to the proof:} observe now that if we take $k=C\cdot \varepsilon^{-2} \log n$, for sufficiently large $C>0$, we have $$ \begin{eqnarray*} \Prob{\left|\|Ax\|_2^2-\|x\|_2^2\right| > \varepsilon \|x\|_2^2} &= & \Prob{\left|\sum_i Y_i^2-\|x\|_2^2\right| > \varepsilon \|x\|_2^2} \\ &=& \Prob{\left|\sum_i \left(\frac{\sqrt{k}}{\|x\|_2}\cdot Y_i\right)^2-k\right| > \varepsilon k}\\ &\leq & 2\cdot \exp\left(-\frac{k\varepsilon^2}{8}\right)\leq \frac{1}{n^3}, \end{eqnarray*} $$ where we used the Chernoff bound for $\chi$-squared distributions and the fact that \[ \frac{\sqrt{k}}{\|x\|_2}\cdot Y_i \sim \cN{0,1}. \] The norm preservation property, and thus the whole lemma, follows.
Extensions and Different Variants
- Two issues: the matrix $A$ we sample above has real-valued entries, and it is very dense, i.e., each entry is non-zero (with probability $1$).
- To deal with real numbers: Achlioptas '03 has shown that sampling each entry to be: $+1$ with probability $\frac{1}{6}$, $-1$ with probability $\frac{1}{6}$, and $0$ otherwise, independently at random; suffices.
- To make $A$ be sparse: Kane and Nelson '14 have shown that there exists a way of sampling $A$ to ensure that there is only at most $O(\frac{\log n}{\varepsilon})$ non-zero entries per column. (The distribution here cannot correspond to simple entry-wise independent sampling anymore.)
- Also, hot off the press: Green Larsen and Nelson have shown very recently, i.e., two weeks ago, that the bound on $k$ is tight. (Previously, it was known that this bound is tight up to an $\log \varepsilon^{-1}$ factor.)
- Natural question: Can we have such a strong dimensionality reduction phenomena also for other distances, e.g., $\ell_1$ or $\ell_\infty$ norms?
- Unfortunately, no -- see the homework. $\ell_2$-distances turn out to be very special.