6.856 Lecture Notes

$\newcommand{\Wbar}{{\overline W}}$

Markov Chains for Sampling

Sampling:

Given complex state space
Want to sample from it
Use some Markov Chain
Run for a long time
end up “near” stationary distribution
Reduces sampling to local moves (easier)
no need for global description of state space
Allows sample from exponential state space

Formalize: what is “near” and “long time”?

Stationary distribution $\pi$
arbitrary distribution $q$
relative pointwise distance (r.p.d.) $\max_j |q_j-\pi_j|/\pi_j$
Intuitively close.
Formally, suppose r.p.d. $\delta$.
Then $(1-\delta)\pi \le q$
So can express distribution $q$ as “with probability $1-\delta$, sample from $\pi$. Else, do something wierd.
So if $\delta$ small, “as if” sampling from $\pi$ each time.
If $\delta$ poly small, can do poly samples without goof
Gives “almost stationary” sample from Markov Chain
Mixing Time: time to reduce r.p.d to some $\epsilon$

Eigenvalues

Method 1 for mixing time: Eigenvalues.

Consider transition matrix $P$.
Eigenvalues $\lambda_1\ge \cdots \ge\lambda_n$
Corresponding Eigenvectors $e_1,\ldots,e_n$.
Any vector $q$ can be written as $\sum a_i e_i$
Then $qP = \sum a_i\lambda_i e_i$
and $qP^k = \sum a_i \lambda_{i}^k e_i$
so sufficient to understand eigenvalues and vectors.
Complex eigenvectors?
- Perron-Frobenius Theorem says for aperiodic irreducible nonnegative square matrix (our case) the largest eigenvector is unique
- And has all nonnegative values
- (If periodic, then number of eigenvectors equals the period, but they are all rotations of the same (possibly complex) eigenvector.)
Is any $|\lambda_i| > 1$?
- If so, $e_i P = \lambda_i P$
- let $M$ be max entry of $e_i$ (in absolute value)
- if $\lambda_i > 1$, then some $e_i P$ entry is $\lambda_i M > M$
- any entry of $e_i P$ is a convex combo of values at most $M$, so max value $M$, contradiction.
- Deduce: all eigenvalues of stochastic matrix at most 1.
How many $\lambda_i=1$?
- Stationary distribution ($e_1=\pi$)
- if any others, could add a little bit of it to $e_1$, get second stationary distribution
- What about $-1$? Only if periodic.
so all other coordinates of eigenvalue decomposition decay as $\lambda_i^k$.
So if can show other $\lambda_i$ small, converge to stationary distribution fast.
In particular, if $\lambda_2 < 1-1/poly$, get polynomial mixing time

Expanders:

Definition

bipartite
$n$ vertices, regular degree $d$
$|\Gamma(S)| \ge (1+c(1-2|S|/n))|S|$

n factor $c$ more neighbors, at least until $S$ near $n/2$.

Take random walk on $(n,d,c)$ expander with constant $c$

add self loops (with probability $1/2$ to deal with periodicity.
uniform stationary distribution
lemma: second eigenvalue $1-O(1/d)$ \[ \lambda_2 \le 1-\frac{c^2}{d(2048+4c^2)} \]
Intuition on convergence: because neighborhoods grow, position becomes unpredictable very fast.
proof: messy math

Deduce: mixing time in expander is $O(\log n)$ to get $\epsilon$ r.p.d. (since $\pi_i = 1/n$)

Converse theorem: if $\lambda_2 \le 1-\epsilon$, get expander with \[ c \ge 4(\epsilon-\epsilon^2) \]
Walks that mix fast are on expanders.

Do expanders exist? Yes! proof: probabilistic method.

Definition: $(n,d,\alpha,c)$ OR-concentrator
- bipartite $2n$ vertices
- degree at most $d$ in $L$
- expansion $c$ on sets $< \alpha n$.
claim: $(n,18,1/3,2)$-concentrator exists
- Construct by sampling $d$ random neighbors with replacement
  - $E_s$: Specific size $s$ set has $< cs$ neighbors.
  - fix $S$ of size $s$. $T$ of size $< cs$.
  - prob. $S$ goes to $T$ at most $(cs/n)^{ds}$
  - $\binom{n}{cs}$ sets $T$
  - $\binom{n}{s}$ sets $S$
  - $$ \begin{eqnarray*} \Pr[] &\le &\binom{n}{s}\binom{n}{cs}(cs/n)^{ds}\\ &\le &(en/s)^s(en/cs)^{cs}(cs/n)^{ds}\\ &= &[(s/n)^{d-c-1}e^{c+1}c^{d-c}]^s\\ &\le &[(1/3)^{d-c-1}e^{c+1}c^{d-c}]^s\\ &\le &[(c/3)^d(3e)^{c+1}]^s \end{eqnarray*} $$
  - Take $c=2, d=18$, get $[(2/3)^{18}(3e)^3]^{-s} < 2^{-s}$
  - sum over $s$, get $< 1$

Gabber-Galil expanders:

But in this case, can do better deterministically.
- Gabber Galil expanders.
- Let $n=2m^2$. Vertices are $(x,y)$ where $x,y \in Z_m$ (one set per side)
- 5 neighbors: $(x,y), (x, x+y), (x,x+y+1),(x+y,y),(x+y+1,y)$ (add mod $m$)
- or 7 neighbors of similar form.
Theorem: this $d=5$ graph has $c=(2-\sqrt{3})/4$, degree 7 has twice the expansion.
in other words, $c$ and $d$ are constant.
meaning $\lambda_2=1-\epsilon$ for some constant $\epsilon$
So random walks on this expander mix very fast: for polynomially small r.p.d., $O(\log n)$ steps of random walk suffice.
Note also that $n$ can be huge, since only need to store one vertex ($O(\log n)$ bits).

Expanders:

Definition

bipartite
$n$ vertices, regular degree $d$
$|\Gamma(S)| \ge (1+c(1-2|S|/n))|S|$

factor $c$ more neighbors, at least until $S$ near $n/2$.

Take random walk on $(n,d,c)$ expander with constant $c$

add self loops (with probability $1/2$ to deal with periodicity.
uniform stationary distribution
lemma: second eigenvalue $1-O(1/d)$ \[ \lambda_2 \le 1-\frac{c^2}{d(2048+4c^2)} \]
Intuition on convergence: because neighborhoods grow, position becomes unpredictable very fast.
proof: messy math

Deduce: mixing time in expander is $O(\log n)$ to get $\epsilon$ r.p.d. (since $\pi_i = 1/n$)

Converse theorem: if $\lambda_2 \le 1-\epsilon$, get expander with \[ c \ge 4(\epsilon-\epsilon^2) \] Walks that mix fast are on expanders.

Gabber-Galil expanders:

Do expanders exist? Yes! proof: probabilistic method.
But in this case, can do better deterministically.
- Gabber Galil expanders.
- Let $n=2m^2$. Vertices are $(x,y)$ where $x,y \in Z_m$ (one set per side)
- 5 neighbors: $(x,y), (x, x+y), (x,x+y+1),(x+y,y),(x+y+1,y)$ (add mod $m$)
- or 7 neighbors of similar form.
Theorem: this $d=5$ graph has $c=(2-\sqrt{3})/4$, degree 7 has twice the expansion.
in other words, $c$ and $d$ are constant.
meaning $\lambda_2=1-\epsilon$ for some constant $\epsilon$
So random walks on this expander mix very fast: for polynomially small r.p.d., $O(\log n)$ steps of random walk suffice.
Note also that $n$ can be huge, since only need to store one vertex ($O(\log n)$ bits).

Application: conserving randomness.

Consider an BPP algorithm (gives right answer with probability $99/100$ (constant irrelevant) using $n$ bits.
$t$ independent trials with majority rule reduce failure probability to $2^{-O(t)}$ (chernoff), but need $tn$ bits
in case of $RP$, used $2$-point sampling to get error $O(1/t)$ with $2n$ bits and $t$ trials.
Use walk instead.
- vertices are $N=2^n$ ($n$-bit) random strings for algorithm.
- edges as degree-7 expander
- only $1/100$ of vertices are bad.
- what is probability majority of time spent there?
- in limit, spend $1/100$ of time there
- how fast converge to limit? How long must we run?
- Power the markov chain so $\lambda_2^{\beta}\le 1/10$ (constant number of steps)
- use random seeds encountered every $\beta$ steps.
number of bits needed:
- $O(n)$ for stationary starting point
- $3\beta$ more per trial,
Theorem: after $7k$ samples, probability majority wrong is $1/2^k$. So error $1/2^n$ with $O(n)$ bits! (compare to naive)
- Let $B$ be powered transition matrix
- let $p^{(i)}$ be distribution of sample $i$, namely $p^0B^i$
- Let $W$ be indicator matrix for good witnesses, namely $1$ at diagonal $i$ if $i$ is a witness. $\Wbar$ completmentary set $I-W$.
- $\|p^iW\|_1$ is probability $p^i$ is witness set. similar for nonwitness.
- Consider a sequence of $7k$ results “witness or not”
- represent as matrices $S=(S_1,\ldots,S_{7k}) \in \{W,\Wbar\}^{7k}$
- claim \[ \Pr[S] = \|p^{(0)}(BS_1)(BS_2)\cdots(BS_{7k})\|_1. \] (draw layered graph. sums prob. of paths through correct sequence of witness/nonwitness)
- Use $2$-norm since easier to work with.
  - Note $\|A\|_1 \le \sqrt{N}\|A\|_2$
  - For fixed sum of values, minimize sum of square by setting all equal
  - ie, for sum $\alpha$, set all equal to $\alpha/N$
  - 2-norm $\alpha/\sqrt{N}$
- defer: $\|pBW\|_2 \le \|p\|_2$ and $\|pB\Wbar\|_2 \le \frac15\|p\|_2$
- deduce if more than $7k/2$ bad witnesses, $$ \begin{eqnarray*} \|p^0\prod BS_i\|_1 &\le & \sqrt{N}\|p^0\prod BS_i\|\\ &\le & \sqrt{N}(\frac15)^{7k/2}\|p^0\|\\ &= &(\frac15)^{7k/2} \end{eqnarray*} $$ (since $\|p_0\|=1/\sqrt{N}$)
- At same time, only $2^{7k}$ bad sequences, so error prob. $2^{7k}5^{-7k/2} \le 2^{-k}$
proof of lemma:
- write $p=\sum a_ie_i$ with $e_1=\pi$
- obviously $\|pBW\| \le \|pB\|$ since $W$ just zeros some stuff out.
- But $\|pB\|=\sqrt{\sum a_i^2\lambda_i^2}\le\sum a_i^2 = \|p\|$
- Now write $p=a_1\pi+y$ where $y \cdot \pi=0$
- argue that $\|\pi B\Wbar\|\le \|\pi \|/10\le \|p\|/10$ and $\|yB\Wbar\|\le\|y\|/10\le\|p\|/10$, done by pythagorous
- First $\pi$:
  - recall $\pi B=\pi$ is uniform vector, all coords $1/N$, so norm $1/\sqrt{N}$
  - $\Wbar$ has only $1/100$ of coordintes nonzero, so
  - $\|e_1\Wbar\| = \sqrt{(N/100)(1/N)}=1/10$
- Now $y$: just note $\|yB\| \le \|y\|/10$ since $\lambda_2 \le 1/10$. Then $\Wbar$ zeros out.
- summary: $\pi$ part likely to be in good witness set, $y$ part unlikely to be relevant.