Markov Chains for Sampling
Sampling:- Given complex state space
- Want to sample from it
- Use some Markov Chain
- Run for a long time
- end up “near” stationary distribution
- Reduces sampling to local moves (easier)
- no need for global description of state space
- Allows sample from exponential state space
Formalize: what is “near” and “long time”?
- Stationary distribution $\pi$
- arbitrary distribution $q$
- relative pointwise distance (r.p.d.) $\max_j |q_j-\pi_j|/\pi_j$
- Intuitively close.
- Formally, suppose r.p.d. $\delta$.
- Then $(1-\delta)\pi \le q$
- So can express distribution $q$ as “with probability $1-\delta$, sample from $\pi$. Else, do something wierd.
- So if $\delta$ small, “as if” sampling from $\pi$ each time.
- If $\delta$ poly small, can do poly samples without goof
- Gives “almost stationary” sample from Markov Chain
- Mixing Time: time to reduce r.p.d to some $\epsilon$
Eigenvalues
Method 1 for mixing time: Eigenvalues.
- Consider transition matrix $P$.
- Eigenvalues $\lambda_1\ge \cdots \ge\lambda_n$
- Corresponding Eigenvectors $e_1,\ldots,e_n$.
- Any vector $q$ can be written as $\sum a_i e_i$
- Then $qP = \sum a_i\lambda_i e_i$
- and $qP^k = \sum a_i \lambda_{i}^k e_i$
- so sufficient to understand eigenvalues and vectors.
- Complex eigenvectors?
- Perron-Frobenius Theorem says for aperiodic irreducible nonnegative square matrix (our case) the largest eigenvector is unique
- And has all nonnegative values
- (If periodic, then number of eigenvectors equals the period, but they are all rotations of the same (possibly complex) eigenvector.)
- Is any $|\lambda_i| > 1$?
- If so, $e_i P = \lambda_i P$
- let $M$ be max entry of $e_i$ (in absolute value)
- if $\lambda_i > 1$, then some $e_i P$ entry is $\lambda_i M > M$
- any entry of $e_i P$ is a convex combo of values at most $M$, so max value $M$, contradiction.
- Deduce: all eigenvalues of stochastic matrix at most 1.
- How many $\lambda_i=1$?
- Stationary distribution ($e_1=\pi$)
- if any others, could add a little bit of it to $e_1$, get second stationary distribution
- What about $-1$? Only if periodic.
- so all other coordinates of eigenvalue decomposition decay as $\lambda_i^k$.
- So if can show other $\lambda_i$ small, converge to stationary distribution fast.
- In particular, if $\lambda_2 < 1-1/poly$, get polynomial mixing time
Expanders:
Definition
- bipartite
- $n$ vertices, regular degree $d$
- $|\Gamma(S)| \ge (1+c(1-2|S|/n))|S|$
Take random walk on $(n,d,c)$ expander with constant $c$
- add self loops (with probability $1/2$ to deal with periodicity.
- uniform stationary distribution
- lemma: second eigenvalue $1-O(1/d)$ \[ \lambda_2 \le 1-\frac{c^2}{d(2048+4c^2)} \]
- Intuition on convergence: because neighborhoods grow, position becomes unpredictable very fast.
- proof: messy math
- Converse theorem: if $\lambda_2 \le 1-\epsilon$, get expander with \[ c \ge 4(\epsilon-\epsilon^2) \]
- Walks that mix fast are on expanders.
Do expanders exist? Yes! proof: probabilistic method.
-
Definition: $(n,d,\alpha,c)$ OR-concentrator
- bipartite $2n$ vertices
- degree at most $d$ in $L$
- expansion $c$ on sets $< \alpha n$.
-
claim: $(n,18,1/3,2)$-concentrator exists
- Construct by sampling $d$ random neighbors with replacement
- $E_s$: Specific size $s$ set has $< cs$ neighbors.
- fix $S$ of size $s$. $T$ of size $< cs$.
- prob. $S$ goes to $T$ at most $(cs/n)^{ds}$
- $\binom{n}{cs}$ sets $T$
- $\binom{n}{s}$ sets $S$
- $$ \begin{eqnarray*} \Pr[] &\le &\binom{n}{s}\binom{n}{cs}(cs/n)^{ds}\\ &\le &(en/s)^s(en/cs)^{cs}(cs/n)^{ds}\\ &= &[(s/n)^{d-c-1}e^{c+1}c^{d-c}]^s\\ &\le &[(1/3)^{d-c-1}e^{c+1}c^{d-c}]^s\\ &\le &[(c/3)^d(3e)^{c+1}]^s \end{eqnarray*} $$
- Take $c=2, d=18$, get $[(2/3)^{18}(3e)^3]^{-s} < 2^{-s}$
- sum over $s$, get $< 1$
- Construct by sampling $d$ random neighbors with replacement
Gabber-Galil expanders:
- But in this case, can do better deterministically.
- Gabber Galil expanders.
- Let $n=2m^2$. Vertices are $(x,y)$ where $x,y \in Z_m$ (one set per side)
- 5 neighbors: $(x,y), (x, x+y), (x,x+y+1),(x+y,y),(x+y+1,y)$ (add mod $m$)
- or 7 neighbors of similar form.
- Theorem: this $d=5$ graph has $c=(2-\sqrt{3})/4$, degree 7 has twice the expansion.
- in other words, $c$ and $d$ are constant.
- meaning $\lambda_2=1-\epsilon$ for some constant $\epsilon$
- So random walks on this expander mix very fast: for polynomially small r.p.d., $O(\log n)$ steps of random walk suffice.
- Note also that $n$ can be huge, since only need to store one vertex ($O(\log n)$ bits).
Expanders:
Definition
- bipartite
- $n$ vertices, regular degree $d$
- $|\Gamma(S)| \ge (1+c(1-2|S|/n))|S|$
Take random walk on $(n,d,c)$ expander with constant $c$
- add self loops (with probability $1/2$ to deal with periodicity.
- uniform stationary distribution
- lemma: second eigenvalue $1-O(1/d)$ \[ \lambda_2 \le 1-\frac{c^2}{d(2048+4c^2)} \]
- Intuition on convergence: because neighborhoods grow, position becomes unpredictable very fast.
- proof: messy math
Converse theorem: if $\lambda_2 \le 1-\epsilon$, get expander with \[ c \ge 4(\epsilon-\epsilon^2) \] Walks that mix fast are on expanders.
Gabber-Galil expanders:
- Do expanders exist? Yes! proof: probabilistic method.
- But in this case, can do better deterministically.
- Gabber Galil expanders.
- Let $n=2m^2$. Vertices are $(x,y)$ where $x,y \in Z_m$ (one set per side)
- 5 neighbors: $(x,y), (x, x+y), (x,x+y+1),(x+y,y),(x+y+1,y)$ (add mod $m$)
- or 7 neighbors of similar form.
- Theorem: this $d=5$ graph has $c=(2-\sqrt{3})/4$, degree 7 has twice the expansion.
- in other words, $c$ and $d$ are constant.
- meaning $\lambda_2=1-\epsilon$ for some constant $\epsilon$
- So random walks on this expander mix very fast: for polynomially small r.p.d., $O(\log n)$ steps of random walk suffice.
- Note also that $n$ can be huge, since only need to store one vertex ($O(\log n)$ bits).
Application: conserving randomness.
- Consider an BPP algorithm (gives right answer with probability $99/100$ (constant irrelevant) using $n$ bits.
- $t$ independent trials with majority rule reduce failure probability to $2^{-O(t)}$ (chernoff), but need $tn$ bits
- in case of $RP$, used $2$-point sampling to get error $O(1/t)$ with $2n$ bits and $t$ trials.
- Use walk instead.
- vertices are $N=2^n$ ($n$-bit) random strings for algorithm.
- edges as degree-7 expander
- only $1/100$ of vertices are bad.
- what is probability majority of time spent there?
- in limit, spend $1/100$ of time there
- how fast converge to limit? How long must we run?
- Power the markov chain so $\lambda_2^{\beta}\le 1/10$ (constant number of steps)
- use random seeds encountered every $\beta$ steps.
- number of bits needed:
- $O(n)$ for stationary starting point
- $3\beta$ more per trial,
- Theorem: after $7k$ samples, probability majority wrong is
$1/2^k$. So error $1/2^n$ with $O(n)$ bits! (compare to naive)
- Let $B$ be powered transition matrix
- let $p^{(i)}$ be distribution of sample $i$, namely $p^0B^i$
- Let $W$ be indicator matrix for good witnesses, namely $1$ at diagonal $i$ if $i$ is a witness. $\Wbar$ completmentary set $I-W$.
- $\|p^iW\|_1$ is probability $p^i$ is witness set. similar for nonwitness.
- Consider a sequence of $7k$ results “witness or not”
- represent as matrices $S=(S_1,\ldots,S_{7k}) \in \{W,\Wbar\}^{7k}$
- claim \[ \Pr[S] = \|p^{(0)}(BS_1)(BS_2)\cdots(BS_{7k})\|_1. \] (draw layered graph. sums prob. of paths through correct sequence of witness/nonwitness)
- Use $2$-norm since easier to work with.
- Note $\|A\|_1 \le \sqrt{N}\|A\|_2$
- For fixed sum of values, minimize sum of square by setting all equal
- ie, for sum $\alpha$, set all equal to $\alpha/N$
- 2-norm $\alpha/\sqrt{N}$
- defer: $\|pBW\|_2 \le \|p\|_2$ and $\|pB\Wbar\|_2 \le \frac15\|p\|_2$
- deduce if more than $7k/2$ bad witnesses, $$ \begin{eqnarray*} \|p^0\prod BS_i\|_1 &\le & \sqrt{N}\|p^0\prod BS_i\|\\ &\le & \sqrt{N}(\frac15)^{7k/2}\|p^0\|\\ &= &(\frac15)^{7k/2} \end{eqnarray*} $$ (since $\|p_0\|=1/\sqrt{N}$)
- At same time, only $2^{7k}$ bad sequences, so error prob. $2^{7k}5^{-7k/2} \le 2^{-k}$
- proof of lemma:
- write $p=\sum a_ie_i$ with $e_1=\pi$
- obviously $\|pBW\| \le \|pB\|$ since $W$ just zeros some stuff out.
- But $\|pB\|=\sqrt{\sum a_i^2\lambda_i^2}\le\sum a_i^2 = \|p\|$
- Now write $p=a_1\pi+y$ where $y \cdot \pi=0$
- argue that $\|\pi B\Wbar\|\le \|\pi \|/10\le \|p\|/10$ and $\|yB\Wbar\|\le\|y\|/10\le\|p\|/10$, done by pythagorous
- First $\pi$:
- recall $\pi B=\pi$ is uniform vector, all coords $1/N$, so norm $1/\sqrt{N}$
- $\Wbar$ has only $1/100$ of coordintes nonzero, so
- $\|e_1\Wbar\| = \sqrt{(N/100)(1/N)}=1/10$
- Now $y$: just note $\|yB\| \le \|y\|/10$ since $\lambda_2 \le 1/10$. Then $\Wbar$ zeros out.
- summary: $\pi$ part likely to be in good witness set, $y$ part unlikely to be relevant.