Parallel Algorithms
Two closely related models of parallel computation.
Circuits
- Logic gates (AND/OR/not) connected by wires
- important measures
- number of gates
- depth (clock cycles in synchronous circuit)
PRAM
- $P$ processors, each with a RAM, local registers
- global memory of $M$ locations
- each processor can in one step do a RAM op or read/write to one global memory location
- synchronous parallel steps
- not realistic, but explores “degree of parallelism”
Essentially the same models, but let us focus on different things.
Circuits
- Logic gates (AND/OR/not) connected by wires
- important measures
- number of gates
- depth (clock cycles in synchronous circuit)
- bounded vs unbounded fan-in/out
- $AC(k)$ and $NC(k)$: unbounded and bounded fan-in with depth $O(\log^k n)$ for problems of size $n$
- $AC(k) \subset NC(k) \subset AC(k+1)$ using full binary tree
- $NC=\cup NC(k)=\cup AC(k)$
Addition
- consider adding $a_i$ and $b_i$ with carry $c_{i-1}$ to produce output $s_i$ and next carry $c_i$
- Ripple carry: $O(n)$ gates, $O(n)$ time
- Carry lookahead: $O(n)$ gates, $O(\log n)$ time
- preplan for late arrival of $c_i$.
- given $a_i$ and $b_i$, three possible cases for $c_i$
- if $a_i=b_i$, then $c_i=a_i$ determined without $c_{i-1}$: generate $c_1=1$ or kill $c_i=0$
- otherwise, propagate $c_i=c_{i-1}$
- write $x_i=k,g,p$ accordingly
- consider $3 \times 3$ “multiplication table” for effect of two adders in a row.
- pair propagates previous carry only if both propagate.
$x_i$ $k$ $p$ $g$ $k$ $k$ $k$ $g$ $x_{i-1}$ $p$ $k$ $p$ $g$ $g$ $k$ $g$ $g$ - Now let $y_0=k$, $y_i = y_{i-1} \times x_i$
- constraints “multiplication table” by induction
$x_i$ $k$ $p$ $g$ $k$ $k$ $k$ $g$ $y_{i-1}$ $p$ $k$ never $g$ $g$ $k$ $g$ $g$ - conclude: $y_i=k$ means $c_i=0$, $y_i=g$ means $c_i=1$, and $y_i=p$ never happens
- so problem reduced to computing all $y_i$ in parallel
Parallel prefix
- Build full binary tree
- two gates at each node
- pass up product of all children
- pass down product of all $x_i$ preceding leftmost child
- works for any associative function
PRAM
various conflict resolutions (CREW, EREW, CRCW)
- $CRCW(k) \subset EREW(k+1)$
- $NC = \cup CRCW(k)$
PRAMs simulate circuits, and vice versa
- So $NC$ well-defined
differences in practice
- CRCW can OR in $O(1)$ time
- EREW requires $\log n$ time (info theory lower bound)
- EREW needs $\log n$ to find max
- CRCW finds max in constant time with $n^2$ processors
- Compare every pair
- If an item loses, write “not max” in its entry
- Check all entries
- If item is max (not overwritten), write its value in answer
- in $O(\log\log n)$ time with $n$ processors
- Suppose $k$ items remain
- Make $k^2/n$ blocks of $n/k$ items
- quadratic time max for each: $(k^2/n)(n/k)^2=n$ processors total
- recurrence: $T(k)=1+T(k^2/n)$
- $T(n/2^i)=1+T(n/2^{2i})$
- so $\log\log n$ iters.
parallel prefix
- using $n$ processors
list ranking EREW
- next pointers $n(x)$
- $d(x)+=d(n(x))$; $n(x)=n(n(x))$.
- by induction, sum of values on path to end doesn't change
Work-Efficient Algorithms
Idea:
- We've seen parallel algorithms that are somewhat “inefficient”
- do more total work (processors times time) than sequential
- Ideal solution: arrange for total work to be proportional to best sequential work
- Work-Efficient Algorithm
- Then a small number of processors (or even 1) can “simulate” many processors in a fully efficient way
- Parallel analogue of “cache oblivious algorithm”---you write algorithm once for many processors; lose nothing when it gets simulated on fewer.
Brent's theorem
- Different perspective on work: count number of processors actually working in each time step.
- If algorithm does $x$ total work and critical path $t$
- Then $p$ processors take $x/p+t$ time
- So, if use $p=x/t$ processors, finish in time $t$ with efficient algorithm
- detail: assigning processors to tasks can be tricky if e.g. work emerges dynamically during execution
Work-efficient parallel prefix
- linear sequential work
- going to need $\log n$ time
- so, aim to get by with $n/\log n$ processors
- give each processor a block of $\log n$ items to add up
- reduces problem to $n/\log n$ values
- use old algorithm
- each processor fixes up prefixes for its block
Work-efficient list ranking
- harder: can't easily give contiguous “blocks” of $\log n$ to each processor (requires list ranking)
- However, assume items in arbitrary order in array of $ n$ structs, so can give $\log n$ distinct items to each processor.
- use random coin flips to knock out “alternate” items
- shortcut any item that is heads and has tails successor
- requires at most one shortcut
- and constant probability every other item is shortcut (and independent)
- so by chernoff, 1/16 of items are shortcut out
- “compact” remaining items into smaller array using parallel prefix on array of pointers (ignoring list structure) to collect only “marked” nodes and update their pointers
- let each processor handle $\log n$ (arbitrary) items
- $O(n/\log n)$ processors, $O(\log n)$ time
- After $O(\log\log n)$ rounds, number of items reduced to $n/\log n$
- use old algorithm
- result: $O(\log n \log\log n)$ time, $n/\log n$ processors
- to improve, use faster “compaction” algorithm to collect marked nodes: $O(\log\log n)$ time randomized, or $O(\log n/\log\log n)$ deterministic. get optimal alg.
- How about deterministic algorithm? Use “deterministic coin tossing”
- take all local maxima as part of ruling set.
Euler tour to reduce to parallel prefix for
- inorder traversal of tree
- computing depth in tree
- numbering leaves
- work efficient
Expression Tree Evaluation
plus and times nodes.
Idea
- merge leaves into parents (good for bushy tree)
- pointer jump degree-2 nodes (good for tall trees)
- combine?
- how “pointer jump” an operator?
Generalize problem:
- “product lookahead”
- Each tree edge has a label $(a,b)$
- meaning that if subtree below evaluates to $y$ then value $(ay+b)$ should be passed up edge
Merging a single leaf
- method for eliminating all left-child leaves
- root $q$ with right child $p$ (product node) on edge labelled $(a_3,b_3)$
- $p$ has left child edge $(a_1,b_1)$ leaf $\ell$ with value $v$
- right child edge to $s$ with label $(a_2,b_2)$
- fold out $p$ and $\ell$, make $s$ a child of $q$
- what label of new edge?
- prepare for $s$ subtree to eval on $y$.
- choose $a,b$ such that $ay+b=a_3\cdot[(a_1v+b_1)\cdot(a_2y+b_2)]+b_3$
Parallelize
- keep tree full, so killing all leaves kills whole tree
- collapse a leaf and pointer jump parent
- problem: can't do to both children at once
- solution: number leaves in-order (Euler Tour)
- three step process:
- shunt odd-numbered left-child leaves
- shunt odd-number right-child leaves
- divide leaf-numbers by 2
- guarantees never simultaneously shunt siblings
Sorting
Parallelizing binary search
- placing one item into a sorted array takes $\log n$ time with 1 processor
- but with $k$ processors can improve to $\log_k n$
- compare to $k$ evenly spaced items
- use concurrent OR to find out which pair I'm between in $O(1)$ time
- recurse
- shows how to gain parallel speed by wasting work
- in particular, can find in $O(1)$ time with $n$ processors
- or even $\sqrt{n}$ processors
CREW Merge sort:
- merge two length-$k$ sequences using $k$ processors
- each element of first seq. uses binary search to find place in second
- so knows how many items smaller
- so knows rank in merged sequence: go there
- then do same for second list
- $O(\log k)$ time with $k$ processors
- handle all length-$k$ lists in parallel with $n$ processors
- total time $O(\sum_{i \le \lg n} \log 2^i)=O(\log^2 n)$ with $n$ processors
Better with more processors?
- use $n^2$ processors to do all pairwise comparisons in parallel
- For each item, count number of wins in parallel in $O(\log n)$ time
- This determines item's rank in output list
Use same ideas with fewer processors?
- Note can sort $\sqrt{n}$ items in $O(\log n)$ time
- Use insights to improve merge?
- Idea: nearby items in one list go to nearby places in merged list
- So don't do whole computation for every item
- Break $A$ into blocks and figure out roughly where each block goes in $B$
- Then spread each block to exact right places in $B$ (recursively)
Details: merge $n$ items of $A$ into $m$ items of $B$ in $O(\log\log n)$ time using $m+n$ processors
- choose $\sqrt{n}$ evenly spaced fenceposts $\alpha_i$ among $A$
- use $\sqrt{m}$ processors to find exact location of each in $B$
- total processors $\sqrt{nm}\le \le \max(n,m) \le n+m$
- split $B$ at locations of $\alpha_i$
- now know $\sqrt{n}$ items (between $\alpha_i$ and $\alpha_{i+1}$) go in each split piece of $B$
- So recurse: $T(n)=2+T(\sqrt{n})=O(\log\log n)$
Use in parallel merge sort: $O(\log n \log\log n)$ with $n$ processors.
- Cole shows how to “pipeline” merges, get optimal $O(\log n)$ time.
Qsort idea
- $\sqrt{n}$ pivots
- sort all in $O(\log n)$ time using $n$ processors
- binary search each item to locate position in pivots
- recurse
- $T(n) = O(\log n) + T(\sqrt{n}) = O(\log n)$
- rigorous analysis messy
- need to deal with variation in sizes of subproblems
Connectivity and connected components
Linear time sequential trivial.
Directed
Squaring adjacency matrix
- $\log n$ time to reduce diameter to 1
- $mn$ processors for first iter, but adds edges
- so, $n^3$ processors
- and space to $n^2$ even if initial graph is sparse
- improvements to $mn$ processors
- But “transitive closure bottleneck” still bedevils parallel algs.
- Even if just want to find $s$-$t$ path we don't know any better
- Problem is that output has size $n^2$ and we don't know which part of it matters.
Undirected
Basic approach:
- Sets of connected vertices grouped as stars
- One root, all others parent-point to it
- Initially all vertices alone
- Edge “live” if connects two distinct stars
- Find live edges in constant time by checking roots
- For live edge with roots $u< v$, connect $u$ as child of $v$
- May be conflicts, but CRCW resolves
- Now get stars again
- Use pointer jumping
- Note: may have chains of links, so need $\log n$ jumps
- Every live star attached to another
- So number of stars decreases by 2
- $m+n$ processors, $\log^2 n$ time.
Smarter: interleave hooking and jumping:
- Maintain set of rooted trees
- Each node points to parent
- Hook some trees together to make fewer trees
- Pointer jump (once) to make shallower trees
- Eventually, each connected component is one star
More details:
- “top” vertex: root or its children
- detect top vertex in constant time
- each vertex has label
- find root label of each top vertex
- Can detect if am star in constant time:
- no pointer double reaches root
- for each edge:
- If ends both on top, different nonstar components, then hook smaller index component to larger
- may be conflicting hooks; assume CRCW resolves
- indices prevent cycles
- If star points to non-star, hook it
- do one pointer jump
Potential function: height of live stars and tall trees
- Live stars get hooked to something (star or internal)
- But never hooked to leaf. So add 1 to height of target
- So sum of heights doesn't go up
- But now, every unit of height is in a tall tree
- Pointer doubling decreases by at least 1/3
- Total height divided each time
- So $\log n$ iterations
Improvements:
- $O((m+n)\alpha(m,n)/\log n)$ processors, $\log n$ time, CRCW
- Randomized $O(\log n)$, $O(m/\log n)$ processors, EREW
Randomization
Randomization in parallel:
- load balancing
- symmetry breaking
- isolating solutions
Classes:
- NC: poly processor, polylog steps
- RNC: with randomization. polylog runtime, monte carlo
- ZNC: las vegas NC
- immune to choice of R/W conflict resolution
Sorting
Quicksort in parallel:
- $n$ processors
- each takes one item, compares to splitter
- count number of predecessors less than splitter
- determines location of item in split
- total time $O(\log n)$
- combine: $O(\log n)$ per layer with $n$ processors
- problem: $\Omega(\log^2 n)$ time bound
- problem: $n\log^2 n$ work
Using many processors:
- do all $n^2$ comparisons
- use parallel prefix to count number of items less than each item
- $O(\log n)$ time
- or $O(n)$ time with $n$ processors
Combine with quicksort:
- Note: single pivot step inefficient: uses $n$ processors and $\log n$ time.
- Better: use $\sqrt{n}$ simultaneous pivots
- Choose $\sqrt{n}$ random items and sort fully to get $\sqrt{n}$ intervals
- For all $n$ items, use binary search to find right interval
- recurse
- $T(n)=O(\log n)+T(\sqrt{n})=O(\log n + \frac12\log n+\frac14\log n + \cdots)=O(\log n)$
Formal analysis:
- consider root-leaf path to any item $x$
- argue total number of parallel steps on path is $O(\log n)$
- consider item $x$
- claim splitter within $\alpha\sqrt{n}$ on each side
- since prob. not at most $(1-\alpha\sqrt{n}/n)^{\sqrt{n}} \le e^{-\alpha}$
- fix $\gamma, d< 1/\gamma$
- define $\tau_k = d^k$
- define $\rho_k = n^{(2/3)^k}$ $(\rho_{k+1}=\rho_k^{2/3})$
- note size $\rho_k$ problem takes $\gamma^k\log n$ time
- note size $\rho_k$ problem odds of having child of size $>\rho_{k+1}$ is less than $e^{-\rho_k^{1/6}}$
- argue at most $d^k$ size-$\rho_k$ problems whp
- follows because probability of $d^k$ size-$\rho_k$ problems in a row is at most
- deduce runtime $\sum d^k\gamma_k = \sum (d\gamma)^{k}\log n = O(\log n)$
- note: as problem shrinks, allowing more divergence in quantity for whp result
- minor detail: “whp” dies for small problems
- OK: if problem size $\log n$, finish in $\log n$ time with $\log n$ processors
Maximal independent set
trivial sequential algorithm
- inherently sequential
- from node point of view: each thinks can join MIS if others stay out
- randomization breaks this symmetry
Randomized idea
- each node joins with some probability
- all neighbors excluded
- many nodes join
- few phases needed
Algorithm:
- all degree 0 nodes join
- node $v$ joins with probability $1/2d(v)$
- if edge $(u,v)$ has both ends marked, unmark lower degree vertex
- put all marked nodes in IS
- delete all neighbors
Intuition: $d$-regular graph
- vertex vanishes if it or neighbor gets chosen
- mark with probability $1/2d$
- prob (no neighbor marked) is $(1-1/2d)^d$, constant
- so const prob. of neighbor of $v$ marked---destroys $v$
- what about unmarking of $v$'s neighbor?
- prob(unmarking forced) only constant as argued above.
- So just changes constants
- const fraction of nodes vanish: $O(\log n)$ phases
- Implementing a phase trivial in $O(\log n)$.
Prob chosen for IS, given marked, exceeds $1/2$
- suppose $w$ marked. only unmarked if higher degree neighbor marked
- higher degree neighbor marked with prob. $\le 1/2d(w)$
- only $d(w)$ neighbors
- prob. any superior neighbor marked at most $1/2$.
For general case, define good vertices
- good: at least $1/3$ neighbors have lower degree
- prob. no neighbor of good marked $\le (1-1/2d(v))^{d(v)/3} \le e^{-1/6}$.
- So some neighbor marked with prob. $1-e^{-1/6}$
- Stays marked with prob. 1/2
- deduce prob. good vertex killed exceeds $(1-e^{-1/6})/2$
- Problem: perhaps only one good vertex?
Good edges
- any edge with a good neighbor
- has const prob. to vanish
- show half edges good
- deduce $O(\log n)$ iterations.
Proof
- Let $V_B$ be bad vertices; we count edges with both ends in $V_B$.
- direct edges from lower to higher degree $d_i$ is indegree, $d_o$ outdegree
- if $v$ bad, then $d_i(v) \le d(v)/3$
- deduce \[\sum_{V_B} d_i(v) \le \frac13 \sum_{V_B} d(v)=\frac13\sum_{V_B}( d_i(v)+d_o(v)) \]
- so $\sum_{V_B} d_i(v) \le \frac12 \sum_{V_B} d_o(v)$
- which means indegree can only “catch” half of outdegree; other half must go to good vertices.
-
more carefully,
- $d_o(v)-d_i(v) \ge \frac13(d(v))=\frac13(d_o(v)+d_i(v))$.
- Let $V_G,V_B$ be good, bad vertices
- degree of bad vertices is $$ \begin{eqnarray*} 2e(V_B,V_B)+e(V_B,V_G)+e(V_G,V_B) &= & \sum_{v\in V_B} d_o(v)+d_i(v)\\ &\le &3\sum(d_o(v)-d_i(v))\\ &= &3(e(V_B,V_G)-e(V_G,V_B))\\ &\le &3(e(V_B,V_G)+e(V_G,V_B) \end{eqnarray*} $$ Deduce $e(V_B,V_B) \le e(V_B,V_G)+e(V_G,V_B)$. result follows.
Derandomization:
- Analysis focuses on edges,
- so unsurprisingly, pairwise independence sufficient
- not immediately obvious, but again consider $d$-uniform case
- prob vertex marked $1/2d$
- neighbors $1,\ldots,d$ in increasing degree order
- Let $E_i$ be event that $i$ is marked.
- Let $E'_i$ be $E_i$ but no $E_j$ for $j< i$
- $A_i$ event no neighbor of $i$ chosen
- Then prob eliminate $v$ at least $$ \begin{eqnarray*} \sum \Pr[E'_i \cap A_i] &= &\sum\Pr[E'_i]\Pr[A_i \mid E'_i]\\ &\ge &\sum \Pr[E'_i]\Pr[A_i] \end{eqnarray*} $$
- Wait: show $\Pr[A_i \mid E'_i] \ge \Pr[A_i]$
- true if independent
- measure $\Pr[\neg A_i \mid E'_i] \le \sum \Pr[E_w \mid E'_i]$ (sum over neighbors $w$ of $i$)
- measure $$ \begin{eqnarray*} \Pr[E_w \mid E'_i] &= &\frac{\Pr[E_w \cap E']}{\Pr[E'_i]}\\ &= &\frac{\Pr[(E_w \cap \neg E_1 \cap \cdots) \cap E_i]}{\Pr[(\neg E_1 \cap \cdots) \cap E_i]} \\ &= &\frac{\Pr[E_w \cap \neg E_1 \cap \cdots \mid E_i]}{\Pr[\neg E_1 \cap \cdots \mid E_i]} \\ &\le &\frac{\Pr[E_w \mid E_i]}{1-\sum_{j\le i}\Pr[E_j \mid E_i]}\\ &= &\Theta(\Pr[E_w]) \end{eqnarray*} $$ (last step assumes $d$-regular so only $d$ neighbors with odds $1/2d$)
- But expected marked neighbors $1/2$, so by Markov $\Pr[A_i]>1/2$
- so prob eliminate $v$ exceeds $\sum\Pr[E'_i]=\Pr[\cup E_i]$
- lower bound as $\sum\Pr[E_i]-\sum\Pr[E_i \cap E_j] = 1/2-d(d-1)/8d^2 > 1/4$
- so $1/2d$ prob. $v$ marked but no neighbor marked, so $v$ chosen
- Generate pairwise independent with $O(\log n)$ bits
- try all polynomial seeds in parallel
- one works
- gives deterministic $NC$ algorithm
with care, $O(m)$ processors and $O(\log n)$ time (randomized)
LFMIS P-complete.
Perfect Matching
We focus on bipartite; book does general case.
Last time, saw detection algorithm in $\RNC$:
- Tutte matrix
- Sumbolic determinant nonzero iff PM
- assign random values in $1,\ldots,2m$
- Matrix Mul, Determinant in $\NC$
How about finding one?
- If unique, no problem
- Since only one nozero term, ok to replace each entry by a 1.
- Remove each edge, see if still PM in parallel
- multiplies processors by $m$
- but still $\NC$
Idea:
- make unique minimum weight perfect matching
- find it
Isolating lemma: [MVV]
- Family of distinct sets over $x_1,\ldots,x_m$
- assign random weights in $1,\ldots,2m$
- Pr(unique min-weight set)$\ge 1/2$
- Odd: no dependence on number of sets!
- (of course $< 2^m$)
Proof:
- Fix item $x_i$
- $Y$ is min-value sets containing $x_i$
- $N$ is min-value sets not containing $x_i$
- true min-sets are either those in $Y$ or in $N$
- how decide? Value of $x_i$
- For $x_i=-\infty$, min-sets are $Y$
- For $x_i=+\infty$, min-sets are $N$
- As increase from $-\infty$ to $\infty$, single transition value when both $X$ and $Y$ are min-weight
- If only $Y$ min-weight, then $x_i$ in every min-set
- If only $X$ min-weight, then $x_i$ in no min-set
- If both min-weight, $x_i$ is ambiguous
- Suppose no $x_i$ ambiguous. Then min-weight set unique!
- Exactly one value for $x_i$ makes it ambiguous given remainder
- So Pr(ambiguous)$=1/2m$
- So Pr(any ambiguous)$< m/2m=1/2$
Usage:
- Consider tutte matrix $A$
- Assign random value $2^{w_i}$ to $x_i$, with $w_i \in 1,\ldots,2m$
- Weight of matching is $2^{\sum w_i}$
- Let $W$ be minimum sum
- Unique w/pr $1/2$
- If so, determinant is odd multiple of $2^W$
- Try removing edges one at a time
- Edge in PM iff new determinant$/2^W$ is even.
- Big numbers? No problem: values have poly number of bits
$NC$ algorithm open.
For exact matching, $P$ algorithm open. $\newcommand{\Wbar}{\overline W}$