Parallel Algorithms
Two closely related models of parallel computation.
Circuits
 Logic gates (AND/OR/not) connected by wires
 important measures
 number of gates
 depth (clock cycles in synchronous circuit)
PRAM
 $P$ processors, each with a RAM, local registers
 global memory of $M$ locations
 each processor can in one step do a RAM op or read/write to one global memory location
 synchronous parallel steps
 not realistic, but explores “degree of parallelism”
Essentially the same models, but let us focus on different things.
Circuits
 Logic gates (AND/OR/not) connected by wires
 important measures
 number of gates
 depth (clock cycles in synchronous circuit)
 bounded vs unbounded fanin/out
 $AC(k)$ and $NC(k)$: unbounded and bounded fanin with depth $O(\log^k n)$ for problems of size $n$
 $AC(k) \subset NC(k) \subset AC(k+1)$ using full binary tree
 $NC=\cup NC(k)=\cup AC(k)$
Addition
 consider adding $a_i$ and $b_i$ with carry $c_{i1}$ to produce output $s_i$ and next carry $c_i$
 Ripple carry: $O(n)$ gates, $O(n)$ time
 Carry lookahead: $O(n)$ gates, $O(\log n)$ time
 preplan for late arrival of $c_i$.
 given $a_i$ and $b_i$, three possible cases for $c_i$
 if $a_i=b_i$, then $c_i=a_i$ determined without $c_{i1}$: generate $c_1=1$ or kill $c_i=0$
 otherwise, propagate $c_i=c_{i1}$
 write $x_i=k,g,p$ accordingly
 consider $3 \times 3$ “multiplication table” for effect of two adders in a row.
 pair propagates previous carry only if both propagate.
$x_i$ $k$ $p$ $g$ $k$ $k$ $k$ $g$ $x_{i1}$ $p$ $k$ $p$ $g$ $g$ $k$ $g$ $g$  Now let $y_0=k$, $y_i = y_{i1} \times x_i$
 constraints “multiplication table” by induction
$x_i$ $k$ $p$ $g$ $k$ $k$ $k$ $g$ $y_{i1}$ $p$ $k$ never $g$ $g$ $k$ $g$ $g$  conclude: $y_i=k$ means $c_i=0$, $y_i=g$ means $c_i=1$, and $y_i=p$ never happens
 so problem reduced to computing all $y_i$ in parallel
Parallel prefix
 Build full binary tree
 two gates at each node
 pass up product of all children
 pass down product of all $x_i$ preceding leftmost child
 works for any associative function
PRAM
various conflict resolutions (CREW, EREW, CRCW)
 $CRCW(k) \subset EREW(k+1)$
 $NC = \cup CRCW(k)$
PRAMs simulate circuits, and vice versa
 So $NC$ welldefined
differences in practice
 CRCW can OR in $O(1)$ time
 EREW requires $\log n$ time (info theory lower bound)
 EREW needs $\log n$ to find max
 CRCW finds max in constant time with $n^2$ processors
 Compare every pair
 If an item loses, write “not max” in its entry
 Check all entries
 If item is max (not overwritten), write its value in answer
 in $O(\log\log n)$ time with $n$ processors
 Suppose $k$ items remain
 Make $k^2/n$ blocks of $n/k$ items
 quadratic time max for each: $(k^2/n)(n/k)^2=n$ processors total
 recurrence: $T(k)=1+T(k^2/n)$
 $T(n/2^i)=1+T(n/2^{2i})$
 so $\log\log n$ iters.
parallel prefix
 using $n$ processors
list ranking EREW
 next pointers $n(x)$
 $d(x)+=d(n(x))$; $n(x)=n(n(x))$.
 by induction, sum of values on path to end doesn't change
WorkEfficient Algorithms
Idea:
 We've seen parallel algorithms that are somewhat “inefficient”
 do more total work (processors times time) than sequential
 Ideal solution: arrange for total work to be proportional to best sequential work
 WorkEfficient Algorithm
 Then a small number of processors (or even 1) can “simulate” many processors in a fully efficient way
 Parallel analogue of “cache oblivious algorithm”you write algorithm once for many processors; lose nothing when it gets simulated on fewer.
Brent's theorem
 Different perspective on work: count number of processors actually working in each time step.
 If algorithm does $x$ total work and critical path $t$
 Then $p$ processors take $x/p+t$ time
 So, if use $p=x/t$ processors, finish in time $t$ with efficient algorithm
 detail: assigning processors to tasks can be tricky if e.g. work emerges dynamically during execution
Workefficient parallel prefix
 linear sequential work
 going to need $\log n$ time
 so, aim to get by with $n/\log n$ processors
 give each processor a block of $\log n$ items to add up
 reduces problem to $n/\log n$ values
 use old algorithm
 each processor fixes up prefixes for its block
Workefficient list ranking
 harder: can't easily give contiguous “blocks” of $\log n$ to each processor (requires list ranking)
 However, assume items in arbitrary order in array of $ n$ structs, so can give $\log n$ distinct items to each processor.
 use random coin flips to knock out “alternate” items
 shortcut any item that is heads and has tails successor
 requires at most one shortcut
 and constant probability every other item is shortcut (and independent)
 so by chernoff, 1/16 of items are shortcut out
 “compact” remaining items into smaller array using parallel prefix on array of pointers (ignoring list structure) to collect only “marked” nodes and update their pointers
 let each processor handle $\log n$ (arbitrary) items
 $O(n/\log n)$ processors, $O(\log n)$ time
 After $O(\log\log n)$ rounds, number of items reduced to $n/\log n$
 use old algorithm
 result: $O(\log n \log\log n)$ time, $n/\log n$ processors
 to improve, use faster “compaction” algorithm to collect marked nodes: $O(\log\log n)$ time randomized, or $O(\log n/\log\log n)$ deterministic. get optimal alg.
 How about deterministic algorithm? Use “deterministic coin tossing”
 take all local maxima as part of ruling set.
Euler tour to reduce to parallel prefix for
 inorder traversal of tree
 computing depth in tree
 numbering leaves
 work efficient
Expression Tree Evaluation
plus and times nodes.
Idea
 merge leaves into parents (good for bushy tree)
 pointer jump degree2 nodes (good for tall trees)
 combine?
 how “pointer jump” an operator?
Generalize problem:
 “product lookahead”
 Each tree edge has a label $(a,b)$
 meaning that if subtree below evaluates to $y$ then value $(ay+b)$ should be passed up edge
Merging a single leaf
 method for eliminating all leftchild leaves
 root $q$ with right child $p$ (product node) on edge labelled $(a_3,b_3)$
 $p$ has left child edge $(a_1,b_1)$ leaf $\ell$ with value $v$
 right child edge to $s$ with label $(a_2,b_2)$
 fold out $p$ and $\ell$, make $s$ a child of $q$
 what label of new edge?
 prepare for $s$ subtree to eval on $y$.
 choose $a,b$ such that $ay+b=a_3\cdot[(a_1v+b_1)\cdot(a_2y+b_2)]+b_3$
Parallelize
 keep tree full, so killing all leaves kills whole tree
 collapse a leaf and pointer jump parent
 problem: can't do to both children at once
 solution: number leaves inorder (Euler Tour)
 three step process:
 shunt oddnumbered leftchild leaves
 shunt oddnumber rightchild leaves
 divide leafnumbers by 2
 guarantees never simultaneously shunt siblings
Sorting
Parallelizing binary search
 placing one item into a sorted array takes $\log n$ time with 1 processor
 but with $k$ processors can improve to $\log_k n$
 compare to $k$ evenly spaced items
 use concurrent OR to find out which pair I'm between in $O(1)$ time
 recurse
 shows how to gain parallel speed by wasting work
 in particular, can find in $O(1)$ time with $n$ processors
 or even $\sqrt{n}$ processors
CREW Merge sort:
 merge two length$k$ sequences using $k$ processors
 each element of first seq. uses binary search to find place in second
 so knows how many items smaller
 so knows rank in merged sequence: go there
 then do same for second list
 $O(\log k)$ time with $k$ processors
 handle all length$k$ lists in parallel with $n$ processors
 total time $O(\sum_{i \le \lg n} \log 2^i)=O(\log^2 n)$ with $n$ processors
Better with more processors?
 use $n^2$ processors to do all pairwise comparisons in parallel
 For each item, count number of wins in parallel in $O(\log n)$ time
 This determines item's rank in output list
Use same ideas with fewer processors?
 Note can sort $\sqrt{n}$ items in $O(\log n)$ time
 Use insights to improve merge?
 Idea: nearby items in one list go to nearby places in merged list
 So don't do whole computation for every item
 Break $A$ into blocks and figure out roughly where each block goes in $B$
 Then spread each block to exact right places in $B$ (recursively)
Details: merge $n$ items of $A$ into $m$ items of $B$ in $O(\log\log n)$ time using $m+n$ processors
 choose $\sqrt{n}$ evenly spaced fenceposts $\alpha_i$ among $A$
 use $\sqrt{m}$ processors to find exact location of each in $B$
 total processors $\sqrt{nm}\le \le \max(n,m) \le n+m$
 split $B$ at locations of $\alpha_i$
 now know $\sqrt{n}$ items (between $\alpha_i$ and $\alpha_{i+1}$) go in each split piece of $B$
 So recurse: $T(n)=2+T(\sqrt{n})=O(\log\log n)$
Use in parallel merge sort: $O(\log n \log\log n)$ with $n$ processors.
 Cole shows how to “pipeline” merges, get optimal $O(\log n)$ time.
Qsort idea
 $\sqrt{n}$ pivots
 sort all in $O(\log n)$ time using $n$ processors
 binary search each item to locate position in pivots
 recurse
 $T(n) = O(\log n) + T(\sqrt{n}) = O(\log n)$
 rigorous analysis messy
 need to deal with variation in sizes of subproblems
Connectivity and connected components
Linear time sequential trivial.
Directed
Squaring adjacency matrix
 $\log n$ time to reduce diameter to 1
 $mn$ processors for first iter, but adds edges
 so, $n^3$ processors
 and space to $n^2$ even if initial graph is sparse
 improvements to $mn$ processors
 But “transitive closure bottleneck” still bedevils parallel algs.
 Even if just want to find $s$$t$ path we don't know any better
 Problem is that output has size $n^2$ and we don't know which part of it matters.
Undirected
Basic approach:
 Sets of connected vertices grouped as stars
 One root, all others parentpoint to it
 Initially all vertices alone
 Edge “live” if connects two distinct stars
 Find live edges in constant time by checking roots
 For live edge with roots $u< v$, connect $u$ as child of $v$
 May be conflicts, but CRCW resolves
 Now get stars again
 Use pointer jumping
 Note: may have chains of links, so need $\log n$ jumps
 Every live star attached to another
 So number of stars decreases by 2
 $m+n$ processors, $\log^2 n$ time.
Smarter: interleave hooking and jumping:
 Maintain set of rooted trees
 Each node points to parent
 Hook some trees together to make fewer trees
 Pointer jump (once) to make shallower trees
 Eventually, each connected component is one star
More details:
 “top” vertex: root or its children
 detect top vertex in constant time
 each vertex has label
 find root label of each top vertex
 Can detect if am star in constant time:
 no pointer double reaches root
 for each edge:
 If ends both on top, different nonstar components, then hook smaller index component to larger
 may be conflicting hooks; assume CRCW resolves
 indices prevent cycles
 If star points to nonstar, hook it
 do one pointer jump
Potential function: height of live stars and tall trees
 Live stars get hooked to something (star or internal)
 But never hooked to leaf. So add 1 to height of target
 So sum of heights doesn't go up
 But now, every unit of height is in a tall tree
 Pointer doubling decreases by at least 1/3
 Total height divided each time
 So $\log n$ iterations
Improvements:
 $O((m+n)\alpha(m,n)/\log n)$ processors, $\log n$ time, CRCW
 Randomized $O(\log n)$, $O(m/\log n)$ processors, EREW
Randomization
Randomization in parallel:
 load balancing
 symmetry breaking
 isolating solutions
Classes:
 NC: poly processor, polylog steps
 RNC: with randomization. polylog runtime, monte carlo
 ZNC: las vegas NC
 immune to choice of R/W conflict resolution
Sorting
Quicksort in parallel:
 $n$ processors
 each takes one item, compares to splitter
 count number of predecessors less than splitter
 determines location of item in split
 total time $O(\log n)$
 combine: $O(\log n)$ per layer with $n$ processors
 problem: $\Omega(\log^2 n)$ time bound
 problem: $n\log^2 n$ work
Using many processors:
 do all $n^2$ comparisons
 use parallel prefix to count number of items less than each item
 $O(\log n)$ time
 or $O(n)$ time with $n$ processors
Combine with quicksort:
 Note: single pivot step inefficient: uses $n$ processors and $\log n$ time.
 Better: use $\sqrt{n}$ simultaneous pivots
 Choose $\sqrt{n}$ random items and sort fully to get $\sqrt{n}$ intervals
 For all $n$ items, use binary search to find right interval
 recurse
 $T(n)=O(\log n)+T(\sqrt{n})=O(\log n + \frac12\log n+\frac14\log n + \cdots)=O(\log n)$
Formal analysis:
 consider rootleaf path to any item $x$
 argue total number of parallel steps on path is $O(\log n)$
 consider item $x$
 claim splitter within $\alpha\sqrt{n}$ on each side
 since prob. not at most $(1\alpha\sqrt{n}/n)^{\sqrt{n}} \le e^{\alpha}$
 fix $\gamma, d< 1/\gamma$
 define $\tau_k = d^k$
 define $\rho_k = n^{(2/3)^k}$ $(\rho_{k+1}=\rho_k^{2/3})$
 note size $\rho_k$ problem takes $\gamma^k\log n$ time
 note size $\rho_k$ problem odds of having child of size $>\rho_{k+1}$ is less than $e^{\rho_k^{1/6}}$
 argue at most $d^k$ size$\rho_k$ problems whp
 follows because probability of $d^k$ size$\rho_k$ problems in a row is at most
 deduce runtime $\sum d^k\gamma_k = \sum (d\gamma)^{k}\log n = O(\log n)$
 note: as problem shrinks, allowing more divergence in quantity for whp result
 minor detail: “whp” dies for small problems
 OK: if problem size $\log n$, finish in $\log n$ time with $\log n$ processors
Maximal independent set
trivial sequential algorithm
 inherently sequential
 from node point of view: each thinks can join MIS if others stay out
 randomization breaks this symmetry
Randomized idea
 each node joins with some probability
 all neighbors excluded
 many nodes join
 few phases needed
Algorithm:
 all degree 0 nodes join
 node $v$ joins with probability $1/2d(v)$
 if edge $(u,v)$ has both ends marked, unmark lower degree vertex
 put all marked nodes in IS
 delete all neighbors
Intuition: $d$regular graph
 vertex vanishes if it or neighbor gets chosen
 mark with probability $1/2d$
 prob (no neighbor marked) is $(11/2d)^d$, constant
 so const prob. of neighbor of $v$ markeddestroys $v$
 what about unmarking of $v$'s neighbor?
 prob(unmarking forced) only constant as argued above.
 So just changes constants
 const fraction of nodes vanish: $O(\log n)$ phases
 Implementing a phase trivial in $O(\log n)$.
Prob chosen for IS, given marked, exceeds $1/2$
 suppose $w$ marked. only unmarked if higher degree neighbor marked
 higher degree neighbor marked with prob. $\le 1/2d(w)$
 only $d(w)$ neighbors
 prob. any superior neighbor marked at most $1/2$.
For general case, define good vertices
 good: at least $1/3$ neighbors have lower degree
 prob. no neighbor of good marked $\le (11/2d(v))^{d(v)/3} \le e^{1/6}$.
 So some neighbor marked with prob. $1e^{1/6}$
 Stays marked with prob. 1/2
 deduce prob. good vertex killed exceeds $(1e^{1/6})/2$
 Problem: perhaps only one good vertex?
Good edges
 any edge with a good neighbor
 has const prob. to vanish
 show half edges good
 deduce $O(\log n)$ iterations.
Proof
 Let $V_B$ be bad vertices; we count edges with both ends in $V_B$.
 direct edges from lower to higher degree $d_i$ is indegree, $d_o$ outdegree
 if $v$ bad, then $d_i(v) \le d(v)/3$
 deduce \[\sum_{V_B} d_i(v) \le \frac13 \sum_{V_B} d(v)=\frac13\sum_{V_B}( d_i(v)+d_o(v)) \]
 so $\sum_{V_B} d_i(v) \le \frac12 \sum_{V_B} d_o(v)$
 which means indegree can only “catch” half of outdegree; other half must go to good vertices.

more carefully,
 $d_o(v)d_i(v) \ge \frac13(d(v))=\frac13(d_o(v)+d_i(v))$.
 Let $V_G,V_B$ be good, bad vertices
 degree of bad vertices is $$ \begin{eqnarray*} 2e(V_B,V_B)+e(V_B,V_G)+e(V_G,V_B) &= & \sum_{v\in V_B} d_o(v)+d_i(v)\\ &\le &3\sum(d_o(v)d_i(v))\\ &= &3(e(V_B,V_G)e(V_G,V_B))\\ &\le &3(e(V_B,V_G)+e(V_G,V_B) \end{eqnarray*} $$ Deduce $e(V_B,V_B) \le e(V_B,V_G)+e(V_G,V_B)$. result follows.
Derandomization:
 Analysis focuses on edges,
 so unsurprisingly, pairwise independence sufficient
 not immediately obvious, but again consider $d$uniform case
 prob vertex marked $1/2d$
 neighbors $1,\ldots,d$ in increasing degree order
 Let $E_i$ be event that $i$ is marked.
 Let $E'_i$ be $E_i$ but no $E_j$ for $j< i$
 $A_i$ event no neighbor of $i$ chosen
 Then prob eliminate $v$ at least $$ \begin{eqnarray*} \sum \Pr[E'_i \cap A_i] &= &\sum\Pr[E'_i]\Pr[A_i \mid E'_i]\\ &\ge &\sum \Pr[E'_i]\Pr[A_i] \end{eqnarray*} $$
 Wait: show $\Pr[A_i \mid E'_i] \ge \Pr[A_i]$
 true if independent
 measure $\Pr[\neg A_i \mid E'_i] \le \sum \Pr[E_w \mid E'_i]$ (sum over neighbors $w$ of $i$)
 measure $$ \begin{eqnarray*} \Pr[E_w \mid E'_i] &= &\frac{\Pr[E_w \cap E']}{\Pr[E'_i]}\\ &= &\frac{\Pr[(E_w \cap \neg E_1 \cap \cdots) \cap E_i]}{\Pr[(\neg E_1 \cap \cdots) \cap E_i]} \\ &= &\frac{\Pr[E_w \cap \neg E_1 \cap \cdots \mid E_i]}{\Pr[\neg E_1 \cap \cdots \mid E_i]} \\ &\le &\frac{\Pr[E_w \mid E_i]}{1\sum_{j\le i}\Pr[E_j \mid E_i]}\\ &= &\Theta(\Pr[E_w]) \end{eqnarray*} $$ (last step assumes $d$regular so only $d$ neighbors with odds $1/2d$)
 But expected marked neighbors $1/2$, so by Markov $\Pr[A_i]>1/2$
 so prob eliminate $v$ exceeds $\sum\Pr[E'_i]=\Pr[\cup E_i]$
 lower bound as $\sum\Pr[E_i]\sum\Pr[E_i \cap E_j] = 1/2d(d1)/8d^2 > 1/4$
 so $1/2d$ prob. $v$ marked but no neighbor marked, so $v$ chosen
 Generate pairwise independent with $O(\log n)$ bits
 try all polynomial seeds in parallel
 one works
 gives deterministic $NC$ algorithm
with care, $O(m)$ processors and $O(\log n)$ time (randomized)
LFMIS Pcomplete.
Perfect Matching
We focus on bipartite; book does general case.
Last time, saw detection algorithm in $\RNC$:
 Tutte matrix
 Sumbolic determinant nonzero iff PM
 assign random values in $1,\ldots,2m$
 Matrix Mul, Determinant in $\NC$
How about finding one?
 If unique, no problem
 Since only one nozero term, ok to replace each entry by a 1.
 Remove each edge, see if still PM in parallel
 multiplies processors by $m$
 but still $\NC$
Idea:
 make unique minimum weight perfect matching
 find it
Isolating lemma: [MVV]
 Family of distinct sets over $x_1,\ldots,x_m$
 assign random weights in $1,\ldots,2m$
 Pr(unique minweight set)$\ge 1/2$
 Odd: no dependence on number of sets!
 (of course $< 2^m$)
Proof:
 Fix item $x_i$
 $Y$ is minvalue sets containing $x_i$
 $N$ is minvalue sets not containing $x_i$
 true minsets are either those in $Y$ or in $N$
 how decide? Value of $x_i$
 For $x_i=\infty$, minsets are $Y$
 For $x_i=+\infty$, minsets are $N$
 As increase from $\infty$ to $\infty$, single transition value when both $X$ and $Y$ are minweight
 If only $Y$ minweight, then $x_i$ in every minset
 If only $X$ minweight, then $x_i$ in no minset
 If both minweight, $x_i$ is ambiguous
 Suppose no $x_i$ ambiguous. Then minweight set unique!
 Exactly one value for $x_i$ makes it ambiguous given remainder
 So Pr(ambiguous)$=1/2m$
 So Pr(any ambiguous)$< m/2m=1/2$
Usage:
 Consider tutte matrix $A$
 Assign random value $2^{w_i}$ to $x_i$, with $w_i \in 1,\ldots,2m$
 Weight of matching is $2^{\sum w_i}$
 Let $W$ be minimum sum
 Unique w/pr $1/2$
 If so, determinant is odd multiple of $2^W$
 Try removing edges one at a time
 Edge in PM iff new determinant$/2^W$ is even.
 Big numbers? No problem: values have poly number of bits
$NC$ algorithm open.
For exact matching, $P$ algorithm open. $\newcommand{\Wbar}{\overline W}$