\documentclass[12pt]{article} \usepackage{wide,me} \parindent0pt
2013 Lecture 29 Start

### Parallel Algorithms

Two closely related models of parallel computation.

Circuits

• Logic gates (AND/OR/not) connected by wires
• important measures
• number of gates
• depth (clock cycles in synchronous circuit)

PRAM

• $P$ processors, each with a RAM, local registers
• global memory of $M$ locations
• each processor can in one step do a RAM op or read/write to one global memory location
• synchronous parallel steps
• not realistic, but explores “degree of parallelism”

Essentially the same models, but let us focus on different things.

#### Circuits

• Logic gates (AND/OR/not) connected by wires
• important measures
• number of gates
• depth (clock cycles in synchronous circuit)
• bounded vs unbounded fan-in/out
• $AC(k)$ and $NC(k)$: unbounded and bounded fan-in with depth $O(\log^k n)$ for problems of size $n$
• $AC(k) \subset NC(k) \subset AC(k+1)$ using full binary tree
• $NC=\cup NC(k)=\cup AC(k)$

• consider adding $a_i$ and $b_i$ with carry $c_{i-1}$ to produce output $s_i$ and next carry $c_i$
• Ripple carry: $O(n)$ gates, $O(n)$ time
• Carry lookahead: $O(n)$ gates, $O(\log n)$ time
• preplan for late arrival of $c_i$.
• given $a_i$ and $b_i$, three possible cases for $c_i$
• if $a_i=b_i$, then $c_i=a_i$ determined without $c_{i-1}$: generate $c_1=1$ or kill $c_i=0$
• otherwise, propagate $c_i=c_{i-1}$
• write $x_i=k,g,p$ accordingly
• consider $3 \times 3$ “multiplication table” for effect of two adders in a row.
• pair propagates previous carry only if both propagate.  $x_i$ $k$ $p$ $g$ $k$ $k$ $k$ $g$ $x_{i-1}$ $p$ $k$ $p$ $g$ $g$ $k$ $g$ $g$
• Now let $y_0=k$, $y_i = y_{i-1} \times x_i$
• constraints “multiplication table” by induction  $x_i$ $k$ $p$ $g$ $k$ $k$ $k$ $g$ $y_{i-1}$ $p$ $k$ never $g$ $g$ $k$ $g$ $g$
• conclude: $y_i=k$ means $c_i=0$, $y_i=g$ means $c_i=1$, and $y_i=p$ never happens
• so problem reduced to computing all $y_i$ in parallel

Parallel prefix

• Build full binary tree
• two gates at each node
• pass up product of all children
• pass down product of all $x_i$ preceding leftmost child
• works for any associative function

#### PRAM

various conflict resolutions (CREW, EREW, CRCW)

• $CRCW(k) \subset EREW(k+1)$
• $NC = \cup CRCW(k)$

PRAMs simulate circuits, and vice versa

• So $NC$ well-defined

differences in practice

• CRCW can OR in $O(1)$ time
• EREW requires $\log n$ time (info theory lower bound)
• EREW needs $\log n$ to find max
• CRCW finds max in constant time with $n^2$ processors
• Compare every pair
• If an item loses, write “not max” in its entry
• Check all entries
• If item is max (not overwritten), write its value in answer
• in $O(\log\log n)$ time with $n$ processors
• Suppose $k$ items remain
• Make $k^2/n$ blocks of $n/k$ items
• quadratic time max for each: $(k^2/n)(n/k)^2=n$ processors total
• recurrence: $T(k)=1+T(k^2/n)$
• $T(n/2^i)=1+T(n/2^{2i})$
• so $\log\log n$ iters.

parallel prefix

• using $n$ processors

list ranking EREW

• next pointers $n(x)$
• $d(x)+=d(n(x))$; $n(x)=n(n(x))$.
• by induction, sum of values on path to end doesn't change

### Work-Efficient Algorithms

Idea:

• We've seen parallel algorithms that are somewhat “inefficient”
• do more total work (processors times time) than sequential
• Ideal solution: arrange for total work to be proportional to best sequential work
• Work-Efficient Algorithm
• Then a small number of processors (or even 1) can “simulate” many processors in a fully efficient way
• Parallel analogue of “cache oblivious algorithm”---you write algorithm once for many processors; lose nothing when it gets simulated on fewer.

Brent's theorem

• Different perspective on work: count number of processors actually working in each time step.
• If algorithm does $x$ total work and critical path $t$
• Then $p$ processors take $x/p+t$ time
• So, if use $p=x/t$ processors, finish in time $t$ with efficient algorithm
• detail: assigning processors to tasks can be tricky if e.g. work emerges dynamically during execution

Work-efficient parallel prefix

• linear sequential work
• going to need $\log n$ time
• so, aim to get by with $n/\log n$ processors
• give each processor a block of $\log n$ items to add up
• reduces problem to $n/\log n$ values
• use old algorithm
• each processor fixes up prefixes for its block

Work-efficient list ranking

• harder: can't easily give contiguous “blocks” of $\log n$ to each processor (requires list ranking)
• However, assume items in arbitrary order in array of $n$ structs, so can give $\log n$ distinct items to each processor.
• use random coin flips to knock out “alternate” items
• shortcut any item that is heads and has tails successor
• requires at most one shortcut
• and constant probability every other item is shortcut (and independent)
• so by chernoff, 1/16 of items are shortcut out
• “compact” remaining items into smaller array using parallel prefix on array of pointers (ignoring list structure) to collect only “marked” nodes and update their pointers
• let each processor handle $\log n$ (arbitrary) items
• $O(n/\log n)$ processors, $O(\log n)$ time
• After $O(\log\log n)$ rounds, number of items reduced to $n/\log n$
• use old algorithm
• result: $O(\log n \log\log n)$ time, $n/\log n$ processors
• to improve, use faster “compaction” algorithm to collect marked nodes: $O(\log\log n)$ time randomized, or $O(\log n/\log\log n)$ deterministic. get optimal alg.
• How about deterministic algorithm? Use “deterministic coin tossing”
• take all local maxima as part of ruling set.

Euler tour to reduce to parallel prefix for

• inorder traversal of tree
• computing depth in tree
• numbering leaves
• work efficient

### Expression Tree Evaluation

plus and times nodes.

Idea

• merge leaves into parents (good for bushy tree)
• pointer jump degree-2 nodes (good for tall trees)
• combine?
• how “pointer jump” an operator?

Generalize problem:

• Each tree edge has a label $(a,b)$
• meaning that if subtree below evaluates to $y$ then value $(ay+b)$ should be passed up edge

Merging a single leaf

• method for eliminating all left-child leaves
• root $q$ with right child $p$ (product node) on edge labelled $(a_3,b_3)$
• $p$ has left child edge $(a_1,b_1)$ leaf $\ell$ with value $v$
• right child edge to $s$ with label $(a_2,b_2)$
• fold out $p$ and $\ell$, make $s$ a child of $q$
• what label of new edge?
• prepare for $s$ subtree to eval on $y$.
• choose $a,b$ such that $ay+b=a_3\cdot[(a_1v+b_1)\cdot(a_2y+b_2)]+b_3$

Parallelize

• keep tree full, so killing all leaves kills whole tree
• collapse a leaf and pointer jump parent
• problem: can't do to both children at once
• solution: number leaves in-order (Euler Tour)
• three step process:
• shunt odd-numbered left-child leaves
• shunt odd-number right-child leaves
• divide leaf-numbers by 2
• guarantees never simultaneously shunt siblings

### Sorting

Parallelizing binary search

• placing one item into a sorted array takes $\log n$ time with 1 processor
• but with $k$ processors can improve to $\log_k n$
• compare to $k$ evenly spaced items
• use concurrent OR to find out which pair I'm between in $O(1)$ time
• recurse
• shows how to gain parallel speed by wasting work
• in particular, can find in $O(1)$ time with $n$ processors
• or even $\sqrt{n}$ processors

CREW Merge sort:

• merge two length-$k$ sequences using $k$ processors
• each element of first seq. uses binary search to find place in second
• so knows how many items smaller
• so knows rank in merged sequence: go there
• then do same for second list
• $O(\log k)$ time with $k$ processors
• handle all length-$k$ lists in parallel with $n$ processors
• total time $O(\sum_{i \le \lg n} \log 2^i)=O(\log^2 n)$ with $n$ processors

Better with more processors?

• use $n^2$ processors to do all pairwise comparisons in parallel
• For each item, count number of wins in parallel in $O(\log n)$ time
• This determines item's rank in output list

Use same ideas with fewer processors?

• Note can sort $\sqrt{n}$ items in $O(\log n)$ time
• Use insights to improve merge?
• Idea: nearby items in one list go to nearby places in merged list
• So don't do whole computation for every item
• Break $A$ into blocks and figure out roughly where each block goes in $B$
• Then spread each block to exact right places in $B$ (recursively)

Details: merge $n$ items of $A$ into $m$ items of $B$ in $O(\log\log n)$ time using $m+n$ processors

• choose $\sqrt{n}$ evenly spaced fenceposts $\alpha_i$ among $A$
• use $\sqrt{m}$ processors to find exact location of each in $B$
• total processors $\sqrt{nm}\le \le \max(n,m) \le n+m$
• split $B$ at locations of $\alpha_i$
• now know $\sqrt{n}$ items (between $\alpha_i$ and $\alpha_{i+1}$) go in each split piece of $B$
• So recurse: $T(n)=2+T(\sqrt{n})=O(\log\log n)$

Use in parallel merge sort: $O(\log n \log\log n)$ with $n$ processors.

• Cole shows how to “pipeline” merges, get optimal $O(\log n)$ time.

Qsort idea

• $\sqrt{n}$ pivots
• sort all in $O(\log n)$ time using $n$ processors
• binary search each item to locate position in pivots
• recurse
• $T(n) = O(\log n) + T(\sqrt{n}) = O(\log n)$
• rigorous analysis messy
• need to deal with variation in sizes of subproblems

### Connectivity and connected components

Linear time sequential trivial.

#### Directed

• $\log n$ time to reduce diameter to 1
• $mn$ processors for first iter, but adds edges
• so, $n^3$ processors
• and space to $n^2$ even if initial graph is sparse
• improvements to $mn$ processors
• But “transitive closure bottleneck” still bedevils parallel algs.
• Even if just want to find $s$-$t$ path we don't know any better
• Problem is that output has size $n^2$ and we don't know which part of it matters.

#### Undirected

Basic approach:

• Sets of connected vertices grouped as stars
• One root, all others parent-point to it
• Initially all vertices alone
• Edge “live” if connects two distinct stars
• Find live edges in constant time by checking roots
• For live edge with roots $u< v$, connect $u$ as child of $v$
• May be conflicts, but CRCW resolves
• Now get stars again
• Use pointer jumping
• Note: may have chains of links, so need $\log n$ jumps
• Every live star attached to another
• So number of stars decreases by 2
• $m+n$ processors, $\log^2 n$ time.

Smarter: interleave hooking and jumping:

• Maintain set of rooted trees
• Each node points to parent
• Hook some trees together to make fewer trees
• Pointer jump (once) to make shallower trees
• Eventually, each connected component is one star

More details:

• “top” vertex: root or its children
• detect top vertex in constant time
• each vertex has label
• find root label of each top vertex
• Can detect if am star in constant time:
• no pointer double reaches root
• for each edge:
• If ends both on top, different nonstar components, then hook smaller index component to larger
• may be conflicting hooks; assume CRCW resolves
• indices prevent cycles
• If star points to non-star, hook it
• do one pointer jump

Potential function: height of live stars and tall trees

• Live stars get hooked to something (star or internal)
• But never hooked to leaf. So add 1 to height of target
• So sum of heights doesn't go up
• But now, every unit of height is in a tall tree
• Pointer doubling decreases by at least 1/3
• Total height divided each time
• So $\log n$ iterations
Summary: $O(m+n)$ processors, $O(\log n)$ time.

Improvements:

• $O((m+n)\alpha(m,n)/\log n)$ processors, $\log n$ time, CRCW
• Randomized $O(\log n)$, $O(m/\log n)$ processors, EREW

### Randomization

Randomization in parallel:

• symmetry breaking
• isolating solutions

Classes:

• NC: poly processor, polylog steps
• RNC: with randomization. polylog runtime, monte carlo
• ZNC: las vegas NC
• immune to choice of R/W conflict resolution

### Sorting

Quicksort in parallel:

• $n$ processors
• each takes one item, compares to splitter
• count number of predecessors less than splitter
• determines location of item in split
• total time $O(\log n)$
• combine: $O(\log n)$ per layer with $n$ processors
• problem: $\Omega(\log^2 n)$ time bound
• problem: $n\log^2 n$ work

Using many processors:

• do all $n^2$ comparisons
• use parallel prefix to count number of items less than each item
• $O(\log n)$ time
• or $O(n)$ time with $n$ processors

Combine with quicksort:

• Note: single pivot step inefficient: uses $n$ processors and $\log n$ time.
• Better: use $\sqrt{n}$ simultaneous pivots
• Choose $\sqrt{n}$ random items and sort fully to get $\sqrt{n}$ intervals
• For all $n$ items, use binary search to find right interval
• recurse
• $T(n)=O(\log n)+T(\sqrt{n})=O(\log n + \frac12\log n+\frac14\log n + \cdots)=O(\log n)$

Formal analysis:

• consider root-leaf path to any item $x$
• argue total number of parallel steps on path is $O(\log n)$
• consider item $x$
• claim splitter within $\alpha\sqrt{n}$ on each side
• since prob. not at most $(1-\alpha\sqrt{n}/n)^{\sqrt{n}} \le e^{-\alpha}$
• fix $\gamma, d< 1/\gamma$
• define $\tau_k = d^k$
• define $\rho_k = n^{(2/3)^k}$ $(\rho_{k+1}=\rho_k^{2/3})$
• note size $\rho_k$ problem takes $\gamma^k\log n$ time
• note size $\rho_k$ problem odds of having child of size $>\rho_{k+1}$ is less than $e^{-\rho_k^{1/6}}$
• argue at most $d^k$ size-$\rho_k$ problems whp
• follows because probability of $d^k$ size-$\rho_k$ problems in a row is at most
• deduce runtime $\sum d^k\gamma_k = \sum (d\gamma)^{k}\log n = O(\log n)$
• note: as problem shrinks, allowing more divergence in quantity for whp result
• minor detail: “whp” dies for small problems
• OK: if problem size $\log n$, finish in $\log n$ time with $\log n$ processors

### Maximal independent set

trivial sequential algorithm

• inherently sequential
• from node point of view: each thinks can join MIS if others stay out
• randomization breaks this symmetry

Randomized idea

• each node joins with some probability
• all neighbors excluded
• many nodes join
• few phases needed

Algorithm:

• all degree 0 nodes join
• node $v$ joins with probability $1/2d(v)$
• if edge $(u,v)$ has both ends marked, unmark lower degree vertex
• put all marked nodes in IS
• delete all neighbors

Intuition: $d$-regular graph

• vertex vanishes if it or neighbor gets chosen
• mark with probability $1/2d$
• prob (no neighbor marked) is $(1-1/2d)^d$, constant
• so const prob. of neighbor of $v$ marked---destroys $v$
• what about unmarking of $v$'s neighbor?
• prob(unmarking forced) only constant as argued above.
• So just changes constants
• const fraction of nodes vanish: $O(\log n)$ phases
• Implementing a phase trivial in $O(\log n)$.

Prob chosen for IS, given marked, exceeds $1/2$

• suppose $w$ marked. only unmarked if higher degree neighbor marked
• higher degree neighbor marked with prob. $\le 1/2d(w)$
• only $d(w)$ neighbors
• prob. any superior neighbor marked at most $1/2$.

For general case, define good vertices

• good: at least $1/3$ neighbors have lower degree
• prob. no neighbor of good marked $\le (1-1/2d(v))^{d(v)/3} \le e^{-1/6}$.
• So some neighbor marked with prob. $1-e^{-1/6}$
• Stays marked with prob. 1/2
• deduce prob. good vertex killed exceeds $(1-e^{-1/6})/2$
• Problem: perhaps only one good vertex?

Good edges

• any edge with a good neighbor
• has const prob. to vanish
• show half edges good
• deduce $O(\log n)$ iterations.

Proof

• Let $V_B$ be bad vertices; we count edges with both ends in $V_B$.
• direct edges from lower to higher degree $d_i$ is indegree, $d_o$ outdegree
• if $v$ bad, then $d_i(v) \le d(v)/3$
• deduce $\sum_{V_B} d_i(v) \le \frac13 \sum_{V_B} d(v)=\frac13\sum_{V_B}( d_i(v)+d_o(v))$
• so $\sum_{V_B} d_i(v) \le \frac12 \sum_{V_B} d_o(v)$
• which means indegree can only “catch” half of outdegree; other half must go to good vertices.
• more carefully,
• $d_o(v)-d_i(v) \ge \frac13(d(v))=\frac13(d_o(v)+d_i(v))$.
• Let $V_G,V_B$ be good, bad vertices
• degree of bad vertices is $$\begin{eqnarray*} 2e(V_B,V_B)+e(V_B,V_G)+e(V_G,V_B) &= & \sum_{v\in V_B} d_o(v)+d_i(v)\\ &\le &3\sum(d_o(v)-d_i(v))\\ &= &3(e(V_B,V_G)-e(V_G,V_B))\\ &\le &3(e(V_B,V_G)+e(V_G,V_B) \end{eqnarray*}$$ Deduce $e(V_B,V_B) \le e(V_B,V_G)+e(V_G,V_B)$. result follows.

Derandomization:

• Analysis focuses on edges,
• so unsurprisingly, pairwise independence sufficient
• not immediately obvious, but again consider $d$-uniform case
• prob vertex marked $1/2d$
• neighbors $1,\ldots,d$ in increasing degree order
• Let $E_i$ be event that $i$ is marked.
• Let $E'_i$ be $E_i$ but no $E_j$ for $j< i$
• $A_i$ event no neighbor of $i$ chosen
• Then prob eliminate $v$ at least $$\begin{eqnarray*} \sum \Pr[E'_i \cap A_i] &= &\sum\Pr[E'_i]\Pr[A_i \mid E'_i]\\ &\ge &\sum \Pr[E'_i]\Pr[A_i] \end{eqnarray*}$$
• Wait: show $\Pr[A_i \mid E'_i] \ge \Pr[A_i]$
• true if independent
• measure $\Pr[\neg A_i \mid E'_i] \le \sum \Pr[E_w \mid E'_i]$ (sum over neighbors $w$ of $i$)
• measure $$\begin{eqnarray*} \Pr[E_w \mid E'_i] &= &\frac{\Pr[E_w \cap E']}{\Pr[E'_i]}\\ &= &\frac{\Pr[(E_w \cap \neg E_1 \cap \cdots) \cap E_i]}{\Pr[(\neg E_1 \cap \cdots) \cap E_i]} \\ &= &\frac{\Pr[E_w \cap \neg E_1 \cap \cdots \mid E_i]}{\Pr[\neg E_1 \cap \cdots \mid E_i]} \\ &\le &\frac{\Pr[E_w \mid E_i]}{1-\sum_{j\le i}\Pr[E_j \mid E_i]}\\ &= &\Theta(\Pr[E_w]) \end{eqnarray*}$$ (last step assumes $d$-regular so only $d$ neighbors with odds $1/2d$)
• But expected marked neighbors $1/2$, so by Markov $\Pr[A_i]>1/2$
• so prob eliminate $v$ exceeds $\sum\Pr[E'_i]=\Pr[\cup E_i]$
• lower bound as $\sum\Pr[E_i]-\sum\Pr[E_i \cap E_j] = 1/2-d(d-1)/8d^2 > 1/4$
• so $1/2d$ prob. $v$ marked but no neighbor marked, so $v$ chosen
• Generate pairwise independent with $O(\log n)$ bits
• try all polynomial seeds in parallel
• one works
• gives deterministic $NC$ algorithm

with care, $O(m)$ processors and $O(\log n)$ time (randomized)

LFMIS P-complete.

### Perfect Matching

We focus on bipartite; book does general case.

Last time, saw detection algorithm in $\RNC$:

• Tutte matrix
• Sumbolic determinant nonzero iff PM
• assign random values in $1,\ldots,2m$
• Matrix Mul, Determinant in $\NC$

• If unique, no problem
• Since only one nozero term, ok to replace each entry by a 1.
• Remove each edge, see if still PM in parallel
• multiplies processors by $m$
• but still $\NC$

Idea:

• make unique minimum weight perfect matching
• find it

Isolating lemma: [MVV]

• Family of distinct sets over $x_1,\ldots,x_m$
• assign random weights in $1,\ldots,2m$
• Pr(unique min-weight set)$\ge 1/2$
• Odd: no dependence on number of sets!
• (of course $< 2^m$)

Proof:

• Fix item $x_i$
• $Y$ is min-value sets containing $x_i$
• $N$ is min-value sets not containing $x_i$
• true min-sets are either those in $Y$ or in $N$
• how decide? Value of $x_i$
• For $x_i=-\infty$, min-sets are $Y$
• For $x_i=+\infty$, min-sets are $N$
• As increase from $-\infty$ to $\infty$, single transition value when both $X$ and $Y$ are min-weight
• If only $Y$ min-weight, then $x_i$ in every min-set
• If only $X$ min-weight, then $x_i$ in no min-set
• If both min-weight, $x_i$ is ambiguous
• Suppose no $x_i$ ambiguous. Then min-weight set unique!
• Exactly one value for $x_i$ makes it ambiguous given remainder
• So Pr(ambiguous)$=1/2m$
• So Pr(any ambiguous)$< m/2m=1/2$

Usage:

• Consider tutte matrix $A$
• Assign random value $2^{w_i}$ to $x_i$, with $w_i \in 1,\ldots,2m$
• Weight of matching is $2^{\sum w_i}$
• Let $W$ be minimum sum
• Unique w/pr $1/2$
• If so, determinant is odd multiple of $2^W$
• Try removing edges one at a time
• Edge in PM iff new determinant$/2^W$ is even.
• Big numbers? No problem: values have poly number of bits

$NC$ algorithm open.

For exact matching, $P$ algorithm open. $\newcommand{\Wbar}{\overline W}$