# Parallel Algorithms 6.854 Notes #32

## Parallel Algorithms

Two closely related models of parallel computation.

Circuits

• Logic gates (AND/OR/not) connected by wires

• important measures

• number of gates

• depth (clock cycles in synchronous circuit)

PRAM

• $$P$$ processors, each with a RAM, local registers

• global memory of $$M$$ locations

• each processor can in one step do a RAM op or read/write to one global memory location

• synchronous parallel steps

• not realistic, but explores “degree of parallelism”

Essentially the same models, but let us focus on different things.

### Circuits

• Logic gates (AND/OR/not) connected by wires

• important measures

• number of gates

• depth (clock cycles in synchronous circuit)

• bounded vs unbounded fan-in/out

• $$AC(k)$$ and $$NC(k)$$: unbounded and bounded fan-in with depth $$O(\log^k n)$$ for problems of size $$n$$

• $$AC(k) \subset NC(k) \subset AC(k+1)$$ using full binary tree

• $$NC=\cup NC(k)=\cup AC(k)$$

• consider adding $$a_i$$ and $$b_i$$ with carry $$c_{i-1}$$ to produce output $$s_i$$ and next carry $$c_i$$

• Ripple carry: $$O(n)$$ gates, $$O(n)$$ time

• Carry lookahead: $$O(n)$$ gates, $$O(\log n)$$ time

• preplan for late arrival of $$c_i$$.

• given $$a_i$$ and $$b_i$$, three possible cases for $$c_i$$

• if $$a_i=b_i$$, then $$c_i=a_i$$ determined without $$c_{i-1}$$: generate $$c_1=1$$ or kill $$c_i=0$$

• otherwise, propagate $$c_i=c_{i-1}$$

• write $$x_i=k,g,p$$ accordingly

• consider $$3 \times 3$$ “multiplication table” for effect of two adders in a row.

• pair propagates previous carry only if both propagate.

 $$x_i$$ $$k$$ $$p$$ $$g$$ $$k$$ $$k$$ $$k$$ $$g$$ $$x_{i-1}$$ $$p$$ $$k$$ $$p$$ $$g$$ $$g$$ $$k$$ $$g$$ $$g$$
• Now let $$y_0=k$$, $$y_i = y_{i-1} \times x_i$$

• constraints “multiplication table” by induction

 $$x_i$$ $$k$$ $$p$$ $$g$$ $$k$$ $$k$$ $$k$$ $$g$$ $$y_{i-1}$$ $$p$$ $$k$$ never $$g$$ $$g$$ $$k$$ $$g$$ $$g$$
• conclude: $$y_i=k$$ means $$c_i=0$$, $$y_i=g$$ means $$c_i=1$$, and $$y_i=p$$ never happens

• so problem reduced to computing all $$y_i$$ in parallel

Parallel prefix

• Build full binary tree

• two gates at each node

• pass up product of all children

• pass down product of all $$x_i$$ preceding leftmost child

• works for any associative function

### PRAM

various conflict resolutions (CREW, EREW, CRCW)

• $$CRCW(k) \subset EREW(k+1)$$

• $$NC = \cup CRCW(k)$$

PRAMs simulate circuits, and vice versa

• So $$NC$$ well-defined

differences in practice

• CRCW can OR in $$O(1)$$ time

• EREW requires $$\log n$$ time (info theory lower bound)

• EREW needs $$\log n$$ to find max

• CRCW finds max in constant time with $$n^2$$ processors

• Compare every pair

• If an item loses, write “not max” in its entry

• Check all entries

• If item is max (not overwritten), write its value in answer

• in $$O(\log\log n)$$ time with $$n$$ processors

• Suppose $$k$$ items remain

• Make $$k^2/n$$ blocks of $$n/k$$ items

• quadratic time max for each: $$(k^2/n)(n/k)^2=n$$ processors total

• recurrence: $$T(k)=1+T(k^2/n)$$

• $$T(n/2^i)=1+T(n/2^{2i})$$

• so $$\log\log n$$ iters.

parallel prefix

• using $$n$$ processors

list ranking EREW

• next pointers $$n(x)$$

• $$d(x)+=d(n(x))$$; $$n(x)=n(n(x))$$.

• by induction, sum of values on path to end doesn’t change

## 0.1 Work-Efficient Algorithms

Idea:

• We’ve seen parallel algorithms that are somewhat “inefficient”

• do more total work (processors times time) than sequential

• Ideal solution: arrange for total work to be proportional to best sequential work

• Work-Efficient Algorithm

• Then a small number of processors (or even 1) can “simulate” many processors in a fully efficient way

• Parallel analogue of “cache oblivious algorithm”—you write algorithm once for many processors; lose nothing when it gets simulated on fewer.

Brent’s theorem

• Different perspective on work: count number of processors actually working in each time step.

• If algorithm does $$x$$ total work and critical path $$t$$

• Then $$p$$ processors take $$x/p+t$$ time

• So, if use $$p=x/t$$ processors, finish in time $$t$$ with efficient algorithm

• detail: assigning processors to tasks can be tricky if e.g. work emerges dynamically during execution

Work-efficient parallel prefix

• linear sequential work

• going to need $$\log n$$ time

• so, aim to get by with $$n/\log n$$ processors

• give each processor a block of $$\log n$$ items to add up

• reduces problem to $$n/\log n$$ values

• use old algorithm

• each processor fixes up prefixes for its block

Work-efficient list ranking

• harder: can’t easily give contiguous “blocks” of $$\log n$$ to each processor (requires list ranking)

• However, assume items in arbitrary order in array of $$n$$ structs, so can give $$\log n$$ distinct items to each processor.

• use random coin flips to knock out “alternate” items

• shortcut any item that is heads and has tails successor

• requires at most one shortcut

• and constant probability every other item is shortcut (and independent)

• so by chernoff, 1/16 of items are shortcut out

• “compact” remaining items into smaller array using parallel prefix on array of pointers (ignoring list structure) to collect only “marked” nodes and update their pointers

• let each processor handle $$\log n$$ (arbitrary) items

• $$O(n/\log n)$$ processors, $$O(\log n)$$ time

• After $$O(\log\log n)$$ rounds, number of items reduced to $$n/\log n$$

• use old algorithm

• result: $$O(\log n \log\log n)$$ time, $$n/\log n$$ processors

• to improve, use faster “compaction” algorithm to collect marked nodes: $$O(\log\log n)$$ time randomized, or $$O(\log n/\log\log n)$$ deterministic. get optimal alg.

• How about deterministic algorithm? Use “deterministic coin tossing”

• take all local maxima as part of ruling set.

Euler tour to reduce to parallel prefix for

• inorder traversal of tree

• computing depth in tree

• numbering leaves

• work efficient

## Expression Tree Evaluation

plus and times nodes.

Idea

• merge leaves into parents (good for bushy tree)

• pointer jump degree-2 nodes (good for tall trees)

• combine?

• how “pointer jump” an operator?

Generalize problem:

• Each tree edge has a label $$(a,b)$$

• meaning that if subtree below evaluates to $$y$$ then value $$(ay+b)$$ should be passed up edge

Merging a single leaf

• method for eliminating all left-child leaves

• root $$q$$ with right child $$p$$ (product node) on edge labelled $$(a_3,b_3)$$

• $$p$$ has left child edge $$(a_1,b_1)$$ leaf $$\ell$$ with value $$v$$

• right child edge to $$s$$ with label $$(a_2,b_2)$$

• fold out $$p$$ and $$\ell$$, make $$s$$ a child of $$q$$

• what label of new edge?

• prepare for $$s$$ subtree to eval on $$y$$.

• choose $$a,b$$ such that $$ay+b=a_3\cdot[(a_1v+b_1)\cdot(a_2y+b_2)]+b_3$$

Parallelize

• keep tree full, so killing all leaves kills whole tree

• collapse a leaf and pointer jump parent

• problem: can’t do to both children at once

• solution: number leaves in-order (Euler Tour)

• three step process:

• shunt odd-numbered left-child leaves

• shunt odd-number right-child leaves

• divide leaf-numbers by 2

• guarantees never simultaneously shunt siblings

## Sorting

Parallelizing binary search

• placing one item into a sorted array takes $$\log n$$ time with 1 processor

• but with $$k$$ processors can improve to $$\log_k n$$

• compare to $$k$$ evenly spaced items

• use concurrent OR to find out which pair I’m between in $$O(1)$$ time

• recurse

• shows how to gain parallel speed by wasting work

• in particular, can find in $$O(1)$$ time with $$n$$ processors

• or even $$\sqrt{n}$$ processors

CREW Merge sort:

• merge two length-$$k$$ sequences using $$k$$ processors

• each element of first seq. uses binary search to find place in second

• so knows how many items smaller

• so knows rank in merged sequence: go there

• then do same for second list

• $$O(\log k)$$ time with $$k$$ processors

• handle all length-$$k$$ lists in parallel with $$n$$ processors

• total time $$O(\sum_{i \le \lg n} \log 2^i)=O(\log^2 n)$$ with $$n$$ processors

Better with more processors?

• use $$n^2$$ processors to do all pairwise comparisons in parallel

• For each item, count number of wins in parallel in $$O(\log n)$$ time

• This determines item’s rank in output list

Use same ideas with fewer processors?

• Note can sort $$\sqrt{n}$$ items in $$O(\log n)$$ time

• Use insights to improve merge?

• Idea: nearby items in one list go to nearby places in merged list

• So don’t do whole computation for every item

• Break $$A$$ into blocks and figure out roughly where each block goes in $$B$$

• Then spread each block to exact right places in $$B$$ (recursively)

Details: merge $$n$$ items of $$A$$ into $$m$$ items of $$B$$ in $$O(\log\log n)$$ time using $$m+n$$ processors

• choose $$\sqrt{n}$$ evenly spaced fenceposts $$\alpha_i$$ among $$A$$

• use $$\sqrt{m}$$ processors to find exact location of each in $$B$$

• total processors $$\sqrt{nm}\le \le \max(n,m) \le n+m$$

• split $$B$$ at locations of $$\alpha_i$$

• now know $$\sqrt{n}$$ items (between $$\alpha_i$$ and $$\alpha_{i+1}$$) go in each split piece of $$B$$

• So recurse: $$T(n)=2+T(\sqrt{n})=O(\log\log n)$$

Use in parallel merge sort: $$O(\log n \log\log n)$$ with $$n$$ processors.

• Cole shows how to “pipeline” merges, get optimal $$O(\log n)$$ time.

Qsort idea

• $$\sqrt{n}$$ pivots

• sort all in $$O(\log n)$$ time using $$n$$ processors

• binary search each item to locate position in pivots

• recurse

• $$T(n) = O(\log n) + T(\sqrt{n}) = O(\log n)$$

• rigorous analysis messy

• need to deal with variation in sizes of subproblems

## Connectivity and connected components

Linear time sequential trivial.

### Directed

• $$\log n$$ time to reduce diameter to 1

• $$mn$$ processors for first iter, but adds edges

• so, $$n^3$$ processors

• and space to $$n^2$$ even if initial graph is sparse

• improvements to $$mn$$ processors

• But “transitive closure bottleneck” still bedevils parallel algs.

• Even if just want to find $$s$$-$$t$$ path we don’t know any better

• Problem is that output has size $$n^2$$ and we don’t know which part of it matters.

### Undirected

Basic approach:

• Sets of connected vertices grouped as stars

• One root, all others parent-point to it

• Initially all vertices alone

• Edge “live” if connects two distinct stars

• Find live edges in constant time by checking roots

• For live edge with roots $$u<v$$, connect $$u$$ as child of $$v$$

• May be conflicts, but CRCW resolves

• Now get stars again

• Use pointer jumping

• Note: may have chains of links, so need $$\log n$$ jumps

• Every live star attached to another

• So number of stars decreases by 2

• $$m+n$$ processors, $$\log^2 n$$ time.

Smarter: interleave hooking and jumping:

• Maintain set of rooted trees

• Each node points to parent

• Hook some trees together to make fewer trees

• Pointer jump (once) to make shallower trees

• Eventually, each connected component is one star

More details:

• “top” vertex: root or its children

• detect top vertex in constant time

• each vertex has label

• find root label of each top vertex

• Can detect if am star in constant time:

• no pointer double reaches root

• for each edge:

• If ends both on top, different nonstar components, then hook smaller index component to larger

• may be conflicting hooks; assume CRCW resolves

• indices prevent cycles

• If star points to non-star, hook it

• do one pointer jump

Potential function: height of live stars and tall trees

• Live stars get hooked to something (star or internal)

• But never hooked to leaf. So add 1 to height of target

• So sum of heights doesn’t go up

• But now, every unit of height is in a tall tree

• Pointer doubling decreases by at least 1/3

• Total height divided each time

• So $$\log n$$ iterations

Summary: $$O(m+n)$$ processors, $$O(\log n)$$ time.

Improvements:

• $$O((m+n)\alpha(m,n)/\log n)$$ processors, $$\log n$$ time, CRCW

• Randomized $$O(\log n)$$, $$O(m/\log n)$$ processors, EREW

## 0.2 Randomization

Randomization in parallel:

• symmetry breaking

• isolating solutions

Classes:

• NC: poly processor, polylog steps

• RNC: with randomization. polylog runtime, monte carlo

• ZNC: las vegas NC

• immune to choice of R/W conflict resolution

## Sorting

Quicksort in parallel:

• $$n$$ processors

• each takes one item, compares to splitter

• count number of predecessors less than splitter

• determines location of item in split

• total time $$O(\log n)$$

• combine: $$O(\log n)$$ per layer with $$n$$ processors

• problem: $$\Omega(\log^2 n)$$ time bound

• problem: $$n\log^2 n$$ work

Using many processors:

• do all $$n^2$$ comparisons

• use parallel prefix to count number of items less than each item

• $$O(\log n)$$ time

• or $$O(n)$$ time with $$n$$ processors

Combine with quicksort:

• Note: single pivot step inefficient: uses $$n$$ processors and $$\log n$$ time.

• Better: use $$\sqrt{n}$$ simultaneous pivots

• Choose $$\sqrt{n}$$ random items and sort fully to get $$\sqrt{n}$$ intervals

• For all $$n$$ items, use binary search to find right interval

• recurse

• $$T(n)=O(\log n)+T(\sqrt{n})=O(\log n + \frac12\log n+\frac14\log n + \cdots)=O(\log n)$$

Formal analysis:

• consider root-leaf path to any item $$x$$

• argue total number of parallel steps on path is $$O(\log n)$$

• consider item $$x$$

• claim splitter within $$\alpha\sqrt{n}$$ on each side

• since prob. not at most $$(1-\alpha\sqrt{n}/n)^{\sqrt{n}} \le e^{-\alpha}$$

• fix $$\gamma, d<1/\gamma$$

• define $$\tau_k = d^k$$

• define $$\rho_k = n^{(2/3)^k}$$ $$(\rho_{k+1}=\rho_k^{2/3})$$

• note size $$\rho_k$$ problem takes $$\gamma^k\log n$$ time

• note size $$\rho_k$$ problem odds of having child of size $$>\rho_{k+1}$$ is less than $$e^{-\rho_k^{1/6}}$$

• argue at most $$d^k$$ size-$$\rho_k$$ problems whp

• follows because probability of $$d^k$$ size-$$\rho_k$$ problems in a row is at most

• deduce runtime $$\sum d^k\gamma_k = \sum (d\gamma)^{k}\log n = O(\log n)$$

• note: as problem shrinks, allowing more divergence in quantity for whp result

• minor detail: “whp” dies for small problems

• OK: if problem size $$\log n$$, finish in $$\log n$$ time with $$\log n$$ processors

## Maximal independent set

trivial sequential algorithm

• inherently sequential

• from node point of view: each thinks can join MIS if others stay out

• randomization breaks this symmetry

Randomized idea

• each node joins with some probability

• all neighbors excluded

• many nodes join

• few phases needed

Algorithm:

• all degree 0 nodes join

• node $$v$$ joins with probability $$1/2d(v)$$

• if edge $$(u,v)$$ has both ends marked, unmark lower degree vertex

• put all marked nodes in IS

• delete all neighbors

Intuition: $$d$$-regular graph

• vertex vanishes if it or neighbor gets chosen

• mark with probability $$1/2d$$

• prob (no neighbor marked) is $$(1-1/2d)^d$$, constant

• so const prob. of neighbor of $$v$$ marked—destroys $$v$$

• what about unmarking of $$v$$’s neighbor?

• prob(unmarking forced) only constant as argued above.

• So just changes constants

• const fraction of nodes vanish: $$O(\log n)$$ phases

• Implementing a phase trivial in $$O(\log n)$$.

Prob chosen for IS, given marked, exceeds $$1/2$$

• suppose $$w$$ marked. only unmarked if higher degree neighbor marked

• higher degree neighbor marked with prob. $$\le 1/2d(w)$$

• only $$d(w)$$ neighbors

• prob. any superior neighbor marked at most $$1/2$$.

For general case, define good vertices

• good: at least $$1/3$$ neighbors have lower degree

• prob. no neighbor of good marked $$\le (1-1/2d(v))^{d(v)/3} \le e^{-1/6}$$.

• So some neighbor marked with prob. $$1-e^{-1/6}$$

• Stays marked with prob. 1/2

• deduce prob. good vertex killed exceeds $$(1-e^{-1/6})/2$$

• Problem: perhaps only one good vertex?

Good edges

• any edge with a good neighbor

• has const prob. to vanish

• show half edges good

• deduce $$O(\log n)$$ iterations.

Proof

• Let $$V_B$$ be bad vertices; we count edges with both ends in $$V_B$$.

• direct edges from lower to higher degree $$d_i$$ is indegree, $$d_o$$ outdegree

• if $$v$$ bad, then $$d_i(v) \le d(v)/3$$

• deduce $\sum_{V_B} d_i(v) \le \frac13 \sum_{V_B} d(v)=\frac13\sum_{V_B}( d_i(v)+d_o(v))$

• so $$\sum_{V_B} d_i(v) \le \frac12 \sum_{V_B} d_o(v)$$

• which means indegree can only “catch” half of outdegree; other half must go to good vertices.

• more carefully,

• $$d_o(v)-d_i(v) \ge \frac13(d(v))=\frac13(d_o(v)+d_i(v))$$.

• Let $$V_G,V_B$$ be good, bad vertices

• degree of bad vertices is \begin{aligned} 2e(V_B,V_B)+e(V_B,V_G)+e(V_G,V_B) &= & \sum_{v\in V_B} d_o(v)+d_i(v)\\ &\le &3\sum(d_o(v)-d_i(v))\\ &= &3(e(V_B,V_G)-e(V_G,V_B))\\ &\le &3(e(V_B,V_G)+e(V_G,V_B)\end{aligned} Deduce $$e(V_B,V_B) \le e(V_B,V_G)+e(V_G,V_B)$$. result follows.

Derandomization:

• Analysis focuses on edges,

• so unsurprisingly, pairwise independence sufficient

• not immediately obvious, but again consider $$d$$-uniform case

• prob vertex marked $$1/2d$$

• neighbors $$1,\ldots,d$$ in increasing degree order

• Let $$E_i$$ be event that $$i$$ is marked.

• Let $$E'_i$$ be $$E_i$$ but no $$E_j$$ for $$j<i$$

• $$A_i$$ event no neighbor of $$i$$ chosen

• Then prob eliminate $$v$$ at least \begin{aligned} \sum \Pr[E'_i \cap A_i] &= &\sum\Pr[E'_i]\Pr[A_i \mid E'_i]\\ &\ge &\sum \Pr[E'_i]\Pr[A_i]\end{aligned}

• Wait: show $$\Pr[A_i \mid E'_i] \ge \Pr[A_i]$$

• true if independent

• measure $$\Pr[\neg A_i \mid E'_i] \le \sum \Pr[E_w \mid E'_i]$$ (sum over neighbors $$w$$ of $$i$$)

• measure \begin{aligned} \Pr[E_w \mid E'_i] &= &\frac{\Pr[E_w \cap E']}{\Pr[E'_i]}\\ &= &\frac{\Pr[(E_w \cap \neg E_1 \cap \cdots) \cap E_i]}{\Pr[(\neg E_1 \cap \cdots) \cap E_i]} \\ &= &\frac{\Pr[E_w \cap \neg E_1 \cap \cdots \mid E_i]}{\Pr[\neg E_1 \cap \cdots \mid E_i]} \\ &\le &\frac{\Pr[E_w \mid E_i]}{1-\sum_{j\le i}\Pr[E_j \mid E_i]}\\ &= &\Theta(\Pr[E_w])\end{aligned} (last step assumes $$d$$-regular so only $$d$$ neighbors with odds $$1/2d$$)

• But expected marked neighbors $$1/2$$, so by Markov $$\Pr[A_i]>1/2$$

• so prob eliminate $$v$$ exceeds $$\sum\Pr[E'_i]=\Pr[\cup E_i]$$

• lower bound as $$\sum\Pr[E_i]-\sum\Pr[E_i \cap E_j] = 1/2-d(d-1)/8d^2 > 1/4$$

• so $$1/2d$$ prob. $$v$$ marked but no neighbor marked, so $$v$$ chosen

• Generate pairwise independent with $$O(\log n)$$ bits

• try all polynomial seeds in parallel

• one works

• gives deterministic $$NC$$ algorithm

with care, $$O(m)$$ processors and $$O(\log n)$$ time (randomized)

LFMIS P-complete.

## Perfect Matching

We focus on bipartite; book does general case.

Last time, saw detection algorithm in $$\RNC$$:

• Tutte matrix

• Sumbolic determinant nonzero iff PM

• assign random values in $$1,\ldots,2m$$

• Matrix Mul, Determinant in $$\NC$$

• If unique, no problem

• Since only one nozero term, ok to replace each entry by a 1.

• Remove each edge, see if still PM in parallel

• multiplies processors by $$m$$

• but still $$\NC$$

Idea:

• make unique minimum weight perfect matching

• find it

Isolating lemma: [MVV]

• Family of distinct sets over $$x_1,\ldots,x_m$$

• assign random weights in $$1,\ldots,2m$$

• Pr(unique min-weight set)$$\ge 1/2$$

• Odd: no dependence on number of sets!

• (of course $$<2^m$$)

Proof:

• Fix item $$x_i$$

• $$Y$$ is min-value sets containing $$x_i$$

• $$N$$ is min-value sets not containing $$x_i$$

• true min-sets are either those in $$Y$$ or in $$N$$

• how decide? Value of $$x_i$$

• For $$x_i=-\infty$$, min-sets are $$Y$$

• For $$x_i=+\infty$$, min-sets are $$N$$

• As increase from $$-\infty$$ to $$\infty$$, single transition value when both $$X$$ and $$Y$$ are min-weight

• If only $$Y$$ min-weight, then $$x_i$$ in every min-set

• If only $$X$$ min-weight, then $$x_i$$ in no min-set

• If both min-weight, $$x_i$$ is ambiguous

• Suppose no $$x_i$$ ambiguous. Then min-weight set unique!

• Exactly one value for $$x_i$$ makes it ambiguous given remainder

• So Pr(ambiguous)$$=1/2m$$

• So Pr(any ambiguous)$$< m/2m=1/2$$

Usage:

• Consider tutte matrix $$A$$

• Assign random value $$2^{w_i}$$ to $$x_i$$, with $$w_i \in 1,\ldots,2m$$

• Weight of matching is $$2^{\sum w_i}$$

• Let $$W$$ be minimum sum

• Unique w/pr $$1/2$$

• If so, determinant is odd multiple of $$2^W$$

• Try removing edges one at a time

• Edge in PM iff new determinant$$/2^W$$ is even.

• Big numbers? No problem: values have poly number of bits

$$NC$$ algorithm open.

For exact matching, $$P$$ algorithm open.