Word-Level Parallelism
Sequential CPU hides parallel processor
- word operations done in “constant” time
- so $w$-bit words means “$w$ processors”
- but model is SIMD
- how to use?
- some operations built in
- zero all
- negate all
- bitwise and/or/xor
- shift
- add two words
- multiply two words
- What can we build with these?
Today: sorting
- $\Omega(\log n)$ comparison sorting
- radix sort in $O(n\log u)$ time, faster for small $u$
- VEB leveraged (parallel) bit ops for $O(\log\log u)$ lookup
- gives $O(n\log\log u)$ sorting
- Andersson et al 1995: $O(n \log\log n)$ sorting of machine words
- important/surprising improvement on VEB
- runtime doesn't depend on word size, only number of items!
- only assumes constant time word operations
- “strongly polynomial”
Main ideas
- if integers are “small,” many fit in one machine word
- so devise a parallel sorting algorithm that can be run on a machine word
- results in $O(n)$-time sorting of small integers
- problem: integers aren't small
- solution: range reduction
- Transform problem of sorting large integers to one of sort smaller integers
- $O(n)$ to halve bits
- so $O(\log\log n)$ rounds let you fit $\log n$ numbers in a machine word
- at which point you sort in linear time
- assuming a SIMD sorting algorithm
Basics
General idea:
- Think of word as array of “digits”
- assume number of digits is power of 2 (for recursion)
- Operate on all digits in parallel
- So must be SIMD
- No conditional branching
Masking trick
- create $2^b$ with one shift
- subtract 1
- get “mask” with $b$ low bits set
- negate to set high bits
- constant time
Warmup: reversing an “array” of digits
- reverse left and right halves
- using shifts, masks, or
- then recurse in parallel on left and right half
- i.e. use smaller shifts and masks
- $\log w$ time
Warmup: Counting ones
- old MS interview question
- think of word as two half words
- count ones in each half word
- mask the two halves
- shift the high half down
- add
- $T(w)=T(w/2)+1$
- $O(\log w)$ time to count $w$-bit word
- just like you'd expect in a $w$-processor PRAM
Warmup: Parallel Prefix
- Treat word as array of $k$ letters
- How compute word containing all prefix sums?
- Idea: sum adjacent pairs, recurse, then add back the “odd” members
- mask odd positions and even positions into two different words
- shift to align
- add
- consider result as array of half as many ints of twice the bits
- recurse to get prefix sums
- copy, shift, and OR to get sums in both odd and even positions
- now add in original odd positions to correct their prefix sums
- For $k$ integers $T(k) = T(k/2)+O(1) = O(\log k)$
- note integers must be small enough that sums don't overflow containers
- i.e., need $O(\log k)$ highest bits of each word initially zero
- wait, how compute masks?
- they're constants, can compute/store in advance
- or algorithmically
- start with mask that is first half zeros second half ones
- then given mask of $b$ bits alternating
- shift by $b/2$ bits and XOR to get $b/2$ bits alternating
- much faster than sorting's linear algorithm
- but doesn't help, because bottleneck is setting the letters to be sorted.
Warmup: sorting with huge words
- for $n$ integers on $b$ bits and $2^{O(b)}$-bit word
- assume all distinct
- treat word as array of “letters” indexed up to max key
- Initially, zero all letters
- For each $i$, set letter $x_i$ to 1 (shift and add)
- now for each $x_i$, count smaller letters
- counts number of smaller items
- tells me where $x_i$ is in order
- how?
- could parallel prefix, but this takes $\log w$ which may be too slow
- instead, for each $x_i$
- create $y_i$ by masking out all higher letters
- using masking trick
- now compute $\sum y_i$
- analysis
- letter $y_i$ gets masked out for every smaller $y_j$
- so if $y_i$ is $k^{th}$ largest, will sum $k$ ones at letter $y_i$
- so will see value $k$ in letter $y_i$
- i.e. can read out position of $y_i$ in sorted order
- runtime $O(n)$
- independent of word size, so long as word size large enough
- problem: $x_i$ maybe not distinct
- solution: count number of occurence of each $x_i$
- during init, just add 1 to letter $x_i$ whenever you see $x_i$
- problem: counts may overflow if $x_i$ to close
- solution: ensure some spare bits in each letter
- $n$ keys, so need $\log n$ bits to avoid overflow
- so make letters that much wider
- i.e. for $b$-bit words, use $b+\log n$-bit letters
Better Sorting
Two insights
- (range reduction) Need keys small so many fit in word
- (packed sorting) Sort keys of word in linear time
- we'll enhance both
- range reduction to make big keys small
- packed sorting with words that aren't so huge
- result: $O(n\log\log n)$ sorting
Range Reduction
Idea
- we've seen a good algorithm when many keys fit in a word
- which requires them to be small
- range reduction is a linear time transformation that halves bits of keys
- related to VEB structure
Method
- given $n$ keys of $b$ bits
- divide numbers into high $h(x)$ and low $\ell(x)$ bits
- bucket by $h(x)$
- also find max $\ell(x)$ in each bucket (same as VEB trick)
- sort $n$ half keys consisting of (i) all $h(x)$ and (ii) all $\ell(x)$ except max $\ell(x)$ in $h(x)$
- that way we still sort $n$ keys
- but now each is $b/2$ bits
- ensure sort returns a permutation (rank of each key)
- which lets you look up which $h(x)$ is associated with each $\ell(x)$
- pass through result to extract $h(x)$ in sorted order
- then pass through result list to order $\ell(x)$ in each bucket
- combine with left out max $\ell(x)$, get full sort
- return as permutation, to support caller
Result:
- $O(n)$ processing transforms a $b$-bit sorting problem to a $b/2$-bit sorting problem
- Note: like VEB, requires $\sqrt{n}$ space or hashing (randomization)
- If we iterate the procedure $O(\log \log n)$ times, we shrink bits by a $O(\log n)$ factor
- so can fit $\log n$ keys in one word
Packed Sorting
A method to quickly sort keys when many fit in one word
Auxiliary data
- still need auxiliary data
- but packed sorting doesn't by default let you carry values with keys
- solution: multiply each key by $n$, use low $\log n < b$ bits to store index of key
- now sorting carries the index with each item
- makes keys slightly larger, but address this with extra range reduction step
- after sort, read permutation by looking at indexes in sorted keys
- return to range reduction caller
Sorting Networks
- digression to parallel sorting
- challenge: want SIMD algorithms
- which means it cannot do different code paths for different cases
- motivates sorting networks
- explain
Bitonic sort:
- bitonic sequence is cyclic shift of a nondecreasing followed by nonincreasing sequence
- Sorting network for bitonic sequences:
- Bitonic Split
- Compare/swap opposite pairs
- Recurse on halves
- claim after swap, each half bitonic
- and bottom half smaller than top half
- bitonic sequence is valley plus mountain
- “sea level” where half above and half below
- if lucky, valley in first half
- then nothing swaps
- if shift whole sequence 1 left, effect is to shift the valley
- ditto for mountain
- conclude LHS is shift of valley, RHS is shift of mountain
- Recurse bitonic split on each half in parallel
- when done, have totally order singleton bitonic sequences
- $O(\log n)$ time with $n$ processors.
- Bitonic Split
- Bitonic sort can be used to merge two sorted sequences
- flip second sequence and concatenate
- gives bitonic sequence
- apply sort of bitonic sequences
- So use merge sorts in parallel to make groups of 1,2,4,8 etc.
- $O(\log^2 n)$ time with $n$ processors
Wrap Up
- Han and Thorup 2002: $O(n\sqrt{\log\log n})$
- actually, $O(n\sqrt{\log(w/\log n)})$.
- range reduction plus packed sorting plus “signature sort” can sort in $O(n)$ time if $w > \log^{2+\epsilon} n$
- Thorup: if can sort in $n\cdot f(n)$, can build $O(f(n))$ priority queue (reverse obvious)