6.854 Lecture Notes

Word-Level Parallelism

Sequential CPU hides parallel processor

Today: sorting

$\Omega(\log n)$ comparison sorting
radix sort in $O(n\log u)$ time, faster for small $u$
VEB leveraged (parallel) bit ops for $O(\log\log u)$ lookup
gives $O(n\log\log u)$ sorting
Andersson et al 1995: $O(n \log\log n)$ sorting of machine words
- important/surprising improvement on VEB
- runtime doesn't depend on word size, only number of items!
- only assumes constant time word operations
- “strongly polynomial”

Main ideas

General idea:

Masking trick

Warmup: reversing an “array” of digits

Warmup: Counting ones

Warmup: Parallel Prefix

Warmup: sorting with huge words

for $n$ integers on $b$ bits and $2^{O(b)}$-bit word
assume all distinct
treat word as array of “letters” indexed up to max key
Initially, zero all letters
For each $i$, set letter $x_i$ to 1 (shift and add)
now for each $x_i$, count smaller letters
- counts number of smaller items
- tells me where $x_i$ is in order
how?
- could parallel prefix, but this takes $\log w$ which may be too slow
- instead, for each $x_i$
- create $y_i$ by masking out all higher letters
- using masking trick
- now compute $\sum y_i$
analysis
- letter $y_i$ gets masked out for every smaller $y_j$
- so if $y_i$ is $k^{th}$ largest, will sum $k$ ones at letter $y_i$
- so will see value $k$ in letter $y_i$
- i.e. can read out position of $y_i$ in sorted order
runtime $O(n)$
independent of word size, so long as word size large enough
problem: $x_i$ maybe not distinct
- solution: count number of occurence of each $x_i$
- during init, just add 1 to letter $x_i$ whenever you see $x_i$
problem: counts may overflow if $x_i$ to close
- solution: ensure some spare bits in each letter
- $n$ keys, so need $\log n$ bits to avoid overflow
- so make letters that much wider
- i.e. for $b$-bit words, use $b+\log n$-bit letters

Two insights

Idea

Method

given $n$ keys of $b$ bits
divide numbers into high $h(x)$ and low $\ell(x)$ bits
bucket by $h(x)$
also find max $\ell(x)$ in each bucket (same as VEB trick)
sort $n$ half keys consisting of (i) all $h(x)$ and (ii) all $\ell(x)$ except max $\ell(x)$ in $h(x)$
that way we still sort $n$ keys
but now each is $b/2$ bits
ensure sort returns a permutation (rank of each key)
which lets you look up which $h(x)$ is associated with each $\ell(x)$
pass through result to extract $h(x)$ in sorted order
then pass through result list to order $\ell(x)$ in each bucket
combine with left out max $\ell(x)$, get full sort
return as permutation, to support caller

Result:

$O(n)$ processing transforms a $b$-bit sorting problem to a $b/2$-bit sorting problem
Note: like VEB, requires $\sqrt{n}$ space or hashing (randomization)
If we iterate the procedure $O(\log \log n)$ times, we shrink bits by a $O(\log n)$ factor
so can fit $\log n$ keys in one word

A method to quickly sort keys when many fit in one word

Auxiliary data

still need auxiliary data
but packed sorting doesn't by default let you carry values with keys
solution: multiply each key by $n$, use low $\log n < b$ bits to store index of key
now sorting carries the index with each item
makes keys slightly larger, but address this with extra range reduction step
after sort, read permutation by looking at indexes in sorted keys
return to range reduction caller

Sorting Networks

Bitonic sort:

bitonic sequence is cyclic shift of a nondecreasing followed by nonincreasing sequence
Sorting network for bitonic sequences:
- Bitonic Split
  - Compare/swap opposite pairs
  - Recurse on halves
  - claim after swap, each half bitonic
  - and bottom half smaller than top half
    - bitonic sequence is valley plus mountain
    - “sea level” where half above and half below
    - if lucky, valley in first half
    - then nothing swaps
    - if shift whole sequence 1 left, effect is to shift the valley
    - ditto for mountain
    - conclude LHS is shift of valley, RHS is shift of mountain
- Recurse bitonic split on each half in parallel
- when done, have totally order singleton bitonic sequences
- $O(\log n)$ time with $n$ processors.
Bitonic sort can be used to merge two sorted sequences
- flip second sequence and concatenate
- gives bitonic sequence
- apply sort of bitonic sequences
So use merge sorts in parallel to make groups of 1,2,4,8 etc.
- $O(\log^2 n)$ time with $n$ processors

Han and Thorup 2002: $O(n\sqrt{\log\log n})$
actually, $O(n\sqrt{\log(w/\log n)})$.
range reduction plus packed sorting plus “signature sort” can sort in $O(n)$ time if $w > \log^{2+\epsilon} n$
Thorup: if can sort in $n\cdot f(n)$, can build $O(f(n))$ priority queue (reverse obvious)