Memory \(M\), block size \(B\), problem size \(N\)
basic operations:
Scanning: \(O(N/B)\)
reversing an array: \(O(N/B)\)
matrix operations
\(N\times N\) matrix
Add?
need to decide on representation
use row-major dense representation
add is linear scan
Multiply?
How externalize standard algorithm?
Row major
scan rows of first and columns of second
so need \(N\) reads to get each column entry from second matrix
and \(N^2\) items to read
so \(N^3\) over all
Column major
What if could somehow transpose second matrix?
Now only need \(N/B\) reads per output value
So \(N^3/B\)
Still wasteful. How?
Many output values rely on same column
So we read the same rows repeatedly
How avoid?
Block layout
Think of matrix as array of smaller matrices
Multiply main matrix as sum of products of smaller matrices
We’ve seen addition is linear
Bigger blocks will help
But the limit: need block to fit in memory
so size \(\sqrt{M} \times \sqrt{M}\)
so need to read \(N/\sqrt{M}\) blocks to produce an output block
time per block is \(M/B\) since block size \(M\)
\(N^2/M\) output blocks
so total time is product \(N^3/B\sqrt{M}\)
Can improve this by starting with better sequential matrix multiply—e.g Strassen.
get to \(N^{2.376}\) sequential
can extend to external memory
Linked list
operations insert, delete, traverse
insert and delete cost \(O(1)\)
must traversing \(k\) items cost \(O(k)\)?
keep segments of list in blocks
keep each block half full
now can traverse \(B/2\) items on one read
so \(O(K/B)\) to traverse \(K\) items
on insert: if block overflows, split into two half-full blocks
on delete:
if block less than half full, check next block
if it is more than half full, redistribute items
now both blocks are at least half full
otherwise, merge items from both blocks, drop empty block
Note: can also insert and delete while traversing at cost \(1/B\) per operation
so, e.g., can hold \(O(1)\) “fingers” in list and insert/delete at fingers.
Search trees:
binary tree cost \(O(\log n)\) memory transfers
wastes effort because only getting one useful item per block read
Instead use \((B+1)\)-ary tree; block has \(B\) splitters
Require all leaves at same depth
ensures balanced
So depth \(O(\log_B N)\), much better
Need to keep balanced while insert/delete:
require every block but root to be at least half full (so degree \(\ge B/2\))
also ensures depth \(\log_B N\)
on insert, if block is full, split into two blocks and pass up a splitter to insert in parent
may overflow parent, forcing recursive splits
on delete, if block half empty, merge as for linked lists
may empty parent, force recursive merges
optimal in comparison model:
Reading a block reveals position of query among \(B\) splitters
i.e. \(\log B\) bits of information
Need \(\log N\) bits.
so \(\log N/\log B\) queries needed.
Sorting:
Tree sort?
\(N\) inserts into \(B\)-tree, then scan
time \(O(N\log_B N)\).
Standard merge sort based on linear scans: \(T(N)=2T(N/2) + O(N/B)=O((N/B)\log N)\)
(Better than tree sort: \(\log B \rightarrow B\).
Can do better by using more memory
\(M/B\)-way merge
keep head block of each of \(M/B\) lists in memory
keep emitting smallest item
when emit \(B\) items, write a block
when block empties, read next block of that list
\(T(M) = O(N/B)+(M/B)T(N/(M/B)) = O((N/B)\log_{M/B}N/B)\)
Optimal for comparison sort:
Assume each block starts and is kept sorted (only strengthens lower bound)
loading a block reveals placement of \(B\) sorted items among \(M\) in memory
\(\binom{M+B}{B} \approx (eM/B)^B\) possibilities
so \(\log() \approx B\log M/B\) bits of info
need \(N\log N\) bits of info about sort,
except in each of \(N/B\) blocks \(B\log B\) bits are redundant (sorted blocks)
total \(N\log B\) redundant i.e. \(N\log N/B\) needed.
now divide by bits per block load
Mismatch:
In memory binary search tree can sort in \(O(n\log n)\) (optimal) using inserts and deletes
But sorting with \(B\)-tree costs \(N\log_B N\)
So inserts are individually optimal, but not “batch optimal”
Basic problem: writing one item costs 1, but writing \(B\) items together only costs 1, i.e. \(1/B\) per item
Is there a data structure that gives “batch optimality”?
Yes, but if inserts/queries are to happen in batches, sometimes you will have to wait for an answer until your batch is big enough
Can achieve optimum throughput, but must sacrifice latency.
Basic idea:
Focus on supporting \(N\) inserts and then doing inorder traversal
Idea: keep buffer in memory, push into \(B\)-tree when we have a block’s worth
Problem: different items push into different children, no longer a block’s worth going down
Solution: keep a buffer at each internal node
Still a problem: writing one item into the child buffer costs 1 instead of \(1/B\)
Solution: make buffers huge, so most children get whole blocks written
Details:
make buffer “as big as possible”: size \(M\)
increase tree degree to \(M/B\)
but still B items per leaf node—tree over individual blocks
basic operation: pushing overflowing buffer down to children
invariant: arriving overflow items are sorted
buffer had \(M\)
sort \(M\) items
merge with sorted incoming \(X\) in time \(O(M/B+X/B)\) (write to external)
write sorted contiguous elements to proper children
check for child overflow, recurse
cost is at most 1 partial block per child (\(M/B\) children) plus 1 block per \(B\) items (\(K/B\)) so \(M/B+K/B < 2K/B\).
since at least \(M\) items, account as \(1/B\) per item
on insert, put item in root buffer (in memory, so free)
when a buffer fills, flush
may fill child buffers. flush recursively
when flush reaches leaves, store items using standard \(B\)-tree ops (split leaf nodes, possibly recursing splits)
cost of splits dominated by buffer flushing
how handle buffers when split leaves?
no problem: they are empty on root-leaf path because we have just flushed to leaves.
cost of flushes:
buffer flush costs \(1/B\) per item
but each flushed item descends a level
total levels \(\log_{M/B} N/B\)
So cost per item is \((1/B)\log_{M/B} N/B\)
So cost to insert \(N\) is optimal sort cost
“Flush” operation
empties all buffers so can directly query structure
full buffers already accounted
unfull buffers cost \(M/B\) per internal node
but number of internal nodes is \((N/B)/(M/B)\) (\(N/B\) leaf blocks divided among \(M/B\) leaf-blocks per parent)
so total cost \(N/B\)
so can sort in \(N/B\log_{M/B} N/B\)
Extensions
search: “insert” query, look for closest match as flushes down
delete: insert “hole” that triggers when it catches up to target item
delete-min: extra buffer of \(M\) minimum elements in memory. when empties, flush left path and extract leftmost items to refill
range search: insert range, push to all matching children on flush
all ops take \(1/B \log_{M/B}N/B\) (plus range output)
In practice, don’t want to consider \(B\) and \(M\)
hard-coding them makes your algorithm non-portable
and, with multi-level memory hierarchies, unclear where bottleneck is, i.e. which \(B\) and \(M\) you use
without knowing \(B\) and \(M\), how can we tune for them?
surprisingly often, you can
key: divide and conquer algorithms
these tend to be sequentially optimal
and rapidly shrink subproblems to where they fit on one page
regardless of the size of that page
Assumption: optimal memory management
so we can assume whatever page eviction policy works for us
because we know e.g. LRU with double memory is 2-competitive against optimal management
and we are ignoring constant factors
Example: scanning contiguous array
\(N/B\)
without knowing \(B\)!
Adding matrices is easy
\(N \times N\) arrays
treat as vectors and scan to add
time \(N^2/B\)
Standard sequential algorithm
\(N \times N\) arrays
\(N^3\) time
row major order
so scanning a row is very cheap
but scanning a column is super expensive
unless \(N \times N < M\), evict rows on every column
so \(N^3\) external memory accesses!
can we do better?
store second matrix column major?
One output now requires \(N/B\) reads
So \(N^3/B\), an improvement
problem: what if next have to multiply in other order?
need fast transpose operation!
and turns out still not optimal—not leveraging large \(M\)
Divide and conquer
mentally partition \(A\) and \(B\) into four blocks
solve 8 matrix-multiply subproblems
add up the pieces
sequentially, \(T(N)=8T(N/2) + N^2 = O(N^3)\) (same as standard)
external memory: \(T(N)=8T(N/2) + N^2/B = O(N^3/B)\) (same as column major)
because at level \(i\) of recurrence do \(2^iN^2/B\) fetches
wait, at size \(N=\sqrt{M}\), whole matrix fits in memory
no more recursions. \(T(c\sqrt{M}) = O(M/B)\)
this is at level \(i\) such that \(N/2^i = \sqrt{M}\), ie \(i=\log N/\sqrt{M}\)
so runtime \(2^iN^2/B= N^3/B\sqrt{M}\)
\(B\)-trees are optimal, but you need to know \(B\)
Divide and conquer
We generally search an array by binary search
We query one value and cut problem in half
bad in external memory, because only halve on one fetch
\(B\)-tree is better, divides by \(B\) on one fetch
but we don’t know \(B\)
Van Emde Boas insight
use new degree parameter \(d\) to avoid confusion
problem of finding right child among \(d\) in page is same problem
before, we set \(d=B\) and ignored it because “in memory”
now we don’t know \(B\) but can still set a \(d\)
recurrence: \(T(N) = T(d) + T(n/d)\)
Balance: set \(d=\sqrt{N}\)
\(T(N) = 2T(\sqrt{N}) = 2^{\log\log N} = \log N\) (still sequentially optimal)
Wait, when \(N=B\) we get the whole array in memory and finish in \(O(1)\)
so, stop at \(i\) such that \(N^{(1/2)^i}=B\), ie \(B^{2^i}=N\) so \(2^i = (\log N)/\log B =\log_B N\)
so runtime \(\log_B N\)
draw a picture
see how you get a chain of \(\log B\) items all on same page.
yields speedup
Generalizations:
Dynamic
Buffer Trees
Want \(O(1)\) insert/delete but \(O(1/B\) scan
Cache aware, we used dynamic splitting/merging of contiguous “runs” to limit fragmentation
How without \(B\)?
Need to combat fragmentation without knowing whether it exists
idea: defragment while scanning
all scanned items become unfragmented
use potential function (amount of fragmentation) to pay for defrag
Solution:
on delete, just leave a hole
on insert, append to end of external memory
on scan, append all scanned items in order to end of memory
Analysis
potential function: number of contiguous “runs” of elements in sequence on block
equals number of “jumps” of pointers from one block to another
suppose traverse \(K\) items in \(r\) runs
the cost is at most \(K/B+r\)
scanned items written to end
eliminates all but first and last run:
reduces runs by \(r-2\)
so amortized cost \(O(K/B)\)
Controlling size
all operations including traversal increase size of data structure
solution: after \(N/B\) work, scan whole list
amortized cost \(N/B\)
but now list is sorted/contiguous
all previous blocks become free/irrelevant