External Memory
6.854 Notes #28

David Karger

External Memory

Memory \(M\), block size \(B\), problem size \(N\)

basic operations:

Scanning: \(O(N/B)\)
reversing an array: \(O(N/B)\)

matrix operations

\(N\times N\) matrix
Add?
- need to decide on representation
- use row-major dense representation
- add is linear scan
Multiply?
How externalize standard algorithm?
- Row major
  - scan rows of first and columns of second
  - so need \(N\) reads to get each column entry from second matrix
  - and \(N^2\) items to read
  - so \(N^3\) over all
- Column major
  - What if could somehow transpose second matrix?
  - Now only need \(N/B\) reads per output value
  - So \(N^3/B\)
  - Still wasteful. How?
  - Many output values rely on same column
  - So we read the same rows repeatedly
  - How avoid?
- Block layout
  - Think of matrix as array of smaller matrices
  - Multiply main matrix as sum of products of smaller matrices
  - We’ve seen addition is linear
  - Bigger blocks will help
  - But the limit: need block to fit in memory
  - so size \(\sqrt{M} \times \sqrt{M}\)
  - so need to read \(N/\sqrt{M}\) blocks to produce an output block
  - time per block is \(M/B\) since block size \(M\)
  - \(N^2/M\) output blocks
  - so total time is product \(N^3/B\sqrt{M}\)
Can improve this by starting with better sequential matrix multiply—e.g Strassen.
- get to \(N^{2.376}\) sequential
- can extend to external memory

Linked list

operations insert, delete, traverse
insert and delete cost \(O(1)\)
must traversing \(k\) items cost \(O(k)\)?
keep segments of list in blocks
keep each block half full
now can traverse \(B/2\) items on one read
so \(O(K/B)\) to traverse \(K\) items
on insert: if block overflows, split into two half-full blocks
on delete:
- if block less than half full, check next block
- if it is more than half full, redistribute items
- now both blocks are at least half full
- otherwise, merge items from both blocks, drop empty block
Note: can also insert and delete while traversing at cost \(1/B\) per operation
so, e.g., can hold \(O(1)\) “fingers” in list and insert/delete at fingers.

Search trees:

binary tree cost \(O(\log n)\) memory transfers
wastes effort because only getting one useful item per block read
Instead use \((B+1)\)-ary tree; block has \(B\) splitters
Require all leaves at same depth
ensures balanced
So depth \(O(\log_B N)\), much better
Need to keep balanced while insert/delete:
- require every block but root to be at least half full (so degree \(\ge B/2\))
- also ensures depth \(\log_B N\)
- on insert, if block is full, split into two blocks and pass up a splitter to insert in parent
- may overflow parent, forcing recursive splits
- on delete, if block half empty, merge as for linked lists
- may empty parent, force recursive merges
optimal in comparison model:
- Reading a block reveals position of query among \(B\) splitters
- i.e. \(\log B\) bits of information
- Need \(\log N\) bits.
- so \(\log N/\log B\) queries needed.

Sorting:

Tree sort?
- \(N\) inserts into \(B\)-tree, then scan
- time \(O(N\log_B N)\).
Standard merge sort based on linear scans: \(T(N)=2T(N/2) + O(N/B)=O((N/B)\log N)\)
(Better than tree sort: \(\log B \rightarrow B\).
Can do better by using more memory
\(M/B\)-way merge
keep head block of each of \(M/B\) lists in memory
keep emitting smallest item
when emit \(B\) items, write a block
when block empties, read next block of that list
\(T(M) = O(N/B)+(M/B)T(N/(M/B)) = O((N/B)\log_{M/B}N/B)\)

Optimal for comparison sort:

Assume each block starts and is kept sorted (only strengthens lower bound)
loading a block reveals placement of \(B\) sorted items among \(M\) in memory
\(\binom{M+B}{B} \approx (eM/B)^B\) possibilities
so \(\log() \approx B\log M/B\) bits of info
need \(N\log N\) bits of info about sort,
except in each of \(N/B\) blocks \(B\log B\) bits are redundant (sorted blocks)
total \(N\log B\) redundant i.e. \(N\log N/B\) needed.
now divide by bits per block load

Buffer Trees

Mismatch:

In memory binary search tree can sort in \(O(n\log n)\) (optimal) using inserts and deletes
But sorting with \(B\)-tree costs \(N\log_B N\)
So inserts are individually optimal, but not “batch optimal”
Basic problem: writing one item costs 1, but writing \(B\) items together only costs 1, i.e. \(1/B\) per item
Is there a data structure that gives “batch optimality”?
Yes, but if inserts/queries are to happen in batches, sometimes you will have to wait for an answer until your batch is big enough
Can achieve optimum throughput, but must sacrifice latency.

Basic idea:

Focus on supporting \(N\) inserts and then doing inorder traversal
Idea: keep buffer in memory, push into \(B\)-tree when we have a block’s worth
Problem: different items push into different children, no longer a block’s worth going down
Solution: keep a buffer at each internal node
Still a problem: writing one item into the child buffer costs 1 instead of \(1/B\)
Solution: make buffers huge, so most children get whole blocks written

Details:

make buffer “as big as possible”: size \(M\)
increase tree degree to \(M/B\)
but still B items per leaf node—tree over individual blocks
basic operation: pushing overflowing buffer down to children
- invariant: arriving overflow items are sorted
- buffer had \(M\)
- sort \(M\) items
- merge with sorted incoming \(X\) in time \(O(M/B+X/B)\) (write to external)
- write sorted contiguous elements to proper children
- check for child overflow, recurse
- cost is at most 1 partial block per child (\(M/B\) children) plus 1 block per \(B\) items (\(K/B\)) so \(M/B+K/B < 2K/B\).
- since at least \(M\) items, account as \(1/B\) per item
on insert, put item in root buffer (in memory, so free)
when a buffer fills, flush
may fill child buffers. flush recursively
when flush reaches leaves, store items using standard \(B\)-tree ops (split leaf nodes, possibly recursing splits)
- cost of splits dominated by buffer flushing
- how handle buffers when split leaves?
- no problem: they are empty on root-leaf path because we have just flushed to leaves.
cost of flushes:
- buffer flush costs \(1/B\) per item
- but each flushed item descends a level
- total levels \(\log_{M/B} N/B\)
- So cost per item is \((1/B)\log_{M/B} N/B\)
- So cost to insert \(N\) is optimal sort cost

“Flush” operation

empties all buffers so can directly query structure
full buffers already accounted
unfull buffers cost \(M/B\) per internal node
but number of internal nodes is \((N/B)/(M/B)\) (\(N/B\) leaf blocks divided among \(M/B\) leaf-blocks per parent)
so total cost \(N/B\)
so can sort in \(N/B\log_{M/B} N/B\)

Extensions

search: “insert” query, look for closest match as flushes down
delete: insert “hole” that triggers when it catches up to target item
delete-min: extra buffer of \(M\) minimum elements in memory. when empties, flush left path and extract leftmost items to refill
range search: insert range, push to all matching children on flush
all ops take \(1/B \log_{M/B}N/B\) (plus range output)

Cache Oblivious Algorithms

In practice, don’t want to consider \(B\) and \(M\)

hard-coding them makes your algorithm non-portable
and, with multi-level memory hierarchies, unclear where bottleneck is, i.e. which \(B\) and \(M\) you use
without knowing \(B\) and \(M\), how can we tune for them?
surprisingly often, you can
key: divide and conquer algorithms
these tend to be sequentially optimal
and rapidly shrink subproblems to where they fit on one page
regardless of the size of that page

Assumption: optimal memory management

so we can assume whatever page eviction policy works for us
because we know e.g. LRU with double memory is 2-competitive against optimal management
and we are ignoring constant factors

Example: scanning contiguous array

\(N/B\)
without knowing \(B\)!

Matrix Multiply

Adding matrices is easy

\(N \times N\) arrays
treat as vectors and scan to add
time \(N^2/B\)

Standard sequential algorithm

\(N \times N\) arrays
\(N^3\) time
row major order
so scanning a row is very cheap
but scanning a column is super expensive
unless \(N \times N < M\), evict rows on every column
so \(N^3\) external memory accesses!
can we do better?

store second matrix column major?

One output now requires \(N/B\) reads
So \(N^3/B\), an improvement
problem: what if next have to multiply in other order?
need fast transpose operation!
and turns out still not optimal—not leveraging large \(M\)

Divide and conquer

mentally partition \(A\) and \(B\) into four blocks
solve 8 matrix-multiply subproblems
add up the pieces
sequentially, \(T(N)=8T(N/2) + N^2 = O(N^3)\) (same as standard)
external memory: \(T(N)=8T(N/2) + N^2/B = O(N^3/B)\) (same as column major)
because at level \(i\) of recurrence do \(2^iN^2/B\) fetches
wait, at size \(N=\sqrt{M}\), whole matrix fits in memory
no more recursions. \(T(c\sqrt{M}) = O(M/B)\)
this is at level \(i\) such that \(N/2^i = \sqrt{M}\), ie \(i=\log N/\sqrt{M}\)
so runtime \(2^iN^2/B= N^3/B\sqrt{M}\)

Binary Search Trees

\(B\)-trees are optimal, but you need to know \(B\)

Divide and conquer

We generally search an array by binary search
We query one value and cut problem in half
bad in external memory, because only halve on one fetch
\(B\)-tree is better, divides by \(B\) on one fetch
but we don’t know \(B\)

Van Emde Boas insight

use new degree parameter \(d\) to avoid confusion
problem of finding right child among \(d\) in page is same problem
before, we set \(d=B\) and ignored it because “in memory”
now we don’t know \(B\) but can still set a \(d\)
recurrence: \(T(N) = T(d) + T(n/d)\)
Balance: set \(d=\sqrt{N}\)
\(T(N) = 2T(\sqrt{N}) = 2^{\log\log N} = \log N\) (still sequentially optimal)
Wait, when \(N=B\) we get the whole array in memory and finish in \(O(1)\)
so, stop at \(i\) such that \(N^{(1/2)^i}=B\), ie \(B^{2^i}=N\) so \(2^i = (\log N)/\log B =\log_B N\)
so runtime \(\log_B N\)
draw a picture
see how you get a chain of \(\log B\) items all on same page.
yields speedup

Generalizations:

Dynamic
Buffer Trees

Linked Lists

Want \(O(1)\) insert/delete but \(O(1/B\) scan

Cache aware, we used dynamic splitting/merging of contiguous “runs” to limit fragmentation
How without \(B\)?
Need to combat fragmentation without knowing whether it exists
idea: defragment while scanning
all scanned items become unfragmented
use potential function (amount of fragmentation) to pay for defrag

Solution:

on delete, just leave a hole
on insert, append to end of external memory
on scan, append all scanned items in order to end of memory

Analysis

potential function: number of contiguous “runs” of elements in sequence on block
equals number of “jumps” of pointers from one block to another
suppose traverse \(K\) items in \(r\) runs
the cost is at most \(K/B+r\)
scanned items written to end
eliminates all but first and last run:
reduces runs by \(r-2\)
so amortized cost \(O(K/B)\)

Controlling size

all operations including traversal increase size of data structure
solution: after \(N/B\) work, scan whole list
amortized cost \(N/B\)
but now list is sorted/contiguous
all previous blocks become free/irrelevant