## External Memory

Memory $M$, block size $B$, problem size $N$

basic operations:

- Scanning: $O(N/B)$
- reversing an array: $O(N/B)$

matrix operations

- $N\times N$ matrix
- Add?
- need to decide on representation
- use row-major dense representation
- add is linear scan

- Multiply?
- How externalize standard algorithm?
- Row major
- scan rows of first and columns of second
- so need $N$ reads to get each column entry from second matrix
- and $N^2$ items to read
- so $N^3$ over all

- Column major
- What if could somehow transpose second matrix?
- Now only need $N/B$ reads per output value
- So $N^3/B$
- Still wasteful. How?
- Many output values rely on same column
- So we read the same rows repeatedly
- How avoid?

- Block layout
- Think of matrix as array of smaller matrices
- Multiply main matrix as sum of products of smaller matrices
- We've seen addition is linear
- Bigger blocks will help
- But the limit: need block to fit in memory
- so size $\sqrt{M} \times \sqrt{M}$
- so need to read $N/\sqrt{M}$ blocks to produce an output block
- time per block is $M/B$ since block size $M$
- $N^2/M$ output blocks
- so total time is product $N^3/B\sqrt{M}$

- Row major
- Can improve this by starting with better sequential matrix
multiply---e.g Strassen.
- get to $N^{2.376}$ sequential
- can extend to external memory

Linked list

- operations insert, delete, traverse
- insert and delete cost $O(1)$
- must traversing $k$ items cost $O(k)$?
- keep segments of list in blocks
- keep each block half full
- now can traverse $B/2$ items on one read
- so $O(K/B)$ to traverse $K$ items
- on insert: if block overflows, split into two half-full blocks
- on delete:
- if block less than half full, check next block
- if it is more than half full, redistribute items
- now both blocks are at least half full
- otherwise, merge items from both blocks, drop empty block

- Note: can also insert and delete while traversing at cost $1/B$ per operation
- so, e.g., can hold $O(1)$ “fingers” in list and insert/delete at fingers.

2013 Lecture 27 End

Search trees:

- binary tree cost $O(\log n)$ memory transfers
- wastes effort because only getting one useful item per block read
- Instead use $(B+1)$-ary tree; block has $B$ splitters
- Require all leaves at same depth
- ensures balanced
- So depth $O(\log_B N)$, much better
- Need to keep balanced while insert/delete:
- require every block but root to be at least half full (so degree $\ge B/2$)
- also ensures depth $\log_B N$
- on insert, if block is full, split into two blocks and pass up a splitter to insert in parent
- may overflow parent, forcing recursive splits
- on delete, if block half empty, merge as for linked lists
- may empty parent, force recursive merges

- optimal in comparison model:
- Reading a block reveals position of query among $B$ splitters
- i.e. $\log B$ bits of information
- Need $\log N$ bits.
- so $\log N/\log B$ queries needed.

2011 Lecture 24 end. 2012 Lecture 26 end.

Sorting:

- Standard merge sort based on linear scans: $T(N)=2T(N/2) + O(N/B)=O((N/B)\log N)$
- Can do better by using more memory
- $M/B$-way merge
- keep head block of each of $M/B$ lists in memory
- keep emitting smallest item
- when emit $B$ items, write a block
- when block empties, read next block of that list
- $T(M) = O(N/B)+(M/B)T(N/(M/B)) = O((N/B)\log_{M/B}N/B)$

Optimal for comparison sort:

- Assume each block starts and is kept sorted (only strengthens lower bound)
- loading a block reveals placement of $B$ sorted items among $M$ in memory
- $\binom{M+B}{B} \approx (eM/B)^B$ possibilities
- so $\log() \approx B\log M/B$ bits of info
- need $N\log N$ bits of info about sort,
- except in each of $N/B$ blocks $B\log B$ bits are redundant (sorted blocks)
- total $N\log B$ redundant i.e. $N\log N/B$ needed.
- now divide by bits per block load

### Buffer Trees

Mismatch:

- In memory binary search tree can sort in $O(n\log n)$ (optimal) using inserts and deletes
- But sorting with $B$-tree costs $N\log_B N$
- So inserts are individually optimal, but not “batch optimal”
- Basic problem: writing one item costs 1, but writing $B$ items together only costs 1, i.e. $1/B$ per item
- Is there a data structure that gives “batch optimality”?
- Yes, but if inserts/queries are to happen in batches, sometimes you will have to wait for an answer until your batch is big enough
- Can achieve optimum throughput, but must sacrifice latency.

Basic idea:

- Focus on supporting $N$ inserts and then doing inorder traversal
- Idea: keep buffer in memory, push into $B$-tree when we have a block's worth
- Problem: different items push into different children, no longer a block's worth going down
- Solution: keep a buffer at each internal node
- Still a problem: writing one item into the child buffer costs 1 instead of $1/B$
- Solution: make buffers
**huge**, so most children get whole blocks written

Details:

- make buffer “as big as possible”: size $M$
- increase tree degree to $M/B$
- basic operation: pushing overflowing buffer down to children
- invariant: arriving overflow items are sorted
- buffer had $M$
- sort $M$ items
- merge with sorted incoming $X$ in time $O(M/B+X/B)$ (write to external)
- write sorted contiguous elements to proper children
- check for child overflow, recurse
- cost is at most 1 partial block per child ($M/B$ children) plus 1 block per $B$ items ($K/B$) so $M/B+K/B < 2K/B$.
- since at least $M$ items, account as $1/B$ per item

- on insert, put item in root buffer (in memory, so free)
- when a buffer fills, flush
- may fill child buffers. flush recursively
- when flush reaches leaves, store items using standard $B$-tree
ops (split leaf nodes, possibly recursing splits)
- cost of splits dominated by buffer flushing
- how handle buffers when split leaves?
- no problem: they are empty on root-leaf path because we have just flushed to leaves.

- cost of flushes:
- buffer flush costs $1/B$ per item
- but each flushed item descends a level
- total levels $\log_{M/B} N/B$
- So cost per item is $(1/B)\log_{M/B} N/B$
- So cost to insert $N$ is optimal sort cost

“Flush” operation

- empties
*all*buffers so can directly query structure - full buffers already accounted
- unfull buffers cost $M/B$ per internal node
- but number of internal nodes is $(N/B)/(M/B)$ ($N/B$ leaf blocks divided among $M/B$ leaf-blocks per parent)
- so total cost $N/B$
- so can sort in $N/B\log_{M/B} N/B$

Extensions

- search: “insert” query, look for closest match as flushes down
- delete: insert “hole” that triggers when it catches up to target item
- delete-min: extra buffer of $M$ minimum elements in memory. when empties, flush left path and extract leftmost items to refill
- range search: insert range, push to
*all*matching children on flush - all ops take $1/B \log_{M/B}N/B$ (plus range output)

## Cache Oblivious Algorithms

In practice, don't want to consider $B$ and $M$

- hard-coding them makes your algorithm non-portable
- and, with multi-level memory hierarchies, unclear where bottleneck is, i.e. which $B$ and $M$ you use
- without knowing $B$ and $M$, how can we tune for them?
- surprisingly often, you can
- key: divide and conquer algorithms
- these tend to be sequentially optimal
- and rapidly shrink subproblems to where they fit on one page
- regardless of the size of that page

Assumption: optimal memory management

- so we can assume whatever page eviction policy works for us
- because we know e.g. LRU with double memory is 2-competitive against optimal management
- and we are ignoring constant factors

Example: scanning contiguous array

- $N/B$
- without knowing $B$!

### Matrix Multiply

Adding matrices is easy

- $N \times N$ arrays
- treat as vectors and scan to add
- time $N^2/B$

Standard sequential algorithm

- $N \times N$ arrays
- $N^3$ time
- row major order
- so scanning a row is very cheap
- but scanning a column is super expensive
- unless $N \times N < M$, evict rows on every column
- so $N^3$ external memory accesses!
- can we do better?

store second matrix column major?

- One output now requires $N/B$ reads
- So $N^3/B$, an improvement
- problem: what if next have to multiply in other order?
- need fast transpose operation!
- and turns out still not optimal---not leveraging large $M$

Divide and conquer

- mentally partition $A$ and $B$ into four blocks
- solve 8 matrix-multiply subproblems
- add up the pieces
- sequentially, $T(N)=8T(N/2) + N^2 = O(N^3)$ (same as standard)
- external memory: $T(N)=8T(N/2) + N^2/B = O(N^3/B)$ (same as column major)
- because at level $i$ of recurrence do $2^iN^2/B$ fetches
- wait, at size $N=\sqrt{M}$, whole matrix fits in memory
- no more recursions. $T(c\sqrt{M}) = O(M/B)$
- this is at level $i$ such that $N/2^i = \sqrt{M}$, ie $i=\log N/\sqrt{M}$
- so runtime $2^iN^2/B= N^3/B\sqrt{M}$

### Binary Search Trees

$B$-trees are optimal, but you need to know $B$

Divide and conquer

- We generally search an array by binary search
- We query one value and cut problem in half
- bad in external memory, because only halve on one fetch
- $B$-tree is better, divides by $B$ on one fetch
- but we don't know $B$

Van Emde Boas insight

- use new degree parameter $d$ to avoid confusion
- problem of finding right child among $d$ in page is same problem
- before, we set $d=B$ and ignored it because “in memory”
- now we don't know $B$ but can still set a $d$
- recurrence: $T(N) = T(d) + T(n/d)$
- Balance: set $d=\sqrt{N}$
- $T(N) = 2T(\sqrt{N}) = 2^{\log\log N} = \log N$ (still sequentially optimal)
- Wait, when $N=B$ we get the whole array in memory and finish in $O(1)$
- so, stop at $i$ such that $N^{(1/2)^i}=B$, ie $B^{2^i}=N$ so $2^i = (\log N)/\log B =\log_B N$
- so runtime $\log_B N$
- draw a picture
- see how you get a chain of $\log B$ items all on same page.
- yields speedup

Generalizations:

- Dynamic
- Buffer Trees

### Linked Lists

Want $O(1)$ insert/delete but $O(1/B$ scan

- Cache aware, we used dynamic splitting/merging of contiguous “runs” to limit fragmentation
- How without $B$?
- Need to combat fragmentation without knowing whether it exists
- idea: defragment while scanning
- all scanned items become unfragmented
- use potential function (amount of fragmentation) to pay for defrag

Solution:

- on delete, just leave a hole
- on insert, append to end of external memory
- on scan, append all scanned items in order to end of memory

Analysis

- potential function: number of contiguous “runs” of elements in sequence on block
- equals number of “jumps” of pointers from one block to another
- suppose traverse $K$ items in $r$ runs
- the cost is at most $K/B+r$
- scanned items written to end
- eliminates all but first and last run:
- reduces runs by $r-2$
- so amortized cost $O(K/B)$

Controlling size

- all operations including traversal increase size of data structure
- solution: after $N/B$ work, scan whole list
- amortized cost $N/B$
- but now list is sorted/contiguous
- all previous blocks become free/irrelevant