6.854 Lecture Notes

This material takes 1 hour.

Persistent Data Structures

Sarnak and Tarjan, "Planar Point Location using persistent trees", Communications of the ACM 29 (1986) 669--679

"Making Data Structures Persistent" by Driscoll, Sarnak, Sleator and Tarjan Journal of Computer and System Sciences 38(1) 1989

Idea: be able to query and/or modify past versions of data structure.

ephemeral: changes to struct destroy all past info
partial persistence: changes to most recent version, query to all past versions
full persistence: queries and changes to all past versions (creates “multiple worlds” situtation)

Goal: general technique that can be applied to any data structure.

Persistent Trees

Many data structures

consist of fixed-size nodes conencted by pointers
accessed from some root entry point
data structure ops traverse pointers
and modify node fields/pointers
runtime determined by number of traverse/modify steps

Our goal: use simulation to transform any ephemeral data structure into a persistent one

measure of success: time and space cost for each primitive traversal and primitive modification step.
An exposed data structure operation will do some number of primitive steps
we will add persistence to primitive steps, so any data structure can be made persistent
we'll measure cost of simulation in slowdown and storage
We'll limit to trees.
But ideas generalize

Key principles:

Lazy: don't work till you must
If you must work, use your work to “simplify” data structure too
force user (adversary) to spend lots of time to make you work
analysis via potential function measuring “complexity” of structure.
- user has to do lots of changes to raise potential,
- so you can spread cost of complex ops over many insertions.
- potential says how much work adversary has done that you haven't had to deal with yet.
- You can charge your work against that work.
another perspective: procrastinate. if you don't do the work, might never need to.

Lazy approach:

During change, do minimum, i.e. nothing.
Indeed, for only change, don't even need to remember change!
but we also want lookups, so record changes
Then apply relevant changes when a lookup happens
So, amortized cost 1.

Problem?

issue with second and further lookups
so, do need to incorporate

Full copy bad.

Fat nodes method:

replace each (single-valued) field of data structure by list of all values taken, sorted by time.
requires $O(1)$ space per data change (unavoidable if keep old data)
to lookup data field, need to search based on time.
store values in binary tree
checking/changing a data field takes $O(\log m)$ time after $m$ updates
multiplicative slowdown of $O(\log m)$ in structure access.
big number from data structures perspective

Path copying:

can only reach node by traversing pointers starting from root
changes to a node only visible to ancestors in pointer structure
when change a node, copy it and ancestors (back to root of data structure
keep list of roots sorted by update time
$O(\log m)$ time to find right root (or const, if time is integers) (additive slowdown)
same access time as original structure
additive instead of multiplicative $O(\log m)$.
modification time and space usage equals number of ancestors: possibly huge!
especially if structure not balanced!

Combined Solution (trees only):

Path copying is too aggressive
how can we be lazy about path copying?
fat nodes bad, but “pleasingly plump” ok.
use fat nodes, but when get too fat, make new path copy
in each node, store 1 extra time-stamped field
if full, overrides one of standard fields for any accesses later than stamped time.
access rule
- standard access, just check for overrides while following pointers
- constant factor increase in access time.
update rule:
- when need to change/copy pointer, use extra field if available.
- otherwise, make new copy of node with new info, and recursively modify parent.
- Idea: lazy path copying
- copy only as much as you must

Analysis idea: adversary has to do many changes to make copying expensive.

analysis: potential function $\phi$ measuring amount of “deferred work”.
- amortized$_i=$ real$_i+\phi_i-\phi_{i-1}$
- then $\sum a_i=\sum r_i+\phi_n-\phi_0$
- upper bounds real cost if $\phi_n \ge \phi_0$.
- sufficient that $\phi_n \ge 0$ and $\phi_0$ fixed

Analysis

live node: pointed at by current root.
potential function: number of full live nodes.
copying a node is free (new copy not full, pays for copy space/time)
pay for filling an extra pointer (do only once, since can stop at that point).
amortized space per update: $O(1)$.

Intro

remind about hash table survey

Talk about splay tree inserts and deletes

by splaying target to root
split and join
need to define/account for potential as create nodes and merge trees

Make point about all at once

Remind about $O(1)$ persistent pointer changes

Application: planar point location.

a computational geometry foreshadow
- Different model
- algebraic operations are $O(1)$
- e.g. computing line intersections or lengths
- even when these return irrationals!
- sweep precision under the rug.
planar subdivision
- division of plane into polygons
- query: “what polygon contains this point”
- define complexity of input?
- edges of polygons form $n$ segments/edges
- segments meet only at ends/vertices
numerous special-purpose solutions
One solution: dimension reduction
- vertical line through each vertex
- divides into slabs
- in slab, segments maintain one vertical ordering
- find query point slab by binary search
- build binary search tree for slab with “above-below” queries
- $n$ binary search trees, size $O(n^2)$, time $O(n^2\log n)$
observation: trees all very similar
think of $x$ axis as time, slabs as “epochs”
at end of epoch, “delete” segments that end, “insert” those that start.
over all time, only $n$ inserts, $n$ deletes.
must be able to query over all times

Persistent sorted sets:

find$(x,s,t)$ find (largest key below) $x$ in set $s$ at time $t$
insert$(i,s,t)$ insert $i$ in $s$ at time $t$
delete$(i,s,t)$.

Method: persistent red-black trees

Red black trees

aggressive rebalancers
amortized cost $O(1)$ to change a field.
store red/black bit in each node
use red/black bit to rebalance.
depth $O(\log n)$

Add persistence

updates only in “present”
search: standard binary tree search; $O(1)$ slowdown for persistence
update: causes changes in red/black fields on path to item, $O(1)$ rotations.
result: $(\log n)$ space per insert/delete
geometry does $O(n)$ changes, so $O(n\log n)$ space.
$O(\log n)$ query time.

Improvement:

red-black bits used only for updates
only need current version of red-black bits
don't persist old bit versions: just overwrite
only persistent updates needed are for $O(1)$ rotations
so $O(1)$ space per update
so $O(n)$ space overall.

Result: $O(n)$ space, $O(\log n)$ query time for planar point location.

Extensions:

method extends to arbitrary pointer-based structures.
$O(1)$ cost per update for any pointer-based structure with any constant indegree and constant number of fields.
use one extra pointer per field in node
and one pointer per indegree
full persistence with same bounds.