Persistent Data Structures
6.854 Notes #2

David Karger

1 Persistent Data Structures

Sarnak and Tarjan, "Planar Point Location using persistent trees", Communications of the ACM 29 (1986) 669–679

"Making Data Structures Persistent" by Driscoll, Sarnak, Sleator and Tarjan Journal of Computer and System Sciences 38(1) 1989

Idea: be able to query and/or modify past versions of data structure.

ephemeral: changes to struct destroy all past info
partial persistence: changes to most recent version, query to all past versions
full persistence: queries and changes to all past versions (creates “multiple worlds” situtation)

Goal: general technique that can be applied to any data structure.

Many data structures

Our goal: use simulation to transform any ephemeral data structure into a persistent one

measure of success: time and space cost for each primitive traversal and primitive modification step.
An exposed data structure operation will do some number of primitive steps
we will add persistence to primitive steps, so any data structure can be made persistent
we’ll measure cost of simulation in slowdown and storage
We’ll limit to trees.
But ideas generalize

Key principles:

Lazy: don’t work till you must
If you must work, use your work to “simplify” data structure too
force user (adversary) to spend lots of time to make you work
analysis via potential function measuring “complexity” of structure.
- user has to do lots of changes to raise potential,
- so you can spread cost of complex ops over many insertions.
- potential says how much work adversary has done that you haven’t had to deal with yet.
- You can charge your work against that work.
another perspective: procrastinate. if you don’t do the work, might never need to.

Lazy approach:

Problem?

Full copy bad.

Fat nodes method:

replace each (single-valued) field of data structure by list of all values taken, sorted by time.
requires \(O(1)\) space per data change (unavoidable if keep old data)
to lookup data field, need to search based on time.
store values in binary tree
checking/changing a data field takes \(O(\log m)\) time after \(m\) updates
multiplicative slowdown of \(O(\log m)\) in structure access.
big number from data structures perspective

Path copying:

can only reach node by traversing pointers starting from root
changes to a node only visible to ancestors in pointer structure
when change a node, copy it and ancestors (back to root of data structure
keep list of roots sorted by update time
\(O(\log m)\) time to find right root (or const, if time is integers) (additive slowdown)
same access time as original structure
additive instead of multiplicative \(O(\log m)\).
modification time and space usage equals number of ancestors: possibly huge!
especially if structure not balanced!

Combined Solution (trees only):

Path copying is too aggressive
how can we be lazy about path copying?
fat nodes bad, but “pleasingly plump” ok.
use fat nodes, but when get too fat, make new path copy
in each node, store 1 extra time-stamped field
if full, overrides one of standard fields for any accesses later than stamped time.
access rule
- standard access, just check for overrides while following pointers
- constant factor increase in access time.
update rule:
- when need to change/copy pointer, use extra field if available.
- otherwise, make new copy of node with new info, and recursively modify parent.
- Idea: lazy path copying
- copy only as much as you must

Analysis idea: adversary has to do many changes to make copying expensive.

analysis: potential function \(\phi\) measuring amount of “deferred work”.
- amortized\(_i=\) real\(_i+\phi_i-\phi_{i-1}\)
- then \(\sum a_i=\sum r_i+\phi_n-\phi_0\)
- upper bounds real cost if \(\phi_n \ge \phi_0\).
- sufficient that \(\phi_n \ge 0\) and \(\phi_0\) fixed

Analysis

What represents work you need to do later?
- full nodes (that may need copying)
- that may still be modified (current values)
live node: pointed at by current root.
potential function: number of full live nodes.
copying a node is free (new copy not full, pays for copy space/time)
pay for filling an extra pointer (do only once, since can stop at that point).
amortized space per update: \(O(1)\).

Persistent sorted sets:

Red black trees

Add persistence

Improvement:

Result: \(O(n)\) space, \(O(\log n)\) query time for planar point location.

Extensions:

method extends to arbitrary pointer-based structures.
\(O(1)\) cost per update for any pointer-based structure with any constant indegree and constant number of fields.
use one extra pointer per field in node
and one pointer per indegree
full persistence with same bounds.