Persistent Data Structures
Sarnak and Tarjan, "Planar Point Location using persistent trees", Communications of the ACM 29 (1986) 669--679
"Making Data Structures Persistent" by Driscoll, Sarnak, Sleator and Tarjan Journal of Computer and System Sciences 38(1) 1989
Idea: be able to query and/or modify past versions of data structure.
- ephemeral: changes to struct destroy all past info
- partial persistence: changes to most recent version, query to all past versions
- full persistence: queries and changes to all past versions (creates “multiple worlds” situtation)
Goal: general technique that can be applied to any data structure.
Persistent Trees
Many data structures
- consist of fixed-size nodes conencted by pointers
- accessed from some root entry point
- data structure ops traverse pointers
- and modify node fields/pointers
- runtime determined by number of traverse/modify steps
Our goal: use simulation to transform any ephemeral data structure into a persistent one
- measure of success: time and space cost for each primitive traversal and primitive modification step.
- An exposed data structure operation will do some number of primitive steps
- we will add persistence to primitive steps, so any data structure can be made persistent
- we'll measure cost of simulation in slowdown and storage
- We'll limit to trees.
- But ideas generalize
Key principles:
- Lazy: don't work till you must
- If you must work, use your work to “simplify” data structure too
- force user (adversary) to spend lots of time to make you work
- analysis via potential function measuring “complexity” of
structure.
- user has to do lots of changes to raise potential,
- so you can spread cost of complex ops over many insertions.
- potential says how much work adversary has done that you haven't had to deal with yet.
- You can charge your work against that work.
- another perspective: procrastinate. if you don't do the work, might never need to.
Lazy approach:
- During change, do minimum, i.e. nothing.
- Indeed, for only change, don't even need to remember change!
- but we also want lookups, so record changes
- Then apply relevant changes when a lookup happens
- So, amortized cost 1.
Problem?
- issue with second and further lookups
- so, do need to incorporate
Full copy bad.
Fat nodes method:
- replace each (single-valued) field of data structure by list of all values taken, sorted by time.
- requires $O(1)$ space per data change (unavoidable if keep old data)
- to lookup data field, need to search based on time.
- store values in binary tree
- checking/changing a data field takes $O(\log m)$ time after $m$ updates
- multiplicative slowdown of $O(\log m)$ in structure access.
- big number from data structures perspective
Path copying:
- can only reach node by traversing pointers starting from root
- changes to a node only visible to ancestors in pointer structure
- when change a node, copy it and ancestors (back to root of data structure
- keep list of roots sorted by update time
- $O(\log m)$ time to find right root (or const, if time is integers) (additive slowdown)
- same access time as original structure
- additive instead of multiplicative $O(\log m)$.
- modification time and space usage equals number of ancestors: possibly huge!
- especially if structure not balanced!
Combined Solution (trees only):
- Path copying is too aggressive
- how can we be lazy about path copying?
- fat nodes bad, but “pleasingly plump” ok.
- use fat nodes, but when get too fat, make new path copy
- in each node, store 1 extra time-stamped field
- if full, overrides one of standard fields for any accesses later than stamped time.
- access rule
- standard access, just check for overrides while following pointers
- constant factor increase in access time.
- update rule:
- when need to change/copy pointer, use extra field if available.
- otherwise, make new copy of node with new info, and recursively modify parent.
- Idea: lazy path copying
- copy only as much as you must
Analysis idea: adversary has to do many changes to make copying expensive.
- analysis: potential function $\phi$ measuring amount of
“deferred work”.
- amortized$_i=$ real$_i+\phi_i-\phi_{i-1}$
- then $\sum a_i=\sum r_i+\phi_n-\phi_0$
- upper bounds real cost if $\phi_n \ge \phi_0$.
- sufficient that $\phi_n \ge 0$ and $\phi_0$ fixed
Analysis
- live node: pointed at by current root.
- potential function: number of full live nodes.
- copying a node is free (new copy not full, pays for copy space/time)
- pay for filling an extra pointer (do only once, since can stop at that point).
- amortized space per update: $O(1)$.
Intro
remind about hash table survey
Talk about splay tree inserts and deletes
- by splaying target to root
- split and join
- need to define/account for potential as create nodes and merge trees
Make point about all at once
Remind about $O(1)$ persistent pointer changes
Application: planar point location.
- a computational geometry foreshadow
- Different model
- algebraic operations are $O(1)$
- e.g. computing line intersections or lengths
- even when these return irrationals!
- sweep precision under the rug.
- planar subdivision
- division of plane into polygons
- query: “what polygon contains this point”
- define complexity of input?
- edges of polygons form $n$ segments/edges
- segments meet only at ends/vertices
- numerous special-purpose solutions
- One solution: dimension reduction
- vertical line through each vertex
- divides into slabs
- in slab, segments maintain one vertical ordering
- find query point slab by binary search
- build binary search tree for slab with “above-below” queries
- $n$ binary search trees, size $O(n^2)$, time $O(n^2\log n)$
- observation: trees all very similar
- think of $x$ axis as time, slabs as “epochs”
- at end of epoch, “delete” segments that end, “insert” those that start.
- over all time, only $n$ inserts, $n$ deletes.
- must be able to query over all times
Persistent sorted sets:
- find$(x,s,t)$ find (largest key below) $x$ in set $s$ at time $t$
- insert$(i,s,t)$ insert $i$ in $s$ at time $t$
- delete$(i,s,t)$.
Method: persistent red-black trees
Red black trees
- aggressive rebalancers
- amortized cost $O(1)$ to change a field.
- store red/black bit in each node
- use red/black bit to rebalance.
- depth $O(\log n)$
Add persistence
- updates only in “present”
- search: standard binary tree search; $O(1)$ slowdown for persistence
- update: causes changes in red/black fields on path to item, $O(1)$ rotations.
- result: $(\log n)$ space per insert/delete
- geometry does $O(n)$ changes, so $O(n\log n)$ space.
- $O(\log n)$ query time.
Improvement:
- red-black bits used only for updates
- only need current version of red-black bits
- don't persist old bit versions: just overwrite
- only persistent updates needed are for $O(1)$ rotations
- so $O(1)$ space per update
- so $O(n)$ space overall.
Result: $O(n)$ space, $O(\log n)$ query time for planar point location.
Extensions:
- method extends to arbitrary pointer-based structures.
- $O(1)$ cost per update for any pointer-based structure with any constant indegree and constant number of fields.
- use one extra pointer per field in node
- and one pointer per indegree
- full persistence with same bounds.