6.854 Lecture Notes

This material takes about 1.5 hours.

Suffix Trees

Gusfield: Algorithms on Strings, Trees, and Sequences.

Weiner 73 “Linear Pattern-matching algorithms” IEEE conference on automata and switching theory

McCreight 76 “A space-economical suffix tree construction algorithm” JACM 23(2) 1976

Chen and Seifras 85 “Efficient and Elegegant Suffix tree construction” in Apostolico/Galil Combninatorial Algorithms on Words

Another “search” structure, dedicated to strings.

Basic problem: match a “pattern” (of length $m$) to “text” (of length $n$)

goal: decide if a given string (“pattern”) is a substring of the text
possibly created by concatenating short ones, eg newspaper
application in IR, also computational bio (DNA seqs)
if pattern avilable first, can build DFA, run in time linear in text
if text available first, can build suffix tree, run in time linear in pattern.
applications in computational bio.

First idea: binary tree on strings. Inefficient because run over pattern many times.

fractional cascading?
realize only need one character at each node!

Tries:

used to store dictionary of strings
trees with children indexed by “alphabet”
time to search equal length of query string
insertion ditto.
optimal, since even hashing requires this time to hash.
but better, because no “hash function” computed.
space an issue:
- using array increases stroage cost by $|\Sigma|$
- using binary tree on alphabet increases search time by $\log|\Sigma|$
- ok for “const alphabet”
- if really fussy, could use hash-table at each node.
size in worst case: sum of word lengths (so pretty much solves “dictionary” problem.

But what about substrings?

Relevance to DNA searches
idea: trie of all $n^2$ substrings
equivalent to trie of all $n$ suffixes.
put “marker” at end, so no suffix contained in other (otherwise, some suffix can be an internal node, “hidden” by piece of other suffix)
means one leaf per suffix
Naive construction: insert each suffix
basic alg:
- text $a_1\cdots a_n$
- define $s_i=a_i \cdots a_n$
- for $i=1$ to $n$
- insert $s_i$
time, space $O(n^2)$

Better construction:

note trie size may be much smaller: $aaaaaaa$.
algorithm with time $O(|T|)$
idea: avoid repeated work by “memoizing”
also shades of finger search tree idea---use locality of reference
suppose just inserted $aw$
next insert is $w$
big prefix of $w$ might already be in trie
avoid traversing: skip to end of prefix.

Suffix links:

any node in trie corresponds to string
arrange for node corresp to $ax$ to point at node corresp to $x$
suppose just inserted $aw$.
walk up tree till find suffix link
follow link (puts you on path corresp to $w$)
walk down tree (adding nodes) to insert rest of $w$

Memoizing: (save your work)

can add suffix link to every node we walked up
(since walked up end of $aw$, and are putting in $w$ now).
charging scheme: charge traversal up a node to creation of suffix link
traversal up also covers (same length) traversal down
once node has suffix link, never passed up again
thus, total time spent going up/down equals number of suffix links
one suffix link per node, so time $O(|T|)$

half hour up to here.

Amortization key principles:

Lazy: don't work till you must
If you must work, use your work to “simplify” data structure too
force user to spend lots of time to make you work
use charges to keep track of work---earn money from user activity, spend it to pay for excess work at certain times.

Linear-size structure:

problem: maybe $|T|$ is large $(n^2)$
compress paths in suffix trie
path on letters $a_i\cdots a_j$ corresp to substring of text
replace by edge labelled by $(i,j)$ (implicit nodes)
Example: tree on $abab\$$
gives tree where every node has indegree at least 2
in such a tree, size is order number of leaves $=O(n)$
terminating $\$$ char now very useful, since means each suffix is a node
Wait: didn't save space; still need to store characters on edge!
see if someone with prompting can figure out: characters on edge are substring of pattern, so just store start and end indices. Look in text to see characters.

Search still works:

preserves invariant: at most one edge starting with given character leaves a node
so can store edges in array indexed by first character of edge.
walk down same as trie
called “slowfind” for later

Construction:

obvious: build suffix trie, compress
drawback: may take $n^2$ time and intermediate space
better: use original construction idea, work in compressed domain.
as before, insert suffixes in order $s_1,\ldots,s_n$
compressed tree of what inserted so far
to insert $s_i$, walk down tree
at some point, path diverges from what's in tree
may force us to “break” an edge (show)
tack on one new edge for rest of string (cheap!)

MacReight 1976

use suffix link idea of up-link-down
problem: can't suffix link every character, only explict nodes
want to work proportional to real nodes traversed
need to skip characters inside edges (since can't pay for them)
introduced “fastfind”
- idea: fast alg for descending tree if know string present in tree
- just check first char on edge, then skip number of chars equal to edge “length”
- may land you in middle of edge (specified offset)
- cost of search: number of explicit nodes in path
- amortize: pay for with explicit-node suffix links

Amortized Analysis:

suppose just inserted string $aw$
sitting on its leaf, which has parent
Parent is only node that was (possibly) created by insertion:
- As soon as walk down preexisting tree falls off tree, create parent node and stop
invariant: every internal node except for parent of current leaf has suffix link to another explicit node
plausible?
- i.e., is there an explicit node for that suffix link to point at?
- suppose $v$ was created as parent of $s_j$ leaf when it diverged from $s_k$
- (note this is only way nodes get created)
- claim $s_{j+1}$ and $s_{k+1}$ diverge at suffix$(v)$, creating another explicit node.
- only problem if $s_{k+1}$ not yet present
- happens only if $k$ is current suffix
- only blocks parent of current leaf.
insertion step:
- suppose just inserted $s_i$
- consider parent $p_i$ and grandparent (parent of parent) $g_i$ of current node
- $g_i$ to $p_i$ link has string $w_1$
- $p_i$ to $s_i$ link $w_2$
- go up to grandparent
- follow suffix link (exists by invariant)
- fastfind $w_1$
- claim: know $w_1$ is present in tree!
  - $p_i$ was created by $s_i$ split from a previous edge (or preexisted)
  - so $aww_1$ was in tree before $s_i$ inserted (prefix of earlier suffix)
  - so $ww_1$ is in tree after $s_i$ inserted
- create suffix link from $p_i$ (preserves invariant)
- slowfind $w_2$ (stopping when leave current tree)
- break current edge if necessary (may land on preexisting node)
- add new edge for rest of $w_2$

Analysis:

First, consider work to reach $g_{i+1}$ (not suf$(g_i)$)
- Mix of fastfind and slowfind, but no worse then cost of doing pure slowfind
- This is it most $|g_{i+1}|-|g_i|+1$ (explain length notation)
- So total is $O(\sum |g_{i+1}|-|g_i|+1) = O(n)$
- Wait: maybe $g_{i+1}-g_i+1 < 0$, and I am cheating on sum?
  - Consider after inserting $s_{i+1}=$suf$(s_i)$
  - then $p_i$ can't point at $s_{i+1}$
  - so must point at ancestor
  - so $g_{i}$ must point at ancestor of ancestor
  - i.e., $g_{i}$ points at $g_{i+1}$ or something higher
  - so $|g_{i+1}| \ge |g_i|-1$
Remaining cost: to reach $p_{i+1}$ (possibly implicit) from $g_{i+1}$.
- If get there during fastfind, costs at most one additional step from $g_{i+1}$
- If get there during slowfind, means fastfind stopped at or before $g_{i+1}$.
- So suf$(p_i)$ is not below $g_{i+1}$.
- So remaining cost (from $g_{i+1}$, not suf$(p_i)$) is \[ |g_{i+1}|-|p_{i+1}| \le |\mbox{suf}(p_i)|-|p_{i+1}| \le |p_i|-|p_{i+1}|+1 \]
- telescopes as before to $O(n)$
- we mostly analyzed as if used slowfind. when was fastfind important?
  - in case when $p_{i+1}$ was reached on fastfind step from $g_{i+1}$
  - ie, $p_{i+1}=$suf$(p_i)$
  - in that case, could not have afforded to do slowfind from $g_{i+1}$ to $p_{i+1}$
  - since slowfind analysis only covered fslowfind to $g_{i+1}$ and from $p_{i+1}$
  - however, don't know that the case occurred until after the fact.

Weiner algorithm: insert strings “backwards”, use prefix links.

Ukonnen online version.

Suffix arrays: many of same benefits as suffix trees, but save pointers:

lexicographic ordering of suffixes
represent as list of integers: $b_1$ is (index in text of) lexicographically first suffix, $b_2$ is (index of) lexicographically second, etc.
search for pattern via binary search on this sequence
some clever tricks (and some more space) let you avoid re-checking characters of pattern.
So linear search (with additive $\log m$ for binary search.
space usage: $3m$ integers (as opposed to numerous pointers and integers of suffix tree).

Applications:

preprocess bottom up, storing first, last, num. of suffixes in subtree
allows to answer queries: what first, last, count of $w$ in text in time $O(|w|)$.
enumerate $k$ occurrences in time $O(w+|k|)$ (traverse subtree, binary so size order of number of occurences (compare to rabin-karp).
longest common subsequence is probably on homework.