\documentclass[11pt]{article}
\usepackage{latexsym}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{amsthm}
\usepackage{epsfig}
\usepackage{psfig}
\newcommand{\handout}[5]{
\noindent
\begin{center}
\framebox{
\vbox{
\hbox to 5.78in { {\bf 6.851: Advanced Data Structures } \hfill #2 }
\vspace{4mm}
\hbox to 5.78in { {\Large \hfill #5 \hfill} }
\vspace{2mm}
\hbox to 5.78in { {\em #3 \hfill #4} }
}
}
\end{center}
\vspace*{4mm}
}
\newcommand{\lecture}[4]{\handout{#1}{#2}{#3}{Scribe: #4}{Lecture #1}}
\newcommand{\op}[1]{\ensuremath{#1}}
\newcommand{\leaf}{\op{())}}
\newcommand{\lrank}[1]{\op{rank_{\leaf}(#1)}}
\newcommand{\lselect}[1]{\op{select_{\leaf}(#1)}}
\newcommand{\sig}{\ensuremath{|\Sigma|}}
\newcommand{\out}{\ensuremath{|\textnormal{output}|}}
\newcommand{\ith}{\ensuremath{i^\textnormal{th}}}
\newcommand{\erank}[2]{\op{evenrank_#1(#2)}}
\newcommand{\esucc}[2]{\op{evensucc_#1(#2)}}
\newcommand{\sak}{\ensuremath{SA_k}}
\newcommand{\sakk}{\ensuremath{SA_{k+1}}}
\newcommand{\iseven}[2]{\op{iseven_#1(#2)}}
\newcommand{\orank}[2]{\op{oddrank_#1(#2)}}
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{observation}[theorem]{Observation}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{claim}[theorem]{Claim}
\newtheorem{fact}[theorem]{Fact}
\newtheorem{assumption}[theorem]{Assumption}
% 1-inch margins, from fullpage.sty by H.Partl, Version 2, Dec. 15, 1988.
\topmargin 0pt
\advance \topmargin by -\headheight
\advance \topmargin by -\headsep
\textheight 8.9in
\oddsidemargin 0pt
\evensidemargin \oddsidemargin
\marginparwidth 0.5in
\textwidth 6.5in
\parindent 0in
\parskip 1.5ex
%\renewcommand{\baselinestretch}{1.25}
\begin{document}
\lecture{22 --- 7 May}{Spring 2007}{Prof.\ Erik Demaine}{Mart\'i Bol\'ivar}
\section{Overview}
In the last lecture we introduced the concept of implicit, succinct,
and compact data structures, and gave examples for succinct binary
tries, as well as proving a bijection between binary tries, rooted
ordered trees, and balanced parenthesis expressions. Succinct data
structures were introduced which solve the \op{rank} and \op{select}
problems.
In this lecture we expand slightly on our previous discussion of
succinct binary tries, and introduce compact data structures for
suffix arrays and suffix trees.
\section{More on Binary Tries}
Note that a leaf in a balanced-parenthesis representation of a binary
trie is the string \leaf. As shown in a paper by Munro, Raman, and Rao
\cite{mrr}, the following queries on leaves can be implemented in
constant time using a succinct data structure.
\begin{description}
\item[leaf-rank(n)] The number of leaves to the left of node \op{n};
denoted \lrank{n}.
\item[leaf-select(i)] The $i^{\textnormal{th}}$ leaf; denoted \lselect{i}.
\item[leaf-count(n)] Number of leaves in the subtree of node \op{n};
equal to $\lrank{\textnormal{matching ) of parent}} - \lrank{n}$.
\item[leftmost leaf in subtree of n] Equal to \lselect{\lrank{n}}.
\item[rightmost leaf in subree of n] Similar to above.
\end{description}
%% it's visual formatting, but i'm in a hurry
\pagebreak
\section{Minisurvey}
Following is a small survey of results on compact suffix
arrays. Recall that a compact data structure uses $O(n)$ bits, where
$n$ is the information-theoretic optimum.
\begin{description}
\item[Grossi and Vitter 2000 \cite{gv00}] Suffix array in
\[ (\frac{1}{\varepsilon} + O(1))|T|\lg{\sig} \]
bits, with query time
\[ O(\frac{|P|}{\log^\varepsilon_{\sig}{|T|}} +
\out\log_{\sig}^\varepsilon{|T|}) \]
We will follow this paper fairly closely in our discussion today.
\item[Ferragina and Manzini 2000 \cite{fm00}] This technique is known as
the FM index. Space is
\[ 5 H_k(T)|T| +
O(\frac{|T|}{\lg{|T|}}(\sig + \lg\lg{|T|})
+ |T|^\varepsilon \sig 2^{\sig\lg\sig}) \]
bits, for all $k$, where $H_k(T)$ is the $k^{\textnormal{th}}$-order
empirical entropy, or the regular entropy conditioned on knowing the
previous $k$ characters. Query time is
\[ O(|P| + \out \lg^\varepsilon{|T|}). \]
The analysis of the FM index is tricky; in particular, the paper
does not claim the above bounds.
\item[Sadakane 2003 \cite{s03}] Space in bits is
\[ \frac{1}{\varepsilon}H_0(T)|T| + O( |T|\lg\lg\sig + \sig\lg\sig ), \]
and query time is
\[ O( |P|\lg{|T|} + \out\lg^\varepsilon{|T|}). \]
Note that this bound is more like a suffix array, due to the
multiplicative log factor.
\item[Grossi, Gupta, Vitter 2003 \cite{ggv03}] This is the only known
succinct result. Space in bits is
\[ H_k(T)|T| + O( |T| \lg\sig\frac{\lg\lg{|T|}}{\lg{|T|}} ), \]
and query time is
\[ O(|P|\lg\sig + \lg^{o(1)}{|T|}). \]
\end{description}
\section{Compressed suffix arrays}
As a warm-up problem, we'll consider compressed suffix arrays. In the
next section, we will build upon the discussion here in a compact
suffix array construction. Henceforth, we shall assume that $\sig =
2$, i.e. a binary alphabet. The words ``even'' and ``odd'' refer to
the index of a character in $T$ (recall that suffix array elements
contain indices into $T$ denoting the beginning of the suffix).
\subsection{Intuition}
We will follow the three-way divide and conquer suffix array
construction discussed in lecture 9, but modify it to be a two-way
division. The recursion is as follows.
\begin{description}
\item[start] The initial text, $T_0$, is $T$; the initial size, $n_0$,
is $n$, and the initial suffix array, $SA_0$, is $SA$, the suffix
array of $T$. We'll define $SA[i]$ as the index in $T$ where the
\ith\ suffix begins.
\item[step] $T_{k+1} = <(T_k[2i], T_k[2i+1])>$, for $i = 0,1,...,n/2$;
$n_{k+1} = n_k/2$ (and hence after $k$ recursions, we'll have a text
of length $n/2^k$ on an alphabet of size $2^{k+1}$); $SA_{k+1} =
(1/2)\cdot$extract even entries of old array $SA_k$.
\end{description}
Now, we obviously can't actually start where we say we do, since that
would require knowing $SA_0$, which is exactly what we're trying to
build. So conceptually, we'll walk \emph{up} the recursion tree from
the bottom, which consists of the trivial case in which the entire
text $T$'s suffix array is constructed as if $T$ were written as a
single letter.
\subsection{Crawling Up the Recursion Tree}
Since we're recursing ``backwards'', we need a way to represent $SA_k$
using $SA_{k+1}$. We'll use the following queries to accomplish this
efficiently:
\begin{description}
\item[\esucc{k}{i}] The ``even successor'' of $i$, defined as
$i$ if $SA_k[i]$ is even, and $j$ if $SA_k[i] = SA_{k+1}[j] -
1$. One of these will be the case, since $SA_{k+1}$ is $SA_k$ with
every pair of letters joined together as a single letter on an
alphabet of twice the original size.
\item[\erank{k}{i}] The ``even rank'' of $i$, or the number of
even values in $SA_k[:i]$ (using slice notation as in Python
\cite{vr06}). This equals the number of even suffixes preceding the
\ith\ suffix.
\end{description}
Additionally, we'll let $SA_k[i] =
2SA_{k+1}[\erank{k}{\esucc{k}{i}}]$, minus 1 if $SA_k[i]$ is odd.
So constant time per operation on the above queries reduces a query on
\sak\ to a query on \sakk\ in constant time. Hence, a query on $SA_0$,
the array we're trying to build, will take $O(l)$ time if we recurse
$l$ times. Only $l = \lg\lg{n}$ recursions will be necessary, for
since we halve the size of the text on each recursion, this will
reduce the text to size $n_l = n/\lg{n}$. We then use a normal suffix
array, which will use $O(n_l \lg{n_l}) = O(n)$ bits of space, and thus
be compressed.
\subsection{Construction}
We will now construct a data structure that answers even-successor and
even-rank queries in constant time. We'll begin with an auxiliary
query, is-even:
\[ \iseven{k}{i} = \left\{
\begin{array}{llll}
1 & {\rm if} & \sak[i] & {\rm even,} \\
0 & {\rm else} & & \\
\end{array}
\right.
\]
We can imagine storing this as an $n_k$-bit vector. Since $n_{k+1} =
n_k/2$, the total number of bits is geometric in $n_0 = n$, so we'd
need $O(n)$ bits to do it this way, which won't work. But for now,
let's pretend that this is how we implement this.
Then we can implement \erank{k}{i} with a rank structure on our
imaginary bit vector in $o(n)$ space.
Doing \esucc{k}{i} is trivial in the case that $\sak[i]$ is even;
there are $n_k/2$ such values. Intuitively, we could just store the
values of $j$ for odd values of $\sak[i]$. Note that we can't actually
write them down, because that would require $n_k \lg{n_k}$ bits.
Whatever data structure we use, let's order the values of $j$ by $i$;
that is, if we're pretending to store the values of $j$ in an array
called $odds$, we'd like $\esucc{k}{i} = odds[\orank{k}{i}]$, where
$\orank{k}{k} = i - \erank{k}{i}$.
Why is it useful to order the $j$'s by $i$? Well, that's just ordering
by suffix in the suffix array, which is ordering by the odd suffix
$T_k[\sak[i]:]$, or ordering by $(T_k[\sak[i]], T_k[\sak[i]+1:]) =
T_k[\sak[\esucc{k}{i}]:]$. This in turn is equivalent to ordering by
$(T_k[\sak[i]], \esucc{k}{i})$.
Now recall that we're trying to create a data structure for answering
\esucc{k}{i} queries. So ordering the $j$'s by $i$ is equivalent to
sorting by $i$ and the values of $j$! That is to say, the values of
$j$ are mostly in sorted order. So we will be storing pairs of letters
in lexical order.
This means we need a clever way of storing a sorted array of $n_k/2$
values $v_i$, each of which is $2^k + \lg{n_k}$ bits long. The $2^k$
follows since at level $k$, each letter $j$ will require $2^k$ bits;
similarly, each value $i$ requires $\lg{n_k}$ bits.
The desired clever trick is to store the leading $\lg{n_k}$ bits of
each $v_i$ in unary differential encoding:
\[ 0^{{\rm lead}(v_1)}10^{{\rm lead}(v_2) - {\rm lead}(v_1)}1... \]
Where lead$(v_i)$ is the value of the leading $\lg{n_k}$ bits of $v_i$
as an unsigned integer. That is to say, we write down the difference
between the value of $v_i$ and $v_{i-1}$'s leading bits in unary, then
write a 1, and repeat.
There will then be $n_k/2$ ones and at most $2^{\lg{n_k}} = n_k$
zeros, and hence at most $(3/2)n_k$ bits total used for this
encoding. Again by the geometric nature of successive values of $n_k$,
this will require $O(n)$ bits total, so the overall data structure is
still compressed.
Note that this also gets you random access -- the leading bits of
$v_i$ have value equal to ${\rm rank}_0({\rm select}_1(i))$.
The remaining $2^k$ bits can be stored in the obvious way, in an
array, as that will use $2^k \frac{n_k}{2} = 2^k\frac{n/2^k}{2} =
\frac{n}{2}$ bits, for total of $n/2 + 3n_k/2 + o(n_k)$ bits. This
completes the construction of a compressed suffix array.
\section{Compact suffix arrays}
The problem with our compressed suffix array construction is that its
$\lg\lg{n}$ levels require $n\lg\lg{n}$ space. As in the last lecture
on binary tries, we would prefer to reduce the number of recursion
levels to a constant. This would give us linear space for a time
tradeoff.
To accomplish this, we will store only $1/\varepsilon + 1$ levels of
recursion, namely those values of $k$ equal to
\[ 0, \varepsilon l, 2\varepsilon l, ..., l = \lg\lg{n}. \]
In essence, we are clustering $2^{\varepsilon l}$ letters in a single
stroke. We now need to be able to \emph{jump} $\varepsilon l$ levels
at once. We are not able to do this in constant time.
\subsection{Level jumping}
In order to present a method of representing $SA_{k\varepsilon l}$
with $SA_{(k+1)\varepsilon l}$, the concept of ``even'' and
``successor'' need to be generalized from the compressed
construction.
In particular, the previous notion of an index $i$ being ``even'' in
the text $T$ will henceforth mean ``even for $\varepsilon l$
recursions''. This buys us a generalized notion of \erank{k}{i} as
well, in the obvious way. We'll define successor similarly, with
$\esucc{k}{i} = j$, where $SA_{k\varepsilon l}[i] = SA_{k\varepsilon
l}[j] - 1$.
Using these definitions, computing $SA_{k\varepsilon l}[i]$ is as
follows:
\begin{itemize}
\item Follow the successor pointer repeatedly until index $j$ is at
the next level down, namely $(k+1)\varepsilon l$.
\item Recurse: $SA_{(k+1)\varepsilon l}[\erank{{k+1}}{j}]$
\item Multiply by $2^{\varepsilon l}$, as this, modulo rounding
errors, is how many letters we clustered per recursion level. We
then correct the round-off error by subtracting the number of calls
to successor in the first step.
\end{itemize}
The runtime is then clearly linear in the number of times we call
successor in the first step. This equals $2^{\varepsilon l}$, because
the successor walking is done in text space, not suffix space. That
is, each recursion level in effect halves the number of letters in $T$
$\varepsilon l$ times.
\subsection{Analysis}
From the arguments outlined in the previous subsection, search time is
$2^{\varepsilon l}\lg\lg{n} = \lg^\varepsilon{n}\lg\lg{n} =
O(\lg^{\varepsilon'}{n})$. (We have introduced a new variable
$\varepsilon'$, which can be made arbitrarily small by appropriate
choice of $\varepsilon$).
Space is $O(n)$: we use the same unary differential encoding for
successor as in the compressed construction. This is a linear number
of bits per level, but we have a (large) multiplicative constant
factor due to level jumping. Nevertheless, with a constant number of
levels, space is linear overall.
This gets us a compact suffix array.
\paragraph{Open problem:}
is it possible to achieve constant query time in linear space?
\section{Suffix trees}
Suffix arrays are somewhat troublesome due to the log factor paid to
search. Suffix trees eliminate this problem. An algorithm for creating
a compact suffix tree given a compact suffix array was also given in
\cite{mrr}. This converts a suffix array which uses $m$ bits into a
suffix tree which uses $o(m)$ bits.
\subsection{Construction}
We already know how to store succinct binary tries. So the first step
is to construct the suffix tree from the suffix array as a binary trie
on $2n+1$ nodes, using $4n+o(1)$ bits, using a balanced-parenthesis
representation.
This is sufficient for giving us the nodes, but it doesn't tell us
letter depths. So to search for a pattern $P$, we maintain letter
depths as we go.
To descend to a child of a node, we need to compute the ``skip'', or
the difference in letter depth between the node and its parent, or the
length in letters of the edge between them.
We compute the length of this edge by finding hte longest match
between the leftmost and rightmost leaves of the child, using the
methods of \cite{mrr} discussed earlier. We then compute leaf ranks,
and dereference the suffix array for these two positions. This gives
us two text positions as suffixes. We compute the longest match
between them, first jumping by the letter-depth of the parent node,
which we have maintained. We then match character by character between
the pattern and the two text positions. When the left and right stop
matching, we know the letter-depth of the child we want to descend
to. If we fall off a branch or the pattern stops matching the two
positions, we know there is no result.
\subsection{Analysis}
$O(|P| + \out)\cdot c)$, where $c$ is hte cost of a suffix array
lookup. This doesn't give us quite the bound of \cite{gv00}, but it's
close enough for our purposes.
\subsection{Improvement}
It is also possible to improve this, creating a succinct suffix tree
given a suffix array. This algorithm is complicated; the reader is
referred to \cite{mrr} for the details.
%\bibliography{mybib}
\bibliographystyle{alpha}
\begin{thebibliography}{77}
\bibitem{mrr}
J. I. Munro, V. Raman, and S. S. Rao,
\emph{Space Efficient Suffix Trees},
Journal of Algorithms, 39(2):205-222.
\bibitem{gv00}
R. Grossi and J. S. Vitter,
\emph{Compressed suffix arrays and suffix trees with applications to text indexing and string matching},
Thirty-Second Annual ACM Symposium on Theory of Computing, vol. STOC,pp. 397-, 2000.
\bibitem{fm00}
P. Ferragina and G. Manzini,
\emph{Indexing Compressed Text},
Journal of the ACM, Vol. 52 (2005), 552-581.
\bibitem{s03}
K. Sadakane,
\emph{New text indexing functionalities of the compressed suffix arrays}.
Journal of Algorithms, 48(2): 294-313 (2003).
\bibitem{ggv03}
R. Grossi, A. Gupta, J. S. Vitter,
\emph{High-order entropy-compressed text indexes},
SODA 2003: 841-850.
\bibitem{vr06}
G. van Rossum,
\emph{Python Tutorial},
http://docs.python.org/tut/node5.html\#SECTION005140000000000000000.
\end{thebibliography}
\end{document}