\documentclass[11pt]{article}
\usepackage{latexsym}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{amsthm}
\usepackage{epsfig}
\usepackage{psfig}
\usepackage{stmaryrd}
\usepackage{xypic}
\usepackage{latexsym}
\def\polylg{\operatorname{polylg}}
\def\adj{\operatorname{adj}}
\def\symb{\operatorname{symb}}
\def\rand{\operatorname{rand}}
\newcommand{\handout}[5]{
\noindent
\begin{center}
\framebox{
\vbox{
\hbox to 5.78in { {\bf 6.897: Advanced Data Structures } \hfill #2 }
\vspace{4mm}
\hbox to 5.78in { {\Large \hfill #5 \hfill} }
\vspace{2mm}
\hbox to 5.78in { {\em #3 \hfill #4} }
}
}
\end{center}
\vspace*{4mm}
}
\newenvironment{itemize*}%
{\vspace{-2ex} \begin{itemize} %
\setlength{\itemsep}{-1ex} \setlength{\parsep}{0pt}}%
{\end{itemize}}
\newcommand{\lecture}[4]{\handout{#1}{#2}{#3}{Scribe: #4}{Lecture #1}}
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{observation}[theorem]{Observation}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{claim}[theorem]{Claim}
\newtheorem{fact}[theorem]{Fact}
\newtheorem{assumption}[theorem]{Assumption}
% 1-inch margins, from fullpage.sty by H.Partl, Version 2, Dec. 15, 1988.
\topmargin 0pt
\advance \topmargin by -\headheight
\advance \topmargin by -\headsep
\textheight 8.9in
\oddsidemargin 0pt
\evensidemargin \oddsidemargin
\marginparwidth 0.5in
\textwidth 6.5in
\parindent 0in
\parskip 1.5ex
%\renewcommand{\baselinestretch}{1.25}
\newcommand{\dont}{\textbf{\sf ?}}
\renewcommand{\th}{\ifmmode{^{\textrm{th}}}\else{\textsuperscript{th}\ }\fi}
\newcommand{\func}[1]{\textnormal{\scshape#1}}
\begin{document}
\lecture{19 --- April 14, 2005}{Spring 2005}
{Prof.\ Erik Demaine}{Vincent Yeung}
\section{Overview}
In this lecture, we discuss the problem of approximate string
matching. In particular, we outline solutions for solving the exact
matching problem for patterns with ``don't care'' symbols, denoted by
\dont. In the process, we will be using solutions for the level
ancestor problem, which we also discuss.
\section{Approximate String Matching}
The approximate string matching problem is defined as follows. Given
an error tolerance $k$ and a text $T$, construct a data structure
which can answer the following query: find occurrences of a pattern
$P$ in $T$ within {\em error} $k$. There are different ways to
measure error, such as:
\begin{enumerate}
\item Hamming distance: the number of character mismatches
\item edit distance: the number of edits (insertions, deletions,
substitutions) needed to produce an exact match.
\end{enumerate}
The best currently-known bounds, given by \cite{cole1}, are:
\begin{itemize}
\item space and preprocessing: $O(|T| \frac{(c \lg {|T|})^k}{k!})$
\item query: $O(|P|+\frac{(c \lg {|T|})^k}{k!} \lg \lg{|T|}) + 3^k
\cdot \textrm{(\# occurrences)}$
\end{itemize}
We will not actually cover this data structure, but we concentrate on
a simpler problem where the same techniques are used. These bounds are
only interesting for small $k$ (such as a constant). For larger $k$,
there are relaxations of the problem which can be solved more
efficiently. This will be the topic of next lecture.
\section{Searching with Wildcards}
We will focus on a subproblem of the above. Again we are given $T$
and $k$ for preprocessing. But now, the query consists of a pattern
$P$ that contains at most $k$ ``don't care'' characters (the \dont{}
wildcards). We are to find ``exact'' matches of $P$, where wildcards
match any character. The best known solution \cite{cole1} solves the
problem in $O(|T| \lg^k {|T|})$ space and $O(2^k \lg \lg {|T|} + |P| +
\textrm{\# occurrences})$ query time.
All the solutions we will discuss involve the use of suffix trees. An
obvious simple solution is to walk down the suffix tree while matching
$P$ and simply branch $|\Sigma|$ ways every time a \dont{} is
encountered in $P$. Thus, queries take at most $O(|\Sigma|^k \cdot
|P|)$. Compared to the best solution we mentioned above, the simple
solution is lacking in that there is a dependence on alphabet size
(which may be very large) and that the dependence on the pattern
length is multiplicative instead of additive.
We now describe how to improve the $\Sigma^k$ factor to $2^k$. To do
so, we perform a heavy-light decomposition of the suffix tree. Recall
that an edge to a child is {\em light} if the subtree rooted in that
child contains at most half the nodes of the parent's subtree. The
intuition is that we now only differentiate between light edges and
possibly a single heavy edge whenever we encounter a \dont. Because
light subtrees are small, we group them together in one big chunk.
Specifically, for each node in the {\em primary} suffix tree, we store
a {\em secondary} suffix tree on the union of light subtrees of that
node, except the first characters of each subtree.
If $k>1$, we recurse $k$ times so that there are $k+1$ ``levels'' of
secondary trees. Since the light depth is $O(\lg{|T|})$ in a
heavy-light decomposition, each leaf appears in $O(\lg^k{|T|})$ trees.
Thus, the solution takes $O(|T|\lg^k{|T|})$ space and preprocessing,
and $O(2^k \cdot |P|)$ query time.
As mentioned above, there is a way to make the $|P|$ factor additive
in the query time. The idea is to find a way to quickly (in $\lg \lg
|T|$ time) determine whether we should take the light/heavy branch.
We will not delve into the specifics of this solution, but mentioned
that using the suffix tree from above, least common ancestor queries,
and {\em level ancestor} queries, we can detect whether one of the
$2^k$ branches is ``good''.
\section{The Level Ancestor Problem}
For the rest of this lecture, we shift our attention to the level
ancestor problem. We are given a static rooted tree, which can be
preprocessed. Then, a level-ancestor query is given a node $V$ and a
number $l$, and must find the $l^\th$ ancestor of $V$. This is
equivalent to finding the depth-$d$ ancestor of $v$, where $d+l =
\textrm{depth}(V)$.
Various solutions to this problem have been proposed \cite{berkman1,
dietz1, alstrup1, bender1}. We will discuss the solution in
\cite{bender1}, by Bender and Farach-Colton. We present gradual steps
leading to a solution that encompasses the different improvements, and
ends up taking linear space and preprocessing time, with constant
query time. First observe that an immediate solution is to store a
lookup table for each node. This gives total space $O(n^2)$ and
constant query time.
\vspace{-.3cm}
\paragraph{Jump pointers.}
Think of skip lists. With {\em jump pointers}, each node stores its
1st, 2nd, 4th, $\ldots$, $2^i$-th ancestors. This takes $O(n \lg n)$
space. To perform queries, recursively go up $\lfloor \lfloor
l\rfloor \rfloor = 2^{\lfloor \lg l \rfloor}$. We know that $l/2 <
\lfloor \lfloor l\rfloor \rfloor \leq l$, so queries take $O(\lg n)$.
\vspace{-.3cm}
\paragraph{Long path decomposition.}
We preprocess the tree as follows:
%
\begin{enumerate}
\item take a longest root-to-leaf path and recurse on the remaining
connected components.
\item store each path as an array ordered by depth (so that nodes in
the path may be randomly accessed), and store a pointer to its
parent path.
\item for each node, store the path to which it belongs and its index
in the array for the path.
\end{enumerate}
Let $\func{height}(v)$ be the height of node $v$ in its path, i.e.~the
number of nodes beneath it.
The space usage is clearly $O(n)$. To answer queries, get the path
for the node and check if it is long enough for the queried ancestor
height; if not, recurse. The query time is therefore linear in the
number of paths traversed. Unfortunately, Figure \ref{fig:longpath}
shows that the number of paths can be as high as $O(\sqrt{n})$.
\begin{figure*}
\centering
\leavevmode
\scalebox{.6}{\includegraphics{fig1.jpg}}
\caption{The maximum path depth in the long path decomposition can be
as high as $O(\sqrt{n})$.}
\label{fig:longpath}
\end{figure*}
\paragraph{Ladder decomposition.}
This extends the long path decomposition in a simple but effective
fashion. We extend the length of each path upwards by a factor of 2
(i.e.~extend a path of length $l$ up by $l$ levels), unless, of
course, we hit the root while going up. The extension is called the
{\em ladder}. See Figure \ref{fig:ladder}.
\begin{figure*}
\centering
\leavevmode
\scalebox{.6}{\includegraphics{fig2.jpg}}
\caption{The ladder decomposition.}
\label{fig:ladder}
\end{figure*}
Instead of storing an array with a path, we store an array with the
path plus the ladder. The space is still linear, because the ladder
can be amortized against the path. However, queries can now be
answered in $O(\lg n)$ time. Indeed, assume the current path is of
length $l$, and the node at the top of the ladder is $w$. Then,
$\func{height}(w) \ge 2\cdot l$. Otherwise, the longest path
containing $w$ would have continued down with the laddert and our the
current path. Therefore, each step either doubles the size of the
path, or finishes.
\paragraph{Combine ladder decomposition \& jump pointers.}
The idea is that jump pointers start with large jumps (that become
exponentially smaller), and the ladder decomposition starts with small
jumps (that become exponentially larger). Then, we can combine these
and have one big jump with jump pointers and another big jump with
ladders. A query proceeds as follows:
%
\begin{itemize}
\item take one jump pointer to go up $\lfloor \lfloor l\rfloor \rfloor
> l/2$ nodes. Call the intermediate node that is reached $w$. We
have $\func{height}(w) > l/2$, because we know there is a path of
length $\lfloor\lfloor l \rfloor\rfloor$ below $w$.
\item take one ladder step. Because $\func{height}(v') > l/2$, we
know that the ladder of $v'$ extends at least $l/2$ above $v'$ (or
it includes the root), so we can get to the correct ancestor right
away.
\end{itemize}
This strategy gives constant-time queries, but $O(n \lg n)$ space for
the jump pointers.
\paragraph{Tune jump pointers.}
We want to store jump pointers only on leaves to reduce space. To
accommodate that, we can store an arbitrary {\em leaf descendant} of
every non-leaf node. The depth-$d$ ancestor of $V$ is the same as the
depth-$d$ ancestor of its leaf descendant, so we can start queries at
leaves. The space usage is $O(n + L \lg n)$ where $L$ is the number
of leaves.
\paragraph{Microtree/macrotree decomposition.}
We want to limit the number of leaves, so we do a microtree/macrotree
decomposition. The macrotree has $O(n / \lg n)$ leaves, so is takes
$O(n)$ space using the solution from above.
Microtrees have $O(\lg n)$ branching nodes, so we can use a more
brute-force approach. We first number nodes by the Euler tour of the
microtree. Note that for each possible depth within the microtree,
there are at most $O(\lg n)$ nodes at that depth (because there are at
most $O(\lg n)$ paths from the restriction on branching nodes). So we
can store a fusion structure for each possible depth, holding the
Euler tour indices of the nodes on that depth. The total space is
linear in the number of nodes. To find the ancestor of depth $d$ of
node $V$, we just query for the predecessor of $V$ in $d$'s atomic
heap.
\paragraph{Weighted level ancestor.}
A more general version of the problem involves edges weights. This is
useful because for compacted tries, an edge may represent more than
one letter. The solution for microtrees from above doesn't quite work,
but similar ideas can be applied obtain $O(\lg \lg n)$ query time and
$O(n)$ space. The $O(\lg\lg n)$ bound comes from predecessor search.
%\bibliography{mybib}
\bibliographystyle{alpha}
\begin{thebibliography}{77}
\bibitem{alstrup1}
Stephen Alstrup, Jacob Holm.
\emph{Improved Algorithms for Finding Level Ancestors in Dynamic Trees}.
ICALP 2000: 73-84
\bibitem{bender1}
Michael A. Bender, Martin Farach-Colton.
\emph{The Level Ancestor Problem simplified}.
Theor. Comput. Sci. 321(1): 5-12 (2004)
\bibitem{berkman1}
Omer Berkman, Uzi Vishkin.
\emph{Finding Level-Ancestors in Trees}.
J. Comput. Syst. Sci. 48(2): 214-230 (1994)
\bibitem{cole1}
Richard Cole, Lee-Ad Gottlieb, Moshe Lewenstein.
\emph{Dictionary matching and indexing with errors and don't cares}.
STOC 2004: 91-100
\bibitem{dietz1}
Paul F. Dietz.
\emph{Finding Level-Ancestors in Dynamic Trees}.
WADS 1991: 32-40
\end{thebibliography}
\end{document}