\documentclass[11pt]{article}
\usepackage{latexsym}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{amsthm}
\usepackage{epsfig}
\usepackage{psfig}
\newcommand{\handout}[5]{
\noindent
\begin{center}
\framebox{
\vbox{
\hbox to 5.78in { {\bf 6.851: Advanced Data Structures } \hfill #2 }
\vspace{4mm}
\hbox to 5.78in { {\Large \hfill #5 \hfill} }
\vspace{2mm}
\hbox to 5.78in { {\em #3 \hfill #4} }
}
}
\end{center}
\vspace*{4mm}
}
\newcommand{\lecture}[4]{\handout{#1}{#2}{#3}{Scribe: #4}{Lecture #1}}
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{observation}[theorem]{Observation}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{claim}[theorem]{Claim}
\newtheorem{fact}[theorem]{Fact}
\newtheorem{assumption}[theorem]{Assumption}
% 1-inch margins, from fullpage.sty by H.Partl, Version 2, Dec. 15, 1988.
\topmargin 1in
\advance \topmargin by -\headheight
\advance \topmargin by -\headsep
\textheight 8.9in
\oddsidemargin 0pt
\evensidemargin \oddsidemargin
\marginparwidth 0.5in
\textwidth 6.5in
\parindent 0in
\parskip 1.5ex
%\renewcommand{\baselinestretch}{1.25}
\begin{document}
\lecture{22 --- April 29}{Spring 2010}{Prof.\ Erik Demaine}{Aaron Bernstein, revised by: Morteza Zadimoghaddam}
\section{Overview}
Up until now, we have mainly studied how to decrease the query time, and the preprocessing time of our data structures. In this lecture, we will focus on maintaining the data as compactly as possible. Our goal will be to get as close the information theoretic optimum as possible. We will refer to this optimum as OPT. Note that most "linear space" data structures we have seen are still far from the information theoretic optimum because they typically use O(n) {\it words} of space, whereas OPT usually uses O(n) bits. This strict space limitation makes it really hard to have dynamic data structures, so most space-efficient data structures are static.
Here are some possible goals we can strive for:
\begin{itemize}
\item {\em Implicit Data Structures} -- Space = information-theoretic-OPT + O(1). The O(1) is there so that we can round up if OPT is fractional. Most implicit data structures just store some permation of the data: that is all we can really do. As some simple examples, we can refer to Heap which is a Implicit Dynamic Data Structure, and Sorted array which is static example of these data structures.
\item {\em Succinct Data Structures} -- Space = OPT + o(OPT). In other words, the leading constant is 1. This is the most common type of space-efficient Data Structures.
\item {\em Compact Data Structures} -- Space = O(OPT). Note that some "linear space" data structures are not actually compact because they use O(w$\cdot$OPT) bits. For example, suffix tree has $O(n)$ words space, but its information theoretic lower bound is $n$ bits. On the other hand, BST can be seen as a Compact Data Structure.
\end{itemize}
\subsection{mini-survey}
\begin{itemize}
\item {\em Implicit Dynamic Seach Tree} -- In 2003, Franceschini and Grossi \cite{fg} developed an implicty dynamic search tree which supports insert, delete, and predecessor in O(log(n)) time worst case.
\item {\em Succinct Dictionary} -- use $\lg \binom{u}{n} + O\big(\frac{\lg \binom{u}{n}}{ \lg\lg\lg u}\big)$ bits~\cite{bm} or $\lg
\binom{u}{n} + O(\frac{n (\lg \lg n)^2}{\lg n})$ bits~\cite{pag01} , and support
$O(1)$ membership queries; $u$ is the size of the universe from
which the $n$ elements are drawn.
\item {\em Succinct Binary Tries} -- The number of possible binary tries with n nodes is the $n$th Catalan number, $C_n$ = ($\stackrel{u}{n})/(n+1) \approx 4^{n}$. Thus, $OPT \approx \log(4^{n}) = 2n$. We note that this can be derived from a recursion formula based on the sizes of the left and right subtrees of the root. In this lecture, we will show how to use $2n + o(n)$ bits of space. We will be able to find the left child, the right child, and the parent in O(1) time. We will also give some intuition for how to answer subtree-size queries in O(1) time. Subtree size is important because it allows us to keep track of the rank of the node we are at.
\item {\em Almost Succinct k-ary trie} -- The number of such tries is C$^{k}_n$ = $\binom{kn+1}{n} / (kn + 1) \approx 2^{(log(k) + log(e))n}$. Thus, OPT = (log(k) + log(e))n. The best known data structures was developed by Benoit {\it et al.} \cite{bdmrrr}. It uses ($\left\lceil log(k) \right\rceil + \left\lceil log(e) \right\rceil$)n + o(n) + O(loglog(k)) bits . This representation still supports the following queries in O(1) time: find child with label i, find parent, and find subtree size.
\item {\em Succinct Rooted Ordered Trees} -- These are different from tries because there can be no absent children. The number of possible trees is C$_n$, so OPT = 2n. A query can ask us to find the ith child of a node, the parent of a node, or the subtree size of a node. Clark and Munro \cite{cm} gave a succinct data structure which uses 2n + o(n) space, and answers queries in constant time.
\item {\em Permutation} -- In this data structure, we are given a permutation $\pi$ of $n$ items, and the queries are of the form $\pi^{k}(i)$. Munro et. al. present a data structure with constant query time and space $(1+\epsilon)n\log(n)+O(1)$ bits in \cite{Munro03}. They also obtain a succinct data structure with $\lceil \log{n!}\rceil + o(n)$ bits and query time $O(\log{n} / \log{\log{n}})$.
\end{itemize}
\section{Level Order Representation of Binary Tries}
One of the central techniques for Succinctly representing tries is called the {\it Level Ordered Representation}. We will go through the nodes in level order, and for each one, we will write down 2 bits. The first bit represents whether that node has a left child (1 if it does, 0 if it doesn't), and the second represents whether it has a right child. In the example in Figure 1, we would go through the nodes in the order A,B,C,D,E,F,G, and we would end up with the bit-string B = 11011101000000.
\begin{figure}
\begin{center}
\includegraphics[scale=0.7, bb= 20 20 320 320]{6851ScribingBinaryTries.jpg}
\caption{A Binary Trie}
\end{center}
\end{figure}
\newpage
{\bf external nodes}: Another way of thinking of the level order representation is to add an {\it external} node wherever we have a missing child. Now, we will go through the nodes (including the external ones) in level order, and write 1 if the node is internal, 0 if it is external. It turns out that this gives us the same bit-string as the representation above, except with an extra one in the front (this one represents the root). So in figure 1, we would have B = 111011101000000.
\subsection{Navigating:}
It may seem as though this representation will be very hard to navigate, but the following theorem makes it much easier.
{\bf Theorem:} The left and right children of the $i$th internal node are at positions $2i$ and $2i+1$ in the array B.
{\bf Proof:} Let D be the $i$th internal node. That is, let D be the $i$th 1 in our array. Suppose that it is at position $i+j$. In other words, say that there are $j$ 0's before it. There are $i-1$ internal nodes before D, so if we include external nodes, these $(i-1)$ nodes have a total of $2(i-1)$ children, they all appear before the left child of D in the string. Thus, there are at most $2(i-1)$ possible nodes that could go between D and left(D). But this includes the $(i-1)$ internal nodes before D, and the $j$ external nodes before D, so there are $2i - 2 - (i-1) - j = i-j-1$ nodes between D and left(D). Thus, left(D) is at position $1 + (i+j) + (i-j-1) = 2i$. Right(D) is at position $2i+1$
We can also prove it in this simpler way. There are $i-1$ internal nodes before D each of which has two children. This way we count each node except root exactly once because each node has exactly one parent. So there are $2(i-1)+1$ nodes before the left child of root (plus one is for root).
\section{Rank and Select}
Say that we could support the following operations on an n-bit string in O(1) time, with o(n) extra space:
\indent $\bullet$ rank(i) = number of 1's at or before position $i$
\indent $\bullet$ select(j) = position of $j$th one.
This would give us the desired representation of the binary trie. The space requirement would be 2n for the level-order representation, and o(n) space for rank/select. Here is how we would support queries:
\indent $\bullet$ left-child(i) = 2rank(i)
\indent $\bullet$ right-child(i) = 2rank(i) + 1
\indent $\bullet$ parent(i) = select($\left\lfloor i/2 \right\rfloor$)
\subsection{Rank}
This algorithm was developed by Jacobsen, in 1989 \cite{jacob}. It uses many of the same ideas as RMQ. The basic idea is that we use a constant number of recursions until we get down to sub-problems of size $k =\log(n)/2$. Note that there are only 2$^{k}$ = $\sqrt{n}$ possible strings of size k, so we will just store a lookup table for {\it all possible bit strings of size k}. For each such string we have k = O(log(n)) possible queries, and it takes log(k) bits to store the solution of each query (the rank of that element). Nonetheless, this is still only $O(2^k \cdot k \log{k}) = O(\sqrt{n}\log(n) \log{\log(n)}) = o(n)$ bits.
{\bf First Attempt:} We will split the bit string into $n/\log^2(n)$ chunks of size $log^2(n)$. To find rank(i), we need to find (rank of i in its chunk) + (number of 1's in all preceding chunks).
We will show how to find rank(i) within a chunk.
But we also need, for each chunk, the total number of 1's among all of the preceding chunks.
There are $n/\log^2(n)$ chunks, and for each of them we have to store a number (with $\log(n)$ bits). So we can store all the data using $O(n/\log{n})$ bits which we can afford.
{\bf Second Attempt:} Now we have chunks of size $\log^2{n}$. The solution is to use one more level of recursion. We will split into $2n/\log(n)$ subchunks of size $\log(n)/2$.
The rank within the subchunks can be found using the lookup table. The problem is to find the number of one bits in the preceding subchunks. Note that we have $2n/\log{n}$ subchunks. But the number of ones in the preceding subchunks is not more than $\log^2{n}$ because we are within a chunk of size $\log^2{n}$. So we can store each of these $2n/\log{n}$ numbers by $O(\log\log(n))$ bits. So the total space of this part is $o(n)$ as well.
\subsection{Select}
This algorithm was developed by Clark and Munro in 1996 \cite{cm}. Select is similar to rank, although more complicated. This time, since we are trying to find the position of the ith one, we will break our array up into chunks with equal amounts of ones, as opposed to chunks of equal size.
{\bf Step 1:} First, we will pick every (log(n)loglog(n))th 1 to be a {\it special one}. We will store the index of every special one. Storing an index takes log(n) bits, so this will take O(nlog(n)/(log(n)loglog(n))) = O(n/loglog(n)) = o(n). Now, we need to restrict our attention to a single chunk: a sequence of bits between two special ones (note that there are O(n/(log(n)loglog(n))) chunks). Let r be the {\it total} number of bits in a chunk. If r $>$ log$^{2}$(n), we will go to step 2. If r $\leq$ log$^{2}$(n), we will go to step 3.
{\bf Step 2:} There are at most O(n/$log^{2}$(n)) chunks of size greater than log$^{2}$(n). Thus, we can afford to just brute force the problem by storing the index (in our bit-string) of every 1 in the chunk. There are log(n)loglog(n) ones, and storing each index takes O(log(n)) space, so the total space, over all of these large chunks, is O(nlog(n)loglog(n)log(n)/log$^{2}$(n)) = O(n/loglog(n)) = o(n).
{\bf Step 3:} In this case we recurse again. This time, within a chunk, we pick every (loglog(n))$^{2}$th one to be a mini-special one. We then split up into mini-chunks. If a chunk has size greater than (loglog(n))$^{4}$, then we brute force as in step 2. There are at most n/(loglog(n))$^{4}$ such chunks, each one contains O((loglog(n))$^{2}$) ones, and storing each index takes O(log(log$^2$(n))) = O(loglog(n)) bits. Thus, the overall number of bits will be O(n(loglog(n))$^{3}$/(loglog(n))$^{4}$) = O(n/loglog(n)) = o(n) space. If a chunk is smaller than (loglog(n))$^{4}$, then we recurse one last time. But note that (loglog(n))$^{4}$ is tiny, so we can afford to store a lookup table for all possible chunks of this size (just as we did in rank).
Note that we can use the lookup table with $o(n)$ space because $\log\log^4(n)$ is not more than $\log(n)/2$.
\section{Subtree Sizes}
We have shown a Succinct binary trie which allows us to find left children, right children, and parents. But we would still like to find sub-tree size in O(1) time. Level order representation does not work for this, because level order gives no information about depth. Thus, we will instead try to encode our nodes in depth first order.
In order to do this, notice that there are C$_n$ (catalan number) binary tries on n nodes. But there are also C$_n$ rooted ordered trees on n nodes, and there are C$_n$ balanced parentheses strings with n parentheses. Moreover, we will describe a bijection: binary tries $\Leftrightarrow$ rooted ordered trees $\Leftrightarrow$ balanced parentheses. This makes the problem much easier because we can work with balanced parentheses, which have a natural bit encoding: 1 for an open parentheses, 0 for a closed one.
\subsection{The Bijections}
We will use the binary trie in figure 1. To make this into a rooted ordered tree, we can think of rotating the trie 45 degrees counter-clockwise. Thus, the top three nodes of the tree will be the right spine of the trie (A,C,F). To make the tree rooted, we will add an extra root *. Now, we recurse into the left subtrees of A,C, and F. For A, the right spine is just B,D,G. For C, the right spine is just E: C's only left child. Figure 2 shows the resulting rooted ordered tree.
%\newpage
%\hspace{4in}
\begin{figure}
\begin{center}
\includegraphics[scale=0.5 , bb= 220 20 250 100]{6_851_scribing_ordered_trees.jpg}
%\includegraphics[width = 4.7 in]{6_851_scribing_ordered_trees.jpg}
\caption{A Rooted Ordered Tree That Represents the Trie in Figure 1}
\end{center}
\end{figure}
To go from rooted ordered trees to balanced parentheses strings, we do a DFS of the ordered tree. We will then put an open parentheses when we first touch a node, and then a closed parentheses the second time we touch it. Fig 3 contains a parentheses representation of the ordered tree in figure 2.
\begin{figure}
\begin{center}
%\includegraphics[scale=0.6, bb= 20 20 250 100]{6_851_parentheses.jpg}
\includegraphics[scale=0.6, bb= 220 20 250 100]{6_851_parentheses.jpg}
\caption{A Balanced Parentheses String That Represents the Ordered Tree in Fiugre 2}
\end{center}
\end{figure}
Now, we will show how the queries are transformed by this bijection. For example, if we want to find the parent in our binary trie, what does this correspond to in the parentheses string? The bold-face is what we have in the binary trie, and under that, we will describe the corresponding queries from the 2 bijections.
{\bf Node:}
{\it Rooted Ordered Tree:} Also, just a node.
{\it Parentheses String:} An open parentheses that corresponds to the first time we visited v.
{\bf Left-Child(v)}
{\it Rooted Ordered Tree:} First-Child(v)
{\it Parentheses String}: The next parentheses. If it is closed, then v has no children.
{\bf Right-Child(v):}
{\it Rooted Ordered Tree:} The next sibling of v.
{\it Parentheses String:} The parentheses after the matching closed parentheses of v. If this is closed, v has no right child.
{\bf Subtree-Size(v):}
{\it Rooted Ordered Tree:}Subtree-size(v) + $\Sigma_{w \in RS}$subtree-size(w). RS stands for the set of right siblings of v.
{\it Parentheses String:} 1/2 of the distance to the matching closed parentheses.
{\bf Parent(v):}
{\it Rooted Ordered Tree:} v's left sibling, if it exists. Otherwise, v's parent.
{\it Parentheses String:} The nearest enclosing ().
%\bibliography{mybib}
\bibliographystyle{alpha}
\begin{thebibliography}{77}
\bibitem{fg}
G. Franseschini, R.Grossi
\emph{Optimal Worst-case Operations for Implicit Cache-Oblivious Search Trees},
Prooceding of the 8th International Workshop on Algorithms and Data Structures (WADS), 114-126, 2003
\bibitem{bm}
A.Brodnik, I.Munro
\emph{Membership in Constant Time and Almost Minimum Space},
Siam J. Computing, 28(5): 1627-1640, 1999
\bibitem{bdmrrr}
D.Benoit, E.Demaine, I.Munro, R.Raman, V.Raman, S.Rao
\emph{Representing Trees of Higher Degree},
Algorithmica 43(4): 275-292, 2005
\bibitem{cm}
D.Clark, I.Munro
\emph{Eifficent Suffix Trees on Secondary Storage},
SODA, 383-391, 1996.
\bibitem{jacob}
G.Jacobson
\emph{Succinct Static Data Structures},
PHD.Thesis, Carnegie Mellon University, 1989.
\bibitem{pag01} R. Pagh: \emph{Low Redundancy in Static
Dictionaries with Constant Query Time}, SIAM Journal of Computing
31(2): 353-363 (2001).
\bibitem{Munro03} J. Ian Munro, Rajeev Raman, Venkatesh Raman, and Satti Srinivasa Rao: \emph{Succinct Representations of Permutations}, ICALP (2003), LNCS 2719, pp. 345-356.
\end{thebibliography}
\end{document}