\documentclass[11pt]{article}
\usepackage{latexsym}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{amsthm}
\usepackage{epsfig}
%\usepackage{psfig}
\newcommand{\handout}[5]{
\noindent
\begin{center}
\framebox{
\vbox{
\hbox to 5.78in { {\bf 6.897: Advanced Data Structures } \hfill #2 }
\vspace{4mm}
\hbox to 5.78in { {\Large \hfill #5 \hfill} }
\vspace{2mm}
\hbox to 5.78in { {\em #3 \hfill #4} }
}
}
\end{center}
\vspace*{4mm}
}
\newcommand{\lecture}[4]{\handout{#1}{#2}{#3}{Scribe: #4}{Lecture #1}}
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{observation}[theorem]{Observation}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{claim}[theorem]{Claim}
\newtheorem{fact}[theorem]{Fact}
\newtheorem{assumption}[theorem]{Assumption}
% 1-inch margins, from fullpage.sty by H.Partl, Version 2, Dec. 15, 1988.
\topmargin 0pt
\advance \topmargin by -\headheight
\advance \topmargin by -\headsep
\textheight 8.9in
\oddsidemargin 0pt
\evensidemargin \oddsidemargin
\marginparwidth 0.5in
\textwidth 6.5in
\parindent 0in
\parskip 1.5ex
%\renewcommand{\baselinestretch}{1.25}
\begin{document}
\lecture{17 --- April 7, 2005}{Spring 2005}
{Lecturer: Mihai P\v{a}tra\c{s}cu}{Tim Abbott}
\section{Overview}
In the last lecture we saw how to solve marked ancestor using
$O(\frac{\log n}{\log \log n})$ time for queries, and $O(\log \log n)$
time for updates. Today we give an essentially tight lower bound for
the existential marked ancestor problem in the cell probe model, due
to Alstrup, Husfeldt and Rauhe~\cite{AHR}. Existential marked
ancestor queries require an answer as to whether or not the node has a
marked ancestor; the updates of marking or unmarking a node are the
same as before. Thus, existential queries are easier that the ones we
considered before. Recall that in the cell probe model, the cost being
bounded is the number of cell probes to $\Theta(\log n)$-bit cells.
The argument we use is an interesting refinement of the Chronogram
Technique, introduced by Fredman and Saks in~\cite{FS}.
\section{A Lower Bound for Counting Ancestors}
Recall that a data structure for the marked ancestor problem supports
two types of operations on a static tree. One is an update, where we
either set or unset the mark bit of a given node. The query on the
data structure varies with the version, but in general we are given a
leaf and asked to compute some property of the marked ancestors of
that leaf (whether they exist, how many of them there are, the lowest
one etc). We first consider the problem of counting the number of
marked ancestors, modulo 2. It is much easier to obtain a lower bound
for the counting problem than for the existential problem, and we use
this opportunity to introduce the general technique.
\begin{theorem}
Let $t_u$ be the update time, and $t_q$ be the query time. For a
perfect tree of branching factor $B \ge t_u \log^2n$, we have the
tradeoff
\[
t_q = \Omega(\log_B n) =
\Omega \left(\frac{\log n}{\log t_u + \log\log n}\right)
\]
\end{theorem}
Note that for any $t_u = \lg^{O(1)} n$, we have that $t_q =
\Omega(\frac{\log n}{\log\log n})$. Thus, our bound from last lecture
was tight. Note also that the lower bound is linear in the height of
the tree considered, so we are proving that simply scanning the
root-to-leaf path is optimal for the complete tree. We will show the
bound in the worst case; it also holds in the amortized case. Finally,
observe that the theorem extends easily to binary trees, because we
can embed our tree in a binary tree (ignore all levels nondivisible by
$\lg B$).
\subsection{A Hard Sequence of Operations}
We first describe the hard sequence of operations on the data
structure. The first thing we do is iterate through all the nodes of
the tree, and at each node with probability $\frac{1}{2}$ set the mark
bit to one (otherwise set it to 0). We do this from the bottom of the
tree upwards (so that all the nodes on a given level are visited
before any nodes on higher levels). The result is that the list of $n$
values of the mark bits is a uniformly random vector in $\{0,1\}^n$.
Then, after all of those updates, query a random leaf.
Intuitively, an update wants to inform the leaves in its subtree about
the node's bit (because queries come at the leaves). At the very
least, it could try to inform its children about the bit, so that a
query could ignore this node. However, we set $B$ to be larger than
$t_u$, so intutively, whatever propagation an update does, it is only
useful with negligible probability. To solidify this intuition, note
that it is easy to handle the case when we scan and update nodes
beginning with higher levels. When updating a node, it can just obtain
a partial count from the root to its parent (because updates from
higher levels happened before), and calculate a partial count for
itself. Then all operations take constant time.
\subsection{Proof of the Lower Bound}
Define \emph{epoch} $j$ to be the time during which the updates on
level $j$ were being executed. Thus, in epoch 0 the root was updated,
and in epoch 1 the $B$ elements at the second level were updated; and
in general in epoch $j$, $B^j$ updates were executed. Note that the
epochs occured temporally in order of decreasing $j$.
Our strategy is to show that the query algorithm must read in
expectation $\Omega(1)$ cells written during each epoch. In total over
the $\log_B n$ epochs, we achieve the desired lower bound in
expectation. This implies that it must hold in the worst case as well.
The only variability that occurs here is in the state of each node,
and the identity of the random leaf. Thus, the problem is determined
by $u \in \{0,1\}^n$, the state of all the nodes in the tree, along
with the value of the random query element $q$. We then wish to show
that for the random $u$ we've constructed, and random $q$, the problem
is hard.
Fix a level $j$. During epoch $j$, there was a chunk of $B^j$
updates. After those updates were completed, there were a series of
smaller epochs, exponentially decreasing in size. Suppose that our
query algorithm reads none of the cells written in epoch $j$. It can
read any cells from other epochs; we will give it all cells from all
other epochs for free. We argue that with constant probability, the
algorithm does not have enough information to determine the correct
answer. The cells that were written in an epoch $k > j$ cannot be
useful, since they know nothing about the state of level $j$ (which
was updated in the future). The total number of cells written in
epochs $k < j$ is at most
\[
t_u(1 + B + B^2 + \cdots + B^{j-1}) = \frac{B^{j} - 1}{B-1}t_u
= t_u \cdot O(B^{j-1})
\]
Since each cell has $\Theta(\log n)$ bits, this are a total of $O(t_u
B^{j-1} \log n)$ bits of information that were written in these
epochs. Now, the total amount of information revealed by the updates
in epoch $j$ is $B^j$ bits, the information in the state of the nodes
of level $j$. Thus the information the query algorithm possesses is
only $O(B^j / \log n) = o(B^j)$, by the value we set of $B$; it
follows that at least a constant proportion of the mark bits at nodes
in level $j$ cannot be known by the query algorithm.
Now, we need to be able to compute the number of marked nodes on the
path from our random query element to the root. Thus, we must know
the value of a random node at level $j$ (since the level-$j$ ancestor
of a random node is random). Thus, with at least a constant
probability, the query algorithm must read some cell from epoch $j$.
We know sketch how to formalize this proof idea, though we do not
delve into the real combinatorial analysis. Let $R_j$ be the set of
cells written during epoch $j$, and $R_{ j$, because these epochs
occured in the past, and the first difference between $u$ and $w$
happened on level $j$. Thus, $u$ and $w$ can only be distinguished by
looking at $R_j$.
Now the proof proceeds as follows:
\begin{itemize}
\item as shown above, $|R_{j$. Supposed that this
cell is in $\overline{R_m}(w)$. We have just shown that $m \neq j$,
and further that $m$ is not less than $j$ (the first two cases),
hence $m>j$. Thus, the cell must have been written before epoch $j$
for both $u$ and $w$. However, since $u$ and $w$ only differ on the
$j$-th level, the values in this cell for the two instances must be
equal.
\end{enumerate}
Thus, if the query needs to distinguish between $u$ and some $w \in
F_j(u)$, it must read some element of $\overline{R_j}(u)$. Since
picking a random leaf corresponds on the $j$-th level to picking a
random node, and there exists a fooling $w$ for a constant fraction of
the nodes on the $j$-th level, this implies that we must read a cell
in $\overline{R_j}(u)$ with probability $\Omega(1)$. The lower bound
follows.
The only loose end is why $\overline{R_j}(u)$ really is small
(implying that $[u_j]$ is large, which we assumed at the
beginning). The idea is the $\overline{R_j}(u)$ includes cells written
during a few alternative computation histories (namely, the ones in
$F_j(u)$ and $u$ itself). Because we could find very small fooling
sets, this is just $O(\lg n)$ times more cells than $R_j$.
\end{proof}
Now we can explain why we needed to introduce the $\overline{R_j}$'s.
Consider some $w$, an alternative to $u$ on the $j$-th level. It is
possible that $u$ might write some cell in epoch $j+3$, say, that $w$
also writes. But $w$ might rewrite the same cell in epoch $j$. The
problem here is that $R_j$ does not contain this overwritten cell ($u$
doesn't rewrite the cell, only $w$ does). The query algorithm could
detect that the write occured by attempting to read the value written
in epoch $j+3$. Using the larger sets $\overline{R_j}(u)$ avoids this
complication: we are including cells written not only by $u$, but also
by some alternatives to $u$. Since the fooling sets are small, using
these these larger sets $\overline{R_j}$ is not expensive, so that we
can still obtain the tight bound.
\bibliographystyle{alpha}
\begin{thebibliography}{77}
\bibitem{AHR}
Stephen Alstrup, Thore Husfeldt, Theis Rauhe:
\emph{Marked Ancestor Problems},
Proc. 39th Annual Symposium on Foundations of Computer Science (FOCS),
p.534, 1998.
\bibitem{FS}
Michael L. Fredman and Michael E. Saks.
\emph{The cell probe complexity of dynamic data structures},
Proc. 21st ACM Symposium on Theory of Computing (STOC), p. 345-354,
1989.
\end{thebibliography}
\end{document}