\documentclass[11pt]{article}
\usepackage{latexsym}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{amsthm}
\usepackage{epsfig}
\usepackage{color}
%\usepackage{psfig}
\newcommand{\handout}[5]{
\noindent
\begin{center}
\framebox{
\vbox{
\hbox to 5.78in { {\bf 6.897: Advanced Data Structures } \hfill #2 }
\vspace{4mm}
\hbox to 5.78in { {\Large \hfill #5 \hfill} }
\vspace{2mm}
\hbox to 5.78in { {\em #3 \hfill #4} }
}
}
\end{center}
\vspace*{4mm}
}
\let\epsilon=\varepsilon
\newcommand{\lecture}[4]{\handout{#1}{#2}{#3}{Scribe: #4}{Lecture #1}}
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{observation}[theorem]{Observation}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{claim}[theorem]{Claim}
\newtheorem{fact}[theorem]{Fact}
\newtheorem{assumption}[theorem]{Assumption}
% 1-inch margins, from fullpage.sty by H.Partl, Version 2, Dec. 15, 1988.
\topmargin 0pt
\advance \topmargin by -\headheight
\advance \topmargin by -\headsep
\textheight 8.9in
\oddsidemargin 0pt
\evensidemargin \oddsidemargin
\marginparwidth 0.5in
\textwidth 6.5in
\parindent 0in
\parskip 1.5ex
%\renewcommand{\baselinestretch}{1.25}
\newcommand{\floor}[1]{\lfloor #1\rfloor}
\begin{document}
\lecture{2 --- February 3, 2005}{Spring 2005}
{Lecturer: Mihai P\v{a}tra\c{s}cu}{Catherine Miller}
\section{Overview}
In the last lecture we saw how to achieve static dictionaries with
$O(1)$ worst-case query time, $O(n)$ worst-case space, and $O(n)$
expected construction time (the FKS scheme). We also discussed Cuckoo
Hashing, which in addition achieves $O(1)$ expected updates.
In this lecture we discuss how various parts of this result can be
improved: the query time, the space, and the use of randomness during
construction.
\section{Overview}
\subsection{Query Time}
We begin with a brief discussion of query time. The optimal time for
a query is 3 cell probes. Two of these probes must be adaptive, but
they may be independent, i.e.~two locations may be probed at the same
time without either needing the other's return value.
\subsection{Space}
We first consider the membership problem, and how to improve our
$O(n)$ space constraint. In a membership problem we are only
concerned with deciding whether or not a given element $x$ is a member
of a set $S$. The optimal space needed to represent a subset of size
$n$ of $U$ is $\lg {U \choose n}$. It is known how achieve space $\lg
{U \choose n}$ plus a lower order term, with an $O(1)$ worst-case
query time -- a succinct membership structure. We will talk about
succinct data structures later in the term.
\paragraph{Bloom Filters.}
Here we will attempt to solve the membership problem, but allow a
small number of false positives. That is to say that we allow a small
probability of mistaking an element which is not in the set for one in
the set, but allow no margin of error for the inverse mistake.
The optimal space is $n\lg \frac{1}{\epsilon}$. This was achieved
(plus a lower order term), together with a worst-case query time
$O(1)$ \cite{pagh05}. We will instead discuss the classic Bloom
filter. This Bloom filter, courtesy of Bloom, has a space $O(n \lg
\frac{1}{\epsilon})$ and query time $O(\lg \frac{1}{\epsilon})$.
We start by having a table of size $2n$ bits which begins with a 0 in
each space, and our choice of universal hashing function. Each $x \in
S$ is hashed to some location in this table, in which location a 1 is
written. Now any time we query this table with an $x$ which is in
$S$, we will definitely find a 1, and thus identify $x$ as a member of
$S$. When we query some $x$ which is not is $S$, we will have a
collision resulting in a false positive with some probability. The
expected number if such collisions is $n \cdot \frac{1}{2n} =
\frac{1}{2}$. So using a standard Markov bound we can say that our
probability of a false positive is $\le 1/2$.
\begin{center}
\scalebox{0.5}{\input{scribe1.pstex_t}}
\end{center}
If one such filter gives a false positive with probability $\le 1/2$, all we
need do to get a better bound is chain many such filters together, and
query each of them on the element $x$. If we chain $\lg 1/ \epsilon$
such filters, then we return a false positive with probability
$(1/2)^{\lg 1 / \epsilon} = \epsilon$, using $2n \lg 1/ \epsilon$
space and with a query time $O(\lg 1/ \epsilon)$.
\begin{center}
\scalebox{0.5}{\input{scribe2.pstex_t}}
\end{center}
\paragraph{Lossy Dictionaries.}
Of course, there is no particular reason to discriminate against false
negatives while allowing such liberty to false positives. If we now
allow false negatives with probability $\gamma$ and false positives
with probability $\epsilon$, we have a lossy dictionary. By picking a
random fraction of the elements equal to $\gamma$ to discard before
hashing, the space can be improved to $(1 - \gamma) n \lg
\frac{1}{\epsilon}$. Interestingly, this is the best one can do
\cite{pagh01}.
\paragraph{Bloomier Filters.}
We now switch to improving the space for the dictionary problem. A
search using a Bloomier filter has the following behavior:
\begin{itemize}
\item if $x \in S \rightarrow$ return associated data $[x]$
\item if $x \notin S \rightarrow$ don't care
\end{itemize}
If the element is in our dictionary we want to return the data
associated with it, but if it's not, we don't much care what the
behavior is. This is useful in many interesting applications, because
we are guaranteed that the queried elements are in the set.
In creating our data structures to support dictionary queries we will
use a new type of data structure: a perfect hash function.
\textbf{Perfect Hashing:} Given a set $S$, construct a small data
structure that can later be queried to assign an individual and unique
ID to any element of $S$. The IDs should come from a set of size
$O(n)$.
One example of this might be a business with one hundred employees who
each need an ID number. One scheme might be to assign each employee
an ID number based on his name. The trouble is that we need to store
the whole list of names, and look up the name in that list each time.
Instead, we might discover that assigning the first digit of the
number based on eye color, the next on height, the next by weight
gives a unique number for each person. A perfect hash function would
then be just a description of this ID assigning algorithm.
Perfect hashing must not be confused with universal hashing: universal
hashing attempts to emulate randomness (it is a study of randomness,
independent of any set $S$), whereas in perfect hashing we are given a
set and then asked to assign IDs; this is a data structure problem.
One can also think of a perfect hash function as the minimum-length
program (algorithm) that assigns unique IDs to elements of $S$, thus
making it a complexity-theoretic program.
Optimal bounds are known both for static and dynamic perfect
hashing. In the static case, the optimal space is $\Theta(n + \lg\lg
U)$ bits. Note that this is much smaller than the space needed to
represent $S$. This shows that small formulas for assigning unique IDs
can always be found. In the dynamic case, an optimal bound of
$\Theta(n \lg\lg \frac{U}{n})$ was recently obtained
\cite{mpp05}. This may be considered an even more surprising result:
the data structure can assign IDs to a changing set, even if it never
knows what the set actually is (because it cannot remember it).
Using these results, it's easy to obtain a Bloomier filter. To support
a dictionary search in which each piece of our data has $r$ bits, we
make one table of size $O(nr)$ bits for our data, and use a perfect
hash function that tells us where in the array to forward queries for
elements which are in our set $S$. Note that there is no guarantee for
an element which is not in $S$. We obtain $\Theta(nr + \lg\lg U)$ bits
of space in the static case, and $\Theta(n(r + \lg\lg \frac{U}{n}))$
in the dynamic case.
\begin{center}
\scalebox{0.5}{\input{scribe3.pstex_t}}
\end{center}
\subsection{Determinism}
As theorists, we want to be rid of anything that smacks of randomness
-- like an $O(n)$ expected construction time. Hagerup, Milterson, and
Pagh \cite{hmp01} showed a completely deterministic dictionary with
$O(n \lg n)$ construction, $O(n)$ space and $O(1)$ queries. A
remaining open problem is to get $O(n)$ deterministic construction
while maintaining the same bounds on space and queries.
Dynamic bounds were obtained in \cite{pagh00}: poly$(\lg n)$ updates
and poly$(\lg\lg n)$ queries. An unpublished manuscript of Sundar from
1993 claims a lower bound of $\Omega (\frac {\lg\lg U}{\lg\lg\lg
U})$. This bound does not appear explicitly in Sundar's paper. This
result must be viewed with some skepticism, pending further
understanding of the proofs.
In the rest of the lecture, we concentrate of the static result of
\cite{hmp01}.
\section{Static Deterministic Dictionaries}
We will focus on the $O(n \lg n)$ construction. This is achieved by
taking an arbitrary universe $U$ and reducing it to a polynomial
universe $(|U| = n^{O(1)})$ using error-correcting codes and bit
tricks. Furthermore, this part of the algorithm is nonuniform (in the
complexity theoretic sense). Then this universe is reduced to
$U=[n^2]$, and a solution for such quadratic universes is given. We
will ommit the first part, and instead discuss the details of the last
two.
For each $x \in U$, write $x$ in bits. Then take each $\lg n$ bit
section and assign it a letter.
\begin{center}
\scalebox{0.5}{\input{scribe5.pstex_t}}
\end{center}
We use this to create a trie. This is a tree with a branching factor
of $n$, into which we insert all of our $x$'s, as root-to-leaf paths.
We will consider a node to be ``active'' if it is the ancestor of an
element $x \in S$. Thus there are $O(n)$ active nodes. To index into
this trie, we start at the root, which has an ID of 0, and read the
first letter of the query, perhaps ``a''. We then query a hash table
for $(0,a)$, which retrieves the ID of the corresponding child, if the
child is active. If the child corresponding to ``a'' is not active we
know our query is not in the set and can stop. If the node is active,
then we retrieve its ID, proceed to query the hash table with this ID
and the second letter, and so on. The total query time is constant,
because the height of the trie is constant.
\begin{center}
\scalebox{0.5}{\input{scribe4.pstex_t}}
\end{center}
Now that we've reduced our universe, we want to find a perfect hash
function $h: U = [n]^2 \rightarrow [2^r]$, $r = \lg n + O(1)$. We can
just expand the universe by a constant factor to $U = [2^r]^2$. We
interpret the universe as a $2^r \times 2^r$ grid of points, in which
we place each of our $n$ elements of $S$. Now these points are not
necessarily in a pattern which is of any intuitive use, but we have as
our goal to get one point in each column, so that each can be uniquely
identified by its column number, and the row information becomes
extraneous. In order to achieve this, we will construct an array of
``rotation'' factors $\delta [i] \in [2^r]$. The intuitive idea is
that we displace each row $i$ by $\delta[i]$, in order to reduce the
number of elements which are on the same column. For reasons that will
become clear later, we choose {\bf xor} for computing displacements.
After a displacement, a point is transformed by $(x,y) \mapsto (y
\oplus \delta[x], x)$.
\begin{center}
\scalebox{0.5}{\input{scribe6.pstex_t}}
\end{center}
A fundamental result by Tarjan and Yao \cite{ty79} which makes this
scheme possible is that a double displacement suffices to cause each
element to have its own column.
\[
(x,y) \rightarrow (y \oplus \delta [x], x) \rightarrow (x \oplus
\delta'[y \oplus \delta [x]], y \oplus \delta [x])
\]
\begin{theorem}
Let $q = $ number of collisions = $\#\{\,((x_1, y_1), (x_2, y_2)) \in
S \mid y_1 = y_2\,\}$. Then there exist displacements $\delta[i]$
such that the new number of collisions is $q' = \min \{n,
\floor{\frac{q}{2^{r-3}}}n\}$.
\end{theorem}
Note that one displacement displacement reduces the number of
collisions to $q' \le n$. Now pick $r = \lg n + 4$. The second
displacement gives $q'' = \floor{\frac{n}{2^{\lg n + 1}}} n = 0$, so
two displacements suffice.
\textbf{Proof:} Call the elements of each column $S_1$, $S_2$, \dots \
according to their location, and reorder columns so that $|S_1| \ge
|S_2| \ge |S_3| \ge \cdots$. We know that $q = \sum_i {|S_i| \choose
2}$. Let $\delta[i]$ be chosen uniformly at random from $[2^r]$. The
number of new collisions is the number of pairs $(u,v)$ with $u \in
S_i$, $v \in S_{j < i}$ such that $u \oplus \delta [i] = v \oplus
\delta [j]$. By linearity of expectation, this equals $|S_i|(\sum_{j
< i} |S_j|) /2^r$. By Markov bounds, with probability $\ge 1/2$, the
number of new collisions is $\le \floor{2 |S_i|(\sum_{j < i}
|S_j|)/2^r}$. Note that it was possible to add the floor because the
number of collisions is always an integer. So $q' = \sum_i$ new
collisions in row $i \le \sum_i \floor{2 |S_i| \sum_{j < i} |S_j|
/2^r}$.
\begin{itemize}
\item first, $q' \le \sum_i \floor{|S_i| n 2^{-r+1}} \le \sum_i |S_i|
n 2^{-r+1} = n^2 2^{-r+1} \le n$
\item second, $|S_i| \le |S_j|$ (by our ordering), so $q' \le \sum_i
\floor{2^{-r+1} \sum_{j \le i} |S_j|^2}$. If $|S_{i}| = 1$ then
$\floor{2 \cdot 1 \cdot n / 2^r} = 0$, so we don't care. Otherwise,
$|S_j| \ge |S_i| > 1$. Then $|S_j|^2 \le 4 {|S_j| \choose 2}$, so $q'
\le \sum_i \floor{2^{-r+3} \sum_{j