6.897: Advanced Data Structures (Spring 2003)
[Home]
[General motivation]
[Requirements]
[Projects]
[Scribe notes]
[Erik's notes]
[Email archive]
[Accessibility]
Projects
See requirements for details on the goal,
scope of topics, allowed collaboration, and format of projects.
You do not have to be taking the class for credit to work on a project
or the open problems posed here; just let me know that you are.
Step 1: Choosing a problem.
The first step in your project is picking the topic and the specific problem to
work on. This is both difficult and important; you want to choose a topic
that's interesting (both to you and some community), not too easy,
not too difficult, fun, not yet solved, etc.
Expect to spend some time finding a good problem to work on.
Of course, I'm here to help think up ideas,
discuss whether an idea is appropriate, etc.
This page is mainly to help with this first step. Specifically, it contains
- specific project ideas,
- broader sources for project ideas, and
- what other students are working on.
Step 2: Solving your problem.
Once you've picked a problem, tell it to me.
Then the second step is to solve it. This may involve
- thinking (defining the model, proving theorems, etc.),
- writing (making sure what you're thinking works out),
- reading related papers (to try to cull some useful techniques),
- coding (for an implementation project),
- experimenting (to learn interesting properties about your
implementation), and
- talking to people (collaboration).
I want to stress this last
point--collaboration--because it is often essential to research in
theoretical computer science. You are allowed to collaborate with
anyone: students taking the class for credit, students listening to the class,
students not in the class, students not at MIT, faculty (including me), etc.
The only constraint for the class is that your own contribution should be
substantial enough, both in terms of solving problems and writing it up.
To evaluate "substantial enough", you should talk to me.
Project Ideas
Here are some random ideas for data structures open problems to work on.
I'll be adding to this list as I think up more open problems.
- Higher-dimensional advanced data structures.
There is a wealth of problems that come up at the intersection of the
advanced data structures we are covering in class
(e.g., fusion trees, van Emde Boas) and computational geometry.
2D is already very interesting.
Beware, some of these problems have been solved already;
before pursuing a problem, check out the references below.
Particular open problems:
References:
- Dan E. Willard,
"Examining
Computational Geometry, Van Emde Boas Trees, and Hashing from the
Perspective of the Fusion Tree," SIAM Journal on Computing,
29(3):1030-1049, 2000.
- Mark H. Overmars,
"Computational geometry on a grid: an overview,"
Technical Report RUU-CS-87-4, Utrecht University, February 1987.
- R. G. Karlsson and J. I. Munro, "Proximity on a grid,"
in Proceedings of the 2nd Symposium on Theoretical Aspects of
Computer Science, LNCS 182, 1985, pages 187-196.
- Rolf G. Karlsson, "Algorithms in a restricted universe,"
Ph.D. thesis, Technical Report CS-84-50, Department of Computer
Science, University of Waterloo, 1984.
- John Iacono
has various papers on point location with integer coordinates.
- RAMBO data structures.
[Suggested by Ian Eslick.]
There are currently only two data structures designed for the RAMBO model
that I know of: the constant-time van Emde Boas / fixed-universe successor
structure we covered in class [1] and the original paper that introduced
the RAMBO model [2]. Take your favorite pointer machine or RAM data
structure and try to make it faster on a RAMBO. Such work could lead to
a nice repertoire of RAMBO data structures that could make the case for
including RAMBO hardware in standard machines.
Interested students: Christopher Collier, Loizos Michael and Christos Kapoutsis, Mihai Patrascu, Sonia Garg and Jonathan Brunsman
References:
- Andrej Brodnik, Svante Carlsson, Johan Karlsson, and J. Ian Munro,
"Worst case constant time priority queue,"
in Proceedings of the 12th Annual ACM-SIAM Symposium on Discrete
Algorithms, pages 523-528, Washington, DC, January 2001.
- M. L. Fredman and M. E. Saks, "The cell probe complexity of dynamic
data structures," in Proceedings of the 21st ACM Symposium on
Theory of Computing, pages 345-354, 1989.
- Sorting faster.
Develop a sorting algorithm in the transdichotomous RAM
that runs in o(n sqrt(lg lg n)) time,
or equivalently [1], develop a priority queue in the transdichitomous RAM
that uses o(sqrt(lg lg n)) amortized
per operation. For an O(sqrt(lg lg n)) solution
(the current best), see [2].
References:
- Mikkel Thorup, "Equivalence between Priority Queues and Sorting,"
in Proceedings of the 43rd Symposium on Foundations of Computer
Science, November 2002, Vancouver, Canada, pages 125-134.
- Yijie Han and Mikkel Thorup, "Integer Sorting in O(n sqrt (log log n)) Expected Time and Linear Space,"
in Proceedings of the 43rd Symposium on Foundations of Computer
Science, November 2002, Vancouver, Canada, pages 135-144.
- Randomized self-organizing linear search.
What is the best competitive randomized strategy for maintaining a list
of elements subject to an online sequence of (linear, comparison-based)
searches? (Here we consider the Sleator and Tarjan cost model.)
The best algorithm so far is 1.6-competitive [1], while the
best lower bound says that no algorithm is less than 1.5-competitive [2].
Good surveys on this problem are [3, 4].
It's also open whether the simple "Bit" algorithm is better than
1.75-competitive.
Interested students: David Liben-Nowell
References:
- S. Albers, B. von Stengel and R. Werchner,
"A combined BIT and TIMESTAMP algorithm for the list update problem,"
Information Processing Letters, 56:135-139, 1995.
- B. Teia, "A lower bound for randomized list update algorithms,"
Information Processing Letters, 47:5-9, 1993.
- S. Albers and J. Westbrook,
"Self-organizing data structures,"
in Online Algorithms: The State of the Art,
LNCS 1442, pages 31-51, 1998.
- Allan Borodin and Ran El-Yaniv,
"Online Computation and Competitive Analysis,"
Cambridge University Press, 1998,
particularly chapter 2.
- Offline self-organizing linear search.
Can the competitive ratio of Munro's offline Order-By-Next-Request
algorithm [1] be improved?
The version we studied in class, in which blocks were sized at
powers of 2, has competitive ratio at most 4.
This ratio can be improved to at most 2.6641... by using blocks
sized at powers of 4.24429..., as mentioned in [1].
Can you improve it further by other techniques, perhaps increasing the
actual running time?
In particular, can you obtain a polynomial-time optimal offline
algorithm, with a competitive ratio of 1?
For the Sleator and Tarjan problem, this last problem--finding an optimal
offline solution--was posed in [2] but then proved NP-hard in [3].
However, it remains open whether better competitive ratios can be
determined by polynomial-time offline algorithms
(better than 1.6 from the randomized strategies in the previous problem).
Interested students: Ilya Baran, Patrick Lam, David Liben-Nowell
References:
- J. Ian Munro, "On the Competitiveness of Linear Search,"
in Proceedings of the 8th Annual European Symposium on Algorithms,
LNCS 1879, Saarbrücken, Germany, September 2000, pages 338-345.
- Allan Borodin and Ran El-Yaniv,
"Online Computation and Competitive Analysis,"
Cambridge University Press, 1998,
particularly chapter 2.
- Christoph Ambuehl, "Offline List Update is NP-hard,"
in Proceedings of the 8th Annual European Symposium,
LNCS 1879, Saarbrücken, Germany, September 2000, pages 42-51.
- Optimal binary search trees.
Compute the optimal binary search tree for a given stochastic distribution
of ordered elements in o(n2) time,
or polynomial time and o(n2) space.
(Or perhaps prove that the problem is 3SUM-hard.)
The best algorithm for this problem runs in
Theta(n2) time and space [1].
I am not sure what is known when we allow multiplicative or additive error
in optimality.
References:
- Donald E. Knuth, "Optimum binary search trees,"
Acta Informatica, 1:14-25, 1971.
- Splay trees.
There are lots of problems about or related to splay trees [1],
the first of which is old and hard,
and the others of which are much newer.
- Prove that splay trees have the dynamic optimality property
(or make progress in this direction) [1].
- Prove that Munro's offline tree strategy is the dynamic offline
optimal [2]. (This may serve as a useful tool for evaluating the
dynamic optimality of splay trees.)
- Make Iacono's unified structure dynamic [3].
Interested students: Mihai Badoiu
- Prove that splay trees have Iacono's unified property [3].
- Is there a dynamic search tree (even offline, like Munro's [2])
that satisfies the unified property [3]?
- Is Wilber's lower bound [4] within a constant factor of the
offline dynamic optimal?
For fun, here is a cool
splay tree applet.
References:
- Daniel Dominic Sleator and Robert Endre Tarjan,
"Self-adjusting binary search trees,"
Journal of the ACM, 32(3):652-686, July 1985.
- J. Ian Munro, "On the Competitiveness of Linear Search,"
in Proceedings of the 8th Annual European Symposium on Algorithms,
LNCS 1879, Saarbrücken, Germany, September 2000, pages 338-345.
- John Iacono, "Alternatives to splay trees with O(log n) worst-case access times,"
in Proceedings of the 12th Annual Symposium on Discrete Algorithms, pages 516-522, 2001.
- R. Wilber, "Lower bounds for accessing binary search trees with rotation,"
in Proceedings of the 27th Symposium on Foundations of Computer
Science, pages 61-69, 1986.
- Implicit search trees.
What is the fastest possible implicit search tree structure
supporting Insert, Delete, Successor, and Predecessor?
An implicit data structure uses exactly n space
where n is the number of elements; in other words, the data
structure stores the elements and nothing else. (In this sense,
the extra information in the data structure really is implicit.)
The only flexibility in an implicit data structure is the order in
which the elements are stored, which you can use to encode pointers, etc.
The 20-year-old implicit search tree with running time
O(lg2 n) per operation [1]
has been improved twice in the past couple of years:
first to
O(lg2 n / lg lg n)
time per operation [2],
then to
O(lg n lg lg n)
time per operation [3].
Is O(lg n) time per operation possible?
What other interesting data structures can be made implicit?
For example, what about something based on van Emde Boas?
References:
- J. Ian Munro, "An implicit data structure supporting insertion,
deletion, and search in O(log2 n)
time," Journal of Computer and System Sciences,
33(1):66-74, 1986.
- Gianni Franceschini, Roberto Grossi, J. Ian Munro, and Linda Pagli,
"Implicit B-Trees: New Results for the Dictionary Problem,"
Proceedings of the 43rd Annual IEEE Symposium on Foundations of
Computer Science, Vancouver, Canada, November 2002.
A longer version is on the web.
- Gianni Franceschini and Roberto Grossi,
"Implicit Dictionaries Supporting Searches and Amortized Updates in O(log n log log n) Time,"
in Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms,
Baltimore, Maryland, January 2003.
- Substring search data structures (text indexing).
Preprocess a text string T
using O(|T|) bits of space,
to support searching for a substring P in O(|P|) time
(or better). Here a "search" must return a location of the substring.
Such a bound would match a known lower bound [1].
As far as I know, the problem is interesting even in the special case that
every query pattern has length |P| = lg |T|,
occurs, and occurs uniquely, and the set of all occurrences of all
patterns form disjoint words. (This is the situation to which the
lower bound [1] applies.) Close upper bounds that don't quite qualify
are referenced within [1].
References:
- Erik D. Demaine and Alejandro López-Ortiz,
"A Linear Lower Bound on Index Size for Text Retrieval,"
Journal of Algorithms, to appear.
- Ordered file maintenance.
Prove that ordered file maintainence requires
Omega(lg2 n) amortized time per operation,
or find a better algorithm.
See [1] and [2] for the O(lg2 n) amortized
algorithm, and [3] for a lower bound in a weak model.
References:
- A. Itai, A. G. Konheim, and M. Rodeh,
"A sparse table implementation of priority queues,"
in Proceedings of the 8th Colloquium on Automata, Languages, and
Programming, LNCS 115, pages 417-431, Acre (Akko), Israel, July
1981.
- Section 2.3 on "packed-memory structures" in
Cache-Oblivious B-Trees
(full version of FOCS 2000 paper).
- Paul F. Dietz, Joel I. Seiferas, and Ju Zhang,
"A Tight Lower Bound for On-line Monotonic List Labeling,"
in Proceedings of the 4th Scandinavian Workshop on Algorithm
Theory, LNCS 824, Aarhus, Denmark, July 1994, pages 131-142.
Sources for Project Ideas
One good general source is to look at recent papers about data structures,
and look at the conclusion (or elsewhere) for the open problems.
Another possibility is to look at an algorithms paper with a high running time
and think about whether you could make it go faster using some more
sophisticated data structures.
To find such papers in algorithms and data structures, here are some general
methods:
Students' Projects
As students choose projects, their titles will be listed here,
to avoid overlap, and facilitate the possibility of collaboration.
Please let me know if there are any inaccuracies (unlisted projects,
mislisted projects).
- Amorphous data structures -- Jake Beal
- Faster planar point location, more RAMBO data structures, graph machines
-- Ian Eslick
- Post office (nearest neighbor) data structures -- Ray Jones
- Minimum spanning hypertrees -- Percy Liang
- Offline linear search -- Ilya Baran, Denis Cebikins, Lev Teytelman
- Offline linear search in the Munro model -- Patrick Lam
- Distributed data structures -- Ben Leong
- Distributed data structures -- Rui Fan and Sayan Mitra
- RAMBO data structures -- Loizos Michael and Christos Kapoutsis
- Proximate point location -- Jeff Lindy
- Smart exhaustive search to test conjectures -- Ilya Shlyakhter
- Partial sums, offline predecessor, and/or RAMBO data structures
-- Mihai Patrascu
- Financial data structures -- William Kwong
- Hashing faster in practice -- Vladimir Kirianski
- Text searching data structures -- Hans Robertson, Michael Walfish
- Randomized linear search and string matching
-- Alexander Andoni and Cristian Cadar
- Implementing cache-oblivious or implicit data structures -- Daniel Aguayo
- RAMBO data structures and planar point location
-- Sonia Garg and Jonathan Brunsman
- 2-dimensional successor queries -- Sid Sen and Shantonu Sen
- Higher-dimensional RAMBO data structures -- Chris Collier
- Faster planar point location -- Glenn Eguchi and Hana Kim
- XML data structures -- David Malan
- Substring searching in linear space and query time
-- Lance Anderson and Wayland Ni