6.897: Advanced Data Structures (Spring 2003)

Prof. Erik Demaine

[Home] [General motivation] [Requirements] [Projects] [Scribe notes] [Erik's notes] [Email archive] [Accessibility]

Projects

See requirements for details on the goal, scope of topics, allowed collaboration, and format of projects. You do not have to be taking the class for credit to work on a project or the open problems posed here; just let me know that you are.

Step 1: Choosing a problem. The first step in your project is picking the topic and the specific problem to work on. This is both difficult and important; you want to choose a topic that's interesting (both to you and some community), not too easy, not too difficult, fun, not yet solved, etc. Expect to spend some time finding a good problem to work on. Of course, I'm here to help think up ideas, discuss whether an idea is appropriate, etc.

This page is mainly to help with this first step. Specifically, it contains

specific project ideas,
broader sources for project ideas, and
what other students are working on.

Step 2: Solving your problem. Once you've picked a problem, tell it to me. Then the second step is to solve it. This may involve

thinking (defining the model, proving theorems, etc.),
writing (making sure what you're thinking works out),
reading related papers (to try to cull some useful techniques),
coding (for an implementation project),
experimenting (to learn interesting properties about your implementation), and
talking to people (collaboration).

I want to stress this last point--collaboration--because it is often essential to research in theoretical computer science. You are allowed to collaborate with anyone: students taking the class for credit, students listening to the class, students not in the class, students not at MIT, faculty (including me), etc. The only constraint for the class is that your own contribution should be substantial enough, both in terms of solving problems and writing it up. To evaluate "substantial enough", you should talk to me.

Project Ideas

Here are some random ideas for data structures open problems to work on. I'll be adding to this list as I think up more open problems.

Higher-dimensional advanced data structures. There is a wealth of problems that come up at the intersection of the advanced data structures we are covering in class (e.g., fusion trees, van Emde Boas) and computational geometry. 2D is already very interesting. Beware, some of these problems have been solved already; before pursuing a problem, check out the references below.
Particular open problems:
- Planar point location. [Posed by Erik Demaine and John Iacono, January 2003.] In general, you are given a planar map (embedded planar graph), which is static, and you want to preprocess it to support queries of the form "what region contains point (x, y)?" A simpler case that is equivalent (ignoring space for the moment) is that you are confined within a vertical strip, and the planar map simply divides the strip into trapezoids via noncrossing chords. Typical data structures for this problem need O(lg n) time. Can we do better, e.g., with fusion trees?
  Interested students: Ian Eslick, Jeff Lindy, Ray Jones, Alan Leung
References:
1. Dan E. Willard, "Examining Computational Geometry, Van Emde Boas Trees, and Hashing from the Perspective of the Fusion Tree," SIAM Journal on Computing, 29(3):1030-1049, 2000.
2. Mark H. Overmars, "Computational geometry on a grid: an overview," Technical Report RUU-CS-87-4, Utrecht University, February 1987.
3. R. G. Karlsson and J. I. Munro, "Proximity on a grid," in Proceedings of the 2nd Symposium on Theoretical Aspects of Computer Science, LNCS 182, 1985, pages 187-196.
4. Rolf G. Karlsson, "Algorithms in a restricted universe," Ph.D. thesis, Technical Report CS-84-50, Department of Computer Science, University of Waterloo, 1984.
5. John Iacono has various papers on point location with integer coordinates.
RAMBO data structures. [Suggested by Ian Eslick.] There are currently only two data structures designed for the RAMBO model that I know of: the constant-time van Emde Boas / fixed-universe successor structure we covered in class [1] and the original paper that introduced the RAMBO model [2]. Take your favorite pointer machine or RAM data structure and try to make it faster on a RAMBO. Such work could lead to a nice repertoire of RAMBO data structures that could make the case for including RAMBO hardware in standard machines.
Interested students: Christopher Collier, Loizos Michael and Christos Kapoutsis, Mihai Patrascu, Sonia Garg and Jonathan Brunsman
References:
1. Andrej Brodnik, Svante Carlsson, Johan Karlsson, and J. Ian Munro, "Worst case constant time priority queue," in Proceedings of the 12th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 523-528, Washington, DC, January 2001.
2. M. L. Fredman and M. E. Saks, "The cell probe complexity of dynamic data structures," in Proceedings of the 21st ACM Symposium on Theory of Computing, pages 345-354, 1989.
Sorting faster. Develop a sorting algorithm in the transdichotomous RAM that runs in o(n sqrt(lg lg n)) time, or equivalently [1], develop a priority queue in the transdichitomous RAM that uses o(sqrt(lg lg n)) amortized per operation. For an O(sqrt(lg lg n)) solution (the current best), see [2].
References:
1. Mikkel Thorup, "Equivalence between Priority Queues and Sorting," in Proceedings of the 43rd Symposium on Foundations of Computer Science, November 2002, Vancouver, Canada, pages 125-134.
2. Yijie Han and Mikkel Thorup, "Integer Sorting in O(n sqrt (log log n)) Expected Time and Linear Space," in Proceedings of the 43rd Symposium on Foundations of Computer Science, November 2002, Vancouver, Canada, pages 135-144.
Randomized self-organizing linear search. What is the best competitive randomized strategy for maintaining a list of elements subject to an online sequence of (linear, comparison-based) searches? (Here we consider the Sleator and Tarjan cost model.) The best algorithm so far is 1.6-competitive [1], while the best lower bound says that no algorithm is less than 1.5-competitive [2]. Good surveys on this problem are [3, 4]. It's also open whether the simple "Bit" algorithm is better than 1.75-competitive.
Interested students: David Liben-Nowell
References:
1. S. Albers, B. von Stengel and R. Werchner, "A combined BIT and TIMESTAMP algorithm for the list update problem," Information Processing Letters, 56:135-139, 1995.
2. B. Teia, "A lower bound for randomized list update algorithms," Information Processing Letters, 47:5-9, 1993.
3. S. Albers and J. Westbrook, "Self-organizing data structures," in Online Algorithms: The State of the Art, LNCS 1442, pages 31-51, 1998.
4. Allan Borodin and Ran El-Yaniv, "Online Computation and Competitive Analysis," Cambridge University Press, 1998, particularly chapter 2.
Offline self-organizing linear search. Can the competitive ratio of Munro's offline Order-By-Next-Request algorithm [1] be improved? The version we studied in class, in which blocks were sized at powers of 2, has competitive ratio at most 4. This ratio can be improved to at most 2.6641... by using blocks sized at powers of 4.24429..., as mentioned in [1]. Can you improve it further by other techniques, perhaps increasing the actual running time? In particular, can you obtain a polynomial-time optimal offline algorithm, with a competitive ratio of 1?
For the Sleator and Tarjan problem, this last problem--finding an optimal offline solution--was posed in [2] but then proved NP-hard in [3]. However, it remains open whether better competitive ratios can be determined by polynomial-time offline algorithms (better than 1.6 from the randomized strategies in the previous problem).
Interested students: Ilya Baran, Patrick Lam, David Liben-Nowell
References:
1. J. Ian Munro, "On the Competitiveness of Linear Search," in Proceedings of the 8th Annual European Symposium on Algorithms, LNCS 1879, Saarbrücken, Germany, September 2000, pages 338-345.
2. Allan Borodin and Ran El-Yaniv, "Online Computation and Competitive Analysis," Cambridge University Press, 1998, particularly chapter 2.
3. Christoph Ambuehl, "Offline List Update is NP-hard," in Proceedings of the 8th Annual European Symposium, LNCS 1879, Saarbrücken, Germany, September 2000, pages 42-51.
Optimal binary search trees. Compute the optimal binary search tree for a given stochastic distribution of ordered elements in o(n²) time, or polynomial time and o(n²) space. (Or perhaps prove that the problem is 3SUM-hard.) The best algorithm for this problem runs in Theta(n²) time and space [1]. I am not sure what is known when we allow multiplicative or additive error in optimality.
References:
1. Donald E. Knuth, "Optimum binary search trees," Acta Informatica, 1:14-25, 1971.
Splay trees. There are lots of problems about or related to splay trees [1], the first of which is old and hard, and the others of which are much newer.
- Prove that splay trees have the dynamic optimality property (or make progress in this direction) [1].
- Prove that Munro's offline tree strategy is the dynamic offline optimal [2]. (This may serve as a useful tool for evaluating the dynamic optimality of splay trees.)
- Make Iacono's unified structure dynamic [3].
  Interested students: Mihai Badoiu
- Prove that splay trees have Iacono's unified property [3].
- Is there a dynamic search tree (even offline, like Munro's [2]) that satisfies the unified property [3]?
- Is Wilber's lower bound [4] within a constant factor of the offline dynamic optimal?
For fun, here is a cool splay tree applet.
References:
1. Daniel Dominic Sleator and Robert Endre Tarjan, "Self-adjusting binary search trees," Journal of the ACM, 32(3):652-686, July 1985.
2. J. Ian Munro, "On the Competitiveness of Linear Search," in Proceedings of the 8th Annual European Symposium on Algorithms, LNCS 1879, Saarbrücken, Germany, September 2000, pages 338-345.
3. John Iacono, "Alternatives to splay trees with O(log n) worst-case access times," in Proceedings of the 12th Annual Symposium on Discrete Algorithms, pages 516-522, 2001.
4. R. Wilber, "Lower bounds for accessing binary search trees with rotation," in Proceedings of the 27th Symposium on Foundations of Computer Science, pages 61-69, 1986.
Implicit search trees. What is the fastest possible implicit search tree structure supporting Insert, Delete, Successor, and Predecessor? An implicit data structure uses exactly n space where n is the number of elements; in other words, the data structure stores the elements and nothing else. (In this sense, the extra information in the data structure really is implicit.) The only flexibility in an implicit data structure is the order in which the elements are stored, which you can use to encode pointers, etc.
The 20-year-old implicit search tree with running time O(lg² n) per operation [1] has been improved twice in the past couple of years: first to O(lg² n / lg lg n) time per operation [2], then to O(lg n lg lg n) time per operation [3]. Is O(lg n) time per operation possible?
What other interesting data structures can be made implicit? For example, what about something based on van Emde Boas?
References:
1. J. Ian Munro, "An implicit data structure supporting insertion, deletion, and search in O(log² n) time," Journal of Computer and System Sciences, 33(1):66-74, 1986.
2. Gianni Franceschini, Roberto Grossi, J. Ian Munro, and Linda Pagli, "Implicit B-Trees: New Results for the Dictionary Problem," Proceedings of the 43rd Annual IEEE Symposium on Foundations of Computer Science, Vancouver, Canada, November 2002. A longer version is on the web.
3. Gianni Franceschini and Roberto Grossi, "Implicit Dictionaries Supporting Searches and Amortized Updates in O(log n log log n) Time," in Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms, Baltimore, Maryland, January 2003.
Substring search data structures (text indexing). Preprocess a text string T using O(|T|) bits of space, to support searching for a substring P in O(|P|) time (or better). Here a "search" must return a location of the substring.
Such a bound would match a known lower bound [1]. As far as I know, the problem is interesting even in the special case that every query pattern has length |P| = lg |T|, occurs, and occurs uniquely, and the set of all occurrences of all patterns form disjoint words. (This is the situation to which the lower bound [1] applies.) Close upper bounds that don't quite qualify are referenced within [1].
References:
1. Erik D. Demaine and Alejandro López-Ortiz, "A Linear Lower Bound on Index Size for Text Retrieval," Journal of Algorithms, to appear.
Ordered file maintenance. Prove that ordered file maintainence requires Omega(lg² n) amortized time per operation, or find a better algorithm. See [1] and [2] for the O(lg² n) amortized algorithm, and [3] for a lower bound in a weak model.
References:
1. A. Itai, A. G. Konheim, and M. Rodeh, "A sparse table implementation of priority queues," in Proceedings of the 8th Colloquium on Automata, Languages, and Programming, LNCS 115, pages 417-431, Acre (Akko), Israel, July 1981.
2. Section 2.3 on "packed-memory structures" in Cache-Oblivious B-Trees (full version of FOCS 2000 paper).
3. Paul F. Dietz, Joel I. Seiferas, and Ju Zhang, "A Tight Lower Bound for On-line Monotonic List Labeling," in Proceedings of the 4th Scandinavian Workshop on Algorithm Theory, LNCS 824, Aarhus, Denmark, July 1994, pages 131-142.

Sources for Project Ideas

One good general source is to look at recent papers about data structures, and look at the conclusion (or elsewhere) for the open problems. Another possibility is to look at an algorithms paper with a high running time and think about whether you could make it go faster using some more sophisticated data structures. To find such papers in algorithms and data structures, here are some general methods:

Web. Google, CiteSeer, arXiv, and DBLP are good resources for finding papers on the web.
Conferences. A lot of papers in data structures appear in the following conferences. However, be warned that many papers in these conferences will not be about data structures.
- ACM-SIAM Symposium on Discrete Algorithms (SODA)
- European Symposium on Algorithms (ESA), in Springer LNCS series
- Workshop on Algorithms and Data Structures (WADS), in Springer LNCS series
- Scandinavian Workshop on Algorithms and Complexity (SWAT), in Springer LNCS series
- International Symposium on Algorithms and Computation (ISAAC), in Springer LNCS series
- ACM Symposium on Theory of Computing (STOC)
- IEEE Symposium on Foundations of Computer Science (FOCS)
In terms of quality, SODA is strong and often highly related. FOCS and STOC are very strong, but have a broad scope. Other conferences listed above can have more mixed quality, but papers are also often quite good.
A lot of the proceedings of these conferences are available on the web, via the ACM Digital Library or more generally MIT's VERA which links to various electronic forms of conferences and journals.
Papers by Erik. Here are some papers by me related to data structures. If one of them interests you, let's talk about possible extensions and open problems along similar lines.

Students' Projects

As students choose projects, their titles will be listed here, to avoid overlap, and facilitate the possibility of collaboration. Please let me know if there are any inaccuracies (unlisted projects, mislisted projects).

Amorphous data structures -- Jake Beal
Faster planar point location, more RAMBO data structures, graph machines -- Ian Eslick
Post office (nearest neighbor) data structures -- Ray Jones
Minimum spanning hypertrees -- Percy Liang
Offline linear search -- Ilya Baran, Denis Cebikins, Lev Teytelman
Offline linear search in the Munro model -- Patrick Lam
Distributed data structures -- Ben Leong
Distributed data structures -- Rui Fan and Sayan Mitra
RAMBO data structures -- Loizos Michael and Christos Kapoutsis
Proximate point location -- Jeff Lindy
Smart exhaustive search to test conjectures -- Ilya Shlyakhter
Partial sums, offline predecessor, and/or RAMBO data structures -- Mihai Patrascu
Financial data structures -- William Kwong
Hashing faster in practice -- Vladimir Kirianski
Text searching data structures -- Hans Robertson, Michael Walfish
Randomized linear search and string matching -- Alexander Andoni and Cristian Cadar
Implementing cache-oblivious or implicit data structures -- Daniel Aguayo
RAMBO data structures and planar point location -- Sonia Garg and Jonathan Brunsman
2-dimensional successor queries -- Sid Sen and Shantonu Sen
Higher-dimensional RAMBO data structures -- Chris Collier
Faster planar point location -- Glenn Eguchi and Hana Kim
XML data structures -- David Malan
Substring searching in linear space and query time -- Lance Anderson and Wayland Ni