6.897 Principles of Fault-Tolerant Distributed Computing/ Professor Nancy Lynch and Dr. Gregory Chockler/ MIT

Massachusetts Insitute of Technology
6.897 Principles of Fault-Tolerant Distributed Computing, Fall Term 2004
Professor Nancy Lynch
Dr. Gregory Chockler

Announcements

October 7, 2004: Homework assignment 4 is available at .ps. Due October 13, 2004, at class.
September 23, 2004: Homework assignment 3 is available at .ps. Due September 29, 2004, at class.
September 23, 2004: The lecture 3 slides in the PDF format (2 per page): .pdf
September 15, 2004: The lecture 2 slides in the PDF format (2 per page): .pdf
September 15, 2004: The original paper on the layering technique used for the number of rounds proofs is on the web.
September 15, 2004: Homework assignment 2 is available at .ps. Due September 22, 2004, at class.
September 8, 2004: Please subscribe to the course mailing list ASAP. To subscribe, send Email to grishac@csail.mit.edu.
September 8, 2004: Homework assignment 1: .ps. Due September 15, 2004, at class.

Course Homepage

Fall, 2004
Time: Wednesday, 1:00-2:30pm
Room: 26-168
Credits: 6 (2-0-4)
Level: Grad H

Enrollment and prerequisites: The course is intended for graduate students interested in the area of distributed fault-tolerance. It will provide a quick introduction to the fundamental concepts as well as the current state-of-the-art and open questions. Although the course is intended to be self-contained, some background in distributed systems and algorithms (e.g., 6.852) will be helpful. Students should have a strong background in mathematics, and should be familiar with basic algorithms (e.g., 6.046) and computer systems (e.g., 6.033).

Taught by:

Prof. Nancy Lynch, 32-G688, lynch AT csail.mit.edu
Dr. Gregory Chockler, 32-G696, 3-9302, grishac AT csail.mit.edu

Course assistant: Joanne Talbot Hanley, 32-G672A, 3-6054

Students mailing lists: 6897-students@theory, 6.897-students@theory

Schedule, handouts, papers, and notes

Course Goals and Summary:

Fault-tolerance is one of the most established but yet actively researched subject areas of distributed computing. The interest in the subject is motivated by an ever growing popularity of distributed systems where robustness represents a major concern due to the inherent vulnerability to component failures and malicious attacks. This course is aimed to introduce the students to the principles of fault-tolerance in distributed systems covering both current state of the art and providing a glimpse into the research frontiers.

To make the exposition self-contained we will start by reviewing fundamental concepts and classical results in distributed fault-tolerance. We will cover computation and failure models as well as basic algorithms and impossibility results. Wherever possible, the material will be presented from the modern perspective, giving the up-to-date outlook at the classical problems. The second part will be dealing with recent advances, current research and open problems in the field. The topics to be discussed will include (but will not necessarily be limited to) approaches to circumventing impossibility results, computing with unreliable storage and fault-tolerance in dynamic systems.

The course will combine lectures (the first part) with student presentations (the second part). The first part might be accompanied by a few theoretical exercises whose objective will be to reinforce the material studied in class. There'll be also a possibility for doing a practical project or working on an open problem.