6

6.338 Project Proposal:

Parallelizing Incremental Bayesian Segmentation (IBS)

Joseph Hastings <jrh@mit.edu>

(in collaboration with Siddhartha Sen <sidsen@mit.edu>)

Background

Incremental Bayesian Segmentation [1] is an on-line machine learning algorithm designed to segment time-series data into a set of distinct clusters. It models the time-series as the concatenation of processes, each generated by a distinct Markov probability distribution, and attempts to find the most-likely break points between the processes. The algorithm was designed to be a computationally cheaper means of segmenting data than the popular Hidden Markov Model algorithm [2]. During the training phase of the algorithm, it builds a set of Markov matrices that it believes are most likely to describe the set of processes responsible for generating the time series. It was developed by Dr. Marco Ramoni of Harvard in 1998 for the purpose of evaluating medical time-series data such as EKG readings, and detecting various abnormal patterns that might indicate a medical emergency. Currently, Joseph is attempting to use the same basic algorithm to detect computer networking abnormalities for his M. Eng thesis.

Proposal

Underlying most of the computations of the IBS algorithm are matrix calculations that we believe can be re-written to work in parallel. In particular, the algorithm does many entropy and relative entropy calculations as well as generating the marginal likelihood that a particular sequence of transitions would be observed given a Markov probability distribution. All of these operations, in principle, could be performed in parallel, except that they depend upon knowledge of the time-series itself. For this 6.338 final project we would like to investigate various methods of parallelizing this process. First, we will port the algorithm from it’s current state (java, LISP, and Perl) to C++ and use MPI to parallelize the relevant matrix operations. As an alternative method of parallelization, we will also rewrite the code using Cilk, a language originally developed by the Supercomputing Technologies Group at the MIT Laboratory for Computer Science (which Sid is currently working on). Cilk is a language for multithreaded parallel programming based on ANSI C that is very effective for exploiting highly asynchronous parallelism [3], which can be difficult to write using message-passing interfaces like MPI. The current release of Cilk (version 5.3.2) includes the Cilk runtime system and compiler, which together provide an effective platform for programming both dense and sparse numerical algorithms.

The Cilk language itself is very simple and easy to use. In fact, after we have parallelized the appropriate sections of code, converting the resulting C++ program to Cilk will require no more than the addition of several keywords. The real intelligence is found in the Cilk runtime system, which handles issues like load balancing, paging, and communication protocols between running threads. Currently, the runtime system has to be told how many processors to run a Cilk program on; we plan to make the runtime system adaptively parallel by intelligently determining how many threads to create and how many processors to distribute them over (as well as how to perform the actual distribution). Then, we will be able to run our code using both the released version of Cilk and our modified version, and compare these results to those obtained from the MPI method of parallelization.

The work that will be done for this class project will only benefit Joseph’s thesis in the sense that if it is successful, he will be able to run the training process in less time than in currently takes; for instance, in porting the algorithm from its current state to C++ / MPI, it will probably be made faster simply by using C++ instead of the slower interpretted languages. The work done on Cilk will augment UROP work currently being done by Sid, but should not interfere with his research otherwise.

References

[1] Paola Sebastiani and Marco Ramoni. Incremental Bayesian Segmentation of Categorical Temporal Data. 2000.

[2] Wenke Lee and Salvatore J. Stolfo. Data Mining Approaches for Intrusion Detection. 1998.

[3] Cilk 5.3.2 Reference Manual. Supercomputing Technologies Group, MIT Lab for Computer Science. November 9, 2001. Available online: http://supertech.lcs.mit.edu/manual-5.3.2.pdf.