6.338 Project Proposal:
Parallelizing Incremental Bayesian
Segmentation (IBS)
Joseph
Hastings <jrh@mit.edu>
(in collaboration with Siddhartha Sen <sidsen@mit.edu>)
Background
Incremental Bayesian
Segmentation [1] is an on-line machine learning algorithm designed to segment
time-series data into a set of distinct clusters. It models the time-series as the
concatenation of processes, each generated by a distinct Markov probability distribution,
and attempts to find the most-likely break points between the processes. The algorithm was designed to be a
computationally cheaper means of segmenting data than the popular Hidden Markov
Model algorithm [2]. During the training
phase of the algorithm, it builds a set of Markov matrices that it believes are
most likely to describe the set of processes responsible for generating the
time series. It was developed by Dr.
Marco Ramoni of Harvard in 1998 for the purpose of evaluating medical time-series
data such as EKG readings, and detecting various abnormal patterns that might
indicate a medical emergency. Currently,
Joseph is attempting to use the same basic algorithm to detect computer
networking abnormalities for his M. Eng thesis.
Proposal
Underlying most of the
computations of the IBS algorithm are matrix calculations that we believe can be
re-written to work in parallel. In
particular, the algorithm does many entropy and relative entropy calculations as
well as generating the marginal likelihood that a particular sequence of
transitions would be observed given a Markov probability distribution. All of these operations, in principle, could
be performed in parallel, except that they depend upon knowledge of the
time-series itself. For this 6.338 final
project we would like to investigate various methods of parallelizing this
process. First, we will port the
algorithm from it’s current state (java, LISP, and
Perl) to C++ and use MPI to parallelize the relevant matrix operations. As an alternative method of parallelization,
we will also rewrite the code using Cilk, a language originally developed by
the Supercomputing Technologies Group at the MIT Laboratory for Computer
Science (which Sid is currently working on). Cilk is a language for multithreaded parallel
programming based on ANSI C that is very effective for exploiting highly
asynchronous parallelism [3], which can be difficult to write using
message-passing interfaces like MPI. The
current release of Cilk (version 5.3.2) includes the Cilk runtime system and
compiler, which together provide an effective platform for programming both dense
and sparse numerical algorithms.
The Cilk language itself is
very simple and easy to use. In fact, after we have parallelized the appropriate
sections of code, converting the resulting C++ program to Cilk will require no
more than the addition of several keywords. The real intelligence is found in
the Cilk runtime system, which handles issues like load balancing, paging, and
communication protocols between running threads. Currently, the runtime system
has to be told how many processors to run a Cilk program on; we plan to make the
runtime system adaptively parallel by intelligently determining how many
threads to create and how many processors to distribute them over (as well as
how to perform the actual distribution). Then, we will be able to run our code
using both the released version of Cilk and our modified version, and compare
these results to those obtained from the MPI method of parallelization.
The work that will be done for
this class project will only benefit Joseph’s thesis in the sense that if it is
successful, he will be able to run the training process in less time than in
currently takes; for instance, in porting the algorithm from its current state to
C++ / MPI, it will probably be made faster simply by using C++ instead of the
slower interpretted languages. The work done on Cilk will augment UROP work
currently being done by Sid, but should not interfere with his research otherwise.
References
[1] Paola
Sebastiani and Marco Ramoni. Incremental Bayesian Segmentation of Categorical Temporal Data.
2000.
[2] Wenke Lee and Salvatore
J. Stolfo. Data Mining Approaches for Intrusion Detection. 1998.
[3] Cilk 5.3.2 Reference
Manual. Supercomputing Technologies Group, MIT Lab for Computer Science.