Progress – updated 16 April 2002

By Guo Liang, Huey Ting

Goal

The goal of our project is to make use of the distributed architecture built by Kevin and Kunal to experiment on the ensembles of classifiers.

DATASET

We have downloaded the corpus Reuters21578 from

http://www.research.att.com/~lewis/reuters21578.html

It contains over 7000 training texts and 3299 test texts spread across 100 over categories. It is the test corpus used in the ACM SIGIR 2001 paper “A re-examination of text categorization methods” - http://www-2.cs.cmu.edu/~yiming/publications.html

We have done the necessary preprocessing to the dataset which consists of the following steps

1) Remove SGML tagging from the files.

2) Removal of stopword like is, are, this, that.

3) Transform the words into their morphological root form. Note that we employ the porter’s stemming algorithm to do the stem the words.

4) Feature extraction : We have decided to use all features as a start.

TOOLS:

We need a set of machine tools implemented in Java for our experimentation.

After much searching, we have decided to use the Weka package as it has implementations in a

Wide of machine learning tools, we are particularly interested in the following.

1) Neural Net

2) Naïve Bayes.

3) SMO

PRELIMNARY RESULTS

1) We have implemented a bagging algorithm to generate different training subsets. Basically, the algorithm randomly selects examples from the available training set with replacement.

2) We have also implemented a voting mechanism to combine the various classifiers.

Bags

Some preliminary results on the category acq (each of the systems is trained on different bags of around 10% of the total training data).

Systems used in combination	Precision	Recall	F-measure
System 1	86.7	87.5	87.1
System 2	86.9	89.3	88.1
System 3	87.7	90.7	89.1
System 1, 2	94.6	80.7	87.1
System 2, 3	95.5	84.4	89.6
System 1, 2, 3	91.2	93.0	92.1

References:

Popular Ensemble Methods: An Empirical Study

http://www-2.cs.cmu.edu/afs/cs/project/jair/pub/volume11/opitz99a-html/paper.html

This paper gives a good overview on the history of ensemble classifiers with extensive references to various works in this field. It also has links to the papers for further reading.