By Guo Liang, Huey Ting
The goal of our project is to make use of the distributed architecture built by Kevin and Kunal to experiment on the ensembles of classifiers.
We have downloaded the corpus Reuters21578
from
http://www.research.att.com/~lewis/reuters21578.html
It contains over 7000 training texts
and 3299 test texts spread across 100 over categories. It is the test corpus
used in the ACM SIGIR 2001 paper “A re-examination of text categorization
methods” - http://www-2.cs.cmu.edu/~yiming/publications.html
We have done the necessary
preprocessing to the dataset which
consists of the following steps
1) Remove SGML tagging from the
files.
2) Removal of stopword like is, are,
this, that.
3) Transform the words into their
morphological root form. Note that we employ the porter’s stemming algorithm to
do the stem the words.
4) Feature extraction : We have
decided to use all features as a start.
TOOLS:
We need a set of machine tools
implemented in Java for our
experimentation.
After much searching, we have
decided to use the Weka package as it has implementations in a
Wide of machine learning tools, we
are particularly interested in the following.
1) Neural Net
2) Naïve Bayes.
3) SMO
1) We have implemented a bagging algorithm to generate different training subsets. Basically, the algorithm randomly selects examples from the available training set with replacement.
2) We have also implemented a voting mechanism to combine the various classifiers.
Some preliminary results on the category acq (each of the systems is trained on different bags of around 10% of the total training data).
Systems used in combination |
Precision |
Recall |
F-measure |
System 1 |
86.7 |
87.5 |
87.1 |
System 2 |
86.9 |
89.3 |
88.1 |
System 3 |
87.7 |
90.7 |
89.1 |
System 1, 2 |
94.6 |
80.7 |
87.1 |
System 2, 3 |
95.5 |
84.4 |
89.6 |
System 1, 2, 3 |
91.2 |
93.0 |
92.1 |
References:
http://www-2.cs.cmu.edu/afs/cs/project/jair/pub/volume11/opitz99a-html/paper.html
This paper gives a good overview on the history of ensemble classifiers with extensive references to various works in this field. It also has links to the papers for further reading.