Information Theoretic Text Classification Methods Evaluation

David Pereira Coutinho, Mário A. T. Figueiredo

2008

Abstract

Most approaches to text classification rely on some measure of (dis)similarity between sequences of symbols. Information theoretic measures have the advantage of making very few assumptions on the models which are considered to have generated the sequences, and have been the focus of recent interest. This paper compares the use of the Ziv-Merhav method (ZMM) and the Cai-Kulkarni-Verdú method (CKVM) for the estimation of relative entropy (or Kullback-Leibler divergence) from sequences of symbols when used as a tool for text classification. We describe briefly our implementation of the ZMM based on a modified version of the Lempel-Ziv algorithm (LZ77) and also the CKVM implementation which is based in the Burrows-Wheeler block sorting transform (BWT). Assessing the accuracy of both the ZMM and CKVM on synthetic Markov sequences shows that CKVM yields better estimates of the Kullback-Leibler divergence. Finally, we apply both methods in a text classification problem (more specifically, authorship attribution) but surprisingly CKVM permforms poorly while ZMM outperforms a previously proposed (also information theoretic) method.

Download


Paper Citation


in Harvard Style

Pereira Coutinho D. and A. T. Figueiredo M. (2008). Information Theoretic Text Classification Methods Evaluation . In Proceedings of the 8th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2008) ISBN 978-989-8111-42-5, pages 77-85. DOI: 10.5220/0001740200770085


in Bibtex Style

@conference{pris08,
author={David Pereira Coutinho and Mário A. T. Figueiredo},
title={Information Theoretic Text Classification Methods Evaluation},
booktitle={Proceedings of the 8th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2008)},
year={2008},
pages={77-85},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001740200770085},
isbn={978-989-8111-42-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 8th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2008)
TI - Information Theoretic Text Classification Methods Evaluation
SN - 978-989-8111-42-5
AU - Pereira Coutinho D.
AU - A. T. Figueiredo M.
PY - 2008
SP - 77
EP - 85
DO - 10.5220/0001740200770085