A Comparison of Methods for Web Document Classification

Julia Hodges, Yong Wang, Bo Tang

Abstract

WebDoc is an automated classification system that assigns Web documents to appropriate Library of Congress subject headings based upon the text in the documents. We have used different classification methods in different versions of WebDoc. One classification method is a statistical approach that counts the number of occurrences of a given noun phrase in documents assigned to a particular subject heading as the basis for determining the weights to be assigned to the candidate indexes. The second classification method uses a naïve Bayes approach. In this case, we experimented with the use of smoothing to dampen the effect of having a large number of 0s in our feature vectors. The third classification method is a k-nearest neighbors approach. With this approach, we tested two different ways of determining the similarity of feature vectors. In this paper, we report the performance of each of the versions of WebDoc in terms of recall, precision, and F-measures.

References

  1. Gale, Willam A. 1995. Good-Turing Smoothing without Tears. Journal of Quantitative Linguistics:217-254.
  2. Goldstein, Jade, Mark Kantrowitz, Vibhu Mittal, and Jaime Carbonell. 1999. Summarizing text documents: Sentence selection and evaluation metrics. Proceedings of the 22nd International Conference on Research and Development in Information, pp. 121-128.
  3. Han, Jiawei, and Micheline Kamber. 2001. Data Mining: Concepts and Techniques. San Diego, CA: Academic Press.
  4. He, Ji, Ah-Hwee Tan, and Chew-Lim Tan. 2000. Machine learning methods for Chinese Web page categorization. Proceedings of the ACL'2000 2nd Chinese Language Processing Workshop, pp. 93-100.
  5. Hodges, Julia, and Jose Cordova. 1993. Automatically building a knowledge base through natural language text analysis. International Journal of Intelligent Systems 8(9): 921-938.
  6. Hodges, Julia, Shiyun Yie, Sonal Kulkarni, and Ray Reighart. 1997. Generation and evaluation of indexes for chemistry articles. Journal of Intelligent Information Systems 7: 57-76.
  7. Kupiec, Julian, Jan Pedersen, and Francine Chen. 1995. A trainable document summarizer. Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 68-73.
  8. Kushmerick, Nicholas, Edward Johnston, and Stephen McGuinness. 2001. Information extraction by text classification. IJCAI-01 workshop on adaptive text extraction and mining.
  9. Larsen, Bjornar, and Chinatsu Aone. 1999. Fast and effective text mining using linear-time document clustering. Proceedings of the 1999 International Conference on Knowledge Discovery and Data Mining (KDD-99), pp. 16-22.
  10. Lehnert, W., J. McCarthy, S. Soderland, E. Riloff, C. Cardie, J. Peterson, and F. Feng. 1993. Umass/Hughes: Description of the CIRCUS system used for MUC-5. Proceedings of the Fifth Message Understanding Conference.
  11. Li, Yonghong, and Anil K. Jain. 1998. Classification of text documents. Proceedings of the 14th International Conference on Pattern Recognition, pp. 1295-1297.
  12. Lin, Shian-Hua, Meng Chang Chen, Jan-Ming Ho, and Yueh-Ming Huan. 2002. ACIRD: Intelligent Internet Document Organization and Retrieval. IEEE Transactions on Knowledge and Data Engineering 14(3): 599-614.
  13. McCallum, Andrew, and Kamal Nigam. 1998. A comparison of event models for naïve Bayes text classification. Proceedings of the AAAI-98 Workshop on Learning for Text Categorization.
  14. Meadow, Charles T., Bert R. Boyce, and Donald H. Kraft. 2000. Text Information Retrieval Systems, 2nd edition. San Diego, CA: Academic Press.
  15. Ng, Hwee Tou, Wei Boon Goh, and Kok Leong Low. 1997. Feature selection, perceptron learning, and a usability case study for text categorization. Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 67-73.
  16. Sable, Carl, Kathy McKeown, and Vasileios Hatzivassiloglou. 2002. Using density estimation to improve text categorization. Technical report no. CUCS-012-02, Department of Computer Science, Columbia University.
  17. Tang, Bo, and Julia Hodges, 2000. Web document classification with positional context. Proceedings of the International Workshop on Web Knowledge Discovery and Data Mining (WKDDM'2000).
  18. Turney, P. 1997. Extraction of Keyphrases from Text: Evaluation of Four Algorithms. Ottawa, Canada: National Research Council of Canada, Institute for Information Technology. ERB-1051.
  19. Wang, Yong. 2002. A comparative study of Web document classification methods. M.S. project report, Mississippi State University.
  20. Yang, Yiming. 1999. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval 1(1/2): 67-88.
  21. Yang, Yiming and Jan O. Pedersen. 1997. A comparative study on feature selection in text categorization. Proceedings of the Fourteenth International Conference on Machine Learning, pp. 412-420.
Download


Paper Citation


in Harvard Style

Hodges J., Wang Y. and Tang B. (2005). A Comparison of Methods for Web Document Classification . In Proceedings of the 5th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2005) ISBN 972-8865-28-7, pages 154-163. DOI: 10.5220/0002557601540163


in Bibtex Style

@conference{pris05,
author={Julia Hodges and Yong Wang and Bo Tang},
title={A Comparison of Methods for Web Document Classification},
booktitle={Proceedings of the 5th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2005)},
year={2005},
pages={154-163},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002557601540163},
isbn={972-8865-28-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 5th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2005)
TI - A Comparison of Methods for Web Document Classification
SN - 972-8865-28-7
AU - Hodges J.
AU - Wang Y.
AU - Tang B.
PY - 2005
SP - 154
EP - 163
DO - 10.5220/0002557601540163