CHI SQUARE FEATURE EXTRACTION BASED SVMS ARABIC TEXT CATEGORIZATION SYSTEM

Abdelwadood Moh’d A Mesleh

doi:10.5220/0001329402350240

CHI SQUARE FEATURE EXTRACTION BASED SVMS ARABIC TEXT CATEGORIZATION SYSTEM

Abdelwadood Moh’d A Mesleh

2007

Abstract

This paper aims to implement a Support Vector Machines (SVMs) based text classification system for Arabic language articles. This classifier uses CHI square method as a feature selection method in the pre-processing step of the Text Classification system design procedure. Comparing to other classification methods, our classification system shows a high classification effectiveness for Arabic articles term of Macroaveraged F1 = 88.11 and Microaveraged F1 = 90.57.

References

Manning, C., Schütze, H., (1999). Foundations of Statistical Natural Language Processing. MIT Press.
Sebastiani, F., (2002). Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34 (1), 1-47.
Yang, Y., & Liu, X., (1999). A re-examination of text categorization methods. 22nd Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR'99), 42-49.
Joachims, T., (1998). Text categorization with support vector machines: Learning with many relevant features. Proceedings of the 10th European Conference on Machine Learning, pages 137-142
Schapire, R. & Singer, Y., (2000). BoosTexter: A boosting-based system for text categorization. Machine Learning, 39, No.2/3.
Vapnik, V., (1998). Statistical learning theory, John Wiley & Sons, Inc., N.Y.
Benkhalifa, M., Mouradi, A., Bouyakhf, H., (2001). Integrating WordNet knowledge to supplement training data in semi-supervised agglomerative hierarchical clustering for text categorization. International Journal of Intelligent Systems. 16 (8): 929-947.
Elkourdi, M., Bensaid, A., & Rachidi, T., (2004). Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm. Proceedings of COLING 20th Workshop on Computational Approaches to Arabic Script-based Languages, Geneva, August 23rd27th .2004, 51-58.
Samir, A., Ata, W., & Darwish, N., (2005), A New Technique for Automatic Text Categorization for Arabic Documents, 5th IBIMA Conference (The internet & information technology in modern organizations), December 13-15, 2005, Cairo, Egypt.
Salton, G,. Wong A., & Yang S., (1975). A Vector Space Model for Automatic Indexing. Communications of the ACM, 18(11), pp. 613-620.
Hofmann, H., (2003). Introduction to Machine Learning, Draft Version 1.1.5, November 10, 2003.
Salton, G., & Buckley, C., (1988). Term weighting approaches in automatic text retrieval. Information Processing and Management, 24 (5), 513-523.
Yang, Y., & Pedersen, J., (1997). A comparative study on feature selection in text categorization. In J. D. H. Fisher, editor, The 14th International Conference on Machine Learning (ICML'97), 412-420. Morgan Kaufmann.
Schutze, H., Hull, D., & Pedersen, J., (1995). A comparison of classifiers and document representations for the routing problem. Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 229-237.
Yang, Y., & Wilbur, J., (1996). Using corpus statistics to remove redundant words in text categorization. Journal of the American Society for Information Science, 47(5), 357-369.
Mitchell, T., (1996). Machine Learning, New York, McGraw Hill .
Vapnik, V., (1995). The Nature of Statistical Learning Theory. Springer-Verlag Berlin.
Hofmann, T., (2000). Learning the similarity of documents: An information geometric approach to document retrieval and categorization. Advances in Neural Information Processing Systems, 12, 914-920.
Takamura, H., Matsumoto, Y., & Yamada, H., (2004). Modeling Category Structures with a Kernel Function. Proceedings of Computational Natural Language Learning. Proceedings of CoNLL-2004, Boston, MA, USA, 57-64.
Cristianini, N., & Shawe-Taylor, J., (2000). An Introduction to Support Vector Machines and other kernel-based learning methods. Cambridge University Press.
Al-Shalabi, R., Kanaan, G., & Gharaibeh, M., (2006). Arabic text categorization using kNN Algorithm, Proceeding of the 4th International Multiconference on Computer Science and Information Technology, volume 4, Amman, Jordan. Retrieved March 20, 2007, from http://csit2006.asu.edu.jo/proceedings.
Baeza-Yates, R., & Rieiro-Neto, B., (1999). Modern Information Retrieval. Addison-Wesley & ACM Press.
Larkey, L., Ballesteros, L., & Connell, M., (2002). Improving Stemming for Arabic Information Retrieval: Light Stemming and Co-occurrence Analysis. Proceedings of the 25th Annual International Conference on Research and Development in Information Retrieval (SIGIR 2002), Tampere, Finland, August 11-15, 2002, 275-282.

Download

Paper Citation

in Harvard Style

Moh’d A Mesleh A. (2007). CHI SQUARE FEATURE EXTRACTION BASED SVMS ARABIC TEXT CATEGORIZATION SYSTEM . In Proceedings of the Second International Conference on Software and Data Technologies - Volume 1: ICSOFT, ISBN 978-989-8111-05-0, pages 235-240. DOI: 10.5220/0001329402350240

in Bibtex Style

@conference{icsoft07,
author={Abdelwadood Moh’d A Mesleh},
title={CHI SQUARE FEATURE EXTRACTION BASED SVMS ARABIC TEXT CATEGORIZATION SYSTEM},
booktitle={Proceedings of the Second International Conference on Software and Data Technologies - Volume 1: ICSOFT,},
year={2007},
pages={235-240},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001329402350240},
isbn={978-989-8111-05-0},
}

in EndNote Style

TY - CONF
JO - Proceedings of the Second International Conference on Software and Data Technologies - Volume 1: ICSOFT,
TI - CHI SQUARE FEATURE EXTRACTION BASED SVMS ARABIC TEXT CATEGORIZATION SYSTEM
SN - 978-989-8111-05-0
AU - Moh’d A Mesleh A.
PY - 2007
SP - 235
EP - 240
DO - 10.5220/0001329402350240