ONTOLOGY-DRIVEN CONCEPTUAL DOCUMENT CLASSIFICATION

Gordana Pavlović-Lažetić, Jelena Graovac

2010

Abstract

Document classification based on the lexical-semantic network, wordnet, is presented. Two types of document classification in Serbian have been experimented with – classification based on chosen concepts from Serbian WordNet (SWN) and proper names-based classification. Conceptual document classification criteria are constructed from hierarchies rooted in a set of chosen concepts (first case) or in hierarchies rooted in some of the proper names' hypernyms (second case). A classificator of the first type is trained and then tested on an indexed and already classified Ebart corpus of Serbian newspapers (476917 articles). Precision, recall and F-measure show that this type of classification is promising although incomplete due mainly to SWN incompleteness. In the context of proper names-based classification, a proper names ontology based on the SWN is presented in the paper. A distance based similarity measure is defined, based on Euclidean and Manhattan distances. Classification of a subset of Contemporary Serbian Language Corpus is presented.

References

  1. EAGLES (1996). Preliminary Recommendations on Text Typology, EAGLES Document EAG-TCWG-TTYP/P. Expert Advisory Group on Language Engineering Standards, European Commission.
  2. Ebart (2010). Aktuelna arhiva. Medijska dokumentacija Ebart, http://www.arhiv.rs.
  3. Fellbaum, C. (1998). WordNet: An electronic lexical database. The MIT press.
  4. HLTG (2010). Resursi srpskog jezika. Human Language Technologies Group, http://korpus.matf.bg.ac.rs, Faculty of Mathematics, University of Belgrade.
  5. Krstev, C., Pavlovic-Laz?etic, G., Vitas, D., and Obradovic, I. (2004). Using textual and lexical resources in developing serbian wordnet. In Romanian J. Sci. Tech. Inform. (Special Issue on Balkanet), 7(1-2), pages 147- 161. Romanian Academy.
  6. LCC (2009). Library of Congress Classification Outline. http://www.loc.gov/catdir/cpso/lcco/, U.S. government.
  7. Miller, G. (1995). Wordnet: A lexical database. In Comm. ACM 38(11) 39-41. ACM - Association for Computing Machinery.
  8. Reuters (2010). Site Archive. Thomson Reuters Corporate, http://in.reuters.com/resources/archive/in/index.html.
  9. Rodriguez, M., Gomez-Hidalgo, J., and Diaz-Agudo, B. (1996). Using wordnet to complement training information in text categorization. In Proceedings of the AAAI Spring Symposium on Machine Learning in Information Access, Bulgaria.
  10. Rosso, P., Molina, A., Pla, F., Jiménez, D., and Vidal, V. (2004). Text categorization and information retrieval usingwordnet senses. In CICLing 2004, Lecture Notes in Computer Science, 2945., pages 596- 600. Springer- Verlag.
  11. S. and Matwin, S. (1998). Text classifcation using wordnet hypernyms. In Usage of WordNet in Natural Language Processing Systems1st International Wordnet Conference.
  12. Tan, P., Steinbach, M., and Kumar, V. (2006). Introduction to Data Mining. Addison-Wesley.
  13. Tomas?evic, J. and Pavlovic-Laz?etic, G. (2008). Productivity of concepts in serbian wordnet. In Proceedings of the Sixth Language Technologies Conference: proceedings of the 11th International Multiconference Information Society - IS 2008, 86-91, pages 86-91.
  14. Tufis, D., Cristea, D., and Stamou, S. (2004). Balkanet: Aims, methods, results and perspectives. a general overview. In Romanian J. Sci. Tech. Inform. (Special Issue on Balkanet), 7(1-2), . 9-43, pages 9-43. Romanian Academy.
  15. Vitas, D., Pavlovic-Laz?etic, G., Krstev, C., Popovic, L., and Obradovic, I. (2003). Processing serbianwritten texts: An overview of resources and basic tools. In Proceedings of the International Workshop on Balkan Language Resources and Tools, Thessaloniki, pages 97-104.
Download


Paper Citation


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010)
TI - ONTOLOGY-DRIVEN CONCEPTUAL DOCUMENT CLASSIFICATION
SN - 978-989-8425-28-7
AU - Pavlović-Lažetić G.
AU - Graovac J.
PY - 2010
SP - 383
EP - 386
DO - 10.5220/0003063903830386


in Harvard Style

Pavlović-Lažetić G. and Graovac J. (2010). ONTOLOGY-DRIVEN CONCEPTUAL DOCUMENT CLASSIFICATION . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010) ISBN 978-989-8425-28-7, pages 383-386. DOI: 10.5220/0003063903830386


in Bibtex Style

@conference{kdir10,
author={Gordana Pavlović-Lažetić and Jelena Graovac},
title={ONTOLOGY-DRIVEN CONCEPTUAL DOCUMENT CLASSIFICATION},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010)},
year={2010},
pages={383-386},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003063903830386},
isbn={978-989-8425-28-7},
}