PurePos: An Open Source Morphological Disambiguator

György Orosz, Attila Novák

2012

Abstract

This paper presents PurePos, a new open source Hidden Markov model based morphological tagger tool that has an interface to an integrated morphological analyzer and thus performs full disambiguated morphological analysis including lemmatization of words both known and unknown to the morphological analyzer. The tagger is implemented in Java and has a permissive LGPL license thus it is easy to integrate and modify. It is fast to train and use while having an accuracy on par with slow to train Maximum Entropy or Conditional Random Field based taggers. Full integration with morphology and an incremental training feature make it suited for integration in web based applications. We show that the integration with morphology boosts our tool’s accuracy in every respect – especially in full morphological disambiguation – when used for morphologically complex agglutinating languages. We evaluate PurePos on Hungarian data demonstrating its state-of-the-art performance in terms of tagging precision and accuracy of full morphological analysis.

References

  1. Armentano-Oller, C., Carrasco, R. C., Corbí-Bellot, A. M., Forcada, M. L., Ginestí-Rosell, M., Ortiz-Rojas, S., Pérez-Ortiz, J. A., Ramírez-Sánchez, G., Sánchez-Martínez, F. and Scalco, M. A.. Open-source Portuguese-Spanish machine translation. In R. Vieira, P. Quaresma, M.d.G.V. Nunes, N.J. Mamede, C. Oliveira, and M.C. Dias, editors, Computational Processing of the Portuguese Language, pages 50-59. Springer-Verlag, Itatiaia, Brazil, May 2006.
  2. Baldridge, J., Morton, T. and Bierner, G.. The OpenNLP maximum entropy package. Technical report, 2002.
  3. Brants, T. TnT - A Statistical Part-of-Speech Tagger. In Proceedings of the sixth conference on Applied Natural Language Processing, pages 224-231. Universität des Saarlandes, Computational Linguistics, Association for Computational Linguistics, 2000.
  4. Csendes, D., Csirik, J. and Gyimó thy, T.. The Szeged Corpus: A POS tagged and syntactically annotated Hungarian natural language corpus. In Proceedings of the 5th International Workshop on Linguistically Interpreted Corpora LINC 2004 at The 20th International Conference on Computational Linguistics COLING 2004, pages 19-23, 2004.
  5. Cunningham, H., Gaizauskas, R. J. and Wilks, Y.. A General Architecture for Language Engineering (GATE) - a new approach to Language Engineering R&D. International Conference On Computational Linguistics, page 52, 1996.
  6. Halácsy, P., Kornai, A., Németh, L., Rung, A., Szakadát, I., Tró n, V. Creating open language resources for Hungarian. In Proceedings of the 4th International Conference on Language Resources and Evaluation, 2004.
  7. Halácsy, P., Kornai, A., Oravecz, C.. HunPos: an open source trigram tagger. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pages 209-212, Prague, Czech Republic, June 2007. Association for Computational Linguistics.
  8. Halácsy, P., Kornai, A., Oravecz, C., Trón, V., Varga, D. Using a morphological analyzer in high precision POS tagging of Hungarian. In 5th edition of the International Conference on Language Resources and Evaulation, pages 2245-2248, 2006.
  9. Kornai, A., Halácsy, P., Nagy, V., Oravecz, C., Tró n, V., Varga, D. Web-based frequency dictionaries for medium density languages. In Adam Kilgarriff and Marco Baroni, editors, Proceedings of the 2nd International Workshop on Web as Corpus, 2006.
  10. Loftsson, H. Tagging icelandic text using a linguistic and a statistical tagger. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers, NAACL-Short 7807, pages 105-108, Stroudsburg, PA, USA, 2007. Association for Computational Linguistics.
  11. Loftsson, H., Helgadó ttir, S., and Rö gnvaldsson, E. Using a morphological database to increase the accuracy in pos tagging. In Proceedings of the International Conference Recent Advances in Natural Language Processing 2011, pages 49-55, Hissar, Bulgaria, 2011. RANLP 2011 Organising Committee.
  12. Loftsson, H., Rö gnvaldsson, E. Icenlp: a natural language processing toolkit for icelandic. In Proceedings of InterSpeech 2007, Special session: Speech and language technology for less-resourced languages, pages 1533-1536, Antwerp, Belgium, 2007. ISCA.
  13. Novák, A., Orosz, G., Balázs, I. Javában taggel ünk. In Attila Tanács and Veronika Vincze, editors, VIII. Magyar Számító gépes Nyelvészeti Konferencia, page 336, Szeged, 2011.
  14. Oravecz, C. and Dienes, P. Efficient Stochastic Part-of-Speech Tagging for Hungarian. In Third International Conference on Language Resources and Evaluation, pages 710-717, 2002.
  15. Prószéky, G. and Novák, A. Computational Morphologies for Small Uralic Languages. In Inquiries into Words, Constraints and Contexts., pages 150-157, Stanford, California, 2005.
  16. Ratnaparkhi, A. A maximum entropy model for part-of-speech tagging. In Proceedings of the conference on Empirical Methods in Natural Language Processing, volume 1, pages 133-142, 1996.
  17. Sánchez-Martínez, F., Pérez-Ortiz, J. A. and Forcada, M. L. Using target-language information to train part-of-speech taggers for machine translation. Machine Translation, 22:29-66, 2008.
  18. Sheikh, Z. and Sánchez-Martínez, F. A trigram part-of-speech tagger for the Apertium free/open-source machine translation platform. In Juan Antonio Pérez-Ortiz, Felipe SánchezMartínez, and Francis M. Tyers, editors, Proceedings of the First International Workshop on Free/Open-Source Rule-Based Machine Translation, pages 67-74, Alicante, Spain, 2009. Universidad de Alicante. Departamento de Lenguajes y Sistemas InformA˜ ¡ticos.
  19. Thilo, G. and Suhre, O. Design and implementation of the UIMA common analysis system. IBM Systems Journal, 43:476-489, 2004.
  20. Toutanova, K., Klein, D., Manning, C. Singer, Y. Feature-rich part-of-speech tagging with a cyclic dependency network. In Marti Hearst and Mari Ostendorf, editors, Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pages 173-180, Edmonton, Canada, 2003. Association for Computational Linguistics.
  21. Vincze, V., Szauter, D., Almási, A., Móra, G., Alexin, Z. and Csirik, J. Hungarian Dependency Treebank. In Proceedings of the Seventh conference on International Language Resources and Evaluation, pages 1-5, 2010.
  22. Zsibrita, J., Nagy, I., Farkas, R. Magyar nyelvi elemzo? modulok az UIMA keretrendszerhez. In Attila Tanács, Dóra Szauter, and Veronika Vincze, editors, VI. Magyar Számítógépes Nyelvészeti Konferencia, pages 394-395, Szeged, 2009. Szegedi Tudományegyetem.
Download


Paper Citation


in Harvard Style

Orosz G. and Novák A. (2012). PurePos: An Open Source Morphological Disambiguator . In Proceedings of the 9th International Workshop on Natural Language Processing and Cognitive Science - Volume 1: NLPCS, (ICEIS 2012) ISBN 978-989-8565-16-7, pages 53-63. DOI: 10.5220/0004090300530063


in Bibtex Style

@conference{nlpcs12,
author={György Orosz and Attila Novák},
title={PurePos: An Open Source Morphological Disambiguator},
booktitle={Proceedings of the 9th International Workshop on Natural Language Processing and Cognitive Science - Volume 1: NLPCS, (ICEIS 2012)},
year={2012},
pages={53-63},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004090300530063},
isbn={978-989-8565-16-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 9th International Workshop on Natural Language Processing and Cognitive Science - Volume 1: NLPCS, (ICEIS 2012)
TI - PurePos: An Open Source Morphological Disambiguator
SN - 978-989-8565-16-7
AU - Orosz G.
AU - Novák A.
PY - 2012
SP - 53
EP - 63
DO - 10.5220/0004090300530063