Robust Morphologic Analyzer for Highly Inflected Languages

Andrés Tomás Hohendahl, José Francisco Zelasco, Judith Donayo

2010

Abstract

We present a multilingual robust morphologic tagger and tokenizer for highly inflected languages like Spa-nish, with efficient spell correction and ‘sound-like’ word inference, obtaining some semantic extraction even on parasynthetic and unknown words. This algorithm combines rules, statistical best-affix-fit along with a language estimator. A rich flag set controls the internal behaviour. The system has been designed for efficiency and low memory footprint, using data structures based on simple available affixing rules. Our system, packed with a Spanish dictionary of 83k lemmas and 5k rules, recognizes 2.2M exact words, the guessing word-space is many times this much.

References

  1. 1 Hohendahl, Andrés T. & Zelasco, José F. 2006. Algoritmos eficientes para detección temprana de errores y clasificación idiomática para uso en procesamiento de lenguaje natural y texto, WICC2006 - ISBN 950-9474-35-5
  2. 2 Diccionarios españoles: http://www3.unileon.es/dp/dfh/jmr/dicci/012.htm
  3. 3 Pedro Luis Díez Orzas 1999. Estudios de Lingüística Española LA RELACIÓN DE MERONIMIA EN LOS SUSTANTIVOS DEL LÉXICO ESPAÑOL: CONTRIBUCIÓN A LA SEMÁNTICA COMPUTACIONAL Volumen 2 (1999) ISSN: 1139-8736
  4. 4 Shannon, Huffman compression: http://www.cbloom.com/algs/statisti.html
  5. 5 FreeLing: http://www.lsi.upc.es/nlp/freeling/
  6. 6 FLANOM: Flexionador y lematizador automático de formas nominales. Santana, O.; Pérez, J.; Carreras, F.; Duque, J.; Hernández, Z.; Rodríguez, G. Lingüística Española Actual XXI, 2, 1999. Ed. Arco/Libros, S.L. 253/297
  7. 7 FLAVER: Flexionador y lematizador automático de formas verbales. Santana, O.; Pérez, J.; Hernández, Z.; Carreras, F.; Rodríguez, G. Lingüística Española Actual XIX, 2, 1997. Ed. Arco/Libros, S.L. 229/282
  8. 8 ASPELL Affix compression: http://aspell.sourceforge.net/man-html/AffixCompression.html
  9. 9 Expresiones Regulares: http://www.regular-expressions.info/
  10. 10 DRAE Diccionario de la Real Academia Española http://buscon.rae.es/diccionario/drae.htm
  11. 11 ISPELL www.gnu.org/software/ispell/ispell.html
  12. 12 NetSpell http://sourceforge.net/projects/netspell/
  13. 13 TRIE http://www.cs.bu.edu/teaching/c/tree/trie/
  14. 14 TST -Ternary Search Tree: www.nist.gov/dads/HTML/ternarySearchTree.html
  15. 15 Open Office Dictionaries: http://lingucomponent.openoffice.org/spell_dic.html
  16. 16 Relaciones morfoléxicas prefijales del español. Santana, O.; Carreras, F.; Pérez, J.; Rodríguez, G. Boletín de Lingüística, Vol. 22. ISSN: 0798-9709. Jul/Dic, 2004. 79/123.
  17. 17 J Bentley & R - Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 1997
  18. 18 Mehlhorn, K. Dynamic Binary Search. SIAM Journal on Computing 8, 2 (May 1979), 175-198.
  19. 19 Hohendahl, A.T.; Zanutto, B. S.; Wainselboim, A. J.; “Desarrollo de un algoritmo para la medición del grado de similitud fonológica entre formas escritas” SLAN2007. X Congreso Latinoamericano de Neuropsicología 2007, Buenos Aires, Argentina
  20. 20 SNOMED CT (Systematized Nomenclature of Medicine--Clinical Terms) http://www.nlm.nih.gov/research/umls/Snomed/snomed_main.htm
  21. 21 Editorial ESPASA CALPE http://www.espasa.com
Download


Paper Citation


in Harvard Style

Tomás Hohendahl A., Zelasco J. and Donayo J. (2010). Robust Morphologic Analyzer for Highly Inflected Languages . In Proceedings of the 7th International Workshop on Natural Language Processing and Cognitive Science - Volume 1: NLPCS, (ICEIS 2010) ISBN 978-989-8425-13-3, pages 112-118. DOI: 10.5220/0003015301120118


in Bibtex Style

@conference{nlpcs10,
author={Andrés Tomás Hohendahl and José Francisco Zelasco and Judith Donayo},
title={Robust Morphologic Analyzer for Highly Inflected Languages},
booktitle={Proceedings of the 7th International Workshop on Natural Language Processing and Cognitive Science - Volume 1: NLPCS, (ICEIS 2010)},
year={2010},
pages={112-118},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003015301120118},
isbn={978-989-8425-13-3},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 7th International Workshop on Natural Language Processing and Cognitive Science - Volume 1: NLPCS, (ICEIS 2010)
TI - Robust Morphologic Analyzer for Highly Inflected Languages
SN - 978-989-8425-13-3
AU - Tomás Hohendahl A.
AU - Zelasco J.
AU - Donayo J.
PY - 2010
SP - 112
EP - 118
DO - 10.5220/0003015301120118