KOSHIK- A Large-scale Distributed Computing Framework for NLP

Peter Exner, Pierre Nugues

2014

Abstract

In this paper, we describe KOSHIK, an end-to-end framework to process the unstructured natural language content of multilingual documents. We used the Hadoop distributed computing infrastructure to build this framework as it enables KOSHIK to easily scale by adding inexpensive commodity hardware. We designed an annotation model that allows the processing algorithms to incrementally add layers of annotation without modifying the original document. We used the Avro binary format to serialize the documents. Avro is designed for Hadoop and allows other data warehousing tools to directly query the documents. This paper reports the implementation choices and details of the framework, the annotation model, the options for querying processed data, and the parsing results on the English and Swedish editions of Wikipedia.

References

  1. Björkelund, A., Hafdell, L., and Nugues, P. (2009). Multilingual semantic role labeling. In Proceedings of CoNLL-2009, pages 43-48, Boulder.
  2. Bohnet, B. (2010). Very high accuracy and fast dependency parsing is not a contradiction. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 89-97. Association for Computational Linguistics.
  3. Bontcheva, K., Tablan, V., Maynard, D., and Cunningham, H. (2004). Evolving gate to meet new challenges in language engineering. Natural Language Engineering, 10(3-4):349-373.
  4. Buchholz, S. and Marsi, E. (2006). Conll-x shared task on multilingual dependency parsing. In Proceedings of the Tenth Conference on Computational Natural Language Learning, pages 149-164. Association for Computational Linguistics.
  5. Dean, J. and Ghemawat, S. (2008). Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107-113.
  6. Ferrucci, D. and Lally, A. (2004). Uima: an architectural approach to unstructured information processing in the corporate research environment. Natural Language Engineering, 10(3-4):327-348.
  7. Ferrucci, D. A. (2012). Introduction to “This is Watson”. IBM Journal of Research and Development, 56(3.4):1:1 -1:15.
  8. Finkel, J. R., Grenager, T., and Manning, C. (2005). Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 363-370. Association for Computational Linguistics.
  9. Grishman, R., Caid, B., Callan, J., Conley, J., Corbin, H., Cowie, J., DiBella, K., Jacobs, P., Mettler, M., Ogden, B., et al. (1997). Tipster text phase ii architecture design version 2.1 p 19 june 1996.
  10. Gurevych, I. and Müller, M.-C. (2008). Information extraction with the darmstadt knowledge processing software repository (extended abstract). In Proceedings of the Workshop on Linguistic Processing Pipelines, Darmstadt, Germany. No printed proceedings available.
  11. Ide, N. and Véronis, J. (1994). Multext: Multilingual text tools and corpora. In Proceedings of the 15th conference on Computational linguistics-Volume 1, pages 588-592. Association for Computational Linguistics.
  12. Klein, D. and Manning, C. D. (2003). Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on Association for Computational LinguisticsVolume 1, pages 423-430. Association for Computational Linguistics.
  13. Laprun, C., Fiscus, J., Garofolo, J., and Pajot, S. (2002). A practical introduction to atlas. In Proc. of the 3rd LREC Conference, pages 1928-1932.
  14. Lee, H., Peirsman, Y., Chang, A., Chambers, N., Surdeanu, M., and Jurafsky, D. (2011). Stanford's multi-pass sieve coreference resolution system at the conll-2011 shared task. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task, pages 28-34. Association for Computational Linguistics.
  15. Nivre, J., Hall, J., Nilsson, J., Chanev, A., Eryigit, G., Kubler, S., Marinov, S., and Marsi, E. (2007). Maltparser: A language-independent system for datadriven dependency parsing. Natural Language Engineering, 13(2):95.
  16. Olston, C., Reed, B., Srivastava, U., Kumar, R., and Tomkins, A. (2008). Pig latin: a not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1099-1110. ACM.
  17. O stling, R. (2012). Stagger: A modern pos tagger for swedish. In The Fourth Swedish Language Technology Conference.
  18. Singhal, A. (2012). Introducing the knowledge graph: things, not strings. Official Google Blog.
  19. Surdeanu, M., Johansson, R., Meyers, A., Màrquez, L., and Nivre, J. (2008). The CoNLL 2008 shared task on joint parsing of syntactic and semantic dependencies. In CoNLL 2008: Proceedings of the 12th Conference on Computational Natural Language Learning, pages 159-177, Manchester.
  20. Tablan, V., Roberts, I., Cunningham, H., and Bontcheva, K. (2013). Gatecloud. net: a platform for large-scale, open-source text processing on the cloud. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 371(1983).
  21. Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., and Murthy, R. (2009). Hive: a warehousing solution over a mapreduce framework. Proceedings of the VLDB Endowment, 2(2):1626-1629.
  22. White, T. (2012). Hadoop: The definitive guide. O'Reilly Media, Inc.
Download


Paper Citation


in Harvard Style

Exner P. and Nugues P. (2014). KOSHIK- A Large-scale Distributed Computing Framework for NLP . In Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-758-018-5, pages 463-470. DOI: 10.5220/0004707704630470


in Bibtex Style

@conference{icpram14,
author={Peter Exner and Pierre Nugues},
title={KOSHIK- A Large-scale Distributed Computing Framework for NLP},
booktitle={Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2014},
pages={463-470},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004707704630470},
isbn={978-989-758-018-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - KOSHIK- A Large-scale Distributed Computing Framework for NLP
SN - 978-989-758-018-5
AU - Exner P.
AU - Nugues P.
PY - 2014
SP - 463
EP - 470
DO - 10.5220/0004707704630470