English-Turkish Parallel Treebank with Morphological Annotations and its Use in Tree-based SMT

Onur Görgün, Olcay Taner Yıldız, Ercan Solak, Razieh Ehsani

2016

Abstract

In this paper, we report our tree based statistical translation study from English to Turkish. We describe our data generation process and report the initial results of tree-based translation under a simple model. For corpus construction, we used the Penn Treebank in the English side. We manually translated about 5K trees from English to Turkish under grammar constraints with adaptations to accommodate the agglutinative nature of Turkish morphology. We used a permutation model for subtrees together with a word to word mapping. We report BLEU scores under simple choices of inference algorithms.

References

  1. El-Kahlout, I. D. (2009). Statistical machine translation from English to Turkish (Ph.D. thesis).
  2. El-Kahlout, I. D. and Oflazer, K. (2006). Initial explorations in English to Turkish statistical machine translation. In Proceedings of the Workshop on Statistical Machine Translation, StatMT 7806, pages 7-14, Stroudsburg, PA, USA. Association for Computational Linguistics.
  3. Hutchinson, J. (1994). The Georgetown-IBM demonstration. MT News International, no.8, pages 15- 18.
  4. Jurafsky, D. and Martin, J. H. (2009). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition (Prentice Hall Series in Artificial Intelligence). Prentice Hall, 2 edition.
  5. Koehn, P. (2010). Statistical Machine Translation. Cambridge University Press, New York, NY, USA, 1st edition.
  6. Marcus, M. P., Marcinkiewicz, M. A., and Santorini, B. (1993). Building a large annotated corpus of english: The penn treebank. Computational Linguistics, 19(2):313-330.
  7. Yeniterzi, R. and Oflazer, K. (2010). Syntax-tomorphology mapping in factored phrase-based statistical machine translation from English to Turkish. In 48th Annual Meeting of the Association for Computational Linguistics.
Download


Paper Citation


in Harvard Style

Görgün O., Yıldız O., Solak E. and Ehsani R. (2016). English-Turkish Parallel Treebank with Morphological Annotations and its Use in Tree-based SMT . In Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-758-173-1, pages 510-516. DOI: 10.5220/0005653905100516


in Bibtex Style

@conference{icpram16,
author={Onur Görgün and Olcay Taner Yıldız and Ercan Solak and Razieh Ehsani},
title={English-Turkish Parallel Treebank with Morphological Annotations and its Use in Tree-based SMT},
booktitle={Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2016},
pages={510-516},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005653905100516},
isbn={978-989-758-173-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - English-Turkish Parallel Treebank with Morphological Annotations and its Use in Tree-based SMT
SN - 978-989-758-173-1
AU - Görgün O.
AU - Yıldız O.
AU - Solak E.
AU - Ehsani R.
PY - 2016
SP - 510
EP - 516
DO - 10.5220/0005653905100516