Construct-Extract: An Effective Model for Building Bilingual Corpus to Improve English-Myanmar Machine Translation

May Zin, Teeradaj Racharak, Nguyen Le

Abstract

When dealing with low resource languages such as Myanmar, using additional pseudo parallel data for training machine translation systems is often an effective approach. As a pseudo parallel corpus is generated by back-translating target monolingual texts into the source language, it potentially contains a lot of noise including translation errors and weakly paired sentences and is thus required cleaning. In this paper, we propose a noisy parallel-sentences filtering system called Construct-Extract based on cosine similarity and Siamese BERT-Networks based cross-lingual sentence embeddings. The proposed system filters out noisy sentences by extracting high score sentence pairs from the constructed pseudo parallel data to finally obtain better synthetic parallel data. As part of the proposed system, we also introduce an unsupervised Myanmar sub-word segmenter to improve the quality of current English-Myanmar translation models that are potential to be used as backward systems for back-translation and often suffer from Myanmar word segmentation errors. Experiments show that the proposed Myanmar word segmentation could help the backward system to construct more accurate back-translated pseudo parallel data and using our extracted pseudo parallel corpus led to improve the performance of English-Myanmar translation systems in the two directions.

Download


Paper Citation


in Harvard Style

Zin M., Racharak T. and Le N. (2021). Construct-Extract: An Effective Model for Building Bilingual Corpus to Improve English-Myanmar Machine Translation.In Proceedings of the 13th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART, ISBN 978-989-758-484-8, pages 333-342. DOI: 10.5220/0010318903330342


in Bibtex Style

@conference{icaart21,
author={May Zin and Teeradaj Racharak and Nguyen Le},
title={Construct-Extract: An Effective Model for Building Bilingual Corpus to Improve English-Myanmar Machine Translation},
booktitle={Proceedings of the 13th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,},
year={2021},
pages={333-342},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0010318903330342},
isbn={978-989-758-484-8},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 13th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,
TI - Construct-Extract: An Effective Model for Building Bilingual Corpus to Improve English-Myanmar Machine Translation
SN - 978-989-758-484-8
AU - Zin M.
AU - Racharak T.
AU - Le N.
PY - 2021
SP - 333
EP - 342
DO - 10.5220/0010318903330342