Construct-Extract: An Effective Model for Building Bilingual Corpus

to Improve English-Myanmar Machine Translation

May Myo Zin, Teeradaj Racharak and Nguyen Minh Le

School of Information Science, Japan Advanced Institute of Science and Technology, Ishikawa, Japan

Keywords:

Neural Machine Translation, Myanmar Word Segmentation, Parallel Corpus Creation, Back-translation,

Siamese-BERT Network.

Abstract:

When dealing with low resource languages such as Myanmar, using additional pseudo parallel data for train-

ing machine translation systems is often an effective approach. As a pseudo parallel corpus is generated by

back-translating target monolingual texts into the source language, it potentially contains a lot of noise includ-

ing translation errors and weakly paired sentences and is thus required cleaning. In this paper, we propose

a noisy parallel-sentences ﬁltering system called Construct-Extract based on cosine similarity and Siamese

BERT-Networks based cross-lingual sentence embeddings. The proposed system ﬁlters out noisy sentences

by extracting high score sentence pairs from the constructed pseudo parallel data to ﬁnally obtain better syn-

thetic parallel data. As part of the proposed system, we also introduce an unsupervised Myanmar sub-word

segmenter to improve the quality of current English-Myanmar translation models that are potential to be used

as backward systems for back-translation and often suffer from Myanmar word segmentation errors. Experi-

ments show that the proposed Myanmar word segmentation could help the backward system to construct more

accurate back-translated pseudo parallel data and using our extracted pseudo parallel corpus led to improve

the performance of English-Myanmar translation systems in the two directions.

1 INTRODUCTION

Sentence-aligned parallel corpus is a prerequisite re-

source for building statistical and neural machine

translation (SMT and NMT) systems. Generally, us-

ing large quantity parallel sentences to train a ma-

chine translation (MT) system enables to produce bet-

ter translation results. However, for low-resource

languages such as Myanmar, parallel corpora remain

scarce due mainly to the cost of their creation. Find-

ings in the literature show that there are two methods

that support the construction of comprehensive paral-

lel corpora. The ﬁrst one is to extract nearly paral-

lel sentence pairs from available topic-aligned paral-

lel documents, called comparable corpora (Gr

egoire

and Langlais, 2018; Hangya and Fraser, 2019). The

second one is to use an automatic back-translation

model trained on existing parallel data for creating

new pseudo parallel corpus from the available target

monolingual text (Xu et al., 2019).

It is worth noting that Myanmar is a resource-poor

language and only small amount of English-Myanmar

parallel sentence pairs are currently available to build

baseline MT systems. Moreover, topic-aligned doc-

uments (i.e. comparable corpora) that contain an

amount of semantically similar sentence pairs are not

yet available. However, there are plenty of English

monolingual data, which are in various domains and

are accessible easily. Available monolingual English

language data can be automatically backward trans-

lated into the Myanmar language for creating addi-

tional parallel corpus for training MT models.

This paper proposes the use of the target-side

data (monolingual English sentences) throughout the

back-translation approach for improving both source-

to-target MT model (a.k.a. a forward model) and

target-to-source MT model (a.k.a. a backward

model). Indeed, we construct a pseudo parallel corpus

and further extract only high quality sentence pairs

from the constructed corpus. We apply the back-

translation approach to construct English-Myanmar

synthetic parallel data as a pseudo parallel corpus

from collected in-domain English monolingual texts,

in which the collected English monolingual sentences

and existing training data are in the same domain.

Then, the English-to-Myanmar MT model is used as

a backward model. Apart from increasing the size of

a corpus, word segmentation has been shown to be

Zin, M., Racharak, T. and Le, N.

Construct-Extract: An Effective Model for Building Bilingual Corpus to Improve English-Myanmar Machine Translation.

DOI: 10.5220/0010318903330342

In Proceedings of the 13th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2021) - Volume 2, pages 333-342

ISBN: 978-989-758-484-8

333

helpful for improving translation tasks (Zhao et al.,

2013). If the word segmentation step has many er-

rors, a high accuracy translation task of the backward

model may not be as expected. Existing Myanmar

word segmentation tools produce massive rare words

in MT tasks. In order to improve the performance of a

backward model, we speciﬁcally propose an unsuper-

vised Myanmar word segmentation approach based

on the NFKC

normalization and byte pair encoding

(BPE) (Sennrich et al., 2016b) mechanisms. The pro-

posed segmentation approach can learn itself to adapt

the current MT domain and signiﬁcantly reduce the

out-of-vocabulary (OOV) rate.

Although a back-translation approach can gener-

ate a large amount of synthetic parallel data, there

is no guarantee of its quality. Data generated with

the back translation might have noisy target transla-

tions (from monolingual data). Data quality plays an

essential role in training both statistical and neural

machine translation models. Especially, NMT mod-

els are very sensitive to noise in inputs. Therefore,

using a constructed pseudo parallel corpus without

ﬁltering low-quality noisy sentences pairs may lead

NMT systems to the performance degradation. Re-

garding synthetic data ﬁltering, we propose a simple

but effective approach called Construct-Extract that

extracts only high-quality parallel sentence pairs from

our constructed corpus. Our approach is based on

the sentence-level cosine similarity of any two sen-

tence vectors, i.e., vector representations of the back-

translated synthetic source (Myanmar) sentence and

the monolingual target (English) sentence. We cal-

culate the sentence vectors on each sentence using

Siamese BERT-Networks with an additional MEAN

pooling layer.

The contribution of this paper is that we demon-

strate the feasibility of improving performance on

the Myanmar-English machine translation task by

developing a neural-based bilingual corpus creation

framework called Contruct-Extract. There are three

important outcomes. First, we introduce a sim-

ple but effective unsupervised Myanmar word seg-

mentation approach for improving the generated re-

sults of MT models that are potential to be used

as back-translation models in pseudo parallel cor-

pus construction. Second, we construct English-

Myanmar pseudo parallel data from English monolin-

gual texts by applying back-translation approach us-

ing improved English-to-Myanmar backward model.

Third, we propose a Siamese BERT-Networks based

approach to high-quality parallel sentences extrac-

tion (from our constructed corpus). Experiments on

English-Myanmar translations demonstrate the efﬁ-

https://en.wikipedia.org/wiki/Unicode equivalence.

cacy of the proposed Myanmar word segmentation on

improving current MT models that are potential to be

used as backward systems in back-translation tasks,

and our constructed-extracted pseudo parallel corpus

on enhancing performance of the ﬁnal MT models for

bidirectional translation tasks.

2 CONSTRUCT-EXTRACT: A

NEURAL-BASED

FRAMEWORK FOR BUILDING

BILINGUAL CORPUS

Our neural-based framework for building Myanmar-

English bilingual corpus comprises two main mod-

ules for (1) pseudo parallel corpus construction and

(2) high-quality parallel sentences pairs extraction.

Figure 1 shows an overview of the system contain-

ing these components. Brieﬂy, Figure 1 (a) depicts

the ﬁrst module containing the following two steps:

improving the backward NMT system with the pro-

posed Myanmar word segmentation and generating a

pseudo parallel corpus through back-translation. Fig-

ure 1 (b) depicts the second module which uses the

Siamese BERT-Network architecture for extracting

high-quality sentence pairs from the corpus generated

by the ﬁrst module.

For Figure 1 (a), we construct more parallel trans-

lated texts through back-translation (Sennrich et al.,

2016a) using a volunteer translator, i.e. an automatic

back-translation of the 150k in-domain target mono-

lingual English text into the source Myanmar lan-

guage using pre-trained English-to-Myanmar back-

ward MT model. To select a volunteer backward

translator, we conduct experiments on the choice of

SMT and NMT with the available parallel datasets.

As a result, NMT generates more accurate and ﬂu-

ent translation outputs than SMT in both direc-

tions; thereby we choose it as our choice in the

pipeline. However, NMT still suffers from the out-of-

vocabulary (OOV) issue due to the weakness of the

current Myanmar word segmentation model. Hence,

we also propose and apply a Myanmar word seg-

mentation model for improving the performance of

the backward NMT system. Our proposed Myanmar

word segmentation model learns only on the current

training data to segment the text that ﬁts with the cur-

rent MT domain. We train our segmentation model as

follows:

• First, the model treats Myanmar sentences as raw

streams of Unicode characters and normalizes

them into canonical forms;

• Then, we apply the idea of byte pair encoding

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

334

(BPE) on normalized corpus to construct the ap-

propriate vocabulary.

We explain each step of our proposed segmentation

model in detail in Section 2.1.

Our constructed (back-translated) pseudo parallel

corpus might have noisy target translations in Myan-

mar language. NMT suffers more sensitivity to noisy

data compared to SMT. In some works, using addi-

tional back-translated data to train NMT will cause

translation performance to deteriorate (Du and Way,

2017) or the performance will not be as good as ex-

pected. To investigate and overcome this problem, we

present an extraction model that incorporates Siamese

BERT networks with cosine similarity to ﬁlter only

high quality sentence pairs. The whole extraction

process is shown in Figure 1 (b), in which Siamese

BERT network is applied to indicate similar sentence

embeddings between sentence vectors u, v with co-

sine similarity to threshold only good quality sentence

pairs. If the similarity score is greater than or equal to

a decision threshold p, we add that pair into the train-

ing data as a good quality sentence pair.

Figure 1: The proposed Construct-Extract framework for

English-Myanmar parallel corpus creation.

2.1 Myanmar Word Segmentation

Our Myanmar word segmenter consists of three com-

ponents: a normalizer, a trainer, and a tokenizer. In-

put sentences are treated as raw Unicode character

streams, including the space as a use character. Fig-

ure 2 presents an overall architecture of the proposed

Myanmar word segmenter.

Firstly, the normalizer (indicated by the ﬁrst blue

box) employs the Unicode NFKC normalization to

normalize semantically equivalent Unicode charac-

ters into canonical forms. NFKC, which is the Uni-

code standard normalization form, has been widely

used in many NLP applications recently because of

its better reproducibility and its strong support on the

Unicode standard. Secondly, the trainer (indicated by

the second blue box) trains the segmentation model

using the byte-pair-encoding (BPE) algorithm (Sen-

nrich et al., 2016b) from the normalized corpus to

build up a word vocabulary based on sub-word com-

ponents. The trained segmentation model helps learn-

ing a vocabulary that provides a good compression

rate of the text. Lastly, the tokenizer module (in-

dicated by the dashed box) internally executes the

normalizer to normalize the input text and tokenizes

it into a sub-word sequence with the segmentation

model trained by the trainer.

Figure 2: The proposed Myanmar word segmenter.

2.2 Back-translation

Back-translation approach (Sennrich et al., 2016a)

is an effective data augmentation method leveraging

target side monolingual data. We tried the back-

translation to construct parallel translated sentence

pairs from collected monolingual English texts. To

perform back-translation, we ﬁrst train English-to-

Myanmar NMT (the backward system) with our pro-

posed Myanmar word segmentation on the parallel

data shown in Table 1, and use it to translate collected

150k English monolingual sentences to construct syn-

thetic source side Myanmar texts. After the back-

translation process, we constructed 150k English-

Myanmar pseudo parallel sentence pairs. The noisy

sentence pairs from the constructed corpus are then

Construct-Extract: An Effective Model for Building Bilingual Corpus to Improve English-Myanmar Machine Translation

335

removed with our proposed extraction module.

2.3 Sentence Embeddings

Sentence-BERT (SBERT) (Reimers and Gurevych,

2019), a modiﬁcation of the pretrained BERT net-

work that uses Siamese and triplet network structures

(Schroff et al., 2015), has set a new state-of-the-art

performance on various sentence classiﬁcation, clus-

tering and sentence-pair regression tasks such as se-

mantic textual similarity. Currently, there are an in-

creasing number of state-of-the-art pretrained mod-

els that support more than 100 languages including

Myanmar and English. These models were trained

based on the idea that a translated sentence should

be mapped to the same location in the vector space

as the original sentence. Therefore, they can gener-

ate aligned vector spaces, i.e., similar inputs in dif-

ferent languages are mapped closely in a vector space

(Reimers and Gurevych, 2020).

In our experiments, we used the pre-trained

model, i.e., distilbert-multilingual-nli-stsb-quora-

ranking to derive semantically meaningful sentence

embeddings between English sentences and back-

translated Myanmar sentences. Then, we applied the

cosine similarity to indicate how much an input sen-

tence pair is semantically similar to each other. In our

model, we threshold the similarity between each sen-

tence pair at 0.77. If the similarity score between each

sentence pair is greater than or equal to the thresh-

old, the model decides to add that pair into the ex-

isting training data as a good quality parallel sen-

tence pair. From 150 thousands parallel translated

sentences, our model extracted only 92,111 sentence

pairs. We examined the performance of SMT and

NMT on existing datasets, with an additional 150k

constructed dataset, and with an additional 92,111 ex-

tracted dataset to judge whether our proposed model

can be effective or not. We elaborate this in more de-

tails in the next section.

3 EXPERIMENTS

This section describes the datasets and baseline MT

systems that we have used in this work.

3.1 Datasets

We collected around 224 thousand manually created

English-Myanmar parallel sentence pairs including

bilingual sentences from text books, Myanmar local

news, and the ALT Corpus (Riza et al., 2016) for

training. The development and test sets are only from

the ALT corpus. Data statistics are shown on Table 1.

For the task of creating additional machine-translated

pseudo parallel data, we additionally gather 150

thousands monolingual English sentences from the

internet. These sentences nearly match the domain

of the ALT corpus, which primarily contains news

originated from English sources.

Table 1: Statistics of parallel datasets.

Type Data Source

Total

Sentences

Train

Local News and Textbooks 204,535

ALT 18,082

Dev ALT 1,000

Test ALT 1,017

3.2 Baseline MT Systems

We evaluated the effectiveness of the proposed word

segmentation model and the proposed Construct-

Extract framework for construction and extraction of

English-Myanmar parallel corpus by performing ma-

chine translation experiments.

3.2.1 Statistical Machine Translation

We trained phrase-based SMT (PBSMT) system us-

ing Moses toolkit (Koehn et al., 2007). GIZA++

(Och and Ney, 2003) is used to implement the word

alignment process. For phrases extraction and lexi-

calized word reordering, we applied grow-diag-ﬁnal

and msd-bidirectional-fe heuristic. For tuning PB-

SMT, we applied the default parameters of Moses.

Moreover, the 5-gram language models were trained

on Myanmar and English monolingual sentences with

Kneser-Ney smoothing using KenLM (Heaﬁeld et al.,

2013).

3.2.2 Neural Machine Translation

We trained the Transformer-based NMT models with

PyTorch version of the OpenNMT project, an open-

source (MIT) neural machine translation framework

(Klein et al., 2018). The Transformer experiments

were run on NVIDIA Tesla P100 GPU with the fol-

lowing parameters listed in Table 2.

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

336

Table 2: Parameters for training Transformer models.

-layers 6 -rnn size 512 -word vec size 512

-transformer ff 2,048 -heads 8

-encoder type transformer

-decoder type transformer

-position encoding true -train steps 200,000

-max generator batches 2 -dropout 0.1

-batch size 4,096 -batch type tokens

-normalization tokens -accum count 2

-optim adam -adam beta2 0.998

-decay method noam -warmup steps 8,000

-learning rate 2 -max grad norm 0

-param init 0 -param init glorot true

-label smoothing 0.1 -valid steps 1,000

-save checkpoint steps 1,000

-world size 1 -gpu rank 0

4 EXPERIMENTAL RESULTS

AND ANALYSIS

In this paper, both of our proposed models are eval-

uated by performing statistical and neural MT sys-

tems: PBSMT and Transformer. Bilingual Evaluation

UnderStudy (BLEU) score is used as the evaluation

metric. The scores are computed using the multi-bleu

script from Moses toolkit.

4.1 Effect of Word Segmentation

A lot of work has been done on the problem of Myan-

mar word segmentation and many word segmentation

methods have been proposed. These segmentation

methods can be roughly classiﬁed into dictionary-

based or rule-based, statistical, machine learning and

hybrid approaches (Pa and Thein, 2008; Ding et al.,

2016; Phyu and Hashimoto, 2017; Oo and Soe, 2019).

In the dictionary-based methods, only words that are

stored in a pre-deﬁned dictionary can be identiﬁed

and the performance of the segmentation depends to

a large degree upon the coverage of the dictionary.

Increasing the size of the dictionary is not a good so-

lution to the out of vocabulary word (OOV) problem

because new words appear constantly. On the other

hand, although the statistical approaches can some-

how solve the problem of unknown words by utilizing

probabilistic or cost-based scoring mechanisms, these

methods also suffer from some drawbacks. The main

issues are that they require large amounts of data for

training, also with an amount of the processing time;

and the difﬁculty in incorporating linguistic knowl-

edge effectively into the segmentation process (Tea-

han et al., 2000). For low-resource languages such

as Myanmar, there are only corpus-based, dictionary-

based, rule-based, and statistical word segmentation

methods freely available for being used as a tempo-

rary solution. Current Myanmar word segmentation

tools can support to obtain better results for some

Myanmar language processing tasks, such as part of

speech (POS) tagging, word sense disambiguation,

text categorization, information retrieval, text summa-

rization, and etc. However, they may probably pro-

duce massive rare-words in both SMT and NMT. The

segmentation error would cause translation mistakes

directly especially in English-to-Myanmar transla-

tion. Although it is not a serious issue in Myanmar-to-

English translation in general, weak Myanmar word

segmentation tools can lead SMT to generate un-

known source words as target translated words be-

cause they cannot ﬁnd the corresponding target trans-

lation in the phrase table. The same problem also oc-

curs in NMT.

Figure 3 illustrates some translation mistakes gen-

erated by the current English-to-Myanmar MT sys-

tems with currently available Myanmar word seg-

mentation tools. These mistakes include: (i) miss-

ing words or phrases in the target Myanmar transla-

tion, (ii) translating English words into wrong Myan-

mar words, and (iii) generating both English words

and their translated Myanmar words together in the

translation results. Even in short sentence transla-

tion in a Myanmar-to-English direction as in Figure

4, it shows a problem of SMT that only generates an

unknown Myanmar source word (in red) as a target

English word and NMT misses to translate this word

completely in translation. Note that this Myanmar

source word (in red) should be translated as “ﬁfteen”

in the target English. This Myanmar word is formed

by combining the two words (one word in green and

another word in blue). The word (in green) is “ﬁf-

teen” in English and the other word (in blue) is nu-

merical classiﬁer; it has no special meaning in English

and is used only in Myanmar language that follows a

number to show what type of thing that the number

is referred to. This is because the current segmenter

can only segment words based on their trained na-

ture, corpus and dictionary that may not be ﬁt with the

available training corpus intended to use in MT tasks.

In this case, Myanmar sentences are segmented using

UCSYNLP word segmenter

that implements a com-

bined model of bigram with word juncture and works

by longest matching and bigram methods trained on

a pre-segmented corpus of 50,000 words collected

manually from Myanmar text books, newspapers, and

journals (Pa and Thein, 2008).

http://www.nlpresearch-ucsy.edu.mm/NLP UCSY/

wsandpos.html

Construct-Extract: An Effective Model for Building Bilingual Corpus to Improve English-Myanmar Machine Translation

337

Figure 3: Translation errors of both statistical and neural English-to-Myanmar MT systems due to the Myanmar word seg-

mentation weakness.

Figure 4: Translation errors of both statistical and neural Myanmar-to-English MT systems due to the Myanmar word seg-

mentation weakness.

Our Myanmar word segmentation model does not re-

quire any linguistic resources and manual works. The

only one requirement is to convert the text to segment

into Unicode encoding. Currently, the Myanmar Uni-

code converters are freely available online and ofﬂine.

The proposed model is able to learn on current MT

corpus and thus it can produce the most suitable seg-

mentation results. We analyse the effectiveness of

our segmentation model in the MT experiments by

checking the translated results and in terms of BLEU

scores.

In our experiments, we used Moses tokenizer and

truecaser for English texts. For Myanmar, UCSYNLP

word segmenter is used as a baseline model. As ex-

plained in Subsection 2.1, our segmentation model

consists of three components: a normalizer, a trainer,

and a tokenizer. For our normalizer and trainer, we

applied the same modules of Unicode NFKC Nor-

malizer and BPE Trainer provided by SentencePiece

(Kudo and Richardson, 2018). In the trainer process,

we use a vocabulary size of 32,000 BPE sub-words.

Our tokenizer module internally executes the normal-

izer to normalize the input Unicode character streams

and tokenizes them into the word sequences with the

segmentation model trained by the trainer.

Based on the proposed segmentation approach,

the performance of MT systems are quite different.

The results reported in Table 3 and Table 4, show-

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

338

Figure 5: Example of translations in English-to-Myanmar direction using SMT and NMT. The translation performance of

both MT systems improved with our segmentation model compared to the baseline UCSYNLP segmenter. NMT with our

segmentation approach generates more accurate and ﬂuent translation outputs.

Table 3: BLEU scores of English-to-Myanmar translation

systems on two segmentation models (baseline and ours).

UCSYNLP Segmenter Our Segmenter

SMT 4.15 7.63

NMT 5.25 8.11

Table 4: BLEU scores of Myanmar-to-English translation

systems on two segmentation models (baseline and ours).

UCSYNLP Segmenter Our Segmenter

SMT 9.41 9.19

NMT 10.24 11.59

ing that our unsupervised segmentation model can

help the SMT and NMT systems to largely outper-

form the previous baselines. Our results have large

gains on both MT systems in both directions. For

the English-to-Myanmar task, SMT and NMT ob-

tained a BLEU score of 7.63 and 8.11, respectively,

with our proposed Myanmar word segmenter, which

outperforms the previous best result by +3.48 and

+2.86 points. For the Myanmar-to-English direction,

NMT still surpasses the baseline score by 1.35 BLEU

points. In this direction, the score of SMT is slightly

decreased from 9.41 to 9.19. This is because we did

not speciﬁcally care about names and numbers dur-

ing the word segmentation process. Some of the rare

names and numbers in the text are separated into two

or three words, and is thus led to a little weakness

only in the word alignment procedure of Myanmar-

to-English PBSMT.

For investigating the OOV words issue, we used a

copy mechanism in all experiments. The copy mech-

anism ﬁrst tries to substitute OOV words with target

words that have maximum attention weight accord-

ing to their source words (Luong et al., 2015). When

the words are not found, it copies the source words to

the position of the not-found target words (Gu et al.,

Construct-Extract: An Effective Model for Building Bilingual Corpus to Improve English-Myanmar Machine Translation

339

Figure 6: A sample of constructed parallel sentences (monolingual English sentences and their corresponding back-translated

Myanmar sentences). The Google translation of the back-translated Myanmar sentence in English is also provided. The score

of how likely the sentences are semantically similar is calculated with cosine similarity. Only the ﬁrst two sentence pairs that

have similarity score of greater than 0.77 are extracted as the good quality sentence pairs.

2016). A detailed study of our results in English-

Myanmar bi-directional translation tasks showed that

the number of OOV words decreased considerably

with our proposed Myanmar word segmentation. Fig-

ure 5 shows some example sentences generated by the

English-to-Myanmar MT systems with the baseline

(UCSYNLP segmenter) and with ours. Both SMT

and NMT systems with our segmentation could han-

dle the problem of OOV words than the outputs with

baseline segmentation. The blue colored parts of the

sentences in the ﬁgure are the correct translation parts

in Myanmar language. It demonstrates that the NMT

system leads to better translation accuracy and ﬂu-

ency than SMT. This part of experiments is done only

on the existing parallel corpus. After obtaining the

result, in all cases, we concluded that NMT with our

segmenter is the best that can generate more accurate

and ﬂuent outputs. Therefore, for the next step of

our back-translation in the parallel corpus construc-

tion task, we choose English-to-Myanmar NMT that

was trained on existing corpus with our segmentation

model as the volunteer pre-trained backward MT sys-

tem.

4.2 Constructed-Extracted Data and

Translation Results

The construct module of our proposed Construct-

Extract model created 150k English-Myanmar pseudo

parallel data by back-translating 150k monolingual

English sentences into Myanmar language using

English-to-Myanmar NMT (the backward system).

After mixing these constructed corpus with the ex-

isting dataset, we have more training data to train

on all MT systems from scratch. Generally, more

training data help MT systems to improve the per-

formance. However, there is one known challenge of

NMT with low-quality noisy sentences. Some sen-

tences in the back-translated corpus are low in quality.

To investigate and overcome this challenge, we pro-

posed the high-quality sentence pairs extraction mod-

ule. Here, the extract module of our proposed model

that are based on Siamese BERT-networks indicated

only 92k high-quality parallel sentence pairs from the

constructed corpus.

To evaluate the performance of our proposed ap-

proach, we manually looked at generated sentences

and have done a qualitative analysis. In Figure 6,

we can see the qualitative accuracy for some English-

Myanmar parallel sentences constructed by using

the back-translation approach. The back-translated

Myanmar sentences have been translated into English

using Google Translate, so as to facilitate a compari-

son with the original monolingual English sentences.

Only the ﬁrst two sentence pairs that have similar-

ity score more than 0.77 are extracted as high-quality

sentence pairs.

Table 5 and Table 6 illustrates the quality of the

corpus created by the proposed Construct-Extract on

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

340

Table 5: BLEU scores for English-to-Myanmar MT systems.

Training Data Total Sentences SMT NMT

Existing Parallel Corpus 204,535 7.63 8.11

+Constructed Corpus +150,000 8.92 8.37

+Extracted Corpus (p ≥ 0.77) +92,111 8.61 8.51

Table 6: BLEU scores for Myanmar-to-English MT systems.

Training Data Total Sentences SMT NMT

Existing Parallel Corpus 204,535 9.19 11.59

+Constructed Corpus +150,000 9.43 12.21

+Extracted Corpus (p ≥ 0.77) +92,111 9.38 12.41

the machine translation experiments. In these tables,

we report the BLEU scores of SMT and NMT systems

on three different data size settings: only on existing

corpus, on existing corpus plus constructed data (all

back-translated sentence pairs), and on existing cor-

pus plus extracted data (high-quality sentence pairs

from constructed data). In both directions, SMT sys-

tems gain an increase on the performance with more

additional data. On the other hand, NMT systems

trained using the extracted pseudo-parallel corpus as

additional data returned the best translation perfor-

mance. These ﬁndings suggest that translation accu-

racy of the NMT systems depends on both the size and

quality of the training data. In this scenario, the pro-

posed Construct-Extract mechanism can be the most

useful for obtaining an improved pseudo-parallel cor-

pus.

5 CONCLUSION

The motivation of this work is our expectation of

improving the translation performance on the cur-

rent English-Myanmar MT systems with the avail-

able limited resources. To set this goal, we present

our two main contributions. The ﬁrst one is Myanmar

word segmentation model trained on the idea of Uni-

code NFKC normalization and the byte-pair-encoding

mechanism. Our segmentation model is aimed to

improve the performance of the backward system in

pseudo parallel corpus construction task using a back-

translation mechanism. The second one is a paral-

lel corpus extraction methodology developed with the

idea of Siamese-BERT-Networks-based sentence em-

bedding and the cosine similarity. We validated the

performance of these proposed models by performing

SMT and NMT experiments.

Unlike traditional Myanmar segmenters that make

use of manually prepared resources such as large-

scale training data, dictionaries, etc, this proposed

segmentation model does not need any manual work

and any knowledge about Myanmar language. The

model only requires converting Myanmar text written

in other fonts into Unicode fonts with the use of freely

available tools. Using our proposed segmenter on

the preprocessing step of NMT systems, their trans-

lation performance improved quite a lot. On the other

hand, the constructed and extracted parallel dataset is

demonstrated to facilitate a signiﬁcant improvement

in MT quality when compared to a generic system as

shown in our experimental results.

Overall, both of our Myanmar word segmenter

and the parallel corpus extraction model are indeed

beneﬁcial for all MT systems to achieve a remarkable

percentage which increases in the BLEU scores of the

Myanmar-English low-resource problems, although

the constructed corpus is less effective to support MT

for yielding a signiﬁcant BLEU score. We hypothe-

size that this is due to the lack of coverage on the sen-

tence categories in the training and test datasets. More

speciﬁcally, the training and test sets used by MT

models in our experiments contains sentences from 13

different categories: crime and law, culture and enter-

tainment, disasters and accidents, economy and busi-

ness, education, the environment, health, obituaries,

politics and conﬂicts, science and technology, sports,

Wackynews, and weather. However, our constructed

pseudo parallel corpus only covers 40 percent out of

these categories. In the future, we plan to collect more

monolingual corpus in different categories and extend

Construct-Extract: An Effective Model for Building Bilingual Corpus to Improve English-Myanmar Machine Translation

341

the proposed framework with a generative adversarial

network for synthesizing high quality sentence candi-

dates.

ACKNOWLEDGEMENTS

The authors would like to thank the Ministry of Ed-

ucation, Culture, Sports, Science and Technology

(MEXT) of Japan for providing the Japanese Gov-

ernment (Monbukagakusho) Scholarship under which

this work was carried out. This work was also sup-

ported in part by the Asian Ofﬁce of Aerospace Re-

search and Development (AOARD), Air Forced Of-

ﬁce of Scientiﬁc Research (Grant no. FA2386-19-1-

4041).

REFERENCES

Ding, C., Thu, Y. K., Utiyama, M., and Sumita, E. (2016).

Word segmentation for Burmese (Myanmar). ACM

Transactions on Asian and Low-Resource Language

Information Processing (TALLIP), 15(4):1–10.

Du, J. and Way, A. (2017). Neural pre-translation for hybrid

machine translation.

egoire, F. and Langlais, P. (2018). Extracting parallel sen-

tences with bidirectional recurrent neural networks to

improve machine translation. In Proceedings of the

27th International Conference on Computational Lin-

guistics, pages 1442–1453.

Gu, J., Lu, Z., Li, H., and Li, V. O. (2016). Incorporating

copying mechanism in sequence-to-sequence learn-

ing. arXiv preprint arXiv:1603.06393.

Hangya, V. and Fraser, A. (2019). Unsupervised paral-

lel sentence extraction with parallel segment detection

helps machine translation. In Proceedings of the 57th

Annual Meeting of the Association for Computational

Linguistics, pages 1224–1234.

Heaﬁeld, K., Pouzyrevsky, I., Clark, J. H., and Koehn,

P. (2013). Scalable modiﬁed Kneser-Ney language

model estimation. In Proceedings of the 51st Annual

Meeting of the Association for Computational Lin-

guistics (Volume 2: Short Papers), pages 690–696.

Klein, G., Kim, Y., Deng, Y., Nguyen, V., Senellart, J., and

Rush, A. M. (2018). Opennmt: Neural machine trans-

lation toolkit. arXiv preprint arXiv:1805.11462.

Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Fed-

erico, M., Bertoldi, N., Cowan, B., Shen, W., Moran,

C., Zens, R., et al. (2007). Moses: Open source toolkit

for statistical machine translation. In Proceedings of

the 45th Annual Meeting of the ACL on Interactive

Poster and Demonstration Sessions, pages 177–180.

Association for Computational Linguistics.

Kudo, T. and Richardson, J. (2018). Sentencepiece: A sim-

ple and language independent subword tokenizer and

detokenizer for neural text processing. arXiv preprint

arXiv:1808.06226.

Luong, M.-T., Pham, H., and Manning, C. D. (2015). Ef-

fective approaches to attention-based neural machine

translation. arXiv preprint arXiv:1508.04025.

Och, F. J. and Ney, H. (2003). A systematic comparison of

various statistical alignment models. Computational

Linguistics, 29(1):19–51.

Oo, Y. and Soe, K. M. (2019). Applying RNNs architec-

ture by jointly learning segmentation and stemming

for Myanmar language. In 2019 IEEE 8th Global

Conference on Consumer Electronics (GCCE), pages

391–393. IEEE.

Pa, W. P. and Thein, N. L. (2008). Myanmar word segmen-

tation using hybrid approach. In Proceedings of 6th

International Conference on Computer Applications,

Yangon, Myanmar, pages 166–170.

Phyu, M. L. and Hashimoto, K. (2017). Burmese word

segmentation with character clustering and CRFs. In

2017 14th International Joint Conference on Com-

puter Science and Software Engineering (JCSSE),

pages 1–6. IEEE.

Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sen-

tence embeddings using Siamese bert-networks. arXiv

preprint arXiv:1908.10084.

Reimers, N. and Gurevych, I. (2020). Making monolin-

gual sentence embeddings multilingual using knowl-

edge distillation. arXiv preprint arXiv:2004.09813.

Riza, H., Purwoadi, M., Uliniansyah, T., Ti, A. A., Alju-

nied, S. M., Mai, L. C., Thang, V. T., Thai, N. P.,

Chea, V., Sam, S., et al. (2016). Introduction of the

Asian language treebank. In 2016 Conference of The

Oriental Chapter of International Committee for Co-

ordination and Standardization of Speech Databases

and Assessment Techniques (O-COCOSDA), pages 1–

6. IEEE.

Schroff, F., Kalenichenko, D., and Philbin, J. (2015).

Facenet: A uniﬁed embedding for face recognition

and clustering. In Proceedings of the IEEE conference

on computer vision and pattern recognition, pages

815–823.

Sennrich, R., Haddow, B., and Birch, A. (2016a). Improv-

ing neural machine translation models with monolin-

gual data. In Proceedings of the 54th Annual Meeting

of the Association for Computational Linguistics (Vol-

ume 1: Long Papers), pages 86–96, Berlin, Germany.

Association for Computational Linguistics.

Sennrich, R., Haddow, B., and Birch, A. (2016b). Neu-

ral machine translation of rare words with subword

units. In Proceedings of the 54th Annual Meeting of

the Association for Computational Linguistics (Vol-

ume 1: Long Papers), pages 1715–1725, Berlin, Ger-

many. Association for Computational Linguistics.

Teahan, W. J., Wen, Y., McNab, R., and Witten, I. H.

(2000). A compression-based algorithm for Chi-

nese word segmentation. Computational Linguistics,

26(3):375–393.

Xu, G., Ko, Y., and Seo, J. (2019). Improving neural ma-

chine translation by ﬁltering synthetic parallel data.

Entropy, 21(12):1213.

Zhao, H., Utiyama, M., Sumita, E., and Lu, B.-L. (2013).

An empirical study on word segmentation for Chi-

nese machine translation. In International Conference

on Intelligent Text Processing and Computational Lin-

guistics, pages 248–263. Springer.

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

342