Constructing High Quality Bilingual Corpus

using Parallel Data from the Web

Sai Man Cheok

1,2

, Lap Man Hoi

1,2

, Su-Kit Tang

1,2 a

and Rita Tse

1,2

School of Applied Sciences, Macao Polytechnic Institute, Macao SAR, China

Engineering Research Centre of Applied Technology on Machine Translation and Artificial Intelligence of

Ministry of Education, Macao Polytechnic Institute, Macao SAR, China

Keywords: Machine Translation, CNN Modelling, Bilingual Corpus, Parallel Data.

Abstract: Natural language machine translation system requires a high-quality bilingual corpus to support its efficient

translation operation at high accuracy rate. In this paper, we propose a bilingual corpus construction method

using parallel data from the Web. It acts as a stimulus to significantly speed up the construction. In our

proposal, there are 4 phases. Parallel data is first pre-processed and refined into three sets of data for training

the CNN model. Using the well-trained model, future parallel data can be selected, classified and added to

the bilingual corpus. The training result showed that the test accuracy reached 98.46%. Furthermore, the result

on precision, recall and f1-score is greater than 0.9, which outperforms RNN and LSTM models.

1 INTRODUCTION

Machine learning technology has been applied into

many different areas, solving many difficult problems

(Lin, 2021) (Chan, 2021) (Chan, 2021). Natural

language processing (NLP) is also one of the areas

that is commonly based on machine translation

technology, which requires a high-quality bilingual

corpus for efficient and accurate automatic translation

(Tse, 2020) (Zin, 2021) (Cheong, 2018). The quality

of bilingual corpus relies on the quality of datasets

used when constructing. In corpus construction, data

is generally sourced from paper articles, electronic

documents, and the Web. As they are not

standardized in an easily readable or pre-defined

format, the processing of the sourcing data becomes

complicated and time-consuming. The digitalization

and proofreading of data on paper materials require

significant post-processing workload. If data is

collected manually, significant effort on editing work

is needed. Even though it is collected electronically,

the data may contain bias or errors. Proofreading is

unavoidable. Therefore, web crawling becomes an

efficient and effective method for collecting data for

bilingual corpus. To ensure the data quality, crawled

data are required to be processed appropriately before

storing into the corpus.

https://orcid.org/0000-0001-8104-7887

In this paper, we propose a method to construct a

high-quality bilingual corpus for machine translation

systems using parallel data (articles in at least two

different languages) from the Web. Assuming

Chinese and Portuguese are the languages to be used,

there are 4 phases, which are 1) data collection, 2)

data pre-processing (cleaning, segmentation and

alignment), 3) model training and 4) classification, in

the construction. Figure 1 depicts the four phases in

the construction of the bilingual corpus.

In phase 1, for data collection, a web crawler is

commonly used to automatically crawl parallel data.

There are a number of web crawling architectures

available, which are hybrid crawler, focused crawlers

and parallel crawlers (Cheok, 2021) (Sharma, 2015)

(Cho, 2002) (Chakrabarti, 1999) (Pappas, 2012).

They crawl webpages automatically from tree-

structural websites for some particular information by

following embedded hypertext links in pages, which

are then stored in a repository for further querying.

In phase 2, three pre-processing steps are

included, which are cleaning, segmentation and

alignment. The cleaning is usually done by removing

unnecessary or unexpected alphabets or text, and by

matching regular expression between bilingual

sentences. Regular expressions are logical formulas,

used by rules for filtering. To improve the cleaning

Cheok, S., Hoi, L., Tang, S. and Tse, R.

Constructing High Quality Bilingual Corpus using Parallel Data from the Web.

DOI: 10.5220/0010997000003194

In Proceedings of the 7th International Conference on Internet of Things, Big Data and Security (IoTBDS 2022), pages 127-132

ISBN: 978-989-758-564-7; ISSN: 2184-4976

127

Figure 1: Four Phases in the Bilingual Corpus Construction.

efficiency and accuracy, regular expression is

normally employed which removes unnecessary

alphabets or text (such as tags, comments, errors,

duplicates, etc.) in web pages. Moreover, to ensure

the automation of data crawling effectively, bilingual

alignment is needed. There are two common methods

used in bilingual alignment, which are length-based

and vocabulary-based (Li, 2010). The length-based

method is based on simple length information. No

vocabulary information is needed. Therefore, it runs

fast and requires minimum storage. Vocabulary-

based method is based on the vocabulary in text to

achieve high accuracy rate, even though it is complex

and slow.

In phase 3, feature engineering creates a

segmentation model to represent the words and

sentences using computer-recognized patterns in

vector format for processing. Existing representation

models such as Bag of Word (TF-IDF algorithm)

(Zhao, 2018), and Word vectors (one-hot algorithm,

word2vec algorithm, etc.) (Uchida, 2018) are

commonly used. After extracting corpus features, the

CNN model is selected and developed for training

due to its two-dimensional structure of input data.

Finally, in phase 4, trained CNN model will be

used for classification, selecting high quality parallel

data to build the bilingual corpus. The training result

showed that the test accuracy reaches a 98.46%. The

result on precision, recall and f1-score is greater than

0.9, which outperforms RNN and LSTM models.

The remainder of the paper is structured as

follows. The construction of bilingual corpus will be

introduced in Chapter 2. It starts with data acquisition

(Phase 1 and Phase 2), followed by the training of

CNN model using three sets of data (Phase 3) in

Chapter 3. Once the training is done, in Chapter 4, the

Classification model will be developed, for selecting

high quality parallel data to build the bilingual corpus

(Phase 4). In Chapter 5, the training performance will

be revealed and discussed. The CNN model over

other models will be evaluated too. Finally, the

remarks of this paper will be given.

2 DATA ACQUISITION

Parallel data is essential in the construction of

bilingual corpus for machine translation systems. To

ensure the quality of data is high, three sets of parallel

data for model training are required.

2.1 Data Collection

In phase 1, collecting a large amount of parallel data

is complicated and time-consuming, even it is

automatic crawling from the Web. To achieve it

efficiently, a web crawler for parallel data that can

ensure the consistency and accuracy of the bilingual

data is highly recommended (Cheok, 2021). As it is

IoTBDS 2022 - 7th International Conference on Internet of Things, Big Data and Security

128

out of scope of this paper, the crawling and

processing of parallel data will not be given. In

particular, as the quality of parallel data is

significantly crucial to the translation quality,

bilingual official websites are highly recommended.

2.2 Data Pre-processing

In phase 2, crawled parallel data are required to go

through three pre-processing steps, which are

Cleaning, Segmentation and Alignment. Cleaning

removes unnecessary characters in the data.

Segmentation divides data into individual segments,

such as sentences. Alignment is the step to cross

check the quality of corresponding data in other

languages by comparing to the one translated from

translation engine.

2.2.1 Cleaning

Crawled parallel data from websites is first refined by

filtering out unnecessary elements, such as HTML

tags and characters, punctuations and spaces in text,

URLs and image links, etc. It happens as those

elements may not appear in both lingual sentences.

They even do not contribute to any meaning of the

text content. For instance, the HTML characters of

Chinese text apparently do not appear in Portuguese

text. If sentence pairs are placed in the training set,

the accuracy of the model training will be reduced,

lowering the accuracy of the final corpus. Thus, in the

corpus cleaning process, all unnecessary elements

will be deleted, ensuring the quality of parallel data in

the training set.

2.2.2 Segmentation

The Portuguese text content in an article can usually

be divided into sentences by punctuation, such as

period, exclamation mark, question mark and

semicolon, etc. Divided sentences are then stored in

pairs as a training set in the bilingual corpus. For

some Latin languages, such as Portuguese,

some exceptional cases are expected. One of the

common examples is about the period punctuation

(“.”), which does not always end a sentence. It may

be an abbreviation or numbering symbol. Therefore,

the punctuation period cannot be treated as an ending

of a sentence in segmentation. On the other hand, in

Chinese, punctuations such as period, exclamation

mark, question mark and semicolon, etc., will be

straightforward to serve for the word segmentation,

dividing the text content into a number of sentences.

It is noteworthy that, depending on the languages,

special segmentation application is needed.

2.2.3 Alignment

After segmentation, data alignment is conducted

which ensures the quality of parallel data, achieving

a certain acceptance level. Original Chinese sentences

translated into Portuguese by a third-party translation

engine will be compared with the original Portuguese

sentence. It is declared that translating Portuguese

sentences using 3

party translation engine for

comparison instead of translating Chinese sentences

is also accepted. The similarity test to be conducted

(Ristad, 1998) will measure the distance value in

transforming one string (the source) into another (the

target) based on the minimal number of deletions,

insertions, or substitutions required. If the similarity is

greater than or equal to 60%, it will be stored as a high-

quality bilingual corpus. This provides a buffering to

balance against translation difference between

translation methods (literal translation and sense-for-

sense translation) in translation engines (Baker, 2001).

Otherwise, it is stored in the pending corpus for

subsequent processing. Figure 2 outlines the workflow

of data alignment on segmented sentences.

After phase 2, three sets of parallel data will be

created, which are the training set, validation set and

test set. The training set is designed for training the

Classification model so that future parallel data can

be categorized accurately in the bilingual corpus. The

training set goes through the feature extraction and

classification in CNN model for configuration. Once

the model is configured, it will be sent to training. In

the training, the validation set is used to validate the

result of the training. It ensures that the model can

accurately and correctly categorize parallel data.

Figure 2: The Workflow of Data Alignment.

If the training result is accepted, the test set will be

used in the Classification model for categorization. In

this model, Chinese is the key in model training and the

construction of the high-reliable bilingual corpus.

Once the Classification model is ready, future parallel

data crawled (the fourth set of data) can be categorized.

Constructing High Quality Bilingual Corpus using Parallel Data from the Web

129

3 TRAINING FOR

CLASSIFICATION

In Phase 3, one of the parallel data sets, called training

set, will be further processed by extracting the

features from text content. Together with another two

sets of data, the CNN model is developed for training.

Once the training is done with satisfactory result, the

Classification model in Phase 4 will be done.

3.1 Feature Engineering

Feature engineering is a process to further manipulate

the training set for improving the accuracy and

efficiency of learning and recognition in Classification

model. In this paper, Chinese word segmentation is

employed (Zhang, 2002), which provides three

particular functions, which are new word discovery,

Batch segmentation and Intelligent filtering.

New word discovery. New words are excavated

from the Chinese text for compilation of professional

dictionaries. Editing and labelling are introduced into

the word segmentation dictionary for improving the

accuracy of the word segmentation system and

adapting to new language changes.

Batch segmentation. Automatic recognition of

new words, such as names, place names, and

organization names, new word tagging, and part-of-

speech tagging, can be achieved efficiently.

Intelligent filtering. Intelligently filtering and

reviewing the semantics of the text content in

sentences using the most complete built-in word

database in China, identifying multiple variants,

traditional and simplified characters, and precise

semantic disambiguation can be achieved efficiently.

As the segmentation method only supports

Chinese language in segmentation, other languages

may need to employ other particular segmentation

methods.

3.2 CNN Model

The CNN model is crucial for the accuracy of

Classification result. To achieve it, Feature extraction

will first be conducted, which processes the training

set with the following steps.

1. Splits a sentence into multiple words;

2. Maps each word into a low-dimensional space

through the word2vec embedding method;

3. Represent the text expressed by the word

vector in one-dimensional;

4. Extract the maximum value of each feature

vector to represent the feature after

convolution with different heights of

convolution kernels.

After Feature extraction, several parameters in

training engine are required to be considered for

configuring the CNN model for best result. After

every round of training experiment against the

validation set, the CNN model will be sent to the

verification process, called Model quality

assessment, which tests about the model for the

classification quality using the test set. If it is not

accepted (result ranged below 90%), the model

parameters will be revised for another round of

experiment. If the result reaches above 90%, the

model is done. The training result of the model can be

seen in next section.

4 PERFORMANCE

EVALUATION

The trained CNN model will be brought to Phase 4 as

the Classification model if the training result is

accepted. In the training, three types of hardware

configurations have been used, which includes one

server-graded computer with four GPUs, one high-

end computer with one GPU and one low-end

personal computer with one GPU. Table 1

summarized the configuration used in the model

training.

Table 1: Hardware Configuration for the Model Training.

NVIDIA

DGX

Dell XPS Normal PC

CPU

64-core

AMD

EPYC CPU

Intel Core

i7-9750H

Intel Core

i7-6700

GPU

4x NVIDIA

A100 80

GB GPUs

NVIDIA

GeForce

GTX 1650

4GB

NVIDIA

GeForce

GTX 960

2GB

RAM

512 GB

DDR4

16GB

DDR4

16GB

DDR4

Storage

1.92 TB

NVME

drive

512GB M.2

PCIe

NVME SSD

256GB SSD

+ 2TB HDD

In software configuration, the same environment

was setup, which included TensorFlow software of

version 1.14.0, running in Window 10 version

2004(OS Build 19041.867), on the same data sets. For

each configuration, the CNN model was trained until

the result is accepted. After certain rounds of training

IoTBDS 2022 - 7th International Conference on Internet of Things, Big Data and Security

130

with parameter adjustment, a high accuracy result is

obtained, as shown in Table 2.

Table 2: Summary of Training on CNN Model.

CNN model

NVIDIA

DGX

Dell

XPS

Normal

Training Loss 0.038 0.044 0.062

Training

Accuracy

98.44% 96.74% 96.88%

Validation Loss 0.039 0.04 0.046

Validation

Accuracy

99.20% 98.86% 99.00%

Training Time 0:03:51 0:06:36 0:08:22

As can be seen in Table 2, the result showed the

CNN model works efficiently in the three machines.

It takes minimum amount of time in training by

NVIDIA DGX machine as it is more computing

power. Besides, it is noteworthy that losses and

accuracies of the CNN model for three machines are

different, due to the randomness of weight

initialization in neural network algorithm.

Moreover, RNN model and LSTM model are

configured for comparison. Similar parameters on the

same sets of data have been configured for training.

In particular, the convolution kernel size and

convolution kernel number in CNN model are set to

be 5 and 256 respectively, and the number of hidden

layers of RNN model and LSTM model is set to 2.

It is found that the training time required for

NVIDIA DGX, Dell XPS and PC on RNN model was

about 18 hours, 30 hours and 48 hours respectively

while the training time required for LSTM model is

about 42 hours, 61 hours and 98 hours respectively.

Table 3 summarized the training using RNN model

for each configuration and Table 4 summarized the

training using LSTM model for each configuration.

Table 3: Summary of Training on RNN Model.

RNN model

NVIDIA

DGX

Dell

XPS

Normal

Training Loss 0.0033 0.0068 0.098

Training

Accuracy

100.00% 100.00% 100.00%

Validation Loss 0.12 0.098 0.046

Validation

Accuracy

98.13% 97.86% 97.57%

Training Time 18:14:23 30:06:48 48:36:19

Table 4: Summary of Training on LSTM Model.

LSTM model

NVIDIA

DGX

Dell

XPS

Normal

Training Loss 0.057 0.072 0.082

Training

Accuracy

96.75% 96.38% 95.81%

Validation Loss 0.08 0.092 0.072

Validation

Accuracy

96.70% 96.23% 97.31%

Training Time 42:28:11 61:33:15 98:29:50

The result of training on three models showed that

they both achieved high training accuracy and

validation accuracy at very low training and

validation losses. Apparently, the amount of training

time for CNN is much lower than the other models.

The CNN model outperforms both RNN and LSTM

models in Classification for all configurations.

With the high accuracy rate achieved in the

training, the three models are further tested using the

test set. The result showed that CNN model still

outperforms RNN and LSTM models in terms of

accuracy rate, precision, recall and f1-score. Table 5

summarized the result of the test on three models.

Table 5: Testing Result on CNN, RNN and LSTM Models.

CNN

Model

RNN

Model

LSTM

Model

Test

Accuracy

98.46% 97.90% 96.22%

Precision 0.98 0.98 0.96

Recall 0.98 0.98 0.97

F1-score 0.98 0.98 0.97

5 REMARKS

In this paper, the construction of bilingual corpus

using parallel data collected from the Web for

machines learning systems has been proposed. Using

the data after pre-processing, high quality data sets

can be prepared for training the CNN model. Using

the well-trained model, future parallel data can be

selected, classified and added to the bilingual corpus.

Training has been conducted to define and evaluate

the CNN model. The result showed that CNN

outperforms RNN and LSTM in terms of accuracy.

Constructing High Quality Bilingual Corpus using Parallel Data from the Web

131

ACKNOWLEDGEMENTS

This work was supported in part by the research grant

(No.: RP/ESCA-04/2020) offered by Macao

Polytechnic Institute.

REFERENCES

Lin, H., Tse, R., Tang, S.-K., Chen, Y., Ke, W., Pau, G.

(2021). Near-realtime face mask wearing recognition

based on deep learning. In 18th IEEE Annual Consumer

Communications and Networking Conference (CCNC

2021). doi: 10.1109/CCNC49032.2021.9369493

Chan, K. I., Chan, N. S., Tang, S.-K., Tse, R. (2021).

Applying Gamification in Portuguese Learning. In 9th

International Conference on Information and

Education Technology (ICIET 2021), pp.178 – 185.

doi: 10.1109/ICIET51873.2021.9419612

Chan, N. S., Chan, K. I., Tse, R., Tang, S.-K., Pau, G.

(2021). ReSPEcT: privacy respecting thermal-based

specific person recognition. In Proc. SPIE 11878,

Thirteenth International Conference on Digital Image

Processing (ICDIP 2021). https://doi.org/10.1117/

12.2599271

Tse, R., Mirri, S., Tang, S.-K, Pau, G, Salomoni, P. (2020).

Building an Italian-Chinese Parallel Corpus for

Machine Translation from the Web, In 6th EAI

International Conference on Smart Objects and

Technologies for Social Good (GOODTECHS). pp.

265-268. doi: 10.1145/3411170.3411258.

Zin, M.; Racharak, T. Le, N. (2021). Construct-Extract: An

Effective Model for Building Bilingual Corpus to

Improve English-Myanmar Machine Translation. In

Proceedings of the 13th International Conference on

Agents and Artificial Intelligence - Volume 2:

ICAART, ISBN 978-989-758-484-8; ISSN 2184-

433X, pages 333-342, doi: 10.5220/0010318903330

342

Cheong, S. T., Xu, J. Liu, Y. "On the design of web crawlers

for constructing an efficient Chinese-Portuguese

bilingual corpus system," 2018 International

Conference on Electronics, Information, and

Communication (ICEIC), 2018, pp. 1-4, doi:

10.23919/ELINFOCOM.2018.8330698.

Cheok, S. M., Hoi, L. M., Tang, S.-K., Tse, R. (2021).

Crawling Parallel Data for Bilingual Corpus Using

Hybrid Crawling Architecture. In 12th International

Conference on Emerging Ubiquitous Systems and

Pervasive Networks (EUSPN-2021), 198, 122-127.

https://doi.org/10.1016/j.procs.2021.12.218.

Sharma, S., Gupta, P. (2015). The anatomy of web crawlers.

In International Conference on Computing,

Communication & Automation. doi:10.1109/ccaa.

2015.7148493.

Cho, J., Garcia-Molina, H. (2002). Parallel crawlers. In

Eleventh International Conference on World Wide Web

(WWW). doi:10.1145/511446.511464

Chakrabarti, S., Berg, M. V., Dom, B. (1999). Focused

crawling: A new approach to topic-specific Web

resource discovery. Computer Networks, 31(11-16),

1623-1640. doi:10.1016/s1389-1286(99)00052-3

Pappas, N., Katsimpras, G., Stamatatos, E. (2012). An

Agent-Based Focused Crawling Framework for Topic-

and Genre-Related Web Document Discovery. In IEEE

24th International Conference on Tools with Artificial

Intelligence. doi:10.1109/ictai.2012.75.

Li, Y. (2010). Study and implementation on key techniques

for an example-based machine translation system. In

Second IITA International Conference on Geoscience

and Remote Sensing. doi:10.1109/iita-grs.2010.560

4108.

Zhao, R., Mao, K. (2018). Fuzzy Bag-of-Words Model for

Document Representation. IEEE Transactions on

Fuzzy Systems, 26(2), 794-804. doi:10.1109/tfuzz.

2017.2690222.

Uchida, S., Yoshikawa, T., Furuhashi, T. (2018).

Application of Output Embedding on Word2Vec. In

2018 Joint 10th International Conference on Soft

Computing and Intelligent Systems (SCIS) and 19th

International Symposium on Advanced Intelligent

Systems (ISIS). doi:10.1109/scis-isis.2018.00224.

Ristad, E., Yianilos, P. (1998). Learning string-edit

distance. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 20(5), 522-532. doi:10.1109/

34.682181

Baker, M., Malmkjær, K. (2001). Routledge Encyclopedia

of Translation Studies. Psychology Press.

Zhang, H., Liu, Q. (2002) Model of Chinese Words Rough

Segmentation Based on N-Shortest-Paths Method.

Journal of Chinese Information Processing, vol. 5, pp.

1-7.

IoTBDS 2022 - 7th International Conference on Internet of Things, Big Data and Security

132