loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Authors: Sai Man Cheok 1 ; 2 ; Lap Man Hoi 1 ; 2 ; Su-Kit Tang 1 ; 2 and Rita Tse 1 ; 2

Affiliations: 1 Engineering Research Centre of Applied Technology on Machine Translation and Artificial Intelligence of Ministry of Education, Macao Polytechnic Institute, Macao SAR, China ; 2 School of Applied Sciences, Macao Polytechnic Institute, Macao SAR, China

Keyword(s): Machine Translation, CNN Modelling, Bilingual Corpus, Parallel Data.

Abstract: Natural language machine translation system requires a high-quality bilingual corpus to support its efficient translation operation at high accuracy rate. In this paper, we propose a bilingual corpus construction method using parallel data from the Web. It acts as a stimulus to significantly speed up the construction. In our proposal, there are 4 phases. Parallel data is first pre-processed and refined into three sets of data for training the CNN model. Using the well-trained model, future parallel data can be selected, classified and added to the bilingual corpus. The training result showed that the test accuracy reached 98.46%. Furthermore, the result on precision, recall and f1-score is greater than 0.9, which outperforms RNN and LSTM models.

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 3.144.124.232

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Cheok, S.; Hoi, L.; Tang, S. and Tse, R. (2022). Constructing High Quality Bilingual Corpus using Parallel Data from the Web. In Proceedings of the 7th International Conference on Internet of Things, Big Data and Security - IoTBDS; ISBN 978-989-758-564-7; ISSN 2184-4976, SciTePress, pages 127-132. DOI: 10.5220/0010997000003194

@conference{iotbds22,
author={Sai Man Cheok. and Lap Man Hoi. and Su{-}Kit Tang. and Rita Tse.},
title={Constructing High Quality Bilingual Corpus using Parallel Data from the Web},
booktitle={Proceedings of the 7th International Conference on Internet of Things, Big Data and Security - IoTBDS},
year={2022},
pages={127-132},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0010997000003194},
isbn={978-989-758-564-7},
issn={2184-4976},
}

TY - CONF

JO - Proceedings of the 7th International Conference on Internet of Things, Big Data and Security - IoTBDS
TI - Constructing High Quality Bilingual Corpus using Parallel Data from the Web
SN - 978-989-758-564-7
IS - 2184-4976
AU - Cheok, S.
AU - Hoi, L.
AU - Tang, S.
AU - Tse, R.
PY - 2022
SP - 127
EP - 132
DO - 10.5220/0010997000003194
PB - SciTePress