Transfer Learning for Digital Heritage Collections: Comparing Neural Machine Translation at the Subword-level and Character-level

Nikolay Banar; Nikolay Banar; Karine Lasaracina; Walter Daelemans; Mike Kestemont; Mike Kestemont

doi:10.5220/0009167205220529

Transfer Learning for Digital Heritage Collections: Comparing Neural Machine Translation at the Subword-level and Character-level

Nikolay Banar, Nikolay Banar, Karine Lasaracina, Walter Daelemans, Mike Kestemont, Mike Kestemont

2020

Abstract

Transfer learning via pre-training has become an important strategy for the efficient application of NLP methods in domains where only limited training data is available. This paper reports on a focused case study in which we apply transfer learning in the context of neural machine translation (French–Dutch) for cultural heritage metadata (i.e. titles of artistic works). Nowadays, neural machine translation (NMT) is commonly applied at the subword level using byte-pair encoding (BPE), because word-level models struggle with rare and out-of-vocabulary words. Because unseen vocabulary is a significant issue in domain adaptation, BPE seems a better fit for transfer learning across text varieties. We discuss an experiment in which we compare a subword-level to a character-level NMT approach. We pre-trained models on a large, generic corpus and fine-tuned them in a two-stage process: first, on a domain-specific dataset extracted from Wikipedia, and then on our metadata. While our experiments show comparable performance for character-level and BPE-based models on the general dataset, we demonstrate that the character-level approach nevertheless yields major downstream performance gains during the subsequent stages of fine-tuning. We therefore conclude that character-level translation can be beneficial compared to the popular subword-level approach in the cultural heritage domain.

Download

Paper Citation

in Harvard Style

Banar N., Lasaracina K., Daelemans W. and Kestemont M. (2020). Transfer Learning for Digital Heritage Collections: Comparing Neural Machine Translation at the Subword-level and Character-level. In Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 1: ARTIDIGH, ISBN 978-989-758-395-7, pages 522-529. DOI: 10.5220/0009167205220529

in Bibtex Style

@conference{artidigh20,
author={Nikolay Banar and Karine Lasaracina and Walter Daelemans and Mike Kestemont},
title={Transfer Learning for Digital Heritage Collections: Comparing Neural Machine Translation at the Subword-level and Character-level},
booktitle={Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 1: ARTIDIGH,},
year={2020},
pages={522-529},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0009167205220529},
isbn={978-989-758-395-7},
}

in EndNote Style

TY - CONF

JO - Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 1: ARTIDIGH,
TI - Transfer Learning for Digital Heritage Collections: Comparing Neural Machine Translation at the Subword-level and Character-level
SN - 978-989-758-395-7
AU - Banar N.
AU - Lasaracina K.
AU - Daelemans W.
AU - Kestemont M.
PY - 2020
SP - 522
EP - 529
DO - 10.5220/0009167205220529