Lifting Sequence Length Limitations of NLP Models using Autoencoders

Reza Marzban, Christopher Crick

Abstract

Natural Language Processing (NLP) is an important subfield within Machine Learning, and various deep learning architectures and preprocessing techniques have led to many improvements. Long short-term memory (LSTM) is the most well-known architecture for time series and textual data. Recently, models like Bidirectional Encoder Representations from Transformers (BERT), which rely on pre-training with unsupervised data and using transfer learning, have made a huge impact on NLP. All of these models work well on short to average-length texts, but they are all limited in the sequence lengths they can accept. In this paper, we propose inserting an encoder in front of each model to overcome this limitation. If the data contains long texts, doing so substantially improves classification accuracy (by around 15% in our experiments). Otherwise, if the corpus consists of short texts which existing models can handle, the presence of the encoder does not hurt performance. Our encoder can be applied to any type of model that deals with textual data, and it will empower the model to overcome length limitations.

Download


Paper Citation


in Harvard Style

Marzban R. and Crick C. (2021). Lifting Sequence Length Limitations of NLP Models using Autoencoders.In Proceedings of the 10th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-758-486-2, pages 228-235. DOI: 10.5220/0010239502280235


in Bibtex Style

@conference{icpram21,
author={Reza Marzban and Christopher Crick},
title={Lifting Sequence Length Limitations of NLP Models using Autoencoders},
booktitle={Proceedings of the 10th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2021},
pages={228-235},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0010239502280235},
isbn={978-989-758-486-2},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 10th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - Lifting Sequence Length Limitations of NLP Models using Autoencoders
SN - 978-989-758-486-2
AU - Marzban R.
AU - Crick C.
PY - 2021
SP - 228
EP - 235
DO - 10.5220/0010239502280235