Using Transfer Learning To Classify Long Unstructured Texts with Small Amounts of Labeled Data

Carlos Rocha, Carlos Rocha, Marcos Dib, Marcos Dib, Li Weigang, Li Weigang, Andrea Nunes, Andrea Nunes, Allan Faria, Allan Faria, Daniel Cajueiro, Daniel Cajueiro, Maísa Kely de Melo, Maísa Kely de Melo, Victor Celestino, Victor Celestino

2022

Abstract

Text classification is a traditional problem in Natural Language Processing (NLP). Most of the state-of-the-art implementations require high-quality, voluminous, labeled data. Pre-trained models on large corpora have shown beneficial for text classification and other NLP tasks, but they can only take a limited amount of symbols as input. This is a real case study that explores different machine learning strategies to classify a small amount of long, unstructured, and uneven data to find a proper method with good performance. The collected data includes texts of financing opportunities the international R&D funding organizations provided on their websites. The main goal is to find international R&D funding eligible for Brazilian researchers, sponsored by the Ministry of Science, Technology and Innovation. We use pre-training and word embedding solutions to learn the relationship of the words from other datasets with considerable similarity and larger scale. Then, using the acquired features, based on the available dataset from MCTI, we apply transfer learning plus deep learning models to improve the comprehension of each sentence. Compared to the baseline accuracy rate of 81%, based on the available datasets, and the 85% accuracy rate achieved through a Transformer-based approach, the Word2Vec-based approach improved the accuracy rate to 88%. The research results serve as a successful case of artificial intelligence in a federal government application.

Download


Paper Citation


in Harvard Style

Rocha C., Dib M., Weigang L., Nunes A., Faria A., Cajueiro D., Kely de Melo M. and Celestino V. (2022). Using Transfer Learning To Classify Long Unstructured Texts with Small Amounts of Labeled Data. In Proceedings of the 18th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 978-989-758-613-2, pages 201-213. DOI: 10.5220/0011527700003318


in Bibtex Style

@conference{webist22,
author={Carlos Rocha and Marcos Dib and Li Weigang and Andrea Nunes and Allan Faria and Daniel Cajueiro and Maísa Kely de Melo and Victor Celestino},
title={Using Transfer Learning To Classify Long Unstructured Texts with Small Amounts of Labeled Data},
booktitle={Proceedings of the 18th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2022},
pages={201-213},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0011527700003318},
isbn={978-989-758-613-2},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 18th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - Using Transfer Learning To Classify Long Unstructured Texts with Small Amounts of Labeled Data
SN - 978-989-758-613-2
AU - Rocha C.
AU - Dib M.
AU - Weigang L.
AU - Nunes A.
AU - Faria A.
AU - Cajueiro D.
AU - Kely de Melo M.
AU - Celestino V.
PY - 2022
SP - 201
EP - 213
DO - 10.5220/0011527700003318