ELEVEN Data-Set: A Labeled Set of Descriptions of Goods Captured from Brazilian Electronic Invoices

Vinícius Di Oliveira, Vinícius Di Oliveira, Li Weigang, Geraldo Filho

2022

Abstract

The task of classifying short text through machine learning (ML) models is promising and challenging for economic related sectors such as electronic invoice processing and auditing. Considering the scarcity of labeled short text data sets and the high cost of establishing new labeled short text databases for supervised learning, especially when they are manually established by experts, this research proposes ELEVEN (ELEctronic inVoicEs in portuguese laNguage) Data-Set in an open data format. This labeled short text database is composed of the product descriptions extracted from electronic invoices. These short Portuguese text descriptions are unstructured, but limited to 120 characters. First, we construct BERT and other models to demonstrate the short text classification using ELEVEN. Then, we show three successful cases, also using the data set we developed, to identify correct products codes according to the short text descriptions of goods captured from the electronic invoices and others. ELEVEN consists of 1.1 million merchandise descriptions recorded as labeled short-texts, annotated by specialist tax auditors, and detailed according to the Mercosur Common Nomenclature. For easy public use, ELEVEN is shared on GitHub by the link: https://github.com/vinidiol/descmerc.

Download


Paper Citation


in Harvard Style

Di Oliveira V., Weigang L. and Filho G. (2022). ELEVEN Data-Set: A Labeled Set of Descriptions of Goods Captured from Brazilian Electronic Invoices. In Proceedings of the 18th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 978-989-758-613-2, pages 257-264. DOI: 10.5220/0011524800003318


in Bibtex Style

@conference{webist22,
author={Vinícius Di Oliveira and Li Weigang and Geraldo Filho},
title={ELEVEN Data-Set: A Labeled Set of Descriptions of Goods Captured from Brazilian Electronic Invoices},
booktitle={Proceedings of the 18th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2022},
pages={257-264},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0011524800003318},
isbn={978-989-758-613-2},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 18th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - ELEVEN Data-Set: A Labeled Set of Descriptions of Goods Captured from Brazilian Electronic Invoices
SN - 978-989-758-613-2
AU - Di Oliveira V.
AU - Weigang L.
AU - Filho G.
PY - 2022
SP - 257
EP - 264
DO - 10.5220/0011524800003318