Assessing the Effectiveness of Multilingual Transformer-based Text

Embeddings for Named Entity Recognition in Portuguese

Diego Bernardes de Lima Santos

, Frederico Giffoni de Carvalho Dutra

2 a

Fernando Silva Parreiras

3 b

and Wladmir Cardoso Brand

1 c

Department of Computer Science, Pontiﬁcal Catholic University of Minas Gerais (PUC Minas), Belo Horizonte, Brazil

Companhia Energ

etica de Minas Gerais (CEMIG), Belo Horizonte, Brazil

Laboratory for Advanced Information Systems, FUMEC University, Belo Horizonte, Brazil

Keywords:

Named Entity Recognition, Text Embedding, Neural Network, Transformer, Multilingual, Portuguese.

Abstract:

Recent state of the art named entity recognition approaches are based on deep neural networks that use an

attention mechanism to learn how to perform the extraction of named entities from relevant fragments of text.

Usually, training models in a speciﬁc language leads to effective recognition, but it requires a lot of time and

computational resources. However, ﬁne-tuning a pre-trained multilingual model can be simpler and faster, but

there is a question on how effective that recognition model can be. This article exploits multilingual models for

named entity recognition by adapting and training tranformer-based architectures for Portuguese, a challenging

complex language. Experimental results show that multilingual trasformer-based text embeddings approaches

ﬁne tuned with a large dataset outperforms state of the art trasformer-based models trained speciﬁcally for

Portuguese. In particular, we build a comprehensive dataset from different versions of HAREM to train our

multilingual transformer-based text embedding approach, which achieves 88.0% of precision and 87.8% in F1

in named entity recognition for Portuguese, with gains of up to 9.89% of precision and 11.60% in F1 compared

to the state of the art single-lingual approach trained speciﬁcally for Portuguese.

1 INTRODUCTION

Natural Language Processing (NLP) is a computer

science research ﬁeld with several practical applica-

tions, such as automatic text reading and question an-

swering, audio content interpretation, document clas-

siﬁcation, and predictive text analysis. Usually, NLP

systems perform a set of basic preprocessing tasks on

input text, such as parsing, tokenization, stop-words

removal, stemming and tagging. Particularly, Named

Entity Recognition (NER) is a NLP tagging task that

extracts important information by marking up it on

text, such as names of people, places and currency

values (Borthwick, 1999). The extracted elements are

relevant entities in the textual content that make sense

within a context. For instance, the recognition of the

entity “New York” as a location in a sentence can be

important to detect where a particular event occurred

or even to relate that location to other locations, deal-

https://orcid.org/0000-0002-8666-0354

https://orcid.org/0000-0002-9832-1501

https://orcid.org/0000-0002-1523-1616

ing with similar entities or with entities with the same

semantic value.

NER is strongly dependent on the context, i.e.,

words or expressions can be recognized as different

types of entity in different contexts. For instance, in

the sentence “Mary prays to Saint Paul for health”, the

expression “Saint Paul” refers to a person (religious

entity), but in the sentence “We will move to Saint

Paul next year”, the expression “Saint Paul” refers

to a place (location entity). Even if the spelling of

a word or expression cited in different sentences is

identical, the meaning can be distinct given differ-

ent contexts. Additionally, sentences are formulated

in distinct ways in different languages, and the lan-

guages differ from each other in structure, form and

complexity, which impose even more challenging is-

sues for NER.

Traditional NER approaches use hand-crafted lin-

guistic grammar-based strategies or statistic models

that requires a large amount of manually annotated

training data to recognize entities in text (Marsh and

Perzanowski, 1998). For years, Conditional Random

Fields (CRF) has been the state of the art strategy for

Santos, D., Dutra, F., Parreiras, F. and Brandão, W.

Assessing the Effectiveness of Multilingual Transformer-based Text Embeddings for Named Entity Recognition in Portuguese.

DOI: 10.5220/0010443204730483

In Proceedings of the 23rd International Conference on Enterprise Information Systems (ICEIS 2021) - Volume 1, pages 473-483

ISBN: 978-989-758-509-8

473

NER, taking context into account in a learning model

that support sequential dependencies between predic-

tions (Lafferty et al., 2001). Recently, deep neural

networks based approaches have achieved even more

effective results than CRF for NER (Goldberg, 2016).

They learn distributed text representations (text em-

beddings) from a huge amount of text to build a lan-

guage model that can be effectively used in several

NLP tasks, including NER.

Deep neural single-lingual models (training NLP

models in a speciﬁc language) usually leads to effec-

tive entity recognition, requiring a lot of time and

computational resources for training. In addition,

such single-lingual approaches require a large amount

of data in each speciﬁc language for training, some-

times not available or easily obtained for certain lan-

guages. However, ﬁne-tuning a pre-trained multilin-

gual model can be cheaper, simpler and faster, re-

quiring no speciﬁc single-language training dataset

and less time and computational resources for train-

ing. But how effective multilingual NER models can

be compared to single-lingual models, particularly for

complex languages, such as Portuguese?

In this article, we exploit multilingual models

for NER by adapting and training transformer-based

text embeddings for named entity recognition in Por-

tuguese. Particularly, we propose a NER approach

by training and ﬁne tuning a multilingual transformer-

based NLP model using a comprehensive dataset we

created by combining different versions of HAREM.

Additionally, we evaluate our proposed approach by

contrasting it with the state-of-the-art (SOTA) single-

lingual approach for NER in Portuguese.

Experimental results show that our multilingual

approach for NER in Portuguese outperforms the

SOTA single-lingual approach with gains of up 9.89%

of precision and 11.60% in F1, achieving 88.00% of

precision and 87.80% in F1 in named entity recogni-

tion. The main contributions of this article are:

• We propose a comprehensive dataset to improve

the training of NER models for Portuguese by

combining different versions of the HAREM

dataset.

• We propose a multilingual NER approach for

Portuguese by adapting and training different

transformer-based neural networks for multilin-

gual NER in English.

• We provide a throughout evaluation of our pro-

posed approach by contrasting them with the

SOTA single-lingual approach for NER in Por-

tuguese reported in literature.

The present article is organized as follows: Section 2

presents the theoretical background in named entity

recognition, word embeddings and transformer-based

architectures of neural networks. Section 3 presents

related work reported in literature for NER, including

the state-of-the-art approach for NER in Portuguese.

Section 4 presents our multilingual NER approach for

Portuguese, as well as the comprehensive dataset we

create to improve the training of our approach. Sec-

tion 5 presents the experimental setup and the results

of the experiments we carry out to evaluate our pro-

posed approach. Finally, Section 6 concludes this ar-

ticle, suggesting directions for future work.

2 BACKGROUND

Named Entity Recognition (NER) is a NLP task that

identiﬁes people, location, currency, and other rel-

evant information within a text (Borthwick, 1999).

While traditional NER approaches use hand-crafted

linguistic grammar-based strategies or statistic mod-

els that require a large amount of manually anno-

tated training data to recognize entities in text (Marsh

and Perzanowski, 1998), recent NER approaches use

deep neural networks to learn an effective recognition

model (Goldberg, 2016). In particular, they learn text

embeddings from a huge amount of text to build a lan-

guage model that can be effectively used for NER.

2.1 Word Embeddings

Recently, different ways to represent text have

emerged, allowing more accurate analyzes of textual

information, e.g., the analysis of similarity between

two words. A distributed text representation, or text

embeddings, can be generated by deep neural net-

work (NN) approaches that learn language models

from a huge amount of natural language corpus. In

particular, word embeddings take the form of a con-

tinuous vector representation describing the meaning

of terms (Levy and Goldberg, 2014). Usually, this

distributed representation is a not mutually exclusive

continuous real-valued vector of ﬁxed length learned

by a NN, typically much smaller than the size of the

vocabulary (Bengio et al., 2003).

The continuous vectors representation are capable

of syntactically representing words, but also allow the

learning of semantic values of terms, that is, word em-

beddings can capture similarity between words with

similar meaning, even if their spelling is quite differ-

ent among them (Mikolov et al., 2013b). Figure 1

presents groups of words with similar context mea-

sured by cosine similarity between word embeddings.

In recent years, different frameworks and algo-

rithms for word embeddings generation have been

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

474

Figure 1: Correlation among words represented as contin-

uous vectors measured by cosine similarity. Source: (Xun

et al., 2017).

proposed, particularly WORD2VEC, GLOVE, and

FASTTEXT. WORD2VEC (Mikolov et al., 2013a;

Mikolov et al., 2013b) is a framework composed of

the ﬁrst efﬁcient word embeddings models, particu-

larly the continuous BoW (CBOW) and the continu-

ous skip-gram (SKIP-GRAM), to learn distributed rep-

resentations of words from large amount of unstruc-

tured text with billions of words. Training such mod-

els does not require dense matrix multiplications and

can be done in one day on a hundred billion words

dataset with a single machine. Particularly, the CBOW

model is a simpliﬁcation of the ﬁrst practical neural

language model approach proposed in literature (Ben-

gio et al., 2003) that uses a fully connected feedfor-

ward neural network to learn simultaneously a dis-

tributed representation for words and the joint prob-

ability distribution function for these word represen-

tations from a huge corpus of natural language text

with millions of words.

In the CBOW model the non-linear hidden layer is

removed, the projection layer is shared for all words,

and word context is captured by a log-linear classi-

ﬁer trained to predict a target word given its two pre-

vious and two next neighboring words. The SKIP-

GRAM model is similar to CBOW but the log-linear

classiﬁer is trained to predict the two previous and two

next neighboring words given a target word (Mikolov

et al., 2013a). Additionally, the models perform

subsampling of frequent words, resulting in faster

training and improved representations of uncommon

words. Moreover, they use two replacement training

methods for full softmax resulting in speedup and ac-

curate distributed representations especially for fre-

quent words (Mikolov et al., 2013b). The replacement

training methods are hierarchical softmax (Morin and

Bengio, 2005) and negative sampling, a simpliﬁed

NCE (Gutmann and Hyv

arinen, 2012).

An interesting property of the word embeddings

learned by WORD2VEC models is that simple vector

operations can often produce meaningful results. For

instance, the sum operation between the vector(usa)

and the vector(capital) results in a vector close to the

vector(washington). Additionally, word embeddings

can be combined using simple operations to represent

longer pieces of text, such as sentences, paragraphs

and documents. For instance, the vector(boston) and

the vector(globe) can be combined to get the vec-

tor(boston globe). However, the resulting word em-

beddings is often unable to represent idiomatic sen-

tences that are not compositions of the individual

words, such as boston globe. Moreover, the word em-

beddings learned by WORD2VEC models exhibit lin-

ear structure that makes precise analogical reasoning

possible. For instance, the vector(queen) being the

nearest representation of the vector(king) minus the

vector(man) plus the vector(woman) provide a way to

test the analogy pair man:king::woman:queen.

GLOVE incorporates global statistics of words

occurrences typically captured by count-based lan-

guage models in a log-bilinear model for unsuper-

vised learning of word embeddings (Pennington

et al., 2014). The intuition is that shallow window

models, such as SKIP-GRAM, poorly utilize statistics

of the corpus since they train on local context window

instead of on global co-occurrence counts. Therefore,

training a NN model simultaneously on local context

and on global word-word co-occurrence counts, mak-

ing efﬁcient use of statistics, produces word embed-

dings with meaningful substructure.

FASTTEXT is another simple and unsupervised

approach that learns distributed representations by

considering subword units and representing words by

a sum of their character n-grams (Bojanowski et al.,

2017). It is an extension of the continuous skip-

gram model (Kiros et al., 2015) that incorporates n-

grams, taking into account the internal structure of

words, which is important for morphologically rich

languages where many word formations follow rules.

For instance, in Latin languages most verbs have more

than dozens different inﬂected forms. These lan-

guages contain many word forms that occur rarely

(or not at all) in the training corpus, making it difﬁ-

cult to learn good word representations. Additionally,

FASTTEXT is capable of building word vectors for

words that do not appear in the training set. Exper-

imental results on word similarity and word analogies

tasks show that FASTTEXT outperforms WORD2VEC

models that do not take into account subword infor-

mation, as well as methods relying on morphological

analysis in different languages (Joulin et al., 2016).

2.2 Transformers

Transformers are sequence transduction models based

exclusively on attention, replacing the recurrent lay-

Assessing the Effectiveness of Multilingual Transformer-based Text Embeddings for Named Entity Recognition in Portuguese

475

ers most commonly used in encoder-decoder architec-

tures with multi-headed self-attention, consequently

allowing more parallelization (Vaswani et al., 2017).

In particular, it follows an encoder-decoder structure

using stacked self-attention and point-wise, fully con-

nected layers for both the encoder and decoder, where

the encoder maps an input sequence of symbol rep-

resentations to a sequence of continuous representa-

tions feeding the decoder that generates an output se-

quence of symbols one element at a time. At each

step the model is auto-regressive, consuming the pre-

viously generated symbols as additional input when

generating the next. Experimental results on machine

translation and English constituency parsing show

that Transformers outperform baseline discriminative

models at a fraction of the training cost.

Figure 2: The attention mechanism’s mapping. Source:

(Vaswani et al., 2017).

The attention mechanism is a strong differentiation

between Transformers and other NN architectures, al-

lowing the estimation of the correlation between ele-

ments in a bidirectional way. Typically, there are two

attention mechanism:

• Self-attention: intra-analysis of a sentence em-

beddings vectors, performing the similarity cal-

culation between different words within the same

sentence. In this analysis the mechanism extracts

the correlation between words in the sentence.

The sense of the vectors represents whether the

words have similar or distinct semantic values.

• Multi-head-attention: divides the sentences into

smaller parts to perform the similarity calculation

between the matrices. It is similar to the self-

attention mechanism, but between different por-

tions of the sentences, identifying the relationship

between words using text segments (sub-spaces).

Figure 2 presents the attention mechanism that esti-

mates the correlation between words with similar se-

mantic values, in a bidirectional way. From Figure 2

we can observe that the word “making” has a close

relationship with the words “2009” and “laws” for in-

stance, i.e., the word “making” appears in the same

expressions than “2009” and “laws”. This relation-

ship allows the prediction of the next terms in sen-

tences with words with similar meanings.

2.2.1 BERT

BERT (Bidirectional Encoder Representations from

Transformers) is a language representation approach

designed to pre-train deep bidirectional representa-

tions from unlabeled text by jointly conditioning on

both left and right context in all layers (Devlin et al.,

2018). In particular, a deep bidirectional TRANS-

FORMER is pre-trained in a masked language model

and next sentence prediction objectives, enabling the

representation to fuse the left and the right context,

thus reducing the need for many heavily-engineered

task-speciﬁc architectures. BERT is the ﬁrst ﬁne-

tuning based representation model that achieves state-

of-the-art performance on a large suite of sentence-

level and token-level tasks, outperforming many task-

speciﬁc architectures.

Figure 3 presents the NN layers of the BERT

architecture. Particularly, we can observe the pre-

training and the ﬁne-tuning steps. During pre-training

the input data set is used without labels, thus perform-

ing unsupervised training of the data. There are two

main tasks during this stage:

• Token masking: randomly selecting a percentage

of about 15 % of the tokens of input and apply-

ing a mask to them so that the training makes the

prediction of these tokens.

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

476

Figure 3: The BERT layers. Source: (Devlin et al., 2018).

• Next sentence prediction (NSP): training for ques-

tion answering, predicting which sentences are

subsequent to previous sentences.

After the pre-training step, the output data can be used

as input to another NLP tasks. From Figure 3 we

can observe that the pre-training output data is used

for natural language inference (MNLI), named entity

recognition (NER) and question answering (SQuAD).

2.2.2 ROBERTA

ROBERTA (Robustly Optimized BERT Approach) is

a BERT-based framework for language model pre-

training that extends BERT by training the model

with bigger batches, over more data, and on longer

sequences, also removing the next sentence predic-

tion objective and dynamically changing the masking

pattern applied to the training data (Liu et al., 2019).

Experimental results on downstream tasks using the

GLUE, RACE and SQuAD benchmarks show that

ROBERTA achieves state-of-the-art results outper-

forming BERT and XLNET, an autoregressive learn-

ing approach (Yang et al., 2019). Table 1 presents

the experimental parameters and results (performance

measured by precision) comparing ROBERTA and

BERT in three different tasks: question answering

(SQuAD), natural language inference (MNLI), and

sentence classiﬁcation (SST).

Table 1: BERT/ROBERTA parameters and performance.

Source: (Liu et al., 2019).

BERT-LARGE ROBERTA

Data 13GB 16GB

Batches 256 8K

Steps 1M 100K

SQuAD v1.1 90,9 93,6

SQuAD v2.0 81,8 87,3

MNLI-m 86,6 89.0

SST-2 93,7 95.3

From Table 1 we can observe that there are signiﬁ-

cant differences in training, with changes in the size of

batches and in the number of steps in training. While

ROBERTA uses a larger dataset than BERT to carry

out its training, vigorously larger batches for its pro-

cessing, however the processing occurs in a smaller

number of steps. ROBERTA is a robust approach,

however, as can be seen in theSQuAD, MNLI and

SST tasks, RoBERTa presents similar and even bet-

ter results than in the BERT approach.

2.2.3 DISTILBER T

DISTILBERT (Distilled BERT) is a general-purpose

smaller and faster pre-trained version of BERT, that

retains almost the same language understanding ca-

pabilities (Sanh et al., 2019). In particular it uses lan-

guage models pre-trained with knowledge distillation,

a compression technique in which a compact model is

trained to reproduce the behaviour of a larger model

or an ensemble of models, resulting in models that

are lighter and faster at inference time, while also re-

quiring smaller computational training. Particularly,

it keeps 97% of language comprehension in its model

with approximately 60% reduction in the model size,

running 60% faster. DISTILBERT can be ﬁne-tuned

on several downstream tasks, keeping the ﬂexibility

of larger models while it is small enough to run on

the edge, e.g. on mobile devices.

The distillation technique (Hinton et al., 2015)

consists of training a distilled (student) model to re-

produce the behavior of a larger (teacher) model.

Thus, DISTILBERT is a leaner model based on

the behavior of the original BERT model. Table 2

presents a comparison of precision performance in

different NLP tasks among BERT, DISTILBERT, and

ELMO, a deep contextualized word representation

approach that models complex syntactic and semantic

characteristics of word uses and how these uses vary

across different linguistic contexts (polysemy) (Peters

et al., 2018).

Table 2: BERT, DISTILBERT and ELMO performance.

Source: (Sanh et al., 2019).

Score CoLA MNLI QNLI

ELMO 68.7 44.1 68.6 76.6

BERT-BASE 79.5 56.3 86.7 88.6

DISTILBERT 77.0 51.3 82.2 87.5

From Table 2 we observe that DISTILBERT perfor-

mance is close to BERT, even providing a reduced

model.

Assessing the Effectiveness of Multilingual Transformer-based Text Embeddings for Named Entity Recognition in Portuguese

477

2.2.4 ALBERT

ALBERT is another BERT based efﬁcient architec-

ture with signiﬁcantly fewer parameters than a tradi-

tional BERT architecture (Lan et al., 2019). In par-

ticular, ALBERT incorporates parameter reduction

techniques that lift the major obstacles in scaling pre-

trained models, also acting as a form of regulariza-

tion that stabilizes the training and helps with gener-

alization. First, it incorporates factorized embedding

parametrization, i.e., decompose the large vocabulary

embedding matrix into two small matrices, thus sep-

arating the size of the hidden layers from the size of

vocabulary embedding, making it easier to grow the

hidden size without signiﬁcantly increasing the pa-

rameter size of the vocabulary embeddings. Second it

incorporates cross-layer parameter sharing, prevent-

ing the parameter from growing with the depth of the

network. Additionally, ALBERT replaces the next

sentence prediction proposed in the original BERT by

a self-supervised loss for sentence-order prediction.

Experiments with GLUE, RACE and SQuAD bench-

marks show that ALBERT achieves state-of-the-art

performance on natural language understanding tasks

outperforming BERT, XLNET and ROBERTA.

In particular, ALBERT address the scalability

problem of BERT derived from memory consump-

tion issues. The growth in the number of param-

eters of BERT has become an important challenge

due to the high memory consumption. Few works

reported in literature address this problem, by using

parallelism (Shazeer et al., 2018) or effectively man-

aging memory consumption through a cleaning mech-

anism to minimize performance impact (Gomez et al.,

2017). However, the obstacle created by the commu-

nication overhead of the BERT architecture is not ad-

dressed by these reported works.

Thus, BERT was extended by ALBERT in order

to reduce around 89% of the number of parameters,

improving performance in NLP tasks. Table 3 shows

a comparison between the hyperparameters of BERT

and ALBERT. Even using less hyperparameters than

BERT, ALBERT provide improved results in differ-

ent NLP tasks, such as SQuAD v1.1 (+1.9%), SQuAD

v2.0 (+3.1%), MNLI (+1.4%), SST-2 (+2.2%), and

RACE (+8.4%) using relatively less resources and

with a faster training phase (Lan et al., 2019).

3 RELATED WORK

The emergence of approaches that use Transformers

to improve performance in NLP tasks has grown in

recent years. Particularly for NER in complex lan-

guages, a recent work reported in literature (Arkhipov

et al., 2019) uses Transformers for named entity

recognition in Slavic languages, achieving up to 93%

of performance in F1 measure when applied to the

Czech language.

Recently, different NN architectures were pro-

posed to perform NER in Portuguese (Souza et al.,

2019). In addition to the comparative analysis be-

tween the architectures, the authors proposed an ef-

fective approach for both word embbedings gener-

ation and named entity recognition in Portuguese.

The proposed approach uses BERT to ﬁrst gener-

ate the word embeddings for Portuguese and ﬁnally

use this word embeddings for NER. The authors also

evaluate different NN architectures, such as LSTM

(Long-Short Term Memory) and BiLSTM (Bidirec-

tional LSTM) for named entity recognition in Por-

tuguese. They also combine these different architec-

tures with CRF (Conditional Random Fields) (Laf-

ferty et al., 2001) to improve performance. Table 4

summarizes the experimental results of the proposed

architectures for multilingual (ML) and Portuguese

(PT) in two scenarios: a full scenario using all the

HAREM dataset with 10 classes, and a selective sce-

nario using a subset of 5 classes of HAREM where

the proposed approach performs better.

From Table 4 we observe that the BERT-LARGE

approach outperforms BERT-BASE. Additionally,

the LSTM architecture does not provide any gain,

however combining CRF brings outstanding perfor-

mance. Moreover, single-lingual models outperforms

multi-lingual models for NER in Portuguese. Thus,

the best results were obtained with a single-lingual

model trained speciﬁcally for Portuguese. Although

the single-lingual approach performs better, the com-

putational cost of training the model in Portuguese

is much higher than using a pre-trained multilingual

model.

Although the authors provide a single-lingual

SOTA approach for NER in Portuguese, a question

remains: is it possible that multilingual NER models

can outperform single-lingual models, particularly for

complex languages, such as Portuguese?

4 PROPOSED APPROACH

In this section we present our multilingual

transformer-based text embeddings approach for

NER in Portuguese. First, we present a comprehen-

sive dataset we propose to improve the training of

NER models for Portuguese. Second, we present the

architecture of our proposed approach.

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

478

Table 3: BERT and ALBERT hyperparameters. Source: (Lan et al., 2019).

Model

Parameters Layer Embedding

# Sharing # Hidden Size

BERT-BASE 108M No 12 768 768

BERT-LARGE 334M No 24 1024 1024

ALBERT-BASE 12M Yes 12 128 768

ALBERT-LARGE 18M Yes 24 128 1024

ALBERT-XLARGE 60M Yes 24 128 2048

ALBERT-XXLARGE 235M Yes 12 128 4096

Table 4: Performance in precision, recall and F1 of the SOTA single-lingual approach for NER trained speciﬁcally for Por-

tuguese in two experimental scenarios. Source: (Souza et al., 2019).

Approach

Full Scenario Selective Scenario

Precision Recall F1 Precision Recall F1

CharWNN

LSTM-CRF

BiLSTM-CRF+FlairBBP

67.16

72.78

74.91

63.74

68.03

74.37

65.41

70.33

74.64

73.98

78.26

83.38

68.68

74.39

81.17

71.23

76.27

82.26

ML-BERT-BASE

ML-BERT-BASE-CRF

ML-BERT-BASE-LSTM

ML-BERT-BASE-LSTM-CRF

2.97

74.82

69.68

74.70

73.78

73.49

69.51

69.74

73.37

74.15

69.59

72.14

77.35

80.10

75.59

80.66

79.16

78.78

77.13

75.06

78.25

79.44

76.35

77.76

PT-BERT-BASE

PT-BERT-BASE-CRF

PT-BERT-BASE-LSTM

PT-BERT-BASE-LSTM-CRF

78.36

78.60

75.00

78.33

77.62

76.89

73.61

73.23

77.98

77.73

74.30

75.69

83.22

83.89

79.88

84.58

82.85

81.50

80.29

78.72

83.03

82.68

80.09

81.66

PT-BERT-LARGE

PT-BERT-LARGE-CRF

PT-BERT-LARGE-LSTM

PT-BERT-LARGE-LSTM-CRF

78.45

80.08

72.96

77.45

77.40

77.31

72.05

72.43

77.92

78.67

72.50

74.86

83.45

84.82

78.13

83.08

83.15

81.72

78.93

77.83

83.30

83.24

78.53

80.37

4.1 Training Dataset

To improve the training of multilingual NER mod-

els for Portuguese, we build a comprehensive

dataset from HAREM (Santos and Cardoso, 2007).

HAREM

is a manually annotated dataset used to

assess the performance of information systems for

named entity recognition in Portuguese. HAREM

is widely used by several NLP approaches reported

in literature (Souza et al., 2019; de Castro et al.,

2018; Gonc¸alo Oliveira and Cardoso, 2009; Fernan-

des et al., 2018; Consoli and Vieira, 2019; Pires,

2017). In particular, the HAREM dataset has the fol-

lowing divisions:

• “CD Primeiro HAREM”: 129 documents and

80,060 words.

• “CD Segundo HAREM”: 129 documents and

147,991 words.

• “Mini-HAREM CD”: 128 documents and 54,074

words.

Available at http://www.linguateca.pt

All HAREM divisions were joined into a single uni-

ﬁed training dataset. Originally, some expressions in

HAREM are ambiguous, i.e., some of them have two

entity labels with different meanings. To build the

uniﬁed training dataset we choose the ﬁrst classiﬁ-

cation described in the HAREM dataset, discarding

the second one. Thus, all expressions were classiﬁed

in a single entity label. Additionally, the paragraph

structure was converted into smaller sentences so that

the BERT-based algorithm can receive input data in

an appropriate format. Paragraphs of up to 256 to-

kens were automatically converted to sentences and

the paragraphs were divided with entity labels also

been incorporated into the uniﬁed training dataset.

4.2 Architecture

The proposed multilingual approach for NER in Por-

tuguese can use multiple transformer-based text em-

beddings. In particular, we implement and evaluate

BERT (Devlin et al., 2018), ROBERTA (Liu et al.,

2019) and DISTILBERT (Sanh et al., 2019). Fig-

ure 4 presents the architecture of our proposed ap-

Assessing the Effectiveness of Multilingual Transformer-based Text Embeddings for Named Entity Recognition in Portuguese

479

Figure 4: The architecture of the proposed multilingual transformer-based text embeddings approach for NER in Portuguese.

proach. Particularly, there are four processing steps:

i) Dataset preprocessing; ii) Multilingual transformer-

based NER; iii) Fine-tuning; iv) NER prediction.

In the dataset preprocessing step our approach

builds the training dataset as described in Section 4.1,

removing the original ambiguities in HAREM, stan-

dardizing the data in sentences within the BERT

standard and consolidation in a single data ﬁle. In

the second step our approach selects the multilin-

gual transformers-based model for NER, instantiat-

ing them in the processing engine and loading the

pre-trained multilingual models for the generation of

the text embeddings. In the ﬁne tuning step our ap-

proach sets the model hyperparameters for the NER

task, generating the ﬁnal NER model by training the

model using Portuguese training data. Finally, in the

prediction step our approach loads the trained model,

receives all the sentences to be evaluated and gener-

ates a ﬁnal output with the named entities recognized

from the input sentences.

The pipeline works in a ﬂexible way so that if a

new version of the HAREM dataset is published it is

possible to incorporate it in the training dataset, pre-

serving the original content and expanding the vol-

ume of data available for training and testing models.

Similarly, although three Transformers approaches

have been initially used in our experiments, it is

also possible to plug in new Transformers-based ap-

proaches with no impact to the processing workﬂow.

5 EXPERIMENTS

In this section we present the experiments we carried

out to evaluate our proposed approach, including ex-

perimental setup, procedures and results. In particu-

lar, the experimental evaluation answer the following

research questions:

1. How effective is each one of the multilingual

BERT-based algorithm for NER in Portuguese?

2. How does our multilingual approach performs

compared to the SOTA single-lingual approach

for NER in Portuguese?

In our evaluation we consider for distinct training sce-

narios: i) 70% of data for training and 30% of data

for testing; ii) 80% of data for training and 20% of

data for testing; iii) 90% of data for training and

10% of data for testing; iv) 95% of data for training

and 5% of data for testing. In each of the scenar-

ios, we evaluate BERT (Devlin et al., 2018), XLM-

ROBERTA (Lample and Conneau, 2019) and DIS-

TILBERT (Sanh et al., 2019), also performing ﬁne-

tuning for NER task. For fair comparison, the same

training dataset and setup parameters were used for

each BERT-based algorithm.

The large number of batch sizes implies in reduc-

ing the number of examples sent for the input of the

BERT-based algorithm, consequently negatively im-

pacting in performance. Thus, batches of 128 and

256 have become more suitable for our experiments.

Batches smaller than 128 could cause truncation is-

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

480

Table 5: Performance of our multilingual approach using different BERT-based algorithms with multiple training set variations

and 3 epochs of training.

Approach Train (%) SEQ SIZE Precision (%) Recall (%) F1 (%)

BERT-BASE 95 128 85.00 86.80 85.90

BERT-BASE 95 256 85.70 86.30 86.00

DISTILBERT 95 128 77.10 82.90 79.90

DISTILBERT 95 256 78.50 83.00 80.70

XML-ROBERTA 95 128 88.00 87.60 87.80

XML-ROBERTA 95 256 86.30 88.40 87.30

BERT-BASE 90 128 67.00 74.20 70.40

BERT-BASE 90 256 68.60 75.40 71.80

DISTILBERT 90 128 62.30 68.20 65.10

DISTILBERT 90 256 62.60 69.30 65.80

XML-ROBERTA 90 128 73.00 78.60 75.70

XML-ROBERTA 90 256 74.60 79.80 77.10

BERT-BASE 80 128 66.40 69.90 68.10

BERT-BASE 80 256 68.30 71.20 69.70

DISTILBERT 80 128 59.10 64.60 61.70

DISTILBERT 80 256 60.80 64.70 62.70

XML-ROBERTA 80 128 67.90 70.90 69.40

XML-ROBERTA 80 256 67.90 71.50 69.70

BERT-BASE 70 128 61.40 62.30 61.80

BERT-BASE 70 256 62.50 64.40 63.40

DISTILBERT 70 128 58.00 59.50 58.80

DISTILBERT 70 256 59.30 61.10 60.20

XML-ROBERTA 70 128 64.40 64.80 64.60

XML-ROBERTA 70 256 64.10 64.80 64.40

sues, that is, the sentences would be truncated, gener-

ating more data loss.

Transformer-based approaches, particularly

BERT, usually require few interactions to converge in

a model able to provide efﬁcient results (Wolf et al.,

2019). We test different epochs to ﬁnally set this

parameter to 3, for better balancing between training

time and model performance. Table 5 presents the

performance of our multilingual approach using dif-

ferent BERT-based algorithms with multiple training

set variations and 3 epochs of training.

From Table 5 we observe that XML-ROBERTA

outperforms BERT and DISTILBERT in different

scenarios. Particularly, the volume of training data

impacts the performance of all BERT-based algo-

rithms, with XML-ROBERTA outperforming BERT-

BASE in 2.68% in precision, 1.84% in recall and

2.09% in F1, also outperforming DISTILBERT in

12.10% in precision, 6.50% in recall and 8.79% in

F1, considering the 95% of training scenario. Addi-

tionally, we can observe that the differences in XML-

ROBERTA performance with batches of 128 and 256

are negligible (0.57% in F1). Recalling our ﬁrst

research question, these experimental results attest

the effectiveness of our multilingual ROBERTA ap-

proach for NER in Portuguese.

Table 6 presents the performance of the SOTA

single-lingual transformer-based text embeddings ap-

proach reported in literature (Souza et al., 2019) in

comparison to our proposed multilingual transformer-

based text embeddings approach for NER in Por-

tuguese. From Table 6 we observe that our mul-

tilingual approach (XML-ROBERTA) outperforms

the best single-lingual approach (PT-BERT-LARGE-

CRF) in the full scenario with gains of 9.89% in

precision, 13.31% in recall, and in 11.60% in F1.

Even considering the selective (best) scenario for the

single-lingual approach, the gains are still signiﬁcant

of 3.74% in precision, 7.19% in recall, and 5.47% in

F1. Recalling our second research question, these ex-

perimental results show that multilingual trasformer-

based text embeddings approaches ﬁne tuned with

a large dataset outperforms SOTA trasformer-based

models trained speciﬁcally for Portuguese.

Multilingual transformer-based approaches for

NER becomes particularly interesting in scenarios

where the amount of computational resources is lim-

ited to train single-lingual approaches but the amount

Assessing the Effectiveness of Multilingual Transformer-based Text Embeddings for Named Entity Recognition in Portuguese

481

of training data is abundant for ﬁne tuning. In ad-

dition, the ﬁne tuning step can be generalized for

any multilingual approach based on BERT. Therefore,

ALBERT (Lan et al., 2019) and BART (Lewis et al.,

2019) for instance, can be easily implemented in our

proposed transformed-based approach, similarly we

implemented DISTILBERT (Sanh et al., 2019) and

ROBERTA (Liu et al., 2019).

Table 6: Performance of the SOTA single-lingual and the

proposed multilingual transformed-based text embeddings

approaches for NER in Portuguese.

Approach Prec. Rec. F1

Single-lingual (Full Scenario)

LSTM-CRF

BiLSTM-CRF+FlairBBP

72.78

74.91

68.03

74.37

70.33

74.64

ML-BERT-BASE-CRF 74.82 73.49 74.15

PT-BERT-BASE-CRF 78.60 76.89 77.73

PT-BERT-LARGE-CRF 80.08 77.31 78.67

Single-lingual (Selective Scenario)

LSTM-CRF

BiLSTM-CRF+FlairBBP

78.26

83.38

74.39

81.17

76.27

82.26

ML-BERT-BASE-CRF 80.10 78.78 79.44

PT-BERT-BASE-CRF 83.89 81.50 82.68

PT-BERT-LARGE-CRF 84.82 81.72 83.24

Multilingual (Full Scenario)

DISTILBERT

BERT-BASE

XLM-ROBERTA

78.50

85.70

88.00

83.00

86.30

87.60

80.70

85.90

87.80

6 CONCLUSIONS

In this article, we assessed the effectiveness of multi-

lingual transformer-based text embeddings for named

entity recognition in Portuguese. Particularly, we

ﬁne-tuned our approach using a large Portuguese

dataset, and we carried out experiments comparing

our approach with the state of the art single-lingual

approach trained speciﬁcally for Portuguese.

Experimental results showed that our multilingual

trasformer-based approach outperformed the state of

the art approach, achieving 88.0% of precision and

87.8% in F1 in named entity recognition for Por-

tuguese, with gains of up to 9.89% of precision and

11.60% in F1. Additionally, even considering a se-

lective scenario, where the state of the art approach

performed better, our approach outperformed it by

3.74% of precision and 5.47% in F1. Thus, our exper-

iments showed that pre-trained multilingual generic

language models based on BERT and ﬁne-tuned with

a larger dataset can outperforms single-lingual spe-

ciﬁc language models that requires a lot of time and

computational resources to be trained.

In future work, we intent to evaluate the impact of

the size of the uniﬁed dataset over the effectiveness of

the NER model, as well to improve the transformer-

based algorithms so that it is possible to adjust the

batches to smaller sizes, such as 64, allowing to in-

crease the number of sentences analyzed and possibly

get outstanding results. In addition, similarly to the

state of the art single-lingual approach, we intent to

add a CRF layer to our multilingual approach, which

can further improve the precision.

ACKNOWLEDGEMENTS

The present work was carried out with the support

of the Coordenac¸

ao de Aperfeic¸oamento de Pessoal

de N

ıvel Superior - Brazil (CAPES) - Financing

Code 001. The authors thank the partial support of

the CNPq (Brazilian National Council for Scientiﬁc

and Technological Development), FAPEMIG (Foun-

dation for Research and Scientiﬁc and Technological

Development of Minas Gerais), CEMIG, FUMEC,

LIAISE and PUC Minas.

REFERENCES

Arkhipov, M., Troﬁmova, M., Kuratov, Y., and Sorokin,

A. (2019). Tuning multilingual transformers for

language-speciﬁc named entity recognition. In Pro-

ceedings of the 7th Workshop on Balto-Slavic Natural

Language Processing, pages 89–93.

Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C.

(2003). A neural probabilistic language model. The

Journal of Machine Learning Research, 3:1137–1155.

Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T.

(2017). Enriching word vectors with subword infor-

mation. Transactions of the Association for Computa-

tional Linguistics, 5:135–146.

Borthwick, A. E. (1999). A maximum entropy approach

to named entity recognition. PhD thesis, New York

University, USA.

Consoli, B. and Vieira, R. (2019). Multidomain contex-

tual embeddings for named entity recognition. Pro-

ceedings of the Iberian Languages Evaluation Forum,

2421:434–441.

de Castro, P. V. Q., da Silva, N. F. F., and da Silva Soares,

A. (2018). Portuguese named entity recognition us-

ing LSTM-CRF. In Proceedings of the International

Conference on Computational Processing of the Por-

tuguese Language, pages 83–92.

Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2018).

BERT: pre-training of deep bidirectional transformers

for language understanding. CoRR, abs/1810.04805.

Fernandes, I., Cardoso, H. L., and Oliveira, E. (2018). Ap-

plying deep neural networks to named entity recogni-

tion in portuguese texts. In Procedings of the 5th In-

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

482

ternational Conference on Social Networks Analysis,

Management and Security, pages 284–289.

Goldberg, Y. (2016). A primer on neural network models

for natural language processing. Journal of Artiﬁcial

Intelligence Research, 57(1):345–420.

Gomez, A. N., Ren, M., Urtasun, R., and Grosse, R. B.

(2017). The reversible residual network: Backpropa-

gation without storing activations. In Proceedings of

the 31st International Conference on Neural Informa-

tion Processing Systems, page 2211–2221.

Gonc¸alo Oliveira, H. and Cardoso, N. (2009). Sahara:

An online service for harem named entity recognition

evaluation. In Proceedings of the 7th Brazilian Sym-

posium in Information and Human Language Technol-

ogy, pages 171–174.

Gutmann, M. U. and Hyv

arinen, A. (2012). Noise-

contrastive estimation of unnormalized statistical

models, with applications to natural image statis-

tics. The Journal of Machine Learning Research,

13(1):307–361.

Hinton, G., Vinyals, O., and Dean, J. (2015). Distill-

ing the knowledge in a neural network. CoRR,

abs/1503.02531.

Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T.

(2016). Bag of tricks for efﬁcient text classiﬁcation.

CoRR, abs/1607.01759.

Kiros, R., Zhu, Y., Salakhutdinov, R., Zemel, R. S., Tor-

ralba, A., Urtasun, R., and Fidler, S. (2015). Skip-

Thought vectors. In Proceedings of the 29th Con-

ference on Neural Information Processing Systems,

pages 1532–1543.

Lafferty, J. D., McCallum, A., and Pereira, F. C. N. (2001).

Conditional random ﬁelds: Probabilistic models for

segmenting and labeling sequence data. In Proceed-

ings of the 18th International Conference on Machine

Learning, page 282–289.

Lample, G. and Conneau, A. (2019). Cross-lingual lan-

guage model pretraining. CoRR, abs/1901.07291.

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P.,

and Soricut, R. (2019). ALBERT: A lite BERT for

self-supervised learning of language representations.

CoRR, abs/1909.11942.

Levy, O. and Goldberg, Y. (2014). Dependency-based word

embeddings. In Proceedings of the 52nd Annual Meet-

ing of the Association for Computational Linguistics,

pages 302–308.

Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mo-

hamed, A., Levy, O., Stoyanov, V., and Zettlemoyer,

L. (2019). BART: denoising sequence-to-sequence

pre-training for natural language generation, transla-

tion, and comprehension. CoRR, abs/1910.13461.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,

Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov,

V. (2019). RoBERTa: A robustly optimized bert pre-

training approach. CoRR, abs/1907.11692.

Marsh, E. and Perzanowski, D. (1998). MUC-7 evaluation

of IE technology: Overview of results. In Proceedings

of the 7th Message Understanding Conference.

Mikolov, T., Chen, K., Corrado, G. S., and Dean, J. (2013a).

Efﬁcient estimation of word representations in vector

space. CoRR, abs/1301.3781.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and

Dean, J. (2013b). Distributed representations of words

and phrases and their compositionality. In Proceed-

ings of the Advances in Neural Information Process-

ing Systems, pages 3111–3119.

Morin, F. and Bengio, Y. (2005). Hierarchical probabilistic

neural network language model. In Proceedings of the

10th International Workshop on Artiﬁcial Intelligence

and Statistics, pages 246–252.

Pennington, J., Socher, R., and Manning, C. D. (2014).

GloVe: Global vectors for word representation. In

Proceedings of the 2014 Conference on Empirical

Methods in Natural Language Processing, pages

1532–1543.

Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark,

C., Lee, K., and Zettlemoyer, L. (2018). Deep con-

textualized word representations. In Proceedings of

the 2018 Conference of the North American Chapter

of the Association for Computational Linguistics: Hu-

man Language Technologies, pages 2227–2237.

Pires, A. R. O. (2017). Named entity extraction from por-

tuguese web text. Master’s thesis, Porto University.

Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2019).

DistilBERT, a distilled version of BERT: smaller,

faster, cheaper and lighter. CoRR, abs/1910.01108.

Santos, D. and Cardoso, N. (2007). Reconhecimento de en-

tidades mencionadas em portugu

es: Documentac¸

ao e

actas do HAREM, a primeira avaliac¸

ao conjunta na

area. Linguateca.

Shazeer, N., Cheng, Y., Parmar, N., Tran, D., Vaswani, A.,

Koanantakool, P., Hawkins, P., Lee, H., Hong, M.,

Young, C., et al. (2018). Mesh-TensorFlow: Deep

learning for supercomputers. In Procedings of the

Advances in Neural Information Processing Systems,

pages 10414–10423.

Souza, F., Nogueira, R. F., and de Alencar Lotufo, R.

(2019). Portuguese named entity recognition using

BERT-CRF. CoRR, abs/1909.10649.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.,

Jones, L., Gomez, A. N., Kaiser, L., and Polo-

sukhin, I. (2017). Attention is all you need. CoRR,

abs/1706.03762.

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue,

C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtow-

icz, M., et al. (2019). Huggingface’s transformers:

State-of-the-art natural language processing. CoRR,

abs/1910.03771.

Xun, G., Li, Y., Zhao, W. X., Gao, J., and Zhang, A. (2017).

A correlated topic model using word embeddings. In

Proceedings of the 26th International Joint Confer-

ence on Artiﬁcial Intelligence, pages 4207–4213.

Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.,

and Le, Q. V. (2019). XLNet: Generalized autoregres-

sive pretraining for language understanding. CoRR,

1906.08237.

Assessing the Effectiveness of Multilingual Transformer-based Text Embeddings for Named Entity Recognition in Portuguese

483