Multiple Relations Classiﬁcation Using Imbalanced Predictions

Adaptation

Sakher Khalil Alqaaidi, Elika Bozorgi and Krzysztof J. Kochut

School of Computing, University of Georgia, U.S.A.

Keywords:

Text Mining, Knowledge Graphs, Natural Language Processing, Relation Classiﬁcation.

Abstract:

The relation classiﬁcation task assigns the proper semantic relation to a pair of subject and object entities; the

task plays a crucial role in various text mining applications, such as knowledge graph construction and entities

interaction discovery in biomedical text. Current relation classiﬁcation models employ additional procedures

to identify multiple relations in a single sentence. Furthermore, they overlook the imbalanced predictions

pattern. The pattern arises from the presence of a few valid relations that need positive labeling in a relatively

large predeﬁned relations set. We propose a multiple relations classiﬁcation model that tackles these issues

through a customized output architecture and by exploiting additional input features. Our ﬁndings suggest that

handling the imbalanced predictions leads to signiﬁcant improvements, even on a modest training design. The

results demonstrate superiority performance on benchmark datasets commonly used in relation classiﬁcation.

To the best of our knowledge, this work is the ﬁrst that recognizes the imbalanced predictions within the

relation classiﬁcation task.

1 INTRODUCTION

The relation classiﬁcation (RC) task aims to iden-

tify relations that capture the dependency in every

pair of entities within unstructured text. The task

is employed in several applications, such as knowl-

edge graph construction and completion (Chen et al.,

2020) and entities interaction detection in biomed-

ical text (Bundschus et al., 2008). In knowledge

graphs, it is common to employ relational triples as

the base structure. A triple consists of a subject entity,

an object entity, and a semantic relation connecting

them. For instance, Wikipedia articles rely on Wiki-

data knowledge base to provide its content (Vrande

and Kr

otzsch, 2014); users can query Wikidata in a

structured format using SPARQL and retrieve the in-

formation as RDF triples. In biomedical text, the RC

task helps in discovering the interactions between en-

tities such as proteins, drugs, chemicals and diseases

in medical corpora.

In the supervised RC task, the objective is to learn

a function that takes a sentence and its tagged entities

as input, then assigns a binary class to each relation

from a predeﬁned set. A positive label indicates that

the relation is valid for an entity pair. Thus, the output

consists of the positive relations. We use this formal

notation for the task:

f (W,E,P) =







R, Multiple relations

r, Single relation

0, otherwise

(1)

where W is a sequence of words [w

, w

... w

], E

is the set of one or more entity pairs. Each entity

pair consists of a subject entity and an object entity,

where an entity is a sub-sequence of W . P is the pre-

deﬁned relations set. R is a set of multiple relations

found for E. r is a single relation.

0 indicates that

no relation exists connecting any of the entities. In an

example from the NYT dataset (Riedel et al., 2010)

for the sentence “Johnnie Bryan Hunt was born on

Feb. 28 , 1927 , in rural Heber Springs , in north-

central Arkansas.”, the valid relations are “contains”

and “place lived”. These relations connect the enti-

ties in the pairs (“Arkansas”, “Heber Springs”) and

(“Johnnie Bryan Hunt”, “Heber Springs”), respec-

tively.

Table 1 shows the average number of relations

in two well known benchmarks, NYT (Riedel et al.,

2010) and WEBNLG (Zeng et al., 2018). Commonly,

a sentence incorporates multiple relations and a single

RC approach is only valid for limited cases. However,

majority of the literature work follow the single rela-

tion approach. Single RC models require additional

Alqaaidi, S., Bozorgi, E. and Kochut, K.

Multiple Relations Classiﬁcation Using Imbalanced Predictions Adaptation.

DOI: 10.5220/0012455100003636

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2024) - Volume 2, pages 511-519

ISBN: 978-989-758-680-4; ISSN: 2184-433X

511

Table 1: The number of predeﬁned relations in the NYT

and WEBNLG datasets, the average number of positive re-

lations in each sentence, the standard deviation, and the per-

centage of sentences with 3 or more positive relations.

Dataset Relations Avg. Stdev. 3+ Rels.

NYT 24 2.00 2.88 18.48%

WEBNLG 216 2.74 2.23 41.72%

preprocessing procedure to be able to identify multi-

ple relations (Wang et al., 2019), that is by replicating

the sentence W in equation 1, then assigning an en-

tity pair and a single relation r to each copy. Such ap-

proach does not only incur additional steps but also an

added training load. An additional downside is losing

the contextual information due to splitting the entities

data in the input (Qu et al., 2014; Yin et al., 2006),

which would result missed accuracy enhancements.

Besides that, several single RC models evaluate their

work on highly class-imbalanced benchmarks, such

as Tacred (Zhang et al., 2017) or datasets with a few

predeﬁned relations. For instance, SemEval (Hen-

drickx et al., 2010) has only six relations. Such per-

formance measurements make it hard to generalize to

real-world scenarios. Additionally, these models em-

ploy complicated approaches, such as attention mech-

anisms, additional training and tuning efforts (Wang

et al., 2016; Zhou et al., 2016). Furthermore, most

approaches neglect the imbalanced prediction pattern

in the predeﬁned relations set, when the objective is

to predict only one relation out of many others in the

predeﬁned relations set.

The multiple RC approach tackles the previously

mentioned problems. However, regular methods still

unable to achieve competitive results, mainly affected

by the need to adapt to the imbalanced prediction. De-

spite their ability to predict several relations in one

sentence, the predicted ones’ size is relatively smaller

than the predeﬁned relations set in common. This gap

is shown in Table 1 when comparing the average num-

ber of relations with the predeﬁned set size, which in-

dicates high imbalanced distribution of positive and

negative labels in each sentence. Furthermore, the

table shows the percentage of sentences of three or

more prediction relations, reﬂecting the importance of

the multiple RC task.

In this paper, we propose a Multiple Relations

Classiﬁcation model using Imbalanced Predictions

Adaptation (MRCA). Our approach adapts to the im-

balanced predictions issue through adjusting both the

output activation function and the loss function. The

utilized loss function has proved its efﬁciency in sev-

eral imbalanced tasks. However, our customization

shows additional enhancements within the RC task.

Furthermore, we utilize the entity features through

concatenating an additional vector to the word em-

beddings in the text encoder level.

The evaluation shows that our approach outper-

forms other models that reported their multiple RC

performances in the relation extraction task on two

popular benchmarks. To the best of our knowledge,

this is the ﬁrst work that addresses the imbalanced

predictions within the RC task. The ablation study

demonstrates the efﬁcacy of our approach compo-

nents in adapting to the imbalanced predictions, and

in utilizing the text and the entity features. Further-

more, the architecture of our model has a light de-

sign that yields astonishing performance. We make

our code available online

2 RELATED WORK

2.1 Single Relation Classiﬁcation

Generally, RC models pursued the approach of gener-

ating efﬁcient text representation to identify relations.

Early supervised approaches (Wang, 2008; Fundel

et al., 2007) employed natural language processing

(NLP) tools to extract text features, such as word lex-

ical features, using dependency tree parsers (Klein

and Manning, 2002), part-of-speech (POS) taggers

and named entity recognition. Relex (Fundel et al.,

2007) generated dependency parse trees and trans-

formed them into features for a rule-based method.

With the achievements of neural network meth-

ods, deep learning models utilized a combination of

text lexical features and word embeddings for the in-

put (Gormley et al., 2015; Zhang et al., 2018) while

other approaches (Zhou et al., 2016; Zeng et al., 2014;

Lee et al., 2019; Ding and Xu, 2022) depended on

those embeddings solely to avoid NLP tools error

propagation to later stages (Zeng et al., 2014). Neural

network-based models employed word embeddings in

different ways. First, embeddings generated from al-

gorithms such as Word2Vec (Mikolov et al., 2013) us-

ing custom training data such as in (Gormley et al.,

2015; Zeng et al., 2014). Second, embeddings from

pre-trained language models (PLMs), such as Glove

(Pennington et al., 2014). These PLMs were uti-

lized in the works including (Zhou et al., 2016; Zhang

et al., 2018; Lee et al., 2019; Ding and Xu, 2022). In

(Zhou et al., 2016), authors presented a neural atten-

tion mechanism with bidirectional LSTM layers with-

out any external NLP tools. In C-GCN (Zhang et al.,

2018), the dependency parser features were embed-

ded into a graph convolution neural network for RC.

https://github.com/sa5r/MRCA

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

512

TANL (Paolini et al., 2021) is a framework to solve

several structure prediction tasks in a uniﬁed way, in-

cluding RC. The authors showed that classiﬁers can-

not beneﬁt from extra latent knowledge in PLMs, and

run their experiments on the T5 language model.

Bert (Devlin et al., 2018) is a contextualized PLM

that has presented signiﬁcant results in various NLP

tasks and several RC models employed it (Wu and

He, 2019; Baldini Soares et al., 2019; Cohen et al.,

2020; Karaevli and G

ung

or, 2022). The earliest was

R-Bert (Wu and He, 2019), where authors customized

Bert for the RC task by adding special tokens for

the entity pairs. Later, Bert’s output was used as

an input for a multi-layer neural network. In (Co-

hen et al., 2020), the traditional classiﬁcation was

replaced with a span prediction approach, adopted

from the question-answering task. In (Karaevli and

ung

or, 2022), the model combined short depen-

dency path representation generated from dependency

parsers with R-Bert generated embeddings.

2.2 Multiple Relations Classiﬁcation

Methods that classify multiple relations in a single in-

put pass vary based on the usage of NLP tools, neural

networks and PLM models. Senti-LSSVM (Qu et al.,

2014) is an SVM-based model that explained the con-

sequences on the performance when handling multi-

relational sentences using a single relation approach.

CopyRE (Zeng et al., 2018) is an end2end entity

tagging and RC model that leveraged the copy mecha-

nism (Gu et al., 2016) and did not use a PLM. Instead

the model used the training platform’s layer to gener-

ate word embeddings. In the RC part of the model, the

authors used a single layer to make predictions over

the softmax function. Inspired by CopyRE, Copy-

MTL (Zeng et al., 2020) is a joint entity and relation

extraction model with a seq2seq architecture. The

model followed CopyRE’s approach in representing

text.

Several models employed Bert in the RC task

(Wang et al., 2019; Li and Tian, 2020). The work

in (Wang et al., 2019) elaborated on the ﬂaws of the

single relation prediction in multi-relational sentences

and presented a model that is based on customizing

Bert. Speciﬁcally, the model employed an additional

prediction layer and considered the positions of the

entities in the input. In (Li and Tian, 2020), authors

showed that RC is not one of the training objectives

in the popular PLMs. Therefore, they leveraged Bert

and used a product matrix to relate the identiﬁed rela-

tions to the sentence entities.

GAME model (Cheng et al., 2022) used the NLP

tool Spacy (Honnibal and Montani, 2017) to gener-

ate word embeddings. The model is based on graph

convolution networks for global sentence dependency

and entities interaction features. ZSLRC (Gong and

Eldardiry, 2021) is a zero-shot learning model that

used Glove PLM. We mention this work because it re-

ports the supervised learning performance in RC task.

3 METHODOLOGY

Our model incorporates two main components, an

output adaptation module and an input utilization

module. Between the input and the output modules,

we employ neural network with light design to have

low training parameters and better performance. We

use an average pooling layer to reduce the dimen-

sionality of the network before the output layer. The

dropout layer is used to tackle training overﬁtting. Fi-

nally, in the output layer, each unit represents a re-

lation. Figure 1 shows the main architecture of our

model.

3.1 Text Encoder

We utilize Glove (Pennington et al., 2014) pre-

computed word embeddings to encode the input sen-

tences. Glove embeddings are retrieved from a key-

value store where words in lowercase are the keys for

a ﬂoat vectors matrix R

s×d

, where s is the vocabu-

lary size and d is the embedding dimensions. We ﬁnd

Glove more convenient for the task to tackle the out-

of-vocabulary (OOV) (Woodland et al., 2000) prob-

lem. Speciﬁcally, Glove’s most used variant

has rel-

atively large dictionary of 400,000 words. However,

the embeddings are context-free and the keys are case

insensitive. Other popular PLMs have much smaller

vocabularies but support Glove’s missed features. For

instance, Bert (Devlin et al., 2018) generates con-

textual embeddings and has character case support.

Nevertheless, the commonly used Bert variant

has

28,997 vocabulary entries only. Thus, the OOV words

will get its representation based on the latent training

parameters (Nayak et al., 2020). At the same time,

several studies showed that RC is not one of the train-

ing objectives in Bert (Li and Tian, 2020; Liu et al.,

2019). Thus, we adjust Glove to provide the missed

features as the following.

First, having case sensitive embeddings is essen-

tial to denote entity words in the sentence. Realiz-

ing entities in the RC task is crucial to detect the

https://nlp.stanford.edu/projects/glove/

https://tfhub.dev/tensorﬂow/bert en uncased L-12 H-

768 A-12/4

Multiple Relations Classiﬁcation Using Imbalanced Predictions Adaptation

513

Encoder

𝑤

Average Pooling

𝑤

𝑛

word embeddings

case vector

entity vector

bidirectional LSTM

forward

backward

LSTM

Dropout Layer

tanh

tokens

k relations output

rel

linear

RC Dice Loss

Figure 1: The main architecture of our model. The adaptation approach uses a linear activation function in the output and the

Dice loss extension. Furthermore, we enhance the embeddings by adding two vectors, a character case vector and an entity

type vector denoted by the orange and blue squares.

proper relation. Generally, a word with an uppercase

ﬁrst character is an entity word. Thus, we add an

additional vector to the word embeddings to denote

the ﬁrst character case. For uppercase ﬁrst character

words we use the value of ceiling the largest vector

value in Glove. Formally, the vector value is com-

puted as the following:

v = d max

1≤i≤s

( max

1≤ j≤d

(R[i][ j ]))e (2)

where R is the vectors matrix in Glove, s is the vo-

cabulary size, and d is the embedding dimensions.

For lowercase ﬁrst character words, we use the nega-

tive value of v. We employ the maximum and mini-

mum values in the PLM to boost distinguishing entity

words from non-entity words. The orange square in

Figure 1 denotes the ﬁrst character case vector.

Second, to provide contextual sentence represen-

tation, we make us of a bidirectional long short-term

memory (LSTM) as our ﬁrst layer in the model archi-

tecture.

Although we employ a large vocabulary in encod-

ing the sentence, a few words are still not matched.

Thus, we generate their embeddings by combining the

character level embeddings.

Entity Features. We show in equation 1, that the

task input consists of subject and object entities in ad-

dition to the sentence. We attempt to enrich the in-

put with these details by following a similar approach

of appending an additional vector from the previous

step. Speciﬁcally, we append a vector of the value v

from equation 2 to the word representation when the

input indicates that the word is a subject entity or part

of it, the negative value of v for object entity words,

and 0 for non-entity words. The dense blue square in

Figure 1 denotes this additional vector. Formally the

vector is given by the function f

entVec

as the follow-

ing:

entVec

(w) =







v , w ∈ E

sub

−1 × v , w ∈ E

ob j

0 , w /∈ {E

sub

∪ E

ob j

}

(3)

where w is a word in the sentence, E

sub

is the subject

entities set and E

ob j

is the object entities set. We use

the negative value in the object entity to emphasize

the difference between entity types and make the rela-

tion direction between entity pairs recognizable while

training.

3.2 Imbalanced Predictions Adaptation

In real-world scenarios, the number of relations in a

sentence is much smaller than the predeﬁned relations

in the RC task. Consider the gap in Table 1 between

WEBNLG relations and the average number of valid

relations in each sentence. We see that it is impracti-

cal to employ traditional probability activation func-

tions in neural networks (NN) for this case. For in-

stance, sigmoid and softmax are commonly used func-

tions in NNs (Chollet, 2021). Our claim is supported

by the fact that these functions treat positive and neg-

ative predictions equally. In other words, all proba-

bility predictions of 0.5 or greater are considered as

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

514

positive label predictions in the mentioned functions.

Thus, we improve the model’s ability to predict neg-

ative labels by devoting 75% of the prediction range

for the negative labels. We implement this step by re-

stricting the model’s layers output to values that vary

between -1 and 1. We perform that through applying

tanh activation function to the ﬁrst layer, then using

a linear activation function in the output layer. As a

result, three quarters of the prediction range are allo-

cated for the negative labels, i.e., all predictions be-

tween -1 and 0.5 indicate a negative label. Figure 2

compares the prediction ranges in a probability acti-

vation function (sigmoid) and the output of the tanh

activation function.

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

-10

-8.5

-7

-5.5

-4

-2.5

-1

0.5

3.5

6.5

9.5

f(x)sigmoid

sigmoid

Negative label Positive label

-1

-0.8

-0.6

-0.4

-0.2

0.2

0.4

0.6

0.8

-10

-8.5

-7

-5.5

-4

-2.5

-1

0.5

3.5

6.5

9.5

f(x)tanh

tanh

Negative label Positive label

Figure 2: Comparison between prediction ranges in the sig-

moid function and our implementation.

Dice Loss Extension. Traditionally, straightfor-

ward classiﬁcation models employ the cross-entropy

loss functions (Chollet, 2021), that are used to im-

prove the accuracy, whereas the RC task objective is

to reduce the false positive and false negative predic-

tions. Thus, we seek improving the precision and

recall performances, i.e., enhancing the model’s f1

score. Dice Loss has shown signiﬁcant results in sev-

eral domains, such as computer vision (Huang et al.,

2018) and other NLP tasks that have imbalanced data

(Li et al., 2020). The function was designed with in-

spiration of the f1 metric as the following:

DiceLoss(y

, p

) = 1 −

+ γ

+ y

+ γ

(4)

where y

is the ground-truth label for relation i, p

the prediction value, and γ is added to the nominator

and the denominator for smoothing, which has a small

value of 1e-6 in our implementation.

Utilizing Dice Loss in our adapted predictions

may incur unconventional behaviour. Speciﬁcally,

when having negative ground truth labels and nega-

tive value predictions at the same time. Such case

would result high loss when using Dice Loss, whereas

low loss is the normal result. Our analysis in Table 2

shows the invalid loss values and the expected ones.

To address this issue, We expand our adaptation by

implementing an extension for Dice Loss. Speciﬁ-

cally, we address the negative prediction case by com-

Table 2: Loss calculations for ground truth y and the pre-

diction value p in Dice loss and in our implementation. The

underlined numbers are the unconventional values in Dice

loss.

y p Expected loss Dice loss RC Dice loss

0 1 ≥ 1 0.9 0.9

0 0.1 ≈ 0 0.9 9e-13

0 -0.1 ≈ 0 0.9 9e-13

0 -1 0 0.9 9e-13

1 1 0 0 0

1 0 ≥ 1 0.9 0.9

1 -1 >1 1.9 1.9

puting the loss from a division operation; the nomi-

nator is the squared smoothing value; the denomina-

tor is the regular Dice loss denominator. Raising the

smoothing value to the second power is necessary to

present a small loss value. Our corrected loss value

examples can be observed in Table 2. We call this

extension, RC DiceLoss and formally deﬁned as the

following:

RC DiceLoss(y

, p

) =











+γ

, y

= 0 and p

<0.5

1 −

+γ

, otherwise

(5)

4 EXPERIMENTS

4.1 Datasets and Experimental Setup

To demonstrate the generalization and the applica-

bility of our model, we evaluated it on diverse and

widely used datasets. The NYT dataset (Riedel et al.,

2010) was generated from a large New York Times ar-

ticles corpus, where each input item consists of a sen-

tence and a set of triples, each triple is composed of

subject and object entities, and a relation. WEBNLG

dataset was originally generated for the Natural Lan-

guage Generation (NLG) task, CopyRE (Zeng et al.,

2018) customized the dataset for the triples and rela-

tions extraction tasks. Table 3 shows the statistics and

the splits of the datasets.

Our model achieved the best results using Glove

PLM. The language model has been trained on 6

Billion tokens with a 400,000 words vocabulary and

300 dimensional word embeddings. Nevertheless, the

experiments demonstrated that our model can adopt

other PLMs and still provide competitive results. We

performed the experiments using TensorFlow. Our

model’s hyper-parameters and training settings are

Multiple Relations Classiﬁcation Using Imbalanced Predictions Adaptation

515

Table 3: Statistics of the evaluation datasets.

Dataset NYT WEBNLG

Relations 24 216

Samples

Training 56,196 5,019

Validation 5,000 500

Testing 5,000 703

Total 66,196 6,222

uniﬁed for both experimental datasets, which con-

ﬁrms the applicability of our approach to real-world

data. Table 4 shows the training settings and the

model hyper-parameters. We used Adam optimizer

for stochastic gradient descent, and performed the

training for ﬁve times on every dataset with different

random seed and reported the mean performance and

the standard deviation. Although we implemented the

training for 50 epochs, the mean convergence epoch

for the NYT dataset was 21.4. The hyper-parameters

were chosen based on tuning the model for best per-

formance. We ran the experiments on a server with

NVIDIA A100-SXM-80GB GPU device and AMD

EPYC MILAN (3rd gen) processor, but using only

8 cores. We used only 20GB of the available main

memory for the WEBNLG dataset experiments and

100GB for the NYT dataset due to its size. We con-

ducted an ablation study to test our model’s compo-

nents using different variants as shown in Section 4.4.

4.2 Comparison Baselines

We compare our results with the following supervised

models. We refer to the main characteristics of each

one in section 2. CopyRE (Zeng et al., 2018) and

CopyMTL (Zeng et al., 2020) are based on the copy

mechanism and used the same approach to generate

word embeddings. Both evaluated their work on the

NYT and WEBNLG datasets. GAME model (Cheng

Table 4: Model hyperparameters and training settings.

Parameter Value

Average Pooling

Pool Size 80

Strides 2

Learning

Rate 0.0015

Decay 3e-5

Bi-LSTM units 2 × 500

Dropout rate 0.15

Sequence padding 100

Epochs 50

Early stopping patience 5

Batch size 32

Generated parameters 13M

Average epoch time 2355ms

Table 5: Our model’s F1 scores and the standard deviations

(subscripts) on the NYT and WEBNLG datasets compared

with the baseline models.

Model GAME CopyRE CopyMTL MRCA

NYT 77.1 87.0 86.9 96.65

0.17

WEBNLG - 75.1 79.7 93.35

0.29

et al., 2022) used Spacy to generated word embed-

dings and reported their results on the NYT dataset.

Other multiple relations classiﬁcation models

were not considered in the comparison due to their

utilization of a different release of the NYT dataset,

such as (Li and Tian, 2020) and ZSLRC (Gong and

Eldardiry, 2021). We found that the used release is

not commonly used in the literature.

4.3 Main Results and Analysis

We report our average F1 scores in Table 5 for the

NYT and WEBNLG datasets, respectively. Addition-

ally, we visualize the training performance in Figure

3. The results show superiority among the baseline

models. We report the precision and recall scores in

Table 6. We highlight our results in the WEBNLG

dataset, as we ﬁnd that relation predictions in that

dataset is highly imbalanced due to the large number

of predeﬁned relations. Furthermore, the dataset has

smaller training data. Nevertheless, the WEBNLG’s

F1 score is close to the NYT’s score. Knowing that,

the NYT dataset has much smaller predeﬁned rela-

tions and more training data, which indicates that our

adaptation method supported achieving better predic-

tions despite the imbalanced distribution of the binary

labels.

4.4 Ablation Study

To examine the effectiveness of our model’s compo-

nents, we evaluate the imbalanced predictions adap-

tation approach, and the text encoder adjustments.

We design different variants of our model and per-

form training using the same main evaluation set-

tings in Table 4. Moreover, We report the average

score of ﬁve runs and the standard deviation. We use

the WEBNLG dataset for the ablation study experi-

ments. We report the performances in Table 6, then

we present the following analysis.

Imbalanced Predictions Adaptation Effective-

ness. To evaluate the contribution of our imbalanced

predictions adaptation approach, we assess our model

using different activation and loss functions. Specif-

ically, we use the traditional sigmoid activation func-

tion and the binary cross entropy loss function. We

report this variant’s performance in Table 6 with the

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

516

Table 6: The performance of our model’s variants on the WEBNLG dataset. The subscripts show the standard deviation.

Model Precision Recall F1

MRCA 95.4

0.25

91.3

0.48

93.35

0.29

MRCA-Sigmoid-BCE 93.35

0.31

88.73

0.55

90.88

0.3

MRCA-Bert 94.5

0.2

89.9

0.49

92.15

0.26

MRCA-Bert-noLSTM 55.18

2.21

53.7

1.1

54.4

1.16

name MRCA-Sigmoid-BCE. The variant’s F1 score is

approximately 3% less than our model’s score, which

is an average value between the precision scores dif-

ference and the recall scores difference. Noting that

the recall gap is larger, which presents the ﬁrst indica-

tion that the adaptation approach improved predicting

negative labels.

Encoder Effectiveness. To evaluate our text en-

coder adjustments, we need to consider two sub-

components in the assessments, that are the usage of

Glove language model and the addition of the entity

type vector to the embeddings. Thus, we test the fol-

lowing variants of our model. MRCA-Bert is a vari-

ant that uses Bert PLM instead of Glove and MRCA-

Bert-noLSTM is a variant that uses Bert but with no

LSTM layers. We use Bert’s release

with charac-

ter case support since we added the same case fea-

ture in our implementation. In the former variant,

there is a slight difference between the reported F1

score and our model’ score, which demonstrates less

contribution of the Glove employment in our overall

performance. However, using Glove, our model still

outperforms the Bert variant due to the better OOV

terms support. Noting that Bert is known as a lan-

guage model with contextual text representation sup-

port. Thus, the assumption is that, the LSTM layers

would not affect Bert’s performance. Nonetheless, in

the second variant MRCA-Bert-noLSTM, the perfor-

mance is way worst. This result supports our claim

that RC is not one of Bert’s training objectives in sec-

tion 3.1 because of the abstract usage of Bert. Further-

more, with a weak contextual representation in Bert,

OOV words will split into non-meaningful tokens as

described in the tokenization algorithm that is used in

Bert (Song et al., 2021). This concludes the impor-

tance of using a language model with larger vocabu-

lary.

5 CONCLUSION

We propose MRCA, a multiple relations classiﬁca-

tion model that aims at improving the imbalanced pre-

https://tfhub.dev/tensorﬂow/bert en cased L-12 H-

768 A-12/4

dictions. Our light-design implementation leverages

wider prediction range for negative labels and cus-

tomize a remarkable loss function for the same pur-

pose. Furthermore, text and entity features are uti-

lized efﬁciently to improve the relations prediction.

The experiments presented superiority among state-

of-the-art models that reported the relation classiﬁca-

tion performance. Assessing our model’s components

showed that addressing the imbalanced predictions

yields signiﬁcant improvement in the relation classi-

ﬁcation task. Furthermore, representing sentences us-

ing language models with rich vocabularies provides

performance enhancements in the relation classiﬁca-

tion task.

1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627

Validation F1

Epochs

93.5

94.5

95.5

96.5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Validation F1

Epochs

(a)

(b)

Figure 3: The validation F1 score during training for the

evaluation datasets. (a) indicates the NYT training perfor-

mance. (b) indicates the WEBNLG training performance.

Multiple Relations Classiﬁcation Using Imbalanced Predictions Adaptation

517

6 FUTURE WORK AND

LIMITATIONS

Although the relation classiﬁcation task has limited

applications as a single module, it has wider usages

in the relation extraction task. Therefore, we see that

our approach can be adopted to achieve new scores

in several applications that utilize the relation classi-

ﬁcation task. Further improvements can be achieved

when using NLP tools for lexical and syntactic text

features. Additionally, it would be typical to advance

our model to assign the predicted relation to the cor-

responding entities pair in the input. However, this

approach cannot be considered as an ideal way for

the relation or triple extraction task because errors in

the entities tagging step would propagate to the rela-

tion classiﬁcation task. Finally, our imbalanced pre-

dictions adaptation promises enhancements if used by

similar tasks of imbalanced classes.

Our evaluation was limited by the small number

of models that reported the relation classiﬁcation per-

formance. However, the results proved our model’s

superiority, denoted by the gap between our F1 score

and the closest model.

REFERENCES

Baldini Soares, L., FitzGerald, N., Ling, J., and

Kwiatkowski, T. (2019). Matching the blanks: Distri-

butional similarity for relation learning. In ACL, pages

2895–2905. Association for Computational Linguis-

tics.

Bundschus, M., Dejori, M., Stetter, M., Tresp, V., and

Kriegel, H.-P. (2008). Extraction of semantic biomed-

ical relations from text using conditional random

ﬁelds. BMC bioinformatics, 9(1):1–14.

Chen, Z., Wang, Y., Zhao, B., Cheng, J., Zhao, X., and

Duan, Z. (2020). Knowledge graph completion: A

review. Ieee Access, 8:192435–192456.

Cheng, H., Liao, L., Hu, L., and Nie, L. (2022). Multi-

relation extraction via a global-local graph convo-

lutional network. IEEE Transactions on Big Data,

8(6):1716–1728.

Chollet, F. (2021). Deep learning with Python. Simon and

Schuster.

Cohen, A. D., Rosenman, S., and Goldberg, Y. (2020).

Relation classiﬁcation as two-way span-prediction.

arXiv preprint arXiv:2010.04829.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2018). Bert: Pre-training of deep bidirectional trans-

formers for language understanding. arXiv preprint

arXiv:1810.04805.

Ding, H. and Xu, X. (2022). Relation classiﬁcation based

on selective entity-aware attention. In CSCWD, pages

177–182. IEEE.

Fundel, K., K

uffner, R., and Zimmer, R. (2007).

Relex—relation extraction using dependency parse

trees. Bioinformatics, 23(3):365–371.

Gong, J. and Eldardiry, H. (2021). Zero-shot relation classi-

ﬁcation from side information. In CIKM, pages 576–

585.

Gormley, M. R., Yu, M., and Dredze, M. (2015). Improved

relation extraction with feature-rich compositional

embedding models. arXiv preprint arXiv:1505.02419.

Gu, J., Lu, Z., Li, H., and Li, V. O. (2016). Incorporating

copying mechanism in sequence-to-sequence learn-

ing. arXiv preprint arXiv:1603.06393.

Hendrickx, I., Kim, S. N., Kozareva, Z., Nakov, P.,

O S

eaghdha, D., Pad

o, S., Pennacchiotti, M., Ro-

mano, L., and Szpakowicz, S. (2010). SemEval-2010

task 8: Multi-way classiﬁcation of semantic relations

between pairs of nominals. In Proceedings of the

5th International Workshop on Semantic Evaluation,

pages 33–38. Association for Computational Linguis-

tics.

Honnibal, M. and Montani, I. (2017). spacy 2: Natural lan-

guage understanding with bloom embeddings, convo-

lutional neural networks and incremental parsing. To

appear, 7(1):411–420.

Huang, Q., Sun, J., Ding, H., Wang, X., and Wang, G.

(2018). Robust liver vessel extraction using 3d u-net

with variant dice loss function. Computers in biology

and medicine, 101:153–162.

Karaevli, H. A. and G

ung

or, T. (2022). Enhancing rela-

tion extraction by using shortest dependency paths be-

tween entities with pre-trained language models. In

INISTA, pages 1–7.

Klein, D. and Manning, C. D. (2002). Fast exact inference

with a factored model for natural language parsing.

Advances in neural information processing systems,

15.

Lee, J., Seo, S., and Choi, Y. S. (2019). Semantic rela-

tion classiﬁcation via bidirectional lstm networks with

entity-aware attention using latent entity typing. Sym-

metry, 11(6):785.

Li, C. and Tian, Y. (2020). Downstream model design

of pre-trained language model for relation extraction

task. arXiv preprint arXiv:2004.03786.

Li, X., Sun, X., Meng, Y., Liang, J., Wu, F., and Li, J.

(2020). Dice loss for data-imbalanced NLP tasks. In

ACL, pages 465–476. Association for Computational

Linguistics.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,

Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov,

V. (2019). Roberta: A robustly optimized bert pre-

training approach. arXiv preprint arXiv:1907.11692.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).

Efﬁcient estimation of word representations in vector

space. arXiv preprint arXiv:1301.3781.

Nayak, A., Timmapathini, H., Ponnalagu, K., and

Venkoparao, V. G. (2020). Domain adaptation chal-

lenges of bert in tokenization and sub-word represen-

tations of out-of-vocabulary words. In Proceedings of

the First Workshop on Insights from Negative Results

in NLP, pages 1–5.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

518

Paolini, G., Athiwaratkun, B., Krone, J., Ma, J., Achille,

A., Anubhai, R., Santos, C. N. d., Xiang, B., and

Soatto, S. (2021). Structured prediction as translation

between augmented natural languages. arXiv preprint

arXiv:2101.05779.

Pennington, J., Socher, R., and Manning, C. D. (2014).

Glove: Global vectors for word representation. In

EMNLP, pages 1532–1543.

Qu, L., Zhang, Y., Wang, R., Jiang, L., Gemulla, R., and

Weikum, G. (2014). Senti-lssvm: Sentiment-oriented

multi-relation extraction with latent structural svm.

Transactions of the Association for Computational

Linguistics, 2:155–168.

Riedel, S., Yao, L., and McCallum, A. (2010). Modeling

relations and their mentions without labeled text. In

ECML PKDD, pages 148–163. Springer.

Song, X., Salcianu, A., Song, Y., Dopson, D., and Zhou,

D. (2021). Fast WordPiece tokenization. In EMNLP,

pages 2089–2103.

Vrande

c, D. and Kr

otzsch, M. (2014). Wikidata: a free

collaborative knowledgebase. Communications of the

ACM, 57(10):78–85.

Wang, H., Tan, M., Yu, M., Chang, S., Wang, D., Xu, K.,

Guo, X., and Potdar, S. (2019). Extracting multiple-

relations in one-pass with pre-trained transformers.

In ACL, pages 1371–1377. Association for Computa-

tional Linguistics.

Wang, L., Cao, Z., De Melo, G., and Liu, Z. (2016). Re-

lation classiﬁcation via multi-level attention cnns. In

Proceedings of the 54th Annual Meeting of the Associ-

ation for Computational Linguistics (Volume 1: Long

Papers), pages 1298–1307.

Wang, M. (2008). A re-examination of dependency path

kernels for relation extraction. In Proceedings of the

Third International Joint Conference on Natural Lan-

guage Processing: Volume-II.

Woodland, P. C., Johnson, S. E., Jourlin, P., and Jones, K. S.

(2000). Effects of out of vocabulary words in spoken

document retrieval. In Proceedings of the 23rd annual

international ACM SIGIR conference on Research and

development in information retrieval, pages 372–374.

Wu, S. and He, Y. (2019). Enriching pre-trained language

model with entity information for relation classiﬁca-

tion. In CIKM, pages 2361–2364.

Yin, X., Han, J., Yang, J., and Yu, P. S. (2006). Efﬁ-

cient classiﬁcation across multiple database relations:

A crossmine approach. IEEE Transactions on Knowl-

edge and Data Engineering, 18(6):770–783.

Zeng, D., Liu, K., Lai, S., Zhou, G., and Zhao, J. (2014).

Relation classiﬁcation via convolutional deep neural

network. In COLING, pages 2335–2344.

Zeng, D., Zhang, H., and Liu, Q. (2020). Copymtl: Copy

mechanism for joint extraction of entities and relations

with multi-task learning. In AAAI, volume 34, pages

9507–9514.

Zeng, X., Zeng, D., He, S., Liu, K., and Zhao, J. (2018). Ex-

tracting relational facts by an end-to-end neural model

with copy mechanism. In ACL, pages 506–514.

Zhang, Y., Qi, P., and Manning, C. D. (2018). Graph convo-

lution over pruned dependency trees improves relation

extraction. arXiv preprint arXiv:1809.10185.

Zhang, Y., Zhong, V., Chen, D., Angeli, G., and Manning,

C. D. (2017). Position-aware attention and supervised

data improve slot ﬁlling. In EMNLP, pages 35–45.

Zhou, P., Shi, W., Tian, J., Qi, Z., Li, B., Hao, H., and Xu,

B. (2016). Attention-based bidirectional long short-

term memory networks for relation classiﬁcation. In

Proceedings of the 54th annual meeting of the associ-

ation for computational linguistics (volume 2: Short

papers), pages 207–212.

Multiple Relations Classiﬁcation Using Imbalanced Predictions Adaptation

519