Evaluating and Improving End-to-End Systems for Knowledge Base

Population

Maxime Prieur

, C

edric Du Mouza

, Guillaume Gadek

and Bruno Grilheres

edric Laboratory, Conservatoire National des Arts et M

etiers, Paris, France

Airbus Defence and Space,

Elancourt, France

Keywords:

Knowledge Base Population, Entity Linking, Supervised Learning, Data Mining, Method and Evaluation.

Abstract:

Knowledge Bases (KB) are used in many ﬁelds, such as business intelligence or user assistance. They aggre-

gate knowledge that can be exploited by computers to help decision making by providing better visualization

or predicting new relations. However, their building remains complex for an expert who has to extract and

link each new information. In this paper, we describe an entity-centric method for evaluating an end-to-end

Knowledge Base Population system. This evaluation is applied to ELROND, a complete system designed as a

workﬂow composed of 4 modules (Named Entity Recognition, Coreference Resolution, Relation Extraction

and Entity Linking) and MERIT, a dynamic entity linking model made of a textual encoder to retrieve similar

entities and a classiﬁer.

1 INTRODUCTION

Knowledge bases (KB) are data structures, generally

relying on a predeﬁned ontology, which are very use-

ful for aggregating information in order to simplify its

visualization and analysis. Knowledge bases are used

in many ﬁelds such as business intelligence (Shue

et al., 2009) to improve decision making or to ex-

tract elements linking scientiﬁc publications (Ammar

et al., 2018). However, manual construction and up-

dating of Knowledge Bases are extremely costly since

the domains in which they are deployed usually ex-

ploit constantly evolving information. An alternative

solution would be an automatic population that would

extract and add the desired elements from sources of

interest into the knowledge base.

Nevertheless, the existing solutions are not yet

sufﬁcient and this subject is still the focus of many

workshops (Ghosal et al., 2022), which aim at propos-

ing new approaches either on the end-to-end process

workﬂow or on one of its subtasks. Most proposals

on the literature focus on the optimization of one or

two particular modules but not of an end-to-end sys-

tem. However, we have no guarantee to get the best

performances for an end-to-end system when trying

to optimize each module separately. Unfortunately, to

optimize an end-to-end Knowledge Base Population

(KBP) system as a whole, there exists, to the best of

our knowledge, no evaluation protocol that are both

automatic and exhaustive (Min et al., 2018; Mesquita

et al., 2019).

In this paper, we attempt to address this issue with

the following contributions:

• we formalize a method for evaluating end-to-end

Knowledge Base Population systems from texts as

a whole;

• we present ELROND, an end-to-end system im-

plemented as a 4-step processing workﬂow whick

could be considered as a baseline for comparison

with future solutions;

• as an improvement of ELROND baseline, we intro-

duce MERIT, a textual encoder-based entity link-

ing solution for entity resolution when building a

dynamic knowledge base;

• we measure and compare the performances of

the proposed models by following the presented

evaluation method and applying it to the DWIE

dataset. (Zaporojets et al., 2021).

The remaining of this article is structured as fol-

lows: after a presentation of recent work on knowl-

edge base population (KBP) evaluation approaches,

on end-to-end systems as well as on models for the

entity linking task (Section 2), we formalize and de-

tail our evaluation protocol (Section 3). Then we de-

scribe our end-to-end implementation proposal, EL-

ROND, and our linking solution, MERIT (Section 4).

Prieur, M., Mouza, C., Gadek, G. and Grilheres, B.

Evaluating and Improving End-to-End Systems for Knowledge Base Population.

DOI: 10.5220/0011726000003393

In Proceedings of the 15th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2023) - Volume 3, pages 641-649

ISBN: 978-989-758-623-1; ISSN: 2184-433X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

641

We present the experiments conducted and the results

obtained in Section 5 before discussing the possible

perspectives to explore in future work.

2 RELATED WORK

2.1 Knowledge Base Population

Knowledge Base Population from texts consists in ex-

tracting the elements and their relations of interest in

order to add them to the already known and structured

information. This task usually involves several steps:

named entity recognition (NER), coreference resolu-

tion, relation extraction, and entity linking. Tinker-

Bell (Al-Badrashiny et al., 2017), one of the ﬁrst end-

to-end systems, consists of a NER module, combin-

ing two Bi-LSTMs taking respectively the text and

the linguistic features to tag the words in the docu-

ment. Meanwhile, entity linking is solved through the

sum of popularity, similarity and consistency scores.

KnowledgeNet (Mesquita et al., 2019) also addresses

relation extraction with a Bi-LSTM (Long Short-

Term Memory) Huang et al. (2015) model which re-

ceives linguistic features and embeddings generated

by a BERT (Bidirectional Encoder Representations

from Transformers) model (Kenton and Toutanova,

2019). This model is however applied at the sen-

tence level and discards supra-phrastic relations (Yao

et al., 2019). Moreover these approaches are based

on Wikipedia and are not adequate when the sources

have a large proportion of entities not listed in the en-

cyclopedia. KBPearl (Lin et al., 2020) suggests to use

open information extraction (OIE) frameworks. The

system extracts and links knowledge from the text us-

ing a graph densiﬁcation method applied to the se-

mantic graph built from the text. In addition to a pro-

fusion of potentially uninteresting information for the

user caused by the lack of predicates in OIE frame-

works, the selection of candidates is performed by

alias matching, which is not very suitable when new

mentions appear.

2.2 Entity Linking

Entity linking is the process of determining whether

the mentions in a text refer to entities in the database.

This process generally establishes a list of similar

entities before selecting, or not, one of the candi-

dates. In addition to dictionary-based approaches

(Al-Badrashiny et al., 2017), encoder-based methods

have been proposed. For instance BLINK (Wu et al.,

2020), uses two BERT models to compare entity men-

tions and Wikipedia descriptions projected into a sin-

gle representation space. This bi-encoder returns de-

scriptions similar to the input mention before being

re-ranked by a more ﬁne-grained encoder. On the

other hand, the list of candidates is not entirely re-

jected if the entry does not correspond to any entity

in the database (NIL prediction). To overcome this

issue, Zhang et al. (2021) applies a Q&A approach,

returning the entities likely to be in the document be-

fore classifying whether a textual segment mentions

a candidate. The NIL classiﬁcation is only partially

solved since they cannot add the description of a new

entity to predict it later. In addition, Blink and En-

tQA use three BERT encoders with separate weights

which makes the system quite cumbersome.

2.3 KBP Evaluation

While there are numerous benchmarks and metrics

for the evaluation of subtasks (“F1-score” for REN

and relation extraction, “Hit@k” for linking, etc),

few solutions exist for end-to-end systems that build

knowledge bases from text. The TAC KBP work-

shops evaluate a system by computing the accuracy

on 1-hop queries, “What is Frodo carrying?”, and 2-

hop queries, “Who created what Frodo is carrying?”,

the number of hops representing the number of rela-

tions which separate the subject entity from the ob-

ject entity. The cost of manually evaluating systems

that return a large number of responses (Ellis et al.,

2015) compels evaluators to focus on a small num-

ber of queries and thus they do not evaluate the entire

database. Min et al. (2018) makes automatic evalu-

ation feasible by measuring the alignment of triplets

(subject, relation, object) between the reference and

the output. An output entity is linked to a reference

one if the produced entity shares more than 50% of

the mentions with the reference entity. This alignment

raises the questions of the situation where the sys-

tem would only extract a small number of mentions

and also the arbitrary choice of the threshold at 50%.

KnowledgeNet (Mesquita et al., 2019) measures the

F1 score on the extraction of annotated triples in sen-

tences and the linking of the subject and object entity

pair to their Wikidata page. Since each sentence in

the dataset annotates only one pair of entity and one

relation, it is not possible to properly evaluate the ac-

curacy since results could be mistakenly considered

as false positives. Like Min et al. (2018), the eval-

uation is done at the textual level and discards the

construction of a base. These incomplete evaluation

methods highlight the need for a protocol that evalu-

ates the performance of an entire KBP system.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

642

3 MODELING AND EVALUATION

3.1 Modeling a Knowledge Base

A KB is composed of elements (entities, attributes

and relations between them) relying on a deﬁned on-

tology. It can thus be modeled by a graph in which

the nodes are the various elements and the edges ex-

press the existence of a relation between these ele-

ments. We deﬁne a KB as follows:

Deﬁnition 3.1 (Knowledge Base). A Knowledge Base

is a data structure that can be modeled by a graph

G = (V, E, Φ, Ψ) where V is the set of vertices in the

graph, E the set of edges between two nodes of V ,

Φ : V → A is a function which for any vertex v

V assigns a set of attributes A

∈ A denoted by tuples

(type, value) and Ψ : E → E a function that associates

to each edge e

∈ E an edge type E

∈ E , with A and

E referring to the set of attributes and the set of edge

types respectively.

Populating a KB with textual content consists

therefore in adding elements extracted from texts ac-

cording to an ontology. To link information related

to the same entity found in several texts, the enti-

ties must have a unique identiﬁer (URI). This al-

lows to obtain for a set of k texts, a reference KB,

= (V

, E

, Φ

, Ψ

) and to measure the proportion

of information correctly extracted by a system build-

ing a base G

= (V

, E

, Φ

, Ψ

Example of Workﬂow for the KBP Task. A KBP

system is built around components or modules that

form the processing workﬂow to solve the KBP

task. The ﬁrst component is in charge of recognizing

named entities (NER) and other elements of interest

in the text (attributes, unnamed entities) while assign-

ing them a type using the document. For instance with

the sentence “Joe Biden, the U.S. President, went vi-

ral on Trump”, the step yields [(Joe Biden, Per), (the

U.S. President, Per), (President, Role), (U.S., Nation-

ality), (Trump, Per)]. The second processing block

groups the textual elements that co-reference. The

mentions from the NER belonging to a cluster com-

posed of mentions of the same type are kept in co-

reference and the remaining mentions are considered

as different entities. Using the previous example, we

get two clusters [(1, PER, Joe Biden, the U.S Presi-

dent), (2, PER, Trump)]. It is then possible to identify,

by a third module, the relations linking the elements

[(1, Is against, 2), (1, Role, President), (1, National-

ity, U.S.)]. These relations, in addition to being part

of the information to be extracted, constitute a sup-

port for the last step, the entity resolution. Each en-

tity in the text, when it is possible, is associated with

an entity in the database. “Trump” must be linked to

the entity “Donald Trump” and not to “Fred Trump”.

Finally, all the information extracted from the text en-

riches the information in the database by completing

those of the entities already known or by adding new

entities.

3.2 Evaluating KBP Systems

In order to evaluate the systems designed for the KBP

task, we present below a process to measure the per-

formance of the methods which extract and aggregate

information either to an existing base (warm start) or

to an initially empty base (cold start).

Entities are deﬁned by attributes and relations

which link them to other entities in the database.

When comparing a reference entity to a built entity,

it is necessary to verify that both the attributes and the

relation match.

Deﬁnition 3.2 (Similarity of attributes and relations).

We use the following similarity deﬁnitions for at-

tributes and relations:

• Attribute similarity: we consider that 2 attributes

match if they have the same type, value and in-

ference text (in which the attribute appears). Al-

though including the inference text creates a mul-

tiplication of information, it veriﬁes that the sys-

tem correctly extracts the information each time it

is mentioned.

• Relation similarity: we consider that 2 relations

are similar if they involve the same predicate (type

of relation), the same inference text and that all

the mentions of the object entity of the constructed

relation are included in the mentions of the object

entity of the reference base.

To check if an entity has been correctly extracted,

we compare the extracted attributes and relations with

those possessed by the reference entity. In order to

measure the proximity, we adapt the precision, recall

and F1-score based on Deﬁnition 3.2 as follows:

α|T P

rel

| + β|T P

att

α(|T P

rel

| + |FP

rel

|) + β(|T P

att

| + |FP

att

α|T P

rel

| + β|T P

att

α(|T P

rel

| + |FN

rel

|) + β(|T P

att

| + |FN

att

= 2

× R

+ R

(1)

With TP, FP and FN for true positives, false positives

and false negatives respectively, rel for relation and

att for attribute v

and v

vertices belonging to the ref-

erence base G

and the built base G

, and 0 ≤ α, β ≤ 1

Evaluating and Improving End-to-End Systems for Knowledge Base Population

643

weights such as α + β = 1, which allow to give a dif-

ferent importance to attributes and relations. Other

weights could be added to differentiate along attribute

or relation types.

Entity Alignment. We align each entity of the ref-

erence KB with an entity of the output base using the

F1 score deﬁned above and the Hungarian algorithm

(Kuhn, 1955). The alignment is possible only for

pairs with a non-zero similarity score. G

and G

enti-

ties without match are respectively considered as false

negatives and false positives. In the warm-start sce-

nario, entities that are initially present remain aligned

between G

and G

, their F1-score only takes into ac-

count the new information. This matching phase leads

to the construction of Ω

a set of pairs (v

i,G

, v

j,G

Global Quality Scores. The comparison between

the constructed and the reference bases after pro-

ceeding k texts is done by aggregating the similar-

ity scores of the previously formed pairs of entities.

Two F1-scores, one F1

micro

and one F1

macro

measur-

ing the proportion of correctly extracted information,

are computed:

micro,k

∑

)∈Ω

α|e

= e

| + β|a

= a

α|E

| + β|A

micro,k

∑

)∈Ω

α|e

= e

| + β|a

= a

α|E

| + β ∗ |A

micro,k

= 2

micro,k

× R

micro,k

+ R

micro,k

macro,k

∑

)∈Ω

|Ω

| + |FN| + |FP|

(2)

With e

, e

edges of node i and j and E

, A

, re-

spectively the set of edges and attributes in the refer-

ence base. The F1

macro

is an average of the similarity

scores of the aligned entities and does not take into ac-

count the difference of distribution (number and type

of relations or attributes) that could exist between the

entities, unlike the F1

micro

. The latter is a weighted

F1 calculated according to the identical elements be-

tween the aligned entities.

Beneﬁts of the Evaluation Process. An issue when

considering an end-to-end system, which is com-

posed of several modules, is that an error caused by

a module can be re-used by another and added to the

database. For example, a person mentioned in a text

can be linked to the wrong person in the database and

thus can lead to an erroneous assignment of a relation

or attribute. The proposed protocol measures at differ-

ent text intervals the distance to the baseline, showing

the resilience of a system to errors that can be made.

The evaluation can be conducted both in a warm-start

and a cold-start scenario. The impact of a single mod-

ule on the whole processing chain is measured using

the ground truth results on the rest of the workﬂow.

The choice to make the F1-score measurement more

ﬂexible, by replacing the exact matching of entities

by the proportion of identical information in a pair,

brings a better representativeness of the systems’ per-

formances.

4 MODELS FOR KBP

This section presents ELROND, a baseline system for

the KBP task and MERIT, an entity linking task mod-

ule which improves the baseline.

4.1 ELROND, an End-to-End System for

KBP

We introduce ELROND (Entity Linking and Relation

extraction On New textual Documents). ELROND is

an implementation that follows the KB enrichment

process explained in Section 3.1. The main compo-

nents, with their interactions, are illustrated in Figure

1. This system is used as a baseline and shows the

interest of the evaluation method detailed in Section

3.2. For each module we implement a recent proposal

found in literature which exhibits good results.

Named Entity Recognition. The NER block con-

sists of a pre-trained and ﬁne-tuned RoBERTa model

(Liu et al., 2019). The choice of RoBERTa was moti-

vated by its current performance on various bench-

marks (SWAG

, GLUE

) for named entity recogni-

tion.

Resolving Co-References. For the co-reference

task, we use in parallel of the NER model, the

pre-trained model Word-level Coreference Resolution

(Dobrovolskii, 2021). This model creates, with the

help of RoBERTa, groups of words in co-reference

(named and unnamed entities, pronouns, etc). The

representation of a token is given by weighting the

vectors of its sub-tokens produced by RoBERTa. The

weights are obtained by applying a softmax function

to the projection of the vectors through an attention

matrix. Finally, the model predicts a co-reference

https : / / paperswithcode.com / sota / common - sense -

reasoning-on-swag

https://gluebenchmark.com/

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

644

NER

Coreference

Resolution

Relation

Extraction

Knowledge

Base

Entity Linking

Grouping mentions into clusters

Adding entities to the KB

Text to

process

Figure 1: Diagram of the proposed processing chain for ELROND.

when the sum of a bilinear projection between two

tokens and the output of a neural network taking the

two tokens as input is positive. We choose to inte-

grate Word-level Coreference Resolution to ELROND

because of its performance on this task.

Relation Extraction. Relations are extracted using

the ATLOP model (Zhou et al., 2021) which repre-

sents each entity by applying a pooling function on

the mention vectors obtained by a PLM (Pre-trained

Language Model). For each pair of entities, an at-

tention coefﬁcient is obtained before using it in a bi-

linear function to compute the plausibility of a rela-

tion type. If the score is greater than the Null type,

the relation is considered as existing. Relationships

that are impossible due to the type of entities are ﬁl-

tered in post-processing.

Entity Linking. The last step applies a search by

mention and a selection by popularity. For each en-

tity in the text, the solution returns the entities in the

database, of the same type and sharing at least one

mention with the entity in the text. In case the men-

tions do not return any results, an extended search is

performed with the acronyms of these mentions. If

no element is returned, a the textual entity is added to

the database. If several entities of the database cor-

respond to mentions of the textual cluster, a selection

by popularity, similar to Al-Badrashiny et al. (2017)

is applied. The entity with the most occurrences, con-

sidering all mentions, is selected.

4.2 MERIT: Model for Entity Resolution

in Text

Candidate Retriever. The popularity-based linking

approach described previously that serves as base-

line is not usable for dealing with new mentions and

is prone to errors during ambiguity resolution, since

it favors popular entities. The proposed method,

MERIT, addresses these shortcomings by relying on

the context of the documents, drawing inspiration

from previous approaches such as Blink (Wu et al.,

2020) or EntQA (Zhang et al., 2021). Like these mod-

els and as illustrated in ﬁgure 2, we propose as a ﬁrst

component an encoder that projects a portion of a text

targeting an entity to a representation space allowing

similarity comparisons. The retriever takes as input a

query text in which the target entity mentions are en-

closed in tags. This retriver differs from EntQA and

BLINK on several points. First, the text samples are

expanded to a size of 256 tokens to include a broader

context for a better discrimination. Secondly and as

illustrated by Soares et al. (2019) all mentions of the

entity of interest in the text sample are wrapped with

special tags ([Ent] and [/Ent]) to improve the quality

of representations. The distance between the vector

representations of the text samples is measured us-

ing cosine similarity instead of a scalar product. In

addition to obtaining better performance in our case,

Luo et al. (2018) show that this provides a more sta-

ble learning. Lastly, a simple encoder replaces the

dual-encoder since the search base is no longer com-

posed of descriptions, but with textual portions of the

same style as those given in the query. The BERT en-

coder is replaced by the ALBERT architecture, which

is lighter and more efﬁcient for semantic similarity

tasks (according to the STS benchmark

Classiﬁer. The most similar text samples retrieved

are then concatenated with the query one. The set of

512 tokens is given to an ALBERT model with a lin-

https : / / paperswithcode.com / sota / semantic - textual -

similarity-on-sts-benchmark

Evaluating and Improving End-to-End Systems for Knowledge Base Population

645

ear layer for classiﬁcation at the end. If none of the

samples are classiﬁed as similar, a new entity is cre-

ated in the base with the query sample as the ﬁrst sup-

port text. In the case where several samples are classi-

ﬁed, the one with the higher classiﬁcation score is se-

lected. The model then returns as output the database

entity that corresponds to the request entity (if a match

has been established).

5 EXPERIMENTS

5.1 Implementation Details

All the proposed approaches are implemented in

Python and use the Pytorch library

. The NER

model is trained using the Flair (Akbik et al., 2019)

framework , the encoders used by MERIT are initial-

ized from the library Hugging Face

. The MERIT

retriever module training uses the contrastive loss

(Khosla et al., 2020) which leads to a better clustering

of identical elements in a representational space. We

use the hard-negative technique (Gillick et al., 2019)

which consists in submitting negative samples consid-

ered as ambiguous by the model during training. This

process brings a better selection of parameters and

makes the model more robust in its predictions. The

classiﬁcation module is trained by the binary cross-

entropy function. We use the index structure Annoy

to return elements according to their cosine similarity.

5.2 Datasets

For a thorough measurement of systems, the com-

pleteness of dataset annotation on all dimensions of

the information to be extracted is necessary. We de-

cide therefore to use DWIE (Zaporojets et al., 2021),

the only free dataset complying with this constraint.

The dataset is composed of 800 press articles in En-

glish. In the 700 training texts and 100 test texts that

constitute DWIE, entities are annotated according to

a multi-level ontology for a total of about 170 classes

and relations. For the KBP task, types and aliases are

used as attributes. The 700 training texts constitute

the KB to be completed in Warm-start.

We also used AIDA (Hoffart et al., 2011) to com-

pare the linking approch using mentions popularity

to MERIT. AIDA consists of 1393 news articles for

which entities and their Wikipedia page names are

listed when available. For DWIE, entities that are not

https://pytorch.org/

https://huggingface.co/

https://github.com/spotify/annoy

linked to Wikipedia only appear in a single text, so

we can consider them as unique entities. This is not

the case for AIDA. Entities in AIDA that do not have

an identiﬁed link are thus discarded from the linking

task.

To estimate the complexity of the linking task on

a dataset, we introduce an ambiguity score, χ, com-

puted as the average for each textual mention, of the

inverse of the number of entities named by it:

χ(dataset) =

∑

mentions

|E(mention)|

−1

|mentions|

(3)

With E(mention) the set of entities sharing this men-

tion. We obtain a score of χ(DW IE) = 0.889 and

χ(AIDA) = 0.794.The entity linking task is therefore

more complex for the dataset AIDA.

Since the existence of certain items in a database,

and thus the order in which the texts allowing the ex-

traction of these items are submitted, may or may not

beneﬁt the operation of the systems, the performance

is measured and averaged over 10 different orderings

of the test set. The data and scripts to compare these

results are available on a git

repository.

5.3 Results and Discussion

ELROND’s Performance. The graphs in Figure 3

show the score of ELROND for the KBP task on the

100 test texts (average taken over 10 runs). For the

warm-start scenario, the original KB is the informa-

tion contained in the 700 training texts. We observe

that the initial distance between the KBs is greater in

warm-start due to the difﬁculty to link the informa-

tion of the texts with those already possessed. In both

cases (although more contrasted in warm-start), the

performance declines over the texts, which attests of

an accumulation of errors during the process. The mi-

cro F1-score is higher than the macro F1-score in both

cases and seems more stable over the end of the test

set. This is explained by the fact that popular enti-

ties (countries and cities for example) which are men-

tioned more often in the texts have on the one hand

more weight in the ﬁnal database, and on the other

hand are easier to recognize. This characteristic will

therefore tend to increase the F1

micro

compared to the

macro

which smoothes the difference in distribution

between entities.

Comparison of Linking Approaches. For entity

resolution, we use two types of metrics. The Hit@k,

that computes the frequency with which the queried

entity is found among the ﬁrst k entities. This met-

ric is only measured on query entities that are in the

[https://github.com/Todaime/KBP]

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

646

Albert

𝑆𝑎𝑚𝑝𝑙𝑒

Mention

vectors

𝑣𝑒𝑐𝑡𝑜𝑟

!"#$%&#

Cosine

Similarity

[𝑆𝑎𝑚𝑝𝑙𝑒||𝑆𝑎𝑚𝑝𝑙𝑒

'(#)%)($"

]

∀𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒𝑠

Classifier

[0, 1]

Sample𝑠

Reranking

with relations

Albert

𝐶𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒𝑠

Figure 2: Diagram of MERIT.

(a) Cold-start (b) Warm-start

Figure 3: ELROND performance for the KBP task on DWIE texts.

database. The second metric is accuracy: a result is

valid if the entity is in the KB and is returned by the

model or if this entity is NIL and the model does not

return a result. We add for comparison a linking solu-

tion that randomly select an entity in cases of ambigu-

ity. For DWIE, we measure the performance with and

without ﬁltering on the type of entities as well as tak-

ing into account the relations that exist in the KB. Due

to the lack of annotation on AIDA, the performances

of the approaches do not those elements. To reduce

the prediction of false positives, MERIT retains only

the ﬁrst 10 candidates obtained by similarity search.

All the results are recorded in tables 1 and 2. For

similarity retrieval and for both datasets, MERIT gives

better results than the mentions popularity approach

and shows a beneﬁt when used to propose results dur-

ing a semi-automatic process. The trend is more nu-

anced on DWIE when dealing with NIL entities. This

is explained by the fact that MERIT applies a classiﬁ-

cation on more candidates and is therefore more prone

to predict false positives, while the mention popular-

ity approach ﬁlters out identical mentions and is lim-

ited to 1 or 2 candidates. AIDA contains more am-

biguity in its texts but with distinct writing styles and

contexts between entities sharing mentions. This may

explain why MERIT has better results than the men-

tions popularity module. We observe that re-ranking

the returned entities according to relationship consis-

tency slightly improves the results. We also studied a

combination of the two approaches. This fusion use

MERIT with a classiﬁcation on the most similar can-

didate and selects the search results by mention pop-

ularity in case of negative classiﬁcation. The results

show in this case study a slight interest for this fusion

when applied to DWIE, but are ﬁnally less interest-

ing on AIDA. The drop on AIDA is explained by the

inﬂuence of the search by popularity of mentions.

Finally, we compared for the KBP task ELROND,

ELROND with the Fusion entity linking solution and

the model proposed in the DWIE paper. Since no link-

ing method is given for the latter, we used the mention

popularity method. The results presented in table 3

show the interest of the proposed approaches which

improve the results by up to 2.2% in the Warm-start

scenario, and the complementarity of MERIT with the

mention approach.

6 CONCLUSIONS

In this paper, we have formalized and presented an

automatic, complete and scalable evaluation method

for the KBP task from texts. It allows to compare and

select methods in warm-start and cold-start scenarios.

This protocol has been used to measure the perfor-

mance of ELROND, a system that serves as a ﬁrst ba-

sis for future improvements. We were able to improve

the entity linking module of the ELROND baseline by

proposing MERIT, an entity resolution model based

Evaluating and Improving End-to-End Systems for Knowledge Base Population

647

Table 1: Performance of the linking approaches on the DWIE dataset.

DWIE

Model Hit@1 Hit@10 Hit@50 Accuracy

Unﬁltered

Random Linking 84.7 92.5 92.5 91.9

Mentions 89.8 93.6 93.6 92.2

MERIT 91.8 95.4 95.8 93.0

Fusion 92.1 95.2 95.8 92.6

Filtered

Random Linking 88.5 93.1 93.1 93.7

Mentions 91.5 93.6 93.6 94.6

MERIT 94.2 96.5 96.7 93.9

MERIT+Mentions 94.3 96.6 96.7 94.6

Filter + relations

MERIT 95.0 96.5 96.7 94.1

MERIT+Mentions 94.4 96.5 96.7 94.8

Table 2: Performance of the linking approaches on the AIDA dataset.

AIDA

Model Hit@1 Hit@10 Hit@50 Accuracy

Random Linking 77.8 92.0 92.5 82.8

Mentions 77.5 92.6 92.6 82.8

MERIT 94.6 98.2 98.6 92.1

Fusion 94.2 97.6 97.9 90.1

on textual encoders and capable of integrating NIL

entities during inference. Future work may study the

contribution of a larger dataset for training a super-

vised linking model, extend the elements of interest

to unnamed entities, to study linking approaches that

would only have a structured KB and the role that on-

tology can play in KBP systems. We are also working

on the production of a French dataset to train and eval-

uate KBP systems. We plan to extend the presented

models to the French language.

REFERENCES

Akbik, A., Bergmann, T., Blythe, D., Rasul, K., Schweter,

S., and Vollgraf, R. (2019). Flair: An easy-to-use frame-

work for state-of-the-art nlp. In Proceedings of the 2019

conference of the North American chapter of the asso-

ciation for computational linguistics (demonstrations),

pages 54–59.

Al-Badrashiny, M., Bolton, J., Chaganty, A. T., Clark, K.,

Harman, C., Huang, L., Lamm, M., Lei, J., Lu, D., Pan,

X., et al. (2017). Tinkerbell: Cross-lingual cold-start

knowledge base construction. In TAC.

Ammar, W., Groeneveld, D., Bhagavatula, C., Beltagy, I.,

Crawford, M., Downey, D., Dunkelberger, J., Elgohary,

A., Feldman, S., Ha, V., et al. (2018). Construction of

the literature graph in semantic scholar. In Proceedings

of NAACL-HLT, pages 84–91.

Dobrovolskii, V. (2021). Word-level coreference resolu-

tion. In Proceedings of the 2021 Conference on Em-

pirical Methods in Natural Language Processing, pages

7670–7675.

Ellis, J., Getman, J., Fore, D., Kuster, N., Song, Z., Bies,

A., and Strassel, S. M. (2015). Overview of linguistic re-

sources for the tac kbp 2015 evaluations: Methodologies

and results. In TAC.

Ghosal, T., Al-Khatib, K., Hou, Y., de Waard, A., and Fre-

itag, D. (2022). Report on the 1st workshop on argumen-

tation knowledge graphs (argkg 2021) at akbc 2021. In

ACM SIGIR Forum, volume 55, pages 1–12. ACM New

York, NY, USA.

Gillick, D., Kulkarni, S., Lansing, L., Presta, A., Baldridge,

J., Ie, E., and Garcia-Olano, D. (2019). Learning dense

representations for entity retrieval. In Proceedings of the

23rd Conference on Computational Natural Language

Learning (CoNLL), pages 528–537.

Hoffart, J., Yosef, M. A., Bordino, I., F

urstenau, H., Pinkal,

M., Spaniol, M., Taneva, B., Thater, S., and Weikum,

G. (2011). Robust disambiguation of named entities in

text. In Proceedings of the 2011 conference on empirical

methods in natural language processing, pages 782–792.

Huang, Z., Xu, W., and Yu, K. (2015). Bidirectional

lstm-crf models for sequence tagging. arXiv preprint

arXiv:1508.01991.

Kenton, J. D. M.-W. C. and Toutanova, L. K. (2019). Bert:

Pre-training of deep bidirectional transformers for lan-

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

648

Table 3: Final model scores for the KBP task on the DWIE dataset.

Cold-start Warm-start

Model F1-Micro F1-Macro F1-Micro F1-Macro

ELROND+FUSION 83.5 76.3 82.0 72.1

ELROND 83.4 76.1 81.4 72.1

DWIE 82.8 75.6 80.3 69.9

guage understanding. In Proceedings of NAACL-HLT,

pages 4171–4186.

Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y.,

Isola, P., Maschinot, A., Liu, C., and Krishnan, D.

(2020). Supervised contrastive learning. Advances

in Neural Information Processing Systems, 33:18661–

18673.

Kuhn, H. W. (1955). The hungarian method for the assign-

ment problem. Naval research logistics quarterly, 2(1-

2):83–97.

Lin, X., Li, H., Xin, H., Li, Z., and Chen, L. (2020). Kb-

pearl: a knowledge base population system supported

by joint entity and relation linking. Proceedings of the

VLDB Endowment, 13(7):1035–1049.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,

Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V.

(2019). Roberta: A robustly optimized BERT pretraining

approach. CoRR.

Luo, C., Zhan, J., Xue, X., Wang, L., Ren, R., and Yang, Q.

(2018). Cosine normalization: Using cosine similarity

instead of dot product in neural networks. In Interna-

tional Conference on Artiﬁcial Neural Networks, pages

382–391. Springer.

Mesquita, F., Cannaviccio, M., Schmidek, J., Mirza, P.,

and Barbosa, D. (2019). Knowledgenet: A benchmark

dataset for knowledge base population. In Proceedings

of the 2019 Conference on Empirical Methods in Natu-

ral Language Processing and the 9th International Joint

Conference on Natural Language Processing (EMNLP-

IJCNLP), pages 749–758.

Min, B., Freedman, M., Bock, R., and Weischedel, R.

(2018). When ace met kbp: End-to-end evaluation of

knowledge base population with component-level anno-

tation. In Proceedings of the Eleventh International Con-

ference on Language Resources and Evaluation (LREC

2018).

Shue, L.-Y., Chen, C.-W., and Shiue, W. (2009). The devel-

opment of an ontology-based expert system for corpo-

rate ﬁnancial rating. Expert Systems with Applications,

36(2):2130–2142.

Soares, L. B., Fitzgerald, N., Ling, J., and Kwiatkowski, T.

(2019). Matching the blanks: Distributional similarity

for relation learning. In Proceedings of the 57th Annual

Meeting of the Association for Computational Linguis-

tics, pages 2895–2905.

Wu, L., Petroni, F., Josifoski, M., Riedel, S., and Zettle-

moyer, L. (2020). Scalable zero-shot entity linking with

dense entity retrieval. In Proceedings of the 2020 Con-

ference on Empirical Methods in Natural Language Pro-

cessing (EMNLP), pages 6397–6407.

Yao, Y., Ye, D., Li, P., Han, X., Lin, Y., Liu, Z., Liu, Z.,

Huang, L., Zhou, J., and Sun, M. (2019). Docred: A

large-scale document-level relation extraction dataset. In

ACL (1).

Zaporojets, K., Deleu, J., Develder, C., and Demeester, T.

(2021). Dwie: An entity-centric dataset for multi-task

document-level information extraction. Information Pro-

cessing & Management, 58(4):102563.

Zhang, W., Hua, W., and Stratos, K. (2021). Entqa: Entity

linking as question answering. In International Confer-

ence on Learning Representations.

Zhou, W., Huang, K., Ma, T., and Huang, J. (2021).

Document-level relation extraction with adaptive thresh-

olding and localized context pooling. In Proceedings

of the AAAI conference on artiﬁcial intelligence, vol-

ume 35, pages 14612–14620.

Evaluating and Improving End-to-End Systems for Knowledge Base Population

649