The Person Index Challenge:

Extraction of Persons from Messy, Short Texts

Markus Schr

oder

1,2

, Christian Jilek

1,2

, Michael Schulze

1,2

and Andreas Dengel

1,2

Smart Data & Knowledge Services Dept., DFKI GmbH, Kaiserslautern, Germany

Computer Science Dept., TU Kaiserslautern, Germany

Keywords:

Person Index, Extraction, Short Text.

Abstract:

When persons are mentioned in texts with their ﬁrst name, last name and/or middle names, there can be a high

variation which of their names are used, how their names are ordered and if their names are abbreviated. If

multiple persons are mentioned consecutively in very different ways, especially short texts can be perceived

as “messy”. Once ambiguous names occur, associations to persons may not be inferred correctly. Despite

these eventualities, in this paper we ask how well an unsupervised algorithm can build a person index from

short texts. We deﬁne a person index as a structured table that distinctly catalogs individuals by their names.

First, we give a formal deﬁnition of the problem and describe a procedure to generate ground truth data for

future evaluations. To give a ﬁrst solution to this challenge, a baseline approach is implemented. By using our

proposed evaluation strategy, we test the performance of the baseline and suggest further improvements. For

future research the source code is publicly available.

1 INTRODUCTION

In the Western world, it is common that persons have

a ﬁrst name (forename, given name), last name (sur-

name, family name) and optionally additional names

like middle names. When individuals are mentioned

in texts, there can be a high variation which of their

names are used, how their names are ordered and

if their names are abbreviated. For example, “John

Fitzgerald Kennedy”, “John”, “Kennedy, J F” and “J.

Kennedy” are variations that refer to the same person.

Once people share equal names, references can

easily become ambiguous even in smaller groups. In

these cases, readers try to disambiguate them with ad-

ditional context information given in texts. However,

especially short texts (or text snippets) often lack a

regular grammar, have only few statistical signals and

are rather ambiguous (Hua et al., 2015). Thus, in a

worst-case scenario, correct associations to individu-

als are impossible to infer. Nevertheless, eligible per-

sons could be suggested.

Often, we encounter unstructured (short) texts

where several persons are mentioned but we do not

have an index which lists them clearly. We deﬁne

such a person index as a structured table that distinctly

catalogs individuals by their names. Such a structured

database can be utilized in various knowledge ser-

vices, for example, in a ontology-based recognition of

named entities (Jilek et al., 2019). However, arrang-

ing this index becomes a challenging task for humans

and especially for machines, when we consider the

previously mentioned eventualities: optional (middle)

names, name variations, ambiguities and short texts.

In particular, the messiness of texts makes this chal-

lenge more difﬁcult. We classify such texts as kind

of “messy” if mentioned person names do not follow

a particular pattern (or structure), i.e. the data qual-

ity is rather low. Still, an initial suggestion for an

index could be calculated by an unsupervised algo-

rithm to reduce the manual effort considerably. Be-

cause such a method’s performance is still unclear in

our described scenario, we ask the following research

question: How well can an unsupervised algorithm

build a person index from a set of messy, short texts?

In order to answer this question, we will design a

procedure that generates ground truth data to conduct

evaluations with proposed solutions. The generator is

able to produce a list of short texts referring to per-

sons in various forms and generates an index of per-

sons that has to be discovered. To acquire ﬁrst perfor-

mance results, we propose a baseline algorithm. This

work does not intend to provide a novel procedure to

solve the person index challenge in the best way pos-

sible. Instead, the contributions of this paper are the

Schröder, M., Jilek, C., Schulze, M. and Dengel, A.

The Person Index Challenge: Extraction of Persons from Messy, Short Texts.

DOI: 10.5220/0010188405310537

In Proceedings of the 13th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2021) - Volume 2, pages 531-537

ISBN: 978-989-758-484-8

531

following:

• a formal deﬁnition of the problem,

• a procedure that generates ground truth data for

this speciﬁc challenge,

• an evaluation strategy to assess the quality of al-

gorithms that try to solve it

For future research the source code of the generator,

the baseline algorithm and the evaluator is publicly

available at GitHub

. This paper is structured as fol-

lows: The next section will formally deﬁne the prob-

lem of building a person index from texts. After the

discussion of related work (Section 2), we suggest a

data generator that produces ground truth for future

evaluations (Section 3). Section 4 describes a baseline

approach for the given problem. Our evaluation strat-

egy in Section 5 shows ﬁrst performance results. Sec-

tion 6 concludes the paper and describes future work.

1.1 Problem Deﬁnition

The problem is a speciﬁc form of named entity recog-

nition (NER) and disambiguation (NED), also known

as Entity Linking, but without having a knowledge

base containing persons in advance. At ﬁrst, per-

sons with their names have to be recognized in text

snippets as usual. However, it is important to decide

which name is ﬁrst name, last name and middle name

to ﬁll the person index correctly. Although we are

aware that there are persons having more than one

middle name, like for example “J. R. R. Tolkien”, we

simpliﬁed our problem to correspond more to famil-

iar industrial scenarios. This is also the reason why

we currently focus on Western names only.

The disambiguation of persons cannot be done

with a preexisting person knowledge base in our sce-

nario since there is no such source in advance. In-

stead, disambiguation has to be done by examining

collected entries in the person index. Because of a

particular uncertainty in this process, there can be am-

biguous person references. What follows is a formal

speciﬁcation of the described problem:

A person is a tuple p

: f n, mn, ln containing

a ﬁrst name ( f n), a last name (ln) and an optional

middle name (mn). Given a set of short texts t

T ,

the challenge is to extract all distinct persons p

from their texts mentions such that (as far as possible)

their names are at full length. In order to know

which person was mentioned in which text, a relation

, p

R T P has to be provided. If references

are ambiguous and there is no way to disambiguate

them, the relation t

, r, P

A with P

P shall

https://github.com/mschroeder-github/person-index

capture that in a short text t

– due to a substring r

(reason) – a set of persons P

are possibly mentioned.

We distinguish between R and A to ease the later

evaluation of correctly found unambiguous persons

versus ambiguous ones.

Example. Given the texts t

“Baker Thompson

LS-Z-U”, t

“mail to Chief Morgan (Wilson),

[remove Baker, Robert]” and t

“Wilson,

M.; Susan Lea Baker”, the following persons

can be discovered: p

Robert, , Baker ,

Wilson, , Morgan , p

, , T hompson

and p

Susan, Lea, Baker . This leads to the re-

lation R t

, p

, t

, p

, t

, p

, t

, p

, t

, p

Because “Baker” is ambiguous in t

, we state

A t

, Baker, p

, p

2 RELATED WORK

In the area of natural language processing (NLP), in-

formation extraction (IE) (Mart

ınez-Rodr

ıguez et al.,

2020) and its subﬁeld named-entity recognition

(NER) (Nadeau and Sekine, 2007) are well known

disciplines. Usually, NER models are trained to rec-

ognize certain entity types in texts such as locations,

organizations or persons. They can be divided into

two categories whether they know all possible entities

upfront, such as in (Song et al., 2011), or they have to

detect them blindly. Our scenario belongs to the sec-

ond category with focus on the recognition of person

entities in texts. Besides the recognition, we also link

found entities to a person index which is a task known

as Entity Linking, for example investigated in (Rizzo

et al., 2017) for tweets. Similar to tweets, we limit the

corpus to be only short texts (Hua et al., 2015).

NER for SMS (Ek et al., 2011) can also be consid-

ered as short texts since they do not necessarily follow

a regular grammar. The authors’ supervised algorithm

is pre-trained with an annotated SMS corpus and ad-

ditionally supported with gazetteer lists.

There are many similar works which try to rec-

ognize entities in short texts. However, they usually

do not form an index of canonical forms as stated in

the problem deﬁnition. More similar to our scenario

is named-entity normalization (NEN). Besides usual

recognition, it integrates a normalization process to

assign unique identiﬁers to entities.

A person normalization problem is solved in (Jijk-

oun et al., 2008) by using within-document reference

resolution in user generated contents. If the person

cannot be disambiguated with a Wikipedia lookup –

which is a typical case in our scenario – the person’s

surface form is used instead. In their appendix, a per-

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

532

son name matching algorithm is described which uses

heuristics and fuzzy matching. Another normaliza-

tion strategy in tweets (Liu et al., 2012) ﬁnds over-

lapping tokens in the entities’ names. The canonical

form of an entity is the one with the maximum words.

Instead of persons, gene mentions were recog-

nized and normalized (Cohen, 2005). For normaliza-

tion, a dictionary is generated that contains many or-

thographic variants how genes could be mentioned. In

a similar way, sCooL (Jacob et al., 2014) normalizes

academic institution names. Normalization is also uti-

lized to improve question answering (Khalid et al.,

2008).

To the best of our knowledge, there is no work that

proposed the problem of building a person index (or a

similar index) as we did.

3 GROUND TRUTH GENERATOR

In order to be able to evaluate our and future ap-

proaches, we propose a generator that produces

ground truth data. Considering the problem deﬁnition

in Section 1.1, the generator produces four comma-

separated values (CSV) ﬁles: a set of short texts men-

tioning persons in various ways (T ), a person index

that lists all mentioned persons (P), a list of relations

that relate short texts to persons (R) and a disambigua-

tion list (A). As input, our procedure expects a cata-

log of ﬁrst names and last names. Additionally, sev-

eral parameter settings can be passed to customize the

generation behavior such as a random seed to control

randomness and quantities to adjust how many per-

sons and short texts should be generated.

In order to control the degree of ambiguity, the

user can decide how many groups of persons share ei-

ther ﬁrst name or last name. Also, the size of these

groups can be speciﬁed. For example, if the degree

of ambiguity is set to two, two groups of two peo-

ple each share a last name while other two groups of

two persons each share a ﬁrst name. Our generator

starts with the creation of persons having ambiguous

names. To avoid producing more ambiguity later, the

selected ambiguous ﬁrst names and last names are not

used again. As an example, Robert Baker and Susan

Lea Baker were generated in the ambiguous last name

group “Baker”. The rest of the persons are generated

straight forward without using any name twice. Per-

sons with middle names are generated as well. Their

number can be adjusted in the generator’s settings. In

these cases, another ﬁrst name is randomly picked as

a middle name. This way, “Susan Lea Baker” was

produced in our example.

After the person index is completed, the short texts

are generated. The generation procedure is heavily in-

spired by concrete data observed in an industrial sce-

nario where spreadsheets were completed by individ-

uals over years. Since copy&paste was often used

to transfer names from several information systems

and ﬁles, various name variations can be found in the

data. That is why each generated short text (repre-

senting a spreadsheet cell) mentions a single person

or a group of people at random. To ensure that the

data is messy in a similar way to the observed data,

every person is mentioned using a different variation.

Table 1 lists fourteen patterns the generator uses to

refer to a person. In the patterns the variables for

ﬁrst name ( f n), middle name (mn) and last name (ln)

are used. The function letter name returns the ﬁrst

letter of a name while lc name converts a name to

lower case. Following procedures generate random

strings: department returns a string that looks like

a description of a department while rnd n generates

a random n-length alphabetic string. note randomly

selects a short note from a list like “old”, “TODO”

and “remember”. The same way, role randomly

picks a role description from a list like “Executive”,

“CEO” and “Chief”. The ‘ ’ symbol indicates a new

line. To demonstrate how they would look like, the

full name “John Fitzgerald Kennedy” is used as an

example. Note that not all patterns mention all names

completely which is captured with the FN (ﬁrst name

mentioned), MN (middle name mentioned) and LN

(last name mentioned) columns. This information is

vital to foresee ambiguity later. Moreover, some pat-

terns contain additional information like department

names, mails or notes to make texts more realistic and

to potentially distract detection algorithms. If the per-

son has a middle name, the generator makes sure to

use patterns 11 to 14. The 11th pattern is always used

ﬁrst to ensure that the person’s middle name was men-

tioned at least once.

If multiple persons are mentioned in a short text,

they are separated by a random delimiter. Addition-

ally, their names can be surrounded by all kinds of

brackets and quotes. By this, we avoid that algorithms

can simply split texts in a trivial way. To give a short

example, some short texts produced by our generator

are listed below (separated by empty lines).

[Sullivan, Arthur <sullivan@wnpql.to>];

HR-X-C-N-G Brooks Alonso

[Watson, L.];Campbell, Mikaela;Cooper VG-Z

Isabella Adams - [Lee Zoey]

{Chloe}; Martin, M.; Raquel Amanda Garcia;

Myers Elijah

Alice

During the text generation, our procedure records

The Person Index Challenge: Extraction of Persons from Messy, Short Texts

533

Table 1: Patterns to generate mentions of persons in various ways which are demonstrated by an example. An ‘x’ in FN, MN

and LN indicates that ﬁrst name, middle name or last name are fully mentioned in a pattern.

Nr. Pattern Example FN MN LN

1 f n John x

2 ln Kennedy x

3 f n ln John Kennedy x x

4 ln f n Kennedy John x x

5 ln, f n Kennedy, John x x

6 ln, letter f n . Kennedy, J. x

7 ln department Kennedy US-Z-G x

8 department ln f n US-Z-G Kennedy John x x

9 ln f n lc ln @rnd 5 .rnd 2 Kennedy John kennedy@xraok.nc x x

10 note role ln f n new Admin Kennedy John x x

11 f n mn ln John Fitzgerald Kennedy x x x

12 f n letter mn . ln John F. Kennedy x x

13 letter f n . letter mn . ln J. F. Kennedy x

14 ln, letter f n . letter mn . Kennedy, J. F. x

Sum - - 9 1 13

which person was unambiguously mentioned in a cer-

tain text in order to capture the relation R. If, in con-

trast, a person has an ambiguous name, the relation A

is ﬁlled instead.

4 BASELINE APPROACH

In this section, we describe our ﬁrst approach to solve

the person index challenge. This work does not in-

tend to provide a novel procedure to solve the person

index challenge in the best way possible. It is meant

to present initial results and act as a comparative mea-

sure for future approaches.

The data’s messiness is handled by our approach

with some assumptions. A guess about name capital-

ization and their ordering allows us to initially pop-

ulate the person index. Additionally, a ﬁrst name

gazetteer lookup is performed to potentially correct

the name order. For disambiguation there is no spe-

cial handling.

First, our proposed procedure discovers persons in

texts. Commonly, named entity recognition (NER) is

applied to identify entities such as organizations, lo-

cations and also persons in texts. We therefore use

OpenNLP

which is a library that provides various

natural language processing (NLP) algorithms. Its

module Name Finder

allows to detect text entities

of various types. To do so, it requires a model which

is pre-trained on a corpus in a speciﬁc language for

http://opennlp.apache.org/

http://opennlp.apache.org/docs/1.9.2/manual/opennlp.

html#tools.nameﬁnd

a certain entity type. In our algorithm, we utilize

the person name ﬁnder model en-ner-person.bin

which was trained on an English corpus to detect

names. Although, this model was trained in a super-

vised manner, our baseline algorithm is still unsuper-

vised since no ground truth (see previous section) was

used to train it. Given a list of tokens, the NER model

predicts with a certain probability if a sequence of to-

kens refers to an individual. Unfortunately, the model

cannot decide which of the tokens are ﬁrst name, mid-

dle name or last name. That is why we make for now

the following three provisional assumptions: if one

token occurs, it is assumed to be a last name; if two

tokens are found, the ﬁrst token is last name and sec-

ond one is ﬁrst name; if three tokens are discovered,

the token in the middle is presumed to be the middle

name. Doing this for given short texts yields a list of

persons where some of them occur several times.

Because of errors made in the detection, some per-

sons get improper names such as symbols or letters.

We deﬁne a proper name as a name that starts with an

upper case letter and ends with lower case letters. To

ﬁlter persons, we match their names with a regular ex-

pression. In this process, duplicates are also removed.

Although this makes it easier to construct a complete

person index, we may lose valuable information (like

initials) for a possible disambiguation later. Still, it

is unclear, if ﬁrst name, middle name and last name

were correctly assigned. Our solution is the usage of

a ﬁrst name gazetteer list. We swap names accord-

ingly if falsely assumed last names turn out to be ﬁrst

names. However, this feature requires a previously

compiled list.

http://opennlp.sourceforge.net/models-1.5/

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

534

Next, the relations (R) which relate short texts to

persons are discovered. We iterate again over all short

texts and match tokens with (ﬁrst and last) names of

all persons in our index. In case a single person was

found, we state a relationship between the text and the

person. If multiple persons are matched, we make an

entry in the ambiguity list (A).

Our approach has three features that can be acti-

vated independently. First, OpenNLP’s Name Finder

allows to clear adaptive data

that is collected during

the text processing. This can be done each time a new

short text is processed which may improve the detec-

tion result. Second, the probability measures of Name

Finder’s model can be used to ﬁlter uncertain detec-

tions. If the probability is below 0.5, tokens will not

be regarded as names of a person. Third, as already

described, a ﬁrst name gazetteer list can be used to

swap names accordingly.

5 EVALUATION

In this section, we present our evaluation strategy

that assesses algorithms that try to solve the proposed

person index challenge. The ground truth GT :

T, P, R, A consists of short texts T , a person index

P, a relation R and an ambiguity list A. Potential al-

gorithms consume short texts in T and output a person

index P

, a relation R

and an ambiguity list A

First, we are interested in the algorithm’s perfor-

mance of building the person index. Therefore, the

ground truth’s index P is compared with the algo-

rithm’s index P

. If all names of two given persons

are identical, a correct match is assumed. Formally,

an intersection of both sets can be calculated to get

the matches P

: P P

. By this, we can calculate

the algorithms precision and recall for assembling the

person index:

prec

recall

Second, we examine how often a correct mapping be-

tween short text and person are suggested. This makes

only sense for persons which are correctly found by

the algorithm, namely P

. Thus, for each person

, the relations

R : t

, p

R, p

and similar for

the algorithm’s output

: t

, p

can be de-

ﬁned. Again, identical mappings are calculated with

https://opennlp.apache.org/docs/1.9.2/apidocs/

opennlp-tools/opennlp/tools/nameﬁnd/NameFinderME.

html

the intersection

. Doing this for all per-

sons p

, we can calculate the average (avg) pre-

cision and recall for ﬁnding the relationships:

prec

: avg

recall

: avg

Third, our goal is to ﬁnd out how well the algorithm

detects ambiguity. Similar to the person index com-

parison, the ground truth’s list A is compared with

the algorithm’s list A

. However this time, we in-

dividually consider every person that was correctly

suggested in the group of ambiguous people. For-

mally, since an element t

, r, P

A contains a set

of ambiguous people P

, we deﬁne an auxiliary set

A : t

, r, p

, r, P

A, p

to ease fur-

ther comparison. Again, an intersection can be calcu-

lated as

. With this, the precision and re-

call for ambiguity detection can be stated as follows:

prec

recall

For each precision and recall pair, we can calculate

the harmonic mean which is commonly known as the

F-score value:

f score : 2

prec recall

Our actual evaluation can be divided into two parts.

In the ﬁrst part, we examine how our method’s fea-

tures effect its detection performance. By testing on

various generated data, we discover that only the ﬁrst

name gazetteer improves results in average. In the

second part, we investigate eight experiments with

speciﬁcally generated datasets. They give insights

in our method’s detection performance in more de-

tail. Whenever we generate ground truth, we took

for its input the 100 most common last names in

USA

and collected 197 popular ﬁrst names

(with-

out middle names). As a list of roles we use “Admin”,

“CEO”, “Chief”, “Executive”, “Developer”, “Con-

tact” and empty string. Notes can have the follow-

ing forms: “new”, “old”, “TODO”, “remember”, “re-

move”, “send PDF to”, “send mail to” and “write

message to”. For delimiters we use “\n” (new line),

“\n\n” (two new lines), “/”, “ / ”, “ - ”, “; ” and “;”

and for brackets we consider “< >”, “( )”, “[ ]”, “{ }”,

“" "” (double quote) and “’ ’” (single quote).

Because our approach has three usable features,

we ﬁrst examine how they have an effect on its de-

tection performance. The described features are (a)

https://en.wikipedia.org/wiki/List of most common

surnames in North America#United States .28American.

29 (Accessed 2020-04-05).

https://en.wikipedia.org/wiki/List of most popular

given names#Americas (Accessed 2020-04-05).

The Person Index Challenge: Extraction of Persons from Messy, Short Texts

535

clearing adaptive data, (b) usage of probability mea-

sures and (c) lookup in a ﬁrst name gazetteer list.

Since these features inﬂuence the detection of per-

sons, we investigate how accurately our algorithm

compiles the person index using the f

measure-

ment. Therefore, several tests with differently gener-

ated data are performed where no persons with mid-

dle names and no ambiguity are involved. 60 ground

truth datasets of various sizes were generated at ran-

dom. In particular, we vary the number of persons

P (10 to 100), the amount of short texts T (150 to

1500), the maximum number of persons mentioned

in a text snippet (1 to 10) and the random seed. With-

out any feature activated, the algorithm reaches in av-

erage an f

of 0.369 0.012. The usage of proba-

bility measures to ﬁlter improbable detections does

not show any effect. If for every new short text

adaptive data is cleared, the value slightly reduces to

0.358 0.026. Once a ﬁrst name gazetteer is used,

the correction of name assignments increases f

average to 0.462 0.024. Because of these insights,

in further evaluation only the ﬁrst name gazetteer fea-

ture is used.

Next, we examine in more detail how our algo-

rithm performs on eight generated datasets. Table 2

summarizes the evaluation results of the experiments.

Regarding ground truth, P denotes the length of the

person index list, T shows the number of generated

short texts, Max is the maximum number of persons

mentioned in a text, MN counts how many persons

have a middle name and Amb states the degree of am-

biguity. With an ambiguity degree of n, the dataset

has n groups of n people each share a last name while

other n groups of n persons each share a ﬁrst name.

As stated in the evaluation section, the precision and

recall measures are calculated accordingly. In the fol-

lowing, the eight experiments are examined.

In the ﬁrst experiment, our algorithm correctly

found the one person in ten different variations be-

cause the generation patterns always contain either

ﬁrst name or last name. However, if the number of

generated short texts is increased (experiment num-

ber two), more persons are incorrectly detected. This

is mainly due to role names and department names

that look like real names (e.g. “Chief”, “Admin”,

etc). Since so many falsely extracted persons share

equal names, the algorithm assumes that they are all

ambiguous and does not state any correct relation be-

tween short texts and persons in R

In the third test, we increase the number of per-

sons to 20, still, per short text only one person is men-

tioned. Because different names are used, the algo-

rithm has a better chance to ﬁnd a correct name pair

in the patterns. This is also the reason why more cor-

rect relations are found, since there is less possibility

for ambiguity.

For the fourth run, the maximum number of men-

tioned persons per text is increased to 10, thus now

multiple persons can be mentioned in one text. This

situation seems to distract our algorithm which re-

sults in a lower f

and f

value. Regarding line 5,

the performance declines slightly if persons with mid-

dle names are introduced. In fact, only one individual

with a middle name was identiﬁed correctly. We as-

sume that the OpenNLP model is not trained to iden-

tify persons with middle names.

In the sixth experiment, ambiguity is introduced:

two groups of two people each share a last name while

other two groups of two persons each share a ﬁrst

name. All eight ambiguous persons where discov-

ered correctly in P

and their ambiguity were detected

completely in A

(since recall

reaches 1.00). How-

ever, many other persons are incorrectly listed in the

index which share ﬁrst names and last names by acci-

dent. This results in a very low prec

precision. If the

degree of ambiguity is increased (line 7) and nearly

every person is ambiguous (18 of 20), the recall de-

clines a little.

In the last experiment, we increase the number of

persons to 40 and short texts to 300. Having more

persons, the algorithm falsely assumes more ambigu-

ity which declines recall

slightly.

Our experiments show that our baseline does not

reach an f

value above 0.6 in more realistic use

cases. Also f

reveals similar poor results. Since

many wrong persons with same names are discovered,

prec

lists the worst results. As improvements, we

suggest to train detection models that are able to dis-

tinguish ﬁrst name and last name. In addition, more

context should be considered when relations between

texts and persons are discovered. For example, a letter

of an abbreviated name can reduce ambiguity drasti-

cally. We assume that if more persons are correctly

discovered, the precision of ambiguity discovery will

increase.

6 CONCLUSION AND OUTLOOK

In this paper, we asked how well an unsupervised al-

gorithm is able to build a person index from a set of

short texts. We deﬁned a person index as a struc-

tured table that distinctly catalogs individuals by their

names. After we gave a formal deﬁnition for this

problem, we proposed a generator that is able to pro-

duce ground truth datasets for this challenge inspired

by concrete data. Additionally, a ﬁrst baseline ap-

proach was described to examine ﬁrst results. In the

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

536

Table 2: Performance of the baseline algorithm on generated ground truth data. P – length of person index list, T – number

of generated short texts, Max – maximum number of mentioned persons per text, MN – number of persons with middle names

and Amb – degree of ambiguity. Precision (prec), recall and f-score (f ) are calculated based on the relations P, R and A.

Nr. P T Max MN Amb prec

recall

prec

recall

prec

recall

1 1 10 0 0 0 1.00 1.00 1.00 1.00 1.00 1.00 - - -

2 1 200 0 0 0 0.14 1.00 0.25 0.00 0.00 - - - -

3 20 200 0 0 0 0.63 0.85 0.72 0.82 0.72 0.77 - - -

4 20 200 10 0 0 0.38 0.90 0.54 0.61 0.09 0.16 - - -

5 20 200 10 4 0 0.31 0.80 0.45 0.63 0.09 0.16 - - -

6 20 200 10 4 2 0.39 0.75 0.52 0.47 0.41 0.44 0.03 1.00 0.06

7 20 200 10 4 3 0.45 0.85 0.59 0.59 0.54 0.56 0.05 0.95 0.09

8 40 300 10 4 3 0.39 0.75 0.52 0.53 0.49 0.51 0.03 0.87 0.06

evaluation, several measurements were deﬁned to ex-

amine the performance of potential solutions. With

this, we analyzed our approach and suggested further

potentials for improvement for future approaches.

For future work, we plan to examine performance

in real use cases using data of our industrial scenar-

ios. In case of ambiguities, our goal is to efﬁciently

integrate human experts which are able to contribute

with their knowledge. The challenge itself can be

made more difﬁcult by generating names with differ-

ent cases (i.e. lower case, upper case, mixed case,

camel case, etc). Regarding the domain, we aim to

generalize the problem statement to other entity types

which have multiple names or IDs in different forms.

ACKNOWLEDGEMENTS

This work was funded by the BMBF project SensAI

(grant no. 01IW20007).

REFERENCES

Cohen, A. (2005). Unsupervised gene/protein named en-

tity normalization using automatically extracted dic-

tionaries. In Proceedings of the ACL-ISMB Work-

shop on Linking Biological Literature, Ontologies

and Databases: Mining Biological Semantics@ISMB

2005, Detroit, MI, USA June 24, 2005, pages 17–24.

Association for Computational Linguistics.

Ek, T., Kirkegaard, C., Jonsson, H., and Nugues, P. (2011).

Named entity recognition for short text messages.

Procedia-Social and Behavioral Sciences, 27:178–

187.

Hua, W., Wang, Z., Wang, H., Zheng, K., and Zhou, X.

(2015). Short text understanding through lexical-

semantic analysis. In 2015 IEEE 31st Int’l Conf. on

Data Engineering, pages 495–506.

Jacob, F., Javed, F., Zhao, M., and McNair, M. (2014).

scool: A system for academic institution name nor-

malization. In 2014 Int. Conf. on Collaboration Tech-

nologies and Systems, CTS 2014, Minneapolis, MN,

USA, May 19-23, 2014, pages 86–93. IEEE.

Jijkoun, V., Khalid, M. A., Marx, M., and de Rijke, M.

(2008). Named entity normalization in user gener-

ated content. In Proceedings of the Second Workshop

on Analytics for Noisy Unstructured Text Data, AND

2008, Singapore, July 24, 2008, volume 303 of ACM

Int. Conf. Proceeding Series, pages 23–30. ACM.

Jilek, C., Schr

oder, M., Novik, R., Schwarz, S., Maus, H.,

and Dengel, A. (2019). Inﬂection-tolerant ontology-

based named entity recognition for real-time appli-

cations. In 2nd Conference on Language, Data and

Knowledge, LDK 2019, May 20-23, 2019, Leipzig,

Germany, volume 70 of OASICS, pages 11:1–11:14.

Schloss Dagstuhl - Leibniz-Zentrum f

ur Informatik.

Khalid, M. A., Jijkoun, V., and de Rijke, M. (2008). The

impact of named entity normalization on information

retrieval for question answering. In Advances in In-

formation Retrieval , 30th European Conference on IR

Research, ECIR 2008, Glasgow, UK, March 30-April

3, 2008. Proceedings, volume 4956 of Lecture Notes

in Computer Science, pages 705–710. Springer.

Liu, X., Zhou, M., Zhou, X., Fu, Z., and Wei, F. (2012).

Joint inference of named entity recognition and nor-

malization for tweets. In The 50th Annual Meeting of

the Association for Computational Linguistics, Pro-

ceedings of the Conference, July 8-14, 2012, Jeju Is-

land, Korea - Volume 1: Long Papers, pages 526–535.

The Association for Computer Linguistics.

Mart

ınez-Rodr

ıguez, J., Hogan, A., and L

opez-Ar

evalo, I.

(2020). Information extraction meets the semantic

web: A survey. Semantic Web, 11(2):255–335.

Nadeau, D. and Sekine, S. (2007). A survey of named entity

recognition and classiﬁcation. Lingvisticae Investiga-

tiones, 30.

Rizzo, G., Pereira, B., Varga, A., van Erp, M., and Basave,

A. E. C. (2017). Lessons learnt from the named en-

tity recognition and linking (NEEL) challenge series.

Semantic Web, 8(5):667–700.

Song, Y., Wang, H., Wang, Z., Li, H., and Chen, W.

(2011). Short text conceptualization using a proba-

bilistic knowledgebase. In IJCAI 2011, Proceedings

of the 22nd Int. Joint Conference on Artiﬁcial Intelli-

gence, Barcelona, Catalonia, Spain, July 16-22, 2011,

pages 2330–2336. IJCAI/AAAI.

The Person Index Challenge: Extraction of Persons from Messy, Short Texts

537