Terminology Expansion with Prototype Embeddings:

Extracting Symptoms of Urinary Tract Infection from Clinical Text

Mahbub Ul Alam

, Aron Henriksson

, Hideyuki Tanushi

, Emil Thiman

2, 3

, Pontus Naucler

2, 3

and Hercules Dalianis

Department of Computer and Systems Sciences, Stockholm University, Stockholm, Sweden

Division of Infectious Disease, Department of Medicine, Karolinska Institutet, Stockholm, Sweden

Department of Infectious Diseases, Karolinska University Hospital, Stockholm, Sweden

Keywords:

Natural Language Processing, Terminologies, Synonym Extraction, Word Embeddings, Clinical Text.

Abstract:

Many natural language processing applications rely on the availability of domain-speciﬁc terminologies con-

taining synonyms. To that end, semi-automatic methods for extracting additional synonyms of a given concept

from corpora are useful, especially in low-resource domains and noisy genres such as clinical text, where non-

standard language use and misspellings are prevalent. In this study, prototype embeddings based on seed

words were used to create representations for (i) speciﬁc urinary tract infection (UTI) symptoms and (ii) UTI

symptoms in general. Four word embedding methods and two phrase detection methods were evaluated using

clinical data from Karolinska University Hospital. It is shown that prototype embeddings can effectively cap-

ture semantic information related to UTI symptoms. Using prototype embeddings for speciﬁc UTI symptoms

led to the extraction of more symptom terms compared to using prototype embeddings for UTI symptoms in

general. Overall, 142 additional UTI symptom terms were identiﬁed, yielding a more than 100% increment

compared to the initial seed set. The mean average precision across all UTI symptoms was 0.51, and as high as

0.86 for one speciﬁc UTI symptom. This study provides an effective and cost-effective solution to terminology

expansion with small amounts of labeled data.

1 INTRODUCTION

In many applications of natural language processing

(NLP), there is a need for ready access to domain-

speciﬁc terminologies. However, for low-resource

languages and domains, wide-coverage terminologi-

cal resources tend to be scarce, and are often pro-

hibitively expensive to create manually. In the con-

text of noisy genres such as clinical text, where non-

standard language use, creative shorthand and mis-

spellings are prevalent (Dalianis, 2018), it is espe-

cially important to have access to domain-speciﬁc

knowledge about the meaning of terms and their se-

mantic relationships. To that end, semi-automatic and

data-driven methods for extracting additional syn-

onyms of a given concept from corpora are useful for

expanding an existing but limited terminology. Such

efforts are not only cost-efﬁcient, but are also com-

pelling due to their ability to capture real, domain-

speciﬁc language use, including common spelling

variants. This allows for the recall (sensitivity) of in-

formation extraction systems to be vastly improved.

Several different approaches to terminology ex-

pansion and synonym extraction have been proposed,

including the use of lexico-syntactic patterns and

graph-based models. More recent efforts have tended

to leverage models of distributional semantics, espe-

cially in the form of word embeddings. These mod-

els are based on the distributional hypothesis (Harris,

1954), which states that words with similar distribu-

tions in a corpus – i.e. words that appear in simi-

lar contexts and co-occur with similar sets of words

– often have similar meanings. They have become

popular also in clinical NLP as data from electronic

health records (EHRs) has become more readily ac-

cessible for research (Khattak et al., 2019). By cre-

ating vector-based representations of word meaning

in semantic space, estimates of semantic similarity to

other words in a corpus can be computed, forming the

basis for many synonym extraction efforts in the clin-

ical domain (Henriksson et al., 2014b; Zhang et al.,

2017; Fan et al., 2019).

In this study, we explore and further investigate

the notion of prototype embeddings for terminology

Alam, M., Henriksson, A., Tanushi, H., Thiman, E., Naucler, P. and Dalianis, H.

Terminology Expansion with Prototype Embeddings: Extracting Symptoms of Urinary Tract Infection from Clinical Text.

DOI: 10.5220/0010190200470057

In Proceedings of the 14th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2021) - Volume 5: HEALTHINF, pages 47-57

ISBN: 978-989-758-490-9

expansion. Prototype embeddings can be derived us-

ing any model of distributional semantics and are vec-

tor representations that aim to capture the meaning of

higher-level concepts based on lexical instantiations

of (some of) its members (Henriksson et al., 2014a).

Prototype embeddings have been shown to be effec-

tive in generating semantic features that improve clin-

ical named entity recognition systems; here, we build

on the idea of prototype embeddings for expanding a

terminology for urinary tract infection (UTI) symp-

toms by extracting candidate terms from clinical text

corpora.

A UTI is an infection in any part of the urinary

system, including kidneys, ureters, bladder and ure-

thra. It is primarily caused by bacteria and is among

the most common bacterial infections in the human

body (Foxman, 2010). UTIs result in suffering and

can also be lethal when they lead to sepsis (Herzog

et al., 2014). Diagnosis of UTI is based on a com-

bination of urinary symptoms and urine culture in-

formation (Rubin et al., 1992). There are a number

of UTI symptoms, which can categorized as follows:

painful urination (dysuria), frequent urination (fre-

quency), constant urge of urination (urgency), tender-

ness in the lower abdomen (suprapubic tenderness),

tenderness or pain elicited by percussion

from the

kidney overlaying area in the back (costovertebral

angle

pain or tenderness), as well as some other,

less speciﬁc symptoms (non-speciﬁc) (ECDC, 2016;

NHSN, 2017). Using only urine culture information

for the diagnosis of UTI will lead to the overestima-

tion of the incidence of UTI (Landers et al., 2010). As

a result, the detection of UTI symptoms is critical for

accurately identifying cases of UTI in EHRs.

While data-driven techniques that can be used to

support terminology development are important for

many domains, the speciﬁc motivation behind this

study – from an application perspective – is to extract

an extensive set of UTI symptom terms as they ap-

pear in real clinical text. The developed terminology

of UTI symptoms is intended to be used for develop-

ing a system for automatically detecting UTIs based

on structured and unstructured data from EHRs. The

main contributions of this study are as follows:

• Two statistical phrase detection methods, with dif-

ferent thresholds, are explored to study the impact

of the trade-off between the number and quality of

the identiﬁed phrases on the downstream task of

terminology expansion. Phrase detection is a nec-

’percussion’ refers to the clinical examination process of

tapping on the surface area of thorax or abdomen to deter-

mine the inner formation.

’costovertebral angle’ refers to the angle created by the

vertebral column and the lower ending of the thorax.

essary component in the data processing pipeline

as many symptoms are multi-word expressions.

• Four word embedding methods are used for

deriving prototype embeddings: Word2Vec,

Phrase2Vec, GloVe and FastText. More impor-

tantly, we evaluate the use of prototype embed-

dings for terminology expansion and explore pro-

totype embeddings at two levels of abstraction: (i)

for speciﬁc UTI symptoms and (ii) for UTI symp-

toms in general.

• Two different corpora are used for training the

prototype embeddings and we explore the trade-

off between data volume and quality: one corpus

is smaller but contains only positive UTI cases,

whereas the other is larger but somewhat less rel-

evant to the target domain.

• Using a small set of seed terms in the form of

UTI symptoms, we are able to extract another 142

new terms for inclusion in the terminology using a

data-driven and semi-automatic method based on

prototype embeddings.

2 METHODS & MATERIALS

In this study, we investigate the use of prototype em-

beddings for the extraction of UTI symptom terms

from clinical text. In order to create prototype em-

beddings for this task, we need: (i) to detect phrases in

the unannotated corpus (in order to be able to capture

symptoms that are not only expressed as single words

but also as multiword expressions), and (ii) base em-

beddings from which to derive the prototype embed-

dings. We conduct a number of experiments with dif-

ferent underlying corpora, different methods for auto-

matic detection of phrases, different methods for cre-

ating the base word embeddings, as well as prototype

embeddings constructed at different levels of abstrac-

tion. Real-word clinical data is extracted from a major

university hospital in Sweden. A domain expert anno-

tates a portion of the data to create seed terms and also

evaluates the candidate UTI symptom terms identiﬁed

by the various models.

2.1 Methods

Below follows a description of the methods used

in the study: (i) methods for phrase detection, (ii)

methods for creating base word embeddings, and (iii)

methods for creating prototype embeddings.

HEALTHINF 2021 - 14th International Conference on Health Informatics

2.1.1 Phrase Detection

For this task, phrase detection is necessary as symp-

tom expressions can either be in the form of unigram

words (e.g. headache) or multiword expressions (e.g.

sore throat). Embeddings for terms of varying length

therefore need to be created, and this is achieved by

automatically identifying (and, in some cases, con-

catenating) phrases in the underlying corpus, which

is later used for constructing the word embeddings.

Two common and simple data-driven phrase de-

tection methods are used to this end. Both are based

on the notion of identifying words that often co-occur

together, but rarely in other contexts. The ﬁrst one

is presented in (Mikolov et al., 2013) and identiﬁes

phrases based on unigram and bigram counts accord-

ing to the following scoring function:

score(w

, w

) =

count(w

) − δ

count(w

) × count(w

)

where δ is a discounting coefﬁcient that helps

to avoid identifying too many phrases made up of

very rare words. Bigrams that score above a cer-

tain set threshold are treated as phrases; this process

is repeated in several passes over the data such that

longer phrases than bigrams can also be identiﬁed.

Here, this method is referred to as IM (for iterative

merging). The second phrase detection method is

based on the normalized (pointwise) mutual informa-

tion among collocated words (Bouma, 2009). Here,

we refer to this method as nPMI.

2.1.2 Word Embeddings

The distributional hypothesis (Harris, 1954) states

that words that frequently co-occur in similar con-

texts tend to be semantically similar. Many meth-

ods have exploited this observation to automatically

derive, from large text corpora, vector representa-

tions of word meaning. Word embeddings are lexical

semantic representations in the form of dense, low-

dimensional vectors in a continuous vector space, in

which embeddings of semantically similar words are

also expected to be in relatively close proximity in se-

mantic (vector) space. Word embeddings can be de-

rived using a number of different methods; as it has

been shown that there is no single method that consis-

tently outperforms others for all types of biomedical

NLP tasks (Wang et al., 2018), we investigate the use

of four common word embedding methods:

Word2Vec (Mikolov et al., 2013) derives, in an

efﬁcient manner, word embeddings using a shallow

neural network that is trained to carry out a supervised

learning task without the need for labeled data. There

are two variants of the learning task: continuous bag

of words (CBOW) and skip-gram. In CBOW, the task

is to learn to predict the target word based on its con-

text (i.e. the adjacent words in a ﬁxed-size window),

while, in the skip-gram model, the task is instead to

predict the context based on the target word.

Phrase2Vec (Artetxe et al., 2018) is an exten-

sion of the former, designed to derive embeddings

for phrases. This method requires one to provide a

list of phrases separately, for which it learns phrase

embeddings, along with regular unigram-based word

embeddings.

GloVe (Pennington et al., 2014) combines global

matrix factorization and local context window meth-

ods to derive word embeddings. The idea is to take

into account the frequency of word co-occurrences in

the entire corpus when deriving the word embeddings.

FastText (Bojanowski et al., 2017) treats words as

a combination of n-gram characters. These n-gram

characters can be mapped to dense vectors, and the

overall aggregation of these lower-level embeddings

can be used to represent a word or a phrase. This

allows for deriving embeddings for unknown words

and also requires less training data in comparison to

the aforementioned methods.

2.1.3 Prototype Embeddings

Prototype embeddings (Henriksson et al., 2014a) are

intended to capture the semantics of a (higher-level)

concept or a group based on the embeddings of the

members. A prototype embedding for a group can, for

instance, simply be created through mean or median

pooling of the members’ embeddings. Here, we take

the column-wise mean value of a set of embeddings

to derive a prototype embedding. It has been shown

that prototype embeddings can be used for creating

semantic features that help to improve named en-

tity recognition systems, while further improvements

can be obtained by creating ensembles of prototype

embeddings, where each member is derived from a

model built with different underlying data and/or hy-

perparameters (Henriksson, 2015).

In this study, we investigate the use of prototype

embeddings for a different task, namely terminology

expansion, and, in particular, to extract terms – as

they appear in real, clinical text – of UTI symptoms.

We also investigate and compare the use of prototype

embeddings constructed at different levels of abstrac-

tion: (i) one prototype embedding for each speciﬁc

UTI symptom, and (ii) another, high-level prototype

embedding for UTI symptoms in general. With the

former, we aim to extract terms that express a speciﬁc

UTI symptom and only use seed terms within that

group to derive the prototype embedding. In the latter

Terminology Expansion with Prototype Embeddings: Extracting Symptoms of Urinary Tract Infection from Clinical Text

case, we aim to extract any form of UTI symptom and

use all seed terms to derive a single prototype embed-

ding. We speciﬁcally investigate which abstraction

level is the most productive for terminology expan-

sion. Here, we refer to the former as symptom-speciﬁc

and the latter as symptom-general.

2.2 Data

In this study, data in the form of text corpora are

needed for constructing the word embeddings. In ad-

dition, the proposed method relies on access to seed

terms for constructing prototype embeddings, both

symptom-speciﬁc and symptom-general.

2.2.1 Corpora

The underlying corpora are extracted from a database

of electronic health records from Karolinska Univer-

sity Hospital in Stockholm, Sweden. The data used in

this study can be obtained (upon request) from the re-

search infrastructure The Swedish Health Record Re-

search Bank (Health Bank

), at Stockholm University

(Dalianis et al., 2015). The infrastructure contains

more than two million patient records from the years

2007-2014 obtained from Karolinska University Hos-

pital.

We extracted clinical notes with the following in-

clusion criteria: (i) patients who are 18 years or older,

(ii) admitted to the hospital between July, 2010 and

March, 2013, and (iii) one urine culture taken during

the hospitalization period. In total, there were 10,335

urine cultures found in 7,256 hospitalizations of 5,659

patients. A urine culture was considered positive if

there was a signiﬁcant growth (having more than or

equal to 10

colony forming units per milliliter of

urine) of no more than two pathogens. In total, there

were 7,972 positive urine cultures found in 6,943 hos-

pitalizations of 5,653 patients.

Table 1: Number of types and tokens in the two corpora.

Corpus Types Tokens

Case Group 156,695 13,475,706

Control Group 181,331 19,357,294

Two corpora are extracted for the experiments de-

scribed later in section 2.3. One contains only clin-

ical notes for hospitalizations that contain a positive

urine culture, i.e. the Case Group. Another corpus is

created with clinical notes for hospitalizations with-

out a positive urine culture, i.e. the Control Group.

The total number of types and tokens in the respective

http://dsv.su.se/healthbank

corpora are shown in Table 1. The corpora are prepro-

cessed by removing punctuation marks and lowercas-

ing all characters.

2.2.2 Seed Terms

In order to create a prototype embedding for a higher-

level concept or group, access is needed to a sample of

terms that represent members of that group. A physi-

cian and expert in infectious diseases, with exten-

sive experience of treating patients with UTI, there-

fore manually annotated one month’s (April, 2012)

worth of data according to the aforementioned inclu-

sion criteria. In total, 120 UTI symptom terms were

annotated according to the six UTI symptoms men-

tioned in the introduction: dysuria, frequency, ur-

gency, suprapubic tenderness, costovertebral angle

pain or tenderness, and non-speciﬁc. In this anno-

tation set, a total of 240 positive urine cultures were

identiﬁed in 201 hospitalizations of 195 patients. The

annotator marked the symptom terms with the exact

form and spelling as found in the clinical text. Ta-

ble 2 provides some examples of the annotated symp-

tom terms. As can be seen, some symptom terms are

misspelt (tr

agnningar should be tr

angningar); these

need to be captured in order for the terminology to

be effective for information extraction purposes. It is

worth mentioning that the sixth UTI symptom (non-

speciﬁc) was used to group the symptom terms which

are not included in ECDC (European centre for dis-

ease prevention and control) or CDC (Centers for Dis-

ease Control and Prevention). We used it to group

the terms which could still be relevant to detect UTI;

for example, miktionsbesv

ar (micturition problems)

could indicate some forms of disturbance related to

micturition. The seed terms are also used for ini-

tial evaluation and hyper-parameter tuning, see sec-

tion 2.3.4. Table 3 provides the number of manually

annotated seed terms for each UTI symptom and their

frequency in the two corpora.

2.3 Experimental Setup

In this paper, we investigate several research ques-

tions in the following sets of experiments:

2.3.1 Experiment 1: Underlying Data

One of the most fundamental aspects that affects the

makeup of a word embedding space is the data which

is used for training the model. Here, we investigate

two aspects of the underlying data: (1) phrase detec-

tion and (2) data volume vs. quality.

Phrase detection is a necessary step in order to

be able to identify UTI symptoms in the form of

HEALTHINF 2021 - 14th International Conference on Health Informatics

Table 2: Examples of annotated UTI symptom terms.

UTI Symptom Example Term Translation

Dysuria sveda

burning

sensation

Frequency kissar ofta

urinating

often

Urgency tr

agnningar

urgency

(misspelt)

Suprapubic

tenderness

ont i bl

asa bladder pain

Costovertebral

angle pain or

tenderness

ﬂanksm

arta ﬂank pain

Non-speciﬁc miktionsbesv

micturition

problems

Table 3: Frequency of seed terms in the two corpora.

UTI

Symptom

Case Group Control Group

Types Tokens Types Tokens

Dysuria 26 3,902 26 4,674

Frequency 9 337 9 395

Urgency 8 4,838 8 5,913

Suprapubic

tenderness

14 49 14 55

Costo-

vertebral

angle pain /

tenderness

35 1,254 35 1,495

Non-speciﬁc 28 1,701 28 2,067

multiword expressions. In data-driven approaches to

phrase detection, there is a trade off between the num-

ber and quality of identiﬁed phrases. In addition to

comparing the two data-driven phrase detection meth-

ods described in section 2.1.1, we explore the down-

stream impact of using a small, medium, or large list

of automatically identiﬁed phrases. The phrase lists

are generated using three different thresholds for each

of the two phrase detection methods: 100 (small),

5 (medium), 1 (large) for IM and 0.57 (small), 0.34

(medium) and 0.23 (large) for nPMI. In order to en-

sure that the manually annotated UTI symptom terms

are treated as phrases, they are concatenated using

the underscore character (“ ”). For example, all in-

stances of “kissar ofta” (urinating often) are replaced

by “kissar ofta”.

It is well-known that large corpora lead to higher-

quality word embeddings, as it is necessary to have a

large number of observations of language use – i.e.

the contexts in which terms are used – in order to

capture the variety and nuances of word meaning.

However, the “quality” of word embeddings – here,

deﬁned according to their performance in the down-

stream task of terminology expansion – is also deter-

mined by the “quality” of the underlying data. In this

context, we deﬁne data quality according to how spe-

ciﬁc the corpus is to the application domain of UTI.

We investigate the use of two different underlying cor-

pora: the Case Group corpus is relatively smaller but

assumed to be of higher quality compared to the Con-

trol Group corpus, which is relatively larger but less

speciﬁc to the application domain and hence assumed

to be of lower quality. See Table 4 for the number

of phrases identiﬁed with each setting and underlying

corpus.

Table 4: The number of identiﬁed phrases in each corpus us-

ing different phrase lists (Small, Medium, Large) generated

using two different phrase detection methods (IM, nPMI)

and three different thresholds.

Phrase

List

Case Group Control Group

IM nPMI IM nPMI

Small 7,780 7,145 11,149 10,233

Medium 29,918 28,626 41,896 40,728

Large 47,406 46,866 67,859 67,972

2.3.2 Experiment 2: Underlying Embeddings

Method

Another important aspect that affects the makeup of a

word embedding space is the method used for train-

ing the model. Different methods perform well in

different domains and on different downstream tasks:

here, we evaluate the following four word embedding

methods to generate base models from which to de-

rive prototype embeddings: Word2Vec, Phrase2Vec,

GloVe and FastText (see section 2.1.2 for details).

The word embedding methods have many hyper-

parameters that need to be tuned. Instead of doing a

grid search in some restricted hyperparameter space,

points are chosen at random in order to more effec-

tively search the space. For each word embedding

method, 50 points are randomly selected, thus yield-

ing 50 different models for each method. See Table 10

in the appendix section, which provides details con-

cerning the hyperparameter space within which points

are randomly sampled.

2.3.3 Experiment 3: Prototype Abstraction

Level

One of the key research questions that we investigate

in this study – building on previous work on the use

of prototype embeddings – is on what level of ab-

straction prototype embeddings are best used for ter-

minology expansion. In this study, we compare two

Terminology Expansion with Prototype Embeddings: Extracting Symptoms of Urinary Tract Infection from Clinical Text

prototype abstraction levels: (1) at the speciﬁc UTI

symptom level (symptom-speciﬁc), and (2) at the gen-

eral UTI symptom level (symptom-general). All base

word embedding models are used for deriving the best

prototype embeddings within each abstraction level.

The two levels are ﬁnally compared and evaluated for

their ability to identify new UTI terms. The candidate

terms produced by the prototype embedding models

at each level are manually assessed by a domain ex-

pert, see section 2.3.4 for further details.

2.3.4 Evaluation

In this study, mean average precision (MAP) is used

as the primary evaluation metric (Sch

utze et al.,

2008). MAP is the simple average of average preci-

sion (AP) scores over all examples in a validation set.

AP is a metric that describes to what extent relevant

items are concentrated in the highest-ranked predic-

tions. For each threshold level (k), AP can be calcu-

lated by ﬁrst taking the difference between the recall

at the current level in the ranked predictions and the

recall at the previous threshold level (k − 1), multi-

plied by the precision at that level (k) in the ranked

prediction. The sum of the contributions at each level

is the AP. Precision is the fraction of predictions that

are relevant and correct, and recall is the fraction of

all relevant values that are predicted.

For model selection, leave-one-out cross-

validation is carried out. In this context, this entails

that, in each iteration, all but one of the seed terms

are used for deriving the prototype embedding;

the ranking of the left-out seed term in the list of

nearest neighbors – based on cosine similarity – is

used for calculating the AP score. This process is

repeated for all seed terms in order to estimate a MAP

score for a given model. For symptom-speciﬁc, this

process is carried out using seed terms for a speciﬁc

UTI symptom, whereas for symptom-general, it is

done using all seed terms. For symptom-speciﬁc,

MAP scores are macro-averaged across the six UTI

symptoms. For each abstraction level, the model with

the highest macro-averaged MAP score is selected as

the best model.

The best models within each level of abstraction

and corpus are then compared and evaluated in the

following manner. For both abstraction levels, all

seed terms – for a speciﬁc UTI symptom or for all

UTI symptoms, respectively – are used for construct-

ing the prototype embeddings, i.e. there is no longer

a need to leave out an instance. In total, 14 lists

of candidate terms for inclusion in the terminology

are generated. For each symptom-speciﬁc prototype

embedding, the candidate list contains the terms

corresponding to the 100 nearest neighbors. For

each symptom-general, the candidate list contains

the terms corresponding to the 600 nearest neighbors

(6 × 100). A domain expert reviewed the union of the

sets of candidate terms for relevance with respect to

a certain UTI symptom. This allowed for counting

the number of relevant UTI symptom terms that were

extracted for each UTI symptom and abstraction

level, as well as to calculate AP scores.

3 RESULTS

The ﬁrst set of experiments were conducted using the

initial set of seed terms for carrying out leave-one-out

cross-validation. This allowed us to efﬁciently eval-

uate a number of potentially important factors in the

creation of prototype embeddings for terminology ex-

pansion: (i) four different base embedding methods,

(ii) two different phrase detection methods, each with

three different thresholds controlling the number of

phrases generated, and (iii) two different underlying

corpora – one smaller but more relevant in scope, the

other larger but less precise in terms of relevant scope.

In Table 5 and 6, we present the results for symptom-

speciﬁc prototype embeddings and symptom-general

prototype embeddings, respectively. For each base

embedding method and phrase detection method, we

present results with the phrase list and corpus that

yielded the best results.

For the symptom-speciﬁc prototype embeddings,

as can be seen in Table 5, better results were obtained

with FastText compared to the other base embedding

methods, regardless of the phrase detection method

used. The overall best result – a MAP score of 0.15 –

was obtained with a medium phrase list obtained using

the IM phrase detection method and the Case Group

corpus. In these experiments, no clear difference was

observed between the Case Group and Control Group

corpora. With respect to the number of phrases identi-

ﬁed, the results seem to speak in slight favor of using

a small- or medium-sized phrase list.

For the symptom-general prototype embeddings,

as can be seen in Table 6, the best results were

again obtained using FastText as the base embed-

ding method. Like in the case of symptom-speciﬁc,

the overall best result – a MAP score of 0.14 – was

obtained with a medium phrase list obtained using

the IM phrase detection method and the Case Group

corpus. Observations with respect to the choice of

underlying corpus and phrase list are similar to the

ones observed for symptom-speciﬁc prototype em-

beddings. The best-performing protototype em-

bedding models at two different levels of abstrac-

tion (symptom-speciﬁc and symptom-general) and for

HEALTHINF 2021 - 14th International Conference on Health Informatics

Table 5: Symptom-Speciﬁc prototype embeddings: macro-averaged MAP scores for different base embedding methods,

phrase detection methods, the best phrase list and the best corpus. The highest scores for each phrase detection method are in

bold.

Base Embedding Phrase Detection Phrase List Corpus MAP

Word2Vec

Medium Control 0.11

Phrase2Vec Large Control 0.10

GloVe Large Case 0.04

FastText Medium Case 0.15

Word2Vec

nPMI

Medium Case 0.10

Phrase2Vec Large Control 0.11

GloVe Small Case 0.12

FastText Small Control 0.12

Table 6: Symptom-General prototype embeddings: macro-averaged MAP scores for different base embedding methods,

phrase detection methods, the best phrase list and the best corpus. The highest scores for each phrase detection method are in

bold.

Base Embedding Phrase Detection Phrase List Corpus MAP

Word2Vec

Medium Control 0.12

Phrase2Vec Large Control 0.10

GloVe Medium Case 0.07

FastText Medium Case 0.14

Word2Vec

nPMI

Medium Case 0.12

Phrase2Vec Large Control 0.11

GloVe Small Case 0.13

FastText Small Control 0.13

two underlying corpora (Case Group and Control

Group) were selected to participate in the ﬁnal eval-

uation, wherein 100 candidate terms were extracted

from each symptom-speciﬁc prototype embedding and

600 terms were extracted from each symptom-general

prototype embedding. The candidate terms were re-

viewed by a domain expert for relevance and the

results, in terms of AP scores, are shown in Table

7. All symptom-speciﬁc prototype embeddings per-

form well, with the exception of suprapubic tender-

ness. The symptom-general prototype embedding

also performed well, but slightly worse compared

to the macro-averaged MAP score for the symptom-

speciﬁc prototype embeddings. Notably, using the

Case Group corpus generally yielded better results

with symptom-speciﬁc prototype embeddings (MAP:

0.51 vs. 0.48), whereas the Control Group corpus

yielded better results with symptom-general prototype

embeddings (MAP: 0.48 vs. 0.30).

Table 8 shows the number of extracted UTI terms

(types) that were deemed relevant by the domain ex-

pert for each of the symptom-speciﬁc and symptom-

general prototype embeddings, as well as the sum of

their frequencies (tokens) in the two corpora. First, it

Table 7: Final evaluation: AP scores in the case and con-

trol corpora for each symptom-speciﬁc prototype embed-

ding and the symptom-general prototype embedding.

Prototype

Embedding

Case

Group

Control

Group

Dysuria 0.61 0.56

Frequency 0.64 0.47

Urgency 0.82 0.76

Suprapubic

tenderness

0.00 0.06

Costovertebral

angle pain or

tenderness

0.86 0.83

Non-speciﬁc 0.13 0.24

Macro-averaged

MAP

0.51 0.48

UTI Symptoms 0.30 0.48

should be noted that there were some terms that ap-

peared in several of the candidate lists; the total num-

ber of unique candidate terms was 1,504. Of these,

142 terms were deemed relevant by the domain ex-

Terminology Expansion with Prototype Embeddings: Extracting Symptoms of Urinary Tract Infection from Clinical Text

pert. The observant reader will notice that the sum of

the types for the symptom-speciﬁc prototype embed-

dings is larger than 142 and this is because, in some

cases, the domain expert classiﬁed a term as relevant

for more than one speciﬁc UTI symptom. Neverthe-

less, more UTI symptom terms were extracted with

the symptom-speciﬁc prototype embeddings than with

the symptom-general counterparts (167 vs. 121). As

expected, the terms are more frequent in the Control

Group corpus, owing to its larger size.

Table 8: Frequency of the extracted and relevant UTI symp-

tom terms in the two corpora.

Prototype

Embedding

Case

Group

Control

Group

Types Tokens Types Tokens

Dysuria 31 415 31 755

Frequency 43 367 43 527

Urgency 21 506 21 709

Suprapubic

tenderness

27 98 27 131

Costo-

vertebral

angle pain /

tenderness

9 510 9 759

Non-speciﬁc 36 765 36 1,081

UTI

Symptoms

121 1,857 121 2,838

Table 9 provides an example of terms automati-

cally extracted from a corpus of clinical text using

a prototype embedding (and calculating its nearest

neighbors), in this case for the UTI symptom urgency.

As can be seen, many relevant terms are among the

nearest neighbors of the prototype embedding. It is

also notable that phrases of varying length are iden-

tiﬁed. There are also several misspellings, and the

frequencies show that these are relatively common.

4 DISCUSSION

In this study, experiments were conducted concerning

(i) the data and (ii) embedding methods used for con-

structing the semantic spaces, as well as (iii) the level

of abstraction for the prototype embeddings. The re-

sults of these will be discussed below, in relation to

the target application, namely terminology expansion

and extracting UTI symptoms from clinical text.

The underlying data and embedding method used

are naturally the two most important aspects that im-

pact the structure of the resulting semantic space. Al-

though these do perhaps not represent the primary fo-

cus of the paper, they were too important to ignore

and we therefore studied their impact on the prototype

embeddings that were, in turn, used for the down-

stream task of terminology expansion. Concerning

the underlying data, this can be broken down into two

parts: (i) phrase detection and (ii) corpus construc-

tion, in particular how the data is sampled and the

trade-off between data volume vs. quality. In terms of

the performance of the two phrase detection methods,

there was little difference between them, with IM used

in the best-performing models. When using statistical

phrase detection methods, there is a clear trade-off be-

tween the number and quality of identiﬁed phrases; in

this case, we could observe that using a large phrase

list resulted in worse performance. As can be seen

in Table 9, some of the identiﬁed phrases (e.g. ur-

intr

angningar urinsticka) are not phrases in a linguis-

tic sense and, while deemed relevant by the domain

expert, probably should not be included as terms in a

terminology. While using linguistic information from

a syntactic parser to generate phrases would likely

yield better results, good syntactic parsers for low-

resource languages and domains can be difﬁcult to ob-

tain, and the simpler methods used in this study gen-

erally produced satisfactory results. Moreover, using

a vocabulary of standard phrases would be limiting

since it would fail with the misspellings and the type

of creative language use found in clinical text.

Regarding corpus construction, the results were

mixed, making it difﬁcult to draw any clear conclu-

sions. However, in the ﬁnal evaluation, it was ob-

served that the Control Group corpus gave better re-

sults for symptom-general prototype embeddings and

the non-speciﬁc symptom-speciﬁc prototype embed-

ding, while the Case Group corpus gave better results

for the other symptom-speciﬁc prototype embeddings

(with the exception of suprapubic tenderness, which

performed badly with both corpora). One possible

explanation is that more data – even at the expense

of being slightly less speciﬁc to the target domain –

is helpful when the prototype embeddings are meant

to capture concepts that are wider in scope, such as

UTI symptoms in general or other, non-speciﬁc UTI

symptoms. This would, however, need to be inves-

tigated further. A limitation with this experiment is

also that the corpora are not all that different; in fu-

ture work, it would be interesting to study this aspect

in more detail and with greater differences between

corpora, both in terms of volume and domain speci-

ﬁcity.

Prototype embeddings are based on a notion that

works with any vector-based model of distributional

semantics. Our experiments showed that the choice

HEALTHINF 2021 - 14th International Conference on Health Informatics

Table 9: Extracted symptom terms, along with English translations, for the prototype embedding for urgency. The ranks and

the frequency in the Case Group corpus of relevant terms are shown. Misspelled terms are in bold.

Rank Extracted Term English Translation Freq

1 tr

angningar vid miktion urgency during micturation 15

2 besv

aras av t

ata tr

angningar bothered by frequent urges 13

3 urintr

angning urinary incontinence 16

4 tr

angningarna the urges 18

5 t

ata tr

angningar och sveda vid miktion frequent urges and burning during micturition 11

6 t

ata urintr

angningar frequent urination 64

8 sveda och tr

angningar burning and urges 30

9 t

ata tr

angningar till miktion frequent urges for micturition 26

10 miktionstr

angningar micturition efforts 29

11 sveda vid miktion t

ata tr

angningar burning during mictation frequent urges 16

12 miktionssveda och t

ata tr

angningar micturition burns and frequent urges 13

13 upplever tr

angningar experiencing urges 31

15 tr

angningar till vattenkastning urge to urinate 11

16 tr

angningar till miktion urges for micturition 46

18 t

ata miktionstr

angningar frequent micturition efforts 16

19 urintr

angningar urinsticka urinary incontinence urine stick 11

25 sveda eller tr

angningar burning or urges 13

27 tr

agningar urges 27

28 besv

ar med tr

angningar discomfort with urges 11

37 form av tr

angningar form of urges 12

38 tr

angningsbesv

ar urgency 21

42 t

ata tr

agningar frequent urges 15

63 t

ata tr

angingar frequent urges 17

of base embedding method does have an impact on

the downstream performance of the prototype em-

beddings. Among the ones included in this study,

FastText consistently outperformed the others. There

could be several explanations for this: one such ex-

planation is that using subword embeddings allows it

to generalize faster, and the corpora used in these ex-

periments are both relatively small.

One of the key aspects we set out to investigate

in this study, in addition to applying the notion of

prototype embeddings to the task of terminology ex-

pansion, was to study if prototype embeddings could

capture, not only synonymy, but something as wide

in scope as UTI symptoms in general. While the per-

formance was good with both symptom-speciﬁc and

symptom-general prototype embeddings, with many

new and relevant terms successfully identiﬁed, the

former outperformed the latter in our experiments. In

future work, it would be interesting to study this in

more detail using a variety of concepts at different

levels of abstraction, as well as to investigate the im-

pact of the size and nature of the seed set used for

deriving a prototype embedding. For example, us-

ing only one UTI symptom as seed terms, would it

be possible to extract other types of UTI symptoms?

One can also imagine more sophisticated ways of de-

riving prototype embeddings than mean pooling, even

if simpler methods have certain advantages.

As can be seen in Table 7, the prototype em-

beddings indeed produced good results. Except for

suprapubic tenderness, the performance was good

in all cases, especially considering the frequency of

these terms in the corpora. We looked for explana-

tions for the poor performance of the suprapubic ten-

derness prototype embeddings and discovered that it

was largely due to the low frequency of the associated

symptom terms. The minimum frequency when creat-

ing word embeddings was set to ten and only two seed

terms for this UTI symptom exceeded this threshold

in the Control Group corpus, yielding an AP score of

0.06. In the Case Group corpus, only one seed term

was present, resulting in an AP score of zero. In this

case, it hence functioned like a regular word embed-

ding to generate the candidate list, which, in turn, il-

lustrates the advantage of prototype embeddings.

In future work, transfer learning will be explored,

which involves ﬁne-tuning a pre-trained model –

trained with a large amount of data, not necessarily

in-domain – to perform another task. In BERT (De-

vlin et al., 2018), multi-head attention is used to gen-

erate word embeddings. Due to its complexity and

amount of data required, BERT-based models are typ-

ically used in transfer learning approaches, and we

plan to explore this for terminology expansion. In fu-

ture work, the terminology will be matched with the

standard medical terminology available in Swedish,

such as ICD-10 (international statistical classiﬁcation

of diseases and related health problems-10), Snomed

Terminology Expansion with Prototype Embeddings: Extracting Symptoms of Urinary Tract Infection from Clinical Text

CT (systematized nomenclature of medicine – clinical

terms), and MeSH (medical subject headings).

5 CONCLUSIONS

In this study, we investigated the use of prototype em-

beddings for terminology expansion, speciﬁcally for

extracting symptoms of urinary tract infections from

clinical text corpora. Four word embedding methods

were used for deriving the higher-level prototype em-

beddings; it was observed that FastText yielded the

best results. We also explored two statistical phrase

detection methods and, while there was little differ-

ence between them, we also studied the trade-off be-

tween the number and quality of identiﬁed phrases

and its impact on the downstream terminology expan-

sion task. We also observed that using a somewhat

smaller but high-quality, relevant corpus generally

gave better results than using a larger yet less precise

corpus; however, this seems to depend on the target

concept’s abstraction level. Indeed, two levels of ab-

straction were compared and contrasted: both yielded

good results, but using prototype embeddings for spe-

ciﬁc symptoms overall outperformed the use of pro-

totype embeddings for urinary tract infection symp-

toms in general. Ultimately, we were able to identify

an additional 142 symptoms for inclusion in the ter-

minology with very little manual effort required.

ACKNOWLEDGEMENTS

This research has been approved by the Regional Eth-

ical Review Board in Stockholm under permission no.

2016/2309-32.

REFERENCES

Artetxe, M., Labaka, G., and Agirre, E. (2018). Unsuper-

vised statistical machine translation. arXiv preprint

arXiv:1809.01272.

Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T.

(2017). Enriching word vectors with subword infor-

mation. Transactions of the Association for Computa-

tional Linguistics, 5:135–146.

Bouma, G. (2009). Normalized (pointwise) mutual in-

formation in collocation extraction. Proceedings of

GSCL, pages 31–40.

Dalianis, H. (2018). Clinical text mining: Secondary use of

electronic patient records. Springer, Open Access.

Dalianis, H., Henriksson, A., Kvist, M., Velupillai, S., and

Weegar, R. (2015). Health bank-a workbench for data

science applications in healthcare. In CAiSE Industry

Track, pages 1–18.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2018). Bert: Pre-training of deep bidirectional trans-

formers for language understanding. arXiv preprint

arXiv:1810.04805.

ECDC (2016). Point prevalence survey of healthcare-

associated infections and antimicrobial use in Eu-

ropean acute care hospitals protocol version 5.3 :

ECDC PPS 2016–2017. ECDC, Stockholm.

Fan, Y., Pakhomov, S., McEwan, R., Zhao, W., Lindemann,

E., and Zhang, R. (2019). Using word embeddings to

expand terminology of dietary supplements on clinical

notes. JAMIA open, 2(2):246–253.

Foxman, B. (2010). The epidemiology of urinary tract in-

fection. Nature Reviews Urology, 7(12):653.

Harris, Z. S. (1954). Distributional structure. Word.

Henriksson, A. (2015). Learning multiple distributed proto-

types of semantic categories for named entity recogni-

tion. International journal of data mining and bioin-

formatics, 13(4):395–411.

Henriksson, A., Dalianis, H., and Kowalski, S. (2014a).

Generating features for named entity recognition by

learning prototypes in semantic space: The case of

de-identifying health records. In 2014 IEEE Interna-

tional Conference on Bioinformatics and Biomedicine

(BIBM), pages 450–457. IEEE.

Henriksson, A., Moen, H., Skeppstedt, M., Daudaravicius,

V., and Duneld, M. (2014b). Synonym extraction and

abbreviation expansion with ensembles of semantic

spaces. Journal of Biomedical Semantics, 5(6).

Herzog, K., Dusel, J. E., Hugentobler, M., Beutin, L.,

agesser, G., Stephan, R., H

achler, H., and N

uesch-

Inderbinen, M. (2014). Diarrheagenic enteroag-

gregative escherichia coli causing urinary tract in-

fection and bacteremia leading to sepsis. Infection,

42(2):441–444.

Khattak, F. K., Jeblee, S., Pou-Prom, C., Abdalla, M.,

Meaney, C., and Rudzicz, F. (2019). A survey of word

embeddings for clinical text. Journal of Biomedical

Informatics: X, 4:100057.

Landers, T., Apte, M., Hyman, S., Furuya, Y., Glied, S.,

and Larson, E. (2010). A comparison of methods to

detect urinary tract infections using electronic data.

The Joint Commission Journal on Quality and Patient

Safety, 36(9):411–417.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and

Dean, J. (2013). Distributed representations of words

and phrases and their compositionality. In Advances in

neural information processing systems, pages 3111–

3119.

NHSN (2017). National Healthcare Safety Network

(NHSN) Patient Safety Component Manual, Centers

for Disease Control and Prevention; 2017. NHSN,

U.S. Department of Health & Human Services.

Pennington, J., Socher, R., and Manning, C. (2014). Glove:

Global vectors for word representation. In Proceed-

ings of the 2014 conference on empirical methods in

natural language processing (EMNLP), pages 1532–

1543.

HEALTHINF 2021 - 14th International Conference on Health Informatics

Rubin, R. H., Shapiro, E. D., Andriole, V. T., Davis,

R. J., and Stamm, W. E. (1992). Evaluation of new

anti-infective drugs for the treatment of urinary tract

infection. Clinical Infectious Diseases, 15(Supple-

ment 1):S216–S227.

Sch

utze, H., Manning, C. D., and Raghavan, P. (2008). In-

troduction to information retrieval. In Proceedings

of the international communication of association for

computing machinery conference, page 260.

Wang, Y., Liu, S., Afzal, N., Rastegar-Mojarad, M., Wang,

L., Shen, F., Kingsbury, P., and Liu, H. (2018). A

comparison of word embeddings for the biomedical

natural language processing. Journal of biomedical

informatics, 87:12–20.

Zhang, L., Li, J., and Wang, C. (2017). Automatic synonym

extraction using word2vec and spectral clustering. In

2017 36th Chinese Control Conference (CCC), pages

5629–5632. IEEE.

APPENDIX

Table 10: Hyperparameter values for different word embed-

ding methods.

Hyperparameter Values

Corpus Case, Control

Phrase detection method IM, nPMI

Phrase list Small, Medium, Large

Context window size 5, 10, 15

Vector dimension 50, 100

Iterations, GloVe 15, 20, 25, 30

Iterations, other methods 2, 5, 10

Hierarchical softmax value 1, 0

Skipgram value 1, 0

Negative value, Phrase2Vec 3, 5, 10

Negative value, other methods 5, 10, 15, 20

cbow mean value for FastText 1, 0

Minimum term frequency 10

x max, GloVe 10

CBOW value, Phrase2Vec 0

min n, FastText 2

max n, FastText 10

Word ngrams, FastText 1

Terminology Expansion with Prototype Embeddings: Extracting Symptoms of Urinary Tract Infection from Clinical Text