What do You Mean, Doctor? A Knowledge-based Approach for Word

Sense Disambiguation of Medical Terminology

Erick Velazquez Godinez, Zolt

an Szl

avik, Edeline Contempr

e and Robert-Jan Sips

myTomorrows, Anthony Fokkerweg 61 1059CP, Amsterdam, The Netherlands

Keywords:

Medical Word Sense Disambiguation, Knowledge-based, Semantic Similarity, Word Embeddings, Data

Understanding.

Abstract:

Word Sense Disambiguation (WSD) is an essential step for any NLP system; it can improve the performance

of a more complex task, like information extraction, named entity linking, among others. Consequently,

any error, while disambiguating a term, spreads to later stages with a snowball effect. Knowledge-based

strategies for WSD offer the advantage of wider coverage of medical terminology than supervised algorithms.

In this research, we present a knowledge-based approach for word sense disambiguation that can use different

semantic similarity measures to determine the correct sense of a term in a given context. Our experiments show

that when our approach used WordNet-based similarity measures, it achieved a very close performance when

using the semantic measures based on word embeddings. We also constructed a small dataset from real-world

data, where the feedback received from the annotators made us distinguish between true ambiguous terms

and vague terms. This distinction needs to be considered for future research for WSD algorithms and dataset

construction. Finally, we analyzed a state-of-the-art dataset with linguistic variables that helped to explain our

approach’s performance. Our analysis revealed that texts containing a high score of lexical richness and a high

ratio of nouns and adjectives lead to better WSD performance.

1 INTRODUCTION

One of the challenges that a BioNLP system still faces

is to decide the correct sense of the ambiguous med-

ical term. E.g., cold can have at least two meanings,

one to refer to the absence of heat and a second that

refers to the common cold. The task of determining

the sense of a given word, in its context, is called

Word Sense Disambiguation (WSD) (Navigli, 2009).

WSD in medical language faces different chal-

lenges than in layperson language, which stems

from the frequent use of specialized terminology,

acronyms, and abbreviations. Although these chal-

lenges have been addressed before (Zhang et al.,

2019; Antunes and Matos, 2017a), no propositions

have incorporated knowledge-type data to keep the

understandability of the system’s output. We are in-

terested in a solution with good performance that is

also reasonably transparent to interpretation. We be-

lieve this can be achieved if we limit the use of word-

embeddings for speciﬁc sub-steps within the WSD

pipeline.

Compared to supervised algorithms, knowledge-

based strategies cover a wider range of terminol-

ogy (Navigli, 2009) for WSD; this is an advantage for

situations of real-world scenarios. Knowledge-based

strategies can rely on similarity measures that exploit

the concept network of lexicons. The basic idea is to

compute the semantic similarity between the context

of the target word and its deﬁnitions. With this simi-

larity value, the system will determine which sense to

select (Navigli, 2009).

Our contributions are:

• proposing a new knowledge-based WSD ap-

proach that uses a semantic similarity measure

based on the concept of information coverage. We

compare term deﬁnitions and segments of text in

which target terms appear;

• creating a small dataset

from a real-world use

case, in addition to evaluating our approach on a

standard dataset (i.e., the MeSH corpus (Jimeno-

Yepes et al., 2011));

• analysing the (re)source data and results by utilis-

ing various linguistic features commonly used in

corpus linguistics (e.g., token type ratio (TTR)),

See https://research.mytomorrows.com/datasets

Godinez, E., Szlávik, Z., Contempré, E. and Sips, R.

What do You Mean, Doctor? A Knowledge-based Approach for Word Sense Disambiguation of Medical Terminology.

DOI: 10.5220/0010180502730280

In Proceedings of the 14th International Joint Conference on Biomedical Engineer ing Systems and Technologies (BIOSTEC 2021) - Volume 5: HEALTHINF, pages 273-280

ISBN: 978-989-758-490-9

273

which provides a proﬁle of the characteristics of

texts leading to accurate disambiguation;

• distinguishing vague vs. ambiguous terms that

may offer new opportunities to improve WSD al-

gorithms.

2 BACKGROUND

While WordNet (Miller, 1995) is a knowledge-

based resource used to assist WSD in layperson lan-

guage (Navigli, 2009), the Uniﬁed Medical Language

System, UMLS, plays a similar role in the medical

domain. UMLS integrates taxonomies and ontolo-

gies of the medical domain (Bodenreider, 2004). Re-

sources like WordNet and UMLS encode semantic re-

lationships (synonymy, hypernymy, hyponymy, etc.)

that give a graph-like structure. Several semantic

measures exploit these semantic relations and graph

structure to compute semantic similarity among con-

cepts (Jiang and Conrath, 1997; Lin et al., 1998; Lesk,

1986).

However, semantic measures have a drawback as

well; they depend on how complete the knowledge

source is. As alternatives, word embeddings are able

to capture relational meanings, which make them suit-

able to compute semantic similarity between words.

Since word embeddings are vector space representa-

tions, the cosine similarity measure is commonly used

to express how similar two word-embedding are (Ju-

rafsky and Martin, 2020, ch. 6). Word embeddings

are created from unlabeled data (Jurafsky and Martin,

2020, ch.6), this makes it possible to have a greater

vocabulary and reduce human intervention. Particu-

larly, deﬁnitions of concepts in UMLS have an essen-

tial role for WSD and word embeddings construction

for the medical domain. Pesaranghader et al. (2019)

used UMLS deﬁnitions to create word embeddings

before initializing a neural network for a supervised

WSD.

Several authors have worked on WSD for the

medical domain (Zhang et al., 2019; Pesaranghader

et al., 2019; Wang et al., 2018). While these studies

have made signiﬁcant contributions to the WSD task,

there is a need to evaluate such systems with real-

world scenarios and human experts, which we also

address in this paper.

For our research, we focused on previous works

that used knowledge-based approaches to address

WSD in the medical domain. For instance, Jimeno-

Yepes and Aronson (2010) compared three methods

on the NLM WSD data set (Weeber et al., 2001). The

ﬁrst method, presented in (McInnes, 2008), is very

similar to the Lesk algorithm (Lesk, 1986); it com-

pares the overlaps of the ambiguous terms to the rep-

resentations made out of the deﬁnitions of the can-

didates’ senses. The second one is an adaptation of

the PageRank algorithm. Presented by Agirre and

Soroa (2009), this adapted version treats UMLS as

a directed graph where the PageRank value is com-

puted after the ambiguous terms and their contexts

are integrated into the graph. The third algorithm is

the Journal Descriptor Indexing (JDI), originally pre-

sented by Humphrey et al. (2006). It is based on sta-

tistical associations between ambiguous concepts and

their semantic types that are mapped to a set of jour-

nal descriptions. In their comparison, Jimeno-Yepes

and Aronson (2010) found that the JDI algorithm per-

forms the best among the three methods compared.

However, these methods rely entirely on UMLS, and

they do not integrate any other source, e.g. WordNet

or word-embbedings, that could improve their perfor-

mance.

In more recent research, Antunes and Matos

(2017b,a) presented a knowledge-based approach that

they applied to resolve ambiguities in the MeSH cor-

pus (Jimeno-Yepes et al., 2011). Antunes and Matos

used the cosine similarity measure to assess the se-

mantic similarity of two terms and the pairwise mu-

tual information value of these terms. For the sim-

ilarity computation, the two terms were represented

by word embeddings. For the pairwise mutual in-

formation calculation, they used the MEDLINE Co-

Occurrences (MRCOC) Files

. The ﬁnal score was

then used to determine the sense of the ambiguous

term. The sense that had the highest score was se-

lected. In the same way, we select the sense with the

higher score for the ambiguous term.

As mentioned in the previous paragraphs, the text

similarity is used to compare senses and their con-

texts to select the right sense. It is important to notice

that both elements (deﬁnitions and context) are fun-

damentally different texts. Until now, this difference

has not been considered in WSD. We believe the dif-

ference between deﬁnitions and context text needs to

be addressed.

With this regard, Velazquez et al. (2016) posed the

problem of comparison of two segments of texts as

a coverage information task on students’ texts. Ve-

lazquez et al. (2016) compared two segments of text,

R and S, to determine to which extent S covers the in-

formation of R. The two segments of text hold a dif-

ferent role, the referent R and the subject of compari-

son S, as Tversky (1977) stated in his model of com-

parison. R is the object holding the most prominent

features, and S is the object with less salient features.

Doing an extrapolation of this deﬁnition, Velazquez

See https://ii.nlm.nih.gov/MRCOC.shtml

HEALTHINF 2021 - 14th International Conference on Health Informatics

274

et al. see syllabus documents as the referent R, which

contains essential concepts that students should dis-

cuss in the ﬁnal dissertation. The ﬁnal dissertation

is considered as S, since it contains a discussion and

paraphrases of the concepts in R.

In our case, we could deﬁne R as the set of def-

initions of an ambiguous term. Each of them is

considered to hold prominent features/words that can

help disambiguate the meaning of an ambiguous term.

Then we could see S as the segment of text where an

ambiguous term appears. Its features/words could dif-

fer from the actual deﬁnition of the ambiguous terms

since they reﬂect the use and context words of the am-

biguous terms. Thus, these contextual words share

some semantic information.

3 METHODOLOGY

3.1 Methods

We tackle the problem of word sense disambiguation

with a strategy based on the principle of coverage of

information (Velazquez et al., 2016). To disambiguate

an ambiguous term, we compute the coverage be-

tween the deﬁnitions and the segments of texts where

the ambiguous term appears. The deﬁnition with the

highest coverage value is considered the ﬁnal sense.

The coverage of the information is computed using

the following formula:

coverage(R,S) =

∑

w∈{R}

maxSim(w, S) ∗ id f (w)

∑

w∈{R}

id f (w)

(1)

Where R is the referent, and S is the subject of com-

parison; both are segments of texts. However, the

referent R is a deﬁnition of ambiguous terms, and S

is the segment of text where the ambiguous term ap-

pears. The maxSim is a function where w is the word

that belongs to R, and it is being compared with each

word in S using a semantic similarity measure. The

function, then selects the word w from R that has the

greater similarity value with the words in S.

As baseline, we used the First Sense Baseline

(FSB); it is solely based on the frequency of occur-

rence of a given sense. The frequency corresponds to

the senses of ambiguous terms that have been manu-

ally annotated in a corpus. In a real-life application,

this approach tends to lead to the long tail getting for-

gotten, which in the medical domain may lead to fur-

ther isolation of people with rare diseases.

3.2 Data-sets and Data Preparation

3.2.1 Data for Evaluation

We used the MeSH WSD corpus (Jimeno-Yepes et al.,

2011), consisting of 203 ambiguous terms, where 106

terms are abbreviations, 88 terms are word-terms, and

nine terms that can be a combination of both. For

each term, there are 100 instances per sense obtained

from MEDLINE. The ambiguous terms come from

the medical subset headings (MeSH) of UMLS.

In addition to this dataset, we manually an-

notated the sense of three ambiguous terms from

UMLS, i.e., ACS, albumin, and basal cell carci-

noma, in 129 clinical trials that we collected from

https://clinicaltrials.gov. We conducted an annotation

task using a group of ﬁve experts with a medical back-

ground. For the ACS term, we collected 36 clini-

cal trials. The ACS term contains six deﬁnitions or

senses. The term albumine contains only two deﬁni-

tions, and we retrieved 47 clinical trials. Finally, the

term basal cell carcinoma has three senses, and we

collected 47 clinical trials. This dataset is available at

https://research.mytomorrows.com/datasets.

For the annotation process, we presented a docu-

ment with the deﬁnitions of the ambiguous terms and

the clinical trials that contained the ambiguous terms.

We asked the annotators to select, from a list of deﬁni-

tions, the sense that corresponds to the actual clinical

trial context.

Regarding the deﬁnitions of the ambiguous terms,

we ﬁrst extracted all of them. Since UMLS incor-

porates multiple data sources, there may be duplicate

concepts – and consequently, deﬁnitions – present for

the same term. With medical experts’ help, we dedu-

plicated concept deﬁnitions that were in the scope of

our experiments. This is a starting point of a project

that aims to incorporate more ambiguous text and

enrich the MeSH corpus. We computed the inter-

annotator agreement for the annotations, resulting in

a value of 0.484, which indicates a moderate agree-

ment (Pustejovsky and Stubbs, 2012).

3.2.2 Data Sources

First, for the semantic similarity, we used two

different word embedding representations, a)

from (Pyysalo et al., 2013), that was trained on

biomedical data and are publicly available

, and

b) word embeddings corresponds to the model

en core sci lg in sci-spacy (Neumann et al., 2019).

http://bio.nlplab.org.

What do You Mean, Doctor? A Knowledge-based Approach for Word Sense Disambiguation of Medical Terminology

275

3.3 Data Analysis

The purpose of this analysis is twofold: on the ﬁrst

hand, it helps us with understanding the nature of in-

put data, i.e., deﬁnitions and context texts. On the

other hand, it could give insights into our method’s

performance. For that reason, we decided to use sev-

eral linguistic features that are commonly used to de-

scribe the variation of texts in corpus linguistics stud-

ies, see (Biber, 2006, p. 221). The selected linguistic

features are meant to explain how informative texts

are, their vocabulary concentration, and vocabulary

distribution. Each variable can give a different dimen-

sion of the characteristics of the text. Thus, we used

the token type ratio (TTR) to measure the lexical di-

versity of texts in a corpus; its value goes from 0 to

1; one means that the vocabulary is varied It has been

used to assess the difference between written and oral

language (Biber, 2006). We also used the number of

tokens per document to see the impact of the size of

texts. Besides, we evaluated the distribution of nouns,

verbs, adjectives, and adverbs using a normalized fre-

quency per 100 token-words. For instance, adverbs

and adjectives seem to expand and elaborate on the

information presented in the text (Biber, 2006). A

high concentration of nouns may indicate a high in-

formational focus on the text (Biber, 2006). We com-

pute these linguistic features for the text of the deﬁni-

tions and the text instances of the MeSH corpus. We

will refer to each kind of text as deﬁnitions and MeSH

texts, respectively.

Finally, we built two multivariate models to an-

alyze the correlation between these features and our

approach’s performance.

• The vocabulary variation model veriﬁes the rela-

tionship between the number of tokens, the lexi-

cal diversity (TTR), and the accuracy of our ap-

proach.

• The informativeness model that veriﬁes the rela-

tionship of the number of nouns, verbs, adverbs,

and adjectives and the accuracy of our approach.

We took the accuracy as the independent variable

in our model, since it is the standard measure used to

discuss results in NLP.

The models were built using the Ordinary Least

Squares (OLS) method in the statsmodels package of

python.

4 RESULTS AND DISCUSSION

4.1 WSD Results

Table 1 shows the results for the MeSH dataset in

terms of accuracy, precision, recall and F1-measure.

Since the dataset contains different terms, the results

correspond to the weighted average values. In the re-

sults, we can see that all conﬁgurations outperformed

the baseline. Regarding the WordNet-based seman-

tic measures, the best performance is for the JCN’s

measure with 70.01 of F1 score, representing a differ-

ence of 36.38 with the baseline. Then, we have 61.38

of F1-measure value for Res’ measure and 59.58 for

Lin’s measure. Previous research reported lower per-

formance of these measures. According to Navigli

(2009), the performance of the WordNet-based mea-

sures for WSD tasks in layperson language is 29.5 for

Res’ measure, 39.0 for JCN’s measure, and 33.1 for

Lin’s measure. From our experiment, our proposi-

tion seems to enhance the performance of knowledge-

based measures. In order to conﬁrm this claim, we

need to conduct more experiments with layperson

language dataset. However, our results show that

WordNet-based measures perform satisfactory even

for medical language.

Regarding the word embeddings’ performance,

we see that the use of id f did not presented an im-

provement. We remark a slight decline in the perfor-

mance of 0.77 but we did not ﬁnd this to be of statis-

tical signiﬁcance. However, this difference of perfor-

mance is probably because word embeddings are ini-

tially trained on a frequency-based matrix (Jurafsky

and Martin, 2020, ch. 6), thus any lexical information

and words distribution is already captured. Originally,

Velazquez et al. (2016)’s work (see formula 1 mixes a

WordNet semantic measure with the lexical informa-

tion of the term, id f to calculate the similarity value.

Thus, when we adapted Velazquez et al. (2016)’s for-

mula to use a word-embedding similarity measure, we

expected that the id f may not be necessary. We then

run experiments with both options to see what the im-

pact is in practice.

Considering our case study data, the results are

slightly different. The baseline performed the best

with a 77.89 of F1-score. It is followed by Res’

measure with 76.59 and the three word-embedding

strategies, with 72.65 for Embeddings-IDF, 74.01 for

Embeddings-no idf, and 73.6 for Embeddings-spacy.

In the bottom rank, we ﬁnd Lin’s measure with 71.13

and JCN’s measure with 64.04. Despite not having

the scale to trust in statistics, we looked at the per-

formance. We found the following: we attribute the

baseline performance to a disparity in the distribution

HEALTHINF 2021 - 14th International Conference on Health Informatics

276

Table 1: Results for the MeSH dataset.

Semantic measure Accuracy Precision Recall F1-score

Baseline 48.06 73.07 48.06 33.62

Embeddings-idf 74.69 74.98 74.69 74.65

Embeddings-no idf 75.46 75.75 75.46 75.45

Embeddings-spacy 73.42 73.66 73.42 73.42

Lin 59.65 59.79 59.65 59.58

Res 61.53 61.54 61.53 61.38

JCN 70.00 70.15 70.00 70.01

of instances for all senses in the dataset we annotated.

For instance, the term ACS has six different senses in

UMLS; in our dataset, 34 instances correspond to the

sense of acute coronary syndrome (CUI

C0948089)

and only one instance for the sense acute chest syn-

drome (C0742343). Thus, when building a dataset,

we need to put a special effort into keeping an equal

distribution among each ambiguous term’s instances.

This will lead to a more robust baseline for the evalu-

ation and a better representation of the senses that the

dataset intent to cover.

After observing our results, we decided to analyse

the characteristics of the dataset and give an explana-

tion on the performance of our approach.

4.2 Understanding Our Data

Regarding the deﬁnitions, we found out that a high

lexical diversity (TTR) has a positive impact when

disambiguating a term p< 0.05. TTR and the ac-

curacy have a Pearson coefﬁcient value of 0.15, see

ﬁg 1-A. With the number of tokens and the TTR as

the independent variables, the vocabulary variation

model explains 91.4 percent of the data (0.914 R-

square value). The TTR has a t value of 50.07 vs.

7.31 for the number of tokens. Thus, TTR is the more

signiﬁcant of the two variables. In practice, a lexically

diverse deﬁnition allows for higher WSD accuracy.

For example, in table 2, we see that the deﬁni-

tion of the term plaque has a 99.0 TTR value, which

means that almost every token-word is unique in the

segment of text. Its counterpart is the deﬁnition of the

term sodium that has a 63.63 TTR value and an accu-

racy of 48.19 . We can remark that in this deﬁnition,

the word sodium is repeated four times, and the words

used, compounds, and food are repeated twice. Hav-

ing repeated words, or a low lexical diversity in a def-

inition reduce the context that an algorithm can use to

disambiguate terms. Indeed, WSD on sodium shows

a lower accuracy than plaque on the MeSH dataset.

Furthermore, this demonstrates the inﬂuence of XAI,

where the quality of the explanation with high TTR

increases clarity of the term.

Concept Unique Identiﬁer in UMLS

Regarding the informativeness model, we found

that the number of nouns also has a determining role

in resolving ambiguity more accurately p < 0.05. In

this model, the number of nouns, adjectives, verbs,

and adverbs explains 91.4 % of the dataset. The num-

ber of nouns has a t value of 17.28; thus, a higher

ratio of nouns leads to higher accuracy in WSD. In

the second place of importance, we found the number

of adjectives with a t value of 8.86. This could be ex-

plained by the fact that adjectives modify nouns; thus,

next to a high number of nouns, a high ratio of adjec-

tives ensures higher accuracy. This also has an expla-

nation from a linguistic perspective; a high ratio of

nouns indicates a focus on information, and a high ra-

tio of adjectives expands and elaborates the informa-

tion of texts (Biber, 2006). In table 2, we observe that

plaque has a ratio of 30.43 vs 45.45 for sodium, but

plaque presents a ratio of 8.69 for the adjectives and

sodium has no adjectives at all. In the case of plaque,

the nouns and the adjectives ensure a higher degree of

informativeness, and consequently, a higher accuracy.

Thus, more nouns are associated with higher WSD

accuracy.

Regarding the text where the ambiguous terms ap-

pear in the MeSH dataset, the vocabulary variation

model found a slight difference in the importance be-

tween the number of tokens and TTR; their t value

is 7.08 and 6.59, respectively. Thus, when it comes

to the accuracy of WSD, both variables seem to con-

tribute to high accuracy. In practical terms, a text that

has high values of TTR and number of tokens leads to

more accurate disambiguation.

For the informativeness model, we found that the

number of verbs is determinant for high accuracy of

p < 0.05. The model is able to explain 92.5% of the

dataset. When testing the correlation between the ac-

curacy and the number of verbs, we found a Pearson

coefﬁcient of 0.22 p < 0.05, see 1-B. Thus, for the

MeSH dataset, a high ration of verbs leads to a higher

accuracy.

The difference in the two models (the vocabulary

variation and the informativeness) for UMLS deﬁ-

nitions and the texts in the MeSH dataset conﬁrms

that each text has different linguistic characteristics.

Knowing the linguistic characteristics of texts or sec-

What do You Mean, Doctor? A Knowledge-based Approach for Word Sense Disambiguation of Medical Terminology

277

Table 2: Example of high and low TTR for the terms plaque and sodium.

Term CUI Accuracy TTR Nouns Adjectives Deﬁnition

Plaque C0011389 95.65 99.0 30.43 8.69 A ﬁlm that attaches to teeth, often causing

DENTAL CARIES and GINGIVITIS. It is

composed of MUCINS, secreted from sali-

vary glands, and microorganisms.

Sodium C0037570 48.19 63.63 45.45 0.00 Sodium or sodium compounds used in

foods or as a food. The most frequently

used compounds are sodium chloride or

sodium glutamate.

Figure 1: Correlation matrix of the linguistic features and the accuracy.

tions of a document has a direct application to real-

world scenarios. For example, clinical trial docu-

ments are composed of an ofﬁcial title, a summary, in-

clusion, exclusion criteria sections; ambiguous terms

can appear in any of these sections. In the case of the

ofﬁcial title section, it may not have sufﬁcient infor-

mation to disambiguate the term. Thus, picking the

section that has the linguistic proﬁle that our models

describe will ensure accurate disambiguation.

In formula 1, the function maxSim(w, S) selects

the word w from the referent that has the maximum

similarity value with the subject of comparison. We

collected these words, which are very similar to a se-

mantic ﬁeld. A semantic ﬁeld is “a set of semantically

related lexical items whose meaning are mutually in-

terdependent and which together provide a concep-

tual structure for a certain domain of reality” (Geer-

aerts, 2010, p.52). E.g. a semantic ﬁeld for “school”

would be composed by teacher, student, blackboard,

book, notebook.

To evaluate the quality of the automatically cre-

ated semantic ﬁelds, we measure the semantic simi-

larity among the group of words. This strategy has

been used to measure the quality of topic modeling

(Koren

c et al., 2018), which is also somehow similar

Figure 2: Boxplots for the semantic cohesion of the seman-

tic ﬁelds for the classes that have been correctly classiﬁed

(left) and the class incorrectly classiﬁed (right).

to a semantic ﬁeld. In table 3, we can see an example

of the ambiguous term “coffee”, its deﬁnitions, and

the semantic ﬁeld created by all the good classiﬁed in-

stances of each sense. In previous research (Gui et al.,

2019), the evaluation of topic modeling has been used

to reinforce the learning process of a deep neural net-

work model. Similarly, WSD with supervised and un-

supervised methods could beneﬁt from this kind of

feedback to increase their performance.

HEALTHINF 2021 - 14th International Conference on Health Informatics

278

Table 3: Meanings and lexical ﬁelds for the ambiguous term “coffee”.

CUI Coherence Deﬁnition Semantic ﬁeld

C0085952 0.3016 A plant genus of the family RUBIACEAE. It is best known for

the COFFEE beverage prepared from the beans (SEEDS).

’coffee’, ’genus’,

’family’, ’tree’,

’plant’

C0009237 0.4859 A beverage made from ground COFFEA beans (SEEDS) in-

fused in hot water. It generally contains CAFFEINE and

THEOPHYLLINE unless it is decaffeinated.

’consume’, ’bever-

age’, ’coffee’, ’caf-

feine’, ’roast’

4.3 Feedback from the Annotation

Process

At the end of the annotation process, we received

feedback from the annotators. One of them remarked

that for the term basal cell carcinoma with CUIs

C3540686, C2984322, and C0007117, the deﬁnitions

are vague and difﬁcult to make a distinction. Consid-

ering these comments, we decided to investigate some

terms in the MeSH dataset to see if there were similar

cases. Our assumption was the following: in theoreti-

cal semantics, we deal with vague terms and ambigu-

ous terms. We talk of a vague term when the contexts

where it appears gives information not speciﬁed in the

deﬁnition. In the sentences he is our publicist and she

is our publicist, the term publicist is vague for gen-

der

. Their contexts only give us more details that do

not appear in their deﬁnition. In ambiguous terms,

their contexts will cause one of the senses to be se-

lected (Saeed, 2008, p. 61). In dictionaries, for the

ambiguous terms, lexicographers separate senses by

domain. In UMLS, the semantic types could have a

similar role. Thus, for an ambiguous term, if its def-

initions are associated with different semantic types,

we will most probably deal with a truly ambiguous

term. On the other hand, if the deﬁnitions of an am-

biguous term share the same semantic type, we will

more probably deal with vague terms.

We found 25 terms out of 203 (12.31%) under

this assumption, i.e., where the deﬁnitions have the

same semantic type. For example, for the term B-

Cell Leukemia associated both with CUIs C2004493

and C0023434, the CUIs have the same semantic type

T191 (Neoplastic Process). However, the deﬁnition

of C2004493 gives a general description of the dis-

ease, while the deﬁnition of C0023434 gives more

details and offers a classiﬁcation of the disease. In-

specting the MRREL table from UMLS, we see that

CUI C2004493 has a parent-child relationship with

CUI C0023434. Thus, these two deﬁnitions are not

ambiguous but vague. In another example, we can

ﬁnd with the term milk with CUIs C0026131 and

C0026140, where the last one has a parent-child re-

These examples were extracted from (Saeed, 2008, p.62).

lationship similar to the B-Cell Leukemia term. This

observation has two implications for future research:

ﬁrst, researchers seeking to improve WSD systems

should consider the difference between ambiguous vs.

vague terms. Both terms need to be tackled differ-

ently. Second, for those seeking to build datasets for

WSD, they need to be aware that ambiguity is more

potential than real (Saeed, 2008, p.61). The possi-

ble ambiguous terms need to undergo ambiguity tests

to determine if they are vague or ambiguous. Such

a test could be automated by checking the semantic

relationships between the candidates in the MRREL

table. This practice will ensure that the dataset will

help to answer the question of WSD.

The presence of vague terms in the MeSH dataset

could mislead the research of WSD since vagueness

and ambiguity are two different problems. Solving

vagueness could be a different NLP problem where

the aim is to retrieve as much information as possi-

ble to make it less vague. Thus, we recommend that

terms in the MeSH dataset be enriched with labels on

vagueness and ambiguity.

5 CONCLUSIONS

In this paper, we presented a knowledge-based ap-

proach for word sense disambiguation for medical ter-

minology that uses an asymmetrical strategy. Our ap-

proach can be conﬁgured to use any semantic measure

on WordNet or a semantic measure based on word

embeddings. In our experiments, we found that the

WordNet-based measures performed very closely to

those based on word embeddings. Such performance

puts our strategy in advantage to others when there is

no specialized domain resource to tackle ambiguity.

We conducted statistical analysis of the texts in the

MeSH corpus and a small clinical trial based dataset

we constructed, using linguistic variables commonly

used in corpus linguistics studies. This analysis

helped us understand the characteristics of the input

texts and their impact on our models’ performance.

Our results suggest that deﬁnitions need to be lexi-

cally diverse and informative to ensure better accu-

racy.

What do You Mean, Doctor? A Knowledge-based Approach for Word Sense Disambiguation of Medical Terminology

279

During our data analysis we have also identiﬁed

the need for differentiation between vague and am-

biguous terms, which we believe has implications for

the use of test corpora – such as MeSH – for WSD

research, even beyond the medical domain.

REFERENCES

Agirre, E. and Soroa, A. (2009). Personalizing pagerank

for word sense disambiguation. In Proceedings of the

EACL 2009, pages 33–41.

Antunes, R. and Matos, S. (2017a). Biomedical word sense

disambiguation with word embeddings. In Interna-

tional Conference on Practical Applications of Com-

putational Biology & Bioinformatics, pages 273–279.

Springer.

Antunes, R. and Matos, S. (2017b). Supervised learning

and knowledge-based approaches applied to biomedi-

cal word sense disambiguation. Journal of integrative

bioinformatics, 14(4).

Biber, D. (2006). University language: A corpus-based

study of spoken and written discourse. Amsterdam:

John Benjamin.

Bodenreider, O. (2004). The uniﬁed medical language sys-

tem (umls): integrating biomedical terminology. Nu-

cleic acids research, 32(suppl 1):D267–D270.

Geeraerts, D. (2010). Theories of lexical semantics. Oxford

University Press.

Gui, L., Leng, J., Pergola, G., Xu, R., He, Y., et al. (2019).

Neural topic model with reinforcement learning. In

Proceedings of the 2019 Conference on EMNLP-

IJCNLP, pages 3469–3474.

Humphrey, S. M., Rogers, W. J., Kilicoglu, H., Demner-

Fushman, D., and Rindﬂesch, T. C. (2006). Word

sense disambiguation by selecting the best semantic

type based on journal descriptor indexing: Prelimi-

nary experiment. JASIST, 57(1):96–113.

Jiang, J. J. and Conrath, D. W. (1997). Semantic similarity

based on corpus statistics and lexical taxonomy. arXiv

preprint cmp-lg/9709008.

Jimeno-Yepes, A. J. and Aronson, A. R. (2010).

Knowledge-based biomedical word sense disam-

biguation: comparison of approaches. BMC bioinfor-

matics, 11(1):569.

Jimeno-Yepes, A. J., McInnes, B. T., and Aronson, A. R.

(2011). Exploiting mesh indexing in medline to gen-

erate a data set for word sense disambiguation. BMC

bioinformatics, 12(1):223.

Jurafsky, D. and Martin, J. H. (2020). Speech

& language processing [Book in preparation].

https://web.stanford.edu/ jurafsky/slp3/.

Koren

c, D., Ristov, S., and

Snajder, J. (2018). Document-

based topic coherence measures for news media text.

Expert Systems with Applications, 114:357–373.

Lesk, M. (1986). Automatic sense disambiguation using

machine readable dictionaries: how to tell a pine cone

from an ice cream cone. In Proceedings of the 5th an-

nual international conference on Systems documenta-

tion, pages 24–26.

Lin, D. et al. (1998). An information-theoretic deﬁnition of

similarity. In Icml, volume 98, pages 296–304.

McInnes, B. (2008). An unsupervised vector approach to

biomedical term disambiguation: integrating umls and

medline. In Proceedings of the ACL-08: HLT Student

Research Workshop, pages 49–54.

Miller, G. A. (1995). Wordnet: a lexical database for en-

glish. Communications of the ACM, 38(11):39–41.

Navigli, R. (2009). Word sense disambiguation: A survey.

ACM computing surveys (CSUR), 41(2):1–69.

Neumann, M., King, D., Beltagy, I., and Ammar, W.

(2019). Scispacy: Fast and robust models for biomed-

ical natural language processing. arXiv preprint

arXiv:1902.07669.

Pesaranghader, A., Matwin, S., Sokolova, M., and Pe-

saranghader, A. (2019). deepbiowsd: effective deep

neural word sense disambiguation of biomedical text

data. JAMIA, 26(5):438–446.

Pustejovsky, J. and Stubbs, A. (2012). Natural Language

Annotation for Machine Learning: A guide to corpus-

building for applications. ” O’Reilly Media, Inc.”.

Pyysalo, S., Ginter, F., Tapio, S., and Sophia, A. (2013).

Distributional semantics resources for biomedical text

processing. Proceedings of LBM, pages 39–44.

Saeed, J. I. (2008). Semantics. Wiley-Blackwell.

Tversky, A. (1977). Features of similarity. Psychological

review, 84(4):327.

Velazquez, E., Ratt

e, S., and de Jong, F. (2016). Analyz-

ing students’ knowledge building skills by compar-

ing their written production to syllabus. In Interna-

tional Conference on Interactive Collaborative Learn-

ing, pages 345–352. Springer.

Wang, Y., Zheng, K., Xu, H., and Mei, Q. (2018). Inter-

active medical word sense disambiguation through in-

formed learning. JAMIA, 25(7):800–808.

Weeber, M., Mork, J. G., and Aronson, A. R. (2001). Devel-

oping a test collection for biomedical word sense dis-

ambiguation. In Proceedings of the AMIA Symposium,

page 746. American Medical Informatics Association.

Zhang, C., Bi

s, D., Liu, X., and He, Z. (2019). Biomedi-

cal word sense disambiguation with bidirectional long

short-term memory and attention-based neural net-

works. BMC bioinformatics, 20(16):502.

HEALTHINF 2021 - 14th International Conference on Health Informatics

280