Semantic Textual Similarity Assessment in Chest X-ray Reports Using a

Domain-Speciﬁc Cosine-Based Metric

Sayeh Gholipour Picha

, Dawood Al Chanti

and Alice Caplier

Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, 38000 Grenoble, France

Keywords:

Semantic Similarity, Medical Language Processing, Biomedical Metric.

Abstract:

Medical language processing and deep learning techniques have emerged as critical tools for improving health-

care, particularly in the analysis of medical imaging and medical text data. These multimodal data fusion

techniques help to improve the interpretation of medical imaging and lead to increased diagnostic accuracy,

informed clinical decisions, and improved patient outcomes. The success of these models relies on the ability

to extract and consolidate semantic information from clinical text. This paper addresses the need for more

robust methods to evaluate the semantic content of medical reports. Conventional natural language process-

ing approaches and metrics are initially designed for considering the semantic context in the natural language

domain and machine translation, often failing to capture the complex semantic meanings inherent in medical

content. In this study, we introduce a novel approach designed speciﬁcally for assessing the semantic simi-

larity between generated medical reports and the ground truth. Our approach is validated, demonstrating its

efﬁciency in assessing domain-speciﬁc semantic similarity within medical contexts. By applying our metric to

state-of-the-art Chest X-ray report generation models, we obtain results that not only align with conventional

metrics but also provide more contextually meaningful scores in the considered medical domain.

1 INTRODUCTION

Advancements in deep learning for medical language

processing have signiﬁcantly improved healthcare

clinical analysis, particularly in the domain of med-

ical imaging applications. Notably, there has been

substantial progress in generating chest X-ray reports

comparable to those written by radiologists. How-

ever, a critical challenge persists in the chest X-ray ap-

plication—assessing the semantic similarity between

generated reports and the ground truth.

Identifying semantic similarities in medical texts

is a difﬁcult task within the language processing do-

main (Alam et al., 2020). This task necessitates a

comprehensive grasp of the entire medical text cor-

pus, the ability to recognize key content, and a pro-

found understanding of the semantic relationships

between these critical keywords at an expert level.

While existing metrics and approaches for capturing

semantic similarity in natural language are effective,

they are not designed for the complexities of medical

https://orcid.org/0000-0003-2675-5463

https://orcid.org/0000-0002-6258-6970

https://orcid.org/0000-0002-5937-4627

content. The need for a robust metric to assess seman-

tic similarity in medical texts has become increasingly

evident, particularly in applications like chest X-ray

report generation, and continues to be an active area

of research (Endo et al., 2021), (Miura et al., 2021),

(Yu et al., 2022).

State-of-the-art chest X-ray report generation

models (Chen et al., 2020), (Miura et al., 2021),

(Endo et al., 2021) still rely on conventional Natural

Language Processing (NLP) methods like BLEU (Pa-

pineni et al., 2002), METEOR (Banerjee and Lavie,

2005), and ROUGE (Lin, 2004) to evaluate the gen-

erated reports against ground truth references. How-

ever, these metrics produce unreliable results due to

their inability to comprehend and compare the seman-

tic similarity of key medical terms. A medical seman-

tic similarity metric would not only provide more sig-

niﬁcant evaluation scores but could also be incorpo-

rated into the training process to improve model per-

formance, potentially leading to enhanced diagnostic

accuracy and decision-making. Additionally, as part

of our ongoing research, our goal is to focus on pro-

viding visual interpretations of chest X-ray reports us-

ing text-to-image localization. As a consequence, a

robust semantic similarity evaluation metric suitable

Gholipour Picha, S., Al Chanti, D. and Caplier, A.

Semantic Textual Similarity Assessment in Chest X-ray Reports Using a Domain-Speciﬁc Cosine-Based Metric.

DOI: 10.5220/0012429600003657

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 17th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2024) - Volume 1, pages 487-494

ISBN: 978-989-758-688-0; ISSN: 2184-4305

487

for medical content will ensure the reliability of gen-

erated reports and will enable us to achieve more ac-

curate localization and interpretation of image con-

tent.

In this context, we propose a new metric designed

to assess and assign scores about the semantic simi-

larity of medical texts. Our metric consists of two se-

quential steps: ﬁrst, we identify the primary clinical

entities, and subsequently, we evaluate the similarity

between these entities using the domain-speciﬁc Co-

sine similarity score. Notably, our approach considers

the presence of negations and detailed descriptions as-

sociated with medical entities during the evaluation

process. To this end, our contributions include:

• Introduction of a novel system for clinical entity

extraction from medical texts.

• Proposition of a new scoring system for the evalu-

ation of semantic similarity that suits medical and

natural texts.

• Presentation of a validation method for scoring

veriﬁcation.

This paper is structured as follows: Section 2 dis-

cusses related works; Section 3 presents the theoreti-

cal and mathematical part of the novel metric; Section

4 validates the metric; Section 5 discusses the results;

Finally, Section 6 concludes the paper.

2 RELATED WORKS

Recent studies have addressed the challenge of sim-

ilarity evaluation between generated medical reports

and the ground truth through various approaches other

than conventional NLP metrics. Researchers have of-

ten introduced innovative metrics in the process.

In the CXR-RePaiR model by Endo et al. (Endo

et al., 2021) a unique approach for automatically

evaluating chest X-ray report generation is proposed

by introducing the CheXbert vector similarity met-

ric, using the CheXbert labeler (Smit et al., 2020)

— a specialized tool for chest X-ray report labeling.

The process involves extracting labels from gener-

ated reports, comparing them with ground truth la-

bels, and presenting the ﬁnal score using cosine sim-

ilarity. While this approach outperforms the BLEU

metric, its applicability is limited to the speciﬁc con-

text of chest X-ray reports and does not readily extend

to other medical applications. The limitations arise

from Chexbert being exclusively trained for chest X-

ray reports. Moreover, the Chexpert labels (Irvin

et al., 2019) (Atelectasis, Cardiomegaly, Consolida-

tion, Edema, Enlarged Cardiomediastinum, Fracture,

Lung Lesion, Lung Opacity, No Finding, Pleural Ef-

fusion, Pleural Other, Pneumonia, Pneumothorax) are

speciﬁc to the chest X-ray dataset, further limiting the

generalizability of the approach to other medical con-

texts.

In a separate study, Yu et al. (Yu et al., 2022) in-

troduced a novel metric targeting the quantiﬁcation of

overlap of clinical entities between ground truth and

generated reports in chest X-ray report generation.

They use the RadGraph model (Jain et al., 2021), a

language model trained on a limited subset of reports

from the MIMIC-CXR dataset (Johnson et al., 2019).

The MIMIC-CXR dataset consists of chest X-ray im-

ages with corresponding reports, and the RadGraph

dataset includes medical entities from chest X-ray re-

ports annotated by radiologists. The approach by Yu

et al. is similar to the BLEU score, exclusively con-

sidering the exact matches among the primary entities

in generated and ground truth reports, overlooking

the semantic similarity of these entities. Furthermore,

the generalizability of this approach to other medical

applications is constrained by the RadGraph model’s

specialization in extracting only chest X-ray related

entities. Nonetheless, while the RadGraph model ac-

knowledges negations in the texts, they are treated

merely as labels to the entities, and the details of en-

tity descriptions are not factored into the evaluation

process.

In a recent study, Patricoski et al. (Patricoski et al.,

2022) conducted an evaluation of seven BERT mod-

els to assess semantic similarity in clinical trial texts.

Notably, the pre-trained BERT model known as SciB-

ERT (Beltagy et al., 2019) demonstrated better per-

formance compared to the other BERT models, even

outperforming the standard BERT model, which se-

cured the second position in this evaluation. This

study underlines the promising potential of BERT

models in semantic similarity evaluation. However,

it has a drawback associated with using BERT mod-

els without preprocessing. BERT models operate at

a token-by-token level, evaluating semantic similarity

by comparing all tokens with each other, a compu-

tationally intensive process that gives relatively low

scores. Despite this computational challenge, it is im-

portant to consider the signiﬁcant potential in SciB-

ERT, particularly due to its huge clinical dictionary.

This ﬁnding underscores the need for careful consid-

eration of preprocessing strategies to maximize the

effectiveness of BERT models in semantic similarity

evaluations.

Notably, the absence of a comprehensive, general

semantic similarity evaluation metric for medical con-

tent persists. Consequently, we introduce a novel met-

ric for Medical Corpus Similarity Evaluation (MCSE)

BIOINFORMATICS 2024 - 15th International Conference on Bioinformatics Models, Methods and Algorithms

488

to comprehensively address and resolve these chal-

lenges.

3 METHODOLOGY

We developed a novel metric for Medical Corpus

Similarity Evaluation (MCSE), by exclusively ex-

tracting key medical entities and employing a pre-

trained BERT model to assess the semantic similarity

of these entities within chest X-ray reports. This tar-

geted approach allows BERT to concentrate solely on

important information and reduces the computational

load during comparison. Importantly, our method-

ology goes beyond extracting main entities, we also

consider the negations and detailed descriptions asso-

ciated with the primary medical entities in chest X-ray

reports. Our MCSE metric consists of two essential

steps:

1. Clinical Entity Extraction.

2. Domain Similarity Evaluation.

3.1 Clinical Entity Extraction

The most important part of comprehending seman-

tic similarity evaluation in text relies on identifying

the key elements, often referred to as clinical entities,

within medical texts. These entities typically fall into

categories related to anatomical body parts, symp-

toms, laboratory equipment, and diagnoses. Each cat-

egory is typically signaled by certain words within a

sentence. However, there are additional words that

precede or follow these main entities, offering de-

scriptions.

To address these complexities, we employ the

Scispacy model (Neumann et al., 2019) for extract-

ing primary clinical entities from medical text us-

ing the embedded clinical dictionary in this model

(BC5CDR: a corpus comprising 1500 PubMed arti-

cles with 4409 annotated chemicals, 5818 diseases,

and 3116 chemical-disease interactions (Li et al.,

2016)). Subsequently, we automatically process the

entire text to identify associated negations and adjec-

tives related to these key entities. These elements

are then integrated to provide a comprehensive rep-

resentation of the considered text. In the context of

this research, the category of laboratory equipment is

deliberately excluded, aligning with the speciﬁc fo-

cus of our application. Table 1 presents an example

of medical text and the extracted entities using our

method and the Scispacy method without any clean-

ing process. While we employ the Scispacy model for

initial entity extraction, it is evident that this model

alone may not sufﬁce. An additional automated post-

processing step is needed to reﬁne and integrate re-

lated entities. The post-processing steps involve elim-

inating a single adjective or non-medical entities, ex-

cluding entities categorized as lab equipment, identi-

fying and adding the relevant adjective to the remain-

ing medical entities, including the existing negation

into these primary entities, and screening out any re-

ported diagnostic procedures terms. These processes

are essential to ensure that the ﬁnal output is presented

as a cohesive set of primary medical entities, ready for

practical use.

3.2 Domain Similarity Evaluation

Having successfully extracted and shifted our focus

to the primary entities within the medical corpus, the

next step involves assessing their semantic similarities

by assigning corresponding scores.

After processing entity extraction, we calculate

a similarity score for the sequences of entities. Let

T = (t

, . . . , t

) represent the reference text entities

and

T = (

, . . . ,

) represent the generated text or

candidate text entities. Initially, we identify the ex-

act same medical entities in both sequences and de-

termine the total count (|C

(i)

|). For the remaining en-

tities, we construct a similarity matrix, where each el-

ement represents the similarity score between entities,

as illustrated in table 2.

maxy

i, j

maxy

i, j

+ y

i, j

i = (0, 1, . . . , M) j = (0, 1, . . . , N)

(1)

i, j

= Similarity(r

, ˆr

) (2)

(

(i)

= t

, if t

= t

& ˆr

if t

̸=

(3)

Where M is the number of total candidate entities,

and ˆr

are the sequence between no matched en-

tities as in equation (3), and S

is a normalized

similarity score between r

and ˆr

. The similarity

score Similarity(r

, ˆr

) in equation (2) is derived from

spaCy (Honnibal et al., 2020), a BERT model trained

on word2vec, to evaluate similarities using domain

cosine similarity.

To evaluate the similarity of candidate entities

with the reference entities, we compute the maximum

score for each column and normalized it with the col-

umn’s average (S

). We then sum these scores for each

column, adding them to |C

(i)

|. To obtain the ﬁnal sim-

ilarity score between the two corpora, we divide this

sum by the total number of candidate entities. This

process is explained in Equation (4).

Semantic Textual Similarity Assessment in Chest X-ray Reports Using a Domain-Speciﬁc Cosine-Based Metric

489

Table 1: In the right column there is an example of medical text. In the left column, there are clinical entities extracted using

the Scispacy model without any cleaning process, and In the middle column, there are clinical entities extracted using our

method.

Medical Text Extracted Entities

using our method

Extracted Entities using

Scispacy (Neumann et al.,

2019)

1. Interval clearance of left basilar consolidation. 2.

Patchy right basilar opacities, which could be seen

with minor atelectasis, but given the context clinical

correlation is suggested regarding any possibility for

recurrent or new aspiration pneumonitis at the right

lung base. 3. Increased new interstitial abnormal-

ity, suggesting recurrence of ﬂuid overload or mild-to-

moderate pulmonary edema; aspiration could also be

considered. Inﬂammation associated with atypical in-

fectious process is probably less likely given the wax-

ing and waning presentation.

ﬂuid overload,

inﬂammation,

aspiration pneu-

monitis, minor

atelectasis, mild

to moderate pul-

monary edema,

left basilar con-

solidation, patchy

right basilar opac-

ities, interstitial

abnormality

Interval, clearance, left

basilar, consolidation,

Patchy, right basilar,

opacities, minor, atelec-

tasis, clinical, recurrent,

aspiration, pneumonitis,

right lung base, Increased,

interstitial abnormality,

recurrence, ﬂuid, overload,

mild-to-moderate pul-

monary edema, aspiration,

Inﬂammation, associated

with, atypical, infectious

process, waxing, waning,

presentation

MCSE :=

(i)

| +

∑

i=1

(4)

Where |C

(i)

| is the number of exactly matched en-

tities between the two corpora of T and

T .

For instance, Table 2 provides an example of the

probable similarity score that two sets of entities can

receive. These entities have been extracted using our

medical entity extraction procedure.

In the table, the two corpora received a score of

0.55 according to our MCSE metric. However, the

calculated BLEU score for them is approximately

zero. Upon analyzing the two medical texts, it be-

comes evident that although the candidate text does

refer to the same side of the chest as in the reference

text and that both texts indicate the presence of pul-

monary edema and pulmonary masses, their overall

similarity is relatively limited. The score of 0.55 car-

ries a more meaningful value in this context compared

to the nearly zero score generated by BLEU.

4 VALIDATION

While the underlying logic of this metric is reason-

able, it is imperative that we validate the results ro-

bustly. Given the use of chest X-ray reports for this

particular application, we have conducted an exten-

sive search within existing datasets to identify an ap-

propriate validation method. After a comprehensive

review of various datasets, we concluded that it would

be more effective to conduct separate validations for

the different steps of the proposed metric.

4.1 Clinical Entity Extraction Process

In order to rigorously validate our clinical entity ex-

traction process, we employ the RadGraph dataset

(Jain et al., 2021). This dataset is a valuable resource

in which radiologists thoroughly annotated the pri-

mary clinical entities in chest X-ray reports as either

”deﬁnitely present” within the report or ”deﬁnitely

absent”. Importantly, in cases where a negation is

associated with a particular entity, it is annotated as

”deﬁnitely absent.”

To achieve our validation objectives, we executed

our entity extraction process on the reports within this

dataset. Subsequently, we compare the number of

similar entities extracted through our method with the

annotations provided by radiologists, particularly fo-

cusing on the two categories of ”deﬁnitely present”

and ”deﬁnitely absent”. This systematic comparison

allows us to assess the accuracy and effectiveness of

our clinical entity extraction methodology in the con-

text of chest X-ray reports, aligning with radiological

standards. Throughout the validation process, cover-

ing all reports in our study, our method consistently

achieves a high level of accuracy. On average, it ac-

curately recognizes 75% of entities marked as ”deﬁ-

nitely present” and successfully identiﬁes 76% of en-

tities labeled as ”deﬁnitely absent”. In our entity ex-

traction process, we deliberately omit anatomical en-

tities like ”chest” or ”lung,” as they are redundant to

BIOINFORMATICS 2024 - 15th International Conference on Bioinformatics Models, Methods and Algorithms

490

Table 2: An example of a medical similarity score between entities. Each score is calculated from equation (2), with the ﬁnal

row S

being computed using equation (1). The scores highlighted in blue indicate the maximum value within each respective

column.

Reference: 1. Interval clearance of left basilar consolidation. 2. Patchy right basilar opacities, which

could be seen with minor atelectasis, but given the context clinical correlation is suggested regarding any

possibility for recurrent or new aspiration pneumonitis at the right lung base. 3. Increased new interstitial

abnormality, suggesting recurrence of ﬂuid overload or mild-to-moderate pulmonary edema; aspiration could

also be considered. Inﬂammation associated with atypical infectious process is probably less likely given the

waxing and waning presentation.

Candidate: Stable multiple bilateral pulmonary masses and right middle lobe collapse due to hilar adenopa-

thy.

Candidate Medical Entities

pulmonary masses right middle lobe hilar adenopathy

ﬂuid overload 0.61 0.49 0.45

inﬂammation 0.64 0.48 0.55

aspiration pneumonitis 0.65 0.39 0.50

minor atelectasis 0.62 0.47 0.53

Reference

Medical

Entities

mild to moderate pulmonary

edema

0.78 0.31 0.51

left basilar consolidation 0.52 0.66 0.32

patchy right basilar opacities 0.64 0.66 0.49

interstitial abnormality 0.69 0.63 0.59

0.548 0.563 0.545

the chest X-ray application and do not contribute sig-

niﬁcantly to the process. This selective exclusion is

one of the factors contributing to the approximately

75% accuracy in our results. Nevertheless, these

results afﬁrm the reliability and consistency of our

methodology.

4.2 Domain Similarity Score

In contrast to the initial phase of clinical entity extrac-

tion, validating the domain similarity score is more

challenging. The scoring system itself is more con-

troversial and subject to debate, and creating an au-

tomated validation method, free from reliance on ra-

diologists, necessitates a creative and innovative ap-

proach. Nevertheless, through the available tools and

databases, we establish a dedicated system for the val-

idation of this scoring method for the application of

chest X-rays.

In the chest X-ray application, the MIMIC-CXR

dataset (Johnson et al., 2019), is one of the biggest

available databases for chest X-ray images and their

corresponding reports. Notably, this dataset pro-

vides us with Chexpert labels (Medical Observation),

including Atelectasis, Cardiomegaly, Consolidation,

Edema, Enlarged Cardiomediastinum, Fracture, Lung

Lesion, Lung Opacity, No Finding, Pleural Effusion,

Pleural Other, Pneumonia, Pneumothorax, and Sup-

port Devices labels (Irvin et al., 2019). The values

of each label are 1 (deﬁnitely present), 0 (deﬁnitely

absent), -1 (ambiguous), or it carries no value at all.

Table 3 presents a sample of Chexpert labels extracted

from chest X-ray reports of ﬁve patients from the

MIMIC-CXR database. The reports corresponding to

these subjects are presented in Table 4.

Our approach involves two distinct strategies.

Firstly, we seek to identify reports sharing the same

sequence of labels and values. For instance, we

search for reports from subjects with Chexpert label

sequences similar to that of Subject 01 in Table 3. For

these reports with matching label sequences, we pro-

ceed to similarity scores computation for each pair of

reports. Simultaneously, we identify reports featur-

ing only one or two labels and with a value of ”deﬁ-

nitely present” for these labels resembling Subject 02

in Table 3 and assess the similarity of these reports

with the reports with different label sequences. As

an example, we calculate the similarity between the

reports of Subject 02 and Subject 05 from Table 3,

given their entirely distinct label sequences. This two-

fold method allows us to analyze the semantic simi-

larity scores for both similar and contrasting reports

in terms of their labels.

Figure 1 presents the results of the two-fold vali-

dation for our scoring method. Within the ﬁgure, blue

dots represent the average scores for semantic evalu-

ation of reports with similar label sequences, while

orange dots show the mean scores for reports with

Semantic Textual Similarity Assessment in Chest X-ray Reports Using a Domain-Speciﬁc Cosine-Based Metric

491

Table 3: A sample table featuring Chexpert labels (1. Atelectasis, 2. Cardiomegaly, 3. Consolidation, 4. Edema, 5. Enlarged

Cardiomediastinum, 6. Fracture, 7. Lung Lescion, 8. Lung Opacity, 9. No Finding, 10. Pleural Effusion, 11. Pleural Other,

12. Pneumonia, 13. Pneumothorax, 14. Support Devices) extracted from chest X-ray reports of ﬁve patients (Subject ##)

from the MIMIC-CXR database (Johnson et al., 2019).

Subject ## Atelectasis Cardiomegaly Consolidation Edema Enlarged Cardiomediastinum Fracture Lung Lescion Lung Opacity No Finding Pleural Effusion Pleural Other Pneumonia Pneumothorax Support Devices

01 0 1 1 -1

02 1 1

03 1 0

04 1 0 -1 0 1

05 1 1

Table 4: Reports corresponding to the subjects listed in Ta-

ble 3 from the MIMIC-CXR dataset (Johnson et al., 2019).

Subject ## Report

01 Lung volumes remain low. There are innumerable bi-

lateral scattered small pulmonary nodules which are bet-

ter demonstrated on recent CT. Mild pulmonary vascular

congestion is stable. The cardio mediastinal silhouette

and hilar contours are unchanged. Small pleural effusion

in the right middle ﬁssure is new. There is no new focal

opacity to suggest pneumonia. There is no pneumotho-

rax.

02 A triangular opacity in the right lung apex is new from

prior examination. There is also fullness of the right

hilum which is new. The remainder of the lungs are clear.

Blunting of bilateral costophrenic angles, right greater

than left, may be secondary to small effusions. The heart

size is top normal.

03 Mild to moderate enlargement of the cardiac silhouette

is unchanged. The aorta is calciﬁed and diffusely tor-

tuous. The mediastinal and hilar contours are otherwise

similar in appearance. There is minimal upper zone vas-

cular redistribution without overt pulmonary edema. No

focal consolidation, pleural effusion or pneumothorax is

present. The osseous structures are diffusely demineral-

ized.

04 The endotracheal tube tip is 6 cm above the carina. Na-

sogastric tube tip is beyond the GE junction and off the

edge of the ﬁlm. A left central line is present in the tip

is in the mid SVC. A pacemaker is noted on the right in

the lead projects over the right ventricle. There is prob-

able scarring in both lung apices. There are no new ar-

eas of consolidation. There is upper zone redistribution

and cardiomegaly suggesting pulmonary venous hyper-

tension. There is no pneumothorax.

05 A moderate left pleural effusion is new. Associated

left basilar opacity likely reﬂect compressive atelectasis.

There is no pneumothorax. There are no new abnormal

cardiac or mediastinal contour. Median sternotomy wires

and mediastinal clips are in expected positions.

contrasting labels. The red horizontal line within the

ﬁgure serves as the dividing line distinguishing be-

tween similar and opposite evaluations. Upon review-

ing these results, it becomes evident that a distinct

boundary exists between reports sharing the same

clinical diagnoses and those with entirely dissimilar

diagnoses. Notably, there are no blue dots below

a 70% similarity threshold, whereas six orange dots

have scores above 70% across 70 label sequences,

which is certainly not very high. Nevertheless, de-

Figure 1: Semantic Evaluation of Chest X-ray reports. Each

blue dot represents the mean score of semantic evaluation

for reports with similar label sequences, while each orange

dot signiﬁes the mean score of semantic evaluation for re-

ports with opposing labels. The red horizontal line repre-

sents the classiﬁcation boundary.

spite this differentiation between similar and opposite

evaluations, some level of similarity, exceeding 50%,

persists within the opposing category. This can be at-

tributed to the implemented cosine similarity within

the medical domain, which introduces a certain bias

towards tokens in the same medical domain. Unfor-

tunately, this bias cannot be entirely eliminated, as

it plays a substantial role in the evaluation process.

However, a clear boundary remains between similar

and contrasting reports.

5 RESULTS AND DISCUSSION

In our original application of chest X-ray report gen-

eration, we incorporate our metric to assess the out-

puts of various models. We compare our results with

the BLEU scores evaluated by these models, speciﬁ-

cally, the CXR-RePaiR (Endo et al., 2021) and R2Gen

(Chen et al., 2020) models, both being state-of-the-art

models for generating chest X-ray reports. Our eval-

uation focuses on measuring the semantic similarity

between the generated reports and the ground truth.

Table 5 presents the BLEU scores obtained from these

BIOINFORMATICS 2024 - 15th International Conference on Bioinformatics Models, Methods and Algorithms

492

models and our metric’s semantic evaluation. As an-

ticipated, the BLEU scores are relatively low, signi-

fying a substantial dissimilarity between the gener-

ated results and the ground truth for both the CXR-

RePaiR and R2Gen models despite being regarded as

state-of-the-art models for chest X-ray report gener-

ation. These models still employ the BLEU metric

for evaluation, primarily due to the scarcity of more

suitable metrics and the need for a standardized eval-

uation process for comparative purposes. Conversely,

our metric produces more promising results for both

of these models. While our metric’s scores align with

the BLEU scores, indicating higher scores for both

BLEU and our MCSE metric in the case of the R2Gen

model compared to the CXR-RePaiR, our metric pro-

vides a deeper evaluation. It suggests a degree of sim-

ilarity to the ground truth rather than outright dissim-

ilarity in BLEU, thus making the generated reports

more reliable and trustworthy, which is a crucial ad-

vancement in the ﬁeld.

Table 5: The result of BLEU score of 2-gram for state-of-

the-art models and the result of our novel metric on these

models outcomes.

Models BLEU Our MCSE

R2Gen (Chen et al., 2020) 0.212 0.71

CXR-RePair (Endo et al.,

2021)

0.069 0.64

Table 6 provides an example of medical text gen-

erated and evaluated using both a BLEU score and

our MCSE metric. It’s evident that, according to the

BLEU score, these two texts appear vastly different,

even though they share the same primary medical en-

tities. However, when we delve into the context, we

can notice that ”moderately severe” serves as a de-

scription for the main entity, ”pulmonary edema”, in

the generated text. Similarly, in the second part of the

text, the main medical entity is ”pleural effusions”,

and terms like ”likely” and ”no large” are used to de-

scribe this entity, which may not be identical but share

semantic similarities. This subtle context evaluation

is precisely what our metric considers, yielding a sim-

ilarity score of 0.64 for these texts, which we argue

is a more accurate reﬂection compared to the BLEU

score.

Lastly, the signiﬁcant beneﬁt of employing this

metric lies in its capacity for comparative analysis

alongside other evaluation measures. For instance,

when examining the outcomes of the BLEU score,

with its word-by-word analysis, situations may arise

where the results are totally inaccurate, casting doubt

on their reliability, despite the models performing

well overall. Integrating the results of our novel

Table 6: A comparative example of using the BLEU score

and our adapted metric with medical reference and gener-

ated text.

BLEU MCSE

Reference Sentence: ”Pul-

monary edema, cardiomegaly,

likely pleural effusions.”

Generated Sentence: ”Mod-

erately severe bilateral pul-

monary edema with no large

pleural effusion.”

0.047 0.64

MCSE metric into the evaluation process allows us

to semantically analyze and ascertain the dependabil-

ity of the models’ textual outputs within the context

of medical content.

6 CONCLUSION

In our research, we tackle the challenge of semantic

similarity scoring in medical corpora, driven by the

inadequacy of existing metrics that, while suitable for

machine translation evaluation, fall short in the ﬁeld

of medical semantic assessment. Our innovative met-

ric draws inspiration from how humans comprehend

text, centering on the extraction of key terms and their

relational context. It introduces a novel approach for

extracting clinical entities from medical text, consid-

ering not only the entities themselves but also the as-

sociated descriptions and negations. Additionally, we

created a new method for scoring the semantic rela-

tionships between these entities by using the domain

cosine similarity. The validation process allowed us to

analyze and validate each of these steps individually,

unraveling a clear distinction between reports sharing

the same diagnosis and those diverging in this regard.

For our research, we focused on the application of

chest X-rays, a critical domain where a robust seman-

tic evaluation metric is highly valuable. We applied

our metric to some of the latest state-of-the-art mod-

els, and the results harmonized with other evaluation

metrics, afﬁrming their reliability.

While our validation process and implementa-

tion yielded successful outcomes, we encountered the

challenge of an inherent bias in domain cosine simi-

larity. This challenge has illuminated a promising di-

rection for our future research, as we explore ways

to mitigate this bias and advance the ﬁeld of medical

semantic evaluation.

Material, Codes, and Acknowledgement: Results

can be reproduced using the code available in the

GitHub repository https://git hub.com/sayeh19

Semantic Textual Similarity Assessment in Chest X-ray Reports Using a Domain-Speciﬁc Cosine-Based Metric

493

94/Medical-Corpus-Semantic-Similar ity-Eva

luation.git. All the computations presented in this

paper were performed using the (Gricad, ) infrastruc-

ture (https://gri cad.univ-grenoble-alpes.f

r), which is supported by Grenoble research commu-

nities.

REFERENCES

Alam, F., Afzal, M., and Malik, K. M. (2020). Comparative

analysis of semantic similarity techniques for medical

text. In 2020 International Conference on Information

Networking (ICOIN), pages 106–109.

Banerjee, S. and Lavie, A. (2005). METEOR: An automatic

metric for MT evaluation with improved correlation

with human judgments. In Proceedings of the ACL

Workshop on Intrinsic and Extrinsic Evaluation Mea-

sures for Machine Translation and/or Summarization,

pages 65–72, Ann Arbor, Michigan. Association for

Computational Linguistics.

Beltagy, I., Lo, K., and Cohan, A. (2019). SciBERT: A

pretrained language model for scientiﬁc text. In Inui,

K., Jiang, J., Ng, V., and Wan, X., editors, Proceed-

ings of the 2019 Conference on Empirical Methods

in Natural Language Processing and the 9th Inter-

national Joint Conference on Natural Language Pro-

cessing (EMNLP-IJCNLP), pages 3615–3620, Hong

Kong, China. Association for Computational Linguis-

tics.

Chen, Z., Song, Y., Chang, T.-H., and Wan, X. (2020). Gen-

erating radiology reports via memory-driven trans-

former. In Proceedings of the 2020 Conference on

Empirical Methods in Natural Language Processing

(EMNLP), pages 1439–1449, Online. Association for

Computational Linguistics.

Endo, M., Krishnan, R., Krishna, V., Ng, A. Y., and Ra-

jpurkar, P. (2021). Retrieval-based chest x-ray report

generation using a pre-trained contrastive language-

image model. In Proceedings of Machine Learning

for Health, volume 158 of Proceedings of Machine

Learning Research, pages 209–219.

Gricad. infrastructure supported by grenoble research com-

munities.

Honnibal, M., Montani, I., Van Landeghem, S., and Boyd,

A. (2020). spaCy: Industrial-strength Natural Lan-

guage Processing in Python.

Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S.,

Chute, C., Marklund, H., Haghgoo, B., Ball, R., Sh-

panskaya, K., Seekins, J., Mong, D. A., Halabi, S. S.,

Sandberg, J. K., Jones, R., Larson, D. B., Langlotz,

C. P., Patel, B. N., Lungren, M. P., and Ng, A. Y.

(2019). Chexpert: A large chest radiograph dataset

with uncertainty labels and expert comparison. Pro-

ceedings of the AAAI Conference on Artiﬁcial Intelli-

gence, 33(01):590–597.

Jain, S., Agrawal, A., Saporta, A., Truong, S., Duong,

D. N., Bui, T., Chambon, P., Zhang, Y., Lungren,

M. P., Ng, A. Y., Langlotz, C., and Rajpurkar, P.

(2021). Radgraph: Extracting clinical entities and re-

lations from radiology reports. In Thirty-ﬁfth Con-

ference on Neural Information Processing Systems

Datasets and Benchmarks Track (Round 1).

Johnson, A. E. W., Pollard, T. J., Berkowitz, S. J., Green-

baum, N. R., Lungren, M. P., Deng, C.-y., Mark, R. G.,

and Horng, S. (2019). Mimic-cxr, a de-identiﬁed pub-

licly available database of chest radiographs with free-

text reports. Scientiﬁc Data, 6(1):317.

Li, J., Sun, Y., Johnson, R. J., Sciaky, D., Wei, C.-H., Lea-

man, R., Davis, A. P., Mattingly, C. J., Wiegers, T. C.,

and Lu, Z. (2016). BioCreative V CDR task cor-

pus: a resource for chemical disease relation extrac-

tion. Database, 2016:baw068.

Lin, C.-Y. (2004). ROUGE: A package for automatic evalu-

ation of summaries. In Text Summarization Branches

Out, pages 74–81, Barcelona, Spain. Association for

Computational Linguistics.

Miura, Y., Zhang, Y., Tsai, E., Langlotz, C., and Jurafsky,

D. (2021). Improving factual completeness and con-

sistency of image-to-text radiology report generation.

In Toutanova, K., Rumshisky, A., Zettlemoyer, L.,

Hakkani-Tur, D., Beltagy, I., Bethard, S., Cotterell, R.,

Chakraborty, T., and Zhou, Y., editors, Proceedings of

the 2021 Conference of the North American Chapter

of the Association for Computational Linguistics: Hu-

man Language Technologies, pages 5288–5304, On-

line. Association for Computational Linguistics.

Neumann, M., King, D., Beltagy, I., and Ammar, W. (2019).

ScispaCy: Fast and Robust Models for Biomedical

Natural Language Processing. In Proceedings of the

18th BioNLP Workshop and Shared Task, pages 319–

327, Florence, Italy. Association for Computational

Linguistics.

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002).

Bleu: a method for automatic evaluation of machine

translation. In Proceedings of the 40th Annual Meet-

ing of the Association for Computational Linguistics,

pages 311–318, Philadelphia, Pennsylvania, USA.

Association for Computational Linguistics.

Patricoski, J., Kreimeyer, K., Balan, A., Hardart, K., Tao,

J., Anagnostou, V., Botsis, T., Investigators, J. H. M.

T. B., et al. (2022). An evaluation of pretrained bert

models for comparing semantic similarity across un-

structured clinical trial texts. Stud Health Technol In-

form, 289:18–21.

Smit, A., Jain, S., Rajpurkar, P., Pareek, A., Ng, A., and

Lungren, M. P. (2020). Chexbert: Combining auto-

matic labelers and expert annotations for accurate ra-

diology report labeling using bert. In Conference on

Empirical Methods in Natural Language Processing.

Yu, F., Endo, M., Krishnan, R., Pan, I., Tsai, A., Reis, E. P.,

Fonseca, E. K. U. N., Ho Lee, H. M., Abad, Z. S. H.,

Ng, A. Y., Langlotz, C. P., Venugopal, V. K., and Ra-

jpurkar, P. (2022). Evaluating Progress in Automatic

Chest X-Ray Radiology Report Generation. preprint,

Radiology and Imaging.

BIOINFORMATICS 2024 - 15th International Conference on Bioinformatics Models, Methods and Algorithms

494