Knowledge Graph Based Trustworthy Medical Code Recommendations

Mutahira Khalid

1 a

, Asim Abbas

2 b

, Hassan Sajjad

3 c

, Hassan Ali Khattak

1 d

Tahir Hameed

4 e

and Syed Ahmad Chan Bukhari

2 f

School of Electrical Engineering and Computer Science, NUST, H-12, Islamabad, Pakistan

Division of Computer Science, Mathematics and Science, St. John’s University, Queens, NY 11439, U.S.A.

Faculty of Computer Science, Dalhousie University, Halifax, Canada

Girard School of Business, Merrimack College, North Andover, Massachusetts, U.S.A.

Keywords:

Medical Coding, Computer Assisted Coding (CAC), Deep Learning, Attention Mechanism, Symbolic AI,

Knowledge Graphs, Ontologies, Explainability.

Abstract:

Medical coding is about assigning standardized alphanumeric codes to diagnoses, procedures, and interven-

tions recorded in patients’ clinical notes. These codes are essential for correct medical claims and billing pro-

cesses, which are critical in maintaining efﬁcient revenue cycles. Computer-Assisted-Coding (CAC) employs

AI models to automate medical coding hence cutting down human effort and errors. Despite their unrivalled

performance, these models lack ‘explainability’. Explainability opens up the inner workings and results of

black-box deep learning models. Attention mechanisms are the most common approach for ‘explainability’,

but they leave some questions unanswered, for instance, the relationship between highlighted words and pre-

dictions. Where black-box models fail to answer such questions, ‘Symbolic AI’ such as ‘Knowledge Graphs’

provide a superior alternate approach. We consolidated the attention mechanism with Symbolic AI to help

users understand the results of a deep-learning model for CAC. We evaluated its performance on the basis

of strong and weak relationships on word-to-word and word-to-code levels by employing a semantically-

enriched Knowledge Graph. We achieved 64% word-to-word and 53% word-to-code level accuracy. This

paper is among the earliest ones on knowledge graphs for explainability in medical coding. It is also the deep-

est in applying attention-based mechanisms and knowledge graphs to any medical domain.

1 INTRODUCTION

Medical coding assigns standardized alphanumeric

medical codes to patients’ diagnoses, procedures, and

other healthcare information (Aalseth, 2014). The

standardized medical billing codes include the In-

ternational Classiﬁcation of Diseases (ICD), Current

procedural terminology (CPT), and the Healthcare

standard procedure coding system (HCPCS) (John-

son and Linker, 2015). The diagnosis, procedures,

and intervention codes are used for claims and billing

management with payers including insurance com-

https://orcid.org/0000-0001-8482-4004

https://orcid.org/0000-0001-6374-0397

https://orcid.org/0000-0002-8584-6595

https://orcid.org/0000-0002-8198-9265

https://orcid.org/0000-0002-6824-6803

https://orcid.org/0000-0002-6517-5261

panies, government agencies like medicare, and pa-

tients. Medical coders manually assign codes to un-

structured text in EHRs and clinical notes. Main-

taining steady cash ﬂows and revenue cycle manage-

ment is an ongoing major challenge for healthcare

providers such as hospitals, hospices, nursing facili-

ties, and small clinics. Errors and speed of medical

coding is a major cause of lost revenues or delays in

accounts receivable for healthcare providers (Alonso

et al., 2020).

In recent years, there have been notable advance-

ments to reduce efforts and errors in medical coding.

Computer Assisted Coding (CAC) assists medical

coders by translating clinical notes to medical billing

codes with the help of machine learning and deep

learning models (Campbell and Giadresco, 2020).

The models scan the unstructured textual notes and

predict applicable medical billing codes, which saves

medical coders the time and effort required to review

Khalid, M., Abbas, A., Sajjad, H., Khattak, H., Hameed, T. and Bukhari, S.

Knowledge Graph Based Trustworthy Medical Code Recommendations.

DOI: 10.5220/0011925700003414

In Proceedings of the 16th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2023) - Volume 5: HEALTHINF, pages 627-637

ISBN: 978-989-758-631-6; ISSN: 2184-4305

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

627

long summaries and complex code databases (Catling

et al., 2018). Medical coders can assign from the pre-

dicted codes or still delve deeper to assign their own

codes. Deep Learning models were especially effec-

tive in automating this burdensome task as CAC sys-

tems keep learning from each prediction and assign-

ment of the codes (Moons et al., 2020).

Some of the common models used for CAC

are Recurrent Neural Networks (RNNs), Convolu-

tional Neural Networks (CNNs), Hierarchical At-

tention Networks (HAN), and Long-Short-Term-

Memory (LSTM), etc (Gu et al., 2021). Despite

their accurate and precise results, these deep learn-

ing models have some limitations, mainly the absence

of ‘explainability’ or ‘transparency’ about their in-

ner workings. That’s why deep learning models are

called black-box models. The black-box nature of

these models makes them less trustworthy and limits

their wide acceptance to real-world applications such

as health, ﬁnance and law (von Eschenbach, 2021).

That’s why ‘explainability’ has become an integral

need of time for healthcare information systems. AI-

based systems would not be adopted if there are sig-

niﬁcant chances for incorrect predictions and if there

are no ways to interpret and comprehend the basis of

such life-critical and business-critical decisions.

No wonder improving transparency of deep learn-

ing models has emerged as important research for the

scientiﬁc community as well as industry (Xu et al.,

2019). Some recent advances include the incorpo-

ration of an attention mechanism (Niu et al., 2021),

in which the important words are assigned atten-

tion weights in the encoding process and later these

weights are used in the decoding process to help the

model focus on a speciﬁc part of the input. Still,

the internal working of the deep learning models re-

mains unclear. In medical coding, the attention mech-

anism incorporated into deep learning models high-

lights some particular terms. It implies that they have

some impact on the prediction of a particular medical

code. However, some issues are brought up, such as:

How are these terms related to the other highlighted

words? What link does this vocabulary have to the

predicted codes? The answer is, regrettably, none.

This limitation can be overcome by adopting sym-

bolic AI, which has given rise to a new era of neuro-

symbolic AI, also known as the third wave of artiﬁcial

intelligence (Garcez and Lamb, 2020).

Symbolic Artiﬁcial Intelligence is also known as

Classical AI or Good Old Fashioned AI (GOFAI)

which was a prominent research area from the 1950s

to 1990s (Confalonieri et al., 2021). Symbolic AI

works by training the machines the way humans learn

using symbols and symbolic representations of this

world. Knowledge Graphs (KG) or Knowledge Bases

are called the brain behind symbolic AI, which is

heterogeneous, labelled, and structured multi-graphs

(Hogan et al., 2021). Knowledge Graphs contain a

huge network of entities and their relations that could

be used as a reasoning system for causal inference.

KG could be used to open the black box of deep learn-

ing models as they are self-explainable. The consol-

idation of deep learning and KG can lead to accurate

and explainable applications (Hitzler et al., 2020).

We proposed a novel approach of ‘Explainable

Knowledge Graph Creation’ to evaluate the attention

results and provide visualization for the sake of ex-

plainability. (Dong et al., 2021) used a Hierarchi-

cal label-wise attention network (H-LAN) deep learn-

ing model for predicting Medical codes. In this pa-

per, we have customized H-LAN with KG for generic

medical code predictions with higher explainability

and transparency. H-LAN alone predicts multiple la-

bels with attention to particular words and sentences

per label. However, it does not explicate the choice

of speciﬁc words, their combinations, or their rela-

tionships with labels. Our approach predicts ICD-10

medical billing codes, labels, and words with spe-

ciﬁc attention weights. In addition, KGs evaluated

the model performance in predicted labels and high-

lighted words while providing visual connections be-

tween labels. The word-to-word and word-to-label

level explainability exactly follows human cognition

and learning patterns. As a result, medical coders see

the knowledge graphs and are more conﬁdent in mak-

ing their choices of the billing codes from the pre-

dicted labels.

This paper makes several theoretical and empiri-

cal contributions. Talking about the theoretical and

methodological contributions ﬁrst, we have extended

the use of the H-LAN model in combination with

much deeper KGs for enhanced explainability of pre-

dicted labels. To that end, we have successfully

demonstrated a method to visualize connections on

word-to-word and word-to-code levels via a knowl-

edge graph at a scale not witnessed before in the med-

ical domain. On the practical side, this is the ﬁrst

paper that has trained and predicted ICD-10 medi-

cal billing codes annotated using pre-trained Clinical

BERT (Surolia, 2022).

To summarise the contributions described above:

1. We ﬁne-tuned a deep learning model from ICD-9

to ICD-10, with an enhanced problem domain.

2. An approach called ”Explainable Knowledge

Graph Creation” is proposed to make explain-

able systems more understandable and get over

the drawbacks of the attention mechanism.

3. A visualization application was made to display

CCH 2023 - Special Session on Machine Learning and Deep Learning for Preventive Healthcare and Clinical Decision Support

628

Table 1: Current Neuro-Symbolic research with the level of explainability and underlying approaches.

Related Work Neuro-Symbolic Approach Graph

Type

Deep

Learning

Model

Explainability-

Level

(Chai, 2020) KG embedding as training data KG LSTM None

(Gaur et al., 2022) Shallow Infusion KG Neural

Network

None

(Malik et al., 2020) Added Ensemble learning predic-

tions as graph nodes

KG Ensemble

Learning

Low

(Dranc

e, 2022) KG Embedding KG GNN Low

(Sheth et al., 2022) Knowledge Infusion KG Neural

Network

Low

(Lu et al., 2022) Graphs as input of neural network Bipartite GNN Low

(Gaur et al., 2021) Shallow and deep infusion of KG

with deep learning

KG BERT Moderate

(Wang et al., 2019) KG embedding with bidirectional

LSTM

KG LSTM High

(Teng et al., 2020) KG + Data infused to Model KG Multi-

Layer

CNN

Attention Mech-

anism

(Ahmed et al., 2022) KG embedding with Bidirectional

LSTM

HyperGraph LSTM Attention Mech-

anism

word-to-word and word-to-code level links for a

reliable and trustable medical coding application.

To the best of our knowledge, no one has consoli-

dated symbolic AI with an ‘attention mechanism’ for

explainable medical code predictions. The rest of the

paper is organized as follows. We reviewed Neuro-

Symbolic approaches with their explainability level in

section 2. Section 3 describes the materials and pro-

posed methodology. Section 4 contains the results and

analysis. Section 5 concludes the paper along with

some limitations and pointers for future research.

2 BACKGROUND

Explainability is not just a desired characteristic, it

is also a current necessity in ﬁelds where human lives

are involved e.g healthcare, ﬁnance, law, etc. Incorpo-

rating transparency in deep learning requires the ma-

nipulation of mathematical models by experts (Futia

and Vetr

o, 2020). An expanding ﬁeld of study that has

attracted a lot of attention recently is neuro-Symbolic

AI. To accomplish both accuracy and explainability,

it integrates symbolic AI and deep learning (Sarker

et al., 2021). Interpretability and explainability are

sometimes used interchangeably, but they are funda-

mentally different concepts. Explainability is the abil-

ity of an AI model to defend its predictions, whereas

interpretability is primarily the AI model’s ability to

be transparent about its internal workings (Gaur et al.,

2021). In this section, we’ll go over the research on

the subject of neuro-symbolic AI, the methods for in-

tegrating KG with deep learning models, the level of

explainability they offer, and the methods for KG pro-

duction, their types, strengths, shortcomings, and lim-

itations.

Knowledge Graphs because of their nature are

considered a clean data source. The subject, object,

and predicate are all present in the triplets’ hub. If

these graphs are used in conjunction with deep learn-

ing models, results can be predicted more accurately.

To diagnose thyroid disease, (Chai, 2020) combined

KGs with a long-short-term memory (LSTM) model

(Hochreiter and Schmidhuber, 1997). (Gaur et al.,

2022) did a shallow infusion of knowledge graphs

with neural networks on mental health data. A

graph based on the use case of subarachnoid haem-

orrhage was constructed using an automated knowl-

edge graph-generating method (Malik et al., 2020).

A dataset of 1000 summaries was procured for graph

generation along with ensemble learning to add rup-

ture probability as nodes.

An emerging concept called ”graph embedding”

produces the vector representation of graph facts in

a manner similar to word embedding. These embed-

dings may help ensure accurate model outputs. (Lu

et al., 2022) Combined graphs and neural networks

(GNN) for predicting the risk of mental illness. For

the beneﬁt of computer specialists, node embedding

was utilised, along with visualisation, to view the

model in action prior to applying it to the prediction

Knowledge Graph Based Trustworthy Medical Code Recommendations

629

layer. Similar to this, link prediction algorithms were

used to design a medicine repurposing strategy. The

study established the connection(s) between a chemi-

cal and a certain target while maintaining their trans-

parency and interpretability (Dranc

e, 2022).

The training data is what deep learning models

rely on the most. Lack of domain-speciﬁc informa-

tion results in either no forecasts or false positive re-

sults. Knowledge graphs are enriched data sources

that can assist in ﬁnding a solution to this problem.

In numerous use cases, knowledge was infused both

superﬁcially and deeply (Gaur et al., 2021). On the

one hand, a self-supervised BERT model was given

a shallow infusion of domain knowledge on a drug-

abuse ontology (Devlin et al., 2018), mostly to help

the model comprehend the context. On the other

hand, shallow and deep knowledge graph infusions

were carried out in educational settings in an effort to

comprehend a student’s performance and identify his

or her poor domain knowledge regions. For clarity

in this situation, certain low-level visuals were cre-

ated. (Khalid et al., 2022) enriched medical sum-

maries with knowledge graphs to improve model pre-

dictions. Another comparable method was devised to

improve the accuracy of the classiﬁers for which they

used process-knowledge infusion. It utilised psycho-

metric questionnaires (PHQ-9) and process knowl-

edge (Sheth et al., 2022).

Some recent research has been done on knowl-

edge consolidation with inputs. The objective is to

show enriched knowledge via attention mechanisms

to enhance the level of explainability. (Teng et al.,

2020) proposed an approach named “G-coder” in

which a multi-layer CNN was employed with an at-

tention mechanism. The results included a knowl-

edge graph mapping the ICD-9 description with Free-

base ontology data that had 1560 nodes and more than

20,000 relations. The enriched knowledge graph was

combined with the attention mechanism to make the

terminologies and coding results interpretable. The

model performed well for the prediction of the top

50 codes, but the explainability remained restricted

to the attention mechanism. A graph attention em-

bedding method was employed in research on the

identiﬁcation of depression symptoms. A hypergraph

was made using a psychometric questionnaire (PHQ-

9) and patient-written text, which then allowed em-

beddings to be created. An Internet-Psychological

Treatment (IDPT) was developed employing a bidi-

rectional LSTM (Graves et al., 2005) with an atten-

tion mechanism to help people deal with depression

while using fewer resources (Ahmed et al., 2022). At-

tention was applied both on the node level and on the

edge level.

Knowledge graphs were only occasionally used

in techniques to increase explainability. (Wang

et al., 2019) created a ”Knowledge-aware path recur-

rent network (KPRN)” that made suggestions using

knowledge graphs. The networks and connections

that exist between various things in the graphs can be

used to comprehend not only user preferences but also

the semantics of entities and relationships. Addition-

ally, it provided explainable predictions.

In conclusion, the publications examined in this

section combined knowledge graphs with either ma-

chine learning models or deep learning techniques.

They share a few characteristics. Prior to model train-

ing, KGs were used in the majority of these stud-

ies with the aim of either incorporating domain in-

formation to enhance model performance or display-

ing the enriched knowledge as an output in the atten-

tion layer. The limited explainability is not the true

essence or intention of knowledge graphs.

Some researchers (Wang et al., 2019; Xian et al.,

2020; Spillo et al., 2022) used graphs to explain the

outputs but they were not in the medical ﬁelds, e.g

movies and music, where ‘name entity recognition’

(NER), datasets and trained corpus are present. None

of the aforementioned studies attempted to offer ex-

plainability at the level of prediction. Neuro-symbolic

AI was not used to explain either the suggestions or

the attention mechanisms that were predicted by the

model. Table 1 summarizes the ﬁndings from our re-

view.

3 DATA AND METHODS

3.1 Data Preparation and Acquisition

‘Medical Information Mart for Intensive Care’

(MIMIC-III) is a large and freely accessible database

(Johnson et al., 2016). It contains more than 40k pa-

tient records admitted to emergency rooms (ER) units

of ‘Beth Israel Deaconess Medical Center’ between

2001 and 2012. Each admission record contains

data on demographics, diagnosis, vitals, lab measure-

ments, and survival along with discharge summaries.

The summaries do contain history, primary diagnosis,

and much more.

MIMIC-III data is annotated with ICD-9 codes.

We employed the Hugging Face model and annotated

nearly 5k discharge summaries with ICD-10 codes

since the latter is more prevalent now. Hugging Face

is an AI community containing thousands of freely

accessible datasets and models (Delangue, 2016). We

acquired a pre-trained Clinical BERT (Surolia, 2022)

for annotation. The output model was tested and ver-

CCH 2023 - Special Session on Machine Learning and Deep Learning for Preventive Healthcare and Clinical Decision Support

630

Table 2: Comparison between pretrained and ﬁne-tuned HLAN model.

Dataset Train Validation Test Label

Count

Score cal-

culation

F-1 Score

MIMIC-III (pretrained) 8066 1573 1729 50 Top-50 64.1%

MIMIC-III (pretrained) 4574 153 322 20 Top-20 74.6%

MIMIC-III (ﬁne-tuned) 3266 800 800 550 Top-5 67.2%

iﬁed by the medical coder itself.

Knowledge Graphs are attractive to computer sci-

entists in all domains, especially to researchers in the

medical domain to address the need for explainability.

Ontologies are expert-created rules that have abun-

dant information which can be leveraged to create

domain-speciﬁc knowledge graphs. Freebase, Bio-

Portal, UMLS, and many other open-source ontolo-

gies are used for graph creation. We employed Bio-

Portal ontologies as a rich data source (Noy et al.,

2009) They are the world’s most comprehensible

repository of biomedical ontologies.

The BioPortal ontologies support recommenda-

tions based on dataset description, term search across

multiple ontologies, annotation of medical concepts

with ontology terms, and much more. BioPortal

is a hub of 1000+ ontologies, having 14,427,459

classes and 79,636,946 mappings. The ontologies

are present in different formats such as Resource

Description Framework (RDF), Web Ontology Lan-

guage (OWL), Extensible Markup Language (XML),

and Comma Separated Files (CSV). We mapped the

model-predicted weighted medical concepts and the

description of medical codes with the ontologies to

build a connected ﬁnal medical knowledge graph.

The medical terms are matched with the ontology

classes, their deﬁnitions, synonyms, and hierarchy till

level 5. Neo4j, a graph database management system,

was used for the construction of Knowledge graphs.

BioPortal REST API was employed for information

retrieval.

3.2 Explainable Knowledge Graph for

Medical Coding

3.2.1 Problem Formulation

Medical coding is a multi-label text classiﬁcation

problem in which text information is translated into

medical codes, an extremely laborious task. Accord-

ing to one estimate, four out of ﬁve generated medical

billing codes are erroneous (Tate, 2017), which has

revenue implications for both payers and providers.

AI-enabled CAC models predict medical codes with

higher precision. However, they are confronted with

acceptance challenges due to the lack of transparency

and explainability.

We propose a novel approach for an explainable

knowledge graph. Figure 1 depicts the workﬂow of

the proposed approach which consists of four mod-

ules; ICD-10 prediction model, semantic enrichment,

semantic knowledge consolidation, and explainable

knowledge graph creation. The last module is further

divided into two parts i.e. word-to-word and word-to-

code level connections. The following sub-sections

elaborate on these modules at some length.

3.2.2 ICD-10 Code Prediction

After testing and reproducing results from multiple

CAC models (Mullenbach et al., 2018; Desai, 2020;

Biswas et al., 2021), we selected a baseline model

titled “Hierarchical Label-Wise Attention Network”

(Dong et al., 2021) for ICD-10 billing codes predic-

tions. The architecture of HLAN and its use of the

attention mechanism for explainable code prediction

led to its selection. The model was trained for multi-

label classiﬁcation on ICD-9 (9th version) on top-50

ICD-9 and top-20 ICD-10 codes. Apart from the pre-

diction of medical codes, an attention mechanism was

also applied at both word and sentence levels. We

mainly reconﬁgured the model for ICD-10 predic-

tions from the original ICD-9 predictions. We used

11x more labels for predictions mainly to enhance the

problem domain. We also annotated the MIMIC-III

dataset with ICD-10 codes by employing the Hug-

ging Face model, later tested and veriﬁed by an ex-

perienced professional medical coder. The reason for

not using the pre-trained model for our ‘Explainable

Knowledge Graph Creation Approach’ is a lack of at-

tention mechanism and the issue of scalability.

Around 4.8k summaries were annotated with

nearly 22k ICD-10 labels where the unique label

count was 550. An average annotation count per sum-

mary was 4 codes. We also developed the hierarchi-

cal structure for even better results. Only diagnos-

tic codes were procured for model training. We di-

vided the dataset for training, testing and validation.

3266 summaries were used for training, 800 for test-

ing, and 800 for validation. The model trained for

550 labels was somewhat less accurate but that was

expected due to an increase in the label count. Table

2 shows the comparison between the trained HLAN

model and ﬁne-tuned HLAN model with respect to

Knowledge Graph Based Trustworthy Medical Code Recommendations

631

Figure 1: Explainable Knowledge Graph Creation Approach.

the dataset size and F1 score.

3.2.3 Semantic Enrichment

Semantics plays an important role in understanding

the context and boundary of particular words (Abbas

et al., 2019; Abbas et al., 2021). In our case, we need

to comprehend the meaning of particular attention

words and their connections with each other and with

the predicted labels. The highlighted words from the

patient discharge summary and the description of the

medical billing codes were the inputs for semantic en-

richment. We leveraged the ICD10-cm library to ex-

tract predicted code descriptions. Our pre-processing

involved stop-word removal (the, is, am, what, etc.),

stemming and lemmatization, and removal of dupli-

cate words. N-grams with n=4 were used to ﬁnd

all possible combinations of words that were mapped

onto Bioportal Ontologies, mainly 1000+ ontologies

in Bioportal REST API, to get as much information as

possible. Words were enriched with synonyms, deﬁ-

nitions, parent / hierarchy-level-1, parent / hierarchy-

level-2, till parent / hierarchy-level-5. Such rich in-

formation can provide a deeper level of explainability

for the end user.

3.2.4 Semantic Knowledge Consolidation

Different types of knowledge relevant to at-

tention weights and description words were

consolidated removing the repetitive synonyms,

deﬁnitions, and hierarchy descriptions. We

also separated the nodes by their types, for

instance, patient summary, medical code, med-

ical code desp words, model attention words,

synonyms, deﬁnitions, parent 1, parent 2, parent 3,

parent 4, and parent 5. Similarly, the relationships

and their types were also separated and duplica-

tions were removed. The types of relationships

were synonyms, deﬁnitions, highlighted words,

description words, parent level 1, parent level 2,

parent level 3, parent level 4, parent level 5, and

connected.

3.2.5 Explainable Knowledge Graph Creation

An automated approach was used to create the knowl-

edge graph which requires a Cypher query to be cre-

ated in the Neo4j platform. See (Khalid et al., 2022)

for similar earlier work. Knowledge graph generation

is a computation-heavy task with run-time creation

even more complex due to the diversity and complex-

ity of data. A speciﬁc graph is created for each sum-

mary as it is entered for ICD-10 code recommenda-

tions in the deep learning model. The relationships

between the graph nodes help us understand the con-

text and semantics in general, but the ‘connected re-

lationship’ speciﬁcally focuses on working of the at-

tention mechanism and model prediction. Attention

weights are assigned to words in the medical billing

codes prediction based on similarity. We have sim-

pliﬁed the graph visualization for the users by pro-

viding two different types of explainability, word-to-

word connections, and word-to-code connections.

3.2.6 Visualisation Application

Graphs are extremely useful to comprehend the pre-

dictions of deep-learning models. But explainability

is more complex and computationally intensive than

most tasks even for deep-learning models. The ac-

tual knowledge graphs created for patient discharge

summaries contain thousands of nodes and relation-

ships making it difﬁcult to analyze just by looking

at them. Having that much visual information be-

comes information overload for a medical coder in-

CCH 2023 - Special Session on Machine Learning and Deep Learning for Preventive Healthcare and Clinical Decision Support

632

Figure 2: Semantic Enrichment Results.

stead of supporting them in the task. We have gener-

ated Knowledge Graphs on word-to-word and word-

to-code levels taking human cognitive behaviour into

account, mainly visual cues in terms of object size,

bolded shapes, colour differences, and colour attri-

bution to events, etc. The graphs can be searched

for user-entered words or codes and they can also be

restricted/ﬁltered to visualize direct and strong con-

nections only. The actual graph was queried using

Cypher language with the help of the Trinity library.

Neo4j is an extremely good platform for visualiza-

tion but it does not offer automatic manipulation of

the size and the colour of the nodes. The simpli-

ﬁed visualization was done using the kglab library.

Path detection algorithms such as depth-ﬁrst search

and breadth-ﬁrst search algorithms in Neo4j were ap-

plied to ﬁnd the shortest path. However, these algo-

rithms performed poorly due to the diversity of cre-

ated graphs, so we used simple Cypher functions to

get a path between chosen words or labels.

4 RESULTS AND ANALYSIS

We achieved a 67.2% F1 score on the top 5 codes in

model training (800 summaries for testing and 800

for validation). Nearly 100 discharge summaries were

passed to the model and the predicted labels and high-

lighted words were procured. Each summary consists

of an average of 25 medical concepts comprising at-

tention words and in-code descriptions. The model

predicted an average of three medical codes (ICD-

10) for each summary. The ‘Explainable Knowledge

Graph Creation Approach’ was applied to the model

results.

The semantic enrichment module extracted the

biomedical concepts and their synonyms, deﬁnitions,

and hierarchies at level 5 from BioPortal ontolo-

gies. The semantic information was then consoli-

dated. We obtained 736 synonyms per summary on

average. There were 159 deﬁnitions, 562 nodes for

Hierachy 1, 473 for Hierachy 2, 379 for Hierachy 3,

327 for Hierachy 4, and 284 for Hierachy 5, as shown

in ﬁgure 2.

The explainable knowledge graph created for each

summary was a collection of 2900 nodes and 3340

relationships on average. A graph of this size can

be effectively used for reasoning and explainabil-

ity. To that end, the model’s performance was eval-

uated based on strength of connections. The pro-

posed semantic enrichment (synonyms, deﬁnitions,

hierarchy 1 to hierarchy 5) process plays a vital role

in identifying word-to-words and word-to-code level

connections or relations. They were made either

through string matching or based on the semantic rel-

evance of the medical concepts.

During the processing of 100 summaries, the

model highlights the words that contributed to the pre-

diction of ICD-10 codes. After scrutinizing these re-

sults for word-to-word level connections, we found

an average number of 176 connections based on syn-

onyms and 75 connections owing to the deﬁnition per

summary, as shown in Figure 3. Similarly, we ana-

lyzed that hierarchical levels of semantic information

also have a crucial role in word-to-word level connec-

tions. Hierarchy 1 produces an average of 39 connec-

tions or relations which is much more than other hi-

erarchy 2 to hierarchy 5. The reason for no or fewer

connections on some hierarchy levels is due to the na-

ture of BioPortal Ontologies, which is not critical at

the word-to-word level.

Knowledge Graph Based Trustworthy Medical Code Recommendations

633

Figure 3: Word-to-Word and Word-to-Code Level Connections.

While analyzing the word-to-code level relations,

we found an average of 52 connections based on

synonyms and 27 connections based on deﬁnitions.

Moreover, we found hierarchy levels 1 and 2 add

more to the word-to-code level relations in compar-

ison to hierarchy levels 3, 4, and 5, as shown in Fig-

ure 3. Consequently, the connections are relatively

fewer when compared to the word-to-word level but

expected, due to the fact the average amount of words

per discharge summary is 21 but only 4 for code de-

scriptions. The hierarchy 5 showed zero connections

for both levels but they are included to check the dif-

ferences and reduction in the number of connections.

Subsequently, we evaluated the performance of

the deep learning model for its predictions and at-

tention mechanism results. The relationship be-

tween labels-with-words and words-with-words was

assessed based on its strength. The relationships be-

ing weak or strong were marked by experts. More

than 30 connections were taken as strong and fewer

than 10 as weak connections.

The procured 100 summaries contain nearly 2500

medical concepts. The manual assessment of them

is not possible in terms of strong and weak connec-

tions. We used a small section of randomly chosen

40 medical codes and nearly 100 words to retrieve re-

sults (Nearly 10 summaries). Out of the 100 words,

64 words had strong, weak, or no connection with re-

spective words as marked by the experts. 21 out of the

40 labels contained were also correct. Eq.1 has been

used to measure the connection accuracy at different

levels, where accuracy is equal to the correct relations

instances divided by total relationships. Hence, we

measure accuracy in terms of strong and weak con-

nections. The model achieved 64% of accuracy for

word-to-word level connections and 53% accuracy for

word-to-code level connections.

Accuracy = Correct Relations/Total Relations

(1)

It has been found that there could be direct or in-

direct relationships (or connections) between words.

For instance, the biomedical concept ‘ﬂap’ has a di-

rect association with the word ‘graft’ but carries an in-

direct relationship with the biomedical concept ‘anas-

tomosis’. We assigned half count to the indirect re-

lationships keeping it full for the direct relationship.

For instance, if ‘ﬂap’ and ‘anastomosis’ are connected

with 4 nodes in between, we would count it as 2 (in

terms of equivalence to direct relationships). The re-

sults of our approach were totally dependent on the

model output. As we trained our model on generic

ICD-10 codes, and not speciﬁc top-50 labels, some-

what low accuracy was expected. The average words

in the accuracy calculation were around 200.

CCH 2023 - Special Session on Machine Learning and Deep Learning for Preventive Healthcare and Clinical Decision Support

634

5 CONCLUSIONS, DISCUSSION,

AND FUTURE WORK

The black-box nature of deep learning models hin-

ders the end-users from trusting predictions and deci-

sion support offered by AI systems. It is especially

true for medical and other critical ﬁelds. The lack of

trust makes sense in the wake of risks to human lives,

health and costs. Recent advancements in ’attention

mechanisms’ based explainability are helping allevi-

ate such trust issues by elucidating the inner workings

of the black-box deep learning AI models. However,

the user still is not made aware of how the highlighted

term is related to other terms and predicted labels.

This paper has demonstrated a novel deep learn-

ing approach titled ‘Explainable Knowledge Graph

Creation’ to introduce explainability in computer-

assisted medical coding (CAC). It has not only suc-

cessfully predicted applicable medical codes in in-

patient discharge summaries, but it has also gener-

ated corresponding knowledge graphs (KG) that help

users review the basis of the predictions. The gener-

ated KGs are very broad and deep, yet they are con-

ﬁgurable in ways where the users can view the re-

lationships between different concepts found in pa-

tient summaries based on their strengths. Strong and

weak word-to-word and word-to-code level connec-

tions make it very valuable for the users in under-

standing and verifying the predictions. Visualization

brings it closer to the process of knowledge creation

and understanding. The proposed approach refers

to reliable medical ontologies and medical coding

databases. While AI-based automation ﬁnds the most

applicable medical codes, the attention mechanisms

and knowledge graphs build user trust in automati-

cally predicted codes. Finally, the proposed system

learns from previous predictions, gradually improv-

ing its performance. To the best of our knowledge, our

approach is among the earliest ones on using knowl-

edge graphs for explainability in medical coding. It

also goes the deepest so far in incorporating explain-

ability in any medical domain.

Multiple practical use cases exist for this approach

mainly in professional services using unstructured

knowledge bases and ontologies, such as medical cod-

ing, accounting, auditing and legal services. For ex-

ample, as shown in this paper, medical coders and

medical claims auditors can be provided automati-

cally predicted codes which they can accept or reject

with higher conﬁdence due to the incorporation of ex-

plainability. Similarly, accounting, tax and legal pro-

fessions rely on extensive textual knowledge bases as

well as text documents from the client side that should

be coded with relevance to speciﬁc clauses in the ac-

counting manuals, tax codes or legal clauses. This

approach can be helpful in building trustworthy rec-

ommenders in these areas.

Despite its precise results, there are certain limi-

tations in our research. At ﬁrst, an accuracy of 64%

and 53% was achieved on word-to-word and word-to-

code levels respectively. These are low but they will

be considered very good considering this is an early

paper in this direction. Even though the model will

learn and improve the accuracy of code predictions

and identifying relationships over time, an important

point is the performance of the Explainable Knowl-

edge Graph Creation approach depends on the accu-

racy of the outputs of the deep learning model. If the

model’s performance is poor, it will directly impact

the accuracy of our novel approach. Secondly, the

enriched knowledge graph is limited to certain nodes

and relationships excluding some entities and detailed

domain knowledge which could have led to even bet-

ter results and explainability. A third limitation comes

from the testing and training datasets containing pa-

tient summaries. MIMIC-III is limited to emergency

room clinical notes and patient discharge summaries

where the focus is on stabilizing the patient rather

than long-term prognosis, so comorbidities and other

issues might not be deeply focused on or addressed

by the ER physicians.

We plan to address the aforementioned limitations

in our future research. To improve model perfor-

mance, a deep infusion of knowledge graphs with

deep learning could increase the overall accuracy of

the Multi-Label Classiﬁcation problem. In order to

enhance the reliability and accuracy of predictions,

future research should employ other medical ontolo-

gies with a deeper knowledge of the domain, further

improving the understanding and visualization of ex-

plainability. Using broader all-cause hospital admis-

sions datasets is also recommended. All the above

steps would go a long way in opening the black box

of deep learning CAC models.

REFERENCES

Aalseth, P. (2014). Medical Coding: What it is and how it

Works. Jones & Bartlett Publishers.

Abbas, A., Afzal, M., Hussain, J., Ali, T., Bilal, H. S. M.,

Lee, S., and Jeon, S. (2021). Clinical concept extrac-

tion with lexical semantics to support automatic an-

notation. International Journal of Environmental Re-

search and Public Health, 18(20):10564.

Abbas, A., Afzal, M., Hussain, J., and Lee, S. (2019).

Meaningful information extraction from unstructured

clinical documents. Proc. Asia Pac. Adv. Netw, 48:42–

47.

Knowledge Graph Based Trustworthy Medical Code Recommendations

635

Ahmed, U., Lin, J. C.-W., and Srivastava, G. (2022). Hyper-

graph-based attention curriculum learning using a lex-

ical algorithm for mental health. Pattern Recognition

Letters, 157:135–143.

Alonso, V., Santos, J. V., Pinto, M., Ferreira, J., Lema, I.,

Lopes, F., and Freitas, A. (2020). Problems and bar-

riers during the process of clinical coding: a focus

group study of coders’ perceptions. Journal of medi-

cal systems, 44(3):1–8.

Biswas, B., Pham, T.-H., and Zhang, P. (2021). Transicd:

Transformer based code-wise attention model for ex-

plainable icd coding. In International Conference

on Artiﬁcial Intelligence in Medicine, pages 469–478.

Springer.

Campbell, S. and Giadresco, K. (2020). Computer-assisted

clinical coding: A narrative review of the literature

on its beneﬁts, limitations, implementation and impact

on clinical coding professionals. Health Information

Management Journal, 49(1):5–18.

Catling, F., Spithourakis, G. P., and Riedel, S. (2018). To-

wards automated clinical coding. International jour-

nal of medical informatics, 120:50–61.

Chai, X. (2020). Diagnosis method of thyroid disease com-

bining knowledge graph and deep learning. IEEE Ac-

cess, 8:149787–149795.

Confalonieri, R., Coba, L., Wagner, B., and Besold, T. R.

(2021). A historical perspective of explainable arti-

ﬁcial intelligence. Wiley Interdisciplinary Reviews:

Data Mining and Knowledge Discovery, 11(1):e1391.

Delangue, C. (2016). Hugging face – the ai community

building the future.

Desai, G. (2020). gauravkdesai/mids-w210-

medical

insurance payment assistant.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2018). Bert: Pre-training of deep bidirectional trans-

formers for language understanding. arXiv preprint

arXiv:1810.04805.

Dong, H., Su

arez-Paniagua, V., Whiteley, W., and Wu,

H. (2021). Explainable automated coding of clinical

notes using hierarchical label-wise attention networks

and label embedding initialisation. Journal of biomed-

ical informatics, 116:103728.

Dranc

e, M. (2022). Neuro-symbolic xai: Application to

drug repurposing for rare diseases. In International

Conference on Database Systems for Advanced Ap-

plications, pages 539–543. Springer.

Futia, G. and Vetr

o, A. (2020). On the integration of

knowledge graphs into deep learning models for a

more comprehensible ai—three challenges for future

research. Information, 11(2):122.

Garcez, A. d. and Lamb, L. C. (2020). Neurosymbolic ai:

the 3rd wave. arXiv preprint arXiv:2012.05876.

Gaur, M., Faldu, K., and Sheth, A. (2021). Semantics of

the black-box: Can knowledge graphs help make deep

learning systems more interpretable and explainable?

IEEE Internet Computing, 25(1):51–59.

Gaur, M., Gunaratna, K., Bhatt, S., and Sheth, A. (2022).

Knowledge-infused learning: A sweet spot in neuro-

symbolic ai. IEEE Internet Computing, 26(4):5–11.

Graves, A., Fern

andez, S., and Schmidhuber, J. (2005).

Bidirectional lstm networks for improved phoneme

classiﬁcation and recognition. In International con-

ference on artiﬁcial neural networks, pages 799–804.

Springer.

Gu, P., Yang, S., Li, Q., and Wang, J. (2021). Disease cor-

relation enhanced attention network for icd coding.

In 2021 IEEE International Conference on Bioinfor-

matics and Biomedicine (BIBM), pages 1325–1330.

IEEE.

Hitzler, P., Bianchi, F., Ebrahimi, M., and Sarker, M. K.

(2020). Neural-symbolic integration and the semantic

web. Semantic Web, 11(1):3–11.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term

memory. Neural computation, 9(8):1735–1780.

Hogan, A., Blomqvist, E., Cochez, M., d’Amato, C., Melo,

G. D., Gutierrez, C., Kirrane, S., Gayo, J. E. L.,

Navigli, R., Neumaier, S., et al. (2021). Knowledge

graphs. ACM Computing Surveys (CSUR), 54(4):1–

37.

Johnson, A. E., Pollard, T. J., Shen, L., Lehman, L.-w. H.,

Feng, M., Ghassemi, M., Moody, B., Szolovits, P.,

Anthony Celi, L., and Mark, R. G. (2016). Mimic-

iii, a freely accessible critical care database. Scientiﬁc

data, 3(1):1–9.

Johnson, S. L. and Linker, R. (2015). Understanding medi-

cal coding: A comprehensive guide. Cengage Learn-

ing.

Khalid, M., Khattak, H. A., Ahmad, A., and Bukhari, S.

A. C. (2022). Explainable prediction of medical codes

through automated knowledge graph curation frame-

work. In Proceedings of 2022 19th International

Bhurban Conference on Applied Sciences & Technol-

ogy (IBCAST) Islamabad, Pakistan, 16th-20th August,

2022, pages 1–6. IEEE.

Lu, H., Uddin, S., Hajati, F., Khushi, M., and Moni, M. A.

(2022). Predictive risk modelling in mental health

issues using machine learning on graphs. In Aus-

tralasian Computer Science Week 2022, pages 168–

175.

Malik, K. M., Krishnamurthy, M., Alobaidi, M., Hussain,

M., Alam, F., and Malik, G. (2020). Automated

domain-speciﬁc healthcare knowledge graph curation

framework: Subarachnoid hemorrhage as phenotype.

Expert Systems with Applications, 145:113120.

Moons, E., Khanna, A., Akkasi, A., and Moens, M.-F.

(2020). A comparison of deep learning methods for

icd coding of clinical records. Applied Sciences,

10(15):5262.

Mullenbach, J., Wiegreffe, S., Duke, J., Sun, J., and

Eisenstein, J. (2018). Explainable prediction of

medical codes from clinical text. arXiv preprint

arXiv:1802.05695.

Niu, Z., Zhong, G., and Yu, H. (2021). A review on the

attention mechanism of deep learning. Neurocomput-

ing, 452:48–62.

Noy, N. F., Shah, N. H., Whetzel, P. L., Dai, B., Dorf, M.,

Grifﬁth, N., Jonquet, C., Rubin, D. L., Storey, M.-A.,

Chute, C. G., et al. (2009). Bioportal: ontologies and

CCH 2023 - Special Session on Machine Learning and Deep Learning for Preventive Healthcare and Clinical Decision Support

636

integrated data resources at the click of a mouse. Nu-

cleic acids research, 37(suppl 2):W170–W173.

Sarker, M. K., Zhou, L., Eberhart, A., and Hitzler, P. (2021).

Neuro-symbolic artiﬁcial intelligence: Current trends.

arXiv preprint arXiv:2105.05330.

Sheth, A., Gaur, M., Roy, K., Venkataraman, R., and Khan-

delwal, V. (2022). Process knowledge-infused ai:

Toward user-level explainability, interpretability, and

safety. IEEE Internet Computing, 26(5):76–84.

Spillo, G., Musto, C., De Gemmis, M., Lops, P., and Se-

meraro, G. (2022). Knowledge-aware recommenda-

tions based on neuro-symbolic graph embeddings and

ﬁrst-order logical rules. In Proceedings of the 16th

ACM Conference on Recommender Systems, pages

616–621.

Surolia, A. (2022). Akshatsurolia/icd-10-code-prediction ·

hugging face.

Tate, N. (2017). 4 in 5 medical bills contain errors: Here’s

what you can do.

Teng, F., Yang, W., Chen, L., Huang, L., and Xu, Q. (2020).

Explainable prediction of medical codes with knowl-

edge graphs. Frontiers in Bioengineering and Biotech-

nology, 8:867.

von Eschenbach, W. J. (2021). Transparency and the black

box problem: Why we do not trust ai. Philosophy &

Technology, 34(4):1607–1622.

Wang, X., Wang, D., Xu, C., He, X., Cao, Y., and Chua,

T.-S. (2019). Explainable reasoning over knowledge

graphs for recommendation. In Proceedings of the

AAAI conference on artiﬁcial intelligence, volume 33,

pages 5329–5336.

Xian, Y., Fu, Z., Zhao, H., Ge, Y., Chen, X., Huang,

Q., Geng, S., Qin, Z., De Melo, G., Muthukrish-

nan, S., et al. (2020). Cafe: Coarse-to-ﬁne neural

symbolic reasoning for explainable recommendation.

In Proceedings of the 29th ACM International Con-

ference on Information & Knowledge Management,

pages 1645–1654.

Xu, F., Uszkoreit, H., Du, Y., Fan, W., Zhao, D., and Zhu,

J. (2019). Explainable ai: A brief survey on history,

research areas, approaches and challenges. In CCF in-

ternational conference on natural language process-

ing and Chinese computing, pages 563–574. Springer.

Knowledge Graph Based Trustworthy Medical Code Recommendations

637