Enhancing Data Quality and Semantic Annotation by Combining

Medical Ontology and Machine Learning Techniques

Zina Nakhla

1,*

and Manel Sliti

2,†

Université de Tunis, Institut Supérieur de Gestion, BESTMOD Laboratory, Tunis, Tunisia

Université de Manouba, Institut Supérieur des Arts et Multimedia de Manouba, Manouba, Tunisia

Keywords: Interoperability, Ehr, Ontology, Machine Learning, NLP.

Abstract: Effective management of electronic health records (EHR) is a major challenge in the modern healthcare

sector. Despite technological advances, the interoperability of medical data remains a crucial challenge. This

complex problem is manifested by the diversity of data formats, the presence of multiple standards and the

heterogeneity of Information Technology (IT) systems used in health- care establishments. However, the

diversity of IT systems and the complexity of medical terminologies often make data interoperability and

semantic annotation in the healthcare domain difficult. To address this challenge, our study proposes an

innovative approach to standardize the representation of medical concepts, to automate the detection of

medical abbreviations and to improve the contextual understanding of medical terms. We developed an

ontological model to harmonize the representation of medical data, thus facilitating their exchange and

integration between different health systems. In parallel, we used advanced machine learning techniques for

automatic detection of medical abbreviations in medical texts, and applied Natural Language Processing to

improve contextual understanding of medical terms. The results of our study demonstrate the effectiveness of

our approach in solving challenges related to medical data management. By combining different advanced

techniques, our approach helps overcome barriers to medical data interoperability and paves the way for better

healthcare system integration and improved patient care.

1 INTRODUCTION

An Electronic Health Record (EHR) is a digital

version of a patient’s paper chart (Sachdeva and

Bhalla, 2022). EHR systems facilitate the collection,

storage, and sharing of patient information in a

structured manner to enhance healthcare delivery and

clinical decision-making. They are designed to

provide real-time, centralized, and secure medical

information accessible to authorized users. While an

EHR contains a patient’s medical history, diagnoses,

prescriptions, and other clinical data, its role goes

beyond mere data archiving. It allows access to

evidence-based tools that help healthcare

professionals optimize their medical decisions

(Fennelly and Moroney, 2024).

One of the key features of an EHR is that health

information can be created and managed by

authorized providers in a digital format capable of

Corresponding author

†

Contributing author

being shared with other providers across more than

one health care organization. EHRs are built to share

information with other health care providers and

organizations such as laboratories, specialists,

medical imaging facilities, pharmacies, emergency

facilities, and workplace clinics so they contain

information from all clinicians involved in a patient’s

care. However, sharing the information needed is a

very complicated problem.

Doctors and specialists are suffering from

collecting distributed data spread over different

locations, and lack of the interoperability among all

the healthcare information systems. Also, the domain

of healthcare produces a huge quantity of data from

various disparate sources. The single patient’s data

can be dispersed over diverse EHRs with various

representation ways (Begoyan, 2007).

Data in EHRs could be presented in many

different types of formats: structured data as

140

Nakhla, Z. and Sliti, M.

Enhancing Data Quality and Semantic Annotation by Combining Medical Ontology and Machine Learning Techniques.

DOI: 10.5220/0013750200004000

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 17th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2025) - Volume 2: KEOD and KMIS, pages

140-150

ISBN: 978-989-758-769-6; ISSN: 2184-3228

(database) (Schloeffel et al, 2006) unstructured data

such as (documents, images,...) (Kiourtis et al, 2017),

and semi-structured data (XML files) (Mylka et al,

2012).

The EHR interoperability problem refers to the

ability of different systems to seamlessly exchange,

interpret, and utilize patient data across various

healthcare providers and settings. This challenge

arises from differences in data formats, adopted

standards, and proprietary systems, leading to

inefficiencies, medical errors, and fragmented care.

Interoperability can be categorized into three levels,

firstly Technical interoperability, which ensures the

physical connection between systems and data

transfer. Secondly Syntactic interoperability, which

enables data exchange through standardized formats

(HL7, XML), (Sartipi and Dehmoobad ,2008).

Thirdly Semantic interoperability, which ensures a

uniform understanding of exchanged data by using

standardized medical terminologies. Semantic

interoperability is essential for improving clinical

decision-making, enhancing care coordination, and

ensuring that all healthcare professionals have access

to consistent and reliable medical information. To

address these challenges, several standards and

terminologies have been developed. Among the

standards enabling structured clinical content

exchange are Health Level Seven (HL7) Digital

Imaging and Communications in Medicine

Structured Reporting (DICOM SR), (Begoyan, 2007)

, ISO EN 13606 (Costa et al, 2011) , openEHR

[(Kalra, 2006), (Schloeffel et al, 2006),( Da Costa,

2019),( Roehrs et al, 2018), (Begoyan, 2007)], GEHR

(Celesti et al, 2016). Nonetheless,semantic

interoperability cannot be achieved without the

adoption of standardized medical terminologies, such

as the Systematized Nomenclature of Medicine

Terms (SNOMED CT), which is the most

comprehensive medical terminology system used

worldwide. SNOMED CT enables precise encoding

of clinical information, facilitating medical document

annotation, clinical decision support, and EHR

interoperability.

Despite continuous efforts, limitations persist,

particularly concerning the variety of medical data

formats, variability of collection protocols,

confidentiality concerns, and the absence of uniform

standards for semantic tagging. These challenges

hinder the automatic understanding of medical data,

slowing down interoperability advancements. To

overcome these limitations, this study proposes an

innovative approach that integrates ontologies,

machine learning, and Natural Language Processing

(NLP). This combination allows for standardizing

medical concept representation, automatically

detecting medical abbreviations, and improving the

contextual understanding of medical terms.

Unlike previous approaches, our method

introduces a novel integration of structured ontology-

based representations with advanced machine

learning and NLP models, enhancing data

standardization, medical entity recognition, and

abbreviation expansion. This paper details each phase

of our proposed approach, demonstrating how it helps

overcome barriers to medical data interoperability

and facilitates seamless integration within healthcare

systems, ultimately improving patient care.

This paper presents an integrated approach to

solving challenges related to medical data

interoperability by combining ontology, machine

learning and NLP. We discuss in detail the different

phases of our approach by combining different

advanced techniques, our approach helps overcome

barriers to medical data interoperability and paves the

way for better healthcare system integration and

improved patient care. The rest of this paper is

structured as follows. Section 2 presents the related

work and previous studies. Section 3 describes the

used dataset. The proposed architecture with its

internal phases are described in Section 4. Section 5

contains the results of experiments, and we discuss

our evaluation of the proposed solution. Section 6

presents a comparison study. Finally, the conclusion

and future work.

2 RELATED WORK

One of the most consistent themes across the

literature is the potential of EHR to revolutionize

multiple aspects of healthcare. Many of these papers

describe how an EHR system can improve how

patients are diagnosed, treated and improve

healthcare (Gunter et al , 2005). McClanahan

describes how quick access to patient information

through a universal EHR system can save the lives of

thousands of emergency room patients each year by

reducing medical errors (McClanahan, 2008).

Santos et al. explained the importance of EHR due

to its ability to integrate various user interfaces and

programs. While their findings are promising, they do

not fully address the significant infrastructural and

financial challenges of integrating diverse systems at

a national or international level. Many of the

proposed solutions are theoretical and lack real-world

validation, which raises questions about their

scalability and long-term feasibility (Santos et al,

2010).

Enhancing Data Quality and Semantic Annotation by Combining Medical Ontology and Machine Learning Techniques

141

Berges et al. explore the use of ontologies in

improving interoperability in heterogeneous EHR

systems. First, they used a reasonable ontology that

EHR-related concepts focus on meanings and

perspectives. Second, they tracked modules that

enable the acquisition of rich ontological descriptions

of EHR data guided by constraint models of medical

knowledge frameworks. Third, it considers necessary

mapping maxims between the concepts that mappings

update. While their model demonstrates the potential

for more structured and accurate data representation,

it overlooks the practical challenges of implementing

such models across diverse healthcare environments.

The author’s approach to aligning heterogeneous

EHR descriptions is promising but requires further

exploration into how these mappings can be

dynamically updated in real-time across various

clinical settings (Berges et al, 2011).

Ferrer et al. represented and integrated patient

data from many different heterogeneous data sources

and encouraged the integration of patient data into a

rule-based Clinical Decision Support System

(CDSS). Their work represents a significant step in

bridging standards such as ISO/CEN 13606 and HL7

through OpenEHR approach. They proposed a

Personal Health Records (PHR) combining

OpenEHR and HL7 supported by a service-oriented

information exchange system (González-Ferrer et al,

2012).

Liyanage et al proposed a method for improving

semantic interoperability in healthcare by using

ontologies. They present an ontology toolkit that

facilitates the development of ontologies for chronic

disease management using heterogeneous health data.

Nevertheless this approach is useful but limited by its

focus on specific diseases and may not be easily

generalized to other domains (Liyanage et al, 2015).

Also, Kiourtis et al. proposed a generic ontology-

based semantic architecture to solve EHR

interopability problem. This architecture used an

ontology language to transform heterogeneous

medical data into a generic schema, called CHL that

can be used to represent health data in a way that is

consistent and understandable to machines. The CHL

also makes it possible to merge different medical

ontologies with similar relationships. The authors

also pointed out that this method is not lossless, as

data transformation can lead to loss of information.

This is a significant limitation, as the loss of data

could compromise patient care and treatment

decisions (Kiourtis et al, 2017).

Futhermore, Hajjamy et al. developed a

semiautomatic approach to integrate classical data

sources into an ontological database. Their method

involves transforming data sources into ontologies

using measures of syntactic, semantic, and structural

similarity, to generate a global ontology. The authors

suggested that their approach could be enhanced by

incorporating other information retrieval techniques

and big data methods to handle larger ontologies.

Although their method is promising, its reliance on

traditional data fusion methods may hinder its ability

to handle the large and complex datasets common in

modern healthcare (El Hajjamy et al,2018).

In 2019, Li Chen et al. developed an ontology to

represent knowledge and relationships in the field of

diabetes and used this ontology to build a reasoning

model for medical decision making. Their method

involved collecting data from various sources,

defining Semantic Web Rule Language (SWRL)

rules to define the reasoning rules of the ontology,

and implementing the SWRL rules in an inference

engine to create a help system for decision called

Ontology-based Medical Diagnostic and Treatment

Platform (OMDP). OMDP used OWL ontology and

SWRL rules to analyse patient’s symptoms, make

diagnosis and recommend treatment by integrating

different types of diabetes knowledge. Despite this,

the limited scope of their model focused only on

diabetes raises concerns about its broader

applicability across other medical conditions.

Furthermore, while the integration of different

knowledge sources is valuable, the lack of real-time

data integration limits its practical utility (Li Chen et

al, 2019).

In 2020, Sreenivasan and chacko. proposed an

approach to map heterogeneous EHR data to

ontologies, to create a semantically annotated

knowledge base that can be queried. The approach

involved semantically annotating the data using an

ontology, building a new of knowledge and the use of

a semantic inference module to infer knowledge from

data. The proof of concept showed that health data

mapped into a relational database using ontologies

can be used for inference and for rapid decision

making by healthcare professionals. The proof of

concept is promising but remains at a pilot stage, and

further validation is required before it can be widely

adopted in clinical practice (Sreenivasan and Chacko,

2020).

In 2022, Adel et al. proposed an ontological model

to integrate patient health data from heterogeneous

data sources at a centralized point to improve the

quality of care. The authors unified five different

healthcare data formats into an unified ontology. The

results showed that the proposed model made it

possible to integrate and collect all patient data from

heterogeneous data sources, improving the quality of

KEOD 2025 - 17th International Conference on Knowledge Engineering and Ontology Development

142

care and reducing medical errors. However, there are

limits to the model, including its specificity to a

particular area, the use of limited test data, the need

for close collaboration between health professionals

and computer scientists, and the need for more in-

depth validation on a larger population in real clinical

environments ( Adel et al, 2022).

Overall, most of the existing approaches faced

limitations, especially the variety of medical data

formats, the variability of collection protocols, the

problem of confidentiality of medical data, and the

lack of uniform standards for semantic tagging, which

have hindered the automatic understanding of data.

These challenges have hindered the implementation

of comprehensive solutions and highlighted the need

for innovative approaches to overcome these

limitations.

3 DATASET DESCRIPTION

This section details the data sources and the

construction of the patient file, outlining the process

used to gather and structure the datasets necessary for

our study. The Web offers access to an enormous

quantity of documents, in several forms (texts,

images, videos, sounds) and in different languages.

Most of these documents are freely available, easy to

access, and in electronic format. To identify relevant

resources, we used the Google search engine with

targeted queries such as “medical dataset Excel”,

“EHR data sample xls”, and “clinical records

spreadsheet». From the search results, we first

considered the most popular web pages (the top-

ranked pages in the results list). These pages were

then filtered according to qualitative criteria:

 Representativeness of the medical domain,

 Audience targeted by the site (general public or

healthcare professional),

 Author of the page (health professional or not),

 Language used (easily understandable or

specialized terminology).

Based on these criteria, we selected web pages that

hosted structured datasets and, more specifically,

Excel files. While Excel is not a standard format for

EHR representation, its tabular structure makes it an

accessible, modifiable, and easily shareable medium

for organizing medical information. These files

provided a practical way to compile representative

terms used by healthcare mediators while maintaining

data consistency.

Thus, the dataset consists of Excel files retrieved

from public health websites and open research

portals, each containing structured medical records in

English and French.

The Excel model is organized into five main classes:

1. Patient ID: an anonymized identifier assigned

to each patient.

2. Healthcare Professional: the physician, nurse,

or medical specialist associated with the

patient record.

3. Disease: the main diagnosis for the patient.

4. Medicine: the prescribed drug or treatment.

5. Symptom: the clinical signs reported by the

patient or observed by the healthcare

professional.

In addition to these core classes, the dataset also

contains supplementary attributes such as medical

exams, therapeutic procedures, and allergies, which

are linked to the five main classes. This structure

ensures both flexibility and consistency in

representing patient-related information

We started with data cleansing which is a crucial

step in the process of managing healthcare data,

including EHR. This phase aims to guarantee the

quality, consistency and reliability of medical

information. During data cleansing, several aspects

are taken into account including the detection and

removal of duplicates, errors, and missing data. In

general, data cleansing plays a major role in creating

high-quality EHRs, ensuring that medical

information is accurate, complete and consistent,

which contributes to better quality healthcare and

efficient management of health data.

4 THE PROPOSED

ARCHITECTURE

Our proposed approach combines the structuring

power of a medical ontology with the flexibility of

machine learning and NLP. The motivation for this

approach is based on identifying the greatest

challenges to medical knowledge. This information is

the cornerstone of clinical decision-making,

biomedical research, and health management.

However, medical data is often unstructured,

heterogeneous, and subject to semantic ambiguity,

creating significant obstacles for healthcare

professionals and researchers who need to extract

relevant insights. To address these challenges, our

method integrates ontology-based structuring with

advanced machine learning techniques and Natural

Language Processing (NLP) to enhance semantic

annotation and interoperability.

Enhancing Data Quality and Semantic Annotation by Combining Medical Ontology and Machine Learning Techniques

143

The architecture of this approach is shown in

Figure 1. It consists of two main layers: the

construction of local ontologies and the unification

into a global ontology.

1. In the first layer: Local Ontology Construction

 A local ontology is generated for each data

source, and then converted into an OWL

ontology representation.

 The input EHR data sources used in this

process are Excel files.

 The output of this layer is a structured local

ontology in OWL format.

2. In the second layer: Global Ontology

Construction

 The objective of this layer is to generate a

unified global ontology that consolidates

heterogeneous data sources into a single

structured representation.

 This step provides semantic alignment and

ensures consistency across different

healthcare information systems.

Figure 1: Architecture of the proposed approach.

4.1 Local Ontologies Construction

Converting an Excel file into an ontology is a

fundamental process for transforming tabular data

into a rich and interpretable semantic structure. This

process is essential when we want to create a formal

and structured representation of the concepts and

relationships present in the data. Excel files are

commonly used to store and manage data in tabular

form, making them easy to use but less suitable for in-

depth semantic analysis. In contrast, an ontology is a

formal representation of concepts and relationships

providing rich semantics for data. The conversion

process begins with defining an appropriate ontology

structure for the data domain. This involves identi-

fying the classes, properties and relationships that

characterize the data. For example, in an Excel file

containing patient information, an ontology could

define classes such as ‘‘Patient‘‘,‘‘Disease‘‘,

‘‘Treatment‘‘, etc., and specify properties. Then, each

row in the Excel file is transformed into a class

instance in the ontology, and the values of the table

cells are associated with the corresponding properties.

4.2 Global Ontologies Construction

Ontology integration is a key step for ensuring

semantic interoperability across heterogeneous

medical datasets. One of the main challenges is that

ontologies may be developed under different

conceptual frameworks, which complicates their

integration. In our approach, we focused on domain-

specific ontologies that provide complementary

coverage of structural, administrative, and semantic

aspects of medical data. We selected three ontologies

for integration: EHR EXTRACT RM.owl, which

provides a formal representation of electronic health

record structures; RIMV3OWL.owl, which models

the HL7 Reference Information Model (RIM) and its

use in clinical systems; and OGMS.owl (Ontology for

General Medical Science), which defines core

concepts of medical science such as disease,

symptom, and diagnosis. The choice of these

ontologies was motivated by their complementary

scope: EHR EXTRACT RM and RIMV3OWL

address the structural and administrative aspects of

patient records, while OGMS provides the semantic

layer for core medical concepts.

Although BFO (Basic Formal Ontology) is a well-

known top-level ontology, we did not explicitly adopt

it in this work. Integrating BFO would have required

substantial re-engineering of the selected ontologies,

which was beyond the scope of our study. Instead, we

aimed to construct a unified mid-level ontology that

bridges EHR structures and medical semantics in a

pragmatic way, directly addressing interoperability

challenges. It is also worth noting that OGMS is itself

grounded in BFO, which indirectly provides our

integration with a degree of top-level alignment.

Furthermore, during the integration process we

identified a lack of sufficient information on medical

abbreviations within the existing ontologies. To

address this gap, we introduced a dedicated Medical

Abbreviations class, in which each abbreviation

KEOD 2025 - 17th International Conference on Knowledge Engineering and Ontology Development

144

instance is systematically organized in alphabetical

order and explicitly linked to its full meaning. This

addition enhances accessibility and improves the

semantic clarity of medical records for healthcare

professionals, researchers, and practitioners.

The use of abbreviations is frequent, users of the

global ontology could thus quickly access detailed

information on medical abbreviations, thereby

strengthening the quality and accuracy of medical

research and data analysis.

4.3 Machine Learning to Ensure

Correspondence

Faced with the complexity of medical terminologies

and the diversity of standards, semantic annotation is

essential for a precise understanding of medical

information. Nevertheless, applying this annotation at

scale requires a scalable approach, which is where

machine learning comes in. By leveraging machine

learning models, our solution aims to automate the

semantic annotation process, thus allowing precise

identification of medical terms and their contexts

within the data. At the same time, machine learning

helps overcome interoperability barriers by enabling

systems to dynamically adapt to different standards

and data formats. The following steps in Figure 2

illustrate a comprehensive approach to solving the

automatic abbreviation detection problem.

Figure 2: Abbreviation detection model using ML.

4.4 Natural Language Processing to

Ensure Correspondence

The use of NLP and the SpaCy library represent an

essential dimension of our approach for semantic

annotation in the context of EHR. While machine

learning offers powerful solutions, NLP stands out as

a complementary method, exploiting advanced

understanding of human language to extract rich

semantic information. By integrating SpaCy, a

modern NLP library, our research aims to go beyond

the traditional limits of semantic annotation. The NLP

techniques and SpaCy significantly contribute to our

goal of semantically enriching medical data, thereby

improving the understanding of complex medical

concepts and supporting informed decision-making

in healthcare. Unlike standard implementations, we

enhance SpaCy capabilities by integrating domain-

specific medical ontologies, custom rule-based

phrase matchers, and pre-trained transformer models

to improve entity recognition. This enhancement

allows for high precision annotation and a better

contextual understanding of medical texts, making it

more adaptable to the complexities of EHR systems.

To validate our approach, we conduct quantitative

performance evaluations, including:

 Precision, Recall, and F1-score for entity

recognition and abbreviation expansion.

 Comparative analysis with alternative

NLP libraries to justify model selection.

 Error analysis to refine NLP-based entity

resolution

This method’s approach performs several steps to

detect and process abbreviations in text using SpaCy,

the steps of this method are illustrated in the

following Figure 3.

Figure 3: Model using NLP (SpaCy).

5 EXPERIMENTATION AND

RESULTS

This section discusses the gathered results for each

phase of the proposed system.

5.1 Result of Data Cleansing

Figure 4 shows the missing values in all columns, the

1st column ’class’ shows that it has the highest

number of values displaying NaN with a number of

NaN is equal to 34, on the other hand the column

’properties’ has no NaN missing values, and the other

two columns ’subclass’ and ’instances’ have almost

Enhancing Data Quality and Semantic Annotation by Combining Medical Ontology and Machine Learning Techniques

145

the same number of missing values with a NaN count

less than 10. Figure 4 illustrates the distribution of

missing NaN values in all columns before the

cleansing data. Then we replaced the NaNs with

zeros, and then we replaced with a specific chain

under the name ’Inconnu’.

Figure 4: Screenshot of the distribution of missing NaN

values in all columns before the cleansing data.

5.2 Results of Ontologies Construction

In this section we discuss the results of converting

data from Excel to our local ontology. This phase

constitutes the foundation of our approach, giving our

medical data a rich and precise semantic

representation. Figures 5 and 6 show an extract of the

constructed ontology.

Figure 5: Ontology generated from Excel file

Figures 5 illustrate the ontology generated from an

Excel dataset, showing how tabular attributes are

transformed into semantic classes and organized

hierarchically. Some classes are instantiated with

terms in French, while others appear in English,

reflecting the multilingual nature of the source data.

This bilingual alignment allows the ontology to

support annotation in both languages, enabling health

records to be processed without ambiguity and

ensuring interoperability across multilingual datasets.

Figure 6: OntoGraf of the global ontology constructed.

Figure 6 OntoGraf of the global ontology constructed

from integrated sources (EHR EXTRACT RM.owl,

RIMV3OWL.owl, OGMS.owl) presents the graph of

the unified ontology produced by merging local

ontologies with domain ontologies. It highlights

aligned classes, merged properties, and the extensions

introduced (notably the Medical Abbreviations class

and its links to full forms). This mid-level ontology

acts as a semantic bridge between EHR structural

schemas and medical domain concepts.

5.3 Abbreviation Detection Using

Machine Learning

The following steps illustrate a comprehensive

approach to solving the automatic abbreviation

detection problem:

 Data preparation: Begins loading data

containing phrases, abbreviations, meanings

and binary functions. Then created the

abbreviation to id dictionary to map the

abbreviations to unique numeric indexes and

added the abbreviation encoded column to

encode the abbreviations and use them for the

subsequent machine learning model.

 Data processing: For data processing, we have

used scikit-learn to create a word vector using

CountVectorizer. These word vectors are then

combined using hstack with other properties

such as uppercase letters, alphanumeric letters,

and dot endings. The dataset is divided into a

training and a test set to evaluate the

performance of the models.

 Evaluation of the model: Several classification

models such as RandomForest, KNN and

DecisionTree are trained on the data. The

performance of each model is evaluated using

various metrics such as accuracy, mean absolute

error, mean squared error and confusion matrix.

These estimates are then visualized using Plotly

Express to facilitate model operation.

Distribution of missing values (NAN) by column

number of missing values (NAN)

KEOD 2025 - 17th International Conference on Knowledge Engineering and Ontology Development

146

The comparison shows that the Decision Tree

classifier achieved the best performance, with an

accuracy of 1.0 and average absolute and squared

errors of 0.00. In second place, the Random Forest

model obtained an accuracy of 0.93, an average

absolute error of 0.06, and an average squared error

of 0.26. Finally, the K-Nearest Neighbors (KNN)

model achieved an accuracy of 0.83, an average

absolute error of 0.17, and an average squared error

of 0.41.

Although Decision Tree reached perfect accuracy

on the dataset, accuracy alone is not always a

sufficient metric for evaluating classifiers in the

medical domain, where robustness and generalization

are critical. For this reason, we also considered

complementary metrics such as precision, recall, and

F1-score, which provide a more balanced view of

performance. Based on these metrics, Decision Tree

remained highly effective, but Random Forest offered

more stable generalization results.

The final choice of Random Forest as the selected

model was motivated by several factors. First, unlike

a single Decision Tree, Random Forest aggregates

multiple trees, which reduces overfitting and

improves robustness on unseen data. Second, in

practice, the implementation of a Decision Tree with

high complexity required more computational

resources than were available in our environment, as

well as additional expertise in fine-tuning and

optimization. In contrast, Random Forest required

less fine-tuning, was easier to implement, and offered

strong performance while maintaining computational

feasibility.

5.3.1 Training and Using the Model

The implementation of the Random Forest model for

abbreviation detection involves a structured approach

comprising several key steps. Initially, a pipeline is

established, integrating a transformer

(CountVectorizer) and a RandomForestClassifier

model, facilitating efficient data preprocessing by

converting text into word vectors. Subsequently, the

dataset is partitioned into training and test sets, and

the RandomForest model is trained on the training set.

Following training, the model generates predictions

on the test set, and its performance is evaluated based

on accuracy. Additionally, the model is applied to

new sentences to detect abbreviations, assigning

labels to indicate their presence. This systematic

approach showcases the integration of Random

Forest into a natural language processing pipeline for

abbreviation detection, emphasizing its role in data

preparation, training, and prediction

5.3.2 Displaying Results

We indicated whether an abbreviation was detected in

the new sentence and also displayed information

about the detected abbreviation by replacing the

abbreviation with its meaning using the replace

abbreviations function.

Figure 9: The results given by the Random Forest model.

Figure 9 shows our result given by the Random Forest

model trained to predict if there is an abbreviation. If

detected, the result displays the detected abbreviation

and its meaning. Take as an example the first

screenshot which shows that the abbreviation ’EGD’

is detected and changes with their meaning. which is

endoscopy. This section shows the resolution of the

automatic abbreviation detection problem,

integrating text processing, feature creation, model

training and performance evaluation. It offers a

complete and extensible solution for detecting

abbreviations in similar contexts.

5.4 Method of Abbreviation Detection

Using SpaCy a NLP Library

The method of automatic abbreviation detection was

implemented using SpaCy, a natural language

processing (NLP) library, combined with rule-based

heuristics and semantic enrichment.

SpaCy was applied for text preprocessing and

linguistic analysis, including tokenization, part-of-

speech tagging, and dependency parsing.

Abbreviation candidates were then identified through

regular expression–based rules, focusing on patterns

such as:

• uppercase sequences of up to five characters,

• contextual forms like “long form (short form)”

(e.g., Blood Pressure (BP)),

• and co-occurrence of potential abbreviations

with their expansions within the same text

window.

To improve accuracy, the system performed

frequency analysis of abbreviation–expansion pairs

and validated them against ontology concepts. When

Enhancing Data Quality and Semantic Annotation by Combining Medical Ontology and Machine Learning Techniques

147

multiple expansions were possible, the one

corresponding to an existing ontology class was

selected.

Furthermore, we used RDF data to enrich the

abbreviation dictionary. Each abbreviation was

annotated with semantic information such as domain,

synonyms, and related ontology class.

In the final step, abbreviations were automatically

replaced by their expanded forms, improving both

readability and semantic clarity. This process is

particularly beneficial for medical record analysis,

clinical reporting, and advanced biomedical research.

Figures 10 illustrate the overall workflow,

showing how SpaCy, regular expressions, and RDF-

based enrichment were combined to detect, expand,

annotate, and replace abbreviations in the corpus.

Figure 10: Screenshot of some abbreviations with their text

replaced given by SpaCy.

6 COMPARISON STUDY

We have done a comparison between the proposed

framework and other previous frameworks.

The machine learning-based approach is more

suited to complex tasks requiring generalization from

labeled data, while the NLP approach with SpaCy is

more transparent, easily adaptable, and well suited to

specific tasks like abbreviation detection. We have

done a comparison between the proposed framework

and other previous frameworks. Table 1 summaries

that comparison. This comparison highlights the

different approaches, their strengths and weaknesses,

allowing a comparative assessment of the

methodologies adopted for medical data

interoperability and semantic annotation. Our

approach stands out significantly from others on

several fronts, attesting to its reliability and

effectiveness. In terms of semantic annotation, our

methodology relies on advanced NLP and machine

learning techniques, thus surpassing the competing

approach which does not offer similar semantic

functionalities. The flexibility of our system is

demonstrated by its ability to adapt to the nuances

inherent in medical data, thus surpassing the average

flexibility offered by other approaches. Regarding

model training, our approach excels by integrating a

robust process, unlike the other method which

neglects this crucial dimension. In terms of

ontological interoperability, our approach confirms

its advantage by guaranteeing semantic consistency

between different data sources. And in terms of

complexity, our approach maintains a careful balance

between sophistication and practicality, thus

positioning itself favorably compared to other

approaches, characterized by high complexity.

Finally, from the performance point of view, our

approach excels with a high evaluation, while the

other approaches show moderate performance. These

substantial differences illustrate the increased

reliability and overall superiority of our approach.

Although the proposed approach achieves

syntactic interoperability in distributed EHRs, it

contains many perspectives. First, integration of new

data sources, we are expanding our approach to

include a variety of medical data sources, such as

medical images, imaging reports and genomic data.

This would contribute to a more complete

presentation of medical information. Second, much of

the knowledge in the field of medicine is unclear, so

we have to deal with incomplete and unclear

problems in the field. Therefore, the presence of a

specialist in this field is a plus.

7 CONCLUSIONS

The complex terminology used by specialists in the

medical field often creates a barrier for anyone

seeking to understand health-related information.

Health users need to decipher this ‘‘jargon‘‘to better

manage their situation. In the context of EHR,

hospitals and physicians face major challenges in

effectively sharing the information needed for

quality, timely, and costeffective care. In this

approach, we set out to solve the complex challenge

of EHR interoperability by combining approaches

based on machine learning and NLP. Our main

objective was to improve the semantic annotation of

medical data, thus facilitating interoperability

between different healthcare systems. To achieve this

goal, we followed an integrated approach, which built

a local and global ontology that helped improve

interoperability by standardizing the representation of

medical concepts. Our approach demonstrated

notable

efficiency in semantic annotation, providing

KEOD 2025 - 17th International Conference on Knowledge Engineering and Ontology Development

148

Table 1: A Comparison Study of the Proposed Framework and Other Previous Ones.

increased flexibility and better adaptation to

variations in medical data. Machine learning models

helped with accurate abbreviation detection, while the

use of SpaCy enhanced contextual understanding of

medical terms. The results show that Our proposed

ontological model offers a significant contribution to

understanding and resolving the challenges

surrounding semantic annotation of electronic health

records. Also it showed significant benefits in data

cleansing and medical data interoperability. By

combining robust ontological elements, machine

learning and NLP approaches, our approach aspires

to improve the efficiency and accuracy of health

systems through smarter management of medical

data, paving the way for significant advances in

understanding and management medical data,

reducing medical errors and guaranteed

interoperability. For future work, we need to create a

graphical user interface to easily use the implemented

framework and Further development of the

framework will concentrate on the limitations of this

work, as discussed previously.

REFERENCES

Adel, E., El-Sappagh, S., Barakat, S., Kwak, K., Elmogy,

M. (2022). Semantic architecture for interoperability in

distributed healthcare systems. IEEE Access, 10,

126161–126179.

Begoyan, A. (2007). An overview of interoperability

standards for electronic health records. Society for

Design and Process Science, USA.

Berges, I., Bermúdez, J., Illarramendi, A. (2011). Toward

semantic interoperability of electronic health records.

IEEE Transactions on Information Technology in

Biomedicine, 16, 424–431.

Celesti, A., Fazio, M., Romano, A., Villari, M. (2016). A

hospital cloud-based archival information system for

the efficient management of HL7 big data. In 2016 39th

International Convention on Information and

Communication Technology, Electronics and

Microelectronics (MIPRO), pp. 406–411.

Chen, L. Li, Lu, D., Zhu, M., Muzammal, M., Samuel, O.,

Huang, G., Li, W., Wu, H. (2019). OMDP: An

ontology-based model for diagnosis and treatment of

diabetes patients in remote healthcare systems.

International Journal of Distributed Sensor Networks,

15, 1550147719847112.

Costa, C., Menárguez-Tortosa, M., Fernández-Breis, J.

(2011). Clinical data interoperability based on

archetype transformation. Journal of Biomedical

Informatics, 44, 869–880.

Costa, C. Da, Wichman, M., Rosa Righi, R., Yamin, A.

(2019). Ontology-based model for interoperability

between openEHR and HL7 health applications. In

Proceedings of the International Conference in Health.

Fennelly, O., Moroney, D., Doyle, M., Eustace-Cook, J.,

Hughes, M. (2024). Key interoperability factors for

patient portals and electronic health records: A scoping

review. International Journal of Medical Informatics,

105335.

González-Ferrer, A., Peleg, M., Verhees, B., Verlinden, J.,

Marcos, C. (2012). Data integration for clinical

decision support based on openEHR archetypes and

HL7 virtual medical record. In International Workshop

on Process-oriented Information Systems in

Healthcare, pp. 71–84.

Gunter, T., Terry, N. (2005). The emergence of national

electronic health record architectures in the United

Enhancing Data Quality and Semantic Annotation by Combining Medical Ontology and Machine Learning Techniques

149

States and Australia: models, costs, and questions.

Journal of Medical Internet Research, 7, e383.

Hajjamy, O. El, Alaoui, L., Bahaj, M. (2018). Integration of

heterogeneous classical data sources in an ontological

database. In Big Data, Cloud and Applications: Third

International Conference, BDCA 2018, Kenitra,

Morocco, pp. 417–432.

Kalra, D. (2006). Electronic health record standards.

Schattauer GMBH-Verlag.

Kiourtis, A., Mavrogiorgou, A., Kyriazis, D. (2017).

Aggregating heterogeneous health data through an

ontological common health language. In 2017 10th

International Conference on Developments in eSystems

Engineering (DeSE), pp. 175–181.

Liyanage, H., Krause, P., De Lusignan, S. (2015). Using

ontologies to improve semantic interoperability in

health data. BMJ Health & Care Informatics, 22.

McClanahan, K. (2008). Balancing good intentions:

protecting the privacy of electronic health information.

Bulletin of Science, Technology & Society, 28, 69–79.

Mylka, A., Kryza, B., Kitowski, J. (2012). Integration of

heterogeneous data sources in an ontological

knowledge base. Computing & Informatics, 31.

Roehrs, A., Costa, C., Rosa Righi, R., Rigo, S., Wichman,

M. (2018). Toward a model for personal health record

interoperability. IEEE Journal of Biomedical and

Health Informatics, 23, 867–873.

Santos, M., Bax, M., Kalra, D. (2010). Building a logical

EHR architecture based on ISO 13606 standard and

semantic web technologies. In MEDINFO 2010, pp.

161–165.

Sartipi, K., Dehmoobad, A. (2008). Cross-domain

information and service interoperability. In

Proceedings of the 10th International Conference on

Information Integration and Web-based Applications &

Services, pp. 25–32.

Sachdeva, S., Bhalla, S. (2022). Using knowledge graph

structures for semantic interoperability in electronic

health records data exchanges. Information, 13, 52.

Schloeffel, P., Beale, T., Hayworth, G., Heard, S., Leslie,

H., et al. (2006). The relationship between CEN 13606,

HL7, and openEHR. In HIC 2006 and HINZ 2006:

Proceedings, p. 24.

Sreenivasan, M., Chacko, A. (2020). A case for semantic

annotation of EHR. In 2020 IEEE 44th Annual

Computers, Software, and Applications Conference

(COMPSAC), pp. 1363–1367.

Plastiras, P., O’Sullivan, D., Weller, P. (2014). An

ontology-driven information model for interoperability

of personal and electronic health records

KEOD 2025 - 17th International Conference on Knowledge Engineering and Ontology Development

150