Synergizing Data Imputation and Electronic Health Records for

Advancing Prostate Cancer Research: Challenges, and Practical

Applications

Abderrahim O. Batouche

1,2,3,∗ a

, Eugen Czeizler

2,3,∗ b

, Miika Koskinen

4 c

, Tuomas Mirtti

2,5 d

and Antti S. Rannikko

2,6 e

Doctoral Programme in Computer Science, University of Helsinki, Helsinki, Finland

Research Program in Systems Oncology, University of Helsinki, Helsinki, Finland

ICAN-Digital Precision Cancer Medicine Flagship, Helsinki, Finland

HUS Helsinki University Hospital, Helsinki, Finland

Department of Pathology, University of Helsinki and Helsinki University Hospital, Helsinki, Finland

Department of Urology, University of Helsinki and Helsinki University Hospital, Helsinki, Finland

ﬁ ﬁ

Keywords:

Data Mining, Electronic Health Records, Missing Data, Prostate Cancer.

Abstract:

The presence of detailed clinical information in electronic health record (EHR) systems presents promising

prospects for enhancing patient care through automated retrieval techniques. Nevertheless, it is widely ac-

knowledged that accessing data within EHRs is hindered by various methodological challenges. Speciﬁcally,

the clinical notes stored in EHRs are composed in a narrative form, making them prone to ambiguous for-

mulations and highly unstructured data presentations, while structured reports commonly suffer from missing

and/or erroneous data entries. This inherent complexity poses signiﬁcant challenges when attempting auto-

mated large-scale medical knowledge extraction tasks, necessitating the application of advanced tools, such

as natural language processing (NLP), as well as data audit techniques. This work aims to address these

obstacles by creating and validating a novel pipeline designed to extract relevant data pertaining to prostate

cancer patients. The objective is to exploit the inherent redundancies available within the integrated structured

and unstructured data entries within EHRs in order to generate comprehensive and reliable medical databases,

ready to be used in advanced research studies. Additionally, the study explores potential opportunities arising

from these data, offering valuable prospects for advancing research in prostate cancer.

1 INTRODUCTION

Prostate cancer (PCa) is a prevalent disease known

for its indolent nature, often characterised by slow de-

velopment and protracted progression over time (Na-

tional Cancer Institute, 2023; The American Cancer

Society medical and editorial content team, 2019). As

such, one speciﬁc challenge in performing medical

research on PCa is dealing with incomplete medical

records and missing data, e.g., as a result of city re-

https://orcid.org/0000-0003-4181-9891

https://orcid.org/0000-0002-1607-1554

https://orcid.org/0000-0002-7267-5811

https://orcid.org/0000-0003-0455-9891

https://orcid.org/0000-0002-4261-3484

∗

These authors contributed equally to this work.

location or disease follow-up across different health

providers. This, in turn, can hinder the results of on-

going research studies analysing the effectiveness of

diagnoses and various treatment planning approaches

(Holmes et al., 2021). Ultimately, this can affect clin-

ical decision-making and the patient’s well-being.

To overcome the limitations of incomplete and/or

erroneous data, Electronic Health Records (EHRs)

mining has emerged as a crucial approach in medi-

cal research as well as within clinical practice(Yadav

et al., 2018). EHRs mining leverages advanced data

analytic and artiﬁcial intelligence (AI) approaches to

extract valuable insights from vast amounts of pa-

tient data(Ajmal et al., 2023; Javaid et al., 2022). By

identifying patterns, trends, and risk factors associ-

ated with prostate cancer, EHRs mining facilitates the

early detection of advanced diseases and the personal-

Batouche, A., Czeizler, E., Koskinen, M., Mirtti, T. and Rannikko, A.

Synergizing Data Imputation and Electronic Health Records for Advancing Prostate Cancer Research: Challenges, and Practical Applications.

DOI: 10.5220/0012350300003657

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 17th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2024) - Volume 2, pages 77-86

ISBN: 978-989-758-688-0; ISSN: 2184-4305

isation of treatment strategies (Knighton et al., 2016;

Seneviratne et al., 2018; Henkel et al., 2022). How-

ever, challenges such as missing data and data secu-

rity must be addressed to ensure patient information

remains complete, conﬁdential, and secure. Addi-

tionally, the lack of interoperability between differ-

ent EHR systems poses hurdles in data sharing and

aggregation, limiting the full potential of mining for

both prostate cancer research, as well as for general

improvement of patient care (De La Torre-D

ıez et al.,

2013). Overcoming these issues and promoting stan-

dardised data collection practices and protocols will

be pivotal in advancing the ﬁeld of PCa treatment

through EHRs mining(Herp et al., 2023), as well as

the overall medical research in general.

In our work, we have designed and developed

a data preprocessing pipeline that can leverage rou-

tinely collected information from our EHRs (HUS

Datalake (Bruck, 2023; Pylk

as, 2023; Misukka,

2022)) to efﬁciently and accurately retrieve and con-

solidate clinicians’ work on PCa treatment analysis.

Using Microsoft Azure machine learning studio and

batches from HUS datalake that are available at the

HUS Acamedic environment (a secure scalable data

analytics platform developed for medical research

(kuorttinen, 2023)), we developed an EHR mining

pipeline using Python libraries to read, process, and

provide curated data for further research applications.

One of the key clinical inputs exhibiting missing

entries within the EHR of a signiﬁcant number of

PCa patients is the occurrence of curative treatment,

i.e., radical prostatectomy (RP) or radiation treatment

(RT). Since the imputation of such missing data is in-

evitable, we had to use a different approach to uncover

these lost data entries. Using routinely collected val-

ues of the prostate-speciﬁc antigen (PSA) lab mea-

surements, we were able to successfully identify and

even classify curative PCa treatments. To our knowl-

edge, this is the ﬁrst attempt to approach the inference

of EHR missing treatment records through PSA time

series data.

Our approach enabled us to enhance our EHR by

incorporating approximately 2.8 thousand new cura-

tive treatment events, marking a notable 27% growth

compared to the treatment events available before-

hand. The explanation for this relatively large in-

crease is multi-folded. Some patients might have been

treated outside the (Helsinki and Uusima) district unit

whose database our study is based upon. Others might

have been treated within private practice units, which

again are not covered by our database. Finally, we can

assume that a proportion of these missing treatment

events are due to human error in correctly recording

them within the EHR.

Another key clinical information (as well as key

surrogate measurement within medical research anal-

ysis) which is most of the times not directly recorded

within EHRs, either in structured or non-structured

format, is the time instant when PCa patients are clas-

siﬁed as having a biochemical recurrence (BCR). Af-

ter primary cancer treatment, BCR is achieved when

the PSA level in the blood surpasses a certain thresh-

old, thus indicating that the disease may be returning

or progressing. Thus, BCR status is an important in-

dicator both clinically, as it signs that further monitor-

ing or treatment may be needed to manage the condi-

tion (Stephenson et al., 2006; Artibani et al., 2018),

and from a (medical) data analysis perspective, as it

is a surrogate for PCa mortality (Zhao et al., 2022;

Artibani et al., 2018). By following the PSA mea-

surements as well as all EHR-available PCa treatment

records we were able to effectively determine (and re-

port) the status and timing of BCR for all PCa pa-

tients.

2 METHODS

2.1 Data Source

Our pipeline starts by identifying patients of interest

within a large academic EHR system (Figure 1). We

used the Finnprostate dataset, which is a large patient

registry study combining Finnish national healthcare

data with local hospital data (n=700,000) of men

suspected of having PCa (PSA measured) or diag-

nosed with PCa. From Finnprostate, we gathered a

HUS (Hospital District of Helsinki and Uusimaa) sub-

cohort of men (n=326,796) having comprehensive pa-

tient information regarding out-patient clinic and hos-

pital visits as well as data regarding laboratory tests,

medication prescriptions, radiological, pathological,

and surgical reports, as well as comorbidities covering

the years 1993 to 2019. The above data is embedded

within the regional HUS Acamedic datalake.

Medical research commonly encounters missing

data. Despite this prevalence, it is nowadays gener-

ally accepted to perform various data analysis tasks

on partially incomplete records, as long as the miss-

ing values are not substantial, and the analysis meth-

ods themselves can cope with speciﬁc uncertainties.

Moreover, the use of advanced imputation techniques

such as maximum likelihood (Wald, 1949), multi-

ple imputation(Schafer, 1999), or Bayesian methods

(Kong et al., 1994) have a good track record in ad-

dressing many of the missing data entries. However,

certain complex missing data records, such as the mo-

ment and type of a deployed treatment, or the ﬁrst di-

HEALTHINF 2024 - 17th International Conference on Health Informatics

Figure 1: Data preprocessing pipeline for Prostate Cancer

research data.

agnostic biopsy of a tumour and its aggressiveness,

are very hard to be addressed by any of the available

computational imputation methods.

In our data processing work (yellow box, Figure

1), imputation was reinforced with customised algo-

rithms that rely on clinical guidelines, experts’ inter-

pretations, as well as the intrinsic information redun-

dancy available within EHR, in order to retrieve the

missing data. All created algorithms are described in

Table 1.

2.2 Missing Curative Treatment

Detection

The Treatment Detection Algorithm (DTX algo)

plays a pivotal role in enhancing our data quality by

identifying and incorporating missing curative treat-

ment records (in Algorithm 4). The algorithm takes

all patient’s data as input and returns a list of missing

curative treatments.

The Signiﬁcant PSA Drop Algorithm (SIGDROP)

constitutes the initial phase of DTX, meticulously

tracking a patient’s PSA values subsequent to their

diagnostic biopsy (Algorithm 1). The algorithm takes

PSA measurements of patient i, and returns, if any:

• drop date: The date of the PSA drop, which is the

highest (maximum) point from where a signiﬁcant

PSA drop starts; is subsequently considered as a

treatment date.

• nadir date: The date of the PSA nadir, which is

the lowest (minimum) point to where the signiﬁ-

cant drop reached.

• PSA

min

: The minimum values (at the time

nadir date); this value is used to classify the drop

into radical prostatectomy or radiation therapy.

The algorithm’s operation commences with the

pursuit of the maximum PSA value (PSA

max

, lines

3-4), followed by an endeavour to identify the min-

imum value within the ensuing δ ≤ 12-month period

(lines 5-32). Upon successful identiﬁcation of a de-

creasing value, at lines 15-16, the algorithm calcu-

lates α, which is the drop percentage that undergoes

rigorous testing to ascertain its adherence to predeter-

mined signiﬁcance conditions (line 17). This process

is indispensable in establishing the genuineness of the

observed drop and conﬁrming its clinical signiﬁcance.

Having validated the drop as signiﬁcant, and (line

6) with no EHR-recorded curative treatment between

the date of drop (d

max

) and the date of the nadir (d

min

DTX proceeds to collate all such identiﬁed drops, sys-

tematically categorising them into two distinct treat-

ment modalities: radiation therapy (RT) and radical

prostatectomy (RP) (Algorithm 2 line 7-10). This

classiﬁcation not only facilitates comprehensive treat-

ment record augmentation but also provides valuable

missing insights into the patient’s therapeutic journey.

2.3 Biochemical Recurrence Detection

Biochemical recurrence (BCR) serves as a crucial in-

dicator for PCa mortality. However, its availability in

EHRs is not always guaranteed. In such cases, vari-

ous methods can be employed to retrieve and impute

this information. Our Detect Biochemical Recurrence

(DBCR algo) Algorithm is speciﬁcally designed to

analyse data from treated patients, identifying poten-

tial relapses and categorising patients as either having

experienced a BCR or not (Algorithm 7). To achieve

this outcome, DBCR utilises four (04) distinct func-

tions, each tailored to a speciﬁc task.

Clinical guidelines governing PSA relapse are

stringent and clearly deﬁned (Van Den Broeck et al.,

2020), and these guidelines are meticulously inte-

grated into the PRP and PRT functions (Algorithms

5-6).

• PSA-based relapse after radical prostatectomy

PRP(p

): this function uses the European As-

sociation of Urology (EAU) guidelines (Van

Den Broeck et al., 2020) to detect whether a

Synergizing Data Imputation and Electronic Health Records for Advancing Prostate Cancer Research: Challenges, and Practical

Applications

Table 1: Summary of Algorithms.

Algorithm Name Input Output Complexity Short Description

SIGDROP PSA

drop date, nadir date, PSA

min

O(M) Detects signiﬁcant PSA drop and related dates.

DTX PATIENT S LIST L O(M ∗N) Detects missing treatments based on PSA data.

CRT p

O(1) Detects Clinical Relapse after RT.

CRP p

O(1) Detects Clinical Relapse after RP.

PRP p

O(N) Detects PSA Relapse after RP.

PRT p

O(N) Detects PSA Relapse after RT.

DBCR T REAT ED PAT IENTS L

bcr

O(M ∗N) Main algorithm for BCR detection.

- RP=Radical prostatectomy, RT=Radiation therapy, BCR=Biochemical recurrence.

- In M*N, M is the number of PSA measurements and N is the number of patients.

Algorithm 1: SIGDROP - Signiﬁcant PSA drop detec-

tion.

Input: PSA

Output: drop date, nadir date, PSA

min

1: M ← size(PSA

)

2: if M ≥ 0 then

3: PSA

max

← PSA

[1]

4: date PSA

max

← getDate(PSA

max

)

5: for j = 1 to M − 1 do

6: e ← PSA

[ j] − PSA

[ j + 1]

7: δ ← date PSA

− date PSA

max

8: if e ≤ 0 then

9: date PSA

← getDate(PSA

[ j + 1])

10: if (PSA

max

< PSA

[ j + 1]) or δ > 12m then

11: PSA

max

← PSA

[ j + 1]

12: date PSA

max

← getDate(PSA

max

)

13: end if

14: else

15: β ← PSA

max

− PSA

[ j + 1]

16: α ←

PSA

max

17: if (α ≥ 0.75 and β ≥ 3) or (α ≥ 0.5 and β ≥ 4) then

18: PSA

min

← PSA

[ j + 1]

19: else

20: if δ > 12m then

21: PSA

max

← PSA

[ j + 1]

22: date PSA

max

← getDate(PSA

max

)

23: else

24: γ ← date

PSA[ j + 2] − date PSA

max

25: if j + 2 ≤ M and γ > 12 then

26: PSA

max

← PSA

[ j + 1]

27: date PSA

max

← getDate(PSA

max

)

28: end if

29: end if

30: end if

31: end if

32: end for

33: end if

34: if PSA

min

exists then

35: drop date ← get date(PSA

max

)

36: nadir date ← get date(PSA

min

)

37: return drop date,nadir date,PSA

min

38: end if

39: return NULL

PSA-based relapse occurred after radical prostate-

ctomy. If an ultrasensitive PSA (Shen et al., 2005)

measurement psa

was taken for patient p

then

we take this into consideration to deﬁne the max-

imum threshold (lines 3-7).

• PSA-based relapse after radiation therapy

PRT (p

): this function also uses the EAU guide-

lines (Van Den Broeck et al., 2020) to detect

whether a PSA-based relapse occurred after

radiation therapy. The algorithm searches for the

ﬁrst increase of 2 PSA units from a nadir value.

Going beyond this, our novel BCR detection

method is not solely reliant on PSA relapse; instead, it

incorporates expert knowledge and translates it into a

new tool for detecting BCR based on secondary treat-

ments (Figures 2 and 3). The CRP and CRT functions

(Algorithms 2-3) have been developed to identify pos-

sible relapses that may have been missed (after an RP

or an RT primary treatment, respectively) either due

to the absence of PSA tests or because the curating

doctor decided on a secondary treatment before the

PSA value has crossed the EAU-guideline threshold.

The exact approaches used to deﬁne clinical relapse

after RP and RT primary treatments are described in

Figure 2 and Figure 3, respectively.

Algorithm 2: CRP - Clinical Relapse after RP.

Input: p

Output: d

1: L ← [ ]

2: if lastRT Date(p

) > f irstRPDate(p

) then

3: if lastRT Date(p

) − f irstRPDate(p

) > 1yr then

4: L ← L + f irstRT DateA f terOneYear(p

)

5: end if

6: if hasHTCT (p

) and

lastHTCT Date(p

) > f irstRPDate(p

) then

7: if lastHTCT Date(p

) − f irstRPDate(p

) ≥ 2yr then

8: L ← L + f irstHTCT DateA f terOneYear(p

)

9: end if

10: end if

11: else

12: if hasHTCT (p

) and

lastHTCT Date(p

) > f irstRPDate(p

) then

13: L ← L + f irstHTCT DateA f terRp(p

)

14: end if

15: end if

16: d

← getMin(L)

17: return d

HEALTHINF 2024 - 17th International Conference on Health Informatics

Algorithm 3: CRT - Clinical Relapse after RT.

Input: p

Output: d

1: L ← [ ]

2: if hasRP(p

) and

lastRPDate(p

) > f irstRT Date(p

) then

3: L ← L + f irstRpDateA f terRt(p

)

4: end if

5: if hasSecondRT (p

) and

secondRT Date(p

) − f irstRT Date(p

) > 1yr then

6: L ← L + secondRT Date(p

)

7: end if

8: if hasHTCT (p

) and

f irstHT CT Date(p

) − f irstRT Date(p

) ≥ 6m then

9: L ← L + f irstHT CT Date(p

)

10: end if

11: if hasHTCT (p

) and

f irstHT CT Date(p

) − f irstRT Date(p

) > 3yr then

12: L ← L + f irstHT CT Date(p

)

13: end if

14: d

← getMin(L)

15: return d

The DBCR Algorithm then uses all the outputs of

the above functions, namely the dates (d

, d

)

of possible BCR occurrences, and selects the earliest

date (if it exists) as the date of biochemical recurrence

for patient p

(Algorithm 7 lines 7-10).

2.4 Evaluation

Retrieving missing data is of utmost importance in

the pre-processing of EHR data for critical and sensi-

tive applications. Additionally, assessing the quality

of imputed data holds signiﬁcant value as it provides

insights into the effectiveness of the methods and al-

gorithms employed. In our study, data evaluation in-

volves a two-tier validation process.

The ﬁrst level (a.k.a. ’step-1’ evaluation) employs

automated tests, where we verify the accuracy of our

algorithms by taking records without missing treat-

ment data, applying the imputation algorithm, and

subsequently scrutinising the outcomes.

The second level (a.k.a. the ’step-2’ evaluation)

entails expert validation, wherein a random selection

of imputed data is manually inspected by domain ex-

perts, ensuring its correctness.

Due to the absence of biochemical recurrence data

in our EHR, we assessed the effectiveness of the

DBCR algorithms by manually evaluating patient out-

comes and employing descriptive statistics.

Figure 2: BCR deﬁnition after radical prostatectomy.

Figure 3: BCR deﬁnition after radiation therapy.

3 RESULTS

3.1 Curated Database

The initial phase of this work was to explore the HUS

datalake (Bruck, 2023; Pylk

as, 2023) and extract the

most accurate and comprehensive prostate cancer data

suitable for subsequent medical research applications.

As a result, we successfully created a structured and

curated database that contains crucial patient informa-

tion, as deﬁned in Table 3.

Synergizing Data Imputation and Electronic Health Records for Advancing Prostate Cancer Research: Challenges, and Practical

Applications

Table 2: Evaluation of DTX algorithm performance.

- Available CTx Estimated CTx Correct estimated CTx New estimated CTx

- DB DTX DTX ∩ DB True-Class False-Class DTX \ DB

PID 7563 9725 6962 (92%) 6294 (90%) 668 (10%) 2763 (+27%)

PID-RP 2495 2722 2233 (90%) 1929 (86%) 304 (14%) 0489 (+16%)

PID-RT 5068 7003 4729 (93%) 4365 (92%) 364 (08%) 2274 (+31%)

Table 3: The curated data tables.

Data Number of rows (%) Number of Patients (%)

T1: Pathology 23,393 12,277

GG1 6618 (28) 3652 (30)

GG2 6383 (27) 3313 (27)

GG3 4747 (20) 2503 (20)

GG4 2310 (10) 1195 (10)

GG5 3335 (14) 1614 (13)

T2: Treatment 40,369 9800

RP 2743 (7) 2742

RT 18,254 (45) 7248

HT 15,804 (39) 4088

CT 3568 (9) 514

T3: PSA 1,424,440 238,399

T4: MRI 20,103 15,807

T5: Medications 13,837,600 290,055

GG1–GG5 = Gleason grade group 1–5 (associated to each pathological entry)

RP=Radical prostatectomy, RT=Radiation therapy,

HT=Hormonal therapy, CT=Chemotherapy.

3.2 Treatments Data

Following the data curation and structuring, we have

implemented the DTX algorithm in order to detect

and impute the missing curative treatment data. As

a result, our database now incorporates n=2763 new

PCa-related treatment records, representing a 27% in-

crease compared to the original data found in the HUS

datalake. The number of patients with RP has in-

creased by 16% (n=489), while the number of those

with RT has increased by 31% (n=2274).

In Table 2 we present the results of ’step-1’ DTX

performance evaluation, i.e., estimated vs. known

(EHR-available) treatment records. We record an im-

putation performance of 92% (n=6962) correct esti-

mated curative treatments, i.e., treatments estimated

using the DTX algorithm that are also found in the ex-

isting database. Out of these, 90% (n=6294) are cor-

rectly classiﬁed as RP or RT, whereas 10% (n=669)

are wrongly classiﬁed. RP classiﬁcation was 86%

correct, whereas RT classiﬁcation reached 92%.

The ’step-2’ evaluation of the DTX algorithm was

performed vs. manual validation by domain experts,

where the experts were using the entire collection of

unstructured reports associated with the test subjects

in order to uncover their treatment history. The ’step-

2’ evaluation started by sampling 40 random patients,

i.e., 20 random RP + 20 random RT, that were de-

tected by the algorithm as having curative treatments

(CTx), however this treatment did not appear within

the EHR (DTX \ DB in Table 2). The results of this

manual validation are summarised in Table 4. Only

one patient from the RP group was unveriﬁable (no

data = treatment cannot be conﬁrmed), while ﬁve RT

patients had the same situation. In addition, 95% of

RP patients were conﬁrmed to have a curative PCa

treatment, and 60% of RT patients were conﬁrmed.

In total 79% of the sampled patients (whose treat-

ments were not recorded within EHR) were conﬁrmed

to have PCa curative treatment.

Table 4: Manual validation for DTX algorithm perfor-

mance.

- Sample Unveriﬁable Veriﬁable True CTx All True CTx

PID-RP 20 01 19 (95%) 18 (95%)

27 (79%)

PID-RT 20 05 15 (75%) 09 (60%)

3.3 BCR Data

Our DBCR algorithm successfully identiﬁed 2851 pa-

tients (Figures 4 and 5) who developed a biochemical

recurrence after a PCa curative treatment. These pa-

tients represent 27% of the treated patients.

Among the identiﬁed BCR patients, 70%

(n=2007) were detected using the PRP and PRT

algorithms, which are in accordance with the EAU

guidelines (Van Den Broeck et al., 2020).

However, around 30% (n=844) were identiﬁed us-

ing our new algorithms, CRP and CRT, formulated

based on the expertise of our clinicians’ team and

other contributors to this work. Notably, without ap-

plying these new algorithms, these cases might have

otherwise gone unnoticed.

Figure 4: BCR detected data: Bar plots representing the

distribution of the time from curative treatment to relapse.

HEALTHINF 2024 - 17th International Conference on Health Informatics

Figure 5: BCR detected data: Box plots of the time from

curative treatment to relapse distribution, by Gleason grade

groups, and by type of curative treatments.

3.4 Applications

After cleaning the data and improving its quality, we

have successfully carried out multiple applications.

The primary application involved developing a data

visualisation tool, enabling clinicians and researchers

to visualise the trajectory of PCa patients, includ-

ing their PSA values, treatments, pathological results,

medical prescriptions, and others (Figure 6).

Figure 6: Helsinki BCR system enables access to prostate

cancer patients’ trajectory and different BCR prediction

models.

Additionally, we have investigated the potential

grade inﬂation in PCa Gleason grade groups during

the MRI era; the research focused on patients in Glea-

son grade groups 1 and 2. This work hypothesised

that some patients in grade group 1 in the pre-MRI era

are nowadays, in the MRI era, classiﬁed and treated

as grade group 2 patients. With enough evidence, the

work proved the hypothesis which will open serious

discussions to reassess current risk stratiﬁcation tools

and clinical decision-making. Updating guidelines on

cancer grading and treatments is crucial to be aligned

with the precision of modern MRI technology.

Furthermore, we are utilising the curated EHR

data to train machine learning models to predict

biochemical recurrence within the following 3-to-10

years from initial curative treatment. Knowing that

prostate cancer is a slow-developing cancer, BCR is

one of the most important and accurate surrogates

for prostate cancer mortality. Therefore, predicting

BCR would have a signiﬁcant impact on treatment

decisions and treatment planning. Our (preliminary)

trained models achieved good performance (Accu-

racy=0.93, AUC=0.93, Precision=0.88) on an internal

validation. The models are trained on n=5262 patients

who have had PCa curative treatment.

4 DISCUSSION

Ensuring high data quality is essential when building

effective AI models and conducting signiﬁcant statis-

tical analyses (Gudivada et al., 2017). This impor-

tance is particularly heightened in clinical research

and applications where decisions may directly impact

patients’ lives. Electronic Health Records (EHR),

such as the one available at HUS Acamedic, play

a critical role in this process, making it imperative

to develop robust exploration methods to harness the

available data.

In our work, we explored, curated, and aug-

mented bio-medical data from within Finnish health-

care records, with a speciﬁc focus on prostate cancer

patients. By establishing a new mining framework

and developing novel analysis algorithms, we suc-

cessfully consolidated our data, enabling us to con-

duct meaningful and impactful medical research.

One of our approaches was to use the time se-

ries data on patients’ PSA levels, a subset of medi-

cal data which is typically well collected and curated

within EHR, in order to infer the existence, and the

type, of EHR-missing curative treatment events. To

our knowledge, this is the ﬁrst time PSA time series

data were used in this way, although, in (Bettencourt-

Silva et al., 2015), the authors employed a similar ap-

proach to generate a completeness score for the over-

all data quality of the cohort. Based on this approach,

we were able to consolidate our EHR by adding ap-

prox. 2.8k new curative treatment events, represent-

ing a 27% increase from the EHR-available treatment

events.

Another important outcome of our mining frame-

work was documenting the status and timing of our

PCa patients’ BCR. Differently than in previous EHR

mining frameworks for PCa medical data, see e.g.

(Park et al., 2021b; Park et al., 2021a), we deﬁne

BCR-status based on both PSA-level measurements

(after primary curative treatment, i.e., radiation ther-

apy –RT– or radical prostatectomy –RP–) as well

Synergizing Data Imputation and Electronic Health Records for Advancing Prostate Cancer Research: Challenges, and Practical

Applications

as based on secondary curative and adjuvant thera-

pies, i.e., PCa related hormonal- and/or chemother-

apy. This approach takes into consideration the clini-

cal reality that sometimes, curating doctors decide on

secondary therapies before the PSA level crosses the

threshold established by current EAU guidelines as

the BCR level. Using this approach, we accurately

captured an additional 844 BCR events (representing

a 42% increase from PSA-only detected BCR events),

which otherwise would either not have been found at

all or would have been given a signiﬁcant later time-

stamp.

One important observation from our EHR data cu-

ration and analysis work is that there exists a large

amount of redundancy in these data sources. This is

particularly observable within the free text input writ-

ten by doctors during their medical checkups and/or

lab, pathological, or imaging reports. On the other

hand, due to a multitude of factors, including human

error, focusing on only one particular type of data

source at a time, such as lab results, pathological re-

ports, or even surgery records, one encounters a sig-

niﬁcant amount of missing data entries.

Therefore, leveraging the data redundancy feature

in EHR not only makes it possible and highly ad-

vantageous to recover these missing data entries but

also validates and assesses the outcomes of our algo-

rithms. This is why, a ”data investigation” approach,

such as the one described in this manuscript, is more

relevant than classical ”data imputation” methods. In-

deed, these latter approaches provide only average-

like behaviours and also are completely inefﬁcient in

detecting missing events, such as a radiation treatment

event altogether missing from within the EHR.

Strongly connected to the above reasoning, one

could not overlook the potential impact the use

of Large Language Models (LLM) could have in

detecting and augmenting the existing EHR data

(Thirunavukarasu et al., 2023). Such models could be

employed to extract (from the free text provided by

doctors) relevant information such as missing events,

e.g. treatments performed in different clinics, cities,

or even countries, or information that is usually not

structurally recorded within EHR, e.g., family history,

use of alcohol and tobacco products, general health

status of the patient, etc. During the current EHR

data analysis no LLM was employed; however, the

approach is currently actively analysed for future us-

age within our models.

5 CONCLUSION

This work demonstrates the challenges of mining

Finnish electronic health records for prostate cancer

(PCa) research, as well as the opportunities it of-

fers in gaining valuable insights. Our methodology,

when applied to the HUS datalake, enabled the de-

tection of missing treatments and biochemical recur-

rences (BCR), which led to a range of clinically rel-

evant ﬁndings, including patients’ timeline histories,

the Gleason grade group inﬂation ﬁnding, and the

BCR classiﬁcation models. The results of our frame-

work highlight the potential of EHR data mining to

advance PCa research and guide personalised patient

care.

ACKNOWLEDGEMENTS

This work was supported by grants from the Can-

cer Society Finland, the Academy of Finland, Jane

and Aatos Erkko Foundation, and State funding for

university-level health research. It is a joint effort of

doctoral students, senior researchers, and clinicians at

the University of Helsinki and the University Hospital

of Helsinki.

REFERENCES

Ajmal, S., Ahmed, A. A. I., and Jalota, C. (2023). Natural

Language Processing in Improving Information Re-

trieval and Knowledge Discovery in Healthcare Con-

versational Agents. Journal of Artiﬁcial Intelligence

and Machine Learning in Management, 7(1):34–47.

Artibani, W., Porcaro, A. B., De Marco, V., Cerruto, M. A.,

and Siracusano, S. (2018). Management of Biochem-

ical Recurrence after Primary Curative Treatment for

Prostate Cancer: A Review. Urologia Internationalis,

100(3):251–262.

Bettencourt-Silva, J. H., Clark, J., Cooper, C. S., Mills, R.,

Rayward-Smith, V. J., and de la Iglesia, B. (2015).

Building data-driven pathways from routinely col-

lected hospital data: A case study on prostate cancer.

JMIR Med Inform, 3(3):e26.

Bruck, O. (2023). Welcome to the hospital dis-

trict of helsinki and uusimaa (hus) hematological

subdatalake data catalogue. Available online at:

https://www.oscarbruck.ﬁ/datalake/. Accessed 2023-

08-22.

De La Torre-D

ıez, I., Gonz

alez, S., and L

opez-Coronado,

M. (2013). EHR Systems in the Spanish Public Health

National System: The Lack of Interoperability be-

tween Primary and Specialty Care. Journal of Medical

Systems, 37(1):9914.

HEALTHINF 2024 - 17th International Conference on Health Informatics

Gudivada, V., Apon, A., and Ding, J. (2017). Data quality

considerations for big data and machine learning: Go-

ing beyond data cleaning and transformations. Inter-

national Journal on Advances in Software, 10(1):1–

20.

Henkel, M., Horn, T., Leboutte, F., Trotsenko, P., Dugas,

S. G., Sutter, S. U., Ficht, G., Engesser, C., Matthias,

M., Stalder, A., Ebbing, J., Cornford, P., Seifert,

H., Stieltjes, B., and Wetterauer, C. (2022). Ini-

tial experience with AI Pathway Companion: Eval-

uation of dashboard-enhanced clinical decision mak-

ing in prostate cancer screening. PLOS ONE,

17(7):e0271183.

Herp, J., Braun, J.-M., Cantuaria, M. L., Tashk, A., Ped-

ersen, T. B., Poulsen, M. H. A., Krogh, M., Nadimi,

E. S., and Sheikh, S. P. (2023). Modeling of electronic

health records for time-variant event learning beyond

bio-markers—a case study in prostate cancer. IEEE

Access, 11:50295–50309.

Holmes, J. H., Beinlich, J., and Boland, M. R. (2021). Why

Is the Electronic Health Record So Challenging for

Research and Clinical Care? Methods of information

in medicine, 60(1-02):32–48.

Javaid, M., Haleem, A., Singh, R. P., Suman, R., and Rab,

S. (2022). Signiﬁcance of machine learning in health-

care: Features, pillars and applications. International

Journal of Intelligent Networks, 3:58–73.

Knighton, A. J., Belnap, T., Brunisholz, K., Huynh, K., and

Bishoff, J. T. (2016). Using Electronic Health Record

Data to Identify Prostate Cancer Patients That May

Qualify for Active Surveillance. EGEMS (Washing-

ton, DC), 4(3):1220.

Kong, A., Liu, J. S., and Wong, W. H. (1994). Se-

quential imputations and bayesian missing data prob-

lems. Journal of the American Statistical Association,

89(425):278–288.

kuorttinen, E. (2023). HUS Acamedic - secure

operating environment. Available online at:

https://www.hus.ﬁ/en/research-and-education/hus-

acamedic-secure-operating-environment. Accessed

2023-08-22.

Misukka, M. (2022). Standardizing electronic health

records in order to advance secondary use of hospital

data lakes - A case study on HUS data lake. Master’s

thesis, Aalto University. School of Science.

National Cancer Institute (2023). The natural his-

tory of prostate cancer. Available online at:

https://www.cancer.gov/types/prostate. Accessed

2023-08-03.

Park, J., Rho, M. J., Moon, H. W., Kim, J., Lee, C., Kim,

D., Kim, C.-S., Jeon, S. S., Kang, M., and Lee, J. Y.

(2021a). Dr. answer ai for prostate cancer: Predict-

ing biochemical recurrence following radical prostate-

ctomy. Technology in Cancer Research & Treatment,

20.

Park, J., Rho, M. J., Moon, H. W., Park, Y. H., Kim, C.-S.,

Jeon, S. S., Kang, M., and Lee, J. Y. (2021b). Prostate

cancer trajectory-map: clinical decision support sys-

tem for prognosis management of radical prostatec-

tomy. Prostate International, 9(1):25–30.

Pylk

as, J. (2023). HUS facilitates clinical data ex-

ploitation through data lake. Available on-

line at: https://www.tietoevry.com/en/success-

stories/2019/hus-facilitates-clinical-data-exploitation-

through-an-integrated-hus-datalake-solution/. Ac-

cessed 2023-08-22.

Schafer, J. L. (1999). Multiple imputation: a primer. Statis-

tical Methods in Medical Research, 8(1):3–15. PMID:

10347857.

Seneviratne, M. G., Banda, J. M., Brooks, J. D., Shah,

N. H., and Hernandez-Boussard, T. M. (2018). Identi-

fying Cases of Metastatic Prostate Cancer Using Ma-

chine Learning on Electronic Health Records. AMIA

... Annual Symposium proceedings. AMIA Symposium,

2018:1498–1504.

Shen, S., Lepor, H., Yaffee, R., and Taneja, S. S.

(2005). ULTRASENSITIVE SERUM PROSTATE

SPECIFIC ANTIGEN NADIR ACCURATELY PRE-

DICTS THE RISK OF EARLY RELAPSE AFTER

RADICAL PROSTATECTOMY. Journal of Urology,

173(3):777–780.

Stephenson, A. J., Kattan, M. W., Eastham, J. A., Dotan,

Z. A., Bianco, F. J., Lilja, H., and Scardino, P. T.

(2006). Deﬁning biochemical recurrence of prostate

cancer after radical prostatectomy: A proposal for a

standardized deﬁnition. Journal of Clinical Oncology,

24(24):3973–3978. PMID: 16921049.

The American Cancer Society medical and editorial con-

tent team (2019). Prostate cancer. Available

online at: https://www.cancer.org/cancer/prostate-

cancer/about/what-is-prostate-cancer.html. Accessed

2023-08-03.

Thirunavukarasu, A. J., Ting, D. S. J., Elangovan, K.,

Gutierrez, L., Tan, T. F., and Ting, D. S. W. (2023).

Large language models in medicine. Nature Medicine,

29(8):1930–1940.

Van Den Broeck, T., Van Den Bergh, R. C., Briers, E.,

Cornford, P., Cumberbatch, M., Tilki, D., De San-

tis, M., Fanti, S., Fossati, N., Gillessen, S., Grum-

met, J. P., Henry, A. M., Lardas, M., Liew, M., Ma-

son, M., Moris, L., Schoots, I. G., Van Der Kwast,

T., Van Der Poel, H., Wiegel, T., Willemse, P.-P. M.,

Rouvi

ere, O., Lam, T. B., and Mottet, N. (2020). Bio-

chemical Recurrence in Prostate Cancer: The Euro-

pean Association of Urology Prostate Cancer Guide-

lines Panel Recommendations. European Urology Fo-

cus, 6(2):231–234.

Wald, A. (1949). Note on the consistency of the maxi-

mum likelihood estimate. The Annals of Mathemat-

ical Statistics, 20(4):595–601.

Yadav, P., Steinbach, M., Kumar, V., and Simon, G. (2018).

Mining Electronic Health Records (EHRs): A Survey.

ACM Computing Surveys, 50(6):1–40.

Zhao, Y., Tao, Z., Li, L., Zheng, J., and Chen, X. (2022).

Predicting biochemical-recurrence-free survival using

a three-metabolic-gene risk score model in prostate

cancer patients. BMC Cancer, 22(1):239.

Synergizing Data Imputation and Electronic Health Records for Advancing Prostate Cancer Research: Challenges, and Practical

Applications

APPENDIX

Algorithm 4: DTX - Missing treatments detection.

Input: PAT IENT S LIST

Output: L

1: L ← [ ]

2: for all p

in PAT IENT S LIST do

3: PSA

← getPsa(p

)

4: T x

← getTreatments(p

)

5: (d

max

, d

min

, PSA

min

) ← SIGDROP(PSA

)

6: if TxExists(d

max

, d

min

, PSA

min

, T x

) = False then

7: if PSA

min

< 0.1 then

8: tx type ← ’RP’

9: else

10: tx type ← ’RT’

11: end if

12: L ← L + (p

,tx type, drop date)

13: end if

14: end for

15: return L

Algorithm 5: PRP - PSA Relapse after RP.

Input: p

Output: d

1: PSA ← getPsaA f terRp(p

)

2: for psa

in PSA do

3: if usp(psa

) = TRUE then

4: th ← 0.2

5: else

6: th ← 0.4

7: end if

8: if psa

> th then

9: d

← getDate(psa

)

10: return d

11: end if

12: end for

13: return NULL

Algorithm 6: PRT - PSA Relapse after RT.

Input: p

Output: d

1: PSA ← getPsaA f terRt(p

)

2: nadir ← getMax(PSA)

3: for psa

in PSA do

4: if nadir > psa

then

5: nadir ← psa

6: end if

7: inc ← psa

− nadir

8: if inc > 2 then

9: d

← getDate(psa

)

10: return d

11: end if

12: end for

13: return NULL

Algorithm 7: DBCR.

Input: TREATED PATIENT S

Output: L

bcr

1: L

bcr

← [ ]

2: for all p

in TREAT ED PAT IENT S do

3: d

← PRP(p

)

4: d

← CRP(p

)

5: d

← PRT (p

)

6: d

← CRT (p

)

7: if allAreNU LL(d

, d

) = FALSE then

8: bcr date ← getMin(d

, d

)

9: new bcr ← (p

, bcr date)

10: L

bcr

← L

bcr

+ new bcr

11: end if

12: end for

13: return L

bcr

HEALTHINF 2024 - 17th International Conference on Health Informatics