Evaluation of the Synthetic Electronic Health Records

Emily Muller

, Xu Zheng

and Jer Hayes

Department of Epidemiology and Biostatistics, Imperial College London, U.K.

Accenture Labs, Dublin, Ireland

Keywords:

Generative Model, Medical Data Synthesis, Synthetic Data Evaluation.

Abstract:

Generative models have been found effective for data synthesis due to their ability to capture complex under-

lying data distributions. The quality of generated data from these models is commonly evaluated by visual

inspection for image datasets or downstream analytical tasks for tabular datasets. These evaluation methods

neither measure the implicit data distribution nor consider the data privacy issues, and it remains an open ques-

tion of how to compare and rank different generative models. Medical data can be sensitive, so it is of great

importance to draw privacy concerns of patients while maintaining the data utility of the synthetic dataset. Be-

yond the utility evaluation, this work outlines two metrics called Similarity and Uniqueness for sample-wise

assessment of synthetic datasets. We demonstrate the proposed notions with several state-of-the-art generative

models to synthesise Cystic Fibrosis (CF) patients’ electronic health records (EHRs), observing that the pro-

posed metrics are suitable for synthetic data evaluation and generative model comparison.

1 INTRODUCTION

Data is constantly generated in the ﬁeld of medicine

from sources such as biosensors, physiological mea-

surements, genome sequencing and electronic health

records (EHRs). Despite this, the data on speciﬁc sub-

populations may be pretty scarce when aggregating

data is expensive if the data is proprietary; or ren-

dered inaccessible due to interoperability standards in

the sharing of health data; or illegal to share it has

the potential to violate privacy. To overcome these

issues, synthetic data are increasingly being used in

the healthcare setting [Chen et al., 2021]. A de-

sirable data synthesis algorithm in the medical do-

main should be able to generate new samples that pre-

serve the original data distribution while adhering to

properties of privacy. Preserving the data distribution

maintains good utility of the synthetic data in down-

stream analytical tasks while maintaining the privacy

level ensures that the synthetic data do not leak infor-

mation of any single individual.

Deep generative models, such as generative ad-

versarial networks (GANs) and variational autoen-

coders (VAEs) [Xu and Veeramachaneni, 2018, Choi

et al., 2017, Xu et al., 2019, Xie et al., 2018, Yoon

et al., 2020], have been found effective for synthe-

sising medical datasets. These models implicitly pa-

rameterise the multivariate distribution of the origi-

nal data using deep neural networks. In [Fiore et al.,

2019] a GAN is used to synthesise the underrepre-

sented class of fraudulent credit card cases, observing

a maximum increase in classiﬁcation sensitivity of 3.4

percentage points when augmenting the small class 3

times its size (0.55% of the training set). While a con-

ditional GAN can achieve optimal performance on a

set of 22 tabular datasets [Douzas and Bacao, 2018].

In an application of augmenting classes of thermal

comfort, the authors ﬁnd two experiments where syn-

thetic data alone has a higher F1 score than the orig-

inal training data. Similar to [Fiore et al., 2019], the

author synthesises the under-represented class. Other

work has shown the efﬁcacy of generative networks

above traditional methods, such as the Synthetic Mi-

nority Over-sampling Technique (SMOTE) and its

variants [Liu et al., 2019, Ngwenduna and Mbuvha,

2021, Engelmann and Lessmann, 2020].

Although the high capacity of generative models

makes them good candidates for capturing complex

non-linear distributions in the data, their intractable

likelihood functions make evaluation difﬁcult. In

this work, we propose to use Similarity and Unique-

ness for sample-wise evaluation of Cystic Fibrosis

EHRs synthesising. We found increased predictive

performance during the experiments when augment-

ing given datasets with synthetic data, which is further

faithfully evaluated by the proposed metrics.

Muller, E., Zheng, X. and Hayes, J.

Evaluation of the Synthetic Electronic Health Records.

DOI: 10.5220/0011531300003523

In Proceedings of the 1st Workshop on Scarce Data in Artiﬁcial Intelligence for Healthcare (SDAIH 2022), pages 17-22

ISBN: 978-989-758-629-3

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

2 METHOD

We propose to compare generative models from mul-

tiple perspectives, including Uniqueness, Similarity

and Utility. In this section, we detail the feature ex-

traction procedure of the data, deﬁnitions of these

evaluation metrics and generative model selection.

2.1 Feature Extraction

Cystic Fibrosis (CF) is a rare disease that gives rise to

different forms of lung dysfunction, eventually lead-

ing to progressive respiratory failure. It is a complex

disease, and the types and severity of symptoms can

differ widely from person to person. In our work, we

extract CF patients from the IBM Explorys database

with a total of 10074 patients extracted, represent-

ing about 1/3 (31199) of all CF patients in the US

[USCFF, 2020]. Patients belong to two subgroups:

having died or having received a lung transplant, la-

beled by value 0; or having survived, labeled by value

1. We remove all samples with no diagnosis codes

and duplicates to enhance synthetic diversity. Our

ﬁnal dataset has 3184 patients, with ∼ 80% belong-

ing to the survived subgroup. The EHRs of these pa-

tients are then aggregated over time. For each patient,

we assign value 1 to the features that have appeared

in the medical history, and value 0 to these features

that have never appeared. The medical data is ﬁnally

represented as a binary matrix where each row cor-

responds to a patient and each column to a medical

feature. The predictive survival outcome is based on

these medical features including comorbidity, lung in-

fection, and therapy variables.

2.2 Uniqueness

Privacy assurances are essential to prevent the leakage

of personal information. However, a synthetic data

generator can achieve perfect evaluation scores by

simply copying the original training data, thus break-

ing the privacy guarantee. Differential privacy (DP)

is one well-known and commonly researched assur-

ance [Dwork et al., 2014]. DP algorithms limit how

much; the output can differ based on whether the in-

put is included; one can learn about a person because

their data was included; and conﬁdence about whether

someone’s data was included. In practice, there are

various distance-based metrics to guarantee such as-

surance. In [Alaa et al., 2021], the authors quantify a

generated sample as authentic if its closest real sam-

ple is at least closer to any other real sample than the

generated sample. Extending this to the case of bi-

nary variables, we could consider hamming distance.

However, since our data is de-personalised and non-

identiﬁable, we assess our generators based on how

many exact copies are made. We consider the require-

ment of privacy as Uniqueness: to not simply copy

the input data. We calculate the Uniqueness of each

model by generating a large ﬁnite number of samples

and reporting the percentage of overlap with the orig-

inal training data.

2.3 Similarity

It is hard to measure the Similarity between the syn-

thetic and original datasets with one score because of

the multiple features and data types within the data.

In our work, the Similarity is measured with four sub-

metrics, precision, recall, density and coverage.

In [Sajjadi et al., 2018] the use of precision and

recall metrics to measure the output from generative

models is proposed. Precision measures the ﬁdelity -

the degree to which generated samples resemble the

original data. Furthermore, recall measures diversity

- whether generated samples cover the full variability

of the original data. The latter is particularly useful

for evaluating generative models prone to mode col-

lapse. Precision is deﬁned as the proportion of the

synthetic probability distribution that can be gener-

ated from the original probability distribution, thus

measuring ﬁdelity, and recall symmetrically deﬁnes

diversity. Precision and recall are effectively calcu-

lated as the proportion of samples that lie in support

of the comparative distribution, which assumes uni-

form density across the data. Therefore, alternative

metrics have been proposed, such as density and cov-

erage, to ameliorate this issue [Kynk

anniemi et al.,

2019, Naeem et al., 2020, Alaa et al., 2021]. We em-

ploy the deﬁnition and implementation of density and

coverage metrics from [Naeem et al., 2020] for a more

accurate Similarity evaluation. Density and coverage

address the lack of robustness to outliers, failure to de-

tect matching distributions and inability to diagnose

different types of distribution failure.

2.4 Utility

To empirically validate the Utility of the generated

dataset, we introduce two different training testing

settings. Setting A: train the predictive models on

the synthetic training set and test the performance of

the models on the testing set. Setting B: train on

the synthetically augmented balanced-class training

set as the original dataset is imbalanced, then test on

the testing set. We perform a 5-fold cross-validation,

sampling each fold with a proportional representation

of each class. After training the synthesisers (de-

SDAIH 2022 - Scarce Data in Artiﬁcial Intelligence for Healthcare

(a) Authenticity proportion (b) Similarity metrics (c) Synthetic data AUC-ROC

Figure 1: (A) Authenticity of 100k samples from each generator. Grey is the proportion of samples that appear in the original

training data. Light purple is samples that do not appear in the original training data, and darker purple represents those that

are unique. (B) Similarity metrics for each model. For each fold, a dataset matching the size of the original fold with the

equivalent proportion of classes is sampled from the unique synthetic dataset (dark blue only). This is repeated 10 times, and

similarity metrics show mean and standard deviation over folds and repetitions.

scribed below), we obtain unique synthetic datasets

for each fold and model. Setting A was implemented

as follows: for each fold, we sampled a synthetic

dataset the size of the original dataset with equiv-

alent class proportions to calculate Utility. We re-

peated this ten times resulting in 50 measurements for

each model for which we report the mean and stan-

dard deviation. For each synthetic dataset, we trained

the following classiﬁcation models: Support Vector

Machine, Linear Regression, Naive Bayes, K-Nearest

Neighbours and a Random Forest. We chose the clas-

siﬁer which maximised the area under the receiver op-

erating curve (AUC-ROC) for the survived subgroup

with label 1 for the holdout test set and compared

this to the training with original data. As well as a

testing utility on the synthetic dataset alone, we also

augmented the original dataset to obtain class balance

(Setting B). We performed the same utility experi-

ment outlined above, reporting both areas under the

receiver operating curve (AUC-ROC) and accuracy

with a balanced class augmented training set.

2.5 Synthetic Data Generators

We implemented several SOTA generative models, in-

cluding variational autoencoders (VAE), Conditional

GAN (CTGAN) [Xu et al., 2019] and Differentially

Private GAN (DPGAN) [Xie et al., 2018] with two

privacy levels. DPGAN001 has sigma 0.01 and gra-

dient clip 0.1, while DPGAN050 has sigma 0.5 and

gradient clip 0.05. We mainly pick these two ex-

treme privacy settings for the DPGAN model to ex-

plore how noise levels impact synthetic datasets. For

these models, we attempted to preserve their original

architecture as published and adjusted the hyperpa-

rameters using grid search, with the optimisation ob-

jective to maximise the sum of Similarity metrics, in-

cluding precision, recall, density and coverage. VAE

is trained with two hidden layers in the encoder and

decoder and ELBO loss minimisation. DPGAN is

trained as a typical minimax game between the dis-

criminator and generator with noise injected into the

gradient during training with the cross-entropy de-

rived loss. CTGAN addresses sparsity and imbal-

anced categorical columns by sampling vectors dur-

ing training and introducing a condition for which the

generator learns the conditional distribution.

3 RESULTS

After hyperparameters tuning and model optimisation

for each of the 5 separate training folds, we gener-

ated 100K synthetic samples from each model. We

post-processed each synthetic dataset to obtain only

unique samples. For each fold, we sampled a syn-

thetic dataset the size of the original dataset with

equivalent class proportions to calculate Similarity.

We repeated this 10 times resulting in 50 measure-

ments for each metric and model for which we report

the mean and standard deviation in the plots.

3.1 Uniqueness

Figure 1a shows the average per fold proportion of du-

plicated data from the generation of 100, 000 samples.

Of 100, 000 generated samples, VAE has the highest

duplication rate of the original training data (67%),

followed by the CTGAN (34%), DPGAN001 (32%)

and DPGAN050 (18%), as shown in grey. Although

DPGAN050 has the lowest duplication rate of the

original training data, its novel synthesised samples

are largely duplicated (approx 71k duplicates shown

in light purple), compared to 26k, 55k and 31k for

VAE, DPGAN001 and CTGAN, respectively. CT-

GAN generates the largest number of unique samples

at an average of 33, 689, and VAE has the fewest with

6, 647 (dark purple). Both DPGAN models generate

Evaluation of the Synthetic Electronic Health Records

(a) AUC-ROC (b) Accuracy

Figure 2: Performance of original data augmented with synthetic samples. For each fold, synthetic samples are randomly

drawn to augment the small class. Where there are not enough of the minor class, the large class is randomly downsampled

achieving class balance. Gaussian Naive Bayes, Random Forest, Neural Network, SVM and Logistic Regression classiﬁers are

each trained to predict the hold out test set outcome and results which maximise performance metrics are reported, averaged

over each fold and repeated 10 times. The average number of samples for each model and class is 2128, 419, 1029, 1908,

1974 and 2128 for Upsampling, Downsampling, VAE, DPGAN050, DPGAN001, CTGAN respectively.

a similar number of unique samples (12, 481 for DP-

GAN001 and 11, 288 for DPGAN050).

3.2 Similarity

Random subsets from each unique synthetic dataset

are sampled, and Similarity with the original data is

computed (1b). Batches are sampled to match the

size of the training data while preserving the outcome

class ratio. Synthesiser VAE has the highest precision

(0.95 ± 0.01) and density (0.63 ± 0.03). Given that

this model has the largest copying rate of the original

data, its unique data also lies closest to the original

data, preserving ﬁdelity. Synthesiser CTGAN has the

lowest ﬁdelity (both precision 0.37±0.04 and density

0.13 ± 0.02) with the original training data; however,

it attains the highest recall (0.67 ± 0.02). CTGAN

displays the largest degradation (0.31 ± 0.02) when

measuring diversity using coverage, placing last with

DPGAN050 (0.31 ± 0.02), which suggests that many

of the diverse samples generated by CTGAN are con-

sidered outliers (they do not match the density of the

original distribution). VAE also displays a marked re-

duction in diversity compared to both DPGAN mod-

els when measured with recall (0.58 ± 0.03) versus

coverage (0.43 ± 0.01). DPGAN001 scores consis-

tently higher than DPGAN050 across all Similarity

metrics (0.64 ± 0.09 vs 0.54 ± 0.05 for precision,

0.46 ± 0.12 vs 0.31 ± 0.02 for recall, 0.28 ± 0.05 vs

0.22± 0.02 for density and 0.44±0.10 vs 0.31± 0.12

for coverage).

3.3 Utility

None of the synthetic datasets reaches an AUC-ROC

that is higher than the original training data in the ex-

periments with Setting A. VAE has the highest AUC

of all synthetic datasets, which is in line with its rank-

ing as the highest across similarity metrics. Model

rank across the sum of Similarity metrics is VAE, DP-

GAN001, CTGAN and DPGAN050, which do not

follow AUC-ROC scores. Besides, we also compared

performance measures of a balanced dataset by both

upsampling and downsampling to a balanced dataset

augmented with synthetic samples (Setting B). The

ordering of AUC-ROC scores is preserved from the

synthetic data only classiﬁcation task (Figure 2a ver-

sus Figure 1c) with a smaller difference from the orig-

inal baseline performance. In contrast, synthetically

augmented data from VAE and CTGAN outperform

the baseline models when measured using accuracy

(Figure 2b). This increased performance is largely

driven by recall (true positive rate) for the survived

subgroup (Figure 2d).

SDAIH 2022 - Scarce Data in Artiﬁcial Intelligence for Healthcare

(a) Oversampled (b) VAE (c) DPGAN050 (d) DPGAN001 (e) CTGAN

(f) Oversampled (g) VAE (h) DPGAN050 (i) DPGAN001 (j) CTGAN

Figure 3: The top row shows the heatmaps of binary matrices of sampled examples, where the columns represent different

features and rows represent different samples. Class 0 at the top half of the matrix represents examples sampled from the

dataset augmented with synthetic data, and Class 1 at the bottom half represents examples sampled from the original dataset.

The balanced datasets serve as input for the classiﬁcation algorithms as per RESULTS section. The bottom row shows 2D

PCA embeddings of the same data.

4 DISCUSSION

The increased predictive performance with synthet-

ically augmented data from VAE and CTGAN is

driven by the recall, indicating that these models can

better identify the positive class with little degrada-

tion to the precision. CTGAN has the most signiﬁ-

cant number of unique samples and has the most re-

markable diversity when measured using recall. This

diversity of CTGAN lends itself to extracting gen-

eralisable features which predict each class. While

VAE produces far fewer unique samples, it retains di-

versity in its samples. To better understand the na-

ture of the signal identiﬁed by these models, we have

plotted the heatmap of the binary feature matrix for

one fold (see Figure 3). In the top row of ﬁgures,

the lower half of the heatmap comes from the origi-

nal distribution (marked as class 1). In contrast, the

upper half is original data augmented with synthetic

data for class balance (marked as class 0). Here we

observe that synthesiser VAE has largely exaggerated

the signal for features in augmented data (class 0),

which has resulted in greater separability in the in-

put space, as shown by the two-dimension PCA re-

duction. Both DPGAN models display reduced sepa-

rability, with DPGAN001 reporting a lower accuracy.

The heatmaps go some way in explaining this since

there does not appear to be a marked difference in the

upper and lower halves of the DPGAN001 heatmap.

DPGAN050, however, does show some exaggerated

features for class 0. Visually, CTGAN appears to have

the most similar heatmap as the original data upsam-

pled while obtaining greater separability. This visual

heuristic of similarity appears to contrast the Similar-

ity metrics reported in the previous section. Since the

Similarity metrics measure both the positive and neg-

ative outcome class in proportions with which they

appear in the original training set, the similarity of the

larger positive class could have far outweighed that of

the smaller negative class.

5 CONCLUSIONS

This work has examined the Utility, Uniqueness and

Similarity of the synthetic Cystic Fibrosis patients’

EHRs with four SOTA generative models. We ob-

served increased accuracy in predictive performance

with both VAE and CTGAN when we augmented the

EHRs with synthetic data. Similarity metrics appear

to explain the Utility performance of synthetic data

generators. While the ampliﬁcation of a signal in

the synthetic dataset may do poorly to preserve the

faithfulness of the original data, it can provide greater

separability hence predictive performance. Consider-

ing the Uniqueness of each synthetic data generator,

CTGAN offers both high predictive performance and

Uniqueness of samples, which is beneﬁcial for con-

sidering stricter conditions on privacy. These trade-

offs are problem-speciﬁc, and conclusions are to be

arrived at based on clinical relevance. For example,

do exaggerated signals corroborate clinical evidence

for co-occurrence of diseases? Given the caveats of

our dataset, we have not shown the clinical relevance

of individual features, which remains as future work.

Evaluation of the Synthetic Electronic Health Records

REFERENCES

Alaa, A. M., van Breugel, B., Saveliev, E., and van der

Schaar, M. (2021). How faithful is your synthetic

data? sample-level metrics for evaluating and auditing

generative models. arXiv preprint arXiv:2102.08921.

Chen, R. J., Lu, M. Y., Chen, T. Y., Williamson, D. F., and

Mahmood, F. (2021). Synthetic data in machine learn-

ing for medicine and healthcare. Nature Biomedical

Engineering, pages 1–5.

Choi, E., Biswal, S., Malin, B., Duke, J., Stewart, W. F., and

Sun, J. (2017). In Machine learning for healthcare

conference, pages 286–305. PMLR.

Douzas, G. and Bacao, F. (2018). Effective data generation

for imbalanced learning using conditional generative

adversarial networks. Expert Systems with applica-

tions, 91:464–471.

Dwork, C., Roth, A., et al. (2014). The algorithmic foun-

dations of differential privacy. Found. Trends Theor.

Comput. Sci., 9(3-4):211–407.

Engelmann, J. and Lessmann, S. (2020). Condi-

tional wasserstein gan-based oversampling of tabu-

lar data for imbalanced learning. arXiv preprint

arXiv:2008.09202.

Fiore, U., De Santis, A., Perla, F., Zanetti, P., and Palmieri,

F. (2019). Using generative adversarial networks for

improving classiﬁcation effectiveness in credit card

fraud detection. Information Sciences, 479:448–455.

Kynk

anniemi, T., Karras, T., Laine, S., Lehtinen, J.,

and Aila, T.(2019). Improved precision/recall met-

ric for assessing generative models. arXiv preprint

arXiv:1904.06991.

Liu, Y., Zhou, Y., Liu, X., Dong, F., Wang, C., and Wang, Z.

(2019). Wasserstein gan-based small-sample augmen-

tation for new-generation artiﬁcial intelligence: a case

study of cancer-staging data in biology. Engineering,

5(1):156–163.

Naeem, M. F., Oh, S. J., Uh, Y., Choi, Y., and Yoo, J.

(2020). Reliable ﬁdelity and diversity metrics for gen-

erative models. In International Conference on Ma-

chine Learning, pages 7176–7185. PMLR.

Ngwenduna, K. S. and Mbuvha, R. (2021). Alleviating class

imbalance in actuarial applications using generative

adversarial networks. Risks, 9(3):49.

Sajjadi, M. S., Bachem, O., Lucic, M., Bousquet, O., and

Gelly, S. (2018). Assessing generative models via pre-

cision and recall. arXiv preprint arXiv:1806.00035.

USCFF (2020). Patient registry annual data report. https:

//www.cff.org/media/23476/download. Accessed: 07-

12-2021.

Xie, L., Lin, K., Wang, S., Wang, F., and Zhou, J. (2018).

Differentially private generative adversarial network.

arXiv preprint arXiv:1802.06739.

Xu, L., Skoularidou, M., Cuesta-Infante, A., and Veera-

machaneni, K. (2019). Modeling tabular data using

conditional gan. arXiv preprint arXiv:1907.00503.

Xu, L. and Veeramachaneni, K. (2018). Synthesizing tabu-

lar data using generative adversarial networks. arXiv

preprint arXiv:1811.11264.

Yoon, J., Drumright, L. N., and Van Der Schaar, M. (2020).

Anonymization through data synthesis using genera-

tive adversarial networks (ads-gan). IEEE journal of

biomedical and health informatics, 24(8):2378–2388.

SDAIH 2022 - Scarce Data in Artiﬁcial Intelligence for Healthcare