SynthCheck: A Dashboard for Synthetic Data Quality Assessment
Gabriele Santangelo, Giovanna Nicora, Riccardo Bellazzi and Arianna Dagliati
Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy
Keywords: Synthetic Data, Quality Evaluation, Privacy, Graphical User Interface.
Abstract: In recent years, synthetic data generation has become a topic of growing interest, especially in healthcare,
where they can support the development of robust Artificial Intelligence (AI) tools. Additionally, synthetic
data offer advantages such as easier sharing and consultation compared to original data, which are subject to
patient privacy laws that have become increasingly rigorous in recent years. To ensure a safe use of synthetic
data, it is necessary to assess their quality. Synthetic data quality evaluation is based on three properties:
resemblance, utility, and privacy, that can be measured using different statistical approaches. Automatic
evaluation of synthetic data quality can foster their safe usage within medical AI systems. For this reason, we
have developed a dashboard application, in which users can perform a comprehensive quality assessment of
their synthetic data. This is achieved through a user-friendly interface, providing easy access and intuitive
functionalities for generating reports.
1 INTRODUCTION
Machine Learning (ML) and Artificial Intelligence
(AI) are increasingly being exploited to solve health-
related problems, such as prognosis prediction from
Electronic Health Records (EHR) or detecting
patterns in multi-omics data. These approaches are
gradually being translated from bench to bedside,
with 171 enabled AI-based medical devices from the
Food and Drug Administration (FDA) as of October
2023 (Joshi et al., 2022).
Data plays a significant role in the development of
such systems, but concerns have been raised when
dealing with patient’s data, with regulators
underlying the need to protect patients’ privacy. To
this end, in recent years, there has been a growing
proposal to replace original data (derived from real
patients) with synthetic data that mimic the main
statistical characteristics of their real counterparts.
One of the most common definition of synthetic data
is the one used by the US Census Bureau (Philpott,
2018), which reads as follows: “Synthetic data are
microdata records created by statistically modeling
original data and then using those models to generate
new data values that reproduce the original data’s
statistical properties”.
Synthetic data are now widely used to train ML
classifiers. For example, (Chen & Chen, 2022)
trained an ML model for lung cancer using synthetic
data only. Synthetic data can also be exploited to test
ML classification performance (Tucker et al., 2020).
(Hernandez et al., 2022) provides a systematic
review of the approaches for synthetic data generation
(SDG) developed in the last few years. SDGs can be
categorized into three main groups: (1) classical
approaches, which includes baseline methods (e.g.
anonymization and noise addition) and statistical and
supervised machine learning approaches; (2) deep
learning approaches, where the generative model is
realized using deep learning; lastly, the (3) third
group includes those approaches that do not fall into
the previous categories (e.g. methods consisting of
generating synthetic data by simulating a series of
procedures).
Regardless of the methods employed to generate
them, it is essential to assess the quality of the
synthetic data. In a recent paper, Hernadez et al. have
described the different metrics currently used to
evaluate tabular synthetic data (Hernadez et al.,
2023). These metrics can be classified into three
categories based on their evaluation objectives. First,
resemblance metrics focuses on assessing the
statistical properties of synthetic data by directly
comparing the statistical distributions of features
between the original and synthetic datasets and
analyzing whether the correlation structure among the
features of the original dataset is preserved in the
synthetic dataset. Utility-related metrics are aimed at
246
Santangelo, G., Nicora, G., Bellazzi, R. and Dagliati, A.
SynthCheck: A Dashboard for Synthetic Data Quality Assessment.
DOI: 10.5220/0012558700003657
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 17th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2024) - Volume 2, pages 246-256
ISBN: 978-989-758-688-0; ISSN: 2184-4305
Proceedings Copyright © 2024 by SCITEPRESS Science and Technology Publications, Lda.
Table 1: List of tools that perform evaluation of synthetic data utility, resemblance and privacy.
Tool Description Metrics GUI Report
Synthetic Data Vault
(Patki et al., 2016)
(SDGym and
SDGMetrics module
)
Python package to
generate and evaluate
synthetic data
resemblance, privacy no yes
SDNist (Task et al.,
2023)
Python package for
evaluation
resemblance, utility,
privacy
no yes
Anonymeter (Giomi
et al., 2023)
Python package to
evaluate privacy
privacy no no
SynthGauge Python package utility, privacy no no
synthpop (Raab et
al., 2021)
R library utility no no
syntheval (SynthEval,
2023/2023)
Python package for
evaluation
resemblance, utility,
privacy
no no
evaluating the usability of statistical conclusions
drawn from synthetic data or the results from ML
models trained with synthetic data. The third relevant
aspect is privacy - a measurement about how private
synthetic data are in terms of the disclosure risk of
private or sensitive information. For example,
simulated cyberattacks by a virtual attacker can be
executed, and their performance subsequently
evaluated.
Different works have benchmarked SDG
methods, also in light of the above mentioned aspects.
In (Goncalves et al., 2020), authors compare eight
SDGs on medical data from the Surveillance
Epidemiology and End Results (SEER) programs in
terms of statistical resemblance between the original
and the synthetic data in terms of privacy, revealing
that no particular methods demonstrated superior
performance. In (Reiner Benaim et al., 2020), a cross
hospital study in the Rambam Health Care Campus,
Israel, authors tested the validity of synthetic data
generated directly from the actual real data across
different clinical research projects. Their results
positively state that synthetic data were a close
estimate to real data from a statistical point of view.
In (Yan et al., 2022), authors benchmarked several
deep learning SDGs on EHR data, investigating the
trade-off between utility and privacy, and finding that
no single SDG outperformed the others. Since
synthetic data can be used to train ML models,
(Rodriguez-Almeida et al., 2023) studied the
relationship between resemblance and the
performance of ML classifiers trained on synthetic
data. In a recent study, (Azizi et al., 2023) showed
how synthetic data can support federated learning.
The aim was to assess country-level differences in the
role of sex on cardiovascular diseases using a dataset
of Austrian and Canadian individuals. The shared
datasets between the two countries were synthesized
using sequence-optimized decision trees and showed
low privacy risk.
Numerous tools have been created for the
generation and assessment of synthetic data. Table 1
presents a list of open-source tools available for
evaluating synthetic datasets concerning resemblance,
privacy, and utility. Only two of these tools measure all
three aspects, and notably, none of them offer a
Graphical User Interface (GUI). The absence of a GUI
might limit the usability of these tools for non-
informatics users, particularly clinicians.
To address this issue, we have implemented a
Dashboard application that users can install and
utilize on their computers. This application allows
both real and synthetic data, and it generates various
metrics to assess resemblance, utility, and privacy.
Furthermore, users can download a report containing
the obtained results.
The following sections provide details on the
implemented metrics and the Dash application
designed for synthetic data evaluation. A case study
on a dataset of Intensive Care Unit (ICU) patients is
then presented.
2 METHODS
2.1 Quality Metrics
This section describes the methods to evaluate the
quality of a synthetic dataset in terms of resemblance,
utility and privacy, that were included in the
application.
SynthCheck: A Dashboard for Synthetic Data Quality Assessment
247
2.1.1 Resemblance Metrics
To assess the resemblance between the original data
and the synthetically generated dataset we considered
three main metrics categories: Univariate
Resemblance Analysis (URA), Multivariate
Relationships Analysis (MRA) and Data Labeling
Analysis (DLA).
URA analysis evaluates synthetic data’s ability to
preserve original data’s univariate statistical
properties. It compares distributions of features
between the original and synthetic datasets using
statistical tests (e.g. Student t-test, Mann-Whitney U-
test, Kolmogorov-Smirnov test for continuous
features and Chi-square test for categorical features).
Preserved statistical properties in synthetic data are
indicated by accepted null hypotheses in tests for
continuous features and rejected hypotheses for
categorical ones. Distance measures like cosine,
Jensen-Shannon, and Wasserstein (only for
continuous features) can also be used to assess
statistical properties preservation the smaller the
distance, the better the preservation.
The MRA analyses determine if synthetic data
replicates original data’s statistical properties in a
multidimensional context, exploiting different
methods:
Correlation matrices: Pearson correlation
matrix is computed for continuous features and
normalized contingency table for categorical
features, for both the original and synthetic
datasets. It is assumed that if the synthetic data
are generated correctly, then the differences
between the “real matrix” and the “synthetic
matrix” will be small;
Outliers analysis: For each observation present
in the original dataset and in the synthetic
dataset, the Local Outlier Factor (LOF) score is
computed. Next, comparison between scores
assigned to original and synthetic data is
visualized;
Variance explained analysis: Principal
Component Analysis (PCA) is performed to
measure the variance explained by the
variables in both the original and synthetic
datasets;
Data “shape” preservation: A visual analysis is
performed using the Uniform Manifold
Approximation and Projection (UMAP)
method to visualize the “shape” assumed by the
original data and compare it with that assumed
by the synthetic data.
In the DLA analysis, several classifiers (e.g.
Random Forest, K-Nearest Neighbors, Decision Tree,
Support Vector Machines and Multilayer Perceptron)
are trained to recognize whether the proposed record
is original or synthetic and their performances are
evaluated. To evaluate their performance, the
following metrics are computed: accuracy, recall,
precision and F1 score. If the semantics of the original
data are preserved in the synthetic data, then the
various classifiers should not be able to distinguish
whether the proposed record is original or synthetic,
i.e. they should have low performance.
2.1.2 Utility Metrics
To assess utility, we evaluate whether the performan-
ces of ML classifiers trained with real data are
maintained when they are trained with synthetic data.
In the “Train on Real Test on Real” (TRTR)
approach, a classifier is selected and trained to predict
the value of a target class using a portion of the
original dataset as the training set. Subsequently, it is
evaluated using a test set derived from the same
original dataset. The performance metrics of the
trained classifier, including accuracy, recall,
precision and F1 score, are computed.
In theTrain on Synthetic Test on Real (TSTR)
approach, the training set is derived from the
synthetic dataset, while the test set consists of
elements from the original dataset. Whereby, the
classifier is trained on synthetic data and tested on
real data. At the end of this analysis, the values of the
performance metrics are computed and compared
between the TRTR and TSTR approach.
2.1.3 Privacy Metrics
Privacy preservation is measured with two different
analyses: the first analysis is called Similarity
Evaluation Analysis (SEA), while the second
involves simulating two different cyberattacks, i.e.
Membership Inference Attack (MIA) and Attribute
Inference Attack (AIA).
In SEA analysis, three distance metrics between
the original and the synthetic data are calculated:
Euclidean distance, cosine similarity and Hausdorff
distance. The distances are calculated considering the
rows (each row representing a patient) of the two
datasets. High values in the case of Euclidean and
Hausdorff distances indicate low similarity between
original and synthetic data, whereby minimal privacy
loss, whereas the opposite is true for cosine similarity.
In the simulated MIA, the attacker has access to a
portion of the original dataset (referred to as the
original subset) and attempts to identify the records in
the original subset that are part of the test set used to
generate the synthetic data phase. The attacker
HEALTHINF 2024 - 17th International Conference on Health Informatics
248
calculates distances (e.g., cosine similarity) between
the original subset record and synthetic dataset
records. If any distance exceeds a similarity
threshold, the record is labelled as belonging to the
original training set. After simulating the attack, the
accuracy and precision of the attacker are computed.
The underlying idea is that if the attacker succeeds in
identifying records, the synthetic dataset contains
records that are too similar to those in the original
training set, resulting in a loss of security for the
original data.
In the simulated AIA, the attacker has access to a
portion of the original dataset and the complete
synthetic dataset, but the original subset lacks some
of the features present in the original dataset. The
attacker’s objective is to reconstruct the missing
features from this subset, using a ML classification or
regression model, depending on the type of feature to
be reconstructed. The model is trained using the
features from the synthetic dataset and the target class
is chosen from the missing features that the attacker
wants to reconstruct. Then, the trained model is used
to predict the considered missing feature, utilizing the
features from the original subset. Finally, the attacker
performance is evaluated by calculating accuracy, if
the reconstructed feature is categorical, or Root Mean
Squared Error (RMSE), in the case of continuous
features. If the synthetic dataset prevents accurate
reconstruction, it suggests preservation of the original
dataset’s privacy.
2.2 Dashboard Architecture
Figure 1 shows the architecture designed for the
application: an initial section where users can upload
the necessary data for quality analysis and a second
section where synthetic data are evaluated using the
metrics described above. Each of the various
subsections allows for downloading a detailed report
of the obtained results.
Figure 1: Architecture diagram of the application. It
comprises two distinct sections: a data loading section and
a section implementing the evaluation metrics for synthetic
data quality.
3 RESULTS
3.1 Dashboard Implementation
For the development of the application, we used the
Python Dash package, a library used for creating
interactive and customized applications. Code and
installation instructions are available in a GitHub
repository (Santangelo, 2023).
As illustrated in Figure 2, the application is
composed of panels reflecting the architecture
illustrated in Figure 1. Through the navigation bar at
the bottom of each page, users can navigate through
the panels after entering the required data (refer to
Appendix section, Figure 9-17, for additional images
related to the application GUI). In the following
paragraphs the application’s panels are described in
detail.
3.1.1 Load Datasets Panel
In the first panel, users can upload all the data
necessary for quality evaluation in Comma Separated
Values (CSV) files (Figure 2, panel 1): (1) the
original dataset, (2) the synthetic dataset and (3) a file
indicating the type (numerical or categorical) of each
feature in the uploaded datasets. The structure of this
file consists of two columns, labelled “Feature” and
“Type”. In the first column, all the features names
from the uploaded datasets will be listed, and in the
second column, the corresponding feature type.
3.1.2 Evaluation Panel
Once the users have uploaded the data, they can
execute the quality assessment of the synthetic data.
It consists of three panels, each implementing a
different quality analysis as described in the Methods
section.
From the Resemblance panel (Figure 2, panels 2-
a), the user can access three different subsections.
First, URA analysis can be performed. The user can
select the desired statistical tests and distance metrics
from three different dropdown menus. Under each
dropdown menu, a table displays the results obtained
for each feature. Additionally, only for statistical
tests, table’s rows are highlighted in red or green
based on accepting/refuting the null hypothesis and,
by clicking on a feature in the table, the user can view
a comparative plot of probability distribution (for
continuous features) or bar plot of the proportions of
each category (for categorical features) with original
data and synthetic data.
SynthCheck: A Dashboard for Synthetic Data Quality Assessment
249
Figure 2: Navigation diagram of the sections included in the application. After loading the required data (panel 1), from the
navigation bar at the bottom, it is possible to navigate through various panels to perform specific analysis: Resemblance
analysis (panels 2-a), Utility analysis (panel 2-b) and Privacy analysis (panels 2-c). Furthermore, from the navigation bar, a
button is available for the user to download a report of the panel they are currently in (this button is not present in the data
loading panel).
In the second subsection, all the metrics related to
the MRA analysis are computed. In the correlation
matrices section, the user can choose from a
dropdown menu whether to view matrices related to
continuous or categorical features. In addition, the
user can choose to view the matrices separately for
real and synthetic data or the difference matrix
between the two. In the outliers analysis section, a
comparative plot with two boxplots of the negative
LOF score in real and synthetic data is shown. While
in the variance explained analysis section, a plot
showing the explained variance ratio trend as the
components considered increase, considering the
original data and the synthetic data; moreover, in the
adjacent table are listed the differences between the
explained variance ratio with original and synthetic
data. At the end of this subsection, the user can
perform the UMAP method and choose the
parameters with which the method should be
executed. In addition, there are two buttons that
implement two different strategies: with the first one,
two separate graphs will be displayed for comparison
between real and synthetic data, while with the
second button, it will be shown a single graph
obtained by running the UMAP method on a single
dataset obtained by concatenating the original and
synthetic datasets.
The last subsection does not implement any user
interactions and it presents the results related to the
DLA analysis. On the left, for each classifier used in
the analysis, the values of performance metrics
(accuracy, precision, recall and F1 score) are
displayed, while on the right, four boxplots related to
the metrics are shown.
In the Utility panel (Figure 2, panels 2-b), the
TRTR and the TSTR approaches are implemented.
Initially, the user has to select, through the two
dropdown menus, a target class from the available
options (only categorical features are listed, since
both analyses are based on a classification problem)
and a ML model to be trained. Furthermore, the user
can choose to upload the original training set and test
set using the buttons at the top, otherwise a random
split of the original dataset already uploaded will be
performed. Then the analysis can be started using the
button at the bottom.
The Privacy panel (Figure 2, panels 2-c) includes
all the analysis performed for privacy evaluation,
therefore the user can access three different
subsections.
The first subsection displays the results obtained
from the SEA analysis. The user can select from the
HEALTHINF 2024 - 17th International Conference on Health Informatics
250
dropdown menu which metric to compute. If cosine
similarity or Euclidean distance is chosen, density
plots of the paired distance values calculated will be
shown. For the Hausdorff distance, only its
corresponding value will be shown.
In the second subsection, a MIA is simulated and
the attacker's performance is shown to the user upon
completing the simulation. Initially, the user has to
upload the training set used in the generation of the
analysed synthetic data and choose, using the
available sliders, the size of the dataset portion that
the attacker will have access to during the attack and
the similarity threshold used by the attacker. Once
this information is provided, the simulation can be
started and, when it is finished, the attacker’s
performance (accuracy and precision) is displayed
through two pie charts.
The last subsection is related to the AIA
simulation. The user has to set the size of the portion
of the original dataset available to the attacker using
a slider. Additionally, through the dropdown menu,
the user has to select which features from the original
dataset, the attacker will have access to during the
attack. Subsequently, the simulation results will be
shown in a tabbed interface and, by clicking on one
of the two different tabs (Accuracy orRMSE),
the user can view the reconstruction performance of
categorical and continuous features, respectively. In
particular, for the continuous features, the
Interquartile Range (IQR) is also shown to better
understand the RMSE value obtained for each
feature.
Additionally, each panel allows the user to
download a report containing the graphs and/or tables
displayed within that specific panel.
3.2 A Case Study with MIMIC Dataset
To assess the validity and functionality of the
developed application, the MIMIC-II dataset was
utilized. This dataset (Silva et al., 2012), contains
vital signs and heterogeneous clinical data of 12,000
ICU patients. Up to 42 variables were recorded for
each patient at least once during the first 48 hours
after admission to the ICU: 6 of these variables are
general descriptors and time series variables with
multiple observations.
Aggregated features were obtained as reported in
(Johnson, 2018/2023), followed by removal of
features with at least 70% missing values. The
resultant dataset consists of 109 features and 6000
records, with some features containing missing
values. Before proceeding with the generation of
synthetic data, the dataset was divided into training
set (80%) and test set (20%) and then to address
missing data, MICE (Multivariate Imputation by
Chained Equations) from the homonym R library
(Buuren & Groothuis-Oudshoorn, 2011), was used.
To generate synthetic data, we select two
approaches, namely HealthGAN (Yale et al., 2020)
and Synthetic Data Vault (SDV) (Patki et al., 2016).
The first method is a deep learning approach that
creates a generative model for synthesizing new data;
specifically, the method uses a modified Generative
Adversarial Network (GAN).
The SDV method learns statistical information
from the original dataset to create the generative
model from which, subsequently, new synthetic data
is sampled. Each feature of the dataset to be modelled
is associated with the parameters of a continuous
statistical distribution. Then, the covariance matrix
among the features is estimated. Therefore, the
generative model consists of the set of all parameters’
distribution and the covariance matrix.
For the sake of readability, only MRA results for
Resemblance evaluation, Utility results and MIA
results for Privacy evaluation are reported and
discussed.
3.2.1 MRA (Resemblance) Results
Table 2 compares correlation matrices derived from
real and synthetic data (both for continuous and
categorical features). The percentage of feature pair
combinations with a difference between real and
synthetic values less than 0.1 was calculated. For the
LOF method, the percentage ratio between the
numbers of identified synthetic and real outliers
(negative LOF score < -1.5) is reported. Finally, the
difference between real and synthetic data in terms of
explained variance, considering one component and
two components, is reported for the PCA.
Table 2: Summary table of MRA results.
HealthGAN SDV
Pearson correlation
matrix
(%feature combination with
difference < 0.1)
92 95
Normalized contingency
table
(%feature combination with
difference < 0.1)
76 48
LOF method
(%ratio synthetic and real
outliers)
55.45 7.27
PCA method
(%difference explained
variance real-synthetic)
1.90
3.94
20.21
5.54
SynthCheck: A Dashboard for Synthetic Data Quality Assessment
251
As shown in Table 2, both methods appear to
perform well in replicating the statistical properties of
the original data. In particular, the HealthGAN
approach seems to provide excellent results even for
categorical features and outliers replication. Indeed,
the percentage of synthetic categorical feature pair
combinations that adheres to the dependency
structure of the original features and outliers
replication ratio is higher with HealthGAN.
Figure 3 and Figure 4 show correlation matrices
for continuous features and normalized contingency
tables for categorical features. As seen in Figure 4,
HealthGAN and SDV methods manage to faithfully
replicate the correlation structure among the
categorical features of the original dataset, as the
matrices (original vs. synthetic) are very similar. The
same conclusions can be drawn for continuous
features (see Figure 3), but only concerning the
HealthGAN method, as the matrix obtained with the
synthetic data generated by the SDV method has
some “gaps”.
Figure 3: Correlation matrices for continuous features (on
the left for original data, in the centre for synthetic data with
HealthGAN and on the right for synthetic data with SDV).
Figure 4: Normalized contingency tables for categorical
features (on the right for original data, at the upper left for
synthetic data with HealthGAN and at the bottom left for
synthetic data with SDV).
The long-tailed distribution of the violin plots in
Figure 5 is due to the distribution of the negative LOF
scores of the original dataset. As observed, both SDG
methods fail to faithfully replicate the behaviour of
the original data concerning outliers, although the
HealthGAN method appears to perform better
compared to the SDV method.
Figure 5: Splitted violin plots depicting the distribution of
negative LOF scores for original observations (blue) and
synthetic observations (red), obtained with HealthGAN
(top) and SDV (bottom).
Using PCA, the synthetic data obtained with both
methods show a very similar behaviour to the original
data (see Figure 6). Indeed, the two trends are almost
completely overlapped, with slight differences when
considering the first five components, especially with
the SDV method.
Figure 6: Plots showing the explained variance trend,
considering the original data (blue) and the synthetic data
(red), with HealthGAN (on the left) and SDV (on the right).
Figure 7 reports the UMAP projections of the
original and synthetic data. The UMAP parameter
controlling the number of neighbours was set to 20,
while the parameter determining the minimum
distance between points in the reduced representation
was set to 0.1. Particularly for HealthGAN, the results
obtained can be considered acceptable since the
“shape” of the synthetic data is similar to that of the
original data, even if rotated. For example, the central
cavity that is more prominent in the original data but
still present in the synthetic data generated with
HealthGAN and the perimeter shape of the synthetic
data, in this case with both HealthGAN and SDV
methods, that closely resembles that of the original
data.
Figure 8 was obtained using the same UMAP
parameters as in Figure 7, but in this case, the original
dataset was concatenated with the synthetic one.
From Figure 8, it can be observed that with both SDG
methods, the synthetic data adheres to the original
data, although the synthetic data obtained with the
SDV method does not cover some small portions of
the original dataset.
HEALTHINF 2024 - 17th International Conference on Health Informatics
252
Figure 7: UMAP projections of the original data (in blue,
on the left) and the synthetic data with HealthGAN (in red,
in the centre) and with SDV (in red, on the right).
Figure 8: UMAP projections of the original dataset (blue)
concatenated with synthetic dataset (red), using
HealthGAN (left) and SDV (right).
3.2.2 Utility Results
In this analysis, a classifier was trained on a
classification problem (target class
“Inhospital_death”), initially using real data (TRTR
approach) and then using synthetic data (TSTR
approach). Both classifiers were tested on the same
original test set. In addition, the same training set and
test set used in learning the generative model were
selected as the original training set and original test
set, respectively. The results obtained with two
approaches are shown in Table 3 (with Random
Forest classifier) and Table 4 (with Decision Tree
classifier), which report the 95% confidence intervals
obtained through 100 replications.
Table 3: Summary table of Utility evaluation, with TRTR
approach results and TSTR approach results. The classifier
used is Random Forest with target class “Inhospital_death”.
TRTR TSTR
HealthGAN SDV
accuracy
(0.875,
0.876)
(0.867,
0.868)
(0.866,
0.867)
precision
(0.682,
0.700)
(0.374,
0.521)
(0.468,
0.502)
recall
(0.129,
0.134)
(0.005,
0.008)
(0.048,
0.052)
F1 score
(0.217,
0.225)
(0.010,
0.015)
(0.086,
0.094)
The goal of Utility evaluation is not to assess the
obtained performance (whether high or low) but to
analyse the differences between the values of the
performance metrics obtained in the TRTR approach
and those obtained in the TSTR approach.
Table 4: Summary table of Utility evaluation, with TRTR
approach results and TSTR approach results. The classifier
used is Decision Tree with target class “Inhospital_death”.
TRTR TSTR
HealthGAN SDV
accuracy
(0.785,
0.788)
(0.795,
0.799)
(0.616,
0.620)
precision
(0.280,
0.285)
(0.271,
0.280)
(0.136,
0.139)
recall
(0.387,
0.395)
(0.312,
0.320)
(0.349,
0.358)
F1 score
(0.325,
0.331)
(0.290,
0.297)
(0.195,
0.200)
Generally, if the model inherently overfits the data
(such as Decision Tree) and the synthetic data are
very similar to the original data, then the differences
between the performance metrics obtained with the
two approaches (TRTR and TSTR) will be less
pronounced compared to when a classifier that
overfits less (such as Random Forest) is used. As
evident from Table 3 and Table 4, better results in
utility evaluation are obtained using the Decision
Tree classifier.
3.2.3 MIA (Privacy) Results
Table 5 shows the results achieved by providing the
attacker with half of the original dataset and using a
similarity threshold of 0.7, which the attacker uses to
identify the records. The 95% confidence intervals
were obtained through 50 replications. From the
information present in this portion of the original
dataset and the information contained in the synthetic
dataset, the attacker must be able to identify which
records in the original subset belong to the training set
used during the SDG phase.
Table 5: Summary table of MIA results, showing the
attacker’s performance values (accuracy and precision).
HealthGAN SDV
Attacker’s accuracy (0.798, 0.800) (0.799, 0.802)
Attacker’s precision (0.798, 0.800) (0.799, 0.800)
The attacker’s performance is quite high in all the
considered cases. This indicates that the synthetic
data are similar to the original data used for training,
as the attacker was able to identify the latter based on
the synthetic data.
Different results can be obtained by changing the
proportion of the original dataset provided to the
SynthCheck: A Dashboard for Synthetic Data Quality Assessment
253
attacker and the similarity threshold used by the
attacker. For example, reducing the size of the
original subset will result in lower attacker
performance. However, it was chosen to show the
performance with half of the original dataset, as it
represents a meaningful test case.
4 CONCLUSIONS
This paper presents a dashboard application that,
through a simple and intuitive GUI, allows users to
conduct a quality analysis of a synthetic dataset
obtained using any generative method. The
application implements various quality evaluation
metrics across three different assessment aspects, to
evaluate the quality of synthetic data: resemblance,
utility and privacy preservation. Furthermore, the
users can also download summary reports from the
different evaluation panels. The application is freely
available for download at (Santangelo, 2023).
In order to assess the performance of the different
proposed metrics, they were used to evaluate the
quality of synthetic datasets obtained from two SDG
methods, namely HealthGAN and SDV. The original
dataset used is the MIMIC-II, which contains EHR
information from patients in ICU. In general,
synthetic data successfully replicate original data’s
statistical properties and ML classifiers’ performance
metrics obtained with the original dataset. However,
the privacy aspect is not fully respected since the
synthetic data are too similar to the original data.
Furthermore, the HealthGAN method seems to
overperform compared to the SDV method.
Among the limitations of this work, one is related
to the type of synthetic data generated, which includes
only tabular data, while EHRs may also include
bioimages and biosignals. All the implemented
metrics were designed for the evaluation of tabular
synthetic data, thus requiring modification or the
addition of new metrics for evaluating synthetic data
of a different nature. Another limitation is the
handling of missing data: the application assumes that
input datasets do not contain missing values.
Therefore, datasets with missing values need to be
imputed before use.
Regarding future developments of the
implemented metrics, it would be important and
advantageous for some analyses to integrate an
explainability (XAI) component for the results
obtained. For example, in the case of DLA, which
uses ML algorithms, it could be useful to identify
which features had a greater or lesser impact on the
final results, allowing for a detailed inspection of
these features. Moreover, it would be useful to
integrate a section for the evaluation of missing data’s
patterns, when they are present in the input datasets.
ACKNOWLEDGEMENTS
Gabriele Santangelo is a PhD student enrolled in the
National PhD program in Artificial Intelligence,
XXXIX cycle, course on Health and life sciences,
organized by Università Campus Bio-Medico di
Roma. This work was supported by “Fit4MedRob-
Fit for Medical Robotics” Grant B53C22006950001.
REFERENCES
Azizi, Z., Lindner, S., Shiba, Y., Raparelli, V., Norris, C. M.,
Kublickiene, K., Herrero, M. T., Kautzky-Willer, A.,
Klimek, P., Gisinger, T., Pilote, L., & El Emam, K.
(2023). A comparison of synthetic data generation and
federated analysis for enabling international evaluations
of cardiovascular health. Scientific Reports, 13(1),
11540. https://doi.org/10.1038/s41598-023-38457-3
Buuren, S. van, & Groothuis-Oudshoorn, K. (2011). mice:
Multivariate Imputation by Chained Equations in R.
Journal of Statistical Software, 45, 1–67.
https://doi.org/10.18637/jss.v045.i03
Chen, A., & Chen, D. O. (2022). Simulation of a machine
learning enabled learning health system for risk
prediction using synthetic patient data. Scientific
Reports, 12(1), 17917. https://doi.org/10.1038/s41598-
022-23011-4
Giomi, M., Boenisch, F., Wehmeyer, C., & Tasnádi, B.
(2023). A Unified Framework for Quantifying Privacy
Risk in Synthetic Data. Proceedings on Privacy
Enhancing Technologies, 2023(2), 312–328.
https://doi.org/10.56553/popets-2023-0055
Goncalves, A., Ray, P., Soper, B., Stevens, J., Coyle, L., &
Sales, A. P. (2020). Generation and evaluation of
synthetic patient data. BMC Medical Research
Methodology, 20(1), 108. https://doi.org/10.1186/s128
74-020-00977-1
Hernadez, M., Epelde, G., Alberdi, A., Cilla, R., & Rankin,
D. (2023). Synthetic Tabular Data Evaluation in the
Health Domain Covering Resemblance, Utility, and
Privacy Dimensions. Methods of Information in
Medicine, 62(S 01), e19–e38. https://doi.org/10.1055/s-
0042-1760247
Hernandez, M., Epelde, G., Alberdi, A., Cilla, R., & Rankin,
D. (2022). Synthetic data generation for tabular health
records: A systematic review. Neurocomputing, 493, 28–
45. https://doi.org/10.1016/j.neucom.2022.04.053
Johnson, A. (2023). Challenge2012 [Jupyter Notebook].
https://github.com/alistairewj/challenge2012 (Original
work published 2018)
HEALTHINF 2024 - 17th International Conference on Health Informatics
254
Joshi, G., Jain, A., Araveeti, S. R., Adhikari, S., Garg, H., &
Bhandari, M. (2022). FDA approved Artificial
Intelligence and Machine Learning (AI/ML)-Enabled
Medical Devices: An updated landscape [Preprint].
Health Informatics. https://doi.org/10.1101/2022.12.07.
22283216
Patki, N., Wedge, R., & Veeramachaneni, K. (2016). The
Synthetic Data Vault. 2016 IEEE International
Conference on Data Science and Advanced Analytics
(DSAA), 399–410. https://doi.org/10.1109/DSAA.201
6.49
Philpott, D. (Ed.). (2018). A guide to Federal terms and
acronyms (Second edition). Bernan Press.
Raab, G. M., Nowok, B., & Dibben, C. (2021). Assessing,
visualizing and improving the utility of synthetic data.
https://doi.org/10.48550/ARXIV.2109.12717
Reiner Benaim, A., Almog, R., Gorelik, Y., Hochberg, I.,
Nassar, L., Mashiach, T., Khamaisi, M., Lurie, Y.,
Azzam, Z. S., Khoury, J., Kurnik, D., & Beyar, R. (2020).
Analyzing Medical Research Results Based on Synthetic
Data and Their Relation to Real Data Results: Systematic
Comparison From Five Observational Studies. JMIR
Medical Informatics, 8(2), e16492. https://doi.org/10.219
6/16492
Rodriguez-Almeida, A. J., Fabelo, H., Ortega, S., Deniz, A.,
Balea-Fernandez, F. J., Quevedo, E., Soguero-Ruiz, C.,
Wägner, A. M., & Callico, G. M. (2023). Synthetic
Patient Data Generation and Evaluation in Disease
Prediction Using Small and Imbalanced Datasets. IEEE
Journal of Biomedical and Health Informatics, 27(6),
2670–2680. https://doi.org/10.1109/JBHI.2022.3196697
Santangelo, G. (2023). SynthCheck [Python]. https://github.
com/bmi-labmedinfo/SynthCheck.git
Silva, I., Moody, G., Scott, D. J., Celi, L. A., & Mark, R. G.
(2012). Predicting in-hospital mortality of ICU patients:
The PhysioNet/Computing in cardiology challenge 2012.
39, 245–248. Scopus.
SynthEval. (2023). [Jupyter Notebook]. schneiderkamplab.
https://github.com/schneiderkamplab/syntheval
(Original work published 2023)
Task, C., Bhagat, K., & Howarth, G. (2023). SDNist v2:
Deidentified Data Report Tool (1.0.0) [dataset]. National
Institute of Standards and Technology. https://doi.org/
10.18434/MDS2-2943
Tucker, A., Wang, Z., Rotalinti, Y., & Myles, P. (2020).
Generating high-fidelity synthetic patient data for
assessing machine learning healthcare software. Npj
Digital Medicine, 3(1), 147. https://doi.org/10.1038/
s41746-020-00353-9
Yale, A., Dash, S., Dutta, R., Guyon, I., Pavao, A., & Bennett,
K. P. (2020). Generation and evaluation of privacy
preserving synthetic health data. Neurocomputing, 416,
244–255. https://doi.org/10.1016/j.neucom.2019.12.136
Yan, C., Yan, Y., Wan, Z., Zhang, Z., Omberg, L., Guinney,
J., Mooney, S. D., & Malin, B. A. (2022). A Multifaceted
benchmarking of synthetic electronic health record
generation models. Nature Communications, 13(1),
7609. https://doi.org/10.1038/s41467-022-35295-1
APPENDIX
Figure 9: Initial screen with details about the table shown
to the user during the upload of the original dataset.
Figure 10: Detail of the results obtained with the statistical
tests in the URA subsection.
Figure 11: Detail of the comparison of correlation matrices
in the MRA subsection.
Figure 12: Detail of the dataset comparison using the
UMAP method in the MRA subsection.
SynthCheck: A Dashboard for Synthetic Data Quality Assessment
255
Figure 13: Subsection related to DLA analysis with
boxplots.
Figure 14: Input panel for providing information required
for executing the Utility evaluation and results section.
Figure 15: Section for the SEA analysis with the result, in
the case of Cosine similarity calculation.
Figure 16: Input panel of the data required for the MIA
simulation and results section.
Figure 17: Input panel of the data required for the AIA
simulation and results section.
HEALTHINF 2024 - 17th International Conference on Health Informatics
256