Authors:
Moad Hani
1
;
Nacim Betrouni
2
;
Saïd Mahmoudi
1
and
Mohammed Benjelloun
1
Affiliations:
1
Department of Computer Engineering and Management, University of Mons (UMONS), Belgium
;
2
Univ. Lille, Inserm, CHU Lille, U1172 – LilNCog – Lille Neuroscience & Cognition, France
Keyword(s):
Parkinson’s Disease, Longitudinal Imputation, Synthetic Data Generation, Clinical Bias Mitigation, HyperImpute, CTGAN, Sliced Wasserstein Distance, PPMI Dataset, Healthcare AI Governance, Multi-Center Reproducibility.
Abstract:
: Longitudinal datasets like the Parkinson’s Progression Markers Initiative (PPMI) face critical challenges from missing data and privacy constraints. This paper introduces PPMI-Benchmark, the first comprehensive framework evaluating 12 imputation methods and 6 synthetic data generation techniques across clinical, demographic, and biomarker variables in Parkinson’s disease research. We implement advanced methods including HyperImpute (ensemble optimization), VaDER (variational deep embedding), and conditional tabular GANs (CTGAN), evaluating them through novel metrics integrating sliced Wasserstein distance (dSW = 0.039 ± 0.012), temporal consistency analysis, and clinical validity constraints. Our results demonstrate HyperImpute’s superiority in imputation accuracy (MAE=5.16 vs. 5.19–5.57 for baselines), while CTGAN achieves optimal distribution fidelity (SWD=0.039 vs. 0.062–0.146). Crucially, we reveal persistent demographic biases in cognitive scores, with age-related imputation e
rrors increasing by 23% for patients over 70, and propose mitigation strategies. The framework provides actionable guidelines for selecting data completion strategies based on missingness patterns (MCAR/MAR/MNAR), computational constraints, and clinical objectives, advancing reproducibility and fairness in neurodegenerative disease research. Validated on 1,483 PPMI participants, our work addresses emerging needs in healthcare AI governance and synthetic data interoperability for multi-center collaborations.
(More)