
parency of PreXP, with positive feedback on its inter-
face and explanations.
Future enhancements will focus on advanced fea-
ture engineering (interaction detection, dimensional-
ity reduction, feature selection) with user options.
Generative AI will enhance imputation, data aug-
mentation, and schema transformation. Explainabil-
ity will improve with numerical justifications and vi-
sual aids. Further developments include refining pre-
processing suggestions, cloud scalability, and bench-
marking to establish PreXP as a robust, domain-aware
tool. Participants proposed several enhancements for
future iterations including: Advanced Feature Engi-
neering for automated feature selection as well as ex-
panding explainability to have precise numerical jus-
tifications for preprocessing decisions.
ACKNOWLEDGMENT
We acknowledge the use of AI tools to generate and
enhance parts of the paper. The content was revised.
REFERENCES
AbouWard, F., Salem, A., and Sharaf, N. (2024). Autovi:
Empowering effective tracing and visualizations with
ai. In 2024 28th International Conference Information
Visualisation (IV), pages 294–297. IEEE.
Balducci, F., Impedovo, D., and Pirlo, G. (2018). Machine
learning applications on agricultural datasets for smart
farm enhancement. Machines, 6(3):38.
Brown, P. A. and Anderson, R. A. (2023). A methodology
for preprocessing structured big data in the behavioral
sciences. Behavior Research Methods, 55(4):1818–
1838.
Chheda, V., Kapadia, S., Lakhani, B., and Kanani, P.
(2021). Automated data driven preprocessing and
training of classification models. In 2021 4th Inter-
national Conference on Computing and Communica-
tions Technologies (ICCCT), pages 27–32. IEEE.
Erickson, N., Mueller, J., Shirkov, A., Zhang, H., Larroy,
P., Li, M., and Smola, A. (2020). Autogluon-tabular:
Robust and accurate automl for structured data. arXiv
preprint arXiv:2003.06505.
Garc
´
ıa, S., Ram
´
ırez-Gallego, S., Luengo, J., Ben
´
ıtez, J. M.,
and Herrera, F. (2016). Big data preprocessing: meth-
ods and prospects. Big data analytics, 1:1–22.
Giovanelli, J., Bilalli, B., and Abell
´
o Gamazo, A. (2021a).
Effective data pre-processing for automl. In Pro-
ceedings of the 23rd International Workshop on De-
sign, Optimization, Languages and Analytical Pro-
cessing of Big Data (DOLAP): co-located with the
24th International Conference on Extending Database
Technology and the 24th International Conference
on Database Theory (EDBT/ICDT 2021): Nicosia,
Cyprus, March 23, 2021, pages 1–10. CEUR-WS. org.
Giovanelli, J., Bilalli, B., and Gamazo, A. A. (2021b). Ef-
fective data pre-processing for automl. In DOLAP’21,
23rd International Workshop on Design, Optimiza-
tion, Languages and Analytical Processing of Big
Data, pages 1–10, Nicosia, Cyprus. CEUR-WS.org.
Goyal, M. and Mahmoud, Q. H. (2024). A systematic
review of synthetic data generation techniques using
generative ai. Electronics, 13(17):3509.
Kaswan, K. S., Dhatterwal, J. S., Malik, K., and Baliyan,
A. (2023). Generative ai: A review on models and ap-
plications. In 2023 International Conference on Com-
munication, Security and Artificial Intelligence (ICC-
SAI), pages 699–704. IEEE.
Kazi, S., Vakharia, P., Shah, P., Gupta, R., Tailor, Y.,
Mantry, P., and Rathod, J. (2022). Preprocessy: a cus-
tomisable data preprocessing framework with high-
level apis. In 2022 7th international conference
on data science and machine learning applications
(CDMA), pages 206–211. IEEE.
Mishra, P., Biancolillo, A., Roger, J. M., Marini, F., and
Rutledge, D. N. (2020). New data preprocessing
trends based on ensemble of multiple preprocessing
techniques. TrAC Trends in Analytical Chemistry,
132:116045.
Moore, R. and Lopes, J. (1999). Paper templates. In TEM-
PLATE’06, 1st International Conference on Template
Production. SCITEPRESS.
Roshdy, A., Sharaf, N., Saad, M., and Abdennadher, S.
(2018). Generic data visualization platform. In 2018
22nd International Conference Information Visualisa-
tion (IV), pages 56–57. IEEE.
Salhi, A., Henslee, A. C., Ross, J., Jabour, J., and Dettwiller,
I. (2023). Data preprocessing using automl: A survey.
In 2023 Congress in Computer Science, Computer
Engineering, & Applied Computing (CSCE), pages
1619–1623. IEEE.
Santos, L. and Ferreira, L. (2023). Atlantic—automated
data preprocessing framework for supervised machine
learning. Software Impacts, 17:100532.
Smith, J. (1998). The Book. The publishing company, Lon-
don, 2nd edition.
Tae, K. H., Roh, Y., Oh, Y. H., Kim, H., and Whang, S. E.
(2019). Data cleaning for accurate, fair, and robust
models: A big data-ai integration approach. In Pro-
ceedings of the 3rd international workshop on data
management for end-to-end machine learning, pages
1–4.
Varma, D., Nehansh, A., and Swathy, P. (2023). Data pre-
processing toolkit: An approach to automate data pre-
processing. Interantional J. Sci. Res. Eng. Manag,
7(03):15.
Westphal, P., B
¨
uhmann, L., Bin, S., Jabeen, H., and
Lehmann, J. (2019). Sml-bench–a benchmarking
framework for structured machine learning. Seman-
tic Web, 10(2):231–245.
Zakrisson, H. (2023). Trinary decision trees for missing
value handling. arXiv preprint arXiv:2309.03561.
Zhang, H., Dong, Y., Xiao, C., and Oyamada, M. (2023).
Jellyfish: A large language model for data preprocess-
ing. arxiv abs/2312.01678 (2023).
DATA 2025 - 14th International Conference on Data Science, Technology and Applications
414