Authors:
Daniel Rocha Franca
1
;
Caio Davi Rabelo Fiorini
2
;
Ligia Ferreira de Carvalho Gonçalves
2
;
Marta Dias Moreira Noronha
3
;
Mark Song
3
and
Luis Enrique Zárate Galvez
3
Affiliations:
1
Bach. Computer Science, Pontifícia Universidade Católica de Minas Gerais, Rua Claudio Manuel, Belo Horizonte, Brazil
;
2
Bach. Data Science and Artificial Intelligence, Pontifícia Universidade Católica de Minas Gerais, Rua Claudio Manuel, Belo Horizonte, Brazil
;
3
Institute of Exact Sciences and Computer Science, Pontifícia Universidade Católica de Minas Gerais, Rua Claudio Manuel, Belo Horizonte, Brazil
Keyword(s):
Hypercholesterolemia, Young Population, Machine Learning, Decision Tree, Genetic Algorithm, Data Mining, National Health Survey, Risk Factors, Data Preprocessing, Health Informatics, CAPTO.
Abstract:
Understanding the risk factors associated with hypercholesterolemia in young individuals is crucial for developing preventive strategies to combat cardiovascular diseases. This study proposes a data mining pipeline employing machine learning techniques to profile high cholesterol in Brazilian youth aged 15 to 25, utilizing the 2019 National Health Survey (PNS) dataset. The PNS-2019 database has 1,088 attributes organized into 26 modules and 293,726 anonymized records. The Knowledge Discovery in Databases (KDD) process was implemented, incorporating a novel CAPTO-based conceptual attribute selection followed by feature selection using a Non-dominated Sorting Genetic Algorithm II (NSGA-II). A decision tree classifier was optimized and evaluated, achieving an F1 Score of 66%, demonstrating reasonable predictive power despite data limitations. The results highlight the significant impact of dietary habits, particularly high sugar and fat intake, on hyper-cholesterolemia risk. The study e
mphasizes the potential for early identification and targeted interventions, contributing to public health improvements and laying the groundwork for future research with advanced models and additional data sources.
(More)