accuracy and better generalization on complex data
sets. This is why the XGboost model is more accurate
than other models.
This study underscores two critical issues: the
imbalance in the dataset and the narrow scope of data
sources. Firstly, the age distribution's skew towards
older adults not only restricts the model's
generalizability to younger demographics but also
potentially masks age-specific risk factors that could
be crucial for comprehensive heart disease prediction.
This demographic limitation could lead to
underrepresentation of early-onset heart disease
patterns, thereby affecting the model's predictive
accuracy across all age groups.
Secondly, the reliance on a single dataset, without
incorporating diverse data from multiple institutions
or international sources, may introduce geographical
and ethnic biases. Heart disease presents varying risk
factors and manifestations across different
populations due to genetic, environmental, and
lifestyle differences. The lack of a multi-institutional,
multinational dataset could hinder the model's ability
to capture these nuances, thus limiting its global
applicability and reducing its effectiveness in
providing personalized risk assessments.
To address these limitations, future research
should aim to develop a more balanced and diverse
dataset that includes a broader age range and
represents multiple populations. This strategy will
improve the model's predictive accuracy and ensure it
is better suited to assess heart disease risks across
different demographic groups. Additionally,
employing advanced feature selection techniques and
dimensionality reduction methods will allow for a
more holistic understanding of the complex interplay
between features, leading to more accurate and
nuanced predictions. Furthermore, expanding the
comparative analysis to include other models like
Neural Networks may reveal additional insights and
potentially higher predictive accuracy. The primary
objective is to improve the predictive capabilities of
heart disease models, which will contribute to more
effective prevention strategies and better
cardiovascular health outcomes. As machine learning
continues to evolve, there is great potential for
developing more accurate, adaptive, and personalized
tools for predicting heart disease in the future.
5 CONCLUSIONS
In summary, three models of decision tree, random
forest and XGboost are compared. XGboost was
found to have the highest accuracy. The decision tree
is 0.9610, the random forest is 0.9659, and XGBoost
is 0.9805. Therefore, it can be found that the
predictions for the performance of the three models
are correct. XGBoost has high accuracy. Iterative
steps help it catch patterns that other models might
miss, allowing it to make better predictions. In
addition, it can handle lost data, which is useful in
cases where records are incomplete. In the future, the
model will be upgraded by employing a diverse range
of models to predict heart disease cases, aiming to
identify alternatives that outperform XGBoost or to
further refine the existing XGBoost model for
improved accuracy. Additionally, a website will be
set up, with the goal of training a refined heart disease
prediction model and launching a platform where
users can estimate their heart disease risk and receive
personalized advice on how to improve their health.
REFERENCES
Ahsan, M. M., & Siddique, Z. 2022. Machine learning-
based heart disease diagnosis: A systematic literature
review. Artificial Intelligence in Medicine, 128,
102289.
Chen, T., & Guestrin, C. 2016. Xgboost: A scalable tree
boosting system. In Proceedings of the 22nd acm
sigkdd international conference on knowledge
discovery and data mining. 785-794.
David, L. 2019. URL: https://www.kaggle.com/datasets/jo
hnsmith88/heart-disease-dataset. Last Accessed: 2024/
09/09
De Ville, B. 2013. Decision trees. Wiley Interdisciplinary
Reviews: Computational Statistics, 5(6), 448-455.
Groenewegen, A., Rutten, F. H., Mosterd, A., & Hoes, A.
W. 2020. Epidemiology of heart failure. European
journal of heart failure, 22(8), 1342-1356.
Khan, Y., Qamar, U., Yousaf, N., & Khan, A. 2019.
Machine learning techniques for heart disease datasets:
A survey. In Proceedings of the 2019 11th
International Conference on Machine Learning and
Computing, 27-35.
Mata, J., Frank, R., & Gigerenzer, G. 2014. Symptom
recognition of heart attack and stroke in nine European
countries: a representative survey. Health
Expectations, 17(3), 376-387.
Ponikowski, P., Anker, S. D., AlHabib, K. F., Cowie, M. R.,
Force, T. L., Hu, S., ... & Filippatos, G. 2014. Heart
failure: preventing disease and death worldwide. ESC
heart failure, 1(1), 4-25.
Rigatti, S. J. 2017. Random forest. Journal of Insurance
Medicine, 47(1), 31-39.
Song, Y. Y., & Ying, L. U. 2015. Decision tree methods:
applications for classification and prediction. Shanghai
archives of psychiatry, 27(2), 130.