
2 REVIEW OF RELATED
LITERATURE
Heart disease prediction remains a crucial area of re-
search, since cardiovascular risk is shaped by not only
biological factors but also geography, social condi-
tions, and daily habits like diet, activity, and health-
care access. While traditional models focus on static
clinical indicators, recent advances in AI and ma-
chine learning enable more adaptive and comprehen-
sive prediction frameworks.
Recent work by (Patil, 2021) introduced a hy-
brid model combining deep learning (Mask R-CNN
for segmentation and feature extraction) with classi-
cal ML classifiers like Random Forest and Gaussian
Naive Bayes, achieving a high heart attack predic-
tion accuracy of 98.5%. Similarly, (Jin et al., 2018)
used artificial neural networks (ANN) on sequential
EHR data to capture temporal healthcare patterns. To-
gether, these studies underscore the effectiveness of
both ensemble and sequence-based models for im-
proving heart disease prediction accuracy.
(Shah et al., 2020) compared several supervised
ML classifiers—ANN, Decision Trees, SVM, Naive
Bayes, and Gradient Boosting—for heart disease pre-
diction, finding that Gaussian Naive Bayes achieved
the highest accuracy at 81.9%. Their findings high-
light the importance of choosing the right algorithm
based on data characteristics. Similarly, (Salhi et al.,
2020) and (Rajesh et al., 2018) demonstrated strong
predictive performance by ANN and Decision Trees.
(Srinivas et al., 2018) proposed hybrid ML strate-
gies to enhance prediction, while (Ranga and Rohila,
2018) conducted detailed parametric analyses to re-
veal strengths unique to each algorithm. Collectively,
these studies underscore that algorithm choice, fea-
ture selection, and robust preprocessing are critical to
building accurate and reliable heart disease prediction
models.
More research highlights that heart health depends
not only on medical factors but also on where peo-
ple live and their social environment. Differences
in risk factors like cholesterol and smoking between
U.S. and Asian populations emphasize the impor-
tance of including social determinants—such as in-
come, education, access to care, and diet—in predic-
tion models, rather than using a one-size-fits-all ap-
proach. (Oladimeji and Oladimeji, 2020) used classi-
fication algorithms like Random Forest, Naive Bayes,
and KNN to find out that predictive outcomes vary
significantly based on features such as smoking sta-
tus, serum composition, and ejection ratio (Oladimeji
and Oladimeji, 2020).
Further studies highlight that ensemble and hy-
brid modeling approaches can significantly improve
prediction accuracy, often surpassing 90% (Abdeld-
jouad et al., 2020; Rahman et al., 2018). By inte-
grating clinical, behavioral, and demographic data,
these models enable more personalized risk stratifica-
tion. Similarly, and (Oladimeji and Oladimeji, 2020)
others demonstrated that combining key health and
demographic indicators with ensemble ML methods
not only enhances model performance and adaptabil-
ity but also achieves consistently high precision and
accuracy rates above 90% (Dangare and Apte, 2012).
Other studies point out challenges like data im-
balance, missing values, and overfitting. (Srivastava
et al., 2020; Hazra et al., 2018) tackled these issues
with data preprocessing, including correlation matrix
filtering, PCA, and hybrid model tuning. Inspired by
this, our project uses Random Forest, Logistic Re-
gression, and XGBoost, along with geosocial data, to
predict heart disease effectively across diverse groups.
3 PROPOSED FRAMEWORK
AND METHODOLOGY
This study uses supervised learning with demo-
graphic, societal, and lifestyle-physiological features
to predict CVD risk, employing XGBoost to enable
accurate early detection and personalized interven-
tions.
3.1 Data
Training and learning herein is based on a dataset
comprising 70, 000 patient records suitable for model-
ing CVD-based risks. This dataset possesses a range
of patient-based features (Age, Height, Weight, Gen-
der, Systolic blood pressure, Diastolic blood pres-
sure, Cholesterol, Glucose, Smoking, Alcohol intake,
Physical activity, Presence or absence of cardiovas-
cular disease)which are important for deeper analysis
with reference to heart-related conditions. The dataset
did not have any missing values, but for future per-
spective of re-training, median value is used to fill any
null value if present. Each sample of features is asso-
ciated with a binary target indicating the presence (1)
or absence (0) of CVD.
3.2 Data Preprocessing and
Augmentation
The training and learning data is preprocessed and ad-
ditional features were synthesized and extracted from
the existent features of the dataset.
Predictive Model for Heart-Related Issues Based on Demographic, Societal, and Lifestyle Factors
359