combination of ensemble model method together and
other models used together with the LightGBM
model in order to improve the prediction of Coronary
Artery Disease risk.
Chowdary et al. We (3. B. V. Chowdary, et.al.,
2021) make a distributed high-speed LightGBM-
based model for heart disease prediction and
weqarried to ascertain its efficiency in tackling bulk
medical data with high predictive accuracy.
Additionally, Gera and Melingi optimized a machine
learning ensemble approach to improve the Predictive
quality and Robustness of CAD risk models with
hyperparameter tuning and advanced loss functions
like (FL) (S. Gera and S. B. Melingi, 2023). The
relevance of data preprocessing and feature
extraction for making ML models perform better was
emphasized in their work. Additionally, the Enhanced
Whale Optimization Algorithm (EWOA) was
proposed as a feature selection approach and
incorporated the most critical risk factors in CAD
prediction, both improving the classification score
(Lakshmi and R. Devi, 2023).
Hybrid models: Features extracted from different
types of data can also be combined. In their work,
Gagoriya and Khandelwal analyzed several hybrid
ML techniques and emphasized the benefits of the use
of multiple algorithms for more accurate disease
classification and prediction (M. Gagoriya and M. K.
Khandelwal, 2023). Gupta et al. have made a
comparison between multiple classification models
like Decision Tree, Naïve Bayes, Random Forest, and
Logistic Regression in order to find the best approach
for the prediction of CAD (Gupta,et.al., 2023). They
found that ensemble techniques were superior to
standalone classifiers for predictive accuracy. Sharma
and Goel have further experimented with other
pathological approaches in ML. They had employed
SVM classification to find that AI-based predictive
models perform better compared to classical risk
score systems (R. Sharma and A. K. Goel, 2023).
XGBoost has also been widely used as an
essential algorithm in predicting CAD. Soni et al.
utilized XGBoost in building an effective predictive
model based on biomedical monitoring and wearable
device data, enhancing risk factor detection and early
diagnosis further (T. Soni,et.al., 2024). Their work
brought out the idea of bringing together ML and
health monitoring technology with real-time systems
to further the management of cardiovascular diseases.
D. P. K et al. also discussed the use of supervised and
unsupervised ML algorithms in detecting
cardiovascular diseases. According to the study, AI-
based models
increase
the
precision
in
myocytic
condition diagnosis, which occurs through predictive
analytics and feature selection techniques (S. Katari,
et.al., 2023).
Despite advancements, there is still potential to
enhance ML- based CAD prediction. Higher accuracy
and clinical relevance rely on ensemble methods,
hyperparameter tuning, and optimized loss functions.
This study proposes a hybrid ML model integrating
LightGBM with ensemble techniques to improve
CAD risk prediction and early detection.
3 MATERIALS AND METHODS
3.1 Dataset
The CAD dataset, consisting of 5,240 records from
the Framingham Heart Institute, was used to validate
the model. It includes various attributes such as sex
(Male = 1, Female = 0), age (continuous), smoking
status (1 = Yes, 0 = No), and the average number of
cigarettes smoked per day. Additional features include
the use of antihypertension drugs (1 = Yes, 0 = No),
history of stroke (1 = Yes, 0 = No), high blood
pressure (1 = Yes, 0 = No), and diabetes (1 = Yes, 0
= No). The dataset additionally includes overall
cholesterol levels, both systolic and diastolic blood
pressure, body mass index, heart rate, and blood
glucose levels. The target variable, Ten Year CAD,
indicates whether a patient developed coronary artery
disease within ten years (1 = Yes, 0 = No). A total of
15.19% of the records corresponded to patients with
CHD (744 cases), while 84.81% represented normal
cases (4,596 cases). Among the CHD patients,
53.26% were men, and 46.74% were women.
3.2 Data Cleaning and Preparation
Data cleaning and preparation plays a crucial role in
machine learning by ensuring that the data is high-
quality, dependable, and consistent prior to training
predictive models. Raw medical data are prone to
missing values, outliers, and imbalanced
distributions, which negatively impact model
performance. Therefore, systematic data
preprocessing techniques were used to address these
problems. Missing values were dealt with in the initial
step through the deletion of incomplete records or
imputing missing values with statistical imputation.
The presence of outliers was detected and deleted
using the IQR technique.
𝑰𝑸𝑹 = 𝑸𝟑 − 𝑸𝟏 (1)
where Q1 (lower quartile) and Q3 (upper quartile)