an in-depth examination of their predictive
capabilities for cardiovascular diseases. This study
examined several data preprocessing strategies,
including feature scaling, encoding, and selection,
and assessed the effectiveness of models, including
Logistic Regression, Decision Tree, Random Forest,
and Extreme Gradient Boosting (XGBoost).
Additionally, the author assessed the importance of
various features to understand their contribution to
disease prediction. Ultimately, through comparison,
this work identified the most robust and interpretable
prediction framework. With this research, the author
strives to increase the predictability of cardiovascular
illness and offer enhanced insights for predictive
modeling in clinical settings.
2 METHOD
2.1 Dataset and Preprocessing
The Cardiovascular Disease Dataset, which was
obtained from Kaggle, was used in this investigation.
It includes 70,000 data, each representing a patient,
with 11 different patient-related eigenvalues as well
as a binary indicator signifying whether
cardiovascular disease is present or not (Svetlana,
2018). Three categories are used to classify the
features: Objective features are factual information
about things like weight, height, age, and gender;
examination features are results from a medical exam
like cholesterol, blood pressure, and glucose levels;
and subjective features are things like self-reported
information about things like drinking alcohol,
smoking, and physical activity. The goal variable,
"cardio," is binary and indicates the presence or
absence of cardiovascular illness (1) or (0). All data
was collected during medical examinations,
providing a snapshot of each patient's health status.
In this study, several methods are leveraged to
process original data before formal experiment, in
order to improve the accuracy and other evaluation
Indicators. Here are those four methods: (1) Handling
Missing Values: Although there are no such values in
the dataset, checks were made to guarantee the
accuracy of the data. (2) Categorical Encoding: One-
Hot Encoding is leveraged to encode data with more
than 2 categories (e.g., gender, cholesterol, and
glucose in this experiment) into a numerical format
for learning, i.e., 0 and 1. (3) Feature Scaling: To
guarantee uniformity across various forms of data, a
single scale was applied to a range of features,
including age, height, weight, and blood pressure.
This normalization improves the model's predictive
accuracy and stabilizes its training phase. (4) Feature
Selection: To enhance the model’s efficiency,
features with minimal variability were removed to
eliminate noise and irrelevant information.
Furthermore, a technique was employed to pinpoint
and preserve the most revealing features based on
their statistical relevance.
2.2 Models
2.2.1 Logistic Regression
It is the baseline model in this work, which is a
statistical approach that is frequently leveraged for
binary classification (LaValley, 2008). It introduces a
linear combination of various attributes, applies
logistic transformations, and constructs a Logistic
Regression model to predict the probability of the
target variable classifying into a particular category.
In this study, it is used to predict whether an
individual has cardiovascular disease (1) or not (0),
using processed features. Although Logistic
Regression provides a strong baseline, its
performance is always not very good when the
connection between the target's features variable is
non-linear.
2.2.2 Decision Tree
A Decision Tree segments the dataset into subsets
based on attribute values. Within this structure, each
internal node corresponds to an original data feature,
while each leaf node signifies an original data class
label (Priyanka, 2020). In this study, it is employed to
forecast the occurrence or non-occurrence of
cardiovascular disease through the analysis of the
refined features. The model progressively partitions
the data by choosing the best split at each node, as
dictated by the Gini impurity measure. Decision trees
are simple to understand, but if they are not
adequately managed, they may overfit the data.
Adding, deleting, or optimizing features is a type of
feature engineering that increases the correctness of
the model.
2.2.3 Random Forest
It is an ensemble technique that relies on the use of
decision trees. Its distinctive feature is the
construction of numerous decision trees throughout
the training phase, with the final class predictions
being determined by the majority vote of the
individual trees (Biau, 2016). In this study,
cardiovascular illness is predicted by Random Forest
using processed characteristics. This model mitigates