A Comparative Study of Machine Learning Models for

Cardiovascular Disease Prediction

Beining Qian

Department of Computer Science and Technology, Chongqing University, Chongqing, 401331, China

Keywords: Machine Learning, Cardiovascular Disease, Random Forest.

Abstract: The efficacy of diverse machine learning approaches in predicting cardiovascular diseases is compared in this

analysis by utilizing a range of hyperparameter tuning and data preprocessing techniques to improve model

performance. The methods applied include encoding categorical data, generating additional features, selecting

the most relevant features, and standardizing data, along with extensive hyperparameter optimization. The

models evaluated include Decision Tree, Logistic Regression, Random Forest, and Extreme Gradient

Boosting (XGBoost). The results show that compared with other models, the Random Forest Model has a

unique ensemble learning method, and achieves excellent performance and robustness by virtue of this

method. As demonstrated by results, the Random Forest effectively balances bias and variance and addresses

the complexity of medical data. The results of the experiment proved to be the best performer in this survey

was the Random Forest Model, although XGBoost also showed strong performance with its sophisticated

boosting and regularization strategies. This emphasizes how crucial model tuning and selection are to

improving forecast accuracy. To further improve prediction reliability and generality, future research should

investigate more sophisticated models and methodologies, optimize preprocessing and tuning strategies, and

incorporate larger datasets.

1 INTRODUCTION

In recent times, medical analysis has become a crucial

component of modern healthcare, playing a vital role

in disease diagnosis, prognosis, and treatment

planning (Kononenko, 2001). With the advancement

of large-scale medical data sets such as electronic

health records, medical image files, and mobile apps,

it has become easier to accurately analyze patient data

to aid in diagnosing conditions. Medical analysis

encompasses various methods aimed at

understanding, predicting, and mitigating the

occurrence of diseases, giving medical professionals

insightful information for early intervention and

individualized care (Celermajer, 2012). Machine

learning is one of the most promising techniques in

this field since it uses data-driven methods to identify

patterns and correlations in medical information.

Machine learning approaches could intelligently

learn representative feature combinations for specific

missions (Jordan, 2015). It has demonstrated

extraordinary promise in medical analysis for

enhancing the precision of diagnoses, forecasting the

course of diseases, and enhancing treatment plans.

Particularly in the diagnosis of cardiovascular

diseases (CVD), these models have demonstrated

superior performance by identifying complex

relationships between related factors such as

cholesterol levels, age, blood pressure, and other

clinical parameters (Mathur, 2020). Machine learning

can handle non-linear correlations, manage high-

dimensional data, and adjust to changes in patient

populations more effectively than traditional

statistical methods (Hagan, 2021).

Machine learning approaches have been

leveraged in several recent research to predict disease

using a variety of medical datasets. A representative

work leveraged them to forecast the start of cancer,

diabetes, and cardiovascular disease (Knutsson,

2000). While these studies have shown promising

results, they often rely on specific feature sets, lack

extensive data preprocessing, or underutilize feature

engineering's potential to increase model correctness.

Furthermore, it is still difficult for these models to be

generalized to other populations.

This research focuses on tackling limitations in

prior investigations by evaluating the performance of

various machine learning techniques and conducting

Qian, B.

A Comparative Study of Machine Learning Models for Cardiovascular Disease Prediction.

DOI: 10.5220/0013332300004558

In Proceedings of the 1st International Conference on Modern Logistics and Supply Chain Management (MLSCM 2024), pages 377-381

ISBN: 978-989-758-738-2

377

an in-depth examination of their predictive

capabilities for cardiovascular diseases. This study

examined several data preprocessing strategies,

including feature scaling, encoding, and selection,

and assessed the effectiveness of models, including

Logistic Regression, Decision Tree, Random Forest,

and Extreme Gradient Boosting (XGBoost).

Additionally, the author assessed the importance of

various features to understand their contribution to

disease prediction. Ultimately, through comparison,

this work identified the most robust and interpretable

prediction framework. With this research, the author

strives to increase the predictability of cardiovascular

illness and offer enhanced insights for predictive

modeling in clinical settings.

2 METHOD

2.1 Dataset and Preprocessing

The Cardiovascular Disease Dataset, which was

obtained from Kaggle, was used in this investigation.

It includes 70,000 data, each representing a patient,

with 11 different patient-related eigenvalues as well

as a binary indicator signifying whether

cardiovascular disease is present or not (Svetlana,

2018). Three categories are used to classify the

features: Objective features are factual information

about things like weight, height, age, and gender;

examination features are results from a medical exam

like cholesterol, blood pressure, and glucose levels;

and subjective features are things like self-reported

information about things like drinking alcohol,

smoking, and physical activity. The goal variable,

"cardio," is binary and indicates the presence or

absence of cardiovascular illness (1) or (0). All data

was collected during medical examinations,

providing a snapshot of each patient's health status.

In this study, several methods are leveraged to

process original data before formal experiment, in

order to improve the accuracy and other evaluation

Indicators. Here are those four methods: (1) Handling

Missing Values: Although there are no such values in

the dataset, checks were made to guarantee the

accuracy of the data. (2) Categorical Encoding: One-

Hot Encoding is leveraged to encode data with more

than 2 categories (e.g., gender, cholesterol, and

glucose in this experiment) into a numerical format

for learning, i.e., 0 and 1. (3) Feature Scaling: To

guarantee uniformity across various forms of data, a

single scale was applied to a range of features,

including age, height, weight, and blood pressure.

This normalization improves the model's predictive

accuracy and stabilizes its training phase. (4) Feature

Selection: To enhance the model’s efficiency,

features with minimal variability were removed to

eliminate noise and irrelevant information.

Furthermore, a technique was employed to pinpoint

and preserve the most revealing features based on

their statistical relevance.

2.2 Models

2.2.1 Logistic Regression

It is the baseline model in this work, which is a

statistical approach that is frequently leveraged for

binary classification (LaValley, 2008). It introduces a

linear combination of various attributes, applies

logistic transformations, and constructs a Logistic

Regression model to predict the probability of the

target variable classifying into a particular category.

In this study, it is used to predict whether an

individual has cardiovascular disease (1) or not (0),

using processed features. Although Logistic

Regression provides a strong baseline, its

performance is always not very good when the

connection between the target's features variable is

non-linear.

2.2.2 Decision Tree

A Decision Tree segments the dataset into subsets

based on attribute values. Within this structure, each

internal node corresponds to an original data feature,

while each leaf node signifies an original data class

label (Priyanka, 2020). In this study, it is employed to

forecast the occurrence or non-occurrence of

cardiovascular disease through the analysis of the

refined features. The model progressively partitions

the data by choosing the best split at each node, as

dictated by the Gini impurity measure. Decision trees

are simple to understand, but if they are not

adequately managed, they may overfit the data.

Adding, deleting, or optimizing features is a type of

feature engineering that increases the correctness of

the model.

2.2.3 Random Forest

It is an ensemble technique that relies on the use of

decision trees. Its distinctive feature is the

construction of numerous decision trees throughout

the training phase, with the final class predictions

being determined by the majority vote of the

individual trees (Biau, 2016). In this study,

cardiovascular illness is predicted by Random Forest

using processed characteristics. This model mitigates

MLSCM 2024 - International Conference on Modern Logistics and Supply Chain Management

378

overfitting by averaging the predictions across

multiple trees. In the Random Forest framework, each

constituent tree is developed on a distinct data subset

and feature set, thereby enhancing the model's

precision and reliability, and thus, its overall

performance.

2.2.4 XGBoost

Employing the sophisticated ensemble method

known as XGBoost, a sequence of Decision Trees is

built, where each subsequent tree corrects the errors

of its predecessor (Chen, 2016). In this study,

XGBoost uses processed features to predict

cardiovascular disease. XGBoost is well-known for

its effectiveness and regularization strategies that stop

overfitting. It works especially well with big and

complicated datasets. The model supports extensive

hyperparameter tuning and feature selection, enabling

the optimization of the feature set or the introduction

of new feature engineering processes.

2.3 Evaluation Indicators

Four indicators are leveraged to measure model's

performance.

Accuracy (ACC) measures how well the model

performs overall by dividing the number of correctly

predicted cases by the total number of instances.

Precision (PRE) determines the ratio of true

positives to total expected positives (true positives +

false positives) to assess the accuracy of the positive

predictions.

Recall (REC) measures how well the model

detects all actual positive events by dividing the

number of true positives by the total number of false

negatives and true positives.

F1-score is the harmonic mean of recall and

precision. It provides a balanced metric where trade-

offs between recall and precision are necessary,

particularly when class distributions are unbalanced.

3 EXPERIMENT AND RESULTS

3.1 Experimental Details

The experiments were conducted on a machine with

the following configuration: The operating system

was Windows 11, driven by a Core i5-11400H 11th

generation Intel CPU. The system was equipped with

an NVIDIA GeForce RTX 3050 GPU featuring 4GB

of VRAM. This setup ensured efficient processing

and performance during the experiments.

The code was implemented using Python 3.9.13.

The key libraries used in the experiments included

Scikit-learn (1.0.2) for machine learning models,

XGBoost (2.0.3) for gradient boosting, Pandas (1.4.4)

for data manipulation.

3.2 Hyperparameters

To guarantee reproducible findings, the dataset was

split into80% training and 20% testing. A random

seed of 40 was used. The max number of iterations

for the Logistic Regression model, which served as a

benchmark for comparison, was set at 1000. The

Decision Tree had two fixed parameters: the

maximum depth was four, and the smallest sample

size required to divide a node was five thousand. At

least of 100 samples were required to divide a node in

the Random Forest model, which included 100

Decision Trees with 10 as the largest. Lastly, the

XGBoost model used a total of 2000 boosted trees, a

learning rate of 0.01, verbosity level set to 1 for

outputting messages, a largest tree depth of 4, a

smallest weight of 1 needed for a child node, a

training sample ratio of 0.7, a column sampling ratio

of 0.7, and early stopping if there was no

improvement in performance for 20 consecutive

rounds.

3.3 Performance Comparison

In this experiment, various techniques were employed

to enhance model performance, such as encoding

categorical variables, creating additional polynomial

features, filtering out features with minimal

variability, and selecting the most relevant features.

Data was also standardized to ensure uniformity

across features, and an extensive search was

conducted to identify the best hyperparameters. The

results provide a summary of the evaluation measures

that were used to determine how effective various

strategies were. Table 1 computes and presents them.

Table 1: Comparison of several models' performances

without the use of performance-enhancing methods.

ACC PRE REC F1

Logistic

ression

.6989 .7084 .6687 .6880

Decision

Tree

.6350 .6315 .6355 .6335

Random

Fores

.7201 .7252 .7024 .7136

XGBoos

.7297 .7410 .7002 .7200

A Comparative Study of Machine Learning Models for Cardiovascular Disease Prediction

379

Based on the previously indicated specific

methodologies, Table 2 presents the model scores.

Table 2: Comparison of several models' performances with

the use of performance-enhancing methods.

ACC PRE REC F1

Logistic

ression

.7131 .7185 .6936 .7059

Decision

Tree

.7318 .7488 .6916 .7191

Random

Fores

.7372 .7502 .7056 .7272

XGBoos

.7367 .7507 .7031 .7261

Taking the accuracy rate as the principal index,

the Random Forest Model’s performance is the best,

and the Logistic Regression Model’s performance is

worst. In general, the Logistic Regression Model’s

performance is worse than the other three models.

Meanwhile, the other three models performed

similarly and have got close scores, only three

decimal places apart. Additionally, all of scores of

four models increased to varying degrees, especially

the Decision Tree Model, almost increased 10%. It

seemed like that the influence of data preprocessing

and the hyperparameter settings is less than choosing

different models.

4 DISCUSSIONS

4.1 Model Performance Analysis

Based on the experimental results, different models

exhibit significant performance differences before

and after preprocessing, revealing their respective

strengths and weaknesses. Prior to preprocessing, the

Logistic Regression Model's accuracy was 0.6989

and its F1 score was 0.6880. Its accuracy increased to

0.7131 and its F1-score to 0.7059 following

preprocessing. This suggests that linear models

perform much better when feature engineering is

applied. However, due to its linear nature, Logistic

Regression struggles to capture complex nonlinear

relationships, leading to its performance being

inferior to more sophisticated models. The decision

tree model's accuracy was 0.6350 and its F1 score was

0.6335 when the data was not treated, clearly

indicating an overfitting bias. After preprocessing,

the Decision Tree's accuracy rose to 0.7318 and its

F1-score to 0.7191, demonstrating that feature

selection and standardization effectively mitigated

overfitting issues. Using an ensemble of Decision

Trees to minimize overfitting, the accuracy was

0.7201 and the F1-score was 0.7136 produced by the

Random Forest on the raw data; these increased to

0.7372 and 0.7272, respectively, following

preprocessing, demonstrating the model's resilience

and capacity for generalization. Using the raw dataset,

the accuracy was 0.7297 and the F1-score was 0.7200.

And it improved to 0.7367 and 0.7261 after

preprocessing — XGBoost again demonstrated

remarkable performance. In this experiment, the

Random Forest model significantly outperformed

XGBoost, despite the latter using gradient boosting

and regularization techniques to handle high-

dimensional and nonlinear data successfully. This is

mainly because Random Forest's ensemble method,

which lowers overfitting and better captures

complicated patterns by averaging several trees,

performs better. Overall, data preprocessing

significantly improved the accuracy across various

approaches, with Random Forest demonstrating the

highest robustness and accuracy, indicating that in

this analysis, it is the best accurate model for

predicting cardiovascular disease.

4.2 Limitations and Future Works

Although data preprocessing improved the

performance of the models, there's still potential for

additional optimization. Feature selection and

engineering strategies could be more comprehensive,

particularly by incorporating domain knowledge or

exploring automated feature generation techniques to

enhance model accuracy and generalization.

Furthermore, in order to handle more complicated

data patterns, future research may take into account

integrating deep learning models, like neural

networks, into the trials instead of solely depending

on classic machine learning models. Moreover,

optimizing hyperparameter search methods and using

techniques like cross-validation could further

improve model robustness and predictive accuracy. In

practical applications, increasing the data volume,

refining the feature set, and employing multi-model

fusion strategies could potentially provide more

reliable and accurate predictions for cardiovascular

disease.

5 CONCLUSIONS

In this study, the author evaluated the performance of

several machine learning models for predicting

cardiovascular disease, focusing on four models. The

analysis incorporated extensive preprocessing

MLSCM 2024 - International Conference on Modern Logistics and Supply Chain Management

380

techniques, including One-Hot Encoding, Polynomial

Feature Expansion, feature selection, and data

standardization, to enhance model performance and

robustness.

The experimental results revealed that the greatest

accuracy and F1-score were attained by Random

Forest, outperforming other models. Specifically,

after preprocessing, Random Forest's accuracy

reached 0.7372 and its F1-score improved to 0.7272,

showcasing its strong generalization ability and

exceptional handling of complicated data patterns.

XGBoost fared much better, with excellent F1-score

and accuracy but lagging slightly behind Random

Forest. The improvements observed in all models,

particularly the significant gains for Decision Tree

and Random Forest, emphasize how important

feature engineering and preprocessing are to

improving the performance of the model.

In order to handle even more complicated data

patterns, future work should investigate more

thoroughly how medical domain knowledge may be

integrated into feature selection and take into account

cutting-edge methods like deep learning models.

Moreover, utilizing multi-model fusion techniques

and expanding the dataset can further increase

anticipated dependability and accuracy. This study

highlights the significance of preprocessing the raw

data and choosing a suitable model to obtain the

highest predictive performance for cardiovascular

disease prognosis.

REFERENCES

Biau, G., & Scornet, E. 2016. A Random Forest guided

tour. Test, 25, 197-227.

Celermajer, D. S., Chow, C. K., Marijon, E., Anstey, N. M.,

& Woo, K. S. 2012. Cardiovascular disease in the

developing world: prevalences, patterns, and the

potential of early disease detection. Journal of the

American College of Cardiology, 60(14), 1207-1216.

Chen, T., & Guestrin, C. 2016. Xgboost: A scalable tree

boosting system. In Proceedings of the 22nd acm

sigkdd international conference on knowledge

discovery and data mining. 785-794.

Hagan, R., Gillan, C. J., & Mallett, F. 2021. Comparison of

machine learning methods for the classification of

cardiovascular disease. Informatics in Medicine

Unlocked, 24, 100606.

Jordan, M. I., & Mitchell, T. M. 2015. Machine learning:

Trends, perspectives, and prospects. Science,

349(6245), 255-260.

Knutsson, A., & Bøggild, H. 2000. Shiftwork and

cardiovascular disease: review of disease

mechanisms. Reviews on environmental health, 15(4),

359-372.

Kononenko, I. 2001. Machine learning for medical

diagnosis: history, state of the art and

perspective. Artificial Intelligence in medicine, 23(1),

89-109.

LaValley, M. P. 2008. Logistic regression. Circulation,

117(18), 2395-2399.

Mathur, P., Srivastava, S., Xu, X., & Mehta, J. L. 2020.

Artificial intelligence, machine learning, and

cardiovascular disease. Clinical Medicine Insights:

Cardiology, 14, 1179546820927404.

Priyanka, & Kumar, D. 2020. Decision tree classifier: a

detailed survey. International Journal of Information

and Decision Sciences, 12(3), 246-269.

Svetlana, U., 2018. Cardiovascular Disease dataset. URL:

https://www.kaggle.com/datasets/sulianova/cardiovasc

ular-disease-dataset. Last Accessed: 2024/09/18

A Comparative Study of Machine Learning Models for Cardiovascular Disease Prediction

381