have used machine learning methods for more
accurate prediction.
Scholars such as Javeed Ashir et al. have utilized
machine learning to predict heart failure (Javeed et
al., 2022). They analyzed the study using different
mechanical models. These include, but are not
restricted to, Random Forest (RF), logistic regression,
support vector machines (SVM), convolutional
neural networks (CNN), and recurrent neural
networks (RNN). To decrease data dimensionality
and increase model accuracy, they also use a variety
of approaches (principal component analysis (PCA),
independent component analysis (ICA), etc.) to
extract and choose important features. This paper
summarizes several public datasets, such as
University of California, Irvine (UCI) Heart Disease
dataset, Cleveland dataset, StatLog dataset, etc.
Accuracy, Sensitivity, Specificity and other
indicators were used to evaluate the model
performance.
However, the role of machine learning in self-
management for heart failure patients has not been
adequately explored. Sheojung and many other
scholars compared ML and statistical regression
models for predicting prognosis of heart failure
patients. They believe that ML methods have more
advantages than statistical regression models.
Because ML method gets higher c-indices. However,
from the perspective of epidemiological evaluation of
clinical prediction models, the quality of currently
available ML-based prediction models remains
suboptimal (Shin et al., 2019).
This research compares the accuracy differences
of K-Nearest Neighbors, Random Forest, and
Logistic Regression models in predicting heart
failure. This study aims to provide insights and
references for heart failure research and treatment.
2 METHODOLOGY
2.1 Data Source
The data used in this study were obtained from
Kaggle. The data set used in this study is called the
Heart Failure Prediction Data Set and is owned by
Larxel. This dataset has been widely used in the field
of healthcare research, especially related to heart
failure prediction and analysis.
This dataset received a usability rating of 10.0,
indicating its high quality and reliability for research
purposes. It was downloaded 158994 times, reflecting
its popularity and usefulness among researchers and
practitioners. This dataset contains records from 299
participants and includes 12 clinical characteristics
such as age, sex, ejection fraction, serum creatinine,
and other relevant health measures. These
characteristics are commonly used in medical studies
to predict outcomes such as mortality or readmission
rates in patients with heart failure.
This dataset is particularly valuable for machine
learning and statistical modeling studies because it
provides a comprehensive set of clinical variables that
can be used to train and evaluate predictive models.
In addition, the structure and variables of the dataset
are very consistent with the objectives of this study.
2.2 Variables and Data Preprocessing
In the original dataset, there were 12 variables, and
the names and explanations of each variable are
shown in Table 1.
For data preprocessing, the normal range of serum
sodium is defined in medicine as 135-145 mEq/L.
Python code was used to remove outliers that fall
outside this range. This operation removes 15 rows of
serum sodium data with no more than 20% missing
values. At the same time, no variable in this paper has
more than 20% missing value, so there is no need to
reduce the variable.
The original dataset contains 12 clinical features,
which may introduce complexity and redundancy into
the prediction model. High dimensional data can lead
to overfitting, especially when some features exhibit
multicollinearity or weak correlation with the target
variable, DEATH_EVENT. Principal Component
Analysis (PCA) is used in this research to overcome
this problem by reducing the dimension while
keeping important information.
2.3 Machine Learning Models
Three models-the K-Nearest Neighbor model, the
Random Forest model, and the Logistic Regression
model-were employed for analysis in this work.
The logistic function is used by the logistic
regression model to simulate the likelihood of a
particular class or occurrence (Li, 2021). The formula
is given by:
𝑃
𝑌=1
|
𝑋
=
⋯
(1)
Where β
is the intercept, and β
,β
,…,β
are the
coefficients for the input features 𝑋
,𝑋
,…,𝑋
.
During training, a large number of decision trees
are constructed using an ensemble learning approach
called Random Forest, which independently
generates the mean prediction of each tree.